Skip to main content
Version: v1.4

SkyRay: Ray Scheduling with Nova

SkyRay combines KubeRay with Nova to enable multi-cluster deployment and management of Ray workloads.

KubeRay manages Ray workloads within a Kubernetes cluster. Nova extends this model by selecting the workload cluster where Ray resources should run.

When to Use This

Use this pattern when:

  • Running Ray workloads across multiple Kubernetes clusters
  • Scaling Ray workloads beyond a single cluster
  • Optimizing placement based on available compute resources
  • Migrating Ray workloads between clusters

How Nova Helps

Nova supports KubeRay custom resources, including:

  • RayCluster
  • RayJob
  • RayService

A typical deployment pattern is:

  1. Deploy KubeRay and its CRDs across all workload clusters
  2. Submit Ray resources to the Nova control plane
  3. Nova selects the appropriate workload cluster based on policy and capacity
  4. KubeRay in the selected cluster creates and manages the Ray workloads

Once Nova selects a cluster, KubeRay handles pod creation and lifecycle management within that cluster.

Placement Behavior

Nova evaluates resource requirements for Ray resources and can:

  • Select clusters based on available CPU, memory, and GPU capacity
  • Apply policy-based placement rules
  • Ensure related Ray resources are scheduled together

Resource-Aware Placement

Nova understands the resource requirements associated with KubeRay custom resources such as RayCluster, RayJob, and RayService.

When these resources are submitted:

  • Nova evaluates the requested CPU, memory, and GPU resources
  • It selects a workload cluster that can satisfy the full set of requirements
  • The selected cluster is expected to have sufficient capacity to run the workload

This allows Nova to effectively perform capacity-aware placement for Ray workloads.

Group and Gang Scheduling Behavior

Because Ray workloads often consist of multiple components that must run together, Nova treats these resources as a single scheduling unit.

This enables:

  • Placement of all required components on the same cluster
  • Avoidance of partial scheduling where only part of the workload runs

Once the workload is placed, KubeRay handles pod-level scheduling and lifecycle management within the selected cluster.

Deployment Pattern

A common approach is to use Nova spread or duplicate scheduling to deploy KubeRay and its CRDs across all workload clusters.

This ensures that any selected cluster is capable of running Ray workloads when Nova schedules them.

Resources and Examples

Example policies and deployment scripts are available here:

Additional background and examples can be found in this blog:

Considerations

  • KubeRay must be installed on all candidate workload clusters
  • Required CRDs must be available before scheduling Ray resources
  • Placement decisions depend on cluster capacity
  • Migration may impact running Ray workloads