SkyRay: Nova Support for KubeRay RayCluster, RayJob, and RayService Custom Resources
Ray is a unified framework for scaling Python and AI applications from a laptop to a cluster. KubeRay runs on a Kubernetes cluster, supporting the creation, deletion, and autoscaling of Ray clusters, along with managing any Ray jobs and services deployed on those Ray clusters. Nova includes support for the KubeRay RayCluster, RayJob, and RayService Custom Resources (CRs), allowing KubeRay to be extended seamlessly from single Kubernetes cluster operation to multi-cluster multi-cloud Kubernetes operation. We use the name SkyRay to refer to the combination of KubeRay and Nova, since the combination extends Ray towards the Sky computing model envisioned by Stoica and Shenker of UC Berkeley.
A common way to configure SkyRay is to use a Nova spread/duplicate policy to deploy KubeRay and its CRDs to all workload clusters, meaning that all workload clusters are KubeRay-enabled. For any subsequent deployment of a RayCluster, RayJob, or RayService CR, Nova applies the policy that is relevant to that CR. Note that since Nova recognizes the resource requests associated each of these kinds of CRs, it can apply a policy that selects a target workload cluster based on available resources, thereby gang scheduling the CR there. Once Nova selects the target workload cluster, the KubeRay deployment in that cluster handles materializing the pods associated with the CR in the usual manner. Nova policy updates can be used to trigger automatic migration of Ray clusters between K8s workload clusters.
Example SkyRay policies for KubeRay, its CRDs, and the RayCluster, RayJob, and RayService CRs can be found in the skyray repo here. Example deployment scripts can be found in that repo here. Note that we have published a blog that uses SkyRay functionality in the context of showing how Nova's autoscaler-aware operation can improve the resource efficiency of ML/AI workloads.