Version: v1.3

Group Scheduling

In the Kubernetes world, an application or service is rarely comprised of only one Kubernetes object. Usually, it is comprised of a several different objects, e.g.: a Deployment for running your code, a ConfigMap to define app configuration, a Secret to store access credentials, a Service to expose your app, and a ServiceAccount to bind your app to the needed permissions defined in a Role or ClusterRole. There are also dependencies between the objects: the Deployment won't start unless it can mount the volumes, but the volumes are mounted from the ConfigMap and Secrets. Tools like Helm emerged to help people compose an application or microservices from several Kubernetes objects.

But there is also another level of dependency between applications or microservices running on Kubernetes: they often require more than one helm chart. Imagine a typical web application running on Kubernetes: it is very likely that it will be composed of a backend application, which relies on a relational database for storage (e.g., Postgres), a non-relational database for caching (e.g., Redis), a message queue solution (e.g., Kafka or RabbitMQ), etc. Very likely each of these is configured and deployed as a separate helm chart or set of vanilla Kubernetes manifests.

The user may want to ensure that all those applications or microservices are gang-scheduled, i.e., scheduled only when they can all get sufficient resources to run simultaneously. And in the multi-cluster context, the user may want them to run in the same workload cluster, so they can communicate using cluster dns.

To address this, we introduced the groupBy field in SchedulePolicy that lets the user define a group label key:

kind: SchedulePolicy
metadata:
  name: web-app-policy
spec:
  resourceSelectors:
    labelSelectors:
      - matchExpressions:
          - key: kubernetes.io/part-of
            operator: Exists
            values: []
  groupBy:
    labelKey: kubernetes.io/part-of
# ...(...)...

This setting will ensure that each Kubernetes resource sharing the same value for defined labelKey, will be treated as one group and scheduled at once to the same workload cluster.

For example, consider those resources:

kind: ConfigMap
metadata:
  name: backend-app-config
  labels:
    kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
  name: backend-app
  labels:
   kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
  name: redis
  labels:
    kubernetes.io/part-of: web-app
---
kind: StatefulSet
metadata:
  name: web-db
  labels:
    kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
  name: prometheus
  labels:
    kubernetes.io/part-of: observability-stack

This policy will create two ScheduleGroups for those resources:

with all resources with kubernetes.io/part-of=web-app (backend-app & redis Deployments, backend-app-config ConfigMap and web-db StatefulSet)
with kubernetes.io/part-of=observability-stack (only prometheus Deployment)

Once the resources are assigned to groups, each group will be treated as one scheduling unit and Nova capacity-based scheduling will only place the group when sufficient resources are available for the entire group. In cases where the group is updated such that not all of the workloads from the ScheduleGroup can run due to insufficient cpu, memory, or optionally gpu resources, Nova will re-schedule the ScheduleGroup.

By default, Nova will run the entire group in the same workload cluster. If the group no longer fits in that cluster, Nova will try to find another workload cluster with enough available cpu, memory, and optionally gpu resources to run the ScheduleGroup. If Nova cannot find a workload cluster with available resources to run the entire group and if the Nova option multi-cluster-capacity is set, Nova will consider placing the group on available resources in multiple workload clusters. This option should only be enabled if all the workload clusters are running a network layer such as Cisco Cilium Cluster Mesh that allows the pods in each cluster to discover and access services in the other workload clusters. Note that when Cilium Cluster Mesh is used, services whose pods may span multiple clusters are expected to include the annotation service.cilium.io/global: "true".

To learn more about how to use Group Scheduling, check out the Capacity-based Scheduling Tutorial.