Group Scheduling
In the Kubernetes world, rarely is an application or service comprised of one Kubernetes resource. Usually, it is comprised of a bunch of different resources: Deployment for running your code, ConfigMap to define app's configuration, Secret to store access credentials, Service to expose your app, ServiceAccount to bind your app to the needed permissions defined in Role or ClusterRole. There is also dependency between them: Deployment won't start unless it can mount the volumes, but volumes are mounted from ConfigMap and Secrets. Tools like Helm emerged to help people compose an application / microservices from several Kubernetes resources.
But there is also another level of dependency between applications / microservices running on Kubernetes: often it requires more than one helm chart. Imagine a typical web application running on Kubernetes: it is very likely that it will be composed of backend application, which relies on relational database for storage (e.g. Postgres), non-relational database for cache (e.g. Redis), message queue solutions (such as Kafka or RabbitMQ) etc. Very likely each of these are configured and deployed as a separate helm chart or vanilla Kubernetes manifests.
In the multi-cluster context, user may want to ensure that all those applications / microservices always run in the same workload cluster, so they can communicate using cluster dns.
To address that, we introduced a field in SchedulePolicy that lets user define a group label key:
kind: SchedulePolicy
metadata:
name: web-app-policy
spec:
resourceSelectors:
labelSelectors:
- matchExpressions:
- key: kubernetes.io/part-of
operator: Exists
values: []
groupBy:
labelKey: kubernetes.io/part-of
# ...(...)...
This setting will ensure that each Kubernetes resource sharing the same value for defined labelKey, will be treated as one group and scheduled at once to the same workload cluster.
For example, consider those resources:
kind: ConfigMap
metadata:
name: backend-app-config
labels:
kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
name: backend-app
labels:
kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
name: redis
labels:
kubernetes.io/part-of: web-app
---
kind: StatefulSet
metadata:
name: web-db
labels:
kubernetes.io/part-of: web-app
---
kind: Deployment
metadata:
name: prometheus
labels:
kubernetes.io/part-of: observability-stack
This policy will create two ScheduleGroups for those resources:
- with all resources with
kubernetes.io/part-of=web-app
(backend-app
&redis
Deployments,backend-app-config
ConfigMap andweb-db
StatefulSet) - with
kubernetes.io/part-of=observability-stack
(onlyprometheus
Deployment)
Once the resources are assigned to groups, each group will be treated as one scheduling unit and Nova will always run those workloads in the same workload cluster. In cases when one of the workloads from the ScheduleGroup cannot run due to lack of cpu/memory resources in the workload cluster, the ScheduleGroup will get re-scheduled and Nova will try to find a workload cluster with enough cpu/memory/gpu resources to run this ScheduleGroup. To explore more how to use Group Scheduling, check Capacity-based Scheduling Tutorial.