Argo Rollouts in multi-cluster environment
Prerequisites
- argo rollouts kubectl plugins installed Link to install instructions
- Nova Control Plane installed with 2 workload clusters connected
- access to the Nova Control Plane hosting cluster
Installing Argo Rollouts
Typically, in the single cluster scenario, both Argo Rollouts Custom Resource Definitions (CRDs) and the controller would be created in a single cluster. In a multi-cluster environment managed by Nova, we need to split it: CRDs will be created in the Nova Control Plane, but argo-rollouts controller will run next to the Nova Control Plane components in the hosting cluster. Argo-rollouts controller manifest is modified, so it can observe Rollouts in the Nova Control Plane (instead of doing this in the hosting cluster).
Let's create Argo Rollouts CRDs in the Nova Control Plane first:
kubectl --context=nova create -f examples/argo-rollouts/crds.yaml
and then the Argo Rollouts controller in the hosting cluster (see the kube context in the command below. Please modify the context if your hosting cluster context is different).
We need to install Argo Rollouts controller in the same namespace where Nova Control Plane components were installed. By default, it's elotl
namespace.
export INSTALL_NAMESPACE=elotl
kubectl --context=kind-cp create -f examples/argo-rollouts/argo-rollouts.yaml -n ${INSTALL_NAMESPACE}
Now, we need to wait until the argo-rollouts controller is up and running:
kubectl --context=kind-cp wait --for=condition=available -n ${INSTALL_NAMESPACE} deployment argo-rollouts --timeout=210s
Our goal in this tutorial is to do canary rollout of our application to two workload clusters at the same time. We will deploy 10 replicas of our application and instruct Nova, to spread it equally on two workload clusters (5 replicas each). The Rollout, that we will use will have 5 steps:
- First, we will replace 20% of 10 replicas with a new version (so 2 replicas, 1 replica in each workload cluster)
- We will pause, that will give us an opportunity to verify if 2 new replicas are spread correctly on two clusters. Then, we will move to the next step
- We will run automated analysis of the new version: It will be a kubernetes Job, which will run 1 pod in each workload cluster, load testing the new version and checking the error rate.
- If the analysis run was successful, we scale up to 6 replicas of the new version (3 in each workload cluster) and scale down old version to 4 replicas (2 in each workload cluster).
- Then we will pause to verify that replicas are spread correctly. After that, we will promote the rollout again After those steps, new version will be scaled up to replicas (10, 5 in each workload cluster) and the old version will be scaled to 0. We will call it a successful Rollout in the multi cluster environment.
All these steps are defined in the canary strategy steps part of the Rollout manifest:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollout-demo
spec:
replicas: 10
revisionHistoryLimit: 1
workloadRef:
apiVersion: apps/v1
kind: Deployment
name: rollouts-demo
strategy:
canary:
canaryMetadata:
labels:
role: canary
stableMetadata:
labels:
role: stable
canaryService: rollouts-demo-canary
steps:
- setWeight: 20
- pause: {}
- analysis:
templates:
- templateName: http-benchmark
args:
- name: host
value: rollouts-demo-canary
- setWeight: 60
- pause: {}
as you see, we specify a reference to the deployment that we will use to get the spec of the ReplicaSets created by argo rollouts controller. Here is the Deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: rollouts-demo
namespace: default
labels:
base-template: "yes"
spec:
replicas: 10
selector:
matchLabels:
app: rollouts-demo
template:
metadata:
labels:
app: rollouts-demo
role: placeholder
spec:
containers:
- name: main
image: argoproj/rollouts-demo:green
imagePullPolicy: Always
ports:
- name: http
containerPort: 8080
protocol: TCP
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
memory: 32Mi
cpu: 5m
in the 3rd step of the Rollout canary strategy, we specify the reference to yet another Argo Rollout Custom Resource, which is AnalysisTemplate. We will need to have it created in the Nova Control Plane for rollout to proceed. Here is AnalysisTemplate manifest:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-benchmark
spec:
args:
- name: host
metrics:
- name: http-benchmark
failureLimit: 1
interval: 5s
count: 1
provider:
job:
metadata:
labels:
role: placeholder
app: rollouts-demo
spec:
parallelism: 2
template:
spec:
containers:
- name: load-tester
image: argoproj/load-tester:latest # we will use load-test image provided by argo project to test a new version
command: [sh, -xec]
# we will send requests to the new version of our app, collect the responses and then check whether the error rate is below acceptable threshold.
args:
- |
wrk -t1 -c1 -d5 -s report.lua http://{{args.host}}/color
jq -e '.errors_ratio <= 0.05' report.json
restartPolicy: Never
backoffLimit: 0
Let's create AnalysisTemplate and base Deployment in the Nova Control Plane first.
kubectl --context=nova create -f examples/argo-rollouts/analysis_template.yaml
kubectl --context=nova create -f examples/argo-rollouts/base-deploy.yaml
As we mentioned before, we want to spread services and replicasets for our Rollout across two workload clusters using Nova. Let's check the workload clusters connected to the Nova Control Plane:
kubectl --context=nova get clusters
NAME K8S-VERSION K8S-CLUSTER REGION ZONE READY IDLE STANDBY
kind-workload-1 1.25 workload-1 True False False
kind-workload-2 1.25 workload-2 True False False
In my case, they are named kind-workload-1
& kind-workload-2
. We will use it in the SchedulePolicy's clusterSelector as well as spread constraints:
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
name: rollouts-demo-policy
spec:
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: default
clusterSelector: # here we select both clusters
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values:
- kind-workload-1 # if your workload clusters are named differently, please modify the value here
- kind-workload-2 # if your workload clusters are named differently, please modify the value here
spreadConstraints:
# in spread constraints, we ask Nova to divide our workload (ReplicaSet) equally across two clusters selected in clusterSelector section.
# practically it means that ReplicaSet with 10 replicas will run 5 replicas in each cluster.
topologyKey: kubernetes.io/metadata.name
spreadMode: Divide
resourceSelectors:
labelSelectors:
- matchExpressions:
# this selector will match stable and canary Services, ReplicaSets created by argo-rollouts controller
# as well as AnalysisRun Job.
- key: app
operator: In
values:
- rollouts-demo
- key: role
operator: Exists
# this is to ensure that our "base" Deployment won't be scheduled by Nova to workload clusters
- key: base-template
operator: DoesNotExist
groupBy:
labelKey: role
Now, we can create this SchedulePolicy.
kubectl --context=nova create -f examples/argo-rollouts/schedule_policy_canary.yaml
Next step is creating services: one will point out to the stable replicaset pods, another to the canary pods.
kubectl --context=nova create -f examples/argo-rollouts/services-canary.yaml
Finally, we can create the Rollout spec and wait until it reached 10 available replicas:
kubectl --context=nova create -f examples/argo-rollouts/rollout_prod_canary.yaml
kubectl --context=nova wait '--for=jsonpath={.status.availableReplicas}'=10 replicaset -l app=rollouts-demo --timeout=180s
We can now verify if each workload cluster runs 5 replicas:
kubectl --context=kind-workload-1 get pods -l app=rollouts-demo
NAME READY STATUS RESTARTS AGE
rollout-demo-5cddbfbdbf-7z78m 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-c8fhj 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-c8pnn 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-ntkcj 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-s2f2n 1/1 Running 0 3m
kubectl --context=kind-workload-2 get pods -l app=rollouts-demo
NAME READY STATUS RESTARTS AGE
rollout-demo-5cddbfbdbf-7t2cx 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-kbdf7 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-mk6sl 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-p7p76 1/1 Running 0 3m
rollout-demo-5cddbfbdbf-rbvl5 1/1 Running 0 3m
Great, our initial deployment was successful, now we can try to deploy a new version of application image and see the progressive delivery on multi cluster environment in action.
Deploy a new version
We will update the image of our rollout from green
to blue
tag, using Argo Rollouts kubectl plugin. This can be also done without the plugin, by updating the manifest and applying it to cluster with kubectl.
kubectl argo rollouts --context=nova set image rollout-demo main=argoproj/rollouts-demo:blue
Our rollout should complete step 1 and then pause. We can verify it using Argo Rollouts kubectl plugin command:
kubectl argo rollouts --context=nova get rollout rollout-demo
Name: rollout-demo
Namespace: default
Status: ॥ Paused
Message: CanaryPauseStep
Strategy: Canary
Step: 1/5
SetWeight: 20
ActualWeight: 20
Images: argoproj/rollouts-demo:blue (canary)
argoproj/rollouts-demo:green (stable)
Replicas:
Desired: 10
Current: 10
Updated: 2
Ready: 10
Available: 10
NAME KIND STATUS AGE INFO
⟳ rollout-demo Rollout ॥ Paused 3m11s
├──# revision:2
│ └──⧉ rollout-demo-845ffbfc68 ReplicaSet ✔ Healthy 2m39s canary
└──# revision:1
└──⧉ rollout-demo-5cddbfbdbf ReplicaSet ✔ Healthy 3m11s stable
It seems that the new replicaset (called rollout-demo-845ffbfc68
) was created with 2 replicas. Let's verify if each workload cluster got a 1 replica of the new replicaset:
kubectl --context=kind-workload-1 get pods
rollout-demo-5cddbfbdbf-4bdvw 1/1 Running 0 2m5s
rollout-demo-5cddbfbdbf-4zfvb 1/1 Running 0 2m5s
rollout-demo-5cddbfbdbf-9qtst 1/1 Running 0 2m5s
rollout-demo-5cddbfbdbf-brfdw 1/1 Running 0 2m5s
rollout-demo-845ffbfc68-lzww7 1/1 Running 0 103s
kubectl --context=kind-workload-2 get pods
NAME READY STATUS RESTARTS AGE
rollout-demo-5cddbfbdbf-49dpj 1/1 Running 0 2m53s
rollout-demo-5cddbfbdbf-jfpvm 1/1 Running 0 2m53s
rollout-demo-5cddbfbdbf-ltbjf 1/1 Running 0 2m53s
rollout-demo-5cddbfbdbf-sdqdv 1/1 Running 0 2m53s
rollout-demo-845ffbfc68-xqjc8 1/1 Running 0 2m11s
Looks like everything worked as expected. We still have 8 replicas of stable version (4 in each workload cluster) and 2 of the new one (1 in each workload cluster). We can promote rollout, and continue progressing:
kubectl argo rollouts --context=nova promote rollout-demo
this will continue rollout until we reach step 4, which is another pause.
kubectl argo rollouts --context=nova get rollout rollout-demo
Name: rollout-demo
Namespace: default
Status: ॥ Paused
Message: CanaryPauseStep
Strategy: Canary
Step: 4/5
SetWeight: 60
ActualWeight: 60
Images: argoproj/rollouts-demo:blue (canary)
argoproj/rollouts-demo:green (stable)
Replicas:
Desired: 10
Current: 10
Updated: 6
Ready: 10
Available: 10
NAME KIND STATUS AGE INFO
⟳ rollout-demo Rollout ॥ Paused 4m48s
├──# revision:2
│ ├──⧉ rollout-demo-845ffbfc68 ReplicaSet ✔ Healthy 4m16s canary
│ └──α rollout-demo-845ffbfc68-2-2 AnalysisRun ✔ Successful 40s ✔ 1
│ └──⊞ e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1 Job ✔ Successful 40s
└──# revision:1
└──⧉ rollout-demo-5cddbfbdbf ReplicaSet ✔ Healthy 4m48s stable
Seems that AnalysisRun (step 3) was successful. We can verify it with a following command:
kubectl wait --for='jsonpath={.status.phase}=Successful' analysisrun -l app=rollout-demo -l step-index=2 --timeout=180s
Each workload cluster should now have: 1 completed pod of the AnalysisRun, 2 replicas of the stable version and 3 replicas of the canary version. Let's check that in both workload clusters:
kubectl --context=kind-workload-1 get pods
NAME READY STATUS RESTARTS AGE
e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1-wlsmk 0/1 Completed 0 2m14s
rollout-demo-5cddbfbdbf-4zfvb 1/1 Running 0 6m22s
rollout-demo-5cddbfbdbf-brfdw 1/1 Running 0 6m22s
rollout-demo-845ffbfc68-hkdxl 1/1 Running 0 119s
rollout-demo-845ffbfc68-lzww7 1/1 Running 0 6m
rollout-demo-845ffbfc68-rcv9k 1/1 Running 0 119s
kubectl --context=kind-workload-2 get pods
NAME READY STATUS RESTARTS AGE
e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1-vh6ln 0/1 Completed 0 2m31s
rollout-demo-5cddbfbdbf-49dpj 1/1 Running 0 6m49s
rollout-demo-5cddbfbdbf-ltbjf 1/1 Running 0 6m49s
rollout-demo-845ffbfc68-4g647 1/1 Running 0 2m16s
rollout-demo-845ffbfc68-gm6qg 1/1 Running 0 2m16s
rollout-demo-845ffbfc68-xqjc8 1/1 Running 0 6m7s
Everything looks ok, so we can promote rollout. After that, a new version will reach 10 replicas and the old one will be scaled down to 0 replicas.
kubectl argo rollouts --context=nova promote rollout-demo
We can verify the new version running 10 replicas:
kubectl --context=nova wait '--for=jsonpath={.status.availableReplicas}'=10 replicaset -n default rollout-demo-845ffbfc68 --timeout=180s
or check the Rollout status using Argo Rollouts kubectl plugin command:
kubectl argo rollouts --context=nova get rollout rollout-demo
Name: rollout-demo
Namespace: default
Status: ✔ Healthy
Strategy: Canary
Step: 5/5
SetWeight: 100
ActualWeight: 100
Images: argoproj/rollouts-demo:blue (stable)
Replicas:
Desired: 10
Current: 10
Updated: 10
Ready: 10
Available: 10
NAME KIND STATUS AGE INFO
⟳ rollout-demo Rollout ✔ Healthy 10m
├──# revision:2
│ ├──⧉ rollout-demo-845ffbfc68 ReplicaSet ✔ Healthy 9m51s stable
│ └──α rollout-demo-845ffbfc68-2-2 AnalysisRun ✔ Successful 6m15s ✔ 1
│ └──⊞ e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1 Job ✔ Successful 6m15s
└──# revision:1
└──⧉ rollout-demo-5cddbfbdbf ReplicaSet • ScaledDown 10m
We can verify if each workload cluster has 5 pods of the new version:
kubectl --context=kind-workload-1 get pods
NAME READY STATUS RESTARTS AGE
e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1-wlsmk 0/1 Completed 0 7m2s
rollout-demo-845ffbfc68-d4gp9 1/1 Running 0 72s
rollout-demo-845ffbfc68-hkdxl 1/1 Running 0 6m47s
rollout-demo-845ffbfc68-lvbdd 1/1 Running 0 72s
rollout-demo-845ffbfc68-lzww7 1/1 Running 0 10m
rollout-demo-845ffbfc68-rcv9k 1/1 Running 0 6m47s
kubectl --context=kind-workload-2 get pods
NAME READY STATUS RESTARTS AGE
e6f5f5d6-7518-421b-b672-b47701e67f6c.http-benchmark.1-vh6ln 0/1 Completed 0 7m38s
rollout-demo-845ffbfc68-4g647 1/1 Running 0 7m23s
rollout-demo-845ffbfc68-4gtt6 1/1 Running 0 108s
rollout-demo-845ffbfc68-gm6qg 1/1 Running 0 7m23s
rollout-demo-845ffbfc68-tsm2k 1/1 Running 0 108s
rollout-demo-845ffbfc68-xqjc8 1/1 Running 0 11m
Cleanup
Let's first delete Rollout, Services, AnalysisTemplate and Deployment created for this tutorial:
kubectl --context=nova delete -f examples/argo-rollouts/rollout_prod_canary.yaml
kubectl --context=nova delete -f examples/argo-rollouts/services-canary.yaml
kubectl --context=nova delete -f examples/argo-rollouts/analysis_template.yaml
kubectl --context=nova delete -f examples/argo-rollouts/base-deploy.yaml
then, we can delete SchedulePolicy:
kubectl --context=nova delete -f examples/argo-rollouts/schedule_policy_canary.yaml
finally, we can uninstall Argo Rollouts CRDs and the Rollouts controller:
kubectl --context=nova delete -f examples/argo-rollouts/crds.yaml
kubectl --context=kind-cp delete -f examples/argo-rollouts/argo-rollouts.yaml -n ${INSTALL_NAMESPACE}
Additional resources
This tutorial is written based on our presentation made on DevOps Day Madrid 2023. You can watch the presentation recording here or get the slides here