(Hot Node Mitigation) Tutorial: Performing CPU and Memory Stress Tests on Your Cluster
This tutorial will walk you through testing/verifying the Hot Node Mitigation feature (Luna v1.2.1 and higher). It is suggested to try this on a test cluster.
The Hot Node Mitigation feature is disabled by default. To enable, simply update the Helm values for manageHighUtilization.enabled
and optionally adjust scaleDown.binPackNodeUtilizationThreshold
. This tutorial uses pods that don’t have any resource requests; as such we will set scaleDown.binPackNodeUtilizationThreshold
to 0
to avoid Luna scaling down nodes it thinks are below the default utilization threshold.
The below example assumes Luna was deployed using --helm-release-name inst-1
and helm --set labels='elotl-luna-inst-1=true'
; update as appropriate for your Luna deployment
kubectl get cm/inst-1-elotl-luna -n elotl -o "jsonpath={.data['config\.yml']}" > values_full_inst-1.yaml
helm upgrade inst-1 ./elotl-luna-v1.2.1.tgz --wait --namespace=elotl --values=./values_full_inst-1.yaml --set manageHighUtilization.enabled=true --set scaleDown.binPackNodeUtilizationThreshold=0
It may be advantageous to run the following two commands in additional/separate terminals to see what’s happening with the pods and nodes while testing, assuming you have watch
installed, else simply run the kubectl commands manually or via some other method:
watch kubectl top nodes
watch kubectl get pods -o wide
CPU Stress Tests
Submit a Bin-Packing Pod:
Create and submit a bin-packing pod with no resource requests and low actual utilization using the
busynores.yaml
file.Example:
cat <<EOF > busynores.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
spec:
replicas: 1
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
elotl-luna-inst-1: "true"
spec:
containers:
- name: busybox
image: busybox
resources:
command:
- sleep
- "infinity"
EOF
kubectl apply -f busynores.yamlOnce submitted, Luna will allocate a bin-packing node to handle this pod.
If you are running on a regional GKE cluster, wait for the 2 unused nodes to scale down (since 3 bin-packing nodes are launched initially).
Submit CPU Load Pods:
Create and submit 2 additional bin-packing pods using the
stresscpu.yaml
file, with no resource requests.Example:
cat <<EOF > stresscpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress-deployment
labels:
app: cpu-stress
spec:
replicas: 2
selector:
matchLabels:
app: cpu-stress
template:
metadata:
labels:
app: cpu-stress
elotl-luna-inst-1: "true"
spec:
containers:
- name: stress-ng
image: litmuschaos/stress-ng:latest
args:
- "--cpu"
- "2" # Run 2 CPU stress workers
- "-t"
- "600s" # Run for 600 seconds (10 minutes)
EOF
kubectl apply -f stresscpu.yamlThe kube scheduler will place both pods on the existing bin-packing node, each consuming 51% of the node's CPU (totaling over 100%). Note: it will take these pods cpu a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.
Handling High CPU Utilization:
- When the node surpasses the yellow and red CPU utilization thresholds, Luna adds the high-utilization taint to the node.
- One of the CPU load pods will be evicted and will become pending until Luna allocates a second bin-packing node.
- Wait for the cluster to stabilize.
Delete CPU Load Pods:
Delete the 2 CPU load pods.
kubectl delete -f stresscpu.yaml
Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.
Wait for this process to finish.
Deploy Do-Not-Evict CPU Load Pods:
Deploy a new version of the 2 CPU load pods with do-not-evict annotations using the
stresscpudonotevict.yaml
file.Example:
cat <<EOF > stresscpudonotevict.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress-deployment
labels:
app: cpu-stress
spec:
replicas: 2
selector:
matchLabels:
app: cpu-stress
template:
metadata:
labels:
app: cpu-stress
elotl-luna-inst-1: "true"
annotations:
pod.elotl.co/do-not-evict: "true"
spec:
containers:
- name: stress-ng
image: litmuschaos/stress-ng:latest
args:
- "--cpu"
- "2" # Run 2 CPU stress workers
- "-t"
- "600s" # Run for 600 seconds (10 minutes)
EOF
kubectl apply -f stresscpudonotevict.yamlThese pods will be added to the existing node, which will enter the red zone. Note: it will take these pods cpu a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.
Luna will recognize that these pods cannot be evicted and will evict the other bin-packing pod with a light workload.
Do-Not-Evict CPU Load Pods
Delete the 2 CPU load pods.
kubectl delete -f stresscpudonotevict.yaml
Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.
Wait for this process to finish.
Memory Stress Tests
Submit a Bin-Packing Pod:
- Create and submit a bin-packing pod with no resource requests and low actual utilization using the
busynores.yaml
file (same as for CPU stress tests).
- Create and submit a bin-packing pod with no resource requests and low actual utilization using the
Submit Memory Load Pods:
Create and submit 2 additional bin-packing pods using the
stressmem.yaml
file, with no resource requests.Example:
cat <<EOF > stressmem.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mem-stress-deployment
labels:
app: mem-stress
spec:
replicas: 2
selector:
matchLabels:
app: mem-stress
template:
metadata:
labels:
app: mem-stress
elotl-luna-inst-1: "true"
spec:
containers:
- name: stress-ng
image: litmuschaos/stress-ng:latest
args:
- "--vm"
- "1" # Run 1 VM stress worker
- "--vm-bytes"
- "40%" # Use 40% of memory
- "-t"
- "600s" # Run for 600 seconds (10 minutes)
EOF
kubectl apply -f stressmem.yamlThe kube scheduler will place both pods on the existing bin-packing node, each consuming ~44% of the node's memory (totaling over 88%). Note: it will take these pods memory a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.
Handling High Memory Utilization:
- When the node surpasses the yellow and red memory utilization thresholds, Luna adds the high-utilization taint to the node.
- One of the memory load pods will be evicted and will become pending until Luna allocates a second bin-packing node.
- Wait for the cluster to stabilize.
Delete Memory Load Pods:
Delete the 2 memory load pods.
kubectl delete -f stressmem.yaml
Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.
Wait for this process to finish.
Deploy Do-Not-Evict Memory Load Pods:
Deploy a new version of the 2 memory load pods with do-not-evict annotations using the
stressmemdonotevict.yaml
file.Example:
cat <<EOF > stressmemdonotevict.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mem-stress-deployment
labels:
app: mem-stress
spec:
replicas: 2
selector:
matchLabels:
app: mem-stress
template:
metadata:
labels:
app: mem-stress
elotl-luna-inst-1: "true"
annotations:
pod.elotl.co/do-not-evict: "true"
spec:
containers:
- name: stress-ng
image: litmuschaos/stress-ng:latest
args:
- "--vm"
- "1" # Run 1 VM stress worker
- "--vm-bytes"
- "40%" # Use 40% of memory
- "-t"
- "600s" # Run for 600 seconds (10 minutes)
EOF
kubectl apply -f stressmemdonotevict.yamlThese pods will be added to the existing node, which will enter the red zone. Note: it will take these pods memory a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.
Luna will recognize that these pods cannot be evicted and will evict the other bin-packing pod with a light workload.
Delete Do-Not-Evict CPU Load Pods and Busybox pod
Delete the 2 CPU load pods.
kubectl delete -f stressmemdonotevict.yaml
kubectl delete -f busynores.yamlLuna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.
Wait for this process to finish.
By following these steps, you can effectively perform CPU and memory stress tests on your cluster and observe how it handles high utilization scenarios.