Skip to main content
Version: v1.2

(Hot Node Mitigation) Tutorial: Performing CPU and Memory Stress Tests on Your Cluster

This tutorial will walk you through testing/verifying the Hot Node Mitigation feature (Luna v1.2.1 and higher). It is suggested to try this on a test cluster.

The Hot Node Mitigation feature is disabled by default. To enable, simply update the Helm values for manageHighUtilization.enabled and optionally adjust scaleDown.binPackNodeUtilizationThreshold. This tutorial uses pods that don’t have any resource requests; as such we will set scaleDown.binPackNodeUtilizationThreshold to 0 to avoid Luna scaling down nodes it thinks are below the default utilization threshold.

The below example assumes Luna was deployed using --helm-release-name inst-1 and helm --set labels='elotl-luna-inst-1=true'; update as appropriate for your Luna deployment

kubectl get cm/inst-1-elotl-luna -n elotl -o "jsonpath={.data['config\.yml']}" > values_full_inst-1.yaml
helm upgrade inst-1 ./elotl-luna-v1.2.1.tgz --wait --namespace=elotl --values=./values_full_inst-1.yaml --set manageHighUtilization.enabled=true --set scaleDown.binPackNodeUtilizationThreshold=0

It may be advantageous to run the following two commands in additional/separate terminals to see what’s happening with the pods and nodes while testing, assuming you have watch installed, else simply run the kubectl commands manually or via some other method:

watch kubectl top nodes
watch kubectl get pods -o wide

CPU Stress Tests

  1. Submit a Bin-Packing Pod:

    • Create and submit a bin-packing pod with no resource requests and low actual utilization using the busynores.yaml file.

    • Example:

      cat <<EOF > busynores.yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: busybox
      spec:
      replicas: 1
      selector:
      matchLabels:
      app: busybox
      template:
      metadata:
      labels:
      app: busybox
      elotl-luna-inst-1: "true"
      spec:
      containers:
      - name: busybox
      image: busybox
      resources:
      command:
      - sleep
      - "infinity"
      EOF

      kubectl apply -f busynores.yaml
    • Once submitted, Luna will allocate a bin-packing node to handle this pod.

    • If you are running on a regional GKE cluster, wait for the 2 unused nodes to scale down (since 3 bin-packing nodes are launched initially).

  2. Submit CPU Load Pods:

    • Create and submit 2 additional bin-packing pods using the stresscpu.yaml file, with no resource requests.

    • Example:

      cat <<EOF > stresscpu.yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: cpu-stress-deployment
      labels:
      app: cpu-stress
      spec:
      replicas: 2
      selector:
      matchLabels:
      app: cpu-stress
      template:
      metadata:
      labels:
      app: cpu-stress
      elotl-luna-inst-1: "true"
      spec:
      containers:
      - name: stress-ng
      image: litmuschaos/stress-ng:latest
      args:
      - "--cpu"
      - "2" # Run 2 CPU stress workers
      - "-t"
      - "600s" # Run for 600 seconds (10 minutes)
      EOF

      kubectl apply -f stresscpu.yaml
    • The kube scheduler will place both pods on the existing bin-packing node, each consuming 51% of the node's CPU (totaling over 100%). Note: it will take these pods cpu a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.

  3. Handling High CPU Utilization:

    • When the node surpasses the yellow and red CPU utilization thresholds, Luna adds the high-utilization taint to the node.
    • One of the CPU load pods will be evicted and will become pending until Luna allocates a second bin-packing node.
    • Wait for the cluster to stabilize.
  4. Delete CPU Load Pods:

    • Delete the 2 CPU load pods.

      kubectl delete -f stresscpu.yaml
    • Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.

    • Wait for this process to finish.

  5. Deploy Do-Not-Evict CPU Load Pods:

    1. Deploy a new version of the 2 CPU load pods with do-not-evict annotations using the stresscpudonotevict.yaml file.

    2. Example:

      cat <<EOF > stresscpudonotevict.yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: cpu-stress-deployment
      labels:
      app: cpu-stress
      spec:
      replicas: 2
      selector:
      matchLabels:
      app: cpu-stress
      template:
      metadata:
      labels:
      app: cpu-stress
      elotl-luna-inst-1: "true"
      annotations:
      pod.elotl.co/do-not-evict: "true"
      spec:
      containers:
      - name: stress-ng
      image: litmuschaos/stress-ng:latest
      args:
      - "--cpu"
      - "2" # Run 2 CPU stress workers
      - "-t"
      - "600s" # Run for 600 seconds (10 minutes)
      EOF

      kubectl apply -f stresscpudonotevict.yaml
    3. These pods will be added to the existing node, which will enter the red zone. Note: it will take these pods cpu a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.

    4. Luna will recognize that these pods cannot be evicted and will evict the other bin-packing pod with a light workload.

  6. Do-Not-Evict CPU Load Pods

    1. Delete the 2 CPU load pods.

      kubectl delete -f stresscpudonotevict.yaml
    2. Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.

    3. Wait for this process to finish.

Memory Stress Tests

  1. Submit a Bin-Packing Pod:

    • Create and submit a bin-packing pod with no resource requests and low actual utilization using the busynores.yaml file (same as for CPU stress tests).
  2. Submit Memory Load Pods:

    • Create and submit 2 additional bin-packing pods using the stressmem.yaml file, with no resource requests.

    • Example:

      cat <<EOF > stressmem.yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: mem-stress-deployment
      labels:
      app: mem-stress
      spec:
      replicas: 2
      selector:
      matchLabels:
      app: mem-stress
      template:
      metadata:
      labels:
      app: mem-stress
      elotl-luna-inst-1: "true"
      spec:
      containers:
      - name: stress-ng
      image: litmuschaos/stress-ng:latest
      args:
      - "--vm"
      - "1" # Run 1 VM stress worker
      - "--vm-bytes"
      - "40%" # Use 40% of memory
      - "-t"
      - "600s" # Run for 600 seconds (10 minutes)
      EOF

      kubectl apply -f stressmem.yaml
    • The kube scheduler will place both pods on the existing bin-packing node, each consuming ~44% of the node's memory (totaling over 88%). Note: it will take these pods memory a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.

  3. Handling High Memory Utilization:

    • When the node surpasses the yellow and red memory utilization thresholds, Luna adds the high-utilization taint to the node.
    • One of the memory load pods will be evicted and will become pending until Luna allocates a second bin-packing node.
    • Wait for the cluster to stabilize.
  4. Delete Memory Load Pods:

    • Delete the 2 memory load pods.

      kubectl delete -f stressmem.yaml
    • Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.

    • Wait for this process to finish.

  5. Deploy Do-Not-Evict Memory Load Pods:

    • Deploy a new version of the 2 memory load pods with do-not-evict annotations using the stressmemdonotevict.yaml file.

    • Example:

      cat <<EOF > stressmemdonotevict.yaml
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: mem-stress-deployment
      labels:
      app: mem-stress
      spec:
      replicas: 2
      selector:
      matchLabels:
      app: mem-stress
      template:
      metadata:
      labels:
      app: mem-stress
      elotl-luna-inst-1: "true"
      annotations:
      pod.elotl.co/do-not-evict: "true"
      spec:
      containers:
      - name: stress-ng
      image: litmuschaos/stress-ng:latest
      args:
      - "--vm"
      - "1" # Run 1 VM stress worker
      - "--vm-bytes"
      - "40%" # Use 40% of memory
      - "-t"
      - "600s" # Run for 600 seconds (10 minutes)
      EOF

      kubectl apply -f stressmemdonotevict.yaml
    • These pods will be added to the existing node, which will enter the red zone. Note: it will take these pods memory a moment to ramp up and for this to be reflected in the kubectl top nodes output and for Luna to react.

    • Luna will recognize that these pods cannot be evicted and will evict the other bin-packing pod with a light workload.

  6. Delete Do-Not-Evict CPU Load Pods and Busybox pod

    1. Delete the 2 CPU load pods.

      kubectl delete -f stressmemdonotevict.yaml
      kubectl delete -f busynores.yaml
    2. Luna will scale down one of the bin-packing nodes and remove the high utilization taint from the remaining node.

    3. Wait for this process to finish.

By following these steps, you can effectively perform CPU and memory stress tests on your cluster and observe how it handles high utilization scenarios.