Version: v1.2

Emergency fallback

Implement a disaster recovery solution for Luna, in case Luna becomes unable to provision nodes or becomes unresponsive. This guide will detail how to use a node group with the cluster autoscaler to put in place a back up for Luna.

Determine the node group’s requirements

To set up the node group, you'll need to determine the requirements for it. Unlike Luna, the cluster autoscaler cannot right-size the nodes based on the pods' requirements. Therefore, to run your workloads via the cluster autoscaler, you'll need to right-size the instance type for the node group yourself.

To do this, take the highest CPU, memory, and GPU requirements from all the pods in the workload you wish to run on the cluster. For example, the maximum per pod requirements could be 6 vCPU, 12GB of memory, and 1 GPU. You'll need to set the instance type for the node group to an instance with at least this configuration. Instance types such as g5.2xlarge, g4dn.xlarge, or g4ad.2xlarge can handle these requirements.

Create a node group and IAM permissions

Next, create a node group that can run up to 20 concurrent g5.2xlarge instances using the command below:

$ eksctl create nodegroup \
    --node-type 'g5.2xlarge' \
    --nodes 0 \
    --nodes-min 0 \
    --nodes-max 20 \
    --cluster $CLUSTER_NAME \
    --tags node.elotl.co/destination=bin-packing \
    --region $REGION

Then, create the IAM policy and roles following the EKS guide.

Install the cluster autoscaler and verify the node group works

Deploy the cluster autoscaler in your cluster using these instructions.

Verify the cluster autoscaler works properly with this command:

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Once the cluster autoscaler deployment is running, create a test workload to ensure the nodes in the node pool can be scaled up and down. We highly recommend validating that the cluster works with a real workload, verifying that all the pods are running, and validating the system functions correctly.

Disable the cluster autoscaler

Once the system is validated, remove the pods used to test the system, and wait around 15 minutes for the cluster autoscaler to fully scale down its node group. Once all the test nodes are shut down, disable the cluster autoscaler using this command:

$ kubectl -n kube-system scale --replicas=0 deploy/cluster-autoscaler

Luna and the backup cluster autoscaler do not work in tandem, so we disable the cluster autoscaler and will re-activate it once disaster strikes.

Now, you're ready to deploy Luna as usual.