Skip to main content
Version: v0.5

Emergency fallback

Implement a disaster recovery solution for Luna, in case Luna becomes unable to provision nodes or becomes unresponsive. This guide will detail how to use a node group with the cluster autoscaler to put in place a back up for Luna.

Determine the node group’s requirements

To set up the node group, you'll need to determine the requirements for it. Unlike Luna, the cluster autoscaler cannot right-size the nodes based on the pods' requirements. Therefore, to run your workloads via the cluster autoscaler, you'll need to right-size the instance type for the node group yourself.

To do this, take the highest CPU, memory, and GPU requirements from all the pods in the workload you wish to run on the cluster. For example, the maximum per pod requirements could be 6 vCPU, 12GB of memory, and 1 GPU. You'll need to set the instance type for the node group to an instance with at least this configuration. Instance types such as g5.2xlarge, g4dn.xlarge, or g4ad.2xlarge can handle these requirements.

Create a node group and IAM permissions

Next, create a node group that can run up to 20 concurrent g5.2xlarge instances using the command below:

$ eksctl create nodegroup \
--node-type 'g5.2xlarge' \
--nodes 0 \
--nodes-min 0 \
--nodes-max 20 \
--cluster $CLUSTER_NAME \
--tags \
--region $REGION

Then, create the IAM policy and roles following the EKS guide.

Install the cluster autoscaler and verify the node group works

Deploy the cluster autoscaler in your cluster using these instructions.

Verify the cluster autoscaler works properly with this command:

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Once the cluster autoscaler deployment is running, create a test workload to ensure the nodes in the node pool can be scaled up and down. We highly recommend validating that the cluster works with a real workload, verifying that all the pods are running, and validating the system functions correctly.

Disable the cluster autoscaler

Once the system is validated, remove the pods used to test the system, and wait around 15 minutes for the cluster autoscaler to fully scale down its node group. Once all the test nodes are shut down, disable the cluster autoscaler using this command:

$ kubectl -n kube-system scale --replicas=0 deploy/cluster-autoscaler

Luna and the backup cluster autoscaler do not work in tandem, so we disable the cluster autoscaler and will re-activate it once disaster strikes.

Now, you're ready to deploy Luna as usual.

Emergency fallback to cluster autoscaler

In the event that the cluster autoscaler is required to take over for Luna as the autoscaler, you must first disable Luna, then activate the cluster autoscaler. Luna must be disabled even if it is not working properly. To disable Luna, scale down its 2 services, the webhook and the manager, using the following commands:

$ kubectl -n elotl scale --replicas=0 deploy/elotl-luna-webhook
$ kubectl -n elotl scale --replicas=0 deploy/elotl-luna-manager

To activate the cluster autoscaler, scale its deployment up using the command below:

$ kubectl -n kube-system scale --replicas=1 deploy/cluster-autoscaler

After activating the cluster autoscaler, it will start creating new nodes based on the instance type specified when you created the node group. These new nodes will be used to handle new incoming workloads that Luna was previously handling.