Skip to main content
Version: v1.2

EKS

Prerequisites

  1. aws cli
  2. kubectl with correct context selected: pointing to the cluster you want to deploy Luna on. If the name of the cluster passed to the deploy script doesn’t match the name of the EKS cluster in the kubectl context, the deploy script will exit with an error.
  3. helm: the package manager for Kubernetes
  4. eksctl >= 0.104.0: to manage the EKS OpenID connect provider.
  5. cmctl: the cert-manager command line utility
  6. An existing EKS cluster with at least 2 nodes (for Luna webhook replica availability). If you don't have one, you can create a new one with eksctl: eksctl --region=... create cluster --name=...

Step 1(optional): Install Nvidia gpu driver for gpu workload

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Step 2: Deploy Luna

Luna needs cert-manager running in the cluster. Deploy script tries to detect cert-manager in the cluster and installs cert-manager to cert-manager namespace otherwise

    cd luna-vX.Y.Z/
./deploy.sh --name <cluster-name> --region <compute-region> [--helm-release-name <release-name>] [--namespace <namespace>] [--additional-helm-values "<additional-helm-values>"]

Note: This command generates a eks-cluster-name_values.yaml file and (in post 0.6.0 Luna) a eks-cluster-name_helm-release-name_values_full.yaml file; please retain these files for use in future upgrades.

Step 3: Verify Luna

kubectl get all -n elotl

Sample output

    NAME                                READY   STATUS    RESTARTS   AGE
pod/luna-manager-5d8578565d-86jwc 1/1 Running 0 56s
pod/luna-webhook-58b7b5dcfb-dwpcb 1/1 Running 0 56s
pod/luna-webhook-58b7b5dcfb-xmlds 1/1 Running 0 56s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/luna-webhook ClusterIP x.x.x.x <none> 8443/TCP 57s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/luna-manager 1/1 1 1 57s
deployment.apps/luna-webhook 2/2 2 2 57s

NAME DESIRED CURRENT READY AGE
replicaset.apps/luna-manager-5d8578565d 1 1 1 57s
replicaset.apps/luna-webhook-58b7b5dcfb 2 2 2 57s

Step 4: Testing

Follow our tutorial to understand value provided by Luna.

Step 5: Verify test pod launch and dynamic worker node addition/removal (while testing)

kubectl get pods --selector=elotl-luna=true -o wide -w
kubectl get nodes -w

Upgrade

When running the upgrade command described below, set <retained-values-file> to <retained-path>/<cluster-name>_values_full.yaml, if your installation version was post 0.5.4 Luna, and to <retained-path>/<cluster-name>_values.yaml otherwise.

To upgrade an existing luna deployment, run:

helm upgrade elotl-luna <chart-path> --wait --namespace=<cluster-namespace> --values=<retained-values-file> <additional-helm-values(optional)>

For example, to upgrade my-cluster from luna-v0.4.6 to luna-v0.5.0 and set an additional helm value binPackingNodeCpu=2, run:

helm upgrade elotl-luna ./elotl-luna-v0.5.0.tgz --wait --namespace=elotl --values=../../luna-v0.4.6/eks/my-cluster_values.yaml --set binPackingNodeCpu=2

And validate the upgrade as follows:

helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
elotl-luna elotl 4 2023-05-19 14:15:30.686251 -0700 PDT deployed elotl-luna-v0.5.0 v0.5.0

Cleanup

We recommend you delete all the pods running in the nodes managed by Luna manager, otherwise there may be orphan nodes left behind once you uninstall Luna. The uninstall.sh script won’t remove orphan nodes to prevent accidentally knocking out critical workloads. These orphan nodes can be easily cleaned up, as described below.

To remove Luna manager’s Helm chart and the custom AWS resources created to run Luna execute the uninstall script:

    ./uninstall.sh <cluster-name> <region>

This will not remove the left over nodes that Luna manager hasn’t scaled down yet. To get the list of orphan nodes’ instance IDs you can use the following command replacing <eks-cluster-name> with the name of the cluster:

    aws ec2 describe-instances \
--filters Name=tag:elotl.co/nodeless-cluster/name/<eks-cluster-name>,Values=owned \
--query "Reservations[*].Instances[*].[InstanceId]" \
--output text

To ensure all the nodes managed by Luna manager are deleted, execute the following command replacing <eks-cluster-name> with the name of the cluster:

    aws ec2 terminate-instances --instance-ids \
$(aws ec2 describe-instances \
--filters Name=tag:elotl.co/nodeless-cluster/name/<eks-cluster-name>,Values=owned \
--query "Reservations[*].Instances[*].[InstanceId]" \
--output text)

Note that all the pods running on these nodes will be forcefully terminated.

To delete Luna manager and the web hook from the cluster while preserving the AWS resources execute the following:

helm uninstall elotl-luna --namespace=elotl
kubectl delete namespace elotl

If you decide to uninstall the Helm chart instead of running uninstall.sh, please ensure that all the orphan nodes have been cleaned as described above.

Notes

Security Groups

Security Groups act as virtual firewalls for EC2 instances to control incoming and outgoing traffic. If a node is missing a security group rule it can affect Luna’s ability to attach the nodes to the EKS cluster or prevent nodes from running pods and services.

To ensure that all the security groups required by EKS are applied to the Luna managed nodes, we tag security groups with the key elotl.co/nodeless-cluster/name and the cluster name as its value. When it starts Luna queries what security groups are needed and add them to the nodes.

When Luna is deployed, the default EKS security groups are automatically tagged. If you wish to tag another security group you can use awscli to add the tags to the security group:

    aws --region=<region> \
ec2 create-tags \
--resources <security group id> \
--tags "Key=elotl.co/nodeless-cluster/name,Value=<cluster_name>"

Once tagged, you must restart the Luna manager pod for Luna to assign the new security group to the newly provisioned nodes. Notes that existing Luna managed nodes will not have their security groups updated and will have to be replaced to get the new security group assignment working.