EKS
Prerequisites
- aws cli
- kubectl with correct context selected: pointing to the cluster you want to deploy Luna on. If the name of the cluster passed to the deploy script doesn’t match the name of the EKS cluster in the kubectl context, the deploy script will exit with an error.
- helm: the package manager for Kubernetes
- eksctl >= 0.104.0: to manage the EKS OpenID connect provider.
- cmctl: the cert-manager command line utility
- An existing EKS cluster with at least 2 nodes (for Luna webhook replica availability). If you don't have one, you can create a new one
with eksctl:
eksctl --region=... create cluster --name=...
Step 1(optional): Install Nvidia gpu driver for gpu workload
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
Step 2: Deploy Luna
Luna needs cert-manager running in the cluster. Deploy script tries to detect cert-manager in the cluster and installs cert-manager to cert-manager namespace otherwise
cd luna-vX.Y.Z/
./deploy.sh --name <cluster-name> --region <compute-region> [--helm-release-name <release-name>] [--namespace <namespace>] [--additional-helm-values "<additional-helm-values>"]
Note: This command generates a eks-cluster-name_values.yaml file and (in post 0.6.0 Luna) a eks-cluster-name_helm-release-name_values_full.yaml file; please retain these files for use in future upgrades.
Step 3: Verify Luna
kubectl get all -n elotl
Sample output
NAME READY STATUS RESTARTS AGE
pod/luna-manager-5d8578565d-86jwc 1/1 Running 0 56s
pod/luna-webhook-58b7b5dcfb-dwpcb 1/1 Running 0 56s
pod/luna-webhook-58b7b5dcfb-xmlds 1/1 Running 0 56s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/luna-webhook ClusterIP x.x.x.x <none> 8443/TCP 57s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/luna-manager 1/1 1 1 57s
deployment.apps/luna-webhook 2/2 2 2 57s
NAME DESIRED CURRENT READY AGE
replicaset.apps/luna-manager-5d8578565d 1 1 1 57s
replicaset.apps/luna-webhook-58b7b5dcfb 2 2 2 57s
Step 4: Testing
Follow our tutorial to understand value provided by Luna.
Step 5: Verify test pod launch and dynamic worker node addition/removal (while testing)
kubectl get pods --selector=elotl-luna=true -o wide -w
kubectl get nodes -w
Upgrade
When running the upgrade command described below, set <retained-values-file>
to <retained-path>/<cluster-name>_values_full.yaml
, if your installation version was post 0.5.4 Luna, and to <retained-path>/<cluster-name>_values.yaml
otherwise.
To upgrade an existing luna deployment, run:
helm upgrade elotl-luna <chart-path> --wait --namespace=<cluster-namespace> --values=<retained-values-file> <additional-helm-values(optional)>
For example, to upgrade my-cluster from luna-v0.4.6 to luna-v0.5.0 and set an additional helm value binPackingNodeCpu=2, run:
helm upgrade elotl-luna ./elotl-luna-v0.5.0.tgz --wait --namespace=elotl --values=../../luna-v0.4.6/eks/my-cluster_values.yaml --set binPackingNodeCpu=2
And validate the upgrade as follows:
helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
elotl-luna elotl 4 2023-05-19 14:15:30.686251 -0700 PDT deployed elotl-luna-v0.5.0 v0.5.0
Cleanup
We recommend you delete all the pods running in the nodes managed by Luna manager, otherwise there may be orphan nodes left behind once you uninstall Luna. The uninstall.sh script won’t remove orphan nodes to prevent accidentally knocking out critical workloads. These orphan nodes can be easily cleaned up, as described below.
To remove Luna manager’s Helm chart and the custom AWS resources created to run Luna execute the uninstall script:
./uninstall.sh <cluster-name> <region>
This will not remove the left over nodes that Luna manager hasn’t scaled down
yet. To get the list of orphan nodes’ instance IDs you can use the following
command replacing <eks-cluster-name>
with the name of the cluster:
aws ec2 describe-instances \
--filters Name=tag:elotl.co/nodeless-cluster/name/<eks-cluster-name>,Values=owned \
--query "Reservations[*].Instances[*].[InstanceId]" \
--output text
To ensure all the nodes managed by Luna manager are deleted, execute the
following command replacing <eks-cluster-name>
with the name of the cluster:
aws ec2 terminate-instances --instance-ids \
$(aws ec2 describe-instances \
--filters Name=tag:elotl.co/nodeless-cluster/name/<eks-cluster-name>,Values=owned \
--query "Reservations[*].Instances[*].[InstanceId]" \
--output text)
Note that all the pods running on these nodes will be forcefully terminated.
To delete Luna manager and the web hook from the cluster while preserving the AWS resources execute the following:
helm uninstall elotl-luna --namespace=elotl
kubectl delete namespace elotl
If you decide to uninstall the Helm chart instead of running uninstall.sh, please ensure that all the orphan nodes have been cleaned as described above.
Notes
Security Groups
Security Groups act as virtual firewalls for EC2 instances to control incoming and outgoing traffic. If a node is missing a security group rule it can affect Luna’s ability to attach the nodes to the EKS cluster or prevent nodes from running pods and services.
To ensure that all the security groups required by EKS are applied
to the Luna managed nodes, we tag security groups with the key
elotl.co/nodeless-cluster/name
and the cluster name as its value. When it starts
Luna queries what security groups are needed and add them to the nodes.
When Luna is deployed, the default EKS security groups are automatically tagged. If you wish to tag another security group you can use awscli to add the tags to the security group:
aws --region=<region> \
ec2 create-tags \
--resources <security group id> \
--tags "Key=elotl.co/nodeless-cluster/name,Value=<cluster_name>"
Once tagged, you must restart the Luna manager pod for Luna to assign the new security group to the newly provisioned nodes. Notes that existing Luna managed nodes will not have their security groups updated and will have to be replaced to get the new security group assignment working.