AKS
Prerequisites
- azure bash cloud shell cli with the environment variable ENVSUBST pointing to an installation of envsubst (azure bash cloud shell does not allow root/sudo package installation).
- kubectl with correct context selected: pointing to the cluster you want to deploy Luna on.
- helm: the package manager for Kubernetes
- An existing AKS cluster without autoscaling enabled. Note that AKS has both free and standard tier clusters; please ensure your cluster tier can handle your expected load at scale.
Considerations
Spot
Luna running on AKS supports allocating Spot instances for bin selection.
If you would like Luna to consider a Spot instance for your workload, but use an on-demand instance if spot is not available, please include the following annotation in your configuration:
annotations:
node.elotl.co/instance-offerings: "spot, on-demand"
Luna will allocate a Spot instance if available for the lowest-priced right-sized instance type; otherwise, it will allocate an on-demand instance.
If you would like Luna to only consider a Spot instance for your workload, and leave the workload pending if no spot instance is available, please include the following annotation:
annotations:
node.elotl.co/instance-offerings: "spot"
If a Luna-allocated Spot instance node is terminated, the associated workload will become pending and Luna will again select a node for it.
Pod Subnet
Luna running on AKS supports specifying the pod subnet used by Dynamic Azure CNI networking for bin selection. By default, Luna will use the same pod subnet as your cluster's system node pool; you can override this choice for your pod.
If you would like Luna to use a particular subnet (e.g., podsubnet1) that you have set up for your workload, please include the following annotation in your configuration:
annotations:
node.elotl.co/aks-pod-subnet: "podsubnet1"
Managed Identity Authentication Setup
As outlined in Step 2 below, Luna supports two Azure authentication techniques to provide access to an account with the permissions Luna needs to perform its AKS cluster scaling operations.
If you want Luna to use managed identity authentication, you'll need to define a user-assigned managed identity and you'll need to give it the appropriate permissions. At Luna deployment time, you'll provide that managed identity's name in an environment variable and its client id as a parameter. You can create a user-assigned managed identity as shown below:
az identity create --name <user-assigned-identity-name> --resource-group <resource-group-name> --location <cluster-location> --subscription <subscription-id>
And you can assign its permissions to "Contributor" access for both of your cluster's resource groups via:
az role assignment create --assignee <user-assigned-identity-principalId> --role "Contributor" --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group-name>
az role assignment create --assignee <user-assigned-identity-principalId> --role "Contributor" --scope /subscriptions/<subscription-id>/resourceGroups/<node-resource-group-name>
To allow managed identity authentication to work in an AKS cluster, Luna uses Azure's workload identity service https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster. The AKS cluster must have the workload identity and OIDC issuer features enabled. You can enable these features at AKS cluster creation time or you can add them to an existing AKS cluster via:
az aks update -n <cluster-name> -g <resource-group-name> --enable-oidc-issuer --enable-workload-identity
Step 1(optional): Install Nvidia gpu driver for gpu workload
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
Step 2: Deploy Luna
Luna needs cert-manager running in the cluster. Deploy script tries to detect cert-manager in the cluster and installs cert-manager to cert-manager namespace otherwise.
To perform AKS cluster scaling, Luna needs create/read/update/delete access for node pools in the AKS cluster's resource group, read access on its VM SKUs, and read/update access on its VM Scale set. To provide Luna with access to an account with the appropriate permissons, you can choose from these two Azure authentication methods (https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication?tabs=bash):
- client secret. Pass the client secret as argument to --client-secret when running the deploy script.
- managed identity. Pass the managed identity name as argument to --identity-name when running the deploy script. Please see "Managed Identity Authentication Setup" section above for setup details. Note: you can only specify client secret or identity name as they are mutually exclusive
You can then run the following command to deploy Luna into your AKS cluster:
./deploy.sh --name <cluster-name> --resource-group <resource-group-name> --location <cluster-location> --subscription <subscription-id> --tenant <tenant-id> --id <client-id> (--identity-name <managed-identity> or --client-secret <client-secret>) [--helm-release-name <release-name>] [--namespace <namespace>] [--additional-helm-values "<additional-helm-values>"]
Note: This command generates a aks-cluster-name_values.yaml file and (in post 0.6.0 Luna) a aks-cluster-name_helm-release-name_values_full.yaml file; please retain these files for use in future upgrades.
Also Note: Azure kube-system metrics-server pods can block a node being scaled down because the pods mount local storage (/tmp mounted to tmp-dir of type EmptyDir) for scratch and the Luna scaleDown option skipNodesWithLocalStorage is true by default. Include "--set scaleDown.skipNodesWithLocalStorage=false" in the set of <additional-helm-values>
to avoid this blocker to Luna scaleDown.
Step 3: Verify Luna
kubectl get all -n elotl
Sample Output
NAME READY STATUS RESTARTS AGE
pod/elotl-luna-manager-6bd7f4674d-cxwz6 1/1 Running 0 2m39s
pod/elotl-luna-webhook-7fcf5998b6-ltrd6 1/1 Running 0 2m39s
pod/elotl-luna-webhook-7fcf5998b6-svr6b 1/1 Running 0 2m39s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elotl-luna-manager ClusterIP x.x.x.x <none> 9090/TCP 2m39s
service/elotl-luna-webhook ClusterIP x.x.x.x <none> 8443/TCP 2m39s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/elotl-luna-manager 1/1 1 1 2m39s
deployment.apps/elotl-luna-webhook 2/2 2 2 2m39s
NAME DESIRED CURRENT READY AGE
replicaset.apps/elotl-luna-manager-6bd7f4674d 1 1 1 2m39s
replicaset.apps/elotl-luna-webhook-7fcf5998b6 2 2 2 2m39s
Step 4: Run some workloads!
Follow our tutorial to understand value provided by Luna.
Step 5: Verify test pod launch and dynamic worker node addition/removal (while testing)
kubectl get pods --selector=elotl-luna=true -o wide -w
kubectl get nodes -w
Upgrade
When running the appropriate upgrade command described below, set <retained-values-file>
to <retained-path>/<cluster-name>_values_full.yaml
, if your installation version was post 0.5.4 Luna, and to <retained-path>/<cluster-name>_values.yaml
otherwise.
To upgrade an existing luna deployment if using managed identity, run:
helm upgrade elotl-luna <chart-path> --wait --namespace=<cluster-namespace> --values=<retained-values-file> <additional-helm-values(optional)>
To upgrade an existing luna deployment if using client secret, run:
helm upgrade elotl-luna <chart-path> --wait --namespace=<cluster-namespace> --values=<retained-values-file> --set azure.clientSecret="<client-secret>" <additional-helm-values(optional)>
For example, to upgrade my-cluster with Luna using client-secret authentication from luna-v0.4.6 to luna-v0.5.0 and set an additional helm value binPackingNodeCpu=2, run:
helm upgrade elotl-luna ./elotl-luna-v0.5.0.tgz --wait --namespace=elotl --values=../../luna-v0.4.6/aks/my-cluster_values.yaml --set azure.clientSecret="<client-secret>" --set binPackingNodeCpu=2
And validate the upgrade as follows:
helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
elotl-luna elotl 4 2023-05-19 14:15:30.686251 -0700 PDT deployed elotl-luna-v0.5.0 v0.5.0
Cleanup
helm uninstall elotl-luna --namespace=elotl
kubectl delete namespace elotl