Skip to main content
Version: v0.5

AKS

Prerequisites

  1. azure bash cloud shell cli with the environment variable ENVSUBST pointing to an installation of envsubst (azure bash cloud shell does not allow root/sudo package installation).
  2. kubectl with correct context selected: pointing to the cluster you want to deploy Luna on.
  3. helm: the package manager for Kubernetes
  4. An existing AKS cluster without autoscaling enabled. Note that AKS has both free and standard tier clusters; please ensure your cluster tier can handle your expected load at scale.

Considerations

Spot

Luna running on AKS supports allocating Spot instances for bin selection.

If you would like Luna to consider a Spot instance for your workload, but use an on-demand instance if spot is not available, please include the following annotation in your configuration:

annotations:
node.elotl.co/instance-offerings: "spot, on-demand"

Luna will allocate a Spot instance if available for the lowest-priced right-sized instance type; otherwise, it will allocate an on-demand instance.

If you would like Luna to only consider a Spot instance for your workload, and leave the workload pending if no spot instance is available, please include the following annotation:

annotations:
node.elotl.co/instance-offerings: "spot"

If a Luna-allocated Spot instance node is terminated, the associated workload will become pending and Luna will again select a node for it.

Pod Subnet

Luna running on AKS supports specifying the pod subnet used by Dynamic Azure CNI networking for bin selection. By default, Luna will use the same pod subnet as your cluster's system node pool; you can override this choice for your pod.

If you would like Luna to use a particular subnet (e.g., podsubnet1) that you have set up for your workload, please include the following annotation in your configuration:

annotations:
node.elotl.co/aks-pod-subnet: "podsubnet1"

Managed Identity Authentication Setup

As outlined in Step 2 below, Luna supports three Azure authentication techniques to provide access to an account with the permissions Luna needs to perform its AKS cluster scaling operations.

If you want Luna to use managed identity authentication, you'll need to define a user-assigned managed identity and you'll need to give it the appropriate permissions. At Luna deployment time, you'll provide that managed identity's name in an environment variable and its client id as a parameter. You can create a user-assigned managed identity as shown below:

az identity create --name <user-assigned-identity-name> --resource-group <resource-group-name> --location <cluster-location> --subscription <subscription-id>

And you can assign its permissions to "Contributor" access for both of your cluster's resource groups via:

az role assignment create --assignee <user-assigned-identity-principalId> --role "Contributor" --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group-name>
az role assignment create --assignee <user-assigned-identity-principalId> --role "Contributor" --scope /subscriptions/<subscription-id>/resourceGroups/<node-resource-group-name>

To allow managed identity authentication to work in an AKS cluster, Luna uses Azure's workload identity service https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster. The AKS cluster must have the workload identity and OIDC issuer features enabled. You can enable these features at AKS cluster creation time or you can add them to an existing AKS cluster via:

az aks update -n <cluster-name> -g <resource-group-name> --enable-oidc-issuer --enable-workload-identity

Step 1(optional): Install Nvidia gpu driver for gpu workload

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

Step 2: Deploy Luna

Luna needs cert-manager running in the cluster. Deploy script tries to detect cert-manager in the cluster and installs cert-manager to cert-manager namespace otherwise.

To perform AKS cluster scaling, Luna needs create/read/update/delete access for node pools in the AKS cluster's resource group, read access on its VM SKUs, and read/update access on its VM Scale set. To provide Luna with access to an account with the appropriate permissons, you can choose from these three Azure authentication methods (https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication?tabs=bash):

  • client-secret. Set the environment variable AZURE_CLIENT_SECRET in the deployment environment.
  • username+password. Set the environment variables AZURE_USERNAME and AZURE_PASSWORD in the deployment environment.
  • managed identity. Set the environment variable AZURE_MANAGED_IDENTITY in the deployment environment. Please see "Managed Identity Authentication Setup" section above for setup details.

You can then run the following command to deploy Luna into your AKS cluster:

./deploy.sh <resource-group-name> <cluster-name> <cluster-location> <subscription-id> <tenant-id> <client-id> <additional-helm-values(optional)>

Note: This command generates an cluster-name_values.yaml file; please retain this file for use in future upgrades.

Also Note: Azure kube-system metrics-server pods can block a node being scaled down because the pods mount local storage (/tmp mounted to tmp-dir of type EmptyDir) for scratch and the Luna scaleDown option skipNodesWithLocalStorage is true by default. Include "--set scaleDown.skipNodesWithLocalStorage=false" in the set of <additional-helm-values> to avoid this blocker to Luna scaleDown.

Step 3: Verify Luna

kubectl get all -n elotl

Sample Output
NAME                                      READY   STATUS    RESTARTS   AGE
pod/elotl-luna-manager-6bd7f4674d-cxwz6 1/1 Running 0 2m39s
pod/elotl-luna-webhook-7fcf5998b6-ltrd6 1/1 Running 0 2m39s
pod/elotl-luna-webhook-7fcf5998b6-svr6b 1/1 Running 0 2m39s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elotl-luna-manager ClusterIP x.x.x.x <none> 9090/TCP 2m39s
service/elotl-luna-webhook ClusterIP x.x.x.x <none> 8443/TCP 2m39s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/elotl-luna-manager 1/1 1 1 2m39s
deployment.apps/elotl-luna-webhook 2/2 2 2 2m39s

NAME DESIRED CURRENT READY AGE
replicaset.apps/elotl-luna-manager-6bd7f4674d 1 1 1 2m39s
replicaset.apps/elotl-luna-webhook-7fcf5998b6 2 2 2 2m39s

Step 4: Run some workloads!

Follow our tutorial to understand value provided by Luna.

Step 5: Verify test pod launch and dynamic worker node addition/removal (while testing)

kubectl get pods --selector=elotl-luna=true -o wide -w
kubectl get nodes -w

Upgrade

To upgrade an existing luna deployment, set the env variables AZURE_USERNAME+AZURE_PASSWORD or the env variable AZURE_CLIENT_SECRET (as in your install), and run:

helm upgrade elotl-luna <chart-path> --wait --namespace=<cluster-namespace> --values=<retained-path>/<cluster-name>_values.yaml <credential-vals> <additional-helm-values(optional)>

where credential-vals is either

--set azure.username="${AZURE_USERNAME}",azure.password="${AZURE_PASSWORD}"

or

--set azure.clientSecret="${AZURE_CLIENT_SECRET}"

For example, to upgrade my-cluster with Luna using client-secret authentication from luna-v0.4.6 to luna-v0.5.0 and set an additional helm value binPackingNodeCpu=2, run:

helm upgrade elotl-luna ./elotl-luna-v0.5.0.tgz --wait --create-namespace --namespace=elotl --values=../../luna-v0.4.6/aks/my-cluster_values.yaml --set azure.clientSecret="${AZURE_CLIENT_SECRET}" --set binPackingNodeCpu=2

And validate the upgrade as follows:

helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
elotl-luna elotl 4 2023-05-19 14:15:30.686251 -0700 PDT deployed elotl-luna-v0.5.0 v0.5.0

Cleanup

helm uninstall elotl-luna --namespace=elotl
kubectl delete namespace elotl