Version: v0.5

Luna Configuration

Labels

In order for Luna Manager to manage a pod's scheduling, the following label must be applied:

metadata:
  labels:
    elotl-luna: "true"

Instance family inclusion and exclusion

To instruct Luna to avoid starting a given instance family node for a given workload. Luna will look for the following annotation in the pod:

metadata:
  annotations:
    node.elotl.co/instance-family-exclusions: "t3,t3a"

This will prevent Luna from starting any t3. or t3a. instance type for this pod.

To instruct Luna to only use specific nodes from a given instance family for a given workload. Luna will look for the following annotation in the pod:

metadata:
  annotations:
    node.elotl.co/instance-family-inclusions: "c6g,c6gd,c6gn,g5g"

This will restrict Luna to using choosing from the following c6g.,c6gd.,c6gn.,g5g. instance types for this pod.

note

Pods with either of these annotations will be bin-selected, regardless of the pod’s resource requirements.

Removal of Under-utilized nodes and pod eviction

Luna is designed to remove nodes deemed as under-utilized. A node falls under this category if the pods operating on it require less than 20% of the node's CPU or memory for bin packing, or if they require less than 75% of the node's CPU or memory for bin selection. If a node has consistently been under-utilized for a duration exceeding the one set in scaleDown.nodeUnneededDuration—which defaults to 5 minutes—, Luna will proceed to evict the pods operating on this node and subsequently remove it.

To avoid eviction of a pod running on an under-utilized node by Luna, the pod must be annotated with pod.elotl.co/do-not-evict: true as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    pod.elotl.co/do-not-evict: "true"
  spec:
...

The annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false is also supported.

GPU SKU annotation

To instruct Luna to start an instance with a specific graphic card:

metadata:
  annotations:
    node.elotl.co/instance-gpu-skus: “v100”

This will start a node with a V100 GPU card.

note

Each pod with this annotation will be bin-selected, regardless of the pod’s resource requirements.

Advanced configuration via Helm Values

This is a list of the configuration options for Luna. These values can be passed to Helm when deploying Luna.

The keys and values are passed to the deploy script as follows:

./deploy.sh <cluster-name> <cluster-region> \
    --set binSelectPodCpuThreshold=3.0 \
    --set binSelectPodMemoryThreshold=2G \
    --set binPackingNodeCpu=3250m \
    --set binPackingNodeMemory=7Gi \
    --set binPackingNodeMinPodCount=42 \
    --set binPackingNodeTypeRegexp='^t3a.*$' \
    --set binPackingNodePricing='spot,on-demand' \
    --set labels='key1=value1,key2=value2'

These configuration options can be modified in the configuration map elotl-luna located in the namespace where Luna manager runs. Once the configuration map has been modified Luna manager and its admission webhook must be restarted for the new configuration to be used.

$ kubectl -n elotl rollout restart deploy/elotl-luna-manager
...
$ kubectl -n elotl rollout restart deploy/elotl-luna-webhook
...

labels

Specify the labels that Luna will use to match the pods to consider.

labels is a list of comma separated key value pairs: key1=value1\,key2=value2; pods with any of the labels will be considered by Luna. The default value is elotl-luna=true.

--set key=value

loopPeriod

How often the Luna main loop runs, by default 10 seconds. Increasing this value will ease the load on the Kubernetes control plane, while lowering it will intensify the load on the Kubernetes control plane.

--set loopPeriod=20s

daemonSetSelector

Select the labels from daemon sets that will run on the Luna nodes. It is empty by default.

For example if you wish to run a node with a GPU attached you will have to select the GPU driver daemon set.

--set daemonSetSelector=name=nvidia-device-plugin-ds

daemonSetExclude

Comma-separated list of names of daemon sets to exclude from those Luna assumes may be active on newly added nodes. It is empty by default.

Use this option to avoid Luna's reserving resources for daemon sets that you do not expect to active on new nodes. For example, the following could be used for Luna on a GKE cluster for which you only plan to use --logging-variant=DEFAULT.

--set daemonSetExclude="fluentbit-gke-256pd\,fluentbit-gke-max\,gke-metrics-agent-scaling-500"

newPodScaleUpDelay

Age of the pod to be considered for scaling up nodes. It is set to 10 seconds by default.

Because pod creation may be scattered, it isn’t desirable for Luna to immediately react to pod creation. Lowering this delay may result in less efficient packing, while increasing it will delay the creation of the nodes and increase the mean time to placement of pods.

--set newPodScaleUpDelay=5s

scaleUpTimeout

Time to allow for the new node to be added and the pending pod to be scheduled before considering the scale up operation expired and subject to retry. It is set to 10 minutes by default. This value can be tuned for the target cloud.

includeArmInstance

Whether to consider Arm instance types. It is set to false by default.

If this option is enabled, all the images of the pods run by Luna must support both the AMD64 and ARM64 architecture. Otherwise pod creation may fail.

placeBoundPVC

Whether to consider pods with bound PVC. It is set to false by default.

reuseBinSelectNodes

Whether to reuse nodes for similar bin-select placed pods. It is set to true by default.

prometheusListenPort

The port number on which Luna manager and webhook will expose their prometheus metrics. It is 9090 by default.

clusterGPULimit: 10

The maximum number of GPUs to run in the cluster. It is set to 10 by default.

clusterGPULimit specifies the GPU limit of the cluster; if gpu count in the cluster reaches this number, luna will stop scaling up GPU nodes.

nodeTags

Tags to add to the cloud instances. It is a list of comma separated key value pairs: key1=value1,key2=value2. It is empty by default.

This is useful to clean-up stale nodes:

--set nodeTags=key1=value1,key2=value2

loggingVerbosity

How verbose Luna manager and webhook are. It is set to 2 by default.

0 critical, 1 important, 2 informational, 3 debug

Bin-selection

Bin-selection means running the pod on a dedicated node.

When a pod’s requirements are high enough Luna provisions a dedicated node to run it. Luna uses the pod’s requirements to determine the node’s optimal configuration, add a new node to the cluster, and run the pod on it. If the pod’s cpu requirement is above binSelectPodCpuThreshold and/or if the pod’s memory requirement is above binSelectPodMemoryThreshold, the pod will be bin-selected and run on a dedicated node.

Bin-packing

Bin-packing means running the pod with other pods on a shared node.

binPackingNodeCpu and binPackingNodeMemory let you configure the shared nodes’ requirement. If you have an instance type in mind, set these paramaters slightly below the node type you are targeting, to take into account the kubelet and kube-proxy overhead. For example if you would like to have nodes with 8 VCPU and 32 GB of memory, set binPackingNodeCpu to "7.5" and binPackingNodeMemory to "28G".

If a pod’s requirements are too much for bin-packing nodes, an over-sized node will be provisioned to handle this pod. For example if configured bin-packing typically have 1 VCPU, and a bin-packed pod needs 1.5 VCPU, a node with 1.5 VCPU will be provisioned by Luna to accommodate this pod. This will only happen when the bin selection thresholds are above the bin packing requirements.

Each node type can only run a limited number of pods. binPackingNodeMinPodCount lets you request a node that can support a minimum number of pods.

binPackingNodeTypeRegexp allows you to limit the instances that will be considered. For example if you would only like to run instances from "t3a" family in AWS you would do: binPackingNodeTypeRegexp='^t3a\..*$'

binPackingNodePricing allows you to indicate the price offerings category for the instances that will be considered. For example if you would only like to run instances from the "spot" category you would do: binPackingNodePricing='spot'

binPackingMinimumNodeCount allows you to specify the minimum number of bin packed nodes. The nodes will be started immediately and will stay online even if no pods are running on them.

AWS

This section details AWS specific configuration options

Custom AMIs

NOTE: All custom AMIs must include the script EKS nodes’ bootstrap script at /etc/eks/bootstrap.sh. Otherwise nodes will not join the cluster.

You can tell Luna to use a specific AMI via the Helm values:

aws.amiIdGeneric for x86-64 nodes
aws.amiIdGenericArm for Arm64 nodes
aws.amiIdGpu for x86-64 nodes with GPU

Each of these configuration options accept an AMI ID. If the AMI doesn’t exist or is not accessible Luna will log an error and fall back to the latest generic EKS images.

Set these custom AMI IDs via helm values like this:

--set aws.amiIdGeneric=ami-1234567890
--set aws.amiIdGenericArm=ami-1234567890
--set aws.amiIdGpu=ami-1234567890

Azure

This section details Azure specific configuration options

Pod Subnet for Dynamic Azure CNI Networking

You can indicate the pod subnet to be used by Dynamic Azure CNI networking for the bin packing node via azure.binPackingNodePodSubnet.

For example, if you would like your bin packing instances to use podsubnet1, do this:

--set azure.binPackingAKSPodSubnet=podsubnet1

Luna Configuration

Labels

Instance family inclusion and exclusion

Removal of Under-utilized nodes and pod eviction

GPU SKU annotation

Advanced configuration via Helm Values

labels​

loopPeriod​

daemonSetSelector​

daemonSetExclude​

newPodScaleUpDelay​

scaleUpTimeout​

includeArmInstance​

placeBoundPVC​

reuseBinSelectNodes​

prometheusListenPort​

clusterGPULimit: 10​

nodeTags​

loggingVerbosity​

Bin-selection​

Bin-packing​

AWS

Custom AMIs​