Version: v1.2

Luna Configuration

Pod configuration

In order for Luna Manager to manage a pod's scheduling, the pod configuration must include a label or annotation that matches Luna's configured pod designation setting. By default, Luna's setting specifies that the following label is applied:

metadata:
  labels:
    elotl-luna: "true"

You can change the list of labels Luna will consider with the labels Helm value:

--set labels='key1=value1,key2=value2'

You can change the list of annotations Luna will consider with the podAnnotations Helm value:

--set podAnnotations='key1=value1,key2=value2'

To prevent Luna from matching a given pod annotate it with pod.elotl.co/ignore: true.

Instance family configuration

Bin selection

To avoid a given instance family, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-family-exclusions: "t3,t3a"

In the example above Luna won’t start any t3 or t3a instance type for the pod.

To use a given instance family, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-family-inclusions: "c6g,c6gd,c6gn,g5g"

In the example above Luna will choose an instance type from the c6g, c6gd, c6gn, or g5g instance families for the pod.

To specify the instance type, you can utilize a regular expression. For intance, if you'd like to specify the instance type to be r6a.xlarge, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-type-regexp: "^r6a.xlarge$"

In the example above, Luna will only consider the r6a.xlarge instance type.

You can combine the instance-type and instance-family annotations like this:

metadata:
  annotations:
    "node.elotl.co/instance-type-regexp": "^*.xlarge$",
    "node.elotl.co/instance-family-exclusions": "r6a",

In the example above, Luna will exclusively consider instance types ending with ".xlarge" and exclude types from the r6a family.

If any of these annotations are present, Luna will schedule the pods on nodes that fulfill all these constraints as well as the resource requirements of the pods. However, if the instance type constraints and the pod's resource requirements are incompatible, no node will be added and the pod will be stuck in the pending state.

Bin packing

Bin packing instance family and type can be configured via the global option binPackingNodeTypeRegexp. Only the instances matching the regular expression will be considered.

For example if you would like to use t3a nodes in AWS, you would set: binPackingNodeTypeRegexp='^t3a\..*$'.

Removal of Under-utilized nodes and possible pod eviction

Luna is designed to remove under-utilized nodes. A node that is running no Luna-managed pods is under-utilized. Additionally, in the case of bin-packing, a node is considered under-utilized if its Luna-managed pods' total resource requests are below scaleDown.binPackNodeUtilizationThreshold, set to 10% by default. If a node has been under-utilized for longer than scaleDown.nodeUnneededDuration, set to 5 minutes by default, and if all Luna-managed pods running on it can be placed on another node, Luna will evict the pods running on the node and remove the node.

To avoid Luna evicting a pod running on an under-utilized node, the pod must be annotated with pod.elotl.co/do-not-evict: true as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    pod.elotl.co/do-not-evict: "true"
  spec:
...

The annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false is also supported.

Note that if Luna-managed bin-packing pods have no resource settings or if their resource settings are inaccurately very low, Luna's detection of under-utilized bin-packing nodes will be wrong. In this case, scaleDown.binPackNodeUtilizationThreshold should be set to 0.0 to avoid Luna evicting pods from bin-packing nodes incorrectly categorized as under-utilized. Please see the next section for more information relevant to such pods.

Management of Over-utilized nodes and possible pod eviction

Luna allocates node resources for pods based on the pods' resource settings. If Luna-managed pods have no resource settings or if their settings are inaccurately too low, Luna-allocated nodes may become over-utilized, causing performance problems.

Luna can be configured to use Kubernetes metrics server data to monitor the CPU and memory utilization of Luna-allocated nodes, and to take action to avoid or reduce high CPU or memory utilization. If the Luna option manageHighUtilization.enabled (default false) is set true, Luna uses metrics server node and pod CPU and memory utilization data as described below.

When a node's CPU utilization >= manageHighUtilization.yellowCPU (default 60) or its memory utilization >= manageHighUtilization.yellowMemory (default 65), Luna adds a taint to the node to prevent the kube scheduler from scheduling more pods on the node. This avoids CPU or memory over-utilization.

When a node's CPU utilization >= manageHighUtilization.redCPU (default 80) or its memory utilization >= manageHighUtilization.redMemory (default 85), Luna performs an eviction of the highest CPU- or memory-demand Luna-scheduled pod that meets the same pod eviction restrictions applied for scale-down.

When a node's CPU utilization < manageHighUtilization.greenCPU (default 10) and its memory utilization is < manageHighUtilization.greenMemory (default 15) and the node has a high utilization taint, that taint is removed from the node. This allows nodes that no longer have high CPU or memory utilization to again host additional pods.

GPU SKU annotation

To instruct Luna to start an instance with a specific graphic card:

metadata:
  annotations:
    node.elotl.co/instance-gpu-skus: “v100”

This will start a node with a V100 GPU card.

note

Each pod with this annotation will be bin-selected, regardless of the pod’s resource requirements.

Advanced configuration via Helm Values

This is a list of the configuration options for Luna. These values can be passed to Helm when deploying Luna.

The keys and values are passed to the deploy script as follows:

./deploy.sh <cluster-name> <cluster-region> \
    --set binSelectPodCpuThreshold=3.0 \
    --set binSelectPodMemoryThreshold=2Gi \
    --set binSelectPodGPUThreshold=1 \
    --set binPackingNodeCpu=3250m \
    --set binPackingNodeMemory=7Gi \
    --set binPackingNodeMinPodCount=42 \
    --set binPackingNodeTypeRegexp='^t3a.*$' \
    --set binPackingNodePricing='spot,on-demand' \
    --set labels='key1=value1,key2=value2'

These configuration options can be modified in the configuration map elotl-luna located in the namespace where Luna manager runs. Once the configuration map has been modified Luna manager and its admission webhook must be restarted for the new configuration to be used.

$ kubectl -n elotl rollout restart deploy/elotl-luna-manager
...
$ kubectl -n elotl rollout restart deploy/elotl-luna-webhook
...

labels

Specify the labels that Luna will use to match the pods to consider.

labels is a list of comma separated key value pairs: key1=value1\,key2=value2; pods with any of the labels will be considered by Luna. The default value is elotl-luna=true.

--set labels='key1=value1\,key2=value2'

podAnnotations

Specify the annotations that Luna will use to match the pods to consider.

Similar to labels, podAnnotations is a list of comma separated key value pairs: key1=value1\,key2=value2; pods with any of the annotations will be considered by Luna. podAnnotations is empty by default.

--set podAnnotations='key1=value1\,key2=value2'

pod.elotl.co/ignore: true

This annotation instructs Luna to ignore a given pod even if it matches labels or podAnnotations.

It's important to note that ignored pods may still be scheduled on Luna-managed nodes, unless these nodes have a specific taint configured. Ignored pods don't have a node selector, so the Kubernetes scheduler will assign them to any available node. If Luna nodes don't have a taint set up, pods that aren't handled by Luna might be scheduled there.

To prevent pods that aren’t managed by Luna from running on Luna-managed nodes, you can utilize node and pod affinity configuration. Node affinity allows you to specify rules that restrict which nodes a pod can be scheduled on, while pod affinity enables you to define rules for co-locating or spreading pods across nodes based on labels.

By combining taints, tolerations, and affinity rules, you can have finer control over pod scheduling and ensure that ignored pods are not inadvertently scheduled on Luna-managed nodes.

loopPeriod

How often the Luna main loop runs, by default 10 seconds. Increasing this value will ease the load on the Kubernetes control plane, while lowering it will intensify the load on the Kubernetes control plane.

--set loopPeriod=20s

daemonSetSelector

daemonSetSelector is a label selector for the daemon sets that will run on the Luna nodes.

Luna cannot predict in advance which daemon sets will run on a given node. Since the conditions for daemon sets are dynamic, Luna must estimate which ones will end up on the node, potentially impacting cost optimization.

The daemonSetSelector configuration option allows you to specify the daemon sets Luna should consider in its capacity calculations.

By default, this option is empty, meaning all daemon sets are selected.

For example, to have Luna only consider the impact of the GPU driver daemon set, you can specify:

--set daemonSetSelector=name=nvidia-device-plugin-ds

daemonSetExclude

daemonSetExclude is a comma-separated list of daemon set names that you want to exclude from Luna's list of active daemon sets for newly added nodes.

It is empty by default.

After selecting daemon sets using daemonSetSelector, the sets are further filtered based on the daemonSetExclude list.

Use this option to prevent Luna from reserving resources for daemon sets you do not expect to be active on new nodes. For example, if you are running Luna on a GKE cluster and only plan to use the --logging-variant=DEFAULT, you might exclude the unused daemon sets as follows:

--set daemonSetExclude="fluentbit-gke-256pd\,fluentbit-gke-max\,gke-metrics-agent-scaling-500"

This option may be used along with daemonSetExcludeDesired0.

daemonSetExcludeDesired0

daemonSetExcludeDesired0 is a boolean that you set true if you want to exclude daemonsets that currently have a Desired count of 0 from Luna's list of active daemon sets for newly added nodes.

It is false by default.

After selecting daemon sets using daemonSetSelector, if daemonSetExcludeDesired0 is true, the sets are further filtered by those that have a Desired count of 0.

Use this option to prevent Luna from reserving resources for daemon sets that are not active on current nodes and that you do not expect to be active on new nodes.

--set daemonSetExcludeDesired0=true

This option may be used along with daemonSetExclude.

newPodScaleUpDelay

Age of the pod to be considered for scaling up nodes. It is set to 10 seconds by default.

Because pod creation may be scattered, it isn’t desirable for Luna to immediately react to pod creation. Lowering this delay may result in less efficient packing, while increasing it will delay the creation of the nodes and increase the mean time to placement of pods.

--set newPodScaleUpDelay=5s

scaleUpTimeout

Time to allow for the new node to be added and the pending pod to be scheduled before considering the scale up operation expired and subject to retry. It is set to 10 minutes by default. This value can be tuned for the target cloud.

includeArmInstance

Whether to consider Arm instance types. It is set to false by default.

If this option is enabled, all the images of the pods run by Luna must support both the AMD64 and ARM64 architecture. Otherwise pod creation may fail.

placeBoundPVC

Whether to consider pods with bound PVC. It is set to false by default.

placeNodeSelector

Whether to consider pods with existing node selector(s). It is set to false by default. When set to true, a pod's existing node selector(s) must be satisfiable by the Luna and pod settings; otherwise, Luna may allocate a node that cannot be used by the pod.

namespacesExclude

List of comma-separated names of namespaces whose pods should be excluded from Luna management. It is set to kube-system only by default. For example, to run with no namespace restrictions on Luna management, use:

--set namespacesExclude={}

To add the namespace test to the exclusion list specify:

--set namespacesExclude='{kube-system,test}'

Note that if the kube-system namespace is not part of the namespacesExclude list, Luna can spin up additional nodes for kube-system pods marked for luna placement that are in the Pending state for too long.

reuseBinSelectNodes

Whether to reuse nodes for similar bin-select placed pods. It is set to true by default.

skipIgnoredPods

Whether to add a node selector to pods not labeled for placement by Luna or to skip adding a node selector to such pods. It is set to false by default.

By default, the Luna webhook sets a node selector for each non-daemonset pod placement request it examines. If a pod is labeled for placement by Luna, its node selector is set to point to a Luna-created node. If a pod is not labeled for placement by Luna, its node selector is set to exclude any Luna-created node; the latter setting is skipped if skipIgnoredPods is set true.

prometheusListenPort

The port number on which Luna manager and webhook will expose their prometheus metrics. It is 9090 by default.

clusterGPULimit: 10

The maximum number of GPUs to run in the cluster. It is set to 10 by default.

clusterGPULimit specifies the GPU limit of the cluster; if gpu count in the cluster reaches this number, luna will stop scaling up GPU nodes.

nvidiaGPUTimeSlices

The number of GPU time-slices for NVIDIA GPUs in cluster. It is set to 1 by default. When its value is greater than 1, Luna treats GPUs in cloud instances as N copies of themselves with respect to scheduling GPU resource requests. This value must match the NVIDIA GPU time slices setting for GPU nodes in the cluster for Luna GPU allocation to operate consistently with that setting.

On AKS, EKS, and OKE clusters, the NVIDIA time-slices setting is transparent to the cluster control plane and GPU workloads running in the cluster. The number of NVIDIA GPU time-slices can be set when installing the nvidia-device-plugin helm chart. The time-slices setting will automatically be configured for all NVIDIA GPUs in the cluster, and cluster nodes will use that value when they report their GPU capacity. GPU workloads transparently get a slice for each GPU resource they request.

On GKE clusters, the NVIDIA time-slices setting is visible to the cluster control plane and to GPU workloads running in the cluster. Luna configures the GPU slice count in the GKE node pool used for GPU node allocation. Note that GPU pods running on GKE clusters with time-sliced GPUs must include nodeSelectors indicating the workload can use time-shared GPUs and specifying the max clients-per-gpu value allowed. And the GPU pods running on time-sliced GPUs cannot specify a nvidia.com/gpu resource limit value greater than 1. Please see the associated GCP documentation for more details.

binSelectMaxPodsPerNode & binPackingMaxPodsPerNode

These configuration options control the maximum number of pods that can run on each node. Setting lower values can minimize the use of network resources like interfaces or IP addresses on the nodes.

AWS

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna uses the AWS defined ENI limit as maximum pods per node value. By default, Luna does not explicitly set the maximum number of pods on the nodes. If you set a value greater than 0, Luna will set the specified maximum number of pods on the nodes.

For nodes with up to 30 VCPUs, the maximum number of pods per node is capped at 110. For nodes with more than 30 VCPUs, the maximum increases to 250.

GCP

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna defaults to a limit of 110 pods per node. These values must be between 8 and 256; otherwise, the API will produce an error, and nodes will not be created.

Azure

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna uses the node’s default max pods per node. For clusters using Kubenet networking, this default is 110 pods per node; for clusters using CNI networking, it is 250 pods per node. The maximum value either can be set to is 250.

OCI

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna defaults to a limit of 110 pods per node.

This option only applies to OKE clusters that use OCI_VCN_IP_NATIVE networking, and indicates how Luna should set max pods per node on nodes it allocates. If MaxPodsPerNode is 0 (default), Luna sets max pods per node to the maximum supported by compute shape vNICs. If MaxPodsPerNode is greater than 0, Luna sets max pods per node to min(MaxPodsPerNode, maximum supported by compute shape vNICs).

nodeLabels

Labels to add to the nodes. It is a mapping of key-value. It is empty by default.

For example, to add a label foo=bar to your nodes, use the following flag:

--set nodeLabels.foo=bar

Note that if you need to include dots in the label’s key you will have to escape them with \:

--set nodeLabels.my\.label\.example=value

nodeTags

Tags to add to the cloud instances. It is a mapping of key-value. It is empty by default.

This can be useful to track and clean-up stale cloud instances. For instance, to add tags key1=value1 and key2=value2, use:

--set nodeTags.key1=value1
--set nodeTags.key2=value2

Note that the nodeTags option is not supported on GKE.

nodeTaints

To add taints to the nodes created by Luna, use the taints configuration option:

--set nodeTaints='{key1=value1:NoSchedule,key2=value2:NoExecute}'

Note that the nodeTaints option is not supported under Oracle Container Engine for Kubernetes (OKE).

loggingVerbosity

How verbose Luna manager and webhook are. It is set to 2 by default.

0 critical, 1 important, 2 informational, 3 debug

scaleDown.nodeUnneededDuration

If a node remains idle for longer than nodeUnneededDuration, Luna manager will scale it down. Default: 5m.

--set scaleDown.nodeUnneededDuration=1m

scaleDown.skipNodeWithSystemPods

Determines whether to skip nodes running pods from the kube-system namespace. Daemonset pods are never considered by Luna; this only applies to deployment pods. Default: false.

scaleDown.skipNodesWithLocalStorage

When true, Luna manager will never scale down nodes with local storage attached to a pod. Default: true.

scaleDown.skipEvictDaemonSetPods

When true, Luna manager will skip evicting daemonset pods from nodes removed for scale down. Default: false.

scaleDown.minReplicaCount

The minimum replica count ensures that the specified number of replicas are always available during node scale-down. Default: 0.

scaleDown.binPackNodeUtilizationThreshold

Defines the utilization threshold to scale down bin-packed nodes, ranging from 0.0 (0% utilization) to 1.0 (100% utilization). Default: 0.1 (10%).

Note that the Helm option --set cannot parse floating point numbers. Use --set-json to define scaleDown.binPackNodeUtilizationThreshold.

scaleDown.minNodeCountPerZone

For clusters supporting zone spread (currently only EKS clusters and GKE regional clusters), indicates the minimum number of nodes (0 or 1) that Luna should keep running per zone in target pools into which zone spread pods may be placed. This minimum is maintained even when no normal (not daemonset or mirror) Luna pods are currently running in the pool. Default: 0. Note that EKS does not support setting this value to 1.

In general, Luna keeps a minimum of 1 node per zone in node pools that may be used for zone spread, to ensure kube-scheduler can see all the zones in its target node set and hence can make the desired zone spread choices. Setting scaleDown.minNodeCountPerZone to 1 to maintain a min of 1 node per zone even when the associated count of normal (not daemonset or mirror) Luna pods is 0 avoids a possible race where kube-scheduler sees zone-spread pods arrive for scheduling when some but not all of a node pool's per-zone nodes have scaled down.

scaleDown.nodeTTL

When > 0, enables Luna support for node time-to-live. When scaleDown.nodeTTL is set to a non-zero value, it must be set to a value greater than or equal to scaleUpTimeout. If scaleDown.nodeTTL is less than scaleUpTimeout, Luna will set it to scaleUpTimeout internally and will emit a warning in the logs. Default: 0m (time-to-live unlimited).

When scaleDown.nodeTTL is set to a non-zero value, Luna uses the value as a time-to-live for its allocated nodes; Luna cordons, drains, and terminates its allocated nodes once they have been running longer than the specified scaleDown.nodeTTL time.

If a nodeTTL-expired node contains any pods with do-not-evict annotatations (i.e., pod.elotl.co/do-not-evict:true or cluster-autoscaler.kubernetes.io/safe-to-evict:false), Luna supports the node's graceful termination by cordoning it, draining its non-kube-system non-daemonset pods except the do-not-evict pods, and then adding the configurable annotation scaleDown.drainedAnnotation to it. An external controller monitoring nodes for that annotation can perform eviction-related operations with respect to the do-not-evict pods and then remove the their do-not-evict annotation. Once a nodeTTL-expired node contains no do-not-evict pods, Luna terminates the node.

scaleDown.managedNodeDelete

Set true to enable Luna support for graceful termination of nodes that are externally-deleted (e.g., "kubectl delete node/node-name"). Default: true.

When scaleDown.managedNodeDelete is set true, Luna adds a finalizer to its allocated nodes, allowing Luna to detect external deletion operations on those nodes. When Luna detects external deletion of an allocated node, if that node contains any do-not-evict pods, Luna performs the graceful termination steps outlined in scaleDown.nodeTTL. Once an externally-deleted Luna-allocated node contains no do-not-evict pods, Luna removes its finalizer from blocking the K8s node deletion and deletes the node from the cloud.

Note that if scaleDown.managedNodeDelete is set, the deletion of Luna-allocated nodes requires the removal of the Luna finalizer; hence, if Luna is disabled with some of its allocated nodes remaining and you later want to remove those nodes, you will need to manually remove the finalizer.

scaleDown.drainedAnnotation

Annotation used during graceful node termination; see scaleDown.nodeTTL or scaleDown.managedNodeDelete. Default: key: node.elotl.co/drained; value: true.

Pod retry

Luna cannot guarantee that a pod will run on one of its node, the node and pod have to be properly configured. If a pod is still in the pending state once the requested node is online, Luna will retry after configurable delay, up to a configurable number of times.

How pod retry works:

A new pod is created, the Luna webhook matches it, and a new node is provisioned by Luna manager.
Luna manager waits for the node to come online or wait until scaleUpTimeout has passed, whichever happens first.
Once the node is online or the request has timed out, Luna checks the pod’s status after podRetryPeriod elapsed.
If the pod is still in the pending state we have two cases:
1. The pod has been retried less than maxPodRetries times, the annotation pod.elotl.co/retry-count is added or incremented to the pod, and the pod will be retries after podRetryPeriod.
2. The pod has been retried maxPodRetries times, the annotation pod.elotl.co/ignore: true is added to the pod. The pod will now be ignored by Luna until the annotation is removed.

maxPodRetries

Sets the maximum retry attempts for a pod. Each retry increments the annotation pod.elotl.co/retry-count on the pod. Once this limit is exceeded, the pod is annotated with pod.elotl.co/ignore: true, indicating Luna should ignore the pod until the annotation is removed.

Default: 3

podRetryPeriod

Determines the delay before Luna retries deploying a pod that remains in the pending state, even after its node is available. This period must allow adequate time for Kubernetes to schedule the pod, otherwise Luna may create unnecessary node(s) temporarily.

Default: 5 minutes

Bin-selection

Bin-selection means running the pod on a dedicated node.

When a pod’s requirements are high enough Luna provisions a dedicated node to run it. Luna uses the pod’s requirements to determine the node’s optimal configuration, add a new node to the cluster, and run the pod on it. If the pod’s cpu requirement is at or above binSelectPodCpuThreshold and/or if the pod’s memory requirement is at or above binSelectPodMemoryThreshold and/or if the pod's gpu requirement is at or above binSelectPodGPUThreshold, the pod will be bin-selected and run on a dedicated node.

Bin-packing

Bin-packing means running the pod with other pods on a shared node.

binPackingNodeCpu, binPackingNodeMemory, and binPackingNodeGPU let you configure the shared nodes’ requirement. If you have an instance type in mind, set these paramaters slightly below the node type you are targeting, to take into account the kubelet and kube-proxy overhead. For example if you would like to have non-GPU nodes with 8 VCPU and 32 GB of memory, set binPackingNodeCpu to "7.5" and binPackingNodeMemory to "28G".

If a pod’s requirements are too much for bin-packing nodes, an over-sized node will be provisioned to handle this pod. For example if configured bin-packing typically have 1 VCPU, and a bin-packed pod needs 1.5 VCPU, a node with 1.5 VCPU will be provisioned by Luna to accommodate this pod. This will only happen when the bin selection thresholds are above the bin packing requirements.

Each node type can only run a limited number of pods. binPackingNodeMinPodCount lets you request a node that can support a minimum number of pods.

binPackingNodeTypeRegexp allows you to limit the instances that will be considered. For example if you would only like to run instances from "t3a" family in AWS you would do: binPackingNodeTypeRegexp='^t3a\..*$'

binPackingNodePricing allows you to indicate the price offerings category for the instances that will be considered. For example if you would only like to run instances from the "spot" category you would do: binPackingNodePricing='spot'

binPackingMinimumNodeCount allows you to specify the minimum number of bin packed nodes. The nodes will be started immediately and will stay online even if no pods are running on them.

Luna’s own deployment and pod configuration

Annotations, tolerations, and affinity

Use the Helm value annotations to add custom annotations to Luna manager and webhook deployments:

$ helm install ... --set annotations.foo=bar --set annotations.hello=word ...

To add custom tolerations to Luna’s own pods use the configuration option tolerations.

The tolerations specification is rather complex, therefore we recommend you define it in a Helm values file and pass its filename with the -f or --values options:

$ cat tolerations.yaml
tolerations:
  - key: "foo"
    value: "bar"
    operator: "Equal"
    effect: "NoSchedule"
$ helm install ... --values tolerations.yaml ...

To add custom affinity to Luna’s own pods use the configuration option affinity.

The affinity specification is rather complex, therefore we recommend you define it in a Helm values file and pass its filename with the -f or --values options:

$ cat affinity.yaml
# Helm values
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-central1-f
$ helm install ... --values affinity.yaml ...

Note that setting the affinity parameter will override the default affinity which prevent Luna pods from running on Luna managed nodes:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.elotl.co/managed-by
          operator: DoesNotExist

Add this snippet to your own affinity definition to prevent Luna pods from running on Luna managed nodes.

Webhook port

You can change the port of the mutation webhook with webhookPort configuration option:

$ helm install ... --set webhookPort=8999 ...

AWS

This section details AWS specific configuration options.

Custom AMIs

NOTE: All custom AMIs must include the script EKS nodes’ bootstrap script at /etc/eks/bootstrap.sh. Otherwise nodes will not join the cluster.

You can tell Luna to use a specific AMI via the Helm values:

aws.amiIdGeneric for x86-64 nodes
aws.amiIdGenericArm for Arm64 nodes
aws.amiIdGpu for x86-64 nodes with GPU

Each of these configuration options accept an AMI ID. If the AMI doesn’t exist or is not accessible Luna will log an error and fall back to the latest generic EKS images.

Set these custom AMI IDs via helm values like this:

--set aws.amiIdGeneric=ami-1234567890
--set aws.amiIdGenericArm=ami-1234567890
--set aws.amiIdGpu=ami-1234567890

Custom AMIs with SSM

Amazon offers various EKS image families like Amazon Linux, Ubuntu, and BottleRocket. Luna can use AWS SSM to fetch the most up to date image from its store.

For Amazon Linux, you can get the latest EKS image for Kubernetes 1.27 on arm64 nodes at /aws/service/eks/optimized-ami/1.27/amazon-linux-2-arm64/recommended/image_id.

To configure a SSM query for each image types use imageSsmQueryGeneric, imageSsmQueryGenericArm, and imageSsmQueryGpu. All these parameters may include exactly one "%s" marker to replace with the Kubernetes version.

For example here’s how to use BottleRocket images:

--set aws.imageSsmQueryGeneric="/aws/service/bottlerocket/aws-k8s-%s/x86_64/latest/image_id"
--set aws.imageSsmQueryGenericArm="/aws/service/bottlerocket/aws-k8s-%s/arm64/latest/image_id"
--set aws.imageSsmQueryGpu="/aws/service/bottlerocket/aws-k8s-%s-nvidia/x86_64/latest/image_id"

To use Ubuntu:

--set aws.imageSsmQueryGeneric="/aws/service/canonical/ubuntu/eks/20.04/%s/stable/current/amd64/hvm/ebs-gp2/ami-id"
--set aws.imageSsmQueryGenericArm="/aws/service/canonical/ubuntu/eks/20.04/%s/stable/current/arm64/hvm/ebs-gp2/ami-id"

Block device mappings

To customize disk settings for your EKS nodes, use the aws.blockDeviceMappings option. Configure it with JSON with a format like this:

[
  {
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "DeleteOnTermination": true,
      "VolumeSize": 42,
      "VolumeType": "gp2",
      "Encrypted": false
    }
  }
]

Use Helm’s --set-string, --set-json or --set-file options to set aws.blockDeviceMappings and avoid --set since it mangles its input.

For example:

$ cat block_device_mapping.json
[
  {
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "DeleteOnTermination": true,
      "VolumeSize": 42,
      "VolumeType": "gp2",
      "Encrypted": false
    }
  }
]
$ helm ... --set-file aws.blockDeviceMappings=block_device_mapping.json

Bin Packing Zone Spread

When aws.binPackingZoneSpread is true (default false), Luna supports placement of bin packing pods that specify zone spread. To support bin packing zone spread, Luna keeps at least one bin packing node running in each zone associated with the EKS cluster as long as there are any Luna bin packing pods running.

User data

userData allows you to define a script to be executed after nodes have been bootstraped.

For example specifying --set-string aws.userData="echo hello > /tmp/hello" will create a file named /tmp/hello with hello in it on the node once the EKS bootstrap script has completed.

If you have a large script we recommend you use the --set-file Helm option to load it:

$ cat myscript.sh
apt-get install my-package
$ ./deploy.sh ... --additional-helm-values "--set-file aws.userData=myscript.sh"

It is empty by default.

IMDS Metadata

metaData defines the instance metadata for EKS nodes, it’s a JSON document conforming to this specification.

Example:

{
    "HttpEndpoint": "enabled",
    "HttpProtocolIpv6": "disabled",
    "HttpPutResponseHopLimit": 42,
    "HttpTokens": "required",
    "InstanceMetadataTagsState": "enabled"
}

Default: Empty.

Use --set-string or --set-file with Helm to set the instance metadata, --set will mangle in the input.

GCP

This section details GCP specific configuration options.

Image Type

By default, Luna allows GCP to select the image type for nodes Luna adds to the cluster. The option gcp.imageType can be used to instead have Luna specify the image type for its added nodes. GCP's default image type and its valid image type values are available via the following command:

gcloud container get-server-config

For example, if you would like to set the image type for Luna nodes to UBUNTU_CONTAINERD, do this:

--set gcp.imageType=UBUNTU_CONTAINERD

Disk Type and size

The gcp.diskType parameter specifies the type of disk to use on the nodes. The available options are pd-standard, pd-ssd, and pd-balanced. By default pd-balanced is used.

To set a specific disk type, use the following Helm parameter:

--set gcp.diskType=pd-ssd

It’s important to note that not all instance types are compatible with the pd-standard disk type. If Luna selects C3 or G2 machine series and gcp.diskType is set to pd-standard, the node creation process will fail.

The gcp.diskSizeGb parameter defines the disk size in gigabytes. If the disk size is too small to accommodate the system, the node may fail to start or may not be functional after starting. The minimum disk size is 10 GB, and the default size is 100 GB.

To set a specific disk size, use the following Helm parameter:

--set gcp.diskSizeGb=200

Node Service Account

By default, Node VMs access the google cloud platform using the default service account. The option gcp.nodeServiceAccount can be set to the email address of an alternative service account to be used by the Luna-allocated Node VMs.

For example, if you would like to set an alternative google cloud platform service account to be used by the Luna-allocated Node VMs, do this:

--set gcp.nodeServiceAccount=myemail@myproject.iam.gserviceaccount.com

Network Tags

gcp.networkTags specifies the Network tags to add to the nodes. gcp.networkTags is a list of strings.

--set gcp.networkTags[0]=tag-value
--set gcp.networkTags[1]=other-tag-value

Empty by default.

GCE Instance Metadata

gcp.gceInstanceMetadata specifies the metadata to add to the GCE instance backing the Kubernetes node. gcp.gceInstanceMetadata is a dictionary.

--set gcp.gceInstanceMetadata.key1=value1
--set gcp.gceInstanceMetadata.key2=value2

Empty by default.

Bin Packing Zone Spread

When gcp.binPackingZoneSpread is true (default is false) on a regional GKE cluster, Luna supports placement of bin packing pods that specify zone spread. When this feature is enabled, Luna ensures there is a minimum of one bin packing node in each zone as long as there are bin packing pods running, giving kube-scheduler visibility into all zones.

Node Management: auto-upgrade and auto-repair

gcp.autoUpgrade and gcp.autoRepair define the node management services for the node pools. Both are true by default.

See the GKE documentation for NodeManagement for more information.

To disable auto-upgrade and auto-repair pass the following Helm values:

--set gcp.autoUpgrade=false
--set gcp.autoRepair=false

Note that to disable node auto-upgrade on node pools, the cluster must be configured to use static version instead of release channels. Otherwise node creation will fail.

Shielded Instance Configuration: secure boot and integrity monitoring

gcp.enableSecureBoot and gcp.enableIntegrityMonitoring configure the options controlling secure boot and integrity monitoring on the node pools. Both are false by default.

See the GKE documentation for ShieldedInstanceConfig for more information.

To enable secure boot and integrity monitoring pass the following Helm values:

--set gcp.enableSecureBoot=true
--set gcp.enableIntegrityMonitoring=true

Node Version

gcp.version specifies the version of Kubernetes to run on the Luna managed nodes. It is empty by default. When the version is not specified each node will be started with the same version of Kubernetes running on the control plane.

To get the list of available versions you can run the following command:

gcloud container get-server-config --format="yaml(validNodeVersions)"

When the gcp.version is not specified, the node will default to using the same version as the control plane. Consequently, if the control plane is updated, any existing node pools running older versions will no longer scale up. Instead, new node pools with the updated version will be created.

Ensure that the gcp.version you select is compatible with your cluster. Incompatibility will prevent Luna from successfully provisioning the nodes.

OAuth Scopes

gcp.oauthScopes specifies the set of Google API scopes to be made available on all of the node VMs under the "default" service account. It’s an array of strings. It is empty by default.

The specified scopes will be added to the built-in scopes. Built-in scopes are cluster type dependent. See the Google Kubernetes Engine documentation about OAuth Scopes to learn more.

For example to allow nodes to mount persistent storage and communicate with gcr.io add the following Helm values:

--set gcp.oauthScopes[0]=https://www.googleapis.com/auth/compute
--set gcp.oauthScopes[1]=https://www.googleapis.com/auth/devstorage.read_only

If you want to reset the gcp.oauthScopes parameter after it has been set, you have a few options:

Use --set gcp.oauthScopes=null during upgrades
Use --set-json=[] during upgrades
Set the parameter to an empty array in the values file

Resource Labels

gcp.resourceLabels are resource labels added to nodes in the form of key-value pairs. By default, this mapping is empty.

To add a resource label foo=bar to your nodes, use the following command:

--set gcp.resourceLabels.foo=bar

If your label key includes dots, you must escape them using a backslash \. For example:

--set gcp.resourceLabels.my\.label\.example=value

For more detailed information, refer to the Google Cloud documentation on Labeling Resources.

Azure

This section details Azure specific configuration options.

Pod Subnet for Dynamic Azure CNI Networking

You can indicate the pod subnet to be used by Dynamic Azure CNI networking for the bin packing node via azure.binPackingNodePodSubnet.

For example, if you would like your bin packing instances to use podsubnet1, do this:

--set azure.binPackingAKSPodSubnet=podsubnet1

Ephemeral OS Disk

You can indicate that Luna should use the ephemeral OS disk type, if Luna bin packing or bin selection chooses a node instance type that supports it and if that instance type has a cache size >= 30 GB (the minimum OS disk size for AKS), via azure.useEphemeralOsDisk. If Luna uses the ephemeral OS disk type, Luna will explicitly set the OS disk size to the node instance type cache size.

If azure.useEphemeralOsDisk is not set to true or if the node instance type Luna chooses does not support the ephemeral OS disk type or have a large enough cache, Luna will use the default OS disk type (managed).

To use this option, do this:

--set azure.useEphemeralOsDisk=true

Enable Node Public IP

You can indicate that Luna should enable AKS assignment of a public IP to the nodes it allocates via azure.enableNodePublicIP. The option is false by default.

To use this option, do this:

--set azure.enableNodePublicIP=true

Luna Configuration

Pod configuration

Instance family configuration

Bin selection​

Bin packing​

Removal of Under-utilized nodes and possible pod eviction

Management of Over-utilized nodes and possible pod eviction

GPU SKU annotation

Advanced configuration via Helm Values

labels​

podAnnotations​

pod.elotl.co/ignore: true​

loopPeriod​

daemonSetSelector​

daemonSetExclude​

daemonSetExcludeDesired0​

newPodScaleUpDelay​

scaleUpTimeout​

includeArmInstance​

placeBoundPVC​

placeNodeSelector​

namespacesExclude​

reuseBinSelectNodes​

skipIgnoredPods​

prometheusListenPort​

clusterGPULimit: 10​

nvidiaGPUTimeSlices​

binSelectMaxPodsPerNode & binPackingMaxPodsPerNode​

AWS​

GCP​

Azure​

OCI​

nodeLabels​

nodeTags​

nodeTaints​

loggingVerbosity​

scaleDown.nodeUnneededDuration​

scaleDown.skipNodeWithSystemPods​

scaleDown.skipNodesWithLocalStorage​

scaleDown.skipEvictDaemonSetPods​

scaleDown.minReplicaCount​

scaleDown.binPackNodeUtilizationThreshold​

scaleDown.minNodeCountPerZone​

scaleDown.nodeTTL​

scaleDown.managedNodeDelete​

scaleDown.drainedAnnotation​

Pod retry

maxPodRetries​

podRetryPeriod​

Bin-selection​

Bin-packing​

Luna’s own deployment and pod configuration

Annotations, tolerations, and affinity​

Webhook port​

AWS

Custom AMIs​

Custom AMIs with SSM​

Block device mappings​

Bin Packing Zone Spread​

User data​

IMDS Metadata​

GCP

Image Type​

Disk Type and size​

Node Service Account​

Network Tags​

GCE Instance Metadata​

Bin Packing Zone Spread​

Node Management: auto-upgrade and auto-repair​

Shielded Instance Configuration: secure boot and integrity monitoring​

Node Version​

OAuth Scopes​

Resource Labels​

Azure

Pod Subnet for Dynamic Azure CNI Networking​

Ephemeral OS Disk​

Enable Node Public IP​

Bin selection

Bin packing

labels

podAnnotations

pod.elotl.co/ignore: true

loopPeriod

daemonSetSelector

daemonSetExclude

daemonSetExcludeDesired0

newPodScaleUpDelay

scaleUpTimeout

includeArmInstance

placeBoundPVC

placeNodeSelector

namespacesExclude

reuseBinSelectNodes

skipIgnoredPods

prometheusListenPort

clusterGPULimit: 10

nvidiaGPUTimeSlices

binSelectMaxPodsPerNode & binPackingMaxPodsPerNode

AWS

GCP

Azure

OCI

nodeLabels

nodeTags

nodeTaints

loggingVerbosity

scaleDown.nodeUnneededDuration

scaleDown.skipNodeWithSystemPods

scaleDown.skipNodesWithLocalStorage

scaleDown.skipEvictDaemonSetPods

scaleDown.minReplicaCount

scaleDown.binPackNodeUtilizationThreshold

scaleDown.minNodeCountPerZone

scaleDown.nodeTTL

scaleDown.managedNodeDelete

scaleDown.drainedAnnotation

maxPodRetries

podRetryPeriod

Bin-selection

Bin-packing

Annotations, tolerations, and affinity

Webhook port

Custom AMIs

Custom AMIs with SSM

Block device mappings

Bin Packing Zone Spread

User data

IMDS Metadata

Image Type

Disk Type and size

Node Service Account

Network Tags

GCE Instance Metadata

Bin Packing Zone Spread

Node Management: auto-upgrade and auto-repair

Shielded Instance Configuration: secure boot and integrity monitoring

Node Version

OAuth Scopes

Resource Labels

Pod Subnet for Dynamic Azure CNI Networking

Ephemeral OS Disk

Enable Node Public IP