Version: v1.4

Luna Configuration

How to specify what pod Luna manages

In order for Luna Manager to manage a pod's scheduling, the pod configuration must include a label or annotation that matches Luna's configured pod designation setting. By default, Luna's setting specifies that the following label is applied:

metadata:
  labels:
    elotl-luna: "true"

You can change the list of labels Luna will consider with the labels Helm value:

--set labels='key1=value1,key2=value2'

You can change the list of annotations Luna will consider with the podAnnotations Helm value:

--set podAnnotations='key1=value1,key2=value2'

To prevent Luna from matching a given pod annotate it with pod.elotl.co/ignore: true.

Custom Certificate Authority (CA) Bundle

Luna utilizes cert-manager to manage certificates for its admission webhooks.

By default, Luna leverages the cert-manager CA Injector to provide these certificates.

To use your own Certificate Authority (CA) bundle for webhook verification, set the caBundle Helm value. This CA bundle will be embedded in the MutatingWebhookConfiguration and used by the Kubernetes API server to verify TLS connections to the webhook service.

When providing a custom CA bundle:

The CA must be the certificate that signed the webhook's TLS certificate (i.e., the one mounted in the webhook pod).
The value of caBundle must be base64-encoded, PEM-formatted certificate data.
It should only contain the signing certificate or certificate chain that the API server will use to verify the TLS connection.
Providing a custom caBundle disables cert-manager CA injection. Ensure that cert-manager is properly configured to use your specified CA for certificate issuance.

If caBundle is empty, the chart will annotate the webhook configuration to request cert-manager to inject the CA bundle.

Cost Estimation Scheduling Gate

By default, Luna's goal is to allocate nodes for pods marked for Luna management. However, Luna does not aim to allocate a node for any pod marked for Luna management that is configured with the nodecostestimate scheduling gate, Instead, Luna determines the type of the node it would attempt to allocate for the pod at the current time and reports as a pod event that type, its estimated cost, and the expected node count needed to handle placement of multiple similar pods, such as those comprising a workload deployment. Providing estimated costs allows users to make budget-informed choices around factors such as workload size, GPU SKU or instance family selection, and spot versus on-demand pricing, prior to configuring the workload for node creation.

We note that a pod's cost estimate is not guaranteed to match the actual cost of placing the pod. The actual cost may be lower, since the estimate does not take into account that the pods may be able to share existing nodes, in the case of bin-packing or in the case of bin-select with node reuse enabled. The actual cost may be higher, since the estimate does not capture that allocating the nodes may hit capacity or quota problems that are only exposed when Luna attempts to do the actual allocation, and Luna may choose a higher-priced node type, subject to any other constraints on Luna's node selection process.

K8s scheduling gates are given in the pod spec; here is an example specifying the "nodecostestimate" gate.

apiVersion: v1
kind: Pod
metadata:
  name: busyboxbp
  labels:
    elotl-luna: "true"
spec:
  schedulingGates:
  - name: "nodecostestimate"
  containers:
  ...

Below is an example NodeCostEstimate event reported for a deployment of 3 bin-packing pods specified with the "nodecostestimate" scheduling-gate on GKE with default bin-pack node settings:

  Type    Reason            Age   From        Message
  ----    ------            ----  ----        -------
  Normal  NodeCostEstimate  83s   elotl-luna  Estimated cost $0.134023/hr for node type e2-standard-4 for mode bin-packing; estimated node count 1 for pod set [{default busybox-5f85465bc5-4whz7 7c7dc600-6970-4626-add8-da130fe35ab7} {default busybox-5f85465bc5-zcgx6 abb171d8-de06-4f74-a98d-ab26bbf57b06} {default busybox-5f85465bc5-x5wxs 86e00051-2e7e-42e1-8cb8-18d74aa19b5d}]

Below is an example NodeCostEstimate event reported for a deployment of 3 small bin-select pods on GKE:

 Type    Reason            Age   From        Message
  ----    ------            ----  ----        -------
  Normal  NodeCostEstimate  72s   elotl-luna  Estimated cost $0.055317/hr for node type e2-medium for mode bin-selection; estimated node count 3 for pod set [{default busybox-79f779db4c-wh5n2 d606ff74-7ca0-47ac-b09e-4d2946417af7} {default busybox-79f779db4c-l847n 3229eb28-68b8-447b-afe3-2ec82245633e} {default busybox-79f779db4c-ddp7g e2a3ca3d-d514-4d9d-8e54-34bfa4f9e395}]

Instance family configuration

Match node type with a regular expression

You can restrict the node types selected by Luna using regular expressions (regex) to match only the desired types.

Luna supports the following regex-based configuration options to filter node types. Each candidate node type’s name is matched against the configured regex; if it matches, the node type is considered:

nodeTypeRegexp: Applies to all nodes, whether bin-packed or bin-selected.
binPackingNodeTypeRegexp: Applies only to bin-packed nodes.
binSelectNodeTypeRegexp: Applies only to bin-selected nodes.

These options follow a hierarchy to avoid conflicts:

If binPackingNodeTypeRegexp or binSelectNodeTypeRegexp is defined, or if the pod(s) under consideration include the annotation node.elotl.co/instance-type-regexp, then nodeTypeRegexp is ignored. Instead, the more specific configurations or annotation take precedence.

If binSelectNodeTypeRegexp is set but the pod(s) include the annotation node.elotl.co/instance-type-regexp, then binSelectNodeTypeRegexp is ignored, and the annotation is used instead.

Here's an example configuration in YAML for AWS:

# Match t3.medium or t4g.medium node types
nodeTypeRegexp: "^(t3|t4g)\.medium$"
# Match r6a.large, r6a.xlarge, r7i.large, and r7i.xlarde node types.
binPackingNodeTypeRegexp: "^(r6a|r7i)\.x?large$"
# Match c6a.large, c6a.xlarge, or c6a.xxlarge node types.
binSelectNodeTypeRegexp: "^c6a\.x?x?large$"

Bin selection

To avoid a given instance family, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-family-exclusions: "t3,t3a"

In the example above Luna won’t start any t3 or t3a instance type for the pod.

To use a given instance family, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-family-inclusions: "c6g,c6gd,c6gn,g5g"

In the example above Luna will choose an instance type from the c6g, c6gd, c6gn, or g5g instance families for the pod.

To specify the instance type, you can utilize a regular expression. For instance, if you'd like to specify the instance type to be r6a.xlarge, annotate the pod like this:

metadata:
  annotations:
    node.elotl.co/instance-type-regexp: "^r6a.xlarge$"

In the example above, Luna will only consider the r6a.xlarge instance type.

You can combine the instance-type and instance-family annotations like this:

metadata:
  annotations:
    "node.elotl.co/instance-type-regexp": "^*.xlarge$",
    "node.elotl.co/instance-family-exclusions": "r6a",

In the example above, Luna will exclusively consider instance types ending with ".xlarge" and exclude types from the r6a family.

If any of these annotations are present, the pod will be bin-selected, even if it falls below the bin-selection threshold. Luna will provision nodes that satisfy the pod's resource requests. However, if the instance type constraints conflict with the pod's resource requests, no node will be provisioned, leaving the pod in a pending state.

To exclude specific instance families from all bin-selected nodes without annotating individual pods, use the binSelectInstanceFamilyExclusions Helm value:

--set binSelectInstanceFamilyExclusions={n1,e2-standard,e2-highmem}

In this example, the n1, e2-standard, and e2-highmem GKE node families are excluded from consideration. Other families, such as e2-highcpu, will remain eligible.

GPU SKU annotation

To instruct Luna to start an instance with a specific graphic card:

metadata:
  annotations:
    node.elotl.co/instance-gpu-skus: "v100"

This will start a node with a V100 GPU card.

note

Each pod with this annotation will be bin-selected, regardless of the pod’s resource requirements.

Max GPUs annotation

To have Luna restrict its node selection to one with the specified max number of GPUs or fewer:

metadata:
  annotations:
    node.elotl.co/instance-max-gpus: "2"

This will start a node for the pod with 2 GPUs or fewer.

note

Each pod with this annotation will be bin-selected, regardless of the pod’s resource requirements.

Max Cost annotation

To have Luna restrict its node selection to one whose cost is estimated to be less than specified amount (in USD) per hour:

metadata:
  annotations:
    node.elotl.co/instance-max-cost: "5.75"

This will start a node for the pod that costs $5.75/hr or less.

note

Each pod with this annotation will be bin-selected, regardless of the pod’s resource requirements.

Force Bin Select annotation

To use bin-select for a pod when it would have otherwise been bin-packed, use:

metadata:
  annotations:
    node.elotl.co/force-bin-select: "true"

Removal of Under-utilized nodes and possible pod eviction

Luna is designed to remove under-utilized nodes. A node that is running no Luna-managed pods is under-utilized. Additionally, in the case of bin-packing, a node is considered under-utilized if its Luna-managed pods' total resource requests are below scaleDown.binPackNodeUtilizationThreshold, set to 10% by default. If a node has been under-utilized for longer than scaleDown.nodeUnneededDuration, set to 5 minutes by default, and if all Luna-managed pods running on it can be placed on another node, Luna will evict the pods running on the node and remove the node.

To avoid Luna evicting a pod running on an under-utilized node, the pod must be annotated with pod.elotl.co/do-not-evict: true as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    pod.elotl.co/do-not-evict: "true"
  spec:
...

The annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false is also supported.

Note that if Luna-managed bin-packing pods have no resource settings or if their resource settings are inaccurately very low, Luna's detection of under-utilized bin-packing nodes will be wrong. In this case, scaleDown.binPackNodeUtilizationThreshold should be set to 0.0 to avoid Luna evicting pods from bin-packing nodes incorrectly categorized as under-utilized. Please see the next section for more information relevant to such pods.

Management of Over-utilized nodes and possible pod eviction

Luna allocates node resources for pods based on the pods' resource settings. If Luna-managed pods have no resource settings or if their settings are inaccurately too low, Luna-allocated nodes may become over-utilized, causing performance problems.

Luna can be configured to use Kubernetes metrics server data to monitor the CPU and memory utilization of Luna-allocated nodes, and to take action to avoid or reduce high CPU or memory utilization. If the Luna option manageHighUtilization.enabled (default false) is set true, Luna uses metrics server node and pod CPU and memory utilization data as described below.

When a node's CPU utilization >= manageHighUtilization.yellowCPU (default 60) or its memory utilization >= manageHighUtilization.yellowMemory (default 65), Luna adds a taint to the node to prevent the kube scheduler from scheduling more pods on the node. This avoids CPU or memory over-utilization.

When a node's CPU utilization >= manageHighUtilization.redCPU (default 80) or its memory utilization >= manageHighUtilization.redMemory (default 85), Luna performs an eviction of the highest CPU- or memory-demand Luna-scheduled pod that meets the same pod eviction restrictions applied for scale-down.

When a node's CPU utilization < manageHighUtilization.greenCPU (default 10) and its memory utilization is < manageHighUtilization.greenMemory (default 15) and the node has a high utilization taint, that taint is removed from the node. This allows nodes that no longer have high CPU or memory utilization to again host additional pods.

Setting back-off durations for insufficient available resources

When Luna is unable to provision a resource due to insufficient availability or other limitations, it enters a back-off period before attempting again to use the resource or retry the operation. These back-off durations are configurable based on the nature of the failure:

Configuration Option	Short Description	Detailed Description	Default Value	Cloud Provider(s)
`backoffDurations.capacityOutBackoff`	Cloud capacity unavailable back-off	Back-off duration after the cloud provider reports insufficient capacity (e.g., no available compute resources in a zone). Avoids repeated allocation attempts for the node type and pricing category during provider-side capacity issues.	`10m`	All supported providers
`backoffDurations.quotaOutBackoff`	Quota limitations reached back-off	Back-off duration after a provisioning failure due to resource quota limitations (e.g., hitting CPU or memory quota limits). Reduces noisy retries when user-side quota constraints need manual intervention.	`1h`	All supported providers
`backoffDurations.scaleUpTimeoutBackoff`	Scale-up timed out back-off	Back-off duration after a scale-up operation exceeds the configured `scaleUpTimeout`. Delays retries using the same node type and instance offering (e.g., spot or on-demand) that previously failed, until the configured back-off period has passed.	`45m`	All supported providers
`backoffDurations.freeAddressesOutBackoff`	Subnet address exhaustion back-off	Back-off duration after a subnet IP address exhaustion prevents provisioning. This can help throttle retries while address availability is resolved. This back-off applies only to the specific subnet where IP address exhaustion occurred, and does not impact scaling attempts for other subnets.	`10m`	All supported providers
`backoffDurations.nodepoolIncompatOpBackoff`	GKE nodepool op conflict back-off	Back-off duration for a node pool scale request that fails due to an incompatible concurrent operation (e.g., ongoing upgrade). Prevents thrashing during platform-level locks.	`30m`	GKE only
`backoffDurations.tooManyOpsBackoff`	GKE max concurrent operations hit back-off	Back-off duration after a scale request fails due to too many concurrent operations (e.g., exceeding GKE’s operation concurrency limits).	`2m`	GKE only

Advanced configuration via Helm Values

This is a list of the configuration options for Luna. These values can be passed to Helm when deploying Luna.

The keys and values are passed to the deploy script as follows using the --additional-helm-values option:

./deploy.sh ... --additional-helm-values "\
    --set binSelectPodCpuThreshold=3.0 \
    --set binSelectPodMemoryThreshold=2Gi \
    --set binSelectPodGPUThreshold=1 \
    --set binPackingNodeCpu=3250m \
    --set binPackingNodeMemory=7Gi \
    --set binPackingNodeMinPodCount=42 \
    --set binPackingNodeTypeRegexp='^t3a.*$' \
    --set binPackingNodePricing='spot\,on-demand' \
    --set labels='key1=value1\,key2=value2'"

These configuration options can be modified in the configuration map elotl-luna located in the namespace where Luna manager runs. Once the configuration map has been modified Luna manager and its admission webhook must be restarted for the new configuration to be used.

$ kubectl -n elotl rollout restart deploy/elotl-luna-manager
...
$ kubectl -n elotl rollout restart deploy/elotl-luna-webhook
...

labels

Specify the labels that Luna will use to match the pods to consider.

labels is a list of comma separated key value pairs: key1=value1\,key2=value2; pods with any of the labels will be considered by Luna. The default value is elotl-luna=true.

--set labels='key1=value1\,key2=value2'

podAnnotations

Specify the annotations that Luna will use to match the pods to consider.

Similar to labels, podAnnotations is a list of comma separated key value pairs: key1=value1\,key2=value2; pods with any of the annotations will be considered by Luna. podAnnotations is empty by default.

--set podAnnotations='key1=value1\,key2=value2'

pendingPodReasonRegexp

Specify a regular expression to match the reason message for a pod to be pending.

Given a pending pod that has a label or annotation designating that it be considered by Luna, if pendingPodReasonRegexp is set, Luna only considers that pending pod if its reason message matches that regular expression. If pendingPodReasonRegexp is empty, that pending pod is not filtered by the contents of its reason message. The default value is empty.

The pendingPodReasonRegexp option can be used in situations where Luna is working with a non-default kubernetes scheduler that may keep a pod pending for reasons that would not be addressed by Luna adding a node to the cluster, to ensure Luna only considers pending pods that can benefit from cluster scaling.

The following example sets pendingPodReasonRegexp to a regular expression that matches the pending reason message produced by the default kube-scheduler when it finds no nodes available to schedule a pod. Note the use of set-file, to avoid issues with Helm handling values that contain spaces. In this case, Luna considering the pod is useful.

echo "^0/([0-9]+) nodes are available" > pendingPodReasonRegexp.txt
--set-file pendingPodReasonRegexp=path-to-file/pendingPodReasonRegexp.txt

pod.elotl.co/ignore: true

This annotation instructs Luna to ignore a given pod even if it matches labels or podAnnotations.

It's important to note that ignored pods may still be scheduled on Luna-managed nodes, unless these nodes have a specific taint configured. Ignored pods don't have a node selector, so the Kubernetes scheduler will assign them to any available node. If Luna nodes don't have a taint set up, pods that aren't handled by Luna might be scheduled there.

To prevent pods that aren’t managed by Luna from running on Luna-managed nodes, you can utilize node and pod affinity configuration. Node affinity allows you to specify rules that restrict which nodes a pod can be scheduled on, while pod affinity enables you to define rules for co-locating or spreading pods across nodes based on labels.

By combining taints, tolerations, and affinity rules, you can have finer control over pod scheduling and ensure that ignored pods are not inadvertently scheduled on Luna-managed nodes.

loopPeriod

How often the Luna main loop runs, by default 10 seconds. Increasing this value will ease the load on the Kubernetes control plane, while lowering it will intensify the load on the Kubernetes control plane.

--set loopPeriod=20s

daemonSetSelector

daemonSetSelector is a label selector for the daemon sets that will run on the Luna nodes.

Luna cannot predict in advance which daemon sets will run on a given node. Since the conditions for daemon sets are dynamic, Luna must estimate which ones will end up on the node, potentially impacting cost optimization.

The daemonSetSelector configuration option allows you to specify the daemon sets Luna should consider in its capacity calculations.

By default, this option is empty, meaning all daemon sets are selected.

For example, to have Luna only consider the impact of the GPU driver daemon set, you can specify:

--set daemonSetSelector=name=nvidia-device-plugin-ds

daemonSetExclude

daemonSetExclude is a comma-separated list of daemon set names that you want to exclude from Luna's list of active daemon sets for newly added nodes.

It is empty by default.

After selecting daemon sets using daemonSetSelector, the sets are further filtered based on the daemonSetExclude list.

Use this option to prevent Luna from reserving resources for daemon sets you do not expect to be active on new nodes. For example, if you are running Luna on a GKE cluster and only plan to use the --logging-variant=DEFAULT, you might exclude the unused daemon sets as follows:

--set daemonSetExclude="fluentbit-gke-256pd\,fluentbit-gke-max\,gke-metrics-agent-scaling-500"

This option may be used along with daemonSetExcludeDesired0.

daemonSetExcludeDesired0

daemonSetExcludeDesired0 is a boolean that you set true if you want to exclude daemonsets that currently have a Desired count of 0 from Luna's list of active daemon sets for newly added nodes.

It is false by default.

After selecting daemon sets using daemonSetSelector, if daemonSetExcludeDesired0 is true, the sets are further filtered by those that have a Desired count of 0.

Use this option to prevent Luna from reserving resources for daemon sets that are not active on current nodes and that you do not expect to be active on new nodes.

--set daemonSetExcludeDesired0=true

This option may be used along with daemonSetExclude.

newPodScaleUpDelay

Age of the pod to be considered for scaling up nodes. It is set to 10 seconds by default.

Because pod creation may be scattered, it isn’t desirable for Luna to immediately react to pod creation. Lowering this delay may result in less efficient packing, while increasing it will delay the creation of the nodes and increase the mean time to placement of pods.

--set newPodScaleUpDelay=5s

scaleUpTimeout

Time to allow for the new node to be added and the pending pod to be scheduled before considering the scale up operation expired and subject to retry. It is set to 10 minutes by default. This value can be tuned for the target cloud.

placeBoundPVC

Whether to consider pods with bound PVC. It is set to false by default.

placeNodeSelector

Whether to consider pods with existing node selector(s). It is set to false by default. When set to true, a pod's existing node selector(s) must be satisfiable by the Luna and pod settings; otherwise, Luna may allocate a node that cannot be used by the pod.

namespacesExclude

List of comma-separated names of namespaces whose pods should be excluded from Luna management. It is set to kube-system only by default. For example, to run with no namespace restrictions on Luna management, use:

--set namespacesExclude={}

To add the namespace test to the exclusion list specify:

--set namespacesExclude='{kube-system,test}'

Note that if the kube-system namespace is not part of the namespacesExclude list, Luna can spin up additional nodes for kube-system pods marked for luna placement that are in the Pending state for too long.

reuseBinSelectNodes

Whether to reuse nodes for similar bin-select placed pods. It is set to true by default.

Bin-select placed pods are considered similar if they match all of the following criteria:

Same containers and initContainers resource requirements
Same set of annotations with the prefix node.elotl.co/
Same values for pod overhead, pod node name, node selector, node affinity, and topology spread constraints fields
Same volumes and mounts, excluding the unique claim name for stateful sets and any kube-api-access-<unique-suffix> volumes

skipIgnoredPods

Whether to add a node selector to pods not labeled for placement by Luna or to skip adding a node selector to such pods. It is set to false by default.

By default, the Luna webhook sets a node selector for each non-daemonset pod placement request it examines. If a pod is labeled for placement by Luna, its node selector is set to point to a Luna-created node. If a pod is not labeled for placement by Luna, its node selector is set to exclude any Luna-created node; the latter setting is skipped if skipIgnoredPods is set true.

prometheusListenPort

The port number on which Luna manager and webhook will expose their prometheus metrics. It is 9090 by default.

clusterGPULimit: 10

The maximum number of GPUs to run in the cluster. It is set to 10 by default.

clusterGPULimit specifies the GPU limit of the cluster; if gpu count in the cluster reaches this number, luna will stop scaling up GPU nodes.

nvidiaGPUTimeSlices

The number of GPU time-slices for NVIDIA GPUs in cluster. It is set to 1 by default. When its value is greater than 1, Luna treats GPUs in cloud instances as N copies of themselves with respect to scheduling GPU resource requests. This value must match the NVIDIA GPU time slices setting for GPU nodes in the cluster for Luna GPU allocation to operate consistently with that setting.

On AKS, EKS, and OKE clusters, the NVIDIA time-slices setting is transparent to the cluster control plane and GPU workloads running in the cluster. The number of NVIDIA GPU time-slices can be set when installing the nvidia-device-plugin helm chart. The time-slices setting will automatically be configured for all NVIDIA GPUs in the cluster, and cluster nodes will use that value when they report their GPU capacity. GPU workloads transparently get a slice for each GPU resource they request.

On GKE clusters, the NVIDIA time-slices setting is visible to the cluster control plane and to GPU workloads running in the cluster. Luna configures the GPU slice count in the GKE node pool used for GPU node allocation. Note that GPU pods running on GKE clusters with time-sliced GPUs must include nodeSelectors indicating the workload can use time-shared GPUs and specifying the max clients-per-gpu value allowed. And the GPU pods running on time-sliced GPUs cannot specify a nvidia.com/gpu resource limit value greater than 1. Please see the associated GCP documentation for more details.

maxPodsPerNode, binSelectMaxPodsPerNode, and binPackingMaxPodsPerNode

These configuration options control the maximum number of pods that can run on each node. Setting lower values can minimize the use of network resources like interfaces or IP addresses on the nodes.

binPackingKeepNodeType

For clouds that use node pools (AKS, GKE, and OKE), when Luna prepares to allocate a bin-packing node, if Luna finds that the existing active bin-packing node pool's configuration does not comply with the latest user configuration or the node pool's instance type is not available due to quota or capacity issues, Luna creates a new bin-packing node pool and updates the previous bin-packing node pool so that its nodes will not be chosen for new placements. If the current bin-packing node pool does comply with the latest user configuration and its instance type does not have availability issues, but there is a less expensive instance type available, creating a new node pool and discontinuing use of the old one is optional. In the optional case, Luna will keep using current node pool if binPackingKeepNodeType is set true (default), and will create a new bin-packing node pool to use the cheaper instance type if binPackingKeepNodeType is set false.

AWS

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna uses the AWS defined ENI limit as maximum pods per node value. By default, Luna does not explicitly set the maximum number of pods on the nodes. If you set a value greater than 0, Luna will set the specified maximum number of pods on the nodes.

For nodes with up to 30 VCPUs, the maximum number of pods per node is capped at 110. For nodes with more than 30 VCPUs, the maximum increases to 250.

GCP

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna defaults to a limit of 110 pods per node. These values must be between 8 and 256; otherwise, the API will produce an error, and nodes will not be created.

Azure

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna uses the node’s default max pods per node. For clusters using Kubenet networking, this default is 110 pods per node; for clusters using CNI networking, it is 250 pods per node. The maximum value either can be set to is 250.

OCI

When binSelectMaxPodsPerNode or binPackingMaxPodsPerNode is set to 0 (the default), Luna defaults to a limit of 110 pods per node.

This option only applies to OKE clusters that use OCI_VCN_IP_NATIVE networking, and indicates how Luna should set max pods per node on nodes it allocates. If MaxPodsPerNode is 0 (default), Luna sets max pods per node to the maximum supported by compute shape vNICs. If MaxPodsPerNode is greater than 0, Luna sets max pods per node to min(MaxPodsPerNode, maximum supported by compute shape vNICs).

nodeLabels

Labels to add to the nodes. It is a mapping of key-value. It is empty by default.

For example, to add a label foo=bar to your nodes, use the following flag:

--set nodeLabels.foo=bar

Note that if you need to include dots in the label’s key you will have to escape them with \:

--set nodeLabels.my\.label\.example=value

nodeTags

Tags to add to the cloud instances. It is a mapping of key-value. It is empty by default.

This can be useful to track and clean-up stale cloud instances. For instance, to add tags key1=value1 and key2=value2, use:

--set nodeTags.key1=value1
--set nodeTags.key2=value2

Note that the nodeTags option is not supported on GKE.

nodeTaints

To add taints to the nodes created by Luna, use the taints configuration option:

--set nodeTaints='{key1=value1:NoSchedule,key2=value2:NoExecute}'

Note that the nodeTaints option is not supported under Oracle Container Engine for Kubernetes (OKE).

loggingVerbosity

How verbose Luna manager and webhook are. It is set to 2 by default.

0 critical, 1 important, 2 informational, 3 debug

Scaling down nodes

If a Luna-managed node has no active pods running on it, or is underutilized, or has expired its configured lifetime, or is not needed to support zone spread or minimum bin-packing node count, Luna will drain and terminate it.

node.elotl.co/do-not-remove as label or annotation

Prevent Luna from draining and terminating specific nodes by applying either a label or annotation:

node.elotl.co/do-not-remove: "true"

This is useful when you need to preserve a node during temporary workload migrations or maintenance operations.

scaleDown.nodeUnneededDuration

If a node remains idle for longer than nodeUnneededDuration, Luna manager will scale it down. Default: 5m.

--set scaleDown.nodeUnneededDuration=1m

scaleDown.skipNodeWithSystemPods

Determines whether to skip nodes running pods from the kube-system namespace. Daemonset pods are never considered by Luna; this only applies to deployment pods. Default: false.

scaleDown.skipNodesWithLocalStorage

When true, Luna manager will never scale down nodes with local storage attached to a pod. Default: true.

scaleDown.skipEvictDaemonSetPods

When true, Luna manager will skip evicting daemonset pods from nodes removed for scale down. Default: false.

scaleDown.minReplicaCount

The minimum replica count ensures that the specified number of replicas are always available during node scale-down. Default: 0.

scaleDown.binPackNodeUtilizationThreshold

Defines the utilization threshold to scale down bin-packed nodes, ranging from 0.0 (0% utilization) to 1.0 (100% utilization). Default: 0.1 (10%).

Note that the Helm option --set cannot parse floating point numbers. Use --set-json to define scaleDown.binPackNodeUtilizationThreshold.

scaleDown.minNodeCountPerZone

For clusters supporting zone spread (currently only EKS clusters and GKE regional clusters), indicates the minimum number of nodes (0 or 1) that Luna should keep running per zone in target pools into which zone spread pods may be placed. This minimum is maintained even when no normal (not daemonset or mirror) Luna pods are currently running in the pool. Default: 0. Note that EKS does not support setting this value to 1.

In general, Luna keeps a minimum of 1 node per zone in node pools that may be used for zone spread, to ensure kube-scheduler can see all the zones in its target node set and hence can make the desired zone spread choices. Setting scaleDown.minNodeCountPerZone to 1 to maintain a min of 1 node per zone even when the associated count of normal (not daemonset or mirror) Luna pods is 0 avoids a possible race where kube-scheduler sees zone-spread pods arrive for scheduling when some but not all of a node pool's per-zone nodes have scaled down.

scaleDown.nodeTTL

When > 0, enables Luna support for node time-to-live. When scaleDown.nodeTTL is set to a non-zero value, it must be set to a value greater than or equal to scaleUpTimeout. If scaleDown.nodeTTL is less than scaleUpTimeout, Luna will set it to scaleUpTimeout internally and will emit a warning in the logs. Default: 0m (time-to-live unlimited).

When scaleDown.nodeTTL is set to a non-zero value, Luna uses the value as a time-to-live for its allocated nodes; Luna cordons, drains, and terminates its allocated nodes once they have been running longer than the specified scaleDown.nodeTTL time.

If a nodeTTL-expired node contains any pods with do-not-evict annotatations (i.e., pod.elotl.co/do-not-evict:true or cluster-autoscaler.kubernetes.io/safe-to-evict:false), Luna supports the node's graceful termination by cordoning it, draining its non-kube-system non-daemonset pods except the do-not-evict pods, and then adding the configurable annotation scaleDown.drainedAnnotation to it. An external controller monitoring nodes for that annotation can perform eviction-related operations with respect to the do-not-evict pods and then remove the their do-not-evict annotation. Once a nodeTTL-expired node contains no do-not-evict pods, Luna terminates the node.

scaleDown.managedNodeDelete

Set true to enable Luna support for graceful termination of nodes that are externally-deleted (e.g., "kubectl delete node/node-name"). Default: true.

When scaleDown.managedNodeDelete is set true, Luna adds a finalizer to its allocated nodes, allowing Luna to detect external deletion operations on those nodes. When Luna detects external deletion of an allocated node, if that node contains any do-not-evict pods, Luna performs the graceful termination steps outlined in scaleDown.nodeTTL. Once an externally-deleted Luna-allocated node contains no do-not-evict pods, Luna removes its finalizer from blocking the K8s node deletion and deletes the node from the cloud.

Note that if scaleDown.managedNodeDelete is set, the deletion of Luna-allocated nodes requires the removal of the Luna finalizer; hence, if Luna is disabled with some of its allocated nodes remaining and you later want to remove those nodes, you will need to manually remove the finalizer.

scaleDown.drainedAnnotation

Annotation used during graceful node termination; see scaleDown.nodeTTL or scaleDown.managedNodeDelete. Default: key: node.elotl.co/drained; value: true.

Pod retry

Luna cannot guarantee that a pod will run on one of its node, the node and pod have to be properly configured. If a pod is still in the pending state once the requested node is online, Luna will retry after configurable delay, up to a configurable number of times.

How pod retry works:

A new pod is created, the Luna webhook matches it, and a new node is provisioned by Luna manager.
Luna manager waits for the node to come online or wait until scaleUpTimeout has passed, whichever happens first.
Once the node is online or the request has timed out, Luna checks the pod’s status after podRetryPeriod elapsed.
If the pod is still in the pending state we have two cases:
1. The pod has been retried less than maxPodRetries times, the annotation pod.elotl.co/retry-count is added or incremented to the pod, and the pod will be retries after podRetryPeriod.
2. The pod has been retried maxPodRetries times, the annotation pod.elotl.co/ignore: true is added to the pod. The pod will now be ignored by Luna until the annotation is removed.

maxPodRetries

Sets the maximum retry attempts for a pod. Each retry increments the annotation pod.elotl.co/retry-count on the pod. Once this limit is exceeded, the pod is annotated with pod.elotl.co/ignore: true, indicating Luna should ignore the pod until the annotation is removed.

Default: 3

podRetryPeriod

Determines the delay before Luna retries deploying a pod that remains in the pending state, even after its node is available. This period must allow adequate time for Kubernetes to schedule the pod, otherwise Luna may create unnecessary node(s) temporarily.

Default: 5 minutes

Bin-selection

Bin-selection is a process where Luna provisions a dedicated node to run a pod with high resource requirements.

When a pod’s resource needs exceed certain thresholds, Luna automatically allocates a dedicated node for that pod. This process involves determining the optimal node configuration based on the pod’s requirements, adding a new node to the cluster, and scheduling the pod to run on this dedicated node.

Bin-selection is triggered when a pod’s resource requirements meet or exceed any of the following thresholds:

CPU: binSelectPodCpuThreshold
Memory: binSelectPodMemoryThreshold
GPU: binSelectPodGPUThreshold

A pod will also be bin-selected if it has any node.elotl.co/instance-* annotations set, irrespective of the pod’s resource requests.

Bin-packing

Bin-packing means running the pod with other pods on a shared node.

binPackingNodeCpu, binPackingNodeMemory, and binPackingNodeGPU let you configure the shared nodes’ requirement. If you have an instance type in mind, set these parameters slightly below the node type you are targeting, to take into account the kubelet and kube-proxy overhead. For example if you would like to have non-GPU nodes with 8 VCPU and 32 GB of memory, set binPackingNodeCpu to "7.5" and binPackingNodeMemory to "28G".

Bin-selection thresholds must be lower than the bin-packing node requirements. Otherwise the system will log a warning, and any bin-packing node requirements that are too low will be increased to match the corresponding bin-selection thresholds.

Each node type can only run a limited number of pods. binPackingNodeMinPodCount lets you request a node that can support a minimum number of pods.

binPackingNodeTypeRegexp allows you to limit the instances that will be considered. For example if you would only like to run instances from "t3a" family in AWS you would do: binPackingNodeTypeRegexp='^t3a\..*$'

binPackingMinimumNodeCount allows you to specify the minimum number of bin packed nodes. The nodes will be started immediately and will stay online even if no pods are running on them.

Spot and on-demand pricing

Spot pricing is a cloud pricing model where providers offer unused compute capacity at significantly discounted rates. Users can access these resources at lower costs, but instances may be reclaimed with short notice when demand increases.

You can configure pricing categories for nodes with the following options:

The configuration value nodePricing for global configuration
The configuration value binPackingNodePricing for bin-packing nodes
The configuration value binSelectNodePricing for bin-selected nodes
The pod annotation node.elotl.co/instance-offerings for bin-selected nodes

Each option accepts these values:

on-demand: On-demand (regular) instances only
spot: Spot (discounted) instances only
spot,on-demand: Spot when possible and fall back on-demand instances if spot isn’t available

nodePricing is the global pricing configuration value. It is used when binPackingNodePricing, binSelectNodePricing, and the annotation node.elotl.co/instance-offerings are not set. If any of these options are set for the packing mode or pod considered, nodePricing is ignored and the other configuration value is used instead.

binPackingNodePricing is used for bin-packing nodes.

binSelectNodePricing is used for bin-selected nodes. If the pod has the annotation node.elotl.co/instance-offerings, binSelectNodePricing is ignored and the annotation’s value is used instead.

To specify whether to use on-demand or spot pricing for bin-selected nodes, you can add the node.elotl.co/instance-offerings annotation to the pod’s definition.

Here’s an example of a pod definition utilizing bin-selection with spot pricing:

apiVersion: v1
kind: Pod
metadata:
  name: high-resource-pod
  annotations:
    node.elotl.co/instance-offerings: "spot"
spec:
  ...

If you would only like to run instances from the “spot” category by default:

nodePricing: spot

note

Spot pricing is supported on EKS, AKS, GKE, and OKE. By default, Luna estimates the spot price of an instance relative to its on-demand price using the ratio defined in the option spotPriceRatioEstimate (default 0.5). On EKS, this default spot price estimate can be overridden by enabling the use of AWS Spot Instance Advisor data, as described in the following section.

Use of Spot Instance Advisor on AWS EKS

The AWS Spot Instance Advisor provides data on spot savings over on-demand prices and spot interruption frequency, averaged over the last 30 days per AWS region and instance type. The data is maintained by AWS in this S3 bucket.

If aws.useSpotAdvisor is set true, Luna fetches the AWS spot-advisor-data hourly and estimates spot price using the 30-day average savings, rather than the static ratio computed using the Luna option spotPriceRatioEstimate. When aws.useSpotAdvisor=true is specified, the options aws.maxSpotPriceRatio and aws.maxSpotInterruptBucket can be used to filter spot instance candidates.

The option aws.maxSpotPriceRatio indicates the highest spot price ratio that Luna should consider. For example, if aws.maxSpotPriceRatio is set to 0.6, Luna would consider a spot instance whose price is up to 60% of its on-demand price, but not one whose price is more than 60% of its on-demand price. The default value 1.0 means Luna considers spot even when it has no discount over the on-demand price, essentially disabling this filter.

The AWS spot instance advisor provides spot interruption frequency in 5 buckets 0-4 labeled as follows: 0:"<5%", 1:"5-10%", 2:"10-15%", 3:"15-20%", 4:">20%". The option aws.maxSpotInterruptBucket indicates the highest interrupt bucket that Luna should consider. For example, setting the value aws.maxSpotInterruptBucket to 2 means Luna would consider a spot instance whose interrupt frequency bucket is 0, 1, or 2, but not one whose interrupt frequency bucket is 3 or 4. The default value 4 means Luna considers a spot instance even in the highest interrupt frequency bucket, essentially disabling this filter.

Spot interruption message option on AWS EKS

The user can set up an AWS SQS queue to receive Spot interruption messages, delivered two minutes before termination, and can provide that queue name to Luna via the AWS option spotSqsQueueName. When Luna receives a Spot termination message, it marks the node with node.elotl.co/spot-event: termination. Nodes with this annotation will be scaled down by Luna.

Special consideration for using Spot on Azure AKS

AKS nodes with spot pricing have a taint automatically applied to them. This means pods running on Spot nodes in AKS must have a toleration set in order to be scheduled and run on the nodes with Spot pricing. To use GPU spot nodes, the toleration also needs to be included on the Nvidia daemonsets.

In order to get the pods running on the spot nodes the operator must add a toleration corresponding to the kubernetes.azure.com/scalesetpriority=spot:NoSchedule taint.

spec:
  containers:
  - name: spot-example
  tolerations:
  - key: "kubernetes.azure.com/scalesetpriority"
    operator: "Equal"
    value: "spot"
    effect: "NoSchedule"

Special consideration for using Spot on OCI OKE

OKE nodes with spot pricing have a taint automatically applied to them. This means pods running on Spot nodes in OKE (called preemptible instances) must have a toleration set in order to be scheduled and run on the nodes with Spot pricing. To use GPU spot nodes, the toleration also needs to be included on the Nvidia daemonsets.

In order to get the pods running on the spot nodes the operator must add a toleration corresponding to the oci.oraclecloud.com/oke-is-preemptible taint.

spec:
  containers:
  - name: spot-example
  tolerations:
  - key: oci.oraclecloud.com/oke-is-preemptible
    operator: Exists
    effect: "NoSchedule"

Luna’s own deployment and pod configuration

Annotations, tolerations, and affinity

Use the Helm value annotations to add custom annotations to Luna manager and webhook deployments:

$ helm install ... --set annotations.foo=bar --set annotations.hello=word ...

To add custom tolerations to Luna’s own pods use the configuration option tolerations.

The tolerations specification is rather complex, therefore we recommend you define it in a Helm values file and pass its filename with the -f or --values options:

$ cat tolerations.yaml
tolerations:
  - key: "foo"
    value: "bar"
    operator: "Equal"
    effect: "NoSchedule"
$ helm install ... --values tolerations.yaml ...

To add custom affinity to Luna’s own pods use the configuration option affinity.

The affinity specification is rather complex, therefore we recommend you define it in a Helm values file and pass its filename with the -f or --values options:

$ cat affinity.yaml
# Helm values
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-central1-f
$ helm install ... --values affinity.yaml ...

Note that setting the affinity parameter will override the default affinity, which prevents Luna pods from running on Luna-managed nodes:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.elotl.co/managed-by
          operator: DoesNotExist

Add this snippet to your own affinity definition to prevent Luna pods from running on Luna-managed nodes.

Luna Manager Pod Additional Environment Variables

The manager.extraEnv[] option enables the ability to provide any number of additional environment variables to the Luna container at deployment time. This flexible mechanism supports custom configuration, debugging, and compatibility options by simply adding key/value pairs via Helm.

The Helm chart templates inject these variables into the Luna Manager pod spec’s environment section, alongside the default environment variables defined by the chart.

Usage Examples

Using --set CLI arguments

helm upgrade ... \
  --set-json 'manager.extraEnv=[{"name":"FOO","value":"bar"},{"name":"ENABLE_DEBUG","value":"true"}]'

helm upgrade ... \
  --set "manager.extraEnv[0].name=FOO" \
  --set "manager.extraEnv[0].value=bar" \
  --set "manager.extraEnv[1].name=ENABLE_DEBUG" \
  --set "manager.extraEnv[1].value='true'"

Using values.yaml

manager:
  extraEnv:
    - name: FOO
      value: bar
    - name: ENABLE_DEBUG
      value: "true"

Webhook port

You can change the port of the mutation webhook with webhookPort configuration option:

$ helm install ... --set webhookPort=8999 ...

webhookConfigPrefix

webhookConfigPrefix is used to set a prefix for the Luna MutatingWebhookConfiguration name. By default, this prefix is set to the value "04999" in Luna. This prefix is changed in order to control the order in which admission webhooks are invoked in a cluster (since webhooks in Kubernetes are invoked in lexicographic ordering).

When Luna is used with the Kubernetes's Vertical Pod Autoscaler (VPA), whose MutatingWebhookConfiguration name is "vpa-webhook-config", we need to choose a prefix that is lexicograpically after "v", so a value such as "zzz", would work. This ensures that VPA is invoked first and a pod's resource values are modified, after which Luna’s admission webhook comes into effect. Luna uses the updated resource values in the pod to choose an appropriate node.

$ helm install ... --set webhookConfigPrefix="zzz" ...

High Availability

The Luna webhook deployment runs multiple replicas. This configuration is required to ensure that all pods are assigned a node selector before being created. To configure the number of webhook replicas, use the Helm value webhook.replicas, which is set to 2 by default.

The Luna manager is not required to be highly available. If the manager’s pod goes down, it will be restarted and resume its operation without issue. This setup is sufficient for most cases, since a manager pod restart usually takes less than 10 seconds. In most cases, adding more manager replicas consumes additional resources and does not provide any meaningful advantage over a single replica.

If you require multiple replicas of the Luna manager, you can enable high availability via leader election by setting the Helm configuration value manager.leaderElection.enabled to true.

To change the number of manager replicas, you can adjust the Helm value manager.leaderElection.replicas, which is set to 2 by default.

You can also configure the leader election lease name, lease duration, renewal deadline, and retry period:

manager:
  leaderElection:
    enabled: true
    replicas: 3
    leaseName: my-lease-name
    leaseDuration: 60s
    renewDeadline: 30s
    retryPeriod: 5s

When leader election is active for the Luna manager, only a single pod will handle all incoming requests. If the active leader becomes unavailable, a new leader will be elected after the lease expires; this new leader will then take over the existing and new node scaling operations.

Note that in some rare cases where the active manager pod is terminated without being able to update its state’s ConfigMap, some scale-up operations may be retried by the new leader. This may cause some nodes to be unnecessarily created, and those nodes will be scaled down after coming online.

Luna Configuration

How to specify what pod Luna manages​

Custom Certificate Authority (CA) Bundle​

Cost Estimation Scheduling Gate​

Instance family configuration

Match node type with a regular expression​

Bin selection​

GPU SKU annotation

Max GPUs annotation

Max Cost annotation

Force Bin Select annotation

Removal of Under-utilized nodes and possible pod eviction

Management of Over-utilized nodes and possible pod eviction

Setting back-off durations for insufficient available resources

Advanced configuration via Helm Values

labels​

podAnnotations​

pendingPodReasonRegexp​

pod.elotl.co/ignore: true​

loopPeriod​

daemonSetSelector​

daemonSetExclude​

daemonSetExcludeDesired0​

newPodScaleUpDelay​

scaleUpTimeout​

placeBoundPVC​

placeNodeSelector​

namespacesExclude​

reuseBinSelectNodes​

skipIgnoredPods​

prometheusListenPort​

clusterGPULimit: 10​

nvidiaGPUTimeSlices​

maxPodsPerNode, binSelectMaxPodsPerNode, and binPackingMaxPodsPerNode​

binPackingKeepNodeType​

AWS​

GCP​

Azure​

OCI​

nodeLabels​

nodeTags​

nodeTaints​

loggingVerbosity​

Scaling down nodes

node.elotl.co/do-not-remove as label or annotation​

scaleDown.nodeUnneededDuration​

scaleDown.skipNodeWithSystemPods​

scaleDown.skipNodesWithLocalStorage​

scaleDown.skipEvictDaemonSetPods​

scaleDown.minReplicaCount​

scaleDown.binPackNodeUtilizationThreshold​

scaleDown.minNodeCountPerZone​

scaleDown.nodeTTL​

scaleDown.managedNodeDelete​

scaleDown.drainedAnnotation​

Pod retry

maxPodRetries​

podRetryPeriod​

Bin-selection​

Bin-packing​

Spot and on-demand pricing​

Use of Spot Instance Advisor on AWS EKS​

Spot interruption message option on AWS EKS​

Special consideration for using Spot on Azure AKS​

Special consideration for using Spot on OCI OKE​

Luna’s own deployment and pod configuration

Annotations, tolerations, and affinity​

Luna Manager Pod Additional Environment Variables​

Usage Examples​

Webhook port​

webhookConfigPrefix​

High Availability

How to specify what pod Luna manages

Custom Certificate Authority (CA) Bundle

Cost Estimation Scheduling Gate

Match node type with a regular expression

Bin selection

labels

podAnnotations

pendingPodReasonRegexp

pod.elotl.co/ignore: true

loopPeriod

daemonSetSelector

daemonSetExclude

daemonSetExcludeDesired0

newPodScaleUpDelay

scaleUpTimeout

placeBoundPVC

placeNodeSelector

namespacesExclude

reuseBinSelectNodes

skipIgnoredPods

prometheusListenPort

clusterGPULimit: 10

nvidiaGPUTimeSlices

maxPodsPerNode, binSelectMaxPodsPerNode, and binPackingMaxPodsPerNode

binPackingKeepNodeType

AWS

GCP

Azure

OCI

nodeLabels

nodeTags

nodeTaints

loggingVerbosity

node.elotl.co/do-not-remove as label or annotation

scaleDown.nodeUnneededDuration

scaleDown.skipNodeWithSystemPods

scaleDown.skipNodesWithLocalStorage

scaleDown.skipEvictDaemonSetPods

scaleDown.minReplicaCount

scaleDown.binPackNodeUtilizationThreshold

scaleDown.minNodeCountPerZone

scaleDown.nodeTTL

scaleDown.managedNodeDelete

scaleDown.drainedAnnotation

maxPodRetries

podRetryPeriod

Bin-selection

Bin-packing

Spot and on-demand pricing

Use of Spot Instance Advisor on AWS EKS

Spot interruption message option on AWS EKS

Special consideration for using Spot on Azure AKS

Special consideration for using Spot on OCI OKE

Annotations, tolerations, and affinity

Luna Manager Pod Additional Environment Variables

Usage Examples

Webhook port

webhookConfigPrefix