Release Notes
1.4.4
This release introduces high availability for Luna. The Luna Manager can now run multiple replicas, with one elected as the leader to handle all incoming requests. Enable this feature by setting the Helm value manager.leaderElection.enabled to true.
Fixed an issue where a pod’s destination ID hash would change if it had an attached volume. This affected bin-selected nodes created with Luna versions 1.3.2 and earlier.
Avoid including the deprecated includeArmInstance field in the ConfigMap when it is empty.
Google GKE
Fixed a configuration matching issue related to node pool reuse for bin-select pods with placement-related annotations.
1.4.3
This release resolves several bugs.
Google GKE
Resolved a crash that occurred when an existing nodepool LinuxNodeConfig is nil.
Added hugepages configuration matching with LinuxNodeConfig.
Microsoft AKS
Fixed an issue with tracking requests for bin-select node pool reuse.
1.4.2
Fixed slow eviction pods with an associated pod disruption budget.
Previously, Luna could enter a continuous loop of cordon, drain, timeout, uncordon, and pod rescheduling for pods with a PDB and graceful termination duration longer than Kubernetes’s default.
1.4.1
Switched to fully-qualified image names.
Amazon EKS
Added the configuration option aws.disablePqcCurves to disable Post-Quantum Cryptography (PQC) algorithms when using Luna. These new curves may not be compatible with certain firewalls, such as AWS Network Firewall or Suricata.
Updated the AWS Go SDK to v1.39.2.
Google GKE
Added support for the gcp.linuxNodeConfig configuration value, which allows changes to sysctls, cgroup mode, and hugepages configuration for Linux nodes.
Updated the Google SDK to v1.44.1.
1.4.0
This release introduces enhanced support for multiple architectures. Nodes with amd64 and arm64 architectures can now be provisioned based on the kubernetes.io/arch node selector.
Resolved an issue with updating the pricing category for existing bin-packing node pools.
Added an event to pods when no suitable node can be provisioned for them.
Amazon EKS
Uses EC2 instances rather than EKS managed node groups to determine the node instance profile.
Exclude AWS instance type a1, that are not supported by Amazon Linux 2023.
Google GKE
Fixed an issue with node type selection on GKE. Certain node types were overlooked if they were not available in the default zone.
Prevent the ARM64 taint added by GKE from being interpreted as a node pool change.
Microsoft AKS
Fixed disk configuration matching in AKS to prevent unnecessary node pool churn.
Oracle OKE
Update OKE node pool naming to avoid overflow issues with new bin-packing node pool versions.
Ensure that flex node types with A2 processors respect the required memory ratio.
1.3.4
This release resolves an issue with node instance profile detection in Amazon EKS.
Amazon EKS
In version 1.3.3, Luna introduced automatic detection of the node instance profile based on the EKS node groups’ configuration. However, this method proved unreliable in certain scenarios. To address this, Luna now queries the EC2 instances attached to the cluster—similar to the old deployment script’s approach. This update ensures compatibility with EKS clusters that lack node groups.
1.3.3
This release includes two new features, which are determining the node instance profile on EKS automatically and generating node cost estimate events for pods with the nodecostestimate schedulingGate.
The release updates the available node types and their pricing for Amazon EKS. It also contains several bugfixes, including
addressing a problem with the binSelectNodeTypeRegexp option,
improving scale-up time to add multiple gpu-enabled nodes,
avoiding logging spurious error messages when pod garbage collection fails with pod not found, and
filtering node types whose prices are unexpectedly set to 0, seen for a dynamically-fetched price for a new node type on AKS.
Amazon EKS
Luna was enhanced to automatically determine the node instance profile if aws.nodeInstanceProfile is empty. This feature requires an update to the IAM policy for Luna Manager to determine the node instance profile. Refer to the upgrade notes for this release.
The set of available node types and their pricing was updated.
Azure AKS
Luna's AKS code was modified to filter node types whose dynamically-fetched prices are reported as 0, which was observed for a node type that was recently added.
1.3.2
This release includes metrics enhancements, improved pod garbage collection behavior, AWS Spot Advisor support, and cloud provider updates.
We made some changes to Luna manager’s metrics:
Added:
scale_errors_total— tracks scaling operation failures
Renamed:
cluster_gpu_limit_exceeds_total→gpu_requests_exceeding_cluster_limitinsufficient_free_addresses_in_subnet_thrown_total→insufficient_free_addresses_in_subnet_errors_total
When Luna scales down a node, daemonset pods that were running on that node occasionally persist in the Kubernetes inventory as orphaned zombie pods; Luna pod garbage collection cleans up those pods. Luna now uses a two-stage process to improve pod garbage collection reliability: First it attempts to remove the pods with foreground deletion. After loopPeriod, if a pod isn’t removed, it switches to background deletion.
Amazon EKS
Luna now supports AWS Spot Instance Advisor, which provides spot savings data and interruption frequency metrics.
To enable it, set aws.useSpotAdvisor to true. You can configure constraints on using spot instances based on the spot advisor data: the spot savings rate with aws.maxSpotPriceRatio, and the interruption frequency filter with aws.maxSpotInterruptBucket.
Google GKE
The node listing and pricing was updated.
1.3.1
This release includes changes to Luna’s internal database format. Before upgrading to version 1.3.1, ensure all node creation requests with spot pricing are complete and no spot node provisioning operations are in progress. In the rare case a spot node creation request fails during the upgrade, Luna may incorrectly mark the node type as unavailable for on-demand instead of spot. The duration of unavailability depends on the failure type and the option configured in the backoffDurations configuration section, by default 1 hour or less.
General
This release introduces several configuration options:
maxPodsPerNodesets the maximum pods parameter for both bin-select and bin-packed nodesnodeTypeRegexpandbinSelectNodeTypeRegexpfilter node types using regular expressionsnodeTypePricingandbinSelectNodePricingconfigure pricing models (on-demand or spot) for all nodes or bin-selected nodes specifically
This release resolves the following issues:
- Resolved an issue where pods with long graceful termination periods would block the Luna loop, causing 10+ minute delays in processing pending pods and evictions before specified graceful termination period elapsed
- Fixed a crash that occurred when no bin-selected node types were available
- Fixed a regression where bin-packing placement groups could be formed from pods with mismatched zone affinity, zone spread, and host spread [rare; introduced in release 1.2.13 (February 2025)]
Oracle OKE
Enhanced unavailable shape detection during node creation, enabling Luna to retry with alternative shapes more quickly.
1.3.0
This release introduces support for Amazon Linux 2023 with EKS. Note that EKS 1.33 requires Amazon Linux 2023 and drops support for Amazon Linux 2.
This release optimizes the provisioning of bin-selected nodes by improving the calculation of required nodes. Luna now avoids creating unnecessary bin-selected nodes when multiple bin-selected pods can be scheduled on a single node.
Anti-affinity configuration has been added to webhook pods to ensure they run on different nodes whenever possible.
A bug that caused Luna to crash during pod eviction from nodes has been fixed.
Amazon EKS
With the addition of Amazon Linux 2023 support, Luna now supports ARM-based instances with GPUs, including the g5g and p6e families. ARM-based GPU instance types use dedicated image configuration values: aws.amiIdGpuArm and imageSsmQueryGpuArm.
Security group tags can now be configured using aws.securityGroupSelector.
1.2.21
This release introduces minor features and improvements:
Added a new binSelectInstanceFamilyExclusions option, allowing users to exclude specific bin-select node types from consideration. This serves as an alternative to the node.elotl.co/instance-family-exclusions pod annotation.
Introduced the caBundle configuration option, enabling the use of a custom certificate authority bundle with the admission webhook.
Luna pods now include a node selector to ensure they run on the appropriate architecture and operating system.
Improved logging output during node draining operations.
1.2.20
We’ve resolved an issue introduced in Luna 1.2.19 where the node.elotl.co/force-bin-select annotation was accidentally removed from the configuration annotations list. This caused pods using this annotation to generate different hashes on restart, changing their node selector. As a result of this regression, existing bin-selected nodes weren’t being reused, forcing the creation of new nodes for restarted pods.
1.2.19
You can now add extra environment variables to the Luna manager and webhook pods via the manager.extraEnv and webhook.extraEnv Helm values.
The deployment script now cleans up temporary files from the installation directory after completion.
Luna now deletes Kubernetes node objects before terminating the underlying node instances. This change allows finalizers on RKE2 nodes to be removed, preventing orphaned node objects from lingering in the etcd database.
SUSE RKE2
Luna now supports running on SUSE RKE2 clusters provisioned with Amazon EC2 nodes.
Installation documentation for RKE2 on EC2 will be available soon. In the meantime, please contact us for deployment assistance.
Amazon EKS
Updated pricing data and added support for the new instance types: c8gd, m8gd, and r8gd.
Introduced configuration options: aws.subnets and aws.securityGroups. If these are not specified, Luna automatically queries subnets and security groups using the EKS cluster information. These options allow you to restrict the subnets in which Luna nodes run and to assign additional security groups to Luna nodes as needed.
Google GKE
Luna no longer attempts to garbage-collect node pools that are in provisioning, reconciling, or stopping states. This prevents logs from being cluttered with non-actionable messages (e.g., Nodepool ... garbage collect failed (...); nodepool is rechecked in future pass).
1.2.18
Amazon EKS
This change ensures that the Luna manager pod will have the environment variables AWS_REGION, AWS_DEFAULT_REGION, and AWS_STS_REGIONAL_ENDPOINTS defined via the deployment manifest. A new helm chart value, aws.clusterLocation, is introduced, which when set to the cluster's region ensures that these environment variables are injected in to the Luna manager pod.
This release introduces a Helm chart value, aws.userDataType, to configure how user data is passed to EC2 instances. Setting aws.userDataType to Template allows the operator to create custom scripts to configure their EC2 nodes. This can be used to support non-standard nodes.
1.2.17
This release addresses several issues which could cause Luna to overshoot its clusterGPULimit option. It also modifies
Luna to reduce a scale request node replica count when the full replica count would exceed clusterGPULimit.
The release fixes issues that interfered with the expected coalescing of bin-select pod placement requests when reuseBinSelectNodes is set.
The release adds the option webhookConfigPrefix to allow changing the prefix of Luna's webhook,
so that the order in which it is run with respect to other webhooks can be modified.
The release updates the Luna manager to log all environment variables at startup, to facilitate troubleshooting configuration issues.
Amazon EKS
The Luna manager service account was updated to include the annotation eks.amazonaws.com/sts-regional-endpoints: "true",
to reflect Luna's use of AssumeRoleWithWebIdentity.
Google GKE
Luna's spot instance offering was changed from using preemptible virtual machine instances to using spot virtual machine instances.
1.2.16
This release adds two new bin-select pod annotations, which can be used to constrain attributes of the
instance type Luna may allocate for the pod. The annotation node.elotl.co/instance-max-cost limits the estimated
USD cost per hour of the instance and node.elotl.co/instance-max-gpus limits the GPU count of the instance, i.e.,
Luna will not choose an instance type for the bin-select pod that violates these limits, if specified.
The release introduces the option spotPriceRatioEstimate (default 0.5) to make Luna's spot to on-demand
price ratio estimate user-visible and configurable.
The release adds the option pendingPodReasonRegexp, which can be set to a regular expression for which a
pod's pending reason message must match for Luna to consider it.
The release updates Luna's backoff handling for an instance type to capture the CPU, memory, and GPU configuration used for the instance type being avoided, since some instance types on some clouds allow multiple configurations.
The release removes some system configuration variables from the configuration map; they are now passed to luna executables via the command line.
This release includes several improvements specific to GKE as described in the next section.
Google GKE
This release modifies Luna to require that the instance type be explicitly constrained if gcp.minCpuPlatform is used.
The release fixes an issue where Luna could generate unsupported instance type candidates for GPU SKUs added to N1 instances. It also fixes an issue where scale down of a node pool, which Luna performs after a failed scale-up operation, would fail on GKE regional clusters.
1.2.15
This release addresses all high severity vulnerabilities reported by Dockerhub scanning. At this point, Dockerhub scanning shows no critical or high severity vulnerabilities.
The release improves the handling of cloud scale-up operations that take longer than scaleUpTimeout.
When this occurs, Luna's subsequent scaling attempt backs off from using the node type and pricing
category associated with the long latency for a period specified by backoffDurations.ScaleUpTimeoutBackoff.
The release improves contention detection and handling on GKE, as described in the GKE-specific section below.
The release addresses a price handling issue on AKS, as described in the AKS-specific section below.
Google GKE
Luna was updated to detect and handle two additional kinds of contention on GKE.
GCP scaling operations can fail due to an incompatible operation running, e.g., when the GKE control plane
is being upgraded. Luna now detects when a nodepool scale-up operation encounters this contention, and backs off
from re-attempting the operation for a period configurable via backoffDurations.nodepoolIncompatOpBackoff.
GCP limits the number of concurrent operations that can run in a GKE cluster. Luna now detects when cluster
operations fail due to this limit, and backs off from running scale-up operations for period configurable
via backoffDurations.tooManyOpsBackoff.
Also, an issue impacting the Luna code that enhances scaling efficiency by opportunistically combining scale operations for multiple pods was addressed.
Azure AKS
Luna's AKS price fetching code was updated to filter prices which have an effective end date that has passed.
1.2.14
This release adds support for specifying the minimum CPU platform for GKE nodes.
The release enhances Luna's logging when no bin-select node meets the configured requirements, so that the user can more easily determine how to update the requirements to allow a node to be allocated.
The release reverts the change made in 1.2.12 to avoid a rare certificate setup issue at webhook startup, given its unintended impact on a key customer installation use case. Installation retry can instead be used if the rare certificate setup issue occurs.
The release adds options to configure the backoff times for allocating classes of unavailable resources, so that the user can set those values to match their operating environment, if desired.
The release improves Luna's spot price estimates to better handle spot vs on-demand selection. It also fixes a role policy issue that impacted Luna's spot termination event handling feature on EKS.
Google GKE
Adds the option gcp.minCpuPlatform to specify the minimum CPU platform for GKE nodes,
and the annotation node.elotl.co/gcp.minCpuPlatform to override the setting for bin-select pods.
Please see minCpuPlatform for details.
Amazon EKS
Fixes an issue that impacted Luna's access to spot termination warning messages when the option aws.spotSqsQueueName is set.
1.2.13
This release updates the operation of binPackingNodePricing and node.elotl.co/instance-offerings when set to
"spot,on-demand" to try on-demand allocation if the spot capacity for a node type is exhausted.
The release also improves particular scale-up situations on EKS and OKE.
Amazon EKS
For EKS, the release enhances scale-up handling of insufficient capacity in a zone by retrying the allocation in other zones recommended in the API response. The release also adds recognition and response to spot quota issues during scale-up.
Oracle OKE
For OKE, on clusters using OCI VCN IP Native networking, the release fixes an issue in the default value handling of the Luna options
binSelectMaxPodsPerNode and binPackingMaxPodsPerNode that could lead scale-up to fail due to unexpected IP address exhaustion.
1.2.12
This release improves the management of instance type choices on clouds for which Luna uses node pools [AKS, GKE, and OKE].
For GKE, it updates the machine types and prices, handles the newly-added goog-gke-accelerator-type label on GPU node pools,
and adds support for detecting, logging, and directly responding to node pool scale operations that fail at runtime.
Common
Add the option binPackingKeepNodeType [default true] to avoid bin-packing node pool churn solely to change the instance type for lower node cost [applies to AKS, GKE, and OKE].
Reduce latency in bin-selection placement when original instance type choice is unavailable on clouds for which Luna uses node pools [applies to AKS, GKE, and OKE].
Fix rare certificate setup issue at webhook startup.