Available Metrics
Luna exposes metrics in Prometheus format. They can be scraped on port 9090 on the elotl-luna-manager pod. The following is a list of available metrics with descriptions.
| Luna Metric Name | Description | Label(s) |
|---|---|---|
| elotl_luna_scale_actions_total | Counts total number of node scale actions done by luna-manager | "action" ("up" or "down") and "node_packing_mode" ("bin-packing" or "bin-selection"). |
| elotl_luna_scale_errors_total | Counts total number of node scale up or down errors | "action" ("up" or "down"), "node_packing_mode" ("bin-packing" or "bin-selection"), "node_type", "reason", "spot", and "subnet". |
| elotl_luna_started_node_types_total | Counts total number of started nodes, grouped by node_type label. Note that this metric will not appear until Luna has created a node. | "node_packing_mode" ("bin-packing" or "bin-selection") and "node_type" with the node type’s name. |
| elotl_luna_node_startup_duration_seconds_{bucket,sum,count} | Histogram of seconds between ScaleUp Request creation and nodepool completing the operation. | "node_packing_mode" ("bin-packing" or "bin-selection"). |
| elotl_luna_pods_evicted_total | Counts pods evictions. | "node_packing_mode" ("bin-packing" or "bin-selection") and "results" ("success" or "error"). |
| elotl_luna_nodes_drained_total | Counts node drain actions. | "node_packing_mode" ("bin-packing" or "bin-selection") and "results" ("success" or "error"). |
| elotl_luna_nodes_removed_total | Counts nodes removed from cluster (ready or not before cordoning). | "node_packing_mode" ("bin-packing" or "bin-selection") and "node_state" ("ready" or "not_ready"). |
| elotl_luna_unschedulable_pods | This gauge is set to current number of unschedulable pods considered by luna-manager. | "node_packing_mode" ("bin-packing" or "bin-selection"). |
| elotl_luna_gpu_requests_exceeding_cluster_limit | Counts number of attempts when pod requests exceeds cluster GPU limit. | "node_packing_mode" ("bin-packing" or "bin-selection"). |
| elotl_luna_pods_skipped | Number of skipped pods (in the last loop iteration) for various reasons. | "node_packing_mode" ("bin-packing" or "bin-selection") and "reason" ("pending_reason_mismatch" or "pvc_bound") |
| elotl_luna_nodes_scale_up_request_expired_total | Counts number of nodes scale up requests expirations. | "node_packing_mode" ("bin-packing" or "bin-selection"). |
| elotl_luna_insufficient_free_addresses_in_subnet_errors_total | Counts number of insufficient free addresses in subnet errors. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). | |
| elotl_luna_node_type_backoff_active | Indicates whether a node type is currently in backoff (1) or not (0). | "node_type" and "reason" |