Available Metrics
Luna exposes metrics in Prometheus format. They can be scraped on port 9090 on the elotl-luna-manager pod. The following is a list of available metrics with descriptions.
Luna Metric Name | Description |
---|---|
elot_luna_scale_actions_total | Counts total number of node scale actions done by luna-manager. Its labels are "action" ("up" or "down") and "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_scale_errors_total | Counts total number of node scale up or down errors. Its labels are "action" ("up" or "down") and "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_started_node_types_total | Counts total number of started nodes, grouped by node_type label. Note that this metric will not appear until Luna has created a node. Its labels are "node_packing_mode" ("bin-packing" or "bin-selection") and "node_type" with the node type’s name. |
elotluna_node_startup_duration_seconds{bucket,sum,count} | Histogram of seconds between ScaleUp Request creation and nodepool completing the operation. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_pods_evicted_total | Counts pods evictions. Its labels are "node_packing_mode" ("bin-packing" or "bin-selection") and "results" ("success" or "error"). |
elot_luna_nodes_drained_total | Counts node drain actions. Its labels are "node_packing_mode" ("bin-packing" or "bin-selection") and "results" ("success" or "error"). |
elot_luna_nodes_removed_total | Counts nodes removed from cluster (ready or not before cordoning). Its labels are "node_packing_mode" ("bin-packing" or "bin-selection") and "node_state" ("ready" or "not_ready"). |
elot_luna_unschedulable_pods | This gauge is set to current number of unschedulable pods considered by luna-manager. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_gpu_requests_exceeding_cluster_limit | Counts number of attempts when pod requests exceeds cluster GPU limit. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_pods_skipped | Number of skipped pods (in the last loop iteration) for various reasons. Its labels are "node_packing_mode" ("bin-packing" or "bin-selection") and "reason" ("pending_reason_mismatch" or "pvc_bound") |
elot_luna_nodes_scale_up_request_expired_total | Counts number of nodes scale up requests expirations. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). |
elot_luna_insufficient_free_addresses_in_subnet_errors_total | Counts number of insufficient free addresses in subnet errors. Its label is "node_packing_mode" ("bin-packing" or "bin-selection"). |