Available Metrics
Luna exposes metrics in Prometheus format. They can be scraped on port 9090 on the elotl-luna-manager pod. The following is a list of available metrics with descriptions.
Luna Metric Name | Description |
---|---|
scale_actions_total | Counts total number of scale actions done by luna-manager |
started_node_types_total | Counts total number of started nodes, grouped by node_type label |
node_startup_duration_seconds | Counts seconds between ScaleUp Request creation and marking as succeeded |
pods_evicted_total | Counts pods evictions (successful or not, see <result> label) |
nodes_drained_total | Counts node drain actions (successful or not, see <ok> label) |
nodes_removed_total | Counts nodes removed from cluster (ready or not before cordoning, see <node_state> label) |
unschedulable_pods | This gauge is set to current number of unschedulable pods considered by luna-manager |
cluster_gpu_limit_exceeds_total | Counts number of attempts when pod requests exceeds cluster GPU limit |
pods_skipped | Number of skipped pods (in the last loop iteration) for various reasons |
nodes_scale_up_request_expired_total | Counts number of nodes scale up requests expirations |