Version: v0.4

Troubleshooting

Luna manager and webhook are down

In case when luna-manager and luna webhook pods have not ready containers, it is likely that there was an issue in the manager. If manager pod is not running, Luna automatically marks webhook pod(s) as not ready. You may notice it by listing pods in Luna namespace (elotl by default):

$ kubectl get pods -n elotl
    elotl-luna-manager-d495f9f96-v7zng               1/2     Running   0          115d
    elotl-luna-webhook-d495f9f96-z5vmh               1/2     Running   0          115d
    elotl-luna-webhook-d495f9f96-w3prk               1/2     Running   0          115d

This is to avoid situation when Luna webhook adds nodeSelector, which cannot be satisfied because Luna manager is down. This way, pods are not mutated with nodeSelector, so they can still run on any node in the cluster.

To find out what's the root cause, check luna manager logs:

$ kubectl logs -n elotl -l app.kubernetes.io/component=luna-manager

Node is idle but not scaled down

There may be a cases when node utilization is low, pod that was meant to run on this node is already deleted, but the node is still in the cluster. In this case, you might see a log message in luna manager specifying which pod(s) are blocking node removal:

Node X cannot be removed because pod Y on this node is blocking the removal due to <Reason>

where reason is explanation why pod cannot be moved to another node.

Another reason could be not enough capacity on the other nodes to run pods running on the node. In this case, log message will say

Node X cannot be removed because these pods can't be scheduled on another of N node choices: pod-A, pod-B, ...

It is on cluster operator to move those pods manually and unlock node removal.

My pods are in Pending state for too long

Luna adds nodeSelector to the pods under management, and then adds a node with a labels matching this node selector. Before the node(s) are added to the cluster and ready, the pod(s) will remain in a Pending state. For bin-selected nodes, it usually it takes less than 5 minutes for pod(s) to transition from creation to Running state. This may vary between cloud providers. For pod requesting GPU it may take a bit longer - usually before Kubernetes marks node as ready and schedulable, there is a DaemonSet pod which needs to install GPU drivers. It's common for those pods to use heavy docker images. If you're seeing that one or more pods are in Pending state for longer than expected, here's a debugging checklist:

Are Pending pods labeled with Luna labels?

All pods managed by Luna will be first labeled by luna-webhook and should have a following labels:

key: node.elotl.co/packing-strategy possible values: bin-packing / bin-selection
key: node.elotl.co/node-pool-name possible values: bin-packing or unique for given pod

Luna webhook will add those labels to all pods which are labeled with labels passed to luna helm chart values. By default, those labels are elotl-luna=true.

note

If Pending pods are controlled by Deployment, StatefulSet, Job, or any other pod controller, make sure labels are set on Pod template, not on the Deployment etc.

Luna labels are missing

It is likely that luna-webhook is having issues. Check out debugging luna webhook tips

Do pending pods have nodeSelector set?

Luna webhook also sets a nodeSelector. In case of bin-packing, nodeSelector is set to fixed value: node.elotl.co/destination: bin-packing

In case of bin-selection, each pod will get node.elotl.co/destination: <unique-value>.

NodeSelector not set

If nodeSelector is set, it may indicate problems with Luna webhook. Check out debugging luna webhook tips

NodeSelector and labels set properly

It is likely an issue with luna-manager. Checking luna manager logs may give you a better understanding what's went wrong.

Pod is Pending, but no new node is created

If your pod is stuck in Pending state, has nodeSelector added by luna webhook but there isn't any newly added node which match node selector, it is possible that Luna couldn't find an instance type which would satisfy Pod's resource requests. In this case, you should see similar log message in luna-manager logs:

unable to find node type for pod my-pod

This may happen in two cases:

when there is no matching instance type satisfying pod's resource request combined with Luna's node type inclusions / exclusions (specified in pod's annotation, see Luna configuration). In this case it's a configuration error on user side.
When matching instance type exists, but it isn't available in the cloud at this moment. In this case, it's a transient error.

On GKE, there is also a third case, when the matching instance type is found, but it cannot start because the quota limit on given CPU/GPU type was reached in the region. In that case, you will see similar log message in luna manager:

rpc error: code = PermissionDenied desc = Insufficient regional quota to satisfy request: resource "N2D_CPUS": request requires '4.0' and is short '4.0'. project has a quota of '0.0' with '0.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=youre-project.

Luna webhook debugging

You can verify it's status with kubectl:

$ kubectl get deployment -n elotl

If luna webhook pods are up and running, you can check whether you pod (e.g. named my-pod) was processed or skipped by webhook:

$ kubectl logs -n elotl deployment/elotl-luna-webhook | grep my-pod

If pod was skipped, you can see one of following log messages:

skipping pod my-pod because it doesn't match labels

skipping pod my-pod because it's a part of daemonset

If pod was processed, you should see something similar to:

 handler.go:50] AdmissionReview Kind="/v1, Kind=Pod"
Namespace="my-namespace" Name="my-pod"/"" UID=57c7b78a-4545-42b8-91cb-9d9519c4a16f ...

handler.go:71] AdmissionResponse: patch=[{add /metadata/labels map[app:my-app elotl-systest:true node.elotl.co/node-pool-name:p915b9e7c node.elotl.co/packing-strategy:bin-selection pod-template-hash:7c5756f47c]} {add /spec/nodeSelector map[node.elotl.co/destination:7985b6b845]}]

Troubleshooting

Luna manager and webhook are down​

Node is idle but not scaled down​

My pods are in Pending state for too long​

Are Pending pods labeled with Luna labels?​

Luna labels are missing​

Do pending pods have nodeSelector set?​

NodeSelector not set​

NodeSelector and labels set properly​

Pod is Pending, but no new node is created​

Luna webhook debugging​