Version: v0.5

Troubleshooting

Installation

Troubleshooting with Nova CLI

Nova CLI has a diagnosing sub-command novactl status. It will run a set of checks to ensure that your Nova Control Plane is up and running and will check if Nova's CRDs are installed. You need to pass Nova's Control Plane hosting cluster kubeconfig.

Checking status of Nova Control Plane Components

* API Server status... Running √
* Kube-controller-manager status... Running √
* ETCD status... Running √
* Nova scheduler status... Running √
Nova Control Plane is healthy √

Checking presence of Nova Custom Resource Definitions

* Cluster CRD presence... installed √
        * Cluster kind-workload-1 connected and ready √
        * Cluster kind-workload-2 connected and ready √
* SchedulePolicy CRD presence... installed √
        * 0 SchedulePolicies defined ‼
                please create at least one SchedulePolicy, otherwise Nova does not know where to run your workloads. SchedulePolicy spec: https://docs.elotl.co
* ScheduleGroup CRD presence... installed √
All Nova Custom Resource Definitions installed √

If one of the components of Nova Control Plane is not running, Nova cannot function properly.All Nova's control plane components run in elotl namespace. You can debug further by getting each component logs using kubectl:

$ kubectl logs -n elotl deploy/nova-scheduler
$ kubectl logs -n elotl deploy/apiserver
$ kubectl logs -n elotl deploy/kube-controller-manager
$ kubectl logs -n elotl statefulset/etcd

Nova agent was successfully installed to the workload cluster, but cluster does not show up in Nova Control Plane

In this case, follow this checklist:

Is Nova agent up and running in the workload cluster? You can verify it by checking its logs: kubectl -n elotl deployment/nova-agent
Is Nova Control Plane API Server reachable from the workload cluster?

Operations

My resources are created in Nova Control Plane, but not scheduled

Nova's scheduling process has multiple steps. In the first step, when the new resource is created, Nova tries to find matching SchedulePolicy. You can check if your resource was matched to SchedulePolicy using following command:

$ kubectl get events --namespace=<resource-namespace> --field-selector=involvedObject.name=<resource-name>

If matched SchedulePolicy is found, Nova will send kubernetes Event that object is matched, e.g.:

$ kubectl get events --namespace=\<resource-namespace\> --field-selector=involvedObject.name=\<resource-name\>
16s Normal SchedulePolicyMatched \<resource-namespace\>/\<resource-name\> schedule policy \<policy-name\> will be used to determine target cluster

In case when no SchedulePolicy was matched, please check:

If your resource's Kind is supported by Nova. This list is available in the Introduction section: https://docs.elotl.co/nova/intro of Nova docs.
If you defined SchedulePolicy with correct namespaceSelector and resourceSelector
If your resource is in one of the namespaces specified in SchedulePolicy's namespaceSelector. Cluster scope objects are matched only based on label selector.
If your resource has labels that match SchedulePolicy's resourceSelectors.
There might be cases, when your objects matches more than one SchedulePolicy. In this case, Nova will sort SchedulePolicies in alphabetical order and use the first one.
If your resource is a namespace starting with 'kube-' or `elotl-' it will be ignored by Nova. Those are restricted namespaces.

Resources were created in Nova Control Plane match SchedulePolicy but not workload cluster.

When your resource(s) are matched to the SchedulePolicy, but they don't transition into Running state, there may be a few reasons:

SchedulePolicy's has a clusterSelector that does not match any clusters.
You can see workload clusters connected to Nova by running:
```
$ kubectl get clusters --show-labels
```
compare it with your SchedulePolicy's '.spec.clusterSelector'. Then, edit cluster selector so it matches one or more clusters.
SchedulePolicy has a clusterSelector matching cluster(s), but there is not enough capacity on those cluster nodes to run your resource(s).

How to Fix

If this is a case and you were using group scheduling, please check following:

$ kubectl get events --namespace=<resource-namespace> --field-selector=involvedObject.name=<resource-name>

Your resource should have an event saying: added to ScheduleGroup <schedule-group-name> which contains objects with groupBy.labelKey <foo>=<bar>

Then, you can get the details on this ScheduleGroup, using:

$ kubectl describe schedulegroup <schedule-group-name>

in the "Events" section there should be a line saying:

SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the workloads cannot be created because their namespace does not exist in the workload cluster.

How to Fix

You can either create a namespace manually in the workload cluster (by running kubectl --context=workload-cluster-context create namespace <your-namespace> or schedule namespace object using Nova. Remember that Namespace is treated as any other resource, meaning that it needs to have labels matching desired SchedulePolicy's resourceSelector.

SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the nova-agent in the cluster is having issues.

How To fix

You can grab logs from nova-agent in this cluster by running:

$ kubectl logs -n elotl deploy/nova-agent

and contact Elotl team.

My resources got scheduled to an unexpected workload cluster

My resources were running in cluster A, but then they got moved to cluster B

Nova supports automatic re-scheduling and it happens in two cases:

Pod(s) that was/were scheduled via Nova (important: it does not apply if you scheduled Deployment or any other pod controller via Nova) are Pending in the workload cluster, because there is insufficient capacity in the cluster.

Deployment that was scheduled via Nova has a following condition:

conditions:
- lastTransitionTime: <does not matter>
    lastUpdateTime: <does not matter>
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available

In case when you defined your SchedulePolicy with groupBy settings, Nova will schedule entire ScheduleGroup at once. If one of the deployments in this group has mentioned condition, whole ScheduleGroup will be re-scheduled. In this case, Nova will send following kubernetes Event for ScheduleGroup:

$ kubectl describe schedulegroup <my-group-name>
...
Events:
  Type     Reason                                Age   From            Message
  ----     ------                                ----  ----            -------
  Warning  ReschedulingTriggered                 3s    nova-agent      deployment default/nginx-group-5 does not have minimum replicas available

Troubleshooting

Installation​