Version: v1.3

Troubleshooting

This section details how to correct issues you may encounter when working with Nova. If you encounter an issue that is not listed here, please contact the Elotl team.

Installation

`timed out waiting for the condition` when Installing Kube API Server

If you get this output while installing Nova Control Plane:

kubectl nova install control-plane --context=${K8S_HOSTING_CLUSTER_CONTEXT} --namespace=${NOVA_NAMESPACE} ${NOVA_CONTROLPLANE_CONTEXT}

Installing Nova Control Plane... 🪄
Cluster name - ${NOVA_CONTROLPLANE_CONTEXT}
Creating namespace elotl in control plane
Creating certificates
Generating certificates
Certificates successfully generated.
Installing Kube API Server...
timed out waiting for the condition

This means that API server of Nova Control Plane and/or its dependencies didn't start properly. What's most likely to cause this is etcd not starting because of no storage provisioner running in your cluster.

Run:

kubectl get pods --context=${K8S_HOSTING_CLUSTER_CONTEXT} --namespace=${NOVA_NAMESPACE}

NAME                                      READY   STATUS             RESTARTS        AGE
apiserver-6bf98bb5d5-vv7wc                0/1     CrashLoopBackOff   6 (110s ago)    9m42s
etcd-0                                    0/1     Pending            0               9m42s
kube-controller-manager-76d5d96df-ntl6g   0/1     CrashLoopBackOff   6 (3m42s ago)   9m42s

As you can see, apiserver and kube-controller-manager are starting and failing, while etcd is still in Pending state.

You should follow your Cloud Provider documentation and setup storage provisioner on your cluster. After you're done, run:

kubectl nova uninstall ${NOVA_CONTROLPLANE_CONTEXT} --context=${K8S_HOSTING_CLUSTER_CONTEXT}

And install your Nova Control Plane again.

Diagnose with the `novactl status` CLI

Nova CLI has a diagnosing sub-command novactl status. The command runs checks to ensure that your Nova Control Plane is up and running and if Nova's CRDs are installed.

Run it using:

kubectl nova status --context=${NOVA_CONTROLPLANE_CONTEXT} --hosting-cluster-context=${K8S_HOSTING_CLUSTER_CONTEXT} --hosting-cluster-nova-namespace=${NOVA_NAMESPACE}

Checking status of Nova Control Plane Components

* API Server status... Running √
* Kube-controller-manager status... Running √
* ETCD status... Running √
* Nova scheduler status... Running √
Nova Control Plane is healthy √

Checking presence of Nova Custom Resource Definitions

* Cluster CRD presence... installed √
        * Cluster wlc-1 connected and ready √
        * Cluster wlc-2 connected and ready √
* SchedulePolicy CRD presence... installed √
        * 0 SchedulePolicies defined ‼
                please create at least one SchedulePolicy, otherwise Nova does not know where to run your workloads. SchedulePolicy spec: https://docs.elotl.co
* ScheduleGroup CRD presence... installed √
All Nova Custom Resource Definitions installed √

If one of the components of Nova Control Plane is not running, Nova cannot function properly. All Nova's control plane components run in elotl namespace.

To debug further, get each component logs using the kubectl command:

kubectl logs -n elotl deploy/nova-scheduler
kubectl logs -n elotl deploy/apiserver
kubectl logs -n elotl deploy/kube-controller-manager
kubectl logs -n elotl statefulset/etcd

Your cluster does not appear in Nova Control Plane

If the Nova agent was successfully installed to the workload cluster, but the cluster does not show up in Nova Control Plane, do the following:

Check that the Nova agent is up and running in the workload cluster. To do this, check the agent's logs:

kubectl get --context=${K8S_CLUSTER_CONTEXT_1} --namespace=${NOVA_NAMESPACE} deployment nova-agent

If agent install finished without issues and agent pod is up and running, something went wrong during agent registration process. Run the following command to get agent logs:

kubectl get pods --context=${K8S_CLUSTER_CONTEXT_1} -n=${NOVA_NAMESPACE} -o name -l "app.kubernetes.io/name"="nova-agent" | xargs -I {} kubectl logs --context=${K8S_CLUSTER_CONTEXT_1} -n=${NOVA_NAMESPACE} {}

And start debugging from there!

Operations

My resources are in the Nova Control Plane, but not scheduled

Nova's scheduling process is a multi-step process. In the first step, Nova tries to find the matching SchedulePolicy when you create a new resource. If your resource is not scheduled, check if it was matched to a SchedulePolicy using following command:

kubectl get events --namespace=<resource-namespace> --field-selector=involvedObject.name=<resource-name>

If matched SchedulePolicy is found, Nova returns the kubernetes Event that object is matched too, for example:

kubectl get events --namespace=\<resource-namespace\> --field-selector=involvedObject.name=\<resource-name\>

16s Normal SchedulePolicyMatched \<resource-namespace\>/\<resource-name\> schedule policy \<policy-name\> will be used to determine target cluster

If the no SchedulePolicy was matched, please verify the following:

Ensure your resource's Kind is supported by Nova. The Nova introduction lists the supported kinds.
Check you defined your SchedulePolicy with the correct namespaceSelector and resourceSelector.
If your resource is in one of the namespaces specified in SchedulePolicy's namespaceSelector. Cluster scope objects are matched only based on label selector.
Does your resource have labels that match SchedulePolicy's resourceSelectors?
Do your objects match more than one SchedulePolicy? In this case, Nova will sort SchedulePolicies in alphabetical order and use the first one.
If your resource is a namespace starting with kube- or elotl these are are restricted namespaces and Nova ignores them.

Resources were created that match the SchedulePolicy but not workload cluster.

When your resource(s) are matched to the SchedulePolicy, but they don't transition into Running state, there may be a few reasons:

SchedulePolicy's has a clusterSelector that does not match any clusters. To see workload clusters connected to Nova run:
```
kubectl get clusters --show-labels
```

To fix this
Compare the output it with your SchedulePolicy's .spec.clusterSelector. Then, edit the cluster selector so it matches one or more clusters.

SchedulePolicy has a clusterSelector matching cluster(s), but there is not enough capacity on those cluster nodes to run your resource(s).

To fix this
If this is a case and you were using group scheduling, please check following:
kubectl get events --namespace=<resource-namespace> >  --field-selector=involvedObject.name=<resource-name> >
Your resource should have an event saying:
added to ScheduleGroup <schedule-group-name> >  which contains objects > > with groupBy.labelKey <foo> > =<bar> >
Then, you can get the details on this ScheduleGroup, using:
kubectl describe schedulegroup <schedule-group-name> >
in the Events section there should be a line saying: Normal ScheduleGroupSyncedToWorkloadCluster 8s nova-scheduler Multiple clusters matching policy <policy-name> (empty cluster selector): <cluster-names>; group policy <schedule-group-name> does not fit in any cluster;

SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the workloads cannot be created because their namespace does not exist in the workload cluster.

To fix this
You can either create a namespace manually in the workload cluster (by running kubectl --context=workload-cluster-context create namespace <your-namespace> or schedule namespace object using Nova. Remember that Namespace is treated as any other resource, meaning that it needs to have labels matching desired SchedulePolicy's resourceSelector.

SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the nova-agent in the cluster is having issues.

To fix this this
You can grab logs from nova-agent in this cluster by running:
kubectl logs -n elotl deploy/nova-agent
and contact Elotl team. info@elotl.co

Nova supports automatic re-scheduling and it happens in these cases:

Pod(s) that was/were scheduled via Nova are Pending in the workload cluster, because there is insufficient capacity in the cluster.

note
This state does not occur if you scheduled Deployment or any other pod controller via Nova.

Deployment that was scheduled via Nova has the following condition:

conditions:
  - lastTransitionTime: <does not matter>
    lastUpdateTime: <does not matter>
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available

If you defined your SchedulePolicy with groupBy settings, Nova will schedule entire ScheduleGroup at once. If one of the deployments in this group has either of the preceding conditions, Nova reschedules the whole ScheduleGroup. In this case, Nova sends the following kubernetes Event for ScheduleGroup:

kubectl describe schedulegroup <my-group-name>
...
Events:
Type     Reason                                Age   From            Message
----     ------                                ----  ----            -------
Warning  ReschedulingTriggered                 3s    nova-agent      deployment default/nginx-group-5 does not have minimum replicas available

Troubleshooting

Installation​

timed out waiting for the condition when Installing Kube API Server​

Diagnose with the novactl status CLI​

Your cluster does not appear in Nova Control Plane​

Operations​

My resources are in the Nova Control Plane, but not scheduled​

Resources were created that match the SchedulePolicy but not workload cluster.​