Troubleshooting
This section details how to correct issues you may encounter when working with Nova. If you encounter an issue that is not listed here, please contact the Elotl team.
Installation
timed out waiting for the condition
when Installing Kube API Server
If you get this output while installing Nova Control Plane:
kubectl nova install control-plane --context=${K8S_HOSTING_CLUSTER_CONTEXT} --namespace=${NOVA_NAMESPACE} ${NOVA_CONTROLPLANE_CONTEXT}
Installing Nova Control Plane... 🪄
Cluster name - ${NOVA_CONTROLPLANE_CONTEXT}
Creating namespace elotl in control plane
Creating certificates
Generating certificates
Certificates successfully generated.
Installing Kube API Server...
timed out waiting for the condition
This means that API server of Nova Control Plane and/or its dependencies didn't start properly. What's most likely to cause this is etcd
not starting because of no storage provisioner running in your cluster.
Run:
kubectl get pods --context=${K8S_HOSTING_CLUSTER_CONTEXT} --namespace=${NOVA_NAMESPACE}
NAME READY STATUS RESTARTS AGE
apiserver-6bf98bb5d5-vv7wc 0/1 CrashLoopBackOff 6 (110s ago) 9m42s
etcd-0 0/1 Pending 0 9m42s
kube-controller-manager-76d5d96df-ntl6g 0/1 CrashLoopBackOff 6 (3m42s ago) 9m42s
As you can see, apiserver
and kube-controller-manager
are starting and failing, while etcd is still in Pending
state.
You should follow your Cloud Provider documentation and setup storage provisioner on your cluster. After you're done, run:
kubectl nova uninstall ${NOVA_CONTROLPLANE_CONTEXT} --context=${K8S_HOSTING_CLUSTER_CONTEXT}
And install your Nova Control Plane again.
Diagnose with the novactl status
CLI
Nova CLI has a diagnosing sub-command novactl status
. The command runs checks to ensure that your Nova Control Plane is up and running and if Nova's CRDs are installed.
Run it using:
kubectl nova status --context=${NOVA_CONTROLPLANE_CONTEXT} --hosting-cluster-context=${K8S_HOSTING_CLUSTER_CONTEXT} --hosting-cluster-nova-namespace=${NOVA_NAMESPACE}
Checking status of Nova Control Plane Components
* API Server status... Running √
* Kube-controller-manager status... Running √
* ETCD status... Running √
* Nova scheduler status... Running √
Nova Control Plane is healthy √
Checking presence of Nova Custom Resource Definitions
* Cluster CRD presence... installed √
* Cluster wlc-1 connected and ready √
* Cluster wlc-2 connected and ready √
* SchedulePolicy CRD presence... installed √
* 0 SchedulePolicies defined ‼
please create at least one SchedulePolicy, otherwise Nova does not know where to run your workloads. SchedulePolicy spec: https://docs.elotl.co
* ScheduleGroup CRD presence... installed √
All Nova Custom Resource Definitions installed √
If one of the components of Nova Control Plane is not running, Nova cannot function properly. All Nova's control plane components run in elotl
namespace.
To debug further, get each component logs using the kubectl
command:
kubectl logs -n elotl deploy/nova-scheduler
kubectl logs -n elotl deploy/apiserver
kubectl logs -n elotl deploy/kube-controller-manager
kubectl logs -n elotl statefulset/etcd
Your cluster does not appear in Nova Control Plane
If the Nova agent was successfully installed to the workload cluster, but the cluster does not show up in Nova Control Plane, do the following:
- Check that the Nova agent is up and running in the workload cluster. To do this, check the agent's logs:
kubectl get --context=${K8S_CLUSTER_CONTEXT_1} --namespace=${NOVA_NAMESPACE} deployment nova-agent
- If agent install finished without issues and agent pod is up and running, something went wrong during agent registration process. Run the following command to get agent logs:
kubectl get pods --context=${K8S_CLUSTER_CONTEXT_1} -n=${NOVA_NAMESPACE} -o name -l "app.kubernetes.io/name"="nova-agent" | xargs -I {} kubectl logs --context=${K8S_CLUSTER_CONTEXT_1} -n=${NOVA_NAMESPACE} {}
And start debugging from there!
Operations
My resources are in the Nova Control Plane, but not scheduled
Nova's scheduling process is a multi-step process. In the first step, Nova tries to find the matching SchedulePolicy when you create a new resource. If your resource is not scheduled, check if it was matched to a SchedulePolicy using following command:
kubectl get events --namespace=<resource-namespace> --field-selector=involvedObject.name=<resource-name>
If matched SchedulePolicy is found, Nova returns the kubernetes Event that object is matched too, for example:
kubectl get events --namespace=\<resource-namespace\> --field-selector=involvedObject.name=\<resource-name\>
16s Normal SchedulePolicyMatched \<resource-namespace\>/\<resource-name\> schedule policy \<policy-name\> will be used to determine target cluster
If the no SchedulePolicy was matched, please verify the following:
- Ensure your resource's Kind is supported by Nova. The Nova introduction. lists the supported kinds.
- Check you defined your SchedulePolicy with the correct namespaceSelector and resourceSelector.
- If your resource is in one of the namespaces specified in SchedulePolicy's namespaceSelector. Cluster scope objects are matched only based on label selector.
- Does your resource have labels that match SchedulePolicy's resourceSelectors?
- Do your objects match more than one SchedulePolicy? In this case, Nova will sort SchedulePolicies in alphabetical order and use the first one.
- If your resource is a namespace starting with
kube-
orelotl
these are are restricted namespaces and Nova ignores them.
Resources were created that match the SchedulePolicy but not workload cluster.
When your resource(s) are matched to the SchedulePolicy, but they don't transition into Running state, there may be a few reasons:
SchedulePolicy's has a clusterSelector that does not match any clusters. To see workload clusters connected to Nova run:
kubectl get clusters --show-labels
To fix thisCompare the output it with your SchedulePolicy's
.spec.clusterSelector
. Then, edit the cluster selector so it matches one or more clusters.
- SchedulePolicy has a clusterSelector matching cluster(s), but there is not enough capacity on those cluster nodes to run your resource(s).
To fix thisIf this is a case and you were using group scheduling, please check following:
kubectl get events --namespace=<resource-namespace> > --field-selector=involvedObject.name=<resource-name> >
Your resource should have an event saying:
added to ScheduleGroup <schedule-group-name> > which contains objects > > with groupBy.labelKey <foo> > =<bar> >
Then, you can get the details on this ScheduleGroup, using:
kubectl describe schedulegroup <schedule-group-name> >
in the
Events
section there should be a line saying: Normal ScheduleGroupSyncedToWorkloadCluster 8s nova-scheduler Multiple clusters matching policy <policy-name> (empty cluster selector): <cluster-names>; group policy <schedule-group-name> does not fit in any cluster;
- SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the workloads cannot be created because their namespace does not exist in the workload cluster.
To fix thisYou can either create a namespace manually in the workload cluster (by running
kubectl --context=workload-cluster-context create namespace <your-namespace>
or schedule namespace object using Nova. Remember that Namespace is treated as any other resource, meaning that it needs to have labels matching desired SchedulePolicy's resourceSelector.
- SchedulePolicy has a clusterSelector matching cluster(s), and there is enough capacity, but the
nova-agent
in the cluster is having issues.
To fix this thisYou can grab logs from nova-agent in this cluster by running:
kubectl logs -n elotl deploy/nova-agent
and contact Elotl team. info@elotl.co
Nova supports automatic re-scheduling and it happens in these cases:
- Pod(s) that was/were scheduled via Nova are Pending in the workload cluster, because there is insufficient capacity in the cluster.
noteThis state does not occur if you scheduled Deployment or any other pod controller via Nova.
Deployment that was scheduled via Nova has the following condition:
conditions:
- lastTransitionTime: <does not matter>
lastUpdateTime: <does not matter>
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: AvailableIf you defined your SchedulePolicy with
groupBy
settings, Nova will schedule entire ScheduleGroup at once. If one of the deployments in this group has either of the preceding conditions, Nova reschedules the whole ScheduleGroup. In this case, Nova sends the following kubernetes Event for ScheduleGroup:kubectl describe schedulegroup <my-group-name>
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReschedulingTriggered 3s nova-agent deployment default/nginx-group-5 does not have minimum replicas available