Recovery Framework
Nova provides a Kubernetes-native framework for defining and executing recovery workflows across clusters.
This framework is used to support disaster recovery, failover, and other automated operational workflows. Recovery behavior is defined using Kubernetes resources and executed by the Nova control plane.
Overview
The recovery framework is built around three resources:
- ReceivedAlert – captures an incoming alert
- RecoveryPlan – defines the recovery steps to run for matching alerts
- RecoveryRun – represents an execution of a recovery plan
How It Works
- An external alerting system sends an alert to Nova
- Nova records the alert as a
ReceivedAlert - Nova matches the alert to a
RecoveryPlan - Nova creates a
RecoveryRun - The recovery steps defined in the matching plan are executed
ReceivedAlert
A ReceivedAlert represents an alert received by Nova from an external alerting system, such as Prometheus Alertmanager.
It serves as the entry point into the recovery workflow.
Example
apiVersion: recovery.elotl.co/v1alpha1
kind: ReceivedAlert
metadata:
name: example-received-alert
spec:
labels:
severity: critical
app: my-app
status:
processed: false
RecoveryPlan
A RecoveryPlan defines the recovery steps Nova should run when an incoming alert matches the plan.
Recovery plans allow recovery logic to be expressed declaratively. A plan can include multiple steps, such as reading fields from existing Kubernetes resources, patching resources, or running a Kubernetes Job.
Example
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
name: example-recoveryplan
spec:
alertLabels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
title: Read Pod Image
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
title: Update Second Container Image
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command:
- sh
- -c
- echo Updating to image {{ .Values.podImage }}
restartPolicy: OnFailure
title: Notify Image Update
RecoveryRun
A RecoveryRun represents the execution of a RecoveryPlan.
When a ReceivedAlert matches a RecoveryPlan, Nova creates a RecoveryRun to track the execution of the recovery workflow.
Example
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryRun
metadata:
name: example-recoveryrun
spec:
alertSpec:
name: critical-alert
labels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command:
- sh
- -c
- echo Updating to image {{ .Values.podImage }}
restartPolicy: OnFailure
status:
phase: Completed
values:
podImage: example-image:v1
conditions:
- type: Success
status: "True"
lastProbeTime: "2026-01-03T12:34:56Z"
lastTransitionTime: "2026-01-03T12:34:56Z"
reason: Completed
message: Recovery step completed successfully
Workflow Summary
- Alert Reception – A
ReceivedAlertis created when Nova receives an alert. - Plan Matching – Nova compares alert labels against available
RecoveryPlanresources. - Recovery Execution – A
RecoveryRunis created and executed. - Status Tracking – Execution progress and results are tracked.
Considerations
- Recovery workflows depend on external alerting systems.
- Application-level recovery depends on storage, networking, and readiness.
- Plans should be tested before production use.