Skip to main content
Version: v1.4

Recovery Framework

Nova provides a Kubernetes-native framework for defining and executing recovery workflows across clusters.

This framework is used to support disaster recovery, failover, and other automated operational workflows. Recovery behavior is defined using Kubernetes resources and executed by the Nova control plane.

Overview

The recovery framework is built around three resources:

  • ReceivedAlert – captures an incoming alert
  • RecoveryPlan – defines the recovery steps to run for matching alerts
  • RecoveryRun – represents an execution of a recovery plan

How It Works

  1. An external alerting system sends an alert to Nova
  2. Nova records the alert as a ReceivedAlert
  3. Nova matches the alert to a RecoveryPlan
  4. Nova creates a RecoveryRun
  5. The recovery steps defined in the matching plan are executed

ReceivedAlert

A ReceivedAlert represents an alert received by Nova from an external alerting system, such as Prometheus Alertmanager.

It serves as the entry point into the recovery workflow.

Example

apiVersion: recovery.elotl.co/v1alpha1
kind: ReceivedAlert
metadata:
name: example-received-alert
spec:
labels:
severity: critical
app: my-app
status:
processed: false

RecoveryPlan

A RecoveryPlan defines the recovery steps Nova should run when an incoming alert matches the plan.

Recovery plans allow recovery logic to be expressed declaratively. A plan can include multiple steps, such as reading fields from existing Kubernetes resources, patching resources, or running a Kubernetes Job.

Example

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
name: example-recoveryplan
spec:
alertLabels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
title: Read Pod Image
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
title: Update Second Container Image
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command:
- sh
- -c
- echo Updating to image {{ .Values.podImage }}
restartPolicy: OnFailure
title: Notify Image Update

RecoveryRun

A RecoveryRun represents the execution of a RecoveryPlan.

When a ReceivedAlert matches a RecoveryPlan, Nova creates a RecoveryRun to track the execution of the recovery workflow.

Example

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryRun
metadata:
name: example-recoveryrun
spec:
alertSpec:
name: critical-alert
labels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage

- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"

- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command:
- sh
- -c
- echo Updating to image {{ .Values.podImage }}
restartPolicy: OnFailure
status:
phase: Completed
values:
podImage: example-image:v1
conditions:
- type: Success
status: "True"
lastProbeTime: "2026-01-03T12:34:56Z"
lastTransitionTime: "2026-01-03T12:34:56Z"
reason: Completed
message: Recovery step completed successfully

Workflow Summary

  1. Alert Reception – A ReceivedAlert is created when Nova receives an alert.
  2. Plan Matching – Nova compares alert labels against available RecoveryPlan resources.
  3. Recovery Execution – A RecoveryRun is created and executed.
  4. Status Tracking – Execution progress and results are tracked.

Considerations

  • Recovery workflows depend on external alerting systems.
  • Application-level recovery depends on storage, networking, and readiness.
  • Plans should be tested before production use.