Skip to main content
Version: v0.9.8

Disaster Recovery

As a multi cluster orchestrator, Nova can help with Day-2 orchestration needs of applications running on a fleet of clusters it manages (in addition to Day-0 static and dynamic scheduling of workloads that we’ve seen in the prior sections).

Automation of the disaster recovery of databases is one such application need that is implemented by Nova through a Kubernetes-native approach. This feature allows the Nova administrator to specify a sequence of steps that would need to be enacted when a failure is detected in its fleet. Such a failure could be a cluster-level failure, zone failure or a region failure. As soon as an alert capturing this failure is received by Nova, a Recovery Plan consisting of a sequence of recovery steps, is executed by the Nova control plane.

In the following sections, we describe the three key CRDs that Nova's disaster recovery framework is built around: ReceivedAlert, RecoveryPlan, and RecoveryRun. These CRDs work in tandem to monitor, plan, and execute disaster recovery strategies. RecoveryPlan is a user-defined plan that is matched to an incoming ReceivedAlert and results in started RecoveryRun.

ReceivedAlert

The ReceivedAlert CRD tracks incoming alerts within the Kubernetes cluster. It functions as the initial point of contact for the Nova disaster recovery system, capturing and recording alerts that may necessitate recovery actions.

  • Kind: ReceivedAlert
  • Function: Tracks and records incoming alerts.
  • Usage: When an alert from alerting system (such as Alertmanager in Prometheus) is received, a ReceivedAlert resource is created, storing the details of the alert.
apiVersion: recovery.elotl.co/v1alpha1
kind: ReceivedAlert
metadata:
name: example-received-alert
spec:
labels:
key: value
status:
processed: false

RecoveryPlan

The RecoveryPlan CRD allows users to define specific recovery strategies. It's a user-created resource that outlines the steps to be taken in response to different types of alerts.

  • Kind: RecoveryPlan
  • Function: Defines the recovery steps for different alerts.
  • Usage: Users create RecoveryPlan resources to specify how the system should respond to various alerts.
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
name: example-recoveryplan
spec:
alertLabels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
title: Read Pod Image
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
title: Update Second Container Image
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
restartPolicy: OnFailure
title: Notify Image Update

RecoveryRun

The RecoveryRun CRD represents the execution of a disaster recovery process. It is triggered based on a matching RecoveryPlan for a given ReceivedAlert.

  • Kind: RecoveryRun
  • Function: Manages and tracks the execution of recovery processes.
  • Usage: When a ReceivedAlert matches a RecoveryPlan, a RecoveryRun is initiated to carry out the specified recovery steps.
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryRun
metadata:
name: example-recoveryrun
spec:
alertSpec:
name: critical-alert
labels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
restartPolicy: OnFailure
status:
phase: Completed
values:
podImage: "example-image:v1"
conditions:
- type: Success
status: "True"
lastProbeTime: "2024-01-03T12:34:56Z"
lastTransitionTime: "2024-01-03T12:34:56Z"
reason: Completed
message: Recovery step completed successfully

Workflow

  1. Alert Reception: A ReceivedAlert is created when an alert is received.
  2. Plan Matching: The system matches the ReceivedAlert to an appropriate RecoveryPlan.
  3. Recovery Execution: A RecoveryRun is initiated to execute the recovery steps defined in the matched RecoveryPlan.

This framework ensures a streamlined and flexible approach to managing disaster recovery in Kubernetes environments.