Disaster Recovery
As a multi cluster orchestrator, Nova can help with Day-2 orchestration needs of applications running on a fleet of clusters it manages (in addition to Day-0 static and dynamic scheduling of workloads that we’ve seen in the prior sections).
Automation of the disaster recovery of databases is one such application need that is implemented by Nova through a Kubernetes-native approach. This feature allows the Nova administrator to specify a sequence of steps that would need to be enacted when a failure is detected in its fleet. Such a failure could be a cluster-level failure, zone failure or a region failure. As soon as an alert capturing this failure is received by Nova, a Recovery Plan
consisting of a sequence of recovery steps, is executed by the Nova control plane.
In the following sections, we describe the three key CRDs that Nova's disaster recovery framework is built around: ReceivedAlert
, RecoveryPlan
, and RecoveryRun
. These CRDs work in tandem to monitor, plan, and execute disaster recovery strategies. RecoveryPlan
is a user-defined plan that is matched to an incoming ReceivedAlert
and results in started RecoveryRun
.
ReceivedAlert
The ReceivedAlert
CRD tracks incoming alerts within the Kubernetes cluster. It functions as the initial point of contact for the Nova disaster recovery system, capturing and recording alerts that may necessitate recovery actions.
- Kind:
ReceivedAlert
- Function: Tracks and records incoming alerts.
- Usage: When an alert from alerting system (such as Alertmanager in Prometheus) is received, a
ReceivedAlert
resource is created, storing the details of the alert.
apiVersion: recovery.elotl.co/v1alpha1
kind: ReceivedAlert
metadata:
name: example-received-alert
spec:
labels:
key: value
status:
processed: false
RecoveryPlan
The RecoveryPlan
CRD allows users to define specific recovery strategies. It's a user-created resource that outlines the steps to be taken in response to different types of alerts.
- Kind:
RecoveryPlan
- Function: Defines the recovery steps for different alerts.
- Usage: Users create
RecoveryPlan
resources to specify how the system should respond to various alerts.
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
name: example-recoveryplan
spec:
alertLabels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
title: Read Pod Image
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
title: Update Second Container Image
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
restartPolicy: OnFailure
title: Notify Image Update
RecoveryRun
The RecoveryRun
CRD represents the execution of a disaster recovery process. It is triggered based on a matching RecoveryPlan
for a given ReceivedAlert
.
- Kind:
RecoveryRun
- Function: Manages and tracks the execution of recovery processes.
- Usage: When a
ReceivedAlert
matches aRecoveryPlan
, aRecoveryRun
is initiated to carry out the specified recovery steps.
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryRun
metadata:
name: example-recoveryrun
spec:
alertSpec:
name: critical-alert
labels:
severity: critical
app: my-app
steps:
- type: readField
readField:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
fieldPath: spec.containers[0].image
outputKey: podImage
- type: patch
patch:
apiVersion: v1
resource: pods
namespace: default
name: my-pod
override:
fieldPath: spec.containers[1].image
value:
raw: "{{ .Values.podImage }}"
patchType: "application/json-patch+json"
- type: job
job:
template:
spec:
containers:
- name: update-notification
image: busybox
command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
restartPolicy: OnFailure
status:
phase: Completed
values:
podImage: "example-image:v1"
conditions:
- type: Success
status: "True"
lastProbeTime: "2024-01-03T12:34:56Z"
lastTransitionTime: "2024-01-03T12:34:56Z"
reason: Completed
message: Recovery step completed successfully
Workflow
- Alert Reception: A
ReceivedAlert
is created when an alert is received. - Plan Matching: The system matches the
ReceivedAlert
to an appropriateRecoveryPlan
. - Recovery Execution: A
RecoveryRun
is initiated to execute the recovery steps defined in the matchedRecoveryPlan
.
This framework ensures a streamlined and flexible approach to managing disaster recovery in Kubernetes environments.