Version: v0.8.0

Disaster Recovery

As a multi cluster orchestrator, Nova can help with Day-2 orchestration needs of applications running on a fleet of clusters it manages (in addition to Day-0 static and dynamic scheduling of workloads that we’ve seen in the prior sections).

Automation of the disaster recovery of databases is one such application need that is implemented by Nova through a Kubernetes-native approach. This feature allows the Nova administrator to specify a sequence of steps that would need to be enacted when a failure is detected in its fleet. Such a failure could be a cluster-level failure, zone failure or a region failure. As soon as an alert capturing this failure is received by Nova, a Recovery Plan consisting of a sequence of recovery steps, is executed by the Nova control plane.

In the following sections, we describe the three key CRDs that Nova's disaster recovery framework is built around: ReceivedAlert, RecoveryPlan, and RecoveryRun. These CRDs work in tandem to monitor, plan, and execute disaster recovery strategies. RecoveryPlan is a user-defined plan that is matched to an incoming ReceivedAlert and results in started RecoveryRun.

ReceivedAlert

The ReceivedAlert CRD tracks incoming alerts within the Kubernetes cluster. It functions as the initial point of contact for the Nova disaster recovery system, capturing and recording alerts that may necessitate recovery actions.

Kind: ReceivedAlert
Function: Tracks and records incoming alerts.
Usage: When an alert from alerting system (such as Alertmanager in Prometheus) is received, a ReceivedAlert resource is created, storing the details of the alert.

apiVersion: recovery.elotl.co/v1alpha1
kind: ReceivedAlert
metadata:
  name: example-received-alert
spec:
  labels:
    key: value
status:
  processed: false

RecoveryPlan

The RecoveryPlan CRD allows users to define specific recovery strategies. It's a user-created resource that outlines the steps to be taken in response to different types of alerts.

Kind: RecoveryPlan
Function: Defines the recovery steps for different alerts.
Usage: Users create RecoveryPlan resources to specify how the system should respond to various alerts.

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
  name: example-recoveryplan
spec:
  alertLabels:
    severity: critical
    app: my-app
  steps:
    - type: readField
      readField:
        apiVersion: v1
        resource: pods
        namespace: default
        name: my-pod
        fieldPath: spec.containers[0].image
        outputKey: podImage
      title: Read Pod Image
    - type: patch
      patch:
        apiVersion: v1
        resource: pods
        namespace: default
        name: my-pod
        override:
          fieldPath: spec.containers[1].image
          value:
            raw: "{{ .Values.podImage }}"
        patchType: "application/json-patch+json"
      title: Update Second Container Image
    - type: job
      job:
        template:
          spec:
            containers:
            - name: update-notification
              image: busybox
              command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
            restartPolicy: OnFailure
      title: Notify Image Update

RecoveryRun

The RecoveryRun CRD represents the execution of a disaster recovery process. It is triggered based on a matching RecoveryPlan for a given ReceivedAlert.

Kind: RecoveryRun
Function: Manages and tracks the execution of recovery processes.
Usage: When a ReceivedAlert matches a RecoveryPlan, a RecoveryRun is initiated to carry out the specified recovery steps.

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryRun
metadata:
  name: example-recoveryrun
spec:
  alertSpec:
    name: critical-alert
    labels:
      severity: critical
      app: my-app
  steps:
    - type: readField
      readField:
        apiVersion: v1
        resource: pods
        namespace: default
        name: my-pod
        fieldPath: spec.containers[0].image
        outputKey: podImage
    - type: patch
      patch:
        apiVersion: v1
        resource: pods
        namespace: default
        name: my-pod
        override:
          fieldPath: spec.containers[1].image
          value:
            raw: "{{ .Values.podImage }}"
        patchType: "application/json-patch+json"
    - type: job
      job:
        template:
          spec:
            containers:
            - name: update-notification
              image: busybox
              command: ['sh', '-c', 'echo Updating to image {{ .Values.podImage }}']
            restartPolicy: OnFailure
status:
  phase: Completed
  values:
    podImage: "example-image:v1"
  conditions:
    - type: Success
      status: "True"
      lastProbeTime: "2024-01-03T12:34:56Z"
      lastTransitionTime: "2024-01-03T12:34:56Z"
      reason: Completed
      message: Recovery step completed successfully

Workflow

Alert Reception: A ReceivedAlert is created when an alert is received.
Plan Matching: The system matches the ReceivedAlert to an appropriate RecoveryPlan.
Recovery Execution: A RecoveryRun is initiated to execute the recovery steps defined in the matched RecoveryPlan.

This framework ensures a streamlined and flexible approach to managing disaster recovery in Kubernetes environments.

Disaster Recovery

ReceivedAlert​

RecoveryPlan​

RecoveryRun​

Workflow​

ReceivedAlert

RecoveryPlan

RecoveryRun

Workflow