Version: v1.5

Disaster Recovery for Percona PostgreSQL Operator

Prerequisites

AWS Cli
yq
kubectl
Nova Control Plane installed with 3 workload clusters connected

The paths to files will be defined relative to try-nova root directory.

We will first export these environment variables so that subsequent steps in this tutorial can be easily followed.

export NOVA_NAMESPACE=elotl
export NOVA_CONTROLPLANE_CONTEXT=nova
export NOVA_WORKLOAD_CLUSTER_1=wlc-1
export NOVA_WORKLOAD_CLUSTER_2=wlc-2

Export this additional environment variable if you installed Nova using the tarball. You can optionally replace the value k8s-cluster-hosting-cp with the context name of your Nova hosting cluster.

export K8S_HOSTING_CLUSTER_CONTEXT=k8s-cluster-hosting-cp

Alternatively export these environment variables if you installed Nova using setup scripts provided in the try-nova repository.

export K8S_HOSTING_CLUSTER_CONTEXT=kind-hosting-cluster
export K8S_CLUSTER_CONTEXT_1=kind-wlc-1
export K8S_CLUSTER_CONTEXT_2=kind-wlc-2

Environment variable names with prefix NOVA_ refer to the custom resource Cluster in Nova. Cluster context names with prefix K8S_ refer to the underlying Kubernetes clusters.

Setting Up S3 Access for Backups

Our first step involves setting up an S3 bucket for backups. Follow these commands to create a bucket and configure access:

Create an S3 bucket

REGION=eu-west-2

aws s3api create-bucket \
    --bucket nova-postgresql-backup \
    --region $REGION \
    --create-bucket-configuration LocationConstraint=$REGION

Create an IAM Policy:

aws iam create-policy \
    --policy-name read-write-list-s3-nova-postgresql-backup \
    --policy-document file://examples/percona-disaster-recovery/s3-policy.json

List IAM policies to verify successful creation:

aws iam list-policies --query 'Policies[?PolicyName==`read-write-list-s3-nova-postgresql-backup`].Arn' --output text

Create an IAM user and attach the IAM policy to it:

aws iam create-user --no-cli-pager --user-name s3-backup-service-account

POLICYARN=$(aws iam list-policies --query 'Policies[?PolicyName==`read-write-list-s3-nova-postgresql-backup`].Arn' --output text)
aws iam attach-user-policy \
    --policy-arn $POLICYARN \
    --user-name s3-backup-service-account

aws iam create-access-key --user-name s3-backup-service-account

NOTE Before rerunning this tutorial make sure that the bucket used is empty.

{
    "AccessKey": {
        "UserName": "s3-backup-service-account",
        "AccessKeyId": "AKIAXXXX",
        "Status": "Active",
        "SecretAccessKey": "VaC0xxxx",
        "CreateDate": "2023-12-13T13:59:34+00:00"
    }
}

Note down the AccessKeyId and SecretAccessKey values and substitute them in the examples/percona-disaster-recovery/template-s3-bucket-access-key-secret.txt

base64 -i examples/percona-disaster-recovery/template-s3-bucket-access-key-secret.txt

Place the output of this command in examples/percona-disaster-recovery/s3-access-secret.yaml

Installing Percona PostgreSQL Operator

Next we install the Percona PostgreSQL Operator and set up the clusters:

Create Schedule Policies: The following policies will schedule the PostgreSQL Operator to clusters 1 and 2; primary PostgreSQL cluster to 1 and standby to 2. HAProxy will also be scheduled to cluster 2.

envsubst < "examples/percona-disaster-recovery/schedule-policies.yaml" > "./schedule-policies.yaml"
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create -f ./schedule-policies.yaml

Clone Percona PostgreSQL Repository:

REPO_DIR="percona-postgresql-operator"
REPO_URL="https://github.com/percona/percona-postgresql-operator"
REPO_BRANCH="v2.3.0"

if [ -d "$REPO_DIR" ]; then
    rm -rf $REPO_DIR
fi

git clone -b $REPO_BRANCH $REPO_URL

Installing the Percona PostgreSQL Operator

echo "Creating operator namespace"
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create ns psql-operator --dry-run=client -o yaml | yq e ".metadata.labels.psql-cluster = \"all\"" | kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} apply -f -

echo "Installing operator to cluster all"
cat percona-postgresql-operator/deploy/bundle.yaml | \
add_labels.sh -l psql-cluster=all | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator create -f -

When running on AWS use:

# echo "Settting up s3 access"
cat examples/percona-disaster-recovery/s3-access-secret.yaml | \
./add_labels.sh -l psql-cluster=all | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator create -f -

and when running locally with Minio:

# echo "Settting up s3 access"
cat examples/percona-disaster-recovery/s3-access-secret-minio.yaml | \
./add_labels.sh -l psql-cluster=all | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator create -f -

Configuring the PostgreSQL clusters

cat examples/percona-disaster-recovery/cluster_1_cr.yaml | \
./add_labels.sh -l psql-cluster=cluster-1 | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator create -f -

cat examples/percona-disaster-recovery/cluster_2_cr.yaml | \
add_labels.sh -l psql-cluster=cluster-2 | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator create -f -

Setting up a loadbalancer

A LoadBalancer is needed to keep supporting client connection after the recovery switch is made. For our example, we'll use HAProxy. We'll need the IP address of our active PostgreSQL cluster. We can get this as follows:

kubectl wait perconapgcluster/cluster1 -n psql-operator --context=${K8S_CLUSTER_CONTEXT_1} '--for=jsonpath={.status.host}' --timeout=300s
DB_HOST=$(kubectl --context=${K8S_CLUSTER_CONTEXT_1} get perconapgcluster/cluster1 -n psql-operator -o jsonpath='{.status.host}')

envsubst < "examples/percona-disaster-recovery/haproxy.cfg" > "./haproxy.cfg"
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create configmap haproxy-config --from-file=haproxy.cfg=./haproxy.cfg --dry-run=client -o yaml | \
./add_labels.sh -l cluster=cluster-ha-proxy | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} apply -f -

We then create the HAProxy deployment and service

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create -f examples/percona-disaster-recovery/haproxy.yaml

Setting up the RecoveryPlan custom resource

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
  name: psql-primary-failover-plan
spec:
  alertLabels:
    app: percona-postgresql-cluster-1
  steps:
    - type: patch  # set cluster 1 to standby
      patch:
        apiVersion: "pgv2.percona.com/v2"
        resource: "perconapgclusters"
        namespace: "psql-operator"
        name: "cluster1"
        override:
          fieldPath: "spec.standby.enabled"
          value:
            raw: true
        patchType: "application/merge-patch+json"
    - type: patch  # set cluster 2 as new primery
      patch:
        apiVersion: "pgv2.percona.com/v2"
        resource: "perconapgclusters"
        namespace: "psql-operator"
        name: "cluster2"
        override:
          fieldPath: "spec.standby.enabled"
          value:
            raw: false
        patchType: "application/merge-patch+json"
    - type: readField  # read cluster 2 host
      readField:
        apiVersion: "pgv2.percona.com/v2"
        resource: "perconapgclusters"
        namespace: "psql-operator"
        name: "cluster2"
        fieldPath: "status.host"
        outputKey: "Cluster2IP"
    - type: patch  # update HAProxy to point to cluster 2
      patch:
        apiVersion: "v1"
        resource: "configmaps"
        namespace: "default"
        name: "haproxy-config"
        override:
          fieldPath: "data"
          value:
            raw: {"haproxy.cfg": "defaults\n    mode tcp\n    timeout connect 5000ms\n    timeout client 50000ms\n    timeout server 50000ms\n\nfrontend fe_main\n    bind *:5432\n    default_backend be_db_2\n\nbackend be_db_2\n    server db2 {{ .Values.Cluster2IP }}:5432 check"}
        patchType: "application/merge-patch+json"

Recovery

The Recovery Plan will read the host of the standby cluster, so we need to make sure it was assigned, before proceeding

kubectl wait perconapgclusters/cluster2 -n psql-operator --context=${NOVA_CONTROLPLANE_CONTEXT} '--for=jsonpath={.status.host}' --timeout=180s
kubectl --context=${K8S_CLUSTER_CONTEXT_1} wait -n psql-operator perconapgcluster cluster1 --for=jsonpath='{.status.pgbouncer.ready}'=3 --timeout=180s

kubectl --context=${K8S_CLUSTER_CONTEXT_1} wait -n psql-operator pod -l postgres-operator.crunchydata.com/role=pgbouncer --for=condition=Ready --timeout=120s
kubectl --context=${K8S_CLUSTER_CONTEXT_1} get pods -n psql-operator

kubectl --context=${K8S_CLUSTER_CONTEXT_2} wait -n psql-operator perconapgcluster cluster2 --for=jsonpath='{.status.pgbouncer.ready}'=3 --timeout=120s
kubectl --context=${K8S_CLUSTER_CONTEXT_2} wait -n psql-operator pod -l postgres-operator.crunchydata.com/role=pgbouncer --for=condition=Ready --timeout=120s
kubectl --context=${K8S_CLUSTER_CONTEXT_2} get pods -n psql-operator

Next, we create the Recovery Plan

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create -f examples/percona-disaster-recovery/recovery-plan.yaml

In production systems alerts will be sent to Nova through the recovery webhook, by some metrics service like Prometheus with Alertmanager. For ease of this tutorial we will simulate receiving an alert by adding it to Nova. When the alert is added Nova looks for the recovery plan by matching alert labels to recovery plan labels. Once it finds the recovery plan it executes it.

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create -f examples/percona-disaster-recovery/received-alert.yaml

Let's verify if recovery succeeded

Check if cluster 1 (in our tutorial we assume it fails) is set to standby.

kubectl wait perconapgclusters/cluster1 -n psql-operator --context=${NOVA_CONTROLPLANE_CONTEXT} '--for=jsonpath={.spec.standby.enabled}'=true --timeout=180s

Check if cluster 2 has taken over the role of the primary.

kubectl wait perconapgclusters/cluster2 -n psql-operator --context=${NOVA_CONTROLPLANE_CONTEXT} '--for=jsonpath={.spec.standby.enabled}'=false --timeout=180s

Check if HAProxy is now connected to the new primary cluster - cluster 2.

kubectl get cm/haproxy-config --context=${NOVA_CONTROLPLANE_CONTEXT} -n default -o jsonpath='{.data.haproxy\.cfg}' | grep 'server db2'

server db2 172.18.255.240:5432 check

Cleanup

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f examples/percona-disaster-recovery/received-alert.yaml

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f examples/percona-disaster-recovery/recovery-plan.yaml

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f examples/percona-disaster-recovery/haproxy.yaml

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create configmap haproxy-config --from-file=haproxy.cfg=examples/percona-disaster-recovery/haproxy.cfg --dry-run=client -o yaml | \
./add_labels.sh -l cluster=cluster-ha-proxy | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f -

cat examples/percona-disaster-recovery/cluster_1_cr.yaml | \
./add_labels.sh -l psql-cluster=cluster-1 | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator delete -f -

cat examples/percona-disaster-recovery/cluster_2_cr.yaml | \
./add_labels.sh -l psql-cluster=cluster-2 | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator delete -f -

cat percona-postgresql-operator/deploy/bundle.yaml | \
./add_labels.sh -l psql-cluster=all | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator delete -f -

cat examples/percona-disaster-recovery/s3-access-secret.yaml |
./add_labels.sh -l psql-cluster=all | \
kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} --namespace psql-operator delete -f -

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} create ns psql-operator --dry-run=client -o yaml | yq e ".metadata.labels.psql-cluster = \"all\"" | kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f -

kubectl --context=${NOVA_CONTROLPLANE_CONTEXT} delete -f ./schedule-policies.yaml
rm -f ./schedule-policies.yaml

Prerequisites​

Setting Up S3 Access for Backups​

Installing Percona PostgreSQL Operator​

Setting up the RecoveryPlan custom resource​

Recovery​

Let's verify if recovery succeeded​