Disaster Recovery for FerretDB with Percona Postgres Operator
Prerequisites
- AWS Cli
- yq
- kubectl
- Nova Control Plane installed with 3 workload clusters connected
The paths to files will be defined relatively to try-nova root directory.
Setting Up S3 Access for Backups
Our first step involves setting up an S3 bucket for backups. Follow these commands to create a bucket and configure access:
- Create S3 bucket
REGION=eu-west-2
aws s3api create-bucket \
--bucket nova-ferretdb-backup \
--region $REGION \
--create-bucket-configuration LocationConstraint=$REGION
- Create IAM Policy:
aws iam create-policy \
--policy-name read-write-list-s3-nova-ferretdb-backup \
--policy-document file://examples/percona-disaster-recovery/s3-policy.json
- List Policies to Verify:
aws iam list-policies --query 'Policies[?PolicyName==`read-write-list-s3-nova-ferretdb-backup`].Arn' --output text
- Create User and Attach Policy:
aws iam create-user --no-cli-pager --user-name s3-backup-service-account
POLICYARN=$(aws iam list-policies --query 'Policies[?PolicyName==`read-write-list-s3-nova-ferretdb-backup`].Arn' --output text)
aws iam attach-user-policy \
--policy-arn $POLICYARN \
--user-name s3-backup-service-account
aws iam create-access-key --user-name s3-backup-service-account
NOTE Before rerunning this tutorial make sure that used bucket is empty.
{
"AccessKey": {
"UserName": "s3-backup-service-account",
"AccessKeyId": "AKIAXXXX",
"Status": "Active",
"SecretAccessKey": "VaC0xxxx",
"CreateDate": "2023-12-13T13:59:34+00:00"
}
}
Note down the AccessKeyId and SecretAccessKey values and substitute in examples/percona-disaster-recovery/template-s3-bucket-access-key-secret.txt
base64 -i examples/percona-disaster-recovery/template-s3-bucket-access-key-secret.txt
Place output in examples/percona-disaster-recovery/s3-access-secret.yaml
Installing Percona PostgreSQL Operator
Now let's install the Percona PostgreSQL Operator and set up the clusters:
- Create Schedule Policies: Below policies will schedule PostgreSQL Operator to cluster 1 and 2, primary PostgreSQL cluster to 1 and standby to 2. HaProxy will be also scheduled to cluster 2. FerretDB deployments will be scheduled to cluster 1 and 2.
kubectl --context nova create -f examples/percona-disaster-recovery/schedule-policies.yaml
kubectl --context nova create -f examples/ferretdb-percona-disaster-recovery/policy-spread-ferretdb.yaml
- Clone Percona PostgreSQL Repository:
REPO_DIR="percona-postgresql-operator"
REPO_URL="https://github.com/percona/percona-postgresql-operator"
REPO_BRANCH="v2.3.0"
if [ -d "$REPO_DIR" ]; then
rm -rf $REPO_DIR
fi
git clone -b $REPO_BRANCH $REPO_URL
- Proceed with installing Percona PostgreSQL Operator
echo "Creating operator namespace"
kubectl --context nova create ns psql-operator --dry-run=client -o yaml | yq e ".metadata.labels.psql-cluster = \"all\"" | kubectl --context nova apply -f -
echo "Installing operator to cluster all"
cat percona-postgresql-operator/deploy/bundle.yaml | python3 add_labels.py namespace psql-cluster all | python3 add_labels.py cluster psql-cluster all | kubectl --context nova --namespace psql-operator create -f -
When running on AWS use:
# echo "Settting up s3 access"
cat examples/percona-disaster-recovery/s3-access-secret.yaml | python3 add_labels.py namespace psql-cluster all | kubectl --context nova create -f -
and when running locally with Minio:
# echo "Settting up s3 access"
cat examples/percona-disaster-recovery/s3-access-secret-minio.yaml | python3 add_labels.py namespace psql-cluster all | kubectl --context nova create -f -
- Configure 2 PostgreSQL clusters
cat examples/ferretdb-percona-disaster-recovery/cluster_1_cr.yaml | python3 add_labels.py namespace psql-cluster cluster-1 | kubectl --context nova --namespace psql-operator create -f -
cat examples/ferretdb-percona-disaster-recovery/cluster_2_cr.yaml | python3 add_labels.py namespace psql-cluster cluster-2 | kubectl --context nova --namespace psql-operator create -f -
- Install FerretDB
kubectl --context nova create -f examples/ferretdb-percona-disaster-recovery/ferretdb.yaml
- Setup loadbalancer in front of our databases. LoadBalancer is needed to keep supporting client connection after the recovery switch is made. For our example we'll use HAProxy. We'll need address of our FerretDB service connected to active PostgreSQL cluster. To get it, you can run
kubectl --context nova get service/ferretdb-service-1 -n default -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Then substitute address for server db1 in HAProxy config:
defaults
mode tcp
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend fe_main
bind *:27017
default_backend be_db_1
backend be_db_1
server db1 <address>:<port> check
Save this config as examples/ferretdb-percona-disaster-recovery/haproxy.cfg
file and run following command to create ConfigMap.
kubectl --context nova create configmap haproxy-config --from-file=haproxy.cfg=examples/ferretdb-percona-disaster-recovery/haproxy.cfg --dry-run=client -o yaml | python3 add_labels.py namespace psql-cluster cluster-ha-proxy | kubectl --context nova apply -f -
And then apply actual HAProxy deployment and service
kubectl --context nova create -f examples/ferretdb-percona-disaster-recovery/haproxy.yaml
Once this is done we can start setting up our RecoveryPlan.
apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
name: psql-primary-failover-plan
spec:
alertLabels:
app: example-app
steps:
- type: patch # set cluster 1 to standby
patch:
apiVersion: "pgv2.percona.com/v2"
resource: "perconapgclusters"
namespace: "psql-operator"
name: "cluster1"
override:
fieldPath: "spec.standby.enabled"
value:
raw: true
patchType: "application/merge-patch+json"
- type: patch # set cluster 2 as new primery
patch:
apiVersion: "pgv2.percona.com/v2"
resource: "perconapgclusters"
namespace: "psql-operator"
name: "cluster2"
override:
fieldPath: "spec.standby.enabled"
value:
raw: false
patchType: "application/merge-patch+json"
- type: readField # read FerretDB service hostanem from cluster 2
readField:
apiVersion: "v1"
resource: "services"
namespace: "default"
name: "ferretdb-service-2"
fieldPath: "status.loadBalancer.ingress[0].hostname"
outputKey: "Cluster2IP"
- type: patch # update HAProxy to point to FerretDB service in cluster 2
patch:
apiVersion: "v1"
resource: "configmaps"
namespace: "psql-operator"
name: "haproxy-config"
override:
fieldPath: "data"
value:
raw: {"haproxy.cfg": "defaults\n mode tcp\n timeout connect 5000ms\n timeout client 50000ms\n timeout server 50000ms\n\nfrontend fe_main\n bind *:27017\n default_backend be_db_2\n\nbackend be_db_2\n server db2 {{ .Values.Cluster2IP }}:27017 check"}
patchType: "application/merge-patch+json"
Let's run
Recovery plan will read the host of FerretDB service in cluster 2, so we need to make sure it was assigned, before proceeding
If running the example in the cloud use:
kubectl get service/ferretdb-service-2 -n default --context nova -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}'
else:
kubectl get service/ferretdb-service-2 -n default --context nova -o=jsonpath='{.spec.clusterIP}'
Add recovery plan. If running the example in the cloud use:
kubectl --context nova create -f examples/ferretdb-percona-disaster-recovery/recovery-plan-cloud.yaml
else:
kubectl --context nova create -f examples/ferretdb-percona-disaster-recovery/recovery-plan-kind.yaml
In production systems alerts will be sent to Nova through recovery webhook, by some metrics service like Prometheus with Alertmanager. For ease of this tutorial we will simulate receiving an alert by adding it to Nova. When the alert is added Nova looks for recovery plan by matching alert labels to recovery plan labels. Once it finds the recovery plan it executes it.
kubectl --context nova create -f examples/percona-disaster-recovery/received-alert.yaml
Let's verify if recovery succeeded:
Check if cluster 1 (in our tutorial we assume it fails) is set to standby.
kubectl wait perconapgclusters/cluster1 -n psql-operator --context nova '--for=jsonpath={.spec.standby.enabled}'=true --timeout=180s
Check if cluster 2 took over the role of primary - standby false.
kubectl wait perconapgclusters/cluster2 -n psql-operator --context nova '--for=jsonpath={.spec.standby.enabled}'=false --timeout=180s
Check if HAProxy is now connected to FerretDB service in cluster 2 which is connected to postgresql cluster 2 - new primary cluster.
kubectl get cm/haproxy-config --context nova -n default -o jsonpath='{.data.haproxy\.cfg}' | grep 'server db2'
server db2 172.18.255.200:27017 check
Cleanup
kubectl --context nova delete -f examples/percona-disaster-recovery/received-alert.yaml
kubectl --context nova delete -f examples/percona-disaster-recovery/recovery-plan.yaml
kubectl --context nova delete -f examples/percona-disaster-recovery/haproxy.yaml
kubectl --context nova create configmap haproxy-config --from-file=haproxy.cfg=examples/percona-disaster-recovery/haproxy.cfg --dry-run=client -o yaml | python3 add_labels.py namespace psql-cluster cluster-ha-proxy | kubectl --context nova delete -f -
kubectl --context nova delete -f examples/ferretdb-percona-disaster-recovery/ferretdb.yaml
cat examples/percona-disaster-recovery/cluster_1_cr.yaml | python3 add_labels.py namespace psql-cluster cluster-1 | kubectl --context nova --namespace psql-operator delete -f -
cat examples/percona-disaster-recovery/cluster_2_cr.yaml | python3 add_labels.py namespace psql-cluster cluster-2 | kubectl --context nova --namespace psql-operator delete -f -
cat percona-postgresql-operator/deploy/bundle.yaml | python3 add_labels.py namespace psql-cluster all | python3 add_labels.py cluster psql-cluster all | kubectl --context nova --namespace psql-operator delete -f -
cat examples/percona-disaster-recovery/s3-access-secret.yaml | python3 add_labels.py namespace psql-cluster all | kubectl --context nova delete -f -
kubectl --context nova create ns psql-operator --dry-run=client -o yaml | yq e ".metadata.labels.psql-cluster = \"all\"" | kubectl --context nova delete -f -
kubectl --context nova delete -f examples/percona-disaster-recovery/schedule-policies.yaml
kubectl --context nova delete -f examples/ferretdb-percona-disaster-recovery/policy-spread-ferretdb.yaml