Restore etcd from Backup on Exoscale
Steps to recover etcd on an OpenShift 4 cluster on Exoscale.
Restoring to a previous cluster state is a destructive and destabilizing action to take on a running cluster. This should only be used as a last resort. If you are able to retrieve data using the Kubernetes API server, then etcd is available and you shouldn’t restore using an etcd backup. |
Starting situation
-
You have an OpenShift 4 cluster on Exoscale
-
One of the following scenarios is true:
-
The cluster has lost the majority of its control plane hosts (quorum loss).
-
An administrator has deleted a critical component which can’t be restored from the object backup.
-
Prerequisites
The following CLI utilities need to be available locally:
-
restic
Restic Backup -
kubectl
-
vault
Vault CLI -
commodore
, see Running Commodore -
git
-
jq
-
yq
yq YAML processor (version 4 or higher)
Access and Download Backup
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>
# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
GIT_REPO=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .gitRepo.url)
git clone --depth 1 $GIT_REPO cluster-repo
RESTIC_ENDPOINT=$(find cluster-repo/manifests/cluster-backup -name '*.yaml' -exec yq eval-all 'select(.kind == "Schedule" and .metadata.name == "etcd" ) | .spec.backend.s3.endpoint' {} \;)
RESTIC_BUCKET=$(find cluster-repo/manifests/cluster-backup -name '*.yaml' -exec yq eval-all 'select(.kind == "Schedule" and .metadata.name == "etcd" ) | .spec.backend.s3.bucket' {} \;)
export RESTIC_REPOSITORY="s3:${RESTIC_ENDPOINT}/${RESTIC_BUCKET}"
echo $RESTIC_REPOSITORY
rm -rf cluster-repo
export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc
export RESTIC_PASSWORD=$(vault kv get \
-format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cluster-backup" | jq -r '.data.data.password')
export AWS_ACCESS_KEY_ID=$(vault kv get \
-format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/exoscale/storage_iam" | jq -r '.data.data.s3_access_key')
export AWS_SECRET_ACCESS_KEY=$(vault kv get \
-format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/exoscale/storage_iam" | jq -r '.data.data.s3_secret_key')
TEMP_DIR=$(mktemp -d)
pushd ${TEMP_DIR}
SNAPSHOT_ID=$(restic snapshots --json --latest=1 --path /syn-cluster-backup-etcd-etcd-backup.tar.gz | jq -r '.[0].id')
restic dump "${SNAPSHOT_ID}" /syn-cluster-backup-etcd-etcd-backup.tar.gz | tar xzv
Connect to Master Node by SSH
vault kv get -format=json clusters/kv/${TENANT_ID}/${CLUSTER_ID}/exoscale/ssh \
| jq -r '.data.data.private_key' | base64 --decode > ssh_key
chmod 400 ssh_key
The following steps are VSHN specific |
LB_HOST=$(grep -E "^Host.*${CLUSTER_ID}" ~/.ssh/sshop_config | head -1 | awk '{print $2}')
echo $LB_HOST
Ensure your ssh config is up-to-date: sshop_update .
|
MASTER_NODE=etcd-0
scp -J "${LB_HOST}" -i ssh_key static_kuberesources_*.tar.gz snapshot_*.db "core@${MASTER_NODE}:"
ssh -J "${LB_HOST}" -i ssh_key "core@${MASTER_NODE}"
Restore etcd
You now should have
-
An SSH connection to a healthy master node
-
The etcd backup archive
Refer to the Openshift 4 Disaster Recovery Guide for further steps.