Restore etcd from Backup on cloudscale.ch

Steps to recover etcd on an OpenShift 4 cluster on cloudscale.ch.

Restoring to a previous cluster state is a destructive and destabilizing action to take on a running cluster. This should only be used as a last resort.

If you are able to retrieve data using the Kubernetes API server, then etcd is available and you shouldn’t restore using an etcd backup.

Starting situation

  • You have an OpenShift 4 cluster on cloudscale.ch

  • One of the following scenarios is true:

    • The cluster has lost the majority of its control plane hosts (quorum loss).

    • An administrator has deleted a critical component which can’t be restored from the object backup.

Prerequisites

The following CLI utilities need to be available locally:

Access and Download Backup

Access to various API
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
Fetch backup url from cluster repo
GIT_REPO=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .gitRepo.url)
git clone --depth 1 $GIT_REPO cluster-repo
RESTIC_ENDPOINT=$(find cluster-repo/manifests/cluster-backup -name '*.yaml' -exec yq eval-all 'select(.kind == "Schedule" and .metadata.name == "etcd" ) | .spec.backend.s3.endpoint' {} \;)
RESTIC_BUCKET=$(find cluster-repo/manifests/cluster-backup -name '*.yaml' -exec yq eval-all 'select(.kind == "Schedule" and .metadata.name == "etcd" ) | .spec.backend.s3.bucket' {} \;)
export RESTIC_REPOSITORY="s3:${RESTIC_ENDPOINT}/${RESTIC_BUCKET}"
echo $RESTIC_REPOSITORY
rm -rf cluster-repo
Connect with Vault
export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc
Fetch backup secrets from vault
export RESTIC_PASSWORD=$(vault kv get \
  -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cluster-backup" | jq -r '.data.data.password')
export AWS_ACCESS_KEY_ID=$(vault kv get \
  -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale" | jq -r '.data.data.s3_access_key')
export AWS_SECRET_ACCESS_KEY=$(vault kv get \
  -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale" | jq -r '.data.data.s3_secret_key')
Download files from latest etcd snapshot
TEMP_DIR=$(mktemp -d)
pushd ${TEMP_DIR}
SNAPSHOT_ID=$(restic snapshots --json --latest=1 --path /syn-cluster-backup-etcd-etcd-backup.tar.gz | jq -r '.[0].id')
restic dump "${SNAPSHOT_ID}" /syn-cluster-backup-etcd-etcd-backup.tar.gz | tar xzv

Connect to Master Node by SSH

Fetch the ssh key
vault kv get -format=json clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale/ssh \
  | jq -r '.data.data.private_key' | base64 --decode > ssh_key
chmod 400 ssh_key
The following steps are VSHN specific
Find load balancer host
LB_HOST=$(grep -E "^Host.*${CLUSTER_ID}" ~/.ssh/sshop_config | head -1 | awk '{print $2}')
echo $LB_HOST
Ensure your ssh config is up-to-date: sshop_update.
Upload recovery files to master node
MASTER_NODE=etcd-0
scp -J "${LB_HOST}" -i ssh_key static_kuberesources_*.tar.gz snapshot_*.db "core@${MASTER_NODE}:"
Connect to master node
ssh -J "${LB_HOST}" -i ssh_key "core@${MASTER_NODE}"

Restore etcd

You now should have

  • An SSH connection to a healthy master node

  • The etcd backup archive

Refer to the Openshift 4 Disaster Recovery Guide for further steps.