Replace a storage node
Steps to replace a storage node of an OpenShift 4 cluster on cloudscale.ch.
Starting situtation
-
You already have a OpenShift 4 cluster on cloudscale.ch
-
You have admin-level access to the cluster
-
The cluster is already running the APPUiO Managed Storage Cluster addon (Rook Ceph).
-
You want to replace an existing storage node in the storage cluster with a new storage node
Prerequisites
The following CLI utilities need to be available locally:
-
docker
-
curl
-
kubectl
-
oc
-
vault
Vault CLI -
commodore
, see Running Commodore -
jq
-
yq
yq YAML processor (version 4 or higher) -
macOS:
gdate
from GNU coreutils,brew install coreutils
Prepare local environment
-
Create local directory to work in
We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.
export WORK_DIR=/path/to/work/dir mkdir -p "${WORK_DIR}" pushd "${WORK_DIR}"
-
Configure API access
Access to cloud API# From https://control.cloudscale.ch/service/<your-project>/api-token export CLOUDSCALE_API_TOKEN=<cloudscale-api-token>
Access to VSHN GitLab# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient export GITLAB_TOKEN=<gitlab-api-token> export GITLAB_USER=<gitlab-user-name>
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>
# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens
-
Get required tokens from Vault
Connect with Vaultexport VAULT_ADDR=https://vault-prod.syn.vshn.net vault login -method=oidc
Grab the LB hieradata repo token from Vaultexport HIERADATA_REPO_SECRET=$(vault kv get \ -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data') export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user') export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')
Get Floaty credentialsexport TF_VAR_lb_cloudscale_api_secret=$(vault kv get \ -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/floaty" | jq -r '.data.data.iam_secret')
-
Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.
commodore catalog compile "${CLUSTER_ID}"
Prepare Terraform environment
-
Configure Terraform secrets
cat <<EOF > ./terraform.env CLOUDSCALE_API_TOKEN TF_VAR_ignition_bootstrap TF_VAR_lb_cloudscale_api_secret TF_VAR_control_vshn_net_token GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL HIERADATA_REPO_TOKEN EOF
-
Setup Terraform
Prepare Terraform execution environment# Set terraform image and tag to be used tf_image=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.image" \ dependencies/openshift4-terraform/class/defaults.yml) tf_tag=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.tag" \ dependencies/openshift4-terraform/class/defaults.yml) # Generate the terraform alias base_dir=$(pwd) alias terraform='touch .terraformrc; docker run -it --rm \ -e REAL_UID=$(id -u) \ -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \ --env-file ${base_dir}/terraform.env \ -w /tf \ -v $(pwd):/tf \ --ulimit memlock=-1 \ "${tf_image}:${tf_tag}" /tf/terraform.sh' export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|') export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/} export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id") export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster" pushd catalog/manifests/openshift4-terraform/
Initialize Terraformterraform init \ "-backend-config=address=${GITLAB_STATE_URL}" \ "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=username=${GITLAB_USER}" \ "-backend-config=password=${GITLAB_TOKEN}" \ "-backend-config=lock_method=POST" \ "-backend-config=unlock_method=DELETE" \ "-backend-config=retry_wait_min=5"
Set alert silence and pause ArgoCD
-
Set a silence in Alertmanager for all rook-ceph alerts
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "POST-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') silence_duration='+60 minutes' (1) kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) read -d "" body << EOF { "matchers": [ { "name": "syn_component", "value": "rook-ceph", "isRegex": false } ], "startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')", "endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")", "createdBy": "$(kubectl config current-context | cut -d/ -f3)", "comment": "Silence rook-ceph alerts" } EOF curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silences" \ -XPOST -d "\${body}" volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
1 Adjust this variable to create a longer or shorter silence -
Extract Alertmanager silence ID from job logs
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \ jq -r '.silenceID')
-
Disable auto sync for component
rook-ceph
. This allows us to temporarily make manual changes to the Rook Ceph cluster.kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
Replace node
-
Make a note of the node you want to replace
export NODE_TO_REPLACE=storage-XXXX
Create a new node
-
Find Terraform resource index of the node to replace
TF_MODULE='module.cluster.module.additional_worker["storage"]' (1)
1 Select the correct worker group. This guide assumes that your storage nodes are part of an additional worker group called "storage". # Grab JSON copy of current Terraform state terraform state pull > .tfstate.json node_index=$(jq --arg tfmodule "${TF_MODULE}" --arg storage_node "${NODE_TO_REPLACE}" -r \ '.resources[] | select(.module==$tfmodule and .type=="random_id") | .instances[] | select(.attributes.hex==$storage_node) | .index_key' \ .tfstate.json)
-
Verify that resource index is correct
jq --arg tfmodule "${TF_MODULE}" --arg index "${node_index}" -r \ '.resources[] | select(.module==$tfmodule and .type=="cloudscale_server") | .instances[$index|tonumber] | .attributes.name' \ .tfstate.json
-
Remove node ID and node resource for node that we want to replace from the Terraform state
terraform state rm "${TF_MODULE}.random_id.node[$node_index]" terraform state rm "${TF_MODULE}.cloudscale_server.node[$node_index]"
-
Run Terraform to spin up a replacement node
terraform apply
-
Approve node cert for new storage node
# Once CSRs in state Pending show up, approve them # Needs to be run twice, two CSRs for each node need to be approved kubectl --as=cluster-admin get csr -w oc --as=cluster-admin get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | \ xargs oc --as=cluster-admin adm certificate approve kubectl --as=cluster-admin get nodes
-
Label and taint the new storage node
kubectl get node -ojson | \ jq -r '.items[] | select(.metadata.name | test("storage-")).metadata.name' | \ xargs -I {} kubectl --as=cluster-admin label node {} node-role.kubernetes.io/storage= kubectl --as=cluster-admin taint node -lnode-role.kubernetes.io/storage \ storagenode=True:NoSchedule
Remove the old MON
-
Find the MON(s) (if any) on the node(s) to replace
MON_IDS="" for node in $(echo -n $NODE_TO_REPLACE); do mon_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods \ -lapp=rook-ceph-mon --field-selector="spec.nodeName=${node}" \ --no-headers -ocustom-columns="MON_ID:.metadata.labels.ceph_daemon_id") MON_IDS="$MON_IDS $mon_id" done echo $MON_IDS
You can skip the remaining steps in this section if $MON_ID
is empty. -
Temporarily adjust the Rook MON failover timeout. This tells the operator to perform the MON failover after less time than the default 10 minutes.
kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \ -p '[{ "op": "replace", "path": "/spec/healthCheck/daemonHealth/mon", "value": { "disabled": false, "interval": "10s", "timeout": "10s" } }]'
-
Cordon node(s) to replace
for node in $(echo -n $NODE_TO_REPLACE); do kubectl --as=cluster-admin cordon "${node}" done
-
For every id in
$MON_IDS
replace the MON podmon_id=<MON_ID> kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pod \ -l app=rook-ceph-mon,ceph_daemon_id="${mon_id}" # Wait until new MON is scheduled kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w # Wait until the cluster has regained full quorum kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph status # Repeat for all other $MON_IDS
-
Verify that three MONs are running
kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy -l app=rook-ceph-mon
Clean up the old node
-
Drain the node(s)
for node in $(echo -n ${NODE_TO_REPLACE}); do kubectl --as=cluster-admin drain "${node}" \ --delete-emptydir-data --ignore-daemonsets done
On cloudscale.ch, we configure Rook Ceph to setup the OSDs in "portable" mode. This configuration enables OSDs to be scheduled on any storage node.
With this configuration, we don’t have to migrate OSDs hosted on the old node(s) manually. Instead, draining a node will cause any OSDs hosted on that node to be rescheduled on other storage nodes.
-
Delete the node(s) from the cluster
for node in $(echo -n ${NODE_TO_REPLACE}); do kubectl --as=cluster-admin delete node "${node}" done
-
Remove the cloudscale.ch VM(s)
for node in $(echo -n ${NODE_TO_REPLACE}); do node_id=$(curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \ https://api.cloudscale.ch/v1/servers | \ jq --arg storage_node "$node" -r \ '.[] | select(.name|startswith($storage_node)) | .uuid') echo "Removing node:" curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \ "https://api.cloudscale.ch/v1/servers/${node_id}" |\ jq -r '.name' curl -XDELETE -H "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \ "https://api.cloudscale.ch/v1/servers/${node_id}" done
Finish up
-
Remove silence in Alertmanager
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "DELETE-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silence/${silence_id}" \ -XDELETE volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
-
Clean up Alertmanager silence jobs
kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-rook-ceph-alerts
-
Re-enable ArgoCD auto sync
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{ "op":"replace", "path":"/spec/syncPolicy", "value": {"automated": {"prune": true, "selfHeal": true}} }]'