Remove a storage node
Steps to remove a storage node of an OpenShift 4 cluster on Exoscale.
Starting situation
-
You already have a OpenShift 4 cluster on Exoscale
-
You have admin-level access to the cluster
-
You want to remove an existing storage node in the cluster
Prerequisites
The following CLI utilities need to be available locally:
-
docker
-
curl
-
kubectl
-
oc
-
exo
>= v1.28.0 Exoscale CLI -
vault
Vault CLI -
commodore
, see Running Commodore -
jq
-
yq
yq YAML processor (version 4 or higher) -
macOS:
gdate
from GNU coreutils,brew install coreutils
Prepare local environment
-
Create local directory to work in
We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.
export WORK_DIR=/path/to/work/dir mkdir -p "${WORK_DIR}" pushd "${WORK_DIR}"
-
Configure API access
Access to cloud APIexport EXOSCALE_API_KEY=<exoscale-key> (1) export EXOSCALE_API_SECRET=<exoscale-secret> export EXOSCALE_ZONE=<exoscale-zone> (2) export EXOSCALE_S3_ENDPOINT="sos-${EXOSCALE_ZONE}.exo.io"
1 We recommend setting up an IAMv3 role called unrestricted
with "Default Service Strategy" set toallow
if it doesn’t exist yet.2 All lower case. For example ch-dk-2
.Access to VSHN GitLab# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient export GITLAB_TOKEN=<gitlab-api-token> export GITLAB_USER=<gitlab-user-name>
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>
# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens
-
Get required tokens from Vault
Connect with Vaultexport VAULT_ADDR=https://vault-prod.syn.vshn.net vault login -method=oidc
Grab the LB hieradata repo token from Vaultexport HIERADATA_REPO_SECRET=$(vault kv get \ -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data') export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user') export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')
-
Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.
commodore catalog compile "${CLUSTER_ID}"
Set alert silence
-
Set a silence in Alertmanager for all rook-ceph alerts
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "POST-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') silence_duration='+60 minutes' (1) kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) read -d "" body << EOF { "matchers": [ { "name": "syn_component", "value": "rook-ceph", "isRegex": false } ], "startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')", "endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")", "createdBy": "$(kubectl config current-context | cut -d/ -f3)", "comment": "Silence rook-ceph alerts" } EOF curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silences" \ -XPOST -d "\${body}" volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
1 Adjust this variable to create a longer or shorter silence -
Extract Alertmanager silence ID from job logs
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \ jq -r '.silenceID')
Update Cluster Config
-
Update cluster config.
pushd "inventory/classes/${TENANT_ID}/" yq eval -i ".parameters.openshift4_terraform.terraform_variables.storage_count -= 1" \ ${CLUSTER_ID}.yml yq eval -i ".parameters.rook_ceph.ceph_cluster.node_count -= 1" \ ${CLUSTER_ID}.yml
Ceph can’t scale to less than 3 storage nodes, which is the default number of nodes. Please ensure that this update doesn’t reduce the number of storage nodes to less than 3 before continuing.
-
Review and commit
# Have a look at the file ${CLUSTER_ID}.yml. git commit -a -m "Remove storage node from cluster ${CLUSTER_ID}" git push popd
-
Compile and push cluster catalog
commodore catalog compile ${CLUSTER_ID} --push -i
Prepare Terraform environment
-
Configure Terraform secrets
cat <<EOF > ./terraform.env EXOSCALE_API_KEY EXOSCALE_API_SECRET TF_VAR_control_vshn_net_token GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL HIERADATA_REPO_TOKEN EOF
-
Setup Terraform
Prepare Terraform execution environment# Set terraform image and tag to be used tf_image=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.image" \ dependencies/openshift4-terraform/class/defaults.yml) tf_tag=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.tag" \ dependencies/openshift4-terraform/class/defaults.yml) # Generate the terraform alias base_dir=$(pwd) alias terraform='touch .terraformrc; docker run -it --rm \ -e REAL_UID=$(id -u) \ -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \ --env-file ${base_dir}/terraform.env \ -w /tf \ -v $(pwd):/tf \ --ulimit memlock=-1 \ "${tf_image}:${tf_tag}" /tf/terraform.sh' export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|') export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/} export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id") export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster" pushd catalog/manifests/openshift4-terraform/
Initialize Terraformterraform init \ "-backend-config=address=${GITLAB_STATE_URL}" \ "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=username=${GITLAB_USER}" \ "-backend-config=password=${GITLAB_TOKEN}" \ "-backend-config=lock_method=POST" \ "-backend-config=unlock_method=DELETE" \ "-backend-config=retry_wait_min=5"
Remove Node
-
Find the node you want to remove. It has to be the one with the highest terraform index.
# Grab JSON copy of current Terraform state terraform state pull > .tfstate.json node_count=$(jq -r \ '.resources[] | select(.module=="module.cluster.module.storage" and .type=="exoscale_compute") | .instances | length' \ .tfstate.json) # Verify that the number of nodes is one more than we configured earlier. echo $node_count export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \ '.resources[] | select(.module=="module.cluster.module.storage" and .type=="exoscale_compute") | .instances[$index|tonumber-1] | .attributes.hostname' \ .tfstate.json) echo $NODE_TO_REMOVE
Remove old OSD
-
Make sure ArgoCD ran and reduced the target number of OSDs
kubectl --as=cluster-admin -n syn-rook-ceph-cluster \ get cephcluster cluster -o jsonpath='{.spec.storage.storageClassDeviceSets[0].count}'
-
Disable ArgoCD auto sync for component
rook-ceph
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
-
Scale down the Rook-Ceph operator
kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \ deploy/rook-ceph-operator
-
Tell Ceph to take the OSD(s) on the node(s) to remove out of service and relocate data stored on them
# Verify that the list of nodes to replace is correct echo $NODE_TO_REMOVE # Reweight OSDs on those nodes to 0 for node in $(echo -n $NODE_TO_REMOVE); do osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \ -l failure-domain="${node}" --no-headers \ -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id") kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd crush reweight "osd.${osd_id}" 0 done
-
Wait for the data to be redistributed ("backfilled")
When backfilling is completed, ceph status
should show all PGs asactive+clean
.Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while. If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config set osd osd_mclock_override_recovery_settings true (1) kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config set osd osd_max_backfills 10 (2)
1 Allow overwriting osd_max_backfills
.2 The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster. After backfilling is completed, you can remove the configuration with
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config rm osd osd_max_backfills kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config rm osd osd_mclock_override_recovery_settings
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph status
-
Remove the OSD(s) from the Ceph cluster
for node in $(echo -n $NODE_TO_REMOVE); do osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \ -l failure-domain="${node}" --no-headers \ -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id") kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd out "${osd_id}" kubectl --as=cluster-admin -n syn-rook-ceph-cluster scale --replicas=0 \ "deploy/rook-ceph-osd-${osd_id}" kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd purge "${osd_id}" kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd crush remove "${node}" done
-
Check that the OSD is no longer listed in
ceph osd tree
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd tree
-
Make a note of the PVC(s) of the old OSD(s)
We also extract the name of the PV(s) here, but we’ll only delete the PV(s) after removing the node(s) from the cluster. old_pvc_names="" old_pv_names="" for node in $(echo -n $NODE_TO_REMOVE); do osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \ -l failure-domain="${node}" --no-headers \ -o custom-columns="NAME:.metadata.name" | cut -d- -f4) pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \ "rook-ceph-osd-${osd_id}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}') pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \ "${pvc_name}" -o jsonpath='{.spec.volumeName}') old_pvc_names="$old_pvc_names $pvc_name" old_pv_names="$old_pv_names $pv_name" done echo $old_pvc_names echo $old_pv_names
-
Delete old OSD deployment(s)
for node in $(echo -n $NODE_TO_REMOVE); do kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \ -l failure-domain="${node}" done
-
Clean up PVC(s) and prepare job(s) of the old OSD(s) if necessary
for pvc_name in $(echo -n $old_pvc_names); do kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete job \ -l ceph.rook.io/pvc="${pvc_name}" kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pvc "${pvc_name}" done
-
Clean up PVC encryption secret(s)
for pvc_name in $(echo -n $old_pvc_names); do kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete secret -l pvc_name="${pvc_name}" done
-
Scale up the Rook-Ceph operator
kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \ deploy/rook-ceph-operator
Remove the old MON
-
Find the MON(s) (if any) on the node(s) to remove
MON_IDS="" for node in $(echo -n $NODE_TO_REMOVE); do mon_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods \ -lapp=rook-ceph-mon --field-selector="spec.nodeName=${node}" \ --no-headers -ocustom-columns="MON_ID:.metadata.labels.ceph_daemon_id") MON_IDS="$MON_IDS $mon_id" done echo $MON_IDS
You can skip the remaining steps in this section if $MON_ID
is empty. -
Temporarily adjust the Rook MON failover timeout. This tells the operator to perform the MON failover after less time than the default 10 minutes.
kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \ -p '[{ "op": "replace", "path": "/spec/healthCheck/daemonHealth/mon", "value": { "disabled": false, "interval": "10s", "timeout": "10s" } }]'
-
Cordon node(s) to remove
for node in $(echo -n $NODE_TO_REMOVE); do kubectl --as=cluster-admin cordon "${node}" done
-
For every id in
$MON_IDS
replace the MON podmon_id=<MON_ID> kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pod \ -l app=rook-ceph-mon,ceph_daemon_id="${mon_id}" # Wait until new MON is scheduled kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w # Wait until the cluster has regained full quorum kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph status # Repeat for all other $MON_IDS
-
Verify that three MONs are running
kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy -l app=rook-ceph-mon
Remove VM
-
Drain the node(s)
for node in $(echo -n ${NODE_TO_REMOVE}); do kubectl --as=cluster-admin drain "${node}" \ --delete-emptydir-data --ignore-daemonsets done
-
Delete the node(s) from the cluster
for node in $(echo -n ${NODE_TO_REMOVE}); do kubectl --as=cluster-admin delete node "${node}" done
-
Remove the node(s) by applying Terraform
Verify that the hostname of the to be deleted node(s) matches
${NODE_TO_REMOVE}
Ensure that you’re still in directory ${WORK_DIR}/catalog/manifests/openshift4-terraform
before executing this command.terraform apply
Finish up
-
Remove silence in Alertmanager
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "DELETE-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silence/${silence_id}" \ -XDELETE volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
-
Clean up Alertmanager silence jobs
kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-rook-ceph-alerts
-
Re-enable ArgoCD auto sync
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{ "op":"replace", "path":"/spec/syncPolicy", "value": {"automated": {"prune": true, "selfHeal": true}} }]'