Remove a storage node

Steps to remove a storage node of an OpenShift 4 cluster on Exoscale.

Starting situation

You already have a OpenShift 4 cluster on Exoscale
You have admin-level access to the cluster
You want to remove an existing storage node in the cluster

Prerequisites

The following CLI utilities need to be available locally:

docker
curl
kubectl
oc
exo >= v1.28.0 Exoscale CLI
vault Vault CLI
commodore, see Running Commodore
jq
yq yq YAML processor (version 4 or higher)
macOS: gdate from GNU coreutils, brew install coreutils

Prepare local environment

Create local directory to work in

We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.

export WORK_DIR=/path/to/work/dir
mkdir -p "${WORK_DIR}"
pushd "${WORK_DIR}"

Configure API access

Access to cloud API

export EXOSCALE_API_KEY=<exoscale-key> (1)
export EXOSCALE_API_SECRET=<exoscale-secret>
export EXOSCALE_ZONE=<exoscale-zone> (2)
export EXOSCALE_S3_ENDPOINT="sos-${EXOSCALE_ZONE}.exo.io"

1	We recommend using the IAMv3 role called `Owner` for the API Key. This role gives full access to the project.
2	All lower case. For example `ch-dk-2`.

Access to VSHN GitLab

# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient
export GITLAB_TOKEN=<gitlab-api-token>
export GITLAB_USER=<gitlab-user-name>

Access to VSHN Lieutenant

# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)

Configuration for hieradata commits

export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens

Get required tokens from Vault

Connect with Vault

export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc

Grab the LB hieradata repo token from Vault

export HIERADATA_REPO_SECRET=$(vault kv get \
  -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data')
export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user')
export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')

Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.
```
commodore catalog compile "${CLUSTER_ID}"
```

Set alert silence

Create alertmanager silence

silence_id=$(
    kubectl --as=cluster-admin -n openshift-monitoring exec \
    sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \
    silence add syn_component=rook-ceph --duration="1h" -c "Silence rook-ceph alerts" -a "$(oc whoami)"
)
echo $silence_id

Update Cluster Config

Update cluster config.

pushd "inventory/classes/${TENANT_ID}/"

yq eval -i ".parameters.openshift4_terraform.terraform_variables.storage_count -= 1" \
  ${CLUSTER_ID}.yml

yq eval -i ".parameters.rook_ceph.ceph_cluster.node_count -= 1" \
  ${CLUSTER_ID}.yml

Ceph can’t scale to less than 3 storage nodes, which is the default number of nodes. Please ensure that this update doesn’t reduce the number of storage nodes to less than 3 before continuing.

Review and commit

# Have a look at the file ${CLUSTER_ID}.yml.

git commit -a -m "Remove storage node from cluster ${CLUSTER_ID}"
git push

popd

Compile and push cluster catalog

commodore catalog compile ${CLUSTER_ID} --push -i

Prepare Terraform environment

Configure Terraform secrets

cat <<EOF > ./terraform.env
EXOSCALE_API_KEY
EXOSCALE_API_SECRET
TF_VAR_control_vshn_net_token
GIT_AUTHOR_NAME
GIT_AUTHOR_EMAIL
HIERADATA_REPO_TOKEN
EOF

Setup Terraform

Prepare Terraform execution environment

# Set terraform image and tag to be used
tf_image=$(\
  yq eval ".parameters.openshift4_terraform.images.terraform.image" \
  dependencies/openshift4-terraform/class/defaults.yml)
tf_tag=$(\
  yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
  dependencies/openshift4-terraform/class/defaults.yml)

# Generate the terraform alias
base_dir=$(pwd)
alias terraform='touch .terraformrc; docker run -it --rm \
  -e REAL_UID=$(id -u) \
  -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \
  --env-file ${base_dir}/terraform.env \
  -w /tf \
  -v $(pwd):/tf \
  --ulimit memlock=-1 \
  "${tf_image}:${tf_tag}" /tf/terraform.sh'

export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"

pushd catalog/manifests/openshift4-terraform/

Initialize Terraform

terraform init \
  "-backend-config=address=${GITLAB_STATE_URL}" \
  "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
  "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
  "-backend-config=username=${GITLAB_USER}" \
  "-backend-config=password=${GITLAB_TOKEN}" \
  "-backend-config=lock_method=POST" \
  "-backend-config=unlock_method=DELETE" \
  "-backend-config=retry_wait_min=5"

Remove Node

Find the node you want to remove. It has to be the one with the highest terraform index.

# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json

node_count=$(jq  -r \
  '.resources[] |
   select(.module=="module.cluster.module.storage" and .type=="exoscale_compute") |
   .instances | length' \
   .tfstate.json)
# Verify that the number of nodes is one more than we configured earlier.
echo $node_count

export NODE_TO_REMOVE=$(jq --arg index "$node_count" -r \
  '.resources[] |
   select(.module=="module.cluster.module.storage" and .type=="exoscale_compute") |
   .instances[$index|tonumber-1] |
   .attributes.hostname' \
   .tfstate.json)
echo $NODE_TO_REMOVE

Remove old OSD

Make sure ArgoCD ran and reduced the target number of OSDs

kubectl --as=cluster-admin -n syn-rook-ceph-cluster \
  get cephcluster cluster -o jsonpath='{.spec.storage.storageClassDeviceSets[0].count}'

Disable ArgoCD auto sync for component rook-ceph

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'

Scale down the Rook-Ceph operator

kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \
  deploy/rook-ceph-operator

Tell Ceph to take the OSD(s) on the node(s) to remove out of service and relocate data stored on them

# Verify that the list of nodes to replace is correct
echo $NODE_TO_REMOVE
# Reweight OSDs on those nodes to 0
for node in $(echo -n $NODE_TO_REMOVE); do
  osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
    -l failure-domain="${node}" --no-headers \
    -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id")
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
    ceph osd crush reweight "osd.${osd_id}" 0
done

Wait for the data to be redistributed ("backfilled")

When backfilling is completed, ceph status should show all PGs as active+clean.

Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while.

If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_mclock_override_recovery_settings true (1)
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_max_backfills 10 (2)

1	Allow overwriting `osd_max_backfills`.
2	The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster.

After backfilling is completed, you can remove the configuration with

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config rm osd osd_max_backfills
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config rm osd osd_mclock_override_recovery_settings

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph status

Remove the OSD(s) from the Ceph cluster

for node in $(echo -n $NODE_TO_REMOVE); do
  osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
    -l failure-domain="${node}" --no-headers \
    -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id")
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
    ceph osd out "${osd_id}"
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster scale --replicas=0 \
    "deploy/rook-ceph-osd-${osd_id}"
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
    ceph osd purge "${osd_id}"
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
    ceph osd crush remove "${node}"
done

Check that the OSD is no longer listed in ceph osd tree

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd tree

Make a note of the PVC(s) of the old OSD(s)

We also extract the name of the PV(s) here, but we’ll only delete the PV(s) after removing the node(s) from the cluster.

old_pvc_names=""
old_pv_names=""
for node in $(echo -n $NODE_TO_REMOVE); do
  osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
    -l failure-domain="${node}" --no-headers \
    -o custom-columns="NAME:.metadata.name" | cut -d- -f4)

  pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
    "rook-ceph-osd-${osd_id}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}')
  pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \
    "${pvc_name}" -o jsonpath='{.spec.volumeName}')

  old_pvc_names="$old_pvc_names $pvc_name"
  old_pv_names="$old_pv_names $pv_name"
done
echo $old_pvc_names
echo $old_pv_names

Delete old OSD deployment(s)

for node in $(echo -n $NODE_TO_REMOVE); do
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \
    -l failure-domain="${node}"
done

Clean up PVC(s) and prepare job(s) of the old OSD(s) if necessary

for pvc_name in $(echo -n $old_pvc_names); do
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete job \
    -l ceph.rook.io/pvc="${pvc_name}"
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pvc "${pvc_name}"
done

Clean up PVC encryption secret(s)

for pvc_name in $(echo -n $old_pvc_names); do
  kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete secret -l pvc_name="${pvc_name}"
done

Scale up the Rook-Ceph operator

kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \
  deploy/rook-ceph-operator

Remove the old MON

We’ve observed situations where the Rook operator was unable to correctly replace the old MON using the instructions in this section.

If you run into issues, please double-check the Rook operator logs and create a ticket with the relevant information so we can improve the steps in this section.

Find the MON(s) (if any) on the node(s) to remove

MON_IDS=""
for node in $(echo -n $NODE_TO_REMOVE); do
  mon_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods \
    -lapp=rook-ceph-mon --field-selector="spec.nodeName=${node}" \
    --no-headers -ocustom-columns="MON_ID:.metadata.labels.ceph_daemon_id")
  MON_IDS="$MON_IDS $mon_id"
done
echo $MON_IDS

You can skip the remaining steps in this section if $MON_ID is empty.

Temporarily adjust the Rook MON failover timeout. This tells the operator to perform the MON failover after less time than the default 10 minutes.

kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
  -p '[{
    "op": "replace",
    "path": "/spec/healthCheck/daemonHealth/mon",
    "value": {
      "disabled": false,
      "interval": "10s",
      "timeout": "10s"
    }
  }]'

Cordon node(s) to remove

for node in $(echo -n $NODE_TO_REMOVE); do
  kubectl --as=cluster-admin cordon "${node}"
done

For every id in $MON_IDS replace the MON pod

mon_id=<MON_ID>
kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pod \
  -l app=rook-ceph-mon,ceph_daemon_id="${mon_id}"

# Wait until new MON is scheduled
kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w

# Wait until the cluster has regained full quorum
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph status

# Repeat for all other $MON_IDS

Verify that three MONs are running

kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy -l app=rook-ceph-mon

Remove VM

Drain the node(s)

for node in $(echo -n ${NODE_TO_REMOVE}); do
  kubectl --as=cluster-admin drain "${node}" \
    --delete-emptydir-data --ignore-daemonsets
done

Delete the node(s) from the cluster

for node in $(echo -n ${NODE_TO_REMOVE}); do
  kubectl --as=cluster-admin delete node "${node}"
done

Remove the node(s) by applying Terraform

Verify that the hostname of the to be deleted node(s) matches ${NODE_TO_REMOVE}

Ensure that you’re still in directory ${WORK_DIR}/catalog/manifests/openshift4-terraform before executing this command.
```
terraform apply
```

Finish up

Expire alertmanager silence

kubectl --as=cluster-admin -n openshift-monitoring exec sts/alertmanager-main --\
    amtool --alertmanager.url=http://localhost:9093 silence expire $silence_id

Re-enable ArgoCD auto sync

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{
    "op":"replace",
    "path":"/spec/syncPolicy",
    "value": {"automated": {"prune": true, "selfHeal": true}}
  }]'

Upstream documentation

Rook documentation
- Remove an OSD
- MON failover