Replace a storage node

Steps to replace a storage node of an OpenShift 4 cluster on cloudscale.ch.

Starting situation

You already have a OpenShift 4 cluster on cloudscale.ch
You have admin-level access to the cluster
The cluster is already running the APPUiO Managed Storage Cluster addon (Rook Ceph).
You want to replace an existing storage node in the storage cluster with a new storage node

Prerequisites

The following CLI utilities need to be available locally:

docker
curl
kubectl
oc
vault Vault CLI
commodore, see Running Commodore
jq
yq yq YAML processor (version 4 or higher)
macOS: gdate from GNU coreutils, brew install coreutils

Prepare local environment

Create local directory to work in

We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.

export WORK_DIR=/path/to/work/dir
mkdir -p "${WORK_DIR}"
pushd "${WORK_DIR}"

Configure API access

Access to cloud API

# From https://control.cloudscale.ch/service/<your-project>/api-token
export CLOUDSCALE_API_TOKEN=<cloudscale-api-token>

Access to VSHN GitLab

# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient
export GITLAB_TOKEN=<gitlab-api-token>
export GITLAB_USER=<gitlab-user-name>

# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)

Configuration for hieradata commits

export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens

Get required tokens from Vault

Connect with Vault

export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc

Grab the LB hieradata repo token from Vault

export HIERADATA_REPO_SECRET=$(vault kv get \
  -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data')
export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user')
export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')

Get Floaty credentials

export TF_VAR_lb_cloudscale_api_secret=$(vault kv get \
  -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/floaty" | jq -r '.data.data.iam_secret')

Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.
```
commodore catalog compile "${CLUSTER_ID}"
```

Prepare Terraform environment

Configure Terraform secrets

cat <<EOF > ./terraform.env
CLOUDSCALE_API_TOKEN
TF_VAR_ignition_bootstrap
TF_VAR_lb_cloudscale_api_secret
TF_VAR_control_vshn_net_token
GIT_AUTHOR_NAME
GIT_AUTHOR_EMAIL
HIERADATA_REPO_TOKEN
EOF

Setup Terraform

Prepare Terraform execution environment

# Set terraform image and tag to be used
tf_image=$(\
  yq eval ".parameters.openshift4_terraform.images.terraform.image" \
  dependencies/openshift4-terraform/class/defaults.yml)
tf_tag=$(\
  yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
  dependencies/openshift4-terraform/class/defaults.yml)

# Generate the terraform alias
base_dir=$(pwd)
alias terraform='touch .terraformrc; docker run -it --rm \
  -e REAL_UID=$(id -u) \
  -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \
  --env-file ${base_dir}/terraform.env \
  -w /tf \
  -v $(pwd):/tf \
  --ulimit memlock=-1 \
  "${tf_image}:${tf_tag}" /tf/terraform.sh'

export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"

pushd catalog/manifests/openshift4-terraform/

Initialize Terraform

terraform init \
  "-backend-config=address=${GITLAB_STATE_URL}" \
  "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
  "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
  "-backend-config=username=${GITLAB_USER}" \
  "-backend-config=password=${GITLAB_TOKEN}" \
  "-backend-config=lock_method=POST" \
  "-backend-config=unlock_method=DELETE" \
  "-backend-config=retry_wait_min=5"

Set alert silence and pause ArgoCD

Create alertmanager silence

silence_id=$(
    kubectl --as=cluster-admin -n openshift-monitoring exec \
    sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \
    silence add syn_component=rook-ceph --duration="1h" -c "Silence rook-ceph alerts" -a "$(oc whoami)"
)
echo $silence_id

Disable auto sync for component rook-ceph. This allows us to temporarily make manual changes to the Rook Ceph cluster.

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'

Replace node

Make a note of the node you want to replace
```
export NODE_TO_REPLACE=storage-XXXX
```

Create a new node

Find Terraform resource index of the node to replace

TF_MODULE='module.cluster.module.additional_worker["storage"]' (1)

1	Select the correct worker group. This guide assumes that your storage nodes are part of an additional worker group called "storage".

# Grab JSON copy of current Terraform state
terraform state pull > .tfstate.json
node_index=$(jq --arg tfmodule "${TF_MODULE}" --arg storage_node "${NODE_TO_REPLACE}" -r \
  '.resources[] |
   select(.module==$tfmodule and .type=="random_id") |
   .instances[] |
   select(.attributes.hex==$storage_node) |
   .index_key' \
  .tfstate.json)

Verify that resource index is correct

jq --arg tfmodule "${TF_MODULE}" --arg index "${node_index}" -r \
  '.resources[] |
   select(.module==$tfmodule and .type=="cloudscale_server") |
   .instances[$index|tonumber] |
   .attributes.name' \
   .tfstate.json

Remove node ID and node resource for node that we want to replace from the Terraform state

terraform state rm "${TF_MODULE}.random_id.node[$node_index]"
terraform state rm "${TF_MODULE}.cloudscale_server.node[$node_index]"

Run Terraform to spin up a replacement node
```
terraform apply
```

Approve node cert for new storage node

# Once CSRs in state Pending show up, approve them
# Needs to be run twice, two CSRs for each node need to be approved

kubectl --as=cluster-admin get csr -w

oc --as=cluster-admin get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | \
  xargs oc --as=cluster-admin adm certificate approve

kubectl --as=cluster-admin get nodes

Label and taint the new storage node

kubectl get node -ojson | \
  jq -r '.items[] | select(.metadata.name | test("storage-")).metadata.name' | \
  xargs -I {} kubectl --as=cluster-admin label node {} node-role.kubernetes.io/storage=

kubectl --as=cluster-admin taint node -lnode-role.kubernetes.io/storage \
  storagenode=True:NoSchedule

Remove the old MON

We’ve observed situations where the Rook operator was unable to correctly replace the old MON using the instructions in this section.

If you run into issues, please double-check the Rook operator logs and create a ticket with the relevant information so we can improve the steps in this section.

Find the MON(s) (if any) on the node(s) to replace

MON_IDS=""
for node in $(echo -n $NODE_TO_REPLACE); do
  mon_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods \
    -lapp=rook-ceph-mon --field-selector="spec.nodeName=${node}" \
    --no-headers -ocustom-columns="MON_ID:.metadata.labels.ceph_daemon_id")
  MON_IDS="$MON_IDS $mon_id"
done
echo $MON_IDS

You can skip the remaining steps in this section if $MON_ID is empty.

Temporarily adjust the Rook MON failover timeout. This tells the operator to perform the MON failover after less time than the default 10 minutes.

kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
  -p '[{
    "op": "replace",
    "path": "/spec/healthCheck/daemonHealth/mon",
    "value": {
      "disabled": false,
      "interval": "10s",
      "timeout": "10s"
    }
  }]'

Cordon node(s) to replace

for node in $(echo -n $NODE_TO_REPLACE); do
  kubectl --as=cluster-admin cordon "${node}"
done

For every id in $MON_IDS replace the MON pod

mon_id=<MON_ID>
kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pod \
  -l app=rook-ceph-mon,ceph_daemon_id="${mon_id}"

# Wait until new MON is scheduled
kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w

# Wait until the cluster has regained full quorum
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph status

# Repeat for all other $MON_IDS

Verify that three MONs are running

kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy -l app=rook-ceph-mon

Clean up the old node

Drain the node(s)

for node in $(echo -n ${NODE_TO_REPLACE}); do
  kubectl --as=cluster-admin drain "${node}" \
    --delete-emptydir-data --ignore-daemonsets
done

On cloudscale.ch, we configure Rook Ceph to setup the OSDs in "portable" mode. This configuration enables OSDs to be scheduled on any storage node.

With this configuration, we don’t have to migrate OSDs hosted on the old node(s) manually. Instead, draining a node will cause any OSDs hosted on that node to be rescheduled on other storage nodes.

Delete the node(s) from the cluster

for node in $(echo -n ${NODE_TO_REPLACE}); do
  kubectl --as=cluster-admin delete node "${node}"
done

Remove the cloudscale.ch VM(s)

for node in $(echo -n ${NODE_TO_REPLACE}); do
  node_id=$(curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \
    https://api.cloudscale.ch/v1/servers | \
    jq --arg storage_node "$node" -r \
    '.[] | select(.name|startswith($storage_node)) | .uuid')

  echo "Removing node:"
  curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \
    "https://api.cloudscale.ch/v1/servers/${node_id}" |\
    jq -r '.name'

  curl -XDELETE -H "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \
    "https://api.cloudscale.ch/v1/servers/${node_id}"
done

Finish up

Expire alertmanager silence

kubectl --as=cluster-admin -n openshift-monitoring exec sts/alertmanager-main --\
    amtool --alertmanager.url=http://localhost:9093 silence expire $silence_id

Re-enable ArgoCD auto sync

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{
    "op":"replace",
    "path":"/spec/syncPolicy",
    "value": {"automated": {"prune": true, "selfHeal": true}}
  }]'

Upstream documentation

Rook documentation
- MON failover