Replace a storage node

Steps to replace a storage node of an OpenShift 4 cluster on Exoscale.

Starting situation

  • You already have a OpenShift 4 cluster on Exoscale

  • You have admin-level access to the cluster

  • You want to replace an existing storage node in the cluster with a new storage node

Prerequisites

The following CLI utilities need to be available locally:

Prepare local environment

  1. Create local directory to work in

    We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.

    export WORK_DIR=/path/to/work/dir
    mkdir -p "${WORK_DIR}"
    pushd "${WORK_DIR}"
  2. Configure API access

    Access to cloud API
    export EXOSCALE_ACCOUNT=<exoscale-account>
    export EXOSCALE_API_KEY=<exoscale-key>
    export EXOSCALE_API_SECRET=<exoscale-secret>
    export EXOSCALE_REGION=<exoscale-zone>
    export EXOSCALE_S3_ENDPOINT="sos-${EXOSCALE_REGION}.exo.io"
    Access to various API
    # From https://git.vshn.net/-/profile/personal_access_tokens, "api" scope is sufficient
    export GITLAB_TOKEN=<gitlab-api-token>
    export GITLAB_USER=<gitlab-user-name>
    
    # For example: https://api.syn.vshn.net
    # IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
    export COMMODORE_API_URL=<lieutenant-api-endpoint>
    export COMMODORE_API_TOKEN=<lieutenant-api-token>
    
    # Set Project Syn cluster and tenant ID
    export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
    export TENANT_ID=$(curl -sH "Authorization: Bearer ${COMMODORE_API_TOKEN}" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
  3. Get required tokens from Vault

    Connect with Vault
    export VAULT_ADDR=https://vault-prod.syn.vshn.net
    vault login -method=ldap username=<your.name>
    Grab the LB hieradata repo token from Vault
    export HIERADATA_REPO_SECRET=$(vault kv get \
      -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data')
    export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user')
    export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')
    Get Floaty credentials
    export TF_VAR_lb_exoscale_api_user=$(vault kv get \
      -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/floaty" | jq '.data.data')
    export TF_VAR_lb_exoscale_api_key=$(echo "${TF_VAR_lb_exoscale_api_user}" | jq -r '.iam_key')
    export TF_VAR_lb_exoscale_api_secret=$(echo "${TF_VAR_lb_exoscale_api_user}" | jq -r '.iam_secret')
  4. Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.

    commodore catalog compile "${CLUSTER_ID}"

Prepare Terraform environment

  1. Configure Terraform secrets

    cat <<EOF > ./terraform.env
    EXOSCALE_API_KEY
    EXOSCALE_API_SECRET
    TF_VAR_lb_exoscale_api_key
    TF_VAR_lb_exoscale_api_secret
    TF_VAR_control_vshn_net_token
    GIT_AUTHOR_NAME
    GIT_AUTHOR_EMAIL
    HIERADATA_REPO_TOKEN
    EOF
  2. Setup Terraform

    Prepare Terraform execution environment
    # Set terraform image and tag to be used
    tf_image=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.image" \
      dependencies/openshift4-terraform/class/defaults.yml)
    tf_tag=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
      dependencies/openshift4-terraform/class/defaults.yml)
    
    # Generate the terraform alias
    base_dir=$(pwd)
    alias terraform='docker run -it --rm \
      -e REAL_UID=$(id -u) \
      --env-file ${base_dir}/terraform.env \
      -w /tf \
      -v $(pwd):/tf \
      --ulimit memlock=-1 \
      "${tf_image}:${tf_tag}" /tf/terraform.sh'
    
    export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer ${COMMODORE_API_TOKEN}" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
    export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
    export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
    export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"
    
    pushd catalog/manifests/openshift4-terraform/
    Initialize Terraform
    terraform init \
      "-backend-config=address=${GITLAB_STATE_URL}" \
      "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=username=${GITLAB_USER}" \
      "-backend-config=password=${GITLAB_TOKEN}" \
      "-backend-config=lock_method=POST" \
      "-backend-config=unlock_method=DELETE" \
      "-backend-config=retry_wait_min=5"

Set alert silence

  1. Set a silence in Alertmanager for all rook-ceph alerts

    if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi
    job_name=$(printf "POST-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
    silence_duration='+60 minutes' (1)
    kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: ${job_name}
      labels:
        app: silence-rook-ceph-alerts
    spec:
     backoffLimit: 0
     template:
      spec:
        restartPolicy: Never
        containers:
          - name: silence
            image: quay.io/appuio/oc:v4.6
            command:
            - bash
            - -c
            - |
              curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent )
              read -d "" body << EOF
              {
                "matchers": [
                  {
                    "name": "syn_component",
                    "value": "rook-ceph",
                    "isRegex": false
                  }
                ],
                "startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')",
                "endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")",
                "createdBy": "$(kubectl config current-context | cut -d/ -f3)",
                "comment": "Silence all rook-ceph alerts"
              }
              EOF
    
              curl "\${curl_opts[@]}" \
                "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silences" \
                -XPOST -d "\${body}"
    
            volumeMounts:
            - mountPath: /etc/ssl/certs/serving-certs/
              name: ca-bundle
              readOnly: true
        serviceAccountName: prometheus-k8s
        volumes:
        - name: ca-bundle
          configMap:
            defaultMode: 288
            name: serving-certs-ca-bundle
    EOJ
    1 Adjust this variable to create a longer or shorter silence
  2. Extract Alertmanager silence ID from job logs

    silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \
      jq -r '.silenceID')

Replace node

  1. Make a note of the node you want to replace

    export NODE_TO_REPLACE=storage-XXXX

Create a new node

  1. Find Terraform resource index of the node to replace

    # Grab JSON copy of current Terraform state
    terraform state pull > .tfstate.json
    node_index=$(jq --arg storage_node "${NODE_TO_REPLACE}" -r \
      '.resources[] |
       select(.module=="module.cluster.module.storage" and .type=="random_id") |
       .instances[] |
       select(.attributes.hex==$storage_node) |
       .index_key' \
      .tfstate.json)
  2. Verify that resource index is correct

    jq --arg index "${node_index}" -r \
      '.resources[] |
       select(.module=="module.cluster.module.storage" and .type=="exoscale_compute") |
       .instances[$index|tonumber] |
       .attributes.hostname' \
       .tfstate.json
  3. Remove node ID and node resource for node that we want to replace from the Terraform state

    terraform state rm "module.cluster.module.storage.random_id.node_id[$node_index]"
    terraform state rm "module.cluster.module.storage.exoscale_compute.nodes[$node_index]"
  4. Run Terraform to spin up a replacement node

    terraform apply
  5. Approve node cert for new storage node

    # Once CSRs in state Pending show up, approve them
    # Needs to be run twice, two CSRs for each node need to be approved
    
    kubectl --as=cluster-admin get csr -w
    
    oc --as=cluster-admin get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | \
      xargs oc --as=cluster-admin adm certificate approve
    
    kubectl --as=cluster-admin get nodes
  6. Label and taint the new storage node

    kubectl --as=cluster-admin label --overwrite node -lnode-role.kubernetes.io/worker \
      node-role.kubernetes.io/storage=""
    kubectl --as=cluster-admin label node -lnode-role.kubernetes.io/infra \
      node-role.kubernetes.io/storage-
    kubectl --as=cluster-admin label node -lnode-role.kubernetes.io/app \
      node-role.kubernetes.io/storage-
    
    kubectl --as=cluster-admin taint node -lnode-role.kubernetes.io/storage \
      storagenode=True:NoSchedule
  7. Wait for the localstorage PV on the new node to be created

    kubectl --as=cluster-admin get pv \
      -l storage.openshift.com/local-volume-owner-name=storagevolumes -w
  8. Disable auto sync for component rook-ceph. This allows us to temporarily make manual changes to the Rook Ceph cluster.

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
    kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
  9. Make a note of the original count of OSDs in the Ceph cluster

    orig_osd_count=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster \
      get cephcluster cluster -o jsonpath='{.spec.storage.storageClassDeviceSets[0].count}')
  10. Change Ceph cluster to have one more OSD

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
      -p "[{
        \"op\": \"replace\",
        \"path\": \"/spec/storage/storageClassDeviceSets/0/count\",
        \"value\": $(expr ${orig_osd_count} + 1)
      }]"
  11. Wait until the new OSD is launched

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w

Remove old OSD

  1. Tell Ceph to take the OSD(s) on the node(s) to replace out of service and relocate data stored on them

    # Verify that the list of nodes to replace is correct
    echo $NODE_TO_REPLACE
    # Reweight OSDs on those nodes to 0
    for node in $(echo -n $NODE_TO_REPLACE); do
      osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
        -l failure-domain="${node}" --no-headers \
        -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id")
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
        ceph osd crush reweight "osd.${osd_id}" 0
    done
  2. Wait for the data to be redistributed ("backfilled")

    When backfilling is completed, ceph status should show all PGs as active+clean.
    Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while.

    If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config set osd osd_max_backfills 10 (1)
    1 The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster.

    After backfilling is completed, you can remove the configuration with

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config rm osd osd_max_backfills
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status
  3. Remove the OSD(s) from the Ceph cluster

    for node in $(echo -n $NODE_TO_REPLACE); do
      osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
        -l failure-domain="${node}" --no-headers \
        -o custom-columns="OSD_ID:.metadata.labels.ceph_daemon_id")
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
        ceph osd out "${osd_id}"
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster scale --replicas=0 \
        "deploy/rook-ceph-osd-${osd_id}"
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
        ceph osd purge "${osd_id}"
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
        ceph osd crush remove "${node}"
    done
  4. Check that the OSD is no longer listed in ceph osd tree

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd tree
  5. Scale down the Rook-Ceph operator

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \
      deploy/rook-ceph-operator
  6. Reset Ceph cluster resource to have original number of OSDs

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
      -p "[{
        \"op\": \"replace\",
        \"path\": \"/spec/storage/storageClassDeviceSets/0/count\",
        \"value\": ${orig_osd_count}
      }]"
  7. Make a note of the PVC(s) of the old OSD(s)

    We also extract the name of the PV(s) here, but we’ll only delete the PV(s) after removing the node(s) from the cluster.
    old_pvc_names=""
    old_pv_names=""
    for node in $(echo -n $NODE_TO_REPLACE); do
      osd_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
        -l failure-domain="${node}" --no-headers \
        -o custom-columns="NAME:.metadata.name" | cut -d- -f4)
    
      pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
        "rook-ceph-osd-${osd_id}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}')
      pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \
        "${pvc_name}" -o jsonpath='{.spec.volumeName}')
    
      old_pvc_names="$old_pvc_names $pvc_name"
      old_pv_names="$old_pv_names $pv_name"
    done
    echo $old_pvc_names
    echo $old_pv_names
  8. Delete old OSD deployment(s)

    for node in $(echo -n $NODE_TO_REPLACE); do
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \
        -l failure-domain="${node}"
    done
  9. Clean up PVC(s) and prepare job(s) of the old OSD(s) if necessary

    for pvc_name in $(echo -n $old_pvc_names); do
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete job \
        -l ceph.rook.io/pvc="${pvc_name}"
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pvc "${pvc_name}"
    done
  10. Clean up PVC encryption secret(s)

    for pvc_name in $(echo -n $old_pvc_names); do
      kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete secret -l pvc_name="${pvc_name}"
    done
  11. Scale up the Rook-Ceph operator

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \
      deploy/rook-ceph-operator

Remove the old MON

  1. Find the MON(s) (if any) on the node(s) to replace

    MON_IDS=""
    for node in $(echo -n $NODE_TO_REPLACE); do
      mon_id=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods \
        -lapp=rook-ceph-mon --field-selector="spec.nodeName=${node}" \
        --no-headers -ocustom-columns="MON_ID:.metadata.labels.ceph_daemon_id")
      MON_IDS="$MON_IDS $mon_id"
    done
    echo $MON_IDS
    You can skip the remaining steps in this section if $MON_ID is empty.
  2. Temporarily adjust the Rook MON failover timeout. This tells the operator to perform the MON failover after less time than the default 10 minutes.

    We currently have to restart the operator to force it to pick up the new MON health check configuration. Once Rook.io GitHub issue #8363 is fixed, the operator restart shouldn’t be necessary anymore.

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
      -p '[{
        "op": "replace",
        "path": "/spec/healthCheck/daemonHealth/mon",
        "value": {
          "disabled": false,
          "interval": "10s",
          "timeout": "10s"
        }
      }]'
    kubectl --as=cluster-admin -n syn-rook-ceph-operator delete pods \
      -l app=rook-ceph-operator
  3. Wait for operator to settle. Wait for a log message saying done reconciling ceph cluster in namespace "syn-rook-ceph-cluster"

    kubectl --as=cluster-admin -n syn-rook-ceph-operator logs -f \
      deploy/rook-ceph-operator
  4. Cordon node(s) to replace

    for node in $(echo -n $NODE_TO_REPLACE); do
      kubectl --as=cluster-admin cordon "${node}"
    done
  5. For every id in $MON_IDS replace the MON pod

    mon_id=<MON_ID>
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete pod \
      -l app=rook-ceph-mon,ceph_daemon_id="${mon_id}"
    
    # Wait until new MON is scheduled
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w
    
    # Wait until the cluster has regained full quorum
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status
    
    # Repeat for all other $MON_IDS
  6. Verify that three MONs are running

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy -l app=rook-ceph-mon
  7. Reset the MON failover timeout

    We currently have to restart the operator to force it to pick up the new MON health check configuration. Once Rook.io GitHub issue #8363 is fixed, the operator restart shouldn’t be necessary anymore.

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster patch cephcluster cluster --type=json \
      -p '[{
        "op": "replace",
        "path": "/spec/healthCheck/daemonHealth/mon",
        "value": {}
      }]'
    kubectl --as=cluster-admin -n syn-rook-ceph-operator delete pods \
      -l app=rook-ceph-operator

Clean up the old node

  1. Drain the node(s) to replace

    for node in $(echo -n ${NODE_TO_REPLACE}); do
      kubectl --as=cluster-admin drain "${node}" \
        --delete-emptydir-data --ignore-daemonsets
    done
  2. Delete the node(s) to replace from the cluster

    for node in $(echo -n ${NODE_TO_REPLACE}); do
      kubectl --as=cluster-admin delete node "${node}"
    done
  3. Remove the Exoscale VM(s)

    for node in $(echo -n ${NODE_TO_REPLACE}); do
      node_id=$(exo vm list -O json | \
        jq --arg storage_node "$node" -r \
        '.[] | select(.name==$storage_node) | .id')
    
      echo "Removing node:"
      exo vm list | grep "${node_id}"
    
      exo vm delete "${node_id}"
    done
  4. Clean up localstorage PV(s) of decommissioned node(s)

    for pv_name in $(echo -n $old_pv_names); do
      kubectl --as=cluster-admin delete pv "${pv_name}"
    done

Finish up

  1. Remove silence in Alertmanager

    if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi
    job_name=$(printf "DELETE-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
    kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: ${job_name}
      labels:
        app: silence-rook-ceph-alerts
    spec:
     backoffLimit: 0
     template:
      spec:
        restartPolicy: Never
        containers:
          - name: silence
            image: quay.io/appuio/oc:v4.6
            command:
            - bash
            - -c
            - |
              curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent )
    
              curl "\${curl_opts[@]}" \
                "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silence/${silence_id}" \
                -XDELETE
    
            volumeMounts:
            - mountPath: /etc/ssl/certs/serving-certs/
              name: ca-bundle
              readOnly: true
        serviceAccountName: prometheus-k8s
        volumes:
        - name: ca-bundle
          configMap:
            defaultMode: 288
            name: serving-certs-ca-bundle
    EOJ
  2. Clean up Alertmanager silence jobs

    kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-rook-ceph-alerts
  3. Re-enable ArgoCD auto sync

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{
        "op":"replace",
        "path":"/spec/syncPolicy",
        "value": {"automated": {"prune": true, "selfHeal": true}}
      }]'

Upstream documentation