Reinitialize a storage disk

Steps to reinitialize an existing, but corrupted Ceph storage disk of an OpenShift 4 cluster on Exoscale.

Starting situation

  • You already have a OpenShift 4 cluster on Exoscale

  • You have admin-level access to the cluster

  • You want to reinitialize a corrupted Ceph storage disk of an existing storage node of the cluster.

    The main symptoms indicating a corrupted storage disk are:

    • the OSD pod associated with the corrupted disk is in CrashLoopBackOff

    • the alert CephOSDDiskNotResponding is firing for the OSD associated with the corrupted disk.

Prerequisites

The following CLI utilities need to be available locally:

  • kubectl

  • jq

Gather information

  1. Make a note of the OSD ID for the disk you want to reinitialize

    export OSD_ID=<ID>
  2. Find PVC and PV of the disk to reinitialize

    pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
      "rook-ceph-osd-${OSD_ID}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}')
    pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \
      "${pvc_name}" -o jsonpath='{.spec.volumeName}')
  3. Find node hosting the disk to reinitialize

    node_name=$(kubectl --as=cluster-admin get pv ${pv_name} \
      -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')

Create silence in Alertmanager

  1. Create alertmanager silence

    silence_id=$(
        kubectl --as=cluster-admin -n openshift-monitoring exec \
        sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \
        silence add syn_component=rook-ceph --duration="30m" -c "Silence rook-ceph alerts" -a "$(oc whoami)"
    )
    echo $silence_id

Reinitialize disk

Shut down OSD of the disk to reinitialize

  1. Temporarily disable rebalancing of the Ceph cluster

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd set noout
  2. Disable auto sync for component rook-ceph. This allows us to temporarily make manual changes to the Rook Ceph cluster.

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
    kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
  3. Scale down the Rook-Ceph operator

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \
      deploy/rook-ceph-operator
  4. Take the old OSD out of service

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd out "osd.${OSD_ID}"
  5. Delete the OSD deployment of the disk you want to reinitialize

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \
      "rook-ceph-osd-${OSD_ID}"

Clean the disk

  1. Find the local-storage-operator pod managing the disk

    diskmaker_pod=$(kubectl --as=cluster-admin -n openshift-local-storage get pods \
      -l "app=diskmaker-manager" --field-selector="spec.nodeName=${node_name}" \
      -o jsonpath='{.items[0].metadata.name}')
  2. Close the LUKS device of the disk

    ceph_image=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get job \
      -l "ceph.rook.io/pvc=${pvc_name}" \
      -o jsonpath='{.items[0].spec.template.spec.containers[0].image}')
    kubectl --as=cluster-admin run -n syn-rook-ceph-cluster \
      "cryptclose-${node_name}-$(date +%s)" --restart=Never -it --rm --image overridden \
      --overrides '{
      "spec": {
        "nodeSelector": {
          "kubernetes.io/hostname": "'"${node_name}"'"
        },
        "hostNetwork": true,
        "hostIPC": true,
        "containers": [{
          "name": "crypttool",
          "image": "'"${ceph_image}"'",
          "command": [
            "sh", "-c",
            "cryptsetup remove /dev/mapper/'"${pvc_name}"'*"
          ],
          "securityContext": {
            "privileged": true,
            "runAsNonRoot": false,
            "runAsUser": 0
          },
          "serviceAccount": "rook-ceph-osd",
          "volumeMounts": [{
            "name": "devices",
            "mountPath": "/dev"
          }]
        }],
        "tolerations": [{
          "key": "storagenode",
          "operator": "Exists"
        }],
        "volumes": [{
          "hostPath": {
            "path": "/dev",
            "type": ""
          },
          "name": "devices"
        }]
      }
    }'
  3. Clean the disk

    We’re cleaning the disk by zeroing the first 512MB. This should be sufficient to allow Ceph to create a new OSD on the disk. If you get errors in the new OSD prepare job, increase count of the dd command to a larger number, for example count=2048 to zero the first 2GB of the disk.

    disk_path=$(kubectl --as=cluster-admin get pv "${pv_name}" -o jsonpath='{.spec.local.path}')
    kubectl --as=cluster-admin -n openshift-local-storage exec -it "${diskmaker_pod}" -- \
     dd if=/dev/zero of="${disk_path}" bs=1M count=512

Start a new OSD on the cleaned disk

  1. Scale Rook-Ceph operator back to 1 replica

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \
      deploy/rook-ceph-operator
  2. Wait for the operator to reconfigure the disk for the OSD

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w
  3. Re-enable Ceph balancing

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd unset noout
  4. Wait for the OSD to be repopulated with data ("backfilled").

    When backfilling is completed, ceph status should show all PGs as active+clean.
    Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while.

    If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config set osd osd_mclock_override_recovery_settings true (1)
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config set osd osd_max_backfills 10 (2)
    1 Allow overwriting osd_max_backfills.
    2 The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster.

    After backfilling is completed, you can remove the configuration with

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config rm osd osd_max_backfills
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config rm osd osd_mclock_override_recovery_settings
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status

Finalize reinitialization

  1. Clean up the old OSD

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd purge "osd.${OSD_ID}"
  2. Check that Ceph cluster is healthy

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd tree

Finish up

  1. Expire alertmanager silence

    kubectl --as=cluster-admin -n openshift-monitoring exec sts/alertmanager-main --\
        amtool --alertmanager.url=http://localhost:9093 silence expire $silence_id
  2. Re-enable ArgoCD auto sync

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{
        "op":"replace",
        "path":"/spec/syncPolicy",
        "value": {"automated": {"prune": true, "selfHeal": true}}
      }]'