Reinitialize a storage disk

Steps to reinitialize an existing, but corrupted Ceph storage disk of an OpenShift 4 cluster on Exoscale.

Starting situation

You already have a OpenShift 4 cluster on Exoscale
You have admin-level access to the cluster

You want to reinitialize a corrupted Ceph storage disk of an existing storage node of the cluster.

The main symptoms indicating a corrupted storage disk are:

the OSD pod associated with the corrupted disk is in CrashLoopBackOff
the alert CephOSDDiskNotResponding is firing for the OSD associated with the corrupted disk.

Prerequisites

The following CLI utilities need to be available locally:

kubectl
jq

Gather information

Make a note of the OSD ID for the disk you want to reinitialize
```
export OSD_ID=<ID>
```

Find PVC and PV of the disk to reinitialize

pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
  "rook-ceph-osd-${OSD_ID}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}')
pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \
  "${pvc_name}" -o jsonpath='{.spec.volumeName}')

Find node hosting the disk to reinitialize

node_name=$(kubectl --as=cluster-admin get pv ${pv_name} \
  -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')

Create silence in Alertmanager

Create alertmanager silence

silence_id=$(
    kubectl --as=cluster-admin -n openshift-monitoring exec \
    sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \
    silence add syn_component=rook-ceph --duration="30m" -c "Silence rook-ceph alerts" -a "$(oc whoami)"
)
echo $silence_id

Reinitialize disk

Shut down OSD of the disk to reinitialize

Temporarily disable rebalancing of the Ceph cluster

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd set noout

Disable auto sync for component rook-ceph. This allows us to temporarily make manual changes to the Rook Ceph cluster.

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
  -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'

Scale down the Rook-Ceph operator

kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \
  deploy/rook-ceph-operator

Take the old OSD out of service

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd out "osd.${OSD_ID}"

Delete the OSD deployment of the disk you want to reinitialize

kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \
  "rook-ceph-osd-${OSD_ID}"

Clean the disk

Find the local-storage-operator pod managing the disk

diskmaker_pod=$(kubectl --as=cluster-admin -n openshift-local-storage get pods \
  -l "app=diskmaker-manager" --field-selector="spec.nodeName=${node_name}" \
  -o jsonpath='{.items[0].metadata.name}')

Close the LUKS device of the disk

ceph_image=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get job \
  -l "ceph.rook.io/pvc=${pvc_name}" \
  -o jsonpath='{.items[0].spec.template.spec.containers[0].image}')
kubectl --as=cluster-admin run -n syn-rook-ceph-cluster \
  "cryptclose-${node_name}-$(date +%s)" --restart=Never -it --rm --image overridden \
  --overrides '{
  "spec": {
    "nodeSelector": {
      "kubernetes.io/hostname": "'"${node_name}"'"
    },
    "hostNetwork": true,
    "hostIPC": true,
    "containers": [{
      "name": "crypttool",
      "image": "'"${ceph_image}"'",
      "command": [
        "sh", "-c",
        "cryptsetup remove /dev/mapper/'"${pvc_name}"'*"
      ],
      "securityContext": {
        "privileged": true,
        "runAsNonRoot": false,
        "runAsUser": 0
      },
      "serviceAccount": "rook-ceph-osd",
      "volumeMounts": [{
        "name": "devices",
        "mountPath": "/dev"
      }]
    }],
    "tolerations": [{
      "key": "storagenode",
      "operator": "Exists"
    }],
    "volumes": [{
      "hostPath": {
        "path": "/dev",
        "type": ""
      },
      "name": "devices"
    }]
  }
}'

Clean the disk

We’re cleaning the disk by zeroing the first 512MB. This should be sufficient to allow Ceph to create a new OSD on the disk. If you get errors in the new OSD prepare job, increase count of the dd command to a larger number, for example count=2048 to zero the first 2GB of the disk.

disk_path=$(kubectl --as=cluster-admin get pv "${pv_name}" -o jsonpath='{.spec.local.path}')
kubectl --as=cluster-admin -n openshift-local-storage exec -it "${diskmaker_pod}" -- \
 dd if=/dev/zero of="${disk_path}" bs=1M count=512

Start a new OSD on the cleaned disk

Scale Rook-Ceph operator back to 1 replica

kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \
  deploy/rook-ceph-operator

Wait for the operator to reconfigure the disk for the OSD

kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w

Re-enable Ceph balancing

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd unset noout

Wait for the OSD to be repopulated with data ("backfilled").

When backfilling is completed, ceph status should show all PGs as active+clean.

Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while.

If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_mclock_override_recovery_settings true (1)
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config set osd osd_max_backfills 10 (2)

1	Allow overwriting `osd_max_backfills`.
2	The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster.

After backfilling is completed, you can remove the configuration with

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config rm osd osd_max_backfills
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph config rm osd osd_mclock_override_recovery_settings

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph status

Finalize reinitialization

Clean up the old OSD

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd purge "osd.${OSD_ID}"

Check that Ceph cluster is healthy

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph status

kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
  ceph osd tree

Finish up

Expire alertmanager silence

kubectl --as=cluster-admin -n openshift-monitoring exec sts/alertmanager-main --\
    amtool --alertmanager.url=http://localhost:9093 silence expire $silence_id

Re-enable ArgoCD auto sync

kubectl --as=cluster-admin -n syn patch apps root --type=json \
  -p '[{
    "op":"replace",
    "path":"/spec/syncPolicy",
    "value": {"automated": {"prune": true, "selfHeal": true}}
  }]'