Reinitialize a storage disk
Steps to reinitialize an existing, but corrupted Ceph storage disk of an OpenShift 4 cluster on Exoscale.
Starting situation
-
You already have a OpenShift 4 cluster on Exoscale
-
You have admin-level access to the cluster
-
You want to reinitialize a corrupted Ceph storage disk of an existing storage node of the cluster.
The main symptoms indicating a corrupted storage disk are:
-
the OSD pod associated with the corrupted disk is in
CrashLoopBackOff
-
the alert
CephOSDDiskNotResponding
is firing for the OSD associated with the corrupted disk.
-
Gather information
-
Make a note of the OSD ID for the disk you want to reinitialize
export OSD_ID=<ID>
-
Find PVC and PV of the disk to reinitialize
pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \ "rook-ceph-osd-${OSD_ID}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}') pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \ "${pvc_name}" -o jsonpath='{.spec.volumeName}')
-
Find node hosting the disk to reinitialize
node_name=$(kubectl --as=cluster-admin get pv ${pv_name} \ -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')
Create silence in Alertmanager
-
Set a silence in Alertmanager for all rook-ceph alerts
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "POST-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') silence_duration='+30 minutes' (1) kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) read -d "" body << EOF { "matchers": [ { "name": "syn_component", "value": "rook-ceph", "isRegex": false } ], "startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')", "endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")", "createdBy": "$(kubectl config current-context | cut -d/ -f3)", "comment": "Silence rook-ceph alerts" } EOF curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silences" \ -XPOST -d "\${body}" volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
1 Adjust this variable to create a longer or shorter silence -
Extract Alertmanager silence ID from job logs
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \ jq -r '.silenceID')
Reinitialize disk
Shut down OSD of the disk to reinitialize
-
Temporarily disable rebalancing of the Ceph cluster
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd set noout
-
Disable auto sync for component
rook-ceph
. This allows us to temporarily make manual changes to the Rook Ceph cluster.kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
-
Scale down the Rook-Ceph operator
kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \ deploy/rook-ceph-operator
-
Take the old OSD out of service
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd out "osd.${OSD_ID}"
-
Delete the OSD deployment of the disk you want to reinitialize
kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \ "rook-ceph-osd-${OSD_ID}"
Clean the disk
-
Find the local-storage-provisioner pod managing the disk
provisioner_pod_label="local-volume-provisioner-$(kubectl --as=cluster-admin \ get pv ${pv_name} \ -o jsonpath='{.metadata.labels.storage\.openshift\.com/local-volume-owner-name}')" provisioner_pod=$(kubectl --as=cluster-admin -n openshift-local-storage get pods \ -l "app=${provisioner_pod_label}" --field-selector="spec.nodeName=${node_name}" \ -o jsonpath='{.items[0].metadata.name}')
-
Close the LUKS device of the disk
ceph_image=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get job \ -l "ceph.rook.io/pvc=${pvc_name}" \ -o jsonpath='{.items[0].spec.template.spec.containers[0].image}') kubectl --as=cluster-admin run -n syn-rook-ceph-cluster \ "cryptclose-${node_name}-$(date +%s)" --restart=Never -it --rm --image overridden \ --overrides '{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "'"${node_name}"'" }, "hostNetwork": true, "hostIPC": true, "containers": [{ "name": "crypttool", "image": "'"${ceph_image}"'", "command": [ "sh", "-c", "cryptsetup remove /dev/mapper/'"${pvc_name}"'*" ], "securityContext": { "privileged": true, "runAsNonRoot": false, "runAsUser": 0 }, "serviceAccount": "rook-ceph-osd", "volumeMounts": [{ "name": "devices", "mountPath": "/dev" }] }], "tolerations": [{ "key": "storagenode", "operator": "Exists" }], "volumes": [{ "hostPath": { "path": "/dev", "type": "" }, "name": "devices" }] } }'
-
Clean the disk
We’re cleaning the disk by zeroing the first 512MB. This should be sufficient to allow Ceph to create a new OSD on the disk. If you get errors in the new OSD prepare job, increase
count
of thedd
command to a larger number, for examplecount=2048
to zero the first 2GB of the disk.disk_path=$(kubectl --as=cluster-admin get pv "${pv_name}" -o jsonpath='{.spec.local.path}') kubectl --as=cluster-admin -n openshift-local-storage exec -it "${provisioner_pod}" -- \ dd if=/dev/zero of="${disk_path}" bs=1M count=512
Start a new OSD on the cleaned disk
-
Scale Rook-Ceph operator back to 1 replica
kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \ deploy/rook-ceph-operator
-
Wait for the operator to reconfigure the disk for the OSD
kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w
-
Re-enable Ceph balancing
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd unset noout
-
Wait for the OSD to be repopulated with data ("backfilled").
When backfilling is completed, ceph status
should show all PGs asactive+clean
.Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while. If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config set osd osd_max_backfills 10 (1)
1 The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster. After backfilling is completed, you can remove the configuration with
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph config rm osd osd_max_backfills
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph status
Finalize reinitialization
-
Clean up the old OSD
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd purge "osd.${OSD_ID}"
-
Check that Ceph cluster is healthy
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph status
kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \ ceph osd tree
Finish up
-
Remove silence in Alertmanager
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi job_name=$(printf "DELETE-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]') kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ apiVersion: batch/v1 kind: Job metadata: name: ${job_name} labels: app: silence-rook-ceph-alerts spec: backoffLimit: 0 template: spec: restartPolicy: Never containers: - name: silence image: quay.io/appuio/oc:v4.13 command: - bash - -c - | curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent ) curl "\${curl_opts[@]}" \ "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silence/${silence_id}" \ -XDELETE volumeMounts: - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle - name: kube-api-access projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: 'token' EOJ
-
Clean up Alertmanager silence jobs
kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-rook-ceph-alerts
-
Re-enable ArgoCD auto sync
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{ "op":"replace", "path":"/spec/syncPolicy", "value": {"automated": {"prune": true, "selfHeal": true}} }]'