Reinitialize a storage disk

Steps to reinitialize an existing, but corrupted Ceph storage disk of an OpenShift 4 cluster on Exoscale.

Starting situation

  • You already have a OpenShift 4 cluster on Exoscale

  • You have admin-level access to the cluster

  • You want to reinitialize a corrupted Ceph storage disk of an existing storage node of the cluster.

    The main symptoms indicating a corrupted storage disk are:

    • the OSD pod associated with the corrupted disk is in CrashLoopBackOff

    • the alert CephOSDDiskNotResponding is firing for the OSD associated with the corrupted disk.

Prerequisites

The following CLI utilities need to be available locally:

  • kubectl

  • jq

Gather information

  1. Make a note of the OSD ID for the disk you want to reinitialize

    export OSD_ID=<ID>
  2. Find PVC and PV of the disk to reinitialize

    pvc_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get deploy \
      "rook-ceph-osd-${OSD_ID}" -ojsonpath='{.metadata.labels.ceph\.rook\.io/pvc}')
    pv_name=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pvc \
      "${pvc_name}" -o jsonpath='{.spec.volumeName}')
  3. Find node hosting the disk to reinitialize

    node_name=$(kubectl --as=cluster-admin get pv ${pv_name} \
      -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}')

Create silence in Alertmanager

  1. Set a silence in Alertmanager for all rook-ceph alerts

    if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi
    job_name=$(printf "POST-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
    silence_duration='+30 minutes' (1)
    kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: ${job_name}
      labels:
        app: silence-rook-ceph-alerts
    spec:
     backoffLimit: 0
     template:
      spec:
        restartPolicy: Never
        containers:
          - name: silence
            image: quay.io/appuio/oc:v4.13
            command:
            - bash
            - -c
            - |
              curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent )
              read -d "" body << EOF
              {
                "matchers": [
                  {
                    "name": "syn_component",
                    "value": "rook-ceph",
                    "isRegex": false
                  }
                ],
                "startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')",
                "endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")",
                "createdBy": "$(kubectl config current-context | cut -d/ -f3)",
                "comment": "Silence rook-ceph alerts"
              }
              EOF
    
              curl "\${curl_opts[@]}" \
                "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silences" \
                -XPOST -d "\${body}"
    
            volumeMounts:
            - mountPath: /etc/ssl/certs/serving-certs/
              name: ca-bundle
              readOnly: true
            - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
              name: kube-api-access
              readOnly: true
        serviceAccountName: prometheus-k8s
        volumes:
        - name: ca-bundle
          configMap:
            defaultMode: 288
            name: serving-certs-ca-bundle
        - name: kube-api-access
          projected:
            defaultMode: 420
            sources:
              - serviceAccountToken:
                  expirationSeconds: 3607
                  path: 'token'
    EOJ
    1 Adjust this variable to create a longer or shorter silence
  2. Extract Alertmanager silence ID from job logs

    silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \
      jq -r '.silenceID')

Reinitialize disk

Shut down OSD of the disk to reinitialize

  1. Temporarily disable rebalancing of the Ceph cluster

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd set noout
  2. Disable auto sync for component rook-ceph. This allows us to temporarily make manual changes to the Rook Ceph cluster.

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
    kubectl --as=cluster-admin -n syn patch apps rook-ceph --type=json \
      -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
  3. Scale down the Rook-Ceph operator

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=0 \
      deploy/rook-ceph-operator
  4. Take the old OSD out of service

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd out "osd.${OSD_ID}"
  5. Delete the OSD deployment of the disk you want to reinitialize

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster delete deploy \
      "rook-ceph-osd-${OSD_ID}"

Clean the disk

  1. Find the local-storage-provisioner pod managing the disk

    provisioner_pod_label="local-volume-provisioner-$(kubectl --as=cluster-admin \
      get pv ${pv_name} \
      -o jsonpath='{.metadata.labels.storage\.openshift\.com/local-volume-owner-name}')"
    provisioner_pod=$(kubectl --as=cluster-admin -n openshift-local-storage get pods \
      -l "app=${provisioner_pod_label}" --field-selector="spec.nodeName=${node_name}" \
      -o jsonpath='{.items[0].metadata.name}')
  2. Close the LUKS device of the disk

    ceph_image=$(kubectl --as=cluster-admin -n syn-rook-ceph-cluster get job \
      -l "ceph.rook.io/pvc=${pvc_name}" \
      -o jsonpath='{.items[0].spec.template.spec.containers[0].image}')
    kubectl --as=cluster-admin run -n syn-rook-ceph-cluster \
      "cryptclose-${node_name}-$(date +%s)" --restart=Never -it --rm --image overridden \
      --overrides '{
      "spec": {
        "nodeSelector": {
          "kubernetes.io/hostname": "'"${node_name}"'"
        },
        "hostNetwork": true,
        "hostIPC": true,
        "containers": [{
          "name": "crypttool",
          "image": "'"${ceph_image}"'",
          "command": [
            "sh", "-c",
            "cryptsetup remove /dev/mapper/'"${pvc_name}"'*"
          ],
          "securityContext": {
            "privileged": true,
            "runAsNonRoot": false,
            "runAsUser": 0
          },
          "serviceAccount": "rook-ceph-osd",
          "volumeMounts": [{
            "name": "devices",
            "mountPath": "/dev"
          }]
        }],
        "tolerations": [{
          "key": "storagenode",
          "operator": "Exists"
        }],
        "volumes": [{
          "hostPath": {
            "path": "/dev",
            "type": ""
          },
          "name": "devices"
        }]
      }
    }'
  3. Clean the disk

    We’re cleaning the disk by zeroing the first 512MB. This should be sufficient to allow Ceph to create a new OSD on the disk. If you get errors in the new OSD prepare job, increase count of the dd command to a larger number, for example count=2048 to zero the first 2GB of the disk.

    disk_path=$(kubectl --as=cluster-admin get pv "${pv_name}" -o jsonpath='{.spec.local.path}')
    kubectl --as=cluster-admin -n openshift-local-storage exec -it "${provisioner_pod}" -- \
     dd if=/dev/zero of="${disk_path}" bs=1M count=512

Start a new OSD on the cleaned disk

  1. Scale Rook-Ceph operator back to 1 replica

    kubectl --as=cluster-admin -n syn-rook-ceph-operator scale --replicas=1 \
      deploy/rook-ceph-operator
  2. Wait for the operator to reconfigure the disk for the OSD

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster get pods -w
  3. Re-enable Ceph balancing

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd unset noout
  4. Wait for the OSD to be repopulated with data ("backfilled").

    When backfilling is completed, ceph status should show all PGs as active+clean.
    Depending on the number of OSDs in the storage cluster and the amount of data that needs to be moved, this may take a while.

    If the storage cluster is mostly idle, you can speed up backfilling by temporarily setting the following configuration.

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config set osd osd_mclock_override_recovery_settings true (1)
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config set osd osd_max_backfills 10 (2)
    1 Allow overwriting osd_max_backfills.
    2 The number of PGs which are allowed to backfill in parallel. Adjust up or down depending on client load on the storage cluster.

    After backfilling is completed, you can remove the configuration with

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config rm osd osd_max_backfills
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph config rm osd osd_mclock_override_recovery_settings
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status

Finalize reinitialization

  1. Clean up the old OSD

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd purge "osd.${OSD_ID}"
  2. Check that Ceph cluster is healthy

    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph status
    kubectl --as=cluster-admin -n syn-rook-ceph-cluster exec -it deploy/rook-ceph-tools -- \
      ceph osd tree

Finish up

  1. Remove silence in Alertmanager

    if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi
    job_name=$(printf "DELETE-silence-rook-ceph-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
    kubectl --as=cluster-admin -n openshift-monitoring create -f- <<EOJ
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: ${job_name}
      labels:
        app: silence-rook-ceph-alerts
    spec:
     backoffLimit: 0
     template:
      spec:
        restartPolicy: Never
        containers:
          - name: silence
            image: quay.io/appuio/oc:v4.13
            command:
            - bash
            - -c
            - |
              curl_opts=( --cacert /etc/ssl/certs/serving-certs/service-ca.crt --header "Content-Type: application/json" --header "Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --resolve alertmanager-main.openshift-monitoring.svc.cluster.local:9095:\$(getent hosts alertmanager-operated.openshift-monitoring.svc.cluster.local | awk '{print \$1}' | head -n 1) --silent )
    
              curl "\${curl_opts[@]}" \
                "https://alertmanager-main.openshift-monitoring.svc.cluster.local:9095/api/v2/silence/${silence_id}" \
                -XDELETE
    
            volumeMounts:
            - mountPath: /etc/ssl/certs/serving-certs/
              name: ca-bundle
              readOnly: true
            - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
              name: kube-api-access
              readOnly: true
        serviceAccountName: prometheus-k8s
        volumes:
        - name: ca-bundle
          configMap:
            defaultMode: 288
            name: serving-certs-ca-bundle
        - name: kube-api-access
          projected:
            defaultMode: 420
            sources:
              - serviceAccountToken:
                  expirationSeconds: 3607
                  path: 'token'
    EOJ
  2. Clean up Alertmanager silence jobs

    kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-rook-ceph-alerts
  3. Re-enable ArgoCD auto sync

    kubectl --as=cluster-admin -n syn patch apps root --type=json \
      -p '[{
        "op":"replace",
        "path":"/spec/syncPolicy",
        "value": {"automated": {"prune": true, "selfHeal": true}}
      }]'