Maintenance troubleshooting

Troubleshooting

General troubleshooting

Get some more detailed information about the current state of the upgrade:

kubectl --as cluster-admin get clusterversions.config.openshift.io version -o jsonpath={.status.conditions}  | jq .

In general: In the case of an error or warning, try to get more detailed information from the log from the specific operator or controller. Be patient as some warnings are just temporary and the operators might be able to relieve themselves from a degraded state!

Follow the log of an operator or controller:

kubectl --as cluster-admin -n openshift-machine-config-operator logs deployment/machine-config-operator -f
kubectl --as cluster-admin -n openshift-machine-config-operator logs deployment/machine-config-controller -f

Check all operator states

Get all clusteroperator objects to receive an overview of the cluster operator states:

$ oc --as cluster-admin get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.2     True        False         False      20h
baremetal                                  4.8.2     True        False         False      147d
cloud-credential                           4.8.2     True        False         False      345d
cluster-autoscaler                         4.8.2     True        False         False      345d
config-operator                            4.8.2     True        False         False      345d
console                                    4.8.2     True        False         False      7d3h
csi-snapshot-controller                    4.8.2     True        False         False      7d6h
dns                                        4.8.2     True        False         False      7d4h
etcd                                       4.8.2     True        False         False      345d
image-registry                             4.8.2     True        False         False      93d
ingress                                    4.8.2     True        False         False      7d4h
insights                                   4.8.2     True        False         False      345d
kube-apiserver                             4.8.2     True        False         False      345d
kube-controller-manager                    4.8.2     True        False         False      345d
kube-scheduler                             4.8.2     True        False         False      345d
kube-storage-version-migrator              4.8.2     True        False         False      7d3h
machine-api                                4.8.2     True        False         False      345d
machine-approver                           4.8.2     True        False         False      345d
machine-config                             4.8.2     True        False         False      7d6h
marketplace                                4.8.2     True        False         False      7d6h
monitoring                                 4.8.2     True        False         False      7d4h
network                                    4.8.2     True        False         False      147d
node-tuning                                4.8.2     True        False         False      7d4h
openshift-apiserver                        4.8.2     True        False         False      7d3h
openshift-controller-manager               4.8.2     True        False         False      7d4h
openshift-samples                          4.8.2     True        False         False      7d4h
operator-lifecycle-manager                 4.8.2     True        False         False      345d
operator-lifecycle-manager-catalog         4.8.2     True        False         False      345d
operator-lifecycle-manager-packageserver   4.8.2     True        False         False      7d3h
service-ca                                 4.8.2     True        False         False      345d
storage                                    4.8.2     True        False         False      148d

Check details of an OpenShift 4 release

You can get detailed information about an OpenShift 4 release with oc adm release info <version>. In particular, this can be helpful to check whether an upgrade contains a CoreOS upgrade.

$ oc --as=cluster-admin adm release info 4.8.28
Name:      4.8.28
Digest:    sha256:ba1299680b542e46744307afc7effc15957a20592d88de4651610b52ed8be9a8
Created:   2022-01-19T10:15:29Z
OS/Arch:   linux/amd64
Manifests: 496

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:ba1299680b542e46744307afc7effc15957a20592d88de4651610b52ed8be9a8 (1)

Release Metadata:
  Version:  4.8.28
  Upgrades: 4.7.21, 4.7.22, 4.7.23, 4.7.24, 4.7.25, 4.7.26, 4.7.28, 4.7.29, 4.7.30, 4.7.31, 4.7.32, 4.7.33, 4.7.34, 4.7.35, 4.7.36, 4.7.37, 4.7.38, 4.7.39, 4.7.40, 4.7.41, 4.8.2, 4.8.3, 4.8.4, 4.8.5, 4.8.6, 4.8.7, 4.8.9, 4.8.10, 4.8.11, 4.8.12, 4.8.13, 4.8.14, 4.8.15, 4.8.16, 4.8.17, 4.8.18, 4.8.19, 4.8.20, 4.8.21, 4.8.22, 4.8.23, 4.8.24, 4.8.25, 4.8.26, 4.8.27
  Metadata:
    url: https://access.redhat.com/errata/RHBA-2022:0172 (2)

Component Versions:
  kubernetes 1.21.6 (3)
  machine-os 48.84.202201102304-0 Red Hat Enterprise Linux CoreOS (4)

Images: (5)
  NAME                                           DIGEST
  [ ... operator and controller image list snipped ... ]

1	The container image which orchestrates installation of the release. This is the image we set in component `openshift4-version`.
2	Link to the release notes for this release.
3	The base Kubernetes version for the release
4	The CoreOS version for the release
5	A list of container image versions for all operators and controllers which are part of the release

Troubleshooting node upgrades

List latest MachineConfig object for each machine pool:

POOL_COUNT=$(kubectl --as=cluster-admin -n openshift-machine-config-operator get machineconfigpool --no-headers | wc -l)
kubectl --as=cluster-admin -n openshift-machine-config-operator get machineconfig \
  --sort-by=".metadata.creationTimestamp" | grep "^rendered-" | tail -n "${POOL_COUNT}"

List nodes with their current and desired MachineConfig objects:

kubectl --as=cluster-admin get nodes -ocustom-columns="NAME:.metadata.name,Current Config:.metadata.annotations.machineconfiguration\.openshift\.io/currentConfig,Desired Config:.metadata.annotations.machineconfiguration\.openshift\.io/desiredConfig"

Check machine-config-daemon pod logs on the node(s) for which current and desired MachineConfig objects don’t match.

The machine-config-daemon logs contain the kubectl drain logs for the node among other things.

NODE=<node-name>
POD=$(kubectl --as=cluster-admin -o jsonpath='{.items[0].metadata.name}' \
  -n openshift-machine-config-operator get pods \
  --field-selector="spec.nodeName=${NODE}" -l k8s-app=machine-config-daemon)
kubectl --as=cluster-admin -n openshift-machine-config-operator \
  logs -c machine-config-daemon -f "${POD}"

If the Machine Config operator fails to drain a node, you may have to force-drain the node:
```
oc --as=cluster-admin adm drain <node-name> --delete-emptydir-data --ignore-daemonsets --force --grace-period=0
```
If manually force-draining the node isn’t successful, check which pods are still running on the node with oc describe node <node-name> or oc get pods --all-namespaces --field-selector spec.nodeName=<node-name> and force delete any non-daemonset pods shown in the output. The Machine Config operator should then be able to continue with the node upgrades. Depending on what’s blocking the drain, these steps may have to be repeated for several nodes.
If nodes get stuck in NotReady during the upgrade process, check whether the VM got stuck trying to reboot itself into the new image:
1. Login to the cloud provider’s web console
2. Check the VM’s VNC (or equivalent) console
3. If the VM is unresponsive on the VNC console, a reboot via the cloud provider’s web interface should resolve the issue.
We’ve not investigated in depth why VMs sometimes get stuck trying to reboot themselves and haven’t observed this problem on OCP 4.7 until now.