Maintenance and Update of an OpenShift 4 cluster
In contrast to previous versions of OpenShift, or other Kubernetes distributions, the operating system (Red Hat Enterprise Linux CoreOS (RHCOS)) and OCP4 server components are bundled into a single unit. In practice, this means that there is no difference between "updating the operating system" or "installing the latest OCP version," it’s all the same.
For an upgrade to |
-
Get list of available updates:
oc adm upgrade --as cluster-admin Updates: VERSION IMAGE 4.5.19 quay.io/openshift-release-dev/ocp-release@sha256:bae5510f19324d8e9c313aaba767e93c3a311902f5358fe2569e380544d9113e 4.5.20 quay.io/openshift-release-dev/ocp-release@sha256:78b878986d2d0af6037d637aa63e7b6f80fc8f17d0f0d5b077ac6aca83f792a0 4.5.24 quay.io/openshift-release-dev/ocp-release@sha256:f3ce0aeebb116bbc7d8982cc347ffc68151c92598dfb0cc45aaf3ce03bb09d11
or
kubectl --as cluster-admin get clusterversion version -o json | jq '.status.availableUpdates[] | {image: .image, version: .version}'
If you don’t get the newest available version, this might be intended. Red Hat does release new updates to specific cluster, when they do have no known issues. So on a stable channel you need some patience! |
-
Update the configuration hierarchy
Set the following parameters to the values retrieved in the previous step:
-
parameters.openshift4_version.spec.desiredUpdate.image
-
parameters.openshift4_version.spec.desiredUpdate.version
-
-
Compile the cluster catalog
-
Enjoy the show
Let the OpenShift operators do their job.
kubectl --as cluster-admin get clusterversion version --watch
-
Check the upgrade state via the
oc
command:$ oc adm upgrade --as cluster-admin Cluster version is 4.5.24 No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
Even if oc adm upgrade
shows that the upgrade has completed, it’s possible that nodes are still being upgraded. -
Check node upgrade status by checking the status of the
MachineConfigPool
resources:$ oc --as=cluster-admin -n openshift-machine-config-operator get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-92e100dc64d7c9ecf669b1f69cdb5dca True False False 3 3 3 0 19d worker rendered-worker-4648c4badfb057c7e3e9f1030fa42507 True False False 6 6 6 0 19d
Applications on the cluster may get rescheduled without prior notice as long as the worker
MachineConfigPool
doesn’t showUpdated=True
.You can observe the progress of the node upgrades with
oc --as=cluster-admin get mcp -w
-
The maintenance of the cluster is only finished once all nodes, including all worker nodes have been upgraded and all
MachineConfigPools
showUpdated=True
.Never leave the cluster in a state with pending node upgrades. If the Machine Config operator can’t drain a node (for example because doing so would violate a
PodDisruptionBudget
) you may have to manually force-drain a node or even manually delete pods that block the node drain. Always open a follow-up ticket to investigate the underlying issues if manual intervention is required.
So far, the upgrade process mostly just worked. Nevertheless, we’ve started documenting how to observe the upgrade process in the following section. More troubleshooting instructions will be added there as we gain experience.
For general information about the upgrade process, check out Updating a cluster between minor versions of the OpenShift 4 documentation.
Also have a look at the blog post The Ultimate Guide to OpenShift Release and Upgrade Process for Cluster Administrators which is an excellent source to understand the process.
Troubleshooting
General troubleshooting
Get some more detailed information about the current state of the upgrade:
kubectl --as cluster-admin get clusterversions.config.openshift.io version -o jsonpath={.status.conditions} | jq .
In general: In the case of an error or warning, try to get more detailed information from the log from the specific operator or controller. Be patient as some warnings are just temporary and the operators might be able to relieve themselves from a degraded state! |
Follow the log of an operator or controller:
kubectl --as cluster-admin -n openshift-machine-config-operator logs deployment/machine-config-operator -f
kubectl --as cluster-admin -n openshift-machine-config-operator logs deployment/machine-config-controller -f
Check all operator states
Get all clusteroperator
objects to receive an overview of the cluster operator states:
$ oc --as cluster-admin get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.2 True False False 20h
baremetal 4.8.2 True False False 147d
cloud-credential 4.8.2 True False False 345d
cluster-autoscaler 4.8.2 True False False 345d
config-operator 4.8.2 True False False 345d
console 4.8.2 True False False 7d3h
csi-snapshot-controller 4.8.2 True False False 7d6h
dns 4.8.2 True False False 7d4h
etcd 4.8.2 True False False 345d
image-registry 4.8.2 True False False 93d
ingress 4.8.2 True False False 7d4h
insights 4.8.2 True False False 345d
kube-apiserver 4.8.2 True False False 345d
kube-controller-manager 4.8.2 True False False 345d
kube-scheduler 4.8.2 True False False 345d
kube-storage-version-migrator 4.8.2 True False False 7d3h
machine-api 4.8.2 True False False 345d
machine-approver 4.8.2 True False False 345d
machine-config 4.8.2 True False False 7d6h
marketplace 4.8.2 True False False 7d6h
monitoring 4.8.2 True False False 7d4h
network 4.8.2 True False False 147d
node-tuning 4.8.2 True False False 7d4h
openshift-apiserver 4.8.2 True False False 7d3h
openshift-controller-manager 4.8.2 True False False 7d4h
openshift-samples 4.8.2 True False False 7d4h
operator-lifecycle-manager 4.8.2 True False False 345d
operator-lifecycle-manager-catalog 4.8.2 True False False 345d
operator-lifecycle-manager-packageserver 4.8.2 True False False 7d3h
service-ca 4.8.2 True False False 345d
storage 4.8.2 True False False 148d
Check details of an OpenShift 4 release
You can get detailed information about an OpenShift 4 release with oc adm release info <version>
.
In particular, this can be helpful to check whether an upgrade contains a CoreOS upgrade.
$ oc --as=cluster-admin adm release info 4.8.28
Name: 4.8.28
Digest: sha256:ba1299680b542e46744307afc7effc15957a20592d88de4651610b52ed8be9a8
Created: 2022-01-19T10:15:29Z
OS/Arch: linux/amd64
Manifests: 496
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:ba1299680b542e46744307afc7effc15957a20592d88de4651610b52ed8be9a8 (1)
Release Metadata:
Version: 4.8.28
Upgrades: 4.7.21, 4.7.22, 4.7.23, 4.7.24, 4.7.25, 4.7.26, 4.7.28, 4.7.29, 4.7.30, 4.7.31, 4.7.32, 4.7.33, 4.7.34, 4.7.35, 4.7.36, 4.7.37, 4.7.38, 4.7.39, 4.7.40, 4.7.41, 4.8.2, 4.8.3, 4.8.4, 4.8.5, 4.8.6, 4.8.7, 4.8.9, 4.8.10, 4.8.11, 4.8.12, 4.8.13, 4.8.14, 4.8.15, 4.8.16, 4.8.17, 4.8.18, 4.8.19, 4.8.20, 4.8.21, 4.8.22, 4.8.23, 4.8.24, 4.8.25, 4.8.26, 4.8.27
Metadata:
url: https://access.redhat.com/errata/RHBA-2022:0172 (2)
Component Versions:
kubernetes 1.21.6 (3)
machine-os 48.84.202201102304-0 Red Hat Enterprise Linux CoreOS (4)
Images: (5)
NAME DIGEST
[ ... operator and controller image list snipped ... ]
1 | The container image which orchestrates installation of the release.
This is the image we set in component openshift4-version . |
2 | Link to the release notes for this release. |
3 | The base Kubernetes version for the release |
4 | The CoreOS version for the release |
5 | A list of container image versions for all operators and controllers which are part of the release |
Troubleshooting node upgrades
-
List latest
MachineConfig
object for each machine pool:POOL_COUNT=$(kubectl --as=cluster-admin -n openshift-machine-config-operator get machineconfigpool --no-headers | wc -l) kubectl --as=cluster-admin -n openshift-machine-config-operator get machineconfig \ --sort-by=".metadata.creationTimestamp" | grep "^rendered-" | tail -n "${POOL_COUNT}"
-
List nodes with their current and desired
MachineConfig
objects:kubectl --as=cluster-admin get nodes -ocustom-columns="NAME:.metadata.name,Current Config:.metadata.annotations.machineconfiguration\.openshift\.io/currentConfig,Desired Config:.metadata.annotations.machineconfiguration\.openshift\.io/desiredConfig"
-
Check
machine-config-daemon
pod logs on the node(s) for which current and desiredMachineConfig
objects don’t match.The
machine-config-daemon
logs contain thekubectl drain
logs for the node among other things.NODE=<node-name> POD=$(kubectl --as=cluster-admin -o jsonpath='{.items[0].metadata.name}' \ -n openshift-machine-config-operator get pods \ --field-selector="spec.nodeName=${NODE}" -l k8s-app=machine-config-daemon) kubectl --as=cluster-admin -n openshift-machine-config-operator \ logs -c machine-config-daemon -f "${POD}"
-
If the Machine Config operator fails to drain a node, you may have to force-drain the node:
oc --as=cluster-admin adm drain <node-name> --delete-emptydir-data --ignore-daemonsets --force --grace-period=0
If manually force-draining the node isn’t successful, check which pods are still running on the node with
oc describe node <node-name>
oroc get pods --all-namespaces --field-selector spec.nodeName=<node-name>
and force delete any non-daemonset pods shown in the output. The Machine Config operator should then be able to continue with the node upgrades. Depending on what’s blocking the drain, these steps may have to be repeated for several nodes. -
If nodes get stuck in
NotReady
during the upgrade process, check whether the VM got stuck trying to reboot itself into the new image:-
Login to the cloud provider’s web console
-
Check the VM’s VNC (or equivalent) console
-
If the VM is unresponsive on the VNC console, a reboot via the cloud provider’s web interface should resolve the issue.
We’ve not investigated in depth why VMs sometimes get stuck trying to reboot themselves and haven’t observed this problem on OCP 4.7 until now. -