Upgrade Controller
Problem Statement
Maintenance of OpenShift 4 clusters is a manual process. We promise maintenance windows outside office hours, and some of them are Switzerland only.
Staying up late to upgrade those clusters isn’t a sustainable solution. It binds team members to the task, and it isn’t a good use of their time.
The need for automation is obvious. We decided to write our own upgrade controller.
High Level Goals
-
A normal, successful upgrade is done without any manual intervention during a defined maintenance window
-
Maintenance window and upgrade rhythm are configurable on a per-cluster basis
-
Suspending upgrades is possible
-
-
Maintenance engineers are notified when an upgrade fails
-
Maintenance is skipped when cluster is unhealthy
-
Implementation
The controller is a standard controller-runtime
controller.
It’s deployed on each OpenShift 4 cluster.
It’s managed through a Custom Resource Definition (CRD) called UpgradeConfig
.
The controller is extendable through webhooks
The controller should be able to send notifications to a webhook. Every step of the upgrade process should be notified. Failed notification deliveries must not block the upgrade flow.
We might want to reuse the Alertmanager webhook definition, they already thought about TLS and authentication and other necessary features.
url: "https://example.com/webhook"
http_config: (1)
authorization:
credentials: "token"
proxy_url: "http://proxy.example.com"
annotations: (2)
cluster_id: "bar"
tenant_id: "foo"
1 | Alertmanager HTTP Config |
2 | Additional annotations to send with the webhook. |
The controller should send a POST request to the webhook with a JSON payload.
{
"version": "1", (1)
"type": "UpgradeSkipped", (2)
"status": "True", (3)
"reason": "ClusterUnhealthy", (4)
"message": "Critical alerts [MultipleDefaultStorageClasses, NodeFilesystemAlmostOutOfFiles] are firing", (5)
"desiredVersion": { (6)
"version": "4.6.34",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
},
"annotations": { (7)
"cluster_id": "bar",
"tenant_id": "foo"
}
}
1 | The version of the webhook payload |
2 | The type of the notification.
Inspired by metav1.Condition . |
3 | The status of the notification.
True , False , Unknown . |
4 | The programmatic identifier of the notification indicating the reason for the notification. |
5 | The human-readable message indicating the reason for the notification. |
6 | The desired version of the cluster. Only present for certain notifications. |
7 | Additional annotations from the webhook configuration. |
The controller manages the content of the ClusterVersion/version
object
The ClusterVersion/version
object is the source of truth for the cluster’s current version and available updates.
It’s currently managed by ArgoCD which could conflict with the controller.
The controller should replace ArgoCD and manage the object from its own CRD.
The configv1.ClusterVersionSpec is included in the UpgradeConfig
CRD and syncs the ClusterVersion/version
object.
The .spec.desiredUpdate
field is set to start the upgrade.
The controller pins the upgrade at a time before the maintenance
The controller creates an UpgradeJob
object at a time configured in the UpgradeConfig
object.
The UpgradeJob
contains a snapshot of the most recent version in the .status.availableUpdates
field and a timestamp when the upgrade should start.
The UpgradeJob
rechecks the available updates at the time of the upgrade.
If the version is no longer available, the upgrade is skipped and a notification is send to the webhook.
pinVersionWindow: "4h" (1)
1 | The time window before the maintenance window in which the upgrade version is pinned. Upgrade jobs are created just in time if empty. Scheduled upgrade jobs are created in this time window. |
Interval / time window definition
-
The controller must support customizable upgrade start time
-
The controller must be able to support various upgrade rhythms (weekly, every two weeks, whenever there’s an update)
The upgrade start time is defined in the UpgradeConfig
object.
It’s in the form of a cron expression with an additional field for the ISO 8601 week number (time#Time.ISOWeek
).
The additional field is used to define the weekly upgrade rhythm.
The syntax is cron-like, for example 7
means on the 7th week of the year.
The initial implementation will support only @odd
and @even
which means every odd/even week of the year.
We support maintenance windows adhering to the local time of a cluster. The time zone of the schedule should be configurable.
It must be possible to suspend scheduling of upgrades.
schedule:
cron: "0 22 * * 2" # 22:00 on Tuesdays (1)
isoWeek: "@odd" (2)
location: "Europe/Zurich" (3)
suspend: false (4)
1 | Cron expression |
2 | Every odd week of the year according to ISO 8601 week number.
Initially supported values are @odd and @even . |
3 | Time zone |
4 | Whether to suspend scheduling of upgrades. |
The controller verifies cluster health before and after the upgrade
The controller shouldn’t try to upgrade a cluster that isn’t healthy.
An UpgradeJob
checks the cluster health before the upgrade and skips the upgrade if the cluster is unhealthy.
If an update is skipped, the controller should send a notification to the webhook.
The controller should also check the cluster health after the upgrade. If the cluster is unhealthy, the controller should send a notification to the webhook.
Having custom queries allows customers or VSHN to extend checks to skip upgrades easily.
preUpgradeHealthChecks:
timeout: "30m" (1)
checkCriticalAlerts: true
checkDegradedOperators: true
excludeAlerts:
- alertname: "KubePodCrashLooping"
excludeNamespaces:
- openshift-console
excludeOperators:
- openshift-monitoring
customQueries:
- query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
1 | How long to wait for the health checks to be successful. |
Custom resource definition
ClusterVersion
The ClusterVersion
CRD defines the parameters synced to the ClusterVersion/version
object.
There must be only one ClusterVersion
object in the cluster.
apiVersion: managedupgrade.appuio.io/v1beta1
kind: ClusterVersion
metadata:
name: version
spec:
template: (1)
spec:
capabilities:
baselineCapabilitySet: v4.11
channel: stable-4.11
clusterID: bc75be34-e92d-4745-bb9d-8ec39e877854
desiredUpdate: {} (2)
upstream: https://api.openshift.com/api/upgrades_info/v1/graph
1 | Template for the ClusterVersion/version object. |
2 | The desiredUpdate is ignored and set by the UpgradeJob controller. |
UpgradeConfig
The UpgradeConfig
CRD defines the upgrade schedule and the upgrade job template.
The reconciliation loop of the controller creates UpgradeJob
objects based on the UpgradeConfig
object.
apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeConfig
metadata:
name: cluster-upgrade
spec:
schedule: (1)
cron: "0 22 * * 2"
isoWeek: "@odd"
location: "Europe/Zurich"
suspend: false
pinVersionWindow: "4h" (2)
maxUpgradeStartDelay: "1h" (3)
jobTemplate:
spec:
config:
upgradeTimeout: "2h" (4)
preUpgradeHealthChecks: (5)
timeout: "30m"
checkCriticalAlerts: true
checkDegradedOperators: true
excludeAlerts:
- alertname: "KubePodCrashLooping"
excludeNamespaces:
- openshift-console
excludeOperators:
- openshift-monitoring
customQueries:
- query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
postUpgradeHealthChecks: (6)
timeout: "30m"
checkCriticalAlerts: true
checkDegradedOperators: true
excludeAlerts:
- alertname: "KubePodCrashLooping"
excludeNamespaces:
- openshift-console
excludeOperators:
- openshift-monitoring
customQueries:
- query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
webhooks: (7)
- url: "https://example.com/webhook"
annotations:
cluster_id: "bar"
tenant_id: "foo"
webhooks: (7)
- url: "https://example.com/webhook"
annotations:
cluster_id: "bar"
tenant_id: "foo"
1 | The upgrade schedule as defined in Interval / time window definition. |
2 | The time window before the maintenance window in which the upgrade version is pinned.
UpgradeJobs are created at schedule - pinVersionWindow . |
3 | The maximum delay between the scheduled upgrade time and the actual upgrade time.
Influences the UpgradeJob’s `.status.upgradeBefore field. |
4 | The timeout for the upgrade. The upgrade is marked as failed if it takes longer than this. |
5 | The health checks to perform before the upgrade as defined in The controller verifies cluster health before and after the upgrade. |
6 | The health checks to perform after the upgrade as defined in The controller verifies cluster health before and after the upgrade. |
7 | The webhook to send notifications to as defined in [upgrade-webhook].
Having multiple webhooks allows to send notifications to different systems.
Both the UpgradeConfig and the UpgradeJob have a webhooks field since both might send notifications. |
UpgradeJob
An UpgradeJob
is created for each upgrade.
It contains a snapshot of the most recent version in the .status.availableUpdates
field, a snapshot of the config, and a timestamp when the upgrade should start.
apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJob
metadata:
name: cluster-upgrade-1609531200-ef11c47 (1)
spec:
startAfter: "2021-01-01T22:00:00+01:00" (2)
startBefore: "2021-01-01T23:00:00+01:00" (3)
desiredVersion: (4)
version: "4.6.1"
image: "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
config: (5)
upgradeTimeout: "2h"
preUpgradeHealthChecks: {} ...
postUpgradeHealthChecks: {} ...
webhooks: []
1 | The name of the UpgradeJob is the timestamp when the upgrade should start plus a hash of the UpgradeConfig object.
The timestamp is primarily used for sorting the UpgradeJob objects should multiple exist. |
2 | The timestamp from when the upgrade should start. |
3 | The timestamp until when the upgrade should start. If the upgrade doesn’t start within this time window, for example when the controller is unavailable, the upgrade is marked as skipped. |
4 | The version to upgrade to. |
5 | The config as defined in [upgrade-config] and copied from the UpgradeConfig object. |