Upgrade Controller

Problem Statement

Maintenance of OpenShift 4 clusters is a manual process. We promise maintenance windows outside office hours, and some of them are Switzerland only.

Staying up late to upgrade those clusters isn’t a sustainable solution. It binds team members to the task, and it isn’t a good use of their time.

The need for automation is obvious. We decided to write our own upgrade controller.

High Level Goals

  • A normal, successful upgrade is done without any manual intervention during a defined maintenance window

  • Maintenance window and upgrade rhythm are configurable on a per-cluster basis

    • Suspending upgrades is possible

  • Maintenance engineers are notified when an upgrade fails

    • Maintenance is skipped when cluster is unhealthy

Non-Goals

  • More general centralized management of OpenShift 4 clusters

Implementation

The controller is a standard controller-runtime controller. It’s deployed on each OpenShift 4 cluster. It’s managed through a Custom Resource Definition (CRD) called UpgradeConfig.

Basic upgrade flow

upgrade controller high level flow chart

The controller is extendable through hooks

The controller can run arbitrary commands if certain events happen during the upgrade. The commands are executed as Kubernetes jobs. Information about the running upgrade is passed to the jobs through environment variables.

The controller manages the content of the ClusterVersion/version object

The ClusterVersion/version object is the source of truth for the cluster’s current version and available updates. It’s currently managed by ArgoCD which could conflict with the controller. The controller should replace ArgoCD and manage the object from its own CRD.

The configv1.ClusterVersionSpec is included in the UpgradeConfig CRD and syncs the ClusterVersion/version object.

The .spec.desiredUpdate field is set to start the upgrade.

The controller pins the upgrade at a time before the maintenance

The controller creates an UpgradeJob object at a time configured in the UpgradeConfig object. The UpgradeJob contains a snapshot of the most recent version in the .status.availableUpdates field and a timestamp when the upgrade should start.

The UpgradeJob rechecks the available updates at the time of the upgrade.

pinVersionWindow: "4h" (1)
1 The time window before the maintenance window in which the upgrade version is pinned. Upgrade jobs are created just in time if empty. Scheduled upgrade jobs are created in this time window.

Interval / time window definition

  • The controller must support customizable upgrade start time

  • The controller must be able to support various upgrade rhythms (weekly, every two weeks, whenever there’s an update)

The upgrade start time is defined in the UpgradeConfig object. It’s in the form of a cron expression with an additional field for the ISO 8601 week number (time#Time.ISOWeek). The additional field is used to define the weekly upgrade rhythm. The syntax is cron-like, for example 7 means on the 7th week of the year. The initial implementation will support only @odd and @even which means every odd/even week of the year.

We support maintenance windows adhering to the local time of a cluster. The time zone of the schedule should be configurable.

It must be possible to suspend scheduling of upgrades.

schedule:
  cron: "0 22 * * 2" # 22:00 on Tuesdays (1)
  isoWeek: "@odd" (2)
  location: "Europe/Zurich" (3)
  suspend: false (4)
1 Cron expression
2 Every odd week of the year according to ISO 8601 week number. Initially supported values are @odd and @even.
3 Time zone
4 Whether to suspend scheduling of upgrades.

The controller verifies cluster health before and after the upgrade

The controller shouldn’t try to upgrade a cluster that isn’t healthy.

An UpgradeJob checks the cluster health before the upgrade and skips the upgrade if the cluster is unhealthy.

The controller should also check the cluster health after the upgrade.

Having custom queries allows customers or VSHN to extend checks to skip upgrades easily.

preUpgradeHealthChecks:
  timeout: "30m" (1)
  checkCriticalAlerts: true
  checkDegradedOperators: true
  excludeAlerts:
  - alertname: "KubePodCrashLooping"
  excludeNamespaces:
  - openshift-console
  excludeOperators:
  - openshift-monitoring
  customQueries:
  - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
1 How long to wait for the health checks to be successful.

Query alerts

The controller should query the cluster’s Prometheus instance for alerts. If there are any alerts with severity=critical, the cluster is unhealthy.

It should be possible to exclude specific alerts and all alerts for certain namespaces.

Check cluster operator health

The ClusterVersion/version object contains a queryable list of each cluster operator’s health. If any of the operators is degraded, the cluster should be considered unhealthy and shouldn’t be upgraded.

It should be possible to exclude operators.

The controller can pause and unpause machine configuration pools to delay node reboots

This allows to update master nodes and operators during office hours without affecting workload on the worker nodes.

The UpgradeJob is marked as paused if all conditions of When’s an upgrade job considered successful? are met but there are paused machine configuration pools.

The overall upgrade timeout (.spec.timeout) is unaffected by the pause and continues to count down.

machineConfigPools:
- matchLabels: (1)
    name: x-app-night-maintenance
  delayUpgrade:
    startAfter: "1h" (2)
    startBefore: "2h" (3)
1 The label selector to match the machine configuration pool.
2 How long to delay the upgrade. Relative to the .spec.startAfter field.
3 The maximum delay to wait for the upgrade. If the controller can’t unpause the upgrade within this time, the upgrade is marked as failed.

The controller must expose Prometheus metrics indicating current state of upgrade

The controller should expose Prometheus metrics indicating the current state of the upgrade and the controller itself. This allows us to monitor the controller and the upgrade process and create alerts.

When’s an upgrade job considered successful?

The controller monitors the ClusterVersion/version for the Available condition. The UpgradeJob is considered successful if

  • the Available condition is True and the Version matches the desired version.

  • .Status.UpdatedMachineCount is equal to .Status.MachineCount for all machine configuration pools.

Custom resource definition

ClusterVersion

The ClusterVersion CRD defines the parameters synced to the ClusterVersion/version object.

There must be only one ClusterVersion object in the cluster.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: ClusterVersion
metadata:
  name: version
spec:
  template: (1)
    spec:
      channel: stable-4.11
      clusterID: bc75be34-e92d-4745-bb9d-8ec39e877854
      desiredUpdate: {} (2)
      upstream: https://api.openshift.com/api/upgrades_info/v1/graph
1 Template for the ClusterVersion/version object.
2 The desiredUpdate is ignored and set by the UpgradeJob controller.

UpgradeConfig

The UpgradeConfig CRD defines the upgrade schedule and the upgrade job template. The reconciliation loop of the controller creates UpgradeJob objects based on the UpgradeConfig object.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeConfig
metadata:
  name: cluster-upgrade
spec:
  schedule: (1)
    cron: "0 22 * * 2"
    isoWeek: "@odd"
    location: "Europe/Zurich"
    suspend: false
  pinVersionWindow: "4h" (2)
  maxUpgradeStartDelay: "1h" (3)
  jobTemplate:
    metadata:
      labels:
        upgrade-config: cluster-upgrade (7)
    spec:
      config:
        upgradeTimeout: "2h" (4)
        preUpgradeHealthChecks: (5)
          timeout: "30m"
          checkCriticalAlerts: true
          checkDegradedOperators: true
          excludeAlerts:
          - alertname: "KubePodCrashLooping"
          excludeNamespaces:
          - openshift-console
          excludeOperators:
          - openshift-monitoring
          customQueries:
          - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
        postUpgradeHealthChecks: (6)
          timeout: "30m"
          checkCriticalAlerts: true
          checkDegradedOperators: true
          excludeAlerts:
          - alertname: "KubePodCrashLooping"
          excludeNamespaces:
          - openshift-console
          excludeOperators:
          - openshift-monitoring
          customQueries:
          - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
        machineConfigPools: (8)
        - matchLabels:
            name: x-app-night-maintenance
          delayUpgrade:
            delayMin: "1h"
            delayMax: "2h"
1 The upgrade schedule as defined in Interval / time window definition.
2 The time window before the maintenance window in which the upgrade version is pinned. UpgradeJobs are created at schedule - pinVersionWindow.
3 The maximum delay between the scheduled upgrade time and the actual upgrade time. Influences the UpgradeJob’s `.status.upgradeBefore field.
4 The timeout for the upgrade. The upgrade is marked as failed if it takes longer than this.
5 The health checks to perform before the upgrade as defined in The controller verifies cluster health before and after the upgrade.
6 The health checks to perform after the upgrade as defined in The controller verifies cluster health before and after the upgrade.
7 Set a label on the UpgradeJob. Allow selecting the created jobs in the UpgradeJobHook manifest.
8 Allows managing machine configuration pools. Currently supports delaying upgrades to nodes in the pool. See The controller can pause and unpause machine configuration pools to delay node reboots.

UpgradeJob

An UpgradeJob is created for each upgrade. It contains a snapshot of the most recent version in the .status.availableUpdates field, a snapshot of the config, and a timestamp when the upgrade should start.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJob
metadata:
  name: cluster-upgrade-1609531200-ef11c47 (1)
spec:
  startAfter: "2021-01-01T22:00:00+01:00" (2)
  startBefore: "2021-01-01T23:00:00+01:00" (3)
  desiredVersion: (4)
    version: "4.6.1"
    image: "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
  config: (5)
    upgradeTimeout: "2h"
    preUpgradeHealthChecks: {} ...
    postUpgradeHealthChecks: {} ...
    machineConfigPools: [] ...
1 The name of the UpgradeJob is the timestamp when the upgrade should start plus a hash of the UpgradeConfig object. The timestamp is primarily used for sorting the UpgradeJob objects should multiple exist.
2 The timestamp from when the upgrade should start.
3 The timestamp until when the upgrade should start. If the upgrade doesn’t start within this time window, for example when the controller is unavailable, the upgrade is marked as skipped.
4 The version to upgrade to.
5 The config as defined in UpgradeConfig and copied from the UpgradeConfig object.

UpgradeJobHook

The UpgradeJobHook CRD allows to run arbitrary jobs before and after the upgrade. The hook can be run once for the next upgrade, or for every upgrade.

Data about the upgrade is passed to the hook in environment variables.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJobHook
metadata:
  name: cluster-upgrade-notify-ext
spec:
  events: (1)
    - Create
    - Start
    - Finish
    - Success
    - Failure
  run: Next # [Next, All] (2)
  failurePolicy: Ignore # [Abort, Ignore] (3)
  selector: (4)
    matchLabels:
      upgrade-config: cluster-upgrade
  template: (5)
    spec:
      template:
        spec:
          containers:
          - name: notify
            image: curlimages/curl:8.1.2 # sponsored OSS image
            args:
            - -XPOST
            - -H
            - Content-Type: application/json
            - -d
            - '{"event": $(EVENT_NAME), "version": $(JOB_spec_desiredVersion_image)}' (6)
            - https://example.com/webhook
          restartPolicy: Never
      backoffLimit: 3
      ttlSecondsAfterFinished: 43200 # 12h (7)
      activeDeadlineSeconds: 300 # 5m (8)
1 The events when to run the hook. Create runs the hook when the UpgradeJob is created. The version is pinned at this point and the job is waiting for startAfter. This can be used to communicate the pending upgrade to other systems. See pinVersionWindow in UpgradeConfig. Start runs the hook when the UpgradeJob starts. Finish runs the hook when the UpgradeJob finishes, regardless of the outcome. Success runs the hook when the UpgradeJob finishes successfully. Failure runs the hook when the UpgradeJob finishes with an error.
2 Whether to run the hook for the next upgrade or for every upgrade.
3 What to do when the hook fails. Ignore is the default and continues the upgrade process. Abort marks the upgrade as failed and stops the upgrade process.

More advanced failure policies can be handled through the built-in Job failure handling mechanisms.

4 The selector to select the UpgradeJob objects to run the hook for.
5 The batchv1.JobTemplateSpec to run.
6 The controller injects the following environment variables:
  • EVENT: The event that triggered the hook as JSON.

    The event definition isn’t complete yet. It will be extended in the future. Guaranteed to be present are the name, time, reason, message fields.

  • EVENT_*: The event definition is flattened into environment variables. The values are JSON encoded; "string" is encoded as "\"string\"", null is encoded as null. The keys are the field paths separated by _. For example:

    • EVENT_name: The name of the event that triggered the hook.

    • EVENT_reason: The reason why the event was triggered.

  • JOB: The full UpgradeJob object as JSON.

  • JOB_*: The job definition is flattened into environment variables. The values are JSON encoded; "string" is encoded as "\"string\"", null is encoded as null. The keys are the field paths separated by _. For example:

    • JOB_metadata_name: The name of the UpgradeJob that triggered the hook.

    • JOB_metadata_labels_my_var_io_info: The label my-var.io/info of the UpgradeJob that triggered the hook.

    • JOB_spec_desiredVersion_image: The image of the UpgradeJob that triggered the hook.

7 Jobs aren’t deleted automatically. Use ttlSecondsAfterFinished to delete the job after a certain time.
8 There is no automatic timeout for jobs. Use activeDeadlineSeconds to set a timeout.