Upgrade Controller

Problem Statement

Maintenance of OpenShift 4 clusters is a manual process. We promise maintenance windows outside office hours, and some of them are Switzerland only.

Staying up late to upgrade those clusters isn’t a sustainable solution. It binds team members to the task, and it isn’t a good use of their time.

The need for automation is obvious. We decided to write our own upgrade controller.

High Level Goals

  • A normal, successful upgrade is done without any manual intervention during a defined maintenance window

  • Maintenance window and upgrade rhythm are configurable on a per-cluster basis

    • Suspending upgrades is possible

  • Maintenance engineers are notified when an upgrade fails

    • Maintenance is skipped when cluster is unhealthy

Non-Goals

  • More general centralized management of OpenShift 4 clusters

Implementation

The controller is a standard controller-runtime controller. It’s deployed on each OpenShift 4 cluster. It’s managed through a Custom Resource Definition (CRD) called UpgradeConfig.

Basic upgrade flow

upgrade controller high level flow chart

The controller is extendable through webhooks

The controller should be able to send notifications to a webhook. Every step of the upgrade process should be notified. Failed notification deliveries must not block the upgrade flow.

We might want to reuse the Alertmanager webhook definition, they already thought about TLS and authentication and other necessary features.

url: "https://example.com/webhook"
http_config: (1)
  authorization:
    credentials: "token"
  proxy_url: "http://proxy.example.com"
annotations: (2)
  cluster_id: "bar"
  tenant_id: "foo"
1 Alertmanager HTTP Config
2 Additional annotations to send with the webhook.

The controller should send a POST request to the webhook with a JSON payload.

{
  "version": "1", (1)

  "type": "UpgradeSkipped", (2)
  "status": "True", (3)
  "reason": "ClusterUnhealthy", (4)
  "message": "Critical alerts [MultipleDefaultStorageClasses, NodeFilesystemAlmostOutOfFiles] are firing", (5)

  "desiredVersion": { (6)
    "version": "4.6.34",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
  },

  "annotations": { (7)
    "cluster_id": "bar",
    "tenant_id": "foo"
  }
}
1 The version of the webhook payload
2 The type of the notification. Inspired by metav1.Condition.
3 The status of the notification. True, False, Unknown.
4 The programmatic identifier of the notification indicating the reason for the notification.
5 The human-readable message indicating the reason for the notification.
6 The desired version of the cluster. Only present for certain notifications.
7 Additional annotations from the webhook configuration.

The controller manages the content of the ClusterVersion/version object

The ClusterVersion/version object is the source of truth for the cluster’s current version and available updates. It’s currently managed by ArgoCD which could conflict with the controller. The controller should replace ArgoCD and manage the object from its own CRD.

The configv1.ClusterVersionSpec is included in the UpgradeConfig CRD and syncs the ClusterVersion/version object.

The .spec.desiredUpdate field is set to start the upgrade.

The controller pins the upgrade at a time before the maintenance

The controller creates an UpgradeJob object at a time configured in the UpgradeConfig object. The UpgradeJob contains a snapshot of the most recent version in the .status.availableUpdates field and a timestamp when the upgrade should start.

The UpgradeJob rechecks the available updates at the time of the upgrade. If the version is no longer available, the upgrade is skipped and a notification is send to the webhook.

pinVersionWindow: "4h" (1)
1 The time window before the maintenance window in which the upgrade version is pinned. Upgrade jobs are created just in time if empty. Scheduled upgrade jobs are created in this time window.

Interval / time window definition

  • The controller must support customizable upgrade start time

  • The controller must be able to support various upgrade rhythms (weekly, every two weeks, whenever there’s an update)

The upgrade start time is defined in the UpgradeConfig object. It’s in the form of a cron expression with an additional field for the ISO 8601 week number (time#Time.ISOWeek). The additional field is used to define the weekly upgrade rhythm. The syntax is cron-like, for example 7 means on the 7th week of the year. The initial implementation will support only @odd and @even which means every odd/even week of the year.

We support maintenance windows adhering to the local time of a cluster. The time zone of the schedule should be configurable.

It must be possible to suspend scheduling of upgrades.

schedule:
  cron: "0 22 * * 2" # 22:00 on Tuesdays (1)
  isoWeek: "@odd" (2)
  location: "Europe/Zurich" (3)
  suspend: false (4)
1 Cron expression
2 Every odd week of the year according to ISO 8601 week number. Initially supported values are @odd and @even.
3 Time zone
4 Whether to suspend scheduling of upgrades.

The controller verifies cluster health before and after the upgrade

The controller shouldn’t try to upgrade a cluster that isn’t healthy.

An UpgradeJob checks the cluster health before the upgrade and skips the upgrade if the cluster is unhealthy. If an update is skipped, the controller should send a notification to the webhook.

The controller should also check the cluster health after the upgrade. If the cluster is unhealthy, the controller should send a notification to the webhook.

Having custom queries allows customers or VSHN to extend checks to skip upgrades easily.

preUpgradeHealthChecks:
  timeout: "30m" (1)
  checkCriticalAlerts: true
  checkDegradedOperators: true
  excludeAlerts:
  - alertname: "KubePodCrashLooping"
  excludeNamespaces:
  - openshift-console
  excludeOperators:
  - openshift-monitoring
  customQueries:
  - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
1 How long to wait for the health checks to be successful.

Query alerts

The controller should query the cluster’s Prometheus instance for alerts. If there are any alerts with severity=critical, the cluster is unhealthy.

It should be possible to exclude specific alerts and all alerts for certain namespaces.

Check cluster operator health

The ClusterVersion/version object contains a queryable list of each cluster operator’s health. If any of the operators is degraded, the cluster should be considered unhealthy and shouldn’t be upgraded.

It should be possible to exclude operators.

The controller must expose Prometheus metrics indicating current state of upgrade

The controller should expose Prometheus metrics indicating the current state of the upgrade and the controller itself. This allows us to monitor the controller and the upgrade process and create alerts.

When’s an upgrade job considered successful?

The controller monitors the ClusterVersion/version for the Available condition. The UpgradeJob is considered successful if the Available condition is True and the Version matches the desired version.

Custom resource definition

ClusterVersion

The ClusterVersion CRD defines the parameters synced to the ClusterVersion/version object.

There must be only one ClusterVersion object in the cluster.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: ClusterVersion
metadata:
  name: version
spec:
  template: (1)
    spec:
      capabilities:
        baselineCapabilitySet: v4.11
      channel: stable-4.11
      clusterID: bc75be34-e92d-4745-bb9d-8ec39e877854
      desiredUpdate: {} (2)
      upstream: https://api.openshift.com/api/upgrades_info/v1/graph
1 Template for the ClusterVersion/version object.
2 The desiredUpdate is ignored and set by the UpgradeJob controller.

UpgradeConfig

The UpgradeConfig CRD defines the upgrade schedule and the upgrade job template. The reconciliation loop of the controller creates UpgradeJob objects based on the UpgradeConfig object.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeConfig
metadata:
  name: cluster-upgrade
spec:
  schedule: (1)
    cron: "0 22 * * 2"
    isoWeek: "@odd"
    location: "Europe/Zurich"
    suspend: false
  pinVersionWindow: "4h" (2)
  maxUpgradeStartDelay: "1h" (3)
  jobTemplate:
    spec:
      config:
        upgradeTimeout: "2h" (4)
        preUpgradeHealthChecks: (5)
          timeout: "30m"
          checkCriticalAlerts: true
          checkDegradedOperators: true
          excludeAlerts:
          - alertname: "KubePodCrashLooping"
          excludeNamespaces:
          - openshift-console
          excludeOperators:
          - openshift-monitoring
          customQueries:
          - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
        postUpgradeHealthChecks: (6)
          timeout: "30m"
          checkCriticalAlerts: true
          checkDegradedOperators: true
          excludeAlerts:
          - alertname: "KubePodCrashLooping"
          excludeNamespaces:
          - openshift-console
          excludeOperators:
          - openshift-monitoring
          customQueries:
          - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
        webhooks: (7)
          - url: "https://example.com/webhook"
            annotations:
              cluster_id: "bar"
              tenant_id: "foo"
  webhooks: (7)
    - url: "https://example.com/webhook"
      annotations:
        cluster_id: "bar"
        tenant_id: "foo"
1 The upgrade schedule as defined in Interval / time window definition.
2 The time window before the maintenance window in which the upgrade version is pinned. UpgradeJobs are created at schedule - pinVersionWindow.
3 The maximum delay between the scheduled upgrade time and the actual upgrade time. Influences the UpgradeJob’s `.status.upgradeBefore field.
4 The timeout for the upgrade. The upgrade is marked as failed if it takes longer than this.
5 The health checks to perform before the upgrade as defined in The controller verifies cluster health before and after the upgrade.
6 The health checks to perform after the upgrade as defined in The controller verifies cluster health before and after the upgrade.
7 The webhook to send notifications to as defined in [upgrade-webhook]. Having multiple webhooks allows to send notifications to different systems. Both the UpgradeConfig and the UpgradeJob have a webhooks field since both might send notifications.

UpgradeJob

An UpgradeJob is created for each upgrade. It contains a snapshot of the most recent version in the .status.availableUpdates field, a snapshot of the config, and a timestamp when the upgrade should start.

apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJob
metadata:
  name: cluster-upgrade-1609531200-ef11c47 (1)
spec:
  startAfter: "2021-01-01T22:00:00+01:00" (2)
  startBefore: "2021-01-01T23:00:00+01:00" (3)
  desiredVersion: (4)
    version: "4.6.1"
    image: "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
  config: (5)
    upgradeTimeout: "2h"
    preUpgradeHealthChecks: {} ...
    postUpgradeHealthChecks: {} ...
    webhooks: []
1 The name of the UpgradeJob is the timestamp when the upgrade should start plus a hash of the UpgradeConfig object. The timestamp is primarily used for sorting the UpgradeJob objects should multiple exist.
2 The timestamp from when the upgrade should start.
3 The timestamp until when the upgrade should start. If the upgrade doesn’t start within this time window, for example when the controller is unavailable, the upgrade is marked as skipped.
4 The version to upgrade to.
5 The config as defined in [upgrade-config] and copied from the UpgradeConfig object.