Triggering Automated Maintenance

Problem

We need a way to trigger OpenShift 4 updates automatically to achieve unattended maintenance. There’s multiple potential approaches to do so, see section proposals

Goals

Automatically start upgrade at a specified time
- Start time can be specified at the cluster level
- Maintenance rhythm (weekly, every two weeks, …) can be specified at the cluster level
Scheduled upgrades can be skipped (for example due to known issues on a cluster, or by customer request)
Maintenance start trigger is monitored for failures
Maintenance is skipped when cluster is unhealthy

Non-goals

More general centralized management of OpenShift 4 clusters

Proposals

Option 1: Use Renovate

We can extend our customized Renovate fork to support updating the desired OpenShift 4 version in the configuration hierarchy. Since we’ve already written multiple custom Renovate managers, identifying and extracting the current desired OpenShift 4 version from the configuration should be straightforward.

Renovate’s data sources are used to determine available updates for a dependency identified by a manager. In contrast to custom managers, we’ve not yet implemented a custom data source for Renovate. However, looking at the documentation, it should be fairly straightforward to implement a data source which queries api.openshift.com for available OpenShift 4 upgrades. There’s already code which queries this API in the OpenShift managed upgrade operator and our own custom OpenShift 4 maintenance tool.

To ensure upgrades are started at a specified time we can leverage Renovate’s support for scheduled updates and automerging.

Monitoring trigger failures for the Renovate approach would require monitoring both Renovate’s execution, and the Commodore CI pipelines for all clusters. This would require a significant amount of engineering, since we don’t have any infrastructure to do so in place yet.

Skipping maintenance when a cluster is unhealthy would require either a feedback loop between Renovate or the Commodore CI pipeline and the cluster, or an additional component on the cluster which can block a desired upgrade pushed into the cluster catalog. Having an additional component on the cluster will introduce additional complexity and raise the question of why that component doesn’t a larger amount of work to orchestrate the upgrade itself. On the other hand, implementing a feedback loop between the cluster and Renovate or the CI pipeline creates additional dependencies which may not be desirable.

Option 2: Use RedHat Advanced Cluster Management (RHACM)

We can use RHACM to automate cluster upgrades. RHACM itself appears to be targeted at manual operations work more than automated GitOps workflows. For example, upgrading clusters is documented as a manual process through clicking in the web interface.

Given this documentation, it looks like we’d need to engineer custom tooling which interacts with RHACM to update managed clusters to the desired version at the right time.

Overall, RHACM does a lot more than just cluster upgrades, and doesn’t provide enough upgrade functionality out of the box to fulfil our requirements. Given our experience with managing Openshift 4 through GitOps, it’s quite likely that trying to interact with RHACM through GitOps would be finicky, if it’s even possible at all.

Option 3: Use OpenShift managed upgrade operator

The Openshift managed upgrade operator supports scheduled upgrades. The operator is installed on each cluster, and watches UpgradeConfig resources. It sets the cluster’s desired version based on the UpgradeConfig resources present on the cluster. Notably, the UpgradeConfig contains a desired OpenShift patch version, which would need to be injected into the cluster somehow. See the README on GitHub for details on how the upgrade process works in detail.

To be able to fully utilize the managed upgrade operator, we’d have to adjust the implementation, since the operator currently only looks at the worker machine config pool to determine whether the node upgrades have completed.

The managed upgrade operator already performs some simple health checks and won’t trigger a requested upgrade if the cluster is unhealthy.

The managed upgrade operator does provide some metrics which can be used to monitor an upgrade. There’s a metric which indicates whether there was an error while triggering the upgrade, as well as metrics indicating the progress of the upgrade.

Option 4: Use a policy tool

We could use a policy tool, such as Kyverno, to orchestrate OpenShift 4 upgrades.

A possible policy to orchestrate upgrades would check the ClusterVersion object for available updates and would update the cluster’s desired version to the latest available upgrade. However, it’s unclear how we could extend such a policy to ensure that upgrades are only applied at a specified time.

Finally, our experience with Kyverno for APPUiO Cloud policies has shown that any reasonably complex policies are hard to engineer and hard to test.

Monitoring the upgrade would be feasible with most policy tools, as they usually provide metrics regarding policy execution errors.

Introducing a pre-upgrade health check should be possible with a policy, since most policy tools have some concept of preconditions which must hold for the policy execution. We might have to implement an additional component which checks the cluster health periodically and updates a configmap with the health check results in order to make them available to the policy.

Option 5: Implement our own upgrade controller

Instead of adjusting openshift-managed-upgrade-operator to support our requirements, or trying to write a rather complex policy for orchestrating automated updates, we could implement our own controller which orchestrates OpenShift 4 version updates.

By running an active component in the cluster we can easily monitor the cluster’s state and upgrade progress from the same tool which triggers the upgrade. If we implement the active component from scratch, we’ve got full control over the tool’s development. This allows quick turnaround for adding features and fixing bugs, and avoids the potentially high maintenance burden of pulling in upstream updates if we were to fork the openshift-managed-upgrade operator.

Additionally, it’s unlikely that we could upstream our changes to openshift-managed-upgrade-operator, as we probably have different requirements than the OpenShift Dedicated product. Since we already have sufficient experience in implementing Kubernetes controllers/operators, the additional effort for the initial implementation should be acceptable.

In order for this approach to actually reduce complexity compared to the Renovate-based approach, we need the controller to autonomously update the desired cluster version. Notably, we don’t gain anything except for additional moving parts if we make the desired OpenShift patch version an input for the controller. Therefore the biggest drawback of this approach is—as with any option that’s based on an active component in the cluster—that we lose some amount of control over the desired version of a cluster in the GitOps repositories. Additionally, we don’t automatically have a persistent history of version updates for each cluster based on the commits in the cluster’s catalog repository.

However, by implementing our own controller, we should be able to add support for a persistent update history outside the cluster with reasonable effort.

Decision

We decided to implement our own upgrade controller.

Rationale

By implementing our own upgrade controller we strike a good balance between implementation effort, customizability, complexity of monitoring the upgrade process.

With a controller that runs on the cluster which is being upgraded, we build a triggering mechanism which can easily react to dynamic changes on the cluster, such as skipping a scheduled upgrade if the cluster is unhealthy. In comparison, a solution where (parts of) the triggering mechanism run outside the target cluster which is being upgraded, would need a complex feedback loop across multiple systems to be able to react to dynamic changes on the cluster.

Additionally, with a controller that runs in the cluster, we don’t need to invest time into engineering new monitoring infrastructure for external systems, such as GitLab CI pipelines which run Renovate or the Commodore catalog compilation CI pipeline, in order to be able to monitor the upgrade triggering mechanism itself.

Further, by writing our own controller, we’re able to accommodate specific upgrade requirements such as announcing target versions ahead of time or custom upgrade schedules more easily than by going with any of the other options.

On the implementation side—since we’ve got significant experience in writing Kubernetes controllers in Go—the overhead of bootstrapping and maintaining our own dedicated tool to trigger upgrades is acceptable.

Finally, having the update triggering mechanism running in the cluster matches RedHat’s upgrade concept for OpenShift 4 which is inherently cluster-scoped, see the upstream documentation.