Cluster Monitoring

OpenShift 4 includes a cluster monitoring solution based on Prometheus. This document aims to explain how we use it in our setups.

Motivation

The documentation about configuring the monitoring stack lists quite a lot of unsupported cases:

  • Add additional ServiceMonitor objects

  • Create additional ConfigMap objects or PrometheusRule objects

  • Directly edit the resources and custom resources of the monitoring stack

  • Using resources of the stack for your own purposes

  • Stopping the Cluster Monitoring Operator from reconciling the monitoring stack

  • Create new and edit existing alert rules

  • Modify Grafana

We know from experience with OpenShift 3.11: some tweaking will be required at some point. This includes adding ServiceMonitor objects for things not (yet) covered by Cluster Monitoring, adding new rules to cover additional failure scenarios and filtering rules that are noisy or not actionable.

Design Proposal

Based on all those restrictions, one could conclude to omit the Cluster Monitoring altogether and do it on your own. This would give full control over everything. But the Cluster Monitoring is a fundamental part of an OpenShift 4 setup and will always be present. It’s required for certain things to work properly. The result of doing everything again, would be a huge waste of resources both in terms of management/engineering, compute and storage resources.

For that reason we’ll make use of Cluster Monitoring as much as possible.

To be able to change the existing alerting rules we’ll bring our own set of rules based on the cluster-monitoring-operator (which in turn is based on kube-prometheus). These rules we can change and adapt to our liking with Jsonnet. The current rules are ignored by creating silences in AlertManager.

Additionally we’ll use the user workload monitoring feature to add new scrape targets and Prometheus alert rules.

Implementation Details

A Commodore component which implements all the alert rules will be written. This component pulls the rules from all OpenShift repos which define any. We then prepend all these rule names with a SYN_ prefix and add a syn=true label to avoid naming collisions. The AlertManager silence can select rules based on this label (or rather the non-existence of it): syn="",alertname=~".+"

Risks and Mitigations

The OpenShift documentation explicitly states that adding new rules isn’t supported. In our experience (mainly on OpenShift 3.11) it’s not a problem though, as long as the names of the new rules don’t collide with existing ones.

Another risk is missing new rules which Red Hat introduces in a new release. By directly using the same rules as the cluster-monitoring-operator does we mitigate this risk to a large extent. Additionall, having an alert rule which triggers if new rules are discovered which aren’t yet par of our own setup ensures we’re notified about them.