Cluster Monitoring

OpenShift 4 includes a cluster monitoring solution based on Prometheus. This document aims to explain how we use it in our setups.

Motivation

The documentation about configuring the monitoring stack lists quite a lot of unsupported cases:

Add additional ServiceMonitor objects
Create additional ConfigMap objects or PrometheusRule objects
Directly edit the resources and custom resources of the monitoring stack
Using resources of the stack for your own purposes
Stopping the Cluster Monitoring Operator from reconciling the monitoring stack
Create new and edit existing alert rules
Modify Grafana

We know from experience with OpenShift 3.11: some tweaking will be required at some point. This includes adding ServiceMonitor objects for things not (yet) covered by Cluster Monitoring, adding new rules to cover additional failure scenarios and filtering rules that are noisy or not actionable.

Design

Based on all those restrictions, one could conclude to omit the Cluster Monitoring altogether and do it on your own. This would give full control over everything. But the Cluster Monitoring is a fundamental part of an OpenShift 4 setup and will always be present. It’s required for certain things to work properly. The result of doing everything again, would be a huge waste of resources both in terms of management/engineering, compute and storage resources.

For that reason we’ll make use of Cluster Monitoring as much as possible.

To be able to change the existing alerting rules we bring our own set of rules based on the cluster-monitoring-operator (which in turn is based on kube-prometheus). These rules we can change and adapt to our liking with Jsonnet. The current rules are ignored by creating silences in AlertManager.

cluster monitoring

Implementation Details

Commodore component-openshift4-monitoring implements all the alert rules. This component pulls the rules from all OpenShift repos which define any. All these rule names are then prefixed with SYN_ and have a syn=true label to avoid naming collisions. The AlertManager silence can select rules based on this label (or rather the non-existence of it): syn="",alertname=~".+".

To automate this, there is a CronJob in place that periodically extends the AlertManager silences.

Silenced alerts are hidden in the OpenShift console by default, but users can easily display them again.

Risks and Mitigations

The OpenShift documentation explicitly states that adding new rules isn’t supported. In our experience (mainly on OpenShift 3.11) it’s not a problem though, as long as the names of the new rules don’t collide with existing ones.

Another risk is missing new rules which Red Hat introduces in a new release. By directly using the same rules as the cluster-monitoring-operator does we mitigate this risk to a large extent. Additionall, having an alert rule which triggers if new rules are discovered which aren’t yet par of our own setup ensures we’re notified about them.