Monitoring (OpenShift 3)
This document describes how OpenShift 3 is monitored, and how this configuration is managed.
This document only describes the new Prometheus-based monitoring setup.
All Icinga2-based checks will eventually go away. See profile_openshift3/manifests/monitoring for checks defined in Icinga2.
OpenShift 3 comes with a bundled Prometheus Cluster Monitoring operator, that installs and manages Prometheus, Alertmanager and Grafana.
The monitoring stack is deployed into the
For a very rough overview of the stack, here’s what each component does:
Scrapes metrics/monitoring data from various Endpoints, defined by
ServiceMonitorobjects in the
Sends out alerts based on
PrometheusRuleobjects in the
openshift-monitoringnamespace. Alerts are sent to Alertmanager.
Receives alerts from Prometheus
Groups and deduplicates them based on rules
Forwards them to
routesin the Alertmanager config
Queries Prometheus and draws pretty graphs
Most base configuration can be done by amending the inventory and then applying the changes by running the openshift-monitoring config playbook:
This will regenerate the secrets containing the configurations. All Prometheus components come with a config-reloader sidecar container, which makes sure the new configuration gets loaded without having to restart the component. Check the main log of the application for any errors after changing the configuration!
After initial setup, there isn’t really anything to configure for Prometheus, since most options revolve around configuring storage.
The Alertmanager configuration must be amended in order to configure alert routing and -forwarding.
For an overview on configuring Alertmanager, see the Prometheus Alertmanager docs. However be aware that those document the
latest version, while OpenShift3 ships with v0.15.2! If you’re unsure whether a certain configuration is supported or not, check the relevant source code.
Alertmanager’s global config contains mostly default values for various receiver configurations, for example default SMTP server & credentials or URLs for various notification service APIs.
It can be extended by setting
mungg_cluster_monitoring_alertmanager_global_config (hash/map) in the inventory. Values will be merged with our default configuration.
To apply those configurations, run the
openshift-monitoring playbook as described above.
To send out custom notifications, two parts are required:
a receiver defining how to send out the notifications
a route sending an alert (based on labels) to a configured receiver
Receivers can be configured using
mungg_cluster_monitoring_alertmanager_extra_receivers (array). Make sure each receiver has an unique name.
devnull are already taken in our default config.
Routes can be configured using
mungg_cluster_monitoring_alertmanager_extra_routes (array). They will be prepended to the default routes. If not defined, profile_openshift3 will set the
continue attribute for each extra route to
While you can set
mungg-cluster-monitoring is used to define the base how a VSHN-managed OCP3 cluster is monitored. All default ServiceMonitor and PrometheusRule objects are managed there.
Changes can be rolled out using the postconfig playbook:
ansible-playbook /usr/share/mungg/playbooks/postconfig.yml --tags monitoring
Resource capacity of a cluster is monitored by Prometheus rules defined in mungg-cluster-monitoring.
By default cluster resource checks are disabled and can be enabled in Hiera data by setting
There are currently three types of resources being checked:
Those resources are monitored by calculating a redundancy value to make sure a cluster can always tolerate a node failure. The redundancy is calculated per resource type:
("allocatable resources in cluster" - "used resources in cluster") / "allocatable resources of a single node"
The calculated value is the number of nodes that can fail until the cluster runs out of a specific resource.
|A value below 1 means that a node failure can’t be tolerated.|