Investigating and handling alerts

Looking at an alert can be overwhelming. This is specially true when not familiar with how the alert ended up in your inbox. This how to provides you some help on where to start.

Investigate an alert

  1. Read the message

    Reading the message usually should give you a concise statement on what’s wrong. Most of the time, that message should be enough to grasp what action needs to be taken.

    There is no standard field defined for alert messages. OpenShift 4 alerts build on top of kube-prometheus. There annotations.message is used, but annotations.summary and annotations.descriptions are also used.

  2. Look out for labels and annotations

    Alerts can have a set of labels and annotations attached. The labels and annotations on alerts contain a lot of valuable information. Many labels refer to Kubernetes resources, such as the namespace, service, and pod by which the alert was triggered. Use that information as a starting point to check the alerting resource on the cluster.

  3. Follow the source URL

    Each alert contains a source link, which is displayed as "Source: …​" in OpsGenie. Following that link will open the alert expression within the Prometheus which generated the alert. Looking at the query can give further clues on what’s going on. The name of the time series are of special interest here.

    The source link only works when clusters are publicly accessible. Check out the cluster specific documentation if the link doesn’t work. There you should be able to find instructions on how to gain access.

  4. Understand the source of a time series

    Sometimes looking at a time series, it’s not obvious what it’s all about. In those cases, it helps to understand where it’s coming from. Each time series has a job label. That label refers to the scrape job that brought that time series into the system. Use that name to identify the source. Have a look at the source’s documentation to find details about the time series.

  5. Inspect targets

    Should the job name not be enough to identify the source of a time series, you can have a look at the targets to find more details. Navigate to Status > Target and search for the job name. This will give you details of the scraped Kubernetes endpoint along with the associated service and pod.

Handle an alert

An alert can be addressed in several ways:

Manual intervention

Sometimes, a manual intervention is the only thing required. This is fine as long the issue isn’t expected to resurface again. Fix the issue using your Kubernetes API client of choice (for example kubectl, oc or the OpenShift Console).

Change of configuration

The alert is expected to show up again but can be prevented by tweaking the configuration. Giving more resources, fixing a typo, changing a setting are all potential solutions.

Be cautious with throwing resources at a problem. Sometimes reducing system load is the better way to go.

Change the configuration within the Syn configuration hierarchy. Do so on the Cluster, Tenant or global level as appropriate.

Change code

Code can behave badly and trigger an alert. Identify the circumstances that led to the failure and file a bug report. Pull requests with a fix are also welcome.

Another trigger for a code change are missing features. A likely case for this would be a Commodore component not exposing configuration values which could be used to address the alert. File a feature request, or better, submit a pull request to the component.

Change of alert rule

If the system is working as expected then altering the alert rule might be the answer. This most likely results in a change of code.

It was discussed to allow altering alert rules via the configuration hierarchy. The decision was to not allow this for now since it would make this too easy. Resolving the issue at the source (kube-prometheus) would be preferred. If not possible, doing so in Jsonnet allows to work with a scalpel instead of a sledgehammer. In Jsonnet, we can make use of variables and or otherwise reuse things. Something we would not be able to do in YAML and a lot of duplication would result of it.

Dropping the alert

Some alerts aren’t actionable and there is no way to improve it by altering the rule. In such cases, it’s OK to drop the alert. You can do so within the configuration hierarchy. See the documentation for removing alert rules.

This can also be a temporary mitigation while a more lasting solution gets implemented.