Investigating and handling alerts
Looking at an alert can be overwhelming. This is specially true when not familiar with how the alert ended up in your inbox. This how to provides you some help on where to start.
Investigate an alert
-
Read the message
Reading the message usually should give you a concise statement on what’s wrong. Most of the time, that message should be enough to grasp what action needs to be taken.
There is no standard field defined for alert messages. OpenShift 4 alerts build on top of kube-prometheus. There
annotations.message
is used, butannotations.summary
andannotations.descriptions
are also used. -
Look out for labels and annotations
Alerts can have a set of labels and annotations attached. The labels and annotations on alerts contain a lot of valuable information. Many labels refer to Kubernetes resources, such as the namespace, service, and pod by which the alert was triggered. Use that information as a starting point to check the alerting resource on the cluster.
-
Follow the source URL
Each alert contains a source link, which is displayed as "Source: …" in OpsGenie. Following that link will open the alert expression within the Prometheus which generated the alert. Looking at the query can give further clues on what’s going on. The name of the time series are of special interest here.
The source link only works when clusters are publicly accessible. Check out the cluster specific documentation if the link doesn’t work. There you should be able to find instructions on how to gain access.
-
Understand the source of a time series
Sometimes looking at a time series, it’s not obvious what it’s all about. In those cases, it helps to understand where it’s coming from. Each time series has a
job
label. That label refers to the scrape job that brought that time series into the system. Use that name to identify the source. Have a look at the source’s documentation to find details about the time series. -
Inspect targets
Should the job name not be enough to identify the source of a time series, you can have a look at the targets to find more details. Navigate to Status > Target and search for the job name. This will give you details of the scraped Kubernetes endpoint along with the associated service and pod.
Handle an alert
An alert can be addressed in several ways:
Manual intervention
Sometimes, a manual intervention is the only thing required.
This is fine as long the issue isn’t expected to resurface again.
Fix the issue using your Kubernetes API client of choice (for example kubectl
, oc
or the OpenShift Console).
Change of configuration
The alert is expected to show up again but can be prevented by tweaking the configuration. Giving more resources, fixing a typo, changing a setting are all potential solutions.
Be cautious with throwing resources at a problem. Sometimes reducing system load is the better way to go. |
Change the configuration within the Syn configuration hierarchy. Do so on the Cluster, Tenant or global level as appropriate.
Change code
Code can behave badly and trigger an alert. Identify the circumstances that led to the failure and file a bug report. Pull requests with a fix are also welcome.
Another trigger for a code change are missing features. A likely case for this would be a Commodore component not exposing configuration values which could be used to address the alert. File a feature request, or better, submit a pull request to the component.
Change of alert rule
If the system is working as expected then altering the alert rule might be the answer. This most likely results in a change of code.
It was discussed to allow altering alert rules via the configuration hierarchy.
The decision was to not allow this for now since it would make this too easy.
Resolving the issue at the source ( |
Dropping the alert
Some alerts aren’t actionable and there is no way to improve it by altering the rule. In such cases, it’s OK to drop the alert. You can do so within the configuration hierarchy. See the documentation for removing alert rules.
This can also be a temporary mitigation while a more lasting solution gets implemented.