Investigating and handling alerts
Looking at an alert can be overwhelming. This is specially true when not familiar with how the alert ended up in your inbox. This how to provides you some help on where to start.
Read the message
Reading the message usually should give you a concise statement on what’s wrong. Most of the time, that message should be enough to grasp what action needs to be taken.
There is no standard field defined for alert messages. OpenShift 4 alerts build on top of kube-prometheus. There
annotations.messageis used, but
annotations.descriptionsare also used.
Look out for labels and annotations
Alerts can have a set of labels and annotations attached. They contain all sorts of valuable information. A lot of them refer to Kubernetes resources. Examples are namespace, service, pod and many more. Use them to identify the resources through the Kubernetes API and have a look at their states.
Follow the source URL
Each alert contains a source link. Following that link will open the alert expression within the originating Prometheus. Looking at the query can give further clues on what’s going on. The name of the time series are of special interest here.
The source link only works when clusters are publicly accessible. Check out the cluster specific documentation if the link doesn’t work. There you should be able to find instructions on how to gain access.
Understand the source of a time series
Sometimes looking at a time series, it’s not obvious what it’s all about. In those cases, it helps to understand where it’s coming from. Each time series has a
joblabel. That label refers to the scrape job that brought that time series into the system. Use that name to identify the source. Have a look at the sources documentation to find details about the time series.
Should the job name not be enough to identify the source of a time series, you can have a look at the targets to find more details. Navigate to Status > Target and search for the job name. This will give you details of the scraped Kubernetes endpoint along with the associated service and pod.
An alert can be addressed in several ways:
Sometimes, a manual intervention is the only thing required.
This is fine as long the issue isn’t expected to resurface again.
Fix the issue using your Kubernetes API client of choice (for example
oc or the OpenShift Console).
The system is expected to show up again but can be prevented by tweaking the configuration. Giving more resources, fixing a typo, changing a setting are all potential solutions.
Be cautious with throwing resources at a problem. Sometimes reducing system load is the better way to go.
Change the configuration within the Syn configuration hierarchy. Do so on the Cluster, Tenant or global level as appropriate.
Code can behave badly and trigger an alert. Identify the circumstances that led to the failure and file a bug report. Pull requests with a fix are also welcome.
Other trigger for a change of code are missing features. A likely case for this would be a Commodore component not allowing to configure certain aspects. File a feature request or better, submit a pull request.
If the system is working as expected then altering the alert rule might be the answer. This most likely results in a change of code.
It was discussed to allow altering alert rules via the configuration hierarchy.
The decision was to not allow this for now since it would make this too easy.
Resolving the issue at the source (
Some alerts aren’t actionable and there is no way to improve it by altering the rule. In such cases, it’s OK to drop the alert. You can do so within the configuration hierarchy. See Remove alert rules.
This can also be a temporary mitigation while a more lasting solution gets implemented.