Multi-team alert routing for base alerts

Problem

Currently, all base alerts configured through component openshift4-monitoring are routed to the same team. However, for some alerts (such as KubeJobFailed), this isn’t always correct, since there might be jobs that are managed by other teams. It would be better to route those alerts dynamically based on some criteria, such as a namespace label.

Goals

  • Selectively route base alerts to the appropriate teams based on namespace ownership

  • Define how "namespace ownership" can be either configured or inferred

Non-Goals

  • Define alert routing for custom alerts

  • Handle alerts in namespaces that aren’t managed through Project Syn

Proposals

Add namespace labels to relevant alerts

One option is to modify the alert queries to join the kube_namespace_labels metric to the result, such that the information therein can be used in the (statically defined) routing rules. This implies that namespace ownership is configured by setting a specific label on the namespace in question. No suitable label currently exists, so all components would need to be updated to provide one, for example based on component ownership.

However, not all alert rules have the namespace label, since not all are namespace scoped. Blindly joining the namespace label metric will therefore lead to some broken alerts. This problem can be solved by creating a new configuration option in component-openshift4-monitoring, in which a list of alerts and a list of labels can be specified. All listed alert rules will be modified to append the listed labels.

component-openshift4-monitoring already modifies the base OpenShift alert rules by creating copies of them. Modifying the alert rules in this way would be straightforward. The challenging part is figuring out which rules to modify, which would be specified through configuration. The alert routing rules will still be configured in the same way as they presently are.

If the modified alert queries are written in such a way that the ownership team label is renamed to syn_team, the existing alert routing rules should work without modification.

Generate routing rules for relevant namespaces based on cluster hierarchy data

With this option, namespace ownership is inferred from component ownership: Components with a dedicated team for operative responsibility should be configured with a syn.teams label, and components generally have a namespace label already. From this, a mapping between namespaces and teams can be inferred from the cluster hierarchy.

component-openshift4-monitoring would then generate routing rules for each team, which match any alerts where the namespace label is present and matches one of the team’s namespaces.

With this approach, the routing rules would be generated by component-openshift4-monitoring. Previously, they were configured statically (though still through the component) by specifying parameters.openshift4_monitoring.alertManagerConfig. In our setup, this configuration is provided as a template in commodore-defaults, and is included into the cluster hierarchy where necessary.

component-openshift4-monitoring should expose a way to configure which team’s alerts should go to which alert receiver, so it can generate the rules accordingly. There should be a default receiver, for teams where no explicit receiver was provided. The new configuration can continue to be provided as a global template.

If the complete new configuration can be provided under parameters.openshift4_monitoring.alertManagerConfig, tenant repos won’t require any changes. Only commodore-defaults will need to be updated with the new configuration.

Decision

component-openshift4-monitoring will generate dynamic alert routing rules based on cluster hierarchy data.

Rationale

With this approach, the monitoring component can leverage the existing source of truth for component (instance) ownership to route the base alerts. The other approach would necessitate duplicating this information in the namespace labels, or updating all of the components to expose this information in a namespace label.

In addition, the alternative approach requires maintaining a list of base alerts which have the namespace label. The chosen approach doesn’t require such a list, so it doesn’t carry a risk of forgetting certain alerts.