Handling of Capacity Alerts for Incident Prevention

Problem

All Databases by VSHN have limited capacity such as storage and memory capacity, or a limit on concurrent connections. Depending on the database type running out of capacity can lead to downtime or even to data loss. We need a way to predict capacity issues, so we can react and prevent an incident.

Writing these predictive capacity alerts is generally a solved problem. We can write Prometheus queries that predict the resource usage and alert based on the predicted usage. The main question is how we want to handle these alerts.

From experience from similar systems, we know that in nearly all cases capacity issues can only be resolved by the service users themselves, either by freeing up resources or by scaling up the instance. So first and foremost we need a process to inform the service user about potential capacity issues.

Proposals

Alert Responsible Ops

The simplest approach would be to alert the engineer that does responsible ops. This is how we are used to handling alerts and how we handled this in previous systems.

In practice this means we will deploy fixed Prometheus alert rules and Alertmanager configs in every database namespace that will send alerts to Opsgenie.

The only open question is how we securely distribute the necessary Opsgenie secrets to the database namespace. This can be solved by espejo or through a custom controller, but we need to be aware that increase the potential attack surface by distributing Opsgenie credentials to many namespaces.

Expose User-Workload Monitoring

APPUiO Cloud and APPUiO Managed OpenShift come with a versatile user-workload monitoring system that allows a user to specify their own Alertmanager configuration on a namespace level.

We could expose the Alertmanager configuration of this system directly in the database claim and the composition will then create an AlertmanagerConfig resource in the database namespace. This would enable a service user to decide for themselves how they want to be alerted about capacity issues.

With such a system the database claim could look similar to the following, which would configure all alerts to be sent to the service users slack instance.

apiVersion: vshn.appcat.vshn.io/v1
kind: VSHNPostgreSQL
metadata:
  name: pgsql-app1-prod
  namespace: prod-app
spec:
  parameters:
    alerting:
      receivers:
      - name: slack
        slackConfigs:
        - apiURL:
            key: url
            name: slack-alert-webhook-secret
          channel: '#alerts'
      route:
        groupBy: [alertname]
        receiver: slack
  writeConnectionSecretToRef:
    name: postgres-creds

Send E-Mail Alert through Alertmanager and mailgun

VSHN has the option to send e-mails automatically through mailgun. This can be leveraged to inform service users through e-mail without a lot of setup overhead.

In practice this can be implemented fairly similarly to forwarding alerts to Opsgenie. The composition will create an AlertmanagerConfig resource in the database namespace that configures Alertmanager to send mails to the provided e-mail address.

With such a system the database claim could look similar to the following, which would send all alerts as e-mails to oncall@example.com

apiVersion: vshn.appcat.vshn.io/v1
kind: VSHNPostgreSQL
metadata:
  name: pgsql-app1-prod
  namespace: prod-app
spec:
  parameters:
    alerting:
      email: oncall@example.com
  writeConnectionSecretToRef:
    name: postgres-creds

There are a few things we’ll need to consider for such an implementation:

  • This setup forces us to replicate a mailgun token to every database namespace which increases the potential attack surface.

  • We might need to think about verifying the e-mail before forwarding alerts to it. This could be done through a custom controller.

Decision

We implement both exposing the user-workload monitoring and automated e-mail alerts through mailgun. E-mail based alerting should be strongly encouraged, and might be mandatory in some cases, to ensure that we can prove that the service user was informed of the issue.

Rationale

Manual intervention for capacity alerts generates a lot of unnecessary toil, as in most cases the engineer can’t do anything but inform the service user. That’s why we believe that the task should be automated, and no engineer should be alerted for potential capacity issues.

Exposing the whole capabilities of the user-workload monitoring system is fairly simple to implement, while at the same time very versatile. It allows our service users to send alerts exactly where they are needed.

On the other hand automatic e-mail alerts through mailgun has a much better UX for simple deployments. With this solution, service users don’t need to handle any mail gateway or care about Alertmanager configuration. They simply provide an e-mail address, and they will be informed about capacity issues.

A further advantage of sending alerts as e-mails through mailgun is, that we have a clear paper trail that the alert was sent and received by the service user.

Exposing the user-workload monitoring system and the automated e-mail solution can co-exist and will complement each other. The user-workload based approach is quickly implemented and gives power users a lot of control, while the automated e-mail system will significantly improve the UX for simple deployments once it is ready.