SLI & SLO Reporting Service

Problem Statement

VSHN has a defined set of customer facing SLOs which are monitored automatically on VSHN managed OpenShift. However, the SLI numbers reported by monitoring don’t directly correspond to numbers that can be compared against a contractual SLA.

VSHN isn’t contractually accountable for service interruptions that are a direct result of 3rd party service interruptions outside of VSHN’s control. Even so, a 3rd party service interruption can be picked up by VSHN’s SLI monitoring, since it can result in a real service interruption.

For customer-facing SLI and SLO reporting, we need a way to exclude any service interruptions that are caused by a 3rd party. This exclusion should be automatic to reduce the risk of human error during calculation.

High Level Goals

  • We’ve got a well-defined source of truth for adjusted SLIs and SLOs.

  • SLOs and remaining error budget can be viewed at any time without doubt about the correctness of the numbers

  • 3rd party outages are detected automatically where possible, but can also be entered manually.

Non-Goals

  • Pretty user interface

  • Automatic reporting of SLO burndown to customers

Implementation

We write a new service that stores external downtime windows and calculates the adjusted SLIs. The service provides a simple REST API to manage 3rd party downtime windows. Downtime windows are stored in a simple database.

The service queries our existing monitoring (Mimir) for the SLO monitoring data, and adjusts the results to exclude any service interruptions that fall into a 3rd party downtime window. A simple query API can be used to query for the updated SLOs.

The service runs on VSHN’s centralized infrastructure, same as VSHN’s central monitoring. Thanks to this, the service doesn’t need to be exposed to the internet.

Service architecture

The SLO reporting service consists of three architectural components: the downtime window API, the data querying API, and the RSS transformer. While all components are largely independent and could easily be split into separate services, we gain no real benefit from doing so: we require none of the benefits provided by a microservice-style architecture.

However, separation of concerns should be maintained within the codebase.

Authentication

Authentication of requests is kept simple with a static username and password via HTTP basic authentication.

The password and username are provided to the service via command line argument or environment variable.

Downtime Window API

The SLO reporting service provides a simple REST API to create, manage and delete 3rd-party downtime windows.

A 3rd-party downtime window has the following properties:

{
  "ID": "15a45185-3e79-4d22-8962-15fc31944d66", (1)
  "StartTime": "2025-07-08T00:00:00Z", (2)
  "EndTime": "2025-07-08T01:00:00Z", (3)
  "Title": "DNS interruption", (4)
  "Description": "DNS queries for zone XYZ could not be resolved.", (5)
  "ExternalID": "12345", (6)
  "ExternalLink": "https://status.somecsp.com/incidents/12345", (7)
  "Affects": [{ (8)
    "cloud": "SomeCSP", (9)
    "region": "SomeCSPRegion"
  }]
}
1 Unique identifier of the downtime window. Set by the SLO reporting service upon creation of a downtime window.
2 Timestamp of the start of the downtime
3 Timestamp of the end of the downtime
4 (optional) Short title describing the nature of the downtime
5 (optional) Longer description of the nature of the downtime
6 (optional) Unique identifier of this outage time window in a 3rd party system. Used for deduplication. Must be unique if set.
7 (optional) Link to further information on this outage provided by 3rd party.
8 List of selectors that determine which clusters are affected by the downtime. An empty list matches no clusters. A cluster is matched if any selector in the list matches its properties.
9 Each selector contains a number of properties. These correspond to Lieutenant cluster facts. The key designates the fact, and the value corresponds to a possible value of the fact. A selector with no fields matches every cluster. A cluster is matched if for every key in the selector, the corresponding cluster fact matches the value specified in the selector.

In addition to providing CRUD endpoints for maintaining the downtime windows, the service also provides an interface for serving all relevant downtime windows given just a cluster ID. This is achieved by querying the Lieutenant API in the background and retrieving the cluster facts from there, which can then be used to find matching downtime windows.

If a new time window is created with the same non-null ExternalID as an existing time window, the existing window is updated instead.

API endpoints

The REST API endpoints of the Downtime Window API also correspond to the programmatic interface of this component within the SLO reporting service.

GET /downtime/

Lists all available downtime windows.

Query parameters:

  • from: ISO timestamp defining the start time from which to include downtime windows.

  • to: ISO timestamp defining the end time until which to include downtime windows.

The result contains all downtime windows that overlap with the time interval [from, to]. Notably, this also includes downtime windows that start before from but end after from (and conversely for to), as well as downtime windows that completely encompass the specified time interval.

Returns:

A list of downtime window records (as defined above).

GET /downtime/cluster/:cluster-id

Lists all available downtime windows that apply to a specific cluster (according to the match rules in Affects).

Path parameters: * cluster-id: Lieutenant ID of the cluster for which to retrieve downtime windows.

Query parameters:

  • from: ISO timestamp defining the start time from which to include downtime windows.

  • to: ISO timestamp defining the end time until which to include downtime windows.

The result contains all downtime windows that overlap with the time interval [from, to]. Notably, this also includes downtime windows that start before from but end after from (and conversely for to), as well as downtime windows that completely encompass the specified time interval.

Returns:

A list of downtime window records (as defined above).

POST /downtime

Creates a new downtime window.

Body parameters:

The body of the request corresponds to the JSON example given above, minus the ID property. Any parameters marked as optional may be omitted.

If an ExternalID is provided in the request body, and an existing downtime window shares the same ExternalID, then the existing window should be updated (replaced) instead of a new one created. Otherwise, a new record is created from the request body and assigned a new random ID.

Returns:

The newly created record, including ID.

POST /downtime/:id

Updates (replaces) an existing downtime window.

Path parameters: * id: ID of the existing downtime window record to update.

Body parameters:

The body of the request corresponds to the JSON example given above. Any parameters marked as optional may be omitted.

If the ExternalID property is modified, a check must be made to ensure the new ExternalID doesn’t conflict with any existing record. If a conflict is found, return an error 400.

Returns:

The newly updated record, including ID.

PATCH /downtime/:id

Updates an existing downtime window, supporting partial updates.

Path parameters: * id: ID of the existing downtime window record to update.

Body parameters:

The body of the request corresponds to the JSON example given above. Any parameter may be omitted.

If the ExternalID property is modified, a check must be made to ensure the new ExternalID doesn’t conflict with any existing record. If a conflict is found, return an error 400.

Returns:

The newly updated record, including ID.

Data Querying API

The SLO reporting service provides a simple querying API to retrieve adjusted SLO data for a specific cluster and time window. The querying API is effectively a proxy for the Mimir querying API.

The querying API is capable of providing the following for each customer-facing SLO:

  • Adjusted SLI Error Rate: this corresponds to slo:sli_error:ratio_rate1h metric, but with every data point inside a relevant downtime window set to zero.

  • Service Level Objective: this corresponds to slo:objective:ratio. Since this metric shouldn’t change frequently, it’s sampled only once at the end of the requested time window.

  • Error Budget: this corresponds to slo:error_budget:ratio. Since this metric shouldn’t change frequently, it’s sampled only once at the end of the requested time window.

For each query, the service returns all the datapoints in the timeseries, at a granularity of 1 per hour. It’s important to keep the granularity of 1 per hour, as otherwise the cumulative sum of error rate samples no longer corresponds to the total error.

The Querying API accepts the following query parameters: * from: Timestamp of the start of the timeframe for which data is delivered * to: Timestamp of the end of said timeframe * cluster_id: ID of the cluster for which data is requested * filter: (Optional) Comma-separated list of (urlencoded) PromQL label matchers that are used when querying Mimir. Can be used to narrow down the list of SLI/SLO pairs that are queried. Example: ?filter=sloth_service%3D~%22customer-facing.%2A%22 (sloth_service=~"customer-facing.*")

Accepting these as query parameters (as opposed to body parameters) will simplify eventual integration with Grafana.

Sample query response:

{
  "cluster_id": "c-appuio-cloudscale-lpg-2", (1)
  "sli_data": {
    "customer-facing-ingress": { (2)
      "objective": 0.999, (3)
      "error_budget": 4, (4)
      "data_points": [
        {
          "timestamp": "2025-07-08T12:00:00Z", (5)
          "error_rate_1h": 0 (6)
        },
        {
          "timestamp": "2025-07-08T13:00:00Z",
          "error_rate_1h": 0.1
        },
        {
          "timestamp": "2025-07-08T14:00:00Z",
          "error_rate_1h": 0.3
        },
        {
          "timestamp": "2025-07-08T15:00:00Z",
          "error_rate_1h": 0.0
        }
      ]
    },
    "customer-facing-api": {
      "objective": 0.999,
      "error_budget": 1.8,
      "data_points": [
        {
          "timestamp": "2025-07-08T12:00:00Z",
          "error_rate_1h": 0
        },
        {
          "timestamp": "2025-07-08T13:00:00Z",
          "error_rate_1h": 0.4
        },
        {
          "timestamp": "2025-07-08T14:00:00Z",
          "error_rate_1h": 0.2
        },
        {
          "timestamp": "2025-07-08T15:00:00Z",
          "error_rate_1h": 0.0
        }
      ]

    }
  }
}
1 ID of the cluster which is being queried. Currently redundant, but included in case we wish to extend the API to return data for multiple clusters in one go.
2 Each entry in sli_data corresponds to one specific SLI/SLO pair. The dictionary key is the name of the pair, which can be derived from the label sloth_id.
3 Service level objective in % (for example, 0.999 would be 99.9% availability). Corresponds to the slo:objective:ratio metric in Mimir. Should be sampled at the end of the time period.
4 Error budget (count) for the SLO. Corresponds to the slo:error_budget:ratio metric in Mimir. Should be sampled at the end of the time period.
5 ISO timestamp of the data point.
6 1-hour average error rate for this SLI. Corresponds to the slo:sli_error:ratio_rate1h metric in Mimir. Any data point that falls within a downtime window should be zeroed.

If needed, further data points can be included in the data_points list-of-dicts, each under its own key.

RSS transformer

The final component of the service is used to automatically ingest RSS feeds, parse the information therein, and store it as downtime windows in the Downtime Window API service.

RSS feeds are only loosely standardized and it’s hard to generalize a field mapping for this specific use of RSS. For that reason, the transformation from /RSS to JSON is defined in Jsonnet. op The available RSS feeds as well as their corresponding transformations are configured statically. The SLI reporting service reads a configuration file, whose path is provided via command line or environment variable. The config file contains structured data listing the RSS feeds to be ingested, including their URLs and their transformation Jsonnet.

The Jsonnet is rendered for each RSS entry, with the full structured content of the RSS entry provided as context. The Jsonnet, when rendered, must result in a valid Downtime Window API record (as defined above).

Existing libraries can be used for parsing XML and for rendering Jsonnet, keeping this component relatively simple.

The RSS feed is fetched, transformed and stored at a regular time interval, which is set globally via command line parameter or environment variable.

The RSS transformer shouldn’t need to handle deduplication of entries; instead it should make sure each entry has a consistent ExternalID.

Access to Data

The adjusted SLI/SLO data provided by the SLO reporting service can be integrated into Grafana dashboards via the Grafana Infinity data source plugin.

The plugin queries the Data Querying API, and can apply arbitrary transformations to the result.

For example, the following transform extracts the rate, cumulative sum (sum_over_time equivalent), error budget burndown/burnup, and cluster availability percentage for each SLO:

$sort($map($keys($.sli_data), function($key) {(
    return {
        $key : $map($lookup($$.sli_data, $key).data_points, function($v, $i, $a) {(
            $cumulative := $sum($filter($a, function($sv, $j) {
                    $j <= $i
                }).error_rate_1h);
            $objective := $lookup($$.sli_data, $key).objective;
            $error_budget := $lookup($$.sli_data, $key).error_budget;
            return {
                "key": $key,
                "timestamp": $v.timestamp,
                "error_rate_1h": $v.error_rate_1h,
                "cumulative": $cumulative,
                "objective": $objective,
                "error_budget": $error_budget,
                "burnup": $cumulative / $error_budget,
                "burndown": 1 - $cumulative / $error_budget,
                "availability": $objective + (1-$objective) * (1 - $cumulative / $error_budget)
            }
        )})
    }
)}).*, function($l,$r){
    $l.timestamp > $r.timestamp
})

Since it’s only possible to query one cluster at a time, it’s not possible to create single panels that show data for all clusters on the same graph. However, it’s possible to create a repeat panel that automatically generates one panel per cluster, which should be sufficient for an overview.

If desired, the Data Querying API can be extended to provide data for multiple clusters in a single API response. With that, combining data from multiple clusters in a single Grafana panel becomes possible.