Service Level Indicator (SLI)

APPUiO Managed OpenShift comes with a collection of service level indicators (SLIs). This document defines and explains these SLIs. All of the SLIs are in the scope of the "Guaranteed Availability" Service Level.

We use the SLIs and Multiwindow, Mulit-Burn-Rate Alerts as the basis of our on-call alerting.

Customer Facing

The customer facing SLIs are the basis of our SLAs.

Ingress

Working cluster ingress is a core requirement for a Kubernetes cluster. If the workloads running on the cluster aren’t accessible, it might as well be down from a user perspective.

We see this as the most important SLI and the one that should be monitored most closely. This is why we base our SLAs on this SLI.

The customer facing ingress SLI is a combination of two SLIs:

Ingress Canary
OpenShift Ingress HAProxy back-end metrics

After three minutes without a successful canary request AND without a successful HAProxy back-end request, based on customer workload, we start counting towards the error budget. Combining the two SLIs allows high fidelity alerting not dependent on customer workload or OpenShift canary configuration.

Ingress

Working cluster ingress is a core requirement for a Kubernetes cluster. If the workloads running on the cluster aren’t accessible, it might as well be down from a user perspective.

Canary

HTTP probes to a canary application

Probes are sent every minute from the ingress operator, inside the cluster, to the external address of the canary target.

This means it will send a request to the public floating IP of the load balancers, which will forward the request to one of the ingress controller running on the infrastructure nodes, which will then forward the request to one of the canary targets, which runs on every worker and infrastructure node.

This setup should approximate the cluster ingress uptime.

As a side effect it also measures out-bound connection issues, which shouldn’t be a part of an ingress SLO. However, the alternative of using an external probe source also measures issues that shouldn’t be part of the SLO, so we chose the in-cluster probe source for simplicity.

Kubernetes API

The Kubernetes API is the main way users interact with the cluster itself. If the API isn’t available, users can’t change configuration or run new workloads and existing deployments will quickly degrade.

A misbehaving Kubernetes API directly impacts the service level.

Request Error Rate

Requests to the Kubernetes API server succeed or are invalid

This is measured directly at the API server through the following metrics.

# The number of failed valid API requests
apiserver_request_total{code=~"(5..|429)"}

# All API requests
apiserver_request_total

We only look for HTTP 5xx errors, which indicate a server side error, and HTTP error 429, which indicates that the API server is overloaded.

Uptime

HTTP probes to the Kubernetes API server succeed

Probes are sent every 10 seconds from a blackbox exporter inside the cluster to the readiness endpoint of the Kubernetes API server.

This SLI approximates a user’s ability to reach at least one API server instance and the API server’s uptime.

Complete outages measured by this SLI can’t be measured by the error rate SLI.

Workload Schedulability

We define Workload Schedulability as the ability to start and successfully run new workloads on the cluster. This ability is essential and directly impacts the service level.

Canary

Canary pods start successfully

A controller starts a known good canary pod every minute and checks if it successfully started after 3 minutes.

This SLI acts as a proxy to measure if users are able to start new workloads and should reveal issues with the scheduler, cluster capacity, and more.

Storage

Persistent storage is a key component of a feature complete Kubernetes cluster. Any storage issues directly impacts the service level for users.

CSI Operations

CSI operations complete successfully

CSI operations are any interactions of the kubelet or controller-manager with the CSI provider. This includes creating, deleting, mounting, unmounting, or resizing a persistent volume.

We measure these interactions using the following metrics, reported by the kubelets and the controller-manager:

# The number of failed csi operations
storage_operation_duration_seconds_count{
  volume_plugin=~"kubernetes.io/csi.+",status="fail-unknown"
}

# All csi operations
storage_operation_duration_seconds_count{volume_plugin=~"kubernetes.io/csi.+"}

This SLI approximates the user experience of interacting with PVs and PVCs. It doesn’t measure any performance issues with the underlying storage.

Cluster Network

Reliable cluster networking is essential for nearly every workload. Without it, users can’t reliably access their workload and even moderate packet loss can negatively impact deployments such as databases.

Packet Loss

ICMP pings between canary pods succeed

A network canary daemonset starts a canary pod on every node. These canaries continuously ping every other pod in the daemonset and report any packet loss. Pings are set every second and the metrics are scraped directly from the canary pods.

This SLI approximates the overall packet loss of the cluster network.