Service Level Objectives

APPUiO Managed OpenShift 4 comes with a collection of service level objectives (SLOs). This document defines and explains these SLOs. An APPUiO Managed cluster should meet these objectives to provide the expected service level to our customers.

We use the SLOs and Multiwindow, Mulit-Brun-Rate Alerts as the basis of our on-call alerting.

These are internal service level objectives, not service level agreements. We don’t guarantee to meet these objectives at all times.


Working cluster ingress is a core requirement for a Kubernetes cluster. If the workloads running on the cluster aren’t accessible, it might as well be down from a user perspective.


99.75% of all HTTP probes to a canary application succeed

Probes are sent every minute from the ingress operator, inside the cluster, to the external address of the canary target.

This means it will send a request to the public floating IP of the load balancers, which will forward the request to one of the ingress controller running on the infrastructure nodes, which will then forward the request to one of the canary targets, which runs on every worker and infrastructure node.


This setup should approximate the cluster ingress uptime.

As a side effect it also measures out-bound connection issues, which shouldn’t be a part of an ingress SLO. However, the alternative of using an external probe source also measures issues that shouldn’t be part of the SLO, so we chose the in-cluster probe source for simplicity.

Kubernetes API

The Kubernetes API is the main way users interact with the cluster itself. If the API isn’t available, users can’t change configuration or run new workloads and existing deployments will quickly degrade.

A misbehaving Kubernetes API directly impacts the service level.

Request Error Rate

99.9% of all requests to the Kubernetes API server succeed or are invalid

This is measured directly at the API server through the following metrics.

# The number of failed valid API requests

# All API requests
We only look for HTTP 5xx errors, which indicate a server side error, and HTTP error 429, which indicates that the API server is overloaded.


99.9% of all HTTP probes to the Kubernetes API server succeed

Probes are sent every 10 seconds from a blackbox exporter inside the cluster to the readiness endpoint of the Kubernetes API server.

This SLI approximates a user’s ability to reach at least one API server instance and the API server’s uptime.

Complete outages measured by this SLI can’t be measured by the error rate SLI.

Workload Schedulability

We define Workload Schedulability as the ability to start and successfully run new workloads on the cluster. This ability is essential and directly impacts the service level.


99.75% of canary pods start successfully

A controller starts a known good canary pod every minute and checks if it successfully started after 3 minutes.

This SLI acts as a proxy to measure if users are able to start new workloads and should reveal issues with the scheduler, cluster capacity, and more.


Persistent storage is a key component of a feature complete Kubernetes cluster. Any storage issues directly impacts the service level for users.

CSI Operations

99.5% of all CSI operations complete successfully

CSI operations are any interactions of the kubelet or controller-manager with the CSI provider. This includes creating, deleting, mounting, unmounting, or resizing a persistent volume.

We measure these interactions using the following metrics, reported by the kubelets and the controller-manager:

# The number of failed csi operations

# All csi operations

This SLI approximates the user experience of interacting with PVs and PVCs. It doesn’t measure any performance issues with the underlying storage.

Cluster Network

Reliable cluster networking is essential for nearly every workload. Without it, users can’t reliably access their workload and even moderate packet loss can negatively impact deployments such as databases.

Packet Loss

99.5% of all ICMP pings between canary pods succeed

A network canary daemonset starts a canary pod on every node. These canaries continuously ping every other pod in the daemonset and report any packet loss. Pings are set every second and the metrics are scraped directly from the canary pods.

This SLI approximates the overall packet loss of the cluster network.