Service Level Indicator (SLI)
APPUiO Managed OpenShift comes with a collection of service level indicators (SLIs). This document defines and explains these SLIs. All of the SLIs are in the scope of the "Guaranteed Availability" Service Level.
We use the SLIs and Multiwindow, Mulit-Burn-Rate Alerts as the basis of our on-call alerting.
Ingress
Working cluster ingress is a core requirement for a Kubernetes cluster. If the workloads running on the cluster aren’t accessible, it might as well be down from a user perspective.
Canary
Probes are sent every minute from the ingress operator, inside the cluster, to the external address of the canary target.
This means it will send a request to the public floating IP of the load balancers, which will forward the request to one of the ingress controller running on the infrastructure nodes, which will then forward the request to one of the canary targets, which runs on every worker and infrastructure node.
This setup should approximate the cluster ingress uptime.
As a side effect it also measures out-bound connection issues, which shouldn’t be a part of an ingress SLO. However, the alternative of using an external probe source also measures issues that shouldn’t be part of the SLO, so we chose the in-cluster probe source for simplicity. |
Kubernetes API
The Kubernetes API is the main way users interact with the cluster itself. If the API isn’t available, users can’t change configuration or run new workloads and existing deployments will quickly degrade.
A misbehaving Kubernetes API directly impacts the service level.
Request Error Rate
This is measured directly at the API server through the following metrics.
# The number of failed valid API requests
apiserver_request_total{code=~"(5..|429)"}
# All API requests
apiserver_request_total
We only look for HTTP 5xx errors, which indicate a server side error, and HTTP error 429, which indicates that the API server is overloaded. |
Uptime
Probes are sent every 10 seconds from a blackbox exporter inside the cluster to the readiness endpoint of the Kubernetes API server.
This SLI approximates a user’s ability to reach at least one API server instance and the API server’s uptime.
Complete outages measured by this SLI can’t be measured by the error rate SLI. |
Workload Schedulability
We define Workload Schedulability as the ability to start and successfully run new workloads on the cluster. This ability is essential and directly impacts the service level.
Storage
Persistent storage is a key component of a feature complete Kubernetes cluster. Any storage issues directly impacts the service level for users.
CSI Operations
CSI operations are any interactions of the kubelet or controller-manager with the CSI provider. This includes creating, deleting, mounting, unmounting, or resizing a persistent volume.
We measure these interactions using the following metrics, reported by the kubelets and the controller-manager:
# The number of failed csi operations
storage_operation_duration_seconds_count{
volume_plugin=~"kubernetes.io/csi.+",status="fail-unknown"
}
# All csi operations
storage_operation_duration_seconds_count{volume_plugin=~"kubernetes.io/csi.+"}
This SLI approximates the user experience of interacting with PVs and PVCs. It doesn’t measure any performance issues with the underlying storage.
Cluster Network
Reliable cluster networking is essential for nearly every workload. Without it, users can’t reliably access their workload and even moderate packet loss can negatively impact deployments such as databases.
Packet Loss
A network canary daemonset starts a canary pod on every node. These canaries continuously ping every other pod in the daemonset and report any packet loss. Pings are set every second and the metrics are scraped directly from the canary pods.
This SLI approximates the overall packet loss of the cluster network.