Customer facing SLOs
Problem
VSHN has a huge historic mess of different SLAs sold to different customers. Neither account managers nor engineers have an overview. This leads to confusion, extra work, and frustration on both sides. The OpenShift team also has their own very technical SLOs, which aren’t customer-facing. Those SLOs are too fine-grained for the customer and also don’t necessarily measure only metrics in our control.
We need to define a set of standardized, customer-facing SLOs that are easy to understand and measure.
Proposals
Proposal 1
Measure OpenShift API-Service availability and response time.
We’ve got an internal SLO for the OpenShift API-Service which works well for us. We can use this as a basis for our customer-facing SLO.
Proposal 2
Measure ingress availability.
OpenShift already has a built-in ingress canary route and also has haproxy
metrics for non-synthetic traffic.
We can use this to measure the availability of the ingress.
As the canary workload route is excluded from the metrics and the haproxy
depend on the workload itself we should measure the availability by using both metrics.
We propose a metric of NO successful request and NO successful canary request (measured with a blackbox exporter) over a period of 3 minutes starts counting to the error budget.
haproxy
metricsabsent_over_time((sum(rate(haproxy_frontend_http_responses_total{code=~"[1-4]xx"}[1m])) > 0)[3m:])
absent_over_time((ingress_canary_route_reachable >0)[3m:])
Other Ideas
Upgrade success rate
Not enough data points (< 25) a year to measure this SLO.
Can still be a manual dashboard for marketing.
Upgrade time
Not enough data points (< 25) a year to measure this SLO. Depends purely on customer workload and the underlying infrastructure.
Can still be a manual dashboard for marketing.
Rationale
This is the most relevant SLO for our customers.
It’s slightly more difficult to measure than the API-Server SLO, but we can use the existing haproxy
metrics and the canary route to measure it.
We already have a Ingress canary SLO for internal use which so far had no issues and high fidelity alerts.