Alert rule: GuaranteedUptimeTarget
Overview
This alert fires when more than 40% of SLI probes to a VSHN managed service fail over a 5-minute window, and the service has sla=guaranteed.
It covers all VSHN services (PostgreSQL, Redis, Keycloak, MariaDB, Minio, Nextcloud, etc.).
The alert name includes the service: VSHNPostgreSQLSla, VSHNRedisSlaHA, etc.
The SlaHA variant fires for high-availability instances.
The SLI Exporter probes each service every second and records results as appcat_probes_seconds_count with a reason label.
Possible reasons: fail-timeout (probe exceeded timeout), fail-unknown (generic failure).
Steps for Debugging
Extract variables from the alert labels:
SERVICE='<service-from-alert>' # VSHNPostgreSQL, VSHNRedis
NAME='<name-from-alert>'
CLAIM_NAMESPACE='<claim_namespace-from-alert>'
INSTANCE_NAMESPACE='<instance_namespace-from-alert>'
REASON='<reason-from-alert>' # fail-timeout or fail-unknown
-
fail-timeout: the service is reachable but not responding within the probe timeout. Check for resource pressure, hung processes, or network issues between the SLI Exporter and the instance. -
fail-unknown: generic probe failure. Check the SLI Exporter logs for the specific error.
kubectl --as=system:admin -n syn-appcat-slos logs -l control-plane=controller-manager -c manager --tail=200 | grep $NAME
kubectl --as=system:admin -n $INSTANCE_NAMESPACE get pods
kubectl --as=system:admin -n $INSTANCE_NAMESPACE get events --sort-by=.lastTimestamp | tail -20
kubectl --as=system:admin -n $INSTANCE_NAMESPACE logs <failing-pod> --tail=100
XR_KIND="xvshn$(echo $SERVICE | tr '[:upper:]' '[:lower:]' | sed 's/vshn//')"
# xvshnpostgresql, xvshnredis, xvshnkeycloak
kubectl --as=system:admin get objects | grep $NAME
kubectl --as=system:admin get $XR_KIND | grep $NAME
-
VSHNPostgreSQL (CNPG)
-
kubectl --as=system:admin -n syn-cnpg-system get pods -
kubectl --as=system:admin -n syn-cnpg-system logs deployments/appcat-cloudnative-pg --tail=50
-
-
VSHNPostgreSQL (StackGres)
-
kubectl --as=system:admin -n syn-stackgres-operator get pod -
kubectl --as=system:admin -n syn-stackgres-operator logs deployments/stackgres-operator --tail=50
-
-
VSHNRedis, VSHNKeycloak, VSHNNextcloud, VSHNMariaDB and the rest
-
kubectl --as=system:admin -n syn-crossplane logs -l 'pkg.crossplane.io/provider=provider-helm' --tail=50
-
Steps for Remediation
Instance pods are crashed or OOMKilled:
Restart the failing workload and check resource limits:
kubectl --as=system:admin -n $INSTANCE_NAMESPACE rollout restart statefulset
kubectl --as=system:admin -n $INSTANCE_NAMESPACE rollout restart deployment
Network policy blocking probe traffic:
Check if a NetworkPolicy in $INSTANCE_NAMESPACE is blocking ingress from syn-appcat-slos.
Transient overload (fail-timeout):
Check node-level resource pressure and pod resource usage. If the instance recovers, the alert will resolve automatically once the failure rate drops below 40%.
Credentials issue:
Connection secrets in the claim namespace have type connection.crossplane.io/v1alpha1.
Verify the secret exists and contains valid credentials:
kubectl --as=system:admin -n $CLAIM_NAMESPACE get secret --field-selector type=connection.crossplane.io/v1alpha1