Alert rule: GuaranteedUptimeTarget

Overview

This alert fires when more than 40% of SLI probes to a VSHN managed service fail over a 5-minute window, and the service has sla=guaranteed. It covers all VSHN services (PostgreSQL, Redis, Keycloak, MariaDB, Minio, Nextcloud, etc.).

The alert name includes the service: VSHNPostgreSQLSla, VSHNRedisSlaHA, etc. The SlaHA variant fires for high-availability instances.

The SLI Exporter probes each service every second and records results as appcat_probes_seconds_count with a reason label. Possible reasons: fail-timeout (probe exceeded timeout), fail-unknown (generic failure).

Steps for Debugging

Extract variables from the alert labels:

SERVICE='<service-from-alert>'               # VSHNPostgreSQL, VSHNRedis
NAME='<name-from-alert>'
CLAIM_NAMESPACE='<claim_namespace-from-alert>'
INSTANCE_NAMESPACE='<instance_namespace-from-alert>'
REASON='<reason-from-alert>'                 # fail-timeout or fail-unknown
Check the reason label first
  • fail-timeout: the service is reachable but not responding within the probe timeout. Check for resource pressure, hung processes, or network issues between the SLI Exporter and the instance.

  • fail-unknown: generic probe failure. Check the SLI Exporter logs for the specific error.

Check SLI Exporter logs
kubectl --as=system:admin -n syn-appcat-slos logs -l control-plane=controller-manager -c manager --tail=200 | grep $NAME
Check the instance pods
kubectl --as=system:admin -n $INSTANCE_NAMESPACE get pods
kubectl --as=system:admin -n $INSTANCE_NAMESPACE get events --sort-by=.lastTimestamp | tail -20
kubectl --as=system:admin -n $INSTANCE_NAMESPACE logs <failing-pod> --tail=100
Check Crossplane objects and composite
XR_KIND="xvshn$(echo $SERVICE | tr '[:upper:]' '[:lower:]' | sed 's/vshn//')"
# xvshnpostgresql, xvshnredis, xvshnkeycloak
kubectl --as=system:admin get objects | grep $NAME
kubectl --as=system:admin get $XR_KIND | grep $NAME
Check providers responsible for the service
  • VSHNPostgreSQL (CNPG)

    • kubectl --as=system:admin -n syn-cnpg-system get pods

    • kubectl --as=system:admin -n syn-cnpg-system logs deployments/appcat-cloudnative-pg --tail=50

  • VSHNPostgreSQL (StackGres)

    • kubectl --as=system:admin -n syn-stackgres-operator get pod

    • kubectl --as=system:admin -n syn-stackgres-operator logs deployments/stackgres-operator --tail=50

  • VSHNRedis, VSHNKeycloak, VSHNNextcloud, VSHNMariaDB and the rest

    • kubectl --as=system:admin -n syn-crossplane logs -l 'pkg.crossplane.io/provider=provider-helm' --tail=50

Steps for Remediation

Instance pods are crashed or OOMKilled:

Restart the failing workload and check resource limits:

kubectl --as=system:admin -n $INSTANCE_NAMESPACE rollout restart statefulset
kubectl --as=system:admin -n $INSTANCE_NAMESPACE rollout restart deployment

Network policy blocking probe traffic:

Check if a NetworkPolicy in $INSTANCE_NAMESPACE is blocking ingress from syn-appcat-slos.

Transient overload (fail-timeout):

Check node-level resource pressure and pod resource usage. If the instance recovers, the alert will resolve automatically once the failure rate drops below 40%.

Credentials issue:

Connection secrets in the claim namespace have type connection.crossplane.io/v1alpha1. Verify the secret exists and contains valid credentials:

kubectl --as=system:admin -n $CLAIM_NAMESPACE get secret --field-selector type=connection.crossplane.io/v1alpha1