Alert rule: CNPGClusterOffline
Overview
This alert triggers when all CNPG collectors in a namespace report cnpg_collector_up == 0, indicating that the cluster is unhealthy or the monitoring exporter has failed on all instances.
This alert relies on cnpg_collector_up being present. If all pods are killed and the metric is absent entirely, this alert will not fire. In that case, investigate manually.
|
Steps for Debugging
- Step one
-
Identify the affected namespace from the alert. Set it as a variable.
INSTANCE_NAMESPACE='<instance-namespace>' - Step two
-
Check if any PostgreSQL pods are running.
kubectl get pods -n $INSTANCE_NAMESPACE - Step three
-
Check the CNPG cluster resource for status conditions.
kubectl describe cluster postgresql -n $INSTANCE_NAMESPACE - Step four
-
If pods exist but the collector is not healthy, check the pod logs for errors.
kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql --prefix | grep -i "collector\|error\|fatal" - Step five
-
If no pods are running, check recent events and node pressure.
kubectl get events -n $INSTANCE_NAMESPACE --sort-by='.lastTimestamp' kubectl describe nodes | grep -A5 "Conditions:" - Step six
-
If the cluster is hibernated, check for the hibernation annotation.
kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.metadata.annotations}'