Alert rule: CNPGClusterHAWarning

Overview

This alert triggers when a CNPG PostgreSQL HA cluster has fewer than 2 streaming replicas connected to the primary. The cluster is still operational but redundancy is reduced - if the primary fails, recovery depends on fewer standbys than expected.

This alert only fires for clusters that have replicas configured. Standalone single-instance deployments are excluded.

Steps for Debugging

Step one

Identify the affected namespace from the alert. Set it as a variable.

INSTANCE_NAMESPACE='<instance-namespace>'

Step two

Check the overall cluster status and how many instances are ready.

kubectl get cluster postgresql -n $INSTANCE_NAMESPACE

Step three

Check which pods are running and their roles.

kubectl get pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql -L cnpg.io/instanceRole

Step four

Resolve the primary pod name and check the replication status.

PRIMARY=$(kubectl get cluster postgresql -n $INSTANCE_NAMESPACE -o jsonpath='{.status.currentPrimary}')
kubectl exec -n $INSTANCE_NAMESPACE $PRIMARY -- psql -U postgres -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"

Step five

Describe and check logs for all replica pods to identify scheduling or crash issues.

kubectl describe pods -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica
kubectl logs -n $INSTANCE_NAMESPACE -l cnpg.io/cluster=postgresql,cnpg.io/instanceRole=replica --prefix | grep -i "error\|fatal\|replication\|receiver"