Alert rules: VSHNRedis

Overview

These alerts cover Redis HA issues in AppCat services. It triggers in the following cases:

VSHNRedisNotMaster - Redis master service has not been pointing to elected master for 5 minutes.
VSHNRedisQuorumNotOk - Quorum check is failing for 5 minutes.
VSHNRedisQuorumFlapping - Quorum state is flapping repeatedly within 10 minutes.

A Redis HA setup has 1 master and 2 replicas. Each pod runs a Sentinel sidecar that monitors and manages failover.

Failover is automatic. If the master goes down, Sentinel promotes a replica to become the new master.

Service behavior:

redis-master → Always points to the current master.
redis-headless → Connects to all Redis pods. You might land on a replica (read-only). This also exposes Sentinel on port 26379.

Ports:

6379 → Redis (read/write if master, read-only if replica)
26379 → Sentinel

Steps for Debugging

Check pod health

kubectl -n $instanceNamespace get pods

Connect to Sentinel

kubectl exec -n $instanceNamespace -it $redisPod -- \
  redis-cli \
    -h 127.0.0.1 \
    -p 26379 \
    --tls \
    --cert /opt/bitnami/redis/certs/tls.crt \
    --key /opt/bitnami/redis/certs/tls.key \
    --cacert /opt/bitnami/redis/certs/ca.crt \
    -a "$( < "$REDIS_PASSWORD_FILE" )"

Check current master:

127.0.0.1:26379> SENTINEL get-master-addr-by-name mymaster
1) "redis-master-0.redis-headless.$namespace.svc.cluster.local"
2) "6379"

Inspect Sentinel’s view of the master

127.0.0.1:26379> SENTINEL master mymaster
 1) "name"
 2) "mymaster"
 3) "ip"
 4) "edis-master-0.redis-headless.$namespace.svc.cluster.local"
 5) "port"
 6) "6379"
 7) "runid"
 8) "562b76a318f63f2dce5b21049e3f82c18798e3cb"
 9) "flags"
10) "master"
11) "link-pending-commands"
...

Check Sentinel quorum health

127.0.0.1:26379> SENTINEL CKQUORUM mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached

Expected: output should confirm enough Sentinels are healthy to authorize a failover. If quorum cannot be reached, failover will not occur.

Connect to Redis

kubectl exec -n $instanceNamespace -it $redisPod -- \
  redis-cli \
    -h 127.0.0.1 \
    -p 6379 \
    --tls \
    --cert /opt/bitnami/redis/certs/tls.crt \
    --key /opt/bitnami/redis/certs/tls.key \
    --cacert /opt/bitnami/redis/certs/ca.crt \
    -a "$( < "$REDIS_PASSWORD_FILE" )"

Check replication state:

127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=redis-replicas-0.redis-headless.$namespace.svc.cluster.local,port=6379,state=online
slave1:ip=redis-replicas-1.redis-headless.$namespace.svc.cluster.local,port=6379,state=online
master_failover_state:no-failover

Expected:

role: master on exactly one pod
Replicas listed as state=online
master_failover_state:no-failover in normal state

Steps for Remediation

If no master is elected (VSHNRedisNotMaster): .Restart the Redis StatefulSet to trigger a new master election.

kubectl -n $instanceNamespace rollout restart statefulset redis-node

Verify a new master via Sentinel:

127.0.0.1:26379> SENTINEL get-master-addr-by-name mymaster

If quorum is not OK (VSHNRedisQuorumNotOk):

Confirm at least 2 of 3 Sentinels are running and healthy.
Restart unhealthy Sentinel sidecar pods if needed.
Verify quorum with:

127.0.0.1:26379> SENTINEL CKQUORUM mymaster

If quorum is flapping (VSHNRedisQuorumFlapping):

Inspect Sentinel logs for repeated promotions/demotions.
Verify pod-to-pod networking and node stability.
Inspect if instability is node-related.

Manual Failover

If Sentinel fails to promote a new master automatically, you can trigger a manual failover. Only do this if quorum is healthy. Check first:

127.0.0.1:26379> SENTINEL CKQUORUM mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached

Then trigger failover:

kubectl exec -n $instanceNamespace -it $redisPod -- \
  redis-cli \
    -h 127.0.0.1 \
    -p 26379 \
    --tls \
    --cert /opt/bitnami/redis/certs/tls.crt \
    --key /opt/bitnami/redis/certs/tls.key \
    --cacert /opt/bitnami/redis/certs/ca.crt \
    -a "$( < "$REDIS_PASSWORD_FILE" )" \

127.0.0.1:26379> SENTINEL failover mymaster

This instructs the Sentinel cluster to elect a new master immediately. The old master will rejoin as a replica when it comes back online.

Alert rules: VSHNRedis

Overview

Steps for Debugging

Steps for Remediation

Manual Failover

References