Alert Group: Syn-Prometheus
Alert Rule: SYN_PrometheusRemoteWriteDesiredShards
Overview
This alert may indicate that the remote write receiver isn’t accepting metrics due to an internal problem or that there’s a network issue between Prometheus and the remote write receiver. If the remote write receiver is a Mimir instance, the root cause may be that the ngnix in front of the Mimir components has stale pod IPs in its DNS cache.
Investigate
-
Check that the remote write receiver’s endpoint is reachable
kubectl -n openshift-monitoring --as=cluster-admin exec -it prometheus-k8s-0 -- curl <remote-write-endpoint>
-
Check the remote write receiver for any issues. If the remote write receiver which has trouble is the VSHN central metrics Mimir instance, check the Mimir nginx pod logs on APPUiO Cloud cloudscale.ch LPG 2.
kubectl -n vshn-appuio-mimir logs deploy/vshn-appuio-mimir-nginx --tail=200
Errors like the following indicate that nginx’s DNS cache contains stale pod IPs.
2022/12/13 08:50:31 [error] 9#9: *2748893 vshn-appuio-mimir-distributor-headless.vshn-appuio-mimir.svc.cluster.local could not be resolved (110: Operation timed out), client: 10.128.10.35, server: , request: "POST /api/v1/push HTTP/1.1", host: "metrics-receive.appuio.net"