Alert Group: Syn-Prometheus

Alert Rule: SYN_PrometheusRemoteWriteBehind

Alert Rule: SYN_PrometheusRemoteWriteDesiredShards

Overview

This alert may indicate that the remote write receiver isn’t accepting metrics due to an internal problem or that there’s a network issue between Prometheus and the remote write receiver. If the remote write receiver is a Mimir instance, the root cause may be that the ngnix in front of the Mimir components has stale pod IPs in its DNS cache.

Investigate

  • Check that the remote write receiver’s endpoint is reachable

    kubectl -n openshift-monitoring --as=cluster-admin exec -it prometheus-k8s-0 -- curl <remote-write-endpoint>
  • Check the remote write receiver for any issues. If the remote write receiver which has trouble is the VSHN central metrics Mimir instance, check the Mimir nginx pod logs on APPUiO Cloud cloudscale.ch LPG 2.

    kubectl -n vshn-appuio-mimir logs deploy/vshn-appuio-mimir-nginx --tail=200

    Errors like the following indicate that nginx’s DNS cache contains stale pod IPs.

    2022/12/13 08:50:31 [error] 9#9: *2748893 vshn-appuio-mimir-distributor-headless.vshn-appuio-mimir.svc.cluster.local could not be resolved (110: Operation timed out), client: 10.128.10.35, server: , request: "POST /api/v1/push HTTP/1.1", host: "metrics-receive.appuio.net"

Resolve

If you’ve identified that the Mimir nginx is the cause of the issue, restart the nginx pod.

k -n vshn-appuio-mimir delete po -l app.kubernetes.io/component=nginx