Monitoring and Metrics for Converged VSHN Services
Based on our decision for PostgreSQL we will be using the same SLI exporter to monitor for the SLIs.
The exporter will be extended with all necessary logic to check for SLIs of other services as well. To check databases the SLI exporter will perform a trivial query. How that query looks like will depend on the database, if the database already provides some health endpoint, that endpoint should be instead. If the query is not successful it will be treated as a downtime by the exporter.
For general application metrics, any exporter that is already included with the given solution should be used (for example if a Redis helm chart brings the Redis export with it). Similarly, already existing Grafana dashboards should be leveraged and adjusted to our needs. These metrics also provide the basis for the capacity alerting.
SLA Exception Handling
There are exceptions that don’t apply to the SLA. To catch as many of these exceptions as possible, a combination of the custom exporter and Schedar’s SLA reporting tool will be necessary. Some exceptions can potentially be caught via the reporter, some can only be caught and identified as such after the metrics have already been written.