Monitoring and Metrics for Converged VSHN Services
The exporter will be extended with all necessary logic to check for SLIs of other services as well. To check databases the SLI exporter will perform a trivial query. How that query looks like will depend on the database, if the database already provides some health endpoint, that endpoint should be instead. If the query is not successful it will be treated as a downtime by the exporter.
For general application metrics, any exporter that is already included with the given solution should be used (for example if a Redis helm chart brings the Redis export with it). Similarly, already existing Grafana dashboards should be leveraged and adjusted to our needs. These metrics also provide the basis for the capacity alerting.
There are exceptions that don’t apply to the SLA. To catch as many of these exceptions as possible, a combination of the custom exporter and Schedar’s SLA reporting tool will be necessary. Some exceptions can potentially be caught via the reporter, some can only be caught and identified as such after the metrics have already been written.
SLO Alerting is routed to VSHN and will be handled by whoever is Responsible Ops. For that the cluster monitoring will be leveraged and with labels we ensure that the alerts are routed correctly.