Multi-Tenancy

This document explains the implementation of multi-tenancy in the centralized monitoring.

Basics: X-Scope-OrgID

Multi-tenancy in Mimir is implemented by setting the X-Scope-OrgID header in all HTTP requests when writing data to and reading data from Mimir.

The value of the header is set to the organization’s unique Keycloak group name, a human readable, all lowercase single word identifier (don’t confuse it with the Keycloak group id which is a database row identifier or the organization’s displayName, which actually contains the proper name of the organization).

Making sure that this header is set to the correct value for all use cases brings a few challenges which are explained below.

Note that sometimes the X-Scope-OrgID header is used to isolate metrics not by organization, but by use case (e.g. billing). This document ignores those cases.

Prometheus Remote Writes

Prometheus can be configured to set the X-Scope-OrgID header when sending data to Mimir. This configuration may look something like this:

prometheus:
  _remoteWrite:
    mymetrics:
      url: [...]
      headers:
        "X-Scope-OrgID": somecustomername
      writeRelabelConfigs:
        [...]

This configuration works well in situations where a single Prometheus instance is responsible for exactly one customer, as it is the case e.g. on an APPUiO Managed cluster.

Prometheus Remote Writes: Multiple Organizations

There are situations in which a single Prometheus instance is responsible for multiple organizations. A typical example is the openshift-user-workload-monitoring Prometheus instance in an APPUiO Cloud cluster. This Prometheus instance handles metrics from many namespaces which can belong to different organizations.

Prometheus itself is not able to set the X-Scope-OrgID header dynamically, hence a proxy server is needed to fix the HTTP requests. The cortex-tenant-ns-label proxy does this job. It is a fork of the cortex-tenat proxy which does a similar job, but not quite the same.

The cortext-tenant-ns-label proxy reads all namespaces and their annotations from the k8s API and converts this into a lookup table namespaceorganization name, which is refreshed regularly. The proxy receives remote writes from Prometheus which are analyzed, separated by namespace label and forwarded to Mimir with the correct X-Scope-OrgID header from the lookup table. A single Prometheus remote write HTTP request likely needs to be split up into multiple HTTP requests to Mimir, because the former can have metrics from different organizations mixed together while the latter can’t.

There is one cortex-tenant-ns-label proxy per APPUiO Cloud cluster, installed via SYN component component-cortex-tenant-ns-label. Prometheus remote writes are configured as normal without the X-Scope-OrgID header but with the destination url set to the cortex-tenant-ns-label proxy.

Grafana

Grafana itself is partially multi-tenant capable by having multiple organizations. There are many details to consider though.

Grafana cannot filter data it receives from Mimir, hence we have to make sure that Mimir only returns data for the organization that is currently selected in Grafana. This requires setting the correct X-Scope-OrgID header when querying Mimir. This can be achieved by having a dedicated data source per organization with this header set to the organization name. As a result we can’t give any users admin privileges even on their own Grafana organizations because that would allow them to change the X-Scope-OrgID header value and see other organization’s data. Hence organization users must remain restricted to the editor and viewer roles, and this restriction must be enforced. editor permissions are sufficient to create/change dashboards hence this should not be a problem in practice.

Grafana does not offer a way to store the Keycloak organization name, which is a problem because we need to reliably link Grafana organizations to Keycloak groups (which represent organizations). We solve this by using a particular scheme for the Grafana organization name: [keycloak organization name] - [organization display name], e.g. vshn - VSHN AG. Since the Keycloak organization name cannot contain spaces this scheme is unambigous.

If we want to provide some standard dashboards to all organizations, these need to be set up in every organization separately. There are also situations in which Grafana modifies a dashboard, and without further measures any sync process that sets up these dashboards would revert the changes.

This complex setup needs to be maintained automatically. This is the job of the grafana-organizations-operator (installed via component-grafana-organizations-operator) where you will also find more details on how this works.