ADR 0054 - VSHN Managed Forgejo Runners

Author	Mike Ditton
Owner	Schedar
Reviewers	Schedar
Date Created	2026-05-20
Date Updated	2026-06-23
Status	rejected
Tags	forgejo,ci,code-hosting

Author

Mike Ditton

Owner

Schedar

Reviewers

Schedar

Date Created

2026-05-20

Date Updated

2026-06-23

Status

rejected

Context

Managed Forgejo currently ships without runners. Customers have to bring their own to run Forgejo Actions, which is a gap in our offering. We want to provide managed runners as part of the service.

This leaves us with three design questions:

How is the runner classified and where does it live? Is it a nested service or an AddOn? Is it a single shared pool or one runner per instance? And if per instance, does it live in the Forgejo instance namespace or in its own dedicated namespace?
How is the runner registered with Forgejo? Forgejo has to issue a registration credential, and that credential has to reach the runner’s config. We can drive this with a dedicated Crossplane provider or with provider-http.
How are jobs executed and isolated? The runner runs arbitrary customer CI workloads, so we need to decide on the executor backend and the isolation model.

Requirements

CI jobs from one customer must not be able to access another customer’s workloads or data.
Runner must support arbitrary CI workloads including Docker builds.
Runner and job containers must run unprivileged; compatible with OpenShift restricted SCCs.
Resource usage must be isolatable from the Forgejo workload.

Solutions

Helm chart

The official Forgejo runner Helm chart is used to deploy the runner. It is the officially maintained chart and is structurally identical to the wrenix chart used in the PoC, which was its upstream before the Forgejo project took it over.

It exposes everything we need:

runner.config.existingSecret: points the runner at a pre-built .runner credential file, skipping the chart’s own registration job. This is what makes the provider-http registration approach work.
dind.* values: built-in Docker-in-Docker sidecar support.
runner.config.file.container.privileged: defaults to false for job containers.
securityContext.privileged: defaults to true (required for DinD); must be set to false for the rootless option.

The chart can be forked if we need to tweak it for our use case in the future, similar to how we handle other third-party charts.

Runner classification and topology

The runner is modelled as an AddOn, not a nested service. A nested service is a part the outer service requires and cannot run without, and that is not billed separately. PostgreSQL in Keycloak and Nextcloud follows this pattern. The runner is the inverse: Forgejo runs perfectly well without it, enabling it is optional, and it carries its own billing. That makes it an AddOn, the same model used for Collabora. The runner cannot function without a Forgejo instance, so it is tightly coupled, but coupling alone does not make something a nested service. Optionality and separate billing are what make it an AddOn.

This classification is independent of where the runner Pod is placed: the same AddOn can live in the Forgejo instance namespace or in its own namespace. The options below cover the shared-vs-per-instance choice, the per-instance placement choice, and, for completeness, the out-of-cluster alternative.

Shared pool: a single runner pool shared across all customer instances.
Instance namespace: a per-instance runner AddOn deployed alongside the Forgejo workload in the instance namespace.
Dedicated namespace: a per-instance runner AddOn deployed in its own namespace.
CSP compute instance: a dedicated VM provisioned per instance on the CSP via Crossplane, with the runner hosted on the VM rather than in the cluster. A Crossplane CSP provider provisions the VM (see Provisioning the compute instance), and the forgejo-runner is installed on it as a systemd-managed binary or via Docker Compose (see Runner installation on the VM). Registration reuses the same provider-http admin-API flow: the pre-generated UUID/token are templated into the runner’s config so it comes up non-interactively. The runner talks back to the Forgejo server over the network, so the VM lives in an isolated, firewalled subnet rather than being directly reachable.

Criteria Shared pool Instance namespace Dedicated namespace CSP compute instance

Criteria	Shared pool	Instance namespace	Dedicated namespace	CSP compute instance
Tenant isolation	❌ jobs from different customers share infrastructure	✅ per-instance namespace boundary	✅ per-instance namespace boundary	✅ strongest: dedicated VM, OS-level boundary; an escape only reaches a disposable VM, not shared cluster nodes
Billing	❌ usage must be tracked across a shared resource	✅ maps cleanly to the instance	✅ maps cleanly to the instance	✅ maps cleanly to the instance, but adds separate CSP compute cost
Resource limits	⚠️ shared across all tenants	⚠️ coupled with the Forgejo workload	✅ independent `LimitRange`/`ResourceQuota`	✅ fully independent: own VM sizing, zero impact on the cluster
Multiple runners per instance	❌	⚠️ clutters the instance namespace	✅ natural to add	⚠️ natural (more VMs) but each adds cost and provisioning latency
Bootstrap effort	✅ single deployment	✅ no extra namespace wiring	⚠️ must provision and wire the dedicated namespace	❌ highest: provision a VM on the CSP, install/register the runner, wire networking; depends on a CSP Crossplane provider

Tenant isolation

❌ jobs from different customers share infrastructure

✅ per-instance namespace boundary

✅ strongest: dedicated VM, OS-level boundary; an escape only reaches a disposable VM, not shared cluster nodes

Billing

❌ usage must be tracked across a shared resource

✅ maps cleanly to the instance

✅ maps cleanly to the instance, but adds separate CSP compute cost

Resource limits

⚠️ shared across all tenants

⚠️ coupled with the Forgejo workload

✅ independent LimitRange/ResourceQuota

✅ fully independent: own VM sizing, zero impact on the cluster

Multiple runners per instance

❌

⚠️ clutters the instance namespace

✅ natural to add

⚠️ natural (more VMs) but each adds cost and provisioning latency

Bootstrap effort

✅ single deployment

✅ no extra namespace wiring

⚠️ must provision and wire the dedicated namespace

❌ highest: provision a VM on the CSP, install/register the runner, wire networking; depends on a CSP Crossplane provider

The CSP compute option also unblocks privileged / Docker-in-Docker execution. On a shared cluster, a privileged job or a container escape reaches the node, and potentially other tenants and the control plane, so it must be avoided (see Job execution and isolation). On a dedicated VM, ideally ephemeral (--ephemeral, one job per runner), the blast radius of an escape is the disposable VM itself. That makes container.privileged = true and DinD an acceptable trade-off for builds that genuinely need them. The cost is provisioning latency, recurring CSP compute spend, and a dependency on a per-CSP provisioning provider. The one remaining security item is network segmentation: an isolated subnet plus firewall, so an escaped job cannot reach internal services.

This option has two sub-decisions of its own, both evaluated below: how the runner is installed on the VM, and how the VM itself is provisioned per CSP.

Runner installation on the VM

Forgejo supports installing the runner as an OS-level binary or running it as a container. In both cases registration is decoupled from the runner and can be pre-generated, which is what lets us reuse the provider-http admin-API flow.

Registration writes a UUID/token pair into the runner config’s server.connections.forgejo section. The pair can be generated server-side ahead of time (forgejo forgejo-cli actions register --secret …, or the admin API) and templated in, so the runner starts non-interactively without the forgejo-runner register prompt. The deprecated interactive register command still exists but is not needed. For throwaway per-instance VMs the recommended mode is --ephemeral (one job per runner, enforced by Forgejo).

Running Docker builds is a requirement, so the VM needs a Docker daemon either way. The two options below differ only in how that daemon is provided and how the runner process itself is managed, not in whether a daemon exists.

Option A: Package / binary + host Docker daemon

A single static forgejo-runner binary runs on the VM as a dedicated runner user under a systemd unit (forgejo-runner daemon -c runner-config.yml). The config is produced with forgejo-runner generate-config and passed explicitly with -c/--config, since it is not discovered automatically. Docker is installed on the VM as a normal host daemon (via cloud-init or the golden image), and the runner uses that daemon directly for docker-label jobs and builds.

The official "packaging" docs only cover NixOS (services.gitea-actions-runner.*); for a generic Linux VM the binary-installation path is the relevant one.

Advantages:

Fewest layers: the runner talks to the host Docker daemon directly, so builds run at native performance with no nested daemon.
Minimal moving parts to image and operate: one binary, one systemd unit, one config file, plus the host Docker package.

Disadvantages:

We own forgejo-runner (and Docker) updates/patching on the VM.
Jobs are given access to the host Docker daemon. On a shared host this would be unacceptable, but on a single-tenant disposable VM the whole VM is the trust boundary, so it is acceptable.

Option B: Docker Compose with DinD

The published docker-compose.yml runs two services: a docker:dind daemon and the forgejo/runner image (as non-root 1001:1001), pointed at the DinD daemon via DOCKER_HOST. Config is generated with docker run --rm … forgejo-runner generate-config. The VM only needs Docker and the compose file; the build daemon is the bundled DinD container.

Advantages:

Self-contained, fully pinned (runner image + DinD image); the runner and its build daemon are reproducible artifacts rather than host packages.
The build daemon is a DinD container separate from the host daemon.

Disadvantages:

Extra nesting (DinD), and the stock compose exposes the daemon over plaintext TCP (--tls=false), relying on the compose network for isolation.
Still requires Docker installed on the VM to run the stack.

Both are viable; the choice is host daemon vs. bundled DinD, not "needs Docker or not." Option A is the leaner default: on a single-tenant disposable VM, letting jobs use the host Docker daemon is acceptable because the VM is the blast radius, and it avoids DinD nesting. Option B is preferable when we want the runner and its build daemon shipped as one self-contained, version-pinned artifact. Either way, because the VM is dedicated and disposable, the privileged/DinD risks that rule these out on the shared cluster are acceptable here. The compensating control is subnet isolation plus firewall, and docker-socket automount into job containers (container.docker_host = automount) is still avoided.

Provisioning the compute instance

The VM has to be created on each CSP through Crossplane. The provider landscape differs per CSP, so there is no single answer. The options below range from official native providers to a Terraform-bridge fallback.

CSP / approach Provider Maintenance Notes

CSP / approach	Provider	Maintenance	Notes
Exoscale	exoscale/provider-exoscale (official)	✅ vendor-maintained, active (last commit 2026-05; tag v0.1.0)	Upjet v2 over the official Terraform provider. Exposes `Instance`, `SSHKey`, `SecurityGroup(Rule)`, `PrivateNetwork`, `ElasticIP`, `BlockStorageVolume`, everything a runner VM needs. Pre-1.0, tiny user base, but the safest native bet.
cloudscale	onzack/provider-cloudscale (third party)	⚠️ third-party (onzack AG, not cloudscale.ch); dormant (last commit 2025-12, ~6 mo stale); no tagged releases	Upjet v2 (RC tooling) over the community Terraform provider. VM kind is `Server`, plus `ServerGroup`, `Network`, `Subnet`, `FloatingIP`, `Volume`; no dedicated SSH-key CRD (keys via user-data). Functional but carries clear bus-factor risk.
OpenStack-based CSPs	crossplane-contrib/provider-openstack (official Crossplane org)	⚠️ under the `crossplane-contrib` org (same org as the upjet AWS/Azure/GCP providers), maintained by Crossplane community contributors rather than an OpenStack vendor; small (~65★, v0.x) but actively released (v0.9.0, 2026-03); one ~13-month stall in its history	Upjet over `terraform-provider-openstack`. One provider deployment + per-CSP `ProviderConfig` covers every OpenStack CSP we run. Full coverage: `InstanceV2`, `KeyPairV2`, `NetworkV2`/`SubnetV2`/`PortV2`, `SecGroupV2(Rule)`, `FloatingIPV2`, `VolumeV3`. Caveat: all CRDs are `v1alpha1` and pre-1.0, so pin versions and smoke-test reconcile/drift on one CSP first.
CSPs without a maintained provider	upbound/provider-opentofu (fallback)	✅ Upbound-maintained, active (v1.1.3, 2026-05); API still `v1beta1`	"Terraform-in-a-pod": a single `Workspace` MR runs an OpenTofu module, surfacing outputs (VM IP/ID) as a connection secret. Covers any CSP with a Terraform provider with no native-provider development. See trade-offs below.

Exoscale

exoscale/provider-exoscale (official)

✅ vendor-maintained, active (last commit 2026-05; tag v0.1.0)

Upjet v2 over the official Terraform provider. Exposes Instance, SSHKey, SecurityGroup(Rule), PrivateNetwork, ElasticIP, BlockStorageVolume, everything a runner VM needs. Pre-1.0, tiny user base, but the safest native bet.

cloudscale

onzack/provider-cloudscale (third party)

⚠️ third-party (onzack AG, not cloudscale.ch); dormant (last commit 2025-12, ~6 mo stale); no tagged releases

Upjet v2 (RC tooling) over the community Terraform provider. VM kind is Server, plus ServerGroup, Network, Subnet, FloatingIP, Volume; no dedicated SSH-key CRD (keys via user-data). Functional but carries clear bus-factor risk.

OpenStack-based CSPs

crossplane-contrib/provider-openstack (official Crossplane org)

⚠️ under the crossplane-contrib org (same org as the upjet AWS/Azure/GCP providers), maintained by Crossplane community contributors rather than an OpenStack vendor; small (~65★, v0.x) but actively released (v0.9.0, 2026-03); one ~13-month stall in its history

Upjet over terraform-provider-openstack. One provider deployment + per-CSP ProviderConfig covers every OpenStack CSP we run. Full coverage: InstanceV2, KeyPairV2, NetworkV2/SubnetV2/PortV2, SecGroupV2(Rule), FloatingIPV2, VolumeV3. Caveat: all CRDs are v1alpha1 and pre-1.0, so pin versions and smoke-test reconcile/drift on one CSP first.

CSPs without a maintained provider

upbound/provider-opentofu (fallback)

✅ Upbound-maintained, active (v1.1.3, 2026-05); API still v1beta1

"Terraform-in-a-pod": a single Workspace MR runs an OpenTofu module, surfacing outputs (VM IP/ID) as a connection secret. Covers any CSP with a Terraform provider with no native-provider development. See trade-offs below.

For Exoscale the official native provider is the clear choice. For the OpenStack-based CSPs provider-openstack is attractive: a single provider covers many CSPs at once, and the resource coverage genuinely fits. The reservation is that, although it sits in the official crossplane-contrib org, it is maintained by community contributors rather than an OpenStack vendor, and it is still small and pre-1.0 (all v1alpha1). It therefore warrants a pin-and-monitor stance and a validation pass before committing. cloudscale has a native provider, but it is third-party and currently dormant.

provider-opentofu is a deliberate fallback for CSPs that have no maintained native provider. It reuses mature Terraform providers and still presents a Crossplane-shaped interface (XR composition, connection-secret outputs), but it accepts "Terraform-in-a-pod" semantics: a coarse single-Workspace resource model rather than first-class MRs, plan-based periodic drift reconciliation, and you-own-the-state. It does not persist state by default, so a remote backend (for example the Kubernetes Secret backend) is mandatory. Running arbitrary Terraform with broad CSP credentials inside the controller pod is also a larger security surface. Upbound itself frames it as a transition or bridge rather than a permanent substitute. The recommended posture is therefore to use it only where no native provider exists, mandate a persistent backend with locking, scope ProviderConfig credentials per CSP, and migrate to native providers as they mature.

Out of scope

This whole option was raised on the PR: spin up dedicated runner VMs on a cloud provider instead of running runners and jobs inside Kubernetes. It is worth having evaluated, but it is deliberately not pursued. Provisioning and maintaining per-instance VMs is an operational surface AppCat / Schedar does not otherwise own: golden images, OS and runner patching, lifecycle, networking, and a per-CSP provisioning provider (with the provider-openstack/provider-opentofu caveats above). Every other topology option keeps the runner inside the cluster, where the existing tooling and patterns already apply. We therefore treat the CSP compute instance as a fallback, to revisit only if a concrete requirement makes it unavoidable (for example workloads that genuinely cannot run unprivileged in-cluster), rather than a path we open now.

Job concurrency and capacity

A natural concern is whether the runner spawns additional job Pods, and if so whether their number per instance must be capped, since this drives capacity management and resource billing.

With the forgejo-runner chart it does not spawn extra Pods per job. A CI run is confined to the pre-provisioned runner Pod(s): jobs execute inside the running runner, and when all runner capacity is busy, further jobs queue until a runner frees up rather than scaling out new Pods. (Whether multiple jobs can share a single Pod concurrently depends on the runner’s capacity setting; in the PoC the chart ran jobs within the provisioned Pod rather than fanning out.)

This bounds resource and storage usage to the provisioned runner Pods, which keeps capacity and billing straightforward. Concurrency is determined by the configured runner capacity and the number of runner Pods per instance, both fixed at provisioning time, instead of an unbounded pool of dynamically spawned job Pods. Customers therefore pick from a small set of pre-defined runner sizes (CPU / memory / storage), and the instance’s runner footprint is known up front.

Registration mechanism

Option A: Custom Forgejo Crossplane provider

A dedicated provider modelling runners (and other Forgejo objects) as first-class managed resources.

Advantages:

Proper reconciliation and a clean resource model for Forgejo objects in general.

Disadvantages:

Significant code to build and maintain for what is essentially a single API call.

Option B: Provider-http

A composition step uses provider-http to call the Forgejo admin API, register a runner, and read the token back from the response. The composition renders the .runner config secret from that token and points the forgejo-runner Helm chart at it, which skips the chart’s own registration step.

Advantages:

Much less effort than a custom provider, and enough to drive the whole flow declaratively.
No new provider to maintain.

Disadvantages:

Not a reconciled resource model, so it only fits the narrow runner-bootstrap use case.

Job execution and isolation

The runner executes arbitrary customer CI workloads, so the executor backend is a security-relevant choice. The runner runs as a Kubernetes Pod, so host and LXC executors are not viable: host execution requires direct host access, and LXC needs kernel-level container nesting, both incompatible with restricted SCCs on a shared cluster. The real choice is between Docker-in-Docker and rootless container execution.

Forgejo’s official security guidance and docker access documentation are the reference for the evaluation below. Docker socket automount (container.docker_host = automount) is ruled out in both options, since Forgejo’s docs classify it as offering "no security isolation."

Option A: Docker-in-Docker (DinD)

A privileged Docker daemon runs as a sidecar alongside the runner. Job containers connect to that daemon instead of the host daemon.

Advantages:

Straightforward to configure; well-documented in Forgejo’s docs.
Job containers are isolated from the host Docker daemon.

Disadvantages:

The DinD sidecar requires privileged: true, conflicting with restricted SCCs and meaning a container escape reaches the node.
Concurrent jobs share the same daemon; they can see each other’s containers and left-over artifacts.
Resource constraints on the runner pod have no effect on containers spawned inside the DinD daemon.

DinD does not provide a hard security boundary for job containers

Forgejo’s docs note that setting container.privileged = false on the job container reduces the attack surface. This is not a hard security guarantee.

The runner pod itself runs with privileged: true. The nested dockerd (DinD sidecar) runs inside that privileged pod, so all containers it spawns operate within the blast radius of that privileged pod.

Additionally, if a job container can reach the DinD socket (/var/run/docker.sock of the nested daemon), it can spawn a new container with elevated privileges:

docker run --privileged -v /:/host alpine chroot /host

This means:

A compromised job can access the host filesystem, load kernel modules, and read other pods' secrets from /proc.
On a shared node, all other tenants co-scheduled there are reachable.
There is no hypervisor boundary - all containers share the host kernel, so kernel exploits are not contained.

The only true isolation boundary is the privileged runner pod itself. Anything running inside it - including unprivileged job containers - is within the blast radius of a node compromise. This makes DinD on a shared cluster an unacceptable risk without additional VM-level isolation.

Option B: Rootless Docker / Podman

Job containers run via a rootless Docker or Podman daemon inside the runner pod. No privileged containers are required.

Advantages:

Runner pod stays unprivileged; compatible with restricted SCCs and Kubernetes Pod Security Standards.
Follows Forgejo’s recommendation for unprivileged runners; container.privileged = false enforced by default.

Disadvantages:

More complex to configure than DinD.

Decision

Per-instance runner as an optional, separately billed AddOn in the Forgejo instance namespace, registered via provider-http (registration Option B).

The runner is classified as an AddOn rather than a nested service. It is optional and separately billed, which is what distinguishes an AddOn from a required, non-billed nested service like PostgreSQL.

For placement there are two defensible choices. One option is a dedicated namespace, which gives cleaner resource isolation and billing boundaries and would keep the door open to a future runner-only AddOn where the customer brings their own Forgejo instance. We instead place the runner in the Forgejo instance namespace, for simplicity. The runner is tightly coupled to Forgejo and cannot run without it, so co-locating it avoids the extra namespace wiring and keeps the AddOn next to the workload it serves. Classification as an AddOn is independent of placement. If a dedicated namespace later proves worthwhile (for example for the standalone runner case, or for stricter resource boundaries), we can move it without revisiting the rest of this decision (see Standalone runner with a customer-provided Forgejo).

Namespace isolation is sufficient on the platforms where we offer the AddOn, which is also why we prefer the in-cluster approach over CSP VMs. The AddOn is offered on Servala and Managed OpenShift, but not on APPUiO. The isolation story differs per platform:

On Servala, we additionally isolate the runners with a hardened RuntimeClass (gVisor), assigned via the runner Pod’s runtime class, putting a syscall-sandbox boundary around the job containers on top of the Talos-based hardening that already goes beyond vanilla Kubernetes. The initial version does not add a separate node pool; we may introduce one later to separate the runners (a bursty, interruptible workload) from platform and other-customer workloads.
On APPUiO, the runner AddOn is not offered. APPUiO is shared, multi-tenant infrastructure where arbitrary customer CI workloads pose a resource-starvation risk to co-tenants and a security concern that the existing namespace isolation, SCCs and quotas do not fully neutralise. We therefore deliberately restrict the service to Servala and Managed OpenShift rather than expose shared APPUiO nodes to untrusted CI.
On Managed OpenShift, the cluster belongs to the customer, who can deploy whatever they like on it, so runner isolation is not our concern.

Combined with the runner being confined to its pre-provisioned Pod(s), this covers the requirement. The CSP compute instance option is explicitly not chosen: it keeps the runner out of the cluster, but at the cost of owning VM provisioning and maintenance, an operational can of worms AppCat / Schedar does not otherwise carry (see the out-of-scope note under Provisioning the compute instance). It stays a fallback for if running unprivileged in-cluster ever proves insufficient.

For registration, a single admin-API call to obtain the runner credential does not justify the effort of building and maintaining a custom provider, so provider-http is the better fit. Should we later need full, reconciled management of Forgejo objects, a custom provider can be reconsidered in a separate decision.

Proof of concept

We built a proof of concept (appcat#684) to validate the registration flow and the Helm chart integration. It deploys into the Forgejo instance namespace, matching the placement decided here, and uses the registration mechanism we are deciding on. provider-http registers the runner against the Forgejo admin API, the composition reads the returned token and writes it into a .runner config secret, and the forgejo-runner Helm chart consumes that secret.

What remains is to expose runner enablement and sizing on the claim instead of always provisioning one, and, on Servala, to wire up the gVisor RuntimeClass for the job containers.

Rejection rationale

During implementation (appcat#684) a security review identified that the design cannot be made safe on a shared cluster without infrastructure prerequisites that were not in place.

The core problem is that any runner topology that requires a privileged pod - including DinD - places an unacceptable trust boundary on a shared cluster:

A privileged pod has full access to the node: host filesystem, other pods' secrets via /proc, network traffic on the node, and the ability to load kernel modules.
The DinD sidecar does not isolate job containers from this blast radius (see the DinD security note above).
Not setting container.privileged = false on job containers reduces accidental exposure but is not a hard boundary.
Rootless Docker/Podman (Option B) removes the privileged pod requirement and is the correct executor choice, but alone it does not contain a determined attacker exploiting a kernel vulnerability, since all containers still share the host kernel.
Without VM-level isolation (gVisor, Kata Containers, or a dedicated cluster), the shared kernel is an unresolved gap.

The feature is therefore paused rather than proceeding with an unresolved security gap.

Path forward

The following options, in increasing order of effort and isolation strength, would unblock the feature:

Option Description Effort Isolation strength

Option	Description	Effort	Isolation strength
Dedicated tainted node pool	Taint a pool of nodes exclusively for runner pods so no other tenant workloads are co-scheduled. Limits the blast radius of a node escape to runner nodes only. Necessary baseline for any option, but not sufficient alone.	Low	Partial
gVisor `RuntimeClass` on runner nodes	Deploy gVisor on the runner node pool and assign a `RuntimeClass` to runner pods. gVisor intercepts syscalls with a user-space kernel, containing kernel exploits inside the sandbox.	Medium	Strong (syscall sandbox)
Kata Containers on runner nodes	Replace the container runtime on runner nodes with Kata Containers (QEMU/Cloud Hypervisor/Firecracker). Each pod runs in a lightweight VM; container escapes do not reach the host kernel. Stronger than gVisor; higher overhead and more operational complexity.	High	Very strong (VM boundary)
Separate CI cluster	Run a dedicated Kubernetes cluster exclusively for CI workloads, not shared with customer data or production services. Privileged pods are acceptable because the blast radius is contained to the CI cluster.	High	Strong (cluster boundary)

Dedicated tainted node pool

Taint a pool of nodes exclusively for runner pods so no other tenant workloads are co-scheduled. Limits the blast radius of a node escape to runner nodes only. Necessary baseline for any option, but not sufficient alone.

Low

Partial

gVisor RuntimeClass on runner nodes

Deploy gVisor on the runner node pool and assign a RuntimeClass to runner pods. gVisor intercepts syscalls with a user-space kernel, containing kernel exploits inside the sandbox.

Medium

Strong (syscall sandbox)

Kata Containers on runner nodes

Replace the container runtime on runner nodes with Kata Containers (QEMU/Cloud Hypervisor/Firecracker). Each pod runs in a lightweight VM; container escapes do not reach the host kernel. Stronger than gVisor; higher overhead and more operational complexity.

High

Very strong (VM boundary)

Separate CI cluster

Run a dedicated Kubernetes cluster exclusively for CI workloads, not shared with customer data or production services. Privileged pods are acceptable because the blast radius is contained to the CI cluster.

High

Strong (cluster boundary)

Consequences

The runner is an optional, separately billed AddOn; Forgejo instances without it are unaffected.
Each Forgejo instance gets an isolated runner; no cross-tenant job execution is possible.
The runner lives in the Forgejo instance namespace; billing still maps cleanly to the instance, while resource limits are shared with the Forgejo workload. A dedicated namespace remains a future option if stricter separation is needed.
Capacity is bounded: jobs run in the pre-provisioned runner Pod(s) and queue when busy, so concurrency and resource usage are fixed by the chosen runner size and count rather than a dynamic pool of job Pods.
The AddOn is offered on Servala and Managed OpenShift only; it is not offered on APPUiO, whose shared, multi-tenant infrastructure makes arbitrary customer CI a resource-starvation and security risk to co-tenants.
Runner isolation is platform-specific: on Servala job containers run under a gVisor RuntimeClass; on Managed OpenShift the cluster is the customer’s own.
Moving the runner to a dedicated namespace later remains possible, which would keep open a future standalone runner where a customer brings their own Forgejo instance (see Standalone runner with a customer-provided Forgejo). That use case cannot reuse the provider-http admin-API registration and would instead consume a customer-provided registration token via a claim secretRef; it is deliberately out of scope here.
No custom Forgejo provider is needed for runner support; the provider-http Crossplane provider must be installed on clusters running managed Forgejo.
Multiple runners per instance, and project-scoped runners, are natural extensions of this approach rather than redesigns.

Open questions and operational considerations

The following points were raised in review. Some are decided here; others are explicitly deferred to the implementation phase and listed so they are not lost.

Runtime isolation hardening

Namespace isolation plus rootless, confined-Pod execution (see Job execution and isolation) is the decided baseline and meets the requirement on our target platforms (see the Decision). The following are optional defence-in-depth we can layer on if we choose to harden further, not prerequisites:

Dedicated node pool: schedule runner/job workloads onto their own node pool via taints + tolerations, keeping arbitrary CI off the nodes that run platform and other-customer workloads. Not in the initial version on Servala; a candidate for later given the runners' bursty profile.
Sandboxed runtime: a hardened RuntimeClass to put a syscall-sandbox or lightweight-VM boundary around job containers. On Servala this is part of the decision (gVisor, see the Decision); a lightweight-VM boundary (Kata Containers) remains a further option. Not applicable on APPUiO (the AddOn is not offered there) or Managed OpenShift (customer cluster).
Minimal ServiceAccount: the runner needs no Kubernetes API access to execute jobs, so its ServiceAccount should carry no RBAC and automountServiceAccountToken should be disabled. (Cheap; worth doing regardless.)
Spot/preemptible capacity: CI is interruptible, so spot nodes (or spot VMs in the CSP fallback) are a cost lever worth considering.

Egress, caching and storage

CI jobs pull large volumes from external sources: npm, PyPI, Maven, Go modules, ad-hoc curl … | sh. Spegel covers container image layers on Servala, but not these other artifacts. Two gaps to address in implementation:

Pull-through caches / proxy for the common package ecosystems, to cut egress cost and speed up builds.
Persistent build / layer cache: without a persistent volume the image-layer and dependency caches evaporate on every Pod restart, giving customers cold builds constantly.

Storage sizing for this cache ties into the pre-defined runner sizes from Job concurrency and capacity.

Lifecycle

Deregistration: deleting the runner must also remove its Forgejo registration, otherwise a dangling runner entry is left behind. provider-http can issue the DELETE call, so the composition should deregister on teardown, the mirror of the registration step.
Upgrade / drain: a chart upgrade restarts the runner Pod and kills in-flight jobs. We need a drain story (for example ephemeral runners that exit cleanly after their current job, or cordoning the runner before upgrade).
Version compatibility: maintain a runner/Forgejo version-compatibility matrix.
Token rotation: the registration credential should be rotatable.

Observability and SLO boundary

Customers need access to their job logs (surfaced in the Forgejo UI) and runner metrics (runner metrics endpoint). No SLOs exist on Servala yet, but when they are defined the responsibility line should be drawn explicitly: runner availability is on us, job success is on the customer. Keeping that distinction clear avoids support-ticket ambiguity.

Product integration

Enablement: the runner is enabled as an optional add-on on the Codey/Forgejo claim (a toggle plus sizing), not provisioned by default.
Compute plans: map runner sizing onto the existing Servala compute-plan concept (CPU/RAM) plus a storage size, rather than inventing a separate sizing mechanism.

Standalone runner with a customer-provided Forgejo

Review raised a possible future use case: offering the runner as a standalone service where the customer brings their own Forgejo instance, managed by them or a third party, rather than a VSHN-managed one. This is out of scope for this ADR, which scopes the runner as an AddOn to a VSHN-managed Forgejo instance, and the decision above is unchanged. It is captured here so the constraint it imposes is not lost.

The registration mechanism decided here does not extend to that use case. provider-http registers the runner by calling the Forgejo admin API (POST /api/v1/admin/actions/runners) with admin credentials AppCat controls, against the in-cluster instance reached over internal service DNS (<instance>-http.<namespace>.svc:3000). For a customer-owned instance we have neither admin credentials nor any business mutating an instance we do not operate, so server-side registration is off the table.

The natural fit is to invert the flow: instead of AppCat registering the runner, the customer generates a runner registration token on their own Forgejo and hands it to us. The claim would carry a secretRef (a corev1.SecretKeySelector, mirroring the existing UnmanagedBucket and Keycloak custom-mount patterns in AppCat) pointing at a Secret with that token, plus the instance’s external URL. The composition reads the token and renders the .runner config Secret directly, then points the forgejo-runner chart at it via runner.config.existingSecret, the same chart hook the AddOn already uses, just skipping the provider-http registration step entirely.

This has knock-on consequences that confirm it belongs in a separate decision rather than this one:

It is a standalone service, not an AddOn coupled to a co-located Forgejo, so it would not live in a Forgejo instance namespace. This reinforces keeping a dedicated namespace viable (see the Decision), since the runner would have no instance namespace to share.
The runner address becomes a customer-provided FQDN over public ingress rather than internal cluster DNS.
The registration lifecycle (deregistration, token rotation) shifts to the customer’s side; AppCat only consumes the token it is given.

If this use case is pursued, it should be a new ADR building on the secret-ref idea above, not a modification of this AddOn.