ADR 0054 - VSHN Managed Forgejo Runners
Author |
Mike Ditton |
|---|---|
Owner |
Schedar |
Reviewers |
Schedar |
Date Created |
2026-05-20 |
Date Updated |
2026-06-02 |
Status |
draft |
Tags |
forgejo,ci,code-hosting |
|
Summary
We offer per-instance Forgejo Actions runners as an optional, separately billed AddOn, following the Collabora model.
Each runner is deployed in the Forgejo instance namespace for simplicity.
Registration uses |
Context
Managed Forgejo currently ships without runners. Customers have to bring their own to run Forgejo Actions, which is a gap in our offering. We want to provide managed runners as part of the service.
This leaves us with three design questions:
-
How is the runner classified and where does it live? Is it a nested service or an AddOn? Is it a single shared pool or one runner per instance? And if per instance, does it live in the Forgejo instance namespace or in its own dedicated namespace?
-
How is the runner registered with Forgejo? Forgejo has to issue a registration credential, and that credential has to reach the runner’s config. We can drive this with a dedicated Crossplane provider or with
provider-http. -
How are jobs executed and isolated? The runner runs arbitrary customer CI workloads, so we need to decide on the executor backend and the isolation model.
Requirements
-
CI jobs from one customer must not be able to access another customer’s workloads or data.
-
Runner must support arbitrary CI workloads including Docker builds.
-
Runner and job containers must run unprivileged; compatible with OpenShift restricted SCCs.
-
Resource usage must be isolatable from the Forgejo workload.
Solutions
Helm chart
The official Forgejo runner Helm chart is used to deploy the runner. It is the officially maintained chart and is structurally identical to the wrenix chart used in the PoC, which was its upstream before the Forgejo project took it over.
It exposes everything we need:
-
runner.config.existingSecret: points the runner at a pre-built.runnercredential file, skipping the chart’s own registration job. This is what makes theprovider-httpregistration approach work. -
dind.*values: built-in Docker-in-Docker sidecar support. -
runner.config.file.container.privileged: defaults tofalsefor job containers. -
securityContext.privileged: defaults totrue(required for DinD); must be set tofalsefor the rootless option.
The chart can be forked if we need to tweak it for our use case in the future, similar to how we handle other third-party charts.
Runner classification and topology
The runner is modelled as an AddOn, not a nested service. A nested service is a part the outer service requires and cannot run without, and that is not billed separately. PostgreSQL in Keycloak and Nextcloud follows this pattern. The runner is the inverse: Forgejo runs perfectly well without it, enabling it is optional, and it carries its own billing. That makes it an AddOn, the same model used for Collabora. The runner cannot function without a Forgejo instance, so it is tightly coupled, but coupling alone does not make something a nested service. Optionality and separate billing are what make it an AddOn.
This classification is independent of where the runner Pod is placed: the same AddOn can live in the Forgejo instance namespace or in its own namespace. The options below cover the shared-vs-per-instance choice, the per-instance placement choice, and, for completeness, the out-of-cluster alternative.
-
Shared pool: a single runner pool shared across all customer instances.
-
Instance namespace: a per-instance runner AddOn deployed alongside the Forgejo workload in the instance namespace.
-
Dedicated namespace: a per-instance runner AddOn deployed in its own namespace.
-
CSP compute instance: a dedicated VM provisioned per instance on the CSP via Crossplane, with the runner hosted on the VM rather than in the cluster. A Crossplane CSP provider provisions the VM (see Provisioning the compute instance), and the
forgejo-runneris installed on it as a systemd-managed binary or via Docker Compose (see Runner installation on the VM). Registration reuses the sameprovider-httpadmin-API flow: the pre-generated UUID/token are templated into the runner’s config so it comes up non-interactively. The runner talks back to the Forgejo server over the network, so the VM lives in an isolated, firewalled subnet rather than being directly reachable.
| Criteria | Shared pool | Instance namespace | Dedicated namespace | CSP compute instance |
|---|---|---|---|---|
Tenant isolation |
❌ jobs from different customers share infrastructure |
✅ per-instance namespace boundary |
✅ per-instance namespace boundary |
✅ strongest: dedicated VM, OS-level boundary; an escape only reaches a disposable VM, not shared cluster nodes |
Billing |
❌ usage must be tracked across a shared resource |
✅ maps cleanly to the instance |
✅ maps cleanly to the instance |
✅ maps cleanly to the instance, but adds separate CSP compute cost |
Resource limits |
⚠️ shared across all tenants |
⚠️ coupled with the Forgejo workload |
✅ independent |
✅ fully independent: own VM sizing, zero impact on the cluster |
Multiple runners per instance |
❌ |
⚠️ clutters the instance namespace |
✅ natural to add |
⚠️ natural (more VMs) but each adds cost and provisioning latency |
Bootstrap effort |
✅ single deployment |
✅ no extra namespace wiring |
⚠️ must provision and wire the dedicated namespace |
❌ highest: provision a VM on the CSP, install/register the runner, wire networking; depends on a CSP Crossplane provider |
The CSP compute option also unblocks privileged / Docker-in-Docker execution.
On a shared cluster, a privileged job or a container escape reaches the node, and potentially other tenants and the control plane, so it must be avoided (see Job execution and isolation).
On a dedicated VM, ideally ephemeral (--ephemeral, one job per runner), the blast radius of an escape is the disposable VM itself.
That makes container.privileged = true and DinD an acceptable trade-off for builds that genuinely need them.
The cost is provisioning latency, recurring CSP compute spend, and a dependency on a per-CSP provisioning provider.
The one remaining security item is network segmentation: an isolated subnet plus firewall, so an escaped job cannot reach internal services.
This option has two sub-decisions of its own, both evaluated below: how the runner is installed on the VM, and how the VM itself is provisioned per CSP.
Runner installation on the VM
Forgejo supports installing the runner as an OS-level binary or running it as a container.
In both cases registration is decoupled from the runner and can be pre-generated, which is what lets us reuse the provider-http admin-API flow.
Registration writes a UUID/token pair into the runner config’s server.connections.forgejo section.
The pair can be generated server-side ahead of time (forgejo forgejo-cli actions register --secret …, or the admin API) and templated in, so the runner starts non-interactively without the forgejo-runner register prompt.
The deprecated interactive register command still exists but is not needed.
For throwaway per-instance VMs the recommended mode is --ephemeral (one job per runner, enforced by Forgejo).
Running Docker builds is a requirement, so the VM needs a Docker daemon either way. The two options below differ only in how that daemon is provided and how the runner process itself is managed, not in whether a daemon exists.
Option A: Package / binary + host Docker daemon
A single static forgejo-runner binary runs on the VM as a dedicated runner user under a systemd unit (forgejo-runner daemon -c runner-config.yml).
The config is produced with forgejo-runner generate-config and passed explicitly with -c/--config, since it is not discovered automatically.
Docker is installed on the VM as a normal host daemon (via cloud-init or the golden image), and the runner uses that daemon directly for docker-label jobs and builds.
The official "packaging" docs only cover NixOS (services.gitea-actions-runner.*); for a generic Linux VM the binary-installation path is the relevant one.
|
Advantages:
-
Fewest layers: the runner talks to the host Docker daemon directly, so builds run at native performance with no nested daemon.
-
Minimal moving parts to image and operate: one binary, one systemd unit, one config file, plus the host Docker package.
Disadvantages:
-
We own
forgejo-runner(and Docker) updates/patching on the VM. -
Jobs are given access to the host Docker daemon. On a shared host this would be unacceptable, but on a single-tenant disposable VM the whole VM is the trust boundary, so it is acceptable.
Option B: Docker Compose with DinD
The published docker-compose.yml runs two services: a docker:dind daemon and the forgejo/runner image (as non-root 1001:1001), pointed at the DinD daemon via DOCKER_HOST.
Config is generated with docker run --rm … forgejo-runner generate-config.
The VM only needs Docker and the compose file; the build daemon is the bundled DinD container.
Advantages:
-
Self-contained, fully pinned (runner image + DinD image); the runner and its build daemon are reproducible artifacts rather than host packages.
-
The build daemon is a DinD container separate from the host daemon.
Disadvantages:
-
Extra nesting (DinD), and the stock compose exposes the daemon over plaintext TCP (
--tls=false), relying on the compose network for isolation. -
Still requires Docker installed on the VM to run the stack.
Both are viable; the choice is host daemon vs. bundled DinD, not "needs Docker or not."
Option A is the leaner default: on a single-tenant disposable VM, letting jobs use the host Docker daemon is acceptable because the VM is the blast radius, and it avoids DinD nesting.
Option B is preferable when we want the runner and its build daemon shipped as one self-contained, version-pinned artifact.
Either way, because the VM is dedicated and disposable, the privileged/DinD risks that rule these out on the shared cluster are acceptable here.
The compensating control is subnet isolation plus firewall, and docker-socket automount into job containers (container.docker_host = automount) is still avoided.
Provisioning the compute instance
The VM has to be created on each CSP through Crossplane. The provider landscape differs per CSP, so there is no single answer. The options below range from official native providers to a Terraform-bridge fallback.
| CSP / approach | Provider | Maintenance | Notes |
|---|---|---|---|
Exoscale |
exoscale/provider-exoscale (official) |
✅ vendor-maintained, active (last commit 2026-05; tag v0.1.0) |
Upjet v2 over the official Terraform provider. Exposes |
cloudscale |
onzack/provider-cloudscale (third party) |
⚠️ third-party (onzack AG, not cloudscale.ch); dormant (last commit 2025-12, ~6 mo stale); no tagged releases |
Upjet v2 (RC tooling) over the community Terraform provider. VM kind is |
OpenStack-based CSPs |
crossplane-contrib/provider-openstack (official Crossplane org) |
⚠️ under the |
Upjet over |
CSPs without a maintained provider |
upbound/provider-opentofu (fallback) |
✅ Upbound-maintained, active (v1.1.3, 2026-05); API still |
"Terraform-in-a-pod": a single |
For Exoscale the official native provider is the clear choice.
For the OpenStack-based CSPs provider-openstack is attractive: a single provider covers many CSPs at once, and the resource coverage genuinely fits.
The reservation is that, although it sits in the official crossplane-contrib org, it is maintained by community contributors rather than an OpenStack vendor, and it is still small and pre-1.0 (all v1alpha1).
It therefore warrants a pin-and-monitor stance and a validation pass before committing.
cloudscale has a native provider, but it is third-party and currently dormant.
provider-opentofu is a deliberate fallback for CSPs that have no maintained native provider.
It reuses mature Terraform providers and still presents a Crossplane-shaped interface (XR composition, connection-secret outputs), but it accepts "Terraform-in-a-pod" semantics: a coarse single-Workspace resource model rather than first-class MRs, plan-based periodic drift reconciliation, and you-own-the-state.
It does not persist state by default, so a remote backend (for example the Kubernetes Secret backend) is mandatory.
Running arbitrary Terraform with broad CSP credentials inside the controller pod is also a larger security surface.
Upbound itself frames it as a transition or bridge rather than a permanent substitute.
The recommended posture is therefore to use it only where no native provider exists, mandate a persistent backend with locking, scope ProviderConfig credentials per CSP, and migrate to native providers as they mature.
|
Out of scope
This whole option was raised on the PR: spin up dedicated runner VMs on a cloud provider instead of running runners and jobs inside Kubernetes.
It is worth having evaluated, but it is deliberately not pursued.
Provisioning and maintaining per-instance VMs is an operational surface AppCat / Schedar does not otherwise own: golden images, OS and runner patching, lifecycle, networking, and a per-CSP provisioning provider (with the |
Job concurrency and capacity
A natural concern is whether the runner spawns additional job Pods, and if so whether their number per instance must be capped, since this drives capacity management and resource billing.
With the forgejo-runner chart it does not spawn extra Pods per job.
A CI run is confined to the pre-provisioned runner Pod(s): jobs execute inside the running runner, and when all runner capacity is busy, further jobs queue until a runner frees up rather than scaling out new Pods.
(Whether multiple jobs can share a single Pod concurrently depends on the runner’s capacity setting; in the PoC the chart ran jobs within the provisioned Pod rather than fanning out.)
This bounds resource and storage usage to the provisioned runner Pods, which keeps capacity and billing straightforward.
Concurrency is determined by the configured runner capacity and the number of runner Pods per instance, both fixed at provisioning time, instead of an unbounded pool of dynamically spawned job Pods.
Customers therefore pick from a small set of pre-defined runner sizes (CPU / memory / storage), and the instance’s runner footprint is known up front.
Registration mechanism
Option A: Custom Forgejo Crossplane provider
A dedicated provider modelling runners (and other Forgejo objects) as first-class managed resources.
Advantages:
-
Proper reconciliation and a clean resource model for Forgejo objects in general.
Disadvantages:
-
Significant code to build and maintain for what is essentially a single API call.
Option B: Provider-http
A composition step uses provider-http to call the Forgejo admin API, register a runner, and read the token back from the response.
The composition renders the .runner config secret from that token and points the forgejo-runner Helm chart at it, which skips the chart’s own registration step.
Advantages:
-
Much less effort than a custom provider, and enough to drive the whole flow declaratively.
-
No new provider to maintain.
Disadvantages:
-
Not a reconciled resource model, so it only fits the narrow runner-bootstrap use case.
Job execution and isolation
The runner executes arbitrary customer CI workloads, so the executor backend is a security-relevant choice. The runner runs as a Kubernetes Pod, so host and LXC executors are not viable: host execution requires direct host access, and LXC needs kernel-level container nesting, both incompatible with restricted SCCs on a shared cluster. The real choice is between Docker-in-Docker and rootless container execution.
Forgejo’s official security guidance and docker access documentation are the reference for the evaluation below.
Docker socket automount (container.docker_host = automount) is ruled out in both options, since Forgejo’s docs classify it as offering "no security isolation."
Option A: Docker-in-Docker (DinD)
A privileged Docker daemon runs as a sidecar alongside the runner. Job containers connect to that daemon instead of the host daemon.
Advantages:
-
Straightforward to configure; well-documented in Forgejo’s docs.
-
Job containers are isolated from the host Docker daemon.
Disadvantages:
-
The DinD sidecar requires
privileged: true, conflicting with restricted SCCs and meaning a container escape reaches the node. -
Concurrent jobs share the same daemon; they can see each other’s containers and left-over artifacts.
-
Resource constraints on the runner pod have no effect on containers spawned inside the DinD daemon.
Option B: Rootless Docker / Podman
Job containers run via a rootless Docker or Podman daemon inside the runner pod. No privileged containers are required.
Advantages:
-
Runner pod stays unprivileged; compatible with restricted SCCs and Kubernetes Pod Security Standards.
-
Follows Forgejo’s recommendation for unprivileged runners;
container.privileged = falseenforced by default.
Disadvantages:
-
More complex to configure than DinD.
Decision
Per-instance runner as an optional, separately billed AddOn in the Forgejo instance namespace, registered via provider-http (registration Option B).
The runner is classified as an AddOn rather than a nested service. It is optional and separately billed, which is what distinguishes an AddOn from a required, non-billed nested service like PostgreSQL.
For placement there are two defensible choices. One option is a dedicated namespace, which gives cleaner resource isolation and billing boundaries and would keep the door open to a future runner-only AddOn where the customer brings their own Forgejo instance. We instead place the runner in the Forgejo instance namespace, for simplicity. The runner is tightly coupled to Forgejo and cannot run without it, so co-locating it avoids the extra namespace wiring and keeps the AddOn next to the workload it serves. Classification as an AddOn is independent of placement. If a dedicated namespace later proves worthwhile (for example for the standalone runner case, or for stricter resource boundaries), we can move it without revisiting the rest of this decision (see Standalone runner with a customer-provided Forgejo).
Namespace isolation is sufficient on the platforms where we offer the AddOn, which is also why we prefer the in-cluster approach over CSP VMs. The AddOn is offered on Servala and Managed OpenShift, but not on APPUiO. The isolation story differs per platform:
-
On Servala, we additionally isolate the runners with a hardened
RuntimeClass(gVisor), assigned via the runner Pod’s runtime class, putting a syscall-sandbox boundary around the job containers on top of the Talos-based hardening that already goes beyond vanilla Kubernetes. The initial version does not add a separate node pool; we may introduce one later to separate the runners (a bursty, interruptible workload) from platform and other-customer workloads. -
On APPUiO, the runner AddOn is not offered. APPUiO is shared, multi-tenant infrastructure where arbitrary customer CI workloads pose a resource-starvation risk to co-tenants and a security concern that the existing namespace isolation, SCCs and quotas do not fully neutralise. We therefore deliberately restrict the service to Servala and Managed OpenShift rather than expose shared APPUiO nodes to untrusted CI.
-
On Managed OpenShift, the cluster belongs to the customer, who can deploy whatever they like on it, so runner isolation is not our concern.
Combined with the runner being confined to its pre-provisioned Pod(s), this covers the requirement. The CSP compute instance option is explicitly not chosen: it keeps the runner out of the cluster, but at the cost of owning VM provisioning and maintenance, an operational can of worms AppCat / Schedar does not otherwise carry (see the out-of-scope note under Provisioning the compute instance). It stays a fallback for if running unprivileged in-cluster ever proves insufficient.
For registration, a single admin-API call to obtain the runner credential does not justify the effort of building and maintaining a custom provider, so provider-http is the better fit.
Should we later need full, reconciled management of Forgejo objects, a custom provider can be reconsidered in a separate decision.
Proof of concept
We built a proof of concept (appcat#684) to validate the registration flow and the Helm chart integration.
It deploys into the Forgejo instance namespace, matching the placement decided here, and uses the registration mechanism we are deciding on.
provider-http registers the runner against the Forgejo admin API, the composition reads the returned token and writes it into a .runner config secret, and the forgejo-runner Helm chart consumes that secret.
What remains is to expose runner enablement and sizing on the claim instead of always provisioning one, and, on Servala, to wire up the gVisor RuntimeClass for the job containers.
Consequences
-
The runner is an optional, separately billed AddOn; Forgejo instances without it are unaffected.
-
Each Forgejo instance gets an isolated runner; no cross-tenant job execution is possible.
-
The runner lives in the Forgejo instance namespace; billing still maps cleanly to the instance, while resource limits are shared with the Forgejo workload. A dedicated namespace remains a future option if stricter separation is needed.
-
Capacity is bounded: jobs run in the pre-provisioned runner Pod(s) and queue when busy, so concurrency and resource usage are fixed by the chosen runner size and count rather than a dynamic pool of job Pods.
-
The AddOn is offered on Servala and Managed OpenShift only; it is not offered on APPUiO, whose shared, multi-tenant infrastructure makes arbitrary customer CI a resource-starvation and security risk to co-tenants.
-
Runner isolation is platform-specific: on Servala job containers run under a gVisor
RuntimeClass; on Managed OpenShift the cluster is the customer’s own. -
Moving the runner to a dedicated namespace later remains possible, which would keep open a future standalone runner where a customer brings their own Forgejo instance (see Standalone runner with a customer-provided Forgejo). That use case cannot reuse the
provider-httpadmin-API registration and would instead consume a customer-provided registration token via a claimsecretRef; it is deliberately out of scope here. -
No custom Forgejo provider is needed for runner support; the
provider-httpCrossplane provider must be installed on clusters running managed Forgejo. -
Multiple runners per instance, and project-scoped runners, are natural extensions of this approach rather than redesigns.
Open questions and operational considerations
The following points were raised in review. Some are decided here; others are explicitly deferred to the implementation phase and listed so they are not lost.
Runtime isolation hardening
Namespace isolation plus rootless, confined-Pod execution (see Job execution and isolation) is the decided baseline and meets the requirement on our target platforms (see the Decision). The following are optional defence-in-depth we can layer on if we choose to harden further, not prerequisites:
-
Dedicated node pool: schedule runner/job workloads onto their own node pool via taints + tolerations, keeping arbitrary CI off the nodes that run platform and other-customer workloads. Not in the initial version on Servala; a candidate for later given the runners' bursty profile.
-
Sandboxed runtime: a hardened
RuntimeClassto put a syscall-sandbox or lightweight-VM boundary around job containers. On Servala this is part of the decision (gVisor, see the Decision); a lightweight-VM boundary (Kata Containers) remains a further option. Not applicable on APPUiO (the AddOn is not offered there) or Managed OpenShift (customer cluster). -
Minimal ServiceAccount: the runner needs no Kubernetes API access to execute jobs, so its ServiceAccount should carry no RBAC and
automountServiceAccountTokenshould be disabled. (Cheap; worth doing regardless.) -
Spot/preemptible capacity: CI is interruptible, so spot nodes (or spot VMs in the CSP fallback) are a cost lever worth considering.
Egress, caching and storage
CI jobs pull large volumes from external sources: npm, PyPI, Maven, Go modules, ad-hoc curl … | sh.
Spegel covers container image layers on Servala, but not these other artifacts.
Two gaps to address in implementation:
-
Pull-through caches / proxy for the common package ecosystems, to cut egress cost and speed up builds.
-
Persistent build / layer cache: without a persistent volume the image-layer and dependency caches evaporate on every Pod restart, giving customers cold builds constantly.
Storage sizing for this cache ties into the pre-defined runner sizes from Job concurrency and capacity.
Lifecycle
-
Deregistration: deleting the runner must also remove its Forgejo registration, otherwise a dangling runner entry is left behind.
provider-httpcan issue theDELETEcall, so the composition should deregister on teardown, the mirror of the registration step. -
Upgrade / drain: a chart upgrade restarts the runner Pod and kills in-flight jobs. We need a drain story (for example ephemeral runners that exit cleanly after their current job, or cordoning the runner before upgrade).
-
Version compatibility: maintain a runner/Forgejo version-compatibility matrix.
-
Token rotation: the registration credential should be rotatable.
Observability and SLO boundary
Customers need access to their job logs (surfaced in the Forgejo UI) and runner metrics (runner metrics endpoint). No SLOs exist on Servala yet, but when they are defined the responsibility line should be drawn explicitly: runner availability is on us, job success is on the customer. Keeping that distinction clear avoids support-ticket ambiguity.
Product integration
-
Enablement: the runner is enabled as an optional add-on on the Codey/Forgejo claim (a toggle plus sizing), not provisioned by default.
-
Compute plans: map runner sizing onto the existing Servala compute-plan concept (CPU/RAM) plus a storage size, rather than inventing a separate sizing mechanism.
Standalone runner with a customer-provided Forgejo
Review raised a possible future use case: offering the runner as a standalone service where the customer brings their own Forgejo instance, managed by them or a third party, rather than a VSHN-managed one. This is out of scope for this ADR, which scopes the runner as an AddOn to a VSHN-managed Forgejo instance, and the decision above is unchanged. It is captured here so the constraint it imposes is not lost.
The registration mechanism decided here does not extend to that use case.
provider-http registers the runner by calling the Forgejo admin API (POST /api/v1/admin/actions/runners) with admin credentials AppCat controls, against the in-cluster instance reached over internal service DNS (<instance>-http.<namespace>.svc:3000).
For a customer-owned instance we have neither admin credentials nor any business mutating an instance we do not operate, so server-side registration is off the table.
The natural fit is to invert the flow: instead of AppCat registering the runner, the customer generates a runner registration token on their own Forgejo and hands it to us.
The claim would carry a secretRef (a corev1.SecretKeySelector, mirroring the existing UnmanagedBucket and Keycloak custom-mount patterns in AppCat) pointing at a Secret with that token, plus the instance’s external URL.
The composition reads the token and renders the .runner config Secret directly, then points the forgejo-runner chart at it via runner.config.existingSecret, the same chart hook the AddOn already uses, just skipping the provider-http registration step entirely.
This has knock-on consequences that confirm it belongs in a separate decision rather than this one:
-
It is a standalone service, not an AddOn coupled to a co-located Forgejo, so it would not live in a Forgejo instance namespace. This reinforces keeping a dedicated namespace viable (see the Decision), since the runner would have no instance namespace to share.
-
The runner address becomes a customer-provided FQDN over public ingress rather than internal cluster DNS.
-
The registration lifecycle (deregistration, token rotation) shifts to the customer’s side; AppCat only consumes the token it is given.
If this use case is pursued, it should be a new ADR building on the secret-ref idea above, not a modification of this AddOn.