ADR 0052 - Rework initial maintenance
Author |
Nicolas Bigler |
|---|---|
Owner |
Schedar |
Reviewers |
|
Date Created |
2026-04-24 |
Date Updated |
2026-04-24 |
Status |
draft |
Tags |
framework,service,maintenance,provisioning |
|
Summary
Initial maintenance runs immediately after provisioning and currently does three things:
The pod restarts triggered during the instance bootstrap, can corrupt the state of certain services (eg. MariaDB). The decision is a two-step rollout:
A small |
Problem
Services are provisioned with only the major image tag (PostgreSQL 16, Redis 7, …) because the composition only knows the major version from the claim.
To ensure instances run on the latest patch and on the right production revision from day one, AppCat triggers an initial maintenance job immediately after provisioning.
That job does three distinct things at once:
-
Resolve the image tag: upgrade the running pods from the major tag to the latest full semver.
-
Pin the composition revision: set
spec.compositionRevisionSelectorto the current production revision label, and setspec.compositionUpdatePolicy=Automatic. Without this pin, Crossplane will float the claim to every newly published revision, which is how test instances behave. New claims must be pinned to be production instances (unless theautoUpdatelabel is set). -
Run the rest of the maintenance side effects: EOL status, StackGres/CNPG-specific ops, registry auth probes,
autoUpdatelabel handling.
Running step 1 while the instance is still bootstrapping is unsafe:
-
Uncontrolled restarts: pods that have not finished their first-time init sequence (for example cluster bootstrap) get restarted during the bootstrap process.
-
State corruption: observed across several services, most notably on MariaDB.
-
Provisioning latency: every new instance will unecessarily restart, which delays the provisioning process. This is especially noticable on services with long start up times (for example Keycloak).
Steps 2 and 3 are not corruption-causing but are now coupled to step 1: if we drop initial maintenance wholesale, new claims would be unpinned (floating on every new revision == test-instance semantics) and the autoUpdate fast-path would be lost.
Any replacement must therefore:
-
Deliver the correct full semver image tag to the pod without a restart-after-bootstrap cycle.
-
Pin new claims to the current production composition revision on creation.
-
Honour
metadata.appcat.vshn.io/autoUpdatetransitions immediately, not at the next maintenance window (this flag switches the instance from prod revision pin to floating latest, that is prod→test). -
Preserve the remaining side effects (EOL, registry auth, service-specific ops) or move them explicitly elsewhere.
Current State
-
pkg/comp-functions/functions/common/maintenance/maintenance.gocreates a one-timeJob(<composite>-initial-maintenance) on first reconcile and keeps it in the desired state for 30 minutes after completion. -
Per-service maintenance under
pkg/maintenance/(postgresql.go,postgresqlcnpg.go,redis.go,mariadb.go,keycloak.go,nextcloud.go,forgejo.go,minio.go) runs the same code paths for initial and scheduled maintenance:-
security upgrade / minor version upgrade for SGCluster/CNPG/Helm releases (the part that restarts pods)
-
ReleaseLatestviapkg/maintenance/release/release.go: Sets composition revision selector + update policy on the claim. -
EOL status propagation, repack, vacuum, etc.
-
-
Composition revisions are labelled with
metadata.appcat.vshn.io/revision. TheReleaseLatestflow picks the latest revision older thanminimumRevisionAgeand writes that label into the claim’scompositionRevisionSelector. -
Claims created by users have no revision selector until initial maintenance sets one.
Solutions
Option A: Image catalogue with weekly updater
Description:
A cluster-scoped ImageCatalogue CR (or ConfigMap) holds service + majorVersion → latest full semver plus compatibility constraints.
A CronJob refreshes it weekly by querying upstream registries / release feeds.
Composition functions read the catalogue when resolving the service image tag, so new instances deploy with the correct full semver on the first reconcile.
Advantages:
-
Deterministic: functions stay offline, no registry calls at render time.
-
Respects Crossplane guidance: no external HTTP from composition functions.
-
Central source of truth: one place to enforce compatibility matrices (CNPG operator vs. PG minor, StackGres vs. PG minor).
-
Single mechanism across all services.
Disadvantages:
-
Up to 7 days stale.
-
New component: CRD + controller + tests + ops.
-
Compatibility modelling is non-trivial.
-
Does not solve revision pinning on its own: must be combined with C or H.
Option B: Native Crossplane EnvironmentConfig as catalogue
Description:
Instead of a custom ImageCatalogue CRD (Option A), store the service + majorVersion → latest full semver mapping in one or more apiextensions.crossplane.io/EnvironmentConfig resources.
Compositions already support reading EnvironmentConfig via the function pipeline, so the resolved tag is available at render time natively.
A weekly CronJob (or the CronOperation from Option D) patches the EnvironmentConfig.
Advantages:
-
No new CRD: reuses a Crossplane-native primitive.
-
First-class composition integration: no custom reader code in functions.
-
Smaller surface: one CR to update, no controller-runtime reconciler for the catalogue itself.
Disadvantages:
-
Up to 7 days stale.
-
Schema flexibility limited:
EnvironmentConfigis a flat key-value bag. Compatibility matrices and structured constraints need to be encoded as conventions. -
Validation: no CRD means no
kubebuildervalidation on the catalogue contents. Validation lives in the updater and the consuming function. -
Does not solve revision pinning on its own: must be combined with C or H.
Option C: Revision-selector controller
Description:
A dedicated controller pins new claims to the current production revision at creation time, replacing the bootstrap-time ReleaseLatest call currently made from initial maintenance.
-
On claim create the controller sets
compositionRevisionSelector+compositionUpdatePolicyso the claim starts pinned to the current production revision. -
Revision rotation on existing instances stays in the regular maintenance CronJob. Revision changes must only happen inside the instance’s maintenance window.
Image tag resolution is not part of this option. This is handled by the composition functions reading an image catalogue (Option A / Option B).
Advantages:
-
Solves pinning at claim creation: new claims pinned immediately, no dependency on an initial-maintenance job.
-
Respects maintenance windows: existing instances are only rotated by scheduled maintenance, never ad-hoc by this controller.
-
Works with
autoUpdate: label transitions flow through a sibling controller that unpins / repins immediately (the fast-path initial maintenance currently owns). -
Fits the existing model: reuses composition revisions,
metadata.appcat.vshn.io/revisionlabel, update policy.
Disadvantages:
-
New controller: one more reconciler to own and monitor.
-
Race window on create: claim may briefly float before the controller reconciles (mitigation: default the composition to a policy that prefers the current revision until the selector is written, or reconcile-on-create).
Option D: Crossplane Operations (CronOperation / WatchOperation)
Description:
Replace the dedicated controller of Option C with Crossplane’s WatchOperation: it reacts to claim create/update events and writes compositionRevisionSelector + compositionUpdatePolicy, handling autoUpdate label transitions in the same pipeline.
Same pipeline model as the regular composition functions, same SDK, same deployment path.
Advantages:
-
No new controller code: reuses the function framework the team already owns.
-
Tightly integrated with Crossplane: direct access to claim/composite state and the composition revision API.
-
Declarative: operations defined as CRs, not Go reconcilers.
-
Aligns with ADR0047 phase 2: the deferred decision explicitly reserves space for evaluating Operations as complexity grows.
Disadvantages:
-
Still alpha:
CronOperation/WatchOperationare alpha in Crossplane. Production use is a risk vs. a plain controller-runtime controller. -
Maintenance logic embedded in function code: same ergonomic downsides as regular composition functions.
-
No multi-step workflows: fine for this use case, but means a future rollout/rollback need still falls back to CronJob or Workflows (per ADR0047).
Option E: Pod-readiness check in initial maintenance
Description:
Keep initial maintenance but, before triggering any upgrade that restarts pods, wait until the instance has finished bootstrapping:
SGCluster ClusterIsReady, CNPG Cluster.Status.Phase == "Cluster in healthy state" with ReadyInstances == Instances, Helm releases deployed with all workloads Ready.
Steps 2 (revision pinning) and 3 (side effects) are unchanged.
Advantages:
-
Minimal change: one readiness predicate per service in
pkg/maintenance/<service>.go. -
Ships quickly: a safe stop-gap that fixes the corruption symptom.
-
Leaves revision pinning flow intact: no regression on production semantics.
Disadvantages:
-
Does not remove the coupling: initial maintenance is still on the critical path.
-
Provisioning is still slow: new instances still restart unecessarily.
-
Per-service readiness logic needs to be maintained.
Option F: HTTP client in composition functions
Description:
Composition functions call upstream (Docker Hub, GHCR, StackGres API) to resolve the latest patch for the requested major version at render time.
Advantages:
-
Always fresh.
Disadvantages:
-
Against Crossplane best practices: functions must be deterministic.
-
External dependency on the render path: registry outage degrades every reconcile.
-
Non-reproducible renders.
-
Rate limits become a production concern.
-
Does not address revision pinning.
Rejected on principle.
Option G: Deploy with replicas=0, scale up after image update
Description:
Composition functions deploy the workload with replicas=0.
An init step (maintenance-like) resolves the image tag, patches the workload, and then scales up.
Advantages:
-
No restart during bootstrap.
Disadvantages:
-
CNPG has no
replicas=0:Cluster.spec.instancesminimum is 1, needs a custom solution. -
Tight two-component coupling: provisioning only works if maintenance is healthy; blast radius of a maintenance bug grows to every new instance.
-
Per-service special cases: StackGres, Helm-based services, Garage cluster bootstrap all differ.
-
Confusing UX: users see a service "stuck" at 0 replicas.
-
Does not address revision pinning.
Option H: Mutating admission webhook
Description:
A mutating webhook on the claim resolves majorVersion → full semver at admission time (reading the catalogue from Option A), and also writes the current production revision selector + update policy onto the claim.
Subsequent revision rotation still needs another mechanism (maintenance or a controller) to update the selector.
Advantages:
-
Happens before any pod starts: no bootstrap restart.
-
Simple per service: one mutation rule per kind.
-
Covers both image tag and revision pinning at creation time.
Disadvantages:
-
Webhook on the critical path: admission failures block every create/update of the claim.
-
Still needs a catalogue and a rotation mechanism: does not stand alone.
-
Reconciliation drift: a tag written into the claim does not re-resolve on later patches unless another mechanism owns updates (overlap with Option C).
-
User-visible mutation: users see a semver they did not write.
Option I: Drop initial maintenance + dedicated autoUpdate controller (standalone)
Description:
Remove initial maintenance entirely and only add a small controller that reacts to autoUpdate label transitions.
Accept that new instances run the major tag and float on the default revision until the next maintenance window.
Advantages:
-
Simplest possible change.
Disadvantages:
-
Leaves patches unapplied for up to a week: unacceptable for security fixes.
-
Leaves new claims unpinned: production instances behave like test instances until first maintenance.
-
Only viable when combined with Option A/B (image catalogue consumed by composition functions) for the image tag and Option C for pinning.
Option J: Reuse image-reflector-controller (Flux) as the updater
Description:
Instead of writing the weekly catalogue updater from scratch, use Flux’s image-reflector-controller (ImageRepository + ImagePolicy) to scan upstream registries and resolve the latest semver matching a given major.
The selected tag is written to the catalogue (Option A) or EnvironmentConfig (Option B) by a small glue controller, or by an ImageUpdateAutomation-style process.
Advantages:
-
Battle-tested registry scanning: auth, pagination, rate-limit handling already solved.
-
Rich semver filtering:
ImagePolicysupports semver ranges, regex, numeric filters natively. -
Cuts build-from-scratch cost: no need to re-implement registry clients per registry type.
Disadvantages:
-
Flux dependency: adds
image-reflector-controlleras an operational dependency, even for clusters that do not otherwise run Flux. -
Glue layer still required: Flux writes status on
ImagePolicy, but translating that into catalogue entries is our code. -
Operational ownership: new controller to monitor, upgrade, secure.
Decision
Two-step rollout.
Step 1: Short term (stop the corruption): adopt Option E. Add a per-service pod-readiness precondition to the initial maintenance path so upgrades only run once the instance has finished bootstrapping. Revision pinning and side effects continue to run as today. This ships in the next maintenance release.
Step 2: Long term (remove initial maintenance): adopt Option A Option B (image catalogue as the storage primitive) consumed by the composition functions, plus Option C (revision-selector controller) and a dedicated autoUpdate controller.
Concretely:
-
An image catalogue holds
service + majorVersion → latest full semver, refreshed by a weekly updater. -
Composition functions read the catalogue on every reconcile and write the resolved full semver onto the composite (and its managed resources) at render time. New instances render with the correct tag on the first reconcile.
-
A revision-selector controller sets
compositionRevisionSelector+compositionUpdatePolicyon new claims at creation time, pinning them to the current production revision. Revision rotation on existing instances stays in the regular maintenance CronJob so it only happens inside the instance’s maintenance window. -
The
autoUpdatecontroller watches themetadata.appcat.vshn.io/autoUpdatelabel on claims/composites and, on transitions, removes (label=true) or restores (label=false) the revision pin immediately, taking over the fast-path that initial maintenance currently owns. -
The remaining side effects of initial maintenance (registry auth probe, service-specific ops) are audited and either migrated to the regular maintenance CronJob or moved into dedicated controllers.
-
Once the above are in place per service, the initial maintenance job is removed from
pkg/comp-functions/functions/common/maintenance/maintenance.go.
Implementation variants to evaluate during Step 2 PoCs:
-
Catalogue storage: custom
ImageCatalogueCRD (Option A) vs.EnvironmentConfig(Option B). Default toEnvironmentConfigunless validation needs force a CRD. -
Catalogue updater: custom CronJob (Option A) vs. Flux
image-reflector-controller(Option J). Default to the custom CronJob unless the registry-scanning surface grows enough to justify the Flux dependency. -
Controllers for revision pinning and
autoUpdate: plain controller-runtime controllers (Option C) vs. CrossplaneWatchOperation/CronOperation(Option D). Default to plain controllers unless we agree that using the alpha version of Crossplane Operations is acceptable for us.
Options F, G, I standalone are rejected for the reasons listed above. Option H is not chosen but its selector-writing mutation remains a fallback if the Step-2 controller-based pinning turns out to race with Crossplane’s revision selection on claim creation.
Rationale
-
Root cause vs. symptom: the corruption is a symptom of "resolve the version after the pods exist". Catalogue-driven resolution in composition functions moves version resolution to render time, before any pods are created.
-
Separates image tag from pinning: catalogue-driven resolution in composition functions delivers the correct full semver at provisioning; the revision-selector controller independently delivers the correct revision selector at provisioning. Neither requires an admission-time hack.
-
Reuses existing building blocks: composition revisions, revision labels, update policies and
ReleaseLatestsemantics already exist (ADR0030). Option C extends them rather than duplicating them. -
Respects Crossplane best practices: no HTTP calls from composition functions (rules out F), no admission-time registry lookups (constrains H).
-
Preserves the
autoUpdatefast-path: moving it into a dedicated controller keeps the immediate prod→test behaviour without requiring the rest of the maintenance flow to run. -
Staged migration: Option E is deliberately a stop-gap so production is safe while Option C rolls out per service.
-
Implementation flexibility preserved: Options B, D, J are called out as alternative implementations for the catalogue, the updater, and the controllers. They do not change the architectural decision and will be settled by the Step-2 PoCs.
Consequences
Immediate:
-
Initial maintenance still runs, but waits for pod readiness before triggering upgrades.
-
Provisioning becomes slightly slower (observable wait) but no longer corrupts instances.
-
Each service in
pkg/maintenance/gains anisSafeToRestart(ctx)predicate:-
StackGres:
SGCluster.Status.Conditions[ClusterIsReady] == Trueand no in-flightSGDbOps. -
CNPG:
Cluster.Status.Phase == "Cluster in healthy state"andReadyInstances == Instances. -
Helm-based (Keycloak, Nextcloud, Forgejo, Garage, MariaDB, Redis): all managed workloads at desired replica count and
Ready.
-
-
Revision pinning and
autoUpdatehandling continue to run via initial maintenance.
Mid-term:
-
Initial maintenance job removed from composition functions;
pkg/comp-functions/functions/common/maintenance/maintenance.goloses its initial-maintenance branch. -
Composition functions resolve the service image tag from the image catalogue at render time.
-
A revision-selector controller owns
compositionRevisionSelectorandcompositionUpdatePolicyon claims at creation time. -
An
autoUpdatecontroller owns label-driven transitions. -
pkg/maintenance/release/release.goReleaseLatestis reduced to the scheduled-maintenance path (revision rotation on existing instances stays there so it respects the instance’s maintenance window).
Risks and follow-ups:
-
Catalogue availability: composition functions now depend on the catalogue being present and populated. Functions must handle a missing or stale catalogue deterministically (fallback to major tag, or fail render explicitly).
-
Compatibility matrix: catalogue must encode operator/service compatibility. Initial version can be static per release and grown over time.
-
Hidden dependencies on initial maintenance: EOL detection, registry auth probing, and early
ReleaseLatestcalls all live in the same path. Follow-up tickets must audit each per service and either move them or drop. -
Race: claim created before revision-selector controller reconciles: instance could float for a short window before being pinned. Mitigated by defaulting the update policy in the composition to a mode that prefers the current revision until the controller writes an explicit selector, or by making the controller reconcile-on-create.
-
autoUpdate synchrony semantics: whether the controller should block on revision switch or be eventually consistent needs a separate design note.
Follow-up tickets (to be created):
-
Implement per-service pod-readiness check in initial maintenance (Option E).
-
Design + implement
ImageCatalogueCRD and weekly updater (Option A). -
Extend composition functions to read the image catalogue and write the resolved full semver onto the composite at render time.
-
Implement revision-selector controller (sets/reconciles
compositionRevisionSelectorandcompositionUpdatePolicyon claims). -
Implement
autoUpdatelabel controller; remove the corresponding logic frompkg/maintenance/release/release.go. -
Audit non-version side effects of initial maintenance (EOL, registry auth, service-specific ops) and migrate each.
-
Remove initial maintenance job creation from
pkg/comp-functions/functions/common/maintenance/maintenance.goonce per-service migration is complete. -
PoC: pod-readiness predicate for StackGres + CNPG (validates Option E before rollout).
-
PoC: revision-selector controller writing selector on claim create (validates Option C pinning).
-
PoC: catalogue as
EnvironmentConfigconsumed by a composition function (validates Option B viability before committing to a custom CRD). -
PoC:
image-reflector-controllerdriving the catalogue (validates Option J vs. a custom CronJob updater). -
Spike: implement the pinning logic as a
WatchOperation(validates Option D against a plain controller-runtime controller).