ADR 0047 - Service Maintenance and Upgrades (Framework 2.0)

Author

Gabriel Saratura

Owner

Schedar

Reviewers

Schedar

Date Created

Date Updated

Status

draft

Tags

framework,framework2,maintenance,upgrades,crossplane,workflows,cronoperations

Summary

The decision selects Kubernetes-native CronJobs to orchestrate maintenance and upgrades, with the possibility of using ArgoCD Workflows or Crossplane Operations in the future as complexity grows.

Problem

Need automated maintenance workflows for:

  • Service instance upgrades

  • Backup and restore orchestration

  • Version checks and notifications

  • Maintenance window enforcement

  • Maintenance suspension

Current State

  • CronJob-based maintenance

  • VersionHandler tracks claim/composite/instance relationships

  • Manual chart version bumps in class/defaults.yml

  • Composition revisions for versioning

  • Revision policy: manual vs automatic

  • Maintenance CronJob switches revision during maintenance window

  • Configurable delayed production rollout

  • Hotfix job for urgent updates

  • Custom rollback scripts

Solutions

Option A: CronJob Pattern

Description:

Kubernetes CronJobs execute scheduled maintenance tasks at specified intervals. Jobs run containers that perform version checks, apply composition revision updates, and trigger service upgrades during maintenance windows.

Advantages:

  • Proven Pattern: Used in component-appcat, works in production

  • Simple: Standard Kubernetes CronJob, easy to understand

  • No Additional Dependencies: Just a CronJob controller (built into K8s)

  • Easy Debugging: Job logs visible in kubectl logs

Disadvantages:

  • Limited Observability: Job logs disappear after completion (need log aggregation)

  • No Complex Workflows: Difficult to express multi-step workflows (backup → upgrade → release → verify)

  • Sequential Only: Can’t easily parallelize tasks

Option B: Argo Workflows

Description:

Argo Workflows is a Kubernetes-native workflow engine that orchestrates multi-step operations as directed acyclic graphs (DAGs). Workflows define sequential or parallel steps with dependencies, conditional execution, and retry logic. Each workflow runs as pods executing container-based tasks.

Example Workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: redis-upgrade-workflow
spec:
  entrypoint: upgrade-with-backup
  templates:
  - name: upgrade-with-backup
    steps:
    # Step 1: Backup
    - name: create-backup
      template: k8up-backup

    # Step 2: Perform upgrade
    - name: upgrade-instance
      template: update-revision

    # Step 3: Perform release
    - name: release-instance
      template: rollout-revision

    # Step 4: Post-upgrade validation
    - name: run-smoke-tests
      template: smoke-tests

Advantages:

  • Complex Workflows: Multi-step workflows with dependencies (backup before upgrade)

  • Excellent Observability: Workflow UI shows step progress, logs, errors

  • Retry Logic: Built-in retry with exponential backoff

  • Conditional Steps: Skip steps based on conditions (for example, skip backup if a recent backup exists)

  • DAG Support: Parallel execution of independent tasks

  • Audit Trail: Workflow history persisted, easy to review what happened

  • Comp Function Integration: Workflows can be part of each instance

  • Generic WorkflowTemplates: WorkflowTemplates can be defined cluster-wide in KCL and easily referenced in Workflows

  • Convenient UI: ServiceOperators can handle maintenance directly in the ArgoCD UI

Disadvantages:

  • Additional Dependency: Requires Argo Workflows installation and management

  • Learning Curve: Schedar team members need to learn workflow DSL (YAML-based)

  • Operational Overhead: Another operator to monitor, upgrade, and secure

  • Complexity: Might be overkill for simple version checks

Option C: Crossplane WatchOperation / CronOperation Functions

Description:

Crossplane operation functions (CronOperation, WatchOperation) extend composition pipelines with scheduled or event-driven execution. Operations trigger composition function steps at specified intervals or in response to resource changes, executing maintenance logic within the Crossplane reconciliation loop.

Example:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: vshnredis.vshn.appcat.vshn.io
spec:
  mode: Pipeline
  pipeline:
  # Normal rendering
  - step: redis
    functionRef:
      name: function-appcat

  # Maintenance operation (runs periodically)
  - step: maintenance-check
    functionRef:
      name: function-appcat-maintenance
    operationRef:
      name: maintenance-cron

Advantages:

  • No Additional Dependencies: Built into Crossplane

  • Tightly Integrated: Direct access to composed resources and claim state

  • Declarative: Same function model as composition

Disadvantages:

  • Experimental: CronOperation/WatchOperation is still alpha in Crossplane

  • Complex Logic in Go: Maintenance logic embedded in composition function code

  • No Multi-Step Workflows: Can’t express backup → upgrade → verify easily

Decision

Use CronJob pattern (Option A). Phase 2: Reevaluate Crossplane Operations (Option C) versus Argo Workflows (Option B) for complex orchestration tasks, such as releases and rollbacks.

Rationale

  1. Proven Pattern: CronJob-based maintenance works in component-appcat production.

  2. Simplicity: No additional dependencies, standard Kubernetes pattern.

  3. Composition Revisions: Leverage ADR0030 pattern (revision selector + automatic policy) for upgrades during maintenance windows.

  4. Good Enough: Version checks and revision updates don’t require complex multi-step workflows yet.

Phase 2 Deferred Decision:

When CronJob maintenance becomes insufficient for complex multi-step workflows (backup → upgrade → verify → rollback):

  • Option B (Argo Workflows): Production-ready workflow engine with DAG orchestration and dedicated UI

  • Option C (Crossplane Operations): Native Crossplane integration but currently alpha status

Reevaluate based on Crossplane Operations maturity, actual workflow complexity needs, and operational burden.