Service Operator

Service Operators manage service instances across all tenants. They troubleshoot issues, perform maintenance, and respond to incidents with cluster-wide visibility and control.

Requirements

As a Service Operator I can:

  • Monitor all instances with cluster-wide visibility of health, SLI metrics, and alerts

  • Access the cluster directly with kubectl for investigation and remediation

  • View comprehensive composite status showing health, errors, and resource states

  • Route alerts to incident management with runbook links for standardized response

  • Follow service-specific troubleshooting procedures documented in runbooks

  • Coordinate upgrades during maintenance windows

  • Restore service instances from backups when needed

  • Execute emergency procedures during incidents