Alert rule: CrossplaneDown

Overview

This alert fires when any Crossplane controller pod in the Crossplane namespace is not reachable by Prometheus for 10 minutes.

Crossplane is the core orchestration engine for AppCat services. When it is down, no new service instances can be provisioned, existing compositions stop reconciling, and any pending changes (scale, update, delete) are blocked.

The alert fires when up{namespace="…", job=~"^crossplane-.+$"} != 1 for 10 minutes.

This alert is only active on clusters where appcat.crossplane.monitoring.enabled: true. It is disabled by default.

Steps for Debugging

Set the namespace from the alert labels:

NAMESPACE='syn-crossplane'
POD='<pod-from-alert>'

Check pod status:

kubectl --as=system:admin -n $NAMESPACE get pods -l release=appcat
kubectl --as=system:admin -n $NAMESPACE describe pod $POD

Check pod logs:

kubectl --as=system:admin -n $NAMESPACE logs $POD --tail=100
kubectl --as=system:admin -n $NAMESPACE logs $POD --previous --tail=100

If the pod is not running, check node pressure or eviction events:

kubectl --as=system:admin -n $NAMESPACE get events --sort-by=.lastTimestamp | tail -30
kubectl --as=system:admin get nodes

Check if the Crossplane deployment is healthy:

kubectl --as=system:admin -n $NAMESPACE get deployment
kubectl --as=system:admin -n $NAMESPACE rollout status deployment/crossplane
kubectl --as=system:admin -n $NAMESPACE rollout status deployment/crossplane-rbac-manager

Verify the Prometheus scrape target is configured:

kubectl --as=system:admin -n $NAMESPACE get servicemonitor crossplane
kubectl --as=system:admin -n $NAMESPACE get service crossplane-metrics

Steps for Remediation

Pod is crash-looping:

Check logs for the root cause (missing CRD, permission denied, OOM).
If OOMKilled, increase memory limits via the component parameters.
If a CRD is missing, check that the Crossplane ArgoCD application is fully synced.

Pod stuck in Pending:

Check node resources and taints.
Verify PodDisruptionBudget is not blocking eviction.

Deployment scaled to 0:

Check why it was scaled down (ArgoCD application health, recent Commodore push). As a temporary measure to restore service while investigating, scale it back up manually - note that ArgoCD will revert this if autosync is enabled:

kubectl --as=system:admin -n $NAMESPACE scale deployment/crossplane --replicas=1