Migrate to Cilium CNI
Prepare for migration
Make sure that your $KUBECONFIG points to the cluster you want to migrate before starting.
|
-
Create alertmanager silence
silence_id=$( kubectl --as=cluster-admin -n openshift-monitoring exec \ sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \ silence add alertname!=Watchdog --duration="3h" -c "cilium migration" ) echo $silence_id
-
Select cluster
export CLUSTER_ID=c-cluster-id-1234 (1) export COMMODORE_API_URL=https://api.syn.vshn.net (2) export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \ "${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant')
1 Replace with the Project Syn cluster ID of the cluster to migrate 2 Replace with the Lieutenant API on which the cluster is registered -
Disable ArgoCD auto sync for components
openshift4-nodes
andopenshift-upgrade-controller
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps openshift4-nodes --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps openshift-upgrade-controller --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
-
Disable the cluster-network-operator. This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere. We also need to scale down the upgrade controller, so that we can patch the
ClusterVersion
object.kubectl --as=cluster-admin -n appuio-openshift-upgrade-controller \ scale deployment openshift-upgrade-controller-controller-manager --replicas=0
kubectl --as=cluster-admin patch clusterversion version \ --type=merge \ -p ' {"spec":{"overrides":[ { "kind": "Deployment", "group": "apps", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true } ]}}'
kubectl --as=cluster-admin -n openshift-network-operator \ scale deploy network-operator --replicas=0
-
Verify that the network operator has been scaled down.
kubectl -n openshift-network-operator get pods (1)
1 This should return No resources found in openshift-network-operator namespace
.If the operator is still running, check the following conditions:
-
The APPUiO OpenShift upgrade controller must be scaled down.
-
The
ClusterVersion
object must have an override to make the network operator deployment unmanaged.
-
-
Remove network operator applied state
kubectl --as=cluster-admin -n openshift-network-operator \ delete configmap applied-cluster
-
Pause all machine config pools
for mcp in $(kubectl get mcp -o name); do kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}' done
Migrate to Cilium
-
Get local cluster working directory
commodore catalog compile "$CLUSTER_ID" (1)
1 We recommend switching to an empty directory to run this command. Alternatively, switch to your existing directory for the cluster. -
Enable component
cilium
pushd inventory/classes/"${TENANT_ID}" yq -i '.applications += "cilium"' "${CLUSTER_ID}.yml"
-
Update
upstreamRules
for monitoringyq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"cilium\"" \ "${CLUSTER_ID}.yml"
-
Update component
networkpolicy
configyq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \ "${CLUSTER_ID}.yml" yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' \ "${CLUSTER_ID}.yml"
-
Verify that the cluster’s
api-int
DNS record existsexport CLUSTER_DOMAIN=$(kubectl get dns cluster -ojsonpath='{.spec.baseDomain}') kubectl --as=cluster-admin -n openshift-dns exec ds/node-resolver -- dig +short api-int.${CLUSTER_DOMAIN}
The command should always return a valid record for
api-int
.If it doesn’t, please check that the OpenShift DNS cluster operator is healthy and double-check that the record is being resolved on the internal DNS for clusters on vSphere. You can see more details about the lookup by omitting the
+short
flag for thedig
command. -
Configure component
cilium
.Configure the cluster Pod and Service CIDRsPOD_CIDR=$(kubectl get network.config cluster \ -o jsonpath='{.spec.clusterNetwork[0].cidr}') HOST_PREFIX=$(kubectl get network.config cluster \ -o jsonpath='{.spec.clusterNetwork[0].hostPrefix}') yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = "'"${HOST_PREFIX}"'"' \ "${CLUSTER_ID}.yml" yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.~clusterPoolIPv4PodCIDRList = "'"${POD_CIDR}"'"' \ "${CLUSTER_ID}.yml"
-
Commit changes
git commit -am "Migrate ${CLUSTER_ID} to Cilium" git push origin master popd
-
Compile catalog
commodore catalog compile "${CLUSTER_ID}"
-
Patch cluster network config
Only execute this step after you’ve paused all machine config pools. Otherwise, nodes may reboot into a state where they’re stuck in
NotReady
.kubectl --as=cluster-admin patch network.config cluster \ --type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}' kubectl --as=cluster-admin patch network.operator cluster \ --type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"}},"status":null}'
-
Apply Cilium manifests. We need to execute the
apply
twice, since the first apply will fail to create theCiliumConfig
resource.kubectl --as=cluster-admin apply -n cilium -Rf catalog/manifests/cilium/
kubectl --as=cluster-admin apply -n cilium -Rf catalog/manifests/cilium/
-
Wait until Cilium CNI is up and running
kubectl -n cilium get pods -w
Finalize migration
-
Re-enable cluster network operator
This will remove the previously active CNI plugin and will deploy the kube-proxy daemonset. As soon as you complete this step, existing pods may go into
CrashLoopBackOff
since they were started with CNI IPs managed by the old network plugin.kubectl --as=cluster-admin -n openshift-network-operator \ scale deployment network-operator --replicas=1 kubectl --as=cluster-admin patch clusterversion version \ --type=merge -p '{"spec":{"overrides":null}}'
-
Unpause MCPs
for mcp in $(kubectl get mcp -o name); do kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}' done
You may need to grab the cluster-admin credentials to complete this step since the OpenShift OAuth components may be unavailable until they’re restarted with Cilium-managed IPs.
You may want to restart the multus daemonset once the old CNI pods are removed.
kubectl --as=cluster-admin -n openshift-multus rollout restart ds/multus
It may be necessary to force drain nodes manually to allow the machine-config-operator to reboot the nodes. Use
kubectl --as=cluster-admin drain --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
to circumvent PDB violations if necessary.Start with a master node, and ensure that the machine-config-operator is running on that master node after it’s been drained and rebooted.
-
Compile and push catalog
commodore catalog compile "${CLUSTER_ID}" --push
-
Re-enable ArgoCD auto sync
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{ "op":"replace", "path":"/spec/syncPolicy", "value": {"automated": {"prune": true, "selfHeal": true}} }]'