Migrate to Cilium CNI
Prepare for migration
Make sure that your $KUBECONFIG points to the cluster you want to migrate before starting.
|
-
Create alertmanager silence
silence_id=$( kubectl --as=cluster-admin -n openshift-monitoring exec \ sts/alertmanager-main -- amtool --alertmanager.url=http://localhost:9093 \ silence add alertname!=Watchdog --duration="3h" -c "cilium migration" -a "$(oc whoami)" ) echo $silence_id -
Select cluster
export CLUSTER_ID=c-cluster-id-1234 (1) export COMMODORE_API_URL=https://api.syn.vshn.net (2) export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \ "${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant')1 Replace with the Project Syn cluster ID of the cluster to migrate 2 Replace with the Lieutenant API on which the cluster is registered -
Disable ArgoCD auto sync for components
openshift4-nodesandopenshift-upgrade-controllerkubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps openshift4-nodes --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' kubectl --as=cluster-admin -n syn patch apps openshift-upgrade-controller --type=json \ -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' -
Disable the cluster-network-operator. This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere. We also need to scale down the upgrade controller, so that we can patch the
ClusterVersionobject.kubectl --as=cluster-admin -n appuio-openshift-upgrade-controller \ scale deployment openshift-upgrade-controller-controller-manager --replicas=0kubectl --as=cluster-admin patch clusterversion version \ --type=merge \ -p ' {"spec":{"overrides":[ { "kind": "Deployment", "group": "apps", "name": "network-operator", "namespace": "openshift-network-operator", "unmanaged": true } ]}}'kubectl --as=cluster-admin -n openshift-network-operator \ scale deploy network-operator --replicas=0 -
Verify that the network operator has been scaled down.
kubectl -n openshift-network-operator get pods (1)1 This should return No resources found in openshift-network-operator namespace.If the operator is still running, check the following conditions:
-
The APPUiO OpenShift upgrade controller must be scaled down.
-
The
ClusterVersionobject must have an override to make the network operator deployment unmanaged.
-
-
Remove network operator applied state
kubectl --as=cluster-admin -n openshift-network-operator \ delete configmap applied-cluster -
Pause all machine config pools
for mcp in $(kubectl get mcp -o name); do kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}' done
Migrate to Cilium
-
Get local cluster working directory
commodore catalog compile "$CLUSTER_ID" (1)1 We recommend switching to an empty directory to run this command. Alternatively, switch to your existing directory for the cluster. -
Enable component
ciliumpushd inventory/classes/"${TENANT_ID}" yq -i '.applications += "cilium"' "${CLUSTER_ID}.yml" -
Update
upstreamRulesfor monitoringyq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"cilium\"" \ "${CLUSTER_ID}.yml" -
Update component
networkpolicyconfigyq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \ "${CLUSTER_ID}.yml" -
Verify that the cluster’s
api-intDNS record existsexport CLUSTER_DOMAIN=$(kubectl get dns cluster -ojsonpath='{.spec.baseDomain}') kubectl --as=cluster-admin -n openshift-dns exec ds/node-resolver -- dig +short api-int.${CLUSTER_DOMAIN}The command should always return a valid record for
api-int.If it doesn’t, please check that the OpenShift DNS cluster operator is healthy and double-check that the record is being resolved on the internal DNS for clusters on vSphere. You can see more details about the lookup by omitting the
+shortflag for thedigcommand. -
Configure component
cilium.Configure the cluster Pod and Service CIDRsPOD_CIDR=$(kubectl get network.config cluster \ -o jsonpath='{.spec.clusterNetwork[0].cidr}') HOST_PREFIX=$(kubectl get network.config cluster \ -o jsonpath='{.spec.clusterNetwork[0].hostPrefix}') yq -i ".parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = ${HOST_PREFIX}" \ "${CLUSTER_ID}.yml" yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.~clusterPoolIPv4PodCIDRList = [ "'"${POD_CIDR}"'" ]' \ "${CLUSTER_ID}.yml" -
Commit changes
git commit -am "Migrate ${CLUSTER_ID} to Cilium" git push origin master popd -
Compile catalog
commodore catalog compile "${CLUSTER_ID}" -
Patch cluster network config
Only execute this step after you’ve paused all machine config pools. Otherwise, nodes may reboot into a state where they’re stuck in
NotReady.kubectl --as=cluster-admin patch network.config cluster \ --type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}' kubectl --as=cluster-admin patch network.operator cluster \ --type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"}},"status":null}' -
Apply Cilium manifests. We need to execute the
applytwice, since the first apply will fail to create theCiliumConfigresource.kubectl --as=cluster-admin apply -n cilium -Rf catalog/manifests/cilium/kubectl --as=cluster-admin apply -n cilium -Rf catalog/manifests/cilium/ -
Wait until Cilium CNI is up and running
kubectl -n cilium get pods -w -
Apply the updated default networkpolicy
SyncConfigThis should avoid issues when draining and rebooting nodes, such as pods unable to be created due to mutating admission webhooks timing out. kubectl --as=cluster-admin -n syn-espejo apply -f catalog/manifests/networkpolicy/10_default_networkpolicies.yaml
Finalize migration
-
Re-enable cluster network operator
This will remove the previously active CNI plugin and will deploy the kube-proxy daemonset. As soon as you complete this step, existing pods may go into
CrashLoopBackOffsince they were started with CNI IPs managed by the old network plugin.kubectl --as=cluster-admin -n openshift-network-operator \ scale deployment network-operator --replicas=1 kubectl --as=cluster-admin patch clusterversion version \ --type=merge -p '{"spec":{"overrides":null}}' -
Unpause MCPs
for mcp in $(kubectl get mcp -o name); do kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}' doneYou may need to grab the cluster-admin credentials to complete this step since the OpenShift OAuth components may be unavailable until they’re restarted with Cilium-managed IPs.
You may want to restart the multus daemonset once the old CNI pods are removed.
kubectl --as=cluster-admin -n openshift-multus rollout restart ds/multusIt may be necessary to force drain nodes manually to allow the machine-config-operator to reboot the nodes. Use
kubectl --as=cluster-admin drain --ignore-daemonsets --delete-emptydir-data --force --disable-evictionto circumvent PDB violations if necessary.Start with a master node, and ensure that the machine-config-operator is running on that master node after it’s been drained and rebooted.
-
Compile and push catalog
commodore catalog compile "${CLUSTER_ID}" --push -
Re-enable ArgoCD auto sync
kubectl --as=cluster-admin -n syn patch apps root --type=json \ -p '[{ "op":"replace", "path":"/spec/syncPolicy", "value": {"automated": {"prune": true, "selfHeal": true}} }]'