Remove a worker node (instance pool)
Steps to remove a worker node of an OpenShift 4 cluster on Exoscale which uses instance pools.
Starting situation
-
You already have a OpenShift 4 cluster on Exoscale
-
Your cluster uses instance pools for the worker and infra nodes
-
You have admin-level access to the cluster
-
You want to remove an existing worker node in the cluster
High-level overview
-
We drain the node
-
Then we remove it from Kubernetes.
-
Finally we remove the associated VM from the instance pool.
Prerequisites
The following CLI utilities need to be available locally:
-
docker
-
curl
-
kubectl
-
oc
-
exo
>= v1.28.0 Exoscale CLI -
vault
Vault CLI -
commodore
, see Running Commodore -
jq
-
yq
yq YAML processor (version 4 or higher) -
macOS:
gdate
from GNU coreutils,brew install coreutils
Prepare local environment
-
Create local directory to work in
We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.
export WORK_DIR=/path/to/work/dir mkdir -p "${WORK_DIR}" pushd "${WORK_DIR}"
-
Configure API access
Access to cloud APIexport EXOSCALE_API_KEY=<exoscale-key> (1) export EXOSCALE_API_SECRET=<exoscale-secret> export EXOSCALE_ZONE=<exoscale-zone> (2) export EXOSCALE_S3_ENDPOINT="sos-${EXOSCALE_ZONE}.exo.io"
1 We recommend using the IAMv3 role called Owner
for the API Key. This role gives full access to the project.2 All lower case. For example ch-dk-2
.Access to VSHN GitLab# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient export GITLAB_TOKEN=<gitlab-api-token> export GITLAB_USER=<gitlab-user-name>
Access to VSHN Lieutenant# For example: https://api.syn.vshn.net # IMPORTANT: do NOT add a trailing `/`. Commands below will fail. export COMMODORE_API_URL=<lieutenant-api-endpoint> # Set Project Syn cluster and tenant ID export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something> export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
Configuration for hieradata commitsexport GIT_AUTHOR_NAME=$(git config --global user.name) export GIT_AUTHOR_EMAIL=$(git config --global user.email) export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens
-
Get required tokens from Vault
Connect with Vaultexport VAULT_ADDR=https://vault-prod.syn.vshn.net vault login -method=oidc
Grab the LB hieradata repo token from Vaultexport HIERADATA_REPO_SECRET=$(vault kv get \ -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data') export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user') export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')
-
Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.
commodore catalog compile "${CLUSTER_ID}"
Prepare Terraform environment
-
Configure Terraform secrets
cat <<EOF > ./terraform.env EXOSCALE_API_KEY EXOSCALE_API_SECRET TF_VAR_control_vshn_net_token GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL HIERADATA_REPO_TOKEN EOF
-
Setup Terraform
Prepare Terraform execution environment# Set terraform image and tag to be used tf_image=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.image" \ dependencies/openshift4-terraform/class/defaults.yml) tf_tag=$(\ yq eval ".parameters.openshift4_terraform.images.terraform.tag" \ dependencies/openshift4-terraform/class/defaults.yml) # Generate the terraform alias base_dir=$(pwd) alias terraform='touch .terraformrc; docker run -it --rm \ -e REAL_UID=$(id -u) \ -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \ --env-file ${base_dir}/terraform.env \ -w /tf \ -v $(pwd):/tf \ --ulimit memlock=-1 \ "${tf_image}:${tf_tag}" /tf/terraform.sh' export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|') export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/} export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id") export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster" pushd catalog/manifests/openshift4-terraform/
Initialize Terraformterraform init \ "-backend-config=address=${GITLAB_STATE_URL}" \ "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \ "-backend-config=username=${GITLAB_USER}" \ "-backend-config=password=${GITLAB_TOKEN}" \ "-backend-config=lock_method=POST" \ "-backend-config=unlock_method=DELETE" \ "-backend-config=retry_wait_min=5"
Drain and Remove Node
-
Select a node to remove. With instance pools, we can remove any node.
export NODE_TO_REMOVE=<node name>
-
If you are working on a production cluster, you need to schedule the node drain for the next maintenance.
-
If you are working on a non-production cluster, you may drain and remove the node immediately.
Schedule node drain (production clusters)
-
Create an adhoc-config for the UpgradeJobHook that will drain the node.
pushd "../../../inventory/classes/$TENANT_ID" cat > manifests/$CLUSTER_ID/drain_node_hook.yaml <<EOF --- apiVersion: managedupgrade.appuio.io/v1beta1 kind: UpgradeJobHook metadata: name: drain-node namespace: appuio-openshift-upgrade-controller spec: events: - Finish selector: matchLabels: appuio-managed-upgrade: "true" run: Next template: spec: template: spec: containers: - args: - -c - | #!/bin/sh set -e oc adm drain ${NODE_TO_REMOVE} --delete-emptydir-data --ignore-daemonsets command: - sh image: quay.io/appuio/oc:v4.13 name: remove-nodes env: - name: HOME value: /export volumeMounts: - mountPath: /export name: export workingDir: /export restartPolicy: Never volumes: - emptyDir: {} name: export --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: drain-nodes-upgrade-controller roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - kind: ServiceAccount name: default namespace: appuio-openshift-upgrade-controller EOF git commit -am "Schedule drain of node ${NODE_TO_REMOVE} on cluster $CLUSTER_ID" git push popd
-
Wait until after the next maintenance window.
-
Confirm the node has been drained.
kubectl get node ${NODE_TO_REMOVE}
-
Clean up UpgradeJobHook
# after redoing the local environment and preparation of terraform: pushd "../../../inventory/classes/$TENANT_ID" rm manifests/$CLUSTER_ID/drain_node_hook git commit -am "Remove UpgradeJobHook to drain node ${NODE_TO_REMOVE} on cluster $CLUSTER_ID" git push popd
-
Delete the node(s) from the cluster
for node in $(echo -n ${NODE_TO_REMOVE}); do kubectl --as=cluster-admin delete node "${node}" done
Drain and remove node immediately
-
Drain the node(s)
for node in $(echo -n ${NODE_TO_REMOVE}); do kubectl --as=cluster-admin drain "${node}" \ --delete-emptydir-data --ignore-daemonsets done
-
Delete the node(s) from the cluster
for node in $(echo -n ${NODE_TO_REMOVE}); do kubectl --as=cluster-admin delete node "${node}" done
Update Cluster Config
-
Update cluster config.
pushd "inventory/classes/${TENANT_ID}/" yq eval -i ".parameters.openshift4_terraform.terraform_variables.worker_count -= 1" \ ${CLUSTER_ID}.yml
-
Review and commit
# Have a look at the file ${CLUSTER_ID}.yml. git commit -a -m "Remove worker node from cluster ${CLUSTER_ID}" git push popd
-
Compile and push cluster catalog
commodore catalog compile ${CLUSTER_ID} --push -i
Remove VM
-
Evict the VM(s) from the instance pool
We’re going through all worker instance pools to find the pool containing the node(s) to remove. This ensures that we can apply the step as-is on clusters on dedicated hypervisors which may have multiple worker instance pools.
instancepool_names=$(exo compute instance-pool list -Ojson | \ jq --arg ip_group "worker" -r \ '.[]|select(.name|contains($ip_group))|.name') for node in $(echo -n ${NODE_TO_REMOVE}); do for pool_name in ${instancepool_names}; do has_node=$(exo compute instance-pool show "${pool_name}" -Ojson | \ jq --arg node "${node}" -r '.instances|index($node)!=null') if [ "$has_node" == "true" ]; then exo compute instance-pool evict "${pool_name}" "${node}" -z "$EXOSCALE_ZONE" break fi done done
-
Run Terraform to update the state with the new instance pool size
There shouldn’t be any changes since instance-pool evict
reduces the instance-pool size by one.Ensure that you’re still in directory ${WORK_DIR}/catalog/manifests/openshift4-terraform
before executing this command.terraform apply