Remove a worker node

Steps to remove a worker node of an OpenShift 4 cluster on cloudscale.ch.

Starting situation

  • You already have a OpenShift 4 cluster on cloudscale.ch

  • You have admin-level access to the cluster

  • You want to remove an existing worker node in the cluster

High-level overview

  • First we identify the correct node to remove and drain it.

  • Then we remove it from Kubernetes.

  • Finally we remove the associated VMs.

Prerequisites

The following CLI utilities need to be available locally:

Prepare local environment

  1. Create local directory to work in

    We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.

    export WORK_DIR=/path/to/work/dir
    mkdir -p "${WORK_DIR}"
    pushd "${WORK_DIR}"
  2. Configure API access

    Access to cloud API
    # From https://control.cloudscale.ch/service/<your-project>/api-token
    export CLOUDSCALE_API_TOKEN=<cloudscale-api-token>
    Access to VSHN GitLab
    # From https://git.vshn.net/-/profile/personal_access_tokens, "api" scope is sufficient
    export GITLAB_TOKEN=<gitlab-api-token>
    export GITLAB_USER=<gitlab-user-name>
Access to VSHN Lieutenant
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
Configuration for hieradata commits
export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens
  1. Get required tokens from Vault

    Connect with Vault
    export VAULT_ADDR=https://vault-prod.syn.vshn.net
    vault login -method=oidc
    Grab the LB hieradata repo token from Vault
    export HIERADATA_REPO_SECRET=$(vault kv get \
      -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data')
    export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user')
    export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')
    Get Floaty credentials
    export TF_VAR_lb_cloudscale_api_secret=$(vault kv get \
      -format=json "clusters/kv/${TENANT_ID}/${CLUSTER_ID}/floaty" | jq -r '.data.data.iam_secret')
  2. Compile the catalog for the cluster. Having the catalog available locally enables us to run Terraform for the cluster to make any required changes.

    commodore catalog compile "${CLUSTER_ID}"

Prepare Terraform environment

  1. Configure Terraform secrets

    cat <<EOF > ./terraform.env
    CLOUDSCALE_API_TOKEN
    TF_VAR_ignition_bootstrap
    TF_VAR_lb_cloudscale_api_secret
    TF_VAR_control_vshn_net_token
    GIT_AUTHOR_NAME
    GIT_AUTHOR_EMAIL
    HIERADATA_REPO_TOKEN
    EOF
  2. Setup Terraform

    Prepare Terraform execution environment
    # Set terraform image and tag to be used
    tf_image=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.image" \
      dependencies/openshift4-terraform/class/defaults.yml)
    tf_tag=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
      dependencies/openshift4-terraform/class/defaults.yml)
    
    # Generate the terraform alias
    base_dir=$(pwd)
    alias terraform='docker run -it --rm \
      -e REAL_UID=$(id -u) \
      --env-file ${base_dir}/terraform.env \
      -w /tf \
      -v $(pwd):/tf \
      --ulimit memlock=-1 \
      "${tf_image}:${tf_tag}" /tf/terraform.sh'
    
    export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
    export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
    export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
    export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"
    
    pushd catalog/manifests/openshift4-terraform/
    Initialize Terraform
    terraform init \
      "-backend-config=address=${GITLAB_STATE_URL}" \
      "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=username=${GITLAB_USER}" \
      "-backend-config=password=${GITLAB_TOKEN}" \
      "-backend-config=lock_method=POST" \
      "-backend-config=unlock_method=DELETE" \
      "-backend-config=retry_wait_min=5"

Drain and Remove Node

  • Find the node you want to remove. It has to be the one with the highest terraform index.

    # Grab JSON copy of current Terraform state
    terraform state pull > .tfstate.json
    
    export NODE_TO_REMOVE=$(jq -r \
      '.resources[] |
       select(.module=="module.cluster.module.worker" and .type=="cloudscale_server") |
       .instances[.instances|length-1] |
       .attributes.name | split(".") | first' \
       .tfstate.json)
    echo $NODE_TO_REMOVE
  • If you are working on a production cluster, you need to schedule the node drain for the next maintenance.

  • If you are working on a non-production cluster, you may drain and remove the node immediately.

Schedule node drain (production clusters)

  1. Create an adhoc-config for the UpgradeJobHook that will drain the node.

    pushd "../../../inventory/classes/$TENANT_ID"
    cat > manifests/$CLUSTER_ID/drain_node_hook <<EOF
    ---
    apiVersion: managedupgrade.appuio.io/v1beta1
    kind: UpgradeJobHook
    metadata:
      name: drain-node
      namespace: appuio-openshift-upgrade-controller
    spec:
      events:
        - Finish
      selector:
        matchLabels:
          appuio-managed-upgrade: "true"
      run: Next
      template:
        spec:
          template:
            spec:
              containers:
                - args:
                    - -c
                    - |
                      #!/bin/sh
                      set -e
                      oc adm drain ${NODE_TO_REMOVE} --delete-emptydir-data --ignore-daemonsets
                  command:
                    - sh
                  image: quay.io/appuio/oc:v4.13
                  name: remove-nodes
                  env:
                    - name: HOME
                      value: /export
                  volumeMounts:
                    - mountPath: /export
                      name: export
                  workingDir: /export
              restartPolicy: Never
              volumes:
                - emptyDir: {}
                  name: export
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: drain-nodes-upgrade-controller
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: cluster-admin
    subjects:
      - kind: ServiceAccount
        name: default
        namespace: appuio-openshift-upgrade-controller
    EOF
    
    git commit -am "Schedule drain of node ${NODE_TO_REMOVE} on cluster $CLUSTER_ID"
    git push
    popd
  2. Wait until after the next maintenance window.

  3. Confirm the node has been drained.

    kubectl get node ${NODE_TO_REMOVE}
  4. Clean up UpgradeJobHook

    # after redoing the local environment and preparation of terraform:
    pushd "../../../inventory/classes/$TENANT_ID"
    rm manifests/$CLUSTER_ID/drain_node_hook
    git commit -am "Remove UpgradeJobHook to drain node ${NODE_TO_REMOVE} on cluster $CLUSTER_ID"
    git push
    popd
  5. Delete the node(s) from the cluster

    for node in $(echo -n ${NODE_TO_REMOVE}); do
      kubectl --as=cluster-admin delete node "${node}"
    done

Drain and remove node immediately

  1. Drain the node(s)

    for node in $(echo -n ${NODE_TO_REMOVE}); do
      kubectl --as=cluster-admin drain "${node}" \
        --delete-emptydir-data --ignore-daemonsets
    done
  2. Delete the node(s) from the cluster

    for node in $(echo -n ${NODE_TO_REMOVE}); do
      kubectl --as=cluster-admin delete node "${node}"
    done

Update Cluster Config

  1. Update cluster config.

    pushd "inventory/classes/${TENANT_ID}/"
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.worker_count -= 1" \
      ${CLUSTER_ID}.yml
  2. Review and commit

    # Have a look at the file ${CLUSTER_ID}.yml.
    
    git commit -a -m "Remove worker node from cluster ${CLUSTER_ID}"
    git push
    
    popd
  3. Compile and push cluster catalog

    commodore catalog compile ${CLUSTER_ID} --push -i

Remove VM

  1. Remove the node(s) by applying Terraform

    Verify that the hostname of the to be deleted node(s) matches ${NODE_TO_REMOVE}

    Ensure that you’re still in directory ${WORK_DIR}/catalog/manifests/openshift4-terraform before executing this command.
    terraform apply