Update compute flavors

Steps to change compute flavors for one or more node groups on cloudscale.ch.

Starting situation

  • You already have an OpenShift 4 cluster on cloudscale.ch

  • You have admin-level access to the cluster

  • You want to change compute flavors of one or more node groups

Prerequisites

The following CLI utilities need to be available locally:

Update Cluster Config

  1. Update cluster config in syn-tenant-repo on a new branch.

    export CLUSTER_ID=
    export INFRA_FLAVOR=plus-X-Y
    export WORKER_FLAVOR=plus-X-Y
    export STORAGE_FLAVOR=plus-X-Y
    
    git checkout -b update-compute-flavors
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.infra_flavor = \"${INFRA_FLAVOR}\"" \
      ${CLUSTER_ID}.yml
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.worker_flavor = \"${WORKER_FLAVOR}\"" \
      ${CLUSTER_ID}.yml
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.additional_worker_groups.storage.flavor = \"${STORAGE_FLAVOR}\"" \
      ${CLUSTER_ID}.yml
    Ensure at least version 3.12 of terraform-openshift4-cloudscale is used to make sure the flavors for master and loadbalancer nodes get updated as well.
  2. Commit and create MR to review

    git commit -a -m "Update compute flavors on cluster ${CLUSTER_ID}"
    git push -u origin update-compute-flavors
  3. Compile and push the cluster catalog.

Prepare Terraform environment

  1. Configure API access

    export COMMODORE_API_URL=https://api.syn.vshn.net (1)
    
    # Set Project Syn cluster and tenant ID
    export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-cluster-id-1234
    export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
    1 Replace with the API URL of the desired Lieutenant instance.
  2. Create a local directory to work in and compile the cluster catalog

    export WORK_DIR=/path/to/work/dir
    mkdir -p "${WORK_DIR}"
    pushd "${WORK_DIR}"
    
    commodore catalog compile "${CLUSTER_ID}"

    We strongly recommend creating an empty directory, unless you already have a work directory for the cluster you’re about to work on. This guide will run Commodore in the directory created in this step.

  3. Configure Terraform environment

    export CLOUDSCALE_API_TOKEN=
    export GITLAB_USER=
    export GITLAB_TOKEN=
  4. Configure Terraform secrets

    cat <<EOF > ./terraform.env
    CLOUDSCALE_API_TOKEN
    TF_VAR_ignition_bootstrap
    TF_VAR_lb_cloudscale_api_secret
    TF_VAR_control_vshn_net_token
    GIT_AUTHOR_NAME
    GIT_AUTHOR_EMAIL
    HIERADATA_REPO_TOKEN
    EOF
  5. Setup Terraform

    Prepare Terraform execution environment
    # Set terraform image and tag to be used
    tf_image=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.image" \
      dependencies/openshift4-terraform/class/defaults.yml)
    tf_tag=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
      dependencies/openshift4-terraform/class/defaults.yml)
    
    # Generate the terraform alias
    base_dir=$(pwd)
    alias terraform='docker run --rm \
      -e REAL_UID=$(id -u) \
      --env-file ${base_dir}/terraform.env \
      -w /tf \
      -v $(pwd):/tf \
      --ulimit memlock=-1 \
      "${tf_image}:${tf_tag}" /tf/terraform.sh'
    
    export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
    export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
    export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
    export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"
    
    pushd catalog/manifests/openshift4-terraform/
    Initialize Terraform
    terraform init \
      "-backend-config=address=${GITLAB_STATE_URL}" \
      "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=username=${GITLAB_USER}" \
      "-backend-config=password=${GITLAB_TOKEN}" \
      "-backend-config=lock_method=POST" \
      "-backend-config=unlock_method=DELETE" \
      "-backend-config=retry_wait_min=5"

Update compute flavors

  1. Verify output of the Terraform plan step (for example, check the output of the Terraform CI/CD pipeline in cluster catalog)

  2. Make sure you are logged in to the correct cluster

    oc cluster-info
  3. Allow terraform to stop nodes

    sed -i '/^resource "cloudscale_server" "node"/a   allow_stopping_for_update = true' .terraform/modules/cluster/modules/node-group/main.tf
    sed -i '/^resource "cloudscale_server" "lb"/a   allow_stopping_for_update = true' .terraform/modules/cluster.lb/modules/vshn-lbaas-cloudscale/main.tf

Update master nodes

  1. Update master node flavors one at a time

    terraform state list module.cluster.module.master.cloudscale_server.node | while read node; do terraform show -json | jq --raw-output --arg node "$node" '.values.root_module.child_modules[].child_modules[].resources[] | select(.address == $node) | .values.name | split(".")[0] ' | while read name; do oc --as=cluster-admin adm drain $name --delete-emptydir-data --ignore-daemonsets --force; terraform apply -target $node -auto-approve; oc --as=cluster-admin adm uncordon $name; oc wait --timeout=300s node --all --for condition=ready; done; done

Update infra nodes

  1. Update infra node flavors one at a time

    terraform state list module.cluster.module.infra.cloudscale_server.node | while read node; do terraform show -json | jq --raw-output --arg node "$node" '.values.root_module.child_modules[].child_modules[].resources[] | select(.address == $node) | .values.name | split(".")[0] ' | while read name; do oc --as=cluster-admin adm drain $name --delete-emptydir-data --ignore-daemonsets --force; terraform apply -target $node -auto-approve; oc --as=cluster-admin adm uncordon $name; oc wait --timeout=300s node --all --for condition=ready; done; done

Update worker nodes

  1. Update worker node flavors one at a time

    terraform state list module.cluster.module.worker.cloudscale_server.node | while read node; do terraform show -json | jq --raw-output --arg node "$node" '.values.root_module.child_modules[].child_modules[].resources[] | select(.address == $node) | .values.name | split(".")[0] ' | while read name; do oc --as=cluster-admin adm drain $name --delete-emptydir-data --ignore-daemonsets --force; terraform apply -target $node -auto-approve; oc --as=cluster-admin adm uncordon $name; oc wait --timeout=300s node --all --for condition=ready; done; done

Update storage nodes

  1. Because the storage cluster needs time to recover between the node restarts, the storage cluster health needs to be checked manually between the flavor updates.

    terraform state list 'module.cluster.module.additional_worker["storage"].cloudscale_server.node' | while read node; do terraform show -json | jq --raw-output --arg node "$node" '.values.root_module.child_modules[].child_modules[].resources[] | select(.address == $node) | .values.name | split(".")[0]'; echo $node; done
    # For each of the storage nodes do:
    oc --as=cluster-admin adm drain $name --delete-emptydir-data --ignore-daemonsets --force
    terraform apply -target $node -auto-approve
    oc --as=cluster-admin adm uncordon $name
    # Now make sure storage cluster is healthy and proceed with the next node

Update loadbalancer nodes

  1. Update loadbalancer flavors one node at a time

    terraform apply -target 'module.cluster.module.lb.cloudscale_server.lb[0]' -auto-approve
    # Wait until lb[0] is up and running
    terraform apply -target 'module.cluster.module.lb.cloudscale_server.lb[1]' -auto-approve

Cleanup

  1. Run the Terraform CI/CD pipeline in cluster catalog, verify plan output and make sure allow_stopping_for_update = true gets removed from the Terraform state.

  2. Apply the Terraform plan if there are no unexpected changes.