Install OpenShift 4 on cloudscale.ch

Steps to install an OpenShift 4 cluster on cloudscale.ch.

These steps follow the Installing a cluster on bare metal docs to set up a user provisioned installation (UPI). Terraform is used to provision the cloud infrastructure.

The commands are idempotent and can be retried if any of the steps fail.

The certificates created during bootstrap are only valid for 24h. So make sure you complete these steps within 24h.

This how-to guide is still a work in progress and will change. It’s currently very specific to VSHN and needs further changes to be more generic.

Starting situation

  • You already have a Tenant and its git repository

  • You have a CCSP Red Hat login and are logged into Red Hat Openshift Cluster Manager

    Don’t use your personal account to login to the cluster manager for installation.
  • You want to register a new cluster in Lieutenant and are about to install Openshift 4 on cloudscale.ch

Prerequisites

Make sure the minor version of openshift-install and the RHCOS image are the same as ignition will fail otherwise.

Cluster Installation

Register the new OpenShift 4 cluster in Lieutenant.

Lieutenant API endpoint

Use the following endpoint for Lieutenant:

Set cluster facts

For customer clusters, set the following cluster facts in Lieutenant:

  • access_policy: Access-Policy of the cluster, such as regular or swissonly

  • service_level: Name of the service level agreement for this cluster, such as guaranteed-availability

  • sales_order: Name of the sales order to which the cluster is billed, such as S10000

  • release_channel: Name of the syn component release channel to use, such as stable

  • cilium_addons: Comma-separated list of cilium addons the customer gets billed for, such as advanced_networking or tetragon. Set to NONE if no addons should be billed.

Set up Keycloak service

  1. Create a Keycloak service

    Use control.vshn.net/vshn/services/_create to create a service. The name and ID must be clusters name. For the optional URL use the OpenShift console URL.

Configure input

Create 2 new cloudscale API tokens with read+write permissions and name them <cluster_id> and <cluster_id>_floaty on control.cloudscale.ch/service/<your-project>/api-token.

Access to cloud API
export CLOUDSCALE_API_TOKEN=<cloudscale-api-token>
export TF_VAR_lb_cloudscale_api_secret=<cloudscale-api-token-for-Floaty>
Access to VSHN GitLab
# From https://git.vshn.net/-/user_settings/personal_access_tokens, "api" scope is sufficient
export GITLAB_TOKEN=<gitlab-api-token>
export GITLAB_USER=<gitlab-user-name>
Access to VSHN Lieutenant
# For example: https://api.syn.vshn.net
# IMPORTANT: do NOT add a trailing `/`. Commands below will fail.
export COMMODORE_API_URL=<lieutenant-api-endpoint>

# Set Project Syn cluster and tenant ID
export CLUSTER_ID=<lieutenant-cluster-id> # Looks like: c-<something>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .tenant)
Configuration for hieradata commits
export GIT_AUTHOR_NAME=$(git config --global user.name)
export GIT_AUTHOR_EMAIL=$(git config --global user.email)
export TF_VAR_control_vshn_net_token=<control-vshn-net-token> # use your personal SERVERS API token from https://control.vshn.net/tokens
OpenShift configuration
export BASE_DOMAIN=<your-base-domain> # customer-provided base domain without cluster name, e.g. "zrh.customer.vshnmanaged.net"
export PULL_SECRET='<redhat-pull-secret>' # As copied from https://cloud.redhat.com/openshift/install/pull-secret "Copy pull secret". value must be inside quotes.

For BASE_DOMAIN explanation, see DNS Scheme.

Set up S3 bucket for cluster bootstrap

  1. Create S3 bucket

    1. If a bucket user already exists for this cluster:

      # Use already existing bucket user
      response=$(curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \
        https://api.cloudscale.ch/v1/objects-users | \
        jq -e ".[] | select(.display_name == \"${CLUSTER_ID}\")")
    2. To create a new bucket user:

      # Create a new user
      response=$(curl -sH "Authorization: Bearer ${CLOUDSCALE_API_TOKEN}" \
        -F display_name=${CLUSTER_ID} \
        https://api.cloudscale.ch/v1/objects-users)
  2. Configure the Minio client

    export REGION=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .facts.region)
    mc config host add \
      "${CLUSTER_ID}" "https://objects.${REGION}.cloudscale.ch" \
      $(echo $response | jq -r '.keys[0].access_key') \
      $(echo $response | jq -r '.keys[0].secret_key')
    
    mc mb --ignore-existing \
      "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition"

Upload Red Hat CoreOS image

  1. Export the Authorization header for the cloudscale.ch API.

    export AUTH_HEADER="Authorization: Bearer ${CLOUDSCALE_API_TOKEN}"

    The variable CLOUDSCALE_API_TOKEN could be used directly. Exporting the variable AUTH_HEADER is done to be compatible with the cloudscale.ch API documentation.

  2. Check if image already exists in the correct zone

    curl -sH "$AUTH_HEADER" https://api.cloudscale.ch/v1/custom-images | jq -r '.[] | select(.slug == "rhcos-4.16") | .zones[].slug'

    If a URL is printed to the output, you can skip the next steps and directly jump to the next section.

  3. Fetch the latest Red Hat CoreOS image

    curl -L https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.16/4.16.3/rhcos-4.16.3-x86_64-openstack.x86_64.qcow2.gz | gzip -d > rhcos-4.16.qcow2
  4. Upload the image to S3 and make it public

    mc cp rhcos-4.16.qcow2 "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/"
    mc anonymous set download "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/rhcos-4.16.qcow2"

    You can check that the download policy is applied successfully with

    mc anonymous get "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/rhcos-4.16.qcow2"

    The output should be

    `Access permission for `[…]-bootstrap-ignition/rhcos-4.16.qcow2` is `download``
  5. Import the image to cloudscale.ch

    curl -i -H "$AUTH_HEADER" \
      -F url="$(mc share download --json "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/rhcos-4.16.qcow2" | jq -r .url)" \
      -F name='RHCOS 4.16' \
      -F zones="${REGION}1" \
      -F slug=rhcos-4.16 \
      -F source_format=qcow2 \
      -F user_data_handling=pass-through \
      https://api.cloudscale.ch/v1/custom-images/import

Set secrets in Vault

Connect with Vault
export VAULT_ADDR=https://vault-prod.syn.vshn.net
vault login -method=oidc
Store various secrets in Vault
# Set the cloudscale.ch access secrets
vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale \
  token=${CLOUDSCALE_API_TOKEN} \
  s3_access_key=$(mc config host ls ${CLUSTER_ID} -json | jq -r .accessKey) \
  s3_secret_key=$(mc config host ls ${CLUSTER_ID} -json | jq -r .secretKey)

# Put LB API key in Vault
vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/floaty \
  iam_secret=${TF_VAR_lb_cloudscale_api_secret}

# Generate an HTTP secret for the registry
vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/registry \
  httpSecret=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 128)

# Generate a master password for K8up backups
vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/global-backup \
  password=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 32)

# Generate a password for the cluster object backups
vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cluster-backup \
  password=$(LC_ALL=C tr -cd "A-Za-z0-9" </dev/urandom | head -c 32)
Grab the LB hieradata repo token from Vault
export HIERADATA_REPO_SECRET=$(vault kv get \
  -format=json "clusters/kv/lbaas/hieradata_repo_token" | jq '.data.data')
export HIERADATA_REPO_USER=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.user')
export HIERADATA_REPO_TOKEN=$(echo "${HIERADATA_REPO_SECRET}" | jq -r '.token')

Prepare Cluster Repository

Starting with this section, we recommend that you change into a clean directory (for example a directory in your home).

Check Running Commodore for details on how to run commodore.

  1. Prepare Commodore inventory.

    mkdir -p inventory/classes/
    git clone $(curl -sH"Authorization: Bearer $(commodore fetch-token)" "${COMMODORE_API_URL}/tenants/${TENANT_ID}" | jq -r '.gitRepo.url') inventory/classes/${TENANT_ID}
  2. Configure the cluster’s domain in Project Syn

    export CLUSTER_DOMAIN="${CLUSTER_ID}.${BASE_DOMAIN}" (1)
    1 Adjust this as necessary if you’re using a non-standard cluster domain.

    The cluster domain configured here must be correct. The value is used to configure how Cilium connects to the cluster’s K8s API.

    pushd "inventory/classes/${TENANT_ID}/"
    
    yq eval -i ".parameters.openshift.baseDomain = \"${CLUSTER_DOMAIN}\"" \
      ${CLUSTER_ID}.yml
    
    git commit -a -m "Configure cluster domain for ${CLUSTER_ID}"
  3. Include openshift4.yml in the cluster’s config if it exists

    For some tenants, this may already configure some of the settings shown in this how-to.
    if ls openshift4.y*ml 1>/dev/null 2>&1; then
        yq eval -i '.classes += ".openshift4"' ${CLUSTER_ID}.yml;
        git commit -a -m "Include openshift4 class for ${CLUSTER_ID}"
    fi
  4. Add Cilium to cluster configuration

    These instructions assume that Cilium is configured to use api-int.${CLUSTER_DOMAIN}:6443 to connect to the cluster’s K8s API. To ensure that that’s the case, add the configuration shown below somewhere in the Project Syn config hierarchy.

    parameters:
      cilium:
        cilium_helm_values:
          k8sServiceHost: api-int.${openshift:baseDomain}
          k8sServicePort: "6443"

    For VSHN, this configuration is set in the Commodore global defaults (internal).

    yq eval -i '.applications += ["cilium"]' ${CLUSTER_ID}.yml
    
    yq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' ${CLUSTER_ID}.yml
    yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' ${CLUSTER_ID}.yml
    
    yq eval -i '.parameters.openshift4_monitoring.upstreamRules.networkPlugin = "cilium"' ${CLUSTER_ID}.yml
    
    yq eval -i '.parameters.openshift.infraID = "TO_BE_DEFINED"' ${CLUSTER_ID}.yml
    yq eval -i '.parameters.openshift.clusterID = "TO_BE_DEFINED"' ${CLUSTER_ID}.yml
    
    git commit -a -m "Add Cilium addon to ${CLUSTER_ID}"
    
    git push
    popd
  5. Compile catalog

    commodore catalog compile ${CLUSTER_ID} --push -i \
      --dynamic-fact kubernetesVersion.major=$(echo "1.29" | awk -F. '{print $1}') \
      --dynamic-fact kubernetesVersion.minor=$(echo "1.29" | awk -F. '{print $2}') \
      --dynamic-fact openshiftVersion.Major=$(echo "4.16" | awk -F. '{print $1}') \
      --dynamic-fact openshiftVersion.Minor=$(echo "4.16" | awk -F. '{print $2}')
    This commodore call requires Commodore v1.5.0 or newer. Please make sure to update your local installation.

Configure the OpenShift Installer

  1. Generate SSH key

    We generate a unique SSH key pair for the cluster as this gives us troubleshooting access.

    SSH_PRIVATE_KEY="$(pwd)/ssh_$CLUSTER_ID"
    export SSH_PUBLIC_KEY="${SSH_PRIVATE_KEY}.pub"
    
    ssh-keygen -C "vault@$CLUSTER_ID" -t ed25519 -f $SSH_PRIVATE_KEY -N ''
    
    BASE64_NO_WRAP='base64'
    if [[ "$OSTYPE" == "linux"* ]]; then
      BASE64_NO_WRAP='base64 --wrap 0'
    fi
    
    vault kv put clusters/kv/${TENANT_ID}/${CLUSTER_ID}/cloudscale/ssh \
      private_key=$(cat $SSH_PRIVATE_KEY | eval "$BASE64_NO_WRAP")
    
    ssh-add $SSH_PRIVATE_KEY
  2. Prepare install-config.yaml

    You can add more options to the install-config.yaml file. Have a look at the config example for more information.

    For example, you could change the SDN from a default value to something a customer requests due to some network requirements.

    export INSTALLER_DIR="$(pwd)/target"
    mkdir -p "${INSTALLER_DIR}"
    
    cat > "${INSTALLER_DIR}/install-config.yaml" <<EOF
    apiVersion: v1
    metadata:
      name: ${CLUSTER_ID} (1)
    baseDomain: ${BASE_DOMAIN} (1)
    platform:
      external:
        platformName: cloudscale
        cloudControllerManager: External
    networking:
      networkType: Cilium
    pullSecret: |
      ${PULL_SECRET}
    sshKey: "$(cat $SSH_PUBLIC_KEY)"
    EOF
    1 Make sure that the values here match the value of $CLUSTER_DOMAIN when combined as <metadata.name>.<baseDomain>. Otherwise, the installation will most likely fail.

    If setting custom CIDR for the OpenShift networking, the corresponding values should be updated in your Commodore cluster definitions. See Cilium Component Defaults and Parameter Reference. Verify with less catalog/manifests/cilium/olm/*ciliumconfig.yaml.

Run the OpenShift Installer

The steps in this section aren’t idempotent and have to be completed uninterrupted in one go. If you have to recreate the install config or any of the generated manifests you need to rerun all of the subsequent steps.
  1. Render install manifests (this will consume the install-config.yaml)

    openshift-install --dir "${INSTALLER_DIR}" \
      create manifests
    1. If you want to change the default "apps" domain for the cluster:

      yq w -i "${INSTALLER_DIR}/manifests/cluster-ingress-02-config.yml" \
        spec.domain apps.example.com
  2. Copy pre-rendered extra machine configs

    machineconfigs=catalog/manifests/openshift4-nodes/10_machineconfigs.yaml
    if [ -f $machineconfigs ];  then
      yq --no-doc -s \
        "\"${INSTALLER_DIR}/openshift/99x_openshift-machineconfig_\" + .metadata.name" \
        $machineconfigs
    fi
  3. Copy cloud-controller-manager manifests

    for f in catalog/manifests/cloudscale-cloud-controller-manager/*; do
      file=$(basename $f)
      # Split resources into individual files
      yq --no-doc -s \
        "\"${INSTALLER_DIR}/manifests/cloudscale-cloud-controller-manager_${file/.yaml}_\" + \$index + \"_\" + (.kind|downcase)" \
        $f
    done
    yq -i e ".stringData.access-token=\"${CLOUDSCALE_API_TOKEN}\"" \
      ${INSTALLER_DIR}/manifests/cloudscale-cloud-controller-manager_01_secret_0_secret.yml
  4. Copy pre-rendered Cilium manifests

    cp catalog/manifests/cilium/olm/* ${INSTALLER_DIR}/manifests/
  5. Verify that the generated cluster domain matches the desired cluster domain

    GEN_CLUSTER_DOMAIN=$(yq e '.spec.baseDomain' \
      "${INSTALLER_DIR}/manifests/cluster-dns-02-config.yml")
    if [ "$GEN_CLUSTER_DOMAIN" != "$CLUSTER_DOMAIN" ]; then
      echo -e "\033[0;31mGenerated cluster domain doesn't match expected cluster domain: Got '$GEN_CLUSTER_DOMAIN', want '$CLUSTER_DOMAIN'\033[0;0m"
    else
      echo -e "\033[0;32mGenerated cluster domain matches expected cluster domain.\033[0;0m"
    fi
  6. Prepare install manifests and ignition config

    openshift-install --dir "${INSTALLER_DIR}" \
      create ignition-configs
  7. Upload ignition config

    mc cp "${INSTALLER_DIR}/bootstrap.ign" "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/"
    
    export TF_VAR_ignition_bootstrap=$(mc share download \
      --json --expire=4h \
      "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition/bootstrap.ign" | jq -r '.share')

Terraform Cluster Config

  1. Switch to the tenant repo

    pushd "inventory/classes/${TENANT_ID}/"
  2. Include no-opsgenie class to prevent monitoring noise during cluster setup

    yq eval -i '.classes += "global.distribution.openshift4.no-opsgenie"' ${CLUSTER_ID}.yml;
  3. Update cluster config

    yq eval -i ".parameters.openshift.infraID = \"$(jq -r .infraID "${INSTALLER_DIR}/metadata.json")\"" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift.clusterID = \"$(jq -r .clusterID "${INSTALLER_DIR}/metadata.json")\"" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift.ssh_key = \"$(cat ${SSH_PUBLIC_KEY})\"" \
      ${CLUSTER_ID}.yml

    If you use a custom "apps" domain, make sure to set parameters.openshift.appsDomain accordingly.

    APPS_DOMAIN=your.custom.apps.domain
    yq eval -i ".parameters.openshift.appsDomain = \"${APPS_DOMAIN}\"" \
      ${CLUSTER_ID}.yml

    By default, the cluster’s update channel is derived from the cluster’s reported OpenShift version. If you want to use a custom update channel, make sure to set parameters.openshift4_version.spec.channel accordingly.

    # Configure the OpenShift update channel as `fast`
    yq eval -i ".parameters.openshift4_version.spec.channel = \"fast-{ocp-minor-version}\"" \
      ${CLUSTER_ID}.yml
  1. Set team responsible for handling Icinga alerts

    # use lower case for team name.
    # e.g. TEAM=aldebaran
    TEAM=<team-name>
  2. Prepare Terraform cluster config

    CA_CERT=$(jq -r '.ignition.security.tls.certificateAuthorities[0].source' \
      "${INSTALLER_DIR}/master.ign" | \
      awk -F ',' '{ print $2 }' | \
      base64 --decode)
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.base_domain = \"${BASE_DOMAIN}\"" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.ignition_ca = \"${CA_CERT}\"" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.ssh_keys = [\"$(cat ${SSH_PUBLIC_KEY})\"]" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.team = \"${TEAM}\"" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.hieradata_repo_user = \"${HIERADATA_REPO_USER}\"" \
      ${CLUSTER_ID}.yml
  3. Configure cloudscale.ch-specific Terraform variables

    yq eval -i ".parameters.openshift4_terraform.terraform_variables.image_slug = \"custom:rhcos-4.16\"" \
      ${CLUSTER_ID}.yml
  4. Prepare cloudscale machine-api provider

    yq eval -i ".parameters.openshift4_terraform.terraform_variables.worker_count = 0" \
      ${CLUSTER_ID}.yml
    
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.infra_count = 0" \
      ${CLUSTER_ID}.yml
    
    
    yq -i '.applications += "machine-api-provider-cloudscale"' \
      ${CLUSTER_ID}.yml
    yq eval -i ".parameters.openshift4_terraform.terraform_variables.make_worker_adoptable_by_provider = true" \
      ${CLUSTER_ID}.yml
    yq eval -i '.parameters.machine_api_provider_cloudscale.secrets["cloudscale-user-data"].stringData.ignitionCA = "${openshift4_terraform:terraform_variables:ignition_ca}"' \
      ${CLUSTER_ID}.yml

You now have the option to further customize the cluster by editing terraform_variables. Most importantly you have the option to change node sizes or add additional specialized worker nodes.

Please look at the configuration reference for the available options.

Commit changes and compile cluster catalog

  1. Review changes. Have a look at the file ${CLUSTER_ID}.yml. Override default parameters or add more component configurations as required for your cluster.

  2. Commit changes

    git commit -a -m "Setup cluster ${CLUSTER_ID}"
    git push
    
    popd
  3. Compile and push cluster catalog

    commodore catalog compile ${CLUSTER_ID} --push -i \
      --dynamic-fact kubernetesVersion.major=$(echo "1.29" | awk -F. '{print $1}') \
      --dynamic-fact kubernetesVersion.minor=$(echo "1.29" | awk -F. '{print $2}') \
      --dynamic-fact openshiftVersion.Major=$(echo "4.16" | awk -F. '{print $1}') \
      --dynamic-fact openshiftVersion.Minor=$(echo "4.16" | awk -F. '{print $2}')
    This commodore call requires Commodore v1.5.0 or newer. Please make sure to update your local installation.

Provision Infrastructure

  1. Configure Terraform secrets

    cat <<EOF > ./terraform.env
    CLOUDSCALE_API_TOKEN
    TF_VAR_ignition_bootstrap
    TF_VAR_lb_cloudscale_api_secret
    TF_VAR_control_vshn_net_token
    GIT_AUTHOR_NAME
    GIT_AUTHOR_EMAIL
    HIERADATA_REPO_TOKEN
    EOF
  2. Setup Terraform

    Prepare Terraform execution environment
    # Set terraform image and tag to be used
    tf_image=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.image" \
      dependencies/openshift4-terraform/class/defaults.yml)
    tf_tag=$(\
      yq eval ".parameters.openshift4_terraform.images.terraform.tag" \
      dependencies/openshift4-terraform/class/defaults.yml)
    
    # Generate the terraform alias
    base_dir=$(pwd)
    alias terraform='touch .terraformrc; docker run -it --rm \
      -e REAL_UID=$(id -u) \
      -e TF_CLI_CONFIG_FILE=/tf/.terraformrc \
      --env-file ${base_dir}/terraform.env \
      -w /tf \
      -v $(pwd):/tf \
      --ulimit memlock=-1 \
      "${tf_image}:${tf_tag}" /tf/terraform.sh'
    
    export GITLAB_REPOSITORY_URL=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r '.gitRepo.url' | sed 's|ssh://||; s|/|:|')
    export GITLAB_REPOSITORY_NAME=${GITLAB_REPOSITORY_URL##*/}
    export GITLAB_CATALOG_PROJECT_ID=$(curl -sH "Authorization: Bearer ${GITLAB_TOKEN}" "https://git.vshn.net/api/v4/projects?simple=true&search=${GITLAB_REPOSITORY_NAME/.git}" | jq -r ".[] | select(.ssh_url_to_repo == \"${GITLAB_REPOSITORY_URL}\") | .id")
    export GITLAB_STATE_URL="https://git.vshn.net/api/v4/projects/${GITLAB_CATALOG_PROJECT_ID}/terraform/state/cluster"
    
    pushd catalog/manifests/openshift4-terraform/
    Initialize Terraform
    terraform init \
      "-backend-config=address=${GITLAB_STATE_URL}" \
      "-backend-config=lock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=unlock_address=${GITLAB_STATE_URL}/lock" \
      "-backend-config=username=${GITLAB_USER}" \
      "-backend-config=password=${GITLAB_TOKEN}" \
      "-backend-config=lock_method=POST" \
      "-backend-config=unlock_method=DELETE" \
      "-backend-config=retry_wait_min=5"
  3. Create LB hieradata

    cat > override.tf <<EOF
    module "cluster" {
      bootstrap_count          = 0
      master_count             = 0
      infra_count              = 0
      worker_count             = 0
      additional_worker_groups = {}
    }
    EOF
    terraform apply -target "module.cluster.module.lb.module.hiera"
  4. Review and merge the LB hieradata MR (listed in Terraform output hieradata_mr) and wait until the deploy pipeline after the merge is completed.

  5. Create LBs

    terraform apply
  6. Setup the DNS records shown in output variable dns_entries from the previous step in the cluster’s parent zone. If you use a custom apps domain, make the necessary changes to the DNS record for *.apps.

  7. Make LB FQDNs available for later steps

    Store LB FQDNs in environment
    declare -a LB_FQDNS
    for id in 1 2; do
      LB_FQDNS[$id]=$(terraform state show "module.cluster.module.lb.cloudscale_server.lb[$(expr $id - 1)]" | grep fqdn | awk '{print $2}' | tr -d ' "\r\n')
    done
    Verify FQDNs
    for lb in "${LB_FQDNS[@]}"; do echo $lb; done
  8. Check LB connectivity

    for lb in "${LB_FQDNS[@]}"; do
      ping -c1 "${lb}"
    done
  9. Wait until LBs are fully initialized by Puppet

    # Wait for Puppet provisioning to complete
    while true; do
      curl --connect-timeout 1 "http://api.${CLUSTER_DOMAIN}:6443" &>/dev/null
      if [ $? -eq 52 ]; then
        echo -e "\nHAproxy up"
        break
      else
        echo -n "."
        sleep 5
      fi
    done
    # Update sshop config, see https://wiki.vshn.net/pages/viewpage.action?pageId=40108094
    sshop_update
    # Check that you can access the LBs using your usual SSH config
    for lb in "${LB_FQDNS[@]}"; do
      ssh "${lb}" hostname -f
    done

    While you’re waiting for the LBs to be provisioned, you can check the cloud-init logs with the following SSH commands

    ssh ubuntu@"${LB_FQDNS[1]}" tail -f /var/log/cloud-init-output.log
    ssh ubuntu@"${LB_FQDNS[2]}" tail -f /var/log/cloud-init-output.log
  10. Check the "Server created" tickets for the LBs and link them to the cluster setup ticket.

  11. Deploy bootstrap node

    cat > override.tf <<EOF
    module "cluster" {
      bootstrap_count          = 1
      master_count             = 0
      infra_count              = 0
      worker_count             = 0
      additional_worker_groups = {}
    }
    EOF
    terraform apply
  12. Review and merge the LB hieradata MR (listed in Terraform output hieradata_mr) and run Puppet on the LBs after the deploy job has completed

    for fqdn in "${LB_FQDNS[@]}"; do
      ssh "${fqdn}" sudo puppetctl run
    done
  13. Wait for bootstrap API to come up

    API_URL=$(yq e '.clusters[0].cluster.server' "${INSTALLER_DIR}/auth/kubeconfig")
    while ! curl --connect-timeout 1 "${API_URL}/healthz" -k &>/dev/null; do
      echo -n "."
      sleep 5
    done && echo -e "\nAPI is up"
  14. Patch Cilium config to allow control plane bootstrap to succeed

    We need to temporarily adjust the Cilium config to not use full kube-proxy replacement, since we currently don’t have a way to disable the initial OpenShift-managed kube-proxy deployment. Additionally, Because the cloudscale Cloud Controller Manager accesses the K8s API via service IP, we need to configure Cilium to provide partial kube-proxy replacement so that the CCM can start and untaint the control plane nodes so that other pods can be scheduled.

    export KUBECONFIG="${INSTALLER_DIR}/auth/kubeconfig"
    
    while ! kubectl get ciliumconfig -A &>/dev/null; do
      echo -n "."
      sleep 2
    done && echo -e "\nCiliumConfig CR is present"
    
    kubectl patch -n cilium ciliumconfig cilium-enterprise --type=merge \
     -p '{
      "spec": {
        "cilium": {
          "kubeProxyReplacement": "false",
          "nodePort": {
            "enabled": true
          },
          "socketLB": {
            "enabled": true
          },
          "sessionAffinity": true,
          "externalIPs": {
            "enabled": true
          },
          "hostPort": {
            "enabled": true
          }
        }
      }
     }'
  15. Deploy control plane nodes

    cat > override.tf <<EOF
    module "cluster" {
      bootstrap_count          = 1
      infra_count              = 0
      worker_count             = 0
      additional_worker_groups = {}
    }
    EOF
    terraform apply
  16. Add the DNS records for etcd shown in output variable dns_entries from the previous step to the cluster’s parent zone

  17. Apply the manifests for the cloudscale machine-api provider

    kapitan refs --reveal --refs-path ../../refs ../machine-api-provider-cloudscale/00_secrets.yaml | kubectl apply -f -
    
    kubectl apply  -f ../machine-api-provider-cloudscale/10_clusterRoleBinding.yaml
    
    kubectl apply -f ../machine-api-provider-cloudscale/10_serviceAccount.yaml
    
    kubectl apply -f ../machine-api-provider-cloudscale/11_deployment.yaml
  18. Apply the machinesets from terraform

    terraform output -raw worker-machineset_yml | grep -vP '^(│|╵|╷|There are some problems with the CLI configuration)' | yq -P > worker-machineset.yml
    head worker-machineset.yml
    kubectl apply -f worker-machineset.yml
    
    terraform output -raw infra-machineset_yml | grep -vP '^(│|╵|╷|There are some problems with the CLI configuration)' | yq -P > infra-machineset.yml
    head infra-machineset.yml
    kubectl apply -f infra-machineset.yml
  19. Wait for bootstrap to complete

    openshift-install --dir "${INSTALLER_DIR}" \
      wait-for bootstrap-complete --log-level debug

    If you’re using a CNI other than Cilium you may need to remove the following taint from the nodes to allow the network to come up:

    kubectl taint no --all node.cloudprovider.kubernetes.io/uninitialized:NoSchedule-

    Once the bootstrap is complete, taint the master nodes again to ensure that they’re properly initialized by the cloud-controller-manager.

    kubectl taint no -l node-role.kubernetes.io/master node.cloudprovider.kubernetes.io/uninitialized=:NoSchedule
  20. Remove bootstrap node

    rm override.tf
    terraform apply
    
    popd
  21. Review and merge the LB hieradata MR (listed in Terraform output hieradata_mr) and run Puppet on the LBs after the deploy job has completed

    for fqdn in "${LB_FQDNS[@]}"; do
      ssh "${fqdn}" sudo puppetctl run
    done
  22. Scale up the infra and worker machinesets

    INFRA_NODES=4 # adjust to desired number of infra nodes
    WORKER_NODES=3 # adjust to desired number of worker nodes
    kubectl scale machineset -nopenshift-machine-api infra --replicas="${INFRA_NODES}"
    kubectl scale machineset -nopenshift-machine-api worker --replicas="${WORKER_NODES}"
  23. Disable OpenShift kube-proxy deployment and revert Cilium patch

    kubectl patch network.operator cluster --type=merge \
      -p '{"spec":{"deployKubeProxy":false}}'
    kubectl -n cilium replace -f catalog/manifests/cilium/olm/cluster-network-07-cilium-ciliumconfig.yaml
    while ! kubectl -n cilium get cm cilium-config -oyaml | grep 'kube-proxy-replacement: "true"' &>/dev/null; do
      echo -n "."
      sleep 2
    done && echo -e "\nCilium config updated"
    kubectl -n cilium rollout restart ds/cilium
  24. Add Infra Node IPs to LB Hieradata

    git clone git@git.vshn.net:appuio/appuio_hieradata.git
    
    pushd appuio_hieradata/lbaas
    
    kubectl get node -l "node-role.kubernetes.io/infra" -oyaml | yq '.items[].status.addresses | filter(.type == "InternalIP") | map(.address)' > ips.yml
    
    yq -i '."profile_openshift4_gateway::backends".router = load("ips.yml")' "${CLUSTER_ID}.yaml"
    
    rm ips.yml
    
    git commit -am "Add infra nodes as backends for ${CLUSTER_ID}."
    git push
    popd
  25. Enable proxy protocol on ingress controller

    kubectl -n openshift-ingress-operator patch ingresscontroller default --type=json \
      -p '[{
        "op":"replace",
        "path":"/spec/endpointPublishingStrategy",
        "value": {"type": "HostNetwork", "hostNetwork": {"protocol": "PROXY"}}
      }]'

    This step isn’t necessary if you’ve disabled the proxy protocol on the load-balancers manually during setup.

    By default, PROXY protocol is enabled through the VSHN Commodore global defaults.

  26. Wait for installation to complete

    openshift-install --dir ${INSTALLER_DIR} \
      wait-for install-complete --log-level debug
  27. Create secret with S3 credentials for the registry

    oc create secret generic image-registry-private-configuration-user \
    --namespace openshift-image-registry \
    --from-literal=REGISTRY_STORAGE_S3_ACCESSKEY=$(mc config host ls ${CLUSTER_ID} -json | jq -r .accessKey) \
    --from-literal=REGISTRY_STORAGE_S3_SECRETKEY=$(mc config host ls ${CLUSTER_ID} -json | jq -r .secretKey)

    If the registry S3 credentials are created too long after the initial cluster setup, it’s possible that the openshift-samples operator has disabled itself because it couldn’t find a working in-cluster registry.

    If the samples operator is disabled, no templates and builder images will be available on the cluster.

    You can check the samples-operator’s state with the following command:

    kubectl get config.samples cluster -ojsonpath='{.spec.managementState}'

    If the command returns Removed, verify that the in-cluster registry pods are now running, and enable the samples operator again:

    kubectl patch config.samples cluster -p '{"spec":{"managementState":"Managed"}}'

    See the upstream documentation for more details on the samples operator.

Setup acme-dns CNAME records for the cluster

You can skip this section if you’re not using Let’s Encrypt for the cluster’s API and default wildcard certificates.
  1. Extract the acme-dns subdomain for the cluster after cert-manager has been deployed via Project Syn.

    fulldomain=$(kubectl -n syn-cert-manager \
      get secret acme-dns-client \
      -o jsonpath='{.data.acmedns\.json}' | \
      base64 -d  | \
      jq -r '[.[]][0].fulldomain')
    echo "$fulldomain"
  2. Add the following CNAME records to the cluster’s DNS zone

    The _acme-challenge records must be created in the same zone as the cluster’s api and apps records respectively.

    $ORIGIN <cluster-zone> (2)
    _acme-challenge.api  IN CNAME <fulldomain>. (1)
    $ORIGIN <apps-base-domain> (3)
    _acme-challenge.apps IN CNAME <fulldomain>. (1)
    1 Replace <fulldomain> with the output of the previous step.
    2 The _acme-challenge.api record must be created in the same origin as the api record.
    3 The _acme-challenge.apps record must be created in the same origin as the apps record.

Ensure emergency admin access to the cluster

  1. Check that emergency credentials were uploaded and are accessible:

    emergency-credentials-receive "${CLUSTER_ID}"
    # Follow the instructions to use the downloaded kubeconfig file

    You need to be in the passbolt group VSHN On-Call.

    If the command fails, check if the controller is already deployed, running, and if the credentials are uploaded:

    kubectl -n appuio-emergency-credentials-controller get emergencyaccounts.cluster.appuio.io -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.lastTokenCreationTimestamp}{"\n"}{end}'
  2. Follow the instructions from emergency-credentials-receive to use the downloaded kubeconfig file.

    export KUBECONFIG="em-${CLUSTER_ID}"
    kubectl get nodes
    oc whoami # should output system:serviceaccount:appuio-emergency-credentials-controller:*
  3. Invalidate the 10 year admin kubeconfig.

    kubectl -n openshift-config patch cm admin-kubeconfig-client-ca --type=merge -p '{"data": {"ca-bundle.crt": ""}}'

Enable Opsgenie alerting

  1. Create the standard silence for alerts that don’t have the syn label

    oc --as cluster-admin -n openshift-monitoring create job --from=cronjob/silence silence-manual
    oc wait -n openshift-monitoring --for=condition=complete job/silence-manual
    oc --as cluster-admin -n openshift-monitoring delete job/silence-manual
  2. Check the remaining active alerts and address them where neccessary

    kubectl --as=cluster-admin -n openshift-monitoring exec sts/alertmanager-main -- \
        amtool --alertmanager.url=http://localhost:9093 alert --active
  3. Remove the "no-opsgenie" class from the cluster’s configuration

    pushd "inventory/classes/${TENANT_ID}/"
    yq eval -i 'del(.classes[] | select(. == "*.no-opsgenie"))' ${CLUSTER_ID}.yml
    git commit -a -m "Enable opsgenie alerting on cluster ${CLUSTER_ID}"
    git push
    popd

Configure access for registry bucket

OpenShift does configure a PublicAccessBlockConfiguration. Ceph currently has a bug, where pushing objects into the S3 bucket are prevented.

The error message in the docker-registry logs is `s3aws: AccessDenied: \n\tstatus code: 403, request id: tx00000000000003ea93fa6-00112504a0-4fa9e750e-rma1, host id: `.

See tracker.ceph.com/issues/49135 for more information.

  1. Install the aws cli tool

    pip install awscli
  2. Check the current S3 bucket configuration after openshift4-registry has been deployed via Project Syn.

    export AWS_ACCESS_KEY_ID=$(mc config host ls ${CLUSTER_ID} -json | jq -r .accessKey)
    export AWS_SECRET_ACCESS_KEY=$(mc config host ls ${CLUSTER_ID} -json | jq -r .secretKey)
    export REGION=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" ${COMMODORE_API_URL}/clusters/${CLUSTER_ID} | jq -r .facts.region)
    aws --endpoint-url "https://objects.${REGION}.cloudscale.ch" s3api get-public-access-block --bucket "${CLUSTER_ID}-image-registry"
  3. Configure BlockPublicAcls to false

    aws s3api put-public-access-block --endpoint-url "https://objects.${REGION}.cloudscale.ch" --bucket "${CLUSTER_ID}-image-registry" --public-access-block-configuration BlockPublicAcls=false
  4. Verify the configuration BlockPublicAcls is false

    aws s3api get-public-access-block --endpoint-url "https://objects.${REGION}.cloudscale.ch" --bucket "${CLUSTER_ID}-image-registry"

    The final configuration should look like this:

    {
        "PublicAccessBlockConfiguration": {
            "BlockPublicAcls": false,
            "IgnorePublicAcls": false,
            "BlockPublicPolicy": false,
            "RestrictPublicBuckets": false
        }
    }

Finalize installation

  1. Configure the apt-dater groups for the LBs.

    git clone git@git.vshn.net:vshn-puppet/nodes_hieradata.git
    pushd nodes_hieradata
    cat >"${LB_FQDNS[1]}.yaml" <<EOF
    ---
    s_apt_dater::host::group: '2200_20_night_main'
    EOF
    cat >"${LB_FQDNS[2]}.yaml" <<EOF
    ---
    s_apt_dater::host::group: '2200_40_night_second'
    EOF
    git add *.yaml
    git commit -m"Configure apt-dater groups for LBs for OCP4 cluster ${CLUSTER_ID}"
    git push origin master
    popd

    This how-to defaults to the night maintenance window on Tuesday at 22:00. Adjust the apt-dater groups according to the documented groups (VSHN-internal only) if the cluster requires a different maintenance window.

  2. Wait for deploy job on nodes hieradata to complete and run Puppet on the LBs to update the apt-dater groups.

    for fqdn in "${LB_FQDNS[@]}"; do
      ssh "${fqdn}" sudo puppetctl run
    done
  3. Delete local config files

    rm -r ${INSTALLER_DIR}/
  4. Remove bootstrap bucket

    mc rm -r --force "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition"
    mc rb "${CLUSTER_ID}/${CLUSTER_ID}-bootstrap-ignition"

Post tasks

VSHN

  1. Enable automated upgrades

  2. Add the cluster to the maintenance template, if necessary

Generic