Managed MachineSets for Cloudscale Provider
Problem
We created a cloudscale Machine-API provider for OpenShift 4 as decided in Custom Machine API Provider. The provider allows managed MachineSets for all node types in an OpenShift 4 cluster. The provider runs on the control plane nodes, but we’ve not yet found a feasible way to run it on bootstrap nodes. Some configuration is still based on Puppet or Terraform.
Proposals
Option 1: Manage worker nodes
We only manage worker nodes with the Machine-API provider. After installing the control-plane nodes, the infra nodes, and any additional nodes (for example, storage nodes), we create a MachineSet for the worker nodes.
This allows us to implement the required customer-requested AutoScale feature and helps us easily replace failing worker nodes. It doesn’t help us with replacing other node types, such as infra nodes.
Option 2: Manage all nodes except control plane nodes
We manage all nodes except the control plane nodes with the Machine-API provider. After installing the control-plane nodes, the worker nodes, infra nodes, and any additional nodes (such as storage nodes) are scaled up using a MachineSet.
This allows us to implement the required customer-requested AutoScale feature and helps us easily replace failing nodes.
Control plane nodes aren’t managed by the Machine-API provider because they aren’t expected to be replaced often. Control plane nodes require configuration in the VSHN DNS zone and can’t be easily replaced anyway. There is no easy, intuitive way to bootstrap the control plane nodes with the Machine-API provider, since the provider itself runs on those nodes.
Some caution is required when following the correct node replacement procedures, such as updating the router back-end configuration for infra nodes or rebalancing storage nodes.
The router back-end configuration will need to be automated, regardless of this issue, as soon as we roll out the new cloudscale load balancers.
Option 3: Manage all nodes
We manage all nodes with the Machine-API provider. The control plane nodes are also managed by the Machine-API provider. We find a way to run the provider on the bootstrap nodes or on the engineer’s device.
This allows us to implement the required customer-requested AutoScale feature and helps us easily replace failing nodes.
Replacing control plane nodes has been tested and just works, thanks to PodDisruptionBudgets in the OpenShift 4 distribution. Some caution is required when updating the internal VSHN DNS zone configuration to reflect the new control plane nodes.
We most likely can replace the DNS zone configuration after we introduce the new cloudscale load balancers.
Rationale
We decided to go with option 2 because it provides the required customer-requested AutoScale feature and allows us to easily replace most types of failing nodes. Since the provider is fairly new, we want to start with a smaller scope and expand it later on. Setting up control plane nodes with a provider isn’t straightforward. With the introduction of the new cloudscale load balancers, we might revisit this decision.