Cluster Access Tooling

This document proposes a new cluster access tooling which aims to provide a user-friendly solution to access our managed OpenShift clusters without a public API.

The proposal includes a SOCKS5 proxy routing connections, and CLI tooling. The proxy is designed to stay in the background and be transparent. CLI tools, such as curl, and browsers can statically add the proxy as a hop.

The tooling gets all required information from VSHN SSHOP config and the Lieutenant inventory.

Problem Statement

A growing part of our managed OpenShift clusters are in private networks without direct access from the public internet. Those clusters are accessed through ssh jump hosts that proxy connections to the cluster using dynamic port forwarding (ssh -D).

The current solution is to, sometimes automatically using direnv, set up a SOCKS proxy per cluster and then use that proxy to access the cluster using oc or kubectl. Since each proxy runs on a different local port, users have to keep track of which proxy corresponds to which cluster. In Firefox this routing is done using the awesome FoxyProxy addon, in other applications users have to set up routing rules manually.

With the growing number of extra services sold on top of our managed OpenShift clusters, the number of other teams and users within VSHN needing access to the clusters is growing. The solution works well enough for our small team, but it’s not very user-friendly if you don’t use it on a daily basis.

We need a simple, user-friendly, and scalable solution to access our managed OpenShift clusters in private networks.

High Level Goals

An easy-to-use tool to access clusters in private networks.
Users should be able to access clusters in private networks without fiddling with Firefox addons or manual routing rules.
The solution should integrate well with our existing ssh jump host setup.

Non-Goals

Replacing ssh jump hosts with a different solution.
Replacing cluster authentication.

Constraints

The solution must work with our existing ssh jump host setup.
No additional tunnels or external connections through customer firewalls. Additional tunnels might not be approved by customers and would further increase the sprawl of access solutions VSHN has to maintain.

Interfaces

User

The user interacts with the cluster access tooling through a command line interface.

Routing information

Routing information is queried from our central inventory, which is managed using Project Syn and can be queried through the Lieutenant API.

See Routing information retrieval for more information.

SSH configuration

The tooling uses the user’s SSH configuration. SSH Jumphost (sshop) configuration is included in the user SSH configuration.

The tooling reads the user’s SSH configuration to determine how to connect to the jump hosts and clusters.

HTTP based access

The tooling sets up a local SOCKS5 proxy listening on localhost.

The SOCKS5 proxy is transparent to the user, non-routed traffic is directly connected to the destination, while traffic to cluster endpoints is routed through the proxy. Direct connections shouldn’t affect speed or battery life, as they should be handed off to the kernel without going through the proxy.

The proxy can stay running in the background.

VPNs and other network access solutions

We decided to not try to deal with VPNs. There is no stable cross-platform solution for managing the VSHN VPN.

Clusters that require VPN access are routed through an additional VSHN management jump host that has access to the VPN, so users don’t have to worry about setting up the VPN themselves.

OpenShift CLI (`oc`) / Kubernetes CLI (`kubectl`) / curl / other CLI tools

The user must set shell environment variables *_PROXY to use to access the clusters in private networks.

export HTTP_PROXY=socks5h://localhost:12000 (1)
export HTTPS_PROXY=socks5h://localhost:12000

curl https://api.c-bettersmarter-prod01.vshnmanaged.net:6443

1	`socks5h` is used to ensure that DNS resolution is also done through the proxy, which is necessary for non-public DNS names. The port is an example, the actual port is configurable.

Firefox and other web browsers

The user must configure their web browser to use the SOCKS5 proxy to access the clusters in private networks.

Because the proxy can stay running in the background, the browser can be configured to use it permanently, so users don’t have to fiddle with browser settings every time they want to access a cluster in a private network.

DNS should also be routed through the proxy to ensure that cluster endpoints that aren’t publicly resolvable can be accessed. Firefox can be configured to do this by checking the "Proxy DNS when using SOCKS v5" option in the proxy settings.

Cluster Authentication

Cluster authentication isn’t handled by the cluster access tooling, but the tooling should work well with our existing cluster authentication solutions.

Emergency Credentials

Cluster Emergency Credentials should be integrated into the cluster access tooling to allow users to easily retrieve emergency credentials for clusters in private networks.

Implementation

The cluster access tooling is implemented as a command line tool written in Go. Go allows cross-compilation, static binaries, and has good support for SSH and HTTP.

It can run as a daemon in the background to keep the SOCKS5 proxy running, or interactively as a CLI. The interactive CLI can be used, among other things, to update routing information, test jump host connectivity, or retrieve emergency credentials.

SOCKS5 Proxy

The tool sets up a local SOCKS5 proxy on localhost that routes traffic to cluster endpoints through the appropriate jump hosts based on the routing information retrieved from the Lieutenant API. All other traffic is directly connected to the destination. The implementation should hand off non-routed traffic to the kernel to avoid unnecessary overhead.

The list of routed endpoints should be searched by suffix so that subdomains of cluster endpoints are also routed through the proxy without needing to list every possible subdomain in the routing information. The list should be searched by longest suffix first to allow for more specific routing rules to take precedence over more general ones.

Exclusion rules take precedence over inclusion rules, so if a domain matches both an inclusion and an exclusion rule, it should be routed directly to the destination instead of through the proxy.

c-bettersmarter-prod01.vshnmanaged.net
a.storage.bettersmarter.ch
b.storage.bettersmarter.ch
vcenter.bettersmarter.ch

A POC implementation of the SOCKS5 was able to demonstrate non-noticeable overhead for new TCP connections, and zero overhead for established connections.

An optional username and password can be set for the SOCKS5 proxy to prevent unauthorized access to the proxy. We don’t see it as necessary to add authentication by default, as the proxy is only listening on localhost, and only does tunneling to cluster endpoints, so it shouldn’t be a security risk.

SSH Tunneling

SSH connections are set up when first required and kept alive as long as the proxy is running to avoid the overhead of setting up new SSH connections for every request. Connections are monitored and automatically re-established if they drop due to network issues or if the machine goes to sleep.

The tooling uses the golang.org/x/crypto/ssh package to set up SSH tunnels to the jump hosts and clusters. This allows us to have full control over the tunnels, we don’t have to worry about managing ssh processes and ports, and we can easily keep the tunnels alive in the background. VSHN SSH configuration is pretty standardized, so we can implement support for the features we use without needing to support every possible SSH configuration option. If a user has a complex SSH configuration that the tooling doesn’t support, we can throw an error early to avoid unexpected behavior.

We evaluated using the ssh command line tool with -D for dynamic port forwarding. The main advantage of using the ssh command line tool is that it can use the user’s existing SSH configuration without having to re-implement all the SSH features in Go. The main advantage of supporting all features is also the biggest disadvantage, as it can lead to unexpected behavior if the user has a complex SSH configuration that the tooling doesn’t support. Another issue is the port management, as each ssh -D command requires a free local port, which can lead to conflicts and makes it harder to manage the proxies. Keeping the tunnel alive through the command line tool can be difficult, especially with ControlMaster and ControlPersist options.

The Go SSH tooling allows to create SSH (agent) servers and clients and we easily end-to-end test the SSH tunneling functionality without needing to set up actual jump hosts and clusters.

A full POC has been implemented using the Go SSH tooling and was able to connect to all our managed clusters with up to three jumps through jump hosts. All SSH features we use in our VSHN SSH configuration are supported, and the implementation feels stable, even when changing networks or under load, and performs well.

Routing information retrieval

Routing information is queried from our central inventory, which is managed using Project Syn and can be queried through the Lieutenant API.

Clusters that need a jump host to access have an additional cluster fact that specifies the jump host to use for accessing the cluster.

apiVersion: syn.tools/v1alpha1
kind: Cluster
metadata:
  name: c-bettersmarter-prod01
spec:
  displayName: BetterSmarter Prod01
  facts:
    jumphost: management4.rma1.bettersmarter.vshnmanaged.net (1)
    jumphostDomains: vcenter.bettersmarter.ch,storage.bettersmarter.ch (2)
    jumphostSkipDomains: apps.c-bettersmarter-prod01.vshnmanaged.net (3)
status:
  facts: (4)
    openshiftBaseDomain: c-bettersmarter-prod01.vshnmanaged.net
    openshiftAppsDomain: apps.c-bettersmarter-prod01.vshnmanaged.net
    openshiftApiURL: https://api.c-bettersmarter-prod01.vshnmanaged.net:6443
    openshiftConsoleURL: https://console.apps.c-bettersmarter-prod01.vshnmanaged.net

1	The `jumphost` fact specifies the jump host to use for accessing the cluster. Missing or empty `jumphost` means the cluster is directly accessible and doesn’t require a jump host.
2	The `jumphostDomains` fact specifies domains that should be routed through the jump host. Domains are comma-separated suffixes. Domains from `.status.facts` (dynamic facts) are added to this list.
3	The `jumphostSkipDomains` fact specifies domains that should be excluded from routing through the jump host, even if they match the `jumphostDomains` list. Domains are comma-separated suffixes.
4	Dynamic facts are retrieved from the cluster itself, such as the API URL and console URL. Those facts are added to the list of domains to route through the jump host, so users don’t have to manually add them to the `jumphostDomains` fact. The domain suffix list generated from this cluster would look like this: `exclude: - apps.c-bettersmarter-prod01.vshnmanaged.net include: - console.apps.c-bettersmarter-prod01.vshnmanaged.net - apps.c-bettersmarter-prod01.vshnmanaged.net - api.c-bettersmarter-prod01.vshnmanaged.net - c-bettersmarter-prod01.vshnmanaged.net - storage.bettersmarter.ch - vcenter.bettersmarter.ch`

Information downloaded from the Lieutenant API is cached locally to allow the tooling to continue working even if the Lieutenant API is temporarily unavailable. New information is downloaded using the update sub-command. The information isn’t expected to change frequently, and we don’t set any TTLs.

Only the last target jump host is specified in the cluster fact, the tooling looks up the routing information for the jump host by querying the SSH configuration. The tooling can support an arbitrary number of jump hosts in the chain without needing to specify them in the routing information.

The following SSH options should be supported to specify jump host chains:

Host management4.rma1.bettersmarter.vshnmanaged.net
  ProxyJump vpn1.rma1.bettersmarter.vshnmanaged.net

Host vpn1.rma1.bettersmarter.vshnmanaged.net
  ProxyCommand ssh -W vpn1.rma1.bettersmarter.vshnmanaged.net:22 -- vpn5.corp.vshn.net

The tooling authenticates to the Lieutenant API using OIDC with the same flow as commodore fetch-token.

The proxy reloads the routing information and invalidates caches on SIGHUP.

Running as a Daemon

The tooling doesn’t implement any special demonization logic, we use standard process management tools to run the proxy in the background, such as systemd on Linux and launchd on macOS.

`systemd`

The tool should provide a systemd user service file that can be used to run the proxy in the background on Linux.

User services get the users SSH configuration and ssh-agent. We tested this with Gnome based desktop environments (Fedora 43, Ubuntu 26.04).

systemd user service file example

# ~/.config/systemd/user/kharon.service
[Unit]
Description=Smart Access cluster proxy

[Service]
Environment="SSH_AUTH_SOCK=%t/gcr/ssh" (1)
ExecStart=%h/workspace/kharon/kharon %h/workspace/kharon/domain_jumphost_mapping.json

[Install]
WantedBy=default.target

1	Systemd has a listener to start the gnome-keyring daemon on first connection to the socket.

`launchd`

The tool should provide a launchd user service file that can be used to run the proxy in the background on macOS.

User services get the users SSH configuration and ssh-agent. We tested this on macOS Tahoe 26.4.

launchd user service file example

<!-- ~/Library/LaunchAgents/io.vshn.SmartAccess.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>io.vshn.SmartAccess</string>
	<key>LimitLoadToSessionType</key>
	<string>Aqua</string>
	<key>ProgramArguments</key>
	<array>
		<string>/Users/USER/workspace/kharon/kharon</string> (1)
		<string>/Users/USER/workspace/kharon/domain_jumphost_mapping.json</string>
	</array>
	<key>KeepAlive</key>
	<true/>
	<key>StandardOutPath</key>
	<string>/Users/USER/Library/Logs/io.vshn.SmartAccess.out.log</string>
	<key>StandardErrorPath</key>
	<string>/Users/USER/Library/Logs/io.vshn.SmartAccess.err.log</string>
	<key>ProcessType</key>
	<string>Interactive</string>
</dict>
</plist>

1	launchd requires hardcoded paths. The installer should replace `USER` with the actual username and set the correct paths for the binary and log files.

CLI

`service-install`

The service-install sub-command installs the necessary files to run the proxy as a daemon using systemd on Linux and launchd on macOS. It checks which OS it’s running on and installs the appropriate service file.

`update`

The update sub-command updates the routing information by querying the Lieutenant API and caching the information locally. See Routing information retrieval for more information.

The command sends a SIGHUP signal to the proxy process to reload the routing information and invalidate caches after updating the information.

`proxy`

The proxy sub-command sets up a local SOCKS5 proxy that routes traffic to cluster endpoints through the appropriate jump hosts based on the routing information retrieved from the Lieutenant API. All other traffic is directly connected to the destination.

The command fails fast if the routing information isn’t available, so users are expected to run update before running proxy for the first time, and then run update periodically to keep the routing information up to date.

See SOCKS5 Proxy for more information.

`test`

The test sub-command tests the connectivity to all jump hosts and clusters based on the routing information retrieved from the Lieutenant API. This can be used to verify that the routing information is correct and that the jump hosts and clusters are accessible.

`kubeconfig`

The kubeconfig sub-command generate a kubeconfig for all known clusters based on the routing information retrieved from the Lieutenant API and writes it to the specified file or standard output if no file is specified.

While the primary use case for the tooling is access to OpenShift clusters, we aim to also support Kubernetes clusters using int128/kubelogin for authentication. A JSON serialized kubeconfig.v1.AuthInfo object can be provided as a cluster fact which the tooling then uses to generate the kubeconfig.

`clusters`

The clusters sub-command lists the clusters based on the information retrieved from the Lieutenant API.

`clusters CLUSTER_NAME kubeconfig`

The clusters CLUSTER_NAME kubeconfig sub-command retrieves the kubeconfig for the specified cluster and writes it to the specified file or standard output if no file is specified.

`clusters CLUSTER_NAME emergency-credentials`

The clusters CLUSTER_NAME emergency-credentials sub-command retrieves the emergency credentials kubeconfig for the specified cluster and writes it to the specified file or standard output if no file is specified.

See Cluster Emergency Credentials for implementation details of the emergency credentials retrieval.

`clusters CLUSTER_NAME shell`

We might try auto login to the cluster using oc login before opening the shell.

The clusters CLUSTER_NAME shell sub-command opens a shell with the kubeconfig set up to access the specified cluster. The command sets an environment variable so that shell startup scripts, such as .bashrc or .zshrc, can detect the command and do their own setup.

Interactive shell with kubeconfig example

package main

import (
	"errors"
	"fmt"
	"os"
	"os/exec"
)

func main() {
	kubeconfig := "$XDG_CACHE_HOME/tool/c-cluster1/kubeconfig" (1)
	shell, ok := os.LookupEnv("SHELL")
	if !ok {
		fmt.Println("Failed to determine the shell to use. Please set the SHELL environment variable.")
		os.Exit(1)
	}
	cmd := exec.Command(shell)
	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.Env = os.Environ()
	cmd.Env = append(cmd.Env, fmt.Sprintf("KUBECONFIG=%s", kubeconfig))
	if err := cmd.Run(); err != nil {
		if err, ok := errors.AsType[*exec.ExitError](err); ok {
			os.Exit(err.ExitCode())
		}
		fmt.Println("Failed to start the shell:", err)
		os.Exit(1)
	}
}

1	The `kubeconfig` file should be a per-cluster or temporary file, not a single file for all clusters, to avoid conflicts when running multiple shells for different clusters at the same time.

`clusters CLUSTER_NAME kubectl`

We might wrap the command instead of embedding it.

We might try auto login to the cluster using oc login before running the command.

The clusters CLUSTER_NAME kubectl embeds kubectl and allows the user to run kubectl commands against the specified cluster without needing to set up a kubeconfig file.

`clusters CLUSTER_NAME oc`

The clusters CLUSTER_NAME oc embeds oc and allows the user to run oc commands against the specified cluster without needing to set up a kubeconfig file.

We might wrap the command instead of embedding it.

We’re not yet 100% sure if we can embed oc. It looks embeddable at first glance, but we need to do more testing to be sure.

Further Links

Proof of concept implementation: github.com/vshn/kharon
SSH jumphost (sshop): vshnwiki.atlassian.net/wiki/spaces/VT/pages/8291275/SSH+Jumphost+sshop
Lieutenant API: syn.tools/lieutenant-api/references/index.html