Kubernetes Resource Management
Scheduling and resource management is a topic many Kubernetes users seem to struggle with, even though it is vital to understand it and correctly configure your workload to ensure optimal resource usage and application availability. In this article, we’ll explain what scheduling and resource management exactly is, how you configure and use them, and go into some best practices.
- Target audience
This is a technical article targeting developers deploying applications onto Kubernetes, as well as cluster administrators.
When creating a Pod in Kubernetes, it’s possible to specify its resource requirements for its containers. This is done using two concepts called requests and limits:
|Resource requests and limits are defined on a Container level, however since a Pod is the smallest schedulable unit we use the term "a Pod’s resources" in this article. A Pod’s resources is simply the sum of its Containers' resources.|
An amount of resources that a container must have guaranteed to have available. When a Pod is running on a Node, those resources will be reserved for that pod.
As the name implies, a limit of how much of a given resource the container may contain for short periods of time. We’ll explain what happens when a container exceeds this limits later in this article.
The two resource types that can be configured are CPU and Memory.
Resource requests and limits for CPU are measured in "CPU units". One CPU (vCPU/Core on cloud providers, hyper thread on bare metal) is equivalent to 1 CPU unit.
CPU units are always measured as an absolute quantity, not as relative ones. So "1 CPU unit" is the same amount of CPU on a single core system as it is on a 256 core machine. However the single core system will only have one CPU unit capacity (we’ll come to that later), while the 256 core machine will have 256 CPU units capacity.
CPU requests and limits can be expressed as mCPU (milli CPU), or "millicore" as they are often referred to as. Each CPU can be divided into 1000 mCPU (because, you know, that’s what "milli" means).
500m- half a CPU
1- one CPU
100m- one tenth of a CPU
The smallest allowed precision is
Resource requests and limits for Memory are measured in bytes. You can use the following suffixes: K, M, G, T, P, E, Ki, Mi, Gi, Ti, Pi, Ei:
… and so on
Usually the "power of two" suffixes (Ki, Mi, Gi, …) are used, so if you’re unsure what to use, stick to them.
Configuring resource requests & limits is done by setting the
.spec.containers.resources field on a container spec:
apiVersion: v1 kind: Pod metadata: name: resource-example spec: containers: - name: app image: app resources: requests: cpu: "100m" (1) memory: "128Mi" (2) limits: cpu: "1" (3) memory: "1Gi" (4)
Since pods usually are created by Deployments (or DeploymentConfigs if you are using OpenShift), you would instead set the deployment’s
It is not necessary to set all of the values. For example it’s possible to configure only Memory requests and CPU limits.
|On APPUiO Public clusters, we enforce the usage of resource requests and limits using LimitRanges. They define the range of possible values as well as default values that will be applied if you do NOT specify any resource requests or limits.|
In order to understand resource management properly, we first have to understand how
kube-scheduler, the default scheduler for Kubernetes, works.
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them.
The job of the scheduler is to take new Pods and assign them to a Node in the cluster.
Whenever kube-scheduler sees a new Pod that is not assigned to a Node (indicated by the fact that the Pod’s
.spec.nodeName is not set), it assigns the Pod to a Node in two phases:
During this phase, the scheduler determines which nodes are eligible for the Pod to be scheduled on. In the beginning, all nodes are candidates. The scheduler then applies various filter plugins, for example: Does the Node fit the Pods
nodeSelector? Has the Node sufficient resources available? Has the Node any taints that are not tolerated by the pod? Is the Node marked as unschedulable? Does the Pod request any special features, for example a GPU?
If after this step no Nodes are left, the Pod will not be assigned to a Node and stay in "Pending" state. An Event is added to the Pod explaining why scheduling failed.
If a pod stays in "Pending", use
kubectl describe pod/<POD>and check the "Events" section to see why it failed.
Scheduling policy predicates can be used to configure the Filtering step of scheduling.
In the second phase, the remaining Nodes are ranked. Again, various scoring plugins are used.
The default configuration tries to spread workload as even across the cluster as possible, minimizing the impact of a node becoming unavailable.
Once these two steps are completed, the scheduler will assign the Pod to the highest-ranking Node, and the Kubelet on that node will spin up its containers.
As we can see, both the Filtering and Scoring phases of scheduling take "resources" into consideration, so let’s have a look at them next.
The two most important resources are CPU and Memory (RAM). Kubernetes tracks other resources as well (like disk space, available PIDs or network ports) but we’ll focus on this two.
Upon startup, the Kubelet determine how much resources the system it runs on has available. This is called the node’s capacity. Next, it reserves a certain amount of CPU and Memory for itself and the system. What’s left is called the Node’s _allocatable_resources. The Kubelet will communicate this information back to the control plane.
If you are cluster-admin, you can view a Node’s resources using the
|For APPUiO clusters, the Kubelet reserves a total of 4Gi Memory and 400mCPU for itself and the system.|
During scheduling, this information is used to determine whether a Pod would "fit" onto a Node or not by taking a Node’s allocatable resources and subtracting the requests of all Pods already running on the Node. If the remaining resources are greater than the requests of the Pod, it will fit.
Before we look into what happens when a node runs out of a resource, we first have to cover another concept: Quality of Service classes
Kubernetes knows three QoS classes: "Guaranteed", "Burstable" and "BestEffort".
When a Pod starts, its QoS class is determine based on the resource requests and limits of its containers:
Guaranteed is assigned when
every container has both requests and limits set for both CPU and Memory
for each container, the requests and limits have the same values set.
The Pod is guaranteed to have the resources it has requested available.
Burstable is assigned when a Pod does not qualify for the "Guaranteed" QoS class, but at least one container has CPU or Memory requests set.
The Pod has its requested resources available, but may use more resources for a short period (aka burst).
BestEffort is assigned to Pods that have no requests or limits set at all.
The Pod may use resources available on a best effort basis.
CPU is a so-called "compressible" resource. This means, when a container exceeds its CPU usage limits, it will simply be throttled. A container with a CPU limit of "100m" cannot use more than 0.1 seconds of CPU time each second.
Memory on the other hand is not "compressible", so when a container exceeds its memory limit, it will be terminated (and restarted of course).
Again, since CPU is a "compressible" resource, the Kubelet does not act on CPU starvation. Each container will have the CPU resources available that it requested - yes, this means that "BestEffort" Pods really get into a tight spot…
Out of Memory handling however triggers an eviction. While evictions (and how they can be configured) would cover a whole blog post on its own, it usually ends with Pods being terminated and moved to different nodes. This is where the QoS classes play an important role: They decide, who gets killed:
First in line are pods that exceed their memory requests are killed, based on their memory usage in relation to their memory requests. Since "BestEffort" pods do not have any requests at all, they will be killed first. However, "Burstable" Pods might also be killed if they exceed their requests.
Since "Guaranteed" pods cannot exceed their requests (because they are equal to their limits), they are never killed because of another pods resource usage.
However, in the rare case that system services on a node (not running in Kubernetes) use more resources than was reserved for them (see "resource reservations" in "Resources and scheduling"), even "Burstable" or "Guaranteed" pods will be killed.
You should now have a fairly good understanding of how scheduling works on Kubernetes. As a conclusion, We want to share a few best practices:
Use requests and limits extensively - it helps the scheduler to distribute your workload more evenly across your cluster.
Use QoS classes to your advantage, for example by making sure all production workloads are assigned a "Guaranteed" QoS class. This means that in case of an out of resource situation, your production environment is not killed by the OOM killer.
For cluster administrators, there are some more points:
Plan AT LEAST one node worth of "resources" as "leftover". This allows your cluster to tolerate the loss of a node - both planned (during maintenance) or unplanned (node crashes).