资源预留 vs 驱逐 vs OOM
转载:https://www.kancloud.cn/pshizhsysu/kubernetes/2206258
一 资源预留简介
kubelet中有几个参数,通过这几个参数可以为系统进程预留资源,不至于pod把计算资源耗尽,而导致系统操作都无法正常进行。
--enforce-node-allocatable
--system-reserved
--system-reserved-cgroup
--kube-reserved
--kube-reserved-cgroup
--eviction-hard
Allocatable
在kubernetes 1.6版本后,引入了Node的Allocatable特性,通过该特性我们可以控制每个节点可分配的资源。
借用官网的图如下:
capacity
-----------------------------
| kube-reserved |
------------------------------
| system-reserved |
------------------------------
| eviction-threshhold |
------------------------------
| allocatable |
| (available for pods) |
------------------------------
Capacity是指Node的容量,allocatable的值为
allocatable = capacity - kube_reserved - system_reserved - eviction_threshhold
当kubelet启动后,Node的allocatable就是固定的,不会因为pod的创建与销毁而改变。
- allocatable vs requests vs limits
在pod的yaml文件中,我们可以为pod设置requests与limits。其中limits与allocatable没有什么关系。但requests与allocatable关系紧密。
调度到某个节点上的Pod的requests总和不能超过该节点的allocatable。limits的总和没有上限。
比如某个节点的内存的allocatable为10Gi,有三个Pod(requests.memory=3Gi)已经调度到该节点上,那么第4个Pod就无法调度到该节点上,即使该Node上的空闲内存大于3Gi。
资源预留 - 不设cgroup
假设我们现在需要为系统预留一定的资源,那么我们可以配置如下的kubelet参数(在这里我们不设置对应的cgroup参数):
--enforce-node-allocatable=pods
--kube-reserved=memory=...
--system-reserved=memory=...
--eviction-hard=...
在上面提到,节点上Pod的requests总和不能超过allocatable。
当我们设置了以上的四个参数时,节点上所有Pod实际使用的资源总和不会超过capacity - kube_reserved - system_reserved
我们可以通过实验进行验证。
1、参数设置
kubelet的启动参数如下:
/usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=true --hostname-override=192.168.1.101 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.232.115:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --kube-reserved=memory=1Gi --system-reserved=memory=1Gi --cgroup-driver=cgroupfs --eviction-hard=memory.available<100Mi
2、查看capacity及allocatable
查看到Node的capacity及allocatable的值如下:
Capacity:
cpu: 2
memory: 4016436Ki (约3.83Gi)
pods: 110
Allocatable:
cpu: 2
memory: 1816884Ki (约1.73Gi)
pods: 110
我们可以计算出allocatable的值,刚好与上面的一致:
allocatale = capacity - kube_reserved - system_reserved - eviction_hard
1816884Ki = 4016436Ki - 1*1024*1024Ki - 1*1024*1024Ki - 100*1024Ki
我们可以通过free命令来查看Node的total值,与capacity一致:
$ free -k
total used free shared buff/cache available
Mem: 4016436 1224372 2234872 17100 557192 2453156
Swap: 0 0 0
3、查看kubepods控制组
查看kubepods控制组中对内存的限制,该值决定了Node上所有的Pod能使用的资源上限:
$ cat /sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes
1965346816
1965346816 Bytes = 1919284Ki = allocatable + 100Mi
根据上面的计算可知,Node上Pod能实际使用的资源上限值为:
kubepods/memory.limit_in_bytes = capacity - kube_reserved - system_reserved
注意:根据上面的公式,我们可以知道,一个节点上所有Pod能使用的内存总和,与eviction-hard无关
4、查看内存的空闲情况
查看内存的使用情况,发现空闲内存为 2.3Gi
$ free -h
total used free shared buff/cache available
Mem: 3.8G 1.2G 2.1G 16M 544M 2.3G
Swap: 0B 0B 0B
5、创建pod
此时内存的空闲值为2.3Gi,allocatable为1.73Gi,kubepod.limit为1.83Gi。
我们创建一个Pod,pod.request为0.1Gi,pod.limit为20Gi,Pod实际消耗内存1Gi。理论上该Pod能创建成功,实际也成功了,如下:
备注:yaml文件消耗内存的脚本见本文附录
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
centos-659755bf78-jdlrc 1/1 Running 0 44s
查看Node的内存使用情况:
$ free -h
total used free shared buff/cache available
Mem: 3.8G 2.2G 1.1G 16M 546M 1.3G
Swap: 0B 0B 0B
此时,空闲内存为1.3Gi,Node剩余的request为1.63Gi,Node的kubepods.limit还剩0.83Gi。
我们再创建一个同样的Pod,根据推测,Pod可以调度成功,但是由于要消耗1Gi的实际内存,超过了0.83Gi,那么该Pod会出现OOM。实验结果也的确如此:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
centos-659755bf78-j8wjv 0/1 OOMKilled 0 5s
centos-659755bf78-jdlrc 1/1 Running 1 1m
资源预留 - 设置对应的cgroup
如果还设置了对应的 --system-reserved-cgroup 和 --kube-reserved-cgroup参数,Pod能实际使用的资源上限不会改变(即kubepods.limit_in_bytes不变),但系统进程与kube进程也会受到资源上限的限制。如果系统进程超过了预留资源,那么系统进程会被cgroup杀掉。
但是如果不设这两个参数,那么系统进程可以使用超过预留的资源上限。
配置建议
为kubelet设置以下四个参数即可:
--enforce-node-allocatable=pods
--kube-reserved=cpu=xx,memory=xx,ephemeral-storage=xx
--system-reserved=cpu=xx,memory=xx,ephemeral-storage=xx
--eviction-hard=memory.available<10%,nodefs.available<10%
一般来说,我们不希望资源的使用率超过70%,所以kube-reserved、system-reserved、eviction-hard都应该设为10%。但由于kube-reserved与system-reserved不能设置百分比,所以它们要设置为绝对值。
总结
- Node的allocatable在kubelet启动后是一个固定的值,不会因为pod的创建与删除而改变
- 当我们为Pod设置了resources.requests时,调度到Node上的Pod的resources.requests的总和不会超过Node的allocatable。但Pod的resources.limits总和可以超过Node的allocatable
- 一个Pod能否成功调度到某个Node,关键看该Pod的resources.request是否小于Node剩下的request,而不是看Node实际的资源空闲量。即使空闲资源小于Pod的requests,Pod也可以调度到该Node上
- 当Pod的内存资源实际使用量超过其limits时,docker(实际是cgroup)会把该Pod内超出限额的进程杀掉(OOM);如果CPU超过,不会杀掉进程,只是进程会一直等待CPU。
- 当我们只设置如下四个参数时,可以达到为系统预留资源的效果,即Pod的资源实际使用量不会超过allocatable的值(因为kubepods控制组中memory.limit_in_bytes的值就为allocatable的值)。即使系统本身没有使用完预留的那部分资源,Pod也无法使用。当系统超出了预留的那部分资源时,系统进程可以抢占allocatable中的资源,即对系统使用的资源没有限制。
--enforce-node-allocatable=pods
--kube-reserved=memory=...
--system-reserved=memory=...
--eviction-hard=...
- 当我们除了设置了以上四个参数,还设置了对应的cgroup时(如下),那么除了Pod使用的资源上限不会超过allocatable外,系统使用的资源上限也不会超过预留资源。当系统进程超过预留资源时,系统进程也会被cgroup杀掉。所以推荐使用上面的设置方法
--enforce-node-allocatable=pods,kube-reserved,system-reserved
--kube-reserved=memory=...
--kube-reserved-cgroup=...
--system-reserved=memory=...
--system-reserved-cgroup=...
--eviction-hard=...
- allocatable与kubepods.limit的值不一样,它们之间相差一个 eviction_hard
allocatable = capacity - kube_reserved - system_reserved - eviction_hard
kubepods.limit = capacity - kube_reserved - system_reserved
附录
- centos.yaml
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: centos
namespace: default
spec:
replicas: 1
selector:
matchLabels:
name: centos
template:
metadata:
labels:
name: centos
spec:
volumes:
- name: volume1
hostPath:
path: /home/docker/yaml/mem.py
containers:
- name: centos
image: 10.142.232.115:8021/library/centos:centos7
command:
- python
- /mem.py
- 1GB
volumeMounts:
- mountPath: /mem.py
name: volume1
resources:
requests:
memory: 0.1Gi
limits:
memory: 20Gi
import sys
import re
import time
def print_help():
print 'Usage: '
print ' python mem.py 100MB'
print ' python mem.py 1GB'
if __name__ == "__main__":
if len(sys.argv) == 2:
pattern = re.compile('^(\d*)([M|G]B)$')
match = pattern.match(sys.argv[1].upper())
if match:
num = int(match.group(1))
unit = match.group(2)
if unit == 'MB':
s = ' ' * (num * 1024 * 1024)
else:
s = ' ' * (num * 1024 * 1024 * 1024)
time.sleep(10000)
else:
print_help()
else:
print_help()
参考
二 imagesfs与nodefs
kubelet可以对磁盘进行管控,但是只能对nodefs与imagefs这两个分区进行管控。其中
- imagefs: docker安装目录所在的分区
- nodefs: kubelet的启动参数--root-dir所指定的目录(默认/var/lib/kubelet)所在的分区
接下来,我们来验证一下我们对imagefs与nodefs的理解。
前置条件
k8s集群使用1.8.6版本
$ kubectl get node
NAME STATUS ROLES AGE VERSION
10.142.232.161 Ready <none> 263d v1.8.6
10.142.232.162 NotReady <none> 263d v1.8.6
10.142.232.163 Ready,SchedulingDisabled <none> 227d v1.8.6
10.142.232.161上docker安装在/app/docker目录下,kubelet的--root-dir没有设置,使用默认的/var/lib/kubelet。/app是一块盘,使用率为70%;/是一块盘,使用率为57%;而imagesfs与nodefs此时设置的阈值都为80%,如下:
$ df -hT
文件系统 类型 容量 已用 可用 已用% 挂载点
devtmpfs devtmpfs 16G 0 16G 0% /dev
tmpfs tmpfs 16G 0 16G 0% /dev/shm
tmpfs tmpfs 16G 1.7G 15G 11% /run
tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/centos-root xfs 45G 26G 20G 57% /
/dev/xvda1 xfs 497M 254M 243M 52% /boot
/dev/xvde xfs 150G 105G 46G 70% /app
$ ps -ef | grep kubelet
root 125179 1 37 17:50 ? 00:00:01 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<20%,imagefs.available<20% --network-plugin=cni
此时,10.142.232.161该node没有报磁盘的错
$ kubectl describe node 10.142.232.161
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 18s kubelet, 10.142.232.161 Starting kubelet.
Normal NodeAllocatableEnforced 18s kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 18s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 18s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 18s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeNotReady 18s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeReady 8s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
验证方案
- 验证imagefs是/app/docker目录所在分区(/app分区使用率为70%)
- 修改imagefs的阈值为60%,node应该报imagefs超标
- 修改imagefs的阈值为80%,node应该正常
- 验证nodefs是/var/lib/kubelet目录所在的分区(/分区使用率为57%)
- 修改nodefs的阈值为50%,node应该报nodefs超标
- 修改nodefs的阈值为60%,node应该正常
- 修改kubelet启动参数--root-dir,将值设成/app/kubelet
- 修改让imagefs的阈值为80%,nodefs的阈值为60%;此时应该报nodefs超标
- 修改让imagefs的阈值为60%,nodefs的阈值为80%;此时应该报imagefs超标
- 修改让imagefs的阈值为60%,nodefs的阈值为60%;此时应该报两个都超标
- 修改让imagefs的阈值为80%,nodefs的阈值为80%;此时node应该正常
验证步骤
一、验证imagefs是/app/docker目录所在分区
1.1 修改imagefs的阈值为60%,node应该imagefs超标
如下,我们把imagefs的阈值设为60%
$ ps -ef | grep kubelet
root 41234 1 72 18:17 ? 00:00:02 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<20%,imagefs.available<40% --network-plugin=cni
然后我们查看节点的状态,Attempting to reclaim imagefs,意思为尝试回收imagefs
$ kubectl describe node 10.142.232.161
...
Normal NodeAllocatableEnforced 1m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal Starting 1m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeHasSufficientDisk 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeNotReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeHasDiskPressure 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasDiskPressure
Normal NodeReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
Warning EvictionThresholdMet 18s (x4 over 1m) kubelet, 10.142.232.161 Attempting to reclaim imagefs
1.2 修改imagefs的阈值为80%,node应该正常
我们把imagefs的阈值为80%
$ ps -ef | grep kubelet
root 51402 1 19 18:24 ? 00:00:06 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<20%,imagefs.available<20% --network-plugin=cni
然后再来查看node的状态,NodeHasNoDiskPressure,说明imagefs使用率没有超过阈值了
$ kubectl describe node 10.142.232.161
...
Warning EvictionThresholdMet 6m (x22 over 11m) kubelet, 10.142.232.161 Attempting to reclaim imagefs
Normal Starting 5m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeAllocatableEnforced 5m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 5m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 5m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 5m (x2 over 5m) kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeNotReady 5m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeReady 4m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
二、验证nodefs是/var/lib/kubelet目录所在的分区(/分区使用率为57%)
2.1 修改nodefs的阈值为50%,node应该报nodefs超标
修改nodefs的阈值为50%
$ ps -ef | grep kubelet
root 72575 1 59 18:35 ? 00:00:04 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<50%,imagefs.available<20% --network-plugin=cni
查看node的状态,报Attempting to reclaim nodefs,意思是尝试回收nodefs,也就是nodefs超标了
$ kubectl describe node 10.142.232.161
...
Normal Starting 1m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeAllocatableEnforced 1m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeNotReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeHasDiskPressure 53s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasDiskPressure
Normal NodeReady 53s kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
Warning EvictionThresholdMet 2s (x5 over 1m) kubelet, 10.142.232.161 Attempting to reclaim nodefs
2.2 修改nodefs的阈值为60%,node应该正常
修改nodefs的阈值为60%
$ ps -ef | grep kubelet
root 78664 1 31 18:38 ? 00:00:02 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<40%,imagefs.available<20% --network-plugin=cni
此时查看node的状态,已正常
$ kubectl describe node 10.142.232.161
...
Normal Starting 2m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
三、修改kubelet启动参数--root-dir,将值设成/app/kubelet
以下几个参数的默认值都与/var/lib/kubelet有关
--root-dir # 默认值为 /var/lib/kubelet
--seccomp-profile-root # 默认值为 /var/lib/kubelet/seccomp
--cert-dir # 默认值为 /var/lib/kubelet/pki
--kubeconfig # 默认值为 /var/lib/kubelet/kubeconfig
为了能够不再使用/var/lib/kubelet这个目录,我们需要对这四个参数显示设置。设置如下:
--root-dir=/app/kubelet
--seccomp-profile-root=/app/kubelet/seccomp
--cert-dir=/app/kubelet/pki
--kubeconfig=/etc/kubernetes/kubeconfig
3.1 修改让imagefs的阈值为80%,nodefs的阈值为60%;此时应该报nodefs超标
$ ps -ef | grep kubelet
root 14423 1 10 19:28 ? 00:00:34 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<40%,imagefs.available<20% --root-dir=/app/kubelet --seccomp-profile-root=/app/kubelet/seccomp --cert-dir=/app/kubelet/pki --network-plugin=cni
查看节点的状态,只报Attempting to reclaim nodefs,也就是说nodefs超标
$ kubectl describe node 10.142.232.161
...
Normal NodeHasDiskPressure 3m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasDiskPressure
Normal NodeReady 3m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
Normal Starting 3m kube-proxy, 10.142.232.161 Starting kube-proxy.
Warning EvictionThresholdMet 27s (x15 over 3m) kubelet, 10.142.232.161 Attempting to reclaim nodefs
3.2 修改让imagefs的阈值为60%,nodefs的阈值为80%;此时应该报imagefs超标
$ ps -ef |grep kubelet
root 21381 1 30 19:36 ? 00:00:02 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<20%,imagefs.available<40% --root-dir=/app/kubelet --seccomp-profile-root=/app/kubelet/seccomp --cert-dir=/app/kubelet/pki --network-plugin=cni
我们查看node的状态,只报imagefs超标
$ kubectl describe node 10.142.232.161
...
Normal Starting 1m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeAllocatableEnforced 1m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeNotReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeHasNoDiskPressure 1m (x2 over 1m) kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
Normal NodeHasDiskPressure 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasDiskPressure
Warning EvictionThresholdMet 11s (x5 over 1m) kubelet, 10.142.232.161 Attempting to reclaim imagefs
3.3 修改让imagefs的阈值为60%,nodefs的阈值为60%;此时应该报两个都超标
$ ps -ef | grep kubelet
root 24524 1 33 19:39 ? 00:00:01 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<40%,imagefs.available<40% --root-dir=/app/kubelet --seccomp-profile-root=/app/kubelet/seccomp --cert-dir=/app/kubelet/pki --network-plugin=cni
我们查看node的状态,果然imagefs与nodefs都超标了
$ kubectl describe node 10.142.232.161
...
Normal Starting 1m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeAllocatableEnforced 1m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 1m (x2 over 1m) kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasNoDiskPressure
Normal NodeNotReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeHasDiskPressure 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasDiskPressure
Normal NodeReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
Warning EvictionThresholdMet 14s kubelet, 10.142.232.161 Attempting to reclaim imagefs
Warning EvictionThresholdMet 4s (x8 over 1m) kubelet, 10.142.232.161 Attempting to reclaim nodefs
3.4 修改让imagefs的阈值为80%,nodefs的阈值为80%;此时node应该正常
$ ps -ef | grep kubelet
root 27869 1 30 19:43 ? 00:00:01 /usr/bin/kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=10.254.0.10 --cluster-domain=kube.local --fail-swap-on=false --hostname-override=10.142.232.161 --kubeconfig=/etc/kubernetes/kubeconfig --pod-infra-container-image=10.142.233.76:8021/library/pause:latest --port=10250 --enforce-node-allocatable=pods --eviction-hard=memory.available<20%,nodefs.inodesFree<20%,imagefs.inodesFree<20%,nodefs.available<20%,imagefs.available<20% --root-dir=/app/kubelet --seccomp-profile-root=/app/kubelet/seccomp --cert-dir=/app/kubelet/pki --network-plugin=cni
我们查看node的状态,果然没有报imagefs与nodefs的错了
$ kubectl decribe node 10.142.232.161
...
Normal Starting 1m kubelet, 10.142.232.161 Starting kubelet.
Normal NodeHasSufficientDisk 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeHasSufficientMemory
Normal NodeNotReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeNotReady
Normal NodeAllocatableEnforced 1m kubelet, 10.142.232.161 Updated Node Allocatable limit across pods
Normal NodeReady 1m kubelet, 10.142.232.161 Node 10.142.232.161 status is now: NodeReady
总结
1、nodefs是--root-dir目录所在分区,imagefs是docker安装目录所在的分区
2、建议nodefs与imagefs共用一个分区,但是这个分区要设置的大一些。
3、当nodefs与imagefs共用一个分区时,kubelet中的其他几个参数--root-dir、--cert-dir
三 资源预留 vs 驱逐 vs OOM
有三个概念我们要分清楚:资源预留、驱逐、OOM。
- 资源预留:影响的是节点的Allocatable的值
- 驱逐:kubelet对Pod进行驱逐时,只根据--eviction-hard参数(支持的指标参考本文),与system-reserved等参数无关。
- OOM:当某个进程的内存超过自己的限制时,该进程会被docker(cgroup)杀掉。容器发生OOM的情况可能有两种:一是容器所使用的内存超出了自身的limit限制;二是所有Pod使用的内存总和超出了
/sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes
。