Cgroup原理及使用

1、什么是Cgroup

cgroups，其名称源自控制组群（control groups）的缩写，是内核的一个特性，用于限制、记录和隔离一组进程的资源使用（CPU、内存、磁盘 I/O、网络等）

资源限制：可以配置 cgroup，从而限制进程可以对特定资源（例如内存或 CPU）的使用量

优先级 ：当资源发生冲突时，您可以控制一个进程相比另一个 cgroup 中的进程可以使用的资源量（CPU、磁盘或网络）

记录：在 cgroup 级别监控和报告资源限制

控制：您可以使用单个命令更改 cgroup 中所有进程的状态（冻结、停止或重新启动）

Cgroups功能的实现依赖于四个核心概念：子系统、控制组、层级树、任务

控制组（cgroup）
表示一组进程和一组带有参数的子系统的关联关系。例如，一个进程使用了 CPU 子系统来限制 CPU 的使用时间，则这个进程和 CPU 子系统的关联关系称为控制组

层级树（hierarchy）
由一系列的控制组按照树状结构排列组成的。这种排列方式可以使得控制组拥有父子关系，子控制组默认拥有父控制组的属性，也就是子控制组会继承于父控制组。比如，系统中定义了一个控制组 c1，限制了 CPU 可以使用 1 核，然后另外一个控制组 c2 想实现既限制 CPU 使用 1 核，同时限制内存使用 2G，那么 c2 就可以直接继承 c1，无须重复定义 CPU 限制

子系统（subsystem）

一个内核的组件，一个子系统代表一类资源调度控制器。例如内存子系统可以限制内存的使用量，CPU 子系统可以限制 CPU 的使用时间。子系统是真正实现某类资源的限制的基础

Subsystem(子系统) cgroups 中的子系统就是一个资源调度控制器(又叫 controllers)

在/sys/fs/cgroup/这个目录下可以看到cgroup子系统

cpu：使用调度程序控制任务对cpu的使用
cpuacct：自动生成cgroup中任务对cpu资源使用情况的报告
cpuset：可以为cgroup中的任务分配独立的cpu和内存
blkio：可以为块设备设定输入输出限制，比如物理驱动设备
devices：可以开启或关闭cgroup中任务对设备的访问
freezer：可以挂起或恢复cgroup中的任务
pids：限制任务数量
memory：可以设定cgroup中任务对内存使用量的限定，并且自动生成这些任务对内存资源使用情况的报告
perf_event：使用后使cgroup中的任务可以进行统一的性能测试
net_cls：docker没有直接使用它，它通过使用等级识别符标记网络数据包，从而允许linux流量控制程序识别从具体cgroup中生成的数据包

任务（task）

在cgroup中，任务就是一个进程，一个任务可以是多个cgroup的成员，但这些cgroup必须位于不同的层级，子进程自动成为父进程cgroup的成员，可按需求将子进程移到不同的cgroup中

cgroup 的作用基本上就是控制一个进程或一组进程可以访问或使用给定关键资源（CPU、内存、网络和磁盘 I/O）的量。一个容器中通常运行了多个进程，并且您需要对这些进程实施统一控制，因此 cgroup 是容器的关键组件。Kubernetes 环境使用cgroup 在 pod 级别上部署资源请求和限制以及对应的 QoS 类

下图说明了当您将特定比例的可用系统资源分配给一个 cgroup（在本例中，为cgroup‑1）后，剩余资源是如何在系统上其他 cgroup（以及各个进程）之间进行分配的

1.1、子系统接口/参数

1.1.1、cpu子系统：于限制进程的 CPU 利用率

cpu.shares：cpu比重分配。通过一个整数的数值来调节cgroup所占用的cpu时间。例如，有2个cgroup（假设为CPU1，CPU2），其中一个(CPU1)cpu.shares设定为100另外一个(CPU2)设为200，那么CPU2所使用的cpu时间将是CPU1所使用时间的2倍。cpu.shares 的值必须为2或者高于2

cpu.cfs_period_us：规定CPU的时间周期(单位是微秒)。最大值是1秒，最小值是1000微秒。如果在一个单CPU的系统内，要保证一个cgroup 内的任务在1秒的CPU周期内占用0.2秒的CPU时间，可以通过设置cpu.cfs_quota_us 为200000和cpu.cfs_period_us 为 1000000

cpu.cfs_quota_us：在单位时间内（即cpu.cfs_period_us设定值）可用的CPU最大时间（单位是微秒）。cpu.cfs_quota_us值可以大于cpu.cfs_period_us值，例如在一个双CPU的系统内，想要一个cgroup内的进程充分的利用2个CPU，可以设定cpu.cfs_quota_us为 200000 及cpu.cfs_period_us为 100000

当设定cpu.cfs_quota_us为-1时，表明不受限制，同时这也是默认值

1.1.2、cpuacct子系统：统计各个 Cgroup 的 CPU 使用情况

cpuacct.stat：cgroup中所有任务的用户和内核分别使用CPU的时长

cpuacct.usage：cgroup中所有任务的CPU使用时长（纳秒）

cpuacct.usage_percpu：cgroup中所有任务使用的每个cpu的时间（纳秒）

1.1.3、cpuset子系统：为一组进程分配指定的CPU和内存节点

cpuset.cpus：允许cgroup中的进程使用的CPU列表。如0-2,16代表 0,1,2,16这4个CPU

cpuset.mems：允许cgroup中的进程使用的内存节点列表。如0-2,16代表 0,1,2,16这4个可用节点

cpuset.memory_migrate：当cpuset.mems变化时内存页上的数据是否迁移（默认值0，不迁移；1，迁移）

cpuset.cpu_exclusive：cgroup是否独占cpuset.cpus 中分配的cpu 。（默认值0，共享；1，独占），如果设置为1，其他cgroup内的cpuset.cpus值不能包含有该cpuset.cpus内的值

cpuset.mem_exclusive：是否独占memory，（默认值0，共享；1，独占）

cpuset.mem_hardwall：cgroup中任务的内存是否隔离，（默认值0，不隔离；1，隔离，每个用户的任务将拥有独立的空间）

cpuset.sched_load_balance：cgroup的cpu压力是否会被平均到cpuset中的多个cpu上。（默认值1，启用负载均衡；0，禁用。）

1.1.4、memory子系统：限制cgroup所能使用的内存上限

memory.limit_in_bytes：设定最大的内存使用量，可以加单位（k/K,m/M,g/G）不加单位默认为bytes

memory.soft_limit_in_bytes：和 memory.limit_in_bytes 的差异是，这个限制并不会阻止进程使用超过限额的内存，只是在系统内存不足时，会优先回收超过限额的进程占用的内存，使之向限定值靠拢。该值应小于memory.limit_in_bytes设定值

memory.stat：统计内存使用情况。各项单位为字节

memory.memsw.limit_in_bytes：设定最大的内存+swap的使用量

memory.oom_control：当进程出现Out of Memory时，是否进行kill操作。默认值0，kill；设置为1时，进程将进入睡眠状态，等待内存充足时被唤醒

memory.force_empty：当设置为0时，清空该group的所有内存页；该选项只有在当前group没有tasks才可以使用

1.1.5、blkio子系统：限制cgroup对IO的使用

blkio.weight：设置权值，范围在[100, 1000]，属于比重分配，不是绝对带宽。因此只有当不同 Cgroup 争用同一个阻塞设备时才起作用

blkio.weight_device：对具体设备设置权值。它会覆盖上面的选项值

blkio.throttle.read_bps_device：对具体的设备，设置每秒读磁盘的带宽上限

blkio.throttle.write_bps_device：对具体的设备，设置每秒写磁盘的带宽上限

blkio.throttle.read_iops_device：对具体的设备，设置每秒读磁盘的IOPS带宽上限

blkio.throttle.write_iops_device：对具体的设备，设置每秒写磁盘的IOPS带宽上限

1.1.6、devices子系统：限定cgroup内的进程可以访问的设备

devices.allow：允许访问的设备。文件包括4个字段：type（设备类型）, major（主设备号）, minor（次设备号）, and access（访问方式）

type

a — 适用所有设备，包括字符设备和块设备
b — 块设备
c — 字符设备

major, minor

access

r — 读
w — 写
m — 创建不存在的设备

devices.deny：禁止访问的设备，格式同devices.allow

devices.list：显示目前允许被访问的设备列表

1.1.7、freezer子系统：暂停或恢复任务

freezer.state：当前cgroup中进程的状态

FROZEN：挂起进程

FREEZING：进程正在挂起中

THAWED：激活进程

1.挂起进程时，会连同子进程一同挂起。
2.不能将进程移动到处于FROZEN状态的cgroup中。
3.只有FROZEN和THAWED可以被写进freezer.state中, FREEZING则不能

1.2、Cgroup简单使用

1.2.1、常规使用

创建test.sh脚本

#!/bin/bash
while true;do
	echo "1"
done

1、创建cgroup子系统的子目录
2、设置资源配额
3、将需要限制的进程号写入子目录

以cpu限额为例，限制test.sh最多使用0.5c。这里有个知识点，tasks和cgroup.procs有什么区别呢？按照官方文档的描述，将pid写入cgroup.procs，则该pid所在的线程组及该pid的子进程等都会自动加入到cgroup中。将pid写入tasks，则只限制该pid

root@test:~# mkdir -p /sys/fs/cgroup/cpu/zz/
root@test:~# echo 50000 > /sys/fs/cgroup/cpu/zz/cpu.cfs_quota_us
root@test:~# echo PID > /sys/fs/cgroup/cpu/zz/cgroup.procs

运行test.sh脚本验证

root@test:~# bash test.sh

1.2.2、CG工具集使用

安装cgroup工具

# CentOS
yum -y install  libcgroup libcgroup-tools

# Ubuntu
apt install cgroup-tools

cgcreate创建，cgdelete删除，cgget查询，cgset设置，cgexec执行等

# 创建cpu子控制组群
root@test:~# cgcreate -g cpu:zz

# 设置cpu子系统参数
root@test:~# cgset -r cpu.cfs_quota_us=50000 /zz

# 在cgroup中执行命令
root@test:~# cgexec  -g cpu:zz df -h -t ext4

# 将进程加入cpu子控制群组
root@test:~# cgclassify -g cpu:zz 3661

# 删除cpu子控制组
root@test:~# cgdelete -g cpu:zz

1.2.3、查看cgroup个数

/proc/cgroups可以查看当前系统挂载了多少子系统，每个子系统的cgroup个数。
其中hierarchy一列是按照mount的顺序生成的一个编号。num_cgroups表示该子系统下有多少个cgroup目录，可以使用find命令查看。
内核函数入口：proc_cgroupstats_show

root@test:~# cat /proc/cgroups  |column  -t
#subsys_name  hierarchy  num_cgroups  enabled
cpuset        9          1            1
cpu           4          48           1
cpuacct       4          48           1
blkio         8          48           1
memory        7          105          1
devices       3          48           1
freezer       10         2            1
net_cls       5          1            1
perf_event    12         1            1
net_prio      5          1            1
hugetlb       2          1            1
pids          6          55           1
rdma          11         1            1
root@test:~# find /sys/fs/cgroup/cpu/ -type d  | wc -l
48
root@test:~# find /sys/fs/cgroup/blkio/ -type d  | wc -l
48

1.2.4、查看进程相关的cgroup信息

root@test:~# cat /proc/3661/cgroup 
12:perf_event:/
11:rdma:/
10:freezer:/
9:cpuset:/
8:blkio:/user.slice
7:memory:/user.slice/user-0.slice/session-6.scope
6:pids:/user.slice/user-0.slice/session-6.scope
5:net_cls,net_prio:/
4:cpu,cpuacct:/
3:devices:/user.slice
2:hugetlb:/
1:name=systemd:/user.slice/user-0.slice/session-6.scope
0::/user.slice/user-0.slice/session-6.scope


root@test:~# ps -o cgroup 3661
CGROUP
8:blkio:/user.slice,7:memory:/user.slice/user-0.slice/session-6.scope,6:pids:/user.slice/user-0.slice/session-6.scope,3:devices:/user.slice,1:name=systemd:/user.slice/user-0.slice/session-6.scope,0::/user.slic

1.3、利用systemd控制cgroup

第一步：创建slice，service

使用systemd创建启动一个test.service

root@test:~# cat /usr/libexec/test.sh 
#!/bin/bash
while true;do
	echo "1"
done
root@test:~# chmod +x /usr/libexec/test.sh

创建test.service的unit文件

root@test:~# vim /etc/systemd/system/test.service
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
[Install]
WantedBy=multi-user.target

启动test服务

root@test:~# systemctl start test
root@test:~# systemctl status test
● test.service - test
     Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-06-20 10:24:53 CST; 5s ago
   Main PID: 1810 (test.sh)
      Tasks: 1 (limit: 4612)
     Memory: 560.0K
     CGroup: /system.slice/test.service
             └─1810 /bin/bash /usr/libexec/test.sh

Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1
Jun 20 10:24:53 test test.sh[1810]: 1

test服务跑满了cpu

root@test:~# mpstat -P ALL 1 10
Linux 3.10.0-327.alx2000.alxos7.x86_64 (localhost)  08/26/2016  _x86_64_  (24 CPU)
12:10:30 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:10:31 PM  all    50.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   50.00
12:10:31 PM    0    99.80    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  00.00
12:10:31 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

你会发现test.service是通过systemd启动的，所以，执行systemd-cgls，将会在system.slice下面。如果你不经过systemd执行/usr/libexec/test.sh，那么，执行systemd-cgls，这个进程将属于cgroup树的user.slice下

root@test:~# systemd-cgls 
└─system.slice 
  ├─test.service 
  │ └─1810 /bin/bash /usr/libexec/test.sh
.....................

第二步：使用cgroup控制进程资源

首先，判断test服务，属于cgroup树的哪个分支，很明显，我们既然没有在配置改变过，那么test服务，一定属于system.slice

root@test:~# systemctl show test
Slice=system.slice
ControlGroup=/system.slice/test.service

修改服务，所属slice

root@test:~# vim /etc/systemd/system/test.service
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
[Install]
WantedBy=multi-user.target


root@test:~# systemctl daemon-reload
root@test:~# systemctl restart test

root@test:~# systemd-cgls 
├─zhrx.slice 
│ └─test.service 
│   └─31661 /bin/bash /usr/libexec/test.sh

然而，此时，我们并没有为zhrx.slice使用cgroup

root@test:~# lscgroup |grep zhrx.slice
root@test:~# lscgroup |grep test.service

在/etc/systemd/system/test.service中添加CPUAccounting=yes。这是在宣布，zhrx.slice，和zhrx.slice下的test.service，都将开始使用cgroup的cpu,cpuacct这个资源管理。

root@test:~# lscgroup |grep zhrx.slice
cpu,cpuacct:/zhrx.slice
cpu,cpuacct:/zhrx.slice/test.service

root@test:~# lscgroup |grep test.service
cpu,cpuacct:/zhrx.slice/test.service

然而，此时test.service依然占用了cpu的100%，如下，都是这2个参数的默认值。其中，可以用 cpu.cfs_period_us 和 cpu.cfs_quota_us 来限制该组中的所有进程在单位时间里可以使用的 cpu 时间。这里的 cfs 是完全公平调度器的缩写。cpu.cfs_period_us 就是时间周期，默认为 100000，即百毫秒。cpu.cfs_quota_us 就是在这期间内可使用的 cpu 时间，默认 -1，即无限制

root@test:~# cat /sys/fs/cgroup/cpu/zhrx.slice/test.service/cpu.cfs_period_us
100000
root@test:~# cat /sys/fs/cgroup/cpu/zhrx.slice/test.service/cpu.cfs_quota_us 
-1

所以，只要执行如下2步，test.service的cpu占用率就会立刻跌倒50%

root@test:~# ps aux | grep test
root       46977 44.4  0.0   6972  3064 ?        Ss   11:36   1:38 /bin/bash /usr/libexec/test.sh
root       47026  0.0  0.0   8160   720 pts/1    S+   11:40   0:00 grep --color=auto test
root@test:~# echo 50000 > /sys/fs/cgroup/cpu/zhrx.slice/test.service/cpu.cfs_quota_us
root@test:~# echo 46977 > /sys/fs/cgroup/cpu/zhrx.slice/test.service/tasks

root@test:~# mpstat -P ALL 1 10
Linux 3.10.0-327.alx2000.alxos7.x86_64 (localhost)  08/26/2016  _x86_64_  (24 CPU)
12:10:30 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:10:31 PM  all    25.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   75.00
12:10:31 PM    0    50.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  50.00
12:10:31 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

下面，开始考虑，如何通过systemd的unit文件，利用cgroup管理资源呢？

第三部：systemd控制cgroup

systemd是如何使用cgroup的，这个问题困扰了很多的同学，systemd其实是通过UNIT文件的配置，来使用cgroup的功能的，比如，使得test.srevice利用cgroup的cpu，memory，blockIO的资源管理；需要的参数分别是：CPUAccounting=yes MemoryAccounting=yes TasksAccounting=yes BlockIOAccounting=yes

那么，这些参数，在#man systemd.resource-control中，有详细的解释

root@test:~# vim /etc/systemd/system/test.service 
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
CPUAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

检查cgroup树中是否存在我们的test.service,zhrx.slice

root@test:~# lscgroup | grep zhrx
cpu,cpuacct:/zhrx.slice
cpu,cpuacct:/zhrx.slice/test.service
memory:/zhrx.slice
memory:/zhrx.slice/test.service
devices:/zhrx.slice
blkio:/zhrx.slice
blkio:/zhrx.slice/test.service
pids:/zhrx.slice
pids:/zhrx.slice/test.service

root@test:~# lscgroup | grep test.service
cpu,cpuacct:/zhrx.slice/test.service
memory:/zhrx.slice/test.service
blkio:/zhrx.slice/test.service
pids:/zhrx.slice/test.service

cgroup的信息，在systemctl status test中也是有体现的

root@test:~# systemctl status test
● test.service - test
     Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-06-20 13:29:04 CST; 1min 1s ago
   Main PID: 47210 (test.sh)
      Tasks: 1 (limit: 4612)
     Memory: 460.0K
        CPU: 27.767s
     CGroup: /zhrx.slice/test.service
             └─47210 /bin/bash /usr/libexec/test.sh

Jun 20 13:30:04 test test.sh[47210]: 1
Jun 20 13:30:04 test test.sh[47210]: 1
Jun 20 13:30:04 test test.sh[47210]: 1

实际应用

1、限制cpu:cpu.shares

test.service

root@test:/usr/libexec# systemctl cat test
# /etc/systemd/system/test.service
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
CPUAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

test2.service

root@test:/usr/libexec# systemctl cat test2
# /etc/systemd/system/test2.service
[Unit]
Description=ee
ConditionFileIsExecutable=/usr/libexec/test2.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
CPUAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

默认：cpu.shares都是1024

root@test:~# cat /sys/fs/cgroup/cpu/zhrx.slice/cpu.shares 
1024
root@test:~# cat /sys/fs/cgroup/cpu/zhrx.slice/test.service/cpu.shares 
1024
root@test:~# cat /sys/fs/cgroup/cpu/zhrx.slice/test2.service/cpu.shares 
1024

mpstat -P ALL 1 2:跑满了2个cpu core

Linux 3.10.0-327.alx2000.alxos7.x86_64 (localhost)      09/18/2016      _x86_64_        (24 CPU)
08:32:09 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:32:10 PM  all    100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   00.00
08:32:10 PM    0    100.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   00.00
08:32:10 PM    1    100.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00

cpu.shares 不是限制进程能使用的绝对的 cpu 时间，而是控制各个组之间的配额

2、限制cpu:CPUQuota=40%

如下，仅仅CPUAccounting=yes MemoryAccounting=yes TasksAccounting=yes BlockIOAccounting=yes，打开这些统计不行，我们还要限制service对资源的使用；

root@test:~# cat /etc/systemd/system/test.service 
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
CPUAccounting=yes
MemoryAccounting=yes
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

前面看到了，test.service吃掉了一个cpu的100％，现在我们就限制它，新增参数：CPUQuota=40%

root@test:~# cat /etc/systemd/system/test.service
[Unit]
Description=test
ConditionFileIsExecutable=/usr/libexec/test.sh
[Service]
Type=simple
ExecStart=/usr/libexec/test.sh
Slice=zhrx.slice
CPUAccounting=yes
CPUQuota=40%
MemoryAccounting=yes
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

root@test:~# systemctl daemon-reload

root@test:~# systemctl restart test.service

如下，你会发现，test.service最多可以占用40%的单个cpu；

root@test:~# mpstat -P ALL 1 3
Linux 3.10.0-327.alx2000.alxos7.x86_64 (localhost)      09/18/2016      _x86_64_        (24 CPU)
05:28:43 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:28:44 PM  all    20.00    0.00    0.08    0.00    0.00    0.00    0.00    0.00    0.00   80.00
05:28:44 PM    0    40.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  60.00
05:28:44 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

3、限制memory

内存蹭蹭涨

root@test:~# cat /usr/libexec/dd.sh 
#!/bin/bash
x="a"
while true;do
	x=$x$x
done

默认不限制内存使用，直到内存溢出，无法分配内存

root@test:~# bash /usr/libexec/dd.sh
/usr/libexec/dd.sh: xrealloc: cannot allocate 18446744071562068096 bytes

限制最多使用内存200M

root@test:~# cat /etc/systemd/system/dd.service 
[Unit]
Description=dd
ConditionFileIsExecutable=/usr/libexec/dd.sh
[Service]
Type=simple
ExecStart=/usr/libexec/dd.sh
Slice=zhrx.slice
CPUAccounting=yes
CPUQuota=40%
MemoryAccounting=yes
MemoryMax=200M
TasksAccounting=yes
BlockIOAccounting=yes
[Install]
WantedBy=multi-user.target

root@test:~# cat /sys/fs/cgroup/memory/zhrx.slice/dd.service/memory.limit_in_bytes
209715200

如下，效果很明显, memory使用已经达到200M

root@test:~# systemctl status dd
● dd.service - dd
   Loaded: loaded (/etc/systemd/system/dd.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-09-18 19:44:42 CST; 27s ago
 Main PID: 82182 (dd)
   Memory: 199.8M (limit: 200.0M)
   CGroup: /zhrx.slice/dd.service
           └─82182 /usr/bin/bash /usr/libexec/dd

观察了一会儿，没有被立刻OOM kill掉，大概等了一会儿，才被kill掉；

root@test:~# systemctl status dd
● dd.service - dd
     Loaded: loaded (/etc/systemd/system/dd.service; disabled; vendor preset: enabled)
     Active: failed (Result: signal) since Mon 2022-06-20 14:27:19 CST; 54s ago
    Process: 50025 ExecStart=/usr/libexec/dd.sh (code=killed, signal=KILL)
   Main PID: 50025 (code=killed, signal=KILL)
        CPU: 741ms

Jun 20 14:27:04 test systemd[1]: Started dd.
Jun 20 14:27:19 test systemd[1]: dd.service: Main process exited, code=killed, status=9/KILL
Jun 20 14:27:19 test systemd[1]: dd.service: Failed with result 'signal'.

查看日志，确实被OOM kill掉了

systemd控制cgroup的常用参数

Slice=jenkins.slice     # 以 ".slice" 为后缀的单元文件,用于封装管理一组进程资源占用的控制组的 slice 单元。此类单元是通过在 Linux cgroup(Control Group) 树中创建一个节点实现资源控制
CPUAccounting=yes       # 若设为"yes"则表示 为此单元开启CPU占用统计。 注意,这同时也隐含的开启了该单元 所属的 slice 以及父 slice 内 所有单元的CPU占用统计
CPUQuotaPeriodSec=     #指定测量CPUQuota=指定的CPU时间配额的持续时间,采用以秒为单位的持续时间值，并带有可选后缀，如毫秒为“ms”
CPUQuota=10%            # 为此单元的进程设置CPU时间限额，必须设为一个以"%"结尾的百分数， 表示该单元最多可使用单颗CPU总时间的百分之多少
AllowedCPUs=                #限制要在特定CPU上执行的进程。获取由空格或逗号分隔的CPU索引或范围列表
MemoryAccounting=yes    # 若设为"yes"则表示 为此单元开启内存占用统计
TasksAccounting=yes  	# 若设为"yes"则表示 为此单元开启 任务数量统计 (内核空间线程数+用户空间进程数)。
TasksMax=infinity       # 为此单元设置总任务数量限制
IOAccounting=yes        # 若设为"yes"则表示 为此单元开启块设备IO统计。
IPAccounting=yes        # 是否为此单元开启网络流量统计。 对于非 socket 单元来说， 设为"yes"表示统计该单元内所有进程创建的全部 IPv4 与 IPv6 套接字上的流量(发送与接收)。
TimeoutSec=0            # 停止当前服务之前等待的秒数
Restart=always          # 只要不是通过systemctl stop来停止服务,任何情况下都必须要重启服务,默认值为no
RestartSec=5            # 重启间隔
StartLimitInterval=0    # 无限次重启, 默认是10秒内如果重启超过5次则不再重启,设置为0表示不限
KillMode=control-group  # 设置在单元停止时,杀死进程的方法,control-group 表示杀死该单元的 cgroup 内的所有进程(对于 service 单元，还要先执行 ExecStop= 动作)
LimitNOFILE=102400      # 文件描述符的数量 ulimit -n
LimitNPROC=102400       # 进程数 ulimit -u
LimitCORE=infinity      # 核心文件大小 ulimit -c
LimitNICE=10            # nice值,等级的范围从-20-19，其中-20最高，19最低
OOMScoreAdjust=600      # 设置进程因内存不足而被杀死的优先级,可设为 -1000(禁止被杀死) 到 1000(最先被杀死)之间的整数值
MemoryMin=512M 	        # 置该单元进程的最低内存用量保证值
MemoryLow=750M          # 尽可能保障该单元中的进程至少可以使用多少内存
MemoryHigh=900M         # 尽可能限制该单元中的进程最多可以使用多少内存
MemoryMax=1G	        # 绝对刚性的限制该单元中的进程最多可以使用多少内存。 这是一个不允许突破的刚性限制，触碰此限制会导致进程由于内存不足而被强制杀死
MemorySwapMax=100M		# 绝对刚性的限制该单元中的进程最多可以使用多少交换空间
CPUWeight=100           # 用于系统正常运行过程
StartupCPUWeight=1024   # 让特定的服务在系统启动过程中 拥有与运行时不一样的CPU优先级
IOWeight=100            # 用于系统正常运行过程
StartupIOWeight=1024    # 仅用于系统启动过程
# 为此单元的 AF_INET 与 AF_INET6 套接字 设置基于IP地址段的访问控制。 
 # 对于每个单元来说，列表默认为空。允许列表与禁止列表的作用规则如下
	# 如果对端地址匹配 IPAddressAllow= 中的地址段，那么允许访问
	# 如果对端地址匹配 IPAddressDeny= 中的地址段，那么拒绝访问
	# 否则，允许访问
IPAddressAllow=any	
IPAddressDeny=192.168.0.0/24

posted @ 2022-06-20 18:08 zhrx 阅读(14592) 评论(2) 编辑收藏举报

刷新页面返回顶部

zhrx