Control Group v2 —— Controller(翻译 by chatgpt)




The "cpu" controllers regulates distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy.
"cpu" 控制器调节 CPU 周期的分配。该控制器实现了普通调度策略的权重和绝对带宽限制模型,以及实时调度策略的绝对带宽分配模型。

In all the above models, cycles distribution is defined only on a temporal base and it does not account for the frequency at which tasks are executed. The (optional) utilization clamping support allows to hint the schedutil cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU.
在上述所有模型中,周期分配仅基于时间,不考虑任务执行的频率。(可选的)利用率钳位支持允许提示 schedutil cpufreq 调度器关于 CPU 应始终提供的最低期望频率,以及不应超过的最大期望频率。

关于utilization clamping,可以参考 Linux Kernel Utilization Clamping简介

WARNING: cgroup2 doesn't yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already have placed RT processes into nonroot cgroups during the system boot process, and these processes may need to be moved to the root cgroup before the cpu controller can be enabled.
警告:cgroup2 尚不支持对实时进程的控制,CPU 控制器只能在所有 RT 进程位于根 cgroup 时启用。请注意,系统管理软件可能已经在系统启动过程中将 RT 进程放入非根 cgroup 中,这些进程可能需要在启用 CPU 控制器之前移动到根 cgroup 中。

CPU Interface Files

CPU 接口文件

All time durations are in microseconds.

  • cpu.stat
    A read-only flat-keyed file. This file exists whether the controller is enabled or not.

    It always reports the following three stats:

    • usage_usec
    • user_usec
    • system_usec

    and the following five when the controller is enabled:

    • nr_periods
    • nr_throttled
    • throttled_usec
    • nr_bursts
    • burst_usec
  • cpu.weight
    A read-write single value file which exists on non-root cgroups. The default is "100".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "100"。

    The weight in the range [1, 10000].
    权重范围为 [1, 10000]。

  • cpu.weight.nice
    A read-write single value file which exists on non-root cgroups. The default is "0".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0"。

    The nice value is in the range [-20, 19].
    优先级值范围为 [-20, 19]。

    This interface file is an alternative interface for "cpu.weight" and allows reading and setting weight using the same values used by nice(2). Because the range is smaller and granularity is coarser for the nice values, the read value is the closest approximation of the current weight.
    此接口文件是 "cpu.weight" 的替代接口,允许使用与 nice(2) 使用的相同值来读取和设置权重。由于 nice 值的范围较小且粒度较粗,读取值是当前权重的最接近近似值。

  • cpu.max
    A read-write two value file which exists on non-root cgroups. The default is "max 100000".
    一个读写的双值文件,存在于非根 cgroup 中。默认值为 "max 100000"。

    The maximum bandwidth limit. It's in the following format:


    which indicates that the group may consume up to $MAX in each $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated.
    表示组在每个 $PERIOD 期间最多可以消耗$MAX。如果$MAX是"max",表示无限制。如果只写一个数字,则更新$MAX

  • cpu.max.burst
    A read-write single value file which exists on non-root cgroups. The default is "0".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0"。

    The burst in the range [0, $MAX].
    突发范围为 [0, $MAX]。

  • cpu.pressure
    A read-write nested-keyed file.

    Shows pressure stall information for CPU. See Documentation/accounting/psi.rst for details.
    显示 CPU 的压力阻塞信息。有关详情,请参阅 Documentation/accounting/psi.rst

  • cpu.uclamp.min
    A read-write single value file which exists on non-root cgroups. The default is "0", i.e. no utilization boosting.
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0",即无利用率增强。

    The requested minimum utilization (protection) as a percentage rational number, e.g. 12.34 for 12.34%.
    请求的最小利用率(保护)作为百分比有理数,例如 12.34 表示 12.34%。

    This interface allows reading and setting minimum utilization clamp values similar to the sched_setattr(2). This minimum utilization value is used to clamp the task specific minimum utilization clamp.
    此接口允许读取和设置类似于 sched_setattr(2) 的最小利用率夹紧值。此最小利用率值用于夹紧特定任务的最小利用率夹紧。

    The requested minimum utilization (protection) is always capped by the current value for the maximum utilization (limit), i.e. cpu.uclamp.max.
    请求的最小利用率(保护)始终受当前值的最大利用率(限制)的限制,即 cpu.uclamp.max。

  • cpu.uclamp.max
    A read-write single value file which exists on non-root cgroups. The default is "max". i.e. no utilization capping
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "max",即无利用率夹紧。

    The requested maximum utilization (limit) as a percentage rational number, e.g. 98.76 for 98.76%.
    请求的最大利用率(限制)作为百分比有理数,例如 98.76 表示 98.76%。

    This interface allows reading and setting maximum utilization clamp values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp.
    此接口允许读取和设置类似于 sched_setattr(2) 的最大利用率夹紧值。此最大利用率值用于夹紧特定任务的最大利用率夹紧。


The "memory" controller regulates distribution of memory. Memory is stateful and implements both limit and protection models. Due to the intertwining between memory usage and reclaim pressure and the stateful nature of memory, the distribution model is relatively complex.

While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. Currently, the following types of memory usages are tracked.
虽然不是完全封闭的,但给定 cgroup 的所有主要内存使用情况都会被跟踪,以便合理地核算和控制总内存消耗。目前,跟踪以下类型的内存使用情况。

  • Userland memory - page cache and anonymous memory.
    用户空间内存 - 页面缓存和匿名内存。

  • Kernel data structures such as dentries and inodes.
    内核数据结构,如 dentries 和 inodes。

  • TCP socket buffers.
    TCP 套接字缓冲区。

The above list may expand in the future for better coverage.

Memory Interface Files


All memory amounts are in bytes. If a value which is not aligned to PAGE_SIZE is written, the value may be rounded up to the closest PAGE_SIZE multiple when read back.
所有内存量均以字节为单位。如果写入的值不是按 PAGE_SIZE 对齐,读取时可能会将该值四舍五入到最接近的 PAGE_SIZE 的倍数。

  • memory.current
    A read-only single value file which exists on non-root cgroups.
    只读单值文件,存在于非根 cgroup 中。

    The total amount of memory currently being used by the cgroup and its descendants.
    由 cgroup 及其后代当前使用的内存总量。

  • memory.min
    A read-write single value file which exists on non-root cgroups. The default is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
    硬内存保护。如果 cgroup 的内存使用量在其有效最小边界内,那么在任何情况下都不会回收 cgroup 的内存。如果没有可回收的未受保护内存可用,将调用 OOM killer。超过有效最小边界(或者如果更高,则为有效低边界),页面将按比例回收超额部分,减少较小超额的回收压力。

    Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its actual memory usage below memory.min.
    有效最小边界受所有祖先 cgroup 的 memory.min 值的限制。如果存在 memory.min 过度承诺(子 cgroup 或 cgroups 需要更多受保护内存,超出父级允许的范围),那么每个子 cgroup 将按其实际内存使用量在 memory.min 下方获得父级保护的一部分。

    Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs.
    在此保护下放置比通常可用的更多内存是不鼓励的,可能导致持续的 OOM。

    If a memory cgroup is not populated with processes, its memory.min is ignored.
    如果内存 cgroup 未填充进程,则其 memory.min 将被忽略。

  • memory.low
    A read-write single value file which exists on non-root cgroups. The default is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup's memory won't be reclaimed unless there is no reclaimable memory available in unprotected cgroups. Above the effective low boundary (or effective min boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
    尽力内存保护。如果 cgroup 的内存使用量在其有效低边界内,那么除非在未受保护的 cgroup 中没有可回收的内存可用,否则不会回收 cgroup 的内存。超过有效低边界(或者如果更高,则为有效最小边界),页面将按比例回收超额部分,减少较小超额的回收压力。

    Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its actual memory usage below memory.low.
    有效低边界受所有祖先 cgroup 的 memory.low 值的限制。如果存在 memory.low 过度承诺(子 cgroup 或 cgroups 需要更多受保护内存,超出父级允许的范围),那么每个子 cgroup 将按其实际内存使用量在 memory.low 下方获得父级保护的一部分。

    Putting more memory than generally available under this protection is discouraged.

  • memory.high
    A read-write single value file which exists on non-root cgroups. The default is "max".
    读写单值文件,存在于非根 cgroup 中。默认值为 "max"。

    Memory usage throttle limit. If a cgroup's usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure.
    内存使用限制。如果 cgroup 的使用超过高边界,cgroup 的进程将受到限制,并承受严重的回收压力。

    Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. The high limit should be used in scenarios where an external process monitors the limited cgroup to alleviate heavy reclaim pressure.
    超过高限不会调用 OOM killer,在极端情况下可能会超出限制。应在外部进程监视受限制的 cgroup 以减轻严重的回收压力的情况下使用高限。

  • memory.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    读写单值文件,存在于非根 cgroup 中。默认值为 "max"。

    Memory usage hard limit. This is the main mechanism to limit memory usage of a cgroup. If a cgroup's memory usage reaches this limit and can't be reduced, the OOM killer is invoked in the cgroup. Under certain circumstances, the usage may go over the limit temporarily.
    内存使用硬限制。这是限制 cgroup 内存使用的主要机制。如果 cgroup 的内存使用达到此限制且无法减少,则在 cgroup 中调用 OOM killer。在某些情况下,使用量可能会暂时超过限制。

    In default configuration regular 0-order allocations always succeed unless OOM killer chooses current task as a victim.
    在默认配置中,常规的 0 级分配总是成功,除非 OOM killer 选择当前任务作为受害者。

    Some kinds of allocations don't invoke the OOM killer. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead.
    某些类型的分配不会调用 OOM killer。调用者可以以不同方式重试它们,作为 -ENOMEM 返回到用户空间,或者在诸如磁盘预读取等情况下默默忽略。

  • memory.reclaim
    A write-only nested-keyed file which exists for all cgroups.
    仅写入的嵌套键文件,适用于所有 cgroups。

    This is a simple interface to trigger memory reclaim in the target cgroup.
    这是一个简单的接口,用于触发目标 cgroup 中的内存回收。

    This file accepts a single key, the number of bytes to reclaim. No nested keys are currently supported.


    echo "1G" > memory.reclaim

    The interface can be later extended with nested keys to configure the reclaim behavior. For example, specify the type of memory to reclaim from (anon, file, ..).
    接口可以稍后通过嵌套键扩展以配置回收行为。例如,指定要从中回收的内存类型(anon、file 等)。

    Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned.
    请注意,内核可能会过度或不足地从目标 cgroup 中回收。如果回收的字节数少于指定的数量,则返回 -EAGAIN。

    Please note that the proactive reclaim (triggered by this interface) is not meant to indicate memory pressure on the memory cgroup. Therefore socket memory balancing triggered by the memory reclaim normally is not exercised in this case. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim.
    请注意,主动回收(由此接口触发)并不意味着内存 cgroup 上的内存压力。因此,由内存回收触发的套接字内存平衡通常在这种情况下不会被执行。这意味着网络层不会根据内存.reclaim 引起的回收而进行调整。

  • memory.peak
    A read-only single value file which exists on non-root cgroups.
    只读单值文件,存在于非根 cgroup 中。

    The max memory usage recorded for the cgroup and its descendants since the creation of the cgroup.
    自创建 cgroup 以来记录的 cgroup 及其后代的最大内存使用量。

    A read-write single value file which exists on non-root cgroups. The default value is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.
    确定是否应将 cgroup 视为 OOM killer 的不可分割的工作负载。如果设置,属于 cgroup 或其后代(如果内存 cgroup 不是叶子 cgroup)的所有任务将一起被杀死或完全不被杀死。这可用于避免部分杀死以保证工作负载的完整性。

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed.
    具有 OOM 保护(oom_score_adj 设置为 -1000)的任务被视为例外,并且永远不会被杀死。

    If the OOM killer is invoked in a cgroup, it's not going to kill any tasks outside of this cgroup, regardless values of ancestor cgroups.
    如果在 cgroup 中调用 OOM killer,则不会杀死任何超出此 cgroup 的任务,而不管祖先 cgroups 的 值如何。

    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
    只读的扁平键文件,存在于非根 cgroup 中。定义了以下条目。除非另有说明,此文件中的值更改会生成文件修改事件。

    Note that all fields in this file are hierarchical and the file modified event can be generated due to an event down the hierarchy. For the local events at the cgroup level see
    请注意,此文件中的所有字段都是分层的,文件修改事件可能是由于层次结构下的事件而生成的。有关 cgroup 级别的本地事件,请参阅。

    • low
      The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under the low boundary. This usually indicates that the low boundary is over-committed.
      由于内存压力高,尽管使用量低于低边界,但 cgroup 被回收的次数。这通常表示低边界被过度承诺。

    • high
      The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event's occurrences are expected.
      由于超过高内存边界,cgroup 的进程被限制并路由执行直接内存回收的次数。对于其内存使用受到高限制而不是全局内存压力的 cgroup,预期会发生此事件。

    • max
      The number of times the cgroup's memory usage was about to go over the max boundary. If direct reclaim fails to bring it down, the cgroup goes to OOM state.
      由于 cgroup 的内存使用量即将超过最大边界的次数。如果直接回收无法将其降下来,cgroup 将进入 OOM 状态。

    • oom
      The number of time the cgroup's memory usage was reached the limit and allocation was about to fail.
      由于 cgroup 的内存使用量达到限制并且分配即将失败的次数。

    This event is not raised if the OOM killer is not considered as an option, e.g. for failed high-order allocations or if caller asked to not retry attempts.
    如果不考虑 OOM killer 作为选项(例如对于失败的高阶分配或如果调用者要求不重试尝试),则不会引发此事件。

    • oom_kill
      The number of processes belonging to this cgroup killed by any kind of OOM killer.
      由任何类型的 OOM killer 杀死的属于此 cgroup 的进程数。

    • oom_group_kill
      The number of times a group OOM has occurred.
      组 OOM 发生的次数。

    Similar to but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.

  • memory.stat
    A read-only flat-keyed file which exists on non-root cgroups.
    这是一个只读的扁平键文件,存在于非根 cgroups 中。

    This breaks down the cgroup's memory footprint into different types of memory, type-specific details, and other information on the state and past events of the memory management system.
    它将 cgroup 的内存占用分解为不同类型的内存、类型特定的细节以及内存管理系统的状态和过去事件的其他信息。

    All memory amounts are in bytes.

    The entries are ordered to be human readable, and new entries can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values!

    If the entry has no per-node counter (or not show in the memory.numa_stat). We use 'npn' (non-per-node) as the tag to indicate that it will not show in the memory.numa_stat.
    如果条目没有每个节点的计数器(或者不显示在 memory.numa_stat 中),我们使用 'npn'(非每节点)作为标签,表示它不会显示在 memory.numa_stat 中。

    • anon
      Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS)
      匿名映射中使用的内存量,例如 brk()、sbrk() 和 mmap(MAP_ANONYMOUS)

    • file
      Amount of memory used to cache filesystem data, including tmpfs and shared memory.
      用于缓存文件系统数据的内存量,包括 tmpfs 和共享内存。

    • kernel (npn)
      Amount of total kernel memory, including (kernel_stack, pagetables, percpu, vmalloc, slab) in addition to other kernel memory use cases.

    • kernel_stack
      Amount of memory allocated to kernel stacks.

    • pagetables
      Amount of memory allocated for page tables.

    • sec_pagetables
      Amount of memory allocated for secondary page tables, this currently includes KVM mmu allocations on x86 and arm64.
      用于次级页表的内存量,目前包括 x86 和 arm64 上的 KVM mmu 分配。

    • percpu (npn)
      Amount of memory used for storing per-cpu kernel data structures.
      用于存储每 CPU 内核数据结构的内存量。

    • sock (npn)
      Amount of memory used in network transmission buffers

    • vmalloc (npn)
      Amount of memory used for vmap backed memory.
      用于 vmap 支持的内存量。

    • shmem
      Amount of cached filesystem data that is swap-backed, such as tmpfs, shm segments, shared anonymous mmap()s
      缓存的文件系统数据,如 tmpfs、shm 段、共享匿名 mmap(),这些数据是交换支持的。

    • zswap
      Amount of memory consumed by the zswap compression backend.
      zswap 压缩后端消耗的内存量。

    • zswapped
      Amount of application memory swapped out to zswap.
      交换到 zswap 的应用程序内存量。

    • file_mapped
      Amount of cached filesystem data mapped with mmap()
      使用 mmap() 映射的缓存文件系统数据量。

    • file_dirty
      Amount of cached filesystem data that was modified but not yet written back to disk

    • file_writeback
      Amount of cached filesystem data that was modified and is currently being written back to disk

    • swapcached
      Amount of swap cached in memory. The swapcache is accounted against both memory and swap usage.

    • anon_thp
      Amount of memory used in anonymous mappings backed by transparent hugepages

    • file_thp
      Amount of cached filesystem data backed by transparent hugepages

    • shmem_thp
      Amount of shm, tmpfs, shared anonymous mmap()s backed by transparent hugepages
      由透明巨大页支持的 shm、tmpfs、共享匿名 mmap() 的内存量。

    • inactive_anon, active_anon, inactive_file, active_file, unevictable
      Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the page reclaim algorithm.

      As these represent internal list state (eg. shmem pages are on anon memory management lists), inactive_foo + active_foo may not be equal to the value for the foo counter, since the foo counter is type-based, not list-based.
      由于这些代表内部列表状态(例如,shmem 页面位于匿名内存管理列表上),因此 inactive_foo + active_foo 可能不等于 foo 计数的值,因为 foo 计数是基于类型而不是基于列表的。

    • slab_reclaimable
      Part of "slab" that might be reclaimed, such as dentries and inodes.
      可能被回收的“slab”部分,例如 dentries 和 inodes。

    • slab_unreclaimable
      Part of "slab" that cannot be reclaimed on memory pressure.

    • slab (npn)
      Amount of memory used for storing in-kernel data structures.

    • workingset_refault_anon
      Number of refaults of previously evicted anonymous pages.

    • workingset_refault_file
      Number of refaults of previously evicted file pages.

    • workingset_activate_anon
      Number of refaulted anonymous pages that were immediately activated.

    • workingset_activate_file
      Number of refaulted file pages that were immediately activated.

    • workingset_restore_anon
      Number of restored anonymous pages which have been detected as an active workingset before they got reclaimed.

    • workingset_restore_file
      Number of restored file pages which have been detected as an active workingset before they got reclaimed.

    • workingset_nodereclaim
      Number of times a shadow node has been reclaimed

    • pgscan (npn)
      Amount of scanned pages (in an inactive LRU list)
      扫描页面的数量(在非活动 LRU 列表中)

    • pgsteal (npn)
      Amount of reclaimed pages

    • pgscan_kswapd (npn)
      Amount of scanned pages by kswapd (in an inactive LRU list)
      kswapd 扫描的页面数量(在非活动 LRU 列表中)

    • pgscan_direct (npn)
      Amount of scanned pages directly (in an inactive LRU list)
      直接扫描的页面数量(在非活动 LRU 列表中)

    • pgscan_khugepaged (npn)
      Amount of scanned pages by khugepaged (in an inactive LRU list)
      khugepaged 扫描的页面数量(在非活动 LRU 列表中)

    • pgsteal_kswapd (npn)
      Amount of reclaimed pages by kswapd
      kswapd 回收的页面数量

    • pgsteal_direct (npn)
      Amount of reclaimed pages directly

    • pgsteal_khugepaged (npn)
      Amount of reclaimed pages by khugepaged
      khugepaged 回收的页面数量

    • pgfault (npn)
      Total number of page faults incurred

    • pgmajfault (npn)
      Number of major page faults incurred

    • pgrefill (npn)
      Amount of scanned pages (in an active LRU list)
      扫描的页面数量(在活动 LRU 列表中)

    • pgactivate (npn)
      Amount of pages moved to the active LRU list
      移动到活动 LRU 列表的页面数量

    • pgdeactivate (npn)
      Amount of pages moved to the inactive LRU list
      移动到非活动 LRU 列表的页面数量

    • pglazyfree (npn)
      Amount of pages postponed to be freed under memory pressure

    • pglazyfreed (npn)
      Amount of reclaimed lazyfree pages

    • thp_fault_alloc (npn)
      Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
      为满足页面错误而分配的透明巨大页数。当未设置 CONFIG_TRANSPARENT_HUGEPAGE 时,此计数器不存在。

    • thp_collapse_alloc (npn)
      Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
      为允许折叠现有页面范围而分配的透明巨大页数。当未设置 CONFIG_TRANSPARENT_HUGEPAGE 时,此计数器不存在。

    • thp_swpout (npn)
      Number of transparent hugepages which are swapout in one piece without splitting.

    • thp_swpout_fallback (npn)
      Number of transparent hugepages which were split before swapout. Usually because failed to allocate some continuous swap space for the huge page.

  • memory.numa_stat
    A read-only nested-keyed file which exists on non-root cgroups.
    这是一个只读的嵌套键文件,存在于非根 cgroups 中。

    This breaks down the cgroup's memory footprint into different types of memory, type-specific details, and other information per node on the state of the memory management system.
    它将 cgroup 的内存占用分解为不同类型的内存、特定类型的细节以及有关内存管理系统状态的每个节点的其他信息。

    This is useful for providing visibility into the NUMA locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use case is evaluating application performance by combining this information with the application's CPU allocation.
    这对于提供 memcg 中 NUMA 本地性信息的可见性很有用,因为页面可以从任何物理节点分配。一个用例是通过将此信息与应用程序的 CPU 分配结合起来评估应用程序的性能。

    All memory amounts are in bytes.

    The output format of memory.numa_stat is:
    memory.numa_stat 的输出格式为:

    type N0=<bytes in node 0> N1=<bytes in node 1> ...
    type N0=<节点 0 中的字节数> N1=<节点 1 中的字节数> ...

    The entries are ordered to be human readable, and new entries can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values!

    The entries can refer to the memory.stat.
    这些条目可以参考 memory.stat。

  • memory.swap.current
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The total amount of swap currently being used by the cgroup and its descendants.
    表示 cgroup 及其后代当前正在使用的交换空间总量。

  • memory.swap.high
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Swap usage throttle limit. If a cgroup's swap usage exceeds this limit, all its further allocations will be throttled to allow userspace to implement custom out-of-memory procedures.
    交换使用率限制。如果 cgroup 的交换使用率超过此限制,所有进一步的分配将被限制,以允许用户空间实现自定义的内存不足程序。

    This limit marks a point of no return for the cgroup. It is NOT designed to manage the amount of swapping a workload does during regular operation. Compare to memory.swap.max, which prohibits swapping past a set amount, but lets the cgroup continue unimpeded as long as other memory can be reclaimed.
    此限制标志着 cgroup 的不可逆转点。它并非设计用于管理工作负载在正常操作期间进行交换的数量。与 memory.swap.max 相比,后者禁止超过设定数量的交换,但只要其他内存可以被回收,就让 cgroup 继续不受阻碍。

    Healthy workloads are not expected to reach this limit.

  • memory.swap.peak
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The max swap usage recorded for the cgroup and its descendants since the creation of the cgroup.
    自创建 cgroup 以来,记录的 cgroup 及其后代的最大交换使用量。

  • memory.swap.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out.
    交换使用硬限制。如果 cgroup 的交换使用达到此限制,cgroup 的匿名内存将不会被交换出去。

    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
    这是一个只读的扁平键文件,存在于非根 cgroups 中。定义了以下条目。除非另有说明,此文件中的值更改会生成文件修改事件。

    • high
      The number of times the cgroup's swap usage was over the high threshold.
      cgroup 的交换使用次数超过高阈值的次数。

    • max
      The number of times the cgroup's swap usage was about to go over the max boundary and swap allocation failed.
      cgroup 的交换使用次数即将超过最大边界并且交换分配失败的次数。

    • fail
      The number of times swap allocation failed either because of running out of swap system-wide or max limit.

    When reduced under the current usage, the existing swap entries are reclaimed gradually and the swap usage may stay higher than the limit for an extended period of time. This reduces the impact on the workload and memory management.

  • memory.zswap.current
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The total amount of memory consumed by the zswap compression backend.
    zswap 压缩后端消耗的内存总量。

  • memory.zswap.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Zswap usage hard limit. If a cgroup's zswap pool reaches this limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk.
    zswap 使用硬限制。如果 cgroup 的 zswap 池达到此限制,它将拒绝在现有条目故障返回或写入磁盘之前接受更多存储。

  • memory.pressure
    A read-only nested-keyed file.

    Shows pressure stall information for memory. See Documentation/accounting/psi.rst for details.
    显示内存的压力阻塞信息。有关详细信息,请参阅 Documentation/accounting/psi.rst

Usage Guidelines


"memory.high" is the main mechanism to control memory usage. Over-committing on high limit (sum of high limits > available memory) and letting global memory pressure to distribute memory according to usage is a viable strategy.
"memory.high" 是控制内存使用的主要机制。在高限制上进行过度承诺(高限制之和 > 可用内存)并让全局内存压力根据使用情况分配内存是一种可行的策略。

Because breach of the high limit doesn't trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.
由于高限制的突破不会触发 OOM 杀手,而是限制违规的 cgroup,管理代理有充分的机会监视并采取适当的行动,比如分配更多内存或终止工作负载。

Determining whether a cgroup has enough memory is not trivial as memory usage doesn't indicate whether the workload can benefit from more memory. For example, a workload which writes data received from network to a file can use all available memory but can also operate as performant with a small amount of memory. A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet.
确定一个 cgroup 是否有足够的内存并不是一件简单的事,因为内存使用量并不表明工作负载是否可以从更多内存中受益。例如,一个将从网络接收的数据写入文件的工作负载可以使用所有可用内存,但也可以在少量内存的情况下运行得很好。需要一种内存压力的度量 - 工作负载由于缺乏内存而受到的影响程度 - 来确定工作负载是否需要更多内存;不幸的是,内存压力监控机制尚未实现。

Memory Ownership


A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn't move the memory usages that it instantiated while in the previous cgroup to the new cgroup.
内存区域由实例化它的 cgroup 承担责任,并且直到释放该区域之前一直由该 cgroup 承担责任。将进程迁移到不同的 cgroup 不会将其在以前 cgroup 中实例化的内存使用情况移动到新 cgroup 中。

A memory area may be used by processes belonging to different cgroups. To which cgroup the area will be charged is in-deterministic; however, over time, the memory area is likely to end up in a cgroup which has enough memory allowance to avoid high reclaim pressure.
一个内存区域可能被属于不同 cgroup 的进程使用。该区域将被收取到哪个 cgroup 是不确定的;然而,随着时间的推移,内存区域很可能最终进入具有足够内存配额以避免高回收压力的 cgroup。

If a cgroup sweeps a considerable amount of memory which is expected to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership.
如果一个 cgroup 扫描了预计会被其他 cgroup 重复访问的大量内存,使用 POSIX_FADV_DONTNEED 放弃受影响文件的内存区域的所有权可能是有意义的,以确保正确的内存所有权。


The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices.

IO Interface Files


  • io.stat
    A read-only nested-keyed file.

    Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined.

    rbytes Bytes read 读取的字节数
    wbytes Bytes written 写入的字节数
    rios Number of read IOs 读取IO的次数
    wios Number of write IOs 写入IO的次数
    dbytes Bytes discarded 丢弃的字节数
    dios Number of discard IOs 丢弃IO的次数

    An example read output follows:

        8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
        8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
  • io.cost.qos
    A read-write nested-keyed file which exists only on the root cgroup.

    This file configures the Quality of Service of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements "io.weight" proportional control. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on "io.cost.qos" or "io.cost.model". The following nested keys are defined.

    enable Weight-based control enable 基于权重的控制启用
    ctrl "auto" or "user"
    rpct Read latency percentile [0, 100] 读取延迟百分位数[0, 100]
    rlat Read latency threshold 读取延迟阈值
    wpct Write latency percentile [0, 100] 写入延迟百分位数[0, 100]
    wlat Write latency threshold 写入延迟阈值
    min Minimum scaling percentage [1, 10000] 最小缩放百分比[1, 10000]
    max Maximum scaling percentage [1, 10000] 最大缩放百分比[1, 10000]

    The controller is disabled by default and can be enabled by setting "enable" to 1. "rpct" and "wpct" parameters default to zero and the controller uses internal device saturation state to adjust the overall IO rate between "min" and "max".

    When a better control quality is needed, latency QoS parameters can be configured. For example:

    8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

    shows that on sdb, the controller is enabled, will consider the device saturated if the 95th percentile of read completion latencies is above 75ms or write 150ms, and adjust the overall IO issue rate between 50% and 150% accordingly.

    The lower the saturation point, the better the latency QoS at the cost of aggregate bandwidth. The narrower the allowed adjustment range between "min" and "max", the more conformant to the cost model the IO behavior. Note that the IO issue base rate may be far off from 100% and setting "min" and "max" blindly can lead to a significant loss of device capacity or control quality. "min" and "max" are useful for regulating devices which show wide temporary behavior changes - e.g. a ssd which accepts writes at the line speed for a while and then completely stalls for multiple seconds.
    饱和点越低,延迟QoS越好,但会牺牲总带宽。允许的"min"和"max"之间的调整范围越窄,IO行为越符合成本模型。请注意,IO发出基础速率可能远非100%,盲目设置"min"和"max"可能导致设备容量或控制质量的显著损失。"min"和"max"对于调节显示临时行为变化的设备很有用 - 例如,一块SSD在一段时间内以线速度接受写入,然后完全停滞了多秒钟。

    When "ctrl" is "auto", the parameters are controlled by the kernel and may change automatically. Setting "ctrl" to "user" or setting any of the percentile and latency parameters puts it into "user" mode and disables the automatic changes. The automatic mode can be restored by setting "ctrl" to "auto".

  • io.cost.model
    A read-write nested-keyed file which exists only on the root cgroup.

    This file configures the cost model of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements "io.weight" proportional control. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on "io.cost.qos" or "io.cost.model". The following nested keys are defined.

    ctrl "auto" or "user"
    model The cost model in use - "linear" 正在使用的成本模型 - "linear"

    When "ctrl" is "auto", the kernel may change all parameters dynamically. When "ctrl" is set to "user" or any other parameters are written to, "ctrl" become "user" and the automatic changes are disabled.

    When "model" is "linear", the following model parameters are defined.

    [r|w]bps The maximum sequential IO throughput 最大顺序IO吞吐量
    [r|w]seqiops The maximum 4k sequential IOs per second 每秒最大4k顺序IO数
    [r|w]randiops The maximum 4k random IOs per second 每秒最大4k随机IO数

    From the above, the builtin linear model determines the base costs of a sequential and random IO and the cost coefficient for the IO size. While simple, this model can cover most common device classes acceptably.

    The IO cost model isn't expected to be accurate in absolute sense and is scaled to the device behavior dynamically.

    If needed, tools/cgroup/ can be used to generate device-specific coefficients.

  • io.weight
    A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100".
    一个可读写的扁平键文件,存在于非根cgroup中。默认值为"default 100"。

    The first line is the default weight applied to devices without specific override. The rest are overrides keyed by $MAJ:$MIN device numbers and not ordered. The weights are in the range [1, 10000] and specifies the relative amount IO time the cgroup can use in relation to its siblings.
    第一行是应用于没有特定覆盖的设备的默认权重。其余的行由$MAJ:$MIN设备号键控,并且没有顺序。权重在[1, 10000]范围内,指定了cgroup相对于其同级别兄弟节点可以使用的IO时间的相对量。

    The default weight can be updated by writing either "default $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
    默认权重可以通过写入"default $WEIGHT"或简单地"$WEIGHT"来更新。可以通过写入"$MAJ:$MIN $WEIGHT"来设置覆盖,并通过写入"$MAJ:$MIN default"来取消设置。

    An example read output follows:

        default 100
        8:16 200
        8:0 50
  • io.max
    A read-write nested-keyed file which exists on non-root cgroups.

    BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined.

    rbps Max read bytes per second 每秒最大读取字节数
    wbps Max write bytes per second 每秒最大写入字节数
    riops Max read IO operations per second 每秒最大读取IO操作数
    wiops Max write IO operations per second 每秒最大写入IO操作数

    When writing, any number of nested key-value pairs can be specified in any order. "max" can be specified as the value to remove a specific limit. If the same key is specified multiple times, the outcome is undefined.

    BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed.

    Setting read limit at 2M BPS and write at 120 IOPS for 8:16:
    设置8:16的读取限制为2M BPS和写入限制为120 IOPS:

    echo "8:16 rbps=2097152 wiops=120" > io.max

    Reading returns the following:

    8:16 rbps=2097152 wbps=max riops=max wiops=120

    Write IOPS limit can be removed by writing the following:

    echo "8:16 wiops=max" > io.max

    Reading now returns the following:

    8:16 rbps=2097152 wbps=max riops=max wiops=max

  • io.pressure
    A read-only nested-keyed file.

    Shows pressure stall information for IO. See Documentation/accounting/psi.rst for details.


Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback mechanism. Writeback sits between the memory and IO domains and regulates the proportion of dirty memory by balancing dirtying and write IOs.

The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs. The memory controller defines the memory domain that dirty memory ratio is calculated and maintained for and the io controller defines the io domain which writes out dirty pages for the memory domain. Both system-wide and per-cgroup dirty memory states are examined and the more restrictive of the two is enforced.

cgroup writeback requires explicit support from the underlying filesystem. Currently, cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are attributed to the root cgroup.

There are inherent differences in memory and writeback management which affects how cgroup ownership is tracked. Memory is tracked per page while writeback per inode. For the purpose of writeback, an inode is assigned to a cgroup and all IO requests to write dirty pages from the inode are attributed to that cgroup.

As cgroup ownership for memory is tracked per page, there can be pages which are associated with different cgroups than the one the inode is associated with. These are called foreign pages. The writeback constantly keeps track of foreign pages and, if a particular foreign cgroup becomes the majority over a certain period of time, switches the ownership of the inode to that cgroup.


While this model is enough for most use cases where a given inode is mostly dirtied by a single cgroup even when the main writing cgroup changes over time, use cases where multiple cgroups write to a single inode simultaneously are not supported well. In such circumstances, a significant portion of IOs are likely to be attributed incorrectly. As memory controller assigns page ownership on the first use and doesn't update it until the page is released, even if writeback strictly follows page ownership, multiple cgroups dirtying overlapping areas wouldn't work as expected. It's recommended to avoid such usage patterns.

The sysctl knobs which affect writeback behavior are applied to cgroup writeback as follows.

  • vm.dirty_background_ratio, vm.dirty_ratio
    These ratios apply the same to cgroup writeback with the amount of available memory capped by limits imposed by the memory controller and system-wide clean memory.

  • vm.dirty_background_bytes, vm.dirty_bytes
    For cgroup writeback, this is calculated into ratio against total available memory and applied the same way as vm.dirty[_background]_ratio.

IO Latency


This is a cgroup v2 controller for IO workload protection. You provide a group with a latency target, and if the average latency exceeds that target the controller will throttle any peers that have a lower latency target than the protected workload.
这是用于IO工作负载保护的cgroup v2控制器。您可以为一个组提供一个延迟目标,如果平均延迟超过该目标,控制器将限制任何具有比受保护工作负载更低延迟目标的对等组。

The limits are only applied at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody:

  /          |            \
  A          B            C
 /  \        |
D    F       G

So the ideal way to configure this is to set io.latency in groups A, B, and C. Generally you do not want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload. Start at higher than the expected latency for your device and watch the avg_lat value in io.stat for your workload group to get an idea of the latency you see during normal operation. Use the avg_lat value as a basis for your real setting, setting at 10-15% higher than the value in io.stat.

How IO Latency Throttling Works


io.latency is work conserving; so as long as everybody is meeting their latency target the controller doesn't do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes 2 forms:

  • Queue depth throttling. This is the number of outstanding IO's a group is allowed to have. We will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time.

  • Artificial delay induction. There are certain types of IO that cannot be throttled without possibly adversely affecting higher priority groups. This includes swapping and metadata IO. These types of IO are allowed to occur normally, however they are "charged" to the originating group. If the originating group is being throttled you will see the use_delay and delay fields in io.stat increase. The delay value is how many microseconds that are being added to any process that runs in this group. Because this number can grow quite large if there is a lot of swapping or metadata IO occurring we limit the individual delay events to 1 second at a time.

Once the victimized group starts meeting its latency target again it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately.

IO Latency Interface Files


  • io.latency
    This takes a similar format as the other controllers.

    "MAJOR:MINOR target=<target time in microseconds>"
    "MAJOR:MINOR target=<目标时间(以微秒为单位)>"

  • io.stat
    If the controller is enabled you will see extra stats in io.stat in addition to the normal ones.

    • depth
      This is the current queue depth for the group.

    • avg_lat
      This is an exponential moving average with a decay rate of 1/exp bound by the sampling interval. The decay rate interval can be calculated by multiplying the win value in io.stat by the corresponding number of samples based on the win value.

    • win
      The sampling window size in milliseconds. This is the minimum duration of time between evaluation events. Windows only elapse with IO activity. Idle periods extend the most recent window.


IO Priority


A single attribute controls the behavior of the I/O priority cgroup policy, namely the io.prio.class attribute. The following values are accepted for that attribute:

  • no-change
    Do not modify the I/O priority class.

  • promote-to-rt
    For requests that have a non-RT I/O priority class, change it into RT. Also change the priority level of these requests to 4. Do not modify the I/O priority of requests that have priority class RT.
    对于具有非RT I/O优先级类别的请求,将其更改为RT。还将这些请求的优先级级别更改为4。不修改具有RT优先级类别的请求的I/O优先级。

  • restrict-to-be
    For requests that do not have an I/O priority class or that have I/O priority class RT, change it into BE. Also change the priority level of these requests to 0. Do not modify the I/O priority class of requests that have priority class IDLE.
    对于没有I/O优先级类别或具有RT I/O优先级类别的请求,将其更改为BE。还将这些请求的优先级级别更改为0。不修改具有IDLE优先级类别的请求的I/O优先级类别。

  • idle
    Change the I/O priority class of all requests into IDLE, the lowest I/O priority class.

  • none-to-rt
    Deprecated. Just an alias for promote-to-rt.

The following numerical values are associated with the I/O priority policies:

no-change 0
promote-to-rt 1
restrict-to-be 2
idle 3

The numerical value that corresponds to each I/O priority class is as follows:

IOPRIO_CLASS_RT (real-time) 1
IOPRIO_CLASS_BE (best effort) 2

The algorithm to set the I/O priority class for a request is as follows:

  • If I/O priority class policy is promote-to-rt, change the request I/O priority class to IOPRIO_CLASS_RT and change the request I/O priority level to 4.

  • If I/O priority class policy is not promote-to-rt, translate the I/O priority class policy into a number, then change the request I/O priority class into the maximum of the I/O priority class policy number and the numerical I/O priority class.



The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()'d after a specified limit is reached.

The number of tasks in a cgroup can be exhausted in ways which other controllers cannot prevent, thus warranting its own controller. For example, a fork bomb is likely to exhaust the number of tasks before hitting memory restrictions.

Note that PIDs used in this controller refer to TIDs, process IDs as used by the kernel.

PID Interface Files


  • pids.max
    A read-write single value file which exists on non-root cgroups. The default is "max".

    Hard limit of number of processes.

  • pids.current
    A read-only single value file which exists on all cgroups.

    The number of processes currently in the cgroup and its descendants.

Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough processes to the cgroup such that pids.current is larger than pids.max. However, it is not possible to violate a cgroup PID policy through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated.
组织操作不受cgroup策略的阻塞,因此pids.current可能大于pids.max。这可以通过将限制设置为小于pids.current的值,或者将足够多的进程附加到cgroup中,使得pids.current大于pids.max来实现。但是,无法通过fork()或clone()违反cgroup PID策略。如果创建新进程会违反cgroup策略,则会返回-EAGAIN。


The "cpuset" controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task's current cgroup. This is especially valuable on large NUMA systems where placing jobs on properly sized subsets of the systems with careful processor and memory placement to reduce cross-node memory access and contention can improve overall system performance.

The "cpuset" controller is hierarchical. That means the controller cannot use CPUs or memory nodes not allowed in its parent.

Cpuset Interface Files


  • cpuset.cpus
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.

    It lists the requested CPUs to be used by tasks within this cgroup. The actual list of CPUs to be granted, however, is subjected to constraints imposed by its parent and can differ from the requested CPUs.

    The CPU numbers are comma-separated numbers or ranges. For example:
    # cat cpuset.cpus 0-4,6,8-10
    An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.

    The value of "cpuset.cpus" stays constant until the next update and won't be affected by any CPU hotplug events.

  • cpuset.cpus.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.

    It lists the onlined CPUs that are actually granted to this cgroup by its parent. These CPUs are allowed to be used by tasks within the current cgroup.

    If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows all the CPUs from the parent cgroup that can be available to be used by this cgroup. Otherwise, it should be a subset of "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" can be granted. In this case, it will be treated just like an empty "cpuset.cpus".

    Its value will be affected by CPU hotplug events.

  • cpuset.mems
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.

    It lists the requested memory nodes to be used by tasks within this cgroup. The actual list of memory nodes granted, however, is subjected to constraints imposed by its parent and can differ from the requested memory nodes.

    The memory node numbers are comma-separated numbers or ranges. For example:
    # cat cpuset.mems 0-1,3
    An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.mems" or all the available memory nodes if none is found.

    The value of "cpuset.mems" stays constant until the next update and won't be affected by any memory nodes hotplug events.

    Setting a non-empty value to "cpuset.mems" causes memory of tasks within the cgroup to be migrated to the designated nodes if they are currently using memory outside of the designated nodes.

    There is a cost for this memory migration. The migration may not be complete and some memory pages may be left behind. So it is recommended that "cpuset.mems" should be set properly before spawning new tasks into the cpuset. Even if there is a need to change "cpuset.mems" with active tasks, it shouldn't be done frequently.

  • cpuset.mems.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.

    It lists the onlined memory nodes that are actually granted to this cgroup by its parent. These memory nodes are allowed to be used by tasks within the current cgroup.

    If "cpuset.mems" is empty, it shows all the memory nodes from the parent cgroup that will be available to be used by this cgroup. Otherwise, it should be a subset of "cpuset.mems" unless none of the memory nodes listed in "cpuset.mems" can be granted. In this case, it will be treated just like an empty "cpuset.mems".

    Its value will be affected by memory nodes hotplug events.

  • cpuset.cpus.exclusive
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.

    It lists all the exclusive CPUs that are allowed to be used to create a new cpuset partition. Its value is not used unless the cgroup becomes a valid partition root. See the "cpuset.cpus.partition" section below for a description of what a cpuset partition is.

    When the cgroup becomes a partition root, the actual exclusive CPUs that are allocated to that partition are listed in "cpuset.cpus.exclusive.effective" which may be different from "cpuset.cpus.exclusive". If "cpuset.cpus.exclusive" has previously been set, "cpuset.cpus.exclusive.effective" is always a subset of it.

    Users can manually set it to a value that is different from "cpuset.cpus". The only constraint in setting it is that the list of CPUs must be exclusive with respect to its sibling.

    For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an exclusive CPU appearing in two or more of its child cgroups is not allowed (the exclusivity rule). A value that violates the exclusivity rule will be rejected with a write error.

    The root cgroup is a partition root and all its available CPUs are in its exclusive CPU set.

  • cpuset.cpus.exclusive.effective
    A read-only multiple values file which exists on all non-root cpuset-enabled cgroups.

    This file shows the effective set of exclusive CPUs that can be used to create a partition root. The content of this file will always be a subset of "cpuset.cpus" and its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" if it is set. If "cpuset.cpus.exclusive" is not set, it is treated to have an implicit value of "cpuset.cpus" in the formation of local partition.

  • cpuset.cpus.partition
    A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.

    It accepts only the following input values when written to.

    "member" Non-root member of a partition 分区的非根成员
    "root" Partition root 分区根
    "isolated" Partition root without load balancing 无负载均衡的分区根

    A cpuset partition is a collection of cpuset-enabled cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of exclusive CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set.

    There are two types of partitions - local and remote. A local partition is one whose parent cgroup is also a valid partition root. A remote partition is one whose parent cgroup is not a valid partition root itself. Writing to "cpuset.cpus.exclusive" is optional for the creation of a local partition as its "cpuset.cpus.exclusive" file will assume an implicit value that is the same as "cpuset.cpus" if it is not set. Writing the proper "cpuset.cpus.exclusive" values down the cgroup hierarchy before the target partition root is mandatory for the creation of a remote partition.

    Currently, a remote partition cannot be created under a local partition. All the ancestors of a remote partition root except the root cgroup cannot be a partition root.

    The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member".

    When set to "root", the current cgroup is the root of a new partition or scheduling domain. The set of exclusive CPUs is determined by the value of its "cpuset.cpus.exclusive.effective".

    When set to "isolated", the CPUs in that partition will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance.

    A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information may be retained, but behaves more like a "member".

    All possible state transitions among "member", "root" and "isolated" are allowed.

    On read, the "cpuset.cpus.partition" file can show the following values.

    "member" Non-root member of a partition 分区的非根成员
    "root" Partition root 分区根
    "isolated" Partition root without load balancing 无负载均衡的分区根
    "root invalid (<reason>)" Invalid partition root 无效的分区根
    "isolated invalid (<reason>)" Invalid isolated partition root 无效的隔离分区根

    In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses.

    For a local partition root to be valid, the following conditions must be met.

    1. The parent cgroup is a valid partition root.

    2. The "cpuset.cpus.exclusive.effective" file cannot be empty, though it may contain offline CPUs.

    3. The "cpuset.cpus.effective" cannot be empty unless there is no task associated with this partition.

    For a remote partition root to be valid, all the above conditions except the first one must be met.

    External events like hotplug or changes to "cpuset.cpus" or "cpuset.cpus.exclusive" can cause a valid partition root to become invalid and vice versa. Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective".

    A valid non-root parent partition may distribute out all its CPUs to its child local partitions when there is no task associated with it.

    Care must be taken to change a valid partition root to "member" as all its child local partitions, if present, will become invalid causing disruption to tasks running in those child partitions. These inactivated partitions could be recovered if their parent is switched back to a partition root with a proper value in "cpuset.cpus" or "cpuset.cpus.exclusive".

    Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug or other changes that modify the validity status of the partition. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.

    A user can pre-configure certain CPUs to an isolated state with load balancing disabled at boot time with the "isolcpus" kernel boot command line option. If those CPUs are to be put into a partition, they have to be used in an isolated partition.

Device controller


Device controller manages access to device files. It includes both creation of new device files (using mknod), and access to the existing device files.

Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF. To control access to device files, a user may create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a device file, corresponding BPF programs will be executed, and depending on the return value the attempt will succeed or fail with -EPERM.
Cgroup v2设备控制器没有接口文件,而是在cgroup BPF之上实现的。为了控制对设备文件的访问,用户可以创建类型为BPF_PROG_TYPE_CGROUP_DEVICE的bpf程序,并将其附加到带有BPF_CGROUP_DEVICE标志的cgroup上。在尝试访问设备文件时,相应的BPF程序将被执行,并根据返回值决定尝试是成功还是失败(返回-EPERM)。

A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx structure, which describes the device access attempt: access type (mknod/read/write) and device (type, major and minor numbers). If the program returns 0, the attempt fails with -EPERM, otherwise it succeeds.

An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.


The "rdma" controller regulates the distribution and accounting of RDMA resources.

RDMA Interface Files


  • rdma.max
    A readwrite nested-keyed file that exists for all the cgroups except root that describes current configured resource limit for a RDMA/IB device.

    Lines are keyed by device name and are not ordered. Each line contains space separated resource name and its configured limit that can be distributed.

    The following nested keys are defined.

    hca_handle Maximum number of HCA Handles HCA句柄的最大数量
    hca_object Maximum number of HCA Objects HCA对象的最大数量

    An example for mlx4 and ocrdma device follows:

        mlx4_0 hca_handle=2 hca_object=2000
        ocrdma1 hca_handle=3 hca_object=max
  • rdma.current
    A read-only file that describes current resource usage. It exists for all the cgroup except root.

    An example for mlx4 and ocrdma device follows:

        mlx4_0 hca_handle=1 hca_object=20
        ocrdma1 hca_handle=1 hca_object=23


The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault.

HugeTLB Interface Files

  • hugetlb.<hugepagesize>.current
    Show current usage for "hugepagesize" hugetlb. It exists for all the cgroup except root.
    显示“hugepagesize” HugeTLB的当前使用情况。对于除根cgroup之外的所有cgroup都存在。

  • hugetlb.<hugepagesize>.max
    Set/show the hard limit of "hugepagesize" hugetlb usage. The default value is "max". It exists for all the cgroup except root.
    设置/显示“hugepagesize” HugeTLB使用的硬限制。默认值为“max”。对于除根cgroup之外的所有cgroup都存在。

  • hugetlb.<hugepagesize>.events
    A read-only flat-keyed file which exists on non-root cgroups.

    • max
      The number of allocation failure due to HugeTLB limit
  • hugetlb.<hugepagesize>.events.local
    Similar to hugetlb.<hugepagesize>.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.

  • hugetlb.<hugepagesize>.numa_stat
    Similar to memory.numa_stat, it shows the numa information of the hugetlb pages of <hugepagesize> in this cgroup. Only active in use hugetlb pages are included. The per-node values are in bytes.



The Miscellaneous cgroup provides the resource limiting and tracking mechanism for the scalar resources which cannot be abstracted like the other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config option.
"Miscellaneous cgroup"提供了对无法像其他cgroup资源那样抽象化的标量资源进行限制和跟踪的机制。控制器由CONFIG_CGROUP_MISC配置选项启用。

A resource can be added to the controller via enum misc_res_type{} in the include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[] in the kernel/cgroup/misc.c file. Provider of the resource must set its capacity prior to using the resource by calling misc_cg_set_capacity().
杂项cgroup为无法像其他cgroup资源一样抽象的标量资源提供了资源限制和跟踪机制。控制器通过include/linux/misc_cgroup.h文件中的enum misc_res_type{}和kernel/cgroup/misc.c文件中的misc_res_name[]添加资源。资源的提供者必须在使用资源之前通过调用misc_cg_set_capacity()设置其容量。

Once a capacity is set then the resource usage can be updated using charge and uncharge APIs. All of the APIs to interact with misc controller are in include/linux/misc_cgroup.h.
一旦设置了容量,就可以使用charge和uncharge API更新资源使用情况。与misc控制器交互的所有API都在include/linux/misc_cgroup.h中。

Misc Interface Files


Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then:

  • misc.capacity
    A read-only flat-keyed file shown only in the root cgroup. It shows miscellaneous scalar resources available on the platform along with their quantities:

        $ cat misc.capacity
        res_a 50
        res_b 10
  • misc.current
    A read-only flat-keyed file shown in the all cgroups. It shows the current usage of the resources in the cgroup and its children.:

        $ cat misc.current
        res_a 3
        res_b 0
  • misc.max
    A read-write flat-keyed file shown in the non root cgroups. Allowed maximum usage of the resources in the cgroup and its children.:

        $ cat misc.max
        res_a max
        res_b 4

    Limit can be set by:

        # echo res_a 1 > misc.max

    Limit can be set to max by:

        # echo res_a max > misc.max

    Limits can be set higher than the capacity value in the misc.capacity file.

    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. All fields in this file are hierarchical.

    • max
      The number of times the cgroup's resource usage was about to go over the max boundary.

Migration and Ownership


A miscellaneous scalar resource is charged to the cgroup in which it is used first, and stays charged to that cgroup until that resource is freed. Migrating a process to a different cgroup does not move the charge to the destination cgroup where the process has moved.

关于MISC Cgroup的用法,参考 LWN:新增misc cgroup!




perf_event controller, if not mounted on a legacy hierarchy, is automatically enabled on the v2 hierarchy so that perf events can always be filtered by cgroup v2 path. The controller can still be moved to a legacy hierarchy after v2 hierarchy is populated.
如果perf_event控制器没有挂载在传统层次结构上,则会自动在v2层次结构上启用,以便可以始终通过cgroup v2路径过滤perf事件。在v2层次结构填充之后,仍然可以将控制器移动到传统层次结构中。

Non-normative information


This section contains information that isn't considered to be a part of the stable kernel API and so is subject to change.

CPU controller root cgroup process behaviour


When distributing CPU cycles in the root cgroup each thread in this cgroup is treated as if it was hosted in a separate child cgroup of the root cgroup. This child cgroup weight is dependent on its thread nice level.

For details of this mapping see sched_prio_to_weight array in kernel/sched/core.c file (values from this array should be scaled appropriately so the neutral - nice 0 - value is 100 instead of 1024).

IO controller root cgroup process behaviour


Root cgroup processes are hosted in an implicit leaf child node. When distributing IO resources this implicit child node is taken into account as if it was a normal child cgroup of the root cgroup with a weight value of 200.

posted @ 2023-12-07 20:27  摩斯电码  阅读(54)  评论(0编辑  收藏  举报