CPU 使用率过高怎么办

CPU使用率相关指标

user（通常缩写为 us），代表用户态 CPU 时间。注意，它不包括下面的 nice 时间，但包括了 guest 时间。
nice（通常缩写为 ni），代表低优先级用户态 CPU 时间，也就是进程的 nice 值被调整为 1-19 之间时的 CPU 时间。这里注意，nice 可取值范围是 -20 到 19，数值越大，优先级反而越低。
system（通常缩写为 sys），代表内核态 CPU 时间。
idle（通常缩写为 id），代表空闲时间。注意，它不包括等待 I/O 的时间（iowait）。
iowait（通常缩写为 wa），代表等待 I/O 的 CPU 时间。
irq（通常缩写为 hi），代表处理硬中断的 CPU 时间。
softirq（通常缩写为 si），代表处理软中断的 CPU 时间。
steal（通常缩写为 st），代表当系统运行在虚拟机中的时候，被其他虚拟机占用的 CPU 时间。
guest（通常缩写为 guest），代表通过虚拟化运行其他操作系统的时间，也就是运行虚拟机的 CPU 时间。
guest_nice（通常缩写为 gnice），代表以低优先级运行虚拟机的时间。

怎么查看 CPU 使用率

top

top - 15:45:59 up 364 days, 20:43,  0 users,  load average: 0.00, 0.01, 0.00
Tasks: 139 total,   1 running,  95 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.8 us,  1.8 sy,  0.0 ni, 96.0 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  3514764 total,   179812 free,  1061072 used,  2273880 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2100148 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                              
 9951 ubuntu    20   0  989840 108760  36424 S   1.3  3.1   0:23.99 node                                 
30257 root      20   0  588648  20356   4840 S   1.0  0.6 315:41.03 barad_agent                          
11399 root      20   0 1114904 151668  21872 S   0.7  4.3 535:40.93 YDService                            
 9995 ubuntu    20   0 1039160  68076  33532 S   0.3  1.9   0:55.91 node                                 
26555 ubuntu    20   0  108500   4476   3144 S   0.3  0.1   0:01.23 sshd                                 
26615 ubuntu    20   0  978144  89548  38244 S   0.3  2.5   0:08.44 node                                 
    1 root      20   0  225544   7596   4920 S   0.0  0.2  19:35.32 systemd

第三行 %Cpu 就是系统的 CPU 使用率。

不过需要注意，top 默认显示的是所有 CPU 的平均值，这个时候你只需要按下数字 1 ，就可以切换到每个 CPU 的使用率了。

继续往下看，空白行之后是进程的实时信息，每个进程都有一个 %CPU 列，表示进程的 CPU 使用率。它是用户态和内核态 CPU 使用率的总和，包括进程用户空间使用的 CPU、通过系统调用执行的内核空间 CPU 、以及在就绪队列等待运行的 CPU。在虚拟化环境中，它还包括了运行虚拟机占用的 CPU。

到这里我们可以发现， top 并没有细分进程的用户态 CPU 和内核态 CPU。那要怎么查看每个进程的详细情况呢？用 pidstat ，它正是一个专门分析每个进程 CPU 使用情况的工具。

查看每个进程 CPU 使用率

可以用 pidstat 命令，查看进程 CPU 使用率，包括：

用户态 CPU 使用率（%usr）；
内核态 CPU 使用率（%system）；
运行虚拟机 CPU 使用率（%guest）；
等待 CPU 使用率（%wait）；
总的 CPU 使用率（%CPU）

最后的 Average 部分，还计算了 5 组数据的平均值。

# 每隔1秒输出一组数据，共输出5组

pidstat 1 5

Linux 4.15.0-180-generic (VM-0-11-ubuntu)       08/04/2023      _x86_64_        (2 CPU)

03:48:38 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:48:39 PM   500      9995    0.00    0.99    0.00    0.00    0.99     0  node
03:48:39 PM     0     11399    0.00    0.99    0.00    0.00    0.99     0  YDService
03:48:39 PM     0     11521    0.00    0.99    0.00    0.00    0.99     1  sh
03:48:39 PM     0     30257    0.00    0.99    0.00    0.00    0.99     0  barad_agent

03:48:39 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:48:40 PM   500      9951    0.00    1.00    0.00    0.00    1.00     1  node
03:48:40 PM     0     11399    0.00    1.00    0.00    0.00    1.00     0  YDService
03:48:40 PM   500     16640    1.00    0.00    0.00    0.00    1.00     1  pidstat
03:48:40 PM     0     30257    1.00    0.00    0.00    0.00    1.00     0  barad_agent

03:48:40 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:48:41 PM   500      9995    1.00    0.00    0.00    0.00    1.00     0  node
03:48:41 PM   500     16640    0.00    1.00    0.00    0.00    1.00     1  pidstat

03:48:41 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:48:42 PM   111      8846    0.00    1.00    0.00    0.00    1.00     0  ntpd
03:48:42 PM     0     11399    1.00    0.00    0.00    0.00    1.00     0  YDService

03:48:42 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
03:48:43 PM     0      7059    0.00    1.00    0.00    0.00    1.00     1  YDLive
03:48:43 PM   500      9995    0.00    1.00    0.00    0.00    1.00     0  node
03:48:43 PM     0     11399    0.00    1.00    0.00    0.00    1.00     0  YDService
03:48:43 PM   500     26615    1.00    0.00    0.00    0.00    1.00     0  node

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:        0      7059    0.00    0.20    0.00    0.00    0.20     -  YDLive
Average:      111      8846    0.00    0.20    0.00    0.00    0.20     -  ntpd
Average:      500      9951    0.00    0.20    0.00    0.00    0.20     -  node
Average:      500      9995    0.20    0.40    0.00    0.00    0.60     -  node
Average:        0     11399    0.20    0.60    0.00    0.00    0.80     -  YDService
Average:        0     11521    0.00    0.20    0.00    0.00    0.20     -  sh
Average:      500     16640    0.20    0.20    0.00    0.00    0.40     -  pidstat
Average:      500     26615    0.20    0.00    0.00    0.20    0.20     -  node
Average:        0     30257    0.20    0.20    0.00    0.00    0.40     -  barad_agent

占用 CPU 是哪个函数

使用系统的 perf 工具。

使用 perf 分析 CPU 性能问题，两种最常见用法。

perf top

第一种常见用法是 perf top，类似于 top，它能够实时显示占用 CPU 时钟最多的函数或者指令，因此可以用来查找热点函数，使用界面如下所示：

sudo perf top

Samples: 3K of event 'cpu-clock', Event count (approx.): 624550087
Overhead  Shared Object             Symbol
   4.77%  [kernel]                  [k] _raw_spin_unlock_irqrestore
   4.10%  perf                      [.] __symbols__insert
   3.19%  perf                      [.] d_print_comp_inner
   2.92%  perf                      [.] rb_next
   2.48%  [kernel]                  [k] __softirqentry_text_start
   2.45%  [kernel]                  [k] __do_page_fault
   2.12%  [kernel]                  [k] finish_task_switch
   1.62%  [kernel]                  [k] do_syscall_64
   1.58%  [kernel]                  [k] clear_page_erms
   1.33%  [kernel]                  [k] unmap_page_range
   1.33%  [kernel]                  [k] flush_tlb_mm_range
   1.20%  perf                      [.] d_print_comp
   1.03%  [kernel]                  [k] filemap_map_pages
   0.99%  [kernel]                  [k] copy_pte_range
   0.94%  [kernel]                  [k] kallsyms_expand_symbol.constprop.1
   0.94%  libc-2.27.so              [.] cfree

第一行包含三个数据，分别是采样数（Samples）、事件类型（event）和事件总数量（Event count）。比如这个例子中，perf 总共采集了 3k 个 CPU 时钟事件，而总事件数则为 624550087。

另外，采样数需要我们特别注意。如果采样数过少（比如只有十几个），那下面的排序和百分比就没什么实际参考价值了。

再往下看是一个表格式样的数据，每一行包含四列，分别是：

第一列 Overhead ，是该符号的性能事件在所有采样中的比例，用百分比来表示。
第二列 Shared ，是该函数或指令所在的动态共享对象（Dynamic Shared Object），如内核、进程名、动态链接库名、内核模块名等。
第三列 Object ，是动态共享对象的类型。比如 [.] 表示用户空间的可执行程序、或者动态链接库，而 [k] 则表示内核空间。
最后一列 Symbol 是符号名，也就是函数名。当函数名未知时，用十六进制的地址来表示。

还是以上面的输出为例，我们可以看到，占用 CPU 时钟最多的是 perf 工具自身，不过它的比例也只有 4.1%，说明系统并没有 CPU 性能问题。

perf record & perf report

第二种常见用法，也就是 perf record 和 perf report。 perf top 虽然实时展示了系统的性能信息，但它的缺点是并不保存数据，也就无法用于离线或者后续的分析。而 perf record 则提供了保存数据的功能，保存后的数据，需要你用 perf report 解析展示。

sudo perf record
^C[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 1.373 MB perf.data (19206 samples) ]

sudo perf report

Samples: 19K of event 'cpu-clock', Event count (approx.): 4801500000
Overhead  Command       Shared Object        Symbol
  96.64%  swapper       [kernel.kallsyms]    [k] native_safe_halt
   0.14%  swapper       [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.11%  swapper       [kernel.kallsyms]    [k] __softirqentry_text_start
   0.09%  swapper       [kernel.kallsyms]    [k] finish_task_switch
   0.07%  barad_agent   python               [.] PyEval_EvalFrameEx
   0.05%  barad_agent   [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.04%  YDService     [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.04%  barad_agent   [kernel.kallsyms]    [k] __do_page_fault
   0.04%  barad_agent   [kernel.kallsyms]    [k] copy_page
   0.03%  barad_agent   [kernel.kallsyms]    [k] copy_pte_range
   0.03%  barad_agent   [kernel.kallsyms]    [k] unmap_page_range
   0.03%  barad_agent   python               [.] lookdict_string
   0.03%  node          [kernel.kallsyms]    [k] copy_pte_range

在实际使用中，我们还经常为 perf top 和 perf record 加上 -g 参数，开启调用关系的采样，方便我们根据调用链来分析性能问题。

总结

CPU 使用率是最直观和最常用的系统性能指标，更是我们在排查性能问题时，通常会关注的第一个指标。所以我们更要熟悉它的含义，尤其要弄清楚用户（%user）、Nice（%nice）、系统（%system）、等待 I/O（%iowait）、中断（%irq）以及软中断（%softirq）这几种不同 CPU 的使用率。比如说：

用户 CPU 和 Nice CPU 高，说明用户态进程占用了较多的 CPU，所以应该着重排查进程的性能问题。
系统 CPU 高，说明内核态占用了较多的 CPU，所以应该着重排查内核线程或者系统调用的性能问题。
I/O 等待 CPU 高，说明等待 I/O 的时间比较长，所以应该着重排查系统存储是不是出现了 I/O 问题。
软中断和硬中断高，说明软中断或硬中断的处理程序占用了较多的 CPU，所以应该着重排查内核中的中断服务程序。

碰到 CPU 使用率升高的问题，你可以借助 top、pidstat 等工具，确认引发 CPU 性能问题的来源；再使用 perf 等工具，排查出引起性能问题的具体函数。

posted @ 2023-08-04 16:15 观海云不远阅读(474) 评论(0) 收藏举报

刷新页面返回顶部

观海云不远