Pytorch剖析器及Pytorch模型的逐层分析

Pytorch的Autograd模块包括一个分析器(profiler),它可以让你检查模型中不同操作符的成本——包括CPU和GPU。

目前有两种模式——使用profile.实现仅cpu模式和基于nvprof(注册CPU和GPU活动)使用emit_nvtx。

torch.autograd.profiler.profile(enabled=Trueuse_cuda=Falserecord_shapes=False)

上下文管理器,用于管理autograd profiler状态并保存结果摘要。 在后台,它仅记录正在C ++中执行的函数的事件,并将这些事件公开给Python。 您可以将任何代码包装到其中,并且它只会报告PyTorch函数的运行时间。

参数:

enabled (booloptional) – 将其设置为False将使该上下文管理器成为无操作。默认值:True。

use_cuda (bool, optional) – 使用cudaEvent API启用CUDA事件的计时。 每个张量操作会增加大约4us的开销。 默认值:False

record_shapes (bool, optional) – 如果设置了形状记录,则将收集有关输入尺寸的信息。这允许查看底层使用了哪些维度,并进一步使用prof.key_averages(group_by_input_shape=True)对它们进行分组。请注意,形状记录可能会使分析数据有偏差。对于最底部的事件(在嵌套函数调用的情况下),很可能是可以忽略的。但是对于更高级别的函数,由于形状的收集,总self cpu time可能会人为地增加。

Example

x = torch.randn((1, 1), requires_grad=True)
with torch.autograd.profiler.profile() as prof:
for _ in range(100):  # any normal python code, really!
  y = x ** 2
  y.backward()
# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

 

结果(没有使用gpu):

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         64.76%           3.096ms          64.76%           3.096ms          3.096ms          1                []                                   
struct torch::autograd::GraphRoot           0.37%            17.700us         0.37%            17.700us         17.700us         1                []                                   
PowBackward0                                23.10%           1.104ms          23.10%           1.104ms          1.104ms          1                []                                   
pow                                         1.37%            65.700us         1.37%            65.700us         65.700us         1                []                                   
mul                                         10.11%           483.100us        10.11%           483.100us        483.100us        1                []                                   
mul                                         0.13%            6.200us          0.13%            6.200us          6.200us          1                []                                   
struct torch::autograd::AccumulateGrad      0.14%            6.500us          0.14%            6.500us          6.500us          1                []                                   
detach                                      0.03%            1.500us          0.03%            1.500us          1.500us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 4.780ms

 

结果(使用gpu):

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         29.13%           3.246ms          29.13%           3.246ms          3.246ms          31.62%           2.866ms          2.866ms          1                []                                   
struct torch::autograd::GraphRoot           0.09%            9.600us          0.09%            9.600us          9.600us          0.02%            2.048us          2.048us          1                []                                   
PowBackward0                                34.12%           3.803ms          34.12%           3.803ms          3.803ms          32.89%           2.982ms          2.982ms          1                []                                   
pow                                         8.53%            950.500us        8.53%            950.500us        950.500us        2.63%            238.592us        238.592us        1                []                                   
mul                                         16.06%           1.789ms          16.06%           1.789ms          1.789ms          19.44%           1.762ms          1.762ms          1                []                                   
mul                                         8.94%            996.700us        8.94%            996.700us        996.700us        10.73%           972.864us        972.864us        1                []                                   
struct torch::autograd::CopyBackwards       1.47%            163.900us        1.47%            163.900us        163.900us        1.31%            118.688us        118.688us        1                []                                   
to                                          1.40%            155.900us        1.40%            155.900us        155.900us        1.27%            114.944us        114.944us        1                []                                   
empty_strided                               0.09%            10.300us         0.09%            10.300us         10.300us         0.01%            1.023us          1.023us          1                []                                   
struct torch::autograd::AccumulateGrad      0.13%            15.000us         0.13%            15.000us         15.000us         0.06%            5.281us          5.281us          1                []                                   
detach                                      0.04%            4.700us          0.04%            4.700us          4.700us          0.02%            1.760us          1.760us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 11.144ms
CUDA time total: 9.066ms

 

torch.autograd.profiler.record_function(name)

上下文管理器/函数装饰器,在运行autograd profiler时向Python代码(或函数)块添加标签。它在跟踪代码概要文件时非常有用。

>>> x = torch.randn((1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     with torch.autograd.profiler.record_function("label-z"): # label the block
...         z = y ** 3
...     y.backward()
...
>>> # NOTE: some columns were removed for brevity
>>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
-----------------------------------  ---------------  ---------------  ---------------
Name                                 Self CPU total %  CPU time avg     Number of Calls
-----------------------------------  ---------------  ---------------  ---------------
pow                                  60.77%           47.470us         3
mul                                  21.73%           25.465us         2
PowBackward0                         12.03%           121.891us        1
torch::autograd::AccumulateGrad      2.70%            6.324us          1
label-z                              2.13%            12.421us         1
torch::autograd::GraphRoot           0.64%            1.503us          1
-----------------------------------  ---------------  ---------------  ---------------
Self CPU time total: 234.344us
CUDA time total: 0.000us

 

torch.autograd.profiler.emit_nvtx(enabled=Truerecord_shapes=False)

上下文管理器,使每个autograd操作发出一个NVTX范围。

在nvprof下运行程序时非常有用:

nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

不幸的是,无法强制nvprof将收集到的数据刷新到磁盘,因此对于CUDA分析,必须使用此上下文管理器注释nvprof跟踪并等待进程退出后再检查它们。 然后,可以使用NVIDIA Visual Profiler(nvvp)可视化时间轴,或者torch.autograd.profiler.load_nvprof()可以加载结果以进行检查,例如 在Python REPL中。

>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

torch.autograd.profiler.load_nvprof(path)

打开nvprof跟踪文件并解析autograd注释。

 

Pytorch模型的逐层分析

采用torchprof库进行pytorch模型的逐层分析

pip install torchprof
 1 import torch
 2 import torchvision
 3 import torchprof
 4 
 5 model = torchvision.models.alexnet(pretrained=False).cuda()
 6 x = torch.rand([1, 3, 224, 224]).cuda()
 7 
 8 with torchprof.Profile(model, use_cuda=True) as prof:
 9     model(x)
10 
11 print(prof.display(show_events=False)) # equivalent to `print(prof)` and `print(prof.display())`

 

Module         | Self CPU total | CPU total | CUDA total | Occurrences
---------------|----------------|-----------|------------|------------
AlexNet        |                |           |            |
├── features   |                |           |            |
│├── 0         |        1.671ms |   6.589ms |    6.701ms |           1
│├── 1         |       62.430us |  62.430us |   63.264us |           1
│├── 2         |       62.909us | 109.948us |  112.640us |           1
│├── 3         |      225.389us | 858.376us |    1.814ms |           1
│├── 4         |       18.999us |  18.999us |   19.456us |           1
│├── 5         |       29.560us |  52.720us |   54.272us |           1
│├── 6         |      136.959us | 511.216us |  707.360us |           1
│├── 7         |       18.480us |  18.480us |   18.624us |           1
│├── 8         |       84.380us | 300.700us |  590.688us |           1
│├── 9         |       18.249us |  18.249us |   17.632us |           1
│├── 10        |       81.289us | 289.946us |  470.016us |           1
│├── 11        |       17.850us |  17.850us |   18.432us |           1
│└── 12        |       29.350us |  52.260us |   52.288us |           1
├── avgpool    |       41.840us |  70.840us |   76.832us |           1
└── classifier |                |           |            |
 ├── 0         |       66.400us | 122.110us |  125.920us |           1
 ├── 1         |      293.658us | 293.658us |  664.704us |           1
 ├── 2         |       17.600us |  17.600us |   18.432us |           1
 ├── 3         |       27.920us |  49.030us |   51.168us |           1
 ├── 4         |       40.590us |  40.590us |  208.672us |           1
 ├── 5         |       17.570us |  17.570us |   18.432us |           1
 └── 6         |       40.489us |  40.489us |   81.920us |           1
View Code

查看每个层中发生的低级操作:prof.display(show_events=True)

Module                        | Self CPU total | CPU total | CUDA total | Occurrences
------------------------------|----------------|-----------|------------|------------
AlexNet                       |                |           |            |
├── features                  |                |           |            |
│├── 0                        |                |           |            |
││├── conv2d                  |       13.370us |   1.671ms |    1.698ms |           1
││├── convolution             |       12.730us |   1.658ms |    1.685ms |           1
││├── _convolution            |       30.660us |   1.645ms |    1.673ms |           1
││├── contiguous              |        6.970us |   6.970us |    7.136us |           1
││└── cudnn_convolution       |        1.608ms |   1.608ms |    1.638ms |           1
│├── 1                        |                |           |            |
││└── relu_                   |       62.430us |  62.430us |   63.264us |           1
│├── 2                        |                |           |            |
││├── max_pool2d              |       15.870us |  62.909us |   63.488us |           1
││└── max_pool2d_with_indices |       47.039us |  47.039us |   49.152us |           1
...
View Code

可以通过在概要文件实例上调用raw()返回原始的Pytorch事件列表。

1 trace, event_lists_dict = prof.raw()
2 print(trace[2])
3 # Trace(path=('AlexNet', 'features', '0'), leaf=True, module=Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)))
4 
5 print(event_lists_dict[trace[2].path][0])
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
conv2d                 0.80%            13.370us         100.00%          1.671ms          1.671ms          25.34%           1.698ms          1.698ms          1                []
convolution            0.76%            12.730us         99.20%           1.658ms          1.658ms          25.15%           1.685ms          1.685ms          1                []
_convolution           1.83%            30.660us         98.44%           1.645ms          1.645ms          24.97%           1.673ms          1.673ms          1                []
contiguous             0.42%            6.970us          0.42%            6.970us          6.970us          0.11%            7.136us          7.136us          1                []
cudnn_convolution      96.19%           1.608ms          96.19%           1.608ms          1.608ms          24.44%           1.638ms          1.638ms          1                []
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 1.671ms
CUDA time total: 6.701ms
View Code

层可以选择单独使用可选kwarg路径参数。忽略所有其他层的分析。

 1 model = torchvision.models.alexnet(pretrained=False)
 2 x = torch.rand([1, 3, 224, 224])
 3 
 4 # Layer does not have to be a leaf layer
 5 paths = [("AlexNet", "features", "3"), ("AlexNet", "classifier")]
 6 
 7 with torchprof.Profile(model, paths=paths) as prof:
 8     model(x)
 9 
10 print(prof)

 

Module         | Self CPU total | CPU total | CUDA total | Occurrences
---------------|----------------|-----------|------------|------------
AlexNet        |                |           |            |
├── features   |                |           |            |
│├── 0         |                |           |            |
│├── 1         |                |           |            |
│├── 2         |                |           |            |
│├── 3         |        3.189ms |  12.717ms |    0.000us |           1
│├── 4         |                |           |            |
│├── 5         |                |           |            |
│├── 6         |                |           |            |
│├── 7         |                |           |            |
│├── 8         |                |           |            |
│├── 9         |                |           |            |
│├── 10        |                |           |            |
│├── 11        |                |           |            |
│└── 12        |                |           |            |
├── avgpool    |                |           |            |
└── classifier |       13.403ms |  14.011ms |    0.000us |           1
 ├── 0         |                |           |            |
 ├── 1         |                |           |            |
 ├── 2         |                |           |            |
 ├── 3         |                |           |            |
 ├── 4         |                |           |            |
 ├── 5         |                |           |            |
 └── 6         |                |           |            |
View Code

 

 

 

 

参考:

https://pytorch.org/docs/stable/autograd.html#profiler

https://github.com/awwong1/torchprof

 

posted on 2020-07-06 17:55  那抹阳光1994  阅读(9629)  评论(0编辑  收藏  举报

导航