代码笔记6 关于使用torch.dataparallel时锁死的解决办法

1

大致情况是这样的,就是在训练中通过torch.dataparallel时进行了训练,这个时候会出现不报错,也不显示任何进展的问题。这种情况可能一开始训练就会出现,也有可能再重新训练时出现。当终止进程时会出现

Process finished with exit code 137 (interrupted by signal 9: SIGKILL

然后我去查看gpu的使用情况,主显卡(用于加载模型)显存已经占用了一部分,说明模型已经加载进去了。而并行gpu却基本没有显存占用,说明数据没有被加载进去,问题一般出现在了dataloader。
网上有很多已有的办法,其实都没啥用,这些能用得到的就看看吧[1]
然后看到一个github链接[2],在里面试了各种办法,找到了一个办法可以解决我这种问题。

I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).

The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:

pin_memory =False/True
num_workers = 0/1/8
Increase ulimit
staggering the start of each experiment
Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.

Solution:
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.

我是通过这句解决了锁死的问题:

torch.set_num_threads(N)

Refrences

[1]https://3water.com/article/8MTM21NDY22Ljg4
[2]https://github.com/pytorch/pytorch/issues/1355

posted @ 2022-05-07 15:53  The1912  阅读(111)  评论(0编辑  收藏  举报