代码笔记6 关于使用torch.dataparallel时锁死的解决办法
Process finished with exit code 137 (interrupted by signal 9: SIGKILL
I recently come across a situation where I need to load many small images. My work station has a CPU with 22 cores and four GPUs, so I run four experiments with different random seeds, and each experiment uses one seperate GPU. I find out that the run time of four processes is amost four times the run time of a single process (No parallel benefit).
The model I train is relatively small and the most time-consuming part acutally comes from data loading. I have tried many different approachs including:
pin_memory =False/True
num_workers = 0/1/8
Increase ulimit
staggering the start of each experiment
Thanks to the system level diagnosis by @vjorlikowski, we find out that if we set num_workers = 0/1/8, each process will try to use all CPU cores and viciously compete with each other for CPU cores.
Use export OMP_NUM_THREADS=N, as described here
or use torch.set_num_threads(N), as described here
We set num_workers = 0 and N=5 in our case, as we have 22 cores. The estimated run time of my program is reduced from 12 days to 1.5 days.