Loading

YOLOX (ByteTrack) 多卡训练卡死 在 initialization, dist.barrier 的解决方法

https://github.com/Megvii-BaseDetection/YOLOX/issues/1289#issuecomment-1409988436

Hey guys. I found a workaround in my case.

(please fix me if I'm mistaken😉)

only allow IP socket communication

try setting these variables, in the scripts or in python by using os.environ[XXX] = ...

export NCCL_LL_THRESHOLD=0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.

The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.

explanation

For distributed training, subprocesses are always initialized dist.init_process_group by TCP

So, IP communication seems reasonable.

My colleague told me that on other machines, it may not need this kind of disabling and falling back to TCP. I donnot know why, either.

caution

NCCL_LL_THRESHOLD is often set as zero. I don't know why.

export NCCL_LL_THRESHOLD=0

Caution ❗ They may influence the model performance.

https://github.com/NVIDIA/nccl/issues/369#issue-678319427

change of start method

In this commit and before, multi-gpu subprocesses are started by launch_by_subprocess, which calls subprocess.Popen

I use YOLOX in ByteTrack. It uses the old version of starting multiple processes

launch_by_subprocess(
        sys.argv,
        world_size,
        num_machines,
        machine_rank,
        num_gpus_per_machine,
        dist_url,
        args,
    )

In the subsequent commits, start methods are set to mp.start_processes, and there may also be some other related but hidden changes.

I haven't check whether switching to the later versions could directly fix the problem.❤️

I'm not familiar with CUDA or NCCL.

However, I think this workaround makes sense, in that it considers the communication between gpus, and the bug lie in the SYNCHRONIZATION.😃

posted @ 2023-01-31 17:02  ZXYFrank  阅读(836)  评论(0编辑  收藏  举报