YOLOX (ByteTrack) 多卡训练卡死 在 initialization, dist.barrier 的解决方法
https://github.com/Megvii-BaseDetection/YOLOX/issues/1289#issuecomment-1409988436
Hey guys. I found a workaround in my case.
(please fix me if I'm mistaken😉)
only allow IP socket communication#
try setting these variables, in the scripts or in python by using os.environ[XXX] = ...
export NCCL_LL_THRESHOLD=0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
The NCCL_P2P_DISABLE variable disables the peer to peer (P2P) transport, which uses CUDA direct access between GPUs, using NVLink or PCI.
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.
explanation#
For distributed training, subprocesses are always initialized dist.init_process_group by TCP
So, IP communication seems reasonable.
My colleague told me that on other machines, it may not need this kind of disabling and falling back to TCP. I donnot know why, either.
caution#
NCCL_LL_THRESHOLD is often set as zero. I don't know why.
export NCCL_LL_THRESHOLD=0
Caution ❗ They may influence the model performance.
https://github.com/NVIDIA/nccl/issues/369#issue-678319427
change of start method#
In this commit and before, multi-gpu subprocesses are started by launch_by_subprocess
, which calls subprocess.Popen
I use YOLOX in ByteTrack. It uses the old version of starting multiple processes
launch_by_subprocess(
sys.argv,
world_size,
num_machines,
machine_rank,
num_gpus_per_machine,
dist_url,
args,
)
In the subsequent commits, start methods are set to mp.start_processes
, and there may also be some other related but hidden changes.
I haven't check whether switching to the later versions could directly fix the problem.❤️
I'm not familiar with CUDA or NCCL.
However, I think this workaround makes sense, in that it considers the communication between gpus, and the bug lie in the SYNCHRONIZATION.😃
作者:JoyFrank
出处:https://www.cnblogs.com/zxyfrank/p/17079835.html
版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」许可协议进行许可。
世界上只有一种英雄主义,就是看到生活本来的样子,并且热爱它
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异
· 三行代码完成国际化适配,妙~啊~