pytorch分布式训练报错：Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 35000

之前使用的比较老的torch 1.8.1，换到torch 2.0后报错 "rank 1 and rank 0 both on CUDA device 35000"

将main函数开头部分的初始化

distributed.init_process_group(backend='nccl', init_method='env://')
device_id, device = opts.local_rank, torch.device(opts.local_rank)
rank, world_size = distributed.get_rank(), distributed.get_world_size()
torch.cuda.set_device(device_id)

换为：

torch.distributed.init_process_group("nccl")
rank, world_size = distributed.get_rank(), distributed.get_world_size()
device_id = rank % torch.cuda.device_count()
device = torch.device(device_id)

可以解决

posted @ 2023-09-05 22:29 脂环阅读(5571) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

脂环

pytorch分布式训练报错：Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 35000

公告