moco论文代码修改为单机多卡训练的方法(使用torchrun)
moco论文代码修改为单机多卡训练的方法(使用DDP)
主要修改部分解释
何凯明 Momentum Contrast for Unsupervised Visual Representation Learning论文中的代码其实已经很精炼的,但是我用这个代码直接进行单机多卡训练,操作起来略有一点繁琐,故而将原文使用torch.multiprocessing.spawn
手动创建进程的分布式训练方法,修改为使用torchrun
来创建分布式训练任务,这样进行单机多卡训练,过程更加简单。
主要修改的是main_moco.py文件
以下输入参数直接删除,因为使用torchrun
时,wold_size、rank可以直接通过系统环境变量获得,不用手动设置,即(world_size=os.environ["WORLD_SIZE"]
,rank=os.environ["RANK"]
),dist-url在单机训练时也用不到,dist-backend后端通信协议在使用GPU训练时,官网推荐“nccl”,那我们就默认nccl:
parser.add_argument(
"--world-size",
default=-1,
type=int,
help="number of nodes for distributed training",
)
parser.add_argument(
"--rank", default=-1, type=int, help="node rank for distributed training"
)
parser.add_argument(
"--dist-url",
default="tcp://224.66.41.62:23456",
type=str,
help="url used to set up distributed training",
)
parser.add_argument(
"--dist-backend", default="nccl", type=str, help="distributed backend"
)parser.add_argument(
"--world-size",
default=-1,
type=int,
help="number of nodes for distributed training",
)
parser.add_argument(
"--rank", default=-1, type=int, help="node rank for distributed training"
)
parser.add_argument(
"--dist-url",
default="tcp://224.66.41.62:23456",
type=str,
help="url used to set up distributed training",
)
parser.add_argument(
"--dist-backend", default="nccl", type=str, help="distributed backend"
)
然后创建一个进程初始化函数ddp_setup_torchrun()
,并添加到main()函数的最前面,之后就可以直接读取os.environ
多进程的环境变量了(main函数后面的其他代码主要用于判断是否使用了分布式训练,以及使用torch.multiprocessing.spawn
分配多进程,我默认了使用分布式训练和torchrun
启动多进程的方式,所以这一部分我都删了,具体删除了哪些内容,可查看后面完整的代码,此处不赘述):
def ddp_setup_torchrun():
dist.init_process_group(backend="nccl")
def main():
args = parser.parse_args()
ddp_setup_torchrun()
args.world_size = int(os.environ["WORLD_SIZE"])
args.gpu = int(os.environ['LOCAL_RANK'])
args.rank = int(os.environ['RANK'])
模型分布到各个GPU上:
torch.cuda.set_device(args.gpu) # master gpu takes up extra memory
torch.cuda.empty_cache()
model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
对数据集进行分布式分配,注意DataLoader的shuffle,这是分布式训练shuffle的常用设置方式,即使用DistributedSampler时,DataLoader不用再shuffle=True,因为DistributedSampler默认使用了shuffle=True,相当于DataLoader使用了一个包含随机采样功能的采样器:
train_dataset = datasets.ImageFolder(
traindir, moco.loader.TwoCropsTransform(transforms.Compose(augmentation))
)
train_sampler=torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
shuffle=(train_sampler is None),
num_workers=args.workers,
pin_memory=True,
sampler=train_sampler,
drop_last=True,
)
训练模型,注意使用train_sampler.set_epoch(epoch)
,功能相当于固定了每个epoch随机采样的种子,这样如果要进行多次重复训练,可以保证同样的epoch获得的数据是相同的:
for epoch in range(args.start_epoch, args.epochs):
train_sampler.set_epoch(epoch)
adjust_learning_rate(optimizer, epoch, args)
# train for one epoch
train(train_loader, model, criterion, optimizer, epoch, args)
main_moco_torchrun.py
import argparse
# import builtins
import math
import os
import random
import shutil
import time
import warnings
import moco.builder
import moco.loader
import torch
import torch.backends.cudnn as cudnn
import torch.distributed as dist
# import torch.multiprocessing as mp
import torch.nn as nn
import torch.nn.parallel
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms
model_names = sorted(
name
for name in models.__dict__
if name.islower() and not name.startswith("__") and callable(models.__dict__[name])
)
parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
parser.add_argument("data", metavar="DIR", help="path to dataset")
parser.add_argument("--save_path", help="Save the checkpoint file.")
parser.add_argument(
"-a",
"--arch",
metavar="ARCH",
default="resnet50",
choices=model_names,
help="model architecture: " + " | ".join(model_names) + " (default: resnet50)",
)
parser.add_argument(
"-j",
"--workers",
default=24,
type=int,
metavar="N",
help="number of data loading workers (default: 32)",
)
parser.add_argument(
"--epochs", default=200, type=int, metavar="N", help="number of total epochs to run"
)
parser.add_argument(
"--start-epoch",
default=0,
type=int,
metavar="N",
help="manual epoch number (useful on restarts)",
)
parser.add_argument(
"-b",
"--batch-size",
default=64,
type=int,
metavar="N",
help="mini-batch size (default: 256), this is the total "
"batch size of all GPUs on the current node when "
"using Data Parallel or Distributed Data Parallel",
)
parser.add_argument(
"--lr",
"--learning-rate",
default=0.03,
type=float,
metavar="LR",
help="initial learning rate",
dest="lr",
)
parser.add_argument(
"--schedule",
default=[120, 160],
nargs="*",
type=int,
help="learning rate schedule (when to drop lr by 10x)",
)
parser.add_argument(
"--momentum", default=0.9, type=float, metavar="M", help="momentum of SGD solver"
)
parser.add_argument(
"--wd",
"--weight-decay",
default=1e-4,
type=float,
metavar="W",
help="weight decay (default: 1e-4)",
dest="weight_decay",
)
parser.add_argument(
"-p",
"--print-freq",
default=10,
type=int,
metavar="N",
help="print frequency (default: 10)",
)
parser.add_argument(
"--resume",
default="",
type=str,
metavar="PATH",
help="path to latest checkpoint (default: none)",
)
parser.add_argument(
"--seed", default=None, type=int, help="seed for initializing training. "
)
# moco specific configs:
parser.add_argument(
"--moco-dim", default=128, type=int, help="feature dimension (default: 128)"
)
parser.add_argument(
"--moco-k",
default=65536,
type=int,
help="queue size; number of negative keys (default: 65536)",
)
parser.add_argument(
"--moco-m",
default=0.999,
type=float,
help="moco momentum of updating key encoder (default: 0.999)",
)
parser.add_argument(
"--moco-t", default=0.07, type=float, help="softmax temperature (default: 0.07)"
)
# options for moco v2
parser.add_argument("--mlp", action="store_true", help="use mlp head")
parser.add_argument(
"--aug-plus", action="store_true", help="use moco v2 data augmentation"
)
parser.add_argument("--cos", action="store_true", help="use cosine lr schedule")
def ddp_setup_torchrun():
dist.init_process_group(backend="nccl")
def main():
args = parser.parse_args()
ddp_setup_torchrun()
if args.seed is not None:
random.seed(args.seed)
torch.manual_seed(args.seed)
cudnn.deterministic = True
warnings.warn(
"You have chosen to seed training. "
"This will turn on the CUDNN deterministic setting, "
"which can slow down your training considerably! "
"You may see unexpected behavior when restarting "
"from checkpoints."
)
args.world_size = int(os.environ["WORLD_SIZE"])
args.gpu = int(os.environ['LOCAL_RANK'])
args.rank = int(os.environ['RANK'])
# ngpus_per_node = torch.cuda.device_count()
# Simply call main_worker function
main_worker(args)
def main_worker(args):
# create model
print("=> creating model '{}'".format(args.arch))
model = moco.builder.MoCo(
models.__dict__[args.arch],
args.moco_dim,
args.moco_k,
args.moco_m,
args.moco_t,
args.mlp,
)
print(model)
torch.cuda.set_device(args.gpu) # master gpu takes up extra memory
torch.cuda.empty_cache()
model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(args.gpu)
optimizer = torch.optim.SGD(
model.parameters(),
args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay,
)
# optionally resume from a checkpoint
if args.resume:
if os.path.isfile(args.resume):
print("=> loading checkpoint '{}'".format(args.resume))
if args.gpu is None:
checkpoint = torch.load(args.resume)
else:
# Map model to be loaded to specified single gpu.
loc = "cuda:{}".format(args.gpu)
checkpoint = torch.load(args.resume, map_location=loc)
args.start_epoch = checkpoint["epoch"]
model.load_state_dict(checkpoint["state_dict"])
optimizer.load_state_dict(checkpoint["optimizer"])
print(
"=> loaded checkpoint '{}' (epoch {})".format(
args.resume, checkpoint["epoch"]
)
)
else:
print("=> no checkpoint found at '{}'".format(args.resume))
cudnn.benchmark = True
# Data loading code
traindir = os.path.join(args.data, "train")
normalize = transforms.Normalize(
mean=[0.6233, 0.3663, 0.2382], std=[0.2812, 0.2396, 0.1967]
)
if args.aug_plus:
# MoCo v2's aug: similar to SimCLR https://arxiv.org/abs/2002.05709
augmentation = [
transforms.RandomResizedCrop(224, scale=(0.2, 1.0)),
transforms.RandomApply(
[transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8 # not strengthened
),
transforms.RandomGrayscale(p=0.2),
transforms.RandomApply([moco.loader.GaussianBlur([0.1, 2.0])], p=0.5),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]
else:
# MoCo v1's aug: the same as InstDisc https://arxiv.org/abs/1805.01978
augmentation = [
transforms.RandomResizedCrop(224, scale=(0.2, 1.0)),
transforms.RandomGrayscale(p=0.2),
transforms.ColorJitter(0.4, 0.4, 0.4, 0.4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]
train_dataset = datasets.ImageFolder(
traindir, moco.loader.TwoCropsTransform(transforms.Compose(augmentation))
)
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
shuffle=(train_sampler is None),
num_workers=args.workers,
pin_memory=True,
sampler=train_sampler,
drop_last=True,
)
for epoch in range(args.start_epoch, args.epochs):
train_sampler.set_epoch(epoch)
adjust_learning_rate(optimizer, epoch, args)
# train for one epoch
train(train_loader, model, criterion, optimizer, epoch, args)
save_checkpoint(
{
"epoch": args.epochs - 1,
"arch": args.arch,
"state_dict": model.state_dict(),
"optimizer": optimizer.state_dict(),
},
is_best=False,
# filename="checkpoint_{:04d}.pth.tar".format(args.epochs - 1),
filename=args.save_path+".pth.tar"
)
def train(train_loader, model, criterion, optimizer, epoch, args):
batch_time = AverageMeter("Time", ":6.3f")
data_time = AverageMeter("Data", ":6.3f")
losses = AverageMeter("Loss", ":.4e")
top1 = AverageMeter("Acc@1", ":6.2f")
top5 = AverageMeter("Acc@5", ":6.2f")
progress = ProgressMeter(
len(train_loader),
[batch_time, data_time, losses, top1, top5],
prefix="Proc{} Epoch: [{}]".format(args.rank,epoch),
)
# switch to train mode
model.train()
end = time.time()
for i, (images, _) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)
if args.gpu is not None:
images[0] = images[0].cuda(args.gpu, non_blocking=True)
images[1] = images[1].cuda(args.gpu, non_blocking=True)
# compute output
output, target = model(im_q=images[0], im_k=images[1])
loss = criterion(output, target)
# acc1/acc5 are (K+1)-way contrast classifier accuracy
# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), images[0].size(0))
top1.update(acc1[0], images[0].size(0))
top5.update(acc5[0], images[0].size(0))
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % args.print_freq == 0:
progress.display(i)
def save_checkpoint(state, is_best, filename="checkpoint.pth.tar"):
torch.save(state, filename)
if is_best:
shutil.copyfile(filename, "model_best.pth.tar")
class AverageMeter:
"""Computes and stores the average and current value"""
def __init__(self, name, fmt=":f"):
self.name = name
self.fmt = fmt
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def __str__(self):
fmtstr = "{name} {val" + self.fmt + "} ({avg" + self.fmt + "})"
return fmtstr.format(**self.__dict__)
class ProgressMeter:
def __init__(self, num_batches, meters, prefix=""):
self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
self.meters = meters
self.prefix = prefix
def display(self, batch):
entries = [self.prefix + self.batch_fmtstr.format(batch)]
entries += [str(meter) for meter in self.meters]
print("\t".join(entries))
def _get_batch_fmtstr(self, num_batches):
num_digits = len(str(num_batches // 1))
fmt = "{:" + str(num_digits) + "d}"
return "[" + fmt + "/" + fmt.format(num_batches) + "]"
def adjust_learning_rate(optimizer, epoch, args):
"""Decay the learning rate based on schedule"""
lr = args.lr
if args.cos: # cosine lr schedule
lr *= 0.5 * (1.0 + math.cos(math.pi * epoch / args.epochs))
else: # stepwise lr schedule
for milestone in args.schedule:
lr *= 0.1 if epoch >= milestone else 1.0
for param_group in optimizer.param_groups:
param_group["lr"] = lr
def accuracy(output, target, topk=(1,)):
"""Computes the accuracy over the k top predictions for the specified values of k"""
with torch.no_grad():
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].contiguous().view(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
if __name__ == "__main__":
main()
启动训练,我这里使用了一台六卡机,我只用4,5号GPU,torchrun的参数--standalone --nnodes=1设置了单机训练,--nproc-per-node=2设置使用两个GPU,其他使用main_moco需要注意的参数:
- traindir的结构是:traindir/train/类别/数据,当然也可以进行修改,修改路径可以找到上面代码里的traindir,修改数据集结构参考torchvision.datasets.ImageFolder类。
- moco-k:必须是global-batch-size的整数倍,但是注意此处设置的batch-size是单张GPU上的batch-size,而kaiming的代码中使用了
all_gather
的方式进行多进程之间的通信,具体而言就是每个进程(GPU)会获取其他进程中的key,然后进行concat,这样最终输出的数据batch size应该是global-batch-size=nGPUs * batch-size
。比如我这里用了2个GPU,batch-size是64,那么global-batch-size就是64*2=128,而moco-k就必须设置为128的整数倍。
CUDA_VISIBLE_DEVICES=4,5 torchrun --standalone --nnodes=1 --nproc-per-node=2 main_moco_torchrun.py traindir --save_path 保存模型路径 --workers 24 --batch-size 64 --moco-k 8448 --aug-plus --cos