训练技巧之百万级类别的分类模型的拆分训练
1. 背景
很多人脸识别算法都是以分类的方式进行训练的,分类的训练方式中存在一个很大的问题,就是模型的最后一个全连接层的参数量太大了,以512为特征为例:
类别数参数矩阵尺寸参数矩阵大小(MB)
- 100w类别——1953MB
- 200w类别——3906MB
- 500w类别——9765MB
类别再多的话,1080TI这种消费级的GPU就装不下了,更不用说还有forward/backward的中间结果需要占据额外的显存。
现在的开源数据越来越多,就算没有自己的数据,靠开源数据也能把类别数量堆到100万了,这种条件下,在单卡难以训练,需要进行模型拆分。
2. 模型拆分
最容易想到的拆分方式就是拆分最大的那个fc层。
class facemodel(torch.nn.Module):
def __init__(self,num_classes):
super(facemodel,self).__init__()
# backbone放在GPU-0
self.backbone = resnet50().to(torch.device("cuda:0"))
self.backbone.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:0"))
self.fc1 = torch.nn.Linear(512, int(num_classes / 6)).to(torch.device("cuda:0"))
# 将fc拆掉一部分放在GPU-1,考虑到forward/backward,需要多拆一点
self.fc2 = torch.nn.Linear(512, num_classes - int(num_classes / 6)).to(torch.device("cuda:1"))
def forward(self,x):
x = self.backbone(x)
x1 = self.fc1(x)
x2 = self.fc2(x.to(torch.device("cuda:1")))
return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1) # 传回GPU-0,便于计算loss
以一个200万类别的模型为例:
net = facemodel(2000000)
summary(net,(3,224,224))
模型参数量如下:
================================================================
Total params: 1,050,557,120
Trainable params: 1,050,557,120
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 301.82
Params size (MB): 4007.56
Estimated Total Size (MB): 4309.95
----------------------------------------------------------------
理论上在单卡可以跑(11178 - 4007.56) / (301.82) = 23.76个batch,双卡就是47.52个batch。
下面试试在双卡可以跑多大的batch_size。
此时在两个GPU上的显存分配为:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 59C P8 20W / 250W | 1531MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 29% 52C P8 19W / 250W | 3841MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19447 C /home/dai/py36env/bin/python 1521MiB |
| 1 19447 C /home/dai/py36env/bin/python 3831MiB |
+-----------------------------------------------------------------------------+
尝试batch_size=64:
batch_size = 64
img = torch.ones(batch_size,3,224,224).cuda()
out = net(img)
label = torch.ones(batch_size).long().to(torch.device("cuda:0"))
loss = torch.nn.CrossEntropyLoss()(out,label)
loss.backward()
loss.item()
使用64的batch_size进行反向传播之后,得到的GPU显存占用情况如下:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 73C P2 84W / 250W | 9855MiB / 11178MiB | 56% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 0% 61C P2 79W / 250W | 7505MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19963 C /home/dai/py36env/bin/python 9845MiB |
| 1 19963 C /home/dai/py36env/bin/python 7495MiB |
+-----------------------------------------------------------------------------+
可见拆分模型后,可以以更大的batch_size进行训练。
但是从上面的显存占用情况可以看出一个问题:两个GPU中的forward/backward显存增长幅度不同,GPU利用率差别也很大。这样容易造成显存浪费,而且长期一个GPU干活一个GPU围观的情况也容易把其中一个GPU搞坏。
为了解决这个问题,可以尝试更细致的模型拆分。
3. 更细致的拆分
我们可以把resnet50的backbone部分也拆分到两个GPU上:
class face_model(torch.nn.Module):
def __init__(self,num_classes):
super(face_model,self).__init__()
backbone = resnet50()
self.bottom = torch.nn.Sequential(
backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
).to(torch.device("cuda:0"))
self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
def forward(self,x):
x = x.to(torch.device("cuda:0"))
x = self.bottom(x)
x = self.layer1(x)
x = self.layer2(x)
x = x.to(torch.device("cuda:1"))
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x).squeeze(3).squeeze(2)
x = self.fc(x)
x2 = self.fc2(x)
x1 = self.fc1(x.to(torch.device("cuda:0")))
return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
net = face_model(2000000)
注意网络及tensor的迁移要使用to(device),不要用cuda(GPUID)
空载情况下的显存占用比较均衡:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 64C P2 76W / 250W | 2539MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 29% 62C P2 80W / 250W | 2625MiB / 11178MiB | 62% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9574 C /home/dai/py36env/bin/python 2529MiB |
| 1 9574 C /home/dai/py36env/bin/python 2615MiB |
+-----------------------------------------------------------------------------+
但是用64的batchsize一跑起来就变成这样了:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 67C P2 81W / 250W | 10945MiB / 11178MiB | 78% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 31% 62C P2 81W / 250W | 6315MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9574 C /home/dai/py36env/bin/python 10935MiB |
| 1 9574 C /home/dai/py36env/bin/python 6305MiB |
+-----------------------------------------------------------------------------+
显存和负载都显得很不均衡,我认为这个情况可以通过两种手段解决:
- 将fc层中更多的权重迁移到GPU1;
- 将loss计算分配到两个GPU上进行。
4. 在两个GPU上计算loss
人脸识别里面的loss计算往往比较复杂,所以这种负载不均衡的情况会变得更加明显,为了缓解这种情况,
class face_model(torch.nn.Module):
def __init__(self,num_classes):
super(face_model,self).__init__()
backbone = resnet50()
self.bottom = torch.nn.Sequential(
backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
).to(torch.device("cuda:0"))
self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
def forward(self,x,label):
x = x.to(torch.device("cuda:0"))
x = self.bottom(x)
x = self.layer1(x)
x = self.layer2(x)
x = x.to(torch.device("cuda:1"))
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x).squeeze(3).squeeze(2)
x = self.fc(x)
x2 = self.fc2(x)
x1 = self.fc1(x.to(torch.device("cuda:0")))
x = torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
loss1 = torch.nn.CrossEntropyLoss()(x[:len(label)//2],label[:len(label)//2].to(torch.device("cuda:0")))
loss2 = torch.nn.CrossEntropyLoss()(x[len(label)//2:].to(torch.device("cuda:1")),label[len(label)//2:].to(torch.device("cuda:1")))
return (loss1 + loss2.to(torch.device("cuda:0"))) / 2
net = face_model(2000000)
从下面的GPU信息可以看到,将loss分散之后,显存分配情况有了少许改善,GPU的利用率看起来也正常了一些。
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 0% 86C P2 166W / 250W | 10701MiB / 11178MiB | 43% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 34% 62C P2 81W / 250W | 7053MiB / 11178MiB | 74% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11743 C /home/dai/py36env/bin/python 10691MiB |
| 1 11743 C /home/dai/py36env/bin/python 7043MiB |
+-----------------------------------------------------------------------------+
5. 模型速度问题
将模型拆分之后,多了很多数据传输的操作,模型的训练速度自然是会下降不少的。可以利用PyTorch的前后端异步特性对速度进行优化,具体参考:模型并行最佳实践(PyTorch)