Pytorch学习

Tensor

Pytorch中的Tensor的数据类型共有10种：

torch.FloatTensor (32-bit floating point)			# torch.float32
torch.DoubleTensor (64-bit floating point)
torch.HalfTensor (16-bit floating point 1)			# torch.float16(half)
torch.BFloat16Tensor (16-bit floating point 2)
torch.ByteTensor (8-bit integer (unsigned))
torch.CharTensor (8-bit integer (signed))
torch.ShortTensor (16-bit integer (signed))
torch.IntTensor (32-bit integer (signed))
torch.LongTensor (64-bit integer (signed))
torch.BoolTensor (Boolean)

Tensor的创建方式

指定数据

torch.tensor()
torch.as_tensor()
torch.from_numpy()

这三种方法通过Python list或者numpy数组矩阵创建Tensors，会继承输入数组的数据类型；
torch.as_tensor()和torch.from_numpy()会和输入数组共享同一块内存，修改其中一个另一个也会随之改变。

# 输入的数据类型是int64，所以创建的Tensor的dtype=torch.int64
t1 = torch.tensor([[1,2,3], [4,5,6]])
# 可通过接口修改数据类型和device
>>> t = torch.tensor([[1,2,3], [4,5,6]], dtype=torch.float64, device=torch.device("cuda", 2))
>>> t
tensor([[1., 2., 3.],
        [4., 5., 6.]], device='cuda:2', dtype=torch.float64)

用torch.Tensor()方法创建的Tensor的数据类型不是继承的而是默认的torch.FloatTensor

指定维度

搭建模型

正常搭建

import torch
import torch.nn as nn
import torch.nn.functional as F


class Model1(nn.Module):
    def __init__(self):
        super(Model1,self).__init__()
        self.conv1=nn.Conv2d(3,6,5)
        self.pool=nn.MaxPool2d(2,2)
        self.conv2=nn.Conv2d(6,16,5)
        self.fc1=nn.Linear(16*5*5,120)
        self.fc5=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)
        # 未被使用结构
        self.relu_a = nn.ReLU()
        self.conv_a = nn.Conv2d(3,6,5)

    def forward(self,x):
        x=self.pool(F.relu(self.conv1(x)))
        x=self.pool(F.relu(self.conv2(x)))
        x=x.view(-1,16*5*5)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x

打印模型结构print(model1)如下，可见只有在def __init__(self)下定义的结构，无论它有没有在forward中被使用，无论它有没有可学习参数，都会出现在模型结构里，且出现的顺序和模型执行顺序无关，只取决于定义顺序。

Model1(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc5): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
  (relu_a): ReLU()
  (conv_a): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
)

打印模型state_dict，前面是key，后面是value的size()，只会打印上述模型结构中有学习参数的部分：

model1.state_dict:
conv1.weight     torch.Size([6, 3, 5, 5])
conv1.bias       torch.Size([6])
conv2.weight     torch.Size([16, 6, 5, 5])
conv2.bias       torch.Size([16])
fc1.weight       torch.Size([120, 400])
fc1.bias         torch.Size([120])
fc5.weight       torch.Size([84, 120])
fc5.bias         torch.Size([84])
fc3.weight       torch.Size([10, 84])
fc3.bias         torch.Size([10])
conv_a.weight    torch.Size([6, 3, 5, 5])
conv_a.bias      torch.Size([6])

Sequential()搭建

class Model2(nn.Module):
    def __init__(self) -> None:
        super(Model2, self).__init__()

        self.backbone = nn.Sequential(
            nn.Conv2d(3,6,5),
            nn.ReLU(),
            nn.MaxPool2d(2,2),
            nn.Conv2d(6,16,5),
            nn.ReLU(),
            nn.MaxPool2d(2,2),
        )
        
        self.fc1=nn.Linear(16*5*5,120)
        self.fc5=nn.Linear(120,84)
        self.fc3=nn.Linear(84,10)

    def forward(self,x):
        x=self.backbone(x)
        x=x.view(-1,16*5*5)
        x=F.relu(self.fc1(x))
        x=F.relu(self.fc2(x))
        x=self.fc3(x)
        return x

打印模型结构：

model2's structure:
Model2(
  (backbone): Sequential(
    (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc5): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

打印模型state_dict

backbone.0.weight        torch.Size([6, 3, 5, 5])
backbone.0.bias          torch.Size([6])
backbone.3.weight        torch.Size([16, 6, 5, 5])
backbone.3.bias          torch.Size([16])
fc1.weight       torch.Size([120, 400])
fc1.bias         torch.Size([120])
fc5.weight       torch.Size([84, 120])
fc5.bias         torch.Size([84])
fc3.weight       torch.Size([10, 84])
fc3.bias         torch.Size([10])

ModelEMA(Exponential Moving Average)

更新模型参数时

AMP(Automatic Mixed Precision)

Pytorch官方教程（英文版）
Pytorch创建的模型以及数据通常默认数据类型都是torch.float32，但是有一些op比如：线性层卷积层等在float16下会更快，而有一些op则需要在float32下进行。Mixed operation尝试去给每个op找到最合适的dtype
完整的混合精度训练通常包括autocast和GradScaler，代码如下：

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

autocast

在autocast()控制下的区域中的op会以指定的数据类型进行运算，这种运算方式在保持准确率的同时能提升性能，op和其对应的dtype由此给出。
在autocast-enabled区域，Tensors可能是任意数据类型，因此在使用autocast时不能在模型种调用half()或者bfloat16()。
autocast()应该只包括模型的前向推理和损失计算部分，不建议包括反向传播部分。反向传播的op会使用前向传播时autocast所使用的dtype。
autocast-enabled区域计算得到的tensor可能是float16的数据类型，因此在autocast-disable区域使用时注意数据类型的不匹配问题。

autocast也可以通过装饰器使用：

class AutocastModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
        ...

GradScaler

参考知乎
 Pytorch官网Gradient Scaling部分

如果一个op的前向推理时有float16的输入，那么这个op在反向传播时也会产生float16的梯度。但是梯度的赋值特别小时，有可能不能用float16来表示，因为会被flush成0，也就是所谓的underflow，与之对应的这个参数的更新也就消失了。
为了避免underflow，gradient scaling将网络的损失都乘以一个scale factor，对这个scaled loss进行反向传播，产生的梯度也相当于乘了同样的scale factor，所以梯度的幅值就被扩大了scale factor倍而避免被flush成0（underflow）。
通过scaled gradient进行参数更新时，需要先将这个scaled gradient除以同样的scale factor来scaled回去，如果scaled回去之后，梯度不包含inf和nan的话，optimizer.step()就会被执行，否则的话涉及这个梯度对应参数的更新会被跳过。
scale factor的值在每次迭代中动态估计，为了尽可能地避免梯度underflow，scale factor应该尽可能大，但是太大的话半精度的Tensor又容易overflow。更新策略为：
在每次scaler.step(optimizer)时检查是否出现inf或者NaN：

出现的话，忽略批次的权重更新，并将scaler factor缩小（乘以backoff_factor）
不出现的话正常更新参数，并且连续多次不出现的话，则scaler.update()会将scaler factor的大小增加（乘以growth_factor）

fine tune

模型在训练时一定要保存的参数有：epoch，model和optimizer以备后续继续训练

ckpt = {'epoch': epoch,
        'best_fitness': best_fitness,
        'training_results': f.read(),
        'model': ema.ema.module if hasattr(ema, 'module') else ema.ema,
        'optimizer': None if final_epoch else optimizer.state_dict()}

继续训练时，需要读取模型和optimizer。
一定要注意：

输入的参数opt.epoch是新增epoch还是继续上一次的总共的epoch，这会影响学习率；
学习率的调整：有的学习率是和总的epoch有关的，所以当新输入了epoch时，一定要注意继续训练时学习率可能会发生突变；

start_epoch = ckpt['epoch'] + 1
# 和输入epoch有关的学习率
lf = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.8 + 0.2  # cosine
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)

posted @ 2022-03-14 19:34 小鸟飞飞11 阅读(63) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Yinzp