李宏毅《机器学习》总结 - 2022 HW4(self-attention、transformer) Strong Baseline
到目前为止最轻松的作业
大概就是给一些(600个)人说的语音,让你判断测试集中的语音是谁说的
人的语音是一个 sequence,可以用 self-attention + FC 获得类别,这不就是 transformer 的 encoder 嘛!
代码:https://colab.research.google.com/drive/18TTUpKwubAIiI_5JTbpOXB4afscbyOHn?usp=sharing
题目分析
最轻松的一集。一句话题解:运行代码过 simple,调大点 epoch+d_model 过 medium,换成 conformer 过 strong
代码分析
看看 classifier:
import torch
import torch.nn as nn
import torch.nn.functional as F
!pip install conformer
from conformer import ConformerBlock
class Classifier(nn.Module):
def __init__(self, d_model=224, n_spks=600, dropout=0.1):
super().__init__()
# Project the dimension of features from that of input into d_model.
self.prenet = nn.Linear(40, d_model)
# 以下注释为 medium 所用 encoder
# self.encoder_layer = nn.TransformerEncoderLayer(
# d_model=d_model, dim_feedforward=256, nhead=2
# )
# self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2)
self.encoder = ConformerBlock(
dim = d_model,
dim_head = 4,
heads = 4,
ff_mult = 4,
conv_expansion_factor = 2,
conv_kernel_size = 20,
attn_dropout = dropout,
ff_dropout = dropout,
conv_dropout = dropout,
)
# Project the the dimension of features from d_model into speaker nums.
self.pred_layer = nn.Sequential(
nn.BatchNorm1d(d_model),
# nn.Linear(d_model, d_model),
# nn.ReLU(),
nn.Linear(d_model, n_spks),
)
def forward(self, mels):
"""
args:
mels: (batch size, length, 40)
return:
out: (batch size, n_spks)
"""
# out: (batch size, length, d_model)
out = self.prenet(mels)
# out: (length, batch size, d_model)
out = out.permute(1, 0, 2)
# The encoder layer expect features in the shape of (length, batch size, d_model).
out = self.encoder(out)
# out: (batch size, length, d_model)
out = out.transpose(0, 1)
# mean pooling
stats = out.mean(dim=1)
# out: (batch, n_spks)
out = self.pred_layer(stats)
return out
一开始的 prenet 对应着 transformer 的 positional encoding,只不过并没有 “加权” 的部分,需要把原序列增长一些,以符合 self-attention 的要求。这里增长到 d_model,由于最后的输出序列长度为 600,发现 \(dmodel=200\) 多时表现较好
接着就是 encoder 的内部了,encoder 实质上是 \(N\) 个 (multi-head self-attention + FC,结合 add&norm),每一层就是一个 self.encoder_layer
,而dim_feedforward=256
就是 FC layer 中隐藏神经元的数目
在 encoder 之后,再加上 mean 的池化和 dimension 为 600 的 FC,就得到了输出
还有一点是这次的代码终于用上了之前提到的自动调整的 learning rate,这里用的是余弦+warm up,需要注意的是每次需要在 optimizer.step()
之后跟着一个 schedule.step()
import math
import torch
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR
def get_cosine_schedule_with_warmup(
optimizer: Optimizer,
num_warmup_steps: int,
num_training_steps: int,
num_cycles: float = 0.5,
last_epoch: int = -1,
):
def lr_lambda(current_step):
# Warmup
if current_step < num_warmup_steps:
return float(current_step) / float(max(1, num_warmup_steps))
# decadence
progress = float(current_step - num_warmup_steps) / float(
max(1, num_training_steps - num_warmup_steps)
)
return max(
0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress))
)
return LambdaLR(optimizer, lr_lambda, last_epoch)