李宏毅《机器学习》总结 - 2022 HW4(self-attention、transformer) Strong Baseline

到目前为止最轻松的作业
大概就是给一些(600个)人说的语音,让你判断测试集中的语音是谁说的
人的语音是一个 sequence,可以用 self-attention + FC 获得类别,这不就是 transformer 的 encoder 嘛!
image

代码:https://colab.research.google.com/drive/18TTUpKwubAIiI_5JTbpOXB4afscbyOHn?usp=sharing

题目分析

最轻松的一集。一句话题解:运行代码过 simple,调大点 epoch+d_model 过 medium,换成 conformer 过 strong

代码分析

看看 classifier:

import torch
import torch.nn as nn
import torch.nn.functional as F
!pip install conformer
from conformer import ConformerBlock

class Classifier(nn.Module):
	def __init__(self, d_model=224, n_spks=600, dropout=0.1):
		super().__init__()
		# Project the dimension of features from that of input into d_model.
		self.prenet = nn.Linear(40, d_model)
		# 以下注释为 medium 所用 encoder
		# self.encoder_layer = nn.TransformerEncoderLayer(
		# 	d_model=d_model, dim_feedforward=256, nhead=2
		# )
		# self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=2)
		self.encoder = ConformerBlock(
				dim = d_model,
				dim_head = 4,
				heads = 4,
				ff_mult = 4,
				conv_expansion_factor = 2,
				conv_kernel_size = 20,
				attn_dropout = dropout,
				ff_dropout = dropout,
				conv_dropout = dropout,
		)

		# Project the the dimension of features from d_model into speaker nums.
		self.pred_layer = nn.Sequential(
			nn.BatchNorm1d(d_model),
			# nn.Linear(d_model, d_model),
			# nn.ReLU(),
			nn.Linear(d_model, n_spks),
		)

	def forward(self, mels):
		"""
		args:
			mels: (batch size, length, 40)
		return:
			out: (batch size, n_spks)
		"""
		# out: (batch size, length, d_model)
		out = self.prenet(mels)
		# out: (length, batch size, d_model)
		out = out.permute(1, 0, 2)
		# The encoder layer expect features in the shape of (length, batch size, d_model).
		out = self.encoder(out)
		# out: (batch size, length, d_model)
		out = out.transpose(0, 1)
		# mean pooling
		stats = out.mean(dim=1)

		# out: (batch, n_spks)
		out = self.pred_layer(stats)
		return out

一开始的 prenet 对应着 transformer 的 positional encoding,只不过并没有 “加权” 的部分,需要把原序列增长一些,以符合 self-attention 的要求。这里增长到 d_model,由于最后的输出序列长度为 600,发现 \(dmodel=200\) 多时表现较好
接着就是 encoder 的内部了,encoder 实质上是 \(N\) 个 (multi-head self-attention + FC,结合 add&norm),每一层就是一个 self.encoder_layer,而dim_feedforward=256 就是 FC layer 中隐藏神经元的数目
在 encoder 之后,再加上 mean 的池化和 dimension 为 600 的 FC,就得到了输出

还有一点是这次的代码终于用上了之前提到的自动调整的 learning rate,这里用的是余弦+warm up,需要注意的是每次需要在 optimizer.step() 之后跟着一个 schedule.step()

import math

import torch
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR


def get_cosine_schedule_with_warmup(
	optimizer: Optimizer,
	num_warmup_steps: int,
	num_training_steps: int,
	num_cycles: float = 0.5,
	last_epoch: int = -1,
):
	def lr_lambda(current_step):
		# Warmup
		if current_step < num_warmup_steps:
			return float(current_step) / float(max(1, num_warmup_steps))
		# decadence
		progress = float(current_step - num_warmup_steps) / float(
			max(1, num_training_steps - num_warmup_steps)
		)
		return max(
			0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress))
		)

	return LambdaLR(optimizer, lr_lambda, last_epoch)
posted @ 2024-01-31 22:44  SkyRainWind  阅读(187)  评论(0编辑  收藏  举报