论文阅读
一、LSTM与BiLSTM基础知识
LSTM
长时间的短期记忆网络(Long Short-Term Memory Networks)的本质在于可以记住很长时期内的内容
相较于普通RNN的单一tanh层(充当激活函数),LSTM改单一tanh层为四个相互作用的层
cell从上一阶段进入到下一阶段,中间通过三种门(gates)向cell状态添加或删除信息,以达到对期待的信息的取舍
遗忘门、输入门、输出门
LSTM基础知识
BiLSTM
BiLSTM是Bi-directional Long Short-Term Memory的缩写,是由前向LSTM与后向LSTM组合而成
双向长短期记忆网络(BiLSTM)相比LSTM有双向信息捕获、更好的序列建模、减少梯度消失问题的优点
BiLSTM基础知识&相较于LSTM的优势
二、简单的BiLSTM案例一实现
实现目标:
给定一个长句子,根据已出现的单词,预测下一个单词
代码如下:
案例一代码
#导库 import torch import numpy as np import torch.nn as nn import torch.optim as optim import torch.utils.data as Data
dtype = torch.FloatTensor
准备数据
sentence = (
'GitHub Actions makes it easy to automate all your software workflows '
'from continuous integration and delivery to issue triage and more'
)word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}
n_class = len(word2idx) # classification problem
max_len = len(sentence.split())
n_hidden = 5数据预处理,构建 dataset,定义 dataloader
def make_data(sentence):
input_batch = []
target_batch = []words = sentence.split() for i in range(max_len - 1): input = [word2idx[n] for n in words[:(i + 1)]] input = input + [0] * (max_len - len(input)) target = word2idx[words[i + 1]] input_batch.append(np.eye(n_class)[input]) target_batch.append(target) return torch.Tensor(input_batch), torch.LongTensor(target_batch)
input_batch: [max_len - 1, max_len, n_class]
input_batch, target_batch = make_data(sentence)
dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, 16, True)定义网络架构
class BiLSTM(nn.Module):
def init(self):
super(BiLSTM, self).init()
self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
#fc
self.fc = nn.Linear(n_hidden * 2, n_class)
def forward(self, X):
#X: [batch_size, max_len, n_class]
batch_size = X.shape[0]
input = X.transpose(0, 1) #input : [max_len, batch_size, n_class]
hidden_state = torch.randn(12, batch_size, n_hidden) #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]
cell_state = torch.randn(12, batch_size, n_hidden) #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]
outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
outputs = outputs[-1] #[batch_size, n_hidden * 2]
model = self.fc(outputs) #model : [batch_size, n_class]
return modelmodel = BiLSTM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)Training
for epoch in range(10000):
for x, y in loader:
pred = model(x)
loss = criterion(pred, y)
if (epoch + 1) % 1000 == 0:
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))optimizer.zero_grad() loss.backward() optimizer.step()
Pred
predict = model(input_batch).data.max(1, keepdim=True)[1]
print(sentence)
print([idx2word[n.item()] for n in predict.squeeze()])
代码分析:
提供一句话sentence,案例中是GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
通过如下代码实现从单词到索引的映射
sentence = (
'GitHub Actions makes it easy to automate all your software workflows '
'from continuous integration and delivery to issue triage and more'
)
word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}
n_class = len(word2idx) # classification problem
max_len = len(sentence.split())
n_hidden = 5
通过如下代码实现数据预处理集的定义,概括来说,是将上述每个单词产生的索引存入待存入的make_data,在存入的过程中,通过input = input + [0] * (max_len - len(input))
保证数据长度始终和原sentence中单词数量一致
通过如下代码实现BiLSTM架构
class BiLSTM(nn.Module):
def __init__(self):
super(BiLSTM, self).__init__()
self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
# fc
self.fc = nn.Linear(n_hidden * 2, n_class)
def forward(self, X):
# X: [batch_size, max_len, n_class]
batch_size = X.shape[0]
input = X.transpose(0, 1) # input : [max_len, batch_size, n_class]
hidden_state = torch.randn(1*2, batch_size, n_hidden) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
cell_state = torch.randn(1*2, batch_size, n_hidden) # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
outputs = outputs[-1] # [batch_size, n_hidden * 2]
model = self.fc(outputs) # model : [batch_size, n_class]
return model
model = BiLSTM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
其中self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
定义了一个双向的长短期记忆网络 (BiLSTM)。input_size
是输入的特征维度,这里是词汇表的大小;hidden_size
是隐藏状态的维度,即隐藏单元的数量;bidirectional=True
表示使用双向的 LSTM
self.fc = nn.Linear(n_hidden * 2, n_class)
定义了一个线性层(全连接层),用于将BiLSTM的输出映射到类别空间。n_hidden * 2
是因为 BiLSTM 是双向的,所以隐藏层的维度是原始维度的两倍
def forward(self, X)
定义了前向传播LSTM。X 是输入数据,维度为 [batch_size, max_len, n_class]
,其中 batch_size
是shape大小,max_len
是输入序列的最大长度,n_class
是词汇表的长度
结果展示:
1w次epoch训练结果
Epoch: 1000 cost = 1.908870
Epoch: 1000 cost = 2.037666
Epoch: 2000 cost = 1.539054
Epoch: 2000 cost = 1.440200
Epoch: 3000 cost = 1.282757
Epoch: 3000 cost = 1.038417
Epoch: 4000 cost = 1.185625
Epoch: 4000 cost = 0.816874
Epoch: 5000 cost = 0.891233
Epoch: 5000 cost = 0.989744
Epoch: 6000 cost = 0.977199
Epoch: 6000 cost = 0.270807
Epoch: 7000 cost = 0.704010
Epoch: 7000 cost = 0.908530
Epoch: 8000 cost = 0.570496
Epoch: 8000 cost = 0.628990
Epoch: 9000 cost = 0.463704
Epoch: 9000 cost = 0.829284
Epoch: 10000 cost = 0.400181
Epoch: 10000 cost = 0.936139
GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['makes', 'makes', 'makes', 'makes', 'to', 'automate', 'automate', 'your', 'software', 'from', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']
10w次epoch训练结果
…… Epoch: 95000 cost = 0.000001 Epoch: 96000 cost = 0.043150 Epoch: 96000 cost = 0.175581 Epoch: 97000 cost = 0.086760 Epoch: 97000 cost = 0.000001 Epoch: 98000 cost = 0.045193 Epoch: 98000 cost = 0.171063 Epoch: 99000 cost = 0.086749 Epoch: 99000 cost = 0.000000 Epoch: 100000 cost = 0.086481 Epoch: 100000 cost = 0.000000 GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more ['Actions', 'makes', 'it', 'easy', 'to', 'automate', 'all', 'your', 'workflows', 'workflows', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']
Process finished with exit code 0
后续想法:
在运行原始代码会产生warning:UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\utils\tensor_new.cpp:278.) return torch.Tensor(input_batch), torch.LongTensor(target_batch)
针对这个问题,增加代码
# 将 input_batch 和 target_batch 转换为 numpy 数组
input_batch = np.array(input_batch)
target_batch = np.array(target_batch)
不仅解决了warning,而且在epoch仅为1w的时候也实现了相比原来更好的效果
1w次epoch训练结果(修改后)
…… Epoch: 8000 cost = 0.216420 Epoch: 9000 cost = 0.332975 Epoch: 9000 cost = 0.043033 Epoch: 10000 cost = 0.251898 Epoch: 10000 cost = 0.209669 GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more ['Actions', 'makes', 'it', 'it', 'to', 'automate', 'all', 'your', 'software', 'workflows', 'workflows', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']
Process finished with exit code 0
三、BERT+BiLSTM案例二实现
实现目标:
命名实体识别(Named Entity Recognition,简称NER)
给定一个句子,对句子中的命名实体进行检测,中文命名实体
代码如下:
见文件包“命名实体识别_中文.ipynb”或者“bertjupyter.py”执行文件
代码分析&结果分析:
from transformers import AutoTokenizer
#加载分词器
tokenizer = AutoTokenizer.from_pretrained('hfl/rbt6')
print(tokenizer)
#分词测试
tokenizer.batch_encode_plus(
[[
'海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间',
'的', '海', '域', '。'
],
[
'这', '座', '依', '山', '傍', '水', '的', '博', '物', '馆', '由', '国', '内', '一',
'流', '的', '设', '计', '师', '主', '持', '设', '计', ',', '整', '个', '建', '筑',
'群', '精', '美', '而', '恢', '宏', '。'
]],
truncation=True,
padding=True,
return_tensors='pt',
is_split_into_words=True)
国内网也能够实现上述加载分词器的连接
is_split_into_words=True
这里的值设置为True,因为在这个案例中句子的提供是已经被分割好了,所以不需要分词器再次分词
return_tensors='pt'
这里pt的设置是让分词后的值被设置为pytorch中的tensor量
import torch
from datasets import load_dataset, load_from_disk
class Dataset(torch.utils.data.Dataset):
def __init__(self, split):
#names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
#在线加载数据集
#dataset = load_dataset(path='peoples_daily_ner', split=split)
#离线加载数据集
dataset = load_from_disk(dataset_path='./data')[split]
#过滤掉太长的句子
def f(data):
return len(data['tokens']) <= 512 - 2
dataset = dataset.filter(f)
self.dataset = dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, i):
tokens = self.dataset[i]['tokens']
labels = self.dataset[i]['ner_tags']
return tokens, labels
dataset = Dataset('train')
tokens, labels = dataset[0]
len(dataset), tokens, labels
构建dataset,并将句子中的单词和数字相对应,names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
这个label中0位代表这是一个单词,1位(B-PER)代表person name的开始,2位(I-PER)代表person name的中间,例如122代表的是一个三个字的人名
同理,3和4代表organization的开始和中间,5和6代表location的开始和中间
#数据整理函数
def collate_fn(data):
tokens = [i[0] for i in data]
labels = [i[1] for i in data]
inputs = tokenizer.batch_encode_plus(tokens,
truncation=True,
padding=True,
return_tensors='pt',
is_split_into_words=True)
lens = inputs['input_ids'].shape[1]
for i in range(len(labels)):
labels[i] = [7] + labels[i]
labels[i] += [7] * lens
labels[i] = labels[i][:lens]
return inputs, torch.LongTensor(labels)
#数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
batch_size=16,
collate_fn=collate_fn,
shuffle=True,
drop_last=True)
#查看数据样例
for i, (inputs, labels) in enumerate(loader):
break
print(len(loader))
print(tokenizer.decode(inputs['input_ids'][0]))
print(labels[0])
for k, v in inputs.items():
print(k, v.shape)
和在案例一当中的补充有一样的功效,labels[i] = [7] + labels[i]
labels[i] += [7] * lens
labels[i] = labels[i][:lens]
这三个语句分别实现了在每个句子的开头加一个补充位,这里是[7],同理在结尾也加一个补充位,然后根据输入句子的最长长度进行截取,得到一批长度相同的label集
运行这段代码,可以得到结果如下
1303
[CLS] 一 个 人 只 有 首 先 对 祖 国 有 一 个 感 性 化 、 具 象 化 的 认 识 , 才 会 更 加 热 爱 祖 国 。 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])
input_ids torch.Size([16, 115])
token_type_ids torch.Size([16, 115])
attention_mask torch.Size([16, 115])
可以得出:
- 在加载器中,有1303批次的数据,这里显示了这批次当中的第一个数据
- 可能会出现超出编码器字典范围的字,以[UNK]表示,那在展示的这个案例中是并没有出现
- tensor变量中的[7]即首尾补全后的情况,尾部有很多[7],取决于这批次中最长数据和当前数据长度的差距
- tensor变量中原句子(数据)的每一个字都被转变为0,因为这仅仅是一个单词,但不是我们希望识别的person name、organization、location中的任意一种,而且再次运行之后这个句子可能会改变
#定义下游模型
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.tuneing = False
self.pretrained = None
self.rnn = torch.nn.GRU(768, 768,batch_first=True)
self.fc = torch.nn.Linear(768, 8)
def forward(self, inputs):
if self.tuneing:
out = self.pretrained(**inputs).last_hidden_state
else:
with torch.no_grad():
out = pretrained(**inputs).last_hidden_state
out, _ = self.rnn(out)
out = self.fc(out).softmax(dim=2)
return out
def fine_tuneing(self, tuneing):
self.tuneing = tuneing
if tuneing:
for i in pretrained.parameters():
i.requires_grad = True
pretrained.train()
self.pretrained = pretrained
else:
for i in pretrained.parameters():
i.requires_grad_(False)
pretrained.eval()
self.pretrained = None
model = Model()
model(inputs).shape
下游任务模型通常使用预训练模型的表示作为输入,并在其基础上进行微调或进一步训练,以适应特定任务的需求。例如,BERT、GPT、RoBERTa等预训练语言模型可以用作下游任务模型的基础,在其之上构建适用于特定任务的神经网络结构,并通过监督学习或其他方法进行训练,从而解决各种NLP任务。
在本案例的下游任务模型中关键的参数是tuneing和pretrained,self.pretrained = None
代表预训练模型并不属于下游模型的一部分
定义forward,如果在tuneing模式下,即self.tuneing = True
,则调用自己的预训练模型,反之,使用外部的预训练模型来进行计算
定义函数fine_tuneing
,来控制是否需要计算梯度
后续想法
存在问题:案例中的模型训练是在cpu上进行训练,训练的力度不够大,没有训练到彻底收敛
这套代码是搜索BiLSTM搜索出来的,说是Bert+BiLSTM,但是看懂了这套代码后,我觉得这套代码里面并没有用到BiLSTM,只是用了Bert,需要再去研究一下
四、BiLSTM实现情感分析案例三实现
实现目标:
使用 Pytorch 构建一个 BiLSTM 来实现情感分析
代码如下:
见文件包“Pytorch4NLP-main”
代码分析&结果分析:
核心代码BiLSTM
class Model(nn.Module):
def __init__(self, embed, config):
super().__init__()
self.embedding = nn.Embedding.from_pretrained(embed, freeze=False)
self.LSTM = nn.LSTM(config.embed_size, config.lstm_hidden_size,
num_layers=config.num_layers, batch_first=True,
bidirectional=True)
# 因为是双向 LSTM, 所以要乘2
self.ffn = nn.Linear(config.lstm_hidden_size * 2,
config.dense_hidden_size)
self.relu = nn.ReLU()
self.classifier = nn.Linear(config.dense_hidden_size,
config.num_outputs)
def forward(self, inputs):
# shape: (batch_size, max_seq_length, embed_size)
embed = self.embedding(inputs)
# shape: (batch_size, max_seq_length, lstm_hidden_size * 2)
lstm_hidden_states, _ = self.LSTM(embed)
# LSTM 的最后一个时刻的隐藏状态, 即句向量
# shape: (batch, lstm_hidden_size * 2)
lstm_hidden_states = lstm_hidden_states[:, -1, :]
# shape: (batch, dense_hidden_size)
ffn_outputs = self.relu(self.ffn(lstm_hidden_states))
# shape: (batch, num_outputs)
logits = self.classifier(ffn_outputs)
return logits
这个文件的代码在数据集合训练集的导入部分报错,还在修改,但是核心的BiLSTM代码部分没有问题