论文阅读

一、LSTM与BiLSTM基础知识


LSTM

长时间的短期记忆网络(Long Short-Term Memory Networks)的本质在于可以记住很长时期内的内容

相较于普通RNN的单一tanh层(充当激活函数),LSTM改单一tanh层为四个相互作用的层

cell从上一阶段进入到下一阶段,中间通过三种门(gates)向cell状态添加或删除信息,以达到对期待的信息的取舍

遗忘门、输入门、输出门

LSTM基础知识

BiLSTM

BiLSTM是Bi-directional Long Short-Term Memory的缩写,是由前向LSTM与后向LSTM组合而成

双向长短期记忆网络(BiLSTM)相比LSTM有双向信息捕获、更好的序列建模、减少梯度消失问题的优点

BiLSTM基础知识&相较于LSTM的优势

二、简单的BiLSTM案例一实现


实现目标:

给定一个长句子,根据已出现的单词,预测下一个单词

代码如下:

案例一代码
 
#导库
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data

dtype = torch.FloatTensor

准备数据

sentence = (
'GitHub Actions makes it easy to automate all your software workflows '
'from continuous integration and delivery to issue triage and more'
)

word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}
n_class = len(word2idx) # classification problem
max_len = len(sentence.split())
n_hidden = 5

数据预处理,构建 dataset,定义 dataloader

def make_data(sentence):
input_batch = []
target_batch = []

words = sentence.split()
for i in range(max_len - 1):
    input = [word2idx[n] for n in words[:(i + 1)]]
    input = input + [0] * (max_len - len(input))
    target = word2idx[words[i + 1]]
    input_batch.append(np.eye(n_class)[input])
    target_batch.append(target)

return torch.Tensor(input_batch), torch.LongTensor(target_batch)

input_batch: [max_len - 1, max_len, n_class]

input_batch, target_batch = make_data(sentence)
dataset = Data.TensorDataset(input_batch, target_batch)
loader = Data.DataLoader(dataset, 16, True)

定义网络架构

class BiLSTM(nn.Module):
def init(self):
super(BiLSTM, self).init()
self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
#fc
self.fc = nn.Linear(n_hidden * 2, n_class)
def forward(self, X):
#X: [batch_size, max_len, n_class]
batch_size = X.shape[0]
input = X.transpose(0, 1) #input : [max_len, batch_size, n_class]
hidden_state = torch.randn(12, batch_size, n_hidden) #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]
cell_state = torch.randn(1
2, batch_size, n_hidden) #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]
outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
outputs = outputs[-1] #[batch_size, n_hidden * 2]
model = self.fc(outputs) #model : [batch_size, n_class]
return model

model = BiLSTM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Training

for epoch in range(10000):
for x, y in loader:
pred = model(x)
loss = criterion(pred, y)
if (epoch + 1) % 1000 == 0:
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

Pred

predict = model(input_batch).data.max(1, keepdim=True)[1]
print(sentence)
print([idx2word[n.item()] for n in predict.squeeze()])


代码分析:

提供一句话sentence,案例中是GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more

通过如下代码实现从单词到索引的映射

sentence = (
    'GitHub Actions makes it easy to automate all your software workflows '
    'from continuous integration and delivery to issue triage and more'
)

word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}
n_class = len(word2idx) # classification problem
max_len = len(sentence.split())
n_hidden = 5

通过如下代码实现数据预处理集的定义,概括来说,是将上述每个单词产生的索引存入待存入的make_data,在存入的过程中,通过input = input + [0] * (max_len - len(input))保证数据长度始终和原sentence中单词数量一致

通过如下代码实现BiLSTM架构

class BiLSTM(nn.Module):
    def __init__(self):
        super(BiLSTM, self).__init__()
        self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
        # fc
        self.fc = nn.Linear(n_hidden * 2, n_class)

    def forward(self, X):
        # X: [batch_size, max_len, n_class]
        batch_size = X.shape[0]
        input = X.transpose(0, 1)  # input : [max_len, batch_size, n_class]

        hidden_state = torch.randn(1*2, batch_size, n_hidden)   # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        cell_state = torch.randn(1*2, batch_size, n_hidden)     # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]

        outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
        outputs = outputs[-1]  # [batch_size, n_hidden * 2]
        model = self.fc(outputs)  # model : [batch_size, n_class]
        return model

model = BiLSTM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

其中self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)定义了一个双向的长短期记忆网络 (BiLSTM)。input_size是输入的特征维度,这里是词汇表的大小;hidden_size 是隐藏状态的维度,即隐藏单元的数量;bidirectional=True 表示使用双向的 LSTM

self.fc = nn.Linear(n_hidden * 2, n_class)定义了一个线性层(全连接层),用于将BiLSTM的输出映射到类别空间。n_hidden * 2 是因为 BiLSTM 是双向的,所以隐藏层的维度是原始维度的两倍

def forward(self, X)定义了前向传播LSTM。X 是输入数据,维度为 [batch_size, max_len, n_class],其中 batch_size 是shape大小,max_len 是输入序列的最大长度,n_class 是词汇表的长度

结果展示:

1w次epoch训练结果
 
Epoch: 1000 cost = 1.908870
Epoch: 1000 cost = 2.037666
Epoch: 2000 cost = 1.539054
Epoch: 2000 cost = 1.440200
Epoch: 3000 cost = 1.282757
Epoch: 3000 cost = 1.038417
Epoch: 4000 cost = 1.185625
Epoch: 4000 cost = 0.816874
Epoch: 5000 cost = 0.891233
Epoch: 5000 cost = 0.989744
Epoch: 6000 cost = 0.977199
Epoch: 6000 cost = 0.270807
Epoch: 7000 cost = 0.704010
Epoch: 7000 cost = 0.908530
Epoch: 8000 cost = 0.570496
Epoch: 8000 cost = 0.628990
Epoch: 9000 cost = 0.463704
Epoch: 9000 cost = 0.829284
Epoch: 10000 cost = 0.400181
Epoch: 10000 cost = 0.936139
GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['makes', 'makes', 'makes', 'makes', 'to', 'automate', 'automate', 'your', 'software', 'from', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']
  
可以看出初始阶段预测结果并不理想,因为提供的单词个数过少,即使双向LSTM,反馈效果也不好,这个问题随着预测到中期,提供的单词数增加,则自然解决了,效果改善很多
10w次epoch训练结果
 
……
Epoch: 95000 cost = 0.000001
Epoch: 96000 cost = 0.043150
Epoch: 96000 cost = 0.175581
Epoch: 97000 cost = 0.086760
Epoch: 97000 cost = 0.000001
Epoch: 98000 cost = 0.045193
Epoch: 98000 cost = 0.171063
Epoch: 99000 cost = 0.086749
Epoch: 99000 cost = 0.000000
Epoch: 100000 cost = 0.086481
Epoch: 100000 cost = 0.000000
GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['Actions', 'makes', 'it', 'easy', 'to', 'automate', 'all', 'your', 'workflows', 'workflows', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']

Process finished with exit code 0

epoch次数增加到10w,明显效果大幅改善,只有一个词预测错误,损失值约等于0.000000

后续想法:

在运行原始代码会产生warning:UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\utils\tensor_new.cpp:278.) return torch.Tensor(input_batch), torch.LongTensor(target_batch)


针对这个问题,增加代码

# 将 input_batch 和 target_batch 转换为 numpy 数组
    input_batch = np.array(input_batch)
    target_batch = np.array(target_batch)

不仅解决了warning,而且在epoch仅为1w的时候也实现了相比原来更好的效果

1w次epoch训练结果(修改后)
 
……
Epoch: 8000 cost = 0.216420
Epoch: 9000 cost = 0.332975
Epoch: 9000 cost = 0.043033
Epoch: 10000 cost = 0.251898
Epoch: 10000 cost = 0.209669
GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['Actions', 'makes', 'it', 'it', 'to', 'automate', 'all', 'your', 'software', 'workflows', 'workflows', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']

Process finished with exit code 0


三、BERT+BiLSTM案例二实现


实现目标:

命名实体识别(Named Entity Recognition,简称NER)

给定一个句子,对句子中的命名实体进行检测,中文命名实体

代码如下:

见文件包“命名实体识别_中文.ipynb”或者“bertjupyter.py”执行文件

代码分析&结果分析:

from transformers import AutoTokenizer

#加载分词器
tokenizer = AutoTokenizer.from_pretrained('hfl/rbt6')

print(tokenizer)

#分词测试
tokenizer.batch_encode_plus(
    [[
        '海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间',
        '的', '海', '域', '。'
    ],
     [
         '这', '座', '依', '山', '傍', '水', '的', '博', '物', '馆', '由', '国', '内', '一',
         '流', '的', '设', '计', '师', '主', '持', '设', '计', ',', '整', '个', '建', '筑',
         '群', '精', '美', '而', '恢', '宏', '。'
     ]],
    truncation=True,
    padding=True,
    return_tensors='pt',
    is_split_into_words=True)

国内网也能够实现上述加载分词器的连接

is_split_into_words=True这里的值设置为True,因为在这个案例中句子的提供是已经被分割好了,所以不需要分词器再次分词

return_tensors='pt'这里pt的设置是让分词后的值被设置为pytorch中的tensor量

import torch
from datasets import load_dataset, load_from_disk


class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        #names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

        #在线加载数据集
        #dataset = load_dataset(path='peoples_daily_ner', split=split)

        #离线加载数据集
        dataset = load_from_disk(dataset_path='./data')[split]

        #过滤掉太长的句子
        def f(data):
            return len(data['tokens']) <= 512 - 2

        dataset = dataset.filter(f)

        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        tokens = self.dataset[i]['tokens']
        labels = self.dataset[i]['ner_tags']

        return tokens, labels


dataset = Dataset('train')

tokens, labels = dataset[0]

len(dataset), tokens, labels

构建dataset,并将句子中的单词和数字相对应,names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']这个label中0位代表这是一个单词,1位(B-PER)代表person name的开始,2位(I-PER)代表person name的中间,例如122代表的是一个三个字的人名

同理,3和4代表organization的开始和中间,5和6代表location的开始和中间

#数据整理函数
def collate_fn(data):
    tokens = [i[0] for i in data]
    labels = [i[1] for i in data]

    inputs = tokenizer.batch_encode_plus(tokens,
                                         truncation=True,
                                         padding=True,
                                         return_tensors='pt',
                                         is_split_into_words=True)

    lens = inputs['input_ids'].shape[1]

    for i in range(len(labels)):
        labels[i] = [7] + labels[i]
        labels[i] += [7] * lens
        labels[i] = labels[i][:lens]

    return inputs, torch.LongTensor(labels)


#数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)

#查看数据样例
for i, (inputs, labels) in enumerate(loader):
    break

print(len(loader))
print(tokenizer.decode(inputs['input_ids'][0]))
print(labels[0])

for k, v in inputs.items():
    print(k, v.shape)

和在案例一当中的补充有一样的功效,labels[i] = [7] + labels[i] labels[i] += [7] * lens labels[i] = labels[i][:lens]这三个语句分别实现了在每个句子的开头加一个补充位,这里是[7],同理在结尾也加一个补充位,然后根据输入句子的最长长度进行截取,得到一批长度相同的label集

运行这段代码,可以得到结果如下

1303
[CLS] 一 个 人 只 有 首 先 对 祖 国 有 一 个 感 性 化 、 具 象 化 的 认 识 , 才 会 更 加 热 爱 祖 国 。 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])
input_ids torch.Size([16, 115])
token_type_ids torch.Size([16, 115])
attention_mask torch.Size([16, 115])

可以得出:

  1. 在加载器中,有1303批次的数据,这里显示了这批次当中的第一个数据
  2. 可能会出现超出编码器字典范围的字,以[UNK]表示,那在展示的这个案例中是并没有出现
  3. tensor变量中的[7]即首尾补全后的情况,尾部有很多[7],取决于这批次中最长数据和当前数据长度的差距
  4. tensor变量中原句子(数据)的每一个字都被转变为0,因为这仅仅是一个单词,但不是我们希望识别的person name、organization、location中的任意一种,而且再次运行之后这个句子可能会改变
#定义下游模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.tuneing = False
        self.pretrained = None

        self.rnn = torch.nn.GRU(768, 768,batch_first=True)
        self.fc = torch.nn.Linear(768, 8)

    def forward(self, inputs):
        if self.tuneing:
            out = self.pretrained(**inputs).last_hidden_state
        else:
            with torch.no_grad():
                out = pretrained(**inputs).last_hidden_state

        out, _ = self.rnn(out)

        out = self.fc(out).softmax(dim=2)

        return out

    def fine_tuneing(self, tuneing):
        self.tuneing = tuneing
        if tuneing:
            for i in pretrained.parameters():
                i.requires_grad = True

            pretrained.train()
            self.pretrained = pretrained
        else:
            for i in pretrained.parameters():
                i.requires_grad_(False)

            pretrained.eval()
            self.pretrained = None


model = Model()

model(inputs).shape

下游任务模型通常使用预训练模型的表示作为输入,并在其基础上进行微调或进一步训练,以适应特定任务的需求。例如,BERT、GPT、RoBERTa等预训练语言模型可以用作下游任务模型的基础,在其之上构建适用于特定任务的神经网络结构,并通过监督学习或其他方法进行训练,从而解决各种NLP任务。

在本案例的下游任务模型中关键的参数是tuneing和pretrained,self.pretrained = None代表预训练模型并不属于下游模型的一部分

定义forward,如果在tuneing模式下,即self.tuneing = True,则调用自己的预训练模型,反之,使用外部的预训练模型来进行计算

定义函数fine_tuneing,来控制是否需要计算梯度

后续想法

存在问题:案例中的模型训练是在cpu上进行训练,训练的力度不够大,没有训练到彻底收敛

这套代码是搜索BiLSTM搜索出来的,说是Bert+BiLSTM,但是看懂了这套代码后,我觉得这套代码里面并没有用到BiLSTM,只是用了Bert,需要再去研究一下

四、BiLSTM实现情感分析案例三实现


实现目标:

使用 Pytorch 构建一个 BiLSTM 来实现情感分析

代码如下:

见文件包“Pytorch4NLP-main”

代码分析&结果分析:

核心代码BiLSTM

class Model(nn.Module):

    def __init__(self, embed, config):
        super().__init__()
        self.embedding = nn.Embedding.from_pretrained(embed, freeze=False)
        self.LSTM = nn.LSTM(config.embed_size, config.lstm_hidden_size,
                            num_layers=config.num_layers, batch_first=True,
                            bidirectional=True)
        # 因为是双向 LSTM, 所以要乘2
        self.ffn = nn.Linear(config.lstm_hidden_size * 2,
                             config.dense_hidden_size)
        self.relu = nn.ReLU()
        self.classifier = nn.Linear(config.dense_hidden_size,
                                    config.num_outputs)

    def forward(self, inputs):
        # shape: (batch_size, max_seq_length, embed_size)
        embed = self.embedding(inputs)
        # shape: (batch_size, max_seq_length, lstm_hidden_size * 2)
        lstm_hidden_states, _ = self.LSTM(embed)
        # LSTM 的最后一个时刻的隐藏状态, 即句向量
        # shape: (batch, lstm_hidden_size * 2)
        lstm_hidden_states = lstm_hidden_states[:, -1, :]
        # shape: (batch, dense_hidden_size)
        ffn_outputs = self.relu(self.ffn(lstm_hidden_states))
        # shape: (batch, num_outputs)
        logits = self.classifier(ffn_outputs)

        return logits

这个文件的代码在数据集合训练集的导入部分报错,还在修改,但是核心的BiLSTM代码部分没有问题

posted @ 2024-04-09 01:45  江左子固  阅读(2)  评论(0编辑  收藏  举报