论文阅读

一、LSTM与BiLSTM基础知识

LSTM

长时间的短期记忆网络（Long Short-Term Memory Networks）的本质在于可以记住很长时期内的内容

相较于普通RNN的单一tanh层（充当激活函数），LSTM改单一tanh层为四个相互作用的层

cell从上一阶段进入到下一阶段，中间通过三种门（gates）向cell状态添加或删除信息，以达到对期待的信息的取舍

遗忘门、输入门、输出门

LSTM基础知识

BiLSTM

BiLSTM是Bi-directional Long Short-Term Memory的缩写，是由前向LSTM与后向LSTM组合而成

双向长短期记忆网络（BiLSTM）相比LSTM有双向信息捕获、更好的序列建模、减少梯度消失问题的优点

BiLSTM基础知识&相较于LSTM的优势

二、简单的BiLSTM案例一实现

实现目标：

给定一个长句子，根据已出现的单词，预测下一个单词

代码如下：

案例一代码

 
#导库
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
dtype = torch.FloatTensor
准备数据
sentence = (

'GitHub Actions makes it easy to automate all your software workflows '

'from continuous integration and delivery to issue triage and more'

)
word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}

idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}

n_class = len(word2idx) # classification problem

max_len = len(sentence.split())

n_hidden = 5
数据预处理，构建 dataset，定义 dataloader
def make_data(sentence):

input_batch = []

target_batch = []
words = sentence.split()
for i in range(max_len - 1):
    input = [word2idx[n] for n in words[:(i + 1)]]
    input = input + [0] * (max_len - len(input))
    target = word2idx[words[i + 1]]
    input_batch.append(np.eye(n_class)[input])
    target_batch.append(target)

return torch.Tensor(input_batch), torch.LongTensor(target_batch)

input_batch: [max_len - 1, max_len, n_class]
input_batch, target_batch = make_data(sentence)

dataset = Data.TensorDataset(input_batch, target_batch)

loader = Data.DataLoader(dataset, 16, True)
定义网络架构
class BiLSTM(nn.Module):

def init(self):

super(BiLSTM, self).init()

self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)

#fc

self.fc = nn.Linear(n_hidden * 2, n_class)

def forward(self, X):

#X: [batch_size, max_len, n_class]

batch_size = X.shape[0]

input = X.transpose(0, 1)  #input : [max_len, batch_size, n_class]

hidden_state = torch.randn(12, batch_size, n_hidden)   #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]

cell_state = torch.randn(12, batch_size, n_hidden)     #[num_layers(=1) * num_directions(=2), batch_size, n_hidden]

outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))

outputs = outputs[-1]  #[batch_size, n_hidden * 2]

model = self.fc(outputs)  #model : [batch_size, n_class]

return model
model = BiLSTM()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)
Training
for epoch in range(10000):

for x, y in loader:

pred = model(x)

loss = criterion(pred, y)

if (epoch + 1) % 1000 == 0:

print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

Pred
predict = model(input_batch).data.max(1, keepdim=True)[1]

print(sentence)

print([idx2word[n.item()] for n in predict.squeeze()])

代码分析：

提供一句话sentence，案例中是GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more

通过如下代码实现从单词到索引的映射

sentence = (
    'GitHub Actions makes it easy to automate all your software workflows '
    'from continuous integration and delivery to issue triage and more'
)

word2idx = {w: i for i, w in enumerate(list(set(sentence.split())))}
idx2word = {i: w for i, w in enumerate(list(set(sentence.split())))}
n_class = len(word2idx) # classification problem
max_len = len(sentence.split())
n_hidden = 5

通过如下代码实现数据预处理集的定义，概括来说，是将上述每个单词产生的索引存入待存入的make_data，在存入的过程中，通过input = input + [0] * (max_len - len(input))保证数据长度始终和原sentence中单词数量一致

通过如下代码实现BiLSTM架构

class BiLSTM(nn.Module):
    def __init__(self):
        super(BiLSTM, self).__init__()
        self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
        # fc
        self.fc = nn.Linear(n_hidden * 2, n_class)

    def forward(self, X):
        # X: [batch_size, max_len, n_class]
        batch_size = X.shape[0]
        input = X.transpose(0, 1)  # input : [max_len, batch_size, n_class]

        hidden_state = torch.randn(1*2, batch_size, n_hidden)   # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        cell_state = torch.randn(1*2, batch_size, n_hidden)     # [num_layers(=1) * num_directions(=2), batch_size, n_hidden]

        outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
        outputs = outputs[-1]  # [batch_size, n_hidden * 2]
        model = self.fc(outputs)  # model : [batch_size, n_class]
        return model

model = BiLSTM()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

其中self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)定义了一个双向的长短期记忆网络 (BiLSTM)。input_size是输入的特征维度，这里是词汇表的大小；hidden_size 是隐藏状态的维度，即隐藏单元的数量；bidirectional=True 表示使用双向的 LSTM

self.fc = nn.Linear(n_hidden * 2, n_class)定义了一个线性层（全连接层），用于将BiLSTM的输出映射到类别空间。n_hidden * 2 是因为 BiLSTM 是双向的，所以隐藏层的维度是原始维度的两倍

def forward(self, X)定义了前向传播LSTM。X 是输入数据，维度为 [batch_size, max_len, n_class]，其中 batch_size 是shape大小，max_len 是输入序列的最大长度，n_class 是词汇表的长度

结果展示：

1w次epoch训练结果

 
Epoch: 1000 cost = 1.908870
Epoch: 1000 cost = 2.037666
Epoch: 2000 cost = 1.539054
Epoch: 2000 cost = 1.440200
Epoch: 3000 cost = 1.282757
Epoch: 3000 cost = 1.038417
Epoch: 4000 cost = 1.185625
Epoch: 4000 cost = 0.816874
Epoch: 5000 cost = 0.891233
Epoch: 5000 cost = 0.989744
Epoch: 6000 cost = 0.977199
Epoch: 6000 cost = 0.270807
Epoch: 7000 cost = 0.704010
Epoch: 7000 cost = 0.908530
Epoch: 8000 cost = 0.570496
Epoch: 8000 cost = 0.628990
Epoch: 9000 cost = 0.463704
Epoch: 9000 cost = 0.829284
Epoch: 10000 cost = 0.400181
Epoch: 10000 cost = 0.936139
GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more
['makes', 'makes', 'makes', 'makes', 'to', 'automate', 'automate', 'your', 'software', 'from', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']

可以看出初始阶段预测结果并不理想，因为提供的单词个数过少，即使双向LSTM，反馈效果也不好，这个问题随着预测到中期，提供的单词数增加，则自然解决了，效果改善很多

10w次epoch训练结果

…… Epoch: 95000 cost = 0.000001 Epoch: 96000 cost = 0.043150 Epoch: 96000 cost = 0.175581 Epoch: 97000 cost = 0.086760 Epoch: 97000 cost = 0.000001 Epoch: 98000 cost = 0.045193 Epoch: 98000 cost = 0.171063 Epoch: 99000 cost = 0.086749 Epoch: 99000 cost = 0.000000 Epoch: 100000 cost = 0.086481 Epoch: 100000 cost = 0.000000 GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more ['Actions', 'makes', 'it', 'easy', 'to', 'automate', 'all', 'your', 'workflows', 'workflows', 'from', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']

Process finished with exit code 0

epoch次数增加到10w，明显效果大幅改善，只有一个词预测错误，损失值约等于0.000000

后续想法：

在运行原始代码会产生warning：UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_1000000000000\work\torch\csrc\utils\tensor_new.cpp:278.) return torch.Tensor(input_batch), torch.LongTensor(target_batch)

针对这个问题，增加代码

# 将 input_batch 和 target_batch 转换为 numpy 数组
    input_batch = np.array(input_batch)
    target_batch = np.array(target_batch)

不仅解决了warning，而且在epoch仅为1w的时候也实现了相比原来更好的效果

1w次epoch训练结果（修改后）

…… Epoch: 8000 cost = 0.216420 Epoch: 9000 cost = 0.332975 Epoch: 9000 cost = 0.043033 Epoch: 10000 cost = 0.251898 Epoch: 10000 cost = 0.209669 GitHub Actions makes it easy to automate all your software workflows from continuous integration and delivery to issue triage and more ['Actions', 'makes', 'it', 'it', 'to', 'automate', 'all', 'your', 'software', 'workflows', 'workflows', 'continuous', 'integration', 'and', 'delivery', 'to', 'issue', 'triage', 'and', 'more']

Process finished with exit code 0

三、BERT+BiLSTM案例二实现

实现目标：

命名实体识别（Named Entity Recognition，简称NER）

给定一个句子，对句子中的命名实体进行检测，中文命名实体

代码如下：

见文件包“命名实体识别_中文.ipynb”或者“bertjupyter.py”执行文件

代码分析&结果分析：

from transformers import AutoTokenizer

#加载分词器
tokenizer = AutoTokenizer.from_pretrained('hfl/rbt6')

print(tokenizer)

#分词测试
tokenizer.batch_encode_plus(
    [[
        '海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间',
        '的', '海', '域', '。'
    ],
     [
         '这', '座', '依', '山', '傍', '水', '的', '博', '物', '馆', '由', '国', '内', '一',
         '流', '的', '设', '计', '师', '主', '持', '设', '计', '，', '整', '个', '建', '筑',
         '群', '精', '美', '而', '恢', '宏', '。'
     ]],
    truncation=True,
    padding=True,
    return_tensors='pt',
    is_split_into_words=True)

国内网也能够实现上述加载分词器的连接

is_split_into_words=True这里的值设置为True，因为在这个案例中句子的提供是已经被分割好了，所以不需要分词器再次分词

return_tensors='pt'这里pt的设置是让分词后的值被设置为pytorch中的tensor量

import torch
from datasets import load_dataset, load_from_disk


class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        #names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

        #在线加载数据集
        #dataset = load_dataset(path='peoples_daily_ner', split=split)

        #离线加载数据集
        dataset = load_from_disk(dataset_path='./data')[split]

        #过滤掉太长的句子
        def f(data):
            return len(data['tokens']) <= 512 - 2

        dataset = dataset.filter(f)

        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        tokens = self.dataset[i]['tokens']
        labels = self.dataset[i]['ner_tags']

        return tokens, labels


dataset = Dataset('train')

tokens, labels = dataset[0]

len(dataset), tokens, labels

构建dataset，并将句子中的单词和数字相对应，names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']这个label中0位代表这是一个单词，1位（B-PER）代表person name的开始，2位（I-PER）代表person name的中间，例如122代表的是一个三个字的人名

同理，3和4代表organization的开始和中间，5和6代表location的开始和中间

#数据整理函数
def collate_fn(data):
    tokens = [i[0] for i in data]
    labels = [i[1] for i in data]

    inputs = tokenizer.batch_encode_plus(tokens,
                                         truncation=True,
                                         padding=True,
                                         return_tensors='pt',
                                         is_split_into_words=True)

    lens = inputs['input_ids'].shape[1]

    for i in range(len(labels)):
        labels[i] = [7] + labels[i]
        labels[i] += [7] * lens
        labels[i] = labels[i][:lens]

    return inputs, torch.LongTensor(labels)


#数据加载器
loader = torch.utils.data.DataLoader(dataset=dataset,
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)

#查看数据样例
for i, (inputs, labels) in enumerate(loader):
    break

print(len(loader))
print(tokenizer.decode(inputs['input_ids'][0]))
print(labels[0])

for k, v in inputs.items():
    print(k, v.shape)

和在案例一当中的补充有一样的功效，labels[i] = [7] + labels[i] labels[i] += [7] * lens labels[i] = labels[i][:lens]这三个语句分别实现了在每个句子的开头加一个补充位，这里是[7]，同理在结尾也加一个补充位，然后根据输入句子的最长长度进行截取，得到一批长度相同的label集

运行这段代码，可以得到结果如下

1303
[CLS] 一 个 人 只 有 首 先 对 祖 国 有 一 个 感 性 化 、 具 象 化 的 认 识 ， 才 会 更 加 热 爱 祖 国 。 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
tensor([7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
        7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])
input_ids torch.Size([16, 115])
token_type_ids torch.Size([16, 115])
attention_mask torch.Size([16, 115])

可以得出：

在加载器中，有1303批次的数据，这里显示了这批次当中的第一个数据
可能会出现超出编码器字典范围的字，以[UNK]表示，那在展示的这个案例中是并没有出现
tensor变量中的[7]即首尾补全后的情况，尾部有很多[7]，取决于这批次中最长数据和当前数据长度的差距
tensor变量中原句子（数据）的每一个字都被转变为0，因为这仅仅是一个单词，但不是我们希望识别的person name、organization、location中的任意一种，而且再次运行之后这个句子可能会改变

#定义下游模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.tuneing = False
        self.pretrained = None

        self.rnn = torch.nn.GRU(768, 768,batch_first=True)
        self.fc = torch.nn.Linear(768, 8)

    def forward(self, inputs):
        if self.tuneing:
            out = self.pretrained(**inputs).last_hidden_state
        else:
            with torch.no_grad():
                out = pretrained(**inputs).last_hidden_state

        out, _ = self.rnn(out)

        out = self.fc(out).softmax(dim=2)

        return out

    def fine_tuneing(self, tuneing):
        self.tuneing = tuneing
        if tuneing:
            for i in pretrained.parameters():
                i.requires_grad = True

            pretrained.train()
            self.pretrained = pretrained
        else:
            for i in pretrained.parameters():
                i.requires_grad_(False)

            pretrained.eval()
            self.pretrained = None


model = Model()

model(inputs).shape

下游任务模型通常使用预训练模型的表示作为输入，并在其基础上进行微调或进一步训练，以适应特定任务的需求。例如，BERT、GPT、RoBERTa等预训练语言模型可以用作下游任务模型的基础，在其之上构建适用于特定任务的神经网络结构，并通过监督学习或其他方法进行训练，从而解决各种NLP任务。

在本案例的下游任务模型中关键的参数是tuneing和pretrained，self.pretrained = None代表预训练模型并不属于下游模型的一部分

定义forward，如果在tuneing模式下，即self.tuneing = True，则调用自己的预训练模型，反之，使用外部的预训练模型来进行计算

定义函数fine_tuneing，来控制是否需要计算梯度

后续想法

存在问题：案例中的模型训练是在cpu上进行训练，训练的力度不够大，没有训练到彻底收敛

这套代码是搜索BiLSTM搜索出来的，说是Bert+BiLSTM，但是看懂了这套代码后，我觉得这套代码里面并没有用到BiLSTM，只是用了Bert，需要再去研究一下

四、BiLSTM实现情感分析案例三实现

实现目标：

使用 Pytorch 构建一个 BiLSTM 来实现情感分析

代码如下：

见文件包“Pytorch4NLP-main”

代码分析&结果分析：

核心代码BiLSTM

class Model(nn.Module):

    def __init__(self, embed, config):
        super().__init__()
        self.embedding = nn.Embedding.from_pretrained(embed, freeze=False)
        self.LSTM = nn.LSTM(config.embed_size, config.lstm_hidden_size,
                            num_layers=config.num_layers, batch_first=True,
                            bidirectional=True)
        # 因为是双向 LSTM, 所以要乘2
        self.ffn = nn.Linear(config.lstm_hidden_size * 2,
                             config.dense_hidden_size)
        self.relu = nn.ReLU()
        self.classifier = nn.Linear(config.dense_hidden_size,
                                    config.num_outputs)

    def forward(self, inputs):
        # shape: (batch_size, max_seq_length, embed_size)
        embed = self.embedding(inputs)
        # shape: (batch_size, max_seq_length, lstm_hidden_size * 2)
        lstm_hidden_states, _ = self.LSTM(embed)
        # LSTM 的最后一个时刻的隐藏状态, 即句向量
        # shape: (batch, lstm_hidden_size * 2)
        lstm_hidden_states = lstm_hidden_states[:, -1, :]
        # shape: (batch, dense_hidden_size)
        ffn_outputs = self.relu(self.ffn(lstm_hidden_states))
        # shape: (batch, num_outputs)
        logits = self.classifier(ffn_outputs)

        return logits

这个文件的代码在数据集合训练集的导入部分报错，还在修改，但是核心的BiLSTM代码部分没有问题

posted @ 2024-04-09 01:45 江左子固阅读(6) 评论(0) 编辑收藏举报

刷新页面返回顶部

龙场悟道

工以立命，文以修身；依人为镜，自臻自爱；广学穷辨，慎言多行；不以物喜，不以己悲；激扬文字，挥斥方遒

论文阅读

一、LSTM与BiLSTM基础知识

LSTM

BiLSTM

二、简单的BiLSTM案例一实现

准备数据

数据预处理，构建 dataset，定义 dataloader

input_batch: [max_len - 1, max_len, n_class]

定义网络架构

Training

Pred

三、BERT+BiLSTM案例二实现

四、BiLSTM实现情感分析案例三实现

公告