gpt模型预训练微调数据代码全流程处理

之前只关注过transformer的encode模块，没有进行过decode模块的使用和训练，最近生成模型大火，而且最近还在看prompt，感觉所有encode的任务都能变成decode的形式，所以这里学习并整理gpt2模型的相关知识。
下述encode模块都以bert为原型，即bert在输入的时候，一般利用transformers库中tokenizer将batch转换为三个向量：input_ids,attention_mask,token_type_ids,这里注意一下attention_mask，一般在bert中，attention_mask主要是将句子padding的部分给mask掉，这里没有考虑说生成模型（单向注意力）这种需要把Q*K(T)矩阵的后续词语给mask掉，bert是双向的注意力，所以之前也没有太关注过这里。
首先来看一下bert中采用的mask策略，我们输入的attention_mask向量一般为（batch_size,seq_len）,同input_ids,bert中处理mask的程序是调用了 modeling_utils中的get_extended_attention_mask函数，其中我们关注函数中如下片段：

    def create_extended_attention_mask_for_decoder(input_shape, attention_mask, device=None):
        if device is not None:
            warnings.warn("The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning)
        else:
            device = attention_mask.device
        batch_size, seq_length = input_shape
        seq_ids = torch.arange(seq_length, device=device)
        causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
        # in case past_key_values are used we need to add a prefix ones mask to the causal mask
        # causal and attention masks must have same type with pytorch version < 1.3
        causal_mask = causal_mask.to(attention_mask.dtype)

        if causal_mask.shape[1] < attention_mask.shape[1]:
            prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
            causal_mask = torch.cat(
                [
                    torch.ones((batch_size, seq_length, prefix_seq_len), device=device, dtype=causal_mask.dtype),
                    causal_mask,
                ],
                axis=-1,
            )

        extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
        return extended_attention_mask

这里我只说每个向量的shape，（这里插一句话，torch一般更改向量的shape有几个操作，view，squeeze，unsqueeze等等，这里有个很好的办法来unsqueeze,即在需要扩展的维度上定义None即可）

input_shape:(batch_size,seq_length)
seq_ids:(seq_length,)#递增序列
causal_mask:(1,1,seq_length)*(batch_size,seq_length,1) = (batch_size,seq_length,seq_length)<=(1,seq_length,1)=(batch_size,seq_length,seq_length)
extended_attention_mask:(batch_size,1,seq_length,seq_length)*(batch_size,1,1,seq_length)=(batch_size,1,seq_length,seq_length)

最后一步的作用是将形成的下三角mask矩阵在乘以attention_mask去掉padding的token的影响。
然后如果是encoder的mask,则直接：
extended_attention_mask = attention_mask[:, None, None, :]
然后所有的extended_attention_mask都需要在进行变换一下：
extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min
为什么这样做呢，因为后续的q*v的打分矩阵是和mask加起来的，这样mask矩阵中原本是1的变成了0，原本是0的变成了负无穷，这样和得分矩阵相加不会影响其值得大小，并且起到了mask得效果，在缩放之后，softmax之前：

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)

bert的mask代码看完了，看一下gpt2中它是怎么做的。
首先它默认为是单向的mask来计算的，除非另有约定（这里是encoder_attention_mask is not None）

        if not self.is_cross_attention:
            # if only "normal" attention layer implements causal mask
            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].to(torch.bool)
            mask_value = torch.finfo(attn_weights.dtype).min
            # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
            # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
            mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
            attn_weights = torch.where(causal_mask, attn_weights, mask_value)

后续仍旧以score打分矩阵加上attention_mask:

        if attention_mask is not None:
            # Apply the attention mask
            attn_weights = attn_weights + attention_mask

总之以上为两个模型mask的区别
然后来看transformers中的gpt2model，一般如果我们用来文本生成任务的话调用GPT2LMHeadModel,在上述gpt2model中加了一层从隐层到vocabulary_size的映射层。

这里先跑一个简单的模型，基于huggingface中的GPT2LMHeadModel
首先介绍一下，huggingface中存在的gpt模型主要就是GPT2Model,GPT2LMHeadModel，大多数模型都是在GPT2Model的基础上加了一些头部的layer,GPT2Model输出的只是隐层的融合了前文信息的向量，
GPT2LMHeadModel在上述基础上将向量映射到vocab_size，这里先简单的看一下预测的代码：

import torch
import numpy as np
from transformers.models.gpt2 import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
from transformers import BertTokenizer

model = GPT2LMHeadModel.from_pretrained("model")
tokenizer = BertTokenizer(vocab_file="model/vocab.txt")


def predict(inputs_text,max_length):
    model.eval()
    input_ids = tokenizer.encode(inputs_text)
    input_ids = input_ids[:-1]#把【sep】去掉
    for i in range(max_length):
        inputs = {"input_ids": torch.tensor([input_ids])}
        outputs = model(**inputs)
        logits = outputs.logits
        last_token_id = int(np.argmax(logits[0][-1].detach().numpy()))
        last_token = tokenizer.convert_ids_to_tokens(last_token_id)
        inputs_text += last_token
        input_ids.append(last_token_id)
    print(inputs_text)




if __name__ == "__main__":
    inputs_text = "知我者"
    max_length = 30
    predict(inputs_text, max_length)

这里下载了一个古文模型，下载链接 https://github.com/Morizeyao/GPT2-Chinese
直接运行即可，上述运行结果：
知我者其惟春秋乎，罪我者其惟春秋乎。春秋，天子之事也。天子之事，则有天子之事，有诸侯之事，有大夫之事。诸侯

这里说一下huggingface中的loss计算，是同一个文本进行错位计算的：

        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

这里比较方便的一点是训练和预测的过程是一样的，只要把loss进行反向传播即可。还有一点就是loss的计算可以mask掉多余的部分再计算。一个简单的gpt模型就运行起来了。
用于不同任务上输入的设定，例如对话：
训练数据形式：CLS+句子1+SEP+句子2+SEP+。。。。。。
预测数据形式：输入：CLS+句子+SEP 预测：自循环至出现SEP结束
开源的数据集：https://github.com/brightmart/nlp_chinese_corpus
生成任务的评价指标常用的有ROUGE，BLEU，METEOR，ROUGE看重召回率，BLEU更看重准确率，BLEU偏向于短文本生成结果，ROUGE偏向于长文本生成结果，METEOR引入了外部知识。