【翻译】用GPT2做文本生成

翻译自 :https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/  

(注:未去联系取得翻译授权,纯学习自用)

 

在这篇文章中,我们将看到如何使用基于Transformers架构的模型生成文本,并使用这些知识演示如何创建假新闻。我们的目标是通过这个实际的例子来演示这些模型的操作和使用。

首先,我们将介绍文本生成模型的理论介绍,接着介绍HuggingFace Transformers,我们在本文的其余部分中将使用的Python库。然后,我们将重点介绍GPT-2模型,以及如何使用HuggingFace Transformers中可用的接口,既可以使用预训练模型生成文本,也可以使用自己的文本重新训练它们。最后,我们将看到如果不谨慎使用这些模型,与之相关的道德风险,因为它们已经通过互联网上的文本进行了训练,并且已经学会了互联网上存在的偏见。

 

 

文本生成模型在几十年前开始开发,远在深度学习热潮之前。这种模型的目的是能够根据给定的文本预测一个单词或一系列单词。下面的图表是这些模型的一个简化表示,使用文本作为输入,模型能够生成一个概率分布,覆盖它所知道的词汇表,并基于它进行选择。

 

早期的文本生成模型使用马尔科夫链进行训练,其中每个单词是链的一个状态,下一个单词的概率(基于前一个单词)是根据训练文本中两个单词连续出现的次数计算得出的。随后,开始使用递归神经网络(RNN),它们能够保留更多的上下文信息,以及长短时记忆(LSTM),这是一种具有更好长期记忆的RNN类型。然而,这种类型的网络在它们可以记住的内容上存在限制,而且难以训练,因此不适合生成长文本。

2017年,Google在其论文["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)中提出了一种新的架构,称为Transformer,这是当今不同文本生成模型所基于的架构,例如GPT-2和GPT-3、BERT或Transformer XL。

在本文中,我们将重点介绍如何使用GPT-2生成文本,这是一种基于Transformer架构的文本生成模型,由OpenAI于2019年2月创建。需要注意的是,GPT-2是一种自回归模型,这意味着它每次迭代生成一个单词。此外,该模型可根据嵌入的不同大小提供。 

 

 

Huggingface Transformers 是一个 Python 库,它可以下载预训练模型,用于完成以下任务:

  • 自然语言理解,例如情感分析
  • 自然语言生成,例如文本生成或文本翻译。 除此之外,它还提供了由 OpenAI 训练和发布的 GPT-2 的四个版本,并提供易于使用的接口,使其非常易于使用。

该库中有三个主要的概念或类,我们将在本文中使用它们:

Tokenizer:它们存储每个模型的词汇表,并包含将字符串编码和解码为一组令牌嵌入索引的方法,这些索引用作模型的输入。

Configuration:它们包含构建模型所需的必要参数。在使用预训练模型时不需要它们。

Model:Pytorch 或 Keras 模型,用于处理库中预训练的模型。

 

 

 

首先,让我们导入所有我们将要使用的包。具体来说,这些包的版本为:

transformers==4.4.2 datasets==1.5.0 nlp colorama==0.4.4 torch==1.9.1

 

import torch, os, re, pandas as pd, json
from sklearn.model_selection import train_test_split
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding, GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, AutoConfig
from datasets import Dataset

 

def pretty_print(text, max_len_line=100):
    words = text.split(' ')
    len_line = 0
    line = ''
    for w in words:
        if w == '\n':
            print(line)
            line = ''
            continue
        if (len(line) + len(w)) > max_len_line:
            print(line)
            line = ''
        line += ' ' + w
    print(line)

 

接下来,我们定义是否要在 CPU 或 GPU 上运行模型。为此,我们可以使用 torch 来检查是否已安装 CUDA,如果已安装,则使用 GPU。如果不可用,则默认使用 CPU。

if torch.cuda.is_available():  
    dev = "cuda:0" 
else:  
    dev = "cpu"  
device = torch.device(dev)  

  

第一步将是加载模型和模型将使用的分词器。我们都通过存在于 Huggingface Transformers GPT2LMHeadModel 和 GPT2Tokenizer 接口的 GPT2 类来完成。这两种情况下,您必须指定要使用的模型版本,OpenAI 发布的四个模型维度都可用:

  • 'gpt2'
  • 'gpt2-medium'
  • 'gpt2-large'
  • 'gpt2-xl'

 

# We load the model
base_model = GPT2LMHeadModel.from_pretrained('gpt2')
# options: ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']

加载模型后,我们可以探索其参数和体系结构:

base_model.num_parameters
# (wte): Embedding(50262, 768)
#     (wpe): Embedding(1024, 768)

输出

<bound method ModuleUtilsMixin.num_parameters of GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (10): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)>

分词器有三个功能:

  • 它将输入文本分成标记,这些标记不一定要与单词相一致,并将这些标记编码和解码为模型的输入ID,反之亦然。
  • 它允许向词汇表中添加新标记。
  • 它管理特殊标记,例如掩码、文本开头、文本结尾、特殊分隔符等。 通过分词器实例,我们可以探索词汇表(get_vocab)并查看其大小,以及探索和tokenize不同的文本以了解其工作原理。
# We load the tokenizer
base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

 

print('Words in vocabulary: ', base_tokenizer.vocab_size)

输出:

Palabras en el vocabulario:  50257

运行

vocabulary = base_tokenizer.get_vocab()
vocabulary['Hi']

输出

17250

运行

text = "Hi, I'm Victor and I work as a Data Scientist"
base_tokenizer.tokenize(text)

输出

['Hi',
 ',',
 'ĠI',
 "'m",
 'ĠVictor',
 'Ġand',
 'ĠI',
 'Ġwork',
 'Ġas',
 'Ġa',
 'ĠData',
 'ĠScientist']

为了准备文本并将其转换为模型期望的格式,我们将使用encode函数,并指定我们希望它生成张量的格式,即Pytorch或TensorFlow。

运行

text_ids = base_tokenizer.encode(text, return_tensors = 'pt')
text_ids

# tensorflow
#text_ids = base_tokenizer.encode(text, return_tensors = 'tf')

输出

tensor([[17250,    11,   314,  1101, 12622,   290,   314,   670,   355,   257,
          6060, 33374]])

通过上述所有内容,我们已经可以生成文本了。我们有一个已分词的文本和一个预先训练好的模型,我们可以调用生成函数,将分词后的文本作为输入传入。

 

text = "I work as a data scientist"
text_ids = base_tokenizer.encode(text, return_tensors = 'pt')

generated_text_samples = base_model.generate(
    text_ids
)
generated_text_samples

  

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[   40,   670,   355,   257,  1366, 11444,   379,   262,  2059,   286,
          3442,    11, 14727,    13,   198,   198,     1,    40,  1101,   407]])

由于输出再次是张量,因此我们将不得不使用分词器逐个令牌对输出进行解码:

for i, beam in enumerate(generated_text_samples):
    print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()

输出

0: I work as a data scientist at the University of California, Berkeley.

"I'm not

然而,重要的是要提到解码方法的相关性(在给定短语的情况下选择下一个单词的方式),因为所获得的文本质量将会有显著差异。它们可以根据传递给生成函数的参数进行配置。

最简单的方法,是在所有可能的单词中选择概率最高的单词。这种情况下不需要指定参数,它是默认使用的方法。

 

 

 

generated_text_samples = base_model.generate(
    text_ids,
    max_length= 100,
)

for i, beam in enumerate(generated_text_samples):
    print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: I work as a data scientist at the University of California, Berkeley.

"I'm not a scientist, but I'm a data scientist," he said. "I'm not a data scientist, 
but I'm a data scientist."

He said he's not sure how much of the data he's collecting is from the government, 
but he's confident that it's not too much.

"I'm not going to be able to do that," he said. "I'm

这是一种确定性生成方法(deterministic generation),如果我们使用相同的提示再次生成文本,所得到的文本将是相同的。

# ejemplo de generación de texto
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 100,
)

for i, beam in enumerate(generated_text_samples):
  print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

输出:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: I work as a data scientist at the University of California, Berkeley.

"I'm not a scientist, but I'm a data scientist," he said. "I'm not a data scientist, 
but I'm a data scientist."

He said he's not sure how much of the data he's collecting is from the government, 
but he's confident that it's not too much.

"I'm not going to be able to do that," he said. "I'm

 

这种方法存在以下问题:

  • 它是确定性的。
  • 它可能会陷入循环并重复相同的单词。
  • 它不考虑在低概率单词之后跟着的高概率的单词。

它将每个步骤中具有最高概率的B个序列保留在内存中,最终选择概率最高的序列。参数B对应于num_beams:

 

 

 

运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    num_beams=5,
    num_return_sequences= 5,
    early_stopping=True 
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

 

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist at the University of California, Berkeley, and I've been 
 working on this project for a long time. I've been working on this project for a long time. 
 I've been working on this project for a long time

 1: I work as a data scientist at the University of California, Berkeley, and I've been
 working on this for a long time. I've been working on this for a long time. I've been 
 working on this for a long time.



 2: I work as a data scientist at the University of California, Berkeley, and I've been 
 working on this for a long time. I've been working on this for a long time. I've been 
 working on this for a long time. I've

 3: I work as a data scientist at the University of California, Berkeley, and I've been 
 working on this for a long time. I've been working on this for a long time. I've been 
 working on this for a long time. I'm

 4: I work as a data scientist at the University of California, Berkeley, and I've been 
 working on this for a long time. I've been working on this for a long time. I've been 
 working on this for a long time. It's

为了避免重复生成相同的文本,我们可以配置一个参数以防止重复出现所需长度的n-gram(no_repeat_ngram_size):

运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences= 5,
    early_stopping=True 
)

for i, beam in enumerate(generated_text_samples):
  print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: I work as a data scientist at the University of California, Berkeley, 
and I've been working on this for a long time.

I have a lot of work to do, but I want to share it with you because I think it's

1: I work as a data scientist at the University of California, Berkeley, 
and I've been working on this for a long time.

I have a lot of work to do, but I want to share with you some of the things that I

2: I work as a data scientist at the University of California, Berkeley, 
and I've been working on this for a long time.

I have a lot of work to do, but I want to share it with you because it's important to

3: I work as a data scientist at the University of California, Berkeley, 
and I've been working on this for a long time.

I have a lot of work to do, but I want to share it with you because it's important.

4: I work as a data scientist at the University of California, Berkeley, 
and I've been working on this for a long time.

I have a lot of work to do, but I want to share it with you because I think it is

问题:

  • 它生成重复的序列,难以控制。
  • 人类并不总是使用这样的确定性语言,如Ari Holtzman等人(2019)所解释的那样。在他们的研究中,他们比较了人类选择的单词和BeamSearch选择的单词的概率,并观察到后者的概率更高,变化更小。

下一个单词是基于先前单词的条件概率分布随机选择的。

 

 

此外,可以调整分布的温度,以增加从最有可能的单词中提取一个单词的概率。

 

 

 在采样中,有两种方法很好用,即top-K采样和top-P采样。在实现时,在这两种情况下都需要将do_sample参数设置为true。

运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=0,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

 

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist and distributor. Compute data knowledge on production 
 companies with an approach that allows you to explore high-level, high performance 
 models that saw some crashes.

 1: I work as a data scientist on several city and county offices and I've spent many 
 many hours with people that happen to like me. When my leg is broken I can get medical 
 indeterminate problems. My

 2: I work as a data scientist at AlecGenetics, both in my prefrontal cortex and my scalp. 
 For example, batting average since I was 8-year-old was more than 500 points higher. 
 Soon after,

 3: I work as a data scientist in a number of sectors, including SAP, Google, CID, Apple, 
 best practices, clustering, mapping and data manipulation". However, Simon is breaking 
 out soon and has

 4: I work as a data scientist at Stratfor, where I've talked about it on some good lengths 
 at a number of shows. My primary focus and foremost goal was to return many messages out of 
 groups and

现在我们可以尝试调整温度参数。

运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=0,
    temperature=0.9,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist. All of my other fields come from years of research, so I 
 know they benefit real physical scientists. So I'm a kind of a crusader for open data that's 
 going to do away

 1: I work as a data scientist at Stardust Research. Blu gives me a home computer in my 
 garden. She owns a double-track truck, a touring dog, and some furniture. I've been a 
 homeless man all my life.

 2: I work as a data scientist at a group that solicits and enables data scientists in 
 their field to understand, properly question and engage with data scientists. I did that 
 with the data scientists

 3: I work as a data scientist at a global health company, and I told them it was foolproof 
 to shut down a website they had started. They asked me how they could know who I was. 
 Knowing that when a

 4: I work as a data scientist and lead a Data Analytics team on the IBM Distributed Systems 
 Group, studying the use of remote workers in data science data and exploration.

Alton has access to every

第一种方法基于概率分布,在前K个概率最高的单词中随机选择下一个单词。例如,在我们想要从单词“The”开始生成文本的情况下,并给出以下词汇概率分布,如果K为6,则下一个单词将在漂亮,狗,车,女人,男人和人中随机选择:

 

 

让我们尝试指定top_k参数:

运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=25,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist and as a technical writer and I do not publish my writing 
 here for

 1: I work as a data scientist for some other companies and they asked me whether I felt 
 the same way. I've worked for the government of India for 20 years and they didn't give 
 me any sort of

 2: I work as a data scientist in a few small firms. In the fall of 2011, I ran the company's
 customer support team through some of our most complicated systems: the customer support 
 system was

 3: I work as a data scientist at the National Center for Missing and Exploited Children at
 Children's Hospital in Philadelphia. She has been following the cases for many months 
 and has been

 4: I work as a data scientist at the Center for Digital Economy, a nonprofit in Washington, 
 D.C.—and I'm currently working at one of the largest companies in Silicon Valley—and I've 
 got the idea for

在Top-P的情况下,也称为核心采样,下一个单词是在前一个单词的条件概率分布中随机选择的,选择的单词集合是添加概率大于或等于p的单词。继续使用前面的例子,如果我们决定在可选单词中选择概率累积达到94%的单词,而不是设置要选择的单词数,则选项将增加:

 

 

该方法由top_p参数指定,接受0到1之间的值:

 运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=0,
    top_p=0.92,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

 

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist for a leading US non-profit that focuses on the medical 
 care of Africa, specifically the Transitional Federated Republic of the Democratic 
 Republic of the Congo

 1: I work as a data scientist who also studies wireless. I tend to care about wireless
 fact-checking, which means I'm able to add factors to the rest of the experts' opinions 
 as well. We

 2: I work as a data scientist at the paper's UMass Amherst Initiative on Media Research, 
 a company specializing in peer-reviewed, high-quality research on certain types of media. 
 In addition to

 3: I work as a data scientist with a good point of view, and it'll never be easy to put 
 my head down and go make notes here if you don't see it. At the same time, my head is the 
 only one that does

 4: I work as a data scientist, only a few of my colleagues are. I have worked on a lot of 
 cold and hard data, but only a few people who really know any of those areas are able to 
 tell me what you're

以上所有方法也可以组合使用。在以下示例中,我们同时调整分布的温度并定义K和P。它将保留最严格的那个,如果前K个单词的累积概率大于P,则所选单词仅从累积P的单词中选择,反之亦然。

 运行

# text generation example
generated_text_samples = base_model.generate(
    text_ids,
    max_length= 50,  
    do_sample=True,  
    top_k=100,
    top_p=0.92,
    temperature=0.8,
    repetition_penalty= 1.5,
    num_return_sequences= 5
)

for i, beam in enumerate(generated_text_samples):
  pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
  print()

 

输出

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 0: I work as a data scientist for an independent company. I've spent the last two years 
 working with developers to figure out how their apps are organized and being able write 
 better code than ever

 1: I work as a data scientist at Google. My job is to provide research for the company, 
 and I'm

 2: I work as a data scientist, and this is my job. I want to stay in the business of 
 helping people improve their skills," he said at his confirmation hearing on Monday 
 night for Assistant Secretary

 3: I work as a data scientist and I try to understand how the world works, so it's great 
 when you

 4: I work as a data scientist in Washington, D.C., and I've been on the receiving end of 
 hate mail lately," he said."My job is to help people understand what's wrong with their 
 communities before

通过我们所学到的知识,我们将创建一个生成文本的函数,通过封装输入文本的分词、使用GPT-2生成文本以及输出的解码,我们可以简单地使用该函数:

 

def generate_n_text_samples(model, tokenizer, input_text, device, n_samples = 5):
    text_ids = tokenizer.encode(input_text, return_tensors = 'pt')
    text_ids = text_ids.to(device)
    model = model.to(device)

    generated_text_samples = model.generate(
        text_ids, 
        max_length= 100,  
        num_return_sequences= n_samples,
        no_repeat_ngram_size= 2,
        repetition_penalty= 1.5,
        top_p= 0.92,
        temperature= .85,
        do_sample= True,
        top_k= 125,
        early_stopping= True
    )
    gen_text = []
    for t in generated_text_samples:
        text = tokenizer.decode(t, skip_special_tokens=True)
        gen_text.append(text)

        return gen_text

 

GPT-2 是使用从互联网上下载的通用文本(如维基百科、Reddit等)进行训练的,因此如果我们希望生成的文本结构具有一定的特定方式或内容聚焦于某一主题,仅使用Transformers中提供的预训练模型是不够的。为了实现这一点,可以对模型进行微调,即向体系结构中添加一些层,并使用包含所需主题或文本结构的数据集对模型进行重新训练。

微调允许基于输入数据集控制生成文本的结构和主题。此外,不需要像GPT-2从头开始训练时那样大量的数据,因此更加经济实惠。

 

要进行fine-tuning,需要遵循三个步骤:

  1. 获取数据。
  2. 处理数据以添加文本的开始和结束标记(或根据所需生成的文本类型添加相应标记)。
  3. 使用这些新数据训练基础模型。

我们将生成新闻格式的文本(假新闻):标题 + 文章。为此,需要两个模型:

  1. 标题生成模型,通过使用来自各种报纸的标题对GPT-2 small进行fine-tuning。
  2. 文章生成模型,通过使用标题和文章对GPT-2 small进行fine-tuning,使其在给定标题的情况下生成文章的前几个句子。
 
 

我们定义标题的起始和结束标记并将它们添加到以下两个位置:

  1. 添加到 tokenizer 作为特殊标记
  2. 当加载预训练模型时,添加到预训练模型的配置中
 运行
# the eos and bos tokens are defined
bos = '<|endoftext|>'
eos = '<|EOS|>'
pad = '<|pad|>'

special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad}

# the new token is added to the tokenizer
num_added_toks = base_tokenizer.add_special_tokens(special_tokens_dict)

# the model config to which we add the special tokens
config = AutoConfig.from_pretrained('gpt2', 
                                    bos_token_id=base_tokenizer.bos_token_id,
                                    eos_token_id=base_tokenizer.eos_token_id,
                                    pad_token_id=base_tokenizer.pad_token_id,
                                    output_hidden_states=False)

# the pre-trained model is loaded with the custom configuration
base_model = GPT2LMHeadModel.from_pretrained('gpt2', config=config)

# the model embedding is resized
base_model.resize_token_embeddings(len(base_tokenizer))

输出

Embedding(50259, 768)

这种情况下的数据处理包括三个步骤:

  • 清理数据集
  • 在标题中添加起始和结束标记
  • 生成标记数据集,我们可以将其传递给模型进行训练

在训练模型之前清理和处理文本非常重要,因为存在噪声会使重新训练的模型生成的文本质量比默认模型差。因此,我们要对标题进行过滤:

  • 空或空值
  • 删除标题中出现的出版物名称
  • 丢弃少于8个单词的标题
  • 丢弃重复的标题
 
filepath= './data/articles1.csv'
df = pd.read_csv(filepath, encoding = 'utf-8', usecols=['title', 'publication'])\
                    .rename(columns={'title': 'text'})

pd.set_option("display.max_colwidth", None)
df.head(5)
 

text publication
House Republicans Fret About Winning Their Health Care Suit - The New York Times New York Times
Rift Between Officers and Residents as Killings Persist in South Bronx - The New York Times New York Times
 
   
   
   
   
   
   
def remove_publication_headline(headline, publication):
    # publication col doesn't match exactly with newspaper in title col
    if str(publication) in str(headline):
        headline = headline.split(' - ')[0]
    return headline

def process_headlines(df, text_colname):
  
    # Remove empty and null rows
    titulo_vacio = (df['text'].str.len() == 0) | df['text'].isna()
    df = df[~titulo_vacio]

    # Remove publication name from title
    df['text'] = df.apply(lambda row: remove_publication_headline(row['text'], row['publication']), axis = 1)

    # Remove headlines with less than 8 words
    titlos_len_ge8 = (df['text'].str.split().apply(lambda x: len(x)) >= 8)
    df = df[titlos_len_ge8]

    # Drop duplicates
    text_df = df.drop_duplicates(subset = [text_colname])\
                [[text_colname]]

    return text_df
    
df = process_headlines(df, 'text')

 

一旦我们的数据集清洗完毕,我们可以为标题添加起始和结束符号。然后我们将数据集分为训练集和验证集。

 运行
df['text'] = bos + ' ' + df['text'] + ' ' + eos

df_train, df_val = train_test_split(df, train_size = 0.9, random_state = 77)
print(f'There are {len(df_train)} headlines for training and {len(df_val)} for validation')

输出

Hay 36380 titulares para el entrenamiento y 4043 para la validación

 

 现在我们可以生成 HuggingFace Transformers 内部使用的数据集,直接从 Pandas 数据帧中加载它们。在我们的情况下,当生成数据集时,我们丢弃所有不包含标题的列,因为它们是不必要的并且会占用内存空间。
 
# we load the datasets directly from a pandas df
train_dataset = Dataset.from_pandas(df_train[['text']])
val_dataset = Dataset.from_pandas(df_val[['text']])

 

最后,我们对数据集进行标记化处理,以便将其用作训练数据。我们使用padding=True在文本末尾添加padding token,以使它们的长度相同。

 
 def tokenize_function(examples):
        return base_tokenizer(examples['text'], padding=True)


tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['text'],
)
tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=5,
    remove_columns=['text'],
)

 

运行

# Example of the result of the tokenization process with padding
base_tokenizer.decode(tokenized_train_dataset['input_ids'][0])

 

输出

'<|endoftext|> Donald Trump: Hillary Clinton ’Opened the Pandora’s Box of Radical Islam’ 
<|EOS|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> 
<|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> 
<|pad|> <|pad|> <|pad|>'

 

训练模型时,HuggingFace Transformers 提供了一个 API,将用户从过程的最复杂的细节中抽象出来。只需使用所需的参数值实例化 TrainingArguments 类,并将其作为参数传递给 Trainer 类。对于 GPT-2,还建议实例化 DataCollatorForLanguageModeling 类,该类负责生成用于训练的子集(批次)。

在我们的情况下,我们将几乎所有的训练参数保留为默认值,只改变了 epochs 的数量和批次的大小。有关更多详细信息,接口中的所有参数都在 TrainingArguments 类的文档中详细解释。我们通过传递自定义的 tokenizer 实例化数据整合器,并关闭“掩码语言建模”的选项。

 
运行
model_headlines_path = './model_headlines_news'

training_args = TrainingArguments(
    output_dir=model_headlines_path,          # output directory
    num_train_epochs=6,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=200,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=model_headlines_path,            # directory for storing logs
    prediction_loss_only=True,
    save_steps=10000 
)

输出

data_collator = DataCollatorForLanguageModeling(
        tokenizer=base_tokenizer,
        mlm=False
    )

 

最后,我们实例化Trainer类,将预训练基础模型、训练参数、数据整理器以及训练和评估数据集传递给它。启动训练只需要调用类的train函数,该函数将在屏幕上显示训练进度。此外,如果我们在训练参数中定义了save_step参数,该函数将在每次达到这些步骤时自动保存模型的检查点,并且训练可以从该检查点继续。

 运行
trainer = Trainer(
    model=base_model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,         # training dataset
    eval_dataset=tokenized_val_dataset            # evaluation dataset
)
trainer.train()
输出
TrainOutput(global_step=6822, training_loss=3.793479129877373, 
metrics={'train_runtime': 6091.6768, 
'train_samples_per_second': 1.12, 'total_flos': 8435632813664256.0, 'epoch': 6.0, 
'init_mem_cpu_alloc_delta': 335156, 'init_mem_gpu_alloc_delta': 511148032, 
'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 
'train_mem_cpu_alloc_delta': 1265954, 'train_mem_gpu_alloc_delta': 1501261312, 
'train_mem_cpu_peaked_delta': 2107868, 'train_mem_gpu_peaked_delta': 3895132160})

 

 

训练完后,我们要保存模型。它将保存在我们在TrainerArguments类中指定的文件夹中。

 
trainer.save_model()
base_tokenizer.save_pretrained(model_headlines_path)

 

输出

('./model_headlines_news/tokenizer_config.json',
 './model_headlines_news/special_tokens_map.json',
 './model_headlines_news/vocab.json',
 './model_headlines_news/merges.txt',
 './model_headlines_news/added_tokens.json')

 

 

此外,由于我们传递了一个评估集,因此我们可以在该数据集上评估模型的指标,并将其与训练集的指标进行比较。

 
trainer.evaluate()

输出

{'epoch': 6.0,
 'eval_loss': 3.579979181289673,
 'eval_mem_cpu_alloc_delta': 105120,
 'eval_mem_cpu_peaked_delta': 151664,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 534739968,
 'eval_runtime': 33.801,
 'eval_samples_per_second': 119.612}
 有了训练好的模型,我们可以尝试生成新闻标题!在这种情况下,我们使用起始令牌(bos_token)作为模型的提示,我们可以看到模型生成的文本长度类似于标题,而且都与我们数据集中的标题主题相关。正如我们所预期的那样,特朗普和希拉里·克林顿是大多数标题的主角。这意味着我们的模型成功地学习了我们数据集的结构和主题。
 
 
# trained model loading
model_headlines_path = './model_headlines_news'


headlines_model = GPT2LMHeadModel.from_pretrained(model_headlines_path)
headlines_tokenizer = GPT2Tokenizer.from_pretrained(model_headlines_path)

device = "cuda:0"

input_text = headlines_tokenizer.bos_token

headlines = generate_n_text_samples(headlines_model, headlines_tokenizer, 
                                    input_text, device, n_samples = 10)
for h in headlines:
    print(h)
    print()

输出

WikiLeaks: Clinton Foundation Adopted By Goldman Sachs, But Has ‘Little Effect’ on 
 Its Business

 Marco Rubio Defends Trump’s ‘Deplorables, Stupid People” Inaugural Address

 President Donald Trump and the Hill: Obamacare, a Fight, but I’m Fighting

 Texas School Shooting: ’I Am One of the Victims”

 Donald Trump’s Executive Action Plan Would Leave 8 Million Illegal Immigrants to Stay in 
 America

接下来,我们将重复这个过程,从新闻标题生成新闻内容的开头。整个过程大体相同:加载预训练模型和标记器,添加所需的特殊标记,加载和处理数据,生成数据集并进行训练。主要的区别在于如何处理数据,使模型能够从标题中学习生成内容。我们通过添加一个新的分隔标记并将标题和新闻内容连接起来(用此标记分隔)来实现这一点:

 
bos_token <title> sep_token <content> eos_token

 

 这样,在生成文本时,将传递包含标题和分隔符的提示给模型,模型将学习生成与标题内容相关的文本。
 
 
# special tokens are defined
bos = '<|endoftext|>'
eos = '<|EOS|>'
body = '<|body|>'
additional_special_tokens = [body]

special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': '<pad>',
                       'sep_token': body} 
                      #  'additional_special_tokens':additional_special_tokens}

# the new token is added to the tokenizer
num_added_toks = base_tokenizer.add_special_tokens(special_tokens_dict)

# model configuration to which we add the special tokens
config = AutoConfig.from_pretrained('gpt2', 
                                    bos_token_id=base_tokenizer.bos_token_id,
                                    eos_token_id=base_tokenizer.eos_token_id,
                                    pad_token_id=base_tokenizer.pad_token_id,
                                    sep_token_id=base_tokenizer.sep_token_id,
                                    output_hidden_states=False)

# we load the pre-trained model with custom settings
base_model = GPT2LMHeadModel.from_pretrained('gpt2', config=config)

# model embeding resizing
base_model.resize_token_embeddings(len(base_tokenizer))

 

我们筛选头条和文章:

  • 空的或为空值
  • 我们从包含出版物名称的标题中删除出版物名称
  • 我们丢弃少于8个单词的标题
  • 我们丢弃重复的标题
  • 我们保留文章的前100个单词

我们按照上述的方式添加分隔符和特殊标记来处理文本

 
df = []
for filepath in ['./data/articles1.csv', './data/articles2.csv']:
    news_df = pd.read_csv(filepath, encoding = 'utf-8')
    df.append(news_df)
news_df = pd.concat(df, axis=0)

def remove_publication_headline(headline, publication):
    # publication col doesn't match exactly with newspaper in title col
    if str(publication) in str(headline):
        headline = headline.split(' - ')[0]
    return headline

  
def process_headlines_articles(df, title_col, content_col):
    # Remove rows with empty or null title or content
    titulo_vacio = (df[title_col].str.len() == 0) | df[title_col].isna()
    contenido_vacio = (news_df[content_col].str.len() == 0) | news_df[content_col].isna()
    df = df[~titulo_vacio & ~contenido_vacio]

    # Remove publication name from title
    df[title_col] = df.apply(lambda row: remove_publication_headline(row[title_col], row['publication']), axis = 1)

    # Remove headlines with less than 8 words
    titlos_len_ge8 = (df[title_col].str.split().apply(lambda x: len(x)) >= 8)
    df = df[titlos_len_ge8]

    # Keep the first 100 words from the content
    news_df[content_col] = news_df[content_col].str.split(' ').apply(lambda x: ' '.join(x[:100]))

    # Drop duplicates
    text_df = df.drop_duplicates(subset = [text_colname])\
                [[text_colname]]

    return text_df

# Data cleansing
news_df = process_headlines_articles(news_df, title_col='title', content_col='content')

# We add the tokens
prepare_text = lambda x: ' '.join([bos, x['title'], body, x['content'], eos])
news_df['text'] = news_df.apply(prepare_text, axis=1)

# Split in train and test
df_train_news, df_val_news = train_test_split(news_df, train_size = 0.9, random_state = 77)

# we load the datasets from pandas df
train_dataset = Dataset.from_pandas(df_train_news[['text']])
val_dataset = Dataset.from_pandas(df_val_news[['text']])

# tokenization
tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=1
)

tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=1
)

我们用与文章训练时的相同方法进行训练。在这种情况下,由于训练未完成,因此训练被停止,并从上次保存的检查点继续进行。

 
model_articles_path = './news-articles_v4'

training_args = TrainingArguments(
    output_dir=model_articles_path,          # output directory
    num_train_epochs=2,              # total # of training epochs
    per_device_train_batch_size=5,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=200,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=model_articles_path,            # directory for storing logs
    prediction_loss_only=True,
    save_steps=10000
)

data_collator = DataCollatorForLanguageModeling(
        tokenizer=base_tokenizer,
        mlm=False
    )

trainer = Trainer(
    model=base_model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,         # training dataset
    eval_dataset=tokenized_val_dataset,            # evaluation dataset
    
)

训练

trainer.train()

或者

trainer.train(resume_from_checkpoint=True)

 

输出

 
TrainOutput(global_step=27980, training_loss=0.7775315323584927, 
metrics={'train_runtime': 2335.4803, 
'train_samples_per_second': 11.98, 'total_flos': 1.698689707771392e+16, 'epoch': 2.0, 
'init_mem_cpu_alloc_delta': 333850, 'init_mem_gpu_alloc_delta': 511148032, 
'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 
'train_mem_cpu_alloc_delta': 1500091, 'train_mem_gpu_alloc_delta': 2012204032,
 'train_mem_cpu_peaked_delta': 3292110, 'train_mem_gpu_peaked_delta': 4292523008})

 

保存模型

trainer.save_model()
base_tokenizer.save_pretrained(model_path)

 

我们已经得到了最终的结果:2016年的假新闻。根据前面生成的标题,我们生成了一篇小文章 

[1m WikiLeaks: Clinton Foundation Adopted By Goldman Sachs, But Has ‘Little Effect’ on 
Its Business

  WikiLeaks has published details of the new Hillary for America campaign ad that is being 
  used to push back against Democratic presidential nominee former Secretary and 2016 
  Democratic National Committee (DNC) candidate Bernie Sanders. [Sanders Campaign is 
  spending $250 million promoting his unsuccessful bid at this year U. S District Court 
  in San Francisco where a judge has rejected three

 Marco Rubio Defends Trump’s ‘Deplorables, Stupid People” Inaugural Address

  On Wednesday night in Cleveland as part of his Republican presidential campaign for the 
  presidency he addressed Donald J. Trump and other Democrats who are engaged with him over 
  immigration reform or trade deals while praising Mr.[  During their commencement address 
  at Ohio State University on Thursday morning several speakers offered him lessons to learn 
  about American values during that

 

质量并不是十分出色,也就是说,大多数这些文章在人类眼中看起来并不真实,要么因为内容不连贯,要么因为结构奇怪。例如,随机出现了不关闭的方括号,引号用不同的字符开启和关闭等。然而,考虑到资源的限制,由于这些模型是在免费的Google Colab上训练的,并且所使用的模型是最小的,所能取得的成果令人惊叹。毫无疑问,如果拥有更好的基础设施和更大的模型,完全可以实现可信的结果,因为我们已经知道这已经在过去的几年中发生过了。
 
 

GPT-2被训练使用了大量从互联网上收集的文本,因此它学习了这些网站中存在的偏见和语言。像GPT-2这样的模型存在以下一些问题:

  • 对于少数或历史上受到歧视的社会群体存在偏见
  • 存在有毒的语言:性别歧视、暴力、冒犯性等等。

这也是OpenAI没有公开发布GPT-3的主要原因。

社区正在开展大量工作,以开发可靠和安全的方法来去除生成文本的有毒性。

我们建议阅读以下论文,其中包含了对去毒化模型的现有方法进行广泛审查:https://arxiv.org/pdf/2009.11462.pdf

 
 
  • explained and demonstrated how to generate text and how to train GPT-2 to generate the text
  • GPT-3
 
 
posted @ 2023-03-14 10:42  地球美好不  阅读(1405)  评论(0编辑  收藏  举报