翻译自 :https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/
首先,我们将介绍文本生成模型的理论介绍,接着介绍HuggingFace Transformers,我们在本文的其余部分中将使用的Python库。然后,我们将重点介绍GPT-2模型,以及如何使用HuggingFace Transformers中可用的接口,既可以使用预训练模型生成文本,也可以使用自己的文本重新训练它们。最后,我们将看到如果不谨慎使用这些模型,与之相关的道德风险,因为它们已经通过互联网上的文本进行了训练,并且已经学会了互联网上存在的偏见。
1. Text generation models
1.1 Introduction to text generation models
2017年,Google在其论文["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)中提出了一种新的架构,称为Transformer,这是当今不同文本生成模型所基于的架构,例如GPT-2和GPT-3、BERT或Transformer XL。
1.2 Huggingface Transformers
Huggingface Transformers 是一个 Python 库,它可以下载预训练模型,用于完成以下任务:
- 自然语言理解,例如情感分析
- 自然语言生成,例如文本生成或文本翻译。 除此之外,它还提供了由 OpenAI 训练和发布的 GPT-2 的四个版本,并提供易于使用的接口,使其非常易于使用。
Model:Pytorch 或 Keras 模型,用于处理库中预训练的模型。
2. Setup
transformers==4.4.2 datasets==1.5.0 nlp colorama==0.4.4 torch==1.9.1
import torch, os, re, pandas as pd, json from sklearn.model_selection import train_test_split from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding, GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, AutoConfig from datasets import Dataset
def pretty_print(text, max_len_line=100): words = text.split(' ') len_line = 0 line = '' for w in words: if w == '\n': print(line) line = '' continue if (len(line) + len(w)) > max_len_line: print(line) line = '' line += ' ' + w print(line)
接下来,我们定义是否要在 CPU 或 GPU 上运行模型。为此,我们可以使用 torch 来检查是否已安装 CUDA,如果已安装,则使用 GPU。如果不可用,则默认使用 CPU。
if torch.cuda.is_available(): dev = "cuda:0" else: dev = "cpu" device = torch.device(dev)
3. Text generation with GPT-2
3.1 Model and tokenizer loading
第一步将是加载模型和模型将使用的分词器。我们都通过存在于 Huggingface Transformers GPT2LMHeadModel 和 GPT2Tokenizer 接口的 GPT2 类来完成。这两种情况下,您必须指定要使用的模型版本,OpenAI 发布的四个模型维度都可用:
- 'gpt2'
- 'gpt2-medium'
- 'gpt2-large'
- 'gpt2-xl'
# We load the model base_model = GPT2LMHeadModel.from_pretrained('gpt2') # options: ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl']
base_model.num_parameters # (wte): Embedding(50262, 768) # (wpe): Embedding(1024, 768)
<bound method ModuleUtilsMixin.num_parameters of GPT2LMHeadModel( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (1): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (2): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (3): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (4): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (5): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (6): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (7): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (8): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (9): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (10): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) (11): Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): Attention( (c_attn): Conv1D() (c_proj): Conv1D() (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): MLP( (c_fc): Conv1D() (c_proj): Conv1D() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=768, out_features=50257, bias=False) )>
- 它将输入文本分成标记,这些标记不一定要与单词相一致,并将这些标记编码和解码为模型的输入ID,反之亦然。
- 它允许向词汇表中添加新标记。
- 它管理特殊标记,例如掩码、文本开头、文本结尾、特殊分隔符等。 通过分词器实例,我们可以探索词汇表(get_vocab)并查看其大小,以及探索和tokenize不同的文本以了解其工作原理。
# We load the tokenizer base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
print('Words in vocabulary: ', base_tokenizer.vocab_size)
Palabras en el vocabulario: 50257
vocabulary = base_tokenizer.get_vocab() vocabulary['Hi']
text = "Hi, I'm Victor and I work as a Data Scientist" base_tokenizer.tokenize(text)
['Hi', ',', 'ĠI', "'m", 'ĠVictor', 'Ġand', 'ĠI', 'Ġwork', 'Ġas', 'Ġa', 'ĠData', 'ĠScientist']
text_ids = base_tokenizer.encode(text, return_tensors = 'pt') text_ids # tensorflow #text_ids = base_tokenizer.encode(text, return_tensors = 'tf')
tensor([[17250, 11, 314, 1101, 12622, 290, 314, 670, 355, 257, 6060, 33374]])
3.2 Decoding methods and parameters
text = "I work as a data scientist" text_ids = base_tokenizer.encode(text, return_tensors = 'pt') generated_text_samples = base_model.generate( text_ids ) generated_text_samples
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. tensor([[ 40, 670, 355, 257, 1366, 11444, 379, 262, 2059, 286, 3442, 11, 14727, 13, 198, 198, 1, 40, 1101, 407]])
for i, beam in enumerate(generated_text_samples): print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
0: I work as a data scientist at the University of California, Berkeley. "I'm not
Greedy Search
generated_text_samples = base_model.generate( text_ids, max_length= 100, ) for i, beam in enumerate(generated_text_samples): print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist at the University of California, Berkeley. "I'm not a scientist, but I'm a data scientist," he said. "I'm not a data scientist, but I'm a data scientist." He said he's not sure how much of the data he's collecting is from the government, but he's confident that it's not too much. "I'm not going to be able to do that," he said. "I'm
这是一种确定性生成方法(deterministic generation),如果我们使用相同的提示再次生成文本,所得到的文本将是相同的。
# ejemplo de generación de texto generated_text_samples = base_model.generate( text_ids, max_length= 100, ) for i, beam in enumerate(generated_text_samples): print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist at the University of California, Berkeley. "I'm not a scientist, but I'm a data scientist," he said. "I'm not a data scientist, but I'm a data scientist." He said he's not sure how much of the data he's collecting is from the government, but he's confident that it's not too much. "I'm not going to be able to do that," he said. "I'm
- 它是确定性的。
- 它可能会陷入循环并重复相同的单词。
- 它不考虑在低概率单词之后跟着的高概率的单词。
Beam Search
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, num_beams=5, num_return_sequences= 5, early_stopping=True ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist at the University of California, Berkeley, and I've been working on this project for a long time. I've been working on this project for a long time. I've been working on this project for a long time 1: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I've been working on this for a long time. I've been working on this for a long time. 2: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I've been working on this for a long time. I've been working on this for a long time. I've 3: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I've been working on this for a long time. I've been working on this for a long time. I'm 4: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I've been working on this for a long time. I've been working on this for a long time. It's
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, num_beams=5, no_repeat_ngram_size=2, num_return_sequences= 5, early_stopping=True ) for i, beam in enumerate(generated_text_samples): print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I have a lot of work to do, but I want to share it with you because I think it's 1: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I have a lot of work to do, but I want to share with you some of the things that I 2: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I have a lot of work to do, but I want to share it with you because it's important to 3: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I have a lot of work to do, but I want to share it with you because it's important. 4: I work as a data scientist at the University of California, Berkeley, and I've been working on this for a long time. I have a lot of work to do, but I want to share it with you because I think it is
- 它生成重复的序列,难以控制。
- 人类并不总是使用这样的确定性语言,如Ari Holtzman等人(2019)所解释的那样。在他们的研究中,他们比较了人类选择的单词和BeamSearch选择的单词的概率,并观察到后者的概率更高,变化更小。
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, do_sample=True, top_k=0, num_return_sequences= 5 ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist and distributor. Compute data knowledge on production companies with an approach that allows you to explore high-level, high performance models that saw some crashes. 1: I work as a data scientist on several city and county offices and I've spent many many hours with people that happen to like me. When my leg is broken I can get medical indeterminate problems. My 2: I work as a data scientist at AlecGenetics, both in my prefrontal cortex and my scalp. For example, batting average since I was 8-year-old was more than 500 points higher. Soon after, 3: I work as a data scientist in a number of sectors, including SAP, Google, CID, Apple, best practices, clustering, mapping and data manipulation". However, Simon is breaking out soon and has 4: I work as a data scientist at Stratfor, where I've talked about it on some good lengths at a number of shows. My primary focus and foremost goal was to return many messages out of groups and
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, do_sample=True, top_k=0, temperature=0.9, num_return_sequences= 5 ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist. All of my other fields come from years of research, so I know they benefit real physical scientists. So I'm a kind of a crusader for open data that's going to do away 1: I work as a data scientist at Stardust Research. Blu gives me a home computer in my garden. She owns a double-track truck, a touring dog, and some furniture. I've been a homeless man all my life. 2: I work as a data scientist at a group that solicits and enables data scientists in their field to understand, properly question and engage with data scientists. I did that with the data scientists 3: I work as a data scientist at a global health company, and I told them it was foolproof to shut down a website they had started. They asked me how they could know who I was. Knowing that when a 4: I work as a data scientist and lead a Data Analytics team on the IBM Distributed Systems Group, studying the use of remote workers in data science data and exploration. Alton has access to every
Top-K Sampling

# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, do_sample=True, top_k=25, num_return_sequences= 5 ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist and as a technical writer and I do not publish my writing here for 1: I work as a data scientist for some other companies and they asked me whether I felt the same way. I've worked for the government of India for 20 years and they didn't give me any sort of 2: I work as a data scientist in a few small firms. In the fall of 2011, I ran the company's customer support team through some of our most complicated systems: the customer support system was 3: I work as a data scientist at the National Center for Missing and Exploited Children at Children's Hospital in Philadelphia. She has been following the cases for many months and has been 4: I work as a data scientist at the Center for Digital Economy, a nonprofit in Washington, D.C.—and I'm currently working at one of the largest companies in Silicon Valley—and I've got the idea for
Top-p (nucleus) sampling
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, do_sample=True, top_k=0, top_p=0.92, num_return_sequences= 5 ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist for a leading US non-profit that focuses on the medical care of Africa, specifically the Transitional Federated Republic of the Democratic Republic of the Congo 1: I work as a data scientist who also studies wireless. I tend to care about wireless fact-checking, which means I'm able to add factors to the rest of the experts' opinions as well. We 2: I work as a data scientist at the paper's UMass Amherst Initiative on Media Research, a company specializing in peer-reviewed, high-quality research on certain types of media. In addition to 3: I work as a data scientist with a good point of view, and it'll never be easy to put my head down and go make notes here if you don't see it. At the same time, my head is the only one that does 4: I work as a data scientist, only a few of my colleagues are. I have worked on a lot of cold and hard data, but only a few people who really know any of those areas are able to tell me what you're
# text generation example generated_text_samples = base_model.generate( text_ids, max_length= 50, do_sample=True, top_k=100, top_p=0.92, temperature=0.8, repetition_penalty= 1.5, num_return_sequences= 5 ) for i, beam in enumerate(generated_text_samples): pretty_print(f"{i}: {base_tokenizer.decode(beam, skip_special_tokens=True)}") print()
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. 0: I work as a data scientist for an independent company. I've spent the last two years working with developers to figure out how their apps are organized and being able write better code than ever 1: I work as a data scientist at Google. My job is to provide research for the company, and I'm 2: I work as a data scientist, and this is my job. I want to stay in the business of helping people improve their skills," he said at his confirmation hearing on Monday night for Assistant Secretary 3: I work as a data scientist and I try to understand how the world works, so it's great when you 4: I work as a data scientist in Washington, D.C., and I've been on the receiving end of hate mail lately," he said."My job is to help people understand what's wrong with their communities before
def generate_n_text_samples(model, tokenizer, input_text, device, n_samples = 5): text_ids = tokenizer.encode(input_text, return_tensors = 'pt') text_ids = text_ids.to(device) model = model.to(device) generated_text_samples = model.generate( text_ids, max_length= 100, num_return_sequences= n_samples, no_repeat_ngram_size= 2, repetition_penalty= 1.5, top_p= 0.92, temperature= .85, do_sample= True, top_k= 125, early_stopping= True ) gen_text = [] for t in generated_text_samples: text = tokenizer.decode(t, skip_special_tokens=True) gen_text.append(text) return gen_text
4. Fine-tunning: How to generate fake news
GPT-2 是使用从互联网上下载的通用文本(如维基百科、Reddit等)进行训练的,因此如果我们希望生成的文本结构具有一定的特定方式或内容聚焦于某一主题,仅使用Transformers中提供的预训练模型是不够的。为了实现这一点,可以对模型进行微调,即向体系结构中添加一些层,并使用包含所需主题或文本结构的数据集对模型进行重新训练。
- 获取数据。
- 处理数据以添加文本的开始和结束标记(或根据所需生成的文本类型添加相应标记)。
- 使用这些新数据训练基础模型。
我们将生成新闻格式的文本(假新闻):标题 + 文章。为此,需要两个模型:
- 标题生成模型,通过使用来自各种报纸的标题对GPT-2 small进行fine-tuning。
- 文章生成模型,通过使用标题和文章对GPT-2 small进行fine-tuning,使其在给定标题的情况下生成文章的前几个句子。
4.1 Fine-tunning to generate headlines
Loading the tokenizer and model with special tokens
- 添加到 tokenizer 作为特殊标记
- 当加载预训练模型时,添加到预训练模型的配置中
# the eos and bos tokens are defined bos = '<|endoftext|>' eos = '<|EOS|>' pad = '<|pad|>' special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': pad} # the new token is added to the tokenizer num_added_toks = base_tokenizer.add_special_tokens(special_tokens_dict) # the model config to which we add the special tokens config = AutoConfig.from_pretrained('gpt2', bos_token_id=base_tokenizer.bos_token_id, eos_token_id=base_tokenizer.eos_token_id, pad_token_id=base_tokenizer.pad_token_id, output_hidden_states=False) # the pre-trained model is loaded with the custom configuration base_model = GPT2LMHeadModel.from_pretrained('gpt2', config=config) # the model embedding is resized base_model.resize_token_embeddings(len(base_tokenizer))
Embedding(50259, 768)
Data loading and processing
- 清理数据集
- 在标题中添加起始和结束标记
- 生成标记数据集,我们可以将其传递给模型进行训练
- 空或空值
- 删除标题中出现的出版物名称
- 丢弃少于8个单词的标题
- 丢弃重复的标题
filepath= './data/articles1.csv' df = pd.read_csv(filepath, encoding = 'utf-8', usecols=['title', 'publication'])\ .rename(columns={'title': 'text'}) pd.set_option("display.max_colwidth", None) df.head(5)
text | publication |
House Republicans Fret About Winning Their Health Care Suit - The New York Times | New York Times |
Rift Between Officers and Residents as Killings Persist in South Bronx - The New York Times | New York Times |
def remove_publication_headline(headline, publication): # publication col doesn't match exactly with newspaper in title col if str(publication) in str(headline): headline = headline.split(' - ')[0] return headline def process_headlines(df, text_colname): # Remove empty and null rows titulo_vacio = (df['text'].str.len() == 0) | df['text'].isna() df = df[~titulo_vacio] # Remove publication name from title df['text'] = df.apply(lambda row: remove_publication_headline(row['text'], row['publication']), axis = 1) # Remove headlines with less than 8 words titlos_len_ge8 = (df['text'].str.split().apply(lambda x: len(x)) >= 8) df = df[titlos_len_ge8] # Drop duplicates text_df = df.drop_duplicates(subset = [text_colname])\ [[text_colname]] return text_df df = process_headlines(df, 'text')
df['text'] = bos + ' ' + df['text'] + ' ' + eos df_train, df_val = train_test_split(df, train_size = 0.9, random_state = 77) print(f'There are {len(df_train)} headlines for training and {len(df_val)} for validation')
Hay 36380 titulares para el entrenamiento y 4043 para la validación
# we load the datasets directly from a pandas df train_dataset = Dataset.from_pandas(df_train[['text']]) val_dataset = Dataset.from_pandas(df_val[['text']])
最后,我们对数据集进行标记化处理,以便将其用作训练数据。我们使用padding=True在文本末尾添加padding token,以使它们的长度相同。
def tokenize_function(examples): return base_tokenizer(examples['text'], padding=True) tokenized_train_dataset = train_dataset.map( tokenize_function, batched=True, num_proc=5, remove_columns=['text'], ) tokenized_val_dataset = val_dataset.map( tokenize_function, batched=True, num_proc=5, remove_columns=['text'], )
# Example of the result of the tokenization process with padding base_tokenizer.decode(tokenized_train_dataset['input_ids'][0])
'<|endoftext|> Donald Trump: Hillary Clinton ’Opened the Pandora’s Box of Radical Islam’ <|EOS|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|> <|pad|>'
训练模型时,HuggingFace Transformers 提供了一个 API,将用户从过程的最复杂的细节中抽象出来。只需使用所需的参数值实例化 TrainingArguments 类,并将其作为参数传递给 Trainer 类。对于 GPT-2,还建议实例化 DataCollatorForLanguageModeling 类,该类负责生成用于训练的子集(批次)。
在我们的情况下,我们将几乎所有的训练参数保留为默认值,只改变了 epochs 的数量和批次的大小。有关更多详细信息,接口中的所有参数都在 TrainingArguments 类的文档中详细解释。我们通过传递自定义的 tokenizer 实例化数据整合器,并关闭“掩码语言建模”的选项。
model_headlines_path = './model_headlines_news' training_args = TrainingArguments( output_dir=model_headlines_path, # output directory num_train_epochs=6, # total # of training epochs per_device_train_batch_size=32, # batch size per device during training per_device_eval_batch_size=16, # batch size for evaluation warmup_steps=200, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir=model_headlines_path, # directory for storing logs prediction_loss_only=True, save_steps=10000 )
data_collator = DataCollatorForLanguageModeling( tokenizer=base_tokenizer, mlm=False )
trainer = Trainer( model=base_model, # the instantiated Transformers model to be trained args=training_args, # training arguments, defined above data_collator=data_collator, train_dataset=tokenized_train_dataset, # training dataset eval_dataset=tokenized_val_dataset # evaluation dataset ) trainer.train()
TrainOutput(global_step=6822, training_loss=3.793479129877373, metrics={'train_runtime': 6091.6768, 'train_samples_per_second': 1.12, 'total_flos': 8435632813664256.0, 'epoch': 6.0, 'init_mem_cpu_alloc_delta': 335156, 'init_mem_gpu_alloc_delta': 511148032, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1265954, 'train_mem_gpu_alloc_delta': 1501261312, 'train_mem_cpu_peaked_delta': 2107868, 'train_mem_gpu_peaked_delta': 3895132160})
('./model_headlines_news/tokenizer_config.json', './model_headlines_news/special_tokens_map.json', './model_headlines_news/vocab.json', './model_headlines_news/merges.txt', './model_headlines_news/added_tokens.json')
{'epoch': 6.0, 'eval_loss': 3.579979181289673, 'eval_mem_cpu_alloc_delta': 105120, 'eval_mem_cpu_peaked_delta': 151664, 'eval_mem_gpu_alloc_delta': 0, 'eval_mem_gpu_peaked_delta': 534739968, 'eval_runtime': 33.801, 'eval_samples_per_second': 119.612}
Headline generation
# trained model loading model_headlines_path = './model_headlines_news' headlines_model = GPT2LMHeadModel.from_pretrained(model_headlines_path) headlines_tokenizer = GPT2Tokenizer.from_pretrained(model_headlines_path) device = "cuda:0" input_text = headlines_tokenizer.bos_token headlines = generate_n_text_samples(headlines_model, headlines_tokenizer, input_text, device, n_samples = 10) for h in headlines: print(h) print()
WikiLeaks: Clinton Foundation Adopted By Goldman Sachs, But Has ‘Little Effect’ on Its Business Marco Rubio Defends Trump’s ‘Deplorables, Stupid People” Inaugural Address President Donald Trump and the Hill: Obamacare, a Fight, but I’m Fighting Texas School Shooting: ’I Am One of the Victims” Donald Trump’s Executive Action Plan Would Leave 8 Million Illegal Immigrants to Stay in America
4.2 Fine-tuning to generate articles from headlines
bos_token <title> sep_token <content> eos_token
Load tokenizer and model with special tokens
# special tokens are defined bos = '<|endoftext|>' eos = '<|EOS|>' body = '<|body|>' additional_special_tokens = [body] special_tokens_dict = {'eos_token': eos, 'bos_token': bos, 'pad_token': '<pad>', 'sep_token': body} # 'additional_special_tokens':additional_special_tokens} # the new token is added to the tokenizer num_added_toks = base_tokenizer.add_special_tokens(special_tokens_dict) # model configuration to which we add the special tokens config = AutoConfig.from_pretrained('gpt2', bos_token_id=base_tokenizer.bos_token_id, eos_token_id=base_tokenizer.eos_token_id, pad_token_id=base_tokenizer.pad_token_id, sep_token_id=base_tokenizer.sep_token_id, output_hidden_states=False) # we load the pre-trained model with custom settings base_model = GPT2LMHeadModel.from_pretrained('gpt2', config=config) # model embeding resizing base_model.resize_token_embeddings(len(base_tokenizer))
Data loading and processing
- 空的或为空值
- 我们从包含出版物名称的标题中删除出版物名称
- 我们丢弃少于8个单词的标题
- 我们丢弃重复的标题
- 我们保留文章的前100个单词
df = [] for filepath in ['./data/articles1.csv', './data/articles2.csv']: news_df = pd.read_csv(filepath, encoding = 'utf-8') df.append(news_df) news_df = pd.concat(df, axis=0) def remove_publication_headline(headline, publication): # publication col doesn't match exactly with newspaper in title col if str(publication) in str(headline): headline = headline.split(' - ')[0] return headline def process_headlines_articles(df, title_col, content_col): # Remove rows with empty or null title or content titulo_vacio = (df[title_col].str.len() == 0) | df[title_col].isna() contenido_vacio = (news_df[content_col].str.len() == 0) | news_df[content_col].isna() df = df[~titulo_vacio & ~contenido_vacio] # Remove publication name from title df[title_col] = df.apply(lambda row: remove_publication_headline(row[title_col], row['publication']), axis = 1) # Remove headlines with less than 8 words titlos_len_ge8 = (df[title_col].str.split().apply(lambda x: len(x)) >= 8) df = df[titlos_len_ge8] # Keep the first 100 words from the content news_df[content_col] = news_df[content_col].str.split(' ').apply(lambda x: ' '.join(x[:100])) # Drop duplicates text_df = df.drop_duplicates(subset = [text_colname])\ [[text_colname]] return text_df # Data cleansing news_df = process_headlines_articles(news_df, title_col='title', content_col='content') # We add the tokens prepare_text = lambda x: ' '.join([bos, x['title'], body, x['content'], eos]) news_df['text'] = news_df.apply(prepare_text, axis=1) # Split in train and test df_train_news, df_val_news = train_test_split(news_df, train_size = 0.9, random_state = 77) # we load the datasets from pandas df train_dataset = Dataset.from_pandas(df_train_news[['text']]) val_dataset = Dataset.from_pandas(df_val_news[['text']]) # tokenization tokenized_train_dataset = train_dataset.map( tokenize_function, batched=True, num_proc=1 ) tokenized_val_dataset = val_dataset.map( tokenize_function, batched=True, num_proc=1 )
model_articles_path = './news-articles_v4' training_args = TrainingArguments( output_dir=model_articles_path, # output directory num_train_epochs=2, # total # of training epochs per_device_train_batch_size=5, # batch size per device during training per_device_eval_batch_size=32, # batch size for evaluation warmup_steps=200, # number of warmup steps for learning rate scheduler weight_decay=0.01, # strength of weight decay logging_dir=model_articles_path, # directory for storing logs prediction_loss_only=True, save_steps=10000 ) data_collator = DataCollatorForLanguageModeling( tokenizer=base_tokenizer, mlm=False ) trainer = Trainer( model=base_model, # the instantiated Transformers model to be trained args=training_args, # training arguments, defined above data_collator=data_collator, train_dataset=tokenized_train_dataset, # training dataset eval_dataset=tokenized_val_dataset, # evaluation dataset )
TrainOutput(global_step=27980, training_loss=0.7775315323584927, metrics={'train_runtime': 2335.4803, 'train_samples_per_second': 11.98, 'total_flos': 1.698689707771392e+16, 'epoch': 2.0, 'init_mem_cpu_alloc_delta': 333850, 'init_mem_gpu_alloc_delta': 511148032, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1500091, 'train_mem_gpu_alloc_delta': 2012204032, 'train_mem_cpu_peaked_delta': 3292110, 'train_mem_gpu_peaked_delta': 4292523008})
[1m WikiLeaks: Clinton Foundation Adopted By Goldman Sachs, But Has ‘Little Effect’ on Its Business [0m WikiLeaks has published details of the new Hillary for America campaign ad that is being used to push back against Democratic presidential nominee former Secretary and 2016 Democratic National Committee (DNC) candidate Bernie Sanders. [Sanders Campaign is spending $250 million promoting his unsuccessful bid at this year U. S District Court in San Francisco where a judge has rejected three [1m Marco Rubio Defends Trump’s ‘Deplorables, Stupid People” Inaugural Address [0m On Wednesday night in Cleveland as part of his Republican presidential campaign for the presidency he addressed Donald J. Trump and other Democrats who are engaged with him over immigration reform or trade deals while praising Mr.[ During their commencement address at Ohio State University on Thursday morning several speakers offered him lessons to learn about American values during that
质量并不是十分出色,也就是说,大多数这些文章在人类眼中看起来并不真实,要么因为内容不连贯,要么因为结构奇怪。例如,随机出现了不关闭的方括号,引号用不同的字符开启和关闭等。然而,考虑到资源的限制,由于这些模型是在免费的Google Colab上训练的,并且所使用的模型是最小的,所能取得的成果令人惊叹。毫无疑问,如果拥有更好的基础设施和更大的模型,完全可以实现可信的结果,因为我们已经知道这已经在过去的几年中发生过了。
5. Biases and toxic language in base GPT-2
- 对于少数或历史上受到歧视的社会群体存在偏见
- 存在有毒的语言:性别歧视、暴力、冒犯性等等。
6. Conclusions and next steps
- explained and demonstrated how to generate text and how to train GPT-2 to generate the text
- GPT-3
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)