第2篇 任务简介
Summary of the tasks
本篇介绍最常见的几种任务。可用的模型允许多种不同的配置,并且在使用场景上具有很大的通用性。在这里介绍最简单的几种,并展示在多种任务(问答、序列分类,命名实体识别等)上的使用方法。
这些案例支持auto-model类,通过给定checkpoint实例化一个模型,自动选择正确的模型架构。
为了使模型在任务上能取得更好的表现,必须加载一个与任务匹配的checkpoint。这些checkpoint通常是在大规模语料上预训练,在特定任务微调。这意味着:
- 不是所有的模型都在所有任务上微调过的。如果你希望在特定任务上微调一个模型,你可以在github的examples目录里面找到一个类似于run_task.py的脚本作为参考。
- 微调模型是在一个特定数据集上微调的。这个数据集不一定能涵盖你的使用场景和领域。因此你也可以通过examples脚本来微调你的模型,或者创建自己的训练脚本。
为了在任务上做推断,库里面提供了几个可用的机制:
- Pipelines:非常容易使用的抽象,最少2行代码就可以解决问题。
- 直接使用模型:相对较少的抽象,但是因为可以直接使用分词器所以更加灵活有力,并且具备完整的推断能力。
两种方法都会进行展示。
所有接下来提到的任务都有已经在特定数据集上微调过的预训练checkpoints。加载一个没有在特定任务上微调过的checkpoint,只需要加载基本的transformer层,不需要加载特定任务层。只需要将针对特定任务的模块权重随机初始化即可。
序列分类
序列分类是一种将序列分到指定的几种类别的任务。一个序列分类的例子是GLUE数据集、如果你想要在GLUE上微调一个模型,你可以借鉴run_glue.py脚本。
这里是一个使用pipelines做情感分析的例子:识别一个文本是积极的还是消极的。下面使用一个在sst2(一个GLUE上的任务)预训练过的模型。
结果返回一个“积极”或“消极”的标签,以及一个分数:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
'''
output:
label: NEGATIVE, with score: 0.9991
'''
下面展示一个使用模型来进行序列分类的例子:判断两个文本是否是彼此的同义句。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。模型可以使用BERT模型,并从checkpoint里面加载其权重。
- 两个句子构建一个序列,需要使用正确的分隔符、token type ids以及attention masks(这些都可以通过分词器自动创建)
- 将序列输入到模型中,因此可以被分到两个类里面:0(不是同义句)和1(是同义句)
- 计算结果的softmax来获得每个类别的概率值。
- 输出结果
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
# Should be paraphrase
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
'''
output:
not paraphrase: 10%
is paraphrase: 90%
'''
抽取式问答
抽取式问答是从文本中提取给定问题的答案。一个经典的问答数据集是SQuAD数据集。如果你想要的在SQuAD任务上微调一个模型,你可以参考run_qa.py脚本。
下面展示使用pipelines方法做问答的方法。结果返回答案文本、置信分数、答案位置(开始和结束值):
from transformers import pipeline
question_answerer = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""
result = question_answerer(question="What is extractive question answering?", context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)
'''
output:
Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
'''
下面展示一个使用模型来进行问答的例子。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。模型可以使用BERT模型,并从checkpoint里面加载其权重。
- 定义一段文本和几个问题
- 迭代所有的问题,构建一个包含文本和当前问题的序列,需要使用正确的分隔符、token type ids以及attention masks
- 将序列输入到模型中,输出是根据整个序列tokens得到的对于开始和结束位置的一系列分数。
- 计算结果的softmax值得到概率
- 根据识别出来的开始和结束位置取出对应的tokens,将这些tokens转换为一个字符串。
- 打印结果
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in 🤗 Transformers?",
"What does 🤗 Transformers provide?",
"🤗 Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
# Get the most likely beginning of answer with the argmax of the score
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)
print(f"Question: {question}")
print(f"Answer: {answer}")
'''
output:
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
'''
语言建模
语言建模是在将模型与语料库相匹配的任务,语料库可以是特定领域的。所有流行的基于transformer的模型都是通过各种各样的语言建模训练的,如:BERT是通过掩码语言建模,GPT2是通过因果语言建模。
语言建模在预训练之外也非常有用,比如将模型分布转移到特定领域:在大规模语料上训练一个模型,然后在一个新的数据集或者特定领域微调。
掩码语言建模
掩码语言建模是在文本里面掩盖住一些token,然后通过模型去预测这些token。这允许模型能够看到掩码左边的文本和掩码右边的文本。这样的训练为需要双向上下文的下游任务提供了坚实的基础,如:SQuAD。如果你希望通过掩码语言建模微调一个模型,你可以参考run_mlm.py脚本。
下面是使用pipelines做掩码语言建模的例子:
from transformers import pipeline
unmasker = pipeline("fill-mask")
模型的输出是填充后的序列、置信分数和填入的token id和token
from pprint import pprint
#pprint可以分行输出数据结构,使打印信息更加美观
pprint(
unmasker(
f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
)
)
'''
output:
[{'score': 0.1793,
'sequence': 'HuggingFace is creating a tool that the community uses to solve '
'NLP tasks.',
'token': 3944,
'token_str': ' tool'},
{'score': 0.1135,
'sequence': 'HuggingFace is creating a framework that the community uses to '
'solve NLP tasks.',
'token': 7208,
'token_str': ' framework'},
{'score': 0.0524,
'sequence': 'HuggingFace is creating a library that the community uses to '
'solve NLP tasks.',
'token': 5560,
'token_str': ' library'},
{'score': 0.0349,
'sequence': 'HuggingFace is creating a database that the community uses to '
'solve NLP tasks.',
'token': 8503,
'token_str': ' database'},
{'score': 0.0286,
'sequence': 'HuggingFace is creating a prototype that the community uses to '
'solve NLP tasks.',
'token': 17715,
'token_str': ' prototype'}]
'''
下面展示一个使用模型来进行掩码语言建模的例子。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。模型可以使用DistilBERT模型,并从checkpoint里面加载其权重。
- 定义一个包含被盖住token的序列,将单词替换为tokenizer.mask_token
- 将序列编码成一个ID列表,并找到mask的位置
- 检索mask对应下标的预测值:张量大小为词表的大小,对于每个token都会有个预测的分数。模型会给对可能性大的token给出更高的分数
- 使用topk方法检索分数最高的5个token
- 将mask token替换成预测的token,并输出结果。
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")
sequence = (
"Distilled models are smaller than the models they mimic. Using them instead of the large "
f"versions would help {tokenizer.mask_token} our carbon footprint."
)
inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
'''
output:
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
'''
因果语言建模
因果语言建模是预测序列token的下一个token。在这种场景下,模型只能利用左边的文本。这样的训练对于生成任务来说,是非常有意思的。如果你希望通过因果语言建模微调一个模型,可以参考run_clm.py脚本。
通常情况,预测下一个token是通过采样模型最后一个隐藏层得到的logits。
下面是一个使用模型进行因果语言建模的例子。可以通过PreTrainedModel.top_k_top_p_filtering方法采样输入序列的下一个token。
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]
# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
'''
output:
Hugging Face is based in DUMBO, New York City, and ...
'''
文本生成
在文本生成(也称为开放式文本生成)中,目标是从给定上下文中生成连贯的文本。下面的案例展示了在pipeline中使用GPT-2生成文本:
from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
'''
output:
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
'''
在这里,模型根据上文生成了一个最大长度为50tokens的随机文本。在幕后,pipeline调用了一个PreTrainedModel.generate()方法来生成文本。这个方法的默认参数可以在pipeline中重载,就像是max_length和do_sample一样。
下面是一个使用XLNet和其分词器的文本生成案例,包括直接调用generate()。
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]
print(generated)
现在文本生成任务可以使用GPT-2、OpenAI-GPT、CTRL、XLNet、Transfo-XL和Reformer来完成。GPT-2通常是一个开放式结尾文本生成任务的好选择,因为它是在百万网页上通过因果语言建模目标训练而来的。
命名实体识别
命名实体识别是将token分到指定类中的任务。比如,将一个token识别成人、组织亦或是地点。一个典型的命名实体识别数据集是CoNLL-2003。如果你想要在一个命名实体识别任务上微调一个模型,你可以参考run_ner.py脚本。
下面是一个使用pipelines做命名实体识别任务的案例,具体来说就是将token识别成下面的9类:
- O:不是一个命名实体
- B-MIS:杂项实体的开始
- I-MIS:杂项实体
- B-PER:人名的开始
- I-PER:人名
- B-ORG:机构的开始
- I-ORG:机构
- B-LOC:地点的开始
- I-LOC:地点
from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""
for entity in ner_pipe(sequence):
print(entity)
'''
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
'''
下面展示一个使用模型来进行命名实体识别的例子。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。模型可以使用BERT模型,并从checkpoint里面加载其权重。
- 定义一个包含典型实体的序列,比如:Hugging Face作为一个组织,New York City作为一个地点。
- 切词并将序列编码成IDs
- 将序列输入到模型,对于每个token会输出每一类的结果,计算结果的softmax值得到每一类的概率
- 将每一个token和它所属的类别打包输出
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = (
"Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
"therefore very close to the Manhattan Bridge."
)
inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
for token, prediction in zip(tokens, predictions[0].numpy()):
print((token, model.config.id2label[prediction]))
'''
output:
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
'''
摘要
摘要是对一个文档或文档做摘要。如果你想要在摘要任务上微调模型,可以参考考run_summarization.py脚本
一个摘要数据集的例子是CNN/Daily Mail数据集,包括了较长的新闻文章。
下面是用pipelines做摘要的案例,使用了在CNN/Daily Mail上微调过的Bart模型。
from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
'''
output:
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
'''
下面展示一个使用模型来进行自动文摘的例子。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。自动文摘一般使用编码器-解码器模型,如:Bart或T5
- 定义需要做摘要的文章。
- 加入一个特殊前缀:"summarize: "。
- 使用PreTrainedModel.generate()方法生成摘要。
在这个例子中,使用谷歌的T5模型。尽管它是在多任务数据集上预训练的,但是也产生了非常好的结果。
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)
print(tokenizer.decode(outputs[0]))
'''
output:
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>
'''
翻译
翻译是一个将文本翻译成另一种语言的任务。如果你想要在翻译任务上微调一个模型,你可以参考run_translation.py脚本。
WMT English to German数据集是翻译任务的一个典型数据集,将英语文本作为出入,将同义的德语文本作为输出。
下面是使用pipeline方法做翻译的案例。使用在多任务混合数据集上预训练的T5模型,现在已经取得了很好的翻译结果。
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
'''
output:
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
'''
由于翻译pipeline取决于PreTrainedModel.generate()方法,我们可以重载PreTrainedModel.generate()方法,就像max_length一样
下面是使用模型和分词器做翻译的案例。过程如下:
- 通过checkpoint名称实例化一个分词器和一个模型。翻译一般使用编码器-解码器模型,如:Bart或T5
- 定义需要做翻译的句子。
- 加入一个特殊前缀:"translate English to German: "。
- 使用PreTrainedModel.generate()方法生成翻译。
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer(
"translate English to German: Hugging Face is a technology company based in New York and Paris",
return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0]))
'''
output:
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
'''
我们得到了和pipeline一样的翻译结果。