Hugging Face 入门
Hugging Face 基本函数
-
tokenizer.tokenize(text):返回一个list,分词,将序列拆分为tokenizer词汇表中可用的tokens,这个中文是拆分为了单个的字,英文是subword
-
tokenizer(text1,text2,..) 等效于 tokenizer.encode_plus(text1,text2,..):如果是逗号,则会将两个句子生成一个input_ids,添加 [CLS] 或 [SEP] token进行分割,eg,[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP],如果是 tokenizer([text1,text2]) 则会把每个句子看作独立的个体进行处理。
-
tokenizer.convert_ids_to_tokens(input_ids) :将id再转为正常的token文字list
-
tokenizer.decode(input_ids): string,等效于 ' '.join(tokenizer.convert_ids_to_tokens(input_ids))
tokenizer(text1,text2,..)
的参数:
-
add_special_tokens = True:Add '[CLS]' and '[SEP]'
-
max_length = 256 :Pad & truncate all sentences.
-
pad_to_max_length = True
-
return_attention_mask = True : Construct attention_mask.
-
return_tensors = 'pt':Return pytorch tensors.
tokenizer(text1,text2,..)
的结果:
- input_ids:list,(每个token 对应的在词典中对应的id) 总是作为唯一必须的输入参数传入到模型中
- attention_mask:list,是一个可选的参数,used when batching sequences together,这个参数告诉模型哪些token应该被attend,哪些不应该。见下方使用案例。
- token_type_ids:list,目的是进行sequence分类或QA的模型,要求将两个不同的序列连接到单个“input_ids”条目中,这通常借助特殊token(例如classifier token([CLS])和 separator token([SEP]))来执行。
attention_mask 使用案例
比如,两个句子如果长度不一样时,我们需要将长的句子剪裁或者对短的句子padding。可以看到,在第一句话的右边的input_ids中添加了0,使其与第二句话的长度相同,而在第一句话对应的attention_mask中,1是应该注意的部分,被padding的部分也是0,告诉模型,我们不对这一部分进行attend。
sequence_a = "这是一个短的句子。"
sequence_b = "这是一个长的句子,至少比第一个句子长。"
encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
len(encoded_sequence_a), len(encoded_sequence_b) # (11, 21)
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True) # 看作两个独立的句子处理
print(padded_sequences)
{'input_ids': [[101, 6821, 3221, 671, 702, 4764, 4638, 1368, 2094, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 6821, 3221, 671, 702, 7270, 4638, 1368, 2094, 8024, 5635, 2208, 3683, 5018, 671, 702, 1368, 2094, 7270, 511, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
position_ids 介绍
- 和在模型本身就嵌入了每个token位置的RNN相反(本身就是按顺序学习),transformer 不知道每个token的位置(因为使用的是attention)。因此,模型使用 position_ids 来识别 token 列表中每个 token 的位置。
- position_ids 是可选参数。如果没有 position_ids 传递给模型,则ID将自动创建为绝对位置向量(absolute positional embeddings)。
- 绝对位置向量(absolute positional embeddings):在[0,config.max_position_embeddings-1]范围内选择绝对位置嵌入。 一些模型使用其他类型的位置嵌入,例如 sinusoidal position embeddings 或者 relative position embeddings(attention中的那个位置向量计算公式好像就是绝对的)。
标签 Labels 介绍
-
真实的标签,也是一个可选的参数,把这个参数传递给模型计算loss。
-
标签需要的具体的样子也因模型而异:
-
对于
sequence classification
的模型, (e.g., BertForSequenceClassification), the model expects a tensor of dimension (batch_size) with each value of the batch corresponding to the expected label of the entire sequence. -
对于
token classification
模型 (e.g., BertForTokenClassification), the model expects a tensor of dimension (batch_size, seq_length) with each value corresponding to the expected label of each individual token. -
对于
masked language modeling
(e.g., BertForMaskedLM), the model expects a tensor of dimension (batch_size, seq_length) with each value corresponding to the expected label of each individual token: 其中 labels 是被 mask 掉的 token 的 token ID , 其余的值将被忽略。 (usually -100) -
对于
sequence to sequence
tasks,(e.g., BartForConditionalGeneration, MBartForConditionalGeneration), the model expects a tensor of dimension (batch_size, tgt_seq_length) with each value corresponding to the target sequences associated with each input sequence. 训练阶段, BART 和 T5 在内部都会 make the appropriate decoder_input_ids 和 decoder attention masks ,通常不需要提供。但这不适用于利用 the Encoder-Decoder 结构的模型。
-
decoder_input_ids 介绍
encoder-decoder models (eg,BART,T5)会根据传入的labels 自行创建它们的 decoder_input_ids。在这样的模型中,传递 the labels 的进行训练首选的方式。
Feed Forward Chunking 介绍
大概是这样的,因为在下图的encoder的部分,我们可以看到 self-attention layer后面往往跟着前馈网络,而存储前馈网络的权重有比较占内存,所以《》的作者提出一种数学上等效,对前馈网络处理[batch_size, sequence_length]的数据进行分块处理的方法,个人理解是分为sequence_length进行分块处理,不同句子相同位置的token反正都是使用一样的权重进行转换。
对于使用函数 apply_chunking_to_forward()
的模型来说, the chunk_size
定义了 the number of output embeddings that are computed in parallel 以此达到内存和时间之间的平衡。如果 chunk_size 设为0,就不会进行feed forward chunking。
模型案例的例子
HuggingFace-transformers系列的介绍以及在下游任务中的使用
!source activate wiki
import transformers
from pprint import pprint
MODEL_PATH = r'/home/sxj/jupyter/AL_BAGs/chinese-bert-wwm'
# a.通过词典导入分词器
tokenizer = transformers.BertTokenizer.from_pretrained(r'/home/sxj/jupyter/AL_BAGs/chinese-bert-wwm/vocab.txt')
# b. 导入配置文件
model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)
# 修改配置
model_config.output_hidden_states = True
model_config.output_attentions = True
# 通过配置和路径导入模型
model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)
# sequence = "A Titan RTX has 24GB of VRAM"
# ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
sequence = "伊丽莎白为约克公爵及公爵夫人"
tokenizer.tokenize(sequence)
['伊', '丽', '莎', '白', '为', '约', '克', '公', '爵', '及', '公', '爵', '夫', '人']
# 单句:encode仅返回input_ids
tokenizer.encode("我 爱你")
[101, 2769, 4263, 872, 102]
# 多句:encode_plus返回所有编码信息
sen_code = tokenizer.encode_plus("我爱你", "不是他")
pprint(sen_code)
assert tokenizer("我爱你", "不是他") == tokenizer.encode_plus("我爱你", "不是他")
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1],
'input_ids': [101, 2769, 4263, 872, 102, 679, 3221, 800, 102],
'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1]}
print(tokenizer.convert_ids_to_tokens(sen_code['input_ids']))
print(tokenizer.decode(sen_code['input_ids']))
assert ' '.join(tokenizer.convert_ids_to_tokens(sen_code['input_ids'])) == tokenizer.decode(sen_code['input_ids'])
['[CLS]', '我', '爱', '你', '[SEP]', '不', '是', '他', '[SEP]']
[CLS] 我 爱 你 [SEP] 不 是 他 [SEP]
import torch
model.eval() # 将模型设为验证模式
input_ids = torch.tensor([sen_code['input_ids']]) # 添加batch维度并转化为tensor
token_type_ids = torch.tensor([sen_code['token_type_ids']])
# 将模型转化为eval模式
model.eval()
# 将模型和数据转移到cuda, 若无cuda,可更换为cpu
device = 'cpu'
tokens_tensor = input_ids.to(device)
segments_tensors = token_type_ids.to(device)
model.to(device)
BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(21128, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
# 进行编码
with torch.no_grad():
# See the models docstrings for the detail of the inputs
outputs = model(tokens_tensor, token_type_ids=segments_tensors) # 也可以直接把tokenizer 的结果扔进入 inputs = tokenizer(question),model(**inputs)
# Transformers models always output tuples.
# See the models docstrings for the detail of all the outputs
# In our case, the first element is the hidden state of the last layer of the Bert model
encoded_layers = outputs
# 得到最终的编码结果encoded_layers
'''
sequence_output:torch.Size([1, 9, 768])
pooled_output:torch.Size([1, 768])
(hidden_states):tuple,13*torch.Size([1, 9, 768])
(attentions):tuple,12*torch.Size([1, 12, 9, 9])
'''
len(encoded_layers),(encoded_layers[0].shape),(encoded_layers[1].shape),len(encoded_layers[2]),(encoded_layers[2][0].shape),len(encoded_layers[3]),(encoded_layers[3][0].shape)
(4,
torch.Size([1, 9, 768]),
torch.Size([1, 768]),
13,
torch.Size([1, 9, 768]),
12,
torch.Size([1, 12, 9, 9]))
最常用的案例
- AutoConfig、AutoModel、AutoTokenizer:automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary
- 并不是所有模型都在所有任务上进行过微调,或者 Fine-tuned models 也是基于特定的数据集进行的微调,该数据集可能与您的用例或域不重叠,可以使用examples进行微调模型或创建你自己的训练脚本。
进行某个任务的inference时, 本库提供了几种机制:
-
Pipelines
: very easy-to-use abstractions, which require as little as two lines of code. -
Direct model use
: Less abstractions, but more flexibility and power via a direct access to a tokenizer (PyTorch/TensorFlow) and full inference capacity.
1. sequence classification
sequence classification 的一个示例是基于GLUR数据集(该项任务的基础),如果想基于 a GLUE sequence classification task 微调模型,可以利用 run_glue.py 和 run_pl_glue.py 或 run_tf_glue.py 的脚本。
本示例使用 pipelines 进行情感分析的任务,判断句子是positive 还是 negative的,它利用了 a fine-tuned model on sst2, which is a GLUE task。
from transformers import pipeline
nlp = pipeline("sentiment-analysis")
result = nlp("我恨你")
print(result)
print(f"label: {result[0]['label']}, with score: {round(result[0]['score'], 4)}")
[{'label': 'NEGATIVE', 'score': 0.7413254976272583}]
label: NEGATIVE, with score: 0.7413
下面这个进行sequence classification 的例子,使用模型来判断是否两个sequences 彼此是 paraphrases(是否表达一个意思)。处理流程如下:
-
根据checkpoint name 初始化 tokenizer 和 model。这里使用的是BERT模型。
-
为两个句子建立一个sequence
-
将sequence传递给model,以便模型进行二分类, 0 (not a paraphrase) and 1 (is a paraphrase)。
-
对结果计算其softmax,来获得各个列表的概率。
-
打印结果
- logits 是什么? ,logits与 softmax都属于在输出层的内容,logits = tf.matmul(X, W) + bias,再对logits做归一化处理,就用到了softmax:Y_pred = tf.nn.softmax(logits,name='Y_pred')
# 具体代码见 https://huggingface.co/transformers/task_summary.html 下面说几个关键点
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained(r"./bert-base-cased-finetuned-mrpc/")
model = AutoModelForSequenceClassification.from_pretrained(r"./bert-base-cased-finetuned-mrpc/", return_dict=True)
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt") # Return pytorch tensors
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
print(paraphrase_classification_logits)
print(torch.softmax(paraphrase_classification_logits, dim=1).tolist()) # 对第1维进行softmax
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
for i in range(len(classes)):
print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
tensor([[-0.3495, 1.9004]], grad_fn=<AddmmBackward>)
[[0.09536301344633102, 0.9046369791030884]]
not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%
2. Extractive Question Answering
从给定问题的文本中提取答案的任务,QA 任务的一个dataset是 SQuAD dataset,如果想基于SQuAD 任务微调模型,可以运行给出的脚本。见原文
from transformers import pipeline
nlp = pipeline('question-answering')
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question.
An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task.
If you would like to fine-tunea model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""
result = nlp(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
# Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96
下面进行QA任务的流程为:
-
初始化一个tokenizer 和 model,这里使用的是Bert进行
-
定义一段 text 和几个 question
-
对 question 列表遍历,将每个question都和text 组成一个sequence
-
将sequence 扔到模型中,模型为 sequence 中每个 token (包括text和question)都输出两个score,代表这个位置是答案开始和结束位置的分数。
-
对分数计算softmax 生成概率
-
将概率最高的start 和 end 位置之间的token转换为答案,输出。
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)
torch.argmax(input, dim=None, keepdim=False):返回指定维度最大值的序号
eg:answer_end = torch.argmax(answer_end_scores) + 1
3. language modeling
Bert就是使用 masked language modeling 进行预训练, GPT-2 使用 causal language modeling 进行预训练,LM在除了预训练之外也很有用,eg,将模型转变为特定领域,使用在非常大的语料库预训练的LM,再将其基于新闻数据集或科学论文数据集微调。eg,LysandreJik/arxiv-nlp
3.1 Masked LM
允许模型同时关注被mask掉的token 左右两边的context。Masked LM 为需要双向context的下游任务(eg:SQuAD)创建了坚实的基础。
from transformers import pipeline
from pprint import pprint
nlp = pipeline("fill-mask")
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))
进行 MLM 的流程:
-
初始化一个tokenizer 和 model,这里使用的是 DistilBERT 模型。
-
定义一个带有 masked token 的sequence,用tokenizer.mask_token 来代替word。
-
将序列编码为ID列表,然后在该列表中找到被mask掉的 token 的位置。
-
在被mask掉的token的index位置检索其预测的结果:此tensor 与 vocabulary具有相同的大小,即预测得到每个词的概率。
-
使用 Pytorch 中的 torch.topk 找到top5的预测结果。
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased", return_dict=True)
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
3.2 Causal Language Modeling
CLM 是从左到右按顺序预测序列中的token,只关注被mask掉token左边的context,这种训练过程在generation任务中很有趣。
通常,下一个被预测的token 是通过从模型由输入序列生成的最后一个隐层的logits中采样得到的。下面的例子,利用了top_k_top_p_filtering() 方法对输入序列的token进行采样得到下一个被预测的 token 。
top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
""" Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
Args:
logits: logits distribution shape (batch size, vocabulary size)
if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
Make sure we keep at least min_tokens_to_keep per batch example in the output
From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
"""
torch.multinomial(input, num_samples,replacement=False, out=None)理解: 对input的每一行做n_samples次取值,输出的张量是每一次取值时input张量对应行的下标,最好要事先对imputs做softmax,保证元素>0。
from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2", return_dict=True)
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample 采样,使得生成的文字开始有了一些随机性,不会总是生成很机械的回复了,但也要保证不要生成很奇怪的话。
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
# Hugging Face is based in DUMBO, New York City, and has
3.3 Text Generation
文本生成(也称为开放式文本生成),生成给定context的延续。 以下示例显示了如何在pipelines中使用GPT-2生成文本。 默认情况下,所有模型在pipelines中使用时均按其各自配置中的配置应用Top-K采样(例如,请参见 gpt-2 config)。
模型生成了一段随机的文本, PreTrainedModel.generate()
的默认参数可以在Pipelines 中直接覆盖,比如下面的max_length。
from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." \
I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
也提供了一个实用 XLNet 和它的 tokenzier 进行文本生成的例子。
一般 Text generation 可以使用 GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL and Reformer in PyTorch ,正如官方给出的例子所示,XLNet 和 Transfo-XL 通常需要 be padded to work well。GPT-2 通常是进行 open-ended text generation 的不错选择,因为它在上百万个网页上进行了causal language modeling 的训练。
关于如何将不同的decoding 策略应用于文本生成的更多信息,参见博客
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased", return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""
prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60) # decoding策略,见上面的链接
generated = prompt + tokenizer.decode(outputs[0])[prompt_length:] # 把之后的生成出来
4. named entity recognition
Named Entity Recognition (NER) 是将每一个token分类的任务。一个NER数据集的例子是 CoNLL-2003 dataset(该任务的基准)。 如果想要基于NER任务微调一个模型,运行脚本 run_ner.py (PyTorch), run_pl_ner.py (leveraging pytorch-lightning) or the run_tf_ner.py (TensorFlow)
下面是如何使用pipelines进行NER的例子,每个token被归为一下9个类别:(注解:关于I和B,这里比如说I-PER,只要被分到这个类别的token,都代表是一个PERSON实体,我们可以后期根据连续的I-PER识别整体的PERSON,但是如果两个PERSON紧挨着,可能导致无法分开两个PERSON,所以有了B-PER)
-
O, Outside of a named entity
-
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity,一个杂项实体的开始,紧接着另一个杂项实体
-
I-MIS, Miscellaneous entity,杂项实体
-
B-PER, Beginning of a person’s name right after another person’s name
-
I-PER, Person’s name
-
B-ORG, Beginning of an organisation right after another organisation
-
I-ORG, Organisation
-
B-LOC, Beginning of a location right after another location
-
I-LOC, Location
It leverages a fine-tuned model on CoNLL-2003, fine-tuned by @stefan-it from dbmdz.
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
# Hugging Face” --- organisation
# “New York City” 、“DUMBO” 、“Manhattan Bridge”--- locations.
下面是使用model 和 tokenizer 进行NER的流程:
-
根据checkpoint name 初始化 model 和 tokenizer,这里model使用了BERT,权重从checkpoint中加载。
-
定义模型需要对每个token分类到的label list
-
定义拥有known entities 的句子
-
把words分割为tokens以便它们可以被映射到predictions,我们使用了一个小的技巧,首先,会对整个序列进行完整的encode和decode,而后得到了一个包含特殊tokens 的字符串,再分词。
-
将sequences 编码为IDs(自动添加特殊tokens)
-
模型为每个tokne都输出了9个类别的预测结果,我们为每个token取最大可能的类别。
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence))) # encode:加上特殊标记分词后的list;encode:合并为一个string;tokenize:再分词
# 等效于 tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(sequence)) ← 这个更方便!
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])
# [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
5. Summarization
对一个document 或一篇文章生成摘要。
一个summarization 任务的数据集是 CNN / Daily Mail dataset, 包含了很多新闻文章,就是为了summarization任务而打造。如果你想要基于summarization任务微调,不同的方法在下面的文档中描述了。
下面是使用模型和tokenizer进行summarization的流程:
-
根据checkpoint name 初始化 model 和 tokenizer,summarization 任务通常使用encoder-decoder模型,eg,BART 或 T5。
-
定义需要被生成摘要的文章
-
添加 T5 特殊的前缀“summarize: “。
-
使用
PreTrainedModel.generate()
方法生成摘要。
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
6. Translation
翻译任务,该任务的一个数据集例子是 WMT English to German dataset。如果想在翻译任务上微调模型,在文档中提供了不同的方法。
下面的 example 使用 pipelines 进行 translation。利用仅仅在多任务混合的数据集(包括WMT)上预训练的T5模型,但是产生了impressive 的效果。
from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
# [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
使用 a model and a tokenizer 进行翻译任务的流程如下:
-
初始化 a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as Bart or T5.
-
定义需要被翻译的 article
-
添加 T5 特定的前缀 "translate English to German: "
-
使用 PreTrainedModel.generate() 方法进行翻译
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0]))
# Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.