huggingface 🤗 Transformers的简单使用

本文讨论了huggingface 🤗 Transformers的简单使用。
原网址:https://huggingface.co/transformers/quicktour.html


使用transformer库需要两个部件:Tokenizer和model。
使用.from_pretrained(name)就可以下载Tokenizer和model

一、
实例化Tokenizer和model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
或者:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

二、Tokenizer
Tokenizer
的作用是:
1、分词
2、将每个分出来的词转化为唯一的ID(int类型)。

pt_batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=5,
return_tensors="pt"
)
print(pt_batch)

其中,当使用list作为batch进行输入时,使用到的padding:是否将所有句子pad到同一个长度。truncation:当遇到超过max_length的句子时是否直接截断到max_length
return_tensors="pt“表示返回pytorch类型的tensor,=“tf”表示返回TensorFlow类型的tensor。

输出如下:
{'input_ids': tensor([[ 101, 2057, 2024, 2200,  102], [ 101, 2057, 3246, 2017,  102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]])}

3、model
将分词后的结果输入模型中:
For a PyTorch model, you need to unpack the dictionary by adding **
pt_outputs = pt_model(**pt_batch)
print(pt_outputs)
输出如下:
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In 🤗 Transformers, all outputs are tuples. Here, we get a tuple with just the final activations of the model.
这一坨东西也是tuple。
Pytorch model outputs are special dataclasses so that you can get autocompletion for their attributes in an IDE.
They also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string)
in which case the attributes not set (that have None values) are ignored.
可以直接pt_outputs[0]来访问。
输出如下:
tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>)

All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax)
since this final activation function is often fused with the loss.

还可以这样输出中间参数:
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = pt_outputs[-2:]

4、激活函数
最后接入激活函数:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim=-1)
print(pt_predictions)
输出如下:
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
可以看见:第一句"We are very happy to show you the 🤗 Transformers library."明显倾向于第二个LABEL(也就是positive),
第二句"We hope you don't hate it."难以区分。


5、个性化修改模型
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

或者:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)
修改之后就需要自己重新训练(如果改动大)/fine-tune(如果只更改了最上层)


Main concepts

The library is built around three types of classes for each model:

  • Model classes such as BertModel, which are 30+ PyTorch models (torch.nn.Module) or Keras models (tf.keras.Model) that work with the pretrained weights provided in the library.

  • Configuration classes such as BertConfig, which store all the parameters required to build a model. You don’t always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).

  • Tokenizer classes such as BertTokenizer, which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.

All these classes can be instantiated from pretrained instances and saved locally using two methods:

  • from_pretrained() lets you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (the supported models are provided in the list here) or stored locally (or on a server) by the user,

  • save_pretrained() lets you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained()

On top of those three base classes, the library provides two APIs:

  pipeline() for quickly using a model (plus its associated tokenizer and configuration) on a given task

  Trainer() to quickly train or fine-tune a given model.



一些名词:

Input IDs:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
print(tokenizer.tokenize(sequence))
t1 = tokenizer(sequence)
print(t1)
t2 = tokenizer.decode(t1["input_ids"])
print(t2)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] A Titan RTX has 24GB of VRAM [SEP]



Token Type IDs:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])
print(encoded_dict)



{'input_ids': [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
其中token_type_ids的0/1就区分了两个句子。


Position IDs:

the position IDs (position_ids) are used by the model to identify each token’s position in the list of tokens.

They are an optional parameter. If no position_ids are passed to the model, the IDs are automatically created as absolute positional embeddings.

Absolute positional embeddings are selected in the range [0, config.max_position_embeddings - 1].

 

Labels:

就是GT,用来算LOSS。

 

Decoder input IDs:

The input IDs of labels that will be fed to the decoder.

Most encoder-decoder models (BART, T5) create their decoder_input_ids on their own from the labels. In such models, passing the labels is the preferred way to handle training.

 

 

Using 🤗 Transformers部分的一些有用的点:

1、Preprocessing data

对于pair of data
>>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
>>> print(encoded_input)
{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


>>> batch_sentences = ["Hello I'm a single sentence",
...                    "And another sentence",
...                    "And the very very last one"]
>>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
...                              "And I should be encoded with the second sentence",
...                              "And I go with the very last one"]
>>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
>>> print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
               [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                   [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


对于Pre-tokenized inputs:

If you want to use pre-tokenized inputs, just set is_split_into_words=True when passing your inputs to the tokenizer. For instance, we have:

>>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
>>> print(encoded_input)
{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


2、fine-tuning
The library also includes a number of task-specific final layers or ‘heads’
whose weights are instantiated randomly when not present in the specified pre-trained model.
For example, instantiating a model with
BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
will create a BERT model instance with encoder weights
copied from the bert-base-uncased model and a randomly initialized sequence classification head
on top of the encoder with an output size of 2.
Models are initialized in eval mode by default. We can call model.train() to put it in train mode.

import torch
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# set to train mode (default is eval mode)
model.train()

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
labels = torch.tensor([1,0])
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss

# 这两句用来训练
loss.backward()
optimizer.step()

也可以用Transformer库自带的训练器:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments


fine-tuning code:
from pathlib import Path
import os

# get data & labels
def read_imdb_split(split_dir):
texts = []
labels = []
for folder in ["pos", "neg"]:
path = os.path.join(split_dir, folder)
for file in os.listdir(path):
with open(os.path.join(path, file), 'r') as f:
texts.append(f.read())
labels.append(0 if folder == "pos" else 1)

return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

# create validation set
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)

# tokenize
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# create a torch dataset
from torch.utils.data.dataset import Dataset
import torch.tensor
class IMDbDataset(Dataset):
# 以这个类构造的子类,一定要定义两个函数。
# len
# 用于提供数据集size
# getitem
# 通过给定索引获取数据和标签。一次只能获取一个数据
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __len__(self):
return len(self.labels)

def __getitem__(self, y):
# x.__getitem__(y) <==> x[y]
# encoding中有好多项,比如MASK,所以就这样key value方便访问所有的,避免遗漏。
item = {key: torch.tensor(val[y]) for key, val in self.encodings.items()}
item["label"] = torch.tensor(self.labels[y])
return item

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# fine-tune with Trainer
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=2, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
# simply train can train the model
trainer.train()

model.save_pretrained("./save_pretrained")


# eval
# test_pred = model(test_encodings)
# print(test_pred[0:100])






posted @ 2021-02-06 17:02  乌蝇哥  阅读(2806)  评论(0编辑  收藏  举报