bert 文本分类问题 - 君莫笑L

本篇文章，使用pytorch框架微调bert

bert官方文档：https://huggingface.co/transformers/model_doc/bert.html

bert文件：https://github.com/huggingface/transformers

这里有一篇文章可以很好的了解bert:https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

bert预训练模型下载地址（可以用迅雷下载）：

PRETRAINED_VOCAB_ARCHIVE_MAP = {
'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",
'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",
'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",
'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",
'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",
}

PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
}

中文分类与英文分类主要区别在于所调用的预训练模型是英文的还是中文的（个人感觉）。比如，“bert-base-uncased”是英文的预训练模型，“bert-base-chinese”是中文模型。

通过下面源码中的例子，我们可以清楚的知道预训练分词的调用方式

Examples::

            # We can't instantiate directly the base class `PreTrainedTokenizer` so let's show our examples on a derived class: BertTokenizer

            # Download vocabulary from S3 and cache.
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

            # Download vocabulary from S3 (user-uploaded) and cache.
            tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-german-cased')

            # If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/')

            # If the tokenizer uses a single vocabulary file, you can point directly to this file
            tokenizer = BertTokenizer.from_pretrained('./test/saved_model/my_vocab.txt')

            # You can link tokens to special vocabulary when instantiating
            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', unk_token='<unk>')
            # You should be sure '<unk>' is in the vocabulary when doing that.
            # Otherwise use tokenizer.add_special_tokens({'unk_token': '<unk>'}) instead)
            assert tokenizer.unk_token == '<unk>'

        """

通过下面的BertForSequenceClassification的源码，我们可以发现，多分类与二分类模型上的区别在于 self.num_labels = config.num_labels 这句话，config是哪来的呢？

config参数来自于：

modelConfig = BertConfig.from_pretrained('bert-base-uncased/bert_config.json')

model = BertForSequenceClassification.from_pretrained('./bert-base-uncased/pytorch_model.bin', config=modelConfig)

通过上面两句话我们可以知道，通过修改bert_config.json文件中的“num_labels”的值就可以了。

@add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. """,
    BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForSequenceClassification(BertPreTrainedModel):
    r"""
        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
            Labels for computing the sequence classification/regression loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
            Classification (or regression if config.num_labels==1) loss.
        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
    Examples:
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
        loss, logits = outputs[:2]
    """
    def __init__(self, config):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = config.num_labels
 
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
 
        self.init_weights()
 
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
                position_ids=None, head_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)
        pooled_output = outputs[1]
 
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
 
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
 
        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs
 
        return outputs  # (loss), logits, (hidden_states), (attentions)

还有另一种方法可以将多分类变成多分类问题，就是将bert的最后一层进行重写

https://blog.csdn.net/weixin_41519463/article/details/100863313

另一位大佬实现多文本分类，可以参考https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch

# 使用 transformers 提供的序列分类模型 BertForSequenceClassification
from transformers import BertForSequenceClassification

PRETRAINED_MODEL_NAME = "bert-base-chinese"
NUM_LABELS = 3

model = BertForSequenceClassification.from_pretrained(
    PRETRAINED_MODEL_NAME, num_labels=NUM_LABELS)

# 在 BertModel 基础上，自己添加 Dropout 层和 线性分类层，实际效果等价于 transformers 提供的 BertForSequenceClassification
class Model(nn.Module):

    def __init__(self, config):
        super(Model, self).__init__()
        self.bert = BertModel.from_pretrained(config.bert_path)
        for param in self.bert.parameters():
            param.requires_grad = True
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):
        context = x[0]  # 输入的句子
        mask = x[2]  # 对padding部分进行mask
        seg_ids = x[3]  # 句子间的分隔情况
        _, pooled = self.bert(input_ids=context, token_type_ids=seg_ids,  attention_mask=mask, output_all_encoded_layers=False)
        out = self.dropout(pooled)
        out = self.fc(out)
        return out
 
 model = Model(config)

通过BertForSequenceClassification中 from_pretrained函数中的example我们可以清楚的知道了预训练模型的调用方法

Examples::

            # For example purposes. Not runnable.
            model = BertModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = BertModel.from_pretrained('./test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
            model = BertModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = BertConfig.from_json_file('./tf_model/my_tf_model_config.json')
            model = BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index', from_tf=True, config=config)

从下面开始，我通过AI研习社金融评论分类为案例，对bert简单应用。

1、导入头文件

import pandas as pd
import numpy as np
from transformers import BertTokenizer,AdamW, BertConfig,  BertForSequenceClassification
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

from tqdm import trange

　from torchviz import make_dot #这是一个可视化工具，感兴趣的可以百度一下

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

2、数据处理将词转化成词向量：

train_data = pd.read_csv("./input/train.csv")


#提取句子并处理
sentences = ['[CLS]' + sent + '[SEP]' for sent in train_data.text.values]
label = train_data.label.values
# print(len(train_data))
# print(train_data.text.values)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_sents = [tokenizer.tokenize(sent) for sent in sentences]
# print(tokenized_sents[0])


#将分割后的句子转化成数字  word-->idx
MAX_LEN = 512
input_ids = [tokenizer.convert_tokens_to_ids(sent) for sent in tokenized_sents]

3、做padding和 truncating 形成相同大小的长度：

'''
#自己手动做pad,truncating表示大于最大长度截断
def pad_sequences(inputs, max_l):
    def pad(x):
        return x[:max_l] if len(x) > max_l else x+[0]*(max_l-len(x))
    feature = np.array([pad(x) for x in input_ids], dtype=np.long)
return feature

 '''
#做PADDING,这里使用keras的包做pad，truncating
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
# input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, truncating='post', padding='post')
# print(input_ids[0])

4、建立attention_mask, bert模型中必须要传入mask：

attention_mask=[]
for seq in input_ids:
    seq_mak = [float(i>0) for i in seq]
    attention_mask.append(seq_mak)

5、划分训练集和验证集：

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids,label,
                                                                    random_state=2020,test_size=0.1)

train_masks, valid_masks, _, _ = train_test_split(attention_mask, input_ids,random_state=2020,test_size=0.1)

6、生成dataloader数据迭代器：

#转成tensor
train_inputs = torch.LongTensor(train_inputs)
valid_inputs =torch.LongTensor(validation_inputs)
train_labels = torch.tensor(train_labels)
valid_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
valid_masks = torch.tensor(valid_masks)

#生成dataloader
batch_size=16
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
valid_dataloader =DataLoader(valid_data, batch_size=batch_size)

7、加载模型：

#加载模型
modelConfig = BertConfig.from_pretrained('bert-base-uncased/bert_config.json')
model = BertForSequenceClassification.from_pretrained('./bert-base-uncased/pytorch_model.bin', config=modelConfig)

# model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# fc_features = model.classifier.in_features
# model.classifier=nn.Linear(fc_features, 11)

model.to(device)

8、加添优化器：

定义优化器，注意BertAdam、AdamW是不同版本的adam优化方法，版本更新太快，知道使用就行，定义需要weight decay的参数
‘gamma’, ‘beta’ 是值LayerNormal层的，不要decay，直接训练即可。其他参数除去bias，均使用weight decay的方法进行训练
weight decay可以简单理解在Adam上的一个优化的基础上成使用L2正则（AdamW）。

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']   #这三个变量不做梯度微调
optimizer_grouped_parameters = [
    {'params':[p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
    'weight_decay_rate':0.01},
    {'params':[p for n, p in param_optimizer if  any(nd in n for nd in no_decay)],
    'weight_decay_rate':0.0},
]

optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

9、微调训练：

接下来就是训练部分了,注意在训练是传入label，模型可以直接得到loss,如果不传入label，便只有一个logits

#计算准确率
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    label_flat = label.flatten()
    return np.sum(pred_flat == label_flat)/len(label_flat)

train_loss_set = []
epochs=10
for _ in trange(epochs, desc='Epoch'):
    model.train()
    tr_loss= 0.0 
    nb_tr_example, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
         #取第一个位置，BertForSequenceClassification第一个位置是Loss，第二个位置是[CLS]的logits
        loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)[0]
        train_loss_set.append(loss.item())
        loss.backward()
        optimizer.step()


        tr_loss+=loss.item()
        nb_tr_example+=b_input_ids.size(0)
        nb_tr_steps+=1

    print("Train loss: {}".format(tr_loss / nb_tr_steps))

（如果想要可视化参数可以执行）

   '''
    y = model(x)
    vis_graph = make_dot(y, params=dict(list(model.named_parameters()) + [('x', x)]))
    vise_graph.view()
    '''

10、评估：

#模型评估
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)[0]
        
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1
    print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))

11、预测

test_df = pd.read_csv("./input/test.csv")

test_sencents = [[['CLS']+ sent + ['SEP']] for sent in test_df.text.values] 
test_tokenizers = [tokenizer.tokenize(sent) for sent in test_sencents]

test_input_ids = [tokenizer.convert_tokens_to_ids(token) for token in test_tokenizers]

input_ids = pad_sequences(test_input_ids, maxlen=MAX_LEN, dtype='long', truncating='post', padding='post')
#建立mask
test_mask=[]
for seq in input_ids:
    seq_mak = [float(i>0) for i in seq]
    test_mask.append(seq_mak)

test_label=[]

for in_ids, mask, in input_ids, test_mask:
    in_ids=torch.LongTensor(in_ids).to(device)
    mask = torch.tensor(mask).to(device)
    logits_str = model(in_ids, token_type_ids=None, attention_mask=mask)[0]
    test_label.append(np.argmax(logits_str.detach().cpu().numpy(), axis=1))
bert_Test= pd.DataFrame(test_df.index,columns=['id'])
bert_Test['label']=test_label
bert_Test.to_csv('bert_test.csv', index=None, header=None)

本案例借鉴：

https://blog.csdn.net/SZU_Hadooper/article/details/101697580

https://blog.csdn.net/weixin_41519463/article/details/100863313

https://github.com/huggingface/transformers

https://zhuanlan.zhihu.com/p/56103665

https://github.com/real-brilliant/bert_chinese_pytorch

https://blog.csdn.net/Real_Brilliant/article/details/84880528

https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch

https://blog.csdn.net/Ascn_/article/details/106194326

发表于 2020-06-05 22:36 君莫笑L 阅读(4282) 评论(0) 编辑收藏举报