Pytorch-使用Bert预训练模型微调中文文本分类
渣渣本跑不动,以下代码运行在Google Colab上。
语料链接:https://pan.baidu.com/s/1YxGGYmeByuAlRdAVov_ZLg
提取码:tzao
neg.txt和pos.txt各5000条酒店评论,每条评论一行。
安装transformers库
!pip install transformers
导包,设定超参数
1 import numpy as np 2 import random 3 import torch 4 import matplotlib.pyplot as plt 5 from torch.nn.utils import clip_grad_norm_ 6 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler 7 from transformers import BertTokenizer, BertForSequenceClassification, AdamW 8 from transformers import get_linear_schedule_with_warmup 9 10 SEED = 123 11 BATCH_SIZE = 16 12 LEARNING_RATE = 2e-5 13 WEIGHT_DECAY = 1e-2 14 EPSILON = 1e-8 15 16 random.seed(SEED) 17 np.random.seed(SEED) 18 torch.manual_seed(SEED)
1.数据预处理
1.1读取文件
1 def readfile(filename): 2 with open(filename, encoding="utf-8") as f: 3 content = f.readlines() 4 return content 5 6 pos_text, neg_text = readfile('hotel/pos.txt'), readfile('hotel/neg.txt') 7 sentences = pos_text + neg_text 8 9 #设定标签 10 pos_targets = np.ones((len(pos_text))) 11 neg_targets = np.zeros((len(neg_text))) 12 targets = np.concatenate((pos_targets, neg_targets), axis=0).reshape(-1, 1) #(10000, 1) 13 total_targets = torch.tensor(targets)
Tip:调用readfile时报错了UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 0
解决办法:将txt文件在Notepad++中打开,点击工具栏的编码,转为UTF-8编码。
1.2BertTokenizer进行编码,将每一句转成数字
1 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir="E:/transformer_file/") 2 print(pos_text[2]) 3 print(tokenizer.tokenize(pos_text[2])) 4 print(tokenizer.encode(pos_text[2])) 5 print(tokenizer.convert_ids_to_tokens(tokenizer.encode(pos_text[2])))
不错,下次还考虑入住。交通也方便,在餐厅吃的也不错。
['不', '错', ',', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。']
[101, 679, 7231, 8024, 678, 3613, 6820, 5440, 5991, 1057, 857, 511, 769, 6858, 738, 3175, 912, 8024, 1762, 7623, 1324, 1391, 4638, 738, 679, 7231, 511, 102]
['[CLS]', '不', '错', ',', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', ',', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。', '[SEP]']
为了使每一句的长度相等,稍作处理;
1 #将每一句转成数字(大于126做截断,小于126做PADDING,加上首尾两个标识,长度总共等于128) 2 def convert_text_to_token(tokenizer, sentence, limit_size=126): 3 4 tokens = tokenizer.encode(sentence[:limit_size]) #直接截断 5 if len(tokens) < limit_size + 2: #补齐(pad的索引号就是0) 6 tokens.extend([0] * (limit_size + 2 - len(tokens))) 7 return tokens 8 9 input_ids = [convert_text_to_token(tokenizer, sen) for sen in sentences] 10 11 input_tokens = torch.tensor(input_ids) 12 print(input_tokens.shape) #torch.Size([10000, 128])
1.3attention_masks, 在一个文本中,如果是PAD符号则是0,否则就是1
1 #建立mask 2 def attention_masks(input_ids): 3 atten_masks = [] 4 for seq in input_ids: 5 seq_mask = [float(i>0) for i in seq] 6 atten_masks.append(seq_mask) 7 return atten_masks 8 9 atten_masks = attention_masks(input_ids) 10 attention_tokens = torch.tensor(atten_masks)
构造input_ids和atten_masks的目的和前面一节中提到的.encode_plus函数返回的input_ids和attention_mask一样,input_type_ids和本次任务无关,它是针对每个训练集有两个句子的任务(如问答任务)。
1.4划分训练集和测试集
两个划分函数的参数random_state和test_size值要一致,才能使得train_inputs和train_masks一一对应。
1 from sklearn.model_selection import train_test_split 2 train_inputs, test_inputs, train_labels, test_labels = train_test_split(input_tokens, total_targets, random_state=666, test_size=0.2) 3 train_masks, test_masks, _, _ = train_test_split(attention_tokens, input_tokens, random_state=666, test_size=0.2) 4 print(train_inputs.shape, test_inputs.shape) #torch.Size([8000, 128]) torch.Size([2000, 128]) 5 print(train_masks.shape) #torch.Size([8000, 128])和train_inputs形状一样 6 7 print(train_inputs[0]) 8 print(train_masks[0])
tensor([ 101, 2769, 6370, 4638, 3221, 10189, 1039, 4638, 117, 852, 2769, 6230, 2533, 8821, 1039, 4638, 7599, 3419, 3291, 1962, 671, 763, 117, 3300, 671, 2476, 1377, 809, 1288, 1309, 4638, 3763, 1355, 119, 2456, 6379, 1920, 2157, 6370, 3249, 6858, 7313, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
1.5创建DataLoader,用来取出一个batch的数据
TensorDataset 可以用来对 tensor 进行打包,就好像 python 中的 zip 功能。该类通过每一个 tensor 的第一个维度进行索引,所以该类中的 tensor 第一维度必须相等,且TensorDataset 中的参数必须是 tensor类型。
SequentialSampler按顺序对数据集采样。
1 train_data = TensorDataset(train_inputs, train_masks, train_labels) 2 train_sampler = RandomSampler(train_data) 3 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE) 4 5 test_data = TensorDataset(test_inputs, test_masks, test_labels) 6 test_sampler = SequentialSampler(test_data) 7 test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)
查看一下train_dataloader的内容:
1 for i, (train, mask, label) in enumerate(train_dataloader): 2 print(train.shape, mask.shape, label.shape) #torch.Size([16, 128]) torch.Size([16, 128]) torch.Size([16, 1]) 3 break 4 print('len(train_dataloader)=', len(train_dataloader)) #500
2.创建模型、优化器
创建模型
1 model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels = 2) #num_labels表示2个分类,好评和差评 2 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 3 model.to(device)
定义优化器
参数eps是为了提高数值稳定性而添加到分母的一个项(默认: 1e-8)。
1 optimizer = AdamW(model.parameters(), lr = LEARNING_RATE, eps = EPSILON)
更通用的写法:bias和LayNorm.weight没有用权重衰减
1 no_decay = ['bias', 'LayerNorm.weight'] 2 optimizer_grouped_parameters = [ 3 {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': WEIGHT_DECAY}, 4 {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 5 ] 6 optimizer = AdamW(optimizer_grouped_parameters, lr = LEARNING_RATE, eps = EPSILON)
学习率预热,训练时先从小的学习率开始训练
1 epochs = 2 2 # training steps 的数量: [number of batches] x [number of epochs]. 3 total_steps = len(train_dataloader) * epochs 4 5 # 设计 learning rate scheduler. 6 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)
3.训练、评估模型
3.1模型准确率
1 def binary_acc(preds, labels): #preds.shape=(16, 2) labels.shape=torch.Size([16, 1]) 2 correct = torch.eq(torch.max(preds, dim=1)[1], labels.flatten()).float() #eq里面的两个参数的shape=torch.Size([16]) 3 acc = correct.sum().item() / len(correct) 4 return acc
3.2计算模型运行时间
1 import time 2 import datetime 3 def format_time(elapsed): 4 elapsed_rounded = int(round((elapsed))) 5 return str(datetime.timedelta(seconds=elapsed_rounded)) #返回 hh:mm:ss 形式的时间
3.3训练模型
- 传入model的参数必须是tensor类型的;
- nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)用于解决神经网络训练过拟合的方法 ;
输入是(NN参数,最大梯度范数,范数类型=2) 一般默认为L2 范数;
Tip: 注意这个方法只在训练的时候使用,在测试的时候不用;
1 def train(model, optimizer): 2 t0 = time.time() 3 avg_loss, avg_acc = [],[] 4 5 model.train() 6 for step, batch in enumerate(train_dataloader): 7 8 # 每隔40个batch 输出一下所用时间. 9 if step % 40 == 0 and not step == 0: 10 elapsed = format_time(time.time() - t0) 11 print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed)) 12 13 b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device) 14 15 output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels) 16 loss, logits = output[0], output[1] 17 18 avg_loss.append(loss.item()) 19 20 acc = binary_acc(logits, b_labels) 21 avg_acc.append(acc) 22 23 optimizer.zero_grad() 24 loss.backward() 25 clip_grad_norm_(model.parameters(), 1.0) #大于1的梯度将其设为1.0, 以防梯度爆炸 26 optimizer.step() #更新模型参数 27 scheduler.step() #更新learning rate 28 29 avg_acc = np.array(avg_acc).mean() 30 avg_loss = np.array(avg_loss).mean() 31 return avg_loss, avg_acc
此处output的形式为(元组类型,第0个元素是loss值,第1个元素是每个batch中好评和差评的概率):
(tensor(0.0210, device='cuda:0', grad_fn=<NllLossBackward>), tensor([[-2.9815, 2.6931], [-3.2380, 3.1935], [-3.0775, 3.0713], [ 3.0191, -2.3689], [ 3.1146, -2.7957], [ 3.7798, -2.7410], [-0.3273, 0.8227], [ 2.5012, -1.5535], [-3.0231, 3.0162], [ 3.4146, -2.5582], [ 3.3104, -2.2134], [ 3.3776, -2.5190], [-2.6513, 2.5108], [-3.3691, 2.9516], [ 3.2397, -2.0473], [-2.8622, 2.7395]], device='cuda:0', grad_fn=<AddmmBackward>))
3.4评估模型
调用model模型时不传入label值。
1 def evaluate(model): 2 avg_acc = [] 3 model.eval() #表示进入测试模式 4 5 with torch.no_grad(): 6 for batch in test_dataloader: 7 b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device) 8 9 output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask) 10 11 acc = binary_acc(output[0], b_labels) 12 avg_acc.append(acc) 13 avg_acc = np.array(avg_acc).mean() 14 return avg_acc
此处output的形式为(元组类型,第0个元素是每个batch中好评和差评的概率):
(tensor([[ 3.8217, -2.7516], [ 2.7585, -2.0853], [-2.9317, 2.9092], [-3.3724, 3.2597], [-2.8692, 2.6741], [-3.2784, 2.9276], [ 3.4946, -2.8895], [ 3.7855, -2.8623], [-2.2249, 2.4336], [-2.4257, 2.4606], [ 3.3996, -2.5760], [-3.1986, 3.0841], [ 3.6883, -2.9492], [ 3.2883, -2.3600], [ 2.6723, -2.0778], [-3.1868, 3.1106]], device='cuda:0'),)
3.5运行训练模型和评估模型
1 for epoch in range(epochs): 2 3 train_loss, train_acc = train(model, optimizer) 4 print('epoch={},训练准确率={},损失={}'.format(epoch, train_acc, train_loss)) 5 test_acc = evaluate(model) 6 print("epoch={},测试准确率={}".format(epoch, test_acc))
运行结果如下:
Batch 40 of 500. Elapsed: 0:00:14. Batch 80 of 500. Elapsed: 0:00:28. Batch 120 of 500. Elapsed: 0:00:42. Batch 160 of 500. Elapsed: 0:00:57. Batch 200 of 500. Elapsed: 0:01:12. Batch 240 of 500. Elapsed: 0:01:26. Batch 280 of 500. Elapsed: 0:01:41. Batch 320 of 500. Elapsed: 0:01:56. Batch 360 of 500. Elapsed: 0:02:11. Batch 400 of 500. Elapsed: 0:02:26. Batch 440 of 500. Elapsed: 0:02:42. Batch 480 of 500. Elapsed: 0:02:57. epoch=0,训练准确率=0.9015,损失=0.2549531048182398 epoch=0,测试准确率=0.9285 Batch 40 of 500. Elapsed: 0:00:16. Batch 80 of 500. Elapsed: 0:00:31. Batch 120 of 500. Elapsed: 0:00:47. Batch 160 of 500. Elapsed: 0:01:03. Batch 200 of 500. Elapsed: 0:01:18. Batch 240 of 500. Elapsed: 0:01:34. Batch 280 of 500. Elapsed: 0:01:50. Batch 320 of 500. Elapsed: 0:02:06. Batch 360 of 500. Elapsed: 0:02:22. Batch 400 of 500. Elapsed: 0:02:37. Batch 440 of 500. Elapsed: 0:02:53. Batch 480 of 500. Elapsed: 0:03:09. epoch=1,训练准确率=0.9595,损失=0.14357946291333065 epoch=1,测试准确率=0.939
4.预测
1 def predict(sen): 2 3 input_id = convert_text_to_token(tokenizer, sen) 4 input_token = torch.tensor(input_id).long().to(device) #torch.Size([128]) 5 6 atten_mask = [float(i>0) for i in input_id] 7 attention_token = torch.tensor(atten_mask).long().to(device) #torch.Size([128]) 8 9 output = model(input_token.view(1, -1), token_type_ids=None, attention_mask=attention_token.view(1, -1)) #torch.Size([128])->torch.Size([1, 128])否则会报错 10 print(output[0]) 11 12 return torch.max(output[0], dim=1)[1] 13 14 label = predict('酒店位置难找,环境不太好,隔音差,下次不会再来的。') 15 print('好评' if label==1 else '差评') 16 17 label = predict('酒店还可以,接待人员很热情,卫生合格,空间也比较大,不足的地方就是没有窗户') 18 print('好评' if label==1 else '差评') 19 20 label = predict('"服务各方面没有不周到的地方, 各方面没有没想到的细节"') 21 print('好评' if label==1 else '差评')
tensor([[ 3.5719, -2.7315]], device='cuda:0', grad_fn=<AddmmBackward>)
差评
tensor([[-2.7998, 2.8675]], device='cuda:0', grad_fn=<AddmmBackward>)
好评
tensor([[-1.9614, 1.5925]], device='cuda:0', grad_fn=<AddmmBackward>)
好评
性能还可以,第三句这种有点奇怪的句子也能正确识别了。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构
2017-08-26 jQuery学习(八)——使用JQ插件validation进行表单校验