使用HF Trainer微调小模型
本文记录HugginngFace的Trainer各种常见用法。
SFTTrainer的一个最简单例子
HuggingFace的各种Trainer能大幅简化我们预训练和微调的工作量。能简化到什么程度?就拿我们个人用户最常会遇到的用监督学习微调语言模型任务为例,只需要定义一个SFTrainer,给定我们想要训练的模型和数据集,就可以直接运行微调任务。
''' The simplest way to supervised-finetune a small LM by SFTTrainer Environment: transformers==4.43.3 datasets==2.20.0 trl==0.9.6 ''' from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import Dataset from trl import SFTTrainer model_path = 'Qwen/Qwen2-0.5B' model = AutoModelForCausalLM.from_pretrained(model_path) corpus = [ {'prompt': 'calculate 24 x 99', 'completion': '24 x 99 = 2376'}, {'prompt': 'Which number is greater, 70 or 68?', 'completion': '70 is greater than 68.'}, {'prompt': 'How many vertices in a tetrahedron?', 'completion': 'A tetrahedron has 4 vertices.'}, ] dataset = Dataset.from_list(corpus) trainer = SFTTrainer(model, train_dataset=dataset) # 给定我们想要训练的模型和数据集 trainer.train() # 就可以直接运行微调任务
使用Trainer不可或缺的参数只有两个:
- model
- train_dataset
是的,其他一切参数都是锦上添花,不可或缺的只有这两个。我们能够如此省心省力地去做微调,当然是因为SFTrainer帮我们做了很多事情,具体做了什么可以对比一下上面代码以和前文【一步一步微调小模型】中的完整代码。
要注意,我们的代码能如此简单很大程度上也是因为SFTTrainer
帮我们预处理了数据集。它支持的数据集格式有两种,一种就是在上面代码里显示的那样,由prompt
和completion
组成的键值对:
{"prompt": "How are you", "completion": "I am fine, thank you."} {"prompt": "What is the capital of France?", "completion": "It's Paris."} {"prompt": "有志者事竟成", "completion": "Where there's a will, there's a way"}
这种格式就可以用于一问一答式的任务,包括只问答、翻译、摘要。另一种是更加灵活的多轮对话式的数据格式,每一个训练样本都以'messages'
开头,然后式一列user
和assistant
的对话记录:
{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "It's Paris."} {"role": "user", "content": "and how about Japan?"}, {"role": "assistant", "content": "It's Tokyo."}]} {"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "..."}]}
这两种数据格式已经足以应付几乎所有微调任务的需求。比如说,我要把使用数学内容的预料训练/微调Qwen-0.5B
使它能回答和数学相关的问题,使用到的数据集是微软的orca-math-word-problems-200k,这个数据集由question
和answer
组成
>>> from datasets import load_dataset >>> dataset = load_dataset('microsoft/orca-math-word-problems-200k', split='train') >>> print(dataset) Dataset({ features: ['question', 'answer'], num_rows: 200035 })
具体来说里面的数据长这个样子:
虽然数据集的键值对不是prompt
和completion
,但我们只需要修改一下名字就可以,就像下面的代码一样:
from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from trl import SFTTrainer model_path = 'Qwen/Qwen2-0.5B' data_path = 'microsoft/orca-math-word-problems-200k' model = AutoModelForCausalLM.from_pretrained(model_path) dataset = load_dataset(data_path, split='train') dataset = dataset.rename_column('question', 'prompt') # rename dataset features to "prompt" and "completion" dataset = dataset.rename_column('answer', 'completion') # to fit in the SFTTrainer trainer = SFTTrainer(model, train_dataset=dataset) trainer.train()
SFTTrainer的一些常见用法
如果我们微调语言模型的任务使用到的数据非常奇特,无法用这两种数据格式来表示(虽然我觉得不太可能),那我们还能怎么办?这时候我们就只能把训练样本转化成语言模型一定会支持的格式,也就是字符串。具体来说就是编写一个函数,这个函数把训练样本作为输入,输出转化后的字符串,然后再定义训练器的时候把这个函数也传给训练器。以下代码种的formatting_prompts_func
就把orca-math-word-problems-200k数据集种的每个训练样本都转化成了字符串
from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from trl import SFTTrainer model_path = 'Qwen/Qwen2-0.5B' data_path = 'microsoft/orca-math-word-problems-200k' model = AutoModelForCausalLM.from_pretrained(model_path) dataset = load_dataset(data_path, split='train') def to_prompts_fn(batch) -> list[str]: '''take a batch of training samples, return a list of strings''' output_texts = [] for i in range(len(batch['question'])): text = f"### Question: {batch['question'][i]}\n ### Answer: {batch['answer'][i]}" output_texts.append(text) return output_texts trainer = SFTTrainer(model, train_dataset=dataset, formatting_func=to_prompts_fn) trainer.train()
虽说使用SFTTrainer
省心省力,但有时我们也希望更加深入地掌控训练/微调过程,比如调整学习率,调整batch的大小,每隔几步就打印一下训练过程中的一些指标、再测试数据上看看模型的效果、保存一下模型,诸如此类的。如果想要更多的控制,只需要给训练器传入更多参数,这些参数都可以统一写在SFTConfig里面,再传给训练器。下面的代码示例展示了常用的一些配置参数,包括如何调整batch大小、设置频繁清空GPU缓存等来避免CUDAOutofMemory
,还给了一个测试数据集来监控模型在测试集上的效果。
''' Common usage of SFTrainer and SFTConfig to finetune a small LM ''' from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from trl import SFTTrainer, SFTConfig model_path = 'Qwen/Qwen2-0.5B' data_path = 'microsoft/orca-math-word-problems-200k' save_path = '/home/zrq96/checkpoints/qwen-0.5B-math-sft42' # where checkpoints to output model = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) dataset = load_dataset(data_path, split='train') dataset = dataset.rename_column('question', 'prompt') dataset = dataset.rename_column('answer', 'completion') splited = dataset.train_test_split(test_size=0.01) sft_config = SFTConfig( output_dir=save_path, # max length of the total sequence max_seq_length=min(tokenizer.model_max_length, 2048), per_device_train_batch_size=4, # by default 8 learning_rate=1e-4, # by default 5e-5 weight_decay=0.1, # by default 0.0 num_train_epochs=2, # by default 3 logging_steps=50, # by default 500 save_steps=100, # by default 500 torch_empty_cache_steps=10, # empty GPU cache every 10 steps eval_strategy='steps', # by default 'no' eval_steps=100, ) trainer = SFTTrainer( model, args=sft_config, train_dataset=splited['train'], eval_dataset=splited['test'], ) trainer.train()
使用Trainer
使用SFTTrainer
能这么省力,还是因为它帮我们做了很多事情,这其中最主要的事情就是帮我们处理好数据集。可以说,在大模型领域的编程里,甚至在当前的人工智能领域里数据处理占了一半以上的工作量。SFTTrainer
是Trainer
的一个子类,而Trainer
是HuggingFace所有训练器的父类,我们可以使用Trainer
来做SFTTrainer
能做的一切事情。我们下面演示一下怎样使用更加通用的Trainer
来做微调,顺便展示一下SFTTrainer
帮我们做了哪些数据处理工作。理解了这个过程,我们以后甚至可以定制自己的训练器。根据Trainer的文档,它的用法跟SFTTRainer类似(倒反天罡了属于是...),也是传入待训练/微调的模型和数据集,以及一些可能的训练参数:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2-0.5B') dataset = ... training_args = TrainingArguments(output_dir='./save_path') trainer = Trainer(model, train_dataset=dataset, args=training_args) trainer.train()
和SFTTrainer
最大的不同是对数据格式的支持。SFTTrainer
可以接受一些直观的数据格式,但Trainer
的数据集要严格按照model.forward
函数所能接受的输入来设计,也就说Trainer
会把数据集里的数据样本直接塞给模型,那么我们的数据集里的样本就要是能直接传给模型的。因此,想要对数据进行处理,我们就得研究一下我们的待训练模型究竟能接收怎样的输入数据。在我们的例子中,我们的模型是Qwen2ForCausalLM,其forward函数的签名是
def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, cache_position: Optional[torch.LongTensor] = None, ) -> Union[Tuple, CausalLMOutputWithPast]: ...
其中,input_ids: torch.LongTensor
是必须有的。因为我们要做训练/微调,所以labels: Optional[torch.LongTensor]
也是必须的而非optional了。所以我们的数据集应当是含input_ids
和labels
的样本,而且这两个特征的数据类型都是torch.LongTensor
:
>>> from datasets import load_dataset >>> dataset = load_dataset('microsoft/orca-math-word-problems-200k', split='train') >>> def preprocess_dataset(x: dict) -> dict: ... # to be implemented ... ... >>> >>> new_ds = dataset.map(preprocess_dataset) >>> print(new_ds) Dataset({ features: ['input_ids', 'labels'], num_rows: 200035 })
我们会使用一个preprocess_dataset
函数把原始数据集种的每一个样本转化成model能给接受的样本。要做的也只是把question和answer的文本内容拼接在一起,然后tokenize一下,有需要的话就pad或者truncate,这就搞定了input_ids
。而CausalLM的forward要做的事情都是预测下一个词,所以labels
就是input_ids
左移一个token的位置而已。然而,因为左移这一步model自己会做,所以我们的labels就只是复制一份input_ids
而已:

因此,我们的预处理函数就仅仅是把文本变成input_ids,然后复制一份作为labels:
def preprocess_data(x: dict) -> dict: ''' take a training sample and return a preprocessed sample with the keys that the model expects, in our case: - input_ids: the tokenized input - labels: the right-shifted tokenized input and optionally: - attention_mask: a mask indicating which tokens should be attended to - position_ids: the position of each token in the input - ... ''' text = f"### Question: {x['question']}\n ### Answer: {x['answer']}" tokenized = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=1024) # 量力而行,可以2014甚至4096,或者使用 data collator return { 'input_ids': tokenized['input_ids'][0], 'labels': tokenized['input_ids'][0].clone(), }
将所有代码整合起来,下面就是使用Trainer对模型做微调的完整代码:
''' Supervised-FineTuning a small LM by the vanilla Trainer ''' from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset from transformers import TrainingArguments, Trainer model_path = 'Qwen/Qwen2-0.5B' data_path = 'microsoft/orca-math-word-problems-200k' save_path = '/home/ricky/checkpoints/qwen-0.5B-math-sft42' model = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) dataset = load_dataset(data_path, split='train') def preprocess_data(x: dict) -> dict: ''' take a training sample and return a preprocessed sample with the keys that the model expects, in our case: - input_ids: the tokenized input - labels: the right-shifted tokenized input and optionally: - attention_mask: a mask indicating which tokens should be attended to - position_ids: the position of each token in the input - ... ''' text = f"### Question: {x['question']}\n ### Answer: {x['answer']}" tokenized = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=1024) return { 'input_ids': tokenized['input_ids'][0], 'labels': tokenized['input_ids'][0].clone(), } new_ds = dataset.map( function=preprocess_data, # map all samples with this function num_proc=4 # use 4 processes to speed up ) splited = new_ds.train_test_split(test_size=0.01) training_args = TrainingArguments( output_dir=save_path, per_device_train_batch_size=2, per_device_eval_batch_size=2, torch_empty_cache_steps=2, num_train_epochs=2, # by default 3 logging_steps=50, # by default 500 save_steps=100, eval_strategy='steps', # by default 'no' eval_steps=100, ) trainer = Trainer( model=model, args=training_args, train_dataset=splited['train'], eval_dataset=splited['test'], ) trainer.train()
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)