1 参考文献
【官方指引】https://qwen.readthedocs.io/en/latest/
【ModelScope训练】https://modelscope.cn/docs/%E4%BD%BF%E7%94%A8Tuners
【CUDA下载安装教程】https://blog.csdn.net/changyana/article/details/135876568
【安装cuDNN】https://developer.nvidia.com/rdp/cudnn-archive
【安装PyTorch】https://pytorch.org/
【安装Ollama】https://ollama.com/download
2 基础环境
2.1 安装CUDA
【查看显卡驱动】nvidia-smi
【验证CUDA安装】nvcc -V
首先查看NVIDIA显卡对应的CUDA版本号,然后根据此版本号下载对应版本的Toolkit
2.2 安装cuDNN
把下载的cudnn压缩包进行解压,在cudnn的文件夹下,把bin,include,lib文件夹下的内容对应拷贝到cuda相应的bin,include,lib下即可,最后安装完成。
2.3 安装PyTorch
# 验证PyTorch是否与CUDA兼容 import torch print(torch.__version__) print(torch.cuda.is_available())
3 通过Ollama调用模型
Ollama是一种比较简单方便的本地运行方案。通过Ollama官网直接安装,可搜索支持的大模型:https://ollama.com/search?q=qwen
4 Qwen官方指引
Qwen官方指南详细说明了支持的各种方案,比查看各种网络博客说明更清晰。
5 自定义训练
5.1 参考ModelScope指引
# 自定义ModelScope模型缓存路径
export MODELSCOPE_CACHE=/Users/kuliuheng/workspace/aiWorkspace/Qwen

1 # A100 18G memory 2 from swift import Seq2SeqTrainer, Seq2SeqTrainingArguments 3 from modelscope import MsDataset, AutoTokenizer 4 from modelscope import AutoModelForCausalLM 5 from swift import Swift, LoraConfig 6 from swift.llm import get_template, TemplateType 7 import torch 8 9 pretrained_model = 'qwen/Qwen2.5-0.5B-Instruct' 10 11 12 def encode(example): 13 inst, inp, output = example['instruction'], example.get('input', None), example['output'] 14 if output is None: 15 return {} 16 if inp is None or len(inp) == 0: 17 q = inst 18 else: 19 q = f'{inst}\n{inp}' 20 example, kwargs = template.encode({'query': q, 'response': output}) 21 return example 22 23 24 if __name__ == '__main__': 25 # 拉起模型 26 model = AutoModelForCausalLM.from_pretrained(pretrained_model, torch_dtype=torch.bfloat16, device_map='auto', trust_remote_code=True) 27 lora_config = LoraConfig( 28 r=8, 29 bias='none', 30 task_type="CAUSAL_LM", 31 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], 32 lora_alpha=32, 33 lora_dropout=0.05) 34 model = Swift.prepare_model(model, lora_config) 35 tokenizer = AutoTokenizer.from_pretrained(pretrained_model, trust_remote_code=True) 36 dataset = MsDataset.load('AI-ModelScope/alpaca-gpt4-data-en', split='train') 37 template = get_template(TemplateType.chatglm3, tokenizer, max_length=1024) 38 39 dataset = dataset.map(encode).filter(lambda e: e.get('input_ids')) 40 dataset = dataset.train_test_split(test_size=0.001) 41 42 train_dataset, val_dataset = dataset['train'], dataset['test'] 43 44 train_args = Seq2SeqTrainingArguments( 45 output_dir='output', 46 learning_rate=1e-4, 47 num_train_epochs=2, 48 eval_steps=500, 49 save_steps=500, 50 evaluation_strategy='steps', 51 save_strategy='steps', 52 dataloader_num_workers=4, 53 per_device_train_batch_size=1, 54 gradient_accumulation_steps=16, 55 logging_steps=10, 56 ) 57 58 trainer = Seq2SeqTrainer( 59 model=model, 60 args=train_args, 61 data_collator=template.data_collator, 62 train_dataset=train_dataset, 63 eval_dataset=val_dataset, 64 tokenizer=tokenizer) 65 66 trainer.train()
(1)官方示例代码中没有写 __main__ 主函数入口,实际运行时发现会报错提示说:子线程在主线程尚未完成初始化之前就运行了。 所以这里就补齐了一个主函数入口
(2)官方代码中没有针对 'qwen/Qwen2.5-0.5B-Instruct' 模型代码,运行时target_modules会提示错误,需要指定模型中实际存在的模块名才行。这里有个技巧,通过打印模型信息可以看到实际的层级结构:
from modelscope import AutoModelForCausalLM model_name = 'qwen/Qwen2.5-0.5B-Instruct' model = AutoModelForCausalLM.from_pretrained(model_name) print(model)
得到如下结果:

Qwen2ForCausalLM( (model): Qwen2Model( (embed_tokens): Embedding(151936, 896) (layers): ModuleList( (0-23): 24 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=896, out_features=896, bias=True) (k_proj): Linear(in_features=896, out_features=128, bias=True) (v_proj): Linear(in_features=896, out_features=128, bias=True) (o_proj): Linear(in_features=896, out_features=896, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=896, out_features=4864, bias=False) (up_proj): Linear(in_features=896, out_features=4864, bias=False) (down_proj): Linear(in_features=4864, out_features=896, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06) (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06) ) ) (norm): Qwen2RMSNorm((896,), eps=1e-06) (rotary_emb): Qwen2RotaryEmbedding() ) (lm_head): Linear(in_features=896, out_features=151936, bias=False) )
调整目标模块名之后,代码能够跑起来了,但Mac Apple M3笔记本上跑,的确是速度太慢了点:
[INFO:swift] Successfully registered `/Users/kuliuheng/workspace/aiWorkspace/Qwen/testMS/.venv/lib/python3.10/site-packages/swift/llm/data/dataset_info.json` [INFO:swift] No vLLM installed, if you are using vLLM, you will get `ImportError: cannot import name 'get_vllm_engine' from 'swift.llm'` [INFO:swift] No LMDeploy installed, if you are using LMDeploy, you will get `ImportError: cannot import name 'prepare_lmdeploy_engine_template' from 'swift.llm'` Train: 0%| | 10/6492 [03:27<42:38:59, 23.69s/it]{'loss': 20.66802063, 'acc': 0.66078668, 'grad_norm': 30.34488869, 'learning_rate': 9.985e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.048214, 'epoch': 0.0, 'global_step/max_steps': '10/6492', 'percentage': '0.15%', 'elapsed_time': '3m 27s', 'remaining_time': '1d 13h 21m 21s'} Train: 0%| | 20/6492 [23:05<477:25:15, 265.56s/it]{'loss': 21.01838379, 'acc': 0.66624489, 'grad_norm': 23.78275299, 'learning_rate': 9.969e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.014436, 'epoch': 0.01, 'global_step/max_steps': '20/6492', 'percentage': '0.31%', 'elapsed_time': '23m 5s', 'remaining_time': '5d 4h 31m 21s'} Train: 0%| | 30/6492 [29:48<66:31:55, 37.07s/it]{'loss': 20.372052, 'acc': 0.67057648, 'grad_norm': 38.68712616, 'learning_rate': 9.954e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.016769, 'epoch': 0.01, 'global_step/max_steps': '30/6492', 'percentage': '0.46%', 'elapsed_time': '29m 48s', 'remaining_time': '4d 11h 2m 20s'} Train: 1%| | 40/6492 [36:00<62:35:16, 34.92s/it]{'loss': 20.92590179, 'acc': 0.66806035, 'grad_norm': 38.17282486, 'learning_rate': 9.938e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.018514, 'epoch': 0.01, 'global_step/max_steps': '40/6492', 'percentage': '0.62%', 'elapsed_time': '36m 0s', 'remaining_time': '4d 0h 48m 3s'} Train: 1%| | 50/6492 [42:23<60:03:47, 33.57s/it]{'loss': 19.25114594, 'acc': 0.68523092, 'grad_norm': 37.24295807, 'learning_rate': 9.923e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.01966, 'epoch': 0.02, 'global_step/max_steps': '50/6492', 'percentage': '0.77%', 'elapsed_time': '42m 23s', 'remaining_time': '3d 19h 1m 0s'} Train: 1%| | 60/6492 [47:45<54:01:41, 30.24s/it]{'loss': 19.54689178, 'acc': 0.69552717, 'grad_norm': 27.87804794, 'learning_rate': 9.908e-05, 'memory(GiB)': 0, 'train_speed(iter/s)': 0.020941, 'epoch': 0.02, 'global_step/max_steps': '60/6492', 'percentage': '0.92%', 'elapsed_time': '47m 45s', 'remaining_time': '3d 13h 19m 3s'} Train: 1%| | 65/6492 [50:38<64:46:05, 36.28s/it]