Transformers Optimum 使用
介绍
🤗 Optimum是Transformers的🤗扩展,它提供了一组性能优化工具,以最高效率在目标硬件上训练和运行模型。
使用入门
当前ONNX最通用,因此我们就只介绍ONNX Runtime
🤗 Optimum 提供与 ONNX Runtime 的集成,一个用于ONNX 模型的跨平台、高性能执行引擎
安装
pip install optimum[onnxruntime-gpu]
为避免 onnxruntime 和 onnxruntime-gpu 之间的冲突,请在安装 Optimum 之前通过运行 pip uninstall onnxruntime 确保未安装软件包 onnxruntime。
将transformer模型导出为onnx
可以使用ORTModelForXXX
加载transformers 模型,注意如果模型来至于Transformers
,需要加上from_transformers=true
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
save_directory = "tmp/onnx/"
# Load a model from transformers and export it to ONNX
ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Save the onnx model and tokenizer
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
保存为ONNX
后,可以继续通过ORTModelForXXX
来加载模型,然后使用pipeline来运行任务。
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline, AutoTokenizer
model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
tokenizer = AutoTokenizer.from_pretrained(save_directory)
cls_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = cls_pipeline("I love burritos!")
ONNX 模型优化
通过ORTOptimizer
可以优化模型,OptimizationConfig
配置优化参数,可以导出onnx模型,并优化Grpah,进行fp16等优化
from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import OptimizationConfig
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "/tmp/outputs"
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(
optimization_level=2,
optimize_with_onnxruntime_only=False,
optimize_for_gpu=False,
)
optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
OptimizationConfig
:
- optimization_level 优化等级
- optimize_for_gpu 是否面向GPU优化
- fp16 是否转换为半精度,配置后可以减小模型体积
Pipeline 使用
在pipeline中使用,只需要accelerator="ort"
即可。
from optimum.pipelines import pipeline
classifier = pipeline(task="text-classification", accelerator="ort")
classifier("I like you. I love you.")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]
当然,我们还可以使用上面optimize
后的模型。
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline
tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
# Loading already converted and optimized ORT checkpoint for inference
model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question=question, context=context)
实际使用测试
我们来加载哈工大讯飞联合实验室提供的阅读理解模型pert
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from optimum.pipelines import pipeline as ortpipeline
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.onnxruntime.optimization import ORTOptimizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
import time
首先,加载原始模型
model_id = "hfl/chinese-pert-base-mrc"
# 使用原始的pipeline
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForQuestionAnswering.from_pretrained(model_id)
pipeline_qa = pipeline('question-answering', model=model, tokenizer=tokenizer)
QA_input = {'question': "著名诗歌《假如生活欺骗了你》的作者是",
'context': "普希金从那里学习人民的语言,吸取了许多有益的养料,这一切对普希金后来的创作产生了很大的影响。这两年里,普希金创作了不少优秀的作品,如《囚徒》、《致大海》、《致凯恩》和《假如生活欺骗了你》等几十首抒情诗,叙事诗《努林伯爵》,历史剧《鲍里斯·戈都诺夫》,以及《叶甫盖尼·奥涅金》前六章。"}
pipeline_qa(QA_input)
start = time.time()
print(pipeline_qa(QA_input))
print(f"未优化的模型耗时:{time.time() - start}")```
{'score': 0.5144545435905457, 'start': 0, 'end': 3, 'answer': '普希金'}
未优化的模型耗时:0.10119819641113281
再使用ORT
```python
# ORT_MODEL
ortmodel = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True, provider="CUDAExecutionProvider")
ort_model_qa = ortpipeline(
"question-answering", model=ortmodel, tokenizer=tokenizer, device=0)
pipeline_qa(QA_input)
start = time.time()
print(pipeline_qa(QA_input))
print(f"ORT模型耗时:{time.time() - start}")
{'score': 0.5144545435905457, 'start': 0, 'end': 3, 'answer': '普希金'}
ORT模型耗时:0.09807920455932617
最后,使用fp16优化后的onnx模型
# 优化器配置
optimization_config = OptimizationConfig(
optimization_level=2, # 优化等级
optimize_for_gpu=True, # 是否面向GPU
fp16=True # 是否转换为半精度
)
optimizer = ORTOptimizer.from_pretrained(ortmodel)
optimizer.optimize(save_dir=save_dir,
optimization_config=optimization_config)
# 加载优化后的模型
ort_opt_model = ORTModelForQuestionAnswering.from_pretrained(
save_dir, file_name="model-optimized.onnx")
opt_onnx_qa = ortpipeline("question-answering",
model=ort_opt_model, tokenizer=tokenizer, device=0)
opt_onnx_qa(QA_input)
start = time.time()
print(opt_onnx_qa(QA_input))
print(f"使用优化后的ONNX 耗时:{time.time() - start}")
{'score': 0.5168349146842957, 'start': 0, 'end': 3, 'answer': '普希金'}
使用优化后的ONNX 耗时:0.009712934494018555