Transformers Optimum 使用

介绍

🤗 Optimum是Transformers的🤗扩展,它提供了一组性能优化工具,以最高效率在目标硬件上训练和运行模型。

使用入门

当前ONNX最通用,因此我们就只介绍ONNX Runtime

🤗 Optimum 提供与 ONNX Runtime 的集成,一个用于ONNX 模型的跨平台、高性能执行引擎

安装

pip install optimum[onnxruntime-gpu]

为避免 onnxruntime 和 onnxruntime-gpu 之间的冲突,请在安装 Optimum 之前通过运行 pip uninstall onnxruntime 确保未安装软件包 onnxruntime。

将transformer模型导出为onnx

可以使用ORTModelForXXX 加载transformers 模型,注意如果模型来至于Transformers,需要加上from_transformers=true

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
save_directory = "tmp/onnx/"
# Load a model from transformers and export it to ONNX
ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Save the onnx model and tokenizer
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

保存为ONNX后,可以继续通过ORTModelForXXX来加载模型,然后使用pipeline来运行任务。

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import pipeline, AutoTokenizer
model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
tokenizer = AutoTokenizer.from_pretrained(save_directory)
cls_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = cls_pipeline("I love burritos!")

ONNX 模型优化

通过ORTOptimizer 可以优化模型,OptimizationConfig 配置优化参数,可以导出onnx模型,并优化Grpah,进行fp16等优化

  
from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import OptimizationConfig

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "/tmp/outputs"

model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)

optimizer = ORTOptimizer.from_pretrained(model)

optimization_config = OptimizationConfig(
    optimization_level=2,
    optimize_with_onnxruntime_only=False,
    optimize_for_gpu=False,
)

optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)

OptimizationConfig

  • optimization_level 优化等级
  • optimize_for_gpu 是否面向GPU优化
  • fp16 是否转换为半精度,配置后可以减小模型体积

Pipeline 使用

在pipeline中使用,只需要accelerator="ort"即可。

from optimum.pipelines import pipeline
classifier = pipeline(task="text-classification", accelerator="ort")
classifier("I like you. I love you.")

[{'label': 'POSITIVE', 'score': 0.9998838901519775}]

当然,我们还可以使用上面optimize后的模型。

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
# Loading already converted and optimized ORT checkpoint for inference
model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(question=question, context=context)

实际使用测试

我们来加载哈工大讯飞联合实验室提供的阅读理解模型pert

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from optimum.pipelines import pipeline as ortpipeline
from optimum.onnxruntime.configuration import OptimizationConfig
from optimum.onnxruntime.optimization import ORTOptimizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
import time

首先,加载原始模型

model_id = "hfl/chinese-pert-base-mrc"

# 使用原始的pipeline
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForQuestionAnswering.from_pretrained(model_id)
pipeline_qa = pipeline('question-answering', model=model, tokenizer=tokenizer)
QA_input = {'question': "著名诗歌《假如生活欺骗了你》的作者是",
            'context': "普希金从那里学习人民的语言,吸取了许多有益的养料,这一切对普希金后来的创作产生了很大的影响。这两年里,普希金创作了不少优秀的作品,如《囚徒》、《致大海》、《致凯恩》和《假如生活欺骗了你》等几十首抒情诗,叙事诗《努林伯爵》,历史剧《鲍里斯·戈都诺夫》,以及《叶甫盖尼·奥涅金》前六章。"}


pipeline_qa(QA_input)
start = time.time()
print(pipeline_qa(QA_input))
print(f"未优化的模型耗时:{time.time() - start}")```

{'score': 0.5144545435905457, 'start': 0, 'end': 3, 'answer': '普希金'}
未优化的模型耗时:0.10119819641113281

再使用ORT

```python
# ORT_MODEL
ortmodel = ORTModelForQuestionAnswering.from_pretrained(model_id,                                                      from_transformers=True,                                                       provider="CUDAExecutionProvider")
ort_model_qa = ortpipeline(
    "question-answering", model=ortmodel, tokenizer=tokenizer, device=0)

pipeline_qa(QA_input)
start = time.time()
print(pipeline_qa(QA_input))
print(f"ORT模型耗时:{time.time() - start}")

{'score': 0.5144545435905457, 'start': 0, 'end': 3, 'answer': '普希金'}
ORT模型耗时:0.09807920455932617

最后,使用fp16优化后的onnx模型

# 优化器配置
optimization_config = OptimizationConfig(
    optimization_level=2,  # 优化等级
    optimize_for_gpu=True,  # 是否面向GPU
    fp16=True            # 是否转换为半精度
)
optimizer = ORTOptimizer.from_pretrained(ortmodel)
optimizer.optimize(save_dir=save_dir,
                   optimization_config=optimization_config)

# 加载优化后的模型
ort_opt_model = ORTModelForQuestionAnswering.from_pretrained(
    save_dir, file_name="model-optimized.onnx")
opt_onnx_qa = ortpipeline("question-answering",
                          model=ort_opt_model, tokenizer=tokenizer, device=0)
opt_onnx_qa(QA_input)
start = time.time()
print(opt_onnx_qa(QA_input))
print(f"使用优化后的ONNX 耗时:{time.time() - start}")

{'score': 0.5168349146842957, 'start': 0, 'end': 3, 'answer': '普希金'}
使用优化后的ONNX 耗时:0.009712934494018555

posted @ 2022-12-07 16:42  JadePeng  阅读(1541)  评论(0编辑  收藏  举报