TensorRT trtexec的用法说明
TensorRT trtexec的用法说明
TensorRT Command-Line Wrapper: trtexec
Description
Included in the samples directory is a command line wrapper tool, called trtexec. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has two main purposes:
- It's useful for benchmarking networks on random data.
- It's useful for generating serialized engines from models.
Benchmarking network - If you have a model saved as a UFF file, ONNX file, or if you have a network description in a Caffe prototxt format, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.
Serialized engine generation - If you generate a saved serialized engine file, you can pull it into another application that runs inference. For example, you can use the TensorRT Laboratory to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats, for example, if you used a Caffe prototxt file and a model is not supplied, random weights are generated. Also, in INT8 mode, random weights are used, meaning trtexec does not provide calibration capability.
Building trtexec
trtexec can be used to build engines, using different TensorRT features (see command line arguments), and run inference. trtexec also measures and reports execution time and can be used to understand performance and possibly locate bottlenecks.
Compile this sample by running make in the
cd <TensorRT root directory>/samples/trtexec
make
Where
Using trtexec
trtexec can build engines from models in Caffe, UFF, or ONNX format.
Example 1: Simple MNIST model from Caffe
The example below shows how to load a model description and its weights, build the engine that is optimized for batch size 16, and save it to a file.
trtexec --deploy=/path/to/mnist.prototxt --model=/path/to/mnist.caffemodel --output=prob --batch=16 --saveEngine=mnist16.trt
Then, the same engine can be used for benchmarking; the example below shows how to load the engine and run inference on batch 16 inputs (randomly generated).
trtexec --loadEngine=mnist16.trt --batch=16
Example 2: Profiling a custom layer
You can profile a custom layer using the IPluginRegistry for the plugins and trtexec. You’ll need to first register the plugin with IPluginRegistry.
If you are using TensorRT shipped plugins, you should load the libnvinfer_plugin.so file, as these plugins are pre-registered.
If you have your own plugin, then it has to be registered explicitly. The following macro can be used to register the plugin creator YourPluginCreator with the IPluginRegistry.
REGISTER_TENSORRT_PLUGIN(YourPluginCreator);
Example 3: Running a network on DLA
To run the AlexNet network on NVIDIA DLA (Deep Learning Accelerator) using trtexec in FP16 mode, issue:
./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --fp16 --allowGPUFallback
To run the AlexNet network on DLA using trtexec in INT8 mode, issue:
./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback
To run the MNIST network on DLA using trtexec, issue:
./trtexec --deploy=data/mnist/mnist.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback
For more information about DLA, see Working With DLA.
Example 4: Running an ONNX model with full dimensions and dynamic shapes
To run an ONNX model in full-dimensions mode with static input shapes:
./trtexec --onnx=model.onnx
The following examples assumes an ONNX model with one dynamic input with name input and dimensions [-1, 3, 244, 244]
To run an ONNX model in full-dimensions mode with an given input shape:
./trtexec --onnx=model.onnx --shapes=input:32x3x244x244
To benchmark your ONNX model with a range of possible input shapes:
./trtexec --onnx=model.onnx --minShapes=input:1x3x244x244 --optShapes=input:16x3x244x244 --maxShapes=input:32x3x244x244 --shapes=input:5x3x244x244
Example 5: Collecting and printing a timing trace
When running, trtexec prints the measured performance, but can also export the measurement trace to a json file:
./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --exportTimes=trace.json
Once the trace is stored in a file, it can be printed using the tracer.py utility. This tool prints timestamps and duration of input, compute, and output, in different forms:
./tracer.py trace.json
Similarly, profiles can also be printed and stored in a json file. The utility profiler.py can be used to read and print the profile from a json file.
Example 6: Tune throughput with multi-streaming
Tuning throughput may require running multiple concurrent streams of execution. This is the case for example when the latency achieved is well within the desired
threshold, and we can increase the throughput, even at the expense of some latency. For example, saving engines for batch sizes 1 and 2 and assume that both
execute within 2ms, the latency threshold:
trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=1 --saveEngine=g1.trt --int8 --buildOnly
trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=2 --saveEngine=g2.trt --int8 --buildOnly
Now, the saved engines can be tried to find the combination batch/streams below 2 ms that maximizes the throughput:
trtexec --loadEngine=g1.trt --batch=1 --streams=2
trtexec --loadEngine=g1.trt --batch=1 --streams=3
trtexec --loadEngine=g1.trt --batch=1 --streams=4
trtexec --loadEngine=g2.trt --batch=2 --streams=2
Tool command line arguments
To see the full list of available options and their descriptions, issue the ./trtexec --help command.
Note: Specifying the --safe parameter turns the safety mode switch ON. By default, the --safe parameter is not specified; the safety mode switch is OFF. The layers and parameters that are contained within the --safe subset are restricted if the switch is set to ON. The switch is used for prototyping the safety restricted flows until the TensorRT safety runtime is made available. For more information, see the Working With Automotive Safety section in the TensorRT Developer Guide.
Additional resources
The following resources provide more details about trtexec:
Documentation
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # trtexec --help
=== Model Options ===
--uff=<file> UFF model
--onnx=<file> ONNX model
--model=<file> Caffe model (default = no model, random weights used)
--deploy=<file> Caffe prototxt file
--output=<name>[,<name>]* Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
--uffInput=<name>,X,Y,Z Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
--uffNHWC Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)
=== Build Options ===
--maxBatch Set max batch size and build an implicit batch engine (default = same size as --batch)
This option should not be used when the input model is ONNX or when dynamic shapes are provided.
--minShapes=spec Build with dynamic shapes using a profile with the min shapes provided
--optShapes=spec Build with dynamic shapes using a profile with the opt shapes provided
--maxShapes=spec Build with dynamic shapes using a profile with the max shapes provided
--minShapesCalib=spec Calibrate with dynamic shapes using a profile with the min shapes provided
--optShapesCalib=spec Calibrate with dynamic shapes using a profile with the opt shapes provided
--maxShapesCalib=spec Calibrate with dynamic shapes using a profile with the max shapes provided
Note: All three of min, opt and max shapes must be supplied.
However, if only opt shapes is supplied then it will be expanded so
that min shapes and max shapes are set to the same values as opt shapes.
Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128
Each input shape is supplied as a key-value pair where key is the input name and
value is the dimensions (including the batch dimension) to be used for that input.
Each key-value pair has the key and value separated using a colon (:).
Multiple input shapes can be provided via comma-separated key-value pairs.
--inputIOFormats=spec Type and format of each of the input tensors (default = all inputs in fp32:chw)
See --outputIOFormats help for the grammar of type and format list.
Note: If this option is specified, please set comma-separated types and formats for all
inputs following the same order as network inputs ID (even if only one input
needs specifying IO format) or set the type and format once for broadcasting.
--outputIOFormats=spec Type and format of each of the output tensors (default = all outputs in fp32:chw)
Note: If this option is specified, please set comma-separated types and formats for all
outputs following the same order as network outputs ID (even if only one output
needs specifying IO format) or set the type and format once for broadcasting.
IO Formats: spec ::= IOfmt[","spec]
IOfmt ::= type:fmt
type ::= "fp32"|"fp16"|"int32"|"int8"
fmt ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|
"cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt]
--workspace=N Set workspace size in MiB.
--memPoolSize=poolspec Specify the size constraints of the designated memory pool(s) in MiB.
Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes.
Pool constraint: poolspec ::= poolfmt[","poolspec]
poolfmt ::= pool:sizeInMiB
pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"
--profilingVerbosity=mode Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)
--minTiming=M Set the minimum number of iterations used in kernel selection (default = 1)
--avgTiming=M Set the number of times averaged in each iteration for kernel selection (default = 8)
--refit Mark the engine as refittable. This will allow the inspection of refittable layers
and weights within the engine.
--sparsity=spec Control sparsity (default = disabled).
Sparsity: spec ::= "disable", "enable", "force"
Note: Description about each of these options is as below
disable = do not enable sparse tactics in the builder (this is the default)
enable = enable sparse tactics in the builder (but these tactics will only be
considered if the weights have the right sparsity pattern)
force = enable sparse tactics in the builder and force-overwrite the weights to have
a sparsity pattern (even if you loaded a model yourself)
--noTF32 Disable tf32 precision (default is to enable tf32, in addition to fp32)
--fp16 Enable fp16 precision, in addition to fp32 (default = disabled)
--int8 Enable int8 precision, in addition to fp32 (default = disabled)
--best Enable all precisions to achieve the best performance (default = disabled)
--directIO Avoid reformatting at network boundaries. (default = disabled)
--precisionConstraints=spec Control precision constraint setting. (default = none)
Precision Constaints: spec ::= "none" | "obey" | "prefer"
none = no constraints
prefer = meet precision constraints set by --layerPrecisions/--layerOutputTypes if possible
obey = meet precision constraints set by --layerPrecisions/--layerOutputTypes or fail
otherwise
--layerPrecisions=spec Control per-layer precision constraints. Effective only when precisionConstraints is set to
"obey" or "prefer". (default = none)
The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
layerName to specify the default precision for all the unspecified layers.
Per-layer precision spec ::= layerPrecision[","spec]
layerPrecision ::= layerName":"precision
precision ::= "fp32"|"fp16"|"int32"|"int8"
--layerOutputTypes=spec Control per-layer output type constraints. Effective only when precisionConstraints is set to
"obey" or "prefer". (default = none)
The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
layerName to specify the default precision for all the unspecified layers. If a layer has more than
one output, then multiple types separated by "+" can be provided for this layer.
Per-layer output type spec ::= layerOutputTypes[","spec]
layerOutputTypes ::= layerName":"type
type ::= "fp32"|"fp16"|"int32"|"int8"["+"type]
--calib=<file> Read INT8 calibration cache file
--safe Enable build safety certified engine
--consistency Perform consistency checking on safety certified engine
--restricted Enable safety scope checking with kSAFETY_SCOPE build flag
--saveEngine=<file> Save the serialized engine
--loadEngine=<file> Load a serialized engine
--tacticSources=tactics Specify the tactics to be used by adding (+) or removing (-) tactics from the default
tactic sources (default = all available tactics).
Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional
tactics.
Tactic Sources: tactics ::= [","tactic]
tactic ::= (+|-)lib
lib ::= "CUBLAS"|"CUBLAS_LT"|"CUDNN"|"EDGE_MASK_CONVOLUTIONS"
For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS
--noBuilderCache Disable timing cache in builder (default is to enable timing cache)
--timingCacheFile=<file> Save/load the serialized global timing cache
=== Inference Options ===
--batch=N Set batch size for implicit batch engines (default = 1)
This option should not be used when the engine is built from an ONNX model or when dynamic
shapes are provided when the engine is built.
--shapes=spec Set input shapes for dynamic shapes inference inputs.
Note: Input names can be wrapped with escaped single quotes (ex: \'Input:0\').
Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128
Each input shape is supplied as a key-value pair where key is the input name and
value is the dimensions (including the batch dimension) to be used for that input.
Each key-value pair has the key and value separated using a colon (:).
Multiple input shapes can be provided via comma-separated key-value pairs.
--loadInputs=spec Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
Input values spec ::= Ival[","spec]
Ival ::= name":"file
--iterations=N Run at least N inference iterations (default = 10)
--warmUp=N Run for N milliseconds to warmup before measuring performance (default = 200)
--duration=N Run performance measurements for at least N seconds wallclock time (default = 3)
--sleepTime=N Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
--idleTime=N Sleep N milliseconds between two continuous iterations(default = 0)
--streams=N Instantiate N engines to use concurrently (default = 1)
--exposeDMA Serialize DMA transfers to and from device (default = disabled).
--noDataTransfers Disable DMA transfers to and from device (default = enabled).
--useManagedMemory Use managed memory instead of separate host and device allocations (default = disabled).
--useSpinWait Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
--threads Enable multithreading to drive engines with independent threads or speed up refitting (default = disabled)
--useCudaGraph Use CUDA graph to capture engine execution and then launch inference (default = disabled).
This flag may be ignored if the graph capture fails.
--timeDeserialize Time the amount of time it takes to deserialize the network and exit.
--timeRefit Time the amount of time it takes to refit the engine before inference.
--separateProfileRun Do not attach the profiler in the benchmark run; if profiling is enabled, a second profile run will be executed (default = disabled)
--buildOnly Exit after the engine has been built and skip inference perf measurement (default = disabled)
=== Build and Inference Batch Options ===
When using implicit batch, the max batch size of the engine, if not given,
is set to the inference batch size;
when using explicit batch, if shapes are specified only for inference, they
will be used also as min/opt/max in the build profile; if shapes are
specified only for the build, the opt shapes will be used also for inference;
if both are specified, they must be compatible; and if explicit batch is
enabled but neither is specified, the model must provide complete static
dimensions, including batch size, for all inputs
Using ONNX models automatically forces explicit batch.
=== Reporting Options ===
--verbose Use verbose logging (default = false)
--avgRuns=N Report performance measurements averaged over N consecutive iterations (default = 10)
--percentile=P Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
--dumpRefit Print the refittable layers and weights from a refittable engine
--dumpOutput Print the output tensor(s) of the last inference iteration (default = disabled)
--dumpProfile Print profile information per layer (default = disabled)
--dumpLayerInfo Print layer information of the engine to console (default = disabled)
--exportTimes=<file> Write the timing results in a json file (default = disabled)
--exportOutput=<file> Write the output tensors to a json file (default = disabled)
--exportProfile=<file> Write the profile information per layer in a json file (default = disabled)
--exportLayerInfo=<file> Write the layer information of the engine in a json file (default = disabled)
=== System Options ===
--device=N Select cuda device N (default = 0)
--useDLACore=N Select DLA core N for layers that support DLA (default = none)
--allowGPUFallback When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
--plugins Plugin library (.so) to load (can be specified multiple times)
=== Help ===
--help, -h Print this message
trtexec的参数使用说明
1.1 Model Option 模型选项
–uff : UFF模型文件名
–onnx : ONNX模型文件名
–model : Caffe模型文件名,模式时无模型,使用随机权重
–deploy : Caffe prototxt 文件名
–output : 输出名称(可多次指定);UFF和Caffe至少需要一个输出
–uffInput : 输入blob名称及其维度(X、Y、Z=C、H、W),可以多次指定;UFF型号至少需要一个
–uffNHWC : 设置输入是否在NHWC布局中而不是NCHW中(在–uffInput中使用X、Y、Z=H、W、C顺序)
1.2 Build Options 构建选项
–maxBatch : 设置最大批处理大小并构建隐式批处理引擎(默认值=1)
–explicitBatch :构建引擎时使用显式批量大小(默认 = 隐式)
–minShapes=spec : 使用提供的最小形状的配置文件构建动态形状
–optShapes=spec : 使用提供的 opt 形状的配置文件构建动态形状
–maxShapes=spec : 使用提供的最大形状的配置文件构建动态形状
–minShapesCalib=spec : 使用提供的最小形状的配置文件校准动态形状
–optShapesCalib=spec : 使用提供的 opt 形状的配置文件校准动态形状
–maxShapesCalib=spec :使用提供的最大形状的配置文件校准动态形状
注意:必须提供所有三个 min、opt 和 max 形状。但是,如果只提供了 opt 形状,那么它将被扩展,以便将最小形状和最大形状设置为与 opt 形状相同的值。此外,使用 动态形状意味着显式批处理。 输入名称可以用转义单引号括起来(例如:‘Input:0’)。示例输入形状规范:input0:1x3x256x256,input1:1x3x128x128 每个输入形状都作为键值对提供,其中 key 是输入名称 值是用于该输入的维度(包括批次维度)。 每个键值对都使用冒号 (😃 分隔键和值。 可以通过逗号分隔的键值对提供多个输入形状。
–inputIOFormats=spec : 每个输入张量的类型和格式(默认所有输入为fp32:chw)
注意:如果指定此选项,请按照与网络输入ID相同的顺序为所有输入设置逗号分隔的类型和格式(即使只有一个输入需要指定IO格式)或设置一次类型和格式以进行广播。
–outputIOFormats=spec : 每个输出张量的类型和格式(默认所有输入为fp32:chw)
注意:如果指定此选项,请按照与网络输出ID相同的顺序为所有输出设置逗号分隔的类型和格式(即使只有一个输出需要指定IO格式)或设置一次类型和格式以进行广播。
–workspace=N : 以M为单位设置工作区大小(默认值 = 16)
–noBuilderCache : 在构建器中禁用时序缓存(默认是启用时序缓存)
–nvtxMode=mode : 指定 NVTX 注释详细程度。 mode ::= default|verbose|none
–minTiming=M : 设置内核选择中使用的最小迭代次数(默认值 = 1)
–avgTiming=M : 为内核选择设置每次迭代的平均次数(默认值 = 8)
–noTF32 : 禁用 tf32 精度(默认是启用 tf32,除了 fp32)
–refit : 将引擎标记为可改装。这将允许检查引擎内的可改装层和重量。
–fp16 : 除 fp32 外,启用 fp16 精度(默认 = 禁用)
–int8 : 除 fp32 外,启用 int8 精度(默认 = 禁用)
–best : 启用所有精度以达到最佳性能(默认 = 禁用)
–calib= : 读取INT8校准缓存文件
–safe : 仅测试安全受限流中可用的功能
–saveEngine= : 保存序列化模型的文件名
–loadEngine= : 加载序列化模型的文件名
–tacticSources=tactics : 通过从默认策略源(默认 = 所有可用策略)中添加 (+) 或删除 (-) 策略来指定要使用的策略。
1.3 Inference Options 推理选项
–batch=N : 为隐式批处理引擎设置批处理大小(默认值 = 1)
–shapes=spec : 为动态形状推理输入设置输入形状。
注意:使用动态形状意味着显式批处理。 输入名称可以用转义的单引号括起来(例如:‘Input:0’)。 示例输入形状规范:input0:1x3x256x256, input1:1x3x128x128 每个输入形状都作为键值对提供,其中键是输入名称,值是用于该输入的维度(包括批次维度)。 每个键值对都使用冒号 (😃 分隔键和值。 可以通过逗号分隔的键值对提供多个输入形状。
–loadInputs=spec :从文件加载输入值(默认 = 生成随机输入)。 输入名称可以用单引号括起来(例如:‘Input:0’)
–iterations=N : 至少运行 N 次推理迭代(默认值 = 10)
–warmUp=N : 在测量性能之前运行 N 毫秒以预热(默认值 = 200)
–duration=N : 运行至少 N 秒挂钟时间的性能测量(默认值 = 3)
–sleepTime=N : 延迟推理以启动和计算之间的 N 毫秒间隔开始(默认 = 0)
–streams=N : 实例化 N 个引擎以同时使用(默认值 = 1)
–exposeDMA : 串行化进出设备的 DMA 传输。 (默认 = 禁用)
–noDataTransfers : 在推理过程中,请勿将数据传入和传出设备。 (默认 = 禁用)
–useSpinWait : 主动同步 GPU 事件。 此选项可能会减少同步时间,但会增加 CPU 使用率和功率(默认 = 禁用)
–threads : 启用多线程以驱动具有独立线程的引擎(默认 = 禁用)
–useCudaGraph : 使用 cuda 图捕获引擎执行,然后启动推理(默认 = 禁用)
–separateProfileRun : 不要在基准测试中附加分析器; 如果启用分析,将执行第二次分析运行(默认 = 禁用)
–buildOnly : 跳过推理性能测量(默认 = 禁用)
1.4 Build and Inference Batch Options 构建和推理批处理选项
使用隐式批处理时,引擎的最大批处理大小(如果未指定)设置为推理批处理大小; 使用显式批处理时,如果仅指定形状用于推理,它们也将在构建配置文件中用作 min/opt/max; 如果只为构建指定了形状,则 opt 形状也将用于推理; 如果两者都被指定,它们必须是兼容的; 如果启用了显式批处理但都未指定,则模型必须为所有输入提供完整的静态维度,包括批处理大小
1.5 Reporting Options 报告选项
–verbose : 使用详细日志记录(默认值 = false)
–avgRuns=N : 报告 N 次连续迭代的平均性能测量值(默认值 = 10)
–percentile=P : 报告 P 百分比的性能(0<=P<=100,0 代表最大性能,100 代表最小性能;(默认 = 99%)
–dumpRefit : 从可改装引擎打印可改装层和重量
–dumpOutput : 打印最后一次推理迭代的输出张量(默认 = 禁用)
–dumpProfile : 每层打印配置文件信息(默认 = 禁用)
–exportTimes= : 将计时结果写入 json 文件(默认 = 禁用)
–exportOutput= : 将输出张量写入 json 文件(默认 = 禁用)
–exportProfile= : 将每层的配置文件信息写入 json 文件(默认 = 禁用)
1.6 System Options 系统选项
–device=N :选择 cuda 设备 N(默认 = 0)
–useDLACore=N : 为支持 DLA 的层选择 DLA 核心 N(默认 = 无)
–allowGPUFallback : 启用 DLA 后,允许 GPU 回退不受支持的层(默认 = 禁用)
–plugins : 要加载的插件库 (.so)(可以多次指定)
1.7 Help 帮助
–help, -h : 打印以上帮助信息
参考:
https://github.com/NVIDIA/TensorRT/tree/master/samples/trtexec
https://blog.csdn.net/qq_29007291/article/details/116135737
https://blog.csdn.net/HW140701/article/details/120360642
https://developer.nvidia.com/zh-cn/blog/tensorrt-trtexec-cn/
https://www.ccoderun.ca/programming/doxygen/tensorrt/md_TensorRT_samples_opensource_trtexec_README.html