自动调试用于移动GPU的卷积网络
自动调试用于移动GPU的卷积网络
对特定设备进行自动调试对于获得最佳性能至关重要。这是有关如何调试整个卷积网络的说明文档。
TVM中Mobile GPU的算子实现以模板形式编写。模板具有许多可调旋钮(平铺因子,矢量化,展开等)。将调试神经网络中的所有卷积,深度卷积和密集算子。调试后,生成一个日志文件,其中存储了所有必需算子的最佳旋钮值。当TVM编译器编译这些算子时,将查询此日志文件以获得最佳旋钮值。
发布了一些ARM设备的预调参数。参考 Mobile GPU Benchmark 。
注意,本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文内容包装在一个if
__name__
==
"__main__":块中。
安装依赖
要在tvm中使用autotvm软件包,需要安装一些额外的依赖项。(如果使用python2,请将“ 3”更改为“ 2”):
pip3 install --user psutil xgboost tornado
为了使TVM在调试期间更快地运行,建议将cython用作tvm的FFI。在tvm的根目录中,执行(如果使用python2,请将“ 3”更改为“ 2”):
pip3 install --user cython
sudo make cython3
返回python代码,导入包。
import os
import numpy as np
import tvm
from tvm import relay, autotvm
import tvm.relay.testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.utils import tempdir
import tvm.contrib.graph_runtime as runtime
定义网络
首先,需要在中继前端API中定义网络。可以从relay.testing加载一些预定义的网络。还可以从MXNet,ONNX和TensorFlow加载模型。
def get_network(name, batch_size):
"""Get the symbol definition and random weight of a network"""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.resnet.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif "vgg" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.vgg.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif name == "mobilenet":
mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "squeezenet_v1.1":
mod, params = relay.testing.squeezenet.get_workload(
batch_size=batch_size, version="1.1", dtype=dtype
)
elif name == "inception_v3":
input_shape = (batch_size, 3, 299, 299)
mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "mxnet":
# an example for mxnet model
from mxnet.gluon.model_zoo.vision import get_model
block = get_model("resnet18_v1", pretrained=True)
mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
net = mod["main"]
net = relay.Function(
net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
)
mod = tvm.IRModule.from_expr(net)
else:
raise ValueError("Unsupported network: " + name)
return mod, params, input_shape, output_shape
启动RPC跟踪器
TVM使用RPC会话与ARM板进行通信。调谐期间,调谐器会将生成的代码发送到电路板上,并测量电路板上的代码速度。
为了扩大调试范围,TVM使用RPC Tracker来管理分布式设备。RPC跟踪器是一个集中式控制器节点。可以将所有设备注册到跟踪器。例如,如果有10部电话,可以将它们全部注册到跟踪器,并行运行10次测量,从而加快了调谐过程。
要启动RPC跟踪器,请在主机上运行此命令。在整个调试过程中都需要使用跟踪器,因此需要为此命令打开一个新终端:
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
预期的输出是
INFO:RPCTracker:bind to 0.0.0.0:9190
将设备注册到RPC跟踪器
将设备注册到跟踪器。第一步是为ARM设备构建TVM运行时runtime。
· python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rk3399
(替换[HOST_IP]为主机的IP地址)
- 对于Android:按照此readme page在Android设备上安装TVM RPC APK。确保可以通过android RPC测试。这样就已经注册了设备。在调试过程中,必须转到开发人员选项并启用“更改时保持屏幕唤醒”并为手机充电以使其稳定。
注册设备后,可以通过查询rpc_tracker进行确认。
python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
例如,如果有2个Huawei mate10 pro,11个Raspberry Pi 3B和2个rk3399,则输出可以是
Queue Status
----------------------------------
key total free pending
----------------------------------
mate10pro 2 2 0
rk3399 2 2 0
rpi3b 11 11 0
----------------------------------
可以将多个设备注册到跟踪器,以加快调谐中的测量速度。
设置调试选项
调试之前,应该应用一些配置。这里以RK3399板为例。在设置中,应该相应地修改目标和device_key。如果使用的是Android手机,设置use_android为True。
#### DEVICE CONFIG ####
target = tvm.target.Target("opencl -device=mali")
# Replace "aarch64-linux-gnu" with the correct target of your board.
# This target host is used for cross compilation. You can query it by :code:`gcc -v` on your device.
target_host = "llvm -mtriple=aarch64-linux-gnu"
# Also replace this with the device key in your tracker
device_key = "rk3399"
# Set this to True if you use android phone
use_android = False
#### TUNING OPTION ####
network = "resnet-18"
log_file = "%s.%s.log" % (device_key, network)
dtype = "float32"
tuning_option = {
"log_filename": log_file,
"tuner": "xgb",
"n_trial": 1000,
"early_stopping": 450,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(build_func="ndk" if use_android else "default"),
runner=autotvm.RPCRunner(
device_key,
host="0.0.0.0",
port=9190,
number=10,
timeout=5,
),
),
}
注意
如何设置调试选项
通常,此处提供的默认值效果很好。如果你有足够的时间预算,你可以设置n_trial,early_stopping更大,这使运行的时间更长的调试。如果设备运行速度非常慢,或者conv2d算子有很多GFLOP,请考虑将超时设置为更大。
开始调试
现在,可以从网络中提取调试任务并开始调试。在这里,提供了一个简单的实用程序功能来调试任务列表。此函数只是一个初始实现,可以按顺序对其进行调试。将来,推出更复杂的调优调度程序。
# You can skip the implementation of this function for this tutorial.
def tune_tasks(
tasks,
measure_option,
tuner="xgb",
n_trial=1000,
early_stopping=None,
log_filename="tuning.log",
use_transfer_learning=True,
):
# create tmp log file
tmp_log_file = log_filename + ".tmp"
if os.path.exists(tmp_log_file):
os.remove(tmp_log_file)
for i, tsk in enumerate(reversed(tasks)):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
# create tuner
if tuner == "xgb" or tuner == "xgb-rank":
tuner_obj = XGBTuner(tsk, loss_type="rank")
elif tuner == "ga":
tuner_obj = GATuner(tsk, pop_size=50)
elif tuner == "random":
tuner_obj = RandomTuner(tsk)
elif tuner == "gridsearch":
tuner_obj = GridSearchTuner(tsk)
else:
raise ValueError("Invalid tuner: " + tuner)
if use_transfer_learning:
if os.path.isfile(tmp_log_file):
tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
# do tuning
tsk_trial = min(n_trial, len(tsk.config_space))
tuner_obj.tune(
n_trial=tsk_trial,
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(tsk_trial, prefix=prefix),
autotvm.callback.log_to_file(tmp_log_file),
],
)
# pick best records to a cache file
autotvm.record.pick_best(tmp_log_file, log_filename)
os.remove(tmp_log_file)
最后,启动调试作业并评估端到端性能。
def tune_and_evaluate(tuning_opt):
# extract workloads from relay program
print("Extract tasks...")
mod, params, input_shape, _ = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_program(
mod["main"],
target=target,
target_host=target_host,
params=params,
ops=(relay.op.get("nn.conv2d"),),
)
# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)
# compile kernels with history best records
with autotvm.apply_history_best(log_file):
print("Compile...")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build_module.build(
mod, target=target, params=params, target_host=target_host
)
# export library
tmp = tempdir()
if use_android:
from tvm.contrib import ndk
filename = "net.so"
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "net.tar"
lib.export_library(tmp.relpath(filename))
# upload module to device
print("Upload...")
remote = autotvm.measure.request_remote(device_key, "0.0.0.0", 9190, timeout=10000)
remote.upload(tmp.relpath(filename))
rlib = remote.load_module(filename)
# upload parameters to device
ctx = remote.context(str(target), 0)
module = runtime.GraphModule(rlib["default"](ctx))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)
# evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, number=1, repeat=30)
prof_res = np.array(ftimer().results) * 1000 # convert to millisecond
print(
"Mean inference time (std dev): %.2f ms (%.2f ms)"
% (np.mean(prof_res), np.std(prof_res))
)
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.
# tune_and_evaluate(tuning_option)
样本输出
调优需要编译许多程序并从中提取功能。因此,建议使用高性能CPU。下面列出了一个示例输出。使用32T AMD Ryzen Threadripper大约需要3个小时。
Extract tasks...
Tuning...
[Task 1/17] Current/Best: 25.30/ 39.12 GFLOPS | Progress: (992/1000) | 751.22 s Done.
[Task 2/17] Current/Best: 40.70/ 45.50 GFLOPS | Progress: (736/1000) | 545.46 s Done.
[Task 3/17] Current/Best: 38.83/ 42.35 GFLOPS | Progress: (992/1000) | 1549.85 s Done.
[Task 4/17] Current/Best: 23.31/ 31.02 GFLOPS | Progress: (640/1000) | 1059.31 s Done.
[Task 5/17] Current/Best: 0.06/ 2.34 GFLOPS | Progress: (544/1000) | 305.45 s Done.
[Task 6/17] Current/Best: 10.97/ 17.20 GFLOPS | Progress: (992/1000) | 1050.00 s Done.
[Task 7/17] Current/Best: 8.98/ 10.94 GFLOPS | Progress: (928/1000) | 421.36 s Done.
[Task 8/17] Current/Best: 4.48/ 14.86 GFLOPS | Progress: (704/1000) | 582.60 s Done.
[Task 9/17] Current/Best: 10.30/ 25.99 GFLOPS | Progress: (864/1000) | 899.85 s Done.
[Task 10/17] Current/Best: 11.73/ 12.52 GFLOPS | Progress: (608/1000) | 304.85 s Done.
[Task 11/17] Current/Best: 15.26/ 18.68 GFLOPS | Progress: (800/1000) | 747.52 s Done.
[Task 12/17] Current/Best: 17.48/ 26.71 GFLOPS | Progress: (1000/1000) | 1166.40 s Done.
[Task 13/17] Current/Best: 0.96/ 11.43 GFLOPS | Progress: (960/1000) | 611.65 s Done.
[Task 14/17] Current/Best: 17.88/ 20.22 GFLOPS | Progress: (672/1000) | 670.29 s Done.
[Task 15/17] Current/Best: 11.62/ 13.98 GFLOPS | Progress: (736/1000) | 449.25 s Done.
[Task 16/17] Current/Best: 19.90/ 23.83 GFLOPS | Progress: (608/1000) | 708.64 s Done.
[Task 17/17] Current/Best: 17.98/ 22.75 GFLOPS | Progress: (736/1000) | 1122.60 s Done.
Compile...
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 128.05 ms (7.74 ms)
注意
遇到困难?
自动调试模块容易出错。如果始终看到“ 0.00 / 0.00 GFLOPS”,则一定有问题。
首先,确保设置了正确的设备配置。然后,可以通过在脚本的开头添加这些行来打印调试信息。它将打印每个测量结果,可以在其中找到有用的错误消息。
import logging
logging.getLogger('autotvm').setLevel(logging.DEBUG)
最后,随时在https://discuss.tvm.apache.org上向社区寻求帮助。