tvm模型部署c++ 分析

tvm模型部署c++ 分析

tvm c++部署官方教程

https://github.com/apache/tvm/tree/main/apps/howto_deploy

https://tvm.apache.org/docs/how_to/deploy/cpp_deploy.html

官方说执行run_example.sh脚本就可以完成部署。

使用C++ API部署TVM模块

提供了一个关于如何在apps/howto_deploy中，部署TVM模块的示例。要运行该示例，可以使用以下命令

cd apps/howto_deploy

./run_example.sh

获取TVM运行库

唯一需要的是链接到目标平台中的TVM运行时。TVM提供了一个最低运行时间，根据使用的模块数量，成本大约在300K到600K之间。在大多数情况下，可以使用libtvm_runtime.com，这是构建时附带的。

如果发现很难构建libtvm_运行时，checkout tvm_runtime_pack.cc。这是一个提供TVM运行时的示例。可以使用构建系统编译此文件，包含到项目中。

可以checkout应用程序，如在iOS、Android和其它平台上，使用TVM构建的应用程序。

动态库与系统模块

TVM提供了两种使用已编译库的方法。可以checkout prepare_test_libs.py（关于如何生成库）和cpp_deploy.cc（关于如何使用）。

将库存储为共享库，并将库动态加载到项目中。

以系统模块模式将编译后的库绑定到项目中。

动态加载更灵活，可以动态加载新模块。系统模块是一种更静态的方法。可以在禁止动态库加载的地方使用系统模块。

c++部署代码

https://github.com/apache/tvm/blob/main/apps/howto_deploy/cpp_deploy.cc

/*
	* Licensed to the Apache Software Foundation (ASF) under one
	* or more contributor license agreements. See the NOTICE file
	* distributed with this work for additional information
	* regarding copyright ownership. The ASF licenses this file
	* to you under the Apache License, Version 2.0 (the
	* "License"); you may not use this file except in compliance
	* with the License. You may obtain a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*
	* Unless required by applicable law or agreed to in writing,
	* software distributed under the License is distributed on an
	* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	* KIND, either express or implied. See the License for the
	* specific language governing permissions and limitations
	* under the License.
	*/

	/*!
	* \brief Example code on load and run TVM module.s
	* \file cpp_deploy.cc
	*/
	#include <dlpack/dlpack.h>
	#include <tvm/runtime/module.h>
	#include <tvm/runtime/packed_func.h>
	#include <tvm/runtime/registry.h>

	#include <cstdio>

	void Verify(tvm::runtime::Module mod, std::string fname) {
	// Get the function from the module.
	tvm::runtime::PackedFunc f = mod.GetFunction(fname);
	ICHECK(f != nullptr);
	// Allocate the DLPack data structures.
	//
	// Note that we use TVM runtime API to allocate the DLTensor in this example.
	// TVM accept DLPack compatible DLTensors, so function can be invoked
	// as long as we pass correct pointer to DLTensor array.
	//
	// For more information please refer to dlpack.
	// One thing to notice is that DLPack contains alignment requirement for
	// the data pointer and TVM takes advantage of that.
	// If you plan to use your customized data container, please
	// make sure the DLTensor you pass in meet the alignment requirement.
	//
	DLTensor* x;
	DLTensor* y;
	int ndim = 1;
	int dtype_code = kDLFloat;
	int dtype_bits = 32;
	int dtype_lanes = 1;
	int device_type = kDLCPU;
	int device_id = 0;
	int64_t shape[1] = {10};
	TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
	TVMArrayAlloc(shape, ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);
	for (int i = 0; i < shape[0]; ++i) {
	static_cast<float*>(x->data)[i] = i;
	}
	// Invoke the function
	// PackedFunc is a function that can be invoked via positional argument.
	// The signature of the function is specified in tvm.build
	f(x, y);
	// Print out the output
	for (int i = 0; i < shape[0]; ++i) {
	ICHECK_EQ(static_cast<float*>(y->data)[i], i + 1.0f);
	}
	LOG(INFO) << "Finish verification...";
	TVMArrayFree(x);
	TVMArrayFree(y);
	}

	void DeploySingleOp() {
	// Normally we can directly
	tvm::runtime::Module mod_dylib = tvm::runtime::Module::LoadFromFile("lib/test_addone_dll.so");
	LOG(INFO) << "Verify dynamic loading from test_addone_dll.so";
	Verify(mod_dylib, "addone");
	// For libraries that are directly packed as system lib and linked together with the app
	// We can directly use GetSystemLib to get the system wide library.
	LOG(INFO) << "Verify load function from system lib";
	tvm::runtime::Module mod_syslib = (*tvm::runtime::Registry::Get("runtime.SystemLib"))();
	Verify(mod_syslib, "addonesys");
	}

	void DeployGraphExecutor() {
	LOG(INFO) << "Running graph executor...";
	// load in the library
	DLDevice dev{kDLCPU, 0};
	tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("lib/test_relay_add.so");
	// create the graph executor module
	tvm::runtime::Module gmod = mod_factory.GetFunction("default")(dev);
	tvm::runtime::PackedFunc set_input = gmod.GetFunction("set_input");
	tvm::runtime::PackedFunc get_output = gmod.GetFunction("get_output");
	tvm::runtime::PackedFunc run = gmod.GetFunction("run");

	// Use the C++ API
	tvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);
	tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({2, 2}, DLDataType{kDLFloat, 32, 1}, dev);

	for (int i = 0; i < 2; ++i) {
	for (int j = 0; j < 2; ++j) {
	static_cast<float>(x->data)[i 2 + j] = i * 2 + j;
	}
	}
	// set the right input
	set_input("x", x);
	// run the code
	run();
	// get the output
	get_output(0, y);

	for (int i = 0; i < 2; ++i) {
	for (int j = 0; j < 2; ++j) {
	ICHECK_EQ(static_cast<float>(y->data)[i 2 + j], i * 2 + j + 1);
	}
	}
	}

	int main(void) {
	DeploySingleOp();
	DeployGraphExecutor();
	return 0;
	}

Makefile文件

https://github.com/apache/tvm/blob/main/apps/howto_deploy/Makefile

# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing,
	# software distributed under the License is distributed on an
	# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	# KIND, either express or implied. See the License for the
	# specific language governing permissions and limitations
	# under the License.

	# Makefile Example to deploy TVM modules.
	TVM_ROOT=$(shell cd ../..; pwd)
	DMLC_CORE=${TVM_ROOT}/3rdparty/dmlc-core

	PKG_CFLAGS = -std=c++14 -O2 -fPIC\
	-I${TVM_ROOT}/include\
	-I${DMLC_CORE}/include\
	-I${TVM_ROOT}/3rdparty/dlpack/include\
	-DDMLC_USE_LOGGING_LIBRARY=\<tvm/runtime/logging.h\>

	PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread

	.PHONY: clean all

	all: lib/cpp_deploy_pack lib/cpp_deploy_normal

	# Build rule for all in one TVM package library
	lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc
	@mkdir -p $(@D)
	$(CXX) -c $(PKG_CFLAGS) -o $@ $^

	# The code library built by TVM
	lib/test_addone_sys.o: prepare_test_libs.py
	@mkdir -p $(@D)
	python3 prepare_test_libs.py

	# Deploy using the all in one TVM package library
	lib/cpp_deploy_pack: cpp_deploy.cc lib/test_addone_sys.o lib/libtvm_runtime_pack.o
	@mkdir -p $(@D)
	$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)

	# Deploy using pre-built libtvm_runtime.so
	lib/cpp_deploy_normal: cpp_deploy.cc lib/test_addone_sys.o
	@mkdir -p $(@D)
	$(CXX) $(PKG_CFLAGS) -o $@ $^ -ltvm_runtime $(PKG_LDFLAGS)
	clean:

结合Makefile文件和run_example.sh脚本一起看

脚本先创建lib目录，然后执行sudo make命令，make操作的执行要看Makefile文件

make命令会先在lib文件夹中编译一个名为libtvm_runtime_pack.o的静态链接库

然后运行prepare_test_lib.py文件生成将模型生成为test_addone_dll.so,test_addone_sys.o和test_relay_add.so三个库，给cpp_deploy.cc调用，生成两个可执行文件cpp_deploy_pack和cpp_deploy_normal

目标是用其它框架写的深度学习网络，通过tvm转换成so文件，使用c++部署，在gpu上进行调用，下面是cpu上部署的代码

* Licensed to the Apache Software Foundation (ASF) under one

* or more contributor license agreements. See the NOTICE file

* distributed with this work for additional information

* regarding copyright ownership. The ASF licenses this file

* to you under the Apache License, Version 2.0 (the

* "License"); you may not use this file except in compliance

* with the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing,

* software distributed under the License is distributed on an

* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

* KIND, either express or implied. See the License for the

* specific language governing permissions and limitations

* under the License.

/*!

* \brief Example code on load and run TVM module.s

* \file cpp_deploy.cc

#include <dlpack/dlpack.h>

#include <tvm/runtime/module.h>

#include <tvm/runtime/packed_func.h>

#include <tvm/runtime/registry.h>

#include <cstdio>

#include <fstream>

#include <sstream>

#include <string>

using namespace std;

template <class Type>

Type stringToNum(const string& str)

{

istringstream iss(str);

Type num;

iss >> num;

return num;

}

void DeployGraphRuntime() {

ifstream in("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/data.txt");

//int image[784];

string s;

int image_index=0;

while(getline(in,s))

{

image[i]=stringToNum<int>(s);

++i;

}*/

LOG(INFO) << "Running graph runtime...";

// load in the library

DLContext ctx{kDLGPU, 0};

tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/tiwang/tvm-tfs-gpu-bkp/model.so");

// create the graph runtime module

tvm::runtime::Module gmod = mod_factory.GetFunction("default")(ctx);

tvm::runtime::PackedFunc set_input = gmod.GetFunction("set_input");

tvm::runtime::PackedFunc get_output = gmod.GetFunction("get_output");

tvm::runtime::PackedFunc run = gmod.GetFunction("run");

// Use the C++ API

tvm::runtime::NDArray x = tvm::runtime::NDArray::Empty({1,784}, DLDataType{kDLFloat, 32, 1}, ctx);

tvm::runtime::NDArray y = tvm::runtime::NDArray::Empty({1, 10}, DLDataType{kDLFloat, 32, 1}, ctx);

while(getline(in,s))

{

static_cast<float*>(x->data)[image_index]=((float)stringToNum<int>(s))/255;

image_index++;

}

// set the right input

set_input("x", x);

// run the code

run();

// get the output

get_output(0, y);

for(int i=0;i<10;++i)

{

LOG(INFO)<<static_cast<float*>(y->data)[i];

}

for (int i = 0; i < 2; ++i) {

for (int j = 0; j < 2; ++j) {

ICHECK_EQ(static_cast<float*>(y->data)[i * 2 + j], i * 2 + j + 1);

}

}*/

}

int main(void) {

//DeploySingleOp();

DeployGraphRuntime();

return 0;

}

思路很简单就是把数据读进来，set_input，run然后get_output，修改了将target修改成cuda后并不能成功在gpu上运行，会出现core dump的问题

原因是想要让模型在gpu上运行，需要在gpu上开辟内存，然后将数据拷贝到gpu上运行，这个代码没有这些操作所以运行时会导致core崩溃。

下面是tvm c++部署调用gpu的完整过程，深度学习模型使用keras写的mnist手写体识别网络，保存成了pb格式，模型代码就不放了，这里直接读取pb文件进行转化，模型输入是（1，784），输出是（1，10）

导入头文件

import tvm

from tvm import te

from tvm import relay

# os and numpy

import numpy as np

import os.path

# Tensorflow imports

import tensorflow as tf

try:

tf_compat_v1 = tf.compat.v1

except ImportError:

tf_compat_v1 = tf

# Tensorflow utility functions

import tvm.relay.testing.tf as tf_testing

from tvm.contrib import graph_runtime

参数设置

#cpu

#target = "llvm"

#target_host = "llvm"

#layout = None

#ctx = tvm.cpu(0)

#gpu

target = "cuda"

target_host = 'llvm'

layout = "NCHW"

ctx = tvm.gpu(0)

处理数据

from tensorflow.python.keras.datasets import mnist

from tensorflow.python.keras.utils import np_utils

(x_train,y_train),(x_test,y_test)=mnist.load_data()

x_test1=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])

print(x_train.shape,x_test.shape)

print(y_train.shape,y_test.shape)

x_train=x_train.reshape(x_train.shape[0],x_train.shape[1]*x_train.shape[2])

x_test=x_test.reshape(x_test.shape[0],x_test.shape[1]*x_test.shape[2])

x_train=x_train/255

x_test=x_test/255

y_train=np_utils.to_categorical(y_train)

y_test=np_utils.to_categorical(y_test)

print(x_train.shape,x_test.shape)

print(y_train.shape,y_test.shape)

with open("data.txt",'w') as wf:

for i in range(784):

wf.write(str(x_test1[12][i]))

wf.write('\n')

读取模型

with tf_compat_v1.gfile.GFile('./frozen_models/simple_frozen_graph.pb', "rb") as f:

graph_def = tf_compat_v1.GraphDef()

graph_def.ParseFromString(f.read())

graph = tf.import_graph_def(graph_def, name="")

# Call the utility to import the graph definition into default graph.

graph_def = tf_testing.ProcessGraphDefParam(graph_def)

# Add shapes to the graph.

config = tf.compat.v1.ConfigProto(allow_soft_placement=True)

with tf_compat_v1.Session() as sess:

graph_def = tf_testing.AddShapesToGraphDef(sess, "Identity")

tensor_name_list = [tensor.name for tensor in tf.compat.v1.get_default_graph().as_graph_def().node]

for tensor_name in tensor_name_list:

print(tensor_name,'\n')

构建

shape_dict = {"x": x_train[0:1].shape}

print(shape_dict)

dtype_dict = {"x": "uint8"}

mod, params = relay.frontend.from_tensorflow(graph_def, layout=layout, shape=shape_dict)

print("Tensorflow protobuf imported to relay frontend.")

编译成tvm模型

with tvm.transform.PassContext(opt_level=3):

lib = relay.build(mod, target=target, target_host=target_host, params=params)

测试一下tvm模型能不能用

from tvm.contrib import graph_runtime

tt=np.zeros([1,784])

i=0

file=open("data.txt")

while 1:

line=file.readline()

if not line:

break

tt[0][i]=int(line)

i+=1

file.close()

dtype = "float32"

m = graph_runtime.GraphModule(lib["default"](ctx))

# set inputs

m.set_input("x", tvm.nd.array(tt.astype(dtype)))

# execute

m.run()

# get outputs

tvm_output = m.get_output(0, tvm.nd.empty(((1, 10)), "float32"))

print(tvm_output.shape,tvm_output)

保存模型

from tvm.contrib import utils

temp=utils.tempdir()

path_lib=temp.relpath("/home/aiteam/test_code/model.so")

print(path_lib)

lib.export_library(path_lib)

print(temp.listdir())

然后进入到tvm/apps/howto_deploy目录，修改tvm_runtime_pack.cc文件，加上头文件

#include "../../src/runtime/cuda/cuda_device_api.cc"

#include "../../src/runtime/cuda/cuda_module.cc"

然后再写一个cc文件存放自己的部署代码，修改Makefile文件进行编译

文件名是cpp_deploy_bkp.cc

修改后的Makefile文件

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied. See the License for the

# specific language governing permissions and limitations

# under the License.

# Makefile Example to deploy TVM modules.

TVM_ROOT=$(shell cd ../..; pwd)

DMLC_CORE=${TVM_ROOT}/3rdparty/dmlc-core

PKG_CFLAGS = -std=c++14 -g -fPIC\

-I${TVM_ROOT}/include\

-I${DMLC_CORE}/include\

-I${TVM_ROOT}/3rdparty/dlpack/include\

-I/usr/local/cuda/include

PKG_LDFLAGS = -L${TVM_ROOT}/build -ldl -pthread -L/usr/local/cuda/lib64 -lcudart -lcuda

.PHONY: clean all

all:lib/libtvm_runtime_pack.o lib/cpp_deploy_pack

#all: lib/cpp_deploy_pack lib/cpp_deploy_normal

# Build rule for all in one TVM package library

.PHONY: lib/libtvm_runtime_pack.o

lib/libtvm_runtime_pack.o: tvm_runtime_pack.cc

@mkdir -p $(@D)

$(CXX) -c $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)

# Deploy using the all in one TVM package library

.PHONY: lib/cpp_deploy_pack

lib/cpp_deploy_pack: cpp_deploy_bkp.cc lib/libtvm_runtime_pack.o

@mkdir -p $(@D)

$(CXX) $(PKG_CFLAGS) -o $@ $^ $(PKG_LDFLAGS)

里面要加上cuda头文件的位置和动态链接库的位置

cpp_deploy_bkp.cc

* Licensed to the Apache Software Foundation (ASF) under one

* or more contributor license agreements. See the NOTICE file

* distributed with this work for additional information

* regarding copyright ownership. The ASF licenses this file

* to you under the Apache License, Version 2.0 (the

* "License"); you may not use this file except in compliance

* with the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing,

* software distributed under the License is distributed on an

* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

* KIND, either express or implied. See the License for the

* specific language governing permissions and limitations

* under the License.

/*!

* \brief Example code on load and run TVM module.s

* \file cpp_deploy.cc

#include <dlpack/dlpack.h>

#include <tvm/runtime/module.h>

#include <tvm/runtime/packed_func.h>

#include <tvm/runtime/registry.h>

#include <cstdio>

#include <fstream>

#include <sstream>

#include <string>

using namespace std;

template <class Type>

Type stringToNum(const string& str)

{

istringstream iss(str);

Type num;

iss >> num;

return num;

}

void DeployGraphRuntime() {

constexpr int dtype_code= 2U;

constexpr int dtype_bits=32;

constexpr int dtype_lines=1;

constexpr int device_type= 2;

constexpr int device_id=0;

int ndim=2;

int64_t in_shape[2]={1,784};

int64_t out_shape[2]={1,10};

DLTensor* DLTX=nullptr;

DLTensor* DLTY=nullptr;

TVMArrayAlloc(in_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTX);

TVMArrayAlloc(out_shape,ndim,dtype_code,dtype_bits,dtype_lines,device_type,device_id,&DLTY);

float img[784];

float rslt[10];

ifstream in("/home/aiteam/tiwang/data.txt");

//int image[784];

string s;

int image_index=0;

while(getline(in,s))

{

image[i]=stringToNum<int>(s);

++i;

}*/

bool enabled = tvm::runtime::RuntimeEnabled("cuda");

if (!enabled)

{

LOG(INFO) << "Skip heterogeneous test because cuda is not enabled."<< "\n";

return;

}

LOG(INFO) << "Running graph runtime...";

// load in the library

DLContext ctx{kDLGPU, 0};

tvm::runtime::Module mod_factory = tvm::runtime::Module::LoadFromFile("/home/aiteam/test_code/model.so");

// create the graph runtime module

tvm::runtime::Module gmod = mod_factory.GetFunction("default")(ctx);

tvm::runtime::PackedFunc set_input = gmod.GetFunction("set_input");

tvm::runtime::PackedFunc get_output = gmod.GetFunction("get_output");

tvm::runtime::PackedFunc run = gmod.GetFunction("run");

// Use the C++ API

while(getline(in,s))

{

if(image_index%28==0)

printf("\n");

//static_cast<float*>(x->data)[image_index]=((float)stringToNum<int>(s))/255;

img[image_index]=((float)stringToNum<int>(s))/255;

int a=stringToNum<int>(s);

printf("%4d",a);

image_index++;

}

TVMArrayCopyFromBytes(DLTX,&img[0],image_index*sizeof(float));

// set the right input

set_input("x", DLTX);

// run the code

run();

// get the output

get_output(0, DLTY);

TVMArrayCopyToBytes(DLTY,&rslt[0],10*sizeof(float));

for(int i=0;i<10;++i)

{

LOG(INFO)<<rslt[i];

//LOG(INFO)<<static_cast<float*>(y->data)[i];

}

int main(void) {

//DeploySingleOp();

DeployGraphRuntime();

return 0;

}

相比于之前cpu部署的代码，gpu部署多了一个拷贝张量的过程

参照

https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229

最终结果

首先在tvm/apps/howto_deplpy目录下执行sudo make

编译通过，运行可执行文件 ./lib/cpp_deploy_pack

参考链接：

https://www.cnblogs.com/wangtianning1223/p/14662970.html

https://github.com/apache/tvm/tree/main/apps/howto_deploy

https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17

https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10

[Performance] TVM - pytorch BERT on CPU

@masahi I add ONNX for the experiments in the following and it seems using ONNX-runtime can get the best performance no matter the sequence length is (without tuning). I use ONNX-runtime with GraphOptimizationLevel.ORT_ENABLE_ALL showing in this link 1. Besides, I plot the IR graph for ONNX, which is quite complicated.

Experiment 1 - Different Target

image1170×432 15.7 KB

Experiment 3 - Different Length

image1632×432 18.1 KB

ONNX IR Graph: drive link

Also, I have some questions about AutoSchedule tuning.

Q1: @merrymercy I was confused that when I use AutoSchedule to tune TVM, can I use target like llvm -libs=cblas or I should use only llvm. I found this will give different tasks to tune.
Q2: @comaniac I think MXNet IR is more friendly than Pytorch IR for AutoSchedule tuning. I set the same parameters for tuning but Pytorch cannot get the results as MXNet (16ms for seq_len=128) The following are their tuning tasks and it seems quite different due to different IR graph. I still work on where the problem comes from, TVM front-end or original code implementation. But I think maybe TVM will have some transforms to generate similar IR graph even if from different framework.

o 1st difference: MXNet will use nn.bias_add() and Pytorch will use relay.add(), which cause the tuning tasks not include this operation. (task 0,1,2,6)

o 2nd difference: Their attention softmax operation have different shape, but I think this doesn’t cause too much latency difference (task 4)

# Tasks for Pytorch AutoSchedule Tuning (Target = llvm)

========== Task 0 (workload key: ["61f56dfd63fda28bc8bcf85739c8e9e3", 128, 3072, 768, 3072, 128, 768]) ==========

placeholder = PLACEHOLDER [128, 3072]

placeholder = PLACEHOLDER [768, 3072]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

========== Task 1 (workload key: ["61f56dfd63fda28bc8bcf85739c8e9e3", 128, 768, 3072, 768, 128, 3072]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [3072, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

========== Task 2 (workload key: ["61f56dfd63fda28bc8bcf85739c8e9e3", 128, 768, 768, 768, 128, 768]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [768, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

========== Task 3 (workload key: ["d2a28fdf41e83222456f5a6e5bf8a24a", 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========

placeholder = PLACEHOLDER [12, 128, 128]

placeholder = PLACEHOLDER [12, 64, 128]

compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])

========== Task 4 (workload key: ["868c2771b1610bdac0ac73167691f4eb", 1, 12, 128, 128, 1, 12, 128, 128]) ==========

placeholder = PLACEHOLDER [1, 12, 128, 128]

T_softmax_maxelem(i0, i1, i2) max= placeholder[i0, i1, i2, k]

T_softmax_delta(i0, i1, i2, i3) = (placeholder[i0, i1, i2, i3] - T_softmax_maxelem[i0, i1, i2])

T_fast_exp(ax0, ax1, ax2, ax3) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1, ax2, a ..(OMITTED).. max_delta[ax0, ax1, ax2, ax3], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1, ax2, ax3])

T_softmax_expsum(i0, i1, i2) += T_fast_exp[i0, i1, i2, k]

T_softmax_norm(i0, i1, i2, i3) = (T_fast_exp[i0, i1, i2, i3]/T_softmax_expsum[i0, i1, i2])

========== Task 5 (workload key: ["d2a28fdf41e83222456f5a6e5bf8a24a", 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========

placeholder = PLACEHOLDER [12, 128, 64]

placeholder = PLACEHOLDER [12, 128, 64]

compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])

========== Task 6 (workload key: ["61f56dfd63fda28bc8bcf85739c8e9e3", 128, 768, 2304, 768, 128, 2304]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [2304, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

========== Task 7 (workload key: ["2dde9ffcbf97381c0f0307643e09dac5", 1, 128, 768, 1, 128, 1]) ==========

placeholder = PLACEHOLDER [1, 128, 768]

placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]

T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)

========== Task 8 (workload key: ["dde89265d3f1a59075cee648386eac1e", 1, 128, 768, 1, 128, 1, 1, 128, 1]) ==========

placeholder = PLACEHOLDER [1, 128, 768]

placeholder = PLACEHOLDER [1, 128, 1]

T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])

T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])

T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]

T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)

========== Task 9 (workload key: ["9e3bd222d4f8d250aeadf2fef0b15f2b", 1, 768, 768, 768, 768, 1, 768]) ==========

placeholder = PLACEHOLDER [1, 768]

placeholder = PLACEHOLDER [768, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [768]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)

T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)

T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1]*(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*(((T_maximum[ax0, ax1]*T_maximum[ax0,  ..(OMITTED).. *T_maximum[ax0, ax1])*(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))

# Tasks for MXNet AutoSchedule Tuning (Target = llvm)

========== Task 0  (workload key: ["9847f8cc0b305137f49f2c5c0c8ab25d", 128, 3072, 768, 3072, 768, 128, 768]) ==========

placeholder = PLACEHOLDER [128, 3072]

placeholder = PLACEHOLDER [768, 3072]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [768]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

========== Task 1  (workload key: ["9847f8cc0b305137f49f2c5c0c8ab25d", 128, 768, 3072, 768, 3072, 128, 3072]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [3072, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [3072]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

========== Task 2  (workload key: ["9847f8cc0b305137f49f2c5c0c8ab25d", 128, 768, 768, 768, 768, 128, 768]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [768, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [768]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

========== Task 3  (workload key: ["d2a28fdf41e83222456f5a6e5bf8a24a", 12, 128, 128, 12, 64, 128, 12, 128, 64]) ==========

placeholder = PLACEHOLDER [12, 128, 128]

placeholder = PLACEHOLDER [12, 64, 128]

compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])

========== Task 4  (workload key: ["4b5e216f8244b4e8f7b6543c4a9087e5", 1536, 128, 1536, 128]) ==========

placeholder = PLACEHOLDER [1536, 128]

T_softmax_maxelem(i0) max= placeholder[i0, k]

T_softmax_delta(i0, i1) = (placeholder[i0, i1] - T_softmax_maxelem[i0])

T_fast_exp(ax0, ax1) = max((tir.reinterpret(tir.shift_left(int32((tir.floor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88. ..(OMITTED).. oor(((max(min(T_softmax_delta[ax0, ax1], 88.3763f), -88.3763f)*1.4427f) + 0.5f))*0.693147f))) + 1f)), T_softmax_delta[ax0, ax1])

T_softmax_expsum(i0) += T_fast_exp[i0, k]

T_softmax_norm(i0, i1) = (T_fast_exp[i0, i1]/T_softmax_expsum[i0])

========== Task 5  (workload key: ["d2a28fdf41e83222456f5a6e5bf8a24a", 12, 128, 64, 12, 128, 64, 12, 128, 128]) ==========

placeholder = PLACEHOLDER [12, 128, 64]

placeholder = PLACEHOLDER [12, 128, 64]

compute(b, i, j) += (placeholder[b, i, k]*placeholder[b, j, k])

========== Task 6  (workload key: ["9847f8cc0b305137f49f2c5c0c8ab25d", 128, 768, 2304, 768, 2304, 128, 2304]) ==========

placeholder = PLACEHOLDER [128, 768]

placeholder = PLACEHOLDER [2304, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [2304]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

========== Task 7  (workload key: ["2dde9ffcbf97381c0f0307643e09dac5", 128, 1, 768, 128, 1, 1]) ==========

placeholder = PLACEHOLDER [128, 1, 768]

placeholder_red(ax0, ax1, ax2) += placeholder[ax0, ax1, k2]

T_divide(ax0, ax1, ax2) = (placeholder_red[ax0, ax1, ax2]/768f)

========== Task 8  (workload key: ["dde89265d3f1a59075cee648386eac1e", 128, 1, 768, 128, 1, 1, 128, 1, 1]) ==========

placeholder = PLACEHOLDER [128, 1, 768]

placeholder = PLACEHOLDER [128, 1, 1]

T_subtract(ax0, ax1, ax2) = (placeholder[ax0, ax1, ax2] - placeholder[ax0, ax1, 0])

T_multiply(ax0, ax1, ax2) = (T_subtract[ax0, ax1, ax2]*T_subtract[ax0, ax1, ax2])

T_multiply_red(ax0, ax1, ax2) += T_multiply[ax0, ax1, k2]

T_divide(ax0, ax1, ax2) = (T_multiply_red[ax0, ax1, ax2]/768f)

========== Task 9  (workload key: ["9e3bd222d4f8d250aeadf2fef0b15f2b", 1, 768, 768, 768, 768, 1, 768]) ==========

placeholder = PLACEHOLDER [1, 768]

placeholder = PLACEHOLDER [768, 768]

T_dense(i, j) += (placeholder[i, k]*placeholder[j, k])

placeholder = PLACEHOLDER [768]

T_add(ax0, ax1) = (T_dense[ax0, ax1] + placeholder[ax1])

T_minimum(ax0, ax1) = min(T_add[ax0, ax1], 9f)

T_maximum(ax0, ax1) = max(T_minimum[ax0, ax1], -9f)

T_fast_tanh(ax0, ax1) = ((T_maximum[ax0, ax1]*(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*(((T_maximum[ax0, ax1]*T_maximum[ax0,  ..(OMITTED).. *T_maximum[ax0, ax1])*(((T_maximum[ax0, ax1]*T_maximum[ax0, ax1])*1.19826e-06f) + 0.000118535f)) + 0.00226843f)) + 0.00489353f))

Sorry for my lots of questions. I’ll do my best to do more experiments and figure out the reasons why Pytorch AutoSchedule not working as MXNet and why TVM is not working as expected when sequence length increasing.

Thanks for the plentiful information.

For Q1, when you extract tasks with llvm -mcpu=skylake-avx512 -libs=cblas, some operators (i.e., dense) will be offloaded to cblas. It means those operators won’t be compiled by the TVM codegen, so AutoScheduler won’t see and tune them.

For Q2, the two differences you pointed out seem not really impactful. Maybe you can try to use debugger to compare the latency breakdown between two models: Debugger — tvm 0.8.dev0 documentation 4

@comaniac I follow your instructions to use debugger and compare the latency in the following IR graphs (latency > 100us with orange color). And I find maybe some operations are hard to tune, such as fused_nn_contrib_dense_pack. All my experiments are done with opt_level=3, required_pass=["FastMath"]

Experiment 1 - Compare BERT on MXNet (68.4ms) and Pytorch (122.9ms)

MXNet Debug IR Graph: drive link
Pytorch Debug IR Graph: drive link 3
From the above graphs, we can find MXNet use fused_nn_contrib_dense_pack_add while Pytorch use fused_nn_contrib_dense_pack operation. This is happened for all FC layers (4 in one transformer block) and I use the latency of first block as example.

o FC for Query, Key, and Value (M: 503us, P: 1314us)

o FC after self-attention (M: 202us, P: 651us)

o FC after layer normalization (M: 798us, P: 2578us)

o FC after GELU (M: 786us, P: 2572us)

image1043×480 60 KB

Experiment 2 - Compare BERT on MXNet (68.4ms) and MXNet-tune (autoschedule) (15.4ms)

MXNet Debug IR Graph: drive link
MXNet-tune Debug IR Graph: drive link
From the above graphs, it is easy to find we can reduce most of dense and batch_matmul operations’ latency. I take the first transformer block as example.

o FC for Query, Key, and Value (M: 503us, M-tune: 229us)

o FC after self-attention (M: 202us, M-tune: 99us)

o FC after layer normalization (M: 798us, M-tune: 292us)

o FC after GELU (M: 786us, M-tune: 367us)

o batch_matmul for Quert and Key (M: 1828us, M-tune:41us)

o batch_matmul for Attention and Value (M: 1312us, M-tune:29us)

Experiment 3 - Compare BERT on Pytorch (122.9ms) and Pytorch-tune (autoschedule) (64.2ms)

Pytorch Debug IR Graph: drive link 3
Pytorch-tune Debug IR Graph: drive link 1
From the above graphs, we can find we only reduce the latency of batch_matmul but not dense operation. I take the first transformer block as example.

o FC for Query, Key, and Value (P: 1314us, P-tune: 1276us)

o FC after self-attention (P: 651us, P-tune: 403us)

o FC after layer normalization (P: 2578us, P-tune: 1779us)

o FC after GELU (P: 2572us, P-tune: 1684us)

o batch_matmul for Quert and Key (P: 1624us, P-tune: 78us)

o batch_matmul for Attention and Value (P: 1265us, P-tune: 70us)

Therefore, I suspect the problem is come from the fused_nn_contrib_dense_pack operation, but I don’t know why this is slower than fused_nn_contrib_dense_pack_add. Even the latter has additional add operation for adding bias. If you want the whole debug files (json and logs), I provide in this drive link 1.

@comaniac I change the TVM version with commit id: 91e07e1f3a7 (Feb. 5, 2021) which is the same as this repo 1. And the problem is solved because we will use fused_nn_batch_matmul for all FC (dense) layers rather than fused_nn_contrib_dense_pack. I think the problem is coming from this PR 3 you provided, which also causes I cannot use -libs=mkl. I have the debug IR graphs in the following and now Pytorch can speed up from Pytorch script 26.63ms to 17.36ms after tuning with autoschedule.

Pytorch Debug IR Graph: drive link 2
Pytorch-tune Debug IR Graph: drive link 1

Hmm I don’t that PR is the root cause of using batch_matmul in BERT model tho. It might be due to this PR that uses dense instead of batch_matmul when the input shape is 2D and 3D:

github.com/apache/tvm 6

Fix PyTorch matmul conversion when given (2-dim, N-dim) input pair 6

apache:main ← yuchaoli:main

opened

ShanghaiApril 14, 2021 11:03 AM

UTCApril 14, 2021 3:03 AM

ParisApril 14, 2021 5:03 AM

Los AngelesApril 13, 2021 8:03 PM

" style="color: var(--primary); cursor: pointer; border-bottom-color: var(--primary-low-mid);"> Apr 14, 2021

This PR change the matmul conversion in PyTorch when given (2-dim, N-dim) input … 6

There’s a known issue that TVM’s dense op and batch_matmul op with Y = X * W^T does have bad performance in some models.

There’re several matmul & batch_matmul ops in bert that takes data tensor as both input and weight(exp. those in multi-head attentions) rather than use const parameter as weight. In such situation, we would see some explicit transpose inserted when the model are imported from TensorFlow or Pytorch(they use X * W for matmul by default). For the MXNet, as far as I know, it uses X * W^T by default.

The PR you found looks like creating a special schedule for dense + transpose, I’m not sure if that’s the key of the performance improving you got because it is written for AutoTVM and AutoScheduler will never use these manual schedule. You can have a more detailed analyse among those dense/batch_matmul ops’ layout and shape.

I would agree with @comaniac that the miss conversion from dense to batch_matmul caused some waste of computation before.

3 MONTHS LATER

will auto-schedule work in case of “llvm -libs-cblas”? e.g. a network may contain matmul ops(leverage cblas) and other ops(leverage auto-schedule to get performance)?

Deploy NNVM module using C++ on GPU using OpenCL target

2 MONTHS LATER

Jul '18

@masahi Thanks for the sample .
In your sample code is “tvm_input” the CPU byte array copied to “x” (GPU array) ?
means is TVMArrayCopyFromBytes(destination,source,size) ?

for (int i = 0; i < n_samples; ++i) {

TVMArrayCopyFromBytes(x, &tvm_input[i * in_size], in_size * sizeof(float));

set_input(input_name.c_str(), x);

run();

get_output(0, y);

TVMArrayCopyToBytes(y, &tvm_output[i * out_size], out_size * sizeof(float));

}

C++ deploy example: switch to OpenCL7

Jul '18

yes, source is on cpu and x is on gpu. In my code, tvm_input should contain input data coming from your input image, for example.

Jul '18

@masahi
instead of tvm_input i created a DLTensor ‘z’ of device type = kdCPU , for storing input image .
I did copy byte array as shown below .

https://gist.github.com/rajh619/74538a7b3a7e1b89a2ae89db5ab24054 30

but the i couldnt find the copied bytes in the destination (x->data) .?

Jul '18

If you already have your data in DLtensor, you should use TVMArrayCopyFromTo.

Jul '18

@masahi
ok .now i tried with TVMArrayCopyFromTo
TVMArrayCopyFromTo(z, x, nullptr);
the same issue happens , i couldnt find the bytes copied to x->data .
I think x->data should be same as z->data(image data) . please correct me if am wrong ?

if x is on GPU, you should NEVER touch x->data. You either get segfault or complete junk.
If you want to access x->data, copy x to CPU first.

@masahi Thanks . got it working . i copied back from GPU to CPU(x->data to k->data) and validated the data.
After executing "run() " , i was able to get output to CPU in two ways :

allocate tvm array to output tensor “y” with devicetype - CPU (1) , then tvm_output(0,y) . y->data contains output . ( i think internally tvm copies the output from device to cpu_host ?)
allocate tvm array to output tensor “y” with devicetype - GPU (4) , tvm_output(0,y) ,then copy bytes from GPU to CPU ->out_vector[] . (similar to your sample code) .
Out of both which is the right way to extract output ?

The answer is 2.

See my sample code. The output y is GPU tensor. I copy y to tvm_output, which is on cpu.

Hi, @masahi I am still getting segfault even I use your sample 5 code also after allocating memory i am doing memeset to 0

DLTensor* x = nullptr;

DLTensor* y = nullptr;

const int in_ndim = 4;

const int out_ndim = in_ndim;

const int num_slice = 1;

const int num_class = 4;

const int shrink_size[] = { 256, 256 };

const int64_t in_shape[] = { num_slice, 1, shrink_size[0], shrink_size[1] };

const int64_t out_shape[] = { num_slice, num_class, shrink_size[0], shrink_size[1] };

TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);

TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);

memset(x->data, 0, 4265265);

it is happing for Cuda and OpenCL for llvm it’s working fine.

you can’t use memset on GPU memory.

hi, @masahi still i am not able working still it throws memory error.

void FR_TVM_Deploy::forward(float* imgData)
{

int in_size = (1 * 64 * 64 * 3 * 4);

constexpr int dtype_code = kDLFloat;

constexpr int dtype_bits = 32;

constexpr int dtype_lanes = 1;

constexpr int device_type = kDLCPU;

constexpr int device_id = 0;

constexpr int in_ndim = 4;

const int64_t in_shape[in_ndim] = {1, 64, 64, 3};

//Allocating memeory to DLTensor object

TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &input);//

TVMArrayCopyFromBytes(input, imgData, in_size);

//Get globl function module for graph runtime

tvm::runtime::Module* mod = (tvm::runtime::Module*)handle;

// get the function from the module(set input data)

tvm::runtime::PackedFunc set_input = mod->GetFunction("set_input");

set_input("input", input);

// get the function from the module(run it)

tvm::runtime::PackedFunc run = mod->GetFunction("run");

run();

int out_ndim = 2;

int64_t out_shape[2] = {1, 256};

TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &output);

// get the function from the module(get output data)

tvm::runtime::PackedFunc get_output = mod->GetFunction("get_output");

get_output(0, output);

size_t out_size = out_shape[0] * out_shape[1];

std::vector<float> tvm_output(out_size, 0);

TVMArrayCopyToBytes(output, &tvm_output[out_size], out_size);

TVMArrayFree(input);

TVMArrayFree(output);

}

when i print the tvm_output vector i am getting all 0’s means output is coming 0, in llvm case i am getting correct output. here i am printing vector tvm_output in loop, is there any othere way to check output?

参考链接：

https://www.cnblogs.com/wangtianning1223/p/14662970.html

https://github.com/apache/tvm/tree/main/apps/howto_deploy

https://discuss.tvm.apache.org/t/deploy-nnvm-module-using-c-on-gpu-using-opencl-target/229/17

https://discuss.tvm.apache.org/t/performance-tvm-pytorch-bert-on-cpu/10200/10

posted @ 2021-11-11 06:10 吴建明wujianming 阅读(677) 评论(0) 编辑收藏举报

刷新页面返回顶部

吴建明

tvm模型部署c++ 分析

[Performance] TVM - pytorch BERT on CPU

Fix PyTorch matmul conversion when given (2-dim, N-dim) input pair 6

Deploy NNVM module using C++ on GPU using OpenCL target

公告