基于OpenCL的yolov3 后处理优化转载

基于OpenCL的yolov3 后处理代码性能优化
背景：
目前对Yolov3的整体架构不是很了解，在网上也看了很多教程，对我来说，讲的比较复杂，可能是对很多的术语没有概念。不过Yolov3代码都是C编写的，对我来说只要是C代码，就不是问题，针对问题，以及解决方法，改代码就好。
我拿到的yolov3的代码可能不是标准的yolov3 , 不过大致处理逻辑是不变的。
yolov3 检测部分有两个部分比较耗时：
第一个耗时点在，网络数据是以uint8类型保存的，那么如果想对数据进行处理，那么一定要讲数据转化成float32 类型的数据。在CPU处理进行转换逻辑中，是以for循环的形式一个一个处理的，那么当数据很大的时候，比如典型的网络数据1313255、2616255、5252255，这么多数据在for循环中，一个一个进行数据类型转换显然是非常耗时的，在我们的平台上验证：耗时为80ms。
另外一个耗时点在，数据转换成float型后，在进行数据处理的过程中（专业的叫法我确实不太清楚），主要的功能是,找到数据的坐标位置，对接下来的数据进行exp指数运算，也是通过for 循环一个一个调用exp进行运算的，CPU 做指数运算，效率更低。测试结果为： 114ms

那么这些处理明显是可以放在GPU 进行并行运算，时间便可以大大缩短，在极短的时间内,几毫米内，就完成了运算。

经过对代码的逻辑分析，这两部分完全可以放在一起，只需要一个kernel便可以完成运算。

#define TENSOR_NUM 3
/*tensor 1*/
vsi_nn_tensor_t *tensor[TENSOR_NUM];
uint8_t *tensor_data[TENSOR_NUM]; //tensor 原始数据内存 uint8 : buff
float *convert_result[TENSOR_NUM]; //原始数据 uint8->float : buf
float *calculation_result[TENSOR_NUM]; //做exp运算结果 : buf
uint32_t data_num[TENSOR_NUM] = {1,1,1}; //tensor 数据大小 //BUG sloved：此处没有初始化，导致下面算size异常，内存分配失败
for(int i=0; i<TENSOR_NUM; i++)
{
tensor[i] = vsi_nn_GetTensor(graph, graph->output.tensors[i]);
for(int j=0; j<tensor[i]->attr.dim_num; j++)
{
data_num[i] *= tensor[i]->attr.size[j];
}
tensor_data[i] = (uint8_t *)vsi_nn_ConvertTensorToData(graph, tensor[i]);
/*内存分配*/ /*TODO: 释放*/
calculation_result[i] = (float*)malloc(sizeof(float) * data_num[i] );
convert_result[i] = (float *)malloc(sizeof(float) * data_num[i]);

}
/*opencl init*/
pcl_controller controller = (pcl_controller)malloc(sizeof(cl_controller));
memset(controller,0,sizeof(cl_controller));
cl_controller_init(controller);
/*opencl run*/
for(int i=0; i<TENSOR_NUM; i++)
{
cl_controller_set(controller, //opencl 控制器
tensor_data[i], // buffer
convert_result[i],
calculation_result[i],
data_num[i], //数据number
tensor[i]->attr.dtype.zero_point, //用于数据类型转换的参数
tensor[i]->attr.dtype.scale); //同上
cl_controller_run(controller, data_num[i]); //opencl kernel run
cl_controller_get(controller, //opencl 控制器
convert_result[i] , //数据类型转换结果
calculation_result[i], //exp结果
data_num[i]); //数据number
}
/*opencl clean*/
cl_controller_clean(controller);
status = show_result(graph,image_name,convert_result,calculation_result);
/*free buf*/
for(int i=0;i<3;i++)
{
free(calculation_result[i]);
free(convert_result[i]);
vsi_nn_Free(tensor_data[i]);
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__kernel void fun_exp(__global unsigned char* data ,\
__global float* convert_data,\
__global float* result,\
int zero_point,\
float scale)
{
int gid = get_global_id(0);

float x = ((float)data[gid] - zero_point) * scale ;
convert_data[gid] = x;
result[gid]=1.f/(1.f + exp(-x));

}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
kernel 的功能很简单，将uint8的数据转化为float型保存在内存，然后指数运算的结果保存在林外一个数组。在主机端进行读取。

由于解除yolov3 和opencl 时间不长，代码写的比较垃圾，不过效果确实达到了，原来处理完上面两部分代码耗时就需要200ms，进过opencl 的优化，只需要24ms 就能处理完，这不是GPU的运算时间，GPU几毫秒以内就完成了，这里是CPU 中对结果进行memcpy等操作的耗时。我也有写memcpy的kernel,但是实际效果不好，就没有写进代码。

效果： 200ms -> 24ms 8倍！

官方有GPU版本的yolov3 不过是以CUDA 实现的。胜称整体效率是CPU的500倍。我这边没有测试环境，就没有进行测试。我也找到了一版完全用opencl 实现的yolov3 ，没有运行环境。可以提供给大家。
————————————————
版权声明：本文为CSDN博主「疯狂的蕉尼基」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/qq_38505858/article/details/121978416

posted @ 2022-12-06 16:06 eastgeneral 阅读(343) 评论(0) 收藏举报

刷新页面返回顶部

eastgeneral

基于OpenCL的yolov3 后处理优化 转载

公告

基于OpenCL的yolov3 后处理优化转载