程序项目代做,有需求私信(vue、React、Java、爬虫、电路板设计、嵌入式linux等)

Rockchip RK3588 - OpenCL环境搭建

在上一节《Rockchip RK3588 - 基于Qt的视频监控和控制系统 》,我们介绍了实时监控的实现,在实时监控中我们需要将分辨率为1920x1080的图像缩放为指定窗口大小的图像,当采样帧率比较高时,会占用大量的CPU资源;

root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen# export DISPLAY=:0.0;./FloatVideo-TouchScreen -size 0.8
# 打开新的终端
root@NanoPC-T6:~# top
任务: 278 total,   2 running, 276 sleeping,   0 stopped,   0 zombie
%Cpu(s): 36.0 us,  1.9 sy,  0.0 ni, 62.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15953.1 total,  14749.5 free,    662.4 used,    541.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14995.4 avail Mem

 进程号 USER      PR  NI    VIRT    RES    SHR    %CPU  %MEM     TIME+ COMMAND
   1513 root      20   0 2876120 127804  72488 S 270.9   0.8   0:27.46 FloatVideo-Touc
    864 root      20   0 3345456 238548 186388 S  14.9   1.5   0:03.11 Xorg
   1251 pi        20   0 1861028  76424  57116 S   2.6   0.5   0:00.96 xfwm4
   ......

那么我们是不是可以通过GPU来实现图像的缩放呢,在RK3588上可以使用OpenCL接口进行GPU加速。

一、OpenCL环境搭建

OpenCLOpen Computing Language开放计算语言)是一种开放的、免版税的标准,用于超级计算机、云服务器、个人计算机、移动设备和嵌入式平台中各种加速器的跨平台并行编程。

OpenCL是由Khronos Group创建和管理的。OpenCL使应用程序能够使用系统或设备中的并行处理能力,从而使应用程序运行得更快、更流畅。

1.1 工作原理

OpenCL是一种编程框架和运行时,它使程序员能够创建称为内核程序(或内核)的小程序,这些程序可以在系统中的任何处理器上并行编译和执行。处理器可以是不同类型的任意组合,包括CPUGPUDSPFPGA或张量处理器,这就是为什么OpenCL经常被称为异构并行编程的解决方案。

OpenCL框架包含两个API

  • platform layer API:在主机CPU上运行,首先用于使程序能够发现系统中可用的并行处理器或计算设备。通过查询哪些计算设备可用,应用程序可以在不同的系统上便携地运行—适应加速器硬件的不同组合。一旦发现了计算设备,platform layer API就允许应用程序选择并初始化它想要使用的设备;
  • Runtime API:它使应用程序的内核程序能够为它们将要运行的计算设备编译,并行加载到这些处理器上并执行。一旦内核程序完成执行,将使用Runtime API收集结果;

为了更好适用于不同的处理器,OpenCL抽象出来了四大模型:

  • 平台模型:描述了OpenCL如何理解拓扑连接系统中的计算资源,对不同硬件及软件实现抽象,方便应用于不同设备;
  • 内存模型:对硬件的各种内存器进行了抽象;
  • 执行模型:程序是如何在硬件上执行的;
  • 编程模型:数据并行和任务并行;

1.2 平台模型

OpenCL中,需要一个主机处理器(Host),一般为CPU。而其它的硬件处理器(多核CPU/GPU/DSP等)被抽象成Compute Device

  • 每个Compute Device包含多个Compute Unit
  • 每个Compute Unit又包含多个Processing Elements(处理单元)。

举例说明:计算设备可以是GPU,计算单元对应于GPU内部的流多处理器(streaming multiprocessors(SMs)),处理单元对应于每个SM内部的单个流处理器。处理器通常通过共享指令调度和内存资源,以及增加本地处理器间通信,将处理单元分组为计算单元,以提高实现效率。

1.3 内存模型

OpenCL内存模型定义了如何访问和共享不同内核和处理单元之间的数据。

1.3.1 内存类型

OpenCL支持以下内存类型:

  • Global memory: 全局内存对在上下文中执行的所有工作项可访问,主机可以使用__global关键字读取、写入和映射命令访问全局内存,在单个工作组中,全局内存是一致的;
  • Constant memory:常量内存是用于主机分配和初始化的对象的内存区域, 所有工作项都可以以只读方式访问常量内存;
  • Local memory: 本地内存是特定于工作组的,工作组中的工作项可以访问本地内存;使用__local关键字进行访问,对于工作组中的所有工作项来说,本地内存是一致的;
  • Private memory:私有内存是特定于工作项的,其他工作项无法访问私有内存;
1.3.2 内存模型

OpenCL内存模型如下:

1.4 执行模型

OpenCL执行模型包括主机应用程序、上下文(context)和OpenCL内核的操作。

主机应用程序使用OpenCL命令队列将kernel和数据传输函数发送到设备以执行。

通过将命令入队到命令队列(Command Queues)中,kernel和数据传输函数可以与应用程序主机代码并行异步执行。

1.4.1 主机应用程序

主机应用程序在应用处理器上运行。主机应用程序通过为以下命令设置命令队列来管理内核的执行:

  • 内存命令;
  • 内核执行命令;
  • 同步操作;
1.4.2 上下文

主机应用程序为内核定义上下文。上下文包括:

  • 计算设备(Compute devices);

  • 内核(Kernels):OpenCL核心计算部分,类似C语言的代码。在需要设备执行计算任务时,数据会被推送到Compute Device,然后Compute Device的计算单元会并发执行内核程序;

  • 程序对象(Programs):Kernels的集合,OpenCL中可以使用cl_program表示;

  • 内存对象(Memory Objects.);

1.4.3 OpenCL内核的操作

Kernels在计算设备上运行。kernel是一段代码,在计算设备上与其它内核并行执行。内核的操作按以下顺序进行:

  • Kernels在主机应用程序中定义;
  • 主机应用程序将kernel提交给计算设备执行。计算设备可以是应用处理器、GPU或其它类型的处理器;
  • 当主机应用程序发出提交kernel的命令时,OpenCL创建工作项的NDRange
  • 对于NDRange中的每个元素,创建kernel的一个实例。这使得每个元素可以独立并行地进行处理。

1.5 OpenCL计算流程

对于OpenCl,利用显卡计算时,需要经历如下步骤:

  • 主机应用程序进行设备初始化(获取平台和设备id,创建上下文和命令队列);
  • 编写并编译kernel(读取内核文件->创建program对象->编译程序->创建内核) ;
  • 主机应用程序准备数据并传入设备(准备主机端数据,创建设备端内存对象并拷贝主机端数据);
  • 主机应用程序将kernel提交给设备执行(传入kernel函数参数, 启动kernel函数);
  • 将结果拷贝回主机应用程序;
  • 后续处理;
  • 释放资源。

二、OpenCL环境搭建

一个完整的OpenCL框架,从内核层到用户层,可分为四部分:

  • 内核层GPU驱动;
  • 用户层动态库;
  • 头文件;
  • 应用程序;

在《Rockchip RK3399 - Mali-T860 GPU驱动(mesa+Panfrost》文章中我们提到GPU驱动一般分为两部分:

  • 一小部分在linux内核中;
  • 另外一大部分在Userspace,在Userspace的部分向下操作内核中的驱动,向上对应用层提供标准的 API 接口,例如:
    • OpenGL ES 1.1、2.0、3.0、3.1、3.2
    • OpenCL 1.1、1.2、2.0
    • Vulkan 1.0
    • RenderScript(受支持的API列表因二进制和GPU类型而异);

Mail GPU IP提供商ARM公司只开放了内核部分驱动,而且这部分驱动还没有按照linux kernel的规范以DRM的框架去实现,此外Userspace部分ARM没有开源,只是以库的形式提供给购买了Mali GPU授权的SoC厂商。

2.1 内核层GPU驱动

RK3588为例,搭载了Mail-G610。内核层GPU驱动这一部分,不需要自己移植,我们开发板所使用的的友善linux kernel 6.1已移植。

接下来的内容仅仅是作为扩展,不感兴趣忽略即可。ARM Mail G610采用的Bifrost 驱动微架构,如果需要自己移植内核层GPU驱动的话。可以从Open Source Bifrost Mali 3rd Gen GPU Architecture Kernel Drivers下载最新版本内核层GPU驱动。比如这里我下载了BX304L01B-SW-99002-r47p0-01eac0.tar

关于如何安装该驱动可以参考:

2.1.1 驱动入口

解压后,可以看到驱动入口文件在drivers/gpu/arm/midgard/mali_kbase_core_linux.c ,在该文件我们可以看到支持的GPU型号;

static const struct of_device_id kbase_dt_ids[] = { { .compatible = "arm,malit6xx" },
                                                    { .compatible = "arm,mali-midgard" },
                                                    { .compatible = "arm,mali-bifrost" },
                                                    { .compatible = "arm,mali-valhall" },
                                                    { /* sentinel */ } };
MODULE_DEVICE_TABLE(of, kbase_dt_ids);

static struct platform_driver kbase_platform_driver = {
        .probe = kbase_platform_device_probe,
        .remove = kbase_platform_device_remove,
        .driver = {
                   .name = kbase_drv_name,
                   .pm = &kbase_pm_ops,
                   .of_match_table = of_match_ptr(kbase_dt_ids),
                   .probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
};

module_platform_driver(kbase_platform_driver);
2.1.2 gpu设备节点

NanoPC上安装运行debian操作系统后,使用以下命令检查Mali GPU的设备树节点:

root@NanoPC-T6:~# apt install device-tree-compiler -y
root@NanoPC-T6:~# dtc -I fs /proc/device-tree | grep mali

arch/arm64/boot/dts/rockchip/rk3588s.dtsi中我们可以定位到gpu设备节点;

gpu: gpu@fb000000 {
		compatible = "arm,mali-bifrost";
		reg = <0x0 0xfb000000 0x0 0x200000>;
		interrupts = <GIC_SPI 94 IRQ_TYPE_LEVEL_HIGH>,
					 <GIC_SPI 93 IRQ_TYPE_LEVEL_HIGH>,
					 <GIC_SPI 92 IRQ_TYPE_LEVEL_HIGH>;
		interrupt-names = "GPU", "MMU", "JOB";

		clocks = <&scmi_clk SCMI_CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
				 <&cru CLK_GPU_STACKS>, <&cru CLK_GPU>;
		clock-names = "clk_mali", "clk_gpu_coregroup",
					  "clk_gpu_stacks", "clk_gpu";
		assigned-clocks = <&scmi_clk SCMI_CLK_GPU>;
		assigned-clock-rates = <200000000>;
		power-domains = <&power RK3588_PD_GPU>;
		operating-points-v2 = <&gpu_opp_table>;
		#cooling-cells = <2>;
		dynamic-power-coefficient = <2982>;

		upthreshold = <30>;
		downdifferential = <10>;

		status = "disabled";
};

arch/arm64/boot/dts/rockchip/rk3588-nanopi6-common.dtsi中可以定位到:

&gpu {
        mali-supply = <&vdd_gpu_s0>;
        mem-supply = <&vdd_gpu_mem_s0>;
        upthreshold = <60>;
        downdifferential = <30>;
        status = "okay";
};

其中:

  • compatible:说明了设备兼容的驱动名称,即"arm,mali-bifrost";可以看到arm,mali-bifrost是和panfrost驱动相匹配的,因此会执行驱动的.probe函数,这里就不深入研究了;
  • reg:指定了寄存器的基地址和大小,即基地址0xfb000000,大小为0x200000
  • interruptsinterrupt-names:分别指定了该设备所使用的中断号和中断的名称;
  • clocks:指定了使用哪个时钟控制器(CRU)提供GPU时钟;
  • power-domains:用于指定设备所属的电源域,即RK3588_PD_GPU
  • mali-supply:指定了GPU设备使用的电源管脚;
  • status:指定GPU设备的状态("okay" 表示设备正常工作);

2.2 用户层动态库

寻找官方(Mali ARM/Rockchip )提供的用户层动态库libmali.so

2.2.1 Mali ARM官方下载安装libmali.so

通过浏览器进入Mali ARM官网:https://developer.arm.com/downloads/-/mali-drivers/user-space

寻找官方提供的用户层动态库libmali.solibmali.so一般会有不同的版本(X11fbdevWayland等),其提供了OpenGL ESEGLOpenCL接口。

不过不幸的是:Mail ARM官网并没有看到适用于RK3588的用户层动态库,但是RK3288的倒是有,这里我们就以RK3288为例:

下载后,解压缩可以看到:

注意:上图中libEGL.solibOpenCL.solibGLESv2.so等库大小均为0,不难猜测libmail.so应该提供了OpenGL ESEGLOpenCL接口。

libmali.so存放在ARM/usr/lib/,同时建立软链接libOpenCL.so指向libmali.so

root@NanoPC-T6:~# ln -s /usr/lib/libmali.so /usr/lib/libOpenCL.so
2.2.2 Rockchip官方提供的libmali.so

我们使用的友善提供的debian文件系统已经安装了libmali.so,该用户层动态库是由Rockchip官方提供的。

如何来查看是否已经安装了OpenCL库和驱动,可以通过如下命令检查是否已经安装了libmali.so

root@NanoPC-T6:~# find /usr -name libmali.so
/usr/lib/aarch64-linux-gnu/libmali.so
root@NanoPC-T6:~# strings /usr/lib/aarch64-linux-gnu/libmali.so | grep Mali-G610
Mali-G610

root@NanoPC-T6:~# strings /usr/lib/aarch64-linux-gnu/libmali.so | grep cl
.....
clReleaseCommandBufferKHR
clReleaseCommandQueue
clReleaseContext
clReleaseDevice
clReleaseEvent
clReleaseKernel
clReleaseMemObject
.....

root@NanoPC-T6:~# ls -l /usr/lib/aarch64-linux-gnu/libmali.so
lrwxrwxrwx 1 root root 12  7月 29  2020 /usr/lib/aarch64-linux-gnu/libmali.so -> libmali.so.1

其中/usr/lib/aarch64-linux-gnu/libmali.solibmali.so库的路径,Mali-G610Mali GPU驱动的版本号。

如果命令输出为空,则说明该库不是Mali GPU驱动库。如果输出包含Mali-G610 字符串,则说明该库是Mali GPU驱动库,并且版本号为Mali-G610

此外在/usr/lib/aarch64-linux-gnu目录下包含单独的OpenGL ESEGLOpenCL库;

root@NanoPC-T6:/opt# ls -l /usr/lib/aarch64-linux-gnu/libOpenCL*
lrwxrwxrwx 1 root root    18  1月 12  2021 /usr/lib/aarch64-linux-gnu/libOpenCL.so.1 -> libOpenCL.so.1.0.0
-rw-r--r-- 1 root root 60856  1月 12  2021 /usr/lib/aarch64-linux-gnu/libOpenCL.so.1.0.0

root@NanoPC-T6:/opt# strings /usr/lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 | grep cl
fclose
closedir
dlclose
clGetExtensionFunctionAddress
clGetPlatformIDs
clCreateContext
clCreateContextFromType
clGetGLContextInfoKHR
......

root@NanoPC-T6:/opt# ls -l /usr/lib/aarch64-linux-gnu/libEGL*
lrwxrwxrwx 1 root root     20  3月 25  2021 /usr/lib/aarch64-linux-gnu/libEGL_mesa.so.0 -> libEGL_mesa.so.0.0.0
-rw-r--r-- 1 root root 259072  3月 25  2021 /usr/lib/aarch64-linux-gnu/libEGL_mesa.so.0.0.0
lrwxrwxrwx 1 root root     11  7月 29  2020 /usr/lib/aarch64-linux-gnu/libEGL.so -> libEGL.so.1
lrwxrwxrwx 1 root root     15  7月 29  2020 /usr/lib/aarch64-linux-gnu/libEGL.so.1 -> libEGL.so.1.1.0
-rw-r--r-- 1 root root  84416  7月 29  2020 /usr/lib/aarch64-linux-gnu/libEGL.so.1.1.0
......

也可以通过如下clinfo命令查看是否已经安装OpenCL库,如果出现下图所示界面,则系统已经安装;

root@NanoPC-T6:~# aptitude install clinfo
root@NanoPC-T6:~# clinfo
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Number of platforms                               1
  Platform Name                                   ARM Platform
  Platform Vendor                                 ARM
  Platform Version                                OpenCL 3.0 v1.g13p0-01eac0.a8b6f0c7e1f83c654c60d1775112dbe4
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics           
......

Number of devices                                 1
  Device Name                                     Mali-G610 r0p0
  Device Vendor                                   ARM
  Device Vendor ID                                0xa8670000
  Device Version                                  OpenCL 3.0 v1.g13p0-01eac0.a8b6f0c7e1f83c654c60d1775112dbe4
  Device UUID                                     000067a8-0100-0000-0000-000000000000
  Driver UUID                                     13833dc2-ecef-4e5b-0159-38fdaf75bfde
  Valid Device LUID                               No
  Device LUID                                     0000-000000000000
  Device Node Mask                                0
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  3.0
  Device OpenCL C Version                         OpenCL C 3.0 v1.g13p0-01eac0.a8b6f0c7e1f83c654c60d1775112dbe4
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0
......

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]
  clCreateContext(NULL, ...) [default]            Success [ARM]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-G610 r0p0            # GPU型号
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-G610 r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-G610 r0p0

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.14
  ICD loader Profile                              OpenCL 3.0

其中arm_release_ver: g13p0-01eac0, rk_so_ver: 10为驱动版本信息。

接着我们需要将建立软链接libOpenCL.so指向libmali.so

root@NanoPC-T6:~# ln -s /usr/lib/aarch64-linux-gnu/libmali.so /usr/lib/aarch64-linux-gnu/libOpenCL.so
root@NanoPC-T6:~# ls -l /usr/lib/aarch64-linux-gnu/libOpenCL.so
lrwxrwxrwx 1 root root 37  1月 16 23:43 /usr/lib/aarch64-linux-gnu/libOpenCL.so -> /usr/lib/aarch64-linux-gnu/libmali.so

2.3 安装头文件

从官网下载头文件OpenCL-Headers

root@NanoPC-T6:/opt# git clone https://github.com/extdomains/github.com/KhronosGroup/OpenCL-Headers.git

运行以下命令来配置构建过程,并指定安装路径为/usr

root@NanoPC-T6:/opt/OpenCL-Headers# cmake -S . -B build -DCMAKE_INSTALL_PREFIX=/usr
-- The C compiler identification is GNU 10.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- The CXX compiler identification is GNU 10.2.1
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python3: /usr/bin/python3.9 (found version "3.9.2") found components: Interpreter
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/OpenCL-Headers/build

其中:

  • -S .:指定源代码目录的路径;
  • -B build:指定构建目录的路径;
  • -DCMAKE_INSTALL_PREFIX=/usr:指定cmake执行install 目标时,安装的路径前缀;

如上命令会让cmake.目录下查找CMakeLists.txt文件,并在./build目录下生成Makefile文件。

接着运行以下命令在./build目录下执行构建操作,只构建install目标,将生成的文件安装到指定的位置;

root@NanoPC-T6:/opt/OpenCL-Headers# cmake --build build --target install
Scanning dependencies of target headers_c_200
[  0%] Building C object tests/lang_c/CMakeFiles/headers_c_200.dir/__/test_headers.c.o
[  0%] Linking C executable headers_c_200
[  0%] Built target headers_c_200
Scanning dependencies of target headers_c_120
[  1%] Building C object tests/lang_c/CMakeFiles/headers_c_120.dir/__/test_headers.c.o
[  1%] Linking C executable headers_c_120
[  1%] Built target headers_c_120
Scanning dependencies of target cl_version_h_c_300
[  1%] Building C object tests/lang_c/CMakeFiles/cl_version_h_c_300.dir/__/test_cl_version.h.c.o
[  2%] Linking C executable cl_version_h_c_300
.......
[ 99%] Built target cl_egl_h_cpp_100
Scanning dependencies of target cl_gl_h_cpp_120
[100%] Building CXX object tests/lang_cpp/CMakeFiles/cl_gl_h_cpp_120.dir/test_cl_gl.h.cpp.o
[100%] Linking CXX executable cl_gl_h_cpp_120
[100%] Built target cl_gl_h_cpp_120
Install the project...
-- Install configuration: ""
-- Installing: /usr/include/CL
-- Installing: /usr/include/CL/opencl.h
-- Installing: /usr/include/CL/cl_egl.h
-- Installing: /usr/include/CL/cl_ext_intel.h
-- Installing: /usr/include/CL/cl_layer.h
-- Installing: /usr/include/CL/cl_platform.h
-- Installing: /usr/include/CL/cl_d3d10.h
-- Installing: /usr/include/CL/cl_va_api_media_sharing_intel.h
-- Installing: /usr/include/CL/cl_icd.h
-- Installing: /usr/include/CL/cl.h
-- Installing: /usr/include/CL/cl_function_types.h
-- Installing: /usr/include/CL/cl_dx9_media_sharing.h
-- Installing: /usr/include/CL/cl_dx9_media_sharing_intel.h
-- Installing: /usr/include/CL/cl_gl_ext.h
-- Installing: /usr/include/CL/cl_d3d11.h
-- Installing: /usr/include/CL/cl_version.h
-- Installing: /usr/include/CL/cl_half.h
-- Installing: /usr/include/CL/cl_ext.h
-- Installing: /usr/include/CL/cl_gl.h
-- Installing: /usr/share/cmake/OpenCLHeaders/OpenCLHeadersTargets.cmake
-- Installing: /usr/share/cmake/OpenCLHeaders/OpenCLHeadersConfig.cmake
-- Installing: /usr/share/cmake/OpenCLHeaders/OpenCLHeadersConfigVersion.cmake
-- Installing: /usr/share/pkgconfig/OpenCL-Headers.pc

头文件已经安装到/usr/include/CL目录下:

root@NanoPC-T6:/opt/OpenCL-Headers# ls -l /usr/include/CL
总用量 392
-rw-r--r-- 1 root root   8057  1月 15 00:10 cl_d3d10.h
-rw-r--r-- 1 root root   8095  1月 15 00:10 cl_d3d11.h
-rw-r--r-- 1 root root  12246  1月 15 00:10 cl_dx9_media_sharing.h
-rw-r--r-- 1 root root    959  1月 15 00:10 cl_dx9_media_sharing_intel.h
-rw-r--r-- 1 root root   5672  1月 15 00:10 cl_egl.h
-rw-r--r-- 1 root root 127490  1月 15 00:10 cl_ext.h
-rw-r--r-- 1 root root    902  1月 15 00:10 cl_ext_intel.h
-rw-r--r-- 1 root root  33387  1月 15 00:10 cl_function_types.h
-rw-r--r-- 1 root root    905  1月 15 00:10 cl_gl_ext.h
-rw-r--r-- 1 root root  12040  1月 15 00:10 cl_gl.h
-rw-r--r-- 1 root root  81631  1月 15 00:10 cl.h
-rw-r--r-- 1 root root  10430  1月 15 00:10 cl_half.h
-rw-r--r-- 1 root root  11505  1月 15 00:10 cl_icd.h
-rw-r--r-- 1 root root   3544  1月 15 00:10 cl_layer.h
-rw-r--r-- 1 root root  43430  1月 15 00:10 cl_platform.h
-rw-r--r-- 1 root root   7090  1月 15 00:10 cl_va_api_media_sharing_intel.h
-rw-r--r-- 1 root root   3125  1月 15 00:10 cl_version.h
-rw-r--r-- 1 root root    970  1月 15 00:10 opencl.h

三、OpenCL测试

此时已经有动态库和头文件,可以进行测试了。在/opt/目录下创建opencl-project文件夹;

root@NanoPC-T6:/opt# mkdir opencl-project

接着创建platform文件夹;

root@NanoPC-T6:/opt# cd opencl-project/
root@NanoPC-T6:/opt/opencl-project# mkdir platform
root@NanoPC-T6:/opt/opencl-project# cd platform

3.1 platform.cpp

/opt/opencl-project/platform目录下编写测试代码platform.cpp

#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

#define MAX_PLATFORMS 10
#define MAX_DEVICES 10

int main() {
    cl_platform_id platforms[MAX_PLATFORMS];
    cl_device_id devices[MAX_DEVICES];
    cl_uint num_platforms, num_devices;
    cl_context context;
    cl_command_queue command_queue;
    cl_program program;
    cl_kernel kernel;
    cl_int ret;

    // 获取平台数量
    ret = clGetPlatformIDs(MAX_PLATFORMS, platforms, &num_platforms);
    if (ret != CL_SUCCESS) {
        printf("Failed to get platform IDs\n");
        return -1;
    }

    printf("Number of platforms: %u\n", num_platforms);

    // 遍历打印平台信息
    for (cl_uint i = 0; i < num_platforms; i++) {
        char platform_name[128];
        char platform_vendor[128];

        ret = clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(platform_name), platform_name, NULL);
        if (ret != CL_SUCCESS) {
            printf("Failed to get platform name for platform %u\n", i);
        }

        ret = clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(platform_vendor), platform_vendor, NULL);
        if (ret != CL_SUCCESS) {
            printf("Failed to get platform vendor for platform %u\n", i);
        }

        printf("Platform %u:\n", i);
        printf("    Name: %s\n", platform_name);
        printf("    Vendor: %s\n", platform_vendor);
        printf("\n");
    }

    // 获取设备数量
    ret = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_GPU, MAX_DEVICES, devices, &num_devices);
    if (ret != CL_SUCCESS) {
        printf("Failed to get device IDs\n");
        return -1;
    }

    // 创建OpenCL上下文
    context = clCreateContext(NULL, num_devices, devices, NULL, NULL, &ret);
    if (ret != CL_SUCCESS) {
        printf("Failed to create context\n");
        return -1;
    }

    // 创建命令队列
    command_queue = clCreateCommandQueue(context, devices[0], 0, &ret);
    if (ret != CL_SUCCESS) {
        printf("Failed to create command queue\n");
        return -1;
    }

    // 定义和构建OpenCL内核
    const char *kernel_source = "__kernel void hello_world() {\n"
                                "    printf(\"Hello, World!\\n\");\n"
                                "}\n";
    program = clCreateProgramWithSource(context, 1, &kernel_source, NULL, &ret);
    if (ret != CL_SUCCESS) {
        printf("Failed to create program\n");
        return -1;
    }

    ret = clBuildProgram(program, num_devices, devices, NULL, NULL, NULL);
    if (ret != CL_SUCCESS) {
        printf("Failed to build program\n");
        return -1;
    }

    // 创建OpenCL内核对象
    kernel = clCreateKernel(program, "hello_world", &ret);
    if (ret != CL_SUCCESS) {
        printf("Failed to create kernel\n");
        return -1;
    }

    // 执行内核函数
    ret = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
    if (ret != CL_SUCCESS) {
        printf("Failed to enqueue task\n");
        return -1;
    }

    // 等待执行完成
    ret = clFinish(command_queue);
    if (ret != CL_SUCCESS) {
        printf("Failed to finish execution\n");
        return -1;
    }

    printf("Kernel executed successfully\n");

    // 清理资源
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseCommandQueue(command_queue);
    ret = clReleaseContext(context);

    return 0;
}

3.2 编译

这里我们介绍两种源码编译的方式。

3.2.1 直接编译

我们可以直接执行如下编译命令:

root@NanoPC-T6:/opt/opencl-project/platform# gcc platform.cpp -o platform -lmali

-lmail用于链接libmali.so库文件,-l选项指定要链接的库文件名,并在文件名前加上lib.so的前缀和后缀。所以-lmali告诉编译器要链接的库文件名为libmali.so

那么编译器如何知道libmali.so在哪里的呢?

  • 首先搜索预定义的默认路径,如/usr/lib/usr/local/lib等;
  • 如果共享库没有在这些路径中找到,则会搜索在/etc/ld.so.conf/etc/ld.so.conf.d目录中指定的路径。这些路径可以包含自定义共享库路径,比如:
root@NanoPC-T6:/opt/opencl-project/platform# ls -l /etc/ld.so.conf.d/
总用量 12
-rw-r--r-- 1 root root  32  7月 29  2020 00-aarch64-mali.conf
-rw-r--r-- 1 root root 103  4月 20  2023 aarch64-linux-gnu.conf
-rw-r--r-- 1 root root  44  9月 23  2022 libc.conf
root@NanoPC-T6:/opt/opencl-project/platform#  cat /etc/ld.so.conf.d/aarch64-linux-gnu.conf
# Multiarch support
/usr/local/lib/aarch64-linux-gnu
/lib/aarch64-linux-gnu
/usr/lib/aarch64-linux-gnu     # 该路径下有libmali.so库文件
3.2.2 cmake编译

当然也可以使用cmake进行编译platform.cpp,接下来我们介绍cmake编译配置。

(1) 在/opt/opencl-project/platform目录下创建CMakeLists.txt

cmake_minimum_required(VERSION 3.0)
cmake_policy(VERSION 3.0...3.18.4)
project(proj)
add_executable(platform platform.cpp)
#寻找OpenCL库  /usr/share/cmake-3.18/Modules/FindOpenCL.cmake
find_package(OpenCL REQUIRED)
#打印调试信息
MESSAGE(STATUS "Project: ${PROJECT_NAME}")
MESSAGE(STATUS "OpenCL library status:")
MESSAGE(STATUS "    version: ${OpenCL_VERSION_STRING}")
MESSAGE(STATUS "    libraries: ${OpenCL_LIBRARY}")
MESSAGE(STATUS "    include path: ${OpenCL_INCLUDE_DIR}")

target_link_libraries(platform PRIVATE OpenCL::OpenCL)

(2) 配置构建过程:

root@NanoPC-T6:/opt/opencl-project/platform#  cmake -S . -B build
-- The C compiler identification is GNU 10.2.1
-- The CXX compiler identification is GNU 10.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/aarch64-linux-gnu/libOpenCL.so (found version "2.2")
-- Project: proj
-- OpenCL library status:
--     version: 2.2 
--     libraries: /usr/lib/aarch64-linux-gnu/libOpenCL.so         # 库文件路径
--     include path: /usr/include  # 头文件路径
-- Configuring done
-- Generating done
-- Build files have been written to: /opt/opencl-project/platform/build

其中:

  • -S .:选项指定源代码目录的路径,CMake将在该路径下查找CMakeLists.txt文件;
  • -B build:选项指定构建目录的路径;

实际上我们使用的版本是OpenCL 3.0,这里判定为2.2版本是因为cmake version 3.18.4 FindOpenCL.cmake能够识别的最大版本为2.2,其通过在CL/cl.h文件查找CL_VERSION_${VERSION}宏来判定安装的版本的。

可以通过修改/usr/share/cmake-3.18/Modules/FindOpenCL.cmake解决这个问题:

foreach(VERSION "3_0" "2_2" "2_1" "2_0" "1_2" "1_1" "1_0")

(3) 执行构建操作,生成可执行程序platform

root@NanoPC-T6:/opt/OpenCL-Headers/exmaples# cmake --build build
Scanning dependencies of target platform
[ 50%] Building CXX object CMakeFiles/platform.dir/platform.cpp.o
In file included from /usr/include/CL/cl.h:20,
                 from /usr/include/CL/opencl.h:24,
                 from /opt/OpenCL-Headers/exmaples/platform.cpp:1:
/usr/include/CL/cl_version.h:22:104: note: ‘#pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)’
   22 | #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)")
      |                                                                                                        ^
[100%] Linking CXX executable platform
[100%] Built target platform

执行程序:

root@NanoPC-T6:/opt/opencl-project/platform# ls -l build/
总用量 48
-rw-r--r-- 1 root root 14229  1月 16 23:45 CMakeCache.txt
drwxr-xr-x 5 root root  4096  1月 16 23:46 CMakeFiles
-rw-r--r-- 1 root root  1632  1月 16 23:45 cmake_install.cmake
-rw-r--r-- 1 root root  5253  1月 16 23:45 Makefile
-rwxr-xr-x 1 root root 14248  1月 16 23:46 platform

root@NanoPC-T6:/opt/opencl-project/platform# ./build/platform
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Number of platforms: 1
Platform 0:
    Name: ARM Platform
    Vendor: ARM

Kernel executed successfully

四、OpenCV测试

4.1 OCL介绍

OpenCV2011年开始与AMD合作加入OpenCL加速。因此,OpenCV-2.4.3版本包含了新的ocl模块,其中包含了一些现有OpenCV算法的OpenCL实现。也就是说,当客户端机器上有OpenCL运行时和兼容设备时,用户可以调用cv::ocl::resize()来使用加速的代码,而不是使用cv::resize()。在接下来的三年中,越来越多的函数和类被添加到ocl模块中;但它始终是一个独立的API,与OpenCV-2.x中的主要面向CPUAPI并存。

OpenCV-3.x中,架构概念已经改变为所谓的Transparent API(T-API)。在新架构中,一个单独的OpenCL加速的cv::ocl::resize() 已经从外部API中移除,而成为常规cv::resize()中的一个分支。这个分支在性能角度上自动调用,并在可行且有意义时进行优化。T-API的实现得到了AMDIntel公司的赞助。

4.1.1 代码示例

Regular CPU code

// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
Mat frame, frameGray;
vector<rect> faces;
for(;;){
  // processing loop
  vcap >> frame;
  cvtColor(frame, frameGray, BGR2GRAY);
  equalizeHist(frameGray, frameGray);
  fd.detectMultiScale(frameGray, faces, ...);
  // draw rectangles …
  // show image …
}

OpenCL-aware code OpenCV-2.x

// initialization
VideoCapture vcap(...);
ocl::OclCascadeClassifier fd("haar_ff.xml");
ocl::oclMat frame, frameGray;
Mat frameCpu;
vector<rect> faces;
for(;;){
  // processing loop
  vcap >> frameCpu;
  frame = frameCpu;
  ocl::cvtColor(frame, frameGray, BGR2GRAY);
  ocl::equalizeHist(frameGray, frameGray);
  fd.detectMultiScale(frameGray, faces, ...);
  // draw rectangles …
  // show image …
}

OpenCL-aware code OpenCV-3.x

// initialization
VideoCapture vcap(...);
CascadeClassifier fd("haar_ff.xml");
UMat frame, frameGray;
vector<rect> faces;
for(;;){
  // processing loop
  vcap >> frame;
  cvtColor(frame, frameGray, BGR2GRAY);
  equalizeHist(frameGray, frameGray);
  fd.detectMultiScale(frameGray, faces, ...);
  // draw rectangles …
  // show image …
}

相比于OpenCV-2.xOpenCV-3.x封装了一个新的数据类型cv::UMat,这个数据类型能够无缝对接OpenCV的普通接口,从而最少的改动代码而最大的完成OpenCL平台的加速功能。

4.1.2 原理

如上面代码所示,只需要将原来的Mat格式换为UMat格式就可以实现OpenCV函数在OpenCL设备上加速运行,而这其中具体实施的基本原理是什么呢?接下来看一下其底层实现的基本原理,具体参看OpenCV 3.1OpenCL部分实现的源代码(注意:每个版本代码可能略有差异):

上图中表明了,当你使用的数据类型是UMat {dst.isUmat()},并且开启了OpenCL使能{useOpenCL()},那么OpenCV的接口将会跳转到OpenCL支持的设备中进行加速运行,当然你需要注意的是,在第一次使用OpenCL加速程序时,OpenCL需要编译生成对应平台的Kernel代码,而编译是需要花费大量的时间的,因此初次运行需要比较长的时间。

4.2 官方opencv-ocl

/opt/opencl-project目录下新建opencv-ocl项目,源码位于:https://521github.com/opencv/opencv/tree/3.4.0/samples/opencl

4.2.1 main.cpp
点击查看代码
/*
// The example of interoperability between OpenCL and OpenCV.
// This will loop through frames of video either from input media file
// or camera device and do processing of these data in OpenCL and then
// in OpenCV. In OpenCL it does inversion of pixels in left half of frame and
// in OpenCV it does bluring in the right half of frame.
*/
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <iomanip>
#include <stdexcept>

#define CL_USE_DEPRECATED_OPENCL_2_0_APIS // eliminate build warning

#if __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif

#include <opencv2/core/ocl.hpp>
#include <opencv2/core/utility.hpp>
#include <opencv2/video.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>


using namespace std;
using namespace cv;

namespace opencl {

class PlatformInfo
{
public:
    PlatformInfo()
    {}

    ~PlatformInfo()
    {}

    cl_int QueryInfo(cl_platform_id id)
    {
        query_param(id, CL_PLATFORM_PROFILE, m_profile);
        query_param(id, CL_PLATFORM_VERSION, m_version);
        query_param(id, CL_PLATFORM_NAME, m_name);
        query_param(id, CL_PLATFORM_VENDOR, m_vendor);
        query_param(id, CL_PLATFORM_EXTENSIONS, m_extensions);
        return CL_SUCCESS;
    }

    std::string Profile()    { return m_profile; }
    std::string Version()    { return m_version; }
    std::string Name()       { return m_name; }
    std::string Vendor()     { return m_vendor; }
    std::string Extensions() { return m_extensions; }

private:
    cl_int query_param(cl_platform_id id, cl_platform_info param, std::string& paramStr)
    {
        cl_int res;

        size_t psize;
        cv::AutoBuffer<char> buf;

        res = clGetPlatformInfo(id, param, 0, 0, &psize);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetPlatformInfo failed"));

        buf.resize(psize);
        res = clGetPlatformInfo(id, param, psize, buf, 0);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetPlatformInfo failed"));

        // just in case, ensure trailing zero for ASCIIZ string
        buf[psize] = 0;

        paramStr = buf;

        return CL_SUCCESS;
    }

private:
    std::string m_profile;
    std::string m_version;
    std::string m_name;
    std::string m_vendor;
    std::string m_extensions;
};


class DeviceInfo
{
public:
    DeviceInfo()
    {}

    ~DeviceInfo()
    {}

    cl_int QueryInfo(cl_device_id id)
    {
        query_param(id, CL_DEVICE_TYPE, m_type);
        query_param(id, CL_DEVICE_VENDOR_ID, m_vendor_id);
        query_param(id, CL_DEVICE_MAX_COMPUTE_UNITS, m_max_compute_units);
        query_param(id, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, m_max_work_item_dimensions);
        query_param(id, CL_DEVICE_MAX_WORK_ITEM_SIZES, m_max_work_item_sizes);
        query_param(id, CL_DEVICE_MAX_WORK_GROUP_SIZE, m_max_work_group_size);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, m_preferred_vector_width_char);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, m_preferred_vector_width_short);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, m_preferred_vector_width_int);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG, m_preferred_vector_width_long);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, m_preferred_vector_width_float);
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, m_preferred_vector_width_double);
#if defined(CL_VERSION_1_1)
        query_param(id, CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF, m_preferred_vector_width_half);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR, m_native_vector_width_char);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT, m_native_vector_width_short);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_INT, m_native_vector_width_int);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG, m_native_vector_width_long);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT, m_native_vector_width_float);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE, m_native_vector_width_double);
        query_param(id, CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF, m_native_vector_width_half);
#endif
        query_param(id, CL_DEVICE_MAX_CLOCK_FREQUENCY, m_max_clock_frequency);
        query_param(id, CL_DEVICE_ADDRESS_BITS, m_address_bits);
        query_param(id, CL_DEVICE_MAX_MEM_ALLOC_SIZE, m_max_mem_alloc_size);
        query_param(id, CL_DEVICE_IMAGE_SUPPORT, m_image_support);
        query_param(id, CL_DEVICE_MAX_READ_IMAGE_ARGS, m_max_read_image_args);
        query_param(id, CL_DEVICE_MAX_WRITE_IMAGE_ARGS, m_max_write_image_args);
#if defined(CL_VERSION_2_0)
        query_param(id, CL_DEVICE_MAX_READ_WRITE_IMAGE_ARGS, m_max_read_write_image_args);
#endif
        query_param(id, CL_DEVICE_IMAGE2D_MAX_WIDTH, m_image2d_max_width);
        query_param(id, CL_DEVICE_IMAGE2D_MAX_HEIGHT, m_image2d_max_height);
        query_param(id, CL_DEVICE_IMAGE3D_MAX_WIDTH, m_image3d_max_width);
        query_param(id, CL_DEVICE_IMAGE3D_MAX_HEIGHT, m_image3d_max_height);
        query_param(id, CL_DEVICE_IMAGE3D_MAX_DEPTH, m_image3d_max_depth);
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_IMAGE_MAX_BUFFER_SIZE, m_image_max_buffer_size);
        query_param(id, CL_DEVICE_IMAGE_MAX_ARRAY_SIZE, m_image_max_array_size);
#endif
        query_param(id, CL_DEVICE_MAX_SAMPLERS, m_max_samplers);
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_IMAGE_PITCH_ALIGNMENT, m_image_pitch_alignment);
        query_param(id, CL_DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT, m_image_base_address_alignment);
#endif
#if defined(CL_VERSION_2_0)
        query_param(id, CL_DEVICE_MAX_PIPE_ARGS, m_max_pipe_args);
        query_param(id, CL_DEVICE_PIPE_MAX_ACTIVE_RESERVATIONS, m_pipe_max_active_reservations);
        query_param(id, CL_DEVICE_PIPE_MAX_PACKET_SIZE, m_pipe_max_packet_size);
#endif
        query_param(id, CL_DEVICE_MAX_PARAMETER_SIZE, m_max_parameter_size);
        query_param(id, CL_DEVICE_MEM_BASE_ADDR_ALIGN, m_mem_base_addr_align);
        query_param(id, CL_DEVICE_SINGLE_FP_CONFIG, m_single_fp_config);
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_DOUBLE_FP_CONFIG, m_double_fp_config);
#endif
        query_param(id, CL_DEVICE_GLOBAL_MEM_CACHE_TYPE, m_global_mem_cache_type);
        query_param(id, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, m_global_mem_cacheline_size);
        query_param(id, CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, m_global_mem_cache_size);
        query_param(id, CL_DEVICE_GLOBAL_MEM_SIZE, m_global_mem_size);
        query_param(id, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, m_max_constant_buffer_size);
        query_param(id, CL_DEVICE_MAX_CONSTANT_ARGS, m_max_constant_args);
#if defined(CL_VERSION_2_0)
        query_param(id, CL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE, m_max_global_variable_size);
        query_param(id, CL_DEVICE_GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE, m_global_variable_preferred_total_size);
#endif
        query_param(id, CL_DEVICE_LOCAL_MEM_TYPE, m_local_mem_type);
        query_param(id, CL_DEVICE_LOCAL_MEM_SIZE, m_local_mem_size);
        query_param(id, CL_DEVICE_ERROR_CORRECTION_SUPPORT, m_error_correction_support);
#if defined(CL_VERSION_1_1)
        query_param(id, CL_DEVICE_HOST_UNIFIED_MEMORY, m_host_unified_memory);
#endif
        query_param(id, CL_DEVICE_PROFILING_TIMER_RESOLUTION, m_profiling_timer_resolution);
        query_param(id, CL_DEVICE_ENDIAN_LITTLE, m_endian_little);
        query_param(id, CL_DEVICE_AVAILABLE, m_available);
        query_param(id, CL_DEVICE_COMPILER_AVAILABLE, m_compiler_available);
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_LINKER_AVAILABLE, m_linker_available);
#endif
        query_param(id, CL_DEVICE_EXECUTION_CAPABILITIES, m_execution_capabilities);
        query_param(id, CL_DEVICE_QUEUE_PROPERTIES, m_queue_properties);
#if defined(CL_VERSION_2_0)
        query_param(id, CL_DEVICE_QUEUE_ON_HOST_PROPERTIES, m_queue_on_host_properties);
        query_param(id, CL_DEVICE_QUEUE_ON_DEVICE_PROPERTIES, m_queue_on_device_properties);
        query_param(id, CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE, m_queue_on_device_preferred_size);
        query_param(id, CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE, m_queue_on_device_max_size);
        query_param(id, CL_DEVICE_MAX_ON_DEVICE_QUEUES, m_max_on_device_queues);
        query_param(id, CL_DEVICE_MAX_ON_DEVICE_EVENTS, m_max_on_device_events);
#endif
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_BUILT_IN_KERNELS, m_built_in_kernels);
#endif
        query_param(id, CL_DEVICE_PLATFORM, m_platform);
        query_param(id, CL_DEVICE_NAME, m_name);
        query_param(id, CL_DEVICE_VENDOR, m_vendor);
        query_param(id, CL_DRIVER_VERSION, m_driver_version);
        query_param(id, CL_DEVICE_PROFILE, m_profile);
        query_param(id, CL_DEVICE_VERSION, m_version);
#if defined(CL_VERSION_1_1)
        query_param(id, CL_DEVICE_OPENCL_C_VERSION, m_opencl_c_version);
#endif
        query_param(id, CL_DEVICE_EXTENSIONS, m_extensions);
#if defined(CL_VERSION_1_2)
        query_param(id, CL_DEVICE_PRINTF_BUFFER_SIZE, m_printf_buffer_size);
        query_param(id, CL_DEVICE_PREFERRED_INTEROP_USER_SYNC, m_preferred_interop_user_sync);
        query_param(id, CL_DEVICE_PARENT_DEVICE, m_parent_device);
        query_param(id, CL_DEVICE_PARTITION_MAX_SUB_DEVICES, m_partition_max_sub_devices);
        query_param(id, CL_DEVICE_PARTITION_PROPERTIES, m_partition_properties);
        query_param(id, CL_DEVICE_PARTITION_AFFINITY_DOMAIN, m_partition_affinity_domain);
        query_param(id, CL_DEVICE_PARTITION_TYPE, m_partition_type);
        query_param(id, CL_DEVICE_REFERENCE_COUNT, m_reference_count);
#endif
        return CL_SUCCESS;
    }

    std::string Name() { return m_name; }

private:
    template<typename T>
    cl_int query_param(cl_device_id id, cl_device_info param, T& value)
    {
        cl_int res;
        size_t size = 0;

        res = clGetDeviceInfo(id, param, 0, 0, &size);
        if (CL_SUCCESS != res && size != 0)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        if (0 == size)
            return CL_SUCCESS;

        if (sizeof(T) != size)
            throw std::runtime_error(std::string("clGetDeviceInfo: param size mismatch"));

        res = clGetDeviceInfo(id, param, size, &value, 0);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        return CL_SUCCESS;
    }

    template<typename T>
    cl_int query_param(cl_device_id id, cl_device_info param, std::vector<T>& value)
    {
        cl_int res;
        size_t size;

        res = clGetDeviceInfo(id, param, 0, 0, &size);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        if (0 == size)
            return CL_SUCCESS;

        value.resize(size / sizeof(T));

        res = clGetDeviceInfo(id, param, size, &value[0], 0);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        return CL_SUCCESS;
    }

    cl_int query_param(cl_device_id id, cl_device_info param, std::string& value)
    {
        cl_int res;
        size_t size;

        res = clGetDeviceInfo(id, param, 0, 0, &size);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        value.resize(size + 1);

        res = clGetDeviceInfo(id, param, size, &value[0], 0);
        if (CL_SUCCESS != res)
            throw std::runtime_error(std::string("clGetDeviceInfo failed"));

        // just in case, ensure trailing zero for ASCIIZ string
        value[size] = 0;

        return CL_SUCCESS;
    }

private:
    cl_device_type                            m_type;
    cl_uint                                   m_vendor_id;
    cl_uint                                   m_max_compute_units;
    cl_uint                                   m_max_work_item_dimensions;
    std::vector<size_t>                       m_max_work_item_sizes;
    size_t                                    m_max_work_group_size;
    cl_uint                                   m_preferred_vector_width_char;
    cl_uint                                   m_preferred_vector_width_short;
    cl_uint                                   m_preferred_vector_width_int;
    cl_uint                                   m_preferred_vector_width_long;
    cl_uint                                   m_preferred_vector_width_float;
    cl_uint                                   m_preferred_vector_width_double;
#if defined(CL_VERSION_1_1)
    cl_uint                                   m_preferred_vector_width_half;
    cl_uint                                   m_native_vector_width_char;
    cl_uint                                   m_native_vector_width_short;
    cl_uint                                   m_native_vector_width_int;
    cl_uint                                   m_native_vector_width_long;
    cl_uint                                   m_native_vector_width_float;
    cl_uint                                   m_native_vector_width_double;
    cl_uint                                   m_native_vector_width_half;
#endif
    cl_uint                                   m_max_clock_frequency;
    cl_uint                                   m_address_bits;
    cl_ulong                                  m_max_mem_alloc_size;
    cl_bool                                   m_image_support;
    cl_uint                                   m_max_read_image_args;
    cl_uint                                   m_max_write_image_args;
#if defined(CL_VERSION_2_0)
    cl_uint                                   m_max_read_write_image_args;
#endif
    size_t                                    m_image2d_max_width;
    size_t                                    m_image2d_max_height;
    size_t                                    m_image3d_max_width;
    size_t                                    m_image3d_max_height;
    size_t                                    m_image3d_max_depth;
#if defined(CL_VERSION_1_2)
    size_t                                    m_image_max_buffer_size;
    size_t                                    m_image_max_array_size;
#endif
    cl_uint                                   m_max_samplers;
#if defined(CL_VERSION_1_2)
    cl_uint                                   m_image_pitch_alignment;
    cl_uint                                   m_image_base_address_alignment;
#endif
#if defined(CL_VERSION_2_0)
    cl_uint                                   m_max_pipe_args;
    cl_uint                                   m_pipe_max_active_reservations;
    cl_uint                                   m_pipe_max_packet_size;
#endif
    size_t                                    m_max_parameter_size;
    cl_uint                                   m_mem_base_addr_align;
    cl_device_fp_config                       m_single_fp_config;
#if defined(CL_VERSION_1_2)
    cl_device_fp_config                       m_double_fp_config;
#endif
    cl_device_mem_cache_type                  m_global_mem_cache_type;
    cl_uint                                   m_global_mem_cacheline_size;
    cl_ulong                                  m_global_mem_cache_size;
    cl_ulong                                  m_global_mem_size;
    cl_ulong                                  m_max_constant_buffer_size;
    cl_uint                                   m_max_constant_args;
#if defined(CL_VERSION_2_0)
    size_t                                    m_max_global_variable_size;
    size_t                                    m_global_variable_preferred_total_size;
#endif
    cl_device_local_mem_type                  m_local_mem_type;
    cl_ulong                                  m_local_mem_size;
    cl_bool                                   m_error_correction_support;
#if defined(CL_VERSION_1_1)
    cl_bool                                   m_host_unified_memory;
#endif
    size_t                                    m_profiling_timer_resolution;
    cl_bool                                   m_endian_little;
    cl_bool                                   m_available;
    cl_bool                                   m_compiler_available;
#if defined(CL_VERSION_1_2)
    cl_bool                                   m_linker_available;
#endif
    cl_device_exec_capabilities               m_execution_capabilities;
    cl_command_queue_properties               m_queue_properties;
#if defined(CL_VERSION_2_0)
    cl_command_queue_properties               m_queue_on_host_properties;
    cl_command_queue_properties               m_queue_on_device_properties;
    cl_uint                                   m_queue_on_device_preferred_size;
    cl_uint                                   m_queue_on_device_max_size;
    cl_uint                                   m_max_on_device_queues;
    cl_uint                                   m_max_on_device_events;
#endif
#if defined(CL_VERSION_1_2)
    std::string                               m_built_in_kernels;
#endif
    cl_platform_id                            m_platform;
    std::string                               m_name;
    std::string                               m_vendor;
    std::string                               m_driver_version;
    std::string                               m_profile;
    std::string                               m_version;
#if defined(CL_VERSION_1_1)
    std::string                               m_opencl_c_version;
#endif
    std::string                               m_extensions;
#if defined(CL_VERSION_1_2)
    size_t                                    m_printf_buffer_size;
    cl_bool                                   m_preferred_interop_user_sync;
    cl_device_id                              m_parent_device;
    cl_uint                                   m_partition_max_sub_devices;
    std::vector<cl_device_partition_property> m_partition_properties;
    cl_device_affinity_domain                 m_partition_affinity_domain;
    std::vector<cl_device_partition_property> m_partition_type;
    cl_uint                                   m_reference_count;
#endif
};

} // namespace opencl


class App
{
public:
    App(CommandLineParser& cmd);
    ~App();

    int initOpenCL();
    int initVideoSource();

    int process_frame_with_open_cl(cv::Mat& frame, bool use_buffer, cl_mem* cl_buffer);
    int process_cl_buffer_with_opencv(cl_mem buffer, size_t step, int rows, int cols, int type, cv::UMat& u);
    int process_cl_image_with_opencv(cl_mem image, cv::UMat& u);

    int run();

    bool isRunning() { return m_running; }
    bool doProcess() { return m_process; }
    bool useBuffer() { return m_use_buffer; }

    void setRunning(bool running)      { m_running = running; }
    void setDoProcess(bool process)    { m_process = process; }
    void setUseBuffer(bool use_buffer) { m_use_buffer = use_buffer; }

protected:
    bool nextFrame(cv::Mat& frame) { return m_cap.read(frame); }
    void handleKey(char key);
    void timerStart();
    void timerEnd();
    std::string timeStr() const;
    std::string message() const;

private:
    bool                        m_running;
    bool                        m_process;
    bool                        m_use_buffer;

    int64                       m_t0;
    int64                       m_t1;
    float                       m_time;
    float                       m_frequency;

    string                      m_file_name;
    int                         m_camera_id;
    cv::VideoCapture            m_cap;
    cv::Mat                     m_frame;
    cv::Mat                     m_frameGray;

    opencl::PlatformInfo        m_platformInfo;
    opencl::DeviceInfo          m_deviceInfo;
    std::vector<cl_platform_id> m_platform_ids;
    cl_context                  m_context;
    cl_device_id                m_device_id;
    cl_command_queue            m_queue;
    cl_program                  m_program;
    cl_kernel                   m_kernelBuf;
    cl_kernel                   m_kernelImg;
    cl_mem                      m_img_src; // used as src in case processing of cl image
    cl_mem                      m_mem_obj;
    cl_event                    m_event;
};


App::App(CommandLineParser& cmd)
{
    cout << "\nPress ESC to exit\n" << endl;
    cout << "\n      'p' to toggle ON/OFF processing\n" << endl;
    cout << "\n       SPACE to switch between OpenCL buffer/image\n" << endl;

    m_camera_id  = cmd.get<int>("camera");
    m_file_name  = cmd.get<string>("video");

    m_running    = false;
    m_process    = false;
    m_use_buffer = false;

    m_t0         = 0;
    m_t1         = 0;
    m_time       = 0.0;
    m_frequency  = (float)cv::getTickFrequency();

    m_context    = 0;
    m_device_id  = 0;
    m_queue      = 0;
    m_program    = 0;
    m_kernelBuf  = 0;
    m_kernelImg  = 0;
    m_img_src    = 0;
    m_mem_obj    = 0;
    m_event      = 0;
} // ctor


App::~App()
{
    if (m_queue)
    {
        clFinish(m_queue);
        clReleaseCommandQueue(m_queue);
        m_queue = 0;
    }

    if (m_program)
    {
        clReleaseProgram(m_program);
        m_program = 0;
    }

    if (m_img_src)
    {
        clReleaseMemObject(m_img_src);
        m_img_src = 0;
    }

    if (m_mem_obj)
    {
        clReleaseMemObject(m_mem_obj);
        m_mem_obj = 0;
    }

    if (m_event)
    {
        clReleaseEvent(m_event);
    }

    if (m_kernelBuf)
    {
        clReleaseKernel(m_kernelBuf);
        m_kernelBuf = 0;
    }

    if (m_kernelImg)
    {
        clReleaseKernel(m_kernelImg);
        m_kernelImg = 0;
    }

    if (m_device_id)
    {
        clReleaseDevice(m_device_id);
        m_device_id = 0;
    }

    if (m_context)
    {
        clReleaseContext(m_context);
        m_context = 0;
    }
} // dtor


int App::initOpenCL()
{
    cl_int res = CL_SUCCESS;
    cl_uint num_entries = 0;

    res = clGetPlatformIDs(0, 0, &num_entries);
    if (CL_SUCCESS != res)
        return -1;

    m_platform_ids.resize(num_entries);

    res = clGetPlatformIDs(num_entries, &m_platform_ids[0], 0);
    if (CL_SUCCESS != res)
        return -1;

    unsigned int i;

    // create context from first platform with GPU device
    for (i = 0; i < m_platform_ids.size(); i++)
    {
        cl_context_properties props[] =
        {
            CL_CONTEXT_PLATFORM,
            (cl_context_properties)(m_platform_ids[i]),
            0
        };

        m_context = clCreateContextFromType(props, CL_DEVICE_TYPE_GPU, 0, 0, &res);
        if (0 == m_context || CL_SUCCESS != res)
            continue;

        res = clGetContextInfo(m_context, CL_CONTEXT_DEVICES, sizeof(cl_device_id), &m_device_id, 0);
        if (CL_SUCCESS != res)
            return -1;

        m_queue = clCreateCommandQueue(m_context, m_device_id, 0, &res);
        if (0 == m_queue || CL_SUCCESS != res)
            return -1;

        const char* kernelSrc =
            "__kernel "
            "void bitwise_inv_buf_8uC1("
            "    __global unsigned char* pSrcDst,"
            "             int            srcDstStep,"
            "             int            rows,"
            "             int            cols)"
            "{"
            "    int x = get_global_id(0);"
            "    int y = get_global_id(1);"
            "    int idx = mad24(y, srcDstStep, x);"
            "    pSrcDst[idx] = ~pSrcDst[idx];"
            "}"
            "__kernel "
            "void bitwise_inv_img_8uC1("
            "    read_only  image2d_t srcImg,"
            "    write_only image2d_t dstImg)"
            "{"
            "    int x = get_global_id(0);"
            "    int y = get_global_id(1);"
            "    int2 coord = (int2)(x, y);"
            "    uint4 val = read_imageui(srcImg, coord);"
            "    val.x = (~val.x) & 0x000000FF;"
            "    write_imageui(dstImg, coord, val);"
            "}";
        size_t len = strlen(kernelSrc);
        m_program = clCreateProgramWithSource(m_context, 1, &kernelSrc, &len, &res);
        if (0 == m_program || CL_SUCCESS != res)
            return -1;

        res = clBuildProgram(m_program, 1, &m_device_id, 0, 0, 0);
        if (CL_SUCCESS != res)
            return -1;

        m_kernelBuf = clCreateKernel(m_program, "bitwise_inv_buf_8uC1", &res);
        if (0 == m_kernelBuf || CL_SUCCESS != res)
            return -1;

        m_kernelImg = clCreateKernel(m_program, "bitwise_inv_img_8uC1", &res);
        if (0 == m_kernelImg || CL_SUCCESS != res)
            return -1;

        m_platformInfo.QueryInfo(m_platform_ids[i]);
        m_deviceInfo.QueryInfo(m_device_id);

        // attach OpenCL context to OpenCV
        cv::ocl::attachContext(m_platformInfo.Name(), m_platform_ids[i], m_context, m_device_id);

        break;
    }

    return m_context != 0 ? CL_SUCCESS : -1;
} // initOpenCL()


int App::initVideoSource()
{
    try
    {
        if (!m_file_name.empty() && m_camera_id == -1)
        {
            m_cap.open(m_file_name.c_str());
            if (!m_cap.isOpened())
                throw std::runtime_error(std::string("can't open video file: " + m_file_name));
        }
        else if (m_camera_id != -1)
        {
            m_cap.open(m_camera_id);
            if (!m_cap.isOpened())
            {
                std::stringstream msg;
                msg << "can't open camera: " << m_camera_id;
                throw std::runtime_error(msg.str());
            }
        }
        else
            throw std::runtime_error(std::string("specify video source"));
    }

    catch (std::exception e)
    {
        cerr << "ERROR: " << e.what() << std::endl;
        return -1;
    }

    return 0;
} // initVideoSource()


// this function is an example of "typical" OpenCL processing pipeline
// It creates OpenCL buffer or image, depending on use_buffer flag,
// from input media frame and process these data
// (inverts each pixel value in half of frame) with OpenCL kernel
int App::process_frame_with_open_cl(cv::Mat& frame, bool use_buffer, cl_mem* mem_obj)
{
    cl_int res = CL_SUCCESS;

    CV_Assert(mem_obj);

    cl_kernel kernel = 0;
    cl_mem mem = mem_obj[0];

    if (0 == mem || 0 == m_img_src)
    {
        // allocate/delete cl memory objects every frame for the simplicity.
        // in real applicaton more efficient pipeline can be built.

        if (use_buffer)
        {
            cl_mem_flags flags = CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR;

            mem = clCreateBuffer(m_context, flags, frame.total(), frame.ptr(), &res);
            if (0 == mem || CL_SUCCESS != res)
                return -1;

            res = clSetKernelArg(m_kernelBuf, 0, sizeof(cl_mem), &mem);
            if (CL_SUCCESS != res)
                return -1;

            res = clSetKernelArg(m_kernelBuf, 1, sizeof(int), &frame.step[0]);
            if (CL_SUCCESS != res)
                return -1;

            res = clSetKernelArg(m_kernelBuf, 2, sizeof(int), &frame.rows);
            if (CL_SUCCESS != res)
                return -1;

            int cols2 = frame.cols / 2;
            res = clSetKernelArg(m_kernelBuf, 3, sizeof(int), &cols2);
            if (CL_SUCCESS != res)
                return -1;

            kernel = m_kernelBuf;
        }
        else
        {
            cl_mem_flags flags_src = CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR;

            cl_image_format fmt;
            fmt.image_channel_order     = CL_R;
            fmt.image_channel_data_type = CL_UNSIGNED_INT8;

            cl_image_desc desc_src;
            desc_src.image_type        = CL_MEM_OBJECT_IMAGE2D;
            desc_src.image_width       = frame.cols;
            desc_src.image_height      = frame.rows;
            desc_src.image_depth       = 0;
            desc_src.image_array_size  = 0;
            desc_src.image_row_pitch   = frame.step[0];
            desc_src.image_slice_pitch = 0;
            desc_src.num_mip_levels    = 0;
            desc_src.num_samples       = 0;
            desc_src.buffer            = 0;
            m_img_src = clCreateImage(m_context, flags_src, &fmt, &desc_src, frame.ptr(), &res);
            if (0 == m_img_src || CL_SUCCESS != res)
                return -1;

            cl_mem_flags flags_dst = CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR;

            cl_image_desc desc_dst;
            desc_dst.image_type        = CL_MEM_OBJECT_IMAGE2D;
            desc_dst.image_width       = frame.cols;
            desc_dst.image_height      = frame.rows;
            desc_dst.image_depth       = 0;
            desc_dst.image_array_size  = 0;
            desc_dst.image_row_pitch   = 0;
            desc_dst.image_slice_pitch = 0;
            desc_dst.num_mip_levels    = 0;
            desc_dst.num_samples       = 0;
            desc_dst.buffer            = 0;
            mem = clCreateImage(m_context, flags_dst, &fmt, &desc_dst, 0, &res);
            if (0 == mem || CL_SUCCESS != res)
                return -1;

            size_t origin[] = { 0, 0, 0 };
            size_t region[] = { (size_t)frame.cols, (size_t)frame.rows, 1 };
            res = clEnqueueCopyImage(m_queue, m_img_src, mem, origin, origin, region, 0, 0, &m_event);
            if (CL_SUCCESS != res)
                return -1;

            res = clWaitForEvents(1, &m_event);
            if (CL_SUCCESS != res)
                return -1;

            res = clSetKernelArg(m_kernelImg, 0, sizeof(cl_mem), &m_img_src);
            if (CL_SUCCESS != res)
                return -1;

            res = clSetKernelArg(m_kernelImg, 1, sizeof(cl_mem), &mem);
            if (CL_SUCCESS != res)
                return -1;

            kernel = m_kernelImg;
        }
    }

    m_event = clCreateUserEvent(m_context, &res);
    if (0 == m_event || CL_SUCCESS != res)
        return -1;

    // process left half of frame in OpenCL
    size_t size[] = { (size_t)frame.cols / 2, (size_t)frame.rows };
    res = clEnqueueNDRangeKernel(m_queue, kernel, 2, 0, size, 0, 0, 0, &m_event);
    if (CL_SUCCESS != res)
        return -1;

    res = clWaitForEvents(1, &m_event);
    if (CL_SUCCESS != res)
        return - 1;

    mem_obj[0] = mem;

    return  0;
}


// this function is an example of interoperability between OpenCL buffer
// and OpenCV UMat objects. It converts (without copying data) OpenCL buffer
// to OpenCV UMat and then do blur on these data
int App::process_cl_buffer_with_opencv(cl_mem buffer, size_t step, int rows, int cols, int type, cv::UMat& u)
{
    cv::ocl::convertFromBuffer(buffer, step, rows, cols, type, u);

    // process right half of frame in OpenCV
    cv::Point pt(u.cols / 2, 0);
    cv::Size  sz(u.cols / 2, u.rows);
    cv::Rect roi(pt, sz);
    cv::UMat uroi(u, roi);
    cv::blur(uroi, uroi, cv::Size(7, 7), cv::Point(-3, -3));

    if (buffer)
        clReleaseMemObject(buffer);
    m_mem_obj = 0;

    return 0;
}


// this function is an example of interoperability between OpenCL image
// and OpenCV UMat objects. It converts OpenCL image
// to OpenCV UMat and then do blur on these data
int App::process_cl_image_with_opencv(cl_mem image, cv::UMat& u)
{
    cv::ocl::convertFromImage(image, u);

    // process right half of frame in OpenCV
    cv::Point pt(u.cols / 2, 0);
    cv::Size  sz(u.cols / 2, u.rows);
    cv::Rect roi(pt, sz);
    cv::UMat uroi(u, roi);
    cv::blur(uroi, uroi, cv::Size(7, 7), cv::Point(-3, -3));

    if (image)
        clReleaseMemObject(image);
    m_mem_obj = 0;

    if (m_img_src)
        clReleaseMemObject(m_img_src);
    m_img_src = 0;

    return 0;
}


int App::run()
{
    if (0 != initOpenCL())
        return -1;

    if (0 != initVideoSource())
        return -1;

    Mat img_to_show;

    // set running state until ESC pressed
    setRunning(true);
    // set process flag to show some data processing
    // can be toggled on/off by 'p' button
    setDoProcess(true);
    // set use buffer flag,
    // when it is set to true, will demo interop opencl buffer and cv::Umat,
    // otherwise demo interop opencl image and cv::UMat
    // can be switched on/of by SPACE button
    setUseBuffer(true);

    // Iterate over all frames
    while (isRunning() && nextFrame(m_frame))
    {
        cv::cvtColor(m_frame, m_frameGray, COLOR_BGR2GRAY);

        UMat uframe;

        // work
        timerStart();

        if (doProcess())
        {
            process_frame_with_open_cl(m_frameGray, useBuffer(), &m_mem_obj);

            if (useBuffer())
                process_cl_buffer_with_opencv(
                    m_mem_obj, m_frameGray.step[0], m_frameGray.rows, m_frameGray.cols, m_frameGray.type(), uframe);
            else
                process_cl_image_with_opencv(m_mem_obj, uframe);
        }
        else
        {
            m_frameGray.copyTo(uframe);
        }

        timerEnd();

        uframe.copyTo(img_to_show);

        putText(img_to_show, "Version : " + m_platformInfo.Version(), Point(5, 30), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);
        putText(img_to_show, "Name : " + m_platformInfo.Name(), Point(5, 60), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);
        putText(img_to_show, "Device : " + m_deviceInfo.Name(), Point(5, 90), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);
        cv::String memtype = useBuffer() ? "buffer" : "image";
        putText(img_to_show, "interop with OpenCL " + memtype, Point(5, 120), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);
        putText(img_to_show, "Time : " + timeStr() + " msec", Point(5, 150), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);

        imshow("opencl_interop", img_to_show);

        handleKey((char)waitKey(3));
    }

    return 0;
}


void App::handleKey(char key)
{
    switch (key)
    {
    case 27:
        setRunning(false);
        break;

    case ' ':
        setUseBuffer(!useBuffer());
        break;

    case 'p':
    case 'P':
        setDoProcess( !doProcess() );
        break;

    default:
        break;
    }
}


inline void App::timerStart()
{
    m_t0 = getTickCount();
}


inline void App::timerEnd()
{
    m_t1 = getTickCount();
    int64 delta = m_t1 - m_t0;
    m_time = (delta / m_frequency) * 1000; // units msec
}


inline string App::timeStr() const
{
    stringstream ss;
    ss << std::fixed << std::setprecision(1) << m_time;
    return ss.str();
}


int main(int argc, char** argv)
{
    const char* keys =
        "{ help h ?    |          | print help message }"
        "{ camera c    | -1       | use camera as input }"
        "{ video  v    |          | use video as input }";

    CommandLineParser cmd(argc, argv, keys);
    if (cmd.has("help"))
    {
        cmd.printMessage();
        return EXIT_SUCCESS;
    }

    App app(cmd);

    try
    {
        app.run();
    }

    catch (const cv::Exception& e)
    {
        cout << "error: " << e.what() << endl;
        return 1;
    }

    catch (const std::exception& e)
    {
        cout << "error: " << e.what() << endl;
        return 1;
    }

    catch (...)
    {
        cout << "unknown exception" << endl;
        return 1;
    }

    return EXIT_SUCCESS;
} // main()
4.2.2 Makefile
TARGET = main
CXX = g++
CFLAGS += -I/usr/include -I/usr/local/include/opencv -I/usr/local/include/opencv2 -L/usr/lib  -L/usr/local/lib -L/lib -std=c++98

CFLAGS +=  -lopencv_core -lopencv_objdetect -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_imgproc -lOpenCL -lpthread -lrt

all:
    @$(CXX)  $(TARGET).cpp -o $(TARGET) $(CFLAGS)
clean:
    rm -rf  $(TARGET)
4.2.3 编译运行
root@NanoPC-T6:/opt/opencl-project/opencv-ocl# make 
root@NanoPC-T6:/opt/opencl-project/opencv-ocl# ./main -c

注意:此处测试时,我们使用的OpenCV版本为3.4.14版本,安装步骤具体参考《Rockchip RK3588 - linuxQtopencv交叉编译环境搭建》。

如下图所示:

4.3 简单demo

由于官方给出的代码opencv-ocl,有很多和OpenCL平台相关的处理部分,看起的比较冗杂。因此这里新写一个新的基于Opencv 3.x +OCL程序的Demo框架。

/opt/opencl-project目录下新建opencv-ocl-demo项目。

4.3.1 main.cpp
#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>

using namespace std;
using namespace cv;
using namespace cv::ocl;

#define GPU 1

int main()
{
    double t = 0.0;

    // 视频捕获 这里我外接了一个摄像头,对应的设备为/dev/video1
    VideoCapture vcap(1);

    if (!vcap.isOpened()) {
        cout << "Can not open video device" << std::endl;
        return -1;
    }

    std::vector<cv::ocl::PlatformInfo> plats;
    cv::ocl::getPlatfomsInfo(plats);
    const cv::ocl::PlatformInfo *platform = &plats[0];
    cout << "Platform Name:" << platform->name().c_str() << endl;

    cv::ocl::Device dev;
    platform->getDevice(dev,0);
    cout << "Device name:" << dev.name().c_str() << endl;    

#if GPU
    cv::ocl::setUseOpenCL(true);
    cout << "Use the OpenCL Deivice?" << cv::ocl::useOpenCL() << endl;

    UMat frame, grayFrame, edges;
    for(;;){
        // 读取视频帧
        vcap.read(frame); 
        t = (double)cv::getTickCount();    
        cv::cvtColor(frame,grayFrame,cv::COLOR_BGR2GRAY);        
        // 进行边缘检测
        cv::Canny(grayFrame, edges, 50, 150); 
        t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
        std::cout << "GPU Time Cost:" << t << "s" << std::endl;
        
        putText(edges, "Platform : " + platform->name(), Point(5, 30), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);
        putText(edges, "Device : " + dev.name(), Point(5, 60), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);    
        putText(edges, "Time : " + std::to_string(t) + " s", Point(5, 90), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);

        // 显示边缘图像
        cv::imshow("Edges", edges); 

        int key = waitKey(30);
        if (key == 27 || key == 'q' || key == 'Q')
            break;
    }
#else
    cv::ocl::setUseOpenCL(false);
    cout << "Use the OpenCL Deivice?" << cv::ocl::useOpenCL() << endl;
    Mat frame, grayFrame, edges;
    for(;;){
        // 读取视频帧
        vcap.read(frame); 
        t = (double)cv::getTickCount();
        cv::cvtColor(frame,grayFrame,cv::COLOR_RGB2GRAY);
        // 进行边缘检测
        cv::Canny(grayFrame, edges, 50, 150); 
        t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
        std::cout << "CPU Time Cost:" << t << "s" << std::endl;
        
        putText(edges, "Time : " + std::to_string(t) + " s", Point(5, 30), FONT_HERSHEY_SIMPLEX, 1., Scalar(255, 100, 0), 2);

        // 显示边缘图像
        cv::imshow("Edges", edges); 

        int key = waitKey(30);
        if (key == 27 || key == 'q' || key == 'Q')
            break;
    }
#endif
    // 关闭所有窗口
    cv::destroyAllWindows(); 
    return 0;
}
4.3.2 CMakeLists.txt
# cmake needs this line
cmake_minimum_required(VERSION 3.1)

# Define project name
project(opencv_example_project)

# Find OpenCV, you may need to set OpenCV_DIR variable
# to the absolute path to the directory containing OpenCVConfig.cmake file
# via the command line or GUI
find_package(OpenCV REQUIRED)

# If the package has been found, several variables will
# be set, you can find the full list with descriptions
# in the OpenCVConfig.cmake file.
# Print some message showing some of them
message(STATUS "OpenCV library status:")
message(STATUS "    config: ${OpenCV_DIR}")
message(STATUS "    version: ${OpenCV_VERSION}")
message(STATUS "    libraries: ${OpenCV_LIBS}")
message(STATUS "    include path: ${OpenCV_INCLUDE_DIRS}")

# Declare the executable target built from your sources
add_executable(opencv_example main.cpp)

# Link your application with OpenCV libraries
target_link_libraries(opencv_example PRIVATE ${OpenCV_LIBS})
4.3.3 GPU编译运行

此处我将USB摄像头连接到开发板的USB接口上。当宏GPU被设置为1时,编译运行:

root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# cmake -S . -B build
-- OpenCV library status:
--     config: /usr/local/lib/cmake/opencv4
--     version: 4.9.0
--     libraries: opencv_calib3d;opencv_core;opencv_dnn;opencv_features2d;opencv_flann;opencv_gapi;opencv_highgui;opencv_imgcodecs;opencv_imgproc;opencv_ml;opencv_objdetect;opencv_photo;opencv_stitching;opencv_video;opencv_videoio
--     include path: /usr/local/include/opencv4
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /opt/opencl-project/opencv-ocl-demo/build
root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# cmake --build build
root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# ./build/opencv_example
[ WARN:0@1.474] global cap_gstreamer.cpp:1777 open OpenCV | GStreamer warning: Cannot query video position: status=0, value=-1, duration=-1
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Platform Name:ARM Platform
Device name:Mali-G610 r0p0
Use the OpenCL Deivice?1
GPU Time Cost:0.001845s

(opencv_example:2522): dbind-WARNING **: 22:51:47.821: AT-SPI: Error retrieving accessibility bus address: org.freedesktop.DBus.Error.ServiceUnknown: The name org.a11y.Bus was not provided by any .service files
GPU Time Cost:0.00103741s
GPU Time Cost:0.000427856s
GPU Time Cost:0.000888377s
GPU Time Cost:0.00085542s
GPU Time Cost:0.000405981s
GPU Time Cost:0.00041269s
GPU Time Cost:0.000391691s
GPU Time Cost:0.000828004s
GPU Time Cost:0.000351734s
GPU Time Cost:0.000803214s
GPU Time Cost:0.000826546s
GPU Time Cost:0.000896543s
GPU Time Cost:0.000805548s
GPU Time Cost:0.000365442s
GPU Time Cost:0.000339485s
......

并且显示图像:

注意:此处测试时,我们使用的OpenCV版本为4.9.0版本。

4.3.4 CPU编译运行

修改宏GPU为0时,编译运行:

root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# cmake -S . -B build
-- OpenCV library status:
--     config: /usr/local/lib/cmake/opencv4
--     version: 4.9.0
--     libraries: opencv_calib3d;opencv_core;opencv_dnn;opencv_features2d;opencv_flann;opencv_gapi;opencv_highgui;opencv_imgcodecs;opencv_imgproc;opencv_ml;opencv_objdetect;opencv_photo;opencv_stitching;opencv_video;opencv_videoio
--     include path: /usr/local/include/opencv4
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /opt/opencl-project/opencv-ocl-demo/build
root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# cmake --build build
root@NanoPC-T6:/opt/opencl-project/opencv-ocl-demo# ./build/opencv_example
[ WARN:0@1.451] global cap_gstreamer.cpp:1777 open OpenCV | GStreamer warning: Cannot query video position: status=0, value=-1, duration=-1
CPU Time Cost:0.0155965s

(opencv_example:2592): dbind-WARNING **: 22:57:10.363: AT-SPI: Error retrieving accessibility bus address: org.freedesktop.DBus.Error.ServiceUnknown: The name org.a11y.Bus was not provided by any .service files
CPU Time Cost:0.0025578s
CPU Time Cost:0.00177938s
CPU Time Cost:0.00147752s
CPU Time Cost:0.00146177s
CPU Time Cost:0.00188466s
CPU Time Cost:0.00146585s
CPU Time Cost:0.00181409s
CPU Time Cost:0.00186921s
CPU Time Cost:0.00135765s
CPU Time Cost:0.00174263s
CPU Time Cost:0.00191091s
CPU Time Cost:0.00195524s
CPU Time Cost:0.00137194s
CPU Time Cost:0.00190333s
CPU Time Cost:0.00188554s
......

并且显示图像:

根据运行的结果可以看出,CPU运行时长远远大于OpenCL平台加速后的运行时长,因此能够明显体现出加速的效果。

4.3.5 异常处理

如果程序运行出现如下错误:

[ WARN:0] OpenCV | GStreamer warning: Cannot query video position: status=0, value=-1, duration=-1
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Platform Name:ARM Platform
Device name:Mali-G610 r0p0
Use the OpenCL Deivice?1
OpenCL program build log: imgproc/color_rgb
Status -11: CL_BUILD_PROGRAM_FAILURE
-D depth=0 -D scn=3 -D PIX_PER_WI_Y=1 -D dcn=1 -D bidx=0 -D STRIPE_SIZE=1
<built-in>:131:9: error: expected member name or ';' after declaration specifiers
int32_t depth;
~~~~~~~ ^
<built-in>:1:15: note: expanded from here
#define depth 0
              ^

<built-in>:131:8: error: expected ';' at end of declaration list
int32_t depth;
       ^

error: Compiler frontend failed (error code 63)

该错误是是由于Conflict between OpenCV's local parameters and OpenCL's macro definitions #24645,大致意思就是OpenCL中定义的宏变量和OpenCV局部参数发生了冲突,这里我就不深究了。

OpenCV4.9.0版本已经修复了这个问题,可以尝试下载该版本测试,记得首先移除之前安装的OpenCV 3.4.14

root@NanoPC-T6:/opt/opencv-4.9.0/build# rm -rf  /usr/local/lib/libopencv_*
root@NanoPC-T6:/opt/opencv-4.9.0/build# sudo cmake .. -DWITH_CAMV4L2=ON
root@NanoPC-T6:/opt/opencv-4.9.0/build# sudo make install -j7
root@NanoPC-T6:/opt/opencv-4.9.0/build# sudo /sbin/ldconfig -v

五、FloatVideo-TouchScreen优化

如果FloatVideo-TouchScreen项目需要使用OpenCL进行加速,首先我们需要将OpenCV版本从3.4.14升级到4.9.0,接着按照如下步骤进行调整。

5.1 float-video-touch-screen.pro

修改qmake配置文件:

QT       += core gui

# 编译生成目标文件名称
TARGET = FloatVideo-TouchScreen

greaterThan(QT_MAJOR_VERSION, 4): QT += widgets

CONFIG += c++11

# You can make your code fail to compile if it uses deprecated APIs.
# In order to do so, uncomment the following line.
#DEFINES += QT_DISABLE_DEPRECATED_BEFORE=0x060000    # disables all the APIs deprecated before Qt 6.0.0

SOURCES += \
    main.cpp \
    mainwindow.cpp \
    src/source/camerathread.cpp \
    src/source/usbthread.cpp

HEADERS += \
    mainwindow.h \
    src/include/camerathread.h \
    src/include/usbthread.h

FORMS += \
    mainwindow.ui

TRANSLATIONS += \
    float-video-touch-screen_zh_CN.ts

INCLUDEPATH += /usr/include/libusb-1.0 \
               /usr/local/include/opencv \
               /usr/local/include/opencv4 \  #opencv 4.9.0
               #/usr/local/include/opencv2 \ #opencv 3.4.14
               /usr/include/libusb-1.0 \
               $$PWD/src/include \
               $$PWD/lib/include \
               /usr/include

LIBS += /usr/local/lib/libopencv_calib3d.so \
        /usr/local/lib/libopencv_highgui.so \
        /usr/local/lib/libopencv_core.so    \
        /usr/local/lib/libopencv_dnn.so    \
        /usr/local/lib/libopencv_features2d.so \
        /usr/local/lib/libopencv_flann.so \
        /usr/local/lib/libopencv_ml.so \
        /usr/local/lib/libopencv_imgproc.so \
        /usr/local/lib/libopencv_imgcodecs.so \
        /usr/local/lib/libopencv_objdetect.so \
        /usr/local/lib/libopencv_video.so \
        /usr/local/lib/libopencv_videoio.so \
        /usr/local/lib/libopencv_stitching.so \
        /usr/lib/aarch64-linux-gnu/libusb-1.0.so \
        /usr/lib/aarch64-linux-gnu/libudev.so \
        $$PWD/lib/libScreen.so

# Default rules for deployment.
qnx: target.path = /tmp/$${TARGET}/bin
else: unix:!android: target.path = /opt/$${TARGET}/bin
!isEmpty(target.path): INSTALLS += target

DISTFILES += \
    config/hid.txt \
    tools/README/REMDME.md \
    tools/README/gitee.md \

5.2 源码调整

此处需要对多处源码进行调整,这里就不一一展示代码了。

5.3 编译运行

可以通过如下命令进行编译,也可以通过Qt Creator工具进行编译;

root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen/lib# cd ..
root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen# mkdir build
root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen# cd build 
root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen/build# qmake ..
root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen/build# ls -l
root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen/build# make

运行程序:

root@NanoPC-T6:/opt/qt-project/FloatVideo-TouchScreen/build# export DISPLAY=:0.0;./FloatVideo-TouchScreen -size 0.8
# 打开新的终端
root@NanoPC-T6:~# top
任务: 257 total,   1 running, 256 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.4 us,  0.8 sy,  0.0 ni, 86.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15948.7 total,  14260.9 free,    643.7 used,   1044.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14980.9 avail Mem

 进程号 USER      PR  NI    VIRT    RES    SHR    %CPU  %MEM     TIME+ COMMAND
   7056 root      20   0 2490876 138316 102356 S  87.0   0.8   0:21.99 FloatVideo-Touc
    831 root      20   0 3785368 263928 207016 S  14.6   1.6   0:54.98 Xorg
   1217 pi        20   0 1862424  77360  57972 S   1.7   0.5   0:07.28 xfwm4
   6888 root       0 -20       0      0      0 I   0.7   0.0   0:00.28 kworker/u17:1-mali_kbase_csf_sync_upd
   6908 root       0 -20       0      0      0 D   0.7   0.0   0:00.46 kworker/u17:0+csf_scheduler_wq
   5065 root       0 -20       0      0      0 I   0.3   0.0   0:00.07 kworker/7:2H-events_highpri
   6901 root      20   0       0      0      0 I   0.3   0.0   0:00.11 kworker/u16:4-events_unbound
   6930 root      20   0       0      0      0 D   0.3   0.0   0:00.48 kworker/4:1+events
   7105 root      20   0       0      0      0 I   0.3   0.0   0:00.03 kworker/5:1-events
   7106 root      20   0   12884   3656   2864 R   0.3   0.0   0:00.07 top
   7109 root       0 -20       0      0      0 I   0.3   0.0   0:00.01 kworker/2:2H-events_highpri
   ......

可以看到CPU占用已经明显下降。此时查看GPU使用率:

root@NanoPC-T6:~# cat /sys/class/devfreq/fb000000.gpu/load
27@300000000Hz

六、代码下载

大奥特曼打小怪兽/opencl-project

参考文章

[1] RK3588实战:调用npu加速,yolov5识别图像、ffmpeg发送到rtmp服务器

[2] 嵌入式AI应用开发实战指南—基于LubanCat-RK系列板卡

[3] RK3588边缘计算

[4] OpenCL学习笔记(四)手动编译开发库(ubuntu+gcc+rk3588

[5] 如何在RK3399中使用opencl并安装QT开发

[6] Arm Mali GPU OpenCL Developer Guide

[7] 什么是OpenCL

[8] 高性能计算

[9] OpenCL练习(一):使用OpenCL+OpenCV进行RGB转灰度图

[10] https://opencv.org/opencl

[11] https://github.com/opencv/opencv/wiki/OpenCL-optimizations

[12] OpenCV OpenCL support

[13] 一、Opencv-OCL编程基础

posted @ 2024-01-15 00:25  大奥特曼打小怪兽  阅读(3177)  评论(0编辑  收藏  举报
如果有任何技术小问题,欢迎大家交流沟通,共同进步