OpenCL memory object 之选择传输path
对应用程序来说,选择合适的memory object传输path可以有效提高程序性能。
下面先看一写buffer bandwidth的例子:
1. clEnqueueWriteBuffer()以及clEnqueueReadBuffer()
如果应用程序已经通过malloc 或者mmap分配内存,CL_MEM_USE_HOST_PTR是个理想的选择。
有两种使用这种方式的方法:
第一种:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )
d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )
pinning开销在步骤a产生,步骤d没有任何pinning开销。通常应用立即程序执行a,b,c,e步骤,而在步骤d之后,要反复读和修改pinnedMemory中的数据,
第二种:
clEnqueueRead/WriteBuffer 直接在用户的memory buffer中被使用。在copy(host->device)数据前,首先需要pin(lock page)操作,然后才能执行传输操作。这条path大概是peak interconnect bandwidth的2/3。
2. 在pre-pinned host buffer上使用clEnqueueCopyBuffer()
和1类似,clEnqueueCopyBuffer在pre-pinned buffer上以peak interconnect bandwidth执行传输操作:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *memory = clEnqueueMapBuffer( pinnedBuffer )
d. Application writes or modifies memory.
e. clEnqueueUnmapMemObject( pinnedBuffer, memory )
f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )
或者通过:
g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )
h. void *memory = clEnqueueMapBuffer( pinnedBuffer )
i. Application reads memory.
j. clEnqueueUnmapMemObject( pinnedBuffer, memory )
由于pinned memory驻留在host memroy,所以clMap() 以及 clUnmap()调用不会导致数据传输。cpu可以以host memory带宽来操作这些pinned buffer。
3、在device buffer上执行 clEnqueueMapBuffer() and clEnqueueUnmapMemObject()
对于已经通过malloc和mmap分配空间的buffer,传输开销除了interconnect传输外,还要包括一个memcpy过程,该过程把buffer拷贝进mapped device buffer。
a. Data transfer from host to device buffer.
1.
ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
由于缓冲被映射为write-only,所以没有数据从device传输到host,映射开销比较低。一个指向pinned host buffer的指针被返回。
2. 应用程序通过memset(ptr)填充host buffer
memcpy ( ptr, srcptr ), fread( ptr ), 或者直接CPU写, 这些操作以host memory全速带宽读写。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
pre-pinned buffer以peak interconnect速度被传输到GPU device。
b. Data transfer from device buffer to host.
1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )
这个命令启动devcie到host数据传输,数据以peak interconnect bandwidth传输到一个pre-pinned的临时缓冲中。返回一个指向pinned memory的指针。
2. 应用程序读、处理数据或者执行 memcpy( dstptr, ptr ), fwrite (ptr), 或者其它类似的函数时候,由于buffer驻留在host memory中,所以操作以host memory bandwidth执行。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
由于buffer被映射成只读的,没有实际数据传输,所以unmap操作的cost很低。
4. host直接访问设备zero copy buffer
这个访问允许数据传输和GPU计算同时执行(overlapped),在一些稀疏(sparse)的写或者更新情况下,比较有用。
a. 一个device上的 zero copy buffer通过下面的命令被创建:
buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )
CPU能够通过uncached WC path直接访问该buffer。 通常可以使用双缓冲机制,gpu在处理一个缓冲中的数据,cpu同时在填充另一个缓冲中的数据。
A zero copy device buffer can also be used to for sparse updates, such as assembling sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. Due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible.
b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is directly mapped into the host address space.
2. The application transfers data via memset( ptr ), memcpy( ptr, srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy device buffer. Depending on the chipset, the bandwidth can be of the same order of magnitude as the interconnect bandwidth, although it typically is lower than peak.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the buffer continues to reside on the device.
c. If the buffer content must be read back later, use clEnqueueReadBuffer( .., buf, ..) or clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).
This bypasses slow host reads through the uncached path.
5 - GPU直接访问host zero copy memory
This option allows direct reads or writes of host memory by the GPU. A GPU kernel can import data from the host without explicit transfer, and write data directly back to host memory. An ideal use is to perform small I/Os straight from the kernel, or to integrate the transfer latency directly into the kernel execution time.
a:The application creates a zero copy host buffer.
buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b:Next, the application modifies or reads the zero copy host buffer.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ | CL_MAP_WRITE, .. )
This operation is very low cost because it is a map of a buffer already residing in host memory.
2. The application modifies the data through memset( ptr ), memcpy( in either direction ), sparse or dense CPU reads or writes. Since the application is modifying a host buffer, these operations take place at host memory bandwidth.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory.
c. The application runs clEnqueueNDRangeKernel(), using buffers of this type as input or output. GPU kernel reads and writes go across the interconnect to host memory, and the data transfer becomes part of the
kernel execution.
The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnitude as the peak interconnect bandwidth.
For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magnitude lower compared to a kernel accessing a regular device buffer located on the device.
d. Following kernel execution, the application can access data in the host buffer in the same manner as described above.