cuda FORTRAN 统一内存 managed
这里是官方的一篇很好的帖子,里面有实例和讲解 managed memory 的使用方法。
http://www.pgroup.com/lit/articles/insider/v6n1a2.htm
可以直接看官方的文档,排版比较好,为了防止以后这篇文档丢失,我复制在下面,有机会再整理一下。
CUDA Fortran Managed Memory
Note: This update to an article originally published in August 2014 reflects changes effective with PGI version 15.9. If you're using any earlier PGI version, please refer to the original article.
In PGI 14.7 we are introducing the biggest productivity improvement to CUDA Fortran since we rolled out CUF kernels roughly four years ago. Using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.
A good explanation of Unified Memory can be found on the NVIDIA Parallel Forall blog, and Appendix J of the CUDA C Programming Guide gives a detail explanation. Unified Memory provides a way to obtain a single pointer to user allocated data that can be used in both host and device code. Unlike zero-copy memory, managed memory is not pinned and static, but migrates between the host and device on access. The system (driver and OS) controls the physical page movement. Specifically, in CUDA C/C++, managed memory usage consists of replacing calls to cudaMalloc() with cudaMallocManaged(), and removing explicit uses of cudaMemcpy() to transfer data between host and device.
In CUDA Fortran, we've added the managed keyword to the language, which can be used in host code similarly to the device keyword. Here's an example of how we've used it to simplify the host code in our sgemm test, a CUDA Fortran example we've used frequently here at PGI:
Here's the code before CUDA managed data:
! matrix data real, allocatable, dimension(:,:) :: A, B, C, gold real, allocatable, device, dimension(:,:) :: dA, dB, dC . . . allocate(A(N,N)) allocate(B(N,N)) allocate(C(N,N)) allocate(gold(N,N)) call random_number(A) call random_number(B) allocate(dA(N,N)) allocate(dB(N,N)) allocate(dC(N,N)) dA = A dB = B dC = 0.0 alpha = 1 beta = 0 m = N k = N blocks = dim3(N/256, N/16, 1) threads = dim3(16, 16, 1) call sgemm_cpu(A, B, gold, m, N, k, alpha, beta) ! timing experiment time = 0.0 istat = cudaEventRecord(start, 0) do j = 1, NREPS call sgemmNN_16x16<<<blocks, threads="">>>(dA, dB, dC, m, N, k, alpha, beta) end do istat = cudaEventRecord(stop, 0) istat = cudaDeviceSynchronize() istat = cudaEventElapsedTime(time, start, stop) time = time / (NREPS*1.0e3) C = dC
Now, here's the same code using managed data:
! matrix data real, managed, allocatable, dimension(:,:) :: A, B, C real, allocatable, dimension(:,:) :: gold . . . allocate(A(N,N)) allocate(B(N,N)) allocate(C(N,N)) allocate(gold(N,N)) call random_number(A) call random_number(B) C = 0.0 alpha = 1 beta = 0 m = N k = N blocks = dim3(N/256, N/16, 1) threads = dim3(16, 16, 1) call sgemm_cpu(A, B, gold, m, N, k, alpha, beta) ! timing experiment time = 0.0 istat = cudaEventRecord(start, 0) do j = 1, NREPS call sgemmNN_16x16<<<blocks, threads="">>>(A, B, C, m, N, k, alpha, beta) end do istat = cudaEventRecord(stop, 0) istat = cudaDeviceSynchronize() istat = cudaEventElapsedTime(time, start, stop) time = time / (NREPS*1.0e3)
As you can see, we no longer need two copies of the arrays A, B, and C. The data movement still happens, but rather than being explicit, now it is controlled by the unified memory management system behind the scenes, much like the way an OS manages virtual memory.
If your CUDA Fortran program is bogged down with two copies of variables, like variable_name for the host, and variable_name_d for the device, then managed memory might be the answer for you.
In this sgemm case, the performance impact of using managed memory is negligible. In other cases, the use of managed memory may result in unwanted transfers between host and device without any of the look-ahead, pre-fetching, or overlap operations that you might find in a tuned CUDA code. Be sure to check wall-clock time; any extra overhead will be attributed to system time by the Linux time command. Remember, you can still use all the tools CUDA provides to tune performance including the use of streams and asynchronous copies where needed to optimize the overlap of computation and data movement.
In our limited testing, we've seen 0–20% performance degradation when doing a quick replacement of the device attribute with the managed attribute followed by removing the data transfer assignment statements. In one purposely poorly done effort, we saw code run about twice as slow but we don't think that's a representative example. I would expect most users will see the former behavior on current hardware and CUDA 6.0.
Managed Data Argument Matching
You can pass managed arrays to kernels which declare their dummy arguments as device arrays so your global subroutines do not require a rewrite. In fact, we've implemented argument matching so that managed actual arguments can match managed, device, and host dummy arguments, in that order. If you declare your dummy argument to be managed, it is only valid to pass a managed actual argument to it; anything else will result in a compilation error as long as the interface is explicit.
module minit interface init module procedure inith, initm, initd end interface contains subroutine initd(a) integer, device :: a(:) print *,"initd" a(:) = z'deadbeef' end subroutine subroutine inith(a) integer :: a(:) print *,"inith" a(:) = z'deadbeef' end subroutine subroutine initm(a) integer, managed :: a(:) print *,"initm" if (size(a).le.1000) then call inith(a) else call initd(a) end if end subroutine end module program t use minit integer, parameter :: n = 10 integer, managed :: m(n) integer, device :: d(n) integer :: h(n) call init(m) call init(d) call init(h) end program
Here, we've overloaded the init function to handle managed, device, or host data. The managed initfunction, in turn, can call either the device or host version depending on some condition. The same rules apply for calling global subroutines.
Managed Data Runtime Visibility and Access
There are no limitations on the availability of managed memory from the device side. For example you can initiate managed data operations on the host and then launch a kernel. When that kernel executes, you are guaranteed to access the updated data on the device.
When a kernel is executing, however, no host accesses of globally visible managed data are permitted. There is just no way currently for the system (driver and host OS) to query, interrupt, and/or synchronize the device sufficiently without some help from the programmer. Using CUDA 6.0, you should put explicit calls to cudaDeviceSynchronize() after your kernel launches and before you try to access the managed data areas from the host. Not doing this may lead to segmentation faults in your application that can be very difficult to track down.
If you suspect that seg faults are due to illegal managed data accesses, one way to make sure is to run under cuda-memcheck. You will see an error like this:
brentl@sb-leback:~/tmp> ./a.out Segmentation fault brentl@sb-leback:~/tmp> cuda-memcheck ./a.out ========= CUDA-MEMCHECK ========= Error: process didn't terminate successfully ========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors. ========= Internal error (20) ========= No CUDA-MEMCHECK results found
Advanced users will want to understand whether high-level constructs like assignment are accessing the managed data from the host or from the device. We will discuss that in detail in the Assignment section below.
Managed CUDA Fortran API Routines
Just like we have CUDA Fortran wrappers for the CUDA C API for things like cudaMalloc, cudaMemset, and cudaMemcpy, we have added interfaces and wrappers to the new managed memory API routines.
cudaMallocManaged() can be called with an allocatable managed array just as you can call cudaMalloc with an allocatable device array. There are two additional arguments besides the array to allocate. An integer count, to specify the number of elements to allocate, and flags which specify the visibility of the managed data. By default, the managed data is visible to all kernels running on all streams. This behavior can be changed with the next runtime routine we'll introduce.
cudaStreamAttachMemAsync() allows stream-level control of managed data, and can allow host code to access managed data not currently involved in a running kernel.
In addition to these two CUDA calls, in CUDA Fortran we've added some additional functions to make life easier if you mix OpenMP and CUDA Fortran:
cudaforSetDefaultStream() allows you to set the default stream for high-level language constructs on a per-thread basis. Inputs are the stream number, or alternatively, a device or managed object and a stream number. The default stream, when set, applies to allocation/stream attach operations on managed data, data movement operations on both managed and device data, and the new intrinsic reduction functions. This stream must still be set explicitly on CUDA kernel launches by the programmer.
Managed Data with Derived Types
One area where CUDA Fortran managed data really shines is with derived types. Before, using device data, derived types were awkward because a derived type containing allocatable device arrays could not itself easily have the device attribute. So, in effect, some fields of the derived type were resident on the device while other fields were resident on the host. Managing that programmatically was cumbersome as the types became larger and more deeply nested.
Now, if your program makes use of derived types in its computational core, managed data seems like a great way to go. Only the data that is required will be moved on demand. If large areas of the types are not accessed, they will not incur the overhead of being moved back and forth. Furthermore, putting these types in modules, both the definitions and instantiated entities, means they can be used in both host and device code and don't need to be passed as arguments.
Here is an example, submitted to our User Forum several years ago, that we have not been able to adequately address until now:
module mCuda integer :: num_cm type tCM integer, allocatable, device :: fine(:) real, allocatable, device :: mat_matrix(:,:,:) real, allocatable, device :: src_matrix(:,:,:) end type type(tCM), allocatable, device :: cm_list(:) end module program test use mCuda integer i num_cm=10 allocate (cm_list(num_cm)) do i=1, num_cm allocate(cm_list(i)%fine(3)) cm_list(i)%fine=10 enddo end
In this example the problem has been that the array of derived types, cm_list, has to be allocated and set up on the host, but we want it to be accessed primarily on the device. The Fortran device code that accesses the derived type, like mat_matrix, is most flexible and most clearly expressed when you use the entire structure, i.e. something like cm_list(i)%mat_matrix(j,k,12). It is up to the compiler (our job) to optimize away common sub-expressions in the generated code.
So, first off, let's just change the outer declaration to use managed data:
type(tCM), allocatable, managed :: cm_list(:)
When we do that, what we see when we compile and run this code is:
# Make sure to specify CUDA 6.0 on the compile line: % pgfortran -Mcuda=cuda6.0 userforum.cuf % ./a.out Segmentation fault
This falls under the illegal managed data access rules described above. When we execute this loop:
do i=1, num_cm allocate(cm_list(i)%fine(3)) cm_list(i)%fine=10 enddo
the first iteration executes successfully. We access the managed data in cm_list to set up the first fine array, cm_list(1)%fine. Next we run a kernel (a memset operation contained in the CUDA Fortran runtime library) to initialize the device array cm_list(i)%fine to the value 10. Kernel launches are asynchronous with respect to the host. So, the host circles around to iteration 2, where we again access the managed data in cm_list to set up for cm_list(2)%fine. This causes the seg fault as we are accessing global managed data while a kernel is executing.
There are at least 3 ways to work around this problem:
- Add synchronization to the loop to ensure the memset kernel is done before accessing managed data.
do i=1, num_cm allocate(cm_list(i)%fine(3)) cm_list(i)%fine=10 istat = cudaDeviceSynchronize() enddo
- Make all the arrays managed. Use cudaforSetDefaultStream to put all the managed arrays in the same stream:
type tCM integer, allocatable, managed :: fine(:) real, allocatable, managed :: mat_matrix(:,:,:) real, allocatable, managed :: src_matrix(:,:,:) end type program test use mCUDA use cudafor integer (kind=cuda_stream_kind) :: str_mng integer i num_cm=10 istat = cudaStreamCreate(str_mng) istat = cudaforSetDefaultStream(str_mng) allocate (cm_list(num_cm)) do i=1, num_cm allocate(cm_list(i)%fine(3)) cm_list(i)%fine=10 enddo end
- Explicitly perform the device array operations on one stream and the managed array operations on another by mapping each allocated entity to the specific stream. This is probably the best performing option:
program test use mCuda use cudafor integer(kind=cuda_stream_kind) :: str_mng, str_dev integer i num_cm=10 istat = cudaStreamCreate(str_mng) istat = cudaStreamCreate(str_dev) allocate(cm_list(num_cm)) istat = cudaforSetDefaultStream(cm_list, str_mng) do i=1, num_cm allocate(cm_list(i)%fine(3)) istat = cudaforSetDefaultStream(cm_list(i)%fine, str_dev) cm_list(i)%fine=10 enddo
Which methods to use are up to you, the programmer, based on ease of use, readability, and performance requirements.
Assignment
CUDA Fortran uses high-level array syntax for moving data between host and device. This feature is carried over to managed data as well. For instance:
real, managed :: a(10000) real :: b(10000) a = 0.0 ! This results in a cudaMemset type operation which launches ! a kernel on the device. Because a is managed, by default ! our runtime inserts a cudaStreamSynchronize call after the ! operation. b = 1.0 ! This, of course, uses host data and runs on the host only ! a = b ! This results in a cudaMemcpy API call, host to device. ! cudaMemcpy is a synchronous operation. b = a ! This results in a cudaMemcpy API call, device to host ! This also is synchronous a(1:10000:2) = b(1:5000) ! Using slice notation of any kind with managed ! data results in the copy happening on the host. istat = cudaforSetDefaultStream(a, mystream) a = b ! This results in an asynchronous cudaMemcpy API call, host to ! device, on stream mystream.
Managed Data in CUF Kernels
Here's an example of managed data used in an OpenMP program with CUF kernels. It also takes advantage of the new default stream API:
program ompcuf use cudafor use omp_lib integer(kind=cuda_stream_kind) :: mystream !$omp parallel private(istat,mystream) istat = cudaStreamCreate(mystream) istat = cudaforSetDefaultStream(mystream) call foo() !$omp end parallel end subroutine foo() use cudafor use omp_lib real, managed :: a(10000) j = omp_get_thread_num() a = real(j) !$cuf kernel do <<< *, *, stream=cudaforGetDefaultStream() >>> do i = 1, 10000 a(i) = a(i) + 1.0 end do istat = cudaStreamSynchronize(cudaforGetDefaultStream()) if (any(a.ne.real(j+1))) then print *,"Found error on ",j else print *,"Looks good on ",j endif end
In the main program, we create a stream for each OMP thread. We add a call to cudaforSetDefaultStream to set the default stream for all subsequent high-level language constructs. So, for instance, in subroutine foo, the array a gets allocated as managed data and bound to the default stream behind the scenes using cudaStreamAttachMemAsync(). In the assignment a = real(j), we also use the thread's default stream in what is basically a cudaMemset operation. We have to explicitly add the default stream in the launch configuration of the CUF kernel, and also use the thread's default stream to synchronize. Once the cudaStreamSynchronize has occurred, this thread can safely access the managed data on the host, in this case in the any() function, even while other threads may be in the middle of their kernel launch.
Conclusion
I've shown a number of ways in which the CUDA 6.0 managed memory feature can make CUDA Fortran programming easier. At this point, managed memory should be viewed as a productivity feature, and not targeted at improved performance. As NVIDIA GPU architectures progress over the next several years, we expect to see better and better performance from the unified memory system. You can start today, and I think you'll find managed memory is a good tool for learning, coding up prototypes, and simplifying complicated data structures which need to be used on both the host and device. Stay tuned; we have a few more ideas on how unified memory can make additional high-level Fortran language features available to CUDA Fortran programmers in the near future.