Introduction
Back in early 2004, NVIDIA released their book on graphics programming: GPU Gems. While most of the various contributors focused on techniques aimed at next-generation 3D games, some were aimed at a very different audience. A couple of articles were focused on non-real-time rendering, and some of these techniques are likely used today in NVIDIA's Gelato product line.
In addition to that, the book ended with a section appropriately named "Beyond Triangles", with some miscellaneous chapters on things such as fluid dynamics, but also one on "Volume Rendering Techniques" and "3D Ultrasound Visualization". These two subjects, along with others such as image reconstruction, represent some of the workloads necessary in the highly lucrative visualisation industry.
These two areas, namely visualisation and near-time rendering (via Gelato, for example), represent the primary target markets for NVIDIA's Quadro Plex solutions. In a recent conference call, Jen-Hsun Huang said that they expect this opportunity to represent hundreds of millions of dollars of revenue in the future. In other words, a few innocent chapters in a 3 years old book now looks like a tremendous business opportunity.
Then, in GPU Gems 2 (released about one year later in April 2005), the number of chapters focused on near-time rendering techniques jumped tremendously. What's even more interesting, however, is that about 25% of the book's articles were focused on what NVIDIA called "General-Purpose Computation on GPUs". The world was already beginning to realize the potential performance-per-dollar, and performance-per-watt, advantage of GPUs for certain workloads.
One could argue that it is exclusively the growing popularity of websites such as GPGPU.org which pushed NVIDIA to consider the importance of this paradigm, and while this certainly must have played a role in their thinking, another significant factor cannot be ignored: NVIDIA already had a very clear idea of what their future architectural decisions would be for the DX10 timeframe. And they likely realised the tremendous potential of their architecture for GPGPU workloads.
Many of the key patents behind NVIDIA's unified shader core archiecture were filed in late 2003, including one by John Erik Lindholm and Simon Moy. Based on our questioning, the Lindholm was in fact the lead project engineer for G80's shader engine. Other related patents were filed later, including one on a 1.5GHz multi-purpose ALU in November 2004, by Ming Y. Siu and Stuart F. Oberman. The latter was also behind AMD's 3DNow! instruction set and contributed significantly to AMD's K6 and K7 FPUs.
Needless to say, it's been a number of years since NVIDIA realised they could have another signficant business opportunity based around more generalised GPU computing. What they needed, then, was a way to extend the market, while also simplifying the programming model and improving efficiency. They got a small team of hardware and software engineers dedicated specifically to that problem, and implemented both new hardware features and a new API to expose them: CUDA.
Today, NVIDIA is making the CUDA beta SDK available to the public, which means the fruits of their labour are about to leave the areas of mildly restrictive NDAs and arcane secrets. Of course, we've had access to CUDA for a certain amount of time now, and we've got the full scoop on how it all works and what it all means. So read on!
NVIDIA expects the Professional, Visualization and GPGPU markets to represent a significant part of their addressable market in 2011 and beyond. Needless to say, that'd put a few new sets of tyres on JHH's Ferrari.
Previous Limitations of the GPGPU Model
Considering Microsoft's Direct3D 10 system already implements a significantly more flexible programming model, it could be asked why we even need an interface dedicated to GPGPU programming anymore. The first reason against using an existing API is that its implementation is driver-independent. Every new driver version might, somehow, introduce new bugs or change some aspects of the implementation.
But even if we excluded such a factor, the problem remains that neither DirectX nor OpenGL are made with GPGPU as their primary design goals, and this limits their performance for such workloads. Perhaps more importantly, however, arbitrary reads and writes to memory while bypassing the caching system (or flushing it) is still not supported in the Direct3D 10 API.
Unsurprisingly, CUDA natively supports both Gather (arbitrary memory read) and Scatter (arbitrary memory write) operations. Neither writes nor reads are cached.
Let us take this opportunity to remind everyone that AMD also introduced their own GPGPU interface last year, CTM ("Close To the Metal"), which is currently hardware accelerated by their R(V)5-series architecture. CTM also supports gather and scatter operations. Excluding the architectural differences between G80 and R580, it might thus appear quite similar to CUDA - it is, however, quite different.
The idea behind CTM is that there is efficiency to be gained by giving an experienced programmer more direct control to the underlying hardware. It is, thus, fundamentally assembly language. CUDA on the other hand aims to simplify GPGPU programming by exposing the system via a standard implementation of the C language. At this point in time, the underlying assembly language output (also known as "NVAsc") is not exposed to the application developer.
Currently, the only public backend CTM is Brook, which abstracts the base interface (Direct3D, OpenGL, CTM and variants therein) and exposes a streaming-like programming model, which fits last-generation hardware pretty much perfectly. The catch, sadly, is that the backend does not expose scatter. Thus, if you need that and want to benefit from CTM, you are pretty much forced to program in assembly language for now.
CTM does have some very nice potential anyway, depending on how it will evolve for the R600 and how the backends evolve. It is not today's subject, however, but trust us to investigate and compare both interfaces in the future. Ease of use, price and performance (per watt?) would certainly be among some of those which we'd like to consider.
Back to the subject at hand, there is a significant facet of CUDA we didn't touch upon yet: efficiency and addressable market. One of NVIDIA's key goals with CUDA was to make it usable for a greater variety of algorithms, while also reducing the amount of CPU interaction necessary in general. Their solution? Thread synchronisation and data sharing. Good, fast and efficient local synchronisation.
New Concepts Behind CUDA, Part 1
The parallel data cache (PDC), also known as "shared memory", is one of the key elements of the G80 architecture that allows for more efficient local synchronisation primitives. Based on our information, it likely also is a key element of NVIDIA's architecture for DirectX 10 Geometry Shader acceleration.
Before we go any further, it might be a good idea to introduce you to the architecture proposed by NVIDIA for G8x and CUDA. It is, amusingly enough, extremely similar in principle to one of the diagrams present in at least one of the NVIDIA patents filed back in late 2003. You can't help but respect (or want to horribly maim!) David Kirk for misleading the entire graphics community about NVIDIA's design choices for so many years.
Processors, multiprocessors, and more processors
In G80's implementation of this architecture, there are 8 processors (ALUs) per multiprocessor, and 16 multiprocessors per chip. In our architecture article, we talked in terms of "clusters"; based on this diagram, we can further say that every cluster has 2 multiprocessors in it. Also, as this diagram indicates, it doesn't seem impossible to us that there is at least one register bank per processor.
We would tend to believe that there are 8 independent texture units which work on quads, rather than 16 half-quad units. And considering the multiprocessors are most likely directly linked to texture units, the number of multiprocessors per texture unit has to be an integer. Also remember that those samplers are more capable on the graphics side than they are with CUDA (where you have 64 bilerps/clock of INT8 filtering to burn), since more than bilinear is exposed. More on that later.
Getting back to the diagram, it can be seen that there is one pool of shared memory per multiprocessor. And besides for device memory, there is no way to "communicate" between the different multiprocessors. In fact, no native synchronisation primitives exist to simplify that. This is why we refer to G80's synchronisation functionality as "local" - they do not extend over the entire chip. On the other hand, however, it is incredibly efficient at what it does allow you to do.
So, what exactly is the PDC, anyway? Based on our best understanding of the matter, each block of shared memory represents 16 banks of single-ported SRAM. Each bank has 1KiB of storage and a bandwidth of 32 bits per clock cycle. Furthermore, since there are 16 multiprocessors on a G80, that aggregates to a total storage of 256KiB and bandwidth of more than 675GiB/s. For all intents and purposes, it can be seen as a logical and highly flexible extension of the register file.
In addition to helping for synchronisation, the parallel data cache can always save you bandwidth. The paradigm in this usage scenario is arguably the same as for Cell, where you load data in a SPE's local store manually and then hopefully reuse it several times, minimizing access to DRAM. It is several times smaller, however, but the fundamental idea is the same.
In this completely arbitrary example, only 4 memory reads to DRAM are required with the parallel data cache, instead of 6 for previous GPGPU implementations. This can result in better efficiency than an automatic cache, but requires more programmer magic to make it work.
It is worth noting that the parallel data cache allows for communication inside a group of 16 threads, without any form of explicit synchronisation. This what we would like to refer to as "super-local implicit synchronisation", and it is most likely the architectural feature used in Pixel Shaders to implement the ddx and ddy instructions with high throughput and good efficiency.
Synchronising execution and data transfers for a larger group of threads isn't exactly 'difficult' either, but it does require explicit synchronisation. So, let's see how that works and what it implies.
New Concepts Behind CUDA, Part 2
Let's first look at how NVIDIA describes CUDA's programming model. Honestly, this single figure pretty much summarises it all:
Multiple Cores, Multiple Pipes, Multiple Threads - Do we have more parallelism than we can handle?"
-- David Kirk, NVIDIA
The term 'kernel' comes from the streaming processor world. The CPU invokes a "kernel" to be executed on the GPU, and the kernel is subdivided in grids, blocks and threads. Furthermore, what is not listed on that figure is that threads within a block are further grouped in 'warps'.
While this might seem quite messy on first glance, it really is not. Warps correspond to the group of threads that are scheduled together, which implies that branching ideally should be fully coherent within a single warp for maximal performance. Remember the branch coherence of the G80 is 32 threads.
Warps are then grouped in blocks. The reasoning behind that is that a single block is guaranteed to execute on the same multiprocessor, which allows threads within the same block to communicate and synchronise with each other. The application programmer can specify how many threads should be present in a single block - this affects latency tolerance and the maximum number of available registers, but also the amount of shared memory available. We'll get back to that shortly.
All the blocks are then grouped in a single grid, so that includes all the threads dispatched to the GPU in a single kernel, or 'draw call' as some may wish to think of it. At any time, the GPU program has access to the unique identifier of the block and thread it is currently running. On that note, it should be said that invoking a kernel blocks the CPU; the call is synchronous. If you want to run multiple concurrent kernels (which is actually required for multi-GPU configurations), you need to make use of multithreading on the CPU.
Synchronisation
By now, you're most likely expecting about a gazillion variables and functions related to synchronisation primitives at this point. Well, no, it's all done through a single function call, and it doesn't even take parameters!
__syncthreads() simply allows you to set a synchronisation point, and no code after that point will be executed (barring, perhaps, code related neither to the shared memory nor the device memory) until all threads in the block have finished executing all previous instructions. This makes it possible to efficiently prevent all possible read-after-write hazards for shared memory, as well as for device memory if other blocks aren't interfering with the same memory locations.
In theory, the instruction is nearly free; it doesn't take more time than a single scalar operation of any other time (2 cycles for 32 threads), but there's a small catch. To understand the problem, consider that the ALUs alone likely have 10 stages (which would mean 5 stages from the scheduler's point of view, since it runs at half the frequency). When you are synchronising, what you want to do is make sure that all threads are at same point of execution. What this effectively means is you are flushing the pipeline!
When flushing a pipeline, you are effectively losing at least as many cycles of execution as there are stages in that pipeline. And I'm not even taking memory latency into consideration here, which would complicate things further. For example, if a memory read isn't completed for one warp, all other warps have to keep waiting for it while trying to synchronise! So, needless to say, this is a Bad Thing. Can it be completely fixed? No. Can it be mostly fixed? Yes.
The basic idea is that if you have two blocks running per multiprocessor, then you'll hopefully still have enough threads running to get decent efficiency while the other block is synchronising. You could also run more blocks per multiprocessor (=> less threads per block, all things equal!); for example, if you had 8 blocks running on a single multiprocessor, and synchronisations were rare enough to practically never happen at the same time, you'd still have 87.5% of your threads available to hide memory latency.
So that's how it works for local synchronisation. If you *absolutely* need global synchronisation, remember you could just divide your workload into multiple kernels and let the CPU do the synchronisation. If that doesn't work in your specific case, or it wouldn't be efficient enough, then it is still possible for threads to communicate with each other via device memory, pretty much like you could do it on the CPU. Just make sure you know which blocks are running at a given time for your target GPU architecture, since the number of multiprocessors could vary!
The techniques to prevent read-after-writes hazards in this case are the same that can apply to CPUs, although for the latter, they can sometimes implicitly work through a shared L2 cache. In CUDA's case, there is no memory read/write caching at all, so you need to incur the entire memory latency several times make global synchronisation work. Obviously, in an ideal world, you'd have both very efficient local synchronisation and L2-based global synchronisation, but it looks like we aren't there yet with G80.
Putting it All Together
So, now that we know what CUDA is and what it brings to the table, let's quickly summarize:
- CUDA exposes the NVIDIA G80 architecture through a language extremely close to ANSI C, and extensions to that language to expose some of the GPU-specific functionality. This is in opposition to AMD's CTM, which is an assembly language construct that aims ot be exposed through third party backends. The two are thus not directly comparable at this time.
- The G80 has 16 independent 'multiprocessors', each composed of 8 'processors' (ALUs) that run at twice the clock rate of the scheduler. Each multiprocessor has a 16KiB pool of "shared memory" and each processor has its own register file bank. For obvious reasons, the GeForce 8800 GTS only has 12 multiprocessors activated.
- Threads are grouped in warps (-> branch coherence), which are further grouped in blocks. All the warps composing a block are guaranteed to run on the same multiprocessor, and can thus take advantage of shared memory and local synchronization.
- Shared memory (aka parallel data cache) allows the programmer to manually minimize the number of DRAM accesses necessary by reusing the same data multiple times. If you are familiar with CELL's architecture, you can think of it as a mini-Local Store. If there is more than one block running per multiprocessor, then only a part of shared memory is available to either blocks.
- Local synchronization is extremely efficient as long as there is more than one block running per multiprocessor. The parallel data cache is also the scheme used for communication when synchronization is occuring. This makes it possible to reduce the number of passes (=> CPU interference) and improve efficiency.
It should also be noted that it is possible for CUDA to interface directly with OpenGL and DirectX. Furthermore, the texture units are exposed in CUDA, and unlike normal memory read/writes, they are cached - which might be a good reason to use them for some things, since that cache is idling otherwise. Sadly, only bilinear filtering is exposed at this time, most likely because anisotropic filtering and trilinear would require access to derivatives and mipmap chains in CUDA. We'd still enjoy that functionality, for example to accelerate deferred rendering passes, but it could be argued that it would needlessly complicate the API for those using CUDA for non-rendering workloads, which really is the vast majority of the target audience.
Advantages & Intended Market
On first glance, CUDA and GPGPU in general are primarily aimed at the scientific market. There's a catch there, though; right now, double precision computations (FP64) aren't supported. The good news, however, is that both NVIDIA and AMD have pledged support for this by year's end. Not at full speed, of course, but performance should remain very impressive anyway.
So, what is CUDA currently good for? Plenty of things anyway, it turns out. Here's the marketing graph NVIDIA used at the G80 Editors' Day:
This graph compares a single GeForce 8800 to a 2.66GHz Conroe.
And yes, that *is* actually a 197x speed-up you're seeing there for finance!
The technical reasons behind some of those speed-ups are quite varied. For physics and wave equations, it's quite possible that it's mostly related to the number of GFlops the chip is capable of. For biological sequence match, we honestly aren't too sure ourselves. Matrix Numerics and FInance are two interesting cases to look at, though. Another thing we could look at is the performance of the CUDA FFT and BLAS libraries, but that goes beyond the scope of this article.
Matrix Numerics benefit quite nicely from the parallel data cache, while Finance benefits from the G80's excellent special-function performance. There is an excellent chapter on the latter subject in GPU Gems 2, named "Options Pricing on the GPU". The ratio of special-function operations is very high, and GPUs are both extremely fast and precise for those nowadays. Although it's likely things are being compared with different levels of precision for the CPU and the GPU, the performance difference would remain ludicrously high even with lower CPU precision.
As for Matrix Numerics, NVIDIA's example to showcase the efficiency potential of the parallel data cache and synchronization actually is matrix multiplication. It turns out that by cleverly using those two factors, it is possible to drastically reduce the bandwidth requirements, thus providing significant performance gains. Since the data for each "sub-matrix" is loaded in shared memory by all the threads at the same time, synchronization is necessary before computation can begin.
Matrix multiplication significantly benefits from the parallel data cache by working on one "sub-matrix" at a time, thus achieving higher bandwidth efficiency than traditional implementations.
Remaining Limitations & Future
As we said previously, FP64 computations are not currently supported on any available GPUs, but this is supposed to in 2007. There remain a number of other limitations, however, and here's a list that should hopefully cover most of them:
- Recursive functions are not supported, at all. This is a scheduler limitation, as there currently are no real "functions" on the hardware side of things, and even if there was, there is no stack to push/pop arguments from either, unless you want to use uncached global memory for that - which is unlikely to be super-fast.
- There is no efficient way to do "global synchronisation" on the GPU, which likely forces you to divide the kernel and do synchronisation on the CPU. Given the variable number of multiprocessors and other factors, there may not even be a perfect solution to this problem.
- Various deviations from the IEEE-754 standard even if the precision level is identical. For example, neither denormals nor signaling NaNs are supported. The rounding mode also cannot be changed, and division/square root are not implemented in a fully standard way.
- Functions cannot have a variable number of arguements. Same problem as for recursion.
- Conversion of floating point numbers to integers is done differently than on x86 CPUs.
- The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.
Many of these problems will likely be resolved, or at least improved upon, in the future. For example, PCI Express 2.0 should debut in a few months, while you'd also expect there to be some improvements for global synchronisation in the future. Other things, including support for denorm signalling, likely won't be seen for a while longer, if ever. Using an on-chip or cached stack would also be a tad ridiculous, given how many threads are in flight at any given time.
Going forward, however, the prospects of GPU computing are very exciting. In addition to further feature set improvements, we should expect significant performance improvements in the future, significantly outpacing Moore's Law in the next few years. Both transistor counts and clock speed increases will provide significant improvements and, in addition to that, the percentage of transistors dedicated to arithmetic operations is bound to increase further.
A good example of this trend is the R580 which, compared to the R520, increased its transistor count and die size by only 22%, but tripled its number of ALUs and the size of its register file. The G80 most likely already has a higher ALU ratio than the R520, but we believe that there remains plenty of room for NVIDIA to double or triple that proportion in the coming years. As such, barring any architectural revolution on the CPU front, the performance advantage of GPGPU solutions is likely to grow in the future, rather than shrink.
Closing Remarks
Today, NVIDIA publicly released the beta for CUDA, as well as the related FFT and BLAS libraries. Because of time constraints and the lack of a timely heads up on the beta's target launch date, we will not be looking at how to program in CUDA today, nor will we examine performance for real-world algorithms. Expect more on that from us in the coming days and weeks.
For now, hopefully this preview will have given you a good grasp of what CUDA is all about, and what it brings to the table. These certainly are exciting times for those interested in the GPGPU paradigm, and based on what's coming in the near future, we can't help but be extremely excited by the future prospects of that industry segment. It will also be very interesting to see what AMD brings to the table in this area when they finally launch their next-generation R600 architecture.
Finally, we would like to take this opportunity to urge NVIDIA (and AMD) not to artificially restrict GPGPU development on consumer cards. While we fully understand the push for higher average selling prices (ASPs) in the GPGPU market, we would like to remind point out that emerging markets require developer innovation. Allowing anyone with a some coding skills and a cool idea to experiment on real hardware can be a key catalyst for short-term and long-term innovation.
An exaggerated focus on the high-ASP part of the market would hinder that innovation, and reduce the perceived performance/dollar advantage which makes current GPGPU solutions so attractive. Furthermore, such limitations would completely block GPU computing from becoming a full part of AAA game middleware solutions and other non-game consumer applications. Applications that would be willing to implement CUDA and CTM paths instead of a single DX10 path should be able to do so.
While we do not have any specific reason to believe NVIDIA and AMD are considering artificially limiting their GPGPU solutions in the consumer space, this is so important from so many points of view that we absolutely had to point it out here. In addition to the level of support provided for professional solutions, they should only be differentiated by their FP64 performance, memory sizes, multi-GPU functionality, and other similar factors. Not by their feature-set. Most server CPUs have been differentiating themselves based on even less than that and, apparently, they're doing just fine!
Navier-Stokes equations computed 100% on the GPU for real-time smoke simulation