反向时间迁移降低算力
反向时间迁移降低算力
Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex geologic formations.
The algorithm migrates each shot independently using this basic workflow:
逆时偏移(RTM)是一种强大的地震偏移技术,为地球物理学家提供了创建地下3D图像的能力。陡峭的倾角?复杂的盐结构?高速对比度?没问题。通过划分上行和下行波场,与精确的速度模型结合起来,RTM甚至可以对最复杂的地质构造成像。
该算法使用以下基本工作流程独立迁移每个镜头:
- Compute the downgoing wavefield.
- Reconstruct the upgoing wavefield and reverse it in time.
- Correlate up and down wavefields at each image point.
- Repeat for all shots and combine in a 3D image.
While simple in concept, the computational costs made RTM economically unviable until the early 2010s, when parallel processing with NVIDIA GPUs dramatically reduced the migration time and hardware footprint needed.
- 计算下行波场。
- 重建上行波场并及时反转。
- 在每个图像点上关联上下波场。
- 对所有镜头重复上述步骤,然后合并为3D图像。
尽管概念上很简单,但计算算力使RTM在经济上无法实现,直到2010年代初,与NVIDIA GPU的并行处理,显着减少了所需的迁移时间和硬件占用空间。
Reducing RTM costs by increasing computational efficiency
通过提高计算效率来降低RTM算力
There are several factors driving computational requirements for tilted transversely isotropic (TTI) RTM. One is the calculation of first, second, and cross-derivatives along x, y, and z. Earlier versions of GPU, such as the Fermi and Kepler generations, had limited streaming multiprocessors (SMs), shared memory, and compiler technology.
Paulius Micikevicius famously overcame these issues by splitting the derivative calculations into two or three passes, with each pass computing a set of derivatives. This major breakthrough allowed seismic processors to run RTM in an economical and time-efficient manner. However, each pass requires a round-trip to memory. Each round-trip to memory hinders performance and drives up costs.
While multi-pass RTM was the best you could do in 2012, you can do much better today with the NVIDIA Volta or NVIDIA Ampere Architecture generations. If your RTM kernel hasn’t been tuned since the days of Paulius, you are leaving significant value on the table.
有几个因素推动了倾斜横向各向同性(TTI)RTM的计算要求。一个是沿x,y和z的一阶,二阶和交叉导数的计算。早期版本的GPU(例如Fermi和Kepler一代)具有有限的流式多处理器(SM),共享内存和编译器技术。
Paulius Micikevicius著名地解决了这些问题,将导数计算分为两个或三个过程,每个过程计算一组导数。这一重大突破使地震处理人员能够以经济,省时的方式运行RTM。但是,每次都需要往返存储器。每次往返内存都会影响性能并增加算力。
尽管multi-pass道RTM在2012年是以做的最好的事情,但如今使用NVIDIA Volta或NVIDIA Ampere Architecture一代,现在可以做得更好。如果自Paulius时代以来尚未调整过RTM内核,将在桌面上留下可观的价值。
Moving to a one-pass RTM
转向单程RTM
A one-pass TTI RTM kernel reads the wavefield one time, computes all necessary derivatives, and writes the updated wavefields to global memory one time. By eliminating multiple read/write roundtrips to memory, this implementation dramatically increases the performance gained on GPUs. It also helps the algorithm scale linearly across multiple GPUs in a node. Figure 2 shows the performance and strong scaling gained by reducing the number of passes on V100, T4, and A100 GPUs.
For seismic processing in the cloud, T4 provides a particularly good price/performance solution. On-premises servers for seismic processing typically have four to eight V100 or A100 GPUs per node. For these configurations, reducing the number of passes from three to one improves RTM kernel performance by 78-98%!
一遍式TTI RTM内核一次读取波场,计算所有必要的导数,然后一次将更新的波场写入全局存储器。通过消除对内存的多次读/写往返,大大提高了在GPU上获得的性能。还有助于算法在节点中的多个GPU之间线性扩展。图2显示了通过减少V100,T4和A100 GPU的通过次数获得的性能和强大的扩展能力。
对于云中的地震处理,T4提供了特别好的性价比解决方案。用于地震处理的本地服务器通常每个节点具有四到八个V100或A100 GPU。对于这些配置,将通过次数从三减少到一,可以将RTM内核性能提高78-98%!
Figure 1. A100 performance on multi and single-pass TTI RTM with linear scaling.
图1. 具有线性缩放的multi-pass和single-pass TTI RTM上的A100性能。
Figure 2. T4 performance on multi and single-pass TTI RTM with linear scaling.
图2.具有线性缩放的multi-pass和single-pass TTI RTM上的T4性能。
Figure 3. V100 performance on multi and single-pass RTM with linear scaling.
图3.具有线性缩放的multi-pass和single-pass RTM上的V100性能。
Conclusion
Reducing the number of passes in your RTM kernel can dramatically improve code performance and decrease costs. To make the development easier, NVIDIA has developed a collection of code examples showing how to implement a GPU-accelerated RTM using best practices.
Of course, the number of passes in an RTM kernel is only one piece of the puzzle. There are several other tricks shown in the example code to further increase performance, such as compression.
减少RTM内核中的通过次数可以大大提高代码性能并降低算力。为了使开发更容易,NVIDIA开发了一系列代码示例,展示了如何使用最佳实践来实现GPU加速的RTM。
当然,RTM内核中的通过次数只是难题之一。示例代码中还显示了其他一些技巧,可以进一步提高性能,例如压缩。