【RDMA】优化 RDMA 代码的提示和技巧
目录
post多个 WR 时,用list串起它们在一次post完成
避免使用原子操作|Avoid using atomic operations
一次Read多个WC|Read multiple Work Completions at once
为某个任务或进程绑核运行|Set processor affinity for a certain task or process
使用本地 NUMA 节点|Work with local NUMA node
使用内存对齐的buffer|Work with cache-line aligned buffers
避免进入重传流|Avoid getting into retransmission flows
Find the best MTU for the RDMA device
Work with multiple outstanding Send Requests
Configure the Queue Pair to allow several RDMA Reads and Atomic in parallel
Work with selective signaling in the Send Queue
使用polling读取WC|Read Work Completions by polling
Use low values in QP's timeout and min_rnr_timer
If immediate data is used, use RDMA Write with immediate instead of Send with immediate
减少内存消耗|Reducing memory consumption
Register physical contiguous memory
Reduce the size of the used Queues to the minimum
降低CPU使用率|Reducing CPU consumption
Work with Work Completion events
Work with solicited events in Responder side
增加可扩展性|Increase the scalability
Use Unreliable Datagram (UD) QP
RDMA性能优化理论依据
摘自:RDMA性能优化经验浅谈(一)https://zhuanlan.zhihu.com/p/522332998
二、基础概念背后的硬件执行方式和原理
首先我们介绍一下RDMA的一些核心概念,当然了,我并不打算写他的API以及调用方式,我们更多关注这些基础概念背后的硬件执行方式和原理,对于这些原理的理解是能够写出高性能RDMA程序的关键。
Memory Region
RDMA的网卡(下文以RNIC指代)通过DMA来读写系统内存,由于DMA只能根据物理地址访问,所以RNIC需要保存一份目标内存区域的虚拟内存到物理内存的映射表,这个映射表被存储在RNIC的Memory Translation Table(MTT)中。同时,由于目前RDMA的访问大都基于Direct Cache Access,不支持page-fault,所以我们还需要保证目标内存区域是被pagelock住以防止操作系统将这部分内存页换出。
总结一下就是,当我们使用RDMA来访问一块内存的时候,这部分内存首先要被pagelock,接着我们还需要把这块内存虚拟地址到逻辑地址的映射表发送给RNIC用于后续的访问查找,这个过程就叫Memory Registeration,这块被注册的内存就是Memory Region。同时我们注册内存的时候需要指定这块内存的访问权限,RNIC将这个访问权限信息存储在Memory Protection Tables(MPT)中用于用户请求时的权限验证。
MTT和MPT被存储在内存中,但是RNIC的SRAM中会进行缓存。当RNIC接收到来自用户的READ/WRITE请求的时候,首先在SRAM中的缓存中查找用户请求的目标地址对应的物理地址以及这块地址对应的访问权限,如果缓存命中了,就直接基于DMA进行操作,如果没有命中,就得通过PCIe发送请求,在内存的MTT和MPT中进行查找,这带来了相当的额外开销,尤其是当你的应用场景需要大量的、细粒度的内存访问的时候,此时RNIC SRAM中的MTT/MPT命中缺失带来的影响可能是致命的。
Memory Region的注册是一个耗时的操作,但大部分情况下,我们都只需要在最开始的时候做一次或者多次。现在也有不需要注册MR基于on-demand paging的方式来访问的,比如AWS的EFA协议。但今天先不展开这块的内容,因为这块更多是Unified Memory这个话题下的,之后我可能会把这个和GPU的UVM放在一起介绍下,因为他们的核心原理其实是一样的。
RDMA Verbs
用户通过RDMA的Verbs API向RNIC发送指令,Verbs分为Memory Verbs和Message Verbs,Memory Verbs主要就是READ、WRITE以及一些ATOMIC的操作,Message Verbs主要包含SEND、RECV。Memory verbs是真正的CPU Bypass以及Kernel Bypass,所以总归是性能比较好的。Message Verbs需要Responder的CPU的参与,相对而言更灵活,但是性能相比Memory Verbs而言一般不太行。
Queue Pair
RDMA的hosts之间是通过Queue Pair(QP)来通信的,一个QP包含一个Send Queue(SQ),一个Receive Queue(RQ)以及对应的Send Completion Queue(SCQ)和Receive Completion Queue(RCQ)。用户发送请求的时候,把请求封装为一个Work Queue Element(WQE)发送到SQ里面,然后RDMA网卡会把这个WQE发送出去,当这个WQE完成的时候,对应的SCQ里面会被放一个Completion Queue Element(CQE),然后用户可以从SCQ里面Poll这个CQE并通过检查状态来确认对应的WQE是否成功完成。需要指出的是,不同的QP可以共用CQ来减少SRAM的存储消耗。
接下来,我们重点介绍一下QP背后的知识。
首先,当我们创建了QP之后,系统是需要保存状态数据的,比如QP的metadata,拥塞控制状态等等,除去QP中的WQE、MTT、MPT,一个QP大约对应375B的状态数据。这在以前RNIC的SRAM比较小的时候会是一个比较重的存储负担,所以以前的RDMA工作会有QP Sharing的研究,就是不同的处理线程去共用QP来减少meta data的存储压力,但是这会带来一定的性能的损失[1]。现在新的RNIC的SRAM已经比较大了,Mellanox的CX4、CX5系列的网卡的SRAM大约2MB,所以现在新网卡上,大家还是比较少去关注QP带来的存储开销,除非你要创建几千个,几万个QP。
其次,RNIC是包含多个Processing Unit(PU)的[2],同时由于QP内的请求处理是具有顺序的,且为了避免cross-PU的同步,一般而言我们认为一个QP对应一个PU来处理。所以,我们可以在一个线程内建立多个QP来加速你数据处理,避免RDMA程序性能瓶颈卡在PU的处理上[3]。
三、RDMA性能优化
RDMA性能优化这个东西说复杂也复杂,说简单也简单。简单的点在于,从性能优化角度而言,其实软件层面我们可以做的设计和选择不会太多,因为性能上限是被硬件卡住的,所以我们为了追求尽可能逼近硬件上限的性能表现,其核心就在于按照硬件最友好的方式去做数据访问即可,没有特别多复杂的算法在这里面,当你想要高性能的时候,多多了解硬件就对了。对照着我们在上面介绍的三个核心概念,我们一一介绍性能优化的经验。
3.1 关注地址翻译的性能开销
前面我们提到,当待请求的数据地址在RNIC SRAM中的MTT/MPT没有命中的时候,RNIC需要通过PCIe去在内存中的MTT和MPT进行查找,这是一个耗时的操作。尤其是当我们需要 high fan-out、fine-grained的数据访问时,这个开销会尤为的明显。现有针对这个问题的优化方式主要有两种:
- Large Page:无论是MTT亦或者操作系统的Page Table,虚拟地址到物理地址的映射表项是Page粒度的,即一个Page对应一个MTT的Entry或者Page Table的Entry(PTE)。使用Large Page可以有效的减少MTT的Size,进而使得RNIC中的MTT Cache命中率更高。
- 使用Contiguous Memory + PA-MR[4, 5]。新一代的CX网卡支持用户基于物理地址访问,为了避免维护一个繁重的Page Table,我们可以通过Linux的CMA API来申请一大块连续的物理内存。这样我们的MTT就只有一项,可以保证100%的Cache命中率。但是这个本身有一些安全隐患,因为使用PA-MR会绕过访问权限验证,所以使用的时候要注意这点。
当然,其实还有一些别的优化手段,在最近我们的工作中提出一种新的方式来提升地址翻译的性能,具体等工作开源出来之后我再来介绍介绍。
3.2 关注RNIC PU/QP的执行模型
一个QP对应一个PU,这是我们对RNIC执行方式的一个简单建模。这个模型下,我们需要通过多QP来充分发挥多PU并行处理的能力,同时也要关注我们的操作减少PU之间的同步,PU之间同步对于性能有着较大的伤害。
3.3 RMDA Verbs
对于RDMA的Verbs的使用,以我个人的经验来看,就是优先使用READ/WRITE,在一些需要CPU介入且需要Batch处理逻辑的,可以尝试使用SEND/RECV。过往的工作有很多基于READ/WRITE去构建Message Passing处理语义的工作[1, 6, 7],可以着重参考。
同时,一个READ/WRITE的WQE可以通过设置对应的FLAG来设置其是否需要在完成时需要被SIGNALED,如果不需要则该WQE完成时不会产生一个CQE。此时一个常见的优化技巧是,当你需要连续在一个QP中发送K个READ/WRITE请求时,只把最后一个请求设置为SIGNALED,其他均为UNSIGNALED,由于QP的执行本身具备顺序关系,所以最后一个执行完了后一定意味着之前的WQE都已经执行完了。当然,是否执行成功需要Application-Specific的方法来确认。
四、 RNIC+ X
最经典的RNIC的使用方式自然是RNIC + System Memory,即直接通过RNIC来访问内存。但是随着GP-GPU、NVM的发展,通过RNIC来直接访问GPU或者通过RNIC来直接访问NVM都是目前比较成熟和热门的技术。RDMA + GPU可以大幅度加速GPU和GPU之间的通信,RDMA + NVM则可以大幅度的扩大内存容量,减少网络通信的需求。这块内容既涉及到硬件又涉及到操作系统的虚拟内存机制,要讲清楚需要不少篇幅,我们放在下一篇进行介绍。
五、总结
本篇文章主要是介绍一些RDMA的基础概念以及背后的原理,基于这些概念和原理我们介绍了RDMA的常见性能优化技巧,下一篇我们将会介绍RNIC + X,包括RNIC + GPU以及RNIC + NVM的内容介绍,感兴趣的读者朋友可以保持关注~。
六、引用
[1] Dragojević, A., Narayanan, D., Castro, M., & Hodson, O. (2014). {FaRM}: Fast Remote Memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (pp. 401-414).
[2] Hauger, S., Wild, T., Mutter, A., Kirstaedter, A., Karras, K., Ohlendorf, R., ... & Scharf, J. (2009, May). Packet processing at 100 Gbps and beyond-challenges and perspectives. In 2009 ITG Symposium on Photonic Networks (pp. 1-10). VDE.
[3] Kalia, A., Kaminsky, M., & Andersen, D. G. (2016). Design guidelines for high performance {RDMA} systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16) (pp. 437-450).
[4] Physical Address Memory Allocation
[5] Contiguous Memory Allocator - CMA (Linux) | Toradex Developer Center
[6] Kalia, A., Kaminsky, M., & Andersen, D. G. (2014, August). Using RDMA efficiently for key-value services. InProceedings of the 2014 ACM Conference on SIGCOMM(pp. 295-306).
[7] https://github.com/pytorch/tensorpipe
编辑于 2022-06-01 09:43
优化 RDMA 代码的提示和技巧
RDMA is used in many places, mainly because of the high performance that it allows to achieve. In this post, I will provide tips and tricks on how to optimize RDMA code in several aspects.
一般建议
避免在数据路径中使用控制操作
Unlike the data operations that stay in the same context that they were called in (i.e. don't perform a context switch) and they are written in optimized way, the control operations (all create/destroy/query/modify) operations are very expensive because:
- Most of the time, they perform a context switch
- Sometimes they allocate or free dynamic memory
- Sometimes they involved in accessing the RDMA device
As a general rule of thumb, one should avoid calling control operations or decrease its use in the data path.
The following verbs are considered as data operations:
- ibv_post_send()
- ibv_post_recv()
- ibv_post_srq_recv()
- ibv_poll_cq()
- ibv_req_notify_cq
post多个 WR 时,用list串起它们在一次post完成
(When posting multiple WRs, post them in a list in one call)
When posting several Work Requests to one of the ibv_post_*() verbs, posting multiple Work Requests as a linked list in one call instead of several calls each time with one Work Request will provide better performance since it allows the low-level driver to perform optimizations.
处理WC时,不要一个一个回复,攒几个再一次同时回复多个
(When using Work Completion events, acknowledge several events in one call)
When handling Work Completions using events, acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed.
避免使用许多聚散表条目
(Avoid using many scatter/gather entries)
Using several scatter/gather entries in a Work Request (either Send Request or Receive Request) mean that the RDMA device will read those entries and will read the memory that they refer to. Using one scatter/gather entry will provide better performance than more than one.
在WR(发送请求SR或接收请求RR)中使用多个分散/收集条目意味着 RDMA 设备将读取这些条目并读取它们引用的内存。使用一个分散/聚集条目将提供比多个条目更好的性能
避免使用 Fence|Avoid using Fence
Send Request with the fence flag set will be blocked until all prior RDMA Read and Atomic Send Requests will be completed. This will decrease the BW.
避免使用原子操作|Avoid using atomic operations
Atomic Operations allow to perform read-modify-write in an atomic way. This usually will decrease the performance since doing this usually involved in locking the access to the memory (implementation dependent).
一次Read多个WC|Read multiple Work Completions at once
ibv_poll_cq() allows to reading multiple completions at once. If the number of Work Completions in the CQ is less than the number of Work Completion that one tried to read, it means that the CQ is empty and there isn't any need to check if there are more Work Completions in it.
ibv_poll_cq() 允许一次读取多个wc。如果 CQ 中的 Work Completion 数量小于尝试读取的 Work Completion 数量,则表示 CQ 为空,无需检查其中是否有更多的 Work Completion。
为某个任务或进程绑核运行|Set processor affinity for a certain task or process
When working with a Symmetric MultiProcessing (SMP) machines, binding the processes to a specific CPU(s)/core(s) may provide better utilization of the CPU(s)/core(s) thus provide better performance. Executing processes as the number of CPU(s)/core(s) in a machine and spread a process to each CPU(s)/core(s) may be a good practice. This can be done with the "taskset" utility.
使用本地 NUMA 节点|Work with local NUMA node
When working on a Non-Uniform Memory Access (NUMA) machines, binding the processes to CPU(s)/core(s) which are considered local NUMA nodes for the RDMA device may provide better performance because of faster CPU access. Spreading the processes to all of the local CPU(s)/core(s) may be a good practice.
使用内存对齐的buffer|Work with cache-line aligned buffers
Working with cache-line aligned buffers (in S/G list, Send Request, Receive Request and data) will improve performance compared to working with unaligned memory buffers; it will decrease the number of CPU cycles and number of memory accesses.
避免进入重传流|Avoid getting into retransmission flows
Retransmission is a performance killer. There are 2 major reasons for retransmission in RDMA:
- Transport retransmission - remote QP isn't at a state that can process incoming messages, i.e. didn't get to, at least, RTR state, or moved to Error state
- RNR retransmission - there is a message that should consume a Receive Request in the responder side, but there isn't any Receive Request in the Receive Queue
There are RDMA devices that provide counters to indicate that retry flows occurred, but not all of them.
Setting QP.retry_cnt and QP.rnr_retry to zero will cause a failure (i.e. Completion with error) when the QP enters those flows.
However, if retry flows can't be avoided, use low (as possible) delay between the retransmission.
增加带宽|Improving the Bandwidth
Find the best MTU for the RDMA device
The MTU value specifies the maximum packet payload size (i.e. excluding the packet headers) that can be sent. As a rule of thumb since the packet header sizes are the same for all MTU values, using the maximum available MTU size will decrease the "paid price" per packet; the percent of the payload data in the total used BW will be increased. However, there are RDMA devices which provide the best performance for MTU values which are lower than the maximum supported value. One should perform some testing in order to find the best MTU for the specific device that he works with.
Use big messages
Sending a few big messages is more effective than sending a lot of small messages. In application level one should collect data and send big messages over RDMA.
Work with multiple outstanding Send Requests
处理多个未完成的发送请求
Working with multiple outstanding Send Requests and keeping the Send Queue always full (i.e. for every polled Work Completion post a new Send Request) will keep the RDMA device busy and prevents it from being idle.
Configure the Queue Pair to allow several RDMA Reads and Atomic in parallel
If one uses RDMA Read or Atomic operations, it is advised to configure the QP to work with several RDMA Read and Atomic operations in flight since it will provide higher BW.
Work with selective signaling in the Send Queue
Working with selective signaling in the Send Queue means that not every Send Request will produce a Work Completion when it ends and this will reduce the number of Work Completions that should be handled.
减少延迟|Reducing the latency
使用polling读取WC|Read Work Completions by polling
为了尽快将放入CQ的wc读出来,polling相比于event polling提供更好的结果。
In order to read the Work Completion as soon as they are added to the Completion Queue, polling will provide the best results (rather than working with Work Completion events).
Send small messages as inline
In RDMA devices which supports sending data as inline, sending small messages as inline will provide better latency since it eliminates the need of the RDMA device to perform extra read (over the PCIe bus) in order to read the message payload.
Use low values in QP's timeout and min_rnr_timer
Using lower values in the QP's timeout and min_rnr_timer means that in case that something gets wrong and retry is required (whether if because the remote QP doesn't answer or if it doesn't have outstanding Receive Request) the waited time before a retransmission will be short.
If immediate data is used, use RDMA Write with immediate instead of Send with immediate
When sending a message that has only immediate data, RDMA Write with immediate will provide better performance than Send With immediate since the latter causes the outstanding posted Receive Request to be read (in the responder side) and not only be consumed.
减少内存消耗|Reducing memory consumption
Use Shared Receive Queue (SRQ)
Instead of posting many Receive Requests for each Queue Pair, using SRQ can save the total number of outstanding Receive Request thus reduce the total consumed memory.
Register physical contiguous memory
Register physical contiguous memory, such as huge pages, can allow the low-level driver(s) to perform optimizations since lower amount of memory address translations will be required (compared to 4KB memory pages buffer).
Reduce the size of the used Queues to the minimum
Creating the various Queues (Queue Pairs, Shared Receive Queues, Completion Queues) may consume a lot of memory. One should set the size of them to the minimum that is required by his application.
降低CPU使用率|Reducing CPU consumption
Work with Work Completion events
Reading the Work Completions using events will eliminate the need to perform constant polling on the CQ since the RDMA device will send an event when a Work Completion was added to the CQ.
Work with solicited events in Responder side
When reading the Work Completions in the Responder side, the solicited event can be a good way to the Requestor to provide a hint that now is a good time to read the completions. This reduces the total number of handled Work Completions.
Share the same CQ with several Queues
Using the same CQ with several Queues and reducing the total number of CQs will eliminate the need to check several CQs in order to understand if an outstanding Work Request was completed. This can be done by sharing the same CQ with multiple Send Queues, multiple Receive Queues or with a mix of them.
增加可扩展性|Increase the scalability
Use collective algorithms
Using collective algorithms will reduce the total number of messages that cross the wire and will decrease the total number of messages and resources that each node in a cluster will use. There are RDMA devices that provide special collective offload operations that will help reducing the CPU utilization.
Use Unreliable Datagram (UD) QP
If every node needs to be able to receive or send a message to any other node in the subnet, using a connected QP (either Reliable or Unreliable) may be a bad solution since many QPs will be created in every node. Using a UD QP is better since it can send and receive messages from any other UD QP in the subnet.
减少并发数
【RDMA】qp数量和RDMA性能(节选翻译)https://blog.csdn.net/bandaoyu/article/details/122947096