[MoE] Deepseek的All-to-all通信: DeepEP代码解读

[MoE] Deepseek的All-to-all通信: DeepEP代码解读


最近,Deepseek开源了一系列MoE的优化技术,让我们看到了AI infra的强大之处。其中,第二天发布的DeepEP则是针对MoE中EP的all-to-all通信进行了优化。



DeepEP的机内通信使用IPC接口,走NVLink。机间通信使用NVSHMEM接口,走IB RDMA。




_buffer = Buffer(group, num_nvl_bytes, num_rdma_bytes)


class Buffer:
    The core expert-parallel (EP) communication buffers for Mixture of Experts (MoE) model, which supports:
        - high-throughput intranode all-to-all (dispatch and combine, using NVLink)
        - high-throughput internode all-to-all (dispatch and combine, using RDMA without AR)
        - low-latency all-to-all (dispatch and combine, using RDMA, AR supported)

        num_sms: the SMs used in high-throughput kernels.
        rank: the local rank number.
        group_size: the number of ranks in the group.
        group: the communication group.
        num_nvl_bytes: the buffer size for intranode NVLink communication.
        num_rdma_bytes: the buffer size for internode (also for intranode with low-latency mode) RDMA communication.
        runtime: the C++ runtime.
    def __init__(self, group: dist.ProcessGroup,
                 num_nvl_bytes: int = 0, num_rdma_bytes: int = 0,
                 low_latency_mode: bool = False, num_qps_per_rank: int = 1) -> None:
        Initialize the communication buffer.

            group: the communication group.
            num_nvl_bytes: the buffer size for intranode NVLink communication.
            num_rdma_bytes: the buffer size for internode (also for intranode with low-latency mode) RDMA communication.
            low_latency_mode: whether to enable low-latency mode.
            num_qps_per_rank: the number of QPs for RDMA, the low-latency mode requires that this number equals
                to the number of local experts.
        # 省略一些不太重要的部分
        self.runtime = deep_ep_cpp.Buffer(self.rank, self.group_size, num_nvl_bytes, num_rdma_bytes, low_latency_mode)



  • int device_id:来自cudaGetDevice
  • int* task_fifo_ptrs[NUM_MAX_NVL_PEERS]:任务队列,用于机内IPC通信。在后面notify_dispatch会用到,dispatch不会用到。
  • cudaIpcMemHandle_t ipc_handles[NUM_MAX_NVL_PEERS]:来自cudaIpcGetMemHandle,用于建立机内IPC通信,创建buffer_ptrs
  • void* buffer_ptrs[NUM_MAX_NVL_PEERS]:NVLink Buffer,用于机内IPC通信。


class Buffer:
	def __init__(...):
        # 使用dist来同步device_id
        # 即cudaGetDevice获得的device_id
        # Synchronize device IDs
        device_ids = [None, ] * self.group_size
        local_device_id = self.runtime.get_local_device_id()
        dist.all_gather_object(device_ids, local_device_id, group)

        # 同步ipc_handle,由前面的cudaIpcGetMemHandle获得
        # Synchronize IPC handles
        ipc_handles = [None, ] * self.group_size
        local_ipc_handle = self.runtime.get_local_ipc_handle()
        dist.all_gather_object(ipc_handles, local_ipc_handle, group)
        # Synchronize NVSHMEM unique IDs
        # 获取root的NVSHMEM的unique_id,然后同步它
        root_unique_id = None
        if self.runtime.get_num_rdma_ranks() > 1 or low_latency_mode:
            # 省略掉一些关于low_latency_mode的代码

            # NOTES: make sure AR (Adaptive Routing) is turned off while running normal kernels, as we cannot verify AR status in the code
            # Synchronize using the root ID
            nvshmem_unique_ids = [None, ] * self.group_size
            if (low_latency_mode and self.rank == 0) or (not low_latency_mode and self.runtime.get_rdma_rank() == 0):
                # 内部调用nvshmemx_get_uniqueid
                root_unique_id = self.runtime.get_local_nvshmem_unique_id()
            dist.all_gather_object(nvshmem_unique_ids, root_unique_id, group)
            root_unique_id = nvshmem_unique_ids[0 if low_latency_mode else self.runtime.get_root_rdma_rank(True)]
        # 现在已经获取了所有对端的信息。接下来创建IPC和NVSHMEM的结构
        # Make CPP runtime available
        self.runtime.sync(device_ids, ipc_handles, root_unique_id)
        assert self.runtime.is_available()



  • 打开IPC handle

    • cudaIpcOpenMemHandle(&buffer_ptrs[i], ipc_handles[i], cudaIpcMemLazyEnablePeerAccess);
  • 创建任务队列task_fifo_ptrs

  • 将相关的变量同步到GPU上。


  • 初始化nvshmem:internode::init(...),内部调用

    • nvshmemx_set_attr_uniqueid_args(rank, num_ranks, &root_unique_id, &attr);
      nvshmemx_init_attr(NVSHMEMX_INIT_WITH_UNIQUEID, &attr);
    • 这里对于非low_latency模式,每个nvshmem的通信组是所有rdma rank上nvk rank相同的GPU,即通信组数量为nvl rank数量,每个通信组的大小为rdma rank的数量,每个通信组的root位于rdma rank=0的节点上。

  • 创建NVSHMEM的共享内存指针rdma_buffer_ptr,内部是

    • nvshmem_align(alignment, size);
    • 此后,所有GPU可以用rdma_buffer_ptr来创建共享的buffer,然后使用nvshmem进行通信




def dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,
                     num_experts: int, previous_event: Optional[EventOverlap] = None) -> \
        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor, torch.Tensor, List, Tuple, EventOverlap]:
    # NOTES: an optional `previous_event` means a CUDA event captured that you want to make it as a dependency 
    # of the dispatch kernel, it may be useful with communication-computation overlap. For more information, please
    # refer to the docs of `Buffer.dispatch`
    global _buffer

    # Calculate layout before actual dispatch
    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \
        _buffer.get_dispatch_layout(topk_idx, num_experts,
                                    previous_event=previous_event, async_finish=True,
                                    allocate_on_comm_stream=previous_event is not None)
    # Do MoE dispatch
    # NOTES: the CPU will wait for GPU's signal to arrive, so this is not compatible with CUDA graph
    # For more advanced usages, please refer to the docs of the `dispatch` function
    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,
                         num_tokens_per_rank=num_tokens_per_rank, num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
                         is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,
                         previous_event=previous_event, async_finish=True,
    # For event management, please refer to the docs of the `EventOverlap` class
    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event



class Buffer:
    def get_dispatch_layout(self, topk_idx: torch.Tensor, num_experts: int,
                            previous_event: Optional[EventOverlap] = None, async_finish: bool = False,
                            allocate_on_comm_stream: bool = False) -> \
            Tuple[torch.Tensor, Optional[torch.Tensor], torch.Tensor, torch.Tensor, EventOverlap]:
        Calculate the layout required for later communication.

            topk_idx: `[num_tokens, num_topk]`, dtype must be `torch.int64`, the expert indices selected by each token,
                `-1` means no selections.
            num_experts: the number of experts.
            previous_event: 如果不是None,则需要等待这个事件结束才会执行kernel。这个参数可以用于描绘流水线并行中的依赖关系。
            previous_event: the event to wait before actually executing the kernel.
            async_finish: the current stream will not wait for the communication kernels to be finished if set.
            allocate_on_comm_stream: control whether all the allocated tensors' ownership to be on the communication stream.

            num_tokens_per_rank: `[num_ranks]` with `torch.int`, the number of tokens to be sent to each rank.
            num_tokens_per_rdma_rank: `[num_rdma_ranks]` with `torch.int`, the number of tokens to be sent to each RDMA
                rank (with the same GPU index), return `None` for intranode settings.
            num_tokens_per_expert: `[num_experts]` with `torch.int`, the number of tokens to be sent to each expert.
            is_token_in_rank: 每个token是否发往每个rank
            is_token_in_rank: `[num_tokens, num_ranks]` with `torch.bool`, whether a token be sent to a rank.
            event: the event after executing the kernel (valid only if `async_finish` is set).
        num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, event = \
            self.runtime.get_dispatch_layout(topk_idx, num_experts, getattr(previous_event, 'event', None),
                                             async_finish, allocate_on_comm_stream)
        return num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, EventOverlap(event)



class Buffer:
    def dispatch(self, x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
                 handle: Optional[Tuple] = None,
                 num_tokens_per_rank: Optional[torch.Tensor] = None, num_tokens_per_rdma_rank: Optional[torch.Tensor] = None,
                 is_token_in_rank: Optional[torch.Tensor] = None, num_tokens_per_expert: Optional[torch.Tensor] = None,
                 topk_idx: Optional[torch.Tensor] = None, topk_weights: Optional[torch.Tensor] = None, expert_alignment: int = 1,
                 config: Optional[Config] = None,
                 previous_event: Optional[EventOverlap] = None, async_finish: bool = False,
                 allocate_on_comm_stream: bool = False) -> \
            Tuple[Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor], Optional[torch.Tensor],
                  Optional[torch.Tensor], List[int], Tuple, EventOverlap]:
        Dispatch tokens to different ranks, both intranode and internode settings are supported.
        Intranode kernels require all the ranks should be visible via NVLink.
        Internode kernels require the ranks in a node should be visible via NVLink, while the ranks with the same GPU
            index should be visible via RDMA. AR must be disabled.

        	x: token的数据
            x: `torch.Tensor` or tuple of `torch.Tensor`, for the first type, the shape must be `[num_tokens, hidden]`,
                and type must be `torch.bfloat16`; for the second type, the first element of the tuple must be shaped as
                `[num_tokens, hidden]` with type `torch.float8_e4m3fn`, the second must be `[num_tokens, hidden // 128]`
                 (requiring divisible) with type `torch.float`.
			handle: 如果设置了handle,则会重用之前计算过的layout信息。这个可用于backward的combine(本质上是dispatch)
            handle: an optional communication handle, if set, the CPU will reuse the layout information to save some time.
            num_tokens_per_rank: `[num_ranks]` with `torch.int`, the number of tokens to be sent to each rank.
            num_tokens_per_rdma_rank: `[num_rdma_ranks]` with `torch.int`, the number of tokens to be sent to each RDMA
                rank (with the same GPU index), return `None` for intranode settings.
            is_token_in_rank: `[num_tokens, num_ranks]` with `torch.bool`, whether a token be sent to a rank.
            num_tokens_per_expert: `[num_experts]` with `torch.int`, the number of tokens to be sent to each expert.
            topk_idx: `[num_tokens, num_topk]` with `torch.int64`, the expert indices selected by each token,
                `-1` means no selections.
            topk_weights: `[num_tokens, num_topk]` with `torch.float`, the expert weights of each token to dispatch.
            expert_alignment: align the number of tokens received by each local expert to this variable.
            config: the performance tuning config.
            previous_event: the event to wait before actually executing the kernel.
            async_finish: the current stream will not wait for the communication kernels to be finished if set.
            allocate_on_comm_stream: control whether all the allocated tensors' ownership to be on the communication stream.

            recv_x: received tokens, the same type and tuple as the input `x`, but the number of tokens equals to the
                received token count.
            recv_topk_idx: received expert indices.
            recv_topk_weights: received expert weights.
            num_recv_tokens_per_expert_list: Python list shaped `[num_local_experts]`, the received token count by
                each local expert, aligned to the input `expert_alignment`.
            handle: the returned communication handle.
            event: the event after executing the kernel (valid only if `async_finish` is set).
        # Internode
        if self.runtime.get_num_rdma_ranks() > 1:
            return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
                                           topk_idx, topk_weights, expert_alignment, config, previous_event, async_finish, allocate_on_comm_stream)
        # 我们略过intranode_dispatch的情况



std::tuple<torch::Tensor, ...>
Buffer::internode_dispatch(const torch::Tensor& x, ...) {
    bool cached_mode = cached_rank_prefix_matrix.has_value();
    // 1个channel对应2个SM
	const int num_channels = config.num_sms / 2;
    // 设置comm_stream
    // Allocate all tensors on comm stream if set
    // NOTES: do not allocate tensors upfront!
    auto compute_stream = at::cuda::getCurrentCUDAStream();
    if (allocate_on_comm_stream) {
        EP_HOST_ASSERT(previous_event.has_value() and async);

    // 等待前置任务完成
    // Wait previous tasks to be finished
    if (previous_event.has_value()) {
        stream_wait(comm_stream, previous_event.value());
    } else {
        stream_wait(comm_stream, compute_stream);
    if (cached_mode) {
        // 如果之前进行过dispatch,则可以重用之前的结果
    else {
        // 否则,需要进行计算
    // 等待notify_dispatch完成


  • rdma_channel_prefix_matrix:形状(num_rdma_ranks, num_channels),每个channel要发往每个RDMA节点token数量的前缀和

  • recv_rdma_rank_prefix_sum:形状(num_rdma_ranks),每个RDMA节点要接收的token数量

  • gbl_channel_prefix_matrix:形状(num_ranks, num_channels),每个channel要发往每个GPU的token数量的前缀和

  • recv_gbl_rank_prefix_sum:形状(num_ranks),每个GPU要接收的token数量

  • moe_recv_counter:int,总共要接收的token数量

  • moe_recv_expert_counter:int[NUM_MAX_LOCAL_EXPERTS],每个本地的expert要接收的token数量



	// Launch data dispatch


在MoE里,一个token可能会发往多个GPU,这些GPU可能位于多个节点上(Deepseek-V3规定了一个token最多发往4个节点)。对于一个token,它首先经过rdma channel,从本地传输到所有的远端节点上编号相同的GPU。然后再经过nvl_channel,传输远端节点中所有的目标GPU上。


下面这张图展示了整体的工作流程,注意:为了方便,这里只展示了一个token发往一个目标GPU的过程。实际上,每个token至多发往4个dst rdma rank,8个dst nvl rank。图中的黄框代表GPU,实线代表数据流经的路径,虚线代表控制信息。



  • dispatch会启动num_channels * 2个SM,其中每两个SM对应一个channel
  • 每个SM有kNumDispatchRDMASenderWarps + 1 + NUM_MAX_NVL_PEERS个warp(默认kNumDispatchRDMASenderWarps = 7NUM_MAX_NVL_PEERS = 8,所以每个SM有16个warp)
  • H800中每个warp有32个线程


  • 对于第一个SM
    • 前8个warp为RDMAAndNVLForwarder,负责将数据从RDMA传输到NVL
    • 1个warp为ForwarderCoordinator,负责协调RDMAAndNVLForwarder
  • 对于第二个SM
    • 前7个warp为RDMASender,负责将数据拷贝到RDMA channel
    • 第8个warp为RDMASenderCoordinator,负责协调RDMASender
    • 剩余8个warp为NVLReceivers



  • src rdma rank和dst rdma rank各自维护了队列的头尾指针,这些指针需要进行同步
  • src将数据放入send_buffer,使用nvshmem发往dst的recv_buffer



  • RDMASender
    • 7个RDMASender轮流取token,每个warp一次取一个token:
      • 在warp内部,每个lane(线程)对应一个dst rdma rank
      • 如果当前lane对应的rank属于token要发往的rdma rank,则推进rdma_send_channel_tail,并等待远端发来的rdma_channel_head,要求tail-head<队列大小num_max_rdma_chunked_recv_tokens
      • 将token放入send_buffer
      • 更新rdma_channel_tail
  • RDMASenderCoordinator
    • 每个lane负责一个dst rdma rank
    • 如果还有未发送的数据,则轮训所有rdma rank:
      • 如果某个rdma_channel中,待发送的数据超过num_max_rdma_chunked_send_tokens,则从send_buffer发送这些数量的token到远端的recv_buffer
      • 更新远端的rdma_channel_tail+=发送的token数
  • RDMAAndNVLForwarder
    • 每个warp负责一个机内的nvl rank(编号从自身的开始),这些warp同时处理来自rdma_channel的数据
    • 若有未转发的数据
      • 等待nvl channel的剩余空间达到num_max_nvl_chunked_send_tokens
      • 轮训所有src rdma rank,检查其rdma_channel_tail,看看有没有新来的token
      • 若找到了一个有token的src rdma rank,枚举收到的所有token,看它是否应发给当前warp对应的nvl rank
      • 若是,则将token从rdma_channel的recv_buffer拷贝到nvl_channel
    • 更新forward_channel_head=rdma_channel_tailnvl_channel_tail+=处理的数据
  • ForwarderCoordinator
    • 每个lane负责一个src rdma rank
    • 若有RDMAAndNVLForwarder还没结束
      • 轮训rdma rank中的每个nvl rank,如果所有的8个RDMAAndNVLForwarderforward_channel_head都更新了,则更新远端的rdma_channel_head
  • NVLReceivers
    • 每个warp负责一个机内的nvl rank(从自身的下一个rank开始)
    • nvl_channel_tail更新了,则
      • 枚举所有收到的token,从nvl_channel中拷贝到recv_x中
      • 更新nvl_channel_head


std::tuple<torch::Tensor, ...>
Buffer::internode_dispatch(const torch::Tensor& x, ...) {
    // internode::dispatch之后
	// 如果是同步模式,则等待dispatch结束
	// 如果是异步,则记录事件到comm_stream上
	// Wait streams
    std::optional<EventHandle> event;
    if (async) {
        event = EventHandle(comm_stream);
        for (auto& t: {x, is_token_in_rank, rank_prefix_matrix, channel_prefix_matrix, recv_x, recv_src_idx, recv_channel_prefix_matrix, send_head}) {
            if (allocate_on_comm_stream)
        // 再对其他一些tensor也执行record_stream
    } else {
        stream_wait(compute_stream, comm_stream);
    // Switch back compute stream
    if (allocate_on_comm_stream)

    // Return values
    return {recv_x, recv_x_scales, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, rank_prefix_matrix, channel_prefix_matrix, recv_channel_prefix_matrix, recv_src_idx, send_head, event};




class Buffer:
# noinspection PyTypeChecker
    def combine(self, x: torch.Tensor, handle: Tuple,
                topk_weights: Optional[torch.Tensor] = None,
                config: Optional[Config] = None,
                previous_event: Optional[EventOverlap] = None, async_finish: bool = False,
                allocate_on_comm_stream: bool = False) -> \
            Tuple[torch.Tensor, Optional[torch.Tensor], EventOverlap]:
        Combine (reduce) tokens (addition **without** weights) from different ranks, both intranode and internode
            settings are supported.
        Intranode kernels require all the ranks should be visible via NVLink.
        Internode kernels require the ranks in a node should be visible via NVLink, while the ranks with the same GPU
            index should be visible via RDMA. AR must be disabled.

            x: `[num_tokens, hidden]` with `torch.bfloat16`, the tokens to send for reducing to its original ranks.
            handle: 必须由dispatch取得
            handle: a must-set communication handle, you can obtain this from the dispatch function.
            topk_weights: `[num_tokens, num_topk]` with `torch.float`, the tokens' top-k weights for reducing to its original ranks.
            config: the performance tuning config.
            previous_event: the event to wait before actually executing the kernel.
            async_finish: the current stream will not wait for the communication kernels to be finished if set.
            allocate_on_comm_stream: control whether all the allocated tensors' ownership to be on the communication stream.

            recv_x: the reduced token from its dispatched ranks.
            recv_topk_weights: the reduced top-k weights from its dispatch ranks.
            event: the event after executing the kernel (valid only if `async_finish` is set).
        # Default config
        config = self.get_combine_config(self.group_size) if config is None else config

        # Internode
        if self.runtime.get_num_rdma_ranks() > 1:
            return self.internode_combine(x, handle, topk_weights, config, previous_event, async_finish, allocate_on_comm_stream)
        # 省略intranode




  • combine会启动num_channels * 2个SM,其中每两个SM对应一个channel
  • 每个SM有NUM_MAX_NVL_PEERS+kNumCombineForwarderWarps+1个warp(默认NUM_MAX_NVL_PEERS = 8kNumCombineForwarderWarps = 16,所以每个SM默认有25个warp)
  • H800中每个warp有32个线程


  • 第一个SM负责发送
    • 前8个warp为NVLSender,负责NVL传输
    • 然后16个warp为NVLAndRDMAForwarder,负责将数据从NVL转发到RDMA
    • 一个warp为Coordinator (send)
  • 第二个SM负责接收
    • 24个warp为RDMAReceiver,负责接受RDMA数据,将其写入combined_x
    • 一个warp为Coordinator (recv)



注意:这里nvl_channel为每个RDMA rank都分别创建了nvl_buffer。


  • NVLSender
    • 每个warp负责一个远端的nvl rank(对应dispatch的src nvl rank)
    • 每个lane对应一个远端的rdma rank(对应dispatch的src rdma rank)
    • 每个lane从gbl_channel_prefix_matrix获取到其负责的token范围,
    • 若有lane有未发送的token,且nvl_channel的剩余空间大于num_max_nvl_chunked_send_tokens,则
      • 每个符合条件的lane将至多num_max_nvl_chunked_send_tokens个token放入nvl_channel
    • 更新nvl_channel_tail
  • NVLAndRDMAForwarder
    • 所有warp平均分给每个远端的rdma rank(对应dispatch的src nvl rank)。我们称负责相同远端rdma rank的warp为一个warp组。
    • 枚举所有未转发的token
      • warp组中的第一个warp负责等待rdma_channel的剩余空间大于num_max_rdma_chunked_send_tokens
      • warp组中的warp轮流枚举待转发的token
        • 在dispatch阶段,一个token会从(dst rdma rank, src nvl rank)转发到多个(dst rdma rank, dst nvl rank)上;因此,在combine阶段,这些不同dst nvl rank上的token要进行reduce
        • 这里NVLAndRDMAForwarder枚举的就是dispatch中(dst rdma rank, src nvl rank)要转发的所有token。使用之前dispatch留下的信息(combined_nvl_head),我们可以知道每个token位于哪些dst nvl rank上。在等待这些nvl rank把所需的token发过来后,就可以对它们进行reduce。
        • reduce后的结果写入rdma_channel。
        • 更新forwarder_nvl_head
      • 在token都reduce完成后,发送rdma_channel的数据,并更新远端的rdma_channel_tail
  • RDMAReceiver
    • 每个warp轮流枚举待combine的token
      • 这里跟NVLAndRDMAForwarder其实差不多。在dispatch时,一个token会从(src rdma rank, src nvl rank)发往多个(dst rdma rank, src nvl rank);因此,在combine阶段,这些不同dst rdma rank上的token要进行reduce。
      • RDMAReceiver枚举的就是dispatch中一开始位于(src rdma rank, src nvl rank)的所有token。使用之前dispatch留下的信息(combined_rdma_head),我们可以知道每个token位于哪些dst rdma rank上,然后就可以对它们进行reduce。在等待这些rdma rank把所需的token发过来后,就可以对它们进行reduce。
      • reduce的结果写入combined_x
      • 更新rdma_receiver_rdma_head
  • Coordinator (send)
    • 每一个lane负责一个dispatch时的(dst rdma rank, dst nvl rank),协调它的所有NVLSender
    • 如果每个NVLSenderforwarder_nvl_head都推进了,则推进nvl_channel_head
  • Coordinator (recv)
    • 每个lane负责一个dispatch时的dst rdma rank,协调它的所有RDMAReceiver
    • 如果所有RDMAReceiverrdma_receiver_rdma_head都推进了,则推进远端的rdma_channel_head





简单来说,普通的GPU-Direct RDMA使用CPU上的代理线程发起请求;而IBGDA直接从GPU发起请求,因此可以降低延迟。




class Buffer:
    def __init__(...):		
        # Synchronize NVSHMEM unique IDs
        root_unique_id = None
        if self.runtime.get_num_rdma_ranks() > 1 or low_latency_mode:
    		# Enable IBGDA for the low latency mode, which refers to "no package forwarding between NVLink and RDMA"
            if low_latency_mode:
                assert num_qps_per_rank > 0
                os.environ['NVSHMEM_DISABLE_P2P'] = '1'
                os.environ['NVSHMEM_IB_ENABLE_IBGDA'] = '1'
                os.environ['NVSHMEM_IBGDA_NIC_HANDLER'] = 'gpu'
                os.environ['NVSHMEM_IBGDA_NUM_RC_PER_PE'] = f'{num_qps_per_rank}'
                # Make sure QP depth is always larger than the number of on-flight WRs, so that we can skip WQ slot check
                os.environ['NVSHMEM_QP_DEPTH'] = '1024'
                # NOTES: NVSHMEM initialization requires at least 256 MiB
                os.environ['NVSHMEM_CUMEM_GRANULARITY'] = f'{2 ** 29}'
            if (low_latency_mode and self.rank == 0) or (not low_latency_mode and self.runtime.get_rdma_rank() == 0):
                root_unique_id = self.runtime.get_local_nvshmem_unique_id()

low_latency的数据路径也与普通模式不同。在普通模式中,token要先被发送到(dst rdma rank, src nvl rank)上,然后在被转发到(dst rdma rank, dst nvl rank)。而low_latency省去了转发的过程,直接把数据发往(dst rdma rank, dst nvl rank)上。因此,所有的GPU都属于一个nvshmem通信组,root就是rank=0的GPU。


class Buffer:
	# noinspection PyTypeChecker
    def low_latency_dispatch(self, x: torch.Tensor, topk_idx: torch.Tensor,
                             num_max_dispatch_tokens_per_rank: int, num_experts: int,
                             async_finish: bool = False, return_recv_hook: bool = False) -> \
            Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor, Tuple, EventOverlap, Callable]:
        A low-latency implementation for dispatching with IBGDA **with implicit FP8 casting**.
        This kernel requires all the ranks (no matter intranode or internode) should be visible via RDMA
            (specifically, IBGDA must be enabled).
        Even for ranks in the same node, NVLink are fully disabled for simplicity.
        Warning: as there are only two buffers, and the returned tensors reuse the buffer, you can not hold more than 2
            low-latency kernels' result tensor at a single moment.

            x: `torch.Tensor` with `torch.bfloat16`, shaped as `[num_tokens, hidden]`, only several hidden shapes are
                supported. The number of tokens to be dispatched must be less than `num_max_dispatch_tokens_per_rank`.
            topk_idx: `torch.Tensor` with `torch.int64`, shaped as `[num_tokens, num_topk]`, only several top-k shapes
                are supported. `-1` indices (not selecting any expert) are supported.
            num_max_dispatch_tokens_per_rank: the maximum number of tokens to dispatch, all the ranks must hold the same value.
            num_experts: the number of all experts.
            async_finish: the current stream will not wait for the communication kernels to be finished if set.
            return_recv_hook: return a receiving hook if set. If set, the kernel will just do the RDMA request issues,
                but **without actually receiving the data**. You must call the received hook to make sure the data's arrival.
                If you not set this flag, the kernel will ensure the data's arrival.

            recv_x: a tuple with received tokens for each expert. The first element is a `torch.Tensor` shaped as
                `[num_local_experts, num_max_dispatch_tokens_per_rank * num_ranks, hidden]` with `torch.float8_e4m3fn`.
                The second tensor is the corresponding scales for the first element with shape
                `[num_local_experts, num_max_dispatch_tokens_per_rank * num_ranks, hidden // 128]` with `torch.float`.
                Notice that, the last-two-dimension of the scaling tensors are in column-major for TMA compatibility.
                Moreover, not all tokens are valid, only some of the `num_max_dispatch_tokens_per_rank * num_ranks` are,
                as we do not synchronize CPU received count with GPU (also not incompatible with CUDA graph).
            recv_count: a tensor shaped `[num_local_experts]` with type `torch.int`, indicating how many tokens each
                expert receive. As mentioned before, all not tokens are valid in `recv_x`.
            handle: the communication handle to be used in the `low_latency_combine` function.
            event: the event after executing the kernel (valid only if `async_finish` is set).
            hook: the receiving hook function (valid only if `return_recv_hook` is set).
        packed_recv_x, packed_recv_x_scales, packed_recv_count, packed_recv_src_info, packed_recv_layout_range, event, hook = \
            self.runtime.low_latency_dispatch(x, topk_idx,
                                              num_max_dispatch_tokens_per_rank, num_experts,
                                              async_finish, return_recv_hook)
        handle = (packed_recv_src_info, packed_recv_layout_range, num_max_dispatch_tokens_per_rank, num_experts)
        tensors_to_record = (x, topk_idx,
                             packed_recv_x, packed_recv_x_scales, packed_recv_count,
                             packed_recv_src_info, packed_recv_layout_range)
        return (packed_recv_x, packed_recv_x_scales), packed_recv_count, handle, \
            EventOverlap(event, tensors_to_record if async_finish else None), hook

相比于普通模式的dispatch,low_latency_dispatch额外提供了一个return_recv_hook选项。若return_recv_hook=True,则low_latency_dispatch只会发送RDMA请求,不会接收数据。用户必须调用recv_hook来确保数据到达。recv_hook的好处是可以避免让SM一直等待接收数据接收完成。比如在下图,在dispatch 0的发送请求发出后,可以直接开始attention 1的计算,计算后再进行dispatch 0的接收。


low_latency模式没有notify_dispatch的过程,即不会先进行一次通信来确定GPU之间互相发送token的数量。取而代之的是,一个rank最多只能发送num_max_dispatch_tokens_per_rank个token,而接收端会的每个expert都会准备能容纳num_max_dispatch_tokens_per_rank * num_ranks个token的buffer,因此内存开销是很高的。


std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor, std::optional<EventHandle>, std::optional<std::function<void()>>>
Buffer::low_latency_dispatch(const torch::Tensor& x, const torch::Tensor& topk_idx,
                             int num_max_dispatch_tokens_per_rank, int num_experts,
                             bool async, bool return_recv_hook) {
	// Kernel launch
    auto next_clean_meta = next_buffer.clean_meta();
    auto launcher = [=](int phases) {
        internode_ll::dispatch(packed_recv_x.data_ptr(), packed_recv_x_scales.data_ptr<float>(),
                               packed_recv_src_info.data_ptr<int>(), packed_recv_layout_range.data_ptr<int64_t>(),
                               buffer.dispatch_rdma_recv_data_buffer, buffer.dispatch_rdma_recv_count_buffer,
                               x.data_ptr(), topk_idx.data_ptr<int64_t>(),
                               next_clean_meta.first, next_clean_meta.second,
                               num_tokens, hidden, num_max_dispatch_tokens_per_rank,
                               num_topk, num_experts, rank, num_ranks,
                               workspace, launch_stream, phases);




  • 启动num_experts / 3个SM,每个SM内有30个warp,每个warp有32个线程


  • 首先,将所有warp分为两种:
    • 所有SM的前29个warp负责将token转换为FP8类型,并发送到目标expert的接收buffer上
    • 第30个warp负责统计发往每个expert的token数量
  • 然后,每个expert使用一个线程,将每个expert的token数量发往远端节点的rdma_recv_count


  • 这里每个expert使用3个warp
  • 每个expert使用一个线程负责查看接收ibgda消息,读取rdma_recv_count
  • 每个expert的3个warp轮流读取token,将其拷贝到recv_x






posted @   CQzhangyu  阅读(1081)  评论(0编辑  收藏  举报
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
2017-02-27 【BZOJ3280】小R的烦恼 最小费用最大流