intel DSA spec 解读

https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification

 用途:

做数据的移动(movement )和转换 (transformation)。

取代 了Intel® QuickData Technology, 这个是 Intel® I/O Acceleration Technology (I/OAT: QD, DCA, RSS, Low latency interrupts …)一部分.  2007年kernel patch 合并了。

https://www.intel.com/content/www/us/en/wireless-network/accel-technology.html (I/OAT)

https://software.intel.com/en-us/articles/fast-memcpy-using-spdk-and-ioat-dma-engine (Fast memcpy with SPDK and Intel® I/OAT DMA Engine)

QD: async DMA

场景:

• Datacenter:

As a data movement offload engine to reduce datacenter tax for memory copying, zeroing, etc. to free up CPU cycles from mundane infrastructure work.

数据移动卸载

• Storage:

Storage appliances use data movement (including CRC generation and Data Integrity Field (DIF) generation) within the node and across nodes using Non-Transparent Bridge (NTB).

通过非透明桥在node内或者node之间做存储数据的移动。

• Networking:

Packet processing pipelines use Intel DSA for data copy. An example usage is virtual switch (vSwitch) offload for inter-VM packet switching.

网络数据包的处理流水线中,做数据拷贝。 这个比较抽象,具体怎么做?  

• Deduplication:

Memory deduplication requires comparing memory pages for equality, which can be done using Intel DSA memory compare operations.

内存比较,查重删除

• VM Migration and Fast Checkpointing:

VM fast checkpointing and VM migration flows require the VMM to identify a VM’s dirty pages and send them efficiently to the destination machine (with minimal network traffic and latency). Intel DSA delta operations generate diffs of pages, enabling the VMM to send only the delta record to the destination, reducing network bandwidth.

VM热迁移,跟Deduplication类似。 

需要详细了解,Intel® DSA Features(主要是第8章,具体的有哪些desciptor), 能明白上述场景中采用了哪些原理。

 结构图:

 

1.  memory-mapped 的寄存器来控制通过操作。

操作包括, 包括capability,配置和 portal(工作提交寄存器,把descriptor提交到qeue中), 在一个4K页表中,在BAR0,BAR1定义(见9.1.1章)来描述,

2. 通用描述符(descriptor),来描述要运行的工作(work)。

描述符可以批处理(放在内存里面),也可以单个处理(队列里面)。

 

 

描述符 通常会包含一个完成记录的地址和有效位。 描述符分很多中, 比如Drain,Memory Move,Compare,Delta,CRC etc

 

 

批处理描述符(位于host内存),包含work descriptors 数组(array) 的地址和长度。提高了多个小数据传输效率,

 device从host 内存中读取work descriptors数组,可以配置乱序执行。batch descriptor和其中的每个work descriptor都有自己的completion record address 和 completion interrupt。 batch 不支持嵌套。

3. 工作队列,就是device上的一块存储,包含提交到device的descriptors。

可配置优先级,和qeues的大小。 有调度算法来处理优先级,保证高优先级不会饿死低优先级。

分Unlimited(kernel空间)和limited protal。

分Shared 和 Dedicated 队列。

具体的配置都看第九章。

4. engine 是真正的执行单元。包含了一个work descriptor 处理单元。

The work descriptor processing unit uses the Address Translation Cache and IOMMU for completion record, source, and destination address translations; reads source data; performs the specified operation; and writes the destination data back to memory. When the operation is complete, the engine writes the completion record to the pre-translated completion address and generates an interrupt, if requested by the work descriptor

5. engine 和 queue 可以分组。 N:M map吧。

6. Descriptor Completion

就是用来描述,Descriptor的完成情况, 可通过中断来通知。中断支持两种,MSI-X table 和 device-specific Interrupt Message Storage (IMS) table, 见SIOV描述。

可以理解成一个同步机制。

 completion record 会支持完成进度, 如果出错的话,软件可以修复错误,或者通过新的Descriptor重新提交剩下的工作,或者由软件来完成。

7. Descriptor 顺序执行和Fencing

等completion record 或者 interrupt

使用Drain descriptor or Drain command,再提交下一个descriptor

在batch中使用Fence flag

问题

protal和descriptor的关系。

descriptor 通过一个特殊的寄存器portal, 来提交的qeue中。

Readback 

应该是,Host 来读device

 

Shared Work Queue (SWQ)
Dedicated Work Queue (DWQ)

如果用户模式客户端使用受限Portal,则它可以请求内核模式驱动程序使用无限Portal代表其提交描述符。 这有助于避免拒绝服务并提供forward progress保证。

descriptors 通过称为Portal的特殊寄存器提交到工作队列中

概念:
 completion record:

When the operation is complete, the engine writes the completion record to the pre-translated completion address and generates an interrupt, if requested by the work descriptor. 

The completion attributes specify the address to write the completion record and optionally the information needed to generate a completion interrupt. 包含了descriptor完成状态或者错误信息。

The first byte of the completion record is the status byte. Status values written by the device are all nonzero. Software should initialize the status field of the completion record to 0 before submitting the descriptor to be able to tell when the device has written to the completion record. (Initializing the completion record also ensures that it is mapped, so the device is less likely to encounter a page fault when accessing it.)

The Request Completion Record flag indicates to the device that it should write the completion record even if the operation completed successfully. If this flag is not set, the device writes the completion record only if there is an error.

 

 REF:   

Intel® Scalable I/O Virtualization Technical Specification  

All Intel® 64 and IA-32 Architectures Software Developer Manuals   

Linux 的内存初始化   

内核文档 Complete virtual memory map with 4-level page tables 

PCIE总线的地址问题    

wiki 输入输出内存管理单元

kevin  Intel® Scalable I/O Virtualization     

Intel® I/O Acceleration Technology  

Fast memcpy with SPDK and Intel® I/OAT DMA Engine   

linux-dsa/kernel   

PCI设备的地址空间  

idxd driver for Intel Data Streaming Accelerator      

INTRODUCING THE INTEL® DATA STREAMING ACCELERATOR (INTEL® DSA)    

 

 

 

 

posted @ 2020-05-03 10:14  lvmxh  阅读(3983)  评论(0编辑  收藏  举报