【RDMA】基于RoCE v1配置PFC (非讲原理)
《Running RoCE over L2 Network Enabled with PFC》https://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf
环境:
两台host(各配有一块双端口40Gbps ConnectX-3 网卡,驱动版本为4.1-1.0.2.0,OS为Ubuntu 16.04)
一台32端口Mellanox Spectrum交换机SN2700,onyx版本为3.6.8102.
PFC背景知识:
PFC:https://blog.csdn.net/bandaoyu/article/details/115346857
引用Juniper对PFC的介绍,“基于优先级的流控制(PFC,Priority-based flow control),IEEE标准802.1Qbb,是一种链路级流控制机制。该流控制机制与IEEE 802.3x的暂停机制类似,但是暂停的是链路上某个优先级的消息(每个级别是一个虚拟通道,暂停某个虚拟通道),而不是整个链路暂停。PFC允许您根据其类别有选择地暂停流量。”
可见,相比于IEEE 802.3x(整个链路),PFC的粒度更小(暂停某个虚拟通道)。因此配置的过程可以理解为将应用流量映射到某一个优先级的过程。根据对流量标记位置的不同,可以分为Trust L2和Trust L3。由于ConnectX-3仅支持RoCE v1,因此本文只关注Trust L2。
在端主机侧,映射关系为:
ToS -> skb_priority -> Vlan-qos (也记为User Priority,即UP,其值为Vlan tag中PCP的值) -> tc。
在交换机侧,映射关系为:
PCP + DEI -> switch-priority -> ingress Port Group (PG)。其中PG包含对PFC阈值的配置。
本文使用tc 4以及switch-priority 4为例。
配置过程:
首先配置交换机:
0. 进入配置模式:
switch-6bd534 [standalone: master] > enable
switch-6bd534 [standalone: master] # configure terminal
1. 创建VLAN,并设置交换机端口为hybrid模式:
switch-6bd534 [standalone: master] (config) # vlan 10
switch-6bd534 [standalone: master] (config vlan 10) # exit
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport mode hybrid
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 switchport hybrid allowed-vlan add 10
2. 关闭所有端口的flow control:
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol send off force
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 flowcontrol receive off force
3.使能priority 4,并在所有端口启用PFC:
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm enable pfc globally: yes
switch-6bd534 [standalone: master] (config) # dcb priority-flow-control priority 4 enable
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 dcb priority-flow-control mode on force
注:如需关闭PFC
switch-6bd534 [standalone: master] (config) # no dcb priority-flow-control enable
This action might cause traffic loss while shutting down a port with priority-flow-control mode on
Type 'yes' to confirm disable pfc globally: yes
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 no dcb priority-flow-control mode force
4. 修改端口的buffer配置,并做switch-priority和PG buffer之间的映射:
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg0 map pool iPool0 type lossy reserved 20K shared alpha 8
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 map pool iPool0 type lossless reserved 70K xoff 17K xon 17K shared alpha 2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 egress-buffer ePort.tc4 map pool ePool0 reserved 1500 shared alpha inf
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 ingress-buffer iPort.pg4 bind switch-priority 4
5. 做PCP+DEI到switch-priority的映射:
switch-6bd534 [standalone: master] (config) # qos trust L2
switch-6bd534 [standalone: master] (config) # interface ethernet 1/1-1/32 qos map pcp 4 dei 0 to switch-priority 4
这样,交换机侧就配置好了。
接下来配置端主机:
1. 设置pfctx和pfcrx 参数:
# vim /etc/modprobe.d/mlx4.conf
添加:
options mlx4_en pfctx=0x16 pfcrx=0x16
注意:pfctx和pfcrx均为8 bits的bitmap,使能priority 4即为0x16. (0x16 is 00010000b) (only priority 3 enabled is 0x8 --> is 00001000b).
注:mlx4_en的参数说明在PDF的 1.3.1.3mlx4_en Parameters:https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_1-1_0_2_0.pdf
然后重启网卡:
# /etc/init.d/openibd restart
验证:
# RX=`cat /sys/module/mlx4_en/parameters/pfcrx`;printf "0x%x\n" $RX
输出结果为:0x16 即正确。
2. 创建VLAN,并设置IP。
# modprobe 8021q
# vconfig add eth2 10
Added VLAN with VID == 10 to IF -:eth2:-
# ifconfig eth2.10 10.10.10.5/24 up
3. 对TCP/IP流量做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:
# for i in {0..7}; do vconfig set_egress_map eth2.10 $i 4 ; done
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
Set egress mapping on device -:eth2.10:- Should be visible in /proc/net/vlan/eth2.10
4. 对不经过内核的流量,即RDMA流量,做skb_priority到UP的映射,将所有skb_priority都映射到UP 4:
# tc_wrap.py -i eth2 -u 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
skprio2up is available only for RoCE in kernels that don't support set_egress_map
Traffic classes are set to 8
UP 0
UP 1
UP 2
UP 3
UP 4
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
skprio: 0 (vlan 10)
skprio: 1 (vlan 10)
skprio: 2 (vlan 10 tos: 8)
skprio: 3 (vlan 10)
skprio: 4 (vlan 10 tos: 24)
skprio: 5 (vlan 10)
skprio: 6 (vlan 10 tos: 16)
skprio: 7 (vlan 10)
UP 5
UP 6
UP 7
5. 做UP到TC的映射,将UP 4映射到TC 4,其他UP各自映射到相应的TC,并开启priority 4上的PFC:
# mlnx_qos -i eth2 -p 0,1,2,3,4,5,6,7 -f 0,0,0,0,1,0,0,0
Priority trust mode is not supported on your system
Priority trust mode: none
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 0 1 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
这样就都配置完成了。
最后,保存配置,防止重启失效:
switch-6bd534 [standalone: master] (config) # write memory
验证
用ib_write_bw测试(使用rdma_cm建立连接),一台做sender,一台做receiver。
receiver:
$ ib_write_bw -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10
sender:$ ib_write_bw 10.10.10.6 -d mlx4_0 -i 2 -x 2 -S 4 --report_gbits -D 10
然后在交换机上查看PG4是否接收到了数据:
switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pg 4
PG 4:
44321827 packets
48853700404 bytes
0 queue depth
0 no buffer discard
0 shared buffer discard
或者查看PFC (注意,并不一定会触发PFC)
switch-6bd534 [standalone: master] (config) # show interfaces ethernet 1/5 counters pfc prio 4
PFC 4:
Rx:
0 pause packets
0 pause duration
Tx:
18 pause packets
4 pause duration
在端主机侧查看priority 4的counter:
$ ethtool -S eth2 | grep prio_4
rx_pause_prio_4: 88
rx_pause_duration_prio_4: 0
rx_pause_transition_prio_4: 0
tx_pause_prio_4: 0
tx_pause_duration_prio_4: 11
tx_pause_transition_prio_4: 44
rx_prio_4_packets: 9155756
rx_prio_4_bytes: 752828084
tx_prio_4_packets: 862787989
tx_prio_4_bytes: 950840867498
参考:
HowTo Run RoCE over L2 Enabled with PFC
How to Enable PFC on Mellanox Switches (Spectrum)
HowTo Configure PFC on ConnectX-4
Mellanox support
原文链接:https://blog.csdn.net/u013431916/article/details/82385641
1.3.1.3mlx4_en Parameters
inline_thold:使用内联数据的阈值(int)默认值,最大值为104个字节。节省了PCI读操作事务,小于阈值大小的数据包将直接复制到硬件缓冲区。 (范围:17-104)
udp_rss:默认情况下启用RSS接收UDP流量(uint)。禁用后,将不会完成用于传入UDP流量的RSS。
pfctx:TX [7:0]上基于优先级的流控制策略。每个优先级位掩码(uint) 0~7bit
pfcrx:RX [7:0]上基于优先级的流控制策略。每个优先级位掩码(uint)0~7bit
udev_dev_port_dev_id与dev_id或dev_port一起使用。范围:0 <= udev_dev_port_dev_id <= 2(默认= 0).
udev_dev_port_dev_id:当内核支持时,使用dev_id或dev_port。范围:0 <= udev_dev_port_dev_id <= 2(默认= 0)。
•0:如果内核支持,则使用dev_port;否则,使用dev_id。
•1:无论dev_port支持如何,都只能使用dev_id。
•2:两者都使用dev_id和dev_port(如果内核支持dev_port)。 (int)
因特尔PFC说明
https://downloadmirror.intel.com/30368/eng/README_irdma_1.4.22.txt
--- Directing RDMA traffic to a traffic class
When using PFC, traffic may be directed to one or more traffic classes (TCs).
Because RDMA traffic bypasses the kernel, Linux traffic control methods like
tc, cgroups, or egress-qos-map can't be used. Instead, set the Type of Service
(ToS) field on your application command line. ToS-to-priority mappings are
hardcoded in Linux as follows:
ToS Priority
--- --------
0 0
8 2
24 4
16 6
Priorities are then mapped to traffic classes using ETS using lldptool or switch
utilities.
Examples of setting ToS 16 in an application:
ucmatose -t 16
ib_write_bw -t 16
Alternatively, for RoCEv2, ToS may be set for all RoCEv2 traffic using
configfs. For example, to set ToS 16 on device rdma<interface>, port 1:
mkdir /sys/kernel/config/rdma_cm/rdma<interface>
echo 16 > /sys/kernel/config/rdma_cm/rdma<interface>/ports/1/default_roce_tos