【RDMA】RoCE Debug Flow for Linux(Linux下调试RoCE的流程)
原文:https://community.mellanox.com/s/article/RoCE-Debug-Flow-for-Linux
*For the RoCE recommended configuration and verification, please click here .
This post provides guidelines for how to debug the RoCE network and how to tune RoCE performance. The following flowchart describes the process of RoCE troubleshooting.
Information on how to run the tests listed in the flowchart below can be found in the subsequent sections.
Test #1 - Check RDMA Connectivity using ibv_rc_pingpong
This test verifies that RoCE traffic can be sent between the client and the server sides. This test does not require rdma-cm to be enabled.
To check the RDMA connectivity, follow the steps below.
On the server side
- Find the server’s ibdev(s) using:
- rdma command, in case you are working with Upstream.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
1/1: mlx5_0/1: state ACTIVE physical_state LINK_UP netdev enp17s0f0
2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1
3/1: mlx5_2/1: state ACTIVE physical_state LINK_UP netdev enp134s0f0
4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev enp134s0f1
OR:
- ibdev2netdev command, in case you are working with OFED.
The output will be a list of the servers’ InfiniBand devices and their matching netdevs.
# ibdev2netdev
# ibdev2netdev
mlx5_0 port 1 ==> enp17s0f0 (Up)
mlx5_1 port 1 ==> enp17s0f1 (Up)
mlx5_2 port 1 ==> enp134s0f0 (Up)
mlx5_3 port 1 ==> enp134s0f1 (Up)
- Find the netdev’s IP address. Select an InfiniBand device from the previous step to be tested, and find the matching netdev’s IP address.
Note: In the examples to follow, the netdev used is mlx5_1/1, obtained from the previous step.
# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
- Find the netdev’s GID.
# show_gids mlx5_1
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 0 fe80:0000:0000:0000:ee0d:9aff:feae:119d v1 enp17s0f1
mlx5_1 1 1 fe80:0000:0000:0000:ee0d:9aff:feae:119d v2 enp17s0f1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:0c07:9cf0 12.7.156.240 v1 enp17s0f1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0c07:9cf0 12.7.156.240 v2 enp17s0f1
n_gids_found=4
- Run ibv_rc_pingpong as server to ensure connectivity is achieved.
# ibv_rc_pingpong -d mlx5_1 -g 3
local address: LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240
remote address: LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239
8192000 bytes in 0.01 seconds = 12475.92 Mbit/sec
1000 iters in 0.01 seconds = 5.25 usec/iter
On the client side
- Find the server’s ibdev(s) using:
- rdma command, in case you are working with Upstream.
# rdma link
1/1: mlx5_0/1: state ACTIVE physical_state LINK_DOWN netdev enp17s0f0
2/1: mlx5_1/1: state ACTIVE physical_state LINK_UP netdev enp17s0f1
3/1: mlx5_2/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f0
4/1: mlx5_3/1: state ACTIVE physical_state LINK_DOWN netdev enp134s0f1
OR
- ibdev2netdev command, in case you are working with OFED.
#ibdev2netdev
mlx5_0 port 1 ==> enp17s0f0 (Down)
mlx5_1 port 1 ==> enp17s0f1 (Up)
mlx5_2 port 1 ==> enp134s0f0 (Down)
mlx5_3 port 1 ==> enp134s0f1 (Down)
Note: In the examples to follow, the netdev used is mlx5_1/1.
- Find the server’s GID using show_gids command.
# show_gids mlx5_1
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_1 1 0 fe80:0000:0000:0000:ee0d:9aff:feae:11e5 v1 enp17s0f1
mlx5_1 1 1 fe80:0000:0000:0000:ee0d:9aff:feae:11e5 v2 enp17s0f1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:0c07:9cef 12.7.156.239 v1 enp17s0f1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0c07:9cef 12.7.156.239 v2 enp17s0f1
n_gids_found=4
- Run rc ping pong client
# [root@l-csi-0124l ~]# ibv_rc_pingpong -d mlx5_1 -g 3 12.7.156.240
local address: LID 0x0000, QPN 0x001960, PSN 0x39c9d6, GID ::ffff:12.7.156.239
remote address: LID 0x0000, QPN 0x003968, PSN 0x3869d8, GID ::ffff:12.7.156.240
8192000 bytes in 0.00 seconds = 14864.14 Mbit/sec
1000 iters in 0.00 seconds = 4.41 usec/iter
[root@l-csi-0124l ~]#
Results
Success criteria: average bandwidth on the client side is larger than 0.In case the test was completed successfully but you have no RDMA service, please contact Mellanox support with the output of sysinfo snapshot command, which can be downloaded at: https://github.com/Mellanox/linux-sysinfo-snapshot
In case of failure, check IP connectivity (see the following Test #2: Basic RDMA Check)
Extra info
- More details on the ibv_rc_pingpong command can be found at:
https://linux.die.net/man/1/ibv_rc_pingpong - More details on show_gids command can be found at:
https://community.mellanox.com/s/article/understanding-show-gids-script - More details on ibdev2netdev command can be found at:
https://community.mellanox.com/s/article/ibdev2netdev
Test #2: Basic RDMA Check
This test verifies some basic preconditions for RDMA traffic establishment .- Check that RoCE is enabled on both the server and the client sides.
# lspci -D | grep Mellanox
0000:11:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:11:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:86:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
0000:86:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
# cat /sys/bus/pci/devices/0000\:11\:00.1/roce_enable
1
If RoCE is disabled (roce_enable is set to 0), enable it:
- Using rdma command (in case you are working with Upstream).
OR
- Using ibdev2netdev command (in case you are working with OFED)
# echo 1 > /sys/bus/pci/devices/0000\:11\:00.1/roce_enable
- Perform an MTU check.
The MTU shall be guaranteed end-to-end without the need to perform segmentation and reassembly.
2.a. Set the MTU value on the server and the client sides:
Verify that the MTU is larger than 1250 Bytes# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
2.b. Perform an end-to-end MTU check. Ping the server:
# ping -f -c 100 -s 1250 -M do 12.7.156.240
PING 12.7.156.240 (12.7.156.240) 1250(1278) bytes of data.
--- 12.7.156.240 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.003/0.003/0.012/0.001 ms, ipg/ewma 0.008/0.003 ms
Success criteria: Both tests above passed OK. If not, please correct the MTU size
Step 4. Check Device Info by running ibv_devinfo on the server
# ibv_devinfo -d mlx5_1 -vvvhca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.24.1000
…
GID[ 1]: fe80:0000:0000:0000:ee0d:9aff:feae:119d
GID[ 2]: 0000:0000:0000:0000:0000:ffff:0c07:9cf0
GID[ 3]: 0000:0000:0000:0000:0000:ffff:0c07:9cf0
Success criteria: command succeeded.
Results
If configuration has been updated as a result of the test (such as change of MTU), it means the test is done successfully. In such case, check IP connectivity (Test #3)If the issue still exists, re-do the steps in Test #1.
In case of failure (command returned with an error, hang, etc,), contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at https://github.com/Mellanox/linux-sysinfo-snapshot
Extra info
More details on ping command can be found at:https://linux.die.net/man/8/ping
More details on show_gids command can be found at:
https://community.mellanox.com/s/article/understanding-show-gids-script
More details on ibv_devinfo command can be found at:
https://linux.die.net/man/1/ibv_devinfo
Test #3: Check IP Connectivity using Ping
This test verifies that IP traffic can be sent between the client and the server sides.On the server side:
Find the server’s IP address by following the second step in Test #1 above.
On the client side:
Ping the server# ping -f -c 100 12.7.156.240
PING 12.7.156.240 (12.7.156.240) 56(84) bytes of data.
--- 12.7.156.240 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.002/0.003/0.015/0.002 ms, ipg/ewma 0.007/0.002 ms
Results
Success criteria: low packet loss, 0% is preferredOn success go to: Contact Mellanox support with the output of sysinfo snapshot command which can be downloaded at https://github.com/Mellanox/linux-sysinfo-snapshot
Upon failure, go to: Verify IP, Ethernet connectivity (7)
Extra info
More details on the ping command can be found at:https://linux.die.net/man/8/ping
Test #4: Verify IP and Ethernet Connectivity
This test enables you to track down the reason for not having IP connectivity. To check for IP and Ethernet connectivity issues, run the following tests.Test #4. A: IP connectivity problems might be a result of the interface being down. For that, check the port state by verifying that the physical port is up:
# ip address show dev enp17s0f1
12: enp17s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether ec:0d:9a:ae:11:9d brd ff:ff:ff:ff:ff:ff
inet 12.7.156.240/8 brd 12.255.255.255 scope global enp17s0f1
valid_lft forever preferred_lft forever
inet6 fe80::ee0d:9aff:feae:119d/64 scope link
valid_lft forever preferred_lft forever
Test #4 B: Make sure that the number of dropped packets does not increase from one run of IP command to another.
Results
Success criteria: Test is done and IP connectivity is resumed.If the issue still exists, contact Mellanox support and provide them with the output of sysinfo snapshot command which can be downloaded at https://github.com/Mellanox/linux-sysinfo-snapshot