...
are printed when a tx doesn't get a completion within a certain time. This is not normal in a healthy IB network and may indicate a "QP stuck" issue in MLNX driver. EX-5704 gives a summary of affected versions.
...
Code Block |
---|
The RDMA timeouts are of course on the o2ib (RoCEv2) fabric - expected given the lack of congestion control for RoCE.
Anyways found that:
ECN was set to tcp_ecn = 2 - so only one way congestion control enabled (on the switch ports set to auto - which defaults to tcp_ecn=2).
PFC is disabled across the cluster
For RoCE v2 we need tcp_ecn = 1 and PFC enabled to reduce the RDMA timeouts and any other timeouts, slow responses etc from the network side of things.
So completely configure this we need a maintenance to reboot the switches once the changes are made (particularly the PFC param).
NOTE: we went through the same pain at Monash Uni with RoCEv1 then RoCEv2 and the ECN/PFC conmgestion control. At this point in time the RDMA errors may occur once every 4 weeks and we get repeated slow rpc response warning - which is not impacting the cluster (more importantly the jobs). We did have NVIDIA/Mellanox engineers go through the config strongly suggesting ECN and PFC be enabled, also making specific switch port changes to the Cumulus based switches (of course at Samsung this is all HPe switches)
I cannot remember the details of those switch port changes (aside from the ECN/PFC updates) as that was between the Monash Uni network admins and NVIDIA. One other point, after the changes were completed, all the previouis network problems we observed literaly went away (the Monash Uni cluster is not large, 3 x lustre filesystems, 5 x ES400's, 1 x nv200 and 1 x 7990, ~2000 client nodes)...
something else we can do is monitor the hardware counters and the counters on the HCA side of things - this was how the NVIDIA engineers determined packets werre getting dropped on a large scale.
such as:
[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/hw_counters/*
[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/counters/*
The counters are more interesting for the lustre side of things as is contains the packet errors and discards, for physical errors the hw_counters are useful - but in this case there was no issues with the infrastructure... |
...