Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

it appears that these messages can sometimes be printed as warnings, when the driver can not allocate a continuous buffer. But then the driver allocates the fragmented buffer and continues to work. I don't believe that would be the reason (directly anyway) for the crash on the client. This is currently being investigated by MLX Devs. Updates to follow

RDMA Timeouts

"QP Stuck" issue

Messages similar to the following:

Code Block
LNetError: 17224:0:(o2iblnd_cb.c:3320:kiblnd_check_txs_locked()) Timed out tx: active_txs, 33 seconds
LNetError: 17224:0:(o2iblnd_cb.c:3395:kiblnd_check_conns()) Timed out RDMA with 172.16.16.12@o2ib (35): c: 29, oc: 0, rc: 31

are printed when a tx doesn't get a completion within a certain time. This is not normal in a healthy IB network and may indicate a "QP stuck" issue in MLNX driver. EX-5704 gives a summary of affected versions.

Other examples

"QP stuck" is not the only cause of RDMA timeouts of this kind. DDN-4105 describes a case when there are lots of these timeouts happening in the cluster, but the situation was remedied by lustre-level settings: "at_min=15" and "max_pages_per_rpc=1M"

Here are some observations from Ian Costello posted in that ticket:


Code Block
The RDMA timeouts are of course on the o2ib (RoCEv2) fabric - expected given the lack of congestion control for RoCE.

Anyways found that:
ECN was set to tcp_ecn = 2 - so only one way congestion control enabled (on the switch ports set to auto - which defaults to tcp_ecn=2).
PFC is disabled across the cluster

For RoCE v2 we need tcp_ecn = 1 and PFC enabled to reduce the RDMA timeouts and any other timeouts, slow responses etc from the network side of things.

So completely configure this we need a maintenance to reboot the switches once the changes are made (particularly the PFC param).

NOTE: we went through the same pain at Monash Uni with RoCEv1 then RoCEv2 and the ECN/PFC conmgestion control. At this point in time the RDMA errors may occur once every 4 weeks and we get repeated slow rpc response warning - which is not impacting the cluster (more importantly the jobs). We did have NVIDIA/Mellanox engineers go through the config strongly suggesting ECN and PFC be enabled, also making specific switch port changes to the Cumulus based switches (of course at Samsung this is all HPe switches) I cannot remember the details of those switch port changes (aside from the ECN/PFC updates) as that was between the Monash Uni network admins and NVIDIA. One other point, after the changes were completed, all the previouis network problems we observed literaly went away (the Monash Uni cluster is not large, 3 x lustre filesystems, 5 x ES400's, 1 x nv200 and 1 x 7990, ~2000 client nodes)...

something else we can do is monitor the hardware counters and the counters on the HCA side of things - this was how the NVIDIA engineers determined packets werre getting dropped on a large scale.

such as:

[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/hw_counters/*
[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/counters/*

The counters are more interesting for the lustre side of things as is contains the packet errors and discards, for physical errors the hw_counters are useful - but in this case there was no issues with the infrastructure...