IB_DEVICE_SG_GAPS_REG

There was a dump_cqe error when using IB_DEVICE_SG_GAPS_REG on mlx5. It was a bug in mlx5. The fix is already merged in 4.16 kernel and in some LTS kernels. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da343b6d90e11132f1e917d865d88ee35d6e6d00

Page allocation failure


 1525 [Tue Mar 20 12:28:32 2018] Call Trace:
 1526 [Tue Mar 20 12:28:32 2018]  [<ffffffff816a6071>] dump_stack+0x19/0x1b
 1527 [Tue Mar 20 12:28:32 2018]  [<ffffffff8118a6f0>] warn_alloc_failed+0x110/0x180
 1528 [Tue Mar 20 12:28:32 2018]  [<ffffffff816a204a>] __alloc_pages_slowpath+0x6b6/0x724
 1529 [Tue Mar 20 12:28:32 2018]  [<ffffffff8118ec85>] __alloc_pages_nodemask+0x405/0x420
 1530 [Tue Mar 20 12:28:32 2018]  [<ffffffff81030e8f>] dma_generic_alloc_coherent+0x8f/0x140
 1531 [Tue Mar 20 12:28:32 2018]  [<ffffffff810645d1>] x86_swiotlb_alloc_coherent+0x21/0x50
 1532 [Tue Mar 20 12:28:32 2018]  [<ffffffffc01213dd>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
 1533 [Tue Mar 20 12:28:32 2018]  [<ffffffffc012197e>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
 1534 [Tue Mar 20 12:28:32 2018]  [<ffffffffc01219f4>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
 1535 [Tue Mar 20 12:28:32 2018]  [<ffffffffc047e11d>] create_kernel_qp.isra.65+0x44d/0x76d [mlx5_ib]
 1536 [Tue Mar 20 12:28:32 2018]  [<ffffffffc04646d8>] create_qp_common+0x9e8/0x1660 [mlx5_ib]
 1537 [Tue Mar 20 12:28:32 2018]  [<ffffffff812978ef>] ? debugfs_create_file+0x1f/0x30
 1538 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0119d0b>] ? mlx5_debug_cq_add+0x4b/0x70 [mlx5_core]
 1539 [Tue Mar 20 12:28:32 2018]  [<ffffffffc011f10e>] ? mlx5_core_create_cq+0x1ae/0x230 [mlx5_core]
 1540 [Tue Mar 20 12:28:32 2018]  [<ffffffff811e19f6>] ? kmem_cache_alloc_trace+0x1d6/0x200
 1541 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0466cfd>] ? _mlx5_ib_create_qp+0xfd/0x530 [mlx5_ib]
 1542 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0466d26>] _mlx5_ib_create_qp+0x126/0x530 [mlx5_ib]
 1543 [Tue Mar 20 12:28:32 2018]  [<ffffffffc00bf7e5>] ? backport_kvfree+0x35/0x40 [mlx_compat]
 1544 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0460af0>] ? mlx5_ib_create_cq+0x300/0x4c0 [mlx5_ib]
 1545 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0467140>] mlx5_ib_create_qp+0x10/0x20 [mlx5_ib]
 1546 [Tue Mar 20 12:28:32 2018]  [<ffffffffc059452a>] ib_create_qp+0x7a/0x2f0 [ib_core]
 1547 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098b5d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
 1548 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09e153f>] kiblnd_create_conn+0xbff/0x1870 [ko2iblnd]
 1549 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0a599da>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
 1550 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09ef6df>] kiblnd_passive_connect+0xa4f/0x1790 [ko2iblnd]
 1551 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098a58c>] ? _cma_attach_to_dev+0x5c/0x70 [rdma_cm]
 1552 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09f0b75>] kiblnd_cm_callback+0x755/0x23a0 [ko2iblnd]
 1553 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098fc36>] cma_req_handler+0x1c6/0x490 [rdma_cm]
 1554 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cb327>] cm_process_work+0x27/0x120 [ib_cm]
 1555 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cc16b>] cm_req_handler+0xb0b/0xe30 [ib_cm]
 1556 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cce55>] cm_work_handler+0x395/0x1306 [ib_cm]
 1557 [Tue Mar 20 12:28:32 2018]  [<ffffffff816ab2f4>] ? __schedule+0x424/0x9b0
 1558 [Tue Mar 20 12:28:32 2018]  [<ffffffff810aa59a>] process_one_work+0x17a/0x440
 1559 [Tue Mar 20 12:28:32 2018]  [<ffffffff810ab266>] worker_thread+0x126/0x3c0
 1560 [Tue Mar 20 12:28:32 2018]  [<ffffffff810ab140>] ? manage_workers.isra.24+0x2a0/0x2a0
 1561 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b270f>] kthread+0xcf/0xe0
 1562 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
 1563 [Tue Mar 20 12:28:32 2018]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
 1564 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40

it appears that these messages can sometimes be printed as warnings, when the driver can not allocate a continuous buffer. But then the driver allocates the fragmented buffer and continues to work. I don't believe that would be the reason (directly anyway) for the crash on the client. This is currently being investigated by MLX Devs. Updates to follow

RDMA Timeouts

"QP Stuck" issue

Messages similar to the following:

LNetError: 17224:0:(o2iblnd_cb.c:3320:kiblnd_check_txs_locked()) Timed out tx: active_txs, 33 seconds
LNetError: 17224:0:(o2iblnd_cb.c:3395:kiblnd_check_conns()) Timed out RDMA with 172.16.16.12@o2ib (35): c: 29, oc: 0, rc: 31

are printed when a tx doesn't get a completion within a certain time. This is not normal in a healthy IB network and may indicate a "QP stuck" issue in MLNX driver. EX-5704 gives a summary of affected versions.

Other examples

"QP stuck" is not the only cause of RDMA timeouts of this kind. DDN-4105 describes a case when there are lots of these timeouts happening in the cluster, but the situation was remedied by lustre-level settings: "at_min=15" and "max_pages_per_rpc=1M"

Here are some observations from Ian Costello posted in that ticket:


The RDMA timeouts are of course on the o2ib (RoCEv2) fabric - expected given the lack of congestion control for RoCE.

Anyways found that:
ECN was set to tcp_ecn = 2 - so only one way congestion control enabled (on the switch ports set to auto - which defaults to tcp_ecn=2).
PFC is disabled across the cluster

For RoCE v2 we need tcp_ecn = 1 and PFC enabled to reduce the RDMA timeouts and any other timeouts, slow responses etc from the network side of things.

So completely configure this we need a maintenance to reboot the switches once the changes are made (particularly the PFC param).

NOTE: we went through the same pain at Monash Uni with RoCEv1 then RoCEv2 and the ECN/PFC conmgestion control. At this point in time the RDMA errors may occur once every 4 weeks and we get repeated slow rpc response warning - which is not impacting the cluster (more importantly the jobs). We did have NVIDIA/Mellanox engineers go through the config strongly suggesting ECN and PFC be enabled, also making specific switch port changes to the Cumulus based switches (of course at Samsung this is all HPe switches) 
I cannot remember the details of those switch port changes (aside from the ECN/PFC updates) as that was between the Monash Uni network admins and NVIDIA. One other point, after the changes were completed, all the previouis network problems we observed literaly went away (the Monash Uni cluster is not large, 3 x lustre filesystems, 5 x ES400's, 1 x nv200 and 1 x 7990, ~2000 client nodes)...

something else we can do is monitor the hardware counters and the counters on the HCA side of things - this was how the NVIDIA engineers determined packets werre getting dropped on a large scale.

such as:

[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/hw_counters/*
[root@rclosdc1rr40-01 ~]# grep -E "*" /sys/class/infiniband/mlx5_1/ports/1/counters/*

The counters are more interesting for the lustre side of things as is contains the packet errors and discards, for physical errors the hw_counters are useful - but in this case there was no issues with the infrastructure...