IB_DEVICE_SG_GAPS_REG
There was a dump_cqe error when using IB_DEVICE_SG_GAPS_REG on mlx5. It was a bug in mlx5. The fix is already merged in 4.16 kernel and in some LTS kernels. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da343b6d90e11132f1e917d865d88ee35d6e6d00
Page allocation failure
1525 [Tue Mar 20 12:28:32 2018] Call Trace: 1526 [Tue Mar 20 12:28:32 2018] [<ffffffff816a6071>] dump_stack+0x19/0x1b 1527 [Tue Mar 20 12:28:32 2018] [<ffffffff8118a6f0>] warn_alloc_failed+0x110/0x180 1528 [Tue Mar 20 12:28:32 2018] [<ffffffff816a204a>] __alloc_pages_slowpath+0x6b6/0x724 1529 [Tue Mar 20 12:28:32 2018] [<ffffffff8118ec85>] __alloc_pages_nodemask+0x405/0x420 1530 [Tue Mar 20 12:28:32 2018] [<ffffffff81030e8f>] dma_generic_alloc_coherent+0x8f/0x140 1531 [Tue Mar 20 12:28:32 2018] [<ffffffff810645d1>] x86_swiotlb_alloc_coherent+0x21/0x50 1532 [Tue Mar 20 12:28:32 2018] [<ffffffffc01213dd>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core] 1533 [Tue Mar 20 12:28:32 2018] [<ffffffffc012197e>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core] 1534 [Tue Mar 20 12:28:32 2018] [<ffffffffc01219f4>] mlx5_buf_alloc+0x14/0x20 [mlx5_core] 1535 [Tue Mar 20 12:28:32 2018] [<ffffffffc047e11d>] create_kernel_qp.isra.65+0x44d/0x76d [mlx5_ib] 1536 [Tue Mar 20 12:28:32 2018] [<ffffffffc04646d8>] create_qp_common+0x9e8/0x1660 [mlx5_ib] 1537 [Tue Mar 20 12:28:32 2018] [<ffffffff812978ef>] ? debugfs_create_file+0x1f/0x30 1538 [Tue Mar 20 12:28:32 2018] [<ffffffffc0119d0b>] ? mlx5_debug_cq_add+0x4b/0x70 [mlx5_core] 1539 [Tue Mar 20 12:28:32 2018] [<ffffffffc011f10e>] ? mlx5_core_create_cq+0x1ae/0x230 [mlx5_core] 1540 [Tue Mar 20 12:28:32 2018] [<ffffffff811e19f6>] ? kmem_cache_alloc_trace+0x1d6/0x200 1541 [Tue Mar 20 12:28:32 2018] [<ffffffffc0466cfd>] ? _mlx5_ib_create_qp+0xfd/0x530 [mlx5_ib] 1542 [Tue Mar 20 12:28:32 2018] [<ffffffffc0466d26>] _mlx5_ib_create_qp+0x126/0x530 [mlx5_ib] 1543 [Tue Mar 20 12:28:32 2018] [<ffffffffc00bf7e5>] ? backport_kvfree+0x35/0x40 [mlx_compat] 1544 [Tue Mar 20 12:28:32 2018] [<ffffffffc0460af0>] ? mlx5_ib_create_cq+0x300/0x4c0 [mlx5_ib] 1545 [Tue Mar 20 12:28:32 2018] [<ffffffffc0467140>] mlx5_ib_create_qp+0x10/0x20 [mlx5_ib] 1546 [Tue Mar 20 12:28:32 2018] [<ffffffffc059452a>] ib_create_qp+0x7a/0x2f0 [ib_core] 1547 [Tue Mar 20 12:28:32 2018] [<ffffffffc098b5d4>] rdma_create_qp+0x34/0xb0 [rdma_cm] 1548 [Tue Mar 20 12:28:32 2018] [<ffffffffc09e153f>] kiblnd_create_conn+0xbff/0x1870 [ko2iblnd] 1549 [Tue Mar 20 12:28:32 2018] [<ffffffffc0a599da>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs] 1550 [Tue Mar 20 12:28:32 2018] [<ffffffffc09ef6df>] kiblnd_passive_connect+0xa4f/0x1790 [ko2iblnd] 1551 [Tue Mar 20 12:28:32 2018] [<ffffffffc098a58c>] ? _cma_attach_to_dev+0x5c/0x70 [rdma_cm] 1552 [Tue Mar 20 12:28:32 2018] [<ffffffffc09f0b75>] kiblnd_cm_callback+0x755/0x23a0 [ko2iblnd] 1553 [Tue Mar 20 12:28:32 2018] [<ffffffffc098fc36>] cma_req_handler+0x1c6/0x490 [rdma_cm] 1554 [Tue Mar 20 12:28:32 2018] [<ffffffffc07cb327>] cm_process_work+0x27/0x120 [ib_cm] 1555 [Tue Mar 20 12:28:32 2018] [<ffffffffc07cc16b>] cm_req_handler+0xb0b/0xe30 [ib_cm] 1556 [Tue Mar 20 12:28:32 2018] [<ffffffffc07cce55>] cm_work_handler+0x395/0x1306 [ib_cm] 1557 [Tue Mar 20 12:28:32 2018] [<ffffffff816ab2f4>] ? __schedule+0x424/0x9b0 1558 [Tue Mar 20 12:28:32 2018] [<ffffffff810aa59a>] process_one_work+0x17a/0x440 1559 [Tue Mar 20 12:28:32 2018] [<ffffffff810ab266>] worker_thread+0x126/0x3c0 1560 [Tue Mar 20 12:28:32 2018] [<ffffffff810ab140>] ? manage_workers.isra.24+0x2a0/0x2a0 1561 [Tue Mar 20 12:28:32 2018] [<ffffffff810b270f>] kthread+0xcf/0xe0 1562 [Tue Mar 20 12:28:32 2018] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40 1563 [Tue Mar 20 12:28:32 2018] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 1564 [Tue Mar 20 12:28:32 2018] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
it appears that these messages can sometimes be printed as warnings, when the driver can not allocate a continuous buffer. But then the driver allocates the fragmented buffer and continues to work. I don't believe that would be the reason (directly anyway) for the crash on the client. This is currently being investigated by MLX Devs. Updates to follow