IB_DEVICE_SG_GAPS_REG

There was a dump_cqe error when using IB_DEVICE_SG_GAPS_REG on mlx5. It was a bug in mlx5. The fix is already merged in 4.16 kernel and in some LTS kernels. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da343b6d90e11132f1e917d865d88ee35d6e6d00

Page allocation failure

 1525 [Tue Mar 20 12:28:32 2018] Call Trace:
 1526 [Tue Mar 20 12:28:32 2018]  [<ffffffff816a6071>] dump_stack+0x19/0x1b
 1527 [Tue Mar 20 12:28:32 2018]  [<ffffffff8118a6f0>] warn_alloc_failed+0x110/0x180
 1528 [Tue Mar 20 12:28:32 2018]  [<ffffffff816a204a>] __alloc_pages_slowpath+0x6b6/0x724
 1529 [Tue Mar 20 12:28:32 2018]  [<ffffffff8118ec85>] __alloc_pages_nodemask+0x405/0x420
 1530 [Tue Mar 20 12:28:32 2018]  [<ffffffff81030e8f>] dma_generic_alloc_coherent+0x8f/0x140
 1531 [Tue Mar 20 12:28:32 2018]  [<ffffffff810645d1>] x86_swiotlb_alloc_coherent+0x21/0x50
 1532 [Tue Mar 20 12:28:32 2018]  [<ffffffffc01213dd>] mlx5_dma_zalloc_coherent_node+0xad/0x110 [mlx5_core]
 1533 [Tue Mar 20 12:28:32 2018]  [<ffffffffc012197e>] mlx5_buf_alloc_node+0x3e/0xa0 [mlx5_core]
 1534 [Tue Mar 20 12:28:32 2018]  [<ffffffffc01219f4>] mlx5_buf_alloc+0x14/0x20 [mlx5_core]
 1535 [Tue Mar 20 12:28:32 2018]  [<ffffffffc047e11d>] create_kernel_qp.isra.65+0x44d/0x76d [mlx5_ib]
 1536 [Tue Mar 20 12:28:32 2018]  [<ffffffffc04646d8>] create_qp_common+0x9e8/0x1660 [mlx5_ib]
 1537 [Tue Mar 20 12:28:32 2018]  [<ffffffff812978ef>] ? debugfs_create_file+0x1f/0x30
 1538 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0119d0b>] ? mlx5_debug_cq_add+0x4b/0x70 [mlx5_core]
 1539 [Tue Mar 20 12:28:32 2018]  [<ffffffffc011f10e>] ? mlx5_core_create_cq+0x1ae/0x230 [mlx5_core]
 1540 [Tue Mar 20 12:28:32 2018]  [<ffffffff811e19f6>] ? kmem_cache_alloc_trace+0x1d6/0x200
 1541 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0466cfd>] ? _mlx5_ib_create_qp+0xfd/0x530 [mlx5_ib]
 1542 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0466d26>] _mlx5_ib_create_qp+0x126/0x530 [mlx5_ib]
 1543 [Tue Mar 20 12:28:32 2018]  [<ffffffffc00bf7e5>] ? backport_kvfree+0x35/0x40 [mlx_compat]
 1544 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0460af0>] ? mlx5_ib_create_cq+0x300/0x4c0 [mlx5_ib]
 1545 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0467140>] mlx5_ib_create_qp+0x10/0x20 [mlx5_ib]
 1546 [Tue Mar 20 12:28:32 2018]  [<ffffffffc059452a>] ib_create_qp+0x7a/0x2f0 [ib_core]
 1547 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098b5d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
 1548 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09e153f>] kiblnd_create_conn+0xbff/0x1870 [ko2iblnd]
 1549 [Tue Mar 20 12:28:32 2018]  [<ffffffffc0a599da>] ? cfs_percpt_unlock+0x1a/0xb0 [libcfs]
 1550 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09ef6df>] kiblnd_passive_connect+0xa4f/0x1790 [ko2iblnd]
 1551 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098a58c>] ? _cma_attach_to_dev+0x5c/0x70 [rdma_cm]
 1552 [Tue Mar 20 12:28:32 2018]  [<ffffffffc09f0b75>] kiblnd_cm_callback+0x755/0x23a0 [ko2iblnd]
 1553 [Tue Mar 20 12:28:32 2018]  [<ffffffffc098fc36>] cma_req_handler+0x1c6/0x490 [rdma_cm]
 1554 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cb327>] cm_process_work+0x27/0x120 [ib_cm]
 1555 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cc16b>] cm_req_handler+0xb0b/0xe30 [ib_cm]
 1556 [Tue Mar 20 12:28:32 2018]  [<ffffffffc07cce55>] cm_work_handler+0x395/0x1306 [ib_cm]
 1557 [Tue Mar 20 12:28:32 2018]  [<ffffffff816ab2f4>] ? __schedule+0x424/0x9b0
 1558 [Tue Mar 20 12:28:32 2018]  [<ffffffff810aa59a>] process_one_work+0x17a/0x440
 1559 [Tue Mar 20 12:28:32 2018]  [<ffffffff810ab266>] worker_thread+0x126/0x3c0
 1560 [Tue Mar 20 12:28:32 2018]  [<ffffffff810ab140>] ? manage_workers.isra.24+0x2a0/0x2a0
 1561 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b270f>] kthread+0xcf/0xe0
 1562 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
 1563 [Tue Mar 20 12:28:32 2018]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
 1564 [Tue Mar 20 12:28:32 2018]  [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40

it appears that these messages can sometimes be printed as warnings, when the driver can not allocate a continuous buffer. But then the driver allocates the fragmented buffer and continues to work. I don't believe that would be the reason (directly anyway) for the crash on the client. This is currently being investigated by MLX Devs. Updates to follow

Space shortcuts

Page tree

IB_DEVICE_SG_GAPS_REG

Page allocation failure

Space shortcuts

Page tree

MLX Info and Tips

IB_DEVICE_SG_GAPS_REG

Page allocation failure