Analysis of TID-RDMA issue by Bruno Faccini (from CAM-54)
Part 1
My first crash-dump analysis steps seem to point to something different that LU-9372, because the number of ptlrpc_rqbd for all MDS services is quite low, and the Slabs consuming the whole memory are different than the duo 32k+1K :
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE …………… ffff88017fc07500 kmalloc-1024 1024 97097290 97097632 3034301 32k ffff88017fc07600 kmalloc-512 512 45939302 45940160 717815 32k ffff88017fc07700 kmalloc-256 256 48963787 48964224 765066 16k ………………
More to come.
Part2
At the time of the crash-dump, there are only 2 threads trying to allocate more memory in Slabs :
PID: 4128 TASK: ffff881031284e70 CPU: 6 COMMAND: "kiblnd_sd_00_01"
#0 [ffff88103e6c5e48] crash_nmi_callback at ffffffff8104d302
#1 [ffff88103e6c5e58] nmi_handle at ffffffff81690237
#2 [ffff88103e6c5eb0] do_nmi at ffffffff81690443
#3 [ffff88103e6c5ef0] end_repeat_nmi at ffffffff8168f653
[exception RIP: mutex_unlock+20]
RIP: ffffffff8168a9c4 RSP: ffff88102831f760 RFLAGS: 00000202
RAX: 000000000390f5d8 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa01ec480
RBP: ffff88102831f760 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000220 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000000 R15: ffff88102831f8e0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <NMI exception stack> ---
#4 [ffff88102831f760] mutex_unlock at ffffffff8168a9c4
#5 [ffff88102831f768] ttm_pool_shrink_scan at ffffffffa01e67d1 [ttm]
#6 [ffff88102831f798] shrinker2_shrink at ffffffffa0167d1e [drm]
#7 [ffff88102831f7c0] shrink_slab at ffffffff811946e9
#8 [ffff88102831f860] do_try_to_free_pages at ffffffff81197aa2
#9 [ffff88102831f8d8] try_to_free_pages at ffffffff81197cbc
#10 [ffff88102831f970] __alloc_pages_slowpath at ffffffff81682b28
#11 [ffff88102831fa60] __alloc_pages_nodemask at ffffffff8118b645
#12 [ffff88102831fb10] alloc_pages_current at ffffffff811cf94a
#13 [ffff88102831fb58] new_slab at ffffffff811da5cc
#14 [ffff88102831fb90] ___slab_alloc at ffffffff811dbe4c
#15 [ffff88102831fc68] __slab_alloc at ffffffff8168420c
#16 [ffff88102831fca8] kmem_cache_alloc_trace at ffffffff811deb07
#17 [ffff88102831fcf0] lnet_parse at ffffffffa0a43f7c [lnet]
#18 [ffff88102831fd68] kiblnd_handle_rx at ffffffffa0ab29ab [ko2iblnd]
#19 [ffff88102831fdb8] kiblnd_scheduler at ffffffffa0ab8f7c [ko2iblnd]
#20 [ffff88102831fec8] kthread at ffffffff810b0a4f
#21 [ffff88102831ff50] ret_from_fork at ffffffff816977d8
…………….
PID: 169107 TASK: ffff88103c382f10 CPU: 4 COMMAND: "kworker/4:3"
#0 [ffff881ec7457480] __schedule at ffffffff8168c245
#1 [ffff881ec74574e8] schedule at ffffffff8168c899
#2 [ffff881ec74574f8] schedule_timeout at ffffffff8168a214
#3 [ffff881ec74575b0] __alloc_pages_slowpath at ffffffff81682c46
#4 [ffff881ec74576a0] __alloc_pages_nodemask at ffffffff8118b645
#5 [ffff881ec7457750] alloc_pages_current at ffffffff811cf94a
#6 [ffff881ec7457798] new_slab at ffffffff811da5cc
#7 [ffff881ec74577d0] ___slab_alloc at ffffffff811dbe4c
#8 [ffff881ec74578a8] __slab_alloc at ffffffff8168420c
#9 [ffff881ec74578e8] __kmalloc at ffffffff811dd968
#10 [ffff881ec7457928] hfi1_kern_exp_rcv_alloc_flows at ffffffffa034afa4 [hfi1]
#11 [ffff881ec7457968] qp_priv_init at ffffffffa031e87f [hfi1]
#12 [ffff881ec74579b8] rvt_create_qp at ffffffffa0136fb3 [rdmavt]
#13 [ffff881ec7457a50] ib_create_qp at ffffffffa00fea1f [ib_core]
#14 [ffff881ec7457a80] rdma_create_qp at ffffffffa0676584 [rdma_cm]
#15 [ffff881ec7457aa8] kiblnd_create_conn at ffffffffa0aa6217 [ko2iblnd]
#16 [ffff881ec7457b28] kiblnd_passive_connect at ffffffffa0ab4041 [ko2iblnd]
#17 [ffff881ec7457bd0] kiblnd_cm_callback at ffffffffa0ab5405 [ko2iblnd]
#18 [ffff881ec7457c38] cma_req_handler at ffffffffa067ae87 [rdma_cm]
#19 [ffff881ec7457ce8] create_client at ffffffffa064c755 [ib_cm]
#20 [ffff881ec7457d28] encode_cb_sequence4args at ffffffffa064d74d [ib_cm]
#21 [ffff881ec7457da0] nfsd4_process_cb_update at ffffffffa064ddc5 [ib_cm]
#22 [ffff881ec7457e20] process_one_work at ffffffff810a845b
#23 [ffff881ec7457e68] worker_thread at ffffffff810a9296
#24 [ffff881ec7457ec8] kthread at ffffffff810b0a4f
#25 [ffff881ec7457f50] ret_from_fork at ffffffff816977d8
but seems there is something wrong in the 2nd decoding with symbols/debuginfo stuff as some of the return-addresses do not finally match with the stack unwinder decoding, where for example the 0xffffffffa031e87f address is resolved as <hfi1_rc_send_complete+335> !!
But anyway, we are still in the OPA/Omni-Path Host Fabric Interface driver, and may be the decoding issue comes from the fact you are using an other driver than the one shipped with your Kernel ?
And the last stack stage/routine before the malloc seems also very interesting :
0xffffffffa034af90: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034af95: mov %r12d,%esi 0xffffffffa034af98: mov %r15,%rdi 0xffffffffa034af9b: mov %rax,0x40(%rbx) 0xffffffffa034af9f: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034afa4: lea 0x0(,%r14,4),%rdi 0xffffffffa034afac: mov %rax,0x48(%rbx) 0xffffffffa034afb0: mov %r12d,%esi 0xffffffffa034afb3: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034afb8: cmpq $0x0,0x30(%rbx)
Seems I will need to find a way to check OPA/hfi1 kmem allocations requirements...
Part 3
Hmm, ok some routines from hfi1 module now could be more easily disassembled, like this hfi1_kern_exp_rcv_alloc_flows() routine doing a bunch of allocations from its very beginning :
Dump of assembler code for function hfi1_kern_exp_rcv_alloc_flows: 0xffffffffa034aee0 <+0>: nopl 0x0(%rax,%rax,1) 0xffffffffa034aee5 <+5>: push %rbp 0xffffffffa034aee6 <+6>: mov $0x8,%eax 0xffffffffa034aeeb <+11>: mov %rsp,%rbp 0xffffffffa034aeee <+14>: push %r15 0xffffffffa034aef0 <+16>: push %r14 0xffffffffa034aef2 <+18>: push %r13 0xffffffffa034aef4 <+20>: push %r12 0xffffffffa034aef6 <+22>: mov %esi,%r12d 0xffffffffa034aef9 <+25>: or $0x8000,%r12d 0xffffffffa034af00 <+32>: push %rbx 0xffffffffa034af01 <+33>: mov %r12d,%esi 0xffffffffa034af04 <+36>: sub $0x8,%rsp 0xffffffffa034af08 <+40>: mov %ax,0x50(%rdi) 0xffffffffa034af0c <+44>: mov %rdi,-0x30(%rbp) 0xffffffffa034af10 <+48>: mov $0x3c0,%edi <<<<<<< 960 Bytes 0xffffffffa034af15 <+53>: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034af1a <+58>: test %rax,%rax 0xffffffffa034af1d <+61>: mov -0x30(%rbp),%rdx 0xffffffffa034af21 <+65>: je 0xffffffffa034b010 <hfi1_kern_exp_rcv_alloc_flows+304> 0xffffffffa034af27 <+71>: xor %r13d,%r13d 0xffffffffa034af2a <+74>: cmpw $0x0,0x50(%rdx) 0xffffffffa034af2f <+79>: mov %rax,0x18(%rdx) 0xffffffffa034af33 <+83>: jne 0xffffffffa034af44 <hfi1_kern_exp_rcv_alloc_flows+100> 0xffffffffa034af35 <+85>: jmpq 0xffffffffa034afeb <hfi1_kern_exp_rcv_alloc_flows+267> 0xffffffffa034af3a <+90>: nopw 0x0(%rax,%rax,1) 0xffffffffa034af40 <+96>: mov 0x18(%rdx),%rax 0xffffffffa034af44 <+100>: movslq %r13d,%rbx 0xffffffffa034af47 <+103>: mov %r12d,%esi 0xffffffffa034af4a <+106>: mov %rdx,-0x30(%rbp) 0xffffffffa034af4e <+110>: lea 0x0(,%rbx,8),%rcx 0xffffffffa034af56 <+118>: shl $0x7,%rbx 0xffffffffa034af5a <+122>: sub %rcx,%rbx 0xffffffffa034af5d <+125>: add %rax,%rbx 0xffffffffa034af60 <+128>: mov 0x2c0e6(%rip),%eax # 0xffffffffa037704c <hfi1_tid_rdma_seg_max_size> 0xffffffffa034af66 <+134>: shr $0xc,%eax <<<<<< 0x40 0xffffffffa034af69 <+137>: mov %eax,%ecx 0xffffffffa034af6b <+139>: and $0x1,%ecx 0xffffffffa034af6e <+142>: lea (%rax,%rcx,1),%r14d <<<<< 0x40 0xffffffffa034af72 <+146>: lea 0x0(,%r14,8),%rdi <<<<<< 0x40 * 8 = 512 Bytes 0xffffffffa034af7a <+154>: mov %r14,%r15 0xffffffffa034af7d <+157>: shl $0x4,%r15 <<<<<< 0x400 0xffffffffa034af81 <+161>: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034af86 <+166>: mov %r12d,%esi 0xffffffffa034af89 <+169>: mov %r15,%rdi <<<<<< 1024 Bytes 0xffffffffa034af8c <+172>: mov %rax,0x30(%rbx) 0xffffffffa034af90 <+176>: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034af95 <+181>: mov %r12d,%esi 0xffffffffa034af98 <+184>: mov %r15,%rdi <<<<<< 1024 Bytes 0xffffffffa034af9b <+187>: mov %rax,0x40(%rbx) 0xffffffffa034af9f <+191>: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034afa4 <+196>: lea 0x0(,%r14,4),%rdi <<<<<<<< 0x40 * 4 = 256 Bytes 0xffffffffa034afac <+204>: mov %rax,0x48(%rbx) 0xffffffffa034afb0 <+208>: mov %r12d,%esi 0xffffffffa034afb3 <+211>: callq 0xffffffff811dd7a0 <__kmalloc> 0xffffffffa034afb8 <+216>: cmpq $0x0,0x30(%rbx) 0xffffffffa034afbd <+221>: mov %rax,0x58(%rbx) 0xffffffffa034afc1 <+225>: mov -0x30(%rbp),%rdx 0xffffffffa034afc5 <+229>: je 0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288> 0xffffffffa034afc7 <+231>: cmpq $0x0,0x40(%rbx) 0xffffffffa034afcc <+236>: je 0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288> 0xffffffffa034afce <+238>: test %rax,%rax 0xffffffffa034afd1 <+241>: je 0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288> 0xffffffffa034afd3 <+243>: cmpq $0x0,0x48(%rbx) 0xffffffffa034afd8 <+248>: je 0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288> 0xffffffffa034afda <+250>: movzwl 0x50(%rdx),%eax 0xffffffffa034afde <+254>: add $0x1,%r13d 0xffffffffa034afe2 <+258>: cmp %r13d,%eax 0xffffffffa034afe5 <+261>: jg 0xffffffffa034af40 <hfi1_kern_exp_rcv_alloc_flows+96> 0xffffffffa034afeb <+267>: add $0x8,%rsp 0xffffffffa034afef <+271>: xor %eax,%eax 0xffffffffa034aff1 <+273>: pop %rbx 0xffffffffa034aff2 <+274>: pop %r12 0xffffffffa034aff4 <+276>: pop %r13 0xffffffffa034aff6 <+278>: pop %r14 0xffffffffa034aff8 <+280>: pop %r15 0xffffffffa034affa <+282>: pop %rbp 0xffffffffa034affb <+283>: retq 0xffffffffa034affc <+284>: nopl 0x0(%rax) 0xffffffffa034b000 <+288>: mov %rbx,%rdi 0xffffffffa034b003 <+291>: mov %rdx,-0x30(%rbp) 0xffffffffa034b007 <+295>: callq 0xffffffffa0348560 <hfi1_kern_exp_rcv_dealloc> 0xffffffffa034b00c <+300>: mov -0x30(%rbp),%rdx 0xffffffffa034b010 <+304>: mov %rdx,%rdi 0xffffffffa034b013 <+307>: callq 0xffffffffa034ae60 <hfi1_kern_exp_rcv_free_flows> 0xffffffffa034b018 <+312>: add $0x8,%rsp 0xffffffffa034b01c <+316>: mov $0xfffffff4,%eax 0xffffffffa034b021 <+321>: pop %rbx 0xffffffffa034b022 <+322>: pop %r12 0xffffffffa034b024 <+324>: pop %r13 0xffffffffa034b026 <+326>: pop %r14 0xffffffffa034b028 <+328>: pop %r15 0xffffffffa034b02a <+330>: pop %rbp 0xffffffffa034b02b <+331>: retq with <hfi1_tid_rdma_seg_max_size> value : ffffffffa037704c: 00040000 ....
and all these allocations are of 1024/512/256 Bytes sizes of the 3x kmem_caches hogging the whole memory!
And this is confirmed by the hfi1 related source code :
#define TID_RDMA_MAX_SEGMENT_SIZE BIT(18) /* 256 KiB (for now) */
unsigned int hfi1_tid_rdma_seg_max_size = TID_RDMA_MAX_SEGMENT_SIZE;
#define TID_RDMA_MAX_READ_SEGS_PER_REQ 6
#define TID_RDMA_MAX_WRITE_SEGS_PER_REQ 2
struct tid_rdma_flow {
int idx;
u32 tid_qpn;
struct flow_state flow_state;
struct tid_rdma_request *req;
struct page **pages;
u32 npages;
u32 npagesets;
struct tid_rdma_pageset *pagesets;
struct kern_tid_node *tnode;
u32 tnode_cnt;
u32 tidcnt;
u32 *tid_entry;
u32 npkts;
u32 pkt;
u32 tid_idx;
u32 tid_offset;
u32 length;
u32 sent;
}
SIZE: 120
struct tid_rdma_pageset {
dma_addr_t addr;
u16 idx;
u16 count;
}
SIZE: 16
struct kern_tid_node {
struct tid_group *grp;
u8 map;
u8 cnt;
}
SIZE: 16
static void hfi1_kern_exp_rcv_dealloc(struct tid_rdma_flow *flow)
{
kfree(flow->pages);
kfree(flow->pagesets);
kfree(flow->tid_entry);
kfree(flow->tnode);
flow->pages = NULL;
flow->pagesets = NULL;
flow->tid_entry = NULL;
flow->tnode = NULL;
}
static int hfi1_kern_exp_rcv_alloc(struct tid_rdma_flow *flow, gfp_t gfp)
{
u32 npages;
npages = hfi1_tid_rdma_seg_max_size >> PAGE_SHIFT; <<<<< 0x40
if (npages & 1)
npages++;
/*
* Worst case allocations: in the worst case there are no contigous
* physical chunks so there are N (= npages) pagesets and TID entries,
* also in the worst case each TID comes from a separate TID group so
* there are N TID groups and therefore N TID nodes
*/
flow->pages = kcalloc(npages, sizeof(*flow->pages), gap); <<<<< 0x40*8=512
flow->pagesets = kcalloc(npages, sizeof(*flow->pagesets), gfp); <<<< 0x40*16=1024
flow->tnode = kcalloc(npages, sizeof(*flow->tnode), gap); <<<< 0x40*16=1024
flow->tid_entry = kcalloc(npages, sizeof(*flow->tid_entry), gfp); <<<< 0x40*4=256
if (!flow->pages || !flow->pagesets || !flow->tid_entry ||!flow->tnode)
goto nomem;
return 0;
nomem:
hfi1_kern_exp_rcv_dealloc(flow);
return -ENOMEM;
}
/* Called at QP destroy time to free TID RDMA resources */
void hfi1_kern_exp_rcv_free_flows(struct tid_rdma_request *req)
{
int i;
for (i = 0; req->flows && i < req->n_max_flows; i++)
hfi1_kern_exp_rcv_dealloc(&req->flows[i]);
kfree(req->flows);
req->flows = NULL;
req->n_max_flows = 0;
req->n_flows = 0;
}
/*
* This is called at QP create time to allocate resources for TID RDMA
* segments/flows. This is done to keep all required memory pre-allocated and
* avoid memory allocation in the data path.
*/
int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp)
{
struct tid_rdma_flow *flows;
int i, ret;
u16 nflows;
/* Size of the flow circular buffer is the next higher power of 2 */
nflows = max_t(u16, TID_RDMA_MAX_READ_SEGS_PER_REQ,
TID_RDMA_MAX_WRITE_SEGS_PER_REQ); <<<<<<<< 6
req->n_max_flows = roundup_pow_of_two(nflows + 1); <<<<<<<<< 8
flows = kcalloc(req->n_max_flows, sizeof(*flows), gap); <<<<<< 8*120=960
if (!flows) {
ret = -ENOMEM;
goto err;
}
req->flows = flows;
for (i = 0; i < req->n_max_flows; i++) {
ret = hfi1_kern_exp_rcv_alloc(&req->flows[i], gfp);
if (ret)
goto err;
}
return 0;
err:
hfi1_kern_exp_rcv_free_flows(req);
return ret;
}
So now, my first guess is that this OOM situation can come from a huge number of OPA QP requirement ( > 5M !!!).
Part 4
After I have spent more time looking into OPA driver source code, I agree with Amir about the fact that these huge kmem allocations (that I have already suspected as per my previous crash-dump analysis) can be avoided if TID-RDMA is disabled :
C symbol: qp_priv_init
File Function Line
0 hfi1/qp.h <global> 185 int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
1 rdma/rdma_vt.h <global> 239 int (*qp_priv_init)(struct rvt_dev_info *rdi, struct rvt_qp *qp,
2 hfi1/qp.c qp_priv_init 930 int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
3 hfi1/verbs.c hfi1_register_ib_device 1986 dd->verbs_dev.rdi.driver_f.qp_priv_init = qp_priv_init;
4 rdmavt/qp.c rvt_create_qp 836 if (rdi->driver_f.qp_priv_init) {
5 rdmavt/qp.c rvt_create_qp 837 err = rdi->driver_f.qp_priv_init(rdi, qp, init_attr,
/**
* rvt_create_qp - create a queue pair for a device
* @ibpd: the protection domain who's device we create the queue pair for
* @init_attr: the attributes of the queue pair
* @udata: user data for libibverbs.so
*
* Queue pair creation is mostly an rvt issue. However, drivers have their own
* unique idea of what queue pair numbers mean. For instance there is a reserved
* range for PSM.
*
* Return: the queue pair on success, otherwise returns an errno.
*
* Called by the ib_create_qp() core verbs function.
*/
struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
struct ib_qp_init_attr *init_attr,
struct ib_udata *udata)
{
................
return ERR_PTR(-EINVAL);
case IB_QPT_UC:
case IB_QPT_RC:
case IB_QPT_UD:
.................
if (rdi->driver_f.qp_priv_init) {
err = rdi->driver_f.qp_priv_init(rdi, qp, init_attr,
gfp);
if (err) {
ret = ERR_PTR(err);
goto bail_rq_wq;
}
}
..................
C symbol: hfi1_kern_exp_rcv_alloc_flows
File Function Line
0 hfi1/tid_rdma.h <global> 302 int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp);
1 hfi1/qp.c qp_priv_init 962 ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
2 hfi1/qp.c qp_priv_init 989 ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
3 hfi1/tid_rdma.c hfi1_kern_exp_rcv_alloc_flows 1564 int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp)
int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
struct ib_qp_init_attr *init_attr, gfp_t gfp)
{
struct hfi1_qp_priv *qpriv = qp->priv;
int i, ret;
qpriv->rcd = qp_to_rcd(rdi, qp);
spin_lock_init(&qpriv->opfn.lock);
INIT_WORK(&qpriv->opfn.opfn_work, opfn_send_conn_request);
INIT_WORK(&qpriv->tid_rdma.trigger_work, tid_rdma_trigger_resume);
qpriv->r_tid_tail = qp->s_tail_ack_queue;
qpriv->flow_state.psn = 0;
qpriv->flow_state.index = RXE_NUM_TID_FLOWS;
qpriv->flow_state.generation = 0;
qpriv->s_state = TID_OP(WRITE_RESP);
qpriv->s_ack_state = TID_OP(WRITE_DATA);
qpriv->s_tid_cur = HFI1_QP_WQE_INVALID;
qpriv->s_tid_tail = HFI1_QP_WQE_INVALID;
qpriv->r_tid_tail = HFI1_QP_WQE_INVALID;
qpriv->r_tid_ack = HFI1_QP_WQE_INVALID;
atomic_set(&qpriv->n_requests, 0);
atomic_set(&qpriv->n_tid_requests, 0);
if (init_attr->qp_type == IB_QPT_RC && HFI1_CAP_IS_KSET(TID_RDMA)) { <<<<<<<< TID_RDMA must be configured !!!!
for (i = 0; i < qp->s_size; i++) {
struct hfi1_swqe_priv *priv;
struct rvt_swqe *wqe = rvt_get_swqe_ptr(qp, i);
priv = kzalloc(sizeof(*priv), gfp);
if (!priv)
return -ENOMEM;
ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
gfp);
.......................
Part 5
Matt, TID_RDMA refers to "Accelerated RDMA" and Wojciech is correct, in order to disable TID_RDMA you should remove the "cap_mask=0x4c09a01cbba" setting from /etc/modprobe.d/hfi1.conf file on all fabric nodes.
Also, back on my number of QPs evaluation before, I have done more OPA driver source code analysis and inside crash-dump values retrieval, and in fact it appears that as per default/current parameter settings, the driver allocates about >8K times the set of 8*(2*1024 + 512 + 256) Bytes , that I had already identified, per QP :
int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
struct ib_qp_init_attr *init_attr, gfp_t gfp)
{
......................
if (init_attr->qp_type == IB_QPT_RC && HFI1_CAP_IS_KSET(TID_RDMA)) {
for (i = 0; i < qp->s_size; i++) {
struct hfi1_swqe_priv *priv;
struct rvt_swqe *wqe = rvt_get_swqe_ptr(qp, i);
priv = kzalloc(sizeof(*priv), gfp);
if (!priv)
return -ENOMEM;
ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req, <<<<<<<<<<
gfp);
if (ret)
return ret;
/*
* Initialize various TID RDMA request variables.
* These variables are "static", which is why they
* can be pre-initialized here before the WRs has
* even been submitted.
* However, non-NULL values for these variables do not
* imply that this WQE has been enabled for TID RDMA.
* Drivers should check the WQE's opcode to determine
* if a request is a TID RDMA one or not.
*/
priv->tid_req.qp = qp;
priv->tid_req.rcd = qpriv->rcd;
priv->tid_req.e.swqe = wqe;
wqe->priv = priv;
}
for (i = 0; i < rvt_max_atomic(rdi); i++) {
struct hfi1_ack_priv *priv;
priv = kzalloc(sizeof(*priv), gfp);
if (!priv)
return -ENOMEM;
ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req, <<<<<<<<<
gfp);
if (ret)
return ret;
priv->tid_req.qp = qp;
priv->tid_req.rcd = qpriv->rcd;
priv->tid_req.e.ack = &qp->s_ack_queue[i];
qp->s_ack_queue[i].priv = priv;
}
}
......................
with
qp->s_size = sqsize = init_attr->cap.max_send_wr + 1 + rdi->dparms.reserved_operations;
where
reserved_operations = 0x1
max_send_wr = 0x2100,
and with
static inline unsigned int rvt_max_atomic(struct rvt_dev_info *rdi)
{
return rdi->dparms.max_rdma_atomic +
rdi->dparms.extra_rdma_atomic + 1;
}
where
max_rdma_atomic = 0x10,
extra_rdma_atomic = 0x8,
So this finally leads to only <600 QPs created, which looks more like a normal value for the potential number of Clients (455< x <455+768) at the time of the crash.
What looks quite huge, is the size of >182MBytes being allocated for each QP when TID_RDMA is enabled, again if I am right with my analysis.
May be there are some tunables that could be adjusted to allow for TID_RDMA feature to become usable on a Server that must support a lot of concurrent Clients/connections, like reduce TID_RDMA_MAX_READ_SEGS_PER_REQ/TID_RDMA_MAX_WRITE_SEGS_PER_REQ to limit n_max_flows value, or also a way to limit max_send_wr (looks to be inherited from upper layers) ??