Analysis of TID-RDMA issue by Bruno Faccini (from CAM-54)

Part 1

My first crash-dump analysis steps seem to point to something different that LU-9372, because the number of ptlrpc_rqbd for all MDS services is quite low, and the Slabs consuming the whole memory are different than the duo 32k+1K :

CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
……………
ffff88017fc07500 kmalloc-1024            1024   97097290  97097632  3034301    32k
ffff88017fc07600 kmalloc-512              512   45939302  45940160  717815    32k
ffff88017fc07700 kmalloc-256              256   48963787  48964224  765066    16k
………………

More to come.

Part2

At the time of the crash-dump, there are only 2 threads trying to allocate more memory in Slabs :

PID: 4128   TASK: ffff881031284e70  CPU: 6   COMMAND: "kiblnd_sd_00_01"
 #0 [ffff88103e6c5e48] crash_nmi_callback at ffffffff8104d302
 #1 [ffff88103e6c5e58] nmi_handle at ffffffff81690237
 #2 [ffff88103e6c5eb0] do_nmi at ffffffff81690443
 #3 [ffff88103e6c5ef0] end_repeat_nmi at ffffffff8168f653
    [exception RIP: mutex_unlock+20]
    RIP: ffffffff8168a9c4  RSP: ffff88102831f760  RFLAGS: 00000202
    RAX: 000000000390f5d8  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffffffffa01ec480
    RBP: ffff88102831f760   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000220  R12: 0000000000000000
    R13: 0000000000000001  R14: 0000000000000000  R15: ffff88102831f8e0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #4 [ffff88102831f760] mutex_unlock at ffffffff8168a9c4
 #5 [ffff88102831f768] ttm_pool_shrink_scan at ffffffffa01e67d1 [ttm]
 #6 [ffff88102831f798] shrinker2_shrink at ffffffffa0167d1e [drm]
 #7 [ffff88102831f7c0] shrink_slab at ffffffff811946e9
 #8 [ffff88102831f860] do_try_to_free_pages at ffffffff81197aa2
 #9 [ffff88102831f8d8] try_to_free_pages at ffffffff81197cbc
#10 [ffff88102831f970] __alloc_pages_slowpath at ffffffff81682b28
#11 [ffff88102831fa60] __alloc_pages_nodemask at ffffffff8118b645
#12 [ffff88102831fb10] alloc_pages_current at ffffffff811cf94a
#13 [ffff88102831fb58] new_slab at ffffffff811da5cc
#14 [ffff88102831fb90] ___slab_alloc at ffffffff811dbe4c
#15 [ffff88102831fc68] __slab_alloc at ffffffff8168420c
#16 [ffff88102831fca8] kmem_cache_alloc_trace at ffffffff811deb07
#17 [ffff88102831fcf0] lnet_parse at ffffffffa0a43f7c [lnet]
#18 [ffff88102831fd68] kiblnd_handle_rx at ffffffffa0ab29ab [ko2iblnd]
#19 [ffff88102831fdb8] kiblnd_scheduler at ffffffffa0ab8f7c [ko2iblnd]
#20 [ffff88102831fec8] kthread at ffffffff810b0a4f
#21 [ffff88102831ff50] ret_from_fork at ffffffff816977d8
…………….
PID: 169107  TASK: ffff88103c382f10  CPU: 4   COMMAND: "kworker/4:3"
 #0 [ffff881ec7457480] __schedule at ffffffff8168c245
 #1 [ffff881ec74574e8] schedule at ffffffff8168c899
 #2 [ffff881ec74574f8] schedule_timeout at ffffffff8168a214
 #3 [ffff881ec74575b0] __alloc_pages_slowpath at ffffffff81682c46
 #4 [ffff881ec74576a0] __alloc_pages_nodemask at ffffffff8118b645
 #5 [ffff881ec7457750] alloc_pages_current at ffffffff811cf94a
 #6 [ffff881ec7457798] new_slab at ffffffff811da5cc
 #7 [ffff881ec74577d0] ___slab_alloc at ffffffff811dbe4c
 #8 [ffff881ec74578a8] __slab_alloc at ffffffff8168420c
 #9 [ffff881ec74578e8] __kmalloc at ffffffff811dd968
#10 [ffff881ec7457928] hfi1_kern_exp_rcv_alloc_flows at ffffffffa034afa4 [hfi1]
#11 [ffff881ec7457968] qp_priv_init at ffffffffa031e87f [hfi1]
#12 [ffff881ec74579b8] rvt_create_qp at ffffffffa0136fb3 [rdmavt]
#13 [ffff881ec7457a50] ib_create_qp at ffffffffa00fea1f [ib_core]
#14 [ffff881ec7457a80] rdma_create_qp at ffffffffa0676584 [rdma_cm]
#15 [ffff881ec7457aa8] kiblnd_create_conn at ffffffffa0aa6217 [ko2iblnd]
#16 [ffff881ec7457b28] kiblnd_passive_connect at ffffffffa0ab4041 [ko2iblnd]
#17 [ffff881ec7457bd0] kiblnd_cm_callback at ffffffffa0ab5405 [ko2iblnd]
#18 [ffff881ec7457c38] cma_req_handler at ffffffffa067ae87 [rdma_cm]
#19 [ffff881ec7457ce8] create_client at ffffffffa064c755 [ib_cm]
#20 [ffff881ec7457d28] encode_cb_sequence4args at ffffffffa064d74d [ib_cm]
#21 [ffff881ec7457da0] nfsd4_process_cb_update at ffffffffa064ddc5 [ib_cm]
#22 [ffff881ec7457e20] process_one_work at ffffffff810a845b
#23 [ffff881ec7457e68] worker_thread at ffffffff810a9296
#24 [ffff881ec7457ec8] kthread at ffffffff810b0a4f
#25 [ffff881ec7457f50] ret_from_fork at ffffffff816977d8

but seems there is something wrong in the 2nd decoding with symbols/debuginfo stuff as some of the return-addresses do not finally match with the stack unwinder decoding, where for example the 0xffffffffa031e87f address is resolved as <hfi1_rc_send_complete+335> !!

But anyway, we are still in the OPA/Omni-Path Host Fabric Interface driver, and may be the decoding issue comes from the fact you are using an other driver than the one shipped with your Kernel ?

And the last stack stage/routine before the malloc seems also very interesting :

   0xffffffffa034af90:  callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034af95:  mov    %r12d,%esi
   0xffffffffa034af98:  mov    %r15,%rdi
   0xffffffffa034af9b:  mov    %rax,0x40(%rbx)
   0xffffffffa034af9f:  callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034afa4:  lea    0x0(,%r14,4),%rdi
   0xffffffffa034afac:  mov    %rax,0x48(%rbx)
   0xffffffffa034afb0:  mov    %r12d,%esi
   0xffffffffa034afb3:  callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034afb8:  cmpq   $0x0,0x30(%rbx)

Seems I will need to find a way to check OPA/hfi1 kmem allocations requirements...

Part 3

Hmm, ok some routines from hfi1 module now could be more easily disassembled, like this hfi1_kern_exp_rcv_alloc_flows() routine doing a bunch of allocations from its very beginning :

Dump of assembler code for function hfi1_kern_exp_rcv_alloc_flows:
   0xffffffffa034aee0 <+0>:     nopl   0x0(%rax,%rax,1)
   0xffffffffa034aee5 <+5>:     push   %rbp
   0xffffffffa034aee6 <+6>:     mov    $0x8,%eax
   0xffffffffa034aeeb <+11>:    mov    %rsp,%rbp
   0xffffffffa034aeee <+14>:    push   %r15
   0xffffffffa034aef0 <+16>:    push   %r14
   0xffffffffa034aef2 <+18>:    push   %r13
   0xffffffffa034aef4 <+20>:    push   %r12
   0xffffffffa034aef6 <+22>:    mov    %esi,%r12d
   0xffffffffa034aef9 <+25>:    or     $0x8000,%r12d
   0xffffffffa034af00 <+32>:    push   %rbx
   0xffffffffa034af01 <+33>:    mov    %r12d,%esi
   0xffffffffa034af04 <+36>:    sub    $0x8,%rsp
   0xffffffffa034af08 <+40>:    mov    %ax,0x50(%rdi)
   0xffffffffa034af0c <+44>:    mov    %rdi,-0x30(%rbp)
   0xffffffffa034af10 <+48>:    mov    $0x3c0,%edi  <<<<<<< 960 Bytes
   0xffffffffa034af15 <+53>:    callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034af1a <+58>:    test   %rax,%rax
   0xffffffffa034af1d <+61>:    mov    -0x30(%rbp),%rdx
   0xffffffffa034af21 <+65>:    je     0xffffffffa034b010 <hfi1_kern_exp_rcv_alloc_flows+304>
   0xffffffffa034af27 <+71>:    xor    %r13d,%r13d
   0xffffffffa034af2a <+74>:    cmpw   $0x0,0x50(%rdx)
   0xffffffffa034af2f <+79>:    mov    %rax,0x18(%rdx)
   0xffffffffa034af33 <+83>:    jne    0xffffffffa034af44 <hfi1_kern_exp_rcv_alloc_flows+100>
   0xffffffffa034af35 <+85>:    jmpq   0xffffffffa034afeb <hfi1_kern_exp_rcv_alloc_flows+267>
   0xffffffffa034af3a <+90>:    nopw   0x0(%rax,%rax,1)
   0xffffffffa034af40 <+96>:    mov    0x18(%rdx),%rax
   0xffffffffa034af44 <+100>:   movslq %r13d,%rbx
   0xffffffffa034af47 <+103>:   mov    %r12d,%esi
   0xffffffffa034af4a <+106>:   mov    %rdx,-0x30(%rbp)
   0xffffffffa034af4e <+110>:   lea    0x0(,%rbx,8),%rcx
   0xffffffffa034af56 <+118>:   shl    $0x7,%rbx
   0xffffffffa034af5a <+122>:   sub    %rcx,%rbx
   0xffffffffa034af5d <+125>:   add    %rax,%rbx
   0xffffffffa034af60 <+128>:   mov    0x2c0e6(%rip),%eax        # 0xffffffffa037704c <hfi1_tid_rdma_seg_max_size>
   0xffffffffa034af66 <+134>:   shr    $0xc,%eax   <<<<<< 0x40
   0xffffffffa034af69 <+137>:   mov    %eax,%ecx
   0xffffffffa034af6b <+139>:   and    $0x1,%ecx
   0xffffffffa034af6e <+142>:   lea    (%rax,%rcx,1),%r14d <<<<< 0x40
   0xffffffffa034af72 <+146>:   lea    0x0(,%r14,8),%rdi  <<<<<< 0x40 * 8 = 512 Bytes
   0xffffffffa034af7a <+154>:   mov    %r14,%r15
   0xffffffffa034af7d <+157>:   shl    $0x4,%r15  <<<<<< 0x400
   0xffffffffa034af81 <+161>:   callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034af86 <+166>:   mov    %r12d,%esi
   0xffffffffa034af89 <+169>:   mov    %r15,%rdi <<<<<< 1024 Bytes
   0xffffffffa034af8c <+172>:   mov    %rax,0x30(%rbx)
   0xffffffffa034af90 <+176>:   callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034af95 <+181>:   mov    %r12d,%esi
   0xffffffffa034af98 <+184>:   mov    %r15,%rdi <<<<<< 1024 Bytes
   0xffffffffa034af9b <+187>:   mov    %rax,0x40(%rbx)
   0xffffffffa034af9f <+191>:   callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034afa4 <+196>:   lea    0x0(,%r14,4),%rdi <<<<<<<< 0x40 * 4 = 256 Bytes
   0xffffffffa034afac <+204>:   mov    %rax,0x48(%rbx)
   0xffffffffa034afb0 <+208>:   mov    %r12d,%esi
   0xffffffffa034afb3 <+211>:   callq  0xffffffff811dd7a0 <__kmalloc>
   0xffffffffa034afb8 <+216>:   cmpq   $0x0,0x30(%rbx)
   0xffffffffa034afbd <+221>:   mov    %rax,0x58(%rbx)
   0xffffffffa034afc1 <+225>:   mov    -0x30(%rbp),%rdx
   0xffffffffa034afc5 <+229>:   je     0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288>
   0xffffffffa034afc7 <+231>:   cmpq   $0x0,0x40(%rbx)
   0xffffffffa034afcc <+236>:   je     0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288>
   0xffffffffa034afce <+238>:   test   %rax,%rax
   0xffffffffa034afd1 <+241>:   je     0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288>
   0xffffffffa034afd3 <+243>:   cmpq   $0x0,0x48(%rbx)
   0xffffffffa034afd8 <+248>:   je     0xffffffffa034b000 <hfi1_kern_exp_rcv_alloc_flows+288>
   0xffffffffa034afda <+250>:   movzwl 0x50(%rdx),%eax
   0xffffffffa034afde <+254>:   add    $0x1,%r13d
   0xffffffffa034afe2 <+258>:   cmp    %r13d,%eax
   0xffffffffa034afe5 <+261>:   jg     0xffffffffa034af40 <hfi1_kern_exp_rcv_alloc_flows+96>
   0xffffffffa034afeb <+267>:   add    $0x8,%rsp
   0xffffffffa034afef <+271>:   xor    %eax,%eax
   0xffffffffa034aff1 <+273>:   pop    %rbx
   0xffffffffa034aff2 <+274>:   pop    %r12
   0xffffffffa034aff4 <+276>:   pop    %r13
   0xffffffffa034aff6 <+278>:   pop    %r14
   0xffffffffa034aff8 <+280>:   pop    %r15
   0xffffffffa034affa <+282>:   pop    %rbp
   0xffffffffa034affb <+283>:   retq   
   0xffffffffa034affc <+284>:   nopl   0x0(%rax)
   0xffffffffa034b000 <+288>:   mov    %rbx,%rdi
   0xffffffffa034b003 <+291>:   mov    %rdx,-0x30(%rbp)
   0xffffffffa034b007 <+295>:   callq  0xffffffffa0348560 <hfi1_kern_exp_rcv_dealloc>
   0xffffffffa034b00c <+300>:   mov    -0x30(%rbp),%rdx
   0xffffffffa034b010 <+304>:   mov    %rdx,%rdi
   0xffffffffa034b013 <+307>:   callq  0xffffffffa034ae60 <hfi1_kern_exp_rcv_free_flows>
   0xffffffffa034b018 <+312>:   add    $0x8,%rsp
   0xffffffffa034b01c <+316>:   mov    $0xfffffff4,%eax
   0xffffffffa034b021 <+321>:   pop    %rbx
   0xffffffffa034b022 <+322>:   pop    %r12
   0xffffffffa034b024 <+324>:   pop    %r13
   0xffffffffa034b026 <+326>:   pop    %r14
   0xffffffffa034b028 <+328>:   pop    %r15
   0xffffffffa034b02a <+330>:   pop    %rbp
   0xffffffffa034b02b <+331>:   retq   

with <hfi1_tid_rdma_seg_max_size> value :
ffffffffa037704c:  00040000                              ....

and all these allocations are of 1024/512/256 Bytes sizes of the 3x kmem_caches hogging the whole memory!

And this is confirmed by the hfi1 related source code :

#define TID_RDMA_MAX_SEGMENT_SIZE       BIT(18)   /* 256 KiB (for now) */
unsigned int hfi1_tid_rdma_seg_max_size = TID_RDMA_MAX_SEGMENT_SIZE;

#define TID_RDMA_MAX_READ_SEGS_PER_REQ 6
#define TID_RDMA_MAX_WRITE_SEGS_PER_REQ 2

struct tid_rdma_flow {
    int idx;
    u32 tid_qpn;
    struct flow_state flow_state;
    struct tid_rdma_request *req;
    struct page **pages;
    u32 npages;
    u32 npagesets;
    struct tid_rdma_pageset *pagesets;
    struct kern_tid_node *tnode;
    u32 tnode_cnt;
    u32 tidcnt;
    u32 *tid_entry;
    u32 npkts;
    u32 pkt;
    u32 tid_idx;
    u32 tid_offset;
    u32 length;
    u32 sent;
}
SIZE: 120

struct tid_rdma_pageset {
    dma_addr_t addr;
    u16 idx;
    u16 count;
}
SIZE: 16

struct kern_tid_node {
    struct tid_group *grp;
    u8 map;
    u8 cnt;
}
SIZE: 16

static void hfi1_kern_exp_rcv_dealloc(struct tid_rdma_flow *flow)
{

        kfree(flow->pages);
        kfree(flow->pagesets);
        kfree(flow->tid_entry);
        kfree(flow->tnode);
        flow->pages = NULL;
        flow->pagesets = NULL;
        flow->tid_entry = NULL;
        flow->tnode = NULL;
}

static int hfi1_kern_exp_rcv_alloc(struct tid_rdma_flow *flow, gfp_t gfp)
{
        u32 npages;

        npages = hfi1_tid_rdma_seg_max_size >> PAGE_SHIFT; <<<<< 0x40
        if (npages & 1)
                npages++;
        /*
         * Worst case allocations: in the worst case there are no contigous
         * physical chunks so there are N (= npages) pagesets and TID entries,
         * also in the worst case each TID comes from a separate TID group so
         * there are N TID groups and therefore N TID nodes
         */
        flow->pages = kcalloc(npages, sizeof(*flow->pages), gap);  <<<<< 0x40*8=512
        flow->pagesets = kcalloc(npages, sizeof(*flow->pagesets), gfp); <<<< 0x40*16=1024
        flow->tnode = kcalloc(npages, sizeof(*flow->tnode), gap); <<<< 0x40*16=1024
        flow->tid_entry = kcalloc(npages, sizeof(*flow->tid_entry), gfp); <<<< 0x40*4=256
        if (!flow->pages || !flow->pagesets || !flow->tid_entry ||!flow->tnode)
                goto nomem;

        return 0;
nomem:  
        hfi1_kern_exp_rcv_dealloc(flow);
        return -ENOMEM;
}

/* Called at QP destroy time to free TID RDMA resources */
void hfi1_kern_exp_rcv_free_flows(struct tid_rdma_request *req)
{
        int i;

        for (i = 0; req->flows && i < req->n_max_flows; i++)
                hfi1_kern_exp_rcv_dealloc(&req->flows[i]);

        kfree(req->flows);
        req->flows = NULL;
        req->n_max_flows = 0;
        req->n_flows = 0;
}

/*
 * This is called at QP create time to allocate resources for TID RDMA
 * segments/flows. This is done to keep all required memory pre-allocated and
 * avoid memory allocation in the data path.
 */
int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp)
{
        struct tid_rdma_flow *flows;
        int i, ret;
        u16 nflows;

        /* Size of the flow circular buffer is the next higher power of 2 */
        nflows = max_t(u16, TID_RDMA_MAX_READ_SEGS_PER_REQ,
                       TID_RDMA_MAX_WRITE_SEGS_PER_REQ);  <<<<<<<< 6
        req->n_max_flows = roundup_pow_of_two(nflows + 1); <<<<<<<<< 8
        flows = kcalloc(req->n_max_flows, sizeof(*flows), gap); <<<<<< 8*120=960
        if (!flows) {
                ret = -ENOMEM;
                goto err;
        }
        req->flows = flows;

        for (i = 0; i < req->n_max_flows; i++) {
                ret = hfi1_kern_exp_rcv_alloc(&req->flows[i], gfp);
                if (ret)
                        goto err;
        }
        return 0;
err:
        hfi1_kern_exp_rcv_free_flows(req);
        return ret;
}

So now, my first guess is that this OOM situation can come from a huge number of OPA QP requirement ( > 5M !!!).

Part 4

After I have spent more time looking into OPA driver source code, I agree with Amir about the fact that these huge kmem allocations (that I have already suspected as per my previous crash-dump analysis) can be avoided if TID-RDMA is disabled :

C symbol: qp_priv_init

  File           Function                Line
0 hfi1/qp.h      <global>                 185 int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
1 rdma/rdma_vt.h <global>                 239 int (*qp_priv_init)(struct rvt_dev_info *rdi, struct rvt_qp *qp,
2 hfi1/qp.c      qp_priv_init             930 int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
3 hfi1/verbs.c   hfi1_register_ib_device 1986 dd->verbs_dev.rdi.driver_f.qp_priv_init = qp_priv_init;
4 rdmavt/qp.c    rvt_create_qp            836 if (rdi->driver_f.qp_priv_init) {
5 rdmavt/qp.c    rvt_create_qp            837 err = rdi->driver_f.qp_priv_init(rdi, qp, init_attr,

/**
 * rvt_create_qp - create a queue pair for a device
 * @ibpd: the protection domain who's device we create the queue pair for
 * @init_attr: the attributes of the queue pair
 * @udata: user data for libibverbs.so
 *
 * Queue pair creation is mostly an rvt issue. However, drivers have their own
 * unique idea of what queue pair numbers mean. For instance there is a reserved
 * range for PSM.
 *
 * Return: the queue pair on success, otherwise returns an errno.
 *
 * Called by the ib_create_qp() core verbs function.
 */
struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
                            struct ib_qp_init_attr *init_attr,
                            struct ib_udata *udata)
{
................
                        return ERR_PTR(-EINVAL);
        case IB_QPT_UC:
        case IB_QPT_RC:
        case IB_QPT_UD:
.................
                if (rdi->driver_f.qp_priv_init) {
                        err = rdi->driver_f.qp_priv_init(rdi, qp, init_attr,
                                                         gfp);
                        if (err) {
                                ret = ERR_PTR(err);
                                goto bail_rq_wq;
                        }
                }
..................

C symbol: hfi1_kern_exp_rcv_alloc_flows

  File            Function                      Line
0 hfi1/tid_rdma.h <global>                       302 int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp);
1 hfi1/qp.c       qp_priv_init                   962 ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
2 hfi1/qp.c       qp_priv_init                   989 ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
3 hfi1/tid_rdma.c hfi1_kern_exp_rcv_alloc_flows 1564 int hfi1_kern_exp_rcv_alloc_flows(struct tid_rdma_request *req, gfp_t gfp)

int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
                 struct ib_qp_init_attr *init_attr, gfp_t gfp)
{
        struct hfi1_qp_priv *qpriv = qp->priv;
        int i, ret;

        qpriv->rcd = qp_to_rcd(rdi, qp);
        spin_lock_init(&qpriv->opfn.lock);
        INIT_WORK(&qpriv->opfn.opfn_work, opfn_send_conn_request);
        INIT_WORK(&qpriv->tid_rdma.trigger_work, tid_rdma_trigger_resume);
        qpriv->r_tid_tail = qp->s_tail_ack_queue;
        qpriv->flow_state.psn = 0;
        qpriv->flow_state.index = RXE_NUM_TID_FLOWS;
        qpriv->flow_state.generation = 0;
        qpriv->s_state = TID_OP(WRITE_RESP);
        qpriv->s_ack_state = TID_OP(WRITE_DATA);
        qpriv->s_tid_cur = HFI1_QP_WQE_INVALID;
        qpriv->s_tid_tail = HFI1_QP_WQE_INVALID;
        qpriv->r_tid_tail = HFI1_QP_WQE_INVALID;
        qpriv->r_tid_ack = HFI1_QP_WQE_INVALID;
        atomic_set(&qpriv->n_requests, 0);
        atomic_set(&qpriv->n_tid_requests, 0);

        if (init_attr->qp_type == IB_QPT_RC && HFI1_CAP_IS_KSET(TID_RDMA)) { <<<<<<<< TID_RDMA must be configured !!!!
                for (i = 0; i < qp->s_size; i++) {
                        struct hfi1_swqe_priv *priv;
                        struct rvt_swqe *wqe = rvt_get_swqe_ptr(qp, i);

                        priv = kzalloc(sizeof(*priv), gfp);
                        if (!priv)
                                return -ENOMEM;

                        ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req,
                                                            gfp);
.......................

Part 5

Matt, TID_RDMA refers to "Accelerated RDMA" and Wojciech is correct, in order to disable TID_RDMA you should remove the "cap_mask=0x4c09a01cbba" setting from /etc/modprobe.d/hfi1.conf file on all fabric nodes.

Also, back on my number of QPs evaluation before, I have done more OPA driver source code analysis and inside crash-dump values retrieval, and in fact it appears that as per default/current parameter settings, the driver allocates about >8K times the set of 8*(2*1024 + 512 + 256) Bytes , that I had already identified, per QP :

int qp_priv_init(struct rvt_dev_info *rdi, struct rvt_qp *qp,
                 struct ib_qp_init_attr *init_attr, gfp_t gfp)
{
......................
        if (init_attr->qp_type == IB_QPT_RC && HFI1_CAP_IS_KSET(TID_RDMA)) {
                for (i = 0; i < qp->s_size; i++) {
                        struct hfi1_swqe_priv *priv;
                        struct rvt_swqe *wqe = rvt_get_swqe_ptr(qp, i);

                        priv = kzalloc(sizeof(*priv), gfp);
                        if (!priv)
                                return -ENOMEM;

                        ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req, <<<<<<<<<<
                                                            gfp);
                        if (ret)
                                return ret;

                        /*
                         * Initialize various TID RDMA request variables.
                         * These variables are "static", which is why they
                         * can be pre-initialized here before the WRs has
                         * even been submitted.
                         * However, non-NULL values for these variables do not
                         * imply that this WQE has been enabled for TID RDMA.
                         * Drivers should check the WQE's opcode to determine
                         * if a request is a TID RDMA one or not.
                         */
                        priv->tid_req.qp = qp;
                        priv->tid_req.rcd = qpriv->rcd;
                        priv->tid_req.e.swqe = wqe;
                        wqe->priv = priv;
                }
                for (i = 0; i < rvt_max_atomic(rdi); i++) {
                        struct hfi1_ack_priv *priv;

                        priv = kzalloc(sizeof(*priv), gfp);
                        if (!priv)
                                return -ENOMEM;

                        ret = hfi1_kern_exp_rcv_alloc_flows(&priv->tid_req, <<<<<<<<<
                                                            gfp);
                        if (ret)
                                return ret;

                        priv->tid_req.qp = qp;
                        priv->tid_req.rcd = qpriv->rcd;
                        priv->tid_req.e.ack = &qp->s_ack_queue[i];
                        qp->s_ack_queue[i].priv = priv;
                }
        }
......................

with
    qp->s_size = sqsize = init_attr->cap.max_send_wr + 1 + rdi->dparms.reserved_operations;
where
    reserved_operations = 0x1
    max_send_wr = 0x2100, 

and with
static inline unsigned int rvt_max_atomic(struct rvt_dev_info *rdi)
{
        return rdi->dparms.max_rdma_atomic +
                rdi->dparms.extra_rdma_atomic + 1;
}
where
    max_rdma_atomic = 0x10, 
    extra_rdma_atomic = 0x8, 

So this finally leads to only <600 QPs created, which looks more like a normal value for the potential number of Clients (455< x <455+768) at the time of the crash.

What looks quite huge, is the size of >182MBytes being allocated for each QP when TID_RDMA is enabled, again if I am right with my analysis.
May be there are some tunables that could be adjusted to allow for TID_RDMA feature to become usable on a Server that must support a lot of concurrent Clients/connections, like reduce TID_RDMA_MAX_READ_SEGS_PER_REQ/TID_RDMA_MAX_WRITE_SEGS_PER_REQ to limit n_max_flows value, or also a way to limit max_send_wr (looks to be inherited from upper layers) ??


  • No labels