Overview
The o2iblnd module manages reliable connections using the verbs API. It sends and receives RDMA. It is used to interface with any HW interface which supports verbs, such as MLX, RoCE and OPA.
Network Startup
When an LNet network is initially added a device is specified using a device name, ex: ib0. The device is looked up using that name and the failover status of the device is determined.
struct net_device *netdev netdev = dev_get_by_name(&init_net, ifname); if (netdev == NULL) { dev->ibd_can_failover = 0; } else { dev->ibd_can_failover = !!(netdev->flags & IFF_MASTER); dev_put(netdev); }
Create and rdma ID
# define kiblnd_rdma_create_id(cb, dev, ps, qpt) rdma_create_id(current->nsproxy->net_ns, \ cb, dev, ps, qpt) cmid = kiblnd_rdma_create_id(kiblnd_cm_callback, dev, RDMA_PS_TCP, IB_QPT_RC);
Bind to the IP address. We use IB over IP for connection establishment
rc = rdma_bind_addr(cmid, (struct sockaddr *)&addr); if (rc != 0 || cmid->device == NULL) { CERROR("Failed to bind %s:%pI4h to device(%p): %d\n", dev->ibd_ifname, &dev->ibd_ifip, cmid->device, rc); rdma_destroy_id(cmid); goto out; }
Protection domain creation
#ifdef HAVE_IB_ALLOC_PD_2ARGS pd = ib_alloc_pd(cmid->device, 0); #else pd = ib_alloc_pd(cmid->device); #endif
Listen for RDMA connections
rc = rdma_listen(cmid, 0);
Memory Registration
FMR or FastReg memory pools are allocated on startup
For FMR pool allocation:
struct ib_fmr_pool_param param = { .max_pages_per_fmr = LNET_MAX_IOV, .page_shift = PAGE_SHIFT, .access = (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE), .pool_size»····· = fps->fps_pool_size, .dirty_watermark = fps->fps_flush_trigger, .flush_function = NULL, .flush_arg = NULL, .cache = !!fps->fps_cache }; fpo->fmr.fpo_fmr_pool = ib_create_fmr_pool(fpo->fpo_hdev->ibh_pd, ¶m);
For FastReg allocation:
1568 #ifndef HAVE_IB_MAP_MR_SG 1569 »·······»·······frd->frd_frpl = ib_alloc_fast_reg_page_list(fpo->fpo_hdev->ibh_ibdev, 1570 »·······»·······»·······»·······»·······»·······»······· LNET_MAX_IOV); 1571 »·······»·······if (IS_ERR(frd->frd_frpl)) { 1572 »·······»·······»·······rc = PTR_ERR(frd->frd_frpl); 1573 »·······»·······»·······CERROR("Failed to allocate ib_fast_reg_page_list: %d\n", 1574 »·······»·······»·······»·······rc); 1575 »·······»·······»·······frd->frd_frpl = NULL; 1576 »·······»·······»·······goto out_middle; 1577 »·······»·······} 1578 #endif 1579 1580 #ifdef HAVE_IB_ALLOC_FAST_REG_MR 1581 »·······»·······frd->frd_mr = ib_alloc_fast_reg_mr(fpo->fpo_hdev->ibh_pd, 1582 »·······»·······»·······»·······»·······»······· LNET_MAX_IOV); 1583 #else 1584 »·······»·······/* 1585 »·······»······· * it is expected to get here if this is an MLX-5 card. 1586 »·······»······· * MLX-4 cards will always use FMR and MLX-5 cards will 1587 »·······»······· * always use fast_reg. It turns out that some MLX-5 cards 1588 »·······»······· * (possibly due to older FW versions) do not natively support 1589 »·······»······· * gaps. So we will need to track them here. 1590 »·······»······· */ 1591 »·······»·······frd->frd_mr = ib_alloc_mr(fpo->fpo_hdev->ibh_pd, 1592 #ifdef IB_MR_TYPE_SG_GAPS 1593 »·······»·······»·······»·······»······· ((*kiblnd_tunables.kib_use_fastreg_gaps == 1) && 1594 »·······»·······»·······»·······»······· (dev_caps & IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT)) ? 1595 »·······»·······»·······»·······»·······»·······IB_MR_TYPE_SG_GAPS : 1596 »·······»·······»·······»·······»·······»·······IB_MR_TYPE_MEM_REG, 1597 #else 1598 »·······»·······»·······»·······»·······»·······IB_MR_TYPE_MEM_REG, 1599 #endif 1600 »·······»·······»·······»·······»······· LNET_MAX_IOV); 1601 »·······»·······if ((*kiblnd_tunables.kib_use_fastreg_gaps == 1) && 1602 »·······»······· (dev_caps & IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT)) 1603 »·······»·······»·······CWARN("using IB_MR_TYPE_SG_GAPS, expect a performance drop\n"); 1604 #endif 1605 »·······»·······if (IS_ERR(frd->frd_mr)) { 1606 »·······»·······»·······rc = PTR_ERR(frd->frd_mr); 1607 »·······»·······»·······CERROR("Failed to allocate ib_fast_reg_mr: %d\n", rc); 1608 »·······»·······»·······frd->frd_mr = NULL; 1609 »·······»·······»·······goto out_middle; 1610 »·······»·······}
Active Connection Establishment
Once the ground work is laid down, then the LND waits for requests to do RDMA operations or for remote connections. The former is called Active Connection Establishment. This section will give an overview of how that works in the code. The latter is called Passive Connection Establishment and will be described in the following section.
When an RDMA operation is requested by higher up layers, an IOV is passed to the LND. The LND needs to map the memory to be RDMAed in preparation for posting. The maximum RDMA operation size the LND does is 1MB, broken into 256 4K (page size on x86-64 systems) work requests.
The code can be followed here:
kiblnd_setup_rd_iov() or kiblnd_setup_rd_kiov()
Once the memory to be RDMAed is mapped properly (mapping depends on whether we use FMR or FastReg), then a connection establishments process commences.
Step 1: resolve address:
rc = rdma_resolve_addr(cmid, (struct sockaddr *)&srcaddr, (struct sockaddr *)&dstaddr, lnet_get_lnd_timeout() * 1000);
Once we receive RDMA_CM_EVENT_ADDR_RESOLVED we proceed to step 2, resolve route:
rc = rdma_resolve_route(cmid, lnet_get_lnd_timeout() * 1000);
On RDMA_CM_EVENT_ROUTE_RESOLVED we move to step 3, create cq and the qp
872 #ifdef HAVE_IB_CQ_INIT_ATTR 873 »·······cq_attr.cqe = IBLND_CQ_ENTRIES(conn); 874 »·······cq_attr.comp_vector = kiblnd_get_completion_vector(conn, cpt); 875 »·······cq = ib_create_cq(cmid->device, 876 »·······»·······»······· kiblnd_cq_completion, kiblnd_cq_event, conn, 877 »·······»·······»······· &cq_attr); 878 #else 879 »·······cq = ib_create_cq(cmid->device, 880 »·······»·······»······· kiblnd_cq_completion, kiblnd_cq_event, conn, 881 »·······»·······»······· IBLND_CQ_ENTRIES(conn), 882 »·······»·······»······· kiblnd_get_completion_vector(conn, cpt)); 883 #endif 898 »·······rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); 904 »·······init_qp_attr->event_handler = kiblnd_qp_event; 905 »·······init_qp_attr->qp_context = conn; 906 »·······init_qp_attr->cap.max_send_sge = *kiblnd_tunables.kib_wrq_sge; 907 »·······init_qp_attr->cap.max_recv_sge = 1; 908 »·······init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR; 909 »·······init_qp_attr->qp_type = IB_QPT_RC; 910 »·······init_qp_attr->send_cq = cq; 911 »·······init_qp_attr->recv_cq = cq; 912 913 »·······conn->ibc_sched = sched; 914 915 »·······do { 916 »·······»·······init_qp_attr->cap.max_send_wr = kiblnd_send_wrs(conn); 917 »·······»·······init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn); 918 919 »·······»·······rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr); 920 »·······»·······if (!rc || conn->ibc_queue_depth < 2) 921 »·······»·······»·······break; 922 923 »·······»·······conn->ibc_queue_depth--; 924 »·······} while (rc);
The LND has its own protocol, where some messages are exchanged to determine the size of the RDMA that's about to happen. Once that's determined, then when initialize the RDMA operation.
wrq->wr.next»···= &(wrq + 1)->wr; wrq->wr.wr_id»··= kiblnd_ptr2wreqid(tx, IBLND_WID_RDMA); wrq->wr.sg_list»= sge; wrq->wr.opcode»·= IB_WR_RDMA_WRITE; wrq->wr.send_flags = 0; /* kiblnd_init_rdma() for more details */
We then post the work request on the qp
rc = ib_post_send(conn->ibc_cmid->qp, wr, &bad);
Note that the LND never does an RDMA read. It only does an RDMA write. This is for historical limitations, which might not be applicable with the latest technology.
Passive Connection Establishment
When the LND receives RDMA_CM_EVENT_CONNECT_REQUEST it proceeds to create the passive side of the connection. Basically it creates the CQ and QP as shown here.
Receiving RDMA
When a LND connection is created a number of buffers, each is 4K in size, are posted to the QP to receive incoming RDMAs. Receiving and sending messages in the LND is governed by a credit system to ensure the peers don't over flow the buffers on the QP.
Notes on RDMA and QP Timeouts