LU-9983 Problem Statement

Changes in OSP sent non-contiguous buffers to the o2iblnd. The assumption in o2iblnd was that for FMR and FastReg (FRMR) it only accepts an offset for the first fragment. Each fragment has a maximum size of PAGE_SIZE (4K on x86_64 systems). However it supported it for global memory regions. For global memory regions each Fragment is described to the peer in the kib_rdma_desc_t structure. However, for FMR and FRMR o2iblnd only used 1 fragment and relied on FMR/FRMR to understand the rest of the fragments. FMR does not support non-contiguous data (as far as I know). FRMR supports gaps in the buffers to be RDMAed only if IB_MR_TYPE_SG_GAPS is used when allocating memory regions. However, Cray testing showed that using this option introduced a memory drop.

Originally, LNet didn't do any sanity checking on the buffers before it attempted to RDMA them, therefore when using FMR/FRMR not all the data would be RDMAd properly and we'd end-up with strange corrupted data.

This resulted in several fixes which introduced some major behavior changes to the o2iblnd:

The above diagram represents the issue at play. Page 3 is not the last page and it has a gap. When this gets RDMAed using FMR/FRMR, then some data is lost from page 4 (need to confirm)

Overview

To understand the above fixes we need to have an understanding of how memory is mapped by the o2iblnd.

MD memory passed down to the LND in IOV or KIOV needs to be mapped in order for it to be DMAed.

Let's look at the GET case for IOV with FMR:

One difference between FMR and FastReg is that in FMR the pages are mapped to the FMR pool, and therefore the rd will need to be set to relative addresses. In the FastReg case, wr→wr.wr.fast_reg.page_list, points to the physical pages, and therefore the source rd, must describe the actual page addresses, and not just a relative offset.


Fast Registration

Peer 1 sends the RDMA descriptor (rd) with only 1 fragment and the dma_address of the memory. Peer 2 receives the rd and sets up its own memory to RDMA by mapping the memory (for fast reg case). Peer2 does an RDMA write given the remote address. It appears for the fast reg this remote address needs to be the actual dma address on peer1. A possible explanation is that peer 1 when it maps the memory to dma data in, it uses the starting memory and from there it knows how to place the data. So when peer 2 sends it just needs to give it the start dma address

Resources

Slide deck - Verbs Overview