There are two parameters which are negotiated between peers on connection creation:
Number of supported fragments is derived from the tunable map_on_demand.
The negotiation logic works as follows (same applies to both parameters above):
peer_ni->ibp_max_frags to the map_on_demand value configured or to IBLND_MAX_RDMA_FRAGS if not configured.ibp_max_frags to the conn->ibc_max_frags, which in turn gets propagated to the connection parameters.max_frags from the node with its own max_frags.max_frags is >= to the node's then accept connectionmax_frags is < the node's then reject and send back peer's max frags value to the node.max_frags set in the rejection and if it's <= to the node's then it re-initiates connection using the peer's max-frags. otherwise the connection can not be established.map_on_demand usageThe map_on_demand value is not used when allocating global memory regions, FMR or FastReg buffers. In this case the maximum number of buffers are allocated.
Under RHEL 7.3 and previous kernel versions, when HAVE_IB_GET_DMA_MR is defined, the global memory region is first examined in the call to kiblnd_find_rd_dma_mr(). This function checks if map_on_demand is configured and if the number of fragments to be RDMA written exceeds the map_on_demand value. If that's the case, then the global memory region can not be used and the code falls to using FMR or Fast Reg if available.
Both FMR and FastReg mapping function reduce the number of fragments, rd_nfrags, to 1. However, this has recently changed with https://review.whamcloud.com/29290/ patch.
There was an inherent assumption in the o2iblnd code that only the first fragment can start at an offset and the last fragment can end at a non-aligned paged boundary. However, with a feature that Di implemented, now Lustre can provide intermediary fragments which can start at an offset or might end at a non-aligned page boundary. This caused an issue where not all the data that was expected to be written was written. The patch above was implemented. The result is that the initial assumption in the code is now invalid.
I will discuss the impact of breaking the assumption, but please note that the code depended on the fact that FMR and FastReg would set the rd_nfrags to 1, which is no longer the case.
Something else to note here is that unless both peers have vastly different fabrics with different DMA memory sizes the limitation imposed by map_on_demand in this case is artificial. Moreover, based on the observable fact that no sites (that I know of) use map-on-demand in their configuration, leads me to believe that there is no use for map_on_demand if the intent is to use the global memory region. And if the intent is to use FMR or FastReg, prior to the above patch, then map-on-demand had no real use except to limit the number of fragments. . Remember FMR and FastREg used to set
rd_nfrags to 1, so the limitation imposed by map_on_demand will never be encountered.
After looking at the 2.7 code base, it appears that map_on_demand had two uses:
rd_nfrags == 1.NOTE: init_qp_attr->cap.max_send_wr is set to IBLND_SEND_WRS(conn) on connection creation. That macro derives its value from ibc_max_frags which reflects the negotiated value based on the configured map_on_demand.
max_send_wr: The maximum number of outstanding Work Requests that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value. |
The main purpose of map_on_demand is to negotiate the size of the work requests queue size on the opposite sides of the QP. By setting it to, for example 32, the behavior would be to use global memory regions (for RHEL7.3 and earlier) for RDMAing buffers which have < 32 fragments or use FMR/FastReg for buffers that have >= 32 fragments. When using FMR we need only 1 WR for RDMA transfer message. This is true because we map the pages to the fmr pool using: ib_fmr_pool_map_phys(), which maps the list of pages to a FMR region, which requires only 1 WR to transfer.
When using FastReg we need 1 for RDMA transfer, 1 for map operation and 1 for invalidate operation, so 3 in total
The benefit, therefore, that map-on-demand provides is the ability to reduce the size of the qp send work requests queue.
However, given the o2iblnd's backwards compatibility requirements, we need to be able to interface with older Lustre versions which use up to 256 fragments. Therefore we decided to remove the map-on-demand configuration and default it to 256. Look below for the proposed solution.
This has the advantage of reducing the complexity of the code; however, in many cases it would consume more memory than needed. This has been observed on OPA using TID-RDMA. Look at for more details.
The way the RMDA write is done in the o2iblnd is as follows:
kiblnd_init_rdma() to setup the RDMA write, before eventually the ib_post_send() is called.kiblnd_init_rdma() ensures that it doesn't write more fragments than the negotiated max_frags.
Here is where things start to break because of the patch identified above. Let's take an example where map_on_demand was set to 32 on both peers. The max_frags negotiated will be 32.
An rdma write is attempted, however, that has rd_nfrags == 34. This will trigger the following error
1083 »·······»·······if (tx->tx_nwrq >= conn->ibc_max_frags) {
1084 »·······»·······»·······CERROR("RDMA has too many fragments for peer_ni %s (%d), "
1085 »·······»·······»······· "src idx/frags: %d/%d dst idx/frags: %d/%d\n",
1086 »·······»·······»······· libcfs_nid2str(conn->ibc_peer->ibp_nid),
1087 »·······»·······»······· conn->ibc_max_frags,
1088 »·······»·······»······· srcidx, srcrd->rd_nfrags,
1089 »·······»·······»······· dstidx, dstrd->rd_nfrags);
1090 »·······»·······»·······rc = -EMSGSIZE;
1091 »·······»·······»·······break;
1092 »·······»·······} |
The real reason for the failure here is that we setup the connection with a max_send_wr == 32, but we're trying to create more work requests.
Currently, I do not see any reason to set max_send_wr to less than the maximum number of fragments == 256.
One issue to be aware of is: . In this ticket there is a possibility to fail creating the connection if we set the number of work requests too high by boosting the concurrent sends. This ought to be documented.
As usual backwards compatibility is going to be a problem. If we remove map-on-demand and make it hardcoded to 256, it'll be a problem, because a downrev peer might've set map-on-demand to something lower causing a write failure. So we'll need to keep the negotiation to deal with downrev peers.
I created a calculator for the o2iblnd tunables. Given two peers with the tunables, it calculates any adjustments to the tunables that will be performed by the o2iblnd, and calculates several connection attributes, that might be of interest.


Kernel version pivots on 693, which is the RHEL7.4 release. This is significant because in this release there is no more support for global memory regions, which impacts the calculations.
Any version below 693 will use the calculations assuming that global memory regions is supported.
The tool is written in python. Download here.
yum install python-setuptoolspip install git+https://github.com/UmSenhorQualquer/pysettings.gityum install pyqt4-dev-tools python-qt4yum install python-openglpip install visvisRefer to http://pyforms.readthedocs.io/en/latest/ for more details.
tar -zxvf lustre_2_10_54_o2iblnd_calc.tar.gz cd lustre_2_10_54 python o2iblnd_tun_gui.py |
Even looking at o2iblnd_tun_calc.py makes it simpler to understand how the different tunables impact each other.