Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Negotiation

There are two parameters which are negotiated between peers on connection creation: 

...

  1. On peer creation set peer_ni->ibp_max_frags to the map_on_demand value configured or to IBLND_MAX_RDMA_FRAGS if not configured.
  2. On connection create creation propagate ibp_max_frags to the conn->ibc_max_frags, which in turn gets propagated to the connection parameters.
  3. Connection Parameters are sent to the peer.
  4. The peer compares the max_frags from the node with its own max_frags.
    1. If its local max_frags is >= to the node's then accept connection
    2. If its local max_frags is < the node's then reject and send back peer's max frags value to the node.
    3. The node checks the peer's max_frags set in the rejection and if it's <= to the node's then it re-initiates connection using the peer's max-frags. otherwise the connection can not be established.

...

The map_on_demand value is not used when allocating global memory regions, FMR or FastReg buffers. In this case the maximum number of buffers are allocated.

...

I will discuss the impact of breaking the assumption below, but please note that the code depended on the fact that FMR and FastReg would set the rd_nfrags to 1, which is no longer the case.

Something else to note here is that unless both peers have vastly different fabrics with different DMA memory sizes the limitation imposed by map_on_demand in this case is artificial. Moreover, based on the observable fact that no sites (that I know of) use map-on-demand in their configuration, leads me to believe that there is no use for map_on_demand if the intent is to use the global memory region. And if the intent is to use FMR or FastReg, prior to the above patch, then map-on-demand literally had  had no real use except to limit the number of fragments. (question). Remember FMR and FastREg used to set rd_nfrags to 1, so the limitation imposed by map_on_demand will never be encountered.

...

After looking at the 2.7 code base, it appears that the only real use of map_on_demand was to use it had two uses:

  1. Used as a flag to

...

  1. turn on the use of FMR or PMR. It wouldn't really matter if it was set to 1 or 256, since again in the FMR case rd_nfrags == 1.

...

  1. Used to allocate the maximum size of work request queue.

NOTE: init_qp_attr->cap.max_send_wr is set to IBLND_SEND_WRS(conn) on connection creation. That macro derives its value from ibc_max_frags which reflects the negotiated value based on the configured map_on_demand.

...

Conclusion on map_on_demand

It appears the intended usage The main purpose of map_on_demand  is to control the maximum number of RDMA fragments transferred. However, when calculating the rd_nfrags in kiblnd_map_tx(), there is no consideration given to the negotiated max_frags value. The underlying assumption in the code then is that if rd_nfrags exceeds the number of negotiated max_frags, we can use FMR/FastReg which maps all the fragments into 1 FMR/FastReg fragment and if we are using FMR/Fast Reg there is no real impact to this tunable. An assumption now broken due to https://review.whamcloud.com/29290/. Also given the usage of map_on_demand described above I find it difficult to understand the necessity of having this tunable. It appears to only complicate the code without adding any significant functionality. is to negotiate the size of the work requests queue size on the opposite sides of the QP. By setting it to, for example 32, the behavior would be to use global memory regions (for RHEL7.3 and earlier) for RDMAing buffers which have < 32 fragments or use FMR/FastReg for buffers that have >= 32 fragments. When using FMR we need only 1 WR for RDMA transfer message. This is true because we map the pages to the fmr pool using: ib_fmr_pool_map_phys(), which maps the list of pages to a FMR region, which requires only 1 WR to transfer.

When using FastReg we need 1 for RDMA transfer, 1 for map operation and 1 for invalidate operation, so 3 in total

The benefit, therefore, that map-on-demand provides is the ability to reduce the size of the qp send work requests queue.

However, given the o2iblnd's backwards compatibility requirements, we need to be able to interface with older Lustre versions which use up to 256 fragments. Therefore we decided to remove the map-on-demand configuration and default it to 256. Look below for the proposed solution.

This has the advantage of reducing the complexity of the code; however, in many cases it would consume more memory than needed. This has been observed on OPA using TID-RDMA. Look at 

Jira
serverHPDD Community Jira
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-10875
for more details.

Proposal

Overview

The way the RMDA write is done in the o2iblnd is as follows:

  1. The sink sets up the DMA memory in which the RDMA write will be done.
  2. The memory is mapped either using global memory regions, FMR or Fast Reg (since RHEL 7.4) global memory regions are no longer used.
  3. An RDMA descriptor (RD) is filled in prominently with the starting address for the local memory of each of the fragments and the number of fragments to send.
    1. The starting address can be:
      1. zero based for FMR
      2. actual physical DMA address for Fastreg
  4. The RD is send sent to the peer.
  5. The peer maps its source buffers and calls kiblnd_init_rdma() to setup the RDMA write, before eventually the ib_post_send() is called.

...

An rdma write is attempted, however, that has rd_nfrags == 34. This will cause trigger the following error

Code Block
1083 »·······»·······if (tx->tx_nwrq >= conn->ibc_max_frags) {
1084 »·······»·······»·······CERROR("RDMA has too many fragments for peer_ni %s (%d), "
1085 »·······»·······»·······       "src idx/frags: %d/%d dst idx/frags: %d/%d\n",
1086 »·······»·······»·······       libcfs_nid2str(conn->ibc_peer->ibp_nid),
1087 »·······»·······»·······       conn->ibc_max_frags,
1088 »·······»·······»·······       srcidx, srcrd->rd_nfrags,
1089 »·······»·······»·······       dstidx, dstrd->rd_nfrags);
1090 »·······»·······»·······rc = -EMSGSIZE;
1091 »·······»·······»·······break;
1092 »·······»·······}

...

One issue to be aware of is: 

Jira
serverHPDD Community Jira
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-7124
. In this ticket there is a possibility to fail creating the connection if we set the number of work requests too high by boosting the concurrent sends. This ought to be documented.

Backwards Compatibility

As usual backwards compatibility is going to be a problem. If we remove map-on-demand and make it hardcoded to 256, it'll be a problem, because a downrev peer might've set map-on-demand to something lower causing a write failure. So we'll need to keep the negotiation to deal with downrev peers.

Tasks

  1. Remove the ability to configure map_-on-demand via tunables or lnetctl/YAML
  2. Default the _demand from the code and have the max_send_wr be based on to a multiple of a constant: 256
  3. Adjust the rest of the code to handle the removal of map_on_demand.
  4. Keep the ability to dial down the number of fragments if the peer supports lower number of fragments. I still don't think there is any actual need to set max_send_wr to anything less than a multiple of 256.
    1. The underlying assumption in the code was that FMR and FastReg both used only 1 fragment, which is no longer the case. If the number of fragments of the message is greater than the number of fragments supported by the peer (or the connection) what should we do? Only option is to divide that into multiple TXs. I contacted Doug Ledford from Redback to see if there is a way to handle gaps in the buffers with FMR on MLX4. If we're able to do that, then it will greatly reduce the complexity of the code.
    Do not remove the actual tuanble for backwards compatibility
  5. Optimize the case where all the fragments have no gaps so that in the FMR case we only end up setting rd_nfrags to 1. This will reduce the resource usage on the card. Less work requests; less work requests
  6. In the case of gaps, then flag the RDMA write as requiring gaps, and when it comes time to map it, check that the connection can support the number of gaps, and if it doesn't then fail with a clear message suggesting that peer set the map-on-demand value to 256Cleanup kiblnd_init_rdma() and remove any uncessary checks against the maximum number of frags.
  7. Document the interactions between the ko2iblnd module parameters. Currently there is a spider web of dependencies between the different parameters. Each dependency needs to be justified and documented and removed if it's unnecessary.
  8. Create a simple calculator to calculate the impact of changing the parameters.
    1. For example if you set concurrent_sends to a value X, then how many work requests will be created?
      1. This will be handy to easily understand the configurations on the cluster configuration without having to go through the pain of re-examining the code.

Ticket tracking changes

Jira
serverHPDD Community Jira
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-10129

o2iblnd Calculator

I created a calculator for the o2iblnd tunables. Given two peers with the tunables, it calculates any adjustments to the tunables that will be performed by the o2iblnd, and calculates several connection attributes, that might be of interest.

Image Added

Image Added

Kernel version pivots on 693, which is the RHEL7.4 release. This is significant because in this release there is no more support for global memory regions, which impacts the calculations.

Any version below 693 will use the calculations assuming that global memory regions is supported.

The tool is written in python. Download here.

Python Requirements

Refer to http://pyforms.readthedocs.io/en/latest/ for more details.

Running it

Code Block
tar -zxvf lustre_2_10_54_o2iblnd_calc.tar.gz
cd lustre_2_10_54
python o2iblnd_tun_gui.py

Even looking at o2iblnd_tun_calc.py makes it simpler to understand how the different tunables impact each other.