Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The negotiation logic works as follows (same applies to both parameters above):

  1. On peer creation set peer_ni→ibpset peer_ni->ibp_max_frags to  to the map_on_demand value  value configured or to IBLNDto IBLND_MAX_RDMA_FRAGS if  if not configured.
  2. On connection create propagate ibp_max_frags to  to the conn→ibc conn->ibc_max_frags, which in turn gets propagated to the connection parameters.
  3. Connection Parameters are sent to the peer.
  4. The peer compares the max_frags from  from the node with its own max_frags.
    1. If its local max_frags is  is >= to the node's then accept connection
    2. If its local max_frags is  is < the node's then reject and send back peer's max frags value to the node.
    3. The node checks the peer's max_frags set  set in the rejection and if it's <= to the node's then it re-initiates connection using the peer's max-frags. otherwise the connection can not be established.

map_on_demand usage usage

The map_on_demand value  value is not used when allocating FMR or FastReg buffers. In this case the maximum number of buffers are allocated.

Under RHEL 7.3 , when HAVEand previous kernel versions, when HAVE_IB_GET_DMA_MR is  is defined, the global memory region is first examined in the call to kiblndto kiblnd_find_rd_dma_mr(). This function checks if map_on_demand is  is configured and if the number of fragments to be RDMA written exceeds the map_on_demand value value. If that's the case, then the global memory region can not be used and the code falls to using FMR or Fast Reg if available. 

Both FMR and FastReg mapping function reduce the number of fragments (rd_nfrags) to , to 1. However, this has recently changed with https://review.whamcloud.com/29290/ patch. 

...

I will discuss the impact of breaking the assumption below, but please note that the code depended on the fact that FMR and FastReg would set the rd_nfrags to  to 1, which is no longer the case.

Something else to note here is that unless both peers have vastly different fabrics with different DMA memory sizes the limitation imposed by map_on_demand in  in this case is artificial. Moreover, based on the observable fact that no sites (that I know of) use map-on-demand in  in their configuration, leads me to believe that there is no use for map_on_demand if  if the intent is to use the global memory region. And if the intent is to use FMR or FastReg, prior to the above patch, then map-on-demand literally  literally had no use (question). Remember FMR and FastREg used to set rd_nfrags to  to 1, so the limitation imposed by map_on_demand will  will never be encountered.

Legacy

After looking at the 2.7 code base, it appears that the only real use of map_on_demand was to use it as a flag to allow the use of FMR. It wouldn't really matter if it was set to 1 or 256, since again rd_nfrags == 1

NOTE: init init_qp_attr→capattr->cap.max_send_wr is  is set to IBLNDto IBLND_SEND_WRS(conn) on  on connection creation. That macro derives its value from ibc_max_frags which  which reflects the negotiated value based on the configured map_on_demand.

Code Block
max_send_wr:
The maximum number of outstanding Work Requests that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value.

...

It appears the intended usage of map_on_demand is  is to control the maximum number of RDMA fragments transferred. However, when calculating the rd_nfrags in  in kiblnd_map_tx(), there is no consideration given to the negotiated max_frags value value. The underlying assumption in the code then is that if rd_nfrags exceeds  exceeds the number of negotiated max_frags, we can use FMR/FastReg which maps all the fragments into 1 FMR/FastReg fragment and if we are using FMR/Fast Reg there is no real impact to this tunable. An assumption now broken due to https://review.whamcloud.com/29290/

Also given the usage of map_on_demand described  described above I find it difficult to understand the necessity of having this tunable. It appears to only complicate the code without adding any significant functionality. 

...

  1. The sink sets up the DMA memory in which the RDMA write will be done.
  2. The memory is mapped either using global memory regions, FMR or Fast Reg (since RHEL 7.4) global memory regions are no longer used.
  3. An RDMA descriptor (RD) is filled in prominently with the starting address for the local memory and the number of fragments to send.
    1. The starting address can be:
      1. zero based for FMR
      2. actual physical DMA address for Fastreg
  4. The RD is send to the peer.
  5. The peer maps its source buffers and calls kiblnd_init_rdma() to  to setup the RDMA write, before eventually the ib_post_send() is  is called.

kiblnd_init_rdma() ensures  ensures that it doesn't write more fragments than the negotiated max_frags.

Here is where things start to break because of the patch identified above. Let's take an example where map_on_demand was  was set to 32 on both peers. The max_frags negotiated  negotiated will be 32.

An rdma write is attempted however that has rd_nfrags == 34. This will cause the following error

...

The real reason for the failure here is that we setup the connection with a max_send_wr == 32, but we're trying to create more work requests.

Currently, I do not see any reason to set max_send_wr to  to less than the maximum number of fragments == 256.

...

  1. Remove map_on_demand from the code and have the max_send_wr be based on a multiple of a constant: 256
  2. Adjust the rest of the code to handle the removal of map_on_demand.
  3. Do not remove the actual tuanble for backwards compatibility
  4. Optimize the case where all the fragments have no gaps so that in the FMR case we only end up setting rd_nfrags to 1. This will reduce the resource usage on the card. Less work requests
  5. Cleanup kiblnd_init_rdma() and  and remove any uncessary checks against the maximum number of frags.
  6. Document the interactions between the ko2iblnd module parameters. Currently there is a spider web of dependencies between the different parameters. Each dependency needs to be justified and documented and removed if it's unnecessary.
  7. Create a simple calculator to calculate the impact of changing the parameters.
    1. For example if you set concurrent_sends to a value X, then how many work requests will be created?
      1. This will be handy to easily understand the configurations on the cluster without having to go through the pain of re-examining the code.

...