Negotiation
There are two parameters which are negotiated between peers on connection creation:
...
- On peer creation set
peer_ni->ibp_max_fragsto themap_on_demandvalue configured or toIBLND_MAX_RDMA_FRAGSif not configured. - On connection create propagate
ibp_max_fragsto theconn->ibc_max_frags, which in turn gets propagated to the connection parameters. - Connection Parameters are sent to the peer.
- The peer compares the
max_fragsfrom the node with its own max_frags.- If its local
max_fragsis >= to the node's then accept connection - If its local
max_fragsis < the node's then reject and send back peer's max frags value to the node. - The node checks the peer's
max_fragsset in the rejection and if it's <= to the node's then it re-initiates connection using the peer's max-frags. otherwise the connection can not be established.
- If its local
map_on_demand usage
The map_on_demand value is not used when allocating FMR or FastReg buffers. In this case the maximum number of buffers are allocated.
...
Something else to note here is that unless both peers have vastly different fabrics with different DMA memory sizes the limitation imposed by map_on_demand in this case is artificial. Moreover, based on the observable fact that no sites (that I know of) use map-on-demand in their configuration, leads me to believe that there is no use for map_on_demand if the intent is to use the global memory region. And if the intent is to use FMR or FastReg, prior to the above patch, then map-on-demand literally had no use . Remember FMR and FastREg used to set
rd_nfrags to 1, so the limitation imposed by map_on_demand will never be encountered.
Legacy
After looking at the 2.7 code base, it appears that the only real use of map_on_demand was to use it as a flag to allow the use of FMR. It wouldn't really matter if it was set to 1 or 256, since again rd_nfrags == 1.
...
| Code Block |
|---|
max_send_wr: The maximum number of outstanding Work Requests that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value. |
Conclusion on map_on_demand
It appears the intended usage of map_on_demand is to control the maximum number of RDMA fragments transferred. However, when calculating the rd_nfrags in kiblnd_map_tx(), there is no consideration given to the negotiated max_frags value. The underlying assumption in the code then is that if rd_nfrags exceeds the number of negotiated max_frags, we can use FMR/FastReg which maps all the fragments into 1 FMR/FastReg fragment and if we are using FMR/Fast Reg there is no real impact to this tunable. An assumption now broken due to https://review.whamcloud.com/29290/.
Also given the usage of map_on_demand described above I find it difficult to understand the necessity of having this tunable. It appears to only complicate the code without adding any significant functionality.
Proposal
Overview
The way the RMDA write is done in the o2iblnd is as follows:
...
One issue to be aware of is:
| Jira | ||||||
|---|---|---|---|---|---|---|
|
Tasks
- Remove map_on_demand from the code and have the max_send_wr be based on a multiple of a constant: 256
- Adjust the rest of the code to handle the removal of
map_on_demand. - Do not remove the actual tuanble for backwards compatibility
- Optimize the case where all the fragments have no gaps so that in the FMR case we only end up setting rd_nfrags to 1. This will reduce the resource usage on the card. Less work requests
- Cleanup
kiblnd_init_rdma()and remove any uncessary checks against the maximum number of frags. - Document the interactions between the ko2iblnd module parameters. Currently there is a spider web of dependencies between the different parameters. Each dependency needs to be justified and documented and removed if it's unnecessary.
- Create a simple calculator to calculate the impact of changing the parameters.
- For example if you set concurrent_sends to a value X, then how many work requests will be created?
- This will be handy to easily understand the configurations on the cluster without having to go through the pain of re-examining the code.
- For example if you set concurrent_sends to a value X, then how many work requests will be created?
...