Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

One issue to be aware of is: 

Jira
serverHPDD Community Jira
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-7124
. In this ticket there is a possibility to fail creating the connection if we set the number of work requests too high by boosting the concurrent sends. This ought to be documented.

Backwards Compatibility

As usual backwards compatibility is going to be a problem. If we remove map-on-demand and make it hardcoded to 256, it'll be a problem, because a downrev peer might've set map-on-demand to something lower causing a write failure. So we'll need to keep the negotiation to deal with downrev peers.

Tasks

  1. Remove the ability to configure map_-on-demand via tunables or lnetctl/YAML
  2. Default the _demand from the code and have the max_send_wr be based on to a multiple of a constant: 256
  3. Adjust the rest of the code to handle the removal of map_on_demand.
  4. Keep the ability to dial down the number of fragments if the peer supports lower number of fragments. I still don't think there is any actual need to set max_send_wr to anything less than a multiple of 256.
    1. The underlying assumption in the code was that FMR and FastReg both used only 1 fragment, which is no longer the case. If the number of fragments of the message is greater than the number of fragments supported by the peer (or the connection) what should we do? Only option is to divide that into multiple TXs. I contacted Doug Ledford from Redback to see if there is a way to handle gaps in the buffers with FMR on MLX4. If we're able to do that, then it will greatly reduce the complexity of the code.
    Do not remove the actual tuanble for backwards compatibility
  5. Optimize the case where all the fragments have no gaps so that in the FMR case we only end up setting rd_nfrags to 1. This will reduce the resource usage on the card; less work requests
  6. Cleanup kiblnd_init_rdma() and remove any uncessary checks against the maximum number of frags.
  7. Document the interactions between the ko2iblnd module parameters. Currently there is a spider web of dependencies between the different parameters. Each dependency needs to be justified and documented and removed if it's unnecessary.
  8. Create a simple calculator to calculate the impact of changing the parameters.
    1. For example if you set concurrent_sends to a value X, then how many work requests will be created?
      1. This will be handy to easily understand the configurations on the cluster configuration without having to go through the pain of re-examining the code.

...