The following assumes that underlying network connectivity has been tested (using iperf for example) and was found to be working as expected for reliability and performance.
It is also assumed that LNet routing has been configured at least such that necessary routes are added on the LNet nodes and, accordingly, there's at least one node designated as LNet router.
Knowledge of lnet_selftest and configuring module parameters via modprobe is also assumed.
Here's a list of commands which the verification procedure is going to rely on:
Two scenarios shall be considered:
In this case, the setup looks like this:
A <-LNet1-> R <-LNet2-> B
Where A and B are Lustre endpoints and R is an LNet router. LNetX can be any lnet, for example "tcp0" or "o2ib100", the only requirement is that LNet1 and LNet2 are different.
TODO
This is applicable only to the router node R. The buffers are used to hold the messages being forwarded.
There are "tiny", "small" and "large" buffers. If memory size allows, the numbers for each should be increased to ((number_of_peers_on_LNet1 x peer_credits_on_LNet1) + (number_of_peers_on_LNet2 x peer_credits_on_LNet2))
"Small" buffers are 4K bytes, "large" are 1M and "tiny" are a only a few bytes. The lnet module parameters which can be used to set these are:
It is also possible to adjust the buffer numbers dynamically via lnetctl: "lnetctl set <tiny_buffers|small_buffers|large_buffers>"
If all of the above looks good, lnet_selftest can be used to check that LNet performance under load. A few reminders on lnet_selftest usage:
If the performance results are not satisfactory, it may be helpful to isolate the problem to a particular node or LNet, for example:
If at this point the performance results are not satisfactory, there's still a chance that certain parameter adjustments can yield improvements.
This section discusses LND tuning.
Current limitation is that generally all NIDs served by a particular LND on a given node shall use the same set of LND tunables. With the exception of sockLND conns_per_peer, it is currently not possible to have parameters specific to a NID, so, for examle, all "tcpX" NIDs on the same node will be configured the same even though they may be connected to different fabrics.
It is recommended that all NIDs talking to eachother over the same LNet have the same set of tunables applied.
For peers on "tcpX" lnets, check the conns_per_peer value in the "lnetctl net show -v 4" output. Heuristically determined optimal settings are: 4 for 100Gbps link and higher, 3 for 50Gbps link, 2 for 5-10Gbps and 1 for anything less. It is possible that in some situations increasing this parameter beyond the recommended value may help improve performance. Note that this can be set per individual tcp NID using lnetctl.
Use "top" to check on socklnd threads while lnet_selftest (or any other test, e.g. FIO, is running). If socklnd threads are seen to be fully loaded, it may be beneficial to increase nscheds value. It makes sense to increase it to a value between conns_per_peer and (conns_per_peer x 2)
Default sockLND peer_credits is 8. There's a chance that increasing this and consequently credits can improve performance. As shown above, changing these affects optimal router buffer number choice.
Default tunings for OPA can be found in /etc/modprobe.d/ko2iblnd.conf and are as follows:
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
They are applied if OPA device is detected.
Default tunings for MLNX are hardcoded. Theyreare notable differences from OPA settings:
peer_credits: 32, peercredits_hiw: 16, concurrent_sends: 64, fmr_pool_size: 512, fmr_flush_trigger: 384, ntx: 512, conns_per_peer: 1
Default peer_credits can be decreased if, for example, it is seen that remote network is slower and can't keep up. As shown above, changing these affects optimal router buffer number choice.