The following assumes that underlying network connectivity has been tested (using iperf for example) as was found to be working as expected for reliability and performance.

It is also assumed that LNet routing has been configured at least such that necessary routes are added on the LNet nodes and, accordingly, there's at least one node designated as LNet router.

Knowledge of lnet_selftest and configuring module parameters via modprobe is also assumed.

Useful Commands

Here's a list of commands which the verification procedure is going to rely on:

Scenarios

Two scenarios shall be considered:

Single-Hop LNet Routing

In this case, the setup looks like this: 

A <-LNet1-> R <-LNet2-> B

Where A and B are Lustre endpoints and R is an LNet router. LNetX can be any lnet, for example "tcp0" or "o2ib100", the only requirement is that LNet1 and LNet2 are different.

Multi-Hop LNet Routing

TODO

LNet Configuration Verification 

Routes

Credits

Router Buffers

This is applicable only to the router node R. The buffers are used to hold the messages being forwarded.

There are "tiny", "small" and "large" buffers. If memory size allows, the numbers for each should be increased to ((number_of_peers_on_LNet1 x peer_credits_on_LNet1) + (number_of_peers_on_LNet2 x peer_credits_on_LNet2))

"Small" buffers are 4K bytes, "large" are 1M and "small" are a only a few bytes.

Basic Connectivity Verification

Load-Testing

If all of the above looks good, lnet_selftest can be used to check that LNet performance under load. A few reminders on lnet_selftest usage:

Problem Isolation

If the performance results are not satisfactory, it may be helpful to isolate the problem to a particular node or LNet, for example: 

Tuning SockLND

If at this point the performance results are not satisfactory, there's still a chance certain parameter adjustments can make improvements. This section discusses SockLND tuning.

conns_per_peer

For peers on "tcpx" lnets, check the conns_per_peer value in the "lnetctl net show -v 4" output. Heuristically determined optimal settings are 4 for 100Gbps link and higher, 3 for 50Gbps link, 2 for 5-10Gbps and 1 for anything less. It is possible that in some situations increasing this parameter beyond the recommended value may help improve performance.

nscheds            

Use "top" to check on socklnd threads while lnet_selftest (or any other test, e.g. FIO, is running). If socklnd threads are seen to be fully loaded, it may be beneficial to increase nscheds value. It makes sense to increase it to a value between conns_per_peer and (conns_per_peer x 2)

Troubleshooting