Purpose
This document intends to provide guidelines for troubleshooting common issues involving LNet as seen in the field, with the idea that non-developers should be able to diagnose most of such problems, and, if solution is not achievable on the spot, provide developers with adequate detail for timely investigation.
Overview
From Lustre POV LNet is the layer providing the abstraction of physical network. LNet module is an "umbrella" module over a number of LNDs (Lustre Network Drivers) dealing with specific network types.
Because Lustre is a distributed FS, lots of errors in the system may have the markers of a "network error", and it may be not always trivial to determine whether LNet is actually at fault.
Topology
It is important to know how the customer is using the network with Lustre.
- which underlying network types are configured (IB, RoCE, Ethernet/TCP, etc.)
- which LNet types are used (o2ib, tcp, kfi, etc.)
- whether LNet routing is configured
- number of servers and clients in the cluster, per type (there may be o2ib and tcp clients, for example)
- is the system single-rail or multi-rail
- which subnets are used (if there are multiple subnets)
Configuration
It should be made clear how Lustre and LNet modules are configured.
- scripts under /etc/modprobe.
- lnet.service
- dynamic configuration by custom scripts
- combination of the above
Separately, configuration outside of Lustre/LNet should be checked also:
- Linux version
- Driver version (e.g. MOFED)
- Linux network configuration (network interfaces, MTU, routing tables, select sysctl settings)
Consistency
It is important to know whether all nodes of the same type are similar and are configured the same. Check the following across the nodes of the same type:
- Lustre version (lctl --vesion)
- Lustre/LNet configuration
- Linux network configuration
For example, inconsistent MTU setting in the system can often be found responsible for bulk transfer failures
System Logs
Collection
A proper "sosreport" or "showall" package should contain the system/kernel logs. Console logs may also be of interest.
Sometimes custom sosreport generations from clients may actually be missing the logs. It is worthwhile to check that the logs are included in the package.
Make sure to collect the logs from all nodes participating in the scenario which is being investigated. For example, if the issue is that a particular client is failing to mount Lustre FS, then collect the logs from the client and all of the servers it can use.
Bits for Early Analysis
Startup
When LNet starts and is able to initialize an NI successfully, it logs something like
LNet: Added LNI 10.1.0.101@o2ib
This can be used to check if NI is coming up using the expected interface and IP.
Errors
Scanning for LNet and Lustre errors in the logs may be useful to quickly establish the problem area.
Other errors which may be important if they are logged in the same time period are anything from the Linux networking services, e.g. any notifications of network interfaces status change.
Mellanox Driver Dumps
If Mellanox HW is involved, look for messages from the MLNX driver in the kernel logs. Basically, scan for occurrences of "mlx5_core" or "mlnx" ("5" may be a different number depending on the driver version).
Some output from Mellanox driver is expected at system startup and may be useful to verify the driver version, for example. However, later the driver is supposed to be silent if everything is normal and anything it dumps in or around the problem period may contain information which is key.
Lnetctl Outputs
Here are some important lnetctl outputs which should be collected manually if they are not included with showall/sosreport automatically
- lnetctl global show
- lnetctl stats show
- lnetctl net show -v 4
- lnetctl peer show -v 4 (may produce large amount of data on large systems with many peers)
- lnetctl route show -v 4
- lnetctl routing show
There's currently no way to trigger a rest for any of the counters shown by lnetctl. For live debugging it is important to take "before" and "after" snapshots and/or use "watch" linux command to track changes of a particular stat.
Analysing Lnetctl Outputs
lnetctl net show
- NI status: up or down. If NI is down, check if it is expected.
- fatal_error: if set to 1 then NI is marked down due to a network event e.g. "link down" or "IP address not found"
- health value: if health feature is on, and "health value" is less than 1000, then there were errors detected with sending via this NI which led to health score decrease. It may be recovering so it is worth monitoring.
lnetctl route show
- route state: up or down. If route is down, check if it is expected
lnetctl global show
- transaction_timeout, retry_count should be consistent across all nodes
- discovery: it should be clear why this is on or off on a particular node
Debug Logs
System logs only contain warning and error messages, so there are situations when debug logs are necessary too.
- When debugging a connection issue, make sure to obtain debug logs from both sides of the connection
- For debugging lnet, at least enable net debug logging: "lctl set_param debug=+net"
- If possible, reproduce on a "quiet" system to reduce the amount of data in the log.
- Make sure that debug log buffer is big enough, especially on a busy server, and that there's sufficient memory to hold the buffer.
Scenarios
Connectivity
This section describes the procedure for testing connectivity between nodes running Lustre.
It is a good practice to start troubleshooting with this procedure unless there's already clear evidence of where the problem is.
Underlying Network
LNet relies on the underlying network functioning properly so this is the first test to be run. Use method appropriate for the specific network type
- IB: ping, ibping, ib_read_bw, ib_write_bw, iperf
- TCP: ping, perf
Ping tests are required to pass reliably and bandwidth tests are required to reach desired bandwidth. If these tests fail then LNet/Lustre can't do any better.
LNet
Once it has been demonstrated that the underlying network appears to be fine, the following can be used to verify LNet connectivity in isolation from Lustre. Lustre module doesn't even need to be loaded.
- lnetctl ping: this needs to be working reliably, so make sure this can be repeated successfully
- lnet_selftest: this needs to work without errors and reach desired bandwidth
If these tests fail then Lustre can't do any better. At this point the issue has been isolated to LNet/LND layer and the reproducer is relatively straghtforward, so it is time to collect the logs for the developers to investigate.
If these tests pass, then the issue can't be isolated to LNet/LND layer using simple tools. The reproducer will need to be found with Lustre modules involved, for example, using FIO or specific sequence of user actions - depending on the case.
LNet Routing
LNet routing is generally used when LNets of different types need to be bridged. LNet routers are just essentially a special use case of a lustre client. Nothing but LNet module is required to be loaded on an LNet router node.
Info on LNet routing configuration and tuning is available here: https://wiki.lustre.org/LNet_Router_Config_Guide and here: https://wiki.whamcloud.com/x/jAI_EQ
The strategy for verifying lnet-routed paths is as follows:
- Verify networks underlying the LNets being connected by the LNet router
- Verify each of the LNets connected by the LNet router in isolation
- Use the same procedure as outlined above for the isolated LNet verification, but apply to the nodes assigned to different LNets being connected by the router
Multi-Rail
This section contains some troubleshooting tips related specifically to the Multi-rail (MR) systems.
- If the system is using different IP subnets within the same LNet, make sure that Linux routing rules are in place allowing traffic to go between the subnets
- If there are multiple NIs of the same type configured on a node, check packet counters in both lnetctl stats and Linux network stats per interface to make sure that traffic is distributed evenly across NIs