Purpose
This document intends to provide guidelines for troubleshooting common issues involving LNet as seen in the field, with the idea that non-developers should be able to diagnose most of such problems, and, if solution is not achievable on the spot, provide developers with adequate detail for timely investigation.
Overview
From Lustre POV LNet is the layer providing the abstraction of physical network. LNet module is an "umbrella" module over a number of LNDs (Lustre Network Drivers) dealing with specific network types.
Because Lustre is a distributed FS, lots of errors in the system may have the markers of a "network error", and it may be not always trivial to determine whether LNet is actually at fault.
Topology
It is important to know how the customer is using the network with Lustre.
- which underlying network types are configured (IB, RoCE, Ethernet/TCP, etc.)
- which LNet types are used (o2ib, tcp, kfi, etc.)
- whether LNet routing is configured
- number of servers and clients in the cluster, per type (there may be o2ib and tcp clients, for example)
- is the system single-rail or multi-rail
- which subnets are used (if there are multiple subnets)
Configuration
It should be made clear how Lustre and LNet modules are configured.
- scripts under /etc/modprobe.d/
- lnet.service
- dynamic configuration by custom scripts
- combination of the above
Separately, configuration outside of Lustre/LNet should be checked also:
- Linux version
- Driver version (e.g. MOFED)
- Linux network configuration (network interfaces, MTU, routing tables, select sysctl settings)
Consistency
It is important to know whether all nodes of the same type are similar and are configured the same. Check the following across the nodes of the same type:
- Lustre version (lctl --vesion)
- Lustre/LNet configuration
- Linux network configuration
For example, inconsistent MTU setting in the system can often be found responsible for bulk transfer failures