You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Purpose

This document intends to provide guidelines for troubleshooting common issues involving LNet as seen in the field, with the idea that non-developers should be able to diagnose most of such problems, and, if solution is not achievable on the spot, provide developers with adequate detail for timely investigation.

Overview

From Lustre POV LNet is the layer providing the abstraction of physical network. LNet module is an "umbrella" module over a number of LNDs (Lustre Network Drivers) dealing with specific network types.

Because Lustre is a distributed FS, lots of errors in the system may have the markers of a "network error", and it may be not always trivial to determine whether LNet is actually at fault.


Topology

It is important to know how the customer is using the network with Lustre. 

  • which underlying network types are configured (IB, RoCE, Ethernet/TCP, etc.)
  • which LNet types are used (o2ib, tcp, kfi, etc.)
  • whether LNet routing is configured
  • number of servers and clients in the cluster, per type (there may be o2ib and tcp clients, for example)
  • is the system single-rail or multi-rail
  • which subnets are used (if there are multiple subnets)

Configuration

It should be made clear how Lustre and LNet modules are configured.

  • scripts under /etc/modprobe.d/ 
  • lnet.service
  • dynamic configuration by custom scripts
  • combination of the above

Separately, configuration outside of Lustre/LNet should be checked also:

  • Linux version
  • Driver version (e.g. MOFED)
  • Linux network configuration (network interfaces, MTU, routing tables, select sysctl settings)

Consistency

It is important to know whether all nodes of the same type are similar and are configured the same. Check the following across the nodes of the same type:

  • Lustre version (lctl --vesion)
  • Lustre/LNet configuration
  • Linux network configuration

For example, inconsistent MTU setting in the system can often be found responsible for bulk transfer failures

System Logs

Collection

A proper "sosreport" or "showall" package should contain the system/kernel logs. Console logs may also be of interest.

Sometimes custom sosreport generations from clients may actually be missing the logs. It is worthwhile to check that the logs are included in the package.

Make sure to collect the logs from all nodes participating in the scenario which is being investigated. For example, if the issue is that a particular client is failing to mount Lustre FS, then collect the logs from the client and all of the servers it can use.

Bits for Analysis

Startup

When LNet starts and is able to initialize an NI successfully, it logs something like

LNet: Added LNI 10.1.0.101@o2ib

This can be used to check if NI is coming up using the expected interface and IP.

Errors

Scanning for LNet and Lustre errors in the logs may be useful to quickly establish the problem area.

Other errors which may be important if they are logged in the same time period are anything from the Linux networking services, e.g. any notifications of network interfaces status change.

Mellanox Driver Dumps

If Mellanox HW is involved, look for messages from the MLNX driver in the kernel logs. Basically, scan for occurrences of "mlx5_core" or "mlnx" ("5" may be a different number depending on the driver version).

Some output from Mellanox driver is expected at system startup and may be useful to verify the driver version, for example. However, later the driver is supposed to be silent if everything is normal and anything it dumps in or around the problem period may contain information which is key. 

Lnetctl Outputs

Here are some important lnetctl outputs which should be collected manually if they are not included with showall/sosreport automatically

  • lnetctl global show
  • lnetctl stats show
  • lnetctl net show -v 4
  • lnetctl peer show -v 4 (may produce large amount of data on large systems with many peers)
  • lnetctl routing show


Scenarios

Connectivity

LNet Routing

Multi-Rail




  • No labels