You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Purpose

This document intends to provide guidelines for troubleshooting common issues involving LNet as seen in the field, with the idea that non-developers should be able to diagnose most of such problems, and, if solution is not achievable on the spot, provide developers with adequate detail for timely investigation.

Overview

From Lustre POV LNet is the layer providing the abstraction of physical network. LNet module is an "umbrella" module over a number of LNDs (Lustre Network Drivers) dealing with specific network types.

Because Lustre is a distributed FS, lots of errors in the system may have the markers of a "network error", and it may be not always trivial to determine whether LNet is actually at fault.


Topology

It is important to know how the customer is using the network with Lustre. 

  • which underlying network types are configured (IB, RoCE, Ethernet/TCP, etc.)
  • which LNet types are used (o2ib, tcp, kfi, etc.)
  • whether LNet routing is configured
  • number of servers and clients in the cluster, per type (there may be o2ib and tcp clients, for example)
  • is the system single-rail or multi-rail
  • which subnets are used (if there are multiple subnets)

Configuration

It should be made clear how Lustre and LNet modules are configured.

  • scripts under /etc/modprobe.d/ 
  • lnet.service
  • dynamic configuration by custom scripts
  • combination of the above

Separately, configuration outside of Lustre/LNet should be checked also:

  • Linux version
  • Driver version (e.g. MOFED)
  • Linux network configuration (network interfaces, MTU, routing tables, select sysctl settings)

Consistency

It is important to know whether all nodes of the same type are similar and are configured the same. Check the following across the nodes of the same type:

  • Lustre version (lctl --vesion)
  • Lustre/LNet configuration
  • Linux network configuration

For example, inconsistent MTU setting in the system can often be found responsible for bulk transfer failures

Logs


Lnetctl Outputs


Scenarios

Connectivity

LNet Routing

Multi-Rail




  • No labels