...
How can I verify my routing setup works?
Basic verification
- Run lnetctl peer show on your node (or
lnetctl peer show --nid <router nid>) and check that all router nids on the same LNet are listed with "status: up". If any routers nids are listed as "down", try the same connectivity checks between your node and the router as described here "TODO: How can I verify connectivity between nodes on the same LNet". - If
avoid_asym_router_failureis enabled, make sure that the same routers are set up on both sides.
Trouble-shooting steps
If basic router setup verification indicates connectivity problems that can't be solved using means discussed in "TODO: How can I verify connectivity between nodes on the same LNet", use the following procedure.
Preparation
If there are many same-level routers connecting your nodes, try to isolate just one for each hop level (if you identified which one causes the problem, leave that one). This will require changing routing settings. For example, if the topology is Client — GW[1..N] — Server, leave only GW1 in the router configuration of the server and the client. This can be done using lnetctl route del command (changes won't be kept on lnet reload).
Make sure that connected nodes can lnetctl ping each other. Note that lnetctl ping needs to be tried multiple times.
For example, if the topology is Client — GW1 — GW2 — Server:
- lnetctl ping Server from the Client. If that fails, then
- lnetctl ping using directly connected nodes: Client to GW1, GW1 to GW2, GW2 to Server. If any of this fails, this is not a routing setup issue but connectivity issue between specific nodes on the same network. Refer to "TODO: How can I verify connectivity between nodes on the same LNet"
- lnetctl ping nodes that are a single hop away: Client to GW2, Server to GW1 in both directions. If any of that fails, then
- Check the routing config for the nodes that fail to communicate
What is LNet Multi-Rail?
LNet Multi-Rail allows multiple interfaces to be used for sending LNet messages. This feature boosts performance. A follow up feature, LNet Resiliency, currently being worked on is aimed at increasing resiliency.
...