Multi-Rail has changed how LNet views the world. Prior to Multi-Rail, each NID represented one unique peer on a network. There was no concept that multiple NIDs can identify the same peer. After Multi-Rail a peer can have multiple NIDs on the same or different network, and LNet has become aware that these NIDs reference the same peer. This creates a disconnect with the routing infrastructure currently in place. This is highlighted in two recent LUs (at the time of this writing): and .
The routing infrastructure needs to deal at the peer level and not the peer NI level.
When adding a route it takes the following form:
lnetctl route add --net <remote network> --gateway <local gateway NID> |
The code currently adds this gateway as a standard peer, which is also kept on a gateway list.
Multi-Rail changes the way we deal with peers such that a peer is composed of multiple peer_nis. However, this infrastructure doesn't extend to the routing logic.
These are a set of proposed changes to align LNet's routing infrastructure with Multi-Rail.
dead
back to alive
again, it should be rediscovered, in case its interface list has changed.avoid_asym_router_failure
logic needs to be reworked.These changes will integrate the router handling more closely with the Multi-Rail code and will avoid issues where an MR router is not discovered properly or identifying that a router is dead when it really is not.