A router is a node which has the routing feature turned on using lnetctl set routing 1
or the equivalent modprobe configuration.
router_ping_timeout + MAX(live_router_check_interval, dead_router_check_interval)
then it's marked downavoid_asym_router_failure
is set to 1.A gateway in this context is the peer NI created when adding a route on a node. For example: lnetctl route add --net tcp --gateway <gateway-NID>.
Dealing with that peer-NI is somewhat of a special case.
avoid_asym_router_failure
is set to 1.The routing infrastructure currently performs the following functionality
lpni_last_alive
lpni_timestamp
dead
if there was an errorlnet_parse()
router_ping_timeout + MAX(live_router_check_interval, dead_router_check_interval)
then it's marked down.avoid_asym_router_failure
is set to 1, which it is by default.Multi-Rail introduced the concept of a peer and a peer NI. A peer can have multiple peer NIs. This changes the semantics of route configuration. Currently a route can be configured as:
lnetctl route add --net <remote net> --gateway <gateway-peer-NID> |
The gateway-peer-NID refers to a specific interface on the router. However with MR enabled on the router, multiple interfaces can be configured on the same network. Therefore, the configuration semantics should be as follows:
lnetctl route add --net <remote net> --gateway <gateway-primary-NID> |
When a route is entered the primary NID specified in the gateway parameter should be immediately discovered. The discovery process will determine all the interfaces available on the router. There could be multiple interfaces on the same network.
A route should only be marked down if all the interfaces on the primary NID's network are down.
Nodes on different networks will use different primary NIDs to refer to the same router. IE a primary NID is only a representation of the router on the peer with the route configured.
The LNet Health/Resiliency feature has added the following features:
The original route code which implements the requirements outlined above are no longer inline with the new mechanisms implemented. There needs to be an effort taken to bring the router code more inline with the new features implemented.
Some details were documented here: Routing and MR integration