Purpose
Multi-Rail allows LNet to discover and use all configured interfaces of a node. It references a node via it's primary NID. This feature carries forward this concept to the routing infrastructure. The following changes are brought in:
- No need to configure a different route per gateway interface. Only one route per gateway. Gateway interfaces are used according to the Multi-Rail selection criteria
- Routing now relies on LNet Health to keep track of the router health
- Router interfaces are monitored via LNet Health. If an interface fails other interfaces will be used.
- Routing uses LNet discovery to discover gateway on regular intervals
- A gateway pushes its list of interfaces upon the discovery of any changes in its state.
This document cover how routing can be configured and pertinent module parameters.
Configuration
Configuring Routes
lnetctl route add --net <remote network> --gateway <primary NID for the gateway> --hops <number of hops> --priority <route priority>
The primary NID of the gateway is used to identify the gateway to use in the route. The gateway can have multiple interfaces on the same or different networks. The peers using the gateway can reach it on one or more of its interfaces. Multi-Rail routing takes care of managing which interface to use.
Configuring Module parameters
Module Parameter | Usage |
---|---|
check_routers_before_use | Defaults to 0. If set to 1 all routers must be up before the system can proceed |
avoid_asym_router_failure | Defaults to 1. If set to 1 a route will be considered up if and only if there exists at least one healthy interface on the local and remote interfaces of the gateway. |
alive_router_check_interval | Defaults to 60 seconds. The gateways will be discovered ever alive_router_check_interval . If the gateway can be reached on multiple networks, the interval per network is alive_router_check_interval / number of networks |
router_ping_timeout | Defaults to 50 seconds. A gateway is considered dead if no response is received within that timeout |
router_sensitivity_percentage | Defaults to 100. This parameter defines how sensitive a router is to failure. If set to 100 then any gateway failure will contribute to all routes using it going down. The lower the value the more tolerant to failures the system becomes |