Original Pre-Health Requriements
Router Requirements
A router is a node which has the routing feature turned on using lnetctl set routing 1
or the equivalent modprobe configuration.
- Track the last time stamp any message was received on a local NI
- if the NI hasn't received any traffic for a period of
router_ping_timeout + MAX(live_router_check_interval, dead_router_check_interval)
then it's marked down- This is done so that other nodes using the gateway can mark the route down, given that
avoid_asym_router_failure
is set to 1.
- This is done so that other nodes using the gateway can mark the route down, given that
- Do not send messages to a peer which is marked down.
- Set the peer status to up when messages are received
- When messages are flowing through the router, query the peer NI a message is destined to every one second to determine if it has come back up again, and if so then set its status to alive.
Gateway Requirements
A gateway in this context is the peer NI created when adding a route on a node. For example: lnetctl route add --net tcp --gateway <gateway-NID>.
Dealing with that peer-NI is somewhat of a special case.
- Mark the gateway peer NI as down when the LND fails to send a message
- Note although the LND notifications happen for all peer NIs it is only pertinent on routers or for gateways.
- Mark the gateway peer NI as up when we receive an unsolicited message or when we receive a REPLY for a PING sent from the router checker.
- Mark the route as down if one of the gateway's interfaces, identified by the gateway peer NI, are down, provided the
avoid_asym_router_failure
is set to 1.
Peer Requirements
- Do not check for peer aliveness when sending a message to a peer.
- Pick a route which has its gateway peer NI marked as up.
Implementation Details
The routing infrastructure currently performs the following functionality
- Keep track of the last time the peer was alive,
lpni_last_alive
- Keep track the last time the peer was notified that its state has changed,
lpni_timestamp
- The peer can change state under the following conditions:
- The LND notifies that the peer is down when it fails to send a message to the peer.
- As an example in o2iblnd:
- kiblnd_peer_connect_failed() and kiblnd_disconnect_conn() call kiblnd_peer_notify() which calls lnet_notify() to set the peer to
dead
if there was an error
- kiblnd_peer_connect_failed() and kiblnd_disconnect_conn() call kiblnd_peer_notify() which calls lnet_notify() to set the peer to
- As an example in o2iblnd:
- A message is received in
lnet_parse
- In this case the peer stat is set to
alive
- In this case the peer stat is set to
- The peer has been dead for longer than the configured peer timeout and it's status hasn't been updated either in the process of receiving or sending messages. In other words the system came up and stayed idle for longer than the configured peer timeout. In this case set the peer state to alive.
- When the router checker ping is responded to or it fails.
- If the router checker ping times out.
- The LND notifies that the peer is down when it fails to send a message to the peer.
- The peer can change state under the following conditions:
- This step only concerns routers. Only send the message if the peer is alive, determined as outlined above.
- On the router if the NI hasn't received any traffic for a period of
router_ping_timeout + MAX(live_router_check_interval, dead_router_check_interval)
then it's marked down.- This is done in order for the peers using the router to mark the peer down when the
avoid_asym_router_failure
is set to 1, which it is by default.
- This is done in order for the peers using the router to mark the peer down when the