Table of Contents |
---|
Original Pre-Health Requirements
...
- Keep track of the last time the peer was alive,
lpni_last_alive
- Keep track the last time the peer was notified that its state has changed,
lpni_timestamp
- The peer can change state under the following conditions:
- The LND notifies that the peer is down when it fails to send a message to the peer.
- As an example in o2iblnd:
- kiblnd_peer_connect_failed() and kiblnd_disconnect_conn() call kiblnd_peer_notify() which calls lnet_notify() to set the peer to
dead
if there was an error
- kiblnd_peer_connect_failed() and kiblnd_disconnect_conn() call kiblnd_peer_notify() which calls lnet_notify() to set the peer to
- As an example in o2iblnd:
- A message is received in
lnet_parse()
- In this case the peer state is set to alive only for gateway peer NIs
- When the router checker ping is responded to or it fails.
- If the router checker ping times out.
- The LND notifies that the peer is down when it fails to send a message to the peer.
- The peer can change state under the following conditions:
- This step only concerns routers: Only send the message if the peer is alive, determined as outlined above.
- On the router if the NI hasn't received any traffic for a period of
router_ping_timeout + MAX(live_router_check_interval, dead_router_check_interval)
then it's marked down.- This is done in order for the peers using the router to mark the peer down when the
avoid_asym_router_failure
is set to 1, which it is by default.
- This is done in order for the peers using the router to mark the peer down when the
LNet Multi-Rail Routing
Multi-Rail introduced the concept of a peer and a peer NI. A peer can have multiple peer NIs. This changes the semantics of route configuration. Currently a route can be configured as:
...
Nodes on different networks will use different primary NIDs to refer to the same router. IE a primary NID is only a representation of the router on the peer with the route configured.
Multi-Rail Router Requirements
- Do not put message on the wire if the health of a peer_ni is below
MAX_HEALTH * rtr_sensitivity_percentage
- Attempt to recover an unhealthy peer_ni once per second by pinging it
- LND shall notify LNet whenever it determines a peer_ni is alive or dead. That will result in the adjustment of the peer_ni's health value.
- LNet shall call an LND API to notify that a peer_ni is dead whenever the peer_ni's health goes below
MAX_HEALTH * rtr_sensitivity_percentage
Multi-Rail Route Requirements
- A route is considered down if there are no viable peer_nis on the remote net of the gateway
- EX: if a route is defined as:
lnetctl route add --net tcp2 --gateway 10.10.10.3@tcp
, then if 10.10.10.3@tcp has not peer_nis which are healthy on tcp2, then that route is dead
- EX: if a route is defined as:
- A gateway is consider down under two circumstances:
- All remote nets reported in the REPLY to the PING are down
- All local representation of the peer_nis on the remote net have a health value below:
MAX_HEALTH * rtr_sensitivity_percentage
Configuration
A router can be configured as follows to utilize the new health infrastructure
lnet_health_sensitivity = 1 ## this will set the decrement the health of the NI by 1 everytime there is a failure to send to that interface
router_sensitivity_percentage = 100 ## this will consider the route down if NI's health is lower than LNET_MAX_HEALTH_VALUE
- Optionally we can set
retry_count > 0 ## this will attempt to resend a message on a different NI if one is available
Route Selection
Currently a route is selected based on the priority and hops value given to it, after that the credits for the peer NI are evaluated. With Multi-Rail there should be a two evaluation factors in the selection process.
...