...
NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.
Peer timeout and recovery model
- On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
- The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
- In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
- At the LNet layer whenever sending a message to a peer check if that peer is alive. lnet_peer_is_alive() is called:
- If the peer is marked dead and you've been notified by the lnd of its death at time X which is after the last known alive time, then consider the peer currently dead.
- Otherwise consider the peer is alive if peer_timeout seconds has not passed from the last time it was alive.
- if the peer_timeout has elapsed then consider the peer dead.
- The issue with that is we will never retry this peer ever again after the peer_timeout is elapsed.
- In case if the node is a router router_ping_timeout defaults to 50, which is less than
Health Revisited
There are different scenarios to consider with Health:
- Asynchronous events which indicate that the card is down
- Immediate failures when sending
- Failures reported by the LND
- Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
- TX timeout cases.
- Currently connection is closed and peer is marked down.
- This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI
...