Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.

Peer timeout and recovery model
  • On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
  • The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
  • In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
  • At the LNet layer whenever sending a message to a peer check if that peer is alive. lnet_peer_is_alive() is called:
    • If the peer is marked dead and you've been notified by the lnd of its death at time X which is after the last known alive time, then consider the peer currently dead.
    • Otherwise consider the peer is alive if peer_timeout seconds has not passed from the last time it was alive.
    • if the peer_timeout has elapsed then consider the peer dead.
      • The issue with that is we will never retry this peer ever again after the peer_timeout is elapsed.
    • In case if the node is a router router_ping_timeout defaults to 50, which is less than 

 

 

Health Revisited

There are different scenarios to consider with Health:

  1. Asynchronous events which indicate that the card is down
  2. Immediate failures when sending
      1. Failures reported by the LND
    1. Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
  3. TX timeout cases.
    1. Currently connection is closed and peer is marked down.
    2. This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI

...