Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.

Health Revisited

There are different scenarios to consider with Health:

  1. Asynchronous events which indicate that the card is down
  2. Immediate failures when sending
    1. Failures reported by the LND
    2. Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
  3. TX timeout cases.
    1. Currently connection is closed and peer is marked down.
    2. This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI

TBD - How do we recover from a peer down?

TX Timeouts in the presence of LNet Routers

...