...
NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.
Health Revisited
There are different scenarios to consider with Health:
- Asynchronous events which indicate that the card is down
- Immediate failures when sending
- Failures reported by the LND
- Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
- TX timeout cases.
- Currently connection is closed and peer is marked down.
- This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI
TBD - How do we recover from a peer down?
TX Timeouts in the presence of LNet Routers
...