...
- On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
- The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
- In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
- At the LNet layer for a non router node, lnet_peer_aliveness_enabled() will always return 0:
Code Block #define lnet_peer_aliveness_enabled(lp) (the_lnet.ln_routing != 0 && \ ((lp)->lpni_net) && \ (lp)->lpni_net->net_tunables.lct_peer_time_out > 0)
In effect, the aliveness of the peer is not considered at all if the node is not a router.
- This can remain the same since the health of the peer will be considered in lnet_select_pathway() before this is considered.
- In fact if the logic for the health of the peer is done in lnet_select_pathway(), then the logic in lnet_post_send_locked() can be removed. A peer will always be as healthy as possible by the time the flow hits lnet_post_send_locked()
- If the node is not a router, then a peer will always be tried irregardless of its health. If it is a router then
...
- then once every second the peer will be queried to see if it's alive or not.
- TBD: In o2iblnd kiblnd_query looks up the peer and then returns the last_alive of hte peer. However, there is code "if (peer_ni == NULL) kiblnd_launch_tx(ni, NULL, nid)". This code will attempt creating and connecting to the peer, which should allow us to discover if the peer is alive. However, as far as I know peer_ni is never removed from the hash. So if it's already an existing peer which died, then the call to kiblnd_launch_tx() will never be made, and we'll never discover if the peer came back to life.
- In socklnd, socknal_query() works differently. It actually attempts to connect to the peer again, within a timeout. This leads the router to discover that the peer is healthy and start using it again.
- TBD: In o2iblnd kiblnd_query looks up the peer and then returns the last_alive of hte peer. However, there is code "if (peer_ni == NULL) kiblnd_launch_tx(ni, NULL, nid)". This code will attempt creating and connecting to the peer, which should allow us to discover if the peer is alive. However, as far as I know peer_ni is never removed from the hash. So if it's already an existing peer which died, then the call to kiblnd_launch_tx() will never be made, and we'll never discover if the peer came back to life.
Health Revisited
There are different scenarios to consider with Health:
- Asynchronous events which indicate that the card is down
- Immediate failures when sending
- Failures reported by the LND
- Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
- TX timeout cases.
- Currently connection is closed and peer is marked down.
- This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI
TBD - How do we recover from a peer down?
TX Timeouts in the presence of LNet Routers
...
Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.
SOCKLND
TBD
LNet Health Version 2.0
TBD