...
- On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
- The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
- In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
- At the LNet layer whenever sending a message to a peer check if that peer is alive. for a non router node, lnet_peer_isaliveness_aliveenabled() is calledwill always return 0:
- If the peer is marked dead and you've been notified by the lnd of its death at time X which is after the last known alive time, then consider the peer currently dead.
- Otherwise consider the peer is alive if peer_timeout seconds has not passed from the last time it was alive.
- if the peer_timeout has elapsed then consider the peer dead.
- The issue with that is we will never retry this peer ever again after the peer_timeout is elapsed.
In case if the node is a router router_ping_timeout defaults to 50, which is less than Code Block #define lnet_peer_aliveness_enabled(lp) (the_lnet.ln_routing != 0 && \ ((lp)->lpni_net) && \ (lp)->lpni_net->net_tunables.lct_peer_time_out > 0)In effect, the aliveness of the peer is not considered at all if the node is not a router.
- This can remain the same since the health of the peer will be considered in lnet_select_pathway() before this is considered.
- In fact if the logic for the health of the peer is done in lnet_select_pathway(), then the logic in lnet_post_send_locked() can be removed. A peer will always be as healthy as possible by the time the flow hits lnet_post_send_locked()
- If the node is not a router, then a peer will always be tried irregardless of its health. If it is a router then
Health Revisited
There are different scenarios to consider with Health:
...