...
A message is resent after the LND transmit deadline expires, or on failure return code. Both these paths are handled in the same manner, since a transmit deadline triggers a call to lnet_finalize(). Both inline and asynchronous errors also endup in lnet_finalize().
Therefore the desired least number of transmits = peer_timeout / LND transmit deadline.
Depending on the frequency of errors, LNet may do more re-transmits. LNet will stop re-transmitting and declare a peer dead, if the peer_timeout expires or all the different paths have been tried with no success.
In the default case where LND transmit timeout is set to 50 seconds and the peer_timeout is set to 180 seconds, then LNet will transmist re-transmit 3 times before it declares the peer dead.
peer_timeout can be increased to fit in more re-transmits or LND transmit timeout can be decreased.
Shadow had made a presentation at LAD 16 that outlines the best values for all Lustre timeouts. It can be accessed here.
Locking
MD is always protected by the lnet_res_lock, which is CPT specific.
...
The MD should be kept intact during the resend procedure. If there is a failure to resend then the MD should be released and message memory freed.
O2IBLND Detailed Discussion
Overview
There are two types of events to account for:
...
Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.
SOCKLND Detailed Discussion
TBD