Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The easiest path is to tell upper layers to resend. For example, PtlRPC has some related logic already. Except that when PtlRPC detects a failure, it disconnects, reconnects, and triggers a recovery operation. This is a fairly heavy-weight process, while the type of resending logic desired is to "just try another path" which differs from what exists today and needs to be implemented for each user.

The LNet Resiliency feature shall ensure that failures that can be dealt with at LNet level are delst with by attempting to resend the message on the available local and remote interfaces with in a timeout provided. LNet shall use a trickle down approach for managing timeouts. the ULP (pltrpc or other upper layer protocol) shall provide a timeout value in the call to LNetPut() or LNetGet(). LNet shall use that as the transaction timeout value to wait for an ACK or REPLY. LNet shall further provide a configuration parameter for the number of retries. The number of retries shall allow the user to specify a maximum number of times LNet shall attempt to resend an unsuccessful message. LNet shall then calculate the message timeout by dividing the transaction timeout with the number of retries. LNet shall pass the calculated message timeout to the LND, which will use it to ensure that the LND protocol completes an LNet message within the message timeout. If the LND is not able to complete the message within the provided timeout it will close the connection and drop all messages on that connection. It will afterword proceed to call into LNet via lnet_finailze() to notify it of the error encountered.

...

LNet should resend the message: b

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

...