Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It is not completely obvious how this scheme interacts with Lustre's timeout parameter (the Lustre RPC timeout, from which a of timeouts are derived).

LNet Health Version 2.0

There are three types of failures that LNet needs to deal with:

...

Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.

SOCKLND

TBD

...

TBD