Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The interfaces which have soft failures will be demerited so it will naturally be selected as a last option.

Work Items

  • refactor lnet_select_pathway() as described above.
  • Health Value Maintenance/Demerit system
  • Selection based on Health Value and not resending over already used interfaces unless non are available.
  • Handling the new events in IBLND and passing them to LNet
  • Handling the new events in SOCKLND and passing them to LNet
  • Adding LNet level transaction timeout (or reuse the peer timeout) and cancelling a resend on timeout
  • Handling timeout case in ptlrpc

Patches

  1. refactor lnet_select_pathway as described above()
  2. Add health values to local_ni
  3. Modify selection to make use of local_ni health values.
  4. Add explicit constraint in the selection to fail a re-send if no local_ni is in optimal health
  5. Handle explicit port down/up events
  6. Handle local interface failure on send and update health value then resend
  7. Add health values to peer_ni
  8. Add explicit constraint in the selection to fail a re-send if no remote_ni is in optimal health 
  9. Handle remote interface failure on send and update health value then resend
  10. Modify selection to make use of peer_ni health values.
  11. Handle LND tx timeout due to being stuck on the queues for too long.
  12. Handle LND tx timeout due to remote rejection
  13. Handle LND tx timeout due to no tx completion
  14. Add an Event timeout towards upper layers (PTLRPC) when a transaction has failed to complete. IE LNET_ACK_MSG, or LNET_REPLY_MSG are not received.
  15. Handle the transaction timeout event in ptlrpc.

...

Gliffy Diagram
namePUT sequence
pagePin3

As shown in the above diagram whenever a tx is queued to be sent or is posted but haven't received confirmation yet, the tx_deadline is still active. The scheduler thread checks the active connections for any transmits which has passed their deadline, and then it closes those connections and notifies LNet via lnet_notify().

...

Gliffy Diagram
nameGET Sequence Diagram
pagePin2

After the completion of an o2iblnd tx ib_post_send(), a completion event is added to the completion queue. This triggers kiblnd_complete to be called. If this is an IBLND_WID_TX then kiblnd_tx_complete() is called, which calls kiblnd_tx_done() if the tx is not sending, waiting or queued. In this case the tx_timeout is closed.

...

Gliffy Diagram
nameo2iblnd TX FSM
pagePin4

In order to understand fully how the LND transmit timeout can be used for resends, we need to have an understanding of the transmit life cycle shown above.

...