...
The interfaces which have soft failures will be demerited so it will naturally be selected as a last option.
Work Items
- refactor lnet_select_pathway() as described above.
- Health Value Maintenance/Demerit system
- Selection based on Health Value and not resending over already used interfaces unless non are available.
- Handling the new events in IBLND and passing them to LNet
- Handling the new events in SOCKLND and passing them to LNet
- Adding LNet level transaction timeout (or reuse the peer timeout) and cancelling a resend on timeout
- Handling timeout case in ptlrpc
Patches
- refactor lnet_select_pathway as described above()
- Add health values to local_ni
- Modify selection to make use of local_ni health values.
- Add explicit constraint in the selection to fail a re-send if no local_ni is in optimal health
- Handle explicit port down/up events
- Handle local interface failure on send and update health value then resend
- Add health values to peer_ni
- Add explicit constraint in the selection to fail a re-send if no remote_ni is in optimal health
- Handle remote interface failure on send and update health value then resend
- Modify selection to make use of peer_ni health values.
- Handle LND tx timeout due to being stuck on the queues for too long.
- Handle LND tx timeout due to remote rejection
- Handle LND tx timeout due to no tx completion
- Add an Event timeout towards upper layers (PTLRPC) when a transaction has failed to complete. IE LNET_ACK_MSG, or LNET_REPLY_MSG are not received.
- Handle the transaction timeout event in ptlrpc.
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
As shown in the above diagram whenever a tx is queued to be sent or is posted but haven't received confirmation yet, the tx_deadline is still active. The scheduler thread checks the active connections for any transmits which has passed their deadline, and then it closes those connections and notifies LNet via lnet_notify().
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
After the completion of an o2iblnd tx ib_post_send(), a completion event is added to the completion queue. This triggers kiblnd_complete to be called. If this is an IBLND_WID_TX then kiblnd_tx_complete() is called, which calls kiblnd_tx_done() if the tx is not sending, waiting or queued. In this case the tx_timeout is closed.
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
In order to understand fully how the LND transmit timeout can be used for resends, we need to have an understanding of the transmit life cycle shown above.
...