...
The MD should be kept intact during the resend procedure. If there is a failure to resend then the MD should be released and message memory freed.
Selection Algorithm with Health
Algorithm Parameters
Parameter | Values | |
SRC NID | Specified (A) | Not specified (B) |
DST NID | local (1) | not local (2) |
DST NID | MR ( C ) | NMR (D) |
A1C
- find the local ni given src_nid
- if no local ni found fail
- if local ni found is down, then fail
- find peer identified by the dst_nid
- select the best peer_ni for that peer
- take into account the health of the peer_ni (if we just demerit the peer_ni it can still be the best of the bunch. So we need to keep track of the peer_nis/local_nis a message was sent over, so we don't revisit the same ones again. This should be part of the message)
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- if this is a resend, do not select the same peer_ni again.
A2C
- find lcoal ni given src_nid
- if no local ni found fail
- if local ni found is down, then fail
- find router to dst_nid
- find best peer_ni (for the router) to send to
- take into account the health of the peer_ni
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- If this is a resend, do not select the same peer_ni again.
A1D
- find local ni given src nid
- if no local_ni found fail
- if local ni found is down, then fail
- find peer_ni using dst_nid
- send to that peer_ni
- If this is a resend then fail, since there are no other possible peer_nis to send to.
A2D
- find local_ni given the src_nid
- if no local_ni found fail
- if local ni found is down, then fail
- find router to go through to that peer_ni
- send to the NID of that router.
- if this is a resend then fail, since there are no other possible peer_nis to send to.
B1C
- select the best_ni to send from, by going through all the local_nis that can reach any of the networks the peer is on
- consider local_ni health in the selection by selecting the local_ni with the best health value.
- If this is a resend do not select a local_ni that has already been used.
- select the best_peer_ni that can be reached by the best_ni selected in the previous step
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- If this is a resend do not consider a peer_ni that has already been used for sending.
- send the message over that path.
B2C
- find the router that can reach the dst_nid
- find the peer for that router. (The peer is MR)
- go to B1C
B1D
- find peer_ni using dst_nid
- If this is a resend and peer_ni is unhealthy fail the send
- If this is an original send, then use the peer_ni even if it's not healthy.
- select the best_ni to send from by going through all the local nis that can reach the dst_nid
- consider local_ni health in the selection by selecting the local_ni with the best health value.
- If this is a resend do not select a local_ni that has already been used.
- send over that path
B2D
- find the router you can reach the dst_nid on (router selection already considers router health using the existing mechanism. Currently a router is alive or dead, discovered via router pings and controlled by tunables such as asynchronous route failure)
- If this is a resend and the peer_ni is unhealthy fail the send
- If this is an original send, then use the peer_ni even if it's not healthy.
- select the best_ni to send from by going through all the local nis that can reach the router NID
- consider local_ni health in the selection by selecting the local_ni with the best health value.
- If this is a resend do not select a local_ni that has already been used.
- send over that path
Work Items
- Health Value Maintenance/Demerit system
- Selection based on Health Value and not resending over already used interfaces
- Handling the new events in IBLND and passing them to LNet
- Handling the new events in SOCKLND and passing them to LNet
- Adding LNet level transaction timeout and cancelling a resend on timeout
- Handling timeout case in ptlrpc
Patches
- Add health values to local_ni
- Modify selection to make use of local_ni health values.
- Add explicit constraint in the selection to fail a re-send if no local_ni is in optimal health
- Handle explicit port down/up events
- Handle local interface failure on send and update health value then resend
- Add health values to peer_ni
- Add explicit constraint in the selection to fail a re-send if no remote_ni is in optimal health
- Handle remote interface failure on send and update health value then resend
- Modify selection to make use of peer_ni health values.
- Handle LND tx timeout due to being stuck on the queues for too long.
- Handle LND tx timeout due to remote rejection
- Handle LND tx timeout due to no tx completion
- Add an Event timeout towards upper layers (PTLRPC) when a transaction has failed to complete. IE LNET_ACK_MSG, or LNET_REPLY_MSG are not received.
- Handle the transaction timeout event in ptlrpc.
...