Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The MD should be kept intact during the resend procedure. If there is a failure to resend then the MD should be released and message memory freed.

Selection Algorithm with Health

Algorithm Parameters

ParameterValues
SRC NIDSpecified (A)Not specified (B)
DST NIDlocal (1)not local (2)
DST NIDMR ( C )NMR (D)

A1C

  • find the local ni given src_nid
    • if no local ni found fail
    • if local ni found is down, then fail
  • find peer identified by the dst_nid
  • select the best peer_ni for that peer
    • take into account the health of the peer_ni (if we just demerit the peer_ni it can still be the best of the bunch. So we need to keep track of the peer_nis/local_nis a message was sent over, so we don't revisit the same ones again. This should be part of the message)
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • if this is a resend, do not select the same peer_ni again.

A2C

  • find lcoal ni given src_nid
    • if no local ni found fail
    • if local ni found is down, then fail
  • find router to dst_nid
  • find best peer_ni (for the router) to send to
    • take into account the health of the peer_ni
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • If this is a resend, do not select the same peer_ni again.

A1D

  • find local ni given src nid
    • if no local_ni found fail
    • if local ni found is down, then fail
  • find peer_ni using dst_nid
  • send to that peer_ni
    • If this is a resend then fail, since there are no other possible peer_nis to send to.

A2D

  • find local_ni given the src_nid
    • if no local_ni found fail
    • if local ni found is down, then fail
  • find router to go through to that peer_ni
  • send to the NID of that router.
    • if this is a resend then fail, since there are no other possible peer_nis to send to.

B1C

  • select the best_ni to send from, by going through all the local_nis that can reach any of the networks the peer is on
    • consider local_ni health in the selection by selecting the local_ni with the best health value.
    • If this is a resend do not select a local_ni that has already been used.
  • select the best_peer_ni that can be reached by the best_ni selected in the previous step
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • If this is a resend do not consider a peer_ni that has already been used for sending.
  • send the message over that path.

B2C

  • find the router that can reach the dst_nid
  • find the peer for that router. (The peer is MR)
  • go to B1C

B1D

  • find peer_ni using dst_nid
    • If this is a resend and peer_ni is unhealthy fail the send
    • If this is an original send, then use the peer_ni even if it's not healthy.
  • select the best_ni to send from by going through all the local nis that can reach the dst_nid
    • consider local_ni health in the selection by selecting the local_ni with the best health value.
    • If this is a resend do not select a local_ni that has already been used.
  • send over that path

B2D

  • find the router you can reach the dst_nid on (router selection already considers router health using the existing mechanism. Currently a router is alive or dead, discovered via router pings and controlled by tunables such as asynchronous route failure)
    • If this is a resend and the peer_ni is unhealthy fail the send
    • If this is an original send, then use the peer_ni even if it's not healthy.
  • select the best_ni to send from by going through all the local nis that can reach the router NID
    • consider local_ni health in the selection by selecting the local_ni with the best health value.
    • If this is a resend do not select a local_ni that has already been used.
  • send over that path

Work Items

  • Health Value Maintenance/Demerit system
  • Selection based on Health Value and not resending over already used interfaces
  • Handling the new events in IBLND and passing them to LNet
  • Handling the new events in SOCKLND and passing them to LNet
  • Adding LNet level transaction timeout and cancelling a resend on timeout
  • Handling timeout case in ptlrpc

Patches

  1. Add health values to local_ni
  2. Modify selection to make use of local_ni health values.
  3. Add explicit constraint in the selection to fail a re-send if no local_ni is in optimal health
  4. Handle explicit port down/up events
  5. Handle local interface failure on send and update health value then resend
  6. Add health values to peer_ni
  7. Add explicit constraint in the selection to fail a re-send if no remote_ni is in optimal health 
  8. Handle remote interface failure on send and update health value then resend
  9. Modify selection to make use of peer_ni health values.
  10. Handle LND tx timeout due to being stuck on the queues for too long.
  11. Handle LND tx timeout due to remote rejection
  12. Handle LND tx timeout due to no tx completion
  13. Add an Event timeout towards upper layers (PTLRPC) when a transaction has failed to complete. IE LNET_ACK_MSG, or LNET_REPLY_MSG are not received.
  14. Handle the transaction timeout event in ptlrpc.

...