Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

    • find the router you can reach the dst_nid on (router selection already considers router health using the existing mechanism. Currently a router is alive or dead, discovered via router pings and controlled by tunables such as asynchronous route failure)
      • If this is a resend and the peer_ni is unhealthy fail the send
      • If this is an original send, then use the peer_ni even if it's not healthy.
    • select the best_ni to send from by going through all the local nis that can reach the router NID
      • consider local_ni health in the selection by selecting the local_ni with the best health value.
      • If this is a resend do not select a local_ni that has already been used.
    • send over that path

(Olaf): trying to rewrite the above in a way that incorporates the single source to NMR destination requirement, and highlights commonalities in the logic flow

  • find route to dst_nid
  • find peer_ni of router
    • no issue if peer_ni is healthy
    • try this peer_ni even if it is unhealthy if this is the 1st attempt to send this message
    • fail if resending to an unhealthy peer_ni
  • pick the preferred NI for the dst_nid if set
    • otherwise pick a healthy local NI and make it the preferred NI for this dst_nid
  • send over this path

Work Items

    • Health Value Maintenance/Demerit system
    • Selection based on Health Value and not resending over already used interfaces
    • Handling the new events in IBLND and passing them to LNet
    • Handling the new events in SOCKLND and passing them to LNet
    • Adding LNet level transaction timeout and cancelling a resend on timeout
    • Handling timeout case in ptlrpc

...