...
- address is wrong
- Route can not be determined
- Connection can not be established
- Connection was rejected due to incompatible parameters
Desired Behavior
- the remote interface health is updated
- Failure statistics incremented
- A resend is issued on a different remote interface if there is one available.
- if no other remote interface is present then the send fails.
Implementation Specifics
In all these cases a different peer_ni should be tried if one exists. lnet_select_pathway() already takes src_nid as a parameter. When resending due to one of these failures src_nid will be set to the src_nid in the message that is being resent.
...
Roughly, LNet would be analogous to the IP layer and ptlrpc is analogous to the TCP layer.
Reasons for timeout
The discussion here refers to the LND Transmit timeout.
Timeouts could occur due to several reasons:
- The message is on the sender's queue and is not posted within the timeout
- This indicates that the local interface is too busy and is unable to process the messages on its queue.
- The message is posted but the transmit is never completed
- An actual culprit can not be determined in this scenario. It could be a sender issue, a receiver issue or a network issue.
- The message is posted, the transmit is completed, but the remote never acknowledges.
- In the IBLND, there are explicit acknowledgements in most cases when the message is received and forwarded to the LNet layer. Look below for more details.
- If an LND message is in waiting state and it didn't receive the expected response, then this indicates an issue at the remote's LND, either at the lower protocol, IB/TCP, or the notification at the LND layer is not being processed in a timely fashion.
Each of these scenarios can be handled differently
Desired Behavior
The desired behavior is listed for each of the above scenarios:
Scenario 1
- Connection is closed
- the local interface health is updated
- Failure statistics incremented
- A resend is issued on a different local interface if there is one available.
- if no other local interface is present, or all are in failed mode, then the send fails.
Scenario 2
- Connection is closed
- the local and remote interface health is updated
- Failure statistics incremented on both local and remote
- A resend is issued on a different path all together if there is one available.
- if no other path is present then the send fails.
Scenario 3
- Connection is closed
- the remote interface health is updated
- Failure statistics incremented
- A resend is issued on a different remote interface if there is one available.
- if no other remote interface is present then the send fails.
Note, that the behavior outlined is consistent with the explcit error cases identified in previous section. Only Scenario 2, diverges as a different path is selected all together, but still the same code structure is used.
Implementation Specifics
O2IBLND
Overview
There are two types of events to account for:
...