...
It is not completely obvious how this scheme interacts with Lustre's timeout
parameter (the Lustre RPC timeout, from which a of timeouts are derived).
LNet Health
There are three types of failures that LNet needs to deal with:
- Local Interface failure
- Remote Interface failure
- Timeouts
- LND detected Timeout
- LNet detected Timeout
Local Interface Failure
TBD
Remote Interface Failure
TBD
Timeouts
LND Detected Timeouts
Upper layers request from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t.
When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.
Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. This could be due to the following reasons:
- The message was never posted.
- LNet should attempt to resend the message from a different local NI, since this NI is unable to process messages on its queue in a timely fashion
- The message was posted but never completed.
- LNet should attempt to resend the message to a different peer_ni since the peer_ni is unable to complete the message.
By handling the tx_deadline properly we are able to account for all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
The pair LNET_MSG_PUT+LNET_MSG_ACK and LNET_MSG_GET+LNET_MSG_REPLY is not covered by the tx_deadline in the LND. If the upper layer does not take precautions it could wait forever on the LNET_MSG_ACK or LNET_MSG_REPLY. Therefore it is reasonable to expect that LNET provides a TIMEOUT event if either of these messages is not received within the expected timeout.
The question is whether LNet should resend the LNET_MSG_PUT or LNET_MSG_GET if it doesn't receive the corresponding response.
Consider the case where there are multiple LNet routers between two nodes, N1 and N2. These routers can possibly be routing between different Hardware, example OPA and MLX. N1 via the LND can reliably determine the health of the next-hop's interfaces. It can not however reliably determine the health of further hops in the chain. Each node can determine the health of the immediate next-hops. Therefore, each node in the path can be trusted to ensure that the message has arrived at the immediate next hop.
If there is a failure along the path and N1 does not receive the expected LNET_MSG_ACK or LNET_MSG_REPLY, and it knows that the message has been received by its next-hop, it has no way to determine where the failure happened. If it decides to resend the message, then there is no way to reliably select a reasonable peer_ni. Especially considering that the message has in fact been received properly by the next-hop. We can then say that we will simply try all the peer_nis of the destination. But in fact this will already be done by the node in the chain which is encountering a problem completing the message with its next-hop. So the net effect is the same. If both are implemented, then duplication of messages is a certainty.
Furthermore the responsibility of end-to-end reliability falls on the shoulder of layers using LNet. LNet's initial design intent is for it to be a fire and forget transport. Ptlrpc's design, however, clearly takes the end-to-end reliability of RPCs in consideration. By adding an LNET_ACK_TIMEOUT and LNET_REPLY_TIMEOUT events to LNet, that it can then report back up to Ptlrpc in the case when the corresponding message is not received within a specific timeout, then ptlrpc can make a decision that the RPC has failed.
RPC failure should be defined in two ways:
- One of the messages that compose the RPC has failed/timed out
- In this case it is reasonable to assume that the peer is dead/unreachable and ptlrpc can clean up its state.
- The response to the RPC has not been received.
- It can do what it does right now and initiate a retransmission of the RPC.
The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermidiary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.
This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.
O2IBLND
Overview
There are two types of events to account for:
...