...
This approach lends itself to breaking out the selection of the local interface from lnet_select_pathway(), leading to the following logic:
Code Block |
---|
lnet_select_local(peer_netni lnet_get_best_ni(local_net, cur_ni, md_cpt) { local_net = get_local_net(peer_net) for each ni in local_net { health_value = lnet_local_ni_health(ni) /* select the best health value */ if (health_value < best_health_value) continue distance = get_distance(md_cpt, dev_cpt) /* select the shortest distance to the MD */ if (distance < lnet_numa_range) distance = lnet_numa_range if (distance > shortest_distance) continue else if distance < shortest_distance distance = shortest_distance /* select based on the most available credits */ else if ni_credits < best_credits continue /* if all is equal select based on round robin */ else if ni_credits == best_credits if best_ni->ni_seq <= ni->ni_seq continue } } |
Remote Interface Failure
TBD
Timeouts
LND Detected Timeouts
Upper layers request from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t.
When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.
Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. This could be due to the following reasons:
- The message was never posted.
- LNet should attempt to resend the message from a different local NI, since this NI is unable to process messages on its queue in a timely fashion
- The message was posted but never completed.
- LNet should attempt to resend the message to a different peer_ni since the peer_ni is unable to complete the message.
By handling the tx_deadline properly we are able to account for all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
The pair LNET_MSG_PUT+LNET_MSG_ACK and LNET_MSG_GET+LNET_MSG_REPLY is not covered by the tx_deadline in the LND. If the upper layer does not take precautions it could wait forever on the LNET_MSG_ACK or LNET_MSG_REPLY. Therefore it is reasonable to expect that LNET provides a TIMEOUT event if either of these messages is not received within the expected timeout.
The question is whether LNet should resend the LNET_MSG_PUT or LNET_MSG_GET if it doesn't receive the corresponding response.
Consider the case where there are multiple LNet routers between two nodes, N1 and N2. These routers can possibly be routing between different Hardware, example OPA and MLX. N1 via the LND can reliably determine the health of the next-hop's interfaces. It can not however reliably determine the health of further hops in the chain. Each node can determine the health of the immediate next-hops. Therefore, each node in the path can be trusted to ensure that the message has arrived at the immediate next hop.
If there is a failure along the path and N1 does not receive the expected LNET_MSG_ACK or LNET_MSG_REPLY, and it knows that the message has been received by its next-hop, it has no way to determine where the failure happened. If it decides to resend the message, then there is no way to reliably select a reasonable peer_ni. Especially considering that the message has in fact been received properly by the next-hop. We can then say that we will simply try all the peer_nis of the destination. But in fact this will already be done by the node in the chain which is encountering a problem completing the message with its next-hop. So the net effect is the same. If both are implemented, then duplication of messages is a certainty.
Furthermore the responsibility of end-to-end reliability falls on the shoulder of layers using LNet. LNet's initial design intent is for it to be a fire and forget transport. Ptlrpc's design, however, clearly takes the end-to-end reliability of RPCs in consideration. By adding an LNET_ACK_TIMEOUT and LNET_REPLY_TIMEOUT events to LNet, that it can then report back up to Ptlrpc in the case when the corresponding message is not received within a specific timeout, then ptlrpc can make a decision that the RPC has failed.
RPC failure should be defined in two ways:
- One of the messages that compose the RPC has failed/timed out
- In this case it is reasonable to assume that the peer is dead/unreachable and ptlrpc can clean up its state.
- The response to the RPC has not been received.
- It can do what it does right now and initiate a retransmission of the RPC.
The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermidiary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.
This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.
Resiliency vs. Reliability
There are two concepts that need to stay separate. Reliability of RPC messages and LNet Resiliency. This feature attempts to add LNet Resiliency against local and immediate next hop interface failure. End-to-end reliability is to ensure that upper layer messages, namely RPC messages, are received and processed by the final destination, and take appropriate action in case this does not happen. End-to-end reliability is the responsibility of the application that uses LNet, in this case ptlrpc. Ptlrpc already has a mechanism to ensure this.
/* lnet_select_pathway() will be modified to add a peer_nid parameter. This parameter indicates that the peer_ni is predetermined, and will be identified by the NID provided. The peer_nid parameter it the next-hop NID, which can be the final destination or the next-hop router. If that peer_NID is not healthy then another peer_NID is selected as per the current algorithm. This will force the algorithm to prefer the peer_ni which was selected in the initial message sending. The peer_ni NID is stored in the message. This new parameter extends the concept of the src_nid, which is provided to lnet_select_pathway() to inform it that the local NI is predetermined. */
/* on resend */
enum lnet_error_type {
LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */
LNET_LOCAL_NI_UP, /* start using this NI */
LNET_LOCAL_NI_SEND_TIMOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */
LNET_PEER_NI_NO_LISTENER, /* there is no remote listener. demerit the peer_ni and try another NI */
LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */
LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */
LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */
LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */
};
static int lnet_handle_send_failure_locked(msg, local_nid, status)
{
switch (status)
/* LNET_LOCAL_NI_DOWN can be received without a message being sent. In this case msg == NULL and it is sufficient to update the health of the local NI */
case LNET_LOCAL_NI_DOWN:
local_ni = lnet_get_local_ni(msg->local_nid)
if (!local_ni)
return
/* flag local NI down */
lnet_set_local_ni_health(DOWN)
if (msg != NULL)
/* resend message to the same peer_ni, but using a different local_ni */
break;
case LNET_LOCAL_NI_UP:
local_ni = lnet_get_local_ni(msg->local_nid)
if (!local_ni)
return
/* flag local NI down */
lnet_set_local_ni_health(UP)
/* This NI will be a candidate for selection in the next message send */
break;
...
}
static int lnet_complete_msg_locked(msg, cpt)
{
status = msg->msg_ev.status
if (status != 0)
rc = lnet_handle_send_failure_locked(msg, status)
if rc == 0
return
/* continue as currently done */
}
|
Remote Interface Failure
A remote interface can be considered problematic under multiple scenarios:
- address is wrong
- Route can not be determined
- Connection can not be established
- Connection was rejected due to incompatible parameters
In all these cases a different peer_ni should be tried if one exists. lnet_select_pathway() already takes src_nid as a parameter. When resending due to one of these failures src_nid will be set to the src_nid in the message that is being resent.
Code Block |
---|
static int lnet_handle_send_failure_locked(msg, local_nid, status)
{
switch (status)
...
case LNET_PEER_NI_ADDR_ERROR:
lpni->stats.stat_addr_err++
goto peer_ni_resend
break
case LNET_PEER_NI_UNREACHABLE:
lpni->stats.stat_unreacheable++
goto peer_ni_resend
break
case LNET_PEER_NI_CONNECT_ERROR:
lpni->stats.stat_connect_err++
goto peer_ni_resend
break
case LNET_PEER_NI_CONNECTION_REJECTED:
lpni->stats.stat_connect_rej++
goto peer_ni_resend
break
default:
/* unexpected failure. failing message */
return
peer_ni_resend
lnet_send(msg, src_nid)
} |
Timeouts
LND Detected Timeouts
Upper layers request from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t.
When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.
Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. This could be due to the following reasons:
- The message was never posted.
- LNet should attempt to resend the message from a different local NI, since this NI is unable to process messages on its queue in a timely fashion
- The message was posted but never completed.
- LNet should attempt to resend the message to a different peer_ni since the peer_ni is unable to complete the message.
By handling the tx_deadline properly we are able to account for all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
The pair LNET_MSG_PUT+LNET_MSG_ACK and LNET_MSG_GET+LNET_MSG_REPLY is not covered by the tx_deadline in the LND. If the upper layer does not take precautions it could wait forever on the LNET_MSG_ACK or LNET_MSG_REPLY. Therefore it is reasonable to expect that LNET provides a TIMEOUT event if either of these messages is not received within the expected timeout.
The question is whether LNet should resend the LNET_MSG_PUT or LNET_MSG_GET if it doesn't receive the corresponding response.
Consider the case where there are multiple LNet routers between two nodes, N1 and N2. These routers can possibly be routing between different Hardware, example OPA and MLX. N1 via the LND can reliably determine the health of the next-hop's interfaces. It can not however reliably determine the health of further hops in the chain. Each node can determine the health of the immediate next-hops. Therefore, each node in the path can be trusted to ensure that the message has arrived at the immediate next hop.
If there is a failure along the path and N1 does not receive the expected LNET_MSG_ACK or LNET_MSG_REPLY, and it knows that the message has been received by its next-hop, it has no way to determine where the failure happened. If it decides to resend the message, then there is no way to reliably select a reasonable peer_ni. Especially considering that the message has in fact been received properly by the next-hop. We can then say that we will simply try all the peer_nis of the destination. But in fact this will already be done by the node in the chain which is encountering a problem completing the message with its next-hop. So the net effect is the same. If both are implemented, then duplication of messages is a certainty.
Furthermore the responsibility of end-to-end reliability falls on the shoulder of layers using LNet. LNet's initial design intent is for it to be a fire and forget transport. Ptlrpc's design, however, clearly takes the end-to-end reliability of RPCs in consideration. By adding an LNET_ACK_TIMEOUT and LNET_REPLY_TIMEOUT events to LNet, that it can then report back up to Ptlrpc in the case when the corresponding message is not received within a specific timeout, then ptlrpc can make a decision that the RPC has failed.
RPC failure should be defined in two ways:
- One of the messages that compose the RPC has failed/timed out
- In this case it is reasonable to assume that the peer is dead/unreachable and ptlrpc can clean up its state.
- The response to the RPC has not been received.
- It can do what it does right now and initiate a retransmission of the RPC.
The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermidiary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.
This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.
Resiliency vs. Reliability
There are two concepts that need to stay separate. Reliability of RPC messages and LNet Resiliency. This feature attempts to add LNet Resiliency against local and immediate next hop interface failure. End-to-end reliability is to ensure that upper layer messages, namely RPC messages, are received and processed by the final destination, and take appropriate action in case this does not happen. End-to-end reliability is the responsibility of the application that uses LNet, in this case ptlrpc. Ptlrpc already has a mechanism to ensure this.
To clarify the terminology further, LNET MESSAGE should be used to describe one of the following messages:
- LNET_MSG_PUT
- LNET_MSG_GET
- LNET_MSG_ACK
- LNET_MSG_GET
- LNET_MSG_HELLO
LNET TRANSACTION should be used to describe
- LNET_MSG_PUT, LNET_MSG_ACK sequence
- LNET_MSG_GET, LNET_MSG_REPLY sequence
NEXT-HOP should describe a peer that is exactly one hop away.
The role of LNet is to ensure that an LNET MESSAGE arrives at the NEXT-HOP, and to flag when a transaction fails to complete.
Upper layers should ensure that the transaction it requests to initiate completes successfully, and take appropriate action otherwise.
Roughly, LNet would be analogous to the IP layer and ptlrpc is analogous Roughly, LNet would be analogonous to the IP layer and ptlrpc is analogonous to the TCP layer.
O2IBLND
Overview
...