...
Code Block |
---|
lnet_ni lnet_get_best_ni(local_net, cur_ni, md_cpt) { local_net = get_local_net(peer_net) for each ni in local_net { health_value = lnet_local_ni_health(ni) /* select the best health value */ if (health_value < best_health_value) continue distance = get_distance(md_cpt, dev_cpt) /* select the shortest distance to the MD */ if (distance < lnet_numa_range) distance = lnet_numa_range if (distance > shortest_distance) continue else if distance < shortest_distance distance = shortest_distance /* select based on the most available credits */ else if ni_credits < best_credits continue /* if all is equal select based on round robin */ else if ni_credits == best_credits if best_ni->ni_seq <= ni->ni_seq continue } } /* * lnet_select_pathway() will be modified to add a peer_nid parameter. This parameter indicates that the peer_ni is predetermined, * and will beis identified by the NID provided. The peer_nid parameter itis the next-hop NID, which can be the final destination or * the next-hop router. If that peer_NID is not healthy then another peer_NID is selected as per the current algorithm. This will * force the algorithm to prefer the peer_ni which was selected in the initial message sending. The peer_ni NID is stored in * the message. This new parameter extends the concept of the src_nid, which is provided to lnet_select_pathway() to inform it * that the local NI is predetermined. */ /* on resend */ enum lnet_error_type { LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */ LNET_LOCAL_NI_UP, /* start using this NI */ LNET_LOCAL_NI_SEND_TIMOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */ LNET_PEER_NI_NO_LISTENER, /* there is no remote listener. demerit the peer_ni and try another NI */ LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */ LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */ }; static int lnet_handle_send_failure_locked(msg, local_nid, status) { switch (status) /* * LNET_LOCAL_NI_DOWN can be received without a message being sent. * In this case msg == NULL and it is sufficient to update the health * of the local NI */ case LNET_LOCAL_NI_DOWN: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(DOWN) break; if (msg != NULL) /* resend message to the same peer_ni, but using a different local_ni */ break; case LNET_LOCAL_NI_UP: local_ni = lnet_get_local_nicase LNET_LOCAL_NI_UP: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(UP) /* This NI will be a candidate for selection in the next message send */ break; ... } static int lnet_complete_msg_locked(msg, cpt) { status = msg->msg_ev.status if (status != 0) rc = lnet_handle_send_failure_locked(msg, status) if rc == 0 return /* continue as currently done */ } |
...
Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. This could be due to the following reasons:
- The message was never posted.
- LNet should attempt to resend the message from a different local NI, since this NI is unable to process messages on its queue in a timely fashion
- The message was posted but never completed.
- LNet should attempt to resend the message to a different peer_ni since the peer_ni is unable to complete the message.
By handling the tx_deadline properly we are able to account for all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
reasons are described further below.
By handling the tx_deadline properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
LNet LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
...
Furthermore the responsibility of end-to-end reliability falls on the shoulder of layers using LNet. LNetPtlrpc's initial design intent is for it to be a fire and forget transport. Ptlrpc's design, however, clearly takes the end-to-end design clearly takes the end-to-end reliability of RPCs in consideration. By adding an LNET_ACK_TIMEOUT and LNET_REPLY_TIMEOUT events to LNet, that it can then report back up to Ptlrpc in the case when the corresponding message is not received within a specific timeout(or add an error status in the current events), then ptlrpc can make a decision that the RPC has failed.
RPC failure should be defined in two ways:
- One of the messages that compose the RPC has failed/timed out
- In this case it is reasonable to assume that the peer is dead/unreachable and ptlrpc can clean up its state.
- The response to the RPC has not been received.
- It can do what it does right now and initiate a retransmission of the RPC.
react to the error status appropriately.
The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermediary The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermidiary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.
...
Upper layers should ensure that the transaction it requests to initiate completes successfully, and take appropriate action otherwise.Roughly, LNet would be analogous to the IP layer and ptlrpc is analogous to the TCP layer.
Reasons for timeout
The discussion here refers to the LND Transmit timeout.
...
- The message is on the sender's queue and is not posted within the timeout
- This indicates that the local interface is too busy and is unable to process the messages on its queue.
- The message is posted but the transmit is never completed
- An actual culprit can not be determined in this scenario. It could be a sender issue, a receiver issue or a network issue.
- The message is posted, the transmit is completed, but the remote never acknowledges.
- In the IBLND, there are explicit acknowledgements in most cases when the message is received and forwarded to the LNet layer. Look below for more details.
- If an LND message is in waiting state and it didn't receive the expected response, then this indicates an issue at the remote's LND, either at the lower protocol, IB/TCP, or the notification at the LND LNet layer is not being processed in a timely fashion.
...
All of these cases should end up calling lnet_finalize() API with the proper return code. lnet_finalize() will be the funnel where all these events shall be processed in a consistent manner. When the message is completed via lnet_complete_msg_locked(), the error is checked and the proper behavior as described above is executed.
Peer_timeout
In the cases when a GET or a PUT transaction is initiated an associated deadline needs to be tagged to the corresponding transaction. This deadline indicates how long LNet should wait for a REPLY or an ACK before it times out the entire transaction.
A new thread is required to check if a transaction deadline has expired.
When a transaction deadline expires an appropriate event is generated towards PTLRPC.
When a the REPLY or the ACK is received the message is removed from the check queue of the thread and success event is generated towards PTLRPC.
Within a transaction deadline, if there is a determination that the GET or PUT message failed to be send to the next-hop then the GET or PUT can be resent.
Resend Window
Resends are terminated when the peer_timeout for a message expires.
...