...
By design LNet is a lossy connectionless network: there are cases where messages can be dropped without the sender being notified. Here we explore the possibilities of making LNet more resilient, including having it retransmit messages over alternate routes. What can be done in this area is constrained by the design of LNet.
In the following discussion, node will often be the shorthand for local node, while peer will be shorthand for peer node or remote node.
Put, Ack, Get, Reply
Within LNet there are three cases of interest: Put, Put+Ack, and Get+Reply.
...
The protocols used to implement LNet tend to be connection-oriented, and implement some kind of handshake or ack protocol that tells the sender that a message has been received. As long as the LND actually reports errors to LNet (not a given, alas) this means that in practice the sender of a message can reliably determine whether the message was successfully sent to another node. When the destination node is on the same LNet network, this is sufficient to enable LNet itself to detect failures even in the bare Put case. But in a routed configuration this only guarantees that the LNet router received the message, and if the LNet router then fails to forward it, a bare Put will be lost without trace.
...
Adaptive timeouts add an interesting wrinkle to this mechanism: they allow the recipient of a Request to tell the sender to "please wait", informing it that the recipient is alive and working but not able to send the Response before the normal timeout. For LNet the interesting implication is that while this is going on, there will be some traffic between the sender and recipient. This However, this traffic may also be in the form of out-of-band information invisible to LNet.
...
The interfaces that LNet provides to the upper layers should work as follows. Set up an MD (Memory Descriptor) to send data from (for a Put) or receive data into (for a Get). An event handler is associated with the MD. Then call LNetGet() or LNetPut() as appropriate.
...
A damaged Reply message will be dropped, and does not result in an LNET_EVENT_REPLY
. Effectively the only way for LNET_EVENT_REPLY
to have an error status is if LNet detects a timeout before the Reply is received.
LNetPut() LNET_ACK_REQ
The caller of LNetPut()
requests an Ack by using LNET_ACK_REQ
as the value of the ack
parameter.
A Put with an Ack is similar to a Get+Reply pair. The events in this case are LNET_EVENT_SEND
and LNET_EVENT_ACK
.
...
As with send, LNET_EVENT_ACK
is expected to only carry an error indication if there was a timeout before the Ack was received.
LNetPut() LNET_NOACK_REQ
The caller of LNetPut()
requests no Ack by using LNET_NOACK_REQ
as the value of the ack
parameter.
A Put without an Ack will only generate an LNET_EVENT_SEND
, which indicates that the MD can now be re-used or discarded.
Possible Failures
There are a number of failures we can encounter, only some of which LNet may address.
- Local Node interface failure: there is some issue with the local interface that prevents it from sending (or receiving) messages.
- Remote Peer interface failure: there is some issue with the remote interface that prevents it from receiving (or sending) messages.
- Path failure: the local interface is working properly, but messages never reach the remote peer interface.
- Remote node Soft peer failure: the remote node peer is properly receiving messages but unresponsive for other reasons.
- Hard peer failure: the peer is down, and unresponsive on all its interfaces.
In a routed LNet configuration these scenarios apply In a routed LNet configuration these scenarios apply to each hop.
These failures will show up in a number of ways:
- Local Node interface reports failure. This can include includes the interface itself being healthy but it noting that the cable connecting it to a switch, or the switch port, is somehow not working right.
- Remote Peer interface not reachable. A remote peer interface that should be reachable from the local node interface cannot be reached. This can be in the form of a Depending on the error this can result in "fast" error , or of a timeout in the LND-level protocol.
- Remote Some peer interfaces on a net not reachable. The local node interface appears to be OK, but there are interfaces several remote interfaces peers it cannot talk to.
- All remote peer interfaces on a net not reachable. The local node interface appears to be OK, but cannot talk to any remote peer interface.
- All remote interfaces of a node peer not reachable. All LNDs report errors when talking to a specific remote nodepeer, but have no problem talking to other nodespeers.
- Put+Ack or Get+Reply timeout. The LND gives no failure indication, but the Ack or Reply takes too long to arrive.
- Dropped Put. Everything appears to work, except it doesn't.
Let's take a look at what LNet on the node can do in each of these cases.
...
Node Interface Reports Failure
This is the easiest case to work with. The LND can report such a failure to LNet, and LNet then refrains from using the local this interface for any traffic.
LNet can mark the interface down, and depending on the capabilies of the LND either recheck periodically or wait for the LND to mark the interface up.
...
Peer Interface Not Reachable
The remote peer interface cannot be reached from the local node interface, but the local node interface can talk to other nodespeers. If the remote peer interface can be reached from other local node interfaces then we're dealing with some path failure. Otherwise the remote peer interface may be bad. If there is only a single local node interface that can talk to the remote peer interface, then we the node cannot distinguish between these cases.
LNet can mark this particular localnode/remote peer interface combination as something to be avoided.
When there are paths from more than one local node interface to the remote peer interface, and none of these work, but other remote peer interfaces do work, then LNet can mark the remote peer interface as bad. Recovery could be done by periodically probing the remote peer interface, maybe using LNet Ping as a poor-man's equivalent of an LNet Control Packet.
...
Some Peer interfaces On A Net Not Reachable
Several remote peer interfaces on a net cannot be reached from a local node interface, but it the same node interface can talk to other nodespeers. This is a more severe variant of the previous case.
All
...
Peer Interfaces On A Net Not Reachable
All remote interfaces on a net cannot be reached from a local interface. If there are other, working, interfaces connected to the same net then the balance of probability shifts to the local interface being bad, or there is a severe problem with the fabric.
In practice LNet will not detect "all" remote interfaces being down. But it can detect that for a period of time, no traffic was successfully sent from a local interface, and therefore start avoiding that interface as a whole. Recovery would involve periodically probing the interface, maybe using LNet Ping.
All
...
Interfaces Of A
...
Peer Not Reachable
The node is likely down. There is little LNet can do here, this is a problem to be handled by upper layers. This includes indicating when LNet should attempt to reconnect.
...
The easiest path is to tell upper layers to resend. For example, PtlRPC has some related logic already. Except that when PtlRPC detects a failure, it disconnects, reconnects, and triggers a recovery . The operation. This is a fairly heavy-weight process, while the type of resending logic desired is to "just try another path" which differs from what exists today and needs to be implemented for each user.
...
- LND Timeout: LND declares that a message won't arrive.
- IB timeout is (default?) slightly less than 4 seconds
- LND timeout is
timeout
module parameter foro2ib
andgni
,sock_timeout
module parameter forsock
?
- LNet Reply Timeout: LNet declares an Ack/Reply won't arrive. > LND Timeout * (max hops -1)
- Depends on the route!
- LNet Retry Timeout: LNet gives up on retries. > LNet Reply Timeout * max LNet retries
- Depends on the route!
peer_timeout
module parameter: peer is declared dead. Either use for LNet Retry Timeout, or > LNet Retry Timeout. Using thepeer_timeout
for the LNet Retry Timeout has the advantage of reducing the number of tunable parameters. A disadvantage is that thepeer_timeout
is currently a per-LND parameter (each LND has its own tunable value), effectively limiting the number of retries to 1 when the LND timed out.
It is not completely obvious how this scheme interacts with Lustre's timeout
parameter (the Lustre RPC timeout, from which a number of timeouts are derived), but at first glance it seems that at least peer_timeout
< Lustre timeout
.
LNet Health Version 2.0
There are three types of failures that LNet needs to deal with:
...
- Synchronously as a return failure to the call to to
lnd_send()
- Asynchronously as an event that could be detected at a later point.
- These asynchronous events can be as a result of a send operations
- They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.
Desired Behavior
When a local interface fails, the following actions should take place:
- the The local interface health is updated
- Failure statistics incremented
- A resend is issued on a different local interface if there is one available.
- if If no other local interface is present, or all are in failed mode, then the send failsare in failed mode, then the send fails.
There are several ways a failed local interface can recover:
- The local interface might signal via the LND that it is well again ("link up").
- The interface is retried because there are no alternative interfaces available; it is the only one, or all have been marked bad.
- A helper thread periodically retries the local interface; if there are alternative interfaces available this will ensure that the interface will be used again if it has recovered, without risking an interruption of the traffic flowing over the working interfaces.
Implementation Specifics
Events that are triggered asynchronously, without initiating a message, such as port down, port up, rdma device removed, shall be handled via a new LNet/LND API that shall be added.
In the other cases, lnet_ni_send()
calls into the LND via the the lnd_send()
callback provided. If the return code is failure failure lnet_finalize()
is called to finalize the message.
lnet_finalize()
takes the return code as an input parameter. The above behavior should be implemented in in lnet_finalize()
since this is the main entry into the LNet module via the LNDs as well.
lnet_
finalzefinalize()
detaches the MD in preparation of completing the message. Once the MD is detached it can be re-used. Therefore, if we are to re-send the message then the MD shouldn't be detached at this point.
lnet_complete_msg_locked()
should be modified to manage the local interface health, and decide whether the message should be resent or not. If the message can not be resent due to no available local interfaces then the MD can be detached and the message can be freed.
Currently Currently lnet_select_pathway()
iterates through all the local interfaces on a particular peer identified by the NID to send to. In this case we would want to restrict the resend to go to the same peer_ni
, but on a different local interface.
This approach lends itself to breaking out the selection of the local interface from lnet_select_pathway()
, leading to the following logic:
Code Block |
---|
lnet_ni lnet_get_best_ni(local_net, cur_ni, md_cpt) { local_net = get_local_net(peer_net) for each ni in local_net { health_value = lnet_local_ni_health(ni) /* select the best health value */ if (health_value < best_health_value) continue distance = get_distance(md_cpt, dev_cpt) /* select the shortest distance to the MD */ if (distance < lnet_numa_range) distance = lnet_numa_range if (distance > shortest_distance) continue else if distance < shortest_distance distance = shortest_distance /* select based on the most available credits */ else if ni_credits < best_credits continue /* if all is equal select based on round robin */ else if ni_credits == best_credits if best_ni->ni_seq <= ni->ni_seq continue } } /* * lnet_select_pathway() will be modified to add a peer_nid parameter. * This parameter indicates that the peer_ni is predetermined, and is * identified by the NID provided. The peer_nid parameter is the * next-hop NID, which can be the final destination or the next-hop * router. If that peer_NID is not healthy then another peer_NID is * selected as per the current algorithm. This will force the * algorithm to prefer the peer_ni which was selected in the initial * message sending. The peer_ni NID is stored in the message. This * new parameter extends the concept of the src_nid, which is provided * to lnet_select_pathway() to inform it that the local NI is * predetermined. */ /* on resend */ enum lnet_error_type { LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */ LNET_LOCAL_NI_UP, /* start using this NI */ LNET_LOCAL_NI_SEND_TIMOUTTIMEOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */ LNET_PEER_NI_NO_LISTENER, /* there is no remote listener. demerit the peer_ni and try another NI */ LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */ LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */ }; static int lnet_handle_send_failure_locked(msg, local_nid, status) { switch (status) /* * LNET_LOCAL_NI_DOWN can be received without a message being sent. * In this case msg == NULL and it is sufficient to update the health * of the local NI */ case LNET_LOCAL_NI_DOWN: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(DOWN) break; case LNET_LOCAL_NI_UP: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(UP) /* This NI will be a candidate for selection in the next message send */ break; ... } static int lnet_complete_msg_locked(msg, cpt) { status = msg->msg_ev.status if (status != 0) rc = lnet_handle_send_failure_locked(msg, status) if rc == 0 return /* continue as currently done */ } |
...
A remote interface can be considered problematic under multiple scenarios:
- address Address is wrong
- Route can not be determined
- Connection can not be established
- Connection was rejected due to incompatible parameters
Desired Behavior
When a remote interface fails the following actions take place:
- The remote interface health is updated
- Failure statistics incremented
- A resend is issued on a different remote interface if there is one available.
- incremented
- A resend is issued on a different remote interface if there is one available.
- If no other remote interface is present then the send fails.
There are several ways a remote interface can recover:
- The remote interface is retried as a destination because there is no alternative available, and no error results.
- The remote interface is periodically probed by a helper thread. An interesting wrinkle is that there is no reason to probe the remote interface unless there is traffic flowing to the peer through other paths. (LNet doesn't care about the state of a remote node that the local node isn't talking to anyway.)if no other remote interface is present then the send fails.
Implementation Specifics
In all these cases a different different peer_ni
should be tried if one exists. lnet_select_pathway()
already takes takes src_nid
as a parameter. When resending due to one of these failures failures src_nid
will be set to the the src_nid
in the message that is being resent.
...
Upper layers request from LNet to send a GET or a PUT via via LNetGet()
and and LNetPut()
APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t
.
When the LND transmits the LND message it sets a a tx_deadline
for that particular transmit. This This tx_deadline
remains active until the remote has confirmed receipt of the message. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse()
, then LNet calls back into the LND layer to receive the message.
Therefore if a a tx_deadline
is hit, it is safe to assume that the remote end has not received the message. This reasons are described further below.
By handling the the tx_deadline
properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.
The tx_deadline
is LND-specific, and derived from the timeout
(or sock_timeout
) module parameter of the LND.
LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
...