Table of Contents |
---|
Adding Resiliency to LNet
Introduction
By design LNet is a lossy connectionless network: there are cases where messages can be dropped without the sender being notified. Here we explore the possibilities of making LNet more resilient, including having it retransmit messages over alternate routes. What can be done in this area is constrained by the design of LNet.
Put, Ack, Get, Reply
Within LNet there are three cases of interest: Put, Put+Ack, and Get+Reply.
- Put: The simplest message type is a bare Put message. This message is sent on the wire, and after that no further response is expected by LNet. In terms of error handling, this means that a failure to send can result in an error, but any kind of failure after the message has been sent will result in the message being dropped without notification. There is nothing LNet can do in the latter case, it will be up to the higher layers that use LNet to detect what is going on and take whatever corrective action is required.
- Put+Ack: The sender of a Put can ask from an Ack from the receiver. The Ack message is generated by LNet on the receiving node (the peer), and this is done as soon as the Put has been received. In contrast to a bare put this means LNet on the sender can track whether an Ack arrives, and if it does not promptly arrive it can take some action. Such an action can take two forms: inform the upper layers that the Ack took too long to arrive, or retry within LNet itself.
- Get+Reply: With a Get message, there is always a Reply from the received. Prior to the Get, the sender and receiver arrange an agreement on the MD that the data for the Reply must be obtained from, so LNet on the receiving node can generate the Reply message as soon as the Get has been received. Failure detection and handling is similar to the Put+Ack case.
The protocols used to implement LNet tend to be connection-oriented, and implement some kind of handshake or ack protocol that tells the sender that a message has been received. As long as the LND actually reports errors to LNet (not a given, alas) this means that in practice the sender of a message can reliably determine whether the message was successfully sent to another node. When the destination node is on the same LNet network, this is sufficient to enable LNet itself to detect failures even in the bare Put case. But in a routed configuration this only guarantees that the router received the message, and if the router then fails to forward it, a bare Put will be lost without trace.
PtlRPC
Users of LNet that send bare Put messages must implement their own methods to detect whether a message was lost. The general rule is simple: the recipient of a Put is supposed react somehow, and if the reaction doesn't happen within a set amount of time, the sender assumes that either the Put was lost, or the recipient is in some other kind of trouble.
For our purposes PtlRPC is of interest. PtlRPC messages can be classified as Request+Response pairs. Both a Request and a Response are built from one or more Get or Put messages. A node that sends a PtlRPC Request requires the receiver to send a Response within a set amount of time, and failing this the Request times out and PtlRPC takes corrective action.
Adaptive timeouts add an interesting wrinkle to this mechanism: they allow the recipient of a Request to tell the sender to "please wait", informing it that the recipient is alive and working but not able to send the Response before the normal timeout. For LNet the interesting implication is that while this is going on, there will be some traffic between the sender and recipient. This traffic may also be in the form of out-of-band information invisible to LNet.
LNet Interfaces
The interfaces that LNet provides to the upper layers should work as follows. Set up an MD (Memory Descriptor) to send data from (for a Put) or receive data into (for a Get). An event handler is associated with the MD. Then call LNetGet() or LNetPut() as appropriate.
LNetGet()
If all goes well, the event handler sees two events: LNET_EVENT_SEND
to indicate the Get message was sent, and LNET_EVENT_REPLY
to indicate the Reply message was received. Note that the send event can happen after the reply event (this is actually the typical case).
If sending the Get message failed, LNET_EVENT_SEND
will include an error status, no LNET_EVENT_REPLY
will happen, and clean up must done accordingly. If the return value of LNetGet() indicates an error then sending the message certainly failed, but a 0 return does not imply success, only that no failure has yet been encountered.
A damaged Reply message will be dropped, and does not result in an LNET_EVENT_REPLY
. Effectively the only way for LNET_EVENT_REPLY
to have an error status is if LNet detects a timeout before the Reply is received.
LNetPut() LNET_ACK_REQ
A Put with an Ack is similar to a Get+Reply pair. The events in this case are LNET_EVENT_SEND
and LNET_EVENT_ACK
.
For a Put, the LNET_EVENT_SEND
indicates that the MD is no longer used by the LNet code and the caller is free do discard or re-use it.
As with send, LNET_EVENT_ACK
is expected to only carry an error indication if there was a timeout before the Ack was received.
LNetPut() LNET_NOACK_REQ
A Put without an Ack will only generate an LNET_EVENT_SEND
, which indicates that the MD can now be used.
Possible Failures
There are a number of failures we can encounter, only some of which LNet may address.
- Local interface failure: there is some issue with the local interface that prevents it from sending (or receiving) messages.
- Remote interface failure: there is some issue with the remote interface that prevents it from receiving (or sending) messages.
- Path failure: the local interface is working properly, but messages never reach the remote interface.
- Remote node failure: the remote node is properly receiving messages but unresponsive for other reasons.
In a routed LNet configuration these scenarios apply to each hop.
These failures will show up in a number of ways:
- Local interface reports failure. This can include the interface itself being healthy but it noting that the cable connecting it to a switch, or the switch port, is somehow not working right.
- Remote interface not reachable. A remote interface that should be reachable from the local interface cannot be reached. This can be in the form of a "fast" error, or of a timeout in the LND-level protocol.
- Remote interfaces on a net not reachable. The local interface appears to be OK, but there are several remote interfaces it cannot talk to.
- All remote interfaces on a net not reachable. The local interface appears to be OK, but cannot talk to any remote interface.
- All remote interfaces of a node not reachable. All LNDs report errors when talking to a specific remote node, but have no problem talking to other nodes.
- Put+Ack or Get+Reply timeout. The LND gives no failure indication, but the Ack or Reply takes too long to arrive.
- Dropped Put. Everything appears to work, except it doesn't.
Let's take a look at what LNet can do in each of these cases.
Local Interface Reports Failure
This is the easiest case to work with. The LND can report such a failure to LNet, and LNet then refrains from using the local interface for any traffic.
LNet can mark the interface down, and depending on the capabilies of the LND either recheck periodically or wait for the LND to mark the interface up.
Remote Interface Not Reachable
The remote interface cannot be reached from the local interface, but the local interface can talk to other nodes. If the remote interface can be reached from other local interfaces then we're dealing with some path failure. Otherwise the remote interface may be bad. If there is only a single local interface that can talk to the remote interface, then we cannot distinguish between these cases.
LNet can mark this particular local/remote interface combination as something to be avoided.
When there are paths more than one local interface to the remote interface, and none of these work, but other remote interfaces do work, then LNet can mark the remote interface as bad. Recovery could be done by periodically probing the remote interface, maybe using LNet Ping as a poor-man's equivalent of an LNet Control Packet.
Remote interfaces On A Net Not Reachable
Several remote interfaces on a net cannot be reached from a local interface, but it can talk to other nodes. This is a more severe variant of the previous case.
All Remote Interfaces On A Net Not Reachable
All remote interfaces on a net cannot be reached from a local interface. If there are other, working, interfaces connected to the same net then the balance of probability shifts to the local interface being bad, or there is a severe problem with the fabric.
In practice LNet will not detect "all" remote interfaces being down. But it can detect that for a period of time, no traffic was successfully sent from a local interface, and therefore start avoiding that interface as a whole. Recovery would involve periodically probing the interface, maybe using LNet Ping.
All Remote Interfaces Of A Node Not Reachable
The node is likely down. There is little LNet can do here, this is a problem to be handled by upper layers. This includes indicating when LNet should attempt to reconnect.
LNet might treat this as the "remote interface not reachable" case for all the interfaces of the remote node. That is, without much difference due to apparently all interfaces of the remote node being down, except for a log message indicating this.
Put+Ack Or Get+Reply Timeout
This is the case where the LND does not signal any problem, so the Ack for a Put or Reply for a Get should arrive promptly, with the only delays due to credit-based throttling, and yet it does not do so. Note that this assumes that were possible the LND layer already implements reasonably tight timeouts, so that LNet can assume the problem is somewhere else.
LNet can impose a "reply timeout", and retry over a different path if there is one available. However, if the assumption about the LND is valid, then the implication is that the node is in trouble. So an alternative is to force the upper layers to cope.
One argument for nevertheless implementing this facility in LNet is that it means the upper layers to have to re-invent and re-implement this wheel time and again.
Dropped Put
No problem was signalled by the LND, and there is no Ack that we could time out waiting for. LNet does not have enough information to do anything, so the upper layers must do so instead.
If this case must be made tractable, LNet can be changed to make the Ack non-optional.
Should LNet Resend Messages
When there are multiple paths available for a message, it makes sense to try to resend it on failure. But where should the resending logic be implemented?
The easiest path is to tell upper layers to resend. For example, PtlRPC has some related logic already. Except that PtlRPC detects a failure, it disconnects, reconnects, and triggers recovery. The type of resending logic desired is to "just try another path" which differs from what exists today and needs to be implemented for each user.
The alternative then is to have LNet resend a message. There should be some limit to the number of retries, and a limit to the amount of time spent retrying. Otherwise we are requiring the upper layers to implement a timer on each LNetGet() and LNetPut() call to guarantee progress. This introduces an LNet "retry timeout" (as opposed to the "reply timeout" discussed above) as the amount of time LNet after which LNet gives up.
In terms of timeouts, this then gives us the following relationships, from shortest to longest:
- LND Timeout: LND declares that a message won't arrive.
- IB timeout is (default?) slightly less than 4 seconds
- LNet Reply Timeout: LNet declares an Ack/Reply won't arrive. > 2 * LND Timeout * max hops
- Depends on the route!
- LNet Retry Timeout: LNet gives up on retries. > LNet Reply Timeout * max LNet retries
- Depends on the route!
It is not completely obvious how this scheme interacts with the timeout
parameter (a number of timeouts appear to be derived from that, but I think it isn't the LND timeout), or the peer_timeout
per LND module parameter.
O2IBLND
Overview
There are two types of events to account for:
...
Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.
SOCKLND
TBD
LNet Health Version 2.0
TBD