Table of Contents |
---|
Adding Resiliency to LNet
Introduction
By design LNet is a lossy connectionless network: there are cases where messages can be dropped without the sender being notified. Here we explore the possibilities of making LNet more resilient, including having it retransmit messages over alternate routes. What can be done in this area is constrained by the design of LNet.
In the following discussion, node will often be the shorthand for local node, while peer will be shorthand for peer node or remote node.
Put, Ack, Get, Reply
Within LNet there are three cases of interest: Put, Put+Ack, and Get+Reply.
...
The protocols used to implement LNet tend to be connection-oriented, and implement some kind of handshake or ack protocol that tells the sender that a message has been received. As long as the LND actually reports errors to LNet (not a given, alas) this means that in practice the sender of a message can reliably determine whether the message was successfully sent to another node. When the destination node is on the same LNet network, this is sufficient to enable LNet itself to detect failures even in the bare Put case. But in a routed configuration this only guarantees that the LNet router received the message, and if the LNet router then fails to forward it, a bare Put will be lost without trace.
PtlRPC
Users of LNet that send bare Put messages must implement their own methods to detect whether a message was lost. The general rule is simple: the recipient of a Put is supposed react somehow, and if the reaction doesn't happen within a set amount of time, the sender assumes that either the Put was lost, or the recipient is in some other kind of trouble.
...
Adaptive timeouts add an interesting wrinkle to this mechanism: they allow the recipient of a Request to tell the sender to "please wait", informing it that the recipient is alive and working but not able to send the Response before the normal timeout. For LNet the interesting implication is that while this is going on, there will be some traffic between the sender and recipient. However, this traffic may also be in the form of out-of-band information invisible to LNet.
LNet Interfaces
The interfaces that LNet provides to the upper layers should work as follows. Set up an MD (Memory Descriptor) to send data from (for a Put) or receive data into (for a Get). An event handler is associated with the MD. Then call LNetGet() or LNetPut() as appropriate.
LNetGet()
If all goes well, the event handler sees two events: LNET_EVENT_SEND
to indicate the Get message was sent, and LNET_EVENT_REPLY
to indicate the Reply message was received. Note that the send event can happen after the reply event (this is actually the typical case).
...
A damaged Reply message will be dropped, and does not result in an LNET_EVENT_REPLY
. Effectively the only way for LNET_EVENT_REPLY
to have an error status is if LNet detects a timeout before the Reply is received.
LNetPut() LNET_ACK_REQ
The caller of LNetPut()
requests an Ack by using LNET_ACK_REQ
as the value of the ack
parameter.
...
As with send, LNET_EVENT_ACK
is expected to only carry an error indication if there was a timeout before the Ack was received.
LNetPut() LNET_NOACK_REQ
The caller of LNetPut()
requests no Ack by using LNET_NOACK_REQ
as the value of the ack
parameter.
A Put without an Ack will only generate an LNET_EVENT_SEND
, which indicates that the MD can now be re-used or discarded.
Possible Failures
There are a number of failures we can encounter, only some of which LNet may address.
...
Let's take a look at what LNet on the node can do in each of these cases.
Node Interface Reports Failure
This is the easiest case to work with. The LND can report such a failure to LNet, and LNet then refrains from using this interface for any traffic.
LNet can mark the interface down, and depending on the capabilies of the LND either recheck periodically or wait for the LND to mark the interface up.
Peer Interface Not Reachable
The peer interface cannot be reached from the node interface, but the node interface can talk to other peers. If the peer interface can be reached from other node interfaces then we're dealing with some path failure. Otherwise the peer interface may be bad. If there is only a single node interface that can talk to the peer interface, then the node cannot distinguish between these cases.
...
When there are paths from more than one node interface to the peer interface, and none of these work, but other peer interfaces do work, then LNet can mark the peer interface as bad. Recovery could be done by periodically probing the peer interface, maybe using LNet Ping as a poor-man's equivalent of an LNet Control Packet.
Some Peer interfaces On A Net Not Reachable
Several peer interfaces on a net cannot be reached from a node interface, but the same node interface can talk to other peers. This is a more severe variant of the previous case.
All Peer Interfaces On A Net Not Reachable
All remote interfaces on a net cannot be reached from a local interface. If there are other, working, interfaces connected to the same net then the balance of probability shifts to the local interface being bad, or there is a severe problem with the fabric.
In practice LNet will not detect "all" remote interfaces being down. But it can detect that for a period of time, no traffic was successfully sent from a local interface, and therefore start avoiding that interface as a whole. Recovery would involve periodically probing the interface, maybe using LNet Ping.
All Interfaces Of A Peer Not Reachable
The node is likely down. There is little LNet can do here, this is a problem to be handled by upper layers. This includes indicating when LNet should attempt to reconnect.
LNet might treat this as the "remote interface not reachable" case for all the interfaces of the remote node. That is, without much difference due to apparently all interfaces of the remote node being down, except for a log message indicating this.
Put+Ack Or Get+Reply Timeout
This is the case where the LND does not signal any problem, so the Ack for a Put or Reply for a Get should arrive promptly, with the only delays due to credit-based throttling, and yet it does not do so. Note that this assumes that were possible the LND layer already implements reasonably tight timeouts, so that LNet can assume the problem is somewhere else.
...
One argument for nevertheless implementing this facility in LNet is that it means the upper layers to have to re-invent and re-implement this wheel time and again.
Dropped Put
No problem was signalled by the LND, and there is no Ack that we could time out waiting for. LNet does not have enough information to do anything, so the upper layers must do so instead.
If this case must be made tractable, LNet can be changed to make the Ack non-optional.
Should LNet Resend Messages
When there are multiple paths available for a message, it makes sense to try to resend it on failure. But where should the resending logic be implemented?
...
It is not completely obvious how this scheme interacts with Lustre's timeout
parameter (the Lustre RPC timeout, from which a number of timeouts are derived), but at first glance it seems that at least peer_timeout
< Lustre timeout
.
LNet Health Version 2.0
There are three types of failures that LNet needs to deal with:
- Local Interface failure
- Remote Interface failure
- Timeouts
- LND detected Timeout
- LNet detected Timeout
Local Interface Failure
Local interface failures will be detected in one of two ways
- Synchronously as a return failure to the call to
lnd_send()
- Asynchronously as an event that could be detected at a later point.
- These asynchronous events can be as a result of a send operations
- They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.
Desired Behavior
When a local interface fails, the following actions should take place:
...
- The local interface might signal via the LND that it is well again ("link up").
- The interface is retried because there are no alternative interfaces available; it is the only one, or all have been marked bad.
- A helper thread periodically retries the local interface; if there are alternative interfaces available this will ensure that the interface will be used again if it has recovered, without risking an interruption of the traffic flowing over the working interfaces.
Implementation Specifics
Events that are triggered asynchronously, without initiating a message, such as port down, port up, rdma device removed, shall be handled via a new LNet/LND API that shall be added.
...
Code Block |
---|
lnet_ni lnet_get_best_ni(local_net, cur_ni, md_cpt) { local_net = get_local_net(peer_net) for each ni in local_net { health_value = lnet_local_ni_health(ni) /* select the best health value */ if (health_value < best_health_value) continue distance = get_distance(md_cpt, dev_cpt) /* select the shortest distance to the MD */ if (distance < lnet_numa_range) distance = lnet_numa_range if (distance > shortest_distance) continue else if distance < shortest_distance distance = shortest_distance /* select based on the most available credits */ else if ni_credits < best_credits continue /* if all is equal select based on round robin */ else if ni_credits == best_credits if best_ni->ni_seq <= ni->ni_seq continue } } /* * lnet_select_pathway() will be modified to add a peer_nid parameter. * This parameter indicates that the peer_ni is predetermined, and is * identified by the NID provided. The peer_nid parameter is the * next-hop NID, which can be the final destination or the next-hop * router. If that peer_NID is not healthy then another peer_NID is * selected as per the current algorithm. This will force the * algorithm to prefer the peer_ni which was selected in the initial * message sending. The peer_ni NID is stored in the message. This * new parameter extends the concept of the src_nid, which is provided * to lnet_select_pathway() to inform it that the local NI is * predetermined. */ /* on resend */ enum lnet_error_type { LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */ LNET_LOCAL_NI_UP, /* start using this NI */ LNET_LOCAL_NI_SEND_TIMEOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */ LNET_PEER_NI_NO_LISTENER, /* there is no remote listener. demerit the peer_ni and try another NI */ LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */ LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */ LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */ }; static int lnet_handle_send_failure_locked(msg, local_nid, status) { switch (status) /* * LNET_LOCAL_NI_DOWN can be received without a message being sent. * In this case msg == NULL and it is sufficient to update the health * of the local NI */ case LNET_LOCAL_NI_DOWN: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(DOWN) break; case LNET_LOCAL_NI_UP: LASSERT(!msg); local_ni = lnet_get_local_ni(msg->local_nid) if (!local_ni) return /* flag local NI down */ lnet_set_local_ni_health(UP) /* This NI will be a candidate for selection in the next message send */ break; ... } static int lnet_complete_msg_locked(msg, cpt) { status = msg->msg_ev.status if (status != 0) rc = lnet_handle_send_failure_locked(msg, status) if rc == 0 return /* continue as currently done */ } |
Remote Interface Failure
A remote interface can be considered problematic under multiple scenarios:
- Address is wrong
- Route can not be determined
- Connection can not be established
- Connection was rejected due to incompatible parameters
Desired Behavior
When a remote interface fails the following actions take place:
...
- The remote interface is retried as a destination because there is no alternative available, and no error results.
- The remote interface is periodically probed by a helper thread. An interesting wrinkle is that there is no reason to probe the remote interface unless there is traffic flowing to the peer through other paths. (LNet doesn't care about the state of a remote node that the local node isn't talking to anyway.)
Implementation Specifics
In all these cases a different peer_ni
should be tried if one exists. lnet_select_pathway()
already takes src_nid
as a parameter. When resending due to one of these failures src_nid
will be set to the src_nid
in the message that is being resent.
Code Block |
---|
static int lnet_handle_send_failure_locked(msg, local_nid, status) { switch (status) ... case LNET_PEER_NI_ADDR_ERROR: lpni->stats.stat_addr_err++ goto peer_ni_resend case LNET_PEER_NI_UNREACHABLE: lpni->stats.stat_unreacheable++ goto peer_ni_resend case LNET_PEER_NI_CONNECT_ERROR: lpni->stats.stat_connect_err++ goto peer_ni_resend case LNET_PEER_NI_CONNECTION_REJECTED: lpni->stats.stat_connect_rej++ goto peer_ni_resend default: /* unexpected failure. failing message */ return peer_ni_resend lnet_send(msg, src_nid) } |
Timeouts
LND Detected Timeouts
Upper layers request from LNet to send a GET or a PUT via LNetGet()
and LNetPut()
APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t
.
...
The tx_deadline
is LND-specific, and derived from the timeout
(or sock_timeout
) module parameter of the LND.
LNet Detected Timeouts
As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.
...
This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.
Resiliency vs. Reliability
There are two concepts that need to stay separate. Reliability of RPC messages and LNet Resiliency. This feature attempts to add LNet Resiliency against local and immediate next hop interface failure. End-to-end reliability is to ensure that upper layer messages, namely RPC messages, are received and processed by the final destination, and take appropriate action in case this does not happen. End-to-end reliability is the responsibility of the application that uses LNet, in this case ptlrpc. Ptlrpc already has a mechanism to ensure this.
...
Upper layers should ensure that the transaction it requests to initiate completes successfully, and take appropriate action otherwise.
Reasons for timeout
The discussion here refers to the LND Transmit timeout.
...
Each of these scenarios can be handled differently
Desired Behavior
The desired behavior is listed for each of the above scenarios:
Scenario 1 - Message not posted
- Connection is closed
- The local interface health is updated
- Failure statistics incremented
- A resend is issued on a different local interface if there is one available.
- If no other local interface is present, or all are in failed mode, then the send fails.
Scenario 2 - Transmit not completed
- Connection is closed
- The local and remote interface health is updated
- Failure statistics incremented on both local and remote
- A resend is issued on a different path all together if there is one available.
- If no other path is present then the send fails.
Scenario 3 - No acknowledgement by remote
- Connection is closed
- The remote interface health is updated
- Failure statistics incremented
- A resend is issued on a different remote interface if there is one available.
- If no other remote interface is present then the send fails.
Note, that the behavior outlined is consistent with the explcit error cases identified in previous section. Only Scenario 2, diverges as a different path is selected all together, but still the same code structure is used.
Implementation Specifics
All of these cases should end up calling lnet_finalize()
API with the proper return code. lnet_finalize()
will be the funnel where all these events shall be processed in a consistent manner. When the message is completed via lnet_complete_msg_locked()
, the error is checked and the proper behavior as described above is executed.
Peer_timeout
In the cases when a GET or a PUT transaction is initiated an associated deadline needs to be tagged to the corresponding transaction. This deadline indicates how long LNet should wait for a REPLY or an ACK before it times out the entire transaction.
...
OW: How is this deadline determined? Naming this section peer_timeout suggests you want to use that? Conceptually we can distinguish between an LNet transaction timeout and an LNet peer timeout.
Resend Window
Resends are terminated when the peer_timeout for a message expires.
...
Alexey Lyashkov made a presentation at LAD 16 that outlines the best values for all Lustre timeouts. It can be accessed here.
Locking
MD is always protected by the lnet_res_lock
, which is CPT specific.
...
The MD should be kept intact during the resend procedure. If there is a failure to resend then the MD should be released and message memory freed.
Selection Algorithm with Health
Algorithm Parameters
Parameter | Values | |
SRC NID | Specified (A) | Not specified (B) |
DST NID | local (1) | not local (2) |
DST NID | MR ( C ) | NMR (D) |
...
Note: When sending to a router that scenario boils down to considering the router as the next-hop peer. The final destination peer NIs are no longer considered in the selection. The next-hop can then be MR or non-MR and the code will deal with it accordingly.
A1C - src specified, local dst, mr dst
- find the local ni given src_nid
- if no local ni found fail
- if local ni found is down, then fail
- find peer identified by the dst_nid
- select the best peer_ni for that peer
- take into account the health of the peer_ni (if we just demerit the peer_ni it can still be the best of the bunch. So we need to keep track of the peer_nis/local_nis a message was sent over, so we don't revisit the same ones again. This should be part of the message)
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- if this is a resend, do not select the same peer_ni again unless no other peer_nis are available and that peer_ni is not in a HARD_ERROR state.
A2C - src specified, route to dst, mr dst
- find local ni given src_nid
- if no local ni found fail
- if local ni found is down, then fail
- find router to dst_nid
- If no router present then fail.
- find best peer_ni (for the router) to send to
- take into account the health of the peer_ni
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- If this is a resend and the peer_nis is not specified, do not select the same peer_ni again. The original destination NID can be found in the message.
- Keep trying to send to the peer_ni even if it has been used before, as long as it is not in a HARD_ERROR state.
A1D - src specified, local dst, nmr dst
- find local ni given src nid
- if no local_ni found fail
- if local ni found is down, then fail
- find peer_ni using dst_nid
- send to that peer_ni
- If this is a resend retry the send on the peer_ni unless that peer_ni is in a HARD_ERROR state, then fail.
A2D - src specified, route to dst, nmr dst
- find local_ni given the src_nid
- if no local_ni found fail
- if local ni found is down, then fail
- find router to go through to that peer_ni
- send to the NID of that router.
- If this is a resend retry the send on the peer_ni unless that peer_ni is in a HARD_ERROR state, then fail.
B1C - src any, local dst, mr dst
- select the best_ni to send from, by going through all the local_nis that can reach any of the networks the peer is on
- consider local_ni health in the selection by selecting the local_ni with the best health value.
- If this is a resend do not select a local_ni that has already been used.
- select the best_peer_ni that can be reached by the best_ni selected in the previous step
- If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
- If this is a resend and the resend peer_ni is not specified do not consider a peer_ni that has already been used for sending as long as there are other peer_nis available for selection. Loop around and re-use peer-nis in round robin.
- peer_nis that are selected cannot be in HARD_ERROR state.
- send the message over that path.
B2C - src any, route to dst, mr dst
- find the router that can reach the dst_nid
- find the peer for that router. (The peer is MR)
- go to B1C
B1D - src any, local dst, nmr dst
- find
peer_ni
usingdst_nid
(non-MR, so this is the onlypeer_ni
candidate)- no issue if
peer_ni
is healthy - try this
peer_ni
even if it is unhealthy if this is the 1st attempt to send this message - fail if resending to an unhealthy
peer_ni
- no issue if
- pick the preferred local_NI for this
peer_ni
if set- If the preferred local_NI is not healthy, fail sending the message and let the upper layers deal with recovery.
- otherwise if preferred local_NI is not set, then pick a healthy local NI and make it the preferred NI for this
peer_ni
- send over this path
B2D - src any, route dst, nmr dst
find route to
dst_nid
find
peer_ni
of routerno issue if
peer_ni
is healthytry this
peer_ni
even if it is unhealthy if this is the 1st attempt to send this messagefail if resending to an unhealthy
peer_ni
pick the preferred local_NI for the
dst_nid
if setIf the preferred local_NI is not healthy, fail sending the message and let the upper layers deal with recovery.
otherwise if preferred local_NI is not set, then pick a healthy local NI and make it the preferred NI for this
peer_ni
send over this path
Resend Behavior
LNet will keep attempting to resend a message across different local/remote NIs as long as the interfaces are only in "soft" failure state. Interfaces are demerited when we fail to send over them due to a timeout. This is opposed to a hard failure which is reported by the underlying HW indicating that this interface can no longer be used for sending and receiving.
...
The interfaces which have soft failures will be demerited so it will naturally be selected as a last option.
Work Items
- Health Value Maintenance/Demerit system
- Selection based on Health Value and not resending over already used interfaces unless non are available.
- Handling the new events in IBLND and passing them to LNet
- Handling the new events in SOCKLND and passing them to LNet
- Adding LNet level transaction timeout (or reuse the peer timeout) and cancelling a resend on timeout
- Handling timeout case in ptlrpc
Patches
- Add health values to local_ni
- Modify selection to make use of local_ni health values.
- Add explicit constraint in the selection to fail a re-send if no local_ni is in optimal health
- Handle explicit port down/up events
- Handle local interface failure on send and update health value then resend
- Add health values to peer_ni
- Add explicit constraint in the selection to fail a re-send if no remote_ni is in optimal health
- Handle remote interface failure on send and update health value then resend
- Modify selection to make use of peer_ni health values.
- Handle LND tx timeout due to being stuck on the queues for too long.
- Handle LND tx timeout due to remote rejection
- Handle LND tx timeout due to no tx completion
- Add an Event timeout towards upper layers (PTLRPC) when a transaction has failed to complete. IE LNET_ACK_MSG, or LNET_REPLY_MSG are not received.
- Handle the transaction timeout event in ptlrpc.
O2IBLND Detailed Discussion
Overview
There are two types of events to account for:
...
There is a group of events which indicate a fatal error
RDMA Device Events
Below are the events that could occur on the RDMA device. Highlighted in BOLD RED are the events that should be handled for health purposes.
- IB_EVENT_CQ_ERR
- IB_EVENT_QP_FATAL
- IB_EVENT_QP_REQ_ERR
- IB_EVENT_QP_ACCESS_ERR
- IB_EVENT_COMM_EST
- IB_EVENT_SQ_DRAINED
- IB_EVENT_PATH_MIG
- IB_EVENT_PATH_MIG_ERR
- IB_EVENT_DEVICE_FATAL
- IB_EVENT_PORT_ACTIVE
- IB_EVENT_PORT_ERR
- IB_EVENT_LID_CHANGE
- IB_EVENT_PKEY_CHANGE
- IB_EVENT_SM_CHANGE
- IB_EVENT_SRQ_ERR
- IB_EVENT_SRQ_LIMIT_REACHED
- IB_EVENT_QP_LAST_WQE_REACHED
- IB_EVENT_CLIENT_REREGISTER
- IB_EVENT_GID_CHANGE
Communication Events
Below are the events that could occur on a connection. Highlighted in BOLD RED are the events that should be handled for health purposes.
RDMA_CM_EVENT_ADDR_RESOLVED: Address resolution (rdma_resolve_addr) completed successfully.
RDMA_CM_EVENT_ADDR_ERROR: Address resolution (rdma_resolve_addr) failed.
RDMA_CM_EVENT_ROUTE_RESOLVED: Route resolution (rdma_resolve_route) completed successfully.
RDMA_CM_EVENT_ROUTE_ERROR: Route resolution (rdma_resolve_route) failed.
RDMA_CM_EVENT_CONNECT_REQUEST: Generated on the passive side to notify the user of a new connection request.
RDMA_CM_EVENT_CONNECT_RESPONSE: Generated on the active side to notify the user of a successful response to a connection request. It is only generated on rdma_cm_id's that do not have a QP associated with them.
RDMA_CM_EVENT_CONNECT_ERROR: Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
RDMA_CM_EVENT_UNREACHABLE: Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request. If this event is generated in response to a UD QP resolution request over InfiniBand, the event status field will contain an errno, if negative, or the status result carried in the IB CM SIDR REP message.
RDMA_CM_EVENT_REJECTED: Indicates that a connection request or response was rejected by the remote end point. The event status field will contain the transport specific reject reason if available. Under InfiniBand, this is the reject reason carried in the IB CM REJ message.
RDMA_CM_EVENT_ESTABLISHED: Indicates that a connection has been established with the remote end point.
RDMA_CM_EVENT_DISCONNECTED: The connection has been disconnected.
RDMA_CM_EVENT_DEVICE_REMOVAL: The local RDMA device associated with the rdma_cm_id has been removed. Upon receiving this event, the user must destroy the related rdma_cm_id.
RDMA_CM_EVENT_MULTICAST_JOIN: The multicast join operation (rdma_join_multicast) completed successfully.
RDMA_CM_EVENT_MULTICAST_ERROR: An error either occurred joining a multicast group, or, if the group had already been joined, on an existing group. The specified multicast group is no longer accessible and should be rejoined, if desired.
RDMA_CM_EVENT_ADDR_CHANGE: The network device associated with this ID through address resolution changed its HW address, eg following of bonding failover. This event can serve as a hint for applications who want the links used for their RDMA sessions to align with the network stack.
RDMA_CM_EVENT_TIMEWAIT_EXIT: The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.
Health Handling
Handling Asynchronous Events
- A callback mechanism should be provided by LNet to the LND to report failure events
- Some translation matrix from LND specific errors to LNet specific errors should be created
- Each LND would create the mapping
- Whenever an event occurs the indicates a fatal error on the device the LNet callback should be called.
- LNet should transition the local NI or remote NI appropriately and take measures to close the connections on that specific device.
Handling Errors on Sends
- If a request to send a message ends in an error: Example a connection error (as seen with the wrong device responding to ARP), then LNet should pick another local device to send from.
- There are a class of errors which indicate a problem in Local NI
- RDMA_CM_EVENT_DEVICE_REMOVAL - This device is no longer present. Should never be used.
- There are a class of errors which indicate a problem in remote NI
- RDMA_CM_EVENT_ADDR_ERROR - The remote address is errnoeous. Should not be used.
- RDMA_CM_EVENT_ADDR_RESOLVED with an error. The remote address can not be resolved
- RDMA_CM_EVENT_ROUTE_ERROR - No route to remote address. Should result in the peer_ni not to be used. But a retry can be done a bit later via time.
- RDMA_CM_EVENT_UNREACHABLE - Remote side is unreachable. Retry after a while.
- RDMA_CM_EVENT_CONNECT_ERROR - problem with connection. Retry after a while.
- RDMA_CM_EVENT_REJECTED - Remote side is rejecting connection. Retry after a while.
- RDMA_CM_EVENT_DISCONNECTED - Move outstanding operations to a different pair if available.
Handling Timeout
This is probably the trickiest situation. Timeout could occur because of network congestion, or because the remote side is too busy, or because it's dead, or hung, etc.
...
One option to consider is to use the peer_timout feature to recognize when peer_nis are down, and update the peer_ni health information via this mechanism. And let the LND and RPC timeouts take care of further resends.
High Level Design
Callback Mechanism
[Olaf: bear in mind that currently the LND already reports status to LNet through lnet_finalize()
]
...
- Although some of the actions LNet will take is the same for different errors, it's still a good idea to keep them separate for statistics and logging.
- on LNET_LOCAL_NI_DOWN set the ni_state to STATE_FAILED. In the selection algorithm this NI will not be picked.
- on LNET_LOCAL_NI_UP set the ni_state to STATE_ACTIVE. In the selection algorithm this NI will be selected.
- Add a state in the peer_ni. This will indicate if it usable or not.
- on LNET_PEER_NI_ADDR_ERROR set the peer_ni state to FAILED. This peer_ni will not be selected in the selection algorithm.
- Add a health value (int). 0 means it's healthy and available for selection.
- on any LNet_PEER_NI_[UNREACHABLE | CONNECT_ERROR | CONNECT_REJECTED] decrement this value.
- That value indicates how long before we use it again.
- A time before use in jiffies is stored. The next time we examine this peer_NI for selection, we take a look at that time. If it has been passed we select it, but we do not increment this value. The value is set to 0 only if there is a successful send to this peer_ni.
- The net effect is that if we have a bad peer_ni, the health value will keep getting decremented, which will mean it'll take progressively longer to reuse it.
- This algorithm is in effect only if there are multiple interfaces, and some of them are healthy. If none of them are healthy (IE the health value is negative), then select the least unhealthy peer_ni (the one with greatest health value).
- The same algorithm can be used for local NI selection
Timeout Handling
LND TX Timeout
PUT
Gliffy Diagram | ||||
---|---|---|---|---|
|
...
- TX timeout can be triggered because the TX remains on one of the outgoing queues for too long. This would indicate that there is something wrong with the local NI. It's either too congested or otherwise problematic. This should result in us trying to resend using a different local NI but possibly to the same peer_ni.
- TX timeout can be triggered because the TX is posted via (ib_post_send()) but it has not completed. In this case we can safely conclude that the peer_ni is either congested or otherwise down. This should result in us trying to resent to a different peer_ni, but potentially using the same local NI.
GET
Gliffy Diagram | ||||
---|---|---|---|---|
|
...
In summary, the tx_timeout serves to ensure that messages which do not require an explicit response from the peer are completed on the tx event added by M|OFED to the completion queue. And it also serves to ensure that any messages which require an explicit reply to be completed receive that reply within the tx_timout.
PUT and GET in Routed Configuration
When a node receives a PUT request, the O2IBLND calls lnet_parse() to deal with it. lnet_parse() calls lnet_parse_put(), which matches the MD and initiates a receive. This ends up calling into the LND, kiblnd_recv(), which would send an IBLND_MSG_PUT_[ACK|NAK]. This allows the sending peer LND to know that the PUT has been received, and let go of it's TX, as shown below. On receipt of the ACK|NAK, the peer sends a IBLND_MSG_PUT_DONE, and initates the RDMA operation. Once the tx completes, kiblnd_tx_done() is called which will then call lnet_finalize(). For the PUT, LNet will end sending an LNET_MSG_ACK, if it needs to (look at lnet_parse_put() for the condition on which LNET_MSG_ACK is sent).
...
At this point (pending further discussion) it is my opinion that it should not. I argue that the decision to get LNET to send the LNET_MSG_ACK or LNET_MSG_REPLY implicitly is actually a poor one. These messages are in direct respons to direct requests by upper layers like RPC. What should've been happening is that when LNET receives an LNET_MSG_[PUT|GET], an event should be generated to the requesting layer, and the requesting layer should be doing another call to LNet, to send the LNET_MSG_[ACK|REPLY]. Maybe it was done that way in order no to hold on resources more than it should, but symantically these messages should belong to the upper layer. Furthermore, the events generated by these messages are used by the upper layer to determine when to do the resends of the PUT/GET. For these reasons I believe that it is a sound decision to only task LNet with attempting to send an LNet message over a different local_ni/peer_ni only if this message is not received by the remote end. This situation is caught by the tx_timeout.
O2IBLND TX Lifecycle
Gliffy Diagram | ||||
---|---|---|---|---|
|
...
NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.
Peer timeout and recovery model
- On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
- The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
- In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
- At the LNet layer for a non router node, lnet_peer_aliveness_enabled() will always return 0:
Code Block #define lnet_peer_aliveness_enabled(lp) (the_lnet.ln_routing != 0 && \ ((lp)->lpni_net) && \ (lp)->lpni_net->net_tunables.lct_peer_time_out > 0)
In effect, the aliveness of the peer is not considered at all if the node is not a router.
- This can remain the same since the health of the peer will be considered in lnet_select_pathway() before this is considered.
- In fact if the logic for the health of the peer is done in lnet_select_pathway(), then the logic in lnet_post_send_locked() can be removed. A peer will always be as healthy as possible by the time the flow hits lnet_post_send_locked()
- If the node is not a router, then a peer will always be tried irregardless of its health. If it is a router then once every second the peer will be queried to see if it's alive or not.
- TBD: In o2iblnd kiblnd_query looks up the peer and then returns the last_alive of hte peer. However, there is code "if (peer_ni == NULL) kiblnd_launch_tx(ni, NULL, nid)". This code will attempt creating and connecting to the peer, which should allow us to discover if the peer is alive. However, as far as I know peer_ni is never removed from the hash. So if it's already an existing peer which died, then the call to kiblnd_launch_tx() will never be made, and we'll never discover if the peer came back to life.
- In socklnd, socknal_query() works differently. It actually attempts to connect to the peer again, within a timeout. This leads the router to discover that the peer is healthy and start using it again.
Health Revisited
There are different scenarios to consider with Health:
- Asynchronous events which indicate that the card is down
- Immediate failures when sending
- Failures reported by the LND
- Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
- TX timeout cases.
- Currently connection is closed and peer is marked down.
- This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI
TX Timeouts in the presence of LNet Routers
Communication with a router adheres to the above details. Once the current hop is sure that the message has made it to the next hop, LNet shouldn't worry about resends. Resends are only to ensure that the message LNet is tasked to send makes it to the next hop. The upper layer RPC protocol makes sure that RPC messages are retried if necessary.
Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.
SOCKLND Detailed Discussion
TBD