Adding Resiliency to LNet

Introduction

By design LNet is a lossy connectionless network: there are cases where messages can be dropped without the sender being notified. Here we explore the possibilities of making LNet more resilient, including having it retransmit messages over alternate routes. What can be done in this area is constrained by the design of LNet.

Put, Ack, Get, Reply

Within LNet there are three cases of interest: Put, Put+Ack, and Get+Reply.

Put: The simplest message type is a bare Put message. This message is sent on the wire, and after that no further response is expected by LNet. In terms of error handling, this means that a failure to send can result in an error, but any kind of failure after the message has been sent will result in the message being dropped without notification. There is nothing LNet can do in the latter case, it will be up to the higher layers that use LNet to detect what is going on and take whatever corrective action is required.
Put+Ack: The sender of a Put can ask from an Ack from the receiver. The Ack message is generated by LNet on the receiving node (the peer), and this is done as soon as the Put has been received. In contrast to a bare put this means LNet on the sender can track whether an Ack arrives, and if it does not promptly arrive it can take some action. Such an action can take two forms: inform the upper layers that the Ack took too long to arrive, or retry within LNet itself.
Get+Reply: With a Get message, there is always a Reply from the received. Prior to the Get, the sender and receiver arrange an agreement on the MD that the data for the Reply must be obtained from, so LNet on the receiving node can generate the Reply message as soon as the Get has been received. Failure detection and handling is similar to the Put+Ack case.

The protocols used to implement LNet tend to be connection-oriented, and implement some kind of handshake or ack protocol that tells the sender that a message has been received. As long as the LND actually reports errors to LNet (not a given, alas) this means that in practice the sender of a message can reliably determine whether the message was successfully sent to another node. When the destination node is on the same LNet network, this is sufficient to enable LNet itself to detect failures even in the bare Put case. But in a routed configuration this only guarantees that the router received the message, and if the router then fails to forward it, a bare Put will be lost without trace.

PtlRPC

Users of LNet that send bare Put messages must implement their own methods to detect whether a message was lost. The general rule is simple: the recipient of a Put is supposed react somehow, and if the reaction doesn't happen within a set amount of time, the sender assumes that either the Put was lost, or the recipient is in some other kind of trouble.

For our purposes PtlRPC is of interest. PtlRPC messages can be classified as Request+Response pairs. Both a Request and a Response are built from one or more Get or Put messages. A node that sends a PtlRPC Request requires the receiver to send a Response within a set amount of time, and failing this the Request times out and PtlRPC takes corrective action.

Adaptive timeouts add an interesting wrinkle to this mechanism: they allow the recipient of a Request to tell the sender to "please wait", informing it that the recipient is alive and working but not able to send the Response before the normal timeout. For LNet the interesting implication is that while this is going on, there will be some traffic between the sender and recipient. This traffic may also be in the form of out-of-band information invisible to LNet.

LNet Interfaces

The interfaces that LNet provides to the upper layers should work as follows. Set up an MD (Memory Descriptor) to send data from (for a Put) or receive data into (for a Get). An event handler is associated with the MD. Then call LNetGet() or LNetPut() as appropriate.

LNetGet()

If all goes well, the event handler sees two events: LNET_EVENT_SEND to indicate the Get message was sent, and LNET_EVENT_REPLY to indicate the Reply message was received. Note that the send event can happen after the reply event (this is actually the typical case).

If sending the Get message failed, LNET_EVENT_SEND will include an error status, no LNET_EVENT_REPLY will happen, and clean up must done accordingly. If the return value of LNetGet() indicates an error then sending the message certainly failed, but a 0 return does not imply success, only that no failure has yet been encountered.

A damaged Reply message will be dropped, and does not result in an LNET_EVENT_REPLY. Effectively the only way for LNET_EVENT_REPLY to have an error status is if LNet detects a timeout before the Reply is received.

LNetPut() LNET_ACK_REQ

A Put with an Ack is similar to a Get+Reply pair. The events in this case are LNET_EVENT_SEND and LNET_EVENT_ACK.

For a Put, the LNET_EVENT_SEND indicates that the MD is no longer used by the LNet code and the caller is free do discard or re-use it.

As with send, LNET_EVENT_ACK is expected to only carry an error indication if there was a timeout before the Ack was received.

LNetPut() LNET_NOACK_REQ

A Put without an Ack will only generate an LNET_EVENT_SEND, which indicates that the MD can now be used.

Possible Failures

There are a number of failures we can encounter, only some of which LNet may address.

Local interface failure: there is some issue with the local interface that prevents it from sending (or receiving) messages.
Remote interface failure: there is some issue with the remote interface that prevents it from receiving (or sending) messages.
Path failure: the local interface is working properly, but messages never reach the remote interface.
Remote node failure: the remote node is properly receiving messages but unresponsive for other reasons.

In a routed LNet configuration these scenarios apply to each hop.

These failures will show up in a number of ways:

Local interface reports failure. This can include the interface itself being healthy but it noting that the cable connecting it to a switch, or the switch port, is somehow not working right.
Remote interface not reachable. A remote interface that should be reachable from the local interface cannot be reached. This can be in the form of a "fast" error, or of a timeout in the LND-level protocol.
Remote interfaces on a net not reachable. The local interface appears to be OK, but there are several remote interfaces it cannot talk to.
All remote interfaces on a net not reachable. The local interface appears to be OK, but cannot talk to any remote interface.
All remote interfaces of a node not reachable. All LNDs report errors when talking to a specific remote node, but have no problem talking to other nodes.
Put+Ack or Get+Reply timeout. The LND gives no failure indication, but the Ack or Reply takes too long to arrive.
Dropped Put. Everything appears to work, except it doesn't.

Let's take a look at what LNet can do in each of these cases.

Local Interface Reports Failure

This is the easiest case to work with. The LND can report such a failure to LNet, and LNet then refrains from using the local interface for any traffic.

LNet can mark the interface down, and depending on the capabilies of the LND either recheck periodically or wait for the LND to mark the interface up.

Remote Interface Not Reachable

The remote interface cannot be reached from the local interface, but the local interface can talk to other nodes. If the remote interface can be reached from other local interfaces then we're dealing with some path failure. Otherwise the remote interface may be bad. If there is only a single local interface that can talk to the remote interface, then we cannot distinguish between these cases.

LNet can mark this particular local/remote interface combination as something to be avoided.

When there are paths more than one local interface to the remote interface, and none of these work, but other remote interfaces do work, then LNet can mark the remote interface as bad. Recovery could be done by periodically probing the remote interface, maybe using LNet Ping as a poor-man's equivalent of an LNet Control Packet.

Remote interfaces On A Net Not Reachable

Several remote interfaces on a net cannot be reached from a local interface, but it can talk to other nodes. This is a more severe variant of the previous case.

All Remote Interfaces On A Net Not Reachable

All remote interfaces on a net cannot be reached from a local interface. If there are other, working, interfaces connected to the same net then the balance of probability shifts to the local interface being bad, or there is a severe problem with the fabric.

In practice LNet will not detect "all" remote interfaces being down. But it can detect that for a period of time, no traffic was successfully sent from a local interface, and therefore start avoiding that interface as a whole. Recovery would involve periodically probing the interface, maybe using LNet Ping.

All Remote Interfaces Of A Node Not Reachable

The node is likely down. There is little LNet can do here, this is a problem to be handled by upper layers. This includes indicating when LNet should attempt to reconnect.

LNet might treat this as the "remote interface not reachable" case for all the interfaces of the remote node. That is, without much difference due to apparently all interfaces of the remote node being down, except for a log message indicating this.

Put+Ack Or Get+Reply Timeout

This is the case where the LND does not signal any problem, so the Ack for a Put or Reply for a Get should arrive promptly, with the only delays due to credit-based throttling, and yet it does not do so. Note that this assumes that were possible the LND layer already implements reasonably tight timeouts, so that LNet can assume the problem is somewhere else.

LNet can impose a "reply timeout", and retry over a different path if there is one available. However, if the assumption about the LND is valid, then the implication is that the node is in trouble. So an alternative is to force the upper layers to cope.

One argument for nevertheless implementing this facility in LNet is that it means the upper layers to have to re-invent and re-implement this wheel time and again.

Dropped Put

No problem was signalled by the LND, and there is no Ack that we could time out waiting for. LNet does not have enough information to do anything, so the upper layers must do so instead.

If this case must be made tractable, LNet can be changed to make the Ack non-optional.

Should LNet Resend Messages

When there are multiple paths available for a message, it makes sense to try to resend it on failure. But where should the resending logic be implemented?

The easiest path is to tell upper layers to resend. For example, PtlRPC has some related logic already. Except that PtlRPC detects a failure, it disconnects, reconnects, and triggers recovery. The type of resending logic desired is to "just try another path" which differs from what exists today and needs to be implemented for each user.

The alternative then is to have LNet resend a message. There should be some limit to the number of retries, and a limit to the amount of time spent retrying. Otherwise we are requiring the upper layers to implement a timer on each LNetGet() and LNetPut() call to guarantee progress. This introduces an LNet "retry timeout" (as opposed to the "reply timeout" discussed above) as the amount of time LNet after which LNet gives up.

In terms of timeouts, this then gives us the following relationships, from shortest to longest:

LND Timeout: LND declares that a message won't arrive.
- IB timeout is (default?) slightly less than 4 seconds
- LND timeout is timeout module parameter for o2ib and gni, sock_timeout module parameter for sock?
LNet Reply Timeout: LNet declares an Ack/Reply won't arrive. > LND Timeout * (max hops -1)
- Depends on the route!
LNet Retry Timeout: LNet gives up on retries. > LNet Reply Timeout * max LNet retries
- Depends on the route!
peer_timeout module parameter: peer is declared dead. Either use for LNet Retry Timeout, or > LNet Retry Timeout.

It is not completely obvious how this scheme interacts with Lustre's timeout parameter (the Lustre RPC timeout, from which a of timeouts are derived).

O2IBLND

Overview

There are two types of events to account for:

Events on the RDMA device itself
Events on the cm_id

Both events should be monitored because they provide information on the health of the device and connection respectively.

ib_register_event_handler() can be used to register a handler to handle events of type 1.

a cm_callback can be register with the cm_id to handle RMDA_CM events.

There is a group of events which indicate a fatal error

RDMA Device Events

Below are the events that could occur on the RDMA device. Highlighted in BOLD RED are the events that should be handled for health purposes.

IB_EVENT_CQ_ERR
IB_EVENT_QP_FATAL
IB_EVENT_QP_REQ_ERR
IB_EVENT_QP_ACCESS_ERR
IB_EVENT_COMM_EST
IB_EVENT_SQ_DRAINED
IB_EVENT_PATH_MIG
IB_EVENT_PATH_MIG_ERR
IB_EVENT_DEVICE_FATAL
IB_EVENT_PORT_ACTIVE
IB_EVENT_PORT_ERR
IB_EVENT_LID_CHANGE
IB_EVENT_PKEY_CHANGE
IB_EVENT_SM_CHANGE
IB_EVENT_SRQ_ERR
IB_EVENT_SRQ_LIMIT_REACHED
IB_EVENT_QP_LAST_WQE_REACHED
IB_EVENT_CLIENT_REREGISTER
IB_EVENT_GID_CHANGE

Communication Events

Below are the events that could occur on a connection. Highlighted in BOLD RED are the events that should be handled for health purposes.

RDMA_CM_EVENT_ADDR_RESOLVED: Address resolution (rdma_resolve_addr) completed successfully.
RDMA_CM_EVENT_ADDR_ERROR: Address resolution (rdma_resolve_addr) failed.
RDMA_CM_EVENT_ROUTE_RESOLVED: Route resolution (rdma_resolve_route) completed successfully.
RDMA_CM_EVENT_ROUTE_ERROR: Route resolution (rdma_resolve_route) failed.
RDMA_CM_EVENT_CONNECT_REQUEST: Generated on the passive side to notify the user of a new connection request.
RDMA_CM_EVENT_CONNECT_RESPONSE: Generated on the active side to notify the user of a successful response to a connection request. It is only generated on rdma_cm_id's that do not have a QP associated with them.
RDMA_CM_EVENT_CONNECT_ERROR: Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
RDMA_CM_EVENT_UNREACHABLE: Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request. If this event is generated in response to a UD QP resolution request over InfiniBand, the event status field will contain an errno, if negative, or the status result carried in the IB CM SIDR REP message.
RDMA_CM_EVENT_REJECTED: Indicates that a connection request or response was rejected by the remote end point. The event status field will contain the transport specific reject reason if available. Under InfiniBand, this is the reject reason carried in the IB CM REJ message.
RDMA_CM_EVENT_ESTABLISHED: Indicates that a connection has been established with the remote end point.
RDMA_CM_EVENT_DISCONNECTED: The connection has been disconnected.
RDMA_CM_EVENT_DEVICE_REMOVAL: The local RDMA device associated with the rdma_cm_id has been removed. Upon receiving this event, the user must destroy the related rdma_cm_id.
RDMA_CM_EVENT_MULTICAST_JOIN: The multicast join operation (rdma_join_multicast) completed successfully.
RDMA_CM_EVENT_MULTICAST_ERROR: An error either occurred joining a multicast group, or, if the group had already been joined, on an existing group. The specified multicast group is no longer accessible and should be rejoined, if desired.
RDMA_CM_EVENT_ADDR_CHANGE: The network device associated with this ID through address resolution changed its HW address, eg following of bonding failover. This event can serve as a hint for applications who want the links used for their RDMA sessions to align with the network stack.
RDMA_CM_EVENT_TIMEWAIT_EXIT: The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

Health Handling

Handling Asynchronous Events

A callback mechanism should be provided by LNet to the LND to report failure events
Some translation matrix from LND specific errors to LNet specific errors should be created
- Each LND would create the mapping
Whenever an event occurs the indicates a fatal error on the device the LNet callback should be called.
LNet should transition the local NI or remote NI appropriately and take measures to close the connections on that specific device.

Handling Errors on Sends

If a request to send a message ends in an error: Example a connection error (as seen with the wrong device responding to ARP), then LNet should pick another local device to send from.
- There are a class of errors which indicate a problem in Local NI
  - RDMA_CM_EVENT_DEVICE_REMOVAL - This device is no longer present. Should never be used.
- There are a class of errors which indicate a problem in remote NI
  - RDMA_CM_EVENT_ADDR_ERROR - The remote address is errnoeous. Should not be used.
  - RDMA_CM_EVENT_ADDR_RESOLVED with an error. The remote address can not be resolved
  - RDMA_CM_EVENT_ROUTE_ERROR - No route to remote address. Should result in the peer_ni not to be used. But a retry can be done a bit later via time.
  - RDMA_CM_EVENT_UNREACHABLE - Remote side is unreachable. Retry after a while.
  - RDMA_CM_EVENT_CONNECT_ERROR - problem with connection. Retry after a while.
  - RDMA_CM_EVENT_REJECTED - Remote side is rejecting connection. Retry after a while.
  - RDMA_CM_EVENT_DISCONNECTED - Move outstanding operations to a different pair if available.

Handling Timeout

This is probably the trickiest situation. Timeout could occur because of network congestion, or because the remote side is too busy, or because it's dead, or hung, etc.

Timeouts are being kept in the LND (o2iblnd) on the transmits. Every transmit which is queued is assigned a deadline. If it expires then the connection on which this transmit is queued, is closed.

peer_timout can be set in routed and non-routed scenario, which provides information on the peer.

Timeouts are also being kept at ptlrpc. These are rpc timeouts.

Refer to section 32.5 in the manual for a description of how RPC timeouts work.

Also refer to section 27.3.7 for LNet Peer Health information.

Given the presence of various timeouts, adding yet another timeout on the message, will further complicate the configuration, and possibly cause further hard to debug issues.

One option to consider is to use the peer_timout feature to recognize when peer_nis are down, and update the peer_ni health information via this mechanism. And let the LND and RPC timeouts take care of further resends.

High Level Design

Callback Mechanism

[Olaf: bear in mind that currently the LND already reports status to LNet through lnet_finalize()]

Add an LNet API:
- lnet_report_lnd_error(lnet_error_type type, int errno, struct lnet_ni *ni, struct lnet_peer_ni *lpni)
  - LND will map its internal errors to lnet_error_type enum.

enum lnet_error_type {
	LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */
	LNET_LOCAL_NI_UP, /* start using this NI */
	LNET_LOCAL_NI_SEND_TIMOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */
	LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */
	LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */
	LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */
	LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */
};

Although some of the actions LNet will take is the same for different errors, it's still a good idea to keep them separate for statistics and logging.
on LNET_LOCAL_NI_DOWN set the ni_state to STATE_FAILED. In the selection algorithm this NI will not be picked.
on LNET_LOCAL_NI_UP set the ni_state to STATE_ACTIVE. In the selection algorithm this NI will be selected.
Add a state in the peer_ni. This will indicate if it usable or not.
on LNET_PEER_NI_ADDR_ERROR set the peer_ni state to FAILED. This peer_ni will not be selected in the selection algorithm.
Add a health value (int). 0 means it's healthy and available for selection.
on any LNet_PEER_NI_[UNREACHABLE | CONNECT_ERROR | CONNECT_REJECTED] decrement this value.
That value indicates how long before we use it again.
A time before use in jiffies is stored. The next time we examine this peer_NI for selection, we take a look at that time. If it has been passed we select it, but we do not increment this value. The value is set to 0 only if there is a successful send to this peer_ni.
The net effect is that if we have a bad peer_ni, the health value will keep getting decremented, which will mean it'll take progressively longer to reuse it.
This algorithm is in effect only if there are multiple interfaces, and some of them are healthy. If none of them are healthy (IE the health value is negative), then select the least unhealthy peer_ni (the one with greatest health value).
The same algorithm can be used for local NI selection

Timeout Handling

LND TX Timeout

PUT

As shown in the above diagram whenever a tx is queued to be sent or is posted but haven't received confirmation yet, the tx_deadline is still active. The scheduler thread checks the active connections for any transmits which has passed their deadline, and then it closes those connections and notifies LNet via lnet_notify().

The tx timeout is cancelled when in the call kiblnd_tx_done(). This function checks 3 flags: tx_sending, tx_waiting and tx_queued. If all of them are 0 then the tx is closed as completed. The key flag to note is tx_waiting. That flag indicates that the tx is waiting for a reply. It is set to 1 in kiblnd_send, when sending the PUT_REQ or GET_REQ. It is also set when sending the PUT_ACK. All of these messages expect a reply back. When the expected reply is received then tx_waiting is set to 0 and kiblnd_tx_done() is called, which eventually cancels the tx_timeout, by basically removing the tx from the queues being checked for the timeout.

The notification in the LNet layer that the connection has been closed can be used by MR to attempt to resend the message on a different peer_ni.

<TBD: I don't think that LND attempts to automatically reconnect to the peer if the connection gets torn down because of a tx_timeout.>

TX timeout is exactly what we need to determine if the message has been transmitted successfully to the remote side. If it has not been transmitted successfully we can attempt to resend it on different peer_nis until we're either successful or we've exhausted all of the peer_nis.

The reason for the TX timeout is also important:

TX timeout can be triggered because the TX remains on one of the outgoing queues for too long. This would indicate that there is something wrong with the local NI. It's either too congested or otherwise problematic. This should result in us trying to resend using a different local NI but possibly to the same peer_ni.
TX timeout can be triggered because the TX is posted via (ib_post_send()) but it has not completed. In this case we can safely conclude that the peer_ni is either congested or otherwise down. This should result in us trying to resent to a different peer_ni, but potentially using the same local NI.

GET

After the completion of an o2iblnd tx ib_post_send(), a completion event is added to the completion queue. This triggers kiblnd_complete to be called. If this is an IBLND_WID_TX then kiblnd_tx_complete() is called, which calls kiblnd_tx_done() if the tx is not sending, waiting or queued. In this case the tx_timeout is closed.

In summary, the tx_timeout serves to ensure that messages which do not require an explicit response from the peer are completed on the tx event added by M|OFED to the completion queue. And it also serves to ensure that any messages which require an explicit reply to be completed receive that reply within the tx_timout.

PUT and GET in Routed Configuration

When a node receives a PUT request, the O2IBLND calls lnet_parse() to deal with it. lnet_parse() calls lnet_parse_put(), which matches the MD and initiates a receive. This ends up calling into the LND, kiblnd_recv(), which would send an IBLND_MSG_PUT_[ACK|NAK]. This allows the sending peer LND to know that the PUT has been received, and let go of it's TX, as shown below. On receipt of the ACK|NAK, the peer sends a IBLND_MSG_PUT_DONE, and initates the RDMA operation. Once the tx completes, kiblnd_tx_done() is called which will then call lnet_finalize(). For the PUT, LNet will end sending an LNET_MSG_ACK, if it needs to (look at lnet_parse_put() for the condition on which LNET_MSG_ACK is sent).

In the case of a GET, on receipt of IBLND_MSG_GET_REQ, lnet_parse() -> lnet_parse_get() -> kiblnd_recv(). If a there is data to be sent back, then the LND sends and RDMA operation with IBLND_MSG_GET_DONE, or just the DONE.

The point I'm trying to illustrate here is that there are two levels of messages. There are the LND messages which confirm that a single LNET message has been received by the peer. And there there are the LNet level messages, such as LNET_MSG_ACK and LNET_MSG_REPLY. These two in particular are in response to the LNET_MSG_PUT and LNET_MSG_GET respectively. At the LND level IBLND_MSG_IMMEDIATE is used.

In a routed configuration, the entire LND handshake between the peer and the router is completed. However the LNET level messages like LNET_MSG_ACK and LNET_MSG_REPLY are sent by the final destination, and not by the router. The router simply forwards on the message it receives.

The question that the design needs to answer is this: Should LNet be concerned with resending messages if LNET_MSG_ACK or LNET_MSG_REPLY are not received for LNET_MSG_PUT and LNET_MSG_GET respectively?

At this point (pending further discussion) it is my opinion that it should not. I argue that the decision to get LNET to send the LNET_MSG_ACK or LNET_MSG_REPLY implicitly is actually a poor one. These messages are in direct respons to direct requests by upper layers like RPC. What should've been happening is that when LNET receives an LNET_MSG_[PUT|GET], an event should be generated to the requesting layer, and the requesting layer should be doing another call to LNet, to send the LNET_MSG_[ACK|REPLY]. Maybe it was done that way in order no to hold on resources more than it should, but symantically these messages should belong to the upper layer. Furthermore, the events generated by these messages are used by the upper layer to determine when to do the resends of the PUT/GET. For these reasons I believe that it is a sound decision to only task LNet with attempting to send an LNet message over a different local_ni/peer_ni only if this message is not received by the remote end. This situation is caught by the tx_timeout.

O2IBLND TX Lifecycle

In order to understand fully how the LND transmit timeout can be used for resends, we need to have an understanding of the transmit life cycle shown above.

This shows that the timeout depends on the type of request being sent. If the request expects a response back then the tx_timeout covers the entire transaction lifetime. Otherwise it covers up until the transmit complete event is queued on the completion queue.

Currently, if the transmit timeout is triggered the connection is closed to ensure that all RDMA operations have ceased. LNet is notified on error and if the modprobe parameter auto_down is set (which it is by default) the peer is marked down. In lnet_select_pathway() lnet_post_send_locked() is called. One of the checks it does is to make sure that the peer we're trying to send to is alive. If not, message is dropped and -EHOSTUNREACH is returned up the call chain.

In lnet_select_pathway() if lnet_post_send_locked() fails, then we ought to marke the health of the peer and attempt to select a different peer_ni to send to.

NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.

Peer timeout and recovery model

On transmit timeout kiblnd notifies LNet that the peer has closed due to an error. This goes through the lnet_notify path.
The peer aliveness at the LNet layer is set to 0 (dead), and the last alive
In IBLND whenever a message is received successfully, transmitted successfully or a connection is completed (whether it is successful or has been rejected) then the last alive time of the peer is set.
At the LNet layer for a non router node, lnet_peer_aliveness_enabled() will always return 0:
- #define lnet_peer_aliveness_enabled(lp) (the_lnet.ln_routing != 0 && \ ((lp)->lpni_net) && \ (lp)->lpni_net->net_tunables.lct_peer_time_out > 0)
- In effect, the aliveness of the peer is not considered at all if the node is not a router.
  - This can remain the same since the health of the peer will be considered in lnet_select_pathway() before this is considered.
  - In fact if the logic for the health of the peer is done in lnet_select_pathway(), then the logic in lnet_post_send_locked() can be removed. A peer will always be as healthy as possible by the time the flow hits lnet_post_send_locked()
If the node is not a router, then a peer will always be tried irregardless of its health. If it is a router then once every second the peer will be queried to see if it's alive or not.
- TBD: In o2iblnd kiblnd_query looks up the peer and then returns the last_alive of hte peer. However, there is code "if (peer_ni == NULL) kiblnd_launch_tx(ni, NULL, nid)". This code will attempt creating and connecting to the peer, which should allow us to discover if the peer is alive. However, as far as I know peer_ni is never removed from the hash. So if it's already an existing peer which died, then the call to kiblnd_launch_tx() will never be made, and we'll never discover if the peer came back to life.
  - In socklnd, socknal_query() works differently. It actually attempts to connect to the peer again, within a timeout. This leads the router to discover that the peer is healthy and start using it again.

Health Revisited

There are different scenarios to consider with Health:

Asynchronous events which indicate that the card is down
Immediate failures when sending
1. 1. Failures reported by the LND
2. Failures that occur because peer is down. Although this class of failures could be moved into the selection algorithm. IE do not pick peers_nis which are not alive.
TX timeout cases.
1. Currently connection is closed and peer is marked down.
2. This behavior should be enhanced to attempt to resend on a different local NI/peer NI, and mark the health of the NI

TX Timeouts in the presence of LNet Routers

Communication with a router adheres to the above details. Once the current hop is sure that the message has made it to the next hop, LNet shouldn't worry about resends. Resends are only to ensure that the message LNet is tasked to send makes it to the next hop. The upper layer RPC protocol makes sure that RPC messages are retried if necessary.

Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.

SOCKLND

TBD

LNet Health Version 2.0

TBD