Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy Diagram
nameMessageProcessing
pagePin12

lnet_msg is a structure used to keep information on the data that will be transmitted over the wire. It does not itself go over the wire. lnet_msg is passed to the LND for transmission.

...

A monitor thread monitors and ensures that messages which have expired are finalized. This processing is detailed in later sections.

...

Resiliency vs. Reliability

There are three types of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Timeouts will be provided by the ULP in the LNetPut() and LNetGet() APIs.

Resend Handling

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
  4. The message will continuously be resent until one of the following criteria is fulfilled:
    1. Message is completed successfully.
    2. Retry-count is reached
    3. Transaction timeout expires

Two new fiels will be added to lnet_msg:

  1. msg_status - bit field that indicates the type of failure which requires a resend
  2. msg_deadline - the deadline for the message calculated by,  send time + transaction timeout
Code Block
struct lnet_msg {
...
»·······__u32»··»·······»·······msg_status;
»·······cfs_time_t»·····»·······msg_deadline;
...
}

When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

lnet_resend_msg_locked()shall queue the message on a resend queue and wake up a thread responsible for resending messages.

The router checker thread, which is always started, will be refactored to handle resending messages.

When a message is initially sent it's tagged with a deadline for this message.  The message will be placed on the active queue. If the message is not completed within that timeout it will be finalized and removed from the active queue. A timeout event will be passed to the ULP.

The monitor thread will wake up ever second and check the top of the active queue, IE the oldest message on the list. If that message has expired it updates its status to TIMEDOUT and finalizes it. It then moves on to the next message on the list and stops once its find a message that has not expired.

If the LND times out and LNet attemps to resend, it'll place it on the resend queue. A message can be on the both the active and resend queue.

As shown in the diagram above both lnet_send() and lnet_parse() put messages on the active queue. lnet_finalize() consumes messages off the active queue when it's time to decommit them.

When the LND calls lnet_finalize() on a timed out message, lnet_finalize() will put the message on the resend queue and wake up the monitor thread, which will go through the resend queue in FIFO order, pop the message and call lnet_send() on it.

When the monitor thread wakes up every second it'll perform the following ordered operations:

  1. Timeout all expired messages on the active list.
  2. Resend messages on the resend queue

The assumption is that under normal circumstances the number of re-sends should be low, so the thread will not add any logic to pace out the resend rate, such as what lnet_finalize() does.

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and either removed from the active queue or placed on the resend queue.

API Changes

The ULP will provide the transaction timeout value on which LNet will base its own timeout values. In the absence of that LNet will fall back on a configurable transaction timeout value.

This trickle down approach will simplify the configuration of the LNet Resiliency feature, as well as make the timeout consistent through out the system, instead of configuring the LND timeout to be much larger than the pltrpc timeout as it is now. Furthermore, the ptlrpc uses a backoff algorithm, which allows it to wait longer for responses. With this trickle down approach, LNet will be able to cope with that timeout backoff algorithm.

The LNetGet() and LNetPut() APIs will be changed to reflect that.

Code Block
/**
 * Initiate an asynchronous GET operation.
 *
 * On the initiator node, an LNET_EVENT_SEND is logged when the GET request
 * is sent, and an LNET_EVENT_REPLY is logged when the data returned from
 * the target node in the REPLY has been written to local MD.
 * LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
 * with in the timeout provided.
 *
 * On the target node, an LNET_EVENT_GET is logged when the GET request
 * arrives and is accepted into a MD.
 *
 * \param self,target,portal,match_bits,offset See the discussion in LNetPut().
 * \param mdh A handle for the MD that describes the memory into which the
 * requested data will be received. The MD must be "free floating" (See LNetMDBind()).
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists) of the MD.
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 */
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 »·······struct lnet_process_id target, unsigned int portal,
 »·······__u64 match_bits, unsigned int offset, int timeout)

/**
 * Initiate an asynchronous PUT operation.
 *
 * There are several events associated with a PUT: completion of the send on
 * the initiator node (LNET_EVENT_SEND), and when the send completes
 * successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
 * that the operation was accepted by the target. LNET_EVENT_ACK will have
 * the timeout flag set if an ACK is not received within the timeout provided
 *  
 * The event LNET_EVENT_PUT is used at the target node to indicate the
 * comple tion of incoming datadelivery.
 *
 * The local events will be logged in the EQ associated with the MD pointed to
 * by \a mdh handle. Using a MD without an associated EQ results in these
 * events being discarded. In this case, the caller must have another
 * mechanism (e.g., a higher level protocol) for determining when it is safe
 * to modify the memory region associated with the MD.
 *
 * Note that LNet does not guarantee the order of LNET_EVENT_SEND and
 * LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
 *
 * \param self Indicates the NID of a local interface through which to send
 * the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
 * \param mdh A handle for the MD that describes the memory to be sent. The MD
 * must be "free floating" (See LNetMDBind()).
 * \param ack Controls whether an acknowledgment is requested.
 * Acknowledgments are only sent when they are requested by the initiating
 * process and the target MD enables them.
 * \param target A process identifier for the target process.
 * \param portal The index in the \a target's portal table.
 * \param match_bits The match bits to use for MD selection at the target
 * process.
 * \param offset The offset into the target MD (only used when the target
 * MD has the LNET_MD_MANAGE_REMOTE option set).
 * \param timeout The timeout to wait for an ACK if one is expected.
 * \param hdr_data 64 bits of user data that can be included in the message
 * header. This data is written to an event queue entry at the target if an
 * EQ is present on the matching MD.
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists).
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 *
 * \see struct lnet_event::hdr_data and lnet_event_kind_t.
 */
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack, 
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data)

Selection Algorithm

The selection algorithm will be modified to take health into account.

It is possible that a send can fail immediately. In this case we need to take active measures to ensure that we do not enter a tight loop resending until the timeout expires. This could peak the CPU consumption unexpectedly.

To do that the last sent time will be kept in the message. If the message is not sent successfully on any of the existing interfaces, then it will be placed on a queue and will be resent after a specific deadline expires. This will be termed a "(re)send procedure". An interval must expire between each (re)send procedure. A (re)send procedure will iterate through all local and remote peers, depending on the source of the send failure. 

The router pinger thread will be refactored to handle resending messages. The router pinger thread is started irregardless and only used to ping gateways if any are configured. Its operation will be expanded to check the pending message queue and re-send messages.

To keep the code simple when resending the health value of the interface that had the problem will be updated. If we are sending to a non-MR peer, then we will use the same src_nid, otherwise the peer will be confused. The selection algorithm will then consider the health value when it is selecting the NI or peer NI.

There are two aspects to consider when sending a message

  1. The congestion of the local and remote interfaces
  2. The health of the local and remote interfaces

The selection algorithm will take an average of these two values and will determine the best interface to select. To make the comparison proper, the health value of the interface will be set to the same value as the credits on initialization. It will be decremented on failure to send and incremented on successful send.

A health_value_range module parameter will be added to control the sensitiveness of the selection. If it is set to 0, then the best interface will be selected. A value higher than 0 will give a range in which to select the interface. If the value is large enough it will in effect be equivalent to turning off health comparison.

Local Interface Failure

Local interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be a result of a send operations
    2. They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.

Synchronous Send Failures

lnet_select_pathway() can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

1, 2, 5, 6 and 10 are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message:

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

LND Interface

LNet shall calculate the message timeout as follows:

message timeout = transaction timeout / retry count

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send().

LND timeout

An LNet message can be represented by a sequence of LND message. In the o2iblnd, the PUT and GET are described in the following sequence diagrams.

PUT

Gliffy Diagram
namePUT sequence
pagePin3

GET

Gliffy Diagram
nameGET Sequence Diagram
pagePin2

A third type of message that the LND sends is the IBLND_MSG_IMMEDIATE. The data is embedded in the message and posted. There is no handshake in this case.

For the PUT case described in the sequence diagram, the initiator sends two messages:

  1. IBLND_MSG_PUT_REQ
  2. IBLND_MSG_PUT_DONE

Both of these messages are sent using the same tx structure. The tx is allocated and placed on a waiting queue. When the IBLND_MSG_PUT_ACK is received the waiting tx is looked up and used to send the IBLND_MSG_PUT_DONE.

When kiblnd_queue_tx_locked() is called for IBLND_MSG_PUT_REQ it sets the tx_deadline as follows:

Code Block
timeout_ns = *kiblnd_tunables.kib_timeout * NSEC_PER_SEC;
tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

When kiblnd_queu_tx_locked() is called for IBLND_MSG_PUT_DONE it reset the tx_deadline again.

This presents an obstacle for the LNet Resiliency feature. LNet provides a timeout for the LND as described above. From LNet's perspective this deadline is for the LNet PUT message. However, if we simply use that value for the timeout_ns calculation, then in essence will will be waing for 2 * LND timeout for the completion of the LNet PUT message. This will mean less re-transmits.

Therefore, the LND, since it has knowledge of its own protocols will need to divide the timeout provided by LNet by the number of transmits it needs to do to complete the LNet level message:

  1. LNET_MSG_GET: Requires only IBLND_MSG_GET_REQ. Use the LNet provided timeout as is.
  2. LNET_MSG_PUT: Requires IBLND_MSG_PUT and IBLND_MSG_PUT_DONE to complete the LNET_MSG_PUT. Use LNet provided timeout / 2
  3. LNET_MSG_GET/LNET_MSG_PUT with < 4K payload: Requires IBLND_MSG_IMMEDIATE. Use LNet provided timeout as is.

Hard Failure

It's possible that the local interface might get into a hard failure scenario by receiving one of these events from the o2iblnd. socklnd needs to be investigated to determine if there are similar cases:

  • IB_EVENT_DEVICE_FATAL
  • IB_EVENT_PORT_ACTIVE
  • IB_EVENT_PORT_ERR
  • RDMA_CM_EVENT_DEVICE_REMOVAL

In these cases the local interface can not be used any longer. So it can not be selected as part of the selection algorithm. If there are no other interface available, then no messages can be sent out of the node.

A corresponding event can be received to indicate that the interface is operational again.

A new LNet/LND Api will be created to pass these events from the LND to LNet.

Timeouts

LND Detected Timeouts

Upper layers request from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t.

When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.

Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. The reasons are described further below.

By handling the tx_deadline properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.

The tx_deadline is LND-specific, and derived from the timeout (or sock_timeout) module parameter of the LND.

LNet Detected Timeouts

As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.

The pair LNET_MSG_PUT+LNET_MSG_ACK and LNET_MSG_GET+LNET_MSG_REPLY is not covered by the tx_deadline in the LND. If the upper layer does not take precautions it could wait forever on the LNET_MSG_ACK or LNET_MSG_REPLY. Therefore it is reasonable to expect that LNET provides a TIMEOUT event if either of these messages are not received within the expected timeout.

The question is whether LNet should resend the LNET_MSG_PUT or LNET_MSG_GET if it doesn't receive the corresponding response.

Consider the case where there are multiple LNet routers between two nodes, N1 and N2. These routers can possibly be routing between different Hardware, example OPA and MLX. N1 via the LND can reliably determine the health of the next-hop's interfaces. It can not however reliably determine the health of further hops in the chain. Each node can determine the health of the immediate next-hops. Therefore, each node in the path can be trusted to ensure that the message has arrived at the immediate next hop.

If there is a failure along the path and N1 does not receive the expected LNET_MSG_ACK or LNET_MSG_REPLY, and it knows that the message has been received by its next-hop, it has no way to determine where the failure happened. If it decides to resend the message, then there is no way to reliably select a reasonable peer_ni. Especially considering that the message has in fact been received properly by the next-hop. We can then say that we will simply try all the peer_nis of the destination. But in fact this will already be done by the node in the chain which is encountering a problem completing the message with its next-hop. So the net effect is the same. If both are implemented, then duplication of messages is a certainty.

Furthermore the responsibility of end-to-end reliability falls on the shoulder of layers using LNet. Ptlrpc's design clearly takes the end-to-end reliability of RPCs in consideration. By adding an LNET_ACK_TIMEOUT and LNET_REPLY_TIMEOUT (or add an error status in the current events), then ptlrpc can react to the error status appropriately.

The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermediary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.

This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.

Relationship between different timeouts.

There are multiple timeouts kept at different layers of the code. It is important to set the timeout defaults such that it works best, and to give guidance on how the different timeouts interact together. 

Looking at timeouts from a bottom up approach:

  1. IB/TCP/GNI re-send timeout
  2. LND transmit timeout
    1. The timeout to wait for before a transmit fails and lnet_finalize() is called with an appropriate error code. This will result in a resend.
  3. transaction timeout
    1. A PUT or a GET can be sent successfully. LNet needs to wait on the ACK/REPLY respectively.
    2. The transaction timeout defines the amount of time to wait before sending a timeout event upwards.
    3. this value is user specified and defaults to the peer_timeout default (180s)
    4. This value can be overridden by the caller of LNetGet()/LNetPut()
  4. Message timeout
    1. timeout after which LNet abandons resending a message.
  5. Resend interval
    1. The interval between each (re)send procedure.
  6. RPC timeout
    1. ptlrpc level timeouts.
    2. The INITIAL_CONNECT_TIMEOUT is set to 5 sec
    3. ldlm_timeout and obd_timeout are tunables and default to LDLM_TIMEOUT_DEFAULT and OBD_TIMEOUT_DEFAULT.

IB/TCP/GNI re-send timeout < LND transmit timeout  < LNet message timeout < LNet transaction timeout < RPC timeout.

A retry count can be specified. That's the number of times to resend after the LND transmit timeout expires.

The timeout value before failing an LNET_MSG_[PUT | GET] will be:

message timeout = (retry count * LND transmit timeout) + (resend interval * retry count)

where

retry count = min(retry count, 5)

message timeout <= transaction timeout

It has been observed that mount could hang for a long time if discovery ping is not responded to. This could happen if an OST is down while a client mounts the File System. In this case it does not make sense to hold up the mount procedure while discovery is taking place. For some cases like discovery the algorithm would specify a different timeout other than what's configured.

Other cases where a timeout can be specified which overrides the configured timeout is router ping and manual ping.

One issue to consider is currently the LND transmit timeout defaults to 50s. So if we do retry up to five times we could be held up for 2500s, which would be unacceptable.

The question to answer is, does it make sense for the LND transmit timeout to be set to 50s? Even though the IB/TCP/GNI timeout can be long, it might make more sense to pre-empt that communication stack and attempt to resend the message from the LNet layer on a different interface, or even reuse the same interface if only on is available.

Resiliency vs. Reliability

There are two concepts that need to stay separate. Reliability of RPC messages and LNet Resiliency. This feature attempts to add LNet Resiliency against local and immediate next hop interface failure. End-to-end reliability is to ensure that upper layer messages, namely RPC messages, are received and processed by the final destination, and take appropriate action in case this does not happen. End-to-end reliability is the responsibility of the application that uses LNet, in this case ptlrpc. Ptlrpc already has a mechanism to ensure this.

To clarify the terminology further, LNET MESSAGE should be used to describe one of the following messages:

  • LNET_MSG_PUT
  • LNET_MSG_GET
  • LNET_MSG_ACK
  • LNET_MSG_GET
  • LNET_MSG_HELLO

LNET TRANSACTION should be used to describe 

  • LNET_MSG_PUT, LNET_MSG_ACK sequence
  • LNET_MSG_GET, LNET_MSG_REPLY sequence

NEXT-HOP should describe a peer that is exactly one hop away.

The role of LNet is to ensure that an LNET MESSAGE arrives at the NEXT-HOP, and to flag when a transaction fails to complete.

Upper layers should ensure that the transaction it requests to initiate completes successfully, and take appropriate action otherwise.

Reasons for timeout

The discussion here refers to the LND Transmit timeout.

Timeouts could occur due to several reasons:

  1. The message is on the sender's queue and is not posted within the timeout
    1. This indicates that the local interface is too busy and is unable to process the messages on its queue.
  2. The message is posted but the transmit is never completed
    1. An actual culprit can not be determined in this scenario. It could be a sender issue, a receiver issue or a network issue.
  3. The message is posted, the transmit is completed, but the remote never acknowledges.
    1. In the IBLND, there are explicit acknowledgements in most cases when the message is received and forwarded to the LNet layer. Look below for more details.
    2. If an LND message is in waiting state and it didn't receive the expected response, then this indicates an issue at the remote's LND, either at the lower protocol, IB/TCP, or the notification at the LNet layer is not being processed in a timely fashion.

Each of these scenarios can be handled differently

Desired Behavior

The desired behavior is listed for each of the above scenarios:

Scenario 1 - Message not posted

  1. Connection is closed
  2. The local interface health is updated
  3. Failure statistics incremented
  4. A resend is issued on a different local interface if there is one available.
  5. If no other local interface is present, or all are in failed mode, then the send fails.

Scenario 2 - Transmit not completed

  1. Connection is closed
  2. The local and remote interface health is updated
  3. Failure statistics incremented on both local and remote
  4. A resend is issued on a different path all together if there is one available.
  5. If no other path is present  then the send fails.

Scenario 3 - No acknowledgement by remote

  1. Connection is closed
  2. The remote interface health is updated
  3. Failure statistics incremented
  4. A resend is issued on a different remote interface if there is one available.
  5. If no other remote interface is present then the send fails.

Note, that the behavior outlined is consistent with the explcit error cases identified in previous section. Only Scenario 2, diverges as a different path is selected all together, but still the same code structure is used.

Implementation Specifics

All of these cases should end up calling lnet_finalize() API with the proper return code. lnet_finalize() will be the funnel where all these events shall be processed in a consistent manner. When the message is completed via lnet_complete_msg_locked(), the error is checked and the proper behavior as described above is executed.

Peer_timeout

In the cases when a GET or a PUT transaction is initiated an associated deadline needs to be tagged to the corresponding transaction. This deadline indicates how long LNet should wait for a REPLY or an ACK before it times out the entire transaction.

A new thread is required to check if a transaction deadline has expired. OW: Can a timer do this? Or is one timer per message too resource-intensive? If a queue is used, then ideally new messages can simply be added to the tail, with their deadline always >= the current tail. With the queue sorted by deadline the checker thread can look at the deadline of the message at the head of the tail to determine how long it sleeps.

When a transaction deadline expires an appropriate event is generated towards PTLRPC.

When a the REPLY or the ACK is received the message is removed from the check queue of the thread and success event is generated towards PTLRPC.

Within a transaction deadline, if there is a determination that the GET or PUT message failed to be sent to the next-hop then the GET or PUT can be resent.

OW: How is this deadline determined? Naming this section peer_timeout suggests you want to use that? Conceptually we can distinguish between an LNet transaction timeout and an LNet peer timeout.

Resend Window

Resends are terminated when the peer_timeout for a message expires.

Resends should also terminate if all local_nis and/or peer_nis are in bad health. New messages can still use paths that have less than optimal health.

A message is resent after the LND transmit deadline expires, or on failure return code. Both these paths are handled in the same manner, since a transmit deadline triggers a call to lnet_finalize(). Both inline and asynchronous errors also endup in lnet_finalize().

Therefore the least number of transmits = peer_timeout / LND transmit deadline.

Depending on the frequency of errors, LNet may do more re-transmits. LNet will stop re-transmitting and declare a peer dead, if the peer_timeout expires or all the different paths have been tried with no success.

In the default case where LND transmit timeout is set to 50 seconds and the peer_timeout is set to 180 seconds, then LNet will re-transmit 3 times before it declares the peer dead.

peer_timeout can be increased to fit in more re-transmits or LND transmit timeout can be decreased.

Alexey Lyashkov made a presentation at LAD 16 that outlines the best values for all Lustre timeouts. It can be accessed here.

Locking

MD is always protected by the lnet_res_lock, which is CPT specific.

Other data structures such as the_lnet.ln_msg_containers, peer_ni, local ni, etc are protected by the lnet_net_lock.

The MD should be kept intact during the resend procedure. If there is a failure to resend then the MD should be released and message memory freed.

Selection Algorithm with Health

Algorithm Parameters

...

two concepts that need to stay separate. Reliability of RPC messages and LNet Resiliency. This feature attempts to add LNet Resiliency against local and immediate next hop interface failure. End-to-end reliability is to ensure that upper layer messages, namely RPC messages, are received and processed by the final destination, and take appropriate action in case this does not happen. End-to-end reliability is the responsibility of the application that uses LNet, in this case ptlrpc. Ptlrpc already has a mechanism to ensure this.

To clarify the terminology further, LNET MESSAGE should be used to describe one of the following messages:

  • LNET_MSG_PUT
  • LNET_MSG_GET
  • LNET_MSG_ACK
  • LNET_MSG_GET
  • LNET_MSG_HELLO

LNET TRANSACTION should be used to describe 

  • LNET_MSG_PUT, LNET_MSG_ACK sequence
  • LNET_MSG_GET, LNET_MSG_REPLY sequence

NEXT-HOP should describe a peer that is exactly one hop away.

The role of LNet is to ensure that an LNET MESSAGE arrives at the NEXT-HOP, and to flag when a transaction fails to complete.

Upper layers should ensure that the transaction it requests to initiate completes successfully, and take appropriate action otherwise.

Failure Areas

There are three areas of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Timeout values will be provided by the ULP in the LNetPut() and LNetGet() APIs.

Health Value Updates

Two values will be added:

  1. health_value: Each NI (local and remote) will have a health value
    1. The health_value will be initialized to 1000
      1. 1000 is chosen in order to granulary select between interfaces based on the value. Otherwise it is arbitrary
    2. When a transient error is detected on an interface, such as a timeout, the health_value is decremented by health_sensitivity
  2. health_sensitivity: This is a global configuration parameter. It determines how long an NI takes to recover or how sensitive a system is to message send failure.
    1. An NI's health_value is decremented by health_sensivitiy on a transient error.
    2. An NI is then placed on a queue to recover.
    3. An NI is pinged  or pinged from once a second.
    4. Every successful ping would increment the NIs health_value by 1.
    5. It takes health_sensitivity pings to bring the interface back to its original health status.
    6. If a ping fails during the recovery process the health_value is decremented further by health_sensitivity.
      1. This will ensure that an unstable NI which has frequent errors, will be preferred less.
    7. The health_sensitivity can be set to 0 to turn off health evaluation.
      1. That means that an interface will remain healthy no matter what happens.
      2. Basically turn off NI selection based on health.

Each NI will have a health_value associated with it. Each NI's health value is initialized to 1000

There are two types of errors that could occur on an NI:

  1. Hard failures: These are failures communicated by the underlying device driver to the LND and in turn the LND propagates it up to LNet
  2. Transient failures: These are failures such as timeouts on the system.

Hard Failures

Hard failures only apply to local interfaces, since there is no way to know if a remote interface has encountered one.

It's possible that the local interface might get into a hard failure scenario by receiving one of these events from the o2iblnd. socklnd needs to be investigated to determine if there are similar cases:

  • IB_EVENT_DEVICE_FATAL
  • IB_EVENT_PORT_ACTIVE
  • IB_EVENT_PORT_ERR
  • RDMA_CM_EVENT_DEVICE_REMOVAL

In these cases the local interface can not be used any longer. So it can not be selected as part of the selection algorithm. If there are no other interface available, then no messages can be sent out of the node.

A corresponding event can be received to indicate that the interface is operational again.

A new LNet/LND Api will be created to pass these events from the LND to LNet.

Transient failures

Transient Interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be a result of a send operations

Synchronous Send Failures

lnet_select_pathway() can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

1, 2, 5, 6 and 10 are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message:

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

Resend Handling

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is decremented
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
  4. The message will continuously be resent until one of the following criteria is fulfilled:
    1. Message is completed successfully.
    2. Retry-count is reached
    3. Transaction timeout expires

Two new fiels will be added to lnet_msg:

  1. msg_status - bit field that indicates the type of failure which requires a resend
  2. msg_deadline - the deadline for the message calculated by,  send time + transaction timeout


Code Block
struct lnet_msg {
...
	__u32 msg_status;
	ktime msg_deadline;
...
}

When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

lnet_resend_msg_locked()shall queue the message on a resend queue and wake up a thread responsible for resending messages, the monitor thread portrayed in the above diagram.

When a message is initially sent it's tagged with a deadline for this message.  The message will be placed on the active queue. If the message is not completed within that timeout it will be finalized and removed from the active queue. A timeout event will be passed to the ULP.

If the LND times out and LNet attemps to resend, it'll place the message on the resend queue. A message can be on the both the active and resend queue.

As shown in the diagram above both lnet_send() and lnet_parse() put messages on the active queue. lnet_finalize() consumes messages off the active queue when it's time to decommit them.

When the LND calls lnet_finalize() on a timed out message, lnet_finalize() will put the message on the resend queue and wake up the monitor thread.

The Monitor Thread

The router checker thread will be refactored to full fill the following responsibilities:

  1. Check the active queue once per second for expired messages
    1. The monitor thread will wake up ever second and check the top of the active queue, IE the oldest message on the list. If that message has expired it updates its status to TIMEDOUT and finalizes it. Finalizeing the message will include removing it from the active queue and the resend queue. It then moves on to the next message on the list and stops once it find a message that has not expired.
  2. Check if there are any messages to resend on the resend_queue
    1. If there are any messages queued, it'll call lnet_send() on each one.
  3. Check if there are any peers on the local_ni recovery queue.
    1. local_nis are a bit tricky to recover. How do you determine if a local NI is good again. Do we ping a random peer NI on the same network as the local NI? If so then what if this local NI has a problem? We could be introducing other failure handling not associated with the local NI recovery during its recovery process.
    2. Best approach at this time is to ping itself.
      1. Pinging itself will force the ping message to travel down the entire stack, LND, Verbs/TCP and IB/HFI/IP. This should be sufficient to determine if the interface has recovered from the transient error encountered.
      2. The time delay to recover the interface will also allow for the LNDs queue to empty out under congestion.
  4. Check if there are any peers on the remote_ni recovery queue
    1. ping the remote ni
    2. Unfortunately, that could result in using an unhealty local NI, but there is no way around that.
      1. In that case we will manage the health_value of the local NI and remote NI as described above.

The assumption is that under normal circumstances the number of re-sends should be low, so the thread will not add any logic to pace out the resend rate, such as what lnet_finalize() does.

In case of immediate failures, for example route failure, the message will not make it on the network. There is a risk that immediate failure could trigger a burst of resends for the message. This could be exaggerated if there is only one interface in the system.

This will be metigated by having a maximum number of retry count. This is a configured value and will cap the number of resends in this case.

Setting retry count to 0 will turn off retries completely and will trigger a message to fail and propagated up on first failure encountered.

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

Protection

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and in the monitor thread and either removed from the active queue or placed on the resend queue.

API Changes

The ULP will provide the transaction timeout value on which LNet will base its own timeout values. In the absence of that LNet will fall back on a configurable transaction timeout value.

This trickle down approach will simplify the configuration of the LNet Resiliency feature, as well as make the timeout consistent through out the system, instead of configuring the LND timeout to be much larger than the pltrpc timeout as it is now. Furthermore, the ptlrpc uses a backoff algorithm, which allows it to wait longer for responses. With this trickle down approach, LNet will be able to cope with that timeout backoff algorithm.

The LNetGet() and LNetPut() APIs will be changed to reflect that.

Code Block
/**
 * Initiate an asynchronous GET operation.
 *
 * On the initiator node, an LNET_EVENT_SEND is logged when the GET request
 * is sent, and an LNET_EVENT_REPLY is logged when the data returned from
 * the target node in the REPLY has been written to local MD.
 * LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
 * with in the timeout provided.
 *
 * On the target node, an LNET_EVENT_GET is logged when the GET request
 * arrives and is accepted into a MD.
 *
 * \param self,target,portal,match_bits,offset See the discussion in LNetPut().
 * \param mdh A handle for the MD that describes the memory into which the
 * requested data will be received. The MD must be "free floating" (See LNetMDBind()).
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists) of the MD.
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 */
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 »·······struct lnet_process_id target, unsigned int portal,
 »·······__u64 match_bits, unsigned int offset, int timeout)

/**
 * Initiate an asynchronous PUT operation.
 *
 * There are several events associated with a PUT: completion of the send on
 * the initiator node (LNET_EVENT_SEND), and when the send completes
 * successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
 * that the operation was accepted by the target. LNET_EVENT_ACK will have
 * the timeout flag set if an ACK is not received within the timeout provided
 *  
 * The event LNET_EVENT_PUT is used at the target node to indicate the
 * comple tion of incoming datadelivery.
 *
 * The local events will be logged in the EQ associated with the MD pointed to
 * by \a mdh handle. Using a MD without an associated EQ results in these
 * events being discarded. In this case, the caller must have another
 * mechanism (e.g., a higher level protocol) for determining when it is safe
 * to modify the memory region associated with the MD.
 *
 * Note that LNet does not guarantee the order of LNET_EVENT_SEND and
 * LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
 *
 * \param self Indicates the NID of a local interface through which to send
 * the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
 * \param mdh A handle for the MD that describes the memory to be sent. The MD
 * must be "free floating" (See LNetMDBind()).
 * \param ack Controls whether an acknowledgment is requested.
 * Acknowledgments are only sent when they are requested by the initiating
 * process and the target MD enables them.
 * \param target A process identifier for the target process.
 * \param portal The index in the \a target's portal table.
 * \param match_bits The match bits to use for MD selection at the target
 * process.
 * \param offset The offset into the target MD (only used when the target
 * MD has the LNET_MD_MANAGE_REMOTE option set).
 * \param timeout The timeout to wait for an ACK if one is expected.
 * \param hdr_data 64 bits of user data that can be included in the message
 * header. This data is written to an event queue entry at the target if an
 * EQ is present on the matching MD.
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists).
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 *
 * \see struct lnet_event::hdr_data and lnet_event_kind_t.
 */
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack, 
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data)

Selection Algorithm

The selection algorithm will be modified to take health into account and will operate according to the following logic:

Code Block
for every peer_net in peer {
	local_net = peer_net

	if peer_net is not local
		select a router
		local_net = router->net
 
	for every local_ni on local_net
		check if local_ni has best health_value
		check if local_ni is nearest MD NUMA
		check if local_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_ni = local_ni

	for every peer_ni on best_ni->net
		check if peer_ni has best health value
		check if peer_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_peer_ni = peer_ni

	send(best_ni, peer_ni)
}

The above algorithm will always prefer NI's that are the most healthy. This is important because dropping even one message will likely result in client evictions. So it is important to always ensure we're using the best path possible.

LND Interface

LNet shall calculate the message timeout as follows:

message timeout = transaction timeout / retry count

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send().

LND Transmits (o2iblnd specific discussion)

ULP requests from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND can complete the LNet PUT/GET via a set of LND messages as shown in the diagrams below.

When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message, if an aknwoledgment is expected or if a no acknowledgement is expected then when the tx is completed the tx_deadline is completed. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.

By handling the tx_deadline properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.

The tx_deadline is LND-specific, and derived from the timeout (or sock_timeout) module parameter of the LND.

Gliffy Diagram
nameo2iblnd TX FSM
pagePin4

LND timeout

PUT

Gliffy Diagram
namePUT sequence
pagePin3

GET

Gliffy Diagram
nameGET Sequence Diagram
pagePin2

A third type of message that the LND sends is the IBLND_MSG_IMMEDIATE. The data is embedded in the message and posted. There is no handshake in this case.

For the PUT case described in the sequence diagram, the initiator sends two messages:

  1. IBLND_MSG_PUT_REQ
  2. IBLND_MSG_PUT_DONE

Both of these messages are sent using the same tx structure. The tx is allocated and placed on a waiting queue. When the IBLND_MSG_PUT_ACK is received the waiting tx is looked up and used to send the IBLND_MSG_PUT_DONE.

When kiblnd_queue_tx_locked() is called for IBLND_MSG_PUT_REQ it sets the tx_deadline as follows:

Code Block
timeout_ns = *kiblnd_tunables.kib_timeout * NSEC_PER_SEC;
tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

When kiblnd_queu_tx_locked() is called for IBLND_MSG_PUT_DONE it reset the tx_deadline again.

This presents an obstacle for the LNet Resiliency feature. LNet provides a timeout for the LND as described above. From LNet's perspective this deadline is for the LNet PUT message. However, if we simply use that value for the timeout_ns calculation, then in essence will will be waing for 2 * LND timeout for the completion of the LNet PUT message. This will mean less re-transmits.

Therefore, the LND, since it has knowledge of its own protocols will need to divide the timeout provided by LNet by the number of transmits it needs to do to complete the LNet level message:

  1. LNET_MSG_GET: Requires only IBLND_MSG_GET_REQ. Use the LNet provided timeout as is.
  2. LNET_MSG_PUT: Requires IBLND_MSG_PUT and IBLND_MSG_PUT_DONE to complete the LNET_MSG_PUT. Use LNet provided timeout / 2
  3. LNET_MSG_GET/LNET_MSG_PUT with < 4K payload: Requires IBLND_MSG_IMMEDIATE. Use LNet provided timeout as is.

System Timeouts

There are multiple timeouts kept at different layers of the code. The LNet Resiliency will attempt to reduce the complexity and ambiguity of setting the timeouts in the system.

This will be done by using a trickle down approach as mentioned before. The top level transaction timeout will be provided to LNet for each PUT/GET send request. If one is not provided LNet will use a configurable default.

LNet will calculate the following timeouts from the transaction timeout:

  1. Message timeout = Transaction timeout / retry count
  2. LND timeout = Message timeout / number of LND messages used to complete an LNet PUT/GET

Caveat

One thing I'm worried about are cases where we see timeouts in tickets, example

Jira
serverHPDD Community Jira
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-10831
. The LND timeout is currently set at 50 seconds. So it appears that there exists cases where the transaction can remain around for over 50 seconds. The reason that occurs is not clear to me. Also the fact that ptlrpc is not reacting to these length delays needs investigation. Or is it reacting in parallel to LNet. Meaning LNet can be waiting for transmits to complete, even after ptlrpc has abandoned the associated RPC messages.

This could be a potential problem in the field, would we see a spike in timeout failures once this feature is used on the sites? Especially when one of the goals of this feature is to tighten the deadlines.

Without large systems to tests this feature, it's difficult to gauge its impact in heavy traffic clusters.

Implementation Specifics

Reasons for timeout

The discussion here refers to the LND Transmit timeout.

Timeouts could occur due to several reasons:

  1. The message is on the sender's queue and is not posted within the timeout
    1. This indicates that the local interface is too busy and is unable to process the messages on its queue.
  2. The message is posted but the transmit is never completed
    1. An actual culprit can not be determined in this scenario. It could be a sender issue, a receiver issue or a network issue.
  3. The message is posted, the transmit is completed, but the remote never acknowledges.
    1. In the IBLND, there are explicit acknowledgements in most cases when the message is received and forwarded to the LNet layer. Look below for more details.
    2. If an LND message is in waiting state and it didn't receive the expected response, then this indicates an issue at the remote's LND, either at the lower protocol, IB/TCP, or the notification at the LNet layer is not being processed in a timely fashion.

Each of these scenarios can be handled differently

Desired Behavior

The desired behavior is listed for each of the above scenarios:

Scenario 1 - Message not posted

  1. Connection is closed
  2. The local interface health is decremented
  3. Failure statistics incremented
  4. A resend is issued.
  5. Selection algorithm will prefer less the unhealthy NI

Scenario 2 - Transmit not completed

  1. Connection is closed
  2. The local and remote interface health is updated
  3. Failure statistics incremented on both local and remote
  4. Selection algorithm will prefer less the unhealthy NIs

Scenario 3 - No acknowledgement by remote

  1. Connection is closed
  2. The remote interface health is updated
  3. Failure statistics incremented
  4. Selection algorithm will prefer less the unhealthy NIs

Selection Algorithm with Health

Algorithm Parameters

ParameterValues
SRC NIDSpecified (A)Not specified (B)
DST NIDlocal (1)not local (2)
DST NIDMR ( C )NMR (D)

Note that when communicating with an NMR peer we need to ensure that the source NI is always the same: there are a few places where the upper layers use the src nid from the message header to determine its originating node, as opposed to using something like a UUID embedded in the message. This means when sending to an NMR node we need to pick a NI and then stick with that going forward.

Note: When sending to a router that scenario boils down to considering the router as the next-hop peer. The final destination peer NIs are no longer considered in the selection. The next-hop can then be MR or non-MR and the code will deal with it accordingly.

A1C - src specified, local dst, mr dst

  • find the local ni given src_nid
    • if no local

Note that when communicating with an NMR peer we need to ensure that the source NI is always the same: there are a few places where the upper layers use the src nid from the message header to determine its originating node, as opposed to using something like a UUID embedded in the message. This means when sending to an NMR node we need to pick a NI and then stick with that going forward.

Note: When sending to a router that scenario boils down to considering the router as the next-hop peer. The final destination peer NIs are no longer considered in the selection. The next-hop can then be MR or non-MR and the code will deal with it accordingly.

A1C - src specified, local dst, mr dst

  • find the local ni given src_nid
    • if no local ni found fail
    • if local ni found is down, then fail
  • find peer identified by the dst_nid
  • select the best peer_ni for that peer
    • take into account the health of the peer_ni (if we just demerit the peer_ni it can still be the best of the bunch. So we need to keep track of the peer_nis/local_nis a message was sent over, so we don't revisit the same ones again. This should be part of the message)
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • if this is a resend, do not select the same peer_ni again unless no other peer_nis are available and that peer_ni is not in a HARD_ERROR state.

A2C - src specified, route to dst, mr dst

  • find local ni given src_nid
    • if no local ni found fail
    • if local ni found is down, then fail
  • find router to dst_nid
    • If no router present then fail.
  • find best peer_ni (for the router) to send to
    • take into account the health of the peer_ni
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • If this is a resend and the peer_nis is not specified, do not select the same peer_ni again. The original destination NID can be found in the message.
      • Keep trying to send to the peer_ni even if it has been used before, as long as it is not in a HARD_ERROR state.

A1D - src specified, local dst, nmr dst

  • find local ni given src nid
    • if no local_ ni found fail
    • if local ni found is down, then fail
  • find peer _ni using identified by the dst_nid
  • select the best peer_ni for that peer
    • take into account the health of the peer_ni (if we just demerit the peer_ni it can still be the best of the bunch. So we need to keep track of the peer_nis/local_nis a message was sent over, so we don't revisit the same ones again. This should be part of the message)
    send to that peer_ni
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • if this is a resend, do not select the same resend retry the send on the peer_ni again unless no other peer_nis are available and that peer_ni is not in a HARD_ERROR state, then fail.

...

A2C - src specified, route to dst,

...

mr dst

  • find local _ ni given the src_nid
    • if no local _ ni found fail
    • if local ni found is down, then fail
  • find router to go through to that peer_ni
  • send to the NID of that router.
    • If this is a resend retry the send on the peer_ni unless that peer_ni is in a HARD_ERROR state, then fail.

...

  • dst

...

  • _nid
    • If no router present then fail.
  • find best peer_ni (for the router) to send to
    • take into account the health of the peer_ni
  • select the best_ni to send from, by going through all the local_nis that can reach any of the networks the peer is on
    • consider local_ni health in the selection by selecting the local_ni with the best health value.
    • If this is a resend do not select a local_ni that has already been used.
  • select the best_peer_ni that can be reached by the best_ni selected in the previous step
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • If this is a resend and the resend peer_ni nis is not specified do not consider a , do not select the same peer_ni again. The original destination NID can be found in the message.
      • Keep trying to send to the peer_ni
      that
      • even if it has
      already
      • been used
      for sending as long as there are other peer_nis available for selection. Loop around and re-use peer-nis in round robin.
      • peer_nis that are selected cannot be in HARD_ERROR state.
  • send the message over that path.

...

      • before, as long as it is not in a HARD_ERROR state.

A1D - src specified, local dst, nmr dst

  • find local ni given src nid
    • if no local_ni found fail
    • if local ni found is down, then fail
  • find peer_ni using dst_nid
  • send to that peer_ni
    • If this is a resend retry the send on the peer_ni unless that peer_ni is in a HARD_ERROR state, then fail.

A2D - src specified, route to dst,

...

nmr dst

  • find local_ni given the src_nid
    • if no local_ni found fail
    • if local ni found is down, then fail
  • find router that can reach the dst_nid
  • find the peer for that router. (The peer is MR)
  • go to B1C

...

  • to go through to that peer_ni
  • send to the NID of that router.
    • If this is a resend retry the send on the peer_ni unless that peer_ni is in a HARD_ERROR state, then fail.

B1C - src any, local dst,

...

mr dst

  • select the best_ni to send from, by going through all the local_nis that can reach any of the networks the peer is on
    • consider local_ni health in the selection by selecting the local_ni with the best health value.
    • If this is a resend do not select a local_ni that has already been used.
  • select the best_peer_ni that can be reached by the best_ni selected in the previous step
    • If this is a resend and the resend peer_ni is specified, then select this peer_ni if it is healthy, otherwise continue with the algorithm.
    • If this is a resend and the resend peer_ni is not specified do not consider a peer_ni that has already been used for sending as long as there are other peer_nis available for selection. Loop around and re-use peer-nis in round robin.
      • peer_nis that are selected cannot be in HARD_ERROR state.
  • send the message over that path.

B2C - src any, route to dst, mr dst

  • find the router that can reach the dst_nid
  • find the peer for that router. (The peer is MR)
  • go to B1C

B1D - src any, local dst, nmr dst

  • find peer_ni using dst_nid (non-MR, so this is the only peer_ni candidate)
  • find peer_ni using dst_nid (non-MR, so this is the only peer_ni candidate)
    • no issue if peer_ni is healthy
    • try this peer_ni even if it is unhealthy if this is the 1st attempt to send this message
    • fail if resending to an unhealthy peer_ni
  • pick the preferred local_NI for this peer_ni if set
    • If the preferred local_NI is not healthy, fail sending the message and let the upper layers deal with recovery.
    • otherwise if preferred local_NI is not set, then pick a healthy local NI and make it the preferred NI for this peer_ni
  • send over this path

B2D - src any, route dst, nmr dst

  • find route to dst_nid

  • find peer_ni of router
    • no issue if peer_ni is healthy
    • try this peer_ni even if it is unhealthy if this is the 1st attempt to send this message
    • fail if resending to an unhealthy peer_ni
  • pick the preferred local_NI for the dstthis peer_nidni if set
    • If the preferred local_NI is not healthy, fail sending the message and let the upper layers deal with recovery.
    • otherwise if preferred local_NI is not set, then pick a healthy local NI and make it the preferred NI for this peer_ni
  • send over this path

Resend Behavior

LNet will keep attempting to resend a message across different local/remote NIs as long as the interfaces are only in "soft" failure state. Interfaces are demerited when we fail to send over them due to a timeout. This is opposed to a hard failure which is reported by the underlying HW indicating that this interface can no longer be used for sending and receiving.

LNet will terminate resends of a message in one of the following conditions

  1. Peer timeout expires
  2. No interfaces available that can be used.
  3. A message is sent successfully.

For hard failures there needs to be a method to recover these interfaces. This can be done through a ping of the interface whether it is local or remote, since that ping will tell us if an interface is up or down. 

The router checker infrastructure currently does this exact job for routers. This infrastructure can be expanded to also query the local or remote NIs which are down.

Selection of the local_ni or peer_ni will be dependent on the following criteria:

  • Has the best health value
    • skip interfaces in HARD_ERROR state
  • closest NUMA (for local interfaces)
  • most available credits
  • Round Robin.

...

    • for this peer_ni
  • send over this path

B2D - src any, route dst, nmr dst

  • find route to dst_nid

  • find peer_ni of router

    • no issue if peer_ni is healthy

    • try this peer_ni even if it is unhealthy if this is the 1st attempt to send this message

    • fail if resending to an unhealthy peer_ni

  • pick the preferred local_NI for the dst_nid if set

    • If the preferred local_NI is not healthy, fail sending the message and let the upper layers deal with recovery.

    • otherwise if preferred local_NI is not set, then pick a healthy local NI and make it the preferred NI for this peer_ni

  • send over this path

Work Items

  • refactor lnet_select_pathway() as described above.
  • Health Value Maintenance/Demerit system
  • Selection based on Health Value and not resending over already used interfaces unless non are available.
  • Handling the new events in IBLND and passing them to LNet
  • Handling the new events in SOCKLND and passing them to LNet
  • Adding LNet level transaction timeout (or reuse the peer timeout) and cancelling a resend on timeout
  • Handling timeout case in ptlrpc

...