Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The LNet Resiliency feature shall ensure that failures that can be dealt are delt with at the LNet level are delst with by attempting to resend by resending the message on the available local and remote interfaces with in a timeout provided.

LNet shall use a trickle down approach for managing timeouts. the The ULP (pltrpc or other upper layer protocol) shall provide a timeout value in the call to LNetPut() or LNetGet(). LNet shall use that as the transaction timeout value to wait for an ACK or REPLY. LNet shall further provide a configuration parameter for the number of retries. The number of retries shall allow the user to specify a maximum number of times LNet shall attempt to resend an unsuccessful message. LNet shall then calculate the message timeout by dividing the transaction timeout with the number of retries. LNet shall pass the calculated message timeout to the LND, which will use it to ensure that the LND protocol completes an LNet message within the message timeout. If the LND is not able to complete the message within the provided timeout it will close the connection and drop all messages on that connection. It will afterword proceed to call into LNet via lnet_finailze() to notify it of the error encountered.

LNet Resiliency

There are three types of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Timeouts will be provided by the ULP in the LNetPut() and LNetGet() APIs. The APIs are defined below:

LNetPut()/LNetGet() APIs

Code Block
/**
 * Initiate an asynchronous GET operation.
 *
 * On the initiator node, an LNET_EVENT_SEND is logged when the GET request
 * is sent, and an LNET_EVENT_REPLY is logged when the data returned from
 * the target node in the REPLY has been written to local MD.
 * LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
 * with in the timeout provided.
 *
 * On the target node, an LNET_EVENT_GET is logged when the GET request
 * arrives and is accepted into a MD.
 *
 * \param self,target,portal,match_bits,offset See the discussion in LNetPut().
 * \param mdh A handle for the MD that describes the memory into which the
 * requested data will be received. The MD must be "free floating" (See LNetMDBind()).
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists) of the MD.
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 */
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 »·······struct lnet_process_id target, unsigned int portal,
 »·······__u64 match_bits, unsigned int offset, int timeout)

/**
 * Initiate an asynchronous PUT operation.
 *
 * There are several events associated with a PUT: completion of the send on
 * the initiator node (LNET_EVENT_SEND), and when the send completes
 * successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
 * that the operation was accepted by the target. LNET_EVENT_ACK will have
 * the timeout flag set if an ACK is not received within the timeout provided
 *  
 * The event LNET_EVENT_PUT is used at the target node to indicate the
 * comple tion of incoming datadelivery.
 *
 * The local events will be logged in the EQ associated with the MD pointed to
 * by \a mdh handle. Using a MD without an associated EQ results in these
 * events being discarded. In this case, the caller must have another
 * mechanism (e.g., a higher level protocol) for determining when it is safe
 * to modify the memory region associated with the MD.
 *
 * Note that LNet does not guarantee the order of LNET_EVENT_SEND and
 * LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
 *
 * \param self Indicates the NID of a local interface through which to send
 * the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
 * \param mdh A handle for the MD that describes the memory to be sent. The MD
 * must be "free floating" (See LNetMDBind()).
 * \param ack Controls whether an acknowledgment is requested.
 * Acknowledgments are only sent when they are requested by the initiating
 * process and the target MD enables them.
 * \param target A process identifier for the target process.
 * \param portal The index in the \a target's portal table.
 * \param match_bits The match bits to use for MD selection at the target
 * process.
 * \param offset The offset into the target MD (only used when the target
 * MD has the LNET_MD_MANAGE_REMOTE option set).
 * \param timeout The timeout to wait for an ACK if one is expected.
 * \param hdr_data 64 bits of user data that can be included in the message
 * header. This data is written to an event queue entry at the target if an
 * EQ is present on the matching MD.
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists).
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 *
 * \see struct lnet_event::hdr_data and lnet_event_kind_t.
 */
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack, 
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data)

Local Interface Failure

Local interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be a result of a send operations
    2. They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.

Synchronous Send Failures

lnet_select_pathway() can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

1, 2, 5, 6 and 10 are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message: b

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

Resend Handling

...

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
  4. The message will continuously be resent until one of the following criteria is fulfilled:
    1. Message is completed successfully.
    2. Retry-count is recahed
    3. Transaction timeout expires

A new field in the msg, msg_status, will be added. This field will hold the send status of the message.

When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

lnet_resend_msg_locked()shall queue the message on a resend queue and wake up a thread responsible for resending messages.

The router checker thread, which is always started, will be refactored to handle resending messages.

When a message is initially sent it's tagged with a deadline for this message. The deadline is the current time + transaction timeout. The message will be placed on the active queue. If the message is not completed within that timeout it will be finalized and removed from the active queue. A timeout event will be passed to the ULP.

If the LND times out and LNet attemps to resend, it'll place it on the resend queue. A message can be on the both the active and resend queue.

As shown in the diagram below both lnet_send() and lnet_parse() put messages on the active queue. lnet_finalize() consumes messages off the active queue when it's time to decommit them.

The monitor thread will wake up every second and checks if any messages which are being sent have passed their deadline. If so, it'll call lnet_finalize() on that message, which will decommit and finalize the message.

When the LND calls lnet_finalize() on a timed out message, lnet_finalize() will put the message on the resend queue and wake up the monitor thread, which will go through the resend queue in FIFO order, pop the message and call lnet_send() on it.

The assumption is that under normal circumstances the number of re-sends should be low, so the thread will not add any logic to pace out the resend rate.

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

...

For PUT that doesn't require an ACK the timeout will be used to provide the transaction timeout to the LND. In that case LNet will resend the PUT if the LND detects an issue with the transmit. LNet shall be able to send a TIMEOUT event to the ULP if the PUT was not successfully transmited. However, if the PUT is successfully transmited, there is no way for LNet to determine if it has been processed properly on the receiving end.

Feature Specification

System Overview

Gliffy Diagram
nameMessageProcessing
pagePin1

lnet_msg is a structure used to keep information on the data that will be transmitted over the wire. It does not itself go over the wire. lnet_msg is passed to the LND for transmission.

Before it's passed to the LND it is placed on an active list, msc_active. The diagram below describes the datastructures

Gliffy Diagram
namelnet_msg_structure
pagePin1

The CPT is determined by the lnet_cpt_of_nid_locked() function. lnet_send() running in the context of the calling threads place a message on msc_active just before sending it to the LND. lnet_parse() places messages on msc_active when it receives it from the LND.

msc_active represent all messages which are currently being processed by LNet.

lnet_finalize(), running in the context of the calling threads, likely the LND scheduler threads, will determine if a message needs to be resent and place it on the resend list. The resend list is a list of all messages which are currently awaiting a resend.

A monitor thread monitors and ensures that messages which have expired are finalized. This processing is detailed in later sections.

Types of Failures

There are three types of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Timeouts will be provided by the ULP in the LNetPut() and LNetGet() APIs.

Resend Handling

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
  4. The message will continuously be resent until one of the following criteria is fulfilled:
    1. Message is completed successfully.
    2. Retry-count is reached
    3. Transaction timeout expires

Two new fiels will be added to lnet_msg:

  1. msg_status - bit field that indicates the type of failure which requires a resend
  2. msg_deadline - the deadline for the message calculated by,  send time + transaction timeout


Code Block
struct lnet_msg {
...
»·······__u32»··»·······»·······msg_status;
»·······cfs_time_t»·····»·······msg_deadline;
...
}


When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

lnet_resend_msg_locked()shall queue the message on a resend queue and wake up a thread responsible for resending messages.

The router checker thread, which is always started, will be refactored to handle resending messages.

When a message is initially sent it's tagged with a deadline for this message.  The message will be placed on the active queue. If the message is not completed within that timeout it will be finalized and removed from the active queue. A timeout event will be passed to the ULP.

The monitor thread will wake up ever second and check the top of the active queue, IE the oldest message on the list. If that message has expired it updates its status to TIMEDOUT and finalizes it. It then moves on to the next message on the list and stops once its find a message that has not expired.

If the LND times out and LNet attemps to resend, it'll place it on the resend queue. A message can be on the both the active and resend queue.

As shown in the diagram above both lnet_send() and lnet_parse() put messages on the active queue. lnet_finalize() consumes messages off the active queue when it's time to decommit them.

When the LND calls lnet_finalize() on a timed out message, lnet_finalize() will put the message on the resend queue and wake up the monitor thread, which will go through the resend queue in FIFO order, pop the message and call lnet_send() on it.

When the monitor thread wakes up every second it'll perform the following ordered operations:

  1. Timeout all expired messages on the active list.
  2. Resend messages on the resend queue

The assumption is that under normal circumstances the number of re-sends should be low, so the thread will not add any logic to pace out the resend rate, such as what lnet_finalize() does.

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and either removed from the active queue or placed on the resend queue.

API Changes

The ULP will provide the transaction timeout value on which LNet will base its own timeout values. In the absence of that LNet will fall back on a configurable transaction timeout value.

This trickle down approach will simplify the configuration of the LNet Resiliency feature, as well as make the timeout consistent through out the system, instead of configuring the LND timeout to be much larger than the pltrpc timeout as it is now. Furthermore, the ptlrpc uses a backoff algorithm, which allows it to wait longer for responses. With this trickle down approach, LNet will be able to cope with that timeout backoff algorithm.

The LNetGet() and LNetPut() APIs will be changed to reflect that.

Code Block
/**
 * Initiate an asynchronous GET operation.
 *
 * On the initiator node, an LNET_EVENT_SEND is logged when the GET request
 * is sent, and an LNET_EVENT_REPLY is logged when the data returned from
 * the target node in the REPLY has been written to local MD.
 * LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
 * with in the timeout provided.
 *
 * On the target node, an LNET_EVENT_GET is logged when the GET request
 * arrives and is accepted into a MD.
 *
 * \param self,target,portal,match_bits,offset See the discussion in LNetPut().
 * \param mdh A handle for the MD that describes the memory into which the
 * requested data will be received. The MD must be "free floating" (See LNetMDBind()).
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists) of the MD.
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 */
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 »·······struct lnet_process_id target, unsigned int portal,
 »·······__u64 match_bits, unsigned int offset, int timeout)

/**
 * Initiate an asynchronous PUT operation.
 *
 * There are several events associated with a PUT: completion of the send on
 * the initiator node (LNET_EVENT_SEND), and when the send completes
 * successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
 * that the operation was accepted by the target. LNET_EVENT_ACK will have
 * the timeout flag set if an ACK is not received within the timeout provided
 *  
 * The event LNET_EVENT_PUT is used at the target node to indicate the
 * comple tion of incoming datadelivery.
 *
 * The local events will be logged in the EQ associated with the MD pointed to
 * by \a mdh handle. Using a MD without an associated EQ results in these
 * events being discarded. In this case, the caller must have another
 * mechanism (e.g., a higher level protocol) for determining when it is safe
 * to modify the memory region associated with the MD.
 *
 * Note that LNet does not guarantee the order of LNET_EVENT_SEND and
 * LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
 *
 * \param self Indicates the NID of a local interface through which to send
 * the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
 * \param mdh A handle for the MD that describes the memory to be sent. The MD
 * must be "free floating" (See LNetMDBind()).
 * \param ack Controls whether an acknowledgment is requested.
 * Acknowledgments are only sent when they are requested by the initiating
 * process and the target MD enables them.
 * \param target A process identifier for the target process.
 * \param portal The index in the \a target's portal table.
 * \param match_bits The match bits to use for MD selection at the target
 * process.
 * \param offset The offset into the target MD (only used when the target
 * MD has the LNET_MD_MANAGE_REMOTE option set).
 * \param timeout The timeout to wait for an ACK if one is expected.
 * \param hdr_data 64 bits of user data that can be included in the message
 * header. This data is written to an event queue entry at the target if an
 * EQ is present on the matching MD.
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists).
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 *
 * \see struct lnet_event::hdr_data and lnet_event_kind_t.
 */
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack, 
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data)

Selection Algorithm

It is possible that a send can fail immediately. In this case we need to take active measures to ensure that we do not enter a tight loop resending until the timeout expires. This could peak the CPU consumption unexpectedly.

To do that the last sent time will be kept in the message. If the message is not sent successfully on any of the existing interfaces, then it will be placed on a queue and will be resent after a specific deadline expires. This will be termed a "(re)send procedure". An interval must expire between each (re)send procedure. A (re)send procedure will iterate through all local and remote peers, depending on the source of the send failure. 

The router pinger thread will be refactored to handle resending messages. The router pinger thread is started irregardless and only used to ping gateways if any are configured. Its operation will be expanded to check the pending message queue and re-send messages.

To keep the code simple when resending the health value of the interface that had the problem will be updated. If we are sending to a non-MR peer, then we will use the same src_nid, otherwise the peer will be confused. The selection algorithm will then consider the health value when it is selecting the NI or peer NI.

There are two aspects to consider when sending a message

  1. The congestion of the local and remote interfaces
  2. The health of the local and remote interfaces

The selection algorithm will take an average of these two values and will determine the best interface to select. To make the comparison proper, the health value of the interface will be set to the same value as the credits on initialization. It will be decremented on failure to send and incremented on successful send.

A health_value_range module parameter will be added to control the sensitiveness of the selection. If it is set to 0, then the best interface will be selected. A value higher than 0 will give a range in which to select the interface. If the value is large enough it will in effect be equivalent to turning off health comparison.

Local Interface Failure

Local interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be a result of a send operations
    2. They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.

Synchronous Send Failures

lnet_select_pathway() can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

1, 2, 5, 6 and 10 are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message:

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

LND Interface

LNet shall calculate the message timeout as follows:

...

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send().

Resending

It is possible that a send can fail immediately. In this case we need to take active measures to ensure that we do not enter a tight loop resending until the timeout expires. This could peak the CPU consumption unexpectedly.

To do that the last sent time will be kept in the message. If the message is not sent successfully on any of the existing interfaces, then it will be placed on a queue and will be resent after a specific deadline expires. This will be termed a "(re)send procedure". An interval must expire between each (re)send procedure. A (re)send procedure will iterate through all local and remote peers, depending on the source of the send failure. 

The router pinger thread will be refactored to handle resending messages. The router pinger thread is started irregardless and only used to ping gateways if any are configured. Its operation will be expanded to check the pending message queue and re-send messages.

To keep the code simple when resending the health value of the interface that had the problem will be updated. If we are sending to a non-MR peer, then we will use the same src_nid, otherwise the peer will be confused. The selection algorithm will then consider the health value when it is selecting the NI or peer NI.

There are two aspects to consider when sending a message

  1. The congestion of the local and remote interfaces
  2. The health of the local and remote interfaces

The selection algorithm will take an average of these two values and will determine the best interface to select. To make the comparison proper, the health value of the interface will be set to the same value as the credits on initialization. It will be decremented on failure to send and incremented on successful send.

...

LND timeout

An LNet message can be represented by a sequence of LND message. In the o2iblnd, the PUT and GET are described in the following sequence diagrams.

PUT

Gliffy Diagram
namePUT sequence
pagePin3

GET

Gliffy Diagram
nameGET Sequence Diagram
pagePin2

A third type of message that the LND sends is the IBLND_MSG_IMMEDIATE. The data is embedded in the message and posted. There is no handshake in this case.

For the PUT case described in the sequence diagram, the initiator sends two messages:

  1. IBLND_MSG_PUT_REQ
  2. IBLND_MSG_PUT_DONE

Both of these messages are sent using the same tx structure. The tx is allocated and placed on a waiting queue. When the IBLND_MSG_PUT_ACK is received the waiting tx is looked up and used to send the IBLND_MSG_PUT_DONE.

When kiblnd_queue_tx_locked() is called for IBLND_MSG_PUT_REQ it sets the tx_deadline as follows:

Code Block
timeout_ns = *kiblnd_tunables.kib_timeout * NSEC_PER_SEC;
tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

When kiblnd_queu_tx_locked() is called for IBLND_MSG_PUT_DONE it reset the tx_deadline again.

This presents an obstacle for the LNet Resiliency feature. LNet provides a timeout for the LND as described above. From LNet's perspective this deadline is for the LNet PUT message. However, if we simply use that value for the timeout_ns calculation, then in essence will will be waing for 2 * LND timeout for the completion of the LNet PUT message. This will mean less re-transmits.

Therefore, the LND, since it has knowledge of its own protocols will need to divide the timeout provided by LNet by the number of transmits it needs to do to complete the LNet level message:

  1. LNET_MSG_GET: Requires only IBLND_MSG_GET_REQ. Use the LNet provided timeout as is.
  2. LNET_MSG_PUT: Requires IBLND_MSG_PUT and IBLND_MSG_PUT_DONE to complete the LNET_MSG_PUT. Use LNet provided timeout / 2
  3. LNET_MSG_GET/LNET_MSG_PUT with < 4K payload: Requires IBLND_MSG_IMMEDIATE. Use LNet provided timeout as is.


Hard Failure

It's possible that the local interface might get into a hard failure scenario by receiving one of these events from the o2iblnd. socklnd needs to be investigated to determine if there are similar cases:

...

Timeout Handling

LND TX Timeout

PUT

Gliffy Diagram
namePUT sequence
pagePin3

...

  1. TX timeout can be triggered because the TX remains on one of the outgoing queues for too long. This would indicate that there is something wrong with the local NI. It's either too congested or otherwise problematic. This should result in us trying to resend using a different local NI but possibly to the same peer_ni.
  2. TX timeout can be triggered because the TX is posted via (ib_post_send()) but it has not completed. In this case we can safely conclude that the peer_ni is either congested or otherwise down. This should result in us trying to resent to a different peer_ni, but potentially using the same local NI.
GET

Gliffy Diagram
nameGET Sequence Diagram
pagePin2

...