Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Page properties
Target releaseLustre 2.12
Epic

Jira
serverHPDD Community Jira
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
keyLU-9120

Document status
Status
titleDRAFT
Document owner
Designer
DevelopersAmir Shehata
QAAmir Shehata

Adding Resiliency to LNet

...

If all goes well, the event handler sees two events: LNET_EVENT_SEND to indicate the Get GET message was sent, and LNET_EVENT_REPLY to indicate the Reply REPLY message was received. Note that the send event can happen after the reply event (this is actually the typical case).

If sending the Get GET message failed, LNET_EVENT_SEND will include an error status, no LNET_EVENT_REPLY will happen, and clean up must be done accordingly. If the return value of LNetGet() indicates an error then sending the message certainly failed, but a 0 return does not imply success, only that no failure has yet been encountered.

A damaged Reply REPLY message will be dropped, and does not result in an LNET_EVENT_REPLY. Effectively the only way for LNET_EVENT_REPLY to have an error status is if LNet detects a timeout before the Reply REPLY is received.

LNetPut() LNET_ACK_REQ

...

A PUT with an ACK is similar to a GetGET + Reply REPLY pair. The events in this case are LNET_EVENT_SEND and LNET_EVENT_ACK.

...

LNet can mark the interface down, and depending on the capabilies capabilities of the LND either recheck periodically or wait for the LND to mark the interface up.

...

LNet might treat this as the "remote interface not reachable" case for all the interfaces of the remote node. That is, without much difference due to apparently all interfaces of the remote node being down, except for a log message indicating this.

PUT+

...

ACK Or

...

GET+

...

REPLY Timeout

This is the case where the LND does not signal any problem, so the Ack ACK for a PUT or Reply REPLY for a Get GET should arrive promptly, with the only delays due to credit-based throttling, and yet it does not do so. Note that this assumes that were where possible the LND layer already implements reasonably tight timeouts, so that LNet can assume the problem is somewhere else.

LNet can impose a "reply timeout", and retry over a different path if there is one available. However, if the assumption about the LND is valid, then the implication is that the node is in trouble. So an alternative is to force the upper layers to cope.One argument for nevertheless implementing this facility in LNet is that it means the upper layers to have to re-invent and re-implement this wheel time and againwill timeout after the configured or passed in transaction timeout and will send an event to the ULP indicating that the PUT/GET has timed out without receiving the expected ACK/REPLY respectively.

Dropped PUT

No problem was signalled signaled by the LND, and there is no Ack ACK that we could time out waiting for. LNet does not have enough information to do anything, so the upper layers ULP must do so instead.

If this case must be made tractable, LNet can be changed to make the Ack non-optional.

...

LNet Resend

...

Handling

When there are multiple paths available for a message, it makes sense to try to resend it on failure. But where should the resending logic be implemented?

The easiest path is to tell upper layers to resend. For example, PtlRPC has some related logic already. Except that when PtlRPC detects a failure, it disconnects, reconnects, and triggers a recovery operation. This is a fairly heavy-weight process, while the type of resending logic desired is to "just try another path" which differs from what exists today and needs to be implemented for each user.

The alternative then is to have LNet resend a message. There should be some limit to the number of retries, and a limit to the amount of time spent retrying. Otherwise we are requiring the upper layers to implement a timer on each LNetGet() and LNetPut() call to guarantee progress. This introduces an LNet "retry timeout" (as opposed to the "reply timeout" discussed above) as the amount of time LNet after which LNet gives up.

In terms of timeouts, this then gives us the following relationships, from shortest to longest:

  • LND Timeout: LND declares that a message won't arrive.
    • IB timeout is (default?) slightly less than 4 seconds
    • LND timeout is timeout module parameter for o2ib and gnisock_timeout module parameter for sock?
  • LNet Reply Timeout: LNet declares an Ack/Reply won't arrive. > LND Timeout * (max hops -1)
    • Depends on the route!
  • LNet Retry Timeout: LNet gives up on retries. > LNet Reply Timeout * max LNet retries
    • Depends on the route!
  • peer_timeout module parameter: peer is declared dead. Either use for LNet Retry Timeout, or > LNet Retry Timeout. Using the peer_timeout for the LNet Retry Timeout has the advantage of reducing the number of tunable parameters. A disadvantage is that the peer_timeout is currently a per-LND parameter (each LND has its own tunable value), effectively limiting the number of retries to 1 when the LND timed out.

It is not completely obvious how this scheme interacts with Lustre's timeout parameter (the Lustre RPC timeout, from which a number of timeouts are derived), but at first glance it seems that at least peer_timeout < Lustre timeout.

LNet Health Version 2.0

There are three types of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Local Interface Failure

Local interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be as a result of a send operations
    2. They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.

Synchronous Send Failures

lnet_select_pathway can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

All these cases are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message:

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is not one available attempt the same interface again.
  4. The message will continuously be resent until the timeout expires or the send succeeds.

A new field in the msg, msg_status, will be added. This field will hold the send status of the message.

When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

When a message is initially sent it's taged with a deadline for this message. The deadline is the current time + peer_timeout. While the message has not timedout it will be resent if it needs to. The deadline is checked everytime we enter lnet_finalize(). When the deadline is reached without successful send, then the MD will be detached.

While the message is in the sending state the MD will not be detached. 

LNet shall use a trickle down approach for managing timeouts. the ULP (pltrpc or other upper layer protocol) shall provide a timeout value in the call to LNetPut() or LNetGet(). LNet shall use that as the transaction timeout value to wait for an ACK or REPLY. LNet shall further provide a configuration parameter for the number of retries. The number of retries shall allow the user to specify a maximum number of times LNet shall attempt to resend an unsuccessful message. LNet shall then calculate the message timeout by dividing the transaction timeout with the number of retries. LNet shall pass the calculated message timeout to the LND, which will use it to ensure that the LND protocol to complete an LNet message completes within the message timeout. If the LND is not able to complete the message within the provided timeout it will close the connection and drop all messages on that connection. It will afterword proceed to call into LNet via lnet_finailze() to notify it of the error encountered.

LNet Resiliency

There are three types of failures that LNet needs to deal with:

  1. Local Interface failure
  2. Remote Interface failure
  3. Timeouts
    1. LND detected Timeout
    2. LNet detected Timeout

Local Interface Failure

Local interface failures will be detected in one of two ways

  1. Synchronously as a return failure to the call to lnd_send()
  2. Asynchronously as an event that could be detected at a later point.
    1. These asynchronous events can be a result of a send operations
    2. They can also be independent of send operations, as failures are detected with the underlying device, for example a "link down" event.

Synchronous Send Failures

lnet_select_pathway() can fail for the following reasons:

  1. Shutdown in progress
  2. Out of memory
  3. Interrup signal received
  4. Discovery error.
  5. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  6. Invalid information given

  7. Message dropped
  8. Aborting message
  9. no route found
  10. Internal failure

1, 2, 5, 6 and 10 are resource errors and it does not make sense to resend the message as any resend will likely run into the same problem.

Asynchronous Send Failures

LNet should resend the message:

  1. On LND transmit timeout
  2. On LND connection failure
  3. On LND send failure

Resend Handling

Gliffy Diagram
nameMessageProcessing
pagePin1

When there is a message send failure due to the reasons outlined above. The behavior should be as follows:

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
  4. The message will continuously be resent until one of the following criteria is fulfilled:
    1. Message is completed successfully.
    2. Retry-count is recahed
    3. Transaction timeout expires

A new field in the msg, msg_status, will be added. This field will hold the send status of the message.

When a message encounters one of the errors above, the LND will update the msg_status field appropriately and call lnet_finalize()

lnet_finalize() will check if the message has timed out or if it needs to be resent and will take action on it. lnet_finalize() currently calls lnet_complete_msg_locked() to continue the processing. If the message has not been sent, then lnet_finalize() should call another function to resend, lnet_resend_msg_locked().

lnet_resend_msg_locked()shall queue the message on a resend queue and wake up a thread responsible for resending messages.

The router checker thread, which is always started, will be refactored to handle resending messages.

When a message is initially sent it's tagged with a deadline for this message. The deadline is the current time + transaction timeout. The message will be placed on the active queue. If the message is not completed within that timeout it will be finalized and removed from the active queue. A timeout event will be passed to the ULP.

If the LND times out and LNet attemps to resend, it'll place it on the resend queue. A message can be on the both the active and resend queue.

As shown in the diagram below both lnet_send() and lnet_parse() put messages on the active queue. lnet_finalize() consumes messages off the active queue when it's time to decommit them.

The monitor thread will wake up every second and checks if any messages which are being sent have passed their deadline. If so, it'll call lnet_finalize() on that message, which will decommit and finalize the message.

When the LND calls lnet_finalize() on a timed out message, lnet_finalize() will put the message on the resend queue and wake up the monitor thread, which will go through the resend queue in FIFO order, pop the message and call lnet_send() on it.

The assumption is that under normal circumstances the number of re-sends should be low, so the thread will not add any logic to pace out the resend rate.

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and either removed from the active queue or placed on the resend queue.

LND Interface

LNet shall calculate the message timeout as follows:

message timeout = transaction timeout / retry count

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send(lnet_finalize() will also update the statistics (or call a function to update the statistics).


Resending

It is possible that a send can fail immediately. In this case we need to take active measures to ensure that we do not enter a tight loop resending until the timeout expires. This could peak the CPU consumption unexpectedly.

...