Page tree
Skip to end of metadata
Go to start of metadata
Target releaseLustre 2.12
Epic

LU-9120 - Getting issue details... STATUS

Document status
DRAFT
Document owner
Designer
DevelopersAmir Shehata
QAAmir Shehata

Scope

Overview

LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the LND and underlying fabrics such as MLX and OPA.

LNet Health will monitor three different types of failures:

  • local interface failures as reported by the underlying fabric
  • remote interface failures as reported by the remote fabric
  • network timeouts.

Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to re-transmit messages across different types of interfaces. For example if a peer has both MLX and TCP interfaces and a transmit error is detected on one of them then LNet can re-transmit the message on the other available interface, provided the peer exists on both networks.

In-Scope

  • LNet shall make best effort to deliver the message to a node that is directly connected, whether that is the final destination of the message or a router in a chain of routers to the final destination.
  • LNet shall re-transmit LNet messages (PUT/GET/ACK/REPLY) over the different local and remote interfaces available.
  • Provide a global timeout mechanism for timing out if ACK/REPLY are not received for their respective PUT/GET.
  • Handle o2iblnd timeouts and errors.
  • Handle socklnd timeouts and errors.
  • Feature testable via LUTF.

Out-of-Scope

  • LNet shall not be responsible for end-to-end message reliability
  • This feature will not add any health functionality to LNDs other than o2iblnd and socklnd. gnilnd, etc will not be modified.

Sign-off


Key Milestones and Deliverables

MilestonesDeliverablesDeliverable Date
Scope and RequirementsScope and Requirements Document
High-level DesignHigh-level Design Document
Unit Test PlanUnit Test Plan Document
Unit Test InfrastructureLNet Unit Test Framework Infrastructure improvement
ImplementationSource Code
Unit Test Plan Development and executionLUTF Test Scripts & Reports
Regression TestingTest Reports
Code Review

Test & Fix

Landing

Requirements

This section will detail the LNet Health Solution requirements.

Categorization

The requirements are broken down into separate categories as described below

CategoryDescription
Configuration (cfg)All requirements which specify the user interaction with the LNet Health Feature
LNet Driver (lnd)All requirements which specify LND behavior
LNet (lnt)All requirements which specify LNet Health behavior
Statistics (stt)All requirements which specify LNet Health Statistics
Backward Compatibility (bck)All requirements which specify how new Multi-Rail systems shall interact with old systems
Debugging (dbg)All requirements which specify debugging features of the LNet Health Project
Testing (tst)All requirements which specify how LNet Health will be tested
Documentation (doc)All requirements which specify the LNet Health feature documentation

Classes

Each requirement will fall into one of these classes

ClassDescription
REQUIREDCore requirement must be implemented
DESIREDRequirement is deemed as an enhancement which can be implemented at a later date

Status

Each requirement will be in one of the following statuses

StatusDescription
ACCEPTEDRequirement has been reviewed and accepted for implementation
IN-PROGRESSRequirement is being reviewed.
REJECTEDRequirement has been reviewed and rejected for implementation. It will not be covered in the HLD or the implementation.


Terms

TermDescription
SHALL

This word, or the terms "REQUIRED" or "MUST", mean that the definition is an absolute requirement of the specification.

SHALL NOT

This phrase, or the phrase "MUST NOT", mean that the definition is an absolute prohibition of the specification.

SHOULDThis word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
SHOULD NOT

This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

Requirement Format

Each requirement will have the following attributes

AttributeDescription
ID

Unique ID of the requirement; comprised of:

  • Three letter acronym of the requirement category, as defined above
  • A number which starts at 005 and incremented to allow for the addition of extra requirements between already existing requirements
ClassClass of the requirement as defined above
Version

Version number of the requirement

  • Draft version will be in the format of 0.X
    • Where X >= 0
      • Ex: 0.1
  • Accepted version will be in the format of Y.00
    • Where Y >= 0
Status

Current status of the requirement. Will be one of the following:

  • IN-PROGRESS: Still being developed and discussed
  • ACCEPTED: Has been agreed on and signed off
DescriptionA description of the requirement

Terms

TermDescription
TransactionA transaction is a PUT/GET and the respective ACK/REPLY. Note a PUT does not necessarily need an ACK
Transaction timeoutThe timeout to wait for an ACK/REPLY
Message timeoutThe timeout after which a PUT, GET, ACK, REPLY are re-transmitted
Local timeoutA timeout that occurs due to a problem on the local NI. The message has not been posted to the network
Remote timeout

A timeout that occurs due to a problem on the peer NI. A message has been posted to the network and completed successfully, but an expected acknowledgement is not received.

Not receiving an LNET ACK/REPLY is categorized as a remote timeout.

Some LND protocol messages expect an acknowledgement, not receiving those constitute a remote timeout.

Network timeoutA message has been posted but the underlying communication protocol (EX: IB, TCP) has not completed within the message timeout.


Configuration Requirements

IDClassVersionStatusDescription

cfg-005

REQUIRED1.0ACCEPTEDLNet Health shall be configurable via the Dynamic LNet Configuration (DLC) API, which uses the sysfs interface

cfg-010

REQUIRED1.0ACCEPTEDLNet Health configuration and statistics queried through DLC API shall be represented via YAML syntax

cfg-015

REQUIRED2.0ACCEPTED

The number of times that LNET shall retry sending a message shall be configured via DLC or sysfs. The number of retries shall be a positive value greater than or equal to 0. A Value of 0 turns off the re-transmits.

The retry count will default to 0. In essence for the first release of the LNet Health feature, it will be turned off and the user must turn it on explicitly. The reasoning behind that is most sites currently have only one interface and re-transmitting on the same interface after failure, will likely result in further failures.

cfg-020

REQUIRED1.0REJECTED

A health range value shall be configured. The NI health is a value between 0 and 1000. Each NI health value is initialized to 1000, fully healthy, on startup and is decremented on failure to transmit on/to that NI.

The health range value is used to provide a range in which to select the NI. Any NI within the configured health value range will be considered healthy. Values greater than 1000 turns off selection based on health. A value of 0 selects only healthy interfaces.

cfg-025

REQUIRED3.0ACCEPTED

Health sensitivity defines how much you decrement the health value by. It defines the percentage of failure that the system is sensitive too. The health value will be configured as a percentage value.

Health sensitivity can be set to 0, that would turn off selection based on the health of the NI. The NI's health value will never be decremented.

Health sensitivity will default to 0. Refer to cfg-015 for the reason.

cfg-030

REQUIRED1.0ACCEPTEDA transaction timeout to wait for an ACK/REPLY for a PUT/GET shall be configurable from the DLC API and sysfs.

cfg-035

DESIRED1.0IN-PROGRESS

The user shall be able to reset the health of an interface dynamically, in effect putting it back into service.

This would be useful when debugging the system, or when a problem has been resolved and the interface can be used. There would be no need to wait for the recovery time.

LNet Driver (LND) Requirements

IDClassVersionStatusDescription

lnd-005

REQUIRED1.0ACCEPTEDLNet health shall build on existing LND failure handling and shall not add new failure handling that doesn't already exist, except requirements explicitly outlined in this document.

lnd-010

DESIRED1.0ACCEPTEDThe LND shall listen to events from the driver indicating fatal device failure, such as device unplugged.

lnd-015

DESIRED1.0ACCEPTEDThe LND shall report fatal device failure via callbacks to the LNet layer

lnd-020

DESIRED1.0ACCEPTEDThe LND shall detect device degradation if the underlying driver provides this information

lnd-025

DESIRED1.0ACCEPTEDThe LND shall report device degradation via callbacks to the LNet layer if the LND supports that.

lnd-030

REQUIRED1.0ACCEPTED

The LND transmit timeouts shall be provided by the LNet layer.

  • Currently the LND transmit timeout is an LND module parameter that defaults to 50 seconds. This is too long. For LNet Health we'll use a trickle down approach, where the ULP provides LNet with the timeout when sending a PUT or a GET, if not provided a configurable default is used. LNet then divides that timeout by the number of retries and provides that to the LND layer. If the message times out at the LND layer, then it is reported up via lnet_finalize() and a re-transmit is issued.
  • Currently an o2iblnd connection is not destroyed until all work requests are completed, which is not time bound.
    • There is a ticket open to use the drain qp API available
    • This is a separate work item from this project.

lnd-035

DESIRED1.0ACCEPTED

Drain the o2iblnd QP to allow a connection to be closed. LU-10915 - Getting issue details... STATUS

lnd-040

REQUIRED1.0ACCEPTEDOn timeout the LND shall drop all queued messages on the existing connection and close it.

lnd-045

REQUIRED1.0ACCEPTEDThe LND shall detect when a message times out before being posted for send. This timeout will be termed "local timeout"

lnd-050

REQUIRED1.0ACCEPTEDThe LND shall detect when a message times out after it has been transmitted but not completed. This timeout will be termed "network timeout"

lnd-055

REQUIRED1.0ACCEPTED

The LND shall detect when a message times out before receiving an expected acknowledgment message:

  • EX: an IBLND_MSG_PUT_ACK is not received for IBLND_MSG_PUT_REQ

This will be termed "remote timeout"

Note that for IMMEDIATE o2iblnd messages there is no expected ACK/NACK and therefore the transmit complete is the only indication that the message transmit was successful. This feature will not add any further hand shaking for this category of messages.

lnd-060

REQUIRED2.0ACCEPTEDThe LND shall propagate through lnet_finalize() an error identifying the error encountered.

LNet Requirements

IDClassVersionStatusDescription

lnt-005

REQUIRED1.0ACCEPTEDLNet shall  maintain a health value per local NI

lnt-010

REQUIRED1.0ACCEPTEDLNet shall maintain a health value per peer NI

lnt-015

REQUIRED1.0ACCEPTED

The health value is a positive number between 0 - 1000.

This range is chosen to allow enough granularity for decrementing and incrementing the health value.

lnt-020

REQUIRED1.0ACCEPTEDLNet shall decrement the health value of an NI by the configured health sensitivity value whenever there is a error sending a message over or to the NI. The health value shall not be less than 0.

lnt-025

REQUIRED1.0ACCEPTEDLNet shall increment the health value but not beyond a 1000, which represents a healthy NI.

lnt-030

REQUIRED1.0ACCEPTED

LNet shall determine the NI to select based on the following ordered criteria:

  1. NI health
  2. NUMA closeness
  3. NI available credits
  4. Round Robin

lnt-035

REQUIRED1.0ACCEPTED

When LNet fails to send on a local NI or to a remote NI, it shall place that NI on a recovery queue. The NI shall be pinged or used for a ping periodically to determine if it has recovered. 1000 minus current health value pings must pass sequentially in order for an interface to be considered fully healthy.

EX: if the NI's health value is 900, then 100 pings using that NI must be successful in order for the NI to be considered fully healthy. Each successful send will increment the NI's health value by 1.

lnt-040

REQUIRED1.0ACCEPTED

On a local, remote and network timeout LNet shall reselect a pair of local and peer NI to resend the message.

  • Another option  is to be more granular when selecting the interfaces depending on the timeout that occurred. If local timeout then only re-select a different local NI. If remote timeout then re-select a peer NI and for a network timeout then select a new pair of local and peer NIs.
  • This has the disadvantage of a more complicated implementation.
  • There isn't a clear advantage of making the selection more granular.
  • Since the health value of the NI in question is decremented based on the error encountered, then the selection algorithm will favor the NI with poor health less.

lnt-045

REQUIRED1.0ACCEPTEDOn any type of timeout, if the peer is non-MR capable, LNet shall retransmit the message from the same local NI.

lnt-050

DESIRED1.0ACCEPTED

For the routers LNet shall re-transmit a message over any of its local NIs.

Routers are a special case since non-MR peers expect the same source NID of the final destination, but doesn't care about the router NIDs. The router NIDs are not passed up to ptlrpc or other ULPs.

lnt-055

REQUIRED1.0ACCEPTED

LNet shall not attempt to resend a message on the following failure types:

  1. Shutdown in progress
  2. Out of memory
  3. Discovery errors out with one of the errors on this list.
  4. An MD bind failure
    1. -EINVAL
    2. -HOSTUNREACH
  5. Invalid information given

  6. Internal failure

The assumption is that any resend will encounter the same failure again. Let the upper layers deal with the failure.

lnt-060

REQUIRED1.0ACCEPTED

LNet shall re-transmit messages no more than the retry count specified by the user.

lnt-065

REQUIRED1.0ACCEPTED

LNet shall stop re-transmitting when one of the following criterion is satisfied

  1. Message is sent successfully
  2. Retry count is reached
  3. Transaction timeout expires.

lnt-070

REQUIRED1.0ACCEPTEDLNet shall default the transaction timeout to 5 seconds

lnt-075

REQUIRED1.0ACCEPTED

LNet shall timeout a message and send a failure event to the ULP if a message is not re-transmitted successfully

lnt-080

REQUIRED1.0ACCEPTED

LNet shall calculate the message timeout based on the ULP provided timeout if one is provided or the configured transaction timeout otherwise.

message timeout = transaction timeout / number of retries.

lnt-085

REQUIRED1.0ACCEPTEDLNet shall pass the message timeout to the LND and will rely on the LND to enforce the timeout. If the LND times out the message then it will notify the LNet layer which will attempt to re-transmit the message.

lnt-090

REQUIRED1.0ACCEPTEDLNet shall not attempt to re-transmit if the retry count is set to 0

lnt-095

REQUIRED1.0ACCEPTEDLNet shall monitor the ACK/REPLY for a PUT/GET. It will send a timeout event for a PUT or a GET if the respective ACK/REPLY is not received within the transaction timeout/

lnt-100

REQUIRED1.0ACCEPTED

LNet shall allow the callers of LNetGet() or LNetPut() to specify a different transaction timeout other than the one configured in the system.

  • EX: lnetctl ping can specify a shorter timeout than ptlrpc

lnt-105

REQUIRED1.0ACCEPTED

LNet shall activate the transaction timeout only after a PUT which requires an ACK or a GET which requires a REPLY is successfully passed to the LND (IE lnd_send() returns successfully)

For PUT which requires no ACK no timeout will be activated.

lnt-110

DESIRED1.0IN-PROGRESSLNet shall use UDEV events to propagate errors detected on a local or peer NI.

lnt-115

DESIRED1.0IN-PROGRESSLNet shall handle flapping of interfaces and will favor the interface less.

Statistics Requirements

IDClassVersionStatusDescription

stt-005

REQUIRED1.0ACCEPTEDLNet shall  maintain the number of resends due to a local timeout per local NI

stt-010

REQUIRED1.0ACCEPTEDLNet shall maintain the number of resends due to a remote timeout per peer NI

stt-015

REQUIRED1.0ACCEPTEDLNet shall maintain the number of resends due to a network timeout per local and peer NI

stt-020

DESIRED1.0ACCEPTEDLNet shall maintain the number of local interface down events

stt-025

DESIRED1.0ACCEPTEDLNet shall maintain the number of local interface up events

stt-030

DESIRED1.0ACCEPTEDLNet shall maintain the average time it takes to successfully send a message per peer NI

stt-035

DESIRED1.0ACCEPTEDLNet shall maintain the average time it takes to successfully complete a transaction per peer NI

stt-040

DESIRED1.0IN-PROGRESSLNet shall provide a method to reset statistics.

Debugging Requirements

IDClassVersionStatusDescription

dbg-005

DESIRED1.0ACCEPTEDLND shall provide hooks to simulate a local timeout

dbg-015

DESIRED1.0ACCEPTEDLND shall provide hooks to simulate a remote timeout

dbg-020

DESIRED1.0ACCEPTEDLND shall provide hooks to simulate a network timeout

dbg-025

DESIRED1.0ACCEPTEDLND shall provide hooks to simulate an interface down event

dbg-030

DESIRED1.0ACCEPTEDLND shall provide hooks to simulate an interface up event

dbg-035

DESIRED1.0ACCEPTEDLNet shall provide hooks to simulate an ACK timeout

dbg-040

DESIRED1.0ACCEPTEDLNet shall provide hooks to simulate a REPLY timeout

Testing Requirements

IDClassVersionStatusDescription

tst-005

DESIRED1.0ACCEPTEDLNet Health shall be testable via the LUTF

tst-010

DESIRED1.0ACCEPTEDLUTF shall utilize the hooks provided by the LND and LNet to trigger failures for testing purposes.

Documentation Requirements

IDClassVersionStatusDescription

doc-005

REQUIRED1.0ACCEPTEDThe user facing configuration shall be documented in the Lustre manual

doc-010

REQUIRED1.0ACCEPTEDThe trickle down approach for timeouts described in this document shall be documented in the Lustre manual

3 Comments

  1. Notes:

    Should the health value be a value from 0-100? NIs are initialized to 100 and decremented when there is an error that points to that interface. When an interface reaches 0, then it will not be selected.

    Should the health value be decremented on every failure? or should multiple file need to occur in order for the health to be decremented, a grace range? I believe that having the sensitivity value which allows the interface to be selected within a range gives the same behavior.

    1. try all interfaces before waiting.

    2. always queue.

    3. tunables:

      - number of retries

      - global timeout

        - overwritten by particular callers

    4. Select on health first. no averages.

    5. If a local interface or a peer interface has their health value decremented, then we'll place them on a recovery queue, to be used on some interval (possibly the same as the router_checker interval) and when that interface is up again, we can increment the health value.



  2. Overall, good set of requirements.  

    Might be a good idea to add a table at the top where reviewers can add their userid and a data stamp as a form of sign-off.  Sort of like how we give a +1 in Gerrit.  Get two sign-offs and you are good to go!  :^).

    Also, please find a way to turn off the captcha feature for the Wiki.  I don't know about others, but I had to enter a captcha for every single comment (including this one).  Not fun.

    1. I updated the requirements according to the discussion/comments.

      There is still one comment I'm mulling over. I'll tag Olaf and Andreas to give their thoughts as well.

      I was just thinking about the sign-off table.  I added it.