LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the LND and underlying fabrics such as MLX and OPA.
LNet Health will monitor three different types of failures:
- local interface failures as reported by the underlying fabric
- remote interface failures as reported by the remote fabric
- network timeouts.
Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to re-transmit messages across different types of interfaces. For example if a peer has both MLX and TCP interfaces and a transmit error is detected on one of them then LNet can re-transmit the message on the other available interface, provided the peer exists on both networks.
- LNet shall make best effort to deliver the message to a node that is directly connected, whether that is the final destination of the message or a router in a chain of routers to the final destination.
- LNet shall re-transmit LNet messages (PUT/GET/ACK/REPLY) over the different local and remote interfaces available.
- Provide a global timeout mechanism for timing out if ACK/REPLY are not received for their respective PUT/GET.
- Handle o2iblnd timeouts and errors.
- Handle socklnd timeouts and errors.
- Feature testable via LUTF.
- LNet shall not be responsible for end-to-end message reliability
- This feature will not add any health functionality to LNDs other than o2iblnd and socklnd. gnilnd, etc will not be modified.
Key Milestones and Deliverables
|Scope and Requirements||Scope and Requirements Document|
|High-level Design||High-level Design Document|
|Unit Test Plan||Unit Test Plan Document|
|Unit Test Infrastructure||LNet Unit Test Framework Infrastructure improvement|
|Unit Test Plan Development and execution||LUTF Test Scripts & Reports|
|Regression Testing||Test Reports|
|Test & Fix|
This section will detail the LNet Health Solution requirements.
The requirements are broken down into separate categories as described below
|Configuration (cfg)||All requirements which specify the user interaction with the LNet Health Feature|
|LNet Driver (lnd)||All requirements which specify LND behavior|
|LNet (lnt)||All requirements which specify LNet Health behavior|
|Statistics (stt)||All requirements which specify LNet Health Statistics|
|Backward Compatibility (bck)||All requirements which specify how new Multi-Rail systems shall interact with old systems|
|Debugging (dbg)||All requirements which specify debugging features of the LNet Health Project|
|Testing (tst)||All requirements which specify how LNet Health will be tested|
|Documentation (doc)||All requirements which specify the LNet Health feature documentation|
Each requirement will fall into one of these classes
|REQUIRED||Core requirement must be implemented|
|DESIRED||Requirement is deemed as an enhancement which can be implemented at a later date|
Each requirement will be in one of the following statuses
|ACCEPTED||Requirement has been reviewed and accepted for implementation|
|IN-PROGRESS||Requirement is being reviewed.|
|REJECTED||Requirement has been reviewed and rejected for implementation. It will not be covered in the HLD or the implementation.|
This word, or the terms "REQUIRED" or "MUST", mean that the definition is an absolute requirement of the specification.
This phrase, or the phrase "MUST NOT", mean that the definition is an absolute prohibition of the specification.
|SHOULD||This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.|
This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.
Each requirement will have the following attributes
Unique ID of the requirement; comprised of:
|Class||Class of the requirement as defined above|
Version number of the requirement
Current status of the requirement. Will be one of the following:
|Description||A description of the requirement|
|Transaction||A transaction is a PUT/GET and the respective ACK/REPLY. Note a PUT does not necessarily need an ACK|
|Transaction timeout||The timeout to wait for an ACK/REPLY|
|Message timeout||The timeout after which a PUT, GET, ACK, REPLY are re-transmitted|
|Local timeout||A timeout that occurs due to a problem on the local NI. The message has not been posted to the network|
A timeout that occurs due to a problem on the peer NI. A message has been posted to the network and completed successfully, but an expected acknowledgement is not received.
Not receiving an LNET ACK/REPLY is categorized as a remote timeout.
Some LND protocol messages expect an acknowledgement, not receiving those constitute a remote timeout.
|Network timeout||A message has been posted but the underlying communication protocol (EX: IB, TCP) has not completed within the message timeout.|
|REQUIRED||1.0||ACCEPTED||LNet Health shall be configurable via the Dynamic LNet Configuration (DLC) API, which uses the sysfs interface|
|REQUIRED||1.0||ACCEPTED||LNet Health configuration and statistics queried through DLC API shall be represented via YAML syntax|
The number of times that LNET shall retry sending a message shall be configured via DLC or sysfs. The number of retries shall be a positive value greater than or equal to 0. A Value of 0 turns off the re-transmits.
The retry count will default to 0. In essence for the first release of the LNet Health feature, it will be turned off and the user must turn it on explicitly. The reasoning behind that is most sites currently have only one interface and re-transmitting on the same interface after failure, will likely result in further failures.
A health range value shall be configured. The NI health is a value between 0 and 1000. Each NI health value is initialized to 1000, fully healthy, on startup and is decremented on failure to transmit on/to that NI.
The health range value is used to provide a range in which to select the NI. Any NI within the configured health value range will be considered healthy. Values greater than 1000 turns off selection based on health. A value of 0 selects only healthy interfaces.
Health sensitivity defines how much you decrement the health value by. It defines the percentage of failure that the system is sensitive too. The health value will be configured as a percentage value.
Health sensitivity can be set to 0, that would turn off selection based on the health of the NI. The NI's health value will never be decremented.
Health sensitivity will default to 0. Refer to cfg-015 for the reason.
|REQUIRED||1.0||ACCEPTED||A transaction timeout to wait for an ACK/REPLY for a PUT/GET shall be configurable from the DLC API and sysfs.|
The user shall be able to reset the health of an interface dynamically, in effect putting it back into service.
This would be useful when debugging the system, or when a problem has been resolved and the interface can be used. There would be no need to wait for the recovery time.
LNet Driver (LND) Requirements
|REQUIRED||1.0||ACCEPTED||LNet health shall build on existing LND failure handling and shall not add new failure handling that doesn't already exist, except requirements explicitly outlined in this document.|
|DESIRED||1.0||ACCEPTED||The LND shall listen to events from the driver indicating fatal device failure, such as device unplugged.|
|DESIRED||1.0||ACCEPTED||The LND shall report fatal device failure via callbacks to the LNet layer|
|DESIRED||1.0||ACCEPTED||The LND shall detect device degradation if the underlying driver provides this information|
|DESIRED||1.0||ACCEPTED||The LND shall report device degradation via callbacks to the LNet layer if the LND supports that.|
The LND transmit timeouts shall be provided by the LNet layer.
|REQUIRED||1.0||ACCEPTED||On timeout the LND shall drop all queued messages on the existing connection and close it.|
|REQUIRED||1.0||ACCEPTED||The LND shall detect when a message times out before being posted for send. This timeout will be termed "local timeout"|
|REQUIRED||1.0||ACCEPTED||The LND shall detect when a message times out after it has been transmitted but not completed. This timeout will be termed "network timeout"|
The LND shall detect when a message times out before receiving an expected acknowledgment message:
This will be termed "remote timeout"
Note that for IMMEDIATE o2iblnd messages there is no expected ACK/NACK and therefore the transmit complete is the only indication that the message transmit was successful. This feature will not add any further hand shaking for this category of messages.
|REQUIRED||2.0||ACCEPTED||The LND shall propagate through |
|REQUIRED||1.0||ACCEPTED||LNet shall maintain a health value per local NI|
|REQUIRED||1.0||ACCEPTED||LNet shall maintain a health value per peer NI|
The health value is a positive number between 0 - 1000.
This range is chosen to allow enough granularity for decrementing and incrementing the health value.
|REQUIRED||1.0||ACCEPTED||LNet shall decrement the health value of an NI by the configured health sensitivity value whenever there is a error sending a message over or to the NI. The health value shall not be less than 0.|
|REQUIRED||1.0||ACCEPTED||LNet shall increment the health value but not beyond a 1000, which represents a healthy NI.|
LNet shall determine the NI to select based on the following ordered criteria:
When LNet fails to send on a local NI or to a remote NI, it shall place that NI on a recovery queue. The NI shall be pinged or used for a ping periodically to determine if it has recovered. 1000 minus current health value pings must pass sequentially in order for an interface to be considered fully healthy.
EX: if the NI's health value is 900, then 100 pings using that NI must be successful in order for the NI to be considered fully healthy. Each successful send will increment the NI's health value by 1.
On a local, remote and network timeout LNet shall reselect a pair of local and peer NI to resend the message.
|REQUIRED||1.0||ACCEPTED||On any type of timeout, if the peer is non-MR capable, LNet shall retransmit the message from the same local NI.|
For the routers LNet shall re-transmit a message over any of its local NIs.
Routers are a special case since non-MR peers expect the same source NID of the final destination, but doesn't care about the router NIDs. The router NIDs are not passed up to ptlrpc or other ULPs.
LNet shall not attempt to resend a message on the following failure types:
The assumption is that any resend will encounter the same failure again. Let the upper layers deal with the failure.
LNet shall re-transmit messages no more than the retry count specified by the user.
LNet shall stop re-transmitting when one of the following criterion is satisfied
|REQUIRED||1.0||ACCEPTED||LNet shall default the transaction timeout to 5 seconds|
LNet shall timeout a message and send a failure event to the ULP if a message is not re-transmitted successfully
LNet shall calculate the message timeout based on the ULP provided timeout if one is provided or the configured transaction timeout otherwise.
message timeout = transaction timeout / number of retries.
|REQUIRED||1.0||ACCEPTED||LNet shall pass the message timeout to the LND and will rely on the LND to enforce the timeout. If the LND times out the message then it will notify the LNet layer which will attempt to re-transmit the message.|
|REQUIRED||1.0||ACCEPTED||LNet shall not attempt to re-transmit if the retry count is set to 0|
|REQUIRED||1.0||ACCEPTED||LNet shall monitor the ACK/REPLY for a PUT/GET. It will send a timeout event for a PUT or a GET if the respective ACK/REPLY is not received within the transaction timeout/|
LNet shall allow the callers of LNetGet() or LNetPut() to specify a different transaction timeout other than the one configured in the system.
LNet shall activate the transaction timeout only after a PUT which requires an ACK or a GET which requires a REPLY is successfully passed to the LND (IE lnd_send() returns successfully)
For PUT which requires no ACK no timeout will be activated.
|DESIRED||1.0||IN-PROGRESS||LNet shall use UDEV events to propagate errors detected on a local or peer NI.|
|DESIRED||1.0||IN-PROGRESS||LNet shall handle flapping of interfaces and will favor the interface less.|
|REQUIRED||1.0||ACCEPTED||LNet shall maintain the number of resends due to a local timeout per local NI|
|REQUIRED||1.0||ACCEPTED||LNet shall maintain the number of resends due to a remote timeout per peer NI|
|REQUIRED||1.0||ACCEPTED||LNet shall maintain the number of resends due to a network timeout per local and peer NI|
|DESIRED||1.0||ACCEPTED||LNet shall maintain the number of local interface down events|
|DESIRED||1.0||ACCEPTED||LNet shall maintain the number of local interface up events|
|DESIRED||1.0||ACCEPTED||LNet shall maintain the average time it takes to successfully send a message per peer NI|
|DESIRED||1.0||ACCEPTED||LNet shall maintain the average time it takes to successfully complete a transaction per peer NI|
|DESIRED||1.0||IN-PROGRESS||LNet shall provide a method to reset statistics.|
|DESIRED||1.0||ACCEPTED||LND shall provide hooks to simulate a local timeout|
|DESIRED||1.0||ACCEPTED||LND shall provide hooks to simulate a remote timeout|
|DESIRED||1.0||ACCEPTED||LND shall provide hooks to simulate a network timeout|
|DESIRED||1.0||ACCEPTED||LND shall provide hooks to simulate an interface down event|
|DESIRED||1.0||ACCEPTED||LND shall provide hooks to simulate an interface up event|
|DESIRED||1.0||ACCEPTED||LNet shall provide hooks to simulate an ACK timeout|
|DESIRED||1.0||ACCEPTED||LNet shall provide hooks to simulate a REPLY timeout|
|DESIRED||1.0||ACCEPTED||LNet Health shall be testable via the LUTF|
|DESIRED||1.0||ACCEPTED||LUTF shall utilize the hooks provided by the LND and LNet to trigger failures for testing purposes.|
|REQUIRED||1.0||ACCEPTED||The user facing configuration shall be documented in the Lustre manual|
|REQUIRED||1.0||ACCEPTED||The trickle down approach for timeouts described in this document shall be documented in the Lustre manual|