Scope
Overview
LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the underlying fabrics such as MLX and OPA.
LNet Health will monitor three different types of failures:
- local interface failures as reported by the underlying fabric
- remote interface failures as reported by the remote fabric
- network timeouts.
Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to re-transmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can re-transmit the message on the other available interface.
In-Scope
- LNet shall make best effort to ensure a message is delivered to the immediate next-hop
- LNet shall re-transmit LNet messages (PUT/GET/ACK/REPLY) over the different local and remote interfaces available.
- Provide a global timeout mechanism for timing out if ACK/REPLY are not received for their respective PUT/GET
- Handle o2iblnd timeouts
- Handle socklnd timeouts
- Feature testable via LUTF
Out-of-Scope
- LNet shall not be responsible for end-to-end message reliability
Key Milestones and Deliverables
Milestones | Deliverables | Deliverable Date |
---|---|---|
Scope and Requirements | Scope and Requirements Document | |
High-level Design | High-level Design Document | |
Unit Test Plan | Unit Test Plan Document | |
Unit Test Infrastructure | LNet Unit Test Framework Infrastructure improvement | |
Implementation | Source Code | |
Unit Test Plan Development and execution | LUTF Test Scripts & Reports | |
Regression Testing | Test Reports | |
Code Review | ||
Test & Fix | ||
Landing |
Requirements
This section will detail the LNet Health Solution requirements.
Categorization
The requirements are broken down into separate categories as described below
Category | Description |
---|---|
Configuration (cfg) | All requirements which specify the user interaction with the LNet Health Feature |
LNet Driver (lnd) | All requirements which specify LND behavior |
LNet (lnt) | All requirements which specify LNet Health behavior |
Statistics (stt) | All requirements which specify LNet Health Statistics |
Backward Compatibility (bck) | All requirements which specify how new Multi-Rail systems shall interact with old systems |
Debugging (dbg) | All requirements which specify debugging features of the LNet Health Project |
Testing (tst) | All requirements which specify how LNet Health will be tested |
Documentation (doc) | All requirements which specify the LNet Health feature documentation |
Classes
Each requirement will fall into one of these classes
Class | Description |
---|---|
REQUIRED | Core requirement must be implemented |
DESIRED | Requirement is deemed as an enhancement which can be implemented at a later date |
Terms
Term | Description |
---|---|
SHALL | This word, or the terms "REQUIRED" or "MUST", mean that the definition is an absolute requirement of the specification. |
SHALL NOT | This phrase, or the phrase "MUST NOT", mean that the definition is an absolute prohibition of the specification. |
SHOULD | This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. |
SHOULD NOT | This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label. |
Requirement Format
Each requirement will have the following attributes
Attribute | Description |
---|---|
ID | Unique ID of the requirement; comprised of:
|
Class | Class of the requirement as defined above |
Version | Version number of the requirement
|
Status | Current status of the requirement. Will be one of the following:
|
Description | A description of the requirement |
Configuration Requirements
ID | Class | Version | Status | Description |
---|---|---|---|---|
cfg-005 | LNet Health shall be configurable via the Dynamic LNet Configuration (DLC) API, which uses the sysfs | |||
cfg-010 | LNet Health configuration and statistics queried through DLC API, configuration shall be represented via a YAML syntax | |||
cfg-015 | Number of times that LNET shall retry sending a message shall be configured via DLC or sysfs. Number of retries shall be a positive value between 0 and 5. A Value of 0 means no retries. | |||
cfg-020 |
Network Interface Health
ID | Class | Version | Status | Description |
---|---|---|---|---|
hlt-005 | DESIRED | 1.0 | ACCEPTED | The LND shall detect device failure |
hlt-010 | DESIRED | 1.0 | ACCEPTED | The LND shall report device failure via callbacks to the LNet layer |
hlt-015 | NICE-TO-HAVE | 1.0 | ACCEPTED | The LND shall detect device degradation |
hlt-020 | NICE-TO-HAVE | 1.0 | ACCEPTED | The LND shall report device degradation via callbacks to the LNet layer |
hlt-025 | NICE-TO-HAVE | 1.0 | ACCEPTED | The LNet layer shall update the status of the local NIs depending on the information reported by the LND layer |