You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Target releaseLustre 2.12
Epic

LU-9120 - Getting issue details... STATUS

Document status
DRAFT
Document owner
Designer
DevelopersAmir Shehata
QAAmir Shehata

Scope

Overview

LNet Multi-Rail has implemented the ability for multiple interfaces to be used on the same LNet network or across multiple LNet networks. The LNet Health feature adds the ability to resend messages across different interfaces when interface or network failures are detected. This allows LNet to mitigate communication failures before passing the failures to upper layers for further error handling. To accomplish this, LNet Health depends on health information reported by the LND and underlying fabrics such as MLX and OPA.

LNet Health will monitor three different types of failures:

  • local interface failures as reported by the underlying fabric
  • remote interface failures as reported by the remote fabric
  • network timeouts.

Each one of these classes of failures are dealt with separately at the LNet layer. The implementation of this health feature at the LNet layer allows LNet to re-transmit messages across different types of interfaces. For example if a peer has both MLX and OPA interfaces and a transmit error is detected on one of them then LNet can re-transmit the message on the other available interface.

In-Scope

  • LNet shall make best effort to ensure a message is delivered to the immediate next-hop
  • LNet shall re-transmit LNet messages (PUT/GET/ACK/REPLY) over the different local and remote interfaces available.
  • Provide a global timeout mechanism for timing out if ACK/REPLY are not received for their respective PUT/GET
  • Handle o2iblnd timeouts
  • Handle socklnd timeouts
  • Feature testable via LUTF

Out-of-Scope

  • LNet shall not be responsible for end-to-end message reliability

Key Milestones and Deliverables

MilestonesDeliverablesDeliverable Date
Scope and RequirementsScope and Requirements Document
High-level DesignHigh-level Design Document
Unit Test PlanUnit Test Plan Document
Unit Test InfrastructureLNet Unit Test Framework Infrastructure improvement
ImplementationSource Code
Unit Test Plan Development and executionLUTF Test Scripts & Reports
Regression TestingTest Reports
Code Review

Test & Fix

Landing

Requirements

This section will detail the LNet Health Solution requirements.

Categorization

The requirements are broken down into separate categories as described below

CategoryDescription
Configuration (cfg)All requirements which specify the user interaction with the LNet Health Feature
LNet Driver (lnd)All requirements which specify LND behavior
LNet (lnt)All requirements which specify LNet Health behavior
Statistics (stt)All requirements which specify LNet Health Statistics
Backward Compatibility (bck)All requirements which specify how new Multi-Rail systems shall interact with old systems
Debugging (dbg)All requirements which specify debugging features of the LNet Health Project
Testing (tst)All requirements which specify how LNet Health will be tested
Documentation (doc)All requirements which specify the LNet Health feature documentation

Classes

Each requirement will fall into one of these classes

ClassDescription
REQUIREDCore requirement must be implemented
DESIREDRequirement is deemed as an enhancement which can be implemented at a later date

Terms

TermDescription
SHALL

This word, or the terms "REQUIRED" or "MUST", mean that the definition is an absolute requirement of the specification.

SHALL NOT

This phrase, or the phrase "MUST NOT", mean that the definition is an absolute prohibition of the specification.

SHOULDThis word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
SHOULD NOT

This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

Requirement Format

Each requirement will have the following attributes

AttributeDescription
ID

Unique ID of the requirement; comprised of:

  • Three letter acronym of the requirement category, as defined above
  • A number which starts at 005 and incremented to allow for the addition of extra requirements between already existing requirements
ClassClass of the requirement as defined above
Version

Version number of the requirement

  • Draft version will be in the format of 0.X
    • Where X >= 0
      • Ex: 0.1
  • Accepted version will be in the format of Y.00
    • Where Y >= 0
Status

Current status of the requirement. Will be one of the following:

  • IN-PROGRESS: Still being developed and discussed
  • ACCEPTED: Has been agreed on and signed off
DescriptionA description of the requirement

Configuration Requirements

IDClassVersionStatusDescription

cfg-005




LNet Health shall be configurable via the Dynamic LNet Configuration (DLC) API, which uses the sysfs interface

cfg-010




LNet Health configuration and statistics queried through DLC API shall be represented via YAML syntax

cfg-015




Number of times that LNET shall retry sending a message shall be configured via DLC or sysfs. Number of retries shall be a positive value between 0 and 5. A Value of 0 means no retries.

cfg-020




Health sensitivity value shall be configured. The smaller the value the more sensitive the selection algorithm is to the health of the interfaces examined.

Network Interface Health

IDClassVersionStatusDescription
hlt-005DESIRED1.0ACCEPTEDThe LND shall detect device failure
hlt-010DESIRED1.0ACCEPTEDThe LND shall report device failure via callbacks to the LNet layer
hlt-015NICE-TO-HAVE1.0ACCEPTEDThe LND shall detect device degradation
hlt-020NICE-TO-HAVE1.0ACCEPTEDThe LND shall report device degradation via callbacks to the LNet layer
hlt-025NICE-TO-HAVE1.0ACCEPTEDThe LNet layer shall update the status of the local NIs depending on the information reported by the LND layer
  • No labels