Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Part of update of the discovery algo to match the implementation.
Page properties
Document status

Status

colourGreen

titleVERSION 1.

1

2 (in progress)

Document authors
Designer
Developers

 

Table of Contents

Introduction

...

The intent for the first revision of this document is to target sign-off by all stakeholders. Subsequently as the implementation work is divided into phases, multiple other documents will be created as needed detailing the design further. This document will be updated with reference links to the other detailed design documents.

Reference Documents

Document Structure

This document is made up of the following sections:

...

Kernel Space: Describes the details of Kernel Space changes including the Dynamic Discovery Behavior

Acronym Table

AcronymDescription
LNetLustre Network
NINetwork Interface
RPCRemote Procedure Call
FSFile System
o2ibInfiniband Network
TCPEthernet TCP-layer Network
NUMANon-Uniform Memory Access
RRRound Robin
CPTCPU Partition
CBChannel Bonding
NIDNetwork Identifier
downrevNode with no Multi-Rail
uprevNode with Multi-Rail

Design Overview

System level

...

Code Block
eth[1,2,3], eth[1-4/2]


 

Expression Structural FormDescription

Gliffy Diagram
nameHLD to Kernel range 02

Figure 4: syntax descriptor

An expression can be a number:

Code Block
[<num>, <expr>]
represented as:
start == end == NUM

An express can be a wild card

Code Block
[*, <expr>]
represented as:
start == 0
end == U32_MAX
INCR == 1

An expression can be a range

Code Block
[<start> - <end>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == 1

An expression can be a range and an increment

Code Block
[<num-start> - <num-end>/<incr>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == INCREMENT VALUE

When passing the built structural format to the kernel it will need to be serialized, in order not to pass pointers between user space and kernel space.

...

The state of a peer is a combination of the following bits of information, where the flags can be found in the source code by prepending LNET_PEER_, so CONFIGURED becomes LNET_PEER_CONFIGURED.

  • CONFIGURED: The peer was configured via DLC.
  • DISCOVERED: The peer has been discovered.
  • UNDISCOVERED: Peer discovery was disabled when the peer was created.

Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such the CONFIGURED and DISCOVERED flags can both be set on a peer.

The UNDISCOVERED state is used to indicate that a peer has been seen by discovery, but not been updated because discovery is disabled. It signals that a peer only needs to be re-examined if discovery is enabled.

  • QUEUED: Peer is queued for discovery.
  • DISCOVERING: Discovery is active for the peer.
  • DATA_PRESENT: Peer data is available to update the peer.
  • NIDS_UPTODATE: Discovery has successfully updated the NIDs of the peer.
  • PING_SENT: Discovery has sent a Ping to the peer and is waiting for the Reply.
  • PUSH_SENT: Discovery has sent a Push to the peer and is waiting for the Ack.
  • PING_FAILED: Sending a ping to the peer failed.
  • PUSH_FAILED: Sending a push to the peer failed.
  • PING_REQUIRED: Discovery must Ping the peer.

The QUEUED flag is used to determine whether a peer is on the ln_dc_request or ln_dc_working queues via its lp_dc_list member. A peer is queued by lnet_peer_queue_for_discovery() and dequeued by lnet_peer_discovery_complete().

The DISCOVERING flag indicates that peer discovery is looking at the peer. When it is cleared, one of DISCOVERED or UNDISCOVERED is set.

The DATA_PRESENT flag is set by the event handler for an incoming Push if it successfully stores the data, and by the event handler for an incoming Reply to a Ping. These event handlers run with spinlocks held, which is why we postpone the complex operation of updating the peer until the discovery thread can do it. The discovery thread processes the data and updates the peer by calling lnet_peer_data_present(), which clears the flag.

The NIDS_UPTODATE flag is used to indicate that the NIDs for the peer are believed to be known. It is cleared when data is received that indicates that the peer may have changed, like an incoming Push. If storing the data from an incoming Push fails we cannot set the DATA_PRESENT flag but do clear NIDS_UPTODATE to indicate that the peer must be re-examined.

The PING_SENT flag indicates that a Ping has been sent and we are waiting for a Reply message. The implication is that lp_ping_mdh is live and has an MD bound to it.

The PUSH_SENT flag indicates that a Push has been sent and we are waiting for an Ack message. The implication is that lp_push_mdh is live and has an MD bound to it.

The PING_FAILED flag indicates that an attempted Ping failed for some reason. In addition to LNet messaging failures, a Ping fails if the Reply does not fit in the pre-allocated buffer.

The PUSH_FAILED flag indicates that an attempted Push failed for some reason. The node sending the Push only sees a failure if LNet messaging reports one.

The PING_REQUIRED flag indicates that a Ping is necessary to properly determine the state of a peer. Triggering a Ping is the mechanism by which discovery attempts to recover from any problems it may have encountered while processing a peer.

  • MULTI_RAIL: This flag indicates whether a peer is running a multi-rail aware version of Lustre.
  • If MULTI_RAIL is set, then lp_node_seqno contains the last ping source sequence number of the node that has been received by the peer.

The following discussion must be updated – OW

  • L: Local config sent to peer
  • P: Peer config merged
  • M: Multi-rail capable peer
  • D: Data received from peer, not yet merged
  • R: Reply to ping pending
  • A: Ack pending
  • Q: Queued for the discovery thread to work on
  • C: Configured by DLC
  • S: Size of MD buffers need to be increased

...

  • INIT - pre state. Only transitory
  • CREATED - peer_ni created but no active connections exists.
  • ACTIVATING - 1st message sent to the peer_ni, but has not completed yet
  • CONNECTED - 1st message sent successfully
  • FAILED - A message (1st or after) has failed to send
  • DELETING - A dynamic update or a config delete removes that peer_ni

Sign-off

NameStatus
Signed Off
Signed Off
Signed Off
Robert Read (optional)
 
SGI PACSigned Off


Appendix

Various Comments and Older Notes

...