...
Reference Documents
Document Link |
---|
Multi-Rail Scope and Requirements Document |
Document Structure
This document is made up of the following sections:
...
The primary data structures maintained by the LNet module will be modified as follows: (cfg-040)
struct lnet_ni
will reference exactly one NIstruct lnet_net
will be added which can point to multiplelnet_ni
structuresstruct lnet
(akalnet_t
) will have a list oflnet_net
structures instead oflnet_ni
structuresstruct lnet_peer
will be renamed tostruct lnet_peer_ni
, and will represent a peer NID with all its creditsstruct lnet_peer_net
will encapsulate multiplelnet_peer_ni
structures. This structure's purpose is to simplify the selection algorithm as discussed later.struct lnet_peer
will encapsulate multiplelnet_peer_net
structures.
...
A "NUMA range" tunable will control the search width. Setting this value to a high number basically turns off NUMA based selection, as all local NIs are considered. cfg-090, snd-025
Dynamic peer discovery
Dynamic peer discovery will be added to LNet. It works by using an LNet PING
to discover whether a peer node supports the same capabilities. Support is indicated by setting a bit in the lnet_ping_info_t->pi_features
field.
...
- Adding/removing/showing Network Interfaces.
- Adding/removing/showing peers. cfg-070, cfg-075
- Each peer can be composed of one or more peer NIDs
- Adding/removing/showing selection policies
The lnetctl
utility uses the DLC library API to perform its functions. Beside the standard command line interface to configure different elements, configuration can be represented in a YAML formatted file. Configuration can also be queried and presented to the user in YAML format. The configuration design philosophy is to ensure that all config which can be queried from the kernel can be fed back into the kernel to get the exact same result. cfg-045, cfg-050, cfg-060, cfg-065, cfg-170
AnchorDLCLib DLCLib
DLC Library
DLCLib | |
DLCLib |
The DLC library shall add a set of APIs to handle configuring the LNet kernel module. cfg-005, cfg-015
lustre_lnet_config_ni()
- this will be modified to add one or more network interfaces. cfg-020, cfg-025lustre_lnet_del_ni()
- this will be modified to delete one or more network interfacelustre_lnet_show_ni()
- this will be modified to show all NIs in the network. cfg-010lustre_lnet_config_peer()
- add a set of peer NIDslustre_lnet_del_peer()
- delete a peer NIDlustre_lnet_show_peers()
- shows all peers in the system. Can provide a maximum number of peers to showlustre_lnet_config_<rule type>_selection_rule()
- add an NI selection policy rule to the existing ruleslustre_lnet_
- delete an NI selection policy rule using its assigned ID or matching criteria. cfg-095del_<rule type>_selection_rule
()lustre_lnet
- show all NI selection policy rules configured in the system, each given an ID._<rule type>_selection_rule
()lustre_lnet_set_dynamic_discover()
- enable or disable dynamic discovery.lustre_lnet_set_use_tcp_bonding()
- enable or disable using TCP bonding.
...
The CPTs are creation time element and the best configuration philosophy is to allow the user to explicitly specify it as part of the interface, therefore, it is the recommendation of this design to only allow configuring NI level CPTs. This maintains the current behavior where CPTs are Network Interface specific.
cfg-030 - the CPT is a creation time configuration and can not be changed afterwards. This requirement will not be implemented.
...
Multi-Rail shall change the way network interfaces are configured. In order to maintain backwards compatibility much code will need to be added to deal with different configuration formats. This will inevitably lead to unmaintainable code. As a result multi-rail lnetctl
/DLC will only work with multi-rail capable LNet. This means that upgrading a system to Multi-Rail capable LNet will entail upgrading all userspace and kernel space components. Older YAML configuration will still work with the newer Multi-Rail capable nodes. bck-005, bck-010, bck-015, bak-20.
Multi-Rail nodes will continue to connect to non-multi-rail capable nodes and vice versa and when a Multi-Rail capable node is connected to a cluster if dynamic discovery is enabled it will automatically be discovered on first use, as described later in this document in the Dynamic Discovery section. bck-025, bck-030
Adding local NI
lnetctl
Interface
...
Code Block |
---|
# In order to remain backward compatible, two forms of the command shall be allowed. # The first will delete the entire network and all network interfaces under it. # The second will delete a single network interface lnetctl > net del -h net del: delete a network Usage: net del --net <network> [--if <interface>] WHERE: --net: net name (e.g. tcp0) --if: interface name. (e.g. eth0) # If the --if parameter is specified, then this will specify exactly one NI to delete or a list # of NIs, since the --if parameter can be a comma separated list. # TODO: It is recommended that if the --if is not specified that all the interfaces are removed. |
YAML Syntax
Code Block |
---|
net: - net: <network. Ex: tcp or o2ib> interfaces: - intf: <interface name to delete> seq_no: <integer. Optional. User generated, and is passed back in the YAML error block> # Example: delete all network interfaces in o2ib1 network completely net: - net: o2ib1 # delete only one NI net: - net: o2ib1 interfaces: - intf: ib0 - intf: ib1 |
...
All peer nids specified must be unique in the system. If a non-unique peer NID is added LNet shall fail the configuration. cfg-080
YAML Syntax
Code Block |
---|
peers: - nids: 0: ip@net1 1: ip@net2 - nids: 0: ip@net3 1: ip@net4 # The exact same syntax can be used to refresh the peer table. The assumption is # each peer in the YAML syntax contains all the peer NIDs. # As an example if a peer is configured as follows: peers: - nids: 0: 10.2.2.3@ib0 1: 10.4.4.4@ib1 # Then later you feed the following into the system peers: - nids: 0: 10.2.2.3@ib0 1: 10.5.5.5@ib2 # The result of this configuration is the removal of 10.4.4.4@ib1 from # the peer NID list and the addition of 10.5.5.5@ib2 # In general a peer can be referenced by any of its NIDs. So when configuring all the NIDs are used # to find the peer. The first peer that's found will be configured. If the peer NID being added is # not unique, then that peer NID is ignored and an error flagged. The Index of the ignored NID is # returned to the user space, and is subsequently reported to the user. |
...
A rule can be uniquely identified using the matching rule or an internal ID which assigned by the LNet module when a rule is added and returned to the user space when they are returned as a result of a show command.
cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165
lnetctl
Interface
Code Block |
---|
# Adding a network priority rule. If the NI under the network doesn't have # an explicit priority set, it'll inherit the network priority: lnetctl > selection net [add | del | show] -h Usage: selection net add --net <network name> --priority <priority> WHERE: selection net add: add a selection rule based on the network priority --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) --priority: Rule priority Usage: selection net del --net <network name> [--id <rule id>] WHERE: selection net del: delete a selection rule given the network patter or the id. If both are provided they need to match or an error is returned. --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) --id: ID assigned to the rule returned by the show command. Usage: selection net show [--net <network name>] WHERE: selection net show: show selection rules and filter on network name if provided. --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) # Add a NID priority rule. All NIDs added that match this pattern shall be assigned # the identified priority. When the selection algorithm runs it shall prefer NIDs with # higher priority. lnetctl > selection nid [add | del | show] -h Usage: selection nid add --nid <NID> --priority <priority> WHERE: selection nid add: add a selection rule based on the nid pattern --nid: nid pattern which follows the same syntax as ip2net --priority: Rule priority Usage: selection nid del --nid <NID> [--id <rule id>] WHERE: selection nid del: delete a selection rule given the nid patter or the id. If both are provided they need to match or an error is returned. --nid: nid pattern which follows the same syntax as ip2net --id: ID assigned to the rule returned by the show command. Usage: selection nid show [--nid <NID>] WHERE: selection nid show: show selection rules and filter on NID pattern if provided. --nid: nid pattern which follows the same syntax as ip2net # Adding point to point rule. This creates an association between a local NI and a remote # NID, and assigns a priority to this relationship so that it's preferred when selecting a pathway.. lnetctl > selection peer [add | del | show] -h Usage: selection peer add --local <NID> --remote <NID> --priority <priority> WHERE: selection peer add: add a selection rule based on local to remote pathway --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net --priority: Rule priority Usage: selection peer del --local <NID> --remote <NID> --id <ID> WHERE: selection peer del: delete a selection rule based on local to remote NID pattern or id --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net --id: ID of the rule as provided by the show command. Usage: selection peer show [--local <NID>] [--remote <NID>] WHERE: selection peer show: show selection rules and filter on NID patterns if provided. --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net # the output will be of the same YAML format as the input described below. |
...
The pseudo code below describes the algorithm in more details. snd-005, snd-010, snd-020, snd-030, snd-035, snd-040, snd-045, snd-050, snd-055, snd-060, snd-065, snd-070, snd-075
snd-015 - NUMA APIs were added in some form, at least since 2.6.1; and therefore will pose no problems for this project.
...
Dynamic Peer Discovery ("Discovery" for short) is the process by which a node can discover the network interfaces it can reach a peer on without being pre-configured. This involves sending a ping to the peer. The ping response carries a flag bit to indicate that the peer is multi-rail capable. If it is the node then pushes its own network interface information to the peer. This protocol distributes the network interface information to both nodes and subsequently the nodes can excercise the peer network interfaces as well as its own, as described in further detail in this section. Discovery can be enabled, disabled or in verification mode. If it is in verification mode, then it will cross reference the discovered peer NIDs with the configured NIDs and complain if there is a discrepancy, but will continue to use the configured NIDs. cfg-085, dyn-005, dyn-010, dyn-015, dyn-020, dyn-025, dyn-030, dyn-035, dyn-040, dyn-045, dyn-050, dyn-055, dyn-060, dyn-065
Discovery handshake
Discovery happens between an active node which takes the lead in the process, and a passive node which responds to messages from the active node. The following diagram illustrates the basic handshake that happens between the active and passive nodes. If the full handshake completes, both nodes have discovered each other.
...
The state of a peer is a combination of the following bits of information, where the flags can be found in the source code by prepending LNET_PEER_
, so CONFIGURED
becomes LNET_PEER_CONFIGURED
. Peer state updates can be triggered from the event handler called when a message is received. These event handlers run while a spinlock is held in a time-critical path, and so we try to limit the amount of work done there. The discovery thread can then do the heavier lifting later under more relaxed locking constraints.
CONFIGURED
: The peer was configured via DLC.DISCOVERED
: The peer has been discovered.UNDISCOVERED
: Peer discovery was disabled when the peer was created.
Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such the CONFIGURED
and DISCOVERED
flags can both be set on a peer.
...
- .
QUEUED
: Peer is queued for discovery.DISCOVERING
: Discovery is active for the peer.DATA_PRESENT
: Peer data is available to update the peer.NIDS_UPTODATE
: Discovery has successfully updated the NIDs of the peer.PING_SENT
: Discovery has sent a Ping to the peer and is waiting for the Reply.PUSH_SENT
: Discovery has sent a Push to the peer and is waiting for the Ack.PING_FAILED
: Sending a ping to the peer failed.PUSH_FAILED
: Sending a push to the peer failed.PING_REQUIRED
: Discovery must Ping the peer.MULTI_RAIL
: This flag indicates whether a peer is running a multi-rail aware version of Lustre.- Ping Source Sequence Number
CONFIGURED
marks the peer as being configured using DLC.
DISCOVERED
marks the peer as having been though peer discovery. Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such a peer can be marked as both CONFIGURED
and DISCOVERED
.
UNDISCOVERED
marks a peer as having been through peer discovery, but that was not updated because peer discovery is disabled. It signals that the peer needs to be re-examined if discovery is enabled.
QUEUED
is set on a peer that is linked on either the ln_dc_request
or the ln_dc_working
queue via its lp_dc_list
member. The QUEUED
flag is used to determine whether a peer is on the ln_dc_request
or ln_dc_working
queues via its lp_dc_list
member. A peer is queued by lnet_peer_queue_for_discovery()
and dequeued by lnet_peer_discovery_complete()
.
The DISCOVERING
flag indicates that peer DISCOVERING
is set on a peer when discovery is looking at the peerit. When it is cleared, discovery completes it clears DISCOVERING
and sets one of DISCOVERED
or UNDISCOVERED
is set.
The DATA_PRESENT
flag is set on a peer
by the event handler for an incoming Push if it successfully stores the data, and by the event handler for an incoming Reply to a Ping. These event handlers run with spinlocks held, which is why we postpone the complex operation of updating the peer until the discovery thread can do it. The discovery thread processes the data and updates the peer by calling lnet_peer_data_present()
, which clears the flag.
The NIDS_UPTODATE
flag is used set on a peer to indicate that the NIDs for the peer are believed to be known. It is cleared when data is received that indicates that the peer may have changed, like an incoming Push. If storing the data from an incoming Push fails we cannot set the DATA_PRESENT
flag but do clear NIDS_UPTODATE
to indicate that the peer must be re-examined.
The PING_SENT
flag indicates that is set on a peer when a Ping has been sent and we are waiting for a Reply message. The implication is that lp_ping_mdh
is live and has an MD bound to it.
The PUSH_SENT
flag indicates that is set on a peer when a Push has been sent and we are waiting for an Ack message. The implication is that lp_push_mdh
is live and has an MD bound to it.
The PING_FAILED
flag indicates that is set on a peer when an attempted Ping failed for some reason. In addition to LNet messaging failures, a Ping fails if the Reply does not fit in the pre-allocated buffer.
The PUSH_FAILED
flag indicates that is set on a peer when an attempted Push failed for some reason. The node sending the Push only sees a failure if LNet messaging reports one.
The PING_REQUIRED
flag indicates that is set on a peer when a Ping is necessary to properly determine the state of a peer. Triggering a Ping is the mechanism by which discovery attempts to recover from any problems it may have encountered while processing a peer. Pings triggered by this flag happen even if discovery has been disabled.
MULTI_RAIL
...
marks a peer
...
as running a multi-rail aware version of Lustre.
If MULTI_RAIL
is set, then The Ping Source Sequence Number is sent in the pi_ni[0].ns_status
field of ping data. In ping data pi_ni[0]
always contains the data for the loopback NID, and non-multi-rail nodes do not interpret that field. The number is stored in the lp_node_seqno
contains the last ping source sequence number of the node that has been received by field of the peer. This is used in lnet_peer_needs_push()
to determine whether a mult-rail aware peer needs a Push in lnet_peer_needs_push()
. The ping source sequence number is a 32 bit number that is updated whenever the source buffer for LNet ping is changed.
...
CONFIGURED
: This flag indicates that thepeer_ni
was configured using DLC.NON_MR_PREF
: This flag indicates that thepeer_ni
has a implicit preferred local NI..
CONFIGURED
is used to track whether the peer_ni
was configured using DLC. Discovery can collect several peer_ni
structures into a single peer
, but should exempt any peer_ni
that was set up by DLC.
The NON_MR_PREF
flag exists to track the fact that marks a peer_ni
for a peer that is not known to be multi-rail aware, and for which a preferred local NI was automatically assigned. A peer_ni
for a non-multi-rail node should always see the same local NI as the source of the traffic from the node. If the preferred local NI has not been explicitly set with DLC then the code picks one and sets this flag to indicate that this is the reason for the preferred NI. That way we know to clear this preferred NI if the peer_ni
turns out to be part of a multi-rail aware peer.
State Changes
Now we'll explore in detail how the above states are changed.
- Configuring a peer NI with DLC
- DLC adds a local NI
- The ping source buffer is updated and its sequence number the Ping Source Sequence Number increased
- DLC deletes a local NI
- The ping source buffer is updated and its sequence number increasedthe Ping Source Sequence Number increased
- DLC adds a local NI
- Configuring a peer with DLC
- DLC creates a peer
- The
peer
is markedCONFIGURED
- The
peer_ni
is markedCONFIGURED
- The
- DLC adds a peer NI to a peer
- DLC deletes a peer NI from a peer
- DLC deletes a peer
- DLC creates a peer
- Forcing ping discovery
- Sending a message
- Ping handling
- Push handling
- Merging received data
- Discovery thread workflow
- get queued
peer
(QUEUED
is set) - if
DATA_PRESENT
is set- merge the received data
- else if
PING_FAILED
is set- clear
PING_FAILED
- set
PING_REQUIRED
- if there was an error, terminate discovery
- else do another pass over the
peer
- clear
- else if
PUSH_FAILED
is set- clear
PUSH_FAILED
- if there was an erro, terminate discover
- else do another pass over the
peer
- clear
- else if
PING_REQUIRED
is set- clear
PING_REQUIRED
- set
PING_SENT
- send Ping
- clear
- else if discovery is disabled
- clear
DISCOVERED
andDISCOVERING
- set
UNDISCOVERED
- clear
- else if
NIDS_UPTODATE
is not set- set
PING_SENT
- send Ping
- set
- else if
MULTI_RAIL
is not set- clear
DISCOVERING
- set
DISCOVERED
- clear
- else if
lp_node_seqno
< Ping Source Sequence Number- set
PUSH_SENT
- send Push
- set
- else
- clear
DISCOVERING
- set
DISCOVERED
- clear
- if
DISCOVERING
is not set- clear
QUEUED
- dequeue
peer
- clear
- get queued
The following discussion must be updated – OW
...
- INIT: pre state. Only transitory.
- CREATED: The NI has been added successfully
- FAILED: LND notification that the physical interface is down. hlt-005, hlt-010,
- DEGRADED: LND notification that the physical interface is degraged. IB and ETH will probably not send such a notification. hlt-015, hlt-020, hlt-025
- DELETING: a config delete is received. All pending messages must be depleted.
Both Degraded and Failed both need the LND to notify LNet. For degraded the LND could possibly query the type of the card and figure out the theoretical speed, then if the measured speed is below, then we can mark as degraded. snd-080
snd-085 - TODO: need to identify in the design how we deal with local NI failures.
...