Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Part of update of the discovery algo to match the implementation.

...

Reference Documents

Document Structure

This document is made up of the following sections:

...

The primary data structures maintained by the LNet module will be modified as follows: (cfg-040)

  • struct lnet_ni will reference exactly one NI
  • struct lnet_net will be added which can point to multiple lnet_ni structures
  • struct lnet (aka lnet_t) will have a list of lnet_net structures instead of lnet_ni structures
  • struct lnet_peer will be renamed to struct lnet_peer_ni, and will represent a peer NID with all its credits
  • struct lnet_peer_net will encapsulate multiple lnet_peer_ni structures. This structure's purpose is to simplify the selection algorithm as discussed later.
  • struct lnet_peer will encapsulate multiple lnet_peer_net structures.

...

A "NUMA range" tunable will control the search width. Setting this value to a high number basically turns off NUMA based selection, as all local NIs are considered. cfg-090, snd-025

Dynamic peer discovery

Dynamic peer discovery will be added to LNet. It works by using an LNet PING to discover whether a peer node supports the same capabilities. Support is indicated by setting a bit in the lnet_ping_info_t->pi_features field.

...

  • Adding/removing/showing Network Interfaces.
  • Adding/removing/showing peers. cfg-070, cfg-075
    • Each peer can be composed of one or more peer NIDs
  • Adding/removing/showing selection policies

The lnetctl utility uses the DLC library API to perform its functions. Beside the standard command line interface to configure different elements, configuration can be represented in a YAML formatted file. Configuration can also be queried and presented to the user in YAML format. The configuration design philosophy is to ensure that all config which can be queried from the kernel can be fed back into the kernel to get the exact same result. cfg-045, cfg-050, cfg-060, cfg-065, cfg-170

Anchor
DLCLib
DLCLib
DLC Library

The DLC library shall add a set of APIs to handle configuring the LNet kernel module. cfg-005, cfg-015

  • lustre_lnet_config_ni() - this will be modified to add one or more network interfaces. cfg-020, cfg-025
  • lustre_lnet_del_ni() - this will be modified to delete one or more network interface
  • lustre_lnet_show_ni() - this will be modified to show all NIs in the network. cfg-010
  • lustre_lnet_config_peer() - add a set of peer NIDs
  • lustre_lnet_del_peer() - delete a peer NID
  • lustre_lnet_show_peers() - shows all peers in the system. Can provide a maximum number of peers to show
  • lustre_lnet_config_<rule type>_selection_rule() - add an NI selection policy rule to the existing rules
  • lustre_lnet_del_<rule type>_selection_rule() - delete an NI selection policy rule using its assigned ID or matching criteria. cfg-095
  • lustre_lnet_<rule type>_selection_rule() - show all NI selection policy rules configured in the system, each given an ID.
  • lustre_lnet_set_dynamic_discover() - enable or disable dynamic discovery.
  • lustre_lnet_set_use_tcp_bonding() -  enable or disable using TCP bonding.

...

The CPTs are creation time element and the best configuration philosophy is to allow the user to explicitly specify it as part of the interface, therefore, it is the recommendation of this design to only allow configuring NI level CPTs. This maintains the current behavior where CPTs are Network Interface specific.

cfg-030 - the CPT is a creation time configuration and can not be changed afterwards. This requirement will not be implemented.

...

Multi-Rail shall change the way network interfaces are configured. In order to maintain backwards compatibility much code will need to be added to deal with different configuration formats. This will inevitably lead to unmaintainable code. As a result multi-rail lnetctl/DLC will only work with multi-rail capable LNet. This means that upgrading a system to Multi-Rail capable LNet will entail upgrading all userspace and kernel space components. Older YAML configuration will still work with the newer Multi-Rail capable nodes. bck-005, bck-010, bck-015, bak-20.

Multi-Rail nodes will continue to connect to non-multi-rail capable nodes and vice versa and when a Multi-Rail capable node is connected to a cluster if dynamic discovery is enabled it will automatically be discovered on first use, as described later in this document in the Dynamic Discovery section. bck-025, bck-030

Adding local NI

lnetctl Interface

...

Code Block
# In order to remain backward compatible, two forms of the command shall be allowed.
# The first will delete the entire network and all network interfaces under it.
# The second will delete a single network interface

lnetctl > net del -h
net del: delete a network
Usage: net del --net <network> [--if <interface>] 

WHERE:

 --net: net name (e.g. tcp0)
 --if: interface name. (e.g. eth0)

# If the --if parameter is specified, then this will specify exactly one NI to delete or a list
# of NIs, since the --if parameter can be a comma separated list.
# TODO: It is recommended that if the --if is not specified that all the interfaces are removed.

 YAML Syntax

cfg-055

Code Block
net:
   - net: <network.  Ex: tcp or o2ib>
     interfaces:
         - intf: <interface name to delete>
     seq_no: <integer.  Optional.  User generated, and is
              passed back in the YAML error block>

# Example: delete all network interfaces in o2ib1 network completely
net:
   - net: o2ib1

# delete only one NI
net:
   - net: o2ib1
     interfaces:
           - intf: ib0
           - intf: ib1

...

All peer nids specified must be unique in the system. If a non-unique peer NID is added LNet shall fail the configuration. cfg-080

YAML Syntax

Code Block
peers:
  - nids:
      0: ip@net1
      1: ip@net2
  - nids:
      0: ip@net3
      1: ip@net4

# The exact same syntax can be used to refresh the peer table. The assumption is
# each peer in the YAML syntax contains all the peer NIDs.
# As an example if a peer is configured as follows:

peers:
  - nids:
     0: 10.2.2.3@ib0
     1: 10.4.4.4@ib1

# Then later you feed the following into the system

peers:
  - nids:
     0: 10.2.2.3@ib0
     1: 10.5.5.5@ib2

# The result of this configuration is the removal of 10.4.4.4@ib1 from 
# the peer NID list and the addition of 10.5.5.5@ib2
# In general a peer can be referenced by any of its NIDs. So when configuring all the NIDs are used
# to find the peer. The first peer that's found will be configured. If the peer NID being added is 
# not unique, then that peer NID is ignored and an error flagged. The Index of the ignored NID is
# returned to the user space, and is subsequently reported to the user.

...

A rule can be uniquely identified using the matching rule or an internal ID which assigned by the LNet module when a rule is added and returned to the user space when they are returned as a result of a show command.

cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165

lnetctl Interface

Code Block
# Adding a network priority rule. If the NI under the network doesn't have
# an explicit priority set, it'll inherit the network priority:
lnetctl > selection net [add | del | show] -h
Usage: selection net add --net <network name> --priority <priority>
 
WHERE:

selection net add: add a selection rule based on the network priority
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
		--priority: Rule priority

Usage: selection net del --net <network name> [--id <rule id>]
 
WHERE:

selection net del: delete a selection rule given the network patter or the id. If both
				   are provided they need to match or an error is returned.
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
		--id: ID assigned to the rule returned by the show command.
 
Usage: selection net show [--net <network name>]

WHERE:

selection net show: show selection rules and filter on network name if provided.
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
 
# Add a NID priority rule. All NIDs added that match this pattern shall be assigned
# the identified priority. When the selection algorithm runs it shall prefer NIDs with
# higher priority.
lnetctl > selection nid [add | del | show] -h
Usage: selection nid add --nid <NID> --priority <priority>

WHERE:

selection nid add: add a selection rule based on the nid pattern
   		--nid: nid pattern which follows the same syntax as ip2net
		--priority: Rule priority


Usage: selection nid del --nid <NID> [--id <rule id>]

WHERE:

selection nid del: delete a selection rule given the nid patter or the id. If both
				   are provided they need to match or an error is returned.
        --nid: nid pattern which follows the same syntax as ip2net
		--id: ID assigned to the rule returned by the show command.


Usage: selection nid show [--nid <NID>]

WHERE:

selection nid show: show selection rules and filter on NID pattern if provided.
        --nid: nid pattern which follows the same syntax as ip2net
# Adding point to point rule. This creates an association between a local NI and a remote
# NID, and assigns a priority to this relationship so that it's preferred when selecting a pathway..
lnetctl > selection peer [add | del | show] -h
Usage: selection peer add --local <NID> --remote <NID> --priority <priority>

WHERE:

selection peer add: add a selection rule based on local to remote pathway
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net
		--priority: Rule priority

Usage: selection peer del --local <NID> --remote <NID> --id <ID>

WHERE:

selection peer del: delete a selection rule based on local to remote NID pattern or id
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net
		--id: ID of the rule as provided by the show command.

Usage: selection peer show [--local <NID>] [--remote <NID>]

WHERE:

selection peer show: show selection rules and filter on NID patterns if provided.
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net

# the output will be of the same YAML format as the input described below.

...

The pseudo code below describes the algorithm in more details. snd-005, snd-010, snd-020, snd-030, snd-035, snd-040, snd-045, snd-050, snd-055, snd-060, snd-065, snd-070, snd-075

snd-015 - NUMA APIs were added in some form, at least since 2.6.1; and therefore will pose no problems for this project.

...

Dynamic Peer Discovery ("Discovery" for short) is the process by which a node can discover the network interfaces it can reach a peer on without being pre-configured. This involves sending a ping to the peer. The ping response carries a flag bit to indicate that the peer is multi-rail capable. If it is the node then pushes its own network interface information to the peer. This protocol distributes the network interface information to both nodes and subsequently the nodes can excercise the peer network interfaces as well as its own, as described in further detail in this section. Discovery can be enabled, disabled or in verification mode. If it is in verification mode, then it will cross reference the discovered peer NIDs with the configured NIDs and complain if there is a discrepancy, but will continue to use the configured NIDs. cfg-085,   dyn-005,   dyn-010,   dyn-015,   dyn-020,   dyn-025,   dyn-030,   dyn-035,   dyn-040,   dyn-045,   dyn-050,   dyn-055,   dyn-060,   dyn-065

Discovery handshake

Discovery happens between an active node which takes the lead in the process, and a passive node which responds to messages from the active node. The following diagram illustrates the basic handshake that happens between the active and passive nodes. If the full handshake completes, both nodes have discovered each other.

...

The state of a peer is a combination of the following bits of information, where the flags can be found in the source code by prepending LNET_PEER_, so CONFIGURED becomes LNET_PEER_CONFIGURED. Peer state updates can be triggered from the event handler called when a message is received. These event handlers run while a spinlock is held in a time-critical path, and so we try to limit the amount of work done there. The discovery thread can then do the heavier lifting later under more relaxed locking constraints.

  • CONFIGURED: The peer was configured via DLC.
  • DISCOVERED: The peer has been discovered.
  • UNDISCOVERED: Peer discovery was disabled when the peer was created.

Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such the CONFIGURED and DISCOVERED flags can both be set on a peer.

...

  • .
  • QUEUED: Peer is queued for discovery.
  • DISCOVERING: Discovery is active for the peer.
  • DATA_PRESENT: Peer data is available to update the peer.
  • NIDS_UPTODATE: Discovery has successfully updated the NIDs of the peer.
  • PING_SENT: Discovery has sent a Ping to the peer and is waiting for the Reply.
  • PUSH_SENT: Discovery has sent a Push to the peer and is waiting for the Ack.
  • PING_FAILED: Sending a ping to the peer failed.
  • PUSH_FAILED: Sending a push to the peer failed.
  • PING_REQUIRED: Discovery must Ping the peer.
  • MULTI_RAIL: This flag indicates whether a peer is running a multi-rail aware version of Lustre.
  • Ping Source Sequence Number

CONFIGURED marks the peer as being configured using DLC.

DISCOVERED marks the peer as having been though peer discovery. Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such a peer can be marked as both CONFIGURED and DISCOVERED.

UNDISCOVERED marks a peer as having been through peer discovery, but that was not updated because peer discovery is disabled. It signals that the peer needs to be re-examined if discovery is enabled.

QUEUED is set on a peer that is linked on either the ln_dc_request or the ln_dc_working queue via its lp_dc_list member. The QUEUED flag is used to determine whether a peer is on the ln_dc_request or ln_dc_working queues via its lp_dc_list member. A peer is queued by lnet_peer_queue_for_discovery() and dequeued by lnet_peer_discovery_complete().

The DISCOVERING flag indicates that peer DISCOVERING is set on a peer when discovery is looking at the peerit. When it is cleared, discovery completes it clears DISCOVERING and sets one of DISCOVERED or UNDISCOVERED is set.

The DATA_PRESENT flag is set on a peer by the event handler for an incoming Push if it successfully stores the data, and by the event handler for an incoming Reply to a Ping. These event handlers run with spinlocks held, which is why we postpone the complex operation of updating the peer until the discovery thread can do it. The discovery thread processes the data and updates the peer by calling lnet_peer_data_present(), which clears the flag.

The NIDS_UPTODATE flag is used set on a peer to indicate that the NIDs for the peer are believed to be known. It is cleared when data is received that indicates that the peer may have changed, like an incoming Push. If storing the data from an incoming Push fails we cannot set the DATA_PRESENT flag but do clear NIDS_UPTODATE to indicate that the peer must be re-examined.

The PING_SENT flag indicates that is set on a peer when a Ping has been sent and we are waiting for a Reply message. The implication is that lp_ping_mdh is live and has an MD bound to it.

The PUSH_SENT flag indicates that is set on a peer when a Push has been sent and we are waiting for an Ack message. The implication is that lp_push_mdh is live and has an MD bound to it.

The PING_FAILED flag indicates that is set on a peer when an attempted Ping failed for some reason. In addition to LNet messaging failures, a Ping fails if the Reply does not fit in the pre-allocated buffer.

The PUSH_FAILED flag indicates that is set on a peer when an attempted Push failed for some reason. The node sending the Push only sees a failure if LNet messaging reports one.

The PING_REQUIRED flag indicates that is set on a peer when a Ping is necessary to properly determine the state of a peer. Triggering a Ping is the mechanism by which discovery attempts to recover from any problems it may have encountered while processing a peer. Pings triggered by this flag happen even if discovery has been disabled.

MULTI_RAIL

...

marks a peer

...

as running a multi-rail aware version of Lustre.

If MULTI_RAIL is set, then The Ping Source Sequence Number is sent in the pi_ni[0].ns_status field of ping data. In ping data pi_ni[0] always contains the data for the loopback NID, and non-multi-rail nodes do not interpret that field. The number is stored in the lp_node_seqno contains the last ping source sequence number of the node that has been received by field of the peer. This is used in lnet_peer_needs_push() to determine whether a mult-rail aware peer needs a Push in lnet_peer_needs_push(). The ping source sequence number is a 32 bit number that is updated whenever the source buffer for LNet ping is changed.

...

  • CONFIGURED: This flag indicates that the peer_ni was configured using DLC.
  • NON_MR_PREF: This flag indicates that the peer_ni has a implicit preferred local NI..

CONFIGURED is used to track whether the peer_ni was configured using DLC. Discovery can collect several peer_ni structures into a single peer, but should exempt any peer_ni that was set up by DLC.

The NON_MR_PREF flag exists to track the fact that marks peer_ni for a peer that is not known to be multi-rail aware, and for which a preferred local NI was automatically assigned. A peer_ni for a non-multi-rail node should always see the same local NI as the source of the traffic from the node. If the preferred local NI has not been explicitly set with DLC then the code picks one and sets this flag to indicate that this is the reason for the preferred NI. That way we know to clear this preferred NI if the peer_ni turns out to be part of a multi-rail aware peer.

State Changes

Now we'll explore in detail how the above states are changed.

  1. Configuring a peer NI with DLC
    1. DLC adds a local NI
      1. The ping source buffer is updated and its sequence number the Ping Source Sequence Number increased
    2. DLC deletes a local NI
      1. The ping source buffer is updated and its sequence number increasedthe Ping Source Sequence Number increased
  2. Configuring a peer with DLC
    1. DLC creates a peer
      1. The peer is marked CONFIGURED
      2. The peer_ni is marked CONFIGURED
    2. DLC adds a peer NI to a peer
    3. DLC deletes a peer NI from a peer
    4. DLC deletes a peer
  3. Forcing ping discovery
  4. Sending a message
  5. Ping handling
  6. Push handling
  7. Merging received data
  8. Discovery thread workflow
    1. get queued peer (QUEUED is set)
    2. if DATA_PRESENT is set
      1. merge the received data
    3. else if PING_FAILED is set
      1. clear PING_FAILED
      2. set PING_REQUIRED
      3. if there was an error, terminate discovery
      4. else do another pass over the peer
    4. else if PUSH_FAILED is set
      1. clear PUSH_FAILED
      2. if there was an erro, terminate discover
      3. else do another pass over the peer
    5. else if PING_REQUIRED is set
      1. clear PING_REQUIRED
      2. set PING_SENT
      3. send Ping
    6. else if discovery is disabled
      1. clear DISCOVERED and DISCOVERING
      2. set UNDISCOVERED
    7. else if NIDS_UPTODATE is not set
      1. set PING_SENT
      2. send Ping
    8. else if MULTI_RAIL is not set
      1. clear DISCOVERING
      2. set DISCOVERED
    9. else if lp_node_seqno < Ping Source Sequence Number
      1. set PUSH_SENT
      2. send Push
    10. else
      1. clear DISCOVERING
      2. set DISCOVERED
    11. if DISCOVERING is not set
      1. clear QUEUED
      2. dequeue peer

The following discussion must be updated – OW

...

  • INIT: pre state. Only transitory.
  • CREATED: The NI has been added successfully
  • FAILED: LND notification that the physical interface is down. hlt-005, hlt-010,
  • DEGRADED: LND notification that the physical interface is degraged. IB and ETH will probably not send such a notification. hlt-015, hlt-020, hlt-025
  • DELETING: a config delete is received. All pending messages must be depleted.

Both Degraded and Failed both need the LND to notify LNet. For degraded the LND could possibly query the type of the card and figure out the theoretical speed, then if the measured speed is below, then we can mark as degraded. snd-080

snd-085 - TODO: need to identify in the design how we deal with local NI failures.

...