Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Reference Documents

Document Structure

This document is made up of the following sections:

...

Gliffy Diagram
nameHLD System Diagram
pagePin1

Figure 1: System Level Diagram

...

Gliffy Diagram
nameHLD Multi-Rail LNet Data Structure Diagram
pagePin7

Figure 2: LNet Data Structure Diagram

The primary data structures maintained by the LNet module will be modified as follows: (cfg-040)

  • struct lnet_ni will reference exactly one NI
  • struct lnet_net will be added which can point to multiple lnet_ni structures
  • struct lnet (aka lnet_t) will have a list of lnet_net structures instead of lnet_ni structures
  • struct lnet_peer will be renamed to struct lnet_peer_ni, and will represent a peer NID with all its credits
  • struct lnet_peer_net will encapsulate multiple lnet_peer_ni structures. This structure's purpose is to simplify the selection algorithm as discussed later.
  • struct lnet_peer will encapsulate multiple lnet_peer_net structures.

...

A "NUMA range" tunable will control the search width. Setting this value to a high number basically turns off NUMA based selection, as all local NIs are considered. cfg-090, snd-025

Dynamic peer discovery

Dynamic peer discovery will be added to LNet. It works by using an LNet PING to discover whether a peer node supports the same capabilities. Support is indicated by setting a bit in the lnet_ping_info_t->pi_features field.

...

  • Adding/removing/showing Network Interfaces.
  • Adding/removing/showing peers. cfg-070, cfg-075
    • Each peer can be composed of one or more peer NIDs
  • Adding/removing/showing selection policies

The lnetctl utility uses the DLC library API to perform its functions. Beside the standard command line interface to configure different elements, configuration can be represented in a YAML formatted file. Configuration can also be queried and presented to the user in YAML format. The configuration design philosophy is to ensure that all config which can be queried from the kernel can be fed back into the kernel to get the exact same result. cfg-045, cfg-050, cfg-060, cfg-065, cfg-170

Anchor
DLCLib
DLCLib
DLC Library

The DLC library shall add a set of APIs to handle configuring the LNet kernel module. cfg-005, cfg-015

  • lustre_lnet_config_ni() - this will be modified to add one or more network interfaces. cfg-020, cfg-025
  • lustre_lnet_del_ni() - this will be modified to delete one or more network interface
  • lustre_lnet_show_ni() - this will be modified to show all NIs in the network. cfg-010
  • lustre_lnet_config_peer() - add a set of peer NIDs
  • lustre_lnet_del_peer() - delete a peer NID
  • lustre_lnet_show_peers() - shows all peers in the system. Can provide a maximum number of peers to show
  • lustre_lnet_config_<rule type>_selection_rule() - add an NI selection policy rule to the existing rules
  • lustre_lnet_del_<rule type>_selection_rule() - delete an NI selection policy rule using its assigned ID or matching criteria. cfg-095
  • lustre_lnet_<rule type>_selection_rule() - show all NI selection policy rules configured in the system, each given an ID.
  • lustre_lnet_set_dynamic_discover() - enable or disable dynamic discovery.
  • lustre_lnet_set_use_tcp_bonding() -  enable or disable using TCP bonding.

...

The CPTs are creation time element and the best configuration philosophy is to allow the user to explicitly specify it as part of the interface, therefore, it is the recommendation of this design to only allow configuring NI level CPTs. This maintains the current behavior where CPTs are Network Interface specific.

cfg-030 - the CPT is a creation time configuration and can not be changed afterwards. This requirement will not be implemented.

...

Multi-Rail shall change the way network interfaces are configured. In order to maintain backwards compatibility much code will need to be added to deal with different configuration formats. This will inevitably lead to unmaintainable code. As a result multi-rail lnetctl/DLC will only work with multi-rail capable LNet. This means that upgrading a system to Multi-Rail capable LNet will entail upgrading all userspace and kernel space components. Older YAML configuration will still work with the newer Multi-Rail capable nodes. bck-005, bck-010, bck-015, bak-20.

Multi-Rail nodes will continue to connect to non-multi-rail capable nodes and vice versa and when a Multi-Rail capable node is connected to a cluster if dynamic discovery is enabled it will automatically be discovered on first use, as described later in this document in the Dynamic Discovery section. bck-025, bck-030

Adding local NI

lnetctl Interface

...

Code Block
# In order to remain backward compatible, two forms of the command shall be allowed.
# The first will delete the entire network and all network interfaces under it.
# The second will delete a single network interface

lnetctl > net del -h
net del: delete a network
Usage: net del --net <network> [--if <interface>] 

WHERE:

 --net: net name (e.g. tcp0)
 --if: interface name. (e.g. eth0)

# If the --if parameter is specified, then this will specify exactly one NI to delete or a list
# of NIs, since the --if parameter can be a comma separated list.
# TODO: It is recommended that if the --if is not specified that all the interfaces are removed.

 YAML Syntax

cfg-055

Code Block
net:
   - net: <network.  Ex: tcp or o2ib>
     interfaces:
         - intf: <interface name to delete>
     seq_no: <integer.  Optional.  User generated, and is
              passed back in the YAML error block>

# Example: delete all network interfaces in o2ib1 network completely
net:
   - net: o2ib1

# delete only one NI
net:
   - net: o2ib1
     interfaces:
           - intf: ib0
           - intf: ib1

...

All peer nids specified must be unique in the system. If a non-unique peer NID is added LNet shall fail the configuration. cfg-080

YAML Syntax

Code Block
peers:
  - nids:
      0: ip@net1
      1: ip@net2
  - nids:
      0: ip@net3
      1: ip@net4

# The exact same syntax can be used to refresh the peer table. The assumption is
# each peer in the YAML syntax contains all the peer NIDs.
# As an example if a peer is configured as follows:

peers:
  - nids:
     0: 10.2.2.3@ib0
     1: 10.4.4.4@ib1

# Then later you feed the following into the system

peers:
  - nids:
     0: 10.2.2.3@ib0
     1: 10.5.5.5@ib2

# The result of this configuration is the removal of 10.4.4.4@ib1 from 
# the peer NID list and the addition of 10.5.5.5@ib2
# In general a peer can be referenced by any of its NIDs. So when configuring all the NIDs are used
# to find the peer. The first peer that's found will be configured. If the peer NID being added is 
# not unique, then that peer NID is ignored and an error flagged. The Index of the ignored NID is
# returned to the user space, and is subsequently reported to the user.

...

A rule can be uniquely identified using the matching rule or an internal ID which assigned by the LNet module when a rule is added and returned to the user space when they are returned as a result of a show command.

cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165

lnetctl Interface

Code Block
# Adding a network priority rule. If the NI under the network doesn't have
# an explicit priority set, it'll inherit the network priority:
lnetctl > selection net [add | del | show] -h
Usage: selection net add --net <network name> --priority <priority>
 
WHERE:

selection net add: add a selection rule based on the network priority
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
		--priority: Rule priority

Usage: selection net del --net <network name> [--id <rule id>]
 
WHERE:

selection net del: delete a selection rule given the network patter or the id. If both
				   are provided they need to match or an error is returned.
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
		--id: ID assigned to the rule returned by the show command.
 
Usage: selection net show [--net <network name>]

WHERE:

selection net show: show selection rules and filter on network name if provided.
        --net: network string (e.g. o2ib or o2ib* or o2ib[1,2])
 
# Add a NID priority rule. All NIDs added that match this pattern shall be assigned
# the identified priority. When the selection algorithm runs it shall prefer NIDs with
# higher priority.
lnetctl > selection nid [add | del | show] -h
Usage: selection nid add --nid <NID> --priority <priority>

WHERE:

selection nid add: add a selection rule based on the nid pattern
   		--nid: nid pattern which follows the same syntax as ip2net
		--priority: Rule priority


Usage: selection nid del --nid <NID> [--id <rule id>]

WHERE:

selection nid del: delete a selection rule given the nid patter or the id. If both
				   are provided they need to match or an error is returned.
        --nid: nid pattern which follows the same syntax as ip2net
		--id: ID assigned to the rule returned by the show command.


Usage: selection nid show [--nid <NID>]

WHERE:

selection nid show: show selection rules and filter on NID pattern if provided.
        --nid: nid pattern which follows the same syntax as ip2net
# Adding point to point rule. This creates an association between a local NI and a remote
# NID, and assigns a priority to this relationship so that it's preferred when selecting a pathway..
lnetctl > selection peer [add | del | show] -h
Usage: selection peer add --local <NID> --remote <NID> --priority <priority>

WHERE:

selection peer add: add a selection rule based on local to remote pathway
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net
		--priority: Rule priority

Usage: selection peer del --local <NID> --remote <NID> --id <ID>

WHERE:

selection peer del: delete a selection rule based on local to remote NID pattern or id
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net
		--id: ID of the rule as provided by the show command.

Usage: selection peer show [--local <NID>] [--remote <NID>]

WHERE:

selection peer show: show selection rules and filter on NID patterns if provided.
   		--local: nid pattern which follows the same syntax as ip2net
		--remote: nid pattern which follows the same syntax as ip2net

# the output will be of the same YAML format as the input described below.

...

Gliffy Diagram
size800
nameLNet Threading Model
pagePin7

Figure 3: LNet Threading Model

...

Code Block
192.168.0.[1-10/2, 13, 14]@nettype
# Refer to Lustre Manual for more examples

or

 


Code Block
eth[1,2,3], eth[1-4/2]

...



Expression Structural FormDescription

Gliffy Diagram
nameHLD to Kernel range 02
pagePin1

Figure 4: syntax descriptor

An expression can be a number:

Code Block
[<num>, <expr>]
represented as:
start == end == NUM

An express can be a wild card

Code Block
[*, <expr>]
represented as:
start == 0
end == U32_MAX
INCR == 1

An expression can be a range

Code Block
[<start> - <end>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == 1

An expression can be a range and an increment

Code Block
[<num-start> - <num-end>/<incr>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == INCREMENT VALUE

...

The pseudo code below describes the algorithm in more details. snd-005, snd-010, snd-020, snd-030, snd-035, snd-040, snd-045, snd-050, snd-055, snd-060, snd-065, snd-070, snd-075

snd-015 - NUMA APIs were added in some form, at least since 2.6.1; and therefore will pose no problems for this project.

...

For apparent hard failures it is worth noting that PING/PUSH information contains the status of each interface. This is a mechanism by which presence of and recovery from hard failures can be communicated. Rather than have a peer actively push such information, it is likely better to have nodes pull it when they need it. Such a pull (done by pinging the peer, of course) can be done occasionally as long as other, healthy, peer NIs are available.

Selection Criteria

The selection criteria for the local NI, peer NI pair listed in order of priority:

Local:

  • Health score
  • UDSP priority
  • NUMA score
  • Credit score
  • Round-robin

Peer:

  • Health score
  • UDSP priority
  • UDSP preference list
  • Credit score
  • Round-robin

Selection Algorithm Pseudo-code

...

Dynamic Peer Discovery ("Discovery" for short) is the process by which a node can discover the network interfaces it can reach a peer on without being pre-configured. This involves sending a ping to the peer. The ping response carries a flag bit to indicate that the peer is multi-rail capable. If it is the node then pushes its own network interface information to the peer. This protocol distributes the network interface information to both nodes and subsequently the nodes can excercise the peer network interfaces as well as its own, as described in further detail in this section. Discovery can be enabled, disabled or in verification mode. If it is in verification mode, then it will cross reference the discovered peer NIDs with the configured NIDs and complain if there is a discrepancy, but will continue to use the configured NIDs. cfg-085,   dyn-005,   dyn-010,   dyn-015,   dyn-020,   dyn-025,   dyn-030,   dyn-035,   dyn-040,   dyn-045,   dyn-050,   dyn-055,   dyn-060,   dyn-065

Discovery handshake

Discovery happens between an active node which takes the lead in the process, and a passive node which responds to messages from the active node. The following diagram illustrates the basic handshake that happens between the active and passive nodes. If the full handshake completes, both nodes have discovered each other.

...

Gliffy Diagram
nameHLD Multi-Rail-Multi-Rail Dynamic Discovery Sequence
pagePin1

Figure 5: Dynamic Discovery Overview

...

Gliffy Diagram
size800
nameDD Overview
pagePin5

Figure 6: Dynamic Discovery Detailed Sequence Diagram

...

  1. Note that the passive node sends a ping reply to the active node before the active node knows whether it should do a ping push.
    1. At this point the passive node does not know whether the active node is uprev or downrev.
    2. As noted above, the LNet reply always goes to the originating NID, so the passive node has enough information to be able to send it.
  2. The active node can be doing discovery on multiple NIDs of the passive node at the same time.
    1. Active node:
      1. The active node has to create peer/peer_net/peer_ni (at the very least peer_ni) datastructures to be able to send a ping message
      2. The active node now has muliple peer/peer_net/peer_ni structures for the same peer.
      3. On receipt of the ping reply the active node merges these structures.
      4. Having merged these structures, the active node sends a ping push message.
      5. The active node should be smart enough to not send multiple ping push messages in this case.
      6. The serialization we obtain by having a single dynamic discovery thread helps here.
    2. Passive node:
      1. The passive node has to create peer/peer_net/peer_ni datastructures to be able to send a ping reply.
      2. At the point where the passive node does this, it doesn't know whether the active node is uprev or downrev.
      3. If downrev, the passive node will not receive further information from the active node.
      4. Therefore the datastructures set up must be complete for a downrev active node.
      5. An uprev active node may have multiple pings in flight to different NIDs, prompting creation of multiple peer structures.
      6. On receipt of the ping push message, these structures must be merged.
      7. Further pushes serve to update these structures.
  3. Dynamic discovery and DLC configuration can update the same peer at the same time.
    1. Serialize updates through a peer mutex, and protect lookups with per-CPT spinlocks.
    2. A lookup needs just the per-CPT spinlock.
      An update must hold both the mutex and all per-CPT spinlocks – LNET_LOCK_EX. It needs this because a single per-CPT lock protects lookups of a peer NID, but also traversal of the peer_ni list in the peer_net, and the peer_net_list in the peer. So all per-CPT locks need to be held if the peer_ni_list or peer_net_list is to be changed.
  4. Can DLC modify discovered peers?
    1. Presumably yes.
    2. Troublesome case is deleting a peer NI that we're just using in discovery.
    3. This is not different from the normal case of trying to delete a peer NI that is currently in use.
    4. The peer NI must be idled first, which implies that the discovery round on that peer NI must be allowed to finish.
    5. Discovery can push a NI list that does not include the NI going idle, even though it uses that NI.
    6. This is similar to the normal case where DLC removes an active NI.
    7. While waiting for a NI to go idle, the peer mutex must be released, to avoid dynamic discovery deadlocking with DLC.
    8. We probably do not want yet another DLC request to come in and try to re-add the peer NI before all the above has finished.
    9. So the peer mutex mentioned above is not the ln_api_mutex that the ioctls serialize on.
    10. The api mutex must be held by the thread doing the ioctl across the entire operation, to avoid this configuration race.
    11. When both are held, the api must be locked before the peer mutex.
  5. Can discovery modify DLC configured peers?
    1. Presumably yes.
    2. When DLC adds a peer NI, it can hold the peer mutex across the entire operation.
    3. When DLC removed a peer NI, it ensured it was idle first.
    4. Discovery always sees a coherent peer datastructure to work on.
  6. The active node has discovery enabled, the passive node has discovery disabled.
    1. The passive may not have a configuration for the peer. In a cluster with only a few multi-rail nodes, it is plausible to just not explicitly configure the non-multi-rail peers.
    2. There are three approaches:
      1. the passive node just drops any push on the floor. In this case the dynamic discovery thread need not be running.
      2. the passive node verifies its configuration using the push message received. In this case the dynamic discovery thread needs to be running.
        1. a push containing more than one interface merits a complaint
        2. a push containing a single interface is accepted without complaint
      3. the passive node updates its configuration using the push.
  7. The active node has discovery disabled, the passive node has discovery enabled.
    1. The passive node is prompted to create the peer/peer_net/peer_ni datastructures as usual
    2. If the active node wasn't DLC configured on the passive node, then the passive node will not detect that the active node is uprev. The relevant ping traffic never happens.
    3. A multi-rail node on which discovery is disabled must be added to the DLC configuration of all its relevant peers.
  8. Active side enables dynamic discovery
    1. While dynamic discovery is disabled all peers added via DLC are moved to the ACTIVE state, and no dynamic discovery is performed
    2. When dynamic discovery is enabled peers which are in the ACTIVE state are not dynamically discovered.
      1. The other option is to have it retroactive and go through all the peers and determine if they have been dynamically discovered and if not then initiate dynamic discovery.
        1. This is likely to cause a spike in traffic
        2. In large systems this could cause a heavy load on the nodes since there could be potentially thousands of peers.
    3. Further communication with peers in ACTIVE state does not trigger dynamic discovery
    4. New peers added via DLC are moved to the WAITING-VERIFICATION state and on first message to these peers dynamic discovery is triggered.
    5. If dynamic discovery is disabled any messages sent on peers in the WAITING_VERIFICATION state, will cause the peers to move directly to the ACTIVE state with no discovery round triggered.
    6. Messages sent to peers that do not exist yet in the system trigger dynamic discovery if dynamic discovery is enabled.
  9. Network Race: Ping push from peer is flipped. This can happen in both directions. In this case the peer receiving the peer will not be alble to distinguish the order of the push and could end up with outdated information.
    1. To resolve this situation a sequence number can be added in the ping push, allowing the recieving side to determine the order.
    2. This will entail the receiving side to maintain the last sequence number of received push.
    3. If the push received has a sequence number which is greater than what it currently has for that peer, then update, otherwise ignore the push since it's has outdated information.
    4. Gliffy Diagram
      nameNetworkRace01
      pagePin3

      1. Figure 7: Push/Push Race Condition
  10. In the case when two peers simultaneously attempt to discover each other, each peer will create the corresponding peer/peer_net/peer_ni structures as it would normally do, and will transition its states according to the FSM. This scenario should be handled through the normal code path.
    1. Gliffy Diagram
      nameNetworkRace02
      pagePin2

      1. Figure 8: Simultaneous Discovery Scenario
    2. The following variations could occur
      1. The node can receive a ping and create the corresponding peer structures before it starts peer discovery.
        1. In this case when the message is attempted to be sent to that peer, the structures are found and the peer is going to be in ACTIVE state, and no discovery round will be triggered.
      2. The node can receive a ping after it has started its ping discovery round
        1. In this case the peer structure will be found in the DISCOVERING state. The ping response will be sent back (possibly before the peer state is examined, but it's not important)
      3. the node can receive a ping response before it sends its own ping response. This is  the standard case. The ping discovery protocol would be completed at this point
      4. The node can receive a ping push before it has sent it's own ping push. This would result in it updating it's own structures. This is again handled in the normal case.
  11. In the case when one node attempts to discover the same peer on multiple NIDs. Multiple peer/peer_net/peer_ni structures will be created for each one of the NIDs, since at this point the node doesn't know that it's the same peer. On ping response the node will send a ping push and transition the corresponding peer state to ACTIVE (note the order). When the second ping response on NID2 is received the information in the ping response is used to locate peer2a and the structures are merged. Since the other peer found is already in ACTIVE state, then there is no need to send another ping push. On the passive side, a similar process occurs. If the ping is sent from two different sources, then two peer/peer_net/peer_ni structures are created, and then merged when the push is received, which serves to link both structures. If the ping is sent from the same src NID, then the peer is created on the first ping and found on the second ping. No merge is required.
    1. Gliffy Diagram
      nameNetworkRace03
      pagePin2
      1. Figure 9: One-sided Discovery on multiple peer NIDs
  12. It is possible to have simultaneous discovery on multiple NIDs. This is a combination of scenario 10 and 11. The handling of both scenarios apply here.

...

  1. Have we received the peer's NI information? Two ways to get it:
    1. This node pinged the peer.
    2. The peer pushed this node.
  2. Has the peer received the local node's NI information. Again two ways:
    1. This node pushed to the peer.
    2. The peer pinged this node.
  3. Has local NI config changed from what the peer was told. Several ways this can happen:
    1. DLC update
    2. Interface hotplug
Peer State

The state of a peer is a combination of the following bits of information, where the flags can be found in the source code by prepending LNET_PEER_, so CONFIGURED becomes LNET_PEER_CONFIGURED. Peer state updates can be triggered from the event handler called when a message is received. These event handlers run while a spinlock is held in a time-critical path, and so we try to limit the amount of work done there. The discovery thread can then do the heavier lifting later under more relaxed locking constraints.

  • CONFIGURED: The peer was configured via DLC.
  • DISCOVERED: The peer has been discovered.
  • UNDISCOVERED: Peer discovery was disabled when the peer was created.

Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such the CONFIGURED and DISCOVERED flags can both be set on a peer.

The UNDISCOVERED state is used to indicate that a peer has been seen by discovery, but not been updated because discovery is disabled. It signals that a peer only needs to be re-examined if discovery is enabled.

  • QUEUED: Peer is queued for discovery.
  • DISCOVERING: Discovery is active for the peer.
  • DATA_
  • QUEUED: Peer is queued for discovery.
  • DISCOVERING: Discovery is active for the peer.
  • DATA_PRESENT: Peer data is available to update the peer.
  • NIDS_UPTODATE: Discovery has successfully updated the NIDs of the peer.
  • PING_SENT: Discovery has sent a Ping to the peer and is waiting for the Reply.
  • PUSH_SENT: Discovery has sent a Push to the peer and is waiting for the Ack.
  • PING_FAILED: Sending a ping to the peer failed.
  • PUSH_FAILED: Sending a push to the peer failed.
  • PING_REQUIRED: Discovery must Ping the peer.

The QUEUED flag is used to determine whether a peer is on the ln_dc_request or ln_dc_working queues via its lp_dc_list member. A peer is queued by lnet_peer_queue_for_discovery() and dequeued by lnet_peer_discovery_complete().

The DISCOVERING flag indicates that peer discovery is looking at the peer. When it is cleared, one of DISCOVERED or UNDISCOVERED is set.

  • MULTI_RAIL: This flag indicates whether a peer is running a multi-rail aware version of Lustre.
  • Ping Source Sequence Number

CONFIGURED marks the peer as being configured using DLC.

DISCOVERED marks the peer as having been though peer discovery. Configuration via DLC overrides peer discovery, but does not prevent the discovery algorithm from processing a peer. The algorithm complains if it finds differences between the configuration and what the peer reports. As such a peer can be marked as both CONFIGURED and DISCOVERED.

UNDISCOVERED marks a peer as having been through peer discovery, but that was not updated because peer discovery is disabled. It signals that the peer needs to be re-examined if discovery is enabled.

QUEUED is set on a peer that is linked on either the ln_dc_request or the ln_dc_working queue via its lp_dc_list member. A peer is queued by lnet_peer_queue_for_discovery() and dequeued by lnet_peer_discovery_complete().

DISCOVERING is set on a peer when discovery is looking at it. When it discovery completes it clears DISCOVERING and sets one of DISCOVERED or UNDISCOVERED.

DATA_PRESENT is set on a peer The DATA_PRESENT flag is set by the event handler for an incoming Push if it successfully stores the data, and by the event handler for an incoming Reply to a Ping. These event handlers run with spinlocks held, which is why we postpone the complex operation of updating the peer until the discovery thread can do it. The discovery thread processes the data and updates the peer by calling lnet_peer_data_present(), which clears the flag.

The NIDS_UPTODATE flag is used set on a peer to indicate that the NIDs for the peer are believed to be known. It is cleared when data is received that indicates that the peer may have changed, like an incoming Push. If storing the data from an incoming Push fails we cannot set the DATA_PRESENT flag but do clear NIDS_UPTODATE to indicate that the peer must be re-examined.

The PING_SENT flag indicates that is set on a peer when a Ping has been sent and we are waiting for a Reply message. The implication is that lp_ping_mdh is live and has an MD bound to it.

The PUSH_SENT flag indicates that is set on a peer when a Push has been sent and we are waiting for an Ack message. The implication is that lp_push_mdh is live and has an MD bound to it.

The PING_FAILED flag indicates that is set on a peer when an attempted Ping failed for some reason. In addition to LNet messaging failures, a Ping fails if the Reply does not fit in the pre-allocated buffer.

The PUSH_FAILED flag indicates that is set on a peer when an attempted Push failed for some reason. The node sending the Push only sees a failure if LNet messaging reports one.

The PING_REQUIRED flag indicates that a Ping is necessary to properly determine the state of a peer. Triggering a Ping is the mechanism by which discovery attempts to recover from any problems it may have encountered while processing a peer.

...

if LNet messaging reports one.

PING_REQUIRED is set on a peer when a Ping is necessary to properly determine the state of a peer. Triggering a Ping is the mechanism by which discovery attempts to recover from any problems it may have encountered while processing a peer. Pings triggered by this flag happen even if discovery has been disabled.

MULTI_RAIL marks a peer as running a multi-rail aware version of Lustre.

The Ping Source Sequence Number is sent in the pi_ni[0].ns_status field of ping data. In ping data pi_ni[0] always contains the data for the loopback NID, and non-multi-rail nodes do not interpret that field. The number is stored in the lp_node_seqno field of the peer. This is used in lnet_peer_needs_push() to determine whether a mult-rail aware peer needs a Push. The ping source sequence number is a 32 bit number that is updated whenever the source buffer for LNet ping is changed.

Peer NI State

We also keep some state related to discovery in the peer_ni data structures. These state flags are prefixed by LNET_PEER_NI_ in the code.

  • CONFIGURED: This flag indicates that the peer_ni was configured using DLC.
  • NON_MR_PREF: This flag indicates that the peer_ni has a implicit preferred local NI.

CONFIGURED is used to track whether the peer_ni was configured using DLC. Discovery can collect several peer_ni structures into a single peer, but should exempt any peer_ni that was set up by DLC.

NON_MR_PREF marks a peer_ni for a peer that is not known to be multi-rail aware, and for which a preferred local NI was automatically assigned. A peer_ni for a non-multi-rail node should always see the same local NI as the source of the traffic from the node. If the preferred local NI has not been explicitly set with DLC then the code picks one and sets this flag to indicate that this is the reason for the preferred NI. That way we know to clear this preferred NI if the peer_ni turns out to be part of a multi-rail aware peer.

State Changes

Now we'll explore in detail how the above states are changed.

  1. Configuring a NI with DLC
    1. DLC adds a local NI
      1. The ping source buffer is updated and the Ping Source Sequence Number increased
    2. DLC deletes a local NI
      1. The ping source buffer is updated and the Ping Source Sequence Number increased
  2. Configuring a peer with DLC
    1. DLC creates a peer
      1. The peer is marked CONFIGURED
      2. The peer_ni is marked CONFIGURED
    2. DLC adds a peer NI to a peer
    3. DLC deletes a peer NI from a peer
    4. DLC deletes a peer
  3. Forcing ping discovery
  4. Sending a message
  5. Ping handling
  6. Push handling
  7. Merging received data
  8. Discovery thread workflow
    1. get queued peer (QUEUED is set)
    2. if DATA_PRESENT is set
      1. merge the received data
    3. else if PING_FAILED is set
      1. clear PING_FAILED
      2. set PING_REQUIRED
      3. if there was an error, terminate discovery
      4. else do another pass over the peer
    4. else if PUSH_FAILED is set
      1. clear PUSH_FAILED
      2. if there was an erro, terminate discover
      3. else do another pass over the peer
    5. else if PING_REQUIRED is set
      1. clear PING_REQUIRED
      2. set PING_SENT
      3. send Ping
    6. else if discovery is disabled
      1. clear DISCOVERED and DISCOVERING
      2. set UNDISCOVERED
    7. else if NIDS_UPTODATE is not set
      1. set PING_SENT
      2. send Ping
    8. else if MULTI_RAIL is not set
      1. clear DISCOVERING
      2. set DISCOVERED
    9. else if lp_node_seqno < Ping Source Sequence Number
      1. set PUSH_SENT
      2. send Push
    10. else
      1. clear DISCOVERING
      2. set DISCOVERED
    11. if DISCOVERING is not set
      1. clear QUEUED
      2. dequeue peer

...

The following discussion must be updated – OW

...

Gliffy Diagram
nameHLD LocalNI FSM
pagePin3

Figure 10: Local NI FSM
  • INIT: pre state. Only transitory.
  • CREATED: The NI has been added successfully
  • FAILED: LND notification that the physical interface is down. hlt-005, hlt-010,
  • DEGRADED: LND notification that the physical interface is degraged. IB and ETH will probably not send such a notification. hlt-015, hlt-020, hlt-025
  • DELETING: a config delete is received. All pending messages must be depleted.

Both Degraded and Failed both need the LND to notify LNet. For degraded the LND could possibly query the type of the card and figure out the theoretical speed, then if the measured speed is below, then we can mark as degraded. snd-080

snd-085 - TODO: need to identify in the design how we deal with local NI failures.

...

Gliffy Diagram
nameHLD local Net FSM
pagePin1

Figure 11: Local Net FSM
  • INIT: pre state. Only transitory.
  • ACTIVE: First NI has been added successfully
  • IN-ACTIVE: All NIs are down via LND notifications.
  • DELETING - Config request to delete local net

...

Gliffy Diagram
nameSimplified Peer FSM
pagePin7

Figure 12: Peer FSM

This is a simplified FSM for a peer that illustrate how the various states relate to each other. The letters refer to the The Discovery Algorithm.

...

Gliffy Diagram
nameHLD Peer NI FSM
pagePin8

Figure 13: Peer NI FSM

When a peer_ni is initially added to the peer, it will not be in CREATED state, which means there is no active connection with that peer_ni.

...

NameStatus
Signed Off
Signed Off
Signed Off
Robert Read (optional)
 

SGI PACSigned Off


Appendix

...

Gliffy Diagram
nameHLD Peer FSM
pagePin11

Gliffy Diagram
namePeer FSM v3
pagePin2

TODO: I think we still need a VERIFY state and an IN-ACTIVE state as shown below. (I combined VERIFY into ACTIVE above – OW)

...