Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Gliffy Diagram
nameHLD System Diagram
pagePin1

Figure 1: System Level Diagram

...

Gliffy Diagram
nameHLD Multi-Rail LNet Data Structure Diagram
pagePin7

Figure 2: LNet Data Structure Diagram

...

Gliffy Diagram
size800
nameLNet Threading Model
pagePin7

Figure 3: LNet Threading Model

...

Code Block
192.168.0.[1-10/2, 13, 14]@nettype
# Refer to Lustre Manual for more examples

or 


Code Block
eth[1,2,3], eth[1-4/2]

...



Expression Structural FormDescription

Gliffy Diagram
nameHLD to Kernel range 02
pagePin1

Figure 4: syntax descriptor

An expression can be a number:

Code Block
[<num>, <expr>]
represented as:
start == end == NUM

An express can be a wild card

Code Block
[*, <expr>]
represented as:
start == 0
end == U32_MAX
INCR == 1

An expression can be a range

Code Block
[<start> - <end>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == 1

An expression can be a range and an increment

Code Block
[<num-start> - <num-end>/<incr>, <expr>]
represented as:
start == START_NUM
end == END_NUM
INCR == INCREMENT VALUE

...

For apparent hard failures it is worth noting that PING/PUSH information contains the status of each interface. This is a mechanism by which presence of and recovery from hard failures can be communicated. Rather than have a peer actively push such information, it is likely better to have nodes pull it when they need it. Such a pull (done by pinging the peer, of course) can be done occasionally as long as other, healthy, peer NIs are available.

Selection Criteria

The selection criteria for the local NI, peer NI pair listed in order of priority:

Local:

  • Health score
  • UDSP priority
  • NUMA score
  • Credit score
  • Round-robin

Peer:

  • Health score
  • UDSP priority
  • UDSP preference list
  • Credit score
  • Round-robin

Selection Algorithm Pseudo-code

...

Gliffy Diagram
nameHLD Multi-Rail-Multi-Rail Dynamic Discovery Sequence
pagePin1

Figure 5: Dynamic Discovery Overview

...

Gliffy Diagram
size800
nameDD Overview
pagePin5

Figure 6: Dynamic Discovery Detailed Sequence Diagram

...

  1. Note that the passive node sends a ping reply to the active node before the active node knows whether it should do a ping push.
    1. At this point the passive node does not know whether the active node is uprev or downrev.
    2. As noted above, the LNet reply always goes to the originating NID, so the passive node has enough information to be able to send it.
  2. The active node can be doing discovery on multiple NIDs of the passive node at the same time.
    1. Active node:
      1. The active node has to create peer/peer_net/peer_ni (at the very least peer_ni) datastructures to be able to send a ping message
      2. The active node now has muliple peer/peer_net/peer_ni structures for the same peer.
      3. On receipt of the ping reply the active node merges these structures.
      4. Having merged these structures, the active node sends a ping push message.
      5. The active node should be smart enough to not send multiple ping push messages in this case.
      6. The serialization we obtain by having a single dynamic discovery thread helps here.
    2. Passive node:
      1. The passive node has to create peer/peer_net/peer_ni datastructures to be able to send a ping reply.
      2. At the point where the passive node does this, it doesn't know whether the active node is uprev or downrev.
      3. If downrev, the passive node will not receive further information from the active node.
      4. Therefore the datastructures set up must be complete for a downrev active node.
      5. An uprev active node may have multiple pings in flight to different NIDs, prompting creation of multiple peer structures.
      6. On receipt of the ping push message, these structures must be merged.
      7. Further pushes serve to update these structures.
  3. Dynamic discovery and DLC configuration can update the same peer at the same time.
    1. Serialize updates through a peer mutex, and protect lookups with per-CPT spinlocks.
    2. A lookup needs just the per-CPT spinlock.
      An update must hold both the mutex and all per-CPT spinlocks – LNET_LOCK_EX. It needs this because a single per-CPT lock protects lookups of a peer NID, but also traversal of the peer_ni list in the peer_net, and the peer_net_list in the peer. So all per-CPT locks need to be held if the peer_ni_list or peer_net_list is to be changed.
  4. Can DLC modify discovered peers?
    1. Presumably yes.
    2. Troublesome case is deleting a peer NI that we're just using in discovery.
    3. This is not different from the normal case of trying to delete a peer NI that is currently in use.
    4. The peer NI must be idled first, which implies that the discovery round on that peer NI must be allowed to finish.
    5. Discovery can push a NI list that does not include the NI going idle, even though it uses that NI.
    6. This is similar to the normal case where DLC removes an active NI.
    7. While waiting for a NI to go idle, the peer mutex must be released, to avoid dynamic discovery deadlocking with DLC.
    8. We probably do not want yet another DLC request to come in and try to re-add the peer NI before all the above has finished.
    9. So the peer mutex mentioned above is not the ln_api_mutex that the ioctls serialize on.
    10. The api mutex must be held by the thread doing the ioctl across the entire operation, to avoid this configuration race.
    11. When both are held, the api must be locked before the peer mutex.
  5. Can discovery modify DLC configured peers?
    1. Presumably yes.
    2. When DLC adds a peer NI, it can hold the peer mutex across the entire operation.
    3. When DLC removed a peer NI, it ensured it was idle first.
    4. Discovery always sees a coherent peer datastructure to work on.
  6. The active node has discovery enabled, the passive node has discovery disabled.
    1. The passive may not have a configuration for the peer. In a cluster with only a few multi-rail nodes, it is plausible to just not explicitly configure the non-multi-rail peers.
    2. There are three approaches:
      1. the passive node just drops any push on the floor. In this case the dynamic discovery thread need not be running.
      2. the passive node verifies its configuration using the push message received. In this case the dynamic discovery thread needs to be running.
        1. a push containing more than one interface merits a complaint
        2. a push containing a single interface is accepted without complaint
      3. the passive node updates its configuration using the push.
  7. The active node has discovery disabled, the passive node has discovery enabled.
    1. The passive node is prompted to create the peer/peer_net/peer_ni datastructures as usual
    2. If the active node wasn't DLC configured on the passive node, then the passive node will not detect that the active node is uprev. The relevant ping traffic never happens.
    3. A multi-rail node on which discovery is disabled must be added to the DLC configuration of all its relevant peers.
  8. Active side enables dynamic discovery
    1. While dynamic discovery is disabled all peers added via DLC are moved to the ACTIVE state, and no dynamic discovery is performed
    2. When dynamic discovery is enabled peers which are in the ACTIVE state are not dynamically discovered.
      1. The other option is to have it retroactive and go through all the peers and determine if they have been dynamically discovered and if not then initiate dynamic discovery.
        1. This is likely to cause a spike in traffic
        2. In large systems this could cause a heavy load on the nodes since there could be potentially thousands of peers.
    3. Further communication with peers in ACTIVE state does not trigger dynamic discovery
    4. New peers added via DLC are moved to the WAITING-VERIFICATION state and on first message to these peers dynamic discovery is triggered.
    5. If dynamic discovery is disabled any messages sent on peers in the WAITING_VERIFICATION state, will cause the peers to move directly to the ACTIVE state with no discovery round triggered.
    6. Messages sent to peers that do not exist yet in the system trigger dynamic discovery if dynamic discovery is enabled.
  9. Network Race: Ping push from peer is flipped. This can happen in both directions. In this case the peer receiving the peer will not be alble to distinguish the order of the push and could end up with outdated information.
    1. To resolve this situation a sequence number can be added in the ping push, allowing the recieving side to determine the order.
    2. This will entail the receiving side to maintain the last sequence number of received push.
    3. If the push received has a sequence number which is greater than what it currently has for that peer, then update, otherwise ignore the push since it's has outdated information.
    4. Gliffy Diagram
      nameNetworkRace01
      pagePin3

      1. Figure 7: Push/Push Race Condition
  10. In the case when two peers simultaneously attempt to discover each other, each peer will create the corresponding peer/peer_net/peer_ni structures as it would normally do, and will transition its states according to the FSM. This scenario should be handled through the normal code path.
    1. Gliffy Diagram
      nameNetworkRace02
      pagePin2

      1. Figure 8: Simultaneous Discovery Scenario
    2. The following variations could occur
      1. The node can receive a ping and create the corresponding peer structures before it starts peer discovery.
        1. In this case when the message is attempted to be sent to that peer, the structures are found and the peer is going to be in ACTIVE state, and no discovery round will be triggered.
      2. The node can receive a ping after it has started its ping discovery round
        1. In this case the peer structure will be found in the DISCOVERING state. The ping response will be sent back (possibly before the peer state is examined, but it's not important)
      3. the node can receive a ping response before it sends its own ping response. This is  the standard case. The ping discovery protocol would be completed at this point
      4. The node can receive a ping push before it has sent it's own ping push. This would result in it updating it's own structures. This is again handled in the normal case.
  11. In the case when one node attempts to discover the same peer on multiple NIDs. Multiple peer/peer_net/peer_ni structures will be created for each one of the NIDs, since at this point the node doesn't know that it's the same peer. On ping response the node will send a ping push and transition the corresponding peer state to ACTIVE (note the order). When the second ping response on NID2 is received the information in the ping response is used to locate peer2a and the structures are merged. Since the other peer found is already in ACTIVE state, then there is no need to send another ping push. On the passive side, a similar process occurs. If the ping is sent from two different sources, then two peer/peer_net/peer_ni structures are created, and then merged when the push is received, which serves to link both structures. If the ping is sent from the same src NID, then the peer is created on the first ping and found on the second ping. No merge is required.
    1. Gliffy Diagram
      nameNetworkRace03
      pagePin2
      1. Figure 9: One-sided Discovery on multiple peer NIDs
  12. It is possible to have simultaneous discovery on multiple NIDs. This is a combination of scenario 10 and 11. The handling of both scenarios apply here.

...

Gliffy Diagram
nameHLD LocalNI FSM
pagePin3

Figure 10: Local NI FSM
  • INIT: pre state. Only transitory.
  • CREATED: The NI has been added successfully
  • FAILED: LND notification that the physical interface is down. hlt-005, hlt-010,
  • DEGRADED: LND notification that the physical interface is degraged. IB and ETH will probably not send such a notification. hlt-015, hlt-020, hlt-025
  • DELETING: a config delete is received. All pending messages must be depleted.

...

Gliffy Diagram
nameHLD local Net FSM
pagePin1

Figure 11: Local Net FSM
  • INIT: pre state. Only transitory.
  • ACTIVE: First NI has been added successfully
  • IN-ACTIVE: All NIs are down via LND notifications.
  • DELETING - Config request to delete local net

...

Gliffy Diagram
nameSimplified Peer FSM
pagePin7

Figure 12: Peer FSM

This is a simplified FSM for a peer that illustrate how the various states relate to each other. The letters refer to the The Discovery Algorithm.

...

Gliffy Diagram
nameHLD Peer NI FSM
pagePin8

Figure 13: Peer NI FSM

When a peer_ni is initially added to the peer, it will not be in CREATED state, which means there is no active connection with that peer_ni.

...

NameStatus
Signed Off
Signed Off
Signed Off
Robert Read (optional)
 

SGI PACSigned Off


Appendix

...

Gliffy Diagram
nameHLD Peer FSM
pagePin11

Gliffy Diagram
namePeer FSM v3
pagePin2

TODO: I think we still need a VERIFY state and an IN-ACTIVE state as shown below. (I combined VERIFY into ACTIVE above – OW)

...