...
Gliffy Diagram name HLD System Diagram pagePin 1
Figure 1: System Level Diagram
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Figure 2: LNet Data Structure Diagram
...
Gliffy Diagram size 800 name LNet Threading Model pagePin 7
Figure 3: LNet Threading Model
...
Code Block |
---|
192.168.0.[1-10/2, 13, 14]@nettype # Refer to Lustre Manual for more examples |
or
Code Block |
---|
eth[1,2,3], eth[1-4/2] |
...
Expression Structural Form | Description | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Figure 4: syntax descriptor | An expression can be a number:
An express can be a wild card
An expression can be a range
An expression can be a range and an increment
|
...
For apparent hard failures it is worth noting that PING/PUSH information contains the status of each interface. This is a mechanism by which presence of and recovery from hard failures can be communicated. Rather than have a peer actively push such information, it is likely better to have nodes pull it when they need it. Such a pull (done by pinging the peer, of course) can be done occasionally as long as other, healthy, peer NIs are available.
Selection Criteria
The selection criteria for the local NI, peer NI pair listed in order of priority:
Local:
- Health score
- UDSP priority
- NUMA score
- Credit score
- Round-robin
Peer:
- Health score
- UDSP priority
- UDSP preference list
- Credit score
- Round-robin
Selection Algorithm Pseudo-code
...
Gliffy Diagram name HLD Multi-Rail-Multi-Rail Dynamic Discovery Sequence pagePin 1
Figure 5: Dynamic Discovery Overview
...
Gliffy Diagram | ||||||
---|---|---|---|---|---|---|
|
Figure 6: Dynamic Discovery Detailed Sequence Diagram
...
- Note that the passive node sends a ping reply to the active node before the active node knows whether it should do a ping push.
- At this point the passive node does not know whether the active node is uprev or downrev.
- As noted above, the LNet reply always goes to the originating NID, so the passive node has enough information to be able to send it.
- The active node can be doing discovery on multiple NIDs of the passive node at the same time.
- Active node:
- The active node has to create
peer/peer_net/peer_ni
(at the very leastpeer_ni
) datastructures to be able to send a ping message - The active node now has muliple
peer/peer_net/peer_ni
structures for the same peer. - On receipt of the ping reply the active node merges these structures.
- Having merged these structures, the active node sends a ping push message.
- The active node should be smart enough to not send multiple ping push messages in this case.
- The serialization we obtain by having a single dynamic discovery thread helps here.
- The active node has to create
- Passive node:
- The passive node has to create
peer/peer_net/peer_ni
datastructures to be able to send a ping reply. - At the point where the passive node does this, it doesn't know whether the active node is uprev or downrev.
- If downrev, the passive node will not receive further information from the active node.
- Therefore the datastructures set up must be complete for a downrev active node.
- An uprev active node may have multiple pings in flight to different NIDs, prompting creation of multiple peer structures.
- On receipt of the ping push message, these structures must be merged.
- Further pushes serve to update these structures.
- The passive node has to create
- Active node:
- Dynamic discovery and DLC configuration can update the same peer at the same time.
- Serialize updates through a peer mutex, and protect lookups with per-CPT spinlocks.
- A lookup needs just the per-CPT spinlock.
An update must hold both the mutex and all per-CPT spinlocks –LNET_LOCK_EX
. It needs this because a single per-CPT lock protects lookups of a peer NID, but also traversal of thepeer_ni
list in thepeer_net
, and thepeer_net_list
in the peer. So all per-CPT locks need to be held if thepeer_ni_list
orpeer_net_list
is to be changed.
- Can DLC modify discovered peers?
- Presumably yes.
- Troublesome case is deleting a peer NI that we're just using in discovery.
- This is not different from the normal case of trying to delete a peer NI that is currently in use.
- The peer NI must be idled first, which implies that the discovery round on that peer NI must be allowed to finish.
- Discovery can push a NI list that does not include the NI going idle, even though it uses that NI.
- This is similar to the normal case where DLC removes an active NI.
- While waiting for a NI to go idle, the peer mutex must be released, to avoid dynamic discovery deadlocking with DLC.
- We probably do not want yet another DLC request to come in and try to re-add the peer NI before all the above has finished.
- So the peer mutex mentioned above is not the
ln_api_mutex
that the ioctls serialize on. - The api mutex must be held by the thread doing the ioctl across the entire operation, to avoid this configuration race.
- When both are held, the api must be locked before the peer mutex.
- Can discovery modify DLC configured peers?
- Presumably yes.
- When DLC adds a peer NI, it can hold the peer mutex across the entire operation.
- When DLC removed a peer NI, it ensured it was idle first.
- Discovery always sees a coherent peer datastructure to work on.
- The active node has discovery enabled, the passive node has discovery disabled.
- The passive may not have a configuration for the peer. In a cluster with only a few multi-rail nodes, it is plausible to just not explicitly configure the non-multi-rail peers.
- There are three approaches:
- the passive node just drops any push on the floor. In this case the dynamic discovery thread need not be running.
- the passive node verifies its configuration using the push message received. In this case the dynamic discovery thread needs to be running.
- a push containing more than one interface merits a complaint
- a push containing a single interface is accepted without complaint
- the passive node updates its configuration using the push.
- The active node has discovery disabled, the passive node has discovery enabled.
- The passive node is prompted to create the
peer/peer_net/peer_ni
datastructures as usual - If the active node wasn't DLC configured on the passive node, then the passive node will not detect that the active node is uprev. The relevant ping traffic never happens.
- A multi-rail node on which discovery is disabled must be added to the DLC configuration of all its relevant peers.
- The passive node is prompted to create the
- Active side enables dynamic discovery
- While dynamic discovery is disabled all peers added via DLC are moved to the ACTIVE state, and no dynamic discovery is performed
- When dynamic discovery is enabled peers which are in the ACTIVE state are not dynamically discovered.
- The other option is to have it retroactive and go through all the peers and determine if they have been dynamically discovered and if not then initiate dynamic discovery.
- This is likely to cause a spike in traffic
- In large systems this could cause a heavy load on the nodes since there could be potentially thousands of peers.
- The other option is to have it retroactive and go through all the peers and determine if they have been dynamically discovered and if not then initiate dynamic discovery.
- Further communication with peers in ACTIVE state does not trigger dynamic discovery
- New peers added via DLC are moved to the WAITING-VERIFICATION state and on first message to these peers dynamic discovery is triggered.
- If dynamic discovery is disabled any messages sent on peers in the WAITING_VERIFICATION state, will cause the peers to move directly to the ACTIVE state with no discovery round triggered.
- Messages sent to peers that do not exist yet in the system trigger dynamic discovery if dynamic discovery is enabled.
- Network Race: Ping push from peer is flipped. This can happen in both directions. In this case the peer receiving the peer will not be alble to distinguish the order of the push and could end up with outdated information.
- To resolve this situation a sequence number can be added in the ping push, allowing the recieving side to determine the order.
- This will entail the receiving side to maintain the last sequence number of received push.
- If the push received has a sequence number which is greater than what it currently has for that peer, then update, otherwise ignore the push since it's has outdated information.
Gliffy Diagram name NetworkRace01 pagePin 3 Figure 7: Push/Push Race Condition
- In the case when two peers simultaneously attempt to discover each other, each peer will create the corresponding peer/peer_net/peer_ni structures as it would normally do, and will transition its states according to the FSM. This scenario should be handled through the normal code path.
Gliffy Diagram name NetworkRace02 pagePin 2 Figure 8: Simultaneous Discovery Scenario
- The following variations could occur
- The node can receive a ping and create the corresponding peer structures before it starts peer discovery.
- In this case when the message is attempted to be sent to that peer, the structures are found and the peer is going to be in ACTIVE state, and no discovery round will be triggered.
- The node can receive a ping after it has started its ping discovery round
- In this case the peer structure will be found in the DISCOVERING state. The ping response will be sent back (possibly before the peer state is examined, but it's not important)
- the node can receive a ping response before it sends its own ping response. This is the standard case. The ping discovery protocol would be completed at this point
- The node can receive a ping push before it has sent it's own ping push. This would result in it updating it's own structures. This is again handled in the normal case.
- The node can receive a ping and create the corresponding peer structures before it starts peer discovery.
- In the case when one node attempts to discover the same peer on multiple NIDs. Multiple
peer/peer_net/peer_ni
structures will be created for each one of the NIDs, since at this point the node doesn't know that it's the same peer. On ping response the node will send a ping push and transition the corresponding peer state to ACTIVE (note the order). When the second ping response on NID2 is received the information in the ping response is used to locate peer2a and the structures are merged. Since the other peer found is already in ACTIVE state, then there is no need to send another ping push. On the passive side, a similar process occurs. If the ping is sent from two different sources, then twopeer/peer_net/peer_ni
structures are created, and then merged when the push is received, which serves to link both structures. If the ping is sent from the same src NID, then the peer is created on the first ping and found on the second ping. No merge is required.Gliffy Diagram name NetworkRace03 pagePin 2 Figure 9: One-sided Discovery on multiple peer NIDs
- It is possible to have simultaneous discovery on multiple NIDs. This is a combination of scenario 10 and 11. The handling of both scenarios apply here.
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Figure 10: Local NI FSM
- INIT: pre state. Only transitory.
- CREATED: The NI has been added successfully
- FAILED: LND notification that the physical interface is down. hlt-005, hlt-010,
- DEGRADED: LND notification that the physical interface is degraged. IB and ETH will probably not send such a notification. hlt-015, hlt-020, hlt-025
- DELETING: a config delete is received. All pending messages must be depleted.
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Figure 11: Local Net FSM
- INIT: pre state. Only transitory.
- ACTIVE: First NI has been added successfully
- IN-ACTIVE: All NIs are down via LND notifications.
- DELETING - Config request to delete local net
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Figure 12: Peer FSM
This is a simplified FSM for a peer
that illustrate how the various states relate to each other. The letters refer to the The Discovery Algorithm.
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Figure 13: Peer NI FSM
When a peer_ni is initially added to the peer, it will not be in CREATED state, which means there is no active connection with that peer_ni.
...
Name | Status |
---|---|
Signed Off | |
Signed Off | |
Signed Off | |
Robert Read (optional) | |
SGI PAC | Signed Off |
Appendix
...
Gliffy Diagram | ||||
---|---|---|---|---|
|
Gliffy Diagram | ||||
---|---|---|---|---|
|
TODO: I think we still need a VERIFY state and an IN-ACTIVE state as shown below. (I combined VERIFY into ACTIVE above – OW)
...