LNet is a virtual networking layer which allows Lustre nodes to communicate with each other.
OverviewLustre is a high-performance parallel distributed file system designed for large-scale cluster computing. It is an open-source, parallel file system that is widely used in high-performance computing (HPC) environments, especially in supercomputing clusters and other high-performance storage systems. Lustre Networking Module (LNet) is the high-performance networking layer used in Lustre. It serves as the communication backbone for Lustre, enabling efficient and scalable data transfers between different components of the Lustre file system deployed on a cluster. The key features and aspects of LNet:
System Level Overview
Users have the option to interact with Behind the scenes, the DLC library plays a crucial role, implementing the configuration API that Within the LNet architecture, the kernel module exposes a set of configuration APIs for LNet, allowing fine-tuning of its behavior. Additionally, it provides a set of send and receive APIs, crucial for the In the modular design of LNet, the core logic for message handling resides in the LNet module, while Lustre Network Drivers (LND) serve as hardware-specific interfaces. Noteworthy LNDs include Networks, Peers and NIDsAt its core LNet virtualizes the underlying Network. In order to effectively do that it introduces three concepts. Network Interface Descriptors (NIDs)In the framework of the Lustre Network, each node is distinguished by a unique identifier known as the Network Interface Descriptor (NID). The NID is comprised of the network address, succeeded by the network type and number, structured as <network address>@<network type><network number>. Network AddressThe determination of the Network Address is contingent upon the Lustre Network Driver (LND) in use. For instance, it corresponds to an IPv4 or IPv6 address for the socklnd, and an IPv4 address for the o2iblnd. An illustrative address for an o2iblnd interface could be: Network TypeLNet employs the network type to discern the category of virtual network to which a node belongs. Each network type is specifically associated with an LND. For instance, Network InterfaceWithin the Lustre network paradigm, each node necessitates at least one configured LNet Network Interface (NI). Each NI is allocated a distinctive NID on the network, and LNet maintains internal data structures for every configured NI. These structures encapsulate crucial information such as state, health, and other relevant details pertaining to the NI. PeersUpon the initial transmission of an LNet message from one node to another, the node creates a Peer structure to track pertinent peer information. This includes advertised NIDs, health status of the NIDs, routing particulars, and other critical data facilitating efficient communication and coordination between network nodes. Use CasesUses cases will be used to illustrate and explain the various LNet functionality available at the time of this writing. Directly connectedThe simplest case is when two nodes are configured with a single network interface each on the same network. This use case leverages the basic LNet send and receive functionality. Routed single-hopIn the scenario of a routed single-hop configuration, individual nodes may exist on disparate network types. As exemplified in the aforementioned diagram, Node A possesses an interface configured on the o2ib network, while Node B has an interface configured on the tcp network. To facilitate message exchange between these nodes, an LNet router node, equipped with interfaces on both the o2ib and tcp networks, must be established. Routes are configured on both Node A and Node B, following the format:
Routes will be configured on Nodes A and B. Example routes
On Node A, a route is added to instruct LNet to route messages intended for the tcp network to the designated LNet router. Conversely, a reverse route on Node B is configured to direct LNet on Node B to forward messages destined for the o2ib network to the same LNet router.
When Lustre on Node A initiates a message destined for Node B, LNet analyzes the network type from the provided Node B NID and identifies the absence of configured Network Interfaces (NI) on the tcp network. Consequently, it consults the list of configured routes, forwarding the message to the router address specified in the route. Upon receiving a message, the LNet router examines the header to determine the final destination NID. If the final destination NID corresponds to an NI local to the router node, the router processes the message locally. However, LNet routers do not accept Lustre messages by default, handling a limited set of message types such as LNet ping and health status messages. If the final destination NID is not local to the router but matches a local NI on the same network, the router forwards the message over that network. In our example, if Node A sends a message destined for Node B's tcp NID, the router, recognizing its own tcp NI and forwards the message over the tcp network. Each node maintains the status of the configured routes. Nodes periodically ping the routers configured in the routes. If the router is healthy the route is considered up, otherwise it is considered down. Only available routes are used for forwarding messages. This routing mechanism enables the effective forwarding of messages destined for diverse networks. Routed single-hop with multiple LNet RoutersA variation of this use case involves the configuration of multiple LNet Routers, each introducing its set of routes. In such instances, nodes A and B can be configured with multiple routes, each pointing to a distinct router. Each route can be assigned a priority, with routes of the same priority being utilized in a round-robin fashion. Alternatively, the highest priority route is selected if routes possess varying priorities. Routed multi-hopThis scenario mirrors the previous one, with the distinction that multiple router nodes are interposed between Nodes A and B. From the standpoint of Nodes A and B, there are no substantive differences in the processing logic. Configured routes direct to the immediate router reachable from each node. Additionally, routes are established on the LNet routers to enable them to forward messages to networks beyond their immediate reach. The routing configuration for the above example will look like:
Multi-Rail homogeneous interfacesLNet offers the capability to configure multiple Network Interfaces (NIs) within the same network, enabling concurrent utilization of these interfaces and consequently augmenting the node's overall bandwidth The Multi-Rail feature introduces the concept of a node's Primary NID, designated as the NID identifying the node for Lustre transmissions. While a node may possess additional NIDs, Lustre does not need to know these secondary identifiers. LNet manages the association between primary NIDs and the remaining NIDs belonging to the node. Furthermore, LNet implements a discovery protocol to ascertain all NIDs of a peer. Before initial communication with a peer the discovery protocol pulls all the interface information of the peer. This interface information is stored in the peer data structures maintained by LNet. Instead of dynamically discovering a peer's NIDs, this can be done statically at configure time, by adding a peer and all its associated NIDs using the lnetctl utility. As shown in the example above, nodes A and B have 3 NIs each, configured on the o2ib network. When sending a message LNet will need to select a local NI to send from and a remote peer NID to send to. LNet will select a local NI per LNet message based on the following criteria
LNet will select the peer interface to use as part of its Multi-Rail protocol. The peer NID selection logic is simpler, as it only has the following selection criteria:
LNet monitors the well-being of both its local NIs and those of its peers, selecting interfaces exhibiting optimal health. Interface health is quantified as an integer value, initialized to 1000, albeit the specific value holds arbitrary significance. In the event of a transmission failure on a particular interface, the health value is decremented. Subsequently, the health metric recuperates with successful message transmissions via the interface. Specific events, exemplified by scenarios like a cable being unplugged, can put an interface into a fatal state. In this fatal state, the interface is deemed unusable until a subsequent event signals the restoration of its operational status. This monitoring and management of interface health contribute significantly to the overall resilience and reliability of the LNet framework. Multi-Rail heterogeneous interfacesThis use case is the same as the homogeneous case, except that LNet handles Multi-Rail across different network types. As shown in the above example a node can have multiple interfaces, each one configured on a different network. LNet will utilize all interfaces for message transmission provided the remote node also has interfaces on these same networks. Multi-Rail routingThe Multi-Rail feature also works for the routing cases described above. LNet routing handles the ability to have multiple interfaces for a router. When configuring the route, only the primary NID of the router is used to identify the router. Upon first communication with the router, LNet uses its discovery protocol to pull all of the routers configured interfaces. It can then utilize the Multi-Rail functionality described above to communicate with the router. User Defined Selection Policy (UDSP)There are use cases when LNet is configured with multiple interfaces possibly on varying network types, where the user wants to have direct influence over the selection criteria. As mentioned above the selection criteria is baked into the code and can not change. However using the UDSP feature, the user can configure specific policies which tell LNet how to select an interface. As an example a user can configure a node to have two networks an o2ib and tcp. The user might want to use the tcp network only as backup if the o2ib network is unavailable. Without UDSP both the o2ib and tcp NIs will be used. However, the user can configure a policy to tell LNet to use tcp only if the o2ib network is not available. There are several UDSP rule types which can be configured. They are outlined below: UDSP Rule TypesNetwork RulesThese rules define the relative priority of the networks against each other. 0 is the highest priority. Networks with higher priorities will be selected during the selection algorithm, unless the network has no healthy interfaces. If there exists an interface on another network which can be used and is healthier than any which are available on the current network, then that one will be used. Health will always trump all other criteria. NID RulesThese rules define the relative priority of individual NIDs. 0 is the highest priority. Once a network is selected the NID with the highest priority is preferred. Note that NID priority is prioritized below health. For example, if there are two NIDs, NID-A and NID-B. NID-A has higher priority but a lower health value, NID-B will still be selected. In that sense the policies are there as a hint to guide the selection algorithm. NID Pair RulesThese rules define preferred paths. Once a local NI is selected, as this is the first step in the selection algorithm, the peer NI which has the local NI on its preferred list is selected. The end result of this strategy is an association between a local NI and a peer NI (or a group of them) Router RulesRouter Rules define which set of routers to use when sending messages to a destination NID(s). It can also be used to identify preferred routers. When defining a network there could be paths which are more optimal than others. To have more control over the path traffic takes, users configure interfaces on different networks, and split up the router pools among the networks. However, this results in complex configuration, which is hard to maintain and is error prone. It is much more desirable to configure all interfaces on the same network, and then define which routers to use when sending to a remote peer or from a source peer. Router Rules allow this functionality LNet Testing ToolsTo assess the functionality and performance of LNet, two specialized tools have been developed to cater to distinct testing requirements.
ConclusionLustre, as a high-performance parallel distributed file system, leverages the Lustre Networking Module (LNet) as its robust communication backbone. Designed with modularity in mind, LNet exhibits versatility in supporting various network transports, optimizing its performance for the rapid data transfers demanded by high-performance computing (HPC) environments. Scalable and reliable, LNet accommodates large-scale cluster configurations, featuring multiple network transports such as TCP/IP/IPv6, InfiniBand, and HPE's Slingshot interconnect. System-level management is facilitated by the lnetctl utility. Operating through a command-line interface or YAML configuration files, lnetctl offers a comprehensive approach for configuring and extracting LNet settings. The LNet framework introduces concepts such as Network Interface Descriptors (NIDs), Network Addresses, Network Types, and Network Interfaces, all crucial for effective virtualization and communication within the Lustre network. Multi-Rail and routing mechanisms further enhance the coordination and exchange of messages between nodes. Use cases, ranging from directly connected nodes to multi-hop configurations, showcase LNet's adaptability. The Multi-Rail feature enhances bandwidth utilization, supporting homogeneous and heterogeneous interfaces, routing scenarios, and even user-defined selection policies, allowing users to influence interface selection based on specific criteria. In summary, LNet stands as a pivotal component within Lustre, delivering efficient, scalable, and configurable networking capabilities essential for the demanding requirements of large-scale cluster computing and high-performance storage environments. |
LNet Block Level DiagramThe diagram above represents the different functional blocks in LNet. A quick overview will help in understanding the code
|
|
|
|
|