Video Presentation
Slide Deck
Video Transcript
Slide 1 - Introduction
Hello, my name is Amir Shehata. I'm the lead Luster Networking Engineer at DDN Storage.
Today I'd like talk to you about the Multi Rail and Health Features in the EXAScaler product [ Luster Networking Layer ].
Slide 2 - Agenda
I'll start off by explaining what Multi-Rail is all about and what it offers us
I will then cover a relevant example to show the simplification Multi-Rail brings to EXAScaler network deployment and configuration
From there I'll highlight the benefits and give some ideas on best configuration practices.
So Let's get started
Slide 3 - What is Multi-Rail and Health
In this presentation, when I speak about Multi-Rail, I'm referring to two related features: Multi-Rail and Health.
First let's touch on Multi-Rail. Best way to explain it is to look at how the Lustre Networking Layer (LNet for short) used to work prior to the introduction of the Multi-Rail feature.
If a node had multiple interfaces, each interface had to be configured on a different LNet network, like o2ib1, o2ib2, etc.
Multi-Rail does a couple of important things. It allows us to group homogeneous interfaces in the same network. Instead of configuring the node's interfaces in different LNet networks, now we can configure them all in the same LNet network. This simplifies configuration tremendously.
It allows LNet to use all the interfaces in the same network. If we have two nodes, and each node has multiple interfaces, we can configure all the interfaces on the same network and LNet can use them in Active-Active mode. Basically, LNet will select the best interface from the group and use it.
If a node has heterogeneous interfaces, for example 2 IB interfaces and 2 OPA interfaces. We will by necessity group them into two different networks. IB interfaces in the o2ib for example and OPA interfaces in the o2ib1 network. However, LNet can still use all these interfaces to communicate with peers which are on the same networks. This goes beyond standard bonding, which requires homogeneous interfaces.
So fare we talked about using all the interfaces in Active-Active, but what about resiliency? With Health, we do not need to sacrifice performance for resiliency. The LNet Health feature allows us to keep track of the health of all configured networks and interfaces. When LNet selects the interface to use it selects the interface with the best health. Therefore, if all the interfaces are healthy we continue to maximize the bandwidth. Only when an interface fails, will the bandwidth be reduced, because we will now avoid using the bad interface.
To explain this further, let's say, we have two interfaces in the o2ib network, IB0 and IB1. If LNet fails to send on IB0, then it will decrement the health of that interface and retry on the IB1 interface. It will then keep monitoring the IB0 interface until it's sure it is healthy again. In the mean LNet will keep using IB1, the healthier interface.
Slide 4 - Without MR
Now let's look at an example. The DGX-2, NVIDIA's AI box, can have up to 8 different interfaces. Without MR each one of these interfaces will need to be configured in its own LNet network. And since the EXAScaler servers do not have as many interfaces, we will have to alias these interfaces and connect the aliases to the different networks. Complicated configuration for sure.
Slide 5 - With MR
Once we throw MR into the mix, the configuration becomes very simply. Only one LNet network, and all the interfaces of all the nodes, clients and servers, are connected to that network. Of course this is only one potential configuration.
Slide 6 - Dual Fabric
The key point I'm trying to make is now we can match the underlying fabric. If we're dealing with two fabrics, OPA vs IB as an example, or even if the underlying physical network has two segments, then Multi-Rail allows us to match the underlying fabric configuration. This makes configuration much more intuitive and less complicated.
Slide 7 - MR Benefits
Now that we have a basic understanding of what LNet Multi-Rail brings to the table, we can see how it simplifies the network configuration.
But that's not all. Since LNet is able to group interfaces in Active-Active mode, it can aggregate their bandwidth. Tests have shown almost linear performance increase as interfaces are added. For example if we have two EDR interfaces, each one is capable of up to 12 GB/s, pure LNet network testing has shown that we can get approximately 22 GB/s when both interfaces are used with Multi-Rail, almost line rate. Not too bad.
And as I mentioned before even if we have two different LNet networks each with one interfaces, we get the same performance scaling, because LNet can use both of them simultaneously to communicate with peers.
Slide 8 - Interface Selection
With large client machines like the DGX-2, it's not enough to select an interface to use at random. There are restrictions imposed on us by the HW.
For example in GPU Direct workloads, there is an affinity between GPUs and interfaces. If we're trying to RDMA to or from a specific GPU, LNet has to select the interface with the best affinity.
The same restriction applies to NUMA nodes. NUMA nodes have affinity to specific interfaces. Therefore LNet needs to be smart about which interface to select based on the NUMA node we're RDMAing from.
LNet uses a few criteria when it selects an interface.
First it looks at the health of the interface. Since if an interface is not healthy there is no sense in using it
Second it looks at the GPU priority. Which interface is best given the GPU we're using for the RDMA operations
Third, NUMA closeness. Which interface is best given the NUMA node we're using for the RDMA operations
Forth, LNet uses a credit system to track how heavily a network interface is being used. LNet selects the least loaded interface.
Finally, if all other criteria is equal, we'll just select an interface in round robin.
With these criteria LNet can make intelligent decisions about which interface to use.
Slide 9 - Performance
Here is a real life example of the benefits afforded to us by Multi-Rail. This graph shows a comparison performance test done between a GPU Direct workload and a CPU workload. The test was performed on a DGX-2 and 2 AI 400 servers. As you can see LNet Multi-Rail benefits both workloads, however due to the GPU's shorter latency, it's able to achieve higher performance, almost saturating the AI-400 servers bandwidth.
Slide 10 - Configuring MR
Configuring Multi-Rail is dead simple. In fact no work is needed to configure it. It's on by default. All we need to do is group the interfaces in the same LNet network.
On the slide you can see a couple of ways to do that, either using the lnetctl utility or via the modprobe configuration file.
LNet takes care of the rest. It automatically discovers the peers' interfaces and applies the Multi-Rail selection algorithm discussed previously, on both the local and remote interfaces.
You can if you want turn off Multi-Rail, although I wouldn't recommend it. You can do that by turning off the automatic interface discovery feature, as I show on the slide. By disabling discovery the node will not know about the peer's Multi-Rail capability. This can be useful for handling backwards compatibility with peers which are not Multi-Rail capable.
Slide 11 - Health Benefits
Now let's switch to health.
The Multi-Rail health feature gives us a few notable benefits. It allows LNet to monitor the health of each interface and by extension the network the interface belongs to. This means, every interface is given a health value. Whenever there is a failure detected on that interface the health value is decremented. And the interface goes into recovery mode.
The message LNet fails to send is retried on a different healthier interface, if available. This re sending happens at the LNet layer and avoids Lustre level failure recovery.
For subsequent messages the healthiest interface is used.
Slide 12 - When to use Health
Health is most beneficial when you have multiple interfaces or networks which you can fail to. But if a node has only one interface, the benefits of the health feature is not as pronounced. There is only one interface to use, no matter how unhealthy it becomes.
Slide 13: Health Paremters
There are three parameters which control the health feature.
transaction_timeout
is the time to wait for an LNet message response before failing the message. Within that timeout, if we have a confirmed send failure, then we can retry up to retry_count times.
These two parameters need to be set with the clustre size and network latency in mind. We found with large clusters we need to increase the transaction timeout to accommodate for the latency introduced by servicing thousands of nodes.
Finally the health_sensitivity is the amount we decrement the health of an interface with on failure. The larger this value is the more sensitive the interface is to failure.
To turn health off, set the health_sensitivity and retry_count to 0. The interface's health will not be decremented and messages will not be re-sent on failure.
Slide 14 - Conclusion
In conclusion, we saw how the LNet Multi-Rail feature simplifies network configuration and increases the performance of the node by aggregating the interfaces' bandwidth.
It is on by default and only requires grouping the homogeneous interfaces on the same LNet network. LNet takes care of discovering the remote peers' interfaces and then selects the interface to use based on the criteria discussed earlier.
The Health feature adds resiliency into the mix, but should be used when multiple interfaces or networks are available. Otherwise the only behavioural change would be re-sending failed messages on the same interface.
Finally, when configuring Health we should keep in mind the size and latency characteristics of the underlying network. The default settings for the Health parameters seem to work in most cases, but with very large clusters the values might need to be tweaked.
This concludes the EXAScaler Multi-Rail and Health overview presentation. I hope you found this useful. For more details you can take a look at the Lustre Manual.
Thank you