| Table of Contents |
|---|
How do I build
...
Lustre?
Refer to Walk-thru- Build Lustre MASTER on RHEL 7.3/CentOS 7.3 from Intel GitAlso refer to Building Lustre/LNet Centos/RHEL 7.x for some quirks when building.Building Lustre from Source
How do I load LNet
Load the module
...
| Code Block |
|---|
lnetctl net add --net <net-name> --if <interface name> # Examples lnetctl net add --net o2ib --if ib0 lnetctl net add --net tcp --if eth0 |
How do I configure different
...
module parameters per network interface?
Sometimes nodes can have different types of interfaces, for example, MLX and OPA. It is desired to configure both of those with different tunables. To do that, we must use the YAML configuration.
...
- Configure LNet and its interfaces as shown above
- Configure a router
Code Block # enable routing lnetctl set routing 1
- Configure route entries on each of the nodes
Code Block lnetctl route add --net <destination net> --gateway <gateway nid> #Examples lnetctl route add --net o2ib1 --gateway 10.10.10.2@o2ib # The above says: # any messages destined to o2ib1 should be forwarded to 10.10.10.2@o2ib # o2ib has to be a reachable network lnetctl route add --net tcp --gateway 10.10.10.3@2o2ib --hop 2 --priority 1 # If there are multiple routes sometimes it's useful to define the priority between these routes # hop should define the number of hops to the gateway. # Unfortunately, due to legacy reasons hop and priority perform the same function # it would have been better to only have one to reduce confusion. # routes with less number of hops or with higher priority are selected first # if routes have the same number of hops and priority they are visited in round-robin.
What
...
module parameters impact routing?
A router can be configured with a set number of buffers. These buffers are used to receive messages to be forwarded
...
The pinger waits for router_ping_timeout for the gateway to respond to a ping health check.
What is LNet Multi-Rail?
LNet Multi-Rail allows multiple interfaces to be used for sending LNet messages. This feature boosts performance. A follow up feature, LNet Resiliency, currently being worked on is aimed at increasing resiliency.
Refer to: http://wiki.lustre.org/Multi-Rail_LNet for the Requirements, HLD and LUG presentations.
How do I configure multiple network interfaces per network?
Via command line:
| Code Block |
|---|
lnetctl net add --net <network> --if <list of comma separated interfaces>
# Example
lnetctl net add --net o2ib --if ib0,ib1 |
From YAML configuration. The values of the tunabes can be changed to whatever value desired.
How can I verify my routing setup works?
Basic verification
- Run lnetctl peer show on your node (or
lnetctl peer show --nid <router nid>) and check that all router nids on the same LNet are listed with "status: up". If any routers nids are listed as "down", try the same connectivity checks between your node and the router as described here "TODO: How can I verify connectivity between nodes on the same LNet". - If
avoid_asym_router_failureis enabled, make sure that the same routers are set up on both sides.
Trouble-shooting steps
If basic router setup verification indicates connectivity problems that can't be solved using means discussed in "TODO: How can I verify connectivity between nodes on the same LNet", use the following procedure.
Preparation
If there are many same-level routers connecting your nodes, try to isolate just one for each hop level (if you identified which one causes the problem, leave that one). This will require changing routing settings. For example, if the topology is Client — GW[1..N] — Server, leave only GW1 in the router configuration of the server and the client. This can be done using lnetctl route del command (changes won't be kept on lnet reload).
Make sure that connected nodes can lnetctl ping each other. Note that lnetctl ping needs to be tried multiple times.
For example, if the topology is Client — GW1 — GW2 — Server:
- lnetctl ping Server from the Client. If that fails, then
- lnetctl ping using directly connected nodes: Client to GW1, GW1 to GW2, GW2 to Server. If any of this fails, this is not a routing setup issue but connectivity issue between specific nodes on the same network. Refer to "TODO: How can I verify connectivity between nodes on the same LNet"
- lnetctl ping nodes that are a single hop away: Client to GW2, Server to GW1 in both directions. If any of that fails, then
- Check the routing config for the nodes that fail to communicate
What is LNet Multi-Rail?
LNet Multi-Rail allows multiple interfaces to be used for sending LNet messages. This feature boosts performance. A follow up feature, LNet Resiliency, currently being worked on is aimed at increasing resiliency.
Refer to: http://wiki.lustre.org/Multi-Rail_LNet for the Requirements, HLD and LUG presentations.
How do I configure multiple network interfaces per network?
Via command line:
| Code Block |
|---|
lnetctl net add --net <network> --if <list of comma separated interfaces>
# Example
lnetctl net add --net o2ib --if ib0,ib1 |
From YAML configuration. The values of the tunabes can be changed to whatever value desired.
| Code Block |
|---|
net:
- net type: o2ib
local NI(s):
- interfaces: |
| Code Block |
net: - net type: o2ib local NI(s): - interfaces: 0: ib0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 0 concurrent_sends0: 8ib0 tunables: fmr_pool_size: 512 fmr_flush_triggerpeer_timeout: 384180 fmrpeer_cachecredits: 18 connspeer_perbuffer_peercredits: 10 - interfaces: 0credits: ib1256 lnd tunables: peerpeercredits_timeouthiw: 1804 peermap_on_creditsdemand: 80 peerconcurrent_buffer_creditssends: 08 creditsfmr_pool_size: 256512 lnd tunables:fmr_flush_trigger: 384 peercreditsfmr_hiwcache: 41 mapconns_onper_demandpeer: 01 concurrent_sends: 8- interfaces: fmr_pool_size0: 512ib1 tunables: fmr_flush_trigger: 384 peer_timeout: 180 fmrpeer_cachecredits: 18 connspeer_perbuffer_peer: 1 |
How do I statically configure Multi-Rail?
There are two steps to configuring multi-rail:
- Configuring the local network interfaces as shown above
- Configuring the Multi-rail enabled peers.
The first step ensures that the local node knows the different interfaces it can send messages over. The second steps tell the local node which peers are multi-rail enabled and which interfaces of these peers to use.
For more information on exact configuration examples, refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetmr
How do I dynamically configure Multi-Rail?
Configuring the peers manually is error prone process. It's best if a node is able to discover its peers dynamically. The Dynamic Discovery feature allows a node to discover the interfaces of its peers the first time it communicates with it.
Whenever the local interface list change an update is sent to all connected peers.
This feature reduces the configuration burden to only configuring the local interfaces of the node.
For more information on the feature refer to the HLD: http://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf. Dynamic Behavior section.
Are there any specific consideration when configuring linux routing for LNet Multi-Rail node?
Refer to: MR Cluster Setup
Are there any specific routing consideration with Multi-Rail?
Refer to: Multi-Rail (MR) Routing
How do I test LNet performance?
lnet_selftest is available for performance testing. Refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetselftest
For a sample lnet_selftest script: self-test template script
Is there a way to functionaly test LNet?
We're currently working on a functional test tool, LNet Unit Test Framework. The documents will be made available soon.
What's the best OPA LND tunables to use?
credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
conns_per_peer: 1 |
How do I statically configure Multi-Rail?
There are two steps to configuring multi-rail:
- Configuring the local network interfaces as shown above
- Configuring the Multi-rail enabled peers.
The first step ensures that the local node knows the different interfaces it can send messages over. The second steps tell the local node which peers are multi-rail enabled and which interfaces of these peers to use.
For more information on exact configuration examples, refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetmr
How do I dynamically configure Multi-Rail?
Configuring the peers manually is error prone process. It's best if a node is able to discover its peers dynamically. The Dynamic Discovery feature allows a node to discover the interfaces of its peers the first time it communicates with it.
Whenever the local interface list change an update is sent to all connected peers.
This feature reduces the configuration burden to only configuring the local interfaces of the node.
For more information on the feature refer to the HLD: http://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf. Dynamic Behavior section.
How should I setup a cluster with a combination of Multi-Rail enabled nodes and non-Multi-Rail enabled nodes? (#TODO)
What should I do if I upgrade from a non-Multi-Rail Lustre version (<=2.10 ) to a Multi-Rail Lustre version (> 2.10) with regards to Multi-Rail? (#TODO)
Are there any specific consideration when configuring Linux routing for LNet Multi-Rail node?
Refer to: MR Cluster Setup
Are there any specific routing consideration with Multi-Rail?
Refer to: Using Routing Resiliency with the Multi-Rail
How do I test LNet performance?
lnet_selftest is available for performance testing. Refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetselftest
For a sample lnet_selftest script: self-test template script
Is there a way to functionally test LNet?
We're currently working on a functional test tool, LNet Unit Test Framework. The documents will be made available soon.
What are the best OPA LND tunables to use?
| Code Block |
|---|
net:
- net type: o2ib1
local NI(s):
|
| Code Block |
net: - net type: o2ib1 local NI(s): - interfaces: 0: ib2 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 - credits: 1024interfaces: lnd tunables: 0: ib2 peercredits_hiwtunables: 4 mappeer_on_demandtimeout: 32180 concurrentpeer_sendscredits: 256128 fmrpeer_poolbuffer_sizecredits: 20480 fmr_flush_triggercredits: 5121024 lnd tunables: peercredits_hiw: 4 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 conns_per_peer: 4 ntx: 2048 |
What are the best HFI tunables to use with Lustre?
| Code Block |
|---|
options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 |
| Info | ||
|---|---|---|
| ||
It is NOT recommended to enable the OPA TID RDMA feature ({{cap_mask=0x4c09a0*1*cbba}}) as this can cause significant memory usage and service errors when there are a large number of connections. |
Can you tell me more about how to configure LNet and QoS?
Refer to: Lustre QoS
How can I look at the debug logs?
| Code Block |
|---|
lctl set_param debug=+net
lctl set_param debug=+neterror
# make sure to use an absolute path
lctl debug_daemon start /root/log
tail -f /root/log
# or... NOTE doesn't have to be an absolute path
lctl dk > /root/log |
I can't mount the FS and I think it's a networking issue, what should I do?
- Check that you can ping your MGS first.
- Check that you can "
lnetctl ping" the MGS NID.- If you're able to "
lnetctl ping" the MGS NID, then check if you can RDMA usingib_write_bw:- Start
ib_write_bwon your server. This will start a receiver process - Run "
ib_write_bw <MGS IP address>" - This will run an RDMA traffic test independent of LNet. If this works then RDMA works and move on to further LNet debugging. Otherwise contact your IB service provider.
- Start
- Next step is to run
lnet_selftestto verify LNet traffic. - If lnet_selftest works, then verify your MGS is setup properly.
- If you're able to "
If ib_write_bw works, but LNet doesn't work, then check your o2iblnd configuration, as shown below.
I have a routed setup and my clients can't mount? (#TODO)
How do I check my o2iblnd configurations on different nodes are compatible?
- Make sure
peer_creditsare the same across your nodes.- peer_credits are dynamically negotiated, such that the lowest
peer_creditsare used. However, if it's not your intention to have differentpeer_creditsacross the different nodes, it is recommended to ensure they all have the same value.
- peer_credits are dynamically negotiated, such that the lowest
- Make sure your
peer_credits_hiware the same across your nodes.peer_credits_hiwdefine the High Water Mark value, which when reached the outstanding credits on the connection are returned using a No-op message.
- Make sure your concurrent_sends are the same across your nodes.
- concurrent_sends define the number of concurrent transmits per conneciton
- The recommended values for the above parameters are:
- peer_credits = 32
- peer_credits_hiw = 16
- concurrent_sends = 64
- Generally speaking you want peer_credits_hiw to be half of peer_credits and concurrent_sends to be two times peer_credits.
- Make sure conns_per_peer are the same across the nodes. It defines the number of IB connections to create to 1 peer.
- On OPA it is recommend to set this value to 4
- On MLX it is recommended to leave it a at the default value of 1
What is map_on_demand and what should I set it to? (#TODO)
How should I configure LNet Health? (#TODO)
When should I turn off LNet Health? (#TODO)
When should I turn off Dynamic Discovery? (#TODO)
What is the process for seamlessly upgrading a router?
To upgrade a router with minimum impact to the live system use the following steps:
- Make sure there's more than one router in the system and that they can handle the load if one router is decommissioned
- If you want the router not to be used immediately after it comes back up, remove the routes pointing to that router from both the clients and the servers
On the decommissioned router run:
Code Block watch -d -n 1 "lnetctl net show -v | grep -E 'send_|recv'"That will show you when the traffic has stopped to that router. Once there is no more traffic, proceed to step 2.
- Unload LNet modules on the router to be reconfigured.
On the nodes connected to and configured to use this router,
Code Block lnetctl route show -v 4should show that the router is down or if you have removed the routes using that router, then it should no longer appear in the list.
Remote LNet ping should succeed because other routers should get used instead:Code Block lnetctl ping <remote nid>- Make changes to the decommissioned router configuration and bring it back online.
- Perform any testing required on the router, using LNet selftest, to verify correct operations. Once satisfied with the router's operation, move to step 6.
- If you've removed the routes to that router from the clients and servers in step 1a, then re-add them.
On the nodes connected to and configured to use the router,
| Code Block |
|---|
lnetctl route show fmr_cache: 1 conns_per_peer: 4 ntx: 2048 |
What's the best HFI tunables to use with Luster?
| Code Block |
|---|
options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 |
Can you tell me more about how to configure LNet and QoS?
-v 4 |
may show that the router is still down. In that case, rediscover the router:
| Code Block |
|---|
lnetctl discover <router nid> |
Which sysctl setting are optimal?
On systems using tcp, default settings for arp cache thresholds may be too low:
| Code Block |
|---|
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024 |
Replace these with:
| Code Block |
|---|
net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 16384
net.ipv4.neigh.default.gc_thresh3 = 32768 |
There are other sysctl settings required for proper MR operation. For these, refer to MR Cluster SetupRefer to: Lustre QoS