Refer to Walk-thru- Build Lustre MASTER on RHEL 7.3/CentOS 7.3 from Git
Also refer to Building Lustre/LNet Centos/RHEL 7.x for some quirks when building.
Load the module
modprobe lnet |
if using standard /etc/modprob.d/lustre.conf for module parameters, then:
# load all the module parameters lnetctl lnet configure --all |
if dynamically configuring then
# don't configure via module parameters lnetctl lnet configure |
Ensure that you have loaded the lnet module and configured it as shown above
lnetctl net add --net <net-name> --if <interface name> # Examples lnetctl net add --net o2ib --if ib0 lnetctl net add --net tcp --if eth0 |
Sometimes nodes can have different types of interfaces, for example, MLX and OPA. It is desired to configure both of those with different tunables. To do that, we must use the YAML configuration.
Assuming a node with two interfaces one OPA and one MLX. Configure MLX on o2ib and OPA on 2ib1
#> cat networkConfig.yaml
net:
- net type: o2ib
local NI(s):
- interfaces:
0: ib0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
conns_per_peer: 1
- net type: o2ib1
local NI(s):
- interfaces:
0: ib2
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 4
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
conns_per_peer: 4
ntx: 2048
#> lnetctl import < networkConfig.yaml |
An LNet router routes LNet messages from one LNet network to another. For example from o2ib1 to tcp2 or from o2ib1 to o2ib2. This is especially useful when you have a cluster with nodes divided on different types of fabric, like OPA and MLX.
# enable routing lnetctl set routing 1 |
lnetctl route add --net <destination net> --gateway <gateway nid> #Examples lnetctl route add --net o2ib1 --gateway 10.10.10.2@o2ib # The above says: # any messages destined to o2ib1 should be forwarded to 10.10.10.2@o2ib # o2ib has to be a reachable network lnetctl route add --net tcp --gateway 10.10.10.3@2o2ib --hop 2 --priority 1 # If there are multiple routes sometimes it's useful to define the priority between these routes # hop should define the number of hops to the gateway. # Unfortunately, due to legacy reasons hop and priority perform the same function # it would have been better to only have one to reduce confusion. # routes with less number of hops or with higher priority are selected first # if routes have the same number of hops and priority they are visited in round-robin. |
A router can be configured with a set number of buffers. These buffers are used to receive messages to be forwarded
There are three categories of buffers:
The number of buffers allocated can be controlled by the following module parameters.
tiny_router_buffers # 512 min for each CPT small_router_buffers # 4096 min for each CPT large_router_buffers # 256 min for each CPT |
They can also dynamically change them via lnetctl. Note that the values you enter are divided among the CPTs configured. The minimum value restriction is enforced per CPT.
lnetctl set tiny_buffers <value> lnetctl set small_buffers <value> lnetctl set large_buffers <value> |
They can also be set via YAML config:
buffers:
tiny: <value>
small: <value>
large: <value> |
Other parameters of interest
check_routers_before_use # Assume routers are down and ping them before use. Defaults to disabled. avoid_asym_router_failure # Avoid asymmetrical router failures (0 to disable). Defaults to enabled dead_router_check_interval # Seconds between dead router health checks (<= 0 to disable). Defaults to 60 seconds live_router_check_interval # Seconds between live router health checks (<= 0 to disable). Defaults to 60 seconds router_ping_timeout # Seconds to wait for the reply to a router health query. Defaults to 50 seconds |
Whenever a route entry is configured on a node, the gateway specified is added to a list. The router pinger is a thread that periodically pings the gateways to ensure that they are up. The gateway entries are segregated into two different categories, live gateways and dead gateways. dead_router_check_interval is the time interval used to ping the dead gateways, while live_router_check_interval is the time interval used to ping the live routers.
The router pinger thread wakes up every second in the following cases:
The pinger waits for router_ping_timeout for the gateway to respond to a ping health check.
lnetctl peer show --nid <router nid>) and check that all router nids on the same LNet are listed with "status: up". If any routers nids are listed as "down", try the same connectivity checks between your node and the router as described here "TODO: How can I verify connectivity between nodes on the same LNet". avoid_asym_router_failure is enabled, make sure that the same routers are set up on both sides. If basic router setup verification indicates connectivity problems that can't be solved using means discussed in "TODO: How can I verify connectivity between nodes on the same LNet", use the following procedure.
If there are many same-level routers connecting your nodes, try to isolate just one for each hop level (if you identified which one causes the problem, leave that one). This will require changing routing settings. For example, if the topology is Client — GW[1..N] — Server, leave only GW1 in the router configuration of the server and the client. This can be done using lnetctl route del command (changes won't be kept on lnet reload).
Make sure that connected nodes can lnetctl ping each other. Note that lnetctl ping needs to be tried multiple times.
For example, if the topology is Client — GW1 — GW2 — Server:
LNet Multi-Rail allows multiple interfaces to be used for sending LNet messages. This feature boosts performance. A follow up feature, LNet Resiliency, currently being worked on is aimed at increasing resiliency.
Refer to: http://wiki.lustre.org/Multi-Rail_LNet for the Requirements, HLD and LUG presentations.
Via command line:
lnetctl net add --net <network> --if <list of comma separated interfaces> # Example lnetctl net add --net o2ib --if ib0,ib1 |
From YAML configuration. The values of the tunabes can be changed to whatever value desired.
net:
- net type: o2ib
local NI(s):
- interfaces:
0: ib0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
conns_per_peer: 1
- interfaces:
0: ib1
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
conns_per_peer: 1 |
There are two steps to configuring multi-rail:
The first step ensures that the local node knows the different interfaces it can send messages over. The second steps tell the local node which peers are multi-rail enabled and which interfaces of these peers to use.
For more information on exact configuration examples, refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetmr
Configuring the peers manually is error prone process. It's best if a node is able to discover its peers dynamically. The Dynamic Discovery feature allows a node to discover the interfaces of its peers the first time it communicates with it.
Whenever the local interface list change an update is sent to all connected peers.
This feature reduces the configuration burden to only configuring the local interfaces of the node.
For more information on the feature refer to the HLD: http://wiki.lustre.org/images/b/bb/Multi-Rail_High-Level_Design_20150119.pdf. Dynamic Behavior section.
Refer to: MR Cluster Setup
Refer to: Using Routing Resiliency with the Multi-Rail
lnet_selftest is available for performance testing. Refer to: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lnetselftest
For a sample lnet_selftest script: self-test template script
We're currently working on a functional test tool, LNet Unit Test Framework. The documents will be made available soon.
net:
- net type: o2ib1
local NI(s):
- interfaces:
0: ib2
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 4
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
conns_per_peer: 4
ntx: 2048 |
options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 |
It is NOT recommended to enable the OPA TID RDMA feature ({{cap_mask=0x4c09a0*1*cbba}}) as this can cause significant memory usage and service errors when there are a large number of connections. |
Refer to: Lustre QoS
lctl set_param debug=+net lctl set_param debug=+neterror # make sure to use an absolute path lctl debug_daemon start /root/log tail -f /root/log # or... NOTE doesn't have to be an absolute path lctl dk > /root/log |
lnetctl ping" the MGS NID.lnetctl ping" the MGS NID, then check if you can RDMA using ib_write_bw:ib_write_bw on your server. This will start a receiver processib_write_bw <MGS IP address>"lnet_selftest to verify LNet traffic.If ib_write_bw works, but LNet doesn't work, then check your o2iblnd configuration, as shown below.
o2iblnd configurations on different nodes are compatible?peer_credits are the same across your nodes.peer_credits are used. However, if it's not your intention to have different peer_credits across the different nodes, it is recommended to ensure they all have the same value.peer_credits_hiw are the same across your nodes.peer_credits_hiw define the High Water Mark value, which when reached the outstanding credits on the connection are returned using a No-op message.map_on_demand and what should I set it to? (#TODO)To upgrade a router with minimum impact to the live system use the following steps:
On the nodes connected to and configured to use this router,
lnetctl route show -v 4 |
should show that the router is down. Remote LNet ping should succeed because other routers should get used instead:
lnetctl ping <remote nid> |
On the nodes connected to and configured to use the router,
lnetctl route show -v 4 |
may show that the router is still down. In that case, rediscover the router:
lnetctl discover <router nid> |