Script

Attached is a script which should setup all the necessary linux routing and parameters as indicated on this wiki. You can download the script here.

Usage

## Dry run
python3 mrrouting.py --dry-run --verbose --if=<comma or space separated interface names>
## Example
python3 mrrouting.py --dry-run --verbose --if=eth0,eth1,eth2,ib0,ib1

## Run the script and setup the parameters:
python3 mrrouting.py --verbose --if=<comma or space separated interface names>
## or
python3 mrrouting.py --if=<comma or space separated interface names>
## Example
python3 mrrouting.py --if=eth0,eth1,eth2,ib0,ib1

The script requires the following python modules installed using the correct version of pip

pip3 install netaddr
pip3 install netifaces

I also created a general configuration script, which can come in handy: lustrecfg.py

python3 lustrecfg.py [--dry-run] --cfg=<YAML configuration file

The configuration file would look like:

lustre:
   operation: all
   build_src: lustre_src
   build_script: deploy.py
   config_name: lnet
   lnet:
      net:
        - net type: o2ib0
          interfaces: eno1,eno2,eno3
          tunables:
             peer_credits: 32
             peer_credits_hiw: 16
             concurrent_sends: 64
        - net type: o2ib1
          interfaces: eno1
          tunables:
             peer_credits: 32
             peer_credits_hiw: 16
             concurrent_sends: 64
      route:
        - net: tcp1
          gateway: 192.168.122.21@tcp
          hop: -1
          priority: 0
          health_sensitivity: 1
      global:
          lnet_transaction_timeout: 100
          lnet_retry_count: 3


Overview

Due to Linux routing quirks, if there are two network interfaces on the same node, the HW address returned in the ARP for a specific IP might not necessarily be the one for the exact interface being ARPed.

This causes problems for o2iblnd, because it resolves the address using IPoIB, and gets the wrong Infiniband address. This causes problems with connections.

To get around this problem we need to setup routing entries and rules to tell the linux Kernel to respond with the correct HW address.

I use trevis-40[1,2] as an example. But this will need to be done for other nodes with multiple interfaces of the same kind, MLX, OPA, ETH

The main difference is in the routing rules. The rules explicitly cause the route selection algorithm to look at the ib0 or ib1 routing tables based on the source prefix. Therefore any packet with a source address set to ib0 or ib1's IP address triggers the rules and is first matched against the route in the corresponding table. In this way messages are guaranteed to use the correct interface. (ib0 and ib1 are used as examples)

Testing

I conducted some testing to understand which tunable parameter avoids the address resolution error or mis-arp problem

Test procedure

  1. load LNet
  2. Discover both nodes
  3. load lnet_selftest
  4. run lnet_selftest
  5. on the switch bring down one of the ports
  6. stop lnet_selftest
  7. on the switch bring up the port
  8. look for address resolution errors and watch health statistics

Test matrix

  • arp settings only:  -110 address resolution errors observed. Local interface didn't recover
  • rp_filter = 0 only:  no -110 address resolution errors observed. Local interfaces recovered
  • rp_filter = 1 only-110 address resolution errors observed. Local interface didn't recover
  • rp_filter = 2 only: -110 address resolution errors observed. consumer fatal error: probably wrong HW address returned for ARP
  • router rules only (delete old routes): -110 address resolution errors observed.
  • router rules + rp_filter = 0: no -110 address resolution errors observed. Local interfaces recovered
  • router rules + rp_filter = 1: no -110 address resolution errors observed. Local interfaces recovered
  • router rules + rp_filter = 2: no -110 address resolution errors observed. Local interfaces recovered

It appears like setting rp_fileter to 0 avoids the -110 error. However, I can not conclusively say that it resolves all issues.

Adding the routing rules in addition to the rp_filter avoids the -110 address resolution error for all rp_filter settings. Therefore, it seems like route rules + rp_filter = 0 is the safest configuration to avoid issues on MR setups with interfaces on the same network.

This configuration needs to be done on all clients and servers which have multiple interfaces configured in Multi-Rail.

accept_local

In kernel version 3.10 commit

7a9bc9b81a5b ("ipv4: Elide fib_validate_source() completely when possible.")

Introduced a behavior change where accept_local parameter was not checked and packets with local address in the source feild were not dropped, when they should be when accept_local is off.

Another patch came in kernel version 3.18 which restored the behavior. That's why we've been seeing problems on centos 8 and ubuntu with health recovery. Because Health recovery pings attempt and do arp resolutions on the local address.

commit 1dced6a854827eb5683f3c57ddbb4595daf145e4
Author: Sébastien Barré <sebastien.barre@uclouvain.be>
Date:   Sun Aug 17 09:19:54 2014 +0200

    ipv4: Restore accept_local behaviour in fib_validate_source()
    
    Commit 7a9bc9b81a5b ("ipv4: Elide fib_validate_source() completely when possible.")
    introduced a short-circuit to avoid calling fib_validate_source when not
    needed. That change took rp_filter into account, but not accept_local.
    This resulted in a change of behaviour: with rp_filter and accept_local
    off, incoming packets with a local address in the source field should be
    dropped.
    
    Here is how to reproduce the change pre/post 7a9bc9b81a5b commit:
    -configure the same IPv4 address on hosts A and B.
    -try to send an ARP request from B to A.
    -The ARP request will be dropped before that commit, but accepted and answered
    after that commit.
    
    This adds a check for ACCEPT_LOCAL, to maintain full
    fib validation in case it is 0. We also leave __fib_validate_source() earlier
    when possible, based on the same check as fib_validate_source(), once the
    accept_local stuff is verified.
    
    Cc: Gregory Detal <gregory.detal@uclouvain.be>
    Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
    Cc: Hannes Frederic Sowa <hannes@redhat.com>
    Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
    Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
    Signed-off-by: David S. Miller <davem@davemloft.net>

There fore it is import to set accept_local to 1on systems to ensure health works properly.

sysctl -w net.ipv4.conf.all.accept_local=1
# or
sysctl -w net.ipv4.conf.<intf name>.accept_local=1

trevis-401

401 is the most complicated node in the cluster. It has 2 ETH, 2 OPA and 2 MLX interfaces. 

Setup

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

sysctl -w net.ipv4.conf.ib2.arp_ignore=1
sysctl -w net.ipv4.conf.ib2.arp_filter=0
sysctl -w net.ipv4.conf.ib2.arp_announce=2
sysctl -w net.ipv4.conf.ib2.rp_filter=0

sysctl -w net.ipv4.conf.ib3.arp_ignore=1
sysctl -w net.ipv4.conf.ib3.arp_filter=0
sysctl -w net.ipv4.conf.ib3.arp_announce=2
sysctl -w net.ipv4.conf.ib3.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
ip neigh flush dev ib2
ip neigh flush dev ib3
 
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
echo 202 ib2 >> /etc/iproute2/rt_tables
echo 203 ib3 >> /etc/iproute2/rt_tables

ip route add 192.168.0.0/16 dev ib0 proto kernel scope link src 192.168.1.1 table ib0
ip route add 192.168.0.0/16 dev ib1 proto kernel scope link src 192.168.2.1 table ib1
ip rule add from 192.168.1.1 table ib0
ip rule add from 192.168.2.1 table ib1
 
ip route add 172.16.0.0/16 dev ib2 proto kernel scope link src 172.16.1.1 table ib2
ip route add 172.16.0.0/16 dev ib3 proto kernel scope link src 172.16.2.1 table ib3
ip rule add from 172.16.1.1 table ib2
ip rule add from 172.16.2.1 table ib3
ip route flush cache

trevis-402 

Setup

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
 
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
 
ip route add 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 table ib1
ip route add 192.168.1.0/24 dev ib0 proto kernel scope link src 192.168.1.2 table ib0
ip rule add from 192.168.1.2 table ib0
ip rule add from 192.168.2.2 table ib1
ip route flush cache

# Try to get the system in the following state:
[root@trevis-402 ~]# ip route show table ib1
192.168.2.0/24 dev ib1  proto kernel  scope link  src 192.168.2.2 


[root@trevis-402 ~]# ip route show table ib1
192.168.2.0/24 dev ib1  proto kernel  scope link  src 192.168.2.2 

trevis-404

Setup

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
 
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
 
ip route add 172.16.1.0/24 dev ib0 proto kernel scope link src 172.16.1.4 table ib0
ip route add 172.16.2.0/24 dev ib1 proto kernel scope link src 172.16.2.4 table ib1
 
ip rule add from  172.16.1.4 table ib0
ip rule add from  172.16.2.4 table ib1
ip route flush cache
 
[root@trevis-404 ~]# ip route show table ib1
172.16.2.0/24 dev ib1  proto kernel  scope link  src 172.16.2.4 
[root@trevis-404 ~]# ip route show table ib1
172.16.2.0/24 dev ib1  proto kernel  scope link  src 172.16.2.4 

Trouble shooting

# if the above setup doesn't resolve the issue, try the following steps:
#	 Make sure to flush the arp cache from the other nodes, so that there is no confusion with addressing. 
ip -s -s neigh flush all
arp -n # show arp entries
# Look at the rules by:
# ip rule show
# make sure that the rules are in correct priority.
# 0 is the highest prio.
# 0 is always going to be the local routing table, which has all the default local and broadcast routes.
# 32766 is the main routing table. So all other policy routing tables should be higher than this one. 

Instability

There is some instability with the MR cluster, specifically with trevis-401. 

On many occasions the OPA interfaces are not pingeable. If that's encountered try shutting it down, wait 15 seconds and start it back up.

pm -0 trevis-401 # shutdown
pm -1 trevis-401 # start

Parameter Explanation

arp_announce/arp_ignore

arp_announce - INTEGER
	Define different restriction levels for announcing the local
	source IP address from IP packets in ARP requests sent on
	interface:
	0 - (default) Use any local address, configured on any interface
	1 - Try to avoid local addresses that are not in the target's
	subnet for this interface. This mode is useful when target
	hosts reachable via this interface require the source IP
	address in ARP requests to be part of their logical network
	configured on the receiving interface. When we generate the
	request we will check all our subnets that include the
	target IP and will preserve the source address if it is from
	such subnet. If there is no such subnet we select source
	address according to the rules for level 2.
	2 - Always use the best local address for this target.
	In this mode we ignore the source address in the IP packet
	and try to select local address that we prefer for talks with
	the target host. Such local address is selected by looking
	for primary IP addresses on all our subnets on the outgoing
	interface that include the target IP address. If no suitable
	local address is found we select the first local address
	we have on the outgoing interface or on all other interfaces,
	with the hope we will receive reply for our request and
	even sometimes no matter the source IP address we announce.

	The max value from conf/{all,interface}/arp_announce is used.

	Increasing the restriction level gives more chance for
	receiving answer from the resolved target while decreasing
	the level announces more valid sender's information.

arp_ignore - INTEGER
	Define different modes for sending replies in response to
	received ARP requests that resolve local target IP addresses:
	0 - (default): reply for any local target IP address, configured
	on any interface
	1 - reply only if the target IP address is local address
	configured on the incoming interface
	2 - reply only if the target IP address is local address
	configured on the incoming interface and both with the
	sender's IP address are part from same subnet on this interface
	3 - do not reply for local addresses configured with scope host,
	only resolutions for global and link addresses are replied
	4-7 - reserved
	8 - do not reply for all local addresses

	The max value from conf/{all,interface}/arp_ignore is used
	when ARP request is received on the {interface}

rp_filter

  • 0 No source address validation is performed and any packet is forwarded to the destination network
  • 1 Strict Mode as defined in RFC 3074. Each incoming packet to a router is tested against the routing table and if the interface that the packet is received on is not the best return path for the packet then the packet is dropped.
  • 2 Loose mode as defines in RFC 3074 Loose Reverse Path. Each incoming packet is tested against the route table and the packet is dropped if the source address is not routable through any interface. The allows for asymmetric routing where the return path may not be the same as the source path

Misc

Crash dump files are in:

/scratch/dumps/trevis-40x.hpdd.intel.com/

Installing OPA utilities

yum install opa-fastfabric

For other OPA related downloads:

https://downloadcenter.intel.com/product/92003/Intel-Omni-Path-Host-Fabric-Interface-Products

Updated instructions from Mellanox

To be verified

# Configure IP
ifconfig ib0 192.168.0.11/16
ifconfig ib1 192.168.1.11/16

# Then do the routing configuration as below
echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 1 > /proc/sys/net/ipv4/conf/all/accept_local
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

ip rule add from all lookup local pref 100
ip rule del pref 0

# Configure Interface ib0
ip rule add iif ib0 lookup local pref 0
ip rule add from 192.168.0.11 table 10 pref 10
ip route add 192.168.0.0/16 dev ib0 src 192.168.0.11 table 10
ip route add local 192.168.0.11 dev ib0 src 192.168.0.11 table 10

# Configure Interface ib1
ip rule add iif ib1 lookup local pref 0
ip rule add from 192.168.1.11 table 11 pref 10
ip route add 192.168.0.0/16 dev ib1 src 192.168.1.11 table 11
ip route add local 192.168.1.11 dev ib1 src 192.168.1.11 table 11

# Choose the default output source port (rather than loopback)
ip rule add to 192.168.0.11 table 10 pref 10
ip rule add to 192.168.1.11 table 11 pref 10

# Make sure cache is flushed
ip route flush cache