Due to Linux routing quirks, if there are two network interfaces on the same node, the HW address returned in the ARP for a specific IP might not necessarily be the one for the exact interface being ARPed.
This causes problems for o2iblnd, because it resolves the address using IPoIB, and gets the wrong Infiniband address. This causes problems with connections.
To get around this problem we need to setup routing entries and rules to tell the linux Kernel to respond with the correct HW address.
I use trevis-40[1,2] as an example. But this will need to be done for other nodes with multiple interfaces of the same kind, MLX, OPA, ETH
The main difference is in the routing rules. The rules explicitly cause the route selection algorithm to look at the ib0 or ib1 routing tables based on the source prefix. Therefore any packet with a source address set to ib0 or ib1's IP address triggers the rules and is first matched against the route in the corresponding table. In this way messages are guaranteed to use the correct interface. (ib0 and ib1 are used as examples)
trevis-401
401 is the most complicated node in the cluster. It has 2 ETH, 2 OPA and 2 MLX interfaces.
Setup
#Setting ARP so it doesn't broadcast sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_announce=2 sysctl -w net.ipv4.conf.ib1.rp_filter=0 sysctl -w net.ipv4.conf.ib2.arp_ignore=1 sysctl -w net.ipv4.conf.ib2.arp_filter=0 sysctl -w net.ipv4.conf.ib2.arp_announce=2 sysctl -w net.ipv4.conf.ib2.rp_filter=0 sysctl -w net.ipv4.conf.ib3.arp_ignore=1 sysctl -w net.ipv4.conf.ib3.arp_filter=0 sysctl -w net.ipv4.conf.ib3.arp_announce=2 sysctl -w net.ipv4.conf.ib3.rp_filter=0 ip neigh flush dev ib0 ip neigh flush dev ib1 ip neigh flush dev ib2 ip neigh flush dev ib3 echo 200 ib0 >> /etc/iproute2/rt_tables echo 201 ib1 >> /etc/iproute2/rt_tables echo 202 ib2 >> /etc/iproute2/rt_tables echo 203 ib3 >> /etc/iproute2/rt_tables ip route add 192.168.0.0/16 dev ib0 proto kernel scope link src 192.168.1.1 table ib0 ip route add 192.168.0.0/16 dev ib1 proto kernel scope link src 192.168.2.1 table ib1 ip rule add from 192.168.1.1 table ib0 ip rule add from 192.168.2.1 table ib1 ip route add 172.16.0.0/16 dev ib2 proto kernel scope link src 172.16.1.1 table ib2 ip route add 172.16.0.0/16 dev ib3 proto kernel scope link src 172.16.2.1 table ib3 ip rule add from 172.16.1.1 table ib2 ip rule add from 172.16.2.1 table ib3 ip route flush cache
trevis-402
Setup
#Setting ARP so it doesn't broadcast sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_announce=2 sysctl -w net.ipv4.conf.ib1.rp_filter=0 ip neigh flush dev ib0 ip neigh flush dev ib1 echo 200 ib0 >> /etc/iproute2/rt_tables echo 201 ib1 >> /etc/iproute2/rt_tables ip route add 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 table ib1 ip route add 192.168.1.0/24 dev ib0 proto kernel scope link src 192.168.1.2 table ib0 ip rule add from 192.168.1.2 table ib0 ip rule add from 192.168.2.2 table ib1 ip route flush cache # Try to get the system in the following state: [root@trevis-402 ~]# ip route show table ib1 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 [root@trevis-402 ~]# ip route show table ib1 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2
trevis-404
Setup
#Setting ARP so it doesn't broadcast sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_announce=2 sysctl -w net.ipv4.conf.ib1.rp_filter=0 ip neigh flush dev ib0 ip neigh flush dev ib1 echo 200 ib0 >> /etc/iproute2/rt_tables echo 201 ib1 >> /etc/iproute2/rt_tables ip route add 172.16.1.0/24 dev ib0 proto kernel scope link src 172.16.1.4 table ib0 ip route add 172.16.2.0/24 dev ib1 proto kernel scope link src 172.16.2.4 table ib1 ip rule add from 172.16.1.4 table ib0 ip rule add from 172.16.2.4 table ib1 ip route flush cache [root@trevis-404 ~]# ip route show table ib1 172.16.2.0/24 dev ib1 proto kernel scope link src 172.16.2.4 [root@trevis-404 ~]# ip route show table ib1 172.16.2.0/24 dev ib1 proto kernel scope link src 172.16.2.4
Trouble shooting
# if the above setup doesn't resolve the issue, try the following steps: # Make sure to flush the arp cache from the other nodes, so that there is no confusion with addressing. ip -s -s neigh flush all arp -n # show arp entries # Look at the rules by: # ip rule show # make sure that the rules are in correct priority. # 0 is the highest prio. # 0 is always going to be the local routing table, which has all the default local and broadcast routes. # 32766 is the main routing table. So all other policy routing tables should be higher than this one.
Instability
There is some instability with the MR cluster, specifically with trevis-401.
On many occasions the OPA interfaces are not pingeable. If that's encountered try shutting it down, wait 15 seconds and start it back up.
pm -0 trevis-401 # shutdown pm -1 trevis-401 # start
Parameter Explanation
arp_announce/arp_ignore
arp_announce - INTEGER
Define different restriction levels for announcing the local
source IP address from IP packets in ARP requests sent on
interface:
0 - (default) Use any local address, configured on any interface
1 - Try to avoid local addresses that are not in the target's
subnet for this interface. This mode is useful when target
hosts reachable via this interface require the source IP
address in ARP requests to be part of their logical network
configured on the receiving interface. When we generate the
request we will check all our subnets that include the
target IP and will preserve the source address if it is from
such subnet. If there is no such subnet we select source
address according to the rules for level 2.
2 - Always use the best local address for this target.
In this mode we ignore the source address in the IP packet
and try to select local address that we prefer for talks with
the target host. Such local address is selected by looking
for primary IP addresses on all our subnets on the outgoing
interface that include the target IP address. If no suitable
local address is found we select the first local address
we have on the outgoing interface or on all other interfaces,
with the hope we will receive reply for our request and
even sometimes no matter the source IP address we announce.
The max value from conf/{all,interface}/arp_announce is used.
Increasing the restriction level gives more chance for
receiving answer from the resolved target while decreasing
the level announces more valid sender's information.
arp_ignore - INTEGER
Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses
The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}
Misc
Crash dump files are in:
/scratch/dumps/trevis-40x.hpdd.intel.com/
Installing OPA utilities
yum install opa-fastfabric
For other OPA related downloads:
https://downloadcenter.intel.com/product/92003/Intel-Omni-Path-Host-Fabric-Interface-Products