Due to Linux routing quirks, if there are two network interfaces on the same node, the HW address returned in the ARP for a specific IP might not necessarily be the one for the exact interface being ARPed.
This causes problems for o2iblnd, because it resolves the address using IPoIB, and gets the wrong Infiniband address. This causes problems with connections.
To get around this problem we need to setup routing entries and rules to tell the linux Kernel to respond with the correct HW address.
I use trevis-40[1,2] as an example. But this will need to be done for other nodes with multiple interfaces of the same kind, MLX, OPA, ETH
trevis-401
401 is the most complicated node in the cluster. It has 2 ETH, 2 OPA and 2 MLX interfaces.
Setup
#Setting ARP so it doesn't broadcast sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_announce=2 sysctl -w net.ipv4.conf.ib1.rp_filter=0 sysctl -w net.ipv4.conf.ib2.arp_ignore=1 sysctl -w net.ipv4.conf.ib2.arp_filter=0 sysctl -w net.ipv4.conf.ib2.arp_announce=2 sysctl -w net.ipv4.conf.ib2.rp_filter=0 sysctl -w net.ipv4.conf.ib3.arp_ignore=1 sysctl -w net.ipv4.conf.ib3.arp_filter=0 sysctl -w net.ipv4.conf.ib3.arp_announce=2 sysctl -w net.ipv4.conf.ib3.rp_filter=0 ip neigh flush dev ib0 ip neigh flush dev ib1 ip neigh flush dev ib2 ip neigh flush dev ib3 echo 200 ib0 >> /etc/iproute2/rt_tables echo 201 ib1 >> /etc/iproute2/rt_tables echo 202 ib2 >> /etc/iproute2/rt_tables echo 203 ib3 >> /etc/iproute2/rt_tables ip route add 192.168.0.0/16 dev ib0 proto kernel scope link src 192.168.1.1 table ib0 ip route add 192.168.0.0/16 dev ib1 proto kernel scope link src 192.168.2.1 table ib1 ip rule add from 192.168.1.1 table ib0 ip rule add from 192.168.2.1 table ib1 ip route add 172.16.0.0/16 dev ib2 proto kernel scope link src 172.16.1.1 table ib2 ip route add 172.16.0.0/16 dev ib3 proto kernel scope link src 172.16.2.1 table ib3 ip rule add from 172.16.1.1 table ib2 ip rule add from 172.16.2.1 table ib3 ip route flush cache
trevis-402
Setup
#Setting ARP so it doesn't broadcast sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_filter=0 sysctl -w net.ipv4.conf.ib1.arp_announce=2 sysctl -w net.ipv4.conf.ib1.rp_filter=0 ip neigh flush dev ib0 ip neigh flush dev ib1 echo 200 ib0 >> /etc/iproute2/rt_tables echo 201 ib1 >> /etc/iproute2/rt_tables ip route add 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 table ib1 ip route add 192.168.1.0/24 dev ib0 proto kernel scope link src 192.168.1.2 table ib0 ip rule add from 192.168.1.2 table ib0 ip rule add from 192.168.2.2 table ib1 ip route flush cache # Try to get the system in the following state: [root@trevis-402 ~]# ip route show table ib1 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 [root@trevis-402 ~]# ip route show table ib1 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 # Also make sure to flush the arp cache from the other nodes, so that there is no confusion with addressing.