You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Due to Linux routing quirks, if there are two network interfaces on the same node, the HW address returned in the ARP for a specific IP might not necessarily be the one for the exact interface being ARPed.

This causes problems for o2iblnd, because it resolves the address using IPoIB, and gets the wrong Infiniband address. This causes problems with connections.

To get around this problem we need to setup routing entries and rules to tell the linux Kernel to respond with the correct HW address.

I use trevis-40[1,2] as an example. But this will need to be done for other nodes with multiple interfaces of the same kind, MLX, OPA, ETH

trevis-401

401 is the most complicated node in the cluster. It has 2 ETH, 2 OPA and 2 MLX interfaces. 

Setup

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

sysctl -w net.ipv4.conf.ib2.arp_ignore=1
sysctl -w net.ipv4.conf.ib2.arp_filter=0
sysctl -w net.ipv4.conf.ib2.arp_announce=2
sysctl -w net.ipv4.conf.ib2.rp_filter=0

sysctl -w net.ipv4.conf.ib3.arp_ignore=1
sysctl -w net.ipv4.conf.ib3.arp_filter=0
sysctl -w net.ipv4.conf.ib3.arp_announce=2
sysctl -w net.ipv4.conf.ib3.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
ip neigh flush dev ib2
ip neigh flush dev ib3
 
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
echo 202 ib2 >> /etc/iproute2/rt_tables
echo 203 ib3 >> /etc/iproute2/rt_tables

ip route add 192.168.0.0/16 dev ib0 proto kernel scope link src 192.168.1.1 table ib0
ip route add 192.168.0.0/16 dev ib1 proto kernel scope link src 192.168.2.1 table ib1
ip rule add from 192.168.1.1 table ib0
ip rule add from 192.168.2.1 table ib1
 
ip route add 172.16.0.0/16 dev ib2 proto kernel scope link src 172.16.1.1 table ib2
ip route add 172.16.0.0/16 dev ib3 proto kernel scope link src 172.16.2.1 table ib3
ip rule add from 172.16.1.1 table ib2
ip rule add from 172.16.2.1 table ib3
ip route flush cache

trevis-402 

Setup

#Setting ARP so it doesn't broadcast
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib0.arp_filter=0
sysctl -w net.ipv4.conf.ib0.arp_announce=2
sysctl -w net.ipv4.conf.ib0.rp_filter=0

sysctl -w net.ipv4.conf.ib1.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_filter=0
sysctl -w net.ipv4.conf.ib1.arp_announce=2
sysctl -w net.ipv4.conf.ib1.rp_filter=0

ip neigh flush dev ib0
ip neigh flush dev ib1
 
echo 200 ib0 >> /etc/iproute2/rt_tables
echo 201 ib1 >> /etc/iproute2/rt_tables
 
ip route add 192.168.2.0/24 dev ib1 proto kernel scope link src 192.168.2.2 table ib1
ip route add 192.168.1.0/24 dev ib0 proto kernel scope link src 192.168.1.2 table ib0
ip rule add from 192.168.1.2 table ib0
ip rule add from 192.168.2.2 table ib1
ip route flush cache

# Try to get the system in the following state:
[root@trevis-402 ~]# ip route show table ib1
192.168.2.0/24 dev ib1  proto kernel  scope link  src 192.168.2.2 


[root@trevis-402 ~]# ip route show table ib1
192.168.2.0/24 dev ib1  proto kernel  scope link  src 192.168.2.2 

Trouble shooting

# if the above setup doesn't resolve the issue, try the following steps:
#	 Make sure to flush the arp cache from the other nodes, so that there is no confusion with addressing. 
ip -s -s neigh flush all
arp -n # show arp entries
# Look at the rules by:
# ip rule show
# make sure that the rules are in correct priority.
# 0 is the highest prio.
# 0 is always going to be the local routing table, which has all the default local and broadcast routes.
# 32766 is the main routing table. So all other policy routing tables should be higher than this one. 

Instability

There is some instability with the MR cluster, specifically with trevis-401. 

On many occasions the OPA interfaces are not pingeable. If that's encountered try shutting it down, wait 15 seconds and start it back up.

pm -0 trevis-401 # shutdown
pm -1 trevis-401 # start

Misc

Crash dump files are in:

/scratch/dumps/trevis-40x.hpdd.intel.com/

Installing OPA utilities

yum install opa-fastfabric

For other OPA related downloads:

https://downloadcenter.intel.com/product/92003/Intel-Omni-Path-Host-Fabric-Interface-Products

  • No labels