If non-MR peer (2.10.8) is discovered by a 2.12 MR peer, the following problem may happen: if non-MR peer has LNets that are not defined on the MR peer, it is possible that a NID on the undefined LNet is listed as primary. Later this causes communication problems when mounting.
Status
Open, reproducible in 2.12. Not an issue on master and 2.13.
Steps To Reproduce
Configuration
Prepare two nodes, one running 2.10.8, another 2.12.4 build. Configure LNet similar to the following:
PeerA (2.10.8 non-MR) | PeerB (2.12.4 MR) |
---|---|
#> lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib local NI(s): - nid: 192.168.1.123@o2ib status: up interfaces: 0: ib0 - net type: o2ib4 local NI(s): - nid: 192.168.1.123@o2ib4 status: up interfaces: 0: ib0 | #> lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib4 local NI(s): - nid: 192.168.1.105@o2ib4 status: up interfaces: 0: ib0 |
Procedure
Run discovery of PeerA from PeerB and check the results:
Problem Behaviour | Expected Behaviour |
---|---|
#> lnetctl discover 192.168.1.123@o2ib4 discover: - primary nid: 192.168.1.123@o2ib Multi-Rail: False peer ni: - nid: 192.168.1.123@o2ib4 - nid: 192.168.1.123@o2ib #> lnetctl peer show peer: - primary nid: 192.168.1.123@o2ib Multi-Rail: False peer ni: - nid: 192.168.1.123@o2ib4 state: NA - nid: 192.168.1.123@o2ib state: NA | #> lnetctl discover 192.168.1.123@o2ib4 discover: - primary nid: 192.168.1.123@o2ib4 Multi-Rail: False peer ni: - nid: 192.168.1.123@o2ib4 #> lnetctl peer show peer: - primary nid: 192.168.1.123@o2ib4 Multi-Rail: False peer ni: - nid: 192.168.1.123@o2ib4 state: NA |
Note that in "problem" scenario, PeerA's primary NID is on o2ib net, which is not accessible from PeerB.
Resolution
Port of - LU-11641Getting issue details... STATUS resolves the issue.
References
- DDN-1228Getting issue details... STATUS , - LU-13548Getting issue details... STATUS