This is a page tracks issues that we run into while testing the Multi-Rail Feature

Problem List

Description	Priority	Reporter	Notes
with 17 interfaces trying to discover on the any of the interface the first time returns an error "no route to host". Ok was able to reproduce. If I follow the steps in UT-DD-EN-0005 exactly, then the first time I try to discover any of the nids it fails steps to reproduce P1 && P2 756 lnetctl lnet configure 757 lnetctl net add --net tcp --if eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8,eth9,eth10,eth11,eth12,eth13,eth14,eth15,eth16 P1: lnetctl ping --nid 192.168.122.42@tcp P2: lnetctl ping --nid 192.168.122.23@tcp P1: [root@MRtest01 Lustre]# lnetctl discover --nid 192.168.122.46@tcp manage: - discover: errno: -1 descr: failed to discover 192.168.122.46@tcp: No route to host Second time it works	critical	Amir
With 17 interface discovered "show peer" hangs When the rc from the kernel is not 0. The structure is not copied out of the kernel to user space. The code depends on that in order to pass the new size if the data to be copied out is too big for the buffer passed in by the user. Since that doesn't happen when rc == -E2BIG, user space code gets into an infinite loop sending IOCTLs to the kernel from libcfs_ioctl() 145 »·······»·······»·······if (err == 0) { 146 »·······»·······»·······»·······if (copy_to_user(uparam, hdr, hdr->ioc_len)) 147 »·······»·······»·······»·······»·······err = -EFAULT; 148 »·······»·······»·······} 149 »·······»·······»·······break; The buffer is only copied to user space if the ioctl handler returns 0. Not really sure if it's safe to change that.		Amir	This has been fixed
"lnetctl discover" command hangs with discovery off. This happened once, so an intermittent issue. Will try to reproduce.	Major	Sonia
"lnetctl discover" discovers the peer even with discovery off	Major	Sonia
"lnetctl discover --force" expects a parameter (no parameter should be needed with --force).	Major	Doug
Doug: I configured a Parallels VM with 16 interfaces (won't let me do 17 as 16 is a limit). When I "lctl network configure" with no YAML or module parameters, I get this error from ksocklnd: "Mar 7 14:01:16 centos-7 kernel: LNet: 5111:0:(socklnd.c:2652:ksocknal_enumerate_interfaces()) Ignoring interface virbr0 (too many interfaces)".	Minor	Doug
Doug: When discovering a node with 16 interfaces via "lnetctl discover --nid", it works, but I am seeing this log: Mar 7 15:48:04 centos-7 kernel: LNetError: 24769:0:(peer.c:1726:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)	Minor	Doug	Olaf: I've seen the message before. Not sure how best to get rid of it, but it would be safe to just not emit it for the loopback case.
Doug: Tried the command "lnetctl set discovery" (no args) and got a core dump.	Major	Doug	Olaf: As I recall, the problem is `parse_long()` is passed a NULL pointer for its `number` parameter in this case, and it doesn't check for that. Easiest to fix in `parse_long()` rather than fix all callers. This bug affects all `lnetctl set` commands.
Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous bullet). From a usability perspective, there is no obvious way to get this information.	Minor	Doug	Doug: was told it is "lnetctl global show". Don't like that (see Usability section) but see this problem as solved.
Doug: When I enter "lnetctl set" to see what options I can set, I get this: [root@centos-7 ~]# lnetctl set set {tiny_buffers \| small_buffers \| large_buffers \| routing} It does not mention "discovery" at all.	Major	Doug	Olaf: So the text in `list[]` for the set subcommand needs to be updated. Note that `max_interfaces` needs to be added there as well.
When you do "lnetctl export" global section doesn't show discovery status	Minor	Amir
Doug: Test: UT-DD-EN-0002. The first call to discover P2 fails with this: [root@centos-7 ~]# lnetctl discover --nid 10.211.55.62@tcp manage: - discover: errno: -1 descr: failed to discover 10.211.55.62@tcp: No route to host The second attempt works as expected. I repeated the test twice and got the same result each time.	Critical	Doug
Doug: Test: UT-DD-EN-0003. Same behaviour as above.	Critical	Doug
Amir: There should be a way to turn off Multi-Rail. From the code the `LNET_PING_FEAT_MULTI_RAIL` is not unsettable.	Critical	Amir
Doug: Test: UT-DD-DIS-0001. Test passed and worked as described. However, I decided to run lnet-selftest to see how running some traffic after the test goes. The lnet-selftest failed (never stopped running, did not show any stats). I looked at top and can see that the "lnet-discovery" thread is using 70% CPU (it was not using any CPU prior to running lnet-selftest). I suspect it is receiving all in coming traffic so lnet-selftest is not getting anything. Additional note: I just redid the this test but ran lnet-selftest "before" trying to invoke discovery. lnet-discovery thread still takes off and lnet-selftest locks. Seems that turning off discovery causes lnet-discovery thread to misbehave.	Blocker	Doug	So this doesn't have to do specifically with the test case. It's when discovery is off and you run lnet_selftest. From the logs I collected it appears that for each selftest message being sent, the same NID gets queued on the discovery thread. But It doesn't end up doing anything. So in effect it goes into a crazy loop, trying to discover, but because it's off, it doesn't and probably doesn't update the state properly, so the next time to the same peer triggers the nid to be queued on the discovery thread again. Since the discovery thread does pretty heavy locking, it drives the system into a grind. Selftest also reacts poorly and hangs the node. I don't think we should be queuing anything on the discovery thread if discovery is off. This has been fixed
Doug: After the previous point, I tried to kill the lnet-discovery thread. It did not stop. I then tried "kill -9". It still did not stop. I then did a reboot. Node went away and could not complete the reboot because it could not unload LNet. Had to reset the node. We need a way to stop any of our worker threads when things do not go well. Hard resetting a node in the field will be unacceptable to customers.	Blocker	Doug	This has been fixed
DD doesn't handle the case where a Mulit-Rail peer is torn down and then booted with a downrev Lustre (non-mr). This needs to be handled. Both this scenario and turning off the Multi-Rail feature are going to be handled fairly similarly node gets configured to !MR push with flag not set peer receives the push tears down the peer and recreates From now on, the peer is viewed as non-MR. Future message exchange will setup the preferred-NID in case of a down-rev admin will need to trigger an explicit discover ping response comes back with feature bit off rest is same as above	Blocker	Amir
DD doesn't send a push when an interface is added to an existing network	Blocker	Amir	The functionality to trigger a push when the configuration is updated is missing. Olaf: Adding an interface should cause `lnet_peer_needs_push()` to return true, therefore `lnet_peer_is_uptodate()` to return false, which in turn triggers discovery when there is traffic to the peer. Moreover, LNet-internal traffic (traffic to portal 0) will never trigger discovery, see `lnet_msg_discovery()`. Amir: As discussed there is the scenario where triggering discovery on traffic is not sufficient, in case one of the peers changes it's primary interface, or even all of its interfaces. The node initiating traffic will not be able to access it.
LASSERT hit with the latest timeout patch <0>LNetError: 4706:0:(peer.c:1704:lnet_peer_discovery_complete()) ASSERTION( lp->lp_state & (1 << 4) ) failed: <0>LNetError: 4706:0:(peer.c:1704:lnet_peer_discovery_complete()) LBUG <4>Pid: 4706, comm: lnet_discovery <4> <4>Call Trace: <4> [<ffffffffa0c8b885>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0c8b9cf>] lbug_with_loc+0x3f/0x90 [libcfs] <4> [<ffffffffa0d2e297>] lnet_peer_discovery_complete+0x147/0x150 [lnet] <4> [<ffffffffa0d33ffd>] lnet_peer_discovery+0xe5d/0x1440 [lnet] <4> [<ffffffff8152a6be>] ? thread_return+0x4e/0x7d0 <4> [<ffffffff81064bd2>] ? default_wake_function+0x12/0x20 <4> [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa0d331a0>] ? lnet_peer_discovery+0x0/0x1440 [lnet] <4> [<ffffffff8109e71e>] kthread+0x9e/0xc0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109e680>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 4706, comm: lnet_discovery Not tainted 2.6.32.504.16.2.el6_lustre #1 <4>Call Trace: <4> [<ffffffff81529fbc>] ? panic+0xa7/0x16f <4> [<ffffffffa0c8b9e6>] ? lbug_with_loc+0x56/0x90 [libcfs] <4> [<ffffffffa0d2e297>] ? lnet_peer_discovery_complete+0x147/0x150 [lnet] <4> [<ffffffffa0d33ffd>] ? lnet_peer_discovery+0xe5d/0x1440 [lnet] <4> [<ffffffff8152a6be>] ? thread_return+0x4e/0x7d0 <4> [<ffffffff81064bd2>] ? default_wake_function+0x12/0x20 <4> [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa0d331a0>] ? lnet_peer_discovery+0x0/0x1440 [lnet] <4> [<ffffffff8109e71e>] ? kthread+0x9e/0xc0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109e680>] ? kthread+0x0/0xc0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 (gdb) l* lnet_peer_discovery_complete+0x147 0x2e2b7 is at /home/ashehata/LustreBuild/mr-dd/lnet/lnet/peer.c:1704. 1699 { 1700 CDEBUG(D_NET, "Dequeue peer %s\n", 1701 libcfs_nid2str(lp->lp_primary_nid)); 1702 1703 spin_lock(&lp->lp_lock); 1704 LASSERT(lp->lp_state & LNET_PEER_QUEUED); 1705 lp->lp_state &= ~LNET_PEER_QUEUED; 1706 spin_unlock(&lp->lp_lock); 1707 list_del_init(&lp->lp_dc_list); 1708 wake_up_all(&lp->lp_dc_waitq); (gdb) l lnet_peer_discovery+0xe5d 0x3401d is in lnet_peer_discovery (/home/ashehata/LustreBuild/mr-dd/lnet/lnet/peer.c:3013). 3008 lnet_net_lock(LNET_LOCK_EX); 3009 list_for_each_entry(lp, &the_lnet.ln_dc_request, lp_dc_list) { 3010 lnet_peer_discovery_error(lp, -ESHUTDOWN); 3011 lnet_peer_discovery_complete(lp); 3012 } 3013 list_for_each_entry(lp, &the_lnet.ln_dc_working, lp_dc_list) { 3014 lnet_peer_discovery_error(lp, -ESHUTDOWN); 3015 lnet_peer_discovery_complete(lp); 3016 } 3017 list_for_each_entry(lp, &the_lnet.ln_dc_expired, lp_dc_list) { To reproduce Modified the code to not send a REPLY for the GET initiate discovery from node. timeout set 180 CTRL-C command lnetctl lnet unconfigure *crash	Blocker

Usability Issues

Description	Priority	Reporter
Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous section). From a usability perspective, there is no obvious way to get this information.		Doug
Doug: I understand that Chris Morrone pushed for us to have "lnetctl set <key> <value>". That breaks from the original paradigm to follow which was done with the Linux "ip" command. It was designed to be: "ip <object> <action> <optional params>". So, it would be more logical to have: "lnetctl discover set 0/1" than "lnetctl set discovery 0/1". Then starting discovery can be: "lnetctl discover start <nids>". Looking for discovery status can be: "lnetctl discovery show". I found myself guessing at these commands as I have given here and had no idea to look at "set".		Doug
Doug: You set the max interfaces with "lnetctl set max_interfaces" but is it shown as "max_intf" in the global settings. Should be the same for consistency.		Doug
Document the behavior of Dynamic Discovery, including what type of traffic triggers discovery.	Critical	Amir

Space shortcuts

Page tree

Problem List

Usability Issues

Space shortcuts

Page tree

Testing Tracker

Problem List

Usability Issues