This is a page tracks issues that we run into while testing the Multi-Rail Feature
Problem List
Description | Priority | Reporter | Notes |
---|---|---|---|
with 17 interfaces trying to discover on the |
...
any of the interface the first time returns an error "no route to host". Ok was able to reproduce. If I follow the steps in UT-DD-EN-0005 exactly, then the first time I try to discover any of the nids it fails steps to reproduce
| critical | Amir | This appears to be an issue with how the State Machine works. I can reproduce it consistently by first getting the state of the peer to peer state: 1032: 10000001000 LNET_PEER_UNDISCOVERED LNET_PEER_PING_FAILED then after that do another discover, which will cause this problem to occur, basically, the FSM says that ping has previously failed and then says that PING is required and stops there. The way the code works is a little bit odd in this scenario:
Also in the above scenario the peer is left in a LNET_PEER_PING_FAILED | LNET_PEER_UNDISCOVERED state: but it's not on the ln_dc_request queue? Do we want these states set for a peer that's not on the request or working discovery queue? Olaf: The intent is |
...
that Amir: This is still an issue.
| ||
With 17 interface discovered "show peer" hangs When the rc from the kernel is not 0. The structure is not copied out of the kernel to user space. The code depends on that in order to pass the new size if the data to be copied out is too big for the buffer passed in by the user. Since that doesn't happen when rc == -E2BIG, user space code gets into an infinite loop sending IOCTLs to the kernel
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
| Amir | This has been fixed | ||
"lnetctl discover" command hangs with discovery off. This happened once, so an intermittent issue. Will try to reproduce. | Major | Sonia | not reproducible | |
"lnetctl discover" discovers the peer even with discovery off | Major | Sonia | discovery can not be turned off now | |
"lnetctl discover --force" expects a parameter (no parameter should be needed with --force). | Major | Doug | This has been fixed | |
Doug: I configured a Parallels VM with 16 interfaces (won't let me do 17 as 16 is a limit). When I "lctl network configure" with no YAML or module parameters, I get this error from ksocklnd: "Mar 7 14:01:16 centos-7 kernel: LNet: 5111:0:(socklnd.c:2652:ksocknal_enumerate_interfaces()) Ignoring interface virbr0 (too many interfaces)". | Minor | Doug | When no interfaces are configured, the ksocknal code enumerates all the interfaces and adds them under the same net. That's how socknal tcp bonding works, but then only uses the first interface. Not sure why they do that. The max number of interfaces the the socknal tcp bonding allows is 16. So you probably have 17 interfaces. Look at ksocknal_enumerate_interfaces(). Anyway, this is outside the scope of DD. We can fix it as a separate patch to master. | |
Doug: When discovering a node with 16 interfaces via "lnetctl discover --nid", it works, but I am seeing this log:
|
...
Minor | Doug | Olaf: I've seen the message before. Not sure how best to get rid of it, but it would be safe to just not emit it for the loopback case. | |||||||||||||
Doug: Tried the command "lnetctl set discovery" (no args) and got a core dump. | Major | Doug | Olaf: As I recall, the problem is Discovery is no longer configurable. | ||||||||||||
Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous bullet). From a usability perspective, there is no obvious way to get this information. | Minor | Doug | Doug: was told it is "lnetctl global show". Don't like that (see Usability section) but see this problem as solved. Discovery is no longer configurable | ||||||||||||
Doug: When I enter "lnetctl set" to see what options I can set, I get this:
It does not mention "discovery" at all. | Major | Doug | Olaf: So the text in Discovery is no longer configurable | ||||||||||||
When you do "lnetctl export" global section doesn't show discovery status | Minor | Amir | Discovery is no longer configurable | ||||||||||||
Doug: Test: UT-DD-EN-0002. The first call to discover P2 fails with this:
The second attempt works as expected. I repeated the test twice and got the same result each time. | Critical | Doug | This is a duplicate of the first entry in this table. Please look above for details. | ||||||||||||
Doug: Test: UT-DD-EN-0003. Same behaviour as above. | Critical | Doug | Duplicate | ||||||||||||
Amir: There should be a way to turn off Multi-Rail. From the code the | Critical | Amir | We are no longer making multi-rail configurable | ||||||||||||
Doug: Test: UT-DD-DIS-0001. Test passed and worked as described. However, I decided to run lnet-selftest to see how running some traffic after the test goes. The lnet-selftest failed (never stopped running, did not show any stats). I looked at top and can see that the "lnet-discovery" thread is using 70% CPU (it was not using any CPU prior to running lnet-selftest). I suspect it is receiving all in coming traffic so lnet-selftest is not getting anything. Additional note: I just redid the this test but ran lnet-selftest "before" trying to invoke discovery. lnet-discovery thread still takes off and lnet-selftest locks. Seems that turning off discovery causes lnet-discovery thread to misbehave. | Blocker | Doug | So this doesn't have to do specifically with the test case. It's when discovery is off and you run lnet_selftest. From the logs I collected it appears that for each selftest message being sent, the same NID gets queued on the discovery thread. But It doesn't end up doing anything. So in effect it goes into a crazy loop, trying to discover, but because it's off, it doesn't and probably doesn't update the state properly, so the next time to the same peer triggers the nid to be queued on the discovery thread again. Since the discovery thread does pretty heavy locking, it drives the system into a grind. Selftest also reacts poorly and hangs the node. I don't think we should be queuing anything on the discovery thread if discovery is off. This has been fixed | ||||||||||||
Doug: After the previous point, I tried to kill the lnet-discovery thread. It did not stop. I then tried "kill -9". It still did not stop. I then did a reboot. Node went away and could not complete the reboot because it could not unload LNet. Had to reset the node. We need a way to stop any of our worker threads when things do not go well. Hard resetting a node in the field will be unacceptable to customers. | Blocker | Doug | This has been fixed | ||||||||||||
DD doesn't handle the case where a Mulit-Rail peer is torn down and then booted with a downrev Lustre (non-mr). This needs to be handled. Both this scenario and turning off the Multi-Rail feature are going to be handled fairly similarly
| Blocker | Amir | Fixed. Setting multi-rail off is a non-issue now. | ||||||||||||
DD doesn't send a push when an interface is added to an existing network | Blocker | Amir | The functionality to trigger a push when the configuration is updated is missing. Olaf: Adding an interface should cause Amir: As discussed there is the scenario where triggering discovery on traffic is not sufficient, in case one of the peers changes it's primary interface, or even all of its interfaces. The node initiating traffic will not be able to access it. Fixed | ||||||||||||
LASSERT hit with the latest timeout patch
To reproduce
| Blocker | This seems to be a bit of a race there. I can't reproduce again. Olaf: The most plausible thing I can think of is that Fixed | |||||||||||||
Doug: I have a bug which can be reproduced by requires these very specific steps to do so. Start with a node and a peer. I have two interfaces configured for the node and 3 for the peer. I leave discovery on in both the node and peer. In the peer, manually configure the node (as a peer) to have one interface, non-MR:
Then, trigger discovery on the node:
Ok, that is fine as the node did not have anything configured via MR and was able to discovery the 3 interfaces on the peer. I then run a write-bulk lnet-selftest from the node to the peer. That works ok. However, when I look at the peers on the peer, I see both interfaces on the node even though MR has configured only one:
Looking at the stats for these two interfaces, I can see they are both used in the test even though they are not bound together:
That was not expected. | Critical | Doug | Olaf: The peer created with Amir: This the expected behavior. Even though you configure the node on the peer to only have one interface, yet the node itself has two interfaces and is multi-rail capable, so it will round robin over both interfaces. Becuase you hard coded the node to be non-mr on the peer, the peer will continue to view the node as non-mr and thus will see both interfaces as two different peers:
That explains why you see traffic on both of the node's interfaces. Furthermore, because peer is dynamically discovered as MR on node all of its interfaces are used in round robin. When you start lnet_selftest in the other direction, where the sender is the peer with the non-mr node configured as peer, then you'll see that only the interface configured for the node is used (mostly)
So I would consider this standard behavior. | ||||||||||||
Router Testing Router: Setup two interfaces on a router on two networks say tcp and tcp1 and set routing. Like below
Node: Setup an interface on tcp1. And add route to tcp like
| critical | Sonia | The problem is in this part of the code:
This is going to fail for all non-local networks. Specifically here:
The code earlier will do the following:
Fixed | ||||||||||||
Crash on nodes with routes configured when bringing down LNet (lnetctl lnet unconfigure)
| Olaf: issue is the return value of LNetEQFree(). From the code this would either be
There was a bug in the router code, where the rcd_mdh was being over written with an invalid value. Which caused this crash. Both fixed | ||||||||||||||
Router Testing Router: Setup two interfaces on a router on two networks say tcp and tcp1 and set routing. Like below
Node 1: Setup an interface on tcp1. And add route to tcp like
Node 2 : Setup an interface on tcp. And add route to tcp1 like
NOTE: Ping on same network works but for different network always gives above error. | Sonia | The router code was overwriting the rcd_mdh, so we were never sending out the ping for checking the router. Which assumed that the router is down and therefore we never used the router. Fixed. |
Usability Issues
Description | Priority | Reporter | Notes |
---|---|---|---|
Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous section). From a usability perspective, there is no obvious way to get this information. | Doug | ||
Doug: I understand that Chris Morrone pushed for us to have "lnetctl set <key> <value>". That breaks from the original paradigm to follow which was done with the Linux "ip" command. It was designed to be: "ip <object> <action> <optional params>". So, it would be more logical to have: "lnetctl discover set 0/1" than "lnetctl set discovery 0/1". Then starting discovery can be: "lnetctl discover start <nids>". Looking for discovery status can be: "lnetctl discovery show". I found myself guessing at these commands as I have given here and had no idea to look at "set". | Doug | ||
Doug: You set the max interfaces with "lnetctl set max_interfaces" but is it shown as "max_intf" in the global settings. Should be the same for consistency. | Doug | ||
Document the behavior of Dynamic Discovery, including what type of traffic triggers discovery.
| Critical | Amir |