This is a page tracks issues that we run into while testing the Multi-Rail Feature
Problem List
- with 17 interfaces trying to discover on the any of the interface the first time returns an error "no route to host".
- Ok was able to reproduce. If I follow the steps in UT-DD-EN-0005 exactly, then the first time I try to discover any of the nids it fails
- steps to reproduce
P1 && P2 756 lnetctl lnet configure 757 lnetctl net add --net tcp --if eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8,eth9,eth10,eth11,eth12,eth13,eth14,eth15,eth16 P1: lnetctl ping --nid 192.168.122.42@tcp P2: lnetctl ping --nid 192.168.122.23@tcp P1: [root@MRtest01 Lustre]# lnetctl discover --nid 192.168.122.46@tcp manage: - discover: errno: -1 descr: failed to discover 192.168.122.46@tcp: No route to host Second time it works
- With 17 interface discovered "show peer" hangs
- When the rc from the kernel is not 0. The structure is not copied out of the kernel to user space. The code depends on that in order to pass the new size if the data to be copied out is too big for the buffer passed in by the user. Since that doesn't happen when rc == -E2BIG, user space code gets into an infinite loop sending IOCTLs to the kernel
from libcfs_ioctl() 145 »·······»·······»·······if (err == 0) { 146 »·······»·······»·······»·······if (copy_to_user(uparam, hdr, hdr->ioc_len)) 147 »·······»·······»·······»·······»·······err = -EFAULT; 148 »·······»·······»·······} 149 »·······»·······»·······break; The buffer is only copied to user space if the ioctl handler returns 0. Not really sure if it's safe to change that.
- "lnetctl discover" command hangs with discovery off. This happened once, so an intermittent issue. Will try to reproduce.
- "lnetctl discover" discovers the peer even with discovery off.
- "lnetctl discover --force" expects a parameter (no parameter should be needed with --force).
Doug: I configured a Parallels VM with 16 interfaces (won't let me do 17 as 16 is a limit). When I "lctl network configure" with no YAML or module parameters, I get this error from ksocklnd: "Mar 7 14:01:16 centos-7 kernel: LNet: 5111:0:(socklnd.c:2652:ksocknal_enumerate_interfaces()) Ignoring interface virbr0 (too many interfaces)".
Doug: When discovering a node with 16 interfaces via "lnetctl discover --nid", it works, but I am seeing this log:
Mar 7 15:48:04 centos-7 kernel: LNetError: 24769:0:(peer.c:1726:lnet_peer_push_event()) Push Put from unknown 0@<0:0> (source 0@<0:0>)
- Olaf: I've seen the message before. Not sure how best to get rid of it, but it would be safe to just not emit it for the loopback case.
- Doug: Tried the command "lnetctl set discovery" (no args) and got a core dump.
- Olaf: As I recall, the problem is
parse_long
()
is passed a NULL pointer for itsnumber
parameter in this case, and it doesn't check for that. Easiest to fix inparse_long()
rather than fix all callers. This bug affects alllnetctl set
commands.
- Olaf: As I recall, the problem is
- Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous bullet). From a usability perspective, there is no obvious way to get this information.
Doug: When I enter "lnetctl set" to see what options I can set, I get this:
[root@centos-7 ~]# lnetctl set set {tiny_buffers | small_buffers | large_buffers | routing}
It does not mention "discovery" at all.
- Olaf: So the text in
list
[]
for the set subcommand needs to be updated. Note thatmax_interfaces
needs to be added there as well.
- Olaf: So the text in
- When you do "lnetctl export" global section doesn't show discovery status
Usability Issues
- Doug: How do I query the discovery setting? There is no "lnetctl show discovery" (should there be?). I tried "lnetctl set discovery" with no parameters and that core dumped (see previous section). From a usability perspective, there is no obvious way to get this information.
- Doug: I understand that Chris Morrone pushed for us to have "lnetctl set <key> <value>". That breaks from the original paradigm to follow which was done with the Linux "ip" command. It was designed to be: "ip <object> <action> <optional params>". So, it would be more logical to have: "lnetctl discover set 0/1" than "lnetctl set discovery 0/1". Then starting discovery can be: "lnetctl discover start <nids>". Looking for discovery status can be: "lnetctl discovery show". I found myself guessing at these commands as I have given here and had no idea to look at "set".
- Doug: You set the max interfaces with "lnetctl set max_interfaces" but is it shown as "max_intf" in the global settings. Should be the same for consistency.