LNet Resiliency/Health is mainly implemented at the LNet layer. The LND layer is only responsible for propagating specific errors up to the LNet Layer, which then reacts to those errors as defined in the Requirements and HLD documentation.
In order to properly test this feature fine grained control over the LND behavior is required. The drop/delay message policies will be modified to simulate the various errors that could occur when sending a message. This is described in the sections below.
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | SRC_SPEC_LOCAL_MR_DST |
| pass | |
2 | SRC_SPEC_LOCAL_MR_DST |
| pass | |
3 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
4 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
5 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
6 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
7 | SRC_SPEC_LOCAL_NMR_DST |
| pass | |
8 | SRC_SPEC_ROUTER_NMR_DST |
| pass | |
9 | SRC_ANY_LOCAL_MR_DST |
| pass | |
10 | SRC_ANY_ROUTER_MR_DST |
| pass | |
11 | SRC_ANY_ROUTER_MR_DST |
| pass | |
12 | SRC_ANY_LOCAL_NMR_DST |
| pass | |
13 | SRC_ANY_ROUTER_NMR_DST |
| pass | |
14 | SRC_ANY_ROUTER_NMR_DST |
| pass |
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | Immediate Failure |
| lnetctl discover <nid> lctl net_drop_add with "-e local_error" lnetctl discover <nid> | pass |
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT |
| Examples: lctl net_drop_add -s 10.9.10.3@tcp -d 10.9.10.4@tcp -m GET -i 20 -e local_dropped Key messages in debug log: (lib-msg.c:762:lnet_health_check()) 10.9.10.3@tcp->10.9.10.4@tcp:GET:LOCAL_DROPPED - queuing for resend (lib-msg.c:508:lnet_handle_local_failure()) ni 10.9.10.3@tcp added to recovery queue. Health = 950 (lib-move.c:2928:lnet_recover_local_nis()) attempting to recover local ni: 10.9.10.3@tcp | pass |
2 | Sensitivity == 0 |
| pass | |
3 | Sensitivity > 0 |
| pass | |
4 | Sensitivity > 0 Buggy interface |
| ||
5 | Retry count == 0 |
| pass | |
6 | Retry count > 0 |
| Key messages in debug log: (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 0 (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 1 | pass |
7 | REPLY timeout |
| pass | |
8 | ACK timeout |
| pass | |
9 | LNET_MSG_STATUS_LOCAL_ERROR |
| pass | |
10 | LNET_MSG_STATUS_REMOTE_DROPPED |
| pass | |
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT |
| pass |
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | self test |
| ip link set eth1 down | pass |
2 | self test |
| pass | |
3 | self test |
| ||
4 | self test |
| ||
5 | self test |
| ||
6 | self test |
|
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | Discovery triggered on route add |
| pass | |
2 | Discovery triggered on interval |
| pass | |
3 | Router tcp1 down due to no traffic |
| pass | |
4 | Router tcp1 comes up when peerB is brought up |
| pass | |
5 | Add route without router there |
| pass | |
6 | traffic should trigger an attempt at router discovery |
| pass | |
7 | Ping should not trigger discovery of router |
| pass | |
8 | Multi-interface router even traffic distribution |
| pass | |
9 | Multi-interface router with one bad interface |
| pass | |
10 | Multi-interface router with a bad interface that recovers |
| pass In an idle system the bad peer interface will be pinged once every second causing its sequence number to go up. So when it comes back online it will not be used until the sequence numbers equalize. This will be the case if the system is busy, but the issue will be reversed. | |
11 | Multi-Router/Multi-interface setup |
| pass | |
12 | Multi-Router/Multi-interface setup with failed gateway |
| pass | |
13 | Multi-Router/Multi-interface setup with router recovery |
| Problem found. Possibly with discovery. 1. bring up two routers with 4 interfaces 2 on each network 2. bring down one of the routers 3. bring it up again but with only 2 of its interfaces on 1 network 4. Client goes berserk, keeps trying to discover it. toggles between state: 0x139 and 39 There were a couple of issues here:
Pass | |
14 | router sensitivity < 100 |
| ||
15 | Extra Health Testing |
|
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | lnet_transaction_timeout |
| lnetctl set transaction_timeout <value> | pass |
2 | lnet_retry_count |
| lnetctl set retry_count <value> | pass |
3 | lnet_health_sensitivity |
| lnetctl set health_sensitivity <value> | |
4 | NI statistics |
| pass | |
5 | Peer NI statistics |
| pass | |
6 | NI Health value |
| ||
7 | Peer NI Health value |
|
The drop policy has been modified to drop outgoing messages with specific errors. This can be done via the following commands. Unfortunately, for details on these commands you'll need to look at the code. A combination of these commands on the different nodes should cover approximately 75% of the health code paths.
lctl net_drop_add -s *@tcp -d *@tcp -m ACK -i 20 lctl net_drop_add -s *@tcp -d *@tcp -m REPLY -i 20 lctl net_drop_add -s *@tcp -d *@tcp -m GET -i 43 -e random -n lctl net_drop_add -s *@tcp -d *@tcp -m PUT -i 20 -e random lctl net_drop_del -s *@tcp -d *@tcp |
The -e
parameter can take the following arguments
local_interrupt # will result in a resend local_dropped # will result in a resend local_aborted # will result in a resend local_no_route # will result in a resend local_error # will not result in a resend local_timeout # will result in a resend remote_error # will not result in a resend remote_dropped # will result in a resend remote_timeout # will not result in a resend network_timeout # will not result in a resend random silent_queue # will queue the message and never call lnet_finalize() |
The -e
can be repeated multiple times to specify a set of different errors to select randomly from. random
can be given to -e
to select any error to simulate at random.
The error simulation occurs immediately before putting the message on the wire. If the drop rule policy is defined and is matched, then the message is not sent and error is simulated.
Type | Description |
---|---|
Drop with error | This is the newly added error simulation. And it is designed to simulate different health failures. This can be used to exercise the following scenarios
|
Drop Received messages | This an existing rule and it can be used to drop received GET/PUT messages. This will result in no ACK/REPLY being sent to the message initiator and will exercise the response timeout code. |
Queuing messages | In a heavily used systems, especially routers, the credits can dip below zero and a message can be queued internally. It's possible that these message can be queued for a long period of time, so a mechanims was created to finalize these messages after the expiration time. A command can be issued to simulate queueing. This will queue the message on a separate queue which is never checked except when the message expires. The message credit is not returned until the message expires. |
LNet Health is off by default. To turn it on, two configuration parameters need to be set
lnetctl set retry_count <value>
lnetctl set health_sensitivity <value>
lnetctl set transaction_timeout <value>