...
Selection Algorithm Scenarios
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | SRC_SPEC_LOCAL_MR_DST |
| pass | |
2 | SRC_SPEC_LOCAL_MR_DST |
| pass | |
3 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
4 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
5 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
6 | SRC_SPEC_ROUTER_MR_DST |
| pass | |
7 | SRC_SPEC_LOCAL_NMR_DST |
| pass | |
8 | SRC_SPEC_ROUTER_NMR_DST |
| pass | |
9 | SRC_ANY_LOCAL_MR_DST |
| pass | |
10 | SRC_ANY_ROUTER_MR_DST |
| pass | |
11 | SRC_ANY_ROUTER_MR_DST |
| pass | |
12 | SRC_ANY_LOCAL_NMR_DST |
| pass | |
13 | SRC_ANY_ROUTER_NMR_DST |
| pass | |
14 | SRC_ANY_ROUTER_NMR_DST |
| pass |
Error Scenarios
Synchronous Errors
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | Immediate Failure |
| lnetctl discover <nid> lctl net_drop_add with "-e local_error" lnetctl discover <nid> | pass |
Asynchronous Errors
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT |
|
- Same setup as 1
- NI is not placed on the recovery queue
- Same setup as 1
- NI is placed on the recovery queue
- Monitor network activity as NI is pinged until health is back to maximum
Examples: lctl net_drop_add -s 10.9.10.3@tcp -d 10.9.10.4@tcp -m GET -i 20 -e local_dropped Key messages in debug log: (lib-msg.c:762:lnet_health_check()) 10.9.10.3@tcp->10.9.10.4@tcp:GET:LOCAL_DROPPED - queuing for resend (lib-msg.c:508:lnet_handle_local_failure()) ni 10.9.10.3@tcp added to recovery queue. Health = 950 (lib-move.c:2928:lnet_recover_local_nis()) attempting to recover local ni: 10.9.10.3@tcp | pass |
2 | Sensitivity == 0 |
Sensitivity > 0
Buggy interface
| pass | |||
3 | Sensitivity > 0 |
| pass | |
4 | Sensitivity > 0 Buggy interface |
| ||
5 | Retry count == 0 |
| pass | |
6 | Retry count > 0 |
|
- Same setup as 1
- Except Use LNet selftest
- Simulate a local timeout
- Re-transmit
- No REPLY received
- Message is finalized and TIMEOUT event is propagated.
Key messages in debug log: (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 0 (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 1 | pass | |||
7 | REPLY timeout |
| pass | |
8 | ACK timeout |
| pass | |
9 | LNET_MSG_STATUS_LOCAL_ERROR |
- Same setup as 7 except simulate ACK timeout
- Same setup as 1
- Message is finalized immediately (not resent)
- Local NI is placed on the recovery queue
- Same procedure to recover the local NI
- Same setup as 1
- Message is queued for resend depending on retry_count
- peer_ni is placed on the recovery queue (not if sensitivity == 0)
- peer_ni is pinged every 1 second
11
LNET_MSG_STATUS_REMOTE_ERROR
LNET_MSG_STATUS_REMOTE_TIMEOUT
LNET_MSG_STATUS_NETWORK_TIMEOUT
|
|
Random Failures
| pass | |||
10 | LNET_MSG_STATUS_REMOTE_DROPPED |
| pass | |
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT |
| pass |
Random Failures
Test # | Tag | Procedure | Script | Result | |||
---|---|---|---|---|---|---|---|
1 | self test |
| |||||
Test # | Tag | Procedure | Script | Result | |||
1 | self test |
| 2 | self test |
| ip link set eth1 down | pass |
23 | self test |
| pass | ||||
43 | self test |
| |||||
54 | self test |
| |||||
5 | self test |
| |||||
6 | self test |
|
MR Router Testing
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | Discovery triggered on route add |
| pass | |
2 | Discovery triggered on interval |
| pass | |
3 | Router tcp1 down due to no traffic |
| pass | |
4 | Router tcp1 comes up when peerB is brought up |
| pass | |
5 | Add route without router there |
| pass | |
6 | traffic should trigger an attempt at router discovery |
|
| pass | |||
7 | Ping should not trigger discovery of router |
| pass | |
8 | Multi-interface router even traffic distribution |
| pass | |
9 | Multi-interface router with one bad interface |
| pass | |
10 | Multi-interface router with a bad interface that recovers |
| pass In an idle system the bad peer interface will be pinged once every second causing its sequence number to go up. So when it comes back online it will not be used until the sequence numbers equalize. This will be the case if the system is busy, but the issue will be reversed. | |
11 | Multi-Router/Multi-interface setup |
| pass | |
12 | Multi-Router/Multi-interface setup with failed gateway |
|
| pass | |||
13 | Multi-Router/Multi-interface setup with router recovery |
| Problem found. Possibly with discovery. 1. bring up two routers with 4 interfaces 2 on each network 2. bring down one of the routers 3. bring it up again but with only 2 of its interfaces on 1 network 4. Client goes berserk, keeps trying to discover it. toggles between state: 0x139 and 39 There were a couple of issues here:
Pass | |
14 | router sensitivity < 100 |
|
|
|
|
| ||||
15 | Extra Health Testing |
|
User Interface
Test # | Tag | Procedure | Script | Result | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | lnet_transaction_timeout |
| lnetctl set transaction_timeout <value> | pass | ||||||||
2 | lnet_retry_count |
| lnetctl set retry_count <value> | pass | ||||||||
3 | lnet_health_sensitivity |
| lnetctl set health_sensitivity <value> |
| ||||||||
4 | NI statistics |
| pass | |||||||||
5 | Peer NI statistics |
| pass | |||||||||
6 | NI Health value |
|
| |||||||||
7 | Peer NI Health value |
|
|
Testing Tools
The drop policy has been modified to drop outgoing messages with specific errors. This can be done via the following commands. Unfortunately, for details on these commands you'll need to look at the code. A combination of these commands on the different nodes should cover approximately 75% of the health code paths.
...