...
Test # | Tag | Procedure | Script | Result |
---|
1 | Immediate Failure | - Send a PING
- simulate an immediate LND failure (EX: NOMEM)
- Message should not be resent
| lnetctl discover <nid> lctl net_drop_add with "-e local_error" lnetctl discover <nid> | pass |
Asynchronous Errors
Test # | Tag | Procedure | Script | Result |
---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT | - MR Node with Multiple interfaces
- Send a PING
- Simulate an <error>
- PING msg should be queued on resend queue
- PING msg will be resent on a different interface
- Failed interfaces' health value will be decremented
- Failed interface will be placed on the recovery queue
|
2 | Sensitivity == 0 | - Same setup as 1
- NI is not placed on the recovery queue
| 3 | Sensitivity > 0 | - Same setup as 1
- NI is placed on the recovery queue
- Monitor network activity as NI is pinged until health is back to maximum
| 4 | Sensitivity > 0 Buggy interface | - Same setup as 1
- NI is placed on recovery queue
- NI is pinged ever 1 second
- Simulate ping failure ever other ping
- NI's health should be decremented on failure
- NI should remain on the recovery queue
| Examples: lctl net_drop_add -s 10.9.10.3@tcp -d 10.9.10.4@tcp -m GET -i 20 -e local_dropped Key messages in debug log: (lib-msg.c:762:lnet_health_check()) 10.9.10.3@tcp->10.9.10.4@tcp:GET:LOCAL_DROPPED - queuing for resend (lib-msg.c:508:lnet_handle_local_failure()) ni 10.9.10.3@tcp added to recovery queue. Health = 950 (lib-move.c:2928:lnet_recover_local_nis()) attempting to recover local ni: 10.9.10.3@tcp | pass |
2 | Sensitivity == 0 | - Same setup as 1
- NI is not placed on the recovery queue
|
| pass |
3 | Sensitivity > |
5 | Retry count == Message will not be retried and the message will be finalized immediately6 | Retry count > 0 | - NI is placed on the recovery queue
- Monitor network activity as NI is pinged until health is back to maximum
|
| pass |
4 | Sensitivity > 0 Buggy interface | - Same setup as 1
- NI is placed on recovery queue
- NI is pinged ever 1 second
- Simulate ping failure ever other ping
- NI's health should be decremented on failure
- NI should remain on the recovery queue
|
|
|
5 | Retry count == 0 | - Same setup as 1
- Message will not be retried and the message will be finalized immediately
|
| pass |
6 | Retry count > 0 | - Same setup as 1
- Message will be transmitted for a maximum of retry count or until the message expires
| Key messages in debug log: (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 0 (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 1 | pass |
Same setup as 1Message will be transmitted for a maximum of retry count or until the message expires |
7 | REPLY timeout | - Same setup as 1
- Except Use LNet selftest
- Simulate a local timeout
- Re-transmit
- No REPLY received
- Message is finalized and TIMEOUT event is propagated.
|
| pass |
8 | ACK timeout | - Same setup as 7 except simulate ACK timeout
|
| pass |
9 | LNET_MSG_STATUS_LOCAL_ERROR | - Same setup as 1
- Message is finalized immediately (not resent)
- Local NI is placed on the recovery queue
- Same procedure to recover the local NI
|
| pass |
10 | LNET_MSG_STATUS_REMOTE_DROPPED | - Same setup as 1
- Message is queued for resend depending on retry_count
- peer_ni is placed on the recovery queue (not if sensitivity == 0)
- peer_ni is pinged every 1 second
|
| pass |
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT | - Same setup as 1
- Message is not resent
- peer_ni recovery happens as outlined in previous cases
|
| pass |
Random Failures
Test # | Tag | Procedure | Script | Result |
---|
1 | self test | - MR Node
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
| ip link set eth1 down | pass |
2 | self test | - MR Node
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
| pass |
3 | self test | - MR Node
- MR Router
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
4 | self test | - MR Node
- MR Router
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
5 | self test | - MR Node
- NMR Router
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
6 | self test | - MR Node
- NMR Router
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
...