System Setup
LNet Resiliency/Health is mainly implemented at the LNet layer. The LND layer is only responsible for propagating specific errors up to the LNet Layer, which then reacts to those errors as defined in the Requirements and HLD documentation.
In order to properly test this feature fine grained control over the LND behavior is required. Hooks will be added in the socklnd and o2iblnd to generate errors on demand. It will listen to IOCTL commands which will put it in "debug mode". While in "debug mode", it'll be tightly coupled with the userspace utility which initiated the debug mode. Different levels of debug will necessiate the LND to propagate events up to userspace and wait for commands back. The userspace utility can then instruct the LND to drop messages, return errors, etc.
Selection Algorithm Scenarios
Test # | Tag | Procedure | Script | Result |
---|
1 | SRC_SPEC_LOCAL_MR_DST | - MR Node
- MR Peer
- Send a ping
- REPLY for PING should always come on the same interface that PING was sent on.
- Check the TRACE in the logs to verify
- Repeat the test. A different local NI should be used for each new PING.
|
|
|
2 | SRC_SPEC_LOCAL_MR_DST | - MR Node
- MR Peer
- Initiate discovery
- Node → PING → Peer
- Node ← PUSH ← Peer
- Node should respond with an ACK to the same interface as the one it received the PUSH on
- Check the TRACE in the logs to verify
- Repeat the test
- Peer's local_ni when sending the PUSH should be different.
|
|
|
3 | SRC_SPEC_ROUTER_MR_DST | - MR Node
- NMR Router
- MR Peer
- Send a ping
- REPLY for PING should always come on the same interface that PING was sent on.
- Check the TRACE in the logs to verify
- Router should be used
- Repeat the test. A different local NI should be used for each new PING.
|
|
|
4 | SRC_SPEC_ROUTER_MR_DST | |
|
|
5 | SRC_SPEC_ROUTER_MR_DST | - MR Node
- MR Router
- MR Peer
- Send a ping
- REPLY for PING should always come on the same interface that PING was sent on.
- Check the TRACE in the logs to verify
- Repeat sending
- Router interfaces should be used in round robin, while the peer destination should remain constant.
- Repeat the test. A different local NI should be used for each new PING.
|
|
|
6 | SRC_SPEC_ROUTER_MR_DST | |
|
|
7 | SRC_SPEC_LOCAL_NMR_DST | - Same as 1 and 2
- Except that repeating the test will not result in a different local_ni being used.
|
|
|
8 | SRC_SPEC_ROUTER_NMR_DST | - Same as 3 - 6
- Except that repeating the test will not result in different local NIs being used.
|
|
|
9 | SRC_ANY_LOCAL_MR_DST | - MR Node
- MR Peer
- Send multiple PINGs
- PING REPLYs should come on the same interface
- Every PING will select a new local/remote NIs
|
|
|
10 | SRC_ANY_ROUTER_MR_DST | - MR Node
- NMR Router
- MR Peer
- Send Multiple PINGs
- Node will cycle over local_NIs
- Node will use the same destination NID as final destination
- Node will use the NMR Router
|
|
|
11 | SRC_ANY_ROUTER_MR_DST | - MR Node
- MR Router
- MR Peer
- Send Multiple PINGs
- Node will cycle over local_NIs
- Node will use the same destination NID as final destination
- Node will use the different interfaces of the MR Router
- MR Router will cycle over the interfaces of the Final destination.
|
|
|
12 | SRC_ANY_LOCAL_NMR_DST | - MR Node
- NMR Peer
- Send multiple PINGs
- Node will use same source/dst NID for all PINGs
|
|
|
13 | SRC_ANY_ROUTER_NMR_DST | - MR Node
- NMR Router
- NMR Peer
- Send multiple PINGs
- Node will use the same source/dst NIDs for all PINGs
- Node will use the router interface
|
|
|
14 | SRC_ANY_ROUTER_NMR_DST | - MR Node
- MR Router
- NMR Peer
- Send multiple PINGs
- Node will use the same source/dst NIDs for all PINGs
- Node will cycle through the Router's interfaces
|
|
|
Error Scenarios
Synchronous Errors
Test # | Tag | Procedure | Script | Result |
---|
1 | Immediate Failure | - Send a PING
- simulate an immediate LND failure (EX: NOMEM)
- Message should not be resent
|
|
|
Asynchronous Errors
Test # | Tag | Procedure | Script | Result |
---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT | - MR Node with Multiple interfaces
- Send a PING
- Simulate an <error>
- PING msg should be queued on resend queue
- PING msg will be resent on a different interface
- Failed interfaces' health value will be decremented
- Failed interface will be placed on the recovery queue
|
|
|
2 | Sensitivity == 0 | - Same setup as 1
- NI is not placed on the recovery queue
|
|
|
3 | Sensitivity > 0 | - Same setup as 1
- NI is placed on the recovery queue
- Monitor network activity as NI is pinged until health is back to maximum
|
|
|
4 | Sensitivity > 0 Buggy interface | - Same setup as 1
- NI is placed on recovery queue
- NI is pinged ever 1 second
- Simulate ping failure ever other ping
- NI's health should be decremented on failure
- NI should remain on the recovery queue
|
|
|
5 | Retry count == 0 | - Same setup as 1
- Message will not be retried and the message will be finalized immediately
|
|
|
6 | Retry count > 0 | - Same setup as 1
- Message will be transmitted for a maximum of retry count or until the message expires
|
|
|
7 | REPLY timeout | - Same setup as 1
- Except Use LNet selftest
- Simulate a local timeout
- Re-transmit
- No REPLY received
- Message is finalized and TIMEOUT event is propagated.
|
|
|
8 | ACK timeout | - Same setup as 7 except simulate ACK timeout
|
|
|
9 | LNET_MSG_STATUS_LOCAL_ERROR | - Same setup as 1
- Message is finalized immediately (not resent)
- Local NI is placed on the recovery queue
- Same procedure to recover the local NI
|
|
|
10 | LNET_MSG_STATUS_REMOTE_DROPPED | - Same setup as 1
- Message is queued for resend depending on retry_count
- peer_ni is placed on the recovery queue (not if sensitivity == 0)
- peer_ni is pinged every 1 second
|
|
|
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT | - Same setup as 1
- Message is not resent
- peer_ni recovery happens as outlined in previous cases
|
|
|
Random Failures
Test # | Tag | Procedure | Script | Result |
---|
1 | self test | - MR Node
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
2 | self test | - MR Node
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
3 | self test | - MR Node
- MR Router
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
4 | self test | - MR Node
- MR Router
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
5 | self test | - MR Node
- NMR Router
- NMR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
6 | self test | - MR Node
- NMR Router
- MR Peer
- Self-test
- Randomize local NI failure
- Randomize Remote NI failure
|
|
|
User Interface
Test # | Tag | Procedure | Script | Result |
---|
1 | lnet_transaction_timeout | - Set lnet_transaction_timeout to a value < retry_count via lnetctl and YAML
- This should lead to a failure to set
- Set lnet_transaction_timeout to a value > retr_count via lnetctl and YAML
- lnet_lnd_timeout value should == lnet_transaction_timeout / retry_count
- Show value via "lnetctl global show"
|
|
|
2 | lnet_retry_count | - Set the lnet_retry_count to a value > lnet_transaction_timeout via lnetctl and YAML
- This should lead to a failure to set
- Set the lnet_retry_count to a value < lnet_transaction_timeout via lnetctl and YAML
- lnet_lnd_timeout value should == lnet_transaction_timeout / retry_count
- Show value via "lnetctl global show"
|
|
|
3 | lnet_health_sensitivity | - Set the lnet_health sensitivity from lnetctl and from YAML
- Show value via "lnetctl global show"
|
|
|
4 | NI statistics | - verify LNet health statistics
|
|
|
5 | Peer NI statistics | - verify LNet health statistics for peer NIs
|
|
|
6 | NI Health value | - verify setting the local NI health statistics
- lnetctl net set --nid <nid> --health <value>
- Redo from YAML
|
|
|
7 | Peer NI Health value | - verify setting the local NI health statistics
- lnetctl peer set --nid <nid> --health <value>
- Redo from YAML
|
|
|