System Setup
LNet Resiliency/Health is mainly implemented at the LNet layer. The LND layer is only responsible for propagating specific errors up to the LNet Layer, which then reacts to those errors as defined in the Requirements and HLD documentation.
In order to properly test this feature fine grained control over the LND behavior is required. Hooks will be added in the socklnd and o2iblnd to generate errors on demand. It will listen to IOCTL commands which will put it in "debug mode". While in "debug mode", it'll be tightly coupled with the userspace utility which initiated the debug mode. Different levels of debug will necessiate the LND to propagate events up to userspace and wait for commands back. The userspace utility can then instruct the LND to drop messages, return errors, etc.
Selection Algorithm Scenarios
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | SRC_SPEC_LOCAL_MR_DST |
| ||
2 | SRC_SPEC_LOCAL_MR_DST |
| ||
3 | SRC_SPEC_ROUTER_MR_DST |
| ||
4 | SRC_SPEC_ROUTER_MR_DST |
| ||
5 | SRC_SPEC_ROUTER_MR_DST |
| ||
6 | SRC_SPEC_ROUTER_MR_DST |
| ||
7 | SRC_SPEC_LOCAL_NMR_DST |
| ||
8 | SRC_SPEC_ROUTER_NMR_DST |
| ||
9 | SRC_ANY_LOCAL_MR_DST |
| ||
10 | SRC_ANY_ROUTER_MR_DST |
| ||
11 | SRC_ANY_ROUTER_MR_DST |
| ||
12 | SRC_ANY_LOCAL_NMR_DST |
| ||
13 | SRC_ANY_ROUTER_NMR_DST |
| ||
14 | SRC_ANY_ROUTER_NMR_DST |
|
Error Scenarios
Synchronous Errors
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | Immediate Failure |
|
Asynchronous Errors
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT |
| ||
2 | Sensitivity == 0 |
| ||
3 | Sensitivity > 0 |
| ||
4 | Sensitivity > 0 Buggy interface |
| ||
5 | Retry count == 0 |
| ||
6 | Retry count > 0 |
| ||
7 | REPLY timeout |
| ||
8 | ACK timeout |
| ||
9 | LNET_MSG_STATUS_LOCAL_ERROR |
| ||
10 | LNET_MSG_STATUS_REMOTE_DROPPED |
| ||
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT |
|
Random Failures
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | self test |
| ||
2 | self test |
| ||
3 | self test |
| ||
4 | self test |
| ||
5 | self test |
| ||
6 | self test |
|
User Interface
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | lnet_transaction_timeout |
| ||
2 | lnet_retry_count |
| ||
3 | lnet_health_sensitivity |
| ||
4 | NI statistics |
| ||
5 | Peer NI statistics |
| ||
6 | NI Health value |
| ||
7 | Peer NI Health value |
|