...
In order to properly test this feature fine grained control over the LND behavior is required. Hooks The drop/delay message policies will be added in the socklnd and o2iblnd to generate errors on demand. It will listen to IOCTL commands which will put it in "debug mode". While in "debug mode", it'll be tightly coupled with the userspace utility which initiated the debug mode. Different levels of debug will necessiate the LND to propagate events up to userspace and wait for commands back. The userspace utility can then instruct the LND to drop messages, return errors, etc.modified to simulate the various errors that could occur when sending a message. This is described in the sections below.
Selection Algorithm Scenarios
...
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | LNET_MSG_STATUS_LOCAL_INTERRUPT LNET_MSG_STATUS_LOCAL_DROPPED LNET_MSG_STATUS_LOCAL_ABORTED LNET_MSG_STATUS_LOCAL_NO_ROUTE LNET_MSG_STATUS_LOCAL_TIMEOUT |
| ||
2 | Sensitivity == 0 |
| ||
3 | Sensitivity > 0 |
| ||
4 | Sensitivity > 0 Buggy interface |
| ||
5 | Retry count == 0 |
| ||
6 | Retry count > 0 |
| ||
7 | REPLY timeout |
| ||
8 | ACK timeout |
| ||
9 | LNET_MSG_STATUS_LOCAL_ERROR |
| ||
10 | LNET_MSG_STATUS_REMOTE_DROPPED |
| ||
11 | LNET_MSG_STATUS_REMOTE_ERROR LNET_MSG_STATUS_REMOTE_TIMEOUT LNET_MSG_STATUS_NETWORK_TIMEOUT |
|
Random Failures
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | self test |
| ||
2 | self test |
| ||
3 | self test |
| ||
4 | self test |
| ||
5 | self test |
| ||
6 | self test |
|
User Interface
Test # | Tag | Procedure | Script | Result |
---|---|---|---|---|
1 | lnet_transaction_timeout |
| ||
2 | lnet_retry_count |
| ||
3 | lnet_health_sensitivity |
| ||
4 | NI statistics |
| ||
5 | Peer NI statistics |
| ||
6 | NI Health value |
| ||
7 | Peer NI Health value |
|
Testing Tools
The drop policy has been modified to drop outgoing messages with specific errors. This can be done via the following commands. Unfortunately, for details on these commands you'll need to look at the code. A combination of these commands on the different nodes should cover approximately 75% of the health code paths.
Code Block |
---|
lctl net_drop_add -s *@tcp -d *@tcp -m ACK -i 20
lctl net_drop_add -s *@tcp -d *@tcp -m REPLY -i 20
lctl net_drop_add -s *@tcp -d *@tcp -m GET -i 1 -e random
lctl net_drop_add -s *@tcp -d *@tcp -m PUT -i 20 -e random
lctl net_drop_del -s *@tcp -d *@tcp |
The -e
parameter can take the following arguments
Code Block |
---|
local_interrupt # will result in a resend
local_dropped # will result in a resend
local_aborted # will result in a resend
local_no_route # will result in a resend
local_error # will not result in a resend
local_timeout # will result in a resend
remote_error # will not result in a resend
remote_dropped # will result in a resend
remote_timeout # will not result in a resend
network_timeout # will not result in a resend
random
drop_send # will drop the message and never call lnet_finalize() |
The -e
can be repeated multiple times to specify different errors. The error simulated will be selected randomly from the ones defined. random
can be given to select any error to simulate at random.
Simulation Details
The error simulation occurs immediately before putting the message on the wire. If the drop rule policy is defined and is matched, then the message is not sent and error is simulated.
Types of Error Simulation Testing
Type | Description |
---|---|
Drop with error | This is the newly added error simulation. And it is designed to simulate different health failures. This can be used to exercise the following scenarios
|
Drop Received messages | This an existing rule and it can be used to drop received GET/PUT messages. This will result in no ACK/REPLY being sent to the message initiator and will exercise the response timeout code. |
Configuration
LNet Health is off by default. To turn it on, two configuration parameters need to be set
- Retry counter. This will indicate how many times a message should be resent before it succeeds or times out. It default to 0, which means no message re-transmission will occur.
lnetctl set retry_count <value>
- Health sensitivity. This is a value by which to decrement the health of the interface. When an interface health value goes below the optimal value, it gets placed on a recovery queue and will be pinged every second until its health recovers. This value by default is 0, which means that a peer or local NI will never go into recovery state.
lnetctl set health_sensitivity <value>
- Transaction timeout. This is the timeout value to wait before a response expires or before a message on the active list expires.
lnetctl set transaction_timeout <value>