...
A local failure has occurred, such as no route found or an address resolution error. These failures could be temporary, therefore LNet will attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces available.
Local/No-Resend
A local non-recoverable error occurred in the system, such as out of memory error. In these cases LNet will not attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces available.
Remote/No-Resend
If LNet successfully sends a message, but the message doesn't complete or an expected reply is not received then it's classified as remote error. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces available.
Remote/Resend
There are set of failures where we could be reasonably sure that the message was dropped before getting to the remote end. In this case LNet will attempt to resend the message. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces available.
User Interface
LNet Health is turned off by default in releases prior to 2.12.3. Starting from 2.12.3 the feature is turned on by default. There are multiple module parameters available to control the LNet Health feature. They are described in the following sections.
...
When LNet detects a failure on a particular interface it will decrement its Health Value by lnet_health_sensitivity
. The greater value defaults to 100. The lesser the value the longer it takes for that interface to become healthy again. The default value of lnet_health_sensitivity
is set to 1, which means decrement the health value by 1 one each failureFor example, if set to 100, it takes 10 seconds to fully recover from 0 to 1000; if set to 50, it takes 20 seconds to fully recover from 0 to 1000.
If set to 0, which means the health value will not be decremented. In essence the health feature is turned off.
When a failure occurs on an interface then its Health Value is decremented and the interface is flagged for recovery. The recovery mechanism is described belowin section lnet_recovery_interval
.
Code Block |
---|
lnetctl set health_sensitivity: sensitivity to failure 0 - turn off health evaluation >0 - sensitivity value not more than 1000 |
...
When LNet detects a failure on a local or remote interface it will place that interface on a recovery queue. There is a recovery queue for local interfaces and another for remote interfaces. The .
Prior to 2.15, the interfaces on the recovery queues will be LNet PINGed every lnet_recovery_interval
. This value defaults to 1 second. On every successful PING the health value of the interface pinged will be incremented by 1by lnet_health_sensitivity
.
Having this value configurable allows system administrators to control the amount of control traffic on the network.
Code Block |
---|
lnetctl set recovery_interval: interval to ping unhealthy interfaces >0 - timeout in seconds |
Since 2.15, lnet_recovery_interval
was deprecated and the fixed interval was replaced with exponential backoff. For more information, see LU-13569.
lnet_retry_count
When LNet detects a failure which it deems appropriate for re-sending a message it will check if a message has passed the maximum retry_count specified. After which if a message wasn't sent successfully a failure event will be passed up to the layer which initiated message sending.
...
The core assumption here is that in a healthy network, sending and receiving LNet messages should not have large delays. There could be large delays with RPC messages and their responses, but that's handled at the the PTLRPC
layer.
Showing LNet Health Configuration Settings
...
Code Block |
---|
#> lnetctl global show global: numa_range: 0 max_intf: 200 discovery: 1 retry_count: 0 transaction_timeout: 5 health_sensitivity: 0 recovery_interval: 1 |
Showing LNet Health Statistics
LNet Health statistics are shown under a higher verbose verbosity settings. To show the local interface health statistics:
Code Block |
---|
lnetctl net show -v 3 |
to show the remote interface health statistics:
Code Block |
---|
lnetctl peer show -v 3 |
...
There is a new YAML block ", health stats
" , which displays the health statistics for each local or remote network interface.
...
Code Block |
---|
#> lnetctl stats show
statistics:
msgs_alloc: 0
msgs_max: 0
rst_alloc: 0
errors: 0
send_count: 0
resend_count: 0
response_timeout_count: 0
local_interrupt_count: 0
local_dropped_count: 0
local_aborted_count: 0
local_no_route_count: 0
local_timeout_count: 0
local_error_count: 0
remote_dropped_count: 0
remote_error_count: 0
remote_timeout_count: 0
network_timeout_count: 0
recv_count: 0
route_count: 0
drop_count: 0
send_length: 0
recv_length: 0
route_length: 0
drop_length: 0
|
Initial Settings Recommendations (prior to 2.13)
LNet Health is off by default. This means that that lnet_health_sensitivity
and and lnet_retry_count
are set to 0.
Setting Setting lnet_health_sensitivity
to 0 will not decrement the health of the interface on failure and will not change the interface selection behavior for selecting an interface. Furthermore the failed interfaces will not be placed on the recovery queues. In essence, turning of off the LNet Health feature.
The LNet Health settings will need to be tuned for each clustre. But the base configuration would be as follows:
...
If there is a failure on the interface the health value will be decremented by 1 and the interface will be LNet PINGed every 1 second.