Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A local failure has occurred, such as no route found or an address resolution error. These failures could be temporary, therefore LNet will attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces available.

Local/No-Resend

A local non-recoverable error occurred in the system, such as out of memory error. In these cases LNet will not attempt to resend the message. LNet will decrement the health value of the local interface and will select it less often if there are multiple available interfaces available.

Remote/No-Resend

If LNet successfully sends a message, but the message doesn't complete or an expected reply is not received then it's classified as remote error. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces available.

Remote/Resend

There are set of failures where we could be reasonably sure that the message was dropped before getting to the remote end. In this case LNet will attempt to resend the message. LNet will not attempt to resend the message to avoid duplicate messages on the remote end. LNet will decrement the health value of the remote interface and will select it less often if there are multiple available interfaces available.

User Interface

LNet Health is turned off by default. There are multiple module parameters available to control the LNet Health feature. They are described in the following sections.

...

When LNet detects a failure on a particular interface it will decrement its Health Value by lnet_health_sensitivity. The greater the value the longer it takes for that interface to become healthy again. The default value of lnet_health_sensitivity is set to 10, which means decrement the health value by 1 one each failurethe health value will not be decremented. In essence the health feature is turned off.

The sensitivity value can be set greater than 0. When a failure occurs on an interface then its Health Value is decremented and the interface is flagged for recovery. The recovery mechanism is described below.

...

The core assumption here is that in a healthy network, sending and receiving LNet messages should not have large delays. There could be large delays with RPC messages and their responses, but that's handled at the the PTLRPC layer.

Showing LNet Health Configuration Settings

...

Code Block
#> lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    retry_count: 0
    transaction_timeout: 5
    health_sensitivity: 0
    recovery_interval: 1

Showing LNet Health Statistics

LNet Health statistics are shown under a higher verbose verbosity settings. To show the local interface health statistics:

Code Block
lnetctl net show -v 3

to show the remote interface health statistics:

Code Block
lnetctl peer show -v 3

...

There is a new YAML block ", health stats" , which displays the health statistics for each local or remote network interface.

...

LNet Health is off by default. This means that that lnet_health_sensitivity and and lnet_retry_count are set to 0.

Setting Setting lnet_health_sensitivity to 0 will not decrement the health of the interface on failure and will not change the interface selection behavior for selecting an interface. Furthermore the failed interfaces will not be placed on the recovery queues. In essence, turning of off the LNet Health feature.

The LNet Health settings will need to be tuned for each clustre. But the base configuration would be as follows:

...

If there is a failure on the interface the health value will be decremented by 1 and the interface will be LNet PINGed every 1 second.