Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update lnet_health_sensitivity and lnet_recovery_interval to reflect changes in new releases

...

When LNet detects a failure on a particular interface it will decrement its Health Value by lnet_health_sensitivity. The greater value defaults to 100. The lesser the value the longer it takes for that interface to become healthy again. For example, if set to 100, it takes 10 seconds to fully recover from 0 to 1000; if set to 50, it takes 20 seconds to fully recover from 0 to 1000.

If . The default value of lnet_health_sensitivity is set to 0, which means the health value will not be decremented. In essence the health feature is turned off.

The sensitivity value can be set greater than 0. When a failure occurs on an interface then its Health Value is decremented and the interface is flagged for recovery. The recovery mechanism is described belowin section lnet_recovery_interval.

Code Block
lnetctl set health_sensitivity: sensitivity to failure
        0 - turn off health evaluation
        >0 - sensitivity value not more than 1000

...

When LNet detects a failure on a local or remote interface it will place that interface on a recovery queue. There is a recovery queue for local interfaces and another for remote interfaces. The

Prior to 2.15, the interfaces on the recovery queues will be LNet PINGed every lnet_recovery_interval. This value defaults to 1 second. On every successful PING the health value of the interface pinged will be incremented by 1by lnet_health_sensitivity.

Having this value configurable allows system administrators to control the amount of control traffic on the network.

Code Block
lnetctl set recovery_interval: interval to ping unhealthy interfaces
        >0 - timeout in seconds

Since 2.15, lnet_recovery_interval was deprecated and the fixed interval was replaced with exponential backoff. For more information, see LU-13569.

lnet_retry_count

When LNet detects a failure which it deems appropriate for re-sending a message it will check if a message has passed the maximum retry_count specified. After which if a message wasn't sent successfully a failure event will be passed up to the layer which initiated message sending.

...

Code Block
#> lnetctl stats show
statistics:
    msgs_alloc: 0
    msgs_max: 0
    rst_alloc: 0
    errors: 0
    send_count: 0
    resend_count: 0
    response_timeout_count: 0
    local_interrupt_count: 0
    local_dropped_count: 0
    local_aborted_count: 0
    local_no_route_count: 0
    local_timeout_count: 0
    local_error_count: 0
    remote_dropped_count: 0
    remote_error_count: 0
    remote_timeout_count: 0
    network_timeout_count: 0
    recv_count: 0
    route_count: 0
    drop_count: 0
    send_length: 0
    recv_length: 0
    route_length: 0
    drop_length: 0


Initial Settings Recommendations (prior to 2.13)

LNet Health is off by default. This means that lnet_health_sensitivity and lnet_retry_count are set to 0.

...