Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Test #TagProcedureScriptResult
1Immediate Failure
  • Send a PING
  • simulate an immediate LND failure (EX: NOMEM)
  • Message should not be resent

lnetctl discover <nid>

lctl net_drop_add with "-e local_error"

lnetctl discover <nid>

pass

Asynchronous Errors

Test #TagProcedureScriptResult
1

LNET_MSG_STATUS_LOCAL_INTERRUPT

LNET_MSG_STATUS_LOCAL_DROPPED

LNET_MSG_STATUS_LOCAL_ABORTED

LNET_MSG_STATUS_LOCAL_NO_ROUTE

LNET_MSG_STATUS_LOCAL_TIMEOUT

  • MR Node with Multiple interfaces
  • Send a PING
  • Simulate an <error>
  • PING msg should be queued on resend queue
  • PING msg will be resent on a different interface
  • Failed interfaces' health value will be decremented
  • Failed interface will be placed on the recovery queue
2Sensitivity == 0
  • Same setup as 1
  • NI is not placed on the recovery queue
3Sensitivity > 0
  • Same setup as 1
  • NI is placed on the recovery queue
  • Monitor network activity as NI is pinged until health is back to maximum
4

Sensitivity > 0

Buggy interface

  • Same setup as 1
  • NI is placed on recovery queue
  • NI is pinged ever 1 second
  • Simulate ping failure ever other ping
  • NI's health should be decremented on failure
  • NI should remain on the recovery queue

Examples:

lctl net_drop_add -s 10.9.10.3@tcp -d 10.9.10.4@tcp -m GET -i 20 -e local_dropped

Key messages in debug log:

(lib-msg.c:762:lnet_health_check()) 10.9.10.3@tcp->10.9.10.4@tcp:GET:LOCAL_DROPPED - queuing for resend

(lib-msg.c:508:lnet_handle_local_failure()) ni 10.9.10.3@tcp added to recovery queue. Health = 950

(lib-move.c:2928:lnet_recover_local_nis()) attempting to recover local ni: 10.9.10.3@tcp

pass
2Sensitivity
5Retry count
== 0
  • Same setup as 1
  • Message will not be retried and the message will be finalized immediately
  • 6Retry count > 0
    • NI is not placed on the recovery queue

    pass
    3Sensitivity > 0
    • Same setup as 1
    • NI is placed on the recovery queue
    • Monitor network activity as NI is pinged until health is back to maximum

    pass
    4

    Sensitivity > 0

    Buggy interface

    • Same setup as 1
    • NI is placed on recovery queue
    • NI is pinged ever 1 second
    • Simulate ping failure ever other ping
    • NI's health should be decremented on failure
    • NI should remain on the recovery queue


    5Retry count == 0
    • Same setup as 1
    • Message will not be retried and the message will be finalized immediately

    pass
    6Retry count > 0
    • Same setup as 1
    • Message will be transmitted for a maximum of retry count or until the message expires

    Key messages in debug log:

    (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 0

    (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 1

    pass
  • Same setup as 1
  • Message will be transmitted for a maximum of retry count or until the message expires
    7REPLY timeout
    • Same setup as 1
    • Except Use LNet selftest
    • Simulate a local timeout
    • Re-transmit
    • No REPLY received
    • Message is finalized and TIMEOUT event is propagated.

    pass
    8ACK timeout
    • Same setup as 7 except simulate ACK timeout

    pass
    9LNET_MSG_STATUS_LOCAL_ERROR
    • Same setup as 1
    • Message is finalized immediately (not resent)
    • Local NI is placed on the recovery queue
    • Same procedure to recover the local NI

    pass
    10LNET_MSG_STATUS_REMOTE_DROPPED
    • Same setup as 1
    • Message is queued for resend depending on retry_count
    • peer_ni is placed on the recovery queue (not if sensitivity == 0)
    • peer_ni is pinged every 1 second

    pass

    11

    LNET_MSG_STATUS_REMOTE_ERROR

    LNET_MSG_STATUS_REMOTE_TIMEOUT

    LNET_MSG_STATUS_NETWORK_TIMEOUT

    • Same setup as 1
    • Message is not resent
    • peer_ni recovery happens as outlined in previous cases

    pass

    Random Failures

    Test #TagProcedureScriptResult
    1self test
    • MR Node
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure
    ip link set eth1 downpass
    2self test
    • MR Node
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure

    pass
    3self test
    • MR Node 
    • MR Router
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    4self test
    • MR Node 
    • MR Router
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    5self test
    • MR Node 
    • NMR Router
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    6self test
    • MR Node 
    • NMR Router
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    ...

    Test #TagProcedureScriptResult
    1lnet_transaction_timeout
    • Set lnet_transaction_timeout to a value < retry_count via lnetctl and YAML
      • This should lead to a failure to set
    • Set lnet_transaction_timeout to a value > retr_count via lnetctl and YAML
      • lnet_lnd_timeout value should == lnet_transaction_timeout / retry_count
    • Show value via "lnetctl global show"
    lnetctl set transaction_timeout <value>pass
    2lnet_retry_count
    • Set the lnet_retry_count to a value > lnet_transaction_timeout via lnetctl and YAML
      • This should lead to a failure to set
    • Set the lnet_retry_count to a value < lnet_transaction_timeout via lnetctl and YAML
      • lnet_lnd_timeout value should == lnet_transaction_timeout / retry_count
    • Show value via "lnetctl global show"
    lnetctl set retry_count <value>pass
    3lnet_health_sensitivity
    • Set the lnet_health sensitivity from lnetctl and from YAML
    • Show value via "lnetctl global show"
    pass
    lnetctl set health_sensitivity <value>

    Jira
    serverWhamcloud Community Jira
    serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
    keyLU-11530

    4NI statistics
    • verify LNet health statistics
      • lnetctl net show -v 3

    pass
    5Peer NI statistics
    • verify LNet health statistics for peer NIs
      • lnetctl peer show -v 3

    pass
    6NI Health value
    • verify setting the local NI health statistics
      • lnetctl net set --nid <nid> --health <value>
    • Redo from YAML

    Jira
    serverWhamcloud Community Jira
    serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
    keyLU-11529

    7Peer NI Health value
    • verify setting the local NI health statistics
      • lnetctl peer set --nid <nid> --health <value>
    • Redo from YAML

    Jira
    serverWhamcloud Community Jira
    serverId8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
    keyLU-11529

    Testing Tools

    The drop policy has been modified to drop outgoing messages with specific errors. This can be done via the following commands. Unfortunately, for details on these commands you'll need to look at the code. A combination of these commands on the different nodes should cover approximately 75% of the health code paths.

    ...