Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Test #TagProcedureScriptResult
1Immediate Failure
  • Send a PING
  • simulate an immediate LND failure (EX: NOMEM)
  • Message should not be resent

lnetctl discover <nid>

lctl net_drop_add with "-e local_error"

lnetctl discover <nid>

pass

Asynchronous Errors

Test #TagProcedureScriptResult
1

LNET_MSG_STATUS_LOCAL_INTERRUPT

LNET_MSG_STATUS_LOCAL_DROPPED

LNET_MSG_STATUS_LOCAL_ABORTED

LNET_MSG_STATUS_LOCAL_NO_ROUTE

LNET_MSG_STATUS_LOCAL_TIMEOUT

  • MR Node with Multiple interfaces
  • Send a PING
  • Simulate an <error>
  • PING msg should be queued on resend queue
  • PING msg will be resent on a different interface
  • Failed interfaces' health value will be decremented
  • Failed interface will be placed on the recovery queue
2Sensitivity == 0
  • Same setup as 1
  • NI is not placed on the recovery queue
3Sensitivity > 0
  • Same setup as 1
  • NI is placed on the recovery queue
  • Monitor network activity as NI is pinged until health is back to maximum
4

Sensitivity > 0

Buggy interface

  • Same setup as 1
  • NI is placed on recovery queue
  • NI is pinged ever 1 second
  • Simulate ping failure ever other ping
  • NI's health should be decremented on failure
  • NI should remain on the recovery queue

Examples:

lctl net_drop_add -s 10.9.10.3@tcp -d 10.9.10.4@tcp -m GET -i 20 -e local_dropped

Key messages in debug log:

(lib-msg.c:762:lnet_health_check()) 10.9.10.3@tcp->10.9.10.4@tcp:GET:LOCAL_DROPPED - queuing for resend

(lib-msg.c:508:lnet_handle_local_failure()) ni 10.9.10.3@tcp added to recovery queue. Health = 950

(lib-move.c:2928:lnet_recover_local_nis()) attempting to recover local ni: 10.9.10.3@tcp

pass
2Sensitivity == 0
  • Same setup as 1
  • NI is not placed on the recovery queue

pass
3Sensitivity >
5Retry count ==
0
  • Same setup as 1
  • Message will not be retried and the message will be finalized immediately
  • 6Retry count > 0
    • NI is placed on the recovery queue
    • Monitor network activity as NI is pinged until health is back to maximum

    pass
    4

    Sensitivity > 0

    Buggy interface

    • Same setup as 1
    • NI is placed on recovery queue
    • NI is pinged ever 1 second
    • Simulate ping failure ever other ping
    • NI's health should be decremented on failure
    • NI should remain on the recovery queue


    5Retry count == 0
    • Same setup as 1
    • Message will not be retried and the message will be finalized immediately

    pass
    6Retry count > 0
    • Same setup as 1
    • Message will be transmitted for a maximum of retry count or until the message expires

    Key messages in debug log:

    (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 0

    (lib-move.c:1715:lnet_handle_send()) TRACE: 10.9.10.3@tcp(10.9.10.3@tcp:<?>) -> 10.9.10.4@tcp(10.9.10.4@tcp:10.9.10.4@tcp) : GET try# 1

    pass
  • Same setup as 1
  • Message will be transmitted for a maximum of retry count or until the message expires
    7REPLY timeout
    • Same setup as 1
    • Except Use LNet selftest
    • Simulate a local timeout
    • Re-transmit
    • No REPLY received
    • Message is finalized and TIMEOUT event is propagated.

    pass
    8ACK timeout
    • Same setup as 7 except simulate ACK timeout

    pass
    9LNET_MSG_STATUS_LOCAL_ERROR
    • Same setup as 1
    • Message is finalized immediately (not resent)
    • Local NI is placed on the recovery queue
    • Same procedure to recover the local NI

    pass
    10LNET_MSG_STATUS_REMOTE_DROPPED
    • Same setup as 1
    • Message is queued for resend depending on retry_count
    • peer_ni is placed on the recovery queue (not if sensitivity == 0)
    • peer_ni is pinged every 1 second

    pass

    11

    LNET_MSG_STATUS_REMOTE_ERROR

    LNET_MSG_STATUS_REMOTE_TIMEOUT

    LNET_MSG_STATUS_NETWORK_TIMEOUT

    • Same setup as 1
    • Message is not resent
    • peer_ni recovery happens as outlined in previous cases

    pass

    Random Failures

    Test #TagProcedureScriptResult
    1self test
    • MR Node
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure
    ip link set eth1 downpass
    2self test
    • MR Node
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure

    pass
    3self test
    • MR Node 
    • MR Router
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    4self test
    • MR Node 
    • MR Router
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    5self test
    • MR Node 
    • NMR Router
    • NMR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    6self test
    • MR Node 
    • NMR Router
    • MR Peer
    • Self-test
    • Randomize local NI failure
    • Randomize Remote NI failure


    ...