Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

O2IBLND

Overview

There are two types of events to account for:

...

There is a group of events which indicate a fatal error

RDMA Device Events

Below are the events that could occur on the RDMA device. Highlighted in BOLD RED are the events that should be handled for health purposes.

  • IB_EVENT_CQ_ERR
  • IB_EVENT_QP_FATAL
  • IB_EVENT_QP_REQ_ERR
  • IB_EVENT_QP_ACCESS_ERR
  • IB_EVENT_COMM_EST
  • IB_EVENT_SQ_DRAINED
  • IB_EVENT_PATH_MIG
  • IB_EVENT_PATH_MIG_ERR
  • IB_EVENT_DEVICE_FATAL
  • IB_EVENT_PORT_ACTIVE
  • IB_EVENT_PORT_ERR
  • IB_EVENT_LID_CHANGE
  • IB_EVENT_PKEY_CHANGE
  • IB_EVENT_SM_CHANGE
  • IB_EVENT_SRQ_ERR
  • IB_EVENT_SRQ_LIMIT_REACHED
  • IB_EVENT_QP_LAST_WQE_REACHED
  • IB_EVENT_CLIENT_REREGISTER
  • IB_EVENT_GID_CHANGE

Communication Events

Below are the events that could occur on a connection. Highlighted in BOLD RED are the events that should be handled for health purposes.

...

One option to consider is to use the peer_timout feature to recognize when peer_nis are down, and update the peer_ni health information via this mechanism. And let the LND and RPC timeouts take care of further resends.

High Level Design

Callback Mechanism

[Olaf: bear in mind that currently the LND already reports status to LNet through lnet_finalize()]

...

  • Although some of the actions LNet will take is the same for different errors, it's still a good idea to keep them separate for statistics and logging.
  • on LNET_LOCAL_NI_DOWN set the ni_state to STATE_FAILED. In the selection algorithm this NI will not be picked.
  • on LNET_LOCAL_NI_UP set the ni_state to STATE_ACTIVE. In the selection algorithm this NI will be selected.
  • Add a state in the peer_ni. This will indicate if it usable or not. 
  • on LNET_PEER_NI_ADDR_ERROR set the peer_ni state to FAILED. This peer_ni will not be selected in the selection algorithm.
  • Add a health value (int). 0 means it's healthy and available for selection.
  • on any LNet_PEER_NI_[UNREACHABLE | CONNECT_ERROR | CONNECT_REJECTED] decrement this value.
  • That value indicates how long before we use it again.
  • A time before use in jiffies is stored. The next time we examine this peer_NI for selection, we take a look at that time. If it has been passed we select it, but we do not increment this value. The value is set to 0 only if there is a successful send to this peer_ni.
  • The net effect is that if we have a bad peer_ni, the health value will keep getting decremented, which will mean it'll take progressively longer to reuse it.
  • This algorithm is in effect only if there are multiple interfaces, and some of them are healthy. If none of them are healthy (IE the health value is negative), then select the least unhealthy peer_ni (the one with greatest health value).
  • The same algorithm can be used for local NI selection

Timeout Handling

LND TX Timeout

PUT

Gliffy Diagram
namePUT sequence

...

  1. TX timeout can be triggered because the TX remains on one of the outgoing queues for too long. This would indicate that there is something wrong with the local NI. It's either too congested or otherwise problematic. This should result in us trying to resend using a different local NI but possibly to the same peer_ni.
  2. TX timeout can be triggered because the TX is posted via (ib_post_send()) but it has not completed. In this case we can safely conclude that the peer_ni is either congested or otherwise down. This should result in us trying to resent to a different peer_ni, but potentially using the same local NI.
GET

Gliffy Diagram
nameGET Sequence Diagram

...

In summary, the tx_timeout serves to ensure that messages which do not require an explicit response from the peer are completed on the tx event added by M|OFED to the completion queue. And it also serves to ensure that any messages which require an explicit reply to be completed receive that reply within the tx_timout.

O2IBLND TX Lifecycle

Gliffy Diagram
nameo2iblnd TX FSM

...

NOTE, currently we don't know why the peer_ni is marked down. As mentioned above the tx_timeout could be triggered for several reasons. Some reasons indicate a problem on the peer side, IE not receiving a response or a transmit complete. Other reasons could indicate local problems, for example the tx never leaves the queued state. Depending on the reason for the tx_timeout LNet should react differently in it's next round of interface selection.

Health Revisited

There are different scenarios to consider with Health:

...

TBD - How do we recover from a peer down?

TX Timeouts in the presence of LNet Routers

Communication with a router adheres to the above details. Once the current hop is sure that the message has made it to the next hop, LNet shouldn't worry about resends. Resends are only to ensure that the message LNet is tasked to send makes it to the next hop. The upper layer RPC protocol makes sure that RPC messages are retried if necessary.

Each hop's LNet will do a best effort in getting the message to the following hop. Unfortunately, there is no feedback mechanism from a router to the originator to inform the originator that a message has failed to send, but I believe this is unnecessary and will probably increase the complexity of the code and the system in general. Rule of thumb should be that each hop only worries about the immediate next hop.

SOCKLND

TBD