Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The local or remote interface health is updated
  2. Failure statistics incremented
  3. A resend is issued on a different local interface if there is one available. If there is not one available attempt the same interface again.
  4. The message will continuously be resent until the timeout expires or the send succeeds.

Note:

Keep last sent time in the message in order to pace the send. There is a risk that the failure occures immediately, which would result in LNet entering a loop resending a message until the timeout expires. This could peak the CPU consumption unexpectedly. It would be better to pace out the resend, delaying it incrementally longer until the peer timeout expires.

TBD: Should the messages be queued on a thread instead of resent from the context of lnet_finalize()? The router pinger thread can be used to process the resend list.

A new field in the msg, msg_status, will be added. This field will hold the send status of the message.

...

When a message is initially sent it's taged with a deadline for this message. The deadline is the current time + peer_timeout. While the message has not timedout it will be resent if it needs to. The deadline is checked everytime we enter lnet_finalize(). When the deadline is reached without successful send, then the MD will be detached.

...

lnet_finalize() will also update the statistics (or call a function to update the statistics).

Resending

To keep the code simple when resending the health value of the interface that had the problem will be updated. If we are sending to a non-MR peer, then we will use the same src_nid, otherwise the peer will be confused. The selection algorithm will then consider the health value when it is selecting the NI or peer NI.

There are two aspects to consider when sending a message

  1. The congestion of the local and remote interfaces
  2. The health of the local and remote interfaces

It is possible that a send can fail immediately. In this case we need to take active measures to ensure that we do not enter a tight loop resending until the timeout expires. This could peak the CPU consumption unexpectedly.

To do that the last sent time will be kept in the message. If the message is not sent successfully on any of the existing interfaces, then it will be placed on a queue and will be resent after a specific deadline expires. This will be termed a "(re)send procedure". An interval must expire between each (re)send procedure. A (re)send procedure will iterate through all local and remote peers, depending on the source of the send failure. 

The router pinger thread will be refactored to handle resending messages. The router pinger thread is started irregardless and only used to ping gateways if any are configured. Its operation will be expanded to check the pending message queue and re-send messages.

To keep the code simple when resending the health value of the interface that had the problem will be updated. If we are sending to a non-MR peer, then we will use the same src_nid, otherwise the peer will be confused. The selection algorithm will then consider the health value when it is selecting the NI or peer NI.

There are two aspects to consider when sending a message

  1. The congestion of the local and remote interfaces
  2. The health of the local and remote interfaces

The selection algorithm will take an average of these two values and will determine the best interface to select. To make the comparison proper, the health value of the interface will be The selection algorithm will take an average of these two values and will determine the best interface to select. To make the comparison proper, the health value of the interface will be set to the same value as the credits on initialization. It will be decremented on failure to send and incremented on failures.successful send.

A health_value_range module parameter will be added to control the sensitiveness of the selection. If it is set to 0, then the best interface will be selected. A value higher than 0 will give a range in which to select the interface. If the value is large enough it will in effect be equivalent to turning off health comparison.

...

A new LNet/LND Api will be created to pass these events from the LND to LNet.

Timeouts

LND Detected Timeouts

Upper layers request from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND encapsulates the LNet message into an LND specific message with its own message type. For example in the o2iblnd it is kib_msg_t.

...

Therefore if a tx_deadline is hit, it is safe to assume that the remote end has not received the message. This The reasons are described further below.

...

The tx_deadline is LND-specific, and derived from the timeout (or sock_timeout) module parameter of the LND.

LNet Detected Timeouts

As mentioned above at the LNet layer LNET_MSG_PUT can be told to expect LNET_MSG_ACK to confirm that the LNET_MSG_PUT has been processed by the destination. Similarly LNET_MSG_GET expects an LNET_MSG_REPLY to confirm that the LNET_MSG_GET has been successfully processed by the destination.

The pair LNET_MSG_PUT+LNET_MSG_ACK and LNET_MSG_GET+LNET_MSG_REPLY is not covered by the tx_deadline in the LND. If the upper layer does not take precautions it could wait forever on the LNET_MSG_ACK or LNET_MSG_REPLY. Therefore it is reasonable to expect that LNET provides a TIMEOUT event if either of these messages is are not received within the expected timeout.

...

The argument against this approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermediary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.approach is mixed clusters, where not all nodes are MR capable. In this case we can not rely on intermediary nodes to try all the interfaces of its next-hop. However, as is assumed in the Multi-Rail design if not all nodes are MR capable, then not all Multi-Rail features are expected to work.

This appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.

Relationship between different timeouts.

There are multiple timeouts kept at different layers of the code. It is important to set the timeout defaults such that it works best, and to give guidance on how the different timeouts interact together. 

Looking at timeouts from a bottom up approach:

  1. IB/TCP/GNI re-send timeout
  2. LND transmit timeout
    1. The timeout to wait for before a transmit fails and lnet_finalize() is called with an appropriate error code. This will result in a resend.
  3. Transaction timeout
    1. timeout after which LNet sends a timeout event for a missing REPLY/ACK.
  4. Message timeout
    1. timeout after which LNet abandons resending a message.
  5. Resend interval
    1. The interval between each (re)send procedure.
  6. RPC timeout
    1. The INITIAL_CONNECT_TIMEOUT is set to 5 sec
    2. ldlm_timeout and obd_timeout are tunables and default to LDLM_TIMEOUT_DEFAULT and OBD_TIMEOUT_DEFAULT.

IB/TCP/GNI re-send timeout < LND transmit timeout  < RPC timeout.

A retry count can be specified. That's the number of times to resend after the LND transmit timeout expires.

The timeout value before failing an LNET_MSG_[PUT | GET] will be:

message timeout = (retry count * LND transmit timeout) + (resend interval * retry count)

where

retry count = min(retry count, 5)

message timeout <= transaction timeout

It has been observed that mount could hang for a long time if discovery ping is not responded to. This could happen if an OST is down while a client mounts the File System. In this case it does not make sense to hold up the mount procedure while discovery is taking place. For some cases like discovery the algorithm would specify a different timeout other than what's configuredThis appraoch would add the LNet resiliency required and avoid the many corner cases that will need to be addressed when receiving message which have already been processed.

Resiliency vs. Reliability

...

Code Block
lnet_ni lnet_get_best_ni(local_net, cur_ni, md_cpt)
{
	local_net = get_local_net(peer_net)
	for each ni in local_net {
		health_value = lnet_local_ni_health(ni)
		/* select the best health value */
		if (health_value < best_health_value)
			continue
		distance = get_distance(md_cpt, dev_cpt)
		/* select the shortest distance to the MD */
		if (distance < lnet_numa_range)
			distance = lnet_numa_range
		if (distance > shortest_distance)
			continue
		else if distance < shortest_distance
			distance = shortest_distance
		/* select based on the most available credits */
		else if ni_credits < best_credits
			continue
		/* if all is equal select based on round robin */
		else if ni_credits == best_credits
			if best_ni->ni_seq <= ni->ni_seq
				continue
	}
}
 
/*
 * lnet_select_pathway() will be modified to add a peer_nid parameter.
 * This parameter indicates that the peer_ni is predetermined, and is
 * identified by the NID provided. The peer_nid parameter is the
 * next-hop NID, which can be the final destination or the next-hop
 * router. If that peer_NID is not healthy then another peer_NID is
 * selected as per the current algorithm. This will force the
 * algorithm to prefer the peer_ni which was selected in the initial
 * message sending. The peer_ni NID is stored in the message. This
 * new parameter extends the concept of the src_nid, which is provided
 * to lnet_select_pathway() to inform it that the local NI is
 * predetermined.
 */
 
/* on resend */
enum lnet_error_type {
	LNET_LOCAL_NI_DOWN, /* don't use this NI until you get an UP */
	LNET_LOCAL_NI_UP, /* start using this NI */
	LNET_LOCAL_NI_SEND_TIMEOUT, /* demerit this NI so it's not selected immediately, provided there are other healthy interfaces */
	LNET_PEER_NI_NO_LISTENER, /* there is no remote listener. demerit the peer_ni and try another NI */
	LNET_PEER_NI_ADDR_ERROR, /* The address for the peer_ni is wrong. Don't use this peer_NI */
	LNET_PEER_NI_UNREACHABLE, /* temporarily don't use the peer NI */
	LNET_PEER_NI_CONNECT_ERROR, /* temporarily don't use the peer NI */
	LNET_PEER_NI_CONNECTION_REJECTED /* temporarily don't use the peer NI */
};
 
static int lnet_handle_send_failure_locked(msg, local_nid, status)
{
	switch (status)
	/*
     * LNET_LOCAL_NI_DOWN can be received without a message being sent.
     * In this case msg == NULL and it is sufficient to update the health
     * of the local NI
     */
	case LNET_LOCAL_NI_DOWN:
	 	LASSERT(!msg);
		local_ni = lnet_get_local_ni(msg->local_nid)
		if (!local_ni)
				return
		/* flag local NI down */
		lnet_set_local_ni_health(DOWN)
		break;
	case LNET_LOCAL_NI_UP:
		LASSERT(!msg);
		local_ni = lnet_get_local_ni(msg->local_nid)
		if (!local_ni)
				return
		/* flag local NI down */
		lnet_set_local_ni_health(UP)
		/* This NI will be a candidate for selection in the next message send */
	break;
...
}
 
static int lnet_complete_msg_locked(msg, cpt)
{
	status = msg->msg_ev.status
	if (status != 0)
		rc = lnet_handle_send_failure_locked(msg, status)
		if rc == 0
				return
	/* continue as currently done */
}
  


Remote Interface Failure

A remote interface can be considered problematic under multiple scenarios:

...