The local or remote interface health is decremented
Failure statistics incremented
A resend is issued on a different local interface if there is one available. If there is none available attempt the same interface again.
The message will continuously be resent until one of the following criteria is fulfilled:
1. Message is completed successfully.
2. Retry-count is reached
3. Transaction timeout expires

Two new fiels fields will be added to lnet_msg:

...

It is possible that a message can be on the resend queue when it either completes or times out. In both of these case it will be removed from the resend queue as well as the active queue and finalized.

Protection

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and in the monitor thread and either removed from the active queue or placed on the resend queue.

API Changes

The ULP will provide the transaction timeout value on which LNet will base its own timeout values. In the absence of that LNet will fall back on a configurable transaction timeout value.

This trickle down approach will simplify the configuration of the LNet Resiliency feature, as well as make the timeout consistent through out the system, instead of configuring the LND timeout to be much larger than the pltrpc timeout as it is now. Furthermore, the ptlrpc uses a backoff algorithm, which allows it to wait longer for responses. With this trickle down approach, LNet will be able to cope with that timeout backoff algorithm.

The LNetGet() and LNetPut() APIs will be changed to reflect that.

Code Block

/**
 * Initiate an asynchronous GET operation.
 *
 * On the initiator node, an LNET_EVENT_SEND is logged when the GET request
 * is sent, and an LNET_EVENT_REPLY is logged when the data returned from
 * the target node in the REPLY has been written to local MD.
 * LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
 * with in the timeout provided.
 *
 * On the target node, an LNET_EVENT_GET is logged when the GET request
 * arrives and is accepted into a MD.
 *
 * \param self,target,portal,match_bits,offset See the discussion in LNetPut().
 * \param mdh A handle for the MD that describes the memory into which the
 * requested data will be received. The MD must be "free floating" (See LNetMDBind()).
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists) of the MD.
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 */
 int
 LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
 »·······struct lnet_process_id target, unsigned int portal,
 »·······__u64 match_bits, unsigned int offset, int timeout)

/**
 * Initiate an asynchronous PUT operation.
 *
 * There are several events associated with a PUT: completion of the send on
 * the initiator node (LNET_EVENT_SEND), and when the send completes
 * successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
 * that the operation was accepted by the target. LNET_EVENT_ACK will have
 * the timeout flag set if an ACK is not received within the timeout provided
 *  
 * The event LNET_EVENT_PUT is used at the target node to indicate the
 * comple tion of incoming datadelivery.
 *
 * The local events will be logged in the EQ associated with the MD pointed to
 * by \a mdh handle. Using a MD without an associated EQ results in these
 * events being discarded. In this case, the caller must have another
 * mechanism (e.g., a higher level protocol) for determining when it is safe
 * to modify the memory region associated with the MD.
 *
 * Note that LNet does not guarantee the order of LNET_EVENT_SEND and
 * LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
 *
 * \param self Indicates the NID of a local interface through which to send
 * the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
 * \param mdh A handle for the MD that describes the memory to be sent. The MD
 * must be "free floating" (See LNetMDBind()).
 * \param ack Controls whether an acknowledgment is requested.
 * Acknowledgments are only sent when they are requested by the initiating
 * process and the target MD enables them.
 * \param target A process identifier for the target process.
 * \param portal The index in the \a target's portal table.
 * \param match_bits The match bits to use for MD selection at the target
 * process.
 * \param offset The offset into the target MD (only used when the target
 * MD has the LNET_MD_MANAGE_REMOTE option set).
 * \param timeout The timeout to wait for an ACK if one is expected.
 * \param hdr_data 64 bits of user data that can be included in the message
 * header. This data is written to an event queue entry at the target if an
 * EQ is present on the matching MD.
 *
 * \retval  0»··   Success, and only in this case events will be generated
 * and logged to EQ (if it exists).
 * \retval -EIO    Simulated failure.
 * \retval -ENOMEM Memory allocation failure.
 * \retval -ENOENT Invalid MD object.
 *
 * \see struct lnet_event::hdr_data and lnet_event_kind_t.
 */
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack, 
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data)

Selection Algorithm

The selection algorithm will be modified to take health into account and will operate according to the following logic:

Code Block

for every peer_net in peer {
	local_net = peer_net

	if peer_net is not local
		select a router
		local_net = router->net
 
	for every local_ni on local_net
		check if local_ni has best health_value
		check if local_ni is nearest MD NUMA
		check if local_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_ni = local_ni

	for every peer_ni on best_ni->net
		check if peer_ni has best health value
		check if peer_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_peer_ni = peer_ni

	send(best_ni, peer_ni)
}

The above algorithm will always prefer NI's that are the most healthy. This is important because dropping even one message will likely result in client evictions. So it is important to always ensure we're using the best path possible.

LND Interface

LNet shall calculate the message timeout as follows:

message timeout = transaction timeout / retry count

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send().

LND Transmits (o2iblnd specific discussion)

ULP requests from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND can complete the LNet PUT/GET via a set of LND messages as shown in the diagrams below.

When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message, if an aknwoledgment is expected or if a no acknowledgement is expected then when the tx is completed the tx_deadline is completed. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.

By handling the tx_deadline properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.

The tx_deadline is LND-specific, and derived from the timeout (or sock_timeout) module parameter of the LND.

Gliffy Diagram

name	o2iblnd TX FSM
pagePin	4

LND timeout

PUT

Gliffy Diagram

name	PUT sequence
pagePin	3

GET

Gliffy Diagram

name	GET Sequence Diagram
pagePin	2

A third type of message that the LND sends is the IBLND_MSG_IMMEDIATE. The data is embedded in the message and posted. There is no handshake in this case.

For the PUT case described in the sequence diagram, the initiator sends two messages:

IBLND_MSG_PUT_REQ
IBLND_MSG_PUT_DONE

Both of these messages are sent using the same tx structure. The tx is allocated and placed on a waiting queue. When the IBLND_MSG_PUT_ACK is received the waiting tx is looked up and used to send the IBLND_MSG_PUT_DONE.

When kiblnd_queue_tx_locked() is called for IBLND_MSG_PUT_REQ it sets the tx_deadline as follows:

Code Block
timeout_ns = kiblnd_tunables.kib_timeout NSEC_PER_SEC; tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

When kiblnd_queu_tx_locked() is called for IBLND_MSG_PUT_DONE it reset the tx_deadline again.

This presents an obstacle for the LNet Resiliency feature. LNet provides a timeout for the LND as described above. From LNet's perspective this deadline is for the LNet PUT message. However, if we simply use that value for the timeout_ns calculation, then in essence will will be waing for 2 * LND timeout for the completion of the LNet PUT message. This will mean less re-transmits.

Therefore, the LND, since it has knowledge of its own protocols will need to divide the timeout provided by LNet by the number of transmits it needs to do to complete the LNet level message:

LNET_MSG_GET: Requires only IBLND_MSG_GET_REQ. Use the LNet provided timeout as is.
LNET_MSG_PUT: Requires IBLND_MSG_PUT and IBLND_MSG_PUT_DONE to complete the LNET_MSG_PUT. Use LNet provided timeout / 2
LNET_MSG_GET/LNET_MSG_PUT with < 4K payload: Requires IBLND_MSG_IMMEDIATE. Use LNet provided timeout as is.

System Timeouts

There are multiple timeouts kept at different layers of the code. The LNet Resiliency will attempt to reduce the complexity and ambiguity of setting the timeouts in the system.

This will be done by using a trickle down approach as mentioned before. The top level transaction timeout will be provided to LNet for each PUT/GET send request. If one is not provided LNet will use a configurable default.

LNet will calculate the following timeouts from the transaction timeout:

Message timeout = Transaction timeout / retry count
LND timeout = Message timeout / number of LND messages used to complete an LNet PUT/GET

Caveat

One thing I'm worried about are cases where we see timeouts in tickets, example

Jira

server	HPDD Community Jira
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	8bba2dd1-4333-3006-bfcd-f35d4ebbd2ad
key	LU-10831

. The LND timeout is currently set at 50 seconds. So it appears that there exists cases where the transaction can remain around for over 50 seconds. The reason that occurs is not clear to me. Also the fact that ptlrpc is not reacting to these length delays needs investigation. Or is it reacting in parallel to LNet. Meaning LNet can be waiting for transmits to complete, even after ptlrpc has abandoned the associated RPC messages.

This could be a potential problem in the field, would we see a spike in timeout failures once this feature is used on the sites? Especially when one of the goals of this feature is to tighten the deadlines.

...

Remote triggered timeout

A PUT not receiving an ACK or a GET not receiving a REPLY is considered a remote timeout. This could happen if the peer is too busy to respond in a timely way or if the peer is going through a reboot, or other unforseen circustances.

The other case which falls into this category is LND timeouts because of missing LND acknowledgment, ex IBLND_MSG_PUT_ACK.

In these cases LNet can not issue a resend safely. The message could've already been received on the remote end and processed. In this case we must add the ability to handle duplicate requests, which is outside the scope of this project.

Therefore, handling this category of failures will be delegated to the ULP. LNet will pass up a timeout event to the ULP.

Protection

The message will continue to be protected by the LNet net CPT lock to ensure mutual access.

When the message is committed, lnet_msg_commit(), the message cpt is assigned. This cpt value is then used to protect the message in subsequent usages. Relevant to this discussion is when the message is examined in lnet_finalize() and in the monitor thread and either removed from the active queue or placed on the resend queue.

Timeout Value

This section discusses how LNet shall calculate its timeouts

There are two options:

ULP provided timeout
LNet configurable timeout

ULP Provided Timeout

The ULP, ex: ptlrpc, will set the timeout in the lnet_libmd structure. This timeout indicates how long the ptlrpc is willing to wait for an RPC response.

LNet can then set its own timeouts based on that timeout.

The draw back of this approach is that the timeouts set is specific to the ULP. For example in ptlrpc this would be the time to wait for an RPC reply.

As shown in the diagram below the timeout provided by ptlrpc would cover the RPC response sent by the peer ptlrpc. While LNet needs to ensure that a single LNET_MSG_PUT makes it to peer LNet.

Therefore, the ULP timeout is related to the LNet transaction timeout only in the sense that it must be less than ULP timeout, but a reasonable LNet transaction timeout can not be derived from the ULP timeout, other than to say it must be less.

This might be enough of a reason to pass the ULP down and have the following logic:

Code Block
transaction_timeout = min(ULP_timeout / num_timeout, global_transaction_timeout / num_retries)

This way the LNet transaction timeout can be set a lot less than the ULP timeout, considering that the ULP timeout can vary its timeout, based on an adaptive backoff algorithm.

Gliffy Diagram

name	timeout
pagePin	3

This trickle down approach has the advantage of simplifying the configuration of the LNet Resiliency feature, as well as making the timeout consistent through out the system, instead of configuring the LND timeout to be much larger than the pltrpc timeout as it is now.

The timeout parameter will be passed down to LNet by setting it in the struct lnet_libmd

Code Block

struct lnet_libmd {
	struct list_head md_list;
	struct lnet_libhandle md_lh;
	struct lnet_me *md_me;
	char *md_start;
	unsigned int md_offset;
	unsigned int md_length;
	unsigned int md_max_size;
	int md_threshold;
	int md_refcount;
	unsigned int md_options;
	unsigned int md_flags;
	unsigned int md_niov; /* # frags at end of struct */
	void *md_user_ptr;
	struct lnet_eq *md_eq;
	struct lnet_handle_md md_bulk_handle;
    /*
	 * timeout to wait for this MD completion
	 */
    ktime md_timeout;
	union { 
		struct kvec»····iov[LNET_MAX_IOV];
		lnet_kiov_t»····kiov[LNET_MAX_IOV];
	} md_iov;
};

LNet Configurable timeout

A transaction timeout value will be configured in LNet and used as described above. This approach avoids passing down the ULP timeout and relies on the sys admin to set the timeout in LNet to a reasonable value, based on the network requirements of the system.

That timeout value will still be used to calculate the LND timeout as in the above approach.

Conclusion on Approach

The second approach will be taken for the first phase, as it reduces the scope of the work and allows the project to meet the deadline for 2.12.

Selection Algorithm

The selection algorithm will be modified to take health into account and will operate according to the following logic:

Code Block

for every peer_net in peer {
	local_net = peer_net

	if peer_net is not local
		select a router
		local_net = router->net
 
	for every local_ni on local_net
		check if local_ni has best health_value
		check if local_ni is nearest MD NUMA
		check if local_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_ni = local_ni

	for every peer_ni on best_ni->net
		check if peer_ni has best health value
		check if peer_ni has the most available credits
		check if we need to use round robin selection
		If above criteria is satisfied
			best_peer_ni = peer_ni

	send(best_ni, peer_ni)
}

The above algorithm will always prefer NI's that are the most healthy. This is important because dropping even one message will likely result in client evictions. So it is important to always ensure we're using the best path possible.

LND Interface

LNet shall calculate the message timeout as follows:

message timeout = transaction timeout / retry count

The message timeout will be stored in the lnet_msg structure and passed down to the LND via lnd_send().

LND Transmits (o2iblnd specific discussion)

ULP requests from LNet to send a GET or a PUT via LNetGet() and LNetPut() APIs. LNet then calls into the LND to complete the operation. The LND can complete the LNet PUT/GET via a set of LND messages as shown in the diagrams below.

When the LND transmits the LND message it sets a tx_deadline for that particular transmit. This tx_deadline remains active until the remote has confirmed receipt of the message, if an aknwoledgment is expected or if a no acknowledgement is expected then when the tx is completed the tx_deadline is completed. Receipt of the message at the remote is when LNet is informed that a message has been received by the LND, done via lnet_parse(), then LNet calls back into the LND layer to receive the message.

By handling the tx_deadline properly we are able to account for almost all next-hop failures. LNet would've done its best to ensure that a message has arrived at the immediate next hop.

The tx_deadline is LND-specific, and derived from the timeout (or sock_timeout) module parameter of the LND.

Gliffy Diagram

name	o2iblnd TX FSM
pagePin	4

LND timeout

PUT

Gliffy Diagram

name	PUT sequence
pagePin	3

GET

Gliffy Diagram

name	GET Sequence Diagram
pagePin	2

A third type of message that the LND sends is the IBLND_MSG_IMMEDIATE. The data is embedded in the message and posted. There is no handshake in this case.

For the PUT case described in the sequence diagram, the initiator sends two messages:

IBLND_MSG_PUT_REQ
IBLND_MSG_PUT_DONE

Both of these messages are sent using the same tx structure. The tx is allocated and placed on a waiting queue. When the IBLND_MSG_PUT_ACK is received the waiting tx is looked up and used to send the IBLND_MSG_PUT_DONE.

When kiblnd_queue_tx_locked() is called for IBLND_MSG_PUT_REQ it sets the tx_deadline as follows:

Code Block
timeout_ns = kiblnd_tunables.kib_timeout NSEC_PER_SEC; tx->tx_deadline = ktime_add_ns(ktime_get(), timeout_ns);

When kiblnd_queu_tx_locked() is called for IBLND_MSG_PUT_DONE it reset the tx_deadline again.

This presents an obstacle for the LNet Resiliency feature. LNet provides a timeout for the LND as described above. From LNet's perspective this deadline is for the LNet PUT message. However, if we simply use that value for the timeout_ns calculation, then in essence will will be waing for 2 * LND timeout for the completion of the LNet PUT message. This will mean less re-transmits.

Therefore, the LND, since it has knowledge of its own protocols will need to divide the timeout provided by LNet by the number of transmits it needs to do to complete the LNet level message:

LNET_MSG_GET: Requires only IBLND_MSG_GET_REQ. Use the LNet provided timeout as is.
LNET_MSG_PUT: Requires IBLND_MSG_PUT and IBLND_MSG_PUT_DONE to complete the LNET_MSG_PUT. Use LNet provided timeout / 2
LNET_MSG_GET/LNET_MSG_PUT with < 4K payload: Requires IBLND_MSG_IMMEDIATE. Use LNet provided timeout as is.

System Timeouts

There are multiple timeouts kept at different layers of the code. The LNet Resiliency will attempt to reduce the complexity and ambiguity of setting the timeouts in the system.

This will be done by using a trickle down approach as mentioned before. The top level transaction timeout will be provided to LNet for each PUT/GET send request. If one is not provided LNet will use a configurable default.

LNet will calculate the following timeouts from the transaction timeout:

Message timeout = Transaction timeout / retry count
LND timeout = Message timeout / number of LND messages used to complete an LNet PUT/GET

Implementation Specifics

Reasons for timeout

...

Space shortcuts

Page tree

Versions Compared

Old Version 60

New Version Current

Key

Protection

API Changes

Selection Algorithm

LND Interface

LND Transmits (o2iblnd specific discussion)

LND timeout

PUT

GET

System Timeouts

Caveat

Remote triggered timeout

Protection

Timeout Value

ULP Provided Timeout

LNet Configurable timeout

Conclusion on Approach

Selection Algorithm

LND Interface

LND Transmits (o2iblnd specific discussion)

LND timeout

PUT

GET

System Timeouts

Implementation Specifics

Reasons for timeout

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 60

New Version Current

Key

Protection

API Changes

Selection Algorithm

LND Interface

LND Transmits (o2iblnd specific discussion)

LND timeout

PUT

GET

System Timeouts

Caveat

Remote triggered timeout

Protection

Timeout Value

ULP Provided Timeout

LNet Configurable timeout

Conclusion on Approach

Selection Algorithm

LND Interface

LND Transmits (o2iblnd specific discussion)

LND timeout

PUT

GET

System Timeouts

Implementation Specifics

Reasons for timeout