...
When there are multiple paths available for a message, it makes sense to try to and resend it on failure. But where should the resending logic be implemented?
...
LNet shall use a trickle down approach for managing timeouts. the ULP (pltrpc or other upper layer protocol) shall provide a timeout value in the call to LNetPut() or LNetGet(). LNet shall use that as the transaction timeout value to wait for an ACK or REPLY. LNet shall further provide a configuration parameter for the number of retries. The number of retries shall allow the user to specify a maximum number of times LNet shall attempt to resend an unsuccessful message. LNet shall then calculate the message timeout by dividing the transaction timeout with the number of retries. LNet shall pass the calculated message timeout to the LND, which will use it to ensure that the LND protocol to complete completes an LNet message completes within the message timeout. If the LND is not able to complete the message within the provided timeout it will close the connection and drop all messages on that connection. It will afterword proceed to call into LNet via lnet_finailze()
to notify it of the error encountered.
...
- Local Interface failure
- Remote Interface failure
- Timeouts
- LND detected Timeout
- LNet detected Timeout
Timeouts will be provided by the ULP in the LNetPut() and LNetGet() APIs. The APIs are defined below:
LNetPut()/LNetGet() APIs
Code Block |
---|
/**
* Initiate an asynchronous GET operation.
*
* On the initiator node, an LNET_EVENT_SEND is logged when the GET request
* is sent, and an LNET_EVENT_REPLY is logged when the data returned from
* the target node in the REPLY has been written to local MD.
* LNET_EVENT_REPLY will have a timeout flag set if the REPLY has not arrived
* with in the timeout provided.
*
* On the target node, an LNET_EVENT_GET is logged when the GET request
* arrives and is accepted into a MD.
*
* \param self,target,portal,match_bits,offset See the discussion in LNetPut().
* \param mdh A handle for the MD that describes the memory into which the
* requested data will be received. The MD must be "free floating" (See LNetMDBind()).
*
* \retval 0»·· Success, and only in this case events will be generated
* and logged to EQ (if it exists) of the MD.
* \retval -EIO Simulated failure.
* \retval -ENOMEM Memory allocation failure.
* \retval -ENOENT Invalid MD object.
*/
int
LNetGet(lnet_nid_t self, struct lnet_handle_md mdh,
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout)
/**
* Initiate an asynchronous PUT operation.
*
* There are several events associated with a PUT: completion of the send on
* the initiator node (LNET_EVENT_SEND), and when the send completes
* successfully, the receipt of an acknowledgment (LNET_EVENT_ACK) indicating
* that the operation was accepted by the target. LNET_EVENT_ACK will have
* the timeout flag set if an ACK is not received within the timeout provided
*
* The event LNET_EVENT_PUT is used at the target node to indicate the
* comple tion of incoming datadelivery.
*
* The local events will be logged in the EQ associated with the MD pointed to
* by \a mdh handle. Using a MD without an associated EQ results in these
* events being discarded. In this case, the caller must have another
* mechanism (e.g., a higher level protocol) for determining when it is safe
* to modify the memory region associated with the MD.
*
* Note that LNet does not guarantee the order of LNET_EVENT_SEND and
* LNET_EVENT_ACK, though intuitively ACK should happen after SEND.
*
* \param self Indicates the NID of a local interface through which to send
* the PUT request. Use LNET_NID_ANY to let LNet choose one by itself.
* \param mdh A handle for the MD that describes the memory to be sent. The MD
* must be "free floating" (See LNetMDBind()).
* \param ack Controls whether an acknowledgment is requested.
* Acknowledgments are only sent when they are requested by the initiating
* process and the target MD enables them.
* \param target A process identifier for the target process.
* \param portal The index in the \a target's portal table.
* \param match_bits The match bits to use for MD selection at the target
* process.
* \param offset The offset into the target MD (only used when the target
* MD has the LNET_MD_MANAGE_REMOTE option set).
* \param timeout The timeout to wait for an ACK if one is expected.
* \param hdr_data 64 bits of user data that can be included in the message
* header. This data is written to an event queue entry at the target if an
* EQ is present on the matching MD.
*
* \retval 0»·· Success, and only in this case events will be generated
* and logged to EQ (if it exists).
* \retval -EIO Simulated failure.
* \retval -ENOMEM Memory allocation failure.
* \retval -ENOENT Invalid MD object.
*
* \see struct lnet_event::hdr_data and lnet_event_kind_t.
*/
int
LNetPut(lnet_nid_t self, struct lnet_handle_md mdh, enum lnet_ack_req ack,
»·······struct lnet_process_id target, unsigned int portal,
»·······__u64 match_bits, unsigned int offset, int timeout
»·······__u64 hdr_data) |
Local Interface Failure
Local interface failures will be detected in one of two ways
...