...
- Local Interface failure
- Remote Interface failure
- Timeouts
- LND detected Timeout
- LNet detected Timeout
Local Interface Failure
Local interface failures will be detected in one of two ways
- Synchronously as a return failure to the call to lnd_send()
- Asynchronously as an event that could be detected at a later point.
- These asynchronous events can be as a result of a send operations
- They can also be independent of send operations, as failures are detected with the underlying device.
Desired Behavior
When a local interface fails, the following actions should take place:
- the local interface health is updated
- Failure statistics incremented
- A resend is issued on a different local interface if there is one available.
- if no other local interface is present, or all are in failed mode, then the send fails.
Implementation Specifics
lnet_ni_send() calls into the LND via the lnd_send() callback provided. If the return code is failure lnet_finalize() is called to finalize the message.
lnet_finalize() takes the return code as an input parameter. The above behavior should be implemented in lnet_finalize() since this is the main entry into the LNet module via the LNDs as well.
lnet_finalze() detaches the MD in preparation of completing the message. Once the MD is detached it can be re-used. Therefore, if we are to re-send the message then the MD shouldn't be detached at this point.
lnet_complete_msg_locked() should be modified to manage the local interface health, and decide whether the message should be resent or not. If the message can not be resent due to no available local interfaces then the MD can be detached and the message can be freed.
Currently lnet_select_pathway() iterates through all the local interfaces on a particular peer identified by the NID to send to. In this case we would want to restrict the resend to go to the same peer_ni, but on a different local interface.
This approach lends itself to breaking out the selection of the local interface from lnet_select_pathway(), leading to the following logic:
Code Block |
---|
lnet_select_local(peer_net)
{
local_net = get_local_net(peer_net)
for each ni in local_net {
health_value = lnet_local_ni_health(ni)
/* select the best health value */
if (health_value < best_health_value)
continue
distance = get_distance(md_cpt, dev_cpt)
/* select the shortest distance to the MD */
if (distance < lnet_numa_range)
distance = lnet_numa_range
if (distance > shortest_distance)
continue
else if distance < shortest_distance
distance = shortest_distance
/* select based on the most available credits */
else if ni_credits < best_credits
continue
/* if all is equal select based on round robin */
else if ni_credits == best_credits
if best_ni->ni_seq <= ni->ni_seq
continue
}
} |
TBD
Remote Interface Failure
TBD
...