States:
LNET_PEER_STATE_INIT
LNET_PEER_STATE_VERIFY
LNET_PEER_STATE_ACTIVE
LNET_PEER_STATE_WAIT_PING_RESPONSE
LNET_PEER_STATE_PUSH_SENT
Events
LNET_EVENT_RECV_ACK
LNET_EVENT_RECV_PING_PUSH
LNET_EVENT_RECV_PING_REPLY
LNET_EVENT_SEND
LNET_EVENT_DLC_ADD_PEER_NI
LNET_EVENT_DLC_DEL_PEER_NI
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE
Receiving
lnet_parse()
- Do error checking
- If routing then update the status of the local Network Interface
- If the message is not for the local node, then it should be routed, so perform appropriate checking
- allocate an
lnet_msg_t
and populate it with the information passed in.msg->msg_type = type; msg->msg_private = private; msg->msg_receiving = 1; msg->msg_rdma_get = rdma_req; msg->msg_len = msg->msg_wanted = payload_length; msg->msg_offset = 0; msg->msg_hdr = *hdr; /* for building message event */ msg->msg_from = from_nid; if (!for_me) { msg->msg_target.pid = dest_pid; msg->msg_target.nid = dest_nid; msg->msg_routing = 1; } else { /* convert common msg->hdr fields to host byteorder */ msg->msg_hdr.type = type; msg->msg_hdr.src_nid = src_nid; msg->msg_hdr.src_pid = le32_to_cpu(msg->msg_hdr.src_pid); msg->msg_hdr.dest_nid = dest_nid; msg->msg_hdr.dest_pid = dest_pid; msg->msg_hdr.payload_length = payload_length; }
- Now we need to call into the peer module. This should be the end of
lnet_parse()
. The following code will be moved to the peer modulelnet_net_lock(cpt); rc = lnet_nid2peer_locked(&msg->msg_rxpeer, from_nid, cpt); if (rc != 0) { lnet_net_unlock(cpt); CERROR("%s, src %s: Dropping %s " "(error %d looking up sender)\n", libcfs_nid2str(from_nid), libcfs_nid2str(src_nid), lnet_msgtyp2str(type), rc); lnet_msg_free(msg); goto drop; } if (lnet_isrouter(msg->msg_rxpeer)) { lnet_peer_set_alive(msg->msg_rxpeer); if (avoid_asym_router_failure && LNET_NIDNET(src_nid) != LNET_NIDNET(from_nid)) { /* received a remote message from router, update * remote NI status on this router. * NB: multi-hop routed message will be ignored. */ lnet_router_ni_update_locked(msg->msg_rxpeer, LNET_NIDNET(src_nid)); } } lnet_msg_commit(msg, cpt); /* message delay simulation */ if (unlikely(!list_empty(&the_lnet.ln_delay_rules) && lnet_delay_rule_match_locked(hdr, msg))) { lnet_net_unlock(cpt); return 0; } if (!for_me) { rc = lnet_parse_forward_locked(ni, msg); lnet_net_unlock(cpt); if (rc < 0) goto free_drop; if (rc == LNET_CREDIT_OK) { lnet_ni_recv(ni, msg->msg_private, msg, 0, 0, payload_length, payload_length); } return 0; } lnet_net_unlock(cpt); /* AMIR: * lnet_peer_recv_message(ni, msg) */ rc = lnet_parse_local(ni, msg); if (rc != 0) goto free_drop; return 0;
- Instead call into the peer module with:
lnet_peer_recv_data()
lnet_peer_recv_data()
lnet_net_lock()
- find a
peer_ni
- Perform the bit of logic we moved out of
lnet_parse()
- If peer_ni already exists, then check if it's detached or has a parent. If peer_ni is not already in our db then create one and by default it's detached.
- if it's detached then check what kind of message it received
- If it's a PING PUSH, then the peer is trying to tell us about the rest of his interfaces
- We can check ahead of calling
lnet_parse_put()
by looking at thehdr->msg.put.match_bits
. That should tell us what specific message was sent to us - Create a buffer to put the push data in.
- Create the peer/peer_net structures and link them
- call
lnet_parse_put
()- provide and eq callback to put the ping info in.
- Feed peer
LNET_EVENT_RECV_PING_PUSH
event
- We can check ahead of calling
- If it's any other type of message then perform similar operation to
lnet_parse_local()
- The peer is still detached at this point
int lnet_parse_local(lnet_ni_t *ni, lnet_msg_t *msg) { int rc; switch (msg->msg_type) { case LNET_MSG_ACK: rc = lnet_parse_ack(ni, msg); break; case LNET_MSG_PUT: rc = lnet_parse_put(ni, msg); break; case LNET_MSG_GET: rc = lnet_parse_get(ni, msg, msg->msg_rdma_get); break; case LNET_MSG_REPLY: rc = lnet_parse_reply(ni, msg); break; default: /* prevent an unused label if !kernel */ LASSERT(0); return -EPROTO; } LASSERT(rc == 0 || rc == ENOENT); return rc; }
- If it's a PING PUSH, then the peer is trying to tell us about the rest of his interfaces
- If one already exists and it's part of a peer
- if
msg->msg_type == LNET_MSG_PUT
then- if this is a PING_PUSH, same check as above.
- Create a buffer to put the push data in
- call
lnet_parse_put()
- Feed the peer
LNET_EVENT_RECV_PING_PUSH
event
- if this is a PING_PUSH, same check as above.
- if
msg->msg_type == LNET_MSG_REPLY
then- if this is a
PING_REPLY
- if
peer_state == LNET_PEER_STATE_WAIT_PING_REPLY
- The reason for the state check here is to avoid creating a buffer for the reply when we don't want to
- An alternative is to have the buffer created when the ping is sent, but I like this better because it's symmetrical with dealing with receiving a PUSH.
- Create a buffer to hold the reply in
- call lnet_parse_reply()
- Provide eq callback to copy ping info data in.
- Feed the peer
LNET_EVENT_RECV_PING_REPLY
event
- The reason for the state check here is to avoid creating a buffer for the reply when we don't want to
- if
- if this is a
- if msg->msg_type == LNET_MSG_ACK
- call lnet_parse_ack()
- No need for an eq callback here
- Feed the peer LNET_EVENT_RECV_ACK event
- call lnet_parse_ack()
- else
- perform similar logic as
lnet_parse_local()
- perform similar logic as
- if
- if it's detached then check what kind of message it received
Notes
For background information.
By the time we return back from lnet_parse_<>() we should have
- read the message into the MD
- called the eq_callback
- detached the MD
- freed the message
Given this order, we can do normal processing, and once we return back from this processing, we should have the information we need copied into a local buffer, which we can then pass to the Dynamic Discovery thread to process.
High-level call flow
lnet_parse_put() lnet_ptl_match_md() case LNET_MATCHMD_OK lnet_recv_put() lnet_ni_recv() lnd_recv() lnet_finalize() lnet_msg_detach_md() lnet_eq_enqueue_event() eq->eq_callback(ev);
Sending
lnet_peer_send_msg()
This API will replace lnet_send()
and will exist in the peer module.
This API is going to trigger:
- Discovery of peer if it needs to
- Determine the local/remote pathway
- send the message
The algorithm will be similar to:
- find or create
peer_ni
withmsg->msg_target.nid under lock
- if
peer_ni
is detached- create
peer_net
- add
peer_ni
topeer_net
- create
peer
add
peer_net
topeer
return peer
- create
- if
peer_ni
is part of apeer return that peer
- Feed
LNET_EVENT_SEND
to the peer
- if
DLC
lnet_peer_add_peer_ni()
api_mutex_lock()
- There is the concept of primary NID.
- Find or create the peer_ni with the primary NID
- if a peer_ni is found and it's attached then
- ensure that the peer NIDs provided are all unique to this peer.
- return peer
- if a peer_ni is detached then
- create peer_net
- add peer_ni to peer_net
- create peer
- add peer_net to peer
- return peer
- Feed peer
LNET_EVENT_DLC_ADD_PEER_NI
- There is the concept of primary NID.
api_mutex_unlock()
lnet_peer_del_per_ni()
api_mutex_lock()
- There is the concept of primary NID.
- Find the peer_ni with the primary NID
- if a peer_ni is found and it's attached then
- ensure that the peer NIDs provided are all unique to this peer.
- return peer
- if a peer_ni is detached then
- Report error.
- Feed peer
LNET_EVENT_DLC_ADD_PEER_NI
- There is the concept of primary NID.
api_mutex_unlock()
lnet_peer_local_ni_cfg_change()
api_mutex_lock()
- for each peer on the peer list
- Feed peer
LNET_EVENT_DLC_LOCAL_CFG_UPDATE
- Feed peer
- for each peer on the peer list
api_mutex_unlock()
FSM Table
Note that when a peer is being worked on it gets locked so that no other thread can change it. This prevents unexpected state transitions.
Calling the FSM Action functions
lnet_peer_lock() action_fn_locked = lookup_peer_action_function(peer, event); action_fn_locked() lnet_peer_unlock()
[Apologies for intruding here. This note outgrew the limited space available in the comment column.
The "traditional" implementation of an FSM in C is to have a function per event type, with a switch inside on the FSM state of the object, and actions again being functions (but often open-coded if simple and unique). Example:
// One for each event type int lnet_peer_event_recv_push_ack(peer, args) { recv_push_ack_prologue; switch (peer->state) { case LNET_PEER_STATE_INIT: peer_state_init_event_recv_push_ack_action(peer, args); peer->state = newstate; break; case LNET_PEER_STATE_VERIFY: peer_state_verify_event_recv_push_ack_action(peer, args); peer->state = newstate; break; ... } recv_push_ack_epilogue; return ...; } int lnet_peer_event_recv_push_mesg(peer, args) ... int lnet_peer_event_recv_ping_mesg(peer, args) ... int lnet_peer_event_recv_ping_reply(peer, args) ...
It is certainly possible to replace the switch with a lookup in an array of function pointers (one such array per event type with entries for each state). But it is not unusual that the bulk of the event handler is the common code (the prologue/epilogue code above) with only a small amount of code specific to the state. In that case going through function pointers may actually obscure code flow and logic more than it clarifies.
It is also possible to implement a generic event handler:
int lnet_peer_event(peer, event, args) { event_prologue; peer_lookup_action(peer, event)(peer, args); event_epilogue; return ...; } // calls now change like this: lnet_peer_recv_push_mesg_event(peer, args); // becomes lnet_peer_event(peer, LNET_PEER_EVENT_RECV_PUSH_MESG, args);
This I dislike for several reasons. It gives us a rather odd bounce from event-specific code through generic code to (again) event-specific code. The actual work still happens in the individual action functions, and these will typically be unique to each <state,event> pair. While this makes it clear that we're dealing with a state machine, in terms of being able to follow the logic of the code I see no gain, and actually expect a loss: following code flow through function pointers is always more difficult that tracing a simple call. Moreover, each event type carries its own unique arguments (the push message with its data versus the ping reply with its data versus the peer ni that was added or removed) and now these have to be force-fitted through a single uniform interface to fit the signature of lnet_peer_event()
and/or the common signatures of the action functions. That's effectively a typeless interface, and those also harm the ability to comprehend the code, and hamper the compiler's ability to detect some some forms of abuse, like passing the wrong type of parameters.]
LNET_PEER_STATE_INIT
Event | Action |
---|---|
LNET_EVENT_RECV_ACK | No-op. |
LNET_EVENT_RECV_PING_PUSH |
on the DD thread
|
LNET_EVENT_RECV_PING_REPLY | Impossible Event. Log error |
LNET_EVENT_SEND |
On the DD thread
|
LNET_EVENT_DLC_ADD_PEER_NI |
|
LNET_EVENT_DLC_DEL_PEER_NI | No-op. The lookup for the peer should return NULL |
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE | No-op. This peer won't be on the peer list yet. |
LNET_PEER_STATE_WAIT_PING_RESPONSE
Event | Action | Notes |
---|---|---|
LNET_EVENT_RECV_ACK | No-op | |
LNET_EVENT_RECV_PING_PUSH |
| |
LNET_EVENT_RECV_PING_REPLY |
On DD thread
| Not sure what it means to enlarge the MD data if it's not enough to receive the data. this can be done when you receive the initial message before you match the md. At this point you already know the size of the data and can set the MD size appropriately. When processing PING REPLY or PING PUSH the following scenarios are possible:
|
LNET_EVENT_SEND |
| |
LNET_EVENT_DLC_ADD_PEER_NI |
| |
LNET_EVENT_DLC_DEL_PEER_NI |
| |
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE |
|
LNET_PEER_STATE_ACTIVE
Event | Action | Notes |
---|---|---|
LNET_EVENT_RECV_ACK | No-op. Log error | |
LNET_EVENT_RECV_PING_PUSH |
on the DD thread (TODO: can be done outside the DD thread)
| |
LNET_EVENT_RECV_PING_REPLY |
| |
LNET_EVENT_SEND |
| |
LNET_EVENT_DLC_ADD_PEER_NI |
| the peer_nis are added to the peer. Previous to entering into the FSM we've already verified that none of these peers invalidates the configuration. I don't see a need to keep track whether a peer was configured via DLC or not. Another option is whenever a peer NI is added to a peer, from DLC and discovery is on then we would want to make sure that this NI is "real". So we can go to the DISCOVER state and initiate a discovery round. Although, I think is too much of a complication. |
LNET_EVENT_DLC_DEL_PEER_NI |
| |
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE |
on DD thread
|
LNET_PEER_STATE_PUSH_SENT
Event | Action |
---|---|
LNET_EVENT_RECV_ACK |
|
LNET_EVENT_RECV_PING_PUSH |
on the DD thread
|
LNET_EVENT_RECV_PING_REPLY |
|
LNET_EVENT_SEND |
|
LNET_EVENT_DLC_ADD_PEER_NI |
|
LNET_EVENT_DLC_DEL_PEER_NI |
|
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE |
on DD thread
|
LNET_PEER_STATE_DISCOVERY
Event | Action |
---|---|
LNET_EVENT_RECV_ACK |
|
LNET_EVENT_RECV_PING_PUSH |
|
LNET_EVENT_RECV_PING_REPLY |
|
LNET_EVENT_SEND |
|
LNET_EVENT_DLC_ADD_PEER_NI |
|
LNET_EVENT_DLC_DEL_PEER_NI |
|
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE |
|
LNET_PEER_STATE_VERIFY
Event | Action |
---|---|
LNET_EVENT_RECV_ACK |
|
LNET_EVENT_RECV_PING_PUSH |
|
LNET_EVENT_RECV_PING_REPLY |
|
LNET_EVENT_SEND |
On the DD thread
|
LNET_EVENT_DLC_ADD_PEER_NI |
|
LNET_EVENT_DLC_DEL_PEER_NI |
|
LNET_EVENT_DLC_LOCAL_NI_CFG_UPDATE | No-op. We will eventually either enter discovery process or not, and we will handle this later. |
Investigation Required
- If we want to force a discovery, we can create an event FORCE_DISCOVERY on a specific NID
- Find the peer to which this NID's peer_ni belongs
- Detach all peer_NIs from that peer
- Initiate discovery on each of the NID requested and transition the peer state approrpriately
2 Comments
Olaf Weber
Some sketches of what code would/could look like if the scheme I proposed is followed. Not complete, the hardest part looks to be peer merging code.
Olaf Weber
Some additional sketches: