Purpose
This document attempts to describe the current socklnd design and some proposed improvements to clean up the design and increase performance.
Overview
| Gliffy Diagram | ||||||
|---|---|---|---|---|---|---|
|
LNet uses the LND via a set of APIs defined here
| Code Block |
|---|
struct lnet_lnd {
»·······/* fields initialized by the LND */
»·······__u32»··»·······»·······lnd_type;
»·······int (*lnd_startup)(struct lnet_ni *ni);
»·······void (*lnd_shutdown)(struct lnet_ni *ni);
»·······int (*lnd_ctl)(struct lnet_ni *ni, unsigned int cmd, void *arg);
»·······/* In data movement APIs below, payload buffers are described as a set
»······· * of 'niov' fragments which are in pages.
»······· * The LND may NOT overwrite these fragment descriptors.
»······· * An 'offset' and may specify a byte offset within the set of
»······· * fragments to start from
»······· */
»·······/* Start sending a preformatted message. 'private' is NULL for PUT and
»······· * GET messages; otherwise this is a response to an incoming message
»······· * and 'private' is the 'private' passed to lnet_parse(). Return
»······· * non-zero for immediate failure, otherwise complete later with
»······· * lnet_finalize() */
»·······int (*lnd_send)(struct lnet_ni *ni, void *private,
»·······»·······»·······struct lnet_msg *msg);
»·······/* Start receiving 'mlen' bytes of payload data, skipping the following
»······· * 'rlen' - 'mlen' bytes. 'private' is the 'private' passed to
»······· * lnet_parse(). Return non-zero for immedaite failure, otherwise
»······· * complete later with lnet_finalize(). This also gives back a receive
»······· * credit if the LND does flow control. */
»·······int (*lnd_recv)(struct lnet_ni *ni, void *private, struct lnet_msg *msg,
»·······»·······»·······int delayed, unsigned int niov,
»·······»·······»·······struct bio_vec *kiov,
»·······»·······»·······unsigned int offset, unsigned int mlen, unsigned int rlen);
»·······/* lnet_parse() has had to delay processing of this message
»······· * (e.g. waiting for a forwarding buffer or send credits). Give the
»······· * LND a chance to free urgently needed resources. If called, return 0
»······· * for success and do NOT give back a receive credit; that has to wait
»······· * until lnd_recv() gets called. On failure return < 0 and
»······· * release resources; lnd_recv() will not be called. */
»·······int (*lnd_eager_recv)(struct lnet_ni *ni, void *private,
»·······»·······»······· struct lnet_msg *msg, void **new_privatep);
»·······/* notification of peer down */
»·······void (*lnd_notify_peer_down)(lnet_nid_t peer);
»·······/* accept a new connection */
»·······int (*lnd_accept)(struct lnet_ni *ni, struct socket *sock);
}; |
These APIs are called from within the context of the ptlrpc threads. In general the LND performs connections, transmits and receives from the context of a pool of threads they create. The threads which do the transmits and receives are affinitized to particular CPU Partitions. The LND requests generated by calling the LND APIs get queued and then processed by one of the LND threads.
Below is an overview of the socklnd design.
Socklnd Overview
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
Network Interface Management
LNet calls ksock_startup() on every lnet_ni created either dynamically when called via lnetctl or via module parameters on initial startup.
A ksock_net block is created and assigned to the lnet_ni.ni_data field. On all APIs call from LNet to the socklnd this field is used to pull up the ksock_net.
Of particular interest is the ksock_net.ksock_interface. This is an array of LNET_INTERFACES_NUM length. This is so because of the legacy tcp bonding feature. There could be multiple interfaces assigned to one ksock_net. However, since Multi-Rail feature manages the multiple network interfaces per network, there is no need to continue supporting tcp bonding.
Once a ksock_net block is created it's added on the global network list, ksocknal_data.ksnd_nets.
This list is traversed when adding a new network. If the interface being added is already being used by one of the configured networks, then we do not need to create a set of scheduler threads. However, if it's a new interface then we'll increase the number of scheduler threads, as long as we stay below the maximum number of configured scheduler interfaces. This is so we can aid in processing the transmits and receives on the new interface.
Scheduler threads are created per CPT and they are intended to serve transmit and receive operations.
Peer and Connection Management
Peer and Route Management
When LNet requests a message to be sent by calling ksocknal_send() ksocklnd will create a peer if one doesn't exist. ksocklnd identifies a peer via its source and destination NIDs. This lends itself to how Multi-Rail works. At the LNet level then, messages which are going over the local Net to the same peer, can traverse multiple peer_nis at the ksocklnd level.
When a transmit is initially launched then a peer is created if none exist. The following steps take place:
- A
ksock_peer_niblock is created and initialize - A
ksock_routeblock is created and initialized with the IP address and port- The IP address of the peer is derived from the NID of the peer as provided by LNet.
- The route is associated with the peer
Notes on routes and their use:
- One route block per peer IP address can be created.
- The IP address used to create the route is derived from the peer's NID as provided by LNet.
- Since the IP address is used to create the route and there could only be one route to a specific IP address and the IP address is derived from the peer.
- The peer is uniquely identified by the local NI (bound to only one interface) and destination IP address defined the peer's NID.
These restrictions defined by the code means there could exist only one route between one specific local interface and one peer.
With legacy TCP bonding there could exist multiple routes between each interface stored in the ksock_net.ksock_interface array and the peer_ni.
Therefore the use of ksock_route is purely to serve the legacy TCP bonding implementation, which has been superseded by the LNet Multi-Rail feature.
Connection Management
Once the routes are created TCP sockets must be created with the remote peer. This is referred to in the code as "connecting routes". The process is triggered by ksocknal_launch_all_connections_locked().
The route is placed on the ksocknal_data.ksnd_connd_routes queue. One of the connd threads then picks that up and starts the actual connection procedure by calling ksocknal_connect().
The number of sockets to create are configurable via the typed_conns module parameter. If this is set to 1 then three sockets will be created:
- Control socket
- Bulk in socket
- Bulk out socket
A connection is created per socket. This connection is added to the ksock_peer_ni list of connections.
A hello message is sent by the active side of the connection. This hello message contains the list of IP addresses stored in the ksnd_data.ksock_interfaces to which we don't have routes yet. When the passive side receives the hello message it sends its own hello as a response. The active side will receive that hello which contains the list of the remote's peer IP addresses. It will then create additional routes to these interfaces, which we would create on demand when sending messages.
| Gliffy Diagram | ||||||
|---|---|---|---|---|---|---|
|
Connection creation management is unduly complex due to TCP bonding. In fact the purpose of the hello message appears to be primarily for passing around the IP addresses of the peer.
Since TCP bonding is now deprecated, this code can be removed, simplifying the over all design of the socklnd code.
Sending Messages
LNet calls ksocknal_send() to send messages. This function will trigger the following steps:
- Connect any extra routes if they aren't connected
- If there are no connections, then queue the connection on the peer
- If there is a connection which exists to the peer, then queue that transmit on that connection
- A scheduler thread on the specified CPT will then pick up the transmit and send it.
- The CPT is identified when the connection is created by calling:
lnet_cpt_of_nid()providing the peer's NID and the NI associated with the peer.
- The CPT is identified when the connection is created by calling:
CPT Confusion?
In lnet_get_best_ni() one of the criteria we use to determine the best NI to send from is NUMA. In that case, we use the MD CPT, since we need to determine the nearest NUMA wise interface to the memory described by that MD. However, the LND scheduler which picks up the transmit and process it, could be associated with a different CPT other than the MD CPT. Since TCP is not a zero copy protocol, would that introduce a performance penalty?
Receiving Messages
When a connection is created a set of callbacks are registered with the socket in the call to ksocknal_lib_set_callback().
When there is data ready to be received ksocknal_data_ready()-> ksocknal_read_callback()
This queues the connection to receive the data from on the scheduler associated with that thread. The message header is read in first and lnet_parse() is called. lnet_parse() can call into the LND again to receive the payload data.
socklnd Improvements
Remove TCP Bonding
- Remove the storage of multiple IP addresses in the
ksock_net - Remove all associated managment code of the multiple IP addresses
- Remove all the route constructs and the code which uses the route constructs
- Connections should be associated directly with the peer
- Hello message can be kept for backwards compatibility, however, they will always include only one IP address
Multiple Connections Per Peer
LU-12815 indicates that creating multiple virtual interfaces under the same interface and then grouping these in one LNet in MR configuration increases performance. By creating multiple virtual interfaces the following effects take place:
- Increase the number of bulk IN/OUT sockets. You'll have 3 per connection created over the virtual interface to the same peer.
- Increase the number of scheduler threads
- Creates more concurrency in sending/receiving data.
It would be better to make these benefits available without adding more configuration complexity from the user side. To do that we can increase the number of connections by controlling it via the conns_per_peer module parameter.
- Rename
ksock_conntoksock_socket_conn - Reuse
ksock_connto encapsulate up to 3ksock_socket_conndata structures. Eachksock_socket_connwould be the same as what is current dayksock_conn; they describe a socket connection to the peer. ksock_peershould include a linked list ofksock_conn. The number ofksock_conncreated will be controled byconns_per_peermodule parameter- Make
conns_per_peerdynamically configurable