Purpose

This document attempts to describe the current socklnd design and some proposed improvements to clean up the design and increase performance.

Overview

LNet uses the LND via a set of APIs defined here

struct lnet_lnd {
»·······/* fields initialized by the LND */
»·······__u32»··»·······»·······lnd_type;

»·······int  (*lnd_startup)(struct lnet_ni *ni);
»·······void (*lnd_shutdown)(struct lnet_ni *ni);
»·······int  (*lnd_ctl)(struct lnet_ni *ni, unsigned int cmd, void *arg);

»·······/* In data movement APIs below, payload buffers are described as a set
»······· * of 'niov' fragments which are in pages.
»······· * The LND may NOT overwrite these fragment descriptors.
»······· * An 'offset' and may specify a byte offset within the set of
»······· * fragments to start from
»······· */

»·······/* Start sending a preformatted message.  'private' is NULL for PUT and
»······· * GET messages; otherwise this is a response to an incoming message
»······· * and 'private' is the 'private' passed to lnet_parse().  Return
»······· * non-zero for immediate failure, otherwise complete later with
»······· * lnet_finalize() */
»·······int (*lnd_send)(struct lnet_ni *ni, void *private,
»·······»·······»·······struct lnet_msg *msg);

»·······/* Start receiving 'mlen' bytes of payload data, skipping the following
»······· * 'rlen' - 'mlen' bytes. 'private' is the 'private' passed to
»······· * lnet_parse().  Return non-zero for immedaite failure, otherwise
»······· * complete later with lnet_finalize().  This also gives back a receive
»······· * credit if the LND does flow control. */
»·······int (*lnd_recv)(struct lnet_ni *ni, void *private, struct lnet_msg *msg,
»·······»·······»·······int delayed, unsigned int niov,
»·······»·······»·······struct bio_vec *kiov,
»·······»·······»·······unsigned int offset, unsigned int mlen, unsigned int rlen);

»·······/* lnet_parse() has had to delay processing of this message
»······· * (e.g. waiting for a forwarding buffer or send credits).  Give the
»······· * LND a chance to free urgently needed resources.  If called, return 0
»······· * for success and do NOT give back a receive credit; that has to wait
»······· * until lnd_recv() gets called.  On failure return < 0 and
»······· * release resources; lnd_recv() will not be called. */
»·······int (*lnd_eager_recv)(struct lnet_ni *ni, void *private,
»·······»·······»·······      struct lnet_msg *msg, void **new_privatep);

»·······/* notification of peer down */
»·······void (*lnd_notify_peer_down)(lnet_nid_t peer);

»·······/* accept a new connection */
»·······int (*lnd_accept)(struct lnet_ni *ni, struct socket *sock);
};

These APIs are called from within the context of the ptlrpc threads. In general the LND performs connections, transmits and receives from the context of a pool of threads they create. The threads which do the transmits and receives are affinitized to particular CPU Partitions. The LND requests generated by calling the LND APIs get queued and then processed by one of the LND threads.

Below is an overview of the socklnd design.

Socklnd Overview

Network Interface Management

LNet calls ksock_startup()  on every lnet_ni  created either dynamically when called via lnetctl or via module parameters on initial startup.

A ksock_net block is created and assigned to the lnet_ni.ni_data field. On all APIs call from LNet to the socklnd this field is used to pull up the ksock_net.

Of particular interest is the ksock_net.ksock_interface. This is an array of LNET_INTERFACES_NUM length. This is so because of the legacy tcp bonding feature. There could be multiple interfaces assigned to one ksock_net. However, since Multi-Rail feature manages the multiple network interfaces per network, there is no need to continue supporting tcp bonding.

Once a ksock_net block is created it's added on the global network list, ksocknal_data.ksnd_nets.

This list is traversed when adding a new network. If the interface being added is already being used by one of the configured networks, then we do not need to create a set of scheduler threads. However, if it's a new interface then we'll increase the number of scheduler threads, as long as we stay below the maximum number of configured scheduler interfaces. This is so we can aid in processing the transmits and receives on the new interface.

Scheduler threads are created per CPT and they are intended to serve transmit and receive operations.

Peer and Connection Management

Peer and Route Management

When LNet requests a message to be sent by calling ksocknal_send() ksocklnd  will create a peer if one doesn't exist. ksocklnd identifies a peer via its source and destination NIDs. This lends itself to how Multi-Rail works. At the LNet level then, messages which are going over the local Net to the same peer, can traverse multiple peer_nis  at the ksocklnd  level.

When a transmit is initially launched then a peer is created if none exist. The following steps take place:

Notes on routes and their use:

These restrictions defined by the code means there could exist only one route between one specific local interface and one peer.

With legacy TCP bonding there could exist multiple routes between each interface stored in the ksock_net.ksock_interface array and the peer_ni.

Therefore the use of ksock_route is purely to serve the legacy TCP bonding implementation, which has been superseded by the LNet Multi-Rail feature.

Connection Management

Once the routes are created TCP sockets must be created with the remote peer. This is referred to in the code as "connecting routes". The process is triggered by ksocknal_launch_all_connections_locked().

The route is placed on the ksocknal_data.ksnd_connd_routes queue. One of the connd threads then picks that up and starts the actual connection procedure by calling ksocknal_connect().

The number of sockets to create are configurable via the typed_conns module parameter. If this is set to 1 then three sockets will be created:

A connection is created per socket. This connection is added to the ksock_peer_ni list of connections.

The following TCP port range is defined for the outgoing connections (active): LNET_ACCEPTOR_MIN_RESERVED_PORT (512) to LNET_ACCEPTOR_MAX_RESERVED_PORT (1023). The requests are sent to the single predefined acceptor port on the other (passive) side. 

A hello message is sent by the active side of the connection. This hello message contains the list of IP addresses stored in the ksnd_data.ksock_interfaces to which we don't have routes yet. When the passive side receives the hello message it sends its own hello as a response. The active side will receive that hello which contains the list of the remote's peer IP addresses. It will then create additional routes to these interfaces, which we would create on demand when sending messages.


Connection creation management is unduly complex due to TCP bonding. In fact the purpose of the hello message appears to be primarily for passing around the IP addresses of the peer.

Since TCP bonding is now deprecated, this code can be removed, simplifying the over all design of the socklnd code.

Sending Messages

LNet calls ksocknal_send() to send messages. This function will trigger the following steps:

CPT Confusion?

In lnet_get_best_ni() one of the criteria we use to determine the best NI to send from is NUMA. In that case, we use the MD CPT, since we need to determine the nearest NUMA wise interface to the memory described by that MD. However, the LND scheduler which picks up the transmit and process it, could be associated with a different CPT other than the MD CPT. Since TCP is not a zero copy protocol, would that introduce a performance penalty?

Receiving Messages

When a connection is created a set of callbacks are registered with the socket in the call to ksocknal_lib_set_callback().

When there is data ready to be received ksocknal_data_ready()-> ksocknal_read_callback()

This queues the connection to receive the data from on the scheduler associated with that thread. The message header is read in first and lnet_parse() is called. lnet_parse() can call into the LND again to receive the payload data.

socklnd  Improvements

Remove TCP Bonding

Redesign Overview: re-purposing route construct

While removing multiple interface per ksock_net is straightforward, getting rid of route constructs is much more invasive to current socklnd design. 

In the current design, there are multiple route structures associated with each peer_ni. The association is not unique because routes may be shared, so separate reference counts are maintained. As mentioned above, deprecating tcp bonding means that only single route is needed per peer_ni. It is proposed to re-purpose the existing route construct as follows:

This proposal simplifies both "the route structure removal" and "adding structure to encapsulate ksock_socket_conn" tasks by combining them.


Multiple Connections Per Peer

LU-12815 indicates that creating multiple virtual interfaces under the same interface and then grouping these in one LNet in MR configuration increases performance. By creating multiple virtual interfaces the following effects take place:

It would be better to make these benefits available without adding more configuration complexity from the user side. To do that we can increase the number of connections by controlling it via the conns_per_peer module parameter.