Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Purpose

This document attempts to describe the current socklnd design and some proposed improvements to clean up the design and increase performance.

Overview

Gliffy Diagram
nameLNetSystemDiagram
pageid149883499

LNet uses the LND via a set of APIs defined here

Code Block
struct lnet_lnd {
»·······/* fields initialized by the LND */
»·······__u32»··»·······»·······lnd_type;

»·······int  (*lnd_startup)(struct lnet_ni *ni);
»·······void (*lnd_shutdown)(struct lnet_ni *ni);
»·······int  (*lnd_ctl)(struct lnet_ni *ni, unsigned int cmd, void *arg);

»·······/* In data movement APIs below, payload buffers are described as a set
»······· * of 'niov' fragments which are in pages.
»······· * The LND may NOT overwrite these fragment descriptors.
»······· * An 'offset' and may specify a byte offset within the set of
»······· * fragments to start from
»······· */

»·······/* Start sending a preformatted message.  'private' is NULL for PUT and
»······· * GET messages; otherwise this is a response to an incoming message
»······· * and 'private' is the 'private' passed to lnet_parse().  Return
»······· * non-zero for immediate failure, otherwise complete later with
»······· * lnet_finalize() */
»·······int (*lnd_send)(struct lnet_ni *ni, void *private,
»·······»·······»·······struct lnet_msg *msg);

»·······/* Start receiving 'mlen' bytes of payload data, skipping the following
»······· * 'rlen' - 'mlen' bytes. 'private' is the 'private' passed to
»······· * lnet_parse().  Return non-zero for immedaite failure, otherwise
»······· * complete later with lnet_finalize().  This also gives back a receive
»······· * credit if the LND does flow control. */
»·······int (*lnd_recv)(struct lnet_ni *ni, void *private, struct lnet_msg *msg,
»·······»·······»·······int delayed, unsigned int niov,
»·······»·······»·······struct bio_vec *kiov,
»·······»·······»·······unsigned int offset, unsigned int mlen, unsigned int rlen);

»·······/* lnet_parse() has had to delay processing of this message
»······· * (e.g. waiting for a forwarding buffer or send credits).  Give the
»······· * LND a chance to free urgently needed resources.  If called, return 0
»······· * for success and do NOT give back a receive credit; that has to wait
»······· * until lnd_recv() gets called.  On failure return < 0 and
»······· * release resources; lnd_recv() will not be called. */
»·······int (*lnd_eager_recv)(struct lnet_ni *ni, void *private,
»·······»·······»·······      struct lnet_msg *msg, void **new_privatep);

»·······/* notification of peer down */
»·······void (*lnd_notify_peer_down)(lnet_nid_t peer);

»·······/* accept a new connection */
»·······int (*lnd_accept)(struct lnet_ni *ni, struct socket *sock);
};

These APIs are called from within the context of the ptlrpc threads. In general the LND performs connections, transmits and receives from the context of a pool of threads they create. The threads which do the transmits and receives are affinitized to particular CPU Partitions. The LND requests generated by calling the LND APIs get queued and then processed by one of the LND threads.

Below is an overview of the socklnd design.

Socklnd Overview

Gliffy Diagram
namesocklnd_overview
pagePin1012

Network Interface Management

LNet calls ksock_startup()  on every lnet_ni  created either dynamically when called via lnetctl or via module parameters on initial startup.

...

Scheduler threads are created per CPT and they are intended to serve transmit and receive operations.

Peer and Connection Management

Peer and Route Management

When LNet requests a message to be sent by calling ksocknal_send() ksocklnd  will create a peer if one doesn't exist. ksocklnd identifies a peer via its source and destination NIDs. This lends itself to how Multi-Rail works. At the LNet level then, messages which are going over the local Net to the same peer, can traverse multiple peer_nis  at the ksocklnd  level.

...

Therefore the use of ksock_route is purely to serve the legacy TCP bonding implementation, which has been superseded by the LNet Multi-Rail feature.

Connection Management

Once the routes are created TCP sockets must be created with the remote peer. This is referred to in the code as "connecting routes". The process is triggered by ksocknal_launch_all_connections_locked().

...

The number of sockets to create are configurable via the typed_conns module parameter. If This this is set to 1 then three sockets will be created:

...

A connection is created per socket. This connection will be is added to the ksock_peer_ni list of connections.

The following TCP port range is defined for the outgoing connections (active): LNET_ACCEPTOR_MIN_RESERVED_PORT (512) to LNET_ACCEPTOR_MAX_RESERVED_PORT (1023). The requests are sent to the single predefined acceptor port on the other (passive) side. 

A hello message is sent by the active side of the connection. This hello message contains the list of IP addresses stored in the ksnd_data.ksock_interfaces to which we don't have routes yet. When the passive side receives the hello message it sends its own hello as a response. The active side will receive that hello which contains the list of the remote's peer IP addresses. It will then create additional routes to these interfaces, which we would create on demand when sending messages are sent.

Gliffy Diagram
nameConnectionEstablishmentSeq
pagePin1


Connection creation management is unduly complex due to TCP bonding. In fact the purpose of the hello message appears to be primarily for passing around the IP addresses of the peer.

Since TCP bonding is now deprecated, this code can be removed, simplifying the over all design of the socklnd code.

Sending Messages

LNet calls ksocknal_send() to send messages. This function will trigger the following steps:

  • Connect any extra routes if they aren't connected
  • If there are no connections, then queue the connection on the peer
  • If there is a connection which exists to the peer, then queue that transmit on that connection
  • A scheduler thread on the specified CPT will then pick up the transmit and send it.
    • The CPT is identified when the connection is created by calling: lnet_cpt_of_nid() providing the peer's NID and the NI associated with the peer.

CPT Confusion?

In lnet_get_best_ni() one of the criteria we use to determine the best NI to send from is NUMA. In that case, we use the MD CPT, since we need to determine the nearest NUMA wise interface to the memory described by that MD. However, the LND scheduler which picks up the transmit and process it, could be associated with a different CPT other than the MD CPT. Since TCP is not a zero copy protocol, would that introduce a performance penalty?

Receiving Messages

When a connection is created a set of callbacks are registered with the socket in the call to ksocknal_lib_set_callback().

...

This queues the connection to receive the data from on the scheduler associated with that thread. The message header is read in first and lnet_parse() is called. lnet_parse() can call into the LND again to receive the payload data.

socklnd  Improvements

Remove TCP Bonding

  • Remove the storage of multiple IP addresses in the ksock_net 
  • Remove all associated managment code of the multiple IP addresses
  • Remove all the route constructs and the code which uses the route constructs
  • Connections should be associated directly with the peer
  • Hello message can be kept for backwards compatibility, however, they will always include only one IP address

Redesign Overview: re-purposing route construct

While removing multiple interface per ksock_net is straightforward, getting rid of route constructs is much more invasive to current socklnd design. 

In the current design, there are multiple route structures associated with each peer_ni. The association is not unique because routes may be shared, so separate reference counts are maintained. As mentioned above, deprecating tcp bonding means that only single route is needed per peer_ni. It is proposed to re-purpose the existing route construct as follows:

  • Rename the route construct to "ksock_peer_conn_cb" or similar name. 
  • Peer_ni shall reference single ksock_peer_conn_cb
  • ksock_peer_conn_cb structure shall be the same as existing route, except there shall be no need for individual refcount
  • It shall be ensured that ksock_peer_conn_cb is created and deleted at the same time as the owning peer_ni construct
  • It shall be ensured that  ksock_peer_conn_cb won't be deleted before all of the connections it owns are deleted

This proposal simplifies both "the route structure removal" and "adding structure to encapsulate ksock_socket_conn" tasks by combining them.


Gliffy Diagram
namesocklnd_rdprop01
pagePin2

Multiple Connections Per Peer

LU-12815 indicates that creating multiple virtual interfaces under the same interface and then grouping these in one LNet in MR configuration increases performance. By creating multiple virtual interfaces the following effects take place:

...