Problem Statement

RoCEv2 is RDMA over Ethernet implementation which may be used over lossy network. In order to minimize effects of frame drops caused by the network congestion RoCEv2 uses ECN mechanism.

For  RoCEv2 ECN congestion control to work properly congestion marking has to be enabled on all devices all over the path. Traffic subjected for ECN marking on the network side must be properly tagged by HCA.
For this purpose DSCP field (part of TOS field) from IP packet is used to differentiate RDMA and RDMA-CNP traffic from other flows. Then ECN marking may be enabled and used only for RDMA traffic when congestion is detected.

Lustre LNET does not support setting the TOS value in ko2iblnd.
Currently - the only way to enable tos marking of RDMA traffic is to set default TOS in mlx4/5 drivers using cma_roce_tos script which is part of mOFED distribution. The script is using configfs to set desired value and must be executed before ko2iblnd module is loaded

Drawback of current way of setting ToS is that it does not allow to have different ToS values in case of having more than one o2ib nets on one HCA (in separate vlans). It is also difficult to verify if proper ToS has been properly set for ko2iblnd QPs.

Kernel API

The kernel provides the following API to use to set to the ToS field. It is defined in rdma/rdma_cm.h :

343 /**
344  * rdma_set_service_type - Set the type of service associated with a
345  *   connection identifier.
346  * @id: Communication identifier to associated with service type.
347  * @tos: Type of service.
348  *
349  * The type of service is interpreted as a differentiated service
350  * field (RFC 2474).  The service type should be specified before
351  * performing route resolution, as existing communication on the
352  * connection identifier may be unaffected.  The type of service
353  * requested may not be supported by the network to all destinations.
354  */
355 void rdma_set_service_type(struct rdma_cm_id *id, int tos); 

Design Overview

tos_flow

lnetctl

Two commands will be modified/added to the lnetctl command utility:

lnetctl net add --net <net> --if <interface> --tos <tos>

The net add command will be modified to take the ToS value to use for this particular network interface.

lnetctl set toss [--net <net> --if <interface>] <tos>

A new lnetctl set tos command will be added to set or modify the tos value for a particular network interface. If no Network Interface is specified all NIs will be set to the ToS value

liblnetconfig API

struct lnet_ioctl_config_o2iblnd_tunables  will be modified to take the ToS value

 71 struct lnet_ioctl_config_o2iblnd_tunables {
 72 »·······__u32 lnd_version;
 73 »·······__u32 lnd_peercredits_hiw;
 74 »·······__u32 lnd_map_on_demand;
 75 »·······__u32 lnd_concurrent_sends;
 76 »·······__u32 lnd_fmr_pool_size;
 77 »·······__u32 lnd_fmr_flush_trigger;
 78 »·······__u32 lnd_fmr_cache;
 79 »·······__u16 lnd_conns_per_peer;
 80 »·······__u16 lnd_ntx;
 81 »·······__s16 lnd_tos;
 82 };

The lnd_tos field will be filled with the value passed by the user from the lnetctl  command otherwise it'll be set to -1 to indicate that field is not set. This will help us avoid modification to the net add API:

138 /*
139  * lustre_lnet_config_ni
140  *   Send down an IOCTL to configure a network interface. It implicitly
141  *   creates a network if one doesn't exist..
142  *
143  *   nw_descr - network and interface descriptor
144  *   global_cpts - globally defined CPTs
145  *   ip2net - this parameter allows configuring multiple networks.
146  *»·····it takes precedence over the net and intf parameters
147  *   tunables - LND tunables
148  *   seq_no - sequence number of the request
149  *   lnd_tunables - lnet specific tunable parameters
150  *   err_rc - [OUT] struct cYAML tree describing the error. Freed by caller
151  */
152 int lustre_lnet_config_ni(struct lnet_dlc_network_descr *nw_descr,
153 »·······»·······»·······  struct cfs_expr_list *global_cpts,
154 »·······»·······»·······  char *ip2net,
155 »·······»·······»·······  struct lnet_ioctl_config_lnd_tunables *tunables,
156 »·······»·······»·······  int seq_no, struct cYAML **err_rc);

struct lnet_ioctl_config_ni  will also be used for a new API to explicitly set the tos for a network interface

/*
 * lustre_lnet_config_tos_ni
 *   Set the TOS for the specified NI
 *
 * nw_descr - network and interface descriptor
 * lnd_tunables - lnet specific tunable parameters
 * seq_no - sequence number of the request
 * err_rc - [OUT] struct cYAML tree describing the error. Freed by caller
 */
int lustre_lnet_config_tos_ni(struct lnet_dlc_network_descr *nw_descr,
							  struct lnet_ioctl_config_lnd_tunables *tunables,
							  int seq_no, struct cYAML **err_rc);

In this manner the liblnetconfig API remains backwards compatible.

Module parameters

Setting the ToS will not be done from the module parameters as it will require modification to the net configuration string. The decision has been made a while back to move away from string parsing in the kernel, and add new Network Interface changes to the lnetctl  utility and liblnetconfig  API.

ioctl  Interface

Two new ioctls  will be added to get and set the ToS value for the NI

IOC_LIBCFS_ADD_TOS_NI
IOC_LIBCFS_SET_TOS_NI

o2iblnd modifications

The o2iblnd module will handle the two new IOCTLs in its kiblnd_ctl() function.

The set operation will lookup the NI and if that fails then it returns an error code, otherwise it'll call rdma_set_service_typ() API.

There doesn't appear to be a cma API provided to get the ToS value set on a particulr cmid ; therefore, we will need to store the ToS value in the device structure: struct kib_hca_dev. The get operation will return this value.