Problem Statement
RoCEv2 is RDMA over Ethernet implementation which may be used over lossy network. In order to minimize effects of frame drops caused by the network congestion RoCEv2 uses ECN mechanism.
For RoCEv2 ECN congestion control to work properly congestion marking has to be enabled on all devices all over the path. Traffic subjected for ECN marking on the network side must be properly tagged by HCA.
For this purpose DSCP field (part of TOS field) from IP packet is used to differentiate RDMA and RDMA-CNP traffic from other flows. Then ECN marking may be enabled and used only for RDMA traffic when congestion is detected.
Lustre LNET does not support setting the TOS value in ko2iblnd.
Currently - the only way to enable tos marking of RDMA traffic is to set default TOS in mlx4/5 drivers using cma_roce_tos script which is part of mOFED distribution. The script is using configfs to set desired value and must be executed before ko2iblnd module is loaded
Drawback of current way of setting ToS is that it does not allow to have different ToS values in case of having more than one o2ib nets on one HCA (in separate vlans). It is also difficult to verify if proper ToS has been properly set for ko2iblnd QPs.
Kernel API
The kernel provides the following API to use to set to the ToS field. It is defined in rdma/rdma_cm.h
:
343 /** 344 * rdma_set_service_type - Set the type of service associated with a 345 * connection identifier. 346 * @id: Communication identifier to associated with service type. 347 * @tos: Type of service. 348 * 349 * The type of service is interpreted as a differentiated service 350 * field (RFC 2474). The service type should be specified before 351 * performing route resolution, as existing communication on the 352 * connection identifier may be unaffected. The type of service 353 * requested may not be supported by the network to all destinations. 354 */ 355 void rdma_set_service_type(struct rdma_cm_id *id, int tos);
Design Overview
lnetctl
Two commands will be modified/added to the lnetctl
command utility:
lnetctl net add --net <net> --if <interface> --tos <tos>
The net add
command will be modified to take the ToS value to use for this particular network interface.
lnetctl set toss [--net <net> --if <interface>] <tos>
A new lnetctl set tos
command will be added to set or modify the tos value for a particular network interface. If no Network Interface is specified all NIs will be set to the ToS value
liblnetconfig
API
struct lnet_ioctl_config_o2iblnd_tunables
will be modified to take the ToS value
71 struct lnet_ioctl_config_o2iblnd_tunables { 72 »·······__u32 lnd_version; 73 »·······__u32 lnd_peercredits_hiw; 74 »·······__u32 lnd_map_on_demand; 75 »·······__u32 lnd_concurrent_sends; 76 »·······__u32 lnd_fmr_pool_size; 77 »·······__u32 lnd_fmr_flush_trigger; 78 »·······__u32 lnd_fmr_cache; 79 »·······__u16 lnd_conns_per_peer; 80 »·······__u16 lnd_ntx; 81 »·······__s16 lnd_tos; 82 };
The lnd_tos
field will be filled with the value passed by the user from the lnetctl
command otherwise it'll be set to -1
to indicate that field is not set. This will help us avoid modification to the net add
API:
138 /* 139 * lustre_lnet_config_ni 140 * Send down an IOCTL to configure a network interface. It implicitly 141 * creates a network if one doesn't exist.. 142 * 143 * nw_descr - network and interface descriptor 144 * global_cpts - globally defined CPTs 145 * ip2net - this parameter allows configuring multiple networks. 146 *»·····it takes precedence over the net and intf parameters 147 * tunables - LND tunables 148 * seq_no - sequence number of the request 149 * lnd_tunables - lnet specific tunable parameters 150 * err_rc - [OUT] struct cYAML tree describing the error. Freed by caller 151 */ 152 int lustre_lnet_config_ni(struct lnet_dlc_network_descr *nw_descr, 153 »·······»·······»······· struct cfs_expr_list *global_cpts, 154 »·······»·······»······· char *ip2net, 155 »·······»·······»······· struct lnet_ioctl_config_lnd_tunables *tunables, 156 »·······»·······»······· int seq_no, struct cYAML **err_rc);
struct lnet_ioctl_config_ni
will also be used for a new API to explicitly set the tos for a network interface
/* * lustre_lnet_config_tos_ni * Set the TOS for the specified NI * * nw_descr - network and interface descriptor * lnd_tunables - lnet specific tunable parameters * seq_no - sequence number of the request * err_rc - [OUT] struct cYAML tree describing the error. Freed by caller */ int lustre_lnet_config_tos_ni(struct lnet_dlc_network_descr *nw_descr, struct lnet_ioctl_config_lnd_tunables *tunables, int seq_no, struct cYAML **err_rc);
In this manner the liblnetconfig
API remains backwards compatible.
Module parameters
Setting the ToS will not be done from the module parameters as it will require modification to the net configuration string. The decision has been made a while back to move away from string parsing in the kernel, and add new Network Interface changes to the lnetctl
utility and liblnetconfig
API.
ioctl
Interface
Two new ioctls
will be added to get and set the ToS value for the NI
IOC_LIBCFS_ADD_TOS_NI IOC_LIBCFS_SET_TOS_NI
o2iblnd
modifications
The o2iblnd module will handle the two new IOCTLs in its kiblnd_ctl()
function.
The set operation will lookup the NI and if that fails then it returns an error code, otherwise it'll call rdma_set_service_typ()
API.
There doesn't appear to be a cma
API provided to get the ToS value set on a particulr cmid
; therefore, we will need to store the ToS value in the device structure: struct kib_hca_dev
. The get operation will return this value.