Introduction

It is sometimes desirable to fine tune the selection of local/remote NIs used for communication. For example currently if there are two networks an OPA and a MLX network, both will be used. Especially if the traffic volume is low the credits criteria will be equivalent between the nodes, and both networks will be used in round robin. However, the user might want to use one network for all traffic and keep the other network free unless the other network goes down.

User Defined Selection Policies (UDSP) will allow this type of control.

UDSPs are configured from lnetctl via either command line or YAML config files and then passed to the kernel. Policies are applied to all local networks and remote peers then stored in the kernel. During the selection process the policies are examined as part of the selection algorithm.

Conceptual Overview

UDSPs are used to finely control traffic. In order to achieve this in the most optimal way possible, the policies can not be examined on the fast path, with every message being sent. The policies shall be instantiated on LNet constructs. LNet constructs are: Local Nets, Local NIs, Peer Nets and Peer NIs. Once a policy is instantiated on an LNet construct, meaning specific fields in the LNet construct structure are filled, then these fields are examined in the selection algorithm.

Currently, the policies control the preference of some constructs over others during the selection algorithm.

UDSP Structure

A UDSP consists of two parts:

The matching criteria
1. The matching criteria is used to match an LNet construct against the policy
Policy action
1. The policy action is the action taken on the LNet construct when the policy is matched.

UDSP Rule Types

Network Rules

These rules define the relative priority of the networks against each other. 0 is the highest priority. Networks with higher priorities will be selected during the selection algorithm, unless the network has no healthy interfaces. If there exists an interface on another network which can be used and is healthier than any which are available on the current network, then that one will be used. Health will always trump all other criteria.

Matching Criteria

In order to match a network rule the network type and and expression representing the network number must be provided. Example:

tcp1 # match tcp1 exactly
tcp[1-3] # match tcp1, tcp2 and tcp3
tcp* # match any tcp network

The policy can apply to local or remote network, depending on the specification in the command.

Matching Action

When a network matches the policy the action is applied on the network LNet construct. The only action available is setting the selection priority of the network. When the selection algorithm is iterating through available networks, the one with the highest selection priority is selected.

NID Rules

These rules define the relative priority of individual NIDs. 0 is the highest priority. Once a network is selected the NID with the highest priority is preferred. Note that NID priority is prioritized below health. For example, if there are two NIDs, NID-A and NID-B. NID-A has higher priority but lower health value, NID-B will still be selected. In that sense the policies are there as a hint to guide the selection algorithm.

Matching Criteria

A NID expression is used to match the policy against local or remote NIDs. Example

10.10.10.2@tcp1 # match the exact nid
10.10.10.[2-3]@tcp1 # match 10.10.10.2@tcp1 and 10.10.10.3@tcp1
10.10.10.[2-3]@tcp* # match 10.10.10.2 and 10.10.10.3 on any tcp network

The policy can apply to local or remote NIDs, depending on the specification in the command.

Matching Action

When a NID matches the policy the action specified in the rule is applied to the NI LNet construct. The only action available is setting the selection priority of the NID. When the selection algorithm is iterating through available NIs, the one with the highest selection priority is selected.

NID Pair Rules

These rules define prefered paths. Once a local NI is selected, as this is the first step in the selection algorithm, the peer NI which has the local NI on its preferred list is selected..The end result of this strategy is an association between a local NI and a peer NI (or a group of them)

Matching Criteria

A NID pair rule takes two expressions describing the src and destination NIDs which should be preferred. The matching rules for each of the supplied NID expressions is the same as the NID Rules above.

Matching Action

The remote NIDs available in the system are examined. Each remote NID which matches the destination NID expression in the policy, will have a set of local NIDs, which match the source NID expression in the policy, added to its preferred local NID list. If no local NID matches the source NID expression in the policy, then the action is a no-op.

Router Rules

Router Rules define which set of routers to use when sending messages to a destination NID(s). When defining a network there could be paths which are more optimal than others. To have more control over the path traffic takes, admins configure interfaces on different networks, and split up the router pools among the networks. However, this results in complex configuration, which is hard to maintain and error prone. It is much more desirable to configure all interfaces on the same network, and then define which routers to use when sending to a remote peer. Router Rules allow this functionality

Matching Criteria

A Router rule takes two expressions describing the destination NIDs and the router NIDs which should be preferred when sending to the matching destination NIDs. The matching rules for each of the supplied NID expressions is the same as the NID Rules above.

Matching Action

The remote NIDs available in the system are examined. Each remote NID which matches the destination NID expression in the policy, will have a set of router NIDs, which match the router NID expression in the policy, added to its preferred router NID list. If no router NID matches the router NID expression in the policy, then the action is a no-op.

UDSP application and Selection Priority

UDSP Rule Interactions

UDSP Rules Types

Outlined below are the UDSP rule types

Network rules
NID rules
NID Pair rules
Router rules

User Interface

Command Line Syntax

Below is the command like syntax for managing UDSPs

# Adding a local network udsp
# if multiple local networks are available, each one can have a priority. 
# The one with the highest priority is preferred
lnetctl policy add --src *@<net type> --<action type> <action context sensitive value> --idx <value>
	--src: is defined in ip2nets syntax. '*@<net type>' syntax indicates the network.
		   This is not to be confused with '*.*.*.*'@<net type>' which indicates all
		   NIDs in this network.
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255]
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behavior is to append
		   to the end of the list

# Adding a local NID udsp
# After a local network is chosen. If there are multiple NIs in the network the
# one with highest priority is preferred.
lnetct policy add --src <Address descriptor>@<net type> --<action type> <action context sensitive value>
				  --idx <value>
	--src: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255]
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behavior is to append
		   to the end of the list

# Adding a remote NID udsp
# When selecting a peer NID select the one with the highest priority.
lnetct policy add --dst <Address descriptor>@<net type> --<action type> <action context sensitive value>
				  --idx <value>
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255]
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behavior is to append
		   to the end of the list

# Adding a NID pair udsp
# When this rule is flattented the local NIDs which match the rule are added on a list
# on the peer NIs matching the rule. When selecting the peer NI, the one with the 
# local NID being used on its list is preferred.
lnetct policy add --src <Address descriptor>@<net type> --dst <Address descriptor>@<net type>
				  --idx <value>
	--src: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2. Destination NIDs can be local or
		   remote.
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behavior is to append
		   to the end of the list

# Adding a Router udsp
# similar to the NID pair udsp. The router NIDs matching the rule are added on a list
# on the peer NIs matching the rule. When sending to a remote peer, the router which
# has its nid on the peer NI list is preferred.
lnetct policy add --dst <Address descriptor>@<net type> --rte <Address descriptor>@<net type>
				  --idx <value>
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2
	--rte: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type> is something like: tcp1, o2ib2.
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behavior is to append
		   to the end of the list


# show all policies in the system.
# the policies are dumped in YAML form.
# Each policy is assigned an index.
# The index is part of the policy YAML block
lnetctl policy show

# to delete a policy the index must be specified.
# The normal behavior then is to firsh show the list of policies
# grab the index and use it in the delete command.
lnetctl policy del --idx <value>

# generally, the syntax is as follows
 lnetctl policy <add | del | show>
  --src: ip2nets syntax specifying the local NID to match
  --dst: ip2nets syntax specifying the remote NID to match
  --rte: ip2nets syntax specifying the router NID to match
  --priority: Priority to apply to rule matches
  --idx: Index of where to insert the rule. By default it appends to
		 the end of the rule list

As of the time of this writing only "priority" action shall be implemented. However, it is feasible in the future to implement different actions to be taken when a rule matches. For example, we can implement a "redirect" action, which redirects traffic to another destination. Yet another example is "lawful intercept" or "mirror" action, which mirrors messages to a different destination. This might be useful for keeping a standby server updated with all information going to the primary server. A lawful intercept action allows personnel authorized by a Law Enforcement Agency (LEA) to intercept file operations from targeted clients and send the file operations to an LI Mediation Device.

YAML Syntax

udsp:
    - idx: <unsigned int>
	  src: <ip>@<net type>
	  dst: <ip>@<net type>
      rte: <ip>@<net type>
	  action:
		- priority: <unsigned int>

Overview of Operations

There are three main operations which can be carried out on UDSPs either from the command line or YAML configuration: add, delete, show.

Add

The UI allows adding a new rule. With the use of the idx optional parameter, the admin can specifiy where in the rule chain the new rule should be added. By default the rule is appended to the list. Any other value will result in inserting the rule in that position.

When a new UDSP is added the entire UDSP set is re-evaluated. This means all Nets, NIs and peer NIs in the systems are traversed and the rules re-applied. This is an expensive operation, but given that UDSP management should be a rare operation, it shouldn't be a problem.

Delete

The UI allows deleting an existing UDSP using its index. The index can be shown using the show command. When a UDSP is deleted the entire UDSP set are re-evaluated. The Nets, NIs and peer NIs are traversed and the rules re-applied..

Show

The UI allows showing existng UDSPs. The format of the YAML output is as follows:

udsp:
    - idx: <unsigned int>
	  src: <ip>@<net type>
	  dst: <ip>@<net type>
      rte: <ip>@<net type>
	  action:
		- priority: <unsigned int>

Design

All policies are stored in kernel space. All logic to add, delete and match policies will be implemented in kernel space. This complicates the kernel space processing. Arguably, policy maintenance logic is not core to LNet functionality. What is core is the ability to select source and destination networks and NIDs in accordance with user definitions. However, the kernel is able to manage policies much easier and with less potential race conditions than user space.

Design Principles

UDSPs are comprised of two parts:

The matching rule
The rule action

The matching rule is what's used to match a NID or a network. The action is what's applied when the rule is matched.

A rule can be uniquely identified using an internal ID which is assigned by the LNet module when a rule is added and returned to the user space when the UDSPs are shown.

UDSP Storage

UDSPs shall be defined by administrators either via LNet command line utility, lnetctl, or via YAML configuration file. lnetctl parses the UDSP and stores it in an intermediary format, which will be flattened and passed down to the kernel LNet module. LNet shall store these UDSPs on a policy list. Once policies are added to LNet they will be applied on existing networks, NIDs and routers. The advantage of this approach is that UDSPs are not strictly tied to the internal constructs, IE networks, NIDs or routers, but can be applied whenever the internal constructs are created and if the internal constructs are deleted then they remain and can be automatically applied at a future time.

This makes configuration easy since a set of UDSPs can be defined, like "all IB networks priority 1", "all Gemini networks priority 2", etc, and when a network is added, it automatically inherits these rules.

Peers are normally not created explicitly by the administrators. The ULP requests to send a message to a peer or the node receives an unsolicited message from a peer which results in creating a peer construct in LNet. It is feasible, especially for router policies, to have a UDSP which associates a set of clients with in a specific range with a set of optimal routers. Having the policies stored and matched in kernel aids in fulfilling this requirement.

UDSP Application

Performance needs to be taken into account with this feature. It is not feasible to traverse the policy lists on every send operation. This will add unnecessary overhead. When rules are applied they have to be "flattened" to the constructs they impact. For example, a Network Rule is added as follows: o2ib priority 0. This rule gives priority for using o2ib network for sending. A priority field in the network will be added. This will be set to 0 for the o2ib network. As we traverse the networks in the selection algorithm, which is part of the current code, the priority field will be compared. This is a more optimal approach than examining the policies on every send to see if it we get any matches.

Order of Operations

It is important to define the order of rule operations, when there are multiple rules that apply to the same construct.

The order is defined by the selection algorithm logical flow:

iterate over all the networks that a peer can be reached on and select the best local network
1. The remote network with the highest priority is examined
  1. Network Rule
2. The local network with the highest priority is selected
  1. Netword Rule
3. The local NI with the highest priority is selected
  1. NID Rule
If the peer is a remote peer and has no local networks,
1. then select the remote peer network with the highest priority
  1. Network Rule
2. Select the highest priority remote peer_ni on the network selected
  1. NID Rule
3. Now that the peer's network and NI are decided select the router in round robin from the peer NI's preferred router list
  1. Router Rule
Otherwise for local peers, select the peer_ni from the peer.
1. highest priority peer NI is selected
  1. NID Rule
2. Select the peer NI which has the local NI selected on its preferred list.
  1. NID Pair Rule

Kernel Design

In Kernel Structures

/* lnet structure which will keep a list of UDSPs */
struct lnet {
	...
	list_head ln_udsp_list;
	...
}

/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
	__u32 ud_net_id;
	list_head ud_ip_range;
}

/* UDSP action types */
enum lnet_udsp_action_type {
	EN_LNET_UDSP_ACTION_PRIORITY = 0,
	EN_LNET_UDSP_ACTION_NONE = 1,
}

 /*
 * a UDSP rule can have up to three user defined NID descriptors
 * 		- src: defines the local NID range for the rule
 * 		- dst: defines the peer NID range for the rule
 * 		- rte: defines the router NID range for the rule
 *
 * An action union defines the action to take when the rule
 * is matched
 */ 
struct lnet_udsp {
	list_head udsp_on_list;
	__u32 idx;
	lnet_ud_nid_descr *udsp_src;
	lnet_ud_nid_describe *udsp_dst;
	lnet_ud_nid_descr *udsp_rte;
	enum lnet_udsp_action_type udsp_action_type;
	union udsp_action {
		__u32 udsp_priority;
	};
}

/* The rules are flattened in the LNet structures as shown below */
struct lnet_net {
...
	/* defines the relative priority of this net compared to others in the system */
	__u32 net_priority;
...
}


struct lnet_remotenet {
...
	/* defines the relative priority of the remote net compared to other remote nets */
	__u32 lrn_priority;
...
}

struct lnet_ni {
...
	/* defines the relative priority of this NI compared to other NIs in the net */
	__u32 ni_priority;
...
}

struct lnet_peer_ni {
...
	/* defines the relative peer_ni priority compared to other peer_nis in the peer */
	__u32 lpni_priority;

	/* defines the list of local NID(s) (>=1) which should be used as the source */
	union lpni_pref {
		lnet_nid_t nid;
		lnet_nid_t *nids;
	}

	/*
	 *	defines the list of router NID(s) to be used when sending to this peer NI
	 *	if the peer NI is remote
     */
	lnet_nid_t *lpni_rte_nids;
...
}

/* UDSPs will be passed to the kernel via IOCTL */
#define IOC_LIBCFS_ADD_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_DEL_UDSP _IOWR(IOC_LIBCFS_TYPE, 107, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_GET_UDSP _IOWR(IOC_LIBCFS_TYPE, 108, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_GET_UDSP_SIZE _IOWR(IOC_LIBCFS_TYPE, 108, IOCTL_CONFIG_SIZE)

Kernel IOCTL Handling

/*
 * api-ni.c will be modified to handle adding a UDSP 
 * All UDSP operations are done under mutex and exclusive spin
 * lock to avoid constructs changing during application of the
 * policies.
 */
int
LNetCtl(unsigned int cmd, void *arg)
{
...
	case IOC_LIBCFS_ADD_UDSP: {
		struct lnet_ioctl_config_udsp *config_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * add and do initial flattening of the UDSP into
		 * internal structures.
		 */
		rc = lnet_add_and_flatten_udsp(config_udsp);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc;
	}

	case IOC_LIBCFS_DEL_UDSP: {
		struct lnet_ioctl_config_udsp *del_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * delete the rule identified by index
		 */
		rc = lnet_del_udsp(del_udsp->udsp_idx);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc;
	}

	case IOC_LIBCFS_GET_UDSP_SIZE: {
		struct lnet_ioctl_config_udsp *get_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * get the UDSP size specified by idx
		 */
		rc = lnet_get_udsp_num(get_udsp);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc
	}

	case IOC_LIBCFS_GET_UDSP: {
		struct lnet_ioctl_config_udsp *get_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * get the udsp at index provided. Return -ENOENT if
		 * no more UDSPs to get
		 */
		rc = lnet_add_udsp(get_udsp, get_udsp->udsp_idx);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc
	}
...
}

IOC_LIBCFS_ADD_UDSP

The handler for the IOC_LIBCFS_ADD_RULES will perform the following operations:

De-serialize the rules passed in from user space
Make sure the rule is unique. If there exists another copy, the add is a NO-OP.
If a rule exists which has the same matching criteria but different action, then update the rule
Insert the rule and assign it an index.
Iterate through all LNet constructs and apply the rules

Application of the rules will be done under api_mutex_lock and the exclusive lnet_net_lock to avoid having the peer or local net lists changed while the rules are being applied.

The rules are iterated and applied whenever:

A local network interface is added.
A remote peer/peer_net/peer_ni is added

`IOC_LIBCFS_DEL_UDSP`

The handler for IOC_LIBCFS_DEL_RULES will

delete the rule with the specified index if it exists.
Iterate through all LNet constructs and apply the updated set of rules

When the updated rule set is applied all traces of deleted or modified rules are removed from the LNet constructs.

IOC_LIBCFS_GET_UDSP_SIZE

Return the size of the UDSP specified by index.

`IOC_LIBCFS_GET_UDSP`

The handler for the IOC_LIBCFS_GET_RULES will serialize the rules on the UDSP list.

The GET call is done in two stages. First it makes a call to the kernel to determine the size of the UDSP at index. User space then allocates a block big enough to accommodate the UDSP and makes another call to actually get the UDSP.

User space iteratively calls the UDSPs until there are no more UDSPs to get.

User space prints the UDSPs in the YAML format specified here.

TODO: Another option is to have IOC_LIBCFS_GET_UDSP_NUM, which gets the total size needed for all UDSPs, and then user space can make one call to get all the UDSPs. However, this complicates the marshaling function. The user space will also need to handle cases where the size of the UDSPs are too large for one call. The above proposal will do more iterations to get all the UDSPs, but the code should be simpler. And since the number of UDSPs are expected to be small, the above proposal should be fine.

Kernel Selection Algorithm Modifications

/*
 * select an NI from the Nets with highest priority
 */
struct lnet_ni *
lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
{
...
	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
	...
		struct lnet_net *net;


		/* consider only highest priority peer_net */
		peer_net_prio = peer_net->lpn_priority;
		if (peer_net_prio > best_peer_net_prio)
			continue;
		else if (peer_net_prio < best_peer_net_prio)
			peer_net_prio = best_peer_net_prio;
		net = lnet_get_net_locked(peer_net->lpn_net_id);
		if (!net)
			continue

		/*
		 * look only at the Nets with the highest priority and disregard
		 * nets which have lower priority. Nets with equal priority are
		 * examined and the best_ni is selected from amongst them.
		 */
		net_prio = net->net_priority;
		if (net_prio > best_net_prio)
			continue;
		else if (net_prio < best_net_prio) {
			best_net_prio = net_prio;
			best_ni = NULL;
		}
		best_ni = lnet_find_best_ni_on_spec_net(best_ni, peer,
											    best_peer_net, md_cpt, false);
	...
	}
...
}

/*
 * select the NI with the highest priority 
 */
static struct lnet_ni *
lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *best_ni,
				 struct lnet_peer *peer, struct lnet_peer_net *peer_net,
				 int md_cpt)
{
...
	ni_prio = ni->ni_priority;

	if (ni_fatal) {
		continue;
	} else if (ni_healthv < best_healthv) {
		continue;
	} else if (ni_healthv > best_healthv) {
		best_healthv = ni_healthv;
		if (distance < shortest_distance)
			shortest_distance = distance;
	/*
	 * if this NI is lower in priority than the one already set then discard it
 	 * otherwise use it and set the best priority so far to this NI's.
	 */
	} else if ni_prio > best_ni_prio) {
		continue;
	} else if (ni_prio < best_ni_prio)
		best_ni_prio = ni_prio;
	}

...
}

/*
 * When a UDSP rule associates local NIs with remote NIs, the list of local NIs NIDs
 * is flattened to a list in the associated peer_NI. When selecting a peer NI, the
 * peer NI with the corresponding preferred local NI is selected.
 */
bool
lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
{
...
}

/*
 * select the peer NI with the highest priority first and then check
 * if it's preferred.
 */ 
static struct lnet_peer_ni *
lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
					struct lnet_peer_net *peer_net)
{
...
	ni_is_pref = lnet_peer_is_pref_nid_locked(lpni, best_ni->ni_nid);

	lpni_prio = lpni->lpni_priority;

	if (lpni_healthv < best_lpni_healthv)
		continue;
	/*
	 * select the NI with the highest priority.
	 */
	else if lpni_prio > best_lpni_prio)
		continue;
	else if (lpni_prio < best_lpni_prio)
		best_lpni_prio = lpni_prio;
	/*
	 * select the NI which has the best_ni's NID in its preferred list
	 */
	else if (!preferred && ni_is_pref)
		preferred = true;
...
} 


static int
lnet_handle_find_routed_path(struct lnet_send_data *sd,
							 lnet_nid_t dst_nid,
							 struct lnet_peer_ni **gw_lpni,
							 struct lnet_peer **gw_peer)
{
...
	lpni = lnet_find_peer_ni_locked(dst_nid);
	peer = lpni->lpni_net->lpn_peer;
	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
		peer_net_priority = peer_net->lpn_priority;
		if (peer_net_priority > peer_net_best_priority)
			continue;
		else if (peer_net_priority < peer_net_best_priority)
			peer_net_best_priority = peer_net_priority;
		lpni = NULL;
		while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni)) {
			/* find best gw for this lpni */
			lpni_prio = lpni->lpni_priority;
			if (lpni_prio > lpni_best_prio)
				continue;
			else if (lpni_prio < lpni_best_prio)
				lpni_best_prio = lpni_prio;


			/*
			 * lnet_find_route_locked will be changed to consider the list of
			 * gw NIDs on the lpni
			 */
			gw = lnet_find_route_locked(NULL, lpni, sd->sd_rtr_nid);
			...
			/*
			 * if gw is MR then select best_NI. Increment the sequence number of
			 * the gw NI for Round Robin selection.
		 	 */
			...
		}
	}
...
}

User Space Design

UDSP Marshaling

After a UDSP is parsed in user space it needs to be marshalled and sent to the kernel. The kernel will de-marshal the data and store it in its own data structures. The UDSP is formed of the following pieces of information:

Index: The index of the UDSP to insert or delete
Source Address expression: A dot expression describing the source address range
Net of the Source: A net id of the source
Destination Address expression: A dot expression describing the destination address range
Net of the Destination: A net id of the destination
Router Address expression: A dot expression describing the router address range
Net of the Router: A net id of the router
Action Type: An enumeration describing the action type.
Action: A structure describing the action if the UDSP is matched.

The data flow of a UDSP looks as follows:

DLC APIs

The DLC library will provide the outlined APIs to expose a way to create, delete and show rules.

Once rules are created and stored in the kernel, they are assigned an ID. This ID is returned and shown in the show command, which dumps the rules. This ID can be referenced later to delete a rule. The process is described in more details below.

/*
 * lustre_lnet_udsp_str_to_action
 * 	Given a string format of the action, convert it to an enumerated type
 * 		action - string format for the action.
 */
enum lnet_udsp_action_type lustre_lnet_udsp_str_to_action(char *action);

/*
 * lustre_lnet_add_udsp
 *   Add a selection policy.
 *		src - source NID descriptor
 *		dst - destination NID descriptor
 *		rte - router NID descriptor
 *		type - action type
 * 		action - union of the action
 *      idx - the index to delete
 * 		seq_no - sequence number of the request
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller
 */
int lustre_lnet_add_udsp(char *src, char *dst, char *rte,
						 enum lnet_udsp_action_type type,
						 union action, unsigned int idx,
						 int seq_no, struct cYAML **err_rc);
 
/*
 * lustre_lnet_del_udsp
 *   Delete a net selection policy.
 *      idx - the index to delete
 * 		seq_no - sequence number of the request
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller
 */
int lustre_lnet_del_udsp(int idx, int seq_no, struct cYAML **err_rc);
 
/*
 * lustre_lnet_show_udsp
 *   Show configured net selection policies.
 * 		seq_no - sequence number of the request
 * 		show_rc - [OUT] struct cYAML tree containing the UDSPs
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller   
 */
int lustre_lnet_show_udsp(int seq_no, struct cYAML **show_rc,
						  struct cYAML **err_rc);

Userspace Structures

/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
	__u32 ud_net_id;
	list_head ud_ip_range;
}

/* UDSP action types */
enum lnet_udsp_action_type {
	EN_LNET_UDSP_ACTION_PRIORITY = 0,
	EN_LNET_UDSP_ACTION_NONE = 1,
}

 /*
 * a UDSP rule can have up to three user defined NID descriptors
 * 		- src: defines the local NID range for the rule
 * 		- dst: defines the peer NID range for the rule
 * 		- rte: defines the router NID range for the rule
 *
 * An action union defines the action to take when the rule
 * is matched
 */ 
struct lnet_udsp {
	list_head udsp_on_list;
	__u32 idx;
	lnet_ud_nid_descr *udsp_src;
	lnet_ud_nid_describe *udsp_dst;
	lnet_ud_nid_descr *udsp_rte;
	enum lnet_udsp_action_type udsp_action_type;
	union udsp_action {
		__u32 udsp_priority;
	};
}

Marshaled Structures

struct cfs_range_expr {
	struct list_head re_link;
	__u32 re_lo;
	__u32 re_hi;
	__u32 re_stride;
};
	
struct lnet_ioctl_udsp {
	__u32 iou_idx;
	enum lnet_udsp_action_type iou_action_type
	union action iou_action {
		__u32 priority;
	}
	__u32 iou_src_dot_expr_count;
	__u32 iou_dst_dot_expr_count;
	__u32 iou_rte_dot_expr_count;
	char iou_bulk[0];
};

The address is expressed as a list of cfs_range_expr. These need to be marshalled. For IP address there are 4 of these structures. Other type of addresses can have a different number. As an example, gemini will only have one. The corresponding iou_[src|dst|rte]_dot_expr_count is set to the number of expressions describing the address. Each expression is then flattened in the structure. They have to be flattened in the order defined: SRC, DST, RTE.

The kernel will receive the marshalled data and will form its internal structures. The functions to marshal and de-marshal should be straight forward. Note that user space and kernel space use the same structures. These structure will be defined in a common location. For this reason the functions to marshal and de-marshal will be shared.

Marshalling and de-marshalling functions

Common functions that can be called from user space and kernel space will be created to marshal and de-marshal the UDSPs:

/*
 * lnet_get_udsp_size()
 * 	Given the UDSP return the size needed to flatten the UDSP
 */
int lnet_get_udsp_size(struct lnet_udsp *udsp);

/*
 * lnet_udsp_marshal()
 * 	Marshal the UDSP pointed to by udsp into the memory block that is provided. In order for this
 *  API to work in both Kernel and User space the bulk pointer needs to be passed in. When this API
 *  is called in the kernel, it is expected that the bulk memory is allocated in userspace. This API
 *  is intended to be called from the kernel to marshal the rules before sending it to user space.
 *  It will also be called from user space to marshal the udsp before sending to the kernel.
 * 		udsp [IN] - udsp to marshal
 * 		bulk_size [IN] - size of bulk.
 *  	bulk [OUT] - allocated block of memory where the serialized rules are stored.
 */
int lnet_udsp_marshal(struct lnet_udsp *udsp, __u32 *bulk_size, void __user *bulk);
 
/*
 * lnet_udsp_demarshal()
 * 	Given a bulk containing a single UDSP, demarshal and populate the udsp structure provided
 * 		bulk [IN] - memory block containing serialized rules
 * 		bulk_size [IN] - size of bulk memory block
 * 		udsp [OUT] - preallocated struct lnet_udsp
 */
int *lnet_udsp_demarshal(void __user *bulk, __u32_bulk_size, struct lnet_udsp *usdp);

Requirements Covered by the Design

cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165

Unit Testing

This section will be updated as development continues. The goal is to update the unit test cases with as much detail as possible. It might be better to have pointers to the actual test scripts in the test case table below. For now an example of a pseudo coded test script is outlined below.

Common Functions

This section defines common functions which will be used in many test cases. They are defined in pseudo python

def add_verify_net(net_configs, destination)
	# all command should be executed on destination
	redirect_to_dest(destination)

	for cfg in net_configs:
		lnetctl net add --net cfg['net'] --if cfg['intf']
		show_output = lnetctl net show
		if (cfg['net'] not in show_output) or
		   (show_output[cfg['net']].if_name != cfg['intf'])
			return FAILED

return SUCCESS

def add_verify_policy(network_type, priority, destination)
	# all command should be executed on destination
	redirect_to_dest(destination)

	lnetctl policy add --src *@network_type --priority priority
	show_output = lnetctl policy show
	if (network_type not in show_output) or
	   (show_output[network_type].priority != priority)
		return FAILED

	show_output = lnetctl net show --net network_type
	if (not show_output) or
	   (show_output[network_type].priority != priority)
		return FAILED

return SUCCESS
 
generate_traffic(peer1, peer2)
	run_lnet_selftest(peer1, peer2)

get_traffic_stats(peer1)
	# get traffic statistics and return

verify_traffic_on(stats1, stats2, net)
	# make sure that the bulk of the traffic is on net

Test Cases

Policy

Test case

Network Rule

Add and verify local network policy.

peer1_address = argv[1]
peer2_address = argv[2]
net = argv[3]
intf = argv[4]
net2 = argv[5]
intf2 = argv[6]

peer1 = make_nid(peer1_address, net)

net_cfg = [{'net' = net, 'intf' = intf}, {'net' = net2, 'intf' = intf2}]
add_verify_net(net_cfg, peer1)
add_verify_policy(net, 0, peer1)

Verify traffic goes over the network with the highest priority

# script should grab its input from user (can be automated)
peer1_address = argv[1]
peer2_address = argv[2]
net = argv[3]
intf = argv[4]
net2 = argv[5]
intf2 = argv[6]

net_cfg = [{'net' = net, 'intf' = intf}, {'net' = net2, 'intf' = intf2}]
add_verify_net(net_cfg, peer1)
add_verify_policy(net, 0, peer1)

peer1 = make_nid(peer1_address, net)
peer2 = make_nid(peer2_address, net)
add_verify_net(net_cfg, peer2)
add_verify_policy(net, 0, peer2)
stats1 = get_traffic_stats(peer1)
generate_traffic(peer1, peer2)
stats2 = get_traffic_stats(peer1)
verify_traffic_on(stats1, stats2, net)

Verify traffic goes over the network with the healthiest local NI even though it might not be set to highest priority

Delete local network policy and verify it has been deleted

Verify traffic returns to normal pattern when network policy is deleted

Error handling: Add policy for non-existent network

Add and verify a remote network policy. IE messages will need to be routed to that network

Verify traffic is routed to the remote network with the highest priority

Verify traffic is routed to another available network given the highest priority remote network is not reachable.

Delete remote network policy and verify it has been deleted

Verify traffic returns to normal pattern when remote network policy is deleted.

Error handling: Add policy for non-existent remote network

NID Rules

Add and verify local NID rule

Verify traffic goes over the local NID with the highest priority

Verify traffic goes over the healthiest NID even if it has lower priority

Delete NID policy and verify it has been deleted

Verify traffic goes back to regular pattern after NID policy is deleted.

Error handling: Add policy for non-existent NID

Repeat the above tests for remote NID

NID Pair Rules

Add and verify NID Pair Rule

→ TODO: how do you verify that a NID Pair rule has been applied correctly. We need to show the preferred NID list in the show command. This also applies to Router Rules.

Verify traffic goes over the preferred Local NIDs

Delete NID pair rule a and verify it has been deleted

Verify traffic goes back to regular pattern after NID Pair policy is deleted.

Error handling: Add a policy that don't match any local NIDs. This should be a no-op

Router Rules

Same set of tests as above but for routers

Subsequent Addition

For each of the policy types, add a policy which doesn't match any thing currently configured. Verify that policy is added irregardless

Add an LNet construct (Net, NI, Route) which matches an existing policy. Verify that policy has been applied on construct

→ TODO: Show commands like net show, peer show, etc should be modified to show the result of the policy application.

Verify traffic adheres to policy

Delete LNet construct. Verify that policy remains.

Dynamic Policy Addition

Run traffic.

For each of the policy types add a policy which should alter traffic

Verify traffic patterns change when policy is added.

Policy application order

Add all types of policies. They all should match and be applied. Verify.

Run traffic.

Verify that policies are applied on traffic in the order of operations defined here.

Dynamic policy Deletion

Add all types of policies.

Run traffic

Verify that polices are applied on traffic in the order of operations defined.

Delete the policy one at a time.

Verify traffic pattern change with each policy deleted.

Use Cases

Preferred Network

If a node can be reached on two LNet networks, it is sometimes desirable to designate a fail-over network. Currently in lustre there is the concept of High Availability (HA) which allows servicenode nids to be defined as described in the lustre manual section 11.2. By using the syntax described in that section, two nids to the same peer can also be defined. However, this approach suffers from current limitation in the lustre software, where the NIDs are exposed to layers above LNet. It is ideal to keep network failures handling contained within LNet and only let lustre worry about defining HA.

Given this it is desirable to have two LNet networks defined on a node, each could have multiple interfaces. Then have a way to tell LNet to always use one network until it is no longer available, IE: all interfaces in that network are down.

In this manner we separate the functionality of defining fail-over pairs from defining fail-over networks.

Preferred NIDs

In a scenario where servers are being upgraded with new interfaces to be used in Multi-Rail, it's possible to add interfaces, for example MLX-EDR interfaces to the server. The user might want to continue making the existing QDR clients use the QDR interface, while new clients can use the EDR interface or even both interfaces. By specifying rules on the clients that prefer specific interfaces this behaviour can be achieved.

Preferred local/remote NID pairs

This is a finer tuned method of specifying an exact path, by not only specifying a priority to a local interface or a remote interface, but by specifying concrete pairs of interfaces that are most preferred. A peer interface can be associated with multiple local interfaces if necessary, to have a N:1 relationship between local interfaces and remote interfaces.

Refer to Olaf's LUG 2016/LAD 2016 PPT for more context.

Preferred Routers

Client sets A and B are all configured on the same LNet network, example o2ib. The servers are on a different LNet network, o2ib2. But due to the underlying network topology it is more efficient to route traffic from Client set A over Router set A and Client set B over Router set B. The green links are wider than the red links. UDSPs can be configured on the clients to specify the preferred set of router NIDs.

Node Types

Based on , there is a need to select an interface based on the destination portal type.

TODO: This will need a new type of policy. However, I believe we might be crossing a gray area here. LNet will need to have an "understanding" about portal types in a sense. Another suggested solution proposed by Andreas Dilger: Why not just configure the MDS with NID1 and the OSS with NID2, and the client won't even know that they are on the same node?

Reference Links

https://www.ece.tufts.edu/~karen/classes/final_presentation/Dragonfly_Topology_Long.pptx