Introduction

It is sometimes desirable to fine tune the selection of local/remote NIs used for communication. For example currently if there are two networks an OPA and a MLX network, both will be used. Especially if the traffic volume is low the credits criteria will be equivalent between the nodes, and both networks will be used in round robin. However, the user might want to use one network for all traffic and keep the other network free unless the other network goes down.

User Defined Selection Policies (UDSP) will allow this type of control. 

UDSPs are configured from lnetctl via either command line or YAML config files and then passed to the kernel. Policies are applied to all local networks and remote peers then stored in the kernel. During the selection process the policies are examined as part of the selection algorithm.

List of Terms

TermDescription
LNet ConstructCan be a local net (lnet_net), local NI (lnet_ni), peer net (lnet_peer_net), peer NI (lnet_peer_ni)
Selection AlgorithmThe algorithm which selects a local NI and a peer NI, or a router NI to send a message to.
Selection CriteriaThe criteria used by the Selection algorithm to select a local NI, peer NI or a router NI for message sending.
Selection PriorityPriority applied by the user on an LNet construct, becomes one of the Selection Criteria.
UDSP matching criteriaa set of expressions which describe LNet constructs which the UDSP is referring to.
UDSP actionan action to take effect on an LNet construct once the construct matches a UDSP matching criteria
UDSP InstantiationThe process of matching and applying the UDSP action on an LNet construct

Conceptual Overview

UDSPs are used to finely control traffic. In order to achieve this in the most optimal way possible, the policies can not be examined on the fast path, with every message being sent. The policies shall be instantiated on LNet constructs. LNet constructs are: Local Nets, Local NIs, Peer Nets and Peer NIs. Once a policy is instantiated on an LNet construct, meaning specific fields in the LNet construct structure are filled, then these fields are examined in the selection algorithm.

Currently, the policies control the preference of some constructs over others during the selection algorithm.

Below is an overview diagram of how the UDSPs integrate in the system.

OverviewDiagram

  1. The admin enters UDSPs either through YAML or through the CLI. 
  2. UDSPs are parsed and stored in a list.
  3. UDSPs are marshalled and sent to the Kernel
  4. Kernel unmarshals and stores the UDSPs locally on a list
  5. Kernel instantiates UDSPs on LNet constructs.
  6. LNet constructs created dynamically will have the UDSPs instantiated on them.
  7. Admin can request a show of all UDSPs in the system.
  8. Kernel marshals and sends to user space through IOCTL
  9. lnetctl unmarshals and displays the UDSPs in YAML format

UDSP Structure

A UDSP consists of two parts:

  1. The matching criteria
    1. The matching criteria is used to match an LNet construct against the policy
  2. Policy action
    1. The policy action is the action taken on the LNet construct when the policy is matched.

UDSP Rule Types

Network Rules

These rules define the relative priority of the networks against each other. 0 is the highest priority. Networks with higher priorities will be selected during the selection algorithm, unless the network has no healthy interfaces. If there exists an interface on another network which can be used and is healthier than any which are available on the current network, then that one will be used. Health will always trump all other criteria.

Matching Criteria

In order to match a network rule the network type and and expression representing the network number must be provided. Example:

tcp1 # match tcp1 exactly
tcp[1-3] # match tcp1, tcp2 and tcp3
tcp* # match any tcp network

The policy can apply to local or remote network, depending on the specification in the command.

Matching Action

When a network matches the policy the action is applied on the network LNet construct. The only action available is setting the selection priority of the network. When the selection algorithm is iterating through available networks, the one with the highest selection priority is selected.

NID Rules

These rules define the relative priority of individual NIDs. 0 is the highest priority. Once a network is selected the NID with the highest priority is preferred. Note that NID priority is prioritized below health. For example, if there are two NIDs, NID-A and NID-B. NID-A has higher priority but lower health value, NID-B will still be selected. In that sense the policies are there as a hint to guide the selection algorithm.

Matching Criteria

A NID expression is  used to match the policy against local or remote NIDs. Example

10.10.10.2@tcp1 # match the exact nid
10.10.10.[2-3]@tcp1 # match 10.10.10.2@tcp1 and 10.10.10.3@tcp1
10.10.10.[2-3]@tcp* # match 10.10.10.2 and 10.10.10.3 on any tcp network

The policy can apply to local or remote NIDs, depending on the specification in the command.

Matching Action

When a NID matches the policy the action specified in the rule is applied to the NI LNet construct. The only action available is setting the selection priority of the NID. When the selection algorithm is iterating through available NIs, the one with the highest selection priority is selected.

NID Pair Rules

These rules define prefered paths. Once a local NI is selected, as this is the first step in the selection algorithm, the peer NI which has the local NI on its preferred list is selected..The end result of this strategy is an association between a local NI and a peer NI (or a group of them)

Matching Criteria

A NID pair rule takes two expressions describing the src and destination NIDs which should be preferred. The matching rules for each of the supplied NID expressions is the same as the NID Rules above.

Matching Action

The remote NIDs available in the system are examined. Each remote NID which matches the destination NID expression in the policy, will have a set of local NIDs, which match the source NID expression in the policy, added to its preferred local NID list. If no local NID matches the source NID expression in the policy, then the action is a no-op.

Router Rules

Router Rules define which set of routers to use when sending messages to a destination NID(s). It can also be used to identify preferred routers for primary source NID.

When defining a network there could be paths which are more optimal than others. To have more control over the path traffic takes, admins configure interfaces on different networks, and split up the router pools among the networks. However, this results in complex configuration, which is hard to maintain and error prone. It is much more desirable to configure all interfaces on the same network, and then define which routers to use when sending to a remote peer or from a source peer. Router Rules allow this functionality

Matching Criteria

There are two mutually exclusive matching criteria: source NID and Destination NID.

A router rule can take two expressions describing the destination NIDs and the router NIDs which should be preferred when sending to the matching destination NIDs. The matching rules for each of the supplied NID expressions is the same as the NID Rules above.

A router rule can take two expressions describing the source network and the router NIDs which should be preferred when sending from the specified source network. The reason a source network is specified as opposed to a source NID descriptor is because of implementation limitation. At the point of router selection we have not determined the source NI to send out of yet. However, by examining each gateway as we select the route, we are able to lookup the local network for that gateway and examine the preferred list on that gateway. If there is an explicit requirement to refine the traffic control to be per NID as opposed to per network then we can update the implementation in its own patch.

Only the source NID descriptor or the destination NID descriptor can be provided as a matching criteria but not both.

Matching Action

The remote NIDs available in the system are examined. Each remote NID which matches the destination or source NID expression in the policy, will have a set of router NIDs, which match the router NID expression in the policy, added to its preferred router NID list. If no router NID matches the router NID expression in the policy, then the action is a no-op.

UDSP application and Selection Priority

It is important to clarify the order UDSP rules are applied versus the selection priority. These two are distinct concepts which need to be kept separate for the simplicity of the design.

The order UDSPs are applied is on first matching rule case. When there are a set of policies of the same type, which apply to the same LNet construct, only the first policy on the list of policies is applied on the matching LNet construct. To make this concept distinct from the selection priority concept, it shall be referred to as UDSP Application Order.

The Selection Priority, is the Policy Action part of the UDSP as described earlier. This Selection Priority only applies to the LNet construct on which the UDSP is being instantiated and is not related to the UDSP Application Order. The Selection Priority of an LNet construct is what the Selection Algorithm uses as one of the criteria for selecting the LNet construct under examination when sending a message.

UDSP Policy Interactions

Once a UDSP policy is instantiated on an LNet construct, the UDSP action becomes part of the LNet construct. When the selection algorithm is running on per message send case, the policy action takes effect. To give a more concrete example, let's say the admin applies a policy which gives the highest Selection Priority for o2ib0 local network. This policy action translates to the local network selection priority field set to 0, which is the highest priority. When the selection algorithm runs it checks the priorities of all local networks and selects o2ib0 since it has the highest Selection Priority.

Policy Actions integrate with the rest of the Selection Criteria used by the selection algorithm. For more details on the selection algorithm, please refer to the initial Multi-Rail HLD.

There is one particular rule which all UDSPs adhere to: Health trumps UDSPs. The selection algorithm always selects the healthiest interface to send from or to send to regardless if there is another network and/or interface which has a higher user assigned Selection Priority. The following diagram gives a simplified overview of the selection algorithm. The HLD linked above is the best place to get more details, as this HLD concentrates on the UDSP design and does not rehash all the details of the selection algorithm. 

SelectionFlow

The diagram shows the integration of the UDSP provided actions in the logical flow of the selection algorithm.

Order of Operations

It is important to define the order of rule operations, when there are multiple rules that apply to the same construct.

The order is defined by the selection algorithm logical flow:

  1. iterate over all the networks that a peer can be reached on and select the best local network
    1. The remote network with the highest priority is examined
      1. Network Rule
    2. The local network with the highest priority is selected
      1. Network Rule
    3. The local NI with the highest priority is selected
      1. NID Rule
  2. If the peer is a remote peer and has no local networks,
    1. then select the remote peer network with the highest priority
      1. Network Rule
    2. Select the highest priority remote peer_ni on the network selected
      1. NID Rule
    3. Now that the peer's network and NI are decided select the router in round robin from the source network and peer NI's preferred router list. The preferred list on the source network takes precedence.
      1. Router Rule
  3. Otherwise for local peers, select the peer_ni from the peer.
    1. highest priority peer NI is selected
      1. NID Rule
    2. Select the peer NI which has the local NI selected on its preferred list.
      1. NID Pair Rule

User Interface

Overview of Operations

There are three main operations which can be carried out on UDSPs either from the command line or YAML configuration: add, delete, show.

Add

The UI allows adding a new rule. With the use of the idx optional parameter, the admin can specifiy where in the rule chain the new rule should be added. By default the rule is appended to the list. Any other value will result in inserting the rule in that position.

When a new UDSP is added the entire UDSP set is re-evaluated. This means all Nets, NIs and peer NIs in the systems are traversed and the rules re-applied. This is an expensive operation, but given that UDSP management should be a rare operation, it shouldn't be a problem.

Delete

The UI allows deleting an existing UDSP using its index. The index can be shown using the show command. When a UDSP is deleted the entire UDSP set are re-evaluated. The Nets, NIs and peer NIs are traversed and the rules re-applied..

Show

The UI allows showing existng UDSPs. The format of the YAML output is as follows:

udsp:
    - idx: <unsigned int>
	  src: <ip>@<net type>
	  dst: <ip>@<net type>
      rte: <ip>@<net type>
	  action:
		- priority: <unsigned int>

Command Line Syntax

Below is the command like syntax for managing UDSPs

# Adding a local network udsp
# if multiple local networks are available, each one can have a priority. 
# The one with the highest priority is preferred
lnetctl policy add --src  <net type><expr>
                   --<action type> <action context sensitive value> 
                   --idx <value>
	--src: is defined in ip2nets syntax. '<net type><expr>' syntax indicates the network.
		   <net type> is one of o2ib, tcp, gni, etc
           <expr> is an ip2nets expression describing a network number.
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255], with 0 as the highest
                      priority
	--idx: The index of where to insert the rule. If it's larger than the policy list size it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# Adding a local NID udsp
# After a local network is chosen, if there are multiple NIs in the network the
# one with highest priority is preferred.
lnetct policy add --src <Address descriptor>@<net type><expr>
                  --<action type> <action context sensitive value>
				  --idx <value>
	--src: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255], with 0 as the highest
                      priority
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# Adding a remote NID udsp
# select the peer NID with the highest priority.
lnetct policy add --dst <Address descriptor>@<net type><expr>
                  --<action type> <action context sensitive value>
				  --idx <value>
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--<action type>: 'priority' is the only implemented action type
	--<action context sensitive value>: is a value specific to the action type.
					  For 'priority' it's a value for [0 - 255], with 0 as the highest
                      priority
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# Adding a NID pair udsp
# When this rule is instantiated the local NIDs which match the rule are added on a list
# on the peer NIs matching the rule. When selecting the peer NI, the one with the 
# local NID being used on its list is preferred.
lnetct policy add --src <Address descriptor>@<net type><expr>
                  --dst <Address descriptor>@<net type><expr>
				  --idx <value>
	--src: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
           Destination NIDs can be local or remote.
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# Adding a Router udsp
# similar to the NID pair udsp. The router NIDs matching the rule are added on a list
# on the peer NIs matching the rule. When sending to a remote peer, the router which
# has its nid on the peer NI list is preferred.
lnetct policy add --dst <Address descriptor>@<net type><expr>
                  --rte <Address descriptor>@<net type><expr>
				  --idx <value>
	--dst: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--rte: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# Adding a Router udsp
# similar to the NID pair udsp. The router NIDs matching the rule are added on a list
# on the local network matching the rule. When sending from on of the constituents of this
# network, then the router which has its nid on the local network list is preferred. If there are two
# UDSPs setting the preference of router selection based on the local network and the peer 
# primary NID, then the preferred list on the local network takes precedence.
lnetct policy add --src <net type><expr>
                  --rte <Address descriptor>@<net type><expr>
				  --idx <value>
	--src: is defined in ip2nets syntax. '<net type><expr>' syntax indicates the network.
		   <net type> is one of o2ib, tcp, gni, etc
           <expr> is an ip2nets expression describing a network number.
	--rte: the address descriptor defined in ip2nets syntax as described in the manual
		   <net type><expr> is similar to what has been described previously.
	--idx: The index of where to insert the rule. If it's larger than the policy list it's
		   appended to the end of the list. If not specified the default behaviour is to append
		   to the end of the list

# show all policies in the system.
# the policies are dumped in YAML form.
# Each policy is assigned an index.
# The index is part of the policy YAML block
lnetctl policy show

# to delete a policy the index must be specified.
# The normal behaviour then is to first show the list of policies
# grab the index and use it in the delete command.
lnetctl policy del --idx <value>

# generally, the syntax is as follows
 lnetctl policy <add | del | show>
  --src: ip2nets syntax specifying the local NID to match
  --dst: ip2nets syntax specifying the remote NID to match
  --rte: ip2nets syntax specifying the router NID to match
  --priority: Priority to apply to rule matches
  --idx: Index of where to insert the rule. By default it appends to
		 the end of the rule list 

Possible Future Actions

As of the time of this writing only "priority" action shall be implemented. However, it is feasible in the future to implement different actions to be taken when a rule matches. 

Drop

We can implement a "drop" rule. Any message destined to a specific NID or are received from a NID are dropped. Such a policy can be used to drop messages from untrusted peers. This would be a more general solution than the one provided here  LU-11894 - Getting issue details... STATUS

Mirror

All message destined or received from a specific peer are mirrored to a different sink. The intention of this is not for backup, but for debugging traffic, similar in functionality to wireshark. Lustre clients are stateful so using such a feature wouldn't work for keeping a backup server uptodate. However, it would be useful for monitoring traffic. There is an extension to wireshark which is available to interpret dumps captured by utilities such as tcpdump or ibdump. However, such utilities capture all traffic, which makes interpreting traffic extremely hard. Mirroring would allow us to isolate and capture traffic between specific peers. This policy can be applied on a server or a router. 

Redirect

Redirect messages to from a specific NID (or group of NIDs) to a different server. This can be applied on routers for example or even servers, to intercept traffic.

YAML Syntax

udsp:
    - idx: <unsigned int>
	  src: <ip>@<net type>
	  dst: <ip>@<net type>
      rte: <ip>@<net type>
	  action:
		- priority: <unsigned int>

Design

UDSPs add/del/show commands are parsed in user space and passed down in structure form to the LNet kernel module. All policies are stored in kernel space. All logic to add, delete and match policies will be implemented in kernel space. This complicates the kernel space processing. Arguably, policy maintenance logic is not core to LNet functionality. What is core is the ability to select source and destination networks and NIDs in accordance with user definitions. However, the kernel is able to manage policies much easier and with less potential race conditions than user space.

Design Principles

UDSPs are comprised of two parts:

  1. The UDSP matching criteria
  2. The UDSP action

The matching criteria is what's used to match an LNet construct. The action is what's applied on the LNet construct when the rule is matched.

A rule can be uniquely identified using an internal ID which is assigned by the LNet module when a rule is added and returned to the user space when the UDSPs are shown.

UDSP Storage

UDSPs shall be defined by administrators either via LNet command line utility, lnetctl, or via YAML configuration file.  lnetctl parses the UDSP and stores it in an intermediary format, which will be marshalled and passed down to the kernel LNet module. LNet shall store these UDSPs on a policy list. Once policies are added to LNet they will be applied on existing networks, NIDs and routers. The advantage of this approach is that UDSPs are not strictly tied to the LNet constructs, IE networks, NIDs or routers, but can be applied whenever the LNet constructs are created and if the LNet constructs are deleted then they remain and can be automatically applied at a future time.

This makes configuration easy since a set of UDSPs can be defined, like "all IB networks priority 1", "all Gemini networks priority 2", etc, and when a network is added, it automatically inherits these rules.

Peers are normally not created explicitly by the administrators. The ULP requests to send a message to a peer or the node receives an unsolicited message from a peer which results in creating a peer construct in LNet. It is feasible, especially for router policies, to have a UDSP which associates a set of clients with in a specific range with a set of optimal routers. Having the policies stored and matched in kernel aids in fulfilling this requirement.

UDSP Instantiation

Performance needs to be taken into account with this feature. It is not feasible to traverse the policy lists on every send operation. This will add unnecessary overhead. When rules are applied they have to be "instantiated" on the LNet constructs they impact. For example, a Network Rule is added as follows: lnetctl policy add --src o2ib0 --priority 0. This rule gives priority for using o2ib0 network for sending. A priority field in the network will be added. This will be set to 0 for the o2ib0 network. As we traverse the networks in the selection algorithm, which is part of the current code, the priority field will be compared. This is a more optimal approach than examining the policies on every send to see if it we get any matches.

Kernel Design

In Kernel Structures

/* lnet structure which will keep a list of UDSPs */
struct lnet {
	...
	list_head ln_udsp_list;
	...
};

/* net descriptor */
struct lnet_ud_net_descr {
	__u32 udn_net_type;
	list_head udn_net_num_range;
};

/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
	struct lnet_ud_net_descr ud_net_id;
	list_head ud_ip_range;
};

/* UDSP action types */
enum lnet_udsp_action_type {
	EN_LNET_UDSP_ACTION_NONE = 0,
	EN_LNET_UDSP_ACTION_PRIORITY = 1,
	EN_LNET_UDSP_PREFERRED_LIST = 2,
};

 /*
 * a UDSP rule can have up to three user defined NID descriptors
 * 		- src: defines the local NID range for the rule
 * 		- dst: defines the peer NID range for the rule
 * 		- rte: defines the router NID range for the rule
 *
 * An action union defines the action to take when the rule
 * is matched
 */ 
struct lnet_udsp {
	list_head udsp_on_list;
	__u32 idx;
	lnet_ud_nid_descr *udsp_src;
	lnet_ud_nid_descr *udsp_dst;
	lnet_ud_nid_descr *udsp_rte;
	enum lnet_udsp_action_type udsp_action_type;
	union udsp_action {
		__u32 udsp_priority;
	};
};

/* The rules are flattened in the LNet structures as shown below */
struct lnet_net {
...
	/* defines the relative priority of this net compared to others in the system */
	__u32 net_priority;
...
};


struct lnet_ni {
...
	/* defines the relative priority of this NI compared to other NIs in the net */
	__u32 ni_priority;
...
};

struct lnet_peer_net {
...
	/* defines the relative priority of this peer net compared to others in the system */
	__u32 lpn_priority;
...
};

struct lnet_peer_ni {
...
	/* defines the relative peer_ni priority compared to other peer_nis in the peer */
	__u32 lpni_priority;

	/* defines the list of local NID(s) (>=1) which should be used as the source */
	union lpni_pref {
		lnet_nid_t nid;
		struct list_head nids;
	}

	/*
	 *	defines the list of router NID(s) to be used when sending to this peer NI
	 *	if the peer NI is remote
     */
	struct list_head lpni_rte_nids;
...
};

/* UDSPs will be passed to the kernel via IOCTL */
#define IOC_LIBCFS_ADD_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_DEL_UDSP _IOWR(IOC_LIBCFS_TYPE, 107, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_GET_UDSP _IOWR(IOC_LIBCFS_TYPE, 108, IOCTL_CONFIG_SIZE)
#define IOC_LIBCFS_GET_UDSP_SIZE _IOWR(IOC_LIBCFS_TYPE, 108, IOCTL_CONFIG_SIZE)

There are two types of actions that can be applied on LNet constructs:

  1. Priority
  2. Preferred List

Priority is an integer value which is assigned to the LNet construct, such as Local Net or Remote NID.

Preferred List comes into play in the NID Pair and Router Rules. The NID Pair rules assign a set of local NIDs to destination peer NIs matching the destination NID descriptor.

The router rules assign a set of router NIDs to the destination peer NIs maching the destination NID descriptor.

The descriptor for the list of NIDs to be attached is carried in the udsp_src and udsp_rte fields. Although technically, they are part of the action, it is a simpler implementation for marshalling and unmarshalling to implement them as shown above.

UDSP Structure Diagram

UDSP_Storage_Structure

Kernel IOCTL Handling

/*
 * api-ni.c will be modified to handle adding a UDSP 
 * All UDSP operations are done under mutex and exclusive spin
 * lock to avoid constructs changing during application of the
 * policies.
 */
int
LNetCtl(unsigned int cmd, void *arg)
{
...
	case IOC_LIBCFS_ADD_UDSP: {
		struct lnet_ioctl_config_udsp *config_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * add and do initial flattening of the UDSP into
		 * internal structures.
		 */
		rc = lnet_add_and_flatten_udsp(config_udsp);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc;
	}

	case IOC_LIBCFS_DEL_UDSP: {
		struct lnet_ioctl_config_udsp *del_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * delete the rule identified by index
		 */
		rc = lnet_del_udsp(del_udsp->udsp_idx);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc;
	}

	case IOC_LIBCFS_GET_UDSP_SIZE: {
		struct lnet_ioctl_config_udsp *get_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * get the UDSP size specified by idx
		 */
		rc = lnet_get_udsp_num(get_udsp);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc
	}

	case IOC_LIBCFS_GET_UDSP: {
		struct lnet_ioctl_config_udsp *get_udsp = arg;
		mutex_lock(&the_lnet.ln_api_mutex);
		/*
		 * get the udsp at index provided. Return -ENOENT if
		 * no more UDSPs to get
		 */
		rc = lnet_add_udsp(get_udsp, get_udsp->udsp_idx);
		mutex_unlock(&the_lnet.ln_api_mutex);
		return rc
	}
...
}

IOC_LIBCFS_ADD_UDSP

The handler for the IOC_LIBCFS_ADD_RULES will perform the following operations:

  1. De-serialize the rules passed in from user space
  2. Make sure the rule is unique. If there exists another copy, the add is a NO-OP.
  3. If a rule exists which has the same matching criteria but different action, then update the rule
  4. Insert the rule and assign it an index.
  5. Iterate through all LNet constructs and apply the rules

Application of the rules will be done under api_mutex_lock and the exclusive lnet_net_lock to avoid having the peer or local net lists changed while the rules are being applied.

The rules are iterated and applied whenever:

  1. A local network interface is added.
  2. A remote peer/peer_net/peer_ni is added

IOC_LIBCFS_DEL_UDSP

The handler for IOC_LIBCFS_DEL_RULES will

  1. delete the rule with the specified index if it exists.
  2. Iterate through all LNet constructs and apply the updated set of rules

When the updated rule set is applied all traces of deleted or modified rules are removed from the LNet constructs.

IOC_LIBCFS_GET_UDSP_SIZE

Return the size of the UDSP specified by index.

IOC_LIBCFS_GET_UDSP

The handler for the IOC_LIBCFS_GET_RULES will serialize the rules on the UDSP list.

The GET call is done in two stages. First it makes a call to the kernel to determine the size of the UDSP at index. User space then allocates a block big enough to accommodate the UDSP and makes another call to actually get the UDSP.

User space iteratively calls the UDSPs until there are no more UDSPs to get.

User space prints the UDSPs in the YAML format specified here.

TODO: Another option is to have IOC_LIBCFS_GET_UDSP_NUM, which gets the total size needed for all UDSPs,  and then user space can make one call to get all the UDSPs. However, this complicates the marshaling function. The user space will also need to handle cases where the size of the UDSPs are too large for one call. The above proposal will do more iterations to get all the UDSPs, but the code should be simpler. And since the number of UDSPs are expected to be small, the above proposal should be fine.

Kernel Selection Algorithm Modifications

 /*
 * select an NI from the Nets with highest priority
 */
struct lnet_ni *
lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
{
...
        /* go through all the peer nets and find the best_ni */
        list_for_each_entry(lpn, &peer->lp_peer_nets, lpn_peer_nets) {
                /* select the preferred peer and local nets */
                lpn_healthv = lnet_get_peer_net_healthv_locked(lpn);
                if (best_lpn_healthv > lpn_healthv)
                        continue;

                lpn_sel_prio = lpn->lpn_sel_priority;
                if (best_lpn_sel_prio > lpn_sel_prio)
                        continue;

                /*
                 * The peer's list of nets can contain non-local nets. We
                 * want to only examine the local ones.
                 */
                net = lnet_get_net_locked(lpn->lpn_net_id);
                if (!net)
                        continue;

                net_healthv = lnet_get_net_healthv_locked(net);
                if (best_net_healthv > net_healthv)
                        continue;

                net_sel_prio = net->net_sel_priority;
                if (best_net_sel_prio > net_sel_prio)
                        continue;

                best_net_heatlhv = net_healthv;
                best_net_sel_prio = net_sel_prio;
                best_lpn_healthv = lpn_healthv;
                best_lpn_sel_prio = lpn_sel_prio;
                best_lpn = lpn; 
        }

        if (best_lpn) {
                /*
                 * Now that we have the healthiest and highest priority lpn
                 * It can also be reached by the healthiest and highest priority
                 * local NI, now we select the best_ni
                 */
                best_ni = lnet_find_best_ni_on_spec_net(NULL, peer,
                                                        best_lpn, md_cpt, false);

                if (best_ni)
                        /* increment sequence number so we can round robin */
                        best_ni->ni_seq++;
        }

        return best_ni;
}

/*
 * select the NI with the highest priority 
 */
static struct lnet_ni *
lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *best_ni,
				 struct lnet_peer *peer, struct lnet_peer_net *peer_net,
				 int md_cpt)
{
...
	ni_prio = ni->ni_priority;

	if (ni_fatal) {
		continue;
	} else if (ni_healthv < best_healthv) {
		continue;
	} else if (ni_healthv > best_healthv) {
		best_healthv = ni_healthv;
		if (distance < shortest_distance)
			shortest_distance = distance;
	/*
	 * if this NI is lower in priority than the one already set then discard it
 	 * otherwise use it and set the best priority so far to this NI's.
	 * keep track of the shortest distance because it is tested later
	 */
	} else if ni_prio > best_ni_prio) {
		continue;
	} else if (ni_prio < best_ni_prio)
		best_ni_prio = ni_prio;
		if (distance < shortest_distance)
			shortest_distance = distance;
	}

...
}

/*
 * When a UDSP rule associates local NIs with remote NIs, the list of local NIs NIDs
 * is flattened to a list in the associated peer_NI. When selecting a peer NI, the
 * peer NI with the corresponding preferred local NI is selected.
 */
bool
lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
{
...
}

/*
 * select the peer NI with the highest priority first and then check
 * if it's preferred.
 */ 
static struct lnet_peer_ni *
lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
					struct lnet_peer_net *peer_net)
{
...
	/*
	 * For Non-MR peers we always want to use the preferred NID because
	 * if we don't the non-MR peer will have problems when it receives
	 * messages from a different NI other than the one it's expecting.
	 * However, for MR cases we need to adhere to the rule that health
	 * always trumps all other criteria. In the preferred NIDs case, if
	 * we have a healthier peer-NI which doesn't have the local_ni on its
	 * preferred list, then we should choose it.
	 *
	 * This scenario is handled here: lnet_handle_send_case_locked()
	 */
	ni_is_pref = lnet_peer_is_pref_nid_locked(lpni, best_ni->ni_nid);

	lpni_prio = lpni->lpni_priority;

	if (lpni_healthv < best_lpni_healthv)
		continue;
	/*
	 * select the NI with the highest priority.
	 */
	else if lpni_prio > best_lpni_prio)
		continue;
	else if (lpni_prio < best_lpni_prio)
		best_lpni_prio = lpni_prio;
	/*
	 * select the NI which has the best_ni's NID in its preferred list
	 */
	else if (!preferred && ni_is_pref)
		preferred = true;
...
} 


static int
lnet_handle_find_routed_path(struct lnet_send_data *sd,
							 lnet_nid_t dst_nid,
							 struct lnet_peer_ni **gw_lpni,
							 struct lnet_peer **gw_peer)
{
...
	lpni = lnet_find_peer_ni_locked(dst_nid);
	peer = lpni->lpni_net->lpn_peer;
	list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
		peer_net_priority = peer_net->lpn_priority;
		if (peer_net_priority > peer_net_best_priority)
			continue;
		else if (peer_net_priority < peer_net_best_priority)
			peer_net_best_priority = peer_net_priority;
		lpni = NULL;
		while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni)) {
			/* find best gw for this lpni */
			lpni_prio = lpni->lpni_priority;
			if (lpni_prio > lpni_best_prio)
				continue;
			else if (lpni_prio < lpni_best_prio)
				lpni_best_prio = lpni_prio;


			best_lpni = lpni;
			...
		}
	}
...
	/*
	 * lnet_find_route_locked will be changed to consider the list of
	 * gw NIDs on the lpni
	 */
	gw = lnet_find_route_locked(NULL, best_lpni, sd->sd_rtr_nid);
	...
	/*
	 * if gw is MR then select best_NI. Increment the sequence number of
	 * the gw NI for Round Robin selection.
	 */
}

Selection Algorithm Notes

  1. When examining the peer_net, we need to examin its health. The health of a peer_net can be derived from the health of the NIs in that peer_net. We can have a health value in the peer_net, which is set to the best health value of all the peer_NIs in that peer_Net. When we are selecting the peer_net in lnet_find_best_ni_on_local_net(), then we test that health value. This logic can be implemented for local networks as well. The loop will then select the best pair of peer and local Nets, then the best_ni is selected from the best network outside the loop.

    1. There are two ways to maintain the peer_net and local net health values. On the fly, whenever we change one of the constituents' health value. Or we can simply derive it by iterating through all the net constituents and find the best health value. The latter approach is easier to implement and performance wise is about the same. With the former approach, we will need to re-derive the net health every time the constituents' health value changes. It is better to just pull that information on demand.
  2. For Non-MR peers we always want to use the preferred NID because if we don't the non-MR peer will have problems when it receives messages from a different NI other than the one it's expecting. However, for MR cases we need to adhere to the rule that health always trumps all other criteria. In the preferred NIDs case, if we have a healthier peer-NI which doesn't have the local_ni on its preferred list, then we should choose i

  3. We need to select the best peer_net (highest priority). Then from that peer_net we select the peer_ni with the highest priority, and then if that peer_ni has a list of preferred routers, then we select the route to use from this list. For remote peer-nis we never ding the health value because we never send messages directly to it. So if there is a failure to send, we ding the router's NI. The only break to this rule, is for REPLY/ACK case. If we dont' receive the REPLY/ACK, whos's fault is it? the remote peer? or the router? To make things simple, we always blame the route.

  4. distance criteria in the selection of the best local_ni, should be ranked below the priority assigned by the admin.

User Space Design

UDSP Marshaling

After a UDSP is parsed in user space it needs to be marshalled and sent to the kernel. The kernel will de-marshal the data and store it in its own data structures. The UDSP is formed of the following pieces of information:

  1. Index: The index of the UDSP to insert or delete
  2. Source Address expression: A dot expression describing the source address range
  3. Net of the Source: A net id of the source
  4. Destination Address expression: A dot expression describing the destination address range
  5. Net of the Destination: A net id of the destination
  6. Router Address expression: A dot expression describing the router address range
  7. Net of the Router: A net id of the router
  8. Action Type: An enumeration describing the action type.
  9. Action: A structure describing the action if the UDSP is matched.

The data flow of a UDSP looks as follows:

DataFlow

DLC APIs

The DLC library will provide the outlined APIs to expose a way to create, delete and show rules.

Once rules are created and stored in the kernel, they are assigned an ID. This ID is returned and shown in the show command, which dumps the rules. This ID can be referenced later to delete a rule. The process is described in more details below.

/*
 * lustre_lnet_udsp_str_to_action
 * 	Given a string format of the action, convert it to an enumerated type
 * 		action - string format for the action.
 */
enum lnet_udsp_action_type lustre_lnet_udsp_str_to_action(char *action);

/*
 * lustre_lnet_add_udsp
 *   Add a selection policy.
 *		src - source NID descriptor
 *		dst - destination NID descriptor
 *		rte - router NID descriptor
 *		type - action type
 * 		action - union of the action
 *      idx - the index to delete
 * 		seq_no - sequence number of the request
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller
 */
int lustre_lnet_add_udsp(char *src, char *dst, char *rte,
						 enum lnet_udsp_action_type type,
						 union action, unsigned int idx,
						 int seq_no, struct cYAML **err_rc);
 
/*
 * lustre_lnet_del_udsp
 *   Delete a net selection policy.
 *      idx - the index to delete
 * 		seq_no - sequence number of the request
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller
 */
int lustre_lnet_del_udsp(int idx, int seq_no, struct cYAML **err_rc);
 
/*
 * lustre_lnet_show_udsp
 *   Show configured net selection policies.
 * 		seq_no - sequence number of the request
 * 		show_rc - [OUT] struct cYAML tree containing the UDSPs
 * 		err_rc - [OUT] struct cYAML tree describing the error. Freed by
 * 				 caller   
 */
int lustre_lnet_show_udsp(int seq_no, struct cYAML **show_rc,
						  struct cYAML **err_rc);

Userspace Structures

/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
	__u32 ud_net_id;
	list_head ud_ip_range;
}

/* UDSP action types */
enum lnet_udsp_action_type {
	EN_LNET_UDSP_ACTION_NONE = 0,
	EN_LNET_UDSP_ACTION_PRIORITY,
}

 /*
 * a UDSP rule can have up to three user defined NID descriptors
 * 		- src: defines the local NID range for the rule
 * 		- dst: defines the peer NID range for the rule
 * 		- rte: defines the router NID range for the rule
 *
 * An action union defines the action to take when the rule
 * is matched
 */ 
struct lnet_udsp {
	list_head udsp_on_list;
	__u32 idx;
	enum lnet_udsp_action_type udsp_action_type;
	lnet_ud_nid_descr *udsp_src;
	lnet_ud_nid_describe *udsp_dst;
	lnet_ud_nid_descr *udsp_rte;
	union udsp_action {
		__u32 udsp_priority;
	};
}

Marshaled Structures

295 /*
296  * An IP or a Net number is composed of 1 or more of these descriptor
297  * structures.
298  */
299 struct lnet_range_expr {
300 »·······__u32 re_lo;
301 »·······__u32 re_hi;
302 »·······__u32 re_stride;
303 };
304 
305 /*
306  * A net descriptor has the net type, IE: O2IBLND, SOCKLND, etc and an
307  * expression describing a net number range.
308  */
309 struct lnet_ioctl_udsp_net_descr {
310 »·······__u32 ud_net_type;
311 »·······struct lnet_range_expr ud_net_num_expr;
312 };
313 
314 /*
315  * The UDSP descriptor header contains the type of matching criteria, SRC,
316  * DST, RTE, etc and how many expressions. For example an IP can be
317  * composed of 4 lnet_range_expr, a gni can be composed of 1
318  */
319 struct lnet_ioctl_udsp_descr_hdr {
320 »·······/*
321 »······· * The literals SRC, DST and RTE are encoded
322 »······· * here.
323 »······· */
324 »·······__u32 ud_descr_type;
325 »·······__u32 ud_descr_count;
326 };
327 
328 /*
329  * each matching expression in the UDSP is described with this.
330  * The bulk format is as follows:
331  *»·····1. 1x struct lnet_ioctl_udsp_net_descr
332  *»·····»·······-> the net part of the NID
333  *»·····2. >=1 struct lnet_range_expr
334  *»·····»·······-> the address part of the NID
335  */
336 struct lnet_ioctl_udsp_descr {
337 »·······struct lnet_ioctl_udsp_descr_hdr iud_src_hdr;
338 »·······union action iou_action {
339 »·······»·······__u32 priority;
340 »·······};
341 »·······struct lnet_ioctl_udsp_net_descr iud_net;
342 »·······char iud_bulk[0];
343 };
344 
345 /*
346  * The cumulative UDSP descriptor
347  * The bulk format is as follows:
348  *»·····1. >=1 struct lnet_ioctl_udsp_descr
349  *
350  * The size indicated in iou_hdr is the total size of the UDSP.
351  *
352  */
353 struct lnet_ioctl_udsp {
354 »·······struct libcfs_ioctl_hdr iou_hdr;
355 »·······__u32 iou_idx;
356 »·······__u32 iou_action_type
357 »·······char iou_bulk[0];
358 };

The address is expressed as a list of lnet_range_expr. These need to be marshalled. For IP address there are 4 of these structures. Other type of addresses can have a different number. As an example, gemini will only have one. The marshalled structure will look like this:

[lnet_ioctl_udsp | lnet_ioctl_udsp_descr | lnet_range_expr ]

They matching expressions need to follow this exact order: SRC, DST, RTE.

It's worth noting that lnet_ioctl_udsp_descr_hdr.ud_descr_type is a 32 bit field which gets set to the literal SRC, DST or RTE depending on what it's describing. Using a 4 byte value that contains ascii letters, which serve as magic values, can be help in rebuilding a system's information in case of corruption.

The kernel will receive the marshalled data and will form its internal structures. The functions to marshal and unmarshal should be straight forward. Note that user space and kernel space use the same structures. These structure will be defined in a common location. For this reason the functions to marshal and unmarshal will have the same interface and same logic. However, because they are needed for both kernel and user space, they will need to be duplicated in two locations. One in the kernel path to be compiled in the kernel, and the other in user space to be compiled for use in user space utilities.

Marshalling and unmarshalling functions

Common functions that can be called from user space and kernel space will be created to marshal and de-marshal the UDSPs:

/*
 * lnet_get_udsp_size()
 * 	Given the UDSP return the size needed to store the marshalled UDSP
 */
int lnet_get_udsp_size(struct lnet_udsp *udsp);

/*
 * lnet_udsp_marshal()
 * 	Marshal the UDSP pointed to by udsp into the memory block that is provided. In order for this
 *  API to work in both Kernel and User space the bulk pointer needs to be passed in. When this API
 *  is called in the kernel, it is expected that the bulk memory is allocated in userspace. This API
 *  is intended to be called from the kernel to marshal the rules before sending it to user space.
 *  It will also be called from user space to marshal the udsp before sending to the kernel.
 * 		udsp [IN] - udsp to marshal
 * 		bulk_size [IN] - size of bulk.
 *  	bulk [OUT] - allocated block of memory where the serialized rules are stored.
 */
int lnet_udsp_marshal(struct lnet_udsp *udsp, __u32 *bulk_size, void __user *bulk);
 
/*
 * lnet_udsp_demarshal()
 * 	Given a bulk containing a single UDSP, demarshal and populate the udsp structure provided
 * 		bulk [IN] - memory block containing serialized rules
 * 		bulk_size [IN] - size of bulk memory block
 * 		udsp [OUT] - preallocated struct lnet_udsp
 */
int *lnet_udsp_demarshal(void __user *bulk, __u32_bulk_size, struct lnet_udsp *usdp);

Requirements Covered by the Design

cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165

Use Cases

Preferred Network

If a node can be reached on two LNet networks, it is sometimes desirable to designate a fail-over network. Currently in lustre there is the concept of High Availability (HA) which allows servicenode nids to be defined as described in the lustre manual section 11.2. By using the syntax described in that section, two nids to the same peer can also be defined. However, this approach suffers from current limitation in the lustre software, where the NIDs are exposed to layers above LNet. It is ideal to keep network failures handling contained within LNet and only let lustre worry about defining HA. 

Given this it is desirable to have two LNet networks defined on a node, each could have multiple interfaces. Then have a way to tell LNet to always use one network until it is no longer available, IE: all interfaces in that network are down.

In this manner we separate the functionality of defining fail-over pairs from defining fail-over networks.

Preferred NIDs

In a scenario where servers are being upgraded with new interfaces to be used in Multi-Rail, it's possible to add interfaces, for example MLX-EDR interfaces to the server. The user might want to continue making the existing QDR clients use the QDR interface, while new clients can use the EDR interface or even both interfaces. By specifying rules on the clients that prefer specific interfaces this behaviour can be achieved.

PreferredNID

Preferred local/remote NID pairs

This is a finer tuned method of specifying an exact path, by not only specifying a priority to a local interface or a remote interface, but by specifying concrete pairs of interfaces that are most preferred. A peer interface can be associated with multiple local interfaces if necessary, to have a N:1 relationship between local interfaces and remote interfaces.

peer2peer

Refer to Olaf's LUG 2016/LAD 2016 PPT for more context.

Preferred Routers

RouterUseCase

Client sets A and B are all configured on the same LNet network, example o2ib. The servers are on a different LNet network, o2ib2. But due to the underlying network topology it is more efficient to route traffic from Client set A over Router set A and Client set B over Router set B. The green links are wider than the red links. UDSPs can be configured on the clients to specify the preferred set of router NIDs.

Fine Grained Routing

TODO: needs to be filled out.

Node Types

Based on  LU-11447 - Getting issue details... STATUS , there is a need to select an interface based on the destination portal type.

TODO: This will need a new type of policy. However, I believe we might be crossing a gray area  here. LNet will need to have an "understanding" about portal types in a sense. Another suggested solution proposed by Andreas Dilger: Why not just configure the MDS with NID1 and the OSS with NID2, and the client won't even know that they are on the same node?

Unit Testing

This section will be updated as development continues. The goal is to update the unit test cases with as much detail as possible. It might be better to have pointers to the actual test scripts in the test case table below. For now an example of a pseudo coded test script is outlined below.

Common Functions

This section defines common functions which will be used in many test cases. They are defined in pseudo python

def add_verify_net(net_configs, destination)
	# all command should be executed on destination
	redirect_to_dest(destination)

	for cfg in net_configs:
		lnetctl net add --net cfg['net'] --if cfg['intf']
		show_output = lnetctl net show
		if (cfg['net'] not in show_output) or
		   (show_output[cfg['net']].if_name != cfg['intf'])
			return FAILED

return SUCCESS

def add_verify_policy(network_type, priority, destination)
	# all command should be executed on destination
	redirect_to_dest(destination)

	lnetctl policy add --src *@network_type --priority priority
	show_output = lnetctl policy show
	if (network_type not in show_output) or
	   (show_output[network_type].priority != priority)
		return FAILED

	show_output = lnetctl net show --net network_type
	if (not show_output) or
	   (show_output[network_type].priority != priority)
		return FAILED

return SUCCESS
 
generate_traffic(peer1, peer2)
	run_lnet_selftest(peer1, peer2)

get_traffic_stats(peer1)
	# get traffic statistics and return

verify_traffic_on(stats1, stats2, net)
	# make sure that the bulk of the traffic is on net

Test Cases

PolicyTest case
Network Rule

Add and verify local network policy.

peer1_address = argv[1]
peer2_address = argv[2]
net = argv[3]
intf = argv[4]
net2 = argv[5]
intf2 = argv[6]

peer1 = make_nid(peer1_address, net)

net_cfg = [{'net' = net, 'intf' = intf}, {'net' = net2, 'intf' = intf2}]
add_verify_net(net_cfg, peer1)
add_verify_policy(net, 0, peer1)

Verify traffic goes over the network with the highest priority

# script should grab its input from user (can be automated)
peer1_address = argv[1]
peer2_address = argv[2]
net = argv[3]
intf = argv[4]
net2 = argv[5]
intf2 = argv[6]

net_cfg = [{'net' = net, 'intf' = intf}, {'net' = net2, 'intf' = intf2}]
add_verify_net(net_cfg, peer1)
add_verify_policy(net, 0, peer1)

peer1 = make_nid(peer1_address, net)
peer2 = make_nid(peer2_address, net)
add_verify_net(net_cfg, peer2)
add_verify_policy(net, 0, peer2)
stats1 = get_traffic_stats(peer1)
generate_traffic(peer1, peer2)
stats2 = get_traffic_stats(peer1)
verify_traffic_on(stats1, stats2, net)

Verify traffic goes over the network with the healthiest local NI even though it might not be set to highest priority

Delete local network policy and verify it has been deleted

Verify traffic returns to normal pattern when network policy is deleted

Error handling: Add policy for non-existent network

Add and verify a remote network policy. IE messages will need to be routed to that network

Verify traffic is routed to the remote network with the highest priority

Verify traffic is routed to another available network given the highest priority remote network is not reachable.

Delete remote network policy and verify it has been deleted

Verify traffic returns to normal pattern when remote network policy is deleted.

Error handling: Add policy for non-existent remote network
NID RulesAdd and verify local NID rule

Verify traffic goes over the local NID with the highest priority


Verify traffic goes over the healthiest NID even if it has lower priority

Delete NID policy and verify it has been deleted


Verify traffic goes back to regular pattern after NID policy is deleted.

Error handling: Add policy for non-existent NID


Repeat the above tests for remote NID
NID Pair Rules

Add and verify NID Pair Rule

TODO: how do you verify that a NID Pair rule has been applied correctly. We need to show the preferred NID list in the show command. This also applies to Router Rules.


Verify traffic goes over the preferred Local NIDs

Delete NID pair rule a and verify it has been deleted

Verify traffic goes back to regular pattern after NID Pair policy is deleted.

Error handling: Add a policy that don't match any local NIDs. This should be a no-op
Router RulesSame set of tests as above but for routers
Subsequent AdditionFor each of the policy types, add a policy which doesn't match any thing currently configured. Verify that policy is added irregardless

Add an LNet construct (Net, NI, Route) which matches an existing policy. Verify that policy has been applied on construct

TODO: Show commands like net show, peer show, etc should be modified to show the result of the policy application.


Verify traffic adheres to policy

Delete LNet construct. Verify that policy remains.
Dynamic Policy Addition

Run traffic.

For each of the policy types add a policy which should alter traffic

Verify traffic patterns change when policy is added.

Policy application order

Add all types of policies. They all should match and be applied. Verify.

Run traffic.

Verify that policies are applied on traffic in the order of operations defined here.

Dynamic policy Deletion

Add all types of policies.

Run traffic

Verify that polices are applied on traffic in the order of operations defined.

Delete the policy one at a time.

Verify traffic pattern change with each policy deleted.

Breakdown

Functional Breakdown

In order to divide the work between multiple developers, we need to define a logical breakdown of the different functional components along API lines. By defining the APIs first, each developer can work on his carved piece independently. Below is a Function Block diagram highlighting the different API lines. The API blocks are in red.

User Space

  1. YAML
    1. YAML syntax as described earlier in the document
  2. CLI
    1. CLI syntax as described earlier in the document
  3. DLC API
    1. The set of API functions lnetctl (or any other utility) calls to perform policy operations
  4. Marshal API
    1. The API used to marshal and unmarshal UDSPs
  5. Parsing API
    1. The API used to parse a textual expression into 

Kernel Space

  1. Marshalled structure IOCTL API
    1. Used by both Kernel and User space
  2. Marshal API
    1. Should be the same for both Kernel and User space, except the implementation is duplicated
  3. Policy Management API
    1. Add, remove, apply policy
  4. Policy Storage Structure API
    1. These are the definitions of the UDSP structures when they are stored. The same definitions are used in both the user and kernel space. The marshal functions take these structures and form the marshalled IOCTL structures. The unmarshal functions take the marshalled IOCTL structures and forms the UDSP Storage structures

FunctionBlockDiagram

Schedule

Preliminary Patch Breakdown

  1. Define UDSP storage structures - kernel space
  2. Define UDSP marshalled structures - kernel space
  3. Define Marshal API - kernel space
  4. Define Policy Management API - kernel space
  5. Selection Algorithm changes implementation - kernel space
  6. Policy Handling implementation - kernel space
    1. Adding/deleting/applying policies
  7. LNet construct creation handling - kernel space
    1. Applying policies on construct creation
  8. Marshalling/unmarshalling implementation - kernel space
  9. Marshalling/unmarshalling implementation - user space
  10. expression parsing and policy storage - user space
  11. Policy Add - kernel space
  12. Policy Del  - kernel space
  13. Policy Show - kernel space
  14. Define Marshal API - user space
  15. Define Parsing API - user space
  16. Define DLC API - user space
  17. Policy Add (including YAML) - user space
  18. Policy Del (including YAML) - user space
  19. Policy show (including YAML) - user space

Reference Links

https://www.ece.tufts.edu/~karen/classes/final_presentation/Dragonfly_Topology_Long.pptx


12 Comments

  1. num lnet_udsp_action_type
    make action_none to be 0, and everything follows.
  2. struct lnet_ud_nid_descr
    specify the net_type+description of the net number to allow things like: o2ib[1-4]
  3. struct lnet_remotenet {... /* defines the relative priority of the remote net compared to other remote nets */ __u32 lrn_priority;...}
    This is not needed
  4. I think this work could be used to implement detection of asymmetrical routes. For instance, if it was possible to negate matching criteria and apply the DROP action, it would be possible to say 'drop any message that comes from a NID that is not in the specified list'. Negating is important because by definition, we cannot know the NIDs of illegitimate routers.

  5. Yes you can. I had that conversation a few days ago with Amir (smile)

  6. It is not clear to me yet how to implement a solution for the problem I am going to describe, but the paragraph about 'Preferred Network' in 'Use cases' section rings a bell to me.

    Here is the problem description: when implementing multi-tenancy for Lustre, one may need as many LNet networks as the number of tenants to support on the file system. So basically, it consists in creating multiple NIDs on client and servers nodes, all using the same network interface but on a different LNet network. On server side, nodemap entries are created, one per tenant, making an association between an LNet network and a fileset. And on client side, the idea is to make use of the 'network' mount option to restrict the mount-point to a given LNet network, hence confining it to a fileset thanks to the nodemap. So far so good, but the problem stems from the fact that OSTs and MDTs have many NIDs on which they are communicating with the clients (as many as the number of LNet networks). So currently these NIDs have to be declared as 'servicenode' or 'failnode' when formatting the targets. Unfortunately, the number of 'servicenode' or 'failnode' values that can be specified is limited (maybe due to the length of the string that represent this information). So at the moment we cannot support having too many tenants (we are talking about hundreds), because it means too many LNet networks and then too many NIDs per target.

    Now, if the UDSP feature could avoid the need for all the 'servicenode' or 'failnode' parameters (as long as they are just another NID on the same network interface), that would be a great benefit for the multi-tenancy feature.

  7. Another use case that ORNL has been interested in controlling traffic over an interfaced based on size. It was requested that small messages be handled over a different network interface so that all bulk I/O went over one specific network interface. This was done to maximum the amount of bulk I/O using all the peer credits available. When small message are mixed in you lose out on the maximum total bandwidth that is possible. 

  8. How would the policy look like for a case when say the sysadmin wants to  - Prefer src NID A to reach dst NID B through route R. Do we define 2 policies to add such a case - one with NID pair rule and other with route pair rule?

  9. Another use case could be to replace the client-side 'network' mount option. This option is used to restrict a mount point to a given LNet network, which is useful for multi-tenancy, but we have seen some incompatibility with recent LNet features like Dynamic Peer Discovery.

    UDSP Network rules operate globally for the whole LNet stack on the node, so it could not be used as a direct replacement for the 'network' mount option, because it would not let a mount point use a given LNet network, and another mount point on the same node use a different LNet network.

    But do you think it could be possible to define a UDSP Network rule, and then have a new mount option that instructs the mount point to apply this UDSP rule, so for that mount point only? Or maybe, being able to specify a special selection priority, and then have a new mount option that instructs to only use that priority level.

    The main reason to consider an alternative to the 'network' mount option would be to get something more robust to the upcoming and past LNet changes.