...
Below is the command like syntax for managing UDSPs
| Code Block |
|---|
# Adding a local network ruleudsp lnetctl# if policymultiple addlocal --src lnetctl policy <add | del | show> --src: ip2nets syntax specifying the local NID to match --dst: ip2nets syntax specifying the remote NID to match --rte: ip2nets syntax specifying the router NID to match --priority: Priority to apply to rule matches --idx: Index of where to insert the rule. By default it appends to the end of the rule list # Adding a network priority rule. If the NI under the network doesn't have # an explicit priority set, it'll inherit the network priority: lnetctl > selection net [add | del | show] -h Usage: selection net add --net <network name> --priority <priority> WHERE: selection net add: add a selection rule based on the network priority --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) --priority: Rule priority Usage: selection net del --net <network name> [--id <rule id>] WHERE: selection net del: delete a selection rule given the network patter or the id. If both are provided they need to match or an error is returned. --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) --id: ID assigned to the rule returned by the show command. Usage: selection net show [--net <network name>] WHERE: selection net show: show selection rules and filter on network name if provided. --net: network string (e.g. o2ib or o2ib* or o2ib[1,2]) # Add a NID priority rule. All NIDs added that match this pattern shall be assigned # the identified priority. When the selection algorithm runs it shall prefer NIDs with # higher priority. lnetctl > selection nid [add | del | show] -h Usage: selection nid add --nid <NID> --priority <priority> WHERE: selection nid add: add a selection rule based on the nid pattern --nid: nid pattern which follows the same syntax as ip2net --priority: Rule priority Usage: selection nid del --nid <NID> [--id <rule id>] WHERE: selection nid del: delete a selection rule given the nid patter or the id. If both are provided they need to match or an error is returned. --nid: nid pattern which follows the same syntax as ip2net --id: ID assigned to the rule returned by the show command. Usage: selection nid show [--nid <NID>] WHERE: selection nid show: show selection rules and filter on NID pattern if provided. --nid: nid pattern which follows the same syntax as ip2net # Adding point to point rule. This creates an association between a local NI and a remote # NID, and assigns a priority to this relationship so that it's preferred when selecting a pathway.. lnetctl > selection peer [add | del | show] -h Usage: selection peer add --local <NID> --remote <NID> --priority <priority> WHERE: selection peer add: add a selection rule based on local to remote pathway --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net --priority: Rule priority Usage: selection peer del --local <NID> --remote <NID> --id <ID> WHERE: selection peer del: delete a selection rule based on local to remote NID pattern or id --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net --id: ID of the rule as provided by the show command. Usage: selection peer show [--local <NID>] [--remote <NID>] WHERE: selection peer show: show selection rules and filter on NID patterns if provided. --local: nid pattern which follows the same syntax as ip2net --remote: nid pattern which follows the same syntax as ip2net # the output will be of the same YAML format as the input described below. |
As of the time of this writing only "priority" action shall be implemented. However, it is feasible in the future to implement different actions to be taken when a rule matches. For example, we can implement a "redirect" action, which redirects traffic to another destination. Yet another example is "lawful intercept" or "mirror" action, which mirrors messages to a different destination. This might be useful for keeping a standby server updated with all information going to the primary server. A lawful intercept action allows personnel authorized by a Law Enforcement Agency (LEA) to intercept file operations from targeted clients and send the file operations to an LI Mediation Device.
...
| Code Block |
|---|
udsp:
- idx: <unsigned int>
src: <ip>@<net type>
dst: <ip>@<net type>
rte: <ip>@<net type>
action:
- priority: <unsigned int> |
Overview of Operations
There are three main operations which can be carried out on UDSPs either from the command line or YAML configuration: add, delete, show.
Add
The UI allows adding a new rule. With the use of the idx optional parameter, the admin can specifiy where in the rule chain the new rule should be added. By default the rule is appended to the list. Any other value will result in inserting the rule in that position.
When a new UDSP is added the entire UDSP set is re-evaluated. This means all Nets, NIs and peer NIs in the systems are traversed and the rules re-applied. This is an expensive operation, but given that UDSP management should be a rare operation, it shouldn't be a problem.
Delete
The UI allows deleting an existing UDSP using its index. The index can be shown using the show command. When a UDSP is deleted the entire UDSP set are re-evaluated. The Nets, NIs and peer NIs are traversed and the rules re-applied..
Show
The UI allows showing existng UDSPs. The format of the YAML output is as follows:
| Code Block |
|---|
udsp:
- idx: <unsigned int>
src: <ip>@<net type>
dst: <ip>@<net type>
rte: <ip>@<net type>
action:
- priority: <unsigned int> |
Design
All policies are stored in kernel space. All logic to add, delete and match policies will be implemented in kernel space. This complicates the kernel space processing. Arguably, policy maintenance logic is not core to LNet functionality. What is core is the ability to select source and destination networks and NIDs in accordance with user definitions. However, the kernel is able to manage policies much easier and with less potential race conditions than user space.
Design Principles
UDSPs are comprised of two parts:
- The matching rule
- The rule action
The matching rule is what's used to match a NID or a network. The action is what's applied when the rule is matched.
A rule can be uniquely identified using an internal ID which is assigned by the LNet module when a rule is added and returned to the user space when the UDSPs are shown.
UDSP Storage
UDSPs shall be defined by administrators either via LNet command line utility, lnetctl, or via YAML configuration file. lnetctl parses the UDSP and stores it in an intermediary format, which will be flattened and passed down to the kernel LNet module. LNet shall store these UDSPs on a policy list. Once policies are added to LNet they will be applied on existing networks, NIDs and routers. The advantage of this approach is that UDSPs are not strictly tied to the internal constructs, IE networks, NIDs or routers, but can be applied whenever the internal constructs are created and if the internal constructs are deleted then they remain and can be automatically applied at a future time.
This makes configuration easy since a set of UDSPs can be defined, like "all IB networks priority 1", "all Gemini networks priority 2", etc, and when a network is added, it automatically inherits these rules.
Peers are normally not created explicitly by the administrators. The ULP requests to send a message to a peer or the node receives an unsolicited message from a peer which results in creating a peer construct in LNet. It is feasible, especially for router policies, to have a UDSP which associates a set of clients with in a specific range with a set of optimal routers. Having the policies stored and matched in kernel aids in fulfilling this requirement.
UDSP Application
Performance needs to be taken into account with this feature. It is not feasible to traverse the policy lists on every send operation. This will add unnecessary overhead. When rules are applied they have to be "flattened" to the constructs they impact. For example, a Network Rule is added as follows: o2ib priority 0. This rule gives priority for using o2ib network for sending. A priority field in the network will be added. This will be set to 0 for the o2ib network. As we traverse the networks in the selection algorithm, which is part of the current code, the priority field will be compared. This is a more optimal approach than examining the policies on every send to see if it we get any matches.
...
| Code Block |
|---|
/* lnet structure will keep a list of UDSPs */
struct lnet {
...
list_head ln_udsp_list;
...
}
/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
__u32 ud_net_id;
list_head ud_ip_range;
}
/* UDSP action types */
enum lnet_udsp_action_type {
EN_LNET_UDSP_ACTION_PRIORITY = 0,
EN_LNET_UDSP_ACTION_NONE = 1,
}
/*
* a UDSP rule can have up to three user defined NID descriptors
* - src: defines the local NID range for the rule
* - dst: defines the peer NID range for the rule
* - rte: defines the router NID range for the rule
*
* An action union defines the action to take when the rule
* is matched
*/
struct lnet_udsp {
list_head udsp_on_list;
__u32 idx;
lnet_ud_nid_descr *udsp_src;
lnet_ud_nid_describe *udsp_dst;
lnet_ud_nid_descr *udsp_rte;
enum lnet_udsp_action_type udsp_action_type;
union udsp_action {
__u32 udsp_priority;
};
}
/* The rules are flattened in the LNet structures as shown below */
struct lnet_net {
...
/* defines the relative priority of this net compared to others in the system */
__u32 net_priority;
...
}
struct lnet_ni {
...
/* defines the relative priority of this NI compared to other NIs in the net */
__u32 ni_priority;
...
}
struct lnet_peer_ni {
...
/* defines the relative peer_ni priority compared to other peer_nis in the peer */
__u32 lpni_priority;
/* defines the list of local NID(s) (>=1) which should be used as the source */
union lpni_pref {
lnet_nid_t nid;
lnet_nid_t *nids;
}
/* defines the list of router NID(s) to be used when sending to this peer NI */
lnet_nid_t *lpni_rte_nids;
...
}
/* UDSPs will be passed to the kernel via IOCTL */
#define IOC_LIBCFS_ADD_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE)
/* UDSP will be grabbed from the kernel via IOCTL
#define IOC_LIBCFS_GET_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE) |
Kernel IOCTL Handling
| Code Block |
|---|
/* api-ni.c will be modified to handle adding a UDSP */
int
LNetCtl(unsigned int cmd, void *arg)
{
...
case IOC_LIBCFS_ADD_UDSP: {
struct lnet_ioctl_config_udsp *config_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* add and do initial flattening of the UDSP into
* internal structures
*/
rc = lnet_add_and_flatten_udsp(config_udsp);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc;
}
case IOC_LIBCFS_GET_UDSP: {
struct lnet_ioctl_config_udsp *get_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* get the udsp at index provided. Return -ENOENT if
* no more UDSPs to get
*/
rc = lnet_add_udsp(get_udsp, get_udsp->idx);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc
}
...
} |
Kernel Selection Algorithm Modifications
| Code Block |
|---|
/*
* select an NI from the Nets with highest priority
*/
struct lnet_ni *
lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
{
...
list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
...
struct lnet_net *net;
net = lnet_get_net_locked(peer_net->lpn_net_id);
if (!net)
continue
/*
* look only at the NIs with the highest priority and disregard
* nets which have lower priority. Nets with equal priority are
* examined and the best_ni is selected from amongst them.
*/
net_prio = net->net_priority;
if (net_prio > best_net_prio)
continue;
else if (net_prio < best_net_prio) {
best_net_prio = net_prio;
best_ni = NULL;
}
best_ni = lnet_find_best_ni_on_spec_net(best_ni, peer,
best_peer_net, md_cpt, false);
...
}
...
}
/*
* select the NI with the highest priority
*/
static struct lnet_ni *
lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *best_ni,
struct lnet_peer *peer, struct lnet_peer_net *peer_net,
int md_cpt)
{
...
ni_prio = ni->ni_priority;
if (ni_fatal) {
continue;
} else if (ni_healthv < best_healthv) {
continue;
} else if (ni_healthv > best_healthv) {
best_healthv = ni_healthv;
if (distance < shortest_distance)
shortest_distance = distance;
/*
* if this NI is lower in priority than the one already set then discard it
* otherwise use it and set the best prioirty so far to this NI's.
*/
} else if ni_prio > best_ni_prio) {
continue;
} else if (ni_prio < best_ni_prio)
best_ni_prio = ni_prio;
}
...
}
/*
* When a UDSP rule associates local NIs with remote NIs, the list of local NIs NIDs
* is flattened to a list in the associated peer_NI. When selecting a peer NI, the
* peer NI with the corresponding preferred local NI is selected.
*/
bool
lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
{
...
}
/*
* select the peer NI with the highest priority first and then the
* preferred one
*/
static struct lnet_peer_ni *
lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
struct lnet_peer_net *peer_net)
{
...
ni_is_pref = lnet_peer_is_pref_nid_locked(lpni, best_ni->ni_nid);
lpni_prio = lpni->lpni_priority;
if (lpni_healthv < best_lpni_healthv)
continue;
/*
* select the NI with the highest priority.
*/
else if lpni_prio > best_lpni_prio)
continue;
else if (lpni_prio < best_lpni_prio)
best_lpni_prio = lpni_prio;
/*
* select the NI which has the best_ni's NID in its preferred list
*/
else if (!preferred && ni_is_pref)
preferred = true;
...
} |
UDSP Marshaling
After a UDSP is parsed in user space it needs to be marshaled and sent to the kernel. The kernel will de-marshal the data and store it in its own data structures. The UDSP is formed of the following pieces of information:
- Index: The index of the UDSP to insert or delete
- Source Address expression: A dot expression describing the source address range
- Net of the Source: A net id of the source
- Destination Address expression: A dot expression describing the destination address range
- Net of the Destination: A net id of the destination
- Router Address expression: A dot expression describing the router address range
- Net of the Router: A net id of the router
- Action Type: An enumeration describing the action type.
- Action: A structure describing the action if the UDSP is matched.
The data flow of a UDSP looks as follows:
Gliffy Diagram name DataFlow pagePin 2
YAML Syntax
Defined here.
Userspace Structures
| Code Block |
|---|
/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
__u32 ud_net_id;
list_head ud_ip_range;
}
/* UDSP action types */
enum lnet_udsp_action_type {
EN_LNET_UDSP_ACTION_PRIORITY = 0,
EN_LNET_UDSP_ACTION_NONE = 1,
}
/*
* a UDSP rule can have up to three user defined NID descriptors
* - src: defines the local NID range for the rule
* - dst: defines the peer NID range for the rule
* - rte: defines the router NID range for the rule
*
* An action union defines the action to take when the rule
* is matched
*/
struct lnet_udsp {
list_head udsp_on_list;
__u32 idx;
lnet_ud_nid_descr *udsp_src;
lnet_ud_nid_describe *udsp_dst;
lnet_ud_nid_descr *udsp_rte;
enum lnet_udsp_action_type udsp_action_type;
union udsp_action {
__u32 udsp_priority;
};
} |
Marshaled Structures
| Code Block |
|---|
struct cfs_range_expr {
struct list_head re_link;
__u32 re_lo;
__u32 re_hi;
__u32 re_stride;
};
struct lnet_ioctl_udsp {
__u32 iou_idx;
enum lnet_udsp_action_type iou_action_type
union action iou_action {
__u32 priority;
}
__u32 iou_src_dot_expr_count;
__u32 iou_dst_dot_expr_count;
__u32 iou_rte_dot_expr_count;
char iou_bulk[0];
}; |
The address is expressed as a list of cfs_range_expr. These need to be marshalled. For IP address there are 4 of these structures. Other type of addresses can have a different number. As an example, gemini will only have one. The corresponding iou_[src|dst|rte]_dot_expr_count is set to the number of expressions describing the address. Each expression is then flattened in the structure. They have to be flattened in the order defined: SRC, DST, RTE.
The kernel will recieve the marshalled data and will form its internal structures. The functions to marshal and de-marshal should be straight forward. Note that user space and kernel space use the same structures. These structure will be defined in a common location. For this reason the functions to marshal and de-marshal will be shared.
Kernel Structure
Defined here.
Structures
| Code Block |
|---|
/* This is a common structure which describes an expression */
struct lnet_match_expr {
};
struct lnet_selection_descriptor {
enum selection_type lsd_type;
char *lsd_pattern1;
char *lsd_pattern2;
union {
__u32 lsda_priority;
} lsd_action_u;
};
/*
* lustre_lnet_add_selection
* Delete the peer NIDs. If all peer NIDs of a peer are deleted
* then the peer is deleted
*
* selection - describes the selection policy rule
* seq_no - sequence number of the command
* err_rc - YAML structure of the resultant return code
*/
int lustre_lnet_add_selection(struct selection_descriptor *selection, int seq_no, struct cYAML **er_rc); |
cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165
YAML Syntax
...
# Configuring Network rulesselection: - type: net net: <net name or pattern. e.g. o2ib1, o2ib*, o2ib[1,2]> priority: <Unsigned integer where 0 is the highest priority>
# Configuring NID rules:selection: - type: nid nid: <a NID pattern as described in the Lustre Manual ip2net syntax> priority: <Unsigned integer where 0 is the highest priority>
# Configuring Point-to-Point rules.selection: - type: peer local: <a NID pattern as described in the Lustre Manual ip2net syntax> remote: <a NID pattern as described in the Lustre Manual ip2net syntax> priority: <Unsigned integer where 0 is the highest priority>
# to delete the rules, there are two options: # 1. Whenever a rule is added it will be assigned a unique ID. Show command will display the# unique ID. The unique ID must be explicitly identified in the delete command.# 2. The rule is matched in the kernel based on the matching rule, unique identifier.# This means that there can not exist two rules that have the exact matching criteria# Both options shall be supported.
networks are available, each one can have a priority.
# The one with the highest priority is preferred
lnetctl policy add --src *@<net type> --<action type> <action context sensitive value> --idx <value>
--src: is defined in ip2nets syntax. '*@<net type>' syntax indicates the network.
This is not to be confused with '*.*.*.*'@<net type>' which indicates all
NIDs in this network.
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# Adding a local NID udsp
# After a local network is chosen. If there are multiple NIs in the network the
# one with highest priority is preferred.
lnetct policy add --src <Address descriptor>@<net type> --<action type> <action context sensitive value>
--idx <value>
--src: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# Adding a remote NID udsp
# When selecting a peer NID select the one with the highest priority.
lnetct policy add --dst <Address descriptor>@<net type> --<action type> <action context sensitive value>
--idx <value>
--dst: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# Adding a NID pair udsp
# When this rule is flattented the local NIDs which match the rule are added on a list
# on the peer NIs matching the rule. When selecting the peer NI, the one with the
# local NID being used on its list is preferred.
lnetct policy add --src <Address descriptor>@<net type> --dst <Address descriptor>@<net type>
--<action type> <action context sensitive value>
--idx <value>
--src: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2
--dst: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2. Destination NIDs can be local or
remote.
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# Adding a network pair udsp
# TBD: do we need this rule?
lnetctl policy add --src *@<net type> --dst *@<net type>
--<action type> <action context sensitive value>
--idx <value>
--src: is defined in ip2nets syntax. '*@<net type>' syntax indicates the network.
This is not to be confused with '*.*.*.*'@<net type>' which indicates all
NIDs in this network.
--dst: is defined in ip2nets syntax, same as the src. Destination network needs to
be a remote network, not a local network. IE, messages need to be routed
to get to that network.
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# Adding a Router udsp
# similar to the NID pair udsp. The router NIDs matching the rule are added on a list
# on the peer NIs matching the rule. When sending to a remote peer, the router which
# has its nid on the peer NI list is preferred.
lnetct policy add --dst <Address descriptor>@<net type> --rte <Address descriptor>@<net type>
--<action type> <action context sensitive value>
--idx <value>
--dst: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2
--rte: the address descriptor defined in ip2nets syntax as described in the manual
<net type> is something like: tcp1, o2ib2.
--<action type>: 'priority' is the only implemented action type
--<action context sensitive value>: is a value specific to the action type.
For 'priority' it's a value for [0 - 255]
--idx: The index of where to insert the rule. If it's larger than the policy list it's
appended to the end of the list. If not specified the default behavior is to append
to the end of the list
# show all policies in the system.
# the policies are dumped in YAML form.
# Each policy is assigned an index.
# The index is part of the policy YAML block
lnetctl policy show
# to delete a policy the index must be specified.
# The normal behavior then is to firsh show the list of policies
# grab the index and use it in the delete command.
lnetctl policy del --idx <value>
# generally, the syntax is as follows
lnetctl policy <add | del | show>
--src: ip2nets syntax specifying the local NID to match
--dst: ip2nets syntax specifying the remote NID to match
--rte: ip2nets syntax specifying the router NID to match
--priority: Priority to apply to rule matches
--idx: Index of where to insert the rule. By default it appends to
the end of the rule list
|
As of the time of this writing only "priority" action shall be implemented. However, it is feasible in the future to implement different actions to be taken when a rule matches. For example, we can implement a "redirect" action, which redirects traffic to another destination. Yet another example is "lawful intercept" or "mirror" action, which mirrors messages to a different destination. This might be useful for keeping a standby server updated with all information going to the primary server. A lawful intercept action allows personnel authorized by a Law Enforcement Agency (LEA) to intercept file operations from targeted clients and send the file operations to an LI Mediation Device.
Anchor YAMLSyntax YAMLSyntax
YAML Syntax
| YAMLSyntax | |
| YAMLSyntax |
| Code Block |
|---|
udsp:
- idx: <unsigned int>
src: <ip>@<net type>
dst: <ip>@<net type>
rte: <ip>@<net type>
action:
- priority: <unsigned int> |
Overview of Operations
There are three main operations which can be carried out on UDSPs either from the command line or YAML configuration: add, delete, show.
Add
The UI allows adding a new rule. With the use of the idx optional parameter, the admin can specifiy where in the rule chain the new rule should be added. By default the rule is appended to the list. Any other value will result in inserting the rule in that position.
When a new UDSP is added the entire UDSP set is re-evaluated. This means all Nets, NIs and peer NIs in the systems are traversed and the rules re-applied. This is an expensive operation, but given that UDSP management should be a rare operation, it shouldn't be a problem.
Delete
The UI allows deleting an existing UDSP using its index. The index can be shown using the show command. When a UDSP is deleted the entire UDSP set are re-evaluated. The Nets, NIs and peer NIs are traversed and the rules re-applied..
Show
The UI allows showing existng UDSPs. The format of the YAML output is as follows:
| Code Block |
|---|
udsp:
- idx: <unsigned int>
src: <ip>@<net type>
dst: <ip>@<net type>
rte: <ip>@<net type>
action:
- priority: <unsigned int> |
Design
All policies are stored in kernel space. All logic to add, delete and match policies will be implemented in kernel space. This complicates the kernel space processing. Arguably, policy maintenance logic is not core to LNet functionality. What is core is the ability to select source and destination networks and NIDs in accordance with user definitions. However, the kernel is able to manage policies much easier and with less potential race conditions than user space.
Design Principles
UDSPs are comprised of two parts:
- The matching rule
- The rule action
The matching rule is what's used to match a NID or a network. The action is what's applied when the rule is matched.
A rule can be uniquely identified using an internal ID which is assigned by the LNet module when a rule is added and returned to the user space when the UDSPs are shown.
UDSP Storage
UDSPs shall be defined by administrators either via LNet command line utility, lnetctl, or via YAML configuration file. lnetctl parses the UDSP and stores it in an intermediary format, which will be flattened and passed down to the kernel LNet module. LNet shall store these UDSPs on a policy list. Once policies are added to LNet they will be applied on existing networks, NIDs and routers. The advantage of this approach is that UDSPs are not strictly tied to the internal constructs, IE networks, NIDs or routers, but can be applied whenever the internal constructs are created and if the internal constructs are deleted then they remain and can be automatically applied at a future time.
This makes configuration easy since a set of UDSPs can be defined, like "all IB networks priority 1", "all Gemini networks priority 2", etc, and when a network is added, it automatically inherits these rules.
Peers are normally not created explicitly by the administrators. The ULP requests to send a message to a peer or the node receives an unsolicited message from a peer which results in creating a peer construct in LNet. It is feasible, especially for router policies, to have a UDSP which associates a set of clients with in a specific range with a set of optimal routers. Having the policies stored and matched in kernel aids in fulfilling this requirement.
UDSP Application
Performance needs to be taken into account with this feature. It is not feasible to traverse the policy lists on every send operation. This will add unnecessary overhead. When rules are applied they have to be "flattened" to the constructs they impact. For example, a Network Rule is added as follows: o2ib priority 0. This rule gives priority for using o2ib network for sending. A priority field in the network will be added. This will be set to 0 for the o2ib network. As we traverse the networks in the selection algorithm, which is part of the current code, the priority field will be compared. This is a more optimal approach than examining the policies on every send to see if it we get any matches.
DLC APIs
The DLC library will provide the outlined APIs to expose a way to create, delete and show rules.
Once rules are created and stored in the kernel, they are assigned an ID. This ID is returned and shown in the show command, which dumps the rules. This ID can be referenced later to delete a rule. The process is described in more details below.
| Code Block |
|---|
/*
* lustre_lnet_udsp_str_to_action
* Given a string format of the action, convert it to an enumerated type
* action - string format for the action.
*/
enum lnet_udsp_action_type lustre_lnet_udsp_str_to_action(char *action);
/*
* lustre_lnet_add_udsp
* Add a selection policy.
* src - source NID descriptor
* dst - destination NID descriptor
* rte - router NID descriptor
* type - action type
* action - union of the action
* idx - the index to delete
* seq_no - sequence number of the request
* err_rc - [OUT] struct cYAML tree describing the error. Freed by
* caller
*/
int lustre_lnet_add_udsp(char *src, char *dst, char *rte,
enum lnet_udsp_action_type type,
union action, unsigned int idx,
int seq_no, struct cYAML **err_rc);
/*
* lustre_lnet_del_udsp
* Delete a net selection policy.
* idx - the index to delete
* seq_no - sequence number of the request
* err_rc - [OUT] struct cYAML tree describing the error. Freed by
* caller
*/
int lustre_lnet_del_udsp(int idx, int seq_no, struct cYAML **err_rc);
/*
* lustre_lnet_show_udsp
* Show configured net selection policies.
* seq_no - sequence number of the request
* show_rc - [OUT] struct cYAML tree containing the UDSPs
* err_rc - [OUT] struct cYAML tree describing the error. Freed by
* caller
*/
int lustre_lnet_show_udsp(char *net); |
Anchor InKernelStructures InKernelStructures
In Kernel Structures
| InKernelStructures | |
| InKernelStructures |
| Code Block |
|---|
/* lnet structure will keep a list of UDSPs */
struct lnet {
...
list_head ln_udsp_list;
...
}
/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
__u32 ud_net_id;
list_head ud_ip_range;
}
/* UDSP action types */
enum lnet_udsp_action_type {
EN_LNET_UDSP_ACTION_PRIORITY = 0,
EN_LNET_UDSP_ACTION_NONE = 1,
}
/*
* a UDSP rule can have up to three user defined NID descriptors
* - src: defines the local NID range for the rule
* - dst: defines the peer NID range for the rule
* - rte: defines the router NID range for the rule
*
* An action union defines the action to take when the rule
* is matched
*/
struct lnet_udsp {
list_head udsp_on_list;
__u32 idx;
lnet_ud_nid_descr *udsp_src;
lnet_ud_nid_describe *udsp_dst;
lnet_ud_nid_descr *udsp_rte;
enum lnet_udsp_action_type udsp_action_type;
union udsp_action {
__u32 udsp_priority;
};
}
/* The rules are flattened in the LNet structures as shown below */
struct lnet_net {
...
/* defines the relative priority of this net compared to others in the system */
__u32 net_priority;
...
}
struct lnet_ni {
...
/* defines the relative priority of this NI compared to other NIs in the net */
__u32 ni_priority;
...
}
struct lnet_peer_ni {
...
/* defines the relative peer_ni priority compared to other peer_nis in the peer */
__u32 lpni_priority;
/* defines the list of local NID(s) (>=1) which should be used as the source */
union lpni_pref {
lnet_nid_t nid;
lnet_nid_t *nids;
}
/* defines the list of router NID(s) to be used when sending to this peer NI */
lnet_nid_t *lpni_rte_nids;
...
}
/* UDSPs will be passed to the kernel via IOCTL */
#define IOC_LIBCFS_ADD_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE)
/* UDSP will be grabbed from the kernel via IOCTL
#define IOC_LIBCFS_GET_UDSP _IOWR(IOC_LIBCFS_TYPE, 106, IOCTL_CONFIG_SIZE) |
Kernel IOCTL Handling
| Code Block |
|---|
/* api-ni.c will be modified to handle adding a UDSP */
int
LNetCtl(unsigned int cmd, void *arg)
{
...
case IOC_LIBCFS_ADD_UDSP: {
struct lnet_ioctl_config_udsp *config_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* add and do initial flattening of the UDSP into
* internal structures
*/
rc = lnet_add_and_flatten_udsp(config_udsp);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc;
}
case IOC_LIBCFS_DEL_UDSP: {
struct lnet_ioctl_config_udsp *del_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* delete the rule identified by index
*/
rc = lnet_del_udsp(del_udsp->udsp_idx);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc;
}
case IOC_LIBCFS_GET_UDSP: {
struct lnet_ioctl_config_udsp *get_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* get the udsp at index provided. Return -ENOENT if
* no more UDSPs to get
*/
rc = lnet_add_udsp(get_udsp, get_udsp->udsp_idx);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc
}
case IOC_LIBCFS_GET_UDSP_SIZE: {
struct lnet_ioctl_config_udsp *get_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* get the UDSP size specified by idx
*/
rc = lnet_get_udsp_num(get_udsp);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc
}
case IOC_LIBCFS_GET_UDSP: {
struct lnet_ioctl_config_udsp *get_udsp = arg;
mutex_lock(&the_lnet.ln_api_mutex);
/*
* get the udsp at index provided. Return -ENOENT if
* no more UDSPs to get
*/
rc = lnet_add_udsp(get_udsp, get_udsp->udsp_idx);
mutex_unlock(&the_lnet.ln_api_mutex);
return rc
}
...
} |
IOC_LIBCFS_ADD_UDSP
The handler for the IOC_LIBCFS_ADD_RULES will perform the following operations:
- De-serialize the rules passed in from user space
- Make sure the rule is unique. If there exists another copy, the add is a NO-OP.
- If a rule exists which has the same matching criteria but different action, then update the rule
- Insert the rule and assign it an index.
- Iterate through all LNet constructs and apply the rules
Application of the rules will be done under api_mutex_lock and the exclusive lnet_net_lock to avoid having the peer or local net lists changed while the rules are being applied.
The rules are iterated and applied whenever:
- A local network interface is added.
- A remote peer/peer_net/peer_ni is added
IOC_LIBCFS_DEL_UDSP
The handler for IOC_LIBCFS_DEL_RULES will
- delete the rule with the specified index
- Iterate through all LNet constructs and apply the updated set of rules
When the new rules are applied all traces of deleted rules are removed from the LNet constructs.
IOC_LIBCFS_GET_UDSP_SIZE
Return the size of the UDSP specified by index.
IOC_LIBCFS_GET_UDSP
The handler for the IOC_LIBCFS_GET_RULES will serialize the rules on the UDSP list.
The GET call is done in two stages. First it makes a call to the kernel to determine the size of the UDSP at index. User space then allocates a block big enough to accommodate the UDSP and makes another call to actually get the UDSP.
User space iteratively calls the UDSPs until there are no more UDSPs to get.
User space prints the UDSPs in the same YAML format.
TODO: Another option is to have IOC_LIBCFS_GET_UDSP_NUM, which gets the total size needed for all UDSPs, and then user space can make one call to get all the UDSPs. However, this complicates the marshaling function. The user space will also need to handle cases where the size of the UDSPs are too large for one call. The above proposal will do more iterations to get all the UDSPs, but the code should be simpler. And since the number of UDSPs are expected to be small, the above propsoal should be fine.
Kernel Selection Algorithm Modifications
| Code Block |
|---|
/*
* select an NI from the Nets with highest priority
*/
struct lnet_ni *
lnet_find_best_ni_on_local_net(struct lnet_peer *peer, int md_cpt)
{
...
list_for_each_entry(peer_net, &peer->lp_peer_nets, lpn_peer_nets) {
...
struct lnet_net *net;
net = lnet_get_net_locked(peer_net->lpn_net_id);
if (!net)
continue
/*
* look only at the NIs with the highest priority and disregard
* nets which have lower priority. Nets with equal priority are
* examined and the best_ni is selected from amongst them.
*/
net_prio = net->net_priority;
if (net_prio > best_net_prio)
continue;
else if (net_prio < best_net_prio) {
best_net_prio = net_prio;
best_ni = NULL;
}
best_ni = lnet_find_best_ni_on_spec_net(best_ni, peer,
best_peer_net, md_cpt, false);
...
}
...
}
/*
* select the NI with the highest priority
*/
static struct lnet_ni *
lnet_get_best_ni(struct lnet_net *local_net, struct lnet_ni *best_ni,
struct lnet_peer *peer, struct lnet_peer_net *peer_net,
int md_cpt)
{
...
ni_prio = ni->ni_priority;
if (ni_fatal) {
continue;
} else if (ni_healthv < best_healthv) {
continue;
} else if (ni_healthv > best_healthv) {
best_healthv = ni_healthv;
if (distance < shortest_distance)
shortest_distance = distance;
/*
* if this NI is lower in priority than the one already set then discard it
* otherwise use it and set the best prioirty so far to this NI's.
*/
} else if ni_prio > best_ni_prio) {
continue;
} else if (ni_prio < best_ni_prio)
best_ni_prio = ni_prio;
}
...
}
/*
* When a UDSP rule associates local NIs with remote NIs, the list of local NIs NIDs
* is flattened to a list in the associated peer_NI. When selecting a peer NI, the
* peer NI with the corresponding preferred local NI is selected.
*/
bool
lnet_peer_is_pref_nid_locked(struct lnet_peer_ni *lpni, lnet_nid_t nid)
{
...
}
/*
* select the peer NI with the highest priority first and then the
* preferred one
*/
static struct lnet_peer_ni *
lnet_select_peer_ni(struct lnet_send_data *sd, struct lnet_peer *peer,
struct lnet_peer_net *peer_net)
{
...
ni_is_pref = lnet_peer_is_pref_nid_locked(lpni, best_ni->ni_nid);
lpni_prio = lpni->lpni_priority;
if (lpni_healthv < best_lpni_healthv)
continue;
/*
* select the NI with the highest priority.
*/
else if lpni_prio > best_lpni_prio)
continue;
else if (lpni_prio < best_lpni_prio)
best_lpni_prio = lpni_prio;
/*
* select the NI which has the best_ni's NID in its preferred list
*/
else if (!preferred && ni_is_pref)
preferred = true;
...
} |
UDSP Marshaling
After a UDSP is parsed in user space it needs to be marshaled and sent to the kernel. The kernel will de-marshal the data and store it in its own data structures. The UDSP is formed of the following pieces of information:
- Index: The index of the UDSP to insert or delete
- Source Address expression: A dot expression describing the source address range
- Net of the Source: A net id of the source
- Destination Address expression: A dot expression describing the destination address range
- Net of the Destination: A net id of the destination
- Router Address expression: A dot expression describing the router address range
- Net of the Router: A net id of the router
- Action Type: An enumeration describing the action type.
- Action: A structure describing the action if the UDSP is matched.
The data flow of a UDSP looks as follows:
Gliffy Diagram name DataFlow pagePin 2
Userspace Structures
| Code Block |
|---|
/* each NID range is defined as net_id and an ip range */
struct lnet_ud_nid_descr {
__u32 ud_net_id;
list_head ud_ip_range;
}
/* UDSP action types */
enum lnet_udsp_action_type {
EN_LNET_UDSP_ACTION_PRIORITY = 0,
EN_LNET_UDSP_ACTION_NONE = 1,
}
/*
* a UDSP rule can have up to three user defined NID descriptors
* - src: defines the local NID range for the rule
* - dst: defines the peer NID range for the rule
* - rte: defines the router NID range for the rule
*
* An action union defines the action to take when the rule
* is matched
*/
struct lnet_udsp {
list_head udsp_on_list;
__u32 idx;
lnet_ud_nid_descr *udsp_src;
lnet_ud_nid_describe *udsp_dst;
lnet_ud_nid_descr *udsp_rte;
enum lnet_udsp_action_type udsp_action_type;
union udsp_action {
__u32 udsp_priority;
};
} |
Marshaled Structures
| Code Block |
|---|
struct cfs_range_expr {
struct list_head re_link;
__u32 re_lo;
__u32 re_hi;
__u32 re_stride;
};
struct lnet_ioctl_udsp {
__u32 iou_idx;
enum lnet_udsp_action_type iou_action_type
union action iou_action {
__u32 priority;
}
__u32 iou_src_dot_expr_count;
__u32 iou_dst_dot_expr_count;
__u32 iou_rte_dot_expr_count;
char iou_bulk[0];
}; |
The address is expressed as a list of cfs_range_expr. These need to be marshalled. For IP address there are 4 of these structures. Other type of addresses can have a different number. As an example, gemini will only have one. The corresponding iou_[src|dst|rte]_dot_expr_count is set to the number of expressions describing the address. Each expression is then flattened in the structure. They have to be flattened in the order defined: SRC, DST, RTE.
The kernel will recieve the marshalled data and will form its internal structures. The functions to marshal and de-marshal should be straight forward. Note that user space and kernel space use the same structures. These structure will be defined in a common location. For this reason the functions to marshal and de-marshal will be shared.
Marshaling and de-marshaling functions
Common functions that can be called from user space and kernel space will be created to marshal and de-marshal the UDSPs:
| Code Block |
|---|
/*
* lnet_get_udsp_size()
* Given the UDSP return the size needed to flatten the UDSP
*/
int lnet_get_udsp_size(struct lnet_udsp *udsp);
/*
* lnet_udsp_marshal()
* Marshal the UDSP pointed to by udsp into the memory block that is provided. In order for this
* API to work in both Kernel and User space the bulk pointer needs to be passed in. When this API
* is called in the kernel, it is expected that the bulk memory is allocated in userspace. This API
* is intended to be called from the kernel to marshal the rules before sending it to user space.
* It will also be called from user space to marshal the udsp before sending to the kernel.
* udsp [IN] - udsp to marshal
* bulk_size [IN] - size of bulk.
* bulk [OUT] - allocated block of memory where the serialized rules are stored.
*/
int lnet_udsp_marshal(struct lnet_udsp *udsp, __u32 *bulk_size, void __user *bulk);
/*
* lnet_udsp_demarshal()
* Given a bulk containing a single UDSP, demarshal and populate the udsp structure provided
* bulk [IN] - memory block containing serialized rules
* bulk_size [IN] - size of bulk memory block
* udsp [OUT] - preallocated struct lnet_udsp
*/
int *lnet_udsp_demarshal(void __user *bulk, __u32_bulk_size, struct lnet_udsp *usdp); |
Requirements Covered by the Design
cfg-100, cfg-105, cfg-110, cfg-115, cfg-120, cfg-125, cfg-130, cfg-135, cfg-140, cfg-160, cfg-165
Use Cases
Preferred Network
If a node can be reached on two LNet networks, it is sometimes desirable to designate a fail-over network. Currently in lustre there is the concept of High Availability (HA) which allows servicenode nids to be defined as described in the lustre manual section 11.2. By using the syntax described in that section, two nids to the same peer can also be defined. However, this approach suffers from current limitation in the lustre software, where the NIDs are exposed to layers above LNet. It is ideal to keep network failures handling contained within LNet and only let lustre worry about defining HA.
Given this it is desirable to have two LNet networks defined on a node, each could have multiple interfaces. Then have a way to tell LNet to always use one network until it is no longer available, IE: all interfaces in that network are down.
In this manner we separate the functionality of defining fail-over pairs from defining fail-over networks.
Preferred NIDs
In a scenario where servers are being upgraded with new interfaces to be used in Multi-Rail, it's possible to add interfaces, for example MLX-EDR interfaces to the server. The user might want to continue making the existing QDR clients use the QDR interface, while new clients can use the EDR interface or even both interfaces. By specifying rules on the clients that prefer a specific interface this behavior can be achieved.
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
Preferred local/remote NID pairs
This is a finer tuned method of specifying an exact path, by not only specifying a priority to a local interface or a remote interface, but by specifying concrete pairs of interfaces that are most preferred. A peer interface can be associated with multiple local interfaces if necessary, to have a N:1 relationship between local interfaces and remote interfaces.
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
Refer to Olaf's LUG 2016/LAD 2016 PPT for more context.
Preferred Routers
Reference Links
https://www.ece.tufts.edu/~karen/classes/final_presentation/Dragonfly_Topology_Long.pptx
Flattening rules
Rules will have a serialize and deserialize APIs. The serialize API will flatten the rules into a contiguous buffer that will be sent to the kernel. On the kernel side the rules will be deserialzed to be stored and queried. When the userspace queries the rules, the rules are serialized and sent up to user space, which deserializes it and prints it in a YAML format.
...
Selection Policies
There are four different types of rules that this HLD will address:
- LNet Network priority rule
- This rule assigns a priority to a network. During selection the network with the highest priority is preferred.
- Local NID rule
- This rule assigns a priority to a local NID within an LNet network. This NID is preferred during selection.
- Remote NID rule
- This rule assigns a priority to a remote NID within an LNet network. This NID is preferred during selection
- Peer-to-peer rules
- This rule associates local NIs with peer NIs. When selecting a peer NI to send to the one associated with the selected local NI is preferred.
These rules are applied differently in the kernel.
The Network priority rule results in a priority value in the struct lnet_net to be set to the one defined in the rule. The local NID rule results in a priority value in the struct lnet_ni to be set to the one defined in the rule. The remote NID rule results in a priority value in the struct lnet_peer_ni to be set to the one defined in the rule. The infrastructure for peer-to-peer rules is implemented via a list of preferred NIDs kept in the struct lnet_peer_ni structure. Once the local network/best NI are already selected, we go through all the peer NIs on the same network and prefer the peer NI which has the best NIs NID on its preferred list. Thereby, preferring a specific pathway between the node and the peer.
Each of these rules impacts a different part of the selection algorithm. The Network rule impacts the selection of the local network. Local NID rules impacts the selection of the best NI to send out of from the preferred network. Remote NID and peer-to-peer rules both impact the peer NI to send to.
It is possible to use both the local NID rule and the peer-to-peer rule to force messages to always take a specific path. For example, assuming a node with three interfaces 10.10.10.3, 10.10.10.4 and 10.10.10.5 and two rules as follows:
| Code Block |
|---|
selection:
- type: nid
local: true
nid: 10.10.10.5
priority: 0
selection:
- type: peer
local: 10.10.10.5
remote: 10.10.10.6
priority: 0 |
These two rules will always prefer sending messages from 10.10.10.5 to 10.10.10.6. As opposed to only sending it occasionally when the 10.01.10.5 interface is selected every third message assuming round robin.
As another example, it is also possible to prioritize a set of local and remote NIs so that they are always preferred. Assuming two peers
- PeerA: 10.10.10.2, 10.10.10.3 and 10.10.30.2
- PeerB: 10.10.10.4, 10.10.10.5 and 10.10.30.3
We can setup the following rules:
| Code Block |
|---|
selection:
- type: nid
local: true
nid: 10.10.10.*
priority: 0
selection:
- type: nid
local: false
nid: 10.10.10.*
priority: 0 |
The following rules will always prefer messages to be sent between the 10.10.10.* interfaces, rather than the 10.10.30.* interfaces.
The question to answer is if such restrictions generally useful? One use case for such rules is while debugging or characterizing the network. Another argument is that the clusters that use lustre are so diverse that allowing them flexibility over traffic control is a benefit for them, as long as the default behavior is optimal out of the box.
The following section attempts to outline some real life scenario where these rules can be used.
Use Cases
Preferred Network
If a node can be reached on two LNet networks, it is sometimes desirable to designate a fail-over network. Currently in lustre there is the concept of High Availability (HA) which allows servicenode nids to be defined as described in the lustre manual section 11.2. By using the syntax described in that section, two nids to the same peer can also be defined. However, this approach suffers from the current limitation in the lustre software, where the NIDs are exposed to layers above LNet. It is ideal to keep network failures handling contained within LNet and only let lustre worry about defining HA.
Given this it is desirable to have two LNet networks defined on a node, each could have multiple interfaces. Then have a way to tell LNet to always use one network until it is no longer available, IE: all interfaces in that network are down.
In this manner we separate the functionality of defining fail-over pairs from defining fail-over networks.
Preferred NIDs
In a scenario where servers are being upgraded with new interfaces to be used in Multi-Rail, it's possible to add interfaces, for example MLX-EDR interfaces to the server. The user might want to continue making the existing QDR clients use the QDR interface, while new clients can use the EDR interface or even both interfaces. By specifying rules on the clients that prefer a specific interface this behavior can be achieved.
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
Preferred local/remote NID pairs
This is a finer tuned method of specifying an exact path, by not only specifying a priority to a local interface or a remote interface, but by specifying concrete pairs of interfaces that are most preferred. A peer interface can be associated with multiple local interfaces if necessary, to have a N:1 relationship between local interfaces and remote interfaces.
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
Refer to Olaf's LUG 2016/LAD 2016 PPT for more context.
DLC APIs
The DLC library will provide the outlined APIs to expose a way to create, delete and show rules.
Once rules are created and stored in the kernel, they are assigned an ID. This ID is returned and shown in the show command, which dumps the rules. This ID can be referenced later to delete a rule. The process is described in more details below.
| Code Block |
|---|
/*
* lustre_lnet_add_net_sel_pol
* Add a net selection policy. If there already exists a
* policy for this net it will be updated.
* net - Network for the selection policy
* priority - priority of the rule
*/
int lustre_lnet_add_net_sel_pol(char *net, int priority);
/*
* lustre_lnet_del_net_sel_pol
* Delete a net selection policy.
* net - Network for the selection policy
* id - [OPTIONAL] ID of the policy. This can be retrieved via a show command.
*/
int lustre_lnet_del_net_sel_pol(char *net, int id);
/*
* lustre_lnet_show_net_sel_pol
* Show configured net selection policies.
* net - filter on the net provided.
*/
int lustre_lnet_show_net_sel_pol(char *net);
/*
* lustre_lnet_add_nid_sel_pol
* Add a nid selection policy. If there already exists a
* policy for this nid it will be updated. NIDs can be either
* local NIDs or remote NIDs.
* nid - NID for the selection policy
* local - is this a local NID
* priority - priority of the rule
*/
int lustre_lnet_add_nid_sel_pol(char *nid, bool local, int priority);
/*
* lustre_lnet_del_nid_sel_pol
* Delete a nid selection policy.
* nid - NID for the selection policy
* local - is this a local NID
* id - [OPTIONAL] ID of the policy. This can be retrieved via a show command.
*/
int lustre_lnet_del_nid_sel_pol(char *nid, int id);
/*
* lustre_lnet_show_nid_sel_pol
* Show configured nid selection policies.
* nid - filter on the NID provided.
*/
int lustre_lnet_show_nid_sel_pol(char *nid);
/*
* lustre_lnet_add_nid_sel_pol
* Add a peer to peer selection policy. If there already exists a
* policy for the pair it will be updated.
* src_nid - source NID
* dst_nid - destination NID
* priority - priority of the rule
*/
int lustre_lnet_add_peer_sel_pol(char *src_nid, char *dst_nid, int priority);
/*
* lustre_lnet_del_peer_sel_pol
* Delete a peer to peer selection policy.
* src_nid - source NID
* dst_nid - destination NID
* id - [OPTIONAL] ID of the policy. This can be retrieved via a show command.
*/
int lustre_lnet_del_peer_sel_pol(char *src_nid, char *dst_nid, int id);
/*
* lustre_lnet_show_peer_sel_pol
* Show peer to peer selection policies.
* src_nid - [OPTIONAL] source NID. If provided the output will be filtered
* on this value.
* dst_nid - [OPTIONAL] destination NID. If provided the output will be filtered
* on this value.
*/
int lustre_lnet_show_peer_sel_pol(char *src_nid, char *dst_nid); |
...
User space/Kernel space Data structures
| Code Block |
|---|
/*
* describes a network:
* nw_id: can be the base network name, ex: o2ib or a full network id, ex: o2ib3.
* nw_expr: an expression to describe the variable part of the network ID
* ex: tcp* - all tcp networks
* ex: tcp[1-5] - resolves to tcp1, tcp2, tcp3, tcp4 and tcp5.
*/
struct lustre_lnet_network_descr {
__u32 nw_id;
struct cfs_expr_list *nw_expr;
};
/*
* lustre_lnet_network_rule
* network rule
* nwr_link - link on rule list
* nwr_descr - network descriptor
* nwr_priority - priority of the rule.
* nwr_id - ID of the rule assigned while deserializing if not already assigned.
*/
struct lustre_lnet_network_rule {
struct list_head nwr_link;
struct lustre_lnet_network_descr nwr_descr;
__u32 nwr_priority;
__u32 nwr_id
};
/*
* lustre_lnet_nid_range_descr
* nidr_expr - expression describing the IP part of the NID
* nidr_nw - a description of the network
*/
struct lustre_lnet_nidr_range_descr {
struct list_head nidr_expr;
struct lustre_lnet_network_descr nidr_nw;
};
/*
* lustre_lnet_nidr_range_rule
* Rule for the nid range.
* nidr_link - link on the rule list
* nidr_descr - descriptor of the nid range
* priority - priority of the rule
*/
struct lustre_lnet_nidr_range_rule {
struct list_head nidr_link;
struct lustre_lnet_nidr_range_descr nidr_descr;
int nidr_priority;
bool nidr_local;
};
/*
* lustre_lnet_p2p_rule
* Rule for the peer to peer.
* p2p_link - link on the rule list
* p2p_src_descr - source nid range
* p2p_dst_descr - destination nid range
* priority - priority of the rule
*/
struct lustre_lnet_p2p_rule {
struct list_head p2p_link;
struct lustre_lnet_nidr_range_descr p2p_src_descr;
struct lustre_lnet_nidr_range_descr p2p_dst_descr;
int priority;
}; |
IOCTL Data structures
| Code Block |
|---|
enum lnet_sel_rule_type {
LNET_SEL_RULE_NET = 0,
LNET_SEL_RULE_NID,
LNET_SEL_RULE_P2P
};
struct lnet_expr {
__u32 ex_lo;
__u32 ex_hi;
__u32 ex_stride;
};
struct lnet_net_descr {
__u32 nsd_net_id;
struct lnet_expr nsd_expr;
};
struct lnet_nid_descr {
struct lnet_expr nir_ip[4];
struct lnet_net_descr nir_net;
};
struct lnet_ioctl_net_rule {
struct lnet_net_descr nsr_descr
__u32 nsr_prio;
__u32 nsr_id
};
struct lnet_ioctl_nid_rule {
struct lnet_nid_descr nir_descr;
__32 nir_prio;
__u32 nir_id;
bool nir_local;
};
sturct lnet_ioctl_net_p2p_rule {
struct lnet_nid_descr p2p_src_descr;
struct lnet_nid_descr p2p_dst_descr;
__u32 p2p_prio;
__u32 p2p_id;
};
/*
* lnet_ioctl_rule_blk
* describes a set of rules of the same type to transfer to the kernel.
* rule_hdr - header information describing the total size of the transfer
* rule_type - type of rules included
* rule_size - size of each individual rule. Can be used to check backwards compatibility
* rule_count - number of rules included in the bulk.
* rule_bulk - pointer to the user space allocated memory containing the rules.
*/
struct lnet_ioctl_rule_blk {
struct libcfs_ioctl_hdr rule_hdr;
enum lnet_sel_rule_type rule_type;
__u32 rule_size;
__u32 rule_count;
void __user *rule_bulk;
}; |
Serialization/Deserialization
Both userspace and kernel space are going to store the rules in the data structures described above. However once userspace has parsed and stored the rules it'll need to serialize it and send it to the kernel.
The serialization process will use the IOCTL datastructures defined above. The process itself is straightforward. The rules as stored in the user space or the kernel are in a linked list. But each rule is of deterministic size and form. For example an IP is described as four struct cfs_range_expr structures. This can be translated into four struct lnet_expr structures.
As an example a serialized list of net rules are going to look as follows:
| Gliffy Diagram | ||||
|---|---|---|---|---|
|
The rest of the rules will look very similar as above, except that the list of rules included in the memory pointed to by rule_bulk is going to contain the pertinent structure format.
On the receiving end the process is reversed to rebuild the linked lists.
Common functions that can be called from user space and kernel space will be created to serialize and deserialize the rules:
| Code Block |
|---|
/*
* lnet_sel_rule_serialize()
* Serialize the rules pointed to by rules into the memory block that is provided. In order for this
* API to work in both Kernel and User space the bulk pointer needs to be passed in. When this API
* is called in the kernel, it is expected that the bulk memory is allocated in userspace. This API
* is intended to be called from the kernel to serialize the rules before sending it to user space
* rules [IN] - rules to be serialized
* rule_type [IN] - rule type to be serialized
* bulk_size [IN] - size of memory allocated.
* bulk [OUT] - allocated block of memory where the serialized rules are stored.
*/
int lnet_sel_rule_serialize(struct list_head *rules, enum lnet_sel_rule_type rule_type, __u32 *bulk_size, void __user *bulk);
/*
* lnet_sel_rule_deserialize()
* Given a bulk of rule_type rules, deserialize and append rules to the linked
* list passed in. Each rule is assigned an ID > 0 if an ID is not already assigned
* bulk [IN] - memory block containing serialized rules
* bulk_size [IN] - size of bulk memory block
* rule_type [IN] - type of rule to deserialize
* rules [OUT] - linked list to append the deserialized rules to
*/
int lnet_sel_rule_deserialize(void __user *bulk, __u32_bulk_size, enum lnet_sel_rule_type rule_type, struct list_head *rules); |
Policy IOCTL Handling
Three new IOCTLs will need to be added: IOC_LIBCFS_ADD_RULES, IOC_LIBCFS_DEL_RULES, and IOC_LIBCFS_GET_RULES.
IOC_LIBCFS_ADD_RULES
The handler for the IOC_LIBCFS_ADD_RULES will perform the following operations:
- call
lnet_sel_rule_deserialize() - Iterate through all the local networks and apply the rules
- Iterate through all the peers and apply the rules.
- splice the new list with the existing rules in the process resolving any conflicts. New rules always trump old rules (no pun intended).
Application of the rules will be done under api_mutex_lock and the exclusive lnet_net_lock to avoid having the peer or local net lists changed while the rules are being applied.
There will be different lists one for each rule type. The rules are iterated and applied whenever:
- A local network interface is added.
- A remote peer/peer_net/peer_ni is added
IOC_LIBCFS_DEL_RULES
The handler for IOC_LIBCFS_DEL_RULES will delete the rules which the ID of the rule passed in or if no ID is passed in then the exact rule is matched.
There will be no other actions taken on rule removal. Once the rule has been applied it will remain applied until the objects it has been applied to are removed.
IOC_LIBCFS_GET_RULES
The handler for the IOC_LIBCFS_GET_RULES will call lnet_sel_rule_serialize() on the master linked list for the type of the rule identified in struct lnet_ioctl_rule_bulk.
It fills as many rules as can fit in the bulk by examining the result of (rule_hdr.ioc_len - sizeof(struct lnet_ioctl_rule_blk)) / rule_size . That number of rules are serialized and placed in the bulk memory block. The IOCTL returns ENOSPC if the given bulk memory block is not enough to hold all the rules. It assigns the number of rules serialized in rule_count. The userspace process can make another call with the number of rules to skip set in rule_count. The handler will skip that indicated number of rules and fill the new bulk memory with the remaining rules. This process can be repeated until all the rules are returned to userspace.
In userpsace the rules are printed in the same YAML format as they are parsed in.
Policy Application
Net Rule
The net which matches the rule will be assigned the priority defined in the rule.
NID Rule
If the local flag is set then attempt to match the local_nis otherwise attempt to match the peer_nis. The NI matched shall be assigned the priority defined in the rule.
Peer to Peer Rule
NIDs for local_nis matching the source NID pattern in the peer to peer rule will be added to a list on the peer_nis which NID match the destination NID pattern.
Selection Algorithm Integration
Currently the selection algorithm performs its job in the following general steps:
- determine the best network to communicate to the destination peer by looking at all the LNet networks the peer is on.
- select the network with the highest priority
- for each selected network go through all the local NIs and keep track of the best_ni based on:
- It's priority
- NUMA distance
- available credits
- round robin
- Skip any networks which are lower priority than the "active" one. If there are multiple networks with the same priority then the best_ni is selected from amongst them using the above criteria.
- Once the best_ni has been selected, select the best peer_ni available by going through the list of the peer_nis on the selected network. Select the peer_ni based on:
- The priority of the peer_ni.
- if the NID of the best_ni is on the preferred local NID list of the peer_ni. It is placed there through the application of the peer to peer rules.
- available credits
- round robin
Misc
As an example, in a dragonfly topology as diagrammed below, a node can have multiple interfaces on the same network, but some interfaces are not optimized to go directly to the destination group. So if the selection algorithm is operating without any rules, it could select a local interface which is less than optimal.
The clouds in the diagram below represents a group of LNet nodes on the o2ib network. The admin should know which node interfaces resolve to a direct path to the destination group. Therefore, giving priority for a local NID within a network is a way to ensure that messages always prefer the optimized paths.
| Gliffy Diagram | ||||||
|---|---|---|---|---|---|---|
|
The diagram above was inspired by: https://www.ece.tufts.edu/~karen/classes/final_presentation/Dragonfly_Topology_Long.pptx
Refer to the above power point for further discussion on the dragon-fly topology.
Example
TBD: I'm thinking in a topology such as the one represented above, the sys admin would configure the routing properly, such that messages heading to a particular IP destination on a different group would get routed to the correct edge router, and from there to the destination group. When LNet is layered on top of this topology there will be no need to explicitly specify a rule, as all necessary routing rules will be defined in the routing tables of the kernel. The assumption here is that Infinitband IB over IP would obey the standard linux routing rules.