Albert Chu
chu11@llnl.gov
Quality of Service (QoS) is offered in Infiniband as a means to offer some guarantees/minimum requirements for certain applications on the fabric.
Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14) Virtual Lanes (VLs) for traffic. The virtual lanes support independent virtual transmit/receive buffers for each port on the fabric.
Service Level (SL): A number (0-15) that can be assigned to any Infiniband packet. The definition/purpose of a SL is not defined. It's up to the user to determine.
There are three basic parts to QoS in Infiniband.
Normally, you assign different SLs to different protocols, applications, etc. (i.e. MPI, Lustre). This allows each protocol/application to be given unique QoS requirements.
Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc.
Determines VL transmission rules based on a set of prioritization rules.
It is the responsibility of administrators/users to use and configure the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the Infiniband card.
This is pretty basic. You assign a SL to a VL. It's a direct one to one mapping. i.e. SL1->VL1, SL2→VL2
Normally, you map SLX -> VLX. If you do otherwise, you're starting to do something pretty crazy.
This is not so basic. There are three components to VL Arbitration configuration, the High-Priority Table, the Low-Priority Table, and the Limit of High Priority.
High & Low Priority VL Arbitration Tables are a list of VL numbers (0-14) and a weighting value (0-255) pairs. The weighting value indicates the number of 64 byte units that can be transmitted from that VL when it is that VL's turn to transmit. A weight of 0 means no data can be transferred. Counters are rounded up as needed for packets (i.e. a weight of 1 means a packet > 64 bytes can still be sent). The High Priority VL Arbitration Table is weights for "high priority" data while the Low Priority VL Arbitration Table is weights for "low priority" data (the usefulness will make more sense after you read "Limit of High Priority" below).
Note that 64*255 =~ 16K, which is small number for many institutions. I think it is easiest to think of the weights as ratios for percentage bandwidth if the network is completely flooded with data from all protocols/applications.
For example:
A) VL0 Weight = 255, VL1 Weight = 255 50% bandwidth for VL0 and VL1 each. B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255 33% bandwidth for VL0, VL1, and VL2 each. C) VL0 Weight = 200, VL1 Weight = 100 66% bandwidth for VL0, 33% bandwidth for VL1. D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100 50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each. |
Indicates the number of high-priority packets (from the High VL Arbitration Table) that can be sent without an opportunity to send a low priority packet (from the Low VL Arbitration Table). Increments are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited data).
4K*254 =~ 1M, which again is small number for many institutions. The most likely numbers to consider using are:
0 - one packet
254 - max high limit data w/o being unlimited
255 - unlimited data
When you combine the High/Low VL Arbitration tables with the Limit of High Priority, you can create some interesting QoS behavior.
(Following example is borrowed from the "Quality and Service in OFED 3.1" presentation listed below.)
High-Limit: 0
VL-Arb-High: VL2 Weight = 1
VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50
Effectively, anytime any data on VL2 is available, send at most one packet from VL2 before sending data from VL0 or VL1. If no VL2 data is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.
Idea:
(Assume Lustre Meta Data Servers and Lustre OSTs are on the same fabric)
MPI -> SL0 -> VL0
Lustre OST Data -> SL1 -> VL1
Lustre Meta Data -> SL2 -> VL2
In this example, Lustre meta data traffic is assumed to be low, but with the high priority, is accessed faster and theoretically allow for better Lustre interaction. When there is no Lustre meta data traffic on the fabric, MPI is given the majority share of bandwidth b/c it is more timing sensitive.
High-Limit: 254
Vl-Arb-High: VL0 Weight = 255
Vl-Arb-Low: VL1 Weight = 1
Effectively, whenever there is data on VL0, always send it before VL1. But do not allow VL0 to starve VL1. Let VL1 send *something* once in awhile.
Idea:
MPI -> SL1 -> VL0
Lustre -> Sl1 -> VL1
So MPI always gets priority over Lustre, but cannot starve it out. The High-Limit of 254 means a low priority packet must be sent once in awhile. This could be important if Lustre "pings" are done to keep some services alive.
Currently configure in /var/cache/opensm/opensm.opts (later to be in /etc/opensm/opensm.conf).
# # QoS OPTIONS # qos TRUE qos_policy_file /var/cache/opensm/qos-policy.conf # QoS default options qos_max_vls 2 qos_high_limit 254 qos_vlarb_high 0:255 qos_vlarb_low 1:1 qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 qos_ca_max_vls 2 qos_ca_high_limit 254 qos_ca_vlarb_high 0:255 qos_ca_vlarb_low 1:1 qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 # achu: VL2 not used, need to give non-null input to buggy opensm qos_swe_max_vls 2 qos_swe_high_limit 255 qos_swe_vlarb_high 0:225,1:25 qos_swe_vlarb_low 2:1 qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15 |
There are default QoS options, and specific QoS options for channel adapters, switches, etc. They allow you to configure for different port-types across the fabric. The "max_vls"
entries can be ignored.
The "high_limit"
, "vlarb_high"
, and "vlarb_low"
fields are hopefully self explanatory. The "vlarb_high"/"vlarb_low"
entries take inputs as <VL>:<Weight>
as input.
In the above example, channel Adapters have:
VL0 Weight = 255 -> For MPI
VL1 Weight = 1 -> For Lustre
Idea: With the High Limit of 254, MPI always gets priority, but cannot starve Lustre.
In the above example, Switches have:
VL0 Weight = 225 -> For MPI
VL1 Weight = 25 -> For Lustre
Idea: Across the entire cluster, MPI, Lustre, etc. are going on from different jobs/tasks. We don't want MPI to starve out other traffic so we give it a nice chunk of bandwidth but not all bandwidth (in this example 90% for MPI, 10% for Lustre).
SLs to VLs are mapped by listing the VLs for each SL in increasing order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of 15 is if the SL is one you don't care about.
The configuration of QoS is now over, but we still need to make protocols/applications use the appropriate SL.
Some tools allow you to pick an SL when you run.
i.e.
mpirun -sl 0 |
However, it may not be easy to force/change users/applications to use different SLs. The easiest way to configure SLs is through the OpenSM QoS policy file.
Depending on OpenSM version, this file is in /var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.
The following is the short summary of options I think are needed for our environment. See "QoS Management in OpenSM" for full set of options.
Format:
qos-ulps <user level protocol>, <options> : <SL level> end-qos-ulps <user level protocol> = IPoIB, SDP, SRP, iSER <options> = port-num, pkey, service-id, target-port-guid (Note: options depends on which user level protocol is selected) <SL level> = SL level 0-15. Example: qos-ulps default : 0 any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1 end-qos-ulps |
Idea:
Everything (most notably MPI) defaults to SL0. Any of the above locations with the listed destination GUID gets SL1.
If the target-port-guid's list of GUIDs are Lustre Routers, that would indicate Lustre data gets SL=1. In combination with the VL Arbitration and SL2VL Mapping configuration listed above, hopefully it can be seen how MPI gets priority over Lustre, but does not starve it out.
Note that files with target-port-guids must be kept up to date if GUIDs change. You can determine GUIDs via /usr/sbin/ibstat.
The tool smpquery can be used to verify that VL Arbitration tables and SL2VL tables have been configured in cards/switches properly.
# > /usr/sbin/smpquery sl2vl 346 # SL2VL table: Lid 346 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15| # > /usr/sbin/smpquery vlarb 346 # VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8 # Low priority VL Arbitration Table: VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | # High priority VL Arbitration Table: VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | The high limit can be determined by issuing portinfo queries via /usr/sbin/smpquery. # > /usr/sbin/smpquery portinfo 346 | grep Limit VLHighLimit:.....................0 |
SLs are most often assigned during Infiniband Queue Pair (QP) creation time. So, if you change your QoS settings, any tools/applications (including Lustre) that are currently running and have already created QPs may not have absorbed the newest QoS policy. The appropriate tools/applications should be restarted.
Not all Infiniband adapters support VLs. Those that do many not support all 15 VLs. You can determine what your system supports by issuing portinfo queries via /usr/sbin/smpquery.
(this is a link to the Git Tree - hopefully the URL is always legit)
http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt
(this is a link to the Git Tree - the URL is on the ofed_1_4 branch, so it probably will change at some point)
http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4