Problem Statement
Currently scheduler threads are grouped into blocks, ksock_sched_info
divided by the number of configured CPTs. Each ksock_sched_info
defaults to having cfs_cpt_weight()
struct ksock_sched
, or if ksnd_nscheds
is configured, it will select the least of these. The idea is not to have more threads than the cfs_cpt_weight()
.
A struct ksock_sched
is associated with exactly 1 thread. There is a 1:1 relationship between a ksock_sched
and a scheduler thread. So any transmits or receives queued on the ksock_sched
will be served by that thread.
When creating a connection the connection is associated with a ksock_sched
selected from the ksock_sched_info
on the CPT derived by lnet_cpt_of_nid(peer_nid)
.
The connection remains associated with that ksock_sched
for the duration of its life.
The life span of a TCP connection spans multiple transmits/receives. This means that the same scheduler thread is used for all these operations. The scheduler is changed only when the connection is torn down. In an lnet_selftest
run and in other filesystem tests one scheduler thread takes up all the CPU resources, causing severe drop in performance.
Solution
The fundamental issue with the socklnd scheduler design is that there is a 1:1 relationship between a scheduler thread and a connection. There is no reason to have such a 1:1 relationship, when any scheduler thread bound to the desired CPT can be used. The best case scenario is to let the kernel do the thread scheduling. This is how o2iblnd works.
In o2iblnd the scheduling code is as follows:
/* * Allocate and determine the number of threads for each * scheduler */ 3099 »·······kiblnd_data.kib_scheds = cfs_percpt_alloc(lnet_cpt_table(), 3100 »·······»·······»·······»·······»·······»······· sizeof(*sched)); 3101 »·······if (kiblnd_data.kib_scheds == NULL) 3102 »·······»·······goto failed; 3103 3104 »·······cfs_percpt_for_each(sched, i, kiblnd_data.kib_scheds) { 3105 »·······»·······int»····nthrs; 3106 3107 »·······»·······spin_lock_init(&sched->ibs_lock); 3108 »·······»·······INIT_LIST_HEAD(&sched->ibs_conns); 3109 »·······»·······init_waitqueue_head(&sched->ibs_waitq); 3110 3111 »·······»·······nthrs = cfs_cpt_weight(lnet_cpt_table(), i); 3112 »·······»·······if (*kiblnd_tunables.kib_nscheds > 0) { 3113 »·······»·······»·······nthrs = min(nthrs, *kiblnd_tunables.kib_nscheds); 3114 »·······»·······} else { 3115 »·······»·······»·······/* max to half of CPUs, another half is reserved for 3116 »·······»·······»······· * upper layer modules */ 3117 »·······»·······»·······nthrs = min(max(IBLND_N_SCHED, nthrs >> 1), nthrs); 3118 »·······»·······} 3119 3120 »·······»·······sched->ibs_nthreads_max = nthrs; 3121 »·······»·······sched->ibs_cpt = i; 3122 »·······} /* * start schedulers */ 3197 static int 3198 kiblnd_dev_start_threads(struct kib_dev *dev, int newdev, u32 *cpts, int ncpts) { ... 3204 »·······for (i = 0; i < ncpts; i++) { 3205 »·······»·······struct kib_sched_info *sched; ... 3213 »·······»·······rc = kiblnd_start_schedulers(kiblnd_data.kib_scheds[cpt]); ... } 3156 static int 3157 kiblnd_start_schedulers(struct kib_sched_info *sched) { ... 3178 »·······for (i = 0; i < nthrs; i++) { 3179 »·······»·······long»···id; 3180 »·······»·······char»···name[20]; 3181 »·······»·······id = KIB_THREAD_ID(sched->ibs_cpt, sched->ibs_nthreads + i); 3182 »·······»·······snprintf(name, sizeof(name), "kiblnd_sd_%02ld_%02ld", 3183 »·······»·······»······· KIB_THREAD_CPT(id), KIB_THREAD_TID(id)); 3184 »·······»·······rc = kiblnd_thread_start(kiblnd_scheduler, (void *)id, name); 3185 »·······»·······if (rc == 0) 3186 »·······»·······»·······continue; 3187 3188 »·······»·······CERROR("Can't spawn thread %d for scheduler[%d]: %d\n", 3189 »·······»······· sched->ibs_cpt, sched->ibs_nthreads + i, rc); 3190 »·······»·······break; 3191 »·······} 3192 3193 »·······sched->ibs_nthreads += i; }
In the code snippet above you can see that each struct kib_sched_info
has multiple threads. So when a connection is put on the scheduler list any of the threads in the scheduler can pick up this connection to work on.
The socklnd
needs to use the same mechanism.
This will result in the following structure
Implementation Details
Locking
Locking was per thread now a lock covers all the threads in the scheduler. I need to work out the impact of that.
Sending/Receiving Messages
Again each thread had access to buffers to put the data received or sent in. These can't be shared over all the threads as data can then override each other. Some investigation is needed in this area.