Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem Statement

Currently scheduler threads are grouped into blocks, ksock_sched_info divided by the number of configured CPTs. Each ksock_sched_info defaults to having cfs_cpt_weight() struct ksock_sched, or if ksnd_nscheds is configured, it will select the least of these. The idea is not to have more threads than the cfs_cpt_weight().

struct ksock_sched is associated with exactly 1 thread. There is a 1:1 relationship between a ksock_sched and a scheduler thread. So any transmits or receives queued on the ksock_sched will be served by that thread.

When creating a connection the connection is associated with a ksock_sched selected from the ksock_sched_info on the CPT derived by lnet_cpt_of_nid(peer_nid).

The connection remains associated with that ksock_sched for the duration of its life.

Gliffy Diagram
namesocklnd_scheduler_orig
pagePin2

The life span of a TCP connection spans multiple transmits/receives. This means that the same scheduler thread is used for all these operations. The scheduler is changed only when the connection is torn down. In an lnet_selftest run and in other filesystem tests one scheduler thread takes up all the CPU resources, causing severe drop in performance.

Solution

The fundamental issue with the socklnd scheduler design is that there is a 1:1 relationship between a scheduler thread and a connection. There is no reason to have such a 1:1 relationship, when any scheduler thread bound to the desired CPT can be used. The best case scenario is to let the kernel do the thread scheduling. This is how o2iblnd works.

In o2iblnd the scheduling code is as follows:

Code Block
/*
 * Allocate and determine the number of threads for each
 * scheduler
 */
3099 »·······kiblnd_data.kib_scheds = cfs_percpt_alloc(lnet_cpt_table(),                             
3100 »·······»·······»·······»·······»·······»·······  sizeof(*sched));
3101 »·······if (kiblnd_data.kib_scheds == NULL)
3102 »·······»·······goto failed;                                                                    
3103 
3104 »·······cfs_percpt_for_each(sched, i, kiblnd_data.kib_scheds) {                                 
3105 »·······»·······int»····nthrs;                                                                  
3106 
3107 »·······»·······spin_lock_init(&sched->ibs_lock);                                               
3108 »·······»·······INIT_LIST_HEAD(&sched->ibs_conns);
3109 »·······»·······init_waitqueue_head(&sched->ibs_waitq);
3110 
3111 »·······»·······nthrs = cfs_cpt_weight(lnet_cpt_table(), i);                                    
3112 »·······»·······if (*kiblnd_tunables.kib_nscheds > 0) {                                         
3113 »·······»·······»·······nthrs = min(nthrs, *kiblnd_tunables.kib_nscheds);                       
3114 »·······»·······} else {                                                                        
3115 »·······»·······»·······/* max to half of CPUs, another half is reserved for                    
3116 »·······»·······»······· * upper layer modules */                                               
3117 »·······»·······»·······nthrs = min(max(IBLND_N_SCHED, nthrs >> 1), nthrs);                     
3118 »·······»·······}
3119 
3120 »·······»·······sched->ibs_nthreads_max = nthrs;
3121 »·······»·······sched->ibs_cpt = i;                                                             
3122 »·······}

/*
 * start schedulers
 */
3197 static int
3198 kiblnd_dev_start_threads(struct kib_dev *dev, int newdev, u32 *cpts, int ncpts)
{
...
3204 »·······for (i = 0; i < ncpts; i++) {
3205 »·······»·······struct kib_sched_info *sched;
...
3213 »·······»·······rc = kiblnd_start_schedulers(kiblnd_data.kib_scheds[cpt]);
...
}

3156 static int
3157 kiblnd_start_schedulers(struct kib_sched_info *sched)
{
...
3178 »·······for (i = 0; i < nthrs; i++) {
3179 »·······»·······long»···id;
3180 »·······»·······char»···name[20];
3181 »·······»·······id = KIB_THREAD_ID(sched->ibs_cpt, sched->ibs_nthreads + i);
3182 »·······»·······snprintf(name, sizeof(name), "kiblnd_sd_%02ld_%02ld",
3183 »·······»·······»······· KIB_THREAD_CPT(id), KIB_THREAD_TID(id));
3184 »·······»·······rc = kiblnd_thread_start(kiblnd_scheduler, (void *)id, name);
3185 »·······»·······if (rc == 0)
3186 »·······»·······»·······continue;
3187 
3188 »·······»·······CERROR("Can't spawn thread %d for scheduler[%d]: %d\n",
3189 »·······»·······       sched->ibs_cpt, sched->ibs_nthreads + i, rc);
3190 »·······»·······break;
3191 »·······}
3192
3193 »·······sched->ibs_nthreads += i;
}

In the code snippet above you can see that each struct kib_sched_info has multiple threads. So when a connection is put on the scheduler list any of the threads in the scheduler can pick up this connection to work on.

The socklnd needs to use the same mechanism.

This will result in the following structure

Gliffy Diagram
namesocklnd_sched_new
pagePin4

Implementation Details

Locking

Locking was per thread now a lock covers all the threads in the scheduler. I need to work out the impact of that.

Sending/Receiving Messages

Again each thread had access to buffers to put the data received or sent in. These can't be shared over all the threads as data can then override each other. Some investigation is needed in this area.