【转帖】isolcpus功能与使用

isolcpus功能存在已久，笔者追溯v2.6.11（2005年）那时内核就已经存在了isolcpus功能。根据kernel-parameters.txt 上的解释，”isolcpus功能用于在SMP均衡调度算法中将一个或多个CPU孤立出来。同时可通过亲和性设置将进程置于 “孤立CPU”运行，isolcpus后面所跟的cpu参数，可设置孤立0~最大CPU个数-1个cpu。这种方法是推荐使用的孤立cpu的方式，与手动设置每个任务的亲和性相比，后一种方式降低了调度器的性能”。

isolcpus带来的好处是有效地提高了孤立cpu上任务运行的实时性。该功能在保证孤立cpu上任务的运行，同时减少了其他任务可以运行的cpu资源，所以需要使用前对cpu资源进行规划：

isolcpus功能使用步骤：

1、决定需要孤立多少、哪些cpu。

如果需要孤立多个cpu，应当尽量使孤立cpu与非孤立cpu尽量不属于同一个域。

2、命令行参数添加孤立cpu。

修改/boot/grub/grub.conf文件，比如孤立cpu5~8核（cpu id对应4~7），添加isolcpus=4,5,6,7至内核命令行，逗号分隔。

3、禁止使用中断均衡服务。

中断均衡会使得孤立核上中断不确定性，导致孤立核上任务实时性能下降。同时避免均衡带来的效益会被cache刷新的开销抵消掉。

4、了解所有中断，进行中断亲和的设计与设置。

5、决定运行在孤立cpu上的任务。

如下是双核系统，孤立cpu0，启动之后的部分进程情况：






  PID COMMAND                                          PSR





 





    1 /sbin/init                                         1





 





    3 [migration/0]                                      0





 





    4 [ksoftirqd/0]                                      0





 





    5 [watchdog/0]                                       0





 





    6 [migration/1]                                      1





 





    7 [ksoftirqd/1]                                      1





 





    8 [watchdog/1]                                       1





 





    9 [events/0]                                         0





 





   10 [events/1]                                         1





 





   18 [kintegrityd/0]                                    0





 





   19 [kintegrityd/1]                                    1





 





   20 [kblockd/0]                                        0





 





   21 [kblockd/1]                                        1





 





   25 [ata/0]                                            0





 





   26 [ata/1]                                            1





 





   36 [aio/0]                                            0





 





   37 [aio/1]                                            1





 





   38 [crypto/0]                                         0





 





   39 [crypto/1]                                         1





 





   45 [kpsmoused]                                        0





 





   46 [usbhid_resumer]                                   0





 





   75 [kstriped]                                         0





 





 1081 rpcbind                                            1





 





 1093 mdadm --monitor --scan -f--pid-file=/var/run/mdad   1





 





 1102 dbus-daemon --system                               1





 





 1113 NetworkManager--pid-file=/var/run/NetworkManager/   1





 





 1117 /usr/sbin/modem-manager                            1





 





 1124 /sbin/dhclient -d -4 -sf/usr/libexec/nm-dhcp-clie   1





 





 1129 /usr/sbin/wpa_supplicant -c/etc/wpa_supplicant/wp   1





 





 1131 avahi-daemon: registering[linux.local]              1





 





 1132 avahi-daemon: chroot helper                        1





 





 1598 sshd: root@pts/0                                   1





 





 1603 -bash                                              1

可以看到有一些内核线程比如[kblockd/0]占用了CPU0,这是因为它指定了在CPU0上执行。其余的进程占用了CPU1。启动系统时默认不使用CPU0,但并不是不可以使用,操作系统仍然可以显式指定使用0号CPU。

最后要说明的是如果使用isolcpus=1,则系统默认会使用CPU0提供服务。如果我们只有两个cpu,却指定孤立了所有cpu，isolcpus=0,1,这时默认使用的是第一个cpu即CPU0提供服务。

isolcpus原理与代码分析
设置与初始化
isolcpus设置
上文所述，Isolcpus变量通过boot传入内核后，首先通过isolated_cpu_setup复值给cpu_isolated_map，cpu_isolated_map表示孤立cpu组的位图。






/* cpus with isolated domains */





 





static cpumask_var_t cpu_isolated_map;





 





 





 





/* Setup the mask of cpus configured forisolated domains */





 





static int __init isolated_cpu_setup(char*str)





 





{undefined





 





         alloc_bootmem_cpumask_var(&cpu_isolated_map);





 





         cpulist_parse(str,cpu_isolated_map);





 





         return1;





 





}





 





 





 





__setup("isolcpus=",isolated_cpu_setup);

在调度域（SchedulingDomains）初始化(arch_init_sched_domains)的时候，cpu_isolated_map派上用场。

Scheduling Domains初始化
调度域初始化流程，Kernel_init()->sched_init_smp->arch_init_sched_domains():






static int arch_init_sched_domains(conststruct cpumask *cpu_map)





 





{undefined





 





         interr;





 





 





 





         arch_update_cpu_topology();





 





         ndoms_cur= 1;





 





         doms_cur= kmalloc(cpumask_size(), GFP_KERNEL);





 





         if(!doms_cur)





 





                   doms_cur= fallback_doms;





 





         cpumask_andnot(doms_cur, cpu_map,cpu_isolated_map);





 





         dattr_cur= NULL;





 





         err = build_sched_domains(doms_cur);





 





         register_sched_domain_sysctl();





 





 





 





         returnerr;





 





}

其中doms_cur是当前调度域中的CPU位图, cpu_map是传入参数cpu_active_mask，可见这里是cpu_active_mask中除去cpu_isolated_map后的cpu作为当前调度域。

调度域是现代硬件技术尤其是多 CPU 多核技术发展的产物。一个 NUMA 架构的系统，系统中的每个 Node 访问系统中不同区域的内存有不同的速度。

同时它又是一个 SMP 系统。由多个物理CPU(Physical Package) 构成。这些物理 CPU 共享系统中所有的内存。但都有自己独立的 Cache 。

每个物理 CPU 又由多个核 (Core) 构成，即Multi-core 技术或者叫 Chip-level Multi processor(CMP) 。一般有自己独立的 L1 Cache，但可能共享 L2Cache 。

每个核中又通过 SMT 之类的技术实现多个硬件线程，或者叫 Virtual CPU( 比如 Intel 的 Hyper-threading 技术 ) 。这些硬件线程，逻辑上看是就是一个 CPU 。它们之间几乎所有的东西都共享。包括 L1 Cache，甚至是逻辑运算单元 (ALU) 以及 Power 。

在上述系统中，最小的执行单元是逻辑 CPU，进程的调度执行也是相对于逻辑 CPU 的。

build_sched_domains用来进行调度域的建立，分为四大块：调度域的初始化、调度组的初始化、调度组cpu power的初始化、运行队列和调度域的关联。

调度域的初始化：






  for_each_cpu(i, cpu_map) {/*孤立cpu排除在调度域初始化之外*/





 





                   cpumask_and(d.nodemask,cpumask_of_node(cpu_to_node(i)),





 





                                cpu_map);





 





 





 





                   sd= __build_numa_sched_domains(&d, cpu_map, attr, i);





 





                   sd= __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);





 





                   sd= __build_mc_sched_domain(&d, cpu_map, attr, sd, i);





 





                   sd= __build_smt_sched_domain(&d, cpu_map, attr, sd, i);





 





         }

这里分别初始化了以上所述的四个调度域：allnodes_domains、phys_domains、core_domains、cpu_domains。以cpu_domains为例：






static struct sched_domain*__build_smt_sched_domain(struct s_data *d,





 





         conststruct cpumask *cpu_map, struct sched_domain_attr *attr,





 





         structsched_domain *parent, int i)





 





{undefined





 





         structsched_domain *sd = parent;





 





#ifdef CONFIG_SCHED_SMT





 





         sd = &per_cpu(cpu_domains, i).sd;/*四个域均是每cpu变量，保留了cpu所属的四个域*/





 





         SD_INIT(sd,SIBLING);/*设置sd的均衡策略等*/





 





         set_domain_attribute(sd,attr);





 





         cpumask_and(sched_domain_span(sd),cpu_map, topology_thread_cpumask(i));





 





         sd->parent = parent;/* cpu_domains的parent是 core_domains */





 





         parent->child= sd;





 





         cpu_to_cpu_group(i, cpu_map,&sd->groups, d->tmpmask);/*初始化sd中的调度组*/





#endif





 





         returnsd;





 





}

为了配合调度域，在Linux内核中引入了调度组（sched_group）: 每一级的调度域都有一个或多个调度组，每个调度组可以有一个CPU或多个CPU，都被调度域当做一个的整体来对待，对负载的计算，判断都是以该调度域所属的调度组为单位的。

调度组cpu power（cpu处理能力）的初始化:

调度组的cpu power就是调度组的负载能力,例如在超线性的cpu中,主核与超线程的虚拟核它们的负载能力是不一样,在load balance的时候,会根据各组的cpu power来决定应该分配给该组多少的负载才合适。在SMP系统中,各CPU是完全对称的,也就是说,它们的处理能力都是一样的,因此在这里,每个cpu对应的调度组的cpu power设置为SHED_LOAD_SCALE。






static void init_sched_groups_power(intcpu, struct sched_domain *sd)





 





{undefined





 





         structsched_domain *child;





 





         structsched_group *group;





 





         longpower;





 





         intweight;





 





 





 





         WARN_ON(!sd|| !sd->groups);





 





 





 





         if(cpu != group_first_cpu(sd->groups))





 





                   return;





 





 





 





         sd->groups->group_weight= cpumask_weight(sched_group_cpus(sd->groups));





 





 





 





         child= sd->child;





 





 





 





         sd->groups->cpu_power= 0;





 





 





 





         if(!child) {undefined





 





                   power= SCHED_LOAD_SCALE;





 





                   weight= cpumask_weight(sched_domain_span(sd));





 





                   /*





 





                    * SMT siblings share the power of a singlecore.





 





                    * Usually multiple threads get a better yieldout of





 





                    * that one core than a single thread wouldhave,





 





                    * reflect that in sd->smt_gain.





 





                    */





 





                   if((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {undefined





 





                            power*= sd->smt_gain;





 





                            power/= weight;





 





                            power>>= SCHED_LOAD_SHIFT;











                   }





 





                  sd->groups->cpu_power+= power;/*如果没有子sg，cpu_power等于SCHED_LOAD_SCALE */





 





                   return;





 





         }





 





 





 





         /*





 





          * Add cpu_power of each child group to thisgroups cpu_power.





 





          */





 





         group= child->groups;





 





         do{





 





                   sd->groups->cpu_power+= group->cpu_power;





 





                   group= group->next;





 





         }while (group != child->groups); /*如果有子sg，cpu_power等于各子组的cpu_power 之和 */





 





}

最后将运行队列和调度域的关联，运行队列和调度域的关联在以下代码中完成:

Cpu_attach_domain()如下:






/*





 





 *Attach the domain 'sd' to 'cpu' as its base domain. Callers must











 *hold the hotplug lock.











 */











cpu_attach_domain





 





{undefined





 





… …





 





 





 





         sched_domain_debug(sd,cpu);





 





 





 





         rq_attach_root(rq,rd);





 





         rcu_assign_pointer(rq->sd, sd); /*将rq->sd=sd.即cpu的运行队列指向其基本调度域，后续均衡时需要通过cpu找到其所属的sd：





 





for_each_domain(cpu, __sd) for (__sd =rcu_dereference(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent) */





 





}





 





#define rcu_assign_pointer(p, v)       ({ \





 





                                                        smp_wmb();\





 





                                                        (p)= (v); \





 





                                               })

负载均衡时机与算法

均衡的时机有定时均衡、闲时均衡和唤醒均衡三种。

定时均衡

时钟中断时，会根据条件触发均衡软中断（内核为其设置了专门的软中断），称为定时均衡。






 void__init sched_init(void)





 





{undefined





 





…





 





         open_softirq(SCHED_SOFTIRQ,run_rebalance_domains);





 





…





 





}

在tick中断处理中,会调用scheduler_tick()：






void scheduler_tick(void)





 





{undefined





 





… …





 





 





 





#ifdef CONFIG_SMP





 





         rq->idle_at_tick= idle_cpu(cpu);





 





         trigger_load_balance(rq,cpu);





 





#endif





 





… …





 





}

trigger_load_balance代码实现如下：






static inline voidtrigger_load_balance(struct rq *rq, int cpu)





 





{undefined





 





… …  





 





         if(time_after_eq(jiffies, rq->next_balance) &&/* rq->next_balance表示下次load balance操作的时间戳，避免进行load balance的操作过于频繁*/





 





             likely(!on_null_domain(cpu)))





 





                   raise_softirq(SCHED_SOFTIRQ);





 





}

如果jiffies大于或者等于rq->next_balance时候,就会调用raise_softirq()将load balance的软中断打开。rq->next_balance表示下次loadbalance操作的时间戳。这样是为了避免进行load balance的操作过于频繁,影响系统性能。

该软中断对应的处理函数为run_rebalance_domains()：






static void run_rebalance_domains(structsoftirq_action *h)





 





{undefined





 





         intthis_cpu = smp_processor_id();





 





         structrq *this_rq = cpu_rq(this_cpu);





 





         enumcpu_idle_type idle = this_rq->idle_at_tick ?





 





                                                        CPU_IDLE: CPU_NOT_IDLE;/* this_rq->idle_at_tick表示当前的运行队列是否空闲*/





 





 





 





         rebalance_domains(this_cpu,idle);





 





 





 





… …





 





}

rebalance_domains






static void rebalance_domains(intcpu, enum cpu_idle_type idle)





 





{undefined





 





… …





 





/*从基本调度域开始从下往上遍历，可见非孤立cpu跟本不会尝试从孤立cpu上进行均衡，因为根本不在一个sd中，这里是和排他性绑定的区别*/





 





         for_each_domain(cpu,sd) {





 





                   if(!(sd->flags & SD_LOAD_BALANCE))





 





                            continue;





 





/*计算平衡的时间间隔*/





 





                   interval= sd->balance_interval;/* balance_interval 在初始化SD_**_INIT设置*/





 





/*如果当前CPU不为空闲,调整时间间隔为较长值, 在当前CPU忙和空闲的情况下,执行load balance的频率是不一样的,在空闲情况下,会以很高的频率执行*/





 





                   if(idle != CPU_IDLE)





 





                            interval*= sd->busy_factor;





 





 





 





                   /*将ms转换成jiffies*/





 





                   interval= msecs_to_jiffies(interval);





 





/*对interval大小进行有效性调整*/





 





                   if(unlikely(!interval))





 





                            interval= 1;





 





                   if(interval > HZ*NR_CPUS/10)





 





                            interval= HZ*NR_CPUS/10;





 





 





 





                   need_serialize= sd->flags & SD_SERIALIZE;





 





 





 





                   if(need_serialize) {undefined





 





                            if(!spin_trylock(&balancing)) /*每个load balance的soft irq必须持有自旋锁才能进行下一步操作*/





 





                                     gotoout;





 





                   }





 





/*如果到了调度时间*/





 





                   if(time_after_eq(jiffies, sd->last_balance + interval)) {undefined





 





                            if (load_balance(cpu, rq, sd, idle,&balance)) {





 





/*均衡操作：





 





从子sched_domain到父sched_domain遍历该CPU对应的domain，优先在低级domain中寻找迁移任务，该sched_domain里最忙的sched_group，从该group里找出最忙的运行队列，从该“最忙”运行队列里挑出几个进程到当前CPU的运行队列里*/





 





                                     /*已经从其它cpu上拉了进程过来,所以不再空闲了*/





 





                                     idle= CPU_NOT_IDLE;





 





                            }





 





                            sd->last_balance= jiffies;





 





                   }





 





                   if(need_serialize)





 





                            spin_unlock(&balancing);





 





out:





 





                   if(time_after(next_balance, sd->last_balance + interval)) {undefined





 





                            next_balance= sd->last_balance + interval;





 





                            update_next_balance= 1;





 





                   }





 





 





 





                   /*





 





                    * Stop the load balance at this level. Thereis another





 





                    * CPU in our sched group which is doing loadbalancing more





 





                    * actively.





 





                    */





 





/*如果balance为0,表示在所属调度组中有更合适的cpu来做负载平衡的工作，退出*/





 





                   if(!balance)





 





                            break;





 





         }





 





 





 





         /*





 





          * next_balance will be updated only when thereis a need.





 





          * When the cpu is attached to null domain forex, it will not be





 





          * updated.





 





          */





 





         if(likely(update_next_balance))





 





                   rq->next_balance= next_balance;





 





}

load_balance()中进行，load_balance函数实现与下文的load_balance_newidle基本一致，后文介绍。从子sched_domain到父sched_domain遍历该CPU对应的domain，优先在低级domain中寻找迁移任务，该sched_domain里最忙的sched_group，从该group里找出最忙的运行队列，从该“最忙”运行队列里挑出几个进程到当前CPU的运行队列里。

闲时均衡
任务调度时，会检查当前cpu是否空闲，如果空闲则进行负载均衡，相应均衡函数为idle_balance()，与定时均衡相比，只要拉到任务就会返回不会遍历完所有的调度域属于轻量级均衡算法。

CPU的运行队列里如果为空，就说明当前CPU没有可调度的任务了，那就要调用idle_balance从其它CPU上“均衡”一些进程到当前rq里:






asmlinkage void __sched schedule(void)





 





{… …





 





         if(unlikely(!rq->nr_running))





 





                   idle_balance(cpu,rq);





 





… …





 





}

再看idle_balance（）的实现,






static voididle_balance(intthis_cpu, struct rq *this_rq)





 





{undefined





 





…





 





/*从基本调度域开始从下往上遍历，可见非孤立cpu跟本不会尝试从孤立cpu上进行均衡，因为根本不在一个sd中，这里是和排他性绑定的区别*/





 





         for_each_domain(this_cpu,sd) {undefined





 





                   unsignedlong interval;





 





                   if(!(sd->flags & SD_LOAD_BALANCE))





 





                            continue;





 





 





 





                   if(sd->flags & SD_BALANCE_NEWIDLE)





 





                            /*If we've pulled tasks over stop searching: */











                            pulled_task= load_balance_newidle(this_cpu, this_rq,











                                                                    sd);











 











                   interval= msecs_to_jiffies(sd->balance_interval);











                   if(time_after(next_balance, sd->last_balance + interval))











                            next_balance= sd->last_balance + interval;











                   if(pulled_task) {/*拉到任务即退出*/











                            this_rq->idle_stamp= 0;











                            break;











                   }











         }











…











}

从子sched_domain到父sched_domain遍历该CPU对应的domain，优先在低级domain中寻找迁移任务。关注调用load_balance_newidle:






static int





 





load_balance_newidle(int this_cpu, structrq *this_rq, struct sched_domain *sd)





 





{undefined





 





。。。 。。。





 





 





 





         schedstat_inc(sd,lb_count[CPU_NEWLY_IDLE]);





 





redo:





 





         update_shares_locked(this_rq,sd);





 





         group= find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,





 





                                        &sd_idle, cpus, NULL);/*该sched_domain里最忙的sched_group */





 





         if(!group) {undefined





 





                   schedstat_inc(sd,lb_nobusyg[CPU_NEWLY_IDLE]);





 





                   gotoout_balanced;





 





         }





 





 





 





         busiest= find_busiest_queue(group, CPU_NEWLY_IDLE, imbalance, cpus); /*从该group里找出最忙的运行队列*/       if (!busiest){undefined





 





                   schedstat_inc(sd,lb_nobusyq[CPU_NEWLY_IDLE]);





 





                   gotoout_balanced;





 





         }





 





 





 





         BUG_ON(busiest== this_rq);





 





 





 





         schedstat_add(sd,lb_imbalance[CPU_NEWLY_IDLE], imbalance);





 





 





 





         ld_moved= 0;





 





         if(busiest->nr_running > 1) {undefined





 





                   /*Attempt to move tasks */





 





                   double_lock_balance(this_rq,busiest);





 





                   /*this_rq->clock is already updated */





 





                   update_rq_clock(busiest);





 





                   ld_moved=move_tasks(this_rq, this_cpu,busiest,





 





                                               imbalance,sd, CPU_NEWLY_IDLE,





 





                                               &all_pinned);/*最后从该“最忙”运行队列里挑出几个进程到当前CPU的运行队列里。move_tasks函数到底挪多少进程到当前CPU是由第4个参数决定的，第4个参数是指最多挪多少负载。 */





 





                   double_unlock_balance(this_rq,busiest);





 





 





 





                   if(unlikely(all_pinned)) {undefined





 





                            cpumask_clear_cpu(cpu_of(busiest),cpus);





 





                            if(!cpumask_empty(cpus))





 





                                     gotoredo;





 





                   }





 





         }





 





… …





 





}

在发现不均衡需要movetask的时候，需要考虑亲和性，因此可能移动任务也可能失败导致多次循环尝试，这里也是非排他性绑定性能损失之处：






int can_migrate_task(struct task_struct*p, struct rq *rq, int this_cpu,





 





                        struct sched_domain *sd, enumcpu_idle_type idle,





 





                        int *all_pinned)





 





{undefined





 





         /*





 





          * We do not migrate tasks that are:





 





          * 1) running (obviously), or





 





          * 2) cannot be migrated to this CPU due tocpus_allowed, or





 





          * 3) are cache-hot on their current CPU.





 





          */





 





         if(!cpu_isset(this_cpu, p->cpus_allowed))





 





                   return0;





 





         *all_pinned= 0;





 





 





 





         if(task_running(rq, p))





 





                   return0;





 





 





 





         return1;





 





}

寻找group里找出最忙的运行队列方法：






static struct rq *





 





find_busiest_queue(struct sched_group*group, enum cpu_idle_type idle,





 





                      unsigned long imbalance, cpumask_t *cpus)





 





{undefined





 





         structrq *busiest = NULL, *rq;





 





         unsignedlong max_load = 0;





 





         inti;





 





 





 





         for_each_cpu_mask(i, group->cpumask) {/*轮询该group所有cpu*/





 





                   unsignedlong wl;





 





 





 





                   if(!cpu_isset(i, *cpus))





 





                            continue;





 





 





 





                   rq= cpu_rq(i);





 





                   wl= weighted_cpuload(i);/*计算该cpu上负载，计算方法后文*/





 





#ifdef CONFIG_SIRQ_BALANCE/*是否考虑软中断的负载*/





 





             if (cpu_in_sirq_judge(i)) {undefined





 





                          wl+= SIRQ_WEIGHT(i);





 





                   }





 





#endif





 





                   if(rq->nr_running == 1 && wl > imbalance)





 





                            continue;





 





 





 





                   if(wl > max_load) {/*保存最大负载任务的cpu*/





 





                            max_load= wl;





 





                            busiest= rq;





 





                   }





 





         }





 





 





 





         returnbusiest;





 





}

调度组负载的计算与说明：

如前所述，一个调度组sched_group可能有1个或多个CPU（依赖于硬件和所属调度域的级别）；在计算负载时，会把该调度组所属的每个CPU的负载加在一起作为整个调度组的负载。

CPU的负载计算方法也很简单，就是把CPU的运行队列上每个任务的负载权值累加起来即可；

任务的负载权值通过查如下的权值表来确定：






static const int prio_to_weight[40] = {undefined





 





 /*-20 */     88761,     71755,    56483,     46273,     36291,





 





 /*-15 */     29154,     23254,    18705,     14949,     11916,





 





 /*-10 */      9548,      7620,     6100,      4904,      3906,





 





 /*  -5*/      3121,      2501,     1991,      1586,      1277,





 





 /*   0*/      1024,       820,       655,       526,       423,





 





 /*   5*/       335,       272,       215,       172,      137,





 





 /*  10*/       110,        87,        70,        56,        45,





 





 /*  15*/        36,        29,        23,        18,        15,





 





};

对于实时任务 weight =prio_to_weight[0] * 2;






   if(task_has_rt_policy(p)) {undefined





 





                   p->se.load.weight= prio_to_weight[0] * 2;





 





                   p->se.load.inv_weight= prio_to_wmult[0] >> 1;











                   return;





 





         }

对于IDLE任务weight = 2;

对于非实时的公平调度任务，weight= prio_to_weight[p->static_prio - 100];显然，static_prio越小，其权值越大，在CGLE内核中static_prio 取值区间在[100 - 139]，和prio_to_weight表是对应的。

CPU的运行队列(rq)的负载的设置主要有如下点：

任务被添加到运行队列（如被唤醒等情况）时，任务的负载就被加到运行队列的负载(rq->load)上；

任务从运行队列上删除时（如任务被阻塞）时，任务的负载就从运行队列的负载(rq->load)上去除。

唤醒均衡
唤醒任务时，会在try_to_wake_up()中根据当前cpu的调度域参数决定是否进行负载均衡，称为唤醒均衡。this_cpu是实际运行这个函数的处理器称之为“目标处理器” ，变量cpu是进程p在睡眠之前运行的处理器称之为“源处理器”：






static int try_to_wake_up(structtask_struct *p, unsigned int state,                         int wake_flags,int mutex)





 





{undefined





 





         if(cpu == this_cpu) {





 





                   schedstat_inc(rq,ttwu_local);





 





                   gotoout_set_cpu;/* 进程p在处理器A上运行，然后睡眠，而运行try_to_wake_up的也是处理器A，其实这样就最好了，进程p在处理器A里cache的数据都不用动，直接让A运行p就行了*/





 





         }





 





}

如果this_cpu和cpu不是同一个处理器，那么代码继续：






   if(unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))





 





                   gotoout_set_cpu;





 





… …





 





         if(this_sd) {/*对于孤立cpu来说，this_sd为空，不会尝试均衡*/





 





                   intidx = this_sd->wake_idx;





 





                   unsignedint imbalance;





 





#ifdef CONFIG_SIRQ_BALANCE





 





             int in_sirq;





 





   





 





                   imbalance= 100 + (this_sd->imbalance_pct - 100) / 2;





 





 





 





             in_sirq =cpu_in_sirq_judge(this_cpu);





 





                   this_load= target_load(this_cpu, idx, in_sirq);





 





 





 





             in_sirq = cpu_in_sirq_judge(cpu);





 





                   load= source_load(cpu, idx, in_sirq);





 





#else





 





             imbalance = 100 +(this_sd->imbalance_pct - 100) / 2;





 





 





 





                   load= source_load(cpu, idx);/*计算源处理器”各自的负载*/





 





                   this_load= target_load(this_cpu,idx);/* 计算 “目标处理器”负载*/





 





#endif





 





                   new_cpu= this_cpu; /* Wake to this CPU if we can */





 





 





 





                  if(this_sd->flags & SD_WAKE_AFFINE) {/*通过读这段代码，我们发现如果一个任务设置同时亲和在孤立cpu与非孤立cpu上，在唤醒均衡时可以看到该任务会调度到非孤立cpu上，将再没有机会回到孤立cpu。所以应当避免将任务同时亲和在孤立cpu与非孤立cpu上这样无趣的做法*/





 





                            unsignedlong tl = this_load;





 





                            unsignedlong tl_per_task;





 





 





 





                            tl_per_task= cpu_avg_load_per_task(this_cpu);/*计算“目标处理器”上的每任务平均负载 tl_per_task*/





 





 





 





                            /*





 





                             * If sync wakeup then subtract the (maximumpossible)





 





                             * effect of the currently running task fromthe load





 





                             * of the current CPU:





 





                             */





 





                            if(sync)





 





                                     tl-= current->se.load.weight;





 





#ifdef CONFIG_SIRQ_BALANCE





 





                            if((tl <= load &&





 





                                     tl+ target_load(cpu, idx, in_sirq) <= tl_per_task) ||





 





                                   100*(tl + p->se.load.weight) <=imbalance*load) {undefined





 





#else





 





                   if ((tl <= load&&





 





                                     tl+ target_load(cpu, idx) <= tl_per_task) ||





 





                                   100*(tl + p->se.load.weight) <=imbalance*load)/*如果“目标处理器”的负载小于“源处理器”的负载且两处理器负载相加都比 tl_per_task小的话，唤醒的进程转为“目标处理器”执行。还有一种情况如果“目标处理器”的负载加上被唤醒的进程的负载后，还比“源处理器”的负载（乘以imbalance后）的小的话，也要把唤醒的进程转为“目标处理器”执行。如果两个因素都不满足，那还是由p进程原来呆的那个CPU（即”源处理器“）继续来处理*/





 





{undefined





 





#endif





 





                                     /*





 





                                      * This domain has SD_WAKE_AFFINE and





 





                                      * p is cache cold in this domain, and





 





                                      * there is no bad imbalance.





 





                                      */





 





                                     schedstat_inc(this_sd,ttwu_move_affine);





 





                                     gotoout_set_cpu;





 





                            }





 





                   }





 





 





 





                   /*





 





                    * Start passive balancing when half theimbalance_pct





 





                    * limit is reached.





 





                    */





 





                   if(this_sd->flags & SD_WAKE_BALANCE) {undefined





 





                            if(imbalance*this_load <= 100*load) {undefined





 





                                     schedstat_inc(this_sd,ttwu_move_balance);





 





                                     gotoout_set_cpu;





 





                            }





 





                   }





 





         }





 





 





 





         new_cpu= cpu; /* Could not wake to this_cpu. Wake to cpu instead */

原文链接：https://blog.csdn.net/haitaoliang/article/details/22427045

文章知识点与官方知识档案匹配，可进一步学习相关知识

云原生入门技能树首页概览11870 人正在系统学习中

【转帖】isolcpus功能与使用的更多相关文章

CentOS下恢复Firefox的复制等功能
在CentOS下使用firefox编辑博客时,我发现无法使用复制粘帖功能,可用如下两种方法恢复(方法一已验证可行): 方法一: 找到user.js所在的目录,Linux下的user.js所在目录为Un ...
每日一帖示例程序（使用TWebBrowser基于HTML做）
最近在程序中增加了每日一帖的功能,搜索一下网站的程序,发现大部分是用Memo实现,而我用的是TWebBrowser基于HTML做,故帖出来共享一下. PAS源码: unit Unit1; interf ...
c#面试题汇总
下面的参考解答只是帮助大家理解,不用背,面试题.笔试题千变万化,不要梦想着把题覆盖了,下面的题是供大家查漏补缺用的,真正的把这些题搞懂了,才能“以不变应万变”.回答问题的时候能联系做过项目的例子是最好 ...
.NET工程师面试宝典
.Net工程师面试笔试宝典传智播客.Net培训班内部资料这套面试笔试宝典是传智播客在多年的教学和学生就业指导过程中积累下来的宝贵资料,大部分来自于学员从面试现场带过来的真实笔试面试题,覆盖了主流的 ...
传智播客DotNet面试题
技术类面试.笔试题汇总(整理者:杨中科,部分内容从互联网中整理而来) 注:标明*的问题属于选择性掌握的内容,能掌握更好,没掌握也没关系. 下面的参考解答只是帮助大家理解,不用背,面试题.笔试题千变万化 ...
[官方软件] Easy Sysprep v4.3.29.602 【系统封装部署利器】（2016.01.22）--skyfree大神
[官方软件] Easy Sysprep v4.3.29.602 [系统封装部署利器](2016.01.22) Skyfree 发表于 2016-1-22 13:55:55 https://www.it ...
Interview
下面的题是供大家查漏补缺用的,真正的把这些题搞懂了,才能"以不变应万变". 回答问题的时候能联系做过项目的例子是最好的,有的问题后面我已经补充联系到项目中的对应的案例了. 1.简述 ...
C# 面试宝典
1.简述 private. protected. public. internal 修饰符的访问权限. private 私有成员只有类成员才能访问 protected 保护成员只有该类及该类的 ...
收藏所用C#技术类面试、笔试题汇总
技术类面试.笔试题汇总注:标明*的问题属于选择性掌握的内容,能掌握更好,没掌握也没关系. 下面的参考解答只是帮助大家理解,不用背,面试题.笔试题千变万化,不要梦想着把题覆盖了,下面的题是供大家查漏补 ...
利用开源项目使discus论坛与java应用同步登录和注册
最近做了一个资源库系统的项目,老师说可以搭建开源论坛替代自己开发社交模块(评论啊,反馈啊)来减轻负担,甚至提到了要给每个资源开一帖的功能..使我十分怀疑到底是减轻负担还是增加负担...不过怀疑归怀疑, ...

随机推荐

C# 创建Excel气泡图
气泡图(Bubble Chart)是可用于展示三个变量之间的关系.通过绘制x 值, y 值和大小值即可确定图表中气泡的坐标及大小.下面通过后端C#代码及VB.NET代码展示如何来实现在Excel中创建 ...
MoE：LLM终身学习的可能性
本文分享自华为云社区<DTSE Tech Talk | 第47期:MoE:LLM终身学习的可能性>,作者:华为云社区精选. 在DTSE Tech Talk的第47期直播<MoE:LL ...
三步实现BERT模型迁移部署到昇腾
本文分享自华为云社区 <bert模型昇腾迁移部署案例>,作者:AI印象. 镜像构建 1. 基础镜像(由工具链小组统一给出D310P的基础镜像) From xxx 2. 安装mindspor ...
Solon2 开发之IoC，七、切面与函数环绕拦截
想要环绕拦截一个 Bean 的函数.需要三个前置条件: 通过注解做为"切点",进行拦截(不能无缘无故给拦了吧?费性能) Bean 的 method 是被代理的在 Bean 被扫描 ...
Nacos 1.2.1 集群搭建(三） Nginx 配置集群
配置 Nginx 可以把.conf 文件拉到本地,配置好再传上去 #gzip on; upstream cluster{ server 192.168.0.113:8848; server 192.1 ...
开源项目因支持乌克兰遭issue刷屏，网友：别用Nginx，别用元素周期表
大家好,我是DD. 昨天,两条科技界的新闻炸了,一条是关于GitHub发文封锁俄罗斯,一条是关于Oracle暂停俄罗斯所有业务.一个是全球最大的开源软件社区,一个是全球最大的数据库软件厂商.似乎巨头的 ...
图扑数字孪生智慧机场，助推民航"四型机场"建设
前言民航局印发的<智慧民航建设路线图>文件中,明确提出智慧机场是智慧民航的四个核心抓手之一.并从机场全域协同运行.作业与服务智能化.智慧建造与运维方面,为智慧机场的发展绘制了清晰的蓝图. ...
sql server主从同步
sql server主从方案介绍 sql server 作为目前主流的数据库,用户遍布世界各地.sql server也有一些比较成熟的主备方案,目前主要有:复制模式(发布-订阅模式).镜像传输模式 ...
Go语言安装（Windows10）
一. 官网下载 https://golang.google.cn/dl/ 二. 软件包安装选择对应的路径进行安装三. 环境变量设置 1.path 检查系统环境变量Path内已经添加Go的安 ...
python · pytorch | NN 训练常用代码存档
1 pandas 读 csv import torch from torch import nn import numpy as np import pandas as pd from copy im ...

【转帖】isolcpus功能与使用

负载均衡时机与算法

【转帖】isolcpus功能与使用的更多相关文章

随机推荐

热门专题