《linux内核设计与实现》阅读笔记-进程与调度

一、进程

process:

executing program code(text section)
data section containing global variables
open files
pending signals
internal kernel data
address space
one or more threads of execution

Processes, in effect, are the living result of running program code.

这是 LKD 对进程的经典描述。

1.1、进程描述符

进程描述符(Process Descriptor)在 linux 中就是指 struct task_struct 结构体，这个结构体在 32 位机器上大约是 1.7KB。

1.1.1、PID

struct task_struct {

    ...

    pit_t pid;

    ...

}

1.1.2、current 宏

linux 通常获取一个指向 task_struct 的指针，通过指针直接操作进程。针对不同体系结构实现了 current 宏。例如在 x86 下:

     +---------+

     | current |

     +----+----+

          |

          v

    +-----+---------+

    | get_current() |

    +-----+---------+

          |

          v

+---------+------------+

| percpu_read_stable() |

+---------+------------+

          |

          v

  +-------+----------+

  | percpu_from_op() |

  +------------------+

#define __percpu_arg(x)		"%%"__stringify(__percpu_seg)":%P" #x    %%

#ifdef CONFIG_X86_64

#define __percpu_seg		gs

#define __percpu_mov_op		movq

#else

#define __percpu_seg		fs

#define __percpu_mov_op		movl

#endif

asm(movl "%%fs:%P1","%0" :

    "=r" (pfo_ret__) :

    "p" (&(var))

asm(movq "%%gs:%P1", "%0" :

    "=r" (pfo_ret__) :

    "p" (&(var))

这段汇编将段寄存器 fs:P1 gs:P2 处的内容读出来(参考:linux内核数据结构)，那这个位置的内容到底是什么呢？(TODO)

上一个宏在 /arch/x86/include/asm 中；另外在源码 /include/asm-generic 中还通用宏定义:

       +---------+

       | current |

       +----+----+

            |

            v

    +-------+-------+

    | get_current() |

    +-------+-------+

            |

            v

+-----------+-----------+

| current_thread_info() |

+-----------+-----------+

            |

            v

 +----------+-----------+

 | percpu_read_stable() |

 +----------------------+

union thread_union {

	struct thread_info thread_info;

	unsigned long stack[THREAD_SIZE/sizeof(long)];

};

1.2、进程状态

#define TASK_RUNNING		0

#define TASK_INTERRUPTIBLE	1

#define TASK_UNINTERRUPTIBLE	2

#define __TASK_STOPPED		4

#define __TASK_TRACED		8

struct task_struct {

    ...

	volatile long state;

    ...

}

set_current_state(state);

set_task_state(current, state);

1.3、进程的经历

+----------+       +----------+      +----------+

|  fork()  +------>+  exec()  +----->+  exit()  |

+----------+       +----+-----+      +----+-----+

                        |                 |

                        |                 v

                        |            +----+-----+

                        +----------->+  wait()  +--------->

                                     +----------+

1.3.1 进程创建(CoW fork)

Copy-on-Write(CoW) 中译写时拷贝。在 CoW fork() 后，父子进程所有数据都只有一份，即它们映射到的物理内存是相同的。它们的 PTE 标志都是 read-only，一旦父进程或者子进程对共享区域执行了写操作，所以就会触发 Page Fault。系统发现 Page Fault 是因为写 CoW 区域造成。系统将写操作区域复制一份，然后将触发这个操作的进程的 PTE 指向新复制内存(并设置PTE为Write)。重新执行写操作，这时候复制的区域的写操作成功。

linux 实现了 CoW fork。

+------------+   +-------------+   +-------------+   +-----------------+

| sys_fork() |   | sys_vfork() |   | sys_clone() |   | kernel_thread() |

+------+-----+   +-------------+   +----+--------+   +-------+---------+

       |               |                |                    |

       |               +------+  +------+                    |

       |                      |  |                           |

       +-------------------+  |  |  +------------------------+

                           |  |  |  |

                          +v--v--v--v--+

                          |  do_fork() |

                          +------+-----+

                                 |

                         +-------+--------+

                         | copy_process() |

                         +----+---+-------+

      +--------------------+  |   |  |------------------------------+

      |                       |   +---------------+                 |

      v                       v                   v                 v

 +----+--------+      +-------+---------+     +---+----------+    +-+---+

 | alloc_pid() |      |dup_task_struct()|     | copy_flags() |    | ... |

 +-------------+      +-----------------+     +--------------+    +-----+

子进程共享 or 复制父进程的资源，取决于 flags 参数:

#define CSIGNAL		    0x000000ff

#define CLONE_VM	    0x00000100

#define CLONE_FS	    0x00000200

#define CLONE_FILES	    0x00000400

#define CLONE_SIGHAND	0x00000800

...

#define CLONE_NEWNET	0x40000000

#define CLONE_IO		0x80000000

fork 成功后，linux 通常让子进程先运行。原因如下:

假设，父子进程返回用户空间后，调度父进程先运行。父进程可能执行一个写操作，这时会触发 CoW。如果调度让子进程先运行，子进程在 fork 后通常会执行 exec。就不和父进程共享数据了，后面即是父进程再执行写操作，也不会触发 CoW。

对于 linux 来说，线程(Thread)是一种特殊的进程。创建的是线程还是进程，取决于 fork 时的 flag 参数:

// 线程

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

// 进程

clone(SIGCHLD, 0);

其实 linux 里面没有严格的线程概念，它的线程就是进程(因为linux中进程已然很轻量)。

Interestingly, note that threads share the virtual memory abstraction, whereas each receives its own virtualized processor.

1.3.2 进程终结

结束进程生命周期由两种方式:

显示执行 exit()
隐式执行 exit()

第二种情况指 C 的编译器会在 main() 函数的返回后执行 exit()。

NORET_TYPE void do_exit(long code)

{

    ...

	exit_signals(tsk);  /* sets PF_EXITING */

    ...

	tsk->exit_code = code;

    ...

	exit_mm(tsk); /*release the mm_struct held by this process*/

    ...

	exit_sem(tsk); /* 退出 IPC 信号量队列 */

	exit_files(tsk);

	exit_fs(tsk);

    ...

	exit_notify(tsk, group_dead);

    ...

	schedule();

	BUG();

	/* Avoid "noreturn function does return".  */

	for (;;)

		cpu_relax();	/* For when BUG is null */

}

这个函数永远不会返回。现在这个进程已经被标志为 EXIT_ZOMBIE。之所以还称它为进程，是因为这个进程还有三个资源没有释放:

kernel stack
thread_info structure
task_struct structure.

这三个资源存在的意义是为了通知父进程，让父进程来释放。

父进程执行 wait 族函数来释放上诉资源:

                      +-------------+

                      | sys_wait4() |

                      +------+------+

                             |

                             v

                  +----------+---------+

                  | wait_task_zombie() |

                  +----------+---------+

                             |

                             v

                     +-------+--------+

                     | release_task() |

                     +------+---------+

         +---------------+  |    +-------------------+

         |                  |                        |

         v                  v                        v

+--------+--------+     +---+---------------+     +--+---+

| __exit_signal() |     | put_task_struct() |     | ...  |

+-----------------+     +-------------------+     +------+

自此，一个进程/线程在操作系统中的痕迹永远抹去了。

二、进程调度

调度策略(Scheduling policies):

SCHED_NORMAL/SCHED_OTHER
SCHED_FIFO
SCHED_RR
SCHED_BATCH
SCHED_IDLE

进程分类:

普通进程(Normal Process)
- 交互式进程(interactive process)
- 批处理进程(batch process)
实时进程(Real-Time Process)

实时进程的调度策略为: SCHED_FIFO/SCHED_RR；普通进程的调度策略为: SCHED_NORMAL。

优先级:

实时优先级(0~99，数值越高优先级越高)
Nice 优先级(-20_19/100139，数值越高优先级越低)

实时进程使用实时优先级，而普通进程则使用 Nice 优先级。在 linux 中实时进程总是优先于普通进程调度。所以这两种优先级互不干扰。

调度器类:

rt_sched_class
fair_sched_class
idle_sched_class

这几个类的类型都是 struct sched_class。调度器类也有优先级。

调度器实体(Scheduler Entity):

sched_entity
sched_rt_entity
sched_dl_entity

The highest priority scheduler class that has a runnable process wins, selecting who runs next.

2.1、普通进程调度

linux 中，普通进程调度实现了完全公平调度(Completely Fair Scheduler)算法。

CFS is based on a simple concept: Model process scheduling as if the system had an ideal, perfectly multitasking processor. In such a system, each process would receive 1/n of the processor’s time, where n is the number of runnable processes, and we’d schedule them for infinitely small durations, so that in any measurable period we’d have run all n processes for the same amount of time.

上面描述的只是一种理想情况。假设系统中有 100 个进程，measurable period 假设为 1ms(极端例子)。每个进程每运行 0.01ms 就要进行一次上下文切换。这是不现实的。

但是我们需要一种标准来衡量 CFS 的性能，于是提出两个概念:

targeted latency
minimum granularity(默认值为 1ms)

总结一句话就是: 在 targeted latency 长的时间内，要让每个进程都能被调度到，且每个进程的运行时间不低于 minimum granularity。

目前来说只是在纸上谈兵。关键是每次调度一个进程后，到底应该运行多长时间呢？在 CFS 中，这个时间由所有普通进程的 Nice 值决定。

先通过 Nice 值计算每个进程[i]的权重(weight):

weight[i] ≈ 1024 / (1.25)^(nice[i])

然后再由权重计算出该进程应该占用的 CPU 比例:

CPU proportion[i] = weight[i]/weight[1] + ... + weight[n]

这是一种几何加权。通过这种方式，使用 CFS 调度运行普通进程，能达到几乎完美的多任务。CFS 的实现分为四部分:

Time Accounting
Process Selection
The Scheduler Entry Point
Sleeping and Waking Up

2.1.1、Time Accounting

struct task_struct {

    ...

	struct sched_entity se;

	...

}

struct sched_entity {

    ...

	u64			vruntime;

    ...

}

对于理想的 CFS 模型来说，每个进程的 vruntime 都是相同的，但现实中却不同。

CFS uses vruntime to account for how long a process has run and thus how much longer it ought to run.

static void update_curr(struct cfs_rq *cfs_rq)

{

    ...

	delta_exec = (unsigned long)(now - curr->exec_start);

    ...

	__update_curr(cfs_rq, curr, delta_exec);

    ...

}

static inline void

__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,

	      unsigned long delta_exec)

{

    ...

	delta_exec_weighted = calc_delta_fair(delta_exec, curr);

	curr->vruntime += delta_exec_weighted;

	update_min_vruntime(cfs_rq);

}

可以看到 vruntime 经过加权计算。

2.1.2、Process Selection

CFS 选择 vruntime 最小的进程调度运行。为了查找迅速，CFS 使用红黑树来组织 struct cfs_rq 运行队列:

struct cfs_rq {

    ...

	struct sched_entity *curr, *next, *last;

    ...

}

vruntime 最小的 sched_entity 在红黑树的最左边。

2.1.3、The Scheduler Entry Point

linux 中总调度入口在 kernel/sched.c/schedule() 中，这个函数的核心是 pick_next_task() 函数:

static inline struct task_struct *

pick_next_task(struct rq *rq)

{

	const struct sched_class *class;

	struct task_struct *p;

    ...

	class = sched_class_highest;

	for ( ; ; ) {

		p = class->pick_next_task(rq);

		if (p)

			return p;

		class = class->next;

	}

}

这个函数看上去挺简单，实际上却是整个进程调度的精华所在。上面提到过 struct sched_class 的变量有 3 个:

fair_sched_class
rt_sched_class
idle_sched_class

在 sched_rt.c 中，fair_sched_class 为自己重新注册了函数:

static const struct sched_class rt_sched_class = {

	.next			= &fair_sched_class,

	.enqueue_task		= enqueue_task_rt,

	.dequeue_task		= dequeue_task_rt,

	.yield_task		= yield_task_rt,

	.check_preempt_curr	= check_preempt_curr_rt,

	.pick_next_task		= pick_next_task_rt,

	.put_prev_task		= put_prev_task_rt,

    ...

}

所以 pick_next_task() 的逻辑就是: 先按调度类优先级从高到底排序，执行各自的 pick_next_task_*() 函数。在各自的 struct *_rq 运行队列中找一个合适的进程。调度类优先级最高的是:

#define sched_class_highest (&rt_sched_class)

2.1.4、Sleeping and Waking Up

主动 sleep
被动 sleep

内核使用一个结构体来组织休眠的 task:

struct __wait_queue_head {

	spinlock_t lock;

	struct list_head task_list;

};

typedef struct __wait_queue_head wait_queue_head_t;

实现原理类似 xv6 中的的 sleep/wakeup。

2.2、实时进程调度

实时进程使用另一种调度方式，其实现比 CFS 要简单很多。在 kernel/sched_rt.c 中，实时进程的策略有两种:

SCHED_FIFO
SCHED_RR

SCHED_RR 是带有时间片的 SCHED_FIFO。

struct task_struct {

    ...

	struct sched_rt_entity rt;

    ...

}

参考资料: