Linux Subreaper 机制及内核态逃离方法(PR_SET_CHILD

PS：要转载请注明出处，本人版权所有。

PS: 这个只是基于《我自己》的理解，

如果和你的原则及想法相冲突，请谅解，勿喷。

环境说明

无

前言

由于某些其他的原因，我们在测试另外一个问题的时候发现了一个奇怪的现象：在我们一直朴素的认知下，如果一个程序创建了parent-process和child-process，这个时候，当child-process正在运行，parent-process退出的时候，child-process会被托孤到init进程。但是我们却通过pstree -p 发现了并不是这样的，他会被托孤到某一个特殊进程下面，这个特殊进程并不是init进程，而是init进程下面的某一个进程。下面是这个现象的验证过程：

测试程序

#include <unistd.h>

#include <sys/types.h>

#include <stdio.h>

int main(int argc, char * argv[])

{

    int pid = fork();

    if (pid < 0)

        printf("fork failed.\n");

    else if (pid > 0){

        printf("parent : child pid = %d\n", pid);

    }

    else{

        printf("child doing ... ...\n");

        printf("first get ppid = %d\n", getppid());

        sleep(20);

        printf("second get ppid = %d\n", getppid());

        sleep(1000);

    }

    sleep(15);

    return 0;

}

我们运行这个程序后，其运行输出如下图：

我们对这个图进行分析，可以知道，当前的运行a.out进程pid是72140，然后子进程的pid是72141，在parent-process退出后，我们再次获取ppid，可以看到输出是3203。

接着我们看一下在parent-process未退出时的进程树图片节选：

在图中我们可以知道，我们的a.out在systemd(3203)->gnome-terminal-(3741)->bash(5073)->a.out(72140)

接着我们看一下在parent-process退出时的进程树图片节选：

在图中我们可以知道，当parent-process退出后，子进程72141被托孤给了systemd(3203)，并不是我们熟知的pid为1的init进程。这里提前透露一下3203是systemd --user一个进程（同时也是一个subreaper）。

带着对这个问题的疑问，我查询了相关的资料，做了相关的实验，查询到这个现象的原因是PR_SET_CHILD_SUBREAPER相关导致的，因此有了本文的相关内容。

什么是Subreaper(PR_SET_CHILD_SUBREAPER) ？

对于这个问题，我们还是要去看man手册，链接如下：https://man7.org/linux/man-pages/man2/prctl.2.html

通过prctl函数，我们可以对当前进程做很多有趣的设置，其中一个就是PR_SET_CHILD_SUBREAPER选项，他主要是用来收集这些托孤进程的，一般是用来给一些守护进程管理进程（例如：上文提到的systemd）使用，使得一个进程能够管理自己的所有后代进程。其主要还是操作当前进程的task_struct中的is_child_subreaper属性，下面是实现的源码节选：

//kernel/sys.c

static int propagate_has_child_subreaper(struct task_struct *p, void *data)

{

	/*

	 * If task has has_child_subreaper - all its descendants

	 * already have these flag too and new descendants will

	 * inherit it on fork, skip them.

	 *

	 * If we've found child_reaper - skip descendants in

	 * it's subtree as they will never get out pidns.

	 */

	if (p->signal->has_child_subreaper ||

	    is_child_reaper(task_pid(p)))

		return 0;

	p->signal->has_child_subreaper = 1;

	return 1;

}

//kernel/sys.c

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

		unsigned long, arg4, unsigned long, arg5)

{

    struct task_struct *me = current;

    //...

    switch (option){

        //...

        case PR_SET_CHILD_SUBREAPER:

            me->signal->is_child_subreaper = !!arg2;

            if (!arg2)

                break;

            //此函数遍历当前进程的所有子进程，并调用propagate_has_child_subreaper设置has_child_subreaper属性。

            walk_process_tree(me, propagate_has_child_subreaper, NULL);

            break;

        case PR_GET_CHILD_SUBREAPER:

            error = put_user(me->signal->is_child_subreaper,

                    (int __user *)arg2);

            break;

        //...

    }

    //...

}

通过如上的源码，我们可以知道了PR_SET_CHILD_SUBREAPER的实现部分原理，但是现在我们还是不知道为啥这样设置之后，当一个进程还有子进程时，当前进程退出后，子进程就托孤给了这个子进程收割者。这里可以提前透露一下，主要是和task_struct中的has_child_subreaper属性有关系。

当含有子进程的父进程退出时，怎么进行托孤的？

其实从问题就可以看出一点端倪，这个托孤的动作一般是发生在进程退出的时候，所以我们去找进程退出相关的代码应该能够找到一些启发。一般来说，我们的进程退出都会调用exit系统调用，对应到内核态，其实就是do_exit。我们通过has_child_subreaper来全局搜索，可以看到一些关联。下面是部分代码节选：

void __noreturn do_exit(long code)

{

	struct task_struct *tsk = current;

    int group_dead;

    //...

    exit_notify(tsk, group_dead);

    //...

}

/*

 * Send signals to all our closest relatives so that they know

 * to properly mourn us..

 */

static void exit_notify(struct task_struct *tsk, int group_dead)

{

    //...

	LIST_HEAD(dead);

    //...

	forget_original_parent(tsk, &dead);

    //...

}

/*

 * This does two things:

 *

 * A.  Make init inherit all the child processes

 * B.  Check to see if any process groups have become orphaned

 *	as a result of our exiting, and if they have any stopped

 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)

 */

static void forget_original_parent(struct task_struct *father,

					struct list_head *dead)

{

	struct task_struct *p, *t, *reaper;

	if (unlikely(!list_empty(&father->ptraced)))

		exit_ptrace(father, dead);

	/* Can drop and reacquire tasklist_lock */

	//通过task_active_pid_ns()->child_reaper查找到一个reaper,然后返回出来，注意一般情况下这里查出来的进程就是当前namespace的init进程。

	reaper = find_child_reaper(father, dead);

	if (list_empty(&father->children))

		return;

	//根据init进程和has_child_subreaper属性，查询真正符合条件的reaper

	reaper = find_new_reaper(father, reaper);

	list_for_each_entry(p, &father->children, sibling) {//遍历当前退出进程的所有子进程

		for_each_thread(p, t) {//遍历所有子进程的线程

			RCU_INIT_POINTER(t->real_parent, reaper);//设置真正的父进程，这里的父进程就是上面我们查找出来了的满足要求的reaper

			BUG_ON((!t->ptrace) != (rcu_access_pointer(t->parent) == father));

			if (likely(!t->ptrace))

				t->parent = t->real_parent;

			if (t->pdeath_signal)

				group_send_sig_info(t->pdeath_signal,

						    SEND_SIG_NOINFO, t,

						    PIDTYPE_TGID);

		}

		/*

		 * If this is a threaded reparent there is no need to

		 * notify anyone anything has happened.

		 */

		if (!same_thread_group(reaper, father))

			reparent_leader(father, p, dead);

	}

	list_splice_tail_init(&father->children, &reaper->children);

}

在forget_original_parent中，我们可以看到整个方法的作用就是，找到一个reaper，然后将所有子进程交付给这个reaper。

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

其实这个问题就在forget_original_parent中的find_new_reaper函数中，也就是has_child_subreaper这个属性怎么生效，下面我们来看看这个函数的功能：

/*

 * When we die, we re-parent all our children, and try to:

 * 1. give them to another thread in our thread group, if such a member exists

 * 2. give it to the first ancestor process which prctl'd itself as a

 *    child_subreaper for its children (like a service manager)

 * 3. give it to the init process (PID 1) in our pid namespace

 */

static struct task_struct *find_new_reaper(struct task_struct *father,

					   struct task_struct *child_reaper)

{

	struct task_struct *thread, *reaper;

	thread = find_alive_thread(father);

	if (thread)

		return thread;

	if (father->signal->has_child_subreaper) {//注意has_child_subreaper属性生效的地方。

		unsigned int ns_level = task_pid(father)->level;

		/*

		 * Find the first ->is_child_subreaper ancestor in our pid_ns.

		 * We can't check reaper != child_reaper to ensure we do not

		 * cross the namespaces, the exiting parent could be injected

		 * by setns() + fork().

		 * We check pid->level, this is slightly more efficient than

		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).

		 */

		for (reaper = father->real_parent;

		     task_pid(reaper)->level == ns_level;

		     reaper = reaper->real_parent) {

			if (reaper == &init_task)

				break;

			if (!reaper->signal->is_child_subreaper)

				continue;

			thread = find_alive_thread(reaper);

			if (thread)

				return thread;

		}

	}

	return child_reaper;

}

我们从find_new_reaper中可以知道，当has_child_subreaper有值时，我们就从当前进程的父进程开始查找，当找到一个进程的is_child_subreaper属性是有值时，我们就返回这个进程作为真正的reaper。当has_child_subreaper无值时，就是以init进程为reaper来托孤。

从以上的推理来看，我们有两个方案可以逃离PR_SET_CHILD_SUBREAPER影响：

直接改写真正PR_SET_CHILD_SUBREAPER的地方，不启用这个属性。例如修改systemd的源码。
写一个内核态的小工具，修改指定进程的as_child_subreaper的值，当我们禁用此值时，在进程退出时，就会把子进程托孤给init进程。

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

按照上一个小结的结论，我们一般情况下是不会去改一些开源的系统程序，例如：systemd。因此我们选择直接写一个基本的内核态模块，直接修改其task_struct数据结构即可。ko文件如下：

#include <linux/module.h>	/* Needed by all modules */

#include <linux/kernel.h>	/* Needed for KERN_INFO */

#include <linux/pid.h>

#include <linux/sched.h>

#include <linux/sched/signal.h>

#include <linux/sched/mm.h>

#include <linux/mm_types.h>

#include <linux/rwsem.h>

#include <linux/slab.h>

#include <linux/fs.h>

#include <linux/mmap_lock.h>

#include <linux/pid_namespace.h>

MODULE_AUTHOR("sky <sky@sky.com>");

MODULE_DESCRIPTION("sky's hack");

MODULE_LICENSE("GPL");

MODULE_VERSION("1.0.0");

static int hack_pid = -1;

module_param_named(hack_pid, hack_pid, uint, S_IRUGO);

MODULE_PARM_DESC(hack_pid, "hack_pid");

int init_module(void)

{

	printk(KERN_INFO "Hello sky_hack.\n");

	printk(KERN_INFO "hack pid %d\n", hack_pid);

	rcu_read_lock();

	struct pid * _pid_struct = find_vpid(hack_pid);

	if (NULL == _pid_struct){

		printk("get pid struct failed.\n");

		rcu_read_unlock();

		return -1;

	}

	struct task_struct * _task_struct = get_pid_task(_pid_struct, PIDTYPE_PID);

	if (NULL == _task_struct){

		printk("get task struct failed.\n");

		rcu_read_unlock();

		return -1;

	}

	struct mm_struct * _mm_struct = get_task_mm(_task_struct);

	if (NULL == _mm_struct){

		printk("get mm struct failed.\n");

		rcu_read_unlock();

		return -1;

	}

	mmap_read_lock(_mm_struct);

    if (_mm_struct->exe_file) {

                char * pathname = kmalloc(PATH_MAX, GFP_ATOMIC);

                if (pathname) {

                    char * p = d_path(&_mm_struct->exe_file->f_path, pathname, PATH_MAX);

                    /*Now you have the path name of exe in p*/

					printk(KERN_INFO "process full path %s\n", p);

                }

				kfree(pathname);

    }

	mmap_read_unlock(_mm_struct);

	struct pid_namespace *pid_ns = task_active_pid_ns(_task_struct);

	struct task_struct *reaper = pid_ns->child_reaper;

	printk(KERN_INFO "pid_ns->child_reaper=%x, current task_struct=%x\n", pid_ns->child_reaper, _task_struct);

	printk(KERN_INFO "is_child_subreaper %d\n", _task_struct->signal->is_child_subreaper);

	printk(KERN_INFO "has_child_subreaper %d\n", _task_struct->signal->has_child_subreaper);

	//escape from a subreaper by do_exit()

	_task_struct->signal->has_child_subreaper = 0;

	rcu_read_unlock();

	return 0;

}

void cleanup_module(void)

{

	printk(KERN_INFO "Goodbye sky_hack.\n");

}

当前这个驱动的唯一目的就是把指定pid进程的has_child_subreaper改为0，这样就可以逃离subreaper。

编译Makefile

obj-m += sky_hack.o

all:

    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:

    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

下面我们再一次做上面的测试，运行a.out，查看a.out相关的进程树，然后运行sky_hack.ko hack_pid=‘a.out进程id’，等待一段时间后，当a.out退出后，再次查看a.out的相关进程树即可。

运行a.out的输出：

我们其实可以看到，按照上面我们的说明进行操作后，a.out第二次打印的ppid已经是1了，这意味着我们逃离subreaper成功了。

下面我们看看insmod sky_hack.ko hack_pid=14749的输出：

我们其实可以看到，在驱动里面我们打印了a.out进程的has_child_subreaper属性是1，因此我们在驱动中重置了它，导致了退出时，成功托孤给了init进程。

下面我们看看这整个阶段中的进程树状况：

这里的进程分布和上述开始的一样。

我们看看逃离subreaper后：

这里的进程分布就和最开始的不一样的，我们成功的将我们的子进程托孤给了init进程。

后记

我们首先从一个其他问题，遇到了这个现象，然后我们深究了这个现象产生的原因，并且最终尝试设计出逃离这种现象的技术方案。这其中会涉及一些内核源码，驱动编写，同时加深了我们对subreaper的理解。经过这些过程后，我们对Linux内核，Linux的应用开发会有一个新的认知和理解。同时也增强了我们解决问题的综合能力。

参考文献

打赏、订阅、收藏、丢香蕉、硬币，请关注公众号（攻城狮的搬砖之路）