select、poll、epoll

1.概念

　　select、poll、epoll都是事件触发机制，当等待的事件发生就触发进行处理，用于I/O复用

2.简单例子理解

3.select函数

3.1函数详解

int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout)

//返回值：就绪描述符的数目，超时返回0，出错返回-1

1）第一个参数maxfdp1指定待测试的描述符个数，它的值是待测试的最大描述符加1（因此把该参数命名为maxfdp1），描述字0、1、2...maxfdp1-1均将被测试（即使你中间有不想测的）

2）中间的三个参数readset、writeset和exceptset指定我们要让内核测试读、写和异常条件的描述符。如果对某一个的条件不感兴趣，就可以把它设为空指针。fd_set存放着描述符，它是一个long类型的数组，是一个bitmap，可通过以下四个宏进行设置：

void FD_ZERO(fd_set *fdset);           //清空集合

void FD_SET(int fd, fd_set *fdset);   //将一个给定的文件描述符加入集合之中

void FD_CLR(int fd, fd_set *fdset);   //将一个给定的文件描述符从集合中删除

int FD_ISSET(int fd, fd_set *fdset);   // 检查集合中指定的文件描述符是否可以读写

3）timeout告知内核等待所指定描述符中的任何一个就绪可花多少时间。其timeval结构用于指定这段时间的秒数和微秒数

struct timeval {

    long tv_sec;   //seconds

    long tv_usec;  //microseconds

};

这个参数有三种可能：

①永远等待下去：仅在有一个描述符准备好I/O时才返回；为此，把该参数设置为空指针NULL（等到你好了我才返回）

②等待一段固定时间：在有一个描述符准备好I/O时返回，但是不超过由该参数所指向的timeval结构中指定的秒数和微秒数（我到了固定时间就返回）

③根本不等待：检查描述符后立即返回，这称为轮询。为此，该参数必须指向一个timeval结构，而且其中的定时器值必须为0（我不断地检查你好没好，不管你好没好我都返回）

3.2实现过程

如图，select会在1~7之间不断循环

1）使用copy_from_user将fd_set（描述符集合）拷贝到内核

2）注册一个函数__pollwait，也是就所谓的poll方法

3）遍历所有描述符fd，调用其对应的poll方法（对于socket，这个poll方法是sock_poll，sock_poll根据情况会调用到tcp_poll，udp_poll或者datagram_poll），poll方法的主要工作就是把current进程挂到fd对应的设备等待队列中，当fd可读写时，会唤醒等待队列上睡眠的进程；poll方法返回的是一个描述读写是否就绪的mask掩码，用这个mask掩码给fd_set赋值

4）遍历完以后，如果发现有可读写的mask掩码，则跳到7

5）如果没有，则调用schedule_timeout使current进程进入睡眠

6）睡眠期间如果有fd可读写时，或者超过了睡眠时间，current进程会被唤醒获得CPU进行工作，跳到3

7）使用copy_to_user把fd_set从内核拷贝到用户空间

　　最后，进程在用户空间检查fd_set，找到可读写的fd，对其进行I/O操作

3.3缺点

1）select可监听的文件描述符数量较小，linux上默认为1024，由宏定义FD_SETSIZE确定

2）每次调用select，都需要把整个fd集合从用户态拷贝到内核态，返回时再从内核态拷贝到用户态，存在开销

3）current进程每次被唤醒时都要遍历所有的fd（即轮询），这样做效率很低

3.4实例

#include <stdio.h>

#include <sys/select.h>

#include <sys/time.h>

#include <errno.h>

#include <stdlib.h>

#include <string.h>

int max(int a, int b)

{

    return(a >= b ? a : b);

}

void str_cli(FILE *fp, int sockfd)

{

    int       maxfdpl;

    fd_set    rset;

    char      sendline[], recvline[];

    FD_ZERO(&rset);

    for (;;)

    {

        FD_SET(fileno(fp), &rset);

        FD_SET(sockfd, &rset);

        maxfdpl = max(fileno(fp), sockfd) + ;

        if (select(maxfdpl, &rset, NULL, NULL, NULL) < )

        {

            perror("select");

            exit();

        }

        if (FD_ISSET(sockfd, &rset))    /* socket is readable */

        {

            if (readline(sockfd, recvline, ) == )

            {

                printf("str_cli: server terminated prematurely\n");

                exit();

            }

            fputs(recvline, stdout);

        }

        if (FD_ISSET(fileno(fp), &rset)) /* input is readable */

        {

            if (fgets(sendline, , fp) == NULL)

                return;

            writen(sockfd, sendline, strlen(sendline));

        }

    }

}

4.poll函数

4.1函数详解

#include <poll.h>

int poll(struct pollfd fds[], nfds_t nfds, int timeout)；

1）poll使用一个结构数组fds来存放套接字描述符，其中每一个元素为pollfd结构

struct pollfd {

    int fd;//表示文件描述符

    short events;//表示请求检测的事件

    short revents; //表示检测之后返回的事件，如果当某个fd有状态变化时，revents的值就不为空

};

　　为了加快处理速度和提高系统性能，poll将会把fds中所有struct pollfd表示为内核的struct poll_list链表，即内核层是用链表来保存描述符

struct poll_list {

    struct poll_list *next;

    int len;

    struct pollfd entries[];

};

2）参数说明

fds：存放需要被检测状态的Socket描述符；与select不同（select函数在调用之后，会清空检测socket描述符的数组），每当调用poll之后，不会清空这个数组，而是将有状态变化的描述符结构的revents变量状态变化，操作起来比较方便；
nfds：用于标记数组fds中的struct pollfd结构元素的总数量；
timeout：poll函数调用阻塞的时间，单位是MS（毫秒）

3）返回值

大于0：表示数组fds中有socket描述符的状态发生变化，或可以读取、或可以写入、或出错。并且返回的值表示这些状态有变化的socket描述符的总数量；此时可以对fds数组进行遍历，以寻找那些revents不空的socket描述符，然后判断这个里面有哪些事件以读取数据
等于0：表示没有socket描述符有状态变化，并且调用超时
小于0：此时表示有错误发生，此时全局变量errno保存错误码

4.2实现过程

　　poll的实现过程与select差不多

4.2优点

1）poll没有最大数量的限制，struct pollfd数组fds大小的可以根据我们自己的需要来定义（但是数量过大后性能也是会下降）

4.3缺点

　　和select的两个缺点一样

5.epoll函数

　　epoll是linux下select/poll的改进

5.1函数详解

epoll会调用三个函数，分别如下：

epoll_create：创建一个epoll的句柄

int epoll_create(int size);

//  size：用来告诉内核这个监听的描述符数量，必须大于0，否则会返回错误EINVAL，这只是对内核初始分配内部数据结构的一个建议，从源码上看，这个size其实没有啥用！！！

1）在内核里，一切皆文件，epoll会在内核初始化时（系统启动时），注册一个文件系统，即开辟出自己的内核cache（高速缓存区），用于存储需要被监控的socket，这些socket会以红黑树的形式保存在内核cache里，以支持快速的查找、插入、删除

2）当调用epoll_create时，会在epoll文件系统里创建一棵红黑树（用来存储之后epoll_ctl传来的描述符），还有一个就绪链表（用于存储准备就绪的描述符）

3）注意：epoll句柄本身会占用一个fd值（linux下可以通过/proc/进程id/fd/查看），所以在使用完epoll后，必须调用close()关闭，否则可能导致fd被耗尽

epoll_ctl：向epoll_create产生的epoll句柄中添加或删除需要监听的描述符fd，并注册要监听的事件类型，每一个描述符和事件类型都写在一个epoll_event结构中

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

/*

epfd：是epoll_create()的返回值。

op：表示op操作，用三个宏来表示：添加EPOLL_CTL_ADD，删除EPOLL_CTL_DEL，修改EPOLL_CTL_MOD，分别添加、删除和修改对fd的监听事件

fd：是需要监听的fd（文件描述符）

epoll_event：是告诉内核需要监听什么事，ET模式也是在这个结构里设置

*/

1）调用copy_from_user把epoll_event结构拷贝到内核空间（网上很多博客说epoll使用了共享内存,这个是完全错误的 ,可以阅读源码，会发现完全没有使用共享内存的任何api）

2）将需要监听的socket fd加入到红黑树中（也可删除和修改，若存在则立即返回，不存在则添加到树上），在插入的过程中还会为这个socket注册一个回调函数ep_poll_callback，当它就绪时时，就会立刻执行这个回调函数（而不是像select/poll中执行唤醒操作default_wake_function）

3）回调函数ep_poll_callback的作用：会把就绪的fd放入就绪链表，再唤醒current进程

epoll_wait：循环地判断就绪链表是否为空

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

epoll_wait会在1~6之间不断循环

1）epoll_wait判断就绪链表是否为空

2）如果不空，则跳到6

3）如果为空，则调用schedule_timeout使current进程进入睡眠

4）睡眠期间如果有fd就绪了，就绪fd会调用回调函数ep_poll_callback，回调函数会把就绪的fd放入就绪链表，并唤醒current进程，然后跳到1

5）或者超过了睡眠时间，也跳到1

6）使用__put_user把就绪的fd拷贝到用户空间

5.2epoll优点

1）epoll可监听的描述符数量很大，上限为系统所有进程最大可打开文件的数目，具体数目可以cat /proc/sys/fs/file-max查看（ubuntu14.04上为98875）

2）select/poll每次调用都要进行整个fd集合在用户态和内核态之间的拷贝，而epoll返回时只需拷贝就绪fd，减少了拷贝的开销

3）select/poll、epoll都是睡眠和唤醒多次交替，但是select/poll在“醒着”的时候要遍历整个fd集合，而epoll在“醒着”的时候只要判断就绪链表是否为空就行了，大大提升了效率

5.3epoll的两种模式

5.3.1水平触发模式（LT：level-triggered）

1）LT模式是epoll默认的工作模式，可支持阻塞和非阻塞套接字

2）传统的select/poll都是这种模式

3）LT模式的源码实现：当一个fd就绪时，回调函数会把该fd放入就绪链表中，这时调用epoll_wait，就会把这个就绪fd拷贝到用户态，然后清空就绪链表，最后epoll_wait干了件事，就是检查这个fd，如果这个fd确实未被处理，又把该fd放回到刚刚清空的就绪链表，于是这个fd又会被下次的epoll_wait返回；而ET模式没有这个检查

5.3.2边缘触发模式（ET：edge-triggered）

1）二者的差异在于LT模式下只要某个socket处于readable/writable状态，无论什么时候进行epoll_wait都会返回该socket；而ET模式下只有某个fd从unreadable变为readable或从unwritable变为writable时（相当于高低电平触发），epoll_wait才会返回该socket

2）这种差异导致ET模式下，正确的读写方式必须为：

读：只要可读，就一直读，直到读完套接字的接收缓冲区

写：只要可写，就一直写，直到写满接收缓冲区

为什么？

当epoll工作在ET模式下时，对于读操作，如果read一次没有读尽套接字接收缓冲区中的数据，这个socket fd仍然处于readable的状态，那么下次epoll_wait是得不到socket fd读就绪的通知的，这时候除非对端发送了新的数据过来，但是对端发新数据的前提是要收到确认，你现在连缓冲区中的数据都没有读完，你怎么发确认给对端？这样就造成了缓冲区中的残留数据永远将没有机会被读出的死局
对于写，也是一样的道理

3）阻塞套接字

当你去读一个阻塞的文件描述符时，如果在该文件描述符上没有数据可读，那么它会一直阻塞(通俗一点就是一直卡在调用函数那里)，直到有数据可读
当你去写一个阻塞的文件描述符时如果在该文件描述符上没有空间(通常是缓冲区)可写，那么它会一直阻塞 ，直到有空间可写
以上的读和写我们统一指在某个文件描述符进行的操作，不单单指真正的读数据，写数据，还包括接收连接accept()，发起连接connect()等操作...

4）非阻塞套接字

当你去读写一个非阻塞的文件描述符时，不管可不可以读写，它都会立即返回，返回成功说明读写操作完成了，返回失败会设置相应errno状态码（EAGAIN），根据这个errno可以进一步执行其他处理，它不会像阻塞套接字那样，卡在那里不动

5）因为ET模式只能采用上面 第2）点 中的读写方式，所以ET模式只支持非阻塞套接字，试想，如果你使用阻塞套接字：因为ET模式要一直读直到把数据读完，所以一般在编写epoll边缘触发模式的程序时，会用一个while循环一直读取socket，读到没有数据可读了的时候，阻塞式套接字会一直阻塞下去，永远卡在while里面的read上，就不是阻塞在epoll_wait上了，造成其他套接字饿死；只有非阻塞套接字，读到没有数据可读了的时候，read返回0，于是退出while循环，程序继续执行下去，其他套接字得到处理

//读

if (events[i].events & EPOLLIN)

{

    n = ;

    while ((nread = read(fd, buf + n, BUFSIZ - )) > )//直到读完，读完时read返回0

    {

        n += nread;

        if (nread == - && errno != EAGAIN)

        {

            perror("read error");

        }

    }

    ev.data.fd = fd;

    ev.events = events[i].events | EPOLLOUT;

    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev);

}

//写

if (events[i].events & EPOLLOUT)

{

    int nwrite, data_size = strlen(buf);

    n = data_size;

    while (n > )//直到写满，写满时n减少到0

    {

        nwrite = write(fd, buf + data_size - n, n);

        if (nwrite < n)

        {

            if (nwrite == - && errno != EAGAIN)

            {

                perror("write error");

            }

            break;

        }

        n -= nwrite;

    }

    ev.data.fd = fd;

    ev.events = EPOLLIN | EPOLLET;

    epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev);  //修改sockfd上要处理的事件为EPOLIN

}

5.3.4使用LT模式，当socket可写时，会不停的触发socket可写的事件，如何处理？

1）开始不把socket加入epoll，需要向socket写数据的时候，直接调用write，此时返回EAGAIN，这时候再把socket加入epoll，在epoll的驱动下写数据，全部数据发送完毕后，再移出epoll

2）这种方式的优点是：数据不多的时候可以避免epoll的触法，提高效率

5.3.3LT模式和ET模式的选择

1）LT模式每次都会返回可读的套接口，ET模式满足边缘条件时才返回可读的套接口，减少了重复的epoll系统调用，因此效率高

2）LT模式对代码编写要求比较低，不容易出现问题；ET模式对编程要求高，每次要彻底处理完每个事件，不彻底处理完的话就没有下一次机会再处理了，所以容易发生丢失事件的情况

5.4epoll源码

/*

* 在深入了解epoll的实现之前, 先来了解内核的3个方面.

* 1. 等待队列 waitqueue

* 我们简单解释一下等待队列:

* 队列头(wait_queue_head_t)往往是资源生产者,

* 队列成员(wait_queue_t)往往是资源消费者,

* 当头的资源ready后, 会逐个执行每个成员指定的回调函数,

* 来通知它们资源已经ready了, 等待队列大致就这个意思.

* 2. 内核的poll机制

* 被Poll的fd, 必须在实现上支持内核的Poll技术,

* 比如fd是某个字符设备,或者是个socket, 它必须实现

* file_operations中的poll操作, 给自己分配有一个等待队列头.

* 主动poll fd的某个进程必须分配一个等待队列成员, 添加到

* fd的对待队列里面去, 并指定资源ready时的回调函数.

* 用socket做例子, 它必须有实现一个poll操作, 这个Poll是

* 发起轮询的代码必须主动调用的, 该函数中必须调用poll_wait(),

* poll_wait会将发起者作为等待队列成员加入到socket的等待队列中去.

* 这样socket发生状态变化时可以通过队列头逐个通知所有关心它的进程.

* 这一点必须很清楚的理解, 否则会想不明白epoll是如何

* 得知fd的状态发生变化的.

* 3. epollfd本身也是个fd, 所以它本身也可以被epoll,

* 可以猜测一下它是不是可以无限嵌套epoll下去...

*

* epoll基本上就是使用了上面的1,2点来完成.

* 可见epoll本身并没有给内核引入什么特别复杂或者高深的技术,

* 只不过是已有功能的重新组合, 达到了超过select的效果.

*/

/*

* 相关的其它内核知识:

* 1. fd我们知道是文件描述符, 在内核态, 与之对应的是struct file结构,

* 可以看作是内核态的文件描述符.

* 2. spinlock, 自旋锁, 必须要非常小心使用的锁,

* 尤其是调用spin_lock_irqsave()的时候, 中断关闭, 不会发生进程调度,

* 被保护的资源其它CPU也无法访问. 这个锁是很强力的, 所以只能锁一些

* 非常轻量级的操作.

* 3. 引用计数在内核中是非常重要的概念,

* 内核代码里面经常有些release, free释放资源的函数几乎不加任何锁,

* 这是因为这些函数往往是在对象的引用计数变成0时被调用,

* 既然没有进程在使用在这些对象, 自然也不需要加锁.

* struct file 是持有引用计数的.

*/

/* --- epoll相关的数据结构 --- */

/*

* This structure is stored inside the "private_data" member of the file

* structure and rapresent the main data sructure for the eventpoll

* interface.

*/

/* 每创建一个epoll句柄, 内核就会分配一个eventpoll与之对应*/

struct eventpoll

{

    /* Protect the this structure access */

    spinlock_t lock;

    /*

    * This mutex is used to ensure that files are not removed

    * while epoll is using them. This is held during the event

    * collection loop, the file cleanup path, the epoll file exit

    * code and the ctl operations.

    */

    /* 添加, 修改或者删除监听fd的时候, 以及epoll_wait返回, 向用户空间

    * 传递数据时都会持有这个互斥锁, 所以在用户空间可以放心的在多个线程

    * 中同时执行epoll相关的操作, 内核级已经做了保护. */

    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */

    /* 调用epoll_wait()时, 我们就是"睡"在了这个等待队列上... */

    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */

    /* 这个用于epollfd本事被poll的时候... */

    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */

    /* 所有已经ready的epitem都在这个链表里面 */

    struct list_head rdllist;

    /* RB tree root used to store monitored fd structs */

    /* 所有要监听的epitem都在这里 */

    struct rb_root rbr;

    /*

    这是一个单链表链接着所有的struct epitem当event转移到用户空间时

    */

    * This is a single linked list that chains all the "struct epitem" that

        * happened while transfering ready events to userspace w / out

        * holding->lock.

        * /

        struct epitem *ovflist;

    /* The user that created the eventpoll descriptor */

    /* 这里保存了一些用户变量, 比如fd监听数量的最大值等等 */

    struct user_struct *user;

};

/*

* Each file descriptor added to the eventpoll interface will

* have an entry of this type linked to the "rbr" RB tree.

*/

/* epitem 表示一个被监听的fd */

struct epitem

{

    /* RB tree node used to link this structure to the eventpoll RB tree */

    /* rb_node, 当使用epoll_ctl()将一批fds加入到某个epollfd时, 内核会分配

    * 一批的epitem与fds们对应, 而且它们以rb_tree的形式组织起来, tree的root

    * 保存在epollfd, 也就是struct eventpoll中.

    * 在这里使用rb_tree的原因我认为是提高查找,插入以及删除的速度.

    * rb_tree对以上3个操作都具有O(lgN)的时间复杂度 */

    struct rb_node rbn;

    /* List header used to link this structure to the eventpoll ready list */

    /* 链表节点, 所有已经ready的epitem都会被链到eventpoll的rdllist中 */

    struct list_head rdllink;

    /*

    * Works together "struct eventpoll"->ovflist in keeping the

    * single linked chain of items.

    */

    /* 这个在代码中再解释... */

    struct epitem *next;

    /* The file descriptor information this item refers to */

    /* epitem对应的fd和struct file */

    struct epoll_filefd ffd;

    /* Number of active wait queue attached to poll operations */

    int nwait;

    /* List containing poll wait queues */

    struct list_head pwqlist;

    /* The "container" of this item */

    /* 当前epitem属于哪个eventpoll */

    struct eventpoll *ep;

    /* List header used to link this item to the "struct file" items list */

    struct list_head fllink;

    /* The structure that describe the interested events and the source fd */

    /* 当前的epitem关系哪些events, 这个数据是调用epoll_ctl时从用户态传递过来 */

    struct epoll_event event;

};

struct epoll_filefd

{

    struct file *file;

    int fd;

};

/* poll所用到的钩子Wait structure used by the poll hooks */

struct eppoll_entry

{

    /* List header used to link this structure to the "struct epitem" */

    struct list_head llink;

    /* The "base" pointer is set to the container "struct epitem" */

    struct epitem *base;

    /*

    * Wait queue item that will be linked to the target file wait

    * queue head.

    */

    wait_queue_t wait;

    /* The wait queue head that linked the "wait" wait queue item */

    wait_queue_head_t *whead;

};

/* Wrapper struct used by poll queueing */

struct ep_pqueue

{

    poll_table pt;

    struct epitem *epi;

};

/* Used by the ep_send_events() function as callback private data */

struct ep_send_events_data

{

    int maxevents;

    struct epoll_event __user *events;

};

//SYSCALL_DEFINE1是一个宏，用于定义有一个参数的系统调用函数；

//这就是epoll_create真身，先进行判断size是否>0，若是则直接调用epoll_create1

//所以其实int epoll_create(int size);中的size真的没啥用！！！

SYSCALL_DEFINE1(epoll_create, int size)

{

    if (size <= )

        return -EINVAL;//无效的参数，#define EINVAL 22 /* Invalid argument */

    return sys_epoll_create1();

}

/* epoll_create1 */

SYSCALL_DEFINE1(epoll_create1, int, flags)

{

    int error;

    struct eventpoll *ep = NULL;//主描述符

                                /* Check the EPOLL_* constant for consistency.  */

                                /* 这句没啥用处... */

    BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

    /* 对于epoll来讲, 目前唯一有效的FLAG就是CLOEXEC */

    if (flags & ~EPOLL_CLOEXEC)

        return -EINVAL;

    /*

    * Create the internal data structure ("struct eventpoll").

    */

    /* 分配一个struct eventpoll, 分配和初始化细节我们随后深聊~ */

    error = ep_alloc(&ep);

    if (error < )

        return error;

    /*

    * Creates all the items needed to setup an eventpoll file. That is,

    * a file structure and a free file descriptor.

    */

    /* 这里是创建一个匿名fd, 说起来就话长了...长话短说:

    * epollfd本身并不存在一个真正的文件与之对应, 所以内核需要创建一个

    * "虚拟"的文件, 并为之分配真正的struct file结构, 而且有真正的fd.

    * 这里2个参数比较关键:

    * eventpoll_fops, fops就是file operations, 就是当你对这个文件(这里是虚拟的)进行操作(比如读)时,

    * fops里面的函数指针指向真正的操作实现, 类似C++里面虚函数和子类的概念.

    * epoll只实现了poll和release(就是close)操作, 其它文件系统操作都有VFS全权处理了.

    * ep, ep就是struct epollevent, 它会作为一个私有数据保存在struct file的private指针里面.

    * 其实说白了, 就是为了能通过fd找到struct file, 通过struct file能找到eventpoll结构.

    * 如果懂一点Linux下字符设备驱动开发, 这里应该是很好理解的,

    * 推荐阅读 <Linux device driver 3rd>

    */

    error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,

        O_RDWR | (flags & O_CLOEXEC));

    if (error < )

        ep_free(ep);

    return error;

}

/*

* 创建好epollfd后, 接下来我们要往里面添加fd咯

* 来看epoll_ctl

* epfd 就是epollfd

* op ADD,MOD,DEL

* fd 需要监听的描述符

* event 我们关心的events

*/

SYSCALL_DEFINE4(epoll_ctl, int epfd, int op, int fd, struct epoll_event __user* event)

{

    int error;

    struct file *file, *tfile;

    struct eventpoll *ep;

    struct epitem *epi;

    struct epoll_event epds;

    error = -EFAULT;

    /*

    * 错误处理以及从用户空间将epoll_event结构copy到内核空间.

    */

    if (ep_op_has_event(op) &&

        copy_from_user(&epds, event, sizeof(struct epoll_event)))

        goto error_return;

    /* Get the "struct file *" for the eventpoll file */

    /* 取得struct file结构, epfd既然是真正的fd, 那么内核空间

    * 就会有与之对于的一个struct file结构

    * 这个结构在epoll_create1()中, 由函数anon_inode_getfd()分配 */

    error = -EBADF;

    file = fget(epfd);

    if (!file)

        goto error_return;

    /* Get the "struct file *" for the target file */

    /* 我们需要监听的fd, 它当然也有个struct file结构, 上下2个不要搞混了哦 */

    tfile = fget(fd);

    if (!tfile)

        goto error_fput;

    /* The target file descriptor must support poll */

    error = -EPERM;

    /* 如果监听的文件不支持poll, 那就没辙了.

    * 你知道什么情况下, 文件会不支持poll吗?

    */

    if (!tfile->f_op || !tfile->f_op->poll)

        goto error_tgt_fput;

    /*

    * We have to check that the file structure underneath the file descriptor

    * the user passed to us _is_ an eventpoll file. And also we do not permit

    * adding an epoll file descriptor inside itself.

    */

    error = -EINVAL;

    /* epoll不能自己监听自己... */

    if (file == tfile || !is_file_epoll(file))

        goto error_tgt_fput;

    /*

    * At this point it is safe to assume that the "private_data" contains

    * our own data structure.

    */

    /* 取到我们的eventpoll结构, 来自与epoll_create1()中的分配 */

    ep = file->private_data;

    /* 接下来的操作有可能修改数据结构内容, 锁之~ */

    mutex_lock(&ep->mtx);

    /*

    * Try to lookup the file inside our RB tree, Since we grabbed "mtx"

    * above, we can be sure to be able to use the item looked up by

    * ep_find() till we release the mutex.

    */

    /* 对于每一个监听的fd, 内核都有分配一个epitem结构,

    * 而且我们也知道, epoll是不允许重复添加fd的,

    * 所以我们首先查找该fd是不是已经存在了.

    * ep_find()其实就是RBTREE查找, 跟C++STL的map差不多一回事, O(lgn)的时间复杂度.

    */

    epi = ep_find(ep, tfile, fd);

    error = -EINVAL;

    switch (op) {

        /* 首先我们关心添加 */

    case EPOLL_CTL_ADD:

        if (!epi) {

            /* 之前的find没有找到有效的epitem, 证明是第一次插入, 接受!

            * 这里我们可以知道, POLLERR和POLLHUP事件内核总是会关心的

            * */

            epds.events |= POLLERR | POLLHUP;

            /* rbtree插入, 详情见ep_insert()的分析

            * 其实我觉得这里有insert的话, 之前的find应该

            * 是可以省掉的... */

            error = ep_insert(ep, &epds, tfile, fd);

        }

        else

            /* 找到了!? 重复添加! */

            error = -EEXIST;

        break;

        /* 删除和修改操作都比较简单 */

    case EPOLL_CTL_DEL:

        if (epi)

            error = ep_remove(ep, epi);

        else

            error = -ENOENT;

        break;

    case EPOLL_CTL_MOD:

        if (epi) {

            epds.events |= POLLERR | POLLHUP;

            error = ep_modify(ep, epi, &epds);

        }

        else

            error = -ENOENT;

        break;

    }

    mutex_unlock(&ep->mtx);

error_tgt_fput:

    fput(tfile);

error_fput:

    fput(file);

error_return:

    return error;

}

/*

* ep_insert()在epoll_ctl()中被调用, 完成往epollfd里面添加一个监听fd的工作

* tfile是fd在内核态的struct file结构

*/

static int ep_insert(struct eventpoll *ep, struct epoll_event *event,struct file *tfile, int fd)

{

    int error, revents, pwake = ;

    unsigned long flags;

    struct epitem *epi;

    struct ep_pqueue epq;

    /* 查看是否达到当前用户的最大监听数 */

    if (unlikely(atomic_read(&ep->user->epoll_watches) >=

        max_user_watches))

        return -ENOSPC;

    /* 从著名的slab中分配一个epitem */

    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))

        return -ENOMEM;

    /* Item initialization follow here ... */

    /* 这些都是相关成员的初始化... */

    INIT_LIST_HEAD(&epi->rdllink);

    INIT_LIST_HEAD(&epi->fllink);

    INIT_LIST_HEAD(&epi->pwqlist);

    epi->ep = ep;

    /* 这里保存了我们需要监听的文件fd和它的file结构 */

    ep_set_ffd(&epi->ffd, tfile, fd);

    epi->event = *event;

    epi->nwait = ;

    /* 这个指针的初值不是NULL哦... */

    epi->next = EP_UNACTIVE_PTR;

    /* Initialize the poll table using the queue callback */

    /* 好, 我们终于要进入到poll的正题了 */

    epq.epi = epi;

    /* 初始化一个poll_table

    * 其实就是指定调用poll_wait(注意不是epoll_wait!!!)时的回调函数,和我们关心哪些events,

    * ep_ptable_queue_proc()就是我们的回调啦, 初值是所有event都关心 */

    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

    /*

    * Attach the item to the poll hooks and get current event bits.

    * We can safely use the file* here because its usage count has

    * been increased by the caller of this function. Note that after

    * this operation completes, the poll callback can start hitting

    * the new item.

    */

    /* 这一部很关键, 也比较难懂, 完全是内核的poll机制导致的...

    * 首先, f_op->poll()一般来说只是个wrapper, 它会调用真正的poll实现,

    * 拿UDP的socket来举例, 这里就是这样的调用流程: f_op->poll(), sock_poll(),

    * udp_poll(), datagram_poll(), sock_poll_wait(), 最后调用到我们上面指定的

    * ep_ptable_queue_proc()这个回调函数...(好深的调用路径...).

    * 完成这一步, 我们的epitem就跟这个socket关联起来了, 当它有状态变化时,

    * 会通过ep_poll_callback()来通知.

    * 最后, 这个函数还会查询当前的fd是不是已经有啥event已经ready了, 有的话

    * 会将event返回. */

    revents = tfile->f_op->poll(tfile, &epq.pt);

    /*

    * We have to check if something went wrong during the poll wait queue

    * install process. Namely an allocation for a wait queue failed due

    * high memory pressure.

    */

    error = -ENOMEM;

    if (epi->nwait < )

        goto error_unregister;

    /* Add the current item to the list of active epoll hook for this file */

    /* 这个就是每个文件会将所有监听自己的epitem链起来 */

    spin_lock(&tfile->f_lock);

    list_add_tail(&epi->fllink, &tfile->f_ep_links);

    spin_unlock(&tfile->f_lock);

    /*

    * Add the current item to the RB tree. All RB tree operations are

    * protected by "mtx", and ep_insert() is called with "mtx" held.

    */

    /* 都搞定后, 将epitem插入到对应的eventpoll中去 */

    ep_rbtree_insert(ep, epi);

    /* We have to drop the new item inside our item list to keep track of it */

    spin_lock_irqsave(&ep->lock, flags);

    /* If the file is already "ready" we drop it inside the ready list */

    /* 到达这里后, 如果我们监听的fd已经有事件发生, 那就要处理一下 */

    if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {

        /* 将当前的epitem加入到ready list中去 */

        list_add_tail(&epi->rdllink, &ep->rdllist);

        /* Notify waiting tasks that events are available */

        /* 谁在epoll_wait, 就唤醒它... */

        if (waitqueue_active(&ep->wq))

            wake_up_locked(&ep->wq);

        /* 谁在epoll当前的epollfd, 也唤醒它... */

        if (waitqueue_active(&ep->poll_wait))

            pwake++;

    }

    spin_unlock_irqrestore(&ep->lock, flags);

    atomic_inc(&ep->user->epoll_watches);

    /* We have to call this outside the lock */

    if (pwake)

        ep_poll_safewake(&ep->poll_wait);

    return ;

error_unregister:

    ep_unregister_pollwait(ep, epi);

    /*

    * We need to do this because an event could have been arrived on some

    * allocated wait queue. Note that we don't care about the ep->ovflist

    * list, since that is used/cleaned only inside a section bound by "mtx".

    * And ep_insert() is called with "mtx" held.

    */

    spin_lock_irqsave(&ep->lock, flags);

    if (ep_is_linked(&epi->rdllink))

        list_del_init(&epi->rdllink);

    spin_unlock_irqrestore(&ep->lock, flags);

    kmem_cache_free(epi_cache, epi);

    return error;

}

/*

* 这个是关键性的回调函数, 当我们监听的fd发生状态改变时, 它会被调用.

* 参数key被当作一个unsigned long整数使用, 携带的是events.

*/

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)

{

    int pwake = ;

    unsigned long flags;

    struct epitem *epi = ep_item_from_wait(wait);//从等待队列获取epitem.需要知道哪个进程挂载到这个设备

    struct eventpoll *ep = epi->ep;//获取

    spin_lock_irqsave(&ep->lock, flags);

    /*

    * If the event mask does not contain any poll(2) event, we consider the

    * descriptor to be disabled. This condition is likely the effect of the

    * EPOLLONESHOT bit that disables the descriptor when an event is received,

    * until the next EPOLL_CTL_MOD will be issued.

    */

    if (!(epi->event.events & ~EP_PRIVATE_BITS))

        goto out_unlock;

    /*

    * Check the events coming with the callback. At this stage, not

    * every device reports the events in the "key" parameter of the

    * callback. We need to be able to handle both cases here, hence the

    * test for "key" != NULL before the event match test.

    */

    /* 没有我们关心的event... */

    if (key && !((unsigned long)key & epi->event.events))

        goto out_unlock;

    /*

    * If we are trasfering events to userspace, we can hold no locks

    * (because we're accessing user memory, and because of linux f_op->poll()

    * semantics). All the events that happens during that period of time are

    * chained in ep->ovflist and requeued later on.

    */

    /*

    * 这里看起来可能有点费解, 其实干的事情比较简单:

    * 如果该callback被调用的同时, epoll_wait()已经返回了,

    * 也就是说, 此刻应用程序有可能已经在循环获取events,

    * 这种情况下, 内核将此刻发生event的epitem用一个单独的链表

    * 链起来, 不发给应用程序, 也不丢弃, 而是在下一次epoll_wait

    * 时返回给用户.

    */

    if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {

        if (epi->next == EP_UNACTIVE_PTR) {

            epi->next = ep->ovflist;

            ep->ovflist = epi;

        }

        goto out_unlock;

    }

    /* If this file is already in the ready list we exit soon */

    /* 将当前的epitem放入ready list */

    if (!ep_is_linked(&epi->rdllink))

        list_add_tail(&epi->rdllink, &ep->rdllist);

    /*

    * Wake up ( if active ) both the eventpoll wait list and the ->poll()

    * wait list.

    */

    /* 唤醒epoll_wait... */

    if (waitqueue_active(&ep->wq))

        wake_up_locked(&ep->wq);

    /* 如果epollfd也在被poll, 那就唤醒队列里面的所有成员. */

    if (waitqueue_active(&ep->poll_wait))

        pwake++;

out_unlock:

    spin_unlock_irqrestore(&ep->lock, flags);

    /* We have to call this outside the lock */

    if (pwake)

        ep_poll_safewake(&ep->poll_wait);

    return ;

}

/*

* Implement the event wait interface for the eventpoll file. It is the kernel

* part of the user space epoll_wait(2).

*/

SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,

    int, maxevents, int, timeout)

{

    int error;

    struct file *file;

    struct eventpoll *ep;

    /* The maximum number of event must be greater than zero */

    if (maxevents <=  || maxevents > EP_MAX_EVENTS)

        return -EINVAL;

    /* Verify that the area passed by the user is writeable */

    /* 这个地方有必要说明一下:

    * 内核对应用程序采取的策略是"绝对不信任",

    * 所以内核跟应用程序之间的数据交互大都是copy, 不允许(也时候也是不能...)指针引用.

    * epoll_wait()需要内核返回数据给用户空间, 内存由用户程序提供,

    * 所以内核会用一些手段来验证这一段内存空间是不是有效的.

    */

    if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {

        error = -EFAULT;

        goto error_return;

    }

    /* Get the "struct file *" for the eventpoll file */

    error = -EBADF;

    /* 获取epollfd的struct file, epollfd也是文件嘛 */

    file = fget(epfd);

    if (!file)

        goto error_return;

    /*

    * We have to check that the file structure underneath the fd

    * the user passed to us _is_ an eventpoll file.

    */

    error = -EINVAL;

    /* 检查一下它是不是一个真正的epollfd... */

    if (!is_file_epoll(file))

        goto error_fput;

    /*

    * At this point it is safe to assume that the "private_data" contains

    * our own data structure.

    */

    /* 获取eventpoll结构 */

    ep = file->private_data;

    /* Time to fish for events ... */

    /* OK, 睡觉, 等待事件到来~~ */

    error = ep_poll(ep, events, maxevents, timeout);

error_fput:

    fput(file);

error_return:

    return error;

}

/* 这个函数真正将执行epoll_wait的进程带入睡眠状态... */

static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout)

{

    int res, eavail;

    unsigned long flags;

    long jtimeout;

    wait_queue_t wait;//等待队列

                      /*

                      * Calculate the timeout by checking for the "infinite" value (-1)

                      * and the overflow condition. The passed timeout is in milliseconds,

                      * that why (t * HZ) / 1000.

                      */

                      /* 计算睡觉时间, 毫秒要转换为HZ */

    jtimeout = (timeout <  || timeout >= EP_MAX_MSTIMEO) ?

        MAX_SCHEDULE_TIMEOUT : (timeout * HZ + ) / ;

retry:

    spin_lock_irqsave(&ep->lock, flags);

    res = ;

    /* 如果ready list不为空, 就不睡了, 直接干活... */

    if (list_empty(&ep->rdllist))

    {

        /*

        * We don't have any available event to return to the caller.

        * We need to sleep here, and we will be wake up by

        * ep_poll_callback() when events will become available.

        */

        /* OK, 初始化一个等待队列, 准备直接把自己挂起,

        * 注意current是一个宏, 代表当前进程 */

        init_waitqueue_entry(&wait, current);//初始化等待队列,wait表示当前进程

        __add_wait_queue_exclusive(&ep->wq, &wait);//挂载到ep结构的等待队列

        for (;;)

        {

            /*

            * We don't want to sleep if the ep_poll_callback() sends us

            * a wakeup in between. That's why we set the task state

            * to TASK_INTERRUPTIBLE before doing the checks.

            */

            /* 将当前进程设置位睡眠, 但是可以被信号唤醒的状态,

            * 注意这个设置是"将来时", 我们此刻还没睡! */

            set_current_state(TASK_INTERRUPTIBLE);

            /* 如果这个时候, ready list里面有成员了,

            * 或者睡眠时间已经过了, 就直接不睡了... */

            if (!list_empty(&ep->rdllist) || !jtimeout)

                break;

            /* 如果有信号产生, 也起床... */

            if (signal_pending(current))

            {

                res = -EINTR;

                break;

            }

            /* 啥事都没有,解锁, 睡觉... */

            spin_unlock_irqrestore(&ep->lock, flags);

            /* jtimeout这个时间后, 会被唤醒,

            * ep_poll_callback()如果此时被调用,

            * 那么我们就会直接被唤醒, 不用等时间了...

            * 再次强调一下ep_poll_callback()的调用时机是由被监听的fd

            * 的具体实现, 比如socket或者某个设备驱动来决定的,

            * 因为等待队列头是他们持有的, epoll和当前进程

            * 只是单纯的等待...

            **/

            jtimeout = schedule_timeout(jtimeout);//睡觉

            spin_lock_irqsave(&ep->lock, flags);

        }

        __remove_wait_queue(&ep->wq, &wait);

        /* OK 我们醒来了... */

        set_current_state(TASK_RUNNING);

    }

    /* Is it worth to try to dig for events ? */

    eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;

    spin_unlock_irqrestore(&ep->lock, flags);

    /*

    * Try to transfer events to user space. In case we get 0 events and

    * there's still timeout left over, we go trying again in search of

    * more luck.

    */

    /* 如果一切正常, 有event发生, 就开始准备数据copy给用户空间了... */

    if (!res && eavail &&

        !(res = ep_send_events(ep, events, maxevents)) && jtimeout)

        goto retry;

    return res;

}

参考资料：

https://www.linuxidc.com/Linux/2012-05/59873.htm

https://www.nowcoder.com/discuss/26226

https://blog.csdn.net/zmxiangde_88/article/details/8099049

https://blog.csdn.net/baiye_xing/article/details/76352935

https://blog.csdn.net/hdutigerkin/article/details/7517390

https://blog.csdn.net/weiyuefei/article/details/52242880

https://blog.csdn.net/al_xin/article/details/39047047

https://blog.csdn.net/weiyuefei/article/details/52242890