内容目录

  • 什么是EPOLL
  • EPOLL接口
  • EPOLL机制
  • 两张图

什么是EPOLL

摘录自manpage介绍

man:epoll(7) epoll(4)
  epoll is a variant of poll(2) that can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. 

EPOLL接口

epoll_create (or epoll_create1)
    epoll_create  opens an epoll file descriptor by requesting the kernel to allocate an event backing store dimensioned for  size descriptors.


epoll_ctl
    epoll_ctl() opens an epoll file descriptor by requesting the kernel to allocate an event backing store dimensioned for  size descriptors.


epoll_wait
     The  epoll_wait()  system call waits for events on the epoll file descriptor epfd for a maximum time of timeout millisec-onds.

 

Linux内核EPOLL实现

关键数据结构:

struct eventpoll

    每个epoll文件都有一个struct eventpoll,存储在epoll文件的priv_data中,其主要成员如下图所示:

 

 wait_queue_head_t wq  sys_epoll_wait使用的等待队列

struct list_head rdllist:  准备好的文件列表

struct rb_root rbr:    存储被监控的fdRB

struct ovflist:      单链表结构,当正在传输已准备好事件到用户层时,将发生的事件拷贝到该链表

 

struct epitem

   每个被监控文件设备都有一个对应的struct epitem,其成员如下图所示

struct rb_node rbn:   链接到eventpoll RB tree的节点

struct list_head rdllink: 链接到eventpoll ready list,即rdllist

struct epoll_filefd ffd:  被监控文件的信息,包括*filefd

struct list_head pwqlist: 包含poll wait queues的列表

struct eventpoll *ep:   指向这个item所属的ep

struct list_head fllink:  链接到被监控文件(目标文件)f_ep_links条目列表

struct epoll_event event:描述感兴趣的事件和fd

 

struct eppoll_entry

   struct eppoll_entry用于socketpoll的钩子。它与被监控文件的struct  epitem结构是一一对应的,ep_ptable_queue_proc函数通过这个结构体,把epoll wait queue添加到目标文件(被监控socket文件)的唤醒队列上。

红黑树结构:

   红黑树用于存储和组织代表被监控设备文件的struct epitem结构体。

 

 

 Linux EPOLL接口内核实现

 epoll_create接口分析

1) ep_alloc  

创建新的struct eventpoll结构体

2get_unused_fd_flags

获取一个空闲的文件描述符,即fd

3anon_inode_getfile

创建一个新的struct file实例,并且挂载到一个匿名inode节点上;

struct eventpoll赋值给epoll文件的struct file->private_data

4fd_install

安装struct filefile array

5struct eventpoll->file = epoll文件struct file

 

epoll_ctl接口分析

相关的处理函数:ep_insert,ep_remove和ep_modify

1)ep_insert:

ep_insert代码片段:

 1 struct ep_pqueue epq;
 2 
 3 epq.epi = epi;
 4 
 5  
 6 
 7 /* 初始化epq.pt的proc和key两个变量,为下面的函数做准备*/
 8 
 9 init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
10 
11 sock = file->private_data;
12 
13  
14 
15 /* 目标文件的文件操作poll,即socket_file_ops的sock_poll函数:
16 
17  (socket_file_ops .poll = sock_poll)
18 
19 return sock->ops->poll(file, sock, wait);
20 
21 即inet_stream_ops的tcp_poll,或者inet_dgram_ops的udp_poll
22 
23 这两个函数都有相同的一句:
24 
25 sock_poll_wait(file, sk_sleep(sk), wait);
26 
27 -->poll_wait(filp, wait_address, p);   #把pwq添加到socket的sk_wq
28 
29    -->p->qproc(filp, wait_address, p); ###即调用ep_ptable_queue_proc
30 
31  */
32 
33 revents = tfile->f_op->poll(tfile, &epq.pt);
34 
35  ……
36 
37  /* 把epi插入到ep的红黑树上 */
38 
39 ep_rbtree_insert(ep, epi);
40 
41  
42 
43 /* 如果被监控事件已经发生,且未加入到ep->rdllist链表中,则epitem添加到ep->rdllist链表上 */
44 
45 if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
46 
47 list_add_tail(&epi->rdllink, &ep->rdllist);
48 
49  
50 
51 /* 通知等待任务,已经有事件发生 */
52 
53 if (waitqueue_active(&ep->wq))
54 
55 wake_up_locked(&ep->wq);
56 
57 if (waitqueue_active(&ep->poll_wait))
58 
59 pwake++;
60 
61 }
62 
63  ……

 

  2)ep_ptable_queue_proc函数分析:

函数实现功能:

1)安装事件回调函数ep_poll_callback,并返回当前事件

2)并把struct eppoll_entry 即等待队列添加到sk_sleep(sk)的等待队列头

 

处理流程:

1)把epitem对应的waitqueue添加到socketsk_wq,并返回当前可用事件

2epi->fllink is added to tfile->f_ep_linkss

3epi(event poll item) is added to event poll (according to a epoll fd)

4)返回事件中有需要的poll事件,并且epi->rdlink未被连接,则添加到ep->rdllist

5)唤醒ep->wq   ##调用sys_poll_wait函数触发

6)唤醒ep->poll_wait  ## 调用file->poll函数触发

 struct ep_pqueue {

poll_table pt;   ###查询表

struct epitem *epi;   ##被监控文件的条目信息

};

代码分析:

ep_ptable_queue_proc

-->create struct eppoll_entry *pwq;

-->initialize  pwq->wait->func = ep_poll_callback   ##注册socket wait queue poll函数

--> pwq->whead = whead  ### whead = sk_sleep(sock->sk)

-->pwq->base = epi       ### event poll item

-->add_wait_queue:      ###pwq->wait will be added to whead

--> list_add_tail :        ###pwq->llink  added to epi->pwqlist

 

 1 static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,  poll_table *pt)
 2 {
 3   struct epitem *epi = ep_item_from_epqueue(pt);
 4   struct eppoll_entry *pwq;
 5 
 6   if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) { 
 7     init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); ###初始化wait回调函数 
 8     pwq->whead = whead;   ###即sock->sk_wq->wait
 9     pwq->base = epi;       ###要监听的文件的epitem
10     add_wait_queue(whead, &pwq->wait);   ###添加到sock的等待队列中
11     list_add_tail(&pwq->llink, &epi->pwqlist); ###添加到epitem的poll wait queues列表
12     epi->nwait++;
13   } else {
14     /* We have to signal that an error occurred */
15     epi->nwait = -1;
16   }
17 }

 

 3)ep_poll_callback函数

函数功能:

这个回调函数由等待队列唤醒机制进行处理。当被监控的文件描述符有事件报告时,则由该文件描述符的相关函数来调用。

 

1)处理ep->ovflist链表

当应用接口拷贝已发生的事件时,又有新的事件发生,则把新事件链接到ovflist链表

2)如果该epi->rdllink还没被链接,则添加到ep->rdllist链表

3)如果在用户层有等待队列ep->wq,则唤醒用户态的等待进程

 

epoll_wait接口分析

epoll_wait在内核中的处理函数是ep_poll,它主要做如下三个方面的工作:

1)超时时间处理

  if (timeout > 0) {

    struct timespec end_time = ep_set_mstimeout(timeout);

    slack = select_estimate_accuracy(&end_time);

    to = &expires;

    *to = timespec_to_ktime(end_time);

  } else if (timeout == 0) {

    /*

     * Avoid the unnecessary trip to the wait queue loop, if the

     * caller specified a non blocking operation.

     */

    timed_out = 1;

    spin_lock_irqsave(&ep->lock, flags);

    goto check_events;

  }

1)如果超时时间大于0,则设置struct timespec类型的结束时间,并转换为ktime_t类型;

2)如果超时时间等于0,则设置timed_out1,直接跳转到检查事件代码。

2)等待事件通知

如果超时时间大于0 ,则进入获取事件的流程。

 1 fetch_events:
 2   spin_lock_irqsave(&ep->lock, flags);  /* 获取事件锁 */ 
 3 
 4   /* 首先检查当前是否有事件发生,如果有则直接跳转到check_events流程 */
 5   if (!ep_events_available(ep)) { 
 6     /*
 7      * We don't have any available event to return to the caller.
 8      * We need to sleep here, and we will be wake up by
 9      * ep_poll_callback() when events will become available. 
10      */
11       /* 初始化等待队列wait,并将等待队列加入到epoll的等待队列链表ep->wq */
12     init_waitqueue_entry(&wait, current);
13     __add_wait_queue_exclusive(&ep->wq, &wait);
14 
15     for (;;) {
16       /*
17        * We don't want to sleep if the ep_poll_callback() sends us
18        * a wakeup in between. That's why we set the task state
19        * to TASK_INTERRUPTIBLE before doing the checks.
20        */
21       /* 设置当前进程为可中断状态 */
22       set_current_state(TASK_INTERRUPTIBLE);
23       /* 如果当前有事件发生,或者已经超时,则退出事件检查的循环 */
24       if (ep_events_available(ep) || timed_out)
25         break;
26 
27       /* 给当前进程发送pending信号 */
28       if (signal_pending(current)) {
29         res = -EINTR;
30         break;
31       }
32       /* 释放ep->lock自旋所,进程睡眠到超时时间 */
33       spin_unlock_irqrestore(&ep->lock, flags);
34       if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
35         timed_out = 1;
36
37       spin_lock_irqsave(&ep->lock, flags);
38     }
39 
40     /* 如果当前进程睡眠时间到,或者有事件触发,则把当前进程从ep->wait等待事件列表中移除,并设置为RUNNING状态  */
41     __remove_wait_queue(&ep->wq, &wait);
42
43     set_current_state(TASK_RUNNING);
44   }

 

3)处理已触发事件

 1 check_events:
 2   /* Is it worth to try to dig for events ? */
 3 
 4   eavail = ep_events_available(ep);
 5 
 6   spin_unlock_irqrestore(&ep->lock, flags);
 7 
 8   /*
 9    * Try to transfer events to user space. In case we get 0 events and
10    * there's still timeout left over, we go trying again in search of
11    * more luck.
12   */
13     /* res为0,并且已有事件触发,则将已经发生事件拷贝到用户态 */
14   if (!res && eavail &&
15       !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
16     goto fetch_events;
17 
18 return res;
19 

 

4)ep_send_events处理函数

  调用ep_scan_ready_list函数,扫描epollrdllist链表,并将事件拷贝到用户态。

  实际调用函数ep_scan_ready_list

  return ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);

5)ep_scan_ready_list

1)获取epollrdllist链表:

  spin_lock_irqsave(&ep->lock, flags);

  /* 获取这个rdllist链表 */

  list_splice_init(&ep->rdllist, &txlist); 

  /* 设置ovflist为空 */

  /* ovflist单向链表在这里的作用是,告诉ep_poll_callback函数,当前有进程在拷贝事件,如果有新的事件发生,则放到该链表中 */

  ep->ovflist = NULL;

  spin_unlock_irqrestore(&ep->lock, flags);

2)调用事件回调函数,将事件拷贝到用户态

  error = (*sproc)(ep, &txlist, priv);

  即调用ep_send_events_proc函数

3)处理ep->ovflist链表

  如果在拷贝事件过程中,有新的事件触发,则需要把新的实际链接到epollrdllist链表中。

 1 /*
 2  * During the time we spent inside the "sproc" callback, some
 3  * other events might have been queued by the poll callback.
 4  * We re-insert them inside the main ready-list here.
 5  */
 6 for (nepi = ep->ovflist; (epi = nepi) != NULL;
 7      nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
 8 /*
 9  * We need to check if the item is already in the list.
10  * During the "sproc" callback execution time, items are
11  * queued into ->ovflist but the "txlist" might already
12  * contain them, and the list_splice() below takes care of them.
13  */
14 if (!ep_is_linked(&epi->rdllink))
15   list_add_tail(&epi->rdllink, &ep->rdllist);
16 } 
17 /*
18  * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after 
19  * releasing the lock, events will be queued in the normal way inside
20  * ep->rdllist.
21 */ 22 ep->ovflist = EP_UNACTIVE_PTR; 23 24 /* 25 * Quickly re-inject items left on "txlist". 26 */ 27 28 list_splice(&txlist, &ep->rdllist);

 

3)如果epoll还有用户处于等待状态,则唤醒该用户

if (!list_empty(&ep->rdllist)) {

  /*

   * Wake up (if active) both the eventpoll wait list and

   * the ->poll() wait list (delayed after we release the lock).

   */

  if (waitqueue_active(&ep->wq))

    wake_up_locked(&ep->wq);

  if (waitqueue_active(&ep->poll_wait))

    pwake++;

}

6)ep_send_events_proc函数

/* 遍历获取的已触发事件链表 */

for (eventcnt = 0, uevent = esed->events;

     !list_empty(head) && eventcnt < esed->maxevents;) {

  epi = list_first_entry(head, struct epitem, rdllink);

 

  list_del_init(&epi->rdllink);

 

    /* 调用被监控设备文件的poll函数,即tcp_poll,或者udp_poll等函数

  * 注意:此处调用,第二个参数poll_table *wait为空,

  * 已经在ep_insert函数中,把监听任务挂载到socketsk_sleep队列上,

  * 所以此处不需要再处理

   */

  revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &

  epi->event.events;

  /*

   * If the event mask intersect the caller-requested one,

   * deliver the event to userspace. Again, ep_scan_ready_list()

   * is holding "mtx", so no operations coming from userspace

   * can change the item.

   */

  /* 如果有触发事件,则将事件拷贝到用户态空间 */

  if (revents) {

    if (__put_user(revents, &uevent->events) ||

        __put_user(epi->event.data, &uevent->data)) {

      list_add(&epi->rdllink, head);

      return eventcnt ? eventcnt : -EFAULT;

    }

    eventcnt++;

    uevent++;

    if (epi->event.events & EPOLLONESHOT)

      epi->event.events &= EP_PRIVATE_BITS;

    else if (!(epi->event.events & EPOLLET)) {

      /* 此处为边缘触发流程:

      * 如果为水平触发,则将触发事件的epi再次链接到epollrdllist链表

      */

      /*

       * If this file has been added with Level

       * Trigger mode, we need to insert back inside

       * the ready list, so that the next call to

       * epoll_wait() will check again the events

       * availability. At this point, no one can insert

       * into ep->rdllist besides us. The epoll_ctl()

       * callers are locked out by

       * ep_scan_ready_list() holding "mtx" and the

       * poll callback will queue them in ep->ovflist.

       */

      list_add_tail(&epi->rdllink, &ep->rdllist);

    }

  }

}

socket事件通知

inet_create

-->sock_init_data

--> sk->sk_state_change = sock_def_wakeup;

sk->sk_data_ready = sock_def_readable;  ## readable, POLLIN, 唤醒监控可读事件的任务

sk->sk_write_space = sock_def_write_space;  ##writable, POLLOUT,唤醒监控可写事件的任务

sk->sk_error_report = sock_def_error_report;  ##error, POLLERR,唤醒监控错误事件的任务

sk->sk_destruct = sock_def_destruct;     ##free sock

 

示例:

sock_def_readable函数分析

{

  struct socket_wq *wq;

  rcu_read_lock();

  wq = rcu_dereference(sk->sk_wq);  /* 获取socket wait_queue */

  if (wq_has_sleeper(wq))    /* sk->sock_wq->wait 是否有等待队列 */

          /* __wake_up_sync_key --> __wake_up_common(wait_queue_head q->lock)

          * 最终调用 ep_poll_callback 函数*/

    wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |

                  POLLRDNORM | POLLRDBAND);

  sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);

  rcu_read_unlock();

}

 

数据结构关系图

 

 

 

 

 Q&A:

Q1: Epoll常用用户态编程接口有哪些?
A1:epoll_create epoll_ctl epoll_wait


Q2: 什么是EPOLL?
A2:EPOLL是一种IO事件通知机制


Q3:EPOLL事件触发机制有哪些?
A3:水平触发和边缘触发


Q4:epoll_ctl接口中op参数有哪些?
A4: EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL


Q5:EPOLL接口可以监控哪些事件?
A5: EPOLLIN, EPOLLOUT, EPOLLERR等