epoll的内核实现

一、内核实现基础
和之前的select相比,epoll是一个目标性更强的实现。在epoll等待的时候,它会把每个poll的唤醒函数注册为自己特有的函数,在该回调函数中,它将自己(被唤醒的fd)添加到readylist中,然后在poll到底是什么事件的时候只检测在readylist中的描述符即可,而不是像select一样遍历所有的描述符集合进行遍历。大致原理即是如此
二、代码中实现简单说明
static int ep_poll-->>ep_events_transfer
/*
 * Perform the transfer of events to user space.
 */
static int ep_events_transfer(struct eventpoll *ep,
                  struct epoll_event __user *events, int maxevents)
{
    int eventcnt = 0;
    struct list_head txlist;

    INIT_LIST_HEAD(&txlist);

    /*
     * We need to lock this because we could be hit by
     * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
     */
    down_read(&ep->sem);

    /* Collect/extract ready items */
    if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
        /* Build result set in userspace */
        eventcnt = ep_send_events(ep, &txlist, events);

        /* Reinject ready items into the ready list */
        ep_reinject_items(ep, &txlist);
    }

    up_read(&ep->sem);

    return eventcnt;
}


/*
 * Walk through the transfer list we collected with ep_collect_ready_items()
 * and, if 1) the item is still "alive" 2) its event set is not empty 3) it's
 * not already linked, links it to the ready list. Same as above, we are holding
 * "sem" so items cannot vanish underneath our nose.
 */
static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
{
    int ricnt = 0, pwake = 0;
    unsigned long flags;
    struct epitem *epi;

    write_lock_irqsave(&ep->lock, flags);

    while (!list_empty(txlist)) {
        epi = list_entry(txlist->next, struct epitem, txlink);

        /* Unlink the current item from the transfer list */
        ep_list_del(&epi->txlink);

        /*
         * If the item is no more linked to the interest set, we don't
         * have to push it inside the ready list because the following
         * ep_release_epitem() is going to drop it. Also, if the current
         * item is set to have an Edge Triggered behaviour, we don't have
         * to push it back either.
         */
        if (ep_rb_linked(&epi->rbn) && !(epi->event.events & EPOLLET) &&
            (epi->revents & epi->event.events) && !ep_is_linked(&epi->rdllink)
) {
            list_add_tail(&epi->rdllink, &ep->rdllist);
            ricnt++;
        }
    }

    if (ricnt) {
        /*
         * Wake up ( if active ) both the eventpoll wait list and the ->poll()
         * wait list.
         */
        if (waitqueue_active(&ep->wq))
            __wake_up_locked(&ep->wq, TASK_UNINTERRUPTIBLE |
                     TASK_INTERRUPTIBLE);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }

    write_unlock_irqrestore(&ep->lock, flags);

    /* We have to call this outside the lock */
    if (pwake)
        ep_poll_safewake(&psw, &ep->poll_wait);
}
这里最为莫名其妙的就是这个ep_reinject_items,把事件返回给用户态之后,此时它会把所有的已经发送的事件再次放入readylist,此时是不是不太符合常规呢?而这一点也是之前比较让我费解的一个地方。
三、为什么再次注入
这个要和最开始说的epoll实现来看。对于select来说,它每次执行的时候都会进行一次遍历,这样假设说select返回之后,用户态对这个事件充耳不闻,或者其它异常情况没有处理这个事件,那么没关系,再次进入select的时候内核不辞劳苦的又poll了一遍,如果没有处理,状态依然存在。
而对于epoll来说,它的此次状态变化事件执行机会在于事件发生的时候,如果epoll只是简单的将事件发送给用户态之后就从ready队列中删除该项,如果用户态没有处理该event事件,那么再次执行epoll_wait的时候将会丢失这次事件,直到有下一个事件发生并再次执行唤醒检查时才会唤醒用户态进程。这也是和select优化的代价。
四、再次进入内核wait
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
           int maxevents, long timeout)
retry:
    write_lock_irqsave(&ep->lock, flags);

    res = 0;
    if (list_empty(&ep->rdllist)) { 再次进入,次条件不满足,所以执行后面的ep_event_transfer,其中再次poll所有ready事件,如果事件已经消除,则会被从readylist中删除,并再次回到retry位置进行等待,此时满足empty条件
……
    }
    /* Is it worth to try to dig for events ? */
    eavail = !list_empty(&ep->rdllist);

    write_unlock_irqrestore(&ep->lock, flags);

    /*
     * Try to transfer events to user space. In case we get 0 events and
     * there's still timeout left over, we go trying again in search of
     * more luck.
     */
    if (!res && eavail &&
        !(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
        goto retry;
五、文件等待队列
和select的等待队列每次进入时创建并添加不同,epoll的等待事件创建之后一直不会被删除,直到epoll_create返回的主文件被删除


/*
 * This is called from eventpoll_release() to unlink files from the eventpoll
 * interface. We need to have this facility to cleanup correctly files that are
 * closed without being removed from the eventpoll interface.
 */
void eventpoll_release_file(struct file *file)
{
    struct list_head *lsthead = &file->f_ep_links;
    struct eventpoll *ep;
    struct epitem *epi;

    /*
     * We don't want to get "file->f_ep_lock" because it is not
     * necessary. It is not necessary because we're in the "struct file"
     * cleanup path, and this means that noone is using this file anymore.
     * The only hit might come from ep_free() but by holding the semaphore
     * will correctly serialize the operation. We do need to acquire
     * "ep->sem" after "epmutex" because ep_remove() requires it when called
     * from anywhere but ep_free().
     */
    mutex_lock(&epmutex);

    while (!list_empty(lsthead)) {
        epi = list_entry(lsthead->next, struct epitem, fllink);

        ep = epi->ep;
        ep_list_del(&epi->fllink);
        down_write(&ep->sem);
        ep_remove(ep, epi);
        up_write(&ep->sem);
    }

    mutex_unlock(&epmutex);
}
或者epoll_ctl中EPOLL_CTL_DEL命令删除该关注项。
六、TODO
内核调试态验证以上描述。

posted on 2019-03-07 09:31  tsecer  阅读(341)  评论(0编辑  收藏  举报

导航