epoll的内核实现
一、内核实现基础
和之前的select相比,epoll是一个目标性更强的实现。在epoll等待的时候,它会把每个poll的唤醒函数注册为自己特有的函数,在该回调函数中,它将自己(被唤醒的fd)添加到readylist中,然后在poll到底是什么事件的时候只检测在readylist中的描述符即可,而不是像select一样遍历所有的描述符集合进行遍历。大致原理即是如此
二、代码中实现简单说明
static int ep_poll-->>ep_events_transfer
/*
* Perform the transfer of events to user space.
*/
static int ep_events_transfer(struct eventpoll *ep,
struct epoll_event __user *events, int maxevents)
{
int eventcnt = 0;
struct list_head txlist;
INIT_LIST_HEAD(&txlist);
/*
* We need to lock this because we could be hit by
* eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
*/
down_read(&ep->sem);
/* Collect/extract ready items */
if (ep_collect_ready_items(ep, &txlist, maxevents) > 0) {
/* Build result set in userspace */
eventcnt = ep_send_events(ep, &txlist, events);
/* Reinject ready items into the ready list */
ep_reinject_items(ep, &txlist);
}
up_read(&ep->sem);
return eventcnt;
}
/*
* Walk through the transfer list we collected with ep_collect_ready_items()
* and, if 1) the item is still "alive" 2) its event set is not empty 3) it's
* not already linked, links it to the ready list. Same as above, we are holding
* "sem" so items cannot vanish underneath our nose.
*/
static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
{
int ricnt = 0, pwake = 0;
unsigned long flags;
struct epitem *epi;
write_lock_irqsave(&ep->lock, flags);
while (!list_empty(txlist)) {
epi = list_entry(txlist->next, struct epitem, txlink);
/* Unlink the current item from the transfer list */
ep_list_del(&epi->txlink);
/*
* If the item is no more linked to the interest set, we don't
* have to push it inside the ready list because the following
* ep_release_epitem() is going to drop it. Also, if the current
* item is set to have an Edge Triggered behaviour, we don't have
* to push it back either.
*/
if (ep_rb_linked(&epi->rbn) && !(epi->event.events & EPOLLET) &&
(epi->revents & epi->event.events) && !ep_is_linked(&epi->rdllink)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ricnt++;
}
}
if (ricnt) {
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
if (waitqueue_active(&ep->wq))
__wake_up_locked(&ep->wq, TASK_UNINTERRUPTIBLE |
TASK_INTERRUPTIBLE);
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
write_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&psw, &ep->poll_wait);
}
这里最为莫名其妙的就是这个ep_reinject_items,把事件返回给用户态之后,此时它会把所有的已经发送的事件再次放入readylist,此时是不是不太符合常规呢?而这一点也是之前比较让我费解的一个地方。
三、为什么再次注入
这个要和最开始说的epoll实现来看。对于select来说,它每次执行的时候都会进行一次遍历,这样假设说select返回之后,用户态对这个事件充耳不闻,或者其它异常情况没有处理这个事件,那么没关系,再次进入select的时候内核不辞劳苦的又poll了一遍,如果没有处理,状态依然存在。
而对于epoll来说,它的此次状态变化事件执行机会在于事件发生的时候,如果epoll只是简单的将事件发送给用户态之后就从ready队列中删除该项,如果用户态没有处理该event事件,那么再次执行epoll_wait的时候将会丢失这次事件,直到有下一个事件发生并再次执行唤醒检查时才会唤醒用户态进程。这也是和select优化的代价。
四、再次进入内核wait
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, long timeout)
retry:
write_lock_irqsave(&ep->lock, flags);
res = 0;
if (list_empty(&ep->rdllist)) { 再次进入,次条件不满足,所以执行后面的ep_event_transfer,其中再次poll所有ready事件,如果事件已经消除,则会被从readylist中删除,并再次回到retry位置进行等待,此时满足empty条件。
……
}
/* Is it worth to try to dig for events ? */
eavail = !list_empty(&ep->rdllist);
write_unlock_irqrestore(&ep->lock, flags);
/*
* Try to transfer events to user space. In case we get 0 events and
* there's still timeout left over, we go trying again in search of
* more luck.
*/
if (!res && eavail &&
!(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
goto retry;
五、文件等待队列
和select的等待队列每次进入时创建并添加不同,epoll的等待事件创建之后一直不会被删除,直到epoll_create返回的主文件被删除
/*
* This is called from eventpoll_release() to unlink files from the eventpoll
* interface. We need to have this facility to cleanup correctly files that are
* closed without being removed from the eventpoll interface.
*/
void eventpoll_release_file(struct file *file)
{
struct list_head *lsthead = &file->f_ep_links;
struct eventpoll *ep;
struct epitem *epi;
/*
* We don't want to get "file->f_ep_lock" because it is not
* necessary. It is not necessary because we're in the "struct file"
* cleanup path, and this means that noone is using this file anymore.
* The only hit might come from ep_free() but by holding the semaphore
* will correctly serialize the operation. We do need to acquire
* "ep->sem" after "epmutex" because ep_remove() requires it when called
* from anywhere but ep_free().
*/
mutex_lock(&epmutex);
while (!list_empty(lsthead)) {
epi = list_entry(lsthead->next, struct epitem, fllink);
ep = epi->ep;
ep_list_del(&epi->fllink);
down_write(&ep->sem);
ep_remove(ep, epi);
up_write(&ep->sem);
}
mutex_unlock(&epmutex);
}
或者epoll_ctl中EPOLL_CTL_DEL命令删除该关注项。
六、TODO
内核调试态验证以上描述。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架