第08课:【实战】Redis网络通信模块源码分析(1)
我们这里先研究redis-server端的网络通信模块。除去Redis本身的业务功能以外,Redis的网络通信模块实现思路和细节非常有代表性。由于网络通信模块的设计也是Linux C++后台开发一个很重要的模块,虽然网络上有很多现成的网络库,但是简单易学且可以作为典范的并不多,而redis-server就是这方面值得借鉴学习的材料之一。
8.1侦听socket初始化工作
通过前面课程的介绍,我们知道网络通信在应用层上的大致流程如下:
*服务器端侦听socket;
*将侦听socket绑定到需要的IP地址和端口上(调用Soket API bind函数);
*启动侦听(调用socket API listen函数);
*无限等待客户端连接到来,调用Socket API accept函数接受客户端连接,并称生一个与该客户端对应的客户端socket;
*处理客户端socket上网络数据的收发,必要时关闭该socket。
全局搜索了一下Redis的代码,寻找调用了bind()函数的代码,经过过滤和筛选,我们确定了位于anet.c的anetListen()函数。
static int (char *err, int s, struct sockaddr *sa, socklen_t len, int backlog) { if (bind(s,sa,len) == -1) { anetSetError(err, "bind: %s", strerror(errno)); close(s); return ANET_ERR; } if (listen(s, backlog) == -1) { anetSetError(err, "listen: %s", strerror(errno)); close(s); return ANET_ERR; } return ANET_OK; }
用GDB的b命令在这个函数上加个断点,然后重新运行redis-server:
(gdb) b anetListen Breakpoint 1 at 0x555555588620: file anet.c, line 440. (gdb) info breakpoints Num Type Disp Enb Address What 1 breakpoint keep y 0x0000555555588620 in anetListen at anet.c:440 (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /home/wzq/Desktop/redis-5.0.3/src/redis-server [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 11580:C 14 Jan 2019 11:27:06.118 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 11580:C 14 Jan 2019 11:27:06.118 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=11580, just started 11580:C 14 Jan 2019 11:27:06.118 # Warning: no config file specified, using the default config. In order to specify a config file use /home/wzq/Desktop/redis-5.0.3/src/redis-server /path/to/redis.conf 11580:M 14 Jan 2019 11:27:06.119 * Increased maximum number of open files to 10032 (it was originally set to 1024). Breakpoint 1, anetListen (err=0x5555559161e0 <server+576> "", s=6, sa=0x555555b2b240, len=28, backlog=511) at anet.c:440 warning: Source file is more recent than executable. 440 static int (char *err, int s, struct sockaddr *sa, socklen_t len, int backlog) { (gdb)
在GDB中断在这个函数时,使用bt命令查看一下此时的调用堆栈:
(gdb) bt #0 anetListen (err=0x5555559161e0 <server+576> "", s=6, sa=0x555555b2b240, len=28, backlog=511) at anet.c:440 #1 0x00005555555887a4 in _anetTcpServer (err=0x5555559161e0 <server+576> "", port=<optimized out>, bindaddr=<optimized out>, af=10, backlog=511) at anet.c:487 #2 0x000055555558cf07 in listenToPort (port=6379, fds=0x55555591610c <server+364>, count=0x55555591614c <server+428>) at server.c:1924 #3 0x0000555555591ed0 in initServer () at server.c:2055 #4 0x0000555555585103 in main (argc=<optimized out>, argv=0x7fffffffccc8) at server.c:4160
通过这个堆栈,结合堆栈#1的6379端口号可以确认这就是我们要找的逻辑,并且这个逻辑在主线程(因为从堆栈上看,最顶层堆栈是main()函数)中进行。
我们看下堆栈#1处的代码:
static int _anetTcpServer(char *err, int port, char *bindaddr, int af, int backlog) { int s = -1, rv; char _port[6]; /* strlen("65535") */ struct addrinfo hints, *servinfo, *p; snprintf(_port,6,"%d",port); memset(&hints,0,sizeof(hints)); hints.ai_family = af; hints.ai_socktype = SOCK_STREAM; hints.ai_flags = AI_PASSIVE; /* No effect if bindaddr != NULL */ if ((rv = getaddrinfo(bindaddr,_port,&hints,&servinfo)) != 0) { anetSetError(err, "%s", gai_strerror(rv)); return ANET_ERR; } for (p = servinfo; p != NULL; p = p->ai_next) { if ((s = socket(p->ai_family,p->ai_socktype,p->ai_protocol)) == -1) continue; if (af == AF_INET6 && anetV6Only(err,s) == ANET_ERR) goto error; if (anetSetReuseAddr(err,s) == ANET_ERR) goto error; if (anetListen(err,s,p->ai_addr,p->ai_addrlen,backlog) == ANET_ERR) s = ANET_ERR; goto end; } if (p == NULL) { anetSetError(err, "unable to bind socket, errno: %d", errno); goto error; } error: if (s != -1) close(s); s = ANET_ERR; end: freeaddrinfo(servinfo); return s; }
将堆栈切换至#1,然后输入info arg查看传入给这个额函数的参数。
(gdb) f 1 #1 0x00005555555887a4 in _anetTcpServer (err=0x5555559161e0 <server+576> "", port=<optimized out>, bindaddr=<optimized out>, af=10, backlog=511) at anet.c:487 487 if (anetListen(err,s,p->ai_addr,p->ai_addrlen,backlog) == ANET_ERR) s = ANET_ERR; (gdb) info args err = 0x5555559161e0 <server+576> "" port = <optimized out> bindaddr = <optimized out> af = 10 backlog = 511 (gdb)
使用系统API getaddrinfo 来解析得到当前主机的IP地址和端口信息。这里没有选择使用gethostbyname这个API是因为gethostbyname仅用于解析ipv4相关的主机信息,而getaddrinfo既可以用于ipv4也可以用于ipv6,这个函数的签名如下:
int getaddrinfo(const char *node,const char *service,const struct addrinfo *hints,struct addrinfo **res);
这个函数的具体用法可以在Linux man手册上查看。通常服务器端在调用getaddrinfo之前,将hints参数的ai_flags设置为AL_PASSIVE,用于bind,主机名nodename通常会设置为NULL,返回通配地址[::]。当然,客户端调用getaddrinfo时,hints参数的ai_flags一般不设置AL_PASSIVE,但是主机名node和服务名service(更愿意称之为端口)则应该不为空。
解析完协议信息后,利用得到的协议信息创建侦听socket,并开启该socket的reuseAddr选项。然后调用anetListen函数,在该函数中先bind后listen。至此,redis-server就可以在6379端口上接受客户端连接了。
8.2接受客户端连接
同样的道理,要研究redis-server如何接受客户端连接,只要搜索socket API accept函数即可。
经定位,我们最终在anet.c文件中找到anetGenericAccept函数:
static int anetGenericAccept(char *err, int s, struct sockaddr *sa, socklen_t *len) { int fd; while(1) { fd = accept(s,sa,len); if (fd == -1) { if (errno == EINTR) continue; else { anetSetError(err, "accept: %s", strerror(errno)); return ANET_ERR; } } break; } return fd; }
我们用b命令在这个函数加个断点,然后重新运行redis-server。一直到程序全部运行起来,GDB都没有触发该断点,这时新打开一个redis-cli,以模拟新客户端连接到redis-server上的行为。断点触发了,此时查看一下调用堆栈。
Thread 1 "redis-server" hit Breakpoint 2, anetGenericAccept (err=err@entry=0x5555559161e0 <server+576> "", s=s@entry=7, sa=sa@entry=0x7fffffffc9f0, len=len@entry=0x7fffffffc9ec) at anet.c:531 531 static int anetGenericAccept(char *err, int s, struct sockaddr *sa, socklen_t *len) { (gdb) bt #0 anetGenericAccept (err=err@entry=0x5555559161e0 <server+576> "", s=s@entry=7, sa=sa@entry=0x7fffffffc9f0, len=len@entry=0x7fffffffc9ec) at anet.c:531 #1 0x00005555555893e2 in anetTcpAccept (err=err@entry=0x5555559161e0 <server+576> "", s=s@entry=7, ip=ip@entry=0x7fffffffcab0 "", ip_len=ip_len@entry=46, port=port@entry=0x7fffffffcaac) at anet.c:552 #2 0x000055555559aad2 in acceptTcpHandler (el=<optimized out>, fd=7, privdata=<optimized out>, mask=<optimized out>) at networking.c:728 #3 0x000055555558806c in aeProcessEvents (eventLoop=eventLoop@entry=0x7ffff6c320a0, flags=flags@entry=11) at ae.c:443 #4 0x000055555558841b in aeMain (eventLoop=0x7ffff6c320a0) at ae.c:501 #5 0x00005555555851d4 in main (argc=<optimized out>, argv=0x7fffffffccc8) at server.c:4197
分析这个调用堆栈,梳理一下这个调用流程。在main函数的initServer函数中创建侦听socket、绑定地址然后开启侦听,接着调用aeMain函数启动一个循环不断地处理“事件”。
void aeMain(aeEventLoop *eventLoop) { eventLoop->stop = 0; while (!eventLoop->stop) { if (eventLoop->beforesleep != NULL) eventLoop->beforesleep(eventLoop); aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP); } }
循环的退出条件是eventLoop->stop 为 1.事件处理的代码如下:
int aeProcessEvents(aeEventLoop *eventLoop, int flags) { int processed = 0, numevents; /* Nothing to do? return ASAP */ if (!(flags & AE_TIME_EVENTS) && !(flags & AE_FILE_EVENTS)) return 0; /* Note that we want call select() even if there are no * file events to process as long as we want to process time * events, in order to sleep until the next time event is ready * to fire. */ if (eventLoop->maxfd != -1 || ((flags & AE_TIME_EVENTS) && !(flags & AE_DONT_WAIT))) { int j; aeTimeEvent *shortest = NULL; struct timeval tv, *tvp; if (flags & AE_TIME_EVENTS && !(flags & AE_DONT_WAIT)) shortest = aeSearchNearestTimer(eventLoop); if (shortest) { long now_sec, now_ms; aeGetTime(&now_sec, &now_ms); tvp = &tv; /* How many milliseconds we need to wait for the next * time event to fire? */ long long ms = (shortest->when_sec - now_sec)*1000 + shortest->when_ms - now_ms; if (ms > 0) { tvp->tv_sec = ms/1000; tvp->tv_usec = (ms % 1000)*1000; } else { tvp->tv_sec = 0; tvp->tv_usec = 0; } } else { /* If we have to check for events but need to return * ASAP because of AE_DONT_WAIT we need to set the timeout * to zero */ if (flags & AE_DONT_WAIT) { tv.tv_sec = tv.tv_usec = 0; tvp = &tv; } else { /* Otherwise we can block */ tvp = NULL; /* wait forever */ } } /* Call the multiplexing API, will return only on timeout or when * some event fires. */ numevents = aeApiPoll(eventLoop, tvp); /* After sleep callback. */ if (eventLoop->aftersleep != NULL && flags & AE_CALL_AFTER_SLEEP) eventLoop->aftersleep(eventLoop); for (j = 0; j < numevents; j++) { aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd]; int mask = eventLoop->fired[j].mask; int fd = eventLoop->fired[j].fd; int fired = 0; /* Number of events fired for current fd. */ /* Normally we execute the readable event first, and the writable * event laster. This is useful as sometimes we may be able * to serve the reply of a query immediately after processing the * query. * * However if AE_BARRIER is set in the mask, our application is * asking us to do the reverse: never fire the writable event * after the readable. In such a case, we invert the calls. * This is useful when, for instance, we want to do things * in the beforeSleep() hook, like fsynching a file to disk, * before replying to a client. */ int invert = fe->mask & AE_BARRIER; /* Note the "fe->mask & mask & ..." code: maybe an already * processed event removed an element that fired and we still * didn't processed, so we check if the event is still valid. * * Fire the readable event if the call sequence is not * inverted. */ if (!invert && fe->mask & mask & AE_READABLE) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } /* Fire the writable event. */ if (fe->mask & mask & AE_WRITABLE) { if (!fired || fe->wfileProc != fe->rfileProc) { fe->wfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } /* If we have to invert the call, fire the readable event now * after the writable one. */ if (invert && fe->mask & mask & AE_READABLE) { if (!fired || fe->wfileProc != fe->rfileProc) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } processed++; } } /* Check time events */ if (flags & AE_TIME_EVENTS) processed += processTimeEvents(eventLoop); return processed; /* return the number of processed file/time events */ }
这段代码先通过flag参数检查是否有事件需要处理。如果有定时器事件(AE_TIME_EVENTS标志),则寻找最近要到期的定时器。
/* Search the first timer to fire. * This operation is useful to know how many time the select can be * put in sleep without to delay any event. * If there are no timers NULL is returned. * * Note that's O(N) since time events are unsorted. * Possible optimizations (not needed by Redis so far, but...): * 1) Insert the event in order, so that the nearest is just the head. * Much better but still insertion or deletion of timers is O(N). * 2) Use a skiplist to have this operation as O(1) and insertion as O(log(N)). */ static aeTimeEvent *aeSearchNearestTimer(aeEventLoop *eventLoop) { aeTimeEvent *te = eventLoop->timeEventHead; aeTimeEvent *nearest = NULL; while(te) { if (!nearest || te->when_sec < nearest->when_sec || (te->when_sec == nearest->when_sec && te->when_ms < nearest->when_ms)) nearest = te; te = te->next; } return nearest; }
这段代码有详细的注释,也非常好理解。注释告诉我们,由于这里的定时器集合是无序的,所以需要遍历一下这个链表,算法复杂度是O(n)。同时,注释中也“暗示”了我们将来Redis在这块的优化方向,即把这个链表按到期时间从小到大排序,这样链表的头部就是我们要的最近时间点的定时器对象,算法复杂度是O(l)。或者使用Redis中的skiplist,算法复杂度是O(log(N))。
接着获取当前系统时间(aeGetTime(&now_sec,&now_ms);)将最早到期的定时器事件减去当前系统时间获得一个间隔。这个时间间隔作为numevents = aeApiPoll(eventLoop,tvp);调用的参数,aeApiPoll()在Linux平台上使用epoll技术,Redis在这个IO复用技术上、在不同的操作系统平台上使用不同的系统函数,在Windows系统上使用select,在Mac系统上使用kqueue。这里重点看下Linux平台下的实现:
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) { aeApiState *state = eventLoop->apidata; int retval, numevents = 0; retval = epoll_wait(state->epfd,state->events,eventLoop->setsize, tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1); if (retval > 0) { int j; numevents = retval; for (j = 0; j < numevents; j++) { int mask = 0; struct epoll_event *e = state->events+j; if (e->events & EPOLLIN) mask |= AE_READABLE; if (e->events & EPOLLOUT) mask |= AE_WRITABLE; if (e->events & EPOLLERR) mask |= AE_WRITABLE; if (e->events & EPOLLHUP) mask |= AE_WRITABLE; eventLoop->fired[j].fd = e->data.fd; eventLoop->fired[j].mask = mask; } } return numevents; }
epoll_wait这个函数的签名如下:
int epoll_wait(int epfd,struct epoll_event *events,int maxevents,int timeout);
最后一个参数timeout的设置非常有讲究,如果传入进来的tvp是NULL,根据上文的分析,说明定时器事件,则将等待时间设置为-1,这会让epoll_wait无限期地挂起来,直到有事件时才会被唤醒。挂起的好处就是不浪费CPU时间片。反之,将timeout设置成最近的定时器事件间隔,将epoll_wait的等待时间设置为最近的定时器事件来临的时间间隔,可以及时唤醒epoll_wait,这样程序流可以尽快处理这个到期的定时器事件(下文会介绍)。
对于epoll_wait这种系统调用,所有的fd(对于网络通信,也叫socket)信息包括侦听fd和普通客户端fd都记录在事件循环对象aeEventLoop的apidata字段中,当某个fd上有事件触发时,从apidata中找到该fd,并把事件类型(mask字段)一起记录到aeEventLoop的fired字段中去。我们先吧这个流程介绍完,再介绍epoll_wait函数中使用的epfd是在何时何地创建的,侦听fd、客户端fd是如何挂载到epfd上去的。
在得到了有事件的fd以后,接下来就要处理这些事件了。在主循环aeProcessEvents中从aeEventLoop对象的fired数组中取出上一步记录的fd,然后根据事件类型(读事件和写事件)分别进行处理。
for (j = 0; j < numevents; j++) { aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd]; int mask = eventLoop->fired[j].mask; int fd = eventLoop->fired[j].fd; int fired = 0; /* Number of events fired for current fd. */ /* Normally we execute the readable event first, and the writable * event laster. This is useful as sometimes we may be able * to serve the reply of a query immediately after processing the * query. * * However if AE_BARRIER is set in the mask, our application is * asking us to do the reverse: never fire the writable event * after the readable. In such a case, we invert the calls. * This is useful when, for instance, we want to do things * in the beforeSleep() hook, like fsynching a file to disk, * before replying to a client. */ int invert = fe->mask & AE_BARRIER; /* Note the "fe->mask & mask & ..." code: maybe an already * processed event removed an element that fired and we still * didn't processed, so we check if the event is still valid. * * Fire the readable event if the call sequence is not * inverted. */ if (!invert && fe->mask & mask & AE_READABLE) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } /* Fire the writable event. */ if (fe->mask & mask & AE_WRITABLE) { if (!fired || fe->wfileProc != fe->rfileProc) { fe->wfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } /* If we have to invert the call, fire the readable event now * after the writable one. */ if (invert && fe->mask & mask & AE_READABLE) { if (!fired || fe->wfileProc != fe->rfileProc) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); fired++; } } processed++; }
该事件字段rfileProc和写事件字段wfileProc都是函数指针,在程序早期设置好,这里直接调用就可以了。
typedef void aeFileProc(struct aeEventLoop *eventLoop, int fd, void *clientData, int mask); /* File event structure */ typedef struct aeFileEvent { int mask; /* one of AE_(READABLE|WRITABLE|BARRIER) */ aeFileProc *rfileProc; aeFileProc *wfileProc; void *clientData; } aeFileEvent;
我们通过搜索关键字epoll_create在ae_poll.c文件中找到EPFD的创建函数aeApiCreate。
static int aeApiCreate(aeEventLoop *eventLoop) { aeApiState *state = zmalloc(sizeof(aeApiState)); if (!state) return -1; state->events = zmalloc(sizeof(struct epoll_event)*eventLoop->setsize); if (!state->events) { zfree(state); return -1; } state->epfd = epoll_create(1024); /* 1024 is just a hint for the kernel */ if (state->epfd == -1) { zfree(state->events); zfree(state); return -1; } eventLoop->apidata = state; return 0; }
使用GDB的b命令在这个函数上加个断点,然后使用run命令重新运行一下redis-server,触发断点,使用bt命令查看此时的调用堆栈。发现EPFD也是在上文介绍的initServer函数中创建的。
(gdb) bt #0 aeCreateEventLoop (setsize=10128) at ae.c:79 #1 0x0000555555591aa0 in initServer () at server.c:2044 #2 0x0000555555585103 in main (argc=<optimized out>, argv=0x7fffffffccc8) at server.c:4160
在aeCreateEventLoop中不仅创建了EPFD,也创建了整个事件循环需要的aeEventLoop对象,并把这个对象记录在Redis的一个全局变量的el字段中。这个全局变量叫server,这是个结构体类型。其定义如下:
struct redisServer server; /* Server global state */ struct redisServer { /* General */ int lua_repl; /* Script replication flags for redis.set_repl(). */ int lua_timedout; /* True if we reached the time limit for script execution. */ int lua_kill; /* Kill the script if true. */ int lua_always_replicate_commands; /* Default replication type. */ /* Lazy free */ int lazyfree_lazy_eviction; int lazyfree_lazy_expire; int lazyfree_lazy_server_del; /* Latency monitor */ long long latency_monitor_threshold; dict *latency_events; /* 省略部分 */ /* Mutexes used to protect atomic variables when atomic builtins are * not available. */ pthread_mutex_t lruclock_mutex; pthread_mutex_t next_client_id_mutex; pthread_mutex_t unixtime_mutex; };