TraTraffic Server 进程模型
1.概述
Traffic Server包括三个一起工作的进程来服务Traffic Server的请求,管理/控制/监控系统的健康状况。图1说明了三个进程的关系,三个进程将会在下面描述。
图1:进程之间的关系
1)traffic_server进程是 Traffic Server的事务处理引擎。它负责接收连接、处理协议请求以及从本地缓存或源服务器提供资源。
2)traffic_manager进程是用来命令和控制Traffic Server的工具,负责启动、监控以及重新配置traffic_server进程。traffic_manager进程同时负责代理自动配置端口、统计接口、集群管理以及vip故障转移。
如果traffic_manager进程检测到traffic_server进程失败,它不仅会立即重启该进程,而且会为所有传入的请求维护一个连接队列。在traffic_server重启前的几秒内传入的所有连接将会被保存在连接队列中,并以FIFO的方式处理。这个连接队列接受任何server故障重启时的连接。
3)traffic_cop进程监控traffic_server和traffic_manager进程的健康状况。traffic_cop进程通过抓取合成web页面的心跳请求方式周期性的(每分钟若干次)查询traffic_server和traffic_manager进程。如果失败事件发生(如果在超时时间间隔内没有收到请求或者收到错误的请求),traffic_cop重启traffic_server和traffic_manager进程。系统这样设计的好处便是给traffic_server进程加上了来自traffic_manager和traffic_cop的双重保障,因为traffic_server进程是工作进程,必须保证它的正常运行。-
4)traffic server采用的是多线程异步事件处理模型:Traffic Server并不是为每个连接都建立一个线程,而是事先创建一组数量可配置的工作线程,每一个工作线程上都运行着独立的异步事件处理程序。traffic_server创建若干组Thread,并将Event按类型调度到相应的Thread的Event队列上,Thread通过执行Event对应的Continuation中的回调函数,来完成状态的迁移。从初始态到终止态的迁移代表了整个事件的执行过程,而Thread是永不退出的,等待着下一个事件的到来。
本文重点在于分析traffic server中三个进程的关系以及实现,对于其多线程异步事件处理模型不作深入分析。进程模型图如下:
2.实现原理
基本原理:对traffic_manager进程和traffic_server进程分别配置对应的manager_lockfile和server_lockfile文件,traffic_cop通过两个lockfile文件来监控traffic_manager和traffic_server进程,同理traffic_manager进程通过server_lockfile来监控traffic_server进程。图2说明了这种关系:
图2:进程以及lockfile文件的关系
关键实现:
关键类 Lockfile
Lockfile::Open(pid_t * holding_pid)函数详解:
解释和说明:Lockfile::Open(pid_t * holding_pid)会有三种类型的返回值,close-on-exec:具体作用在于当开辟其他进程调用exec()族函数时,在调用exec函数之前为exec族函数释放对应的文件描述符。
(1):返回1说明lockfile可以被打开,这也说明与lockfile关联的进程没有运行,如果关联的进程在运行,lockfile会被进程持有,就不会被打开;
(2):返回0说明检测到lockfile被某个进程持有,那么将持有lockfile的进程ID写入holding_pid返回,持有lockfile的进程ID是在对应进程运行的时候,由Get()函数写入到lockfile中的;
(3):返回负值一共有三种情况,一是打开fname失败,二是获取close-on-exec标识失败,三是设置clsoe-on-exec标识失败。
重要的kill进程的相关函数,简要说明如下:
// kill
//用于杀死指定pid的进程
//return: 0--okay,-1—error
1.int kill(pid_t pid, int sig);
//ink_killall
//杀死程序名称为pname的所有进程
// return: 0--okay,-1—error
2. ink_killall(const char *pname, int sig);
ink_killall调用:
3. ink_killall_get_pidv_xmalloc (pname, &pidv, &pidvcnt);
4. ink_killall_kill_pidv (pidv, pidvcnt, sig);
// ink_killall_get_pidv_xmalloc
//根据程序panme,获取程序运行的进程ID到pidv数组中,以及进程的个数到pidvcnt
//变量中
//return: -1 error (pidv: set to NULL; pidvcnt: set to 0); 0 okay (pidv: ats_malloc'd //pid vector; pidvcnt: number of pid's;if pidvcnt is set to 0, then pidv will //be set to NULL)
3.int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt);
// ink_killall_kill_pidv (pidv, pidvcnt, sig);
//将pidv中记录的进程ID逐个调用kill( pidv[i],sig)
// return: 0--okay,-1—error
4.int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt,int sig);
ink_killall_kill_pidv调用:
1.kill(pid_t pid, int sig);
// safe_kill
//用于安全的杀死程序名称为pname的所有进程,lockfile_name为进程需要关联的lockfile文件//group表明是否需要杀死pname进程创造的子进程,因为它们在同一个进程组;
//return: void
5. static void safe_kill(const char *lockfile_name, const char *pname, bool group);
static void safe_killd调用:
6. Lockfile::Kill(killsig, coresig, pname);
7. Lockfile::KillGroup(killsig, coresig, pname);
// Lockfile::Kill
//处理好对应的lockfile文件,杀死程序名为pname的所有进程,其中sig一般就是kill信号,//initial_sig默认为0,用于发送给init_pid进程的
//return:void
6. void Lockfile::Kill(int sig, int initial_sig, const char *pname);
Lockfile::Kill调用:
8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);
// Lockfile::KillGroup
//处理好对应的lockfile文件,杀死程序名为pname的进程,以及该进程创建的子进程(当然也包括//子进程创建的线程),sig为kill信号
//信号
//initial_sig同上kill函数
//return :void
7.void Lockfile::KillGroup(int sig, int initial_sig, const char *pname);
Lockfile::KillGroup调用:
8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);
// LockKill::lockfile_kill_internal
//首先杀死init_pid进程,然后杀死程序名称为pname的所有进程
//return :void
8.static void lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig);
lockfile_kill_internal调用:
1.kill(init_pid, init_sig);
3.ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt);
4.ink_killall_kill_pidv(pidv, pidvcnt, sig);
若想了解详细实现细节,请参见源代码.
2. 模拟traffic_cop对traffic_manager和traffic_server的监控
Traffic_cop启动以后进入main函数,main函数会调用一个check函数,在check里面会周期性的调用check_programs()函数来对traffic_manager和traffic_server进行监控。check_programs()函数有些复杂,流程图如下图。
3.模拟测试
根据原理,模仿了traffic_cop、traffic_manager和traffic_server三个进程,其中将traffic_cop实现为守护进程,traffic_manager进程对traffic-server进程的监控类似于traffic_cop对traffic_manager与traffic_server的监控,故不作重复说明。实验中,由于测试traffic_manager与traffic_server进程健康度的函数heartbear_manager()、server_up()与heartbeat_server()函数涉及到端口通信部分内容,由于其不妨碍原理部分的模拟,略写了它们的代码,而是让它们直接返回正常值。(程序运行的时候需要manage_lokfile和server_lockfile文件,读者应自己在可执行文件所在文件夹下加上这两个文件)
程序运行后,敲入命令 ps –axj|grep binary得到图如下:
前四个标识分别是:父进程ID/进程ID/进程组ID/会话ID
图中可以看出它们的正常关系。
当traffic_manager进程异常退出的时候,traffic_cop会重启traffic_manager进程,在日志文件中可以看出这一动作:(日志部分内容如下)
==============traffic_server is running, pid:'5443'!
----------------traffic_manager is running, pid:'5436'!
==============traffic_server is running, pid:'5443'!
---------------traffic_manager has a expcetion and eixt!
Entering check_programs()
traffic_manager not running, making sure traffic_server is dead
Entering safe_kill
Leaving safe_kill
Entering spwan_manager()!
Leaving spwan_manager()!
Leaving check_programs
----------------traffic_manager is running, pid:'5463'!
Entering spwan_server()!
Leaving spwan_server()!
==============traffic_server is running, pid:'5467'!
从日志中可以看出,某个时刻,traffic_manager进程ID是5436,traffic_server进程ID是5443;下一时刻中,traffic_manager进程出现了异常(---------------traffic_manager has a expcetion and eixt!),然后traffic_cop在周期性的check_programs()中发现” traffic_manager not running”,然后它杀死了traffic_server进程(” making sure traffic_server is dead”),然后重新创建了traffic_manager进程(” Entering spwan_manager()!”),traffic_manager进程的ID已经变成了5463,traffic_manager正常运行后,发现traffic_server进程没有运行,随后它调用spwan_server()产生新的traffic_server进程,其ID号变成了5467。说明traffic_cop监控功能正常。
当traffic_server进程异常退出的时候,traffic_manager进程会检测到这一行为,然后重启traffic_server进程,在日志文件中也可以看出这一动作:(日志部分内容如下)
==============traffic_server is running, pid:'7703'!
----------------traffic_manager is running, pid:'7699'!
=================traffic_server has a expcetion and exit!
Entering safe_kill
Leaving safe_kill
--------------Entering spwan_server()!
--------------Leaving spwan_server()!
----------------traffic_manager is running, pid:'7699'!
==============traffic_server is running, pid:'7712'!
从日志上可以看出,某时刻,traffic_manager进程ID为7699,traffic_server进程ID是7703,接下来traffic_server进程出现异常退出,traffic_manager进程则调用spwan_server()重新开启了一个traffic_server进程,ID号为7712,此时traffic_manager进程的ID号仍然是7699,说明traffic_manager进程没有改变。这说明traffic_manager起到了监控traffic_server进程的作用。
4.总结
为什么设计了三个进程来工作,而不是采用两个进程:直接让traffic_manager进程来监管traffic_server进程。由于traffic_manager进程所负担的系统角色说明单独的两个进程是无法满足系统要求的。特别是当traffic_manager进程检测到traffic_server进程失败的时候,它会暂时将请求放入队列中,所以它也需要在端口上暂时监听请求,这样系统就无法保障该进程不会出现异常,这也意味着traffic_manager进程同样也会出现异常。为此系统设计了traffic_cop守护进程来监控,traffic_cop进程的角色就是纯粹的监控另外两个进程,理论上这个守护进程是不会异常结束的,这样的三层设计比两层设计更安全更可靠。当三个进程协同工作的时候,客户对于服务器的异常是透明的(设计上如此,但并非绝对,当traffic_manager与traffic_server同时异常结束的时候,traffic_cop在重启它们的几秒钟内,客户的请求会无法接收,小概率),客户是不会感知到自己的请求会出现问题的,可能会感觉延迟大了一些。从服务器的架构设计上可以看出,服务器的要求是尽可能的稳定安全,对于异常情况的考虑应周全。
源代码:
1.lock_and_kill.h
1 #ifndef LOCK_AND_KILL_H 2 #define LOCK_AND_KILL_H 3 #include <sys/types.h> 4 #include <string.h> 5 #define PATH_NAME_MAX 4096 6 7 /*------------------------------------------------------------------------- 8 ink_killall 9 - Sends signal 'sig' to all processes with the name 'pname' 10 - Returns: -1 error 11 0 okay 12 -------------------------------------------------------------------------*/ 13 int ink_killall(const char *pname, int sig); 14 15 /*------------------------------------------------------------------------- 16 ink_killall_get_pidv_xmalloc 17 - Get all pid's named 'pname' and stores into ats_malloc'd 18 pid_t array, 'pidv' 19 - Returns: -1 error (pidv: set to NULL; pidvcnt: set to 0) 20 0 okay (pidv: ats_malloc'd pid vector; pidvcnt: number of pid's; 21 if pidvcnt is set to 0, then pidv will be set to NULL) 22 -------------------------------------------------------------------------*/ 23 int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt); 24 25 /*------------------------------------------------------------------------- 26 ink_killall_kill_pidv 27 - Kills all pid's in 'pidv' with signal 'sig' 28 - Returns: -1 error 29 0 okay 30 -------------------------------------------------------------------------*/ 31 int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig); 32 33 34 35 class Lockfile 36 { 37 public: 38 39 Lockfile(void):fd(0) 40 { 41 fname[0] = '\0'; 42 } 43 44 45 // coverity[uninit_member] 46 Lockfile(const char *filename):fd(0) 47 { 48 strcpy(fname, filename); 49 } 50 51 52 ~Lockfile(void) 53 { 54 } 55 56 void SetLockfileName(const char *filename) 57 { 58 strcpy(fname, filename); 59 } 60 61 const char *GetLockfileName(void) 62 { 63 return fname; 64 } 65 66 // Open() -----非常重要的函数 67 // 68 // Tries to open a lock file, returning: 69 // -errno on error 70 // 0 if someone is holding the lock (with holding_pid set) 71 // 1 if we now have a writable lock file 72 int Open(pid_t * holding_pid); 73 74 // Get() 75 // 76 // Gets write access to a lock file, and if successful, truncates 77 // file, and writes the current process ID. Returns: 78 // -errno on error 79 // 0 if someone is holding the lock (with holding_pid set) 80 // 1 if we now have a writable lock file 81 int Get(pid_t * holding_pid); 82 83 // Close() 84 // 85 // Closes the file handle on the opened Lockfile. 86 void Close(void); 87 88 // Kill() 89 // KillGroup() 90 // 91 // Ensures no one is holding the lock. It tries to open the lock file 92 // and if that does not succeed, it kills the process holding the lock. 93 // If the lock file open succeeds, it closes the lock file releasing 94 // the lock. 95 // 96 // The intial signal can be used to generate a core from the process while 97 // still ensuring it dies. 98 void Kill(int sig, int initial_sig = 0, const char *pname = NULL); 99 void KillGroup(int sig, int initial_sig = 0, const char *pname = NULL); 100 101 private: 102 char fname[PATH_NAME_MAX]; 103 int fd; 104 }; 105 106 107 #endif
2.lock_and_kill.cpp
1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <dirent.h> 4 #include<unistd.h> 5 #include<sys/file.h> 6 #include <errno.h> 7 #include <signal.h> 8 9 #include "lock_and_kill.h" 10 11 12 #define PROC_BASE "/proc" 13 #define INITIAL_PIDVSIZE 32 14 #define LOCKFILE_BUF_LEN 16 15 #define LINE_MAX 1024 //may be hava problem with it 16 int 17 ink_killall(const char *pname, int sig) 18 { 19 int err; 20 pid_t *pidv; 21 int pidvcnt; 22 23 if (ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt) < 0) { 24 return -1; 25 } 26 27 if (pidvcnt == 0) { 28 free(pidv); 29 return 0; 30 } 31 32 err = ink_killall_kill_pidv(pidv, pidvcnt, sig); 33 free(pidv); 34 return err; 35 } 36 37 int 38 ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt) 39 { 40 DIR *dir; 41 FILE *fp; 42 struct dirent *de; 43 pid_t pid, self; 44 char buf[LINE_MAX], *p, *comm; 45 int pidvsize = INITIAL_PIDVSIZE; 46 47 if (!pname || !pidv || !pidvcnt) 48 goto l_error; 49 50 self = getpid(); 51 if (!(dir = opendir(PROC_BASE))) 52 goto l_error; 53 54 *pidvcnt = 0; 55 *pidv = (pid_t *)malloc(pidvsize * sizeof(pid_t)); 56 57 while ((de = readdir(dir))) { 58 if (!(pid = (pid_t) atoi(de->d_name)) || pid == self) 59 continue; 60 snprintf(buf, sizeof(buf), PROC_BASE "/%d/stat", pid); 61 if ((fp = fopen(buf, "r"))) { 62 if (fgets(buf, sizeof buf, fp) == 0) 63 goto l_close; 64 if ((p = strchr(buf, '('))) { 65 comm = p + 1; 66 if ((p = strchr(comm, ')'))) 67 *p = '\0'; 68 else 69 goto l_close; 70 if (strcmp(comm, pname) == 0) { 71 if (*pidvcnt >= pidvsize) { 72 pid_t *pidv_realloc; 73 pidvsize *= 2; 74 if (!(pidv_realloc = (pid_t *)realloc(*pidv, pidvsize * sizeof(pid_t)))) { 75 free(*pidv); 76 goto l_error; 77 } else { 78 *pidv = pidv_realloc; 79 } 80 } 81 (*pidv)[*pidvcnt] = pid; 82 (*pidvcnt)++; 83 } 84 } 85 l_close: 86 fclose(fp); 87 } 88 } 89 closedir(dir); 90 91 if (*pidvcnt == 0) { 92 free(*pidv); 93 *pidv = 0; 94 } 95 return 0; 96 l_error: 97 *pidv = NULL; 98 *pidvcnt = 0; 99 return -1; 100 } 101 102 int 103 ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig) 104 { 105 int err = 0; 106 if (!pidv || (pidvcnt <= 0)) 107 return -1; 108 while (pidvcnt > 0) { 109 pidvcnt--; 110 if (kill(pidv[pidvcnt], sig) < 0) 111 err = -1; 112 } 113 return err; 114 } 115 116 117 ////////////////////类函数的实现在下面////////////////////////////////// 118 //////////////////////////////////////////////////////////////////////// 119 int 120 Lockfile::Open(pid_t * holding_pid) 121 { 122 char buf[LOCKFILE_BUF_LEN]; 123 pid_t val; 124 int err; 125 *holding_pid = 0; 126 127 #define FAIL(x) \ 128 { \ 129 if (fd > 0) \ 130 close (fd); \ 131 return (x); \ 132 } 133 134 struct flock lock; 135 char *t; 136 int size;//开始的时候设置成无效的一个值 137 138 // Try and open the Lockfile. Create it if it does not already 139 // exist. 140 do { 141 fd = open(fname, O_RDWR | O_CREAT, 0644); 142 } while ((fd < 0) && (errno == EINTR)); 143 144 if (fd < 0) 145 return (-errno); 146 147 // Lock it. Note that if we can't get the lock EAGAIN will be the 148 // error we receive. 149 lock.l_type = F_WRLCK; 150 lock.l_start = 0; 151 lock.l_whence = SEEK_SET; 152 lock.l_len = 0; 153 154 do { 155 err = fcntl(fd, F_SETLK, &lock); 156 } while ((err < 0) && (errno == EINTR)); 157 158 if (err < 0) { 159 // We couldn't get the lock. Try and read the process id of the 160 // process holding the lock from the lockfile. 161 t = buf; 162 163 for (size = 15; size > 0;) { 164 do { 165 err = read(fd, t, size); 166 } while ((err < 0) && (errno == EINTR)); 167 168 if (err < 0) 169 FAIL(-errno); 170 if (err == 0) 171 break; 172 173 size -= err; 174 t += err; 175 } 176 *t = '\0'; 177 178 // coverity[secure_coding] 179 if (sscanf(buf, "%d\n", (int*)&val) != 1) { 180 *holding_pid = 0; 181 } else { 182 *holding_pid = val; 183 } 184 FAIL(0); 185 186 } 187 // If we did get the lock, then set the close on exec flag so that 188 // we don't accidently pass the file descriptor to a child process 189 // when we do a fork/exec. 190 do { 191 err = fcntl(fd, F_GETFD, 0); 192 } while ((err < 0) && (errno == EINTR)); 193 194 if (err < 0) 195 FAIL(-errno); 196 197 val = err | FD_CLOEXEC; 198 199 do { 200 err = fcntl(fd, F_SETFD, val); 201 } while ((err < 0) && (errno == EINTR)); 202 203 if (err < 0) 204 FAIL(-errno); 205 206 // Return the file descriptor of the opened lockfile. When this file 207 // descriptor is closed the lock will be released. 208 return (1); // success 209 #undef FAIL 210 } 211 212 int 213 Lockfile::Get(pid_t * holding_pid) 214 { 215 char buf[LOCKFILE_BUF_LEN]; 216 int err; 217 *holding_pid = 0; 218 219 fd = -1; 220 221 // Open the Lockfile and get the lock. If we are successful, the 222 // return value will be the file descriptor of the opened Lockfile. 223 err = Open(holding_pid); 224 if (err != 1) 225 return err; 226 227 if (fd < 0) { 228 return -1; 229 } 230 231 // Truncate the Lockfile effectively erasing it. 232 do { 233 err = ftruncate(fd, 0); 234 } while ((err < 0) && (errno == EINTR)); 235 236 if (err < 0) { 237 close(fd); 238 return (-errno); 239 } 240 241 // Write our process id to the Lockfile. 242 snprintf(buf, sizeof(buf), "%d\n", (int) getpid()); 243 244 do { 245 err = write(fd, buf, strlen(buf)); 246 } while ((err < 0) && (errno == EINTR)); 247 248 if (err != (int) strlen(buf)) { 249 close(fd); 250 return (-errno); 251 } 252 return (1); // success 253 } 254 255 void 256 Lockfile::Close(void) 257 { 258 if (fd != -1) { 259 close(fd); 260 } 261 } 262 263 //------------------------------------------------------------------------- 264 // Lockfile::Kill() and Lockfile::KillAll() 265 // 266 // Open the lockfile. If we succeed it means there was no process 267 // holding the lock. We'll just close the file and release the lock 268 // in that case. If we don't succeed in getting the lock, the 269 // process id of the process holding the lock is returned. We 270 // repeatedly send the KILL signal to that process until doing so 271 // fails. That is, until kill says that the process id is no longer 272 // valid (we killed the process), or that we don't have permission 273 // to send a signal to that process id (the process holding the lock 274 // is dead and a new process has replaced it). 275 // 276 // INKqa11325 (Kevlar: linux machine hosed up if specific threads 277 // killed): Unfortunately, it's possible on Linux that the main PID of 278 // the process has been successfully killed (and is waiting to be 279 // reaped while in a defunct state), while some of the other threads 280 // of the process just don't want to go away. Integrate ink_killall 281 // into Kill() and KillAll() just to make sure we really kill 282 // everything and so that we don't spin hard while trying to kill a 283 // defunct process. 284 //------------------------------------------------------------------------- 285 286 287 static void 288 lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig) 289 { 290 int err; 291 292 #if defined(linux) 293 294 pid_t *pidv; 295 int pidvcnt; 296 297 // Need to grab pname's pid vector before we issue any kill signals. 298 // Specifically, this prevents the race-condition in which 299 // traffic_manager spawns a new traffic_server while we still think 300 // we're killall'ing the old traffic_server. 301 if (pname) { 302 //这函数的功能是什么,将程序名为pname的进程都不给杀死,pidv是pid的数组指针,pidvcnt是进程个数 303 ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt); 304 } 305 306 if (init_sig > 0) { 307 kill(init_pid, init_sig); 308 // sleep for a bit and give time for the first signal to be 309 // delivered 310 sleep(1); 311 } 312 313 do { 314 if ((err = kill(pid, sig)) == 0) { 315 sleep(1); 316 } 317 if (pname && (pidvcnt > 0)) { 318 ink_killall_kill_pidv(pidv, pidvcnt, sig); 319 sleep(1); 320 } 321 } while ((err == 0) || ((err < 0) && (errno == EINTR))); 322 323 free(pidv); 324 325 #else 326 327 if (init_sig > 0) { 328 kill(init_pid, init_sig); 329 // sleep for a bit and give time for the first signal to be 330 // delivered 331 sleep(1); 332 } 333 334 do { 335 err = kill(pid, sig); 336 } while ((err == 0) || ((err < 0) && (errno == EINTR))); 337 338 #endif // linux check 339 340 } 341 342 ///////////////////////////////////////////////////////////////// 343 ///////////////////////////////////////////////////////////////// 344 void 345 Lockfile::Kill(int sig, int initial_sig, const char *pname) 346 { 347 int err; 348 int pid; 349 pid_t holding_pid; 350 351 err = Open(&holding_pid); 352 if (err == 1) // success getting the lock file,说明没有对应的server进程存在 353 { 354 Close(); //因此不需要处理,关闭就行了 355 } else if (err == 0) // someone else has the lock 356 { 357 pid = holding_pid; //获取持有锁进程的pid 358 if (pid != 0) { //当进程pid有效的时候,就去杀死这个进程 359 360 lockfile_kill_internal(pid, initial_sig, pid, pname, sig); 361 } 362 } 363 } 364 365 366 ///////////////////////////////////////////////////////////////////// 367 ///////////////////////////////////////////////////////////////////// 368 //没怎么明白这个函数!! 369 void 370 Lockfile::KillGroup(int sig, int initial_sig, const char *pname) 371 { 372 int err; 373 pid_t pid; 374 pid_t holding_pid; 375 376 err = Open(&holding_pid); 377 if (err == 1) // success getting the lock file 378 { 379 Close(); 380 } else if (err == 0) // someone else has the lock 381 { 382 do { 383 pid = getpgid(holding_pid);//获得进程组识别码 384 } while ((pid < 0) && (errno == EINTR)); 385 386 if ((pid < 0) || (pid == getpid())) 387 pid = holding_pid; 388 else 389 pid = -pid; 390 391 if (pid != 0) { 392 // We kill the holding_pid instead of the process_group 393 // initially since there is no point trying to get core files 394 // from a group since the core file of one overwrites the core 395 // file of another one 396 lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig); 397 } 398 } 399 }
3.log.h
1 #ifndef LOG_H 2 #define LOG_H 3 #include <stdio.h> 4 5 void write_to_log(char* c){ 6 7 FILE* fd; 8 fd = fopen("log.txt", "ab"); 9 if (fd) 10 { 11 fputs(c, fd); 12 fclose(fd); 13 } 14 } 15 16 #endif
4.traffic_cop.cpp
1 #include "lock_and_kill.h" 2 #include "log.h" 3 #include <sys/types.h> 4 #include <sys/ipc.h> 5 #include <sys/sem.h> 6 #include <signal.h> 7 #include <sys/param.h> 8 #include <unistd.h> 9 #include <stdlib.h> 10 #include <sys/wait.h> 11 #include <time.h> 12 #include <string.h> 13 #include <stdio.h> 14 #include <sys/stat.h> 15 16 17 #define NOWARN_UNUSED(x) (void)(x) 18 19 static char cop_lockfile[PATH_NAME_MAX]; 20 static char manager_lockfile[PATH_NAME_MAX]; 21 static char server_lockfile[PATH_NAME_MAX]; 22 23 static char manager_binary[PATH_NAME_MAX] = "traffic_manager"; 24 static char server_binary[PATH_NAME_MAX] = "traffic_server"; 25 static int killsig=SIGKILL; 26 static int coresig=0; 27 static int server_not_found = 0; 28 static int server_failures=0; 29 static int manager_failures =0; 30 31 static const int sleep_time = 10; // 10 sec 32 static const int manager_timeout = 3 * 60; // 3 min 33 static const int server_timeout = 3 * 60; // 3 min 34 static const int kill_timeout = 1 * 60; // 1 min 35 36 37 static void sig_alarm_warn(int signum=0) 38 { 39 alarm(kill_timeout); 40 } 41 42 43 static void sig_fatal(int signum) 44 { 45 abort(); 46 } 47 48 49 static void set_alarm_warn() 50 { 51 struct sigaction action; 52 action.sa_handler = sig_alarm_warn; 53 sigemptyset(&action.sa_mask); 54 action.sa_flags = 0; 55 sigaction(SIGALRM, &action, NULL); 56 } 57 58 static void set_alarm_death() 59 { 60 struct sigaction action; 61 action.sa_handler = sig_fatal; 62 sigemptyset(&action.sa_mask); 63 action.sa_flags = 0; 64 sigaction(SIGALRM, &action, NULL); 65 } 66 67 static void sig_child(int signum) 68 { 69 NOWARN_UNUSED(signum); 70 pid_t pid = 0; 71 int status = 0; 72 for (;;) { 73 pid = waitpid(WAIT_ANY, &status, WNOHANG); 74 75 if (pid <= 0) { 76 break; 77 } 78 // TSqa03086 - We can not log the child status signal from 79 // the signal handler since syslog can deadlock. Record 80 // the pid and the status in a global for logging 81 // next time through the event loop. We will occasionally 82 // lose some information if we get two sig childs in rapid 83 // succession 84 // child_pid = pid; 85 //child_status = status; 86 } 87 } 88 89 90 static void init_signals() 91 { 92 struct sigaction action; 93 write_to_log("Entering init_signals()\n"); 94 action.sa_handler = sig_child; 95 sigemptyset(&action.sa_mask); 96 action.sa_flags = 0; 97 sigaction(SIGCHLD, &action, NULL); 98 action.sa_handler = sig_fatal; 99 sigemptyset(&action.sa_mask); 100 action.sa_flags = 0; 101 write_to_log("leaving init_signals()\n\n"); 102 } 103 104 105 static void safe_kill(const char* lockfile_name,const char * pname,bool group) 106 { 107 Lockfile lockfile(lockfile_name); 108 write_to_log("Entering safe_kill\n"); 109 set_alarm_warn(); 110 alarm(kill_timeout); 111 112 if (group == true) { 113 lockfile.KillGroup(killsig, coresig, pname); 114 } else { 115 lockfile.Kill(killsig, coresig, pname); 116 } 117 alarm(0); 118 set_alarm_death(); 119 write_to_log("Leaving safe_kill\n\n"); 120 121 } 122 123 124 //为了简单化,直接返回0 125 static int server_up() 126 { 127 return 1; 128 129 } 130 131 132 static int heartbeat_manager() 133 { 134 //safe_kill(manager_lockfile, manager_binary, true); 135 return 1; 136 } 137 138 static int heartbeat_server() 139 { 140 //safe_kill(server_lockfile, server_binary, false); 141 //server_failures = 0; 142 return 1; 143 } 144 145 146 147 static void spawn_manager() 148 { 149 int err; 150 int key; 151 err = fork(); 152 write_to_log("Entering spwan_manager()!\n\n"); 153 if (err == 0) { 154 err = execv(manager_binary, NULL); 155 write_to_log("somehow execv failed!\n"); 156 exit(1); 157 } else if (err == -1) { 158 write_to_log("unable to fork !\n"); 159 exit(1); 160 } 161 162 manager_failures = 0; 163 write_to_log("Leaving spwan_manager()!\n\n"); 164 } 165 166 167 static void init_lockfiles() 168 { 169 // Layout::relative_to(cop_lockfile, sizeof(cop_lockfile), Layout::get()->runtimedir, COP_LOCK); 170 // Layout::relative_to(manager_lockfile, sizeof(manager_lockfile), Layout::get()->runtimedir, MANAGER_LOCK); 171 // Layout::relative_to(server_lockfile, sizeof(server_lockfile), Layout::get()->runtimedir, SERVER_LOCK); 172 173 write_to_log("Entering init_lockfiles()\n"); 174 strcpy(cop_lockfile,"cop_lockfile"); 175 strcpy(manager_lockfile,"manager_lockfile"); 176 strcpy(server_lockfile,"server_lockfile"); 177 178 strcpy(manager_binary,"manager_binary"); 179 strcpy(server_binary,"server_binary"); 180 181 182 write_to_log("leaving init_lockfiles()\n\n"); 183 184 //manager_lockfile="manager_lockfile"; 185 //server_lockfile="server_lockfile"; 186 //manager_binary="manager_binary"; 187 //server_binary="server_binary"; 188 189 } 190 191 192 static void check_lockfile() 193 { 194 195 write_to_log("Entering check_lockfile()\n"); 196 int err; 197 pid_t holding_pid; 198 Lockfile cop_lf(cop_lockfile); 199 err = cop_lf.Get(&holding_pid); 200 201 202 if (err < 0) { 203 write_to_log("leaving check_lockfile(),and err<0\n\n"); 204 exit(1); 205 } else if (err == 0) { 206 write_to_log("leaving check_lockfile(),and err==0\n\n"); 207 exit(1); 208 } 209 write_to_log("leaving check_lockfile()\n\n"); 210 211 } 212 213 214 215 static void check_programs() 216 { 217 int err; 218 pid_t holding_pid; 219 220 write_to_log("Entering check_programs()\n"); 221 printf("Entering check_programs()\n"); 222 //尝试去获取 manager的lockfile,如果成功,说明没有manager进程在运行 223 Lockfile manager_lf(manager_lockfile); 224 err = manager_lf.Open(&holding_pid); 225 226 //通过检测err的值来判断manager进程的运行情况 227 if(err==0){ 228 write_to_log("in check_programs(),manager_lockfile,err==0\n"); 229 230 printf("in check_programs(),manager_lockfile,err==0\n"); 231 232 if(kill(holding_pid,0)==-1){ 233 234 printf("holding_pid is %d,and invalid\n",holding_pid); 235 236 ink_killall(manager_binary, killsig); 237 sleep(1); // give signals a chance to be received 238 err = manager_lf.Open(&holding_pid); 239 } 240 241 } 242 243 244 if(err>0){//说明可以获得manager lockfile 245 // 'lockfile_open' returns the file descriptor of the opened 246 // lockfile. We need to close this before spawning the 247 // manager so that the manager can grab the lock. 248 manager_lf.Close(); 249 // Make sure we don't have a stray traffic server running. 250 251 write_to_log("traffic_manager not running, making sure traffic_server is dead\n"); 252 safe_kill(server_lockfile,server_binary,false); 253 spawn_manager(); 254 } 255 else 256 { 257 258 259 260 261 //err<0,Open中返回负值,说明可能是加锁成功,但是设置lockfile的文件信息失败 262 // If there is a manager running we want to heartbeat it to 263 // make sure it hasn't wedged. If the manager test succeeds we 264 // check to see if the server is up. (That is, it hasn't been 265 // brought down via the UI). If the manager thinks the server 266 // is up, we make sure there is actually a server process 267 // running. If there is we test it. 268 269 alarm(2*manager_timeout); 270 err=heartbeat_manager();//? 271 alarm(0); 272 273 if(err<0){//???what case 274 return ; 275 276 } 277 278 279 if(server_up()<=0){//???what case 280 return;//err>0 ,manager is running ,if server is down we think manager can create a new server ,so return 281 } 282 283 Lockfile server_lf(server_lockfile); 284 err=server_lf.Open(&holding_pid); 285 286 if(err==0){ 287 if(kill(holding_pid,0)==-1){ 288 ink_killall(server_binary,killsig); 289 sleep(1);// give signals a chance to be received 290 err=server_lf.Open(&holding_pid); 291 } 292 } 293 294 if(err>0){ 295 server_lf.Close(); 296 server_not_found += 1; 297 298 if(server_not_found>1){ 299 300 301 server_not_found=0; 302 safe_kill(manager_lockfile, manager_binary, true); 303 } 304 }else{ 305 alarm(2 * server_timeout); 306 heartbeat_server();//? 307 alarm(0); 308 309 } 310 311 } 312 printf("Leaving check_programs\n\n"); 313 write_to_log("Leaving check_programs\n\n"); 314 } 315 316 317 static void init() 318 { 319 write_to_log("Entering init()\n"); 320 init_signals(); 321 init_lockfiles(); 322 check_lockfile(); 323 write_to_log("Leaving init()\n\n"); 324 } 325 326 static void millisleep(int ms) 327 { 328 struct timespec ts; 329 ts.tv_sec = ms / 1000; 330 ts.tv_nsec = (ms - ts.tv_sec * 1000) * 1000 * 1000; 331 nanosleep(&ts, NULL); 332 } 333 334 // Changed function from taking no argument and returning void 335 // to taking a void* and returning a void*. The change was made 336 // so that we can call ink_thread_create() on this function 337 // in the case of running cop as a win32 service. 338 339 static void* check(void* arg) 340 { 341 //bool mgmt_init=false; 342 write_to_log("Entering check()\n\n"); 343 for(;;){ 344 345 // problems with the ownership of this file as root Make sure it is 346 // owned by the admin user 347 348 alarm(2 * (sleep_time + manager_timeout * 2 + server_timeout)); 349 350 check_programs(); 351 millisleep(sleep_time * 1000); 352 } 353 write_to_log("Leaveing check()\n\n"); 354 return arg; 355 } 356 357 void init_daemon(void) 358 { 359 int i; 360 pid_t pid; 361 struct rlimit rl; 362 struct sigaction sa; 363 //printf("------------------------------\n"); 364 //umask(0); 365 if(getrlimit(RLIMIT_NOFILE,&rl)<0){ 366 exit(1); 367 } 368 369 370 if((pid=fork())<0){ 371 exit(1);//fork失败,退出 372 }else if(pid> 0){ 373 exit(0);//是父进程,结束父进程 374 } 375 376 //是第一子进程,后台继续执行 377 setsid();//第一子进程成为新的会话组长和进程组长 378 //并与控制终端分离 379 sa.sa_handler=SIG_IGN; 380 sigemptyset(&sa.sa_mask); 381 sa.sa_flags=0; 382 383 if(sigaction(SIGHUP,&sa,NULL)<0){ 384 exit(1); 385 } 386 387 if((pid=fork())<0){ 388 exit(1);//fork失败,退出 389 }else if(pid> 0){ 390 exit(0);//是父进程,结束父进程 391 } 392 //是第二子进程,继续 393 //第二子进程不再是会话组长 394 umask(0); 395 if (rl.rlim_max==RLIM_INFINITY){ 396 rl.rlim_max=1024; 397 398 } 399 400 for(i=0;i< rl.rlim_max;++i)//关闭打开的文件描述符 401 { 402 close(i); 403 } 404 405 //chdir("/tmp");//改变工作目录到/tmp 406 return; 407 } 408 409 410 int main() 411 { 412 413 init_daemon();//守护进程初始化函数 414 write_to_log("Entering main()\n"); 415 signal(SIGHUP, SIG_IGN); 416 signal(SIGTSTP, SIG_IGN); 417 signal(SIGTTOU, SIG_IGN); 418 signal(SIGTTIN, SIG_IGN); 419 //setsid(); 420 init(); 421 check(NULL); 422 write_to_log("leaving main()\n\n"); 423 return 0; 424 }
5.traffic_manager.cpp
1 #include "lock_and_kill.h" 2 #include "log.h" 3 #include <sys/types.h> 4 #include <sys/ipc.h> 5 #include <sys/sem.h> 6 #include <signal.h> 7 #include <unistd.h> 8 #include <stdlib.h> 9 #include <sys/wait.h> 10 #include <time.h> 11 #include <string.h> 12 #include <stdio.h> 13 14 #define NOWARN_UNUSED(x) (void)(x) 15 static char manager_lockfile[4096]="manager_lockfile"; 16 static char server_lockfile[4096]="server_lockfile"; 17 static int server_failures=0; 18 static int killsig=SIGKILL; 19 static int coresig=0; 20 static char server_binary[4096] = "server_binary"; 21 static const int sleep_time = 10; // 10 sec 22 static const int manager_timeout = 3 * 60; // 3 min 23 static const int server_timeout = 3 * 60; // 3 min 24 static const int kill_timeout = 1 * 60; // 1 min 25 26 static void sig_alarm_warn(int signum=0) 27 { 28 alarm(kill_timeout); 29 } 30 31 32 static void sig_fatal(int signum) 33 { 34 abort(); 35 } 36 37 38 static void set_alarm_warn() 39 { 40 struct sigaction action; 41 action.sa_handler = sig_alarm_warn; 42 sigemptyset(&action.sa_mask); 43 action.sa_flags = 0; 44 sigaction(SIGALRM, &action, NULL); 45 } 46 47 static void set_alarm_death() 48 { 49 struct sigaction action; 50 action.sa_handler = sig_fatal; 51 sigemptyset(&action.sa_mask); 52 action.sa_flags = 0; 53 sigaction(SIGALRM, &action, NULL); 54 } 55 56 static void sig_child(int signum) 57 { 58 NOWARN_UNUSED(signum); 59 pid_t pid = 0; 60 int status = 0; 61 for (;;) { 62 pid = waitpid(WAIT_ANY, &status, WNOHANG); 63 64 if (pid <= 0) { 65 break; 66 } 67 // TSqa03086 - We can not log the child status signal from 68 // the signal handler since syslog can deadlock. Record 69 // the pid and the status in a global for logging 70 // next time through the event loop. We will occasionally 71 // lose some information if we get two sig childs in rapid 72 // succession 73 // child_pid = pid; 74 //child_status = status; 75 } 76 } 77 78 static void safe_kill(const char* lockfile_name,const char * pname,bool group) 79 { 80 Lockfile lockfile(lockfile_name); 81 write_to_log("Entering safe_kill\n"); 82 set_alarm_warn(); 83 alarm(kill_timeout); 84 85 if (group == true) { 86 lockfile.KillGroup(killsig, coresig, pname); 87 } else { 88 lockfile.Kill(killsig, coresig, pname); 89 } 90 alarm(0); 91 set_alarm_death(); 92 write_to_log("Leaving safe_kill\n\n"); 93 94 } 95 96 static void spawn_server() 97 { 98 int err; 99 int key; 100 write_to_log("--------------Entering spwan_server()!\n\n"); 101 err = fork(); 102 if (err == 0) { 103 err = execv(server_binary, NULL); 104 105 write_to_log("--------------somehow execv failed!\n"); 106 exit(1); 107 } else if (err == -1) { 108 write_to_log("--------------unable to fork server !\n"); 109 exit(1); 110 } 111 112 server_failures = 0; 113 write_to_log("--------------Leaving spwan_server()!\n\n"); 114 } 115 116 117 void check_server() 118 { 119 int err; 120 pid_t holding_pid; 121 Lockfile server_lf(server_lockfile); 122 err=server_lf.Get(&holding_pid); 123 124 if(err==0){ 125 if(kill(holding_pid,0)==-1){ 126 ink_killall(server_binary,killsig); 127 sleep(1); 128 err=server_lf.Open(&holding_pid); 129 } 130 131 } 132 133 if(err>0){ 134 server_lf.Close(); 135 safe_kill(server_lockfile,server_binary,false); 136 spawn_server(); 137 138 } 139 140 } 141 142 143 144 145 int main() 146 { 147 pid_t holding_pid=0; 148 Lockfile manager_lf(manager_lockfile); 149 manager_lf.Get(&holding_pid); 150 151 while(1){ 152 153 char buf[100]; 154 sprintf(buf,"----------------traffic_manager is running, pid:'%d'!\n",getpid()); 155 write_to_log(buf); 156 157 printf("----------------traffic_manager is running,pidID: %d\n",getpid()); 158 159 sleep(5); 160 int c=rand()%10; 161 162 if(c==1){//模拟manager进程出现状况 163 write_to_log("----------------traffic_manager has a expcetion and eixt!\n"); 164 exit(1); 165 }else{//对server进程进行检查 166 check_server(); 167 } 168 } 169 }
6.traffic_server.cpp
1 #include "log.h" 2 #include "lock_and_kill.h" 3 #include <sys/types.h> 4 #include <unistd.h> 5 #include <stdlib.h> 6 7 8 static char server_lockfile[4096]="server_lockfile"; 9 10 int main() 11 { 12 13 pid_t holding_pid=0; 14 Lockfile server_lf(server_lockfile); 15 server_lf.Get(&holding_pid); 16 17 while(1){ 18 19 char buf[100]; 20 sprintf(buf,"==============traffic_server is running, pid:'%d'!\n",getpid()); 21 write_to_log(buf); 22 sleep(5); 23 int c=rand()%100; 24 25 if(c<30){//模拟server进程出现状况 26 write_to_log("=================traffic_server has a expcetion and exit!\n"); 27 exit(1); 28 } 29 } 30 return 0; 31 32 }
以上文档为以前研究时所写,希望能给感兴趣的同学一点帮助,同时也请大家指点。我这里时简要的分析了traffic进程控制的问题,测试中许多是简化的,比如心跳测试之类的,代码中有说明。