Linux惊群

1.惊群

　　惊群即当某一资源可用时，导致多个进程/线程去竞争资源。惊群会导致的问题：

　　　　[1]导致n-1个进程/线程做了无效的调度和上下文切换，cpu瞬时增高。

　　　　[2]多个进程/线程争取资源同步（加解锁）时造成的系统开销。

　　当前Linux存在的惊群情况有：accept、epoll、条件变量导致的多线程惊群。

2.accept、epoll、Nginx、条件变量导致的多线程惊群

　　【1】accept

　　　　在2.6内核之前，使用fork多个进程accept同一个fd时，如果有信号会导致所有进程的accept都会惊醒，但只有一个可以accept成功，其他返回EGAIN。

　　　　2.6内核之后解决了该问题，由信号时只会唤醒一个进程。

　　【2】epoll

　　　　epoll有LT和ET模式。

　　　　epoll的内核操作：加锁遍历“就绪队列”（ep->rdllist），首先把event从“就绪队列”删除，然后调用文件的poll函数检查该文件（fd）是否有事件，如果有则把事件和用户数据拷贝到用户空间，之后如果是EPOLLIN模式则再次加入到“就绪队列”。

　　　　因此，当是LT模式时，某个进程的epoll_wait()收到事件后，并且内核再次把事件加到“就绪队列”，进而一直重复此过程，直到这个事件被处理。

　　　　结论，使用fork多个进程调用epoll_wait()同一个epoll_fd时：

　　　　　　[1]LT模式时，会导致同一个fd被多个进程收到事件，因为处理这个事件之前，会一直把该事件放到“就绪队列”，导致其他进程收到事件，类似惊群（连锁反应导致的不断触发）。

　　　　　　[2]ET模式时因为不会再次加到“就绪队列”，所以不会导致一个事件被多个进程处理。

　　【3】条件变量导致的多线程惊群

　　　　首先，条件变量中的mutex是对用户的互斥量进行保护的，而不是cond本身，因为cond的操作本来就是原子的。

　　　　pthread_cond_signal：这个函数肯定只会唤醒一个线程。

　　　　pthread_cond_broadcast：同时唤醒多个线程，只有一个线程的pthread_cond_wait()返回（即当前线程成功加锁了mutex），其他线程还不能返回（等待mutex解锁后才能返回）。即被唤醒的所有线程其实是要通过mutex锁顺序执行。

　　　　因此，条件变量的惊群情形是：使用pthread_cond_broadcast同时唤醒多个线程。由于被唤醒的多个线程顺序执行，所以第2个线程的pthread_cond_wait返回时，可能条件已经不满足了，所以需要使用while再次判断条件是否满足：

pthread_mutex_lock(&lock);
while (count == 0)
	pthread_cond_wait(&cond, &lock);
pthread_mutex_unlock(&lock);

3.Nginx的惊群处理

4.epoll内核部分源码


/*
 * We can loop without lock because we are passed a task private list.
 * Items cannot vanish during the loop because ep_scan_ready_list() is            //表示执行下面循环时前面已经加锁了
 * holding "mtx" during this call.
 */
for (esed->res = 0, uevent = esed->events;                                        //遍历就绪队列，就绪队列表示检查fd状态或有信号的event队列
	 !list_empty(head) && esed->res < esed->maxevents;) {
	epi = list_first_entry(head, struct epitem, rdllink);

	/*
	 * Activate ep->ws before deactivating epi->ws to prevent
	 * triggering auto-suspend here (in case we reactive epi->ws
	 * below).
	 *
	 * This could be rearranged to delay the deactivation of epi->ws
	 * instead, but then epi->ws would temporarily be out of sync
	 * with ep_is_linked().
	 */
	ws = ep_wakeup_source(epi);
	if (ws) {
		if (ws->active)
			__pm_stay_awake(ep->ws);
		__pm_relax(ws);
	}

	list_del_init(&epi->rdllink);                                                    //删除event

	revents = ep_item_poll(epi, &pt, 1);                                             //返回fd的状态，即EPOLLIN等信号

	/*
	 * If the event mask intersect the caller-requested one,
	 * deliver the event to userspace. Again, ep_scan_ready_list()
	 * is holding "mtx", so no operations coming from userspace
	 * can change the item.
	 */
	if (revents) {                                                                  //如果有信号，则把信号和用户数据拷贝到用户空间
		if (__put_user(revents, &uevent->events) ||
			__put_user(epi->event.data, &uevent->data)) {
			list_add(&epi->rdllink, head);
			ep_pm_stay_awake(epi);
			if (!esed->res)
				esed->res = -EFAULT;
			return 0;
		}
		esed->res++;
		uevent++;
		if (epi->event.events & EPOLLONESHOT)
			epi->event.events &= EP_PRIVATE_BITS;
		else if (!(epi->event.events & EPOLLET)) {                              //如果不是EPOLLET模式，则再把event加到就绪队列
			/*
			 * If this file has been added with Level
			 * Trigger mode, we need to insert back inside
			 * the ready list, so that the next call to
			 * epoll_wait() will check again the events
			 * availability. At this point, no one can insert
			 * into ep->rdllist besides us. The epoll_ctl()
			 * callers are locked out by
			 * ep_scan_ready_list() holding "mtx" and the
			 * poll callback will queue them in ep->ovflist.
			 */
			list_add_tail(&epi->rdllink, &ep->rdllist);
			ep_pm_stay_awake(epi);
		}
	}
}

posted on 2019-03-21 17:08 能量星星阅读(414) 评论(0) 收藏举报

刷新页面返回顶部

能量星星

Linux惊群

导航