pdflush内核线程池及其中隐含的竞争

pdflush内核线程池是Linux为了回写文件系统数据而创建的进程上下文工作环境。它的实现比较精巧，全部代码只有不到250行。

1 /*
2 * mm/pdflush.c - worker threads for writing back filesystem data
3 *
4 * Copyright (C) 2002, Linus Torvalds.
5 *
6 * 09Apr2002    akpm@zip.com.au
7 *      Initial version
8 * 29Feb2004    kaos@sgi.com
9 *      Move worker thread creation to kthread to avoid chewing
10 *      up stack space with nested calls to kernel_thread.
11 */

文件头部的说明，主要包含版权信息和主要的更改记录（Changlog）。kaos@sgi.com将内核工作线程的创建工作移交给了kthread，主要是为了防止过多的内核线程消耗太多的父工作线程的堆栈空间。关于这个改变我们也能够通过ps的结果看出：

root         5     1     5 0    1 21:31 ?        00:00:00 [kthread]
root       114     5   114 0    1 21:31 ?        00:00:00 [pdflush]
root       115     5   115 0    1 21:31 ?        00:00:00 [pdflush]

所有pdflush内核线程的父进程都是kthread进程（pid为5）。

12
13 #include <linux/sched.h>
14 #include <linux/list.h>
15 #include <linux/signal.h>
16 #include <linux/spinlock.h>
17 #include <linux/gfp.h>
18 #include <linux/init.h>
19 #include <linux/module.h>
20 #include <linux/fs.h> // Needed by writeback.h
21 #include <linux/writeback.h> // Prototypes pdflush_operation()
22 #include <linux/kthread.h>
23 #include <linux/cpuset.h>
24
25

包含一些比要的头文件。不过有一点不怎么好，虽然C++的行注释已经迁移到了C，可在内核的代码里面看到，还是一样的不舒服，可能是我太挑剔了，本身也没啥不好，我可能需要与时俱进。

26 /*
27 * Minimum and maximum number of pdflush instances
28 */
29 #define MIN_PDFLUSH_THREADS 2
30 #define MAX_PDFLUSH_THREADS 8
31
32 static void start_one_pdflush_thread(void);
33
34

29和30行分别定义了pdflush内核线程实例的最小和最大数量，分别是2和8。最小线程数是为了减少操作的延时，最大线程数是为了防止过多的线程降低系统性能。不过，这里的最大线程数有些问题，下面我们分析其中的竞争条件时会再次提及它。

35 /*
36 * The pdflush threads are worker threads for writing back dirty data.
37 * Ideally, we'd like one thread per active disk spindle. But the disk
38 * topology is very hard to divine at this level. Instead, we take
39 * care in various places to prevent more than one pdflush thread from
40 * performing writeback against a single filesystem. pdflush threads
41 * have the PF_FLUSHER flag set in current->flags to aid in this.
42 */
43

上面这段注释是对pdflush线程池的简单解释，大致的意思就是：“pdflush线程是为了将脏数据写回的工作线程。比较理想的情况是为每一个活跃的磁盘轴创建一个线程，但是在这个层次上比较难确定磁盘的拓扑结构，因此，我们处处小心，尽量防止对单一文件系统做多个回写操作。pdflush线程可以通过current->flags中PF_FLUSHER标志来协助实现这个。”

可以看出，内核开发者们对于效率还是相当的“吝啬”，考虑的比较周全。但是，对于层次的划分也相当关注，时刻不敢越“雷池”半步，那么的谨小慎微。

43
44 /*
45 * All the pdflush threads. Protected by pdflush_lock
46 */
47 static LIST_HEAD(pdflush_list);
48 static DEFINE_SPINLOCK(pdflush_lock);
49
50 /*
51 * The count of currently-running pdflush threads. Protected
52 * by pdflush_lock.
53 *
54 * Readable by sysctl, but not writable. Published to userspace at
55 * /proc/sys/vm/nr_pdflush_threads.
56 */
57 int nr_pdflush_threads = 0;
58
59 /*
60 * The time at which the pdflush thread pool last went empty
61 */
62 static unsigned long last_empty_jifs;
63

定义个一些必要的全局变量，为了不污染内核的名字空间，对于不需要导出的变量都采用了static关键字限定了它们的作用域为此编译单元（即当前的pdflush.c文件）。所有的空闲pdflush线程都被串在双向链表pdflush_list里面，并用变量nr_pdflush_threads对当前pdflush的进程（包括活跃的和空闲的）数就行统计，last_empty_jifs用来记录pdflush线程池上次为空（也就是无线程可用）的jiffies时间，线程池中所有需要互斥操作的场合都采用自旋锁pdflush_lock进行加锁保护。

64 /*
65 * The pdflush thread.
66 *
67 * Thread pool management algorithm:
68 *
69 * - The minimum and maximum number of pdflush instances are bound
70 *   by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
71 *
72 * - If there have been no idle pdflush instances for 1 second, create
73 *   a new one.
74 *
75 * - If the least-recently-went-to-sleep pdflush thread has been asleep
76 *   for more than one second, terminate a thread.
77 */
78

又是一大段注释，不知道你有没有看烦，反正我都有点儿腻烦了，本来只想就其间的竞争说两句，没想到扯出这么多东西！上面介绍的是线程池的算法：

pdflush线程实例的数量介于MIN_PDFLUSH_THREADS和MAX_PDFLUSH_THREADS之间。
如果线程池持续1秒没有空闲线程，则创建一个新的线程。
如果那个最先睡眠的进程休息了超过1秒，则结束一个线程实例。

79 /*
80 * A structure for passing work to a pdflush thread. Also for passing
81 * state information between pdflush threads. Protected by pdflush_lock.
82 */
83 struct pdflush_work {
84         struct task_struct *who;        /* The thread */
85         void (*fn)(unsigned long);      /* A callback function */
86         unsigned long arg0;             /* An argument to the callback */
87         struct list_head list;          /* On pdflush_list, when idle */
88         unsigned long when_i_went_to_sleep;
89 };
90

上面定义了每个线程实例的节点数据结构，比较简明，不需要再废话。

现在，基本的数据结构的变量都浏览了一遍，接下来我们将从module_init这个入口着手分析:

232 static int __init pdflush_init(void)
233 {
234         int i;
235
236         for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
237                 start_one_pdflush_thread();
238         return 0;
239 }
240
241 module_init(pdflush_init);

创建MIN_PDFLUSH_THREADS个pdflush线程实例。请注意，这里只有module_init()定义，而没有module_exit()，言外之意就是：这个程序即使编译成内核模块，也是只能添加不能删除。请参看sys_delete_module()的实现:

File: kernel/module.c

   609      /* If it has an init func, it must have an exit func to unload */
   610      if ((mod->init != NULL && mod->exit == NULL)
   611          || mod->unsafe) {
   612          forced = try_force(flags);
   613          if (!forced) {
   614              /* This module can't be removed */
   615              ret = -EBUSY;
   616              goto out;
   617          }
   618      }

   498 #ifdef CONFIG_MODULE_FORCE_UNLOAD
   499 static inline int try_force(unsigned int flags)
   500 {
   501      int ret = (flags & O_TRUNC);
   502      if (ret)
   503          add_taint(TAINT_FORCED_MODULE);
   504      return ret;
   505 }
   506 #else
   507 static inline int try_force(unsigned int flags)
   508 {
   509      return 0;
   510 }
   511 #endif /* CONFIG_MODULE_FORCE_UNLOAD */

可见，除非编译的时候选择了模块强制卸载（注意：这个选项比较危险，不要尝试）的选项，否则这样的模块是不允许被卸载的。再次回到pdflush：

227 static void start_one_pdflush_thread(void)
228 {
229 kthread_run(pdflush, NULL, "pdflush");
230 }
231

用kthread_run借助kthread帮助线程生成pdflush内核线程实例:

164 /*
165 * Of course, my_work wants to be just a local in __pdflush(). It is
166 * separated out in this manner to hopefully prevent the compiler from
167 * performing unfortunate optimisations against the auto variables. Because
168 * these are visible to other tasks and CPUs. (No problem has actually
169 * been observed. This is just paranoia).
170 */
这段注释比较有意思，为了防止编译器将局部变量my_work优化成寄存器变量，所以这里整个处理流程转变成了pdflush套__pdflush的方式。实际上，局部变量的采用相对于动态申请内存，无论是在空间利用率还是在时间效率上都是有好处的。
171 static int pdflush(void *dummy)
172 {
173         struct pdflush_work my_work;
174         cpumask_t cpus_allowed;
175
176         /*
177          * pdflush can spend a lot of time doing encryption via dm-crypt. We
178          * don't want to do that at keventd's priority.
179          */
180         set_user_nice(current, 0);
微调优先级，提高系统的整体响应。
181
182         /*
183          * Some configs put our parent kthread in a limited cpuset,
184          * which kthread() overrides, forcing cpus_allowed == CPU_MASK_ALL.
185          * Our needs are more modest - cut back to our cpusets cpus_allowed.
186          * This is needed as pdflush's are dynamically created and destroyed.
187          * The boottime pdflush's are easily placed w/o these 2 lines.
188          */
189         cpus_allowed = cpuset_cpus_allowed(current);
190         set_cpus_allowed(current, cpus_allowed);
设置允许运行的CPU集合掩码。
191
192         return __pdflush(&my_work);
193 }

91 static int __pdflush(struct pdflush_work *my_work)
92 {
93         current->flags |= PF_FLUSHER;
94         my_work->fn = NULL;
95         my_work->who = current;
96         INIT_LIST_HEAD(&my_work->list);
做些初始化动作。
97
98         spin_lock_irq(&pdflush_lock);
因为要对nr_pdflush_threads和pdflush_list操作，所以需要加互斥锁，为了避免意外（pdflush任务的添加可能在硬中断上下文），故同时关闭硬中断。
99         nr_pdflush_threads++;
将nr_pdflush_threads的计数加1,因为多了一个pdflush内核线程实例。
100         for ( ; ; ) {
101                 struct pdflush_work *pdf;
102
103                 set_current_state(TASK_INTERRUPTIBLE);
104                 list_move(&my_work->list, &pdflush_list);
105                 my_work->when_i_went_to_sleep = jiffies;
106                 spin_unlock_irq(&pdflush_lock);
107
108                 schedule();
将自己加入空闲线程列表pdflush_list，然后让出cpu，等待被调度。
109                 if (try_to_freeze()) {
110                         spin_lock_irq(&pdflush_lock);
111                         continue;
112                 }
如果正在冻结当前进程，继续循环。
113
114                 spin_lock_irq(&pdflush_lock);
115                 if (!list_empty(&my_work->list)) {
116                         printk("pdflush: bogus wakeup!\n");
117                         my_work->fn = NULL;
118                         continue;
119                 }
120                 if (my_work->fn == NULL) {
121                         printk("pdflush: NULL work function\n");
122                         continue;
123                 }
124                 spin_unlock_irq(&pdflush_lock);
上面是对被意外唤醒情况的处理。
125
126                 (*my_work->fn)(my_work->arg0);
127
带参数arg0执行任务函数。
128                 /*
129                  * Thread creation: For how long have there been zero
130                  * available threads?
131                  */
132                 if (jiffies - last_empty_jifs > 1 * HZ) {
133                         /* unlocked list_empty() test is OK here */
134                         if (list_empty(&pdflush_list)) {
135                                 /* unlocked test is OK here */
136                                 if (nr_pdflush_threads < MAX_PDFLUSH_THREADS)
137                                         start_one_pdflush_thread();
138                         }
139                 }
如果pdflush_list为空超过1妙，并且线程数量还有可以增长的余地，则重新启动一个新的pdflush线程实例。
140
141                 spin_lock_irq(&pdflush_lock);
142                 my_work->fn = NULL;
143
144                 /*
145                  * Thread destruction: For how long has the sleepiest
146                  * thread slept?
147                  */
148                 if (list_empty(&pdflush_list))
149                         continue;
如果pdflush_list依然为空，继续循环。
150                 if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
151                         continue;
如果线程数量不大于最小线程数，继续循环。
152                 pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
153                 if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) {
154                         /* Limit exit rate */
155                         pdf->when_i_went_to_sleep = jiffies;
156                         break;                                  /* exeunt */
157                 }
如果pdflush_list的最后一个内核线程睡眠超过1秒，可能系统变得较为轻闲，结束本线程。为什么是最后一个？因为这个list是作为栈来使用的，所以栈底的元素也肯定就是最老的元素。
158         }
159         nr_pdflush_threads--;
160         spin_unlock_irq(&pdflush_lock);
161         return 0;
nr_pdflush_threads减1,退出本线程。
162 }
163

是不是少做了些工作？没错，好象没有处理SIGCHLD信号。其实用kthread创建的进程都是自己清理自己的，根本就无须父进程wait，不会产生僵尸进程，请参看

File: kernel/workqueue.c

   200      /* SIG_IGN makes children autoreap: see do_notify_parent(). */
   201      sa.sa.sa_handler = SIG_IGN;
   202      sa.sa.sa_flags = 0;
   203      siginitset(&sa.sa.sa_mask, sigmask(SIGCHLD));
   204      do_sigaction(SIGCHLD, &sa, (struct k_sigaction *)0);

另外在sigaction的手册页中可以详细的看到关于忽略SIGCHLD的“后果”：

       POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN.
       POSIX.1-2001 allows this possibility, so that ignoring SIGCHLD can
       be used to prevent the creation of zombies (see wait(2)). Never-
       theless, the historical BSD and System V behaviours for ignoring
       SIGCHLD differ, so that the only completely portable method of
       ensuring that terminated children do not become zombies is to catch
       the SIGCHLD signal and perform a wait(2) or similar.

无疑Linux内核是符合较新的POSIX标准的，这也给我们提供了一个避免产生僵尸进程的“简易”方法，不过要注意：这种手法是不可以移植的。

请折回头来再次考虑函数__pdflush()，这次我们关注其间的竞争：

135                                 /* unlocked test is OK here */
136                                 if (nr_pdflush_threads < MAX_PDFLUSH_THREADS)
137                                         start_one_pdflush_thread();

虽然开锁判断线程数不会造成数据损坏，但是如果有几个进程并行判断nr_pdflush_threads的值，并都一致认为线程数还有可以增长的余地，然后都调用start_one_pdflush_thread()去产生新的pdflush线程实例，那么线程数就可能超过MAX_PDFLUSH_THREADS，最坏的情况下可能是其两倍。

再来看接下来的行：

152                 pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
153                 if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) {
154                         /* Limit exit rate */
155                         pdf->when_i_went_to_sleep = jiffies;
156                         break;                                  /* exeunt */
157                 }

考虑瞬间的迸发请求，然后都在同一时刻停止运行，这时所有进程退出的时候都不会满足153行的判定，然后都会去睡眠，再假设接下来的n秒内都没有新的请求出发，那么pdflush内核线程数最大的情况将持续n秒，不符合当初的设计要求3。

195 /*
196 * Attempt to wake up a pdflush thread, and get it to do some work for you.
197 * Returns zero if it indeed managed to find a worker thread, and passed your
198 * payload to it.
199 */
200 int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
201 {
202         unsigned long flags;
203         int ret = 0;
204
205         if (fn == NULL)
206                 BUG();          /* Hard to diagnose if it's deferred */
207
208         spin_lock_irqsave(&pdflush_lock, flags);
209         if (list_empty(&pdflush_list)) {
210                 spin_unlock_irqrestore(&pdflush_lock, flags);
211                 ret = -1;
212         } else {
213                 struct pdflush_work *pdf;
214
215                 pdf = list_entry(pdflush_list.next, struct pdflush_work, list);
216                 list_del_init(&pdf->list);
217                 if (list_empty(&pdflush_list))
218                         last_empty_jifs = jiffies;
219                 pdf->fn = fn;
220                 pdf->arg0 = arg0;
221                 wake_up_process(pdf->who);
222                 spin_unlock_irqrestore(&pdflush_lock, flags);
223         }
224         return ret;
225 }
226

上面的函数用来给pdflush线程分配任务，如果当前有空闲线程可用，则分配一个任务给它，接着唤醒它，让它去执行。

总结：

内核编程需要缜密的思维，稍有不甚就有可能引发意外，无论你的代码有多短，必须慎之又慎。虽然pdflush的线程池实现存在以上提到的两点竞争，但是他们都不会造成十分严重的后果，只不过不符合设计要求，不能作为一个良好的实现而推行。

注意：

本文中“内核线程”、“线程”和“进程”交叉使用，但实际上他们都代表“内核线程”，并且这样也没啥不妥，“线程”作为“内核线程”的简称，而“内核线程”本质就是共享内核数据空间的一组“进程”，所以在某些情况下两者互换，并无大碍。

原文：http://blog.chinaunix.net/u/5251/showart_320793.html

posted @ 2010-12-24 17:31 天不会黑阅读(522) 评论(0) 编辑收藏举报

刷新页面返回顶部

天不会黑

pdflush内核线程池及其中隐含的竞争

公告