linux内核调试技术——进程上下文R状态死锁监测机制softlockup detector
1 简介
从内核稳定性问题的角度来看内核安全,是基础,也是必备技能。很多时候,一个内核稳定性问题,就是造成系统安全的罪魁祸首。
当出现异常死锁、Hang up、死机等问题时,watchdog的作用就很好的体现出来。Watchdog主要用于监测系统运行情况,一旦出现以上异常情况,就会重启系统,并收集crash dump(程序崩溃时保存的运行数据)。
watchdog包括硬件watchdog和软件实现的watchdog,前者是硬件模块,触发时直接硬件复位CPU重启,无法收集crash dump。而软件实现的watchdog可触发panic,收集crashdump等, 软件watchdog在Linux内核中有softlockup detector和hardlockup detector,安卓framework也有watchdog。
本文介绍softlockup detector
2 工作原理
- 假定某一变量的状态能表征系统运行状态,比如中断次数(如果高优先级中断没有发生,就认为CPU卡死),比如/dev/watchdog时间戳(如果超时时间到了仍没有向watchdog节点写数据,就认为用户空间卡死)。
- 启动一个watchdog程序,定期观测该变量,来判定系统是否正常,并采取相应动作。
- 内核态watchdog主要用于检测内核Lockup。所谓的Lockup,是指某段内核代码一直占着CPU不放,此时内核调度器无法进行调度工作。进一步严重情况下,会导致整个系统卡死。
- Lockup涉及到内核线程、时钟中断。它们有不一样的优先级:内核线程 < 时钟中断 。其中,内核线程可以被调度或被中断打断。
3 Lockup分类
- 只有内核代码(中断上下文或关抢占的线程上下文)才能引起lockup,因为用户代码是可以被抢占的,只有一种情况例外,就是SCHED_FIFO( 一直运行,直到进程运行完毕才会释放CPU)优先级为99的实时进程。当它被阻塞或被更高优先级进程抢占时,也可能使[watchdog/x]内核线程抢不到CPU而形成soft lockup。
- 内核代码必须处于禁止内核抢占的状态(preemption disabled),Linux是可抢占的,只在某些特定的代码区才禁止抢占(例如spinlock),才可能形成lockup。
Lockup 分为两种:soft lockup 和 hard lockup
- Soft lockup在CPU无法正常调度其他线程时发生,即某段代码一直占用某个CPU,导致watchdog/x内核线程得不到调度,此时中断仍可响应;
- Hard lockup在中断无法正常响应时发生,即关中断时间过长或中断处理程序执行时间过长。
4 softlockup detector 详解
在驱动中加入以下代码可触发soft lockup,通过spinlock()实现关抢占,使得该CPU上的[watchdog/x]线程无法被调度。
static spinlock_t spinlock;
spin_lock_init(&spinlock);
spin_lock(&spinlock);
while(1);
- soft lockup detector机制通过smp_hotplug_thread机制为每个CPU核创建一个内核线程[watchdog/X] (其中N为CPU ID,该内核线程为FIFI 99最高优先级,此类线程设置了 p->flags |= PF_NO_SETAFFINITY属性,不允许用户空间设置CPU亲和性)。该线程定时(每隔4s执行一次)对变量watchdog_touch_ts加加操作,即喂狗。
- 然后给每个CPU分配一个高精度hrtimer([watchdog/X] 首次执行时就被创建),该定时器的中断服务程序会每隔4s(sample_period seconds (4 seconds by default))检测一下变量watchdog_touch_ts是否被更新过,如果20s内该变量仍未更新,就说明CPU卡住,导致watchdog线程无法调度。
hrtimer的中断处理函数是:kernel/watchdog.c/watchdog_timer_fn()。
中断处理函数主要做了以下事情:
- 对变量hrtimer_interrupts加加操作,该变量同时供hard lockup detector用于判断CPU是否响应中断。
- 唤醒[watchdog/x]内核线程喂狗(执行soft_lockup_hrtimer_cnt=hrtimer_interrupts赋值即将hrtimer_interrupts更新到soft_lockup_hrtimer_cnt,同时更新watchdog_touch_ts到当时系统时间截,下图有误,并非++)。
- 检测变量watchdog_touch_ts是否被更新,如果超过20s未更新,说明[watchdog/x]未得到运行,发生了soft lockup,CPU被霸占。
注意:这里的内核线程[watchdog/x]的目的是操作变量watchdog_touch_ts,该变量是被watch的对象。而真正的看门狗,则是由hrtimer中断触发的,即 watchdog_timer_fn()函数,该函数会唤醒喂狗线程。[watchdog/x]是被scheduler调度执行的,而watchdog_timer_fn()则是被中断触发的。
5 源码分析
5.1 全局变量与数据结构
/* Global variables, exported for sysctl */
unsigned int __read_mostly softlockup_panic = CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
static bool softlockup_threads_initialized __read_mostly;
static u64 __read_mostly sample_period; //定时器时间周期,默认4秒
//以下per cpu变量,每个CPU有一个
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); //[watchdog/x]线程更新的时间截
static DEFINE_PER_CPU(struct task_struct *, softlockup_watchdog);//[watchdog/x]线程的 task_struct结构指针
static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);//[watchdog/x]线程对应的hrtimer中断定时器控制结构
static DEFINE_PER_CPU(bool, softlockup_touch_sync);
static DEFINE_PER_CPU(bool, soft_watchdog_warn);//发生softlockup当标记为true,否则false
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);//hrtimer定时器watchdog_timer_fn自增
static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);//[watchdog/x]线程更新soft_lockup_hrtimer_cnt=hrtimer_interrupts
static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved);//定时器回调watchdog_timer_fn保存发生softlockup时current,它不是[watchdog/x]线程的 task_struct结构指针
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);//用在hardlockup dectector中
static unsigned long soft_lockup_nmi_warn;
static struct smp_hotplug_thread watchdog_threads = {
.store = &softlockup_watchdog,//[watchdog/x]线程的 task_struct结构指针,上面定义的per cpu变量
.thread_should_run = watchdog_should_run,//return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt),两者不等则返回true
.thread_fn = watchdog,//[watchdog/x]线程的喂狗回调,先会调用thread_should_run返回true则执行,否则不执行。
.thread_comm = "watchdog/%u",
.setup = watchdog_enable,//[watchdog/x]线程激活时被执行,比thread_fn(即喂狗)回调执行早,它会创建hrtimer,同时设置[watchdog/x]为FIFO 99
.cleanup = watchdog_cleanup,//内部调用下面的watchdog_disable,当有人对此kthread调用了 kthread_stop() 时被执行,一般是module_exit,softlockup的模块一般不会被调用
.park = watchdog_disable,//执行与[watchdog/x]相反的操作,取消定时器,[watchdog/x]线程设为CFS 优先级0
.unpark = watchdog_enable,
};
5.2 创建[watchdog/x]线程(per cpu thread)
softlockup detector通过smp_hotplug_thread机制创建per cpu thread。关于smp_hotplug_thread机制详细见《Linux内核机制smp_hotplug_thread》
kernel_init_freeable
lockup_detector_init //比do_basic_setup调用的早,因此比 early_initcall 调用的还早
lockup_detector_setup:
ret = smpboot_register_percpu_thread_cpumask(&watchdog_threads, &watchdog_allowed_mask);
softlockup_threads_initialized = true;
lockup_detector_reconfigure();//detector的配置初始化,包括定时器的周期参数等。
/** smpboot.c
* smpboot_register_percpu_thread_cpumask - Register a per_cpu thread related to hotplug
* @plug_thread: Hotplug thread descriptor
* @cpumask: The cpumask where threads run
*
* Creates and starts the threads on all online cpus.
*/
int smpboot_register_percpu_thread_cpumask(struct smp_hotplug_thread *plug_thread, const struct cpumask *cpumask)
{
unsigned int cpu;
int ret = 0;
.................
for_each_online_cpu(cpu) {
__smpboot_create_thread(plug_thread, cpu);//创建per_cpu thread 即[watchdog/x]
smpboot_unpark_thread(plug_thread, cpu);//unpark per_cpu thread
}
list_add(&plug_thread->list, &hotplug_threads);//挂到hotplug_threads中,以便core管理
............
}
static int __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
struct smpboot_thread_data *td;
td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
td->cpu = cpu;
td->ht = ht;
tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu, ht->thread_comm);//创建per_cpu thread 即[watchdog/x]
kthread_park(tsk);
get_task_struct(tsk);
*per_cpu_ptr(ht->store, cpu) = tsk;//内核线程的task_struct指针保存到上面定义的per cpu变量中
return 0;
}
static int smpboot_thread_fn(void *data)
{
struct smpboot_thread_data *td = data;
struct smp_hotplug_thread *ht = td->ht;
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
preempt_disable(); //关抢占
............................
/* Check for state change setup */
switch (td->status) {
case HP_THREAD_NONE:
__set_current_state(TASK_RUNNING);
preempt_enable();
if (ht->setup) //调用setup即上面的watchdog_enable,,它会创建hrtimer,同时设置[watchdog/x]为FIFO 99
ht->setup(td->cpu);
td->status = HP_THREAD_ACTIVE;
continue;
case HP_THREAD_PARKED:
__set_current_state(TASK_RUNNING);
preempt_enable();
if (ht->unpark)
ht->unpark(td->cpu);
td->status = HP_THREAD_ACTIVE;
continue;
}
if (!ht->thread_should_run(td->cpu)) {//return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt),两者不等则返回true
preempt_enable_no_resched();
schedule();
} else {
__set_current_state(TASK_RUNNING);
preempt_enable();
ht->thread_fn(td->cpu);//上面的(hrtimer_interrupts) !=soft_lockup_hrtimer_cnt),则执行thread_fn即喂狗函数watchdog()
}
}
}
5.3 创建hrtimer定时器
由上面的smp_hotplug_thread机制的smpboot_thread_fn处理可知,内核线程第一次执行smp_hotplug_thread::setup()回调时被调用,即watchdog_enable:
static void watchdog_enable(unsigned int cpu)
{
struct hrtimer *hrtimer = this_cpu_ptr(&watchdog_hrtimer);
/*
* Start the timer first to prevent the NMI watchdog triggering
* before the timer has a chance to fire.
*/
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = watchdog_timer_fn;
hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED);
/* Initialize timestamp */
__touch_watchdog();//喂一次狗,初始化watchdog_touch_ts值,因为上面定义时没初始化,它是0
//smp_hotplug_thread机制创建的内核线程是CFS 120,此处设置为FIFO 99
watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
}
以上5.2, 5.3节流程处理完后,系统为每个CPU核创建了一个内核线程,优先级为FIFO 99:
sh-3.2# ps | grep watchdog
12 root 0 SW [watchdog/0]
15 root 0 SW [watchdog/1]
21 root 0 SW [watchdog/2]
27 root 0 SW [watchdog/3]
sh-3.2# chrt -p 12
pid 12's current scheduling policy: SCHED_FIFO
pid 12's current scheduling priority: 99
5.4 定时喂狗
5.4.1 hrtimer时间周期
在detector init中(5.2节也提到),配置了定时器的周期参数sample_period(全局变量),它的值由watchdog_thresh乘2除5,转化为纳秒计算得来,watchdog_thresh默认10秒(可通过/proc/sys/kernel/watchdog_thresh修改),即定时器一次调用周期4秒,即每4秒喂一狗,详见如下代码:
lockup_detector_init
lockup_detector_setup:
ret = smpboot_register_percpu_thread_cpumask(&watchdog_threads, &watchdog_allowed_mask);
softlockup_threads_initialized = true;
lockup_detector_reconfigure()://detector的配置初始化,包括定时器的周期参数等。
set_sample_period
static int get_softlockup_thresh(void)
{
return watchdog_thresh * 2;//watchdog_thresh默认是10秒, 可通过/proc/sys/kernel/watchdog_thresh修改
}
static void set_sample_period(void)
{
sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);//20除5==4秒,转化为纳秒作为hrtimer采样周期。
watchdog_update_hrtimer_threshold(sample_period);//hardlockup detector用的,不关心。
}
5.4.2 hrtimer定时器喂狗
主要做了两件事:
- soft_lockup_hrtimer_cnt == hrtimer_interrupts
- watchdog_touch_ts == get_timestamp(); //喂狗
//watchdog_timer_fn是5.3节创建hrtimer定时器时绑定的回调函数,每隔4秒执行一次:
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
/* kick the softlockup detector */
//唤醒内核线程[watchdog/x],它会通过thread_should_run判断是否应该调用thread_fn即喂狗函数watchdog,详见见smpboot_thread_fn
//thread_should_run是return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt),两者不等则返回true
wake_up_process(__this_cpu_read(softlockup_watchdog)); //唤醒[watchdog/X]线程喂狗
/* .. and repeat */
hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));//更新定时器,4秒后再触发。
.....
}
/*下面两个都是[watchdog/X]线程的上下文,正在喂狗的地方,上面是hrtimer定时器的上下文(主要用于唤醒唤醒[watchdog/X]线程及检测是否发生softlockup),
发生softlockup时dumpstack栈空间是hrtimer的上下文,即发生softlockup时在CPU上running的线程。
*/
static void __touch_watchdog(void)
{
__this_cpu_write(watchdog_touch_ts, get_timestamp());//喂狗
}
static void watchdog(unsigned int cpu)
{
__this_cpu_write(soft_lockup_hrtimer_cnt, __this_cpu_read(hrtimer_interrupts));
__touch_watchdog();//喂狗
}
5.5 softlockup检测
hrtimer定时器每隔4秒检测是否发生softlockup,检测逻辑为:
如果20s内都没更新watchdog_touch_ts ,就认为出现了soft lockup,详细原理见2节和4节。
/* Global variables, exported for sysctl */
unsigned int __read_mostly softlockup_panic = CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
/* watchdog_timer_fn是5.3节创建hrtimer定时器时绑定的回调函数,每隔4秒执行一次: */
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
watchdog_interrupt_count(); /* 对hrtimer_interrupts加1操作*/
wake_up_process(__this_cpu_read(softlockup_watchdog)); /* 唤醒喂狗线程[watchdog/X] 见5.4.2节*/
/* .. and repeat 更新定时器,4秒后再触发。 */
hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
/* check for a softlockup
* This is done by making sure a high priority task is
* being scheduled. The task touches the watchdog to
* indicate it is getting cpu time. If it hasn't then
* this is a good indication some task is hogging the cpu
* 英文判断softlockup很详细,如果某个CPU卡死,该CPU的watchdog线程不会被调度,即watchdog_touch_ts
* 不会被更新。如果20s内都没更新watchdog_touch_ts ,就认为出现了soft lockup
* is_softlockup内部实现是:
* 如果watchdog_touch_ts 20秒更新,返回duration = 当时时间截 - watchdog_touch_ts 否则返回0
* duration > 0表示发生了softlockup
*/
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
smp_processor_id(), duration,
current->comm, task_pid_nr(current));
__this_cpu_write(softlockup_task_ptr_saved, current);
print_modules();
print_irqtrace_events(current);
dump_stack();
add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
if (softlockup_panic) //CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE配置为1时触发softlockup panic
panic("softlockup: hung tasks");
__this_cpu_write(soft_watchdog_warn, true);
} else
__this_cpu_write(soft_watchdog_warn, false);
return HRTIMER_RESTART;
}
static int is_softlockup(unsigned long touch_ts)
{
unsigned long now = get_timestamp();
/* 如果某个CPU卡死,该CPU的watchdog线程不会被调度,即watchdog_touch_ts */
/* 不会被更新。如果20s内都没更新watchdog_touch_ts ,就认为出现了soft lockup */
/* get_softlockup_thresh()函数返回20 */
if (time_after(now, touch_ts + get_softlockup_thresh()))
return now - touch_ts;
return 0;
}
6 问题分析思路
Soft lockup相关log:
BUG: soft lockup – CPU#2 stuck for 21s! [taskname, pid]
上述Log说明有进程/线程持续执行的时间超过21s,导致其他进程/线程无法调度,以下情况为形成Soft lockup的主要原因:
-
线程上下文存在死循环( for循环的退出条件弄错),并且是高优先级FIFO 99(无论内核态还是用户态都可以引起)。
-
内核态关抢占过久(用户态不能关抢占,关中断不算,那是hardlockup不是softlockup),内核模块或驱动模块可通过preempt_disable()关抢占,或有些API隐含了关抢占的操作,如spin_lock,它里面会调用preempt_disable关抢占。不正确使用spinlock,会导致了softlockup(譬如spinlock临界区执行超过20秒,或嵌套调用,顺序不对的话就可能导致死锁)。
在处理该类问题时,可以遵循以下原则:
- 查看watchdog_touch_ts变量在最近20秒(watchdog_thresh * 2)内,是否被watchdog 线程更新过。若没有更新,就意味着watchdog线程得不到调度。很有可能某个cpu关抢占时间过长或FIFO 99线程执行时间过长,导致watchdog线程得不到调度。
- 这种情况下,系统往往不会死掉,但是会很慢。如果将内核参数 softlockup_panic(CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC宏)设置为1,系统会panic。否则,只将warning信息打印出来。
7 内核配置
- 开启soft lockup
CONFIG_LOCKUP_DETECTOR=y #selected by CONFIG_SOFTLOCKUP_DETECTOR
CONFIG_SOFTLOCKUP_DETECTOR=y
- 出现softlockup时,使能系统panic(默认为n)
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y 或通过echo 1 >/proc/sys/kernel/softlockup_panic修改
- 发生soft lockup时,系统默认会打印相关warning信息。如果需要抛出panic,也可以在应用层做以下设置:
echo 1 > /proc/sys/kernel/softlockup_panic
cat /proc/sys/kernel/watchdog_thresh/*默认10s*/
[!NOTE]
watchdog_thresh默认是10s,如果超(watchdog_thresh*2)秒,watchdog_touch_ts未更新,kernel将panic。最大能设到60s,即120s内watchdog_touch_ts未更新,kernel将panic。
参考:
https://blog.csdn.net/wangquan1992/article/details/122927588
https://docs.kernel.org/admin-guide/lockup-watchdogs.html
https://mp.weixin.qq.com/s/XORlqCvlhwx-Hz2UpyXJcw
https://blog.csdn.net/ximenjianxue/article/details/106077015