linux内核调试技术——进程上下文R状态死锁监测机制softlockup detector

1 简介

从内核稳定性问题的角度来看内核安全，是基础，也是必备技能。很多时候，一个内核稳定性问题，就是造成系统安全的罪魁祸首。

当出现异常死锁、Hang up、死机等问题时，watchdog的作用就很好的体现出来。Watchdog主要用于监测系统运行情况，一旦出现以上异常情况，就会重启系统，并收集crash dump（程序崩溃时保存的运行数据）。

watchdog包括硬件watchdog和软件实现的watchdog，前者是硬件模块，触发时直接硬件复位CPU重启，无法收集crash dump。而软件实现的watchdog可触发panic，收集crashdump等，软件watchdog在Linux内核中有softlockup detector和hardlockup detector，安卓framework也有watchdog。

本文介绍softlockup detector

2 工作原理

假定某一变量的状态能表征系统运行状态，比如中断次数（如果高优先级中断没有发生，就认为CPU卡死），比如/dev/watchdog时间戳（如果超时时间到了仍没有向watchdog节点写数据，就认为用户空间卡死）。
启动一个watchdog程序，定期观测该变量，来判定系统是否正常，并采取相应动作。
内核态watchdog主要用于检测内核Lockup。所谓的Lockup，是指某段内核代码一直占着CPU不放，此时内核调度器无法进行调度工作。进一步严重情况下，会导致整个系统卡死。
Lockup涉及到内核线程、时钟中断。它们有不一样的优先级：内核线程 < 时钟中断。其中，内核线程可以被调度或被中断打断。

3 Lockup分类

只有内核代码（中断上下文或关抢占的线程上下文）才能引起lockup，因为用户代码是可以被抢占的，只有一种情况例外，就是SCHED_FIFO（一直运行，直到进程运行完毕才会释放CPU）优先级为99的实时进程。当它被阻塞或被更高优先级进程抢占时，也可能使[watchdog/x]内核线程抢不到CPU而形成soft lockup。
内核代码必须处于禁止内核抢占的状态(preemption disabled)，Linux是可抢占的，只在某些特定的代码区才禁止抢占（例如spinlock），才可能形成lockup。

Lockup 分为两种：soft lockup 和 hard lockup

Soft lockup在CPU无法正常调度其他线程时发生，即某段代码一直占用某个CPU，导致watchdog/x内核线程得不到调度，此时中断仍可响应；
Hard lockup在中断无法正常响应时发生，即关中断时间过长或中断处理程序执行时间过长。

4 softlockup detector 详解

在驱动中加入以下代码可触发soft lockup，通过spinlock()实现关抢占，使得该CPU上的[watchdog/x]线程无法被调度。

static spinlock_t spinlock;
spin_lock_init(&spinlock);
spin_lock(&spinlock);
while(1);

soft lockup detector机制通过smp_hotplug_thread机制为每个CPU核创建一个内核线程[watchdog/X] （其中N为CPU ID，该内核线程为FIFI 99最高优先级，此类线程设置了 p->flags |= PF_NO_SETAFFINITY属性，不允许用户空间设置CPU亲和性）。该线程定时（每隔4s执行一次）对变量watchdog_touch_ts加加操作，即喂狗。
然后给每个CPU分配一个高精度hrtimer（[watchdog/X] 首次执行时就被创建），该定时器的中断服务程序会每隔4s（sample_period seconds (4 seconds by default)）检测一下变量watchdog_touch_ts是否被更新过，如果20s内该变量仍未更新，就说明CPU卡住，导致watchdog线程无法调度。

hrtimer的中断处理函数是：kernel/watchdog.c/watchdog_timer_fn()。

中断处理函数主要做了以下事情：

对变量hrtimer_interrupts加加操作，该变量同时供hard lockup detector用于判断CPU是否响应中断。
唤醒[watchdog/x]内核线程喂狗（执行soft_lockup_hrtimer_cnt=hrtimer_interrupts赋值即将hrtimer_interrupts更新到soft_lockup_hrtimer_cnt，同时更新watchdog_touch_ts到当时系统时间截，下图有误，并非++）。
检测变量watchdog_touch_ts是否被更新，如果超过20s未更新，说明[watchdog/x]未得到运行，发生了soft lockup，CPU被霸占。

注意：这里的内核线程[watchdog/x]的目的是操作变量watchdog_touch_ts，该变量是被watch的对象。而真正的看门狗，则是由hrtimer中断触发的，即 watchdog_timer_fn()函数，该函数会唤醒喂狗线程。[watchdog/x]是被scheduler调度执行的，而watchdog_timer_fn()则是被中断触发的。

5 源码分析

5.1 全局变量与数据结构

/* Global variables, exported for sysctl */
unsigned int __read_mostly softlockup_panic =  CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;

static bool softlockup_threads_initialized __read_mostly;
static u64 __read_mostly sample_period; //定时器时间周期，默认4秒

//以下per cpu变量，每个CPU有一个
static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); //[watchdog/x]线程更新的时间截
static DEFINE_PER_CPU(struct task_struct *, softlockup_watchdog);//[watchdog/x]线程的 task_struct结构指针
static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);//[watchdog/x]线程对应的hrtimer中断定时器控制结构
static DEFINE_PER_CPU(bool, softlockup_touch_sync);
static DEFINE_PER_CPU(bool, soft_watchdog_warn);//发生softlockup当标记为true，否则false
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);//hrtimer定时器watchdog_timer_fn自增
static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);//[watchdog/x]线程更新soft_lockup_hrtimer_cnt=hrtimer_interrupts
static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved);//定时器回调watchdog_timer_fn保存发生softlockup时current，它不是[watchdog/x]线程的 task_struct结构指针
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);//用在hardlockup dectector中
static unsigned long soft_lockup_nmi_warn;

static struct smp_hotplug_thread watchdog_threads = {
	.store			= &softlockup_watchdog,//[watchdog/x]线程的 task_struct结构指针，上面定义的per cpu变量
	.thread_should_run	= watchdog_should_run,//return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt)，两者不等则返回true
	.thread_fn		= watchdog,//[watchdog/x]线程的喂狗回调，先会调用thread_should_run返回true则执行，否则不执行。
	.thread_comm		= "watchdog/%u",
	.setup			= watchdog_enable,//[watchdog/x]线程激活时被执行，比thread_fn（即喂狗）回调执行早，它会创建hrtimer，同时设置[watchdog/x]为FIFO 99
	.cleanup		= watchdog_cleanup,//内部调用下面的watchdog_disable，当有人对此kthread调用了 kthread_stop() 时被执行，一般是module_exit，softlockup的模块一般不会被调用
	.park			= watchdog_disable,//执行与[watchdog/x]相反的操作，取消定时器，[watchdog/x]线程设为CFS 优先级0
	.unpark			= watchdog_enable,
};

5.2 创建[watchdog/x]线程（per cpu thread）

softlockup detector通过smp_hotplug_thread机制创建per cpu thread。关于smp_hotplug_thread机制详细见《Linux内核机制smp_hotplug_thread》

kernel_init_freeable
       lockup_detector_init  //比do_basic_setup调用的早，因此比 early_initcall 调用的还早
           lockup_detector_setup:
               ret = smpboot_register_percpu_thread_cpumask(&watchdog_threads, &watchdog_allowed_mask);
			  softlockup_threads_initialized = true;
			  lockup_detector_reconfigure();//detector的配置初始化，包括定时器的周期参数等。

/**  smpboot.c
 * smpboot_register_percpu_thread_cpumask - Register a per_cpu thread related to hotplug
 * @plug_thread:	Hotplug thread descriptor
 * @cpumask:		The cpumask where threads run
 *
 * Creates and starts the threads on all online cpus.
 */
int smpboot_register_percpu_thread_cpumask(struct smp_hotplug_thread *plug_thread, const struct cpumask *cpumask)
{
	unsigned int cpu;
	int ret = 0;
.................
    
	for_each_online_cpu(cpu) {
		__smpboot_create_thread(plug_thread, cpu);//创建per_cpu thread 即[watchdog/x]
		smpboot_unpark_thread(plug_thread, cpu);//unpark per_cpu thread
	}
	list_add(&plug_thread->list, &hotplug_threads);//挂到hotplug_threads中，以便core管理
............
}

static int __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
    struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
	struct smpboot_thread_data *td;

	td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
	td->cpu = cpu;
	td->ht = ht;

	tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu,  ht->thread_comm);//创建per_cpu thread 即[watchdog/x]

	kthread_park(tsk);
	get_task_struct(tsk);
	*per_cpu_ptr(ht->store, cpu) = tsk;//内核线程的task_struct指针保存到上面定义的per cpu变量中
	return 0;
}

static int smpboot_thread_fn(void *data)
{
	struct smpboot_thread_data *td = data;
	struct smp_hotplug_thread *ht = td->ht;

	while (1) {
		set_current_state(TASK_INTERRUPTIBLE);
		preempt_disable(); //关抢占
        ............................
		/* Check for state change setup */
		switch (td->status) {
		case HP_THREAD_NONE:
			__set_current_state(TASK_RUNNING);
			preempt_enable();
			if (ht->setup) //调用setup即上面的watchdog_enable,，它会创建hrtimer，同时设置[watchdog/x]为FIFO 99
				ht->setup(td->cpu);
			td->status = HP_THREAD_ACTIVE;
			continue;

		case HP_THREAD_PARKED:
			__set_current_state(TASK_RUNNING);
			preempt_enable();
			if (ht->unpark)
				ht->unpark(td->cpu);
			td->status = HP_THREAD_ACTIVE;
			continue;
		}

		if (!ht->thread_should_run(td->cpu)) {//return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt)，两者不等则返回true
			preempt_enable_no_resched();
			schedule();
		} else {
			__set_current_state(TASK_RUNNING);
			preempt_enable();
			ht->thread_fn(td->cpu);//上面的(hrtimer_interrupts) !=soft_lockup_hrtimer_cnt)，则执行thread_fn即喂狗函数watchdog()
		}
	}
}

5.3 创建hrtimer定时器

由上面的smp_hotplug_thread机制的smpboot_thread_fn处理可知，内核线程第一次执行smp_hotplug_thread::setup()回调时被调用，即watchdog_enable：

static void watchdog_enable(unsigned int cpu)
{
	struct hrtimer *hrtimer = this_cpu_ptr(&watchdog_hrtimer);

	/*
	 * Start the timer first to prevent the NMI watchdog triggering
	 * before the timer has a chance to fire.
	 */
	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
	hrtimer->function = watchdog_timer_fn;
	hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL_PINNED);

	/* Initialize timestamp */
	__touch_watchdog();//喂一次狗，初始化watchdog_touch_ts值，因为上面定义时没初始化，它是0

    //smp_hotplug_thread机制创建的内核线程是CFS 120，此处设置为FIFO 99
	watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
}

以上5.2, 5.3节流程处理完后，系统为每个CPU核创建了一个内核线程，优先级为FIFO 99：

sh-3.2# ps | grep watchdog
   12 root         0 SW   [watchdog/0]
   15 root         0 SW   [watchdog/1]
   21 root         0 SW   [watchdog/2]
   27 root         0 SW   [watchdog/3]
sh-3.2# chrt -p 12
pid 12's current scheduling policy: SCHED_FIFO
pid 12's current scheduling priority: 99

5.4 定时喂狗

5.4.1 hrtimer时间周期

在detector init中（5.2节也提到），配置了定时器的周期参数sample_period（全局变量），它的值由watchdog_thresh乘2除5，转化为纳秒计算得来，watchdog_thresh默认10秒（可通过/proc/sys/kernel/watchdog_thresh修改），即定时器一次调用周期4秒，即每4秒喂一狗，详见如下代码：

lockup_detector_init
           lockup_detector_setup:
               ret = smpboot_register_percpu_thread_cpumask(&watchdog_threads, &watchdog_allowed_mask);
			  softlockup_threads_initialized = true;
			  lockup_detector_reconfigure()://detector的配置初始化，包括定时器的周期参数等。
				set_sample_period

static int get_softlockup_thresh(void)
{
	return watchdog_thresh * 2;//watchdog_thresh默认是10秒， 可通过/proc/sys/kernel/watchdog_thresh修改
}
static void set_sample_period(void)
{
	sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);//20除5==4秒，转化为纳秒作为hrtimer采样周期。
	watchdog_update_hrtimer_threshold(sample_period);//hardlockup detector用的，不关心。
}

5.4.2 hrtimer定时器喂狗

主要做了两件事：

soft_lockup_hrtimer_cnt == hrtimer_interrupts
watchdog_touch_ts == get_timestamp(); //喂狗

//watchdog_timer_fn是5.3节创建hrtimer定时器时绑定的回调函数，每隔4秒执行一次：
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
	unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
	/* kick the softlockup detector */
    //唤醒内核线程[watchdog/x]，它会通过thread_should_run判断是否应该调用thread_fn即喂狗函数watchdog，详见见smpboot_thread_fn
    //thread_should_run是return (hrtimer_interrupts) !=soft_lockup_hrtimer_cnt)，两者不等则返回true
	wake_up_process(__this_cpu_read(softlockup_watchdog)); //唤醒[watchdog/X]线程喂狗

	/* .. and repeat */
	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));//更新定时器，4秒后再触发。
    .....
}

/*下面两个都是[watchdog/X]线程的上下文，正在喂狗的地方，上面是hrtimer定时器的上下文（主要用于唤醒唤醒[watchdog/X]线程及检测是否发生softlockup），
  发生softlockup时dumpstack栈空间是hrtimer的上下文，即发生softlockup时在CPU上running的线程。
  */
static void __touch_watchdog(void)
{
	__this_cpu_write(watchdog_touch_ts, get_timestamp());//喂狗
}
static void watchdog(unsigned int cpu)
{
	__this_cpu_write(soft_lockup_hrtimer_cnt,  __this_cpu_read(hrtimer_interrupts));
	__touch_watchdog();//喂狗
}

5.5 softlockup检测

hrtimer定时器每隔4秒检测是否发生softlockup，检测逻辑为：
如果20s内都没更新watchdog_touch_ts ，就认为出现了soft lockup，详细原理见2节和4节。

/* Global variables, exported for sysctl */
unsigned int __read_mostly softlockup_panic = CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;

/* watchdog_timer_fn是5.3节创建hrtimer定时器时绑定的回调函数，每隔4秒执行一次： */
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
	unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
    
	watchdog_interrupt_count();  /* 对hrtimer_interrupts加1操作*/
	wake_up_process(__this_cpu_read(softlockup_watchdog)); /* 唤醒喂狗线程[watchdog/X] 见5.4.2节*/

	/* ..  and repeat  更新定时器，4秒后再触发。 */
	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));


	/* check for a softlockup
	 * This is done by making sure a high priority task is
	 * being scheduled.  The task touches the watchdog to
	 * indicate it is getting cpu time.  If it hasn't then
	 * this is a good indication some task is hogging the cpu
	 * 英文判断softlockup很详细，如果某个CPU卡死，该CPU的watchdog线程不会被调度，即watchdog_touch_ts
	 * 不会被更新。如果20s内都没更新watchdog_touch_ts ，就认为出现了soft lockup
	 * is_softlockup内部实现是：
	 * 如果watchdog_touch_ts 20秒更新，返回duration = 当时时间截 - watchdog_touch_ts 否则返回0
	 * duration > 0表示发生了softlockup
	 */
	duration = is_softlockup(touch_ts);
	if (unlikely(duration)) {

		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
			smp_processor_id(), duration,
			current->comm, task_pid_nr(current));
		__this_cpu_write(softlockup_task_ptr_saved, current);
		print_modules();
		print_irqtrace_events(current);
		dump_stack();

		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
		if (softlockup_panic) //CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE配置为1时触发softlockup panic
			panic("softlockup: hung tasks");
		__this_cpu_write(soft_watchdog_warn, true);
	} else
		__this_cpu_write(soft_watchdog_warn, false);
    
	return HRTIMER_RESTART;
}


static int is_softlockup(unsigned long touch_ts)
{
    unsigned long now = get_timestamp();
    /* 如果某个CPU卡死，该CPU的watchdog线程不会被调度，即watchdog_touch_ts */
    /* 不会被更新。如果20s内都没更新watchdog_touch_ts ，就认为出现了soft lockup */
    /* get_softlockup_thresh()函数返回20 */
    if (time_after(now, touch_ts + get_softlockup_thresh()))
        return now - touch_ts;
    return 0;
}

6 问题分析思路

Soft lockup相关log：

BUG: soft lockup – CPU#2 stuck for 21s! [taskname, pid]

上述Log说明有进程/线程持续执行的时间超过21s，导致其他进程/线程无法调度，以下情况为形成Soft lockup的主要原因：

线程上下文存在死循环（ for循环的退出条件弄错），并且是高优先级FIFO 99（无论内核态还是用户态都可以引起）。
内核态关抢占过久（用户态不能关抢占，关中断不算，那是hardlockup不是softlockup），内核模块或驱动模块可通过preempt_disable()关抢占，或有些API隐含了关抢占的操作，如spin_lock，它里面会调用preempt_disable关抢占。不正确使用spinlock，会导致了softlockup（譬如spinlock临界区执行超过20秒，或嵌套调用，顺序不对的话就可能导致死锁）。

在处理该类问题时，可以遵循以下原则：

查看watchdog_touch_ts变量在最近20秒（watchdog_thresh * 2）内，是否被watchdog 线程更新过。若没有更新，就意味着watchdog线程得不到调度。很有可能某个cpu关抢占时间过长或FIFO 99线程执行时间过长，导致watchdog线程得不到调度。
这种情况下，系统往往不会死掉，但是会很慢。如果将内核参数 softlockup_panic（CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC宏）设置为1，系统会panic。否则，只将warning信息打印出来。

7 内核配置

开启soft lockup

CONFIG_LOCKUP_DETECTOR=y  #selected by CONFIG_SOFTLOCKUP_DETECTOR
CONFIG_SOFTLOCKUP_DETECTOR=y

出现softlockup时，使能系统panic（默认为n）

CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y  或通过echo 1 >/proc/sys/kernel/softlockup_panic修改

发生soft lockup时，系统默认会打印相关warning信息。如果需要抛出panic，也可以在应用层做以下设置：

echo 1 > /proc/sys/kernel/softlockup_panic
cat  /proc/sys/kernel/watchdog_thresh/*默认10s*/

[!NOTE]
watchdog_thresh默认是10s，如果超(watchdog_thresh*2)秒，watchdog_touch_ts未更新，kernel将panic。最大能设到60s，即120s内watchdog_touch_ts未更新，kernel将panic。

参考：
https://blog.csdn.net/wangquan1992/article/details/122927588
https://docs.kernel.org/admin-guide/lockup-watchdogs.html
https://mp.weixin.qq.com/s/XORlqCvlhwx-Hz2UpyXJcw
https://blog.csdn.net/ximenjianxue/article/details/106077015

posted @ 2024-07-04 10:56 RobertHu 阅读(345) 评论(0) 收藏举报

刷新页面返回顶部

robert-hu-453416372