Softlockup&Hardlockup检测机制

前言

Linux自身具备一定的异常检测机制，softlockup和hardlockup是典型的两种，softlockup检测内核是否出现了长时间不调度其他任务执行的异常情况。hardlockup则更进一步检测内核是否出现了长时间不响应中断的异常情况。softlockup和hardlockup的定义如下：

A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run.
A 'hardlockup' is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds, without letting other interrupts have a chance to run.

这两种异常检测机制具有一定的相似性，因此设计的思路是一体的。但是在检测的目标上又存在差异，所以实现上有一些不同。

watchdog

watchdog机制是一种常见的keep-alive方法，其原理是周期性的执行一个任务检查某个值是否已经更新，这个检查过程称之为watch dog，而更新值的动作被称为touch dog。
softlockup和hardlockup机制针对的是单核的检测，因此对于每一个CPU内核都有两个dog分别对应softlockup和hqrdlockup。

softlockup的dog是watchdog_touch_ts，记录了上一次touch dog的时间戳。
hardlockup的dog是hrtimer_interrupts，记录hrtimer高精度定时器中断发生的次数。

static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);

在内核中存在三类程序可以被执行的，按照优先级从高到底分别是NMI处理函数、Normal Interrupt处理函数和Task。从本质上来说，softlockup检测的是NMI和Normal Interrupt正常响应的情况下，Task之间的调度能否正常发生，hardlockup检测的是NMI正常响应的情况下，Normal Interrupt能否正常响应和被调度执行。

Note：NMI作为不可屏蔽中断，保证了任何条件下都能执行。

softlockup

为了满足检测目标，softlockup需要有一个内核线程能够touch dog（更新watchdog_touch_ns），并且该线程必须在softlockup检查时启动。同时还需要一个周期定时器任务，检查watchdog_touch_ts与now之间的距离是否超过门限，如果超过就认为发生了softlockup。默认超时时长softlockup_thresh是20s(2 * watchdog_thresh)。softlockup检查在is_softlockup（kernel/watchdog.c）中实现：

static int is_softlockup(unsigned long touch_ts)
{
    unsigned long now = get_timestamp();

    if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
        /* Warn about unreasonable delays. */
        if (time_after(now, touch_ts + get_softlockup_thresh()))
            return now - touch_ts;
    }
    return 0;
}

为了保证softlockup的有效性，更新watchdog_touch_ns的Task必须拥有最高的任务优先级，否则即使正常发生调度低优先级任务也无法及时更新时间戳。因此在老的内核版本更新watch_touch_ns的Task是[watchdog/x]，随着STOP调度类（比实时任务的优先级更高）的引入，更新线程变成了[migration/x]。

migration线程作为内核中优先级最高的线程，负责内核热插拔、停止CPU运行等工作。migration线程管理了一个work_queue，当有任务需要执行时migration就会进入RUNNABLE状态等待调度，一旦发生调度migration一定能够拿到执行权更新watchdog_touch_ns，保证了softlockup检查的有效性。

而检查softlockup的任务必须交给优先级更高的中断，内核中的hrtimer可以周期性的触发中断，在hrtimer的处理函数watchdog_timer_fn中可以检查[migration/x]是否正常更新了watchdog_touch_ns，hrtimer定时器的触发周期是softlockup_thresh / 5（默认值是4s）。

softlockup检查机制的整体流程如下：

hrtimer周期性的触发执行中断处理程序watchdog_timer_fn：
1. 向work_queue插入任务softlockup_fn
2. 检查watchdog_touch_ns是否异常
3. 睡眠，等待下一次触发
migration线程
1. 被work_queue唤醒
2. 检查队列，取出softlockup_fn执行
3. 更新watchdog_touch_ns
4. work_queue为空，进入睡眠

如果migration线程在任务队列中长时间没有被调度执行（核上的任务长时间的占据了CPU），则说明出现了softlockup异常，需要对现场进行dump。

softlockup检查机制的整体流程

hardlockup

hardlockup的检测机制和softlockup类似，但是检测的目标不同，hardlockup检测的是普通中断长时间不响应，hardlockup的检查在kernel/watchdog.c的is_hardlockup中实现，判断hrtimer_interrupts是否在进行递增，如果没有递增则认为发生了hardlockup。

/* watchdog detector functions */
bool is_hardlockup(void)
{
    unsigned long hrint = __this_cpu_read(hrtimer_interrupts);

    if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
        return true;

    __this_cpu_write(hrtimer_interrupts_saved, hrint);
    return false;
}

hardlockup的默认超时时长watchdog_thresh是10s，是softlockup的一半。和softlockup不一样的是hrtimer_interrupts没有记录时间戳信息，如何判断是否超时呢？
Linux使用的是周期性的NMI。基于perf subsystem的cycles事件，perf的counter可以设置溢出阈值，当perf event的发生次数达到阈值时会触发一次NMI中断，同时cycles与时间存在一定的关系，具体可以看kernel/watchdog.c的watchdog_nmi_enable函数。顺着调用链可以看到hardlockup_detector_event_create函数（在kernel/watchdog_hld.c中）调用了hw_nmi_get_sample_period（在arch/x86/kernel/apic/hw_nmi.c中），这个函数是一个体系结构相关的函数，在这里获取了cycles溢出的NMI中断的触发周期watchdog_thresh。

u64 hw_nmi_get_sample_period(int watchdog_thresh)
{
    return (u64)(cpu_khz) * 1000 * watchdog_thresh;
}

周期性的NMI触发执行回调函数进行watch（检查hrtimer_interrupts是否递增），hrtimer则负责定期的touch（增加hrtimer_interrupts）。

hardlockup和softlockup之间通过hrtimer产生了交集，所以hrtiemr的处理函数不仅要watch watchdog_touch_ts进行softlockup检查，同时还需要touch hrtimer_interrupts更新中断触发次数。

NOTE：2024-03-15更新
hardlockup的超时周期是通过cycles NMI中断的触发周期来保障的，但是在一些具有睿频模式（turbo mode）的CPU上通过cycles数量推算时间这个方法会不准确，NMI中断的触发周期会缩小导致误报。所谓睿频模式指的是CPU会根据情况自动的调整CPU的频率和关闭CPU，比如在一个四核处理器上运行单线程程序，此时会关闭三个核心，提高运行核心的频率从而提高性能，并且降低功耗。但是这会带来两个问题，动态频率会导致基于cycles NMI中断周期不准，第二个问题是停止的CPU的时钟会不更新。因此在这个场景下内核中有一个配置选项CONFIG_HARDLOCKUP_CHECK_TIMESTAMP，开启这个配置选项以后在NMI中断的回调函数中会检查时间戳，如果距离上一次hardlockup检查过去了4/5 * watchdog_thresh(能够保证至少一次hrtimer_interrupts更新)才进行hardlockup检查。此外，如果ktime是基于jiffies（每个时钟中断更新一次）的，在停止的CPU上jiffies并不会更新，此时通过一个计数器nmi_rearmed判断是否达到了时间间隔要求。这个特性可以参考如下代码：

#ifdef CONFIG_HARDLOCKUP_CHECK_TIMESTAMP
static DEFINE_PER_CPU(ktime_t, last_timestamp);
static DEFINE_PER_CPU(unsigned int, nmi_rearmed);
static ktime_t watchdog_hrtimer_sample_threshold __read_mostly;

void watchdog_update_hrtimer_threshold(u64 period)
{
	/*
	 * The hrtimer runs with a period of (watchdog_threshold * 2) / 5
	 *
	 * So it runs effectively with 2.5 times the rate of the NMI
	 * watchdog. That means the hrtimer should fire 2-3 times before
	 * the NMI watchdog expires. The NMI watchdog on x86 is based on
	 * unhalted CPU cycles, so if Turbo-Mode is enabled the CPU cycles
	 * might run way faster than expected and the NMI fires in a
	 * smaller period than the one deduced from the nominal CPU
	 * frequency. Depending on the Turbo-Mode factor this might be fast
	 * enough to get the NMI period smaller than the hrtimer watchdog
	 * period and trigger false positives.
	 *
	 * The sample threshold is used to check in the NMI handler whether
	 * the minimum time between two NMI samples has elapsed. That
	 * prevents false positives.
	 *
	 * Set this to 4/5 of the actual watchdog threshold period so the
	 * hrtimer is guaranteed to fire at least once within the real
	 * watchdog threshold.
	 */
	watchdog_hrtimer_sample_threshold = period * 2;
}

static bool watchdog_check_timestamp(void)
{
	ktime_t delta, now = ktime_get_mono_fast_ns();

	delta = now - __this_cpu_read(last_timestamp);
	if (delta < watchdog_hrtimer_sample_threshold) {
		/*
		 * If ktime is jiffies based, a stalled timer would prevent
		 * jiffies from being incremented and the filter would look
		 * at a stale timestamp and never trigger.
		 */
		if (__this_cpu_inc_return(nmi_rearmed) < 10)
			return false;
	}
	__this_cpu_write(nmi_rearmed, 0);
	__this_cpu_write(last_timestamp, now);
	return true;
}
#else
static inline bool watchdog_check_timestamp(void)
{
	return true;
}
#endif

watchdog相关配置接口

启用或禁用watchdog:

/proc/sys/kernel/soft_watchdog：启用或禁用softlockup
/proc/sys/kernel/nmi_watchdog：启用或禁用hardlockup
/proc/sys/kernel/watchdog: 同时启用或禁用softlockup 和 hardlockup，读取的返回值是soft_watchdog和nmi_watchdog取或。

设置哪些core启用watchdog:

/proc/sys/kernel/watchdog_cpumask

设置lockup超时门限：

/proc/sys/kernel/watchdog_thresh:设置NMI watchdog超时门限，softlockup_thresh是2 * watchdog_thresh

设置超时的处理：

/proc/sys/kernel/hardlockup_panic：出现hardlockup时是否panic

References

[1] lockup-watchdogs

posted @ 2024-01-16 16:29 ZouTaooo 阅读(1000) 评论(0) 收藏举报

刷新页面返回顶部

ZouTaooo

Softlockup&Hardlockup检测机制

前言

watchdog

softlockup

hardlockup

watchdog相关配置接口

相关源码

watchdog初始化

hrtimer

cycles NMI

References

公告