timer bug导致系统cpu hard lockup分析

　　问题现象：

　　系统（android）正常运行一段时间之后就会panic，查看linux kernel的log发现是发生了hard lockup，相关的log如下：

<4>[ 1815.827575] c0 cpu0 detected cpu1 has HARDLOCKUP!
<4>[ 1815.828063] c0 Modules linked in: sd8xxx(O) citty(O) mlan(O) exfat_fs(P) exfat_core(P) geu(O) galcore(O)

　　什么是HARDLOCKUP，简单来说，就是在多核（4核）系统中，如果其中一个核（core1）长时间没有响应timer中断，会被逻辑上的上一个核（core0）检测到，然后系统就会panic。而每次timer中断中watchdog thread都会调度到，所以我们从watchdog thread出发进行调试，首先利用crash工具来查看watchdog/1 上一次被调度到是什么时候。

crash> ps
3008 2 1 e0e29e00 IN 0.0 0 0 [watchdog/1]
crash> struct task_struct e0e29e00
struct task_struct {
exec_start = 1801178161618, // 1801.1178161618

　　所以上次watchdog thead被wake up大概是在1801s，大概是在hardlockup发生（1815s）前的14秒。为什么这么长时间了watchdog一直没有被再次调度呢，怀疑是timer除了问题，所以查看core1的next time event 还有timerkeeper。

crash> struct clock_event_device 0xc1a11200
struct clock_event_device {
event_handler = 0xc01596a4 <hrtimer_interrupt>,
set_next_event = 0xc0118470 <twd_set_next_event>,
set_next_ktime = 0,
next_event = {
tv64 = 665477081301 //665.477081301, 
},

crash> p timekeeper
timekeeper = $11 = {
total_sleep_time = {
tv_sec = 1139, 
tv_nsec = 350860596

　　根据上面的数据计算，next time event应该发生在665.47 + 1139 + 0.35 ~ 1805s左右，这个时间点watchdog thread理论上应该被再次wakeup。但是实际上过了将近14s还是没有收到新的timer event，感觉上就是timer变慢了，查看code发现local timer是根据启动的core frequency是配置的，而启动的core frequency是1.2G，而panic的时候core的frequency却只有312M了，差了4倍左右，导致原本只需要等4s现在却需要16s（next time event发生的时间变成1801+16 ~ 1807s），时间大大超出预期，core0就误判core1已经无法响应中断，一个hardlockup就这样被trigger。

posted @ 2014-11-20 17:10 muryo 阅读(1865) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

muryo

timer bug导致系统cpu hard lockup分析

公告