内核bug的追踪: THP huge page
2.6.32.279的内核, KSM THP开启。
LOG:
<4>------------[ cut here ]------------ <2>kernel BUG at mm/huge_memory.c:1194! <4>invalid opcode: 0000 [#1] SMP <4>last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:06:00.0/host7/scsi_host/host7/proc_name <4>CPU 1 <4>Modules linked in: nf_conntrack bonding 8021q garp stp llc ipv6 openvswitch(U) libcrc32c iptable_filter ip_tables sg igb dca microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ext4 mbcache jbd2 sd_mod crc_t10dif isci libsas scsi_transport_sas ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 3084, comm: redis-server Tainted: G --------------- H 2.6.32-279.19.5.el6.ucloud.x86_64 #1 Huawei Technologies Co., Ltd. Tecal RH2288H V2-12L/BC11SRSG1 <4>RIP: 0010:[<ffffffff81180f22>] [<ffffffff81180f22>] split_huge_page+0x802/0x850 <4>RSP: 0018:ffff880c5edddbd8 EFLAGS: 00010086 <4>RAX: 00000000ffffffff RBX: ffffea0029333970 RCX: ffffc9000c06b000 <4>RDX: 0000000000000002 RSI: ffff880c61fe9c00 RDI: 0000000000000004 <4>RBP: ffff880c5edddca8 R08: 0000000000000064 R09: 0000000000000001 <4>R10: 0000000000000106 R11: ffff880c5941f300 R12: ffffea0029334000 <4>R13: 0000000000000000 R14: ffffea002932d000 R15: 00000007fcd17be1 <4>FS: 00007fcd25858720(0000) GS:ffff880053820000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b <4>CR2: 00007f7086a8a740 CR3: 0000000c57daf000 CR4: 00000000001406e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process redis-server (pid: 3084, threadinfo ffff880c5eddc000, task ffff880c5941e940) <4>Stack: <4> 0000000000000002 0000000000000000 0000000000000000 ffff880c00000001 <4><d> ffff880c5edddc78 ffff88184155d5f0 0000000000000000 ffffffff81180f9b <4><d> ffff88184155d630 ffffffff810a9f0d ffffea0052938e78 0000000180010f40 <4>Call Trace: <4> [<ffffffff81180f9b>] ? __split_huge_page_pmd+0x2b/0xc0 <4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190 <4> [<ffffffff81180ff0>] __split_huge_page_pmd+0x80/0xc0 <4> [<ffffffff81147ff4>] unmap_vmas+0xa34/0xc00 <4> [<ffffffff81148b9d>] zap_page_range+0x7d/0xe0 <4> [<ffffffff81144b4d>] sys_madvise+0x54d/0x770 <4> [<ffffffff815073bb>] ? thread_return+0x4e/0x7d3 <4> [<ffffffff810a9f0d>] ? trace_hardirqs_on_caller+0x14d/0x190 <4> [<ffffffff81509ee2>] ? trace_hardirqs_on_thunk+0x3a/0x3f <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b <4>Code: 83 c1 01 e8 69 5e 38 00 0f 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 0f 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 0f <1>RIP [<ffffffff81180f22>] split_huge_page+0x802/0x850 <4> RSP <ffff880c5edddbd8>
结合代码: 找到的call path:
<上面的back trace中, >
unmap_vmas --unmap_page_range --zap_pud_range --zap_pmd_range --split_huge_page_pmd --__split_huge_page_pmd --split_huge_page --__split_huge_page --__split_huge_page_refcount
line: 1194行如上:
【汇编code】: objdump vmlinux之后, 因为很多函数是static inline的, 所以system map里面找到jump地址,要到调用函数里面去找。
RAX: 00000000ffffffff
RBX: ffffea0029333970
R12: ffffea0029334000
触发原因: eax的值是-1;
eax放的是: add $0x1, %eax <-------> atomic_read(&(page)->_mapcount) + 1;
即: _mapcount的值是-2, 相加之后出现了-1值。
现象是 mapcount出现了理论上不应该也不能出现的值;
mapcount:
+ * Count of ptes mapped in
+ * mms, to show when page is
+ * mapped & limit reverse map
+ * searches.
start from "-1";
经过分析, 理论上此处的mapcount值是 -1
BUG_ON(page_mapcount(page_tail) < 0); 判断值是0 ,就不会出现BUG。
而改写的值不是脏值,看起来像是被减一后出现的。
都是在 put_compound_page中。。。
经过搜索<google>
得到下面的一个记录:
+====== start quote ======= + mapcount 0 page_mapcount 1 + kernel BUG at mm/huge_memory.c:1384! + +At some point prior to the panic, a "bad pmd ..." message similar to the +following is logged on the console: + + mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). + +The "bad pmd ..." message is logged by pmd_clear_bad() before it clears +the page's PMD table entry. + + 143 void pmd_clear_bad(pmd_t *pmd) + 144 { +-> 145 pmd_ERROR(*pmd); + 146 pmd_clear(pmd); + 147 } + +After the PMD table entry has been cleared, there is an inconsistency +between the actual number of PMD table entries that are mapping the page +and the page's map count (_mapcount field in struct page). When the page +is subsequently reclaimed, __split_huge_page() detects this inconsistency. + + 1381 if (mapcount != page_mapcount(page)) + 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", + 1383 mapcount, page_mapcount(page)); +-> 1384 BUG_ON(mapcount != page_mapcount(page)); + +The root cause of the problem is a race of two threads in a multithreaded +process. Thread B incurs a page fault on a virtual address that has never +been accessed (PMD entry is zero) while Thread A is executing an madvise() +system call on a virtual address within the same 2 MB (huge page) range. + + virtual address space + .---------------------. + | | + | | + .-|---------------------| + | | | + | | |<-- B(fault) + | | | + 2 MB | |/////////////////////|-. + huge < |/////////////////////| > A(range) + page | |/////////////////////|-' + | | | + | | | + '-|---------------------| + | | + | | + '---------------------' + +- Thread A is executing an madvise(..., MADV_DONTNEED) system call + on the virtual address range "A(range)" shown in the picture. + +sys_madvise + // Acquire the semaphore in shared mode. + down_read(¤t->mm->mmap_sem) + ... + madvise_vma + switch (behavior) + case MADV_DONTNEED: + madvise_dontneed + zap_page_range + unmap_vmas + unmap_page_range + zap_pud_range + zap_pmd_range + // + // Assume that this huge page has never been accessed. + // I.e. content of the PMD entry is zero (not mapped). + // + if (pmd_trans_huge(*pmd)) { + // We don't get here due to the above assumption. + } + // + // Assume that Thread B incurred a page fault and + .---------> // sneaks in here as shown below. + | // + | if (pmd_none_or_clear_bad(pmd)) + | { + | if (unlikely(pmd_bad(*pmd))) + | pmd_clear_bad + | { + | pmd_ERROR + | // Log "bad pmd ..." message here. + | pmd_clear + | // Clear the page's PMD entry. + | // Thread B incremented the map count + | // in page_add_new_anon_rmap(), but + | // now the page is no longer mapped + | // by a PMD entry (-> inconsistency). + | } + | } + | + v +- Thread B is handling a page fault on virtual address "B(fault)" shown + in the picture. + +... +do_page_fault + __do_page_fault + // Acquire the semaphore in shared mode. + down_read_trylock(&mm->mmap_sem) + ... + handle_mm_fault + if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) + // We get here due to the above assumption (PMD entry is zero). + do_huge_pmd_anonymous_page + alloc_hugepage_vma + // Allocate a new transparent huge page here. + ... + __do_huge_pmd_anonymous_page + ... + spin_lock(&mm->page_table_lock) + ... + page_add_new_anon_rmap + // Here we increment the page's map count (starts at -1). + atomic_set(&page->_mapcount, 0) + set_pmd_at + // Here we set the page's PMD entry which will be cleared + // when Thread A calls pmd_clear_bad(). + ... + spin_unlock(&mm->page_table_lock) + +The mmap_sem does not prevent the race because both threads are acquiring +it in shared mode (down_read). Thread B holds the page_table_lock while +the page's map count and PMD table entry are updated. However, Thread A +does not synchronize on that lock. + +====== end quote =======
通过上面的分析, 和目前的情况。。。。。。
有理由判断:multi thread的情况出现(函数调用过程中,嵌套了多个循环,multi thread的执行是有各种条件的。
The mmap_sem does not prevent the race because both threads are acquiring +it in shared mode (down_read). Thread B holds the page_table_lock while +the page's map count and PMD table entry are updated. However, Thread A +does not synchronize on that lock.
需要进一步判断,具体哪里触发的。。。。。
已经怎么构造测试的环境。。。。