CPU问题导致的大量进程崩溃问题

昨天刚收到一个故障机,现象是复重启。

从日志中可以看到surfaceflinger一直在NE,如:

pid: 17522, tid: 17522, name: surfaceflinger >>> /system/bin/surfaceflinger <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7d977d4750
x0 ffffffffffffffff x1 ffffffffffffffff x2 0000000000000000 x3 0000000000000010
x4 fffffffffffffff0 x5 0000000000000040 x6 000000000000003f x7 0000000000000000
x8 0000007f977d47d0 x9 0000000000000004 x10 0000007f9940d1c0 x11 0000000000000004
x12 0000007f9940d1e8 x13 0000000000000000 x14 0000000000000000 x15 0000007f977d4738
x16 0000007f977d52f8 x17 0000007f977d53b4 x18 0000007f989a8b80 x19 0000007f977d3c00
x20 0000007fc5777a00 x21 0000007f977d4c00 x22 0000007f977d3c10 x23 0000000000000000
x24 0000000000000000 x25 0000007f92330010 x26 0000000000000438 x27 0000000000000438
x28 0000000000000780 x29 0000007fc57778e0 x30 0000007f989d2688
sp 0000007fc57778e0 pc 0000007f989d26a0 pstate 0000000080000000

backtrace:
#00 pc 00000000001cb6a0 /system/vendor/lib64/egl/libGLESv2_adreno.so (_ZN9EsxGfxMem4InitEP19EsxGfxMemCreateData+320)
#01 pc 00000000001cc2a8 /system/vendor/lib64/egl/libGLESv2_adreno.so (_ZN9EsxGfxMem6CreateEP19EsxGfxMemCreateData+72)
#02 pc 00000000001a1e58 /system/vendor/lib64/egl/libGLESv2_adreno.so (_ZN17EglSubDriverImage4InitEP10EsxContext+248)
#03 pc 00000000001a22f0 /system/vendor/lib64/egl/libGLESv2_adreno.so (_ZN17EglSubDriverImage6CreateEP10EglDisplayP10EsxContextiiPvPKi+208)
#04 pc 0000000000193294 /system/vendor/lib64/egl/libGLESv2_adreno.so (_ZN6EglApi11CreateImageEPvS0_jS0_PKi+340)
...

 

pc附近的指令为:

code around pc:
    0000007f989d2680 97fc1b75d2817802 911be2b0912ce26f  .x..u...o.,.....
    0000007f989d2690 9280000192800000 a90005e0911ed2b1  ................
    0000007f989d26a0 a90205e0a90105e0 a90405e0a90305e0  ................
    0000007f989d26b0 a90605e0a90505e0 a9000600a90705e0  ................
    0000007f989d26c0 a9020600a9010600 a9040600a9030600  ................
    0000007f989d26d0 a9060600a9050600 aa1103e3a9070600  ................
    0000007f989d26e0 b957e660f94bce72 f90bce618b000241  r.K.`.W.A...a...
    0000007f989d26f0 110007c2885ffc7e 35ffffa48804fc62  ~._.....b......5
    0000007f989d2700 36000185395eb2a5 aa1303e0f9400276  ..^9...6v.@.....
    0000007f989d2710 d63f0260f9400ad3 2a1503e02a0003f5  ..@.`.?....*...*
    0000007f989d2720 a94153f3f9401bf7 a8c47bfda9425bf5  ..@..SA..[B..{..
    0000007f989d2730 b957e263d65f03c0 51000c6652800009  .._.c.W....Rf..Q
    0000007f989d2740 54000689710070df b940168b79402a8a  .p.q...T.*@y..@.
    0000007f989d2750 331b0d8bd345214c 721f05bf53001d6d  L!E....3m..S...r
    0000007f989d2760 52a1800654000481 52a0800452a10007  ...T...R...R...R
    0000007f989d2770 7100099f52800008 7100119f540004a0  ...R...q...T...q

 

用工具解析成arm指令:

    7f989d2680:	d2817802 	mov	x2, #0xbc0                 	// #3008
    7f989d2684:	97fc1b75 	bl	0x7f988d9458
    7f989d2688:	912ce26f 	add	x15, x19, #0xb38
    7f989d268c:	911be2b0 	add	x16, x21, #0x6f8
    7f989d2690:	92800000 	mov	x0, #0xffffffffffffffff    	// #-1
    7f989d2694:	92800001 	mov	x1, #0xffffffffffffffff    	// #-1
    7f989d2698:	911ed2b1 	add	x17, x21, #0x7b4
    7f989d269c:	a90005e0 	stp	x0, x1, [x15]
    7f989d26a0:	a90105e0 	stp	x0, x1, [x15,#16]
    7f989d26a4:	a90205e0 	stp	x0, x1, [x15,#32]
    7f989d26a8:	a90305e0 	stp	x0, x1, [x15,#48]
    7f989d26ac:	a90405e0 	stp	x0, x1, [x15,#64]
    7f989d26b0:	a90505e0 	stp	x0, x1, [x15,#80]
    7f989d26b4:	a90605e0 	stp	x0, x1, [x15,#96]
    7f989d26b8:	a90705e0 	stp	x0, x1, [x15,#112]
    7f989d26bc:	a9000600 	stp	x0, x1, [x16]
    7f989d26c0:	a9010600 	stp	x0, x1, [x16,#16]
    7f989d26c4:	a9020600 	stp	x0, x1, [x16,#32]
    7f989d26c8:	a9030600 	stp	x0, x1, [x16,#48]
    7f989d26cc:	a9040600 	stp	x0, x1, [x16,#64]
    7f989d26d0:	a9050600 	stp	x0, x1, [x16,#80]
    7f989d26d4:	a9060600 	stp	x0, x1, [x16,#96]
    7f989d26d8:	a9070600 	stp	x0, x1, [x16,#112]
    7f989d26dc:	aa1103e3 	mov	x3, x17
    7f989d26e0:	f94bce72 	ldr	x18, [x19,#6040]
    7f989d26e4:	b957e660 	ldr	w0, [x19,#6116]
    7f989d26e8:	8b000241 	add	x1, x18, x0
    7f989d26ec:	f90bce61 	str	x1, [x19,#6040]
    7f989d26f0:	885ffc7e 	ldaxr	w30, [x3]
    7f989d26f4:	110007c2 	add	w2, w30, #0x1
    7f989d26f8:	8804fc62 	stlxr	w4, w2, [x3]
    7f989d26fc:	35ffffa4 	cbnz	w4, 0x7f989d26f0
    7f989d2700:	395eb2a5 	ldrb	w5, [x21,#1964]
    7f989d2704:	36000185 	tbz	w5, #0, 0x7f989d2734
    7f989d2708:	f9400276 	ldr	x22, [x19]
    7f989d270c:	aa1303e0 	mov	x0, x19
    7f989d2710:	f9400ad3 	ldr	x19, [x22,#16]
    7f989d2714:	d63f0260 	blr	x19
    7f989d2718:	2a0003f5 	mov	w21, w0
    7f989d271c:	2a1503e0 	mov	w0, w21
    7f989d2720:	f9401bf7 	ldr	x23, [sp,#48]
    7f989d2724:	a94153f3 	ldp	x19, x20, [sp,#16]
    7f989d2728:	a9425bf5 	ldp	x21, x22, [sp,#32]
    7f989d272c:	a8c47bfd 	ldp	x29, x30, [sp],#64
    7f989d2730:	d65f03c0 	ret
    7f989d2734:	b957e263 	ldr	w3, [x19,#6112]
    7f989d2738:	52800009 	mov	w9, #0x0                   	// #0
    7f989d273c:	51000c66 	sub	w6, w3, #0x3
    7f989d2740:	710070df 	cmp	w6, #0x1c
    7f989d2744:	54000689 	b.ls	0x7f989d2814
    7f989d2748:	79402a8a 	ldrh	w10, [x20,#20]
    7f989d274c:	b940168b 	ldr	w11, [x20,#20]
    7f989d2750:	d345214c 	ubfx	x12, x10, #5, #4
    7f989d2754:	331b0d8b 	bfi	w11, w12, #5, #4
    7f989d2758:	53001d6d 	uxtb	w13, w11
    7f989d275c:	721f05bf 	tst	w13, #0x6
    7f989d2760:	54000481 	b.ne	0x7f989d27f0
    7f989d2764:	52a18006 	mov	w6, #0xc000000             	// #201326592
    7f989d2768:	52a10007 	mov	w7, #0x8000000             	// #134217728
    7f989d276c:	52a08004 	mov	w4, #0x4000000             	// #67108864
    7f989d2770:	52800008 	mov	w8, #0x0                   	// #0
    7f989d2774:	7100099f 	cmp	w12, #0x2
    7f989d2778:	540004a0 	b.eq	0x7f989d280c
    7f989d277c:	7100119f 	cmp	w12, #0x4

 

出问题的指令是:

7f989d26a0:	a90105e0 	stp	x0, x1, [x15,#16]

 

此时x15的值是:0x0000007f977d4738

上面这条指令将x0值写入0x0000007f977d4748,x1值写入0x0000007f977d4750

出错的地址是0x0000007d977d4750,看起来是将x1写入0x0000007f977d4750时,地址突然变成了0x0000007d977d4750导致的FC。

0x0000007f977d4750

0x0000007d977d4750

这两个值就差一个bit,单条指令出这种异常,基本能确定是CPU问题。

通过x15指向的内存值也能够证明上面的推测:

memory near x15:
    0000007f977d4718 0000000000000000 0000000000000000  ................
    0000007f977d4728 0000000000000000 0000000000000000  ................
    0000007f977d4738 ffffffffffffffff ffffffffffffffff  ................
    0000007f977d4748 ffffffffffffffff 0000000000000000  ................
    0000007f977d4758 0000000000000000 0000000000000000  ................
    0000007f977d4768 0000000000000000 0000000000000000  ................
    0000007f977d4778 0000000000000000 0000000000000000  ................
    0000007f977d4788 0000000000000000 0000000000000000  ................
    0000007f977d4798 0000000000000000 0000000000000000  ................
    0000007f977d47a8 0000000000000000 0000000000000000  ................
    0000007f977d47b8 0000000000000000 0000000000000000  ................
    0000007f977d47c8 0000000000000000 0000000000000000  ................
    0000007f977d47d8 0000000000000000 0000000000000000  ................
    0000007f977d47e8 0000000000000000 0000000000000000  ................
    0000007f977d47f8 0000000000000000 0000000000000000  ................
    0000007f977d4808 0000000000000000 0000000000000000  ................

 

0x0000007f977d4748里的值已经更新为x0值,但0x0000007f977d4750里的值确不是x1值。

一条指令执行过程中x15值变化了,能说明什么呢?

 

同时有其他进程的FC,现象都是寄存器值正确,但读到的值有个别位异常。

列出其中部分出问题的指令如下:

    7f98fd515c: a90039ed stp x13, x14, [x15]

    7f930c8c74: a94051f3 ldp x19, x20, [x15]

    7f81be5234: 3dc001e3 ldr q3, [x15]

    7f914b815c: a90039ed stp x13, x14, [x15]

    7f856c3d7c: b9001ded str w13, [x15,#28]

发现都是从x15指向的内存读写数据时出的错。

软件无法解释,给高通报个bug吧

 

posted @ 2017-07-06 11:08  YYPapa  阅读(2958)  评论(1编辑  收藏  举报