高通sensor库和Linker的死锁问题分析报告
【问题描述】
调试NativeHeap泄露时,我们会用到Android Native Heap调试框架。
在push libc_malloc_debug_leak.so后重启zygote(adb shell stop + adb shell start),会发现系统一直起不来。
用debuggerd打印system_server的调用栈,可以发现system server的大部分线程都在malloc函数里卡死:
pid: 4918, tid: 4929, name: Binder_1 >>> system_server <<< backtrace: #00 pc 00000000000178cc /system/bin/linker64 (__dl_syscall+28) #01 pc 00000000000160c0 /system/bin/linker64 (__dl__ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tPK8timespeci.constprop.0+260) #02 pc 00000000000163cc /system/bin/linker64 (__dl_pthread_mutex_lock+36) #03 pc 0000000000003318 /system/bin/linker64 (__dl_dl_iterate_phdr+32) #04 pc 000000000003ca94 /system/lib64/libc_malloc_debug_leak.so (_Unwind_Find_FDE+368) #05 pc 0000000000039dd4 /system/lib64/libc_malloc_debug_leak.so #06 pc 000000000003a998 /system/lib64/libc_malloc_debug_leak.so #07 pc 000000000003b1fc /system/lib64/libc_malloc_debug_leak.so (_Unwind_Backtrace+76) #08 pc 0000000000008808 /system/lib64/libc_malloc_debug_leak.so #09 pc 0000000000009e10 /system/lib64/libc_malloc_debug_leak.so (leak_malloc+404) #10 pc 000000000001bb9c /system/lib64/libc.so (malloc+20) #11 pc 0000000000012770 /system/lib64/libutils.so (_ZN7android12SharedBuffer5allocEm+56) #12 pc 0000000000014108 /system/lib64/libutils.so (_ZN7android8String16C1EPKDsm+32) #13 pc 00000000000d356c /system/lib64/libandroid_runtime.so #14 pc 0000000073b6286c /data/dalvik-cache/arm64/system@framework@boot.oat (offset 0x248a000)
其中有两个线程发生死锁:
pid: 4918, tid: 4992, name: system_server >>> system_server <<<
backtrace:
#00 pc 000000000001bf9c /system/lib64/libc.so (syscall+28)
#01 pc 0000000000066bd4 /system/lib64/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tPK8timespeci.constprop.0+260)
#02 pc 0000000000066f5c /system/lib64/libc.so (pthread_mutex_lock+36)
#03 pc 000000000000a414 /system/vendor/lib64/libsensor1.so
#04 pc 000000000000c774 /system/vendor/lib64/libsensor1.so (sensor1_open+1388)
#05 pc 000000000002c6b0 /system/vendor/lib64/sensors.ssc.so (_ZN14SensorsContextC2Ev+252)
#06 pc 0000000000013298 /system/vendor/lib64/sensors.ssc.so
#07 pc 0000000000003c28 /system/bin/linker64 (__dl__ZN6soinfo13call_functionEPKcPFvvE+104)
#08 pc 0000000000003d7c /system/bin/linker64 (__dl__ZN6soinfo10call_arrayEPKcPPFvvEmb+248)
#09 pc 000000000000a090 /system/bin/linker64 (__dl__Z9do_dlopenPKciPK17android_dlextinfo+436)
#10 pc 00000000000033ac /system/bin/linker64 (__dl_dlopen+44)
#11 pc 0000000000006c70 /system/lib64/hw/sensors.msm8996.so (_ZL17lazy_init_modulesv.part.203+200)
#12 pc 0000000000006f20 /system/lib64/hw/sensors.msm8996.so (_ZL12open_sensorsPK11hw_module_tPKcPP11hw_device_t+456)
#13 pc 000000000000cdc0 /system/lib64/libsensorservice.so
#14 pc 00000000000117bc /system/lib64/libsensorservice.so
#15 pc 0000000000012244 /system/lib64/libutils.so (_ZNK7android7RefBase9incStrongEPKv+112)
#16 pc 000000000001fe90 /system/lib64/libandroid_servers.so (_ZN7android10sensorInitEPv+128)
#17 pc 0000000000065ee4 /system/lib64/libc.so (_ZL15__pthread_startPv+52)
#18 pc 000000000001ed44 /system/lib64/libc.so (__start_thread+16)
pid: 4918, tid: 4993, name: system_server >>> system_server <<<
backtrace:
#00 pc 00000000000178cc /system/bin/linker64 (__dl_syscall+28)
#01 pc 00000000000160c0 /system/bin/linker64 (__dl__ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tPK8timespeci.constprop.0+260)
#02 pc 00000000000163cc /system/bin/linker64 (__dl_pthread_mutex_lock+36)
#03 pc 0000000000003318 /system/bin/linker64 (__dl_dl_iterate_phdr+32) -linker
#04 pc 000000000003ca94 /system/lib64/libc_malloc_debug_leak.so (_Unwind_Find_FDE+368)
#05 pc 0000000000039dd4 /system/lib64/libc_malloc_debug_leak.so
#06 pc 000000000003a998 /system/lib64/libc_malloc_debug_leak.so
#07 pc 000000000003b1fc /system/lib64/libc_malloc_debug_leak.so (_Unwind_Backtrace+76)
#08 pc 0000000000008808 /system/lib64/libc_malloc_debug_leak.so
#09 pc 0000000000009e10 /system/lib64/libc_malloc_debug_leak.so (leak_malloc+404)
#10 pc 000000000001bb9c /system/lib64/libc.so (malloc+20)
#11 pc 000000000000bd80 /system/vendor/lib64/libsensor1.so
#12 pc 0000000000065ee4 /system/lib64/libc.so (_ZL15__pthread_startPv+52)
#13 pc 000000000001ed44 /system/lib64/libc.so (__start_thread+16)
死锁过程如下:
4992线程在加载sensor模块时,由于sensors.ssc.so中定义了全局静态对象SensorsContext SensorsContext::self;
因此dlopen(4992#10)时,加载完sensors.ssc.so后就会调用SensorsContext的构造函数(4992#05)。
这个构造函数会创建线程4993,并等待4993线程唤醒自己。
而dlopen函数会持一个全局锁g_dl_mutex。
void* dlopen(const char* filename, int flags) { return dlopen_ext(filename, flags, nullptr); } static void* dlopen_ext(const char* filename, int flags, const android_dlextinfo* extinfo) { ScopedPthreadMutexLocker locker(&g_dl_mutex); soinfo* result = do_dlopen(filename, flags, extinfo); return result; }
4993线程在运行过程中,会获取一个sensor模块的全局锁libsensor_cli_data_mutex,然后再调用malloc()函数。
static sensor1_error_e libsensor_read_socket( int fd ) { ... pthread_mutex_lock( &libsensor_cli_data_mutex ); if( (cli_idx = libsensor_get_client_by_fd( fd )) < 0 ) { pthread_mutex_unlock( &libsensor_cli_data_mutex ); return SENSOR1_EBUFFER; } rx_msg_p = malloc( SENSOR_MAX_MSG_SIZE + sizeof(libsensor_ctl_read_s) -1 ); ...
由于打开了HeapLeak调试开关,所以malloc时,会记录当前的调用栈。
而获取调用栈的unwind方法需要用到linker的dl_iterate_phdr()方法(4993#3)
int dl_iterate_phdr(int (*cb)(dl_phdr_info* info, size_t size, void* data), void* data) { ScopedPthreadMutexLocker locker(&g_dl_mutex); return do_dl_iterate_phdr(cb, data); }
这个方法会申请g_dl_mutex锁,而这个锁此时已经被4992线程拿着了,所以4993会卡在dl_iterate_phdr()中。
此时4992线程还在等待4993线程:
sensor1_error_e sensor1_open( sensor1_handle_s **hndl, sensor1_notify_data_cb_t data_cbf, intptr_t cb_data ) { ... if( -1 == ( err = clock_gettime (CLOCK_REALTIME, &open_timeout ) ) ) { ... } else { open_timeout.tv_sec += 1; err = sem_timedwait( &open_sem, &open_timeout ); //这里等现线程4993 } if( 0 != err && ETIMEDOUT == errno ) { libsensor_client_data_s cli_data; LOG_ERROR( "%s: Sem wait timed-out for socket %i", __func__, sockfd ); libsensor_del_client( sockfd ); ...
sem_timedwait()在等待1秒后超时,会调用libsensor_del_client(),
而这个函数又申请4993线程持有的libsensor_cli_data_mutex锁。
static int libsensor_add_client( libsensor_client_data_s const *cli_data, bool is_wait_clnt ) { ... pthread_mutex_lock( &libsensor_cli_data_mutex );
这样就产生了死锁。
这个概率看起来是必现的,而由于这两个锁又不是同一个模块的,所以不太好解决。
似乎无论是HeapLeak机制还是Sensor代码,各自的处理都没有问题。
由于HeapLeak是Debug机制,所以在HeapLeak中做出让步,更合理一些。
【解决方案】
1、获取调用栈时,不用unwind方法:
在64位下,MTK对获取backtrace的方法做了优化,
可以参考MTK版本里的@bionic/libc/bionic/debug_stacktrace.cpp中的get_backtrace()的实现。
这个方法通过fp来查找lr,可以大大提高效率,且不会调用unwind,也不会有死锁了。
2、调用so的构造函数前释放 g_dl_mutex锁:
@bionic/linker/dlfcn.cpp /*static*/ pthread_mutex_t g_dl_mutex = PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP; static void* dlopen_ext(const char* filename, int flags, const android_dlextinfo* extinfo) { //ScopedPthreadMutexLocker locker(&g_dl_mutex); soinfo* result = do_dlopen(filename, flags, extinfo); return result; } @bionic/linker/linker.cpp extern pthread_mutex_t g_dl_mutex; soinfo* do_dlopen(const char* name, int flags, const android_dlextinfo* extinfo) { ... { ScopedPthreadMutexLocker locker(&g_dl_mutex); soinfo* si = find_library(name, flags, extinfo); } if (si != nullptr) { si->call_constructors(); //这里是so的构造函数 } return si; }