ANR原理分析
ANR原理分析
前言:
ANR,应用程序无响应。触发后会弹一个dialog,提示用户。
主要分析以下几点:
- ANR触发场景
- ANR产生的过程
- ANR分析方法
- ANR监控
1. ANR触发场景
ANR类型 | 超时时间 | 报错信息 |
---|---|---|
输入事件(按键、触摸等) | 5s | Input event dispatching timed out |
广播BroadcastReceiver | 前台10s,后台/offload 60s | Receiver during timeout of |
Service服务 | 前台 10s,普通 20s,后台 200s | Timeout executing service |
ContentProvider | 10s | timeout publishing content providers |
2. ANR产生的过程
这里分别讨论广播,服务,内容提供者,输入事件的ANR产生过程。
产生过程可以总结为:
事件发生前通过Handler发送延迟消息
如果事件成功发生就把延迟消息移除
否则延迟消息触发,随之产生ANR
有博客讲这三个过程比喻为埋炸弹,拆炸弹和启动炸弹,非常合适。
2.1 broadcast超时机制
2.1.1 设置广播超时时间
调用链如下:
ContextImpl.sendBroadcast
->AMS.broadcastIntent
->AMS.broadcastIntentLocked
broadcastIntentLocked方法中将广播加入到队列中。
@GuardedBy("this") final int broadcastIntentLocked(ProcessRecord callerApp, String callerPackage, Intent intent, String resolvedType, IIntentReceiver resultTo, int resultCode, String resultData, Bundle resultExtras, String[] requiredPermissions, int appOp, Bundle bOptions, boolean ordered, boolean sticky, int callingPid, int callingUid, int realCallingUid, int realCallingPid, int userId, boolean allowBackgroundActivityStarts) { intent = new Intent(intent); //... if ((receivers != null && receivers.size() > 0) || resultTo != null) { BroadcastQueue queue = broadcastQueueForIntent(intent); BroadcastRecord r = new BroadcastRecord(queue, intent, callerApp, callerPackage, callingPid, callingUid, callerInstantApp, resolvedType, requiredPermissions, appOp, brOptions, receivers, resultTo, resultCode, resultData, resultExtras, ordered, sticky, false, userId, allowBackgroundActivityStarts, timeoutExempt); if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Enqueueing ordered broadcast " + r); final BroadcastRecord oldRecord = replacePending ? queue.replaceOrderedBroadcastLocked(r) : null; if (oldRecord != null) { //... } else { queue.enqueueOrderedBroadcastLocked(r);//加入到mOrderedBroadcasts队列中 queue.scheduleBroadcastsLocked();//处理广播 } //... return ActivityManager.BROADCAST_SUCCESS; }
BroadCastQueue.scheduleBroadcastsLocked方法,这里发送了一个BROADCAST_INTENT_MSG消息
public void scheduleBroadcastsLocked() { if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Schedule broadcasts [" + mQueueName + "]: current=" + mBroadcastsScheduled); if (mBroadcastsScheduled) { return; } mHandler.sendMessage(mHandler.obtainMessage(BROADCAST_INTENT_MSG, this)); mBroadcastsScheduled = true; } private final class BroadcastHandler extends Handler { public BroadcastHandler(Looper looper) { super(looper, null, true); } @Override public void handleMessage(Message msg) { switch (msg.what) { case BROADCAST_INTENT_MSG: { if (DEBUG_BROADCAST) Slog.v( TAG_BROADCAST, "Received BROADCAST_INTENT_MSG [" + mQueueName + "]"); processNextBroadcast(true); } break; case BROADCAST_TIMEOUT_MSG: { synchronized (mService) { broadcastTimeoutLocked(true); } } break; } } }
然后跳转到processNextBroadcast方法,真正执行是processNextBroadcastLocked
final void processNextBroadcast(boolean fromMsg) { synchronized (mService) { processNextBroadcastLocked(fromMsg, false); } } final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj) { BroadcastRecord r; //... if (! mPendingBroadcastTimeoutMessage) { long timeoutTime = r.receiverTime + mConstants.TIMEOUT; if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Submitting BROADCAST_TIMEOUT_MSG [" + mQueueName + "] for " + r + " at " + timeoutTime); setBroadcastTimeoutLocked(timeoutTime);//设置超时时间 } //... }
processNextBroadcastLocked方法会设置超时时间,就是埋炸弹,在时间一到就要判断是否引爆。
final void setBroadcastTimeoutLocked(long timeoutTime) { if (! mPendingBroadcastTimeoutMessage) { Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this); mHandler.sendMessageAtTime(msg, timeoutTime); mPendingBroadcastTimeoutMessage = true; } }
2.1.2 判断是否产生ANR,即是否超时
再回到BroadcastHandler#BROADCAST_TIMEOUT_MSG
时间一到执行broadcastTimeoutLocked方法,判断是否超时。
final void broadcastTimeoutLocked(boolean fromMsg) { //... long now = SystemClock.uptimeMillis(); BroadcastRecord r = mDispatcher.getActiveBroadcastLocked(); if (fromMsg) { //... long timeoutTime = r.receiverTime + mConstants.TIMEOUT; if (timeoutTime > now) { //未超时 // We can observe premature timeouts because we do not cancel and reset the // broadcast timeout message after each receiver finishes. Instead, we set up // an initial timeout then kick it down the road a little further as needed // when it expires. setBroadcastTimeoutLocked(timeoutTime); return; } } //... if (!debugging && anrMessage != null) { // Post the ANR to the handler since we do not want to process ANRs while // potentially holding our lock. mHandler.post(new AppNotResponding(app, anrMessage)); //超时发送anr消息 } }
2.2 service超时机制
2.2.1 service超时时间设置(埋炸弹)
以startService举例:
Context.startService
调用链如下:
AMS.startService
ActiveServices.startService
ActiveServices.startServiceLocked
ActiveServices.startServiceInnerLocked
ActiveServices.bringUpServiceLocked
ActiveServices.realStartServiceLocked
private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app, boolean execInFg) throws RemoteException { bumpServiceExecutingLocked(r, execInFg, "create");//1、这里会发送delay消息(SERVICE_TIMEOUT_MSG) try { ... //2、通知AMS创建服务 app.thread.scheduleCreateService(r, r.serviceInfo, mAm.compatibilityInfoForPackage(r.serviceInfo.applicationInfo), app.getReportedProcState()); r.postNotification(); created = true; } //... } private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) { ... scheduleServiceTimeoutLocked(r.app);//具体发送消息的方法 ... } void scheduleServiceTimeoutLocked(ProcessRecord proc) { if (proc.executingServices.size() == 0 || proc.thread == null) { return; } Message msg = mAm.mHandler.obtainMessage( ActivityManagerService.SERVICE_TIMEOUT_MSG); msg.obj = proc; // 发送deley消息,前台服务是20s,后台服务是10s mAm.mHandler.sendMessageDelayed(msg, proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT); }
如果服务所在的时间内,没有移除这个消息,那么就会在AMS里面处理消息:
final class MainHandler extends Handler { public MainHandler(Looper looper) { super(looper, null, true); } @Override public void handleMessage(Message msg) { switch (msg.what) { //...服务超时会调用serviceTimeout方法 case SERVICE_TIMEOUT_MSG: { mServices.serviceTimeout((ProcessRecord)msg.obj); } break; } }
2.2.2 service移除超时消息(拆炸弹)
启动一个Service,先要经过AMS管理,然后AMS会通知应用进程执行Service的生命周期,
ActivityThread的handleCreateService方法会被调用
//ActivityThread.java @UnsupportedAppUsage private void handleCreateService(CreateServiceData data) { //... try { //... Application app = packageInfo.makeApplication(false, mInstrumentation); service.attach(context, this, data.info.name, data.token, app, ActivityManager.getService()); service.onCreate();//1、service onCreate调用 mServices.put(data.token, service); try { ActivityManager.getService().serviceDoneExecuting(//2、拆炸弹在这里 data.token, SERVICE_DONE_EXECUTING_ANON, 0, 0); } catch (RemoteException e) { throw e.rethrowFromSystemServer(); } } //... }
注释1,Service的onCreate方法被调用,
注释2,调用AMS的serviceDoneExecuting方法,最终会调用到ActiveServices.serviceDoneExecutingLocked
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying, boolean finishing) { //... mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);//移除delay消息 //... }
可以看到,onCreate方法调用完之后,就会移除delay消息,炸弹被拆除。
2.2.3 service触发ANR(引爆炸弹)
假设Service的onCreate执行超过10s,那么炸弹就会引爆,也就是
void serviceTimeout(ProcessRecord proc) { //... if (anrMessage != null) { mAm.mAppErrors.appNotResponding(proc, null, null, false, anrMessage); } //... }
2.3 ContentProvider超时机制
2.3.1 ContentProvider设置超时时间(埋炸弹)
在应用启动时,ContentProvider发布 若超时也会发生ANR。
调用链:
ActivityThread.attach()
AMS.attachApplicationLocked()
应用启动后,ActivityThread执行attach()操作,最后会执行attachApplicationLocked() 实现上述ANR判断。
//ActivityManagerService.java // How long we wait for an attached process to publish its content providers // before we decide it must be hung. static final int CONTENT_PROVIDER_PUBLISH_TIMEOUT = 10*1000; /** * How long we wait for an provider to be published. Should be longer than * {@link #CONTENT_PROVIDER_PUBLISH_TIMEOUT}. */ static final int CONTENT_PROVIDER_WAIT_TIMEOUT = 20 * 1000; @GuardedBy("this") private boolean attachApplicationLocked(@NonNull IApplicationThread thread, int pid, int callingUid, long startSeq) { if (providers != null && checkAppInLaunchingProvidersLocked(app)) { Message msg = mHandler.obtainMessage(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG); msg.obj = app; //可能ANR mHandler.sendMessageDelayed(msg, CONTENT_PROVIDER_PUBLISH_TIMEOUT); } //... try { //移除CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG延迟消息 //这里的thread是从ActivityThread传入的,ApplicationThread对象。 thread.bindApplication(processName, appInfo, providers, ...); } } final class MainHandler extends Handler { @Override public void handleMessage(Message msg) { switch (msg.what) { case CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG: { ProcessRecord app = (ProcessRecord)msg.obj; synchronized (ActivityManagerService.this) { processContentProviderPublishTimedOutLocked(app); } } break; } } } @GuardedBy("this") private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) { cleanupAppInLaunchingProvidersLocked(app, true); mProcessList.removeProcessLocked(app, false, true, "timeout publishing content providers"); }
2.3.2 ContentProvider移除超时消息(拆炸弹)
如何移除,看thread.bindApplication(),该方法在延迟发送消息之后执行,即移除延迟消息。如在10s内执行完成,就是不会触发ANR。
注:这是最简单直接看到的一种,移除该消息的调用地方不只一处。
简单看下这个移除延迟消息过程:
//ActivityThread.java private class ApplicationThread extends IApplicationThread.Stub { public final void bindApplication(String processName, ApplicationInfo appInfo, ...) { sendMessage(H.BIND_APPLICATION, data); } } class H extends Handler { public void handleMessage(Message msg) { switch (msg.what) { case BIND_APPLICATION: AppBindData data = (AppBindData)msg.obj; handleBindApplication(data); break; } } } @UnsupportedAppUsage private void handleBindApplication(AppBindData data) { try { if (!data.restrictedBackupMode) { if (!ArrayUtils.isEmpty(data.providers)) { installContentProviders(app, data.providers); } } } } @UnsupportedAppUsage private void installContentProviders( Context context, List<ProviderInfo> providers) { try { ActivityManager.getService().publishContentProviders( getApplicationThread(), results); } catch (RemoteException ex) { throw ex.rethrowFromSystemServer(); } } //ActivityManagerService.java public final void publishContentProviders(IApplicationThread caller, List<ContentProviderHolder> providers) { final ProcessRecord r = getRecordForAppLocked(caller); if (wasInLaunchingProviders) { mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, r); } }
2.3.3 ContentProvider触发ANR(引爆炸弹)
//AMS @GuardedBy("this") private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) { cleanupAppInLaunchingProvidersLocked(app, true); mProcessList.removeProcessLocked(app, false, true, "timeout publishing content providers"); }
若发生超时,这里没有调用appNotResponding()(不像前3种),这里会杀掉进程并清理了相关信息。
2.4 input事件超时机制
这个等分析input事件机制后再写吧。看不太明白。
2.5 ANR处理过程
android 10是调用ProcessRecord的appNotResponding方法。
看下这个方法做了哪些操作:
void appNotResponding(String activityShortComponentName, ApplicationInfo aInfo, String parentShortComponentName, WindowProcessController parentProcess, boolean aboveSystem, String annotation) { //... //1、写入event log // Log the ANR to the event log. EventLog.writeEvent(EventLogTags.AM_ANR, userId, pid, processName, info.flags, annotation); //... //2、收集需要的log,anr、cpu等 // Log the ANR to the main log. StringBuilder info = new StringBuilder(); info.setLength(0); info.append("ANR in ").append(processName); if (activityShortComponentName != null) { info.append(" (").append(activityShortComponentName).append(")"); } info.append("\n"); info.append("PID: ").append(pid).append("\n"); if (annotation != null) { info.append("Reason: ").append(annotation).append("\n"); } if (parentShortComponentName != null && parentShortComponentName.equals(activityShortComponentName)) { info.append("Parent: ").append(parentShortComponentName).append("\n"); } ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true); //... // 3、dump堆栈信息,包括java堆栈和native堆栈,保存到文件中 // For background ANRs, don't pass the ProcessCpuTracker to // avoid spending 1/2 second collecting stats to rank lastPids. File tracesFile = ActivityManagerService.dumpStackTraces(firstPids, (isSilentAnr()) ? null : processCpuTracker, (isSilentAnr()) ? null : lastPids, nativePids); String cpuInfo = null; if (isMonitorCpuUsage()) { mService.updateCpuStatsNow(); synchronized (mService.mProcessCpuTracker) { cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime); } info.append(processCpuTracker.printCurrentLoad()); info.append(cpuInfo); } info.append(processCpuTracker.printCurrentState(anrTime)); Slog.e(TAG, info.toString());//4、输出ANR 日志 if (tracesFile == null) { // There is no trace file, so dump (only) the alleged culprit's threads to the log Process.sendSignal(pid, Process.SIGNAL_QUIT);// 5、没有抓到tracesFile,发一个SIGNAL_QUIT信号 } StatsLog.write(StatsLog.ANR_OCCURRED, uid, processName, activityShortComponentName == null ? "unknown": activityShortComponentName, annotation, (this.info != null) ? (this.info.isInstantApp() ? StatsLog.ANROCCURRED__IS_INSTANT_APP__TRUE : StatsLog.ANROCCURRED__IS_INSTANT_APP__FALSE) : StatsLog.ANROCCURRED__IS_INSTANT_APP__UNAVAILABLE, isInterestingToUserLocked() ? StatsLog.ANROCCURRED__FOREGROUND_STATE__FOREGROUND : StatsLog.ANROCCURRED__FOREGROUND_STATE__BACKGROUND, getProcessClassEnum(), (this.info != null) ? this.info.packageName : ""); final ProcessRecord parentPr = parentProcess != null ? (ProcessRecord) parentProcess.mOwner : null; // 6、输出到drapbox mService.addErrorToDropBox("anr", this, processName, activityShortComponentName, parentShortComponentName, parentPr, annotation, cpuInfo, tracesFile, null); //... synchronized (mService) { // mBatteryStatsService can be null if the AMS is constructed with injector only. This // will only happen in tests. if (mService.mBatteryStatsService != null) { mService.mBatteryStatsService.noteProcessAnr(processName, uid); } if (isSilentAnr() && !isDebugging()) { kill("bg anr", true);//7、后台ANR,直接杀进程 return; } //8、错误报告 // Set the app's notResponding state, and look up the errorReportReceiver makeAppNotRespondingLocked(activityShortComponentName, annotation != null ? "ANR " + annotation : "ANR", info.toString()); //9、弹出ANR dialog,会调用handleShowAnrUi方法 // mUiHandler can be null if the AMS is constructed with injector only. This will only // happen in tests. if (mService.mUiHandler != null) { // Bring up the infamous App Not Responding dialog Message msg = Message.obtain(); msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG; msg.obj = new AppNotRespondingDialog.Data(this, aInfo, aboveSystem); mService.mUiHandler.sendMessage(msg); } } }
主要流程如下:
1、写入event log
2、写入 main log
3、生成tracesFile
4、输出ANR logcat(控制台可以看到)
5、如果没有获取到tracesFile,会发一个SIGNAL_QUIT信号,这里看注释是会触发收集线程堆栈信息流程,写入traceFile
6、输出到drapbox
7、后台ANR,直接杀进程
8、错误报告
9、弹出ANR dialog,会调用 AppErrors#handleShowAnrUi方法。
3. ANR分析方法
3.1 通过traces.txt分析ANR
上面已经分析了ANR触发流程,最终会把发生ANR时的线程堆栈、cpu等信息保存起来,我们一般都是分析 /data/anr/traces.txt 文件
4. ANR监控
线上问题,怎么样才能拿到ANR日志呢?
这部分咱也没做过,照抄的这篇文章
先抄下来再研究吧。
4.1 抓取系统traces.txt 上传
1、当监控线程发现主线程卡死时,主动向系统发送SIGNAL_QUIT信号。
2、等待/data/anr/traces.txt文件生成。
3、文件生成以后进行上报。
存在两个问题:
1、traces.txt 里面包含所有线程的信息,上传之后需要人工过滤分析
2、很多高版本系统需要root权限才能读取 /data/anr这个目录
4.2 ANRWatchDog
[ANRWatchDog](ANRWatchDog 是一个自动检测ANR的开源库 "ANRWatchDog") 是一个自动检测ANR的开源库
4.2.1 ANRWatchDog原理
其源码只有两个类,核心是ANRWatchDog这个类,继承自Thread,它的run 方法如下,看注释处
public void run() { setName("|ANR-WatchDog|"); long interval = _timeoutInterval; // 1、开启循环 while (!isInterrupted()) { boolean needPost = _tick == 0; _tick += interval; if (needPost) { // 2、往UI线程post 一个Runnable,将_tick 赋值为0,将 _reported 赋值为false _uiHandler.post(_ticker); } try { // 3、线程睡眠5s Thread.sleep(interval); } catch (InterruptedException e) { _interruptionListener.onInterrupted(e); return ; } // If the main thread has not handled _ticker, it is blocked. ANR. // 4、线程睡眠5s之后,检查 _tick 和 _reported 标志,正常情况下_tick 已经被主线程改为0,_reported改为false,如果不是,说明 2 的主线程Runnable一直没有被执行,主线程卡住了 if (_tick != 0 && !_reported) { ... if (_namePrefix != null) { // 5、判断发生ANR了,那就获取堆栈信息,回调onAppNotResponding方法 error = ANRError.New(_tick, _namePrefix, _logThreadsWithoutStackTrace); } else { error = ANRError.NewMainOnly(_tick); } _anrListener.onAppNotResponding(error); interval = _timeoutInterval; _reported = true; } } }
ANRWatchDog 的原理是比较简单的,概括为以下几个步骤
- 开启一个线程,死循环,循环中睡眠5s
- 往UI线程post 一个Runnable,将_tick 赋值为0,将 _reported 赋值为false
- 线程睡眠5s之后检查_tick和_reported字段是否被修改
- 如果_tick和_reported没有被修改,说明给主线程post的Runnable一直没有被执行,也就说明主线程卡顿至少5s(只能说至少,这里存在5s内的误差)
- 将线程堆栈信息输出
其中涉及到并发的一个知识点,关于 volatile 关键字的使用,面试中的常客,
volatile的特点是:保证可见性,禁止指令重排,适合在一个线程写,其它线程读的情况。
面试中一般会展开问JMM,工作内存,主内存等,以及为什么要有工作内存,能不能所有字段都用 volatile 关键字修饰等问题。
回到ANRWatchDog本身,细心的同学可能会发现一个问题,使用ANRWatchDog有时候会捕获不到ANR,是什么原因呢?
4.2.2 ANRWatchDog 缺点
ANRWatchDog 会出现漏检测的情况,看图
如上图这种情况,红色表示卡顿,
假设主线程卡顿了2s之后,ANRWatchDog这时候刚开始一轮循环,将_tick 赋值为5,并往主线程post一个任务,把_tick修改为0
主线程过了3s之后不卡顿了,将_tick赋值为0
等到ANRWatchDog睡眠5s之后,发现_tick的值是0,判断为没有发生ANR。而实际上,主线程中间是卡顿了5s,ANRWatchDog误差是在5s之内的(5s是默认的,线程的睡眠时长)
针对这个问题,可以做一下优化。
4.3 ANRMonitor
ANRWatchDog 漏检测的问题,根本原因是因为线程睡眠5s,不知道前一秒主线程是否已经出现卡顿了,如果改成每间隔1秒检测一次,就可以把误差降低到1s内。
接下来通过改造ANRWatchDog ,来做一下优化,命名为ANRMonitor。
我们想让子线程间隔1s执行一次任务,可以通过 HandlerThread来实现
流程如下:
核心的Runnable代码
@Volatile var mainHandlerRunEnd = true //子线程会间隔1s调用一次这个Runnable private val mThreadRunnable = Runnable { blockTime++ //1、标志位 mainHandlerRunEnd 没有被主线程修改,说明有卡顿 if (!mainHandlerRunEnd && !isDebugger()) { logw(TAG, "mThreadRunnable: main thread may be block at least $blockTime s") } //2、卡顿超过5s,触发ANR流程,打印堆栈 if (blockTime >= 5) { if (!mainHandlerRunEnd && !isDebugger() && !mHadReport) { mHadReport = true //5s了,主线程还没更新这个标志,ANR loge(TAG, "ANR->main thread may be block at least $blockTime s ") loge(TAG, getMainThreadStack()) //todo 回调出去,这里可以按需把其它线程的堆栈也输出 //todo debug环境可以开一个新进程,弹出堆栈信息 } } //3、如果上一秒没有卡顿,那么重置标志位,然后让主线程去修改这个标志位 if (mainHandlerRunEnd) { mainHandlerRunEnd = false mMainHandler.post { mainHandlerRunEnd = true } } //子线程间隔1s调用一次mThreadRunnable sendDelayThreadMessage() }
子线程每隔1s会执行一次mThreadRunnable,检测标志位 mainHandlerRunEnd 是否被修改
假如mainHandlerRunEnd如期被主线程修改为true,那么重置mainHandlerRunEnd标志位为false,然后继续执行步骤1
假如mainHandlerRunEnd没有被修改true,说明有卡顿,累计卡顿5s就触发ANR流程
在监控到ANR的时候,除了获取主线程堆栈,还有cpu、内存占用等信息也是比较重要的,demo中省略了这部分内容。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 单线程的Redis速度为什么快?
· 展开说说关于C#中ORM框架的用法!
· Pantheons:用 TypeScript 打造主流大模型对话的一站式集成库
· SQL Server 2025 AI相关能力初探
· 为什么 退出登录 或 修改密码 无法使 token 失效