转自:http://blog.csdn.net/ruingman/article/details/53118202

 

定义

主线程在特定的时间内没有做完特定的事情

常见的场景

   A.input事件超过5S没有处理完成
B.service executing 超时(bind,create,start,unbind等等),前台20s,后台200s
C.广播处理超时,前台10S,后台60s
D.ContentProvider执行超时,20s

常见的原因

        A.耗时操作,如复杂的layout,庞大的for循环,IO等。
B.被Binder 对端block
C.被子线程同步锁block
D.Binder被占满导致主线程无法和SystemServer通信
E.得不到系统资源(CPU/RAM/IO)
其中ABCD比较好分析,而E比较困难。

如何分析?

指导思想:通过各种线索,寻求主线程在过去一段时间内发生block的可能原因。

线索包括:
A.traces.txt/dropbox
AMS在ANR发生的时候,dump相关进程(ANR的进程、systemserver、mediaserver,surfaceFinger等)的当前栈到traces.txt。
traces.txt存在的几种形式:
1. adb pull /data/anr/
2. slog/2012xxxxx/misc/anr/snapshot_xxxx.log
3. slog/dropbox/data_app_anr_xxxxx.tgz
4. slog/dropbox/system_app_anr_xxxx.tgz
需要注意的是,traces.txt是抓取的是超时后(如input超时就是5s后)的snapshot,并不一定能够真实的反应出block的点。
也存在抓到主线程没有block,在idle的情况。
B.Eventlog中的dvm_lock_sample.
在同步锁发生content的时候,虚拟机会将这个竞争记录在eventlog中:
dvm_lock_sample: [system_server,1,Binder_7, 22, ActivityManagerService.java,15921,-,9628,4]
进程名 被block线程 block时间(ms) 被block的行号 持有者行号
实现可以参考art/runtime/monitor_android.cc LogContentionEvent函数
如果主线程是被binder对端、被同步锁block,那么eventlog中就很有可能会有dvm_lock_sample的打印。
华为项目上实现了更加强大的功能BlockMonitor,在主线程有耗时操作(如handlemsg、Binder调用耗时)的时候会打印出栈。 C.搜索主线程在发生ANR前后的main,systemlog,结合代码查看可能block在哪里。
现在发生ANR的时候,sprdruntimeinfo在mainlog中会Dump出比较多的信息,其中比较重要的是binder、cpu。
D.ANR发生的时候,打印出的CPU的占用。
注意,并不能因为有进程cpu占用高就果断的去怀疑他。
因为最主要的线索traces.txt的有效性并不是非常高,所以ANR问题分析是存在一定的局限性的。

案例分析

1. Settings 被对端block

 首先需要去看的就是traces.txt 中ANR进程的主线程的栈。
以settting的这次ANR为例: "main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x73ee6470 self=0xb4d76500
| sysTid=22831 nice=0 cgrp=default sched=0/0 handle=0xb6f4bc00
| state=S schedstat=( 0 0 0 ) utm=22 stm=22 core=0 HZ=100
| stack=0xbe283000-0xbe285000 stackSize=8MB
| held mutexes=
native: #00 pc 000410ac /system/lib/libc.so (__ioctl+8)
native: #01 pc 000477e5 /system/lib/libc.so (ioctl+14)
native: #02 pc 0001e7c5 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
native: #03 pc 0001ee17 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+38)
native: #04 pc 0001efcd /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+124)
native: #05 pc 00019fb7 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+30)
native: #06 pc 00086de9 /system/lib/libandroid_runtime.so (???)
native: #07 pc 00d94629 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:503)
at android.net.INetworkPolicyManager$Stub$Proxy.getNetworkPolicies(INetworkPolicyManager.java:410)
at android.net.NetworkPolicyManager.getNetworkPolicies(NetworkPolicyManager.java:174)
at com.android.settings.net.NetworkPolicyEditor.read(NetworkPolicyEditor.java:57)
at com.android.settings.DataUsageSummary.onCreate(DataUsageSummary.java:361)
at android.app.Fragment.performCreate(Fragment.java:2202)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:942)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1148)
at android.app.BackStackRecord.run(BackStackRecord.java:793)
at android.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:1535)
at android.app.FragmentManagerImpl.executePendingTransactions(FragmentManager.java:562)
at com.android.settings.SettingsActivity.switchToFragment(SettingsActivity.java:1084)
at com.android.settings.SettingsActivity.onCreate(SettingsActivity.java:657)
at android.app.Activity.performCreate(Activity.java:6251)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1110)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2370)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2477)
at android.app.ActivityThread.-wrap11(ActivityThread.java:-1)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1345)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:148)
at android.app.ActivityThread.main(ActivityThread.java:5438)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:762)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:652)
主线程被对端block,而对端是在systemserver中的NetworkPolicyManager。
接下来继续去查NetworkPolicyManager为何会被block。
通常情况下,可以搜索getNetworkPolicies,一般的对端的函数并不会修改函数名:
看到我们的对端:
"Binder_4" prio=5 tid=56 Blocked
| group="main" sCount=1 dsCount=0 obj=0x1321a0a0 self=0xad31e200
| sysTid=2491 nice=0 cgrp=default sched=0/0 handle=0x9fd00930
| state=S schedstat=( 0 0 0 ) utm=46441 stm=46582 core=1 HZ=100
| stack=0x9fc04000-0x9fc06000 stackSize=1014KB
| held mutexes=
at com.android.server.net.NetworkPolicyManagerService.getNetworkPolicies(NetworkPolicyManagerService.java:1696)
- waiting to lock <0x07439315> (a java.lang.Object) held by thread 35
at android.net.INetworkPolicyManager$Stub.onTransact(INetworkPolicyManager.java:145)
at android.os.Binder.execTransact(Binder.java:453
被Tid=35的人拿住了一把锁(0x07439315),那么继续看tid=35是谁,有两种方法:
1. 搜索tid=35
2. 搜索0x07439315,找到 - locked <0x07439315> (a java.lang.Object) "NetworkPolicy" prio=5 tid=35 TimedWaiting
| group="main" sCount=1 dsCount=0 obj=0x12d98940 self=0x9f91f700
| sysTid=2415 nice=0 cgrp=default sched=0/0 handle=0xa0f33930
| state=S schedstat=( 0 0 0 ) utm=7681 stm=7783 core=0 HZ=100
| stack=0xa0e31000-0xa0e33000 stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x02580c1b> (a java.lang.Object)
at java.lang.Thread.parkFor$(Thread.java:1220)
- locked <0x02580c1b> (a java.lang.Object)
at sun.misc.Unsafe.park(Unsafe.java:299)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:634)
at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:426)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:345)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:340)
at com.android.server.NetworkManagementService.setInterfaceQuota(NetworkManagementService.java:1712)
- locked <0x0b0f91b8> (a java.lang.Object)
at com.android.server.net.NetworkPolicyManagerService.setInterfaceQuota(NetworkPolicyManagerService.java:2421)
at com.android.server.net.NetworkPolicyManagerService.updateNetworkRulesLocked(NetworkPolicyManagerService.java:1232)
at com.android.server.net.NetworkPolicyManagerService$14.onReceive(NetworkPolicyManagerService.java:1060)
- locked <0x07439315> (a java.lang.Object)
at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:881)
at android.os.Handler.handleCallback(Handler.java:739)
at android.os.Handler.dispatchMessage(Handler.java:95)
at android.os.Looper.loop(Looper.java:148)
at android.os.HandlerThread.run(HandlerThread.java:61) 可以看到,NetworkPolicy在通过NativeDaemonConnector和netd通信(setInterfaceQuota)
我们结合log来看下是否有有用信息,按照之前的经验,netd在执行完cmd的时候,会打印出slow operation。
在发生ANR的前后,查找netd相关的打印:
06-19 15:29:00.997 1235 1270 I am_anr : [0,22831,com.android.settings,818429509,Input dispatching timed out (Waiting because no window has focus but there is a focused application that may eventually add a window when it finishes starting up.)]
06-19 15:29:05.683 1235 2415 E NetdConnector: NDC Command {55445 bandwidth setiquota seth_w0 9223372036854775807} took too long (4755ms
06-19 15:29:05.723 1235 2491 I dvm_lock_sample: [system_server,1,Binder_4,4919,NetworkPolicyManagerService.java,1696,-,1056,100]
从eventslog中还可以发现,setting出现了两次连续的ANR,而上下文中都有类似上面的dvm_lock_sample NetworkPolicyManagerService相关的告警。
因此虽然15:28的这次ANR并没有打出有用的栈,但是我们还是可以猜测出这两次ANR的原因都是netd的cmd耗时太久导致的(在主线程的调用路径上存在必然的block)
那么,在netd可能无法修改的情况下,我们应该如何去resolve这个问题呢。
将可能存在block的操作放到非UI线程中去做。

2.com.huawei.hwvplayer.youku对端block:

  打开dropbox中对应的system_app_anrxxxx:
查看主线程的栈:
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
| sysTid=25390 nice=-1 cgrp=default sched=3/0 handle=0xb6f18b34
| state=S schedstat=( 0 0 0 ) utm=81 stm=12 core=2 HZ=100
| stack=0xbe78b000-0xbe78d000 stackSize=8MB
| held mutexes=
kernel: (couldn't read /proc/self/task/25390/stack)
native: #00 pc 000422d0 /system/lib/libc.so (__ioctl+8)
native: #01 pc 00047825 /system/lib/libc.so (ioctl+14)
native: #02 pc 0001e835 /system/lib/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+132)
native: #03 pc 0001ee93 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+38)
native: #04 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
native: #05 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
native: #06 pc 0008a035 /system/lib/libandroid_runtime.so (???)
native: #07 pc 00d78869 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:510)
at android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
at android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
at android.os.Environment.getExternalStorageDirectory(Environment.java:354)
at com.huawei.common.utils.PathUtils.<clinit>(PathUtils.java:51)
at com.huawei.common.utils.PathUtils.getWorkspacePath(PathUtils.java:80)
at com.huawei.common.components.log.Logger.<clinit>(Logger.java:37)
at com.huawei.common.components.log.Logger.i(Logger.java:162)
at com.huawei.hwvplayer.data.db.DbProvider.attachInfo(DbProvider.java:89)
at android.app.ActivityThread.installProvider(ActivityThread.java:5279)
at android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
at android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
at android.app.ActivityThread.access$1600(ActivityThread.java:165)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:188)
at android.app.ActivityThread.main(ActivityThread.java:5578)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684) 主线程尝试去调用MountService的 getVolumeList接口,可能没有返回。
查看system_server中相关的,搜索getVolumeList
发现systemserver中有三个Binder线程和主线程被block,我们的对端是哪个Binder线程暂时无法确认,但这并不影响我们继续分析,因为他们被blockd的路径是一致的: "main" prio=5 tid=1 Blocked
| group="main" sCount=1 dsCount=0 obj=0x752b0000 self=0xb4276500
| sysTid=22735 nice=-2 cgrp=default sched=0/0 handle=0xb6f18b34
| state=S schedstat=( 0 0 0 ) utm=432 stm=85 core=1 HZ=100
| stack=0xbe78b000-0xbe78d000 stackSize=8MB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
at android.os.storage.StorageManager.getVolumeList(StorageManager.java:858)
at android.os.storage.StorageManager.getPrimaryVolume(StorageManager.java:906)
at com.android.server.usb.UsbDeviceManager.systemReady(UsbDeviceManager.java:327)
at com.android.server.usb.UsbService.systemReady(UsbService.java:181)
at com.android.server.usb.UsbService$Lifecycle.onBootPhase(UsbService.java:78)
at com.android.server.SystemServiceManager.startBootPhase(SystemServiceManager.java:135)
at com.android.server.SystemServer$3.run(SystemServer.java:1489)
at com.android.server.am.ActivityManagerService.systemReady(ActivityManagerService.java:12417)
at com.android.server.am.HwActivityManagerService.systemReady(HwActivityManagerService.java:960)
at com.android.server.SystemServer.startOtherServices(SystemServer.java:1485)
at com.android.server.SystemServer.run(SystemServer.java:381)
at com.android.server.SystemServer.main(SystemServer.java:272)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684) "Binder_8" prio=5 tid=76 Blocked
| group="main" sCount=1 dsCount=0 obj=0x13bd60a0 self=0x9c1abe00
| sysTid=25191 nice=-1 cgrp=default sched=0/0 handle=0x97158930
| state=S schedstat=( 0 0 0 ) utm=9 stm=7 core=2 HZ=100
| stack=0x9705c000-0x9705e000 stackSize=1014KB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
at android.os.Binder.execTransact(Binder.java:453) "Binder_2" prio=5 tid=8 Blocked
| group="main" sCount=1 dsCount=0 obj=0x12cac0a0 self=0xaebf0300
| sysTid=22761 nice=-1 cgrp=default sched=0/0 handle=0xaef7d930
| state=S schedstat=( 0 0 0 ) utm=41 stm=25 core=0 HZ=100
| stack=0xaee81000-0xaee83000 stackSize=1014KB
| held mutexes=
at com.android.server.MountService.getVolumeList(MountService.java:2759)
- waiting to lock <0x0eeb54f1> (a java.lang.Object) held by thread 40
at android.os.storage.IMountService$Stub.onTransact(IMountService.java:1634)
at android.os.Binder.execTransact(Binder.java:453) 他们均是被tid=40的人block,按照上面的方法搜索tid=40或者 0x0eeb54f1得到block的人:
"MountService" prio=5 tid=40 TimedWaiting
| group="main" sCount=1 dsCount=0 obj=0x132c1160 self=0x9ce57400
| sysTid=23512 nice=0 cgrp=default sched=0/0 handle=0x9a239930
| state=S schedstat=( 0 0 0 ) utm=29 stm=2 core=2 HZ=100
| stack=0x9a137000-0x9a139000 stackSize=1038KB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x00fea1f3> (a java.lang.Object)
at java.lang.Thread.parkFor$(Thread.java:1235)
- locked <0x00fea1f3> (a java.lang.Object)
at sun.misc.Unsafe.park(Unsafe.java:299)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2053)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:372)
at com.android.server.NativeDaemonConnector$ResponseQueue.remove(NativeDaemonConnector.java:777)
at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:489)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
at com.android.server.MountService.handleSystemReady(MountService.java:776)
- locked <0x0eeb54f1> (a java.lang.Object)
at com.android.server.MountService.access$500(MountService.java:152)
at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:150)
at android.os.HandlerThread.run(HandlerThread.java:61) 和上面netd类似的,mountservice也是通过ndc和vold通信,这里我们需要继续查看是否vold存在异常。
之前提到过,类似这种同步锁block的,dvm_lock_sample一定会有打印,于是先去找eventslog,不过这个是华为的log,是没有eventlog的。
而华为实现了一个blockMonitor的功能,和dvm_lock_sample类似,当某个操作特别耗时的时候,会将其打印出来:
在ANR的附近找到如下:
07-19 10:17:50.739 25271 25271 W BlockMonitor: The binder calling took 55209ms.
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BlockMonitor.checkBinderTime(BlockMonitor.java:141)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.BinderProxy.transact(Binder.java:511)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.IMountService$Stub$Proxy.getVolumeList(IMountService.java:771)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.storage.StorageManager.getVolumeList(StorageManager.java:883)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment$UserEnvironment.getExternalDirs(Environment.java:95)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Environment.getExternalStorageDirectory(Environment.java:354)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.parsePathStrategy(FileProvider.java:583)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.getPathStrategy(FileProvider.java:534)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.support.v4.content.FileProvider.attachInfo(FileProvider.java:352)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installProvider(ActivityThread.java:5279)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.installContentProviders(ActivityThread.java:4868)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.handleBindApplication(ActivityThread.java:4799)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.access$1600(ActivityThread.java:165)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread$H.handleMessage(ActivityThread.java:1436)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Handler.dispatchMessage(Handler.java:102)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.os.Looper.loop(Looper.java:188)
07-19 10:17:50.739 25271 25271 W BlockMonitor: android.app.ActivityThread.main(ActivityThread.java:5578)
07-19 10:17:50.739 25271 25271 W BlockMonitor: java.lang.reflect.Method.invoke(Native Method)
07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:794)
07-19 10:17:50.739 25271 25271 W BlockMonitor: com.android.internal.os.ZygoteInit.main(ZygoteInit.java:684)
再加上之前的:
07-19 10:17:50.729 22735 23512 E NativeDaemonConnector.ResponseQueue: Timeout waiting for response
07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response to 4 volume reset
07-19 10:17:50.729 22735 23512 E VoldConnector: timed-out waiting for response mOutputStream = android.net.LocalSocketImpl$SocketOutputStream@922644, mSocket = vold
07-19 10:17:50.731 22735 23512 W MountService: Failed to reset vold
07-19 10:17:50.731 22735 23512 W MountService: com.android.server.NativeDaemonTimeoutException: command '4 volume reset' failed with 'null'
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.executeForList(NativeDaemonConnector.java:501)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:386)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.NativeDaemonConnector.execute(NativeDaemonConnector.java:381)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.resetIfReadyAndConnectedLocked(MountService.java:827)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.handleSystemReady(MountService.java:776)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService.access$500(MountService.java:152)
07-19 10:17:50.731 22735 23512 W MountService: at com.android.server.MountService$MountServiceHandler.handleMessage(MountService.java:596)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.Handler.dispatchMessage(Handler.java:102)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.Looper.loop(Looper.java:150)
07-19 10:17:50.731 22735 23512 W MountService: at android.os.HandlerThread.run(HandlerThread.java:61)
我们有理由去推断vold的状态是不对的,但是又没有vold的栈。
那么全局grep vold试试能不能找到线索,在kernellog中发现了vold的异常:
07-19 14:22:24.669 <3>[10772.492156] c0 Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
07-19 14:22:24.669 <6>[10772.492217] c0 vold R running 0 224 1 0x00000001
07-19 14:22:24.669 <4>[10772.492278] c0 [<c05ebecc>] (__schedule+0x38c/0x5bc) from [<c05ea478>] (schedule_timeout+0x18/0x1e8)
07-19 14:22:24.669 <4>[10772.492309] c0 [<c05ea478>] (schedule_timeout+0x18/0x1e8) from [<c05eb90c>] (wait_for_common+0x11c/0x164)
07-19 14:22:24.669 <4>[10772.492309] c0 [<c05eb90c>] (wait_for_common+0x11c/0x164) from [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4)
07-19 14:22:24.669 <4>[10772.492339] c0 [<c03cd8c8>] (mmc_wait_for_req+0xb4/0xe4) from [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74)
07-19 14:22:24.669 <4>[10772.492370] c0 [<c03cd95c>] (mmc_wait_for_cmd+0x64/0x74) from [<c03d41f0>] (mmc_send_status+0x6c/0x8c)
07-19 14:22:24.670 <4>[10772.492400] c0 [<c03d41f0>] (mmc_send_status+0x6c/0x8c) from [<c03d4504>] (sd_send_status+0x14/0x44)
07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d4504>] (sd_send_status+0x14/0x44) from [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168)
07-19 14:22:24.670 <4>[10772.492431] c0 [<c03d491c>] (mmc_lock_unlock_by_buf+0xac/0x168) from [<c03dabd8>] (mmc_lockable_store+0x594/0x75c)
07-19 14:22:24.670 <4>[10772.492461] c0 [<c03dabd8>] (mmc_lockable_store+0x594/0x75c) from [<c029d560>] (dev_attr_store+0x18/0x24)
07-19 14:22:32.070 <4>[10772.492492] c0 [<c029d560>] (dev_attr_store+0x18/0x24) from [<c013b370>] (sysfs_write_file+0x104/0x148)
07-19 14:22:32.070 <4>[10772.492522] c0 [<c013b370>] (sysfs_write_file+0x104/0x148) from [<c00eabb4>] (vfs_write+0xd0/0x180)
07-19 14:22:32.070 <4>[10772.492553] c0 [<c00eabb4>] (vfs_write+0xd0/0x180) from [<c00eb070>] (SyS_write+0x38/0x68)
07-19 14:22:32.071 <4>[10772.492583] c0 [<c00eb070>] (SyS_write+0x38/0x68) from [<c000e840>] (ret_fast_syscall+0x0/0x30)
vold一直在这个操作中没有退出来,所以不能响应客户端的请求,从而导致了ANR。
这个问题需要mmc的同事进一步去分析,目前怀疑是SD卡发生了错误。

3. setting 主线程耗时操作

"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x7425caf8 self=0xb4827800
| sysTid=671 nice=0 cgrp=default sched=0/0 handle=0xb6febbec
| state=S schedstat=( 0 0 0 ) utm=1853 stm=50 core=2 HZ=100
| stack=0xbe7bd000-0xbe7bf000 stackSize=8MB
| held mutexes=
kernel: (couldn't read /proc/self/task/671/stack)
native: #00 pc 000133cc /system/lib/libc.so (syscall+28)
native: #01 pc 000a9a83 /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82)
native: #02 pc 001b16f1 /system/lib/libart.so (art::JNI::NewString(_JNIEnv*, unsigned short const*, int)+640)
native: #03 pc 00075887 /system/lib/libandroid_runtime.so (???)
native: #04 pc 008570ab /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_database_CursorWindow_nativeGetString__JII+110)
at android.database.CursorWindow.nativeGetString(Native method)
at android.database.CursorWindow.getString(CursorWindow.java:438)
at android.database.AbstractWindowedCursor.getString(AbstractWindowedCursor.java:51)
at android.database.CursorWrapper.getString(CursorWrapper.java:114)
at com.android.settings.ApnSettings.fillList(ApnSettings.java:259)
at com.android.settings.ApnSettings.onResume(ApnSettings.java:208)
at android.app.Fragment.performResume(Fragment.java:2096)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:928)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1067)
at android.app.FragmentManagerImpl.moveToState(FragmentManager.java:1049)
at android.app.FragmentManagerImpl.dispatchResume(FragmentManager.java:1879)
at android.app.Activity.performResume(Activity.java:6113)
at android.app.ActivityThread.performResumeActivity(ActivityThread.java:3015)
at android.app.ActivityThread.handleResumeActivity(ActivityThread.java:3061)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2415)
at android.app.ActivityThread.access$800(ActivityThread.java:151)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1313)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:135)
at android.app.ActivityThread.main(ActivityThread.java:5345)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:947)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:742) 从ApnSettings.java中可以看到fillList这个可能的耗时操作主线程处理的地方有OnResume和OnReceive。
且代码中存在使用HandlerThread异步处理这个操作的地方,可以借鉴下将这两处都放到HandlerThread中去做。

4. 系统资源紧张造成的ANR

 通常情况下,一般这种问题的表现的可能形式是
1. traces.txt中的主线程的栈在一个非常common的操作中,如new一个变量,读取某个文件等
2. ANR的trace中IOW相当高(IOW+CPU sys+usr=100)
3. Kswapd非常活跃
4. Low Ram 配置
5. kernel log中 D状态的用户进程比较多,且都block在内存相关,且内存比较紧张.
6. 连续出现ANR,slog和ams一直在dumptrace,造成比较大的IO压力
7. 前台正在玩一个很大的游戏
8. 某个应用正在安装做dex2oat.
出现这种情况下,我们需要做的是:
1. 针对性的调整lmk参数,减少后台的数量
2. 检查是否有不应该常驻的进程
3. 是否需要修改伪前台的adj,减少常驻内存
4. 是否需要关闭dex2oat
此类问题比较难解决。

5. Binder被占满导致ANR

1. traces.txt中主线程处于IDLE状态,要么是没有抓到现场,要么是出现了其他问题。
2. 分析log,找到ANR的时间点如下:后台服务执行超时,这个timeout是200s
07-28 03:37:33.681 1719 1732 I am_anr : [0,15625,com.android.exchange,948452933,executing service com.android.exchange/.service.EasService]
----- pid 15625 at 2016-07-28 03:37:33 -----,dumptrace的时间点。
3. 查看是否有主线程的消息:
07-28 03:37:39.373 15625 15625 W BlockMonitor: Package name: com.android.exchange
07-28 03:37:39.373 15625 15625 W BlockMonitor: The Message{ what=115 obj=ServiceArgsData{token=android.os.BinderProxy@49814e2
startId=6 args=Intent { act=com.android.email.EXCHANGE_INTENT flg=0x4 cmp=com.android.exchange/.service.EasService (has extras) }}
target=android.app.ActivityThread$H } took 5174ms.
之前有说过,华为项目是没有dvmlocksample但是有blockMonitor的,如果有耗时的binder调用以及mesg处理,那么就会被他抓过来。
从代码中可以看到,what=115是serviceArg消息,是startservice的一部分,而这个消息应该是34秒的时候才接受到并处理的。
那么,为什么33s的时候dumptrace并没有抓到主线程有卡顿?且后续并没有blockmonitor的打印,也就是说200s内主线程和systemserver并没有出现block。
那么还有可能是什么原因导致了前200s并没有收到这个消息?
我们来看下这部分的代码:
  private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg,
boolean oomAdjusted) throws TransactionTooLargeException {
final int N = r.pendingStarts.size();
if (N == 0) {
return;
} while (r.pendingStarts.size() > 0) {
Exception caughtException = null;
ServiceRecord.StartItem si;
try {
si = r.pendingStarts.remove(0);
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Sending arguments to: "
+ r + " " + r.intent + " args=" + si.intent);
if (si.intent == null && N > 1) {
// If somehow we got a dummy null intent in the middle,
// then skip it. DO NOT skip a null intent when it is
// the only one in the list -- this is to support the
// onStartCommand(null) case.
continue;
}
si.deliveredTime = SystemClock.uptimeMillis();
r.deliveredStarts.add(si);
si.deliveryCount++;
if (si.neededGrants != null) {
mAm.grantUriPermissionUncheckedFromIntentLocked(si.neededGrants,
si.getUriPermissionsLocked());
}
bumpServiceExecutingLocked(r, execInFg, "start"); //超时timer开始计算
if (!oomAdjusted) {
oomAdjusted = true;
mAm.updateOomAdjLocked(r.app);
}
int flags = 0;
if (si.deliveryCount > 1) {
flags |= Service.START_FLAG_RETRY;
}
if (si.doneExecutingCount > 0) {
flags |= Service.START_FLAG_REDELIVERY;
}
r.app.thread.scheduleServiceArgs(r, si.taskRemoved, si.id, flags, si.intent);//通过binder给App发送一个消息,ActivityThread会收到并处理。
} catch (TransactionTooLargeException e) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Transaction too large: intent="
+ si.intent);
caughtException = e;
} catch (RemoteException e) {
// Remote process gone... we'll let the normal cleanup take care of this.
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "Crashed while sending args: " + r);
caughtException = e;
} catch (Exception e) {
Slog.w(TAG, "Unexpected exception", e);
caughtException = e;
} if (caughtException != null) {
// Keep nesting count correct
final boolean inDestroying = mDestroyingServices.contains(r);
serviceDoneExecutingLocked(r, inDestroying, inDestroying);
if (caughtException instanceof TransactionTooLargeException) {
throw (TransactionTooLargeException)caughtException;
}
break;
}
}
}
从上面的代码可以看出,在启动计时器和给app发送消息之间并不存在耗时操作,且这个是在同步锁之内的,如果慢了一定会被blockmonitor抓出来。
那么问题的原因只有一种可能情形:
200s之前给app发送的消息,APP并没有及时的处理。
从经验可以猜测,一般这种情况都是app的Binder被占满导致没有空闲的Binder线程能够处理这个消息导致。
而我们的traces正好dump到了这一幕:
app所有的Binder线程都全部在做一件事情,一个依赖于网络的操作,是被另外一个人通过contentprovider.query调过来的,有点不太正常。
"Binder_10" prio=5 tid=30 Native
| group="main" sCount=1 dsCount=0 obj=0x22dba160 self=0xb2f26700
| sysTid=4613 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9eb80930
| state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3 HZ=100
| stack=0x9ea84000-0x9ea86000 stackSize=1014KB
| held mutexes=
kernel: (couldn't read /proc/self/task/4613/stack)
native: #00 pc 000423e8 /system/lib/libc.so (__pselect6+20)
native: #01 pc 0001c115 /system/lib/libc.so (select+60)
native: #02 pc 0000a279 /system/lib/libjavacrypto.so (???)
native: #03 pc 0000b29f /system/lib/libjavacrypto.so (???)
native: #04 pc 004bc3fd /data/dalvik-cache/arm/system@framework@boot.oat (Java_com_android_org_conscrypt_NativeCrypto_SSL_1read__JLjava_io_FileDescriptor_2Lcom_android_org_conscrypt_NativeCrypto_00024SSLHandshakeCallbacks_2_3BIII+192)
at com.android.org.conscrypt.NativeCrypto.SSL_read(Native method)
at com.android.org.conscrypt.OpenSSLSocketImpl$SSLInputStream.read(OpenSSLSocketImpl.java:705)
- locked <0x0508e00e> (a java.lang.Object)
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:108)
at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:196)
at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:88)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:179)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:185)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:240)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:264)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:284)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:126)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:440)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:596)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:517)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:495)
at com.android.exchange.EasResponse.fromHttpRequest(EasResponse.java:91)
at com.android.exchange.service.EasServerConnection.executeHttpUriRequest(EasServerConnection.java:384)
at com.android.exchange.eas.EasOperation.performOperation(EasOperation.java:382)
at com.android.exchange.service.EasService.searchGal(EasService.java:706)
at com.android.exchange.provider.ExchangeDirectoryProvider.query(ExchangeDirectoryProvider.java:350)
at android.content.ContentProvider.query(ContentProvider.java:1017)
at android.content.ContentProvider$Transport.query(ContentProvider.java:238)
at android.content.ContentProviderNative.onTransact(ContentProviderNative.java:112)
at android.os.Binder.execTransact(Binder.java:453)
sprdruntime在ANR的时候会dump BinderState,搜索当前进程的pid,查看是否能够找到线索:
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: proc 15625
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4537: l 01
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: incoming transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
07-28 03:37:33.677 1719 1732 W SprdRuntimeInfo: thread 4538: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4563: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4564: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4578: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4580: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4591: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4592: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4602: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4603: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4612: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 4613: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15635: l 02
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15636: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624820: edc2bd80 from 30228:30240 to 15625:15636 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600714
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 15658: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 624824: c2c47400 from 30228:30238 to 15625:15658 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600a20
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: thread 18343: l 01
07-28 03:37:33.678 1719 1732 W SprdRuntimeInfo: incoming transaction 623794: d5132380 from 30228:30241 to 15625:18343 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6000f4 可以看到我们binder线程被30228给占满了,而同时我们也发现30228也有点不太对:
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: proc 30228
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 1612: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 623807: edbbfa40 from 30228:1612 to 15625:15635 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb600404
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 623738: e994e880 from 2332:4503 to 30228:1612 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f880009c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4535: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625339: ed9d1ec0 from 30228:4535 to 15625:4537 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb600d84
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625317: e82f4000 from 2332:4556 to 30228:4535 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f8800d6c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4536: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625344: c2c47080 from 30228:4536 to 15625:4538 code 1 flags 10 pri 10 r1 node 222269 size 736:4 data fb601090
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625326: ed9d1c40 from 2332:4557 to 30228:4536 code 1 flags 10 pri 10 r1 node 538358 size 776:4 data f88010a0
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4558: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625759: e887e6c0 from 30228:4558 to 15625:4563 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb6013f4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625733: e887e5c0 from 2332:4575 to 30228:4558 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f88013d4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4561: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 625805: e6223a40 from 30228:4561 to 15625:4564 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601704
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 625756: e8719880 from 2332:4576 to 30228:4561 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f880170c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4577: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626144: e96d0cc0 from 30228:4577 to 15625:4578 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601a70
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626125: d34d6ec0 from 2332:4585 to 30228:4577 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801a44
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4579: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626162: d0a97440 from 30228:4579 to 15625:4580 code 1 flags 10 pri 10 r1 node 222269 size 740:4 data fb601db0
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626131: c3f98200 from 2332:4587 to 30228:4579 code 1 flags 10 pri 10 r1 node 538358 size 780:4 data f8801d7c
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4588: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626562: d34d60c0 from 30228:4588 to 15625:4591 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb6020ec
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: incoming transaction 626541: ede63540 from 2332:4598 to 30228:4588 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88020b4
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: thread 4590: l 11
07-28 03:37:33.562 1719 1732 W SprdRuntimeInfo: outgoing transaction 626575: ecb23800 from 30228:4590 to 15625:4592 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602430
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626546: e9a32b80 from 2332:4599 to 30228:4590 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88023f0
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4600: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626867: d3c64580 from 30228:4600 to 15625:4602 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602770
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626849: ecb23e00 from 2332:4608 to 30228:4600 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f880272c
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4601: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 626883: e8b8bfc0 from 30228:4601 to 15625:4603 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602ab4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 626856: e8389f00 from 2332:4609 to 30228:4601 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802a68
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4610: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627170: e8b8bdc0 from 30228:4610 to 15625:4612 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb602df4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627144: c27891c0 from 2332:4618 to 30228:4610 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f8802da4
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: thread 4611: l 11
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: outgoing transaction 627176: e8ded5c0 from 30228:4611 to 15625:4613 code 1 flags 10 pri 10 r1 node 222269 size 744:4 data fb603138
07-28 03:37:33.563 1719 1732 W SprdRuntimeInfo: incoming transaction 627151: e2b511c0 from 2332:4619 to 30228:4611 code 1 flags 10 pri 10 r1 node 538358 size 784:4 data f88030e0 也就是说实际上是2332query了30228,30228才会去query16625导致anr的。
那么2332是谁?
eventslog中可以看到2332是email,30228是android.process.acore。
搜索traces.txt,看到我们在anr的时候有dump到2332的栈:
发现了异常,有很多个线程(也就是上面dump到的2332的线程)在查询,
"Filter" prio=5 tid=42 Native
| group="main" sCount=1 dsCount=0 obj=0x130cc8e0 self=0x98637500
| sysTid=4502 nice=10 cgrp=bg_non_interactive sched=0/0 handle=0x9877e930
| state=S schedstat=( 0 0 0 ) utm=1 stm=0 core=0 HZ=100
| stack=0x9867c000-0x9867e000 stackSize=1038KB
| held mutexes=
kernel: (couldn't read /proc/self/task/4502/stack)
native: #00 pc 00017618 /system/lib/libc.so (syscall+28)
native: #01 pc 000f5cbd /system/lib/libart.so (_ZN3art17ConditionVariable4WaitEPNS_6ThreadE+80)
native: #02 pc 00275df1 /system/lib/libart.so (_ZN3art3JNI18CallBooleanMethodVEP7_JNIEnvP8_jobjectP10_jmethodIDSt9__va_list+592)
native: #03 pc 0008617b /system/lib/libandroid_runtime.so (???)
native: #04 pc 00089c2b /system/lib/libandroid_runtime.so (???)
native: #05 pc 000198dd /system/lib/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+60)
native: #06 pc 0001ec25 /system/lib/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+584)
native: #07 pc 0001ef73 /system/lib/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+262)
native: #08 pc 0001f049 /system/lib/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+124)
native: #09 pc 00019fe3 /system/lib/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+30)
native: #10 pc 0008a035 /system/lib/libandroid_runtime.so (???)
native: #11 pc 00d78919 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+140)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:505)
at android.content.ContentProviderProxy.query(ContentProviderNative.java:419)
at android.content.ContentResolver.query(ContentResolver.java:502)
at android.content.ContentResolver.query(ContentResolver.java:438)
at com.huawei.mail.chips.BaseRecipientAdapter.doQuery(BaseRecipientAdapter.java:931)
at com.huawei.mail.chips.BaseRecipientAdapter$DirectoryFilter.performFiltering(BaseRecipientAdapter.java:400)
at android.widget.Filter$RequestHandler.handleMessage(Filter.java:234)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:150)
at android.os.HandlerThread.run(HandlerThread.java:61) 那么接下来就需要结合代码查看这么多线程创建的原因。
解决的思路也比较简单,是否需要采取线程池或者限制下线程的个数,防止由于此操作比较耗时(依赖于网络)导致对端ANR。
  1. Bug 582473 single touch point test,操作15次以上必现工厂测试anr
   从dropbox中找到Validationtools对应的traces,查看主线程的栈,发现主线程在IDLE状态。

"main" prio=5 tid=1 NATIVE
| group="main" sCount=1 dsCount=0 obj=0x4155acc0 self=0x414943f0
| sysTid=3287 nice=-1 sched=0/0 cgrp=apps handle=1074065748
| state=S schedstat=( 0 0 0 ) utm=1241 stm=340 core=0
#00 pc 00021764 /system/lib/libc.so (epoll_wait+12)
#01 pc 000105e3 /system/lib/libutils.so (android::Looper::pollInner(int)+94)
#02 pc 00010811 /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+92)
#03 pc 0006b345 /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22)
#04 pc 0001e70c /system/lib/libdvm.so (dvmPlatformInvoke+112)
#05 pc 0004dc43 /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
#06 pc 00027778 /system/lib/libdvm.so
#07 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
#08 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
#09 pc 00060259 /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+392)
#10 pc 000683eb /system/lib/libdvm.so
#11 pc 00027778 /system/lib/libdvm.so
#12 pc 0002e410 /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
#13 pc 0002bac8 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+156)
#14 pc 0005ff77 /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+338)
#15 pc 0004982b /system/lib/libdvm.so
#16 pc 0004de13 /system/lib/libandroid_runtime.so
#17 pc 0004ecdd /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+532)
#18 pc 00001423 /system/bin/app_process
#19 pc 0000e403 /system/lib/libc.so (__libc_init+50)
#20 pc 00000f34 /system/bin/app_process
at android.os.MessageQueue.nativePollOnce(Native Method)
at android.os.MessageQueue.next(MessageQueue.java:138)
at android.os.Looper.loop(Looper.java:123)
at android.app.ActivityThread.main(ActivityThread.java:5294)
at java.lang.reflect.Method.invokeNative(Native Method)
at java.lang.reflect.Method.invoke(Method.java:515)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:932)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:748)
at dalvik.system.NativeStart.main(Native Method)
然后发现第二个线程是Binder10 且在block状态,通常情况下,一个不是非常繁忙的需要对外提供服务的app是不许要16个binder线程的。
于是接着查看是否其他binder也存在异常,结果发现所有的binder线程均处于同样的调用栈。 "Binder_D" prio=5 tid=21 WAIT
| group="main" sCount=1 dsCount=0 obj=0x41d6f468 self=0x555f4898
| sysTid=3344 nice=0 sched=0/0 cgrp=apps handle=1442108096
| state=S schedstat=( 0 0 0 ) utm=0 stm=0 core=3
at java.lang.Object.wait(Native Method)
- waiting on <0x41d6f4c0> (a java.lang.VMThread) held by tid=21 (Binder_D)
at java.lang.Thread.parkFor(Thread.java:1205)
at sun.misc.Unsafe.park(Unsafe.java:325)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
at android.view.IWindow$Stub.onTransact(IWindow.java:108)
at android.os.Binder.execTransact(Binder.java:427)
at dalvik.system.NativeStart.run(Native Method) "Binder_F" prio=5 tid=23 WAIT
| group="main" sCount=1 dsCount=0 obj=0x41d4e600 self=0x55cb63f8
| sysTid=3351 nice=0 sched=0/0 cgrp=apps handle=1435714912
| state=S schedstat=( 0 0 0 ) utm=1 stm=1 core=0
at java.lang.Object.wait(Native Method)
- waiting on <0x41d4e658> (a java.lang.VMThread) held by tid=23 (Binder_F)
at java.lang.Thread.parkFor(Thread.java:1205)
at sun.misc.Unsafe.park(Unsafe.java:325)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:157)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:813)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:846)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1175)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:180)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
at android.view.SurfaceView$MyWindow.resized(SurfaceView.java:655)
at android.view.IWindow$Stub.onTransact(IWindow.java:108)
at android.os.Binder.execTransact(Binder.java:427)
at dalvik.system.NativeStart.run(Native Method) 如上面一个例子,在这种Binder满的情况下,systemserver和app之间的通信是被block的,虽然这个是4.4并没有sprdruntimeInfo给我们dump出binder的状态,我们还是可以推断出
ANR的直接原因就是此处不合常理的Binder被占满。
首先通过代码可以看到,surfaceView中这行对应的代码为如下: @Override
public void resized(Rect frame, Rect overscanInsets, Rect contentInsets,
Rect visibleInsets, boolean reportDraw, Configuration newConfig) {
SurfaceView surfaceView = mSurfaceView.get();
if (surfaceView != null) {
if (DEBUG) Log.v(
"SurfaceView", surfaceView + " got resized: w=" + frame.width()
+ " h=" + frame.height() + ", cur w=" + mCurWidth + " h=" + mCurHeight);
surfaceView.mSurfaceLock.lock();
try {
if (reportDraw) {
surfaceView.mUpdateWindowNeeded = true;
surfaceView.mReportDrawNeeded = true;
surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
} else if (surfaceView.mWinFrame.width() != frame.width()
|| surfaceView.mWinFrame.height() != frame.height()) {
surfaceView.mUpdateWindowNeeded = true;
surfaceView.mHandler.sendEmptyMessage(UPDATE_WINDOW_MSG);
}
} finally {
surfaceView.mSurfaceLock.unlock();
}
}
}
mSurfaceLocklock的路径还有如下:
private void updateWindow(boolean force, boolean redrawNeeded)
private final Canvas internalLockCanvas(Rect dirty)
由于这个是个必现问题,那么问题就简单了,我们可以在所有会lock和unlock的地方都加上log,再复现一次来得到真相。
于是乎发现是在SingleTouchPointTest pass的时候忘记unlockCanvs导致。

套路总结

   1. 查看traces.txt,如果有block栈则通过栈来找到block原因。
2. 应用自身耗时操作尝试修改异步,被Binder block则进一步确认下对端的情况,如果是依赖于外部状态的可能block操作,应用修改为异步,如果是common接口,转FW
3. 如果不存在栈,那么尽人事听天命,查看ANR点附近是否有主线程block的可能信息,如果有则顺腾摸瓜,没有就Needinfo。

如何避免ANR

          A.减少复杂的layout
B.主线程尽量不要做和显示无关的事情。
C.如果存在可能会block、耗时的操作,不要放到主线程中,可以使用异步等方式来放到另外一个handlerthread或者asynctask中去做。
需要注意的是,使用handler.post或者sendmessage的时候需要确认清楚handler的looper,如果是主线程,依然可能会ANR。
在HandlerThread中如果需要处理和显示相关的,还需要到主线程中处理(非UI线程不能操作UI,在ViewRoot中有检查)

Android 开发

Android ANR分析(2)的更多相关文章

  1. Android ANR 分析解决方法

    一:什么是ANR ANR:Application Not Responding,即应用无响应 二:ANR的类型 ANR一般有三种类型: 1. KeyDispatchTimeout(5 seconds) ...

  2. Android ANR分析(1)

    转自:http://blog.csdn.net/itachi85/article/details/6918761 一:什么是ANR ANR:Application Not Responding,即应用 ...

  3. android anr分析方法

    目录(?)[+] 案例1关键词ContentResolver in AsyncTask onPostExecute high iowait 案例2关键词在UI线程进行网络数据的读写   一:什么是AN ...

  4. Android ANR分析及解决方案

    一:什么是ANR ANR:Application Not Responding,即应用无响应. ANR定义:在Android上,如果你的应用程序有一段时间响应不够灵敏,系统会向用户显示一个对话框,这个 ...

  5. [转]Android ANR 分析解决方法

    一:什么是ANR ANR:Application Not Responding,即应用无响应 二:ANR的类型 ANR一般有三种类型: 1. KeyDispatchTimeout(5 seconds) ...

  6. android ANR 分析定位问题

    ANR ? android 规定,Activity如果5秒钟之内无法响应屏幕触摸事件或者键盘输入事件,BroadcastReceiver 如果10s中之内还未执行完操作就会出现ANR 定位ANR问题 ...

  7. Android ANR分析(三)

    http://www.jianshu.com/p/8964812972be http://stackoverflow.com/questions/704311/android-how-do-i-inv ...

  8. Android ANR 分析

    首先贴一下trace 文件 Process: com.oppo.reader PID: 20358 Time: 2933175644_1545041895232 Flags: 0x38d83e44 P ...

  9. Android ANR分析

    1.发生anr时手机会生产traces文件 拉取trace文件:adb pull data/anr/traces.txt ./mytraces.txt 保存路径 参考

随机推荐

  1. python os.path.dirname 是什么目录

    这个获取文件路径中所在的目录. 1 2 3 4 5 6 7 In [1]: import os   In [2]: os.__file__ Out[2]: '/usr/lib/python2.7/os ...

  2. 跟着百度学PHP[4]OOP面对对象编程-15-魔术方法__call方法

    简而言之就是调用了一个类中没有的方法就会自动调用__call()方法, 该参数有两个必须的参数! 第一个参数:调用的不存在的方法的方法名. 第二个参数:调用不存在的方法的参数. 但是总的说回来,__c ...

  3. Sqli-LABS通关笔录-11[sql注入之万能密码以及登录框报错注入]

    在这一关卡我学到了 1.万能密码的构造,大概的去揣测正常的SQL语句是如何的. 2. 3. 00x1 SQL万能密码的构造 在登录框当中可以添加了一个单引号.报错信息如下所示: 据此报错,我们大概的可 ...

  4. MATLAB命令大全

    一.常用对象操作:除了一般windows窗口的常用功能键外.1.!dir 可以查看当前工作目录的文件. !dir& 可以在dos状态下查看.2.who 可以查看当前工作空间变量名, whos ...

  5. Android 创建内容提供器(ContentResolver)

    如果想实现跨程序共享数据的功能,官方推荐的方式就是使用内容提供器,可以通过新建一个类去继承 ContentResolver 的方式来创建一个自己的内容提供器. ContentProvider 类中有六 ...

  6. 7 HandlerSet 处理程序链表类——Live555源码阅读(一)基本组件类

    这是Live555源码阅读的第一部分,包括了时间类,延时队列类,处理程序描述类,哈希表类这四个大类. 本文由乌合之众 lym瞎编,欢迎转载 my.oschina.net/oloroso Handler ...

  7. 3.2---最小栈(CC150)

    //思路:入栈时不是最小值,永远都没机会成为最小值. import java.util.Stack; class MinStack { private Stack<Integer> sta ...

  8. 【转】Oracle当中扫描数据的方法

    本文将对oracle当中扫描数据的存取方法进行介绍. 1) 全表扫描(Full Table Scans, FTS) 为实现全表扫描,Oracle读取表中所有的行,并检查每一行是否满足语句的WHERE限 ...

  9. 新型序列化类库MessagePack,比JSON更快、更小的格式

    MessagePack is an efficient binary serialization format. It lets you exchange data among multiple la ...

  10. 【Unity3D】使用鼠标键盘控制Camera视角(即时战略类游戏视角):缩近,拉远,旋转

    今天写一个demo,要用到鼠标键盘控制三维视角,因此写了个脚本用于控制. 该脚本可以用于即时战略类游戏的视角,提供了缩进,拉伸,旋转.同时按住鼠标右键不放,移动鼠标可以实现第一人称视角的效果. usi ...