难得Winform项目中碰到dead lock,记录一下。
QA报告说,有时候晚上跑完自动化脚本,第二天早上来发现系统hang在屏保界面没反应,从日志看也没有报错。这种属于很少才会发生,也不知道怎么重现,但是很严重的bug,于是抓个dump来研究一下。
# Windbg加载dump文件后的一些文件信息
Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64 Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [F:\DeadLock\Hang.DMP] User Mini Dump File with Full Memory: Only application data is available Symbol search path is: *** Invalid *** **************************************************************************** * Symbol loading may be unreliable without a symbol search path. * * Use .symfix to have the debugger choose a symbol path. * * After setting your symbol path, use .reload to refresh symbol locations. * **************************************************************************** Executable search path is: Windows 7 Version 7601 (Service Pack 1) MP (4 procs) Free x64 Product: WinNt, suite: EmbeddedNT SingleUserTS Machine Name: Debug session time: Thu Feb 27 02:47:42.000 2014 (UTC + 8:00) System Uptime: 4 days 16:41:57.807 Process Uptime: 4 days 16:41:28.000 ................................................................ Loading unloaded module list ................................................................ *** ERROR: Symbol file could not be found. Defaulted to export symbols for ntdll.dll - ntdll!NtWaitForMultipleObjects+0xa: 00000000`7706186a c3 ret |
# 设置symbol path
0:000> .sympath srv*c:\MyServerSymbols*http://msdl.microsoft.com/download/symbols Symbol search path is: srv*c:\MyServerSymbols*http://msdl.microsoft.com/download/symbols Expanded Symbol search path is: srv*c:\myserversymbols*http://msdl.microsoft.com/download/symbols |
#手动load SOS.dll,因为我们的调试机和测试机上的SOS不是同一个版本,所以不用.loadby sos clr
0:000> .load F:\SOS.dll\1008\SOS.dll |
#看看线程都在干啥
0:000> ~* e!clrstack OS Thread Id: 0xfe8 (0) Child SP IP Call Site 00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80] 00000000001d6128 000000007706186a [GCFrame: 00000000001d6128] 00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object) 00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs) 00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll 00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[]) 00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32) 00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext) 00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext) 00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window) 00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment) 00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons) ......<cut to save space>...... OS Thread Id: 0x2180 (132) Child SP IP Call Site 0000000043cad940 000000007706186a [GCFrame: 0000000043cad940] 0000000043cadb98 000000007706186a [GCFrame: 0000000043cadb98] 0000000043cadb48 000000007706186a [HelperMethodFrame: 0000000043cadb48] System.Threading.Monitor.Enter(System.Object) 0000000043cadca0 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs() 0000000043cadd30 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs) 0000000043cadda0 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object) 0000000043cade20 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object) 0000000043cae548 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 0000000043cae548] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object) 0000000043cae670 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) 0000000043cae6d0 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object) 0000000043cae948 000007fef7ba44c4 [GCFrame: 0000000043cae948] 0000000043caeb20 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 0000000043caeb20] ......<cut to save space>...... OS Thread Id: 0x363c (1076) Child SP IP Call Site 00000000d190e7a0 000000007706186a [GCFrame: 00000000d190e7a0] 00000000d190e9f8 000000007706186a [GCFrame: 00000000d190e9f8] 00000000d190e9a8 000000007706186a [HelperMethodFrame: 00000000d190e9a8] System.Threading.Monitor.Enter(System.Object) 00000000d190eb00 000007ff02ca8a18 TheSystem.CCDialog.CloseModalDialogs() 00000000d190eb90 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs) 00000000d190ec00 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object) 00000000d190ec80 000007fef6ba2f78 System.Threading.ExecutionContext.runTryCode(System.Object) 00000000d190f3a8 000007fef7ba44c4 [HelperMethodFrame_PROTECTOBJ: 00000000d190f3a8] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object) 00000000d190f4d0 000007fef6b91661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) 00000000d190f530 000007fef6c0ced7 System.Threading._TimerCallback.PerformTimerCallback(System.Object) 00000000d190f7a8 000007fef7ba44c4 [GCFrame: 00000000d190f7a8] 00000000d190f980 000007fef7ba44c4 [DebuggerU2MCatchHandlerFrame: 00000000d190f980] ......<cut to save space>...... OS Thread Id: 0x11a0 (1120) Failed to start stack walk: 80004005 |
结果发现Windbg异常繁忙,打印了很久才停下来,1120个线程…正常情况下应该是百来个线程。而且从堆栈上来看中间绝大多数(好几百个)线程的call stack都是和132号线程是一样的,看上去都在等一个什么锁。
#等啥呢?
0:000>!syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 1453 0000000039f170a8 21 1 000000002e74f8e0 1a3c 82 00000000143cccd0 System.Object 3711 000000002dba2568 1823 1 0000000067884fe0 1920 151 0000000011271e58 System.Collections.Hashtable ----------------------------- Total 4562 CCW 551 RCW 21 ComClassFactory 14 Free 1196 |
也就是说151号线程拥有0000000011271e58这个hashtable对象的锁,然后还有 823个线程在等这个锁!!!
(可以参考Tess对!syncblk的解释http://blogs.msdn.com/b/tess/archive/2006/01/09/a-hang-scenario-locks-and-critical-sections.aspx)
#先看看151号线程
0:000> ~151 e!clrstack OS Thread Id: 0x1920 (151) Child SP IP Call Site 000000005ec9dbb8 000000007706186a [HelperMethodFrame_1OBJ: 000000005ec9dbb8] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) 000000005ec9dcf0 000007fef6c1c094 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean)*** WARNING: Unable to verify checksum for mscorlib.ni.dll 000000005ec9dd30 000007fef18b01ad System.Windows.Forms.Control.WaitForWaitHandle(System.Threading.WaitHandle)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll 000000005ec9de20 000007fef12a41e2 System.Windows.Forms.Control.MarshaledInvoke(System.Windows.Forms.Control, System.Delegate, System.Object[], Boolean) 000000005ec9dfa0 000007fef18b2d99 System.Windows.Forms.Control.Invoke(System.Delegate, System.Object[]) 000000005ec9e020 000007ff02ca8b34 TheSystem.CCDialog.CloseModalDialogs()*** 000000005ec9e0b0 000007ff02ca8937 TheSystem.CCUserActivityManager.autoLogoutTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs)*** ERROR: Module load completed but symbols could not be loaded for TheSystem.dll 000000005ec9e120 000007fef5f01b51 System.Timers.Timer.MyTimerCallback(System.Object)*** WARNING: Unable to verify checksum for System.ni.dll ......<cut to save space>...... |
#等UI线程呢,看看UI线程在干啥。
0:000> !clrstack OS Thread Id: 0xfe8 (0) Child SP IP Call Site 00000000001d5f80 000000007706186a [GCFrame: 00000000001d5f80] 00000000001d6128 000000007706186a [GCFrame: 00000000001d6128] 00000000001d60d8 000000007706186a [HelperMethodFrame_1OBJ: 00000000001d60d8] System.Threading.Monitor.Enter(System.Object) 00000000001d6220 000007ff02b1c734 TheSystem.CCDialog.OnClosed(System.EventArgs) 00000000001d62c0 000007fef18f4695 System.Windows.Forms.Form.CheckCloseDialog(Boolean)*** WARNING: Unable to verify checksum for System.Windows.Forms.ni.dll 00000000001d6330 000007fef1e64073 System.Windows.Forms.Application+ThreadContext.System.Windows.Forms.UnsafeNativeMethods.IMsoComponent.FContinueMessageLoop(Int32, Int32, MSG[]) 00000000001d6470 000007fef12aadbe System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32) 00000000001d66b0 000007fef12aa813 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext) 00000000001d6810 000007fef12aa1a1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext) 00000000001d68a0 000007fef18f8a52 System.Windows.Forms.Form.ShowDialog(System.Windows.Forms.IWin32Window) 00000000001d6c10 000007ff02c844f6 TheSystem.CCMessageBox.ShowInternal(System.Windows.Forms.Form, MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons, Boolean, System.Collections.Generic.Dictionary`2<System.String,System.String>, System.Windows.Forms.Control, System.Drawing.ContentAlignment) 00000000001d70e0 000007ff02c862cd TheSystem.CCMessageBox.Show(MessageBoxIcon, System.String, System.String, System.Windows.Forms.MessageBoxButtons) ......<cut to save space>...... |
同时!do了一下hashtable对象0000000011271e58,发现里面有三个dialog,和log里面的一致。接下来看代码:
拥有锁的151号线程 - TheSystem.CCDialog.CloseModalDialogs()
UI线程 TheSystem.CCDialog.OnClosed(System.EventArgs)
按照系统的逻辑,一段时间没有用户操作后会自动logout,如果自动logout前存在modal dialog,那么先自动close modal dialog,然后再logout。正常情况下logout前最多只有1个modal dialog,问题就出在出问题时logout前有3个model dialog,
那么Hang的过程是怎么产生的呢? 151号线程获取了锁,然后Invoke UI线程去Close了第一个Dialog,注意Invoke完了之后还是继续回到本线程执行,同时这个时候UI线程去响应WM_CLOSE消息然后调用OnClosed () 方法等锁, 占着锁的 151号线程接着去Invoke UI线程Close下一个dialog.
也就是说151号线程占有了锁,然后调Invoke—同步调用,去调用UI线程。但是UI线程呢,在等151号线程占着的锁然后才能往下执行,但是151号线程一直占着锁不释放,所以UI线程也一直在那里等着 – 死锁, dead lock。 正常情况下,只有一个dialog的时候,Invoke()完了就直接退出了,不用等下次Invoke(),所以不会有死锁。
那么1120个线程又是怎么回事呢?因为TheSystem.CCDialog.CloseModalDialogs()是被timer触发的,所以每过一段时间就有一个新的线程起来挂在那里等那个锁,所以我们才会看到有那么多数量的线程。
解决方案有多种,最简单的就是把Invoke改成BeginInvoke,这样非UI线程可以很快退出,把锁让给UI线程。当然也可以把CloseModalDialogs()放在UI线程里面调用。