最近遇到一个有趣的.net 2.0程序死锁问题,一般来说.net死锁问题都是应用程序显示的请求锁的过程出现锁访问顺序不一致导致的,但是本文中这个死锁则相对较为隐晦,隐藏的很深。
调试过程
.net的死锁我们可以通过sos.dll提供的syncblk来查看sync block来发现那些线程拥有锁,哪些线程等待锁。所以我们先通过syncblk来查看以下输出如何。
通过syncblk可以看到目前有一个syncblk已经被线程3(系统线程1814)所拥有。
0:005> .loadby sos mscorwks 0:005> !syncblk Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner 3 0000000000f4e678 3 1 0000000000f58320 1814 3 0000000002ef6040 System.Object ----------------------------- Total 3 CCW 0 RCW 0 ComClassFactory 0 Free 0
切换到3号线程,验证是否系统线程号为1814。输出其调用栈。原来该线程在等待进入另外一个CriticalSection。
0:005> ~3s ntdll!ZwWaitForSingleObject+0xa: 000007f8`04ba2c2a c3 ret 0:003> ~. . 3 Id: 88c.1814 Suspend: 1 Teb: 000007f5`ff396000 Unfrozen Start: mscorwks!Thread::intermediateThreadProc (000007ff`e7d5f33c) Priority: 0 Priority class: 32 Affinity: f 0:003> kvL Args to Child : Call Site 00000000`00f535d0 00000000`000001f8 00000000`00000000 00000000`00f07c70 : ntdll!ZwWaitForSingleObject+0xa 00000000`00000000 000007ff`43010042 00000000`00f535d0 00000000`00000001 : ntdll!RtlpWaitOnCriticalSection+0xea 00000000`0000000a 00000000`00000000 00000000`00f1c5b0 00000000`00f535d0 : ntdll!RtlpEnterCriticalSectionContended+0x94 00000000`00000a45 00000000`00000000 ffffffff`fffffffe 000007ff`88533480 : mscorwks!UnsafeEEEnterCriticalSection+0x20 00000000`00000000 000007ff`e82758f0 ffffffff`fffffffe 00000000`00000000 : mscorwks!CrstBase::Enter+0x123 ffffffff`00000001 00000000`00f4d920 00000000`00000000 00000000`00000001 : mscorwks!ListLockEntry::FinishDeadlockAwareEnter+0x2b 00000000`1b81e100 00000000`00f48e80 00000000`00000000 00000000`00000000 : mscorwks!ListLockEntry::LockHolder::DeadlockAwareAcquire+0x32 000007f8`01eb798a 000007ff`883f3ba0 00000000`00000003 000007ff`883f3958 : mscorwks!MethodTable::DoRunClassInitThrowing+0x6cb 00000000`00000000 00000000`00000000 ffffffff`fffffffe 000007ff`e1261560 : mscorwks!MethodTable::CheckRunClassInitThrowing+0x68 00000000`02ef8148 00000000`00f0c040 00000000`00000000 000007ff`883f3b90 : mscorwks!MethodDesc::DoPrestub+0x162 00000000`02ef4e40 00000000`00000000 00000000`02ef5fe8 00000000`00f58320 : mscorwks!PreStubWorker+0x1fa 00000000`1b81ec50 00000000`00000000 00000000`1b81ea50 00000000`00000000 : mscorwks!ThePreStubAMD64+0x87 00000000`02ef6028 000007ff`e16334c0 00000000`1b81ed00 00000000`00000000 : StaticConstruction!StaticConstruction.Singleton.LockIt()+0x10c 00000000`02ef4c20 000007ff`e16334c0 00000000`1b81ed00 00000000`00000000 : StaticConstruction!StaticConstruction.Program.Thread1Proc()+0x37
...
00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
将该CriticalSection输出,查看一下拥有的线程是1994,即4号线程。
0:003> !cs 00f535d0 ----------------------------------------- Critical section = 0x0000000000f535d0 (+0xF535D0) DebugInfo = 0x0000000000f402e0 LOCKED LockCount = 0x1 WaiterWoken = No OwningThread = 0x0000000000001994 RecursionCount = 0x1 LockSemaphore = 0x1F8 SpinCount = 0x00000000020007d0 0:003> ~ 0 Id: 88c.1f7c Suspend: 1 Teb: 000007f5`ff39e000 Unfrozen 1 Id: 88c.18a4 Suspend: 1 Teb: 000007f5`ff39c000 Unfrozen 2 Id: 88c.1438 Suspend: 1 Teb: 000007f5`ff39a000 Unfrozen . 3 Id: 88c.1814 Suspend: 1 Teb: 000007f5`ff396000 Unfrozen 4 Id: 88c.1994 Suspend: 1 Teb: 000007f5`ff394000 Unfrozen # 5 Id: 88c.174c Suspend: 1 Teb: 000007f5`ff1be000 Unfrozen
切换到4号线程查看调用栈,可以看到该调用栈正在等待StaticConstruction.Singleton.LockIt中的Monitor.Enter,即三号线程拥有的syncblk。
0:003> ~4s ntdll!ZwWaitForMultipleObjects+0xa: 000007f8`04ba319b c3 ret 0:004> kL Child-SP RetAddr Call Site 00000000`1b91d348 000007f8`01e812d2 ntdll!ZwWaitForMultipleObjects+0xa 00000000`1b91d350 000007ff`e7c3e809 KERNELBASE!WaitForMultipleObjectsEx+0xe5 00000000`1b91d630 000007ff`e7c431f1 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1 00000000`1b91d6d0 000007ff`e7d403e5 mscorwks!Thread::DoAppropriateAptStateWait+0x41 00000000`1b91d730 000007ff`e7c5e95c mscorwks!Thread::DoAppropriateWaitWorker+0x191 00000000`1b91d830 000007ff`e7c9d17a mscorwks!Thread::DoAppropriateWait+0x5c 00000000`1b91d8a0 000007ff`e7c20fe1 mscorwks!CLREvent::WaitEx+0xbe 00000000`1b91d950 000007ff`e7d6e012 mscorwks!AwareLock::EnterEpilog+0xc9 00000000`1b91da20 000007ff`e817a825 mscorwks!AwareLock::Enter+0x72 00000000`1b91da50 000007ff`88540657 mscorwks!JIT_MonEnterWorker_Portable+0xf5 00000000`1b91dc20 000007ff`885404c3 StaticConstruction!StaticConstruction.Singleton.LockIt()+0xa7 00000000`1b91dcc0 000007ff`e7ddd562 StaticConstruction!StaticConstruction.Static..cctor()+0x93 00000000`1b91dd30 000007ff`e7d1a293 mscorwks!CallDescrWorker+0x82 00000000`1b91dd70 000007ff`e7d1a3da mscorwks!CallDescrWorkerWithHandler+0xd3 00000000`1b91de10 000007ff`e7cfd437 mscorwks!DispatchCallDebuggerWrapper+0x3e 00000000`1b91de70 000007ff`e7cf22bd mscorwks!MethodTable::RunClassInitEx+0x207 00000000`1b91dfc0 000007ff`e8165f98 mscorwks!MethodTable::DoRunClassInitThrowing+0x74d 00000000`1b91ea40 000007ff`e814f162 mscorwks!MethodTable::CheckRunClassInitThrowing+0x68 00000000`1b91ea80 000007ff`e7cf72aa mscorwks!MethodDesc::DoPrestub+0x162 00000000`1b91eb70 000007ff`e7ddd447 mscorwks!PreStubWorker+0x1fa 00000000`1b91ec30 000007ff`88540300 mscorwks!ThePreStubAMD64+0x87 00000000`1b91ed00 000007ff`e14e2bdb StaticConstruction!StaticConstruction.Program.Thread2Proc()+0x20 ... 00000000`1b91f8c0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
因此我们可以看到是3号线程和4号线程互相得到了彼此需要请求的锁,因而造成了死锁。
其中4号线程很容易理解,他在等待monitor.enter,但是如何解释3号线程的调用栈,3号线程同样是调用StaticConstruction.Singleton.LockIt,其等待的CritcalSection来自何处又是做什么用的呢?
回头查看程序代码,
using System; using System.Collections.Generic; using System.Text; using System.Threading; namespace StaticConstruction { class Program { static void Main(string[] args) { Thread t1 = new Thread(Thread1Proc); Thread t2 = new Thread(Thread2Proc); t1.Start(); // lockA -> wait for static construction t2.Start(); // static consctruction -> lockA Console.Read(); } static void Thread1Proc() { Singleton.Instance.LockIt(); } static void Thread2Proc() { Static.Foo(); } } class Singleton { private object lockA = new object(); private Singleton() { } private static Singleton _instance = new Singleton(); public static Singleton Instance { get { return _instance; } } public void LockIt() { Console.WriteLine("Thread {0} waiting lock A", Thread.CurrentThread.ManagedThreadId); lock (lockA) { Console.WriteLine("Thread {0} got lock A", Thread.CurrentThread.ManagedThreadId); Thread.Sleep(10); Static.Foo(); } Console.WriteLine("Thread {0} released lock A", Thread.CurrentThread.ManagedThreadId); } } class Static { private Static() { } static Static() { Console.WriteLine("Static constructor begin by thread {0}", Thread.CurrentThread.ManagedThreadId); Singleton.Instance.LockIt(); Console.WriteLine("Static constructor end by thread {0}", Thread.CurrentThread.ManagedThreadId); } public static void Foo() { Console.WriteLine("Static method Foo begin"); Console.WriteLine("Static method Foo end"); } } }
原来3号线程在StaticConstruction.Singleton.LockIt得到锁之后紧接着要调用一个静态类Static的方法,而回到4号线程的调用栈可以看到该静态类Static构造函数还在执行过程中,在这个静态类被构造成功之前,3号线程当然无法调用该类上的方法,原来CLR通过在methodtable初始化过程中通过一个CriticalSection来确保线程安全。因此也使得在这种特殊情况下会有死锁情况发生。
明白了这一点,解决方法就很简单了,将锁的请求顺序统一,只要大家都按照同样的请求顺序来请求锁,死锁问题就不会发生了。