HtmlAgilityPack 库 StackOverflowException 解决方案
最近试用HtmlAgilityPack 来解析html,试用过程中程序会抛出StackOverflowException异常,从MSDN上可以看到,从 .NET Framework 2.0 版开始,将无法通过 try-catch 块捕获 StackOverflowException 对象,并且默认情况下将终止相应的进程。
调查原因,发现,当一个html结构非常复杂时,HtmlAgilityPack 的递归次数会非常多,于是就报StackOverflowException异常,google了一下,找到下面的解决方案
首先,在库中新增一个类:
public class StackChecker { public unsafe static bool HasSufficientStack(long bytes) { var stackInfo = new MEMORY_BASIC_INFORMATION(); // We subtract one page for our request. VirtualQuery rounds UP to the next page. // Unfortunately, the stack grows down. If we're on the first page (last page in the // VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this // doesn't work right for IA64 due to bigger pages. IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096); // Query for the current stack allocation information. VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION)); // If the current address minus the base (remember: the stack grows downward in the // address space) is greater than the number of bytes requested plus the reserved // space at the end, the request has succeeded. return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) > (bytes + STACK_RESERVED_SPACE); } // We are conservative here. We assume that the platform needs a whole 16 pages to // respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB, // which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks // incorrectly. private const long STACK_RESERVED_SPACE = 4096 * 16; [DllImport("kernel32.dll")] private static extern int VirtualQuery( IntPtr lpAddress, ref MEMORY_BASIC_INFORMATION lpBuffer, int dwLength); private struct MEMORY_BASIC_INFORMATION { internal uint BaseAddress; internal uint AllocationBase; internal uint AllocationProtect; internal uint RegionSize; internal uint State; internal uint Protect; internal uint Type; } }
然后,在递归次数较多的地方(such as HtmlNode.WriteTo(TextWriter outText) andHtmlNode.WriteTo(XmlWriter writer)):)添加下面的代码:
if (!StackChecker.HasSufficientStack(4*1024)) throw new Exception("The document is too complex to parse");
OK,大功告成!