HtmlAgilityPack 抓取中文页面乱码问题的解决方案

HtmlAgilityPack是用C#写的开源Html Parser。不过它的某些方面设计不尽完善，比如，按照其正常模式抓取中文网页，往往获得的是乱码。比如，抓取新华网首页(http://xinhua.org)。模仿HtmlAgilityPack示例，爬取代码如下：

HtmlWeb hw = new HtmlWeb();

string url = @"http://xinhua.org";

HtmlDocument doc = hw.Load(url);

doc.Save("output.html");

获得的页面用ie打开，是乱码。

穿越HtmlAgilityPack的代码迷宫，最后发现问题出在HtmlWeb类的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。该方法有以下代码：

HttpWebResponse resp;

try

{

resp = req.GetResponse() as HttpWebResponse;

}

……

if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0))

{

respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);

}

else

{

respenc = null;

}

……

Stream s = resp.GetResponseStream();

if (s != null)

{

if (UsingCache)

{

// NOTE: LastModified does not contain milliseconds, so we remove them to the file

SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize);

// save headers

SaveCacheHeaders(req.RequestUri, resp);

if (path != null)

{

// copy and touch the file

IOLibrary.CopyAlways(cachePath, path);

File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath));

}

else

{

// try to work in-memory

if ((doc != null) && (html))

{

if (respenc != null)

{

doc.Load(s, respenc);

}

else

{

doc.Load(s, true);

}

resp.Close();

}

其中resp是http请求的response。设置断点发现resp.ContentEncoding为空。于是最后的加载行为便变成了doc.Load(s, true);而这个load方法也可能出了问题，最后得到的是乱码。

解决方法：

不使用HttpWeb，该类不成熟。自己写http请求，代码如下：

HttpWebRequest req;

req = WebRequest.Create(new Uri(@"http://xinhua.org")) as HttpWebRequest;

req.Method = "GET";

WebResponse rs = req.GetResponse();

Stream rss = rs.GetResponseStream();

String url = @"http://xinhua.org";

try

{

HtmlDocument doc = new HtmlDocument();

doc.Load(rss);

doc.Save("output.html");

}

catch (Exception e)

{

Console.WriteLine(e.Message.ToString());

Console.WriteLine(e.StackTrace);

}

上面代码中，doc.Load(…) 使用的编码为System.Text.Encoding.Default，在我机器上为gb2312编码。

HtmlDocument也可以指定编码load stream。获得指定编码有两种方法：
（1）在HttpWebResponse 对象中可以获取html代码中设置的charset；
（2）未提供charset的html页面，HtmlDocument提供了自动检测代码的方法DetectEncoding(…)。这一方法俺为测试过，不知道正确性如何.

posted @ 2007-06-24 23:53 xiaotie 阅读(7631) 评论(9) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗？
· 【译】Visual Studio 中新的强大生产力特性
· 【设计模式】告别冗长if-else语句：使用策略模式优化代码结构
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义

公告

木有，什么都木有

昵称： xiaotie
园龄： 20年1个月
荣誉：推荐博客
粉丝： 2412
关注： 44

+加关注

2007年6月

日

一

二

三

四

五

六

卖银鳞胸甲的D61

物美价廉，5G1件

HtmlAgilityPack 抓取中文页面乱码问题的解决方案

公告

搜索

常用链接

积分与排名

随笔分类 (485)

随笔档案 (379)

相册 (8)

阅读排行榜

评论排行榜

推荐排行榜

最新评论