爬hao123--应对不同页面不同编码

大部分网站，页面使用统一编码的。

但是小部分网站很特殊，不同页面采用不同编码，例如：Hao123

当遇到这种情况是，需要使用服务器的静态文件缓存机制。

由于Web服务器一般都会开启静态文件缓存的。

所以在get同一个页面时，大部分请情况下两次返回是同样的内容。

这时只需将第一次反回的内容是用一种编码格式解码，再获取该页面的正确编码格式。

是用新的编码格式对第二次返回内容正确解码，就可以获取到真正的网络内容了。

代码如下：

   　　　　　　　　var stream = HttpHelper.CreateGetHttpResponse(url, null, null, "", null);

                string encoding = "";
                string html = "";

                using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
                {

                    Regex messages1 = new Regex("charset=['\"]*(?<enc>[^>'\"]*)['\"]*");

                    string _html = reader.ReadToEnd();

                    encoding = messages1.Match(_html).Groups["enc"].Value;
                }

                var stream2 = HttpHelper.CreateGetHttpResponse(url, null, null, "", null);
                using (StreamReader reader2 = new StreamReader(stream2, Encoding.GetEncoding(encoding)))
                {
                    html = reader2.ReadToEnd();
                }

posted on 2012-11-09 10:05 CosmoKey 阅读(319) 评论(0) 收藏举报

刷新页面返回顶部

爬hao123--应对不同页面不同编码

导航

公告