抓取网站编码信息及内容

     最近在编写一个读取网站内容的小东西，在网上一搜很多，但是在拿过来用时不太理想，有些内容读取还是出现乱码问题。于是我在loafinweb 这位兄弟代码的基础上做了一些小的调整，以达到个人需求，如有不对之处还请loafinweb 见谅。

1、获取编码片段：

把 string html = reader.ReadToEnd();

改写成

 while ((temp = reader.ReadLine()) != null)
 {
    htmlBuilder.Append(temp);
    html = htmlBuilder.ToString();
    if (html.IndexOf("charset", StringComparison.InvariantCultureIgnoreCase) > 0)
     {
        break;
     }
 }

这样对读取速度有所改进，只需要读取页面头部的编码部分即可，不需要读取整个页面。

2、添加对response.StatusCode == HttpStatusCode.MovedPermanently ||response.StatusCode==HttpStatusCode.Found情况的判断，递归获取编码。

(相关 Response.StatusCode的HTTP状态代码 请参考http://wenku.baidu.com/view/cc274309bb68a98271fefada.html)

如新浪www.sina.com，默认会跳到www.sina.com.cn

通过Fiddler可以抓取到相关的跳转过程

//获取编码
        public static string getEncoding(string url)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            StreamReader reader = null;
            string temp = string.Empty;
            try
            {
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Timeout = 30000;
                request.AllowAutoRedirect = false;
                string html = "";
                response = (HttpWebResponse)request.GetResponse();
                if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
                {
                    if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
                        reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress));
                    else
                        reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII);

                    //此处不用ReadToEnd 方法，采用ReadLine 当读到charset时跳出。
                    //string html = reader.ReadToEnd();
                    StringBuilder htmlBuilder = new StringBuilder();
                    while ((temp = reader.ReadLine()) != null)
                    {
                        htmlBuilder.Append(temp);
                        html = htmlBuilder.ToString();
                        if (html.IndexOf("charset", StringComparison.InvariantCultureIgnoreCase) > 0)
                        {
                            break;
                        }
                    }

                    Regex reg_charset = new Regex(@"charset\b\s*=\s*(?<charset>[^""]*)");
                    if (reg_charset.IsMatch(html))
                    {
                        return reg_charset.Match(html).Groups["charset"].Value;
                    }
                    else if (response.CharacterSet != string.Empty)
                    {
                        return response.CharacterSet;
                    }
                    else
                        return Encoding.Default.BodyName;
                }
                else if (response.StatusCode == HttpStatusCode.MovedPermanently ||response.StatusCode==HttpStatusCode.Found)
                {
                    //页面跳转返回301，如：www.sina.com
                    //重新读取跳转地址
                    if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
                        reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress));
                    else
                        reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII);
                    html = reader.ReadToEnd();
                    Regex reg_href = new Regex("<a[\\s]+href[\\s]*=[\\s]*\"([^<\"]+)\"");
                    if (reg_href.IsMatch(html))
                    {
                        var targetUrl=reg_href.Match(html).Groups[1].Value;
                        if (!IsURL(targetUrl))
                        {
                            url = url + targetUrl;
                        }
                        else
                        {
                            url = targetUrl;
                        }
                        return getEncoding(url);
                    }
                }
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
            finally
            {
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (reader != null)
                    reader.Close();

                if (request != null)
                    request = null;
            }
            return Encoding.Default.BodyName;
        }

3、获取网页内容时添加对gzip情况的判断，否则可能出现乱码如www.sohu.com

//获取网页字符根据url  
        public static string getHtml(string url)
        {
            try
            {
                string str = "";
                Encoding en = Encoding.GetEncoding(getEncoding(url));
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.Headers.Set("Pragma", "no-cache");
                request.Timeout = 30000;
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
                {
                    //此处不要用StreamReader 直接读取，需要判断gzip情况
                    //否则可能出现乱码现象，如www.sohu.com
                    //Stream strM = response.GetResponseStream();
                    //StreamReader sr = new StreamReader(strM, en);
                    StreamReader sr;
                    if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
                        sr = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress),en);
                    else
                        sr = new StreamReader(response.GetResponseStream(), en);
                    str = sr.ReadToEnd();
                    //strM.Close();
                    sr.Close();
                }
                return str;
            }
            catch
            {
                return String.Empty;
            }
        }

代码下载
参考：http://www.cnblogs.com/clc2008/archive/2011/09/13/2174284.html

posted @ 2012-01-31 15:24 peak 阅读(584) 评论(0) 编辑收藏举报

刷新页面返回顶部

peak－酸菜馆

东北酸菜，与众不同！

抓取网站编码信息及内容

公告