抓取网站编码信息及内容

     最近在编写一个读取网站内容的小东西,在网上一搜很多,但是在拿过来用时不太理想,有些内容读取还是出现乱码问题。于是我在loafinweb 这位兄弟代码的基础上做了一些小的调整,以达到个人需求,如有不对之处还请loafinweb 见谅。
1、获取编码片段:
把 string html = reader.ReadToEnd();
改写成
 while ((temp = reader.ReadLine()) != null)
{
htmlBuilder.Append(temp);
html = htmlBuilder.ToString();
if (html.IndexOf("charset", StringComparison.InvariantCultureIgnoreCase) > 0)
{
break;
}
}
这样对读取速度有所改进,只需要读取页面头部的编码部分即可,不需要读取整个页面。
2、添加对response.StatusCode == HttpStatusCode.MovedPermanently ||response.StatusCode==HttpStatusCode.Found情况的判断,递归获取编码。
(相关 Response.StatusCode的HTTP状态代码 请参考http://wenku.baidu.com/view/cc274309bb68a98271fefada.html)
如新浪www.sina.com,默认会跳到www.sina.com.cn
通过Fiddler可以抓取到相关的跳转过程

   

   

//获取编码
public static string getEncoding(string url)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
StreamReader reader = null;
string temp = string.Empty;
try
{
request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 30000;
request.AllowAutoRedirect = false;
string html = "";
response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
{
if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress));
else
reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII);

//此处不用ReadToEnd 方法,采用ReadLine 当读到charset时跳出。
//string html = reader.ReadToEnd();
StringBuilder htmlBuilder = new StringBuilder();
while ((temp = reader.ReadLine()) != null)
{
htmlBuilder.Append(temp);
html = htmlBuilder.ToString();
if (html.IndexOf("charset", StringComparison.InvariantCultureIgnoreCase) > 0)
{
break;
}
}

Regex reg_charset = new Regex(@"charset\b\s*=\s*(?<charset>[^""]*)");
if (reg_charset.IsMatch(html))
{
return reg_charset.Match(html).Groups["charset"].Value;
}
else if (response.CharacterSet != string.Empty)
{
return response.CharacterSet;
}
else
return Encoding.Default.BodyName;
}
else if (response.StatusCode == HttpStatusCode.MovedPermanently ||response.StatusCode==HttpStatusCode.Found)
{
//页面跳转返回301,如:www.sina.com
//重新读取跳转地址
if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
reader = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress));
else
reader = new StreamReader(response.GetResponseStream(), Encoding.ASCII);
html = reader.ReadToEnd();
Regex reg_href = new Regex("<a[\\s]+href[\\s]*=[\\s]*\"([^<\"]+)\"");
if (reg_href.IsMatch(html))
{
var targetUrl=reg_href.Match(html).Groups[1].Value;
if (!IsURL(targetUrl))
{
url = url + targetUrl;
}
else
{
url = targetUrl;
}
return getEncoding(url);
}
}
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
finally
{
if (response != null)
{
response.Close();
response = null;
}
if (reader != null)
reader.Close();

if (request != null)
request = null;
}
return Encoding.Default.BodyName;
}
3、获取网页内容时添加对gzip情况的判断,否则可能出现乱码如www.sohu.com
//获取网页字符根据url  
public static string getHtml(string url)
{
try
{
string str = "";
Encoding en = Encoding.GetEncoding(getEncoding(url));
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Headers.Set("Pragma", "no-cache");
request.Timeout = 30000;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
{
//此处不要用StreamReader 直接读取,需要判断gzip情况
//否则可能出现乱码现象,如www.sohu.com
//Stream strM = response.GetResponseStream();
//StreamReader sr = new StreamReader(strM, en);
StreamReader sr;
if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase))
sr = new StreamReader(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress),en);
else
sr = new StreamReader(response.GetResponseStream(), en);
str = sr.ReadToEnd();
//strM.Close();
sr.Close();
}
return str;
}
catch
{
return String.Empty;
}
}
代码下载
参考:http://www.cnblogs.com/clc2008/archive/2011/09/13/2174284.html
posted @ 2012-01-31 15:24  peak  阅读(584)  评论(0编辑  收藏  举报