世上有一些很牛逼的事情,这些事情能让很多牛逼的人死在牛逼里。
我们先看一个页面 http://www.skxox.com/xxinfo_127691.html
这个页面应该在浏览器里面可以正常显示。不会出现乱码。
再查看他的源文件,可以看到这一行 <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
于是,牛逼的你很牛逼的认为,这个页面时gb2312编码的。。。
那现在试试,让浏览器以GB2313编码显示这个网页试试:
涓�鍛ㄩ挗閾佽涓氫俊鎭嫨瑕�
尼玛啊,这到底是神马啊。。。。。。
所以,博客园上面那些自动识别网页编码的文章都是骗人的。。。
抓包工具看下:
HTTP/1.1 200 OK
Date: Thu, 21 Apr 2011 07:36:27 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
X-Powered-By: UrlRewriter.NET 1.7.0
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 41750
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
上面这个才是它真正的编码。
所以,求求你不要再去分析网页里面的charset了。
获取编码的语句换成:
string c = response.ContentType.Replace("text/html;", "").Replace("charset=", "").Trim();
一整坨代码:
/// <summary>
/// 远程获取url地址的页面源代码
/// </summary>
/// <param name="url">要获取页面的URL</param>
/// <returns>返回HTML代码</returns>
public static string GetHtml(string url, string ucoid)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
StreamReader reader = null;
try
{
request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "www.svnhost.cn";
request.Timeout = 20000;
request.AllowAutoRedirect = true;
response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
{
string c = response.ContentType.Replace("text/html;", "").Replace("charset=", "").Trim();
if (ucoid.IsNullOrEmpty())
{
ucoid = c;
}
reader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.GetEncoding(ucoid));
string html = reader.ReadToEnd();
return html;
}
}
catch { }
finally
{
if (response != null)
{
response.Close();
response = null;
}
if (reader != null)
{
reader.Close();
}
if (request != null)
{
request = null;
}
}
return string.Empty;
}
/// </summary>
/// <param name="url">要获取页面的URL</param>
/// <returns>返回HTML代码</returns>
public static string GetHtml(string url, string ucoid)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
StreamReader reader = null;
try
{
request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "www.svnhost.cn";
request.Timeout = 20000;
request.AllowAutoRedirect = true;
response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK && response.ContentLength < 1024 * 1024)
{
string c = response.ContentType.Replace("text/html;", "").Replace("charset=", "").Trim();
if (ucoid.IsNullOrEmpty())
{
ucoid = c;
}
reader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.GetEncoding(ucoid));
string html = reader.ReadToEnd();
return html;
}
}
catch { }
finally
{
if (response != null)
{
response.Close();
response = null;
}
if (reader != null)
{
reader.Close();
}
if (request != null)
{
request = null;
}
}
return string.Empty;
}
所以网页设计师你桑不起啊。。。。他们上辈子都是掉进化粪池里折翼的天屎 啊。。。