在webForm中WebRequest\WebClient\WebBrowser获取远程页面源码的三种方式（downmoon)

　　一个小需求，获取远程页面的源码，主要用于抓数据。原来用的好好的，最近突然不能获取页面源码了，但是仍然可以用浏览器正常浏览。（文后附源码下载。^_^）

　　经过分析，原来用的代码如下：

Code

查了下资料，原来需要加参数。
　　　　　　#region 关键参数，否则会取不到内容　Important Parameters,else get nothing.
                httpWebRequest.UserAgent = "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)";
                httpWebRequest.Accept = "*/*";
                httpWebRequest.KeepAlive = true;
                httpWebRequest.Headers.Add("Accept-Language", "zh-cn,en-us;q=0.5");
                #endregion

修正后的代码如下：

Code

读取页面详细信息#region 读取页面详细信息

/// <summary> /// 读取页面详细信息

/// </summary>

///<param name="Url">需要读取的地址</param>

/// <param name="encoding">读取的编码方式</param>

/// <returns></returns>

public static string GetStringByUrl(string Url, System.Text.Encoding encoding)

{

if (Url.Equals("about:blank")) return null; ;

if (!Url.StartsWith("http://") && !Url.StartsWith("https://")) { Url = "http://" + Url; }

int dialCount = 0;

loop:

StreamReader sreader = null;

string result = string.Empty;

try

{

HttpWebRequest httpWebRequest = (HttpWebRequest)WebRequest.Create(Url);

//httpWebRequest.Timeout = 20;

关键参数，否则会取不到内容　Important Parameters,else get nothing.

HttpWebResponse httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();

if (httpWebResponse.StatusCode == HttpStatusCode.OK)

{

sreader = new StreamReader(httpWebResponse.GetResponseStream(), encoding);

char[] cCont = new char[256];

int count = sreader.Read(cCont, 0, 256);

while (count > 0)

{ // Dumps the 256 characters on a string and displays the string to the console.

String str = new String(cCont, 0, count);

result += str;

count = sreader.Read(cCont, 0, 256);

}

if (null != httpWebResponse) { httpWebResponse.Close(); }

return result;

}

catch (WebException e)

{

if (e.Status == WebExceptionStatus.ConnectFailure) { dialCount++; ReDial(); }

if (dialCount < 5) { goto loop; }

return null;

}

finally { if (sreader != null) { sreader.Close(); } }

}

#endregion

public static void ReDial()

{

int res = 1;

////while (res != 0)

////{

//// CSDNWebTest.RASDisplay ras = new RASDisplay();

//// ras.Disconnect();

//// res = ras.Connect("asdl");

//// System.Threading.Thread.Sleep(TimeSpan.FromSeconds(10));

////}

}

问题是解决了，后来再想了想，可以用WebClient先把页面download到本地临时文件，再读取文本内容。

代码如下：

Code

结果不能获取源码。错误如下：

再想想，还有Webbrowser控件可以用啊。在WinFrom下只要在主线程前加[STAThread]即可。

Code

在WebForm就麻烦些了，出现错误，线程不在单线程单元中，故无法实例化 ActiveX 控件“8856f961-340a-11d0-a96b-00c04fd705a2”

代码如下：

Code

后来搜索N小时(N>=5)后，终于找到可行解决方案,在WebPage页面头部加入AspCompat="true"

即<%@ Page Language="C#" AspCompat="true" ******/>

MSDN给出的解释是：
在 ASP .NET 网页的 <%@Page> 标记中包含兼容性属性 aspcompat=true，如 <%@Page aspcompat=true Language=VB%>。使用此属性将强制网页以 STA 模式执行，从而确保您的组件可以继续正确运行。如果试图使用 STA 组件但没有指定此标记，运行时将会发生异常情况。

将此属性的值设置为 true 时，将允许网页调用 COM+ 1.0 组件，该组件需要访问非管理的 ASP 内置对象。可以通过 ObjectContext 对象进行访问。

如果将此标记的值设为 true，性能会稍微有些下降。建议只在确实需要时才这样做。

终于可以了！　不知道有没有更好的方法？？

附：源码下载。

邀月注：

如果不能测试，请注意是否在域（AD)环境下，如果是！　请注意设置代理和防火墙
请参考：
http://dev.csdn.net/article/83914.shtm

或http://blog.csdn.net/downmoon/archive/2006/04/14/663337.aspx

或http://www.cnblogs.com/downmoon/archive/2007/12/29/1019701.html

posted @ 2009-07-01 10:51 邀月阅读(14298) 评论(9) 编辑收藏举报

刷新页面返回顶部

在webForm中WebRequest\WebClient\WebBrowser获取远程页面源码的三种方式（downmoon)

公告