《C# 爬虫 破境之道》:第一境 爬虫原理 — 第二节:WebRequest

本节主要来介绍一下,在C#中制造爬虫,最为常见、常用、实用的基础类 ------ WebRequest、WebResponse。

先来看一个示例 [1.2.1]:

 1     using System;
 2     using System.IO;
 3     using System.Net;
 4     using System.Text;
 6     class Program
 7     {
 8         static void Main(string[] args)
 9         {
10             var request = WebRequest.Create(@"https://www.cnblogs.com/mikecheers/p/12090487.html");
11             request.Method = "GET";
12             using (var response = request.GetResponse())
13             {
14                 using (var stream = response.GetResponseStream())
15                 {
16                     using (var reader = new StreamReader(stream, new UTF8Encoding(false)))
17                     {
18                         var content = reader.ReadToEnd();
19                         Console.WriteLine(content);
20                     }
21                 }
22                 response.Close();
23             }
24             request.Abort();
25             Console.ReadLine();
26         }
27     }


<div id="cnblogs_post_body" class="blogpost-body "> <p>  在构建本章节内容的时候,笔者也在想一个问题,究竟什么样的采集器框架,才能算得上是一个&ldquo;全能&rdquo;的呢?就我自己以往项目经历而言,可以归纳以下几个大的分类:</p> <ol> <li>根据通讯协议:HTTP的、HTTPS的、TCP的、UDP的;</li> <li>根据数据类型:纯文本的、json的、压缩包的、图片的、视频的;</li> <li>根据更新周期:不定期更新的、定期更新的、增量更新的;</li> <li>根据数据来源:单一数据源、多重数据源、多重数据源混合;</li> <li>根据采集点分布:单机的,集群的;</li> <li>根据反爬虫策略:控制频率的,使用代理的,使用特定UA的;</li> <li>根据配置:可配置的,不可配置的;</li> </ol> <p>  以上的分类,也有可能不够全面,不过应该可以涵盖主流数据采集的分类了。</p> <p>  为了方便阐述一个爬虫的工作原理,我们从上面找到一条最简单路径,来进行阐述(偷奸耍滑?非也,大道化简,万变不离其宗:)</p> <p>  OK,一个小目标,单机、单一数据源、定期更新、纯文本、HTTP的爬虫,来一只。</p> <p>&nbsp;</p> <p style="margin-left: 30px;">在第一境的后面各节中,我们就来逐步实现这个小目标,同时,也来探究一下其中的原理。只有掌握了这些,才能通过继续扩展,衍生出强大的爬虫:)</p> </div>



示例 [1.2.1]中,最核心的部分,就是使用了本节的主题、两个基础类:WebRequest和WebResponse。




[Code 1.2.2]

 1     public abstract class WebRequest : MarshalByRefObject, ISerializable
 2     {
 3         protected WebRequest();
 4         protected WebRequest(SerializationInfo serializationInfo, StreamingContext streamingContext);
 5         /***************
 6          *  避免篇幅太长,此处省略部分属性及方法。
 7          * *************/
 8         public static IWebProxy DefaultWebProxy { get; set; }
 9         public static RequestCachePolicy DefaultCachePolicy { get; set; }
10         public virtual IWebProxy Proxy { get; set; }
11         public static WebRequest Create(Uri requestUri);
12         public static WebRequest Create(string requestUriString);
13         public static WebRequest CreateDefault(Uri requestUri);
14         public static HttpWebRequest CreateHttp(Uri requestUri);
15         public static HttpWebRequest CreateHttp(string requestUriString);
16         public static IWebProxy GetSystemWebProxy();
17     }


 这里列出了我们通常最为关注的几个属性和方法,当然也包括类的定义。首先,它是一个抽象类,这也意味着,它会有派生类,在.Net Framework 4.6.1中,主要包括:





在4.6.1中,比4.0多出了几个静态方法,比如:public static HttpWebRequest CreateHttp(xxx)。估计在更新的版本中,也会有所差异,所以大家在学习爬虫的过程中,应尽量对自己所使用的framework版本及其对应的类有深刻的印象。


说完构造,我们看看[Code 1.2.2]中第11行到第15行中提及的关于创建实例的方法:

虽然有几个重载,但看参数名称就可以看得出,就是跟你要Uri(uniform resource identifier,统一资源标识符),比如:https://www.cnblogs.com/mikecheers/category/1609574.html,有了这个,它就可以出去“横”了:P





说完创建实例,我们看看[Code 1.2.2]中第8行、第10行和第16行中提及的关于代理的属性和方法:

  • DefaultWebProxy 这个是一个静态属性,可以得知,无论我们创建了多少WebRequest的派生类的实例,DefaultWebProxy只有一个,当然,它是可变的,程序运行过程中,我们动态修改,但它是全局的,一旦设置,所有实例都会默认使用它,而无须未每个实例去设置;
  • Proxy 这是一个虚属性,为每一个WebRequest的派生类的实例提供定制化代理的可能;
  • GetSystemWebProxy 是一个静态方法,这个方法帮助我们获取当前用户IE配置的代理,在我国,大家使用的浏览器五花八门,所以很少使用到,不过做浏览器的话,还是很需要的;


  • 不能直接访问:比如说,目前在我天朝境内,我们想访问google、facebook等,是不能直接访问的,想访问的话,代理,就是一条路,比如我们先绕到岛国,由岛国中转再访问google或facebook就可以了。当然,这里代理只是条条大路中的一条路而已,VPN也是一种选择,不用太纠结;
  • “绕道而行”:我们在做的是爬虫,爬的资源人家是不是愿意给,这个大家心里应该有点A<->C数,不愿意给,人家就会做限制,俗称“反爬策略”,其中一种常见的策略就是限制单客户IP的并发数量及访问频率,我们作为爬虫的开发者,有希望能够尽快得到想要的资源,毕竟电费也挺贵的:)应对这种策略,常用的方式就是使用代理了,通过将请求分发到多个代理,来缓解单IP被限制的压力。应用场景嘛,比如视频直播间内无数的水军帐号,动辄上万,如果想从单IP做到这点,还是有难度的;







  • 比如根据转发的请求不同,分为S5(Socket 5)、HTTP、HTTPS等;
  • 根据隐匿程度,分为普通、高匿、透明等;还有其他一些分类;
  • 根据公开程度,分为公有、私有等;





1     public interface IWebProxy
2     {
3         ICredentials Credentials { get; set; }
4         Uri GetProxy(Uri destination);
5         bool IsBypassed(Uri host);
6     }


  1     //
  2     // Summary:
  3     //     Contains HTTP proxy settings for the System.Net.WebRequest class.
  4     public class WebProxy : IAutoWebProxy, IWebProxy, ISerializable
  5     {
  6         //
  7         // Summary:
  8         //     Initializes an empty instance of the System.Net.WebProxy class.
  9         public WebProxy();
 10         //
 11         // Summary:
 12         //     Initializes a new instance of the System.Net.WebProxy class from the specified
 13         //     System.Uri instance.
 14         //
 15         // Parameters:
 16         //   Address:
 17         //     A System.Uri instance that contains the address of the proxy server.
 18         public WebProxy(Uri Address);
 19         //
 20         // Summary:
 21         //     Initializes a new instance of the System.Net.WebProxy class with the specified
 22         //     URI.
 23         //
 24         // Parameters:
 25         //   Address:
 26         //     The URI of the proxy server.
 27         //
 28         // Exceptions:
 29         //   T:System.UriFormatException:
 30         //     Address is an invalid URI.
 31         public WebProxy(string Address);
 32         //
 33         // Summary:
 34         //     Initializes a new instance of the System.Net.WebProxy class with the System.Uri
 35         //     instance and bypass setting.
 36         //
 37         // Parameters:
 38         //   Address:
 39         //     A System.Uri instance that contains the address of the proxy server.
 40         //
 41         //   BypassOnLocal:
 42         //     true to bypass the proxy for local addresses; otherwise, false.
 43         public WebProxy(Uri Address, bool BypassOnLocal);
 44         //
 45         // Summary:
 46         //     Initializes a new instance of the System.Net.WebProxy class with the specified
 47         //     host and port number.
 48         //
 49         // Parameters:
 50         //   Host:
 51         //     The name of the proxy host.
 52         //
 53         //   Port:
 54         //     The port number on Host to use.
 55         //
 56         // Exceptions:
 57         //   T:System.UriFormatException:
 58         //     The URI formed by combining Host and Port is not a valid URI.
 59         public WebProxy(string Host, int Port);
 60         //
 61         // Summary:
 62         //     Initializes a new instance of the System.Net.WebProxy class with the specified
 63         //     URI and bypass setting.
 64         //
 65         // Parameters:
 66         //   Address:
 67         //     The URI of the proxy server.
 68         //
 69         //   BypassOnLocal:
 70         //     true to bypass the proxy for local addresses; otherwise, false.
 71         //
 72         // Exceptions:
 73         //   T:System.UriFormatException:
 74         //     Address is an invalid URI.
 75         public WebProxy(string Address, bool BypassOnLocal);
 76         //
 77         // Summary:
 78         //     Initializes a new instance of the System.Net.WebProxy class with the specified
 79         //     System.Uri instance, bypass setting, and list of URIs to bypass.
 80         //
 81         // Parameters:
 82         //   Address:
 83         //     A System.Uri instance that contains the address of the proxy server.
 84         //
 85         //   BypassOnLocal:
 86         //     true to bypass the proxy for local addresses; otherwise, false.
 87         //
 88         //   BypassList:
 89         //     An array of regular expression strings that contains the URIs of the servers
 90         //     to bypass.
 91         public WebProxy(Uri Address, bool BypassOnLocal, string[] BypassList);
 92         //
 93         // Summary:
 94         //     Initializes a new instance of the System.Net.WebProxy class with the specified
 95         //     URI, bypass setting, and list of URIs to bypass.
 96         //
 97         // Parameters:
 98         //   Address:
 99         //     The URI of the proxy server.
100         //
101         //   BypassOnLocal:
102         //     true to bypass the proxy for local addresses; otherwise, false.
103         //
104         //   BypassList:
105         //     An array of regular expression strings that contain the URIs of the servers to
106         //     bypass.
107         //
108         // Exceptions:
109         //   T:System.UriFormatException:
110         //     Address is an invalid URI.
111         public WebProxy(string Address, bool BypassOnLocal, string[] BypassList);
112         //
113         // Summary:
114         //     Initializes a new instance of the System.Net.WebProxy class with the specified
115         //     System.Uri instance, bypass setting, list of URIs to bypass, and credentials.
116         //
117         // Parameters:
118         //   Address:
119         //     A System.Uri instance that contains the address of the proxy server.
120         //
121         //   BypassOnLocal:
122         //     true to bypass the proxy for local addresses; otherwise, false.
123         //
124         //   BypassList:
125         //     An array of regular expression strings that contains the URIs of the servers
126         //     to bypass.
127         //
128         //   Credentials:
129         //     An System.Net.ICredentials instance to submit to the proxy server for authentication.
130         public WebProxy(Uri Address, bool BypassOnLocal, string[] BypassList, ICredentials Credentials);
131         //
132         // Summary:
133         //     Initializes a new instance of the System.Net.WebProxy class with the specified
134         //     URI, bypass setting, list of URIs to bypass, and credentials.
135         //
136         // Parameters:
137         //   Address:
138         //     The URI of the proxy server.
139         //
140         //   BypassOnLocal:
141         //     true to bypass the proxy for local addresses; otherwise, false.
142         //
143         //   BypassList:
144         //     An array of regular expression strings that contains the URIs of the servers
145         //     to bypass.
146         //
147         //   Credentials:
148         //     An System.Net.ICredentials instance to submit to the proxy server for authentication.
149         //
150         // Exceptions:
151         //   T:System.UriFormatException:
152         //     Address is an invalid URI.
153         public WebProxy(string Address, bool BypassOnLocal, string[] BypassList, ICredentials Credentials);
154         //
155         // Summary:
156         //     Initializes an instance of the System.Net.WebProxy class using previously serialized
157         //     content.
158         //
159         // Parameters:
160         //   serializationInfo:
161         //     The serialization data.
162         //
163         //   streamingContext:
164         //     The context for the serialized data.
165         protected WebProxy(SerializationInfo serializationInfo, StreamingContext streamingContext);
167         //
168         // Summary:
169         //     Gets or sets the credentials to submit to the proxy server for authentication.
170         //
171         // Returns:
172         //     An System.Net.ICredentials instance that contains the credentials to submit to
173         //     the proxy server for authentication.
174         //
175         // Exceptions:
176         //   T:System.InvalidOperationException:
177         //     You attempted to set this property when the System.Net.WebProxy.UseDefaultCredentials
178         //     property was set to true.
179         public ICredentials Credentials { get; set; }
180         //
181         // Summary:
182         //     Gets or sets an array of addresses that do not use the proxy server.
183         //
184         // Returns:
185         //     An array that contains a list of regular expressions that describe URIs that
186         //     do not use the proxy server when accessed.
187         public string[] BypassList { get; set; }
188         //
189         // Summary:
190         //     Gets or sets a value that indicates whether to bypass the proxy server for local
191         //     addresses.
192         //
193         // Returns:
194         //     true to bypass the proxy server for local addresses; otherwise, false. The default
195         //     value is false.
196         public bool BypassProxyOnLocal { get; set; }
197         //
198         // Summary:
199         //     Gets or sets the address of the proxy server.
200         //
201         // Returns:
202         //     A System.Uri instance that contains the address of the proxy server.
203         public Uri Address { get; set; }
204         //
205         // Summary:
206         //     Gets a list of addresses that do not use the proxy server.
207         //
208         // Returns:
209         //     An System.Collections.ArrayList that contains a list of System.Net.WebProxy.BypassList
210         //     arrays that represents URIs that do not use the proxy server when accessed.
211         public ArrayList BypassArrayList { get; }
212         //
213         // Summary:
214         //     Gets or sets a System.Boolean value that controls whether the System.Net.CredentialCache.DefaultCredentials
215         //     are sent with requests.
216         //
217         // Returns:
218         //     true if the default credentials are used; otherwise, false. The default value
219         //     is false.
220         //
221         // Exceptions:
222         //   T:System.InvalidOperationException:
223         //     You attempted to set this property when the System.Net.WebProxy.Credentials property
224         //     contains credentials other than the default credentials. For more information,
225         //     see the Remarks section.
226         public bool UseDefaultCredentials { get; set; }
228         //
229         // Summary:
230         //     Reads the Internet Explorer nondynamic proxy settings.
231         //
232         // Returns:
233         //     A System.Net.WebProxy instance that contains the nondynamic proxy settings from
234         //     Internet Explorer 5.5 and later.
235         [Obsolete("This method has been deprecated. Please use the proxy selected for you by default. http://go.microsoft.com/fwlink/?linkid=14202")]
236         public static WebProxy GetDefaultProxy();
237         //
238         // Summary:
239         //     Returns the proxied URI for a request.
240         //
241         // Parameters:
242         //   destination:
243         //     The System.Uri instance of the requested Internet resource.
244         //
245         // Returns:
246         //     The System.Uri instance of the Internet resource, if the resource is on the bypass
247         //     list; otherwise, the System.Uri instance of the proxy.
248         //
249         // Exceptions:
250         //   T:System.ArgumentNullException:
251         //     The destination parameter is null.
252         public Uri GetProxy(Uri destination);
253         //
254         // Summary:
255         //     Indicates whether to use the proxy server for the specified host.
256         //
257         // Parameters:
258         //   host:
259         //     The System.Uri instance of the host to check for proxy use.
260         //
261         // Returns:
262         //     true if the proxy server should not be used for host; otherwise, false.
263         //
264         // Exceptions:
265         //   T:System.ArgumentNullException:
266         //     The host parameter is null.
267         public bool IsBypassed(Uri host);
268         //
269         // Summary:
270         //     Populates a System.Runtime.Serialization.SerializationInfo with the data that
271         //     is needed to serialize the target object.
272         //
273         // Parameters:
274         //   serializationInfo:
275         //     The System.Runtime.Serialization.SerializationInfo to populate with data.
276         //
277         //   streamingContext:
278         //     A System.Runtime.Serialization.StreamingContext that specifies the destination
279         //     for this serialization.
280         protected virtual void GetObjectData(SerializationInfo serializationInfo, StreamingContext streamingContext);
281     }


[Code 1.2.3]

1 WebProxy proxyObject = new WebProxy("http://proxyserver:80/",true);
2 WebRequest req = WebRequest.Create("http://www.contoso.com");
3 req.Proxy = proxyObject;




[Code 1.2.4]

  1     public abstract class WebRequest : MarshalByRefObject, ISerializable
  2     {
  3         /// <summary>
  4         /// 获取或设置此请求的默认缓存策略。
  5         /// </summary>
  6         public static RequestCachePolicy DefaultCachePolicy { get; set; }
  7         /// <summary>
  8         /// 获取或设置此请求的缓存策略。
  9         /// </summary>
 10         public virtual RequestCachePolicy CachePolicy { get; set; }
 11         /// <summary>
 12         /// 获取或设置当前请求的模拟级别。
 13         /// </summary>
 14         public TokenImpersonationLevel ImpersonationLevel { get; set; }
 15         /// <summary>
 16         /// 当在子类中重写时,获取或设置请求的连接组的名称。
 17         /// </summary>
 18         public virtual string ConnectionGroupName { get; set; }
 19         //
 20         // Summary:
 21         //     When overridden in a descendant class, gets or sets the collection of header
 22         //     name/value pairs associated with the request.
 23         //
 24         // Returns:
 25         //     A System.Net.WebHeaderCollection containing the header name/value pairs associated
 26         //     with this request.
 27         //
 28         // Exceptions:
 29         //   T:System.NotImplementedException:
 30         //     Any attempt is made to get or set the property, when the property is not overridden
 31         //     in a descendant class.
 32         public virtual WebHeaderCollection Headers { get; set; }
 33         /// <summary>
 34         /// 当在子类中被重写时,获取或设置所发送的请求数据的内容长度。
 35         /// </summary>
 36         public virtual long ContentLength { get; set; }
 37         /// <summary>
 38         /// 当在子类中被重写时,获取或设置所发送的请求数据的内容类型。
 39         /// </summary>
 40         public virtual string ContentType { get; set; }
 41         /// <summary>
 42         /// 当在子类中被重写时,获取或设置用于对 Internet 资源请求进行身份验证的网络凭据。
 43         /// </summary>
 44         public virtual ICredentials Credentials { get; set; }
 45         /// <summary>
 46         /// 当在子代类中重写时,获取或设置一个 Boolean 值,该值控制 DefaultCredentials 是否随请求一起发送。
 47         /// </summary>
 48         public virtual bool UseDefaultCredentials { get; set; }
 49         /// <summary>
 50         /// 当在子类中被重写时,指示是否对请求进行预身份验证。
 51         /// </summary>
 52         public virtual bool PreAuthenticate { get; set; }
 53         /// <summary>
 54         /// 获取或设置请求超时之前的时间长度(以毫秒为单位)。
 55         /// </summary>
 56         public virtual int Timeout { get; set; }
 57         /// <summary>
 58         /// 获取或设置用于此请求的身份验证和模拟的级别。
 59         /// </summary>
 60         public AuthenticationLevel AuthenticationLevel { get; set; }
 61         /// <summary>
 62         /// 当在子类中被重写时,获取或设置要在此请求中使用的协议方法。
 63         /// </summary>
 64         public virtual string Method { get; set; }
 65         /// <summary>
 66         /// 当在子类中被重写时,获取与请求关联的 Internet 资源的 URI。
 67         /// </summary>
 68         public virtual Uri RequestUri { get; }
 70         /***************
 71          * 避免篇幅太长,此处省略部分属性及方法。
 72          * *************/
 74         /// <summary>
 75         /// 为指定的 URI 注册 WebRequest 子代。
 76         /// </summary>
 77         public static bool RegisterPrefix(string prefix, IWebRequestCreate creator);
 78         /// <summary>
 79         /// 中止请求。
 80         /// </summary>
 81         public virtual void Abort();
 82         /// <summary>
 83         /// 当在子类中重写时,提供 GetRequestStream() 方法的异步版本。
 84         /// </summary>
 85         public virtual IAsyncResult BeginGetRequestStream(AsyncCallback callback, object state);
 86         /// <summary>
 87         /// 当在子类中被重写时,开始对 Internet 资源的异步请求。
 88         /// </summary>
 89         public virtual IAsyncResult BeginGetResponse(AsyncCallback callback, object state);
 90         /// <summary>
 91         /// 当在子类中重写时,返回用于将数据写入 Internet 资源的 Stream。
 92         /// </summary>
 93         public virtual Stream EndGetRequestStream(IAsyncResult asyncResult);
 94         /// <summary>
 95         /// 当在子类中重写时,返回 WebResponse。
 96         /// </summary>
 97         public virtual WebResponse EndGetResponse(IAsyncResult asyncResult);
 98         /// <summary>
 99         /// 当在子类中重写时,返回用于将数据写入 Internet 资源的 Stream。
100         /// </summary>
101         public virtual Stream GetRequestStream();
102         /// <summary>
103         /// 当在子类中被重写时,将用于写入数据的 Stream 作为异步操作返回到 Internet 资源。
104         /// </summary>
105         public virtual Task<Stream> GetRequestStreamAsync();
106         /// <summary>
107         /// 当在子类中被重写时,返回对 Internet 请求的响应。
108         /// </summary>
109         public virtual WebResponse GetResponse();
110         /// <summary>
111         /// 当在子类中被重写时,将作为异步操作返回对 Internet 请求的响应。
112         /// </summary>
113         public virtual Task<WebResponse> GetResponseAsync();
114         /// <summary>
115         /// 使用将目标对象序列化所需的数据填充 SerializationInfo。
116         /// </summary>
117         protected virtual void GetObjectData(SerializationInfo serializationInfo, StreamingContext streamingContext);
118     }
System.Net.WebRequest 重要的属性和方法,但这里不细究。




