《C# 爬虫 破境之道》:第二境 爬虫应用 — 第七节:并发控制与策略
- 我们的任务总量大概在什么量级,全速采集大概需要耗费多少时间、多少资源,未来的发展是不是可控?
- 采集系统自身依托的环境资源是否充足,是否能够满足随之而来的巨大的资源消耗?
- 采集的目标资源系统是否具有某些反爬策略限制?
- 采集的目标资源系统是否能够承受得住如此数量级的并发采集请求(无论单点或分布式采集系统,都要考虑这点)?
- 随着采集结果返回,带来的后续分析、处理、存储能力是否能够满足大量数据的瞬时到来?
- CPU:采集系统的占用总量建议不超过30%,CPU总使用量建议不超过50%。(虽然我这个疯子经常贪婪过渡T_T)。对于多核CPU,线程创建数量建议不超过CPU核数的两倍。
- 内存:采集系统的占用总量建议不超过50%,内存总使用量建议不超过70%。
- 存储器:对于商业或者大规模的爬虫体系,建议将存储分离,使用外部存储设备,比如NAS、分布式缓存、数据仓库等;当然,其他爬虫体系也这么建议,但如果条件不允许的话,只能存储在本地磁盘的话,就需要考虑磁盘的IOPS了,即使是使用缓存、数据库系统来作为中间存储媒介,实质上也是与磁盘IO打交道,不过一般的缓存、数据库系统都会对IO做优化,而且能干预的力度比较小,倒是可以略微“省心”。这个,本人也无法给出一个合理的通用的建议值,磁盘的性能千奇百怪,只能是按实际环境来拿捏了。
- 带宽:分为上行、下行两个带宽指标,采集系统在这两个指标中的占用总量都不建议超过80%。除了考虑ISP分配的带宽,还要考虑会影响其效能的周边设备,比如猫、交换机、路由器甚至是网线的吞吐能力。说来尴尬,我经常在家里做实验,爬虫系统和目标资源系统都还OK,联通的光猫跪了……重启复活……又跪了……重启复活……又跪了……重启复活……
- 可用端口:这个是一个隐性条件,也是经常被忽略的限制。拿Windows系统来说,可用的端口最大数量为UInt16.MaxValue(65535)个,而伴随着系统启动,就会有一系列的服务占用了部分端口,比如IIS中的网站、数据库、QQ,而系统本身也会保留一部分端口,比如443、3389等。而是否能够使用端口重用技术来缓解疼痛,对具体实现以及NAS端口映射规则的要求更高,不好或不可控。所以爬虫本身能够使用的端口数就有一个极限限制,这个也没有建议值,具体情况各不相同。

1 namespace MikeWare.Core.Components.CrawlerFramework.Policies 2 { 3 using System; 4 5 public abstract class AConcurrentPolicy 6 { 7 public virtual bool WaitOne(TimeSpan timeout) => throw new NotImplementedException(); 8 9 public virtual void ReleaseOne() => throw new NotImplementedException(); 10 } 11 }

1 namespace MikeWare.Core.Components.CrawlerFramework.Policies 2 { 3 using System; 4 using System.Threading; 5 6 public class SemaphoreConcurrentPolicy : AConcurrentPolicy 7 { 8 private Semaphore semaphore = null; 9 10 public SemaphoreConcurrentPolicy(int init, int max) 11 { 12 semaphore = new Semaphore(init, max); 13 } 14 15 public override bool WaitOne(TimeSpan timeout) 16 { 17 return semaphore.WaitOne(timeout); 18 } 19 20 public override void ReleaseOne() 21 { 22 semaphore.Release(1); 23 } 24 } 25 }
SemaphoreConcurrentPolicy继承自AConcurrentPolicy,定义了一个私有变量Semaphore semaphore,以及重写了基类的两个抽象方法;

namespace System.Threading { // // Summary: // Limits the number of threads that can access a resource or pool of resources // concurrently. public sealed class Semaphore : WaitHandle { // // Summary: // Initializes a new instance of the System.Threading.Semaphore class, specifying // the initial number of entries and the maximum number of concurrent entries. // // Parameters: // initialCount: // The initial number of requests for the semaphore that can be granted concurrently. // // maximumCount: // The maximum number of requests for the semaphore that can be granted concurrently. // // Exceptions: // T:System.ArgumentException: // initialCount is greater than maximumCount. // // T:System.ArgumentOutOfRangeException: // maximumCount is less than 1. -or- initialCount is less than 0. public Semaphore(int initialCount, int maximumCount); // // Summary: // Initializes a new instance of the System.Threading.Semaphore class, specifying // the initial number of entries and the maximum number of concurrent entries, and // optionally specifying the name of a system semaphore object. // // Parameters: // initialCount: // The initial number of requests for the semaphore that can be granted concurrently. // // maximumCount: // The maximum number of requests for the semaphore that can be granted concurrently. // // name: // The name of a named system semaphore object. // // Exceptions: // T:System.ArgumentException: // initialCount is greater than maximumCount. -or- name is longer than 260 characters. // // T:System.ArgumentOutOfRangeException: // maximumCount is less than 1. -or- initialCount is less than 0. // // T:System.IO.IOException: // A Win32 error occurred. // // T:System.UnauthorizedAccessException: // The named semaphore exists and has access control security, and the user does // not have System.Security.AccessControl.SemaphoreRights.FullControl. // // T:System.Threading.WaitHandleCannotBeOpenedException: // The named semaphore cannot be created, perhaps because a wait handle of a different // type has the same name. public Semaphore(int initialCount, int maximumCount, string name); // // Summary: // Initializes a new instance of the System.Threading.Semaphore class, specifying // the initial number of entries and the maximum number of concurrent entries, optionally // specifying the name of a system semaphore object, and specifying a variable that // receives a value indicating whether a new system semaphore was created. // // Parameters: // initialCount: // The initial number of requests for the semaphore that can be satisfied concurrently. // // maximumCount: // The maximum number of requests for the semaphore that can be satisfied concurrently. // // name: // The name of a named system semaphore object. // // createdNew: // When this method returns, contains true if a local semaphore was created (that // is, if name is null or an empty string) or if the specified named system semaphore // was created; false if the specified named system semaphore already existed. This // parameter is passed uninitialized. // // Exceptions: // T:System.ArgumentException: // initialCount is greater than maximumCount. -or- name is longer than 260 characters. // // T:System.ArgumentOutOfRangeException: // maximumCount is less than 1. -or- initialCount is less than 0. // // T:System.IO.IOException: // A Win32 error occurred. // // T:System.UnauthorizedAccessException: // The named semaphore exists and has access control security, and the user does // not have System.Security.AccessControl.SemaphoreRights.FullControl. // // T:System.Threading.WaitHandleCannotBeOpenedException: // The named semaphore cannot be created, perhaps because a wait handle of a different // type has the same name. public Semaphore(int initialCount, int maximumCount, string name, out bool createdNew); // // Summary: // Opens the specified named semaphore, if it already exists. // // Parameters: // name: // The name of the system semaphore to open. // // Returns: // An object that represents the named system semaphore. // // Exceptions: // T:System.ArgumentException: // name is an empty string. -or- name is longer than 260 characters. // // T:System.ArgumentNullException: // name is null. // // T:System.Threading.WaitHandleCannotBeOpenedException: // The named semaphore does not exist. // // T:System.IO.IOException: // A Win32 error occurred. // // T:System.UnauthorizedAccessException: // The named semaphore exists, but the user does not have the security access required // to use it. public static Semaphore OpenExisting(string name); // // Summary: // Opens the specified named semaphore, if it already exists, and returns a value // that indicates whether the operation succeeded. // // Parameters: // name: // The name of the system semaphore to open. // // result: // When this method returns, contains a System.Threading.Semaphore object that represents // the named semaphore if the call succeeded, or null if the call failed. This parameter // is treated as uninitialized. // // Returns: // true if the named semaphore was opened successfully; otherwise, false. // // Exceptions: // T:System.ArgumentException: // name is an empty string. -or- name is longer than 260 characters. // // T:System.ArgumentNullException: // name is null. // // T:System.IO.IOException: // A Win32 error occurred. // // T:System.UnauthorizedAccessException: // The named semaphore exists, but the user does not have the security access required // to use it. public static bool TryOpenExisting(string name, out Semaphore result); // // Summary: // Exits the semaphore and returns the previous count. // // Returns: // The count on the semaphore before the System.Threading.Semaphore.Release* method // was called. // // Exceptions: // T:System.Threading.SemaphoreFullException: // The semaphore count is already at the maximum value. // // T:System.IO.IOException: // A Win32 error occurred with a named semaphore. // // T:System.UnauthorizedAccessException: // The current semaphore represents a named system semaphore, but the user does // not have System.Security.AccessControl.SemaphoreRights.Modify. -or- The current // semaphore represents a named system semaphore, but it was not opened with System.Security.AccessControl.SemaphoreRights.Modify. public int Release(); // // Summary: // Exits the semaphore a specified number of times and returns the previous count. // // Parameters: // releaseCount: // The number of times to exit the semaphore. // // Returns: // The count on the semaphore before the System.Threading.Semaphore.Release* method // was called. // // Exceptions: // T:System.ArgumentOutOfRangeException: // releaseCount is less than 1. // // T:System.Threading.SemaphoreFullException: // The semaphore count is already at the maximum value. // // T:System.IO.IOException: // A Win32 error occurred with a named semaphore. // // T:System.UnauthorizedAccessException: // The current semaphore represents a named system semaphore, but the user does // not have System.Security.AccessControl.SemaphoreRights.Modify rights. -or- The // current semaphore represents a named system semaphore, but it was not opened // with System.Security.AccessControl.SemaphoreRights.Modify rights. public int Release(int releaseCount); } }
看它的summary,我们大体了解这个类就是专门用来做并发限制的,它具有三个构造函数,我们最关心的,就是其中两个参数int initialCount, int maximumCount及其涵义;
initialCount:能够被Semaphore 授予的数量的初始值;
maximumCount:能够被Semaphore 授予的最大值;
当然,我们常见的情况是构造盒子的时候,initialCount == maximumCount,特殊场景下,会设置不相同,这个视具体业务而定。然而,maximumCount不能小于initialCount,initialCount不能小于0,这个是硬性的。
这样是不是initialCount 和 maximumCount就很容易理解了。
同时,Semaphore 还有非常重要的方法(Release)方法,再把上面的栗子举起来说话,Release就是归还钥匙,任务结束了,那么就出门还钥匙,然后其它在门口等待的任务就可以领到钥匙进门了:)
再者,Semaphore 继承自System.Threading.WaitHandle,于是乎,它就具有了一系列Wait方法,当有新任务来领钥匙,一看,盒子空了,那怎么办呢,等吧,但是等多久呢,是一直等下去还是等一个超时时间,这就看业务逻辑了。

1 namespace MikeWare.Core.Components.CrawlerFramework 2 { 3 using MikeWare.Core.Components.CrawlerFramework.Policies; 4 using System; 5 using System.Collections.Concurrent; 6 using System.Threading; 7 using System.Threading.Tasks; 8 9 public class LeaderAnt : Ant 10 { 11 private ConcurrentQueue<JobContext> Queue; 12 private ManualResetEvent mre = new ManualResetEvent(false); 13 public AConcurrentPolicy EnqueuePolicy { get; set; } 14 15 …… 16 17 public void Enqueue(JobContext context) 18 { 19 if (null != EnqueuePolicy) 20 { 21 while (!EnqueuePolicy.WaitOne(TimeSpan.FromMilliseconds(3)) && !mre.WaitOne(1)) 22 continue; 23 } 24 25 Queue.Enqueue(context); 26 } 27 28 …… 29 }

1 namespace MikeWare.Crawlers.EBooks.Bizs 2 { 3 using MikeWare.Core.Components.CrawlerFramework; 4 using MikeWare.Core.Components.CrawlerFramework.Policies; 5 using MikeWare.Crawlers.EBooks.Entities; 6 using System; 7 using System.Collections.Generic; 8 using System.Net; 9 10 public class EBooksCrawler 11 { 12 public static void Start(int pageIndex, DateTime lastUpdateTime) 13 { 14 var leader = new LeaderAnt() 15 { 16 EnqueuePolicy = new SemaphoreConcurrentPolicy(100, 100) 17 //EnqueuePolicy = new PeriodEnqueuePolicy(TimeSpan.FromMilliseconds(150)) 18 }; 19 20 var newContext = new JobContext 21 { 22 JobName = $"奇书网-最新电子书-列表-第{pageIndex.ToString("00000")}页", 23 Uri = $"http://www.xqishuta.com/s/new/index_{pageIndex}.html", 24 Method = WebRequestMethods.Http.Get, 25 InParams = new Dictionary<string, object>(), 26 Analizer = new BooksListAnalizer(), 27 }; 28 newContext.InParams.Add(Consts.PAGE_INDEX, 1); 29 newContext.InParams.Add(Consts.LAST_UPDATE_TIME, DateTime.MinValue); 30 31 leader.Enqueue(newContext); 32 33 leader.Work(); 34 } 35 } 36 }
