并行 Webclient（一）

2018-09-11 16:28 音乐让我说阅读(346) 评论(0) 编辑收藏举报

在 Stackoverflow 上看到了一个提问，关于并行的 WebClient，觉得回答者的代码很有参考性，下面记录一下，以便日后用到：

提问者：

我有一个功能基本上分为两个子功能。

html=RetriveHTML(int index);
returnColection = RegexProcess(html, index);

通过优化RetrieveHTML并列化来加速此过程的最佳方法是什么？

通常我用最多20000个索引来调用它。第一个subfuntcion是网络相关的（使用webclient.downloadstring从一个服务器获取几个URL HTML），第二个子功能主要是CPU。

我迷失在并行foreach和Tasks（继续，继续，fromasync）世界，我遇到麻烦来解决问题。我首先尝试使用Parallel foreach，但是我发现其性能即网络I / O在连续调用时会降级（第一个循环很快，其他循环变慢）。解决方案将释放html对象，因为它们很多很大。我正在使用.net 4.0

回答者：

    class Program
    {
        static void Main(string[] args)
        {
            ProcessInParallell();
        }

        private static Regex _regex = new Regex("net");

        private static void ProcessInParallell()
        {
            Uri[] resourceUri = new Uri[] { new Uri("http://www.microsoft.com"), new Uri("http://www.google.com"), new Uri("http://www.amazon.com") };
            //1. Stage 1: Download HTML
            //Use the blocking collection for concurrent tasks
            BlockingCollection<string> htmlDataList = new BlockingCollection<string>();
            Parallel.For(0, resourceUri.Length, index =>
            {
                var html = RetrieveHTML(resourceUri[index]);
                htmlDataList.TryAdd(html);

                //If we reach to the last index, signal the completion
                if (index == (resourceUri.Length - 1))
                {
                    htmlDataList.CompleteAdding();
                }
            });

            //2. Get matches
            //This concurrent bags will be used to store the result of the matching stage
            ConcurrentBag<string> matchedHtml = new ConcurrentBag<string>();

            IList<Task> processingTasks = new List<Task>();

            //Enumerate through each downloaded HTML document
            foreach (var html in htmlDataList.GetConsumingEnumerable())
            {
                //Create a new task to match the downloaded HTML
                var task = Task.Factory.StartNew((data) =>
                {
                    var downloadedHtml = data as string;
                    if (downloadedHtml == null)
                        return;
                    if (_regex.IsMatch(downloadedHtml))
                    {
                        matchedHtml.Add(downloadedHtml);
                    }
                },html);
                //Add the task to the waiting list
                processingTasks.Add(task);
            }

            //wait for the all tasks to complete
            Task.WaitAll(processingTasks.ToArray());

            foreach (var html in matchedHtml)
            {
                //Do something with the matched result    

            }
        }

        private static string RetrieveHTML(Uri uri)
        {
            using (WebClient webClient = new WebClient())
            {
                //set this to null if there is no proxy
                webClient.Proxy = null;

                byte[] data = webClient.DownloadData(uri);

                return Encoding.UTF8.GetString(data);
            }
        }
    }

追问：

谢谢你的收获。但如果下载了很多/很长的html文件，这不会占用太多内存吗？ - husvar 2013年 1月23日21:58

回答：

我意识到如果下载的内容很大，就会占用大量内存。因此，您最好将工作量加大。干杯 - Toan Nguyen 2013年1月23日22:45

谢谢浏览！

刷新页面返回顶部

音乐让我说山重水复疑无路，柳暗花明又一村。要抓住每个属于自己的机会，同时也不要放弃希望，也许在山穷水尽之时，便是柳暗花明之所。

并行 Webclient（一）

提问者：

回答者：

追问：

回答：

About

音乐让我说 山重水复疑无路，柳暗花明又一村。要抓住每个属于自己的机会，同时也不要放弃希望，也许在山穷水尽之时，便是柳暗花明之所。

并行 Webclient（一）

提问者：

回答者：

追问：

回答：

About

音乐让我说山重水复疑无路，柳暗花明又一村。要抓住每个属于自己的机会，同时也不要放弃希望，也许在山穷水尽之时，便是柳暗花明之所。