.Net Core爬虫爬取妹子网图片

现在网上大把的Python的爬虫教程,很少看见有用C#写的,正好新出的.Net Core可以很方便的部署到Linux上,就用妹子图做示范写个小爬虫

在C#下有个很方便的类库 HtmlAgilityPack 可以用来分析网页

我们先新建一个.Net Core控制台项目MzituCrawler,在Nuget管理器添加HtmlAgilityPack的引用 Install-Package HtmlAgilityPack -Version 1.9.2 

我们打开妹子图的网页,点击下面的翻页按钮,发现每页的地址有个固定的格式 https://www.mzitu.com/page/页码/ 

我们先获取总共多少页

1 var baseUrl = $"https://www.mzitu.com";
2 HtmlWeb web = new HtmlWeb();
3 var indexDoc = web.Load(baseUrl);
4 var pageNode = indexDoc.DocumentNode.SelectNodes("/html/body/div[@class='main']/div[@class='main-content']/div[@class='postlist']/nav/div/a").Last(a => a.GetAttributeValue("class", string.Empty) == "page-numbers");
5 var pageCount = int.Parse(pageNode.InnerText);

查看网页上链接的元素可以看到每个链接对应的xpath地址为 //*[@id='pins']/li/a 

我们用HtmlAgilityPack获取第每一页的内容

 1 for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
 2 {
 3     var url = new Uri(new Uri(baseUrl), $"/page/{pageIndex}/").ToString();
 4     var doc = web.Load(url);
 5     var nodes = doc.DocumentNode.SelectNodes("//*[@id='pins']/li/a");
 6     if (nodes.Count > 0x0)
 7     {
 8         foreach (var node in nodes)
 9         {
10             var title = node.SelectSingleNode("img").GetAttributeValue("alt", string.Empty);
11             var href = node.GetAttributeValue("href", string.Empty);
12             href = new Uri(new Uri(baseUrl), href).ToString();
13             DownloadImages(downloadFolder: Path.Combine(baseFolder, title), url: href);
14         }
15     }
16     else
17     {
18         return;
19     }
20 }

其中方法 DownloadImages 是下载对应链接里面图片的方法

 1 private static void DownloadImages(string downloadFolder, string url)
 2 {
 3     if (!Directory.Exists(downloadFolder))
 4     {
 5         Directory.CreateDirectory(downloadFolder);
 6     }
 7     HtmlWeb web = new HtmlWeb();
 8     var indexDoc = web.Load(url);
 9     var pageNode = indexDoc.DocumentNode.SelectNodes("/html/body/div[@class='main']/div[@class='content']/div[@class='pagenavi']/a").Reverse().Skip(1).First();
10     var pageCount = pageNode == null ? 1 : int.Parse(pageNode.InnerText);
11     for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
12     {
13         var doc = web.Load($"{url}/{pageIndex}");
14         var imageNode = doc.DocumentNode.SelectSingleNode("/html/body/div[2]/div[1]/div[3]/p/a/img");
15         if (imageNode != null)
16         {
17             var imageUrl = imageNode.GetAttributeValue("src", string.Empty);
18             imageUrl = new Uri(new Uri(url), imageUrl).ToString();
19             if (historyUrl.Contains(imageUrl))
20             {
21                 continue;
22             }
23             using (var client = new HttpClient())
24             {
25                 client.DefaultRequestHeaders.Host = "i.meizitu.net";
26                 client.DefaultRequestHeaders.Pragma.ParseAdd("no-cache");
27                 client.DefaultRequestHeaders.AcceptEncoding.ParseAdd("gzip, deflate");
28                 client.DefaultRequestHeaders.AcceptLanguage.ParseAdd("zh-CN,zh;q=0.8,en;q=0.6");
29                 client.DefaultRequestHeaders.CacheControl = new System.Net.Http.Headers.CacheControlHeaderValue { NoCache = true };
30                 client.DefaultRequestHeaders.Connection.ParseAdd("keep-alive");
31                 client.DefaultRequestHeaders.Referrer = new Uri(url);
32                 client.DefaultRequestHeaders.UserAgent.ParseAdd("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
33                 client.DefaultRequestHeaders.Accept.ParseAdd("image/webp,image/apng,image/*,*/*;q=0.8");
34                 var buffer = client.GetByteArrayAsync(imageUrl).Result;
35                 var fileName = new Uri(imageUrl).Segments.Last();
36                 File.WriteAllBytes(Path.Combine(downloadFolder, fileName), buffer);
37             }
38         }
39     }
40 }

 

posted @ 2019-03-13 14:56  追寻未来的笨鸟  阅读(1287)  评论(0编辑  收藏  举报