C#写一个简单爬虫

最近研究C#的爬虫写法,搞了半天,才在网上很多的写法中整理出了一个简单的demo(本人菜鸟,大神勿喷)。一是为了自己记录一下以免日后用到,二是为了供需要朋友参考。

废话不多说,上代码

 1 using HtmlAgilityPack;
 2 using System;
 3 using System.Collections.Generic;
 4 using System.IO;
 5 using System.Linq;
 6 using System.Net;
 7 using System.Text;
 8 using System.Threading.Tasks;
 9 
10 namespace Crawler
11 {
12     class Program
13     {
14         static void Main(string[] args)
15         {
16 
17             //WebProxy proxyObject = new WebProxy(IP, PORT);//这里我是用的代理。
18 
19             //向指定地址发送请求
20             HttpWebRequest HttpWReq = (HttpWebRequest)WebRequest.Create("http://news.baidu.com/");
21             //HttpWReq.Proxy = proxyObject;
22             HttpWReq.Timeout = 10000;
23             HttpWebResponse HttpWResp = (HttpWebResponse)HttpWReq.GetResponse();
24             StreamReader sr = new StreamReader(HttpWResp.GetResponseStream(), Encoding.GetEncoding("UTF-8"));
25             HtmlDocument doc = new HtmlDocument();
26             doc.Load(sr);
27             HtmlNodeCollection ulNodes = doc.DocumentNode.SelectSingleNode("//div[@id='pane-news']").SelectNodes("ul");
28             if (ulNodes != null && ulNodes.Count > 0)
29             {
30                 for (int i = 0; i < ulNodes.Count; i++)
31                 {
32                     HtmlNodeCollection liNodes = ulNodes[i].SelectNodes("li");
33                     for (int j = 0; j < liNodes.Count; j++)
34                     {
35                         string title = liNodes[j].SelectSingleNode("a").InnerHtml.Trim();
36                         string href = liNodes[j].SelectSingleNode("a").GetAttributeValue("href", "").Trim();
37                         Console.WriteLine("新闻标题:" + title + ",链接:" + href);
38                     }
39                 }
40             }
41             Console.ReadLine();
42             sr.Close();
43             HttpWResp.Close();
44             HttpWReq.Abort();
45         }
46     }
47 }

其中解析html的写法用到了XPath的语法,大家可以自行百度下,比较简单。

posted @ 2018-07-16 10:37  一叶、知秋  阅读(4567)  评论(1编辑  收藏  举报