paul_cheung

导航

c# & Fizzler to crawl web page in a certain website domain

使用fizzler [HtmlAgilityPackExtension]和c#进行网页数据提取;fizzler是HtmlAgilityPack的一个扩展,支持jQuery Selector;

提取数据一般都是有规律url拼凑,然后挨个儿发request得到response进行解析:

1.假如一个website下的所有xxx.sample.com/contactus.html里边存在邮箱字段(准备提取的数据)

  a)当有子域名的时候,比如:a.sample.com, aadr.sample.com, 135dj.sample.com,随机性比较强;

   解决方法:bing search engine中使用 site:b2b.sample.com搜索得到的result页面可以提取所有子域名,然后拼凑成xxx.sample.com/contactus.html,继而发送请求到这个url,得        到response进行解析;

   NOTE:关于site:b2b.sample.com的搜索url拼凑如下,

       http://www.bing.com/search?q=site%3A{b2b.sample.com}&go=Submit&qs=n&form=QBRE&pq=site%3A{b2b.sample.com}&sc=1-19&sp=-1&sk=&cvid=6165a189f5354b1982fb8cd6933abb6f&first={pageIndex}&FORM=PERE

2.像www.sample.com/1456.html的页面可以直接平凑1456.html/1457.html/1458.html etc.此处不列举;

Fizzler使用方法:

1.从nuget上安装Fizzler

2.使用方法参考code.google.com

3.使用bing提取website下的所有子域:

private static List<string> GetSubdomains(string websiteDomain, int startPageIndex = 1, int pageCount = 999, int pageSize = 14)
        {
            var list = new List<string>();
            //using bind to search subdomains in a certain website
            var bingSearchUrlFormat = "http://www.bing.com/search?q=site%3a{0}&go=Submit&qs=n&pq=site%3a{0}&sc=1-100&sp=-1&sk=&cvid=a9b36439006f4b05b09f9202c5b784bd&first={1}&FORM=PQRE";

            WebClient client = new WebClient();
            client.Encoding = Encoding.UTF8;
            var doc = new HtmlDocument();

            var first = (startPageIndex / 10) * 140 + 1;
            var stopIndex = first + pageCount*pageSize;
            var currentPageIndex = startPageIndex;
            for (var startItemSquenceNumber = first; startItemSquenceNumber < stopIndex; startItemSquenceNumber = startItemSquenceNumber + pageSize)
            {
                var response = client.DownloadString(string.Format(bingSearchUrlFormat, websiteDomain, startItemSquenceNumber));
                HtmlDocumentExtensions.LoadHtml2(doc, response);
                var docNode = doc.DocumentNode;
                var subDomains = docNode.QuerySelectorAll(".sb_meta cite");foreach (var subDomain in subDomains)
                {
                    list.Add(subDomain.InnerText);
                }
            }return list;
        }

4.获取网页节点:

        private static List<HtmlNode> GetWebPageNodes(string url, string elementSelector, string attributeNameContained, string attributeNameContainedValueLike)
        {
            var client = new WebClient();
            client.Encoding = Encoding.UTF8;
            var response = client.DownloadString(url);
            var doc = new HtmlDocument();
            HtmlDocumentExtensions.LoadHtml2(doc, response);
            var docNode = doc.DocumentNode;
            var emailNode = docNode.QuerySelectorAll(elementSelector).Where(node => node.Attributes.Where(attr => attr.Name == attributeNameContained).FirstOrDefault().Value.Contains(attributeNameContainedValueLike)).FirstOrDefault();

            var nodes = (from node in docNode.QuerySelectorAll(elementSelector)
                         where node.HasAttributes && node.GetAttributeValue(attributeNameContained, string.Empty).Contains(attributeNameContainedValueLike)
                         select node).ToList();

            return nodes;
        }

5.获取某个网页中邮箱的方法:

var subdomains = GetSubdomains("b2b.sample.com", stopPageIndex, 10);
var urlFormat = "http://{0}/contactus.html";
GetWebPageNodes(string.Format(urlFormat, item), "body table a", "href", "mailto").FirstOrDefault();

 

最后的问题:当通过bing搜索子域时会有限制,发送100~150个请求后获取到的response就不是我想要的页面,而是要求输入验证码防止攻击的html;此问题暂时未解决,望大神指点!

posted on 2014-04-09 20:14  paul_cheung  阅读(844)  评论(0编辑  收藏  举报