HtmlAgilityPack搭配 ScrapySharp或HtmlAgilityPack.CssSelectors

Html Agility Pack 源码中的类大概有28个左右,其实不算一个很复杂的类库,但它的功能确不弱,为解析DOM已经提供了足够强大的功能支持,可以跟jQuery操作DOM媲 美:)Html Agility Pack最常用的基础类其实不多,对解析DOM来说,就只有HtmlDocument和HtmlNode这两个常用的类,还有一个 HtmlNodeCollection集合类。

一、ScapySharp

HTML Agility Pack的操作起来还是很麻烦,下面我们要介绍的这个组件是ScrapySharp,他在2个方面针对Html Agility Pack进行了包装,使得解析Html页面不再痛苦,幸福指数直线上升到90分哈。

ScapySharp有了一个真实的浏览器包装类(处理Reference,Cookie等),另外一个就是使用类似于jQuery一样的Css选择器和Linq语法。让我们使用起来非常的爽。它的代码放在 https://bitbucket.org/rflechner/scrapysharp。也可以通过Nuget添加

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

namespace HTMLAgilityDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            var uri = new Uri("http://www.cnblogs.com/shanyou/archive/2012/05/20/2509435.html");
            var browser1 = new ScrapingBrowser();
            var html1 = browser1.DownloadString(uri);
            var htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(html1);
            var html = htmlDocument.DocumentNode;

            var title = html.CssSelect("title");
            foreach (var htmlNode in title)
            {
                Console.WriteLine(htmlNode.InnerHtml);
            }
            var divs = html.CssSelect("div.postBody");

            foreach (var htmlNode in divs)
            {
                Console.WriteLine(htmlNode.InnerHtml);
            }

            divs = html.CssSelect("#cnblogs_post_body");
            foreach (var htmlNode in divs)
            {
                Console.WriteLine(htmlNode.InnerHtml);
            }
        }
    }
}

Basic examples of CssSelect usages:

var divs = html.CssSelect("div");  //all div elements

var nodes = html.CssSelect("div.content"); //all div elements with css class ‘content’

var nodes = html.CssSelect("div.widget.monthlist"); //all div elements with the both css class

var nodes = html.CssSelect("#postPaging"); //all HTML elements with the id postPaging

var nodes = html.CssSelect("div#postPaging.testClass"); // all HTML elements with the id postPaging and css class testClass 

var nodes = html.CssSelect("div.content > p.para"); //p elements who are direct children of div elements with css class ‘content’ 

var nodes = html.CssSelect("input[type=text].login"); // textbox with css class login 

We can also select ancestors of elements:

var nodes = html.CssSelect("p.para").CssSelectAncestors("div.content > div.widget");

二、搭配HtmlAgilityPack.CssSelectors(这个有bug,class里面有下划线_会抛异常)

var postItems = htmlDocument.QuerySelectorAll(".post-item");

参考:http://www.cnblogs.com/shanyou/archive/2012/05/27/2520603.html

http://www.tools138.com/create/article/20141014/130844875.html

posted @ 2015-11-13 13:57    阅读(1031)  评论(0编辑  收藏  举报