Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]
I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:
Maybe someone has done this already, but I don't see it in the comments.
I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
using System.Xml;
using System.Xml.XPath;
using System.IO;
namespace HtmlAgilityPack
{
public class XslExtension
{
public XmlDocument loadhtmlasxml(string url)
{
// Create an instance of the HtmlWeb object
HtmlWeb web = new HtmlWeb();
// Declare necessary stream and writer objects
MemoryStream m = new MemoryStream();
XmlTextWriter xtw = new XmlTextWriter(m,null);
// Load the content into the writer
web.LoadHtmlAsXml(url, xtw);
// Rewind the memory stream
m.Position = 0;
// Create, fill, and return the xml document
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
return xdoc;
}
}
}
Then, I used NXSLT from http://www.xmllab.net to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:hap="http://smourier.blogspot.com"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
version="1.0">
<xsl:output method="html" omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>
<h2>First, connect to http://www.cnn.com and load its node set into a local variable</h2>
<xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>
<h3>CNN.com has this many nodes:</h3>
<xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
<h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
<h3>Special Coverage</h3>
<xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
<div>
<h3><xsl:copy-of select="." /></h3>
<!-- Now get the images from each story if they exist -->
<h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
<xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
<br /><br />
</div>
</xsl:for-each>
<h1>END TEST OF HtmlAgilityPack.XslExtension</h1>
</xsl:template>
</xsl:stylesheet>
The command for NXSLT to perform this is:
nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap="http://smourier.blogspot.com" -af .\HtmlAgilityPackXs
lExtension.dll
The style sheet connects to CNN.com using the syntax:
select="hap:loadhtmlasxml('http://www.cnn.com')"
Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.
This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...
Let me know what you think...
http://blogs.wdevs.com/ultravioletconsulting/archive/2005/09/10/10506.aspx
欢迎大家扫描下面二维码成为我的客户,扶你上云