网络爬虫速成指南(二)网页解析(基于模板)
网页解析技术: 1 xpath教程 2 正则表达式教程
xpath是将html加载为DOM树解析,简单,易维护。
通常我用正则作为辅助抽取,用xpath定位后,再从定位的数据中用正则抽取。
xpath的类库:
.net 方向主要用到HtmlAgilityPack
java 方向主要用到HtmlCleaner(得FQ) jsoup
以下是本人封装好的:HtmlParser
使用示例:
HtmlParser<HotelInfo> parser = new HtmlParser<HotelInfo>(); ParseConfig config = new ParseConfig("save_xc.xml"); String html = File.ReadAllText("xiecheng.txt", Encoding.GetEncoding("GBK")); HotelInfo entity = parser.GetEntity(html, config);
模板样例1:
<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save root=".">
<field>
<name>Title</name>
<xpath>//div[@id='J_Article_Wrap']//h1</xpath>
</field>
<field>
<name>PubTime</name>
<xpath>//*[@id='pub_date']</xpath>
<regex>
<pattern>(\d+)年(\d+)月(\d+)日</pattern>
<format>{0}-{1}-{2}</format>
</regex>
</field>
<field>
<name>Article</name>
<xpath>//*[@id="artibody"]</xpath>
</field>
</save>
</page>
</template>
模板样例2:
<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save_m root="//tr[@id]">
<field>
<name>Price</name>
<xpath>./td[@class='price']</xpath>
</field>
</save_m>
</page>
</template>
模板样例3:
<?xml version="1.0" encoding="utf-8"?>
<template>
<page>
<save root=".">
<field>
<name>Name</name>
<xpath>//h1</xpath>
</field>
<field>
<name>EngName</name>
<xpath>//div[@class='name']/h2</xpath>
</field>
<field>
<name>Star</name>
<xpath>//div[@class='grade']/span/@title</xpath>
</field>
<field>
<name>Address</name>
<xpath>//div[@class='adress']</xpath>
</field>
<field>
<name>Description</name>
<xpath>//*[@id="htlDes"]</xpath>
</field>
<field>
<name>Facility</name>
<xpath position='outerhtml'>//div[@class="htl_info_table "]</xpath>
</field>
<field>
<name>Policy</name>
<xpath position='outerhtml'>//div[@class='detail_main']/div[@class="htl_info_table"]</xpath>
</field>
<field>
<name>Traffic</name>
<xpath position='outerhtml'>//div[@class='transSub'][1]/div[@class="htl_info_table"]</xpath>
</field>
<field>
<name>Nearby</name>
<xpath position='outerhtml'>//div[@class='transSub'][2]/div[@class="htl_info_table"]</xpath>
</field>
</save>
</page>
</template>