定义信息源的一些示例(xml文件)
这里有一些共享的信息源,点击Download按钮下载即可。
本文目录
1、订阅博客,简单一例
2、从网页获取信息,简单一例
3、充分使用callback回调代码
4、html_re中包含多个block
5、使用html_json这个worker,解析json数据
1、订阅博客,简单一例:
<source> <name>范志红博客</name> <comment>搜狐博客。原创营养信息。</comment> <link>http://snowheart19.blog.sohu.com/</link> <worker>rss_atom</worker> <data> <url>http://snowheart19.blog.sohu.com/rss</url> </data> </source>
2、从网页获取信息,简单一例:
<source> <name>ybk168新邮预告</name> <comment>ybk168新邮预告</comment> <link>http://www.ybk168.com/newslist/00040051.html</link> <worker>html_re</worker> <data> <url>http://www.ybk168.com/newslist/00040051.html</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="list">(.*?)<div class="page"> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <li><span.*?href="([^"]+)".*?title="([^"]+)".*? class="list_lr">([^<]+)< ]]> </itemre> <maprules> <title>2</title> <url>'http://www.ybk168.com', 1</url> <pub_date>3</pub_date> </maprules> </block> </data> </source>
3、充分使用callback回调代码:
<source> <name>北京空气质量</name> <comment>北京环境监测的微博。',利有散染预【8时' in s or '浓度】' not in s</comment> <link>http://weibo.cn/u/2516831703</link> <worker>html_re</worker> <data> <url>http://weibo.cn/u/2516831703</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="b">(.*)$ ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ weibo\.cn\[([\d-]+) ]]> </itemre> <maprules> <title>'notitle'</title> <pub_date>1</pub_date> <suid>1</suid> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ ^(?:.*?\[<span class="kt">置顶</span>\]|.*?<span class="pms">) (.*?) <input type="submit" value="查看更多内容" ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <div class="c" id="([^"]+)"> (?:<div><span class="ctt">|.*?<span class="cmt">转发理由:</span>) (.*?) (?:</span>|<a [^>]+>赞\[\d+\]).*? <span class="ct">([^& ]+) ]]> </itemre> <maprules> <title>'notitle'</title> <summary>2</summary> <pub_date>3</pub_date> <suid>1</suid> </maprules> </block> </data> <callback> <![CDATA[ if posi == 0: temp_date = info.pub_date info.temp = 'del' elif '日' in info.pub_date: info.temp = 'del' else: s = info.summary if ',' in s or \ '利' in s or \ '有' in s or \ '散' in s or \ '染' in s or \ '预' in s or \ '【8时' in s or \ '浓度】' not in s: info.url = 'http://weibo.cn/u/2516831703' info.pub_date = '' info.title = '[' + temp_date + '] ' + s[:16] + '…' else: info.temp = 'del' ]]> </callback> </source>
4、html_re中包含多个block:
<source> <name>中国国家地理</name> <comment>中国国家地理</comment> <link>http://www.dili360.com/</link> <worker>html_re</worker> <data> <url>http://www.dili360.com/</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="community-item" id="community-items" > (.*?)<!--end--> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <li class="img-block".*? <a target="_blank" href="([^"]+)">.*? <h4>(.*?)</h4> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="community-item" id="community-items" > (.*?)<!--end--> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <dt><a href="([^"]+)" target="_blank">(.*?)</a></dt> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ <ul class="style-1" id="replace">(.*?)</ul> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <div class="detail">.*? <a href="([^"]+)" target="_blank"><h4>(.*?)</h4> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> <summary>'景观图片'</summary> </maprules> </block> </data> </source>
5、使用html_json这个worker,解析json数据:
<source>
<name>新浪书讯</name>
<comment>新浪图书,书讯。</comment>
<link>http://book.sina.com.cn/</link>
<worker>html_json</worker>
<data>
<url>http://feed.mix.sina.com.cn/api/roll/get?callback=jsonp1436772833418&pageid=8&lid=156&num=20</url>
<re flags='DOTALL'>
<![CDATA[
^try\{\w+\(
(.*)
\);\}catch\(e\)\{\};$
]]>
</re>
<block>
<block_path>'result', 'data'</block_path>
<title>'title'</title>
<url>'url'</url>
<summary>'summary'</summary>
<temp>'intro'</temp>
<pub_date>'ctime'</pub_date>
</block>
</data>
<callback>
<![CDATA[
info.pub_date = unixtime(info.pub_date)
info.summary = info.summary or info.temp
info.temp = 0
]]>
</callback>
</source>