python feedparser 使用
2013-06-12 16:06 youxin 阅读(3892) 评论(0) 编辑 收藏 举报号称Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds。官网:
https://pypi.python.org/pypi/feedparser/
基本用法
>>> import feedparser >>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml") >>> d['feed']['title'] # feed data is a dictionary u'Sample Feed' >>> d.feed.title # get values attr-style or dict-style u'Sample Feed' >>> d.channel.title # use RSS or Atom terminology anywhere u'Sample Feed' >>> d.feed.link # resolves relative links u'http://example.org/' >>> d.feed.subtitle # parses escaped HTML u'For documentation <em>only</em>' >>> d.channel.description # RSS terminology works here too u'For documentation <em>only</em>' >>> len(d['entries']) # entries are a list 1 >>> d['entries'][0]['title'] # each entry is a dictionary u'First entry title' >>> d.entries[0].title # attr-style works here too u'First entry title' >>> d['items'][0].title # RSS terminology works here too u'First entry title' >>> e = d.entries[0] >>> e.link # easy access to alternate link u'http://example.org/entry/3' >>> e.links[1].rel # full access to all Atom links u'related' >>> e.links[0].href # resolves relative links here too u'http://example.org/entry/3' >>> e.author_detail.name # author data is a dictionary u'Mark Pilgrim' >>> e.updated_parsed # parses all date formats (2005, 11, 9, 11, 56, 34, 2, 313, 0) >>> e.content[0].value # sanitizes dangerous HTML u'<div>Watch out for <em>nasty tricks</em></div>' >>> d.version # reports feed type and version u'atom10' >>> d.encoding # auto-detects character encoding u'utf-8' >>> d.headers.get('Content-type') # full access to all HTTP headers u'application/xml'
标准的item:
<item> <title><![CDATA[厦门公交车放火案死者名单公布<br/>警方公布嫌犯犯罪证据]]></title> <link>http://www.infzm.com/content/91404</link> <description><![CDATA[6月11日下午,厦门BRT公交车放火案47名死亡者名单公布。厦门政府新闻办6月10日发布消息称,有证据表明,陈水总携带汽油上了闽DY7396公交车。且有多名幸存者指认其在车上纵火,致使整部车引起猛烈燃烧。经笔迹鉴定,陈水总6月7日致妻、女的两封绝笔书系陈水总本人所写。]]></description> <category>南方周末-热点新闻</category> <author>infzm</author> <pubDate>2013-06-11 11:24:32</pubDate> </item>
feedparser.parse()得到什么,
d=feedparser.parse(' ')
>>> print d
{'feed': {}, 'encoding': u'utf-8', 'bozo': 1, 'version': u'', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('no element found',)}
可以看到,得到的是字典,feed也是字典,entries是list。