Python HTML Resolution Demo - SGMLParser & PyQuery
1. SGMLParser:
这里定义了一个Parse类,继承SGMLParser里面的方法。使用一个变量is_h4做标记判定html文件中的h4标签,如果遇到h4标签,则将标签内的内容加入到Parse的变量name中。解释一下start_h4()和end_h4()函数,他们原型是SGMLParser中的
start_tagname(self, attrs)
end_tagname(self)
tagname就是标签名称,比如当遇到<h4>,就会调用start_h4,遇到</h4>,就会调用 end_h4。attrs为标签的参数,以[(attribute, value), (attribute, value), ...]的形式传回。
Demo:
#!/usr/bin/python2.7 # FileName: sgmlparser.py # Author: lxw # Date: 2015-07-30 import urllib2 from sgmllib import SGMLParser class Parse(SGMLParser): def __init__(self): SGMLParser.__init__(self) self.is_h4 = "" self.name = [] self.is_a = "" self.link = [] def start_h4(self, attrs): self.is_h4 = 1 def end_h4(self): self.is_h4 = "" def start_a(self, attrs): self.is_a = 1 def end_a(self): self.is_a = "" def handle_data(self, text): if self.is_h4 == 1: self.name.append(text) if self.is_a == 1: self.link.append(text) def main(): #content = urllib2.urlopen("https://kb.isc.org/").read() content = urllib2.urlopen("https://list.taobao.com/browse/cat-0.htm").read() parse = Parse() parse.feed(content) for item in parse.link: print(item.decode("gbk").encode("utf-8")) print("-"*20) for item in parse.name: print(item.decode("gbk").encode("utf-8")) if __name__ == '__main__': main() else: print("Being imported as a module.")
2. PyQuery:
#!/usr/bin/python2.7 #coding=utf-8 #如果想有中文注释就必须得有上面的语句 # FileName: pyQueryParse.py # Author: lxw # Date: 2015-07-30 from pyquery import PyQuery ''' 直接运行没有问题, 但当把输出重定向到文件时, 就出现如下错误: UnicodeEncodeError: 'ascii' codec can't encode characters in position 166-167: ordinal not in range(128) 解决方法是增加下面的三行代码: ''' import sys reload(sys) sys.setdefaultencoding("utf-8") def main(): source = PyQuery(url="https://list.taobao.com/browse/cat-0.htm") #print(type(source)) #<class 'pyquery.pyquery.PyQuery'> #print(type((source("a")))) #<class 'pyquery.pyquery.PyQuery'> for data in source.find("a"): #print(type(data)) #<class 'lxml.html.HtmlElement'> #print(type(PyQuery((data)))) #<class 'pyquery.pyquery.PyQuery'> #print(type(PyQuery(data).text())) #<type 'unicode'>/<type 'str'> print(PyQuery(data).text()) if __name__ == '__main__': main() else: print("Being imported as a module.")
References: