Python 通过sgmllib模块解析HTML
""" 对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值 依赖安装:pip install sgmllib3k 使用方法: 1.自定义一个类,继承sgmllib的SGMLParser 2.复写SGMLParser的方法,添加自己自定义的标签处理函数 3.通过自定义的类的对象的.feed(data)把要解析的数据传入解析器,然后自定义的方法自动生效。 """ from urllib import request import sgmllib class HandleHtml(sgmllib.SGMLParser): """ 自定义HTML解析类 """ def unknown_starttag(self, tag, attrs): """ 任意标签开始被解析时调用 :param tag: 标签名 :param attrs: 标签的参数 :return: """ try: for attr in attrs: if attr[0] == 'href': print(f"{attr[0]}:{attr[1]}") except: pass if __name__ == '__main__': response = request.urlopen("http://freebuf.com/") page = response.read() page = page.decode('utf-8') # 创建HTML解析对象 handle_html = HandleHtml() # 将数据传入解析器 handle_html.feed(page)
输出结果:
href:https://www.freebuf.com/buf/plugins/wp-favorite-posts/wpfp.css href:https://static.3001.net/css/recentcomments/wp-recentcomments.css?ver=2.2.3 href:https://www.freebuf.com/buf/plugins/gold/assets/css/widget.css?ver=1.3.2.1 href:https://static.3001.net/css/highslide/highslide.css href:https://www.freebuf.com/buf/plugins/cartpauj-pm/style/style.css href: https://www.freebuf.com/buf/plugins/simditor/highlight/styles/default.css href:https://static.freebuf.com/images/favicon.ico href:https://static.3001.net/css/new/header.css href:https://static.3001.net/css/new/bootstrap.min.css?ver=2016051701 href:https://static.3001.net/css/new/swiper-3.4.2.min.css href:https://static.3001.net/css/new/model.css?ver=2017112156855 href:https://static.3001.net/css/new/style.css?ver=2018112123749359438534 href:http://www.freebuf.com href:http://www.freebuf.com href:http://job.freebuf.com href:# ......