Python 通过sgmllib模块解析HTML

"""
对html文本的解析方案-示例:在标签开始的时候检查标签中的attrs属性,解析出所有的参数的href属性值
依赖安装:pip install sgmllib3k
使用方法:
    1.自定义一个类,继承sgmllib的SGMLParser
    2.复写SGMLParser的方法,添加自己自定义的标签处理函数
    3.通过自定义的类的对象的.feed(data)把要解析的数据传入解析器,然后自定义的方法自动生效。
"""
from urllib import request
import sgmllib


class HandleHtml(sgmllib.SGMLParser):
    """
    自定义HTML解析类
    """

    def unknown_starttag(self, tag, attrs):
        """
        任意标签开始被解析时调用
        :param tag: 标签名
        :param attrs: 标签的参数
        :return:
        """
        try:
            for attr in attrs:
                if attr[0] == 'href':
                    print(f"{attr[0]}:{attr[1]}")
        except:
            pass


if __name__ == '__main__':
    response = request.urlopen("http://freebuf.com/")
    page = response.read()
    page = page.decode('utf-8')

    # 创建HTML解析对象
    handle_html = HandleHtml()
    # 将数据传入解析器
    handle_html.feed(page)

输出结果:

href:https://www.freebuf.com/buf/plugins/wp-favorite-posts/wpfp.css
href:https://static.3001.net/css/recentcomments/wp-recentcomments.css?ver=2.2.3
href:https://www.freebuf.com/buf/plugins/gold/assets/css/widget.css?ver=1.3.2.1
href:https://static.3001.net/css/highslide/highslide.css
href:https://www.freebuf.com/buf/plugins/cartpauj-pm/style/style.css
href: https://www.freebuf.com/buf/plugins/simditor/highlight/styles/default.css
href:https://static.freebuf.com/images/favicon.ico
href:https://static.3001.net/css/new/header.css
href:https://static.3001.net/css/new/bootstrap.min.css?ver=2016051701
href:https://static.3001.net/css/new/swiper-3.4.2.min.css
href:https://static.3001.net/css/new/model.css?ver=2017112156855
href:https://static.3001.net/css/new/style.css?ver=2018112123749359438534
href:http://www.freebuf.com
href:http://www.freebuf.com
href:http://job.freebuf.com
href:#
......

 

posted @ 2019-01-23 10:54  寒爵  阅读(924)  评论(0编辑  收藏  举报