HTMLParser和BeautifulSoup使用入门和总结

1.HTMLParser一般这么用：

from html.parser import HTMLParser
from urllib import request

class MyHtmlParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.categories = []
        self.in_a = False

    def handle_starttag(self, tag, attrs):
        def _attr(attrs,attrname):
            for attr in attrs:
                if attr[0] == attrname:
                    return attr[1]
            return None
        if tag == 'a' and _attr(attrs,'role') == 'menuitem':
            self.in_a = True

    def handle_endtag(self, tag):
        if tag == 'a' and self.in_a:
            self.in_a = False

    def handle_data(self, data):
        if self.in_a:
            self.categories.append(data)

2.BeautifulSoup一般这样：

soup = BeautifulSoup(price_html,'html.parser')
soup.find_all('div',class_='abcd')

3.HTMLParser遇到div嵌套，handle_endtag里关闭div开关会提前关闭，试了很久目前没想出解决方案。

3.BeautifulSoup的find('div', class_='test')是find_all(...)的特殊情况,只匹配第一次。class是python预留关键字，所以加下划线区分，也可以这样attrs={'class':'test')。其中test还可以用正则表达式来匹配。

4.如果没有span, tag.div.a.span=None, tag.div.a.span.string会报错。

5.BeautifulSoup遇到问题:<a>kkk<span>lang</span></a>, a.string无法获得kkk, 就因为a里面嵌套了<span>

6.BeautifulSoup先读取整个html，生成对象树，比较耗内存，速度慢。但是比HTMLParser更方便使用。

posted @ 2018-06-15 22:32 方山客阅读(6907) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· AI与.NET技术实操系列（二）：开始使用ML.NET
· 单线程的Redis速度为什么快？

公告

昵称：方山客
园龄： 8年6个月
粉丝： 4
关注： 12

+加关注

2025年3月

日

一

二

三

四

五

六

随笔分类

随笔档案

阅读排行榜

评论排行榜

1. pyinstaller，scrapy和apscheduler(1)

古琴剑弹

HTMLParser和BeautifulSoup使用入门和总结

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论