爬虫获取公告类内容通用式正则

detail = response.xpath('//div[@class="meetingDetailBox"]').extract()[0] 正文
匹配所有汉字
summary=re.sub(r'<style.*?</style>|<.*?>|begin-->|end-->|\r|\n|\t|\xa0','', detail, flags=re.S)
匹配所有url图片
img_url=re.findall(r'<img.*?src="(.*?)".*?>',detail)
附件
file_doc=re.findall(r'<a href="/module/download.*?".*?>.*?</a>',detail)

匹配汉字补充写法
summary=re.sub(r'<style.*?</style>|<.*?>| | ','',detail,flags=re.S)[:300]

posted @ 2020-07-09 09:15 山东张铭恩阅读(183) 评论(0) 编辑收藏举报

刷新页面返回顶部