首先分析网页的结构和抓取流程:
1,下载start_urls,交给parse方法处理
2,parse处理,从中获取本页的文章url(以后获取到文章首图传递个自定义的parse_detail),和下一页的url
3,将下一页的url交给parse方法。继续进行1和2;将文章url传递给自定义的解析函数parse_detail
4,parse方法对下一页的url进行下载,自定义的parse_detial则通过css或者xpath 解析有用的信息传递给Item,次数两个操作是异步操作
以下是源码:
import scrapy
import re
from scrapy.http import Request
class JobboleSpider(scrapy.Spider):
name = "jobbole"
allowed_domains = ["blog.jobbole.com"]
start_urls = ['http://blog.jobbole.com/all-posts/']
def parse(self, response):
'''
进行两部操作:1,从页面抓取所有的文章url交给解析器进行解析这个页面的标题时间点赞数等
2,从页面抓取下一个页面的url 进行下载
'''
post_urls = response.xpath('/html/body/div[1]/div[3]/div/div/a/@href').extract()
for url in post_urls:
yield Request(url,callback=self.parse_detail)
next_url = response.xpath('/html/body/div[1]/div[3]/div[21]/a[4]/@href').extract_first()
if next_url:
yield Request(next_url,callback=self.parse)
def parse_detail(self,response):
title = response.xpath('/html//div[@class="entry-header"]/h1/text()')
add_time=response.xpath('/html//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace("·","")
like = response.xpath('//span[contains(@class,"vote-post-up")]/h10/text()').extract()[0]
fav_temp = response.xpath('//span[contains(@class,"bookmark-btn")]/text()').extract()[0]
re_fav_num = re.match(".*(\d+).*",fav_temp)
if re_fav_num:
fav_num = int(re_fav_num.group(1))
else:
fav_num = 0
comment_temp = response.xpath('//span[@class="btn-bluet-bigger href-style hide-on-480"]/text()').extract()[0]
re_comment = re.match(".*(\d+).*",comment_temp)
if re_comment:
comment_num = int(re_comment.group(1))
else:
comment_num = 0
content = response.xpath('//div[@class="entry"]').extract()[0]
temp_tag = response.xpath('/html//p[@class="entry-meta-hide-on-mobile"]/a/text()')
tag_list = [x for x in tag if not x.strip().endswith("评论") ]
tags = ";".join(tag_list)
pass
'''
#通过css选择器提取数据
#提取标题 .后面跟的代表的是class的属性
css_title = response.css(".entry-header h1::text").extract()[0]
#添加时间 提取p标签的class的唯一值为entry-meta-hide-on-mobile属性的标签,::text是提取其内容,与text()相似
css_add_time =response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","")
#点赞人数:class 属性为vote-post-up 的唯一的标签
css_like = response.css(".vote-post-up h10::text").extract()[0]
#文章收藏人数,对于数组来说按索引提取有风险,所有需要异常捕获操作,extract_first()就是对其进行异常操作,保证列表为空时不报错
css_fav_temp = response.css(".bookmark-btn::text").extract_first()
re_fav_num = re.match(".*(\d+).*",fav_temp)
if re_fav_num:
css_fav_num = re_fav_num.group(1)
#文章评论人数
re_comment = response.css('a[href="#article-comment"] span::text').extract_first()("0")
if re_comment:
comment_num = re_comment.group(1)
#标签:涉及去重评论
temp_tag = response.css('p[class="entry-meta-hide-on-mobile"] a::text').extract()
tag_list = [x for x in tag if not x.strip().endswith("评论") ]
tags = ";".join(tag_list)
'''
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步