scrapy爬虫第一站
http://www.jianshu.com/p/fa614bea98eb
按照这个教程一点一点做的,自己再记录一下,爬取评分在9分以上的书籍
首先是需要创建一个scrapy项目,在cmd中输入 scrapy startproject douban
然后在pycharm中打开项目,新建一个main文件(和scrapy.cfg在同一级目录下)
from scrapy import cmdline cmdline.execute("scrapy crawl dbbook".split())
dbbook是爬虫名,根据自定义的名字进行修改
在sprider文件中新建一个python文件:
import scrapy import re from practise.items import doubanBook class DbbookSpider(scrapy.Spider): name="dbbook" start_urls=('https://www.douban.com/doulist/1264675/',) #爬取的网站 def parse(self,response): item = doubanBook() selector = scrapy.Selector(response) books = selector.xpath('//div[@class="bd doulist-subject"]') for each in books: title=each.xpath('div[@class="title"]/a/text()').extract()[0] title=title.replace(' ','').replace('\n','') rate=each.xpath('div[@class="rating"]/span[@class="rating_nums"]/text()').extract()[0] author=re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1) author = author.replace(' ', '').replace('\n', '') item['title']=title item['rate']=rate item['author']=author yield item nextPage = selector.xpath('//div[@class="paginator"]/span[@class="next"]/link/@href').extract() #获取下一页的链接 if nextPage: next=nextPage[0] yield scrapy.http.Request(next,callback=self.parse)
items.py文件是用来存储数据的容器,可以考虑成一个结构体,所有需要提取的信息都在这里面存着。
class doubanBook(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() rate=scrapy.Field() author=scrapy.Field()
创建三个我们需要存取的数据
最后修改setting.py输出所爬取的数据:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0' FEED_URI = u'file:///G://douban.csv' FEED_FORMAT = 'CSV'
附上re库正则表达式的学习网站用以复习:http://cuiqingcai.com/977.html