Scrapy爬虫之豆瓣TOP250

最近学习了python的一个超级牛的库scrapy，写下一些心得。

初看的时候，看的是官方文档，讲的有些晦涩，有些地方也有模糊不清的地方，而且完整的中高级的用例不多，再由于版本更新的问题，原来的一些方法发生了一些改变，所以在博客园上找到一篇博客结合官方文档，爬出一个自己的scrapy，主要目标豆瓣电影top250，接下来上源码：

首先，在自己想要的目录下新建项目：

scrapy startproject douban

进入douban文件夹，看一下目录结构，其中result.txt文件是我的输出文件

接下来，修改douban文件夹下的items.py，将返回项整合在一个item中:

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # http://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class DoubanItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     # pass
15     movie_name = scrapy.Field()
16     movie_director = scrapy.Field()
17     movie_editor = scrapy.Field()
18     movie_roles = scrapy.Field()
19     movie_style = scrapy.Field()
20     movie_date = scrapy.Field()
21     movie_long = scrapy.Field()

然后开始编写自己的爬虫程序，在spiders文件下新建douban_spider.py：

 1 # -*- coding: utf-8 -*-
 2 from scrapy.spiders import BaseSpider   # 这里新版使用spiders
 3 from scrapy.selector import HtmlXPathSelector
 4 from douban.items import DoubanItem
 5 import scrapy
 6 import re
 7 import sys
 8 reload(sys)
 9 sys.setdefaultencoding("utf-8")  # 设定文件字符编码utf-8
10 
11 
12 class DoubanSpider(BaseSpider):
13     """docstring for DoubanSpider"""
14     name = "douban"    # scrapy爬虫名称
15     allow_domains = ["movie.douban.com"]   # 允许域名
16     # 开始检索的URL
17     start_urls = ["http://movie.douban.com/top250" + "?start=" + str(yeshu * 25) + "&filter=&type=" for yeshu in range(0, 10)]
18 
19     # 回调函数
20     def parse(self, response):
21         hxs = HtmlXPathSelector(response)
22         movie_link = hxs.xpath('//div[@class="hd"]/a/@href').extract()
23         # movie_next = hxs.xpath('//span[@class="next"]/a/@href').extract()
24         # nextmo = movie_next[0]
25         # if nextmo:
26         #     nextmo = "http://movie.douban.com/top" + nextmo
27         #     start_urls.append(nextmo)
28         for link in movie_link:
29             # 给出进入二级页面的请求，并使用下面的回调函数
30             yield scrapy.Request(link, callback=self.parse_item)
31 
32     # 自己写的回调函数，用于处理二级页面
33     def parse_item(self, response):
34         item_has = HtmlXPathSelector(response)
35         movie_name = item_has.xpath('//h1/span/text()').extract()
36         movie_director = item_has.xpath('//a[@rel="
37             v:directedBy"]/text()').extract()
38         movie_editor = item_has.xpath('//div[@id="
39             info"]/span[2]/span[@class="attrs"]/a/text()').extract()
40         movie_roles = item_has.xpath('//a[@rel="
41             v:starring"]/text()').extract()
42         movie_style = item_has.xpath('//span[@property="
43             v:genre"]/text()').extract()
44         movie_date = item_has.xpath('//span[@property="
45             v:initialReleaseDate"]/text()').extract()
46         movie_long = item_has.xpath('//span[@property="
47             v:runtime"]/text()').extract()
48         item = DoubanItem()
49         item['movie_name'] = movie_name
50         item['movie_director'] = movie_director
51         item['movie_editor'] = movie_editor
52         item['movie_roles'] = movie_roles
53         item['movie_style'] = movie_style
54         item['movie_date'] = movie_date
55         item['movie_long'] = movie_long
56         yield item

最后更改douban下的pinelines.py文件，用于将爬到的数据存到文件中：

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 NUM = 1
 8 
 9 
10 class DoubanPipeline(object):
11 
12     # NUM = 1
13 
14     def process_item(self, item, spider):
15         movie_name = item['movie_name']
16         movie_director = item['movie_director']
17         movie_editor = [line + '、' for line in item['movie_editor']]
18         movie_roles = [line + '、' for line in item['movie_roles']]
19         movie_style = [line + '、' for line in item['movie_style']]
20         movie_date = [line + '、' for line in item['movie_date']]
21         movie_long = item['movie_long']
22         f = open("result.txt", "a")
23         global NUM
24         f.write(str(NUM))
25         f.write("\n片名：")
26         NUM += 1
27         print "NAME:", movie_name
28         # for it in movie_name:
29         #     f.write(it)
30         # f.write("\n")
31         f.writelines(movie_name)
32         f.write("\n导演：")
33         f.writelines(movie_director)
34         f.write("\n编剧：")
35         f.writelines(movie_editor)
36         f.write("\n主角：")
37         f.writelines(movie_roles)
38         f.write("\n类型：")
39         f.writelines(movie_style)
40         f.write("\n上映时间：")
41         f.writelines(movie_date)
42         f.write("\n影片时长：")
43         f.writelines(movie_long)
44         f.write("\n")
45         f.close()
46         return item

最后的最后，别忘了修改douban下的settings.py中的ITEM_PIPELINES，将它设置成我们自己写的pineline，在默认情况下是被注释掉的：

ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}

程序工作就完成了，接下来你就可以运行了，需要在程序根目录运行，即第一个douban下：

scrapy crawl douban

posted @ 2015-09-24 21:34 phil_chow 阅读(304) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Phil Chow

心有猛虎，细嗅蔷薇

Scrapy爬虫之豆瓣TOP250

公告