爬虫之scrapy框架

1.windows下安装scrapy
（1）pip install wheel
（2）从https://www.lfd.uci.edu/~gohlke/pythonlibs/下载Twisted-18.9.0-cp36-cp36m-win_amd64.whl 到存放下载后文件的目录，打开终端，执行pip install Twisted-18.9.0-cp36-cp36m-win_amd64.whl 电脑上安装的是py几就下载对应版本就是几
（3）pip install pywin32
（4）pip install scrapy

在linux下和mac下直接pip3 install scrapy就可以

2.创建scrapy工程（在终端中执行）
2.1 新建目录命名为 FirstBlood

cd firstblood 　　# 切换到新建目录下
scrapy startproject  firstblood    # 新建爬虫项目

2.2 新建项目目录结构

 1 firstblood # 项目名称
 2     firstblood    # 项目同名目录
 3         |
 4         |-spiders    # 爬虫文件的存放目录
 5             |
 6             |-__init__.py
 7         |
 8         |-__init__.py
 9         |
10         |-items.py    # 结合piplines文件使用
11         |
12         |-middlewares.py    # scrapy中的中间件文件
13         |
14         |-piplines.pyt
15         |
16         |-settings.py    # scrapy中的配置文件

3.创建第一个爬虫文件
3.1 在终端中执行

scrapy genspider first www.xxx.com     # 随便指定一个域名，爬虫文件名为 first

3.2 创建的爬虫文件中

import scrapy
class FirstSpider(scrapy.Spider):
    # 爬虫文件名
    name = 'first'

    # 优先允许操作的url
    allowed_domains = ['www.xxx.com']

    start_urls = ['www.xxx.com',]

    # 接受返回数据的方法
    def parse(self, response):
        pass

3.3 尝试爬取豆瓣网信息

 1 import scrapy
 2 
 3 class FirstSpider(scrapy.Spider):
 4     # 爬虫文件名
 5     name = 'first'
 6     # 允许操作的url
 7     # allowed_domains = ['www.xxx.com']
 8 
 9     # 发送get请求到https://www.douban.com/
10     start_urls = ['https://www.douban.com/',]
11 
12     def parse(self, response):
13         """
14         接受请求发送后返回的数据信息
15         :param response: 
16         :return: 
17         """
18 
19         # 通过scrapy自带的xpath解析从网站中拷贝出来的xpath信息，用    extract_first()取最终数据
20         data_list = response.xpath('//*[@id="anony-    book"]/div/div[1]/div[1]/h2/a/text()').extract()
21         print(data_list)
22 
23 终端中执行，启动爬虫程序
24 scrapy crawl first --nolog
25 
26 结果：['读书']
27 
28 PS:扩展extract()和extract_first()的区别：
29 
30     extract() 结果获取的是一个列表，可以在后面加一个[0]，表示取列表中的第一个数据
31 
32     extract_first() 结果直接就是列表中的第一个数据
33 
34 settings配置文件中配置：
35 
36     # 忽略robots协议
37     ROBOTSTXT_OBEY = False
38 
39     # 伪造请求头
40     USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

View Code

4.爬取糗事百科数据
4.1 将数据基于终端指令的持久化存储

 1 import scrapy
 2 
 3 class FirstSpider(scrapy.Spider):
 4     name = 'first'
 5 
 6     start_urls = ['https://www.qiushibaike.com/text/',]
 7 
 8     def parse(self, response):
 9         """
10         基于终端指令持久化存储
11         :param response: 
12         :return: 
13         """
14         data_list = []
15         div_list = response.xpath('//div[@id="content-left"]/div')
16         for div in div_list:
17             author = div.xpath('./div/a[2]/h2/text()').extract()
18             content = div.xpath('./a[1]/div/span//text()').extract()
19             content = "".join(content).strip('\n')
20 
21             data_dic = {
22                 'author':author,
23                 'content':content,
24             }
25             data_list.append(data_dic)
26         return data_list

View Code

4.2 将数据基于管道的持久化存储（存储到文件）
4.2.1 scrapy默认是不支持管道的持久化存储的，先在在settings配置文件中打开管道存储

1 # Configure item pipelines
2 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
3 ITEM_PIPELINES = {
4     'firstBlood.pipelines.FirstbloodPipeline': 300,
5 }

settings

4.2.2 管道存储要调用items文件

1 import scrapy
2 
3 class FirstbloodItem(scrapy.Item):
4     # define the fields for your item here like:
5     # name = scrapy.Field()
6     
7     # 将爬虫文件获取到的数据封装过来
8     author = scrapy.Field()
9     content = scrapy.Field()

items

4.2.3 编写爬虫文件，获取数据

 1 from firstBlood.items import FirstbloodItem # 调用itmes文件中的类
 2 
 3 class FirstSpider(scrapy.Spider):
 4     # 爬虫文件名
 5     name = 'first'
 6     # 允许操作的url
 7     # allowed_domains = ['www.xxx.com']
 8     start_urls = ['https://www.qiushibaike.com/text/',]
 9 
10 
11     def parse(self, response):
12 
13         """
14         基于管道持久化存储
15         :param response: 
16         :return: 
17         """
18         div_list = response.xpath('//div[@id="content-left"]/div')
19         for div in div_list:
20             author = div.xpath('./div/a[2]/h2/text()').extract_first()
21             content = div.xpath('./a[1]/div/span//text()').extract()
22             content = "".join(content).strip('\n')
23 
24             # 将解析到数据封装到item对象中
25 
26             item = FirstbloodItem() # 实例化items类
27             item['author'] = author
28             item['content'] = content
29 
30             # 通过yield将item对象提交给管道
31         yield item

View Code

4.2.4 编写管道文件，开始做持久化存储

 1 class FirstbloodPipeline(object):
 2 
 3     def __init__(self):
 4         self.pc = None
 5 
 6     def open_spider(self,spider):
 7         """
 8         打开文件，这个方法只会被调用一次
 9         :param spider: 
10         :return: 
11         """
12         print('打开文件')
13         self.pc = open('data.txt','a',encoding='utf-8')
14 
15 
16     def process_item(self, item, spider):
17         """
18         这个方法会被调用多次，
19         :param item: 爬虫文件提交过来的item
20         :param spider: 
21         :return: 
22         """
23 
24         # 获取解析到的数据
25         author = item.get('author')
26         content = item.get('content')
27 
28         # 调用文件句柄，将解析过来的数据写到文件中
29         self.pc.write(author+content+'\n')
30 
31         return item
32 
33     def close_spider(self, spider):
34         """
35         关闭文件
36         :return: 
37         """
38         print('关闭文件')
39         self.pc.close()

View Code

4.3 基于管道持久化存储（存储到redis）
4.3.1 爬虫文件中和items中的代码不需要改变，且安装redis
4.3.2 重写一个类，名为RedisPipeline

 1 class RedisPipeline(object):
 2 
 3     def __init__(self):
 4         self.conn = None
 5 
 6         def open_spider(self,spider):
 7             """
 8             连接redis
 9             :param spider: 
10             :return: 
11             """
12 
13             self.conn = redis.Redis(host='212.64.54.2',port=6379,password='123123')
14             print('开始连接')
15 
16     def process_item(self, item, spider):
17         """
18         获取爬虫文件传过来的数据，写入到redis的列表中
19         :param item: 
20         :param spider: 
21         :return: 
22         """
23 
24         author = item.get('author')
25         content = item.get('content')
26         self.conn.hset('data_dic',author,content)
27         return item
28 
29     def close_spider(self,spider):
30         """
31         获取redis中的数据，且关闭连接
32         :param spider: 
33         :return: 
34         """
35 
36         data = self.conn.hgetall('data_dic')
37         print('关闭连接')
38         self.conn.close()
39

View Code

4.3.4 在配置文件中配置：

1 ITEM_PIPELINES = {
2     # 'firstBlood.pipelines.FirstbloodPipeline': 300,
3     'firstBlood.pipelines.RedisPipeline': 300, # 写上这么一行
4 }

View Code

5.爬取全站数据

爬虫文件的内容

 1 import scrapy
 2 from total_station.items import TotalStationItem
 3 
 4 class TotalSpider(scrapy.Spider):
 5     name = 'total'
 6     # allowed_domains = ['www.xxx.com']
 7     start_urls = ['https://bj.58.com/chuzu/?PGTID=0d100000-0000-    13b8-36f2-bf4b23ad89f2&ClickID=3']
 8 
 9     # 通过格式化定义每次要爬取的url
10     url = 'https://bj.58.com/chuzu/pn%s/?PGTID=0d3090a7-0000-138e-9b50-77af32298fce&ClickID=2'
11 
12     # 定义一个可随时变换的页码
13     page_num = 1
14 
15     def parse(self, response):
16         li_list = response.xpath('//ul[@class="listUl"]/li')
17         for li in li_list:
18             title = li.xpath('./div[2]/h2/a/text()').extract()
19             title = "".join(title).strip('\n ')
20             price = li.xpath('./div[3]/div//text()').extract()
21             price = "".join(price).strip('\n \t')
22             pattern = li.xpath('./div[2]/p[2]//text()').extract() # 格局
23             pattern = "".join(pattern).strip('\n \t')
24             position = li.xpath('./div[2]/p[2]//text()').extract() # 地理位置
25             position = "".join(position).strip('\n \t ')
26             agent = li.xpath('./div[2]/div//text()').extract()
27             agent = "".join(agent).strip('\n \t ')
28 
29 
30             item = TotalStationItem()
31             item['title'] = title
32             item['price'] = price
33             item['pattern'] = pattern
34             item['position'] = position
35             item['agent'] = agent
36             yield item
37 
38         # for循环执行完后执行下面的代码 
39         if self.page_num < 5:
40 
41             self.page_num += 1
42 
43             # 格式化url，获取每个页面的数据
44             new_urls = format(self.url %self.page_num)
45 
46             # 调用scrapy.Request方法，去请求格式化好的每个url，且每请求一次就调用一次paser方法
47             yield scrapy.Request(url=new_urls,callback=self.parse)
48 
49 items文件和pipelines文件是不用做任何改动的

爬虫文件的代码

6.通过scrapy发送post请求（访问百度翻译）

 1 class PosttestSpider(scrapy.Spider):
 2     name = 'postTest'
 3     # allowed_domains = ['www.xxx.com']
 4 
 5     # 对百度翻译做post请求
 6     start_urls = ['https://fanyi.baidu.com/sug']
 7 
 8     def start_requests(self):
 9         # 请求参数
10         data = {
11             'kw': 'abs',
12         }
13         for url in self.start_urls:
14 
15             # scrapy中发送post请求用FormRequest，接参数用formdata
16             yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
17 
18     def parse(self, response):
19         print(response.text)
20 
21 传过来的是json格式，需要json.loads一下

View Code

7.设置输出的日志级别和把日志写到文件中
在settings中配置：
# 输出的日志级别

LOG_LEVEL = 'ERROR'

# 将日志输出到文件中

LOG_FILE = 'spider.log'

8.代理ip和代理池
8.1 设置ip代理的话，会走中间件文件middlewares.py，在这个文件中我们只需要留下其中关于DownloaderMiddleware的代码
8.2 在settings中开启下载器

DOWNLOADER_MIDDLEWARES = {
    'PostPro.middlewares.PostproDownloaderMiddleware': 543,
}

8.3 在中间件中设置单个代理ip（获取代理ip的网站：http://www.goubanjia.com/）

class PostproDownloaderMiddleware(object):

    def process_request(self,request,spider)
        request.meta['proxy'] = 'https://134.209.13.153:8080'
        return None

8.4 设置代理池

 1 import random
 2 
 3 # 设置IP代理池
 4 HTTP_POOL = [
 5     '124.236.21.44:80',
 6     '81.95.142.178:80',
 7     '88.84.223.1:33519',
 8     '115.79.208.56:38351'
 9 ]
10 
11 HTTPS_POOL = [
12     '134.209.13.153:8080',
13     '43.243.165.206:3128',
14 ]
15 
16 class PostproDownloaderMiddleware(object):
17 
18     def process_request(self, request, spider):
19 
20         # 将请求的url获取到，并通过:分割
21         pro = request.url.split(':')
22 
23         # 判断如果协议头等于http的话
24         if pro[0] == 'http':
25 
26             # 在http的ip池中随机获取一个ip地址
27             ip = random.choice(HTTP_POOL)
28 
29             # 设置成代理ip
30             request.meta['proxy'] = 'http://'+ ip
31 
32         # 如果协议头等于https的话
33         elif pro[0] == 'https':
34 
35             # 随机从https的ip池中获取一个ip
36             ip = random.choice(HTTPS_POOL)
37 
38             # 并设置成代理
39             request.meta['proxy'] = 'https://'+ ip
40 
41         return None
42 
43     def process_response(self, request, response, spider):
44 
45         return response

View Code

9.设置User-Agent

 1 # 设置UA池
 2 UA_POOL = [
 3     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
 4     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
 5     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
 6     'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
 7     'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)',
 8     'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
 9     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
10 
11 
12 ]
13 
14 class UserAgentDownloaderMiddleware(UserAgentMiddleware):
15 
16     def process_request(self, request, spider):
17 
18         # 随机获取UA池中的某一个UA
19         UA = random.choice(UA_POOL)
20 
21         # 设置请求头中的User-Agent
22         request.headers['User-Agent'] = UA
23         return None
24 
25     def process_response(self, request, response, spider):
26 
27         return response

User-Agent代理

在settings中要引用当前类

1 DOWNLOADER_MIDDLEWARES = {
2     # ip代理的中间件
3     'PostPro.middlewares.PostproDownloaderMiddleware': 543,
4 
5     # UA的中间件
6     'PostPro.middlewares.UserAgentDownloaderMiddleware': 544,
7 }

View Code

10.使用selenium动态获取数据（模拟爬取人民网新闻）
中间件

 1 from scrapy.http import HtmlResponse
 2 
 3 class NewsDownloaderMiddleware(object):
 4 
 5     def process_request(self, request, spider):
 6 
 7         return None
 8 
 9     def process_response(self, request, response, spider):
10         # 获取webdriver对象
11         browser = spider.browser
12 
13         # 相当于将要请求的url放到浏览器中
14         browser.get(request.url)
15 
16         # 获取页面数据
17         page_text = browser.page_source
18 
19         # 获取请求的页面并返回
20         new_response = HtmlResponse(url=browser.current_url,body=page_text,encoding='utf-8',request=request)
21         return new_response

中间件

 1 import scrapy
 2 from selenium import webdriver
 3 
 4 class NewSpider(scrapy.Spider):
 5     name = 'new'
 6 
 7     # 被访问网站
 8     start_urls = ['http://tw.people.com.cn/']
 9 
10     def __init__(self):
11         self.browser = webdriver.Chrome(executable_path=r'F:\pycharm\chromedriver.exe')
12     super(NewSpider,self).__init__()
13 
14 
15     def parse(self, response):
16         """
17         获取数据内容
18         :param response: 
19         :return: 
20         """
21         title = response.xpath('//*[@id="p2Ab_1"]/div[4]/p/strong/a').extract_first()
22 print(title)
23         content = response.xpath('//*[@id="p2Ab_1"]/div[4]/p/em/a/text()').extract_first()
24         print(content)
25 
26     def closed(self,spider):
27         self.browser.quit()

爬虫文件

1 # Enable or disable downloader middlewares
2 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
3 DOWNLOADER_MIDDLEWARES = {
4     'news.middlewares.NewsDownloaderMiddleware': 543,
5 }

配置文件

11.CrawlSpider爬取全站数据
11.1 创建scrapy项目

scrapy startproject crawlPro

11.2 创建爬虫文件，这里的爬虫文件创建方法和普通爬虫文件创建方法是不一样的

scray genspider -t crawl website www.x.com

11.3 规则的写法

 1 from scrapy.linkextractors import LinkExtractor
 2 from scrapy.spiders import CrawlSpider, Rule
 3 
 4 class WebsiteSpider(CrawlSpider):
 5     name = 'website'
 6     # allowed_domains = ['www.x.com']
 7     start_urls = ['http://www.x.com/']
 8 
 9     # 连接提取器，可以在起始url对应的页面进行制定连接的提取
10     link = LinkExtractor(allow=r'Items/')
11 
12     rules = (
13         # 规则提取器，可以将连接提取器提取到的连接对应的页面数据进行指定规则的解析
14         # follow=True表示提取所有的数据，不只是当前看到的页面上的数据
15         # follow=False 表示只提取当前页面上看到的数据
16         Rule(link,callback='parse_item', follow=True),
17         )
18 
19     def parse_item(self, response):
20         item = {}
21         #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
22         #item['name'] = response.xpath('//div[@id="name"]').get()
23         #item['description'] = response.xpath('//div[@id="description"]').get()
24         return item

View Code

11.4 测试爬取抽屉的所有页面数据

 1 import scrapy
 2 from scrapy.linkextractors import LinkExtractor
 3 from scrapy.spiders import CrawlSpider, Rule
 4 
 5 class WebsiteSpider(CrawlSpider):
 6     name = 'website'
 7     # allowed_domains = ['www.x.com']
 8     start_urls = ['https://dig.chouti.com/']
 9 
10     # 连接提取器，可以在起始url对应的页面进行制定连接的提取
11     link = LinkExtractor(allow=r'/all/hot/recent/(\d+)')
12 
13     rules = (
14         # 规则提取器，可以将连接提取器提取到的连接对应的页面数据进行指定规则的解析
15         Rule(link,callback='parse_item', follow=False),
16     )
17 
18     def parse_item(self, response):
19         item = {}
20         content_list = []
21 
22         div_list= response.xpath('//*[@id="content-list"]/div')
23 
24         for div in div_list:
25             title = "".join(div.xpath('./div[4]/div[1]/a[1]/text()').extract()).strip('\t\n')
26             content_list.append(title)
27             item['title'] = content_list
28             return item

爬虫文件

1 import scrapy
2 
3 class CrawlproItem(scrapy.Item):
4 
5     title = scrapy.Field()

item文件

1 class CrawlproPipeline(object):
2     def process_item(self, item, spider):
3         print(item['title'])
4 
5         return item

管道文件

1 ITEM_PIPELINES = {
2     'crawlPro.pipelines.CrawlproPipeline': 300,
3 }

配置文件

12.分布式
12.1 安装scrapy-redis

pip3 install scrapy-redis

12.2 创建工程

scrapy startproject distributed

12.3 创建爬虫文件，和创建CrawlSpider的方式是一样的

scray genspider -t crawl website www.x.com

12.4 在爬虫文件中导入模块，将当前类继承RedisCrawlSpider

from scrapy_redis.spiders import RedisCrawlSpider

12.5 将当前类中的allowed_domains和start_urls都注释掉，且添加 redis_key = 'HDD' 这是一个调度器的名称

12.6 修改配置文件
12.6.1 设置UA

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

12.6.2 关闭robots协议

ROBOTSTXT_OBEY = False

12.6.3 配置共享管道
# 共享管道

1 ITEM_PIPELINES = {
2     'scrapy_redis.pipelines.RedisPipeline':400,
3 }

View Code

12.6.4 设置其他配置

1 # 使用scrapy_redis组件的去重队列
2 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
3 
4 # 使用scrapy_redis组件自己的调度器
5 SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
6 
7 # 是否允许暂停(服务器宕机，爬取任务暂停，开机后接着上次暂停的地方继续爬取)
8 SCHEDULER_PERSIST = True

View Code

12.6.5 设置redis连接

 1 # 指定Redis主机的ip地址
 2 REDIS_HOST = 'x.x.x.x'
 3 
 4 # redis服务的端口
 5 REDIS_PORT = '6379'
 6 
 7 # 编码
 8 REDIS_ENCODING = 'utf-8'
 9 
10 # 密码（字典格式）
11 REDIS_PARAMS = '{"password":"123456"}'

View Code

12.7 安装redis
pass
12.8 编写爬虫文件

 1 class DistriSpider(RedisCrawlSpider):
 2     name = 'distri'
 3     # allowed_domains = ['www.x.com']
 4     # start_urls = ['https://dig.chouti.com/']
 5 
 6     redis_key = 'HDD' # 调度器名称
 7 
 8     rules = (
 9 Rule(LinkExtractor(allow=r'/all/hot/recent/(\d+)'), callback='parse_item', follow=True),
10     )
11 
12     def parse_item(self, response):
13         div_list = response.xpath('//*[@id="content-list"]/div')
14         for div in div_list:
15 
16             # 获取数据的title和作者
17             title = div.xpath('./div[4]/div[1]/a[1]/text()').extract_first()
18             author = div.xpath('./div[4]/div[2]/a[4]/b/text()').extract_first()
19 
20             # 提交到item
21             item = DistributedItem()
22             item['title'] = title
23             item['author'] = author
24 
25             yield item

爬虫文件

1 class DistributedItem(scrapy.Item):
2     # define the fields for your item here like:
3     # name = scrapy.Field()
4     title = scrapy.Field()
5     author = scrapy.Field()

item文件

12.9 启动爬虫文件

 scrapy runspider distri.py # 启动方式和之前的不一样，这个要直接启动爬虫文件

			
2019-03-06 17:26:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023   # 到这个地方会停止

12.10 在redis数据库中操作

LPUSH HDD https://dig.chouti.com # HDD指定爬虫文件中的redis_key的值

指定完以后，爬虫文件开始工作

12.11 在redis数据库中查看数据

LRANGE distri:items 0 -1  # 会显示全部数据

posted on 2019-03-07 14:35 花豆豆阅读(138) 评论(0) 编辑收藏举报