2017.08.07 python爬虫实战之使用代理爬取糗事百科

1.目标分析:

 

2.创建编辑Scrapy爬虫:

(1)执行命令:

(2)编辑items.py文件:

(3)Scrapy项目中间件----添加Headers:

在Scrapy项目中,掌管proxy的中间件是scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware,直接修改这个文件不是不可以,不过为了一个项目就去修改整个环境变量,不值得;

所以,我们自己写一个中间件,让它运行,然后将Scrapy默认的中间件关闭掉就可以了;

 

#! /usr/bin/env python
# -*- coding: utf-8 -*-

from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class CustomUserAgent(UserAgentMiddleware):
def process_request(self, request, spider):
ua="Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML,like Gecko) Chrome/19.0.1061.1 Safari/536.3"

request.headers.setdefault('User-Agent',ua)

 

(4)修改settings.py文件,将系统默认的中间件scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware关闭,用自己创建的中间件qiushi.middlewares.customMiddlewares.CustomUserAgent代替:

(5)编辑糗事Spider.py文件,获取所需数据:

subSelector=response.xpath('//div[@class="article block untagged mb15 typs_long"]')

subSelector.xpath('.//div[@class="content"]/span/text()').extract()[0]    笑话内容

subSelector.xpath('./div[@class="stats"]/span/i/text()').extract()[0]    好笑次数

subSelector.xpath('.//h2/text()').extract()[0]    发布者名字

subSelector.xpath('//div[@class="thumb"]/a/img/@src').extract()[0]         笑话图片

subSelector.xpath('//div[@class="stats"]/span/a/i/text()').extract()[0]      评论次数

 

(6)编辑qiushiSpider.py文件:

 

# -*- coding: utf-8 -*-
import scrapy
from qiushi.items import QiushiItem

class QiushispiderSpider(scrapy.Spider):
name = 'qiushiSpider'
allowed_domains = ['qiushibaike.com']
start_urls = ['http://qiushibaike.com/hot']

def parse(self, response):
subSelector=response.xpath('.//div[@class="article block untagged mb15 typs_hot" or @class="article block untagged mb15 typs_old" ]')
items=[ ]
for sub in subSelector:
item=QiushiItem()
item['author']=sub.xpath('.//h2/text()').extract()[0]
item['content']=sub.xpath('.//div[@class="content"]/span/text()').extract()[0]

item['img']=sub.xpath('//div[@class="thumb"]/a/img/@src').extract()

item['funNum']=sub.xpath('./div[@class="stats"]/span/i/text()').extract()[0]
item['talkNum']=sub.xpath('//div[@class="stats"]/span/a/i/text()').extract()[0]
items.append(item)
return items

注意这个图片的路径:是//div[@class="thumb"]/a/img/@src,不是.//div[@class="thumb"]/a/img/@src
而且是extract(),不是extract()[0],加上[0]意味着只有第一张图片的URL被爬取下来

(7)编辑pipelines.py文件,保存图片和数据:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time
import urllib2
import os


class QiushiPipeline(object):
def process_item(self, item, spider):
today=time.strftime('%Y%m%d',time.localtime())
fileName=today+'qiushi.txt'
imgDir='IMG'

if os.path.isdir(imgDir):
pass
else:
os.mkdir(imgDir)

with open(fileName,'a') as fp:
fp.write('-'*50+'\n'+'-'*50+'\n')
fp.write("author:\t%s \n" %(item['author'].encode('utf8')))
fp.write("content:\t%s \n" %(item['content'].encode('utf8')))

for i in item['img']:
try:
imgUrl=i

except IndexError:
pass

imgName=os.path.basename(imgUrl)
print("我是imgName"+imgName)
fp.write("img:\t %s \n" %(imgName))
imgPathName=imgDir+os.sep+imgName
print("我是imgPathName"+imgPathName)
with open(imgPathName,'wb') as fpi:
response=urllib2.urlopen("http:"+imgUrl)
fpi.write(response.read())
fp.write("funNum:%s \t talkNum:%s \t" %(item['funNum'],item['talkNum']))
fp.write('-'*50+'\n'+'-'*50+'\n'*10)
return item

3.在qiushi项目的任意目录下,运行命令:scrapy crawl qiushiSpider

爬取结果:

 4.Scrapy项目中间件----------添加Proxy

  Scrapy默认环境下,proxy的设置是由中间件scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware控制的

我们这里还是自己定义一个中间件取代这个环境变量就好。直接在middleware.customMiddlewares(自己建的模块和文件)中添加一个类就好。

接下来在修改Settings.py文件就好,将新添加的中间件CustomProxy添加到DOWNLOADER_MIDDLEWARES中去,这里与之前的CustomUserAgent不同的是:

CustomUserAgent需要禁止系统的UserAgentMiddleware,而CustomProxy则需要在系统的HttpProxyMiddle之前执行:

 

再次执行命令:scrapy crawl qiushiSpider

可以看到两个中间件都启动运行了,如果proxy不行,多试几个代理服务器看看就好

 

posted @ 2017-08-10 13:25  小春熙子  阅读(420)  评论(0编辑  收藏  举报