1月18日学习内容整理:Scrapy框架补充之爬虫代理,扩展和信号

s7day102

内容回顾:
1. 练习题

val = [[11,22],[44,55],[66,77],]
val[0].append(33)


for item in val:
item.append(33)


permission_list = [
{'id':1, 'title':'权限1','pid':None},
{'id':2, 'title':'权限2','pid':1},
{'id':3, 'title':'权限3','pid':None},
{'id':4, 'title':'权限4','pid':1},
{'id':5, 'title':'权限5','pid':3},
]

permission_list = [
{'id':1, 'title':'权限1','pid':None,'children': [{'id':2, 'title':'权限2','pid':1},{'id':4, 'title':'权限4','pid':1},]},
{'id':3, 'title':'权限3','pid':None,'children':[{'id':5, 'title':'权限5','pid':3},]},
]

2. scrapy
1. scrapy startproject sp1
cd sp1
scrapy genspider chouti chouti.com
写代码
scrapy crawl chouti --nolog

2. 起始URL

start_urls = ['http://chouti.com/']

start_requests方法,返回值
- 可迭代对象
- 生成器

3. 去重
类:
request_seen
配置:
# 默认去重规则
# DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
# 自定义去重规则
# DUPEFILTER_CLASS
DUPEFILTER_CLASS = 'spnew.filter.MyDupeFilter'



4. 回调函数
yield item对象


5. pipeline
类:
process_item方法:
return item
raise DropItem()

配置:
{
"xxxxxx": 300,
"xxxxxx": None,
}

6. 处理cookie



3. Http协议
方法:'get', 'post', 'put', 'patch', 'delete', 'head', 'options', 'trace'
请求头:
Content-Type: application/x-www-form-urlencoded


Django后台获取请求数据:
request.body,请求体中原始值 name=alex&age=18&xxx=xx

# 条件:
# Content-Type: application/x-www-form-urlencoded
# 数据格式:name=alex&age=18&xxx=xx
request.POST,对request.body进行解析 {name:alex,age:18,xxx:xx}

4. 异步非阻塞
非阻塞:不等待,setblocking(False) 1.connect ; 2. recv
异步:回调

基于事件循环实现的:异步非阻塞、Twisted

基于协程实现的:gevent

5. Twisted & Scrapy
- 创建空的Defered对象,使Twisted事件循环不要停止
- 永不停歇的去做任务
calllater(0,函数名)

def 函数名:
calllater(0,函数名)


今日内容:
1. 爬虫
- 爬虫补充
- scrapy-redis

2. 权限系统(组件)



内容详细:
1. 爬虫
a. scrapy添加代理,基于下载中间件
- 内置方式:os.environ
os.environ['http_proxy'] = "http://root:woshiniba@192.168.11.11:9999/"    必须是http_proxy
os.environ['https_proxy'] = "http://192.168.11.11:9999/"

- 自定义代理:爬虫中间件

import six
import random
import base64

from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware

 

def to_bytes(text, encoding=None, errors='strict'):
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
return text.encode(encoding, errors)


class ProxyMiddleware(object):
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
else:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])



DOWNLOADER_MIDDLEWARES = {
'sp1.proxy.ProxyMiddleware': 666,
}


b.
https:
- 掏钱,用户体验好

- 自己做,下载安装证书


c. 扩展&信号
注册信号:crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
什么时候注册:

EXTENSIONS = {
"sp1.ext.MyExtension":300,
}

from scrapy import signals
class MyExtension(object):
def __init__(self, value):
self.value = value

@classmethod
def from_crawler(cls, crawler):
ext = cls()

crawler.signals.connect(ext.xxxx1, signal=signals.spider_opened)
crawler.signals.connect(ext.dddd1, signal=signals.spider_closed)
return ext

def xxxx1(self, spider):
print('open')

def dddd1(self, spider):
print('close')

PS: django信号


d. 入口
scrapy crawl chouti
cat scrapy
....

需求:10个爬虫同时启动
scrapy crawl chouti
scrapy crawl cnblogs

自定义命令:
类:
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):

requires_project = True

def syntax(self):
return '[options]'

def short_desc(self):
return 'Runs all of the spiders'

def run(self, args, opts):

# 所有的爬虫列表
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
# 循环每个爬虫,准备工作
self.crawler_process.crawl(name, **opts.__dict__)
# 开始爬取
self.crawler_process.start()



配置:
# 自定义命令
COMMANDS_MODULE = "sp1.cms"


































posted @ 2018-01-18 13:55  九二零  阅读(81)  评论(0编辑  收藏  举报