Python scrapy 爬虫框架学习笔记
早在去年,根据搜索的相关资料,已经写出了一部分针对某些网站的爬虫,但是学习的不系统,全部是根据自己搜索由来而做的,对很多较深的原理并不是太懂,这两天找了一套视频,在深入的学习一下。
安装就不说了,参考官网,我自己是 Windows7 环境,安装的是 python 环境是 Anaconda3 (64-bit)
2.初步使用 scrapy
新建 stackoverflow_spider.py 文件编写代码并在命令行进行运行以下代码:
#并制定爬取的数据存储为本地 json 类型文件(并支持存储 为 csv xml 等类型)
scrapy runspider stackoverflow_spider.py -o quotes.json
import scrapy
class StackOverflowSpider(scrapy.Spider):
# 定义爬虫项目名称
name = "stackoverflow"
# 指定爬取初始链接,parse 作为默认回调函数
start_urls=["http://stackoverflow.com/questions?sort=votes"]
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
# 相对链接转化成绝对链接
full_url = response.urljoin(href.extract())
# 注册 parse_question 为回调函数
yield scrapy.Request(full_url,callback=self.parse_question)
def parse_question(self, response):
yield {
'title':response.css('h1 a::text').extract()[0],
'votes':response.css(".question .vote-count-post::text").extract()[0],
'body':response.css(".question .post-text").extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
附带框架的高级特性:
1,内置的数据抽取器
2,交互式控制台用于调试数据抽取方法
3,内置对结果输出的支持,可以保存为JSON, CSV, XML等
4,自动处理编码
5,支持自定义扩展
6,丰富的内置扩展,可用于处理:
1)cookies and session
2)HTTP features like compression, authentication, caching
3) user-agent spoofing
7,远程调试scrapy
8,更多的支持,比如可爬取xml、csv,可自动下载图片等等。
4)robots.txt
5) crawl depth restriction
3.基本使用步骤
创建一个工程:
scrapy startproject tutorial
会自动产生一下文件:
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
然后用快捷命令创建编写文件,以便快速改写:
scrapy genspider toscrape_spider toscrape.com
首先来一个保存源代码的实例:
# -*- coding: utf-8 -*-
import scrapy
class TocSpiderSpider(scrapy.Spider):
name = 'toc_spider'
allowed_domains = ['tocscrape.com']
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename,'wb') as f:
f.write(response.body)
执行爬取:
scrapy crawl toc_spider
#运行前如果不知道有几个爬虫实例,可以使用命令 scrapy list 来查看
当然我们抓取数据一般都会指定爬取字段,然后保存到数据库中,所以我们要定义 item,需要爬取多少字段就定义多少字段
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ToscrapeItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
然后我们需要在爬取文件中引入
#顶部引入,项目文件夹名称 后面的 item 类,可以在 item 文件中复制
from tocscrape.items import ToscrapeItem
#然后爬虫逻辑中使用
item = new ToscrapeItem()
item['name'] = response.xpath()
4.1基本概念介绍之命令行
help:scrapy的基本命令,用于查看帮助信息。
scrapy --help
version:查看版本信息,可见-v参数查看各组件的版本信息,包含安装的 python scrapy xml 等等;
scrapy version
scrapy version -v
创建一个工程
scrapy startproject projectname
在工程中产生一个 spider ,可以产生多个 spider ,要求 projectname 不能相同,需要创建工程后进入工程目录中使用,后面可选跟指定的爬取域名
scrapy genspider example example.com
列出工程中都有那些spider
scrapy list
view: 查看也页面源码在浏览器中显示的样子,检测确定爬取数据代码中是否包含某些代码
scrapy view https://www.baidu.com/
parse: 在工程中使用固定的 parse 函数解析某个页面,需要进入工程目录中使用
scrapy parse url
一个非常有用的命令,可用于调试数据、检测xpath、查看页面源码,等等,使用后方便使用交互环境下,用 scrapy 显示的 reponse 等等进行调试,调试必备
scrapy shell url
PS:小窍门获取返回数据中匹配的数据 respnse.xpath('/html/body/div/li/em/text()').re('\d+')[0]
运行自包含的爬虫 runspider,没有通过 startproject 命令创建工程目录,直接单独编写后使用
scrapy runpiser demo_spider.py
bench 执行一个基准测试;可用来检测scrapy是否安装成功;
scrapy bench
5基本概念介绍之 scrapy 中重要的对象
Request 对象,最新版请查看官方文档:
初始化参数:
class scrapy.http.Request(
url [ ,
callback,
method='GET',
headers,
body,
cookies,
meta, 非常重要,多个解析函数中传递数据
encoding='utf-8',
priority=0,
don't_filter=False,
errback ] )
其他属性:
url
method
headers
body
meta
copy()
replace()
子类:
FormRequest 非常重要实现登录功能,做请求用
Response 一般不会去实例化,直接使用,包含很多可用数据
关于登录模块的学习,贴下教程所写:
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import FormRequest
from scrapy.mail import MailSender
from bioon import settings
from bioon.items import BioonItem
class BioonspiderSpider(scrapy.Spider):
name = "bioonspider"
allowed_domains = ["bioon.com"]
start_urls=['http://login.bioon.com/login']
def parse(self,response):
#从response.headers中获取cookies信息
r_headers = response.headers['Set-Cookie']
cookies_v = r_headers.split(';')[0].split('=')
cookies = {cookies_v[0]:cookies_v[1]}
#模拟请求的头部信息
headers = {
'Host': 'login.bioon.com',
'Referer':'http://login.bioon.com/login',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0',
'X-Requested-With':'XMLHttpRequest'
}
#获取验证信息
csrf_token = response.xpath(
'//input[@id="csrf_token"]/@value').extract()[0]
#获得post的目的URL
login_url = response.xpath(
'//form[@id="login_form"]/@action').extract()[0]
end_login = response.urljoin(login_url)
#生成post的数据
formdata={
#请使用自己注册的用户名
'account':'********',
'client_id':'usercenter',
'csrf_token':csrf_token,
'grant_type':'grant_type',
'redirect_uri':'http://login.bioon.com/userinfo',
#请使用自己注册的用户名
'username':'********',
#请使用自己用户名的密码
'password':'xxxxxxx',
}
#模拟登录请求
return FormRequest(
end_login,
formdata=formdata,
headers=headers,
cookies=cookies,
callback=self.after_login
)
def after_login(self,response):
self.log('Now handling bioon login page.')
aim_url = 'http://news.bioon.com/Cfda/'
obj = json.loads(response.body)
print "Loging state: ", obj['message']
if "success" in obj['message']:
self.logger.info("=========Login success.==========")
return scrapy.Request(aim_url,callback = self.parse_list)
def parse_list(self,response):
lis_news = response.xpath(
'//ul[@id="cms_list"]/li/div/h4/a/@href').extract()
for li in lis_news:
end_url = response.urljoin(li)
yield scrapy.Request(end_url,callback=self.parse_content)
def parse_content(self,response):
head = response.xpath(
'//div[@class="list_left"]/div[@class="title5"]')[0]
item=BioonItem()
item['title'] = head.xpath('h1/text()').extract()[0]
item['source'] = head.xpath('p/text()').re(ur'来源:(.*?)\s(.*?)$')[0]
item['date_time'] = head.xpath('p/text()').re(ur'来源:(.*?)\s(.*?)$')[1]
item['body'] = response.xpath(
'//div[@class="list_left"]/div[@class="text3"]').extract()[0]
return item
def closed(self,reason):
import pdb;pdb.set_trace()
self.logger.info("Spider closed: %s"%str(reason))
mailer = MailSender.from_settings(self.settings)
mailer.send(
to=["******@qq.com"],
subject="Spider closed",
body=str(self.crawler.stats.get_stats()),
cc=["**********@xxxxxxxx.com"]
)
item 文件:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BioonItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source =scrapy.Field()
date_time = scrapy.Field()
body = scrapy.Field()
setting 文件:
# -*- coding: utf-8 -*-
# Scrapy settings for bioon project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
#Scrapy项目实现的bot的名字(也为项目名称)。
BOT_NAME = 'bioon'
SPIDER_MODULES = ['bioon.spiders']
NEWSPIDER_MODULE = 'bioon.spiders'
#保存项目中启用的下载中间件及其顺序的字典。默认:: {}
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'
#保存项目中启用的pipeline及其顺序的字典。该字典默认为空,值(value)任意。
#不过值(value)习惯设定在0-1000范围内。
ITEM_PIPELINES={
#'bioon.pipelines.BioonPipeline':500
}
#下载器下载网站页面时需要等待的时间。该选项可以用来限制爬取速度,
#减轻服务器压力。同时也支持小数:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
#爬取网站最大允许的深度(depth)值。如果为0,则没有限制。
DEPTH_LIMIT=0
#是否启用DNS内存缓存(DNS in-memory cache)。默认: True
DNSCACHE_ENABLED=True
#logging输出的文件名。如果为None,则使用标准错误输出(standard error)。默认: None
LOG_FILE='scrapy.log'
#log的最低级别。可选的级别有: CRITICAL、 ERROR、WARNING、INFO、DEBUG。默认: 'DEBUG'
LOG_LEVEL='DEBUG'
#如果为 True ,进程所有的标准输出(及错误)将会被重定向到log中。
#例如, 执行 print 'hello' ,其将会在Scrapy log中显示。
#默认: False
LOG_STDOUT=False
#对单个网站进行并发请求的最大值。默认: 8
CONCURRENT_REQUESTS_PER_DOMAIN=8
#Default: True ,Whether to enable the cookies middleware. If disabled, no cookies will be sent to web servers.
COOKIES_ENABLED = True
#feed settings
FEED_URI = 'file:///C:/Users/stwan/Desktop/bioon/a.txt'
FEED_FORMAT = 'jsonlines'
LOG_ENCODING = None
##----------------------Mail settings------------------------
#Default: ’scrapy@localhost’,Sender email to use (From: header) for sending emails.
MAIL_FROM='*********@163.com'
#Default: ’localhost’, SMTP host to use for sending emails.
MAIL_HOST="smtp.163.com"
#Default: 25, SMTP port to use for sending emails.
MAIL_PORT="25"
#Default: None, User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
MAIL_USER="*********@163.com"
#Default: None, Password to use for SMTP authentication, along with MAIL_USER.
MAIL_PASS="xxxxxxxxxxxxx"
#Enforce using STARTTLS. STARTTLS is a way to take an existing insecure connection,
#and upgrade it to a secure connection using SSL/TLS.
MAIL_TLS=False
#Default: False, Enforce connecting using an SSL encrypted connection
MAIL_SSL=False
pipelines 文件:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from bioon.handledb import adb_insert_data,exec_sql
from bioon.settings import DBAPI,DBKWARGS
class BioonPipeline(object):
def process_item(self, item, spider):
print "Now in pipeline:"
print item['name']
print item['value']
print "End of pipeline."
#store data
#adb_insert_data(item,"tablename",DBAPI,**DBKWARGS)
return item
middlewares 文件:
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
#-*- coding:utf-8-*-
import base64
from proxy import GetIp,counter
from scrapy import log
ips=GetIp().get_ips()
class ProxyMiddleware(object):
http_n=0 #counter for http requests
https_n=0 #counter for https requests
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
if request.url.startswith("http://"):
n=ProxyMiddleware.http_n
n=n if n<len(ips['http']) else 0
request.meta['proxy']= "http://%s:%d"%(ips['http'][n][0],int(ips['http'][n][1]))
log.msg('Squence - http: %s - %s'%(n,str(ips['http'][n])))
ProxyMiddleware.http_n=n+1
if request.url.startswith("https://"):
n=ProxyMiddleware.https_n
n=n if n<len(ips['https']) else 0
request.meta['proxy']= "https://%s:%d"%(ips['https'][n][0],int(ips['https'][n][1]))
log.msg('Squence - https: %s - %s'%(n,str(ips['https'][n])))
ProxyMiddleware.https_n=n+1
爬取 xici 网站的 ip 列表:
爬取类:
# -*- coding: utf-8 -*-
import scrapy
from collectips.items import CollectipsItem
class XiciSpider(scrapy.Spider):
name = "xici"
allowed_domains = ["xicidaili.com"]
start_urls = (
'http://www.xicidaili.com',
)
def start_requests(self):
reqs=[]
for i in range(1,206):
req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i)
reqs.append(req)
return reqs
def parse(self, response):
ip_list=response.xpath('//table[@id="ip_list"]')
trs = ip_list[0].xpath('tr')
items=[]
for ip in trs[1:]:
pre_item=CollectipsItem()
pre_item['IP'] = ip.xpath('td[3]/text()')[0].extract()
pre_item['PORT'] = ip.xpath('td[4]/text()')[0].extract()
pre_item['POSITION'] = ip.xpath('string(td[5])')[0].extract().strip()
pre_item['TYPE'] = ip.xpath('td[7]/text()')[0].extract()
pre_item['SPEED'] = ip.xpath(
'td[8]/div[@class="bar"]/@title').re('\d{0,2}\.\d{0,}')[0]
pre_item['LAST_CHECK_TIME'] = ip.xpath('td[10]/text()')[0].extract()
items.append(pre_item)
return items
item:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class CollectipsItem(scrapy.Item):
# define the fields for your item here like:
IP = scrapy.Field()
PORT = scrapy.Field()
POSITION = scrapy.Field()
TYPE = scrapy.Field()
SPEED = scrapy.Field()
LAST_CHECK_TIME = scrapy.Field()
setting:
# -*- coding: utf-8 -*-
# Scrapy settings for collectips project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'collectips'
SPIDER_MODULES = ['collectips.spiders']
NEWSPIDER_MODULE = 'collectips.spiders'
# database connection parameters
DBKWARGS={'db':'ippool','user':'root', 'passwd':'toor',
'host':'localhost','use_unicode':True, 'charset':'utf8'}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'collectips.pipelines.CollectipsPipeline': 300,
}
#Configure log file name
LOG_FILE = "scrapy.log"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0'
pipelines:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
class CollectipsPipeline(object):
def process_item(self, item, spider):
DBKWARGS = spider.settings.get('DBKWARGS')
con = MySQLdb.connect(**DBKWARGS)
cur = con.cursor()
sql = ("insert into proxy(IP,PORT,TYPE,POSITION,SPEED,LAST_CHECK_TIME) "
"values(%s,%s,%s,%s,%s,%s)")
lis = (item['IP'],item['PORT'],item['TYPE'],item['POSITION'],item['SPEED'],
item['LAST_CHECK_TIME'])
try:
cur.execute(sql,lis)
except Exception,e:
print "Insert error:",e
con.rollback()
else:
con.commit()
cur.close()
con.close()
return item
天猫商品爬取代码:
爬取类:
# -*- coding: utf-8 -*-
import scrapy
from topgoods.items import TopgoodsItem
class TmGoodsSpider(scrapy.Spider):
name = "tm_goods"
allowed_domains = ["http://www.tmall.com"]
start_urls = (
'http://list.tmall.com/search_product.htm?type=pc&totalPage=100&cat=50025135&sort=d&style=g&from=sn_1_cat-qp&active=1&jumpto=10#J_Filter',
)
#记录处理的页数
count=0
def parse(self, response):
TmGoodsSpider.count += 1
divs = response.xpath("//div[@id='J_ItemList']/div[@class='product']/div")
if not divs:
self.log( "List Page error--%s"%response.url )
print "Goods numbers: ",len(divs)
for div in divs:
item=TopgoodsItem()
#商品价格
item["GOODS_PRICE"] = div.xpath("p[@class='productPrice']/em/@title")[0].extract()
#商品名称
item["GOODS_NAME"] = div.xpath("p[@class='productTitle']/a/@title")[0].extract()
#商品连接
pre_goods_url = div.xpath("p[@class='productTitle']/a/@href")[0].extract()
item["GOODS_URL"] = pre_goods_url if "http:" in pre_goods_url else ("http:"+pre_goods_url)
#图片链接
try:
file_urls = div.xpath('div[@class="productImg-wrap"]/a[1]/img/@src|'
'div[@class="productImg-wrap"]/a[1]/img/@data-ks-lazyload').extract()[0]
item['file_urls'] = ["http:"+file_urls]
except Exception,e:
print "Error: ",e
import pdb;pdb.set_trace()
yield scrapy.Request(url=item["GOODS_URL"],meta={'item':item},callback=self.parse_detail,
dont_filter=True)
def parse_detail(self,response):
div = response.xpath('//div[@class="extend"]/ul')
if not div:
self.log( "Detail Page error--%s"%response.url )
item = response.meta['item']
div=div[0]
#店铺名称
item["SHOP_NAME"] = div.xpath("li[1]/div/a/text()")[0].extract()
#店铺连接
pre_shop_url = div.xpath("li[1]/div/a/@href")[0].extract()
item["SHOP_URL"] = response.urljoin(pre_shop_url)
#公司名称
item["COMPANY_NAME"] = div.xpath("li[3]/div/text()")[0].extract().strip()
#公司所在地
item["COMPANY_ADDRESS"] = div.xpath("li[4]/div/text()")[0].extract().strip()
yield item
items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TopgoodsItem(scrapy.Item):
# define the fields for your item here like:
GOODS_PRICE = scrapy.Field()
GOODS_NAME = scrapy.Field()
GOODS_URL = scrapy.Field()
SHOP_NAME = scrapy.Field()
SHOP_URL = scrapy.Field()
COMPANY_NAME = scrapy.Field()
COMPANY_ADDRESS = scrapy.Field()
#图片链接
file_urls = scrapy.Field()
settings 注意,这里包含图片下载:
# -*- coding: utf-8 -*-
# Scrapy settings for topgoods project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'topgoods'
SPIDER_MODULES = ['topgoods.spiders']
NEWSPIDER_MODULE = 'topgoods.spiders'
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':301,
}
#下面三行图片下载
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_URLS_FIELD = 'file_urls'
IMAGES_STORE = r'.'
# IMAGES_THUMBS = {
# 'small': (50, 50),
# 'big': (270, 270),
# }
LOG_FILE = "scrapy.log"
代理 ip 的模块设置:
middlewares:
# Importing base64 library because we'll need it ONLY in case
#if the proxy we are going to use requires authentication
#-*- coding:utf-8-*-
import base64
from proxy import GetIp,counter
import logging
ips=GetIp().get_ips()
class ProxyMiddleware(object):
http_n=0 #counter for http requests
https_n=0 #counter for https requests
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
if request.url.startswith("http://"):
n=ProxyMiddleware.http_n
n=n if n<len(ips['http']) else 0
request.meta['proxy']= "http://%s:%d"%(
ips['http'][n][0],int(ips['http'][n][1]))
logging.info('Squence - http: %s - %s'%(n,str(ips['http'][n])))
ProxyMiddleware.http_n=n+1
if request.url.startswith("https://"):
n=ProxyMiddleware.https_n
n=n if n<len(ips['https']) else 0
request.meta['proxy']= "https://%s:%d"%(
ips['https'][n][0],int(ips['https'][n][1]))
logging.info('Squence - https: %s - %s'%(n,str(ips['https'][n])))
ProxyMiddleware.https_n=n+1
在同一目录下创建 proxy.py
import sys
from handledb import exec_sql
import socket
import urllib2
dbapi="MySQLdb"
kwargs={'user':'root','passwd':'toor','db':'ippool','host':'localhost', 'use_unicode':True}
def counter(start_at=0):
'''Function: count number
Usage: f=counter(i) print f() #i+1'''
count=[start_at]
def incr():
count[0]+=1
return count[0]
return incr
def use_proxy (browser,proxy,url):
'''Open browser with proxy'''
#After visited transfer ip
profile=browser.profile
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.http', proxy[0])
profile.set_preference('network.proxy.http_port', int(proxy[1]))
profile.set_preference('permissions.default.image',2)
profile.update_preferences()
browser.profile=profile
browser.get(url)
browser.implicitly_wait(30)
return browser
class Singleton(object):
'''Signal instance example.'''
def __new__(cls, *args, **kw):
if not hasattr(cls, '_instance'):
orig = super(Singleton, cls)
cls._instance = orig.__new__(cls, *args, **kw)
return cls._instance
class GetIp(Singleton):
def __init__(self):
sql='''SELECT `IP`,`PORT`,`TYPE`
FROM `proxy`
WHERE `TYPE` REGEXP 'HTTP|HTTPS'
AND `SPEED`<5 OR `SPEED` IS NULL
ORDER BY `proxy`.`TYPE` ASC
LIMIT 50 '''
self.result = exec_sql(sql,**kwargs)
def del_ip(self,record):
'''delete ip that can not use'''
sql="delete from proxy where IP='%s' and PORT='%s'"%(record[0],record[1])
print sql
exec_sql(sql,**kwargs)
print record ," was deleted."
def judge_ip(self,record):
'''Judge IP can use or not'''
http_url="http://www.baidu.com/"
https_url="https://www.alipay.com/"
proxy_type=record[2].lower()
url=http_url if proxy_type== "http" else https_url
proxy="%s:%s"%(record[0],record[1])
try:
req=urllib2.Request(url=url)
req.set_proxy(proxy,proxy_type)
response=urllib2.urlopen(req,timeout=30)
except Exception,e:
print "Request Error:",e
self.del_ip(record)
return False
else:
code=response.getcode()
if code>=200 and code<300:
print 'Effective proxy',record
return True
else:
print 'Invalide proxy',record
self.del_ip(record)
return False
def get_ips(self):
print "Proxy getip was executed."
http=[h[0:2] for h in self.result if h[2] =="HTTP" and self.judge_ip(h)]
https=[h[0:2] for h in self.result if h[2] =="HTTPS" and self.judge_ip(h)]
print "Http: ",len(http),"Https: ",len(https)
return {"http":http,"https":https}