2017.08.05 Python网络爬虫实战之获取代理

1.项目准备:爬取网站:http://www.proxy360.cn/Region/China,http://www.xicidaili.com/

2.创建编辑Scrapy爬虫:

scrapy startproject getProxy

scrapy genspider proxy360Spider proxy360.cn

项目目录结构:

 

3.修改items.py:

 

 4.修改Spider.py文件 proxy360Spider.py:

(1)先使用scrapy shell命令查看一下连接网络返回的结果和数据:

scrapy shell http://www.proxy360.cn/Region/China

(2)再看一下response的数据内容:response.xpath('/*').extract(),返回的数据中含有代理服务器;

(3)观察发现所有的数据模块都是以<div class="proxylistitem" name="list_proxy_ip">这个tag开头的:

(4)在scrapy shell中测试一下:

subSelector=response.xpath('//div[@class="proxylistitem" and @name="list_proxy_ip"]')

subSelector.xpath('.//span[1]/text()').extract()[0]

 subSelector.xpath('.//span[2]/text()').extract()[0]

 subSelector.xpath('.//span[3]/text()').extract()[0]

 subSelector.xpath('.//span[4]/text()').extract()[0]

 

(5) 编写Spider文件 proxy360Spider.py:

# -*- coding: utf-8 -*-
import scrapy
from getProxy.items import GetproxyItem

class Proxy360spiderSpider(scrapy.Spider):
name = 'proxy360Spider'
allowed_domains = ['proxy360.cn']

nations=['Brazil','China','Amercia','Taiwan','Japan','Thailand','Vietnam','bahrein']
start_urls=[ ]
for nation in nations:
start_urls.append('http://www.proxy360.cn/Region/'+nation)

def parse(self, response):
subSelector=response.xpath('//div[@class="proxylistitem" and @name="list_proxy_ip"]')
items=[]
for sub in subSelector:
item=GetproxyItem()
item['ip']=sub.xpath('.//span[1]/text()').extract()[0]
item['port']=sub.xpath('.//span[2]/text()').extract()[0]
item['type']=sub.xpath('.//span[3]/text()').extract()[0]
item['loction']=sub.xpath('.//span[4]/text()').extract()[0]
item['protocol']='HTTP'
item['source']='proxy360'
items.append(item)
return items



(6)修改pipelines.py文件,处理:
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class GetproxyPipeline(object):
def process_item(self, item, spider):
fileName='proxy.txt'
with open(fileName,'a') as fp:
fp.write(item['ip'].encode('utf8').strip()+'\t')
fp.write(item['port'].encode('utf8').strip()+'\t')
fp.write(item['protocol'].encode('utf8').strip()+'\t')
fp.write(item['type'].encode('utf8').strip()+'\t')
fp.write(item['loction'].encode('utf8').strip()+'\t')
fp.write(item['source'].encode('utf8').strip()+'\n')
return item


(7)修改Settings.py,决定由哪个文件来处理获取的数据:

(8)执行结果:

 

5.多个Spider,只有一个Spdier的时候得到的proxy数据不够多:

(1 )到getProxy目录下,执行:scrapy  genspider xiciSpider xicidaili.com

(2)确定如何获取数据:scrapy shell http://www.xicidaili.com/nn/2

 (3)只需要在settings.py中添加一个USER_Agent项就可以了

再次测试如何获取数据:scrapy shell http://www.xicidaili.com/nn/2

(4)在浏览器中查看源代码:发现所需的数据块都是以<tr class="odd">开头的

(5)在scrapy shell中执行命令:

 subSelector=response.xpath('//tr[@class=""]| //tr[@class="odd"]')

subSelector[0].xpath('.//td[2]/text()').extract()[0]

 subSelector[0].xpath('.//td[3]/text()').extract()[0]

 subSelector[0].xpath('.//td[4]/a/text()').extract()[0]

subSelector[0].xpath('.//td[5]/text()').extract()[0]

subSelector[0].xpath('.//td[6]/text()').extract()[0]

 

(6)编写xiciSpider.py:

# -*- coding: utf-8 -*-
import scrapy
from getProxy.items import GetproxyItem

class XicispdierSpider(scrapy.Spider):
name = 'xiciSpdier'
allowed_domains = ['xicidaili.com']
wds=['nn','nt','wn','wt']
pages=20
start_urls=[]
for type in wds:
for i in xrange(1,pages+1):
start_urls.append('http://www.xicidaili.com/'+type+'/'+str(i))


def parse(self, response):
subSelector=response.xpath('//tr[@class=""]| //tr[@class="odd"]')
items=[]
for sub in subSelector:
item=GetproxyItem()
item['ip']=sub.xpath('.//td[2]/text()').extract()[0]
item['port']=sub.xpath('.//td[3]/text()').extract()[0]
item['type']=sub.xpath('.//td[5]/text()').extract()[0]
if sub.xpath('.//td[4]/a/text()'):
item['loction']=sub.xpath('.//td[4]/a/text()').extract()[0]
else:
item['loction']=sub.xpath('.//td[4]/text()').extract()[0]

item['protocol']=sub.xpath('.//td[6]/text()').extract()[0]
item['source']='xicidaili'
items.append(item)


return items

 (7)执行:scrapy crawl xiciSpider

结果:

 

 

6.验证获取的代理服务器地址是否可用:另外写一个python程序验证代理:testProxy.py

#! /usr/bin/env python
# -*- coding: utf-8 -*-


import urllib2
import re
import threading

class TesyProxy(object):
def __init__(self):
self.sFile=r'proxy.txt'
self.dFile=r'alive.txt'
self.URL=r'http://www.baidu.com/'
self.threads=10
self.timeout=3
self.regex=re.compile(r'baidu.com')
self.aliveList=[]

self.run()

def run(self):
with open(self.sFile,'r') as fp:
lines=fp.readlines()
line=lines.pop()
while lines:
for i in xrange(self.threads):
t=threading.Thread(target=self.linkWithProxy,args=(line,))

t.start()
if lines:
line=lines.pop()
else:
continue

with open(self.dFile,'w') as fp:
for i in xrange(len(self.aliveList)):
fp.write(self.aliveList[i])

def linkWithProxy(self,line):
lineList=line.split('\t')
protocol=lineList[2].lower()
server=protocol+r'://'+lineList[0]+':'+lineList[1]
opener=urllib2.build_opener(urllib2.ProxyHandler({protocol:server}))
urllib2.install_opener(opener)
try:
response=urllib2.urlopen(self.URL,timeout=self.timeout)
except:
print('%s connect failed' %server)
return
else:
try:
str=response.read()
except:
print('%s connect failed' %server)
return
if self.regex.search(str):
print('%s connect success ............' %server)
self.aliveList.append(line)


if __name__ == '__main__':
TP=TesyProxy()


执行命令:python testProxy.py
结果:

 


 

posted @ 2017-08-07 19:19  小春熙子  阅读(692)  评论(0编辑  收藏  举报