python爬虫实战(四)--------豆瓣网的模拟登录(模拟登录和验证码的处理----scrapy)
在利用scrapy框架爬各种网站时,一定会碰到某些网站是需要登录才能获取信息。
这两天也在学习怎么去模拟登录,通过自己码的代码和借鉴别人的项目,调试成功豆瓣的模拟登录,顺便处理了怎么自动化的处理验证码。
一般都是通过打码平台处理的,当然你也可以机器学习的知识去识别验证码。后期我想自己做一个关于机器学习识别验证码的API,训练主流的网站,方便自己调用。(还不知道能不能做出来呢,走一步看一步咯!)
思路
一、想要实现登录豆瓣关键点
- 分析真实post地址 ----寻找它的formdata,如下图,按浏览器的F12可以找到。
- 模拟post ----构造类似的formdata
- 验证码处理 ----打码平台
实战操作
相关代码已经调试成功----2017-4-5
目标网站:豆瓣网
实现:模拟登录豆瓣,验证码处理,登录到个人主页就算是success
数据:没有抓取数据,此实战主要是模拟登录和处理验证码的学习。要是有需求要抓取数据,编写相关的抓取规则即可抓取内容。
登录成功展示如图:
我在这里贴出主要代码,完整代码请移步我的github:https://github.com/pujinxiao/douban_login
spiders文件夹中DouBan.py主要代码如下:
1 # -*- coding: utf-8 -*- 2 import scrapy,urllib,re 3 from scrapy.http import Request,FormRequest 4 import ruokuai 5 class DoubanSpider(scrapy.Spider): 6 name = "DouBan" 7 allowed_domains = ["douban.com"] 8 #start_urls = ['http://douban.com/'] 9 header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"} #供登录模拟使用 10 def start_requests(self): 11 url='https://www.douban.com/accounts/login' 12 return [Request(url=url,meta={"cookiejar":1},callback=self.parse)]#可以传递一个标示符来使用多个。如meta={'cookiejar': 1}这句,后面那个1就是标示符 13 14 def parse(self, response): 15 captcha=response.xpath('//*[@id="captcha_image"]/@src').extract() #获取验证码图片的链接 16 print captcha 17 if len(captcha)>0: 18 '''此时有验证码''' 19 #人工输入验证码 20 #urllib.urlretrieve(captcha[0],filename="C:/Users/pujinxiao/Desktop/learn/douban20170405/douban/douban/spiders/captcha.png") 21 #captcha_value=raw_input('查看captcha.png,有验证码请输入:') 22 23 #用快若打码平台处理验证码--------验证码是任意长度字母,成功率较低 24 captcha_value=ruokuai.get_captcha(captcha[0]) 25 reg=r'<Result>(.*?)</Result>' 26 reg=re.compile(reg) 27 captcha_value=re.findall(reg,captcha_value)[0] 28 print '验证码为:',captcha_value 29 30 data={ 31 "form_email": "weisuen007@163.com", 32 "form_password": "weijc7789", 33 "captcha-solution": captcha_value, 34 #"redir": "https://www.douban.com/people/151968962/", #设置需要转向的网址,由于我们需要爬取个人中心页,所以转向个人中心页 35 } 36 else: 37 '''此时没有验证码''' 38 print '无验证码' 39 data={ 40 "form_email": "weisuen007@163.com", 41 "form_password": "weijc7789", 42 #"redir": "https://www.douban.com/people/151968962/", 43 } 44 print '正在登陆中......' 45 ####FormRequest.from_response()进行登陆 46 return [ 47 FormRequest.from_response( 48 response, 49 meta={"cookiejar":response.meta["cookiejar"]}, 50 headers=self.header, 51 formdata=data, 52 callback=self.get_content, 53 ) 54 ] 55 def get_content(self,response): 56 title=response.xpath('//title/text()').extract()[0] 57 if u'登录豆瓣' in title: 58 print '登录失败,请重试!' 59 else: 60 print '登录成功' 61 ''' 62 可以继续后续的爬取工作 63 '''
ruokaui.py代码如下:
我所用的是若块打码平台,选择url识别验证码,直接给打码平台验证码图片的链接地址,传回验证码的值。
1 # -*- coding: utf-8 -*- 2 import sys, hashlib, os, random, urllib, urllib2 3 from datetime import * 4 5 class APIClient(object): 6 def http_request(self, url, paramDict): 7 post_content = '' 8 for key in paramDict: 9 post_content = post_content + '%s=%s&'%(key,paramDict[key]) 10 post_content = post_content[0:-1] 11 #print post_content 12 req = urllib2.Request(url, data=post_content) 13 req.add_header('Content-Type', 'application/x-www-form-urlencoded') 14 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) 15 response = opener.open(req, post_content) 16 return response.read() 17 18 def http_upload_image(self, url, paramKeys, paramDict, filebytes): 19 timestr = datetime.now().strftime('%Y-%m-%d %H:%M:%S') 20 boundary = '------------' + hashlib.md5(timestr).hexdigest().lower() 21 boundarystr = '\r\n--%s\r\n'%(boundary) 22 23 bs = b'' 24 for key in paramKeys: 25 bs = bs + boundarystr.encode('ascii') 26 param = "Content-Disposition: form-data; name=\"%s\"\r\n\r\n%s"%(key, paramDict[key]) 27 #print param 28 bs = bs + param.encode('utf8') 29 bs = bs + boundarystr.encode('ascii') 30 31 header = 'Content-Disposition: form-data; name=\"image\"; filename=\"%s\"\r\nContent-Type: image/gif\r\n\r\n'%('sample') 32 bs = bs + header.encode('utf8') 33 34 bs = bs + filebytes 35 tailer = '\r\n--%s--\r\n'%(boundary) 36 bs = bs + tailer.encode('ascii') 37 38 import requests 39 headers = {'Content-Type':'multipart/form-data; boundary=%s'%boundary, 40 'Connection':'Keep-Alive', 41 'Expect':'100-continue', 42 } 43 response = requests.post(url, params='', data=bs, headers=headers) 44 return response.text 45 46 def arguments_to_dict(args): 47 argDict = {} 48 if args is None: 49 return argDict 50 51 count = len(args) 52 if count <= 1: 53 print 'exit:need arguments.' 54 return argDict 55 56 for i in [1,count-1]: 57 pair = args[i].split('=') 58 if len(pair) < 2: 59 continue 60 else: 61 argDict[pair[0]] = pair[1] 62 63 return argDict 64 65 def get_captcha(image_url): 66 client = APIClient() 67 while 1: 68 paramDict = {} 69 result = '' 70 act = raw_input('请输入打码方式url:') 71 if cmp(act, 'info') == 0: 72 paramDict['username'] = raw_input('username:') 73 paramDict['password'] = raw_input('password:') 74 result = client.http_request('http://api.ruokuai.com/info.xml', paramDict) 75 elif cmp(act, 'register') == 0: 76 paramDict['username'] = raw_input('username:') 77 paramDict['password'] = raw_input('password:') 78 paramDict['email'] = raw_input('email:') 79 result = client.http_request('http://api.ruokuai.com/register.xml', paramDict) 80 elif cmp(act, 'recharge') == 0: 81 paramDict['username'] = raw_input('username:') 82 paramDict['id'] = raw_input('id:') 83 paramDict['password'] = raw_input('password:') 84 result = client.http_request('http://api.ruokuai.com/recharge.xml', paramDict) 85 elif cmp(act, 'url') == 0: 86 paramDict['username'] = '********' 87 paramDict['password'] = '********' 88 paramDict['typeid'] = '2000' 89 paramDict['timeout'] = '90' 90 paramDict['softid'] = '76693' 91 paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6' 92 paramDict['imageurl'] = image_url 93 result = client.http_request('http://api.ruokuai.com/create.xml', paramDict) 94 elif cmp(act, 'report') == 0: 95 paramDict['username'] = raw_input('username:') 96 paramDict['password'] = raw_input('password:') 97 paramDict['id'] = raw_input('id:') 98 result = client.http_request('http://api.ruokuai.com/create.xml', paramDict) 99 elif cmp(act, 'upload') == 0: 100 paramDict['username'] = '********' 101 paramDict['password'] = '********' 102 paramDict['typeid'] = '2000' 103 paramDict['timeout'] = '90' 104 paramDict['softid'] = '76693' 105 paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6' 106 paramKeys = ['username', 107 'password', 108 'typeid', 109 'timeout', 110 'softid', 111 'softkey' 112 ] 113 114 from PIL import Image 115 imagePath = raw_input('Image Path:') 116 img = Image.open(imagePath) 117 if img is None: 118 print 'get file error!' 119 continue 120 img.save("upload.gif", format="gif") 121 filebytes = open("upload.gif", "rb").read() 122 result = client.http_upload_image("http://api.ruokuai.com/create.xml", paramKeys, paramDict, filebytes) 123 124 elif cmp(act, 'help') == 0: 125 print 'info' 126 print 'register' 127 print 'recharge' 128 print 'url' 129 print 'report' 130 print 'upload' 131 print 'help' 132 print 'exit' 133 elif cmp(act, 'exit') == 0: 134 break 135 136 return result
笔记
知识点:
- return Request的用法
return [Request(url=url,meta={"cookiejar":1},callback=self.parse)] #可以传递一个标示符来使用多个。如meta={'cookiejar': 1}这句,后面那个1就是标示符 - 打码平台的使用
直接利用验证码图片的url接口即可 - FormRequest的用法
return [ FormRequest.from_response( response, meta={"cookiejar":response.meta["cookiejar"]}, headers=self.header, formdata=data, callback=self.get_content, ) ]
作者:今孝
出处:http://www.cnblogs.com/jinxiao-pu/p/6670672.html
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
欢迎博友指出错误,我将改进,共同提高技术。