scrapy模拟登录新浪微博
hi:
all, scrapy搞模拟登录真的很简单哦,以下均是在你安装scrapy成功的前提下哦.
首先,分析新浪微薄的登录流程,使用抓包工具得到下面的图片:
一般来说,登录主要就是对服务器进行post数据过去,如果对方有验证码,需要验证码识别之类的东西,那是计算机图形学干的事,scrapy干不了,而新浪微博比较特别,首先大家应该清楚,新浪是个大公司,不会那么简单直接让你post数据的,所以在post请求前有一个get请求,去获取服务器的一些参数,那么,我们做的第一个事情是写一个get请求:
第一步,使用scrapy的shell命令创建一些模板
E:\workspace\TribuneSpider\src>scrapy genspider -t crawl weibo weibo.com
Created spider 'weibo' using template 'crawl' in module:
src.spiders.weibo
形如以上代码,后2行是scrapy给你的提示。生成以下模块化代码
#! -*- encoding:utf-8 -*-
import re
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from src.items import SrcItem
class WeiboSpider(CrawlSpider):
'''
这是一个使用scrapy模拟登录新浪微博的例子,
希望能对广大的同学有点帮助
'''
name = 'weibo'
allowed_domains = ['weibo.com']
start_urls = ['http://www.weibo.com/']
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SrcItem()
#i['domain_id'] = hxs.select('//input[@id="sid"]/@value').extract()
#i['name'] = hxs.select('//div[@id="name"]').extract()
#i['description'] = hxs.select('//div[@id="description"]').extract()
return i
第二步,修改设置,继承自CrawlSpider类的WeiboSpider的name,allowed_domains属性是必须的,但是start_urls属性是可以不要的,改由其他方法生成,接下来修改代码,获取新浪的第一的链接返回的值,提前我们先要知道一般这个链接返回什么格式:
sinaSSOController.preloginCallBack({"retcode":0,"servertime":1314019864,"nonce":"J1F9XN"})
返回的是形如以上代码段的格式内容,而以上内容在post登录时是需要这3个属性的。修改后代码为:
#! -*- encoding:utf-8 -*-
import re
import time
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from src.items import SrcItem
class WeiboSpider(CrawlSpider):
'''
这是一个使用scrapy模拟登录新浪微博的例子,
希望能对广大的同学有点帮助
'''
name = 'weibo'
allowed_domains = ['weibo.com', 'sina.com.cn']
def start_requests(self):
username = '********'
url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s' % \
(username, str(time.time()).replace('.', ''))
print url
return [Request(url=url, method='get', callback=self.parse_item)]
def parse_item(self, response):
print response.body
hxs = HtmlXPathSelector(response)
i = SrcItem()
#i['domain_id'] = hxs.select('//input[@id="sid"]/@value').extract()
#i['name'] = hxs.select('//div[@id="name"]').extract()
#i['description'] = hxs.select('//div[@id="description"]').extract()
return i
代码说明:
1 1. 首先scrapy有一个offsite机制,即是否抓其他域名,需要将allowed_domains属性添加上sina.com.cn,
2. 分析get的链接的特点,得出以下结论:get方式需要你传递你的用户名和你当前发送请求的时间撮,一般使用time.time()是13位的时间撮,只不过是有个顿号而已。去掉之。
3. 使用callback来指定回滚函数,这里涉及到scrapy的内部执行机制,还挺深,不延展了。然后在parse_item中打印response.body,即返回的这个response的主题内容,response对象有很多属性,比如response.request, response.url,response.headers等等。这个你可以自己用dir(response)查看的。
接下来执行当前这个爬虫,看是否能正常工作,执行结果:
E:\workspace\TribuneSpider\src>scrapy crawl weibo
2011-08-22 22:02:47+0800 [scrapy] INFO: Scrapy 0.12.0.2539 started (bot: src)
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, Re
leware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddle
epthMiddleware
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled item pipelines: SrcPipeline
http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=*******&client=ssologin.js(v
&_=131402176767
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-08-22 22:02:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-08-22 22:02:47+0800 [weibo] INFO: Spider opened
2011-08-22 22:02:47+0800 [weibo] DEBUG: Crawled (200) <GET http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.
nCallBack&user=******&client=ssologin.js(v1.3.14)&_=131402176767> (referer: None)
sinaSSOController.preloginCallBack({"retcode":0,"servertime":1314021767,"nonce":"0G2Q3S"})
2011-08-22 22:02:47+0800 [weibo] DEBUG: Scraped SrcItem() in <http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOControll
oginCallBack&user=********&client=ssologin.js(v1.3.14)&_=131402176767>
2011-08-22 22:02:47+0800 [weibo] INFO: Passed SrcItem()
2011-08-22 22:02:47+0800 [weibo] INFO: Closing spider (finished)
2011-08-22 22:02:47+0800 [weibo] INFO: Spider closed (finished)
你看到这里有sinaSSOControlloginCallBack&user=********&client=ssologin.js(v1.3.14)&_=131402176767>这个了吗?bingo
接下来加东西,加post请求啊.
1. 分析post请求的数据:
case:博客园,我的心你伤不起啊,也太慢了吧,上传个图片我的妈呀,真是个拖拉机
其实也就很简单了,直接把这一堆值给post的地址url 全部post过去,这还不简单?!!!
现在添加实现代码如下:
#! -*- encoding:utf-8 -*-
import re
import os
import time
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.http import FormRequest
from src.items import SrcItem
class WeiboSpider(CrawlSpider):
'''
这是一个使用scrapy模拟登录新浪微博的例子,
希望能对广大的同学有点帮助
'''
name = 'weibo'
allowed_domains = ['weibo.com', 'sina.com.cn']
def start_requests(self):
username = '*******'
url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s' % \
(username, str(time.time()).replace('.', ''))
print url
return [Request(url=url, method='get', callback=self.post_message)]
def post_message(self, response):
serverdata = re.findall('{"retcode":0,"servertime":(.*?),"nonce":"(.*?)"}', response.body, re.I)[0]
print serverdata
servertime = serverdata[0]
print servertime
nonce = serverdata[1]
print nonce
formdata = {"entry" : 'miniblog',
"gateway" : '1',
"from" : "",
"savestate" : '7',
"useticket" : '1',
"ssosimplelogin" : '1',
"username" : '**********',
"service" : 'miniblog',
"servertime" : servertime,
"nonce" : nonce,
"pwencode" : 'wsse',
"password" : '*********',
"encoding" : 'utf-8',
"url" : 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack',
"returntype" : 'META'}
return [FormRequest(url = 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.3.14)',
formdata = formdata,callback=self.parse_item) ]
def parse_item(self, response):
with open('%s%s%s' % (os.getcwd(), os.sep, 'logged.html'), 'wb') as f:
f.write(response.body)
接下来验证,查看文件logged.html内容:
<html><head><script language='javascript'>parent.sinaSSOController.feedBackUrlCallBack({"result":true,"userinfo":{"uniqueid":"1700208252","userid":"×××××××××","displayname":"×××××××××","userdomain":"×××××××××"}});</script></head><body></body></html>null
有一个result=true,是否成功了呢?别急,我们在加段代码(实际上,上面的内容是抓包的第三个链接返回的内容):
#! -*- encoding:utf-8 -*-
import re
import os
import time
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.http import FormRequest
from src.items import SrcItem
class WeiboSpider(CrawlSpider):
'''
这是一个使用scrapy模拟登录新浪微博的例子,
希望能对广大的同学有点帮助 ,这是所有代码
'''
name = 'weibo'
allowed_domains = ['weibo.com', 'sina.com.cn']
def start_requests(self):
username = '********'
url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s' % \
(username, str(time.time()).replace('.', ''))
print url
return [Request(url=url, method='get', callback=self.post_message)]
def post_message(self, response):
serverdata = re.findall('{"retcode":0,"servertime":(.*?),"nonce":"(.*?)"}', response.body, re.I)[0]
print serverdata
servertime = serverdata[0]
print servertime
nonce = serverdata[1]
print nonce
formdata = {"entry" : 'miniblog',
"gateway" : '1',
"from" : "",
"savestate" : '7',
"useticket" : '1',
"ssosimplelogin" : '1',
"username" : '*******',
"service" : 'miniblog',
"servertime" : servertime,
"nonce" : nonce,
"pwencode" : 'wsse',
"password" : '*******',
"encoding" : 'utf-8',
"url" : 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack',
"returntype" : 'META'}
return [FormRequest(url = 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.3.14)',
formdata = formdata,callback=self.check_page) ]
def check_page(self, response):
url = 'http://weibo.com/'
request = response.request.replace(url=url, method='get', callback=self.parse_item)
return request
def parse_item(self, response):
with open('%s%s%s' % (os.getcwd(), os.sep, 'logged.html'), 'wb') as f:
f.write(response.body)
查看输出文件,用浏览器打开:
成功了吧,朋友们。真正有多少代码量呢?哦,忘记说了一句,scrapy的FormRequest本身是有一个最简单的查找提交表单的方法的,即使用FormRequest.from_response()方法,但是,新浪的登录比较特别,无法查找到表单,所以就采用以上方法。