因为项目原因,我被领导委任爬取微博用户的一些信息,而作为一个爬虫经验几乎为0的python非老司机,开始了漫长的研究之路。。。。
在了解了爬虫的基本工具和著名框架scrapy后
博主还是决定自己参考网上的各路大神的脚本,写一个登录脚本。。。。
环境
tools
1、Chrome及其developer tools
2、Charles【这个是fiddler的Mac替代版,付费软件,但是网上有破解版的,可以搜一下,用着比Mac版的fiddler舒服多了】
3、python3.6
4、pycharm
查询资料的过程中,因为微博登录有好几个跳转,很多大神建议preserve log模式打开
python3.6中使用的库
1、urllib.request、urllib.error、urllib.parse
2、re——正则表达式
3、rsa、base64
4、json
5、binascii——对加密数据进行编码
ps:博主这里用的是anaconda自带的库,发现rsa和base64需要用pip另外下载
系统
Mac OS 10.13.2
weibo.com登录
当我登录微博后,每隔一段时间就会出现push_count.json文件,当我们点击输入用户名时,会出现prelogin.php文件,引起了我们的注意
点开查看,会发现一些十分可疑的东西,比如su。
这里我们用base64对其解码试试
1 import base64 2 print(base64.b64decode('MzU4NTEwMjQ5JTQwcXEuY29t'))
输出结果为:b'358510249%40qq.com'
果然,是用户名!!!
需要注意的是,用户名中可能包含@这样的符号,而我们刚才看到的加密过的su,解码之后@
变成了%40,
这其实是url的编码。
然后为了方便查看,我们切换到charles工具查看一下prelogin.php的body部分
sinaSSOController.preloginCallBack({ "retcode": 0, "servertime": 1515836591, "pcid": "gz-cd9bccf44f515b8765496d8694e51ba7c996", "nonce": "JLT53P", "pubkey": "EB2A38568661887FA180BDDB5CABD5F21C7BFD59C090CB2D245A87AC253062882729293E5506350508E7F9AA3BB77F4333231490F915F6D63C55FE2F08A49B353F444AD3993CACC02DB784ABBB8E42A9B1BBFFFB38BE18D78E87A0E41B9B8F73A928EE0CCEE1F6739884B9777E4FE9E88A1BBE495927AC4A799B3181D6442443", "rsakv": "1330428213", "is_openlock": 0, "showpin": 0, "exectime": 5 })
一眼望过去,哇,长得好像json,嗨森!
有用的似乎有servertime、nonce、rsakv以及这长长的pubkey。。。。是什么鬼!!
一查:好嘛,非对称加密,呵呵好开心。。。。。才怪!!!QAQ
登录微博
在这里,我们需要用到Charles来抓取跳转的连接。结果如下:
我们抓取的目标就是prelogin后面出现的POST表单login.php?client=ssologin.js(v1.4.19)
观察一下里面的内容:
entry:weibo gateway:1 from: savestate:7 qrcode_flag:false useticket:1 pagerefer:https://login.sina.com.cn/crossdomain2.php?action=logout&r=https%3A%2F%2Fweibo.com%2Flogout.php%3Fbackurl%3D%252F%252Fs.weibo.com vsnf:1 su:MzU4NTEwMjQ5JTQwcXEuY29t service:miniblog servertime:1515895583 nonce:JLT53P pwencode:rsa2 rsakv:1330428213 sp:02ca1b627293c21e098882de3e276def93654ffba9817d0d95174b11c403e46e8016bf66ed421198fffaaa691fb0c9d03d45da676de0282a30aef899855262e09164dfef35eb6820ba017ecf8f437643fe94eaf0632095ffcc647ada27b23c9ed1b1c8f7d1d87ce2c69ed4f9997fb9283c42622c677dbecfe60a802f4b621ee3 sr:1680*1050 encoding:UTF-8 prelt:31 url:https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack returntype:META
不难看出,su即username,sp即password,sp显然已经用rsa加密过。
为了解密加密过后的sp,我们首先需对js进行分析。
首先,登录的时候会出现一个post表单login.php?client=ssologin.js(v1.4.19),随后出现一个ssologin.js文件,点开以后,我们发现了一堆密密麻麻的东西。
结合之前的信息,我们已经知道RSA加密和一个叫pubkey的参数,搜一下,立刻能得到我们想要的信息:
这里,10001就是rsa加密用到的exponent,需要注意的是,它是16进制的,所以我们还需要将其转化为10进制。
另一个信息就死我们的password啦
password=RSAKey.encrypt([me.servertime,me.nonce].join("\t")+"\n"+password)}
对应的Python加密代码如下:
1 import rsa 2 import binascii 3 def get_encrypted_pw(self, data): 4 rsa_e = int('10001',16) # 0x10001 5 pw_string = str(servertime) + '\t' + str(nonce) + '\n' + str(password) 6 key = rsa.PublicKey(int(pubkey, 16), rsa_e) 7 pw_encypted = rsa.encrypt(pw_string.encode('utf-8'), key) 8 password = '' # 安全起见清空明文密码 9 passwd = binascii.b2a_hex(pw_encypted) #将二进制编码转化为ascii/hex 10 print(passwd) 11 return passwd
最终代码
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 # 导入所需模块 2 import urllib.error 3 import urllib.request 4 import urllib.parse 5 import re 6 import rsa 7 import http.cookiejar #从前的cookielib 8 import base64 9 import json 10 import urllib 11 import binascii 12 13 # 简历Launcher类 14 class Launcher(): 15 # 初始化username和password这两个参数 16 def __init__(self,username,password): 17 self.username = username 18 self.password = password 19 20 #建立get_encrypted_name方法,获取base64加密后的用户名 21 def get_encrypted_name(self): 22 # 将字符串转化为url编码 23 username_urllike = urllib.request.quote(self.username) 24 username_encrypted = base64.b64encode(bytes(username_urllike, encoding='utf-8')) 25 return username_encrypted.decode('utf-8') # 将bytes对象转为str 26 27 def get_prelogin_args(self): 28 ''' 29 该函数用于模拟预登录过程,并获取服务器返回的 nonce , servertime , pubkey 等信息,用一个字典返回数据 30 ''' 31 json_pattern = re.compile('\((.*)\)') 32 url = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=&' + self.get_encrypted_name() + '&rsakt=mod&client=ssologin.js(v1.4.19)' 33 try: 34 request = urllib.request.Request(url) 35 response = urllib.request.urlopen(request) 36 raw_data = response.read().decode('utf-8') 37 # 利用正则取出json 38 json_data = json_pattern.search(raw_data).group(1) 39 # 讲json包装成字典 40 data = json.loads(json_data) 41 # print(data) 42 return data 43 except urllib.error as e: 44 print("%d" % e.code) 45 return None 46 47 # 建立get_encrypeted_pw获取登录信息生成的rsa加密版密码 48 def get_encrypted_pw(self, data): 49 rsa_e = int('10001',16) # 0x10001 50 pw_string = str(data['servertime']) + '\t' + str(data['nonce']) + '\n' + str(self.password) 51 key = rsa.PublicKey(int(data['pubkey'], 16), rsa_e) 52 pw_encypted = rsa.encrypt(pw_string.encode('utf-8'), key) 53 self.password = '' # 安全起见清空明文密码 54 passwd = binascii.b2a_hex(pw_encypted) 55 print(passwd) 56 return passwd 57 58 def enableCookies(self): 59 # 建立一个cookies 容器 60 cookie_container = http.cookiejar.CookieJar() 61 # 将一个cookies容器和一个HTTP的cookie的处理器绑定 62 cookie_support = urllib.request.HTTPCookieProcessor(cookie_container) 63 # 创建一个opener,设置一个handler用于处理http的url打开 64 opener = urllib.request.build_opener(cookie_support, urllib.request.HTTPHandler) 65 # 安装opener,此后调用urlopen()时会使用安装过的opener对象 66 urllib.request.install_opener(opener) 67 68 # 构造build_post_data方法,用于包装一个POST方法所需的数据 69 def build_post_data(self, raw): 70 post_data = { 71 "entry": "weibo", 72 "gateway": "1", 73 "from": "", 74 "savestate": "7", 75 "qrcode_flag":'false', 76 "useticket": "1", 77 "pagerefer": "https://login.sina.com.cn/crossdomain2.php?action=logout&r=https%3A%2F%2Fweibo.com%2Flogout.php%3Fbackurl%3D%252F", 78 "vsnf": "1", 79 "su": self.get_encrypted_name(), 80 "service": "miniblog", 81 "servertime": raw['servertime'], 82 "nonce": raw['nonce'], 83 "pwencode": "rsa2", 84 "rsakv": raw['rsakv'], 85 "sp": self.get_encrypted_pw(raw), 86 "sr": "1680*1050", 87 "encoding": "UTF-8", 88 "prelt": "194", 89 "url": "https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack", 90 "returntype": "META" 91 } 92 data = urllib.parse.urlencode(post_data).encode('utf-8') 93 return data 94 95 # 登录,注意这里需要进行三次跳转 96 def login(self): 97 url = 'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.19)' 98 self.enableCookies() 99 data = self.get_prelogin_args() 100 post_data = self.build_post_data(data) 101 headers = { 102 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 103 } 104 try: 105 request = urllib.request.Request(url=url, data=post_data, headers=headers) 106 response = urllib.request.urlopen(request) 107 html = response.read().decode('GBK') 108 ''' 109 一开始用的是utf-8解码,然而得到的数据很丑陋,却隐约看见一个GBK字样。所以这里直接采用GBK解码 110 ''' 111 # print(html) 112 except urllib.error as e: 113 print(e.code) 114 115 p = re.compile('location\.replace\("(.*?)"\)') 116 p2 = re.compile("location\.replace\('(.*?)'\)") 117 p3 = re.compile(r'"userdomain":"(.*?)"') 118 try: 119 login_url = p.search(html).group(1) 120 request = urllib.request.Request(login_url) 121 response = urllib.request.urlopen(request) 122 page = response.read().decode('GBK') 123 # print(page) 124 login_url2 = p2.search(page).group(1) 125 request = urllib.request.Request(login_url2) 126 response = urllib.request.urlopen(request) 127 page2 = response.read().decode('utf-8') 128 # print(page2) 129 login_url = 'http://weibo.com/' + p3.search(page2).group(1) 130 request = urllib.request.Request(login_url) 131 response = urllib.request.urlopen(request) 132 final = response.read().decode('utf-8') 133 print(final) 134 135 print("Login success!") 136 except: 137 print('Login error!') 138 return 0
值得注意的是,在最后的login中,我们现尝试直接登录,看看返回的是什么。
1 def login(self): 2 url = 'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.19)' 3 self.enableCookies() 4 data = self.get_prelogin_args() 5 post_data = self.build_post_data(data) 6 headers = { 7 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 8 } 9 try: 10 request = urllib.request.Request(url=url, data=post_data, headers=headers) 11 response = urllib.request.urlopen(request) 12 html = response.read().decode('GBK') 13 ''' 14 一开始用的是utf-8解码,然而得到的数据很丑陋,却隐约看见一个GBK字样。所以这里直接采用GBK解码 15 ''' 16 print(html) 17 print('-------------------------') 18 except urllib.error as e: 19 print(e.code)
很好,我们看到的果然是一堆奇怪的东西呢!!
<html> <head> <title>新浪通行证</title> <meta http-equiv="refresh" content="0; url='https://login.sina.com.cn/crossdomain2.php?action=login&entry=weibo&r=https%3A%2F%2Fpassport.weibo.com%2Fwbsso%2Flogin%3Fssosavestate%3D1547533996%26url%3Dhttps%253A%252F%252Fweibo.com%252Fajaxlogin.php%253Fframelogin%253D1%2526callback%253Dparent.sinaSSOController.feedBackUrlCallBack%26display%3D0%26ticket%3DST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1%26retcode%3D0&sr=1680%2A1050'"/> <meta http-equiv="Content-Type" content="text/html; charset=GBK" /> </head> <body bgcolor="#ffffff" text="#000000" link="#0000cc" vlink="#551a8b" alink="#ff0000"> <script type="text/javascript" language="javascript"> location.replace("https://login.sina.com.cn/crossdomain2.php?action=login&entry=weibo&r=https%3A%2F%2Fpassport.weibo.com%2Fwbsso%2Flogin%3Fssosavestate%3D1547533996%26url%3Dhttps%253A%252F%252Fweibo.com%252Fajaxlogin.php%253Fframelogin%253D1%2526callback%253Dparent.sinaSSOController.feedBackUrlCallBack%26display%3D0%26ticket%3DST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1%26retcode%3D0&sr=1680%2A1050"); </script> </body> </html>
看大神的解释说,这是一段重新定向的的代码,重新定向的url写在location.replace后面,所以我们需要编写一段正则表达式将这段url爬取下来。
1 p = re.compile('location\.replace\("(.*?)"\)') 2 try: 3 login_url = p.search(html).group(1) 4 request = urllib.request.Request(login_url) 5 response = urllib.request.urlopen(request) 6 page = response.read().decode('GBK') 7 print(page) 8 except: 9 print('Login error!') 10 return 0
来看看结果:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=GBK" /> <title>新浪通行证</title> <script charset="utf-8" src="https://i.sso.sina.com.cn/js/ssologin.js"></script> </head> <body> 正在登录 ... <script> try{sinaSSOController.setCrossDomainUrlList({"retcode":0,"arrURL":["https:\/\/passport.weibo.com\/wbsso\/login?ticket=ST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-0D1D8222688249D4F950E05810AD22DD-1&ssosavestate=1547533996","https:\/\/passport.97973.com\/sso\/crossdomain?action=login&savestate=1547533996","https:\/\/passport.krcom.cn\/sso\/crossdomain?service=krvideo&savestate=1&ticket=ST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-94884ABE2B9E4113CE7B809F4B5C92DC-1&ssosavestate=1547533996","https:\/\/passport.weibo.cn\/sso\/crossdomain?action=login&savestate=1"]});} catch(e){ var msg = e.message; var img = new Image(); var type = 1; img.src = 'https://login.sina.com.cn/sso/debuglog?msg=' + msg +'&type=' + type; }try{sinaSSOController.crossDomainAction('login',function(){location.replace('https://passport.weibo.com/wbsso/login?ssosavestate=1547533996&url=https%3A%2F%2Fweibo.com%2Fajaxlogin.php%3Fframelogin%3D1%26callback%3Dparent.sinaSSOController.feedBackUrlCallBack&display=0&ticket=ST-MjQ2Nzk2MDk3Mg==-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1&retcode=0');});} catch(e){ var msg = e.message; var img = new Image(); var type = 2; img.src = 'https://login.sina.com.cn/sso/debuglog?msg=' + msg +'&type=' + type; } </script> </body> </html>
很好又是一堆奇怪的东西!(ノಠ益ಠ)ノ彡┻━┻不过仔细一看,是不是还挺眼熟??
OMG!!location.replace again!!!只是这次后面的链接似乎是passport.weibo.com,哇哦~是不是敲像正式登陆的~
话不多说,立刻先用这则表达把这段url提取出来再说!
1 p2 = re.compile("location\.replace\('(.*?)'\)") 2 try: 3 login_url2 = p2.search(page).group(1) 4 request = urllib.request.Request(login_url2) 5 response = urllib.request.urlopen(request) 6 page2 = response.read().decode('utf-8') 7 print(page2) 8 except: 9 print('Login error!') 10 return 0
本以为这次妥妥的了的我看到的结果却是。。。。
<html><head><script language='javascript'>parent.sinaSSOController.feedBackUrlCallBack({"result":true,"userinfo":{"uniqueid":"2467960972","userid":null,"displayname":null,"userdomain":"?wvr=5&lf=reg"}});</script></head><body></body></html>
呵呵,又是一个重定向= =
然而这次很轻易地注意到里面有个"?wvr=5&lf=reg"
字段肥肠眼熟,看看刚才手工登陆抓到的包,果然,这是最终链接的一部分。
所以再搞一个正则表达式,把该字段也搞出来,然后拼接一个最终url出来,就可以轻松而愉悦地模拟登陆了!
以上~