铺垫
目标网站:http://www.gsxt.gov.cn/index.html
网站数据包分析:charles抓包
从结果,追根溯源
先看http://www.gsxt.gov.cn/corp-query-search-1.html这个包
从上图中可以看到,这个页面显示的内容是静态的资源,所以我们必需要获取这个页面
上图中我们可以看到,他需要的参数有:
tab:ent_tab province: geetest_challenge:10faf845f3f031f4aa0c314d5b593477 geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan token:40390420 searchword:搜索关键字
如果去搜索js生成的话,你会发现如下(360浏览器出现了点问题,接下来我用谷歌来调试):
上图中找到了这三个参数的生成的地方,是不是有点激动,只要解析那个生成的方法是不是就能搞定了?没那么简单,继续往下看
如上图所示,我点到了生成的函数那,。。。。。。。。。。
换条路:我们看看其他两个包
第三个包:
第三个包的响应里面有:validate
把这个值拿出来,与第一个包抓的参数geetest_validate的值对比一下:
第一个包参数:
geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan
第三个包参数:84cec0edcd71ef8e63faafaf251c840a
结论:一毛一样
这里出现了一个问题就是:
SearchItemCaptcha?t=1593853193470 这个包获取的 challenge的值与获取corp-query-search-1.html这个包 携带的参数geetest_challenge的值是不同的
且要想拿到validate的值必须先搞定geetest_challenge这个参数。
先不管其他的了,先访问拿到gt再说,后面再研究这个geetest_challenge参数
正文
目标:拿到下面的响应
cookie反爬虫
上面说到了,我们要获取这个地址:http://www.gsxt.gov.cn/SearchItemCaptcha?t=1593853193470的响应数据,从而拿到gt参数
我们先模拟发请求:
import requests import time import re import execjs class Business_Information(object): keyword = '腾讯' headers = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Host': 'www.gsxt.gov.cn', 'Pragma': 'no-cache', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } sess = requests.session() sess.get('http://www.gsxt.gov.cn/index.html',headers=headers) def get_challenge(self): url = 'http://www.gsxt.gov.cn/SearchItemCaptcha' params = { "t": round(time.time() * 1000) } # 获取生成cookie的js代码 cookie_html = self.sess.get(url, params=params, headers=self.headers).text print(cookie_html) def main(self): self.get_challenge() bf = Business_Information() bf.main()
看结果:
返回了这么一串东西,很明显不是我们需要的数据,那这个是个什么东西呢?经过两个小时的研究,发现这个代码是用来生成js代码的,只有调用了这个生成的js代码才能拿到生成cookie的js代码,然后调用生成cookie的js代码才能拿到真正的cookie
流程:调用接口,获得一堆js代码——》正则匹配需要的js代码——》调用前面的js代码,生成用来生成真正cookie的js代码——》调用生成的js代码——》获得真正的cookie
分析解析:
调用接口拿到的js代码(其实就是个html中嵌入来js代码):
<script > var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"), y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}", f = function (x, y) { var a = 0, b = 0, c = 0; x = x.split(""); y = y || 99; while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c; return c }, z = f(y.match(/\w/g).sort(function (x, y) { return f(x) - f(y) }).pop()); while (z++) try { eval(y.replace(/\b\w+\b/g, function (y) { return x[f(y, z) - 1] || ("_" + y) })); break } catch (_) { } </script>
接下来做修改
function pre_cookie() { # 用个函数包起来 var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"), y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}", f = function (x, y) { var a = 0, b = 0, c = 0; x = x.split(""); y = y || 99; while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c; return c }, z = f(y.match(/\w/g).sort(function (x, y) { return f(x) - f(y) }).pop()); while (z++) try { var result; # 定义一个变量来存储值 result = (y.replace(/\b\w+\b/g, function (y) { # 给变量赋一下值 return x[f(y, z) - 1] || ("_" + y) })); break } catch (_) { } return result # 返回这个变量 }
接下来用execjs模块调用一下:
结果:其中的黑体字就是我们需要用来生成真正cookie的js,注意:当你多次执行的时候它返回的js可能是错误的js(频率不高,有兴趣的可以尝试一下所以下一步哪里需要做一下判断)
var _f = function () { setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500); document.cookie = '__jsl_clearance=1593854493.002|0|' + (function () { var _4i = [function (_f) { return _f }, function (_4i) { return _4i }, function (_f) { return eval('String.fromCharCode(' + _f + ')') }, function (_f) { for (var _4i = 0; _4i < _f.length; _4i++) { _f[_4i] = parseInt(_f[_4i]).toString(36) } ; return _f.join('') }], _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D']; for (var _49 = 0; _49 < _f.length; _49++) { _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49]) } ; return _f.join('') })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;' }; if ((function () { try { return !!window.addEventListener; } catch (e) { return false; } })()) { document.addEventListener('DOMContentLoaded', _f, false) } else { document.attachEvent('onreadystatechange', _f) }
拿出来进行改写:
function generate_cookie_js() { # 用函数包起来 cookie = '__jsl_clearance=1593854493.002|0|' + (function () { var _4i = [function (_f) { return _f }, function (_4i) { return _4i }, function (_f) { return eval('String.fromCharCode(' + _f + ')') }, function (_f) { for (var _4i = 0; _4i < _f.length; _4i++) { _f[_4i] = parseInt(_f[_4i]).toString(36) } ; return _f.join('') }], _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D']; for (var _49 = 0; _49 < _f.length; _49++) { _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49]) } ; return _f.join('') })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;' return cookie # 让他返回cookie };
用execjs模块调用一下:
拿到了结果,这个反爬虫携带的cookie参数
贴一下cookie反爬的源代码:
import requests import time import re import execjs class Business_Information(object): keyword = '腾讯' headers = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'no-cache', 'Host': 'www.gsxt.gov.cn', 'Pragma': 'no-cache', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } sess = requests.session() sess.get('http://www.gsxt.gov.cn/index.html',headers=headers) def get_challenge(self): url = 'http://www.gsxt.gov.cn/SearchItemCaptcha' params = { "t": round(time.time() * 1000) } # 获取生成cookie的js代码 cookie_html = self.sess.get(url, params=params, headers=self.headers).text # 从返回的html源码中匹配到js代码部分 cookie_js = re.findall("<script>(.*?)</script>",cookie_html, re.S)[0] # 拼接生成要调用的js代码 edit_js ="function pre_cookie(){" + cookie_js.replace('try{eval','try{var result; result=')+"return result}" # 第一次调用js,获得用来生成cookie的真正的js代码 first_js = execjs.compile(edit_js) # 调用js生成第二次需要的js代码(动态变化的) generate_cookie_js_all = first_js.call("pre_cookie") # 匹配真正生成cookie的js代码 # print(generate_cookie_js_all) if "href(){setTimeout" in generate_cookie_js_all: raise Exception('您获取的这段js代码太傻比,请重新获取!') generate_cookie_js = re.findall('document\.(cookie=.*?if)',generate_cookie_js_all)[0] generate_cookie_js = "window = {};var get_cookie = function () {"+generate_cookie_js.replace("};if",";return cookie};") # 第二次调用js,生成真正的cookie second_js = execjs.compile(generate_cookie_js) # 获取真正的cookie cookie = second_js.call('get_cookie') print(cookie) cookie = cookie.split("__jsl_clearance=",)[-1] self.sess.cookies.set("__jsl_clearance",cookie) json_data = self.sess.get(url, params=params, headers=self.headers).json() print(json_data) def main(self): self.get_challenge() bf = Business_Information() bf.main()
执行结果:
注:这个网站的反爬已更新