铺垫

目标网站:http://www.gsxt.gov.cn/index.html

网站数据包分析:charles抓包

 从结果,追根溯源

先看http://www.gsxt.gov.cn/corp-query-search-1.html这个包

 从上图中可以看到,这个页面显示的内容是静态的资源,所以我们必需要获取这个页面

 上图中我们可以看到,他需要的参数有:

tab:ent_tab
province:
geetest_challenge:10faf845f3f031f4aa0c314d5b593477
geetest_validate:84cec0edcd71ef8e63faafaf251c840a
geetest_seccode:84cec0edcd71ef8e63faafaf251c840a|jordan
token:40390420
searchword:搜索关键字

如果去搜索js生成的话,你会发现如下(360浏览器出现了点问题,接下来我用谷歌来调试):

 

 上图中找到了这三个参数的生成的地方,是不是有点激动,只要解析那个生成的方法是不是就能搞定了?没那么简单,继续往下看

 如上图所示,我点到了生成的函数那,。。。。。。。。。。

换条路:我们看看其他两个包

 第三个包:

 第三个包的响应里面有:validate

把这个值拿出来,与第一个包抓的参数geetest_validate的值对比一下:

第一个包参数:
geetest_validate:84cec0edcd71ef8e63faafaf251c840a geetest_seccode:84cec0edcd71ef8e63faafaf251c840a
|jordan
第三个包参数:84cec0edcd71ef8e63faafaf251c840a

结论:一毛一样

这里出现了一个问题就是:

SearchItemCaptcha?t=1593853193470 这个包获取的 challenge的值与获取corp-query-search-1.html这个包 携带的参数geetest_challenge的值是不同的

且要想拿到validate的值必须先搞定geetest_challenge这个参数。

先不管其他的了,先访问拿到gt再说,后面再研究这个geetest_challenge参数

正文

目标:拿到下面的响应

cookie反爬虫 

上面说到了,我们要获取这个地址:http://www.gsxt.gov.cn/SearchItemCaptcha?t=1593853193470的响应数据,从而拿到gt参数

我们先模拟发请求:

import requests
import time
import re
import execjs

class Business_Information(object):
    keyword = '腾讯'
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Host': 'www.gsxt.gov.cn',
        'Pragma': 'no-cache',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
           }
    sess = requests.session()
    sess.get('http://www.gsxt.gov.cn/index.html',headers=headers)
    def get_challenge(self):
        url = 'http://www.gsxt.gov.cn/SearchItemCaptcha'
        params = {
            "t": round(time.time() * 1000)
        }
        # 获取生成cookie的js代码
        cookie_html = self.sess.get(url, params=params, headers=self.headers).text
        print(cookie_html)

    def main(self):
        self.get_challenge()


bf = Business_Information()
bf.main()

看结果:

 返回了这么一串东西,很明显不是我们需要的数据,那这个是个什么东西呢?经过两个小时的研究,发现这个代码是用来生成js代码的,只有调用了这个生成的js代码才能拿到生成cookie的js代码,然后调用生成cookie的js代码才能拿到真正的cookie

流程:调用接口,获得一堆js代码——》正则匹配需要的js代码——》调用前面的js代码,生成用来生成真正cookie的js代码——》调用生成的js代码——》获得真正的cookie

分析解析:

调用接口拿到的js代码(其实就是个html中嵌入来js代码):

<script >
var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"),
    y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}",
    f = function (x, y) {
        var a = 0, b = 0, c = 0;
        x = x.split("");
        y = y || 99;
        while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
        return c
    }, z = f(y.match(/\w/g).sort(function (x, y) {
        return f(x) - f(y)
    }).pop());
while (z++) try {
    eval(y.replace(/\b\w+\b/g, function (y) {
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch (_) {
}
</script>        

接下来做修改

function pre_cookie() { # 用个函数包起来
var x = "@21@var@@Jl@location@new@@while@0xEDB88320@@D@a@for@@@@0xFF@match@document@catch@20@@window@@search@parseInt@@@1593854493@@@0@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@@@@__jsl_clearance@2@04@10@innerHTML@onreadystatechange@@Jul@@@@substr@replace@false@4@try@g@8@DOMContentLoaded@@return@33@@@@reverse@@@charAt@addEventListener@if@createElement@002@href@Sat@@@36@GMT@charCodeAt@mDoFw@Ei@@length@@@else@@firstChild@toLowerCase@captcha@1@setTimeout@toString@split@@3@@String@https@challenge@function@@e@@Path@cookie@eval@@I@chars@@attachEvent@1500@Array@@@f@RegExp@@join@P@pathname@Expires@JgSe0upZ@@div@d@@@fromCharCode".replace(/@*$/, "").split("@"),
    y = "3 f=4f(){46('6.38=6.5f+6.15.28(/[\\?|&]44-4e/,\\'\\')',56);k.4k='1h=19.37|1c|'+(4f(){3 4i=[4f(f){2g f},4f(4i){2g 4i},4f(f){2g 50('4c.62('+f+')')},4f(f){e(3 4i=1c;4i<f.3i;4i++){f[4i]=16(f[4i]).47(3c)};2g f.5d('')}],f=[[(1i+[])+(-~~~''+[]+[])],'52',[(1i+[])+(-~~~''+[]+[])],[+[-~{}, ~~![]]+[]+[[]][1c]][1c].33(~~![]),[(-~~~''+[]+[])+(1i+[]),(-~~~''+[]+[])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'5%',(1i+[]),[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'3f',[((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])+(-~[1i]+2a+[[]][1c])],(2a+[]+[])+({}+[]+[]).33(2a+2a),'3g',(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c]),'5e',(2a+[]+[]),[(-~[1i]+[])+(-~~~''+[]+[])],[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+((+!(+!{}))-~[]-~[]-~![]-~[]-~[]+[[]][1c])],(([-~-~{}]+(+[])>>-~-~{})+[]+[[]][1c]),[(-~![]-~~~''+(-~[]+[-~-~{}])/[-~-~{}]+[]+[[]][1c])+[-~![]-~[((+!(+!{}))<<(+!(+!{})))-~{}-~(-~[-~[]-~[]])]]],'%',(-~[1i]+[]),'c'];e(3 49=1c;49<f.3i;49++){f[49]=4i[[4a,45,4a,1c,4a,45,1c,1i,1c,45,1i,1c,45,1c,45,1c,4a,1i,1c,1i,45,1c,45][49]](f[49])};2g f.5d('')})()+';5g=39, 1j-23-11 1k:2:2h 3d;4j=/;'};35((4f(){2b{2g !!13.34;}10(4h){2g 29;}})()){k.34('2e',f,29)}40{k.55('21',f)}",
    f = function (x, y) {
        var a = 0, b = 0, c = 0;
        x = x.split("");
        y = y || 99;
        while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
        return c
    }, z = f(y.match(/\w/g).sort(function (x, y) {
        return f(x) - f(y)
    }).pop());
while (z++) try {
    var result;  # 定义一个变量来存储值
    result = (y.replace(/\b\w+\b/g, function (y) { # 给变量赋一下值
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch (_) {
}
return result  # 返回这个变量
}

接下来用execjs模块调用一下:

 结果:其中的黑体字就是我们需要用来生成真正cookie的js,注意:当你多次执行的时候它返回的js可能是错误的js(频率不高,有兴趣的可以尝试一下所以下一步哪里需要做一下判断)

var _f = function () {
    setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500);
    document.cookie = '__jsl_clearance=1593854493.002|0|' + (function () {
        var _4i = [function (_f) {
                return _f
            }, function (_4i) {
                return _4i
            }, function (_f) {
                return eval('String.fromCharCode(' + _f + ')')
            }, function (_f) {
                for (var _4i = 0; _4i < _f.length; _4i++) {
                    _f[_4i] = parseInt(_f[_4i]).toString(36)
                }
                ;
                return _f.join('')
            }],
            _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D'];
        for (var _49 = 0; _49 < _f.length; _49++) {
            _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49])
        }
        ;
        return _f.join('')
    })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;'
};
if ((function () {
    try {
        return !!window.addEventListener;
    } catch (e) {
        return false;
    }
})()) {
    document.addEventListener('DOMContentLoaded', _f, false)
} else {
    document.attachEvent('onreadystatechange', _f)
}

拿出来进行改写:

function generate_cookie_js() { # 用函数包起来
    cookie = '__jsl_clearance=1593854493.002|0|' + (function () {
        var _4i = [function (_f) {
                return _f
            }, function (_4i) {
                return _4i
            }, function (_f) {
                return eval('String.fromCharCode(' + _f + ')')
            }, function (_f) {
                for (var _4i = 0; _4i < _f.length; _4i++) {
                    _f[_4i] = parseInt(_f[_4i]).toString(36)
                }
                ;
                return _f.join('')
            }],
            _f = [[(2 + []) + (-~~~'' + [] + [])], 'I', [(2 + []) + (-~~~'' + [] + [])], [+[-~{}, ~~![]] + [] + [[]][0]][0].charAt(~~![]), [(-~~~'' + [] + []) + (2 + []), (-~~~'' + [] + []) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], 'Jl%', (2 + []), [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'mDoFw', [((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0]) + (-~[2] + 4 + [[]][0])], (4 + [] + []) + ({} + [] + []).charAt(4 + 4), 'Ei', (-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]), 'P', (4 + [] + []), [(-~[2] + []) + (-~~~'' + [] + [])], [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + ((+!(+!{})) - ~[] - ~[] - ~![] - ~[] - ~[] + [[]][0])], (([-~-~{}] + (+[]) >> -~-~{}) + [] + [[]][0]), [(-~![] - ~~~'' + (-~[] + [-~-~{}]) / [-~-~{}] + [] + [[]][0]) + [-~![] - ~[((+!(+!{})) << (+!(+!{}))) - ~{} - ~(-~[-~[] - ~[]])]]], '%', (-~[2] + []), 'D'];
        for (var _49 = 0; _49 < _f.length; _49++) {
            _f[_49] = _4i[[3, 1, 3, 0, 3, 1, 0, 2, 0, 1, 2, 0, 1, 0, 1, 0, 3, 2, 0, 2, 1, 0, 1][_49]](_f[_49])
        }
        ;
        return _f.join('')
    })() + ';Expires=Sat, 04-Jul-20 10:21:33 GMT;Path=/;'
return cookie # 让他返回cookie
};

用execjs模块调用一下:

 拿到了结果,这个反爬虫携带的cookie参数

贴一下cookie反爬的源代码:

import requests
import time
import re
import execjs

class Business_Information(object):
    keyword = '腾讯'
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'no-cache',
        'Host': 'www.gsxt.gov.cn',
        'Pragma': 'no-cache',
        'Proxy-Connection': 'keep-alive',
        'Referer': 'http://www.gsxt.gov.cn/corp-query-search-1.html',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
           }
    sess = requests.session()
    sess.get('http://www.gsxt.gov.cn/index.html',headers=headers)
    def get_challenge(self):
        url = 'http://www.gsxt.gov.cn/SearchItemCaptcha'
        params = {
            "t": round(time.time() * 1000)
        }
        # 获取生成cookie的js代码
        cookie_html = self.sess.get(url, params=params, headers=self.headers).text
        # 从返回的html源码中匹配到js代码部分
        cookie_js = re.findall("<script>(.*?)</script>",cookie_html, re.S)[0]
        # 拼接生成要调用的js代码
        edit_js ="function pre_cookie(){" + cookie_js.replace('try{eval','try{var result; result=')+"return result}"
        # 第一次调用js,获得用来生成cookie的真正的js代码
        first_js = execjs.compile(edit_js)
        # 调用js生成第二次需要的js代码(动态变化的)
        generate_cookie_js_all = first_js.call("pre_cookie")
        # 匹配真正生成cookie的js代码
        # print(generate_cookie_js_all)
        if "href(){setTimeout" in generate_cookie_js_all:
            raise Exception('您获取的这段js代码太傻比,请重新获取!')
        generate_cookie_js = re.findall('document\.(cookie=.*?if)',generate_cookie_js_all)[0]
        generate_cookie_js = "window = {};var get_cookie = function () {"+generate_cookie_js.replace("};if",";return cookie};")
        # 第二次调用js,生成真正的cookie
        second_js = execjs.compile(generate_cookie_js)
        # 获取真正的cookie
        cookie = second_js.call('get_cookie')
        print(cookie)
        cookie = cookie.split("__jsl_clearance=",)[-1]
        self.sess.cookies.set("__jsl_clearance",cookie)
        json_data = self.sess.get(url, params=params, headers=self.headers).json()
        print(json_data)


    def main(self):
        self.get_challenge()


bf = Business_Information()
bf.main()

执行结果:

注:这个网站的反爬已更新

posted on 2020-07-04 16:08  kindvampire  阅读(981)  评论(1编辑  收藏  举报