SPIDER-DAY04--,requests.post 请求,及代理

1. 代理参数

1.1 代理IP概述

【1】定义
代替你原来的IP地址去对接网络的IP地址

【2】作用
隐藏自身真实IP,避免被封
   
【3】获取代理IP网站
快代理、全网代理、代理精灵、... ...

【4】参数类型
proxies
proxies = { '协议':'协议://IP:端口号' }
proxies = { '协议':'协议://用户名:密码@IP:端口号' }

1.2 代理分类

1.2.1 普通代理

【1】代理格式
proxies = { '协议':'协议://IP:端口号' }

【2】使用免费普通代理IP访问测试网站: http://httpbin.org/get

import requests
url = 'http://httpbin.org/get'
headers = {'User-Agent':'Mozilla/5.0'}
# 定义代理,在代理IP网站中查找免费代理IP
proxies = {
   'http':'http://112.85.164.220:9999',
   'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

1.2.2 私密代理和独享代理

【1】代理格式
proxies = { '协议':'协议://用户名:密码@IP:端口号' }

【2】使用私密代理或独享代理IP访问测试网站: http://httpbin.org/get

import requests
url = 'http://httpbin.org/get'
proxies = {
   'http': 'http://309435365:szayclhp@106.75.71.140:16816',
   'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
   'User-Agent' : 'Mozilla/5.0',
}

html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

1.3 建立代理IP池

"""
建立代理IP池 - 开放代理
"""
import requests
from fake_useragent import UserAgent

class ProxyPool:
   def __init__(self):
       self.api_url = '快代理的api链接'
       self.test_url = 'http://httpbin.org/get'
       self.headers = {'User-Agent':UserAgent().random}

   def get_proxy(self):
       """获取代理IP"""
       html = requests.get(url=self.api_url,
                           headers=self.headers).text
       proxy_list = html.split('\r\n')
       for proxy in proxy_list:
           # proxy: 1.1.1.1:8888
           self.test_proxy(proxy)

   def test_proxy(self, proxy):
       """测试1个代理IP是否可用"""
       proxies = {
           'http' : 'http://{}'.format(proxy),
           'https': 'https://{}'.format(proxy)
      }
       try:
           resp = requests.get(url=self.test_url,
                               proxies=proxies,
                               headers=self.headers,
                               timeout=3)
           print(proxy, '\033[31m可用\033[0m')
       except Exception as e:
           print(proxy, '不可用')

if __name__ == '__main__':
   spider = ProxyPool()
   spider.get_proxy()

2. requests.post()

2.1 POST请求

【1】适用场景 : Post类型请求的网站

【2】参数 : data={}
  2.1) Form表单数据: 字典
  2.2) res = requests.post(url=url,data=data,headers=headers)
 
【3】POST请求特点 : Form表单提交数据

2.2 控制台抓包

  • 打开方式及常用选项

    【1】打开浏览器,F12打开控制台,找到Network选项卡

    【2】控制台常用选项
      2.1) Network: 抓取网络数据包
        a> ALL: 抓取所有的网络数据包
        b> XHR:抓取异步加载的网络数据包
        c> JS : 抓取所有的JS文件
      2.2) Sources: 格式化输出并打断点调试JavaScript代码,助于分析爬虫中一些参数
      2.3) Console: 交互模式,可对JavaScript中的代码进行测试
       
    【3】抓取具体网络数据包后
      3.1) 单击左侧网络数据包地址,进入数据包详情,查看右侧
      3.2) 右侧:
        a> Headers: 整个请求信息
           General、Response Headers、Request Headers、Query String、Form Data
        b> Preview: 对响应内容进行预览
        c> Response:响应内容

3. 有道翻译爬虫

3.1 项目需求

破解有道翻译接口,抓取翻译结果

# 结果展示
请输入要翻译的词语: elephant
翻译结果: 大象
*************************
请输入要翻译的词语: 喵喵叫
翻译结果: mews

3.2 项目分析流程

【1】准备抓包: F12开启控制台,刷新页面
【2】寻找地址
2.1) 页面中输入翻译单词,控制台中抓取到网络数据包,查找并分析返回翻译数据的地址
       F12-Network-XHR-Headers-General-Request URL
【3】发现规律
3.1) 找到返回具体数据的地址,在页面中多输入几个单词,找到对应URL地址
3.2) 分析对比 Network - All(或者XHR) - Form Data,发现对应的规律
【4】寻找JS加密文件
控制台右上角 ...->Search->搜索关键字->单击->跳转到Sources,左下角格式化符号{}
【5】查看JS代码
搜索关键字,找到相关加密方法,用python实现加密算法
【6】断点调试
JS代码中部分参数不清楚可通过断点调试来分析查看
【7】Python实现JS加密算法

3.3 项目步骤

1、开启F12抓包,找到Form表单数据如下:

i: 喵喵叫
from: AUTO
to: AUTO
smartresult: dict
client: fanyideskweb
salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
doctype: json
version: 2.1
keyfrom: fanyi.web
action: FY_BY_REALTlME

2、在页面中多翻译几个单词,观察Form表单数据变化

salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
# 但是bv的值不变

3、一般为本地js文件加密,刷新页面,找到js文件并分析JS代码

控制台右上角 - Search - 搜索salt - 查看文件 - 格式化输出

【结果】 : 最终找到相关JS文件 : fanyi.min.js

4、打开JS文件,分析加密算法,用Python实现

【ts】经过分析为13位的时间戳,字符串类型
   js代码实现)  "" + (new Date).getTime()
   python实现) str(int(time.time() * 1000))

【salt】ts + 0-9之间的随机数(字符串类型)
   js代码实现)  ts + parseInt(10 * Math.random(), 10);
   python实现)  ts + str(random.randint(0, 9))

【sign】('设置断点调试,来查看 e 的值,发现 e 为要翻译的单词')
   js代码实现) n.md5("fanyideskweb" + e + salt + "Tbh5E8=q6U3EXe+&L[4c@")
   python实现)
   from hashlib import md5
   m = md5()
   m.update(string.encode())
   sign = m.hexdigest()

5、pycharm中正则处理headers和formdata

【1】pycharm进入方法 :Ctrl + r ,选中 Regex
【2】处理headers和formdata
    (.*): (.*)
    "$1": "$2",
【3】点击 Replace All

3.4 代码实现

"""
请输入要翻译的单词:tiger
翻译结果:老虎
"""
import requests
import time
from hashlib import md5
import random

class YdSpider:
    def __init__(self):
        # URL地址一定要是:F12抓包抓到的POST的地址
        self.post_url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
        self.headers = {
            # 检查频率最高的三个:Cookie、Referer、User-Agent
            "Cookie": "OUTFOX_SEARCH_USER_ID=1391264118@10.108.160.105; OUTFOX_SEARCH_USER_ID_NCOO=2105417985.4787014; JSESSIONID=aaasSeD7PiY4G_nO8cWDx; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1612506057171",
            "Referer": "http://fanyi.youdao.com/",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36",
        }
        # 输入要翻译的单词
        self.word = input('请输入要翻译的单词:')

    def md5_string(self, string):
        """功能函数"""
        m = md5()
        m.update(string.encode())

        return m.hexdigest()

    def get_ts_salt_sign(self):
        """获取ts salt sign"""
        ts = str(int(time.time() * 1000))
        salt = ts + str(random.randint(0, 9))
        string = "fanyideskweb" + self.word + salt + "Tbh5E8=q6U3EXe+&L[4c@"
        sign = self.md5_string(string)

        return ts, salt, sign

    def attack_yd(self):
        """逻辑函数"""
        ts, salt, sign = self.get_ts_salt_sign()
        data = {
            "i": self.word,
            "from": "AUTO",
            "to": "AUTO",
            "smartresult": "dict",
            "client": "fanyideskweb",
            "salt": salt,
            "sign": sign,
            "lts": ts,
            "bv": "6a1ac4a5cc37a3de2c535a36eda9e149",
            "doctype": "json",
            "version": "2.1",
            "keyfrom": "fanyi.web",
            "action": "FY_BY_REALTlME",
        }
        # .json():把json格式的字符串转为python数据类型
        # .join() 等同于 json.loads('{}')
        html = requests.post(url=self.post_url,
                             data=data,
                             headers=self.headers).json()

        return html['translateResult'][0][0]['tgt']

if __name__ == '__main__':
    spider = YdSpider()
    print(spider.attack_yd())

4. 百度翻译JS逆向爬虫

4.1 JS逆向详解

【1】应用场景
	当JS加密的代码过于复杂,没有办法破解时,考虑使用JS逆向思想
    
【2】模块
	2.1》模块名:execjs
	2.2》安装: sudo pip3 install pyexecjs
	2.3》使用流程
		import execjs
		with open('xxx.js', 'r') as f:
			js_code = f.read()
            
		js_obj = execjs.compile(js_code)
        js_obj.eval('函数名("参数")')

4.2 JS代码调试

  • 抓到 JS 加密文件,存放到 translate.js 文件中

    // e(r, gtk)  增加了gtk参数
    // i = window[l] 改为了 i = gtk
    function a(r) {
        if (Array.isArray(r)) {
            for (var o = 0, t = Array(r.length); o < r.length; o++)
                t[o] = r[o];
            return t
        }
        return Array.from(r)
    }
    function n(r, o) {
        for (var t = 0; t < o.length - 2; t += 3) {
            var a = o.charAt(t + 2);
            a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a),
            a = "+" === o.charAt(t + 1) ? r >>> a : r << a,
            r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a
        }
        return r
    }
    function e(r,gtk) {
        var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g);
        if (null === o) {
            var t = r.length;
            t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10))
        } else {
            for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++)
                "" !== e[C] && f.push.apply(f, a(e[C].split(""))),
                C !== h - 1 && f.push(o[C]);
            var g = f.length;
            g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join(""))
        }
        var u = void 0
          , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107);
        u = null !== i ? i : (i = gtk || "") || "";
        for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) {
            var A = r.charCodeAt(v);
            128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)),
            S[c++] = A >> 18 | 240,
            S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224,
            S[c++] = A >> 6 & 63 | 128),
            S[c++] = 63 & A | 128)
        }
        for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++)
            p += S[b],
            p = n(p, F);
        return p = n(p, D),
        p ^= s,
        0 > p && (p = (2147483647 & p) + 2147483648),
        p %= 1e6,
        p.toString() + "." + (p ^ m)
    }
    var i = null;
  • test_translate.py调试JS文件

    import execjs
    
    with open('translate.js', 'r', encoding='utf-8') as f:
        jscode = f.read()
    
    jsobj = execjs.compile(jscode)
    sign = jsobj.eval('e("hello","320305.131321201")')
    print(sign)

4.3 百度翻译代码实现

"""
百度翻译破解案例 - JS逆向(execjs模块)
"""
import requests
import execjs
import re

class BdSpider:
    def __init__(self):
        # url:F12抓包抓到的POST的URL地址
        self.post_url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
        self.post_headers = {
            '''Accept''': '''*/*''',
            '''Accept-Encoding''': '''gzip, deflate, br''',
            '''Accept-Language''': '''zh-CN,zh;q=0.9''',
            '''Cache-Control''': '''no-cache''',
            '''Connection''': '''keep-alive''',
            '''Content-Length''': '''135''',
            '''Content-Type''': '''application/x-www-form-urlencoded; charset=UTF-8''',
            '''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; BA_HECTOR=8l2kag0h21052l24g91g1ps820r; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612509484; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612509484292_101.30.65.196_f2e2fdf6; ab_sr=1.0.0_MGIyNDA3Mjk2NGU4NjAxZjkzYzU5YjQ4Mjg3YjJmMTFjMzRjY2E0Y2EwYWE5YTllZGE2Yjk5NmM2M2RjZmViMjUwMjIyZGJlODNhZDJkOTk0YjNkMjRiNTE0NjM4YzEx''',
            '''Host''': '''fanyi.baidu.com''',
            '''Origin''': '''https://fanyi.baidu.com''',
            '''Pragma''': '''no-cache''',
            '''Referer''': '''https://fanyi.baidu.com/''',
            '''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''',
            '''sec-ch-ua-mobile''': '''?0''',
            '''Sec-Fetch-Dest''': '''empty''',
            '''Sec-Fetch-Mode''': '''cors''',
            '''Sec-Fetch-Site''': '''same-origin''',
            '''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''',
            '''X-Requested-With''': '''XMLHttpRequest''',
        }
        self.word = input('请输入翻译单词:')
        # 获取gtk和token的
        self.get_url = 'https://fanyi.baidu.com/'
        self.get_headers = {
            '''Accept''': '''text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9''',
            '''Accept-Encoding''': '''gzip, deflate, br''',
            '''Accept-Language''': '''zh-CN,zh;q=0.9''',
            '''Cache-Control''': '''no-cache''',
            '''Connection''': '''keep-alive''',
            '''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612514819; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612514819989_101.30.65.196_430628f1; ab_sr=1.0.0_NjNjMjA3YzcyNDdiMzE4Njk5MGRkNjY1ZTY2YmFiNTI4MzE2ODQ3ZDIwYjBmNGRlZWFjODgyOGFjMGY0ZTQ3ODVlM2MxNDYxMjQ2ZWYzZGFkM2EzYWFjZjYyM2RkY2Vi''',
            '''Host''': '''fanyi.baidu.com''',
            '''Pragma''': '''no-cache''',
            '''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''',
            '''sec-ch-ua-mobile''': '''?0''',
            '''Sec-Fetch-Dest''': '''document''',
            '''Sec-Fetch-Mode''': '''navigate''',
            '''Sec-Fetch-Site''': '''none''',
            '''Sec-Fetch-User''': '''?1''',
            '''Upgrade-Insecure-Requests''': '''1''',
            '''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''',
        }

    def get_gtk_token(self):
        """获取gtk和token"""
        html = requests.get(url=self.get_url,
                            headers=self.get_headers).text
        gtk = re.findall("window.gtk = '(.*?)'", html, re.S)[0]
        token = re.findall("token: '(.*?)'", html, re.S)[0]

        return gtk, token

    def get_sign(self):
        """获取sign"""
        gtk, token = self.get_gtk_token()
        with open('translate.js', 'r') as f:
            jscode = f.read()

        jsobj = execjs.compile(jscode)
        sign = jsobj.eval('e("{}","{}")'.format(self.word, gtk))

        return sign

    def attack_bd(self):
        """逻辑函数"""
        sign = self.get_sign()
        gtk, token = self.get_gtk_token()
        data = {
            "from": "en",
            "to": "zh",
            "query": self.word,
            "transtype": "realtime",
            "simple_means_flag": "3",
            "sign": sign,
            "token": token,
            "domain": "common",
        }
        html = requests.post(url=self.post_url,
                             data=data,
                             headers=self.post_headers).json()

        return html['trans_result']['data'][0]['dst']

if __name__ == '__main__':
    spider = BdSpider()
    print(spider.attack_bd())

4. 今日作业

【1】抓取快代理网站免费高匿代理,并测试是否可用来建立自己的代理IP池
    https://www.kuaidaili.com/free/

【2】肯德基餐厅门店信息抓取(POST请求练习)
    1.1) URL地址: http://www.kfc.com.cn/kfccda/storelist/index.aspx
    1.2) 所抓数据: 餐厅编号、餐厅名称、餐厅地址、城市
    1.3) 数据存储: 保存到数据库
    1.4) 程序运行效果:
         请输入城市名称:北京
         把北京的所有肯德基门店的信息保存到数据库中

 

 

posted @ 2022-02-26 15:49  我不知道取什么名字好  阅读(277)  评论(0编辑  收藏  举报