SPIDER-DAY04--,requests.post 请求,及代理
1. 代理参数
【1】定义
代替你原来的IP地址去对接网络的IP地址
【2】作用
隐藏自身真实IP,避免被封
【3】获取代理IP网站
快代理、全网代理、代理精灵、... ...
【4】参数类型
proxies
proxies = { '协议':'协议://IP:端口号' }
proxies = { '协议':'协议://用户名:密码@IP:端口号' }
1.2 代理分类
1.2.1 普通代理
【1】代理格式
proxies = { '协议':'协议://IP:端口号' }
【2】使用免费普通代理IP访问测试网站: http://httpbin.org/get
import requests
url = 'http://httpbin.org/get'
headers = {'User-Agent':'Mozilla/5.0'}
# 定义代理,在代理IP网站中查找免费代理IP
proxies = {
'http':'http://112.85.164.220:9999',
'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)
1.2.2 私密代理和独享代理
【1】代理格式
proxies = { '协议':'协议://用户名:密码@IP:端口号' }
【2】使用私密代理或独享代理IP访问测试网站: http://httpbin.org/get
import requests
url = 'http://httpbin.org/get'
proxies = {
'http': 'http://309435365:szayclhp@106.75.71.140:16816',
'https':'https://309435365:szayclhp@106.75.71.140:16816',
}
headers = {
'User-Agent' : 'Mozilla/5.0',
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)
1.3 建立代理IP池
"""
建立代理IP池 - 开放代理
"""
import requests
from fake_useragent import UserAgent
class ProxyPool:
def __init__(self):
self.api_url = '快代理的api链接'
self.test_url = 'http://httpbin.org/get'
self.headers = {'User-Agent':UserAgent().random}
def get_proxy(self):
"""获取代理IP"""
html = requests.get(url=self.api_url,
headers=self.headers).text
proxy_list = html.split('\r\n')
for proxy in proxy_list:
# proxy: 1.1.1.1:8888
self.test_proxy(proxy)
def test_proxy(self, proxy):
"""测试1个代理IP是否可用"""
proxies = {
'http' : 'http://{}'.format(proxy),
'https': 'https://{}'.format(proxy)
}
try:
resp = requests.get(url=self.test_url,
proxies=proxies,
headers=self.headers,
timeout=3)
print(proxy, '\033[31m可用\033[0m')
except Exception as e:
print(proxy, '不可用')
if __name__ == '__main__':
spider = ProxyPool()
spider.get_proxy()
2. requests.post()
2.1 POST请求
【1】适用场景 : Post类型请求的网站
【2】参数 : data={}
2.1) Form表单数据: 字典
2.2) res = requests.post(url=url,data=data,headers=headers)
【3】POST请求特点 : Form表单提交数据
2.2 控制台抓包
-
打开方式及常用选项
【1】打开浏览器,F12打开控制台,找到Network选项卡
【2】控制台常用选项
2.1) Network: 抓取网络数据包
a> ALL: 抓取所有的网络数据包
b> XHR:抓取异步加载的网络数据包
c> JS : 抓取所有的JS文件
2.2) Sources: 格式化输出并打断点调试JavaScript代码,助于分析爬虫中一些参数
2.3) Console: 交互模式,可对JavaScript中的代码进行测试
【3】抓取具体网络数据包后
3.1) 单击左侧网络数据包地址,进入数据包详情,查看右侧
3.2) 右侧:
a> Headers: 整个请求信息
General、Response Headers、Request Headers、Query String、Form Data
b> Preview: 对响应内容进行预览
c> Response:响应内容
3. 有道翻译爬虫
3.1 项目需求
破解有道翻译接口,抓取翻译结果
# 结果展示
请输入要翻译的词语: elephant
翻译结果: 大象
*************************
请输入要翻译的词语: 喵喵叫
翻译结果: mews
3.2 项目分析流程
【1】准备抓包: F12开启控制台,刷新页面
【2】寻找地址
2.1) 页面中输入翻译单词,控制台中抓取到网络数据包,查找并分析返回翻译数据的地址
F12-Network-XHR-Headers-General-Request URL
【3】发现规律
3.1) 找到返回具体数据的地址,在页面中多输入几个单词,找到对应URL地址
3.2) 分析对比 Network - All(或者XHR) - Form Data,发现对应的规律
【4】寻找JS加密文件
控制台右上角 ...->Search->搜索关键字->单击->跳转到Sources,左下角格式化符号{}
【5】查看JS代码
搜索关键字,找到相关加密方法,用python实现加密算法
【6】断点调试
JS代码中部分参数不清楚可通过断点调试来分析查看
【7】Python实现JS加密算法
3.3 项目步骤
1、开启F12抓包,找到Form表单数据如下:
i: 喵喵叫 from: AUTO to: AUTO smartresult: dict client: fanyideskweb salt: 15614112641250 sign: 94008208919faa19bd531acde36aac5d ts: 1561411264125 bv: f4d62a2579ebb44874d7ef93ba47e822 doctype: json version: 2.1 keyfrom: fanyi.web action: FY_BY_REALTlME
2、在页面中多翻译几个单词,观察Form表单数据变化
salt: 15614112641250 sign: 94008208919faa19bd531acde36aac5d ts: 1561411264125 bv: f4d62a2579ebb44874d7ef93ba47e822 # 但是bv的值不变
3、一般为本地js文件加密,刷新页面,找到js文件并分析JS代码
控制台右上角 - Search - 搜索salt - 查看文件 - 格式化输出 【结果】 : 最终找到相关JS文件 : fanyi.min.js
4、打开JS文件,分析加密算法,用Python实现
【ts】经过分析为13位的时间戳,字符串类型 js代码实现) "" + (new Date).getTime() python实现) str(int(time.time() * 1000)) 【salt】ts + 0-9之间的随机数(字符串类型) js代码实现) ts + parseInt(10 * Math.random(), 10); python实现) ts + str(random.randint(0, 9)) 【sign】('设置断点调试,来查看 e 的值,发现 e 为要翻译的单词') js代码实现) n.md5("fanyideskweb" + e + salt + "Tbh5E8=q6U3EXe+&L[4c@") python实现) from hashlib import md5 m = md5() m.update(string.encode()) sign = m.hexdigest()
5、pycharm中正则处理headers和formdata
【1】pycharm进入方法 :Ctrl + r ,选中 Regex 【2】处理headers和formdata (.*): (.*) "$1": "$2", 【3】点击 Replace All
3.4 代码实现
""" 请输入要翻译的单词:tiger 翻译结果:老虎 """ import requests import time from hashlib import md5 import random class YdSpider: def __init__(self): # URL地址一定要是:F12抓包抓到的POST的地址 self.post_url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule' self.headers = { # 检查频率最高的三个:Cookie、Referer、User-Agent "Cookie": "OUTFOX_SEARCH_USER_ID=1391264118@10.108.160.105; OUTFOX_SEARCH_USER_ID_NCOO=2105417985.4787014; JSESSIONID=aaasSeD7PiY4G_nO8cWDx; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1612506057171", "Referer": "http://fanyi.youdao.com/", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36", } # 输入要翻译的单词 self.word = input('请输入要翻译的单词:') def md5_string(self, string): """功能函数""" m = md5() m.update(string.encode()) return m.hexdigest() def get_ts_salt_sign(self): """获取ts salt sign""" ts = str(int(time.time() * 1000)) salt = ts + str(random.randint(0, 9)) string = "fanyideskweb" + self.word + salt + "Tbh5E8=q6U3EXe+&L[4c@" sign = self.md5_string(string) return ts, salt, sign def attack_yd(self): """逻辑函数""" ts, salt, sign = self.get_ts_salt_sign() data = { "i": self.word, "from": "AUTO", "to": "AUTO", "smartresult": "dict", "client": "fanyideskweb", "salt": salt, "sign": sign, "lts": ts, "bv": "6a1ac4a5cc37a3de2c535a36eda9e149", "doctype": "json", "version": "2.1", "keyfrom": "fanyi.web", "action": "FY_BY_REALTlME", } # .json():把json格式的字符串转为python数据类型 # .join() 等同于 json.loads('{}') html = requests.post(url=self.post_url, data=data, headers=self.headers).json() return html['translateResult'][0][0]['tgt'] if __name__ == '__main__': spider = YdSpider() print(spider.attack_yd())
4. 百度翻译JS逆向爬虫
4.1 JS逆向详解
【1】应用场景 当JS加密的代码过于复杂,没有办法破解时,考虑使用JS逆向思想 【2】模块 2.1》模块名:execjs 2.2》安装: sudo pip3 install pyexecjs 2.3》使用流程 import execjs with open('xxx.js', 'r') as f: js_code = f.read() js_obj = execjs.compile(js_code) js_obj.eval('函数名("参数")')
4.2 JS代码调试
-
抓到 JS 加密文件,存放到 translate.js 文件中
// e(r, gtk) 增加了gtk参数 // i = window[l] 改为了 i = gtk function a(r) { if (Array.isArray(r)) { for (var o = 0, t = Array(r.length); o < r.length; o++) t[o] = r[o]; return t } return Array.from(r) } function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r,gtk) { var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" !== e[C] && f.push.apply(f, a(e[C].split(""))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join("")) } var u = void 0 , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = null !== i ? i : (i = gtk || "") || ""; for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m) } var i = null;
-
test_translate.py调试JS文件
import execjs with open('translate.js', 'r', encoding='utf-8') as f: jscode = f.read() jsobj = execjs.compile(jscode) sign = jsobj.eval('e("hello","320305.131321201")') print(sign)
4.3 百度翻译代码实现
""" 百度翻译破解案例 - JS逆向(execjs模块) """ import requests import execjs import re class BdSpider: def __init__(self): # url:F12抓包抓到的POST的URL地址 self.post_url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh' self.post_headers = { '''Accept''': '''*/*''', '''Accept-Encoding''': '''gzip, deflate, br''', '''Accept-Language''': '''zh-CN,zh;q=0.9''', '''Cache-Control''': '''no-cache''', '''Connection''': '''keep-alive''', '''Content-Length''': '''135''', '''Content-Type''': '''application/x-www-form-urlencoded; charset=UTF-8''', '''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; BA_HECTOR=8l2kag0h21052l24g91g1ps820r; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612509484; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612509484292_101.30.65.196_f2e2fdf6; ab_sr=1.0.0_MGIyNDA3Mjk2NGU4NjAxZjkzYzU5YjQ4Mjg3YjJmMTFjMzRjY2E0Y2EwYWE5YTllZGE2Yjk5NmM2M2RjZmViMjUwMjIyZGJlODNhZDJkOTk0YjNkMjRiNTE0NjM4YzEx''', '''Host''': '''fanyi.baidu.com''', '''Origin''': '''https://fanyi.baidu.com''', '''Pragma''': '''no-cache''', '''Referer''': '''https://fanyi.baidu.com/''', '''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''', '''sec-ch-ua-mobile''': '''?0''', '''Sec-Fetch-Dest''': '''empty''', '''Sec-Fetch-Mode''': '''cors''', '''Sec-Fetch-Site''': '''same-origin''', '''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''', '''X-Requested-With''': '''XMLHttpRequest''', } self.word = input('请输入翻译单词:') # 获取gtk和token的 self.get_url = 'https://fanyi.baidu.com/' self.get_headers = { '''Accept''': '''text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9''', '''Accept-Encoding''': '''gzip, deflate, br''', '''Accept-Language''': '''zh-CN,zh;q=0.9''', '''Cache-Control''': '''no-cache''', '''Connection''': '''keep-alive''', '''Cookie''': '''BIDUPSID=F29F6E28B0FBCFC1A7E211DE92583124; PSTM=1608988098; BAIDUID=F29F6E28B0FBCFC153ABCCC1188CE960:FG=1; FANYI_WORD_SWITCH=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_40d64069dbd8841dbfa1e003a7c7dd8a1611644719974; BDSFRCVID_BFESS=WauOJeC627FxfbveyBYtbHBDYpJWrfTTH6aoVfeprTJ6BRthQWFTEG0P8M8g0KubvwPHogKKBmOTHgLF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJI8oK0XJD-3fP36qR6sMJI0hU5054RB2C6yX4K8Kb7VbU3p5fnkbfJBDGJlhP5abDT85fT2KqnIsMTxylJKbl07yajK2h5h56RH-pFK0foMjMj5QlOpQT8re-FOK5OibCrqLMJdab3vOpozXpO1bUAzBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqtJHKbDOVC_K3D; BDUSS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDUSS_BFESS=U5hRU50YVFPZmdLbnZ3QTlZYWJDdXpkZ2VRSUtDZzl3V0o4TEY1Z3d5YVhlMEJnRVFBQUFBJCQAAAAAAAAAAAEAAAADZykQv9u~2zMwOTQzNTM2NQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJfuGGCX7hhge; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1610182098,1610437054,1610501127,1612504926; H_PS_PSSID=33423_33439_33273_31660_33595_33540_33590_26350; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=2; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1612514819; __yjsv5_shitong=1.0_7_7b5b5c8481cf67b7c1053df9dfdc2105d925_300_1612514819989_101.30.65.196_430628f1; ab_sr=1.0.0_NjNjMjA3YzcyNDdiMzE4Njk5MGRkNjY1ZTY2YmFiNTI4MzE2ODQ3ZDIwYjBmNGRlZWFjODgyOGFjMGY0ZTQ3ODVlM2MxNDYxMjQ2ZWYzZGFkM2EzYWFjZjYyM2RkY2Vi''', '''Host''': '''fanyi.baidu.com''', '''Pragma''': '''no-cache''', '''sec-ch-ua''': '''"Chromium";v="88", "Google Chrome";v="88", ";Not A Brand";v="99"''', '''sec-ch-ua-mobile''': '''?0''', '''Sec-Fetch-Dest''': '''document''', '''Sec-Fetch-Mode''': '''navigate''', '''Sec-Fetch-Site''': '''none''', '''Sec-Fetch-User''': '''?1''', '''Upgrade-Insecure-Requests''': '''1''', '''User-Agent''': '''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36''', } def get_gtk_token(self): """获取gtk和token""" html = requests.get(url=self.get_url, headers=self.get_headers).text gtk = re.findall("window.gtk = '(.*?)'", html, re.S)[0] token = re.findall("token: '(.*?)'", html, re.S)[0] return gtk, token def get_sign(self): """获取sign""" gtk, token = self.get_gtk_token() with open('translate.js', 'r') as f: jscode = f.read() jsobj = execjs.compile(jscode) sign = jsobj.eval('e("{}","{}")'.format(self.word, gtk)) return sign def attack_bd(self): """逻辑函数""" sign = self.get_sign() gtk, token = self.get_gtk_token() data = { "from": "en", "to": "zh", "query": self.word, "transtype": "realtime", "simple_means_flag": "3", "sign": sign, "token": token, "domain": "common", } html = requests.post(url=self.post_url, data=data, headers=self.post_headers).json() return html['trans_result']['data'][0]['dst'] if __name__ == '__main__': spider = BdSpider() print(spider.attack_bd())
4. 今日作业
【1】抓取快代理网站免费高匿代理,并测试是否可用来建立自己的代理IP池 https://www.kuaidaili.com/free/ 【2】肯德基餐厅门店信息抓取(POST请求练习) 1.1) URL地址: http://www.kfc.com.cn/kfccda/storelist/index.aspx 1.2) 所抓数据: 餐厅编号、餐厅名称、餐厅地址、城市 1.3) 数据存储: 保存到数据库 1.4) 程序运行效果: 请输入城市名称:北京 把北京的所有肯德基门店的信息保存到数据库中
分类:
爬虫
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· AI 智能体引爆开源社区「GitHub 热点速览」
· 写一个简单的SQL生成工具
· Manus的开源复刻OpenManus初探