爬虫--requests模块基础
requests模块:python中原生的一款基于网络请求的模块,功能强大,效率极高
作用:模拟浏览器发送请求
如何使用:(requests模块的编码流程)
--指定url
--发起请求
--获取响应数据
--持久化存储
环境安装:
pip install requests
案例一:sougou页面爬取
import requests if __name__ == '__main__': url = "https://www.sogou.com/" response = requests.get(url=url) page_text = response.text print(page_text) with open('./sogou.html','w',encoding='utf-8') as fp: fp.write(page_text) print("爬取数据结束!")
案例二:简易页面采集器
import requests #UA伪装:门户网站的服务器会检测对应请求的载体身份识别,如果检测到请求的载体身份标识为一款浏览器 #则说明该请求是一个正常的请求.但是,如果检测到请求的载体身份识别不是基于浏览器的,则为不正常的 #请求,可能是爬虫,则服务器端就很有可能拒绝该请求.则需要UA伪装 #UA:User-agent(请求载体的身份标识) if __name__ == '__main__': # UA伪装:将对应的user-agent封装到一个字典中 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/...' } url = "https://www.sogou.com/web" # 处理url携带的参数:封装到字典中 kw = input('enter a word:') param = { 'query':kw } # 对指定的URL发起的请求对应的url是携带参数的,并且请求过程中处理了参数 response = requests.get(url=url,params=param,headers=headers) page_text = response.text file_name = kw +'.html' with open(file_name,'w',encoding='utf-8') as f: f.write(page_text) print(file_name,'保存成功!')
案例三:破解百度翻译--页面局部刷新
根据抓包工具的分析: --百度翻译是post请求,携带参数 --响应数据是一组json数据 import requests from sign import sign if __name__ == '__main__': # 指定url:从抓包工具中得到posturl post_url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh' mycookie = 'BIDUPSID=93775633FDF274C72' # UA伪装 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome...6', 'Cookie': mycookie } # post请求参数处理(同get请求一致) search_word = input('请输入翻译内容:') # sign是百度翻译刚更新的参数,附网络大神sign算法的破解
附:JS逆向——破解百度翻译参数(sign)爬虫 超级详细
sign = sign(search_word) data = { 'query': search_word, 'from': 'en', 'to': 'zh', 'transtype': 'realtime', 'simple_means_flag': '3', 'token': '79f9b84e69633634e9bab84cf796a52a', 'sign': sign, 'domain': 'common' } # 请求发送 response = requests.post(url=post_url, data=data, headers=headers) # 获取响应数据:根据抓包工具,从content-type:application/json,可以看出返回的是一个json数据 # json方法将返回的json字符串数据还原成字典对象,只有确认服务器返回的是json数据才能只用 dic_obj = response.json() print(dic_obj)
# 大神的sign.py代码
import js2py import requests import re def sign(word): session = requests.Session() headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"} session.headers = headers response = session.get("http://fanyi.baidu.com/") gtk = re.findall(";window.gtk = ('.*?');", response.content.decode())[0] word = word context = js2py.EvalJs() js = r''' function a(r) { if (Array.isArray(r)) { for (var o = 0, t = Array(r.length); o < r.length; o++) t[o] = r[o]; return t } return Array.from(r) } function n(r, o) { for (var t = 0; t < o.length - 2; t += 3) { var a = o.charAt(t + 2); a = a >= "a" ? a.charCodeAt(0) - 87 : Number(a), a = "+" === o.charAt(t + 1) ? r >>> a : r << a, r = "+" === o.charAt(t) ? r + a & 4294967295 : r ^ a } return r } function e(r) { var o = r.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g); if (null === o) { var t = r.length; t > 30 && (r = "" + r.substr(0, 10) + r.substr(Math.floor(t / 2) - 5, 10) + r.substr(-10, 10)) } else { for (var e = r.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/), C = 0, h = e.length, f = []; h > C; C++) "" !== e[C] && f.push.apply(f, a(e[C].split(""))), C !== h - 1 && f.push(o[C]); var g = f.length; g > 30 && (r = f.slice(0, 10).join("") + f.slice(Math.floor(g / 2) - 5, Math.floor(g / 2) + 5).join("") + f.slice(-10).join("")) } var u = void 0 , l = "" + String.fromCharCode(103) + String.fromCharCode(116) + String.fromCharCode(107); u = 'null !== i ? i : (i = window[l] || "") || ""'; for (var d = u.split("."), m = Number(d[0]) || 0, s = Number(d[1]) || 0, S = [], c = 0, v = 0; v < r.length; v++) { var A = r.charCodeAt(v); 128 > A ? S[c++] = A : (2048 > A ? S[c++] = A >> 6 | 192 : (55296 === (64512 & A) && v + 1 < r.length && 56320 === (64512 & r.charCodeAt(v + 1)) ? (A = 65536 + ((1023 & A) << 10) + (1023 & r.charCodeAt(++v)), S[c++] = A >> 18 | 240, S[c++] = A >> 12 & 63 | 128) : S[c++] = A >> 12 | 224, S[c++] = A >> 6 & 63 | 128), S[c++] = 63 & A | 128) } for (var p = m, F = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(97) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(54)), D = "" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(51) + ("" + String.fromCharCode(94) + String.fromCharCode(43) + String.fromCharCode(98)) + ("" + String.fromCharCode(43) + String.fromCharCode(45) + String.fromCharCode(102)), b = 0; b < S.length; b++) p += S[b], p = n(p, F); return p = n(p, D), p ^= s, 0 > p && (p = (2147483647 & p) + 2147483648), p %= 1e6, p.toString() + "." + (p ^ m) } ''' js = js.replace('\'null !== i ? i : (i = window[l] || "") || ""\'', gtk) # 执行js context.execute(js) # 调用函数得到sign sign = context.e(word) return sign
案例三:豆瓣电影分类数据爬取
import json import requests if __name__ == '__main__': url = 'https://movie.douban.com/j/chart/top_list' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome...' } param = { 'type': '24', 'interval_id': '100:90', 'action': '', 'start': '10', # 从库中第几部电影开始取 'limit': '20', # 每次取出的电影个数 } response = requests.get(url=url,params=param,headers=headers) list_data = response.json() # 从json字符串数据还原成原来的数据类型 print(type(list_data)) fp = open('./douban.json','w',encoding='utf-8') json.dump(list_data,fp=fp,ensure_ascii=False) print('成功!')
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通