Python爬虫自学笔记(四)爬取手机APP资源
现在手机应用越来越多,大家也都习惯了用手机上网,爬取手机上的数据就成为爬虫们的必要工作。
爬取手机资料的基本原理是用抓包工具抓取手机访问网页或者APP过程中的数据,然后进行解析。
因为手机上的数据大部分是格式化的,主要是json格式,所以相对来说解析比较容易,难度主要就在于如何抓包,
并从一大堆杂乱无章的数据包中找到需要的数据。
抓包有很多工具,比较常用的是fiddler。
抓包工具 Fiddler
Fiddler是一个http协议调试代理工具,它能够记录并检查所有你的电脑和互联网之间的http通讯,设置断点,查看所有的“进出”Fiddler的数据。
Fiddler 要比其他的网络调试器要更加简单,因为它不仅仅暴露http通讯还提供了一个用户友好的格式。
1、安装
fiddler
的官方下载链接:https://www.telerik.com/download/fiddler
下载完成后一步步安装即可
2、设置fiddler
2.1设置允许抓取HTTPS
信息包
操作很简单,打开下载好的fiddler
,找到 Tools -> Options
,然后在HTTPS
的工具栏下勾选Decrpt HTTPS traffic
,
在新弹出的选项栏下勾选Ignore server certificate errors
。这样,fiddler就会抓取到HTTPS
的信息包,否则会一直显示tunnel
。
2.2设置允许外部设备发送HTTP/HTTPS
到fiddler
相同的,在Connections
选项栏下勾选Allow remote computers to connect
,并记住上面的端口号8888
,端口号后面会使用到。
3、设置手机端
设置手机端之前,我们需要记住一点:电脑和手机需要在同一个网络下进行操作。可以使用wifi
或者手机热点
等来完成。
假如你已经让电脑和手机处于同一个网络下了,这时候我们需要知道此网络的ip地址
,可以在命令行输入ipconfig
简单的获得,如图。
下面以Android手机为例进行代理设置
确定一下手机和PC是连接在同一个局域网中
进入手机的设置->点击进入WLAN设置->选择连接到的无线网,长按弹出选项框:如图所示:
将代理设置成手动,将上面获取到的ip地址和端口号填入,点击保存。这样就将我们的手机设置成功了。
第四步:下载Fiddler的安全证书
使用Android手机的浏览器打开:http://192.168.1.96:8888, 点"FiddlerRoot certificate" 然后安装证书,如图:
注意:这里这个证书是安装在手机端的,如果不装,就不能正确抓取HTTPS的数据
如果一切顺利的话,这时候打开fiddler,用手机上网访问网页或APP,就能看到fiddler开始抓取数据了
这里可以看到app发送和接收了哪些数据包
为了更加精准定位到某乎(只看目标的数据包),添加一个过滤条件
这样我们获取的数据包列表就都是过滤条件内的目标网址
4、查找数据包
比如点击热榜
对应的https加密数据包如下:
数据包中的数据如下:
提取出url链接
https://api.zhihu.com/topstory/hot-list?limit=10&reverse_order=0
注意:这里寻找所要的数据包,一是看body栏的体积大小,二是看最前面的文件类型,我们需要的大部分数据应该在 {json}这个里面。图像则在img里面。
拿到url之后,接着开始编程爬取保存数据。
5、编写爬虫程序
import requests import json headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',} url = "https://api.zhihu.com/topstory/hot-list?limit=10&reverse_order=0" res = requests.get(url, headers=headers) res.encoding = 'utf-8' s = json.loads(res.text) list = s['data'] for i in list: title = i['target']['title'] print(title)
得到结果如下
再比如从东方财富网抓取某一个股票的当天的交易记录
import requests headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', } url='https://push2his.eastmoney.com/api/qt/stock/trends2/get?secid=0.000651&fields1=f1%2Cf2%2Cf3%2Cf4%2Cf5%2Cf6%2Cf7%2Cf8%2Cf9%2Cf10%2Cf11%2Cf12%2Cf13%2Cf14&fields2=f51%2Cf53%2Cf54%2Cf55%2Cf56%2Cf57%2Cf58&iscr=0&iscca=0&ut=f057cbcbce2a86e2866ab8877db1d059&ndays=1' res=requests.get(url=url,headers=headers,verify=False) data=res.json() for i in range(30): print(data['data']['trends'][i])
#得到如下数据
2021-08-17 09:36,47.82,47.87,47.82,2412,11541700.00,47.923 2021-08-17 09:37,47.75,47.82,47.75,4438,21215542.00,47.906 2021-08-17 09:38,47.67,47.75,47.66,3942,18802944.00,47.882 2021-08-17 09:39,47.89,47.89,47.61,5406,25805845.00,47.861 2021-08-17 09:40,47.92,47.98,47.89,1341,6428628.00,47.864 2021-08-17 09:41,47.96,47.96,47.89,2773,13292278.00,47.868 2021-08-17 09:42,47.99,47.99,47.96,1525,7316794.00,47.872 2021-08-17 09:43,48.06,48.06,47.99,1842,8846337.00,47.878 2021-08-17 09:44,48.03,48.06,48.00,3042,14611700.00,47.888 2021-08-17 09:45,48.01,48.03,48.00,2128,10215539.00,47.892 2021-08-17 09:46,48.18,48.20,48.03,7288,35060015.00,47.919 2021-08-17 09:47,48.13,48.18,48.08,2552,12286559.00,47.928 2021-08-17 09:48,48.13,48.14,48.10,2558,12310755.00,47.936 2021-08-17 09:49,48.07,48.12,48.07,1568,7542189.00,47.940 2021-08-17 09:50,47.98,48.07,47.98,2322,11148561.00,47.942 2021-08-17 09:51,48.02,48.03,47.97,1370,6574020.00,47.943 2021-08-17 09:52,48.01,48.03,48.00,1093,5248116.00,47.944 2021-08-17 09:53,47.95,48.01,47.95,1830,8780509.00,47.945 2021-08-17 09:54,47.88,47.95,47.88,1346,6448633.00,47.945 2021-08-17 09:55,47.90,47.90,47.85,1453,6955471.00,47.943 2021-08-17 09:56,47.93,47.93,47.91,999,4787261.00,47.943 2021-08-17 09:57,47.95,47.95,47.93,1215,5824444.00,47.943 2021-08-17 09:58,47.99,47.99,47.95,1172,5622096.00,47.943 2021-08-17 09:59,47.98,48.00,47.97,881,4227343.00,47.944
6、常见问题
6.1手机设置代理后不能上网
问题解答见如下链接
https://www.jianshu.com/p/b122eab059c4
https://blog.csdn.net/jss19940414/article/details/89875043
https://blog.csdn.net/jianglianye21/article/details/81743129
6.2手机可以上网,fiddler可以抓取,但是部分APP不能访问
问题解答见如下链接:
https://www.cnblogs.com/lulianqi/p/11380794.html