破解58二手房详情页价格信息加密反爬机制
1. 我们在爬取58二手房详情页信息时出现自定义字体,原本正常的价格信息为乱码,如图所示:
字体加密一般是网页修改了默认的字符编码集,在网页上加载的他们自己定义的字体文件作为字体的样式,可以正确地显示数字,但是在源码上同样的二进制数由于未加载自定义的字体文件就由计算机默认编码成了乱码。
一般来说,通用的解决办法是找到字体文件,分析文件中的映射关系。字体文件都是作为样式加在加密字体的部位。
在样式中,我基本上都看完了, 从名字上看只有strongbox与fangchan-secret最可能是字体加密文件。
2. 从源码中查找,在源码中Ctrl+F搜索 fangchan-secret 寻找字体加密文件
在58的源码中,字体文件是通过base64加密之后放在js里面了。把其中加密的部分取出,第一次是分析,在代码中可使用正则将其中的内容取出来。
58的字体加密文件每次网页刷新,其中的映射顺序会变,所以在不刷新的情况下,再复制一份
3. 这里取第一条 1300及其对应的乱码。
base64_str = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8d/YAAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQWmlfBAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOWi6hJfDzz1AAsIAAAAAADZiYZYAAAAANmJhlgAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACgAFAAcAAwAJAAIACAAEAAEABgAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAKAACVjwAAlY8AAAAFAACZPAAAmTwAAAAHAACaSwAAmksAAAADAACeOgAAnjoAAAAJAACeowAAnqMAAAACAACfZAAAn2QAAAAIAACfkgAAn5IAAAAEAACfpAAAn6QAAAABAACfpQAAn6UAAAAGAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA" single_code = "麣龥龥齤龒"
4. 首先进行 base64 解码,转化成为二进制形式,在方法中同时我也将字体文件写入了 otf 字体文件中。
def make_font_file(base64_string: str): bin_data = base64.decodebytes(base64_string.encode()) with open('text.otf','wb') as f: f.write(bin_data) return bin_data
5. 将字节文件转化为xml格式
def convert_font_to_xml(bin_data): # 由于TTFont接收一个文件类型 # BytesIO(bin_data) 把二进制数据当作文件来操作 font = TTFont(BytesIO(bin_data)) font.saveXML("text.xml")
bin_data = make_font_file(base64_str) convert_font_to_xml(bin_data) # 获取对应关系 font = TTFont(BytesIO(make_font_file(base64_str))) uniList = font['cmap'].tables[0].ttFont.getGlyphOrder() c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap # c = font.getBestCmap() print(c)
打印结果:
{38006: 'glyph00007', 38287: 'glyph00009', 39228: 'glyph00004', 39499: 'glyph00005', 40506: 'glyph00008', 40611: 'glyph00006', 40804: 'glyph00010', 40850: 'glyph00001', 40868: 'glyph00003', 40869: 'glyph00002'}
该字典的键就是网页上显示的乱码的unicode编码,值就是该乱码对应的真正数字。其中 glyph00007 是资源,每一个 glyph0000x 对应一个数字。
58的字体文件比较偷懒,根据其后缀就知道对应的数字是多少, glyph00001对应0,glyph00001对应2, 以此类推。
6. 我们根据网页抓取的乱码的unicode编码,获取其对对应的字源,即可获取所对应的数字。
def get_num(string): ret_list = [] for char in string: decode_num = ord(char) num = c[decode_num] num = int(num[-2:])-1 ret_list.append(num) return ret_list
如果通过浏览器看到的是类似 鸺齤齤 这样的乱码,使用爬虫获取的数据是类似 鸺龒龒 十六进制的数字,可直接截取后面四位转化为十进制数后在通过映射表查找。
完整代码如下,直接保存为.py文件后调用get_crack_text函数即可(需要传入base64码和需要解密的乱码文字):
# -*- coding: utf-8 -*- # __author: Tiger_Lee # @file: decode_price.py # @time: 2019 08 26 # @email: lxh661314@163.com import base64 from io import BytesIO from fontTools.ttLib import TTFont def make_font_file(base64_string: str): """ 进行base64解码,转化成为二进制形式,在方法中同时我也将字体文件写入了otf字体文件中。 :param base64_string: :return: """ bin_data = base64.decodebytes(base64_string.encode()) with open('text.otf', 'wb') as f: f.write(bin_data) return bin_data def convert_font_to_xml(bin_data): """ 将字节文件转化为xml格式 由于TTFont接收一个文件类型 :param bin_data: 把二进制数据当作文件来操作 :return: """ font = TTFont(BytesIO(bin_data)) font.saveXML("text.xml") def get_num(string, c_list): """ 我们根据网页抓取的乱码的unicode编码,获取其对对应的字源,即可获取所对应的数字 :param string: :return: """ ret_list = [] for char in string: decode_num = ord(char) num = c_list[decode_num] num = int(num[-2:])-1 ret_list.append(num) return ret_list def get_crack_text(code, text): """ 调用此函数即可 :param code: 正则匹配到的base64码 :param text: 乱码字体 :return: """ font = TTFont(BytesIO(make_font_file(code))) # uni_list = font['cmap'].tables[0].ttFont.getGlyphOrder() code_list = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap # code_list = font.getBestCmap() crack_text = "".join([str(i) for i in get_num(text, code_list)]) return crack_text
使用 fontCreator 打开字体文件可以直观的看到每一个数字对应的编码。
至于还有其他类似的有字体加密的网站, 大部分都可以使用此方法进行解密
作者:TigerLee
出处:http://www.cnblogs.com/tiger666/
本文版权归作者和博客园所有,欢迎转载。转载请在留言板处留言给我,且在文章标明原文链接,谢谢!
如果您觉得本篇博文对您有所收获,觉得我还算用心,请点击右下角的 [推荐],谢谢!