1.爬取网站第一步确定URL,先分析这个网站的数据是不是由ajax动态加载的,对网页进行刷新,看xhr上有没有相应的数据.发现没有相应数据显示,验证这个网页的数据可以直接通过原地址来抓取,顺便把headers也拿下来,通过requests.get的方法发送请求,获取页面源码数据
page_text = requests.get(url=url,headers=headers).text
2.第二步实例化etree对象进行数据解析,网站是通过网络加载,非本地加载,我们用etree.HTML来实例化
tree=etree.HTML(page_text)
分析网页源码数据的标签,看看我们要抓取的内容在哪一个位置,对他进行定位
li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
获取到了这个页面信息的列表之后,我们要想办法将这些数据拿出来,利用for来遍历数据
for li in li_list: title = li.xpath('./div[2]/h2/a/text()')[0] #./表示的是我的当前标签(li标签) price_list = li.xpath('./div[3]//text()') price = "".join(price_list) # 取消空格 detail_url = li.xpath('./div[2]/h2/a/@href')[0]
3. 完成了上面的步骤,我们就已经获取了二手房房源的标题,价格和对应房源的具体网址detail_url,网址抓下来是完整的,我们不需要对他在进行补充
我们点击第一条二手房源进去看里面的源码信息,我们要抓取的标签内容是在
id="generalSituation"内
现在我们需要向详情页的url发送请求,获取详情页的概况数据.
detail_page_text = requests.get(url=detail_url,headers = headers).text
再去实例化一个detial_etree对象对数据进行解析
detail_tree= etree.HTML(detail_page_text) desc = "".join(detail_tree.xpath('//div[@id="generalSituation"]/div//text()'))
完成以上步骤就已经完成本次爬虫的95%
4.最后一步,我们需要找一个"容器"将这些数据全部放在一起~
all_data_list = list() dic={ "title":title, "price":price, "desc":desc} all_data_list.append(dic)
好啦,这部分代码是粗略抓取58同城二手房的代码,尚未封装成函数
from lxml import etree import requests from decode_handler import get_crack_text,base64_code url = "https://sz.58.com/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d100000-0000-45eb-257a-c435be5884d6&ClickID=2" headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"} # 获取页面源码数据 page_text = requests.get(url=url,headers=headers).text # 实例化etree对象进行数据解析 tree=etree.HTML(page_text) li_list = tree.xpath('//ul[@class="house-list-wrap"]/li') all_data_list = list() for li in li_list: title = li.xpath('./div[2]/h2/a/text()')[0] #./表示的是我的当前标签(li标签) price_list = li.xpath('./div[3]//text()') price = "".join(price_list) # 取消空格 detail_url = li.xpath('./div[2]/h2/a/@href')[0] print(title) print(price) # print(detail_url) # 向详情页的url发送请求,获取详情页的概况数据 detail_page_text = requests.get(url=detail_url,headers = headers).text # 数据解析(新实例化一个详情页的etree对象,进行详情页的数据解析) detail_tree= etree.HTML(detail_page_text) desc = "".join(detail_tree.xpath('//div[@id="generalSituation"]/div//text()')) dic={ "title":title, "price":price, "desc":desc } all_data_list.append(dic)
我们在爬取之后会发现一个问题:
概况下的价格为乱码,那么我们要怎么处理这种在源码上就已经是乱码的数据呢?
分析思路:
- 字体加密一般是网页修改了默认的字符编码集,在网页上加载的他们自己定义的字体文件作为字体的样式,可以正确地显示数字,但是在源码上同样的二进制数由于未加载自定义的字体文件就由计算机默认编码成了乱码。
- 一般来说,通用的解决办法是找到字体文件,分析文件中的映射关系。字体文件都是作为样式加在加密字体的部位。
- 除了我们刚才在抓取的strongbox源码上或者输出爬取的结果中发现有乱码的文字之外,我们也可以从源码中Ctrl+F搜索 fangchan-secret 寻找字体加密文件
- 在58的源码中,字体文件是通过base64加密之后放在js里面了。把其中加密的部分取出,第一次是分析,在代码中可使用正则将其中的内容取出来。58的字体加密文件每次网页刷新,其中的映射顺序会变,所以在不刷新的情况下,再复制一份
base64_string="AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8l/XgAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQY1C5jAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOErPNJfDzz1AAsIAAAAAADapnGpAAAAANqmcakAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABgAFAAoAAwAJAAIACAABAAQABwAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAGAACVjwAAlY8AAAAFAACZPAAAmTwAAAAKAACaSwAAmksAAAADAACeOgAAnjoAAAAJAACeowAAnqMAAAACAACfZAAAn2QAAAAIAACfkgAAn5IAAAABAACfpAAAn6QAAAAEAACfpQAAn6UAAAAHAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA" single_code = "麣龥龥齤龒"
- 首先要进行base64 解码,转化成为二进制形式,在方法中同时我也将字体文件写入了otf 字体文件中。
def make_font_file(base64_string: str): bin_data = base64.decodebytes(base64_string.encode()) with open('text.otf','wb') as f: f.write(bin_data) return bin_data
然后我们将字节文件转化成xml格式
def convert_font_to_xml(bin_data): # 由于TTFont接收一个文件类型 # BytesIO(bin_data) 把二进制数据当作文件来操作 font = TTFont(BytesIO(bin_data)) font.saveXML("text.xml") bin_data = make_font_file(base64_str) convert_font_to_xml(bin_data) # 获取对应关系 font = TTFont(BytesIO(make_font_file(base64_str))) uniList = font['cmap'].tables[0].ttFont.getGlyphOrder() c = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap # c = font.getBestCmap()print(c)
得到结果该字典的键就是网页上显示的乱码的unicode编码,值就是该乱码对应的真正数字。其中 glyph00007 是资源,每一个 glyph0000x 对应一个数字。58的字体文件比较偷懒,根据其后缀就知道对应的数字是多少, glyph00001对应0,glyph00001对应2, 以此类推。
{38006: 'glyph00007', 38287: 'glyph00009', 39228: 'glyph00004', 39499: 'glyph00005', 40506: 'glyph00008', 40611: 'glyph00006', 40804: 'glyph00010', 40850: 'glyph00001', 40868: 'glyph00003', 40869: 'glyph00002'}
我们根据网页抓取的乱码的unicode编码,获取其对对应的字源,即可获取所对应的数字。
def get_num(string): ret_list = [] for char in string: decode_num = ord(char) num = c[decode_num] num = int(num[-2:])-1 ret_list.append(num) return ret_list
完整代码如下,直接保存为.py文件后调用get_crack_text函数即可(需要传入base64码和需要解密的乱码文字):
import base64 from io import BytesIO from fontTools.ttLib import TTFont def make_font_file(base64_string: str): """ 进行base64解码,转化成为二进制形式,在方法中同时我也将字体文件写入了otf字体文件中。 :param base64_string: :return: """ bin_data = base64.decodebytes(base64_string.encode()) with open('text.otf', 'wb') as f: f.write(bin_data) return bin_data def convert_font_to_xml(bin_data): """ 将字节文件转化为xml格式 由于TTFont接收一个文件类型 :param bin_data: 把二进制数据当作文件来操作 :return: """ font = TTFont(BytesIO(bin_data)) font.saveXML("text.xml") def get_num(string, c_list): """ 我们根据网页抓取的乱码的unicode编码,获取其对对应的字源,即可获取所对应的数字 :param string: :return: """ ret_list = [] for char in string: decode_num = ord(char) num = c_list[decode_num] num = int(num[-2:])-1 ret_list.append(num) return ret_list def get_crack_text(code, text): """ 调用此函数即可 :param code: 正则匹配到的base64码 :param text: 乱码字体 :return: """ font = TTFont(BytesIO(make_font_file(code))) # uni_list = font['cmap'].tables[0].ttFont.getGlyphOrder() code_list = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap # code_list = font.getBestCmap() crack_text = "".join([str(i) for i in get_num(text, code_list)]) return crack_text