Python爬虫--爬取文字加密的番茄小说

一、学爬虫,看小说

很久没有去研究爬虫了,借此去尝试爬取小说查看小说,发现页面返回的内容居然都是加密的。

 二、对小说目录进行分析

  通过分析小说目录页面,获取小说名称等内容

引用parsel包,对页面信息进行获取

url = "https://fanqienovel.com/reader/7276663560427471412?enter_from=page"

# 发送请求
response = requests.get(url=url, headers=headers)
# 获取响应得文本数据(html字符串数据)
html_data = response.text
"""解析数据:提取需要得数据内容"""
# 把html字符串数据转成可解析对象
selector = parsel.Selector(html)

# xpath 匹配内容
text = selector.xpath('string(//div[@class="muye-reader-content noselect"])').get()

# re 正则匹配内容
text = selector.re(r'<p>(.*?)</p>')


# css选择器匹配
# 章节名
name = select.css('.muye-reader-title::text').get()
print(name)

直接上代码

 1 import requests
 2 import parsel
 3 
 4 # URL地址(请求地址)
 5 url = "https://fanqienovel.com/page/7276384138653862966"
 6 # 模拟浏览器
 7 headers = {
 8     # cookie
 9     'Cookie': 'Hm_lvt_2667d29c8e792e6fa9182c20a3013175=1716438629; csrf_session_id=cb69e6cf3b1af43a88a56157e7795f2e; '
10               'novel_web_id=7372047678422058532; s_v_web_id=verify_lwir8sbl_HcMwpu3M_DoJp_4RKG_BcMo_izZ4lEmNBlEQ; '
11               'Hm_lpvt_2667d29c8e792e6fa9182c20a3013175=1716454389; ttwid=1%7CRpx4a-wFaDG9-ogRfl7wXC7k61DQkWYwkb_Q2THE'
12               'qb4%7C1716454388%7Cb80bb1f8f2ccd546e6a1ccd1b1abb9151e31bbf5d48e3224451a90b7ca5d534c; msToken=-9U5-TOe5X2'
13               'axgeeY4G28F-tp-R7o8gDaOF5p2fPPvcNdZYLXWU9JiPv_tOU81HeXCDT52o4UtGOLCZmuDMN2I8yulNK-8hIUpNSHiEVK3ke5aEeG'
14               'J4wDhk_cQgJ3g==',
15     # User-Agent
16     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 '
17                   'Safari/537.36'
18 }
19 # 发送请求
20 response = requests.get(url=url, headers=headers)
21 # 获取响应得文本数据(html字符串数据)
22 html = response.text
23 """解析数据:提取需要得数据内容"""
24 # 把html字符串数据转成可解析对象
25 selector = parsel.Selector(html)
26 # 书名信息
27 name = selector.css('.info-name h1::text').get()
28 print(name)
29 # 作者信息
30 au = selector.css('.author-name-text::text').get()
31 print(au)
32 # 标签信息
33 x = selector.css('.info-label span::text').getall()
34 print(x)

运行结果如下:

 继续获取章节名称信息、章节URL

 获取章节名称、章节URL信息

分析页面信息,使用css选择器,进行提取对应字段

# css选择器
# 章节名称
.chapter-item-title::text

# 章节对应url
.chapter-item-title::attr(href)

 

 

# 章节名称
title_list = selector.css('.chapter-item-title::text').getall()
print(title_list)

# 章节url
href = selector.css('.chapter-item-title::attr(href)').getall()
print(href)

代码运行结果

 对url进行拼接

for title, link in zip(title_list, href):
    print(title)

    # 完整的小说章节链接
    link_url = 'https://fanqienovel.com' + link
    print(link_url)

代码运行

 对url进行检查,发现第1章的url显示并不正确,访问后并不是第1章的内容,7372041397370618392

 代码修改

检查页面herf信息,发现会显示最近更新的href,对应id与代码运行时显示第一章的id一样。需要对代码进行修改

修改代码

for title, link in zip(title_list, href[1:]):
    print(title)

    # 完整的小说章节链接
    link_url = 'https://fanqienovel.com' + link
    print(link_url)

代码运行成功

 url检查成功

三、获取url页面的数据

  提取页面的数据信息

   # 发生请求+获取数据内容
    link_data = requests.get(url=link_url, headers=headers).text
    # 解析数据:提取小说内容
    link_selector = parsel.Selector(link_data)
    # 提取小说内容
    content_list = link_selector.css('.muye-reader-content-16 p::text').getall()
    # 把列表合并成字符串
    content = '\n'.join(content_list)

代码运行

发现可以获取页面的部分内容,但内容并不完整,很多文字被加密,无法展示

四、文字内容解密

  对页面进行分析,双击下载字体库

 成功下载字体库

使用软件FontCreator.exe打开,可查看字体库内容

 对获取的小说内容进行转换

  使用ord函数,对获取的内容转码

    # 发生请求+获取数据内容
    link_data = requests.get(url=link_url, headers=headers).text
    # 解析数据:提取小说内容
    link_selector = parsel.Selector(link_data)
    # 提取小说内容
    content_list = link_selector.css('.muye-reader-content-16 p::text').getall()
    # 把列表合并成字符串
    content = '\n'.join(content_list)

    for i in content:
        print(i, "-->", ord(i))

运行结果:

针对获得的数据信息进行分析

  在下载的字体库中可以找到对应的汉字

  如 ascii码 58657 ---> 

        58398 ---> 是

        58483 ---> 白

        58611 ---> 的

 

 以此类推

需要整理一份对应的字典表,将字体库中的对应关系整理出来才行。

通过将获取的内容进行替换之后,即可获得完整的信息

解密处理

 1 text = select.css('.muye-reader-content-16 p::text').getall()
 2 content = '\n'.join(text)
 3 # print(content)
 4 for index in content:
 5     try:
 6         t1 = dict_data[str(ord(index))]
 7         print(t1, end="")
 8     except:
 9         t1 = index
10         print(t1, end="")

运行结果

 结果显示与页面显示的内容一致

 数据保存

对获取的内容进行保存即可

text = select.css('.muye-reader-content-16 p::text').getall()
content = '\n'.join(text)
# print(content)
result = []
for index in content:
    try:
        t1 = dict_data[str(ord(index))]
        # print(t1, end="")
        result.append(t1)
    except:
        t1 = index
        # print(t1, end="")
        result.append(t1)


# 写入文件
with open('2.txt', mode='a', encoding='utf8') as f:
    f.write(name + '\n') # 写入章节名称
    for i in result:
        f.write(i)

运行结果:

 完整代码:

  PS:由于其中的解密字典,是手动整理的,不保证准确性。思路仅供参考。

  1 import requests
  2 import parsel
  3 
  4 # URL地址(请求地址)
  5 url = "https://fanqienovel.com/page/7276384138653862966"
  6 # 模拟浏览器
  7 headers = {
  8     # cookie
  9     'Cookie': 'Hm_lvt_2667d29c8e792e6fa9182c20a3013175=1716438629; csrf_session_id=cb69e6cf3b1af43a88a56157e7795f2e; '
 10               'novel_web_id=7372047678422058532; s_v_web_id=verify_lwir8sbl_HcMwpu3M_DoJp_4RKG_BcMo_izZ4lEmNBlEQ; '
 11               'Hm_lpvt_2667d29c8e792e6fa9182c20a3013175=1716454389; ttwid=1%7CRpx4a-wFaDG9-ogRfl7wXC7k61DQkWYwkb_Q2THE'
 12               'qb4%7C1716454388%7Cb80bb1f8f2ccd546e6a1ccd1b1abb9151e31bbf5d48e3224451a90b7ca5d534c; msToken=-9U5-TOe5X2'
 13               'axgeeY4G28F-tp-R7o8gDaOF5p2fPPvcNdZYLXWU9JiPv_tOU81HeXCDT52o4UtGOLCZmuDMN2I8yulNK-8hIUpNSHiEVK3ke5aEeG'
 14               'J4wDhk_cQgJ3g==',
 15     # User-Agent
 16     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 '
 17                   'Safari/537.36'
 18 }
 19 # 发送请求
 20 response = requests.get(url=url, headers=headers)
 21 # 获取响应得文本数据(html字符串数据)
 22 html = response.text
 23 """解析数据:提取需要得数据内容"""
 24 # 把html字符串数据转成可解析对象
 25 selector = parsel.Selector(html)
 26 # 书名信息
 27 name = selector.css('.info-name h1::text').get()
 28 # print(name)
 29 # 作者信息
 30 au = selector.css('.author-name-text::text').get()
 31 # print(au)
 32 # 标签信息
 33 x = selector.css('.info-label span::text').getall()
 34 # print(x)
 35 
 36 # 章节名称
 37 title_list = selector.css('.chapter-item-title::text').getall()
 38 # print(title_list)
 39 
 40 # 章节url
 41 href = selector.css('.chapter-item-title::attr(href)').getall()
 42 # print(href)
 43 
 44 
 45 for title, link in zip(title_list, href[1:]):
 46     print(title)
 47 
 48     # 完整的小说章节链接
 49     link_url = 'https://fanqienovel.com' + link
 50     print(link_url)
 51 
 52     # 发生请求+获取数据内容
 53     link_data = requests.get(url=link_url, headers=headers).text
 54     # 解析数据:提取小说内容
 55     link_selector = parsel.Selector(link_data)
 56     # 提取小说内容
 57     content_list = link_selector.css('.muye-reader-content-16 p::text').getall()
 58     # 把列表合并成字符串
 59     content = '\n'.join(content_list)
 60 
 61     # for i in content:
 62     #     print(i, "-->", ord(i))
 63 
 64     dict_data = {
 65         '58670': '0',
 66         '58413': '1',
 67         '58678': '2',
 68         '58371': '3',
 69         '58353': '4',
 70         '58480': '5',
 71         '58359': '6',
 72         '58449': '7',
 73         '58540': '8',
 74         '58692': '9',
 75         '58712': 'a',
 76         '58542': 'b',
 77         '58575': 'c',
 78         '58626': 'd',
 79         '58691': 'e',
 80         '58561': 'f',
 81         '58362': 'g',
 82         '58619': 'h',
 83         '58430': 'i',
 84         '58531': 'j',
 85         '58588': 'k',
 86         '58440': 'l',
 87         '58681': 'm',
 88         '58631': 'n',
 89         '58376': 'o',
 90         '58429': 'p',
 91         '58555': 'q',
 92         '58498': 'r',
 93         '58518': 's',
 94         '58453': 't',
 95         '58397': 'u',
 96         '58356': 'v',
 97         '58435': 'w',
 98         '58514': 'x',
 99         '58482': 'y',
100         '58529': 'z',
101         '58515': 'A',
102         '58688': 'B',
103         '58709': 'C',
104         '58344': 'D',
105         '58656': 'E',
106         '58381': 'F',
107         '58576': 'G',
108         '58516': 'H',
109         '58463': 'I',
110         '58649': 'J',
111         '58571': 'K',
112         '58558': 'L',
113         '58433': 'M',
114         '58517': 'N',
115         '58387': 'O',
116         '58687': 'P',
117         '58537': 'Q',
118         '58541': 'R',
119         '58458': 'S',
120         '58390': 'T',
121         '58466': 'U',
122         '58386': 'V',
123         '58697': 'W',
124         '58519': 'X',
125         '58511': 'Y',
126         '58634': 'Z',
127         '58611': '',
128         '58590': '',
129         '58398': '',
130         '58422': '',
131         '58657': '',
132         '58666': '',
133         '58562': '',
134         '58345': '',
135         '58510': '',
136         '58496': '',
137         '58654': '',
138         '58441': '',
139         '58493': '',
140         '58714': '',
141         '58618': '',
142         '58528': '',
143         '58620': '',
144         '58403': '',
145         '58461': '',
146         '58481': '',
147         '58700': '',
148         '58708': '',
149         '58503': '',
150         '58442': '',
151         '58639': '',
152         '58506': '',
153         '58663': '',
154         '58436': '',
155         '58563': '',
156         '58391': '',
157         '58357': '',
158         '58354': '',
159         '58695': '',
160         '58372': '',
161         '58696': '',
162         '58551': '',
163         '58445': '',
164         '58408': '',
165         '58599': '',
166         '58424': '',
167         '58394': '',
168         '58348': '',
169         '58426': '',
170         '58673': '',
171         '58417': '',
172         '58556': '',
173         '58603': '',
174         '58565': '',
175         '58604': '',
176         '58522': '',
177         '58632': '',
178         '58622': '',
179         '58350': '',
180         '58605': '',
181         '58617': '',
182         '58401': '',
183         '58637': '',
184         '58684': '',
185         '58382': '',
186         '58464': '',
187         '58487': '',
188         '58693': '',
189         '58608': '',
190         '58392': '',
191         '58474': '',
192         '58601': '',
193         '58355': '',
194         '58573': '',
195         '58499': '',
196         '58469': '',
197         '58361': '',
198         '58698': '',
199         '58489': '',
200         '58711': '',
201         '58457': '',
202         '58635': '',
203         '58492': '',
204         '58647': '',
205         '58623': '',
206         '58521': '',
207         '58609': '',
208         '58530': '',
209         '58665': '',
210         '58652': '',
211         '58676': '',
212         '58456': '',
213         '58581': '',
214         '58509': '',
215         '58488': '',
216         '58363': '',
217         '58685': '',
218         '58396': '',
219         '58523': '',
220         '58471': '',
221         '58485': '',
222         '58613': '',
223         '58533': '',
224         '58589': '',
225         '58527': '',
226         '58593': '',
227         '58699': '',
228         '58707': '',
229         '58414': '',
230         '58596': '',
231         '58570': '',
232         '58660': '',
233         '58364': '',
234         '58526': '',
235         '58501': '',
236         '58638': '',
237         '58404': '',
238         '58677': '',
239         '58535': '',
240         '58629': '',
241         '58577': '',
242         '58606': '',
243         '58497': '',
244         '58662': '',
245         '58479': '',
246         '58532': '',
247         '58380': '',
248         '58385': '',
249         '58405': '',
250         '58644': '',
251         '58578': '使',
252         '58505': '',
253         '58564': '',
254         '58412': '',
255         '58686': '',
256         '58624': '',
257         '58667': '',
258         '58607': '',
259         '58616': '',
260         '58368': '',
261         '58427': '',
262         '58423': '',
263         '58633': '',
264         '58525': '',
265         '58543': '',
266         '58418': '',
267         '58597': '',
268         '58683': '',
269         '58507': '',
270         '58621': '',
271         '58703': '',
272         '58438': '',
273         '58536': '',
274         '58384': '',
275         '58484': '',
276         '58539': '',
277         '58554': '',
278         '58421': '',
279         '58347': '',
280         '58569': '',
281         '58710': '',
282         '58574': '',
283         '58375': '',
284         '58645': '西',
285         '58592': '',
286         '58572': '',
287         '58388': '',
288         '58370': '',
289         '58399': '',
290         '58651': '',
291         '58546': '',
292         '58504': '',
293         '58419': '',
294         '58407': '',
295         '58672': '',
296         '58675': '',
297         '58538': '',
298         '58465': '',
299         '58374': '',
300         '58579': '',
301         '58402': '',
302         '58702': '',
303         '58553': '',
304         '58360': '',
305         '58389': '',
306         '58560': '',
307         '58690': '',
308         '58473': '',
309         '58512': '',
310         '58653': '',
311         '58704': '便',
312         '58545': '',
313         '58641': '',
314         '58475': '',
315         '58583': '',
316         '58472': '',
317         '58478': '',
318         '58664': '',
319         '58586': '',
320         '58568': '',
321         '58674': '',
322         '58490': '',
323         '58476': '',
324         '58346': '',
325         '58630': '',
326         '58595': '',
327         '58502': '',
328         '58713': '',
329         '58587': '',
330         '58548': '',
331         '58351': '',
332         '58547': '',
333         '58443': '',
334         '58460': '',
335         '58636': '',
336         '58585': '',
337         '58625': '',
338         '58694': '',
339         '58428': '',
340         '58640': '',
341         '58628': '',
342         '58612': '',
343         '58446': '',
344         '58468': '',
345         '58410': '',
346         '58508': '',
347         '58594': '',
348         '58483': '',
349         '58544': '',
350         '58495': '',
351         '58450': '',
352         '58643': '',
353         '58486': '',
354         '58406': '',
355         '58447': '',
356         '58669': '',
357         '58415': '',
358         '58444': '',
359         '58549': '',
360         '58494': '',
361         '58409': '',
362         '58658': '',
363         '58557': '',
364         '58602': '',
365         '58559': '',
366         '58610': '',
367         '58513': '',
368         '58500': '',
369         '58378': '',
370         '58680': '',
371         '58352': '',
372         '58383': '',
373         '58454': '',
374         '58671': '',
375         '58668': '',
376         '58452': '',
377         '58627': '',
378         '58400': '',
379         '58455': '',
380         '58416': '',
381         '58552': '',
382         '58614': '',
383         '58582': '',
384         '58534': '',
385         '58701': '',
386         '58349': '',
387         '58491': '',
388         '58467': '',
389         '58365': '',
390         '58598': '',
391         '58425': '',
392         '58462': '',
393         '58420': '',
394         '58661': '',
395         '58615': '',
396         '58648': '',
397         '58470': '',
398         '58377': '',
399         '58520': '',
400         '58646': '',
401         '58600': '',
402         '58431': '',
403         '58715': '',
404         '58524': '',
405         '58439': '',
406         '58566': '',
407         '58477': '',
408         '58642': '',
409         '58437': '',
410         '58411': '',
411         '58451': '',
412         '58395': '',
413         '58369': '',
414         '58706': '',
415         '58705': '',
416         '58379': '',
417         '58567': '',
418         '58373': '',
419         '58448': '',
420         '58659': '',
421         '58434': '',
422         '58679': '',
423         '58432': '',
424         '58689': '',
425         '58591': '',
426         '58682': ''
427     }
428     for index in content:
429         try:
430             t1 = dict_data[str(ord(index))]
431             print(t1, end="")
432         except:
433             t1 = index
434             print(t1, end="")

最后执行结果如下:

 

posted @ 2024-05-23 22:31  RChow  阅读(743)  评论(0编辑  收藏  举报