园龄：3年9个月粉丝：2 关注：1

获取网页的markdown

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

# 获取网页源码
import re
 
import html2text
import requests
def preprocess_html(html):
    # 删除没有 src 属性的 img 标签
    processed_html = re.sub(r'<img(?![^>]*\ssrc=)[^>]*>', '', html)
    return processed_html
page_url = 'https://www.ysxiao.cn/c/202212/57443.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36'
}
def requests_page(url):
    fp = requests.get(url=url, headers=headers, timeout=10)
    fp.encoding = 'utf-8'
    return fp.text
fp = requests_page(page_url)
if isinstance(fp, bytes):
    original_format = fp.decode('utf-8')
else:
    original_format = fp
original_format = preprocess_html(original_format)
markdown = html2text.html2text(original_format)
print(markdown)

　　

上一篇代码或者网页源码中出现304状态码

下一篇Server disconnected without sending a response.

本文作者：布都御魂

本文链接：https://www.cnblogs.com/wolvies/p/18451333

版权声明：本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。

posted @ 2024-10-08 11:30 布都御魂阅读(12) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

【推荐】还在用 ECharts 开发大屏？试试这款永久免费的开源 BI 工具！
【推荐】国内首个AI IDE，深度理解中文开发场景，立即下载体验Trae
【推荐】编程新体验，更懂你的AI，立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包，你的智能百科全书，全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell：AI 加持，快人一步

随笔：241
文章：1
评论：2
阅读：70150

公告

昵称：布都御魂
园龄： 3年9个月
粉丝： 2
关注： 1

<

2025年3月

>

日

一

二

三

四

五

六

23

24

25

26

27

28

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

最新随笔

随笔档案 (241)

文章档案 (1)

2021年6月(1)

阅读排行榜

评论排行榜

最新评论

1. Re:tiktok
你好，怎么联系
--大锤2023
2. Re:抖音x-bogs参数
佬，可以出视频讲解吗
--删除键