不详细揭秘博客园备份随笔

以保存我的合集《短文新编》为例。首先在 https://www.cnblogs.com/caijianhong/collections/18069?page={page} 这个页面将网页源代码爬下来，在里面找出所有随笔的 url 和标题。我们都知道形如 https://www.cnblogs.com/caijianhong/p/{blogid} 的随笔，随笔所有者可以在后面加上 .md 获得它的 markdown，但是 1. 没有标题，所以要从合集那一页把标题拿过来，使用神秘 grep 技巧，参见代码，或搜索 grep 正则 与 awk substr。2. 需要随笔所有者的 cookie，目测只需要 .CNBlogsCookie 与 .Cnblogs.AspNetCore.Cookies。然后就用 python 的 requests 库爬网页就是了。

必要时呼唤任意 AI 大语言模型帮助你。

#!/bin/env python3
import requests as rq
import subprocess as sp

headers = {
    "cookie": ".CNBlogsCookie=一个字符串;.Cnblogs.AspNetCore.Cookies=另一个字符串",
    "User-Agent": "看看你的",
}


def getContent(idx, url, title):
    res = rq.get("https://www.cnblogs.com/caijianhong/p/" + url + ".md", headers=headers)
    res.encoding = "utf-8"
    with open(f"{idx}.{title}.md", "w") as file:
        print(f"# {title}", file=file)
        print(res.text, file=file)


def runcmd(cmd):
    return sp.check_output(cmd, shell=True).decode()


cmds = {
    "url": """grep -E '<a class="entrylistItemTitle" href="https://www.cnblogs.com/caijianhong/p/' list.txt | grep -oE '[0-9]{8}'""",
    "title": """grep -E '<span role="heading" aria-level="2">' list.txt | awk '{print substr($0, 41, length($0)-48)}'""",
}

urls = runcmd(cmds["url"]).strip().split("\n")
titles = runcmd(cmds["title"]).strip().split("\n")
for idx, url, title in zip(range(len(urls)), urls, titles):
    getContent(idx, url, title)

posted @ 2024-07-22 14:44 caijianhong 阅读(19) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

不详细揭秘博客园备份随笔

公告