markdown转pdf，方法总结（用于DL500八股文pdf导出；针对GitHub项目"DeepLearning-500-questions"）

总结使用

DeepLearning-500-questions_pdf-html版本_20241023 生成pdf、html文件，以及代码，都在里面

通过百度网盘分享的文件：DeepLearning-500-questions_pdf-html...
链接：https://pan.baidu.com/s/1D8pHj62pOyKYUjjM4KJN9w?pwd=to4e
提取码：to4e

（完美解决）几乎完美解决问题，除了有些图片放得比较大，图片质量低的时候看起来效果不太好之外，其它，公式、格式、粗体、换行，基本没什么问题

a. VScode插件Markdown All In One markdown转html，Ctrl+Shift+P，>Markdown All in One: 批量打印文档为HTML（选择文件夹）。批处理，支持多级文件夹的多个文件处理

　　a.1 用代码修改一下格式

 1 import os
 2 
 3 def replace_strings_in_file(file_path, replacements):
 4     """替换文件中指定字符串并记录替换的位置"""
 5     modified = False
 6     occurrences = []
 7     
 8     # 读取文件内容
 9     with open(file_path, 'r', encoding='utf-8') as file:
10         lines = file.readlines()
11 
12     # 遍历每行进行替换
13     new_lines = []
14     for line_num, line in enumerate(lines, 1):
15         new_line = line
16         for search_string, replace_string in replacements.items():
17             index = new_line.find(search_string)
18             while index != -1:
19                 # 记录替换位置
20                 occurrences.append((line_num, index, search_string, replace_string))
21                 # 进行字符串替换
22                 new_line = new_line[:index] + replace_string + new_line[index + len(search_string):]
23                 # 查找后续出现的位置
24                 index = new_line.find(search_string, index + len(replace_string))
25         # 保存修改后的行
26         if new_line != line:
27             modified = True
28         new_lines.append(new_line)
29 
30     # 如果文件有修改，则重写文件
31     if modified:
32         with open(file_path, 'w', encoding='utf-8') as file:
33             file.writelines(new_lines)
34 
35     return occurrences
36 
37 def replace_strings_in_directory(directory, replacements):
38     """递归遍历文件夹，查找和替换所有 Markdown 文件中的指定字符串"""
39     all_occurrences = {}
40 
41     # 遍历文件夹及子文件夹
42     for root, dirs, files in os.walk(directory):
43         for file in files:
44             if file.endswith('.md'):
45                 file_path = os.path.join(root, file)
46                 occurrences = replace_strings_in_file(file_path, replacements)
47                 if occurrences:
48                     all_occurrences[file_path] = occurrences
49 
50     return all_occurrences
51 
52 def print_replacement_occurrences(occurrences):
53     """输出所有修改的位置"""
54     for file_path, positions in occurrences.items():
55         print(f"\nFile: {file_path}")
56         for line_num, col_num, old_string, new_string in positions:
57             print(f"  Line {line_num}, Column {col_num}: '{old_string}' -> '{new_string}'")
58 
59 if __name__ == "__main__":
60     # 输入要搜索的文件夹路径
61     directory = input("请输入要搜索的文件夹路径: ")
62 
63     # 替换规则
64     replacements = {
65         r'\begin{eqnarray}': r'\begin{equation}\begin{aligned}',
66         r'\end{eqnarray}': r'\end{aligned}\end{equation}',
67         '`$': '$',
68         '$`': '$'
69     }
70 
71     # 执行替换并记录修改位置
72     occurrences = replace_strings_in_directory(directory, replacements)
73 
74     # 输出所有替换的地方
75     if occurrences:
76         print_replacement_occurrences(occurrences)
77     else:
78         print("No replacements made.")

View Code

这是查找有多少个字符串的代码

 1 import os
 2 
 3 def find_string_in_file(file_path, search_string):
 4     """查找指定文件中某个字符串出现的位置和次数"""
 5     occurrences = []
 6     with open(file_path, 'r', encoding='utf-8') as file:
 7         lines = file.readlines()
 8         for line_num, line in enumerate(lines, 1):
 9             # 查找当前行中字符串出现的位置
10             index = line.find(search_string)
11             while index != -1:
12                 occurrences.append((line_num, index))
13                 # 查找后续出现的位置
14                 index = line.find(search_string, index + 1)
15     return occurrences
16 
17 def find_string_in_directory(directory, search_string):
18     """递归遍历文件夹，查找所有 Markdown 文件中指定字符串出现的位置和次数"""
19     string_occurrences = {}
20     
21     # 遍历文件夹及子文件夹
22     for root, dirs, files in os.walk(directory):
23         for file in files:
24             if file.endswith('.md'):
25                 file_path = os.path.join(root, file)
26                 occurrences = find_string_in_file(file_path, search_string)
27                 if occurrences:
28                     string_occurrences[file_path] = occurrences
29     
30     return string_occurrences
31 
32 def print_string_occurrences(occurrences, search_string):
33     """输出指定字符串出现的位置和次数"""
34     for file_path, positions in occurrences.items():
35         print(f"\nFile: {file_path}")
36         print(f"'{search_string}' found {len(positions)} time(s):")
37         for line_num, col_num in positions:
38             print(f"  Line {line_num}, Column {col_num}")
39 
40 if __name__ == "__main__":
41     directory = input("请输入要搜索的文件夹路径: ")
42     search_string = input("请输入要查找的字符串: ")
43     
44     occurrences = find_string_in_directory(directory, search_string)
45     if occurrences:
46         print_string_occurrences(occurrences, search_string)
47     else:
48         print(f"No occurrences of '{search_string}' found.")

View Code

b. 使用浏览器的Microsoft print to pdf，用代码，批处理，支持多级文件夹的多个文件处理。（而使用浏览器的Microsoft print to pdf，也可以手动一一转，就是有点麻烦。）

b.1 chromedriver.exe选择版本要一样

b.2 time.sleep等待时间根据具体情况设置，如果设置大了，可能比较慢，如果设置小了，可能会报错。如果有一次报错了，比如selenium自动化DevTools连接断开问题的错误，可以选择再跑一次试试看有没有问题

 1 import os
 2 import json
 3 import time
 4 from selenium import webdriver
 5 
 6 source_folder = r'C:/Users/chenguanbin/OneDrive - hust.edu.cn/_工作/八股文/DL500-html'  # 修改为你的HTML文件路径
 7 output_folder = r'C:/Users/chenguanbin/OneDrive - hust.edu.cn/_工作/八股文/DL500-html-to-microsoft-pdf'    # 修改为你的输出PDF路径
 8 
 9 chrome_options = webdriver.ChromeOptions()
10 
11 settings = {
12     "recentDestinations": [{
13         "id": "Save as PDF",
14         "origin": "local",
15         "account": ""
16     }],
17     "selectedDestinationId": "Save as PDF",
18     "version": 2,
19     "isHeaderFooterEnabled": False,
20     "isLandscapeEnabled": True,
21     "isCssBackgroundEnabled": True,
22     "mediaSize": {
23         "height_microns": 297000,
24         "name": "ISO_A4",
25         "width_microns": 210000,
26         "custom_display_name": "A4 210 x 297 mm"
27     },
28 }
29 chrome_options.add_argument('--enable-print-browser')
30 chrome_options.add_argument('--kiosk-printing')
31 
32 def print_html_files(source_folder, output_folder):
33     for dirpath, _, filenames in os.walk(source_folder):
34         for filename in filenames:
35             if filename.endswith('.html'):
36                 if (filename == 'readme.html'):
37                     continue
38 
39                 html_path = os.path.join(dirpath, filename)
40                 # 生成输出PDF路径，保持文件夹结构
41                 relative_path = os.path.relpath(dirpath, source_folder)
42                 pdf_output_dir = os.path.join(output_folder, relative_path)
43                 os.makedirs(pdf_output_dir, exist_ok=True)
44                 pdf_name = f"{os.path.splitext(filename)[0]}.pdf"
45                 pdf_output_path = os.path.join(pdf_output_dir, pdf_name)
46 
47                 prefs = {
48                     'printing.print_preview_sticky_settings.appState': json.dumps(settings),
49                     'savefile.default_directory': pdf_output_dir  # 修改为你的输出路径
50                 }
51                 chrome_options.add_experimental_option('prefs', prefs)
52                 
53                 # chrome_options.add_argument('--headless')
54                 chrome_options.add_argument('--no-sandbox')
55                 chrome_options.add_argument("--disable-extensions")
56 
57 
58                 driver = webdriver.Chrome(options=chrome_options)
59                 driver.get(f"file:///{html_path.replace('\\', '/')}")
60                 driver.maximize_window()
61                 time.sleep(10)  # 等待页面加载 # 设置大一点，确保页面加载完成
62                 driver.execute_script(f'document.title="{pdf_name}"; window.print();')
63                 time.sleep(10)  # 等待打印 # 设置大一点，确保打印完成
64                 driver.refresh()
65                 driver.close()
66 
67 
68 
69 print_html_files(source_folder, output_folder)

View Code

2. VScode插件Markdown Preview Enhanced。格式是正确的。但是无法批处理和指令处理。（可以手动一一转，就是有点麻烦。））

3. pandoc --pdf-engine=xelatex typst、markdown-pdf可以试试，但是不保证公式和图片对不对，或者报错什么的

4. PDF阅读器统一把html转pdf。有些效果不太好

需求

markdown格式转为pdf

我遇到的：

1. 我现在想把多个八股文文档（GitHub项目里的 scutan90/DeepLearning-500-questions: 深度学习500问，以问答形式对常用的概率知识、线性代数、机器学习、深度学习、计算机视觉等热点问题进行阐述，以帮助自己及有需要的读者。全书分为18个章节，50余万字。由于水平有限，书中不妥之处恳请广大读者批评指正。未完待续............ 如有意合作，联系scutjy2015@163.com 版权所有，违权必究 Tan 2018.06）

2. 有的GitHub项目，可以导出markdown、pdf，比如OI-Wiki

遇到问题

1. markdown格式，涉及多种不同的设计规则。这里应该是Github专门的md格式

2. 公式有时识别不到

3. 换行、图片位置，有时做得不好

4. 中文、英文字体和粗体，有时无法处理

着手方向

1. Github本身是怎么处理打印这件事的。还有参考一些项目，比如OI-Wiki，看看它们怎么去处理，它们挺有打印成pdf的需求的。

2. 一些已有的GitHub项目，python库，还有Foxit这样的PDF阅读器，还有一些md的软件，比如typora。

3. 搜索VScode插件、GitHub项目，关键词如multi、multiply、batch

已有方法，网上的讨论

How Can I Convert Github-Flavored Markdown To A PDF - Super User

http://www.markdowntopdf.com

Grip

joeyespo/grip：在提交 GitHub README.md 文件之前，先在本地预览它们。

pip install grip  
grip your_markdown.md

grip your_markdown.md --export your_markdown.html

或者：Alternatively you can download (free!) Atom (atom.io), open your file in Atom, use control + shift+ M to view it in preview, save as html, then open the html in your Chrome browser and save as pdf.

非常棒的一个工具，即装即用，就是公式识别不太好。

vscode插件Markdown Preview Enhanced

格式都挺好的

不过只能手动处理，而且无法批处理

Markdown PDF vscode插件 / 工具

但是转pdf或html，效果不好：

这是因为markdown转pdf 公式没有得到转换，

采用VScode中Markdown PDF无法正确输出包含公式的pdf解决方案_markdown pdf 数学公式无法识别-CSDN博客这里的方法解决，

Markdown PDF无法正确输出包含公式的pdf解决方案
安装该插件后，可以找到如下路径文件

C://Users/<username>/.vscode/extensions/yzane.markdown-pdf-XXX/template/template.html
然后在该文件末尾添加如下两行javascript代码。

<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: {inlineMath: [['$', '$']]}, messageStyle: "none" });</script>

vscode插件Markdown All in One

Ctrl+Shift+P: Markdown: Print current document to HTML 或者右键

可以批处理

得到的是md->html，完整文件夹多级结构的一一对应。特别好！

但是只能转html格式

基本上公式、图片都挺准确的

也有极少数问题：

html文件搜索ParseError寻找

1. equarray（latex里的）无法识别

2. ` $U,W,b$ ` 无法识别公式+斜体

3. 少数文档的部分\times（我没有处理，处理应该不难；看着还好就没处理）

4. 这个问题多个vscode插件都遇到问题（不想处理）

解决方法：

1. KaTeX parse error: No such environment: eqnarray at position 7: \begin{̲e̲q̲n̲a̲r̲r̲a̲y̲}̲ \label{eq} - CSDN文库

\begin{eqnarray} 修改为\begin{equation}\begin{aligned}

\end{eqnarray} 修改为\end{aligned}\end{equation}

1	`$修改为$，$`修改为$

chatgpt：

1	`写一个代码，识别一个文件夹里的所有markdown文件，它可能是多级文件夹，某个字符串出现的位置和数目。这个字符串，比如是"eqnarray"、`"`$"`、`"$`"`。`

1	`写一个代码，识别一个文件夹里的所有markdown文件，它可能是多级文件夹。把markdown文件里面，所有的字符串"\begin{eqnarray}"改为"\begin{equation}\begin{aligned}"，所有的字符串"\end{eqnarray}"改为"\end{aligned}\end{equation}"，所有的字符串`"`$"`改为"$"，所有的字符串`"$`"`改为"$"。并输出所有修改的位置`

代码看文章最前面

File: C:\Users\chenguanbin\OneDrive - hust.edu.cn\_工作\八股文\DeepLearning-500-questions\ch02_机器学习基础\第二章_机器学习基础.md
  Line 923, Column 0: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 930, Column 0: '\end{eqnarray}' -> '\end{aligned}\end{equation}'
  Line 948, Column 0: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 952, Column 0: '\end{eqnarray}' -> '\end{aligned}\end{equation}'
  Line 956, Column 0: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 961, Column 0: '\end{eqnarray}' -> '\end{aligned}\end{equation}'

File: C:\Users\chenguanbin\OneDrive - hust.edu.cn\_工作\八股文\DeepLearning-500-questions\ch03_深度学习基础\第三章_深度学习基础.md
  Line 612, Column 1: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 612, Column 534: '\end{eqnarray}' -> '\end{aligned}\end{equation}'
  Line 668, Column 1: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 668, Column 226: '\end{eqnarray}' -> '\end{aligned}\end{equation}'

File: C:\Users\chenguanbin\OneDrive - hust.edu.cn\_工作\八股文\DeepLearning-500-questions\ch06_循环神经网络(RNN)\第六章_循环神经网络(RNN).md
  Line 17, Column 31: '`$' -> '$'
  Line 17, Column 33: '$`' -> '$'
  Line 43, Column 16: '`$' -> '$'
  Line 43, Column 28: '`$' -> '$'
  Line 43, Column 40: '`$' -> '$'
  Line 43, Column 22: '$`' -> '$'
  Line 43, Column 31: '$`' -> '$'
  Line 43, Column 43: '$`' -> '$'

File: C:\Users\chenguanbin\OneDrive - hust.edu.cn\_工作\八股文\DeepLearning-500-questions\English version\ch03_DeepLearningFoundation\ChapterIII_DeepLearningFoundation.md
  Line 581, Column 1: '\begin{eqnarray}' -> '\begin{equation}\begin{aligned}'
  Line 581, Column 540: '\end{eqnarray}' -> '\end{aligned}\end{equation}'

而且无法用指令去处理，但是其实批处理一个文件夹内的所有文件，已经挺可以了。

还有少数是多个软件也无法解决的：

html转pdf

1. wkhtmltopdf：公式有些没识别到，还有公式格式有点难看。

有些图片也没识别到

2. foxit：可以批处理，选择一个文件夹。

遇到图片消失的

html

foxit pdf

还有表格有点丑

3. 浏览器 microsoft print to pdf

效果挺好的

唯一不太好的是图片有的放得比较大，有些图片质量大，放大后图片有点模糊

用python进行批处理

参考了这位的工作python之批量打印网页为pdf文件（一） - NewJune - 博客园

代码看文章最前面

寻找有没有pdf没有生成：

import os

def get_corresponding_files(folder_a, folder_b):
    correspondences = {}

    # 获取文件夹A的文件
    for root, _, files in os.walk(folder_a):
        for file in files:
            if file.endswith('.html'):
                relative_path = os.path.relpath(root, folder_a)
                correspondences[os.path.join(relative_path, file[:-5])] = 'html'

    # 获取文件夹B的文件
    for root, _, files in os.walk(folder_b):
        for file in files:
            if file.endswith('.pdf'):
                relative_path = os.path.relpath(root, folder_b)
                correspondences[os.path.join(relative_path, file[:-4])] = 'pdf'

    return correspondences

def find_missing_pdfs(folder_a, folder_b):
    correspondences = get_corresponding_files(folder_a, folder_b)

    # 找出缺失的pdf文件
    missing_pdfs = [key + '.pdf' for key in correspondences.keys() if correspondences[key] == 'html' and key + '.pdf' not in correspondences]

    return missing_pdfs

folder_a = 'C:/Users/chenguanbin/OneDrive - hust.edu.cn/_工作/八股文/DL500-html'  # 替换为A文件夹路径
folder_b = 'C:/Users/chenguanbin/OneDrive - hust.edu.cn/_工作/八股文/DL500-html-to-microsoft-pdf'    # 替换为B文件夹路径
missing_pdf_files = find_missing_pdfs(folder_a, folder_b)

# print("Missing PDF files in folder B:", missing_pdf_files)
print("Missing PDF files in folder B:")
for data in missing_pdf_files:
    print(data)

# 'ch01_数学基础\\第一章_数学基础.pdf', 'ch06_循环神经网络(RNN)\\第六章_循环神经网络(RNN).pdf'

中文数字改成英文数字，方便排序：

 1 import os
 2 import re
 3 
 4 # 中文数字与阿拉伯数字的映射
 5 chinese_to_arabic = {
 6     '零': '0', '一': '1', '二': '2', '三': '3', '四': '4',
 7     '五': '5', '六': '6', '七': '7', '八': '8', '九': '9',
 8     '十': '10', '百': '100', '千': '1000', '万': '10000'
 9 }
10 
11 # 将中文数字转换为阿拉伯数字
12 def convert_chinese_to_arabic(chinese_number):
13     total = 0
14     current = 0
15     for char in chinese_number:
16         if char in chinese_to_arabic:
17             current += int(chinese_to_arabic[char])
18         elif char == '十':
19             current = current if current > 0 else 1  # 处理"十"开头的情况
20             current *= 10
21         elif char in ['百', '千', '万']:
22             if char == '百':
23                 current *= 100
24             elif char == '千':
25                 current *= 1000
26             elif char == '万':
27                 total += current * 10000
28                 current = 0
29     total += current
30     return total
31 
32 def rename_pdfs(folder):
33     for filename in os.listdir(folder):
34         if filename.endswith('.pdf'):
35             match = re.match(r'第([零一二三四五六七八九十]+)章(.+)\.pdf', filename)
36             if match:
37                 chinese_number = match.group(1)
38                 arabic_number = convert_chinese_to_arabic(chinese_number)
39                 new_filename = f'第{arabic_number}章{match.group(2)}.pdf'
40                 os.rename(os.path.join(folder, filename), os.path.join(folder, new_filename))
41                 print(f'Renamed: {filename} to {new_filename}')
42 
43 folder_path = r'C:/Users/chenguanbin/OneDrive - hust.edu.cn/_工作/八股文/DL500-html-to-microsoft-pdf'  # 替换为你的文件夹路径
44 rename_pdfs(folder_path)

完结撒花！

markdown-pdf项目

alanshaw/markdown-pdf: Markdown to PDF converter

选择node版本11.10（选择高版本可能会报错）：

npm install markdown-pdf --save

它可以进行批处理

alanshaw/markdown-pdf: Markdown to PDF converter

这是官方提供的，From multiple paths to multiple paths：

var markdownpdf = require("markdown-pdf")

var mdDocs = ["home.md", "about.md", "contact.md"]
  , pdfDocs = mdDocs.map(function (d) { return "out/" + d.replace(".md", ".pdf") })

markdownpdf().from(mdDocs).to(pdfDocs, function () {
  pdfDocs.forEach(function (d) { console.log("Created", d) })
})

node xxx.js

但是公式没识别出来

emmm，也有人提出需求：Is there a way to use MathJax? · Issue #155 · alanshaw/markdown-pdf

pandoc

1. pandoc --pdf-engine=xelatex

有可能遇到问题

! Package amsmath Error: Erroneous nesting of equation structures; (amsmath) trying to recover with `aligned'.

See the amsmath package documentation for explanation. Type H <return> for immediate help. ...

l.584 \end{align*}

解决方案是把前后的"$$"删除

还有比如Missing $ inserted的问题

Error producing PDF.
! Missing $ inserted.
<inserted text>
                $
l.290 鏄湡瀹炲€硷紝\$ \frac{\partial y_l}{\partial z_l}

markdown - Pandoc Error: ! missing $ inserted - Stack Overflow

2. pandoc --pdf-engine=typst

它有时也会有格式的问题

遇到问题，无法粗体和斜体

使用

--ascii

--highlight

-V CJKmainfont:SourceHanSerifCN-Regular -V CJKoptions:BoldFont=SourceHanSansCN-Medium,ItalicFont=STKaiti

都不行

潜在解决方法：官方文档（后续我没处理了，有点麻烦）

用 pandoc 让 Markdown 从 LaTeX 输出 pdf 文档 - 黄石的时空回环

中文没有加粗 | Typst 中文社区导航

其它方法和软件

1. prince软件（Markdown Preview Enhanced里面有使用）

效果不佳，无法识别公式

效果均不太理想。比如公式，比如markdown一些格式无法识别

posted @ 2024-10-22 20:05 congmingyige 阅读(273) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

· "华为杯"华南理工大学程序设计竞赛(同步赛) H、K、M（还没写）题解

· Linux-markdown转pdf

· 记录上百页html生成pdf的历程和坑(bookjs-easy解决：生成、拼接pdf、接收服务端pdf、自定义pdf字体)

· Excel转PDF问题记录

阅读排行：
· [翻译] 为什么 Tracebit 用 C# 开发
· Deepseek官网太卡，教你白嫖阿里云的Deepseek-R1满血版
· 2分钟学会 DeepSeek API，竟然比官方更好用！
· .NET 使用 DeepSeek R1 开发智能 AI 客户端
· 刚刚！百度搜索“换脑”引爆AI圈，正式接入DeepSeek R1满血版

历史上的今天：
2018-10-22 某些英文缩写
2018-10-22 圆的面积并
2018-10-22 Codeforces Round #517 (Div. 2, based on Technocup 2019 Elimination Round 2) D. Minimum path
2018-10-22 The 2016 ACM-ICPC Asia Beijing Regional Contest E - What a Ridiculous Election
2017-10-22 把矩阵分成n*m个块，从任意一个块出发，问是否可以一笔画遍历矩阵中所有的块

公告

昵称： congmingyige
园龄： 8年2个月
粉丝： 6
关注： 3

+加关注

2025年2月

日

一

二

三

四

五

六

congmingyige

markdown转pdf，方法总结（用于DL500八股文pdf导出；针对GitHub项目"DeepLearning-500-questions"）

总结使用

需求

遇到问题

着手方向

已有方法，网上的讨论

Grip

vscode插件Markdown Preview Enhanced

Markdown PDF vscode插件 / 工具

vscode插件Markdown All in One

代码看文章最前面

html转pdf

1. wkhtmltopdf：公式有些没识别到，还有公式格式有点难看。

2. foxit：可以批处理，选择一个文件夹。

3. 浏览器 microsoft print to pdf

代码看文章最前面

markdown-pdf项目

pandoc

其它方法和软件

公告

搜索

常用链接

我的标签

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论