microsoft/markitdown
microsoft/markitdown
https://github.com/microsoft/markitdown
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports:
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
To install MarkItDown, use pip:
pip install markitdown
. Alternatively, you can install it from the source:pip install -e .
https://www.sohu.com/a/838561544_121798711
微软近日在GitHub上发布了一款名为MarkItDown的开源Python工具库,这一创新工具的发布为文本处理带来了新的可能性。MarkItDown支持将多种文档格式轻松转换为Markdown格式,包括Office文档(如Word、Excel、PowerPoint)、PDF、图片、音频、HTML以及多种文本格式(如csv、json、xml等),这无疑为广大开发者和用户提供了极大的便利。
随着信息化时代的快速发展,Markdown作为一种轻量级标记语言,因其简洁性和易用性而备受青睐。用户可以利用Markdown格式方便地进行文本编辑、记录和发布。MarkItDown的推出,将极大简化这一过程,只需简单的一步操作,就能将复杂的文档转化为清晰的Markdown格式,进而提升文本的索引、搜索和分析效率。
值得一提的是,MarkItDown不仅是一个文件转换工具,它还集成了大型语言模型(如OpenAI的GPT-4),使得用户能够更智能地处理图像。例如,通过配置该工具,用户可以实现将图片内容描述转换为文本,进一步丰富文档的内容。例如,以下简单代码展示了如何使用MarkItDown将图片转换为文本描述:
from markitdown import MarkItDown from openai import OpenAI client = OpenAI() # 初始化OpenAI客户端 md = MarkItDown(mlm_client=client, mlm_model="gpt-4") # 创建MarkItDown对象并配置模型 result = md.convert("example.jpg") # 转换图片为文本内容 print(result.text_content) # 输出文本内容
markdowner
https://github.com/supermemoryai/markdowner
A fast tool to convert any website into LLM-ready markdown data.
A fast tool to convert any website into LLM-ready markdown data.
I'm building an AI app called Supermemory - https://git.new/memory. Where users can store website content in the app and then query it using AI. One thing I noticed was - when data is structured and predictable (in markdown format), the LLM responses are much better.
There are other solutions available for this - https://r.jina.ai, https://firecrawl.dev, etc. But they are either:
- too expensive / proprietary
- or too limited.
- very difficult to deploy
firecrawl
https://github.com/mendableai/firecrawl
Firecrawl
Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.
This repository is in development, and we’re still integrating custom modules into the mono repo. It's not fully ready for self-hosted deployment yet, but you can run it locally.
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our documentation.
Pst. hey, you, join our stargazers :)