langchain中的Document类

在Langchain-Chatchat的上传文档接口（upload_docs）中有个自定义的docs字段，用到了Document类。根据发现指的是from langchain.docstore.document import Document。本文简要对Document类进行介绍。

1.上传文档接口upload_docs

def upload_docs(
        file: List[UploadFile] = File(..., description="上传文件，支持多文件"),
        knowledge_base_name: str = Form(..., description="知识库名称", examples=["samples"]),
        override: bool = Form(False, description="覆盖已有文件"),
        to_vector_store: bool = Form(True, description="上传文件后是否进行向量化"),
        chunk_size: int = Form(CHUNK_SIZE, description="知识库中单段文本最大长度"),
        chunk_overlap: int = Form(OVERLAP_SIZE, description="知识库中相邻文本重合长度"),
        zh_title_enhance: bool = Form(ZH_TITLE_ENHANCE, description="是否开启中文标题加强"),
        docs: Json = Form({}, description="自定义的docs，需要转为json字符串",
                          examples=[{"test.txt": [Document(page_content="custom doc")]}]),
        not_refresh_vs_cache: bool = Form(False, description="暂不保存向量库（用于FAISS）"),
) -> BaseResponse:

这里的docs是Json数据类型，本质上可以理解为dict数据类型。pydantic 中的 Json 类用于表示包含 JSON 数据的字段。它可以接受任何合法的 JSON 数据，然后在验证时将其解析为 Python 字典。以下是一个使用 Json 类的简单示例：

from typing import List
from pydantic import BaseModel, Json

class MyModel(BaseModel):
    json_data: Json

# 实例化 MyModel 类
data = {'key1': 'value1', 'key2': [1, 2, 3]}
my_model_instance = MyModel(json_data=data)

# 输出实例
print(my_model_instance)

在这个例子中，定义了一个 MyModel 类，其中有一个字段 json_data，它的类型是 Json。然后创建一个包含 JSON 数据的字典 data，并用它实例化 MyModel 类。在输出实例时，Json 类会将传入的 JSON 数据解析为 Python 字典。请注意，Json 类并不关心具体的 JSON 数据结构，它接受任何合法的 JSON 数据。

2.Document类源码
该类的引用包为from langchain.docstore.document import Document。简单理解就是包括文本内容（page_content）、元数据（metadata）和类型（type）的类。源码如下所示：

class Document(Serializable):
    """Class for storing a piece of text and associated metadata."""

    page_content: str
    """String text."""
    metadata: dict = Field(default_factory=dict)
    """Arbitrary metadata about the page content (e.g., source, relationships to other
        documents, etc.).
    """
    type: Literal["Document"] = "Document"

    @classmethod
    def is_lc_serializable(cls) -> bool:
        """Return whether this class is serializable."""
        return True

    @classmethod
    def get_lc_namespace(cls) -> List[str]:
        """Get the namespace of the langchain object."""
        return ["langchain", "schema", "document"]

3.Document类例子
代码定义了一个 Document 类，该类继承自 Serializable，使用了 Python 的类型提示和注解。在 Document 类中，有 page_content、metadata、type 三个属性，并定义了一些方法。

下面实例化 Document 类，并输出实例的内容：

from typing import List, Literal
from langchain_core.load.serializable import Serializable
from pydantic import Field

class Document(Serializable):
    page_content: str
    metadata: dict = Field(default_factory=dict)
    type: Literal["Document"] = "Document"

    @classmethod
    def is_lc_serializable(cls) -> bool:
        return True

    @classmethod
    def get_lc_namespace(cls) -> List[str]:
        return ["langchain", "schema", "document"]

# 实例化 Document 类
custom_doc = Document(page_content="custom doc")

# 输出实例
print(custom_doc)

输出结果，如下所示：

page_content='custom doc' metadata=FieldInfo(annotation=NoneType, required=False, default_factory=dict)

在这个例子中，创建了一个名为 custom_doc 的 Document 类的实例，并通过 print(custom_doc) 将其输出。确保环境中已经安装了 pydantic 和langchain_core模块，可以使用 pip install pydantic langchain_core -i https://pypi.tuna.tsinghua.edu.cn/simple 进行安装。

参考文献：
[1] 文档加载器：https://python.langchain.com/docs/integrations/document_loaders/copypaste
[2] https://docs.pydantic.dev/latest/concepts/fields/
[3] https://github.com/chatchat-space/Langchain-Chatchat/blob/master/server/api.py

NLP工程化

1.本公众号以对话系统为中心，专注于Python/C++/CUDA、ML/DL/RL和NLP/KG/DS/LLM领域的技术分享。
2.本公众号Roadmap可查看飞书文档：https://z0yrmerhgi8.feishu.cn/wiki/Zpewwe2T2iCQfwkSyMOcgwdInhf

NLP工程化(公众号)

NLP工程化(星球号)

posted on 2024-01-22 00:28 扫地升阅读(1098) 评论(0) 收藏举报