03LangChain初学者指南：从零开始实现高效数据检索

LangChain初学者指南：从零开始实现高效数据检索

https://python.langchain.com/v0.2/docs/tutorials/retrievers/

这个文档，我们将熟悉LangChain的向量存储和抽象检索器。支持从（向量）数据库和其他来源检索数据，并与大模型的工作流集成。这对于需要检索数据以进行推理的应用程序非常重要，例如检索增强生成（retrieval-augmented generation）的情况，或者RAG（请参阅我们的RAG教程在这里）。

概念

这个指南着重于文本数据的检索。涵盖以下主要概念：

Documents：文本
Vector stores：向量存储
Retrievers：检索

Setup

Jupyter Notebook

这些教程和其他教程可能最方便在Jupyter笔记本中运行。请参阅此处有关安装方法的说明。

Installation

这个教程需要使用 langchain 、langchain-chroma和 langchain-openai包。

pip install langchain langchain-chroma langchain-openai

Installation guide.

LangSmith

设置环境变量

export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="..."

如果在 notebook中，可以这样设置:

import getpass
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

Documents

LangChain 实现了一个提取的文档，文档包括文本单元和相关元数据。它具有两个属性：

page_content ：字符串格式的内容
metadata ：包含任意元数据的字典。

元数据属性可以包含关于文档来源、与其他文档的关系以及其他信息。请注意，单个文档对象通常代表更大文档的一部分。

生成一些 documents 例子:

from langchain_core.documents import Document
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
),
]

API 调用:

Document

这里我们生成了五个包含元数据的文档，其中显示了三个不同的“来源”。

向量存储

向量搜索是一种常见的存储和搜索非结构化数据（如非结构化文本）的方法。其思想是存储与文本相关联的数值向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似度度量来识别存储中相关的数据。

LangChain的VectorStore对象定义了用于将文本和文档对象添加到存储，和使用各种相似度度量进行查询的方法。通常使用嵌入模型进行初始化，这些模型确定了文本数据如何被转化为数字向量。

LangChain包括一套与不同矢量存储技术集成的解决方案。一些矢量存储由提供者（如各种云服务提供商）托管，并需要特定的凭据才能使用；一些（例如Postgres）在独立的基础设施中运行，可以在本地或通过第三方运行；其他一些可以运行在内存中，用于轻量级工作负载。在这里，我们将演示使用Chroma的LangChain向量存储的用法，是一个基于内存的实现。

实例化一个向量存储的时候，通常需要提供一个嵌入模型来指定文本应该如何转换为数字向量。在这里，我们将使用 OpenAI 的嵌入模型。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
)

API 调用:

OpenAIEmbeddings

调用 .from_documents 把文档添加到向量存储中。VectorStore实现了用于添加文档的方法，这些方法可以在对象实例化之后调用。大多数实现都允许您连接到现有的向量存储，例如，通过提供客户端、索引名称或其他信息。有关特定集成的更多详细信息，请参阅文档。

一旦我们实例化了一个包含文档的 VectorStore，我们就可以对其进行查询。VectorStore 包括以下查询方法：

同步和异步查询；
通过字符串查询和通过向量查询；
带有和不带有返回相似度分数的查询；
通过相似度和最大边际相关性（在检索结果中平衡相似度和多样性的查询）进行查询。

这些方法会输出一个Document对象的列表。

例子

返回与字符串查询相似的文档：

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

异步查询：

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回分数查询：

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.
vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.3751849830150604),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.48316916823387146),
 (Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
  0.49601367115974426),
 (Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),
  0.4972994923591614)]

根据嵌入的查询返回类似文档的查询：

embedding = OpenAIEmbeddings().embed_query("cat")
vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

学习更多：

Retrievers

LangChain VectorStore 对象不继承 Runnable，因此无法直接集成到 LangChain 表达式语言 chains 中。

Retrievers 继承了 Runnables，实现了一套标准方法（例如同步和异步的 invoke和 batch操作），并且设计为纳入LCEL链中。

我们可以自己创建一个简单的可运行对象，而无需继承 Runnables。下面我们将围绕相似性搜索方法构建一个示例：

from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result
retriever.batch(["cat", "shark"])

API 调用:

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

Vectorstores 实现一个 as_retriever 方法，该方法将生成一个 VectorStoreRetriever。这些 retriever 包括特定的 search_type 和 search_kwargs 属性，用于识别调用底层向量存储的方法以及如何给它们参数化。例如，我们可以使用以下方法复制上述操作：

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)
retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever 支持相似度（默认）、mmr（最大边际相关性）和 similarity_score_threshold 可以对输出的相似文档，设定相似度分数阈值。

Retrievers 可以很容易地整合到更复杂的应用中，比如检索增强生成（RAG）应用程序，它将给定的问题与检索到的上下文结合组成 LLM 的提示。下面我们展示一个最简单的例子。

OpenAI

pip install -qU langchain-openai

import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass()
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
message = """
Answer this question using the provided context only.
{question}
Context:
{context}
"""
prompt = ChatPromptTemplate.from_messages([("human", message)])
rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API 调用:

response = rag_chain.invoke("tell me about cats")
print(response.content)

Cats are independent pets that often enjoy their own space.

总结:

本文档提供了向量存储和检索的示例代码。介绍了LangChain的向量存储和抽象检索器，包括向量存储和检索器的概念和使用。向量存储是存储和搜索非结构化数据的一种方法，LangChain的VectorStore对象定义了用于将文本和文档对象添加到存储，和使用各种相似度度量进行查询的方法。检索器继承了Runnables，实现了一套标准方法，并且可以加入LCEL链中。

posted @ 2024-11-13 14:31 onecyl 阅读(28) 评论(0) 编辑收藏举报

刷新页面返回顶部

onecyl

Talk is cheap , show me the code!

03LangChain初学者指南：从零开始实现高效数据检索

LangChain初学者指南：从零开始实现高效数据检索

概念

Setup

Jupyter Notebook

Installation

LangSmith

Documents

API 调用:

向量存储

API 调用:

例子

Retrievers

API 调用:

API 调用:

总结:

公告