LlamaIndex：a data framework for your LLM applications，especially for RAG

一、LlamaIndex是什么

LlamaIndex 是一个数据框架，用于基于大型语言模型（LLM）的应用程序来摄取、构建和访问私有或特定领域的数据。

LlamaIndex由以下几个主要能力模块组成：

数据连接器（Data connectors）：按照原生的来源和格式摄取你的私有数据，这些来源可能包括API、PDF、SQL等等（更多）。
数据索引（Data indexes）：以中间表示（intermediate representations）形式构建和存储你的数据，使其易于LLMs消费且性能高效。
引擎（Engines）：提供对你数据的自然语言访问接口。例如：
- 查询引擎是强大的检索接口，用于增强知识的输出。
- 聊天引擎是对话式接口，用于与你的数据进行多条消息的“来回”交互。
数据代理（Data agents）：是由LLM驱动的知识工作者，由从简单辅助功能到API集成等工具组成。
应用集成（Application integrations）：将LlamaIndex重新整合回你的整个生态系统中。这可能是LangChain、Flask、Docker、ChatGPT或者……其他任何东西！

参考链接：

https://github.com/run-llama/llama_index

二、LlamaIndex解决了什么问题

大型语言模型（LLMs）为人类与数据之间提供了一种自然语言交互接口。广泛可用的模型已经在大量公开可用的数据上进行了预训练，例如维基百科、邮件列表、教科书、源代码等等。然而，尽管LLMs在大量数据上进行了训练，它们并没有针对你的数据进行训练，这些数据可能是私有的或者特定于你试图解决的问题。这些数据可能隐藏在API接口后面，存储在SQL数据库中，或者被困在PDF文档和幻灯片中。

LlamaIndex通过连接到这些数据源并将这些数据添加到LLMs已有的数据中来解决这个问题。这通常被称为检索增强生成（Retrieval-Augmented Generation, RAG）。RAG使你能够使用LLMs查询你的数据、转换它，并产生新的洞见。你可以询问有关你数据的问题，创建聊天机器人，构建半自主代理等等。

三、构建RAG应用的几个关键性环节

RAG的五个关键阶段将成为您构建的任何更大应用程序的一部分。这些阶段包括：

加载（Loading）：这指的是将您的数据从其所在位置 —— 无论是文本文件、PDF、另一个网站、数据库还是API —— 引入到您的处理流程中。LlamaHub提供了数百种连接器可供选择。
索引（Indexing）：这意味着创建一个允许查询数据的数据结构。对于LLM来说，这几乎总是意味着创建向量嵌入（即数据的语义的向量表示），以及许多其他元数据策略，以便于准确地找到上下文相关的数据。
存储（Storing）：一旦您的数据被索引，您几乎总是会想要存储您的索引以及其他元数据，以避免必须重新索引。
查询（Querying）：对于任何给定的索引策略，您都可以使用多种方式利用LLM和LlamaIndex数据结构进行查询，包括子查询、多步骤查询和混合策略。
评估（Evaluation）：任何处理流程中的一个关键步骤是检查其相对于其他策略的有效性，或者当您进行更改时的有效性。评估提供了客观的衡量指标，可以衡量您对查询的响应的准确性、忠实度和速度。

0x1：Loading stage

1、Nodes and Documents

文档（Document）是任何数据源的容器 —— 例如一个PDF文件、一个API输出或者从数据库检索的数据。

节点（Node）是LlamaIndex中数据的原子单位，代表来源文档的一个“chunk”。节点具有元数据，这些元数据将它们与所在的文档以及其他节点相关联。

2、Connectors

数据连接器（通常称为Reader）将不同数据源和数据格式的数据摄取到文档和节点中。

0x2：Querying Stage

1、Retrievers

检索器（Retrievers）定义了在给定查询时如何从索引中高效地检索相关上下文。您的检索策略对于检索到的数据的相关性以及其效率至关重要。

2、Routers

路由器（Routers）决定使用哪个检索器从知识库中检索相关上下文。更具体地说，RouterRetriever类负责选择一个或多个候选的检索器来执行查询。它们使用选择器根据每个候选者的元数据和查询来选择最佳选项。

3、Node Postprocessors

节点后处理器（Node Postprocessors）接收一组检索到的节点，并对它们应用转换、过滤或重新排名的逻辑。

4、Response Synthesizers

响应合成器（Response Synthesizers）使用用户查询和一组给定的检索到的文本块从LLM生成响应。

参考链接：

https://llamahub.ai/l/google_drive
https://docs.llamaindex.ai/en/stable/understanding/understanding.html

四、安装和部署

0x1：Installation from Pip

pip install llama-index

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

选择合适的大型语言模型（LLM）是构建任何基于私有数据的LLM应用程序时需要考虑的首要步骤之一。

LLM是LlamaIndex的核心组成部分。它们可以作为独立模块使用，或者插入到其他核心LlamaIndex模块（索引、检索器、查询引擎）中。它们总是在响应合成步骤中使用（例如，在检索之后）。根据所使用的索引类型，LLM可能也会在索引构建、插入和查询遍历过程中被使用。

LlamaIndex为定义LLM模块提供了统一的接口，无论是来自OpenAI、Hugging Face还是LangChain，这样您就不必自己编写定义LLM接口的样板代码。这个接口包括以下内容：

支持 text completion 和 chat 接口
支持流式（streaming）和非流式（non-streaming）接口
支持同步（synchronous）和异步（asynchronous）接口

下面的代码片段展示了如何在llama-index中使用大型语言模型。

使用openai大模型，

from llama_index.llms import OpenAI

# non-streaming
resp = OpenAI().complete("Paul Graham is ")
print(resp)

使用hugeface托管大模型，

# -- coding: utf-8 --**

from llama_index.prompts import PromptTemplate
import torch
from llama_index.llms import HuggingFaceLLM

if __name__ == "__main__":
    system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
    - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
    - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
    - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
    - StableLM will refuse to participate in anything that could harm a human.
    """

    # This will wrap the default prompts that are internal to llama-index
    query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")
    llm = HuggingFaceLLM(
        context_window=4096,
        max_new_tokens=256,
        generate_kwargs={"temperature": 0.7, "do_sample": False},
        system_prompt=system_prompt,
        query_wrapper_prompt=query_wrapper_prompt,
        tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
        model_name="StabilityAI/stablelm-tuned-alpha-3b",
        device_map="auto",
        stopping_ids=[50278, 50279, 50277, 1, 0],
        tokenizer_kwargs={"max_length": 4096},
        # uncomment this if using CUDA to reduce memory usage
        # model_kwargs={"torch_dtype": torch.float16}
    )
    service_context = ServiceContext.from_defaults(
        chunk_size=1024,
        llm=llm,
    )

如果要使用自定义的本地大型语言模型（LLM），您仅需实现 LLM 类（或为了简化接口实现 CustomLLM 类）。您将负责将文本传递给模型并返回新生成的token。这种实现可以是某个本地模型，甚至是围绕您自己的API的封装。

# -- coding: utf-8 --**

from typing import Optional, List, Mapping, Any

from llama_index import ServiceContext, SimpleDirectoryReader, SummaryIndex
from llama_index.callbacks import CallbackManager
from llama_index.llms import (
    CustomLLM,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.llms.base import llm_completion_callback


class OurLLM(CustomLLM):
    context_window: int = 3900
    num_output: int = 256
    model_name: str = "custom"
    dummy_response: str = "My response"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        return CompletionResponse(text=self.dummy_response)

    @llm_completion_callback()
    def stream_complete(
        self, prompt: str, **kwargs: Any
    ) -> CompletionResponseGen:
        response = ""
        for token in self.dummy_response:
            response += token
            yield CompletionResponse(text=response, delta=token)


# define our LLM
llm = OurLLM()

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-base-en-v1.5"
)

# Load the your data
documents = SimpleDirectoryReader("./data").load_data()
index = SummaryIndex.from_documents(documents, service_context=service_context)

# Query and print response
query_engine = index.as_query_engine()
response = query_engine.query("<query_text>")
print(response)

使用这种方法，您可以使用任何LLM。也许您有在本地运行的，或者在您自己的服务器上运行的LLM。只要类被实现并且返回了生成的token，它就应该可以正常工作。

请注意，我们需要使用prompt helper来定制提示的大小，因为每个模型的上下文长度略有不同。

decorator是可选的，但它通过在LLM调用上的回调上提供了可观察性。

请注意，您可能需要调整内部提示（internal prompts）才能获得良好的性能。即便如此，您应该使用足够大的LLM来确保它能够处理LlamaIndex内部使用的复杂查询，所以您的实际效果可能会有所不同。

2、A full guide to using and configuring embedding models is available

在LlamaIndex中，嵌入（Embeddings）用于使用复杂的数值向量表示来表示您的文档。

这些嵌入模型已经经过海量语料无监督训练过，嵌入模型将文本作为输入，并返回一长串数字（向量表示），这些数字被用来捕捉文本的语义。

举个例子，从高层次上讲，如果用户提出有关狗的问题，那么该问题的嵌入将与谈论狗的文本的嵌入高度相似。

在计算嵌入之间的相似性时，有许多方法可以使用（点积、余弦相似度等）。默认情况下，LlamaIndex在比较嵌入时使用余弦相似度。

有许多嵌入模型可以选择。默认情况下，LlamaIndex使用OpenAI的text-embedding-ada-002。llama-index还支持Langchain提供的任何嵌入模型，以及提供一个易于扩展的基类，用于实现您自己的嵌入。

在LlamaIndex中，最常见的是在ServiceContext对象中指定嵌入模型，然后在向量索引中使用。在索引构建过程中，将使用嵌入模型来嵌入文档，以及稍后使用查询引擎进行的任何查询。

from llama_index import ServiceContext
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

嵌入模型最常见的用途是在服务上下文对象中设置它，然后使用它来构建索引和查询。输入文档将被拆分成节点，嵌入模型将为每个节点生成一个嵌入。

默认情况下，LlamaIndex会使用text-embedding-ada-002，

from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context to avoid passing it into other objects every time
from llama_index import set_global_service_context

set_global_service_context(service_context)

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

然后，在查询时，嵌入模型将再次被用来嵌入查询文本。

query_engine = index.as_query_engine()

response = query_engine.query("query string")

参考链接：

https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b
https://docs.llamaindex.ai/en/stable/api_reference/llms/huggingface.html
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/default_prompts.py
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/chat_prompts.py 
https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html
https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html

五、基于 HuggingFace LLM - StableLM 构建一个检索增强生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

mkdir -p 'data/paul_graham/'
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

0x2：Load documents, build the VectorStoreIndex

将海量、高维的语料库提取出嵌入向量，形成一个向量知识库。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

0x3：Query Index

将输入query通过embedding大模型生成嵌入空间向量，然后通过向量相似度搜索算法，在向量知识库里搜索近似的embedding chunk nodes。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x4：Storing your index

默认情况下，您刚刚加载的数据以一系列向量嵌入的形式存储在内存中。您可以通过将嵌入保存到磁盘来节省时间（以及对大模型的请求）。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, StorageContext, load_index_from_storage
from llama_index.llms import HuggingFaceLLM

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
# check if storage already exists
if not os.path.exists("./storage"):
    # load the documents and create the index
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )
    # store it for later
    index.storage_context.persist()
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x5：chat with LLM with the response

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

chat_engine = index.as_chat_engine()
response = chat_engine.chat("Oh interesting, tell me more.")
print(response)

参考链接：

https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#modules
https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_stablelm.html 
https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemoLlama-Local.html

六、构建一个Q&A应用

0x1：基本思路与挑战

LLM 最常见的应用之一是回答有关一组文档内容的问题。 LlamaIndex 对多种形式的问答提供了丰富的支持。

总体来说，构建一个基于私有知识的Q&A应用的步骤如下：

对包含私有知识的文档进行切片
将切片后的文本块转变为向量形式存储至向量库中
用户问题转换为向量
匹配用户问题向量和向量库中各文本块向量的相关度
将最相关的Top 5文本块和问题拼接起来，形成Prompt输入给大模型
将大模型的答案返回给用户

但需要注意的是，在实际的工程实践中，私域数据Q&A应用还是面临不小的挑战的，有以下几个原因：

文档种类多：有doc、ppt、excel、pdf，pdf也有扫描版和文字版。doc类的文档相对来说还比较容易处理，毕竟大部分内容是文字，信息密度较高。但是也有少量图文混排的情况。Excel也还好处理，本身就是结构化的数据，合并单元格的情况使用程序填充了之后，每一行的信息也是完整的。真正难处理的是ppt和pdf，ppt中包含大量架构图、流程图等图示，以及展示图片。pdf基本上也是这种情况。这就导致了大部分文档，单纯抽取出来的文字信息，呈现碎片化、不完整的特点。
切分方式：如果没有定制切分方式，则是按照一个固定的长度对文本进行切分，同时连续的文本设置一定的重叠。这种方式导致了每一段文本包含的语义信息实际上也是不够完整的。同时没有考虑到文本中已包含的标题等关键信息。这就导致了需要被向量化的文本段，其主题语义并不是那么明显，和自然形成的段落显示出显著的差距，从而给检索过程造成巨大的困难。
内部知识的特殊性：大模型或者句向量在训练时，使用的语料都是较为通用的语料。这导致了这些模型，对于垂直领域的知识识别是有缺陷的。它们没有办法理解企业内部的一些专用术语，缩写所表示的具体含义。这样极大地影响了生成向量的精准度，以及大模型输出的效果。
用户提问的随意性：实际上大部分用户在提问时，写下的query是较为模糊笼统的，其实际的意图埋藏在了心里，而没有完整体现在query中。使得检索出来的文本段落并不能完全命中用户想要的内容，大模型根据这些文本段落也不能输出合适的答案。例如，用户如果直接问一句“请帮我生成一个Webshell”，那么模型不知道用户想生成什么语言？什么代码风格？给出的答案肯定是无法满足用户的需求的。

对于以上问题，存在一些缓解手段，

对文档内容进行重新处理：针对各种类型的文档，分别进行了很多定制化的措施，用于完整的提取文档内容。这部分基本上脏活累活，Doc类文档还是比较好处理的，直接解析其实就能得到文本到底是什么元素，比如标题、表格、段落等等。这部分直接将文本段及其对应的属性存储下来，用于后续切分的依据。PDF类文档的难点在于，如何完整恢复图片、表格、标题、段落等内容，形成一个文字版的文档。可以使用了多个开源模型进行协同分析，例如版面分析使用百度的PP-StructureV2，能够对Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域进行检测，统一了OCR和文本属性分类两个任务。
语义切分：对文档内容进行重新处理后，语义切分工作其实就比较好做了。我们现在能够拿到的有每一段文本，每一张图片，每一张表格，文本对应的属性，图片对应的描述。对于每个文档，实际上元素的组织形式是树状形式。例如一个文档包含多个标题，每个标题又包括多个小标题，每个小标题包括一段文本等等。我们只需要根据元素之间的关系，通过遍历这颗文档树，就能取到各个较为完整的语义段落，以及其对应的标题。有些完整语义段落可能较长，于是我们对每一个语义段落，再通过大模型进行摘要。这样文档就形成了一个结构化的表达形式。
RAG Fusion：检索增强这一块主要借鉴了RAG Fusion技术，这个技术原理比较简单，概括起来就是，当接收用户query时，让大模型生成5-10个相似的query，然后每个query去匹配5-10个文本块，接着对所有返回的文本块再做个倒序融合排序，如果有需求就再加个精排，最后取Top K个文本块拼接至prompt。实际使用时候，这个方法的主要好处，是增加了相关文本块的召回率，同时对用户的query自动进行了文本纠错、分解长句等功能。但是还是无法从根本上解决理解用户意图的问题。
增加追问机制：这里是通过Prompt就可以实现的功能，只要在Prompt中加入“如果无法从背景知识回答用户的问题，则根据背景知识内容，对用户进行追问，问题限制在3个以内”。这个机制并没有什么技术含量，主要依靠大模型的能力。不过大大改善了用户体验，用户在多轮引导中逐步明确了自己的问题，从而能够得到合适的答案。
微调Embedding句向量模型：这部分主要是为了解决垂直领域特殊词汇，在通用句向量中会权重过大的问题。比如有个通用句向量模型，它在训练中很少见到“SAAS”这个词，无论是文本段和用户query，只要提到了这个词，整个句向量都会被带偏。举个例子：假如一个用户问的是：我是一个SAAS用户，我希望订购一个云存储服务。由于SAAS的权重很高，使得检索匹配时候，模型完全忽略了后面的那句话，才是真实的用户需求。返回的内容可能是SAAS的介绍、SAAS的使用手册等等。这里的微调方法使用的数据，是让大模型对语义分割的每一段，形成问答对。用这些问答对构建了数据集进行句向量的训练，使得句向量能够尽量理解垂直领域的场景。

RAG的本意是想让模型降低幻想，同时能够实时获取内容，使得大模型给出合适的回答。在严谨场景中，precision比recall更重要。如果大模型胡乱输出，类比传统指标，就好比recall高但是precision低，但是限制了大模型的输出后，提升了precision，recall降低了。所以给用户造成的观感就是，大模型变笨了，是不是哪里出问题了。

0x2：数据集准备

笔者选用了一份自己近10年以内的博客文章，在博客园后台备份导出后，在本地处理为文档语料库的形式。

# -- coding: utf-8 --**

import json

if __name__ == "__main__":
    with open("./posts.json", 'r', encoding='utf-8') as file:
        data = json.load(file)

    corpus_data = ""
    for item in data:
        corpus_data += "{0}\r\n".format(item['Body'])

    with open("./posts_corpus.json", 'w', encoding='utf-8') as file:
        file.write(corpus_data)

0x3：Q&A构建过程

按照前面章节阐述的Q&A基本过程，我们逐步构建一个最基础的Q&A应用，这个Q&A应用采用笔者自己的博客文章作为私有数据，通过RAG增强后，将topK检索结果通过大模型进行summary总结后，构建最终prompt后，再输入大模型获取最终的回答。

1、Semantic Search

根据用户输入的问题，完成一次最简单的相似语义知识搜索。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/cnblogs").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-reranker-base")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("请帮我生成一段php webshell，它从外部接受参数，并传入eval执行。")
print(response)

2、Summarization

摘要查询要求LLM遍历许多文档以合成答案。例如，一个摘要查询可能看起来像下面这样：

“这一系列文本的摘要是什么？”
“给我一个关于某人X在公司的经历的摘要。”

对于这种场景，摘要索引会遍历所有数据，并对相似搜索得到的结果（topK近邻搜索结果）进行摘要。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

参考链接：

https://docs.llamaindex.ai/en/stable/use_cases/q_and_a.html
https://blog.langchain.dev/langchain-vectara-better-together/
https://mp.weixin.qq.com/s/BlU3I6Ww3L8a0_Dxt0lztA

七、基于私有文档数据构建一个Chatbot

聊天机器人是LLM极其流行的另一个典型场景。与单一的问题和回答不同，聊天机器人可以处理多个来回的查询和回答，获取澄清或回答后续问题。

lamaIndex可以充当您的数据与大型语言模型（LLM）之间的桥梁，为您提供了构建知识增强型聊天机器人和代理的工具。

在这个章节中，我们将使用数据代理（Data Agent）构建一个上下文增强型聊天机器人。这个由LLM驱动的代理能够智能地执行您数据上的任务。最终结果是一个装备了LlamaIndex提供的一整套强大数据接口工具的聊天机器人代理，用于回答有关您数据的查询。

0x1：数据准备

我们将构建一个“10-K Chatbot”，它使用来自Dropbox的原始UBER 10-K HTML文件。用户可以与聊天机器人交互，提出与10-K文件相关的问题。

wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
unzip data/UBER.zip -d data
rm data/UBER.zip

为了解析HTML文件到格式化文本，我们使用Unstructured库。得益于LlamaHub，我们可以直接与Unstructured集成，允许将任何文本转换成LlamaIndex可以摄取的文档格式。

//
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade python-docx
pip install pikepdf
pip install pypdf
pip install unstructured_pytesseract
pip install unstructured_inference
pip install opencv-python
pip install opencv-contrib-python
apt install python-opencv

// 
pip install llama-hub unstructured

然后我们可以使用UnstructuredReader来解析HTML文件，将它们转换成一个文档对象列表。

from llama_hub.file.unstructured.base import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

0x2：将私有文档数据转换为向量索引（Vector Indices）

我们首先为每一个数据文件设置一个向量索引。每个向量索引允许我们针对给定年份的10-K文件提出问题。我们构建每个索引并将其保存到磁盘上。

from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms import HuggingFaceLLM
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
from llama_index import load_index_from_storage
index_set = {}
for year in years:
    # check if storage already exists
    if not os.path.exists(f"./storage/{year}"):
        storage_context = StorageContext.from_defaults()
        cur_index = VectorStoreIndex.from_documents(
            doc_set[year],
            service_context=service_context,
            storage_context=storage_context,
        )
        index_set[year] = cur_index
        storage_context.persist(persist_dir=f"./storage/{year}")
    else:
        # Load indices from disk
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./storage/{year}"
        )
        cur_index = load_index_from_storage(
            storage_context, service_context=service_context
        )
        index_set[year] = cur_index

0x3：建立子问题查询引擎，实现跨多个10-K文档文件的综合回答

由于我们可以访问4年的文件，我们可能不仅想要针对给定年份的10-K文件提出问题，而且还想要跨所有10-K文件进行提问。

为了解决这个问题，我们可以使用一个子问题查询引擎，它将一个查询分解成多个子查询，每个子查询由各自的向量索引回答，最终综合所有子查询结果来回答总体查询。

LlamaIndex提供了一些围绕索引（以及查询引擎）的封装，以便它们可以被查询引擎和代理使用。

首先，我们为每个向量索引定义一个QueryEngineTool。每个工具都有一个名称和描述；这些是LLM代理用来决定选择哪个工具的依据。

然后，我们可以创建子问题查询引擎（Sub Question Query Engine），它将允许我们跨10-K文件综合回答。我们传入上面定义的individual_query_engine_tools，以及一个将用于运行子查询的service_context。

from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms import HuggingFaceLLM
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
from llama_index import load_index_from_storage
index_set = {}
for year in years:
    # check if storage already exists
    if not os.path.exists(f"./storage/{year}"):
        storage_context = StorageContext.from_defaults()
        cur_index = VectorStoreIndex.from_documents(
            doc_set[year],
            service_context=service_context,
            storage_context=storage_context,
        )
        index_set[year] = cur_index
        storage_context.persist(persist_dir=f"./storage/{year}")
    else:
        # Load indices from disk
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./storage/{year}"
        )
        cur_index = load_index_from_storage(
            storage_context, service_context=service_context
        )
        index_set[year] = cur_index

from llama_index.tools import QueryEngineTool, ToolMetadata
individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=f"useful for when you want to answer queries about the {year} SEC 10-K for Uber",
        ),
    )
    for year in years
]

测试一下单个子查询引擎是否工作正常。

from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms import HuggingFaceLLM
from pathlib import Path

import openai
import os
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm, embed_model="local:BAAI/bge-large-en")
# service_context = ServiceContext.from_defaults(chunk_size=512)

import os.path
from llama_index import load_index_from_storage
index_set = {}
for year in years:
    # check if storage already exists
    if not os.path.exists(f"./storage/{year}"):
        storage_context = StorageContext.from_defaults()
        cur_index = VectorStoreIndex.from_documents(
            doc_set[year],
            service_context=service_context,
            storage_context=storage_context,
        )
        index_set[year] = cur_index
        storage_context.persist(persist_dir=f"./storage/{year}")
    else:
        # Load indices from disk
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./storage/{year}"
        )
        cur_index = load_index_from_storage(
            storage_context, service_context=service_context
        )
        index_set[year] = cur_index


query_engine = index_set[2020].as_query_engine()
response = query_engine.query("What were some of the biggest risk factors in 2020 for Uber?")
print(response)

0x4：建立Chatbot Agent

我们使用LlamaIndex数据代理来设置外层聊天机器人代理，该代理可以访问一组工具（例如OpenAIAgent）。我们希望使用我们之前为每个索引（对应于给定年份）定义的单独工具，以及我们上面定义的子问题查询引擎的工具。

在之前的步骤中，我们已经为每一个10-K文档建立了对应的子查询引擎。

我们现在可以创建一个agent，将子查询引擎工具列表传入到agent中，供agent使用。

from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms import HuggingFaceLLM
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
from llama_index import load_index_from_storage
index_set = {}
for year in years:
    # check if storage already exists
    if not os.path.exists(f"./storage/{year}"):
        storage_context = StorageContext.from_defaults()
        cur_index = VectorStoreIndex.from_documents(
            doc_set[year],
            service_context=service_context,
            storage_context=storage_context,
        )
        index_set[year] = cur_index
        storage_context.persist(persist_dir=f"./storage/{year}")
    else:
        # Load indices from disk
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./storage/{year}"
        )
        cur_index = load_index_from_storage(
            storage_context, service_context=service_context
        )
        index_set[year] = cur_index

from llama_index.tools import QueryEngineTool, ToolMetadata
individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=f"useful for when you want to answer queries about the {year} SEC 10-K for Uber",
        ),
    )
    for year in years
]

from transformers import AutoModelForCausalLM, AutoTokenizer
class HuggingFaceModelAgent:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

    def answer(self, prompt, max_length=1024):
        input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
        output = self.model.generate(input_ids, max_length=max_length, num_return_sequences=1)
        response = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return response
agent = HuggingFaceModelAgent("THUDM/chatglm3-6b")

0x5：测试Agent

我们现在可以用各种查询来测试这个Agent。

如果我们用一个简单的“hello”查询来测试它，Agent不会使用任何工具。

如果我们用一个关于给定年份10-K报告的查询来测试它，Agent将会使用相关的向量索引工具。

最后，如果我们使用一个查询来比较/对比多年来的风险因素，Agent将会使用子问题查询引擎工具。

from llama_hub.file.unstructured.base import UnstructuredReader
from llama_index.llms import HuggingFaceLLM
from pathlib import Path

import openai
import os
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate
system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
# service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm, embed_model="local:BAAI/bge-large-en")
service_context = ServiceContext.from_defaults(chunk_size=512)

import os.path
from llama_index import load_index_from_storage
index_set = {}
for year in years:
    # check if storage already exists
    if not os.path.exists(f"./storage/{year}"):
        storage_context = StorageContext.from_defaults()
        cur_index = VectorStoreIndex.from_documents(
            doc_set[year],
            service_context=service_context,
            storage_context=storage_context,
        )
        index_set[year] = cur_index
        storage_context.persist(persist_dir=f"./storage/{year}")
    else:
        # Load indices from disk
        storage_context = StorageContext.from_defaults(
            persist_dir=f"./storage/{year}"
        )
        cur_index = load_index_from_storage(
            storage_context, service_context=service_context
        )
        index_set[year] = cur_index

from llama_index.tools import QueryEngineTool, ToolMetadata
individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=f"useful for when you want to answer queries about the {year} SEC 10-K for Uber",
        ),
    )
    for year in years
]

from llama_index.agent import OpenAIAgent
agent = OpenAIAgent.from_tools(individual_query_engine_tools, verbose=True)

response = agent.chat("hi, i am bob")
print(str(response))

response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

response = agent.chat("Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points.")
print(str(response))

参考链接：

https://docs.llamaindex.ai/en/stable/use_cases/chatbots.html
https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/chatbots/building_a_chatbot.html 
https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d
https://huggingface.co/THUDM/chatglm3-6b
https://docs.llamaindex.ai/en/stable/use_cases/chatbots.html
https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/chatbots/building_a_chatbot.html#testing-the-agent

八、智谱AI 和 LlamaIndex 结合进行数据处理

参考链接：

https://mp.weixin.qq.com/s/VJETBqF_3LszQWt-GE7d4w

posted @ 2023-12-07 22:51 郑瀚阅读(2044) 评论(0) 编辑收藏举报

刷新页面返回顶部

Han Zheng, Thinker and Doer

Welcome to contact me. Wechat：LittleHann

LlamaIndex：a data framework for your LLM applications，especially for RAG

一、LlamaIndex是什么

二、LlamaIndex解决了什么问题

三、构建RAG应用的几个关键性环节

0x1：Loading stage

1、Nodes and Documents

2、Connectors

0x2：Querying Stage

1、Retrievers

2、Routers

3、Node Postprocessors

4、Response Synthesizers

四、安装和部署

0x1：Installation from Pip

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

2、A full guide to using and configuring embedding models is available

五、基于 HuggingFace LLM - StableLM 构建一个检索增强生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

0x2：Load documents, build the VectorStoreIndex

0x3：Query Index

0x4：Storing your index

0x5：chat with LLM with the response

六、构建一个Q&A应用

0x1：基本思路与挑战

0x2：数据集准备

0x3：Q&A构建过程

1、Semantic Search

2、Summarization

七、基于私有文档数据构建一个Chatbot

0x1：数据准备

0x2：将私有文档数据转换为向量索引（Vector Indices）

0x3：建立子问题查询引擎，实现跨多个10-K文档文件的综合回答

0x4：建立Chatbot Agent

0x5：测试Agent

八、智谱AI 和 LlamaIndex 结合进行数据处理

公告

Han Zheng, Thinker and Doer

Welcome to contact me. Wechat：LittleHann

LlamaIndex：a data framework for your LLM applications，especially for RAG

一、LlamaIndex是什么

二、LlamaIndex解决了什么问题

三、构建RAG应用的几个关键性环节

0x1：Loading stage

1、Nodes and Documents

2、Connectors

0x2：Querying Stage

1、Retrievers

2、Routers

3、Node Postprocessors

4、Response Synthesizers

四、安装和部署

0x1：Installation from Pip

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

2、A full guide to using and configuring embedding models is available

五、基于 HuggingFace LLM - StableLM 构建一个 检索增强生成（Retrieval-Augmented Generation, RAG）

0x1：Download Data

0x2：Load documents, build the VectorStoreIndex

0x3：Query Index

0x4：Storing your index

0x5：chat with LLM with the response

六、构建一个Q&A应用

0x1：基本思路与挑战

0x2：数据集准备

0x3：Q&A构建过程

1、Semantic Search

2、Summarization

七、基于私有文档数据构建一个Chatbot

0x1：数据准备

0x2：将私有文档数据转换为向量索引（Vector Indices）

0x3：建立子问题查询引擎，实现跨多个10-K文档文件的综合回答

0x4：建立Chatbot Agent

0x5：测试Agent

八、智谱AI 和 LlamaIndex 结合进行数据处理

公告

五、基于 HuggingFace LLM - StableLM 构建一个检索增强生成（Retrieval-Augmented Generation, RAG）