RAG实战4-RAG过程中发生了什么？

在RAG实战3中我们介绍了如何追踪哪些文档片段被用于检索增强生成，但我们仍不知道RAG过程中到底发生了什么，为什么大模型能够根据检索出的文档片段进行回复？本文将用一个简单的例子来解释前面的问题。

在阅读本文之前，请先阅读RAG实战3。

回答：为什么大模型能够根据检索出的文档片段进行回复？

先执行以下代码：

import logging
import sys
import torch
from llama_index.core import PromptTemplate, Settings, StorageContext, load_index_from_storage
from llama_index.core.callbacks import LlamaDebugHandler, CallbackManager
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM

# 定义日志
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


# 定义system prompt
SYSTEM_PROMPT = """You are a helpful AI assistant."""
query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

# 使用llama-index创建本地大模型
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name='/yldm0226/models/Qwen1.5-14B-Chat',
    model_name='/yldm0226/models/Qwen1.5-14B-Chat',
    device_map="auto",
    model_kwargs={"torch_dtype": torch.float16},
)
Settings.llm = llm

# 使用LlamaDebugHandler构建事件回溯器，以追踪LlamaIndex执行过程中发生的事件
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
Settings.callback_manager = callback_manager

# 使用llama-index-embeddings-huggingface构建本地embedding模型
Settings.embed_model = HuggingFaceEmbedding(
    model_name="/yldm0226/RAG/BAAI/bge-base-zh-v1.5"
)

# 从存储文件中读取embedding向量和向量索引
storage_context = StorageContext.from_defaults(persist_dir="doc_emb")
index = load_index_from_storage(storage_context)
# 构建查询引擎
query_engine = index.as_query_engine(similarity_top_k=5)

# 查询获得答案
response = query_engine.query("不耐疲劳，口燥、咽干可能是哪些证候？")
print(response)

# get_llm_inputs_outputs返回每个LLM调用的开始/结束事件
event_pairs = llama_debug.get_llm_inputs_outputs()
# print(event_pairs[0][1].payload.keys())
print(event_pairs[0][1].payload["formatted_prompt"])

输出很长，我们一部分一部分来看。

首先找到类似下面的输出：

**********
Trace: query
    |_query ->  14.458354 seconds
      |_retrieve ->  0.845918 seconds
        |_embedding ->  0.71383 seconds
      |_synthesize ->  13.612246 seconds
        |_templating ->  2e-05 seconds
        |_llm ->  13.60905 seconds
**********

以上的输出记录了我们的query在程序过程中经历的阶段和所用的时间。整个过程分为两个阶段：抽取（retrieve）和合成（synthesize）。

合成阶段的templating步骤会将我们的query和抽取出来的文档片段组合成模板，构成新的query，然后调用LLM，得到最终的response。

所以，我们只要找到templating所构建的新query，就可以知道为什么大模型能够根据我们检索出来的文档进行回复了。

在输出中找到response下面的部分：

[INST]<<SYS>>
You are a helpful AI assistant.<</SYS>>

Context information is below.
---------------------
file_path: document/中医临床诊疗术语证候.txt

4.6.1.1
    津液不足证  syndrome/pattern of fluid and humor insufficiency
    津亏证
    因津液生成不足，或嗜食辛辣，蕴热化燥，邪热灼损津液所致。临床以口眼喉鼻及皮肤等干燥，大便干结，小便短少，舌质偏红而干，脉细数等为特征的证候。

4.6.1.

file_path: document/中医临床诊疗术语证候.txt

临床以口干、舌燥，频饮而不解其渴，食多、善饥，夜尿频多，逐渐消瘦，舌质红，舌苔薄黄或少，脉弦细或滑数，伴见皮肤干燥，四肢乏力，大便干结等为特征的证候。

4.6.3.2
    津亏热结证  syndrome/pattern of fluid depletion and heat binding
    液干热结证
    因津液亏虚，热邪内结所致。

file_path: document/中医临床诊疗术语证候.txt

临床以口眼喉鼻及皮肤等干燥，大便干结，小便短少，舌质偏红而干，脉细数等为特征的证候。

4.6.1.2
    津液亏涸证  syndrome/pattern of fluid and humor scantiness
    津液亏耗证
    津液干枯证
    因津液亏损，形体官窍失养所致。临床以口干、唇裂，鼻燥无涕，皮肤干瘪，目陷、螺瘪，甚则肌肤甲错，舌质红而少津，舌中裂，脉细或数，可伴见口渴、欲饮，干咳，目涩，大便干，小便少等为特征的证候。

file_path: document/中医临床诊疗术语证候.txt

临床以鼻咽干涩或痛，口唇燥干，舌质红，舌苔白或燥，脉浮或微数，伴见发热、无汗，头痛或肢节酸痛等为特征的证候。

3.6.3.2
    燥干清窍证  syndrome/pattern of dryness harassing the upper orifices
    因气候或环境干燥，津液耗损，清窍失濡所致。临床以口鼻、咽喉干燥，两眼干涩，少泪、少涕、少津、甚则衄血，舌质瘦小、舌苔干而少津，脉细等为特征的证候。

file_path: document/中医临床诊疗术语证候.txt

6.3.1
    津伤化燥证  syndrome/pattern of fluid damage transforming into dryness
    津伤燥热证
    因燥热内蕴，或内热化燥，伤津耗液所致。临床以口干、舌燥，频饮而不解其渴，食多、善饥，夜尿频多，逐渐消瘦，舌质红，舌苔薄黄或少，脉弦细或滑数，伴见皮肤干燥，四肢乏力，大便干结等为特征的证候。

4.6.3.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: 不耐疲劳，口燥、咽干可能是哪些证候？
Answer: [/INST]

上面这段很长的文本是由print(event_pairs[0][1].payload["formatted_prompt"])语句输出的，这段文本就是templating后的新query。

现在，我们就能回答为什么大模型能够根据检索出的文档片段进行回复这个问题了：我们的原始query由"不耐疲劳，口燥、咽干可能是哪些证候？"变成了上面这段很长的新query，由于我们给大模型提供了一些文档片段知识，并且要求大模型根据提供的先验知识回答我们的原始query，因此大模型能够根据检索出的文档片段进行回复。（这其实也就是RAG技术的本质了）

可以发现一个问题，新query中既有中文，也有英文，这是因为LlamaIndex是外国人做的，他们构建的模板都是英文的。LlamaIndex允许自定义查询流程，构建自己的中文模板，这里中英文混合也解决了我们的问题，因此不再赘述。

此外，event_pairs中其实还有很多对我们有用的信息，你可以通过输出或DEBUG的方式来寻找能够解决你自己问题的信息。比如，我注释掉的#print(event_pairs[0][1].payload.keys())就可以输出事件结束时所有相关的属性。

下面是模型的回复：

从提供的中医临床证候信息来看，口燥、咽干的症状可能与以下证候相关：

1. 津液不足证：由于津液生成不足或者体内燥热导致，表现为口眼喉鼻干燥，咽干是其中的一个症状。

2. 津亏热结证：津液亏虚加上热邪内结，也可能出现口燥和咽干。

3. 津液亏涸证：严重的津液亏损可能导致口唇干燥、咽部干燥，伴随其他严重脱水症状。

4. 燥干清窍证：气候干燥或体质原因引起的津液缺乏，口鼻咽喉干燥也是其特征。

5. 津伤化燥证：燥热内蕴或内热化燥损伤津液，也会出现口燥、频饮但不解渴的现象。

因此，这些证候都有可能与不耐疲劳和口燥、咽干的症状相符合，需要结合其他临床表现来确定具体的证候类型。建议在中医诊断中由专业医生根据全人情况判断。

进阶尝试

接下来，我们尝试跟踪一下更复杂的RAG过程。

前面我们提到了抽取（retrieve）和合成（synthesize）两个阶段。

抽取（retrieve）阶段的retrievers模块规定了针对查询从知识库获取相关上下文的技术。我们之前使用的都是默认的方法，其实LlamaIndex官方为我们提供了一些其他常用的方法：

SimilarityPostprocessor: 使用similarity_cutoff设置阈值。移除低于某个相似度分数的节点。
KeywordNodePostprocessor: 使用required_keywords和exclude_keywords。根据关键字包含或排除过滤节点。
MetadataReplacementPostProcessor: 用其元数据中的数据替换节点内容。
LongContextReorder: 重新排序节点，这有利于需要大量顶级结果的情况，可以解决模型在扩展上下文中的困难。
SentenceEmbeddingOptimizer: 选择percentile_cutoff或threshold_cutoff作为相关性。基于嵌入删除不相关的句子。
CohereRerank: 使用coherence ReRank对节点重新排序，返回前N个结果。
SentenceTransformerRerank: 使用SentenceTransformer交叉编码器对节点重新排序，产生前N个节点。
LLMRerank: 使用LLM对节点重新排序，为每个节点提供相关性评分。
FixedRecencyPostprocessor: 返回按日期排序的节点。
EmbeddingRecencyPostprocessor: 按日期对节点进行排序，但也会根据嵌入相似度删除较旧的相似节点。
TimeWeightedPostprocessor: 对节点重新排序，偏向于最近未返回的信息。
PIINodePostprocessor(β): 可以利用本地LLM或NER模型删除个人身份信息。
PrevNextNodePostprocessor(β): 根据节点关系，按顺序检索在节点之前、之后或两者同时出现的节点。

合成（synthesize）阶段的响应合成器（response synthesizer）会引导LLM生成响应，将用户查询与检索到的文本块混合在一起。

假设有一堆文档。现在，你问了一个问题，并希望根据这些文档得到答案。响应合成器就像人一样，浏览文档，找到相关信息，并生成回复。

retrievers负责提取出相关的文本片段，我们已经讨论过了。而响应合成器负责将这些片段收集起来，并给出一个精心设计的答案。

LlamaIndex官方为我们提供了多种响应合成器：

Refine: 这种方法遍历每一段文本，一点一点地精炼答案。
Compact: 是Refine的精简版。它将文本集中在一起，因此需要处理的步骤更少。
Tree Summarize: 想象一下，把许多小的答案结合起来，再总结，直到你得到一个主要的答案。
Simple Summarize: 只是把文本片段剪短，然后给出一个快速的总结。
No Text: 这个问题不会给你答案，但会告诉你它会使用哪些文本。
Accumulate: 为每一篇文章找一堆小答案，然后把它们粘在一起。
Compact Accumulate: 是“Compact”和“Accumulate”的合成词。

此外，retriever和response synthesizer都支持自定义，在此不作讨论。

现在，让我们选择一种retriever和一种response synthesizer。retriever选择SimilarityPostprocessor，response synthesizer选择Refine。

代码如下所示：

import logging
import sys
import torch
from llama_index.core import PromptTemplate, Settings, SimpleDirectoryReader, \
    VectorStoreIndex, get_response_synthesizer
from llama_index.core.callbacks import LlamaDebugHandler, CallbackManager
from llama_index.core.indices.vector_store import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import ResponseMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM

# 定义日志
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# 定义system prompt
SYSTEM_PROMPT = """You are a helpful AI assistant."""
query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

# 使用llama-index创建本地大模型
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name='/yldm0226/models/Qwen1.5-14B-Chat',
    model_name='/yldm0226/models/Qwen1.5-14B-Chat',
    device_map="auto",
    model_kwargs={"torch_dtype": torch.float16},
)
Settings.llm = llm

# 使用LlamaDebugHandler构建事件回溯器，以追踪LlamaIndex执行过程中发生的事件
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
Settings.callback_manager = callback_manager

# 使用llama-index-embeddings-huggingface构建本地embedding模型
Settings.embed_model = HuggingFaceEmbedding(
    model_name="/yldm0226/RAG/BAAI/bge-base-zh-v1.5"
)

# 读取文档并构建索引
documents = SimpleDirectoryReader("document").load_data()
index = VectorStoreIndex.from_documents(documents)

# 构建retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
)

# 构建response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.REFINE
)

# 构建查询引擎
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.6)],
)

# 查询获得答案
response = query_engine.query("不耐疲劳，口燥、咽干可能是哪些证候？")
print(response)

# get_llm_inputs_outputs返回每个LLM调用的开始/结束事件
event_pairs = llama_debug.get_llm_inputs_outputs()
print(event_pairs[0][1].payload["formatted_prompt"])

运行代码后，在输出中可以找到类似下面的内容：

**********
Trace: query
    |_query ->  33.425664 seconds
      |_synthesize ->  33.403238 seconds
        |_templating ->  2e-05 seconds
        |_llm ->  7.425154 seconds
        |_templating ->  2.5e-05 seconds
        |_llm ->  4.763223 seconds
        |_templating ->  2.4e-05 seconds
        |_llm ->  6.601226 seconds
        |_templating ->  2.2e-05 seconds
        |_llm ->  6.878335 seconds
        |_templating ->  2.2e-05 seconds
        |_llm ->  7.726241 seconds
**********

可以看出，我们将response synthesizer由默认的Compact替换为Refine之后，query在程序过程中经历的阶段发生了变化，REFINE模式会进行更多次的templating和LLM调用。

构建的新Query如下所示，这与之前是一样的：

[INST]<<SYS>>
You are a helpful AI assistant.<</SYS>>

Context information is below.
---------------------
file_path: document/中医临床诊疗术语证候.txt

临床以干咳、痰少，或痰中带血，口渴，鼻咽燥痛，声音嘶哑，肌肤枯燥，舌质红而干，舌苔少，脉虚数，伴见低热，神疲、乏力，语声低微，盗汗，大便干结等为特征的证候。

5.4.1.5.1.1
    肺燥津伤证  syndrome/pattern of lung dryness with fluid damage
    肺燥津亏证
    因燥邪袭肺，津液亏虚，肺燥失润所致。临床以干咳、少痰，咽干，口燥，鼻燥，喉痒，舌质红，舌苔少津，脉浮细数等为特征的证候。

5.4.1.5.1.2
    肺燥伤阴证  syndrome/pattern of lung dryness damaging yin
    因肺热化燥，伤及阴津所致。临床以咳嗽，痰少或无，痰黄而黏，口干、咽燥，烦渴、多饮，小便短少，舌质红，舌苔焦黄，脉弦数，可伴见潮热、颧红等为特征的证候。

5.4.1.5.1.3
    肺燥阴虚证  syndrome/pattern of lung dryness with yin deficiency
    阴虚肺燥证
    因阴液亏虚，肺燥失润所致。临床以午后潮热，干咳、痰少，喉痒、鼻燥、少涕，咽干、烦渴，消瘦，舌质红，舌苔少，脉细数，伴见盗汗浸衣，心烦、失眠等为特征的证候。

5.4.1.5.2
    肺燥郁热证  syndrome/pattern of lung dryness with stagnated heat
    肺燥化热证
    因忧劳伤肺，郁热化燥，伤及肺津所致。临床以发热、烦渴，咳嗽、痰少而黏，胸胁灼痛，大便干结，小便短少，舌质红而干，舌苔薄黄，脉弦数等为特征的证候。

5.4.1.6
    肺经证  syndrome/pattern of lung meridian (vessel)
    泛指因各种原因致使肺经循行部位异常所引起的一类证候。

5.4.1.6.1
    肺经风热证  syndrome/pattern of wind and heat in the lung meridian
    因风热邪客肺经，或风热郁滞肤腠，外发于头面所致。
---------------------
Given the context information and not prior knowledge, answer the query.
Query: 不耐疲劳，口燥、咽干可能是哪些证候？
Answer: [/INST]

另外，由于我们使用了SimilarityPostprocessor的retriever，并将相似度阈值设置为0.6，因此检索出的相似度小于0.6的文档片段会被摘除。

最后，我们看一下模型的回复：

从中医角度看，不耐疲劳、口燥、咽干的症状可能涉及多个证候，如燥邪犯肺证（4.6.3.3），由于燥气耗伤肺津；津亏热结证（4.6.3.2）或津枯肠结证（4.6.3.3），表现为体内津液亏损且伴有热象，导致口干、便秘等；肺胃阴虚证（5.6.4.4.2.2）和心肾阴虚（5.1.1.1.1），特别是心肾不交时，也会出现类似症状。此外，心系证中的心寒证（5.1.1.1）如心中寒证也可能表现出口干咽燥。具体诊断需根据临床表现、体质和相关检查结果来确定。

从上面的案例可以看出，我们可以自由组合不同的retriever和response synthesizer，以完成我们的需求。当LlamaIndex提供的

retriever和response synthesizer不能满足我们的需求的时候，我们还可以自定义retriever和response synthesizer，有兴趣的读者可以自行探索。

posted @ 2024-03-09 12:26 一蓑烟雨度平生阅读(1549) 评论(0) 收藏举报

刷新页面返回顶部

一蓑烟雨度平生

技术记录

RAG实战4-RAG过程中发生了什么？

RAG实战4-RAG过程中发生了什么？

回答：为什么大模型能够根据检索出的文档片段进行回复？

进阶尝试

公告