Proj. CLJ Paper Reading: Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Abstract

本文： Speculative RAG
Task: improving retrieval results by combining RAG with LLMs refinement
Method: 利用large Generalist LM大点的通用模型来验证RAG drafts generated by smaller distilled specialist LM小点的专家模型生成的RAG draft.
- 每个draft都由检索到的文档中不同视角的某个子集生成
- 减少了每个draft所需的input token counts
效果
- 增进了对每个子集的理解，减少了potential position bias over long context
- 实验
  - dataset: TriviaQA、MuSiQue、PubHealth 和 ARC-Challenge
  - 效果
    1. achieved state-of-art performance(只是持平？）
    2. reduced latency
    3. 在PubHeath上
    4. +12.97% accuracy
    5. -51% latency

1. intro

https://blog.langchain.dev/agentic-rag-with-langgraph/

Self-Reflective RAG

对模型进行指令微调，会对retrieved documents的相关性评分，只对相关的文档最终生成Answer
注意这里state machine的引用，state可以是(e.g., retrieval, grade documents, re-write query)

Q1: 如果retrieved documents质量不高，那么是全部重新retrieve还是把关联性不好的文档中的关联keywords提取出来重新提取一遍?

Refinement Options:
- Option 1: Query Rewriting: Construct a new query focusing on specific missing terms like "environmental benefits" and "wind energy emissions." Example: "Wind energy environmental impact emissions wildlife."
- Option 2: Selective Supplementation:
  - Retain Doc 2 (General renewable energy) and retrieve new documents to replace Doc 1 and Doc 3.
- Option 3: Total Redo
似乎是重写retrieve query

Corrective RAG

使用轻量级evaluator对documents的相关性评分，对评分不高的文档，使用web search(Tavily API)进行拓展。之后按照句子将这些文档分为knowledge strip(比如一句话一个描述，一个动作），再将这些strip聚类起来成为general strips，再对这些general strips进行相关性评分，去掉不相关的strips。最终用这些relevant strips生成answer

Q1: lightweight evaluator是tranditional ML method?
A: 都可以，也可以是GPT4等LLM API，或者是fine-tuned model

Q2: Knowledge strip:

E.g.:
- Knowledge Strip 1:
  - "Solar energy is a renewable energy source that harnesses sunlight to generate electricity or heat."
- Knowledge Strip 2:
  - "Solar panels are made of photovoltaic cells, which convert sunlight into electrical energy."

Q3: NLI模型和retrieve scorer的区别

相关性有点类似于蕴含，retrieve scorer可以勉强称为一种NLI模型
NLI的根本目的还是在于抽取蕴含，矛盾等逻辑关系，retrieve scorer则更偏向于相关性打分

Q4: 为何使用了knowledge strips还是被评价为no changing on reasoning capabilities

没有对答案生成的reasoning步骤做强化
对retrieved documents的质量提升不算

Q5: Self-reflective RAG和Corrective RAG的对比：

Steps:
- Self reflective: Query->Retrieve->Reflect(Relevance+Quality, if not good enough, re-query)->Generate Answer with Reflection->Output
- CRAG: Query->Retrieve->Evaluate(Relevance，if not good, re-query, 使用websearch扩充或者扔掉)->使用web search等方法拓展不相关或矛盾的documents->使用Knowledge Strips Divide-then-Combine Filter Irrelevant Content->Generate Answer->Output

Q1: Speculative RAG的对比：

Steps:
- Query->Retrieve->divides into multiple clusters->从每个cluster各取一个documents组成一个subset，对多个subset(并行的)->Select the good draft, Generate(if not good enough, re-query)->Output
- 并行1-n: 多个smaller, specialist models分别对多组subsets进行生成处理，为每个subset生成draft和与之对应的rationale
- 并行n+1: Generalist LLM，对每个Draft和rationale进行推理打分

Q2: 为何强调: No need to instructiontune the Generalist LM，这句话的卖点在哪里

Self-reflective RAG和Corrective RAG还是需要对Generalist LM做指令微调，但是Speculative RAG则只需要微调小型的专用draft生成模型。
卖点在分开了draft generation(Specialist LLM)和quality refinement(Generalist LLM)

Q3: Both Corrective RAG and Specialized RAG:

Use lightweight models (like smaller LMs or NLI models) for retrieval-related tasks.
Aim to enhance the relevance of retrieved documents, either through iterative refinement (Corrective RAG) or robust retrieval techniques (Specialized RAG).
Employ the Generalist LM for final answer generation. ？
Corrective RAG也使用了lightweight evaluators,但是目的是refine the retrieved context before passing it to the Generalist LM减少retrieval errors. Specialized RAG则更倾向于利用retrival来增强particular domain knowledge?？？？
Q: Self-reflection allows the LLM to assess the adequacy of its context and adjust the answer or trigger further retrieval steps as needed.将自我评估嵌入到 LLM 的生成过程中。

Q4: 为何Specialized RAG更加dynamic?

难道不需要微调小模型，因此反而代价更大？

Q5: 如果不对小模型进行并行化，Specialized RAG是否更慢？

有可能

3.1 Overview

Q: 这种乘积式的分数不是会很快变很小么？

Speculative Decoding Speculative decoding (Stern et al., 2018; Xia et al., 2023; Chen et al., 2023a; Leviathan et al., 2023; Xia et al., 2024) aims to reduce auto-regressive decoding latency through a draft-then-verify paradigm. This involves drafting multiple future tokens with a small model and verifying them in parallel with the target model (Xia et al., 2024). The draft model is typically either an independent model from the same series (Leviathan et al., 2023; Chen et al., 2023a) or the target model itself (Zhang et al., 2023a; Cai et al., 2024). Our approach extends this concept from token-level drafting to answer-level drafting. In contrast to traditional verification criteria (Stern et al., 2018; Xia et al., 2023; Leviathan et al., 2023; Chen et al., 2023a; Miao et al., 2024), which accept or reject tokens based on their generation probabilities, we leverage language modeling objectives to directly assess the confidence of entire answer drafts.