【每周一读】Optimize your RAG pt.1 - Data ingestion

今天与其说是阅读笔记更像是摘录与翻译...本来都用了引用块，但格式不太好看就还是改成正文了。

以下英文部分全是原文引用。

原文🔗：https://textgeneration.substack.com/p/optimize-your-rag-pt1-data-ingestion

原文作者：Iulia Brezeanu

Optimize your RAG pt.1 - Data ingestion

The biggest drawbacks of classical RAG appear in the retrieval phase, which can be affected by misaligned retrieved chunks or the failure to retrieve the relevant ones. The generation phase can present challenges when the model generates answers not grounded in the context. The augmentation phase — when the selected documents are synthesized into a coherent prompt — brings concerns about repetition and redundancy if multiple retrieved passages contain similar information.
经典 RAG 的最大缺点出现在检索阶段，这可能会受到未对齐的检索块或无法检索相关块的影响。当模型生成的答案不以上下文为基础时，生成阶段可能会带来挑战。增强阶段 - 当选定的文档被合成为连贯的提示时 - 如果多个检索到的段落包含相似的信息，则会带来重复和冗余的担忧。

Optimizing the Data Ingestion Process

Data preparation includes activities like removing irrelevant details, clarifying ambiguous entities and terminology, verifying factual correctness, retaining contextual information, and refreshing outdated documents.
数据准备包括删除不相关的细节、澄清模棱两可的实体和术语、验证事实正确性、保留上下文信息以及刷新过时文档等活动。

Metadata such as tags and categories can make information retrieval more efficient. For example, when indexing scientific papers, adding metadata like location, time period, variables, and experiments helps categorize the papers and makes them easier to search and retrieve later.
元数据（如标签和类别）可以使信息检索更加高效。例如，在索引科学论文时，添加位置、时间段、变量和实验等元数据有助于对论文进行分类，并使其更易于以后搜索和检索。

...We can see that raw semantic search has low precision, so there's no guarantee the most relevant text chunks will be retrieved. Consider adding relevant metadata entities before ingesting data to increase precision and remove irrelevant candidates during retrieval.
...我们可以看到，原始语义搜索的精度很低，因此不能保证会检索到最相关的文本块。请考虑在引入数据之前添加相关的元数据实体，以提高精度并在检索过程中删除不相关的候选项。

Semantic Representations

There are two types of embedding models: static and dynamic. As seen in models like OpenAI's embeddings-ada-02, dynamic embeddings represent words based on their context, differing from static embeddings, where each word has a fixed vector. This means the same word can have different embeddings depending on surrounding words.
嵌入模型有两种类型：静态和动态。正如 OpenAI 的 embeddings-ada-02 等模型所示，动态嵌入根据上下文表示单词，这与静态嵌入不同，静态嵌入每个单词都有一个固定的向量。这意味着同一个单词可以有不同的嵌入，具体取决于周围的单词。

Chunk Optimization

The chunking model should match the content length - whether it is long or short. Embedding models also behave differently in function of the block size. For instance, sentence-transformer works best with single sentences, while text-embedding-ada-002 does better with 256 or 512 token blocks.
分块模型应与内容长度匹配 - 无论是长还是短。嵌入模型在块大小的功能上也表现不同。例如，sentence-transformer 最适合处理单个句子，而 text-embedding-ada-002 处理 256 或 512 个标记块效果更好。

In practice, retrieving accurate query results requires adaptively using different chunking approaches. No single "best" chunking strategy fits every situation. Rather, the most suitable strategy depends on the specific context and application.
在实践中，检索准确的查询结果需要自适应地使用不同的分块方法。没有一个单一的“最佳”分块策略适合所有情况。相反，最合适的策略取决于具体的背景和应用。

Choosing the Embedding Model

AngIE

AngIE proposed improving beyond the use of just cosine similarity for text similarity tasks. Cosine similarity can have "saturation zones" where small changes in the angle between vectors lead to minimal changes in the cosine value, especially near -1 and 1. This can cause issues like vanishing gradient and affect learning subtle differences during backpropagation.
AngIE建议改进，而不仅仅是在文本相似性任务中使用余弦相似性。余弦相似性可以有“饱和区”，其中向量之间角度的微小变化会导致余弦值的最小变化，尤其是在 -1 和 1 附近。这可能会导致梯度消失等问题，并影响反向传播过程中的学习细微差异。

To address this, AngIE optimizes not only the cosine similarity but also the vector angle itself. It does this by splitting the text embedding into real and imaginary parts to represent it in complex space. Then, it computes the angle difference between vectors in this complex space using division rules. Finally, it normalizes this angle difference to get a value to optimize.
为了解决这个问题，AngIE不仅优化了余弦相似度，还优化了矢量角本身。它通过将文本嵌入拆分为实部和虚部来在复数空间中表示它。然后，它使用除法规则计算这个复数空间中向量之间的角度差。最后，它对这个角度差进行归一化，以获得一个要优化的值。

This normalized angle difference ensures that text embeddings with smaller angles are considered more similar. Optimizing it directly avoids the saturation caused by using cosine similarity.
这种归一化的角度差异可确保具有较小角度的文本嵌入被认为更相似。优化它直接避免了使用余弦相似度引起的饱和。

If you plan to use AngIE for retrieval tasks, you can use the code below.
如果你打算使用 AngIE 进行检索任务，可以使用下面的代码。

from angle_emb import AnglE, Prompts
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()

angle.set_prompt(prompt=Prompts.C)

vecs = angle.encode([{'text': 'hello world1'}, {'text': 'hello world2'}], to_numpy=True)

print(vecs)

Voyage

Voyage-02 is a proprietary general-purpose embedding model with a context length of 4000 tokens and 1024 dimensions. It claims to have better retrieval quality than OpenAI's ada-002.
Voyage-02 是一个专有的通用嵌入模型，上下文长度为 4000 个标记和 1024 个维度。它声称比 OpenAI 的 ada-002 具有更好的检索质量。

The creators built 9 evaluation datasets to evaluate retrieval performance spanning domains like technical docs, restaurant reviews, and news. Experiments showed that Voyage-02's base model outperformed OpenAI's embeddings and other popular open-source models on these datasets.
创建者构建了 9 个评估数据集，以评估跨技术文档、餐厅评论和新闻等领域的检索性能。实验表明，在这些数据集上，Voyage-02 的基本模型优于 OpenAI 的嵌入和其他流行的开源模型。

You can use the voyageai package to create embeddings like in this example:
你可以使用 voyageai 包来创建嵌入，如以下示例所示：

import os
import voyageai

vo = voyageai.Client()

texts = [
    "The Mediterranean diet emphasizes fish, olive oil, ...",
    "Photosynthesis in plants converts light energy into ...",
    "20th-century innovations, from radios to smartphones ...",
    "Rivers provide water, irrigation, and habitat for ...",
    "Apple’s conference call to discuss fourth fiscal ...",
    "Shakespeare's works, like 'Hamlet' and ...",
]

# Embed the documents
result = vo.embed(texts, model="voyage-02", input_type="document")
print(result.embeddings)

BGE

The Beijing Academy of Artificial Intelligence (BAAI) created the BGE models, which are available on HuggingFace and are some of the best open-source embedding models.
北京人工智能研究院（BAAI）创建了 BGE 模型，这些模型可在 HuggingFace 上找到，并且是一些最好的开源嵌入模型。

The models were fine-tuned on unsupervised datasets, including Wikipedia, CC-net, StackExchange, Reddit, S2orc, and sentence-transformers datasets. Then, they conducted further fine-tuning on supervised datasets like NLI, FEVER, NQ, HotpotQA, Quora StackExchange Duplicates, and MEDI. It can also be fine-tuned to optimize retrieval relevance.
这些模型在无监督数据集上进行了微调，包括 Wikipedia、CC-net、StackExchange、Reddit、S2orc 和 sentence-transformers 数据集。然后，在 NLI、FEVER、NQ、HotpotQA、Quora StackExchange Duplicates和MEDI等有监督数据集上进行了进一步的微调。它也可以进行微调以优化检索相关性。

It’s also very easy to use with Langchain:
它与 Langchain 一起使用也非常容易：

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

embedding = hf.embed_query("hi this is harrison")

LLM-Embedder

The LLM-Embedder is a unified embedding model that supports LLMs' diverse retrieval augmentation needs. It is trained to capture distinct semantic relationships required for various retrieval tasks, which are often subject to mutual interference.
LLM-Embedder 是一个统一的嵌入模型，支持LLMs多样化的检索增强需求。它经过训练，能够捕捉到各种检索任务所需的不同语义关系，这些关系通常会相互干扰。

The training methodology includes reward formulation based on LLMs' feedback, stabilized knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These strategies significantly improve retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios.
训练方法包括基于LLM反馈的奖励制定、稳定的知识蒸馏、带有明确指令的多任务微调，以及同质化的批内负采样。这些策略显著提高了LLM的检索增强能力，在各种评估场景中超越了通用检索器和特定任务的检索器。

Fine-tuning the Embedding Model

Embedding models are trained on large corpora, which makes them good at representing common terms as vectors in a multi-dimensional space. However, they might have a limited capacity to effectively represent niche or domain-specific words not encountered during training.
嵌入模型是在大型语料库上训练的，这使得它们擅长将常用术语表示为多维空间中的向量。但是，它们有效表示训练期间未遇到的小众或特定领域词的能力可能有限。

The same goes if the user queries have ambiguous phrases. The model might not generate the most accurate vector representations.
如果用户查询包含模棱两可的短语，情况也是如此。该模型可能无法生成最准确的向量表示。

An effective solution is fine-tuning with data specific to the target application. It can enhance the model’s capacity to represent specialized vocabulary and align with our specific use case.
一个有效的解决方案是使用特定于目标应用的数据进行微调。它可以增强模型表示专业词汇表的能力，并与我们的特定用例对齐。

To create a fine-tuned embedding model specialized for a particular domain, we start by breaking our document into smaller chunks. Then, use a powerful LLM like GPT-4 to generate relevant questions based on each chunk.
为了创建专门针对特定领域的微调嵌入模型，我们首先将文档分解为更小的块。然后，使用强大的 LLM 如 GPT-4 根据每个块生成相关问题。

We feed those generated questions back into the LLM, along with the original chunks as context, to produce answers. The resulting question-answer pairs can be compiled into a Q&A dataset.
我们将这些生成的问题与原始块一起反馈到 LLM 中，原始块作为上下文，以生成答案。生成的问答对可以编译到 Q&A 数据集中。

This curated Q&A dataset can be used to fine-tune an existing embedding model, helping it encode the meanings and relationships relevant to our domain more accurately. Now, our embedding model is fine-tuned for the nuances and subject matter contained in the original document.
这个精心策划的 Q&A 数据集可用于微调现有的嵌入模型，帮助它更准确地编码与我们的领域相关的含义和关系。现在，我们的嵌入模型针对原始文档中包含的细微差别和主题进行了微调。

Future work

Up to this point, we have focused on enhancing the semantic representations of our documents. While it definitely helps, retrieval systems do not always achieve optimal compatibility with certain LLMs. A solution is to directly supervise the fine-tuning process using feedback from the LLM itself by letting the LLM evaluate the retriever's outputs.
到目前为止，我们一直专注于增强文档的语义表示。虽然它肯定有帮助，但检索系统并不总是能够与某些LLM实现最佳兼容性。一个解决方案是利用LLM本身的反馈，通过让 LLM 评估检索器的输出来直接监督微调过程。

----------------------------------------------

引人深思的一篇文章，RAG每个阶段都会对最终结果造成影响，而不同阶段的优化策略也多种多样，我们需要根据不同场景和任务找到最合适的优化方案。

让LLM根据领域数据chunks自己生成问答对的方法很巧妙，想试试；

之前只想着微调chat模型，没想过嵌入模型也可以根据领域数据做微调；

以及LLM embedder还是第一次听说，后续再去了解一下。

感谢阅读，欢迎评论区留言讨论！

posted @ 2024-01-31 12:52 Aikoin 阅读(33) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Aikoin

心之所向，无所不成。