ragas ------------- Supercharge Your LLM Application Evaluations

ragas

https://docs.ragas.io/en/latest/

Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications. It is designed to help you evaluate your LLM applications with ease and confidence.

🚀 Get Started

Install with pip and get started with Ragas with these tutorials.

References

Frequently Asked Questions

▶ What is the best open-source model to use?

▶ Why do NaN values appear in evaluation results?

▶ How can I make evaluation results more explainable?

Get Started

📚 Core Concepts

In depth explanation and discussion of the concepts and working of different features available in Ragas.

Core Concepts

🛠️ How-to Guides

Practical guides to help you achieve a specific goals. Take a look at these guides to learn how to use Ragas to solve real-world problems.

How-to Guides

📖 References

Technical descriptions of how Ragas classes and methods work.

https://github.com/explodinggradients/ragas

Objective metrics, intelligent test generation, and data-driven insights for LLM apps

Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications. Say goodbye to time-consuming, subjective assessments and hello to data-driven, efficient evaluation workflows. Don't have a test dataset ready? We also do production-aligned test set generation.

Note

Need help setting up Evals for your AI application? We'd love to help! We are conducting Office Hours every week. You can sign up here.

Key Features

🎯 Objective Metrics: Evaluate your LLM applications with precision using both LLM-based and traditional metrics.

🧪 Test Data Generation: Automatically create comprehensive test datasets covering a wide range of scenarios.

🔗 Seamless Integrations: Works flawlessly with popular LLM frameworks like LangChain and major observability tools.

📊 Build feedback loops: Leverage production data to continually improve your LLM applications.

from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
await metric.single_turn_ascore(SingleTurnSample(**test_data))

LangChain Integration

https://docs.ragas.io/en/latest/howtos/integrations/langchain/

This tutorial demonstrates how to evaluate a RAG-based Q&A application built with LangChain using Ragas. Additionally, we will explore how the Ragas App can help analyze and enhance the application's performance.

Building a simple Q&A application

To build a question-answering system, we start by creating a small dataset and indexing it using its embeddings in a vector database.

import os
from dotenv import load_dotenv
from langchain_core.documents import Document

load_dotenv()

content_list = [
    "Andrew Ng is the CEO of Landing AI and is known for his pioneering work in deep learning. He is also widely recognized for democratizing AI education through platforms like Coursera.",
    "Sam Altman is the CEO of OpenAI and has played a key role in advancing AI research and development. He is a strong advocate for creating safe and beneficial AI technologies.",
    "Demis Hassabis is the CEO of DeepMind and is celebrated for his innovative approach to artificial intelligence. He gained prominence for developing systems that can master complex games like AlphaGo.",
    "Sundar Pichai is the CEO of Google and Alphabet Inc., and he is praised for leading innovation across Google's vast product ecosystem. His leadership has significantly enhanced user experiences on a global scale.",
    "Arvind Krishna is the CEO of IBM and is recognized for transforming the company towards cloud computing and AI solutions. He focuses on providing cutting-edge technologies to address modern business challenges.",
]

langchain_documents = []

for content in content_list:
    langchain_documents.append(
        Document(
            page_content=content,
        )
    )

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = InMemoryVectorStore(embeddings)

_ = vector_store.add_documents(langchain_documents)

We will now build a RAG-based system that integrates the retriever, LLM, and prompt into a Retrieval QA Chain. The retriever fetches relevant documents from a knowledge base. LLM will generate responses based on the retrieved documents using the Prompt which will guide the model's response, helping it understand the context and generate relevant and coherent language-based output.

In LangChain, we can create a retriever from a vector store by using its .as_retriever method. For more details, refer to the LangChain documentation on vector store retrievers.

retriever = vector_store.as_retriever(search_kwargs={"k": 1})

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

We will define a Chain that processes the user query and retrieved relevant data, passing it to the model within a structured prompt. The model's output is then parsed to generate the final response as a string.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


template = """Answer the question based only on the following context:
{context}

Question: {query}
"""
prompt = ChatPromptTemplate.from_template(template)

qa_chain = prompt | llm | StrOutputParser()

def format_docs(relevant_docs):
    return "\n".join(doc.page_content for doc in relevant_docs)


query = "Who is the CEO of OpenAI?"

relevant_docs = retriever.invoke(query)
qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})

Output:

'The CEO of OpenAI is Sam Altman.'

Evaluate

sample_queries = [
    "Which CEO is widely recognized for democratizing AI education through platforms like Coursera?",
    "Who is Sam Altman?",
    "Who is Demis Hassabis and how did he gained prominence?",
    "Who is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's product ecosystem?",
    "How did Arvind Krishna transformed IBM?",
]

expected_responses = [
    "Andrew Ng is the CEO of Landing AI and is widely recognized for democratizing AI education through platforms like Coursera.",
    "Sam Altman is the CEO of OpenAI and has played a key role in advancing AI research and development. He strongly advocates for creating safe and beneficial AI technologies.",
    "Demis Hassabis is the CEO of DeepMind and is celebrated for his innovative approach to artificial intelligence. He gained prominence for developing systems like AlphaGo that can master complex games.",
    "Sundar Pichai is the CEO of Google and Alphabet Inc., praised for leading innovation across Google's vast product ecosystem. His leadership has significantly enhanced user experiences globally.",
    "Arvind Krishna is the CEO of IBM and has transformed the company towards cloud computing and AI solutions. He focuses on delivering cutting-edge technologies to address modern business challenges.",
]

To evaluate the Q&A system we need to structure the queries, expected_responses and other metric secpific requirments to EvaluationDataset.

from ragas import EvaluationDataset


dataset = []

for query, reference in zip(sample_queries, expected_responses):
    relevant_docs = retriever.invoke(query)
    response = qa_chain.invoke({"context": format_docs(relevant_docs), "query": query})
    dataset.append(
        {
            "user_input": query,
            "retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
            "response": response,
            "reference": reference,
        }
    )

evaluation_dataset = EvaluationDataset.from_list(dataset)

To evauate our Q&A application we will use the following metrices.

LLMContextRecall: Evaluates how well retrieved contexts align with claims in the reference answer, estimating recall without manual reference context annotations.
Faithfulness: Assesses whether all claims in the generated answer can be inferred directly from the provided context.
Factual Correctness: Checks the factual accuracy of the generated response by comparing it with a reference, using claim-based evaluation and natural language inference.

For more details on these metrics and how they apply to evaluating RAG systems, visit Ragas Metrics Documentation.

from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness

evaluator_llm = LangchainLLMWrapper(llm)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],
    llm=evaluator_llm,
)

result

Output

{'context_recall': 1.0000, 'faithfulness': 0.9000, 'factual_correctness': 0.9260}

Check out app.ragas.io for a more detailed analysis, including interactive visualizations and metrics. You'll need to create an account and generate a Ragas API key to upload and explore your results.

rag_evaluation

https://github.com/VectorInstitute/rag_bootcamp/blob/main/rag_evaluation/rag_evaluation_basic.ipynb

score = evaluate(
    dataset=test_dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision(),
        AnswerCorrectness(),
    ],
    llm=llm,
    embeddings=embeddings,
)
score.to_pandas()

langfuse integration

https://github.com/fhrzn/rag-analytics-eval/blob/main/evaluation.py

from langfuse import Langfuse
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from langchain.chat_models.azure_openai import AzureChatOpenAI
from langchain.embeddings.azure_openai import AzureOpenAIEmbeddings
import os
from dotenv import load_dotenv
import pandas as pd
from typing import List

load_dotenv()


def init_models():
    # LLM & Embedding
    llm = AzureChatOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_MODEL"),
        openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        openai_api_version=os.getenv("AZURE_OPENAI_VERSION"),
    )

    embedding = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING"),
        openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        openai_api_version=os.getenv("AZURE_OPENAI_VERSION"),
    )

    return llm, embedding


def get_traces_dataset(langfuse_client: Langfuse, tag: str):
    # get traces
    response = langfuse_client.client.trace.list(tags=tag)
    traces = response.data
    
    evaluation_set = {
        "question": [],
        "contexts": [],
        "answer": [],
        "trace_id": []
    }

    # extract question, context, answer
    for t in traces:
        observations = [langfuse_client.client.observations.get(o) for o in t.observations]
        for o in observations:
            if o.name == "LLMChain":
                question = o.input["question"]
                contexts = [o.input["context"]]
                answer = o.output["text"]
        
        evaluation_set['question'].append(question)
        evaluation_set['contexts'].append(contexts)
        evaluation_set['answer'].append(answer)
        evaluation_set['trace_id'].append(t.id)

    return evaluation_set


def ingest_score(langfuse_client: Langfuse, scores: pd.DataFrame, metric_names: List[str]):
    for _, row in scores.iterrows():
        for metric in metric_names:
            langfuse_client.score(
                name=metric,
                value=row[metric],
                trace_id=row["trace_id"]
            )





if __name__ == "__main__":
    # init
    langfuse = Langfuse(
        secret_key="sk-lf-d26600a7-aa86-4aae-af39-e09c0155a96d",
        public_key="pk-lf-aaa63774-ad52-487e-9e2f-1f354b0d60ae",
        host="http://localhost:3000",
    )

    llm, embedding = init_models()

    # get dataset
    evaluation_set = get_traces_dataset(langfuse, tag="RAG")
    evaluation_set = Dataset.from_dict(evaluation_set)

    # evaluate
    scores = evaluate(evaluation_set,
                      metrics=[faithfulness, answer_relevancy],
                      llm=llm,
                      embeddings=embedding,
                      raise_exceptions=False)
    
    scores = scores.to_pandas()

    # save result
    ingest_score(langfuse, scores, metric_names=["faithfulness", "answer_relevancy"])

running

https://github.com/fhrzn/rag-analytics-eval/blob/main/main.py

import os
import sys
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
from langchain_community.document_loaders.wikipedia import WikipediaLoader
from langchain_community.vectorstores.faiss import FAISS
from langchain_core.prompts import PromptTemplate
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.chains.llm import LLMChain

from langfuse.callback import CallbackHandler
from langfuse import Langfuse

load_dotenv()



def setup_langfuse():
    print("setup langfuse...")
    # analytics
    langfuse_callback = CallbackHandler(
        secret_key="sk-lf-d26600a7-aa86-4aae-af39-e09c0155a96d",
        public_key="pk-lf-aaa63774-ad52-487e-9e2f-1f354b0d60ae",
        host="http://localhost:3000",
        tags=["RAG"]
    )

    return langfuse_callback


def init_models():
    print("init model...")
    # LLM & Embedding
    llm = AzureChatOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_MODEL"),
        openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        openai_api_version=os.getenv("AZURE_OPENAI_VERSION"),
    )

    embedding = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING"),
        openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        openai_api_version=os.getenv("AZURE_OPENAI_VERSION"),
    )

    return llm, embedding


def ingest_data(query: str, embedding: AzureOpenAIEmbeddings, lang: str = "id"):
    print("ingesting data...")
    # document loader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=3)
    docs = loader.load()
    vectorstores = FAISS.from_documents(docs, embedding)

    return vectorstores


def retrieval_mode(question: str, wiki_search: str, llm: AzureChatOpenAI, embedding: AzureOpenAIEmbeddings, langfuse_handler: CallbackHandler):
    # get vectorstore
    vectorstore = ingest_data(wiki_search, embedding)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

    # setup prompt
    prompt_str = (
        "Use the given context to answer the question. \n"
        "If you don't know the answer, say you don't know. \n"
        "Use three sentence maximum and keep the answer concise.\n"
        "-------------------------------------------------------\n"
        "Context: ```\n{context}\n````\n"
        "-------------------------------------------------------\n"
        "Question: \"{question}\""
    )    
    prompt = PromptTemplate.from_template(prompt_str)

    retrieval_chain = RetrievalQA.from_llm(llm, prompt, retriever=retriever, llm_chain_kwargs={"verbose": True})
    result = retrieval_chain.run({"query": question}, callbacks=[langfuse_handler])

    return result


# OPTIONAL
def general_mode(question: str, llm: AzureChatOpenAI, langfuse_handler: CallbackHandler):
    prompt = PromptTemplate.from_template("Answer the given question below\n Question: {text}")
    chain = LLMChain(llm=llm, prompt=prompt)
    result = chain.predict(text=question, callbacks=[langfuse_handler])
    return result


if __name__ == "__main__":

    # init model
    llm, embedding = init_models()

    # langfuse
    langfuse_handler = setup_langfuse()

    try:
        while True:
            wiki_search = input("Enter wikipedia keyword (optional): ")
            question = input("Enter your question: ")

            result = retrieval_mode(question, wiki_search, llm, embedding, langfuse_handler)
            print(result)
    except KeyboardInterrupt:
        print()
        sys.exit(1)

https://langfuse.com/self-hosting/docker-compose

Docker Compose

This guide will walk you through deploying Langfuse on a VM using Docker Compose. We will use the docker-compose.yml file.

If you use a cloud provider like AWS, GCP, or Azure, you will need permissions to deploy virtual machines.

For high-availability and high-throughput, we recommend using Kubernetes (deployment guide). The docker compose setup lacks high-availability, scaling capabilities, and backup functionality.

Open Source
LLM Engineering Platform

Traces, evals, prompt management and metrics to debug and improve your LLM application.

posted @ 2025-02-23 21:09 lightsong 阅读(134) 评论(0) 收藏举报

刷新页面返回顶部

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}, AI:[LLM, langchain，langraph]

ragas ------------- Supercharge Your LLM Application Evaluations

ragas

Frequently Asked Questions

Key Features

LangChain Integration

Building a simple Q&A application

Evaluate

rag_evaluation

langfuse integration

Docker Compose

Open Source
LLM Engineering Platform

公告

Stay Hungry,Stay Foolish!

lightsong

{Web: [React, Vue, NodeJS, HTTP]，DevOps:[Jenkins,Docker,K8S], Languages:[Python, JS, C, Lua, Shell, Groovy]}, AI:[LLM, langchain，langraph]

ragas ------------- Supercharge Your LLM Application Evaluations

ragas

Frequently Asked Questions

Key Features

LangChain Integration

Building a simple Q&A application

Evaluate

rag_evaluation

langfuse integration

Docker Compose

Open SourceLLM Engineering Platform

公告

Open Source
LLM Engineering Platform