Agent Foundry
All Problems

#58. RAG Evaluation Pipeline

HardRAGEvaluation

The Problem

Your RAG pipeline returns answers, but you have no idea how good they are. Is the retriever finding the right documents? Is the LLM making up facts not in the context? Is the answer actually complete? Without evaluation metrics, you're flying blind—every prompt tweak or retriever change is a guess. Your job is to build an automated evaluation pipeline that measures three key metrics: retrieval relevance (are the retrieved docs relevant?), answer faithfulness (is the answer grounded in the context?), and answer completeness (does the answer fully address the question?).

Examples

Example 1

User input: What SDKs are available and what are the rate limits?

Current (bad) output:

Answer: We support Python, JavaScript, and Go SDKs. Free tier: 1000 req/min, Enterprise: 10000 req/min.
(No evaluation — you don't know if this is faithful or complete)

Expected (good) output:

Answer: We support Python, JavaScript, and Go SDKs. Free tier: 1000 req/min, Enterprise: 10000 req/min.

Evaluation:
  Retrieval Relevance: 0.95 (retrieved docs directly relate to SDKs and rate limits)
  Answer Faithfulness: 1.0 (all claims are supported by retrieved context)
  Answer Completeness: 1.0 (both SDKs and rate limits are covered)

Example 2

User input: How do I set up webhooks and authenticate?

Expected (good) output:

Answer: Configure webhooks in Dashboard > Settings > Integrations. Authenticate using an API key in the X-API-Key header.

Evaluation:
  Retrieval Relevance: 0.9
  Answer Faithfulness: 1.0
  Answer Completeness: 1.0

Your Task

Add an evaluation layer to the RAG pipeline that:

  • Scores retrieval relevance: how well do the retrieved documents match the question (0-1).
  • Scores answer faithfulness: are all claims in the answer supported by the retrieved context (0-1).
  • Scores answer completeness: does the answer address all parts of the question (0-1).
  • Uses LLM-as-judge or programmatic checks to produce these scores automatically.

Evaluation

Submissions are checked for the following:

  • Measures retrieval relevance: The pipeline scores how relevant the retrieved documents are to the question.
  • Measures answer faithfulness: The pipeline scores whether the answer is faithful to the retrieved context without hallucination.
  • Measures answer completeness: The pipeline scores whether the answer addresses all parts of the user's question.
  • Produces numeric scores: Each metric produces a score between 0 and 1 for quantitative comparison.

Constraints

  • The evaluation must measure retrieval relevance, answer faithfulness, and answer completeness
  • Each metric must produce a score between 0 and 1
  • Evaluation must be automated using LLM-as-judge or programmatic checks
  • The evaluation pipeline must work on any question-answer-context triple
Starter Code
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

documents = [
    Document(page_content="The platform supports Python, JavaScript, and Go SDKs."),
    Document(page_content="Rate limits are 1000 req/min for free tier, 10000 req/min for enterprise."),
    Document(page_content="Authentication requires an API key passed in the X-API-Key header."),
    Document(page_content="Webhooks can be configured in the dashboard under Settings > Integrations."),
]

vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever()

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on context.\n\nContext: {context}"),
    ("human", "{question}"),
])

# TODO: This RAG pipeline has no evaluation — add metrics for retrieval relevance,
# answer faithfulness, and answer completeness
def ask(question: str) -> str:
    docs = retriever.invoke(question)
    context = "\n".join([doc.page_content for doc in docs])
    chain = prompt | llm
    result = chain.invoke({"context": context, "question": question})
    return result.content

# No evaluation — just prints the answer with no quality metrics
print(ask("What SDKs are available and what are the rate limits?"))
Open in Google Colab
Evaluation Criteria0/4