Agent Foundry
All Problems

#46. Basic Document QA

EasyRAG

The Problem

You have five text documents about a company, and you need an agent that answers questions using only the information in those documents. Right now the agent ignores the documents entirely and answers from its own parametric knowledge—which means it confidently produces answers that may be outdated, wrong, or completely fabricated. Your job is to wire up a proper retrieval pipeline so the agent fetches relevant documents first and grounds every answer in their content.

Examples

Example 1

User input: Where is Acme Corp headquartered?

Current (bad) output: The agent guesses a city based on its training data, potentially answering "San Francisco" or "New York" because those are common startup locations—completely ignoring the documents.

Expected (good) output: Based on the provided documents, Acme Corp is headquartered in Austin, Texas.

Example 2

User input: How much funding has Acme Corp raised?

Current (bad) output: The agent fabricates a funding amount or says it doesn't know, since it never looks at the documents.

Expected (good) output: According to the documents, Acme Corp raised a $50M Series B in January 2024, led by Sequoia Capital.

Example 3

User input: What does Acme Corp sell?

Current (bad) output: A generic guess like "software solutions" with no specifics.

Expected (good) output: Acme Corp's flagship product is the AcmeAI platform for enterprise automation.

Your Task

Build a retrieval-augmented generation (RAG) pipeline that:

  • Loads all five documents into a vector store with embeddings.
  • Retrieves the most relevant document(s) for a given question.
  • Passes the retrieved context to the LLM with instructions to answer only from that context.
  • Returns accurate, document-grounded answers.

Evaluation

Submissions are checked for the following:

  • Uses document retrieval: The agent retrieves relevant documents from a vector store before answering.
  • Answers from documents only: The agent's answer is grounded in the provided documents, not parametric knowledge.
  • Returns correct answers: The agent returns factually correct answers that match the document content.

Constraints

  • The agent must answer only from the provided documents
  • No external knowledge or web search may be used
  • All five documents must be loadable by the pipeline
  • The retrieval mechanism must use vector similarity search
Starter Code
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini")

documents = [
    "Acme Corp was founded in 2019 and is headquartered in Austin, Texas.",
    "Acme Corp's flagship product is the AcmeAI platform for enterprise automation.",
    "Acme Corp raised a $50M Series B in January 2024 led by Sequoia Capital.",
    "Acme Corp has 200 employees across offices in Austin, New York, and London.",
    "Acme Corp's CEO is Jane Smith, who previously led engineering at BigTech Inc.",
]

# BUG: No retrieval — the agent answers from parametric knowledge instead of documents
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer the user's question."),
    ("human", "{question}"),
])

chain = prompt | llm

result = chain.invoke({"question": "Where is Acme Corp headquartered?"})
print(result.content)
Open in Google Colab
Evaluation Criteria0/3