Agent Foundry
All Problems

#48. Source Citation

MediumRAG

The Problem

Your RAG pipeline retrieves the right documents and produces correct answers, but there is no way for the user to verify where the information came from. In a business setting—legal, finance, compliance—unattributed answers are worthless. Users need to see which document backs each claim so they can verify, audit, and trust the output. The current pipeline strips source metadata during retrieval and the prompt never asks the LLM to cite anything. Your job is to preserve source metadata through the pipeline and instruct the LLM to cite the source document for every factual claim.

Examples

Example 1

User input: What was the revenue growth?

Current (bad) output: The company's revenue grew 40% year-over-year in Q3 2024. (No citation — user has no idea which report this came from.)

Expected (good) output: The company's revenue grew 40% year-over-year in Q3 2024. [Source: earnings_report_q3.pdf]

Example 2

User input: When is the new product launching?

Current (bad) output: The new product launch is scheduled for March 2025.

Expected (good) output: The new product launch is scheduled for March 2025. [Source: product_roadmap.pdf]

Example 3

User input: How are employees feeling about the company?

Current (bad) output: Employee satisfaction scores improved to 4.5 out of 5.

Expected (good) output: Employee satisfaction scores improved to 4.5/5 in the latest survey. [Source: hr_annual_review.pdf]

Your Task

Modify the RAG pipeline so that:

  • Source metadata (document name/ID) is preserved when documents are retrieved.
  • Retrieved context passed to the LLM includes source identifiers alongside the content.
  • The LLM prompt instructs the model to cite the source for every factual claim.
  • The final answer is readable and naturally incorporates citations.

Evaluation

Submissions are checked for the following:

  • Cites sources for claims: Every factual claim in the answer includes a reference to the source document.
  • Citations are accurate: Citations match the actual source documents that contained the information.
  • Output remains readable: The answer is still natural and easy to read despite including citations.

Constraints

  • Every factual claim in the answer must include a source citation
  • Citations must reference the actual document name or ID
  • The retrieval mechanism itself must not be removed or bypassed
  • The agent must still produce natural, readable answers
Starter Code
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

documents = [
    Document(page_content="The company's revenue grew 40% year-over-year in Q3 2024.", metadata={"source": "earnings_report_q3.pdf"}),
    Document(page_content="The new product launch is scheduled for March 2025.", metadata={"source": "product_roadmap.pdf"}),
    Document(page_content="Employee satisfaction scores improved to 4.5/5 in the latest survey.", metadata={"source": "hr_annual_review.pdf"}),
]

vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever()

# BUG: The prompt does not ask the LLM to cite sources — answers have no attribution
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based on the context below.\n\nContext: {context}"),
    ("human", "{question}"),
])

def ask(question: str) -> str:
    docs = retriever.invoke(question)
    context = "\n".join([doc.page_content for doc in docs])
    chain = prompt | llm
    result = chain.invoke({"context": context, "question": question})
    return result.content

print(ask("What was the revenue growth?"))
Open in Google Colab
Evaluation Criteria0/3