Agent Foundry
All Problems

#54. RAG with Tables

HardRAG

The Problem

Your financial reports contain important data in tables—revenue by region, expense breakdowns, quarterly comparisons. The current RAG pipeline uses a naive character-based text splitter that chops the document into fixed-size chunks with no awareness of table structure. This means table headers end up in one chunk, the first two rows in another, and the remaining rows in yet another. When the agent retrieves a chunk, it gets a fragment of a table with no headers and no context—making it impossible to answer questions like "What was Asia Pacific's revenue growth?" Your job is to implement a table-aware chunking strategy that keeps tables intact.

Examples

Example 1

User input: What was the revenue growth in Asia Pacific?

Current (bad) output: The agent retrieves a chunk containing | Asia Pacific | $28M but without the table header, so it can't determine which column is Q3, Q2, or Growth. It guesses or hallucinates.

Expected (good) output: Asia Pacific's revenue grew 27.3% in Q3 2024, from $22M in Q2 to $28M in Q3.

Example 2

User input: What percentage of revenue goes to R&D?

Current (bad) output: The agent retrieves a fragment like | R&D | $25M | without the column header explaining what the percentages mean.

Expected (good) output: R&D spending was $25M, representing 21.4% of revenue in Q3 2024.

Example 3

User input: Which region had the highest growth?

Current (bad) output: Returns an incomplete or wrong answer because the revenue table is split across chunks and the agent can't compare all regions.

Expected (good) output: Asia Pacific had the highest growth at 27.3%, followed by North America at 18.4%.

Your Task

Fix the chunking strategy so the pipeline:

  • Detects table boundaries in the document and treats each table as an atomic unit.
  • Keeps table headers with their data rows—never splits a table across chunks.
  • Handles documents with both prose paragraphs and tables.
  • Produces correct answers to questions that require reading tabular data.

Evaluation

Submissions are checked for the following:

  • Preserves table structure: Tables are kept intact during chunking and not split across chunk boundaries.
  • Answers tabular questions correctly: The agent correctly extracts and reports data from tables in the documents.
  • Handles mixed prose and tables: The chunking strategy works for documents containing both narrative text and tables.

Constraints

  • Tables in documents must be preserved during chunking
  • The agent must correctly answer questions requiring tabular data
  • Chunk boundaries must not split table rows apart
  • The chunking strategy must handle both prose and tables in the same document
Starter Code
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain.text_splitter import CharacterTextSplitter

llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

document_text = """Company Financial Report Q3 2024

Revenue Summary:
The company achieved strong growth across all segments.

| Region       | Q3 Revenue | Q2 Revenue | Growth |
|-------------|-----------|-----------|--------|
| North America | $45M      | $38M      | 18.4%  |
| Europe       | $32M      | $29M      | 10.3%  |
| Asia Pacific | $28M      | $22M      | 27.3%  |
| Latin America | $12M      | $11M      | 9.1%   |

Expense Breakdown:
Operating expenses were well managed this quarter.

| Category     | Amount  | % of Revenue |
|-------------|---------|-------------|
| R&D          | $25M    | 21.4%       |
| Sales        | $18M    | 15.4%       |
| Operations   | $15M    | 12.8%       |
| G&A          | $8M     | 6.8%        |

The company projects continued growth in Q4 2024."""

# BUG: Naive character splitting breaks tables apart
splitter = CharacterTextSplitter(chunk_size=150, chunk_overlap=20)
chunks = splitter.split_text(document_text)
docs = [Document(page_content=chunk) for chunk in chunks]

vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on context.\n\nContext: {context}"),
    ("human", "{question}"),
])

def ask(question: str) -> str:
    retrieved = retriever.invoke(question)
    context = "\n".join([doc.page_content for doc in retrieved])
    chain = prompt | llm
    result = chain.invoke({"context": context, "question": question})
    return result.content

print(ask("What was the revenue growth in Asia Pacific?"))
Open in Google Colab
Evaluation Criteria0/3