Metadata Filter - Problems

#51. Metadata Filter

MediumRAGTool Calling

The Problem

Your company has policy documents spanning multiple years and categories (HR, finance, security). When a user asks for "2024 HR policies," the retriever ignores the year and category constraints and returns semantically similar documents from any year—so 2022 safety training shows up alongside 2024 PTO policies. The user ends up with outdated information they can't trust. Your job is to extract metadata filters from the user's query (year, category) and apply them during retrieval so only documents matching the specified criteria are returned.

Examples

Example 1

User input: What are the 2024 HR policies?

Current (bad) output: An answer mixing 2022 safety training, 2023 remote work, and 2024 PTO policies — the user asked for 2024 only.

Expected (good) output: Based on the 2024 HR policies: All employees receive 20 days of PTO per year.

Example 2

User input: What finance policies were updated in 2024?

Current (bad) output: Mentions 2022 travel expense policy alongside 2024 401k matching — no year filtering.

Expected (good) output: In 2024, the finance policy states that the company matches 401k contributions up to 6%.

Example 3

User input: What are the current security requirements?

Current (bad) output: Returns HR or finance documents that are semantically closer but wrong category.

Expected (good) output: According to the 2024 security policy, data must be encrypted at rest and in transit.

Your Task

Modify the retrieval pipeline so the agent:

Parses the user's natural language query to extract metadata filters (year, category).
Applies those filters during retrieval so only matching documents are searched.
Combines metadata filtering with semantic search for accurate, targeted results.
Returns answers grounded only in documents that match the user's criteria.

Evaluation

Submissions are checked for the following:

Extracts metadata filters from query: The agent correctly identifies filter criteria (year, category) from the user's natural language query.
Applies metadata filters during retrieval: The retriever uses metadata filters to narrow results before or during search.
Returns correctly filtered results: The agent returns only documents matching the specified metadata criteria.

Constraints

Documents must have metadata fields including year and category
The retriever must filter by metadata before or during search
The agent must extract filter criteria from the user's natural language query
Semantic search alone is not sufficient — metadata filtering is required

from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_core.prompts import ChatPromptTemplate from langchain_community.vectorstores import FAISS from langchain_core.documents import Document llm = ChatOpenAI(model="gpt-4o-mini") embeddings = OpenAIEmbeddings() documents = [ Document(page_content="Employees must complete safety training annually.", metadata={"year": 2022, "category": "hr"}), Document(page_content="Remote work is allowed up to 3 days per week.", metadata={"year": 2023, "category": "hr"}), Document(page_content="All employees receive 20 days of PTO per year.", metadata={"year": 2024, "category": "hr"}), Document(page_content="Travel expenses must be submitted within 30 days.", metadata={"year": 2022, "category": "finance"}), Document(page_content="The company matches 401k contributions up to 6%.", metadata={"year": 2024, "category": "finance"}), Document(page_content="Data must be encrypted at rest and in transit.", metadata={"year": 2024, "category": "security"}), ] vectorstore = FAISS.from_documents(documents, embeddings) # BUG: No metadata filtering — retrieves semantically similar but wrong-year docs retriever = vectorstore.as_retriever() prompt = ChatPromptTemplate.from_messages([ ("system", "Answer based on the context.\n\nContext: {context}"), ("human", "{question}"), ]) def ask(question: str) -> str: docs = retriever.invoke(question) context = "\n".join([doc.page_content for doc in docs]) chain = prompt | llm result = chain.invoke({"context": context, "question": question}) return result.content # Asks for 2024 policies but may get 2022 docs instead print(ask("What are the 2024 HR policies?"))