Agent Foundry
All Problems

#97. Caching Layer

MediumCost Optimization

The Problem

Your customer-support bot answers the same handful of questions hundreds of times a day—"What is your return policy?", "How do I return an item?", "What's your policy on returns?"—and every single one triggers a fresh LLM call. These semantically identical questions waste tokens and add latency. Your job is to add a semantic caching layer that recognizes when a new question is similar enough to a previously answered one and serves the cached response instantly, at zero token cost.

Examples

Example 1

User input sequence: "What is your return policy?" → "How do I return an item?" → "What's your policy on returns?"

Current (bad) output: Three separate LLM calls, each taking ~1 second and costing tokens. All three get slightly different phrasings of the same answer.

Expected (good) output: The first query calls the LLM and caches the response. The second and third queries are recognized as semantically similar (cosine similarity > 0.95), so the cached response is returned in under 10ms with zero token cost. Output shows: "[CACHE HIT — 0.003s] Our return policy allows…"

Example 2

User input sequence: "Do you offer free shipping?" → "Is shipping free?" → "Do you ship for free?"

Current (bad) output: Three LLM calls for what is essentially the same question.

Expected (good) output: One LLM call, two cache hits. The response time drops from ~1s to under 10ms for cached answers. Total cost is roughly 1/3 of the original.

Your Task

Modify the starter code so that:

  • A semantic cache stores query-response pairs indexed by query embeddings.
  • New queries are compared against cached queries using cosine similarity.
  • Queries above a similarity threshold return the cached response without calling the LLM.
  • The cache has a configurable max size or TTL to prevent unbounded growth.

Evaluation

Submissions are checked for the following:

  • Cache hit on similar queries: Semantically similar queries return cached responses without calling the LLM.
  • Faster on cache hit: Cache hits are significantly faster than LLM calls.
  • Zero token cost on hit: Cached responses incur no LLM token cost.
  • Bounded cache size: The cache has a configurable TTL or max size to prevent unbounded growth.

Constraints

  • Semantically similar queries must return cached responses without calling the LLM
  • The cache must store and retrieve based on semantic similarity, not exact string match
  • Cache hits must be significantly faster than LLM calls and incur zero token cost
  • The cache must have a configurable TTL or max size to prevent unbounded growth
Starter Code
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import time

llm = ChatOpenAI(model="gpt-4o-mini")

# BUG: Every query calls the LLM, even repeated or near-identical questions
# TODO: Add a caching layer that serves cached responses for similar queries
def answer_question(question: str) -> str:
    start = time.time()
    response = llm.invoke([
        SystemMessage(content="You are a helpful customer support agent. Answer concisely."),
        HumanMessage(content=question),
    ])
    elapsed = time.time() - start
    return f"[{elapsed:.2f}s] {response.content}"

questions = [
    "What is your return policy?",
    "How do I return an item?",
    "What's your policy on returns?",
    "Do you offer free shipping?",
    "Is shipping free?",
    "What is your return policy?",
    "Do you ship for free?",
    "How can I return a product?",
]
for q in questions:
    result = answer_question(q)
    print(f"Q: {q}\nA: {result}\n")
Open in Google Colab
Evaluation Criteria0/4