Caching Layer - Problems

The Problem

Your customer-support bot answers the same handful of questions hundreds of times a day—"What is your return policy?", "How do I return an item?", "What's your policy on returns?"—and every single one triggers a fresh LLM call. These semantically identical questions waste tokens and add latency. Your job is to add a semantic caching layer that recognizes when a new question is similar enough to a previously answered one and serves the cached response instantly, at zero token cost.

Examples

Example 1

User input sequence: "What is your return policy?" → "How do I return an item?" → "What's your policy on returns?"

Current (bad) output: Three separate LLM calls, each taking ~1 second and costing tokens. All three get slightly different phrasings of the same answer.

Expected (good) output: The first query calls the LLM and caches the response. The second and third queries are recognized as semantically similar (cosine similarity > 0.95), so the cached response is returned in under 10ms with zero token cost. Output shows: "[CACHE HIT — 0.003s] Our return policy allows…"

Example 2

User input sequence: "Do you offer free shipping?" → "Is shipping free?" → "Do you ship for free?"

Current (bad) output: Three LLM calls for what is essentially the same question.

Expected (good) output: One LLM call, two cache hits. The response time drops from ~1s to under 10ms for cached answers. Total cost is roughly 1/3 of the original.

Your Task

Modify the starter code so that:

A semantic cache stores query-response pairs indexed by query embeddings.
New queries are compared against cached queries using cosine similarity.
Queries above a similarity threshold return the cached response without calling the LLM.
The cache has a configurable max size or TTL to prevent unbounded growth.

Evaluation

Submissions are checked for the following:

Cache hit on similar queries: Semantically similar queries return cached responses without calling the LLM.
Faster on cache hit: Cache hits are significantly faster than LLM calls.
Zero token cost on hit: Cached responses incur no LLM token cost.
Bounded cache size: The cache has a configurable TTL or max size to prevent unbounded growth.

#97. Caching Layer