Agent Foundry
All Problems

#102. A/B Test Two Agents

HardEvaluationCost Optimization

The Problem

You want to know whether your customer-support agent should use gpt-4o with a detailed prompt or gpt-4o-mini with a concise prompt. Today, there is only one configuration—no way to compare alternatives on quality or cost. Your job is to build an A/B testing framework that runs both configurations side-by-side on the same inputs, scores each on quality and cost, and recommends a winner.

Examples

Example 1

Test input: What is your return policy?

Current (bad) output: Only one configuration runs. You have no idea if a cheaper model would produce equivalent quality.

Expected (good) output:

Config A (gpt-4o, detailed prompt):
  Quality: 4.5/5  |  Cost: $0.0035  |  Latency: 1.8s

Config B (gpt-4o-mini, concise prompt):
  Quality: 4.2/5  |  Cost: $0.0004  |  Latency: 0.6s

Example 2

A/B test summary across 5 queries:

Current (bad) output: No comparison data exists.

Expected (good) output:

=== A/B Test Results ===
Config A (gpt-4o):      Avg quality: 4.6/5  |  Total cost: $0.0180  |  Avg latency: 1.9s
Config B (gpt-4o-mini): Avg quality: 4.1/5  |  Total cost: $0.0020  |  Avg latency: 0.5s

Cost-effectiveness (quality per $):
  Config A: 255.6 quality-points/$
  Config B: 2050.0 quality-points/$

RECOMMENDATION: Config B — 89% cheaper with only 11% quality reduction.
  Use Config A only for complex queries requiring deep reasoning.

Your Task

Build an A/B testing framework so that:

  • Two agent configurations run on the same set of test inputs.
  • Each is scored on quality (via LLM-as-judge) and cost (via token tracking).
  • A summary report compares the two and recommends a winner with justification.
  • The framework is configurable — swapping in new configurations requires no code changes.

Evaluation

Submissions are checked for the following:

  • Same inputs for both: Both agent configurations run on the same set of test inputs.
  • Quality and cost scores: Each configuration is scored on both quality and cost metrics.
  • Winner recommendation: The comparison produces a winner recommendation with justification.
  • Swappable configurations: New agent configurations can be added without modifying the comparison framework.

Constraints

  • Both agent configurations must run on the same set of test inputs for fair comparison
  • Each configuration must be scored on quality (accuracy) and cost (tokens/dollars)
  • The comparison must produce a winner recommendation with statistical justification
  • The framework must support swapping in new configurations without code changes
Starter Code
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

# BUG: Only one agent configuration — no way to compare alternatives
# TODO: Run two configs side-by-side and compare quality vs cost
llm = ChatOpenAI(model="gpt-4o")

def run_agent(query: str) -> str:
    response = llm.invoke([
        SystemMessage(content="You are a helpful customer support agent. Answer thoroughly and professionally."),
        HumanMessage(content=query),
    ])
    return response.content

test_queries = [
    "What is your return policy?",
    "My order arrived damaged, what should I do?",
    "Can I change my shipping address after placing an order?",
    "Do you price match with competitors?",
    "How do I cancel my subscription?",
]
for q in test_queries:
    result = run_agent(q)
    print(f"Q: {q}\nA: {result[:100]}...\n")
Open in Google Colab
Evaluation Criteria0/4