Agent Foundry
All Problems

#101. Agent Evaluation Harness

HardEvaluation

The Problem

Your customer-support agent is "tested" by manually running a few queries and eyeballing the output. There are no metrics, no expected outputs, and no way to detect regressions when you change the prompt or model. Your job is to build an automated evaluation harness that runs a suite of at least 10 test cases, scores each on accuracy, latency, and cost, and produces a summary report—making agent quality measurable and repeatable.

Examples

Example 1

Test case: Input: "What is your return policy?" — Expected: mentions 30-day window and free returns

Current (bad) output: The developer manually reads the response and decides "looks good" — no score, no record, not reproducible.

Expected (good) output:

Test 1: "What is your return policy?"
  Accuracy: 0.9 (mentions 30-day window ✓, mentions free returns ✓, misses restocking fee)
  Latency: 1.23s
  Cost: $0.0004
  Status: PASS

Example 2

Test suite summary:

Current (bad) output: No summary exists. The developer has a vague sense that "it usually works."

Expected (good) output:

=== Evaluation Report ===
Total test cases: 10
Passed: 8 (80%)
Failed: 2 (20%)
Avg accuracy: 0.85
Avg latency: 1.15s
Total cost: $0.0042
Failed tests: #4 (shipping question), #9 (edge case: empty input)

Your Task

Build an evaluation harness so that:

  • At least 10 test cases are defined with inputs and expected outputs.
  • Each test case is scored on accuracy (LLM-as-judge or keyword matching), latency (wall-clock time), and cost (token usage).
  • Results are aggregated into a summary report with pass/fail rates and averages.
  • The harness is data-driven — adding new test cases requires only adding data, not changing code.

Evaluation

Submissions are checked for the following:

  • At least 10 test cases: The harness includes at least 10 test cases with expected outputs.
  • Multi-metric scoring: Each test case is scored on accuracy, latency, and cost.
  • Summary report: Results are aggregated into a report with pass/fail rates and metric averages.
  • Data-driven test cases: Adding new test cases requires only data changes, not code modifications.

Constraints

  • The test suite must include at least 10 test cases with expected outputs
  • Each test case must be scored on accuracy, latency, and cost
  • Results must be aggregated into a summary report with pass/fail rates and averages
  • The harness must be reusable — adding new test cases should require no code changes
Starter Code
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOpenAI(model="gpt-4o-mini")

# BUG: No automated evaluation — testing is manual, inconsistent, and not reproducible
# TODO: Build an evaluation harness that runs test cases and scores accuracy, latency, and cost
def customer_support_agent(query: str) -> str:
    response = llm.invoke([
        SystemMessage(content="You are a customer support agent for an e-commerce store. Answer questions about orders, returns, shipping, and products."),
        HumanMessage(content=query),
    ])
    return response.content

# Manual testing — no scoring, no metrics, not reproducible
print(customer_support_agent("What is your return policy?"))
print("---")
print(customer_support_agent("Where is my order #1234?"))
print("---")
print(customer_support_agent("Do you ship internationally?"))
Open in Google Colab
Evaluation Criteria0/4