Agent Foundry
All Problems

#74. Cascading Failure Isolation

HardError RecoveryMulti-Agent

The Problem

You have a multi-agent pipeline for generating research reports: a researcher, an analyst, a visualization agent, and a report writer. They run in sequence, each building on the previous agent's output. When the analysis agent crashes (bad model load, OOM, API error), the entire pipeline dies — the user gets nothing, even though the research, visualization, and report agents are perfectly healthy. There is no error isolation between agents, so one crash cascades through the whole system. Your job is to isolate each agent in its own error boundary so that crashes are contained, surviving agents continue, and the user gets partial results.

Examples

Example 1

User input: Generate a report on AI trends for 2025

Current (bad) output: RuntimeError: Analysis model failed to load — the entire pipeline crashes. Research results are lost. No report is generated.

Expected (good) output:

Report on AI trends for 2025:

Research: [detailed findings on AI trends]
Analysis: ⚠️ Unavailable (analysis agent encountered an error: Analysis model failed to load)
Visualizations: Charts generated for AI trends
Conclusion: Report compiled from available data. Note: analysis section is missing due to a processing error.

Example 2

User input: Analyze market trends for Q4

Current (bad) output: Pipeline crash — no output at all.

Expected (good) output: Research and visualization results are returned. The analysis failure is noted. The report agent compiles what's available.

Example 3

User input: Summarize climate change data (all agents healthy)

Current (bad) output: (Works fine when no agent crashes.)

Expected (good) output: Full report with all four sections populated. No error notes needed.

Your Task

Implement cascading failure isolation so the pipeline:

  • Wraps each agent in its own error boundary (try/except or equivalent).
  • Allows healthy agents to continue running even when one crashes.
  • Aggregates results from all successful agents into the final output.
  • Clearly reports which agent(s) failed and what error occurred.

Evaluation

Submissions are checked for the following:

  • Failure is isolated to one agent: A crash in one agent doesn't take down others.
  • Other agents continue running: Healthy agents execute and produce output normally.
  • Partial results returned: The final output includes everything that succeeded.
  • Failed agent is reported: The output notes which agent failed and why.

Constraints

  • Each agent must run in its own isolated error boundary
  • If one agent crashes, the others must continue executing
  • The final output must include results from all successful agents and a failure note for crashed ones
  • The pipeline must not re-raise exceptions from individual agent failures
Starter Code
from langgraph.graph import StateGraph, START, END
from typing import TypedDict

class State(TypedDict):
    topic: str
    research: str
    analysis: str
    visualization: str
    report: str

def research_agent(state: State) -> State:
    return {"research": f"Research findings on {state['topic']}: [detailed research]"}

def analysis_agent(state: State) -> State:
    # BUG: This agent crashes and takes down the entire pipeline
    raise RuntimeError("Analysis model failed to load")

def visualization_agent(state: State) -> State:
    return {"visualization": f"Charts generated for {state['topic']}"}

def report_agent(state: State) -> State:
    return {"report": f"Report: Research={state['research']}, Analysis={state['analysis']}, Viz={state['visualization']}"}

graph = StateGraph(State)
graph.add_node("research", research_agent)
graph.add_node("analysis", analysis_agent)
graph.add_node("visualization", visualization_agent)
graph.add_node("report", report_agent)

graph.add_edge(START, "research")
graph.add_edge("research", "analysis")
graph.add_edge("analysis", "visualization")
graph.add_edge("visualization", "report")
graph.add_edge("report", END)

app = graph.compile()

# Test: One agent crash kills the entire pipeline
result = app.invoke({"topic": "AI trends 2025", "research": "", "analysis": "", "visualization": "", "report": ""})
print(result["report"])
Open in Google Colab
Evaluation Criteria0/4