Workflow Checkpointing - Problems

The Problem

Your data-processing pipeline has five steps: extract entities, classify document, summarize findings, generate recommendations, and format the report. When step 4 fails due to an API timeout, the entire pipeline restarts from step 1—wasting the work already done in steps 1–3 and doubling your LLM costs. Your job is to add checkpointing so the pipeline saves its state after each successful step and, on failure, resumes from the last checkpoint instead of starting over.

Examples

Example 1

Input: Q3 2024 sales data

Current (bad) output:

Attempt 1: Step 1 ✓, Step 2 ✓, Step 3 ✓, Step 4 ✗ (API timeout)
Attempt 2: Step 1 ✓ (repeated!), Step 2 ✓ (repeated!), Step 3 ✓ (repeated!), Step 4 ✓, Step 5 ✓
Total LLM calls: 8 (3 wasted)

Expected (good) output:

Attempt 1: Step 1 ✓ [checkpointed], Step 2 ✓ [checkpointed], Step 3 ✓ [checkpointed], Step 4 ✗ (API timeout)
Attempt 2: Resuming from step 4... Step 4 ✓ [checkpointed], Step 5 ✓
Total LLM calls: 6 (0 wasted)

Example 2

Input: Annual compliance report data

Current (bad) output: A failure at step 5 forces re-execution of all 5 steps, costing extra time and tokens.

Expected (good) output: Steps 1–4 are restored from checkpoints. Only step 5 re-executes. The final report is identical to what an uninterrupted run would produce.

Your Task

Modify the starter code so that:

The workflow checkpoints state after each successful step.
On failure, the workflow resumes from the last checkpoint, skipping already-completed steps.
Checkpoints are persistent (survive process restarts), not just in-memory.
The final output from a resumed run matches what an uninterrupted run would produce.

Evaluation

Submissions are checked for the following:

Checkpoints after each step: The workflow saves state after each successful step completion.
Resumes from failure point: On failure and retry, the workflow continues from the last successful step, not step 1.
Persistent checkpoints: Checkpoints survive process restarts, not just stored in memory.
Correct final result: The resumed workflow produces the same final output as an uninterrupted run.

#98. Workflow Checkpointing