---
name: checkpoint
description: Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases.
allowed-tools: Read, Write, Glob
---

# Checkpoint & Resume Skill

Pattern for saving workflow state and resuming after interruption.

## When to Load This Skill

- Starting a workflow that might be interrupted
- Resuming after `claude -r`
- Recovering from crashes or timeouts

## Core Concept

The dotagent workflow uses **file-based state** that survives session interruption:

```
Session crash/exit
        ↓
State files persist on disk:
  - memory/state/phase.json        # Which phase we're in
  - memory/state/execution.json    # Task-level progress
  - memory/reports/*.json          # Completed phase outputs
        ↓
claude -r (resume session)
        ↓
Orchestrator reads state, continues from last checkpoint
```

## Checkpoint Files

### Phase Checkpoint: `memory/state/phase.json`

```json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}
```

### Execution Checkpoint: `memory/state/execution.json`

See executor agent for detailed schema with:
- Task status tracking
- Timestamps (started_at, completed_at)
- Output file paths for verification

## Resume Protocol

### Step 1: Detect Resume Scenario

```
ON WORKFLOW START:
  checkpoint = Read("memory/state/phase.json")

  IF checkpoint exists AND checkpoint.phase_status == "in_progress":
    → This is a RESUME
    → Log: "Detected interrupted workflow: {workflow_id}"
    → Go to Step 2
  ELSE:
    → Fresh start, create new checkpoint
```

### Step 2: Validate State Integrity

```
VALIDATE:
  1. Check all referenced output files exist
  2. Check timestamps are reasonable (not future, not ancient)
  3. Check phase progression is valid
  4. Check for incomplete writes (interruption_safe flag)

IF validation fails:
  → Ask user: "State appears corrupted. Start fresh? [y/N]"
  → Archive corrupted state to memory/state/.archive/
```

### Step 3: Determine Resume Point

```
RESUME LOGIC by phase:

REQUIREMENTS (in_progress):
  - Check if demand.json exists and is valid
  - If valid: advance to ARCHITECTURE
  - If not: re-spawn PM agent

ARCHITECTURE (in_progress):
  - Check for design files in memory/reports/designs/
  - Check for final_design.json
  - If final exists: advance to IMPLEMENTATION
  - If designs exist but no final: spawn Roundtable
  - If no designs: re-spawn Architects

IMPLEMENTATION (in_progress):
  - Read execution.json
  - Run executor recovery checks
  - Continue execution loop

VERIFICATION (in_progress):
  - Check for verification.json
  - If exists: advance to REFLECTION
  - If not: re-spawn QA

REFLECTION (in_progress):
  - Check for reflection file
  - If exists: workflow complete
  - If not: re-spawn Reflector
```

### Step 4: Inform User and Continue

```
LOG to user:
  "Resuming workflow {id} from {phase} phase"
  "Last activity: {timestamp}"
  "Completed: {list of completed phases}"

IF current_phase requires user approval (was at checkpoint):
  → Re-confirm with user before proceeding
```

## Safe Checkpoint Writing

Always update checkpoint atomically:

```
# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)

# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true
```

## Recovery from Specific Scenarios

### Scenario 1: Ctrl-C During Subagent

```
State: task-001 status="running", no output file

Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop
```

### Scenario 2: Crash After Write, Before State Update

```
State: task-001 status="running", output file EXISTS

Recovery:
- Detect output file
- Read status from output
- Update state to match
```

### Scenario 3: Interrupted During User Approval

```
State: phase=ARCHITECTURE, has designs but no final_design

Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects
```

### Scenario 4: Ancient State File

```
State: started_at is 7 days ago

Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution
```

## Checkpoint Frequency

Update checkpoint after:
- Phase completion
- User approval
- Each task status change (in executor)
- Before spawning expensive agents (opus)

## Archiving Old State

When starting fresh or after completion:

```
Archive pattern:
  memory/state/.archive/{workflow_id}_{timestamp}/
    - phase.json
    - execution.json

Keep last 5 archives, delete older
```

## Integration with Workflow

### In /develop Command

```markdown
## Resume Check

Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
   - Show resume prompt to user
   - "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh
```

### In Each Phase Agent

```markdown
## On Completion

Before returning:
1. Write output file
2. Update phase.json:
   - Add to completed_phases
   - Advance current_phase
   - Set phase_status = complete
3. Log checkpoint saved
```

## Principles

1. **State on disk** - Never rely on conversation memory alone
2. **Validate before resume** - Don't blindly trust old state
3. **Inform the user** - Always tell them what's being resumed
4. **Atomic writes** - Prevent half-written state
5. **Archive, don't delete** - Keep old state for debugging