Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install aiskillstore-marketplace-skills-clouder0-checkpointgit clone https://github.com/aiskillstore/marketplace.gitcp marketplace/SKILL.MD ~/.claude/skills/aiskillstore-marketplace-skills-clouder0-checkpoint/SKILL.md---
name: checkpoint
description: Robust workflow checkpoint and resume. Handles session interruption, state recovery, and safe resume across all workflow phases.
allowed-tools: Read, Write, Glob
---
# Checkpoint & Resume Skill
Pattern for saving workflow state and resuming after interruption.
## When to Load This Skill
- Starting a workflow that might be interrupted
- Resuming after `claude -r`
- Recovering from crashes or timeouts
## Core Concept
The dotagent workflow uses **file-based state** that survives session interruption:
```
Session crash/exit
↓
State files persist on disk:
- memory/state/phase.json # Which phase we're in
- memory/state/execution.json # Task-level progress
- memory/reports/*.json # Completed phase outputs
↓
claude -r (resume session)
↓
Orchestrator reads state, continues from last checkpoint
```
## Checkpoint Files
### Phase Checkpoint: `memory/state/phase.json`
```json
{"workflow_id":"string","started_at":"ISO-8601","last_updated":"ISO-8601","current_phase":"REQUIREMENTS|ARCHITECTURE|IMPLEMENTATION|VERIFICATION|REFLECTION","phase_status":"pending|in_progress|complete|failed","completed_phases":[{"phase":"REQUIREMENTS","completed_at":"ISO-8601","output":"memory/reports/demand.json"}],"user_checkpoints":[{"phase":"REQUIREMENTS","approved_at":"ISO-8601"}],"interruption_safe":true}
```
### Execution Checkpoint: `memory/state/execution.json`
See executor agent for detailed schema with:
- Task status tracking
- Timestamps (started_at, completed_at)
- Output file paths for verification
## Resume Protocol
### Step 1: Detect Resume Scenario
```
ON WORKFLOW START:
checkpoint = Read("memory/state/phase.json")
IF checkpoint exists AND checkpoint.phase_status == "in_progress":
→ This is a RESUME
→ Log: "Detected interrupted workflow: {workflow_id}"
→ Go to Step 2
ELSE:
→ Fresh start, create new checkpoint
```
### Step 2: Validate State Integrity
```
VALIDATE:
1. Check all referenced output files exist
2. Check timestamps are reasonable (not future, not ancient)
3. Check phase progression is valid
4. Check for incomplete writes (interruption_safe flag)
IF validation fails:
→ Ask user: "State appears corrupted. Start fresh? [y/N]"
→ Archive corrupted state to memory/state/.archive/
```
### Step 3: Determine Resume Point
```
RESUME LOGIC by phase:
REQUIREMENTS (in_progress):
- Check if demand.json exists and is valid
- If valid: advance to ARCHITECTURE
- If not: re-spawn PM agent
ARCHITECTURE (in_progress):
- Check for design files in memory/reports/designs/
- Check for final_design.json
- If final exists: advance to IMPLEMENTATION
- If designs exist but no final: spawn Roundtable
- If no designs: re-spawn Architects
IMPLEMENTATION (in_progress):
- Read execution.json
- Run executor recovery checks
- Continue execution loop
VERIFICATION (in_progress):
- Check for verification.json
- If exists: advance to REFLECTION
- If not: re-spawn QA
REFLECTION (in_progress):
- Check for reflection file
- If exists: workflow complete
- If not: re-spawn Reflector
```
### Step 4: Inform User and Continue
```
LOG to user:
"Resuming workflow {id} from {phase} phase"
"Last activity: {timestamp}"
"Completed: {list of completed phases}"
IF current_phase requires user approval (was at checkpoint):
→ Re-confirm with user before proceeding
```
## Safe Checkpoint Writing
Always update checkpoint atomically:
```
# BAD: Can leave corrupted state
Write(checkpoint_file, new_state)
# GOOD: Atomic update
1. Set interruption_safe = false
2. Write to checkpoint_file.tmp
3. Rename checkpoint_file.tmp → checkpoint_file
4. Set interruption_safe = true
```
## Recovery from Specific Scenarios
### Scenario 1: Ctrl-C During Subagent
```
State: task-001 status="running", no output file
Recovery:
- Detect orphaned task
- Increment attempts
- Reset to "pending"
- Re-spawn on next loop
```
### Scenario 2: Crash After Write, Before State Update
```
State: task-001 status="running", output file EXISTS
Recovery:
- Detect output file
- Read status from output
- Update state to match
```
### Scenario 3: Interrupted During User Approval
```
State: phase=ARCHITECTURE, has designs but no final_design
Recovery:
- Detect we're at approval checkpoint
- Re-present options to user
- Don't re-run architects
```
### Scenario 4: Ancient State File
```
State: started_at is 7 days ago
Recovery:
- Warn user about stale state
- Offer to archive and start fresh
- If continue: proceed with caution
```
## Checkpoint Frequency
Update checkpoint after:
- Phase completion
- User approval
- Each task status change (in executor)
- Before spawning expensive agents (opus)
## Archiving Old State
When starting fresh or after completion:
```
Archive pattern:
memory/state/.archive/{workflow_id}_{timestamp}/
- phase.json
- execution.json
Keep last 5 archives, delete older
```
## Integration with Workflow
### In /develop Command
```markdown
## Resume Check
Before starting workflow:
1. Check for existing phase.json
2. If exists and in_progress:
- Show resume prompt to user
- "Resume workflow from {phase}? [Y/n]"
3. If user confirms: load checkpoint, continue
4. If user declines: archive old state, start fresh
```
### In Each Phase Agent
```markdown
## On Completion
Before returning:
1. Write output file
2. Update phase.json:
- Add to completed_phases
- Advance current_phase
- Set phase_status = complete
3. Log checkpoint saved
```
## Principles
1. **State on disk** - Never rely on conversation memory alone
2. **Validate before resume** - Don't blindly trust old state
3. **Inform the user** - Always tell them what's being resumed
4. **Atomic writes** - Prevent half-written state
5. **Archive, don't delete** - Keep old state for debugging