flow-swarm

Show SKILL.md content (~5.7k tokens)
---
name: flow-swarm
version: 2.1.1
description: >
  Multi-agent swarm orchestration via RuFlo + Claude Code. Turns single coding sessions into coordinated agent teams (architect/coder/tester/reviewer). Proven on 7 consecutive production runs generating 430+ tests across a 50K+ line Elixir codebase with 83% zero-iteration success rate. Features 150+ MCP tools for inter-agent coordination, persistent cross-run memory (sql.js + HNSW vectors), task tracking, file claim locks, and session persistence. Includes battle-tested prompt templates for test generation, feature builds, refactors, security audits, and quality loops. Setup script with 8-point verification. Works with any language, any codebase. NOT for one-liner edits or read-only tasks. Triggers: "swarm this", "flow swarm", "use the swarm", "flowswarm".
---

# FlowSwarm v2.1

Multi-agent swarm orchestration for Claude Code via RuFlo. One prompt, coordinated agents, production results.

## What Changed in v2.1

Critical fix: **MCP tools were disabled** (`autoStart: false`). This prevented Claude Code from calling `mcp__claude-flow__swarm_init`, `memory_store`, `agent_spawn`, etc. during every swarm run we'd done so far. Fixed.

Also fixed: `--print` mode **does not** auto-discover `.mcp.json`. You must pass `--mcp-config .mcp.json` explicitly.

| Change | Why |
|---|---|
| `autoStart: true` in `.mcp.json` | Was `false` — all 150+ MCP tools were disabled in every prior run |
| `--mcp-config .mcp.json` flag added to exec pattern | `--print` doesn't auto-load project MCP config |
| MCP tool reference table added | 150+ tools now documented: swarm_init, agent_spawn, memory_store, etc. |
| Prompt templates updated to call MCP tools | Without explicit instructions, Claude may not use them |
| Setup script auto-fixes autoStart | `ruflo init` defaults to false; setup script corrects it |

## What Changed in v2.0

v1.0 was theory. v2.0 is battle-tested across 5 production runs (355 tests, 5/5 green, 4/5 zero-iteration).

| Change | Why |
|---|---|
| Tiered task routing replaces one-size-fits-all | Pure-data modules don't need GenServer test isolation advice |
| Target selection protocol added | Picking the RIGHT module matters more than swarm config |
| Pre-flight context injection | Feeding the swarm grep output of public functions = dramatically better coverage |
| Daemon reality check | Workers timeout/fail often (20-50% success); swarm value comes from prompt orchestration, not daemon workers |
| Removed WASM Booster claims | Never observed in practice; hooks + prompt patterns drive all real value |
| Real performance data | Actual timing, test counts, iteration rates from production runs |

## Architecture

Two layers working together:

**Layer 1: MCP Tools (150+ tools via @claude-flow/cli)**
When `autoStart: true` in `.mcp.json`, Claude Code gets access to real coordination tools:
- `swarm_init` — creates swarm with topology, persists to `.claude-flow/swarm/swarm-state.json`
- `agent_spawn` — registers agents with model routing (haiku/sonnet/opus/inherit)
- `memory_store` / `memory_search` — sql.js + HNSW vector embeddings for semantic recall
- `task_create` / `task_complete` — tracks task state with assignment
- `session_save` / `session_restore` — persists session state between runs
- `claims_claim` / `claims_release` — prevents agents from editing same files
- `coordination_consensus` — multi-agent agreement on decisions

**Layer 2: Prompt Orchestration (our FlowSwarm patterns)**
The SWARM MODE prefix causes Claude Code to think in roles (architect/coder/reviewer). Combined with pre-flight context injection (grepping public APIs), this produces 80% zero-iteration success.

**Both layers matter.** v1.0 had Layer 2 only (autoStart was false, MCP tools never loaded). v2.0 enables both.

```
OpenClaw → exec (background) → Claude Code
                                    ↓
                    MCP Server starts (autoStart: true)
                    150+ tools available via @claude-flow/cli
                                    ↓
                    SWARM MODE prompt → swarm_init tool called
                    agent_spawn × N → task_create → execute
                                    ↓
                    memory_store (findings) → task_complete
                                    ↓
                              Output + persisted state
```

## Prerequisites

```bash
ruflo --version    # 3.5.x+
claude --version   # Claude Code CLI
```

## Setup (One-Time Per Machine)

```bash
# Full setup: install RuFlo + register MCP + init project
./scripts/setup-flow-swarm.sh /path/to/project

# Verify
./scripts/setup-flow-swarm.sh --verify /path/to/project
```

Or manually:
```bash
npm install -g ruflo@latest
claude mcp add ruflo -- npx -y ruflo@latest mcp start
cd /path/to/project && ruflo init && ruflo memory init && ruflo daemon start
```

## CRITICAL: Enable MCP Server

After `ruflo init`, the `.mcp.json` file defaults to `autoStart: false`. This disables ALL 150+ MCP tools during Claude Code sessions. Fix it:

```bash
# Check current state
python3 -c "import json; d=json.load(open('.mcp.json')); print('autoStart:', d['mcpServers']['claude-flow'].get('autoStart'))"

# Enable (REQUIRED for full swarm functionality)
python3 -c "
import json
with open('.mcp.json') as f: d = json.load(f)
d['mcpServers']['claude-flow']['autoStart'] = True
with open('.mcp.json', 'w') as f: json.dump(d, f, indent=2)
print('MCP autoStart enabled')
"
```

Without this, Claude Code runs without swarm tools. The prompt patterns still work (v1.0 proved this), but you lose: persistent swarm state, agent memory, task tracking, session persistence, and inter-agent coordination.

## The FlowSwarm Protocol (3 Steps)

### Step 1: Select Your Target

**This is the highest-leverage decision.** Pick wrong and you waste a swarm run.

Best targets (in order):
1. **Large modules with zero tests** — highest ROI, swarm excels here
2. **Pure data/logic modules** — no IO mocking needed, near-100% first-pass success
3. **Modules with thin test coverage** — swarm fills gaps the original author skipped
4. **Feature builds with clear specs** — architect/coder/reviewer shines on greenfield

Find targets fast:
```bash
# List untested modules by size (biggest = best target)
for f in lib/**/*.ex; do
  base=$(basename "$f" .ex)
  count=$(find test/ -name "${base}_test.exs" 2>/dev/null | wc -l | tr -d ' ')
  [ "$count" = "0" ] && echo "$(wc -l < "$f")L $f"
done | sort -rn | head -10
```

### Step 2: Build the Prompt (Context-Rich)

The secret sauce: **feed the swarm a pre-flight scan** of the module. Don't just say "test this file" — tell it exactly what functions exist, what patterns the project uses, what edge cases matter.

```bash
# Pre-flight: scan public API
grep -n "^  def " lib/your_module.ex
# Pre-flight: check existing test patterns
head -30 test/some_existing_test.exs
```

Then build the prompt with that intel baked in.

### Step 3: Launch and Verify

```bash
# Launch (ALWAYS background, NEVER nohup)
exec(
  command='cd /project && claude --permission-mode bypassPermissions --mcp-config .mcp.json --print "SWARM MODE: ... TASK: ..."',
  background=True,
  timeout=300
)

# Poll for completion
process(action="poll", sessionId="xxx", timeout=120000)

# Verify the output actually compiles/passes
mix test test/path/to/new_test.exs
```

**Critical exec rules:**
- `--print` buffers ALL output until exit. Use `background: true` + poll.
- NEVER use `nohup` — Node.js stdout capture breaks silently (empty files).
- Timeout 300s minimum for complex tasks. Simple test gen: 60-120s.
- Always run `mix test` (or equivalent) on swarm output before committing.

## Prompt Templates (Battle-Tested)

### Test Generation — Pure Data Module
**Proven: 147/147, 66/66, 41/41 zero-iteration**

Best for: static catalogs, type definitions, translation modules, config builders.

```
SWARM MODE: Initialize hierarchical swarm with MCP tools.

COORDINATION:
1. Call swarm_init with topology "hierarchical", maxAgents 4, strategy "specialized"
2. Call agent_spawn for: architect (analyze module), coder (write tests), reviewer (verify)
3. Call task_create for the test generation task
4. After completion: call memory_store with key findings and task_complete

TASK: Write comprehensive ExUnit tests for [MODULE_PATH] ([LINE_COUNT] lines, [DESCRIPTION]).

Public API:
[PASTE grep -n "^  def " output here]

Key data to validate:
- [List specific assertions: required struct keys, value ranges, URL formats, etc.]
- [List known edge cases: unknown inputs, nil, empty string, integer where string expected]

Requirements:
- File: test/[matching_path]_test.exs
- Use async: true (pure functions, no state)
- Group tests by function (describe blocks)
- Test ALL variants, not just a sample (e.g., all 8 hotels, not just 2)
- Include edge cases: nil input, empty string, unknown keys
- Do NOT modify any source files

When done: call memory_store with test count + key findings, then output results.
```

### Test Generation — GenServer / Stateful Module
**Proven: 43/43, required 3 iterations (test isolation)**

```
SWARM MODE: Initialize hierarchical swarm with MCP tools.

COORDINATION:
1. Call swarm_init with topology "hierarchical", maxAgents 4, strategy "specialized"
2. Call agent_spawn for: architect (analyze GenServer behavior), coder (write tests), reviewer (verify)
3. Call task_create for the test generation task
4. After each iteration: call memory_store with what failed and why
5. After completion: call task_complete with final results

TASK: Write comprehensive ExUnit tests for [MODULE_PATH] ([LINE_COUNT] lines, GenServer).

Public API:
[PASTE grep output]

CRITICAL — Test Isolation for GenServers:
- The module registers as a named process (__MODULE__). It's already supervised globally.
- Do NOT use start_supervised! — it conflicts with the app-supervised instance.
- Pattern: stop the global instance, restart with test config, re-stop at end.
- OR: if start_link accepts a name: option, use unique names per test.
- async: false for GenServer tests that touch global state.

Requirements:
- Test GenServer lifecycle (init, handle_call, handle_cast, handle_info)
- Test crash recovery: missing catch-all handlers are REAL BUGS worth flagging
- Test state transitions and side effects
- Do NOT modify any source files
```

### Test Generation — Module with External Dependencies
```
SWARM MODE: Initialize hierarchical swarm (maxAgents 4, strategy specialized).
Spawn: architect (analyze deps + plan mocks), coder (write tests), reviewer (verify coverage).

TASK: Write comprehensive ExUnit tests for [MODULE_PATH].

This module depends on: [LIST DEPENDENCIES]
Mock strategy: [Mox / manual mock / test config override]
Reference existing mocks in test/support/ if any.

Requirements:
- Mock all external calls (HTTP, DB, external services)
- Test happy path AND error paths (timeouts, 4xx, 5xx, malformed responses)
- async: true if using Mox with allowances
- Do NOT modify source files
```

### Feature Build (Greenfield)
```
SWARM MODE: Initialize hierarchical swarm (maxAgents 6, strategy specialized).
Spawn: architect (plan structure), coder (implement), tester (tests), reviewer (quality).
Architect plans FIRST. Coder implements. Tester validates. Reviewer catches issues.

TASK: [Feature description with clear acceptance criteria]

Architecture constraints:
- [List patterns to follow from existing codebase]
- [List modules/files to reference for conventions]

HARD LIMIT: Maximum 5 iterations if quality loop needed.
```

### Refactor (Anti-Drift)
```
SWARM MODE: Initialize anti-drift hierarchical swarm (maxAgents 4).
Spawn: architect (plan + checkpoint), coder (execute), reviewer (validate each step).

ANTI-DRIFT RULES:
- Architect creates numbered plan before ANY code is written
- Coder implements ONE step at a time
- Reviewer validates EACH step before proceeding
- If reviewer rejects twice: STOP and report
- Checkpoint state after each successful step

TASK: [Refactor description]
HARD LIMIT: Maximum 8 iterations.
```

### Security Audit
```
SWARM MODE: Security-focused hierarchical swarm (maxAgents 5, strategy specialized).
Spawn: security-architect (threat model), auditor (scan), coder (fix), tester (verify).

TASK: Security audit of [scope].

Checklist:
- [ ] Dependency vulnerabilities (mix audit / npm audit)
- [ ] Hardcoded secrets in source
- [ ] Injection vectors (SQL, XSS, command)
- [ ] Auth/authz bypass paths
- [ ] GenServer catch-all handlers (handle_info, handle_cast) — these are REAL BUGS
- [ ] Error messages leaking internal state
- [ ] Rate limiting gaps
- [ ] CORS/CSP headers

Output: findings table with severity, file, line, fix.
HARD LIMIT: Maximum 5 iterations.
```

### Quality Loop (Ralph-Style)
```
SWARM MODE: Initialize ring swarm (maxAgents 4, strategy adaptive).
Spawn: coder, tester, reviewer, coordinator.
HARD LIMIT: Maximum 10 iterations.

TASK: Iterate on [target] until [score threshold].

Per iteration:
1. Coder fixes based on reviewer feedback
2. Tester runs full suite, reports pass/fail count
3. Reviewer scores against rubric
4. Score >= threshold → STOP, report final score
5. Iteration == 10 → STOP regardless, report score and remaining gaps
```

## MCP Tools Reference (Available When autoStart: true)

These tools become available to Claude Code during swarm sessions. Include instructions to USE them in your prompts.

### Core Swarm (must-use)
| Tool | Purpose |
|---|---|
| `swarm_init` | Create swarm with topology + strategy. Persists to `.claude-flow/swarm/` |
| `swarm_status` | Check swarm health mid-run |
| `swarm_shutdown` | Clean shutdown with state persistence |
| `agent_spawn` | Register agents with model routing (haiku/sonnet/opus) |
| `agent_status` | Check individual agent state |
| `memory_store` | Persist findings to sql.js + HNSW (semantic search) |
| `memory_search` | Retrieve relevant context from prior runs |
| `task_create` | Track task with assignment + status |
| `task_complete` | Mark task done with summary |

### Coordination (high-value for complex tasks)
| Tool | Purpose |
|---|---|
| `session_save` | Save session state between runs |
| `session_restore` | Resume from prior session |
| `claims_claim` | Lock a file/resource (prevents agent conflicts) |
| `claims_release` | Release lock |
| `coordination_consensus` | Multi-agent agreement |
| `coordination_sync` | Synchronize agent state |

### Analysis (useful for reviews)
| Tool | Purpose |
|---|---|
| `analyze_diff` | Review code changes |
| `analyze_diff_risk` | Assess risk of changes |
| `performance_report` | Bottleneck detection |

### Why This Matters
Without `autoStart: true`, Claude Code has ZERO access to these tools. It runs on prompt intelligence alone (which works, as v1.0 proved). With them enabled, the swarm can:
- **Persist memories between runs** — learn from prior sessions
- **Track tasks formally** — not just in-context reasoning
- **Coordinate agents** — prevent file conflicts, reach consensus
- **Route by model** — use haiku for simple subtasks, opus for architecture

## Swarm Topologies

| Topology | When | Track Record |
|---|---|---|
| `hierarchical` | Test gen, features, refactors | 5/5 green tonight |
| `ring` | Quality loops, pipelines | Proven in Ralph loops |
| `mesh` | Research, exploration | Untested in production |
| `star` | Simple delegation | Untested in production |

**Default: `hierarchical`.** It has the strongest anti-drift properties and all production wins used it.

## Performance Data (Real, Not Theoretical)

### Test Generation Runs (March 23, 2026)

| Module | Lines | Tests | Time | Iterations | Result |
|---|---|---|---|---|---|
| AssetHealthCheck (GenServer) | 288 | 43 | ~120s | 3 | 43/43 ✅ + found 2 real bugs |
| WidgetHotelAssets (data) | 2,017 | 147 | ~90s | 0 | 147/147 ✅ |
| SquadBuilder (config builder) | 889 | 58 | ~90s | 0 | 58/58 ✅ |
| I18n (translations) | 272 | 41 | ~60s | 0 | 41/41 ✅ |
| TravelClick Types (structs) | 637 | 66 | ~90s | 0 | 66/66 ✅ |

**Totals:** 355 new tests, 5/5 modules green, 4/5 zero-iteration (80%)

### Key Observations
- **Pure data modules**: near-100% first-pass success. No iteration needed.
- **GenServer modules**: expect 2-3 iterations for test isolation issues.
- **Real bugs found**: 2 missing catch-all handlers (handle_cast, handle_info) — production-grade findings.
- **Data quality issues found**: 1 (duplicate image URLs across hotel rooms).
- **Execution time**: 60-120s per module regardless of size (bottleneck is Claude Code `--print` buffering, not swarm complexity).

### Daemon Worker Reality Check

| Worker | Success Rate | Notes |
|---|---|---|
| map | 100% | Fast (1ms), just indexes project structure |
| consolidate | 100% | Fast (9ms), memory compaction |
| audit | 20% | Timeouts at 300s, falls back to local mode |
| optimize | 33% | Timeouts, deferred on high CPU load |
| testgaps | 50% | Deferred on high CPU load |
| predict | 0% | Disabled by default |
| document | 0% | Disabled by default |

**Takeaway:** Don't rely on daemon workers for task quality. The prompt pattern does the heavy lifting. Daemon adds marginal background value (map + consolidate work; audit/optimize are unreliable).

## Parallel Swarm Runs

You can run multiple swarms simultaneously on different modules. Each gets its own exec session:

```python
# Launch 2 parallel swarms
exec(command='cd /project && claude --mcp-config .mcp.json --print "SWARM: ... TASK: test module_a"', background=True)
exec(command='cd /project && claude --mcp-config .mcp.json --print "SWARM: ... TASK: test module_b"', background=True)

# Poll both
process(action="poll", sessionId="session-a", timeout=120000)
process(action="poll", sessionId="session-b", timeout=120000)
```

**Observed:** 2 parallel swarms work cleanly. 3+ may cause CPU load deferrals on daemon workers (irrelevant for prompt-driven value).

## Self-Improvement Protocol

FlowSwarm can analyze and improve itself:

```
SWARM MODE: Initialize meta-analysis hierarchical swarm (maxAgents 4).
Spawn: architect (analyze skill files), analyst (review production data), coder (rewrite), reviewer (validate).

TASK: Analyze the FlowSwarm skill at [path] against production run data.
Review: what worked, what failed, what's missing. Generate v[N+1].
```

## Troubleshooting

**No output from Claude Code:**
`--print` buffers until completion. Use `background: true` on exec, poll with generous timeout. Never use `nohup`.

**Swarm didn't fire:**
1. Check `.claude/settings.json` has hooks
2. Run `ruflo doctor`
3. Verify MCP: `claude mcp list | grep ruflo`
4. Restart: `ruflo daemon stop && ruflo daemon start`

**GenServer test isolation failures:**
The global supervised instance conflicts with test instances. Solutions:
- Stop global, restart for test, cleanup after
- Use unique names: `name: :"test_#{System.unique_integer()}"`
- Set `async: false` for stateful tests

**Daemon workers timing out:**
Normal. Workers like `audit` and `optimize` timeout at 300s regularly (20-33% success rate). The swarm's value comes from prompt orchestration, not daemon workers. Ignore worker failures unless you specifically need their output.

**Memory shows 0 entries:**
In v1.0 (autoStart: false), the MCP server never started so memory_store was never called. With v2.0 (autoStart: true), Claude Code can call memory_store directly. Check after a run:
```bash
ruflo memory stats
ruflo memory search -q "test results"
```

**CPU load deferrals:**
Workers defer when system CPU > 8. This is protective. During active swarm runs, expect deferrals. Workers catch up when CPU drops.

## Files

```
skills/flow-swarm/
├── SKILL.md                          # This file (v2.0)
├── scripts/
│   └── setup-flow-swarm.sh           # Install + init + verify
└── references/
    └── template-examples.md          # Extended templates with context
```

## Changelog

### v2.1.0 (2026-03-23)
- **CRITICAL FIX:** Enabled MCP autoStart (was false, disabling ALL 150+ swarm tools)
- Added MCP tool reference table (swarm_init, agent_spawn, memory_store, etc.)
- Updated prompt templates to instruct Claude Code to USE MCP tools
- Setup script now auto-fixes autoStart: false → true
- Verify mode checks autoStart status
- Documented full MCP tool inventory (150+ tools across 18 categories)
- Root cause of "memory shows 0 entries" identified: MCP server wasn't running

### v2.0.0 (2026-03-23)
- Complete rewrite based on 5 production swarm runs (355 tests generated)
- Added tiered prompt templates (pure-data vs GenServer vs external deps)
- Added target selection protocol + --targets script flag
- Added pre-flight context injection pattern
- Added real performance data table with actual timing and iteration counts
- Added parallel swarm run documentation
- Added daemon worker reality check (success rates, what to ignore)
- Added self-improvement protocol
- Removed unverified WASM Booster performance claims
- Fixed: documented that `--print` mode doesn't persist swarm memories

### v1.0.0 (2026-03-23)
- Initial release based on first swarm run (AssetHealthCheck)
Get flow-swarm.

vz-bench-debug

vz-scrape-runner

Think you can beat it?