SQLVivekKarmarkarFree

behavioral-test-loop

Composite skill: read a spec file, extract test cases, launch browser, run visual+programmatic assertions, record with ffmpeg, propose and apply fixes, then loop until all tests pass. Produces tagged video artifacts per iteration. Use when the user says 'behavioral test loop', 'test against spec and fix', 'spec-driven test-fix cycle', 'test and fix loop', 'behavioral tests with recording', or chains /human-emulate + /validate + /visual-verify + ffmpeg recording in a single prompt. Also triggers on '/behavioral-test-loop'.

Repo bundle on VersuzVivekKarmarkar/claude-code-os810 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/VivekKarmarkar/claude-code-os Yours? Claim it ↗

§ 01 — Stats

Prior1090

Quality—

Score—

Tasks—

§ 02 — Install

Get behavioral-test-loop.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install vivekkarmarkar-claude-code-os-skills-behavioral-test-loop

Or clone the repo

$git clone https://github.com/VivekKarmarkar/claude-code-os.git

Or copy the SKILL.md manually

$cp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-skills-behavioral-test-loop/SKILL.md

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge vivekkarmarkar-claude-code-os-skills-behavioral-test-loop↵

Show SKILL.md content (~2.5k tokens)

---
name: behavioral-test-loop
description: "Composite skill: read a spec file, extract test cases, launch browser, run visual+programmatic assertions, record with ffmpeg, propose and apply fixes, then loop until all tests pass. Produces tagged video artifacts per iteration. Use when the user says 'behavioral test loop', 'test against spec and fix', 'spec-driven test-fix cycle', 'test and fix loop', 'behavioral tests with recording', or chains /human-emulate + /validate + /visual-verify + ffmpeg recording in a single prompt. Also triggers on '/behavioral-test-loop'."
metadata:
  filePattern: "**/expected_behavior.md,**/spec.md,**/requirements.md"
  bashPattern: "behavioral.test|spec.*test|test.*loop"
  priority: 80
---

# Behavioral Test Loop — Spec-Driven Fix-Test-Record Cycle

Composite skill. Reads a spec, extracts test cases, tests them visually and programmatically in a browser, records the session with ffmpeg, fixes bugs, and loops until all tests pass.

See `reference_guide.md` in this folder for a worked example with all artifacts.

## Prerequisites

Before starting, verify these are available:

```bash
# Required tools
which ffmpeg      # screen recording + video compositing
which python3     # HTTP server for static files
echo $DISPLAY     # X11 display for ffmpeg x11grab
```

Required MCP: **Playwright** (`mcp__plugin_playwright_playwright__*`) for browser automation.

Depends on these sibling skills (must exist in `~/.claude/skills/`):
- `ffmpeg-screencast` — screen recording lifecycle
- `serve-and-test` — HTTP server + cache-busting
- `annotated-test-video` — screenshot-to-video compositing

If any are missing, the skill still works — the instructions below are self-contained. The sibling skills just provide deeper guidance for each sub-step.

## Step 1: Find and Parse the Spec

Look for the spec file. Common names: `expected_behavior.md`, `spec.md`, `requirements.md`, or any file the user points to.

Extract **every** testable claim as a numbered test case:

```
TC1: "Black background"        → getComputedStyle(body).backgroundColor === 'rgb(0, 0, 0)'
TC2: "White header"            → getComputedStyle(h1).color === 'rgb(255, 255, 255)'
TC3: "Header text"             → h1.textContent === 'Click a button below to change color'
TC4: "Two buttons: Red, Blue"  → buttons.length === 2 && labels === ['Red', 'Blue']
TC5: "Green square"            → getComputedStyle(square).backgroundColor === 'rgb(0, 128, 0)'
TC6: "Red btn → red"           → [click Red] → square bg === 'rgb(255, 0, 0)'
TC7: "Blue btn → blue"         → [click Blue] → square bg === 'rgb(0, 0, 255)'
```

Group into **Initial State** (true on load) and **Interactions** (true after user action).

## Step 2: Code Analysis (Pre-Test Hypothesis)

Read the implementation and compare each TC against the actual code. Note suspected bugs **before** opening the browser. Example:

```
Line 58: square.style.background = 'blue'  ← Red handler sets blue, not red
         HYPOTHESIS: TC6 will FAIL
```

This gives you a prediction to confirm visually.

## Step 3: Iteration Loop

```
PORT=8765
ITERATION=0

while bugs_remain:
```

### 3a. Start Screen Recording

```bash
# Detect resolution
RES=$(xdpyinfo -display ${DISPLAY:-:1} | grep dimensions | awk '{print $2}')

# Start recording (run in background)
ffmpeg -f x11grab -framerate 15 -video_size $RES -i ${DISPLAY:-:1} \
  -c:v libx264 -preset ultrafast -crf 23 -pix_fmt yuv420p \
  "iter${ITERATION}_recording.mp4"
```

### 3b. Start HTTP Server

```bash
# Kill stale servers first
lsof -ti:$PORT | xargs kill -9 2>/dev/null
sleep 1

# Start fresh
cd /path/to/project && python3 -m http.server $PORT &

# Verify correct content is served
curl -s http://localhost:$PORT/index.html | grep "the key line"
```

### 3c. Run All Tests in Playwright

**Navigate with cache-busting:**
```
mcp__plugin_playwright_playwright__browser_navigate
  url: http://localhost:8765/index.html?v=ITERATION
```

**Screenshot initial state:**
```
mcp__plugin_playwright_playwright__browser_take_screenshot
  type: png
  filename: iter{N}_01_initial_state.png
  fullPage: true
```

**Run programmatic assertions (Initial State):**
```
mcp__plugin_playwright_playwright__browser_evaluate
  function: |
    () => {
      const results = {};
      // TC1: Check background
      const bodyBg = getComputedStyle(document.body).backgroundColor;
      results.TC1 = { expected: 'rgb(0, 0, 0)', actual: bodyBg, pass: bodyBg === 'rgb(0, 0, 0)' };
      // TC2: Check header color
      const h1 = document.querySelector('h1');
      const h1Color = getComputedStyle(h1).color;
      results.TC2 = { expected: 'rgb(255, 255, 255)', actual: h1Color, pass: h1Color === 'rgb(255, 255, 255)' };
      // ... add all initial-state TCs
      return results;
    }
```

**For each interaction test:**
1. Click: `mcp__plugin_playwright_playwright__browser_click` with `ref` from snapshot
2. Screenshot: `mcp__plugin_playwright_playwright__browser_take_screenshot`
3. Assert: `mcp__plugin_playwright_playwright__browser_evaluate` with `getComputedStyle`
4. Reload for next test: `mcp__plugin_playwright_playwright__browser_navigate` with new `?v=` param

**Key: always use `getComputedStyle()` for color checks.** CSS `green` resolves to `rgb(0, 128, 0)`, not `rgb(0, 255, 0)`. Named colors resolve to specific RGB values.

### 3d. Compile Results

```markdown
| TC  | Spec Claim              | Expected        | Actual          | Result |
|-----|------------------------|-----------------|-----------------|--------|
| TC1 | Black background       | rgb(0,0,0)      | rgb(0,0,0)      | PASS   |
| TC6 | Red btn → red square   | rgb(255,0,0)    | rgb(0,0,255)    | FAIL   |
```

### 3e. Stop Recording

```bash
pkill -INT -f "ffmpeg.*iter${ITERATION}_recording"
sleep 2

# Validate
ffprobe -v error -show_entries format=duration,size \
  -of default=noprint_wrappers=1 "iter${ITERATION}_recording.mp4"
```

### 3f. Create Annotated Summary Video

Use ffmpeg `drawtext` to overlay test names and PASS/FAIL verdicts on screenshots:

```bash
ffmpeg -y \
  -loop 1 -t 4 -i iter${N}_01_initial_state.png \
  -loop 1 -t 4 -i iter${N}_02_after_red_click.png \
  -loop 1 -t 4 -i iter${N}_03_after_blue_click.png \
  -filter_complex "\
    [0:v]scale=1280:960:...,fps=15,\
      drawtext=text='ITER ${N} | TEST 1/3 - Initial State':fontsize=32:fontcolor=yellow:x=(w-text_w)/2:y=20:box=1:boxcolor=black@0.7:boxborderw=8,\
      drawtext=text='PASS - description':fontsize=22:fontcolor=lime:x=(w-text_w)/2:y=h-50:box=1:boxcolor=black@0.7:boxborderw=6[v0];\
    ...concat..." \
  -map "[out]" -c:v libx264 -pix_fmt yuv420p -preset fast \
  "iter${N}_${RESULT}_video.mp4"
```

Use `fontcolor=lime` for PASS, `fontcolor=red` for FAIL.

### 3g. If Failures Exist: Fix and Continue

```
if all_pass:
    break   # Exit loop — done!

# Otherwise:
# 1. Propose fix for each failing TC with file:line reference
# 2. Apply fix with Edit tool
# 3. Increment ITERATION
# 4. Kill old server, restart (lsof -ti:PORT | xargs kill -9)
# 5. Go to 3a
```

**CRITICAL after editing files:**
- Kill and restart the HTTP server (stale process serves old content)
- Navigate with incremented `?v=` cache buster
- Verify with `curl` that the server returns the fixed content

## Step 4: Final Report

After the loop exits (all tests pass):

```markdown
## Behavioral Test Report

**Spec file**: expected_behavior.md
**App file**: index.html
**Iterations**: N (0 = baseline with bugs, 1..N = fixes applied)
**Final result**: ALL PASS (N/N)

### Test Cases
| TC | Spec Claim | Iter 0 | Iter 1 | ... |
|----|-----------|--------|--------|-----|

### Bugs Fixed
| Bug | File:Line | Was | Should Be | Fixed In |
|-----|-----------|-----|-----------|----------|

### Video Artifacts
| File | Description |
|------|-------------|
| iter0_recording.mp4 | Fullscreen capture, baseline |
| iter0_baseline_1FAIL_video.mp4 | Annotated summary, baseline |
| iter1_recording.mp4 | Fullscreen capture, after fix |
| iter1_ALL_PASS_video.mp4 | Annotated summary, all green |
```

## Cleanup

Always run at the end:

```bash
# Kill HTTP server
lsof -ti:8765 | xargs kill -9 2>/dev/null

# Kill any remaining ffmpeg
pgrep -af ffmpeg | grep -v grep
```

## Naming Convention

```
iter{N}_recording.mp4              — fullscreen ffmpeg capture
iter{N}_{RESULT}_video.mp4         — annotated summary (iter1_ALL_PASS_video.mp4)
iter{N}_{NN}_{testname}.png        — test screenshots (iter0_02_after_red_click.png)
```

## Gotchas Learned from Real Usage

1. **Stale server**: `python3 -m http.server` doesn't auto-reload. After editing files, you MUST kill and restart or you'll test old code and get false results.
2. **Browser cache**: Always use `?v=N` query param when navigating. Without it, Playwright may serve cached content.
3. **CSS color values**: `getComputedStyle` returns `rgb()` format. `green` = `rgb(0, 128, 0)`, `red` = `rgb(255, 0, 0)`, `blue` = `rgb(0, 0, 255)`. Don't compare against named colors.
4. **ffmpeg SIGINT vs SIGKILL**: Always stop ffmpeg with `pkill -INT` (SIGINT), not `-9` (SIGKILL). SIGKILL corrupts the MP4 container. SIGINT lets ffmpeg finalize cleanly.
5. **Playwright blocks file://**: Static HTML must be served over HTTP. That's why we use `python3 -m http.server`.
6. **Port conflicts**: Always check `lsof -ti:PORT` before starting a server. Previous sessions leave orphaned processes.