Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install vivekkarmarkar-claude-code-os-skills-behavioral-test-loopgit clone https://github.com/VivekKarmarkar/claude-code-os.gitcp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-skills-behavioral-test-loop/SKILL.md---
name: behavioral-test-loop
description: "Composite skill: read a spec file, extract test cases, launch browser, run visual+programmatic assertions, record with ffmpeg, propose and apply fixes, then loop until all tests pass. Produces tagged video artifacts per iteration. Use when the user says 'behavioral test loop', 'test against spec and fix', 'spec-driven test-fix cycle', 'test and fix loop', 'behavioral tests with recording', or chains /human-emulate + /validate + /visual-verify + ffmpeg recording in a single prompt. Also triggers on '/behavioral-test-loop'."
metadata:
filePattern: "**/expected_behavior.md,**/spec.md,**/requirements.md"
bashPattern: "behavioral.test|spec.*test|test.*loop"
priority: 80
---
# Behavioral Test Loop — Spec-Driven Fix-Test-Record Cycle
Composite skill. Reads a spec, extracts test cases, tests them visually and programmatically in a browser, records the session with ffmpeg, fixes bugs, and loops until all tests pass.
See `reference_guide.md` in this folder for a worked example with all artifacts.
## Prerequisites
Before starting, verify these are available:
```bash
# Required tools
which ffmpeg # screen recording + video compositing
which python3 # HTTP server for static files
echo $DISPLAY # X11 display for ffmpeg x11grab
```
Required MCP: **Playwright** (`mcp__plugin_playwright_playwright__*`) for browser automation.
Depends on these sibling skills (must exist in `~/.claude/skills/`):
- `ffmpeg-screencast` — screen recording lifecycle
- `serve-and-test` — HTTP server + cache-busting
- `annotated-test-video` — screenshot-to-video compositing
If any are missing, the skill still works — the instructions below are self-contained. The sibling skills just provide deeper guidance for each sub-step.
## Step 1: Find and Parse the Spec
Look for the spec file. Common names: `expected_behavior.md`, `spec.md`, `requirements.md`, or any file the user points to.
Extract **every** testable claim as a numbered test case:
```
TC1: "Black background" → getComputedStyle(body).backgroundColor === 'rgb(0, 0, 0)'
TC2: "White header" → getComputedStyle(h1).color === 'rgb(255, 255, 255)'
TC3: "Header text" → h1.textContent === 'Click a button below to change color'
TC4: "Two buttons: Red, Blue" → buttons.length === 2 && labels === ['Red', 'Blue']
TC5: "Green square" → getComputedStyle(square).backgroundColor === 'rgb(0, 128, 0)'
TC6: "Red btn → red" → [click Red] → square bg === 'rgb(255, 0, 0)'
TC7: "Blue btn → blue" → [click Blue] → square bg === 'rgb(0, 0, 255)'
```
Group into **Initial State** (true on load) and **Interactions** (true after user action).
## Step 2: Code Analysis (Pre-Test Hypothesis)
Read the implementation and compare each TC against the actual code. Note suspected bugs **before** opening the browser. Example:
```
Line 58: square.style.background = 'blue' ← Red handler sets blue, not red
HYPOTHESIS: TC6 will FAIL
```
This gives you a prediction to confirm visually.
## Step 3: Iteration Loop
```
PORT=8765
ITERATION=0
while bugs_remain:
```
### 3a. Start Screen Recording
```bash
# Detect resolution
RES=$(xdpyinfo -display ${DISPLAY:-:1} | grep dimensions | awk '{print $2}')
# Start recording (run in background)
ffmpeg -f x11grab -framerate 15 -video_size $RES -i ${DISPLAY:-:1} \
-c:v libx264 -preset ultrafast -crf 23 -pix_fmt yuv420p \
"iter${ITERATION}_recording.mp4"
```
### 3b. Start HTTP Server
```bash
# Kill stale servers first
lsof -ti:$PORT | xargs kill -9 2>/dev/null
sleep 1
# Start fresh
cd /path/to/project && python3 -m http.server $PORT &
# Verify correct content is served
curl -s http://localhost:$PORT/index.html | grep "the key line"
```
### 3c. Run All Tests in Playwright
**Navigate with cache-busting:**
```
mcp__plugin_playwright_playwright__browser_navigate
url: http://localhost:8765/index.html?v=ITERATION
```
**Screenshot initial state:**
```
mcp__plugin_playwright_playwright__browser_take_screenshot
type: png
filename: iter{N}_01_initial_state.png
fullPage: true
```
**Run programmatic assertions (Initial State):**
```
mcp__plugin_playwright_playwright__browser_evaluate
function: |
() => {
const results = {};
// TC1: Check background
const bodyBg = getComputedStyle(document.body).backgroundColor;
results.TC1 = { expected: 'rgb(0, 0, 0)', actual: bodyBg, pass: bodyBg === 'rgb(0, 0, 0)' };
// TC2: Check header color
const h1 = document.querySelector('h1');
const h1Color = getComputedStyle(h1).color;
results.TC2 = { expected: 'rgb(255, 255, 255)', actual: h1Color, pass: h1Color === 'rgb(255, 255, 255)' };
// ... add all initial-state TCs
return results;
}
```
**For each interaction test:**
1. Click: `mcp__plugin_playwright_playwright__browser_click` with `ref` from snapshot
2. Screenshot: `mcp__plugin_playwright_playwright__browser_take_screenshot`
3. Assert: `mcp__plugin_playwright_playwright__browser_evaluate` with `getComputedStyle`
4. Reload for next test: `mcp__plugin_playwright_playwright__browser_navigate` with new `?v=` param
**Key: always use `getComputedStyle()` for color checks.** CSS `green` resolves to `rgb(0, 128, 0)`, not `rgb(0, 255, 0)`. Named colors resolve to specific RGB values.
### 3d. Compile Results
```markdown
| TC | Spec Claim | Expected | Actual | Result |
|-----|------------------------|-----------------|-----------------|--------|
| TC1 | Black background | rgb(0,0,0) | rgb(0,0,0) | PASS |
| TC6 | Red btn → red square | rgb(255,0,0) | rgb(0,0,255) | FAIL |
```
### 3e. Stop Recording
```bash
pkill -INT -f "ffmpeg.*iter${ITERATION}_recording"
sleep 2
# Validate
ffprobe -v error -show_entries format=duration,size \
-of default=noprint_wrappers=1 "iter${ITERATION}_recording.mp4"
```
### 3f. Create Annotated Summary Video
Use ffmpeg `drawtext` to overlay test names and PASS/FAIL verdicts on screenshots:
```bash
ffmpeg -y \
-loop 1 -t 4 -i iter${N}_01_initial_state.png \
-loop 1 -t 4 -i iter${N}_02_after_red_click.png \
-loop 1 -t 4 -i iter${N}_03_after_blue_click.png \
-filter_complex "\
[0:v]scale=1280:960:...,fps=15,\
drawtext=text='ITER ${N} | TEST 1/3 - Initial State':fontsize=32:fontcolor=yellow:x=(w-text_w)/2:y=20:box=1:boxcolor=black@0.7:boxborderw=8,\
drawtext=text='PASS - description':fontsize=22:fontcolor=lime:x=(w-text_w)/2:y=h-50:box=1:boxcolor=black@0.7:boxborderw=6[v0];\
...concat..." \
-map "[out]" -c:v libx264 -pix_fmt yuv420p -preset fast \
"iter${N}_${RESULT}_video.mp4"
```
Use `fontcolor=lime` for PASS, `fontcolor=red` for FAIL.
### 3g. If Failures Exist: Fix and Continue
```
if all_pass:
break # Exit loop — done!
# Otherwise:
# 1. Propose fix for each failing TC with file:line reference
# 2. Apply fix with Edit tool
# 3. Increment ITERATION
# 4. Kill old server, restart (lsof -ti:PORT | xargs kill -9)
# 5. Go to 3a
```
**CRITICAL after editing files:**
- Kill and restart the HTTP server (stale process serves old content)
- Navigate with incremented `?v=` cache buster
- Verify with `curl` that the server returns the fixed content
## Step 4: Final Report
After the loop exits (all tests pass):
```markdown
## Behavioral Test Report
**Spec file**: expected_behavior.md
**App file**: index.html
**Iterations**: N (0 = baseline with bugs, 1..N = fixes applied)
**Final result**: ALL PASS (N/N)
### Test Cases
| TC | Spec Claim | Iter 0 | Iter 1 | ... |
|----|-----------|--------|--------|-----|
### Bugs Fixed
| Bug | File:Line | Was | Should Be | Fixed In |
|-----|-----------|-----|-----------|----------|
### Video Artifacts
| File | Description |
|------|-------------|
| iter0_recording.mp4 | Fullscreen capture, baseline |
| iter0_baseline_1FAIL_video.mp4 | Annotated summary, baseline |
| iter1_recording.mp4 | Fullscreen capture, after fix |
| iter1_ALL_PASS_video.mp4 | Annotated summary, all green |
```
## Cleanup
Always run at the end:
```bash
# Kill HTTP server
lsof -ti:8765 | xargs kill -9 2>/dev/null
# Kill any remaining ffmpeg
pgrep -af ffmpeg | grep -v grep
```
## Naming Convention
```
iter{N}_recording.mp4 — fullscreen ffmpeg capture
iter{N}_{RESULT}_video.mp4 — annotated summary (iter1_ALL_PASS_video.mp4)
iter{N}_{NN}_{testname}.png — test screenshots (iter0_02_after_red_click.png)
```
## Gotchas Learned from Real Usage
1. **Stale server**: `python3 -m http.server` doesn't auto-reload. After editing files, you MUST kill and restart or you'll test old code and get false results.
2. **Browser cache**: Always use `?v=N` query param when navigating. Without it, Playwright may serve cached content.
3. **CSS color values**: `getComputedStyle` returns `rgb()` format. `green` = `rgb(0, 128, 0)`, `red` = `rgb(255, 0, 0)`, `blue` = `rgb(0, 0, 255)`. Don't compare against named colors.
4. **ffmpeg SIGINT vs SIGKILL**: Always stop ffmpeg with `pkill -INT` (SIGINT), not `-9` (SIGKILL). SIGKILL corrupts the MP4 container. SIGINT lets ffmpeg finalize cleanly.
5. **Playwright blocks file://**: Static HTML must be served over HTTP. That's why we use `python3 -m http.server`.
6. **Port conflicts**: Always check `lsof -ti:PORT` before starting a server. Previous sessions leave orphaned processes.