verification

Show SKILL.md content (~4.6k tokens)
---
name: verification
description: "Complete verification pipeline (L1-L6 tests + quality gates). This skill should be used when the user asks to 'verify', '/a-verify', 'check everything', 'pre-ship verification', or before claiming any task complete — evidence before assertions."
mode: [coding, engineering]
effort: medium
superpowers_pattern: [iron_law, red_flags, hard_gate]
see_also: [tdd, systematic-debugging, code-review]
thinking_mode: adaptive
---

# Verification

**Principle**: Evidence before assertions. NEVER claim work passes without running commands and confirming output.

<HARD-GATE>
NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.
If you have not run the verification command in this message, you cannot claim it passes.
Evidence before assertions, always.
</HARD-GATE>

**Iron Law**: `LAW-VERIFY-001` (evidence-before-assertions). Override requires HITL AskUserQuestion. Source: `scripts/execution-philosophy/iron-laws.yaml`.

<red-flags>
| Thought | Reality |
|---|---|
| "Tests should pass now, committing" | "Should pass" is a wish, not evidence. Confidence is not verification. Until the command runs and the output is read, you do not know — you hope. |
| "Good enough, we can refactor later" | "Later" is the cemetery where good intentions go. Code merged ships to production. Every refactor-later is a mortgage with compound interest paid in incident reviews. |
| "YAGNI — nobody will notice this edge case" | YAGNI means "don't build speculative features", not "don't handle real inputs". Edge cases happen IN PRODUCTION to REAL users, not in your head. |
| "The agent reported success, task is done" | Agent reports are NOT evidence. Agents can claim success while leaving an empty diff, broken tests, or uncommitted files. Trust but verify — always check the VCS diff independently. |
| "Trust me, I've done this pattern 20 times" | Experience speeds recognition, not verification. The 20 previous times had 20 different contexts. This one has its own gotcha you have not met yet. |
</red-flags>

## Verification Levels

| Level | Scope | Command pattern | Notes |
|-------|-------|----------------|-------|
| **L1** | Backend unit | `docker exec synapse-backend bash -c "cd /app && python -m pytest tests/{path} -x -q --tb=short"` | Specific file first, then broader |
| **L2** | Frontend unit + types | `cd frontend && bunx vitest --run` + `bun run type-check` | Independent, can parallelize |
| **L3** | E2E (Playwright) | `cd frontend && bunx playwright test e2e/qa-*.spec.ts` | Only if plan Section O specifies E2E |
| **L4** | Persona validation | Per persona: precondition → action → verify result → check RBAC → check perf | Browser automation, API curl, or manual |
| **L5** | Security | Input validation, RBAC enforcement (wrong role), no secrets in responses/logs | Code review + runtime checks |
| **L6** | Performance | `time curl -s "localhost:8001/api/v1/{pid}/resource"` (target <200ms) + `bunx vite build` (bundle size) | Benchmarks |

## Output Format

```
✅ VERIFICATION REPORT
L1 Backend:  ✅/❌ {passed} passed ({failed} failed)
L2 Frontend: ✅/❌ {passed} passed, type-check {status}
L3 E2E:      ✅/❌ {scenarios} scenarios
L4 Persona:  {role}: ✅/❌ {details} (per persona)
L5 Security: ✅/❌ {details}
L6 Perf:     ✅/❌ API {ms}ms, build {size}
OVERALL: ✅/❌
```

## Parallel Execution

**Sequential first**: DB migrations (`alembic upgrade head`) must complete before any test.

**Then parallel** (3 background calls in same message):
1. pytest (backend) → `/tmp/pytest-results.txt`
2. vitest (frontend) → `/tmp/vitest-results.txt`
3. type-check → `/tmp/typecheck-results.txt`

**Do NOT parallelize**: migration+tests, 2 pytest on same DB, E2E+pytest, security+deploy.

## Quality Gates Pipeline (fail-fast)

| Gate | Command | Hard fail |
|------|---------|-----------|
| **Build** | `bun run build` | Yes |
| **Types** | `bun run typecheck` | Yes |
| **Lint** | `bun run lint` | Yes |
| **Tests** | `bun test` / `pytest` | Yes |
| **Security** | Grep staged: secrets, .env, private keys, console.log | Secrets/keys: yes. console.log: warning |
| **Diff** | `git status --short` + `git diff --stat` | Info only |

Detect available scripts first. Missing gate = ⏭️ skipped. Stop on first failure.

**Status**: ✅ Passed | ❌ Failed (blocking) | ⚠️ Warning | ⏭️ Skipped

## Flags

| Flag | Behavior |
|------|----------|
| `--quick` | Smoke tests only |
| `--fix` | Auto-fix lint |
| `--verbose` | Full output |
| `--no-security` | Skip security scan |

## Checkpoints

`checkpoint save|compare|list|diff` — captures test counts, type errors, lint errors, security issues, coverage %, file checksums. Flags regressions (decreased passes, increased errors).

## On Failure

1. Identify which level failed
2. Use systematic-debugging skill
3. Max 2 fix attempts
4. Still failing → AskUserQuestion: what failed, what tried, 2-3 alternatives

## HITL Gates

- **All pass** → AskUserQuestion: "All gates pass. Ready to commit/ship?"
- **Any fail** → AskUserQuestion: "(a) Fix issues (b) Skip gate (c) Abort — I'll fix manually"

## Environment Health Checks (run BEFORE L1-L6)

Before running any verification, check the runtime environment is healthy:

| Check | Command | Fix if broken |
|-------|---------|---------------|
| **Docker containers up** | `docker compose ps --format "{{.Name}} {{.Status}}"` | `docker compose up -d` |
| **Workspace packages synced** | `docker exec synapse-frontend ls node_modules/@axoiq/` | `docker exec synapse-frontend bun install && docker restart synapse-frontend` |
| **Vite dev server responding** | `curl -s -o /dev/null -w "%{http_code}" http://localhost:4000` | `docker restart synapse-frontend` |
| **Backend API healthy** | `curl -s http://localhost:8001/health` | `docker restart synapse-backend` |
| **Vite cache stale** | Check if `node_modules/.vite` is outdated after package changes | `docker exec synapse-frontend rm -rf node_modules/.vite && docker restart synapse-frontend` |

**When to run**:
- After adding/modifying workspace packages (`frontend/packages/*`)
- After `bun install` or `bun.lock` changes
- After Docker container rebuild
- When you see `Failed to resolve import "@axoiq/*"` errors

**Auto-fix pattern** (for finishing-branch skill):
```bash
# If workspace packages changed in this commit:
if git diff --cached --name-only | grep -q "^frontend/packages/"; then
  docker exec synapse-frontend bun install
  docker restart synapse-frontend
  sleep 5
  curl -sf http://localhost:4000 > /dev/null || echo "⚠️ Frontend not responding after package sync"
fi
```

## DoD Tier Check (after L1-L6)

After completing verification levels, compute the DoD tier from the feature's validation matrix:

| Tier | Score | Meaning |
|------|-------|---------|
| CODED (20%) | Only code-level layers pass (BE Unit, FE Unit, Type Check, etc.) | Not ready for review |
| VALIDATING (21-80%) | Some validation layers pass (E2E, HITL, Security, etc.) | In progress |
| VALIDATED (81-99%) | Most layers pass but not shipped | Ready for deploy |
| SHIPPED (100%) | All 13 layers PASS | Production-ready |

**After L1-L6, report DoD tier**:
```
DoD Score: {score}/100% → {tier_icon} {tier}
  CODED:     {tier1_score}/20%
  VALIDATED: {tier2_score}/60%
  SHIPPED:   {tier3_score}/20%
```

NEVER claim a feature is "done" if DoD tier < VALIDATED. NEVER report progress > 20% if only Tier 1 passes.

## File Coverage Check (after DoD)

```bash
curl -s $BACKEND/api/v1/admin/atlas-dev/features/coverage \
  -H "X-Admin-Token: $ADMIN_TOKEN" | python3 -c "
import json, sys
data = json.load(sys.stdin)
pct = data['coverage_pct']['overall']
orphans = len(data['orphans'].get('backend',[])) + len(data['orphans'].get('frontend',[]))
print(f'Coverage: {pct}% | Orphans: {orphans}')
if pct < 80: print('WARNING: File coverage below 80%')
"
```

- Source Files section required for features at VALIDATING tier or above
- Before claiming "BE Unit PASS", verify tests exist for files in Source Files

## Never Skip
- NEVER claim "tests pass" without running them
- NEVER claim "it works" without verifying
- ALWAYS show actual output
- ALWAYS run type-check

## Test Impact Analysis (consolidated from test-affected W5.3 2026-05-01)

Sub-mode for fast pre-push gate (G1): run **only** the tests impacted by
uncommitted (or recently committed) changes, with a hard 30-second budget.
Splits into backend (pytest-testmon) and frontend (vitest --changed) streams.

The goal: **sub-30s feedback before `git push`**. Full suite runs in CI.

### CATO compatibility

This sub-mode is referenced by Synapse's CATO orchestration
(`.claude/rules/cato-orchestration.md`) as the G1 pre-push runner. Behaviour
must remain compatible: same flags, same exit-code semantics (0/1/2),
same JSONL log format to `~/.claude/ci-health.jsonl`.

### Commands

```bash
/atlas test-affected                # Run affected since HEAD~1
/atlas test-affected --since origin/dev    # Compare to branch tip
/atlas test-affected --dry-run      # Print selection, don't execute
/atlas test-affected --budget 60    # Raise budget (default 30s)
/atlas test-affected --only backend # Skip frontend
```

Implementation: `${CLAUDE_PLUGIN_ROOT}/skills/verification/test-affected.sh`

### What it runs

| Change touches | Runs |
|---|---|
| `backend/**/*.py` | `pytest --testmon -x -q -m "not slow and not external"` |
| `frontend/packages/*/src/**` | `bun x vitest run --changed <since>` in the package dir |
| `frontend/src/**` | `bun x vitest run --changed <since>` |
| `.woodpecker/**` / `scripts/**` | syntax validation (`yq` / `shellcheck`) |

### Budget enforcement

- Total wall-clock budget: **30s** (configurable via `--budget`)
- On timeout: SIGTERM the runner, print `Unrun N tests — covered in CI`
- Exit code: `0` green / `1` red / `2` budget exceeded (advisory)

### pytest-testmon fallback chain (TIA — Test Impact Analysis)

1. If `backend/.testmondata` exists → use `--testmon` (SQLite TIA cache)
2. Else if `.testmondata` missing → fall back to `--lf` (last-failed)
3. Else → select by filepath map (`backend/app/X.py → backend/tests/.../test_X.py`)

### Test cache pluggable backends (CATO Rule 9)

The `.testmondata` SQLite blob can be shared across CI agents via:

1. **Valkey** (preferred, ~5ms latency) — `VALKEY_URL` env var
2. **Forgejo generic package registry** (portable) — `FORGEJO_API` + `FORGEJO_TOKEN`
3. **Filesystem / NFS mount** (simplest) — `CATO_TESTMON_FS_DIR`

Key schema: `cato:testmondata:{branch}:py{major.minor}`, TTL 14 days.
SQLite magic-byte validation rejects corrupt blobs (cache poisoning guard).
Missing config → cache is skipped (cold seed), never fails CI.

### G1 pre-push gate integration

Hook: `${CLAUDE_PLUGIN_ROOT}/hooks/pre-push-affected`
- PreToolUse on `Bash(git push …)` — runs the sub-mode advisory v1
- Logs results to `~/.claude/ci-health.jsonl` (does NOT block push by default)
- Flip to blocking via env `ATLAS_G1_BLOCKING=true` (Phase 5 hazy-mapping-stallman)
- Opt-out: `ATLAS_SKIP_G1=1` one-shot bypass

### vitest --changed pattern

Frontend stream uses Vitest's native `--changed <since>` flag — no custom
diff parsing. Vitest computes the dependency graph and runs only tests whose
imports transitively touch a changed file. `--passWithNoTests` ensures clean
exit when no frontend changes.

### References

- pytest-testmon: https://pypi.org/project/pytest-testmon/
- vitest --changed: https://vitest.dev/guide/cli#changed
- CATO orchestration: `.claude/rules/cato-orchestration.md` (Synapse repo)
- Plan parent (G1 doctrine): `.blueprint/plans/hazy-mapping-stallman.md` Section F / T3.2

---

## Verify-App Orchestrator (SOTA 2026 — sp6)

> Entry point: `commands/verify-app.md` (`/atlas verify-app`)
> Sub-plan: `.blueprint/plans/sp6-verify-app-sota-2026.md`
> Iron Law: `LAW-CI-VERIFY-001` — evidence before assertions on ALL gates.

The `/verify-app` command wraps L1-L6 with SOTA 2026 gates. Use this section when
orchestrating the full pipeline rather than individual levels.

### Execution Order

```
1. Env health checks (pre-flight)        [always]
2. L1 Backend + L2 Frontend + L6 Perf   [parallel, always]
3. L3 E2E → L4 Persona → L5 Security    [sequential, always]
   [if --depth=basic → STOP + report]
4. Judge gate (atlas-eval)               [full + ultrathink]
5. Property gate (pbt-generator)         [full + ultrathink]
6. Mutation gate (emit config)           [full + ultrathink]
7. 9-Pillar readiness (agent-readiness)  [full + ultrathink]
   [if --depth=ultrathink → Phase 6 Opus synthesis]
8. Final aggregated VERIFICATION REPORT  [always]
```

### Depth Modes

| `--depth` | Gates | Model | Time |
|-----------|-------|-------|------|
| `basic` | L1-L6 only | Sonnet | < 5 min |
| `full` (default) | L1-L6 + judge + property + mutation + 9-Pillar | Sonnet | 5-15 min |
| `ultrathink` | full + Opus Phase 6 risk synthesis | Opus | 15-30 min |

### HITL Gates

- All gates pass → AskUserQuestion: "All gates pass. Ready to commit/ship?"
- Judge score 75-84 (borderline) → AskUserQuestion with 3 options
- Any L1-L6 hard fail → AskUserQuestion with fix/skip/abort

---

## Multi-Gate Composition (SOTA 2026)

L1-L6 are necessary but not sufficient for SOTA 2026 ship-confidence. The three
additional gates address gap classes that L1-L6 cannot catch:

| Gate class | Gap it closes | Skill |
|------------|---------------|-------|
| **LLM Judge** | Semantic drift, spec mismatch, output quality regression | `atlas-eval` |
| **Property / Mutation** | Logic edge cases, redundant tests (passing despite bugs) | `pbt-generator` + mutmut/Stryker |
| **9-Pillar readiness** | Agent-workflow support degradation (docs, CI, observability) | `agent-readiness` |

**Principle**: No single gate type is sufficient. Require ≥ 3 gate types for `--depth full`.

### Gate Invocation Pattern

```bash
# Gate 4: Judge
atlas eval skill-regression verify-app  # uses evals/skills/verify-app/golden.jsonl
# OR for per-diff judging:
# invoke atlas-eval directly with diff context as input

# Gate 5: Property
atlas pbt run --target <changed-pure-functions> --no-judge

# Gate 6: Mutation (doc-only v1 — emit config, human runs)
# See: skills/verification/references/mutation-config.md

# Gate 7: 9-Pillar
bash scripts/agent-readiness-check.sh --json | \
  python3 -c "import json,sys; d=json.load(sys.stdin); ok=sum(1 for v in d['pillars'].values() if v>=3); print(f'{ok}/9 at level>=3 — {\"PASS\" if ok>=7 else \"FAIL\"}')"
```

---

## Judge Gate Thresholds (SOTA 2026)

Reference doc: `skills/verification/references/llm-judge-gates.md`

| Score | Action |
|-------|--------|
| ≥ 85/100 | **PASS** — proceed to next gate |
| 75–84/100 | **BORDERLINE** — AskUserQuestion (a) accept (b) investigate (c) abort |
| < 75/100 | **HARD FAIL** — same as L1-L6 failure, block ship |

**Why 0.85 (not 0.80)?**
The atlas-eval v1 PASS threshold is 0.80 for general skills. For `/verify-app` specifically —
the gate that guards ALL future ships — we set 0.85 as the floor. sp7 may raise to 0.90 after
3 sprints of baseline data. Decision logged in sp6 sub-plan.

**Judge model**: Sonnet 4.6 (cost-sensitive for bulk eval). Opus 4.7 for ultrathink synthesis.

**Regression alert**: if score drops > 10 points vs prior 7-day median → flag as regression,
require human review before proceeding.

---

## Mutation Kill-Rate Gate (SOTA 2026)

Reference doc: `skills/verification/references/mutation-config.md`

**Kill-rate formula**: `killed_mutants / (total - timeout_mutants - skipped_mutants)`

| Kill-rate | Status | v1 behavior |
|-----------|--------|-------------|
| ≥ 60% | PASS | Advisory PASS |
| 40–59% | WARNING | Advisory WARN, proceed |
| < 40% | FAIL | Advisory FAIL, surface to human |

**Phase 2**: auto-blocking gate (mise.toml entry for mutmut/Stryker).
**Constraint**: mutmut (Python) + Stryker (TypeScript) = doc-only, no auto-install in v1.
Each project team installs via their own dependency manager (separate concern from atlas-plugin).

**Tools**:
- Python: `mutmut run --paths-to-mutate backend/app/services/`
- TypeScript: `npx stryker run` (Stryker config: `stryker.conf.json`)

---

## 9-Pillar Readiness Gate (SOTA 2026)

Reference doc: `skills/verification/references/factory-9-pillar.md`

**Pass condition**: ≥ 7 of 9 pillars scored at level ≥ 3 (Standardized) by the
`agent-readiness-check.sh` script.

| Pillar | Level ≥ 3 required for pass? |
|--------|------------------------------|
| P1 Style & Validation | Yes |
| P2 Build System | Yes |
| P3 Testing | Yes |
| P4 Documentation | Yes |
| P5 Dev Environment | Yes |
| P6 Code Quality | Yes |
| P7 Observability | Optional (atlas-plugin is a plugin, not a server) |
| P8 Security & Governance | Yes |
| P9 Task Discovery | Yes |

P7 is optional because atlas-plugin is a skills/commands package, not a runtime server —
it has no structlog/OTel/Sentry wiring by design. Gate counts P7 pass if level ≥ 1.

**Output line in VERIFICATION REPORT**:
```
9-Pillar: ✅ {n}/9 at level ≥ 3 (score {total}/45) — {PASS|FAIL}
```
Get verification.

vz-bench-debug

vz-scrape-runner

Think you can beat it?