Testingseb155Free

skill-regression-test

Skill regression testing via golden datasets + LLM-as-judge. Auto-runs on SkillVersionBump event. Use when 'regression test', 'verify skill quality', 'detect skill drift'.

Repo bundle on Versuzseb155/atlas-plugin336 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/seb155/atlas-plugin Yours? Claim it ↗

§ 01 — Stats

Prior1087

Quality—

Score—

Tasks—

§ 02 — Install

Get skill-regression-test.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install seb155-atlas-plugin-dist-atlas-admin-addon-skills-skill-regression-test

Or clone the repo

$git clone https://github.com/seb155/atlas-plugin.git

Or copy the SKILL.md manually

cp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-dist-atlas-admin-addon-skills-skill-regression-test/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge seb155-atlas-plugin-dist-atlas-admin-addon-skills-skill-regression-test↵

Show SKILL.md content (~1.9k tokens)

---
name: skill-regression-test
description: "Skill regression testing via golden datasets + LLM-as-judge. Auto-runs on SkillVersionBump event. Use when 'regression test', 'verify skill quality', 'detect skill drift'."
mode: [coding, engineering]
effort: medium
version: 1.0.0
tier: [admin]
---

# skill-regression-test — Auto Regression Gate on Skill Version Bump

> Treat skill version bumps like production deploys: golden dataset must still pass.
> Reuses the **W1.5 atlas-eval** harness (no duplication) — same `golden.jsonl` schema,
> same `evals/run.sh` runner, same Sonnet 4.6 judge.
> Trigger: **SkillVersionBump** custom event (NEW — emitted by `skill-management`).

## When to Use

- User says: "regression test", "verify skill quality", "detect skill drift", "score this skill"
- Automatically: when `skill-management` bumps a skill's `version:` field, the
  `regression-test-runner` hook fires this skill against the affected golden dataset
- Before promoting a skill version (block promotion if score drop > 5pts)
- CI: scheduled re-baseline (canary loop) on top-N mature skills

## Reuse Map (W1.5 atlas-eval)

| Concern | Source (atlas-eval) | What this skill adds |
|---------|---------------------|----------------------|
| Golden dataset schema | `evals/skills/.template/golden.jsonl` | nothing — same schema |
| Eval runner | `evals/run.sh <skill>` | invoked verbatim |
| Judge model | Sonnet 4.6 medium-effort | inherited |
| Pass threshold | `≥ 80` weighted score | inherited (`PASS_THRESHOLD`) |
| Result storage | `evals/results/{skill}/{date}.jsonl` | inherited |
| **Baseline score store** | — | **NEW**: `~/.atlas/skill-baseline-scores/{skill}.json` |
| **Version-bump trigger** | — | **NEW**: `SkillVersionBump` custom event |
| **Promotion gate** | — | **NEW**: exit 1 + JSON alert on regression |

## Custom Event: SkillVersionBump

**Emitter**: `skill-management` skill — when it patches a `version:` field in
any `skills/*/SKILL.md`, it emits one JSON line to stdin of registered hooks:

```json
{
  "event": "SkillVersionBump",
  "skill": "code-review",
  "old_version": "1.0.0",
  "new_version": "1.0.1",
  "timestamp": "2026-05-01T18:00:00Z",
  "trigger": "manual" 
}
```

**Subscriber**: `hooks/regression-test-runner` (this skill's bash hook).

**Contract**:
- Hook reads stdin → parses event
- If `event != "SkillVersionBump"` → exit 0 (no-op)
- If golden dataset missing → exit 0 + warn (skill not yet covered)
- Else → invoke `atlas eval skill-regression <skill>` (which calls `bash evals/run.sh <skill>`)
- Compare new score vs `~/.atlas/skill-baseline-scores/{skill}.json`
- If `regression > 5pts` → emit JSON alert + exit 1 (block version promotion)

## Coverage Targets

| Tier | Skill count | Goal | Enforcement |
|------|------------:|------|-------------|
| **Top 10 mature skills** | 10 | **100%** golden coverage | Block version bump if missing |
| **Top 30 by usage** | 30 | **50%** golden coverage | Warn on version bump |
| **All admin-tier skills** | ~80 | best-effort | Advisory only |

Top-N source: `~/.atlas/skill-usage.jsonl` aggregated by `flow-analytics`.

## Workflow

```
┌─────────────────────────────────────────────────────────────┐
│ skill-management bumps `version:` in skills/X/SKILL.md      │
│   → emits SkillVersionBump event to registered hooks        │
└─────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│ hooks/regression-test-runner (this skill's hook)             │
│   1. parse event → skill name + new/old version              │
│   2. check evals/skills/{skill}/golden.jsonl exists          │
│   3. bash evals/run.sh <skill>                                │
│   4. parse new_score from evals/results/{skill}/{date}.jsonl  │
│   5. read baseline_score from ~/.atlas/skill-baseline-...     │
│   6. if (baseline - new_score) > 5 → REGRESSION              │
│      emit JSON alert + exit 1 (blocks promotion)             │
│   7. else → write new_score as new baseline + exit 0         │
└─────────────────────────────────────────────────────────────┘
```

## CLI

```bash
# Manual run (no event needed)
atlas regression-test code-review

# With explicit baseline pin
atlas regression-test code-review --baseline 1.0.0

# CI mode (machine-readable JSON output, exit code = signal)
atlas regression-test code-review --ci

# Direct hook invocation (test harness)
echo '{"event":"SkillVersionBump","skill":"code-review","new_version":"1.0.1","old_version":"1.0.0"}' \
  | bash hooks/regression-test-runner
```

## Output (human mode)

```
============================================================
skill-regression-test — code-review
============================================================
  baseline   : 87 (pinned to v1.0.0, captured 2026-04-15)
  new score  : 82 (v1.0.1, just-now)
  delta      : -5 pts
  threshold  : -5 pts (regression cutoff)
  verdict    : REGRESSION (score dropped to threshold)
  regressed  : ["pr-feedback-edge-1", "security-finding-format-2"]
============================================================
PROMOTION BLOCKED: rerun after fix or override with --force
```

## Output (CI / `--ci` mode)

Single JSON line on stdout:
```json
{"verdict":"REGRESSION","skill":"code-review","baseline":87,"new_score":82,"delta":-5,"regressed_ids":["pr-feedback-edge-1","security-finding-format-2"],"results_file":"evals/results/code-review/2026-05-01.jsonl"}
```

Exit codes:
- `0` — PASS (delta within tolerance, baseline updated)
- `1` — REGRESSION (delta exceeds tolerance, baseline NOT updated)
- `2` — ERROR (golden missing, runner crash, etc.)

## Constraints

- ❌ Does NOT duplicate atlas-eval logic — only orchestrates `evals/run.sh` + diffs scores
- ❌ Does NOT write to `evals/results/` — that's atlas-eval's responsibility
- ✅ Owns `~/.atlas/skill-baseline-scores/` — JSON files, one per skill
- ✅ Idempotent: re-running with same versions produces same verdict
- ✅ Safe default: missing baseline → seed with current score + PASS

## Dogfooding

This plan eats its own dog food: `evals/skills/atlas-eval/golden.jsonl` evaluates
the eval harness itself (5 meta test cases: schema validation, judge prompt
rendering, scoring math, regression detection, dry-run safety). When a future
session bumps `atlas-eval` version, this skill auto-runs against those 5 cases
and blocks the bump if the harness regresses.

## Forward Compatibility

- **W3.2 canary**: same baseline-store reused; canary writes `~/.atlas/skill-baseline-scores/{skill}.canary.json`
- **W4 multi-judge**: regression detector adopts vote-median once `--multi-judge` lands
- **GitHub PR comments**: future hook can post `--ci` JSON to PR review (not in scope v1)

## References

- W1.5 atlas-eval skill: `skills/atlas-eval/SKILL.md`
- atlas-eval runner: `evals/run.sh`
- Golden dataset template: `evals/skills/.template/golden.jsonl`
- Bootstrap meta-test: `evals/skills/atlas-eval/golden.jsonl` (this PR)
- Hook: `hooks/regression-test-runner`
- Baseline store: `~/.atlas/skill-baseline-scores/{skill}.json`
- Plan SSoT: `.blueprint/plans/ultrathink-regarde-ce-qui-abundant-petal.md` Section H W3.1