Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install seb155-atlas-plugin-dist-atlas-admin-addon-skills-skill-regression-testgit clone https://github.com/seb155/atlas-plugin.gitcp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-dist-atlas-admin-addon-skills-skill-regression-test/SKILL.md---
name: skill-regression-test
description: "Skill regression testing via golden datasets + LLM-as-judge. Auto-runs on SkillVersionBump event. Use when 'regression test', 'verify skill quality', 'detect skill drift'."
mode: [coding, engineering]
effort: medium
version: 1.0.0
tier: [admin]
---
# skill-regression-test — Auto Regression Gate on Skill Version Bump
> Treat skill version bumps like production deploys: golden dataset must still pass.
> Reuses the **W1.5 atlas-eval** harness (no duplication) — same `golden.jsonl` schema,
> same `evals/run.sh` runner, same Sonnet 4.6 judge.
> Trigger: **SkillVersionBump** custom event (NEW — emitted by `skill-management`).
## When to Use
- User says: "regression test", "verify skill quality", "detect skill drift", "score this skill"
- Automatically: when `skill-management` bumps a skill's `version:` field, the
`regression-test-runner` hook fires this skill against the affected golden dataset
- Before promoting a skill version (block promotion if score drop > 5pts)
- CI: scheduled re-baseline (canary loop) on top-N mature skills
## Reuse Map (W1.5 atlas-eval)
| Concern | Source (atlas-eval) | What this skill adds |
|---------|---------------------|----------------------|
| Golden dataset schema | `evals/skills/.template/golden.jsonl` | nothing — same schema |
| Eval runner | `evals/run.sh <skill>` | invoked verbatim |
| Judge model | Sonnet 4.6 medium-effort | inherited |
| Pass threshold | `≥ 80` weighted score | inherited (`PASS_THRESHOLD`) |
| Result storage | `evals/results/{skill}/{date}.jsonl` | inherited |
| **Baseline score store** | — | **NEW**: `~/.atlas/skill-baseline-scores/{skill}.json` |
| **Version-bump trigger** | — | **NEW**: `SkillVersionBump` custom event |
| **Promotion gate** | — | **NEW**: exit 1 + JSON alert on regression |
## Custom Event: SkillVersionBump
**Emitter**: `skill-management` skill — when it patches a `version:` field in
any `skills/*/SKILL.md`, it emits one JSON line to stdin of registered hooks:
```json
{
"event": "SkillVersionBump",
"skill": "code-review",
"old_version": "1.0.0",
"new_version": "1.0.1",
"timestamp": "2026-05-01T18:00:00Z",
"trigger": "manual"
}
```
**Subscriber**: `hooks/regression-test-runner` (this skill's bash hook).
**Contract**:
- Hook reads stdin → parses event
- If `event != "SkillVersionBump"` → exit 0 (no-op)
- If golden dataset missing → exit 0 + warn (skill not yet covered)
- Else → invoke `atlas eval skill-regression <skill>` (which calls `bash evals/run.sh <skill>`)
- Compare new score vs `~/.atlas/skill-baseline-scores/{skill}.json`
- If `regression > 5pts` → emit JSON alert + exit 1 (block version promotion)
## Coverage Targets
| Tier | Skill count | Goal | Enforcement |
|------|------------:|------|-------------|
| **Top 10 mature skills** | 10 | **100%** golden coverage | Block version bump if missing |
| **Top 30 by usage** | 30 | **50%** golden coverage | Warn on version bump |
| **All admin-tier skills** | ~80 | best-effort | Advisory only |
Top-N source: `~/.atlas/skill-usage.jsonl` aggregated by `flow-analytics`.
## Workflow
```
┌─────────────────────────────────────────────────────────────┐
│ skill-management bumps `version:` in skills/X/SKILL.md │
│ → emits SkillVersionBump event to registered hooks │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ hooks/regression-test-runner (this skill's hook) │
│ 1. parse event → skill name + new/old version │
│ 2. check evals/skills/{skill}/golden.jsonl exists │
│ 3. bash evals/run.sh <skill> │
│ 4. parse new_score from evals/results/{skill}/{date}.jsonl │
│ 5. read baseline_score from ~/.atlas/skill-baseline-... │
│ 6. if (baseline - new_score) > 5 → REGRESSION │
│ emit JSON alert + exit 1 (blocks promotion) │
│ 7. else → write new_score as new baseline + exit 0 │
└─────────────────────────────────────────────────────────────┘
```
## CLI
```bash
# Manual run (no event needed)
atlas regression-test code-review
# With explicit baseline pin
atlas regression-test code-review --baseline 1.0.0
# CI mode (machine-readable JSON output, exit code = signal)
atlas regression-test code-review --ci
# Direct hook invocation (test harness)
echo '{"event":"SkillVersionBump","skill":"code-review","new_version":"1.0.1","old_version":"1.0.0"}' \
| bash hooks/regression-test-runner
```
## Output (human mode)
```
============================================================
skill-regression-test — code-review
============================================================
baseline : 87 (pinned to v1.0.0, captured 2026-04-15)
new score : 82 (v1.0.1, just-now)
delta : -5 pts
threshold : -5 pts (regression cutoff)
verdict : REGRESSION (score dropped to threshold)
regressed : ["pr-feedback-edge-1", "security-finding-format-2"]
============================================================
PROMOTION BLOCKED: rerun after fix or override with --force
```
## Output (CI / `--ci` mode)
Single JSON line on stdout:
```json
{"verdict":"REGRESSION","skill":"code-review","baseline":87,"new_score":82,"delta":-5,"regressed_ids":["pr-feedback-edge-1","security-finding-format-2"],"results_file":"evals/results/code-review/2026-05-01.jsonl"}
```
Exit codes:
- `0` — PASS (delta within tolerance, baseline updated)
- `1` — REGRESSION (delta exceeds tolerance, baseline NOT updated)
- `2` — ERROR (golden missing, runner crash, etc.)
## Constraints
- ❌ Does NOT duplicate atlas-eval logic — only orchestrates `evals/run.sh` + diffs scores
- ❌ Does NOT write to `evals/results/` — that's atlas-eval's responsibility
- ✅ Owns `~/.atlas/skill-baseline-scores/` — JSON files, one per skill
- ✅ Idempotent: re-running with same versions produces same verdict
- ✅ Safe default: missing baseline → seed with current score + PASS
## Dogfooding
This plan eats its own dog food: `evals/skills/atlas-eval/golden.jsonl` evaluates
the eval harness itself (5 meta test cases: schema validation, judge prompt
rendering, scoring math, regression detection, dry-run safety). When a future
session bumps `atlas-eval` version, this skill auto-runs against those 5 cases
and blocks the bump if the harness regresses.
## Forward Compatibility
- **W3.2 canary**: same baseline-store reused; canary writes `~/.atlas/skill-baseline-scores/{skill}.canary.json`
- **W4 multi-judge**: regression detector adopts vote-median once `--multi-judge` lands
- **GitHub PR comments**: future hook can post `--ci` JSON to PR review (not in scope v1)
## References
- W1.5 atlas-eval skill: `skills/atlas-eval/SKILL.md`
- atlas-eval runner: `evals/run.sh`
- Golden dataset template: `evals/skills/.template/golden.jsonl`
- Bootstrap meta-test: `evals/skills/atlas-eval/golden.jsonl` (this PR)
- Hook: `hooks/regression-test-runner`
- Baseline store: `~/.atlas/skill-baseline-scores/{skill}.json`
- Plan SSoT: `.blueprint/plans/ultrathink-regarde-ce-qui-abundant-petal.md` Section H W3.1