Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install vivekkarmarkar-claude-code-os-plugins-vercel-0-42-0-claude-skills-benchmark-e2egit clone https://github.com/VivekKarmarkar/claude-code-os.gitcp claude-code-os/SKILL.MD ~/.claude/skills/vivekkarmarkar-claude-code-os-plugins-vercel-0-42-0-claude-skills-benchmark-e2e/SKILL.md---
name: benchmark-e2e
description: End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement report for overnight self-improvement loops.
---
# Benchmark E2E
Single-command pipeline that creates projects, exercises skill injection via `claude --print`, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.
## Quick Start
```bash
# Full suite (9 projects, ~2-3 hours)
bun run scripts/benchmark-e2e.ts
# Quick mode (first 3 projects, ~30-45 min)
bun run scripts/benchmark-e2e.ts --quick
```
Options:
| Flag | Description | Default |
|------|-------------|---------|
| `--quick` | Run only first 3 projects | `false` |
| `--base <path>` | Override base directory | `~/dev/vercel-plugin-testing` |
| `--timeout <ms>` | Per-project timeout (forwarded to runner) | `900000` (15 min) |
## Pipeline Stages
The orchestrator chains four stages sequentially, aborting on failure:
1. **runner** — Creates test dirs, installs plugin, runs `claude --print` with `VERCEL_PLUGIN_LOG_LEVEL=trace`
2. **verify** — Detects package manager, launches dev server, polls for 200 with non-empty HTML
3. **analyze** — Matches JSONL sessions to projects via `run-manifest.json`, extracts metrics
4. **report** — Generates `report.md` and `report.json` with scorecards and recommendations
## Contracts
### `run-manifest.json`
Written by the runner at `<base>/results/run-manifest.json`. Links all downstream stages to the same run.
```typescript
interface BenchmarkRunManifest {
runId: string; // UUID for this pipeline run
timestamp: string; // ISO 8601
baseDir: string; // Absolute path to base directory
projects: Array<{
slug: string; // e.g. "01-recipe-platform"
cwd: string; // Absolute path to project dir
promptHash: string; // SHA hash of the prompt text
expectedSkills: string[];
}>;
}
```
The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.
### `events.jsonl`
The orchestrator writes NDJSON events to `<base>/results/events.jsonl` tracking pipeline lifecycle:
```jsonc
// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner", "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner", "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify", "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }
```
### `report.json`
Machine-readable report at `<base>/results/report.json` for programmatic consumption:
```typescript
interface ReportJson {
runId: string | null;
timestamp: string;
verdict: "pass" | "partial" | "fail";
gaps: Array<{
slug: string;
expected: string[];
actual: string[];
missing: string[];
}>;
recommendations: string[];
suggestedPatterns: Array<{
skill: string; // Skill that was expected but not injected
glob: string; // Suggested pathPattern glob
tool: string; // Tool name that should trigger injection
}>;
}
```
## Overnight Automation Loop
Run the pipeline repeatedly with a cooldown between iterations:
```bash
while true; do
bun run scripts/benchmark-e2e.ts
sleep 3600
done
```
Each run produces timestamped `report.json` and `report.md` files. Compare across runs to track improvement.
## Self-Improvement Cycle
The pipeline enables a closed feedback loop:
1. **Run** — `bun run scripts/benchmark-e2e.ts` exercises the plugin against realistic projects
2. **Read gaps** — `report.json` lists which skills were expected but never injected, with exact slugs
3. **Apply fixes** — Use `suggestedPatterns` entries (copy-pasteable YAML) to add missing frontmatter patterns; use `recommendations` to fix hook logic
4. **Re-run** — Execute the pipeline again to verify the gaps are closed
5. **Compare** — Diff `report.json` across runs: `verdict` should trend from `"fail"` → `"partial"` → `"pass"`
For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.
## Prompt Table
Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.
| # | Slug | Expected Skills |
|---|------|----------------|
| 01 | recipe-platform | auth, vercel-storage, nextjs |
| 02 | trivia-game | vercel-storage, nextjs |
| 03 | code-review-bot | ai-sdk, nextjs |
| 04 | conference-tickets | payments, email, auth |
| 05 | content-aggregator | cron-jobs, ai-sdk |
| 06 | finance-tracker | cron-jobs, email |
| 07 | multi-tenant-blog | routing-middleware, cms, auth |
| 08 | status-page | cron-jobs, vercel-storage, observability |
| 09 | dog-walking-saas | payments, auth, vercel-storage, env-vars |
## Cleanup
```bash
rm -rf ~/dev/vercel-plugin-testing
```