Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install jmagly-aiwg-agentic-code-addons-aiwg-evals-skills-eval-reportgit clone https://github.com/jmagly/aiwg.gitcp aiwg/SKILL.MD ~/.claude/skills/jmagly-aiwg-agentic-code-addons-aiwg-evals-skills-eval-report/SKILL.md---
namespace: aiwg
name: eval-report
platforms: [all]
description: Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
---
# Evaluation Report
Generate a quality report from accumulated evaluation results.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation
## Usage
```bash
/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json
```
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --output | stdout | Output file path |
| --compare | none | Previous report to diff against |
| --mode | all | Agent category: sdlc, marketing, forensics, all |
| --format | markdown | Output format: markdown, json |
| --since | none | Only include results after this date (ISO 8601) |
| --threshold | 0.85 | Score below this triggers a warning |
## Process
1. **Collect Results**: Read all `eval-*.json` files from `.aiwg/reports/`
2. **Aggregate Scores**: Compute per-agent and per-archetype scores
3. **Detect Regressions**: Compare against --compare baseline if provided
4. **Rank Agents**: Sort by overall score, flag below-threshold agents
5. **Build Recommendations**: Surface specific agents and archetypes needing attention
6. **Output Report**: Write markdown or JSON to --output or stdout
## Report Sections
### Summary Dashboard
Overall health at a glance — total agents tested, aggregate score, regression count.
### By Archetype
Pass rates per Roig (2025) failure archetype across all agents.
### Agents Needing Attention
Agents below the --threshold, with consecutive-failure streaks flagged.
### Regression Analysis
When --compare is provided: agents whose scores dropped since the baseline.
### Recommendations
Prioritized action list: which agents to review, which archetypes to harden.
## Output Format (Markdown)
```markdown
# Agent Quality Report
**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2
## By Archetype
| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |
## Agents Needing Attention
| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |
## Recommendations
1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)
```
## Output Format (JSON)
```json
{
"generated": "2026-04-01T10:30:00Z",
"summary": {
"agents_tested": 58,
"overall_score": 0.87,
"regressions": 2
},
"by_archetype": {
"grounding": 0.92,
"substitution": 0.88,
"distractor": 0.78,
"recovery": 0.90
},
"agents_needing_attention": [
{"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
],
"recommendations": [
"Review data-analyst context filtering"
]
}
```
## Examples
```bash
# Standard report to stdout
/eval-report
# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md
# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json
# JSON for CI consumption
/eval-report --format json --threshold 0.80
# SDLC agents only
/eval-report --mode sdlc
```
## Related Commands
- `/eval-agent` - Test individual agents
- `/eval-workflow` - Test multi-agent workflows
- `aiwg lint agents` - Static validation
Generate evaluation report: $ARGUMENTS
## References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands