Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install jmagly-aiwg-agentic-code-addons-aiwg-evals-skills-eval-workflowgit clone https://github.com/jmagly/aiwg.gitcp aiwg/SKILL.MD ~/.claude/skills/jmagly-aiwg-agentic-code-addons-aiwg-evals-skills-eval-workflow/SKILL.md---
namespace: aiwg
name: eval-workflow
platforms: [all]
description: Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
---
# Workflow Evaluation
Run automated evaluation tests against a multi-agent workflow.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation
## Usage
```bash
/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| workflow-name | Yes | Workflow (flow command) to evaluate |
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
| --timeout | 300 | Maximum seconds per scenario |
## What Gets Evaluated
### Orchestration Quality
- **Agent coordination**: Parallel agents launched correctly in single message
- **Handoff fidelity**: Artifacts pass correctly between phases
- **Gate enforcement**: Phase gates checked before transition
### Archetype Resistance
- `grounding-test` — Archetype 1: Premature action without reading state
- `distractor-test` — Archetype 3: Context pollution from irrelevant artifacts
- `recovery-test` — Archetype 4: Fragile execution when subagent fails
### Output Validation
- Required artifacts created in correct `.aiwg/` paths
- Document structure matches templates
- Traceability links intact
## Process
1. **Load Workflow**: Read flow command definition
2. **Select Scenarios**: Based on --scenario flag or all applicable
3. **Setup Workspace**: Create isolated `.aiwg/working/` test space
4. **Execute Flow**: Run workflow against each scenario
5. **Validate Outputs**: Check artifact presence, structure, and content
6. **Generate Report**: Output results with pass/fail per assertion
7. **Cleanup**: Remove test workspace
## Output Format
```json
{
"workflow": "flow-security-review-cycle",
"timestamp": "2026-04-01T10:30:00Z",
"scenarios": {
"grounding-test": {
"passed": true,
"score": 1.0,
"assertions": [
{"name": "threat-model-created", "passed": true},
{"name": "security-gate-run", "passed": true}
],
"duration_ms": 45000
},
"distractor-test": {
"passed": false,
"score": 0.7,
"assertions": [
{"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
],
"duration_ms": 38000
}
},
"summary": {
"passed": 4,
"failed": 1,
"total": 5,
"score": 0.80
}
}
```
## Examples
```bash
# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle
# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose
# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
```
## Success Criteria
| Metric | Target |
|--------|--------|
| Artifact creation | 100% |
| Grounding compliance | >90% |
| Distractor resistance | >80% |
| Recovery success | ≥80% |
| Overall | ≥85% |
## Related Commands
- `/eval-agent` - Test individual agents
- `/eval-report` - Generate aggregate quality report
- `aiwg lint agents` - Static validation
Evaluate workflow: $ARGUMENTS
## References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands