Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install seb155-atlas-plugin-skills-ci-archaeologygit clone https://github.com/seb155/atlas-plugin.gitcp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-skills-ci-archaeology/SKILL.md---
name: ci-archaeology
description: "Woodpecker CI failure log archaeology — pattern classification + remediation suggestions. Use when 'ci archaeology', 'why did CI fail', 'analyze pipeline failure', 'pipeline forensic', or after multiple red builds."
mode: [engineering, ops]
effort: medium
version: 1.0.0
tier: [admin]
---
# CI Archaeology — Woodpecker Failure Forensic
Classify Woodpecker CI failures from raw log archaeology. Bridges the gap between
"CI red, but why?" and "actionable fix" by mapping log signatures to documented
remediation lessons (`lesson_woodpecker_*.md`, `lesson_oauth_refresh_token_race.md`).
ROI: 30 min/CI fail saved × 1-2 fails/week ≈ **26h/year** for the lead engineer.
## When to Use
- User says "ci archaeology", "why did CI fail", "analyze pipeline failure",
"pipeline forensic", "classify CI failures", "what's killing CI"
- After multiple red builds in a row (≥ 2) — pattern aggregation reveals root cause
- Post-mortem on a specific failed pipeline (`atlas ci-archaeology --pipeline N`)
- Quarterly health review — classify last 50 fails to validate ≥ 80% known-pattern rate
- BEFORE re-running a failed pipeline blindly (avoid wasted retry per
`lesson_woodpecker_oauth_retry_no_unstuck.md`)
## Companion skills (delegate, don't duplicate)
| Skill | Use it for |
|-------|------------|
| `ci-management` | Fetching pipelines, decoding logs, rerunning, secrets, agents — **base API patterns reused here** |
| `ci-feedback-loop` | Post-push polling to terminal state (LAW-WORKFLOW-001). Hands off to ci-archaeology when red. |
| `ci-health` | Aggregate trends (kill rate, flaky), 7d/30d windows. Archaeology zooms into a single failure. |
ci-archaeology is the **diagnostic** layer: given a red pipeline (or N reds),
emit a classification + lesson pointer. It does NOT fix; it suggests.
## Environment
```bash
# Required (read from ~/.env or shell)
export WP_TOKEN="..." # Woodpecker bearer token
export WP_API="https://ci.axoiq.com/api"
export WP_REPO_ID="1" # synapse default; override per repo
# Optional
export FORGEJO_API="http://192.168.10.75:3000/api/v1"
export FORGEJO_TOKEN="..." # for cross-ref to issues / commit metadata
```
If `WP_TOKEN` is missing → exit 2 with the same hint as `ci-health.py`.
## CLI
| Command | Purpose |
|---------|---------|
| `atlas ci-archaeology` (no args) | Last 10 failures, classification breakdown table |
| `atlas ci-archaeology --recent N` | Last N failures (default 10, max 200) |
| `atlas ci-archaeology --pipeline N` | Deep-dive single pipeline (every step, every pattern hit) |
| `atlas ci-archaeology --pattern P` | Filter to one pattern: `timeout\|oom\|lint\|test\|deploy\|oauth` |
| `atlas ci-archaeology --classify-rate` | Run on last 50 fails; exit 0 if ≥ 80% classified, else 1 |
| `atlas ci-archaeology --since 7d` | Time window instead of count (accepts `Nd` / `Nh`) |
| `atlas ci-archaeology --json` | Machine output (for piping to ci-health, dashboards) |
| `atlas ci-archaeology --verbose` | Show matched regex + log excerpt per finding |
Default output: human-readable table grouped by pattern, with a remediation
column pointing at the matching memory lesson.
## The 6 Classification Patterns
Each pattern has:
- **regex set** (case-insensitive, multiline) matched against decoded step logs
- **confidence score** (0-100) — primary regex hit = 80, corroborating signal = +20
- **remediation lesson** — exact memory file path, with one-line "fix" summary
- **suggested action** — single command or HITL gate
Classification picks the **highest-confidence** pattern per failure. Ties go to
the more-specific pattern (oauth_race > deploy_fail > generic).
### 1. `timeout` — step exceeded budget
**Primary regex** (any match → 80):
```
deadline exceeded
context deadline exceeded
timeout after \d+
step .* timed out
killed after \d+(s|m|h)
```
**Corroborating** (+20 each, max 100):
- Step duration > 30 min in pipeline metadata (`finished - started > 1800`)
- Step name contains `test|e2e|smoke|build` (long-runners)
**Lesson**: `memory/lesson_woodpecker_oauth_retry_no_unstuck.md`
**Fix**: Don't retry blindly — OAuth race timeouts inherit broken cache.
Force a fresh commit (empty `git commit --allow-empty`) instead of API retry.
For genuine slow tests: split the step or raise budget in `.woodpecker/*.yml`.
**Suggested action**:
```bash
# If suspect OAuth race
ssh root@vm-forgejo "su - git -c 'gitea admin auth list'" # check OAuth source
# If suspect slow test, profile locally
docker exec synapse-backend pytest <slow-step> --durations=20
```
### 2. `oom` — process killed by kernel / cgroup
**Primary regex** (any match → 80):
```
\bKilled\b
OOMKiller
out of memory
oom-killer
ENOMEM
exit code 137
exit status 137
container .* killed
cannot allocate memory
```
**Corroborating** (+20):
- Step ran on agent with known low RAM (`agent_id` cross-ref to `atlas ci agents`)
- Step before contained `docker buildx` / `npm install` / `bun install` (memory hogs)
**Lesson**: `memory/lesson_woodpecker_disk_full_queue_zombies.md`
**Fix**: Disk-full and OOM share signatures. First check disk on the agent host:
```bash
ssh root@<woodpecker-host> "df -h /var/lib/docker"
ssh root@<woodpecker-host> "docker system df"
```
If logs > 1 GB: docker daemon stalls → queue zombies appear as OOM.
Permanent fix: `/etc/docker/daemon.json` `{"log-opts":{"max-size":"100m","max-file":"3"}}`
+ daemon reload (already deployed fleet-wide 2026-04-24, ADR-021).
**Suggested action**:
```bash
atlas ci agents # find affected agent
ssh root@<agent> "df -h && docker system df" # disk + image bloat
ssh root@<agent> "journalctl -u docker --since '2h ago' | grep -i oom"
```
### 3. `lint_fail` — static-analysis errors
**Primary regex** (any match → 80, language-tagged):
```
ruff: ^[^\s]+\.py:\d+:\d+: [A-Z]\d{3} | ruff failed | ^E\d{3}
mypy: error: .* \[(arg-type|attr-defined|call-arg|...)\] | mypy: \d+ errors
shellcheck: SC\d{4}: | shellcheck reported \d+
biome: biome check .* error | × .* error
eslint: \d+ errors? \(\d+ warnings?\) | error Unexpected
semgrep: ❯❯❯ Semgrep \d+ findings | severity: ERROR
prettier: Code style issues found
gitleaks: \d+ leaks? found
```
**Corroborating** (+20):
- Step name matches `lint|format|check|typecheck|biome|ruff|mypy`
- Exit code is 1 (lint convention) not 137 (OOM)
**Lesson**: `memory/lesson_audit_layer_must_actually_run.md` (when scanner FPs)
+ language-specific: ruff/biome/mypy docs.
**Fix**: Lint failures are usually local-fix-then-push. Check first if scanner
config drifted (regex char classes, brace expansion) — see lesson on bashisms.
For Python: `ruff check --fix` + `mypy --strict`. For TS: `biome check --apply`.
**Suggested action**:
```bash
# Reproduce locally with the same config CI uses
docker exec synapse-backend ruff check backend/ --output-format=concise
cd frontend && bunx biome check --apply
```
### 4. `test_fail` — runtime test assertions
**Primary regex** (any match → 80):
```
pytest: FAILED \S+ | AssertionError | ^E\s+assert | \d+ failed,? \d+ passed
vitest: FAIL \S+\.test\. | ✗ \S+ | Test Files \d+ failed
playwright: \d+ failed \((\d+)s\) | Test timeout of \d+ms exceeded\.
go test: --- FAIL: | FAIL\s+\S+\s+\d
generic: Error: expect\(.*\)\.toBe | Error: expected
```
**Corroborating** (+20):
- Step name matches `test|spec|e2e|smoke|integration`
- Stack trace present (3+ lines starting with `\s+at `, `File "`, or `>`)
**Lesson**: `.claude/rules/testing-funnel.md` (G0-G4) +
`.claude/rules/testing-mock-budget.md` (when mock-related)
**Fix**:
1. Reproduce the **exact** failing test locally with the **exact** marker / fixture
2. Distinguish flaky (intermittent) vs broken (always red) via 3 reruns
3. If flaky: tag with `@pytest.mark.flaky` and file `cato-flaky-detected` issue
4. If broken by mock-of-internal: see persona-bug pattern → migrate to T3/T4
**Suggested action**:
```bash
# Pytest: rerun the failing test, verbose, no capture
docker exec synapse-backend pytest <failing_test_id> -x -vv --tb=short
# Vitest: rerun changed
cd frontend && bunx vitest run <failing-file> --reporter=verbose
# Playwright: trace + headed locally
cd frontend && bunx playwright test <spec> --trace=on --headed
```
### 5. `deploy_fail` — deploy step crashed
**Primary regex** (any match → 80):
```
ssh:.*Permission denied
Could not resolve hostname
no such file or directory.*\.env
Error response from daemon
container .* not running
container .* unhealthy
unhealthy after \d+ attempts
manifest .* not found
denied: requested access to the resource is denied
```
**Corroborating** (+20):
- Step name matches `deploy|publish|release|push-image|smoke-prod`
- Pipeline branch is `dev` or `main` (deploy targets)
**Lesson**: `memory/lesson_localhost_ipv6_healthcheck_pitfall.md`
+ `memory/lesson_forgejo_actions_runner_replaces_ssh_deploy.md`
+ `memory/lesson_woodpecker_deploy_secret_rotation.md`
**Fix**: Three sub-classes:
- **SSH leg failure** → migrate to Forgejo Actions runner on target VM (atlas user pattern)
- **Healthcheck flap** → check `localhost` vs `127.0.0.1` (IPv6 ::1 trap in Alpine/slim)
- **Registry auth** → secret rotation (Woodpecker UI), then `atlas ci secrets`
**Suggested action**:
```bash
# SSH path
ssh atlas@<target-vm> "docker ps -a | grep synapse"
# Healthcheck
docker exec <container> wget -q -O- http://127.0.0.1:<port>/health
# Registry
atlas ci secrets list # verify FORGEJO_TOKEN present + correct events
```
### 6. `oauth_race` — Forgejo↔Woodpecker token desync
**Primary regex** (high specificity → 80):
```
user does not exist \[uid: 0\]
unable to fetch repo .* from forgejo
oauth.*token.*expired
token refresh.*failed
GetRepo.*404 Not Found
```
**Corroborating** (+20):
- Pipeline workflows array is **empty** (`workflows: []`) — config not re-fetched
- Multiple consecutive failures within < 60s (race window)
**Lesson**: `memory/lesson_oauth_refresh_token_race.md`
+ `memory/lesson_woodpecker_oauth_retry_no_unstuck.md`
**Fix**: This is the most invisible CI killer. API retries (`POST /pipelines/{n}`)
inherit the broken empty-workflow cache → 5+ wasted retries. Pattern:
1. Re-login to Woodpecker UI (re-issues OAuth refresh token, ~30s)
2. Push a fresh commit (forces config re-fetch — `--allow-empty` if no real change)
3. Long-term migration: Forgejo PAT (CLAUDE.md task #75)
**Suggested action**:
```bash
# Verify the race signature
atlas ci pipeline <N> --json | jq '.workflows | length' # 0 = race
# Fresh-commit unstick
git commit --allow-empty -m "ci: refresh after oauth race"
git push
# Then ci-feedback-loop takes over
```
## Confidence + classification rate
A failure is **classified** if any pattern scores ≥ 80.
Failures < 80 are reported as `unclassified` with the top-3 partial matches for
human triage.
`--classify-rate` runs on last 50 fails; success criterion ≥ 80% classified.
Below 80% = patterns drift → file Forgejo issue with label `ci-archaeology-drift`
+ propose new regex via PR to this skill.
## Output formats
### Default (human, grouped):
```
CI Archaeology — last 10 failures (synapse, repo_id=1)
┌──────────────┬───────┬─────────────────────────────────────────┐
│ Pattern │ Count │ Top remediation │
├──────────────┼───────┼─────────────────────────────────────────┤
│ test_fail │ 4 │ See testing-funnel.md (G2 mock-budget) │
│ oauth_race │ 3 │ Fresh-commit unstick + re-login WP UI │
│ deploy_fail │ 2 │ Healthcheck IPv6 — use 127.0.0.1 │
│ unclassified │ 1 │ Manual triage: pipeline #1502 │
└──────────────┴───────┴─────────────────────────────────────────┘
Classification rate: 9/10 (90%) ✅
```
### `--pipeline N` (deep-dive):
```
Pipeline #1495 status=failure branch=feat/x duration=4m12s
Step backend-test (pid=3) → test_fail conf=100
match: "FAILED tests/api/test_chat.py::test_persona_resolution"
match: "AssertionError: expected 'I&C', got None"
→ memory/lesson_persona_bug_2026-04-16.md
→ suggested: docker exec synapse-backend pytest tests/api/test_chat.py::test_persona_resolution -x -vv
```
### `--json`:
Standard schema, one object per failure with `pipeline`, `step`, `pattern`,
`confidence`, `regex_hits`, `lesson_path`, `suggested_action`, `excerpt`.
## Reuse from existing skills
- **API auth + pipeline list**: copy the `_fetch_pipelines()` pattern from
`skills/ci-health/ci-health.py` (verbatim function, same env vars)
- **Step log decode**: shell out to `atlas ci logs <N> --all` (ci-management)
to avoid re-implementing the base64/multiplexed decoder
- **Agent cross-ref**: `atlas ci agents` for OOM correlation (ci-management)
- **Classify-rate validation**: mirror `--validate-p1` exit-code idiom from ci-health
This keeps ci-archaeology as a **thin classifier on top** — no duplicate API code.
## Implementation notes
- Compile all regexes once at module load with `re.IGNORECASE | re.MULTILINE`
- Cache per-pipeline log fetches in `~/.cache/atlas/ci-archaeology/pipeline-{N}.log`
with 24h TTL (failures don't change post-mortem; cache amortizes deep-dive)
- For `--recent N`: filter `status in {failure, error}` server-side via Woodpecker API
- For unclassified output: print the **first 200 chars** of the failing step log
to give the human a head start
- Exit codes: 0 = ran successfully, 1 = `--classify-rate` below threshold,
2 = missing env / API error
## Observability
Append one JSONL line to `memory/ci-archaeology-metrics.jsonl` per run:
```json
{"ts":"2026-04-30T20:00:00Z","fails_seen":10,"classified":9,"rate":0.90,
"top_pattern":"test_fail","weeks_window":1}
```
Weekly digest (cron Mon 09:23 EDT, aligned with existing Anthropic Routine):
- 7-day classification rate trend
- Top-3 patterns by count
- Any pattern with > 30% week-over-week growth → flag for triage
## Verification (skill self-test)
```bash
# 1. YAML frontmatter parses
head -10 skills/ci-archaeology/SKILL.md | python3 -c "
import sys, yaml
print(yaml.safe_load(sys.stdin.read().strip('-')))
"
# 2. All referenced lesson files exist
for f in \
memory/lesson_woodpecker_oauth_retry_no_unstuck.md \
memory/lesson_woodpecker_disk_full_queue_zombies.md \
memory/lesson_oauth_refresh_token_race.md \
memory/lesson_woodpecker_debug_wp_token.md \
memory/lesson_localhost_ipv6_healthcheck_pitfall.md \
memory/lesson_forgejo_actions_runner_replaces_ssh_deploy.md
do
test -f "$f" && echo "OK $f" || echo "MISSING $f"
done
# 3. Classify-rate gate (production)
atlas ci-archaeology --classify-rate
# exit 0 = ≥ 80%, exit 1 = drift, file issue
```
## See Also
- `skills/ci-management/SKILL.md` — base Woodpecker CLI (logs, rerun, secrets)
- `skills/ci-feedback-loop/SKILL.md` — post-push poll loop (hands off here on red)
- `skills/ci-health/SKILL.md` — kill-rate / flaky aggregation (complementary)
- `.claude/rules/cato-orchestration.md` — change-aware test routing (G0-G4)
- `.claude/rules/testing-funnel.md` — where tests live (G0/G1/G2/G3/G4)
- `.claude/rules/testing-mock-budget.md` — mock rules (Rule 1: orchestrator smoke)
- `memory/lesson_woodpecker_debug_wp_token.md` — canonical "silent CI failure" debug
- ADR-021 — fleet log-rotation hardening (2026-04-24)