ci-archaeology

Show SKILL.md content (~4.2k tokens)
---
name: ci-archaeology
description: "Woodpecker CI failure log archaeology — pattern classification + remediation suggestions. Use when 'ci archaeology', 'why did CI fail', 'analyze pipeline failure', 'pipeline forensic', or after multiple red builds."
mode: [engineering, ops]
effort: medium
version: 1.0.0
tier: [admin]
---

# CI Archaeology — Woodpecker Failure Forensic

Classify Woodpecker CI failures from raw log archaeology. Bridges the gap between
"CI red, but why?" and "actionable fix" by mapping log signatures to documented
remediation lessons (`lesson_woodpecker_*.md`, `lesson_oauth_refresh_token_race.md`).

ROI: 30 min/CI fail saved × 1-2 fails/week ≈ **26h/year** for the lead engineer.

## When to Use

- User says "ci archaeology", "why did CI fail", "analyze pipeline failure",
  "pipeline forensic", "classify CI failures", "what's killing CI"
- After multiple red builds in a row (≥ 2) — pattern aggregation reveals root cause
- Post-mortem on a specific failed pipeline (`atlas ci-archaeology --pipeline N`)
- Quarterly health review — classify last 50 fails to validate ≥ 80% known-pattern rate
- BEFORE re-running a failed pipeline blindly (avoid wasted retry per
  `lesson_woodpecker_oauth_retry_no_unstuck.md`)

## Companion skills (delegate, don't duplicate)

| Skill | Use it for |
|-------|------------|
| `ci-management` | Fetching pipelines, decoding logs, rerunning, secrets, agents — **base API patterns reused here** |
| `ci-feedback-loop` | Post-push polling to terminal state (LAW-WORKFLOW-001). Hands off to ci-archaeology when red. |
| `ci-health` | Aggregate trends (kill rate, flaky), 7d/30d windows. Archaeology zooms into a single failure. |

ci-archaeology is the **diagnostic** layer: given a red pipeline (or N reds),
emit a classification + lesson pointer. It does NOT fix; it suggests.

## Environment

```bash
# Required (read from ~/.env or shell)
export WP_TOKEN="..."        # Woodpecker bearer token
export WP_API="https://ci.axoiq.com/api"
export WP_REPO_ID="1"        # synapse default; override per repo

# Optional
export FORGEJO_API="http://192.168.10.75:3000/api/v1"
export FORGEJO_TOKEN="..."   # for cross-ref to issues / commit metadata
```

If `WP_TOKEN` is missing → exit 2 with the same hint as `ci-health.py`.

## CLI

| Command | Purpose |
|---------|---------|
| `atlas ci-archaeology` (no args) | Last 10 failures, classification breakdown table |
| `atlas ci-archaeology --recent N` | Last N failures (default 10, max 200) |
| `atlas ci-archaeology --pipeline N` | Deep-dive single pipeline (every step, every pattern hit) |
| `atlas ci-archaeology --pattern P` | Filter to one pattern: `timeout\|oom\|lint\|test\|deploy\|oauth` |
| `atlas ci-archaeology --classify-rate` | Run on last 50 fails; exit 0 if ≥ 80% classified, else 1 |
| `atlas ci-archaeology --since 7d` | Time window instead of count (accepts `Nd` / `Nh`) |
| `atlas ci-archaeology --json` | Machine output (for piping to ci-health, dashboards) |
| `atlas ci-archaeology --verbose` | Show matched regex + log excerpt per finding |

Default output: human-readable table grouped by pattern, with a remediation
column pointing at the matching memory lesson.

## The 6 Classification Patterns

Each pattern has:
- **regex set** (case-insensitive, multiline) matched against decoded step logs
- **confidence score** (0-100) — primary regex hit = 80, corroborating signal = +20
- **remediation lesson** — exact memory file path, with one-line "fix" summary
- **suggested action** — single command or HITL gate

Classification picks the **highest-confidence** pattern per failure. Ties go to
the more-specific pattern (oauth_race > deploy_fail > generic).

### 1. `timeout` — step exceeded budget

**Primary regex** (any match → 80):
```
deadline exceeded
context deadline exceeded
timeout after \d+
step .* timed out
killed after \d+(s|m|h)
```

**Corroborating** (+20 each, max 100):
- Step duration > 30 min in pipeline metadata (`finished - started > 1800`)
- Step name contains `test|e2e|smoke|build` (long-runners)

**Lesson**: `memory/lesson_woodpecker_oauth_retry_no_unstuck.md`
**Fix**: Don't retry blindly — OAuth race timeouts inherit broken cache.
Force a fresh commit (empty `git commit --allow-empty`) instead of API retry.
For genuine slow tests: split the step or raise budget in `.woodpecker/*.yml`.

**Suggested action**:
```bash
# If suspect OAuth race
ssh root@vm-forgejo "su - git -c 'gitea admin auth list'"  # check OAuth source
# If suspect slow test, profile locally
docker exec synapse-backend pytest <slow-step> --durations=20
```

### 2. `oom` — process killed by kernel / cgroup

**Primary regex** (any match → 80):
```
\bKilled\b
OOMKiller
out of memory
oom-killer
ENOMEM
exit code 137
exit status 137
container .* killed
cannot allocate memory
```

**Corroborating** (+20):
- Step ran on agent with known low RAM (`agent_id` cross-ref to `atlas ci agents`)
- Step before contained `docker buildx` / `npm install` / `bun install` (memory hogs)

**Lesson**: `memory/lesson_woodpecker_disk_full_queue_zombies.md`
**Fix**: Disk-full and OOM share signatures. First check disk on the agent host:
```bash
ssh root@<woodpecker-host> "df -h /var/lib/docker"
ssh root@<woodpecker-host> "docker system df"
```
If logs > 1 GB: docker daemon stalls → queue zombies appear as OOM.
Permanent fix: `/etc/docker/daemon.json` `{"log-opts":{"max-size":"100m","max-file":"3"}}`
+ daemon reload (already deployed fleet-wide 2026-04-24, ADR-021).

**Suggested action**:
```bash
atlas ci agents                                    # find affected agent
ssh root@<agent> "df -h && docker system df"      # disk + image bloat
ssh root@<agent> "journalctl -u docker --since '2h ago' | grep -i oom"
```

### 3. `lint_fail` — static-analysis errors

**Primary regex** (any match → 80, language-tagged):
```
ruff:                ^[^\s]+\.py:\d+:\d+: [A-Z]\d{3} | ruff failed | ^E\d{3} 
mypy:                error: .* \[(arg-type|attr-defined|call-arg|...)\] | mypy: \d+ errors
shellcheck:          SC\d{4}: | shellcheck reported \d+
biome:               biome check .* error | × .* error
eslint:              \d+ errors? \(\d+ warnings?\) | error  Unexpected
semgrep:             ❯❯❯ Semgrep \d+ findings | severity: ERROR
prettier:            Code style issues found
gitleaks:            \d+ leaks? found
```

**Corroborating** (+20):
- Step name matches `lint|format|check|typecheck|biome|ruff|mypy`
- Exit code is 1 (lint convention) not 137 (OOM)

**Lesson**: `memory/lesson_audit_layer_must_actually_run.md` (when scanner FPs)
+ language-specific: ruff/biome/mypy docs.
**Fix**: Lint failures are usually local-fix-then-push. Check first if scanner
config drifted (regex char classes, brace expansion) — see lesson on bashisms.
For Python: `ruff check --fix` + `mypy --strict`. For TS: `biome check --apply`.

**Suggested action**:
```bash
# Reproduce locally with the same config CI uses
docker exec synapse-backend ruff check backend/ --output-format=concise
cd frontend && bunx biome check --apply
```

### 4. `test_fail` — runtime test assertions

**Primary regex** (any match → 80):
```
pytest:        FAILED \S+ |  AssertionError | ^E\s+assert | \d+ failed,? \d+ passed
vitest:        FAIL  \S+\.test\. | ✗ \S+ |  Test Files  \d+ failed
playwright:    \d+ failed \((\d+)s\) | Test timeout of \d+ms exceeded\.
go test:       --- FAIL: | FAIL\s+\S+\s+\d
generic:       Error: expect\(.*\)\.toBe | Error: expected
```

**Corroborating** (+20):
- Step name matches `test|spec|e2e|smoke|integration`
- Stack trace present (3+ lines starting with `\s+at `, `File "`, or `>`)

**Lesson**: `.claude/rules/testing-funnel.md` (G0-G4) +
`.claude/rules/testing-mock-budget.md` (when mock-related)
**Fix**:
1. Reproduce the **exact** failing test locally with the **exact** marker / fixture
2. Distinguish flaky (intermittent) vs broken (always red) via 3 reruns
3. If flaky: tag with `@pytest.mark.flaky` and file `cato-flaky-detected` issue
4. If broken by mock-of-internal: see persona-bug pattern → migrate to T3/T4

**Suggested action**:
```bash
# Pytest: rerun the failing test, verbose, no capture
docker exec synapse-backend pytest <failing_test_id> -x -vv --tb=short
# Vitest: rerun changed
cd frontend && bunx vitest run <failing-file> --reporter=verbose
# Playwright: trace + headed locally
cd frontend && bunx playwright test <spec> --trace=on --headed
```

### 5. `deploy_fail` — deploy step crashed

**Primary regex** (any match → 80):
```
ssh:.*Permission denied
Could not resolve hostname
no such file or directory.*\.env
Error response from daemon
container .* not running
container .* unhealthy
unhealthy after \d+ attempts
manifest .* not found
denied: requested access to the resource is denied
```

**Corroborating** (+20):
- Step name matches `deploy|publish|release|push-image|smoke-prod`
- Pipeline branch is `dev` or `main` (deploy targets)

**Lesson**: `memory/lesson_localhost_ipv6_healthcheck_pitfall.md`
+ `memory/lesson_forgejo_actions_runner_replaces_ssh_deploy.md`
+ `memory/lesson_woodpecker_deploy_secret_rotation.md`
**Fix**: Three sub-classes:
- **SSH leg failure** → migrate to Forgejo Actions runner on target VM (atlas user pattern)
- **Healthcheck flap** → check `localhost` vs `127.0.0.1` (IPv6 ::1 trap in Alpine/slim)
- **Registry auth** → secret rotation (Woodpecker UI), then `atlas ci secrets`

**Suggested action**:
```bash
# SSH path
ssh atlas@<target-vm> "docker ps -a | grep synapse"
# Healthcheck
docker exec <container> wget -q -O- http://127.0.0.1:<port>/health
# Registry
atlas ci secrets list           # verify FORGEJO_TOKEN present + correct events
```

### 6. `oauth_race` — Forgejo↔Woodpecker token desync

**Primary regex** (high specificity → 80):
```
user does not exist \[uid: 0\]
unable to fetch repo .* from forgejo
oauth.*token.*expired
token refresh.*failed
GetRepo.*404 Not Found
```

**Corroborating** (+20):
- Pipeline workflows array is **empty** (`workflows: []`) — config not re-fetched
- Multiple consecutive failures within < 60s (race window)

**Lesson**: `memory/lesson_oauth_refresh_token_race.md`
+ `memory/lesson_woodpecker_oauth_retry_no_unstuck.md`
**Fix**: This is the most invisible CI killer. API retries (`POST /pipelines/{n}`)
inherit the broken empty-workflow cache → 5+ wasted retries. Pattern:
1. Re-login to Woodpecker UI (re-issues OAuth refresh token, ~30s)
2. Push a fresh commit (forces config re-fetch — `--allow-empty` if no real change)
3. Long-term migration: Forgejo PAT (CLAUDE.md task #75)

**Suggested action**:
```bash
# Verify the race signature
atlas ci pipeline <N> --json | jq '.workflows | length'   # 0 = race
# Fresh-commit unstick
git commit --allow-empty -m "ci: refresh after oauth race"
git push
# Then ci-feedback-loop takes over
```

## Confidence + classification rate

A failure is **classified** if any pattern scores ≥ 80.
Failures < 80 are reported as `unclassified` with the top-3 partial matches for
human triage.

`--classify-rate` runs on last 50 fails; success criterion ≥ 80% classified.
Below 80% = patterns drift → file Forgejo issue with label `ci-archaeology-drift`
+ propose new regex via PR to this skill.

## Output formats

### Default (human, grouped):
```
CI Archaeology — last 10 failures (synapse, repo_id=1)
┌──────────────┬───────┬─────────────────────────────────────────┐
│ Pattern      │ Count │ Top remediation                         │
├──────────────┼───────┼─────────────────────────────────────────┤
│ test_fail    │ 4     │ See testing-funnel.md (G2 mock-budget)  │
│ oauth_race   │ 3     │ Fresh-commit unstick + re-login WP UI   │
│ deploy_fail  │ 2     │ Healthcheck IPv6 — use 127.0.0.1        │
│ unclassified │ 1     │ Manual triage: pipeline #1502           │
└──────────────┴───────┴─────────────────────────────────────────┘
Classification rate: 9/10 (90%) ✅
```

### `--pipeline N` (deep-dive):
```
Pipeline #1495  status=failure  branch=feat/x  duration=4m12s
Step backend-test (pid=3)  →  test_fail  conf=100
  match: "FAILED tests/api/test_chat.py::test_persona_resolution"
  match: "AssertionError: expected 'I&C', got None"
  → memory/lesson_persona_bug_2026-04-16.md
  → suggested: docker exec synapse-backend pytest tests/api/test_chat.py::test_persona_resolution -x -vv
```

### `--json`:
Standard schema, one object per failure with `pipeline`, `step`, `pattern`,
`confidence`, `regex_hits`, `lesson_path`, `suggested_action`, `excerpt`.

## Reuse from existing skills

- **API auth + pipeline list**: copy the `_fetch_pipelines()` pattern from
  `skills/ci-health/ci-health.py` (verbatim function, same env vars)
- **Step log decode**: shell out to `atlas ci logs <N> --all` (ci-management)
  to avoid re-implementing the base64/multiplexed decoder
- **Agent cross-ref**: `atlas ci agents` for OOM correlation (ci-management)
- **Classify-rate validation**: mirror `--validate-p1` exit-code idiom from ci-health

This keeps ci-archaeology as a **thin classifier on top** — no duplicate API code.

## Implementation notes

- Compile all regexes once at module load with `re.IGNORECASE | re.MULTILINE`
- Cache per-pipeline log fetches in `~/.cache/atlas/ci-archaeology/pipeline-{N}.log`
  with 24h TTL (failures don't change post-mortem; cache amortizes deep-dive)
- For `--recent N`: filter `status in {failure, error}` server-side via Woodpecker API
- For unclassified output: print the **first 200 chars** of the failing step log
  to give the human a head start
- Exit codes: 0 = ran successfully, 1 = `--classify-rate` below threshold,
  2 = missing env / API error

## Observability

Append one JSONL line to `memory/ci-archaeology-metrics.jsonl` per run:
```json
{"ts":"2026-04-30T20:00:00Z","fails_seen":10,"classified":9,"rate":0.90,
 "top_pattern":"test_fail","weeks_window":1}
```

Weekly digest (cron Mon 09:23 EDT, aligned with existing Anthropic Routine):
- 7-day classification rate trend
- Top-3 patterns by count
- Any pattern with > 30% week-over-week growth → flag for triage

## Verification (skill self-test)

```bash
# 1. YAML frontmatter parses
head -10 skills/ci-archaeology/SKILL.md | python3 -c "
import sys, yaml
print(yaml.safe_load(sys.stdin.read().strip('-')))
"

# 2. All referenced lesson files exist
for f in \
  memory/lesson_woodpecker_oauth_retry_no_unstuck.md \
  memory/lesson_woodpecker_disk_full_queue_zombies.md \
  memory/lesson_oauth_refresh_token_race.md \
  memory/lesson_woodpecker_debug_wp_token.md \
  memory/lesson_localhost_ipv6_healthcheck_pitfall.md \
  memory/lesson_forgejo_actions_runner_replaces_ssh_deploy.md
do
  test -f "$f" && echo "OK $f" || echo "MISSING $f"
done

# 3. Classify-rate gate (production)
atlas ci-archaeology --classify-rate
# exit 0 = ≥ 80%, exit 1 = drift, file issue
```

## See Also

- `skills/ci-management/SKILL.md` — base Woodpecker CLI (logs, rerun, secrets)
- `skills/ci-feedback-loop/SKILL.md` — post-push poll loop (hands off here on red)
- `skills/ci-health/SKILL.md` — kill-rate / flaky aggregation (complementary)
- `.claude/rules/cato-orchestration.md` — change-aware test routing (G0-G4)
- `.claude/rules/testing-funnel.md` — where tests live (G0/G1/G2/G3/G4)
- `.claude/rules/testing-mock-budget.md` — mock rules (Rule 1: orchestrator smoke)
- `memory/lesson_woodpecker_debug_wp_token.md` — canonical "silent CI failure" debug
- ADR-021 — fleet log-rotation hardening (2026-04-24)
Get ci-archaeology.

vz-bench-debug

vz-scrape-runner

Think you can beat it?