Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install seb155-atlas-plugin-dist-atlas-admin-addon-skills-infra-auditgit clone https://github.com/seb155/atlas-plugin.gitcp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-dist-atlas-admin-addon-skills-infra-audit/SKILL.md---
name: infra-audit
description: "Comprehensive infrastructure audit pipeline (5-8 phases parallèles). Use when 'infra audit', 'check infra', 'mesh diagnostic', 'audit infra', or before major infra changes."
mode: [ops]
effort: medium
version: 1.0.0
tier: [admin]
---
# Infra Audit — Comprehensive Parallel Audit Pipeline
> Spawn 5-8 parallel audit phases via `atlas-team`, each reusing an existing
> infra skill as a building block. Aggregates per-phase status into a single
> ASCII dashboard with drift summary and actionable next steps.
>
> **ROI target**: 20-25 min/session saved on the manual audits Seb runs
> regularly (pre-change, post-change, weekly health, after CI fleet incident).
## When to Use
- User says "infra audit", "audit infra", "check infra", "mesh diagnostic"
- Before any major infra change (Caddy/Authentik/DNS/firewall/VM migration)
- After fleet incident (CI red, deploy stuck, OAuth race, mesh disconnect)
- Weekly cadence (recurring routine via `atlas-routines`)
- Investor/client demo prep (validate full stack health)
- Drift detection vs `infrastructure/docs/admin-access-paths.md` SSoT
## When NOT to Use
- Single-service health check → use `infra-health` directly
- Targeted DNS/SSL only → use `network-audit`
- VPN debug only → use `mesh-diagnostics`
- Endpoint API ping → use `api-healthcheck`
- Active change/migration → use `infrastructure-change` (this skill is read-only)
## CLI
```bash
# Default: 5 quick phases (network, sso, caddy, ci, drift) — ~3-5 min
atlas infra-audit
# Deep: all 8 phases including backup restore test — ~10-15 min
atlas infra-audit --deep
# Single phase (debug or follow-up)
atlas infra-audit --phase {network|sso|caddy|ci|drift|backup|cert|netbird}
# Output formats
atlas infra-audit --format dashboard # ASCII (default)
atlas infra-audit --format json # for scripts/dashboards
atlas infra-audit --format markdown # for handoff/memory file
# Routine mode (cron-friendly, opens Forgejo issue if any FAIL)
atlas infra-audit --routine
```
## The 8 Audit Phases
Each phase is **spawnable as a parallel `atlas-team` agent** with its own
read-only worktree. Phases share **zero mutable state** — they aggregate
through the report writer at the end.
### Phase 1 — Network Baseline `--phase network`
**Reuses**: `network-audit` (DNS + ports + SSL + VLAN)
Checks:
- DNS resolution (laptop → VM 550, laptop → VM 801, mesh → all peers)
- Ping latency (laptop → 100.64.0.1 prod, laptop → 100.64.0.x dev, baseline budget < 50ms)
- UniFi VPN reachability (admin path #1 invariant per `admin-access-paths.md`)
- SSL cert validity (axoiq.com + s-gagnon.com wildcards)
**Pass**: all DNS resolve, latency < 50ms, certs valid > 30d.
**Warn**: latency 50-150ms OR certs 14-30d expiry.
**Fail**: DNS NXDOMAIN, latency > 150ms, certs < 14d expiry.
### Phase 2 — SSO Scopes `--phase sso`
**Reuses**: Authentik API client (lib in `infrastructure-change` skill)
Checks (anchored on lesson `lesson_authentik_provider_zero_mappings_silent_oidc_fail.md`):
- All Authentik OIDC providers have 4 standard scopes: `openid`, `profile`, `email`, `groups`
- Each app has `policy_bindings` set (per `feedback_authentik_policy_bindings_default.md`)
- `deny-external-devs` policy bound on sensitive apps (vault/proxmox/coder/forgejo-admin)
- Latest `last_sign_in` < 7d for active service accounts (charles, jeremy, atlas)
**Pass**: 18/18 providers w/ all 4 scopes, all sensitive apps bound.
**Warn**: 1-2 providers missing 1 scope OR 1-2 apps unbound.
**Fail**: ≥3 providers w/ partial mappings OR sensitive app unbound.
### Phase 3 — Caddy Routes `--phase caddy`
**Reuses**: `infra-health` (HTTP probe matrix)
Checks:
- Every entry in `/etc/caddy/Caddyfile` (and homelab equivalents) returns 200/302/401
- HTTP→HTTPS redirect on all subdomains
- `forward_auth` blocks point to live Authentik outpost
- `copy_headers` rename for legacy services (per `lesson_caddy_authentik_header_rename_bridge.md`)
- NetBird gRPC bypass `@grpc` matcher includes `/management.AccountInfoService/*` (per `lesson_caddy_netbird_grpc_account_info_service_bypass.md`)
**Pass**: all routes 200/302/401 expected, redirects intact.
**Warn**: 1-2 routes 5xx but auto-recover.
**Fail**: ≥3 routes hard-down OR forward_auth broken.
### Phase 4 — CI Health `--phase ci`
**Reuses**: `ci-health` (Woodpecker + Forgejo Actions observability)
Checks:
- Woodpecker queue depth ≤ 3 (else stuck)
- Agent fleet status (5 active expected post-cleanup)
- Recent failure rate ≤ 15% over 24h
- No `agent_id NULL` zombies (per `lesson_woodpecker_disk_full_queue_zombies.md`)
- Forgejo Actions runner alive on VM 801 (per `lesson_forgejo_actions_runner_replaces_ssh_deploy.md`)
**Pass**: queue ≤ 3, all agents online, failure rate ≤ 15%.
**Warn**: queue 4-9 OR failure rate 16-30%.
**Fail**: queue ≥ 10 OR agent down OR failure rate > 30%.
### Phase 5 — Deviation Compare `--phase drift`
**Reuses**: `git diff` + structured doc parser
Checks current state vs SSoT:
- `infrastructure/docs/admin-access-paths.md` → live mesh peer matrix
- `.blueprint/MEGA-PLAN.md` Section L → live ecosystem URL matrix
- `axoiq-business/finance/` references → grep tech repos for corpo leak (per `.semgrep/synapse-no-corpo-leak.yaml`)
- `lefthook.yml` hooks (per `.claude/rules/no-corpo-leak.md`) actually configured
**Pass**: zero drift across 3 SSoT docs.
**Warn**: 1-2 minor drifts (cosmetic, naming).
**Fail**: corpo leak detected OR admin path missing OR mesh peer missing in doc.
### Phase 6 — Backup Integrity `--phase backup` (deep only by default)
**Reuses**: `infrastructure-ops` (PBS + DB backup probes)
Checks:
- PBS snapshots: latest < 24h, retention chain intact (daily/weekly/monthly)
- DB backups: synapse_db + devhub_db + authentik_db latest < 24h
- Sample restore test (deep mode only): `pg_restore --list` on latest `*.dump`, verify table count matches live
- TrueNAS replication lag: < 1h on critical datasets
**Pass**: snapshots fresh, restore list parses, replication lag < 1h.
**Warn**: snapshots 24-48h OR replication lag 1-6h.
**Fail**: snapshots > 48h OR restore corrupt OR replication > 6h.
### Phase 7 — Cert Expiry `--phase cert`
**Reuses**: `network-audit` cert section
Checks all certs:
- `*.axoiq.com` + `axoiq.com` apex (Cloudflare-managed)
- `*.s-gagnon.com` + apex (Let's Encrypt via Caddy)
- Internal mesh certs (NetBird/Tailscale) — auto-rotated, just verify
- Authentik signing cert (used for OIDC token signing)
- Forgejo HTTPS cert
- Vault transit cert (if used)
**Pass**: all > 30d to expiry.
**Warn**: any 14-30d to expiry → schedule rotation.
**Fail**: any < 14d to expiry → ALERT, immediate rotation.
### Phase 8 — NetBird Mesh `--phase netbird`
**Reuses**: `mesh-diagnostics`
Checks:
- Peer connectivity matrix (every peer pings every other peer)
- OIDC group sync recent (<1h since last `groups_propagation_enabled` job)
- Recent disconnects in `journalctl -u netbird` last 24h ≤ 3 events
- Route advertisements valid (no orphan routes)
- AccountInfoService gRPC reachable (smoke check + lesson cross-ref)
**Pass**: all peers reachable, OIDC sync fresh, disconnects ≤ 3.
**Warn**: 1-2 peers flapping OR OIDC sync 1-6h.
**Fail**: ≥3 peers down OR OIDC sync > 6h OR mass disconnect storm.
## Pipeline
```
┌─ Phase 1 (network) ─────┐
├─ Phase 2 (sso) ─────────┤
├─ Phase 3 (caddy) ───────┤
USER → atlas-team spawn (parallel) ─┼─ Phase 4 (ci) ──────────┼─→ aggregator → dashboard
├─ Phase 5 (drift) ───────┤ │
├─ Phase 6 (backup) ──────┤ └─→ Forgejo issue (if FAIL + --routine)
├─ Phase 7 (cert) ────────┤
└─ Phase 8 (netbird) ─────┘
```
## Spawn Pattern (atlas-team)
Each phase spawns 1 worker. The aggregator is a thin DET node (no LLM
reasoning) — pure JSON merge + ASCII render. Never escalate to a larger model
for the aggregation step.
```yaml
# pseudocode of the spawn manifest atlas-team uses
team:
name: infra-audit
workers:
- phase: network
skill: network-audit
timeout_s: 60
- phase: sso
skill: infrastructure-change # reuses Authentik client
timeout_s: 90
- phase: caddy
skill: infra-health
timeout_s: 90
- phase: ci
skill: ci-health
timeout_s: 60
- phase: drift
skill: code-analysis # for git diff + doc parse
timeout_s: 60
# --deep only:
- phase: backup
skill: infrastructure-ops
timeout_s: 180
- phase: cert
skill: network-audit
timeout_s: 30
- phase: netbird
skill: mesh-diagnostics
timeout_s: 60
aggregator:
type: det
output: dashboard
artifacts:
- memory/infra-audit-${TS}.md
- .atlas/runs/infra-audit/${TS}/raw/*.json
```
## Output: ASCII Dashboard
```
╔═══════════════════════════════════════════════════════════════════════╗
║ INFRA AUDIT — 2026-04-30 18:42 EDT — mode: --deep — duration: 9m 12s ║
╠═══════════════════════════════════════════════════════════════════════╣
║ Phase │ Status │ Findings ║
║ ────────────────┼────────┼────────────────────────────────────────── ║
║ 1. network │ PASS │ DNS 14/14, latency p50 12ms, SSL 367d ║
║ 2. sso │ WARN │ 17/19 providers OK, 2 missing groups scope║
║ 3. caddy │ PASS │ 32/32 routes 200/302/401, redirects OK ║
║ 4. ci │ PASS │ queue=2, agents 5/5, fail-rate 8.4% ║
║ 5. drift │ WARN │ admin-access-paths.md missing peer "ai" ║
║ 6. backup │ PASS │ PBS 4h, DB 6h, restore-list OK ║
║ 7. cert │ PASS │ min expiry 89d (s-gagnon.com wildcard) ║
║ 8. netbird │ PASS │ 12/12 peers, OIDC sync 23m, disconnects=0 ║
╠═══════════════════════════════════════════════════════════════════════╣
║ OVERALL: WARN — 2 actionable items, 0 blockers ║
╚═══════════════════════════════════════════════════════════════════════╝
Actionable next steps:
1. [sso] PATCH Authentik provider id=7 (paperless) + id=12 (immich)
to add 'groups' property mapping. Reuse pattern from
lesson_authentik_provider_zero_mappings_silent_oidc_fail.md
2. [drift] Update infrastructure/docs/admin-access-paths.md to include
peer "ai" (VM 551) added 2026-04-23. Re-run --phase drift
to confirm zero deviation.
Artifacts:
• memory/infra-audit-2026-04-30T18-42.md
• .atlas/runs/infra-audit/2026-04-30T18-42/raw/*.json
```
## Pass/Warn/Fail aggregation rule
- All PASS → exit 0, **OVERALL: PASS**
- ≥1 WARN, 0 FAIL → exit 0, **OVERALL: WARN** (actionable, non-blocking)
- ≥1 FAIL → exit 2, **OVERALL: FAIL** (blocking, opens Forgejo issue if `--routine`)
## Routine integration (`--routine`)
When invoked from a cron via `atlas-routines`:
- Run all 8 phases (deep mode auto-on)
- Output dashboard to `memory/infra-audit-routine-${ISO_DATE}.md`
- If `OVERALL=FAIL` → open Forgejo issue in `axoiq/synapse` repo with
label `infra-audit-fail` + assign to Seb
- If `OVERALL=WARN` → append to `memory/infra-audit-warn-stream.md` (rolling
log; no issue spam)
Suggested cadence:
- Weekly Monday 06:00 EDT (catches weekend drift)
- Pre-investor-demo (manual `atlas infra-audit --deep` 30 min before)
- Post-major-change (auto-trigger from `infrastructure-change` skill on success)
## Reused skills (composable building blocks)
| Skill | Used by phases | Why reused |
|--------------------------|----------------|------------------------------------------|
| `infra-health` | 3 (caddy) | HTTP probe + LAN/WAN/SSO matrix |
| `network-audit` | 1, 7 | DNS + ports + SSL |
| `mesh-diagnostics` | 8 | NetBird/Tailscale peer + OIDC sync |
| `infrastructure-ops` | 6 | PBS + DB backup probes (read-only) |
| `ci-health` | 4 | Woodpecker + Forgejo Actions observability |
| `infrastructure-change` | 2 | Authentik API client (read-only mode) |
| `code-analysis` | 5 | git diff + structured doc parser |
| `secret-manager` | (env) | RESEND_API_KEY, FORGEJO_TOKEN injection |
**Anti-pattern avoided**: this skill does **not** duplicate any of the above
logic. It composes. Each phase is 5-15 lines of glue plus a `Skill` invocation.
## Constraints
- **Read-only by default**: NEVER mutate state. `--phase backup --deep`
performs a `pg_restore --list` (read-only) but never `--clean`.
- **Stateless**: phases don't share writable memory. Aggregator is a pure
reducer over per-phase JSON.
- **Token budget**: aggregator is DET (no LLM). Each phase capped at its own
budget per `atlas-team` worker config.
- **No HITL gate**: this skill is informational. Action skills
(`infrastructure-change`, `deploy-hotfix`) handle HITL.
- **Drift gate**: if `Phase 5 (drift)` flags corpo leak → exit 2 immediately,
do not run remaining phases.
## Cross-references
- Companion skill: `infra-health` (single-pass health) vs `infra-audit` (multi-phase)
- Source-of-truth doc: `infrastructure/docs/admin-access-paths.md`
- Lessons learned absorbed into checks:
- `lesson_authentik_provider_zero_mappings_silent_oidc_fail.md` → Phase 2
- `lesson_caddy_authentik_header_rename_bridge.md` → Phase 3
- `lesson_caddy_netbird_grpc_account_info_service_bypass.md` → Phase 3, 8
- `lesson_woodpecker_disk_full_queue_zombies.md` → Phase 4
- `lesson_forgejo_actions_runner_replaces_ssh_deploy.md` → Phase 4
- `feedback_authentik_policy_bindings_default.md` → Phase 2
- `feedback_admin_access_invariant_two_paths.md` → Phase 1, 5
- `feedback_cellular_external_simulation.md` → Phase 2 (extension hint)
- Plan SSoT: `.blueprint/plans/ultrathink-regarde-ce-qui-abundant-petal.md` Section H W6.1
- Routines integration: `atlas-routines` skill (cloud cron, headless)
## FAQ
**Q: Why parallel phases vs one big sequential audit?**
A: Wall-clock 9 min (parallel) vs ~30 min (sequential). Same agent budget
because each worker is scope-narrow. Trade: aggregator complexity, but
aggregator is DET (cheap).
**Q: Can I run just one phase fast?**
A: Yes — `--phase netbird` runs in ~60s, no team spawn (single skill direct call).
**Q: Backup phase scares me; what if --deep restore actually mutates?**
A: It doesn't. `pg_restore --list` only parses the archive header. No DB
connection touched. Sample restore tests are routed through a separate
`infrastructure-ops --restore-test` flow with HITL.
**Q: How does this differ from `enterprise-audit`?**
A: `enterprise-audit` = 14-dimension static analysis (codebase quality,
security, RBAC, observability, etc). `infra-audit` = runtime infra health
(network, SSO, CI, mesh, certs). Complementary, not overlapping.
**Q: What if a phase times out?**
A: Phase reports `FAIL` with reason `timeout`, aggregator continues with
remaining phases. Overall = FAIL. Re-run `--phase <name>` standalone with
larger timeout via env `ATLAS_AUDIT_TIMEOUT_S=300`.