infra-audit

Show SKILL.md content (~4.1k tokens)
---
name: infra-audit
description: "Comprehensive infrastructure audit pipeline (5-8 phases parallèles). Use when 'infra audit', 'check infra', 'mesh diagnostic', 'audit infra', or before major infra changes."
mode: [ops]
effort: medium
version: 1.0.0
tier: [admin]
---

# Infra Audit — Comprehensive Parallel Audit Pipeline

> Spawn 5-8 parallel audit phases via `atlas-team`, each reusing an existing
> infra skill as a building block. Aggregates per-phase status into a single
> ASCII dashboard with drift summary and actionable next steps.
>
> **ROI target**: 20-25 min/session saved on the manual audits Seb runs
> regularly (pre-change, post-change, weekly health, after CI fleet incident).

## When to Use

- User says "infra audit", "audit infra", "check infra", "mesh diagnostic"
- Before any major infra change (Caddy/Authentik/DNS/firewall/VM migration)
- After fleet incident (CI red, deploy stuck, OAuth race, mesh disconnect)
- Weekly cadence (recurring routine via `atlas-routines`)
- Investor/client demo prep (validate full stack health)
- Drift detection vs `infrastructure/docs/admin-access-paths.md` SSoT

## When NOT to Use

- Single-service health check → use `infra-health` directly
- Targeted DNS/SSL only → use `network-audit`
- VPN debug only → use `mesh-diagnostics`
- Endpoint API ping → use `api-healthcheck`
- Active change/migration → use `infrastructure-change` (this skill is read-only)

## CLI

```bash
# Default: 5 quick phases (network, sso, caddy, ci, drift) — ~3-5 min
atlas infra-audit

# Deep: all 8 phases including backup restore test — ~10-15 min
atlas infra-audit --deep

# Single phase (debug or follow-up)
atlas infra-audit --phase {network|sso|caddy|ci|drift|backup|cert|netbird}

# Output formats
atlas infra-audit --format dashboard   # ASCII (default)
atlas infra-audit --format json        # for scripts/dashboards
atlas infra-audit --format markdown    # for handoff/memory file

# Routine mode (cron-friendly, opens Forgejo issue if any FAIL)
atlas infra-audit --routine
```

## The 8 Audit Phases

Each phase is **spawnable as a parallel `atlas-team` agent** with its own
read-only worktree. Phases share **zero mutable state** — they aggregate
through the report writer at the end.

### Phase 1 — Network Baseline `--phase network`

**Reuses**: `network-audit` (DNS + ports + SSL + VLAN)

Checks:
- DNS resolution (laptop → VM 550, laptop → VM 801, mesh → all peers)
- Ping latency (laptop → 100.64.0.1 prod, laptop → 100.64.0.x dev, baseline budget < 50ms)
- UniFi VPN reachability (admin path #1 invariant per `admin-access-paths.md`)
- SSL cert validity (axoiq.com + s-gagnon.com wildcards)

**Pass**: all DNS resolve, latency < 50ms, certs valid > 30d.
**Warn**: latency 50-150ms OR certs 14-30d expiry.
**Fail**: DNS NXDOMAIN, latency > 150ms, certs < 14d expiry.

### Phase 2 — SSO Scopes `--phase sso`

**Reuses**: Authentik API client (lib in `infrastructure-change` skill)

Checks (anchored on lesson `lesson_authentik_provider_zero_mappings_silent_oidc_fail.md`):
- All Authentik OIDC providers have 4 standard scopes: `openid`, `profile`, `email`, `groups`
- Each app has `policy_bindings` set (per `feedback_authentik_policy_bindings_default.md`)
- `deny-external-devs` policy bound on sensitive apps (vault/proxmox/coder/forgejo-admin)
- Latest `last_sign_in` < 7d for active service accounts (charles, jeremy, atlas)

**Pass**: 18/18 providers w/ all 4 scopes, all sensitive apps bound.
**Warn**: 1-2 providers missing 1 scope OR 1-2 apps unbound.
**Fail**: ≥3 providers w/ partial mappings OR sensitive app unbound.

### Phase 3 — Caddy Routes `--phase caddy`

**Reuses**: `infra-health` (HTTP probe matrix)

Checks:
- Every entry in `/etc/caddy/Caddyfile` (and homelab equivalents) returns 200/302/401
- HTTP→HTTPS redirect on all subdomains
- `forward_auth` blocks point to live Authentik outpost
- `copy_headers` rename for legacy services (per `lesson_caddy_authentik_header_rename_bridge.md`)
- NetBird gRPC bypass `@grpc` matcher includes `/management.AccountInfoService/*` (per `lesson_caddy_netbird_grpc_account_info_service_bypass.md`)

**Pass**: all routes 200/302/401 expected, redirects intact.
**Warn**: 1-2 routes 5xx but auto-recover.
**Fail**: ≥3 routes hard-down OR forward_auth broken.

### Phase 4 — CI Health `--phase ci`

**Reuses**: `ci-health` (Woodpecker + Forgejo Actions observability)

Checks:
- Woodpecker queue depth ≤ 3 (else stuck)
- Agent fleet status (5 active expected post-cleanup)
- Recent failure rate ≤ 15% over 24h
- No `agent_id NULL` zombies (per `lesson_woodpecker_disk_full_queue_zombies.md`)
- Forgejo Actions runner alive on VM 801 (per `lesson_forgejo_actions_runner_replaces_ssh_deploy.md`)

**Pass**: queue ≤ 3, all agents online, failure rate ≤ 15%.
**Warn**: queue 4-9 OR failure rate 16-30%.
**Fail**: queue ≥ 10 OR agent down OR failure rate > 30%.

### Phase 5 — Deviation Compare `--phase drift`

**Reuses**: `git diff` + structured doc parser

Checks current state vs SSoT:
- `infrastructure/docs/admin-access-paths.md` → live mesh peer matrix
- `.blueprint/MEGA-PLAN.md` Section L → live ecosystem URL matrix
- `axoiq-business/finance/` references → grep tech repos for corpo leak (per `.semgrep/synapse-no-corpo-leak.yaml`)
- `lefthook.yml` hooks (per `.claude/rules/no-corpo-leak.md`) actually configured

**Pass**: zero drift across 3 SSoT docs.
**Warn**: 1-2 minor drifts (cosmetic, naming).
**Fail**: corpo leak detected OR admin path missing OR mesh peer missing in doc.

### Phase 6 — Backup Integrity `--phase backup` (deep only by default)

**Reuses**: `infrastructure-ops` (PBS + DB backup probes)

Checks:
- PBS snapshots: latest < 24h, retention chain intact (daily/weekly/monthly)
- DB backups: synapse_db + devhub_db + authentik_db latest < 24h
- Sample restore test (deep mode only): `pg_restore --list` on latest `*.dump`, verify table count matches live
- TrueNAS replication lag: < 1h on critical datasets

**Pass**: snapshots fresh, restore list parses, replication lag < 1h.
**Warn**: snapshots 24-48h OR replication lag 1-6h.
**Fail**: snapshots > 48h OR restore corrupt OR replication > 6h.

### Phase 7 — Cert Expiry `--phase cert`

**Reuses**: `network-audit` cert section

Checks all certs:
- `*.axoiq.com` + `axoiq.com` apex (Cloudflare-managed)
- `*.s-gagnon.com` + apex (Let's Encrypt via Caddy)
- Internal mesh certs (NetBird/Tailscale) — auto-rotated, just verify
- Authentik signing cert (used for OIDC token signing)
- Forgejo HTTPS cert
- Vault transit cert (if used)

**Pass**: all > 30d to expiry.
**Warn**: any 14-30d to expiry → schedule rotation.
**Fail**: any < 14d to expiry → ALERT, immediate rotation.

### Phase 8 — NetBird Mesh `--phase netbird`

**Reuses**: `mesh-diagnostics`

Checks:
- Peer connectivity matrix (every peer pings every other peer)
- OIDC group sync recent (<1h since last `groups_propagation_enabled` job)
- Recent disconnects in `journalctl -u netbird` last 24h ≤ 3 events
- Route advertisements valid (no orphan routes)
- AccountInfoService gRPC reachable (smoke check + lesson cross-ref)

**Pass**: all peers reachable, OIDC sync fresh, disconnects ≤ 3.
**Warn**: 1-2 peers flapping OR OIDC sync 1-6h.
**Fail**: ≥3 peers down OR OIDC sync > 6h OR mass disconnect storm.

## Pipeline

```
                                    ┌─ Phase 1 (network) ─────┐
                                    ├─ Phase 2 (sso) ─────────┤
                                    ├─ Phase 3 (caddy) ───────┤
USER → atlas-team spawn (parallel) ─┼─ Phase 4 (ci) ──────────┼─→ aggregator → dashboard
                                    ├─ Phase 5 (drift) ───────┤      │
                                    ├─ Phase 6 (backup) ──────┤      └─→ Forgejo issue (if FAIL + --routine)
                                    ├─ Phase 7 (cert) ────────┤
                                    └─ Phase 8 (netbird) ─────┘
```

## Spawn Pattern (atlas-team)

Each phase spawns 1 worker. The aggregator is a thin DET node (no LLM
reasoning) — pure JSON merge + ASCII render. Never escalate to a larger model
for the aggregation step.

```yaml
# pseudocode of the spawn manifest atlas-team uses
team:
  name: infra-audit
  workers:
    - phase: network
      skill: network-audit
      timeout_s: 60
    - phase: sso
      skill: infrastructure-change  # reuses Authentik client
      timeout_s: 90
    - phase: caddy
      skill: infra-health
      timeout_s: 90
    - phase: ci
      skill: ci-health
      timeout_s: 60
    - phase: drift
      skill: code-analysis  # for git diff + doc parse
      timeout_s: 60
    # --deep only:
    - phase: backup
      skill: infrastructure-ops
      timeout_s: 180
    - phase: cert
      skill: network-audit
      timeout_s: 30
    - phase: netbird
      skill: mesh-diagnostics
      timeout_s: 60
  aggregator:
    type: det
    output: dashboard
    artifacts:
      - memory/infra-audit-${TS}.md
      - .atlas/runs/infra-audit/${TS}/raw/*.json
```

## Output: ASCII Dashboard

```
╔═══════════════════════════════════════════════════════════════════════╗
║  INFRA AUDIT — 2026-04-30 18:42 EDT — mode: --deep — duration: 9m 12s ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Phase           │ Status │ Findings                                  ║
║  ────────────────┼────────┼────────────────────────────────────────── ║
║  1. network      │  PASS  │ DNS 14/14, latency p50 12ms, SSL 367d     ║
║  2. sso          │  WARN  │ 17/19 providers OK, 2 missing groups scope║
║  3. caddy        │  PASS  │ 32/32 routes 200/302/401, redirects OK    ║
║  4. ci           │  PASS  │ queue=2, agents 5/5, fail-rate 8.4%       ║
║  5. drift        │  WARN  │ admin-access-paths.md missing peer "ai"   ║
║  6. backup       │  PASS  │ PBS 4h, DB 6h, restore-list OK            ║
║  7. cert         │  PASS  │ min expiry 89d (s-gagnon.com wildcard)    ║
║  8. netbird      │  PASS  │ 12/12 peers, OIDC sync 23m, disconnects=0 ║
╠═══════════════════════════════════════════════════════════════════════╣
║  OVERALL: WARN — 2 actionable items, 0 blockers                       ║
╚═══════════════════════════════════════════════════════════════════════╝

Actionable next steps:
  1. [sso]   PATCH Authentik provider id=7 (paperless) + id=12 (immich)
             to add 'groups' property mapping. Reuse pattern from
             lesson_authentik_provider_zero_mappings_silent_oidc_fail.md
  2. [drift] Update infrastructure/docs/admin-access-paths.md to include
             peer "ai" (VM 551) added 2026-04-23. Re-run --phase drift
             to confirm zero deviation.

Artifacts:
  • memory/infra-audit-2026-04-30T18-42.md
  • .atlas/runs/infra-audit/2026-04-30T18-42/raw/*.json
```

## Pass/Warn/Fail aggregation rule

- All PASS → exit 0, **OVERALL: PASS**
- ≥1 WARN, 0 FAIL → exit 0, **OVERALL: WARN** (actionable, non-blocking)
- ≥1 FAIL → exit 2, **OVERALL: FAIL** (blocking, opens Forgejo issue if `--routine`)

## Routine integration (`--routine`)

When invoked from a cron via `atlas-routines`:
- Run all 8 phases (deep mode auto-on)
- Output dashboard to `memory/infra-audit-routine-${ISO_DATE}.md`
- If `OVERALL=FAIL` → open Forgejo issue in `axoiq/synapse` repo with
  label `infra-audit-fail` + assign to Seb
- If `OVERALL=WARN` → append to `memory/infra-audit-warn-stream.md` (rolling
  log; no issue spam)

Suggested cadence:
- Weekly Monday 06:00 EDT (catches weekend drift)
- Pre-investor-demo (manual `atlas infra-audit --deep` 30 min before)
- Post-major-change (auto-trigger from `infrastructure-change` skill on success)

## Reused skills (composable building blocks)

| Skill                    | Used by phases | Why reused                               |
|--------------------------|----------------|------------------------------------------|
| `infra-health`           | 3 (caddy)      | HTTP probe + LAN/WAN/SSO matrix          |
| `network-audit`          | 1, 7           | DNS + ports + SSL                        |
| `mesh-diagnostics`       | 8              | NetBird/Tailscale peer + OIDC sync       |
| `infrastructure-ops`     | 6              | PBS + DB backup probes (read-only)       |
| `ci-health`              | 4              | Woodpecker + Forgejo Actions observability |
| `infrastructure-change`  | 2              | Authentik API client (read-only mode)    |
| `code-analysis`          | 5              | git diff + structured doc parser         |
| `secret-manager`         | (env)          | RESEND_API_KEY, FORGEJO_TOKEN injection  |

**Anti-pattern avoided**: this skill does **not** duplicate any of the above
logic. It composes. Each phase is 5-15 lines of glue plus a `Skill` invocation.

## Constraints

- **Read-only by default**: NEVER mutate state. `--phase backup --deep`
  performs a `pg_restore --list` (read-only) but never `--clean`.
- **Stateless**: phases don't share writable memory. Aggregator is a pure
  reducer over per-phase JSON.
- **Token budget**: aggregator is DET (no LLM). Each phase capped at its own
  budget per `atlas-team` worker config.
- **No HITL gate**: this skill is informational. Action skills
  (`infrastructure-change`, `deploy-hotfix`) handle HITL.
- **Drift gate**: if `Phase 5 (drift)` flags corpo leak → exit 2 immediately,
  do not run remaining phases.

## Cross-references

- Companion skill: `infra-health` (single-pass health) vs `infra-audit` (multi-phase)
- Source-of-truth doc: `infrastructure/docs/admin-access-paths.md`
- Lessons learned absorbed into checks:
  - `lesson_authentik_provider_zero_mappings_silent_oidc_fail.md` → Phase 2
  - `lesson_caddy_authentik_header_rename_bridge.md` → Phase 3
  - `lesson_caddy_netbird_grpc_account_info_service_bypass.md` → Phase 3, 8
  - `lesson_woodpecker_disk_full_queue_zombies.md` → Phase 4
  - `lesson_forgejo_actions_runner_replaces_ssh_deploy.md` → Phase 4
  - `feedback_authentik_policy_bindings_default.md` → Phase 2
  - `feedback_admin_access_invariant_two_paths.md` → Phase 1, 5
  - `feedback_cellular_external_simulation.md` → Phase 2 (extension hint)
- Plan SSoT: `.blueprint/plans/ultrathink-regarde-ce-qui-abundant-petal.md` Section H W6.1
- Routines integration: `atlas-routines` skill (cloud cron, headless)

## FAQ

**Q: Why parallel phases vs one big sequential audit?**
A: Wall-clock 9 min (parallel) vs ~30 min (sequential). Same agent budget
because each worker is scope-narrow. Trade: aggregator complexity, but
aggregator is DET (cheap).

**Q: Can I run just one phase fast?**
A: Yes — `--phase netbird` runs in ~60s, no team spawn (single skill direct call).

**Q: Backup phase scares me; what if --deep restore actually mutates?**
A: It doesn't. `pg_restore --list` only parses the archive header. No DB
connection touched. Sample restore tests are routed through a separate
`infrastructure-ops --restore-test` flow with HITL.

**Q: How does this differ from `enterprise-audit`?**
A: `enterprise-audit` = 14-dimension static analysis (codebase quality,
security, RBAC, observability, etc). `infra-audit` = runtime infra health
(network, SSO, CI, mesh, certs). Complementary, not overlapping.

**Q: What if a phase times out?**
A: Phase reports `FAIL` with reason `timeout`, aggregator continues with
remaining phases. Overall = FAIL. Re-run `--phase <name>` standalone with
larger timeout via env `ATLAS_AUDIT_TIMEOUT_S=300`.
Get infra-audit.

vz-bench-debug

vz-scrape-runner

Think you can beat it?