Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install seb155-atlas-plugin-skills-infrastructure-opsgit clone https://github.com/seb155/atlas-plugin.gitcp atlas-plugin/SKILL.MD ~/.claude/skills/seb155-atlas-plugin-skills-infrastructure-ops/SKILL.md---
name: infrastructure-ops
description: "Admin-tier infrastructure operations. This skill should be used when the user asks to 'manage VMs', 'LXC', 'proxmox ops', 'docker ops', 'backup DR', 'database admin', '/atlas infra ops', or needs Tailscale/DNS/Cloudflare/monitoring changes with HITL gates."
mode: [ops]
effort: high
---
# Infrastructure Ops
Manage homelab and production infrastructure safely.
Every destructive operation requires HITL approval and pre-action backup verification.
## Pipeline
```
AUDIT → PLAN → EXECUTE → VERIFY → (ROLLBACK if unhealthy)
```
## Scope
| Domain | Tools | Operations |
|--------|-------|-----------|
| Virtualization | Proxmox, LXC, QEMU | Create, resize, snapshot, migrate |
| Containers | Docker, Compose | Stack up/down, build, prune, volumes |
| Networking | Tailscale, Cloudflare, CoreDNS | ACL, DNS, tunnels, firewall |
| Monitoring | Grafana, Prometheus, Uptime Kuma | Dashboards, alerts, probes |
| Databases | PostgreSQL 17, Valkey 8 | Backup, vacuum, reindex, roles, slow queries |
| Backup & DR | pg_dump, rsync, Proxmox snapshots | Schedule, verify, restore test |
| Capacity | CPU/RAM/disk metrics | Trends, provisioning recommendations |
## Workflow
### Step 1: AUDIT — Establish ground truth
Run: `pvesh get /nodes`, `pct list && qm list`, `docker compose ls && docker ps`, `df -h`, `tailscale status`, DB health queries. Present summary table.
### Step 2: PLAN — Concrete change plan before touching anything
1. Goal (1 sentence), 2. Action steps + expected outcomes, 3. Dependencies/order, 4. Flag `⚠️ DESTRUCTIVE` steps, 5. Rollback per destructive step, 6. Success criteria.
**HITL Gate**: AskUserQuestion → Approve | Dry run | Modify | Abort
### Step 3: EXECUTE — Sequential, verify after each destructive step
Check exit code/logs, run health check, report status. **Retry cap: 2** → AskUserQuestion with error + 2-3 options.
### Step 4: VERIFY — Full health sweep
Check service endpoints, container restart counts, Prometheus targets, Uptime Kuma. If any fail → ROLLBACK.
## Subcommands
| Command | Description | HITL |
|---------|-------------|------|
| `infra status` | Full health sweep | No |
| `infra audit` | Resource inventory + capacity | No |
| `infra restart <svc>` | Restart with health check | Prod: yes |
| `infra snapshot <vm>` | Proxmox snapshot | No |
| `infra db backup [env]` | pg_dump + verify integrity | No |
| `infra db vacuum` | VACUUM ANALYZE + REINDEX | Prod: yes |
| `infra prune` | Docker prune (dry run first) | Yes |
| `infra capacity` | CPU/RAM/disk trends → recommendations | No |
| `infra network audit` | Tailscale ACL + CF DNS + firewall | No |
| `infra backup verify` | Restore-test latest backup | No |
## HITL Gates
| Operation | Dev/Staging | Production |
|-----------|:-----------:|:----------:|
| Service restart | Auto | ⚠️ HITL |
| DB backup | Auto | Auto |
| DB VACUUM / REINDEX | Auto | ⚠️ HITL |
| Docker prune | ⚠️ HITL | ⚠️ HITL |
| Volume delete | ⚠️ HITL | ⚠️ HITL |
| VM stop | Auto | ⚠️ HITL |
| VM snapshot | Auto | Auto |
| Firewall / DNS change | ⚠️ HITL | ⚠️ HITL |
| Valkey flush | ⚠️ HITL | ⚠️ HITL |
## Key Command Patterns
| Domain | Pattern |
|--------|---------|
| Docker restart | `docker compose -f <file> restart <service>` |
| Docker rebuild | `docker compose -f <file> up -d --build <service>` |
| Docker prune | `docker image prune -f --filter "until=720h" --dry-run` (dry-run FIRST) |
| Proxmox snapshot | `pvesh create /nodes/<node>/qemu/<vmid>/snapshot -snapname pre-change-$(date +%Y%m%d)` |
| PG backup | `docker exec <db> pg_dump -U postgres <dbname> \| gzip > backup_$(date +%Y%m%d).sql.gz` |
| PG slow queries | `SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20` |
| Valkey health | `redis-cli -p 6379 info all \| grep -E "uptime\|memory\|clients\|keyspace"` |
| Tailscale status | `tailscale status && tailscale ping <peer>` |
| CF DNS | `curl -s "https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records" -H "Authorization: Bearer $CF_API_TOKEN" \| jq '.result[]'` |
## Safety Rules (NON-NEGOTIABLE)
1. **Backup before destructive ops** — pg_dump / snapshot ALWAYS
2. **Health check after every change** — never report success without proof
3. **Dry run before prune** — always `--dry-run` first
4. **HITL for prod** — restart, prune, schema change = explicit approval
5. **Max 2 retries** — then AskUserQuestion with alternatives
6. **Off-peak for REINDEX** — table-locking = maintenance windows only
7. **Never expose secrets** — no env vars/tokens/passwords in output
8. **Audit trail** — log every action with timestamp in session notes
## Error Recovery
| Scenario | Action |
|----------|--------|
| Container won't start | `docker logs --tail 100 <name>` → AskUserQuestion |
| DB connection refused | Check pg_hba.conf, max_connections, port |
| Tailscale unreachable | `tailscale ping`, check ACL, re-auth node |
| Health check fails post-deploy | Rollback compose → restore backup → AskUserQuestion |
| Disk full | `df -h` + `du -sh /opt/* /var/*` → prune Docker first |
## Capacity Planning
Collect 7-day trends (CPU/RAM/disk) → project 30/60/90 days → flag >80% within 30 days → recommend actions. Present as table: resource | current | projected | severity (CRITICAL/WARNING/OK).
## Hardware Capacity Planning (consolidated from hardware-capacity W5.3 2026-05-01)
Use this section when the user asks for `capacity audit`, `hardware status`, `CPU/RAM/disk/GPU`, `resource planning`, `/atlas hardware`, or growth projections with thermal data.
### When to Use
- Auditing current hardware utilization across PVE nodes
- Planning for new VMs/workloads (will it fit?)
- GPU allocation decisions (which GPU for which workload)
- Disk capacity projections (when will we run out?)
- RAM/CPU right-sizing for existing VMs
### Hardware Inventory
| Node | Hostname | CPU | Cores | RAM | Disk | GPU | VRAM |
|------|----------|-----|-------|-----|------|-----|------|
| PVE1 | srv-ctrl | i5-12600H | 12T | 32GB | ~100GB SSD | none | — |
| PVE2 | srv-comp | Ryzen 7950X3D | 32T | 94GB | ~960GB NVMe | RTX 3080 Ti | 12GB |
| PVE3 | srv-stor | i5-9600KF | 6C | 32GB | ~960GB NVMe + NAS | GTX 1070 Ti | 8GB |
### Process
#### 1. Live Audit (per node)
```bash
for node in 192.168.1.21 192.168.1.22 192.168.1.23; do
echo "=== $(ssh root@$node hostname) ==="
ssh root@$node '
echo "CPU: $(nproc) cores, $(cat /proc/cpuinfo | grep "model name" | head -1 | cut -d: -f2)"
echo "RAM: $(free -h | awk "/Mem:/{print \$2\" total, \"\$3\" used, \"\$7\" available\"}")"
echo "Disk: $(df -h / | tail -1 | awk "{print \$2\" total, \"\$3\" used, \"\$5\" usage\"}")"
echo "VMs: $(qm list 2>/dev/null | tail -n+2 | wc -l) running"
echo "LXCs: $(pct list 2>/dev/null | tail -n+2 | wc -l) running"
lspci | grep -i "nvidia\|vga" | grep -v "virtio"
'
done
```
#### 2. Capacity Calculation
```
Available = Total - Reserved(host) - Allocated(VMs)
Overcommit ratio: CPU 2:1 OK, RAM 1:1 strict, Disk 1:1 strict
```
#### 3. GPU Allocation Matrix
| GPU | Node | Allocated To | VRAM Used | Available |
|-----|------|-------------|-----------|-----------|
| RTX 3080 Ti | PVE2 | VM 551 (AI/LLM) | ~10GB | ~2GB |
| GTX 1070 Ti | PVE3 | VM 570 (GPU-dev) | 0 (passthrough) | 8GB |
#### 4. Growth Projection
```
Current: X GB used / Y GB total (Z%)
30-day trend: +N GB/month
Projected full: in M months
Action threshold: 80% → alert, 90% → expand
```
### Output Format
Present results as:
1. **Node Summary Table** — CPU/RAM/Disk per node
2. **VM Allocation Table** — Per-VM resource usage
3. **Capacity Forecast** — 30/60/90 day projections
4. **Recommendations** — Resize, migrate, or expand decisions