vastai-gpu

Show SKILL.md content (~2.5k tokens)
---
name: vastai-gpu
description: Provision and manage vast.ai GPU instances for GPU-dependent testing (vLLM inference, model loading, VRAM validation). No local GPU available — all GPU work goes through vast.ai.
---

# vast.ai GPU Testing

Run GPU-dependent tests (vLLM inference, model loading, VRAM budgeting, tensor parallelism) on
vast.ai rental instances. The local laptop has no GPU — only `llm-server` needs one;
`api-backend`, `frontend`, and `qdrant` always run locally in Docker.

**Use `scripts/gpu-session.py` as the default path** — it owns the full lifecycle
(search → provision → download → vLLM → SSH tunnel → compose up → teardown) and persists
session state so partial failures never leak paid instances. The manual workflow in the
appendix is only for scenarios the script doesn't cover.

## Prerequisites

- `vastai` CLI installed and authenticated (`pip install vastai`)
- SSH key registered on vast.ai (current key: `~/.ssh/id_ed25519`, vast.ai key ID: 740068)
- Volume 34301538 ("tepegoz_models", 80GB, Bulgaria, machine_id=20420) — optional, not
  used by the script today; reserved for a future `--use-cached-models` flag.

## Cost Rules

- Each profile in `scripts/gpu-session.py` has a per-hour budget cap. Defaults range from
  $0.15/hr (qwen-0.5b) to $0.50/hr (qwen-32b). Override with `--budget-dph`.
- The script **refuses to provision** if the cheapest matching offer exceeds the budget.
- Use the **smallest profile** that exercises the code path you care about:
  - config / tp wiring → `qwen-0.5b-tp2` (2×cheap GPU, cheap model)
  - quality-of-output → `qwen-14b` (AWQ, 1×12GB)
  - production parity → `qwen-32b` (GPTQ, 2×24GB)
- `gpu-session.py down` destroys the instance as its last step — always use it rather
  than leaving a session alive.

## Automated workflow

### 1. Start a session

```bash
# Smallest + cheapest, skip confirmation:
make gpu-up MODEL=qwen-0.5b YES=1

# or directly:
python scripts/gpu-session.py up --model qwen-0.5b --yes

# Override budget:
python scripts/gpu-session.py up --model qwen-14b --budget-dph 0.40

# Skip pre-flight checks (not recommended):
python scripts/gpu-session.py up --model qwen-0.5b --yes --skip-preflight
```

The script will:

0. **Pre-flight checks** — before spending money, verify the local stack is healthy:
   - Docker daemon is running
   - `docker-compose.cpu-only.yml` parses correctly
   - `make test` passes (all backend + frontend tests green)
   - No stale compose overrides from previous sessions
   If any check fails, the script aborts with a clear error message. Fix the
   local issue before renting a GPU. Use `--skip-preflight` only when you know
   what you're doing (e.g. the failing test is unrelated to your GPU work).
1. Search vast.ai offers matching the profile's GPU filter.
2. Pick the cheapest; refuse if over budget.
3. Create the instance; write `.gpu-session.json` immediately (so a crash
   here still has a record to destroy).
4. Wait for `running`, verify SSH + GPU visibility.
5. **Post-provision validation** — run 5 checks on the remote machine:
   - CUDA runtime available (`torch.cuda.is_available()`)
   - GPU count matches tensor-parallel requirement
   - Disk space sufficient for model weights
   - Docker Hub connectivity (TCP+TLS to registry-1.docker.io)
   - GPU inference smoke test (matrix multiply on CUDA)
   If any check fails, the instance is destroyed (~$0.01) and the next offer
   is tried automatically (up to 3 attempts).
6. Download the model (HuggingFace snapshot_download). `HF_TOKEN` is
   forwarded if set in the local environment.
7. Start vLLM with `--enforce-eager` and `--disable-custom-all-reduce`.
8. Open an SSH tunnel: `0.0.0.0:18080 → remote:8080`. `0.0.0.0` is mandatory
   so Docker containers can reach it via `host.docker.internal`.
9. Generate `docker-compose.gpu.yml` — a compose override that injects GPU-mode env
   vars (`MODEL_NAME`, `VLLM_BASE_URL`, …) into `api-backend` via `environment:`.
   **The committed `.env` and `docker-compose.cpu-only.yml` are never mutated.**
10. `docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml up -d`.
11. Probe container → vLLM connectivity.

### 2. Use the session

- Backend API: `http://localhost:8000`
- Frontend: `http://localhost:3000`
- Direct vLLM: `curl http://localhost:18080/v1/models`
- Interactive shell on the remote box: `make gpu-ssh`
- Tail vLLM server logs: `make gpu-logs FOLLOW=1`
- Session state + costs: `make gpu-status`

### 3. Tear down

```bash
make gpu-down
```

Always use `down` rather than ad-hoc `docker compose down` + `vastai destroy` —
the script orchestrates all five cleanup steps (compose, override file, tunnel PID,
instance, session state file) and is idempotent.

## Profiles

Defined at the top of `scripts/gpu-session.py`; run `make gpu-status` or
`python scripts/gpu-session.py list-profiles` to see the live set.

| Profile | Repo | TP | Min VRAM | Disk | Budget |
|---------|------|----|----------|------|--------|
| qwen-0.5b | Qwen2.5-0.5B-Instruct | 1 | 4GB | 20GB | $0.15/hr |
| qwen-0.5b-tp2 | Qwen2.5-0.5B-Instruct | 2 | 4GB | 20GB | $0.25/hr |
| qwen-7b | Qwen2.5-7B-Instruct | 1 | 16GB | 30GB | $0.30/hr |
| qwen-14b | Qwen2.5-14B-Instruct-AWQ | 1 | 12GB | 25GB | $0.30/hr |
| qwen-32b | Qwen2.5-32B-Instruct-GPTQ-Int8 | 2 | 24GB | 60GB | $0.50/hr |

**Known incompatibilities** (from prior testing):

- GGUF does NOT work with vLLM tp>1 (`GGUFUninitializedParameter` error)
- AWQ on 2×4090 gives only 4K context — insufficient for engineering QA
- 72B-AWQ quality is worse than 32B-GPTQ-Int8 despite larger size
- GPUs with Compute Capability < 8.0 (GTX 1080) lack bfloat16 — vLLM refuses

## Recovery

If anything fails mid-session, the state file (`.gpu-session.json`) is still on disk.
`make gpu-down` will read it and clean up what it can regardless of phase:

- compose stack → stopped
- `docker-compose.gpu.yml` override → deleted
- SSH tunnel → killed (by saved PID, with pattern match as safety net)
- vast.ai instance → destroyed (this is the money-saver; runs even if earlier steps fail)
- session state file → deleted

## When NOT to use vast.ai

- code review, config syntax checks, type-checking
- any test that doesn't execute CUDA kernels or load weights
- anything validated by `make test` on CPU

Don't spin up a GPU to run `make lint`.

---

## Appendix: Manual workflow (fallback)

When to use the manual workflow:

- Debugging something the script doesn't cover (e.g. trying a non-Qwen model,
  experimenting with a different vLLM image tag, needing per-request control over
  the remote commands).
- Improving `gpu-session.py` itself — you need to understand what it does.

All steps the script automates can be reproduced by hand. The key invariants:

### SSH tunnel must bind 0.0.0.0

```bash
ssh -f -N -L 0.0.0.0:18080:localhost:8080 \
  -i ~/.ssh/id_ed25519 \
  -o ServerAliveInterval=60 \
  root@<host> -p <port>
```

Docker containers on Linux resolve `host.docker.internal` to the bridge gateway
(e.g. 172.17.0.1), NOT 127.0.0.1. A tunnel bound only to localhost is unreachable
from inside containers. Docker Desktop (Mac/Windows) handles this differently;
on Linux the `extra_hosts: host.docker.internal:host-gateway` in
`docker-compose.cpu-only.yml` is required.

### Local stack env overrides via compose override, NOT .env edits

Don't edit `.env` or `docker-compose.cpu-only.yml` by hand — you'll commit the
wrong values eventually. Instead, create a temporary `docker-compose.gpu.yml`:

```yaml
services:
  api-backend:
    environment:
      MODEL_NAME: "Qwen2.5-14B-Instruct-AWQ"
      VLLM_BASE_URL: "http://host.docker.internal:18080"
      VLLM_MAX_MODEL_LEN: "8192"
      VLLM_TP_SIZE: "1"
      VLLM_GPU_COUNT: "1"
      VLLM_QUANTIZATION: "awq"
```

Then:

```bash
docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml up -d
```

Compose's `environment:` takes priority over `env_file:`, so the backend container
sees the override values. This is exactly what `gpu-session.py` does.

### vLLM startup flags are non-negotiable

```
--enforce-eager
--disable-custom-all-reduce
--served-model-name=<same short name the backend sends>
```

See the "Critical Design Decisions" section of the root `CLAUDE.md` and issue #61
for the reasoning.

### Teardown order

```bash
docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml down
rm docker-compose.gpu.yml
pkill -f 'ssh.*-L.*:18080:localhost:8080'
vastai destroy instance <ID>
```

Destroy the instance LAST — if any earlier step fails, at least you want the
money-stopper to still run.

### Quick manual reference

```
INSTANCE_ID=<id>  SSH_HOST=<host>  SSH_PORT=<port>  MODEL=Qwen2.5-14B-Instruct-AWQ

# Download
ssh ... "nohup python3 -c \"from huggingface_hub import snapshot_download; snapshot_download('Qwen/$MODEL', local_dir='/vllm-workspace/$MODEL')\" > /tmp/dl.log 2>&1 &"

# Start vLLM
ssh ... "nohup python3 -m vllm.entrypoints.openai.api_server \
  --model /vllm-workspace/$MODEL --served-model-name $MODEL \
  --quantization awq --max-model-len 8192 --enforce-eager \
  --disable-custom-all-reduce --gpu-memory-utilization 0.90 --port 8080 \
  > /tmp/vllm.log 2>&1 &"

# Tunnel + compose override + compose up (see above)

# Teardown (see above)
```
Get vastai-gpu.

vz-bench-debug

vz-scrape-runner

Think you can beat it?