Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install defectgi-tepegoz-claude-skills-vastai-gpugit clone https://github.com/defectGI/tepegoz.gitcp tepegoz/SKILL.MD ~/.claude/skills/defectgi-tepegoz-claude-skills-vastai-gpu/SKILL.md---
name: vastai-gpu
description: Provision and manage vast.ai GPU instances for GPU-dependent testing (vLLM inference, model loading, VRAM validation). No local GPU available — all GPU work goes through vast.ai.
---
# vast.ai GPU Testing
Run GPU-dependent tests (vLLM inference, model loading, VRAM budgeting, tensor parallelism) on
vast.ai rental instances. The local laptop has no GPU — only `llm-server` needs one;
`api-backend`, `frontend`, and `qdrant` always run locally in Docker.
**Use `scripts/gpu-session.py` as the default path** — it owns the full lifecycle
(search → provision → download → vLLM → SSH tunnel → compose up → teardown) and persists
session state so partial failures never leak paid instances. The manual workflow in the
appendix is only for scenarios the script doesn't cover.
## Prerequisites
- `vastai` CLI installed and authenticated (`pip install vastai`)
- SSH key registered on vast.ai (current key: `~/.ssh/id_ed25519`, vast.ai key ID: 740068)
- Volume 34301538 ("tepegoz_models", 80GB, Bulgaria, machine_id=20420) — optional, not
used by the script today; reserved for a future `--use-cached-models` flag.
## Cost Rules
- Each profile in `scripts/gpu-session.py` has a per-hour budget cap. Defaults range from
$0.15/hr (qwen-0.5b) to $0.50/hr (qwen-32b). Override with `--budget-dph`.
- The script **refuses to provision** if the cheapest matching offer exceeds the budget.
- Use the **smallest profile** that exercises the code path you care about:
- config / tp wiring → `qwen-0.5b-tp2` (2×cheap GPU, cheap model)
- quality-of-output → `qwen-14b` (AWQ, 1×12GB)
- production parity → `qwen-32b` (GPTQ, 2×24GB)
- `gpu-session.py down` destroys the instance as its last step — always use it rather
than leaving a session alive.
## Automated workflow
### 1. Start a session
```bash
# Smallest + cheapest, skip confirmation:
make gpu-up MODEL=qwen-0.5b YES=1
# or directly:
python scripts/gpu-session.py up --model qwen-0.5b --yes
# Override budget:
python scripts/gpu-session.py up --model qwen-14b --budget-dph 0.40
# Skip pre-flight checks (not recommended):
python scripts/gpu-session.py up --model qwen-0.5b --yes --skip-preflight
```
The script will:
0. **Pre-flight checks** — before spending money, verify the local stack is healthy:
- Docker daemon is running
- `docker-compose.cpu-only.yml` parses correctly
- `make test` passes (all backend + frontend tests green)
- No stale compose overrides from previous sessions
If any check fails, the script aborts with a clear error message. Fix the
local issue before renting a GPU. Use `--skip-preflight` only when you know
what you're doing (e.g. the failing test is unrelated to your GPU work).
1. Search vast.ai offers matching the profile's GPU filter.
2. Pick the cheapest; refuse if over budget.
3. Create the instance; write `.gpu-session.json` immediately (so a crash
here still has a record to destroy).
4. Wait for `running`, verify SSH + GPU visibility.
5. **Post-provision validation** — run 5 checks on the remote machine:
- CUDA runtime available (`torch.cuda.is_available()`)
- GPU count matches tensor-parallel requirement
- Disk space sufficient for model weights
- Docker Hub connectivity (TCP+TLS to registry-1.docker.io)
- GPU inference smoke test (matrix multiply on CUDA)
If any check fails, the instance is destroyed (~$0.01) and the next offer
is tried automatically (up to 3 attempts).
6. Download the model (HuggingFace snapshot_download). `HF_TOKEN` is
forwarded if set in the local environment.
7. Start vLLM with `--enforce-eager` and `--disable-custom-all-reduce`.
8. Open an SSH tunnel: `0.0.0.0:18080 → remote:8080`. `0.0.0.0` is mandatory
so Docker containers can reach it via `host.docker.internal`.
9. Generate `docker-compose.gpu.yml` — a compose override that injects GPU-mode env
vars (`MODEL_NAME`, `VLLM_BASE_URL`, …) into `api-backend` via `environment:`.
**The committed `.env` and `docker-compose.cpu-only.yml` are never mutated.**
10. `docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml up -d`.
11. Probe container → vLLM connectivity.
### 2. Use the session
- Backend API: `http://localhost:8000`
- Frontend: `http://localhost:3000`
- Direct vLLM: `curl http://localhost:18080/v1/models`
- Interactive shell on the remote box: `make gpu-ssh`
- Tail vLLM server logs: `make gpu-logs FOLLOW=1`
- Session state + costs: `make gpu-status`
### 3. Tear down
```bash
make gpu-down
```
Always use `down` rather than ad-hoc `docker compose down` + `vastai destroy` —
the script orchestrates all five cleanup steps (compose, override file, tunnel PID,
instance, session state file) and is idempotent.
## Profiles
Defined at the top of `scripts/gpu-session.py`; run `make gpu-status` or
`python scripts/gpu-session.py list-profiles` to see the live set.
| Profile | Repo | TP | Min VRAM | Disk | Budget |
|---------|------|----|----------|------|--------|
| qwen-0.5b | Qwen2.5-0.5B-Instruct | 1 | 4GB | 20GB | $0.15/hr |
| qwen-0.5b-tp2 | Qwen2.5-0.5B-Instruct | 2 | 4GB | 20GB | $0.25/hr |
| qwen-7b | Qwen2.5-7B-Instruct | 1 | 16GB | 30GB | $0.30/hr |
| qwen-14b | Qwen2.5-14B-Instruct-AWQ | 1 | 12GB | 25GB | $0.30/hr |
| qwen-32b | Qwen2.5-32B-Instruct-GPTQ-Int8 | 2 | 24GB | 60GB | $0.50/hr |
**Known incompatibilities** (from prior testing):
- GGUF does NOT work with vLLM tp>1 (`GGUFUninitializedParameter` error)
- AWQ on 2×4090 gives only 4K context — insufficient for engineering QA
- 72B-AWQ quality is worse than 32B-GPTQ-Int8 despite larger size
- GPUs with Compute Capability < 8.0 (GTX 1080) lack bfloat16 — vLLM refuses
## Recovery
If anything fails mid-session, the state file (`.gpu-session.json`) is still on disk.
`make gpu-down` will read it and clean up what it can regardless of phase:
- compose stack → stopped
- `docker-compose.gpu.yml` override → deleted
- SSH tunnel → killed (by saved PID, with pattern match as safety net)
- vast.ai instance → destroyed (this is the money-saver; runs even if earlier steps fail)
- session state file → deleted
## When NOT to use vast.ai
- code review, config syntax checks, type-checking
- any test that doesn't execute CUDA kernels or load weights
- anything validated by `make test` on CPU
Don't spin up a GPU to run `make lint`.
---
## Appendix: Manual workflow (fallback)
When to use the manual workflow:
- Debugging something the script doesn't cover (e.g. trying a non-Qwen model,
experimenting with a different vLLM image tag, needing per-request control over
the remote commands).
- Improving `gpu-session.py` itself — you need to understand what it does.
All steps the script automates can be reproduced by hand. The key invariants:
### SSH tunnel must bind 0.0.0.0
```bash
ssh -f -N -L 0.0.0.0:18080:localhost:8080 \
-i ~/.ssh/id_ed25519 \
-o ServerAliveInterval=60 \
root@<host> -p <port>
```
Docker containers on Linux resolve `host.docker.internal` to the bridge gateway
(e.g. 172.17.0.1), NOT 127.0.0.1. A tunnel bound only to localhost is unreachable
from inside containers. Docker Desktop (Mac/Windows) handles this differently;
on Linux the `extra_hosts: host.docker.internal:host-gateway` in
`docker-compose.cpu-only.yml` is required.
### Local stack env overrides via compose override, NOT .env edits
Don't edit `.env` or `docker-compose.cpu-only.yml` by hand — you'll commit the
wrong values eventually. Instead, create a temporary `docker-compose.gpu.yml`:
```yaml
services:
api-backend:
environment:
MODEL_NAME: "Qwen2.5-14B-Instruct-AWQ"
VLLM_BASE_URL: "http://host.docker.internal:18080"
VLLM_MAX_MODEL_LEN: "8192"
VLLM_TP_SIZE: "1"
VLLM_GPU_COUNT: "1"
VLLM_QUANTIZATION: "awq"
```
Then:
```bash
docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml up -d
```
Compose's `environment:` takes priority over `env_file:`, so the backend container
sees the override values. This is exactly what `gpu-session.py` does.
### vLLM startup flags are non-negotiable
```
--enforce-eager
--disable-custom-all-reduce
--served-model-name=<same short name the backend sends>
```
See the "Critical Design Decisions" section of the root `CLAUDE.md` and issue #61
for the reasoning.
### Teardown order
```bash
docker compose -f docker-compose.cpu-only.yml -f docker-compose.gpu.yml down
rm docker-compose.gpu.yml
pkill -f 'ssh.*-L.*:18080:localhost:8080'
vastai destroy instance <ID>
```
Destroy the instance LAST — if any earlier step fails, at least you want the
money-stopper to still run.
### Quick manual reference
```
INSTANCE_ID=<id> SSH_HOST=<host> SSH_PORT=<port> MODEL=Qwen2.5-14B-Instruct-AWQ
# Download
ssh ... "nohup python3 -c \"from huggingface_hub import snapshot_download; snapshot_download('Qwen/$MODEL', local_dir='/vllm-workspace/$MODEL')\" > /tmp/dl.log 2>&1 &"
# Start vLLM
ssh ... "nohup python3 -m vllm.entrypoints.openai.api_server \
--model /vllm-workspace/$MODEL --served-model-name $MODEL \
--quantization awq --max-model-len 8192 --enforce-eager \
--disable-custom-all-reduce --gpu-memory-utilization 0.90 --port 8080 \
> /tmp/vllm.log 2>&1 &"
# Tunnel + compose override + compose up (see above)
# Teardown (see above)
```