standalone-experimental-test

Show SKILL.md content (~2.9k tokens)
---
name: standalone-experimental-test
description: Run standalone experimental tests that never touch the existing codebase. Creates isolated test files, runs them, captures results, then cleans up. Zero risk to working code.
triggers:
  - "test this experimentally"
  - "standalone test"
  - "experimental test"
  - "try this without touching the code"
  - "test in isolation"
  - "can we verify this works"
  - "/standalone-experimental-test"
---

# Standalone Experimental Test — Zero-Risk Experimentation

Run experiments to verify assumptions, test APIs, or explore behavior WITHOUT modifying any existing project files. Everything happens in an isolated scratch space.

## Philosophy

The codebase is sacred. Experiments are disposable. Never mix the two.

## Workflow

### Step 1: Define the experiment

Before writing anything, state clearly:

```
Experiment: [what we're testing]
Hypothesis: [what we expect to happen]
Success criteria: [how we'll know it worked]
Files needed: [what test files we'll create]
Dependencies: [what packages/tools are needed]
```

Wait for user approval before proceeding.

### Step 2: Create the scratch space

Create an isolated directory for the experiment:

```bash
mkdir -p /tmp/experiment-<descriptive-name>-$(date +%s)
```

Rules:
- **NEVER create test files inside the project directory**
- **NEVER modify existing project files** — not even "temporarily"
- **NEVER add dependencies to the project's package.json or pyproject.toml**
- If the experiment needs project dependencies, install them separately in the scratch space
- If the experiment needs to import from the project, use absolute paths or symlinks — don't copy project files

### Step 3: Write the test

Create minimal, focused test files in the scratch space. The test should:
- Be as small as possible — test ONE thing
- Be self-contained — runnable without the project
- Print clear output — what happened, what was expected
- Handle errors gracefully — catch and report, don't crash silently

If the test needs project context (e.g., testing how a library interacts with project code):
- Read project files but don't write to them
- Import from the project using absolute paths
- Use the project's virtual environment if needed, but don't modify it

### Step 4: Run and capture

Run the test and capture ALL output:

```bash
cd /tmp/experiment-<name>-<timestamp>
# Run the test, capture stdout and stderr
python test.py 2>&1 | tee results.txt
```

### Step 5: Report findings

Present results clearly:

```
Experiment: [name]
Result: [CONFIRMED / DENIED / INCONCLUSIVE]
Evidence: [what the output showed]
Implication: [what this means for the project]
```

If the hypothesis was confirmed, explain what the next step would be to integrate it into the project (but don't do it — that's a separate task).

If denied, explain what we learned and what alternative approaches exist.

### Step 6: Clean up

Ask the user if they want to keep the experiment files:

```
Experiment complete. Keep the scratch files at /tmp/experiment-<name>-<timestamp>?
```

- If yes: leave them
- If no: `rm -rf /tmp/experiment-<name>-<timestamp>`

## Rules

1. **NEVER modify project files.** Not even a single line. Not even "just a log statement." If you need to test a modified version of a project file, copy it to the scratch space and modify the copy.
2. **NEVER install packages into the project.** If the experiment needs a package, install it in the scratch space or a temporary venv.
3. **NEVER commit experiment files.** They live in /tmp and die there.
4. **One experiment, one question.** Don't combine multiple hypotheses into one test. Run separate experiments.
5. **The scratch space is disposable.** Don't build anything in /tmp that you'd be sad to lose. If results are valuable, save them to the project as a deliberate, separate step.
6. **Read-only access to the project.** The experiment can READ project files, USE the project's installed packages (via their paths), and REFERENCE project configuration. It cannot WRITE, MODIFY, or ADD anything to the project.

## Examples

### Testing if a library handles partial input
```
Experiment: Does Streamdown render incomplete LaTeX gracefully?
Hypothesis: Passing partial LaTeX like "$$x = \frac{" renders without crashing
Success criteria: Component renders without error, shows partial equation
Files needed: test_partial_latex.html in scratch space
Dependencies: streamdown (use project's node_modules)
```

### Testing API response format
```
Experiment: What does the GitHub API return for repo tree requests?
Hypothesis: Returns flat array of file paths with types
Success criteria: We see the exact JSON structure
Files needed: test_gh_api.sh in scratch space
Dependencies: gh CLI (already installed)
```

## Proven Example: Testing OpenAI Realtime API Delta Streaming

This experiment was run on 2026-03-26 and CONFIRMED that function call arguments stream token by token.

**Definition:**
```
Experiment: Does OpenAI Realtime API stream function call arguments via delta events?
Hypothesis: Delta events fire token by token during function calls
Success criteria: Multiple delta events with partial JSON for a single function call
Result: CONFIRMED — 147 delta events over 1.4 seconds, each containing 1-6 chars
```

**Test file** (`test_deltas.py`):
```python
"""
Tests whether OpenAI Realtime API streams function call arguments.
Connects via WebSocket, sends a prompt that forces a function call,
captures all response.function_call_arguments.delta events.

Requirements: websockets, python-dotenv
Env: OPENAI_API_KEY must be set (or loaded from a .env.local)
"""

import asyncio
import json
import os
import time

# Load API key — point this at your project's .env.local (READ-ONLY)
from dotenv import load_dotenv
load_dotenv("/path/to/your/project/.env.local")

import websockets

RESULTS_FILE = os.path.join(os.path.dirname(__file__), "results.txt")


def log(msg):
    timestamp = time.time()
    line = f"[{timestamp:.3f}] {msg}"
    print(line, flush=True)
    with open(RESULTS_FILE, "a") as f:
        f.write(line + "\n")


async def test_realtime_delta_streaming():
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        log("ERROR: OPENAI_API_KEY not found")
        return

    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "OpenAI-Beta": "realtime=v1",
    }

    log("Connecting to OpenAI Realtime API...")

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Configure session with a tool, text-only mode
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text"],
                "tools": [{
                    "type": "function",
                    "name": "show_content",
                    "description": "Show visual content",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "content": {
                                "type": "string",
                                "description": "Markdown/LaTeX content. Use $$ for display math."
                            }
                        },
                        "required": ["content"]
                    }
                }],
                "tool_choice": "required",
            }
        }))

        resp = json.loads(await ws.recv())
        log(f"Session configured: {resp['type']}")

        # Prompt that forces a long function call argument
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{
                    "type": "input_text",
                    "text": (
                        "Show me on the blackboard: the quadratic formula, "
                        "Euler's identity, the Pythagorean theorem, "
                        "and the definition of a derivative. "
                        "Use full LaTeX formatting with display math."
                    )
                }]
            }
        }))

        await ws.send(json.dumps({"type": "response.create"}))
        log("Prompt sent. Waiting for function call events...")

        delta_events = []
        done = False
        start_time = time.time()

        while not done:
            try:
                msg = await asyncio.wait_for(ws.recv(), timeout=30)
            except asyncio.TimeoutError:
                log("TIMEOUT")
                break

            event = json.loads(msg)
            event_type = event.get("type", "unknown")

            if "function_call" in event_type:
                elapsed = time.time() - start_time

                if event_type == "response.function_call_arguments.delta":
                    delta_events.append({
                        "elapsed": round(elapsed, 3),
                        "delta": event.get("delta", ""),
                    })
                    log(f"  DELTA #{len(delta_events):3d} @ {elapsed:.3f}s: '{event.get('delta', '')}'")

                elif event_type == "response.function_call_arguments.done":
                    log(f"  DONE @ {elapsed:.3f}s: ({len(event.get('arguments', ''))} chars)")

            if event_type == "response.done":
                done = True

        # Report
        log("")
        log("=" * 60)
        log(f"Total delta events: {len(delta_events)}")
        if len(delta_events) > 1:
            first = delta_events[0]["elapsed"]
            last = delta_events[-1]["elapsed"]
            log(f"CONFIRMED: {len(delta_events)} deltas over {last - first:.3f}s")
            log(f"Reconstructed: {''.join(d['delta'] for d in delta_events)}")
        elif len(delta_events) == 1:
            log("DENIED: Only one delta event")
        else:
            log("INCONCLUSIVE: No delta events")


if __name__ == "__main__":
    asyncio.run(test_realtime_delta_streaming())
```

**How to run:**
```bash
mkdir -p /tmp/experiment-deltas
cp test_deltas.py /tmp/experiment-deltas/
cd /tmp/experiment-deltas
# Use a venv with websockets and python-dotenv installed
python test_deltas.py 2>&1 | tee results.txt
```

**Expected output (CONFIRMED):**
```
DELTA #  1 @ 0.819s: '{"'
DELTA #  2 @ 0.819s: 'content'
DELTA #  3 @ 0.819s: '":"'
DELTA #  4 @ 0.819s: 'Here'
...
DELTA #147 @ 2.186s: '"}'
DONE @ 2.208s: (349 chars)
Total delta events: 147
CONFIRMED: 147 deltas over 1.367s
```
Get standalone-experimental-test.

vz-scrape-runner

vz-bench-debug

Think you can beat it?