ResearchhiyenwongFree

autoresearch

Autonomous AI research loop - let the agent run ML experiments overnight. Inspired by Karpathy's autoresearch. Use when: autonomous research, ml experiments, overnight training, self-improving models, auto-optimization.

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get autoresearch.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-autoresearch

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

$cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-autoresearch/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-autoresearch↵

Show SKILL.md content (~1.7k tokens)

---
name: autoresearch
description: "Autonomous AI research loop - let the agent run ML experiments overnight. Inspired by Karpathy's autoresearch. Use when: autonomous research, ml experiments, overnight training, self-improving models, auto-optimization."
---

# AutoResearch 🔬

**Let the agent run autonomous ML experiments while you sleep.**

## Description

AutoResearch enables the agent to autonomously iterate on machine learning experiments. It modifies code, runs training, evaluates results, and keeps improvements - looping indefinitely until manually stopped.

Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch).

## Activation Keywords
- autoresearch
- autonomous research
- overnight experiments
- ml experiments loop
- auto optimization
- 自主研究
- 自动实验

## Prerequisites

1. A working ML training setup (single GPU recommended)
2. `uv` package manager: `curl -LsSf https://astral.sh/uv/install.sh | sh`
3. Clone the autoresearch repo or have your own training code

## Quick Start

```
User: "Start autoresearch on my training code"
Agent: Reads this skill, sets up experiment loop, runs indefinitely
```

## Experiment Loop

### Phase 1: Setup

1. **Agree on run tag**: Create a tag based on date (e.g., `apr5`)
2. **Create branch**: `git checkout -b autoresearch/<tag>`
3. **Read in-scope files**:
   - Training code (e.g., `train.py`)
   - Data prep (e.g., `prepare.py`) - READ ONLY
   - README.md for context
4. **Verify data exists**: Check training data is prepared
5. **Initialize results log**: Create `results.tsv` with header
6. **Confirm setup** with user

### Phase 2: First Baseline Run

Always run the initial training to establish baseline metrics:

```bash
uv run train.py > run.log 2>&1
grep "^val_loss:\|^val_bpb:\|^peak_vram_mb:" run.log
```

Record baseline in `results.tsv`.

### Phase 3: Autonomous Loop

```
LOOP FOREVER (until manually interrupted):

1. ANALYZE current state
   - Read results.tsv to see what's been tried
   - Identify patterns: what worked, what didn't
   - Consider next experiment

2. MODIFY code
   - Edit train.py with experimental idea
   - Keep changes focused and reviewable
   
3. COMMIT
   git add -A && git commit -m "experiment: <description>"

4. RUN experiment
   uv run train.py > run.log 2>&1
   
5. EVALUATE results
   grep "^val_bpb:\|^peak_vram_mb:" run.log
   
6. LOG to results.tsv
   - commit hash (7 chars)
   - metric value
   - memory usage
   - status: keep/discard/crash
   - description
   
7. DECIDE
   - Improved (lower val_bpb)? → KEEP, advance branch
   - Worse or equal? → DISCARD, git reset --hard HEAD~1
   - Crashed? → LOG crash, fix or skip

8. REPEAT
```

## Results Log Format

`results.tsv` (tab-separated):

```
commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation
d4e5f6g	0.000000	0.0	crash	double model width (OOM)
```

## Experiment Ideas

### Architecture Changes
- Increase/decrease model depth
- Change attention patterns (windowed, local, etc.)
- Modify MLP activation functions
- Add/remove normalization layers
- Experiment with embedding sizes

### Optimizer Tuning
- Adjust learning rate
- Try different optimizers (Adam, Muon, etc.)
- Modify weight decay
- Experiment with gradient clipping

### Training Loop Modifications
- Change batch size
- Modify sequence length
- Add regularization techniques
- Implement learning rate schedules

## Safety Rules

| Rule | Detail |
|------|--------|
| Fixed time budget | Each run = 5 minutes (configurable) |
| Single file to modify | Only edit train.py (or specified file) |
| No new dependencies | Use only existing packages |
| Read-only data prep | Never modify prepare.py |
| Timeout protection | Kill runs exceeding 2x time budget |
| Git branch isolation | All work on dedicated branch |

## Complexity Criterion

All else being equal, simpler is better:

- Small improvement + ugly code → NOT worth it
- Small improvement + deleted code → DEFINITELY keep
- No improvement + simpler code → Keep (simplification win)

Weigh complexity cost against improvement magnitude.

## Key Metrics

| Metric | Goal | Notes |
|--------|------|-------|
| val_bpb | Lower is better | Validation bits per byte |
| val_loss | Lower is better | Alternative metric |
| peak_vram_mb | Monitor | Don't explode memory |
| MFU | Higher = better efficiency | Model FLOPS Utilization |
| tokens/sec | Higher = faster | Training throughput |

## Notifications

When user wakes up / returns:

1. **Summary of experiments run**
2. **Best result achieved**
3. **Notable discoveries**
4. **Recommendations for next steps**

## Error Handling

### Crashes
- Easy fix (typo, missing import) → Fix and re-run
- Fundamental issue → Log crash, skip idea

### OOM (Out of Memory)
- Reduce batch size
- Reduce model size
- Log as crash, try alternative

### Timeout
- Kill process after 2x budget
- Log as failure, revert

## Example Session

```
User: "Run autoresearch on nanogpt overnight"

Agent:
1. Sets up branch autoresearch/apr5
2. Runs baseline: val_bpb = 1.023
3. Tries LR=0.02: val_bpb = 1.015 ✓ KEEP
4. Tries depth=16: val_bpb = 1.008 ✓ KEEP
5. Tries GeLU: val_bpb = 1.010 ✗ DISCARD
6. Tries window attention: val_bpb = 1.002 ✓ KEEP
... (runs 100+ experiments overnight)

User returns to:
- 127 experiments completed
- Best val_bpb: 0.987
- Key insight: window attention + LR=0.015 works best
```

## Advanced Usage

### Multiple Agents
Run parallel experiments on different GPUs:

```
Agent 1: branch autoresearch/apr5-gpu0
Agent 2: branch autoresearch/apr5-gpu1
```

### Custom Time Budget
Modify in prepare.py or via environment variable:

```bash
TIME_BUDGET=300  # 5 minutes in seconds
```

### Research Domain Adaptation
Adapt the skill for:
- NLP experiments
- Computer vision
- Reinforcement learning
- Any iterative optimization task

## Related Skills
- `arxiv-search`: Find relevant papers for ideas
- `skill-extractor`: Capture patterns from successful experiments
- `meta-cognitive-reflection`: Reflect on research strategy

## Resources
- [karpathy/autoresearch](https://github.com/karpathy/autoresearch)
- [karpathy/nanochat](https://github.com/karpathy/nanochat)
- [Tweet announcement](https://x.com/karpathy/status/2029701092347630069)

---

**Remember: NEVER STOP until manually interrupted. The human expects you to continue working indefinitely.**