Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install brycewang-stanford-awesome-agent-skills-for-empirical-research-skills-42-wanshuiyin-aris-skills-skills-codex-experiment-bridgegit clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.gitcp Awesome-Agent-Skills-for-Empirical-Research/SKILL.MD ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-skills-42-wanshuiyin-aris-skills-skills-codex-experiment-bridge/SKILL.md--- name: experiment-bridge description: "Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says \"实现实验\", \"implement experiments\", \"bridge\", \"从计划到跑实验\", \"deploy the plan\", or has an experiment plan ready to execute." --- # Workflow 1.5: Experiment Bridge Implement and deploy experiments from plan: **$ARGUMENTS** ## Overview This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results. ``` Workflow 1 output: This skill: Workflow 2 input: refine-logs/EXPERIMENT_PLAN.md → implement → deploy → collect → initial results ready refine-logs/EXPERIMENT_TRACKER.md code /run-experiment for /auto-review-loop refine-logs/FINAL_PROPOSAL.md ``` ## Constants - **AUTO_DEPLOY = true** — Automatically deploy experiments after implementation. Set `false` to review code before deploying. - **SANITY_FIRST = true** — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early. - **MAX_PARALLEL_RUNS = 4** — Maximum number of experiments to deploy in parallel (limited by available GPUs). - **BASE_REPO = false** — GitHub repo URL to use as a base codebase. When set, clone it first and implement experiments on top of it. - **COMPACT = false** — When `true`, prefer `IDEA_CANDIDATES.md` over the full `IDEA_REPORT.md`, and append completed runs to `EXPERIMENT_LOG.md`. > Override: `/experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project` ## Inputs This skill expects one or more of: 1. **`refine-logs/EXPERIMENT_PLAN.md`** (best) — claim-driven experiment roadmap from `/experiment-plan` 2. **`refine-logs/EXPERIMENT_TRACKER.md`** — run-by-run execution table 3. **`refine-logs/FINAL_PROPOSAL.md`** — method description for implementation context 4. **`IDEA_CANDIDATES.md`** — compact idea summary (preferred when `COMPACT = true`) 5. **`IDEA_REPORT.md`** — fallback if refine-logs don't exist If none exist, ask the user what experiments to implement. ## Workflow ### Phase 1: Parse the Experiment Plan Read `EXPERIMENT_PLAN.md` and extract: 1. **Run order and milestones** — which experiments run first (sanity → baseline → main → ablation → polish) 2. **For each experiment block:** - Dataset / split / task - Compared systems and variants - Metrics to compute - Setup details (backbone, hyperparameters, seeds) - Success criterion - Priority (MUST-RUN vs NICE-TO-HAVE) 3. **Compute budget** — total estimated GPU-hours 4. **Method details** from `FINAL_PROPOSAL.md` — what exactly to implement Present a brief summary: ``` 📋 Experiment plan loaded: - Milestones: [N] (sanity → baseline → main → ablation) - Must-run experiments: [N] - Nice-to-have: [N] - Estimated GPU-hours: [X] Proceeding to implementation. ``` ### Phase 2: Implement Experiment Code **If `BASE_REPO` is set** — clone the repo first: ```bash git clone <BASE_REPO> base_repo/ ``` For each milestone (in order), write the experiment scripts: 1. **Check existing code** — scan the project (or cloned `base_repo/`) for existing experiment scripts, model code, and data loaders. Reuse as much as possible. 2. **Implement missing pieces:** - Training scripts with proper argparse (all hyperparameters configurable) - Evaluation scripts computing the specified metrics - Data loading / preprocessing if needed - Baseline implementations if not already present - Fixed random seeds for reproducibility - Results saved to JSON/CSV for later analysis - Proper logging (wandb if configured in AGENTS.md) 3. **Follow the plan's run order** — implement sanity-stage experiments first, then baselines, then main method, then ablations. 4. **Self-review before deploying:** - Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse? - Is the random seed fixed and controllable? - Are results saved in a parseable format (JSON/CSV)? - Does the code match FINAL_PROPOSAL.md's method description? - **CRITICAL**: does evaluation compare predictions against dataset ground truth, never another model's output? ### Phase 3: Sanity Check (if SANITY_FIRST = true) Before deploying the full experiment suite, run the sanity-stage experiment: ``` /run-experiment [sanity experiment command] ``` Wait for completion. Verify: - Training loop runs without errors - Metrics are computed and saved correctly - GPU memory usage is within bounds - Output format matches expectations If sanity fails → fix the code, re-run. Do not proceed to full deployment with broken code. ### Phase 4: Deploy Full Experiments Deploy experiments following the plan's milestone order: ``` /run-experiment [experiment commands] ``` For each milestone: 1. Deploy experiments in parallel (up to MAX_PARALLEL_RUNS) 2. Use `/monitor-experiment` to track progress 3. Collect results as experiments complete **🚦 Checkpoint (if AUTO_DEPLOY = false):** ``` 🔧 Code implementation complete. Ready to deploy: Milestone 0 (sanity): [status — passed/pending] Milestone 1 (baseline): [N experiments, ~X GPU-hours] Milestone 2 (main method): [N experiments, ~X GPU-hours] Milestone 3 (ablations): [N experiments, ~X GPU-hours] Total estimated: ~X GPU-hours on [N] GPUs Deploy now? Or review the code first? ``` ### Phase 5: Collect Initial Results As experiments complete: 1. **Parse output files** (JSON/CSV/logs) for key metrics 2. **Training quality check** — if W&B data is available, invoke `/training-check` to detect NaN, loss divergence, plateaus, or overfitting. If W&B is not configured, skip silently. 3. **Update `refine-logs/EXPERIMENT_TRACKER.md`** — fill in Status and Notes columns 4. **Check success criteria** from EXPERIMENT_PLAN.md — did each experiment meet its bar? 5. **Write initial results summary:** ```markdown # Initial Experiment Results **Date**: [today] **Plan**: refine-logs/EXPERIMENT_PLAN.md ## Results by Milestone ### M0: Sanity — PASSED - [result] ### M1: Baselines | Run | System | Key Metric | Status | |-----|--------|-----------|--------| | R001 | baseline_1 | X.XX | DONE | ### M2: Main Method | Run | System | Key Metric | Status | |-----|--------|-----------|--------| | R003 | our_method | X.XX | DONE | ### M3: Ablations ... ## Summary - [X/Y] must-run experiments completed - Main result: [positive/negative/inconclusive] - Ready for /auto-review-loop: [YES/NO] ## Next Step → /auto-review-loop "[topic]" ``` ### Phase 5.5: Write Compact Log (when COMPACT = true) **Skip entirely if `COMPACT` is `false`.** Append each completed experiment to `EXPERIMENT_LOG.md`: ```markdown ## [Run ID] — [timestamp] - **System**: [method name] - **Config**: [key hyperparameters] - **Result**: [primary metric = X.XX] - **Verdict**: [positive / negative / inconclusive] - **Reproduce**: `python train.py --config configs/run_id.yaml --seed 42` ``` ### Phase 5.6: Auto Ablation Planning After main experiments (M2) complete with positive results, invoke `/ablation-planner` to design ablation studies: - Read the main results and method description - Generate a claim-driven ablation plan: which components to remove, what to compare, and expected outcomes - Append ablation blocks to `refine-logs/EXPERIMENT_PLAN.md` and `refine-logs/EXPERIMENT_TRACKER.md` - If main results are negative or inconclusive, skip ablation planning and note that in the summary If `/ablation-planner` is unavailable, skip silently. ### Phase 6: Handoff Present final status: ``` 🔬 Experiment bridge complete: - Implemented: [N] experiment scripts - Deployed: [N] experiments on [M] GPUs - Completed: [X/Y] must-run, [A/B] nice-to-have - Main result: [one sentence] Results: refine-logs/EXPERIMENT_RESULTS.md Tracker: refine-logs/EXPERIMENT_TRACKER.md Ready for Workflow 2: → /auto-review-loop "[topic]" ``` ## Key Rules - **Large file handling**: If the Write tool fails due to file size, immediately retry using Bash (`cat << 'EOF' > file`) to write in chunks. Do NOT ask the user for permission — just do it silently. - **CRITICAL — Evaluation must use dataset ground truth.** Always compare model predictions against the dataset's actual labels/targets, never another model's output. If the task has official eval scripts, prefer them. - **Follow the plan.** Do not invent experiments not in EXPERIMENT_PLAN.md. If you think something is missing, note it but don't add it. - **Sanity first.** Never deploy a full suite without verifying the sanity stage passes. - **Reuse existing code.** Scan the project before writing new scripts. Extend, don't duplicate. - **Save everything as JSON/CSV.** The auto-review-loop needs parseable results, not just terminal output. - **Update the tracker.** `EXPERIMENT_TRACKER.md` should reflect real status after each run completes. - **Don't wait forever.** If an experiment exceeds 2x its estimated time, flag it and move on to the next milestone. - **Budget awareness.** Track GPU-hours against the plan's budget. Warn if approaching the limit. ## Composing with Other Skills ``` /idea-discovery "direction" ← Workflow 1: find + refine + plan /experiment-bridge ← you are here (Workflow 1.5: implement + deploy) /auto-review-loop "topic" ← Workflow 2: review + iterate /paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper Or use /research-pipeline for the full end-to-end flow (includes this bridge). ```