ResearchhiyenwongFree

BioMysteryBench: Benchmarking AI Bioinformatics Research

name: bio-mystery-bench description: BioMysteryBench methodology for benchmarking LLM bioinformatics research capabilities on real-world, open-ended problems. Addresses three key challenges in scientific benchmarking: (1) multiple valid approaches to the same problem, (2) subjective research decisions in noisy datasets, (3) unsolved biological questions. Use when: evaluating AI scientific capabilities, designing research benchmarks, bioinformatics evaluation, agentic science assessment, or measuring AI performance on open-ended research tasks. ---

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get BioMysteryBench: Benchmarking AI Bioinformatics Research.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-bio-mystery-bench

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

$cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-bio-mystery-bench/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-bio-mystery-bench↵

Show SKILL.md content (~877 tokens)

---
name: bio-mystery-bench
description: BioMysteryBench methodology for benchmarking LLM bioinformatics research capabilities on real-world, open-ended problems. Addresses three key challenges in scientific benchmarking: (1) multiple valid approaches to the same problem, (2) subjective research decisions in noisy datasets, (3) unsolved biological questions. Use when: evaluating AI scientific capabilities, designing research benchmarks, bioinformatics evaluation, agentic science assessment, or measuring AI performance on open-ended research tasks.
---

# BioMysteryBench: Benchmarking AI Bioinformatics Research

## The Challenge of Scientific Benchmarks

Three properties make biological research especially hard to benchmark:

### 1. Multiple Valid Approaches
- Same research question can be tackled via GWAS, microbiome sequencing, etc.
- Approach depends on expertise, resources, and research taste
- BixBench handles this by grading conclusions, not methods
- Tradeoff: conclusions reflect subjective choices made along the way

### 2. Subjective Decisions Lead to Different Conclusions
- Small analysis differences can flip conclusions in noisy biological datasets
- Example: metformin response prediction — three studies, three different answers
- SciGym avoids this by using simulated biology with ground-truth
- Tradeoff: unclear if simulated performance tracks real-data performance

### 3. Unsolved Problems
- Most impactful tasks are ones humans haven't solved yet
- Cannot use expert intuition as ground-truth
- Solution: ground benchmarks in experimental measurements
  - ProteinGym: mutation fitness via Deep Mutational Scanning
  - CASP: protein folding against unpublished crystal structures

## Benchmarking Strategy

**BioMysteryBench approach**:
- Tasks Claude with analysis of real-world biological datasets
- Evaluates creative solutions to open-ended problems
- Grounds evaluation in verifiable biological outcomes

## Key Findings

- Claude's biological capabilities improve rapidly across generations
- Current models perform on par with human experts
- Latest generations solved problems that human expert panels could not
- Models sometimes used very different strategies than humans

## Benchmark Design Principles for Scientific Tasks

1. **Ground in data, not just expert opinion** — use experimental measurements as ground-truth
2. **Allow multiple solution paths** — grade on correctness, not methodology
3. **Include unsolved problems** — test where models could surpass human capability
4. **Handle ambiguity gracefully** — noisy real-world data should be part of the evaluation
5. **Compare to human baselines** — include expert panels for comparison

## Comparison with Other Benchmarks

| Benchmark | Tests | Limitation |
|-----------|-------|------------|
| MMLU-Pro | Expert knowledge | Self-contained Q&A only |
| GPQA | Graduate-level questions | Google-proof but still Q&A |
| LAB-Bench | Biology knowledge work | Reading figures, protocols |
| BLADE | Dataset analysis | Checks if steps match human |
| BixBench | Biological datasets | Grades conclusions (subjective grader) |
| SciGym | Simulated lab | Simulated, not real data |
| BioMysteryBench | Real-world open problems | Novel approach |