OtherhiyenwongFree

cheesebench-evaluating-large-language-models

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maz. Activation: rodent behavior paradigms, LLM evaluation, ODE complexity

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get cheesebench-evaluating-large-language-models.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-cheesebench-evaluating-large-language-models

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-cheesebench-evaluating-large-language-models↵

Show SKILL.md content (~1.2k tokens)

---
name: cheesebench-evaluating-large-language-models
description: "We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maz. Activation: rodent behavior paradigms, LLM evaluation, ODE complexity"
version: 1.0.0
metadata:
  hermes:
    source_paper: "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms (arXiv:2604.10825v1)"
    tags: [behavior, behavioral, cognitive, learning, neuroscience, paradigm]
---

# CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

## Paper Reference

- **Title**: CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
- **Authors**: Zacharie Bugaud
- **arXiv**: 2604.10825v1
- **Published**: 2026-04-12
- **Categories**: cs.AI
- **PDF**: https://arxiv.org/abs/2604.10825

## Overview

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement l

## Core Concepts

1. **Behavioral Neuroscience Paradigms**: Classic rodent behavioral tests as LLM evaluation tasks
2. **Cross-Species Evaluation**: Bridging animal behavior research with AI evaluation
3. **Cognitive Task Benchmarking**: Systematic assessment of LLM capabilities through behavioral paradigms
4. **Spatial Memory & Navigation**: Evaluating spatial reasoning through maze-like tasks

## Core Paradigms Covered

| Task | Cognitive Domain | What it Tests |
|------|-----------------|---------------|
| Morris Water Maze | Spatial learning | Navigation & memory |
| T-Maze | Working memory | Alternation behavior |
| Radial Arm Maze | Spatial reference memory | Memory capacity |
| Open Field | Anxiety & exploration | Risk assessment |
| Elevated Plus Maze | Anxiety | Risk-reward tradeoff |
| Fear Conditioning | Associative learning | Memory formation |
| Object Recognition | Recognition memory | Novelty detection |
| Barnes Maze | Spatial learning | Escape motivation |
| Social Interaction | Social behavior | Social cognition |

## Implementation Pattern

```python
class CheeseBenchTask:
    """Base class for rodent behavioral paradigm tasks."""
    
    def __init__(self, name, domain, description):
        self.name = name
        self.domain = domain
        self.description = description
    
    def evaluate_llm(self, llm_response, ground_truth):
        raise NotImplementedError

class MorrisWaterMaze(CheeseBenchTask):
    """Spatial navigation and learning task."""
    
    def __init__(self):
        super().__init__(
            name="Morris Water Maze",
            domain="Spatial Learning",
            description="Navigate to hidden platform using spatial cues"
        )
    
    def generate_prompt(self, session_num=1):
        return ("Imagine you are in a circular pool. "
                "Find a hidden platform using spatial landmarks. "
                "Session: " + str(session_num))
```

## Applications

- LLM cognitive capability evaluation
- Cross-modal behavioral benchmarking
- AI safety assessment through behavioral paradigms
- Comparative cognitive science

## Limitations

- Based on abstract analysis; full paper may contain additional details
- Implementations are illustrative; refer to paper for production code
- Domain-specific parameters need empirical tuning

## References

- Zacharie Bugaud (2026). "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms." arXiv:2604.10825v1.
- Full paper: https://arxiv.org/pdf/2604.10825.pdf

## Activation Keywords

- behavior, behavioral, cognitive, learning, neuroscience, paradigm, rodent