DatahiyenwongFree

agent-rl-benchmark

Agent² RL-Bench: Benchmark for evaluating agentic RL post-training where LLM agents autonomously design, implement, and run complete RL pipelines. Use when evaluating LLM agent capabilities for reinforcement learning engineering, RL pipeline automation, or agentic model alignment.

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get agent-rl-benchmark.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-agent-rl-benchmark

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

$cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-agent-rl-benchmark/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-agent-rl-benchmark↵

Show SKILL.md content (~1.5k tokens)

---
name: agent-rl-benchmark
description: "Agent² RL-Bench: Benchmark for evaluating agentic RL post-training where LLM agents autonomously design, implement, and run complete RL pipelines. Use when evaluating LLM agent capabilities for reinforcement learning engineering, RL pipeline automation, or agentic model alignment."
---

# Agent² RL-Bench: Agentic RL Post-Training Benchmark

Benchmark for evaluating whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models.

## Core Concept

As RL post-training increasingly drives model alignment and specialization, the ability for LLM agents to engineer RL pipelines becomes critical. This benchmark tests:

1. **Pipeline Design**: Can agents design RL training pipelines?
2. **Implementation**: Can agents write correct RL code?
3. **Execution**: Can agents run and debug RL experiments?
4. **Improvement**: Do the pipelines actually improve models?

## Activation Keywords

- Agent² RL-Bench
- agentic RL post-training
- LLM agent RL engineering
- automated RL pipeline
- agent RL benchmark
- RL pipeline automation
- LLM alignment automation

## Benchmark Structure

### Level 1: Pipeline Design

Agents must design RL pipelines given:
- Target model (base LLM)
- Task specification
- Reward model requirements
- Computational constraints

**Outputs:**
- RL algorithm selection (PPO, DPO, GRPO, etc.)
- Hyperparameter configuration
- Data preprocessing strategy
- Evaluation protocol

### Level 2: Implementation

Agents must implement the designed pipeline:

```python
# Example: GRPO implementation
class AgenticGRPOTrainer:
    def __init__(self, model, config):
        self.model = model
        self.config = config
        
    def generate_responses(self, prompts):
        """Generate multiple responses per prompt"""
        pass
    
    def compute_rewards(self, responses, ground_truth):
        """Reward computation"""
        pass
    
    def compute_grpo_loss(self, log_probs, rewards):
        """Group Relative Policy Optimization"""
        # Advantage = r_i - mean(r)
        # Loss = -mean(log_prob * advantage)
        pass
    
    def train_step(self, batch):
        """Single training step"""
        pass
```

### Level 3: Execution & Debugging

Agents must:
- Execute training runs
- Monitor metrics
- Handle failures
- Debug issues
- Optimize performance

### Level 4: Evaluation

Agents must evaluate whether the trained model improved:

```python
def evaluate_model(model, test_tasks):
    metrics = {}
    for task in test_tasks:
        predictions = model.generate(task.inputs)
        metrics[task.name] = task.evaluate(predictions)
    return metrics
```

## Task Categories

### 1. Reasoning Tasks
- Mathematical reasoning
- Code generation
- Logical deduction
- Chain-of-thought

### 2. Alignment Tasks
- Harmlessness
- Helpfulness
- Honesty
- Instruction following

### 3. Specialization Tasks
- Domain adaptation
- Few-shot learning
- Tool use
- Multi-turn conversation

## Evaluation Metrics

### Pipeline Quality
- **Correctness**: Does the code run?
- **Efficiency**: Training time, memory usage
- **Scalability**: Works on different model sizes?

### Model Improvement
- **Absolute Gain**: Δ in task performance
- **Relative Gain**: % improvement over baseline
- **Generalization**: Improvement on held-out tasks
- **Robustness**: Consistent across seeds

### Agent Capabilities
- **Autonomy**: % tasks completed without human intervention
- **Recovery**: Ability to recover from failures
- **Optimization**: Hyperparameter tuning quality

## Example Tasks

### Task 1: Implement GRPO

```
Given: Base model, reasoning dataset
Design: GRPO training pipeline
Implement: Complete training loop
Execute: Train for N steps
Evaluate: Pass@1 on math problems
```

### Task 2: Debug RL Training

```
Given: Failing RL training run
Diagnose: Root cause of failure
Fix: Correct the issue
Verify: Training succeeds
```

### Task 3: Optimize Hyperparameters

```
Given: Working baseline
Explore: Hyperparameter space
Optimize: For target metric
Report: Best configuration
```

## Difficulty Levels

| Level | Description | Example |
|-------|-------------|---------|
| Easy | Standard RLHF | PPO on preference data |
| Medium | Advanced methods | GRPO, DPO variants |
| Hard | Novel problems | Multi-task RL, constrained optimization |
| Expert | Research-level | Novel algorithm design |

## Tools Used

- exec: Run RL training scripts
- python: RL implementation
- read: Analysis results, logs
- write: Generate pipeline code

## Implementation Guidelines

### Agent Environment

```python
class RLBenchEnvironment:
    def __init__(self):
        self.available_models = [...]
        self.compute_budget = {...}
        self.datasets = {...}
    
    def submit_solution(self, agent_code):
        """Submit agent's RL pipeline"""
        return self.evaluate(agent_code)
    
    def get_feedback(self, run_id):
        """Get execution results"""
        return self.runs[run_id]
```

### Success Criteria

A successful agent:
1. Designs valid RL approach
2. Implements bug-free code
3. Trains without crashes
4. Achieves positive Δ performance
5. Generalizes to held-out tasks

## References

- arXiv:2604.10547v1 (2026) - "Agent² RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?"
- RLHF literature (Ouyang et al., Bai et al.)
- GRPO (DeepSeek-R1)
- LLM agent benchmarks (SWE-bench, etc.)

## Related Skills

- grpo-rl-training: GRPO implementation
- llm-alignment: Model alignment techniques
- agent-evaluation: LLM agent benchmarks
- reinforcement-learning: General RL methods