Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-agent-rl-benchmarkgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-agent-rl-benchmark/SKILL.md---
name: agent-rl-benchmark
description: "Agent² RL-Bench: Benchmark for evaluating agentic RL post-training where LLM agents autonomously design, implement, and run complete RL pipelines. Use when evaluating LLM agent capabilities for reinforcement learning engineering, RL pipeline automation, or agentic model alignment."
---
# Agent² RL-Bench: Agentic RL Post-Training Benchmark
Benchmark for evaluating whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models.
## Core Concept
As RL post-training increasingly drives model alignment and specialization, the ability for LLM agents to engineer RL pipelines becomes critical. This benchmark tests:
1. **Pipeline Design**: Can agents design RL training pipelines?
2. **Implementation**: Can agents write correct RL code?
3. **Execution**: Can agents run and debug RL experiments?
4. **Improvement**: Do the pipelines actually improve models?
## Activation Keywords
- Agent² RL-Bench
- agentic RL post-training
- LLM agent RL engineering
- automated RL pipeline
- agent RL benchmark
- RL pipeline automation
- LLM alignment automation
## Benchmark Structure
### Level 1: Pipeline Design
Agents must design RL pipelines given:
- Target model (base LLM)
- Task specification
- Reward model requirements
- Computational constraints
**Outputs:**
- RL algorithm selection (PPO, DPO, GRPO, etc.)
- Hyperparameter configuration
- Data preprocessing strategy
- Evaluation protocol
### Level 2: Implementation
Agents must implement the designed pipeline:
```python
# Example: GRPO implementation
class AgenticGRPOTrainer:
def __init__(self, model, config):
self.model = model
self.config = config
def generate_responses(self, prompts):
"""Generate multiple responses per prompt"""
pass
def compute_rewards(self, responses, ground_truth):
"""Reward computation"""
pass
def compute_grpo_loss(self, log_probs, rewards):
"""Group Relative Policy Optimization"""
# Advantage = r_i - mean(r)
# Loss = -mean(log_prob * advantage)
pass
def train_step(self, batch):
"""Single training step"""
pass
```
### Level 3: Execution & Debugging
Agents must:
- Execute training runs
- Monitor metrics
- Handle failures
- Debug issues
- Optimize performance
### Level 4: Evaluation
Agents must evaluate whether the trained model improved:
```python
def evaluate_model(model, test_tasks):
metrics = {}
for task in test_tasks:
predictions = model.generate(task.inputs)
metrics[task.name] = task.evaluate(predictions)
return metrics
```
## Task Categories
### 1. Reasoning Tasks
- Mathematical reasoning
- Code generation
- Logical deduction
- Chain-of-thought
### 2. Alignment Tasks
- Harmlessness
- Helpfulness
- Honesty
- Instruction following
### 3. Specialization Tasks
- Domain adaptation
- Few-shot learning
- Tool use
- Multi-turn conversation
## Evaluation Metrics
### Pipeline Quality
- **Correctness**: Does the code run?
- **Efficiency**: Training time, memory usage
- **Scalability**: Works on different model sizes?
### Model Improvement
- **Absolute Gain**: Δ in task performance
- **Relative Gain**: % improvement over baseline
- **Generalization**: Improvement on held-out tasks
- **Robustness**: Consistent across seeds
### Agent Capabilities
- **Autonomy**: % tasks completed without human intervention
- **Recovery**: Ability to recover from failures
- **Optimization**: Hyperparameter tuning quality
## Example Tasks
### Task 1: Implement GRPO
```
Given: Base model, reasoning dataset
Design: GRPO training pipeline
Implement: Complete training loop
Execute: Train for N steps
Evaluate: Pass@1 on math problems
```
### Task 2: Debug RL Training
```
Given: Failing RL training run
Diagnose: Root cause of failure
Fix: Correct the issue
Verify: Training succeeds
```
### Task 3: Optimize Hyperparameters
```
Given: Working baseline
Explore: Hyperparameter space
Optimize: For target metric
Report: Best configuration
```
## Difficulty Levels
| Level | Description | Example |
|-------|-------------|---------|
| Easy | Standard RLHF | PPO on preference data |
| Medium | Advanced methods | GRPO, DPO variants |
| Hard | Novel problems | Multi-task RL, constrained optimization |
| Expert | Research-level | Novel algorithm design |
## Tools Used
- exec: Run RL training scripts
- python: RL implementation
- read: Analysis results, logs
- write: Generate pipeline code
## Implementation Guidelines
### Agent Environment
```python
class RLBenchEnvironment:
def __init__(self):
self.available_models = [...]
self.compute_budget = {...}
self.datasets = {...}
def submit_solution(self, agent_code):
"""Submit agent's RL pipeline"""
return self.evaluate(agent_code)
def get_feedback(self, run_id):
"""Get execution results"""
return self.runs[run_id]
```
### Success Criteria
A successful agent:
1. Designs valid RL approach
2. Implements bug-free code
3. Trains without crashes
4. Achieves positive Δ performance
5. Generalizes to held-out tasks
## References
- arXiv:2604.10547v1 (2026) - "Agent² RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?"
- RLHF literature (Ouyang et al., Bai et al.)
- GRPO (DeepSeek-R1)
- LLM agent benchmarks (SWE-bench, etc.)
## Related Skills
- grpo-rl-training: GRPO implementation
- llm-alignment: Model alignment techniques
- agent-evaluation: LLM agent benchmarks
- reinforcement-learning: General RL methods