Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-agentic-rl-llm-surveygit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-agentic-rl-llm-survey/SKILL.md---
name: agentic-reinforcement-learning-for-llms
description: Skill for AI agent capabilities
---
# Agentic Reinforcement Learning for LLMs
## Description
Agentic RL transforms LLMs from passive sequence generators into autonomous, decision-making agents. This skill covers the paradigm shift from conventional LLM-RL (single-step MDPs) to Agentic RL (temporally extended POMDPs), including planning, tool use, memory, reasoning, self-improvement, and perception capabilities.
**Key Contributions:**
- Formalizes conceptual shift from LLM-RL to Agentic RL
- Two-fold taxonomy: agentic capabilities + task domains
- RL as mechanism for transforming static modules to adaptive behavior
- Consolidates 500+ works into practical compendium
## Tools Used
- read: Read agent state and environment observations
- exec: Execute actions in environment
- write: Store memories and plans
- browser: Access external tools and APIs
- memory_search: Retrieve relevant past experiences
## Instructions for Agents
### Core Agentic Capabilities
1. **Planning** - Generate and execute multi-step action sequences
2. **Tool Use** - Select and invoke appropriate tools
3. **Memory** - Store and retrieve relevant information
4. **Reasoning** - Chain-of-thought and decomposition
5. **Self-Improvement** - Learn from experience via RL
6. **Perception** - Process multi-modal inputs
### Key Paradigm Shift
| LLM-RL (Conventional) | Agentic RL |
|-----------------------|------------|
| Single-step MDP | Temporally extended POMDP |
| Passive generation | Autonomous decision-making |
| Static prompts | Dynamic environment interaction |
| No memory | Persistent memory |
| No planning | Multi-step planning |
### When to Use
- Multi-step reasoning tasks
- Environment interaction required
- Tool selection decisions
- Memory-dependent decisions
- Self-improvement loops
## Overview
**Source:** arXiv:2509.02547v4
**Utility:** 0.94
**Authors:** 23 authors from 14 institutions
**Scope:** 500+ works synthesized
## Activation Keywords
- agentic RL
- LLM reinforcement learning
- agent planning
- tool use RL
- LLM decision making
---
## Core Framework
### From MDP to POMDP
**LLM-RL (Single-step MDP):**
```
State (prompt) → Action (token) → Reward → Next State
```
**Agentic RL (POMDP):**
```
Observation → Belief State → Plan → Actions → Observations → ...
↓ ↓ ↓
Memory Reasoning Tool Use
```
### Agentic Capabilities Taxonomy
| Capability | Description | RL Role |
|------------|-------------|---------|
| Planning | Multi-step action generation | Learn optimal plans |
| Tool Use | External API/tool selection | Learn tool policies |
| Memory | Information persistence | Learn what to store |
| Reasoning | Chain-of-thought | Learn reasoning strategies |
| Self-Improvement | Learning from experience | Meta-learning |
| Perception | Multi-modal processing | Learn representations |
---
## Implementation Patterns
### 1. Planning Agent
```python
class PlanningAgent:
def __init__(self, llm, env):
self.llm = llm
self.env = env
def plan(self, goal):
# Generate high-level plan
plan = self.llm.generate_plan(goal)
# Execute with RL-learned policy
for step in plan:
action = self.policy(step)
obs, reward = self.env.step(action)
self.update_memory(obs, reward)
```
### 2. Tool-Using Agent
```python
class ToolUsingAgent:
def __init__(self, tools, policy_network):
self.tools = tools
self.policy = policy_network
def select_tool(self, context):
# RL-learned tool selection
tool_probs = self.policy(context)
return self.tools[sample(tool_probs)]
def execute(self, tool, inputs):
return tool.execute(inputs)
```
### 3. Memory-Augmented Agent
```python
class MemoryAgent:
def __init__(self, memory_size=1000):
self.memory = []
self.retrieval_policy = None # RL-learned
def store(self, observation, action, reward):
# Learn what to remember
if self.should_store(observation):
self.memory.append((observation, action, reward))
def retrieve(self, context):
# RL-learned retrieval
return self.retrieval_policy(context, self.memory)
```
---
## Key Applications
| Domain | Agentic RL Use |
|--------|----------------|
| Web Navigation | Multi-step browsing, form filling |
| Code Generation | Planning, testing, debugging loops |
| Game Playing | Strategic decision making |
| Robotics | Sensorimotor control |
| Research | Literature search, experiment design |
---
## Benchmarks
| Benchmark | Focus | Environment |
|-----------|-------|-------------|
| WebShop | E-commerce navigation | Web |
| ALFWorld | Household tasks | Embodied |
| ScienceWorld | Scientific reasoning | Text |
| InterCode | Code execution | Terminal |
| ToolBench | Tool use | API |
---
## Training Approaches
### 1. Policy Gradient Methods
```python
# PPO-style update for agentic behavior
def agentic_ppo_update(agent, trajectories):
for traj in trajectories:
returns = compute_returns(traj.rewards)
advantages = returns - traj.values
# Update policy
ratio = new_prob / old_prob
clip_ratio = torch.clamp(ratio, 1-eps, 1+eps)
loss = -torch.min(ratio * advantages, clip_ratio * advantages)
```
### 2. Reward Shaping for Agents
```python
def agentic_reward(action, outcome, plan):
reward = 0
reward += outcome.success * 1.0 # Task completion
reward += plan.efficiency * 0.1 # Plan quality
reward += action.validity * 0.1 # Valid actions
reward -= step_cost * 0.01 # Encourage efficiency
return reward
```
---
## Open-Source Resources
### Environments
- OpenAI Gym
- RL4LMs
- AgentBench
- WebArena
### Frameworks
- LangChain
- AutoGPT
- BabyAGI
- MetaGPT
---
## Best Practices
1. **Start with clear objectives** - Define success metrics
2. **Design reward functions carefully** - Shape agent behavior
3. **Use curriculum learning** - Start simple, increase complexity
4. **Enable self-improvement** - Let agents learn from mistakes
5. **Monitor for reward hacking** - Agents may exploit loopholes
---
## Examples
### Example 1: Basic Application
**User:** I need to apply Agentic Reinforcement Learning for LLMs to my analysis.
**Agent:** I'll help you apply agentic-rl-llm-survey. First, let me understand your specific use case...
**Context:** Apply the methodology
### Example 2: Advanced Scenario
**User:** Multi-step reasoning tasks
**Agent:** Based on the methodology, I'll guide you through the advanced application...
### Example 2: Advanced Application
**User:** What are the key considerations for agentic-rl-llm-survey?
**Agent:** Let me search for the latest research and best practices...
## References
- Paper: https://arxiv.org/abs/2509.02547
- DOI: https://doi.org/10.48550/arXiv.2509.02547
- TMLR: https://openreview.net/forum?id=RY19y2RI1O
---
**Created:** 2026-03-28
**Source:** arXiv:2509.02547v4 - "The Landscape of Agentic RL for LLMs"