- Agentic RL trains an LLM to act over many turns — plan, call tools, read the results, keep going — instead of optimizing a single answer like classic RLHF.
- It reframes the model from a one-step text generator into a policy in a long-horizon, partially-observable environment (a POMDP), usually rewarded by a verifier (RLVR) on final task success.
- Two mechanics make it work in practice: a rollout that interleaves think → tool call → observation → answer, and masking the tool-observation tokens so you never train on text the model didn't generate.
- The hard parts are agent-specific: long-horizon credit assignment, tool-call reward hacking, and training collapse (RAGEN's 'Echo Trap') — and they're where most recipes fail.
What is agentic RL?
Agentic RL is the practice of using reinforcement learning to train large language models to behave as autonomous agents: they plan, call external tools (a search engine, a code interpreter, a browser, an API), observe what comes back, and keep going across many turns until a task is finished. Where classic RLHF optimizes a single response to match human taste, agentic RL optimizes a whole trajectory — an interleaved stream of reasoning, tool calls, and tool observations — usually against an automatically computed reward for whether the final task actually succeeded.
The shift is conceptual, not just engineering. A vanilla LLM-RL setup treats generation as essentially one big action: prompt in, answer out, one reward. Agentic RL treats the model as a policy operating in an environment it doesn’t fully control — search results it can’t predict, code that may throw an error, a webpage that changes. That makes it a long-horizon, partially-observable decision problem, and it’s why the algorithms, rewards, and failure modes all look different from single-turn alignment.
Agentic RL vs. RLHF vs. prompt-based agents (ReAct, Toolformer)
People constantly conflate four things. They sit on two axes: how the model decides to act and whether RL updates its weights.
| Approach | What it is | Reward / signal | Weights updated? |
|---|---|---|---|
| ReAct (2022) | Prompt pattern: interleave “thought” and “action” so a frozen model uses tools | none — pure prompting | No |
| Toolformer (2023) | Self-supervised: insert API calls into training text, keep the useful ones | does the call lower LM loss? | Yes (SFT, not RL) |
| RLHF | Align a single response to human preference | learned reward model | Yes (RL, single-turn) |
| Agentic RL | Optimize a multi-turn tool-using trajectory | verifier on task success (RLVR) | Yes (RL, multi-turn) |
ReAct and Toolformer are the conceptual ancestors: they showed an LLM can interleave reasoning and tool calls. Agentic RL is what happens when you stop hand-prompting that behavior and instead reward it into the weights — so the model learns when to search, which tool to reach for, and when it has enough to answer.
A worked example: a search agent that learns when to query
Concretely, take a question-answering agent built on the Search-R1 recipe. The model emits structured tags; the environment runs the tool and pastes results back; the loop repeats until the model commits to an answer.
Only the final answer earns reward (does it exactly match the gold answer?). Nobody told the model to search once versus twice, or what to put in the query — RL discovered that the search-then-answer policy maximizes the outcome reward. Search-R1 reports roughly a 41% average relative improvement over a no-search baseline on QA benchmarks with a Qwen2.5-7B model, holding across in- and out-of-distribution test sets.
Why RL for tool-using agents?
Couldn’t you just do supervised fine-tuning on good agent traces, or prompt a strong model? Both hit walls that RL is built to climb.
Imitating expert trajectories teaches a path, not recovery. The agent never sees its own mistakes during training, so at test time — when a search returns junk or code throws an error — it’s off-distribution and flounders. SFT also can’t optimize for outcome; it copies tokens whether or not the episode would have succeeded.
ReAct-style prompting depends entirely on the base model’s latent skill. It can’t teach a model to call a tool it tends to ignore, can’t fix systematic over- or under-calling, and can’t be tuned toward your specific task’s success metric. It’s a ceiling, not a training method.
Outcome rewards and emergent behavior
The defining feature of agentic RL is that you usually only need to score the end state: did the tests pass, is the answer correct, did the task complete? From that sparse signal, RL discovers strategies nobody scripted. ToRL and ToolRL both report emergent behaviors after RL — the model learns to self-correct faulty code, adjust how often it invokes a tool, and compose multiple tools for sub-tasks. This is the same lesson DeepSeek-R1 taught for reasoning: a verifiable outcome reward can incentivize complex behavior to appear on its own. See RL for reasoning.
The mental model: agentic RL as a (PO)MDP
To train an agent you have to name the pieces of the decision problem. Agentic RL maps the LLM loop onto a Partially Observable Markov Decision Process.
The true state is the full task context plus the external world (the search index, the codebase, the live webpage). The agent never sees all of it — it only sees observations: the tokens currently in context, including whatever tools returned. Hidden world state is exactly what makes it partially observable.
At each turn the policy emits a chunk of tokens that encodes a reasoning step plus an action — typically a tool call (with arguments) or a final answer. The action space is the model’s whole vocabulary, structured by special tags.
When the agent calls a tool, the environment executes it — runs the code, queries the index, loads the page — and appends the result as a new observation. This transition is outside the model and often stochastic.
An episode is the whole trajectory until the agent answers or hits a turn limit. The reward is usually sparse and terminal: a verifier checks final success and assigns (often) a single scalar. Everything in between has to be credited from that.
RLVR: rewards you can check automatically
The reason agentic RL took off in 2025 is RLVR — Reinforcement Learning with Verifiable Rewards. Instead of a learned reward model (which can be hacked, see RLHF), the reward comes from a programmatic checker: do the unit tests pass, does the math answer match, did the SQL return the right rows, did the task’s success condition fire? For agents this is a perfect match — most agentic tasks have a ground-truth outcome you can verify cheaply.
Verifiable rewards are robust (you can’t sweet-talk a unit test) and free to compute at scale. Their downside is sparsity — one bit at the end of a long episode — which is the central credit-assignment headache below.
How training actually works
The rollout loop: interleaved reasoning, tool calls, observations
A training rollout generates the trajectory the policy learns from. Generation pauses whenever the model emits a tool call, the environment runs it, the result is spliced back into context, and generation resumes.
Tool-observation masking and the loss
This is the single most important agentic-RL implementation detail. The tokens a tool returns were not produced by the policy — so you must not compute the policy-gradient loss on them. Search-R1 calls this retrieved-token masking: the loss is computed only over the LLM’s own generated tokens; tokens copied verbatim from a retrieval (or any tool) have their gradients masked out.
Why it matters: if you let those tokens into the objective, you’d be training the model to predict arbitrary external text it can’t control — which destabilizes training and teaches it to parrot retrievals instead of reasoning over them. Concretely, the per-token policy-gradient objective carries a 0/1 mask:
The mask is 0 for every observation/tool-output token. Frameworks like VerlTool generalize this bookkeeping so the masking is handled uniformly across many tool types.
Algorithms: PPO, GRPO, RLOO, DAPO
Agentic RL borrows the policy-optimization toolkit from LLM-RL, with multi-turn twists.
| Algorithm | Idea | Critic / baseline | Why used for agents |
|---|---|---|---|
| PPO | Clipped on-policy updates | learned value network | Most controllable; the default in Search-R1 |
| GRPO | Advantage from a group of sampled trajectories’ relative rewards | none (group mean is the baseline) | Cheaper — no critic; dominant for verifiable rewards |
| RLOO | REINFORCE with leave-one-out baseline over samples | none (other samples) | Simple, low-variance for single terminal reward |
| DAPO | GRPO + decoupled clipping & dynamic sampling fixes | none | Stabilizes long-output / sparse-reward RL |
GRPO is the workhorse: with one terminal reward per trajectory, sampling a group of trajectories per task and normalizing their rewards gives a clean advantage without training a separate value network — a big win when each rollout is expensive.
Multi-turn variants (RAGEN’s StarPO, and the trajectory-level schemes in Tool-Star) keep this structure but operate over whole tool-using trajectories rather than single completions.
Credit assignment across turns
Here is the deepest agentic problem. The reward is a single bit at turn N, but the agent made N decisions — which one earned (or lost) the reward? This is temporal credit assignment, and it’s far harder than in single-turn RLHF.
- Sparse outcome reward — give 1 at the end, 0 otherwise, and let the advantage flow back across all generated turns. Simple, hard to hack, but high-variance and sample-hungry on long horizons.
- Turn-level / dense shaping — add intermediate signals: a small reward for a well-formed tool call, step-level rewards for useful retrievals (as in step-search variants), or process rewards on reasoning steps (see process reward models). Faster to learn, but every added signal is a new surface for reward hacking.
Reward design and reward hacking
Format vs. correctness vs. tool-use rewards
ToolRL is the first systematic study of reward design for tool use under GRPO, sweeping reward type, scale, granularity, and temporal dynamics. The headline lesson: coarse answer-matching alone is too blunt — agents need fine-grained feedback distinguishing (a) did you call the right tool, (b) with the right parameters, (c) in the right format, and (d) did it produce the right outcome. A principled decomposition beats both base models (+17%) and SFT (+15%) and trains more stably.
A common, well-behaved reward decomposition looks like:
with the outcome term dominant so the agent can’t trade real success for format points.
Tool-call hacking and agent-specific exploits
Because tools change the reward surface, agents invent exploits a single-turn model never could:
- Tool-call hacking — calling a tool (or many) to collect format/shaping reward without using the result, or making a call that looks productive to a lenient verifier.
- Format gaming — emitting the exact tags/structure the reward checks for while the content is wrong.
- Verifier exploitation — under RLVR, finding answers that satisfy the checker but not the intent (e.g. printing the expected output instead of computing it, hardcoding a test’s answer).
- Observation parroting — copying retrieved text as the answer; the masking above is a partial defense, an outcome reward the rest.
The fix is the same discipline as in RLHF: make your verifier harder to game than your policy is to train, keep the outcome term dominant, and watch held-out success rather than training reward.
Training stability and failure modes
Multi-turn RL is notoriously unstable, and RAGEN gave the canonical anatomy of the collapse: the Echo Trap.
In the Echo Trap, the agent overfits to locally-rewarded reasoning patterns — it finds a phrasing or tool sequence that scored once and repeats it, narrowing its behavior. The signature is collapsing reward variance, falling entropy, and gradient spikes, with variance and entropy dropping before reward degrades — so they’re early warnings worth logging. RAGEN’s stabilized variant StarPO-S counters it with variance-based trajectory filtering (train on high-uncertainty tasks, discard low-information rollouts), critic baselining, and decoupled clipping. Other common stabilizers: an entropy bonus, KL control toward the reference, clip-higher schemes (DAPO), and curriculum over task difficulty.
Task families and benchmarks
Agentic RL has crystallized around a handful of task families, each with its own environment and verifier.
Multi-hop QA where the agent searches, reads, and synthesizes. Verifier = exact-match / F1 against gold answers. Canonical: Search-R1; environments built on Wikipedia/dense retrievers.
The model writes and runs code to compute, then continues reasoning. Verifier = numerical answer match. Canonical: ToRL, Tool-Star.
Edit a real repo to fix a bug; verifier = the project’s test suite. Sparse and long-horizon. Benchmark: SWE-bench Verified. See Agent-RLVR, SkyRL-Agent.
Navigate sites, fill forms, complete purchases, or hold tool-using conversations. Benchmarks: WebShop, tau-bench / tau2-bench.
Environments and benchmarks
The environment is the reward in agentic RL, so benchmarks double as training grounds:
| Benchmark | Domain | Reward / success signal |
|---|---|---|
| WebShop | Simulated e-commerce browsing | bought the right product matching the instruction |
| SWE-bench (Verified) | Real GitHub bug-fixing | the repo’s unit tests pass after the patch |
| tau-bench / tau2-bench | Tool-using user-facing tasks (airline, retail) | task completed under domain rules |
Building and operating these sandboxes — sandboxed code execution, browser pools, reproducible repo snapshots — is its own discipline; see the RL environments page and the companies building agent environments.
Frameworks and infrastructure
A 2025 wave of open frameworks made agentic RL reproducible. They differ mostly in how much of the agent loop they own and how they scale rollouts.
| Framework | Angle | Notes |
|---|---|---|
| verl | General LLM-RL engine | The de facto backend most agent frameworks build on |
| VerlTool | Holistic tool-use RL (ARLT) | Tools as modular plugins via a unified API on top of verl |
| SkyRL-Agent | Efficient multi-turn agent RL | Async dispatcher + lightweight tool interface; backend-portable |
| Agent Lightning | Train any existing agent | Decouples agent execution from RL training, minimal code change |
| rLLM | Extensible agent RL | Strong extensibility, tied to the verl backend |
The shared engineering bottleneck is rollout cost: each training step needs many long, tool-calling trajectories, and a tool call can take seconds (a test suite, a web request). The dominant fix is asynchronous rollouts — decouple trajectory generation from gradient updates so slow tools don’t stall the learner. SkyRL-Agent’s async dispatcher and Agent Lightning’s execution/training split both target exactly this.
Frontier-model practice (2025–2026): native tool-use RL
Agentic RL has moved from papers into shipped models. Frontier systems are increasingly trained so tool use is native — learned with RL, not bolted on by prompting.
- DeepSeek-R1 established the template the whole “-R1” agent literature builds on: pure RL with GRPO and verifiable rewards can incentivize complex reasoning — and, extended to tools, complex acting.
- Kimi K2 (a 1T-param / 32B-active MoE) is the clearest open example of frontier agentic RL: a joint RL stage combining RLVR with a self-critique rubric reward, fed by large-scale agentic data synthesis. It posts strong SWE-bench Verified (~65.8%) and tau2-bench (~66.1) scores and can chain 200–300 sequential tool calls without losing the goal.
The frontier pattern mirrors the RLHF story’s endgame: use RLVR/agentic RL to make the model competent at acting, and keep preference-based alignment so it stays helpful and safe. See computer-use and enterprise-workflow environment vendors.
A short history
Open problems and where the field is going
- Long-horizon credit assignment — propagating one terminal bit across dozens of turns is still high-variance; turn-level rewards help but invite hacking. The core unsolved problem.
- Sparse rewards & exploration — many tasks reward almost nothing until the very end (SWE agents especially); guidance toward successful trajectories (Agent-RLVR) is a partial answer.
- Rollout cost — tool-calling trajectories are slow and expensive; async infrastructure is necessary but not sufficient.
- Generalization — agents trained on one tool/benchmark often don’t transfer; how broadly RL-trained tool skills generalize is open.
- Evaluation — outcome metrics miss how a task was solved (wasteful tool use, lucky guesses); honest agent eval is harder than a single success rate.
Researcher takes
A concise explanation of why agentic RL is uniquely hard: terminal-only reward leaves intermediate steps unsupervised.
Frequently asked questions
How is agentic RL different from RLHF?
RLHF optimizes a single response against a learned reward model of human preference. Agentic RL optimizes a multi-turn trajectory of reasoning and tool calls against a verifiable outcome reward. RLHF is single-step and taste-based; agentic RL is long-horizon, environment-interactive, and success-based. Frontier models use both — RL for competence, preference alignment for helpfulness/safety.
Is agentic RL just RLVR?
They overlap heavily but aren’t identical. RLVR describes the reward source — a programmatic verifier. Agentic RL describes the setting — a multi-turn tool-using agent. Most agentic RL uses RLVR for its reward, but RLVR also applies to single-turn tasks (e.g. one-shot math), and agentic RL can in principle use learned or rubric rewards (as Kimi K2 does alongside RLVR).
Why mask tool-observation tokens?
Because the model didn’t generate them. Including tool outputs in the policy-gradient loss would train the model to predict external text it can’t control — destabilizing training and encouraging it to parrot retrievals instead of reasoning over them. Masking computes the loss only on the agent’s own generated tokens. It’s the most important implementation detail in multi-turn agent RL.
Can I RL-train an agent I already built with LangChain / a custom harness?
Increasingly, yes. Agent Lightning explicitly decouples agent execution from RL training so an existing agent can be tuned with minimal code change, and frameworks like SkyRL-Agent expose lightweight tool interfaces. You still need a verifiable reward and the compute for many rollouts.
Key papers
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey — 2025 — the field’s reference survey (500+ works, capability taxonomy).
- Search-R1 — 2025 — multi-turn search agents with outcome rewards and retrieved-token masking.
- ToolRL: Reward is All Tool Learning Needs — 2025 — systematic reward design for tool use under GRPO.
- RAGEN — 2025 — StarPO for trajectory-level RL; documents the Echo Trap and StarPO-S.
- ToRL: Scaling Tool-Integrated RL — 2025 — RL teaches code-interpreter use with emergent self-correction.
- DeepSeek-R1 — 2025 — pure RL (GRPO + verifiable rewards) induces reasoning; the template agent papers extend.
- Kimi K2: Open Agentic Intelligence — 2025 — frontier open-weight native tool-use RL (RLVR + self-critique rubric).
- ReAct (2022) · Toolformer (2023) · WebShop (2022) — the prompt-and-SFT precursors and a foundational environment.
Related
RLVR · GRPO · PPO · RLHF · Reward models · RL for reasoning · RL environments · What is reinforcement learning?