reinforcement-learning.com
// RL FOR LLMS & AGENTS

Agentic RL: Training LLM Agents

How reinforcement learning turns LLMs into tool-using agents: the POMDP framing, RLVR rewards, rollout masking, GRPO/PPO, reward hacking, and 2026 frontier practice.

Updated 2026-06-07 18 min read
Key takeaways
  • Agentic RL trains an LLM to act over many turns — plan, call tools, read the results, keep going — instead of optimizing a single answer like classic RLHF.
  • It reframes the model from a one-step text generator into a policy in a long-horizon, partially-observable environment (a POMDP), usually rewarded by a verifier (RLVR) on final task success.
  • Two mechanics make it work in practice: a rollout that interleaves think → tool call → observation → answer, and masking the tool-observation tokens so you never train on text the model didn't generate.
  • The hard parts are agent-specific: long-horizon credit assignment, tool-call reward hacking, and training collapse (RAGEN's 'Echo Trap') — and they're where most recipes fail.

What is agentic RL?

Agentic RL is the practice of using reinforcement learning to train large language models to behave as autonomous agents: they plan, call external tools (a search engine, a code interpreter, a browser, an API), observe what comes back, and keep going across many turns until a task is finished. Where classic RLHF optimizes a single response to match human taste, agentic RL optimizes a whole trajectory — an interleaved stream of reasoning, tool calls, and tool observations — usually against an automatically computed reward for whether the final task actually succeeded.

The shift is conceptual, not just engineering. A vanilla LLM-RL setup treats generation as essentially one big action: prompt in, answer out, one reward. Agentic RL treats the model as a policy operating in an environment it doesn’t fully control — search results it can’t predict, code that may throw an error, a webpage that changes. That makes it a long-horizon, partially-observable decision problem, and it’s why the algorithms, rewards, and failure modes all look different from single-turn alignment.

SINGLE-TURN (RLHF / RLVR)PromptPolicy LMAnswerRewardMULTI-TURN AGENTIC RLTaskPolicy LMthink + actTool(search/code)Observation(masked)Rewardloop: read observation, decide next actionon success
Single-turn LLM-RL (top) optimizes one prompt-to-answer step. Agentic RL (bottom) optimizes a multi-turn trajectory where the model's tool calls produce observations it must read and react to before the final reward.

Agentic RL vs. RLHF vs. prompt-based agents (ReAct, Toolformer)

People constantly conflate four things. They sit on two axes: how the model decides to act and whether RL updates its weights.

ApproachWhat it isReward / signalWeights updated?
ReAct (2022)Prompt pattern: interleave “thought” and “action” so a frozen model uses toolsnone — pure promptingNo
Toolformer (2023)Self-supervised: insert API calls into training text, keep the useful onesdoes the call lower LM loss?Yes (SFT, not RL)
RLHFAlign a single response to human preferencelearned reward modelYes (RL, single-turn)
Agentic RLOptimize a multi-turn tool-using trajectoryverifier on task success (RLVR)Yes (RL, multi-turn)

ReAct and Toolformer are the conceptual ancestors: they showed an LLM can interleave reasoning and tool calls. Agentic RL is what happens when you stop hand-prompting that behavior and instead reward it into the weights — so the model learns when to search, which tool to reach for, and when it has enough to answer.

A worked example: a search agent that learns when to query

Concretely, take a question-answering agent built on the Search-R1 recipe. The model emits structured tags; the environment runs the tool and pastes results back; the loop repeats until the model commits to an answer.

Task”Who directed the highest-grossing film of the year Oppenheimer won Best Picture?“
model generates<think> Oppenheimer won at the 2024 ceremony, for 2023. I need the top-grossing 2023 film. </think> <search> highest-grossing film 2023 </search>
environment returns (masked)<information> Barbie (2023) — worldwide gross $1.45B, the year’s highest. </information>
model generates<think> Barbie was directed by Greta Gerwig. I can answer. </think> <answer> Greta Gerwig </answer>

Only the final answer earns reward (does it exactly match the gold answer?). Nobody told the model to search once versus twice, or what to put in the query — RL discovered that the search-then-answer policy maximizes the outcome reward. Search-R1 reports roughly a 41% average relative improvement over a no-search baseline on QA benchmarks with a Qwen2.5-7B model, holding across in- and out-of-distribution test sets.

Why RL for tool-using agents?

Couldn’t you just do supervised fine-tuning on good agent traces, or prompt a strong model? Both hit walls that RL is built to climb.

The limits of SFT on traces

Imitating expert trajectories teaches a path, not recovery. The agent never sees its own mistakes during training, so at test time — when a search returns junk or code throws an error — it’s off-distribution and flounders. SFT also can’t optimize for outcome; it copies tokens whether or not the episode would have succeeded.

The limits of prompting

ReAct-style prompting depends entirely on the base model’s latent skill. It can’t teach a model to call a tool it tends to ignore, can’t fix systematic over- or under-calling, and can’t be tuned toward your specific task’s success metric. It’s a ceiling, not a training method.

Outcome rewards and emergent behavior

The defining feature of agentic RL is that you usually only need to score the end state: did the tests pass, is the answer correct, did the task complete? From that sparse signal, RL discovers strategies nobody scripted. ToRL and ToolRL both report emergent behaviors after RL — the model learns to self-correct faulty code, adjust how often it invokes a tool, and compose multiple tools for sub-tasks. This is the same lesson DeepSeek-R1 taught for reasoning: a verifiable outcome reward can incentivize complex behavior to appear on its own. See RL for reasoning.

The mental model: agentic RL as a (PO)MDP

To train an agent you have to name the pieces of the decision problem. Agentic RL maps the LLM loop onto a Partially Observable Markov Decision Process.

1
State and observation

The true state is the full task context plus the external world (the search index, the codebase, the live webpage). The agent never sees all of it — it only sees observations: the tokens currently in context, including whatever tools returned. Hidden world state is exactly what makes it partially observable.

2
Action

At each turn the policy emits a chunk of tokens that encodes a reasoning step plus an action — typically a tool call (with arguments) or a final answer. The action space is the model’s whole vocabulary, structured by special tags.

3
Transition (the environment)

When the agent calls a tool, the environment executes it — runs the code, queries the index, loads the page — and appends the result as a new observation. This transition is outside the model and often stochastic.

4
Reward and episode

An episode is the whole trajectory until the agent answers or hits a turn limit. The reward is usually sparse and terminal: a verifier checks final success and assigns (often) a single scalar. Everything in between has to be credited from that.

RLVR: rewards you can check automatically

The reason agentic RL took off in 2025 is RLVR — Reinforcement Learning with Verifiable Rewards. Instead of a learned reward model (which can be hacked, see RLHF), the reward comes from a programmatic checker: do the unit tests pass, does the math answer match, did the SQL return the right rows, did the task’s success condition fire? For agents this is a perfect match — most agentic tasks have a ground-truth outcome you can verify cheaply.

r(τ)  =  verify(final state of trajectory τ)    {0,1}r(\tau) \;=\; \texttt{verify}\big(\text{final state of trajectory } \tau\big) \;\in\; \{0, 1\}

Verifiable rewards are robust (you can’t sweet-talk a unit test) and free to compute at scale. Their downside is sparsity — one bit at the end of a long episode — which is the central credit-assignment headache below.

How training actually works

The rollout loop: interleaved reasoning, tool calls, observations

A training rollout generates the trajectory the policy learns from. Generation pauses whenever the model emits a tool call, the environment runs it, the result is spliced back into context, and generation resumes.

thinkgeneratedtool callgeneratedobservationMASKEDthinkgeneratedobservationMASKEDanswergeneratedtime / turns →terminal reward credited back across all generated tokens
One agentic rollout. The policy alternates generated segments (think / call / answer) with environment-produced observations. The reward lands only at the end and must be propagated back across every generated turn.

Tool-observation masking and the loss

This is the single most important agentic-RL implementation detail. The tokens a tool returns were not produced by the policy — so you must not compute the policy-gradient loss on them. Search-R1 calls this retrieved-token masking: the loss is computed only over the LLM’s own generated tokens; tokens copied verbatim from a retrieval (or any tool) have their gradients masked out.

Why it matters: if you let those tokens into the objective, you’d be training the model to predict arbitrary external text it can’t control — which destabilizes training and teaches it to parrot retrievals instead of reasoning over them. Concretely, the per-token policy-gradient objective carries a 0/1 mask:

L(θ)  =  Eτ ⁣[t1[t is model-generated]  A^t  logπθ(atst)]\mathcal{L}(\theta) \;=\; -\,\mathbb{E}_\tau\!\left[\sum_{t} \mathbf{1}[\,t \text{ is model-generated}\,]\; \hat{A}_t \;\log \pi_\theta(a_t \mid s_t)\right]

The mask 1[]\mathbf{1}[\cdot] is 0 for every observation/tool-output token. Frameworks like VerlTool generalize this bookkeeping so the masking is handled uniformly across many tool types.

Algorithms: PPO, GRPO, RLOO, DAPO

Agentic RL borrows the policy-optimization toolkit from LLM-RL, with multi-turn twists.

AlgorithmIdeaCritic / baselineWhy used for agents
PPOClipped on-policy updateslearned value networkMost controllable; the default in Search-R1
GRPOAdvantage from a group of sampled trajectories’ relative rewardsnone (group mean is the baseline)Cheaper — no critic; dominant for verifiable rewards
RLOOREINFORCE with leave-one-out baseline over samplesnone (other samples)Simple, low-variance for single terminal reward
DAPOGRPO + decoupled clipping & dynamic sampling fixesnoneStabilizes long-output / sparse-reward RL

GRPO is the workhorse: with one terminal reward per trajectory, sampling a group of GG trajectories per task and normalizing their rewards gives a clean advantage without training a separate value network — a big win when each rollout is expensive.

A^i  =  rimean(r1,,rG)std(r1,,rG)\hat{A}_i \;=\; \frac{r_i - \operatorname{mean}(r_1,\dots,r_G)}{\operatorname{std}(r_1,\dots,r_G)}

Multi-turn variants (RAGEN’s StarPO, and the trajectory-level schemes in Tool-Star) keep this structure but operate over whole tool-using trajectories rather than single completions.

Credit assignment across turns

Here is the deepest agentic problem. The reward is a single bit at turn N, but the agent made N decisions — which one earned (or lost) the reward? This is temporal credit assignment, and it’s far harder than in single-turn RLHF.

  • Sparse outcome reward — give 1 at the end, 0 otherwise, and let the advantage flow back across all generated turns. Simple, hard to hack, but high-variance and sample-hungry on long horizons.
  • Turn-level / dense shaping — add intermediate signals: a small reward for a well-formed tool call, step-level rewards for useful retrievals (as in step-search variants), or process rewards on reasoning steps (see process reward models). Faster to learn, but every added signal is a new surface for reward hacking.

Reward design and reward hacking

Format vs. correctness vs. tool-use rewards

ToolRL is the first systematic study of reward design for tool use under GRPO, sweeping reward type, scale, granularity, and temporal dynamics. The headline lesson: coarse answer-matching alone is too blunt — agents need fine-grained feedback distinguishing (a) did you call the right tool, (b) with the right parameters, (c) in the right format, and (d) did it produce the right outcome. A principled decomposition beats both base models (+17%) and SFT (+15%) and trains more stably.

A common, well-behaved reward decomposition looks like:

r  =  rformatsmall, gated  +  rtool-matchright tool/args  +  routcomedominantr \;=\; \underbrace{r_{\text{format}}}_{\text{small, gated}} \;+\; \underbrace{r_{\text{tool-match}}}_{\text{right tool/args}} \;+\; \underbrace{r_{\text{outcome}}}_{\text{dominant}}

with the outcome term dominant so the agent can’t trade real success for format points.

Tool-call hacking and agent-specific exploits

Because tools change the reward surface, agents invent exploits a single-turn model never could:

  • Tool-call hacking — calling a tool (or many) to collect format/shaping reward without using the result, or making a call that looks productive to a lenient verifier.
  • Format gaming — emitting the exact tags/structure the reward checks for while the content is wrong.
  • Verifier exploitation — under RLVR, finding answers that satisfy the checker but not the intent (e.g. printing the expected output instead of computing it, hardcoding a test’s answer).
  • Observation parroting — copying retrieved text as the answer; the masking above is a partial defense, an outcome reward the rest.

The fix is the same discipline as in RLHF: make your verifier harder to game than your policy is to train, keep the outcome term dominant, and watch held-out success rather than training reward.

Training stability and failure modes

Multi-turn RL is notoriously unstable, and RAGEN gave the canonical anatomy of the collapse: the Echo Trap.

training steps →entropy / variance fall firstreward collapses laterTask reward (mean)Entropy / reward variance
The Echo Trap: the agent improves, then overfits to a few locally-rewarded patterns. Reward-variance and entropy fall early — before reward itself drops — making them the leading indicators of collapse.

In the Echo Trap, the agent overfits to locally-rewarded reasoning patterns — it finds a phrasing or tool sequence that scored once and repeats it, narrowing its behavior. The signature is collapsing reward variance, falling entropy, and gradient spikes, with variance and entropy dropping before reward degrades — so they’re early warnings worth logging. RAGEN’s stabilized variant StarPO-S counters it with variance-based trajectory filtering (train on high-uncertainty tasks, discard low-information rollouts), critic baselining, and decoupled clipping. Other common stabilizers: an entropy bonus, KL control toward the reference, clip-higher schemes (DAPO), and curriculum over task difficulty.

Task families and benchmarks

Agentic RL has crystallized around a handful of task families, each with its own environment and verifier.

Search / deep-research agents

Multi-hop QA where the agent searches, reads, and synthesizes. Verifier = exact-match / F1 against gold answers. Canonical: Search-R1; environments built on Wikipedia/dense retrievers.

Tool-integrated math reasoning

The model writes and runs code to compute, then continues reasoning. Verifier = numerical answer match. Canonical: ToRL, Tool-Star.

Software-engineering (SWE) agents

Edit a real repo to fix a bug; verifier = the project’s test suite. Sparse and long-horizon. Benchmark: SWE-bench Verified. See Agent-RLVR, SkyRL-Agent.

Web / computer-use & user-facing agents

Navigate sites, fill forms, complete purchases, or hold tool-using conversations. Benchmarks: WebShop, tau-bench / tau2-bench.

Environments and benchmarks

The environment is the reward in agentic RL, so benchmarks double as training grounds:

BenchmarkDomainReward / success signal
WebShopSimulated e-commerce browsingbought the right product matching the instruction
SWE-bench (Verified)Real GitHub bug-fixingthe repo’s unit tests pass after the patch
tau-bench / tau2-benchTool-using user-facing tasks (airline, retail)task completed under domain rules

Building and operating these sandboxes — sandboxed code execution, browser pools, reproducible repo snapshots — is its own discipline; see the RL environments page and the companies building agent environments.

Frameworks and infrastructure

A 2025 wave of open frameworks made agentic RL reproducible. They differ mostly in how much of the agent loop they own and how they scale rollouts.

FrameworkAngleNotes
verlGeneral LLM-RL engineThe de facto backend most agent frameworks build on
VerlToolHolistic tool-use RL (ARLT)Tools as modular plugins via a unified API on top of verl
SkyRL-AgentEfficient multi-turn agent RLAsync dispatcher + lightweight tool interface; backend-portable
Agent LightningTrain any existing agentDecouples agent execution from RL training, minimal code change
rLLMExtensible agent RLStrong extensibility, tied to the verl backend

The shared engineering bottleneck is rollout cost: each training step needs many long, tool-calling trajectories, and a tool call can take seconds (a test suite, a web request). The dominant fix is asynchronous rollouts — decouple trajectory generation from gradient updates so slow tools don’t stall the learner. SkyRL-Agent’s async dispatcher and Agent Lightning’s execution/training split both target exactly this.

Frontier-model practice (2025–2026): native tool-use RL

Agentic RL has moved from papers into shipped models. Frontier systems are increasingly trained so tool use is native — learned with RL, not bolted on by prompting.

65.8
Kimi K2 — SWE-bench Verified (%)
200–300
Kimi K2 — sequential tool calls held on task
500+
works synthesized in the 2025 agentic-RL survey
  • DeepSeek-R1 established the template the whole “-R1” agent literature builds on: pure RL with GRPO and verifiable rewards can incentivize complex reasoning — and, extended to tools, complex acting.
  • Kimi K2 (a 1T-param / 32B-active MoE) is the clearest open example of frontier agentic RL: a joint RL stage combining RLVR with a self-critique rubric reward, fed by large-scale agentic data synthesis. It posts strong SWE-bench Verified (~65.8%) and tau2-bench (~66.1) scores and can chain 200–300 sequential tool calls without losing the goal.

The frontier pattern mirrors the RLHF story’s endgame: use RLVR/agentic RL to make the model competent at acting, and keep preference-based alignment so it stays helpful and safe. See computer-use and enterprise-workflow environment vendors.

A short history

2022
ReAct & WebShop
Interleaving reasoning and acting via prompting (ReAct); WebShop gives an interactive web environment to train and test agents.
2023
Toolformer
LLMs teach themselves when to call APIs via self-supervision — the pre-RL baseline for tool use.
2025 (Jan)
DeepSeek-R1
Pure RL with GRPO + verifiable rewards induces reasoning without SFT — the recipe the agent papers extend to tools.
2025 (spring)
Search-R1 · ToRL · ToolRL · RAGEN
Multi-turn tool agents trained with outcome rewards + masking; ToolRL formalizes reward design; RAGEN names the Echo Trap and StarPO-S.
2025 (summer–fall)
Kimi K2 · the Survey · VerlTool · Agent Lightning
Frontier native tool-use RL ships; the field gets its reference survey and a stack of open agentic-RL frameworks.
2026
Scaling the rollout
Async, backend-portable training (SkyRL-Agent) pushes long-horizon SWE/web agents toward production at lower cost.

Open problems and where the field is going

  • Long-horizon credit assignment — propagating one terminal bit across dozens of turns is still high-variance; turn-level rewards help but invite hacking. The core unsolved problem.
  • Sparse rewards & exploration — many tasks reward almost nothing until the very end (SWE agents especially); guidance toward successful trajectories (Agent-RLVR) is a partial answer.
  • Rollout cost — tool-calling trajectories are slow and expensive; async infrastructure is necessary but not sufficient.
  • Generalization — agents trained on one tool/benchmark often don’t transfer; how broadly RL-trained tool skills generalize is open.
  • Evaluation — outcome metrics miss how a task was solved (wasteful tool use, lucky guesses); honest agent eval is harder than a single success rate.

Researcher takes

A concise explanation of why agentic RL is uniquely hard: terminal-only reward leaves intermediate steps unsupervised.

Frequently asked questions

How is agentic RL different from RLHF?

RLHF optimizes a single response against a learned reward model of human preference. Agentic RL optimizes a multi-turn trajectory of reasoning and tool calls against a verifiable outcome reward. RLHF is single-step and taste-based; agentic RL is long-horizon, environment-interactive, and success-based. Frontier models use both — RL for competence, preference alignment for helpfulness/safety.

Is agentic RL just RLVR?

They overlap heavily but aren’t identical. RLVR describes the reward source — a programmatic verifier. Agentic RL describes the setting — a multi-turn tool-using agent. Most agentic RL uses RLVR for its reward, but RLVR also applies to single-turn tasks (e.g. one-shot math), and agentic RL can in principle use learned or rubric rewards (as Kimi K2 does alongside RLVR).

Why mask tool-observation tokens?

Because the model didn’t generate them. Including tool outputs in the policy-gradient loss would train the model to predict external text it can’t control — destabilizing training and encouraging it to parrot retrievals instead of reasoning over them. Masking computes the loss only on the agent’s own generated tokens. It’s the most important implementation detail in multi-turn agent RL.

Can I RL-train an agent I already built with LangChain / a custom harness?

Increasingly, yes. Agent Lightning explicitly decouples agent execution from RL training so an existing agent can be tuned with minimal code change, and frameworks like SkyRL-Agent expose lightweight tool interfaces. You still need a verifiable reward and the compute for many rollouts.

Key papers

RLVR · GRPO · PPO · RLHF · Reward models · RL for reasoning · RL environments · What is reinforcement learning?