reinforcement-learning.com
// FOUNDATIONS

On-Policy vs Off-Policy RL

On-policy vs off-policy reinforcement learning explained: behavior vs target policy, SARSA vs Q-learning, importance sampling, the deadly triad, and PPO/SAC.

Updated 2026-06-07 14 min read
Key takeaways
  • On-policy methods learn about the same policy they use to act; off-policy methods learn about a target policy from data generated by a different behavior policy.
  • SARSA (on-policy) vs Q-learning (off-policy) is the canonical contrast: on the Cliff Walking task SARSA learns a safe path, Q-learning learns the optimal-but-risky one.
  • Off-policy buys sample efficiency — you can reuse a replay buffer and learn the greedy policy while exploring — but it can be unstable and often needs importance sampling.
  • The distinction drives modern algorithm choice: PPO (near-on-policy) vs SAC/DQN (off-policy) in control, and is now central to LLM post-training.

What does on-policy vs off-policy mean?

Every reinforcement learning agent does two things at once: it acts in the world (choosing actions to gather experience) and it learns (updating its estimate of how good actions are). The on-policy / off-policy distinction is about a single question — is the policy you are learning about the same as the policy you are using to act?

  • On-policy methods evaluate and improve the same policy that generates the data. The agent learns the value of the policy it is actually following, exploration and all.
  • Off-policy methods learn about a target policy π\pi (the one you ultimately care about, usually the greedy/optimal one) using data drawn from a different behavior policy bb (the one that actually picks actions, usually something more exploratory).

That second policy is the whole trick. Off-policy learning decouples acting from learning, which is why an off-policy agent can explore wildly while still learning the value of behaving optimally — and why it can learn from old data, demonstrations, or another agent entirely.

On-policyOff-policyOne policy πactsis learnedsame policy does bothBehavior bexploresTarget πgreedyexperience (replay buffer)b collects data, π is learned from it
On-policy uses one policy for both acting and learning. Off-policy splits them: a behavior policy collects experience while a separate target policy is learned from it.
▶ RL Course by David Silver — Lecture 5: Model-Free Control (SARSA, Q-learning, on- vs off-policy)

The canonical pair: SARSA vs Q-learning

The cleanest way to feel the difference is to compare the two most famous tabular control algorithms. They look almost identical — both are temporal-difference methods that bootstrap a Q-value toward a one-step target — but they differ in a single term: what they assume happens next.

1
SARSA — on-policy

SARSA updates toward the value of the action the agent actually took next, following its current (exploratory) policy:

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha\big[\,r + \gamma\, Q(s', a') - Q(s,a)\,\big]

Here aa' is the action the behavior policy genuinely picks in ss' (often ε\varepsilon-greedy). The name spells out the tuple it uses: State, Action, Reward, State, Action. Because the target depends on what the policy really does, SARSA evaluates the policy it is following.

2
Q-learning — off-policy

Q-learning updates toward the value of the best next action, regardless of what the agent will actually do:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha\big[\,r + \gamma\, \max_{a'} Q(s', a') - Q(s,a)\,\big]

The max\max makes the target the greedy policy, even while the agent keeps exploring with ε\varepsilon-greedy. Target policy (greedy) and behavior policy (ε\varepsilon-greedy) differ — that is off-policy.

That one change — Q(s,a)Q(s',a') versus maxaQ(s,a)\max_{a'} Q(s',a') — is the entire on-policy / off-policy distinction in tabular form. See Q-learning for the full treatment.

Cliff Walking: where the difference becomes visible

The textbook demonstration is Cliff Walking (Sutton & Barto). An agent must cross a grid to a goal. Each step costs 1-1; stepping off the cliff edge costs 100-100 and resets the episode. The shortest path runs right along the cliff edge.

Q-learning learns the optimal path

It evaluates the greedy policy, so it learns to hug the cliff edge — the shortest route. But because it still explores with ε\varepsilon-greedy while learning, it repeatedly falls off, so its online reward during training is worse.

SARSA learns the safe path

It evaluates its actual exploratory policy, which sometimes takes a random step. A random step next to the cliff is catastrophic, so SARSA learns to walk a row away from the edge — a longer but safer route that earns higher reward during learning.

−1
Cost per step in Cliff Walking
−100
Cost of falling off the cliff
1 term
The only formula difference: Q(s',a') vs max Q(s',a')

Why off-policy is powerful: the payoffs

Decoupling behavior from target unlocks capabilities on-policy methods simply cannot have:

CapabilityWhy off-policy enables it
Experience replayStore transitions in a buffer and reuse them many times. The data was generated by old policies, so only an off-policy learner can use it. This is what makes DQN work.
Sample efficiencyReusing data means far fewer environment interactions. Off-policy SAC reports 5–10× better sample efficiency than on-policy PPO on continuous-control benchmarks.
Learn from demonstrations / logsTrain on human data, expert trajectories, or logged production data — see offline RL and imitation learning.
Learn many policies at onceOne stream of experience can train several target policies (e.g. general value functions) in parallel.
Explore freely, deploy greedilyBehave with maximum exploration while still learning the optimal greedy policy.

The catch: importance sampling and instability

If off-policy is so capable, why use on-policy at all? Because the freedom comes with a bill.

When you learn about π\pi from data drawn under bb, the experience is sampled from the wrong distribution. To get an unbiased estimate you must reweight each sample by how much more (or less) likely the target policy was to take that action — the importance sampling ratio:

ρt=π(atst)b(atst)\rho_t = \frac{\pi(a_t \mid s_t)}{b(a_t \mid s_t)}

For a multi-step return you multiply the per-step ratios, ρt:T=k=tTρk\rho_{t:T} = \prod_{k=t}^{T} \rho_k. This is unbiased, but the product can explode or vanish, giving these estimates enormous variance — a long trajectory where π\pi and bb disagree even mildly produces a wildly swinging weight.

behavior b samplestarget π distributionreweight byρ = π(a|s) / b(a|s)unbiased, but variance grows with trajectory length
The importance sampling ratio reweights off-policy data back onto the target distribution. Multiplying ratios over a trajectory keeps it unbiased but makes the variance grow with horizon length.

The deadly triad

Variance is not the only hazard. Sutton & Barto name the deadly triad — the three ingredients that, combined, can make value estimates diverge to infinity:

  1. Function approximation (e.g. a neural network instead of a table),
  2. Bootstrapping (updating an estimate from another estimate, as in TD),
  3. Off-policy learning (training on data from a different policy).

Any two are usually fine. All three together can blow up. This is exactly the regime Deep Q-Networks operate in — function approximation + bootstrapping + off-policy replay — which is why DQN needs stabilizers like a frozen target network and experience replay to tame the triad. On-policy methods sidestep the off-policy leg entirely, which is a large part of why they tend to be more stable, if less sample-efficient.

Go deeper: why Q-learning survives the triad but plain off-policy TD struggles

Q-learning’s max\max operator makes its target the greedy policy, but it never explicitly forms the multi-step importance ratio — it bootstraps a one-step target, so it dodges the exploding-product problem of full-trajectory importance sampling. The price is that, with function approximation, the one-step off-policy update is no longer a true gradient of any objective (a “semi-gradient” method), which is precisely where the deadly triad bites and divergence becomes possible. Methods like Gradient TD, Retrace(λ\lambda), Tree-Backup, and V-trace were designed to give convergent, bounded-variance off-policy learning by clipping or truncating importance ratios. DeepMind’s study of the deadly triad empirically maps when divergence actually shows up in deep RL.

The full spectrum: how common algorithms line up

In practice “on-policy vs off-policy” is less a binary and more a spectrum of how far the data-generating policy is allowed to drift from the one being optimized.

AlgorithmClassData reuseNotes
SARSAOn-policyNoneLearns the value of its own exploratory policy.
REREINFORCE / vanilla policy gradientsOn-policyNoneEach update needs fresh samples from the current policy.
PPO”Near-on-policy”A few epochsReuses each batch for a handful of updates via a clipped importance ratio; throws the data away after.
Q-learning / DQNOff-policyLarge replay bufferLearns the greedy policy from any past data.
DDPG / TD3Off-policyLarge replay bufferOff-policy actor-critic for continuous actions.
SACOff-policyLarge replay bufferMaximum-entropy off-policy; among the most sample-efficient.

PPO is the interesting middle case. It is usually called on-policy, yet it does reuse each batch of trajectories for several gradient epochs — technically off-policy. It gets away with this by forming the importance ratio ρt=πθ(atst)/πold(atst)\rho_t = \pi_\theta(a_t\mid s_t)/\pi_{\text{old}}(a_t\mid s_t) and clipping it so the new policy can never stray far from the one that collected the data. That keeps π\pi and bb close enough that variance stays manageable — “near-on-policy” is the honest label.

On-policy vs off-policy in LLM post-training

The distinction has moved to the center of frontier AI. Reinforcement learning for language models — RLHF, RLVR, GRPO — is dominated by on-policy methods: PPO and GRPO both sample fresh completions from the current model, score them, and update. On-policy is favored here because it is stable and because the “environment” (a reward model or verifier) is cheap to query relative to a real robot.

But the same sample-efficiency pressure that drives control research is now appearing in LLMs. Generating completions from a giant model is expensive, so reusing samples is tempting. Recent work shows off-policy RL can match or beat on-policy GRPO for LLM reasoning at much higher sample efficiency, and a wave of methods (e.g. clipped/balanced off-policy objectives) aim to stabilize off-policy LLM training. DPO, notably, learns directly from a fixed offline dataset of preferences — squarely off-policy / offline. See RL for reasoning and agentic RL for where this is heading.

A short history

1989
Q-learning
Chris Watkins introduces Q-learning, proving an agent can learn optimal action-values off-policy while following any sufficiently exploratory behavior policy.
1994
SARSA
Rummery & Niranjan propose the on-policy TD control algorithm later named SARSA — it evaluates the policy it actually follows.
2015
DQN tames off-policy deep RL
DeepMind’s Deep Q-Network combines off-policy Q-learning with experience replay and a target network to stabilize the deadly triad on Atari.
2017–18
PPO and SAC
PPO popularizes stable near-on-policy optimization; SAC pushes off-policy sample efficiency in continuous control.
2018
Deep RL and the deadly triad
Van Hasselt et al. empirically characterize when off-policy + bootstrapping + function approximation actually diverges.
2023–26
The LLM era
On-policy PPO/GRPO dominate LLM post-training; off-policy and offline methods (DPO, off-policy RL for reasoning) re-emerge for sample efficiency.

How to choose

1
Are environment steps cheap?

If interaction is cheap (a simulator, a reward model), on-policy (PPO, GRPO) is simpler and more stable. If each step is slow or costly (a real robot, a person), favor off-policy to squeeze every transition.

2
Do you have logged or offline data?

If you must learn from a fixed dataset of past behavior, you need off-policy / offline RL by definition — on-policy cannot reuse foreign data.

3
How much stability can you afford to engineer?

Off-policy with deep nets invites the deadly triad; budget for target networks, replay tricks, and ratio clipping. If you want fewer moving parts, on-policy is the safer default.

4
Continuous or discrete actions?

Discrete + off-policy → DQN. Continuous + off-policy → SAC / TD3. Either action space + on-policy → PPO.

Frequently asked questions

Is the difference just “does it use a replay buffer?”

A replay buffer is a consequence, not the definition. The definition is whether the policy being learned (target) matches the policy generating the data (behavior). Reusing a replay buffer is only valid for off-policy methods precisely because the buffered data came from older, different policies — so the buffer is a strong signal, but the underlying property is the policy mismatch.

Why is SARSA “safer” than Q-learning if Q-learning finds the optimal policy?

Both find the optimal policy in the limit (as exploration anneals to zero). The difference is during learning: SARSA evaluates its actual exploratory policy, so it accounts for the chance that a random exploratory step sends it off the cliff, and it avoids the edge. Q-learning evaluates the greedy policy and ignores its own exploration, so it learns the risky optimal route while falling off repeatedly mid-training.

Is PPO on-policy or off-policy?

Both labels appear in the literature. PPO is conventionally called on-policy, but it reuses each batch for several gradient epochs using a clipped importance ratio — technically off-policy behavior. The accurate description is “near-on-policy”: it permits a small, bounded drift between the data-collecting policy and the policy being updated. See PPO.

What is the deadly triad and how does it relate?

It is the combination of function approximation, bootstrapping, and off-policy learning. Each pair is usually safe, but all three together can make value estimates diverge. Off-policy learning is one of the three legs, which is the core technical reason off-policy deep RL needs extra stabilization (target networks, clipped importance ratios) that on-policy methods do not.

Key references

Q-learning · Deep Q-Networks · Temporal-difference learning · Policy gradients · PPO · Offline RL · Exploration vs exploitation · Continuous control