PPO: Proximal Policy Optimization, Explained

Key takeaways

PPO is the default deep-RL policy-gradient algorithm: it improves a policy in small, safe steps instead of risky big jumps.
Its core trick is a clipped surrogate objective that caps how far the new policy can move from the old one — TRPO's trust region without the second-order math.
The full recipe: collect rollouts, estimate advantages with GAE, then run several epochs of minibatch SGD on the clipped policy loss, a value loss, and an entropy bonus.
PPO is the classic RLHF optimizer for aligning LLMs; its critic-free descendant GRPO now powers most reasoning-model training.

What is PPO?

Proximal Policy Optimization (PPO) is a reinforcement-learning algorithm introduced by OpenAI in 2017 that improves an agent’s policy a little at a time while preventing each update from changing the policy too drastically. It does this with a clipped objective that caps how far the new policy can move from the old one — giving it the stability of more complex methods like TRPO but with far simpler code. PPO became the default policy-gradient algorithm in deep RL, the workhorse of RLHF for aligning large language models, and the conceptual parent of newer variants like GRPO.

If you are new to the field, start with what is reinforcement learning? PPO sits inside the policy-gradient family: instead of learning the value of states and acting greedily, it directly adjusts the parameters $\theta$ of a policy $\pi_\theta(a\mid s)$ to make good actions more likely.

▶ An introduction to Policy Gradient methods & PPO — Arxiv Insights (the plain-English intuition)

The intuition: why limit how much the policy changes

The problem with vanilla policy gradients

The simplest policy-gradient method (REINFORCE / “vanilla” PG) nudges the policy in the direction that increases expected reward:

\nabla_\theta J(\theta) = \mathbb{E}_t\big[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\hat{A}_t\big]

where $\hat{A}_t$ is the advantage — how much better action $a_t$ was than the policy’s average behaviour in that state. This works in theory but is brutal in practice for two reasons:

High variance. Reward signals are noisy, so the gradient estimate jumps around. One unlucky batch can point you in a bad direction.
Destructive updates. The data was collected by the old policy. Take one large gradient step and the new policy can be so different that the old data no longer describes it — the policy collapses, and because RL is on-policy, there is no clean way to recover. You cannot simply lower the learning rate either: too small and learning crawls; too large and it explodes.

The fix everyone wants: reuse each batch of experience for several gradient steps, but stop before the policy drifts too far from the one that collected the data.

From TRPO to PPO: keeping the trust region, dropping the hard math

TRPO (Trust Region Policy Optimization, 2015) solved this by maximizing improvement subject to a hard constraint: the new policy must stay within a KL-divergence trust region of the old one. It comes with a monotonic-improvement guarantee — but enforcing the constraint needs second-order optimization (conjugate gradient, Fisher-vector products) that is fiddly to implement and hard to combine with architectures like dropout or parameter sharing.

PPO’s insight: you can get most of TRPO’s stability with a first-order method by baking the “stay close” idea directly into the loss as a simple clip. No constrained optimization, no Hessians — just a modified objective you can drop into any SGD trainer. That simplicity is why PPO, not TRPO, became the default.

TRPO enforces a hard KL trust region around the old policy with second-order optimization. PPO approximates the same 'stay close' effect with a cheap first-order clip on the probability ratio.

How PPO works, step by step

PPO runs as a loop: gather a batch of experience with the current policy, compute advantages, then squeeze several optimization steps out of that batch before throwing it away and gathering more.

Collect rollouts with the current policy

Run the policy $\pi_{\theta_{\text{old}}}$ in the environment (or, for LLMs, generate completions) for a fixed number of steps across parallel actors. Store each transition: state, action, reward, and the log-probability the old policy assigned to that action. PPO is on-policy — this fresh data is what every update in this iteration is measured against.

Estimate advantages with GAE

For each timestep, compute how much better the taken action was than expected, using Generalized Advantage Estimation (GAE). GAE blends short- and long-horizon estimates with a parameter $\lambda$ to trade off bias and variance (formula below). The critic’s value estimates $V(s_t)$ act as the baseline that makes this low-variance.

Form the probability ratio and clipped objective

For each sample, compute the ratio between how likely the new policy is to take that action versus the old one:

r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}

Then maximize the clipped surrogate objective — the part of PPO that does the real work:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\,\hat{A}_t,\; \text{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\big)\Big]

Add the value loss and entropy bonus

Combine three terms into one loss: the clipped policy objective, a value-function loss that trains the critic to predict returns, and an entropy bonus that keeps the policy exploring rather than collapsing to one action too soon.

Run several epochs of minibatch SGD

This is PPO’s efficiency win: shuffle the batch into minibatches and take multiple epochs (typically 3–10) of gradient steps on it. The clip is what makes this safe — even after several updates, the policy cannot stray far from the one that collected the data. Then discard the batch and return to step 1.

Why the min() makes it a pessimistic bound

The clipped objective is the single most-misunderstood line in PPO, so it is worth slowing down. There are two cases, depending on whether the action was good or bad.

Good action (Â > 0)

We want to increase its probability, so $r_t$ rises above 1. The clip caps the reward of going past $1+\epsilon$ : once the new policy is already much more likely to take this action, further increases earn no extra objective. The incentive to push harder vanishes.

Bad action (Â < 0)

We want to decrease its probability, so $r_t$ falls below 1. The clip caps the benefit at $1-\epsilon$ : the objective stops rewarding you for shrinking the action’s probability beyond that point. You cannot annihilate an action in one update.

The crucial detail is the min between the clipped and unclipped terms. It makes the objective a pessimistic (lower) bound on the unclipped one: PPO only “lets go” of the clip when doing so makes the objective worse, never better. Concretely, if a bad action’s ratio jumps above 1 (the update is making a mistake), the unclipped term is smaller and the min selects it — so the gradient still pulls the policy back. The clip removes the incentive to over-step; the min ensures the clip can never be exploited to avoid a needed correction.

The clipped objective for a single sample, as a function of the ratio rₜ. For a good action (left) the objective flattens above 1+ε; for a bad action (right) it flattens below 1−ε. Beyond those points the gradient is zero, so the update stops pushing.

The math (practitioner depth)

The clipped surrogate and the role of ε

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\,\hat{A}_t,\; \text{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\big)\Big]

The clip range $\epsilon$ is PPO’s most important knob — the radius of the trust region. The paper’s default is $\epsilon = 0.2$ : a single update may not change any action’s probability by more than roughly $\pm 20\%$ relative to the old policy. Smaller $\epsilon$ means more conservative, more stable, slower learning; larger $\epsilon$ means faster but riskier. In LLM RLHF, much tighter values (e.g. 0.1–0.2 with extra KL control) are common because a destabilized chat model is expensive to recover.

The full PPO objective

The complete per-iteration objective adds a value loss and entropy term:

L_t(\theta) = \mathbb{E}_t\Big[\, L^{\text{CLIP}}_t(\theta) \;-\; c_1\,L^{\text{VF}}_t(\theta) \;+\; c_2\,S\big[\pi_\theta\big](s_t) \,\Big]

$L^{\text{VF}}_t = \big(V_\theta(s_t) - V_t^{\text{target}}\big)^2$ — the critic is trained to predict returns; accurate values give low-variance advantages.
$S[\pi_\theta]$ — the entropy bonus rewards a less-peaked action distribution, slowing premature convergence to a deterministic policy.
$c_1$ (value coefficient, often ~0.5) and $c_2$ (entropy coefficient, often ~0.0–0.01) weight the three terms. When the actor and critic share a network, all three flow through the same backbone.

GAE and the λ–γ tradeoff

Generalized Advantage Estimation computes the advantage as an exponentially-weighted sum of temporal-difference (TD) residuals $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ :

\hat{A}^{\text{GAE}}_t = \sum_{l=0}^{\infty}(\gamma\lambda)^l\,\delta_{t+l}

Two knobs control the bias–variance tradeoff:

$\gamma$ (discount) sets how far into the future rewards matter. Lower $\gamma$ is myopic and low-variance; $\gamma \approx 0.99$ is standard.
$\lambda$ (GAE) interpolates between two extremes. At $\lambda = 0$ , GAE is the one-step TD advantage $\delta_t$ — low variance, high bias (relies heavily on the critic). At $\lambda = 1$ , it becomes the full Monte-Carlo return minus the baseline — unbiased but high variance. The usual sweet spot is $\lambda \approx 0.95$ .

PPO-Penalty vs PPO-Clip

The 2017 paper proposed two variants. PPO-Clip (above) is the one everyone uses. PPO-Penalty instead keeps the unclipped objective but subtracts an explicit, adaptively-tuned KL penalty:

L^{\text{KLPEN}}(\theta) = \mathbb{E}_t\Big[ r_t(\theta)\hat{A}_t - \beta\,\mathrm{KL}\big[\pi_{\theta_{\text{old}}}\,\|\,\pi_\theta\big]\Big]

After each update, $\beta$ is raised if the measured KL overshot a target and lowered if it undershot. The paper found PPO-Clip generally performs better and is simpler, so it became the default — but the KL-penalty idea reappears prominently in RLHF, where a KL term to a frozen reference model (not the previous step) keeps the LLM from drifting away from its supervised-fine-tuned self.

Go deeper: PPO is on-policy, but it reuses data — how?

Pure on-policy methods (like A2C) use each sample exactly once. PPO bends this: it takes multiple epochs over the same batch, which is technically off-policy after the first epoch because the policy being optimized is no longer the one that generated the data. The clip is precisely what licenses this — by refusing to credit updates that move $r_t$ far from 1, it keeps the optimization inside the region where the old samples remain a reasonable approximation. This is why PPO is far more sample-efficient than vanilla PG while staying stable. Push the epoch count too high, though, and the policy outgrows its data — a common cause of instability.

Hyperparameters and how to tune them

PPO is famously sensitive to hyperparameters and implementation choices. Typical starting points (continuous control / games):

Hyperparameter	Typical value	What it controls
Clip $\epsilon$	0.1 – 0.3 (def. 0.2)	Trust-region radius — the single biggest lever
Learning rate	1e-4 – 3e-4 (often annealed)	Step size; anneal to 0 over training for stability
Discount $\gamma$	0.99	How far future rewards count
GAE $\lambda$	0.95	Bias–variance of advantage estimates
Epochs per batch	3 – 10	Sample reuse; too high destabilizes
Minibatch size	32 – 4096	SGD granularity
Rollout / batch length	2048 × N actors	Data per update
Entropy coeff $c_2$	0.0 – 0.01	Exploration; raise if policy collapses early
Value coeff $c_1$	0.5	Weight of the critic loss
Max grad norm	0.5	Gradient clipping for stability

Implementation details that actually matter

PPO’s reputation for being “simple” hides a dirty secret: a faithful reading of the paper often fails to reproduce its results. The gap is closed by a pile of unglamorous engineering tricks. Huang et al.’s “37 Implementation Details of PPO” is the definitive catalogue. The ones that matter most:

Advantage normalization — normalize advantages to zero mean and unit variance per minibatch. PPO is very sensitive to advantage scale; skipping this is a frequent cause of failure.
Value-function clipping — clip the value update with the same $\epsilon$ idea, mirroring the policy clip.
Orthogonal initialization — initialize hidden layers orthogonally with gain $\sqrt 2$ ; the policy output head with a tiny scale (~0.01) so early actions are near-uniform, the value head with scale 1.
Learning-rate annealing — decay the LR linearly to zero over training.
Gradient clipping — clip global grad norm (commonly 0.5).
Reward / observation scaling — normalize observations and scale rewards; raw magnitudes wreck the value loss.
A separate, frozen log-prob from rollout time — store $\log\pi_{\theta_{\text{old}}}$ at collection, not recompute it.

PPO in RLHF and LLM training

PPO is the algorithm that turned base language models into assistants. In the classic RLHF pipeline (InstructGPT, ChatGPT), the language model is the policy, and PPO optimizes it against a learned reward model — with a KL leash to keep it sane.

The KL penalty and the four-model setup

The objective PPO maximizes for an LLM is the reward-model score minus a per-token KL penalty to a frozen reference model (the SFT checkpoint):

\max_{\pi_\theta}\; \mathbb{E}_{x,\,y\sim\pi_\theta}\big[\,r_\phi(x,y)\,\big]\;-\;\beta\,\mathrm{KL}\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big)

This is why production PPO-RLHF juggles four models at once — a real systems burden at LLM scale:

PPO-RLHF holds four models in memory: the trainable policy, a frozen reference for the KL penalty, a frozen reward model that scores completions, and the value/critic network that supplies the baseline for advantages.

This memory and complexity cost — four large models, plus PPO’s tuning sensitivity — is exactly what motivated the simpler alternatives that followed. DPO skips the reward model and RL loop entirely; GRPO keeps the RL loop but throws away the critic.

PPO vs GRPO and the reasoning-era variants

GRPO (Group Relative Policy Optimization), introduced in DeepSeekMath (2024) and central to DeepSeek-R1, is “PPO without the critic.” Instead of a learned value network, it samples a group of completions per prompt and uses the group’s mean reward as the baseline — the advantage of each completion is just how it scored relative to its siblings. For long reasoning chains, where the reward arrives only at the very end and a per-token critic is nearly impossible to train, this is both cheaper (no value network, ~50% less memory) and often more effective. GRPO is now the most common optimizer for RLVR (verifiable-reward training of reasoning models), while PPO remains the more controllable default when you can afford the critic.

	TRPO	PPO	GRPO
Stay-close mechanism	Hard KL trust region	Clipped ratio (+ optional KL)	Clipped ratio + KL to reference
Optimization order	Second-order	First-order	First-order
Critic / value network	Yes	Yes	No (group baseline)
Baseline for advantage	Learned value	Learned value (GAE)	Group-relative mean reward
Main use today	Largely superseded	Robotics, games, classic RLHF	LLM reasoning / RLVR
Drops vs PPO	—	(the parent)	The value network

2025 saw continued PPO evolution for LLMs, e.g. Truncated PPO for more efficient long-generation training, and sequence-level variants — but the clipped-ratio core remains intact across all of them.

▶ RLHF + PPO with full math derivations and PyTorch code — Umar Jamil (the deep LLM version)

Strengths, limitations, and common failure modes

Why people reach for PPO

Stable and forgiving relative to vanilla PG; sample-efficient via batch reuse; first-order and easy to parallelize; works across discrete and continuous actions; one well-understood recipe spans games, robotics, and LLMs.

Where it bites

Many interacting hyperparameters; on-policy, so it discards old data and is less sample-efficient than off-policy methods like SAC; results hinge on implementation details; at LLM scale the four-model setup is heavy.

Common failure modes and how to debug them:

KL collapse / policy drift. The policy moves too far and quality craters. Symptoms: KL to the old (or reference) policy spikes. Fix: tighten $\epsilon$ , lower the LR, reduce epochs; in RLHF, raise the KL coefficient $\beta$ .
Value-loss blow-ups. The critic loss explodes and corrupts advantages. Fix: value clipping, reward/return normalization, separate or lower value LR, clip the value coefficient’s influence.
Advantage-scale sensitivity. Forgetting to normalize advantages, or normalizing inconsistently, is a top cause of silent underperformance.
Entropy collapse. The policy becomes deterministic too early and stops exploring. Fix: increase the entropy coefficient.
Reward hacking. In RLHF especially, the policy games the proxy reward model. The KL penalty mitigates but does not eliminate it — watch the gold reward, not just the proxy.

Code and libraries

You should almost never implement PPO from scratch for production — use a vetted library:

Library	Best for	Notes
Stable-Baselines3	Classic RL (Gym/Gymnasium)	Batteries-included, well-tested PPO
CleanRL	Learning & research	Single-file, readable; mirrors the “37 details”
TRL	LLM RLHF	PPO, DPO, GRPO for transformers
TorchRL	Custom RL pipelines	Modular PyTorch components
OpenRLHF / veRL	Large-scale LLM RL	Distributed PPO/GRPO at scale

For the canonical math-plus-pseudocode reference, OpenAI Spinning Up’s PPO page is still the best single document. Building the environments, reward pipelines, and infrastructure that PPO trains against is its own industry — see the RL environment and RLHF tooling companies.

A short history

2015

TRPO + GAE

Schulman et al. introduce Trust Region Policy Optimization and Generalized Advantage Estimation — the trust-region idea and the advantage estimator PPO inherits.

2017

PPO

Schulman et al. publish Proximal Policy Optimization, replacing TRPO’s hard constraint with a simple clip. It quickly becomes deep RL’s default.

2019

OpenAI Five & robotics

PPO scales to Dota 2 and to dexterous robot-hand manipulation, proving it works far beyond toy benchmarks.

2022

InstructGPT → ChatGPT

PPO becomes the RL engine of RLHF, aligning GPT-3 into an instruction follower. The “37 details” blog post documents what makes PPO actually reproduce.

2024

GRPO

DeepSeekMath introduces GRPO — PPO without the critic — using group-relative rewards; it underpins the reasoning-model wave.

2025–26

Reasoning-era variants

Truncated PPO, sequence-level and difficulty-aware variants refine PPO-derived RL for efficient large-scale LLM reasoning.

Researcher takes

ML researcher Cameron Wolfe breaks down why PPO became the workhorse RL algorithm behind RLHF and LLM alignment.

View Cameron R. Wolfe's post on X →

Nathan Lambert points to a structural failure mode in PPO’s core clipping mechanism when applied to reasoning models.

View Nathan Lambert's post on X →

Frequently asked questions

What does “proximal” mean in PPO?

It refers to keeping the updated policy proximal — close — to the previous one. The clip on the probability ratio is the mechanism that enforces this proximity, so each update is a small, safe step rather than a destabilizing leap.

What is a good value for the clip epsilon?

The paper’s default of $\epsilon = 0.2$ is a solid starting point for games and continuous control. Smaller values (0.1) are more conservative and stable; larger (0.3) learn faster but risk instability. LLM RLHF often uses tighter clips plus an explicit KL penalty to a reference model.

Is PPO on-policy or off-policy?

PPO is on-policy: each iteration optimizes against data freshly collected by the current policy and then discards it. It bends this slightly by taking several epochs over each batch — the clip is what keeps those reused updates valid. This makes it less sample-efficient than off-policy methods like SAC, but more stable and simpler.

PPO vs GRPO — which should I use for an LLM?

For classic preference-based RLHF with a learned reward model, PPO is the controllable, well-understood default. For training reasoning models against verifiable rewards (RLVR) — math, code, long chains-of-thought — GRPO is usually preferred: it drops the hard-to-train critic, halves memory, and uses a group-relative baseline that suits sparse end-of-sequence rewards.

Key papers

Proximal Policy Optimization Algorithms — Schulman et al., 2017 — the original PPO paper (clipped objective, PPO-Penalty/PPO-Clip).
Trust Region Policy Optimization — Schulman et al., 2015 — PPO’s predecessor with the hard KL trust region.
High-Dimensional Continuous Control Using GAE — Schulman et al., 2015/16 — the advantage estimator PPO relies on.
The 37 Implementation Details of PPO — Huang et al., 2022 — the tricks that make PPO reproduce.
DeepSeekMath / GRPO — Shao et al., 2024 — the critic-free PPO variant powering reasoning models.
Truncated Proximal Policy Optimization — 2025 — an efficient PPO variant for large-scale LLM RL.

What is reinforcement learning? · RLHF · GRPO · DPO & preference optimization · Reward models · RLVR · RL for reasoning