- PPO is the default deep-RL policy-gradient algorithm: it improves a policy in small, safe steps instead of risky big jumps.
- Its core trick is a clipped surrogate objective that caps how far the new policy can move from the old one — TRPO's trust region without the second-order math.
- The full recipe: collect rollouts, estimate advantages with GAE, then run several epochs of minibatch SGD on the clipped policy loss, a value loss, and an entropy bonus.
- PPO is the classic RLHF optimizer for aligning LLMs; its critic-free descendant GRPO now powers most reasoning-model training.
What is PPO?
Proximal Policy Optimization (PPO) is a reinforcement-learning algorithm introduced by OpenAI in 2017 that improves an agent’s policy a little at a time while preventing each update from changing the policy too drastically. It does this with a clipped objective that caps how far the new policy can move from the old one — giving it the stability of more complex methods like TRPO but with far simpler code. PPO became the default policy-gradient algorithm in deep RL, the workhorse of RLHF for aligning large language models, and the conceptual parent of newer variants like GRPO.
If you are new to the field, start with what is reinforcement learning? PPO sits inside the policy-gradient family: instead of learning the value of states and acting greedily, it directly adjusts the parameters of a policy to make good actions more likely.
The intuition: why limit how much the policy changes
The problem with vanilla policy gradients
The simplest policy-gradient method (REINFORCE / “vanilla” PG) nudges the policy in the direction that increases expected reward:
where is the advantage — how much better action was than the policy’s average behaviour in that state. This works in theory but is brutal in practice for two reasons:
- High variance. Reward signals are noisy, so the gradient estimate jumps around. One unlucky batch can point you in a bad direction.
- Destructive updates. The data was collected by the old policy. Take one large gradient step and the new policy can be so different that the old data no longer describes it — the policy collapses, and because RL is on-policy, there is no clean way to recover. You cannot simply lower the learning rate either: too small and learning crawls; too large and it explodes.
The fix everyone wants: reuse each batch of experience for several gradient steps, but stop before the policy drifts too far from the one that collected the data.
From TRPO to PPO: keeping the trust region, dropping the hard math
TRPO (Trust Region Policy Optimization, 2015) solved this by maximizing improvement subject to a hard constraint: the new policy must stay within a KL-divergence trust region of the old one. It comes with a monotonic-improvement guarantee — but enforcing the constraint needs second-order optimization (conjugate gradient, Fisher-vector products) that is fiddly to implement and hard to combine with architectures like dropout or parameter sharing.
PPO’s insight: you can get most of TRPO’s stability with a first-order method by baking the “stay close” idea directly into the loss as a simple clip. No constrained optimization, no Hessians — just a modified objective you can drop into any SGD trainer. That simplicity is why PPO, not TRPO, became the default.
How PPO works, step by step
PPO runs as a loop: gather a batch of experience with the current policy, compute advantages, then squeeze several optimization steps out of that batch before throwing it away and gathering more.
Run the policy in the environment (or, for LLMs, generate completions) for a fixed number of steps across parallel actors. Store each transition: state, action, reward, and the log-probability the old policy assigned to that action. PPO is on-policy — this fresh data is what every update in this iteration is measured against.
For each timestep, compute how much better the taken action was than expected, using Generalized Advantage Estimation (GAE). GAE blends short- and long-horizon estimates with a parameter to trade off bias and variance (formula below). The critic’s value estimates act as the baseline that makes this low-variance.
For each sample, compute the ratio between how likely the new policy is to take that action versus the old one:
Then maximize the clipped surrogate objective — the part of PPO that does the real work:
Combine three terms into one loss: the clipped policy objective, a value-function loss that trains the critic to predict returns, and an entropy bonus that keeps the policy exploring rather than collapsing to one action too soon.
This is PPO’s efficiency win: shuffle the batch into minibatches and take multiple epochs (typically 3–10) of gradient steps on it. The clip is what makes this safe — even after several updates, the policy cannot stray far from the one that collected the data. Then discard the batch and return to step 1.
Why the min() makes it a pessimistic bound
The clipped objective is the single most-misunderstood line in PPO, so it is worth slowing down. There are two cases, depending on whether the action was good or bad.
We want to increase its probability, so rises above 1. The clip caps the reward of going past : once the new policy is already much more likely to take this action, further increases earn no extra objective. The incentive to push harder vanishes.
We want to decrease its probability, so falls below 1. The clip caps the benefit at : the objective stops rewarding you for shrinking the action’s probability beyond that point. You cannot annihilate an action in one update.
The crucial detail is the min between the clipped and unclipped terms. It makes the objective a pessimistic (lower) bound on the unclipped one: PPO only “lets go” of the clip when doing so makes the objective worse, never better. Concretely, if a bad action’s ratio jumps above 1 (the update is making a mistake), the unclipped term is smaller and the min selects it — so the gradient still pulls the policy back. The clip removes the incentive to over-step; the min ensures the clip can never be exploited to avoid a needed correction.
The math (practitioner depth)
The clipped surrogate and the role of ε
The clip range is PPO’s most important knob — the radius of the trust region. The paper’s default is : a single update may not change any action’s probability by more than roughly relative to the old policy. Smaller means more conservative, more stable, slower learning; larger means faster but riskier. In LLM RLHF, much tighter values (e.g. 0.1–0.2 with extra KL control) are common because a destabilized chat model is expensive to recover.
The full PPO objective
The complete per-iteration objective adds a value loss and entropy term:
- — the critic is trained to predict returns; accurate values give low-variance advantages.
- — the entropy bonus rewards a less-peaked action distribution, slowing premature convergence to a deterministic policy.
- (value coefficient, often ~0.5) and (entropy coefficient, often ~0.0–0.01) weight the three terms. When the actor and critic share a network, all three flow through the same backbone.
GAE and the λ–γ tradeoff
Generalized Advantage Estimation computes the advantage as an exponentially-weighted sum of temporal-difference (TD) residuals :
Two knobs control the bias–variance tradeoff:
- (discount) sets how far into the future rewards matter. Lower is myopic and low-variance; is standard.
- (GAE) interpolates between two extremes. At , GAE is the one-step TD advantage — low variance, high bias (relies heavily on the critic). At , it becomes the full Monte-Carlo return minus the baseline — unbiased but high variance. The usual sweet spot is .
PPO-Penalty vs PPO-Clip
The 2017 paper proposed two variants. PPO-Clip (above) is the one everyone uses. PPO-Penalty instead keeps the unclipped objective but subtracts an explicit, adaptively-tuned KL penalty:
After each update, is raised if the measured KL overshot a target and lowered if it undershot. The paper found PPO-Clip generally performs better and is simpler, so it became the default — but the KL-penalty idea reappears prominently in RLHF, where a KL term to a frozen reference model (not the previous step) keeps the LLM from drifting away from its supervised-fine-tuned self.
Go deeper: PPO is on-policy, but it reuses data — how?
Pure on-policy methods (like A2C) use each sample exactly once. PPO bends this: it takes multiple epochs over the same batch, which is technically off-policy after the first epoch because the policy being optimized is no longer the one that generated the data. The clip is precisely what licenses this — by refusing to credit updates that move far from 1, it keeps the optimization inside the region where the old samples remain a reasonable approximation. This is why PPO is far more sample-efficient than vanilla PG while staying stable. Push the epoch count too high, though, and the policy outgrows its data — a common cause of instability.
Hyperparameters and how to tune them
PPO is famously sensitive to hyperparameters and implementation choices. Typical starting points (continuous control / games):
| Hyperparameter | Typical value | What it controls |
|---|---|---|
| Clip | 0.1 – 0.3 (def. 0.2) | Trust-region radius — the single biggest lever |
| Learning rate | 1e-4 – 3e-4 (often annealed) | Step size; anneal to 0 over training for stability |
| Discount | 0.99 | How far future rewards count |
| GAE | 0.95 | Bias–variance of advantage estimates |
| Epochs per batch | 3 – 10 | Sample reuse; too high destabilizes |
| Minibatch size | 32 – 4096 | SGD granularity |
| Rollout / batch length | 2048 × N actors | Data per update |
| Entropy coeff | 0.0 – 0.01 | Exploration; raise if policy collapses early |
| Value coeff | 0.5 | Weight of the critic loss |
| Max grad norm | 0.5 | Gradient clipping for stability |
Implementation details that actually matter
PPO’s reputation for being “simple” hides a dirty secret: a faithful reading of the paper often fails to reproduce its results. The gap is closed by a pile of unglamorous engineering tricks. Huang et al.’s “37 Implementation Details of PPO” is the definitive catalogue. The ones that matter most:
- Advantage normalization — normalize advantages to zero mean and unit variance per minibatch. PPO is very sensitive to advantage scale; skipping this is a frequent cause of failure.
- Value-function clipping — clip the value update with the same idea, mirroring the policy clip.
- Orthogonal initialization — initialize hidden layers orthogonally with gain ; the policy output head with a tiny scale (~0.01) so early actions are near-uniform, the value head with scale 1.
- Learning-rate annealing — decay the LR linearly to zero over training.
- Gradient clipping — clip global grad norm (commonly 0.5).
- Reward / observation scaling — normalize observations and scale rewards; raw magnitudes wreck the value loss.
- A separate, frozen log-prob from rollout time — store at collection, not recompute it.
PPO in RLHF and LLM training
PPO is the algorithm that turned base language models into assistants. In the classic RLHF pipeline (InstructGPT, ChatGPT), the language model is the policy, and PPO optimizes it against a learned reward model — with a KL leash to keep it sane.
The KL penalty and the four-model setup
The objective PPO maximizes for an LLM is the reward-model score minus a per-token KL penalty to a frozen reference model (the SFT checkpoint):
This is why production PPO-RLHF juggles four models at once — a real systems burden at LLM scale:
This memory and complexity cost — four large models, plus PPO’s tuning sensitivity — is exactly what motivated the simpler alternatives that followed. DPO skips the reward model and RL loop entirely; GRPO keeps the RL loop but throws away the critic.
PPO vs GRPO and the reasoning-era variants
GRPO (Group Relative Policy Optimization), introduced in DeepSeekMath (2024) and central to DeepSeek-R1, is “PPO without the critic.” Instead of a learned value network, it samples a group of completions per prompt and uses the group’s mean reward as the baseline — the advantage of each completion is just how it scored relative to its siblings. For long reasoning chains, where the reward arrives only at the very end and a per-token critic is nearly impossible to train, this is both cheaper (no value network, ~50% less memory) and often more effective. GRPO is now the most common optimizer for RLVR (verifiable-reward training of reasoning models), while PPO remains the more controllable default when you can afford the critic.
| TRPO | PPO | GRPO | |
|---|---|---|---|
| Stay-close mechanism | Hard KL trust region | Clipped ratio (+ optional KL) | Clipped ratio + KL to reference |
| Optimization order | Second-order | First-order | First-order |
| Critic / value network | Yes | Yes | No (group baseline) |
| Baseline for advantage | Learned value | Learned value (GAE) | Group-relative mean reward |
| Main use today | Largely superseded | Robotics, games, classic RLHF | LLM reasoning / RLVR |
| Drops vs PPO | — | (the parent) | The value network |
2025 saw continued PPO evolution for LLMs, e.g. Truncated PPO for more efficient long-generation training, and sequence-level variants — but the clipped-ratio core remains intact across all of them.
Strengths, limitations, and common failure modes
Stable and forgiving relative to vanilla PG; sample-efficient via batch reuse; first-order and easy to parallelize; works across discrete and continuous actions; one well-understood recipe spans games, robotics, and LLMs.
Many interacting hyperparameters; on-policy, so it discards old data and is less sample-efficient than off-policy methods like SAC; results hinge on implementation details; at LLM scale the four-model setup is heavy.
Common failure modes and how to debug them:
- KL collapse / policy drift. The policy moves too far and quality craters. Symptoms: KL to the old (or reference) policy spikes. Fix: tighten , lower the LR, reduce epochs; in RLHF, raise the KL coefficient .
- Value-loss blow-ups. The critic loss explodes and corrupts advantages. Fix: value clipping, reward/return normalization, separate or lower value LR, clip the value coefficient’s influence.
- Advantage-scale sensitivity. Forgetting to normalize advantages, or normalizing inconsistently, is a top cause of silent underperformance.
- Entropy collapse. The policy becomes deterministic too early and stops exploring. Fix: increase the entropy coefficient.
- Reward hacking. In RLHF especially, the policy games the proxy reward model. The KL penalty mitigates but does not eliminate it — watch the gold reward, not just the proxy.
Code and libraries
You should almost never implement PPO from scratch for production — use a vetted library:
| Library | Best for | Notes |
|---|---|---|
| Stable-Baselines3 | Classic RL (Gym/Gymnasium) | Batteries-included, well-tested PPO |
| CleanRL | Learning & research | Single-file, readable; mirrors the “37 details” |
| TRL | LLM RLHF | PPO, DPO, GRPO for transformers |
| TorchRL | Custom RL pipelines | Modular PyTorch components |
| OpenRLHF / veRL | Large-scale LLM RL | Distributed PPO/GRPO at scale |
For the canonical math-plus-pseudocode reference, OpenAI Spinning Up’s PPO page is still the best single document. Building the environments, reward pipelines, and infrastructure that PPO trains against is its own industry — see the RL environment and RLHF tooling companies.
A short history
Researcher takes
ML researcher Cameron Wolfe breaks down why PPO became the workhorse RL algorithm behind RLHF and LLM alignment.
Nathan Lambert points to a structural failure mode in PPO’s core clipping mechanism when applied to reasoning models.
Frequently asked questions
What does “proximal” mean in PPO?
It refers to keeping the updated policy proximal — close — to the previous one. The clip on the probability ratio is the mechanism that enforces this proximity, so each update is a small, safe step rather than a destabilizing leap.
What is a good value for the clip epsilon?
The paper’s default of is a solid starting point for games and continuous control. Smaller values (0.1) are more conservative and stable; larger (0.3) learn faster but risk instability. LLM RLHF often uses tighter clips plus an explicit KL penalty to a reference model.
Is PPO on-policy or off-policy?
PPO is on-policy: each iteration optimizes against data freshly collected by the current policy and then discards it. It bends this slightly by taking several epochs over each batch — the clip is what keeps those reused updates valid. This makes it less sample-efficient than off-policy methods like SAC, but more stable and simpler.
PPO vs GRPO — which should I use for an LLM?
For classic preference-based RLHF with a learned reward model, PPO is the controllable, well-understood default. For training reasoning models against verifiable rewards (RLVR) — math, code, long chains-of-thought — GRPO is usually preferred: it drops the hard-to-train critic, halves memory, and uses a group-relative baseline that suits sparse end-of-sequence rewards.
Key papers
- Proximal Policy Optimization Algorithms — Schulman et al., 2017 — the original PPO paper (clipped objective, PPO-Penalty/PPO-Clip).
- Trust Region Policy Optimization — Schulman et al., 2015 — PPO’s predecessor with the hard KL trust region.
- High-Dimensional Continuous Control Using GAE — Schulman et al., 2015/16 — the advantage estimator PPO relies on.
- The 37 Implementation Details of PPO — Huang et al., 2022 — the tricks that make PPO reproduce.
- DeepSeekMath / GRPO — Shao et al., 2024 — the critic-free PPO variant powering reasoning models.
- Truncated Proximal Policy Optimization — 2025 — an efficient PPO variant for large-scale LLM RL.
Related
What is reinforcement learning? · RLHF · GRPO · DPO & preference optimization · Reward models · RLVR · RL for reasoning