- Policy gradient methods optimize a parameterized policy directly by gradient ascent on expected return — no value function or argmax required.
- The policy gradient theorem turns 'gradient of an expectation' into 'expectation of a gradient' via the log-derivative trick, so you can estimate it from sampled trajectories.
- REINFORCE (Williams, 1992) is the simplest instance: scale each action's log-probability gradient by the return that followed it.
- Its weakness is high variance — fixed with baselines, reward-to-go and advantages, which lead straight to actor-critic, TRPO, PPO and GRPO.
What are policy gradient methods?
Policy gradient methods learn the policy directly. Instead of estimating how good each state or action is and then acting greedily — the value-based recipe behind Q-learning and DQN — they parameterize the policy itself as (typically a neural network) and adjust with gradient ascent to make high-return behavior more probable.
The core trick is disarmingly simple: run the policy, see what happened, then increase the probability of the actions that led to good outcomes and decrease the probability of the ones that led to bad outcomes — weighted by how good or bad. REINFORCE is the canonical algorithm that implements exactly this idea.
Why optimize the policy directly?
A policy parameterization gives you three things value-based methods struggle with:
Output the parameters of a distribution (e.g. a Gaussian’s mean and variance) instead of one Q-value per discrete action. No argmax over an infinite set. Essential for robotics and control.
Some problems have no optimal deterministic policy (partially observed states, games like rock-paper-scissors). A policy network outputs a probability distribution, so randomness is first-class — and it doubles as built-in exploration.
Small parameter changes produce small policy changes. Value-based methods can flip the greedy action discontinuously, causing oscillation; policy gradients move gently along a gradient.
The trade-off: policy gradients are on-policy (data must come from the current policy, so it’s discarded after each update) and high-variance Monte-Carlo estimators. Most of the field’s progress since REINFORCE is about taming that variance.
The objective and the policy gradient theorem
We want parameters that maximize expected return. Let a trajectory be with return . The objective is
The problem: appears inside the distribution we’re averaging over, so we can’t just differentiate a sum of fixed terms. The log-derivative trick rescues us. Because ,
This is the conceptual heart of the whole field: it converts the gradient of an expectation (intractable) into an expectation of a gradient (estimate it by sampling). And the environment dynamics drop out — the trajectory probability factorizes as , but only the terms depend on . Taking the log turns the product into a sum and the transition terms vanish:
This is the policy gradient theorem (Sutton et al., 2000), and you never need a model of the environment to use it.
REINFORCE, step by step
REINFORCE (Williams, 1992 — the name is a backronym for “REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility”) is the most direct Monte-Carlo implementation of the policy gradient theorem.
Sample a full trajectory by acting with the current policy until the episode ends.
For every timestep , compute the reward-to-go — the discounted return from that step onward. (Using rather than the whole-episode return is the standard, lower-variance form: an action can only influence rewards that come after it.)
For each step, form — the score function scaled by how good the rest of the episode turned out.
Ascend the gradient:
Then throw the data away and repeat — REINFORCE is strictly on-policy.
Go deeper: the score function and why it’s unbiased
The term is called the score function. A key identity makes everything work: the expected score under the policy is zero, . That is precisely why you can subtract a state-dependent baseline (next section) without biasing the estimator — and why REINFORCE, despite using a single sampled return, is an unbiased estimate of the true gradient. The price of unbiasedness is variance: one noisy episode stands in for an expectation over all possible trajectories.
The variance problem — and baselines
REINFORCE works, but it is notoriously high-variance. Returns swing wildly from episode to episode, so the gradient estimate is noisy and learning is slow and unstable. The single most important fix is the baseline.
Subtract a function that depends only on the state (not the action) from the return:
Because the expected score is zero, subtracting any state-dependent leaves the gradient unbiased while shrinking its variance. The natural choice is the state-value function — the average return you’d expect from that state. Then the weight becomes : an estimate of the advantage , i.e. “was this action better or worse than typical for this state?”
When you learn with its own network and use it as the baseline, you’ve built an actor-critic method: the actor is the policy, the critic is the value estimate. That step — replacing the raw Monte-Carlo return with a learned, bootstrapped advantage — is the bridge from REINFORCE to every modern policy-gradient algorithm.
Go deeper: the optimal baseline and other variance fighters
is convenient but not variance-optimal. The minimum-variance baseline is a gradient-magnitude-weighted average of returns, — derived by Greensmith, Bartlett & Baxter (2004). In practice captures most of the benefit and is what people use. Other variance-reduction tools stack on top: reward-to-go (causality), generalized advantage estimation (GAE) which trades bias for variance via a knob, discounting as an implicit variance control, and large batch sizes to average the noise. PPO and TRPO add trust regions on top to bound how far each noisy step can move the policy.
From REINFORCE to the modern family
REINFORCE is the root of a large, still-growing tree. Each descendant attacks one of its weaknesses — variance, sample inefficiency, or instability from too-large updates.
| Method | What it adds over REINFORCE | Why it matters |
|---|---|---|
| REINFORCE + baseline | Subtract | First big variance cut, still unbiased |
| Actor-critic | Learned, bootstrapped critic for the advantage | Lower variance, can update per-step (online) |
| A2C / A3C | Synchronous/parallel actors + entropy bonus | Stable, scalable deep RL |
| TRPO | Trust region constraining KL between updates | Prevents destructive policy jumps |
| PPO | Clipped surrogate objective (cheap trust region) | The workhorse — robotics, games, and RLHF |
| GRPO | Group-relative advantage; drops the value network | Cheap RL for LLM reasoning (DeepSeek) |
Worked intuition: REINFORCE on CartPole
Concretely, on the classic CartPole task the policy network maps the 4-D state (cart position, velocity, pole angle, angular velocity) to a probability over two actions (push left / push right).
Each episode: act stochastically until the pole falls, record , compute reward-to-go , subtract a baseline, and ascend . Episodes that balanced longer produce larger , so their actions get reinforced harder — and because the policy is stochastic, exploration is automatic. Plain REINFORCE solves CartPole but with visibly noisy learning curves; adding a value baseline smooths them dramatically. To try variants yourself, RL environments like Gymnasium ship CartPole as the standard first benchmark.
A short history
When to use policy gradients
Actions are discrete and few, sample efficiency matters, and off-policy replay helps. DQN and its descendants reuse old data; pure policy gradients can’t. Many real systems blend both via actor-critic.
Building the environments, reward pipelines and infrastructure that make large-scale policy-gradient training practical is its own industry — see the companies building RL environments.
Researcher takes
Lambert makes the historical-clarification argument that the algorithm everyone calls REINFORCE is nothing more than the vanilla policy gradient, tracing the name back to Williams 1992 and situating it within the lineage that RLHF practitioners rediscovered.
Frequently asked questions
What’s the difference between REINFORCE and policy gradient methods?
“Policy gradient methods” is the whole family of algorithms that optimize a parameterized policy by gradient ascent on expected return. REINFORCE is the simplest concrete member — a Monte-Carlo estimator that scales each action’s log-probability gradient by the realized return. Actor-critic, TRPO, PPO and GRPO are all policy gradient methods that improve on REINFORCE.
Why is REINFORCE so high-variance?
It estimates an expectation over all possible trajectories using a single sampled episode, and the return that weights each action sums up many random rewards and transitions. Two runs of the same policy can produce very different returns, so the gradient estimate jumps around. Baselines, reward-to-go, learned critics (actor-critic) and large batches all reduce this variance.
Does subtracting a baseline bias the gradient?
No — as long as the baseline depends only on the state, not the action. The expected score function is zero, so and the baseline cancels in expectation while still reducing variance. An action-dependent baseline would introduce bias.
Is PPO just REINFORCE with extra steps?
Essentially yes, with two crucial additions: a learned critic to compute low-variance advantages, and a clipped surrogate objective that stops any single update from moving the policy too far. The core “log-prob gradient times advantage” update is pure policy gradient. See PPO.
Key papers
- Simple Statistical Gradient-Following Algorithms for Connectionist RL — Williams, 1992 — the original REINFORCE paper.
- Policy Gradient Methods for RL with Function Approximation — Sutton, McAllester, Singh & Mansour, 2000 — the policy gradient theorem.
- Variance Reduction Techniques for Gradient Estimates in RL — Greensmith, Bartlett & Baxter, 2004 — baselines and variance bounds.
- Trust Region Policy Optimization — Schulman et al., 2015 — trust regions.
- High-Dimensional Continuous Control Using GAE — Schulman et al., 2015 — generalized advantage estimation.
- Proximal Policy Optimization — Schulman et al., 2017 — the modern default.
Related
Actor-critic · PPO · GRPO · Value functions · Q-learning · Exploration vs exploitation · What is reinforcement learning?