reinforcement-learning.com
// CORE ALGORITHMS

Policy Gradient Methods (REINFORCE)

How policy gradient methods and REINFORCE optimize a policy directly via the policy gradient theorem, the log-derivative trick, baselines, and variance reduction.

Updated 2026-06-07 15 min read
Key takeaways
  • Policy gradient methods optimize a parameterized policy directly by gradient ascent on expected return — no value function or argmax required.
  • The policy gradient theorem turns 'gradient of an expectation' into 'expectation of a gradient' via the log-derivative trick, so you can estimate it from sampled trajectories.
  • REINFORCE (Williams, 1992) is the simplest instance: scale each action's log-probability gradient by the return that followed it.
  • Its weakness is high variance — fixed with baselines, reward-to-go and advantages, which lead straight to actor-critic, TRPO, PPO and GRPO.

What are policy gradient methods?

Policy gradient methods learn the policy directly. Instead of estimating how good each state or action is and then acting greedily — the value-based recipe behind Q-learning and DQN — they parameterize the policy itself as πθ(as)\pi_\theta(a \mid s) (typically a neural network) and adjust θ\theta with gradient ascent to make high-return behavior more probable.

The core trick is disarmingly simple: run the policy, see what happened, then increase the probability of the actions that led to good outcomes and decrease the probability of the ones that led to bad outcomes — weighted by how good or bad. REINFORCE is the canonical algorithm that implements exactly this idea.

Policy π_θ(neural net)Sample trajectoryτ = (s₀,a₀,…,s_T)Return R(τ)= Σ γᵗ r_tθ ← θ + α · R(τ) · ∇_θ log π_θ(a|s)
Policy gradient loop: sample whole trajectories from the current policy, compute the return of each, then nudge θ to raise the log-probability of actions weighted by the return that followed them.
▶ RL Course by David Silver — Lecture 7: Policy Gradient Methods (the canonical lecture)

Why optimize the policy directly?

A policy parameterization gives you three things value-based methods struggle with:

Continuous & high-dimensional actions

Output the parameters of a distribution (e.g. a Gaussian’s mean and variance) instead of one Q-value per discrete action. No argmax over an infinite set. Essential for robotics and control.

Stochastic policies by construction

Some problems have no optimal deterministic policy (partially observed states, games like rock-paper-scissors). A policy network outputs a probability distribution, so randomness is first-class — and it doubles as built-in exploration.

Smooth improvement

Small parameter changes produce small policy changes. Value-based methods can flip the greedy action discontinuously, causing oscillation; policy gradients move gently along a gradient.

The trade-off: policy gradients are on-policy (data must come from the current policy, so it’s discarded after each update) and high-variance Monte-Carlo estimators. Most of the field’s progress since REINFORCE is about taming that variance.

The objective and the policy gradient theorem

We want parameters θ\theta that maximize expected return. Let a trajectory be τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \dots) with return R(τ)=tγtrtR(\tau) = \sum_{t} \gamma^{t} r_t. The objective is

J(θ)=Eτπθ[R(τ)].J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\big[\,R(\tau)\,\big].

The problem: θ\theta appears inside the distribution we’re averaging over, so we can’t just differentiate a sum of fixed terms. The log-derivative trick rescues us. Because θpθ(τ)=pθ(τ)θlogpθ(τ)\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau),

θJ(θ)=Eτπθ[R(τ)θlogpθ(τ)].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\big[\,R(\tau)\,\nabla_\theta \log p_\theta(\tau)\,\big].

This is the conceptual heart of the whole field: it converts the gradient of an expectation (intractable) into an expectation of a gradient (estimate it by sampling). And the environment dynamics drop out — the trajectory probability factorizes as pθ(τ)=p(s0)tP(st+1st,at)πθ(atst)p_\theta(\tau) = p(s_0)\prod_t P(s_{t+1}\mid s_t,a_t)\,\pi_\theta(a_t\mid s_t), but only the πθ\pi_\theta terms depend on θ\theta. Taking the log turns the product into a sum and the transition terms vanish:

θJ(θ)=Eτπθ ⁣[(t=0Tθlogπθ(atst)) ⁣(t=0Tγtrt)].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\left(\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\right)\!\left(\sum_{t=0}^{T} \gamma^t r_t\right)\right].

This is the policy gradient theorem (Sutton et al., 2000), and you never need a model of the environment to use it.

REINFORCE, step by step

REINFORCE (Williams, 1992 — the name is a backronym for “REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility”) is the most direct Monte-Carlo implementation of the policy gradient theorem.

1
Run an episode

Sample a full trajectory τ=(s0,a0,r1,,sT)\tau = (s_0, a_0, r_1, \dots, s_T) by acting with the current policy πθ\pi_\theta until the episode ends.

2
Compute the return after each step

For every timestep tt, compute the reward-to-go Gt=k=tTγktrk+1G_t = \sum_{k=t}^{T}\gamma^{k-t} r_{k+1} — the discounted return from that step onward. (Using GtG_t rather than the whole-episode return is the standard, lower-variance form: an action can only influence rewards that come after it.)

3
Accumulate the gradient

For each step, form Gtθlogπθ(atst)G_t\,\nabla_\theta \log \pi_\theta(a_t\mid s_t) — the score function scaled by how good the rest of the episode turned out.

4
Update the parameters

Ascend the gradient:

θθ+αt=0TγtGtθlogπθ(atst).\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \gamma^t\, G_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t).

Then throw the data away and repeat — REINFORCE is strictly on-policy.

Go deeper: the score function and why it’s unbiased

The term θlogπθ(as)\nabla_\theta \log \pi_\theta(a\mid s) is called the score function. A key identity makes everything work: the expected score under the policy is zero, Eaπθ[θlogπθ(as)]=θaπθ(as)=θ1=0\mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a\mid s)] = \nabla_\theta \sum_a \pi_\theta(a\mid s) = \nabla_\theta 1 = 0. That is precisely why you can subtract a state-dependent baseline (next section) without biasing the estimator — and why REINFORCE, despite using a single sampled return, is an unbiased estimate of the true gradient. The price of unbiasedness is variance: one noisy episode stands in for an expectation over all possible trajectories.

The variance problem — and baselines

REINFORCE works, but it is notoriously high-variance. Returns swing wildly from episode to episode, so the gradient estimate is noisy and learning is slow and unstable. The single most important fix is the baseline.

Subtract a function b(s)b(s) that depends only on the state (not the action) from the return:

θJ(θ)=E ⁣[tθlogπθ(atst)(Gtb(st))].\nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\big(G_t - b(s_t)\big)\right].

Because the expected score is zero, subtracting any state-dependent b(s)b(s) leaves the gradient unbiased while shrinking its variance. The natural choice is the state-value function b(s)=V(s)b(s) = V(s) — the average return you’d expect from that state. Then the weight becomes GtV(st)G_t - V(s_t): an estimate of the advantage A(st,at)A(s_t,a_t), i.e. “was this action better or worse than typical for this state?”

Raw returns G_tAdvantage G_t − V(s)all push “up” — high variance0above-avg up, below-avg down
Without a baseline, all positive returns push action probabilities up (only the relative sizes differ). Subtracting V(s) recenters returns around zero, so genuinely above-average actions go up and below-average actions go down — far less noise.

When you learn V(s)V(s) with its own network and use it as the baseline, you’ve built an actor-critic method: the actor is the policy, the critic is the value estimate. That step — replacing the raw Monte-Carlo return with a learned, bootstrapped advantage — is the bridge from REINFORCE to every modern policy-gradient algorithm.

Go deeper: the optimal baseline and other variance fighters

V(s)V(s) is convenient but not variance-optimal. The minimum-variance baseline is a gradient-magnitude-weighted average of returns, b=E[(θlogπθ)2G]E[(θlogπθ)2]b^* = \frac{\mathbb{E}[(\nabla_\theta \log \pi_\theta)^2\, G]}{\mathbb{E}[(\nabla_\theta \log \pi_\theta)^2]} — derived by Greensmith, Bartlett & Baxter (2004). In practice V(s)V(s) captures most of the benefit and is what people use. Other variance-reduction tools stack on top: reward-to-go (causality), generalized advantage estimation (GAE) which trades bias for variance via a λ\lambda knob, discounting as an implicit variance control, and large batch sizes to average the noise. PPO and TRPO add trust regions on top to bound how far each noisy step can move the policy.

From REINFORCE to the modern family

REINFORCE is the root of a large, still-growing tree. Each descendant attacks one of its weaknesses — variance, sample inefficiency, or instability from too-large updates.

MethodWhat it adds over REINFORCEWhy it matters
REINFORCE + baselineSubtract V(s)V(s)First big variance cut, still unbiased
Actor-criticLearned, bootstrapped critic for the advantageLower variance, can update per-step (online)
A2C / A3CSynchronous/parallel actors + entropy bonusStable, scalable deep RL
TRPOTrust region constraining KL between updatesPrevents destructive policy jumps
PPOClipped surrogate objective (cheap trust region)The workhorse — robotics, games, and RLHF
GRPOGroup-relative advantage; drops the value networkCheap RL for LLM reasoning (DeepSeek)
▶ L3 Policy Gradients and Advantage Estimation — Pieter Abbeel (Foundations of Deep RL)

Worked intuition: REINFORCE on CartPole

Concretely, on the classic CartPole task the policy network maps the 4-D state (cart position, velocity, pole angle, angular velocity) to a probability over two actions (push left / push right).

2
discrete actions the policy outputs probabilities over
+1
reward per timestep the pole stays upright
500
max return — a long episode is a strong gradient signal

Each episode: act stochastically until the pole falls, record (st,at,rt)(s_t, a_t, r_t), compute reward-to-go GtG_t, subtract a baseline, and ascend t(Gtb)θlogπθ(atst)\sum_t (G_t - b)\nabla_\theta \log \pi_\theta(a_t\mid s_t). Episodes that balanced longer produce larger GtG_t, so their actions get reinforced harder — and because the policy is stochastic, exploration is automatic. Plain REINFORCE solves CartPole but with visibly noisy learning curves; adding a value baseline smooths them dramatically. To try variants yourself, RL environments like Gymnasium ship CartPole as the standard first benchmark.

A short history

1992
REINFORCE
Ronald Williams formalizes the class of “REINFORCE” algorithms and the log-derivative gradient estimator — the foundation of policy-based RL.
2000
Policy Gradient Theorem
Sutton, McAllester, Singh & Mansour prove the policy gradient theorem with function approximation and the compatible-features result that grounds actor-critic.
2004
Variance reduction theory
Greensmith, Bartlett & Baxter derive optimal baselines and rigorous variance bounds for gradient estimates.
2015–16
Deep policy gradients
TRPO (Schulman et al.) adds trust regions; A3C (Mnih et al.) scales actor-critic; GAE refines advantage estimation.
2017
PPO
Schulman et al. introduce the clipped objective — simple, stable, and now the default policy-gradient algorithm everywhere.
2024–25
GRPO & LLMs
DeepSeek’s GRPO strips the critic and uses group-relative advantages, making policy gradients the engine of LLM reasoning post-training.

When to use policy gradients

Reach for policy gradients when…

Actions are continuous or high-dimensional; the optimal policy is stochastic; you want stable, smooth improvement; or you’re doing LLM post-training (PPO, GRPO, RLHF).

Prefer value-based when…

Actions are discrete and few, sample efficiency matters, and off-policy replay helps. DQN and its descendants reuse old data; pure policy gradients can’t. Many real systems blend both via actor-critic.

Building the environments, reward pipelines and infrastructure that make large-scale policy-gradient training practical is its own industry — see the companies building RL environments.

Researcher takes

Lambert makes the historical-clarification argument that the algorithm everyone calls REINFORCE is nothing more than the vanilla policy gradient, tracing the name back to Williams 1992 and situating it within the lineage that RLHF practitioners rediscovered.

Frequently asked questions

What’s the difference between REINFORCE and policy gradient methods?

“Policy gradient methods” is the whole family of algorithms that optimize a parameterized policy by gradient ascent on expected return. REINFORCE is the simplest concrete member — a Monte-Carlo estimator that scales each action’s log-probability gradient by the realized return. Actor-critic, TRPO, PPO and GRPO are all policy gradient methods that improve on REINFORCE.

Why is REINFORCE so high-variance?

It estimates an expectation over all possible trajectories using a single sampled episode, and the return that weights each action sums up many random rewards and transitions. Two runs of the same policy can produce very different returns, so the gradient estimate jumps around. Baselines, reward-to-go, learned critics (actor-critic) and large batches all reduce this variance.

Does subtracting a baseline bias the gradient?

No — as long as the baseline depends only on the state, not the action. The expected score function is zero, so E[b(s)θlogπθ]=0\mathbb{E}[b(s)\nabla_\theta \log \pi_\theta] = 0 and the baseline cancels in expectation while still reducing variance. An action-dependent baseline would introduce bias.

Is PPO just REINFORCE with extra steps?

Essentially yes, with two crucial additions: a learned critic to compute low-variance advantages, and a clipped surrogate objective that stops any single update from moving the policy too far. The core “log-prob gradient times advantage” update is pure policy gradient. See PPO.

Key papers

Actor-critic · PPO · GRPO · Value functions · Q-learning · Exploration vs exploitation · What is reinforcement learning?