Policy Gradient Methods (REINFORCE)

Key takeaways

Policy gradient methods optimize a parameterized policy directly by gradient ascent on expected return — no value function or argmax required.
The policy gradient theorem turns 'gradient of an expectation' into 'expectation of a gradient' via the log-derivative trick, so you can estimate it from sampled trajectories.
REINFORCE (Williams, 1992) is the simplest instance: scale each action's log-probability gradient by the return that followed it.
Its weakness is high variance — fixed with baselines, reward-to-go and advantages, which lead straight to actor-critic, TRPO, PPO and GRPO.

What are policy gradient methods?

Policy gradient methods learn the policy directly. Instead of estimating how good each state or action is and then acting greedily — the value-based recipe behind Q-learning and DQN — they parameterize the policy itself as $\pi_\theta(a \mid s)$ (typically a neural network) and adjust $\theta$ with gradient ascent to make high-return behavior more probable.

The core trick is disarmingly simple: run the policy, see what happened, then increase the probability of the actions that led to good outcomes and decrease the probability of the ones that led to bad outcomes — weighted by how good or bad. REINFORCE is the canonical algorithm that implements exactly this idea.

Policy gradient loop: sample whole trajectories from the current policy, compute the return of each, then nudge θ to raise the log-probability of actions weighted by the return that followed them.

▶ RL Course by David Silver — Lecture 7: Policy Gradient Methods (the canonical lecture)

Why optimize the policy directly?

A policy parameterization gives you three things value-based methods struggle with:

Continuous & high-dimensional actions

Output the parameters of a distribution (e.g. a Gaussian’s mean and variance) instead of one Q-value per discrete action. No argmax over an infinite set. Essential for robotics and control.

Stochastic policies by construction

Some problems have no optimal deterministic policy (partially observed states, games like rock-paper-scissors). A policy network outputs a probability distribution, so randomness is first-class — and it doubles as built-in exploration.

Smooth improvement

Small parameter changes produce small policy changes. Value-based methods can flip the greedy action discontinuously, causing oscillation; policy gradients move gently along a gradient.

The trade-off: policy gradients are on-policy (data must come from the current policy, so it’s discarded after each update) and high-variance Monte-Carlo estimators. Most of the field’s progress since REINFORCE is about taming that variance.

The objective and the policy gradient theorem

We want parameters $\theta$ that maximize expected return. Let a trajectory be $\tau = (s_0, a_0, s_1, a_1, \dots)$ with return $R(\tau) = \sum_{t} \gamma^{t} r_t$ . The objective is

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\big[\,R(\tau)\,\big].

The problem: $\theta$ appears inside the distribution we’re averaging over, so we can’t just differentiate a sum of fixed terms. The log-derivative trick rescues us. Because $\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)$ ,

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\big[\,R(\tau)\,\nabla_\theta \log p_\theta(\tau)\,\big].

This is the conceptual heart of the whole field: it converts the gradient of an expectation (intractable) into an expectation of a gradient (estimate it by sampling). And the environment dynamics drop out — the trajectory probability factorizes as $p_\theta(\tau) = p(s_0)\prod_t P(s_{t+1}\mid s_t,a_t)\,\pi_\theta(a_t\mid s_t)$ , but only the $\pi_\theta$ terms depend on $\theta$ . Taking the log turns the product into a sum and the transition terms vanish:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\left(\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\right)\!\left(\sum_{t=0}^{T} \gamma^t r_t\right)\right].

This is the policy gradient theorem (Sutton et al., 2000), and you never need a model of the environment to use it.

REINFORCE, step by step

REINFORCE (Williams, 1992 — the name is a backronym for “REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility”) is the most direct Monte-Carlo implementation of the policy gradient theorem.

Run an episode

Sample a full trajectory $\tau = (s_0, a_0, r_1, \dots, s_T)$ by acting with the current policy $\pi_\theta$ until the episode ends.

Compute the return after each step

For every timestep $t$ , compute the reward-to-go $G_t = \sum_{k=t}^{T}\gamma^{k-t} r_{k+1}$ — the discounted return from that step onward. (Using $G_t$ rather than the whole-episode return is the standard, lower-variance form: an action can only influence rewards that come after it.)

Accumulate the gradient

For each step, form $G_t\,\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ — the score function scaled by how good the rest of the episode turned out.

Update the parameters

Ascend the gradient:

\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \gamma^t\, G_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t).

Then throw the data away and repeat — REINFORCE is strictly on-policy.

Go deeper: the score function and why it’s unbiased

The term $\nabla_\theta \log \pi_\theta(a\mid s)$ is called the score function. A key identity makes everything work: the expected score under the policy is zero, $\mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta \log \pi_\theta(a\mid s)] = \nabla_\theta \sum_a \pi_\theta(a\mid s) = \nabla_\theta 1 = 0$ . That is precisely why you can subtract a state-dependent baseline (next section) without biasing the estimator — and why REINFORCE, despite using a single sampled return, is an unbiased estimate of the true gradient. The price of unbiasedness is variance: one noisy episode stands in for an expectation over all possible trajectories.

The variance problem — and baselines

REINFORCE works, but it is notoriously high-variance. Returns swing wildly from episode to episode, so the gradient estimate is noisy and learning is slow and unstable. The single most important fix is the baseline.

Subtract a function $b(s)$ that depends only on the state (not the action) from the return:

\nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\big(G_t - b(s_t)\big)\right].

Because the expected score is zero, subtracting any state-dependent $b(s)$ leaves the gradient unbiased while shrinking its variance. The natural choice is the state-value function $b(s) = V(s)$ — the average return you’d expect from that state. Then the weight becomes $G_t - V(s_t)$ : an estimate of the advantage $A(s_t,a_t)$ , i.e. “was this action better or worse than typical for this state?”

Without a baseline, all positive returns push action probabilities up (only the relative sizes differ). Subtracting V(s) recenters returns around zero, so genuinely above-average actions go up and below-average actions go down — far less noise.

When you learn $V(s)$ with its own network and use it as the baseline, you’ve built an actor-critic method: the actor is the policy, the critic is the value estimate. That step — replacing the raw Monte-Carlo return with a learned, bootstrapped advantage — is the bridge from REINFORCE to every modern policy-gradient algorithm.

Go deeper: the optimal baseline and other variance fighters

$V(s)$ is convenient but not variance-optimal. The minimum-variance baseline is a gradient-magnitude-weighted average of returns, $b^* = \frac{\mathbb{E}[(\nabla_\theta \log \pi_\theta)^2\, G]}{\mathbb{E}[(\nabla_\theta \log \pi_\theta)^2]}$ — derived by Greensmith, Bartlett & Baxter (2004). In practice $V(s)$ captures most of the benefit and is what people use. Other variance-reduction tools stack on top: reward-to-go (causality), generalized advantage estimation (GAE) which trades bias for variance via a $\lambda$ knob, discounting as an implicit variance control, and large batch sizes to average the noise. PPO and TRPO add trust regions on top to bound how far each noisy step can move the policy.

From REINFORCE to the modern family

REINFORCE is the root of a large, still-growing tree. Each descendant attacks one of its weaknesses — variance, sample inefficiency, or instability from too-large updates.

Method	What it adds over REINFORCE	Why it matters
REINFORCE + baseline	Subtract $V(s)$	First big variance cut, still unbiased
Actor-critic	Learned, bootstrapped critic for the advantage	Lower variance, can update per-step (online)
A2C / A3C	Synchronous/parallel actors + entropy bonus	Stable, scalable deep RL
TRPO	Trust region constraining KL between updates	Prevents destructive policy jumps
PPO	Clipped surrogate objective (cheap trust region)	The workhorse — robotics, games, and RLHF
GRPO	Group-relative advantage; drops the value network	Cheap RL for LLM reasoning (DeepSeek)

▶ L3 Policy Gradients and Advantage Estimation — Pieter Abbeel (Foundations of Deep RL)

Worked intuition: REINFORCE on CartPole

Concretely, on the classic CartPole task the policy network maps the 4-D state (cart position, velocity, pole angle, angular velocity) to a probability over two actions (push left / push right).

discrete actions the policy outputs probabilities over

reward per timestep the pole stays upright

500

max return — a long episode is a strong gradient signal

Each episode: act stochastically until the pole falls, record $(s_t, a_t, r_t)$ , compute reward-to-go $G_t$ , subtract a baseline, and ascend $\sum_t (G_t - b)\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ . Episodes that balanced longer produce larger $G_t$ , so their actions get reinforced harder — and because the policy is stochastic, exploration is automatic. Plain REINFORCE solves CartPole but with visibly noisy learning curves; adding a value baseline smooths them dramatically. To try variants yourself, RL environments like Gymnasium ship CartPole as the standard first benchmark.

A short history

1992

REINFORCE

Ronald Williams formalizes the class of “REINFORCE” algorithms and the log-derivative gradient estimator — the foundation of policy-based RL.

2000

Policy Gradient Theorem

Sutton, McAllester, Singh & Mansour prove the policy gradient theorem with function approximation and the compatible-features result that grounds actor-critic.

2004

Variance reduction theory

Greensmith, Bartlett & Baxter derive optimal baselines and rigorous variance bounds for gradient estimates.

2015–16

Deep policy gradients

TRPO (Schulman et al.) adds trust regions; A3C (Mnih et al.) scales actor-critic; GAE refines advantage estimation.

2017

PPO

Schulman et al. introduce the clipped objective — simple, stable, and now the default policy-gradient algorithm everywhere.

2024–25

GRPO & LLMs

DeepSeek’s GRPO strips the critic and uses group-relative advantages, making policy gradients the engine of LLM reasoning post-training.

When to use policy gradients

Reach for policy gradients when…

Actions are continuous or high-dimensional; the optimal policy is stochastic; you want stable, smooth improvement; or you’re doing LLM post-training (PPO, GRPO, RLHF).

Prefer value-based when…

Actions are discrete and few, sample efficiency matters, and off-policy replay helps. DQN and its descendants reuse old data; pure policy gradients can’t. Many real systems blend both via actor-critic.

Building the environments, reward pipelines and infrastructure that make large-scale policy-gradient training practical is its own industry — see the companies building RL environments.

Researcher takes

Lambert makes the historical-clarification argument that the algorithm everyone calls REINFORCE is nothing more than the vanilla policy gradient, tracing the name back to Williams 1992 and situating it within the lineage that RLHF practitioners rediscovered.

View Nathan Lambert's post on X →

Frequently asked questions

What’s the difference between REINFORCE and policy gradient methods?

“Policy gradient methods” is the whole family of algorithms that optimize a parameterized policy by gradient ascent on expected return. REINFORCE is the simplest concrete member — a Monte-Carlo estimator that scales each action’s log-probability gradient by the realized return. Actor-critic, TRPO, PPO and GRPO are all policy gradient methods that improve on REINFORCE.

Why is REINFORCE so high-variance?

It estimates an expectation over all possible trajectories using a single sampled episode, and the return that weights each action sums up many random rewards and transitions. Two runs of the same policy can produce very different returns, so the gradient estimate jumps around. Baselines, reward-to-go, learned critics (actor-critic) and large batches all reduce this variance.

Does subtracting a baseline bias the gradient?

No — as long as the baseline depends only on the state, not the action. The expected score function is zero, so $\mathbb{E}[b(s)\nabla_\theta \log \pi_\theta] = 0$ and the baseline cancels in expectation while still reducing variance. An action-dependent baseline would introduce bias.

Is PPO just REINFORCE with extra steps?

Essentially yes, with two crucial additions: a learned critic to compute low-variance advantages, and a clipped surrogate objective that stops any single update from moving the policy too far. The core “log-prob gradient times advantage” update is pure policy gradient. See PPO.

Key papers

Simple Statistical Gradient-Following Algorithms for Connectionist RL — Williams, 1992 — the original REINFORCE paper.
Policy Gradient Methods for RL with Function Approximation — Sutton, McAllester, Singh & Mansour, 2000 — the policy gradient theorem.
Variance Reduction Techniques for Gradient Estimates in RL — Greensmith, Bartlett & Baxter, 2004 — baselines and variance bounds.
Trust Region Policy Optimization — Schulman et al., 2015 — trust regions.
High-Dimensional Continuous Control Using GAE — Schulman et al., 2015 — generalized advantage estimation.
Proximal Policy Optimization — Schulman et al., 2017 — the modern default.

Actor-critic · PPO · GRPO · Value functions · Q-learning · Exploration vs exploitation · What is reinforcement learning?