reinforcement-learning.com
// CORE ALGORITHMS

Actor-Critic Methods (A2C / A3C)

How actor-critic RL works: the actor-critic split, the advantage function, the A3C asynchronous algorithm, its synchronous A2C cousin, the math, and where they fit in 2026.

Updated 2026-06-07 15 min read
Key takeaways
  • Actor-critic methods run two networks together: an actor that picks actions (the policy) and a critic that scores states (a value function).
  • The critic supplies a low-variance baseline so the actor learns from the advantage — how much better an action was than expected — instead of raw returns.
  • A3C (2016) parallelised this across many CPU workers running asynchronously; A2C is the simpler synchronous version that batches workers and matches its performance.
  • Actor-critic is the structural backbone of modern deep RL: PPO, SAC, DDPG, IMPALA and GRPO are all actor-critic at heart.

What is an actor-critic method?

An actor-critic method is a reinforcement learning algorithm that learns two things at once. The actor is a policy — it looks at the current state and decides what to do. The critic is a value function — it watches the actor and estimates how good the situation is, then tells the actor whether the action it just took turned out better or worse than expected.

This split solves the central weakness of each family it descends from. Pure policy-gradient methods (like REINFORCE) learn the policy directly but estimate the gradient from full Monte-Carlo returns, which are unbiased but extremely noisy. Pure value-based methods (like Q-learning) are low-variance and sample-efficient but struggle with continuous or large action spaces. Actor-critic keeps the explicit policy of one and the bootstrapped value estimate of the other — the critic acts as a learned baseline that slashes variance without adding much bias.

Actorpolicy π(a|s; θ)Criticvalue V(s; w)Environmentaction astate s, reward radvantage / TD error δ→ updates the actor
The actor-critic loop. The actor selects an action; the environment returns a reward and next state; the critic computes a TD error (the advantage signal) that updates both networks — the critic to predict value better, the actor to favour above-average actions.

The advantage: the signal that makes it work

The policy-gradient theorem says you can improve a policy by nudging it toward actions that led to high return. Plain REINFORCE uses the full return GtG_t as the weight:

θJ(θ)=E[θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}\big[\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,G_t\,\big]

That estimator is unbiased but has enormous variance — GtG_t swings wildly with luck later in the episode. Subtracting a baseline b(st)b(s_t) that depends only on state leaves the gradient unbiased (the baseline term has zero expectation) while shrinking variance. The best baseline is the state value V(st)V(s_t), and replacing GtV(st)G_t - V(s_t) with the advantage function gives the actor-critic gradient:

A(st,at)=Q(st,at)V(st),θJ(θ)=E[θlogπθ(atst)A(st,at)]A(s_t, a_t) = Q(s_t, a_t) - V(s_t), \qquad \nabla_\theta J(\theta) = \mathbb{E}\big[\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,A(s_t, a_t)\,\big]

In practice you don’t have QQ directly. A3C estimates the advantage with an nn-step return bootstrapped off the critic:

A^t=i=0k1γirt+i+γkV(st+k;w)V(st;w)\hat{A}_t = \sum_{i=0}^{k-1}\gamma^i r_{t+i} + \gamma^k V(s_{t+k};w) - V(s_t;w)

The single-step case (k=1k=1) is just the TD error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) — the same quantity that trains the critic. One number does double duty: it tells the critic how wrong its prediction was, and tells the actor whether to make the action more or less likely.

Go deeper: GAE and the bias-variance dial

The nn-step return picks one fixed horizon kk — short kk is low-variance but biased by the imperfect critic, long kk is the reverse. Generalized Advantage Estimation (GAE) (Schulman et al., 2015) instead takes an exponentially-weighted average of all nn-step estimates with a decay λ\lambda:

A^tGAE(γ,λ)=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty}(\gamma\lambda)^l\,\delta_{t+l}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

λ=0\lambda=0 collapses to one-step TD (low variance, high bias); λ=1\lambda=1 recovers the full Monte-Carlo advantage (high variance, no bias). λ0.95\lambda\approx0.95 is the workhorse default in PPO and most modern actor-critics.

A3C: Asynchronous Advantage Actor-Critic

A3C was introduced by Mnih et al. at DeepMind in the 2016 ICML paper Asynchronous Methods for Deep Reinforcement Learning. Its insight was about infrastructure as much as algorithm. At the time, deep RL relied on a large experience replay buffer (as in DQN) to decorrelate the highly correlated stream of consecutive transitions — without that, the network would chase its own tail and diverge.

A3C replaced replay with parallelism. Run many actor-learners in parallel, each with its own copy of the policy and its own environment instance, each exploring a different part of the state space. At any instant the workers are seeing decorrelated data simply because they’re in different places — so you get the stabilising effect of replay for free, and it works for on-policy methods that replay can’t support.

Global networkshared θ (actor), w (critic)Worker 1local net + envWorker 2local net + envWorker 3local net + envWorker Nlocal net + envasync gradient push ↑pull fresh weights ↓16 CPU cores, no GPU — workers never wait for each other
A3C architecture: each worker holds a local copy of the actor-critic, runs its own environment for a few steps, computes gradients, and asynchronously applies them to a shared global network — then pulls the fresh weights and repeats.
1
Each worker syncs and rolls out

A worker copies the current global parameters into its local actor-critic, then runs its own environment for up to tmaxt_{\max} steps (typically 5–20), collecting states, actions and rewards on-policy.

2
Bootstrap the return and compute advantages

At the end of the rollout it bootstraps with the critic — R=V(stmax;w)R = V(s_{t_{\max}};w) for a non-terminal state, or 00 if the episode ended — then walks backward computing the nn-step return and advantage A^t\hat A_t at each step.

3
Accumulate the gradient (three terms)

Each worker accumulates a policy gradient θlogπθ(atst)A^t\nabla_\theta \log\pi_\theta(a_t|s_t)\,\hat A_t, a value gradient on (RtV(st;w))2(R_t - V(s_t;w))^2, and an entropy bonus βθH(πθ(st))\beta\,\nabla_\theta H(\pi_\theta(\cdot|s_t)) that rewards keeping the policy spread out, preventing premature collapse to a single action.

4
Apply asynchronously, then repeat

The worker applies its accumulated gradients to the shared global network without locking — other workers may update in between (a “Hogwild!”-style lock-free update). It then loops back to step 1. Many workers doing this concurrently is what stabilises and accelerates training.

A3C usually shares the lower layers of the actor and critic networks (e.g. the convolutional stack for Atari), with two heads on top: a softmax policy head and a single linear value head. The combined loss for one worker is:

L=logπθ(atst)A^tactor  +  cv(RtV(st;w))2critic    βH(πθ(st))entropy bonus\mathcal{L} = \underbrace{-\log\pi_\theta(a_t|s_t)\,\hat A_t}_{\text{actor}} \;+\; \underbrace{c_v\,(R_t - V(s_t;w))^2}_{\text{critic}} \;-\; \underbrace{\beta\,H(\pi_\theta(\cdot|s_t))}_{\text{entropy bonus}}
16
CPU cores A3C used — no GPU required
½
Training time vs. prior Atari state-of-the-art
57
Atari games where it beat the prior average score

A2C: the synchronous cousin

When researchers reproduced A3C they found something surprising: the asynchrony wasn’t actually pulling its weight. The noise from lock-free, out-of-date (“stale”) gradients was a cost, not a benefit. A2C — Advantage Actor-Critic — is the synchronous variant: a coordinator waits for every worker to finish its rollout, averages their experience into one big batch, performs a single update, and broadcasts the new weights.

OpenAI introduced A2C in its Baselines: ACKTR & A2C release and reported that the synchronous version matches or beats A3C while being simpler to implement and far better at exploiting a GPU — large synchronised batches are exactly what GPUs want, whereas A3C’s many small asynchronous CPU updates leave a GPU starved.

A3C — asynchronous

Workers update a shared network independently, no waiting. Strong on CPU-only clusters; tolerant of slow or heterogeneous workers. Cost: stale gradients add noise, and results are hard to reproduce exactly.

A2C — synchronous

A coordinator batches all workers and does one update per step. Deterministic, reproducible, GPU-friendly, simpler code. Cost: throughput is capped by the slowest worker each round.

A3CA2C
Update timingAsynchronous, lock-freeSynchronous, batched
Gradient freshnessCan be staleAlways on-policy
Hardware sweet spotMany CPU coresGPU + parallel envs
ReproducibilityHard (race conditions)Deterministic
Reported performanceStrongEqual or better
Code complexityHigherLower

The practical verdict held: the field largely moved to synchronous, batched rollouts. A2C is the template most modern on-policy algorithms — including PPO — actually use under the hood.

Actor-critic vs. its neighbours

MethodFamilyKey idea relative to A2C/A3C
A2C / A3COn-policy ACThe baseline: advantage actor-critic with parallel rollouts
PPOOn-policy ACAdds a clipped surrogate objective so updates can’t move too far; the modern default
DDPG / TD3Off-policy ACDeterministic actor + Q-critic for continuous control; uses replay
SACOff-policy ACMaximum-entropy objective; an entropy bonus baked into the reward, not just a regulariser
IMPALADistributed ACScales A3C-style learning with V-trace off-policy correction for stale data
GRPOOn-policy, critic-freeDrops the value network entirely; estimates advantage from a group of sampled answers

The line to PPO is the one to remember: PPO is an advantage actor-critic with the same value head, the same GAE advantages, and the same entropy bonus — it just wraps the actor update in a clipping mechanism that lets you reuse each batch for several epochs safely. A2C is “PPO without the clip.” Going the other direction, GRPO — central to today’s reasoning-model training — is what you get when you keep the actor and the advantage but delete the critic, replacing it with a group-relative baseline.

Go deeper: why entropy regularization matters here

A pure policy gradient can converge greedily onto whatever action looks best early, before it has explored enough — a self-reinforcing trap, since a narrowing policy generates less diverse data. A3C adds βH(π)-\beta\,H(\pi) to the loss, where HH is the policy’s entropy; this gently rewards keeping probability mass spread across actions. It’s a cheap, effective nudge toward exploration and a direct ancestor of SAC’s full maximum-entropy framework, which makes that bonus the central objective rather than a side term.

A short history

1983
Actor-critic, the original
Barto, Sutton & Anderson’s “adaptive critic” on the cart-pole task — the two-structure idea decades before deep nets.
2000
Policy gradient theorem
Sutton et al. formalise policy gradients with function approximation, giving actor-critic its theoretical footing and the baseline trick.
2015
GAE
Schulman et al. introduce Generalized Advantage Estimation, the bias-variance dial that nearly every later actor-critic uses.
2016
A3C
Mnih et al. parallelise actor-critic across asynchronous CPU workers — replay-free, fast, and a new Atari state-of-the-art.
2017
A2C & PPO
OpenAI shows synchronous A2C matches A3C; PPO adds clipping and becomes the dominant on-policy method.
2018
IMPALA & SAC
IMPALA scales AC to distributed clusters with V-trace; SAC makes the entropy bonus the objective for off-policy continuous control.
2024–25
GRPO & the LLM era
Critic-free actor-critic variants (GRPO) power reasoning-model RL; the actor-critic skeleton remains everywhere.

Where actor-critic is used

  • Continuous control & robotics — actor-critic handles continuous action spaces that value-only methods can’t; SAC and TD3 (both AC) are standard for locomotion and manipulation.
  • Games — A3C set Atari records and learned to navigate 3D mazes from pixels; AC variants appear throughout game-playing RL.
  • LLM post-trainingPPO (an actor-critic) is the classic RLHF optimiser; GRPO is the critic-free descendant behind reasoning models.
  • Operations & systems — resource scheduling, networking and recommendation pipelines commonly use A2C/PPO-style agents.

Building and scaling these systems — parallel environments, reward pipelines, distributed rollouts — is its own industry; see the RL environment vendors.

Limitations

  • On-policy sample inefficiency — A2C/A3C must throw away each batch after one (or few) updates; off-policy AC methods like SAC reuse a replay buffer and are far more sample-efficient.
  • Hyperparameter sensitivity — entropy coefficient, value-loss weight, rollout length and learning rate all interact; a bad setting silently kills learning.
  • Critic bias — bootstrapping off an imperfect critic injects bias into the advantage; if the critic is badly wrong, the actor follows it off a cliff.
  • A3C’s stale gradients — asynchrony’s out-of-date updates can hurt more than the decorrelation helps, which is exactly why A2C exists.

Frequently asked questions

What’s the difference between A2C and A3C?

Same algorithm, different orchestration. A3C runs many workers that update a shared network asynchronously (no waiting, lock-free). A2C is synchronous: a coordinator waits for all workers, averages their experience into one batch, and does a single update. A2C is simpler, reproducible, GPU-friendly, and performs as well or better — so it’s usually preferred.

Why use the advantage instead of the raw reward or return?

The return tells you the outcome was good but not whether your action caused it — a state might be good regardless. The advantage A(s,a)=Q(s,a)V(s)A(s,a)=Q(s,a)-V(s) subtracts the state’s baseline value, isolating how much better this action was than average. Because the baseline has zero expected gradient, this slashes variance while keeping the estimate unbiased.

Is PPO an actor-critic method?

Yes. PPO is an advantage actor-critic: it has the same value-function critic, uses GAE advantages, and adds an entropy bonus. Its one addition is a clipped surrogate objective that limits how far each update moves the policy, letting it safely reuse a batch for multiple epochs. A2C is essentially PPO without the clipping. See PPO.

Does the actor-critic need two separate networks?

Not necessarily. Many implementations (including A3C on Atari) share a body — common feature layers — with two small heads: a policy head and a value head. This shares representation and saves compute. Fully separate networks are also common, especially when actor and critic need different capacities or learning rates.

Key papers

Policy gradients · Value functions · PPO · Q-learning · Deep Q-networks · GRPO · Exploration vs. exploitation · What is reinforcement learning?