- Actor-critic methods run two networks together: an actor that picks actions (the policy) and a critic that scores states (a value function).
- The critic supplies a low-variance baseline so the actor learns from the advantage — how much better an action was than expected — instead of raw returns.
- A3C (2016) parallelised this across many CPU workers running asynchronously; A2C is the simpler synchronous version that batches workers and matches its performance.
- Actor-critic is the structural backbone of modern deep RL: PPO, SAC, DDPG, IMPALA and GRPO are all actor-critic at heart.
What is an actor-critic method?
An actor-critic method is a reinforcement learning algorithm that learns two things at once. The actor is a policy — it looks at the current state and decides what to do. The critic is a value function — it watches the actor and estimates how good the situation is, then tells the actor whether the action it just took turned out better or worse than expected.
This split solves the central weakness of each family it descends from. Pure policy-gradient methods (like REINFORCE) learn the policy directly but estimate the gradient from full Monte-Carlo returns, which are unbiased but extremely noisy. Pure value-based methods (like Q-learning) are low-variance and sample-efficient but struggle with continuous or large action spaces. Actor-critic keeps the explicit policy of one and the bootstrapped value estimate of the other — the critic acts as a learned baseline that slashes variance without adding much bias.
The advantage: the signal that makes it work
The policy-gradient theorem says you can improve a policy by nudging it toward actions that led to high return. Plain REINFORCE uses the full return as the weight:
That estimator is unbiased but has enormous variance — swings wildly with luck later in the episode. Subtracting a baseline that depends only on state leaves the gradient unbiased (the baseline term has zero expectation) while shrinking variance. The best baseline is the state value , and replacing with the advantage function gives the actor-critic gradient:
In practice you don’t have directly. A3C estimates the advantage with an -step return bootstrapped off the critic:
The single-step case () is just the TD error — the same quantity that trains the critic. One number does double duty: it tells the critic how wrong its prediction was, and tells the actor whether to make the action more or less likely.
Go deeper: GAE and the bias-variance dial
The -step return picks one fixed horizon — short is low-variance but biased by the imperfect critic, long is the reverse. Generalized Advantage Estimation (GAE) (Schulman et al., 2015) instead takes an exponentially-weighted average of all -step estimates with a decay :
collapses to one-step TD (low variance, high bias); recovers the full Monte-Carlo advantage (high variance, no bias). is the workhorse default in PPO and most modern actor-critics.
A3C: Asynchronous Advantage Actor-Critic
A3C was introduced by Mnih et al. at DeepMind in the 2016 ICML paper Asynchronous Methods for Deep Reinforcement Learning. Its insight was about infrastructure as much as algorithm. At the time, deep RL relied on a large experience replay buffer (as in DQN) to decorrelate the highly correlated stream of consecutive transitions — without that, the network would chase its own tail and diverge.
A3C replaced replay with parallelism. Run many actor-learners in parallel, each with its own copy of the policy and its own environment instance, each exploring a different part of the state space. At any instant the workers are seeing decorrelated data simply because they’re in different places — so you get the stabilising effect of replay for free, and it works for on-policy methods that replay can’t support.
A worker copies the current global parameters into its local actor-critic, then runs its own environment for up to steps (typically 5–20), collecting states, actions and rewards on-policy.
At the end of the rollout it bootstraps with the critic — for a non-terminal state, or if the episode ended — then walks backward computing the -step return and advantage at each step.
Each worker accumulates a policy gradient , a value gradient on , and an entropy bonus that rewards keeping the policy spread out, preventing premature collapse to a single action.
The worker applies its accumulated gradients to the shared global network without locking — other workers may update in between (a “Hogwild!”-style lock-free update). It then loops back to step 1. Many workers doing this concurrently is what stabilises and accelerates training.
A3C usually shares the lower layers of the actor and critic networks (e.g. the convolutional stack for Atari), with two heads on top: a softmax policy head and a single linear value head. The combined loss for one worker is:
A2C: the synchronous cousin
When researchers reproduced A3C they found something surprising: the asynchrony wasn’t actually pulling its weight. The noise from lock-free, out-of-date (“stale”) gradients was a cost, not a benefit. A2C — Advantage Actor-Critic — is the synchronous variant: a coordinator waits for every worker to finish its rollout, averages their experience into one big batch, performs a single update, and broadcasts the new weights.
OpenAI introduced A2C in its Baselines: ACKTR & A2C release and reported that the synchronous version matches or beats A3C while being simpler to implement and far better at exploiting a GPU — large synchronised batches are exactly what GPUs want, whereas A3C’s many small asynchronous CPU updates leave a GPU starved.
Workers update a shared network independently, no waiting. Strong on CPU-only clusters; tolerant of slow or heterogeneous workers. Cost: stale gradients add noise, and results are hard to reproduce exactly.
A coordinator batches all workers and does one update per step. Deterministic, reproducible, GPU-friendly, simpler code. Cost: throughput is capped by the slowest worker each round.
| A3C | A2C | |
|---|---|---|
| Update timing | Asynchronous, lock-free | Synchronous, batched |
| Gradient freshness | Can be stale | Always on-policy |
| Hardware sweet spot | Many CPU cores | GPU + parallel envs |
| Reproducibility | Hard (race conditions) | Deterministic |
| Reported performance | Strong | Equal or better |
| Code complexity | Higher | Lower |
The practical verdict held: the field largely moved to synchronous, batched rollouts. A2C is the template most modern on-policy algorithms — including PPO — actually use under the hood.
Actor-critic vs. its neighbours
| Method | Family | Key idea relative to A2C/A3C |
|---|---|---|
| A2C / A3C | On-policy AC | The baseline: advantage actor-critic with parallel rollouts |
| PPO | On-policy AC | Adds a clipped surrogate objective so updates can’t move too far; the modern default |
| DDPG / TD3 | Off-policy AC | Deterministic actor + Q-critic for continuous control; uses replay |
| SAC | Off-policy AC | Maximum-entropy objective; an entropy bonus baked into the reward, not just a regulariser |
| IMPALA | Distributed AC | Scales A3C-style learning with V-trace off-policy correction for stale data |
| GRPO | On-policy, critic-free | Drops the value network entirely; estimates advantage from a group of sampled answers |
The line to PPO is the one to remember: PPO is an advantage actor-critic with the same value head, the same GAE advantages, and the same entropy bonus — it just wraps the actor update in a clipping mechanism that lets you reuse each batch for several epochs safely. A2C is “PPO without the clip.” Going the other direction, GRPO — central to today’s reasoning-model training — is what you get when you keep the actor and the advantage but delete the critic, replacing it with a group-relative baseline.
Go deeper: why entropy regularization matters here
A pure policy gradient can converge greedily onto whatever action looks best early, before it has explored enough — a self-reinforcing trap, since a narrowing policy generates less diverse data. A3C adds to the loss, where is the policy’s entropy; this gently rewards keeping probability mass spread across actions. It’s a cheap, effective nudge toward exploration and a direct ancestor of SAC’s full maximum-entropy framework, which makes that bonus the central objective rather than a side term.
A short history
Where actor-critic is used
- Continuous control & robotics — actor-critic handles continuous action spaces that value-only methods can’t; SAC and TD3 (both AC) are standard for locomotion and manipulation.
- Games — A3C set Atari records and learned to navigate 3D mazes from pixels; AC variants appear throughout game-playing RL.
- LLM post-training — PPO (an actor-critic) is the classic RLHF optimiser; GRPO is the critic-free descendant behind reasoning models.
- Operations & systems — resource scheduling, networking and recommendation pipelines commonly use A2C/PPO-style agents.
Building and scaling these systems — parallel environments, reward pipelines, distributed rollouts — is its own industry; see the RL environment vendors.
Limitations
- On-policy sample inefficiency — A2C/A3C must throw away each batch after one (or few) updates; off-policy AC methods like SAC reuse a replay buffer and are far more sample-efficient.
- Hyperparameter sensitivity — entropy coefficient, value-loss weight, rollout length and learning rate all interact; a bad setting silently kills learning.
- Critic bias — bootstrapping off an imperfect critic injects bias into the advantage; if the critic is badly wrong, the actor follows it off a cliff.
- A3C’s stale gradients — asynchrony’s out-of-date updates can hurt more than the decorrelation helps, which is exactly why A2C exists.
Frequently asked questions
What’s the difference between A2C and A3C?
Same algorithm, different orchestration. A3C runs many workers that update a shared network asynchronously (no waiting, lock-free). A2C is synchronous: a coordinator waits for all workers, averages their experience into one batch, and does a single update. A2C is simpler, reproducible, GPU-friendly, and performs as well or better — so it’s usually preferred.
Why use the advantage instead of the raw reward or return?
The return tells you the outcome was good but not whether your action caused it — a state might be good regardless. The advantage subtracts the state’s baseline value, isolating how much better this action was than average. Because the baseline has zero expected gradient, this slashes variance while keeping the estimate unbiased.
Is PPO an actor-critic method?
Yes. PPO is an advantage actor-critic: it has the same value-function critic, uses GAE advantages, and adds an entropy bonus. Its one addition is a clipped surrogate objective that limits how far each update moves the policy, letting it safely reuse a batch for multiple epochs. A2C is essentially PPO without the clipping. See PPO.
Does the actor-critic need two separate networks?
Not necessarily. Many implementations (including A3C on Atari) share a body — common feature layers — with two small heads: a policy head and a value head. This shares representation and saves compute. Fully separate networks are also common, especially when actor and critic need different capacities or learning rates.
Key papers
- Asynchronous Methods for Deep Reinforcement Learning — Mnih et al., 2016 — the A3C paper.
- High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al., 2015 — GAE.
- Policy Gradient Methods for RL with Function Approximation — Sutton et al., 2000 — the theoretical foundation.
- OpenAI Baselines: ACKTR & A2C — OpenAI, 2017 — the synchronous A2C.
- Soft Actor-Critic — Haarnoja et al., 2018 — maximum-entropy off-policy AC.
Related
Policy gradients · Value functions · PPO · Q-learning · Deep Q-networks · GRPO · Exploration vs. exploitation · What is reinforcement learning?