Actor-Critic Methods (A2C, A3C), Explained

Key takeaways

Actor-critic methods run two networks together: an actor that picks actions (the policy) and a critic that scores states (a value function).
The critic supplies a low-variance baseline so the actor learns from the advantage — how much better an action was than expected — instead of raw returns.
A3C (2016) parallelised this across many CPU workers running asynchronously; A2C is the simpler synchronous version that batches workers and matches its performance.
Actor-critic is the structural backbone of modern deep RL: PPO, SAC, DDPG, IMPALA and GRPO are all actor-critic at heart.

What is an actor-critic method?

An actor-critic method is a reinforcement learning algorithm that learns two things at once. The actor is a policy — it looks at the current state and decides what to do. The critic is a value function — it watches the actor and estimates how good the situation is, then tells the actor whether the action it just took turned out better or worse than expected.

This split solves the central weakness of each family it descends from. Pure policy-gradient methods (like REINFORCE) learn the policy directly but estimate the gradient from full Monte-Carlo returns, which are unbiased but extremely noisy. Pure value-based methods (like Q-learning) are low-variance and sample-efficient but struggle with continuous or large action spaces. Actor-critic keeps the explicit policy of one and the bootstrapped value estimate of the other — the critic acts as a learned baseline that slashes variance without adding much bias.

The actor-critic loop. The actor selects an action; the environment returns a reward and next state; the critic computes a TD error (the advantage signal) that updates both networks — the critic to predict value better, the actor to favour above-average actions.

The advantage: the signal that makes it work

The policy-gradient theorem says you can improve a policy by nudging it toward actions that led to high return. Plain REINFORCE uses the full return $G_t$ as the weight:

\nabla_\theta J(\theta) = \mathbb{E}\big[\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,G_t\,\big]

That estimator is unbiased but has enormous variance — $G_t$ swings wildly with luck later in the episode. Subtracting a baseline $b(s_t)$ that depends only on state leaves the gradient unbiased (the baseline term has zero expectation) while shrinking variance. The best baseline is the state value $V(s_t)$ , and replacing $G_t - V(s_t)$ with the advantage function gives the actor-critic gradient:

A(s_t, a_t) = Q(s_t, a_t) - V(s_t), \qquad \nabla_\theta J(\theta) = \mathbb{E}\big[\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\,A(s_t, a_t)\,\big]

In practice you don’t have $Q$ directly. A3C estimates the advantage with an $n$ -step return bootstrapped off the critic:

\hat{A}_t = \sum_{i=0}^{k-1}\gamma^i r_{t+i} + \gamma^k V(s_{t+k};w) - V(s_t;w)

The single-step case ( $k=1$ ) is just the TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ — the same quantity that trains the critic. One number does double duty: it tells the critic how wrong its prediction was, and tells the actor whether to make the action more or less likely.

Go deeper: GAE and the bias-variance dial

The $n$ -step return picks one fixed horizon $k$ — short $k$ is low-variance but biased by the imperfect critic, long $k$ is the reverse. Generalized Advantage Estimation (GAE) (Schulman et al., 2015) instead takes an exponentially-weighted average of all $n$ -step estimates with a decay $\lambda$ :

\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty}(\gamma\lambda)^l\,\delta_{t+l}, \qquad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

$\lambda=0$ collapses to one-step TD (low variance, high bias); $\lambda=1$ recovers the full Monte-Carlo advantage (high variance, no bias). $\lambda\approx0.95$ is the workhorse default in PPO and most modern actor-critics.

A3C: Asynchronous Advantage Actor-Critic

A3C was introduced by Mnih et al. at DeepMind in the 2016 ICML paper Asynchronous Methods for Deep Reinforcement Learning. Its insight was about infrastructure as much as algorithm. At the time, deep RL relied on a large experience replay buffer (as in DQN) to decorrelate the highly correlated stream of consecutive transitions — without that, the network would chase its own tail and diverge.

A3C replaced replay with parallelism. Run many actor-learners in parallel, each with its own copy of the policy and its own environment instance, each exploring a different part of the state space. At any instant the workers are seeing decorrelated data simply because they’re in different places — so you get the stabilising effect of replay for free, and it works for on-policy methods that replay can’t support.

A3C architecture: each worker holds a local copy of the actor-critic, runs its own environment for a few steps, computes gradients, and asynchronously applies them to a shared global network — then pulls the fresh weights and repeats.

Each worker syncs and rolls out

A worker copies the current global parameters into its local actor-critic, then runs its own environment for up to $t_{\max}$ steps (typically 5–20), collecting states, actions and rewards on-policy.

Bootstrap the return and compute advantages

At the end of the rollout it bootstraps with the critic — $R = V(s_{t_{\max}};w)$ for a non-terminal state, or $0$ if the episode ended — then walks backward computing the $n$ -step return and advantage $\hat A_t$ at each step.

Accumulate the gradient (three terms)

Each worker accumulates a policy gradient $\nabla_\theta \log\pi_\theta(a_t|s_t)\,\hat A_t$ , a value gradient on $(R_t - V(s_t;w))^2$ , and an entropy bonus $\beta\,\nabla_\theta H(\pi_\theta(\cdot|s_t))$ that rewards keeping the policy spread out, preventing premature collapse to a single action.

Apply asynchronously, then repeat

The worker applies its accumulated gradients to the shared global network without locking — other workers may update in between (a “Hogwild!”-style lock-free update). It then loops back to step 1. Many workers doing this concurrently is what stabilises and accelerates training.

A3C usually shares the lower layers of the actor and critic networks (e.g. the convolutional stack for Atari), with two heads on top: a softmax policy head and a single linear value head. The combined loss for one worker is:

\mathcal{L} = \underbrace{-\log\pi_\theta(a_t|s_t)\,\hat A_t}_{\text{actor}} \;+\; \underbrace{c_v\,(R_t - V(s_t;w))^2}_{\text{critic}} \;-\; \underbrace{\beta\,H(\pi_\theta(\cdot|s_t))}_{\text{entropy bonus}}

CPU cores A3C used — no GPU required

Training time vs. prior Atari state-of-the-art

Atari games where it beat the prior average score

A2C: the synchronous cousin

When researchers reproduced A3C they found something surprising: the asynchrony wasn’t actually pulling its weight. The noise from lock-free, out-of-date (“stale”) gradients was a cost, not a benefit. A2C — Advantage Actor-Critic — is the synchronous variant: a coordinator waits for every worker to finish its rollout, averages their experience into one big batch, performs a single update, and broadcasts the new weights.

OpenAI introduced A2C in its Baselines: ACKTR & A2C release and reported that the synchronous version matches or beats A3C while being simpler to implement and far better at exploiting a GPU — large synchronised batches are exactly what GPUs want, whereas A3C’s many small asynchronous CPU updates leave a GPU starved.

A3C — asynchronous

Workers update a shared network independently, no waiting. Strong on CPU-only clusters; tolerant of slow or heterogeneous workers. Cost: stale gradients add noise, and results are hard to reproduce exactly.

A2C — synchronous

A coordinator batches all workers and does one update per step. Deterministic, reproducible, GPU-friendly, simpler code. Cost: throughput is capped by the slowest worker each round.

	A3C	A2C
Update timing	Asynchronous, lock-free	Synchronous, batched
Gradient freshness	Can be stale	Always on-policy
Hardware sweet spot	Many CPU cores	GPU + parallel envs
Reproducibility	Hard (race conditions)	Deterministic
Reported performance	Strong	Equal or better
Code complexity	Higher	Lower

The practical verdict held: the field largely moved to synchronous, batched rollouts. A2C is the template most modern on-policy algorithms — including PPO — actually use under the hood.

Actor-critic vs. its neighbours

Method	Family	Key idea relative to A2C/A3C
A2C / A3C	On-policy AC	The baseline: advantage actor-critic with parallel rollouts
PPO	On-policy AC	Adds a clipped surrogate objective so updates can’t move too far; the modern default
DDPG / TD3	Off-policy AC	Deterministic actor + Q-critic for continuous control; uses replay
SAC	Off-policy AC	Maximum-entropy objective; an entropy bonus baked into the reward, not just a regulariser
IMPALA	Distributed AC	Scales A3C-style learning with V-trace off-policy correction for stale data
GRPO	On-policy, critic-free	Drops the value network entirely; estimates advantage from a group of sampled answers

The line to PPO is the one to remember: PPO is an advantage actor-critic with the same value head, the same GAE advantages, and the same entropy bonus — it just wraps the actor update in a clipping mechanism that lets you reuse each batch for several epochs safely. A2C is “PPO without the clip.” Going the other direction, GRPO — central to today’s reasoning-model training — is what you get when you keep the actor and the advantage but delete the critic, replacing it with a group-relative baseline.

Go deeper: why entropy regularization matters here

A pure policy gradient can converge greedily onto whatever action looks best early, before it has explored enough — a self-reinforcing trap, since a narrowing policy generates less diverse data. A3C adds $-\beta\,H(\pi)$ to the loss, where $H$ is the policy’s entropy; this gently rewards keeping probability mass spread across actions. It’s a cheap, effective nudge toward exploration and a direct ancestor of SAC’s full maximum-entropy framework, which makes that bonus the central objective rather than a side term.

A short history

1983

Actor-critic, the original

Barto, Sutton & Anderson’s “adaptive critic” on the cart-pole task — the two-structure idea decades before deep nets.

2000

Policy gradient theorem

Sutton et al. formalise policy gradients with function approximation, giving actor-critic its theoretical footing and the baseline trick.

2015

GAE

Schulman et al. introduce Generalized Advantage Estimation, the bias-variance dial that nearly every later actor-critic uses.

2016

A3C

Mnih et al. parallelise actor-critic across asynchronous CPU workers — replay-free, fast, and a new Atari state-of-the-art.

2017

A2C & PPO

OpenAI shows synchronous A2C matches A3C; PPO adds clipping and becomes the dominant on-policy method.

2018

IMPALA & SAC

IMPALA scales AC to distributed clusters with V-trace; SAC makes the entropy bonus the objective for off-policy continuous control.

2024–25

GRPO & the LLM era

Critic-free actor-critic variants (GRPO) power reasoning-model RL; the actor-critic skeleton remains everywhere.

Where actor-critic is used

Continuous control & robotics — actor-critic handles continuous action spaces that value-only methods can’t; SAC and TD3 (both AC) are standard for locomotion and manipulation.
Games — A3C set Atari records and learned to navigate 3D mazes from pixels; AC variants appear throughout game-playing RL.
LLM post-training — PPO (an actor-critic) is the classic RLHF optimiser; GRPO is the critic-free descendant behind reasoning models.
Operations & systems — resource scheduling, networking and recommendation pipelines commonly use A2C/PPO-style agents.

Building and scaling these systems — parallel environments, reward pipelines, distributed rollouts — is its own industry; see the RL environment vendors.

Limitations

On-policy sample inefficiency — A2C/A3C must throw away each batch after one (or few) updates; off-policy AC methods like SAC reuse a replay buffer and are far more sample-efficient.
Hyperparameter sensitivity — entropy coefficient, value-loss weight, rollout length and learning rate all interact; a bad setting silently kills learning.
Critic bias — bootstrapping off an imperfect critic injects bias into the advantage; if the critic is badly wrong, the actor follows it off a cliff.
A3C’s stale gradients — asynchrony’s out-of-date updates can hurt more than the decorrelation helps, which is exactly why A2C exists.

Frequently asked questions

What’s the difference between A2C and A3C?

Same algorithm, different orchestration. A3C runs many workers that update a shared network asynchronously (no waiting, lock-free). A2C is synchronous: a coordinator waits for all workers, averages their experience into one batch, and does a single update. A2C is simpler, reproducible, GPU-friendly, and performs as well or better — so it’s usually preferred.

Why use the advantage instead of the raw reward or return?

The return tells you the outcome was good but not whether your action caused it — a state might be good regardless. The advantage $A(s,a)=Q(s,a)-V(s)$ subtracts the state’s baseline value, isolating how much better this action was than average. Because the baseline has zero expected gradient, this slashes variance while keeping the estimate unbiased.

Is PPO an actor-critic method?

Yes. PPO is an advantage actor-critic: it has the same value-function critic, uses GAE advantages, and adds an entropy bonus. Its one addition is a clipped surrogate objective that limits how far each update moves the policy, letting it safely reuse a batch for multiple epochs. A2C is essentially PPO without the clipping. See PPO.

Does the actor-critic need two separate networks?

Not necessarily. Many implementations (including A3C on Atari) share a body — common feature layers — with two small heads: a policy head and a value head. This shares representation and saves compute. Fully separate networks are also common, especially when actor and critic need different capacities or learning rates.

Key papers

Asynchronous Methods for Deep Reinforcement Learning — Mnih et al., 2016 — the A3C paper.
High-Dimensional Continuous Control Using Generalized Advantage Estimation — Schulman et al., 2015 — GAE.
Policy Gradient Methods for RL with Function Approximation — Sutton et al., 2000 — the theoretical foundation.
OpenAI Baselines: ACKTR & A2C — OpenAI, 2017 — the synchronous A2C.
Soft Actor-Critic — Haarnoja et al., 2018 — maximum-entropy off-policy AC.

Policy gradients · Value functions · PPO · Q-learning · Deep Q-networks · GRPO · Exploration vs. exploitation · What is reinforcement learning?

Actor-Critic Methods (A2C / A3C)

What is an actor-critic method?

The advantage: the signal that makes it work

A3C: Asynchronous Advantage Actor-Critic

A2C: the synchronous cousin

Actor-critic vs. its neighbours

A short history

Where actor-critic is used

Limitations

Frequently asked questions

Key papers

Related