reinforcement-learning.com
// CORE ALGORITHMS

Continuous Control: DDPG, TD3 & SAC

How off-policy actor-critic methods solve continuous control: the deterministic policy gradient, DDPG, TD3's three fixes, and SAC's maximum-entropy approach.

Updated 2026-06-07 15 min read
Key takeaways
  • Continuous control means the action is a real-valued vector (joint torques, steering angle) — you can't enumerate actions and pick the max like in DQN.
  • DDPG, TD3 and SAC are off-policy actor-critic methods: a critic learns a Q-function, and an actor is trained to output the action that maximizes it via the deterministic policy gradient.
  • DDPG is the foundational recipe but is notoriously unstable; TD3 adds three fixes (twin critics, delayed updates, target smoothing) to tame Q-value overestimation.
  • SAC adds an entropy bonus to the reward, making the policy stochastic and self-tuning its own exploration — it's the most robust default for continuous control today.

What is continuous control?

In many of the flashiest RL problems — a robot arm grasping an object, a quadruped learning to run, a car holding a lane — the agent doesn’t choose from a short menu of buttons. It outputs a real-valued vector: torques for each joint, a steering angle, a throttle. This is continuous control, and it breaks the single most useful trick in value-based RL.

Deep Q-Networks pick an action by computing Q(s,a)Q(s, a) for every action and taking the argmax\arg\max. With a continuous action space that argmax\arg\max is an optimization problem over Rn\mathbb{R}^n at every single timestep — intractable to do exactly. The whole family of algorithms on this page exists to answer one question: how do you do the argmax\arg\max when there are infinitely many actions?

State sActoraction a = μ(s)Criticvalue Q(s, a)Bellman targetr + γ Q’(s’, a’)Environmentnext state, reward∇a Q → actorreplay buffer feeds off-policy transitions to both networks
Continuous control actor-critic. The actor maps a state to an action vector; the critic scores that state-action pair. Gradients of Q with respect to the action flow back through the critic into the actor — that is the deterministic policy gradient.

The deterministic policy gradient

The mathematical engine under DDPG and TD3 is the deterministic policy gradient (DPG) theorem, proved by Silver et al. (2014). It says you can train a deterministic actor μθ(s)\mu_\theta(s) by pushing its parameters in the direction that most increases the critic’s value:

θJ(θ)=EsD[aQϕ(s,a)a=μθ(s)  θμθ(s)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \mathcal{D}}\Big[\, \nabla_a Q_\phi(s, a)\big|_{a=\mu_\theta(s)} \; \nabla_\theta \mu_\theta(s) \,\Big]

This is just the chain rule. Differentiate the critic with respect to its action input, then differentiate the action with respect to the actor’s weights, and you get a gradient that flows from QQ all the way back into the policy. The critic acts as a differentiable surrogate for the environment: instead of sampling noisy returns like a policy gradient method, the actor reads off the slope of QQ and walks uphill.

DDPG: deep deterministic policy gradient

DDPG (Lillicrap et al., 2015) was the breakthrough that married DPG with the deep-learning tricks from DQN. The paper’s title — “Continuous Control with Deep Reinforcement Learning” — names the whole problem. Its recipe is the template every later method modifies.

1
Learn a critic with the Bellman equation

A Q-network Qϕ(s,a)Q_\phi(s,a) is trained to minimize the temporal-difference error against a bootstrapped target, exactly like DQN — but the next action comes from the actor, not an argmax\arg\max:

y=r+γQϕ(s,μθ(s))y = r + \gamma \, Q_{\phi'}\big(s', \mu_{\theta'}(s')\big)

The loss is (Qϕ(s,a)y)2\big(Q_\phi(s,a) - y\big)^2 over transitions sampled from a replay buffer.

2
Learn an actor with the deterministic policy gradient

The actor μθ(s)\mu_\theta(s) is updated to output actions the critic scores highly, by ascending θQϕ(s,μθ(s))\nabla_\theta \, Q_\phi(s, \mu_\theta(s)) — the chain-rule gradient above.

3
Stabilize with target networks and replay

Slowly-updated target networks ϕ,θ\phi', \theta' (Polyak averaging ϕτϕ+(1τ)ϕ\phi' \leftarrow \tau \phi + (1-\tau)\phi') provide stable targets, and a replay buffer decorrelates data — both inherited from DQN.

4
Explore by adding action noise

Because the policy is deterministic, exploration is injected by hand: add noise (originally an Ornstein-Uhlenbeck process, later just Gaussian) to the actor’s output when collecting data.

DDPG works, and on a good day it is remarkably sample-efficient. But it earned a reputation as brittle and hyperparameter-sensitive: small changes in learning rate or reward scale can send it from solved to diverged. The root cause is a specific, diagnosable failure.

TD3: three fixes for a brittle algorithm

TD3 — Twin Delayed DDPG (Fujimoto et al., 2018, “Addressing Function Approximation Error in Actor-Critic Methods”) — is not a new paradigm. It is DDPG with three targeted patches that, together, turn it into one of the most reliable continuous-control algorithms. The paper’s contribution is showing exactly why DDPG fails and fixing each cause.

1. Clipped double-Q (the "twin")

Learn two independent critics Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2} and use the minimum of the two when forming the target. Taking the min systematically underestimates rather than overestimates — a conservative bias that is far safer to bootstrap. This is the single most important fix.

2. Delayed policy updates

Update the actor (and target networks) less often than the critics — typically once every two critic updates. Letting the value estimate settle before the actor chases it prevents the actor from amplifying transient critic errors.

3. Target policy smoothing

Add clipped noise to the target action so the critic can’t exploit a sharp, spurious peak in QQ. It regularizes the value estimate to be smooth over a small neighborhood of similar actions — a form of reward/value smoothing.

The clipped double-Q target captures all three ideas at once:

y=r+γmini=1,2Qϕi(s,a~),a~=μθ(s)+clip(ϵ,c,c),    ϵN(0,σ)y = r + \gamma \min_{i=1,2} Q_{\phi_i'}\big(s', \, \tilde{a}'\big), \qquad \tilde{a}' = \mu_{\theta'}(s') + \mathrm{clip}\big(\epsilon, -c, c\big), \;\; \epsilon \sim \mathcal{N}(0, \sigma)

Both critics are trained toward this same target yy; the actor is updated (delayed) against only the first critic, θQϕ1(s,μθ(s))\nabla_\theta Q_{\phi_1}(s, \mu_\theta(s)). With these changes TD3 dramatically outperforms DDPG and matches or beats the original SAC across the standard MuJoCo benchmark suite.

Go deeper: why the minimum of two critics works

Two critics trained on the same data are not identical — they have different random initializations and see minibatches in a different order, so their errors are partially uncorrelated. Where one critic happens to overestimate a particular state-action pair, the other often does not, and min(Q1,Q2)\min(Q_1, Q_2) picks the lower (less inflated) value. The result is a deliberate downward bias. A small underestimate is benign: the actor simply won’t chase a value bubble that isn’t there. An overestimate is toxic because the actor is built to seek it out. TD3 trades a little pessimism for a lot of stability.

SAC: maximum-entropy reinforcement learning

Soft Actor-Critic (Haarnoja et al., 2018) attacks the same problem from a different angle. Instead of a deterministic actor plus hand-tuned exploration noise, SAC makes the policy stochastic and changes the objective itself: maximize reward and the policy’s entropy.

J(π)=tE(st,at)π[r(st,at)+αH(π(st))]J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \pi}\Big[\, r(s_t, a_t) + \alpha \, \mathcal{H}\big(\pi(\cdot \mid s_t)\big) \,\Big]

The entropy term H\mathcal{H} rewards the policy for staying as random as it can while still solving the task. The temperature α\alpha trades off the two. This maximum-entropy framing has three big payoffs: built-in, state-dependent exploration; robustness, because the policy keeps multiple good options alive instead of collapsing onto one; and far less hyperparameter fuss than DDPG.

DDPG / TD3: deterministic + noiseSAC: stochastic policyaction valueaction valuesingle output a = μ(s)distribution π(a|s)
DDPG and TD3 use a deterministic actor and add external noise for exploration; SAC uses a stochastic actor whose entropy is part of the objective, so exploration is learned and state-dependent.

SAC keeps TD3’s clipped double-Q trick, but its soft target adds an entropy bonus, and the next action is sampled from the current stochastic policy rather than a target actor:

y=r+γ(mini=1,2Qϕi(s,a~)αlogπθ(a~s)),a~πθ(s)y = r + \gamma \Big( \min_{i=1,2} Q_{\phi_i'}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s') \Big), \qquad \tilde{a}' \sim \pi_\theta(\cdot \mid s')

To train the stochastic actor with backpropagation, SAC uses the reparameterization trick: sample noise ξN(0,I)\xi \sim \mathcal{N}(0, I) and squash a Gaussian through a tanh\tanh, so gradients flow through the sampling step.

a~θ(s,ξ)=tanh(μθ(s)+σθ(s)ξ)\tilde{a}_\theta(s, \xi) = \tanh\big(\mu_\theta(s) + \sigma_\theta(s) \odot \xi\big)
Go deeper: automatic temperature tuning

The temperature α\alpha is the trickiest knob in maximum-entropy RL — too high and the agent acts randomly forever, too low and it collapses to a deterministic policy and stops exploring. The follow-up paper (“Soft Actor-Critic Algorithms and Applications”, 2018) made α\alpha self-tuning: you specify a target entropy (a common default is dim(A)-\dim(\mathcal{A}), i.e. minus the number of action dimensions) and α\alpha is adjusted by gradient descent to hit it:

J(α)=Eaπ[α(logπ(as)+Hˉ)]J(\alpha) = \mathbb{E}_{a \sim \pi}\big[ -\alpha \big( \log \pi(a \mid s) + \bar{\mathcal{H}} \big) \big]

This is why modern SAC is so close to plug-and-play: the one hyperparameter that mattered most now tunes itself, leaving little to hand-fit.

DDPG vs TD3 vs SAC at a glance

3
Targeted fixes TD3 adds to DDPG
2
Critics used by both TD3 and SAC
1
Hyperparameter SAC auto-tunes (temperature α)
DDPGTD3SAC
PolicyDeterministicDeterministicStochastic
CriticsOneTwo (clipped min)Two (clipped min)
ExplorationExternal noise on actionExternal noise on actionEntropy bonus (built-in)
Policy updateEvery stepDelayedEvery step
ObjectiveExpected returnExpected returnReturn + entropy
RobustnessBrittleStableVery stable
Year / paper2015 (Lillicrap)2018 (Fujimoto)2018 (Haarnoja)

In practice the two stable algorithms trade blows by environment. On the standard MuJoCo benchmarks, SAC tends to win on high-dimensional tasks like HalfCheetah and Humanoid (where its exploration shines), while TD3 is competitive or better on lower-dimensional ones like Hopper. The honest summary: SAC is the safer default, especially for real-robot work where exploration and robustness matter most, but a well-tuned TD3 is a strong, simpler baseline.

A short history

2014
Deterministic Policy Gradient
Silver et al. prove the DPG theorem — you can compute an exact gradient for a deterministic policy through a differentiable critic.
2015
DDPG
Lillicrap et al. combine DPG with DQN’s target networks and replay buffer, solving continuous-control tasks end-to-end from a deep network.
2018
TD3
Fujimoto et al. diagnose DDPG’s overestimation bug and fix it with twin critics, delayed updates and target smoothing.
2018
SAC
Haarnoja et al. introduce maximum-entropy actor-critic; a follow-up adds automatic temperature tuning, making it near plug-and-play.
2018–now
The robotics workhorses
SAC and TD3 become the default off-policy baselines for continuous control and real-robot learning; PPO covers the massively-parallel-simulation regime.
▶ L5 DDPG and SAC — Pieter Abbeel, Foundations of Deep RL series

Where these methods are used

Off-policy continuous control is the backbone of RL in robotics: manipulation, dexterous hands, and locomotion where each real-world sample is slow and expensive, so sample efficiency is paramount. SAC in particular is a favorite for learning directly on physical robots because its entropy-driven exploration is gentle and self-regulating. Beyond robotics they show up in autonomous-driving controllers, energy/HVAC optimization, and continuous-action game and simulation agents. For the broader topic see continuous control’s parent, actor-critic methods and the robotics and control RL environment vendors.

Researcher takes

Levine offers a conceptual argument for why maximum-entropy methods like SAC are not just an exploration trick but provably solve a deeper class of robust-control problems:

Frequently asked questions

Why can’t I just use DQN for continuous control?

DQN selects actions with argmaxaQ(s,a)\arg\max_a Q(s,a), which requires evaluating every action. With a continuous action space that maximization is itself an optimization problem at every timestep. DDPG/TD3/SAC sidestep it by training an actor network to output the maximizing action directly.

Should I default to TD3 or SAC?

SAC is the more robust default for most continuous-control problems, especially where exploration is hard or you’re training on hardware — its entropy objective handles exploration automatically and it has fewer brittle hyperparameters. TD3 is a strong, slightly simpler baseline that can edge SAC out on some lower-dimensional tasks. Try SAC first.

How do TD3 and SAC both fix DDPG’s overestimation?

Both use clipped double-Q learning — two critics with the target taken as their minimum, which biases estimates downward and prevents the actor from exploiting inflated Q-values. TD3 adds delayed policy updates and target smoothing on top; SAC adds the entropy term and a stochastic policy.

When would I use PPO instead?

PPO (page) is on-policy: less sample-efficient but extremely robust and trivially parallelizable. If you have a fast simulator and can run thousands of parallel environments (common in legged-robot sim-to-real), PPO often wins. If samples are scarce and expensive, the off-policy SAC/TD3 family is far more efficient. See on-policy vs off-policy.

Key papers

Actor-critic methods · Policy gradients · Deep Q-networks · PPO · On-policy vs off-policy · Value functions · RL in robotics · Exploration vs exploitation