Continuous Control: DDPG, TD3 & SAC

Key takeaways

Continuous control means the action is a real-valued vector (joint torques, steering angle) — you can't enumerate actions and pick the max like in DQN.
DDPG, TD3 and SAC are off-policy actor-critic methods: a critic learns a Q-function, and an actor is trained to output the action that maximizes it via the deterministic policy gradient.
DDPG is the foundational recipe but is notoriously unstable; TD3 adds three fixes (twin critics, delayed updates, target smoothing) to tame Q-value overestimation.
SAC adds an entropy bonus to the reward, making the policy stochastic and self-tuning its own exploration — it's the most robust default for continuous control today.

What is continuous control?

In many of the flashiest RL problems — a robot arm grasping an object, a quadruped learning to run, a car holding a lane — the agent doesn’t choose from a short menu of buttons. It outputs a real-valued vector: torques for each joint, a steering angle, a throttle. This is continuous control, and it breaks the single most useful trick in value-based RL.

Deep Q-Networks pick an action by computing $Q(s, a)$ for every action and taking the $\arg\max$ . With a continuous action space that $\arg\max$ is an optimization problem over $\mathbb{R}^n$ at every single timestep — intractable to do exactly. The whole family of algorithms on this page exists to answer one question: how do you do the $\arg\max$ when there are infinitely many actions?

Continuous control actor-critic. The actor maps a state to an action vector; the critic scores that state-action pair. Gradients of Q with respect to the action flow back through the critic into the actor — that is the deterministic policy gradient.

The deterministic policy gradient

The mathematical engine under DDPG and TD3 is the deterministic policy gradient (DPG) theorem, proved by Silver et al. (2014). It says you can train a deterministic actor $\mu_\theta(s)$ by pushing its parameters in the direction that most increases the critic’s value:

\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \mathcal{D}}\Big[\, \nabla_a Q_\phi(s, a)\big|_{a=\mu_\theta(s)} \; \nabla_\theta \mu_\theta(s) \,\Big]

This is just the chain rule. Differentiate the critic with respect to its action input, then differentiate the action with respect to the actor’s weights, and you get a gradient that flows from $Q$ all the way back into the policy. The critic acts as a differentiable surrogate for the environment: instead of sampling noisy returns like a policy gradient method, the actor reads off the slope of $Q$ and walks uphill.

DDPG: deep deterministic policy gradient

DDPG (Lillicrap et al., 2015) was the breakthrough that married DPG with the deep-learning tricks from DQN. The paper’s title — “Continuous Control with Deep Reinforcement Learning” — names the whole problem. Its recipe is the template every later method modifies.

Learn a critic with the Bellman equation

A Q-network $Q_\phi(s,a)$ is trained to minimize the temporal-difference error against a bootstrapped target, exactly like DQN — but the next action comes from the actor, not an $\arg\max$ :

y = r + \gamma \, Q_{\phi'}\big(s', \mu_{\theta'}(s')\big)

The loss is $\big(Q_\phi(s,a) - y\big)^2$ over transitions sampled from a replay buffer.

Learn an actor with the deterministic policy gradient

The actor $\mu_\theta(s)$ is updated to output actions the critic scores highly, by ascending $\nabla_\theta \, Q_\phi(s, \mu_\theta(s))$ — the chain-rule gradient above.

Stabilize with target networks and replay

Slowly-updated target networks $\phi', \theta'$ (Polyak averaging $\phi' \leftarrow \tau \phi + (1-\tau)\phi'$ ) provide stable targets, and a replay buffer decorrelates data — both inherited from DQN.

Explore by adding action noise

Because the policy is deterministic, exploration is injected by hand: add noise (originally an Ornstein-Uhlenbeck process, later just Gaussian) to the actor’s output when collecting data.

DDPG works, and on a good day it is remarkably sample-efficient. But it earned a reputation as brittle and hyperparameter-sensitive: small changes in learning rate or reward scale can send it from solved to diverged. The root cause is a specific, diagnosable failure.

TD3: three fixes for a brittle algorithm

TD3 — Twin Delayed DDPG (Fujimoto et al., 2018, “Addressing Function Approximation Error in Actor-Critic Methods”) — is not a new paradigm. It is DDPG with three targeted patches that, together, turn it into one of the most reliable continuous-control algorithms. The paper’s contribution is showing exactly why DDPG fails and fixing each cause.

1. Clipped double-Q (the "twin")

Learn two independent critics $Q_{\phi_1}, Q_{\phi_2}$ and use the minimum of the two when forming the target. Taking the min systematically underestimates rather than overestimates — a conservative bias that is far safer to bootstrap. This is the single most important fix.

2. Delayed policy updates

Update the actor (and target networks) less often than the critics — typically once every two critic updates. Letting the value estimate settle before the actor chases it prevents the actor from amplifying transient critic errors.

3. Target policy smoothing

Add clipped noise to the target action so the critic can’t exploit a sharp, spurious peak in $Q$ . It regularizes the value estimate to be smooth over a small neighborhood of similar actions — a form of reward/value smoothing.

The clipped double-Q target captures all three ideas at once:

y = r + \gamma \min_{i=1,2} Q_{\phi_i'}\big(s', \, \tilde{a}'\big), \qquad \tilde{a}' = \mu_{\theta'}(s') + \mathrm{clip}\big(\epsilon, -c, c\big), \;\; \epsilon \sim \mathcal{N}(0, \sigma)

Both critics are trained toward this same target $y$ ; the actor is updated (delayed) against only the first critic, $\nabla_\theta Q_{\phi_1}(s, \mu_\theta(s))$ . With these changes TD3 dramatically outperforms DDPG and matches or beats the original SAC across the standard MuJoCo benchmark suite.

Go deeper: why the minimum of two critics works

Two critics trained on the same data are not identical — they have different random initializations and see minibatches in a different order, so their errors are partially uncorrelated. Where one critic happens to overestimate a particular state-action pair, the other often does not, and $\min(Q_1, Q_2)$ picks the lower (less inflated) value. The result is a deliberate downward bias. A small underestimate is benign: the actor simply won’t chase a value bubble that isn’t there. An overestimate is toxic because the actor is built to seek it out. TD3 trades a little pessimism for a lot of stability.

SAC: maximum-entropy reinforcement learning

Soft Actor-Critic (Haarnoja et al., 2018) attacks the same problem from a different angle. Instead of a deterministic actor plus hand-tuned exploration noise, SAC makes the policy stochastic and changes the objective itself: maximize reward and the policy’s entropy.

J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \pi}\Big[\, r(s_t, a_t) + \alpha \, \mathcal{H}\big(\pi(\cdot \mid s_t)\big) \,\Big]

The entropy term $\mathcal{H}$ rewards the policy for staying as random as it can while still solving the task. The temperature $\alpha$ trades off the two. This maximum-entropy framing has three big payoffs: built-in, state-dependent exploration; robustness, because the policy keeps multiple good options alive instead of collapsing onto one; and far less hyperparameter fuss than DDPG.

DDPG and TD3 use a deterministic actor and add external noise for exploration; SAC uses a stochastic actor whose entropy is part of the objective, so exploration is learned and state-dependent.

SAC keeps TD3’s clipped double-Q trick, but its soft target adds an entropy bonus, and the next action is sampled from the current stochastic policy rather than a target actor:

y = r + \gamma \Big( \min_{i=1,2} Q_{\phi_i'}(s', \tilde{a}') - \alpha \log \pi_\theta(\tilde{a}' \mid s') \Big), \qquad \tilde{a}' \sim \pi_\theta(\cdot \mid s')

To train the stochastic actor with backpropagation, SAC uses the reparameterization trick: sample noise $\xi \sim \mathcal{N}(0, I)$ and squash a Gaussian through a $\tanh$ , so gradients flow through the sampling step.

\tilde{a}_\theta(s, \xi) = \tanh\big(\mu_\theta(s) + \sigma_\theta(s) \odot \xi\big)

Go deeper: automatic temperature tuning

The temperature $\alpha$ is the trickiest knob in maximum-entropy RL — too high and the agent acts randomly forever, too low and it collapses to a deterministic policy and stops exploring. The follow-up paper (“Soft Actor-Critic Algorithms and Applications”, 2018) made $\alpha$ self-tuning: you specify a target entropy (a common default is $-\dim(\mathcal{A})$ , i.e. minus the number of action dimensions) and $\alpha$ is adjusted by gradient descent to hit it:

J(\alpha) = \mathbb{E}_{a \sim \pi}\big[ -\alpha \big( \log \pi(a \mid s) + \bar{\mathcal{H}} \big) \big]

This is why modern SAC is so close to plug-and-play: the one hyperparameter that mattered most now tunes itself, leaving little to hand-fit.

DDPG vs TD3 vs SAC at a glance

Targeted fixes TD3 adds to DDPG

Critics used by both TD3 and SAC

Hyperparameter SAC auto-tunes (temperature α)

	DDPG	TD3	SAC
Policy	Deterministic	Deterministic	Stochastic
Critics	One	Two (clipped min)	Two (clipped min)
Exploration	External noise on action	External noise on action	Entropy bonus (built-in)
Policy update	Every step	Delayed	Every step
Objective	Expected return	Expected return	Return + entropy
Robustness	Brittle	Stable	Very stable
Year / paper	2015 (Lillicrap)	2018 (Fujimoto)	2018 (Haarnoja)

In practice the two stable algorithms trade blows by environment. On the standard MuJoCo benchmarks, SAC tends to win on high-dimensional tasks like HalfCheetah and Humanoid (where its exploration shines), while TD3 is competitive or better on lower-dimensional ones like Hopper. The honest summary: SAC is the safer default, especially for real-robot work where exploration and robustness matter most, but a well-tuned TD3 is a strong, simpler baseline.

A short history

2014

Deterministic Policy Gradient

Silver et al. prove the DPG theorem — you can compute an exact gradient for a deterministic policy through a differentiable critic.

2015

DDPG

Lillicrap et al. combine DPG with DQN’s target networks and replay buffer, solving continuous-control tasks end-to-end from a deep network.

2018

TD3

Fujimoto et al. diagnose DDPG’s overestimation bug and fix it with twin critics, delayed updates and target smoothing.

2018

SAC

Haarnoja et al. introduce maximum-entropy actor-critic; a follow-up adds automatic temperature tuning, making it near plug-and-play.

2018–now

The robotics workhorses

SAC and TD3 become the default off-policy baselines for continuous control and real-robot learning; PPO covers the massively-parallel-simulation regime.

▶ L5 DDPG and SAC — Pieter Abbeel, Foundations of Deep RL series

Where these methods are used

Off-policy continuous control is the backbone of RL in robotics: manipulation, dexterous hands, and locomotion where each real-world sample is slow and expensive, so sample efficiency is paramount. SAC in particular is a favorite for learning directly on physical robots because its entropy-driven exploration is gentle and self-regulating. Beyond robotics they show up in autonomous-driving controllers, energy/HVAC optimization, and continuous-action game and simulation agents. For the broader topic see continuous control’s parent, actor-critic methods and the robotics and control RL environment vendors.

Researcher takes

Levine offers a conceptual argument for why maximum-entropy methods like SAC are not just an exploration trick but provably solve a deeper class of robust-control problems:

View Sergey Levine's post on X →

Frequently asked questions

Why can’t I just use DQN for continuous control?

DQN selects actions with $\arg\max_a Q(s,a)$ , which requires evaluating every action. With a continuous action space that maximization is itself an optimization problem at every timestep. DDPG/TD3/SAC sidestep it by training an actor network to output the maximizing action directly.

Should I default to TD3 or SAC?

SAC is the more robust default for most continuous-control problems, especially where exploration is hard or you’re training on hardware — its entropy objective handles exploration automatically and it has fewer brittle hyperparameters. TD3 is a strong, slightly simpler baseline that can edge SAC out on some lower-dimensional tasks. Try SAC first.

How do TD3 and SAC both fix DDPG’s overestimation?

Both use clipped double-Q learning — two critics with the target taken as their minimum, which biases estimates downward and prevents the actor from exploiting inflated Q-values. TD3 adds delayed policy updates and target smoothing on top; SAC adds the entropy term and a stochastic policy.

When would I use PPO instead?

PPO (page) is on-policy: less sample-efficient but extremely robust and trivially parallelizable. If you have a fast simulator and can run thousands of parallel environments (common in legged-robot sim-to-real), PPO often wins. If samples are scarce and expensive, the off-policy SAC/TD3 family is far more efficient. See on-policy vs off-policy.

Key papers

Deterministic Policy Gradient Algorithms — Silver et al., 2014 — the theorem that makes deterministic actor-critic possible.
Continuous Control with Deep RL (DDPG) — Lillicrap et al., 2015 — the foundational deep continuous-control algorithm.
Addressing Function Approximation Error (TD3) — Fujimoto et al., 2018 — diagnoses and fixes DDPG’s overestimation.
Soft Actor-Critic — Haarnoja et al., 2018 — maximum-entropy off-policy actor-critic.
Soft Actor-Critic Algorithms and Applications — Haarnoja et al., 2018 — automatic temperature tuning and real-robot results.

Actor-critic methods · Policy gradients · Deep Q-networks · PPO · On-policy vs off-policy · Value functions · RL in robotics · Exploration vs exploitation