- Continuous control means the action is a real-valued vector (joint torques, steering angle) — you can't enumerate actions and pick the max like in DQN.
- DDPG, TD3 and SAC are off-policy actor-critic methods: a critic learns a Q-function, and an actor is trained to output the action that maximizes it via the deterministic policy gradient.
- DDPG is the foundational recipe but is notoriously unstable; TD3 adds three fixes (twin critics, delayed updates, target smoothing) to tame Q-value overestimation.
- SAC adds an entropy bonus to the reward, making the policy stochastic and self-tuning its own exploration — it's the most robust default for continuous control today.
What is continuous control?
In many of the flashiest RL problems — a robot arm grasping an object, a quadruped learning to run, a car holding a lane — the agent doesn’t choose from a short menu of buttons. It outputs a real-valued vector: torques for each joint, a steering angle, a throttle. This is continuous control, and it breaks the single most useful trick in value-based RL.
Deep Q-Networks pick an action by computing for every action and taking the . With a continuous action space that is an optimization problem over at every single timestep — intractable to do exactly. The whole family of algorithms on this page exists to answer one question: how do you do the when there are infinitely many actions?
The deterministic policy gradient
The mathematical engine under DDPG and TD3 is the deterministic policy gradient (DPG) theorem, proved by Silver et al. (2014). It says you can train a deterministic actor by pushing its parameters in the direction that most increases the critic’s value:
This is just the chain rule. Differentiate the critic with respect to its action input, then differentiate the action with respect to the actor’s weights, and you get a gradient that flows from all the way back into the policy. The critic acts as a differentiable surrogate for the environment: instead of sampling noisy returns like a policy gradient method, the actor reads off the slope of and walks uphill.
DDPG: deep deterministic policy gradient
DDPG (Lillicrap et al., 2015) was the breakthrough that married DPG with the deep-learning tricks from DQN. The paper’s title — “Continuous Control with Deep Reinforcement Learning” — names the whole problem. Its recipe is the template every later method modifies.
A Q-network is trained to minimize the temporal-difference error against a bootstrapped target, exactly like DQN — but the next action comes from the actor, not an :
The loss is over transitions sampled from a replay buffer.
The actor is updated to output actions the critic scores highly, by ascending — the chain-rule gradient above.
Slowly-updated target networks (Polyak averaging ) provide stable targets, and a replay buffer decorrelates data — both inherited from DQN.
Because the policy is deterministic, exploration is injected by hand: add noise (originally an Ornstein-Uhlenbeck process, later just Gaussian) to the actor’s output when collecting data.
DDPG works, and on a good day it is remarkably sample-efficient. But it earned a reputation as brittle and hyperparameter-sensitive: small changes in learning rate or reward scale can send it from solved to diverged. The root cause is a specific, diagnosable failure.
TD3: three fixes for a brittle algorithm
TD3 — Twin Delayed DDPG (Fujimoto et al., 2018, “Addressing Function Approximation Error in Actor-Critic Methods”) — is not a new paradigm. It is DDPG with three targeted patches that, together, turn it into one of the most reliable continuous-control algorithms. The paper’s contribution is showing exactly why DDPG fails and fixing each cause.
Learn two independent critics and use the minimum of the two when forming the target. Taking the min systematically underestimates rather than overestimates — a conservative bias that is far safer to bootstrap. This is the single most important fix.
Update the actor (and target networks) less often than the critics — typically once every two critic updates. Letting the value estimate settle before the actor chases it prevents the actor from amplifying transient critic errors.
Add clipped noise to the target action so the critic can’t exploit a sharp, spurious peak in . It regularizes the value estimate to be smooth over a small neighborhood of similar actions — a form of reward/value smoothing.
The clipped double-Q target captures all three ideas at once:
Both critics are trained toward this same target ; the actor is updated (delayed) against only the first critic, . With these changes TD3 dramatically outperforms DDPG and matches or beats the original SAC across the standard MuJoCo benchmark suite.
Go deeper: why the minimum of two critics works
Two critics trained on the same data are not identical — they have different random initializations and see minibatches in a different order, so their errors are partially uncorrelated. Where one critic happens to overestimate a particular state-action pair, the other often does not, and picks the lower (less inflated) value. The result is a deliberate downward bias. A small underestimate is benign: the actor simply won’t chase a value bubble that isn’t there. An overestimate is toxic because the actor is built to seek it out. TD3 trades a little pessimism for a lot of stability.
SAC: maximum-entropy reinforcement learning
Soft Actor-Critic (Haarnoja et al., 2018) attacks the same problem from a different angle. Instead of a deterministic actor plus hand-tuned exploration noise, SAC makes the policy stochastic and changes the objective itself: maximize reward and the policy’s entropy.
The entropy term rewards the policy for staying as random as it can while still solving the task. The temperature trades off the two. This maximum-entropy framing has three big payoffs: built-in, state-dependent exploration; robustness, because the policy keeps multiple good options alive instead of collapsing onto one; and far less hyperparameter fuss than DDPG.
SAC keeps TD3’s clipped double-Q trick, but its soft target adds an entropy bonus, and the next action is sampled from the current stochastic policy rather than a target actor:
To train the stochastic actor with backpropagation, SAC uses the reparameterization trick: sample noise and squash a Gaussian through a , so gradients flow through the sampling step.
Go deeper: automatic temperature tuning
The temperature is the trickiest knob in maximum-entropy RL — too high and the agent acts randomly forever, too low and it collapses to a deterministic policy and stops exploring. The follow-up paper (“Soft Actor-Critic Algorithms and Applications”, 2018) made self-tuning: you specify a target entropy (a common default is , i.e. minus the number of action dimensions) and is adjusted by gradient descent to hit it:
This is why modern SAC is so close to plug-and-play: the one hyperparameter that mattered most now tunes itself, leaving little to hand-fit.
DDPG vs TD3 vs SAC at a glance
| DDPG | TD3 | SAC | |
|---|---|---|---|
| Policy | Deterministic | Deterministic | Stochastic |
| Critics | One | Two (clipped min) | Two (clipped min) |
| Exploration | External noise on action | External noise on action | Entropy bonus (built-in) |
| Policy update | Every step | Delayed | Every step |
| Objective | Expected return | Expected return | Return + entropy |
| Robustness | Brittle | Stable | Very stable |
| Year / paper | 2015 (Lillicrap) | 2018 (Fujimoto) | 2018 (Haarnoja) |
In practice the two stable algorithms trade blows by environment. On the standard MuJoCo benchmarks, SAC tends to win on high-dimensional tasks like HalfCheetah and Humanoid (where its exploration shines), while TD3 is competitive or better on lower-dimensional ones like Hopper. The honest summary: SAC is the safer default, especially for real-robot work where exploration and robustness matter most, but a well-tuned TD3 is a strong, simpler baseline.
A short history
Where these methods are used
Off-policy continuous control is the backbone of RL in robotics: manipulation, dexterous hands, and locomotion where each real-world sample is slow and expensive, so sample efficiency is paramount. SAC in particular is a favorite for learning directly on physical robots because its entropy-driven exploration is gentle and self-regulating. Beyond robotics they show up in autonomous-driving controllers, energy/HVAC optimization, and continuous-action game and simulation agents. For the broader topic see continuous control’s parent, actor-critic methods and the robotics and control RL environment vendors.
Researcher takes
Levine offers a conceptual argument for why maximum-entropy methods like SAC are not just an exploration trick but provably solve a deeper class of robust-control problems:
Frequently asked questions
Why can’t I just use DQN for continuous control?
DQN selects actions with , which requires evaluating every action. With a continuous action space that maximization is itself an optimization problem at every timestep. DDPG/TD3/SAC sidestep it by training an actor network to output the maximizing action directly.
Should I default to TD3 or SAC?
SAC is the more robust default for most continuous-control problems, especially where exploration is hard or you’re training on hardware — its entropy objective handles exploration automatically and it has fewer brittle hyperparameters. TD3 is a strong, slightly simpler baseline that can edge SAC out on some lower-dimensional tasks. Try SAC first.
How do TD3 and SAC both fix DDPG’s overestimation?
Both use clipped double-Q learning — two critics with the target taken as their minimum, which biases estimates downward and prevents the actor from exploiting inflated Q-values. TD3 adds delayed policy updates and target smoothing on top; SAC adds the entropy term and a stochastic policy.
When would I use PPO instead?
PPO (page) is on-policy: less sample-efficient but extremely robust and trivially parallelizable. If you have a fast simulator and can run thousands of parallel environments (common in legged-robot sim-to-real), PPO often wins. If samples are scarce and expensive, the off-policy SAC/TD3 family is far more efficient. See on-policy vs off-policy.
Key papers
- Deterministic Policy Gradient Algorithms — Silver et al., 2014 — the theorem that makes deterministic actor-critic possible.
- Continuous Control with Deep RL (DDPG) — Lillicrap et al., 2015 — the foundational deep continuous-control algorithm.
- Addressing Function Approximation Error (TD3) — Fujimoto et al., 2018 — diagnoses and fixes DDPG’s overestimation.
- Soft Actor-Critic — Haarnoja et al., 2018 — maximum-entropy off-policy actor-critic.
- Soft Actor-Critic Algorithms and Applications — Haarnoja et al., 2018 — automatic temperature tuning and real-robot results.
Related
Actor-critic methods · Policy gradients · Deep Q-networks · PPO · On-policy vs off-policy · Value functions · RL in robotics · Exploration vs exploitation