reinforcement-learning.com
// CORE ALGORITHMS

Deep Q-Networks (DQN)

How DQN combined Q-learning with deep neural nets to play Atari from pixels — experience replay, target networks, the loss, the deadly triad, and Rainbow.

Updated 2026-06-07 15 min read
Key takeaways
  • DQN scales tabular Q-learning to high-dimensional inputs by replacing the Q-table with a deep neural network that maps states to action values.
  • Two tricks make it stable: an experience replay buffer (break correlations in the data) and a slowly-updated target network (stop the target chasing its own tail).
  • In 2015 a single DQN — same architecture and hyperparameters — reached human-level play on 49 Atari games straight from raw pixels, launching the deep RL era.
  • Plain DQN overestimates values and is sample-hungry; Double DQN, Dueling heads, Prioritized Replay and the combined Rainbow agent fixed most of its weaknesses.

What is a Deep Q-Network?

A Deep Q-Network (DQN) is Q-learning with a neural network standing in for the lookup table. Classic Q-learning stores one number — the expected long-run return — for every (state, action) pair in a table. That works for gridworlds, but the moment the state is an image (an Atari screen is 210×160210\times160 pixels) the table has more entries than there are atoms in the universe. DQN throws the table away and trains a deep network Q(s,a;θ)Q(s,a;\theta) to approximate those numbers, so it can generalize across states it has never seen exactly.

The breakthrough wasn’t the idea of using a network — people had tried that for years and it tended to diverge. The breakthrough was a pair of stabilizing tricks (experience replay and a target network) that finally made non-linear function approximation in RL work reliably, and at a scale nobody had managed before: one agent learning 49 different games from nothing but pixels and the score.

4 stackedframes84×84ConvConvConvFC512Q(s, ↑) = 8.2Q(s, →) = 9.7 ✓Q(s, fire) = 3.1convolutional feature extractorone output per action
DQN architecture: raw pixel frames flow through convolutional layers, then fully-connected layers output one Q-value per action in a single forward pass. The agent picks the action with the highest value.

From Q-learning to a network

Recall the optimal action-value function. It obeys the Bellman optimality equation:

Q(s,a)=Es ⁣[r+γmaxaQ(s,a)  |  s,a]Q^*(s,a) = \mathbb{E}_{s'}\!\left[\, r + \gamma \max_{a'} Q^*(s',a') \;\middle|\; s,a \,\right]

Tabular Q-learning nudges each table entry toward that right-hand side. DQN does the same thing, but the “entry” is now the output of a network with parameters θ\theta, so instead of overwriting a cell we take a gradient step to reduce the squared difference between prediction and target:

L(θ)=E(s,a,r,s)D ⁣[(r+γmaxaQ(s,a;θ)TD targetQ(s,a;θ))2]L(\theta) = \mathbb{E}_{(s,a,r,s')\sim D}\!\left[\Big(\underbrace{r + \gamma \max_{a'} Q(s',a';\theta^-)}_{\text{TD target}} - Q(s,a;\theta)\Big)^2\right]

Two subtleties hide in that formula, and they are the whole reason DQN works:

  • The expectation is over (s,a,r,s)(s,a,r,s') drawn from DD, a replay buffer of stored past transitions — not from the latest trajectory.
  • The target uses θ\theta^-, the parameters of a separate target network, not the live weights θ\theta.

The two stabilizers

Experience replay

Every transition (s, a, r, s′) the agent sees is pushed into a large buffer (often 1M frames). Updates sample a random minibatch from it. This breaks the strong correlation between consecutive frames, lets each experience be reused many times (sample efficiency), and averages over many past behaviours so the data distribution shifts slowly.

Target network

A frozen copy of the network, θ\theta^-, supplies the bootstrap target r+γmaxaQ(s,a;θ)r + \gamma \max_{a'} Q(s',a';\theta^-). It’s synced to the live weights only every CC steps (e.g. every 10k updates). Because the target moves in slow, discrete jumps, the feedback loop that causes value estimates to blow up is largely tamed.

49
Atari games learned by one DQN (2015 Nature)
>100%
of human score on 29 of those games
1M
transitions held in the replay buffer

The training loop, step by step

1
Act with ε-greedy exploration

At state ss, with probability ε\varepsilon pick a random action, otherwise pick argmaxaQ(s,a;θ)\arg\max_a Q(s,a;\theta). ε\varepsilon is annealed from 1.0 down to ~0.1 over training so the agent explores early and exploits later — the classic exploration vs exploitation trade-off.

2
Store the transition

Observe reward rr and next state ss'. Push the tuple (s,a,r,s)(s,a,r,s') into the replay buffer DD, evicting the oldest if full.

3
Sample a minibatch

Draw a random minibatch of transitions from DD. Random sampling is what decorrelates the data — consecutive Atari frames are almost identical, and training on them in order would be unstable.

4
Compute targets with the target network

For each sampled transition, the target is y=ry = r if ss' is terminal, else y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s',a';\theta^-). Crucially this uses the frozen weights θ\theta^-.

5
Gradient step on the live network

Minimize (yQ(s,a;θ))2\big(y - Q(s,a;\theta)\big)^2 with respect to θ\theta (the original paper clips the error term and uses RMSProp). Only the live network learns.

6
Periodically refresh the target

Every CC steps, copy θθ\theta \rightarrow \theta^-. Then loop back to step 1.

Go deeper: the preprocessing that made Atari tractable

DQN doesn’t feed raw 210×160210\times160 RGB frames to the network. Each frame is converted to grayscale, cropped/downsampled to 84×8484\times84, and four consecutive frames are stacked as input channels. The stacking matters: a single frame can’t tell you whether the ball is moving up or down, so a static image isn’t a Markov state. Stacking four frames restores enough of the velocity/direction information to make the problem approximately Markov. DQN also uses frame-skipping (the agent acts every 4th frame and repeats the action between) to cut compute roughly 4×.

Why it nearly didn’t work: the deadly triad

Combining three ingredients is known to make value-based RL diverge — Sutton and Barto call it the deadly triad:

IngredientDQN uses it?Why it’s dangerous
Function approximationYes — a deep netUpdates to one state leak into others; errors can compound
BootstrappingYes — one-step TD targetThe target depends on the model’s own (possibly wrong) estimates
Off-policy learningYes — replay + greedy targetTraining data comes from a different policy than the one being improved

DQN hits all three at once, which is exactly why earlier attempts at “neural Q-learning” tended to blow up. The target network and replay buffer don’t remove the triad — they damp the feedback loops enough that learning converges in practice. Hado van Hasselt’s Deep RL and the Deadly Triad is the definitive study of when it does and doesn’t bite.

▶ DeepMind's Deep Q-Learning & Superhuman Atari Gameplays — Two Minute Papers (the result, in 2 minutes)

Overestimation, and the fixes that became Rainbow

Plain DQN has a systematic flaw: the max\max in the target is taken over noisy estimates, and the max of noisy numbers is biased upward. The agent becomes overconfident about actions it has barely tried. A cascade of extensions fixed this and other weaknesses; each is a small change, and stacking all of them gives the Rainbow agent.

ExtensionOne-line ideaPaper
Double DQNSelect the action with the live net, evaluate it with the target net — decouples the two max roles to cut overestimationvan Hasselt 2015
Dueling DQNSplit the head into a state-value V(s)V(s) stream and an advantage A(s,a)A(s,a) stream, recombined into QQWang 2015
Prioritized ReplaySample transitions with large TD-error more often — learn from surprises firstSchaul 2015
Multi-step returnsBootstrap from nn steps ahead instead of 1 — faster reward propagation(n-step TD)
Distributional RL (C51)Learn the full distribution of returns, not just the meanBellemare 2017
Noisy NetsLearnable noise in the weights replaces ε-greedy for explorationFortunato 2017
RainbowCombine all six — beats every individual componentHessel 2017
Go deeper: the Double DQN correction in one line

Standard DQN target:   y=r+γQ(s,argmaxaQ(s,a;θ); θ)\;y = r + \gamma\, Q\big(s', \arg\max_{a'} Q(s',a';\theta^-);\ \theta^-\big) — the same network both picks and scores the best next action, so its own optimistic errors reinforce themselves.

Double DQN target:   y=r+γQ(s,argmaxaQ(s,a;θ); θ)\;y = r + \gamma\, Q\big(s', \arg\max_{a'} Q(s',a';\theta);\ \theta^-\big) — the live network picks the action, the target network scores it. Splitting selection from evaluation removes most of the upward bias at essentially zero extra cost, since DQN already maintains both networks.

DQN vs policy-gradient methods

DQN learns a value function and derives the policy implicitly by taking the argmax. The other major family — policy gradients and actor-critic — parameterizes and optimizes the policy directly. The practical split:

DQN (value-based)Policy gradient / actor-critic
Action spaceDiscrete only (needs an argmax)Discrete or continuous
PolicyDeterministic, implicit (argmax of Q)Explicit, often stochastic
Sample efficiencyHigh — replay reuses data heavilyLower for on-policy methods
StabilityCan diverge (deadly triad)On-policy methods like PPO are smoother
Sweet spotAtari, games, discrete controlRobotics, continuous control, LLM post-training

DQN’s discrete-action limitation is fundamental — there is no tractable argmax over a continuous action space. DDPG and SAC can be read as the continuous-control descendants of DQN, marrying its replay-buffer, target-network machinery to an actor that outputs continuous actions.

A short history

1989
Q-learning
Watkins introduces Q-learning — the tabular algorithm DQN scales up.
2013
DQN on arXiv
Mnih et al. publish “Playing Atari with Deep Reinforcement Learning” — first deep net trained end-to-end from pixels to control, on 7 games.
2015
Human-level control (Nature)
The expanded version adds the target network; one agent reaches human-level play on 49 Atari games and lands on the cover of Nature.
2015–16
Double & Dueling DQN
Overestimation and architecture fixes sharply improve scores.
2017
Rainbow
Hessel et al. combine six extensions into a single state-of-the-art agent.
2020
Agent57
DeepMind’s descendant of DQN finally beats the human baseline on all 57 Atari games.

Where DQN sits today

DQN is rarely the final answer in modern production systems, but it is the foundation everyone learns first and the conceptual ancestor of most off-policy deep RL. Its ideas — replay buffers, target networks, off-policy value learning — are everywhere in model-free control, recommendation, and game AI. For continuous control and LLM alignment the field leans on PPO, actor-critic, and preference methods instead, but the replay/target-network toolkit DQN popularized still underpins SAC, TD3, and friends.

Building the simulators, replay infrastructure, and RL environments these agents train in is its own industry — see the RL environment and benchmark vendors.

Researcher takes

The lasting lesson practitioners draw from DQN is less “use these exact two tricks” and more “the deadly triad is real, and stability engineering matters as much as the learning rule.” Hado van Hasselt — author of Double DQN and the definitive deadly-triad study — frames it bluntly: DQN combines all three dangerous ingredients and shouldn’t converge, yet careful target-network and replay design make it work anyway. That tension between theoretical fragility and empirical success is the through-line of value-based deep RL, and it’s why every successor (Rainbow, Agent57, SAC) is in large part a stability story.

Frequently asked questions

What’s the difference between Q-learning and DQN?

Q-learning stores action values in a table — one entry per state-action pair — which only works for small, discrete state spaces. DQN replaces that table with a neural network so it can handle huge or continuous-input states (like images), and adds experience replay and a target network to keep the neural approximation from diverging.

Why does DQN need a target network?

The TD target r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s',a') depends on the network’s own output. If you compute it with the live weights you’re updating, every gradient step also moves the target, creating an unstable feedback loop. Freezing a copy of the weights for the target — and refreshing it only every few thousand steps — gives the regression a stationary goal and stops value estimates from blowing up.

Can DQN handle continuous action spaces?

Not directly. DQN selects actions with an argmax over its output layer, which requires a finite, discrete set of actions. For continuous control you need a different approach — DDPG, TD3, or SAC reuse DQN’s replay buffer and target networks but add an actor network that outputs continuous actions.

Is DQN still used in 2026?

As a production algorithm it’s mostly been superseded by Rainbow-style agents and by policy-gradient methods like PPO for many tasks. But DQN remains the standard teaching entry point to deep RL, the baseline most papers compare against, and the source of ideas (off-policy replay, target networks) that are still core to modern off-policy methods.

Key papers

Q-learning · Value functions · Exploration vs exploitation · Markov decision processes · Policy gradients · Actor-critic · What is reinforcement learning?