- Q-learning learns a table of action-values Q(s,a) — the expected long-run reward of taking action a in state s and acting optimally thereafter.
- It's model-free and off-policy: it learns the optimal policy while exploring with a different (e.g. ε-greedy) behaviour policy, by bootstrapping toward the best next action.
- One update rule does the work: nudge Q(s,a) toward r + γ·maxₐ' Q(s',a'). Watkins & Dayan (1992) proved it converges to Q* under mild conditions.
- Its max operator causes maximization bias (overestimation); Double Q-learning fixes it, and pairing Q-learning with neural nets gives Deep Q-Networks (DQN).
What is Q-learning?
Q-learning is one of the foundational algorithms of reinforcement learning: a simple, model-free method that learns how good each action is in each situation, purely from trial and error. The “Q” stands for quality — Q(s, a) is the expected total (discounted) reward you’ll collect if you take action a in state s and then act optimally forever after.
Once you have an accurate Q, the optimal policy is trivial: in any state, pick the action with the highest Q-value. No model of the environment, no planning, no separate policy network — just look up the row for your state and take the argmax. That elegance is why Q-learning, introduced by Chris Watkins in his 1989 PhD thesis, remains the gateway algorithm for understanding value functions and the launch point for Deep Q-Networks.
The Q-function and the Bellman optimality equation
The optimal action-value function Q* satisfies a self-consistency condition called the Bellman optimality equation. The value of acting now equals the immediate reward plus the discounted value of acting optimally next:
Here γ ∈ [0,1) is the discount factor — how much future reward is worth relative to immediate reward. Low γ makes a myopic agent chasing instant payoff; γ near 1 makes a far-sighted one. This is the action-value cousin of the equations on the value functions page, set in a Markov Decision Process.
Q-learning’s whole job is to solve this equation by sampling — without ever knowing the transition probabilities or reward function. Each real transition (s, a, r, s') is a noisy sample of the right-hand side, and we slowly average those samples into our estimate.
The update rule
Everything in Q-learning reduces to one line, applied after every transition:
The bracketed term is the temporal-difference (TD) error: the gap between what we now think this action is worth (the TD target) and our old estimate. We move our estimate a fraction α of the way to close that gap.
How fast we overwrite old beliefs. α = 0 learns nothing; α = 1 keeps only the latest sample. Deterministic worlds tolerate α = 1; stochastic ones need α to shrink over time for convergence.
How much the future counts. As γ → 1 the agent values long-horizon reward; if γ ≥ 1 the values can diverge. Tied to the effective planning horizon ≈ 1/(1−γ).
The algorithm, end to end
Set Q(s, a) for all state-action pairs — often to zero, or optimistically high to encourage early exploration. Terminal states are fixed at 0.
From the current state s, pick an action — typically ε-greedy: with probability 1−ε take argmaxₐ Q(s,a), otherwise act randomly. This is the exploration-vs-exploitation knob.
Execute a, receive reward r and next state s' from the environment.
Apply the update rule: Q(s,a) ← Q(s,a) + α[r + γ maxₐ' Q(s',a') − Q(s,a)]. Note the target uses the greedy next value, regardless of what you’ll actually do next.
Set s ← s'. Loop until the episode ends, then start a new one. Decay ε (and often α) over time so the agent shifts from exploring to exploiting.
Go deeper: a worked single update
Say Q(s, a) = 5, you take a, get reward r = 2, and land in s' whose best action has value maxₐ' Q(s',a') = 10. With α = 0.1, γ = 0.9:
- TD target =
2 + 0.9 × 10 = 11 - TD error =
11 − 5 = 6 - New
Q(s,a) = 5 + 0.1 × 6 = 5.6
We nudged 10% of the way from 5 toward 11. Repeat across many visits and Q(s,a) converges to the true Q*(s,a). Crucially, the next actual action you take might be exploratory and worth only 3 — but the update ignored that and used the optimistic 10. That is off-policy learning in one number.
Off-policy: the defining property
Q-learning learns about one policy (the greedy, optimal target policy) while following another (the exploratory behaviour policy). This decoupling is its superpower.
Target uses maxₐ' Q(s', a') — the value of the best next action. Learns Q* directly, even from random or old data. Tends to learn the optimal (sometimes risky) path.
Target uses Q(s', a') for the action a' the policy actually takes next. Learns the value of the policy it follows — including its exploration — so it learns safer, more conservative paths.
The textbook illustration is Cliff Walking (Sutton & Barto): a gridworld with a cliff edge that gives a big negative reward. Q-learning learns the optimal policy that walks right along the cliff edge — but because it still explores ε-greedily, it occasionally steps off and scores worse online. SARSA learns a safer path one row back, accounting for the chance that exploration sends it over the edge. Same environment, different objective: optimal policy vs. value-of-the-policy-being-followed.
Does it converge?
Yes — and that guarantee is a big part of why Q-learning is canonical. Watkins & Dayan (1992) proved that tabular Q-learning converges to the optimal Q* with probability 1, under conditions that are mild but must all hold:
In plain terms: explore everything forever, and let the learning rate decay neither too fast nor too slow (e.g. α_t = 1/t satisfies both sums). Because the target uses max rather than the behaviour policy’s action, convergence does not require the behaviour policy to become greedy — it only needs to keep visiting every pair. That is the formal statement of off-policy convergence.
Maximization bias and Double Q-learning
Q-learning has a subtle, systematic flaw. Because the update uses maxₐ' Q(s', a'), and the Q estimates are noisy, the max tends to pick whichever action got lucky noise — so it systematically overestimates the true value. This is maximization bias (the “optimizer’s curse”): the maximum of noisy estimates is biased above the maximum of the true values.
Double Q-learning (van Hasselt, 2010) fixes this by keeping two independent value tables, Q_A and Q_B. On each update it uses one to choose the best next action and the other to evaluate it:
Because the chooser and the evaluator have independent noise, the lucky-outlier action selected by Q_A is unlikely to also be overvalued by Q_B. This decoupling provably removes the upward bias. The same idea later became Double DQN, a standard fix for overestimation in deep RL.
From table to network: Deep Q-Networks
The Q-table is exact but doesn’t scale: a chess or Atari state space is astronomically large, and most states are never seen twice. The fix is to approximate Q(s, a) with a function — and once that function is a deep neural network, you get a Deep Q-Network (DQN).
DeepMind’s 2015 DQN learned to play 49 Atari games from raw pixels, reaching human-level play on many, using two tricks to stabilize the unstable triad:
| Ingredient | Tabular Q-learning | Deep Q-Network (DQN) |
|---|---|---|
| Q representation | exact lookup table | convolutional neural network |
| Update | one cell per step | gradient step minimizing TD error |
| Data | latest transition | experience replay buffer (decorrelates samples) |
| Target stability | exact | target network (frozen copy for the max) |
| Scales to | small, discrete | high-dimensional, image inputs |
The conceptual core — bootstrap toward r + γ maxₐ' Q(s',a'), off-policy — is identical. DQN just learns the function instead of filling in cells, and adds machinery to keep that learning from diverging.
A short history
When to reach for Q-learning
Discrete action spaces, environments you can sample cheaply, and problems where a value table or value network suffices. The default starting point for value-based control and for learning RL itself.
Continuous or huge action spaces (the max becomes intractable) — there, policy gradients and actor-critic methods like PPO are usually better. Very sparse rewards also strain pure Q-learning.
To actually run Q-learning you need an environment to learn in; the standard testbeds (Gymnasium, MiniGrid, classic gridworlds) and the broader tooling are covered under RL environments and the companies building RL environments.
Researcher takes
Levine frames the core fix for off-policy overestimation as a design choice in the TD target: by using a SARSA-like update with a modified (expectile) loss, IQL never queries the values of out-of-distribution actions, sidestepping the maximization bias that the max-over-actions in standard Q-learning introduces.
Frequently asked questions
What’s the difference between Q-learning and SARSA?
Both are TD control methods, but the update target differs. Q-learning uses maxₐ' Q(s',a') — the best possible next action — making it off-policy: it learns the optimal policy regardless of how it explores. SARSA uses Q(s',a') for the action it actually takes next, making it on-policy: it learns the value of the exploratory policy it follows, and so prefers safer paths.
Why is Q-learning called “model-free”?
It never learns or uses the environment’s transition probabilities or reward function. It only needs sampled experience (s, a, r, s') and learns values directly from it. Methods that do learn a transition model are covered under model-based RL.
Why does Q-learning overestimate values?
The max in its target picks the highest of several noisy estimates, which is statistically biased above the true maximum — “maximization bias.” Double Q-learning removes it by using two independent estimators: one selects the action, the other evaluates it.
Is Q-learning the same as a Deep Q-Network?
A DQN is Q-learning, with the lookup table replaced by a neural network and stabilized with experience replay and a target network. The update rule and off-policy logic are unchanged; only the representation of Q differs.
Key papers and references
- Q-learning — Watkins & Dayan, 1992 — the convergence proof and canonical statement.
- Double Q-learning — van Hasselt, NIPS 2010 — diagnoses and fixes maximization bias.
- Human-level control through deep reinforcement learning — Mnih et al., Nature 2015 — DQN, the deep extension.
- Reinforcement Learning: An Introduction — Sutton & Barto — Chapter 6 covers Q-learning, SARSA, and the Cliff Walking example.
Related
Value functions · Deep Q-Networks · Exploration vs exploitation · Markov Decision Processes · Policy gradients · Actor-critic · What is reinforcement learning?