Q-Learning: Off-Policy Value-Based RL

Key takeaways

Q-learning learns a table of action-values Q(s,a) — the expected long-run reward of taking action a in state s and acting optimally thereafter.
It's model-free and off-policy: it learns the optimal policy while exploring with a different (e.g. ε-greedy) behaviour policy, by bootstrapping toward the best next action.
One update rule does the work: nudge Q(s,a) toward r + γ·maxₐ' Q(s',a'). Watkins & Dayan (1992) proved it converges to Q* under mild conditions.
Its max operator causes maximization bias (overestimation); Double Q-learning fixes it, and pairing Q-learning with neural nets gives Deep Q-Networks (DQN).

What is Q-learning?

Q-learning is one of the foundational algorithms of reinforcement learning: a simple, model-free method that learns how good each action is in each situation, purely from trial and error. The “Q” stands for quality — Q(s, a) is the expected total (discounted) reward you’ll collect if you take action a in state s and then act optimally forever after.

Once you have an accurate Q, the optimal policy is trivial: in any state, pick the action with the highest Q-value. No model of the environment, no planning, no separate policy network — just look up the row for your state and take the argmax. That elegance is why Q-learning, introduced by Chris Watkins in his 1989 PhD thesis, remains the gateway algorithm for understanding value functions and the launch point for Deep Q-Networks.

Q-learning maintains a Q-table: one cell per state-action pair. The agent acts (often ε-greedily), observes a reward and next state, and updates one cell toward the best achievable future. The optimal policy is just the argmax of each row.

The Q-function and the Bellman optimality equation

The optimal action-value function Q* satisfies a self-consistency condition called the Bellman optimality equation. The value of acting now equals the immediate reward plus the discounted value of acting optimally next:

Q^*(s,a) = \mathbb{E}\Big[\,r + \gamma \max_{a'} Q^*(s',a') \;\big|\; s,a\,\Big]

Here γ ∈ [0,1) is the discount factor — how much future reward is worth relative to immediate reward. Low γ makes a myopic agent chasing instant payoff; γ near 1 makes a far-sighted one. This is the action-value cousin of the equations on the value functions page, set in a Markov Decision Process.

Q-learning’s whole job is to solve this equation by sampling — without ever knowing the transition probabilities or reward function. Each real transition (s, a, r, s') is a noisy sample of the right-hand side, and we slowly average those samples into our estimate.

The update rule

Everything in Q-learning reduces to one line, applied after every transition:

Q(s_t, a_t) \;\leftarrow\; Q(s_t, a_t) \;+\; \alpha\,\big[\,\underbrace{r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a')}_{\text{TD target}} \;-\; Q(s_t, a_t)\,\big]

The bracketed term is the temporal-difference (TD) error: the gap between what we now think this action is worth (the TD target) and our old estimate. We move our estimate a fraction α of the way to close that gap.

α — learning rate

How fast we overwrite old beliefs. α = 0 learns nothing; α = 1 keeps only the latest sample. Deterministic worlds tolerate α = 1; stochastic ones need α to shrink over time for convergence.

γ — discount factor

How much the future counts. As γ → 1 the agent values long-horizon reward; if γ ≥ 1 the values can diverge. Tied to the effective planning horizon ≈ 1/(1−γ).

The algorithm, end to end

Initialize the Q-table

Set Q(s, a) for all state-action pairs — often to zero, or optimistically high to encourage early exploration. Terminal states are fixed at 0.

Choose an action (behaviour policy)

From the current state s, pick an action — typically ε-greedy: with probability 1−ε take argmaxₐ Q(s,a), otherwise act randomly. This is the exploration-vs-exploitation knob.

Act and observe

Execute a, receive reward r and next state s' from the environment.

Update toward the TD target

Apply the update rule: Q(s,a) ← Q(s,a) + α[r + γ maxₐ' Q(s',a') − Q(s,a)]. Note the target uses the greedy next value, regardless of what you’ll actually do next.

Advance and repeat

Set s ← s'. Loop until the episode ends, then start a new one. Decay ε (and often α) over time so the agent shifts from exploring to exploiting.

Go deeper: a worked single update

Say Q(s, a) = 5, you take a, get reward r = 2, and land in s' whose best action has value maxₐ' Q(s',a') = 10. With α = 0.1, γ = 0.9:

TD target = 2 + 0.9 × 10 = 11
TD error = 11 − 5 = 6
New Q(s,a) = 5 + 0.1 × 6 = 5.6

We nudged 10% of the way from 5 toward 11. Repeat across many visits and Q(s,a) converges to the true Q*(s,a). Crucially, the next actual action you take might be exploratory and worth only 3 — but the update ignored that and used the optimistic 10. That is off-policy learning in one number.

Off-policy: the defining property

Q-learning learns about one policy (the greedy, optimal target policy) while following another (the exploratory behaviour policy). This decoupling is its superpower.

Q-learning (off-policy)

Target uses maxₐ' Q(s', a') — the value of the best next action. Learns Q* directly, even from random or old data. Tends to learn the optimal (sometimes risky) path.

SARSA (on-policy)

Target uses Q(s', a') for the action a' the policy actually takes next. Learns the value of the policy it follows — including its exploration — so it learns safer, more conservative paths.

The textbook illustration is Cliff Walking (Sutton & Barto): a gridworld with a cliff edge that gives a big negative reward. Q-learning learns the optimal policy that walks right along the cliff edge — but because it still explores ε-greedily, it occasionally steps off and scores worse online. SARSA learns a safer path one row back, accounting for the chance that exploration sends it over the edge. Same environment, different objective: optimal policy vs. value-of-the-policy-being-followed.

Does it converge?

Yes — and that guarantee is a big part of why Q-learning is canonical. Watkins & Dayan (1992) proved that tabular Q-learning converges to the optimal Q* with probability 1, under conditions that are mild but must all hold:

∞

Every state-action pair must be visited infinitely often

Σα = ∞

Step sizes must sum to infinity (keep learning)…

Σα² < ∞

…but their squares must be finite (eventually settle)

In plain terms: explore everything forever, and let the learning rate decay neither too fast nor too slow (e.g. α_t = 1/t satisfies both sums). Because the target uses max rather than the behaviour policy’s action, convergence does not require the behaviour policy to become greedy — it only needs to keep visiting every pair. That is the formal statement of off-policy convergence.

Maximization bias and Double Q-learning

Q-learning has a subtle, systematic flaw. Because the update uses maxₐ' Q(s', a'), and the Q estimates are noisy, the max tends to pick whichever action got lucky noise — so it systematically overestimates the true value. This is maximization bias (the “optimizer’s curse”): the maximum of noisy estimates is biased above the maximum of the true values.

Maximization bias: with several noisy action-value estimates, taking the max selects the upward outlier, so the bootstrapped target sits above the true value. Double Q-learning decouples action selection from action evaluation to remove the bias.

Double Q-learning (van Hasselt, 2010) fixes this by keeping two independent value tables, Q_A and Q_B. On each update it uses one to choose the best next action and the other to evaluate it:

Q_A(s,a) \leftarrow Q_A(s,a) + \alpha\Big[r + \gamma\, Q_B\big(s',\, \arg\max_{a'} Q_A(s',a')\big) - Q_A(s,a)\Big]

Because the chooser and the evaluator have independent noise, the lucky-outlier action selected by Q_A is unlikely to also be overvalued by Q_B. This decoupling provably removes the upward bias. The same idea later became Double DQN, a standard fix for overestimation in deep RL.

From table to network: Deep Q-Networks

The Q-table is exact but doesn’t scale: a chess or Atari state space is astronomically large, and most states are never seen twice. The fix is to approximate Q(s, a) with a function — and once that function is a deep neural network, you get a Deep Q-Network (DQN).

DeepMind’s 2015 DQN learned to play 49 Atari games from raw pixels, reaching human-level play on many, using two tricks to stabilize the unstable triad:

Ingredient	Tabular Q-learning	Deep Q-Network (DQN)
Q representation	exact lookup table	convolutional neural network
Update	one cell per step	gradient step minimizing TD error
Data	latest transition	experience replay buffer (decorrelates samples)
Target stability	exact	target network (frozen copy for the max)
Scales to	small, discrete	high-dimensional, image inputs

The conceptual core — bootstrap toward r + γ maxₐ' Q(s',a'), off-policy — is identical. DQN just learns the function instead of filling in cells, and adds machinery to keep that learning from diverging.

A short history

1989

Watkins introduces Q-learning

Chris Watkins’s PhD thesis, “Learning from Delayed Rewards,” defines the algorithm and the Q-function.

1992

Convergence proved

Watkins & Dayan publish the proof that tabular Q-learning converges to Q* with probability 1.

2010

Double Q-learning

Hado van Hasselt diagnoses maximization bias and introduces the two-estimator fix.

2013–15

Deep Q-Networks

Mnih et al. combine Q-learning with deep nets, experience replay and target networks — human-level Atari from pixels, launching deep RL.

2015–16

Double DQN & Rainbow

Overestimation fixes and a bundle of improvements (dueling nets, prioritized replay, n-step, distributional) push value-based RL further.

When to reach for Q-learning

Good fit

Discrete action spaces, environments you can sample cheaply, and problems where a value table or value network suffices. The default starting point for value-based control and for learning RL itself.

Poor fit

Continuous or huge action spaces (the max becomes intractable) — there, policy gradients and actor-critic methods like PPO are usually better. Very sparse rewards also strain pure Q-learning.

To actually run Q-learning you need an environment to learn in; the standard testbeds (Gymnasium, MiniGrid, classic gridworlds) and the broader tooling are covered under RL environments and the companies building RL environments.

Researcher takes

Levine frames the core fix for off-policy overestimation as a design choice in the TD target: by using a SARSA-like update with a modified (expectile) loss, IQL never queries the values of out-of-distribution actions, sidestepping the maximization bias that the max-over-actions in standard Q-learning introduces.

View Sergey Levine's post on X →

Frequently asked questions

What’s the difference between Q-learning and SARSA?

Both are TD control methods, but the update target differs. Q-learning uses maxₐ' Q(s',a') — the best possible next action — making it off-policy: it learns the optimal policy regardless of how it explores. SARSA uses Q(s',a') for the action it actually takes next, making it on-policy: it learns the value of the exploratory policy it follows, and so prefers safer paths.

Why is Q-learning called “model-free”?

It never learns or uses the environment’s transition probabilities or reward function. It only needs sampled experience (s, a, r, s') and learns values directly from it. Methods that do learn a transition model are covered under model-based RL.

Why does Q-learning overestimate values?

The max in its target picks the highest of several noisy estimates, which is statistically biased above the true maximum — “maximization bias.” Double Q-learning removes it by using two independent estimators: one selects the action, the other evaluates it.

Is Q-learning the same as a Deep Q-Network?

A DQN is Q-learning, with the lookup table replaced by a neural network and stabilized with experience replay and a target network. The update rule and off-policy logic are unchanged; only the representation of Q differs.

Key papers and references

Q-learning — Watkins & Dayan, 1992 — the convergence proof and canonical statement.
Double Q-learning — van Hasselt, NIPS 2010 — diagnoses and fixes maximization bias.
Human-level control through deep reinforcement learning — Mnih et al., Nature 2015 — DQN, the deep extension.
Reinforcement Learning: An Introduction — Sutton & Barto — Chapter 6 covers Q-learning, SARSA, and the Cliff Walking example.

Value functions · Deep Q-Networks · Exploration vs exploitation · Markov Decision Processes · Policy gradients · Actor-critic · What is reinforcement learning?