Temporal-Difference (TD) Learning, Explained

Key takeaways

TD learning estimates value functions from raw experience — no model of the environment — while bootstrapping from its own current estimates instead of waiting for the final outcome.
The whole idea fits in one line: nudge each estimate toward the next one. The size of that nudge is the TD error, δ = r + γV(s') − V(s).
It blends Monte Carlo (learn from experience) and dynamic programming (bootstrap from estimates) — and TD(λ) puts a single dial between the two.
TD is the engine under SARSA, Q-learning and DQN, and the same error signal turns out to match how dopamine neurons in the brain encode reward prediction.

What is temporal-difference learning?

Temporal-difference (TD) learning is the central idea in reinforcement learning: learn to predict long-run reward by comparing each prediction to the next one and correcting the difference. You do not need a model of how the world works (unlike dynamic programming), and you do not need to wait until the end of an episode to learn (unlike Monte Carlo methods). You update as you go.

The intuition is everyday. Suppose at 8:00 you predict your commute will take 30 minutes. At 8:10 you hit unexpected traffic and now expect the whole trip to take 40. You do not need to arrive to know your first guess was 10 minutes too low — the revision of your own estimate already carries the lesson. TD learning is exactly that: every time a later, better-informed estimate disagrees with an earlier one, the gap is a learning signal.

TD(0) prediction: from a state, take one real step (reward r, next state s'), then bootstrap with the existing estimate V(s'). The mismatch between this one-step target and the old estimate is the TD error that drives the update.

▶ RL Course by David Silver — Lecture 4: Model-Free Prediction (Monte Carlo and TD)

The TD(0) update and the TD error

The simplest version, TD(0) (one-step TD), estimates the state-value function $V(s)$ . After taking one step — observing reward $r$ and landing in $s'$ — it updates:

V(s) \leftarrow V(s) + \alpha\,\big[\,\underbrace{r + \gamma V(s')}_{\text{TD target}} - V(s)\,\big]

The bracketed quantity is the TD error, written $\delta$ :

\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

Here $\alpha$ is the learning rate (how big a step toward the target) and $\gamma$ is the discount factor. Read it plainly: $\delta$ is how wrong the current estimate was, judged against a slightly better estimate that has one extra real reward baked in. If $\delta > 0$ , the future looked better than predicted, so raise $V(s)$ ; if $\delta < 0$ , lower it.

1 step

how far TD(0) looks before updating

O(1)

memory per update — no episode buffer needed

1988

Sutton formalizes TD; the field's cornerstone

The single most important word here is bootstrapping: the update target $r + \gamma V(s')$ contains $V(s')$ , which is itself an estimate the agent is still learning. TD pulls itself up by its own bootstraps — it learns a guess from a guess. That is what makes it fast and online, and also what makes its convergence subtler than ordinary supervised learning.

How TD prediction works, step by step

Initialize estimates

Set $V(s)$ to arbitrary values (often zero) for every state. These are deliberately wrong — the algorithm’s job is to chip away at the error.

Act and observe one transition

From the current state $s$ , follow your policy, take an action, and observe the immediate reward $r$ and the next state $s'$ . This is one real interaction — no model, no rollout.

Form the TD target

Compute $r + \gamma V(s')$ . You are combining one real reward with your existing estimate of everything that follows $s'$ . This is the bootstrap.

Compute the TD error and update

Take $\delta = (r + \gamma V(s')) - V(s)$ and move the old estimate a fraction $\alpha$ of the way toward the target: $V(s) \leftarrow V(s) + \alpha\,\delta$ . Then set $s \leftarrow s'$ and repeat from step 2.

Because each update touches only the state just left, TD propagates information one step backward per visit. Over many episodes, value flows back from rewards toward the states that lead to them — slowly at first, then faster as the estimates downstream become reliable. (Eligibility traces, below, accelerate exactly this backward flow.)

Go deeper: why bootstrapping is not true gradient descent

In supervised learning the target is fixed and the update is a real gradient step on a loss. In TD the target $r + \gamma V(s')$ moves because it depends on the very parameters you are updating. So TD is a semi-gradient method: you take the gradient of $V(s)$ but treat the target as a constant, ignoring its dependence on the weights. This is why tabular TD(0) provably converges (under standard step-size conditions) but TD with nonlinear function approximation and off-policy data can diverge — the notorious deadly triad of bootstrapping + function approximation + off-policy training. Gradient-TD methods (GTD, TDC) restore stability by optimizing a true objective, the mean-squared projected Bellman error.

TD vs Monte Carlo vs dynamic programming

The cleanest way to place TD is along two axes: does it sample experience? and does it bootstrap?

Property	Dynamic programming	Monte Carlo	TD learning
Needs a model of dynamics	Yes	No	No
Learns from raw experience	No	Yes	Yes
Bootstraps (uses own estimates)	Yes	No	Yes
Updates before episode ends	Yes	No	Yes
Works in continuing (non-terminating) tasks	Yes	No	Yes
Target variance	None	High	Low
Target bias	None	None (unbiased)	Biased (early on)

The bias/variance trade-off is the heart of it. Monte Carlo waits for the actual return $G_t$ , which is an unbiased sample of the true value but high variance (it accumulates every random reward and transition to the end of the episode). TD replaces most of that random tail with a single bootstrapped estimate $V(s')$ — low variance but biased while $V$ is still wrong. In practice TD’s lower variance usually makes it learn faster, and it often finds better estimates on Markov problems because it implicitly exploits the Markov structure.

Bias vs variance of the update target. Monte Carlo uses the full sampled return (unbiased, high variance); TD(0) uses a one-step bootstrap (low variance, biased); n-step methods interpolate, and TD(λ) averages across all n at once.

n-step TD and TD(λ): one dial between the extremes

TD(0) and Monte Carlo are the endpoints of a continuum. n-step TD takes $n$ real rewards before bootstrapping:

G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^{n} V(s_{t+n})

With $n=1$ this is TD(0); with $n \to \infty$ (to the end of the episode) it is Monte Carlo. Intermediate $n$ often beats both endpoints.

TD(λ) is the elegant unification: instead of picking one $n$ , it averages all n-step returns, weighting the $n$ -step return by $(1-\lambda)\lambda^{n-1}$ . The parameter $\lambda \in [0,1]$ is a single knob — $\lambda = 0$ recovers TD(0), $\lambda = 1$ recovers Monte Carlo.

Computing that average directly would require waiting for the whole episode (the forward view). The trick that makes TD(λ) practical online is the backward view with eligibility traces: each state keeps a fading memory $e(s)$ of how recently and frequently it was visited, decaying by $\gamma\lambda$ each step. Every time step’s TD error $\delta$ is then broadcast to all states in proportion to their current trace:

e_t(s) = \gamma\lambda\, e_{t-1}(s) + \mathbb{1}[s_t = s], \qquad V(s) \leftarrow V(s) + \alpha\,\delta_t\, e_t(s)

This assigns credit to recently visited states the instant a surprise arrives — solving the credit-assignment problem incrementally, with the same per-step cost.

From prediction to control: SARSA and Q-learning

Everything above predicts value for a fixed policy. To actually improve behavior, you apply the same TD update to action values $Q(s,a)$ and act greedily (or ε-greedily) on them. The choice of TD target splits the two most famous control algorithms:

SARSA — on-policy TD control

Target uses the action actually taken next: $\delta = r + \gamma Q(s', a') - Q(s,a)$ . It learns the value of the policy it is following (exploration included), so it tends to be more conservative near danger. See on-policy vs off-policy.

Q-learning — off-policy TD control

Target uses the best next action: $\delta = r + \gamma \max_{a'} Q(s', a') - Q(s,a)$ . It learns the optimal policy regardless of how it explores — the basis of Q-learning and, with a neural net, Deep Q-Networks.

The difference is a single operator in the target — $Q(s',a')$ (the action sampled) versus $\max_{a'} Q(s',a')$ (the greedy action). That is the whole on-policy/off-policy distinction, expressed in the TD error. Both are instances of generalized policy iteration: TD evaluates, the greedy step improves, round and round.

Go deeper: TD in deep RL and actor-critic

Replace the table with a neural network and the TD error becomes the loss signal. DQN minimizes the squared TD error of Q-learning, stabilized by a target network (a frozen copy that supplies $\max_{a'} Q(s',a')$ , decoupling the moving target) and a replay buffer. In actor-critic methods the critic is a TD learner estimating value, and its TD error $\delta$ is the advantage signal that tells the actor which way to push the policy — so TD sits at the core of policy-gradient algorithms like A3C, PPO and GRPO. Generalized advantage estimation (GAE) is, under the hood, a TD(λ) average of one-step advantages.

TD learning in the brain

One of the most striking results in computational neuroscience is that TD did not just inspire algorithms — it appears to describe biology. Wolfram Schultz’s recordings of midbrain dopamine neurons show they fire not for reward itself but for reward that was better than predicted, stay silent for fully predicted reward, and dip below baseline when an expected reward is omitted. That signature — positive, zero, negative on surprise — is precisely the TD error $\delta$ . Schultz, Dayan and Montague’s 1997 work made the link explicit, and a 2022 DeepMind/Harvard study showed dopamine responses even shift backward in time across learning exactly as the TD error predicts.

A short history of TD learning

1959

Samuel's checkers player

Arthur Samuel’s self-learning checkers program updates board evaluations toward later evaluations — a proto-TD method, decades early.

1983

Actor-critic and the ADP

Barto, Sutton and Anderson use a TD-like critic to balance a pole, seeding the actor-critic architecture.

1988

Sutton formalizes TD(λ)

Sutton’s “Learning to Predict by the Methods of Temporal Differences” defines the family and proves key convergence results — the field’s cornerstone.

1989

Q-learning

Watkins introduces Q-learning, off-policy TD control, with a convergence proof following in 1992.

1992

TD-Gammon

Tesauro’s neural-network backgammon player, trained purely by TD self-play, reaches near-world-champion strength — the first great deep-RL success.

1997

Dopamine as TD error

Schultz, Dayan and Montague map the TD error onto phasic dopamine signaling in the brain.

2015

Deep Q-Networks

DeepMind’s DQN scales TD-based Q-learning to Atari from pixels, igniting the modern deep-RL era.

Practical notes

Step size $\alpha$ . Convergence theory wants $\alpha$ to shrink over time, but constant small $\alpha$ is standard in practice for non-stationary or deep settings — it lets the agent keep tracking a moving target.
Choosing $\lambda$ . Intermediate values (often around $0.8$ – $0.95$ ) frequently beat both TD(0) and Monte Carlo; GAE in modern policy-gradient code exposes exactly this dial.
Watch the deadly triad. Bootstrapping + function approximation + off-policy data can diverge. Target networks, careful step sizes, or gradient-TD methods are the usual remedies.
TD targets are biased early. Estimates can be wildly off before downstream values settle; this is expected, not a bug.

Most production RL stacks — and the libraries and frameworks that implement them — are built on TD updates at their core. For the broader tooling and vendor landscape around RL, see companies building RL environments.

Frequently asked questions

What exactly is the “temporal difference”?

It is the difference between two temporally successive predictions of the same quantity — your estimate now, $V(s_t)$ , versus your slightly better estimate a step later, $r_{t+1} + \gamma V(s_{t+1})$ . The gap between predictions made at different times is the error signal you learn from.

Is TD learning model-free?

Yes. TD learns value estimates straight from sampled transitions $(s, r, s')$ without ever knowing the environment’s transition probabilities or reward function. That is what separates it from dynamic programming, which requires a full model. See model-based RL for the contrast.

When should I use Monte Carlo instead of TD?

Prefer Monte Carlo when the problem is strongly non-Markov (so bootstrapping on $V(s')$ misleads), when episodes are short and you want unbiased targets, or for debugging. Prefer TD for continuing tasks, online learning, and most large-scale problems where its lower variance and faster propagation win. Many modern methods (n-step, TD(λ), GAE) deliberately sit between the two.

How does TD(λ) relate to eligibility traces?

Eligibility traces are the mechanism that implements TD(λ) online. Rather than waiting to compute the λ-weighted average of all n-step returns (the forward view), each state holds a decaying trace of how recently it was visited, and every TD error is applied to all states in proportion to their traces (the backward view). The two views are provably equivalent.

Key papers and resources

Learning to Predict by the Methods of Temporal Differences — Sutton, 1988 — the foundational paper.
Reinforcement Learning: An Introduction — Sutton & Barto — chapters 6, 7 and 12 are the definitive treatment (free online).
Temporal Difference Learning and TD-Gammon — Tesauro, 1995 — the landmark application.
A Neural Substrate of Prediction and Reward — Schultz, Dayan & Montague, 1997 — TD error in the brain.
Temporal difference learning — Wikipedia — a solid overview with references.

What is reinforcement learning? · Markov decision processes · Value functions · Monte Carlo methods · Q-learning · On-policy vs off-policy · Deep Q-Networks · Actor-critic