- TD learning estimates value functions from raw experience — no model of the environment — while bootstrapping from its own current estimates instead of waiting for the final outcome.
- The whole idea fits in one line: nudge each estimate toward the next one. The size of that nudge is the TD error, δ = r + γV(s') − V(s).
- It blends Monte Carlo (learn from experience) and dynamic programming (bootstrap from estimates) — and TD(λ) puts a single dial between the two.
- TD is the engine under SARSA, Q-learning and DQN, and the same error signal turns out to match how dopamine neurons in the brain encode reward prediction.
What is temporal-difference learning?
Temporal-difference (TD) learning is the central idea in reinforcement learning: learn to predict long-run reward by comparing each prediction to the next one and correcting the difference. You do not need a model of how the world works (unlike dynamic programming), and you do not need to wait until the end of an episode to learn (unlike Monte Carlo methods). You update as you go.
The intuition is everyday. Suppose at 8:00 you predict your commute will take 30 minutes. At 8:10 you hit unexpected traffic and now expect the whole trip to take 40. You do not need to arrive to know your first guess was 10 minutes too low — the revision of your own estimate already carries the lesson. TD learning is exactly that: every time a later, better-informed estimate disagrees with an earlier one, the gap is a learning signal.
The TD(0) update and the TD error
The simplest version, TD(0) (one-step TD), estimates the state-value function . After taking one step — observing reward and landing in — it updates:
The bracketed quantity is the TD error, written :
Here is the learning rate (how big a step toward the target) and is the discount factor. Read it plainly: is how wrong the current estimate was, judged against a slightly better estimate that has one extra real reward baked in. If , the future looked better than predicted, so raise ; if , lower it.
The single most important word here is bootstrapping: the update target contains , which is itself an estimate the agent is still learning. TD pulls itself up by its own bootstraps — it learns a guess from a guess. That is what makes it fast and online, and also what makes its convergence subtler than ordinary supervised learning.
How TD prediction works, step by step
Set to arbitrary values (often zero) for every state. These are deliberately wrong — the algorithm’s job is to chip away at the error.
From the current state , follow your policy, take an action, and observe the immediate reward and the next state . This is one real interaction — no model, no rollout.
Compute . You are combining one real reward with your existing estimate of everything that follows . This is the bootstrap.
Take and move the old estimate a fraction of the way toward the target: . Then set and repeat from step 2.
Because each update touches only the state just left, TD propagates information one step backward per visit. Over many episodes, value flows back from rewards toward the states that lead to them — slowly at first, then faster as the estimates downstream become reliable. (Eligibility traces, below, accelerate exactly this backward flow.)
Go deeper: why bootstrapping is not true gradient descent
In supervised learning the target is fixed and the update is a real gradient step on a loss. In TD the target moves because it depends on the very parameters you are updating. So TD is a semi-gradient method: you take the gradient of but treat the target as a constant, ignoring its dependence on the weights. This is why tabular TD(0) provably converges (under standard step-size conditions) but TD with nonlinear function approximation and off-policy data can diverge — the notorious deadly triad of bootstrapping + function approximation + off-policy training. Gradient-TD methods (GTD, TDC) restore stability by optimizing a true objective, the mean-squared projected Bellman error.
TD vs Monte Carlo vs dynamic programming
The cleanest way to place TD is along two axes: does it sample experience? and does it bootstrap?
| Property | Dynamic programming | Monte Carlo | TD learning |
|---|---|---|---|
| Needs a model of dynamics | Yes | No | No |
| Learns from raw experience | No | Yes | Yes |
| Bootstraps (uses own estimates) | Yes | No | Yes |
| Updates before episode ends | Yes | No | Yes |
| Works in continuing (non-terminating) tasks | Yes | No | Yes |
| Target variance | None | High | Low |
| Target bias | None | None (unbiased) | Biased (early on) |
The bias/variance trade-off is the heart of it. Monte Carlo waits for the actual return , which is an unbiased sample of the true value but high variance (it accumulates every random reward and transition to the end of the episode). TD replaces most of that random tail with a single bootstrapped estimate — low variance but biased while is still wrong. In practice TD’s lower variance usually makes it learn faster, and it often finds better estimates on Markov problems because it implicitly exploits the Markov structure.
n-step TD and TD(λ): one dial between the extremes
TD(0) and Monte Carlo are the endpoints of a continuum. n-step TD takes real rewards before bootstrapping:
With this is TD(0); with (to the end of the episode) it is Monte Carlo. Intermediate often beats both endpoints.
TD(λ) is the elegant unification: instead of picking one , it averages all n-step returns, weighting the -step return by . The parameter is a single knob — recovers TD(0), recovers Monte Carlo.
Computing that average directly would require waiting for the whole episode (the forward view). The trick that makes TD(λ) practical online is the backward view with eligibility traces: each state keeps a fading memory of how recently and frequently it was visited, decaying by each step. Every time step’s TD error is then broadcast to all states in proportion to their current trace:
This assigns credit to recently visited states the instant a surprise arrives — solving the credit-assignment problem incrementally, with the same per-step cost.
From prediction to control: SARSA and Q-learning
Everything above predicts value for a fixed policy. To actually improve behavior, you apply the same TD update to action values and act greedily (or ε-greedily) on them. The choice of TD target splits the two most famous control algorithms:
Target uses the action actually taken next: . It learns the value of the policy it is following (exploration included), so it tends to be more conservative near danger. See on-policy vs off-policy.
Target uses the best next action: . It learns the optimal policy regardless of how it explores — the basis of Q-learning and, with a neural net, Deep Q-Networks.
The difference is a single operator in the target — (the action sampled) versus (the greedy action). That is the whole on-policy/off-policy distinction, expressed in the TD error. Both are instances of generalized policy iteration: TD evaluates, the greedy step improves, round and round.
Go deeper: TD in deep RL and actor-critic
Replace the table with a neural network and the TD error becomes the loss signal. DQN minimizes the squared TD error of Q-learning, stabilized by a target network (a frozen copy that supplies , decoupling the moving target) and a replay buffer. In actor-critic methods the critic is a TD learner estimating value, and its TD error is the advantage signal that tells the actor which way to push the policy — so TD sits at the core of policy-gradient algorithms like A3C, PPO and GRPO. Generalized advantage estimation (GAE) is, under the hood, a TD(λ) average of one-step advantages.
TD learning in the brain
One of the most striking results in computational neuroscience is that TD did not just inspire algorithms — it appears to describe biology. Wolfram Schultz’s recordings of midbrain dopamine neurons show they fire not for reward itself but for reward that was better than predicted, stay silent for fully predicted reward, and dip below baseline when an expected reward is omitted. That signature — positive, zero, negative on surprise — is precisely the TD error . Schultz, Dayan and Montague’s 1997 work made the link explicit, and a 2022 DeepMind/Harvard study showed dopamine responses even shift backward in time across learning exactly as the TD error predicts.
A short history of TD learning
Practical notes
- Step size . Convergence theory wants to shrink over time, but constant small is standard in practice for non-stationary or deep settings — it lets the agent keep tracking a moving target.
- Choosing . Intermediate values (often around –) frequently beat both TD(0) and Monte Carlo; GAE in modern policy-gradient code exposes exactly this dial.
- Watch the deadly triad. Bootstrapping + function approximation + off-policy data can diverge. Target networks, careful step sizes, or gradient-TD methods are the usual remedies.
- TD targets are biased early. Estimates can be wildly off before downstream values settle; this is expected, not a bug.
Most production RL stacks — and the libraries and frameworks that implement them — are built on TD updates at their core. For the broader tooling and vendor landscape around RL, see companies building RL environments.
Frequently asked questions
What exactly is the “temporal difference”?
It is the difference between two temporally successive predictions of the same quantity — your estimate now, , versus your slightly better estimate a step later, . The gap between predictions made at different times is the error signal you learn from.
Is TD learning model-free?
Yes. TD learns value estimates straight from sampled transitions without ever knowing the environment’s transition probabilities or reward function. That is what separates it from dynamic programming, which requires a full model. See model-based RL for the contrast.
When should I use Monte Carlo instead of TD?
Prefer Monte Carlo when the problem is strongly non-Markov (so bootstrapping on misleads), when episodes are short and you want unbiased targets, or for debugging. Prefer TD for continuing tasks, online learning, and most large-scale problems where its lower variance and faster propagation win. Many modern methods (n-step, TD(λ), GAE) deliberately sit between the two.
How does TD(λ) relate to eligibility traces?
Eligibility traces are the mechanism that implements TD(λ) online. Rather than waiting to compute the λ-weighted average of all n-step returns (the forward view), each state holds a decaying trace of how recently it was visited, and every TD error is applied to all states in proportion to their traces (the backward view). The two views are provably equivalent.
Key papers and resources
- Learning to Predict by the Methods of Temporal Differences — Sutton, 1988 — the foundational paper.
- Reinforcement Learning: An Introduction — Sutton & Barto — chapters 6, 7 and 12 are the definitive treatment (free online).
- Temporal Difference Learning and TD-Gammon — Tesauro, 1995 — the landmark application.
- A Neural Substrate of Prediction and Reward — Schultz, Dayan & Montague, 1997 — TD error in the brain.
- Temporal difference learning — Wikipedia — a solid overview with references.
Related
What is reinforcement learning? · Markov decision processes · Value functions · Monte Carlo methods · Q-learning · On-policy vs off-policy · Deep Q-Networks · Actor-critic