- A value function answers one question: starting here (and following a policy), how much total reward should I expect?
- The Bellman equation is the trick that makes this tractable — it expresses the value of a state as the immediate reward plus the discounted value of the next state.
- Two flavours: the state-value V(s) scores a situation; the action-value Q(s,a) scores a situation-plus-move and is what lets you act greedily.
- The Bellman optimality equation (with a max instead of an average) defines the best possible value — and almost every RL algorithm is a way of solving or approximating it.
What is a value function?
A value function estimates how good it is to be in a given situation — measured as the total reward you expect to collect from here onward. It is the single most important object in reinforcement learning: rewards tell you what is good right now, but values tell you what is good in the long run, after accounting for everything that is likely to follow.
The catch is the phrase “from here onward.” The future is long, often infinite, and branches with every action and every roll of the environment’s dice. Summing over all of it directly is hopeless. The Bellman equation is the insight that rescues this: the value of where you are equals the reward you get next, plus the (discounted) value of where you end up. That one recursive step turns an infinite sum into a self-referential equation you can actually solve.
Value functions are always defined relative to a policy — a way of behaving — inside a Markov Decision Process, which supplies the states, actions, transition probabilities and rewards.
The return: what we are actually valuing
Before defining value, we need the thing it is the expectation of: the return , the discounted sum of all future rewards from time :
The discount factor controls how much the future counts. At the agent is myopic (only the next reward matters); near it is far-sighted. Discounting also keeps the infinite sum finite and reflects that a reward now is usually worth more than the same reward later.
Two value functions: state-value and action-value
There are two value functions, and the difference between them is the difference between judging a situation and judging a move.
The expected return from state when you then follow policy :
Useful for evaluating where you are, but it does not directly tell you what to do — you would need the model to look one step ahead.
The expected return from taking action in state , then following :
This is the action-friendly one: compare across actions and just pick the best. It is what Q-learning and DQN learn.
The two are tied together. The state-value is the action-value averaged over the policy’s action choices, and the action-value is the immediate reward plus the discounted state-value of where you land:
The Bellman expectation equation
Substitute one of those identities into the other and the recursion appears. The Bellman expectation equation writes the value of a state purely in terms of the values of its possible successor states:
Read it left to right: average over the actions the policy might take, then over where the environment might send you, of (reward now) + γ·(value of the next state). The same logic gives the action-value form:
This is the whole idea: an infinite-horizon return collapses into a one-step lookahead plus the value of the rest. The diagram below — a backup diagram, the standard way to draw these — shows how value flows back from successor states to the current one.
From a recursion to a system you can solve
For a finite MDP, the Bellman expectation equation is not just one equation — it is one equation per state, all sharing the same unknowns. For states that is a system of linear equations in unknowns, which in matrix form has a closed-form solution:
where is the state-to-state transition matrix induced by the policy and the expected immediate reward per state. Solving the inverse directly costs , which is fine for toy problems but hopeless at scale — so in practice we iterate instead.
Start with an arbitrary guess for every state’s value, e.g. for all . The starting point does not matter for convergence.
For each state, replace its old value with the right-hand side of the Bellman equation, computed from the current estimates of its successors. This is iterative policy evaluation.
The update is a contraction mapping with modulus : each sweep shrinks the error by at least a factor of , so the estimates converge geometrically to the unique fixed point .
Act greedily with respect to the new values to get a better policy, re-evaluate, and repeat. Alternating these two steps is policy iteration; folding them into one is value iteration. Both rest entirely on the Bellman equation.
The Bellman optimality equation
Everything so far evaluated a given policy. The point of RL is to find the best one. Define the optimal value functions as the best achievable over all policies:
The key result — and the reason RL is tractable at all — is that these satisfy their own self-consistent recursion, the Bellman optimality equation. Instead of averaging over actions, it maximises over them:
The intuition: the value of acting optimally from is the value of the single best action now, plus the discounted value of acting optimally thereafter. Once you have , the optimal policy is trivial — in every state, pick . No model, no lookahead, no planning: the optimal action falls straight out of the optimal action-values.
Why this powers (almost) all of RL
Nearly every RL algorithm is, underneath, a different way of solving or approximating a Bellman equation. The split is about how much you know and how you compute the expectation.
| Family | What it knows | How it uses Bellman | Examples |
|---|---|---|---|
| Dynamic programming | Full model | Sweeps the exact equation over all states | Policy iteration, value iteration |
| Monte Carlo | Nothing — just episodes | Averages actual returns, no bootstrapping | First-visit / every-visit MC |
| Temporal difference | Nothing | Bootstraps: updates toward | TD(0), SARSA, Q-learning |
| Deep RL | Nothing; huge state space | Minimises Bellman error with a neural net | DQN, actor-critic |
The unifying object is the TD error — the gap between the two sides of the Bellman equation given current estimates:
If is zero everywhere, the Bellman equation holds and your value function is consistent. Every TD method just nudges estimates to shrink this error. DQN turns it into a regression loss; actor-critic uses it as the signal that trains the critic and weights the policy-gradient update; even model-based RL plans by applying Bellman backups inside a learned model.
Go deeper: contraction, fixed points and why iteration works
Define the Bellman optimality operator acting on a value function: . For any two value functions it satisfies — it is a -contraction in the max-norm. By the Banach fixed-point theorem a contraction has exactly one fixed point and repeated application converges to it from any starting point, with error shrinking by at least each step. That fixed point is . This is the formal guarantee behind value iteration, and the reason discounting () matters so much: it is what makes the operator a contraction in the first place. The expectation operator (with average instead of max) is a contraction too, which is why policy evaluation converges.
Go deeper: the deadly triad and Bellman error in deep RL
With a perfect table of values, Bellman backups are guaranteed to converge. Replace the table with a function approximator and the guarantee can break. Sutton and Barto call the dangerous combination the deadly triad: (1) function approximation, (2) bootstrapping (using your own estimates as targets, as TD does), and (3) off-policy training. When all three are present, the Bellman update is no longer a contraction in the approximator’s parameter space and values can diverge. DQN’s headline tricks — a frozen target network and an experience replay buffer — are engineering patches that tame exactly this instability, keeping the Bellman target stable enough for gradient descent to chase.
A short history
A concrete example
Picture a tiny gridworld: an agent on a 4×4 grid, reward per step, , episode ends at a goal corner. The state-value of a cell is, roughly, “how many steps to the goal, negated.” A cell next to the goal has value near ; a far corner has value around . Run iterative policy evaluation and watch these numbers settle: each sweep, a cell’s value updates to plus the average value of the neighbours its policy moves toward. After enough sweeps the values stop changing — the Bellman equation is satisfied — and acting greedily on them (“step toward the higher-valued neighbour”) gives the shortest path. That is the entire arc of value-based RL in miniature: estimate values, then act greedily on them.
Researcher takes
The Bellman equation is one of the rare ideas that is simultaneously the theoretical foundation and the practical workhorse of a field. The community’s running observation is that decades of progress — from tabular dynamic programming to TD learning to deep RL — have been less about replacing the Bellman equation than about finding ever-more-scalable ways to satisfy it from data instead of from a known model. The arc Sutton describes runs from “compute the value when you know the world” to “learn the value while you live in it,” and every major algorithm is a waypoint on that path. Even the recent wave of LLM post-training keeps the same skeleton: value-style critics estimate expected return inside methods like PPO and actor-critic, and the simplifications that dropped them (GRPO) are defined precisely by what part of the Bellman machinery they remove.
Researcher takes
A direct argument for why a learned value function changes the credit-assignment game: bootstrapping TD methods can attribute outcomes online, even under long reward delays, rather than waiting for full returns the way REINFORCE-style methods must.
A contrarian-leaning case (against the policy-gradient-only trend) that value/Q-function methods scale favorably with both compute spent querying the value function and with data, making value-based RL a promising long-run direction.
Frequently asked questions
What is the difference between a value function and a reward?
Reward is the immediate, single-step signal from the environment. A value function is the expected total of all future (discounted) rewards from a state. Reward says what is good now; value says what is good in the long run. RL agents act on values precisely because chasing immediate reward is short-sighted.
When should I use V(s) versus Q(s,a)?
Use when you want to choose actions without a model — you can just pick , which is why Q-learning and DQN learn . Use when you only need to evaluate states (e.g. as a critic baseline in actor-critic and policy gradients) or when you have a model to do the one-step lookahead yourself.
Why is the discount factor γ necessary?
Three reasons: it keeps the infinite sum of rewards finite, it expresses a preference for sooner rewards, and — mathematically — it is what makes the Bellman operator a contraction, which is what guarantees iterative methods converge to a unique value function. Without (in continuing tasks) the whole edifice can fail to converge.
Does the Bellman equation require knowing the environment’s dynamics?
The equation is written with the transition model , so dynamic programming needs the model. But temporal-difference methods (TD, SARSA, Q-learning) satisfy the same equation using sampled transitions — they replace the explicit expectation with experience. That is exactly what makes model-free RL possible.
Key references
- Reinforcement Learning: An Introduction — Sutton & Barto, 2nd ed. — Chapters 3–4 are the canonical treatment of value functions and the Bellman equations.
- Dynamic Programming — Richard Bellman, 1957 — the original source of the equation.
- Learning to Predict by the Methods of Temporal Differences — Sutton, 1988 — TD learning.
- Q-learning — Watkins & Dayan, 1992 — convergent sampled solution to the Bellman optimality equation.
- Human-level control through deep reinforcement learning — Mnih et al., 2015 — Bellman-error minimisation with deep networks (DQN).
Related
What is reinforcement learning? · Markov Decision Processes · Q-learning · Deep Q-Networks · Policy gradients · Actor-critic · Model-based RL