reinforcement-learning.com
// FOUNDATIONS

Value Functions and the Bellman Equations

Value functions and the Bellman equations explained: state-value vs action-value, the recursive Bellman expectation and optimality equations, and why they power all of RL.

Updated 2026-06-07 15 min read
Key takeaways
  • A value function answers one question: starting here (and following a policy), how much total reward should I expect?
  • The Bellman equation is the trick that makes this tractable — it expresses the value of a state as the immediate reward plus the discounted value of the next state.
  • Two flavours: the state-value V(s) scores a situation; the action-value Q(s,a) scores a situation-plus-move and is what lets you act greedily.
  • The Bellman optimality equation (with a max instead of an average) defines the best possible value — and almost every RL algorithm is a way of solving or approximating it.

What is a value function?

A value function estimates how good it is to be in a given situation — measured as the total reward you expect to collect from here onward. It is the single most important object in reinforcement learning: rewards tell you what is good right now, but values tell you what is good in the long run, after accounting for everything that is likely to follow.

The catch is the phrase “from here onward.” The future is long, often infinite, and branches with every action and every roll of the environment’s dice. Summing over all of it directly is hopeless. The Bellman equation is the insight that rescues this: the value of where you are equals the reward you get next, plus the (discounted) value of where you end up. That one recursive step turns an infinite sum into a self-referential equation you can actually solve.

Value functions are always defined relative to a policy π\pi — a way of behaving — inside a Markov Decision Process, which supplies the states, actions, transition probabilities and rewards.

The return: what we are actually valuing

Before defining value, we need the thing it is the expectation of: the return GtG_t, the discounted sum of all future rewards from time tt:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}

The discount factor γ[0,1]\gamma \in [0,1] controls how much the future counts. At γ=0\gamma = 0 the agent is myopic (only the next reward matters); near γ=1\gamma = 1 it is far-sighted. Discounting also keeps the infinite sum finite and reflects that a reward now is usually worth more than the same reward later.

γ = 0
Myopic — only the very next reward matters
γ ≈ 0.99
Far-sighted — typical for long-horizon tasks
1/(1−γ)
Effective horizon, in steps, the agent plans over

Two value functions: state-value and action-value

There are two value functions, and the difference between them is the difference between judging a situation and judging a move.

State-value V(s) — how good is this situation?

The expected return from state ss when you then follow policy π\pi:

vπ(s)=Eπ[GtSt=s]v_\pi(s) = \mathbb{E}_\pi\big[\, G_t \mid S_t = s \,\big]

Useful for evaluating where you are, but it does not directly tell you what to do — you would need the model to look one step ahead.

Q(s,a) — how good is this move from here?

The expected return from taking action aa in state ss, then following π\pi:

qπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s,a) = \mathbb{E}_\pi\big[\, G_t \mid S_t = s,\, A_t = a \,\big]

This is the action-friendly one: compare qπ(s,a)q_\pi(s,a) across actions and just pick the best. It is what Q-learning and DQN learn.

The two are tied together. The state-value is the action-value averaged over the policy’s action choices, and the action-value is the immediate reward plus the discounted state-value of where you land:

vπ(s)=aπ(as)qπ(s,a)qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_{a} \pi(a\mid s)\, q_\pi(s,a) \qquad\quad q_\pi(s,a) = \sum_{s',r} p(s',r\mid s,a)\big[\, r + \gamma\, v_\pi(s') \,\big]

The Bellman expectation equation

Substitute one of those identities into the other and the recursion appears. The Bellman expectation equation writes the value of a state purely in terms of the values of its possible successor states:

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)]v_\pi(s) = \sum_{a} \pi(a\mid s) \sum_{s',r} p(s',r\mid s,a)\Big[\, r + \gamma\, v_\pi(s') \,\Big]

Read it left to right: average over the actions the policy might take, then over where the environment might send you, of (reward now) + γ·(value of the next state). The same logic gives the action-value form:

qπ(s,a)=s,rp(s,rs,a)[r+γaπ(as)qπ(s,a)]q_\pi(s,a) = \sum_{s',r} p(s',r\mid s,a)\Big[\, r + \gamma \sum_{a'} \pi(a'\mid s')\, q_\pi(s',a') \,\Big]

This is the whole idea: an infinite-horizon return collapses into a one-step lookahead plus the value of the rest. The diagram below — a backup diagram, the standard way to draw these — shows how value flows back from successor states to the current one.

sstate, value v(s)π(a|s)action aaction a’p(s’,r|s,a)s’s’s’s’value backs up: v(s) = avg of r + γ·v(s’)
A backup diagram for the Bellman expectation equation. From state s the agent picks an action under π (open circles); the environment then transitions to a successor state s' with reward r. Value backs up from the leaves to the root: v(s) is the average over both branchings of r + γ·v(s').
▶ RL Course by David Silver — Lecture 2: Markov Decision Processes (value functions & the Bellman equation, the canonical treatment)

From a recursion to a system you can solve

For a finite MDP, the Bellman expectation equation is not just one equation — it is one equation per state, all sharing the same unknowns. For nn states that is a system of nn linear equations in nn unknowns, which in matrix form has a closed-form solution:

vπ=Rπ+γPπvπvπ=(IγPπ)1Rπv_\pi = R^\pi + \gamma P^\pi v_\pi \quad\Longrightarrow\quad v_\pi = \big(I - \gamma P^\pi\big)^{-1} R^\pi

where PπP^\pi is the state-to-state transition matrix induced by the policy and RπR^\pi the expected immediate reward per state. Solving the inverse directly costs O(n3)O(n^3), which is fine for toy problems but hopeless at scale — so in practice we iterate instead.

1
Initialise

Start with an arbitrary guess for every state’s value, e.g. v(s)=0v(s) = 0 for all ss. The starting point does not matter for convergence.

2
Apply the Bellman operator (a sweep)

For each state, replace its old value with the right-hand side of the Bellman equation, computed from the current estimates of its successors. This is iterative policy evaluation.

3
Repeat until it stops moving

The update is a contraction mapping with modulus γ\gamma: each sweep shrinks the error by at least a factor of γ\gamma, so the estimates converge geometrically to the unique fixed point vπv_\pi.

4
(Optional) improve the policy

Act greedily with respect to the new values to get a better policy, re-evaluate, and repeat. Alternating these two steps is policy iteration; folding them into one is value iteration. Both rest entirely on the Bellman equation.

The Bellman optimality equation

Everything so far evaluated a given policy. The point of RL is to find the best one. Define the optimal value functions as the best achievable over all policies:

v(s)=maxπvπ(s)q(s,a)=maxπqπ(s,a)v_*(s) = \max_\pi v_\pi(s) \qquad\qquad q_*(s,a) = \max_\pi q_\pi(s,a)

The key result — and the reason RL is tractable at all — is that these satisfy their own self-consistent recursion, the Bellman optimality equation. Instead of averaging over actions, it maximises over them:

v(s)=maxas,rp(s,rs,a)[r+γv(s)]v_*(s) = \max_{a} \sum_{s',r} p(s',r\mid s,a)\Big[\, r + \gamma\, v_*(s') \,\Big] q(s,a)=s,rp(s,rs,a)[r+γmaxaq(s,a)]q_*(s,a) = \sum_{s',r} p(s',r\mid s,a)\Big[\, r + \gamma \max_{a'} q_*(s',a') \,\Big]

The intuition: the value of acting optimally from ss is the value of the single best action now, plus the discounted value of acting optimally thereafter. Once you have qq_*, the optimal policy is trivial — in every state, pick argmaxaq(s,a)\arg\max_a q_*(s,a). No model, no lookahead, no planning: the optimal action falls straight out of the optimal action-values.

Bellman expectationevaluate a fixed policy πsaverage over actionsBellman optimalityfind the best policysmax over actions
Expectation vs. optimality. The expectation equation averages over the policy's actions to evaluate it; the optimality equation takes the max to find the best. The max makes the equation non-linear — which is why it needs iterative methods, not a matrix inverse.

Why this powers (almost) all of RL

Nearly every RL algorithm is, underneath, a different way of solving or approximating a Bellman equation. The split is about how much you know and how you compute the expectation.

FamilyWhat it knowsHow it uses BellmanExamples
Dynamic programmingFull model p(s,rs,a)p(s',r\mid s,a)Sweeps the exact equation over all statesPolicy iteration, value iteration
Monte CarloNothing — just episodesAverages actual returns, no bootstrappingFirst-visit / every-visit MC
Temporal differenceNothingBootstraps: updates toward r+γV(s)r + \gamma V(s')TD(0), SARSA, Q-learning
Deep RLNothing; huge state spaceMinimises Bellman error with a neural netDQN, actor-critic

The unifying object is the TD error — the gap between the two sides of the Bellman equation given current estimates:

δt=Rt+1+γV(St+1)Bellman targetV(St)current estimate\delta_t = \underbrace{R_{t+1} + \gamma\, V(S_{t+1})}_{\text{Bellman target}} - \underbrace{V(S_t)}_{\text{current estimate}}

If δt\delta_t is zero everywhere, the Bellman equation holds and your value function is consistent. Every TD method just nudges estimates to shrink this error. DQN turns it into a regression loss; actor-critic uses it as the signal that trains the critic and weights the policy-gradient update; even model-based RL plans by applying Bellman backups inside a learned model.

Go deeper: contraction, fixed points and why iteration works

Define the Bellman optimality operator TT acting on a value function: (Tv)(s)=maxas,rp(s,rs,a)[r+γv(s)](Tv)(s) = \max_a \sum_{s',r} p(s',r\mid s,a)[r + \gamma v(s')]. For any two value functions u,vu, v it satisfies TuTvγuv\lVert Tu - Tv \rVert_\infty \le \gamma \lVert u - v \rVert_\infty — it is a γ\gamma-contraction in the max-norm. By the Banach fixed-point theorem a contraction has exactly one fixed point and repeated application converges to it from any starting point, with error shrinking by at least γ\gamma each step. That fixed point is vv_*. This is the formal guarantee behind value iteration, and the reason discounting (γ<1\gamma < 1) matters so much: it is what makes the operator a contraction in the first place. The expectation operator TπT^\pi (with average instead of max) is a contraction too, which is why policy evaluation converges.

Go deeper: the deadly triad and Bellman error in deep RL

With a perfect table of values, Bellman backups are guaranteed to converge. Replace the table with a function approximator and the guarantee can break. Sutton and Barto call the dangerous combination the deadly triad: (1) function approximation, (2) bootstrapping (using your own estimates as targets, as TD does), and (3) off-policy training. When all three are present, the Bellman update is no longer a contraction in the approximator’s parameter space and values can diverge. DQN’s headline tricks — a frozen target network and an experience replay buffer — are engineering patches that tame exactly this instability, keeping the Bellman target stable enough for gradient descent to chase.

A short history

1950s
Bellman invents dynamic programming
Richard Bellman, working at RAND, formalises multistage decision processes and the recursive equation that now bears his name. He reportedly chose the bland name “dynamic programming” to disguise that he was doing mathematics from a research-averse Secretary of Defense.
1957
Markov Decision Processes
Bellman and, separately, Ronald Howard (policy iteration, 1960) cast sequential decision-making as MDPs solved by Bellman backups — the foundation RL still stands on.
1988
Temporal-difference learning
Richard Sutton shows you can satisfy the Bellman equation from raw experience, without a model, by bootstrapping — the bridge from planning to learning.
1989
Q-learning
Chris Watkins introduces Q-learning, a sampled, off-policy solver for the Bellman optimality equation with a convergence proof.
2015
Deep Q-Networks
DeepMind scales Bellman-error minimisation to a neural network and learns Atari from pixels — value functions go deep.

A concrete example

Picture a tiny gridworld: an agent on a 4×4 grid, reward 1-1 per step, γ=1\gamma = 1, episode ends at a goal corner. The state-value of a cell is, roughly, “how many steps to the goal, negated.” A cell next to the goal has value near 1-1; a far corner has value around 6-6. Run iterative policy evaluation and watch these numbers settle: each sweep, a cell’s value updates to 1-1 plus the average value of the neighbours its policy moves toward. After enough sweeps the values stop changing — the Bellman equation is satisfied — and acting greedily on them (“step toward the higher-valued neighbour”) gives the shortest path. That is the entire arc of value-based RL in miniature: estimate values, then act greedily on them.

Researcher takes

The Bellman equation is one of the rare ideas that is simultaneously the theoretical foundation and the practical workhorse of a field. The community’s running observation is that decades of progress — from tabular dynamic programming to TD learning to deep RL — have been less about replacing the Bellman equation than about finding ever-more-scalable ways to satisfy it from data instead of from a known model. The arc Sutton describes runs from “compute the value when you know the world” to “learn the value while you live in it,” and every major algorithm is a waypoint on that path. Even the recent wave of LLM post-training keeps the same skeleton: value-style critics estimate expected return inside methods like PPO and actor-critic, and the simplifications that dropped them (GRPO) are defined precisely by what part of the Bellman machinery they remove.

Researcher takes

A direct argument for why a learned value function changes the credit-assignment game: bootstrapping TD methods can attribute outcomes online, even under long reward delays, rather than waiting for full returns the way REINFORCE-style methods must.

A contrarian-leaning case (against the policy-gradient-only trend) that value/Q-function methods scale favorably with both compute spent querying the value function and with data, making value-based RL a promising long-run direction.

Frequently asked questions

What is the difference between a value function and a reward?

Reward is the immediate, single-step signal from the environment. A value function is the expected total of all future (discounted) rewards from a state. Reward says what is good now; value says what is good in the long run. RL agents act on values precisely because chasing immediate reward is short-sighted.

When should I use V(s) versus Q(s,a)?

Use Q(s,a)Q(s,a) when you want to choose actions without a model — you can just pick argmaxaQ(s,a)\arg\max_a Q(s,a), which is why Q-learning and DQN learn QQ. Use V(s)V(s) when you only need to evaluate states (e.g. as a critic baseline in actor-critic and policy gradients) or when you have a model to do the one-step lookahead yourself.

Why is the discount factor γ necessary?

Three reasons: it keeps the infinite sum of rewards finite, it expresses a preference for sooner rewards, and — mathematically — it is what makes the Bellman operator a contraction, which is what guarantees iterative methods converge to a unique value function. Without γ<1\gamma < 1 (in continuing tasks) the whole edifice can fail to converge.

Does the Bellman equation require knowing the environment’s dynamics?

The equation is written with the transition model p(s,rs,a)p(s',r\mid s,a), so dynamic programming needs the model. But temporal-difference methods (TD, SARSA, Q-learning) satisfy the same equation using sampled transitions — they replace the explicit expectation with experience. That is exactly what makes model-free RL possible.

Key references

What is reinforcement learning? · Markov Decision Processes · Q-learning · Deep Q-Networks · Policy gradients · Actor-critic · Model-based RL