reinforcement-learning.com
// FOUNDATIONS

Markov Decision Processes (MDPs)

What a Markov Decision Process is, the (S, A, P, R, γ) tuple, the Markov property, Bellman equations, value/policy iteration, POMDPs — the formal foundation of all RL.

Updated 2026-06-07 15 min read
Key takeaways
  • A Markov Decision Process (MDP) is the formal model of sequential decision-making under uncertainty — the mathematical foundation of essentially all reinforcement learning.
  • It is the 5-tuple (S, A, P, R, γ): states, actions, transition probabilities, rewards, and a discount factor.
  • The Markov property — 'the future depends only on the present, not the path that got you here' — is what makes the whole thing tractable.
  • Solving an MDP means finding an optimal policy via the Bellman equations, using dynamic programming (value/policy iteration) when the model is known, and RL when it isn't.

What is a Markov Decision Process?

A Markov Decision Process (MDP) is the standard mathematical framework for modelling a problem where an agent repeatedly chooses actions in an environment, the environment responds with a new state and a reward, and the agent’s goal is to act so as to maximize reward over time. Almost every reinforcement learning algorithm — from Q-learning to PPO to AlphaZero — is, under the hood, a method for solving or approximating an MDP.

The key idea is that an MDP captures the full structure of a sequential decision problem in five pieces. If you can write your problem down as those five pieces, the entire theory of RL becomes available to you.

Agent (policy π)Environmentaction a_treward r_{t+1}next state s_{t+1}observes state s_t
The agent–environment loop that every MDP describes: at step t the agent observes state s_t, takes action a_t, and the environment returns a reward r_{t+1} and next state s_{t+1}.
▶ RL Course by David Silver — Lecture 2: Markov Decision Processes (the canonical lecture, ~1h40)

The five pieces: the (S, A, P, R, γ) tuple

Formally, an MDP is the tuple S,A,P,R,γ\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle:

SymbolNameWhat it is
S\mathcal{S}State spaceThe set of all situations the agent can be in (e.g. board positions, robot joint angles).
A\mathcal{A}Action spaceThe set of choices available (e.g. moves, motor torques).
P(ss,a)P(s' \mid s, a)Transition functionThe probability of landing in state ss' after taking action aa in state ss. Encodes the environment’s dynamics.
R(s,a)R(s, a)Reward functionThe (expected) immediate reward for taking aa in ss. The numeric signal the agent maximizes.
γ[0,1]\gamma \in [0, 1]Discount factorHow much future reward is worth relative to immediate reward.

The agent’s behaviour is a policy π(as)\pi(a \mid s) — a (possibly stochastic) mapping from states to actions. The goal is to find the policy that maximizes the expected discounted return:

Gt=Rt+1+γRt+2+γ2Rt+3+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
5
components in the MDP tuple (S, A, P, R, γ)
γ ∈ [0,1]
discount factor — 0 = myopic, →1 = far-sighted
1957
Bellman formalizes dynamic programming & MDPs

The Markov property

The “Markov” in MDP refers to the Markov property: the future is independent of the past given the present.

P(st+1st,at,st1,at1,,s0)=P(st+1st,at)P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0) = P(s_{t+1} \mid s_t, a_t)

In plain English: the current state contains everything you need to know to predict what happens next. The path you took to get there carries no extra information. This is what makes MDPs tractable — the agent never has to remember history, only react to the present state.

Go deeper: the ladder from Markov chains to MDPs

MDPs sit at the top of a small hierarchy, each level adding one ingredient:

  • Markov chain (Markov process) — just S,P\langle \mathcal{S}, P \rangle. States that transition randomly. No actions, no rewards. Andrey Markov’s original 1906 object; he famously demonstrated it on the sequence of vowels and consonants in Pushkin’s Eugene Onegin.
  • Markov reward process (MRP) — add a reward function and discount: S,P,R,γ\langle \mathcal{S}, P, R, \gamma \rangle. Now states have value, but the agent still can’t choose.
  • Markov decision process (MDP) — add actions: S,A,P,R,γ\langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle. The agent now influences transitions, so we can ask for the optimal policy.
  • Partially observable MDP (POMDP) — the agent no longer sees the true state, only noisy observations, and must act on a belief over states. More on this below.

Value functions and the Bellman equations

To solve an MDP we need to know how good states and actions are. That is the job of the value functions. The state-value function Vπ(s)V^\pi(s) is the expected return from state ss when following policy π\pi; the action-value function Qπ(s,a)Q^\pi(s,a) is the expected return from taking aa in ss and following π\pi thereafter.

The structural insight that makes everything computable is recursion: the value of a state equals the immediate reward plus the discounted value of wherever you land next. This is the Bellman expectation equation:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a)+γVπ(s)]V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \big[\, R(s, a) + \gamma\, V^\pi(s') \,\big]

The whole point of RL is the optimal value function VV^*, which obeys the Bellman optimality equation — replace the average-over-actions with a max:

V(s)=maxasP(ss,a)[R(s,a)+γV(s)]V^*(s) = \max_{a} \sum_{s'} P(s' \mid s, a) \big[\, R(s, a) + \gamma\, V^*(s') \,\big]

Once you have VV^* (or QQ^*), the optimal policy is just act greedily: in each state pick the action with the highest value. The Bellman equations are the bridge from “what is this problem worth?” to “what should I do?“

sV(s)action a₁action a₂π(a|s)s’s’s’P(s’|s,a)R + γV(s’)
A Bellman backup: the value of state s is computed by looking one step ahead — averaging over actions (the policy), then over the environment's possible next states, then folding in the discounted value of each.
Go deeper: why the discount factor γ exists

γ\gamma does three jobs at once. Mathematically, with γ<1\gamma < 1 the infinite sum of rewards is guaranteed to converge (geometric series), so values are finite and the Bellman operator is a contraction — which is what guarantees a unique fixed point and convergence of the algorithms below. Economically, it captures “a reward now beats a reward later,” like an interest rate. Practically, it sets the agent’s planning horizon: γ=0\gamma = 0 is purely myopic; γ1\gamma \to 1 makes the agent value the distant future almost as much as the present. Roughly, the effective horizon is 1/(1γ)1/(1-\gamma) steps — so γ=0.99\gamma = 0.99 means the agent “sees” about 100 steps ahead.

Solving an MDP: dynamic programming

When you know the model — the transition and reward functions — you can solve the MDP exactly with dynamic programming, using the Bellman optimality equation as an update rule. Two classic algorithms dominate.

1
Value iteration

Start with arbitrary values and repeatedly apply the Bellman optimality backup to every state:

Vk+1(s)maxasP(ss,a)[R(s,a)+γVk(s)]V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s'\mid s,a)\big[R(s,a) + \gamma V_k(s')\big]

Because the backup is a contraction, VkV_k converges to VV^*. Then read off the greedy policy. Simple, and works well even on large state spaces.

2
Policy iteration

Alternate two phases until the policy stops changing:

  1. Policy evaluation — compute VπV^\pi for the current policy (solve the linear Bellman expectation equations).
  2. Policy improvement — make the policy greedy with respect to that VπV^\pi.

Each round produces a strictly better policy, and it converges in surprisingly few iterations — often a handful. Introduced by Ronald Howard in 1960.

Value iteration

One cheap, partial backup per state per sweep; many sweeps. Folds evaluation and improvement together. Easy to implement; great when the state space is large.

Policy iteration

Full (expensive) policy evaluation, then a greedy improvement; few outer iterations. Converges in fewer rounds but each round costs more. Both reach the same VV^*.

From known models to RL: when you can’t sweep the states

If you can’t enumerate states or don’t know PP and RR, you move from planning to learning — the agent samples experience and estimates values from it. This is the leap from dynamic programming to reinforcement learning.

SettingModel known?Enumerate states?Approach
Small, known MDPYesYesDynamic programming (value/policy iteration)
Unknown dynamicsNoYesModel-free RL — Q-learning, SARSA
Huge / continuous statesEitherNoFunction approximation — Deep Q-Networks, policy gradients
Learn the model, then planLearnedEitherModel-based RL, MuZero

Every one of these is still solving an MDP — they just trade the exact Bellman backup for sampled estimates, and a lookup table for a neural network. The need to learn the model (rather than be handed it) is also what creates the exploration vs. exploitation dilemma: you can only estimate PP and RR for actions you actually try.

When the state is hidden: POMDPs

A core MDP assumption is that the agent sees the true state. Often it doesn’t — a robot has noisy sensors, a poker player can’t see opponents’ cards. A Partially Observable MDP (POMDP) adds an observation set Ω\Omega and an observation function O(os,a)O(o \mid s', a); the agent never sees ss directly, only an observation oo.

The standard trick is to maintain a belief state — a probability distribution over which true state you might be in — and treat that as the (now continuous) state of a regular MDP. This recovers the Markov property at the cost of a much harder problem: exact POMDP solving is intractable in general, which is why deep RL often sidesteps it with recurrent networks or frame-stacking to summarize history.

A short history

1906
Markov chains
Andrey Markov introduces the chains that bear his name, studying letter sequences in Pushkin’s Eugene Onegin — the statistical seed of the Markov property.
1957
Bellman & dynamic programming
Richard Bellman formalizes dynamic programming, the Bellman equation, and the term ‘Markov decision process’ — and coins ‘the curse of dimensionality.‘
1960
Policy iteration
Ronald Howard’s book Dynamic Programming and Markov Processes introduces policy iteration, still a workhorse algorithm.
1989
Q-learning
Chris Watkins shows you can solve an MDP without knowing its model, learning Q-values from sampled experience — the birth of practical model-free RL.
1998
Sutton & Barto
The textbook Reinforcement Learning: An Introduction makes the MDP the universal language of the field.
2013–25
Deep RL at scale
DQN, AlphaGo/MuZero and LLM post-training all rest on the MDP — scaled with neural networks to states no table could hold.

Where MDPs show up

The framework is deliberately general, which is why it appears everywhere:

  • Games — board games, video games, and self-play systems like AlphaZero are MDPs (or two-player extensions).
  • Robotics & control — states are sensor/joint readings, actions are torques; classic continuous-control MDPs.
  • Operations — inventory, queueing, and resource allocation were early MDP applications.
  • Recommenders & ads — sequential choices where today’s action shapes tomorrow’s state.
  • LLM post-training — even RLHF frames text generation as an MDP: the state is the prompt-so-far, actions are next tokens, and the reward comes from a reward model. The MDP abstraction is what lets the same RL algorithms apply to a chatbot and a chess engine.

Building the simulators, datasets and RL environments that instantiate these MDPs at production scale is now its own industry — see the companies building RL environments.

Frequently asked questions

What is the difference between a Markov chain and an MDP?

A Markov chain is just states transitioning randomly — no decisions, no rewards. An MDP adds actions (the agent influences which transitions happen) and a reward (a goal to optimize). Add only a reward to a chain and you get a Markov reward process; add actions too and you get a full MDP.

What exactly is the Markov property?

It says the next state depends only on the current state and action, not on the full history of how you got there. The present state is a sufficient summary of the past. It is a property of your chosen state representation — a good RL state is designed to be Markov.

What does the discount factor γ do?

It trades off immediate versus future reward. γ\gamma near 0 makes the agent myopic; γ\gamma near 1 makes it far-sighted. It also keeps the infinite return finite and makes the Bellman operator a contraction, which is what guarantees the solution methods converge. The effective horizon is roughly 1/(1γ)1/(1-\gamma) steps.

How do MDPs relate to reinforcement learning?

The MDP is the problem; RL is the solution method for the case where the model is unknown or too large to solve exactly. Dynamic programming solves a fully known MDP; RL learns to act in an MDP from sampled experience. Every value function, policy and Bellman equation in RL is defined with respect to an underlying MDP.

Key references

  • Dynamic Programming — Richard Bellman, 1957 — the origin of the Bellman equation and the MDP formalism.
  • Dynamic Programming and Markov Processes — Ronald Howard, 1960 — introduces policy iteration.
  • Reinforcement Learning: An Introduction — Sutton & Barto, 2nd ed. — Chapters 3–4 are the definitive modern treatment of MDPs and dynamic programming.
  • Markov decision process — Wikipedia — a solid, well-cited overview.
  • RL Course, Lecture 2 — David Silver (DeepMind) — the canonical lecture on MDPs.

What is reinforcement learning? · Value functions · Q-learning · Exploration vs. exploitation · Model-based RL · Policy gradients