reinforcement-learning.com
// FOUNDATIONS

Partially Observable MDPs (POMDPs)

What a POMDP is, the 7-tuple and belief-state math, how Bayesian belief updates work, exact and point-based solvers, and how deep RL handles partial observability.

Updated 2026-06-08 15 min read
Key takeaways
  • A POMDP is an MDP where the agent can't see the true state — it only gets noisy, partial observations of it.
  • The fix is a belief state: a probability distribution over states, updated with Bayes' rule after every action and observation.
  • Beliefs turn a POMDP into a continuous-state 'belief MDP' whose optimal value function is piecewise-linear and convex — but exact solving is intractable, so we use point-based and Monte-Carlo solvers.
  • Deep RL sidesteps explicit belief tracking with memory: frame stacking or, better, a recurrent/Transformer network that summarizes history into a hidden state.

What is a POMDP?

A Partially Observable Markov Decision Process (POMDP) is the model you reach for when an agent has to act in a world it cannot fully see. In an ordinary Markov Decision Process, the agent knows exactly what state it is in; in a POMDP it only receives an observation — a noisy, incomplete clue about the hidden true state. A robot’s camera shows one room, not the whole building; a poker player sees their cards, not their opponent’s; a medical policy reads test results, not the underlying disease.

Because the agent can’t observe the state, it can’t just memorize a state-to-action policy. Instead it must reason about what the state probably is given everything it has seen and done so far. That summary — a probability distribution over possible states — is called the belief state, and it is the central idea that makes POMDPs both powerful and hard.

The formal definition: a 7-tuple

A POMDP extends the MDP 5-tuple with two new ingredients — an observation space and an observation model. Formally it is a 7-tuple (S,A,T,R,Ω,O,γ)(S, A, T, R, \Omega, O, \gamma):

SymbolNameMeaning
SSStatesThe hidden true states of the world (never directly seen)
AAActionsWhat the agent can do
T(ss,a)T(s' \mid s, a)Transition modelProbability of moving to ss' from ss under action aa
R(s,a)R(s, a)RewardReward for taking aa in (hidden) state ss
Ω\OmegaObservationsThe clues the agent can perceive
O(os,a)O(o \mid s', a)Observation modelProbability of seeing oo after landing in ss' via aa
γ\gammaDiscountHow much future reward is worth now, γ[0,1)\gamma \in [0,1)

The only structural difference from an MDP is the pair (Ω,O)(\Omega, O). Set the observation equal to the state — Ω=S\Omega = S and OO an identity — and a POMDP collapses back to an ordinary MDP. Everything hard about POMDPs flows from the gap between SS and Ω\Omega.

Hidden state sevolves via T(s’|s,a)Observation odrawn from O(o|s’,a)Belief b(s)Bayes updatePolicya = pi(b)action a affects the worldsense
The POMDP loop. The true state evolves under the transition model but is hidden (dashed). The agent never sees the state directly — only an observation drawn from the observation model — and must maintain a belief to decide what to do.

Why partial observability breaks the MDP trick

In an MDP the Markov property holds for the observable state: the current state is a complete summary of the past, so the optimal policy is a simple map from state to action. In a POMDP that map does not exist over observations. The same observation can mean two completely different things depending on history — a closed door looks identical whether or not there’s a tiger behind it. A policy that reacts only to the latest observation is provably suboptimal, and naive Q-learning over observations can fail to converge at all.

There are two clean ways to restore the Markov property:

Track a belief (model-based)

Maintain bb, a probability distribution over all states, and update it with Bayes’ rule. The belief is a Markov state — it summarizes the entire history. This is exact but requires a model and is computationally heavy.

Remember history (model-free)

Feed the agent a window of recent observations (frame stacking) or a recurrent/Transformer network whose hidden state compresses the whole trajectory. No explicit model needed — this is how deep RL handles POMDPs in practice.

The belief state and the Bayesian update

The belief btb_t is a vector that, for every state ss, holds the probability the agent assigns to being in ss right now. It starts as a prior b0b_0 and, after each action aa and resulting observation oo, is refined by Bayes’ rule — the same machinery as a hidden-Markov-model filter. For discrete states:

b(s)=O(os,a)  sST(ss,a)b(s)Pr(ob,a)b'(s') = \frac{O(o \mid s', a)\;\sum_{s \in S} T(s' \mid s, a)\, b(s)}{\Pr(o \mid b, a)}

Read it left to right in two moves:

1
Predict (push the belief through dynamics)

Before seeing anything new, propagate the old belief through the transition model: b~(s)=sT(ss,a)b(s)\tilde b(s') = \sum_{s} T(s' \mid s, a)\, b(s). This is what you’d believe based on the action alone — uncertainty usually grows here.

2
Correct (fold in the observation)

Reweight each predicted state by how likely it was to produce the observation you actually got, O(os,a)O(o \mid s', a), then renormalize by Pr(ob,a)=sO(os,a)b~(s)\Pr(o \mid b, a) = \sum_{s'} O(o \mid s', a)\,\tilde b(s'). A sharp, informative observation shrinks uncertainty.

The denominator is just a normalizing constant ensuring the new belief sums to one. Crucially, bb' depends only on the previous belief bb, the action, and the observation — never on the full history. That is the magic: the belief is a sufficient statistic of everything the agent has ever experienced.

Go deeper: the belief MDP and why the value function is piecewise-linear

Because the belief is Markov, a POMDP over S|S| states is exactly an ordinary MDP over the continuous belief simplex — the “belief MDP.” Its reward is ρ(b,a)=sb(s)R(s,a)\rho(b,a) = \sum_s b(s)R(s,a) and its transitions are governed by the belief-update rule above. The Bellman equation becomes

V(b)=maxa[ρ(b,a)+γoPr(ob,a)V(ba,o)].V^*(b) = \max_a \Big[\, \rho(b,a) + \gamma \sum_{o} \Pr(o \mid b, a)\, V^*(b^{a,o}) \,\Big].

Sondik’s key 1971 result: for a finite horizon the optimal value function over the belief simplex is piecewise-linear and convex (PWLC). It can be written as the upper envelope of a finite set of hyperplanes — the famous alpha-vectors:

V(b)=maxαΓ  αb=maxαΓsα(s)b(s).V(b) = \max_{\alpha \in \Gamma} \; \alpha \cdot b = \max_{\alpha \in \Gamma}\sum_s \alpha(s)\, b(s).

Each alpha-vector corresponds to a conditional plan; convexity reflects that more certain beliefs (corners of the simplex) are worth more. This representation is what every exact and point-based solver manipulates.

A worked example: the Tiger problem

The canonical POMDP, from Kaelbling, Littman & Cassandra, is the Tiger problem. An agent faces two doors. Behind one is a tiger (100-100), behind the other treasure (+10+10). The hidden state is which door hides the tiger. The agent can open-left, open-right, or listen (cost 1-1). Listening returns a noisy observation — “hear tiger left/right” — that is correct only ~85% of the time.

With a 50/50 prior, opening immediately has expected value 12(10)+12(100)=45\tfrac{1}{2}(10) + \tfrac{1}{2}(-100) = -45 — terrible. So the rational policy is to listen first, watch the belief sharpen toward one door, and only open once the belief crosses a confidence threshold. Listening twice in agreement pushes the belief well past 90%, at which point opening becomes worthwhile. One noisy sensor, a tiny state space — yet the optimal behavior is a non-trivial information-gathering then commit strategy that no observation-reactive policy can express. That is the whole POMDP story in miniature.

7-tuple
POMDP spec — MDP plus an observation space and model
|S|−1
Dimensions of the continuous belief simplex
PSPACE
Hardness of finite-horizon POMDP planning (Papadimitriou & Tsitsiklis)
▶ Partially Observable MDPs (POMDPs) — Pieter Abbeel, UC Berkeley CS287 (the full lecture)

Why exact solving is hard

Two curses make POMDPs notoriously difficult:

  • Curse of dimensionality. The belief lives in a continuous (S1)(|S|{-}1)-dimensional simplex. Value iteration must reason over this whole space, not a finite set of states.
  • Curse of history. The number of distinct action–observation histories — and the number of alpha-vectors needed to represent the value function — grows exponentially with the planning horizon. Exact value iteration is feasible only for toy problems.

Finite-horizon POMDP planning is PSPACE-complete, and infinite-horizon optimal control is undecidable in general. This is why the practical field is almost entirely about approximation.

How POMDPs are solved in practice

1
Exact value iteration (small problems only)

Algorithms like Sondik’s, Witness, and incremental pruning compute the exact PWLC value function by enumerating and pruning alpha-vectors. Correct but exponential — usable only for a handful of states and a short horizon.

2
Point-based value iteration (the workhorse)

PBVI (Pineau et al., 2003) and successors avoid representing the whole belief space: they sample a finite set of reachable belief points and back up alpha-vectors only there. Perseus adds randomized backups; SARSOP (Kurniawati et al., 2008) focuses sampling on the optimally reachable belief region using value bounds, and remains a standard offline solver.

3
Online Monte-Carlo planning

Plan from the current belief only, just in time. POMCP (Silver & Veness, 2010) runs Monte-Carlo tree search over action–observation histories using only a black-box simulator, scaling to POMDPs with 105610^{56} states. DESPOT (Ye et al., 2017) searches a sparse, determinized belief tree with regularization for better worst-case guarantees.

4
Deep RL with memory

When you have no model — just experience — learn a policy or value function whose input is a summary of history. Frame stacking, recurrent networks (LSTM/GRU), and Transformers let the network build an implicit belief. This is how modern deep RL tackles real partial observability.

belief b(s1) — 0 (state s0) to 1 (state s1) →value V(b)switch pointV(b) = max over alpha-vectorsalpha plans
Exact POMDP value iteration represents the value function as the upper envelope of alpha-vectors (hyperplanes) over the belief simplex — here the 1-D belief between two states. Each alpha-vector is a conditional plan; the envelope is piecewise-linear and convex.

Deep RL under partial observability

Most large-scale RL problems are POMDPs in disguise — a single video frame, sensor reading, or web page rarely reveals the full state. Rather than track an explicit belief, deep agents learn to remember.

ApproachHow it summarizes historyNotes
Frame stackingConcatenate the last kk observations (Atari used 4 frames)Cheap; captures short-range dynamics like velocity, but a fixed window forgets the distant past
DRQNAn LSTM replaces the first dense layer of a DQNIntegrates information over arbitrary horizons; robust to flickering/occluded observations
Recurrent policy gradientsActor-critic / PPO with a recurrent coreStandard for partially observable control and many robotics tasks
Transformers / sequence modelsAttention over the full observation–action historyStrong long-horizon memory; basis of in-context and decision-transformer agents
World modelsLearn a latent dynamics model; the latent state acts as a learned beliefDreamerV3 and friends plan in a compact latent that implicitly does belief tracking

The throughline: a recurrent or attentional hidden state is a learned, implicit belief. The agent never writes down b(s)b(s), but it learns to encode exactly the history features needed to act — which is what the belief was for in the first place. POMDPs are also why memory architecture is a first-class design choice in agentic RL.

Go deeper: belief tracking vs. learned memory — when to use which

If you have an accurate model (T,O)(T, O) and a modest state space, an explicit belief filter plus a point-based or online solver (SARSOP, POMCP, DESPOT) is hard to beat: it is principled, interpretable, and can do active information-gathering by construction. If you lack a model, have high-dimensional observations (pixels, language), or need to scale, model-free deep RL with recurrent/Transformer memory is the practical default. The frontier blends both: model-based RL and world models learn a latent state that is an approximate belief, recovering planning benefits without a hand-specified observation model. See also model-based RL and POMDP-adjacent exploration, since reducing belief uncertainty is itself a form of exploration.

A short history

1965
Åström: optimal control under partial observation
Karl Åström formalizes control of Markov processes with incomplete state information, introducing the sufficient-statistic belief.
1971
Sondik: PWLC value function
Edward Sondik’s thesis proves the optimal finite-horizon value function is piecewise-linear and convex — the alpha-vector foundation of every exact solver.
1998
Kaelbling, Littman & Cassandra
The classic AI synthesis “Planning and acting in partially observable stochastic domains” brings POMDPs to mainstream AI with the Tiger problem.
2003
PBVI: point-based value iteration
Pineau, Gordon & Thrun make large POMDPs tractable by backing up alpha-vectors at sampled belief points.
2008–2010
SARSOP and POMCP
SARSOP focuses on optimally reachable beliefs; POMCP scales online planning to enormous POMDPs via MCTS over histories.
2015
DRQN: deep RL meets POMDPs
Hausknecht & Stone add an LSTM to DQN, learning implicit beliefs from raw, partial observations.
2023–25
World models as learned beliefs
DreamerV3 and successors plan in a learned latent state, blurring the line between belief tracking and model-free memory.

Where POMDPs show up

  • Robotics & autonomy. Localization, manipulation under occlusion, and navigation with noisy sensors are textbook POMDPs — see RL in robotics.
  • Dialogue & assistants. The user’s intent is hidden; the system maintains a belief over goals and acts to clarify or fulfill them.
  • Healthcare. Treatment policies act on tests that only partially reveal a patient’s true condition.
  • Games of imperfect information. Poker, Battleship, and partially observed video games hide part of the state from each player.
  • Operations & maintenance. Inspection-and-repair policies (railways, infrastructure) decide when to inspect vs. act under uncertain asset condition.

Building the simulators, reward pipelines, and partially observable environments these agents train in is its own industry — see the list of RL environment companies.

Frequently asked questions

What’s the difference between an MDP and a POMDP?

In an MDP the agent observes the true state directly, so the optimal policy maps state to action. In a POMDP the agent sees only noisy, partial observations and must maintain a belief — a distribution over states — to act well. An MDP is the special case of a POMDP where observation equals state.

Is a belief state a Markov state?

Yes — that’s the whole point. The belief is a sufficient statistic of the entire action–observation history, so the belief update depends only on the previous belief, the action, and the new observation. This turns a POMDP into an MDP over the (continuous) belief space.

Why is solving POMDPs so much harder than MDPs?

Two curses. The belief lives in a continuous high-dimensional simplex (curse of dimensionality), and the number of distinct histories — and alpha-vectors needed to represent the value function — grows exponentially with the horizon (curse of history). Finite-horizon POMDP planning is PSPACE-complete, so practitioners rely on approximation.

How does deep RL handle partial observability without computing a belief?

It learns memory instead. Frame stacking, recurrent networks (LSTM/GRU as in DRQN), Transformers, and world models compress the observation history into a hidden state that acts as an implicit, learned belief — no explicit observation model required.

Key papers

Markov decision processes · What is reinforcement learning? · Value functions · Deep Q-networks · Model-based RL · World models · RL in robotics · Agentic RL