Partially Observable MDPs (POMDPs)

Key takeaways

A POMDP is an MDP where the agent can't see the true state — it only gets noisy, partial observations of it.
The fix is a belief state: a probability distribution over states, updated with Bayes' rule after every action and observation.
Beliefs turn a POMDP into a continuous-state 'belief MDP' whose optimal value function is piecewise-linear and convex — but exact solving is intractable, so we use point-based and Monte-Carlo solvers.
Deep RL sidesteps explicit belief tracking with memory: frame stacking or, better, a recurrent/Transformer network that summarizes history into a hidden state.

What is a POMDP?

A Partially Observable Markov Decision Process (POMDP) is the model you reach for when an agent has to act in a world it cannot fully see. In an ordinary Markov Decision Process, the agent knows exactly what state it is in; in a POMDP it only receives an observation — a noisy, incomplete clue about the hidden true state. A robot’s camera shows one room, not the whole building; a poker player sees their cards, not their opponent’s; a medical policy reads test results, not the underlying disease.

Because the agent can’t observe the state, it can’t just memorize a state-to-action policy. Instead it must reason about what the state probably is given everything it has seen and done so far. That summary — a probability distribution over possible states — is called the belief state, and it is the central idea that makes POMDPs both powerful and hard.

The formal definition: a 7-tuple

A POMDP extends the MDP 5-tuple with two new ingredients — an observation space and an observation model. Formally it is a 7-tuple $(S, A, T, R, \Omega, O, \gamma)$ :

Symbol	Name	Meaning
$S$	States	The hidden true states of the world (never directly seen)
$A$	Actions	What the agent can do
$T(s' \mid s, a)$	Transition model	Probability of moving to $s'$ from $s$ under action $a$
$R(s, a)$	Reward	Reward for taking $a$ in (hidden) state $s$
$\Omega$	Observations	The clues the agent can perceive
$O(o \mid s', a)$	Observation model	Probability of seeing $o$ after landing in $s'$ via $a$
$\gamma$	Discount	How much future reward is worth now, $\gamma \in [0,1)$

The only structural difference from an MDP is the pair $(\Omega, O)$ . Set the observation equal to the state — $\Omega = S$ and $O$ an identity — and a POMDP collapses back to an ordinary MDP. Everything hard about POMDPs flows from the gap between $S$ and $\Omega$ .

The POMDP loop. The true state evolves under the transition model but is hidden (dashed). The agent never sees the state directly — only an observation drawn from the observation model — and must maintain a belief to decide what to do.

Why partial observability breaks the MDP trick

In an MDP the Markov property holds for the observable state: the current state is a complete summary of the past, so the optimal policy is a simple map from state to action. In a POMDP that map does not exist over observations. The same observation can mean two completely different things depending on history — a closed door looks identical whether or not there’s a tiger behind it. A policy that reacts only to the latest observation is provably suboptimal, and naive Q-learning over observations can fail to converge at all.

There are two clean ways to restore the Markov property:

Track a belief (model-based)

Maintain $b$ , a probability distribution over all states, and update it with Bayes’ rule. The belief is a Markov state — it summarizes the entire history. This is exact but requires a model and is computationally heavy.

Remember history (model-free)

Feed the agent a window of recent observations (frame stacking) or a recurrent/Transformer network whose hidden state compresses the whole trajectory. No explicit model needed — this is how deep RL handles POMDPs in practice.

The belief state and the Bayesian update

The belief $b_t$ is a vector that, for every state $s$ , holds the probability the agent assigns to being in $s$ right now. It starts as a prior $b_0$ and, after each action $a$ and resulting observation $o$ , is refined by Bayes’ rule — the same machinery as a hidden-Markov-model filter. For discrete states:

b'(s') = \frac{O(o \mid s', a)\;\sum_{s \in S} T(s' \mid s, a)\, b(s)}{\Pr(o \mid b, a)}

Read it left to right in two moves:

Predict (push the belief through dynamics)

Before seeing anything new, propagate the old belief through the transition model: $\tilde b(s') = \sum_{s} T(s' \mid s, a)\, b(s)$ . This is what you’d believe based on the action alone — uncertainty usually grows here.

Correct (fold in the observation)

Reweight each predicted state by how likely it was to produce the observation you actually got, $O(o \mid s', a)$ , then renormalize by $\Pr(o \mid b, a) = \sum_{s'} O(o \mid s', a)\,\tilde b(s')$ . A sharp, informative observation shrinks uncertainty.

The denominator is just a normalizing constant ensuring the new belief sums to one. Crucially, $b'$ depends only on the previous belief $b$ , the action, and the observation — never on the full history. That is the magic: the belief is a sufficient statistic of everything the agent has ever experienced.

Go deeper: the belief MDP and why the value function is piecewise-linear

Because the belief is Markov, a POMDP over $|S|$ states is exactly an ordinary MDP over the continuous belief simplex — the “belief MDP.” Its reward is $\rho(b,a) = \sum_s b(s)R(s,a)$ and its transitions are governed by the belief-update rule above. The Bellman equation becomes

V^*(b) = \max_a \Big[\, \rho(b,a) + \gamma \sum_{o} \Pr(o \mid b, a)\, V^*(b^{a,o}) \,\Big].

Sondik’s key 1971 result: for a finite horizon the optimal value function over the belief simplex is piecewise-linear and convex (PWLC). It can be written as the upper envelope of a finite set of hyperplanes — the famous alpha-vectors:

V(b) = \max_{\alpha \in \Gamma} \; \alpha \cdot b = \max_{\alpha \in \Gamma}\sum_s \alpha(s)\, b(s).

Each alpha-vector corresponds to a conditional plan; convexity reflects that more certain beliefs (corners of the simplex) are worth more. This representation is what every exact and point-based solver manipulates.

A worked example: the Tiger problem

The canonical POMDP, from Kaelbling, Littman & Cassandra, is the Tiger problem. An agent faces two doors. Behind one is a tiger ( $-100$ ), behind the other treasure ( $+10$ ). The hidden state is which door hides the tiger. The agent can open-left, open-right, or listen (cost $-1$ ). Listening returns a noisy observation — “hear tiger left/right” — that is correct only ~85% of the time.

With a 50/50 prior, opening immediately has expected value $\tfrac{1}{2}(10) + \tfrac{1}{2}(-100) = -45$ — terrible. So the rational policy is to listen first, watch the belief sharpen toward one door, and only open once the belief crosses a confidence threshold. Listening twice in agreement pushes the belief well past 90%, at which point opening becomes worthwhile. One noisy sensor, a tiny state space — yet the optimal behavior is a non-trivial information-gathering then commit strategy that no observation-reactive policy can express. That is the whole POMDP story in miniature.

7-tuple

POMDP spec — MDP plus an observation space and model

|S|−1

Dimensions of the continuous belief simplex

PSPACE

Hardness of finite-horizon POMDP planning (Papadimitriou & Tsitsiklis)

▶ Partially Observable MDPs (POMDPs) — Pieter Abbeel, UC Berkeley CS287 (the full lecture)

Why exact solving is hard

Two curses make POMDPs notoriously difficult:

Curse of dimensionality. The belief lives in a continuous $(|S|{-}1)$ -dimensional simplex. Value iteration must reason over this whole space, not a finite set of states.
Curse of history. The number of distinct action–observation histories — and the number of alpha-vectors needed to represent the value function — grows exponentially with the planning horizon. Exact value iteration is feasible only for toy problems.

Finite-horizon POMDP planning is PSPACE-complete, and infinite-horizon optimal control is undecidable in general. This is why the practical field is almost entirely about approximation.

How POMDPs are solved in practice

Exact value iteration (small problems only)

Algorithms like Sondik’s, Witness, and incremental pruning compute the exact PWLC value function by enumerating and pruning alpha-vectors. Correct but exponential — usable only for a handful of states and a short horizon.

Point-based value iteration (the workhorse)

PBVI (Pineau et al., 2003) and successors avoid representing the whole belief space: they sample a finite set of reachable belief points and back up alpha-vectors only there. Perseus adds randomized backups; SARSOP (Kurniawati et al., 2008) focuses sampling on the optimally reachable belief region using value bounds, and remains a standard offline solver.

Online Monte-Carlo planning

Plan from the current belief only, just in time. POMCP (Silver & Veness, 2010) runs Monte-Carlo tree search over action–observation histories using only a black-box simulator, scaling to POMDPs with $10^{56}$ states. DESPOT (Ye et al., 2017) searches a sparse, determinized belief tree with regularization for better worst-case guarantees.

Deep RL with memory

When you have no model — just experience — learn a policy or value function whose input is a summary of history. Frame stacking, recurrent networks (LSTM/GRU), and Transformers let the network build an implicit belief. This is how modern deep RL tackles real partial observability.

Exact POMDP value iteration represents the value function as the upper envelope of alpha-vectors (hyperplanes) over the belief simplex — here the 1-D belief between two states. Each alpha-vector is a conditional plan; the envelope is piecewise-linear and convex.

Deep RL under partial observability

Most large-scale RL problems are POMDPs in disguise — a single video frame, sensor reading, or web page rarely reveals the full state. Rather than track an explicit belief, deep agents learn to remember.

Approach	How it summarizes history	Notes
Frame stacking	Concatenate the last $k$ observations (Atari used 4 frames)	Cheap; captures short-range dynamics like velocity, but a fixed window forgets the distant past
DRQN	An LSTM replaces the first dense layer of a DQN	Integrates information over arbitrary horizons; robust to flickering/occluded observations
Recurrent policy gradients	Actor-critic / PPO with a recurrent core	Standard for partially observable control and many robotics tasks
Transformers / sequence models	Attention over the full observation–action history	Strong long-horizon memory; basis of in-context and decision-transformer agents
World models	Learn a latent dynamics model; the latent state acts as a learned belief	DreamerV3 and friends plan in a compact latent that implicitly does belief tracking

The throughline: a recurrent or attentional hidden state is a learned, implicit belief. The agent never writes down $b(s)$ , but it learns to encode exactly the history features needed to act — which is what the belief was for in the first place. POMDPs are also why memory architecture is a first-class design choice in agentic RL.

Go deeper: belief tracking vs. learned memory — when to use which

If you have an accurate model $(T, O)$ and a modest state space, an explicit belief filter plus a point-based or online solver (SARSOP, POMCP, DESPOT) is hard to beat: it is principled, interpretable, and can do active information-gathering by construction. If you lack a model, have high-dimensional observations (pixels, language), or need to scale, model-free deep RL with recurrent/Transformer memory is the practical default. The frontier blends both: model-based RL and world models learn a latent state that is an approximate belief, recovering planning benefits without a hand-specified observation model. See also model-based RL and POMDP-adjacent exploration, since reducing belief uncertainty is itself a form of exploration.

A short history

1965

Åström: optimal control under partial observation

Karl Åström formalizes control of Markov processes with incomplete state information, introducing the sufficient-statistic belief.

1971

Sondik: PWLC value function

Edward Sondik’s thesis proves the optimal finite-horizon value function is piecewise-linear and convex — the alpha-vector foundation of every exact solver.

1998

Kaelbling, Littman & Cassandra

The classic AI synthesis “Planning and acting in partially observable stochastic domains” brings POMDPs to mainstream AI with the Tiger problem.

2003

PBVI: point-based value iteration

Pineau, Gordon & Thrun make large POMDPs tractable by backing up alpha-vectors at sampled belief points.

2008–2010

SARSOP and POMCP

SARSOP focuses on optimally reachable beliefs; POMCP scales online planning to enormous POMDPs via MCTS over histories.

2015

DRQN: deep RL meets POMDPs

Hausknecht & Stone add an LSTM to DQN, learning implicit beliefs from raw, partial observations.

2023–25

World models as learned beliefs

DreamerV3 and successors plan in a learned latent state, blurring the line between belief tracking and model-free memory.

Where POMDPs show up

Robotics & autonomy. Localization, manipulation under occlusion, and navigation with noisy sensors are textbook POMDPs — see RL in robotics.
Dialogue & assistants. The user’s intent is hidden; the system maintains a belief over goals and acts to clarify or fulfill them.
Healthcare. Treatment policies act on tests that only partially reveal a patient’s true condition.
Games of imperfect information. Poker, Battleship, and partially observed video games hide part of the state from each player.
Operations & maintenance. Inspection-and-repair policies (railways, infrastructure) decide when to inspect vs. act under uncertain asset condition.

Building the simulators, reward pipelines, and partially observable environments these agents train in is its own industry — see the list of RL environment companies.

Frequently asked questions

What’s the difference between an MDP and a POMDP?

In an MDP the agent observes the true state directly, so the optimal policy maps state to action. In a POMDP the agent sees only noisy, partial observations and must maintain a belief — a distribution over states — to act well. An MDP is the special case of a POMDP where observation equals state.

Is a belief state a Markov state?

Yes — that’s the whole point. The belief is a sufficient statistic of the entire action–observation history, so the belief update depends only on the previous belief, the action, and the new observation. This turns a POMDP into an MDP over the (continuous) belief space.

Why is solving POMDPs so much harder than MDPs?

Two curses. The belief lives in a continuous high-dimensional simplex (curse of dimensionality), and the number of distinct histories — and alpha-vectors needed to represent the value function — grows exponentially with the horizon (curse of history). Finite-horizon POMDP planning is PSPACE-complete, so practitioners rely on approximation.

How does deep RL handle partial observability without computing a belief?

It learns memory instead. Frame stacking, recurrent networks (LSTM/GRU as in DRQN), Transformers, and world models compress the observation history into a hidden state that acts as an implicit, learned belief — no explicit observation model required.

Key papers

Planning and Acting in Partially Observable Stochastic Domains — Kaelbling, Littman & Cassandra, 1998 — the canonical AI treatment.
Point-Based Value Iteration (PBVI) — Pineau, Gordon & Thrun, 2003 — made large POMDPs tractable.
SARSOP — Kurniawati, Hsu & Lee, 2008 — point-based planning over optimally reachable beliefs.
Monte-Carlo Planning in Large POMDPs (POMCP) — Silver & Veness, 2010 — online MCTS for huge POMDPs.
Deep Recurrent Q-Learning for POMDPs (DRQN) — Hausknecht & Stone, 2015 — recurrent memory for deep RL.

Markov decision processes · What is reinforcement learning? · Value functions · Deep Q-networks · Model-based RL · World models · RL in robotics · Agentic RL