- A POMDP is an MDP where the agent can't see the true state — it only gets noisy, partial observations of it.
- The fix is a belief state: a probability distribution over states, updated with Bayes' rule after every action and observation.
- Beliefs turn a POMDP into a continuous-state 'belief MDP' whose optimal value function is piecewise-linear and convex — but exact solving is intractable, so we use point-based and Monte-Carlo solvers.
- Deep RL sidesteps explicit belief tracking with memory: frame stacking or, better, a recurrent/Transformer network that summarizes history into a hidden state.
What is a POMDP?
A Partially Observable Markov Decision Process (POMDP) is the model you reach for when an agent has to act in a world it cannot fully see. In an ordinary Markov Decision Process, the agent knows exactly what state it is in; in a POMDP it only receives an observation — a noisy, incomplete clue about the hidden true state. A robot’s camera shows one room, not the whole building; a poker player sees their cards, not their opponent’s; a medical policy reads test results, not the underlying disease.
Because the agent can’t observe the state, it can’t just memorize a state-to-action policy. Instead it must reason about what the state probably is given everything it has seen and done so far. That summary — a probability distribution over possible states — is called the belief state, and it is the central idea that makes POMDPs both powerful and hard.
The formal definition: a 7-tuple
A POMDP extends the MDP 5-tuple with two new ingredients — an observation space and an observation model. Formally it is a 7-tuple :
| Symbol | Name | Meaning |
|---|---|---|
| States | The hidden true states of the world (never directly seen) | |
| Actions | What the agent can do | |
| Transition model | Probability of moving to from under action | |
| Reward | Reward for taking in (hidden) state | |
| Observations | The clues the agent can perceive | |
| Observation model | Probability of seeing after landing in via | |
| Discount | How much future reward is worth now, |
The only structural difference from an MDP is the pair . Set the observation equal to the state — and an identity — and a POMDP collapses back to an ordinary MDP. Everything hard about POMDPs flows from the gap between and .
Why partial observability breaks the MDP trick
In an MDP the Markov property holds for the observable state: the current state is a complete summary of the past, so the optimal policy is a simple map from state to action. In a POMDP that map does not exist over observations. The same observation can mean two completely different things depending on history — a closed door looks identical whether or not there’s a tiger behind it. A policy that reacts only to the latest observation is provably suboptimal, and naive Q-learning over observations can fail to converge at all.
There are two clean ways to restore the Markov property:
Maintain , a probability distribution over all states, and update it with Bayes’ rule. The belief is a Markov state — it summarizes the entire history. This is exact but requires a model and is computationally heavy.
Feed the agent a window of recent observations (frame stacking) or a recurrent/Transformer network whose hidden state compresses the whole trajectory. No explicit model needed — this is how deep RL handles POMDPs in practice.
The belief state and the Bayesian update
The belief is a vector that, for every state , holds the probability the agent assigns to being in right now. It starts as a prior and, after each action and resulting observation , is refined by Bayes’ rule — the same machinery as a hidden-Markov-model filter. For discrete states:
Read it left to right in two moves:
Before seeing anything new, propagate the old belief through the transition model: . This is what you’d believe based on the action alone — uncertainty usually grows here.
Reweight each predicted state by how likely it was to produce the observation you actually got, , then renormalize by . A sharp, informative observation shrinks uncertainty.
The denominator is just a normalizing constant ensuring the new belief sums to one. Crucially, depends only on the previous belief , the action, and the observation — never on the full history. That is the magic: the belief is a sufficient statistic of everything the agent has ever experienced.
Go deeper: the belief MDP and why the value function is piecewise-linear
Because the belief is Markov, a POMDP over states is exactly an ordinary MDP over the continuous belief simplex — the “belief MDP.” Its reward is and its transitions are governed by the belief-update rule above. The Bellman equation becomes
Sondik’s key 1971 result: for a finite horizon the optimal value function over the belief simplex is piecewise-linear and convex (PWLC). It can be written as the upper envelope of a finite set of hyperplanes — the famous alpha-vectors:
Each alpha-vector corresponds to a conditional plan; convexity reflects that more certain beliefs (corners of the simplex) are worth more. This representation is what every exact and point-based solver manipulates.
A worked example: the Tiger problem
The canonical POMDP, from Kaelbling, Littman & Cassandra, is the Tiger problem. An agent faces two doors. Behind one is a tiger (), behind the other treasure (). The hidden state is which door hides the tiger. The agent can open-left, open-right, or listen (cost ). Listening returns a noisy observation — “hear tiger left/right” — that is correct only ~85% of the time.
With a 50/50 prior, opening immediately has expected value — terrible. So the rational policy is to listen first, watch the belief sharpen toward one door, and only open once the belief crosses a confidence threshold. Listening twice in agreement pushes the belief well past 90%, at which point opening becomes worthwhile. One noisy sensor, a tiny state space — yet the optimal behavior is a non-trivial information-gathering then commit strategy that no observation-reactive policy can express. That is the whole POMDP story in miniature.
Why exact solving is hard
Two curses make POMDPs notoriously difficult:
- Curse of dimensionality. The belief lives in a continuous -dimensional simplex. Value iteration must reason over this whole space, not a finite set of states.
- Curse of history. The number of distinct action–observation histories — and the number of alpha-vectors needed to represent the value function — grows exponentially with the planning horizon. Exact value iteration is feasible only for toy problems.
Finite-horizon POMDP planning is PSPACE-complete, and infinite-horizon optimal control is undecidable in general. This is why the practical field is almost entirely about approximation.
How POMDPs are solved in practice
Algorithms like Sondik’s, Witness, and incremental pruning compute the exact PWLC value function by enumerating and pruning alpha-vectors. Correct but exponential — usable only for a handful of states and a short horizon.
PBVI (Pineau et al., 2003) and successors avoid representing the whole belief space: they sample a finite set of reachable belief points and back up alpha-vectors only there. Perseus adds randomized backups; SARSOP (Kurniawati et al., 2008) focuses sampling on the optimally reachable belief region using value bounds, and remains a standard offline solver.
Plan from the current belief only, just in time. POMCP (Silver & Veness, 2010) runs Monte-Carlo tree search over action–observation histories using only a black-box simulator, scaling to POMDPs with states. DESPOT (Ye et al., 2017) searches a sparse, determinized belief tree with regularization for better worst-case guarantees.
When you have no model — just experience — learn a policy or value function whose input is a summary of history. Frame stacking, recurrent networks (LSTM/GRU), and Transformers let the network build an implicit belief. This is how modern deep RL tackles real partial observability.
Deep RL under partial observability
Most large-scale RL problems are POMDPs in disguise — a single video frame, sensor reading, or web page rarely reveals the full state. Rather than track an explicit belief, deep agents learn to remember.
| Approach | How it summarizes history | Notes |
|---|---|---|
| Frame stacking | Concatenate the last observations (Atari used 4 frames) | Cheap; captures short-range dynamics like velocity, but a fixed window forgets the distant past |
| DRQN | An LSTM replaces the first dense layer of a DQN | Integrates information over arbitrary horizons; robust to flickering/occluded observations |
| Recurrent policy gradients | Actor-critic / PPO with a recurrent core | Standard for partially observable control and many robotics tasks |
| Transformers / sequence models | Attention over the full observation–action history | Strong long-horizon memory; basis of in-context and decision-transformer agents |
| World models | Learn a latent dynamics model; the latent state acts as a learned belief | DreamerV3 and friends plan in a compact latent that implicitly does belief tracking |
The throughline: a recurrent or attentional hidden state is a learned, implicit belief. The agent never writes down , but it learns to encode exactly the history features needed to act — which is what the belief was for in the first place. POMDPs are also why memory architecture is a first-class design choice in agentic RL.
Go deeper: belief tracking vs. learned memory — when to use which
If you have an accurate model and a modest state space, an explicit belief filter plus a point-based or online solver (SARSOP, POMCP, DESPOT) is hard to beat: it is principled, interpretable, and can do active information-gathering by construction. If you lack a model, have high-dimensional observations (pixels, language), or need to scale, model-free deep RL with recurrent/Transformer memory is the practical default. The frontier blends both: model-based RL and world models learn a latent state that is an approximate belief, recovering planning benefits without a hand-specified observation model. See also model-based RL and POMDP-adjacent exploration, since reducing belief uncertainty is itself a form of exploration.
A short history
Where POMDPs show up
- Robotics & autonomy. Localization, manipulation under occlusion, and navigation with noisy sensors are textbook POMDPs — see RL in robotics.
- Dialogue & assistants. The user’s intent is hidden; the system maintains a belief over goals and acts to clarify or fulfill them.
- Healthcare. Treatment policies act on tests that only partially reveal a patient’s true condition.
- Games of imperfect information. Poker, Battleship, and partially observed video games hide part of the state from each player.
- Operations & maintenance. Inspection-and-repair policies (railways, infrastructure) decide when to inspect vs. act under uncertain asset condition.
Building the simulators, reward pipelines, and partially observable environments these agents train in is its own industry — see the list of RL environment companies.
Frequently asked questions
What’s the difference between an MDP and a POMDP?
In an MDP the agent observes the true state directly, so the optimal policy maps state to action. In a POMDP the agent sees only noisy, partial observations and must maintain a belief — a distribution over states — to act well. An MDP is the special case of a POMDP where observation equals state.
Is a belief state a Markov state?
Yes — that’s the whole point. The belief is a sufficient statistic of the entire action–observation history, so the belief update depends only on the previous belief, the action, and the new observation. This turns a POMDP into an MDP over the (continuous) belief space.
Why is solving POMDPs so much harder than MDPs?
Two curses. The belief lives in a continuous high-dimensional simplex (curse of dimensionality), and the number of distinct histories — and alpha-vectors needed to represent the value function — grows exponentially with the horizon (curse of history). Finite-horizon POMDP planning is PSPACE-complete, so practitioners rely on approximation.
How does deep RL handle partial observability without computing a belief?
It learns memory instead. Frame stacking, recurrent networks (LSTM/GRU as in DRQN), Transformers, and world models compress the observation history into a hidden state that acts as an implicit, learned belief — no explicit observation model required.
Key papers
- Planning and Acting in Partially Observable Stochastic Domains — Kaelbling, Littman & Cassandra, 1998 — the canonical AI treatment.
- Point-Based Value Iteration (PBVI) — Pineau, Gordon & Thrun, 2003 — made large POMDPs tractable.
- SARSOP — Kurniawati, Hsu & Lee, 2008 — point-based planning over optimally reachable beliefs.
- Monte-Carlo Planning in Large POMDPs (POMCP) — Silver & Veness, 2010 — online MCTS for huge POMDPs.
- Deep Recurrent Q-Learning for POMDPs (DRQN) — Hausknecht & Stone, 2015 — recurrent memory for deep RL.
Related
Markov decision processes · What is reinforcement learning? · Value functions · Deep Q-networks · Model-based RL · World models · RL in robotics · Agentic RL