reinforcement-learning.com
// PLANNING & MODEL-BASED

Model-Based Reinforcement Learning

What model-based RL is, how learning a world model plus planning works, the math, Dyna/MuZero/Dreamer, sample efficiency vs model-free, and the 2026 landscape.

Updated 2026-06-07 16 min read
Key takeaways
  • Model-based RL learns (or is given) a model of the environment's dynamics, then uses it to plan or to generate imagined experience — instead of learning purely from real trial and error.
  • The big payoff is sample efficiency: an agent that can simulate outcomes needs far fewer real interactions, which matters when real data is slow, costly or dangerous (robots, control).
  • The big risk is model bias: a policy optimized against an imperfect model exploits the model's errors. Short rollouts, uncertainty estimates and latent world models are the main defenses.
  • The lineage runs from Dyna (1990) through PILCO and World Models to MuZero (planning with a learned model) and Dreamer (learning entirely in imagination) — the engine behind much of modern planning-based RL.

What is model-based RL?

Model-based reinforcement learning (MBRL) is the family of RL methods in which the agent uses a model of the environment — a function that predicts what happens next — to decide how to act. Give it a state and an action and the model returns a prediction of the next state and the reward. With that predictor in hand, the agent can think ahead: roll out hypothetical futures, score them, and pick actions that lead somewhere good, all without touching the real world.

The contrast is with model-free RL (think Q-learning, DQN, PPO), which throws away any explicit notion of dynamics and learns a value function or policy directly from experienced transitions. Model-free methods are simpler and famously robust; model-based methods are far more sample-efficient because each real interaction also improves a model that can then be queried thousands of times for free.

MODEL-FREEExperiencePolicy / ValuelearnMODEL-BASEDExperienceDynamics modelp(s′,r | s,a)Plan /imagine → actlearnqueryimagined rollouts feed back as extra training data
Model-free RL maps experience straight to a policy or value. Model-based RL inserts a learned dynamics model in the middle, which it can query to plan or to generate imagined experience.
▶ CS 285 Lecture 12: Model-Based RL with Policies — Sergey Levine (UC Berkeley)

Why model-based RL exists

Model-free RL works, but it is hungry. DQN famously needed tens of millions of frames to master a single Atari game; modern policy-gradient methods need millions of environment steps. That is fine in a fast simulator and ruinous on real hardware. Model-based RL exists to break that dependence on raw interaction count.

The intuition is the one humans use constantly: you do not learn to parallel-park by crashing ten thousand times. You build an internal model of how the car responds and plan a few candidate maneuvers in your head before committing. A model lets an RL agent do the same — and because the model is reusable, the agent can keep planning against it long after the real interaction is over.

~10–100×
typical sample-efficiency gain over model-free on control tasks
300k vs 3M
MBPO steps to match SAC on Ant — a 10× reduction
150+ tasks
mastered by a single DreamerV3 config, fixed hyperparameters

The two ingredients: a model and a planner

Every model-based system is built from two parts, and the design space is mostly about how they fit together.

1
Learn (or obtain) a dynamics model

The model approximates the environment’s transition and reward functions. Formally, for a Markov decision process it learns

p^(st+1st,at)andr^(st,at).\hat{p}(s_{t+1}\mid s_t, a_t) \qquad\text{and}\qquad \hat{r}(s_t, a_t).

In games like Go or chess the model is known and exact (the rules). In the messy real world it must be learned from data — with neural networks, Gaussian processes (PILCO), or latent recurrent state-space models (Dreamer). Choices that matter: deterministic vs stochastic, pixel-space vs compact latent space, and whether the model reports its own uncertainty.

2
Plan or imagine with the model

Given the model, the agent improves its behavior in one of two ways:

  • Planning / decision-time control — at each step, search forward through the model to choose the best action now. Examples: Monte Carlo Tree Search (MuZero, AlphaZero), Model Predictive Control (MPC), and random-shooting / CEM trajectory optimization.
  • Background planning (Dyna-style) — use the model to generate imagined transitions and feed them to an ordinary model-free learner, so the policy or value function trains on real and synthetic data.

Dyna: the foundational template

Richard Sutton’s Dyna (1990) is the architecture every later method echoes. Its loop interleaves three operations: act in the real environment, use that real transition to update both a value function and a model, then run extra value updates on simulated transitions sampled from the model. The model-free learner cannot tell real experience from imagined experience — it just gets more of it.

EnvironmentValue / policy(model-free learner)Learned modelreal experiencemodel learningsimulatedexperienceact
The Dyna loop: real experience updates the value function and the model directly; the model then generates simulated experience for additional planning updates between real steps.

Dyna-Q’s lesson endures: even a crude model, used for a handful of extra updates per real step, dramatically speeds learning early on when real data is scarce. The danger it also exposes — when the world changes, a stale model keeps feeding the learner lies — motivated decades of work on model uncertainty.

The math: what could go wrong, formally

The seductive promise of MBRL has a sharp catch. Suppose your model has a small per-step prediction error. Over a long rollout, those errors compound: the agent plans deep into a future the model invents, and a policy optimized to be optimal in the model can be arbitrarily bad in the world. This is model bias (also called model exploitation or the objective mismatch).

The standard remedy, formalized by MBPO (Janner et al., 2019), is to keep model rollouts short and branch them from real states rather than imagining whole episodes from scratch. If the model’s one-step generalization error is bounded by ϵm\epsilon_m and the policy has shifted from the data-collection policy by ϵπ\epsilon_\pi, the gap between true and model returns under a kk-step branched rollout scales roughly as

gap    2rmax ⁣[γk+1ϵπ(1γ)2+γkϵπ1γ+k1γϵm].\text{gap} \;\lesssim\; 2\,r_{\max}\!\left[\frac{\gamma^{k+1}\epsilon_\pi}{(1-\gamma)^2} + \frac{\gamma^{k}\epsilon_\pi}{1-\gamma} + \frac{k}{1-\gamma}\,\epsilon_m\right].

The shape is what matters: error grows with rollout length kk (the kϵmk\,\epsilon_m term) but short rollouts off real states keep policy-shift terms in check. Pick kk to trade off — long enough to be useful, short enough that compounding model error stays small. “When to trust your model” is the whole game.

Go deeper: handling uncertainty in the model

The cleanest fix for model exploitation is to make the model honest about what it does not know. PILCO used a Gaussian process to propagate full predictive distributions through long-term planning, which is why it learned cart-pole-style tasks in a handful of trials. Deep methods approximate this with ensembles of probabilistic networks (as in PETS and MBPO): disagreement among ensemble members flags regions where the model is extrapolating, so the planner can avoid betting the policy on them. The general principle — penalize or avoid high-uncertainty states — also underpins offline RL, where you can never collect more data to correct a confidently-wrong model.

Decision-time planning: MuZero and the learned model

The most dramatic MBRL result hides the model inside a search. MuZero (Schrittwieser et al., 2019/2020) learns a model that predicts only what planning actually needs — the reward, the policy and the value — and runs Monte Carlo Tree Search over that learned model. Crucially it never predicts pixels or full states; it learns an abstract latent dynamics that is “good enough to plan with,” which turns out to be far easier than reconstructing the world.

The payoff: MuZero matched AlphaZero in Go, chess and shogi without being told the rules, and set a new state of the art on Atari at the same time — superhuman planning in domains where the dynamics had to be learned from scratch.

Known model — plan with the rules

Board games hand you a perfect simulator. AlphaZero plans with the true rules via MCTS; no model error, all the budget goes to search. See AlphaZero & MuZero.

Learned model — plan with predictions

MuZero, Dreamer, TD-MPC2 learn the dynamics from experience, then plan in that learned (often latent) space. More general, but now exposed to model bias.

Learning in imagination: World Models and Dreamer

The other modern thread trains the policy itself inside the model. Ha and Schmidhuber’s World Models (2018) showed the recipe vividly: a VAE compresses pixels to a latent code, an MDN-RNN predicts how that latent evolves, and a tiny controller is trained almost entirely inside the model’s “dream” — then transferred back to the real environment.

Danijar Hafner’s Dreamer line industrialized this. Dreamer learns a recurrent latent world model, then trains an actor and critic on imagined latent rollouts by backpropagating value gradients through the learned dynamics. DreamerV3 (2023) hit a milestone that had resisted everyone: collecting diamonds in Minecraft from scratch, from sparse rewards, with no human data — and it mastered 150+ tasks across very different domains with a single fixed set of hyperparameters, long the Achilles heel of model-based methods.

Go deeper: why plan in latent space?

Predicting raw pixels is wasteful — most pixels are irrelevant to control, and reconstruction loss spends capacity on visual detail the agent never acts on. Latent world models (Dreamer, TD-MPC2, MuZero) instead learn a compact state that need only be predictive of reward and value. This makes long imagined rollouts cheap, lets the planner search more futures per unit compute, and sidesteps a whole class of model-bias failures where the model nails the picture but mis-predicts the consequence. TD-MPC2 pushes this further with a decoder-free latent model and shows it scales: a single 317M-parameter agent handled 80 tasks across multiple embodiments and action spaces.

Model-based vs model-free: the trade-off

Neither family dominates; they trade different resources.

DimensionModel-based RLModel-free RL
Sample efficiencyHigh — reuses every real transition via the modelLow — needs many real interactions
Compute per stepHigh — planning/imagination is expensiveLow — one forward pass
Asymptotic performanceCan be capped by model errorOften higher with enough data
Failure modeModel bias / exploitationSlow, unstable, sample-hungry
Best whenReal data is costly, slow or risky (robotics, control)Cheap fast simulators, lots of steps
ExamplesDyna, PILCO, MBPO, MuZero, Dreamer, TD-MPC2DQN, PPO, SAC

The gap is narrowing from both sides: MBPO and DreamerV3 closed much of the historic performance deficit of model-based methods, while model-free methods borrow tricks (replay, ensembles) to improve efficiency. In practice the question is rarely “which camp” but “how much planning can I afford, and how much can I trust my model?”

A short history of model-based RL

1990
Dyna
Sutton’s architecture interleaves acting, model learning and planning on simulated experience — the template for the whole field.
2011
PILCO
Deisenroth & Rasmussen use a Gaussian-process model that propagates uncertainty, achieving then-unmatched data efficiency on control tasks.
2017
AlphaZero
Planning (MCTS) with a known, exact model reaches superhuman Go, chess and shogi from self-play alone.
2018
World Models
Ha & Schmidhuber train a controller almost entirely inside a learned generative dream of the environment.
2019
MBPO & PlaNet/Dreamer
Short branched rollouts make deep model-based competitive with SAC; latent world models learn behaviors by imagination.
2020
MuZero
Planning with a learned latent model matches AlphaZero without being given the rules, and tops Atari.
2023–25
DreamerV3 & TD-MPC2
A single config masters 150+ tasks and finds Minecraft diamonds from scratch; scalable latent world models point toward general planning agents.

Where model-based RL is used

  • Robotics and continuous control — the home turf, where real data is slow and expensive. PILCO, MBPO, Dreamer and TD-MPC2 all target this.
  • Games with planning — AlphaZero/MuZero for board games and Atari; learned-model search where rules are unknown.
  • Industrial and systems control — MuZero was adapted to optimize video compression; model-predictive control underlies chemical, energy and HVAC systems.
  • Data center & resource scheduling — short-horizon learned models support planning where exploration in production is risky.
  • A bridge to offline RL — model-based methods that quantify uncertainty are a natural fit for learning from fixed datasets without further interaction.

Building the simulators, environments and data pipelines that model-based agents train against is its own growing industry — see the RL environment and simulation vendors.

Limitations and open problems

  • Model bias remains the core tension — a perfectly optimized policy in an imperfect model can fail in reality; uncertainty and short rollouts mitigate but don’t eliminate it.
  • Compute cost of planning — decision-time search (MCTS, CEM) is expensive per action, limiting real-time use.
  • Long-horizon and stochastic dynamics — compounding error and modeling genuine randomness (vs aleatoric uncertainty) are still hard.
  • Objective mismatch — models trained to predict accurately are not necessarily the models most useful for control; value-aware and decision-aware model learning is an active area.
  • Generality — DreamerV3’s fixed-hyperparameter success was a landmark precisely because robustness across domains had been so elusive.

Researcher takes

Danijar Hafner, lead author of the Dreamer line, frames the model-based idea in one sentence: solve tasks by imagining the consequences of actions inside a continuously learned world model.

Researcher takes

LeCun makes the core sample-efficiency argument for model-based methods: with a good world model, model-predictive control solves new tasks zero-shot, whereas model-free RL needs enormous numbers of trials per task.

A pointed historical critique: LeCun argues that because pure RL is so sample-inefficient, the field has effectively rediscovered ideas from optimal control and model-predictive control under new names.

Frequently asked questions

What is the difference between model-based and model-free RL?

Model-free RL learns a policy or value function directly from experienced transitions, ignoring any explicit model of dynamics. Model-based RL learns (or is given) a model that predicts the next state and reward, then uses it to plan or to generate imagined training data. Model-based is more sample-efficient; model-free is simpler and often stronger at the asymptote given unlimited cheap data.

Is MuZero model-based?

Yes — it is a flagship example. MuZero learns a model that predicts the reward, policy and value (not raw observations) and runs Monte Carlo Tree Search over that learned model. It plans with a model it learned itself, which is exactly what makes it model-based, even though it never reconstructs the environment’s true state. See AlphaZero & MuZero.

What is a world model?

A world model is a learned internal simulator of an environment’s dynamics — typically a compact latent representation plus a predictor of how that latent evolves under actions, often with reward and value heads. Agents like Dreamer train behaviors inside the world model by imagining rollouts, then act in the real environment. It is the model in model-based RL, scaled up with deep generative networks.

Why is model bias such a problem?

Because a policy is an optimizer. Optimize it against a model with even small per-step errors and it will exploit those errors — choosing actions that look great in the model and fail in reality — and the errors compound over long imagined rollouts. The standard defenses are short rollouts branched from real states (MBPO), uncertainty-aware models (ensembles, Gaussian processes), and planning in compact latent spaces.

Key papers

AlphaZero & MuZero · Markov decision processes · Value functions · Offline RL · PPO · Q-learning · What is reinforcement learning?