- A world model is a learned, predictive model of an environment's dynamics — given the current state and an action, it predicts what happens next, usually in a compressed latent space rather than raw pixels.
- Once you have a world model you can train a policy almost entirely inside it — 'imagining' or 'dreaming' rollouts — which makes learning dramatically more sample-efficient than acting in the real world.
- The lineage runs from Ha & Schmidhuber's 2018 'World Models' (VAE + RNN + tiny controller) to the Dreamer family, which mastered Atari, continuous control and was first to mine diamonds in Minecraft from scratch.
- World models now power generative interactive worlds (DeepMind's Genie) and are central to the bet — argued by Yann LeCun and others — that prediction, not just language, is the path to agents that plan.
What is a world model?
A world model is a learned model of how an environment behaves: feed it the current situation and a candidate action, and it predicts the next situation (and usually the reward). It is the agent’s internal “simulator” of reality. The name and modern framing come from David Ha and Jürgen Schmidhuber’s 2018 paper World Models, but the idea is old — it is just a learned version of the dynamics model at the heart of model-based RL.
The crucial twist that makes world models powerful is where they predict. Predicting the next frame of a game pixel-by-pixel is wasteful and hard. Instead, a world model first compresses each observation into a small latent vector , then learns to predict the next latent . Planning and policy learning happen in this compact latent space — far cheaper than reasoning over raw images.
Why world models exist
Standard model-free RL — Q-learning, DQN, PPO — learns a policy or value function directly from experience and throws the transitions away. It works, but it is sample-hungry: it can take tens of millions of frames to learn a game, because each environment interaction teaches the network only about the reward, not about how the world works.
A world model captures the structure that model-free methods ignore. Every transition the agent sees is a free lesson in physics: what follows what, which actions matter, what the consequences are. Learning that dynamics model is self-supervised — no rewards required, just “predict the next thing” — so it can soak up far more signal from the same data. Then the agent can practice in imagination instead of paying for every lesson in the real world.
The original World Models: V, M, C
Ha & Schmidhuber split the agent into three pieces, deliberately pushing almost all the parameters into the model and keeping the policy tiny.
A convolutional VAE compresses each 64×64 RGB frame into a small latent vector — dimension 32 for CarRacing, 64 for VizDoom. This is the “what does the world look like right now” code, learned purely to reconstruct frames.
An LSTM with a Mixture Density Network head predicts the distribution over the next latent given the current latent, action and hidden state:
Modelling a mixture (5 Gaussians) rather than a single point lets the model express genuine uncertainty — the future is stochastic, and a fireball might or might not appear.
The controller is almost nothing: a single linear map from the concatenation of the latent and the RNN hidden state to an action, . With only ~867 parameters, it is trained with CMA-ES, an evolution strategy — feasible precisely because the search space is so small.
The headline result: on VizDoom, the agent was trained entirely inside the dream generated by the MDN-RNN, never touching the real game during policy learning, then transferred — scoring 1092 on the real environment. On CarRacing-v0 it reached 906 ± 21, the first agent to “solve” that benchmark.
Go deeper: the temperature trick and cheating the dream
Training inside a learned model has a trap: the policy can find adversarial actions that exploit the model’s imperfections — racking up imaginary reward in situations the real world never produces. Ha & Schmidhuber’s fix was a temperature parameter on the MDN-RNN’s sampling. Cranking up (they used for the transferred VizDoom agent) makes the dream more uncertain and noisy, which paradoxically improves real-world transfer: a policy robust to a chaotic dream is robust to reality, and can’t lazily exploit a too-clean hallucination. This is the same over-optimization tension you see in reward hacking — optimize a flawed model too hard and it stops tracking truth.
How a modern world model is trained
The contemporary recipe — used by the Dreamer family — interleaves three loops that all run continuously.
The current policy acts in the real environment and stores transitions in a replay buffer. This is the only place real interaction is spent.
Train the dynamics model to predict the next latent, the reward and an episode-continuation flag from replayed sequences. Dreamer uses a Recurrent State-Space Model (RSSM) whose latent has a deterministic part (carried by a GRU) and a stochastic part :
Roll the world model forward from sampled start states, generating purely imagined latent trajectories. Train an actor-critic on these dreams: the critic estimates returns inside the model, the actor maximizes them. No real frames are touched in this loop — it is the imagination engine.
The Dreamer family
Danijar Hafner and collaborators turned the world-model idea into a state-of-the-art, general agent over three versions.
| Version | Year | Headline | Key change |
|---|---|---|---|
| PlaNet | 2019 | Latent planning from pixels | Introduced the RSSM; planned with model-predictive control, no policy network |
| Dreamer (v1) | 2020 | Continuous control by latent imagination | Learned an actor-critic inside the model via analytic gradients through the dynamics |
| DreamerV2 | 2021 | First world model to beat humans on Atari | Discrete (categorical) latents stabilized learning |
| DreamerV3 | 2023–25 | One config across 150+ tasks; first to mine Minecraft diamonds from scratch | Robustness via normalization, “symlog” reward transforms and balancing — no per-task tuning |
DreamerV3’s Minecraft result is the standout. Collecting a diamond requires a long chain of subgoals (wood, tools, stone, iron, then diamond) with sparse reward, and it was the first algorithm to do it from scratch — no human demonstrations, no curriculum — using roughly 30M environment steps. The work was published in Nature in 2025.
Go deeper: why imagination beats real rollouts for the policy
Once the world model is decent, generating an imagined trajectory is a few cheap forward passes through a small recurrent net — orders of magnitude faster and safer than stepping a real simulator or robot. Crucially, the dynamics are differentiable, so Dreamer can backpropagate the policy gradient through the imagined dynamics (a pathwise / reparameterized gradient), giving a lower-variance learning signal than the score-function estimator model-free methods rely on. The catch is compounding model error: small per-step prediction mistakes accumulate over a long rollout, so imagined horizons are kept short (often 15–16 steps) and refreshed from real states.
Generative vs. predictive world models
There is a live architectural debate about what a world model should predict.
Predict (and often render) the full next observation — pixels, frames, full latents with a decoder. Used by the original World Models, Dreamer and Genie. Great for interpretability and interactive generation, but spends capacity on irrelevant detail (every leaf, every texture).
Predict only the representation of the future, never the pixels. Yann LeCun’s JEPA family argues pixel prediction yields a “blurry average” and that the model should learn to ignore unpredictable detail and capture structure. V-JEPA 2 reported strong zero-shot robot control from latent-space prediction alone.
LeCun’s bet is that prediction in representation space, not next-token generation, is the road to agents that plan and reason — and that a good world model is the missing ingredient in today’s LLMs. Whether to render pixels (Genie) or skip them (JEPA) is the central open design choice in 2026.
Genie: world models as generative worlds
DeepMind’s Genie line reframes a world model as something you can play. Trained unsupervised on unlabelled internet video, Genie (2024) learns a spatiotemporal tokenizer, an autoregressive dynamics model and — its key trick — a latent action model that infers controllable actions without any action labels. At 11B parameters it is, in effect, a foundation world model: prompt it and step through a generated 2D world frame by frame.
Genie 3 (2025) pushed this to real-time, navigable 3D worlds at 720p and 24 fps from a text prompt, holding consistency for minutes. The implication for RL is large: such models could become on-demand training environments — infinite, diverse RL environments generated rather than hand-built.
A short history
Where world models are used
| Domain | What the world model buys you |
|---|---|
| Sample-efficient game RL | Dreamer-style imagination training on Atari, DMLab, Minecraft, Crafter |
| Robotics | Learn dynamics from cheap/unlabelled data, then plan or train policies without risking hardware (V-JEPA 2, Dreamer on real robots) |
| Autonomous driving | Predict how a scene evolves to anticipate other agents; latent occupancy forecasting |
| Generative environments | Genie-style models as infinite, controllable training and evaluation worlds |
| Planning & reasoning | A learned model is what you search over in MuZero-style planning |
Limitations and open problems
- Compounding error. Tiny per-step prediction mistakes accumulate over long imagined rollouts, so horizons stay short and the model must be refreshed from real states.
- Model exploitation. Policies can “hack” the world model’s flaws, scoring imaginary reward that doesn’t transfer — the same Goodhart dynamic as reward hacking, mitigated by noise, short horizons and KL-style constraints.
- What to predict. Pixels waste capacity on unpredictable detail; pure latent prediction (JEPA) risks representation collapse and needs care to stay informative.
- Long-horizon consistency. Even Genie 3 holds coherence only for minutes; objects drift and worlds forget. Stable, long-lived simulation is unsolved.
- Compute. Foundation world models are large and expensive to train and run.
Researcher takes
Danijar Hafner, lead author of the Dreamer line, on DreamerV3 reaching Nature and becoming the first agent to find diamonds in Minecraft without human data — a milestone for imagination-based RL:
Frequently asked questions
How is a world model different from model-based RL?
It is the modern, learned incarnation of it. Model-based RL is the broad category — any method that learns or uses a dynamics model to plan or train. “World model” specifically connotes a learned, often latent generative model of a high-dimensional environment (pixels, video), and the practice of training policies inside it by imagination.
What does “training in a dream” actually mean?
The policy never touches the real environment during behaviour learning. The world model generates synthetic (“imagined” or “dreamed”) trajectories in latent space, and the actor-critic is trained on those. Real interaction is spent only to improve the model, not to grind the policy — which is why it is so sample-efficient.
Is a large language model a world model?
Partially and controversially. An LLM has absorbed a lot of implicit knowledge about how the world works from text, and can predict consequences in language. But critics like Yann LeCun argue next-token prediction over text is not a grounded, controllable model of physical dynamics — which is exactly what JEPA-style and video world models aim to provide. See the generative vs. predictive debate above.
Do world models replace simulators?
Increasingly they complement and sometimes replace them. Hand-built simulators are accurate but expensive to author and limited in diversity; learned world models (and generative ones like Genie) can be trained from data and produce far more variety — at the cost of fidelity and long-horizon consistency.
Key papers
- World Models — Ha & Schmidhuber, 2018 — the VAE + MDN-RNN + controller agent that trains in its dream.
- Learning Latent Dynamics for Planning from Pixels (PlaNet) — Hafner et al., 2019 — the Recurrent State-Space Model.
- Dream to Control: Learning Behaviors by Latent Imagination — Hafner et al., 2020 — Dreamer.
- Mastering Diverse Domains through World Models (DreamerV3) — Hafner et al., 2023; published in Nature, 2025.
- Genie: Generative Interactive Environments — Bruce et al., 2024 — world models you can play.
Related
Model-based RL · AlphaZero & MuZero · Curiosity & intrinsic motivation · RL environments · Policy gradients · RL in robotics · What is reinforcement learning?