World Models in Reinforcement Learning

Key takeaways

A world model is a learned, predictive model of an environment's dynamics — given the current state and an action, it predicts what happens next, usually in a compressed latent space rather than raw pixels.
Once you have a world model you can train a policy almost entirely inside it — 'imagining' or 'dreaming' rollouts — which makes learning dramatically more sample-efficient than acting in the real world.
The lineage runs from Ha & Schmidhuber's 2018 'World Models' (VAE + RNN + tiny controller) to the Dreamer family, which mastered Atari, continuous control and was first to mine diamonds in Minecraft from scratch.
World models now power generative interactive worlds (DeepMind's Genie) and are central to the bet — argued by Yann LeCun and others — that prediction, not just language, is the path to agents that plan.

What is a world model?

A world model is a learned model of how an environment behaves: feed it the current situation and a candidate action, and it predicts the next situation (and usually the reward). It is the agent’s internal “simulator” of reality. The name and modern framing come from David Ha and Jürgen Schmidhuber’s 2018 paper World Models, but the idea is old — it is just a learned version of the dynamics model at the heart of model-based RL.

The crucial twist that makes world models powerful is where they predict. Predicting the next frame of a game pixel-by-pixel is wasteful and hard. Instead, a world model first compresses each observation into a small latent vector $z$ , then learns to predict the next latent $z'$ . Planning and policy learning happen in this compact latent space — far cheaper than reasoning over raw images.

A world model compresses observations into a latent code, predicts how that code evolves under each action, and lets the agent train a policy inside the model's imagination before acting for real.

▶ World Models (Ha & Schmidhuber) explained — Yannic Kilcher, ~19 min

Why world models exist

Standard model-free RL — Q-learning, DQN, PPO — learns a policy or value function directly from experience and throws the transitions away. It works, but it is sample-hungry: it can take tens of millions of frames to learn a game, because each environment interaction teaches the network only about the reward, not about how the world works.

A world model captures the structure that model-free methods ignore. Every transition the agent sees is a free lesson in physics: what follows what, which actions matter, what the consequences are. Learning that dynamics model is self-supervised — no rewards required, just “predict the next thing” — so it can soak up far more signal from the same data. Then the agent can practice in imagination instead of paying for every lesson in the real world.

867

parameters in the original World Models CarRacing controller — the rest is the world model

1092±556

VizDoom score from a policy trained entirely inside the dream, then transferred

150+

tasks DreamerV3 masters with one fixed set of hyperparameters

The original World Models: V, M, C

Ha & Schmidhuber split the agent into three pieces, deliberately pushing almost all the parameters into the model and keeping the policy tiny.

V — Vision (a variational autoencoder)

A convolutional VAE compresses each 64×64 RGB frame into a small latent vector $z$ — dimension 32 for CarRacing, 64 for VizDoom. This is the “what does the world look like right now” code, learned purely to reconstruct frames.

M — Memory (an MDN-RNN)

An LSTM with a Mixture Density Network head predicts the distribution over the next latent given the current latent, action and hidden state:

P\big(z_{t+1} \mid a_t,\, z_t,\, h_t\big) = \sum_{k} \pi_k\, \mathcal{N}\!\big(z_{t+1};\, \mu_k,\, \sigma_k^2\big)

Modelling a mixture (5 Gaussians) rather than a single point lets the model express genuine uncertainty — the future is stochastic, and a fireball might or might not appear.

C — Controller (a linear policy)

The controller is almost nothing: a single linear map from the concatenation of the latent and the RNN hidden state to an action, $a_t = W_c\,[\,z_t\, ;\, h_t\,] + b_c$ . With only ~867 parameters, it is trained with CMA-ES, an evolution strategy — feasible precisely because the search space is so small.

The headline result: on VizDoom, the agent was trained entirely inside the dream generated by the MDN-RNN, never touching the real game during policy learning, then transferred — scoring 1092 on the real environment. On CarRacing-v0 it reached 906 ± 21, the first agent to “solve” that benchmark.

Go deeper: the temperature trick and cheating the dream

Training inside a learned model has a trap: the policy can find adversarial actions that exploit the model’s imperfections — racking up imaginary reward in situations the real world never produces. Ha & Schmidhuber’s fix was a temperature parameter $\tau$ on the MDN-RNN’s sampling. Cranking $\tau$ up (they used $\tau = 1.15$ for the transferred VizDoom agent) makes the dream more uncertain and noisy, which paradoxically improves real-world transfer: a policy robust to a chaotic dream is robust to reality, and can’t lazily exploit a too-clean hallucination. This is the same over-optimization tension you see in reward hacking — optimize a flawed model too hard and it stops tracking truth.

How a modern world model is trained

The contemporary recipe — used by the Dreamer family — interleaves three loops that all run continuously.

Collect real experience

The current policy acts in the real environment and stores transitions $(o_t, a_t, r_t, o_{t+1})$ in a replay buffer. This is the only place real interaction is spent.

Learn the world model (self-supervised)

Train the dynamics model to predict the next latent, the reward and an episode-continuation flag from replayed sequences. Dreamer uses a Recurrent State-Space Model (RSSM) whose latent has a deterministic part $h_t$ (carried by a GRU) and a stochastic part $z_t$ :

\hat{z}_{t} \sim p_\theta(\hat{z}_t \mid h_t), \qquad h_t = f_\theta(h_{t-1}, z_{t-1}, a_{t-1})

Learn behaviour in imagination

Roll the world model forward from sampled start states, generating purely imagined latent trajectories. Train an actor-critic on these dreams: the critic estimates returns inside the model, the actor maximizes them. No real frames are touched in this loop — it is the imagination engine.

The Dreamer family

Danijar Hafner and collaborators turned the world-model idea into a state-of-the-art, general agent over three versions.

Version	Year	Headline	Key change
PlaNet	2019	Latent planning from pixels	Introduced the RSSM; planned with model-predictive control, no policy network
Dreamer (v1)	2020	Continuous control by latent imagination	Learned an actor-critic inside the model via analytic gradients through the dynamics
DreamerV2	2021	First world model to beat humans on Atari	Discrete (categorical) latents stabilized learning
DreamerV3	2023–25	One config across 150+ tasks; first to mine Minecraft diamonds from scratch	Robustness via normalization, “symlog” reward transforms and balancing — no per-task tuning

DreamerV3’s Minecraft result is the standout. Collecting a diamond requires a long chain of subgoals (wood, tools, stone, iron, then diamond) with sparse reward, and it was the first algorithm to do it from scratch — no human demonstrations, no curriculum — using roughly 30M environment steps. The work was published in Nature in 2025.

Go deeper: why imagination beats real rollouts for the policy

Once the world model is decent, generating an imagined trajectory is a few cheap forward passes through a small recurrent net — orders of magnitude faster and safer than stepping a real simulator or robot. Crucially, the dynamics are differentiable, so Dreamer can backpropagate the policy gradient through the imagined dynamics (a pathwise / reparameterized gradient), giving a lower-variance learning signal than the score-function estimator model-free methods rely on. The catch is compounding model error: small per-step prediction mistakes accumulate over a long rollout, so imagined horizons are kept short (often 15–16 steps) and refreshed from real states.

Generative vs. predictive world models

There is a live architectural debate about what a world model should predict.

Generative (reconstructive)

Predict (and often render) the full next observation — pixels, frames, full latents with a decoder. Used by the original World Models, Dreamer and Genie. Great for interpretability and interactive generation, but spends capacity on irrelevant detail (every leaf, every texture).

Predictive (non-generative / JEPA)

Predict only the representation of the future, never the pixels. Yann LeCun’s JEPA family argues pixel prediction yields a “blurry average” and that the model should learn to ignore unpredictable detail and capture structure. V-JEPA 2 reported strong zero-shot robot control from latent-space prediction alone.

LeCun’s bet is that prediction in representation space, not next-token generation, is the road to agents that plan and reason — and that a good world model is the missing ingredient in today’s LLMs. Whether to render pixels (Genie) or skip them (JEPA) is the central open design choice in 2026.

Genie: world models as generative worlds

DeepMind’s Genie line reframes a world model as something you can play. Trained unsupervised on unlabelled internet video, Genie (2024) learns a spatiotemporal tokenizer, an autoregressive dynamics model and — its key trick — a latent action model that infers controllable actions without any action labels. At 11B parameters it is, in effect, a foundation world model: prompt it and step through a generated 2D world frame by frame.

Genie 3 (2025) pushed this to real-time, navigable 3D worlds at 720p and 24 fps from a text prompt, holding consistency for minutes. The implication for RL is large: such models could become on-demand training environments — infinite, diverse RL environments generated rather than hand-built.

A short history

1990s

Schmidhuber's predictive controllers

Early work on recurrent “world models” and curiosity: a network that predicts its environment to drive control and exploration.

2018

World Models (Ha & Schmidhuber)

VAE + MDN-RNN + tiny CMA-ES controller; first clean demonstration of training a policy entirely inside a learned dream and transferring it.

2019

PlaNet

Hafner et al. introduce the Recurrent State-Space Model and latent planning directly from pixels.

2020–21

Dreamer & DreamerV2

Actor-critic learned by latent imagination; DreamerV2 is the first world-model agent to exceed human Atari performance.

2023–25

DreamerV3 in Nature

One configuration masters 150+ tasks; first to collect Minecraft diamonds from scratch, no human data.

2024–25

Genie & JEPA

DeepMind’s Genie generates playable worlds from video; LeCun’s V-JEPA 2 pushes non-generative, representation-space world models for planning.

Where world models are used

Domain	What the world model buys you
Sample-efficient game RL	Dreamer-style imagination training on Atari, DMLab, Minecraft, Crafter
Robotics	Learn dynamics from cheap/unlabelled data, then plan or train policies without risking hardware (V-JEPA 2, Dreamer on real robots)
Autonomous driving	Predict how a scene evolves to anticipate other agents; latent occupancy forecasting
Generative environments	Genie-style models as infinite, controllable training and evaluation worlds
Planning & reasoning	A learned model is what you search over in MuZero-style planning

Limitations and open problems

Compounding error. Tiny per-step prediction mistakes accumulate over long imagined rollouts, so horizons stay short and the model must be refreshed from real states.
Model exploitation. Policies can “hack” the world model’s flaws, scoring imaginary reward that doesn’t transfer — the same Goodhart dynamic as reward hacking, mitigated by noise, short horizons and KL-style constraints.
What to predict. Pixels waste capacity on unpredictable detail; pure latent prediction (JEPA) risks representation collapse and needs care to stay informative.
Long-horizon consistency. Even Genie 3 holds coherence only for minutes; objects drift and worlds forget. Stable, long-lived simulation is unsolved.
Compute. Foundation world models are large and expensive to train and run.

Researcher takes

Danijar Hafner, lead author of the Dreamer line, on DreamerV3 reaching Nature and becoming the first agent to find diamonds in Minecraft without human data — a milestone for imagination-based RL:

View Danijar Hafner's post on X →

Frequently asked questions

How is a world model different from model-based RL?

It is the modern, learned incarnation of it. Model-based RL is the broad category — any method that learns or uses a dynamics model to plan or train. “World model” specifically connotes a learned, often latent generative model of a high-dimensional environment (pixels, video), and the practice of training policies inside it by imagination.

What does “training in a dream” actually mean?

The policy never touches the real environment during behaviour learning. The world model generates synthetic (“imagined” or “dreamed”) trajectories in latent space, and the actor-critic is trained on those. Real interaction is spent only to improve the model, not to grind the policy — which is why it is so sample-efficient.

Is a large language model a world model?

Partially and controversially. An LLM has absorbed a lot of implicit knowledge about how the world works from text, and can predict consequences in language. But critics like Yann LeCun argue next-token prediction over text is not a grounded, controllable model of physical dynamics — which is exactly what JEPA-style and video world models aim to provide. See the generative vs. predictive debate above.

Do world models replace simulators?

Increasingly they complement and sometimes replace them. Hand-built simulators are accurate but expensive to author and limited in diversity; learned world models (and generative ones like Genie) can be trained from data and produce far more variety — at the cost of fidelity and long-horizon consistency.

Key papers

World Models — Ha & Schmidhuber, 2018 — the VAE + MDN-RNN + controller agent that trains in its dream.
Learning Latent Dynamics for Planning from Pixels (PlaNet) — Hafner et al., 2019 — the Recurrent State-Space Model.
Dream to Control: Learning Behaviors by Latent Imagination — Hafner et al., 2020 — Dreamer.
Mastering Diverse Domains through World Models (DreamerV3) — Hafner et al., 2023; published in Nature, 2025.
Genie: Generative Interactive Environments — Bruce et al., 2024 — world models you can play.

Model-based RL · AlphaZero & MuZero · Curiosity & intrinsic motivation · RL environments · Policy gradients · RL in robotics · What is reinforcement learning?