World Models
Learning to Imagine: Latent Dynamics, Planning in Dream, and Mastering Diverse Domains
The previous chapter showed that access to a perfect simulator transforms the planning problem: given the rules of Go, AlphaZero can search millions of positions per second and reach superhuman strength from self-play alone. But perfect simulators are a luxury that the real world rarely provides. Robots interacting with physical environments, agents navigating markets, or systems exploring molecular space have no closed-form transition function to query — every piece of data must be earned through direct interaction. The natural response is to learn a model of the environment from collected experience and then use that model as a surrogate simulator for planning.
This chapter is concerned with a specific and powerful instantiation of that idea: the world model. A world model is a learned function that approximates the environment’s transition distribution , but crucially operates in a compact latent space rather than the raw observation space Ha & Schmidhuber, 2018. Instead of predicting the next pixel frame — a high-dimensional and largely redundant signal — the model encodes each observation into a small latent vector and performs all dynamics reasoning there, which results in order-of-magnitude faster planning than in pixel space Hafner et al., 2019Moerland et al., 2023.
We organise the chapter around three milestones. We begin with MuZero Schrittwieser et al., 2020, which picks up exactly where AlphaZero left off: it brings tree search to settings where the rules are not given, by learning the game engine itself in an abstract latent space that need not decode back to observations. We then turn to the Dreamer family Hafner et al., 2020Hafner et al., 2021Hafner et al., 2025, which descends from the earliest modern world models — the V/M/C architecture of Ha and Schmidhuber Ha & Schmidhuber, 2018 and the recurrent state-space model of PlaNet Hafner et al., 2019 — and trains an actor-critic entirely inside imagined rollouts of a reconstructive latent model. Finally, we look at where the field is heading: toward pretrained, reusable world models that decouple representation learning from any single task.
Slides for this chapter (open full screen).
MuZero: Learning the Rules from Scratch¶
Chapter 7 showed that Monte Carlo Tree Search becomes extraordinarily powerful when paired with a learned policy-value network — provided a perfect simulator is available to expand the search tree. MuZero Schrittwieser et al., 2020 removes that proviso. It brings the tree-search machinery of AlphaZero (Chapter 7) to bear on settings where no game engine is given — Atari, where the rules are buried inside an emulator that cannot be queried symbolically, or robotics, where physics governs transitions that have no closed-form description — by learning the engine itself.
From AlphaZero to MuZero¶
Recall from Chapter 7 that AlphaZero Silver et al., 2018 runs MCTS using the game engine to expand nodes: at every node, the tree policy selects an action, the engine executes it, and the child state is obtained for free. This architecture is superhuman in Go, Chess, and Shogi — but only because the rules are provided. Applying AlphaZero to Atari, where the “rules” are encoded in the emulator and cannot be queried directly, or to a robot, where physics governs transitions that have no closed-form description, is not possible.
MuZero’s key insight is that the agent does not need to predict observations in order to plan effectively. It only needs to predict the three quantities that drive MCTS: rewards (to evaluate paths), values (to estimate leaf nodes), and policies (to guide search). These can all be predicted in an abstract latent space that need not correspond to any observable quantity Schrittwieser et al., 2020.
The Three Learned Functions¶
MuZero learns three neural networks, shown in Figure 2:
Figure 2:MuZero’s three learned functions. The representation function encodes a stack of past observations into an initial latent state . The dynamics function advances the latent state and predicts the immediate reward given an action, replacing the game engine at every MCTS expansion. The prediction function reads the latent state and outputs a policy prior and a value estimate , playing the role of AlphaZero’s neural network.
Representation function :
The representation function encodes a stack of the most recent observations — in Atari, together with the actions that produced them — into an initial hidden state from which MCTS begins. It is called once per real environment step and never called during search.
Dynamics function :
Given a latent state and an action, the dynamics function predicts the immediate reward and the next latent state. This function replaces the game engine inside MCTS: every node expansion calls rather than querying a simulator.
Prediction function :
The prediction function reads a latent state and returns a policy prior over actions and a scalar value estimate . It plays the exact same role as the policy-value network in AlphaZero.
Neural MCTS in Latent Space¶
The search procedure in MuZero is identical in structure to AlphaZero’s (see Monte Carlo Tree Search in Chapter 7): PUCT-guided selection, expansion, and backup, with the prediction function providing policy priors and value estimates. The only difference is that node expansion calls instead of the game engine, as illustrated in Figure 3.
Figure 3:MuZero MCTS in latent space: AlphaZero queries the game engine at every expansion; MuZero queries the learned dynamics function instead. The PUCT selection rule, backup, and recommended-action extraction (argmax over visit counts) are identical. The true environment is never queried during search. Cf. Schrittwieser et al., 2020.
At each real step, MuZero first encodes the current observation stack with to obtain , then runs MCTS simulations. Each simulation selects a leaf by PUCT:
expands the leaf by calling and , and backs up value estimates along the visited path. Here stands in for a visit-count–dependent exploration term: the original algorithm (Equation 2 of Schrittwieser et al. (2020)) grows the exploration bonus logarithmically as a node is visited more often, which we fold into a single constant for clarity. The value backup in MuZero differs from AlphaZero’s in one important detail: it incorporates the predicted rewards accumulated along the unrolled path,
where is the depth of the expanded leaf, the rewards are those predicted after each action along the path, and is the value head output at the leaf. AlphaZero always backs up game outcomes (win/loss) from terminal positions; MuZero bootstraps from the value head because latent trajectories never reach a terminal state in the traditional sense — terminal states are instead treated as absorbing during training, so the value head learns to emit a constant value beyond episode end.
Training¶
During training, the MuZero network is unrolled for hypothetical steps on sequences sampled from the replay buffer Schrittwieser et al., 2020. Fix a real time index : the representation function receives past observations , and the dynamics function is fed the real actions . At each unroll step , the combined model predicts
where chains , , and . Each step is aligned with stored MCTS targets , environment rewards , and value targets . Following Figure S2 of Schrittwieser et al. (2020),
In board games, final outcomes are encoded as rewards at the terminal step , and the reward head is not trained (). In Atari, and is the MCTS value estimate at the bootstrap step. The per-trajectory loss is
with the three head losses defined as
Here and are the softmax distributions over a fixed 601-integer support () emitted by the reward and value heads; encodes a scalar target as a convex combination of its two adjacent integer supports (e.g.\ a target of 3.7 becomes weight 0.3 on support 3 and 0.7 on support 4), after the invertible square-root rescaling of the raw targets, , described in Appendix F of Schrittwieser et al. (2020) (DreamerV3, below, later replaces this with the related but distinct symlog transform). Squared error on rewards and values sufficed for the bounded board-game targets; cross-entropy against this support representation was found more stable when scales vary widely in Atari. The cross-entropy heads , , and each carry a leading minus, so minimising them maximises the log-likelihood of the targets, matching AlphaZero’s policy loss (Equation (13)). To keep gradient magnitude roughly constant across unroll depths, each head loss is scaled by and the gradient entering the dynamics function is additionally scaled by Schrittwieser et al., 2020.
Results¶
In board games, MuZero matches AlphaZero on Go, Chess, and Shogi despite never being told the rules — it learns the equivalent of the game engine from self-play data. On the 57-game Atari suite, MuZero sets a new state of the art, surpassing model-free Rainbow; its sample-efficient MuZero Reanalyze variant reaches a 731% median human-normalised score from only 200M frames. A particularly striking finding, in Go, is that MuZero’s MCTS over its learned model matches — and at shorter search budgets slightly exceeds — search with the ground-truth simulator (AlphaZero), despite evaluating each node with fewer residual blocks (16 vs. 20). This suggests MuZero caches computation in the search tree, using each application of the dynamics model to deepen its understanding of the position Schrittwieser et al., 2020.
EfficientZero: Closing the Data-Efficiency Gap¶
MuZero’s superhuman Atari performance comes at a steep data cost — hundreds of millions of frames per game to reach its reported scores, or weeks of continuous play, whereas a human can master most Atari games in under two hours Ye et al., 2021. EfficientZero Ye et al., 2021 closes most of that gap with three targeted additions to MuZero, reaching above-human performance from the Atari 100k budget — 100 000 agent steps, about 400K frames or roughly two hours of play — for the first time, using on the order of 500× less data than a DQN trained on 200M frames.
First, a self-supervised consistency loss — a SimSiam-style objective — directly aligns the dynamics function’s predicted latent, , with the latent that the representation function produces from the actual next observation, . This supplies the dynamics model with a dense learning signal it otherwise receives only indirectly through reward and value losses (the stop-gradient is essential, otherwise the representation could trivially collapse to a constant).
Second, a value-prefix head uses a small LSTM to predict the cumulative discounted reward over the unroll before bootstrapping, separating short-term credit assignment from the long-term value estimate and making both easier to learn.
Third, model-based off-policy correction addresses value targets that grow stale as the policy improves: for older replayed transitions, the -step value target is computed over a shortened horizon and the remaining steps are filled in by re-running the search with the current model, so the bootstrap reflects up-to-date value estimates rather than the outdated behaviour policy that originally generated the data.
Together these let EfficientZero reach a median human-normalised score above 1.0 on the Atari 100k benchmark — roughly two hours of play, where MuZero remains below human level — effectively closing the data-efficiency gap with human learning speed Ye et al., 2021.
Dreamer: Imagination-Based Actor-Critic¶
MuZero shapes its latent space purely to serve tree search. The Dreamer family takes the opposite stance: it learns a latent transition model and uses it to train a policy by gradient-based latent imagination rather than search. The lineage runs back to the first modern world models.
Ha and Schmidhuber Ha & Schmidhuber, 2018 introduced the idea in 2018: compress each observation into a compact latent code, learn a recurrent model of how that code evolves, and train a small controller entirely inside the model’s own imagined rollouts — “dreaming” — before deploying it in the real environment. PlaNet Hafner et al., 2019 turned this template into the Recurrent State Space Model (RSSM) — the world model used by the Dreamer family (V1–3) — and solved continuous-control tasks from pixels by online planning directly in latent space, with no learned policy at all. The catch is cost: the planner re-optimises a fresh distribution over action sequences at every decision, requiring hundreds of latent rollouts per step. Dreamer Hafner et al., 2020 keeps PlaNet’s RSSM but amortises that planning into a policy: instead of searching at decision time, it trains an actor-critic inside imagined rollouts of the model, yielding a policy that acts in a single forward pass at test time.
Following the classical decomposition of agents that learn in imagination, the Dreamer agent interleaves three operations, which structure the rest of this section: fitting the RSSM world model to past experience, behavior learning inside imagined rollouts of that frozen model, and environment interaction to gather fresh experience.
RSSM World Model¶
Dreamer Hafner et al., 2020 couples its world model with an actor and a critic — introduced in the next subsection — all operating entirely in latent space. We begin with the world model itself. Dreamer’s world model is PlaNet’s Recurrent State Space Model (RSSM), a generative model of observation sequences built around a single compact model state with Markovian transitions: given and the action , the model predicts , and from it predicts the reward. The model is fitted to past experience by representation learning: encoding real observations into model states from which it can reconstruct the observation and predict the reward, as shown in Figure 4.
Figure 4:Learning the RSSM from experience — the “learn dynamics” component of Dreamer, unrolled over two steps. Each model state (orange) can be formed two ways. The transition/prior model (blue box, dashed arrow) predicts the next state from the previous state and action without seeing the observation — this is what lets the model imagine forward in latent space. When a real observation (green) is available, an encoder (purple) feeds the representation/posterior model (blue box, solid arrow), which corrects the prior. From each model state a decoder reconstructs the observation and a reward model (orange box) predicts . All heads are fit jointly by maximising the sequence ELBO (Equation (13)), so the model state learns to capture exactly what is needed to reconstruct observations and predict rewards; the KL term pulls the prior toward the posterior, keeping imagination accurate when observations are absent.
Following the original Dreamer paper, we treat as a single latent variable and describe the world model by the conditional distributions it learns:
Here, denotes the representation model — the posterior that infers the state with access to the observation — and denotes the transition model — the prior that predicts the next state from and without an observation — together with the reward and observation heads that read off . The transition is what lets the model imagine forward in latent space without decoding pixels; the posterior is used only when a real observation is available, to “correct” the prior. (For tasks with early termination, Dreamer V1 additionally predicts a discount factor from each model state to down-weight imagined steps beyond a likely episode end; this “continue” predictor is not part of the core continuous-control model, but it becomes a standard RSSM head from DreamerV2 onward.)
Both prior and posterior are diagonal Gaussians whose parameters are produced by small networks; the reward head is an MLP and the observation decoder is a transposed CNN. The table below summarises the modules and their roles in the sequence ELBO of Equation (13).
| Component | Input → output | Architecture | Objective |
|---|---|---|---|
| Representation / posterior | recurrent net + MLP → diagonal Gaussian | — inference when is available (training) | |
| Transition / prior | recurrent net + MLP → diagonal Gaussian | — predicts the next state without an observation (imagination) | |
| Observation decoder | Convolutional decoder | — pixel reconstruction | |
| Reward model | MLP (Gaussian likelihood) | — scalar reward prediction |
All modules are trained jointly by maximising a sequence-level Evidence Lower Bound (ELBO):
The reconstruction terms push the model to predict both observations and rewards accurately. The KL term regularises the posterior toward the prior — ensuring that the prior, which must operate without observations during imagination, carries enough information to support useful predictions.
Behavior learning¶
With the world model fitted and held fixed, behaviour is learned entirely in imagination — this is the “learn behavior” component of Dreamer. Starting from posterior states inferred on real data, the agent imagines steps into the future inside the world model, without interacting with the environment: the actor proposes actions, the world model advances the latent state, and the reward predictor supplies rewards. The actor and critic are updated on these purely imagined trajectories.
The key innovation is the gradient estimator used to train the actor. Because the world model and the actor are both differentiable, gradients can flow back through imagined states and rewards directly to the actor’s parameters — no REINFORCE estimator is needed. Concretely, the actor samples actions via reparameterisation:
so the gradient exists analytically. The imagined next state — drawn from the transition model — is also reparameterised, so the gradient continues flowing backwards through the dynamics.
Figure 5:Behavior learning inside imagined rollouts. An -step trajectory of latent states (orange) is rolled forward by the frozen world model’s dynamics (the prior). At each step the actor (blue) proposes an action via reparameterisation, a reward head predicts (red), and the critic (orange) estimates the state value. The -returns (red dashed, backward) serve as critic targets, while the actor’s analytic gradient (green) flows backward through the latent dynamics. A stop-gradient keeps the critic target from interfering with the critic’s own gradient.
The critic is trained to predict -returns — a weighted mixture of -step returns that trades off bias against variance Sutton & Barto, 2018:
where each -step estimate sums imagined rewards and bootstraps from the critic. The actor and critic are then trained cooperatively, as in policy iteration: the actor maximises the predicted returns while the critic minimises its squared distance to them,
where is the stop-gradient: the critic regresses toward the -return as a fixed target, while the actor maximises it by propagating analytic gradients back through the learned dynamics (Equation (14)).
Environment interaction¶
Imagination alone cannot improve the agent indefinitely: the world model is only accurate where it has seen data, so the agent must periodically act in the real environment to collect fresh experience. This is the third Dreamer component, shown in Figure 6. To choose an action, the agent runs the RSSM forward over the history of the current episode — encoding each real observation through the posterior to obtain the model state — and queries the actor , adding a small amount of exploration noise. Executing that action in the environment yields the next observation and reward, and the transition is appended to a replay buffer .
Figure 6:Environment interaction — the “act” component of Dreamer. The agent encodes the episode history into a model state via the RSSM posterior, the actor proposes an action (with added exploration noise), and executing it in the environment produces the next observation and reward. Each transition is stored in the replay buffer , from which batches are later drawn to refit the world model and the actor-critic.
The three components run in a single loop. The buffer is seeded with a handful of episodes collected under a random policy; thereafter the agent alternates between a fixed number of training updates — fitting the world model on sampled batches, then the actor-critic on rollouts imagined from those batches — and collecting one fresh episode with the current policy. Because behaviour is learned almost entirely from imagined rollouts rather than from the replayed transitions directly, Dreamer is strikingly sample-efficient: it substantially improved over PlaNet on the DeepMind Control Suite from pixels, and exceeded the strongest model-free baseline (D4PG) in final performance while using roughly 20× fewer environment steps — reaching an average score of 823 in steps, against D4PG’s 786 in 108 steps Hafner et al., 2020.
The Dreamer Algorithm¶
Putting the three components together gives the full Dreamer agent of Hafner et al. (2020) (Algorithm 1 in the paper, with notation matched to this chapter):
Initialize the dataset with random seed episodes.
Initialize the world-model parameters , actor parameters , and critic parameters randomly.
While not converged do
For update step do
Dynamics learning. Draw data sequences .
Compute model states along each sequence from the representation/posterior model .
Update with one gradient step on the sequence ELBO (Equation (13)).
Behavior learning. From each posterior state , imagine an -step trajectory under the frozen world model, with actions .
Predict imagined rewards and values , and compute the -returns (Equation (15)).
Update the actor: (Equation (16)).
Update the critic: (Equation (17)).
Environment interaction. Reset the environment: .
For time step do
Infer the model state from the history via the posterior .
Draw an action and add exploration noise.
Step the environment: .
Add the episode to the dataset: .
The inner loop () interleaves dynamics learning and behavior learning, both consuming only batches drawn from ; the outer loop then collects a single fresh episode with the current actor before refilling the inner loop. With one fixed setting across all continuous tasks — seed episodes, updates per collected episode, sequences of length , imagination horizon , and -return parameters , — this loop is what reaches the sample efficiency reported above Hafner et al., 2020.
DreamerV2¶
Dreamer V1 was designed for continuous-control tasks with continuous Gaussian latents. Extending it to Atari — where observations are game frames, rewards are sparse, and the underlying dynamics have a discrete, switchlike character — required rethinking the latent representation. DreamerV2 Hafner et al., 2021 introduces three targeted changes to the world model of Figure 7.
Figure 7:The DreamerV2 world model, unrolled over three steps. A deterministic recurrent state (blue) is carried forward by a GRU from the previous recurrent state, the previous stochastic latent, and the action (gray). At each step the stochastic latent comes in two versions: the posterior (orange), encoded from the real observation (green) through the encoder (purple), and the prior (orange), predicted from without the observation. The min KL arrows pull the prior toward the posterior (KL balancing). From a reward head predicts and a decoder reconstructs (green). The inset shows DreamerV2’s defining change: each latent is 32 categorical variables of 32 classes each — a stack of one-hot codes, over 1048 in total — replacing DreamerV1’s continuous Gaussian and sampled with straight-through gradients. Cf. Hafner et al., 2021.
Change 1 — Categorical latents. DreamerV2 replaces the 30-dimensional continuous Gaussian latent with 32 groups of 32 categorical random variables, each group sampled from a Categorical(32) distribution. The full latent can take distinct values — far more than the Gaussian’s effective support — and categorical distributions naturally represent the discrete mode-switching dynamics that dominate Atari environments (a character picks up an item, the game transitions to a new level, a shot destroys a ship).
Change 2 — Straight-through gradients. Argmax is not differentiable, so reparameterisation cannot be applied directly to categorical samples. DreamerV2 uses the straight-through gradient estimator (a standard trick for discrete latents, going back to Bengio et al. 2013 and the Gumbel-softmax of Jang et al. 2017): in the forward pass, is sampled as a one-hot vector from the softmax of learned logits; in the backward pass, the gradient is passed through the logits as if the one-hot sample were identical to the softmax output. This is a biased but low-variance estimator that has proven effective in practice for discrete latent variable models.
Change 3 — KL balancing. The KL term in the ELBO plays two roles at once: it trains the prior (the transition model) toward the posterior, and it regularises the posterior toward the prior. Coupling these is harmful, because the transition prior is hard to learn and the posterior should not be dragged toward a poorly trained prior. DreamerV2 decouples the two directions with an asymmetric KL loss:
where . The stop-gradient () on the posterior in the first term means that 80% of the KL gradient flows only into the prior, pushing the prior to match the more-informed posterior. The remaining 20% — with the stop-gradient on the prior — updates the posterior toward the prior in the usual direction. This asymmetry encourages the prior to carry rich information — acting as a compressed world model — while the posterior retains the freedom to diverge from the prior when the observation is informative.
With these three changes, DreamerV2 matched or exceeded Rainbow — a carefully tuned model-free deep Q-learning baseline — on the 200M-frame Atari benchmark (55 games), becoming the first agent to reach human-level Atari performance by learning behaviour purely inside a separately trained world model Hafner et al., 2021.
DreamerV3¶
DreamerV1 and V2 each required task-specific tuning of reward scaling, KL coefficients, and return normalisation. DreamerV3 Hafner et al., 2025 sets a more ambitious goal: a single agent with a single fixed set of hyperparameters that performs well across a qualitatively diverse collection of domains — Atari games, continuous locomotion from the DeepMind Control Suite, partially observable memory tasks from BSuite Osband et al., 2020, robotic manipulation, and the open-ended Minecraft environment.
Figure 8:DreamerV3’s universal agent. The same network and hyperparameters are applied across qualitatively different domain families without per-domain tuning. Three innovations enable this domain invariance: symlog transforms stabilise reconstruction and return targets across wildly different reward scales; free bits prevent latent collapse in environments with low-information observations; and percentile return normalisation makes the actor loss robust to varying reward magnitudes. The result is the first agent to collect a diamond in Minecraft from scratch.
The core challenge in building a universal agent is that the statistics of different environments vary enormously: Atari rewards range into the hundreds per frame; some continuous control tasks produce negative rewards throughout; Minecraft episodes last thousands of steps with sparse, delayed reward. A single set of loss coefficients cannot serve all of these simultaneously without explicit mechanisms to normalise the relevant quantities.
Symlog transformations. DreamerV3 applies a symmetric logarithmic transform to network inputs, reconstruction targets, and return estimates:
Symlog compresses large values while preserving the sign and leaving small values near-linear. It is applied to the observation encoder inputs, to the targets of the observation decoder, and to return targets before they enter the actor and critic losses. The inverse transform is applied to the decoder output during generation. This single change eliminates the need to hand-tune observation and reward normalisation coefficients across domains.
Free bits. Posterior collapse — the RSSM latent carrying no information about the observation — is a persistent hazard in environments where observations are nearly deterministic given the action history. DreamerV3 prevents collapse by clipping the per-category KL from below at a minimum value of 1 nat:
This “free bits” mechanism Hafner et al., 2025 ensures the model always devotes at least 1 nat of capacity per latent category to the observation, regardless of how well the prior already predicts it. (In full, DreamerV3 applies the clip to two separately weighted KL terms — a dynamics loss training the prior, , and a representation loss training the posterior, (alongside the unclipped prediction loss, ) — carrying V2’s KL balancing forward; the free-bits clip is the ingredient that prevents collapse.)
Percentile return normalisation. The actor’s policy gradient is proportional to the advantage — the difference between the -return and the critic’s baseline. If returns vary over several orders of magnitude across an episode (as they do in Minecraft, where reward is extremely sparse for thousands of steps and then suddenly large upon collecting a rare resource), a fixed learning rate produces wildly inconsistent update magnitudes. DreamerV3 normalises the return signal by the inter-percentile range of recent returns:
where the percentiles are computed as exponential moving averages over the training batch. DreamerV3 only rescales the returns by the inter-percentile range — it does not subtract a baseline here, since the critic already supplies one. The denominator is clipped at 1 to avoid division by near-zero in the early stages of training. This normalisation makes the actor’s effective learning rate robust to the absolute scale of rewards across environments and across training phases within a single environment.
The architecture also benefits from several engineering improvements over V2: a substantially larger model (200M parameters versus ~20M in V2), layer normalisation (LayerNorm) throughout, and SiLU activation functions replacing ELU. Together these changes — three principled algorithmic innovations plus improved engineering — yield an agent that achieves human-level performance across Atari 57, matches or exceeds specialised agents on the DeepMind Control Suite, solves BSuite memory tasks, and collected a diamond in Minecraft from scratch — the first agent to do so without any human demonstrations or domain-specific prior knowledge Hafner et al., 2025.
Notably, the same fixed-hyperparameter agent is also competitive in the sample-efficient regime that EfficientZero was built for (EfficientZero: Closing the Data-Efficiency Gap). On the Atari 100k benchmark — 26 games with a budget of only 400K frames, about two hours of play — DreamerV3 outperforms the best prior methods other than EfficientZero, including the transformer-based world models IRIS and TWM, the model-free SPR, and SimPLe Hafner et al., 2025. EfficientZero still holds the benchmark’s state of the art, but only by adding online tree search, prioritised replay, and early level resets — extra machinery that, as the DreamerV3 authors note, complicates a direct comparison. That a single world-model agent tuned for nothing in particular lands just behind a benchmark-specialised tree-search method is a striking illustration of how far latent imagination has come.
What’s Next in World Models¶
Every method in this chapter trains its world model from scratch, on data from the single task the agent is learning, and discards it when the task changes. The most active frontier in 2025–2026 reverses this assumption: it treats the world model as a pretrained, reusable artifact, much as the rest of deep learning has moved from task-specific training to large pretrained foundations. Three recent lines of work illustrate the shift.
DINO-WM Zhou et al., 2024 builds its world model on top of frozen DINOv2 patch features rather than learning a representation jointly with the dynamics. Because the visual backbone is pretrained and never updated, the model needs no pixel-reconstruction decoder at all — it simply predicts future patch embeddings with a Vision Transformer. At test time, planning is posed as visual goal-reaching: model-predictive control that minimises the latent distance to a goal image, solved zero-shot without task-specific reward models or demonstrations. This decouples the expensive representation learning, done once on broad image data, from the cheap, task-specific dynamics model.
LeWorldModel Maes et al., 2026 pursues the same reconstruction-free goal from the joint-embedding predictive architecture (JEPA) direction LeCun, 2022, training end-to-end from raw pixels on a single GPU. Its key ingredient is a regulariser, SIGReg, that forces the latent embeddings toward an isotropic Gaussian and so directly prevents the representation collapse that has long destabilised reconstruction-free predictive models — the failure mode that observation decoders and stop-gradients were previously needed to avoid.
DreamerV4 Hafner et al., 2025 scales the Dreamer recipe into a large block-causal transformer world model that attends jointly over spatial patches and time, made fast enough for real-time imagination by a shortcut-forcing training objective. Trained purely on a fixed offline dataset of recorded human Minecraft play — with no live environment interaction — it learns its behaviour entirely inside the learned model and becomes the first agent to obtain diamonds in Minecraft from offline data alone.
Summary¶
This chapter developed the world-model paradigm through two complementary families. MuZero brought tree search to settings with no given rules, learning an abstract latent engine — dynamics, reward, value, and policy heads — shaped entirely by what MCTS needs and under no obligation to reconstruct observations; EfficientZero then made that recipe sample-efficient enough to rival human learning speed. The Dreamer family, descending from Ha and Schmidhuber’s V/M/C template and PlaNet’s RSSM, took the opposite stance on the latent space: it reconstructs observations and trains an actor-critic entirely inside imagined rollouts of the learned model, scaling from continuous control (V1) to discrete Atari dynamics (V2) to a single universal agent across dozens of domains (V3).
Two recurring themes run through every method. The first is compression. Every approach encodes high-dimensional observations into a compact latent state that retains what is necessary for predicting rewards and future states. The dimensionality reduction is not just a computational convenience — it is what makes planning tractable. Planning over a 30-dimensional Gaussian or a 1024-dimensional categorical code is qualitatively different from planning over a raw pixel frame.
The second is imagination. Once a compact latent model is available, the agent can generate synthetic experience — imagined rollouts, simulated MCTS expansions — at a fraction of the cost of real interaction. Dreamer uses this to train an actor-critic without touching the environment during behaviour learning. MuZero uses this to run MCTS without querying a simulator. In both cases, the quality of the imagination directly determines the quality of the resulting policy.
| Algorithm | Model | Planning | Key innovation |
|---|---|---|---|
| MuZero Schrittwieser et al., 2020 | ++ (abstract) | MCTS in latent space | No rules needed; board games + Atari |
| EfficientZero Ye et al., 2021 | MuZero + consistency | MCTS in latent space | Self-supervised consistency; 1 000× efficiency |
| World Models Ha & Schmidhuber, 2018 | VAE + MDN-RNN | Dream training (CMA-ES) | Train controller inside learned model |
| PlaNet Hafner et al., 2019 | RSSM | CEM in latent space | Planning without policy; latent overshooting |
| Dreamer Hafner et al., 2020 | RSSM + decoders | Actor-critic in imagination | BPTT through world model; -returns |
| DreamerV2 Hafner et al., 2021 | RSSM + categorical | Actor-critic in imagination | Discrete latents; KL balancing |
| DreamerV3 Hafner et al., 2025 | RSSM (large) | Actor-critic in imagination | Symlog; free bits; universal hyperparameters |
The two families differ in their choice of latent space structure and planning mechanism. The Dreamer family maintains a recurrent stochastic latent that must decode to observations, enabling rich world models suitable for continuous-control tasks with complex visual inputs. MuZero and EfficientZero use an entirely abstract latent space shaped only by task-relevant predictions, enabling efficient tree search in discrete and combinatorial settings. Neither approach dominates universally: Dreamer’s imagination-based actor-critic is more sample-efficient than MCTS for high-frequency continuous control; MuZero’s tree search finds stronger combinatorial strategies than amortised policies in games. Open questions remain around bridging the two families — can imagination-based planning and tree search be combined? — and extending world models to partially observable, multi-agent, and non-stationary settings. A further shift is already underway: from world models trained per task toward pretrained, reusable ones — DINO-WM Zhou et al., 2024, LeWorldModel Maes et al., 2026, and DreamerV4 Hafner et al., 2025 — which aim to learn dynamics once on broad data and transfer them across tasks.
Further Reading¶
Danijar Hafner, Learning Latent Dynamics for Planning from Pixels — author walkthrough of PlaNet: RSSM world model, video prediction, and latent-space CEM planning (~200× sample efficiency on DeepMind Control).
Danijar Hafner, Dream to Control: Learning Behaviors by Latent Imagination — ICLR 2020 talk on Dreamer (v1): actor–critic training inside imagined RSSM trajectories with backprop through the world model.
Danijar Hafner, Mastering Atari with Discrete World Models — ICLR 2021 talk on DreamerV2: categorical latents, KL balancing, and human-level Atari from imagination in a separately trained world model.
Niclas Hansen, dreamer4 — unofficial PyTorch reimplementation of DreamerV4 with multi-task DeepMind Control support and Hugging Face checkpoints.
LeCun, Yann, A Path Towards Autonomous Machine Intelligence LeCun, 2022 — vision-position paper on world models, hierarchical planning, and joint-embedding predictive architectures (JEPA).
Silver & Sutton, Welcome to the Era of Experience Silver & Sutton, 2025 — argument that the next AI leap will come from agents learning continuously from grounded experience streams rather than static human data alone.
- Ha, D., & Schmidhuber, J. (2018). Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems, 31.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. Proceedings of the 36th International Conference on Machine Learning, 2555–2565.
- Moerland, T. M., Broekens, J., Plaat, A., & Jonker, C. M. (2023). Model-based Reinforcement Learning: A Survey. Foundations and Trends in Machine Learning, 16(1), 1–118. 10.1561/2200000086
- Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T., & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588(7839), 604–609. 10.1038/s41586-020-03051-4
- Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. International Conference on Learning Representations. https://arxiv.org/abs/1912.01603
- Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. International Conference on Learning Representations. https://arxiv.org/abs/2010.02193
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2025). Mastering diverse control tasks through world models. Nature, 640(8059), 647–653. 10.1038/s41586-025-08744-2
- Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404
- Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari Games with Limited Data. Advances in Neural Information Processing Systems, 34, 25476–25488.
- Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1724–1734. 10.3115/v1/D14-1179
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6114
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
- Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepesvári, C., Singh, S., Van Roy, B., Sutton, R., Silver, D., & van Hasselt, H. (2020). Behaviour Suite for Reinforcement Learning. Proceedings of the 8th International Conference on Learning Representations (ICLR).
- Zhou, G., Pan, H., LeCun, Y., & Pinto, L. (2024). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning. arXiv Preprint arXiv:2411.04983. https://arxiv.org/abs/2411.04983
- Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y., & Balestriero, R. (2026). LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels. arXiv Preprint arXiv:2603.19312. https://arxiv.org/abs/2603.19312