World Models

The previous chapter showed that access to a perfect simulator transforms the planning problem: given the rules of Go, AlphaZero can search millions of positions per second and reach superhuman strength from self-play alone. But perfect simulators are a luxury that the real world rarely provides. Robots interacting with physical environments, agents navigating markets, or systems exploring molecular space have no closed-form transition function to query — every piece of data must be earned through direct interaction. The natural response is to learn a model of the environment from collected experience and then use that model as a surrogate simulator for planning.

This chapter is concerned with a specific and powerful instantiation of that idea: the world model. A world model is a learned function that approximates the environment’s transition distribution $p(s', r \mid s, a)$ , but crucially operates in a compact latent space rather than the raw observation space Ha & Schmidhuber, 2018. Instead of predicting the next pixel frame — a high-dimensional and largely redundant signal — the model encodes each observation into a small latent vector and performs all dynamics reasoning there, which results in order-of-magnitude faster planning than in pixel space Hafner et al., 2019Moerland et al., 2023.

We organise the chapter around three milestones. We begin with MuZero Schrittwieser et al., 2020, which picks up exactly where AlphaZero left off: it brings tree search to settings where the rules are not given, by learning the game engine itself in an abstract latent space that need not decode back to observations. We then turn to the Dreamer family Hafner et al., 2020Hafner et al., 2021Hafner et al., 2025, which descends from the earliest modern world models — the V/M/C architecture of Ha and Schmidhuber Ha & Schmidhuber, 2018 and the recurrent state-space model of PlaNet Hafner et al., 2019 — and trains an actor-critic entirely inside imagined rollouts of a reconstructive latent model. Finally, we look at where the field is heading: toward pretrained, reusable world models that decouple representation learning from any single task.

Slides for this chapter (open full screen).

MuZero: Learning the Rules from Scratch¶

Chapter 7 showed that Monte Carlo Tree Search becomes extraordinarily powerful when paired with a learned policy-value network — provided a perfect simulator is available to expand the search tree. MuZero Schrittwieser et al., 2020 removes that proviso. It brings the tree-search machinery of AlphaZero (Chapter 7) to bear on settings where no game engine is given — Atari, where the rules are buried inside an emulator that cannot be queried symbolically, or robotics, where physics governs transitions that have no closed-form description — by learning the engine itself.

From AlphaZero to MuZero¶

Recall from Chapter 7 that AlphaZero Silver et al., 2018 runs MCTS using the game engine to expand nodes: at every node, the tree policy selects an action, the engine executes it, and the child state is obtained for free. This architecture is superhuman in Go, Chess, and Shogi — but only because the rules are provided. Applying AlphaZero to Atari, where the “rules” are encoded in the emulator and cannot be queried directly, or to a robot, where physics governs transitions that have no closed-form description, is not possible.

MuZero’s key insight is that the agent does not need to predict observations in order to plan effectively. It only needs to predict the three quantities that drive MCTS: rewards (to evaluate paths), values (to estimate leaf nodes), and policies (to guide search). These can all be predicted in an abstract latent space that need not correspond to any observable quantity Schrittwieser et al., 2020.

The Three Learned Functions¶

MuZero learns three neural networks, shown in Figure 2:

Representation function $h_\theta$ :

s^0 = h_\theta(o_1, \ldots, o_t).

(1)

The representation function encodes a stack of the most recent observations — in Atari, together with the actions that produced them — into an initial hidden state $s^0$ from which MCTS begins. It is called once per real environment step and never called during search.

Dynamics function $g_\theta$ :

(r^k,\, s^k) = g_\theta(s^{k-1},\, a^k).

(2)

Given a latent state and an action, the dynamics function predicts the immediate reward and the next latent state. This function replaces the game engine inside MCTS: every node expansion calls $g_\theta$ rather than querying a simulator.

Prediction function $f_\theta$ :

(p^k,\, v^k) = f_\theta(s^k).

(3)

The prediction function reads a latent state and returns a policy prior $p^k$ over actions and a scalar value estimate $v^k$ . It plays the exact same role as the policy-value network in AlphaZero.

Neural MCTS in Latent Space¶

The search procedure in MuZero is identical in structure to AlphaZero’s (see Monte Carlo Tree Search in Chapter 7): PUCT-guided selection, expansion, and backup, with the prediction function providing policy priors and value estimates. The only difference is that node expansion calls $g_\theta$ instead of the game engine, as illustrated in Figure 3.

$MuZero MCTS in latent space: AlphaZero queries the game engine at every expansion; MuZero queries the learned dynamics function g_\theta instead. The PUCT selection rule, backup, and recommended-action extraction (argmax over visit counts) are identical. The true environment is never queried during search. Cf. .$

Figure 3:MuZero MCTS in latent space: AlphaZero queries the game engine at every expansion; MuZero queries the learned dynamics function $g_\theta$ instead. The PUCT selection rule, backup, and recommended-action extraction (argmax over visit counts) are identical. The true environment is never queried during search. Cf. Schrittwieser et al., 2020.

At each real step, MuZero first encodes the current observation stack with $h_\theta$ to obtain $s^0$ , then runs $N_\text{sim}$ MCTS simulations. Each simulation selects a leaf by PUCT:

U(s, a) = Q(s, a) + c_\text{puct} \cdot p^k(a) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)},

(4)

expands the leaf by calling $g_\theta$ and $f_\theta$ , and backs up value estimates along the visited path. Here $c_\text{puct}$ stands in for a visit-count–dependent exploration term: the original algorithm (Equation 2 of Schrittwieser et al. (2020)) grows the exploration bonus logarithmically as a node is visited more often, which we fold into a single constant for clarity. The value backup in MuZero differs from AlphaZero’s in one important detail: it incorporates the predicted rewards accumulated along the unrolled path,

G^k = \sum_{\tau=0}^{l-1-k} \gamma^\tau\, r^{k+1+\tau} + \gamma^{\,l-k}\, \nu^{l},

(5)

where $l$ is the depth of the expanded leaf, the rewards $r^{k+1+\tau}$ are those predicted after each action along the path, and $\nu^{l}$ is the value head output at the leaf. AlphaZero always backs up game outcomes (win/loss) from terminal positions; MuZero bootstraps from the value head because latent trajectories never reach a terminal state in the traditional sense — terminal states are instead treated as absorbing during training, so the value head learns to emit a constant value beyond episode end.

Training¶

During training, the MuZero network is unrolled for $K$ hypothetical steps on sequences sampled from the replay buffer Schrittwieser et al., 2020. Fix a real time index $t$ : the representation function receives past observations $o_1, \ldots, o_t$ , and the dynamics function is fed the real actions $a_{t+1}, \ldots, a_{t+K}$ . At each unroll step $k$ , the combined model predicts

p^k_t,\, v^k_t,\, r^k_t = \mu_\theta(o_1, \ldots, o_t,\, a_{t+1}, \ldots, a_{t+k}),

(6)

where $\mu_\theta$ chains $h_\theta$ , $g_\theta$ , and $f_\theta$ . Each step is aligned with stored MCTS targets $\pi_{t+k}$ , environment rewards $u_{t+k}$ , and value targets $z_{t+k}$ . Following Figure S2 of Schrittwieser et al. (2020),

z_{t+k} = \begin{cases} u_T & \text{board games} \\[0.4em] u_{t+k+1} + \gamma u_{t+k+2} + \cdots + \gamma^{n-1} u_{t+k+n} + \gamma^n \nu_{t+k+n} & \text{general MDPs } \end{cases}

(7)

In board games, final outcomes $\{-\text{loss}, \text{draw}, +\text{win}\}$ are encoded as rewards $u_t \in \{-1, 0, +1\}$ at the terminal step $T$ , and the reward head is not trained ( $l_r \equiv 0$ ). In Atari, $n = 10$ and $\nu_{t+k+n}$ is the MCTS value estimate at the bootstrap step. The per-trajectory loss is

l_t(\theta) = \sum_{k=0}^{K} \!\Big[ \underbrace{l_r(u_{t+k},\, r^k_t)}_{\text{reward loss}} + \underbrace{l_v(z_{t+k},\, v^k_t)}_{\text{value loss}} + \underbrace{l_p(\pi_{t+k},\, p^k_t)}_{\text{policy loss}} \Big] + \underbrace{c\|\theta\|^2}_{\text{$L_2$ regularisation}},

(8)

with the three head losses defined as

l_r(u, r) = \begin{cases} 0 & \text{board games,} \\[0.4em] -\boldsymbol{\phi}(u)^\top \log \mathbf{r} & \text{general MDPs,} \end{cases}

(9)

l_v(z, q) = \begin{cases} (z - q)^2 & \text{board games,} \\[0.4em] -\boldsymbol{\phi}(z)^\top \log \mathbf{q} & \text{general MDPs,} \end{cases}

(10)

l_p(\boldsymbol{\pi}, \mathbf{p}) = -\boldsymbol{\pi}^\top \log \mathbf{p}.

(11)

Here $\mathbf{r}$ and $\mathbf{q}$ are the softmax distributions over a fixed 601-integer support ( $-300, \ldots, 300$ ) emitted by the reward and value heads; $\boldsymbol{\phi}(x)$ encodes a scalar target $x$ as a convex combination of its two adjacent integer supports (e.g.\ a target of 3.7 becomes weight 0.3 on support 3 and 0.7 on support 4), after the invertible square-root rescaling of the raw targets, $h(x) = \mathrm{sign}(x)\bigl(\sqrt{|x|+1} - 1\bigr) + \epsilon x$ , described in Appendix F of Schrittwieser et al. (2020) (DreamerV3, below, later replaces this with the related but distinct symlog transform). Squared error on rewards and values sufficed for the bounded board-game targets; cross-entropy against this support representation was found more stable when scales vary widely in Atari. The cross-entropy heads $l_r$ , $l_v$ , and $l_p$ each carry a leading minus, so minimising them maximises the log-likelihood of the targets, matching AlphaZero’s policy loss (Equation (13)). To keep gradient magnitude roughly constant across unroll depths, each head loss is scaled by $1/K$ and the gradient entering the dynamics function is additionally scaled by $1/2$ Schrittwieser et al., 2020.

Results¶

In board games, MuZero matches AlphaZero on Go, Chess, and Shogi despite never being told the rules — it learns the equivalent of the game engine from self-play data. On the 57-game Atari suite, MuZero sets a new state of the art, surpassing model-free Rainbow; its sample-efficient MuZero Reanalyze variant reaches a 731% median human-normalised score from only 200M frames. A particularly striking finding, in Go, is that MuZero’s MCTS over its learned model matches — and at shorter search budgets slightly exceeds — search with the ground-truth simulator (AlphaZero), despite evaluating each node with fewer residual blocks (16 vs. 20). This suggests MuZero caches computation in the search tree, using each application of the dynamics model to deepen its understanding of the position Schrittwieser et al., 2020.

EfficientZero: Closing the Data-Efficiency Gap¶

MuZero’s superhuman Atari performance comes at a steep data cost — hundreds of millions of frames per game to reach its reported scores, or weeks of continuous play, whereas a human can master most Atari games in under two hours Ye et al., 2021. EfficientZero Ye et al., 2021 closes most of that gap with three targeted additions to MuZero, reaching above-human performance from the Atari 100k budget — 100 000 agent steps, about 400K frames or roughly two hours of play — for the first time, using on the order of 500× less data than a DQN trained on 200M frames.

First, a self-supervised consistency loss — a SimSiam-style objective — directly aligns the dynamics function’s predicted latent, $g_\theta(s_{t-1}, a_{t-1})$ , with the latent that the representation function produces from the actual next observation, $\mathrm{sg}\!\left[h_\theta(o_t)\right]$ . This supplies the dynamics model with a dense learning signal it otherwise receives only indirectly through reward and value losses (the stop-gradient is essential, otherwise the representation could trivially collapse to a constant).

Second, a value-prefix head uses a small LSTM to predict the cumulative discounted reward over the unroll before bootstrapping, separating short-term credit assignment from the long-term value estimate and making both easier to learn.

Third, model-based off-policy correction addresses value targets that grow stale as the policy improves: for older replayed transitions, the $n$ -step value target is computed over a shortened horizon and the remaining steps are filled in by re-running the search with the current model, so the bootstrap reflects up-to-date value estimates rather than the outdated behaviour policy that originally generated the data.

Together these let EfficientZero reach a median human-normalised score above 1.0 on the Atari 100k benchmark — roughly two hours of play, where MuZero remains below human level — effectively closing the data-efficiency gap with human learning speed Ye et al., 2021.

Dreamer: Imagination-Based Actor-Critic¶

MuZero shapes its latent space purely to serve tree search. The Dreamer family takes the opposite stance: it learns a latent transition model and uses it to train a policy by gradient-based latent imagination rather than search. The lineage runs back to the first modern world models.

Ha and Schmidhuber Ha & Schmidhuber, 2018 introduced the idea in 2018: compress each observation into a compact latent code, learn a recurrent model of how that code evolves, and train a small controller entirely inside the model’s own imagined rollouts — “dreaming” — before deploying it in the real environment. PlaNet Hafner et al., 2019 turned this template into the Recurrent State Space Model (RSSM) — the world model used by the Dreamer family (V1–3) — and solved continuous-control tasks from pixels by online planning directly in latent space, with no learned policy at all. The catch is cost: the planner re-optimises a fresh distribution over action sequences at every decision, requiring hundreds of latent rollouts per step. Dreamer Hafner et al., 2020 keeps PlaNet’s RSSM but amortises that planning into a policy: instead of searching at decision time, it trains an actor-critic inside imagined rollouts of the model, yielding a policy that acts in a single forward pass at test time.

Following the classical decomposition of agents that learn in imagination, the Dreamer agent interleaves three operations, which structure the rest of this section: fitting the RSSM world model to past experience, behavior learning inside imagined rollouts of that frozen model, and environment interaction to gather fresh experience.

RSSM World Model¶

Dreamer Hafner et al., 2020 couples its world model with an actor $\pi_\phi(a_\tau \mid s_\tau)$ and a critic $V_\psi(s_\tau)$ — introduced in the next subsection — all operating entirely in latent space. We begin with the world model itself. Dreamer’s world model is PlaNet’s Recurrent State Space Model (RSSM), a generative model of observation sequences built around a single compact model state $s_t$ with Markovian transitions: given $s_{t-1}$ and the action $a_{t-1}$ , the model predicts $s_t$ , and from $s_t$ it predicts the reward. The model is fitted to past experience by representation learning: encoding real observations into model states from which it can reconstruct the observation and predict the reward, as shown in Figure 4.

$Learning the RSSM from experience — the “learn dynamics” component of Dreamer, unrolled over two steps. Each model state s_t (orange) can be formed two ways. The transition/prior model q_\theta(s_t \mid s_{t-1}, a_{t-1}) (blue box, dashed arrow) predicts the next state from the previous state and action without seeing the observation — this is what lets the model imagine forward in latent space. When a real observation o_t (green) is available, an encoder (purple) feeds the representation/posterior model p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t) (blue box, solid arrow), which corrects the prior. From each model state a decoder reconstructs the observation \hat o_t and a reward model q_\theta(r_t \mid s_t) (orange box) predicts \hat r_t. All heads are fit jointly by maximising the sequence ELBO (Equation ), so the model state learns to capture exactly what is needed to reconstruct observations and predict rewards; the KL term pulls the prior toward the posterior, keeping imagination accurate when observations are absent.$

Figure 4:Learning the RSSM from experience — the “learn dynamics” component of Dreamer, unrolled over two steps. Each model state $s_t$ (orange) can be formed two ways. The transition/prior model $q_\theta(s_t \mid s_{t-1}, a_{t-1})$ (blue box, dashed arrow) predicts the next state from the previous state and action without seeing the observation — this is what lets the model imagine forward in latent space. When a real observation $o_t$ (green) is available, an encoder (purple) feeds the representation/posterior model $p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t)$ (blue box, solid arrow), which corrects the prior. From each model state a decoder reconstructs the observation $\hat o_t$ and a reward model $q_\theta(r_t \mid s_t)$ (orange box) predicts $\hat r_t$ . All heads are fit jointly by maximising the sequence ELBO (Equation (13)), so the model state learns to capture exactly what is needed to reconstruct observations and predict rewards; the KL term pulls the prior toward the posterior, keeping imagination accurate when observations are absent.

Following the original Dreamer paper, we treat $s_t$ as a single latent variable and describe the world model by the conditional distributions it learns:

\begin{array}{rll} \text{Representation (posterior):} & p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t) \\[0.3em] \text{Transition (prior):} & q_\theta(s_t \mid s_{t-1}, a_{t-1}) \\[0.3em] \text{Reward model:} & q_\theta(r_t \mid s_t) \\[0.3em] \text{Observation decoder:} & q_\theta(o_t \mid s_t). \end{array}

(12)

Here, $p$ denotes the representation model — the posterior that infers the state with access to the observation $o_t$ — and $q$ denotes the transition model — the prior that predicts the next state from $s_{t-1}$ and $a_{t-1}$ without an observation — together with the reward and observation heads that read off $s_t$ . The transition is what lets the model imagine forward in latent space without decoding pixels; the posterior is used only when a real observation is available, to “correct” the prior. (For tasks with early termination, Dreamer V1 additionally predicts a discount factor from each model state to down-weight imagined steps beyond a likely episode end; this “continue” predictor is not part of the core continuous-control model, but it becomes a standard RSSM head from DreamerV2 onward.)

Both prior and posterior are diagonal Gaussians whose parameters are produced by small networks; the reward head is an MLP and the observation decoder is a transposed CNN. The table below summarises the modules and their roles in the sequence ELBO of Equation (13).

Component	Input → output	Architecture	Objective
Representation / posterior $p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t)$	$(s_{t-1}, a_{t-1}, o_t) \to s_t$	recurrent net + MLP → diagonal Gaussian	$\mathbb{E}[\log q_\theta(o_t \mid s_t) + \log q_\theta(r_t \mid s_t)] - \mathrm{KL}(p \,\|\, q)$ — inference when $o_t$ is available (training)
Transition / prior $q_\theta(s_t \mid s_{t-1}, a_{t-1})$	$(s_{t-1}, a_{t-1}) \to s_t$	recurrent net + MLP → diagonal Gaussian	$\mathrm{KL}\bigl[p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t) \,\|\, q_\theta(s_t \mid s_{t-1}, a_{t-1})\bigr]$ — predicts the next state without an observation (imagination)
Observation decoder $q_\theta(o_t \mid s_t)$	$s_t \to o_t$	Convolutional decoder	$\mathbb{E}[\log q_\theta(o_t \mid s_t)]$ — pixel reconstruction
Reward model $q_\theta(r_t \mid s_t)$	$s_t \to r_t$	MLP (Gaussian likelihood)	$\mathbb{E}[\log q_\theta(r_t \mid s_t)]$ — scalar reward prediction

All modules are trained jointly by maximising a sequence-level Evidence Lower Bound (ELBO):

\begin{array}{rl} \mathcal{L}(\theta) = \sum_t \mathbb{E}_{p_\theta}\!\Big[ & \underbrace{\log q_\theta(o_t \mid s_t)}_{\text{observation reconstruction}} \\[0.5em] + & \underbrace{\log q_\theta(r_t \mid s_t)}_{\text{reward prediction}} \\[0.5em] - & \underbrace{ \mathrm{KL}\bigl[ p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t) \,\|\, q_\theta(s_t \mid s_{t-1}, a_{t-1}) \bigr] }_{\text{KL regularisation}} \Big], \end{array}

(13)

The reconstruction terms push the model to predict both observations and rewards accurately. The KL term regularises the posterior toward the prior — ensuring that the prior, which must operate without observations during imagination, carries enough information to support useful predictions.

Note

Latent overshooting (a PlaNet aside). The standard ELBO of Equation (13) trains the posterior to match the observation at each step, but does not explicitly train the prior for multi-step predictions — yet multi-step accuracy is precisely what planning and imagination require. Latent overshooting Hafner et al., 2019 addresses this by additionally requiring the prior to match the posterior not just one step ahead but $d$ steps ahead without observations, for $d = 1, 2, \ldots, D$ . Concretely, an overshooting loss is added for each depth $d$ : unroll the prior $d$ steps forward from a starting posterior state (using only actions, no observations), then penalise the KL between the resulting prior distribution and the $d$ -step-ahead posterior. This keeps the prior calibrated over the multi-step horizons that the planner (in PlaNet) and the imagined rollouts (in Dreamer) actually use.

Behavior learning¶

With the world model fitted and held fixed, behaviour is learned entirely in imagination — this is the “learn behavior” component of Dreamer. Starting from posterior states inferred on real data, the agent imagines $H = 15$ steps into the future inside the world model, without interacting with the environment: the actor proposes actions, the world model advances the latent state, and the reward predictor supplies rewards. The actor and critic are updated on these purely imagined trajectories.

The key innovation is the gradient estimator used to train the actor. Because the world model and the actor are both differentiable, gradients can flow back through imagined states and rewards directly to the actor’s parameters — no REINFORCE estimator is needed. Concretely, the actor samples actions via reparameterisation:

a_\tau = \tanh\!\bigl(\mu_\phi(s_\tau) + \sigma_\phi(s_\tau) \cdot \varepsilon_\tau\bigr), \quad \varepsilon_\tau \sim \mathcal{N}(0, I),

(14)

so the gradient $\partial a_\tau / \partial \phi$ exists analytically. The imagined next state $s_{\tau+1} \sim q_\theta(s_{\tau+1} \mid s_\tau, a_\tau)$ — drawn from the transition model — is also reparameterised, so the gradient continues flowing backwards through the dynamics.

Note

The reparameterisation trick, in brief. Training through a sampling step requires the gradient $\partial / \partial\phi$ of an expectation over a distribution whose parameters depend on $\phi$ — but drawing a sample is not differentiable. The trick Kingma & Welling, 2014 rewrites the sample as a deterministic function of the parameters and an independent noise variable that carries no parameters. For a Gaussian, drawing $z \sim \mathcal{N}(\mu_\phi, \sigma_\phi^2)$ is equivalent to computing $z = \mu_\phi + \sigma_\phi \odot \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$ ; now $z$ depends on $\phi$ only through the deterministic $\mu_\phi$ and $\sigma_\phi$ , so $\partial z / \partial \phi$ is well defined and gradients flow through the sample. Dreamer applies this to both the actor’s actions (Equation (14)) and the RSSM’s stochastic states $s_\tau$ , which makes an entire imagined rollout differentiable end-to-end — the basis for backpropagating value gradients through the dynamics.

Behavior learning inside imagined rollouts. An H-step trajectory of latent states s_1, \ldots, s_H (orange) is rolled forward by the frozen world model’s dynamics (the prior). At each step the actor \pi_\phi (blue) proposes an action via reparameterisation, a reward head predicts \hat r_\tau (red), and the critic V_\psi (orange) estimates the state value. The \lambda-returns V^\lambda (red dashed, backward) serve as critic targets, while the actor’s analytic gradient \nabla_\phi J (green) flows backward through the latent dynamics. A stop-gradient \text{sg}[V^\lambda] keeps the critic target from interfering with the critic’s own gradient. — Figure 5:Behavior learning inside imagined rollouts. An $H$ -step trajectory of latent states $s_1, \ldots, s_H$ (orange) is rolled forward by the frozen world model’s dynamics (the prior). At each step the actor $\pi_\phi$ (blue) proposes an action via reparameterisation, a reward head predicts $\hat r_\tau$ (red), and the critic $V_\psi$ (orange) estimates the state value. The $\lambda$ -returns $V^\lambda$ (red dashed, backward) serve as critic targets, while the actor’s analytic gradient $\nabla_\phi J$ (green) flows backward through the latent dynamics. A stop-gradient $\text{sg}[V^\lambda]$ keeps the critic target from interfering with the critic’s own gradient.

The critic is trained to predict $\lambda$ -returns — a weighted mixture of $n$ -step returns that trades off bias against variance Sutton & Barto, 2018:

V^\lambda_\tau = (1 - \lambda) \sum_{n=1}^{H-\tau-1} \lambda^{n-1} \hat{V}_n(\tau) \;+\; \lambda^{H-\tau-1} V_\psi(s_H),

(15)

where each $n$ -step estimate $\hat{V}_n(\tau) = \sum_{k=0}^{n-1} \gamma^k \hat{r}_{\tau+k} + \gamma^n V_\psi(s_{\tau+n})$ sums imagined rewards and bootstraps from the critic. The actor and critic are then trained cooperatively, as in policy iteration: the actor maximises the predicted returns while the critic minimises its squared distance to them,

\max_\phi\; \mathbb{E}_{\pi_\phi,\, p_\theta}\!\left[\sum_{\tau=1}^{H} V^\lambda_\tau\right],

(16)

\min_\psi\; \mathbb{E}_{\pi_\phi,\, p_\theta}\!\left[\sum_{\tau=1}^{H} \tfrac{1}{2}\,\bigl\lVert V_\psi(s_\tau) - \mathrm{sg}\!\left(V^\lambda_\tau\right) \bigr\rVert^2\right],

(17)

where $\mathrm{sg}(\cdot)$ is the stop-gradient: the critic regresses toward the $\lambda$ -return as a fixed target, while the actor maximises it by propagating analytic gradients back through the learned dynamics (Equation (14)).

Note

Reparameterisation versus REINFORCE. REINFORCE policy gradients (Chapter 4) estimate $\nabla \mathbb{E}[R] = \mathbb{E}[R \nabla \log \pi]$ : the gradient is unbiased but has high variance, requiring many sample rollouts to converge. Reparameterisation computes $\partial a / \partial \phi$ analytically through the sampling process, producing a low-variance gradient at the cost of requiring a differentiable action distribution. Therefore Dreamer’s choice to use reparameterisation is important: it allows the actor to learn from trajectories as short as $H = 15$ imagined steps, where REINFORCE would produce gradients too noisy to be useful. The stop-gradient operator $\text{sg}[V^\lambda]$ on the critic’s target is equally important: without it, the critic’s own parameters would receive misleading gradients through the $\lambda$ -return definition, destabilising training.

Environment interaction¶

Imagination alone cannot improve the agent indefinitely: the world model is only accurate where it has seen data, so the agent must periodically act in the real environment to collect fresh experience. This is the third Dreamer component, shown in Figure 6. To choose an action, the agent runs the RSSM forward over the history of the current episode — encoding each real observation through the posterior to obtain the model state $s_t$ — and queries the actor $\pi_\phi(a_t \mid s_t)$ , adding a small amount of exploration noise. Executing that action in the environment yields the next observation and reward, and the transition $(o_t, a_t, r_t)$ is appended to a replay buffer $\mathcal{D}$ .

The three components run in a single loop. The buffer is seeded with a handful of episodes collected under a random policy; thereafter the agent alternates between a fixed number of training updates — fitting the world model on sampled batches, then the actor-critic on rollouts imagined from those batches — and collecting one fresh episode with the current policy. Because behaviour is learned almost entirely from imagined rollouts rather than from the replayed transitions directly, Dreamer is strikingly sample-efficient: it substantially improved over PlaNet on the DeepMind Control Suite from pixels, and exceeded the strongest model-free baseline (D4PG) in final performance while using roughly 20× fewer environment steps — reaching an average score of 823 in $5\times10^6$ steps, against D4PG’s 786 in 10⁸ steps Hafner et al., 2020.

The Dreamer Algorithm¶

Putting the three components together gives the full Dreamer agent of Hafner et al. (2020) (Algorithm 1 in the paper, with notation matched to this chapter):

Initialize the dataset $\mathcal{D}$ with $S$ random seed episodes.
Initialize the world-model parameters $\theta$ , actor parameters $\phi$ , and critic parameters $\psi$ randomly.
While not converged do
1. For update step $c = 1, \ldots, C$ do
  1. Dynamics learning. Draw $B$ data sequences $\{(o_t, a_t, r_t)\}_{t=k}^{k+L} \sim \mathcal{D}$ .
  2. Compute model states $s_t$ along each sequence from the representation/posterior model $p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t)$ .
  3. Update $\theta$ with one gradient step on the sequence ELBO (Equation (13)).
  4. Behavior learning. From each posterior state $s_t$ , imagine an $H$ -step trajectory $\{(s_\tau, a_\tau)\}_{\tau=t}^{t+H}$ under the frozen world model, with actions $a_\tau \sim \pi_\phi(a_\tau \mid s_\tau)$ .
  5. Predict imagined rewards $\hat r_\tau$ and values $V_\psi(s_\tau)$ , and compute the $\lambda$ -returns $V^\lambda_\tau$ (Equation (15)).
  6. Update the actor: $\phi \leftarrow \phi + \alpha\, \nabla_\phi \sum_{\tau} V^\lambda_\tau$ (Equation (16)).
  7. Update the critic: $\psi \leftarrow \psi - \alpha\, \nabla_\psi \sum_{\tau} \tfrac{1}{2}\bigl\lVert V_\psi(s_\tau) - \mathrm{sg}(V^\lambda_\tau)\bigr\rVert^2$ (Equation (17)).
2. Environment interaction. Reset the environment: $o_1 \leftarrow \texttt{env.reset()}$ .
3. For time step $t = 1, \ldots, T$ do
  1. Infer the model state $s_t$ from the history via the posterior $p_\theta(s_t \mid s_{t-1}, a_{t-1}, o_t)$ .
  2. Draw an action $a_t \sim \pi_\phi(a_t \mid s_t)$ and add exploration noise.
  3. Step the environment: $r_t, o_{t+1} \leftarrow \texttt{env.step}(a_t)$ .
4. Add the episode to the dataset: $\mathcal{D} \leftarrow \mathcal{D} \cup \{(o_t, a_t, r_t)\}_{t=1}^{T}$ .

The inner loop ( $c = 1, \ldots, C$ ) interleaves dynamics learning and behavior learning, both consuming only batches drawn from $\mathcal{D}$ ; the outer loop then collects a single fresh episode with the current actor before refilling the inner loop. With one fixed setting across all continuous tasks — $S = 5$ seed episodes, $C = 100$ updates per collected episode, $B = 50$ sequences of length $L = 50$ , imagination horizon $H = 15$ , and $\lambda$ -return parameters $\gamma = 0.99$ , $\lambda = 0.95$ — this loop is what reaches the sample efficiency reported above Hafner et al., 2020.

DreamerV2¶

Dreamer V1 was designed for continuous-control tasks with continuous Gaussian latents. Extending it to Atari — where observations are game frames, rewards are sparse, and the underlying dynamics have a discrete, switchlike character — required rethinking the latent representation. DreamerV2 Hafner et al., 2021 introduces three targeted changes to the world model of Figure 7.

Change 1 — Categorical latents. DreamerV2 replaces the 30-dimensional continuous Gaussian latent with 32 groups of 32 categorical random variables, each group sampled from a Categorical(32) distribution. The full latent $z_t = [z_t^1, \ldots, z_t^{32}]$ can take $32^{32} \approx 10^{48}$ distinct values — far more than the Gaussian’s effective support — and categorical distributions naturally represent the discrete mode-switching dynamics that dominate Atari environments (a character picks up an item, the game transitions to a new level, a shot destroys a ship).

Change 2 — Straight-through gradients. Argmax is not differentiable, so reparameterisation cannot be applied directly to categorical samples. DreamerV2 uses the straight-through gradient estimator (a standard trick for discrete latents, going back to Bengio et al. 2013 and the Gumbel-softmax of Jang et al. 2017): in the forward pass, $z_t^i$ is sampled as a one-hot vector from the softmax of learned logits; in the backward pass, the gradient is passed through the logits as if the one-hot sample were identical to the softmax output. This is a biased but low-variance estimator that has proven effective in practice for discrete latent variable models.

Change 3 — KL balancing. The KL term in the ELBO plays two roles at once: it trains the prior (the transition model) toward the posterior, and it regularises the posterior toward the prior. Coupling these is harmful, because the transition prior is hard to learn and the posterior should not be dragged toward a poorly trained prior. DreamerV2 decouples the two directions with an asymmetric KL loss:

\mathcal{L}_\text{KL} = \alpha \cdot \mathrm{KL}\!\left(\text{sg}[p] \,\|\, q\right) + (1-\alpha) \cdot \mathrm{KL}\!\left(p \,\|\, \text{sg}[q]\right),

(18)

where $\alpha = 0.8$ . The stop-gradient ( $\text{sg}[\cdot]$ ) on the posterior in the first term means that 80% of the KL gradient flows only into the prior, pushing the prior to match the more-informed posterior. The remaining 20% — with the stop-gradient on the prior — updates the posterior toward the prior in the usual direction. This asymmetry encourages the prior to carry rich information — acting as a compressed world model — while the posterior retains the freedom to diverge from the prior when the observation is informative.

With these three changes, DreamerV2 matched or exceeded Rainbow — a carefully tuned model-free deep Q-learning baseline — on the 200M-frame Atari benchmark (55 games), becoming the first agent to reach human-level Atari performance by learning behaviour purely inside a separately trained world model Hafner et al., 2021.

DreamerV3¶

DreamerV1 and V2 each required task-specific tuning of reward scaling, KL coefficients, and return normalisation. DreamerV3 Hafner et al., 2025 sets a more ambitious goal: a single agent with a single fixed set of hyperparameters that performs well across a qualitatively diverse collection of domains — Atari games, continuous locomotion from the DeepMind Control Suite, partially observable memory tasks from BSuite Osband et al., 2020, robotic manipulation, and the open-ended Minecraft environment.

Figure 8:DreamerV3’s universal agent. The same network and hyperparameters are applied across qualitatively different domain families without per-domain tuning. Three innovations enable this domain invariance: symlog transforms stabilise reconstruction and return targets across wildly different reward scales; free bits prevent latent collapse in environments with low-information observations; and percentile return normalisation makes the actor loss robust to varying reward magnitudes. The result is the first agent to collect a diamond in Minecraft from scratch.

The core challenge in building a universal agent is that the statistics of different environments vary enormously: Atari rewards range into the hundreds per frame; some continuous control tasks produce negative rewards throughout; Minecraft episodes last thousands of steps with sparse, delayed reward. A single set of loss coefficients cannot serve all of these simultaneously without explicit mechanisms to normalise the relevant quantities.

Symlog transformations. DreamerV3 applies a symmetric logarithmic transform to network inputs, reconstruction targets, and return estimates:

\mathrm{symlog}(x) = \mathrm{sign}(x) \cdot \ln(|x| + 1).

(19)

Symlog compresses large values while preserving the sign and leaving small values near-linear. It is applied to the observation encoder inputs, to the targets of the observation decoder, and to return targets before they enter the actor and critic losses. The inverse transform $\mathrm{symexp}(x) = \mathrm{sign}(x)(\exp(|x|) - 1)$ is applied to the decoder output during generation. This single change eliminates the need to hand-tune observation and reward normalisation coefficients across domains.

Free bits. Posterior collapse — the RSSM latent carrying no information about the observation — is a persistent hazard in environments where observations are nearly deterministic given the action history. DreamerV3 prevents collapse by clipping the per-category KL from below at a minimum value of 1 nat:

\mathcal{L}_\text{KL} = \max\!\bigl(1,\, \mathrm{KL}[p \| q]\bigr).

(20)

Note

Nats, in brief. A nat (natural unit of information) measures information using the natural logarithm $\ln$ (base $e$ ). Entropy and KL divergence in deep learning are almost always written with $\ln$ , so their values are reported in nats rather than bits (which use $\log_2$ ). The two units differ by a constant factor: $1\ \text{nat} = \tfrac{1}{\ln 2}\ \text{bits} \approx 1.44$ bits, and $1\ \text{bit} = \ln 2\ \text{nats} \approx 0.69$ nats (a nat is the larger unit, since $e > 2$ ). The free-bits floor of 1 nat in Equation (20) therefore guarantees a modest but non-zero information budget per latent category before the KL penalty kicks in; the mechanism is called “free bits” for historical reasons even though the threshold is stated in nats.

This “free bits” mechanism Hafner et al., 2025 ensures the model always devotes at least 1 nat of capacity per latent category to the observation, regardless of how well the prior already predicts it. (In full, DreamerV3 applies the clip to two separately weighted KL terms — a dynamics loss training the prior, $\beta_\text{dyn} = 0.5$ , and a representation loss training the posterior, $\beta_\text{rep} = 0.1$ (alongside the unclipped prediction loss, $\beta_\text{pred} = 1$ ) — carrying V2’s KL balancing forward; the free-bits clip is the ingredient that prevents collapse.)

Percentile return normalisation. The actor’s policy gradient is proportional to the advantage — the difference between the $\lambda$ -return and the critic’s baseline. If returns vary over several orders of magnitude across an episode (as they do in Minecraft, where reward is extremely sparse for thousands of steps and then suddenly large upon collecting a rare resource), a fixed learning rate produces wildly inconsistent update magnitudes. DreamerV3 normalises the return signal by the inter-percentile range of recent returns:

\hat{R} = \frac{R}{\max\!\bigl(\mathrm{perc}_{95}(R) - \mathrm{perc}_5(R),\, 1\bigr)},

(21)

where the percentiles are computed as exponential moving averages over the training batch. DreamerV3 only rescales the returns by the inter-percentile range $S = \mathrm{perc}_{95} - \mathrm{perc}_5$ — it does not subtract a baseline here, since the critic already supplies one. The denominator is clipped at 1 to avoid division by near-zero in the early stages of training. This normalisation makes the actor’s effective learning rate robust to the absolute scale of rewards across environments and across training phases within a single environment.

The architecture also benefits from several engineering improvements over V2: a substantially larger model (200M parameters versus ~20M in V2), layer normalisation (LayerNorm) throughout, and SiLU activation functions replacing ELU. Together these changes — three principled algorithmic innovations plus improved engineering — yield an agent that achieves human-level performance across Atari 57, matches or exceeds specialised agents on the DeepMind Control Suite, solves BSuite memory tasks, and collected a diamond in Minecraft from scratch — the first agent to do so without any human demonstrations or domain-specific prior knowledge Hafner et al., 2025.

Notably, the same fixed-hyperparameter agent is also competitive in the sample-efficient regime that EfficientZero was built for (EfficientZero: Closing the Data-Efficiency Gap). On the Atari 100k benchmark — 26 games with a budget of only 400K frames, about two hours of play — DreamerV3 outperforms the best prior methods other than EfficientZero, including the transformer-based world models IRIS and TWM, the model-free SPR, and SimPLe Hafner et al., 2025. EfficientZero still holds the benchmark’s state of the art, but only by adding online tree search, prioritised replay, and early level resets — extra machinery that, as the DreamerV3 authors note, complicates a direct comparison. That a single world-model agent tuned for nothing in particular lands just behind a benchmark-specialised tree-search method is a striking illustration of how far latent imagination has come.

What’s Next in World Models¶

Every method in this chapter trains its world model from scratch, on data from the single task the agent is learning, and discards it when the task changes. The most active frontier in 2025–2026 reverses this assumption: it treats the world model as a pretrained, reusable artifact, much as the rest of deep learning has moved from task-specific training to large pretrained foundations. Three recent lines of work illustrate the shift.

DINO-WM Zhou et al., 2024 builds its world model on top of frozen DINOv2 patch features rather than learning a representation jointly with the dynamics. Because the visual backbone is pretrained and never updated, the model needs no pixel-reconstruction decoder at all — it simply predicts future patch embeddings with a Vision Transformer. At test time, planning is posed as visual goal-reaching: model-predictive control that minimises the latent distance to a goal image, solved zero-shot without task-specific reward models or demonstrations. This decouples the expensive representation learning, done once on broad image data, from the cheap, task-specific dynamics model.

LeWorldModel Maes et al., 2026 pursues the same reconstruction-free goal from the joint-embedding predictive architecture (JEPA) direction LeCun, 2022, training end-to-end from raw pixels on a single GPU. Its key ingredient is a regulariser, SIGReg, that forces the latent embeddings toward an isotropic Gaussian and so directly prevents the representation collapse that has long destabilised reconstruction-free predictive models — the failure mode that observation decoders and stop-gradients were previously needed to avoid.

DreamerV4 Hafner et al., 2025 scales the Dreamer recipe into a large block-causal transformer world model that attends jointly over spatial patches and time, made fast enough for real-time imagination by a shortcut-forcing training objective. Trained purely on a fixed offline dataset of recorded human Minecraft play — with no live environment interaction — it learns its behaviour entirely inside the learned model and becomes the first agent to obtain diamonds in Minecraft from offline data alone.

Summary¶

This chapter developed the world-model paradigm through two complementary families. MuZero brought tree search to settings with no given rules, learning an abstract latent engine — dynamics, reward, value, and policy heads — shaped entirely by what MCTS needs and under no obligation to reconstruct observations; EfficientZero then made that recipe sample-efficient enough to rival human learning speed. The Dreamer family, descending from Ha and Schmidhuber’s V/M/C template and PlaNet’s RSSM, took the opposite stance on the latent space: it reconstructs observations and trains an actor-critic entirely inside imagined rollouts of the learned model, scaling from continuous control (V1) to discrete Atari dynamics (V2) to a single universal agent across dozens of domains (V3).

Two recurring themes run through every method. The first is compression. Every approach encodes high-dimensional observations into a compact latent state that retains what is necessary for predicting rewards and future states. The dimensionality reduction is not just a computational convenience — it is what makes planning tractable. Planning over a 30-dimensional Gaussian or a 1024-dimensional categorical code is qualitatively different from planning over a raw pixel frame.

The second is imagination. Once a compact latent model is available, the agent can generate synthetic experience — imagined rollouts, simulated MCTS expansions — at a fraction of the cost of real interaction. Dreamer uses this to train an actor-critic without touching the environment during behaviour learning. MuZero uses this to run MCTS without querying a simulator. In both cases, the quality of the imagination directly determines the quality of the resulting policy.

Algorithm	Model	Planning	Key innovation
MuZero Schrittwieser et al., 2020	$h$ + $g$ + $f$ (abstract)	MCTS in latent space	No rules needed; board games + Atari
EfficientZero Ye et al., 2021	MuZero + consistency	MCTS in latent space	Self-supervised consistency; 1 000× efficiency
World Models Ha & Schmidhuber, 2018	VAE + MDN-RNN	Dream training (CMA-ES)	Train controller inside learned model
PlaNet Hafner et al., 2019	RSSM	CEM in latent space	Planning without policy; latent overshooting
Dreamer Hafner et al., 2020	RSSM + decoders	Actor-critic in imagination	BPTT through world model; $\lambda$ -returns
DreamerV2 Hafner et al., 2021	RSSM + categorical	Actor-critic in imagination	Discrete latents; KL balancing
DreamerV3 Hafner et al., 2025	RSSM (large)	Actor-critic in imagination	Symlog; free bits; universal hyperparameters

The two families differ in their choice of latent space structure and planning mechanism. The Dreamer family maintains a recurrent stochastic latent that must decode to observations, enabling rich world models suitable for continuous-control tasks with complex visual inputs. MuZero and EfficientZero use an entirely abstract latent space shaped only by task-relevant predictions, enabling efficient tree search in discrete and combinatorial settings. Neither approach dominates universally: Dreamer’s imagination-based actor-critic is more sample-efficient than MCTS for high-frequency continuous control; MuZero’s tree search finds stronger combinatorial strategies than amortised policies in games. Open questions remain around bridging the two families — can imagination-based planning and tree search be combined? — and extending world models to partially observable, multi-agent, and non-stationary settings. A further shift is already underway: from world models trained per task toward pretrained, reusable ones — DINO-WM Zhou et al., 2024, LeWorldModel Maes et al., 2026, and DreamerV4 Hafner et al., 2025 — which aim to learn dynamics once on broad data and transfer them across tasks.

MuZero: Learning the Rules from Scratch¶

From AlphaZero to MuZero¶

The Three Learned Functions¶

Neural MCTS in Latent Space¶

Training¶

Results¶

EfficientZero: Closing the Data-Efficiency Gap¶

Dreamer: Imagination-Based Actor-Critic¶

RSSM World Model¶

Behavior learning¶

Environment interaction¶

The Dreamer Algorithm¶

DreamerV2¶

DreamerV3¶

What’s Next in World Models¶

Summary¶

Further Reading¶