Model-Based Reinforcement Learning

Every approach in the previous chapters — value iteration, policy gradients, actor-critic methods — treats the environment as an inscrutable black box that you query by executing actions. The agent learns what to do but never explicitly learns what will happen. This chapter opens a different door: the agent also acquires a model of the environment’s dynamics and uses it to plan — to reason ahead about consequences before committing to an action.

The motivation is sample efficiency. Real environments are expensive: a robot burns time and hardware interacting with the physical world; a drug-discovery agent runs wet-lab assays; even a simulator can be the computational bottleneck. If the agent has a model, it can generate imagined experience at near-zero cost, multiplying the number of learning updates per unit of real interaction.

Two questions arise:

Is the model given or must it be learned? Game rules, physics simulators, and known differential equations are given. For most real problems the model must be estimated from data.
How is the model used for planning? It can generate extra transitions to augment a replay buffer (background planning), or it can be queried at decision time to look ahead and choose actions (decision-time planning).

This chapter works through these ideas from the ground up. We begin with a taxonomy that organises the space of model-based approaches, then cover background-planning methods and their deep-learning extensions (learned-model background planning). We then turn to decision-time planning, using two-player perfect-information games as our running example: self-play, Monte Carlo Tree Search, and Reinforcement Learning combine to master the game of Go (given-model decision-time planning).

Slides for this chapter (open full screen).

Taxonomy of Model-Based RL¶

A model is anything the agent can query to ask “if I take action $a$ in state $s$ , what state $s'$ do I reach and what reward $r$ do I receive?” — formally, any (approximation of the) transition distribution $p(s', r \mid s, a)$ Sutton & Barto, 2018Moerland et al., 2023. Model-based RL is organised by two orthogonal questions about that model: where does it come from? and how is it used to choose actions? Together these two axes carve out four canonical regions, shown in Figure 2; the remainder of this section works through them in turn.

Figure 2:A conceptual taxonomy of model-based reinforcement learning organised along two independent axes. The vertical axis distinguishes how the model is obtained: learned from experience (orange ellipse, top) or provided as a perfect simulator — such as a game engine or physics model — (green ellipse, bottom). These two regions do not overlap: a model is either given or must be estimated. The horizontal axis distinguishes how the model is used for planning: decision-time planning looks ahead before each action (cyan ellipse, left); background planning generates synthetic transitions to augment learning offline (blue ellipse, right). These two modes may overlap for methods that use both. Model-free methods (right, dashed box) use neither a model nor explicit planning.

Where the Model Comes From¶

Given models are perfect simulators handed to the agent for free. The canonical example is a game engine — the rules of Go, Chess, or Shogi fully determine the next position Silver et al., 2016Silver et al., 2018. Physics engines such as MuJoCo or Isaac Gym play the same role for continuous-control benchmarks Todorov et al., 2012Makoviychuk et al., 2021, and known ODE or PDE dynamics fill the same slot in classical control.

Their advantage is decisive: there is no model bias, the simulator can be queried arbitrarily many times, and search depth is limited only by compute. The drawbacks are practical rather than statistical. First, perfect simulators are rare in real-world settings — robots, markets, molecules and physiological systems have no closed-form transition function Moerland et al., 2023. Second, even when a simulator exists, its wall-clock cost can become the bottleneck: large-scale game-playing agents spend most of their compute inside the simulator, not on gradient steps Vinyals et al., 2019Berner et al., 2019. Third, when the “simulator” is itself an approximation — e.g. a randomised physics model used to train a robot policy — the sim-to-real gap turns into a model-bias problem in disguise Akkaya et al., 2019.

Learned models are estimated from the agent’s own experience. The agent fits an approximate dynamics function $\hat T$ and reward function $\hat R$ from the transitions it has collected so far, and can then query $(\hat T, \hat R)$ instead of (or in addition to) the real environment Sutton, 1991. Concrete instances range from Gaussian-process models in PILCO Deisenroth & Rasmussen, 2011 to probabilistic neural-network ensembles in PETS and MBPO Chua et al., 2018Janner et al., 2019, and on to compact latent-space world models trained directly from pixels Ha & Schmidhuber, 2018Hafner et al., 2025.

A learned model is the only viable option when no simulator exists, and it gets better with data in the same loop that collects it. The price is model bias: $\hat T$ is never perfect, and prediction errors compound on multi-step rollouts so that even small per-step inaccuracies derail long-horizon planning Janner et al., 2019Chua et al., 2018. A learned model also does not bypass the cost of exploration — it can only predict transitions in regions it has already seen.

How the Model Is Used¶

Planning means using a model to compute or improve a policy without additional real-world interaction Sutton & Barto, 2018Moerland et al., 2023. The classical instantiation is Dynamic Programming, summarised in the note below; modern model-based RL can be read as two complementary ways of approximating DP when full sweeps over the state space are not tractable.

Note

Dynamic Programming, in brief. When the full transition model $p(s', r \mid s, a)$ is known, the Bellman optimality equation Bellman, 1957Puterman, 1994

V^*(s) = \max_a \sum_{s', r} p(s', r \mid s, a)\bigl[r + \gamma V^*(s')\bigr]

(1)

decomposes $V^*$ in terms of itself. Value iteration turns this into an algorithm: repeatedly sweep over all states and apply the Bellman backup until convergence Bertsekas, 2019Sutton & Barto, 2018. No environment interaction is needed — each update uses the complete distribution over next states rather than a single sampled transition, which makes DP orders of magnitude more data-efficient than Monte Carlo or TD methods when applicable.

Two limits prevent pure DP from scaling: (i) the model is rarely given in closed form, and (ii) enumerating all states is hopeless for continuous or high-dimensional spaces. Q-learning Watkins & Dayan, 1992 can itself be read as a sampled single-step DP update — a Bellman backup that replaces the full expectation with one observed transition Sutton & Barto, 2018. Decision-time and background planning, defined next, are the two scalable approximations of DP that modern model-based RL relies on.

Decision-time planning queries the model online, at the moment an action must be chosen. From the current state $s_t$ the agent rolls forward a short trajectory, builds a search tree, or optimises a finite-horizon control sequence, and uses the result to pick a single action; the next state triggers a new search from scratch. The two canonical instances are Monte Carlo Tree Search in board games Silver et al., 2016Silver et al., 2017 and Model Predictive Control (MPC) in robotics and process control Camacho & Bordons, 2007Lazic et al., 2018, with sampling-based MPC variants such as MPPI now standard in high-frequency control Williams et al., 2017Chua et al., 2018.

Its benefits are structural. Each decision adapts to the current state rather than to an average state distribution, so fresh information can be exploited immediately. Model bias is contained to the planning horizon — errors influence the single action that is executed but never leak into a stored value function or policy Moerland et al., 2023. And no replay buffer of imagined transitions has to be maintained. The drawback is latency: every action incurs a search, and per-decision compute scales with horizon and branching factor, which rules out decision-time planning for high-frequency control unless the search itself is cheap.

Note

Model Predictive Control, in brief. MPC is the receding-horizon idea: at each step the agent (i) predicts the next $H$ states under a candidate action sequence $a_{t:t+H-1}$ , (ii) optimises that sequence against a cost over the predicted rollout, (iii) executes only the first action $a_t$ , and (iv) discards the rest, replanning from the new observed state on the next step Camacho & Bordons, 2007. The recipe originated in industrial process control in the 1970s–80s and is now a workhorse of robotics, autonomous driving, and even data-centre cooling Lazic et al., 2018.

Replanning is what makes MPC robust to model error: predictions deep in the planning horizon are inevitably wrong, but they are never executed — by the time the agent reaches step $t+5$ , it has re-optimised with five additional real observations. MPC is therefore a closed-loop policy built from open-loop optimisation, and it tolerates substantially more model error than the horizon $H$ alone would suggest Camacho & Bordons, 2007. Two optimiser families dominate. Gradient-based MPC — Differential Dynamic Programming Jacobson & Mayne, 1970 and its first-order variant iLQR Tassa et al., 2012 — exploits smooth, differentiable dynamics and converges in a handful of iterations. Sampling-based MPC — the Cross-Entropy Method Boer et al., 2005 and MPPI Williams et al., 2017 — treats the dynamics as a black box and weighs many parallel candidate sequences by predicted cost. The sampling flavour pairs naturally with learned neural-network dynamics, where gradients exist but are often noisy Chua et al., 2018.

Background planning queries the model offline, between real environment steps, and uses the resulting synthetic transitions to train a policy or value function. This is the architecture introduced by Dyna Sutton, 1991 and scaled to deep continuous control by MBPO Janner et al., 2019 and to discrete domains by SimPLe Kaiser et al., 2020. Each real step yields many imagined transitions, and the off-policy learner trains on a mixture of real and imagined data.

The benefits are complementary to decision-time planning. Inference at action time is cheap — the policy network alone acts, with no search in the loop — making background planning a natural fit for off-policy deep RL with replay buffers. Planning is amortised across many gradient updates rather than re-done from scratch per action, and the synthetic data multiplies the number of updates per unit of real interaction. The drawback is that model errors propagate into the learned value or policy: a wrong simulated transition becomes a wrong gradient. The rollout horizon must therefore be tuned carefully — in MBPO short rollouts ( $k \le 5$ ) work, longer ones quickly degrade the policy because per-step error compounds Janner et al., 2019.

Learned-Model Planning¶

The agent does not receive a simulator — it must estimate the environment’s dynamics from collected experience. We will first cover the canonical reference and then an instantiation of a deep RL algorithm.

DYNA: The Blueprint¶

Dyna Sutton, 1991 is the foundational learned-model background-planning architecture. It integrates three concurrent processes into a single agent loop:

Direct RL — learn from real transitions exactly as in Q-learning.
Model learning — fit a model $\hat T, \hat R$ to real transitions.
Planning — draw $n$ simulated transitions from $\hat T, \hat R$ and use each for an additional Q-learning update.

Figure 3:The DYNA architecture. Real interactions update the Q-table (path ①) and improve the model. The model then generates $n$ simulated transitions per real step for additional Q-learning updates (path ②). The dashed arrow marks simulated experience as cheaper than real experience.

Dyna-Q makes this concrete for tabular settings. After each real step $(s, a, r, s')$ :

Update $Q(s, a)$ with the real transition (standard TD update).
Update the model: $\hat T(s, a) \leftarrow s'$ ; $\hat R(s, a) \leftarrow r$ .
Repeat $n$ times: sample a previously-seen $(s, a)$ , query the model for $(\hat r, \hat s')$ , and apply a Q-learning update.

With $n = 50$ planning steps per real step, Dyna-Q matched the sample efficiency of a model-free agent receiving 50× more environment interactions in Sutton’s original maze experiments Sutton, 1990Sutton & Barto, 2018.

Three principles emerge from the Dyna blueprint:

Real experience is data-efficient — every transition trains both the policy and the model.
Simulated experience is compute-efficient — model queries are cheap, so $n \gg 1$ is feasible.
Model quality is the bottleneck — a wrong model steers planning in the wrong direction.

MBPO: Deep DYNA with Ensemble Dynamics¶

Model-Based Policy Optimization (MBPO) Janner et al., 2019 scales the Dyna idea to continuous control. The tabular Dyna model is a deterministic lookup table; in continuous spaces the model must output a distribution,

\hat T(s, a) = \mathcal{N}\!\bigl(\mu_\phi(s,a),\, \sigma^2_\phi(s,a)\bigr),

(2)

to capture both environment stochasticity and uncertainty about its own predictions. MBPO uses an ensemble of such probabilistic neural network dynamics models together with SAC (Soft Actor-Critic) Haarnoja et al., 2018 as the underlying off-policy algorithm.

The key innovation is the branched rollout scheme: MBPO starts each simulated rollout from a real state sampled from an environment replay buffer $\mathcal{D}_\text{env}$ and rolls forward only $k$ steps. The short rollouts are stored in $\mathcal{D}_\text{model}$ , and SAC trains on a mixture of both buffers.

MBPO’s branched rollout scheme. A probabilistic ensemble of N dynamics models is trained on \mathcal{D}_\text{env}. Each rollout starts from a real state and runs k steps; simulated transitions fill \mathcal{D}_\text{model}. SAC draws from both buffers. — Figure 4:MBPO’s branched rollout scheme. A probabilistic ensemble of $N$ dynamics models is trained on $\mathcal{D}_\text{env}$ . Each rollout starts from a real state and runs $k$ steps; simulated transitions fill $\mathcal{D}_\text{model}$ . SAC draws from both buffers.

The algorithm¶

Putting the pieces together gives the MBPO training loop of Janner et al. (2019) (Algorithm 2 in the paper, with notation matched to this chapter):

Initialize SAC policy $\pi_\psi$ , an ensemble of $N$ dynamics models $\hat T_{\phi_i}(s,a) = \mathcal{N}\!\bigl(\mu_{\phi_i}(s,a),\, \sigma^2_{\phi_i}(s,a)\bigr)$ for $i = 1,\ldots,N$ , environment buffer $\mathcal{D}_\text{env}$ , and model buffer $\mathcal{D}_\text{model}$ .
For $N_\text{epoch}$ epochs do
1. Train each $\hat T_{\phi_i}$ on $\mathcal{D}_\text{env}$ via maximum likelihood.
2. For $E$ environment steps do
  1. Observe state $s_t$ . Take $a_t \sim \pi_\psi(\cdot \mid s_t)$ ; execute $a_t$ , observe $r_t$ and $s_{t+1}$ ; add $(s_t, a_t, r_t, s_{t+1})$ to $\mathcal{D}_\text{env}$ .
  2. For $M$ model rollouts do
    1. Sample a start state $s$ uniformly from $\mathcal{D}_\text{env}$ .
    2. For $j = 1,\ldots,k$ do
      1. Sample an ensemble member $i \sim \mathcal{U}\{1,\ldots,N\}$ .
      2. Sample $a \sim \pi_\psi(\cdot \mid s)$ ; sample $(s', r) \sim \hat T_{\phi_i}(s,a)$ .
      3. Add $(s, a, r, s')$ to $\mathcal{D}_\text{model}$ ; set $s \leftarrow s'$ .
  3. For $G$ gradient updates do
    1. Sample mini-batches from $\mathcal{D}_\text{model}$ and take SAC steps on $\psi$ Haarnoja et al., 2018 (off-policy actor–critic updates with clipped double- $Q$ targets and an entropy bonus).

The outer loop collects real data and refits the model; the middle loop generates short branched rollouts; the inner loop amortises planning into many policy updates on imagined transitions. In practice, SAC is often trained on a mixture of $\mathcal{D}_\text{env}$ and $\mathcal{D}_\text{model}$ , even though the published pseudocode shows updates on $\mathcal{D}_\text{model}$ only Janner et al., 2019.

Why ensembles? An ensemble of $N$ independently initialised models (typically $N = 7$ ) captures epistemic uncertainty — uncertainty arising from limited data that shrinks as more experience arrives. This differs from aleatoric uncertainty (irreducible environment stochasticity), which is captured within each member’s output variance. High ensemble disagreement signals epistemically uncertain states that the agent should visit with real data rather than simulate.

Why short rollouts? Model error $\epsilon$ per step accumulates roughly as $k\epsilon$ over $k$ steps. For $k \in [1, 5]$ the planning benefit dominates; for $k \gg 5$ the policy degrades Janner et al., 2019.

MBPO achieves an order-of-magnitude improvement in sample efficiency over SAC alone on standard MuJoCo benchmarks, often reaching comparable final performance in $5\text{–}20\times$ fewer environment steps, though gains vary by environment.

Given-Model Planning¶

The agent has access to a perfect simulator — the rules of a board game or a known physics model — and never needs to estimate the transition function. The challenge shifts from model learning to effective search: how do you find the best action in a state space too large to enumerate?

Two-Player Perfect Information Games¶

The second major thread in model-based RL arose from board games — specifically from the question: can a computer reach human-level play through search and learning alone?

Two-player perfect information (TPIG) games are zero-sum sequential games in which both players see the complete game state at all times. Chess, Go, Shogi, Checkers, and Othello all fall into this class. Because the state is fully visible and the rules are deterministic, the game can be described by a perfect transition model — the game engine — that tells you exactly what follows from any action.

The game of Go is played on a 19×19 grid; players alternately place black and white stones, aiming to control territory Silver et al., 2016. Go has an average branching factor of approximately 250 legal moves per position (compared to roughly 35 in Chess) and typical game depths on the order of 150 moves Silver et al., 2016, yielding a game tree whose depth and breadth make exhaustive search completely infeasible Silver et al., 2016. For decades, Monte Carlo tree search programs that blended online search with offline-learned priors Gelly & Silver, 2007 topped out at strong amateur level on full boards Enzenberger et al., 2010. Chess followed a different trajectory: Deep Blue defeated the reigning world champion in 1997 using massively parallel alpha–beta search and a hand-crafted linear evaluation function—not reinforcement learning or deep networks Campbell et al., 2002. Deep learning nonetheless made earlier headway on chess: the Giraffe engine learned a neural evaluation function by self-play and reached roughly international-master strength in 2015 Lai, 2015, before AlphaGo’s first victory over a human professional on full-board Go Silver et al., 2016. The central difficulty in Go is evaluating who is winning from a given board position: reliable position evaluation was long believed intractable Silver et al., 2016, because it requires understanding long-range strategic patterns that do not reduce to simple heuristics.

Figure 5:The game of Go: the board and stone placement (left), and four core rules illustrated on mini-boards (right)—territory, capture (a surrounded group with one remaining liberty), ko, and end-of-game scoring.

TPIG games are unusually good for AI research: the reward signal is unambiguous (win/lose/draw), the state is fully observable, progress can be measured against human expertise, and the game engine provides a free, perfect simulator. These properties make them ideal laboratories for model-based planning — but they also mean that techniques developed here do not transfer unchanged to general RL settings, a limitation we return to at the end of this section.

Self-Play¶

A key ingredient in all superhuman game-playing agents is self-play: the agent trains against versions of itself rather than against a fixed opponent or human-provided data.

In self-play, the agent simultaneously takes both sides of the board. Each game produces training signal (who won) that can be used to improve the policy. As the agent improves, the opponent — a snapshot or the current version of the same agent — also becomes stronger, providing an automatic curriculum: the difficulty of the training opponent tracks the agent’s current level.

Self-play was first shown to work at scale by Tesauro’s TD-Gammon Tesauro, 1995, which learned to play Backgammon at near-expert level purely from self-play and temporal-difference learning. For Go, self-play is essential: there are no simple Backgammon-style rollout heuristics, so any improvement in the value estimate must come from experience against strong opposition.

The deeper reason self-play works is that it instantiates a form of policy iteration Silver et al., 2017: playing against the current policy evaluates it (revealing its weaknesses), and updating the network to win more frequently improves it. Because the “opponent” is always just the current self, the training distribution stays on-policy throughout — no fixed opponent can be “exploited” to overfit.

Monte Carlo Tree Search¶

Monte Carlo Tree Search (MCTS) was introduced by Coulom (2007) for the Go program Crazy Stone and given its now-canonical UCB-driven form — UCT (Upper Confidence bounds applied to Trees) — by Kocsis & Szepesvári (2006); see Browne et al. (2012) for a comprehensive survey. It is the planning backbone used by all AlphaGo-family agents Silver et al., 2016Silver et al., 2017Silver et al., 2018. It is a given-model algorithm: it requires a perfect simulator and uses it to build a search tree at decision time, not to augment training data Silver et al., 2016.

The four phases of MCTS. ① Selection: traverse from the root, choosing actions by UCB until an unexpanded node. ② Expansion: add a new leaf. ③ Simulation: run a rollout to a terminal state. ④ Backpropagation: update visit counts and value estimates along the path. — Figure 6:The four phases of MCTS. **① Selection:** traverse from the root, choosing actions by UCB until an unexpanded node. **② Expansion:** add a new leaf. **③ Simulation:** run a rollout to a terminal state. **④ Backpropagation:** update visit counts and value estimates along the path.

Inputs / prerequisites. MCTS needs (i) a generative model of the environment $p(s', r \mid s, a)$ that can be queried from arbitrary states — a perfect simulator in given-model settings such as a game engine, or a learned model in the variants we will see in Chapter 8; (ii) the current state $s_0$ from which to plan; (iii) a way to evaluate non-terminal leaf states — either by rolling out to termination with a default policy or by calling a learned value function; and (iv) a compute budget (number of iterations $N_\text{sim}$ or wall-clock time). MCTS is anytime: more iterations sharpen the recommendation but the search can be stopped at any point. It applies naturally to episodic, fully observable sequential decision problems — single-agent MDPs as well as two-player perfect-information games Kocsis & Szepesvári, 2006Browne et al., 2012.

Output. A statistics tree rooted at $s_0$ holding, for every visited $(s, a)$ , a visit count $N(s, a)$ and a value estimate $Q(s, a)$ . From this tree the planner extracts a recommended action — typically the most-visited root child $a^\star = \arg\max_a N(s_0, a)$ , which is more robust than $\arg\max_a Q$ because it weights confidence by sample size Coulom, 2007.

Components. Each iteration reuses four pieces: a tree policy (here UCB/UCT) that selects already-known children, the generative model that advances to the next state, a default (rollout) policy that estimates leaf value cheaply, and a backup operator that propagates the leaf estimate up the visited path. The four phases below — Selection, Expansion, Simulation, Backpropagation — are just the order in which these components are called inside one iteration.

Each iteration of MCTS has four phases Browne et al., 2012:

① Selection. From the root, select child nodes according to the Upper Confidence Bound (UCB) formula:

\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}},

(3)

where $Q(s,a)$ is the mean value estimate, $N(s)$ the total visit count of node $s$ , $N(s,a)$ the visit count of the child reached by action $a$ , and $c>0$ an exploration constant. Combining UCB1 Auer et al., 2002 with tree search this way is the UCT algorithm of Kocsis & Szepesvári (2006), the variant used in the AlphaGo family.

② Expansion. Add one or more new child nodes to the selected leaf.

③ Simulation. From the new leaf, roll out to a terminal state using a default policy (random moves or a fast heuristic). Record the outcome $v \in \{-1, 0, +1\}$ .

④ Backpropagation. Walk back to the root, updating $N(s,a) \leftarrow N(s,a)+1$ and $Q(s,a) \leftarrow Q(s,a) + \frac{v - Q(s,a)}{N(s,a)}$ .

After many iterations the chosen action is $\arg\max_a N(s_0, a)$ — the most-visited child.

Pure UCT with random rollouts already plays Go at amateur level — Coulom’s Crazy Stone won the first UEC Cup computer Go tournament (19×19 board) with essentially this recipe Coulom, 2007. Its weakness: random rollouts are noisy value estimators, and Go’s branching factor ( $\approx 250$ ) makes deep search computationally prohibitive. The key idea of AlphaGo is to replace random rollouts with a trained neural network.

AlphaGo¶

AlphaGo Silver et al., 2016 was the first program to defeat a professional Go player at full strength. It is built from four separately trained neural networks that plug into Monte Carlo Tree Search at inference time, glued together by a three-stage training pipeline that mixes supervised learning, reinforcement learning, and regression. This subsection walks through the inputs the networks see (Inputs: Hand-crafted Feature Planes), what each of the four networks does (The Four Networks), how they are trained (Training Pipeline), and how MCTS uses them at decision time (Neural-Guided MCTS: PUCT and the Mixed Leaf Estimator). Figure 7 gives the high-level picture; the subsections below fill in the algorithmic detail.

Inputs: Hand-crafted Feature Planes¶

Before any network sees a Go board, AlphaGo encodes the board state $s$ as a $19 \times 19 \times 48$ tensor of binary feature planes Silver et al., 2016. Each plane is a $19 \times 19$ binary mask — the same spatial layout as the board itself — and the 48 channels split into two groups, summarised in Figure 8:

Raw board state (5 planes): the location of the current player’s stones, the opponent’s stones, and the empty points, plus two constant bias planes (all-ones and all-zeros) that give the convolution a translation-invariant constant signal.
Engineered Go knowledge (43 planes): decades of Go-programming heuristics encoded as features the network gets for free. These include how many turns ago a stone was last played at each point, how many liberties (empty adjacent points) each group has, how many stones we would capture by playing here, whether playing here would put our own group into atari (one liberty from death), and two binary features that fire on the ladder — one of Go’s most pattern-rich tactical motifs.

The 48 binary input planes of AlphaGo’s policy and value networks. The first 5 planes are raw board state; the remaining 43 are engineered Go knowledge that bakes in human heuristics about life-and-death, captures, and ladders. Every plane is 19 \times 19, so the networks are fully convolutional (no pooling) and exploit the translation equivariance of Go. AlphaGo Zero later drops every engineered feature and feeds in only 17 raw planes. — Figure 8:The 48 binary input planes of AlphaGo’s policy and value networks. The first 5 planes are raw board state; the remaining 43 are engineered Go knowledge that bakes in human heuristics about life-and-death, captures, and ladders. Every plane is $19 \times 19$ , so the networks are *fully convolutional* (no pooling) and exploit the translation equivariance of Go. AlphaGo Zero later drops every engineered feature and feeds in only 17 raw planes.

The translation equivariance is critical: the same tactical pattern at the corner, the side, and the centre of the board has the same meaning, so the networks share weights across all 361 intersections and never need fully-connected layers until the very last step (only the value network has one).

Engineering features for a deep network was standard practice for board-game programs in 2016. The lesson AlphaGo Zero will teach a year later is that, given enough self-play, none of these engineered features are necessary — but in AlphaGo they shave several percentage points off move-prediction accuracy and noticeably strengthen MCTS.

The Four Networks¶

AlphaGo trains four distinct networks, each playing a separate role at inference time. The table below lays out their architectures, training data, loss functions, outputs, and runtime responsibilities side-by-side.

	SL policy $p_\sigma$ supervised learning	RL policy $p_\rho$ self-play fine-tuning	Value net $v_\theta$ position evaluator	Fast rollout $p_\pi$ cheap simulator
Architecture	13-layer convolutional policy network (~4M parameters)	Same 13-layer CNN as $p_\sigma$ — weights initialised from $p_\sigma$ (~4M parameters)	13 conv layers + 1 fully-connected layer; single scalar output (tanh)	Linear softmax over hand-crafted local patterns — ~1500× faster than $p_\sigma$
Trained on	30 M positions from 160 k human-expert games (KGS Go server)	Self-play games vs. an opponent pool of past $p_\rho$ snapshots	30 M positions, one per game from $p_\rho$ self-play (breaks within-game correlation)	8 M positions from an online Go server (separate from KGS)
Loss / signal	Cross-entropy against the expert move ( $\rightarrow$ 57 % top-1 accuracy)	REINFORCE policy gradient with terminal reward $z \in \{-1, +1\}$ ( $\rightarrow$ 80 % win-rate vs. $p_\sigma$ )	MSE regression against game outcome: $(z - v_\theta(s))^2$	Cross-entropy against the human move ( $\rightarrow$ only 24 % top-1 accuracy)
Output	Move distribution $p_\sigma(a \mid s)$ over $19^2 + 1$ actions	Move distribution $p_\rho(a \mid s)$ (sharper, less diverse than $p_\sigma$ )	Scalar $v_\theta(s) \in [-1, +1]$ — estimated win probability from the current player’s view	Move distribution $p_\pi(a \mid s)$ — weak but extremely cheap
Role at inference	Move priors $P(s, a)$ in the PUCT bonus (diverse priors $\Rightarrow$ exploratory search)	Generates training data for $v_\theta$ — not called during MCTS itself	Leaf evaluation $v_\theta(s_L)$ — the low-variance, biased term in the blend	Plays out leaf to terminal $\rightarrow$ outcome $z_L$ is the noisy, unbiased blend term

Two design choices are worth flagging up front. First, the supervised policy $p_\sigma$ and the RL policy $p_\rho$ share the same architecture (13-layer convolutional network) and the same input encoding; $p_\rho$ is just $p_\sigma$ with REINFORCE applied on top. Second, $p_\pi$ exists for one reason: speed. A forward pass through $p_\sigma$ takes about $3\,\text{ms}$ on a single GPU — far too slow to play out an entire game inside MCTS. $p_\pi$ is a linear softmax over local-pattern features (no hidden layers) and runs in about $2\,\mu\text{s}$ , roughly $1500\times$ faster, at the cost of dropping move-prediction accuracy from 57% to 24%. That trade-off is decisive: a noisier but unbiased rollout, sampled cheaply, is more valuable inside MCTS than a slow accurate one.

Training Pipeline¶

The three training stages are run in sequence; each stage consumes the output of the previous one.

Stage 1 — Supervised learning of $p_\sigma$ . The SL policy network is trained on 30 million board positions drawn from 160 000 games between strong human amateurs on the KGS Go server Silver et al., 2016. The training signal is a standard categorical cross-entropy loss against the move the human actually played:

\mathcal{L}_\text{SL}(\sigma) \;=\; -\, \mathbb{E}_{(s, a) \sim \mathcal{D}_\text{KGS}}\!\left[\log p_\sigma(a \mid s)\right],

(4)

which is identical to how an image classifier is trained, except that the “class label” is which of the (up to) 361 legal moves the expert chose at position $s$ . After three weeks of training, $p_\sigma$ reaches 57% top-1 move-prediction accuracy, far better than any prior Go program. Crucially, $p_\sigma$ also has a diverse output distribution: even when the human move is obvious, $p_\sigma$ usually places non-trivial probability on several reasonable alternatives — a property that matters when we plug it into MCTS.

Stage 2 — Reinforcement learning of $p_\rho$ by self-play. The RL policy $p_\rho$ is initialised from $p_\sigma$ and fine-tuned by self-play with the REINFORCE policy-gradient estimator from Chapter 4 Williams, 1992. Each game is played to completion, the outcome $z \in \{-1, +1\}$ (loss vs. win for the player to move at $s$ ) is treated as the reward, and the parameters are updated by

\nabla_\rho \mathcal{J}(\rho) \;=\; \mathbb{E}_\text{self-play}\!\left[\, z \sum_{t=0}^{T} \nabla_\rho \log p_\rho(a_t \mid s_t) \,\right],

(5)

i.e. moves from games we won are made slightly more likely, moves from games we lost slightly less likely.

A naive implementation would have $p_\rho$ play against itself, but this causes two pathologies: (i) the gradient signal collapses because both sides improve in lock-step, and (ii) the agent overfits to its own tactics — defeating its current self does not mean defeating any other reasonable opponent. AlphaGo fixes this with an opponent pool: every few iterations a snapshot of $p_\rho$ is added to the pool, and each self-play game is played against a randomly sampled snapshot rather than the current parameters. The pool gives self-play a built-in diversity of opponents and prevents cyclic strategies of the form “I always open in the upper-right corner because I always beat myself when I do.” After this stage, $p_\rho$ wins 80% of games against $p_\sigma$ .

Stage 3 — Regression training of $v_\theta$ . The value network $v_\theta$ shares the same convolutional trunk as the policy networks but adds a final fully-connected layer producing a single scalar, with a tanh non-linearity that constrains its output to $[-1, +1]$ . It is trained by mean-squared regression against the outcome of a self-play game played by $p_\rho$ :

\mathcal{L}_\text{val}(\theta) \;=\; \mathbb{E}_{(s, z) \sim \mathcal{D}_\text{val}}\!\left[\,(z - v_\theta(s))^2\,\right].

(6)

The non-obvious part is how $\mathcal{D}_\text{val}$ is built. A direct approach — store every $(s_t, z)$ pair from each self-play game — fails badly because consecutive positions within a single game are extremely correlated: they share most stones, the same eventual winner, and the same strategic plan. Training a regressor on this distribution overfits within a few epochs (the network memorises games rather than learning to evaluate positions). AlphaGo sidesteps this by playing 30 million separate self-play games and keeping exactly one position per game. The resulting dataset is roughly i.i.d. at the position level and the value network generalises to unseen boards.

Note

Why pre-train with supervision before doing RL? Go has a branching factor of $\sim 250$ and games of length $\sim 150$ , so the space of trajectories is on the order of 250¹⁵⁰ — incomprehensibly larger than anything REINFORCE can sample. From a random initialisation the policy gradient would wander in this space for a very long time before producing any game with reward signal more informative than noise. Bootstrapping with supervised learning on $\mathcal{D}_\text{KGS}$ gives REINFORCE a starting point where most games are already coherent games of Go — wins and losses then carry meaningful credit. This supervised pre-training + RL fine-tuning pattern is not Go-specific: it is now the dominant recipe for training large language models, where it appears as supervised fine-tuning (SFT) followed by RL from human (or AI) feedback. AlphaGo is one of its earliest large-scale demonstrations.

Neural-Guided MCTS: PUCT and the Mixed Leaf Estimator¶

At inference, AlphaGo runs Monte Carlo Tree Search exactly as introduced in Monte Carlo Tree Search, but with two crucial modifications: the selection rule now uses neural-network priors, and the leaf evaluation no longer relies on random rollouts. Both modifications — and the role each of $p_\sigma$ , $v_\theta$ , and $p_\pi$ plays inside one iteration — are summarised in Figure 9.

One AlphaGo MCTS iteration. ① Selection (purple) descends the tree by PUCT (Equation ); ② Expansion adds a new leaf s_L; ③ Evaluation runs the value network v_\theta(s_L) (green) in parallel with a fast p_\pi rollout to the terminal outcome z_L (orange) and blends them into V(s_L) (Equation ); ④ Backup walks the visited path updating N and Q from V(s_L) (Equation ). Per iteration, p_\sigma fires once (at expansion, to seed child priors that are reused thereafter), v_\theta fires once (at evaluation), and p_\pi fires many times (throughout the rollout to termination). The RL policy p_\rho is not queried at inference — its only role is to generate training data for v_\theta. — Figure 9:One AlphaGo MCTS iteration. **① Selection** (purple) descends the tree by PUCT (Equation (7)); **② Expansion** adds a new leaf $s_L$ ; **③ Evaluation** runs the value network $v_\theta(s_L)$ (green) in parallel with a fast $p_\pi$ rollout to the terminal outcome $z_L$ (orange) and blends them into $V(s_L)$ (Equation (8)); **④ Backup** walks the visited path updating $N$ and $Q$ from $V(s_L)$ (Equation (9)). Per iteration, $p_\sigma$ fires *once* (at expansion, to seed child priors that are reused thereafter), $v_\theta$ fires *once* (at evaluation), and $p_\pi$ fires *many* times (throughout the rollout to termination). The RL policy $p_\rho$ is *not* queried at inference — its only role is to generate training data for $v_\theta$ .

① Selection. AlphaGo replaces the bare UCB formula in Equation (3) with PUCT — Predictor + UCB applied to Trees — which weights each candidate’s exploration bonus by the network’s prior probability for that move:

U(s, a) \;=\; Q(s, a) \;+\; c_\text{puct} \, P(s, a) \, \frac{\sqrt{N(s)}}{1 + N(s, a)},

(7)

Note

A note on the name “PUCT”. The PUCT label is now standard but its lineage is layered. Rosin Rosin, 2011 introduced PUCB — Predictor + UCB — for bandits with side information, with bonus $P(a)\sqrt{\log N}/N(a)$ . The AlphaGo paper Silver et al., 2016 describes its selection rule as “a variant of the PUCT algorithm” and names the surrounding search system APV-MCTS (Asynchronous Policy and Value MCTS); its bonus $P(s,a)\sqrt{N(s)}/(1+N(s,a))$ has a polynomial rather than logarithmic decay, which is why some authors re-read the P as Polynomial. AlphaGo Zero and AlphaZero subsequently dropped the “variant of” qualifier and the literature has settled on plain “PUCT”, which is the convention we follow here.

with prior $P(s, a) = p_\sigma(a \mid s)$ from the SL policy. PUCT has two natural regimes: when a child has been visited few times, $N(s, a)$ is small and the bonus reduces to roughly $c_\text{puct} \, P(s, a) \, \sqrt{N(s)}$ — the prior dominates, so high-prior moves are explored first. As $N(s, a)$ grows, the bonus shrinks and $Q(s, a)$ — the empirical win-rate estimate — takes over. Compared with vanilla UCB, PUCT concentrates the search budget on plausible moves rather than spreading it uniformly across the $\sim 250$ legal options.

Why the SL policy $p_\sigma$ for priors, not the stronger $p_\rho$ ? This is the most counter-intuitive design choice in AlphaGo. Empirically, using $p_\rho$ as the prior produces a weaker tree search than using $p_\sigma$ . The reason is distribution shape: $p_\rho$ has been trained to win, so it concentrates almost all probability on the single move it estimates is best. Plugging that into PUCT collapses the search onto a narrow subtree, and any move $p_\rho$ undervalues — including potentially the correct one in unfamiliar positions — receives essentially zero exploration. $p_\sigma$ , by contrast, has been trained to imitate humans, who themselves play diverse styles, so $p_\sigma$ assigns non-trivial probability to many reasonable moves. Diverse priors lead to a more exploratory search tree, and a more exploratory tree produces better moves than the stronger-but-narrower policy on its own. The general lesson — that the right way to use a network inside search is rarely “pick its top choice” — comes up again in AlphaGo Zero and MuZero.

② Expansion. When PUCT selects an edge that leads to a state not yet in the tree, that state becomes a new leaf $s_L$ . AlphaGo evaluates the SL policy once at this leaf — $p_\sigma(\cdot \mid s_L)$ — and stores the resulting distribution as the prior $P(s_L, \cdot)$ over all of $s_L$ ’s legal children. These priors are not recomputed: every later iteration that descends through $s_L$ reads the same cached vector inside the PUCT bonus of Equation (7). The single $p_\sigma$ call at expansion is therefore amortised over every subsequent visit to $s_L$ , which keeps the per-iteration cost of MCTS dominated by the leaf evaluation that follows.

③ Evaluation. When MCTS reaches an unexplored leaf $s_L$ , the original Monte Carlo Tree Search would estimate its value by a random rollout. AlphaGo replaces that with two parallel estimates of the same quantity, blended into one number:

V(s_L) \;=\; (1 - \lambda) \, v_\theta(s_L) \;+\; \lambda \, z_{p_\pi},

(8)

where $v_\theta(s_L) \in [-1, +1]$ is the value network’s win-probability estimate (a single forward pass) and $z_{p_\pi} \in \{-1, +1\}$ is the outcome of one fast rollout from $s_L$ played by $p_\pi$ on both sides until termination. The two estimators have complementary error profiles: $v_\theta$ has low variance but is possibly biased (its evaluation depends on whatever positions were in its training set), whereas $z_{p_\pi}$ is unbiased (a real game outcome under a fixed rollout policy) but has high variance (the outcome of a single rollout is a noisy sample). Blending them with $\lambda = 0.5$ — the value used in the paper — captures most of the bias correction from rollouts while inheriting most of the variance reduction from the value network.

④ Backup. With $V(s_L)$ in hand, MCTS walks back along the visited path. For every edge $(s, a)$ on that path it updates the visit count and the running mean of the value estimate:

N(s, a) \leftarrow N(s, a) + 1, \qquad Q(s, a) \leftarrow Q(s, a) + \frac{V(s_L) - Q(s, a)}{N(s, a)},

(9)

so $Q(s, a)$ converges to the empirical mean of the mixed leaf estimates collected along that edge. No network is called during backup: the only inputs are the cached statistics already in the tree and the leaf value computed in the previous phase. After $N_\text{sim}$ iterations, the recommended move at the root is, as in plain MCTS, the most-visited child $\arg\max_a N(s_0, a)$ .

Results¶

AlphaGo defeated Lee Sedol — at the time one of the world’s strongest players — 4–1 in a five-game match in March 2016, a result widely considered to be a decade ahead of schedule. The system’s main limitation is its dependence on a complex three-stage training pipeline and 30 million human expert positions.

AlphaGo Zero¶

AlphaGo Zero Silver et al., 2017 eliminates all human knowledge: no expert games, no hand-crafted features, no separate networks. Starting from random play and training purely through self-play, it surpassed every earlier AlphaGo variant. The core recipe — a dual-head residual network, PUCT-guided MCTS, and self-play distillation — is game-agnostic; the details below are the Go-specific 2017 instantiation that AlphaZero later generalises to Chess and Shogi (Paragraph).

Architecture¶

AlphaGo Zero uses a single dual-head residual network $f_\theta$ :

\bigl(p_\theta(s),\, v_\theta(s)\bigr) = f_\theta(s),

(10)

where $p_\theta(s)$ is a probability vector over legal moves (policy head) and $v_\theta(s) \in [-1, 1]$ is a scalar win estimate (value head). The input is 17 binary feature planes encoding the current position, the seven most recent positions for each player, and whose turn it is. Because Go rules are invariant to rotation and reflection, AlphaGo Zero augments training data with all eight symmetries of each position and applies a random board transform before every neural-network evaluation inside MCTS, averaging over orientation bias.

Neural MCTS with PUCT¶

AlphaGo Zero keeps the PUCT selection rule introduced for AlphaGo (Equation (7)), but feeds it priors from the unified network $f_\theta$ instead of a separate SL policy:

U(s, a) = Q(s, a) + c_\text{puct} \cdot p_\theta(a \mid s) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}.

(11)

The leaf-evaluation step is where the largest simplification happens. When MCTS reaches an unvisited leaf $s_L$ , AlphaGo Zero evaluates $f_\theta(s_L)$ with a single forward pass and uses both of its outputs: $v_\theta(s_L)$ becomes the leaf value — the entire mixed estimator of Equation (8) is replaced by just the value head — and $p_\theta(\cdot \mid s_L)$ assigns priors to the newly expanded children. The fast rollout policy $p_\pi$ is gone, so there is no $\lambda$ to tune; the noisy simulation phase has been collapsed entirely into a deterministic network evaluation.

After $N_\text{sim} = 800$ MCTS simulations the search produces a visit-count policy:

\pi(a \mid s) = \frac{N(s, a)^{1/\tau}}{\sum_b N(s, b)^{1/\tau}},

(12)

with temperature $\tau = 1$ for the first 30 moves (encouraging exploration) and $\tau \to 0$ thereafter. This $\pi$ is stronger than the raw network output $p_\theta$ because it incorporates tree search.

Self-Play Training Loop¶

Training alternates two phases indefinitely:

Self-play. The current best network plays against itself. At each position, 800 MCTS simulations produce a visit-count vector $\vec\pi$ , stored as the policy target. The final game outcome $z \in \{-1, +1\}$ — Go has no draws under the Tromp–Taylor rules used — is stored as the value target. The replay buffer retains the most recent 500 000 games.
Network update. Mini-batches $(s, \vec\pi, z)$ are sampled uniformly from the buffer. Parameters $\theta$ are updated by minimising:

\ell(\theta) = \underbrace{(z - v_\theta(s))^2}_{\text{value loss}} - \underbrace{\vec\pi \cdot \log p_\theta(s)}_{\text{policy loss}} + \underbrace{\lambda \|\theta\|^2}_{\text{weight decay}}.

(13)

After each update the network plays 400 games against the previous best; if it wins $>55\%$ it becomes the new best and takes over self-play. This best-network promotion gate stabilises training by preventing a temporarily weaker checkpoint from polluting self-play data; search hyperparameters were tuned per run via Bayesian optimisation. AlphaZero removes the gate and reuses a single set of hyperparameters across games (Paragraph).

Why Does Self-Play Work?¶

The network and MCTS form a mutual improvement loop:

A stronger network yields better PUCT priors → the search finds stronger lines → the visit-count policy $\vec\pi$ is stronger than the raw network $p_\theta$ .
Training on stronger $\vec\pi$ targets improves the network.
Repeat.

This is generalised policy iteration: MCTS performs policy evaluation (producing a stronger policy $\vec\pi$ from the current network), while the network update performs policy improvement — with no human knowledge fixing the initial point. MCTS amplifies what the network knows; training distils it back into the weights. The process bootstraps from noise to superhuman strength.

Results¶

AlphaGo Zero trained from random initialisation in 72 hours:

vs. AlphaGo Lee (defeated Lee Sedol 4–1): wins 100–0.
vs. AlphaGo Master (60–0 against top human professionals): wins 89–11.

AlphaGo Zero rediscovered classical joseki and fuseki sequences, and invented novel strategic patterns that human experts had not previously considered. The same core recipe was generalised to Chess and Shogi a year later as AlphaZero (Paragraph).

AlphaZero¶

AlphaZero Silver et al., 2018 generalises the AlphaGo Zero recipe to Chess, Shogi, and Go — training a separate network instance per game from random play, with no human games, no handcrafted evaluation features, and no domain-specific search heuristics such as opening books or endgame tablebases. State-of-the-art engines like Stockfish and Elmo instead rely on decades of hand-tuned alpha–beta search, linear evaluation functions, and extensive domain adaptations Silver et al., 2018. AlphaZero replaces all of that with a single dual-head residual network and tabula-rasa self-play: the game rules alone serve as the transition model.

The core ingredients are unchanged from AlphaGo Zero: PUCT-guided MCTS, a policy/value network $f_\theta$ , and the mutual-improvement loop described above. What changes are the input encodings (board size and feature planes differ per game), a handful of algorithmic variants tuned for multi-game generality, and the targets the network learns (draws are possible in chess and shogi). AlphaZero demonstrates that the recipe is not Go-specific — it is a general method for two-player perfect-information games with a given simulator.

Differences from AlphaGo Zero¶

AlphaZero is best understood as a generalised variant of the AlphaGo Zero algorithm Silver et al., 2018. Four concrete differences matter:

Outcome target. AlphaGo Zero estimates the probability of winning, with terminal outcomes $z \in \{-1, +1\}$ only (Go has no draws under Tromp–Taylor scoring). AlphaZero instead optimises the expected score $z \in \{-1, 0, +1\}$ , treating draws as first-class outcomes — essential for chess, where drawn positions are common at superhuman level.
Symmetries. Go rules are invariant to rotation and reflection; AlphaGo Zero exploits this with 8-fold data augmentation during training and random board transforms at every MCTS leaf evaluation. Chess and shogi rules are asymmetric (castling, pawn direction, piece drops), so AlphaZero applies neither augmentation nor search-time transforms.
Network management. AlphaGo Zero trains in iterations: after each update, the new network must win $>55\%$ of 400 games against the previous best before taking over self-play (Figure 10). AlphaZero maintains a single network updated continually — self-play always uses the latest parameters $\theta$ , with no iteration boundary and no promotion gate.
Hyperparameters. AlphaGo Zero tuned search hyperparameters per run via Bayesian optimisation. AlphaZero reuses the same settings for all three games — 800 MCTS simulations, identical learning-rate schedule, and so on — with one exception: Dirichlet noise at the root is scaled inversely to the typical branching factor, with $\alpha \in \{0.3,\, 0.15,\, 0.03\}$ for chess, shogi, and Go respectively.

Input and action encoding¶

Unlike AlphaGo Zero, which hard-codes a $19 \times 19$ Go board, AlphaZero parameterises encoding by board size $N$ . The neural-network input is an $N \times N \times (MT + L)$ image stack: $T = 8$ consecutive board positions (zero-padded before move 1), each represented by $M$ binary feature planes, plus $L$ constant planes encoding whose turn it is, the total move count, and game-specific rule state (castling rights, repetition counts, prisoner tallies). The board is always oriented from the current player’s perspective. Figure 11 summarises the three encodings.

$AlphaZero input and policy-head shapes for Go, Chess, and Shogi. All three share an N \times N \times (MT + L) input stack with T = 8 position history; only the plane counts and policy tensor differ. Illegal moves are masked to zero probability and the distribution is re-normalised over legal moves.$

Figure 11:AlphaZero input and policy-head shapes for Go, Chess, and Shogi. All three share an $N \times N \times (MT + L)$ input stack with $T = 8$ position history; only the plane counts and policy tensor differ. Illegal moves are masked to zero probability and the distribution is re-normalised over legal moves.

The policy head also varies by game. Go uses the same flat distribution over $19 \times 19 + 1$ moves (stone placements plus pass) as AlphaGo Zero. Chess encodes moves as an $8 \times 8 \times 73$ tensor: for each origin square, 56 planes encode “queen moves” (distance 1–7 in each of eight compass directions), 8 encode knight hops, and 9 encode pawn underpromotions. Shogi uses a $9 \times 9 \times 139$ tensor with analogous move planes plus 7 drop planes for captured pieces returned to the board. In every case the network outputs a large logits tensor; illegal moves are zeroed and the remainder is re-normalised.

Game	Board	Input planes	Policy shape	Notable constant planes
Go	$19 \times 19$	17	$19 \times 19 + 1$ flat	Repetitions (2)
Chess	$8 \times 8$	119	$8 \times 8 \times 73$	Castling (4), 50-move rule, repetitions (3)
Shogi	$9 \times 9$	362	$9 \times 9 \times 139$	Prisoner counts (14), repetitions (3)

Architecture¶

The network architecture is the same dual-head residual network $f_\theta$ introduced for AlphaGo Zero:

\bigl(p_\theta(s),\, v_\theta(s)\bigr) = f_\theta(s),

(14)

with a convolutional trunk of residual blocks He et al., 2016 as in AlphaGo Zero and two heads: a policy head whose output shape matches the game’s action encoding, and a scalar value head $v_\theta(s) \in [-1, 1]$ . No fast rollout policy, no separate SL and RL networks — a single forward pass at each MCTS leaf supplies both the prior $p_\theta(\cdot \mid s_L)$ and the leaf value $v_\theta(s_L)$ .

Neural MCTS with PUCT¶

Search follows the same PUCT rule as AlphaGo Zero (Equation (7)): priors from $f_\theta$ , leaf values from the value head alone, $N_\text{sim} = 800$ simulations per move, and a visit-count policy $\pi$ with temperature $\tau = 1$ for the first 30 plies and $\tau \to 0$ thereafter. Two AlphaZero-specific details: Dirichlet noise $\text{Dir}(\alpha)$ is added to the root prior (with game-scaled $\alpha$ as above) to encourage opening exploration, and there is no symmetry averaging at leaf evaluation. At tournament time, moves are selected greedily by root visit count.

Self-Play Training Loop¶

Training alternates the same two phases as AlphaGo Zero, but with continual weight updates:

Self-play. The current network (always the latest $\theta$ ) plays against itself. Each position yields a visit-count policy target $\vec\pi$ and a terminal outcome $z \in \{-1, 0, +1\}$ . The replay buffer retains the most recent 500 000 games.
Network update. Mini-batches of size 4096 are sampled uniformly from the buffer and optimised with the loss in Equation (13). The learning rate starts at 0.2 and is dropped three times (to 0.02, 0.002, and 0.0002) over 700 000 mini-batch steps per game instance. Training used 5 000 first-generation TPUs for self-play data generation and 64 second-generation TPUs for network updates Silver et al., 2018.

Each game trained to completion in roughly 9 hours (chess), 12 hours (shogi), or 34 hours (Go). During training, AlphaZero surpassed its baseline opponents after approximately 4 hours / 300k steps (Stockfish 8 in chess), 2 hours / 110k steps (Elmo in shogi), and 8 hours / 165k steps (AlphaGo Lee in Go) — all measured on an Elo scale from 1-second-per-move evaluation games.

Why Does Self-Play Work?¶

The mutual-improvement argument is identical to the AlphaGo Zero case above: MCTS amplifies the network’s knowledge into a stronger visit-count policy $\vec\pi$ , and gradient descent distils $\vec\pi$ back into the weights. One chess-specific insight from the AlphaZero paper helps explain why neural MCTS can beat alpha–beta engines that evaluate orders of magnitude more positions: MCTS averages value estimates over all simulations visiting a subtree, which tends to cancel spurious neural approximation errors, whereas alpha–beta search propagates explicit minimax values and therefore amplifies the largest errors to the root Silver et al., 2018.

Results¶

100-game tournaments at 1 minute per move against the strongest available baselines (Stockfish 8 with 64 threads; Elmo with 64 threads; AlphaGo Zero trained for 3 days) yielded:

Game	Opponent	Wins	Draws	Losses
Chess	Stockfish 8	28	72	0
Shogi	Elmo	90	2	8
Go	AlphaGo Zero (3-day)	60	—	40

AlphaZero lost zero games to Stockfish and only eight to Elmo (combined over both colours).

Search efficiency tells a complementary story. At tournament settings, AlphaZero evaluates roughly 80 000, 40 000, and 16 000 positions per second in chess, shogi, and Go — compared to 70 million for Stockfish and 35 million for Elmo. AlphaZero compensates for roughly 1 000 $\times$ fewer node evaluations by using its deep network to focus search on the most promising variations, and its MCTS scales more effectively with thinking time than either baseline engine Silver et al., 2018.

During self-play, AlphaZero independently discovered the most popular human chess openings (English, Queen’s Gambit, Sicilian, Ruy Lopez, and others) and defeated Stockfish from each starting position in dedicated 100-game match probes. The result settled a long-standing question: AlphaGo Zero’s success was not a Go-specific artefact, but evidence for a general reinforcement-learning algorithm that masters diverse perfect-information games from self-play and the rules alone.

Why Can’t We Apply This Everywhere?¶

AlphaGo Zero and AlphaZero work because of three properties that most RL problems do not have:

A perfect simulator is available. The game engine is given and can be queried arbitrarily many times for free. In most real-world settings (robotics, chemistry, markets) no such model exists — it must be learned from data.
Self-play provides a natural reward signal. The zero-sum two-player structure means that improving against yourself is always useful. For single-agent tasks, cooperative games, or problems with dense reward, there is no obvious self-play analogue.
The state is fully observed. Both players see the entire board. Real-world sensors are noisy and partial; the agent must infer hidden state from observations.

These limitations motivate the world-model paradigm: if the agent can learn its own model of the environment, it can plan without a given simulator — and potentially extend planning-based methods to single-agent, partially observed, continuous-control settings.

Summary¶

This chapter introduced the two axes of model-based RL: how the model is obtained (learned from experience or provided as a perfect simulator) and how the model is used (background planning to generate cheap synthetic data, or decision-time planning to search before acting). Learned-model methods — DYNA-Q and MBPO — convert each real transition into many simulated ones, achieving substantial sample-efficiency gains over model-free counterparts at the cost of model bias. Given-model methods — the AlphaGo family — take a complementary approach: with a perfect game engine and self-play as an automatic curriculum, neural MCTS bootstraps from random play to superhuman strength without any human knowledge beyond the rules of the game.

Algorithm	Model	Planning	Data source	Key strength
DYNA-Q Sutton, 1991	Tabular lookup (learned)	Background Q-updates	Real + simulated	Conceptual clarity; DP meets model-free RL
MBPO Janner et al., 2019	Neural ensemble (learned)	Short rollouts ( $k \leq 5$ )	Real + simulated	Deep continuous control; $5\text{–}20\times$ sample efficiency
AlphaGo Silver et al., 2016	True simulator (given)	MCTS + neural guidance	Human games + self-play	First superhuman Go agent
AlphaGo Zero Silver et al., 2017	True simulator (given)	Neural MCTS (PUCT)	Self-play only	Tabula rasa; mutual improvement loop
AlphaZero Silver et al., 2018	True simulator (given)	Neural MCTS (PUCT)	Self-play only	Chess + Shogi + Go; draw outcomes; continual update
MuZero Schrittwieser et al., 2020	Learned latent (world model)	Latent MCTS	Self-play + Atari	No game rules needed — next chapter

The next chapter introduces world models as a substrate for imagination-based planning in continuous environments: PlaNet Hafner et al., 2019 and Dreamer Hafner et al., 2025 learn a full recurrent latent-dynamics model from pixels, enabling long-horizon planning without Monte Carlo Tree Search — and we return to MuZero in full detail.

Taxonomy of Model-Based RL¶

Where the Model Comes From¶

How the Model Is Used¶

Learned-Model Planning¶

DYNA: The Blueprint¶

MBPO: Deep DYNA with Ensemble Dynamics¶

The algorithm¶

Given-Model Planning¶

Two-Player Perfect Information Games¶

Self-Play¶

Monte Carlo Tree Search¶

AlphaGo¶

Inputs: Hand-crafted Feature Planes¶

The Four Networks¶

Training Pipeline¶

Neural-Guided MCTS: PUCT and the Mixed Leaf Estimator¶

Results¶

AlphaGo Zero¶

Architecture¶

Neural MCTS with PUCT¶

Self-Play Training Loop¶

Why Does Self-Play Work?¶

Results¶

AlphaZero¶

Differences from AlphaGo Zero¶

Input and action encoding¶

Architecture¶

Neural MCTS with PUCT¶

Self-Play Training Loop¶

Why Does Self-Play Work?¶

Results¶

Why Can’t We Apply This Everywhere?¶

Summary¶

Further Reading¶