Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model-Based Reinforcement Learning

Planning, Self-Play, and Mastering the Game of Go

Every approach in the previous chapters — value iteration, policy gradients, actor-critic methods — treats the environment as an inscrutable black box that you query by executing actions. The agent learns what to do but never explicitly learns what will happen. This chapter opens a different door: the agent also acquires a model of the environment’s dynamics and uses it to plan — to reason ahead about consequences before committing to an action.

The motivation is sample efficiency. Real environments are expensive: a robot burns time and hardware interacting with the physical world; a drug-discovery agent runs wet-lab assays; even a simulator can be the computational bottleneck. If the agent has a model, it can generate imagined experience at near-zero cost, multiplying the number of learning updates per unit of real interaction.

Two questions arise:

  1. Is the model given or must it be learned? Game rules, physics simulators, and known differential equations are given. For most real problems the model must be estimated from data.

  2. How is the model used for planning? It can generate extra transitions to augment a replay buffer (background planning), or it can be queried at decision time to look ahead and choose actions (decision-time planning).

This chapter works through these ideas from the ground up. We begin with a taxonomy that organises the space of model-based approaches, then cover background-planning methods and their deep-learning extensions (learned-model background planning). We then turn to decision-time planning, using two-player perfect-information games as our running example: self-play, Monte Carlo Tree Search, and Reinforcement Learning combine to master the game of Go (given-model decision-time planning).

Slides for this chapter (open full screen).

Taxonomy of Model-Based RL

A model is anything the agent can query to ask “if I take action aa in state ss, what state ss' do I reach and what reward rr do I receive?” — formally, any (approximation of the) transition distribution p(s,rs,a)p(s', r \mid s, a) Sutton & Barto, 2018Moerland et al., 2023. Model-based RL is organised by two orthogonal questions about that model: where does it come from? and how is it used to choose actions? Together these two axes carve out four canonical regions, shown in Figure 2; the remainder of this section works through them in turn.

A conceptual taxonomy of model-based reinforcement learning organised along two independent axes. The vertical axis distinguishes how the model is obtained: learned from experience (orange ellipse, top) or provided as a perfect simulator — such as a game engine or physics model — (green ellipse, bottom). These two regions do not overlap: a model is either given or must be estimated. The horizontal axis distinguishes how the model is used for planning: decision-time planning looks ahead before each action (cyan ellipse, left); background planning generates synthetic transitions to augment learning offline (blue ellipse, right). These two modes may overlap for methods that use both. Model-free methods (right, dashed box) use neither a model nor explicit planning.

Figure 2:A conceptual taxonomy of model-based reinforcement learning organised along two independent axes. The vertical axis distinguishes how the model is obtained: learned from experience (orange ellipse, top) or provided as a perfect simulator — such as a game engine or physics model — (green ellipse, bottom). These two regions do not overlap: a model is either given or must be estimated. The horizontal axis distinguishes how the model is used for planning: decision-time planning looks ahead before each action (cyan ellipse, left); background planning generates synthetic transitions to augment learning offline (blue ellipse, right). These two modes may overlap for methods that use both. Model-free methods (right, dashed box) use neither a model nor explicit planning.

Where the Model Comes From

Given models are perfect simulators handed to the agent for free. The canonical example is a game engine — the rules of Go, Chess, or Shogi fully determine the next position Silver et al., 2016Silver et al., 2018. Physics engines such as MuJoCo or Isaac Gym play the same role for continuous-control benchmarks Todorov et al., 2012Makoviychuk et al., 2021, and known ODE or PDE dynamics fill the same slot in classical control.

Their advantage is decisive: there is no model bias, the simulator can be queried arbitrarily many times, and search depth is limited only by compute. The drawbacks are practical rather than statistical. First, perfect simulators are rare in real-world settings — robots, markets, molecules and physiological systems have no closed-form transition function Moerland et al., 2023. Second, even when a simulator exists, its wall-clock cost can become the bottleneck: large-scale game-playing agents spend most of their compute inside the simulator, not on gradient steps Vinyals et al., 2019Berner et al., 2019. Third, when the “simulator” is itself an approximation — e.g. a randomised physics model used to train a robot policy — the sim-to-real gap turns into a model-bias problem in disguise Akkaya et al., 2019.

Learned models are estimated from the agent’s own experience. The agent fits an approximate dynamics function T^\hat T and reward function R^\hat R from the transitions it has collected so far, and can then query (T^,R^)(\hat T, \hat R) instead of (or in addition to) the real environment Sutton, 1991. Concrete instances range from Gaussian-process models in PILCO Deisenroth & Rasmussen, 2011 to probabilistic neural-network ensembles in PETS and MBPO Chua et al., 2018Janner et al., 2019, and on to compact latent-space world models trained directly from pixels Ha & Schmidhuber, 2018Hafner et al., 2025.

A learned model is the only viable option when no simulator exists, and it gets better with data in the same loop that collects it. The price is model bias: T^\hat T is never perfect, and prediction errors compound on multi-step rollouts so that even small per-step inaccuracies derail long-horizon planning Janner et al., 2019Chua et al., 2018. A learned model also does not bypass the cost of exploration — it can only predict transitions in regions it has already seen.

How the Model Is Used

Planning means using a model to compute or improve a policy without additional real-world interaction Sutton & Barto, 2018Moerland et al., 2023. The classical instantiation is Dynamic Programming, summarised in the note below; modern model-based RL can be read as two complementary ways of approximating DP when full sweeps over the state space are not tractable.

Decision-time planning queries the model online, at the moment an action must be chosen. From the current state sts_t the agent rolls forward a short trajectory, builds a search tree, or optimises a finite-horizon control sequence, and uses the result to pick a single action; the next state triggers a new search from scratch. The two canonical instances are Monte Carlo Tree Search in board games Silver et al., 2016Silver et al., 2017 and Model Predictive Control (MPC) in robotics and process control Camacho & Bordons, 2007Lazic et al., 2018, with sampling-based MPC variants such as MPPI now standard in high-frequency control Williams et al., 2017Chua et al., 2018.

Its benefits are structural. Each decision adapts to the current state rather than to an average state distribution, so fresh information can be exploited immediately. Model bias is contained to the planning horizon — errors influence the single action that is executed but never leak into a stored value function or policy Moerland et al., 2023. And no replay buffer of imagined transitions has to be maintained. The drawback is latency: every action incurs a search, and per-decision compute scales with horizon and branching factor, which rules out decision-time planning for high-frequency control unless the search itself is cheap.

Background planning queries the model offline, between real environment steps, and uses the resulting synthetic transitions to train a policy or value function. This is the architecture introduced by Dyna Sutton, 1991 and scaled to deep continuous control by MBPO Janner et al., 2019 and to discrete domains by SimPLe Kaiser et al., 2020. Each real step yields many imagined transitions, and the off-policy learner trains on a mixture of real and imagined data.

The benefits are complementary to decision-time planning. Inference at action time is cheap — the policy network alone acts, with no search in the loop — making background planning a natural fit for off-policy deep RL with replay buffers. Planning is amortised across many gradient updates rather than re-done from scratch per action, and the synthetic data multiplies the number of updates per unit of real interaction. The drawback is that model errors propagate into the learned value or policy: a wrong simulated transition becomes a wrong gradient. The rollout horizon must therefore be tuned carefully — in MBPO short rollouts (k5k \le 5) work, longer ones quickly degrade the policy because per-step error compounds Janner et al., 2019.

Learned-Model Planning

The agent does not receive a simulator — it must estimate the environment’s dynamics from collected experience. We will first cover the canonical reference and then an instantiation of a deep RL algorithm.

DYNA: The Blueprint

Dyna Sutton, 1991 is the foundational learned-model background-planning architecture. It integrates three concurrent processes into a single agent loop:

  1. Direct RL — learn from real transitions exactly as in Q-learning.

  2. Model learning — fit a model T^,R^\hat T, \hat R to real transitions.

  3. Planning — draw nn simulated transitions from T^,R^\hat T, \hat R and use each for an additional Q-learning update.

The DYNA architecture. Real interactions update the Q-table (path ①) and improve the model. The model then generates n simulated transitions per real step for additional Q-learning updates (path ②). The dashed arrow marks simulated experience as cheaper than real experience.

Figure 3:The DYNA architecture. Real interactions update the Q-table (path ①) and improve the model. The model then generates nn simulated transitions per real step for additional Q-learning updates (path ②). The dashed arrow marks simulated experience as cheaper than real experience.

Dyna-Q makes this concrete for tabular settings. After each real step (s,a,r,s)(s, a, r, s'):

  1. Update Q(s,a)Q(s, a) with the real transition (standard TD update).

  2. Update the model: T^(s,a)s\hat T(s, a) \leftarrow s'; R^(s,a)r\hat R(s, a) \leftarrow r.

  3. Repeat nn times: sample a previously-seen (s,a)(s, a), query the model for (r^,s^)(\hat r, \hat s'), and apply a Q-learning update.

With n=50n = 50 planning steps per real step, Dyna-Q matched the sample efficiency of a model-free agent receiving 50× more environment interactions in Sutton’s original maze experiments Sutton, 1990Sutton & Barto, 2018.

Three principles emerge from the Dyna blueprint:

MBPO: Deep DYNA with Ensemble Dynamics

Model-Based Policy Optimization (MBPO) Janner et al., 2019 scales the Dyna idea to continuous control. The tabular Dyna model is a deterministic lookup table; in continuous spaces the model must output a distribution,

T^(s,a)=N ⁣(μϕ(s,a),σϕ2(s,a)),\hat T(s, a) = \mathcal{N}\!\bigl(\mu_\phi(s,a),\, \sigma^2_\phi(s,a)\bigr),

to capture both environment stochasticity and uncertainty about its own predictions. MBPO uses an ensemble of such probabilistic neural network dynamics models together with SAC (Soft Actor-Critic) Haarnoja et al., 2018 as the underlying off-policy algorithm.

The key innovation is the branched rollout scheme: MBPO starts each simulated rollout from a real state sampled from an environment replay buffer Denv\mathcal{D}_\text{env} and rolls forward only kk steps. The short rollouts are stored in Dmodel\mathcal{D}_\text{model}, and SAC trains on a mixture of both buffers.

MBPO’s branched rollout scheme. A probabilistic ensemble of N dynamics models is trained on \mathcal{D}_\text{env}. Each rollout starts from a real state and runs k steps; simulated transitions fill \mathcal{D}_\text{model}. SAC draws from both buffers.

Figure 4:MBPO’s branched rollout scheme. A probabilistic ensemble of NN dynamics models is trained on Denv\mathcal{D}_\text{env}. Each rollout starts from a real state and runs kk steps; simulated transitions fill Dmodel\mathcal{D}_\text{model}. SAC draws from both buffers.

The algorithm

Putting the pieces together gives the MBPO training loop of Janner et al. (2019) (Algorithm 2 in the paper, with notation matched to this chapter):

  1. Initialize SAC policy πψ\pi_\psi, an ensemble of NN dynamics models T^ϕi(s,a)=N ⁣(μϕi(s,a),σϕi2(s,a))\hat T_{\phi_i}(s,a) = \mathcal{N}\!\bigl(\mu_{\phi_i}(s,a),\, \sigma^2_{\phi_i}(s,a)\bigr) for i=1,,Ni = 1,\ldots,N, environment buffer Denv\mathcal{D}_\text{env}, and model buffer Dmodel\mathcal{D}_\text{model}.

  2. For NepochN_\text{epoch} epochs do

    1. Train each T^ϕi\hat T_{\phi_i} on Denv\mathcal{D}_\text{env} via maximum likelihood.

    2. For EE environment steps do

      1. Observe state sts_t. Take atπψ(st)a_t \sim \pi_\psi(\cdot \mid s_t); execute ata_t, observe rtr_t and st+1s_{t+1}; add (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) to Denv\mathcal{D}_\text{env}.

      2. For MM model rollouts do

        1. Sample a start state ss uniformly from Denv\mathcal{D}_\text{env}.

        2. For j=1,,kj = 1,\ldots,k do

          1. Sample an ensemble member iU{1,,N}i \sim \mathcal{U}\{1,\ldots,N\}.

          2. Sample aπψ(s)a \sim \pi_\psi(\cdot \mid s); sample (s,r)T^ϕi(s,a)(s', r) \sim \hat T_{\phi_i}(s,a).

          3. Add (s,a,r,s)(s, a, r, s') to Dmodel\mathcal{D}_\text{model}; set sss \leftarrow s'.

      3. For GG gradient updates do

        1. Sample mini-batches from Dmodel\mathcal{D}_\text{model} and take SAC steps on ψ\psi Haarnoja et al., 2018 (off-policy actor–critic updates with clipped double-QQ targets and an entropy bonus).

The outer loop collects real data and refits the model; the middle loop generates short branched rollouts; the inner loop amortises planning into many policy updates on imagined transitions. In practice, SAC is often trained on a mixture of Denv\mathcal{D}_\text{env} and Dmodel\mathcal{D}_\text{model}, even though the published pseudocode shows updates on Dmodel\mathcal{D}_\text{model} only Janner et al., 2019.

Why ensembles? An ensemble of NN independently initialised models (typically N=7N = 7) captures epistemic uncertainty — uncertainty arising from limited data that shrinks as more experience arrives. This differs from aleatoric uncertainty (irreducible environment stochasticity), which is captured within each member’s output variance. High ensemble disagreement signals epistemically uncertain states that the agent should visit with real data rather than simulate.

Why short rollouts? Model error ϵ\epsilon per step accumulates roughly as kϵk\epsilon over kk steps. For k[1,5]k \in [1, 5] the planning benefit dominates; for k5k \gg 5 the policy degrades Janner et al., 2019.

MBPO achieves an order-of-magnitude improvement in sample efficiency over SAC alone on standard MuJoCo benchmarks, often reaching comparable final performance in 520×5\text{–}20\times fewer environment steps, though gains vary by environment.

Given-Model Planning

The agent has access to a perfect simulator — the rules of a board game or a known physics model — and never needs to estimate the transition function. The challenge shifts from model learning to effective search: how do you find the best action in a state space too large to enumerate?

Two-Player Perfect Information Games

The second major thread in model-based RL arose from board games — specifically from the question: can a computer reach human-level play through search and learning alone?

Two-player perfect information (TPIG) games are zero-sum sequential games in which both players see the complete game state at all times. Chess, Go, Shogi, Checkers, and Othello all fall into this class. Because the state is fully visible and the rules are deterministic, the game can be described by a perfect transition model — the game engine — that tells you exactly what follows from any action.

The game of Go is played on a 19×19 grid; players alternately place black and white stones, aiming to control territory Silver et al., 2016. Go has an average branching factor of approximately 250 legal moves per position (compared to roughly 35 in Chess) and typical game depths on the order of 150 moves Silver et al., 2016, yielding a game tree whose depth and breadth make exhaustive search completely infeasible Silver et al., 2016. For decades, Monte Carlo tree search programs that blended online search with offline-learned priors Gelly & Silver, 2007 topped out at strong amateur level on full boards Enzenberger et al., 2010. Chess followed a different trajectory: Deep Blue defeated the reigning world champion in 1997 using massively parallel alpha–beta search and a hand-crafted linear evaluation function—not reinforcement learning or deep networks Campbell et al., 2002. Deep learning nonetheless made earlier headway on chess: the Giraffe engine learned a neural evaluation function by self-play and reached roughly international-master strength in 2015 Lai, 2015, before AlphaGo’s first victory over a human professional on full-board Go Silver et al., 2016. The central difficulty in Go is evaluating who is winning from a given board position: reliable position evaluation was long believed intractable Silver et al., 2016, because it requires understanding long-range strategic patterns that do not reduce to simple heuristics.

The game of Go: the board and stone placement (left), and four core rules illustrated on mini-boards (right)—territory, capture (a surrounded group with one remaining liberty), ko, and end-of-game scoring.

Figure 5:The game of Go: the board and stone placement (left), and four core rules illustrated on mini-boards (right)—territory, capture (a surrounded group with one remaining liberty), ko, and end-of-game scoring.

TPIG games are unusually good for AI research: the reward signal is unambiguous (win/lose/draw), the state is fully observable, progress can be measured against human expertise, and the game engine provides a free, perfect simulator. These properties make them ideal laboratories for model-based planning — but they also mean that techniques developed here do not transfer unchanged to general RL settings, a limitation we return to at the end of this section.

Self-Play

A key ingredient in all superhuman game-playing agents is self-play: the agent trains against versions of itself rather than against a fixed opponent or human-provided data.

In self-play, the agent simultaneously takes both sides of the board. Each game produces training signal (who won) that can be used to improve the policy. As the agent improves, the opponent — a snapshot or the current version of the same agent — also becomes stronger, providing an automatic curriculum: the difficulty of the training opponent tracks the agent’s current level.

Self-play was first shown to work at scale by Tesauro’s TD-Gammon Tesauro, 1995, which learned to play Backgammon at near-expert level purely from self-play and temporal-difference learning. For Go, self-play is essential: there are no simple Backgammon-style rollout heuristics, so any improvement in the value estimate must come from experience against strong opposition.

The deeper reason self-play works is that it instantiates a form of policy iteration Silver et al., 2017: playing against the current policy evaluates it (revealing its weaknesses), and updating the network to win more frequently improves it. Because the “opponent” is always just the current self, the training distribution stays on-policy throughout — no fixed opponent can be “exploited” to overfit.

Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) was introduced by Coulom (2007) for the Go program Crazy Stone and given its now-canonical UCB-driven form — UCT (Upper Confidence bounds applied to Trees) — by Kocsis & Szepesvári (2006); see Browne et al. (2012) for a comprehensive survey. It is the planning backbone used by all AlphaGo-family agents Silver et al., 2016Silver et al., 2017Silver et al., 2018. It is a given-model algorithm: it requires a perfect simulator and uses it to build a search tree at decision time, not to augment training data Silver et al., 2016.

The four phases of MCTS. ① Selection: traverse from the root, choosing actions by UCB until an unexpanded node. ② Expansion: add a new leaf. ③ Simulation: run a rollout to a terminal state. ④ Backpropagation: update visit counts and value estimates along the path.

Figure 6:The four phases of MCTS. ① Selection: traverse from the root, choosing actions by UCB until an unexpanded node. ② Expansion: add a new leaf. ③ Simulation: run a rollout to a terminal state. ④ Backpropagation: update visit counts and value estimates along the path.

Inputs / prerequisites. MCTS needs (i) a generative model of the environment p(s,rs,a)p(s', r \mid s, a) that can be queried from arbitrary states — a perfect simulator in given-model settings such as a game engine, or a learned model in the variants we will see in Chapter 8; (ii) the current state s0s_0 from which to plan; (iii) a way to evaluate non-terminal leaf states — either by rolling out to termination with a default policy or by calling a learned value function; and (iv) a compute budget (number of iterations NsimN_\text{sim} or wall-clock time). MCTS is anytime: more iterations sharpen the recommendation but the search can be stopped at any point. It applies naturally to episodic, fully observable sequential decision problems — single-agent MDPs as well as two-player perfect-information games Kocsis & Szepesvári, 2006Browne et al., 2012.

Output. A statistics tree rooted at s0s_0 holding, for every visited (s,a)(s, a), a visit count N(s,a)N(s, a) and a value estimate Q(s,a)Q(s, a). From this tree the planner extracts a recommended action — typically the most-visited root child a=argmaxaN(s0,a)a^\star = \arg\max_a N(s_0, a), which is more robust than argmaxaQ\arg\max_a Q because it weights confidence by sample size Coulom, 2007.

Components. Each iteration reuses four pieces: a tree policy (here UCB/UCT) that selects already-known children, the generative model that advances to the next state, a default (rollout) policy that estimates leaf value cheaply, and a backup operator that propagates the leaf estimate up the visited path. The four phases below — Selection, Expansion, Simulation, Backpropagation — are just the order in which these components are called inside one iteration.

Each iteration of MCTS has four phases Browne et al., 2012:

① Selection. From the root, select child nodes according to the Upper Confidence Bound (UCB) formula:

UCB(s,a)=Q(s,a)+clnN(s)N(s,a),\text{UCB}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}},

where Q(s,a)Q(s,a) is the mean value estimate, N(s)N(s) the total visit count of node ss, N(s,a)N(s,a) the visit count of the child reached by action aa, and c>0c>0 an exploration constant. Combining UCB1 Auer et al., 2002 with tree search this way is the UCT algorithm of Kocsis & Szepesvári (2006), the variant used in the AlphaGo family.

② Expansion. Add one or more new child nodes to the selected leaf.

③ Simulation. From the new leaf, roll out to a terminal state using a default policy (random moves or a fast heuristic). Record the outcome v{1,0,+1}v \in \{-1, 0, +1\}.

④ Backpropagation. Walk back to the root, updating N(s,a)N(s,a)+1N(s,a) \leftarrow N(s,a)+1 and Q(s,a)Q(s,a)+vQ(s,a)N(s,a)Q(s,a) \leftarrow Q(s,a) + \frac{v - Q(s,a)}{N(s,a)}.

After many iterations the chosen action is argmaxaN(s0,a)\arg\max_a N(s_0, a) — the most-visited child.

Pure UCT with random rollouts already plays Go at amateur level — Coulom’s Crazy Stone won the first UEC Cup computer Go tournament (19×19 board) with essentially this recipe Coulom, 2007. Its weakness: random rollouts are noisy value estimators, and Go’s branching factor (250\approx 250) makes deep search computationally prohibitive. The key idea of AlphaGo is to replace random rollouts with a trained neural network.

AlphaGo

AlphaGo Silver et al., 2016 was the first program to defeat a professional Go player at full strength. It is built from four separately trained neural networks that plug into Monte Carlo Tree Search at inference time, glued together by a three-stage training pipeline that mixes supervised learning, reinforcement learning, and regression. This subsection walks through the inputs the networks see (Inputs: Hand-crafted Feature Planes), what each of the four networks does (The Four Networks), how they are trained (Training Pipeline), and how MCTS uses them at decision time (Neural-Guided MCTS: PUCT and the Mixed Leaf Estimator). Figure 7 gives the high-level picture; the subsections below fill in the algorithmic detail.

AlphaGo’s training pipeline (left) and inference (right). Three networks are trained: the SL policy p_\sigma on human expert positions, the RL policy p_\rho by self-play fine-tuning, and the value network v_\theta on self-play outcomes. At inference, MCTS uses p_\sigma for move priors and v_\theta for leaf evaluation. A fast rollout policy p_\pi (used inside MCTS simulation, not shown separately) completes games quickly; leaf values combine v_\theta and the p_\pi-rollout outcome.

Figure 7:AlphaGo’s training pipeline (left) and inference (right). Three networks are trained: the SL policy pσp_\sigma on human expert positions, the RL policy pρp_\rho by self-play fine-tuning, and the value network vθv_\theta on self-play outcomes. At inference, MCTS uses pσp_\sigma for move priors and vθv_\theta for leaf evaluation. A fast rollout policy pπp_\pi (used inside MCTS simulation, not shown separately) completes games quickly; leaf values combine vθv_\theta and the pπp_\pi-rollout outcome.

Inputs: Hand-crafted Feature Planes

Before any network sees a Go board, AlphaGo encodes the board state ss as a 19×19×4819 \times 19 \times 48 tensor of binary feature planes Silver et al., 2016. Each plane is a 19×1919 \times 19 binary mask — the same spatial layout as the board itself — and the 48 channels split into two groups, summarised in Figure 8:

The 48 binary input planes of AlphaGo’s policy and value networks. The first 5 planes are raw board state; the remaining 43 are engineered Go knowledge that bakes in human heuristics about life-and-death, captures, and ladders. Every plane is 19 \times 19, so the networks are fully convolutional (no pooling) and exploit the translation equivariance of Go. AlphaGo Zero later drops every engineered feature and feeds in only 17 raw planes.

Figure 8:The 48 binary input planes of AlphaGo’s policy and value networks. The first 5 planes are raw board state; the remaining 43 are engineered Go knowledge that bakes in human heuristics about life-and-death, captures, and ladders. Every plane is 19×1919 \times 19, so the networks are fully convolutional (no pooling) and exploit the translation equivariance of Go. AlphaGo Zero later drops every engineered feature and feeds in only 17 raw planes.

The translation equivariance is critical: the same tactical pattern at the corner, the side, and the centre of the board has the same meaning, so the networks share weights across all 361 intersections and never need fully-connected layers until the very last step (only the value network has one).

Engineering features for a deep network was standard practice for board-game programs in 2016. The lesson AlphaGo Zero will teach a year later is that, given enough self-play, none of these engineered features are necessary — but in AlphaGo they shave several percentage points off move-prediction accuracy and noticeably strengthen MCTS.

The Four Networks

AlphaGo trains four distinct networks, each playing a separate role at inference time. The table below lays out their architectures, training data, loss functions, outputs, and runtime responsibilities side-by-side.

SL policy pσp_\sigma
supervised learning
RL policy pρp_\rho
self-play fine-tuning
Value net vθv_\theta
position evaluator
Fast rollout pπp_\pi
cheap simulator
Architecture13-layer convolutional policy network (~4M parameters)Same 13-layer CNN as pσp_\sigma — weights initialised from pσp_\sigma (~4M parameters)13 conv layers + 1 fully-connected layer; single scalar output (tanh)Linear softmax over hand-crafted local patterns — ~1500× faster than pσp_\sigma
Trained on30 M positions from 160 k human-expert games (KGS Go server)Self-play games vs. an opponent pool of past pρp_\rho snapshots30 M positions, one per game from pρp_\rho self-play (breaks within-game correlation)8 M positions from an online Go server (separate from KGS)
Loss / signalCross-entropy against the expert move (\rightarrow 57 % top-1 accuracy)REINFORCE policy gradient with terminal reward z{1,+1}z \in \{-1, +1\} (\rightarrow 80 % win-rate vs. pσp_\sigma)MSE regression against game outcome: (zvθ(s))2(z - v_\theta(s))^2Cross-entropy against the human move (\rightarrow only 24 % top-1 accuracy)
OutputMove distribution pσ(as)p_\sigma(a \mid s) over 192+119^2 + 1 actionsMove distribution pρ(as)p_\rho(a \mid s) (sharper, less diverse than pσp_\sigma)Scalar vθ(s)[1,+1]v_\theta(s) \in [-1, +1] — estimated win probability from the current player’s viewMove distribution pπ(as)p_\pi(a \mid s) — weak but extremely cheap
Role at inferenceMove priors P(s,a)P(s, a) in the PUCT bonus (diverse priors \Rightarrow exploratory search)Generates training data for vθv_\theta — not called during MCTS itselfLeaf evaluation vθ(sL)v_\theta(s_L) — the low-variance, biased term in the blendPlays out leaf to terminal \rightarrow outcome zLz_L is the noisy, unbiased blend term

Two design choices are worth flagging up front. First, the supervised policy pσp_\sigma and the RL policy pρp_\rho share the same architecture (13-layer convolutional network) and the same input encoding; pρp_\rho is just pσp_\sigma with REINFORCE applied on top. Second, pπp_\pi exists for one reason: speed. A forward pass through pσp_\sigma takes about 3ms3\,\text{ms} on a single GPU — far too slow to play out an entire game inside MCTS. pπp_\pi is a linear softmax over local-pattern features (no hidden layers) and runs in about 2μs2\,\mu\text{s}, roughly 1500×1500\times faster, at the cost of dropping move-prediction accuracy from 57% to 24%. That trade-off is decisive: a noisier but unbiased rollout, sampled cheaply, is more valuable inside MCTS than a slow accurate one.

Training Pipeline

The three training stages are run in sequence; each stage consumes the output of the previous one.

Stage 1 — Supervised learning of pσp_\sigma. The SL policy network is trained on 30 million board positions drawn from 160 000 games between strong human amateurs on the KGS Go server Silver et al., 2016. The training signal is a standard categorical cross-entropy loss against the move the human actually played:

LSL(σ)  =  E(s,a)DKGS ⁣[logpσ(as)],\mathcal{L}_\text{SL}(\sigma) \;=\; -\, \mathbb{E}_{(s, a) \sim \mathcal{D}_\text{KGS}}\!\left[\log p_\sigma(a \mid s)\right],

which is identical to how an image classifier is trained, except that the “class label” is which of the (up to) 361 legal moves the expert chose at position ss. After three weeks of training, pσp_\sigma reaches 57% top-1 move-prediction accuracy, far better than any prior Go program. Crucially, pσp_\sigma also has a diverse output distribution: even when the human move is obvious, pσp_\sigma usually places non-trivial probability on several reasonable alternatives — a property that matters when we plug it into MCTS.

Stage 2 — Reinforcement learning of pρp_\rho by self-play. The RL policy pρp_\rho is initialised from pσp_\sigma and fine-tuned by self-play with the REINFORCE policy-gradient estimator from Chapter 4 Williams, 1992. Each game is played to completion, the outcome z{1,+1}z \in \{-1, +1\} (loss vs. win for the player to move at ss) is treated as the reward, and the parameters are updated by

ρJ(ρ)  =  Eself-play ⁣[zt=0Tρlogpρ(atst)],\nabla_\rho \mathcal{J}(\rho) \;=\; \mathbb{E}_\text{self-play}\!\left[\, z \sum_{t=0}^{T} \nabla_\rho \log p_\rho(a_t \mid s_t) \,\right],

i.e. moves from games we won are made slightly more likely, moves from games we lost slightly less likely.

A naive implementation would have pρp_\rho play against itself, but this causes two pathologies: (i) the gradient signal collapses because both sides improve in lock-step, and (ii) the agent overfits to its own tactics — defeating its current self does not mean defeating any other reasonable opponent. AlphaGo fixes this with an opponent pool: every few iterations a snapshot of pρp_\rho is added to the pool, and each self-play game is played against a randomly sampled snapshot rather than the current parameters. The pool gives self-play a built-in diversity of opponents and prevents cyclic strategies of the form “I always open in the upper-right corner because I always beat myself when I do.” After this stage, pρp_\rho wins 80% of games against pσp_\sigma.

Stage 3 — Regression training of vθv_\theta. The value network vθv_\theta shares the same convolutional trunk as the policy networks but adds a final fully-connected layer producing a single scalar, with a tanh non-linearity that constrains its output to [1,+1][-1, +1]. It is trained by mean-squared regression against the outcome of a self-play game played by pρp_\rho:

Lval(θ)  =  E(s,z)Dval ⁣[(zvθ(s))2].\mathcal{L}_\text{val}(\theta) \;=\; \mathbb{E}_{(s, z) \sim \mathcal{D}_\text{val}}\!\left[\,(z - v_\theta(s))^2\,\right].

The non-obvious part is how Dval\mathcal{D}_\text{val} is built. A direct approach — store every (st,z)(s_t, z) pair from each self-play game — fails badly because consecutive positions within a single game are extremely correlated: they share most stones, the same eventual winner, and the same strategic plan. Training a regressor on this distribution overfits within a few epochs (the network memorises games rather than learning to evaluate positions). AlphaGo sidesteps this by playing 30 million separate self-play games and keeping exactly one position per game. The resulting dataset is roughly i.i.d. at the position level and the value network generalises to unseen boards.

Neural-Guided MCTS: PUCT and the Mixed Leaf Estimator

At inference, AlphaGo runs Monte Carlo Tree Search exactly as introduced in Monte Carlo Tree Search, but with two crucial modifications: the selection rule now uses neural-network priors, and the leaf evaluation no longer relies on random rollouts. Both modifications — and the role each of pσp_\sigma, vθv_\theta, and pπp_\pi plays inside one iteration — are summarised in Figure 9.

One AlphaGo MCTS iteration. ① Selection (purple) descends the tree by PUCT (Equation ); ② Expansion adds a new leaf s_L; ③ Evaluation runs the value network v_\theta(s_L) (green) in parallel with a fast p_\pi rollout to the terminal outcome z_L (orange) and blends them into V(s_L) (Equation ); ④ Backup walks the visited path updating N and Q from V(s_L) (Equation ). Per iteration, p_\sigma fires once (at expansion, to seed child priors that are reused thereafter), v_\theta fires once (at evaluation), and p_\pi fires many times (throughout the rollout to termination). The RL policy p_\rho is not queried at inference — its only role is to generate training data for v_\theta.

Figure 9:One AlphaGo MCTS iteration. ① Selection (purple) descends the tree by PUCT (Equation (7)); ② Expansion adds a new leaf sLs_L; ③ Evaluation runs the value network vθ(sL)v_\theta(s_L) (green) in parallel with a fast pπp_\pi rollout to the terminal outcome zLz_L (orange) and blends them into V(sL)V(s_L) (Equation (8)); ④ Backup walks the visited path updating NN and QQ from V(sL)V(s_L) (Equation (9)). Per iteration, pσp_\sigma fires once (at expansion, to seed child priors that are reused thereafter), vθv_\theta fires once (at evaluation), and pπp_\pi fires many times (throughout the rollout to termination). The RL policy pρp_\rho is not queried at inference — its only role is to generate training data for vθv_\theta.

① Selection. AlphaGo replaces the bare UCB formula in Equation (3) with PUCTPredictor + UCB applied to Trees — which weights each candidate’s exploration bonus by the network’s prior probability for that move:

U(s,a)  =  Q(s,a)  +  cpuctP(s,a)N(s)1+N(s,a),U(s, a) \;=\; Q(s, a) \;+\; c_\text{puct} \, P(s, a) \, \frac{\sqrt{N(s)}}{1 + N(s, a)},

with prior P(s,a)=pσ(as)P(s, a) = p_\sigma(a \mid s) from the SL policy. PUCT has two natural regimes: when a child has been visited few times, N(s,a)N(s, a) is small and the bonus reduces to roughly cpuctP(s,a)N(s)c_\text{puct} \, P(s, a) \, \sqrt{N(s)}the prior dominates, so high-prior moves are explored first. As N(s,a)N(s, a) grows, the bonus shrinks and Q(s,a)Q(s, a) — the empirical win-rate estimate — takes over. Compared with vanilla UCB, PUCT concentrates the search budget on plausible moves rather than spreading it uniformly across the 250\sim 250 legal options.

Why the SL policy pσp_\sigma for priors, not the stronger pρp_\rho? This is the most counter-intuitive design choice in AlphaGo. Empirically, using pρp_\rho as the prior produces a weaker tree search than using pσp_\sigma. The reason is distribution shape: pρp_\rho has been trained to win, so it concentrates almost all probability on the single move it estimates is best. Plugging that into PUCT collapses the search onto a narrow subtree, and any move pρp_\rho undervalues — including potentially the correct one in unfamiliar positions — receives essentially zero exploration. pσp_\sigma, by contrast, has been trained to imitate humans, who themselves play diverse styles, so pσp_\sigma assigns non-trivial probability to many reasonable moves. Diverse priors lead to a more exploratory search tree, and a more exploratory tree produces better moves than the stronger-but-narrower policy on its own. The general lesson — that the right way to use a network inside search is rarely “pick its top choice” — comes up again in AlphaGo Zero and MuZero.

② Expansion. When PUCT selects an edge that leads to a state not yet in the tree, that state becomes a new leaf sLs_L. AlphaGo evaluates the SL policy once at this leaf — pσ(sL)p_\sigma(\cdot \mid s_L) — and stores the resulting distribution as the prior P(sL,)P(s_L, \cdot) over all of sLs_L’s legal children. These priors are not recomputed: every later iteration that descends through sLs_L reads the same cached vector inside the PUCT bonus of Equation (7). The single pσp_\sigma call at expansion is therefore amortised over every subsequent visit to sLs_L, which keeps the per-iteration cost of MCTS dominated by the leaf evaluation that follows.

③ Evaluation. When MCTS reaches an unexplored leaf sLs_L, the original Monte Carlo Tree Search would estimate its value by a random rollout. AlphaGo replaces that with two parallel estimates of the same quantity, blended into one number:

V(sL)  =  (1λ)vθ(sL)  +  λzpπ,V(s_L) \;=\; (1 - \lambda) \, v_\theta(s_L) \;+\; \lambda \, z_{p_\pi},

where vθ(sL)[1,+1]v_\theta(s_L) \in [-1, +1] is the value network’s win-probability estimate (a single forward pass) and zpπ{1,+1}z_{p_\pi} \in \{-1, +1\} is the outcome of one fast rollout from sLs_L played by pπp_\pi on both sides until termination. The two estimators have complementary error profiles: vθv_\theta has low variance but is possibly biased (its evaluation depends on whatever positions were in its training set), whereas zpπz_{p_\pi} is unbiased (a real game outcome under a fixed rollout policy) but has high variance (the outcome of a single rollout is a noisy sample). Blending them with λ=0.5\lambda = 0.5 — the value used in the paper — captures most of the bias correction from rollouts while inheriting most of the variance reduction from the value network.

④ Backup. With V(sL)V(s_L) in hand, MCTS walks back along the visited path. For every edge (s,a)(s, a) on that path it updates the visit count and the running mean of the value estimate:

N(s,a)N(s,a)+1,Q(s,a)Q(s,a)+V(sL)Q(s,a)N(s,a),N(s, a) \leftarrow N(s, a) + 1, \qquad Q(s, a) \leftarrow Q(s, a) + \frac{V(s_L) - Q(s, a)}{N(s, a)},

so Q(s,a)Q(s, a) converges to the empirical mean of the mixed leaf estimates collected along that edge. No network is called during backup: the only inputs are the cached statistics already in the tree and the leaf value computed in the previous phase. After NsimN_\text{sim} iterations, the recommended move at the root is, as in plain MCTS, the most-visited child argmaxaN(s0,a)\arg\max_a N(s_0, a).

Results

AlphaGo defeated Lee Sedol — at the time one of the world’s strongest players — 4–1 in a five-game match in March 2016, a result widely considered to be a decade ahead of schedule. The system’s main limitation is its dependence on a complex three-stage training pipeline and 30 million human expert positions.

AlphaGo Zero

AlphaGo Zero Silver et al., 2017 eliminates all human knowledge: no expert games, no hand-crafted features, no separate networks. Starting from random play and training purely through self-play, it surpassed every earlier AlphaGo variant. The core recipe — a dual-head residual network, PUCT-guided MCTS, and self-play distillation — is game-agnostic; the details below are the Go-specific 2017 instantiation that AlphaZero later generalises to Chess and Shogi (Paragraph).

Architecture

AlphaGo Zero uses a single dual-head residual network fθf_\theta:

(pθ(s),vθ(s))=fθ(s),\bigl(p_\theta(s),\, v_\theta(s)\bigr) = f_\theta(s),

where pθ(s)p_\theta(s) is a probability vector over legal moves (policy head) and vθ(s)[1,1]v_\theta(s) \in [-1, 1] is a scalar win estimate (value head). The input is 17 binary feature planes encoding the current position, the seven most recent positions for each player, and whose turn it is. Because Go rules are invariant to rotation and reflection, AlphaGo Zero augments training data with all eight symmetries of each position and applies a random board transform before every neural-network evaluation inside MCTS, averaging over orientation bias.

Neural MCTS with PUCT

AlphaGo Zero keeps the PUCT selection rule introduced for AlphaGo (Equation (7)), but feeds it priors from the unified network fθf_\theta instead of a separate SL policy:

U(s,a)=Q(s,a)+cpuctpθ(as)N(s)1+N(s,a).U(s, a) = Q(s, a) + c_\text{puct} \cdot p_\theta(a \mid s) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}.

The leaf-evaluation step is where the largest simplification happens. When MCTS reaches an unvisited leaf sLs_L, AlphaGo Zero evaluates fθ(sL)f_\theta(s_L) with a single forward pass and uses both of its outputs: vθ(sL)v_\theta(s_L) becomes the leaf value — the entire mixed estimator of Equation (8) is replaced by just the value head — and pθ(sL)p_\theta(\cdot \mid s_L) assigns priors to the newly expanded children. The fast rollout policy pπp_\pi is gone, so there is no λ\lambda to tune; the noisy simulation phase has been collapsed entirely into a deterministic network evaluation.

After Nsim=800N_\text{sim} = 800 MCTS simulations the search produces a visit-count policy:

π(as)=N(s,a)1/τbN(s,b)1/τ,\pi(a \mid s) = \frac{N(s, a)^{1/\tau}}{\sum_b N(s, b)^{1/\tau}},

with temperature τ=1\tau = 1 for the first 30 moves (encouraging exploration) and τ0\tau \to 0 thereafter. This π\pi is stronger than the raw network output pθp_\theta because it incorporates tree search.

Self-Play Training Loop

AlphaGo Zero’s self-play training loop. The current best network drives MCTS self-play, producing (s_t, \vec\pi_t, z) triples stored in a replay buffer (most recent 500 000 games). Mini-batches train the network to predict both the MCTS policy \vec\pi and the game outcome z. The updated network is promoted to best-network status if it wins >55\% of 400 evaluation games — an AlphaGo Zero-specific gate that AlphaZero drops in favour of continual weight updates ().

Figure 10:AlphaGo Zero’s self-play training loop. The current best network drives MCTS self-play, producing (st,πt,z)(s_t, \vec\pi_t, z) triples stored in a replay buffer (most recent 500 000 games). Mini-batches train the network to predict both the MCTS policy π\vec\pi and the game outcome zz. The updated network is promoted to best-network status if it wins >55%>55\% of 400 evaluation games — an AlphaGo Zero-specific gate that AlphaZero drops in favour of continual weight updates (Figure 12).

Training alternates two phases indefinitely:

  1. Self-play. The current best network plays against itself. At each position, 800 MCTS simulations produce a visit-count vector π\vec\pi, stored as the policy target. The final game outcome z{1,+1}z \in \{-1, +1\} — Go has no draws under the Tromp–Taylor rules used — is stored as the value target. The replay buffer retains the most recent 500 000 games.

  2. Network update. Mini-batches (s,π,z)(s, \vec\pi, z) are sampled uniformly from the buffer. Parameters θ\theta are updated by minimising:

(θ)=(zvθ(s))2value lossπlogpθ(s)policy loss+λθ2weight decay.\ell(\theta) = \underbrace{(z - v_\theta(s))^2}_{\text{value loss}} - \underbrace{\vec\pi \cdot \log p_\theta(s)}_{\text{policy loss}} + \underbrace{\lambda \|\theta\|^2}_{\text{weight decay}}.

After each update the network plays 400 games against the previous best; if it wins >55%>55\% it becomes the new best and takes over self-play. This best-network promotion gate stabilises training by preventing a temporarily weaker checkpoint from polluting self-play data; search hyperparameters were tuned per run via Bayesian optimisation. AlphaZero removes the gate and reuses a single set of hyperparameters across games (Paragraph).

Why Does Self-Play Work?

The network and MCTS form a mutual improvement loop:

This is generalised policy iteration: MCTS performs policy evaluation (producing a stronger policy π\vec\pi from the current network), while the network update performs policy improvement — with no human knowledge fixing the initial point. MCTS amplifies what the network knows; training distils it back into the weights. The process bootstraps from noise to superhuman strength.

Results

AlphaGo Zero trained from random initialisation in 72 hours:

AlphaGo Zero rediscovered classical joseki and fuseki sequences, and invented novel strategic patterns that human experts had not previously considered. The same core recipe was generalised to Chess and Shogi a year later as AlphaZero (Paragraph).

AlphaZero

AlphaZero Silver et al., 2018 generalises the AlphaGo Zero recipe to Chess, Shogi, and Go — training a separate network instance per game from random play, with no human games, no handcrafted evaluation features, and no domain-specific search heuristics such as opening books or endgame tablebases. State-of-the-art engines like Stockfish and Elmo instead rely on decades of hand-tuned alpha–beta search, linear evaluation functions, and extensive domain adaptations Silver et al., 2018. AlphaZero replaces all of that with a single dual-head residual network and tabula-rasa self-play: the game rules alone serve as the transition model.

The core ingredients are unchanged from AlphaGo Zero: PUCT-guided MCTS, a policy/value network fθf_\theta, and the mutual-improvement loop described above. What changes are the input encodings (board size and feature planes differ per game), a handful of algorithmic variants tuned for multi-game generality, and the targets the network learns (draws are possible in chess and shogi). AlphaZero demonstrates that the recipe is not Go-specific — it is a general method for two-player perfect-information games with a given simulator.

Differences from AlphaGo Zero

AlphaZero is best understood as a generalised variant of the AlphaGo Zero algorithm Silver et al., 2018. Four concrete differences matter:

  1. Outcome target. AlphaGo Zero estimates the probability of winning, with terminal outcomes z{1,+1}z \in \{-1, +1\} only (Go has no draws under Tromp–Taylor scoring). AlphaZero instead optimises the expected score z{1,0,+1}z \in \{-1, 0, +1\}, treating draws as first-class outcomes — essential for chess, where drawn positions are common at superhuman level.

  2. Symmetries. Go rules are invariant to rotation and reflection; AlphaGo Zero exploits this with 8-fold data augmentation during training and random board transforms at every MCTS leaf evaluation. Chess and shogi rules are asymmetric (castling, pawn direction, piece drops), so AlphaZero applies neither augmentation nor search-time transforms.

  3. Network management. AlphaGo Zero trains in iterations: after each update, the new network must win >55%>55\% of 400 games against the previous best before taking over self-play (Figure 10). AlphaZero maintains a single network updated continually — self-play always uses the latest parameters θ\theta, with no iteration boundary and no promotion gate.

  4. Hyperparameters. AlphaGo Zero tuned search hyperparameters per run via Bayesian optimisation. AlphaZero reuses the same settings for all three games — 800 MCTS simulations, identical learning-rate schedule, and so on — with one exception: Dirichlet noise at the root is scaled inversely to the typical branching factor, with α{0.3,0.15,0.03}\alpha \in \{0.3,\, 0.15,\, 0.03\} for chess, shogi, and Go respectively.

Input and action encoding

Unlike AlphaGo Zero, which hard-codes a 19×1919 \times 19 Go board, AlphaZero parameterises encoding by board size NN. The neural-network input is an N×N×(MT+L)N \times N \times (MT + L) image stack: T=8T = 8 consecutive board positions (zero-padded before move 1), each represented by MM binary feature planes, plus LL constant planes encoding whose turn it is, the total move count, and game-specific rule state (castling rights, repetition counts, prisoner tallies). The board is always oriented from the current player’s perspective. Figure 11 summarises the three encodings.

AlphaZero input and policy-head shapes for Go, Chess, and Shogi. All three share an N \times N \times (MT + L) input stack with T = 8 position history; only the plane counts and policy tensor differ. Illegal moves are masked to zero probability and the distribution is re-normalised over legal moves.

Figure 11:AlphaZero input and policy-head shapes for Go, Chess, and Shogi. All three share an N×N×(MT+L)N \times N \times (MT + L) input stack with T=8T = 8 position history; only the plane counts and policy tensor differ. Illegal moves are masked to zero probability and the distribution is re-normalised over legal moves.

The policy head also varies by game. Go uses the same flat distribution over 19×19+119 \times 19 + 1 moves (stone placements plus pass) as AlphaGo Zero. Chess encodes moves as an 8×8×738 \times 8 \times 73 tensor: for each origin square, 56 planes encode “queen moves” (distance 1–7 in each of eight compass directions), 8 encode knight hops, and 9 encode pawn underpromotions. Shogi uses a 9×9×1399 \times 9 \times 139 tensor with analogous move planes plus 7 drop planes for captured pieces returned to the board. In every case the network outputs a large logits tensor; illegal moves are zeroed and the remainder is re-normalised.

GameBoardInput planesPolicy shapeNotable constant planes
Go19×1919 \times 191719×19+119 \times 19 + 1 flatRepetitions (2)
Chess8×88 \times 81198×8×738 \times 8 \times 73Castling (4), 50-move rule, repetitions (3)
Shogi9×99 \times 93629×9×1399 \times 9 \times 139Prisoner counts (14), repetitions (3)

Architecture

The network architecture is the same dual-head residual network fθf_\theta introduced for AlphaGo Zero:

(pθ(s),vθ(s))=fθ(s),\bigl(p_\theta(s),\, v_\theta(s)\bigr) = f_\theta(s),

with a convolutional trunk of residual blocks He et al., 2016 as in AlphaGo Zero and two heads: a policy head whose output shape matches the game’s action encoding, and a scalar value head vθ(s)[1,1]v_\theta(s) \in [-1, 1]. No fast rollout policy, no separate SL and RL networks — a single forward pass at each MCTS leaf supplies both the prior pθ(sL)p_\theta(\cdot \mid s_L) and the leaf value vθ(sL)v_\theta(s_L).

Neural MCTS with PUCT

Search follows the same PUCT rule as AlphaGo Zero (Equation (7)): priors from fθf_\theta, leaf values from the value head alone, Nsim=800N_\text{sim} = 800 simulations per move, and a visit-count policy π\pi with temperature τ=1\tau = 1 for the first 30 plies and τ0\tau \to 0 thereafter. Two AlphaZero-specific details: Dirichlet noise Dir(α)\text{Dir}(\alpha) is added to the root prior (with game-scaled α\alpha as above) to encourage opening exploration, and there is no symmetry averaging at leaf evaluation. At tournament time, moves are selected greedily by root visit count.

Self-Play Training Loop

AlphaZero’s self-play training loop. Unlike AlphaGo Zero (), there is no best-network promotion gate: the latest weights drive self-play immediately after each gradient update. Terminal outcomes z \in \{-1, 0, +1\} include draws.

Figure 12:AlphaZero’s self-play training loop. Unlike AlphaGo Zero (Figure 10), there is no best-network promotion gate: the latest weights drive self-play immediately after each gradient update. Terminal outcomes z{1,0,+1}z \in \{-1, 0, +1\} include draws.

Training alternates the same two phases as AlphaGo Zero, but with continual weight updates:

  1. Self-play. The current network (always the latest θ\theta) plays against itself. Each position yields a visit-count policy target π\vec\pi and a terminal outcome z{1,0,+1}z \in \{-1, 0, +1\}. The replay buffer retains the most recent 500 000 games.

  2. Network update. Mini-batches of size 4096 are sampled uniformly from the buffer and optimised with the loss in Equation (13). The learning rate starts at 0.2 and is dropped three times (to 0.02, 0.002, and 0.0002) over 700 000 mini-batch steps per game instance. Training used 5 000 first-generation TPUs for self-play data generation and 64 second-generation TPUs for network updates Silver et al., 2018.

Each game trained to completion in roughly 9 hours (chess), 12 hours (shogi), or 34 hours (Go). During training, AlphaZero surpassed its baseline opponents after approximately 4 hours / 300k steps (Stockfish 8 in chess), 2 hours / 110k steps (Elmo in shogi), and 8 hours / 165k steps (AlphaGo Lee in Go) — all measured on an Elo scale from 1-second-per-move evaluation games.

Why Does Self-Play Work?

The mutual-improvement argument is identical to the AlphaGo Zero case above: MCTS amplifies the network’s knowledge into a stronger visit-count policy π\vec\pi, and gradient descent distils π\vec\pi back into the weights. One chess-specific insight from the AlphaZero paper helps explain why neural MCTS can beat alpha–beta engines that evaluate orders of magnitude more positions: MCTS averages value estimates over all simulations visiting a subtree, which tends to cancel spurious neural approximation errors, whereas alpha–beta search propagates explicit minimax values and therefore amplifies the largest errors to the root Silver et al., 2018.

Results

100-game tournaments at 1 minute per move against the strongest available baselines (Stockfish 8 with 64 threads; Elmo with 64 threads; AlphaGo Zero trained for 3 days) yielded:

GameOpponentWinsDrawsLosses
ChessStockfish 828720
ShogiElmo9028
GoAlphaGo Zero (3-day)6040

AlphaZero lost zero games to Stockfish and only eight to Elmo (combined over both colours).

Search efficiency tells a complementary story. At tournament settings, AlphaZero evaluates roughly 80 000, 40 000, and 16 000 positions per second in chess, shogi, and Go — compared to 70 million for Stockfish and 35 million for Elmo. AlphaZero compensates for roughly 1 000×\times fewer node evaluations by using its deep network to focus search on the most promising variations, and its MCTS scales more effectively with thinking time than either baseline engine Silver et al., 2018.

During self-play, AlphaZero independently discovered the most popular human chess openings (English, Queen’s Gambit, Sicilian, Ruy Lopez, and others) and defeated Stockfish from each starting position in dedicated 100-game match probes. The result settled a long-standing question: AlphaGo Zero’s success was not a Go-specific artefact, but evidence for a general reinforcement-learning algorithm that masters diverse perfect-information games from self-play and the rules alone.

Why Can’t We Apply This Everywhere?

AlphaGo Zero and AlphaZero work because of three properties that most RL problems do not have:

  1. A perfect simulator is available. The game engine is given and can be queried arbitrarily many times for free. In most real-world settings (robotics, chemistry, markets) no such model exists — it must be learned from data.

  2. Self-play provides a natural reward signal. The zero-sum two-player structure means that improving against yourself is always useful. For single-agent tasks, cooperative games, or problems with dense reward, there is no obvious self-play analogue.

  3. The state is fully observed. Both players see the entire board. Real-world sensors are noisy and partial; the agent must infer hidden state from observations.

These limitations motivate the world-model paradigm: if the agent can learn its own model of the environment, it can plan without a given simulator — and potentially extend planning-based methods to single-agent, partially observed, continuous-control settings.

Summary

This chapter introduced the two axes of model-based RL: how the model is obtained (learned from experience or provided as a perfect simulator) and how the model is used (background planning to generate cheap synthetic data, or decision-time planning to search before acting). Learned-model methods — DYNA-Q and MBPO — convert each real transition into many simulated ones, achieving substantial sample-efficiency gains over model-free counterparts at the cost of model bias. Given-model methods — the AlphaGo family — take a complementary approach: with a perfect game engine and self-play as an automatic curriculum, neural MCTS bootstraps from random play to superhuman strength without any human knowledge beyond the rules of the game.

AlgorithmModelPlanningData sourceKey strength
DYNA-Q Sutton, 1991Tabular lookup (learned)Background Q-updatesReal + simulatedConceptual clarity; DP meets model-free RL
MBPO Janner et al., 2019Neural ensemble (learned)Short rollouts (k5k \leq 5)Real + simulatedDeep continuous control; 520×5\text{–}20\times sample efficiency
AlphaGo Silver et al., 2016True simulator (given)MCTS + neural guidanceHuman games + self-playFirst superhuman Go agent
AlphaGo Zero Silver et al., 2017True simulator (given)Neural MCTS (PUCT)Self-play onlyTabula rasa; mutual improvement loop
AlphaZero Silver et al., 2018True simulator (given)Neural MCTS (PUCT)Self-play onlyChess + Shogi + Go; draw outcomes; continual update
MuZero Schrittwieser et al., 2020Learned latent (world model)Latent MCTSSelf-play + AtariNo game rules needed — next chapter

The next chapter introduces world models as a substrate for imagination-based planning in continuous environments: PlaNet Hafner et al., 2019 and Dreamer Hafner et al., 2025 learn a full recurrent latent-dynamics model from pixels, enabling long-horizon planning without Monte Carlo Tree Search — and we return to MuZero in full detail.

Further Reading

References
  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  2. Moerland, T. M., Broekens, J., Plaat, A., & Jonker, C. M. (2023). Model-based Reinforcement Learning: A Survey. Foundations and Trends in Machine Learning, 16(1), 1–118. 10.1561/2200000086
  3. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529(7587), 484–489. 10.1038/nature16961
  4. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404
  5. Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A Physics Engine for Model-Based Control. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5026–5033. 10.1109/IROS.2012.6386109
  6. Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, J., Macklin, M., Hoeller, D., Rudin, N., Allshire, A., Handa, A., & State, G. (2021). Isaac Gym: High Performance GPU-Based Physics Simulation for Robot Learning. arXiv Preprint arXiv:2108.10470. https://arxiv.org/abs/2108.10470
  7. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., … Silver, D. (2019). Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782), 350–354. 10.1038/s41586-019-1724-z
  8. Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., … Zhang, S. (2019). Dota 2 with Large Scale Deep Reinforcement Learning. arXiv Preprint arXiv:1912.06680.
  9. Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., & Zhang, L. (2019). Solving Rubik’s Cube with a Robot Hand. arXiv Preprint arXiv:1910.07113.
  10. Sutton, R. S. (1991). Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4), 160–163. 10.1145/122344.122377
  11. Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Proceedings of the 28th International Conference on Machine Learning (ICML), 465–472.
  12. Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Advances in Neural Information Processing Systems, 31.
  13. Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to Trust Your Model: Model-Based Policy Optimization. Advances in Neural Information Processing Systems, 32.
  14. Ha, D., & Schmidhuber, J. (2018). Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems, 31.
  15. Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2025). Mastering diverse control tasks through world models. Nature, 640(8059), 647–653. 10.1038/s41586-025-08744-2