reinforcement-learning.com
// PLANNING & MODEL-BASED

AlphaZero and MuZero: Planning by Self-Play

How AlphaZero masters chess, shogi and Go from self-play, and how MuZero plans with a learned model — the math, MCTS, training, and real-world uses, explained.

Updated 2026-06-07 17 min read
Key takeaways
  • AlphaZero masters chess, shogi and Go from scratch using only self-play, a single deep network, and Monte Carlo Tree Search — no human games, no handcrafted evaluation.
  • Its engine is a loop: MCTS uses the network to plan a stronger move than the raw policy, the game outcome trains the network, and the better network makes the next search stronger.
  • MuZero removes the last piece of given knowledge — the rules — by learning a model that predicts only what matters for planning: reward, value, and policy, not pixels.
  • These ideas now run far beyond games: video compression at YouTube, faster matrix multiplication (AlphaTensor), and sorting algorithms in the C++ standard library (AlphaDev).

What are AlphaZero and MuZero?

AlphaZero and MuZero are DeepMind’s landmark systems for learning to plan. AlphaZero (2017) reaches superhuman play in chess, shogi and Go starting from random play — no human games, no opening books, no handcrafted position-scoring. It needs only the rules of the game. MuZero (2019) goes one step further: it throws away even the rules, learning a model of the environment and planning inside that learned model. The same algorithm that masters board games then masters the 57 Atari video games, where the “rules” are unknown pixels.

Both belong to the model-based RL family: they don’t just react, they imagine ahead. The shared engine is Monte Carlo Tree Search (MCTS) guided by a neural network, trained by self-play — a pure loop where the system is its own opponent and its own teacher.

Neural networkpolicy p, value vMCTS planningsearch policy πSelf-play gameoutcome zTraining (π, z)gradient updatestore π and z
The self-play improvement loop shared by AlphaZero and MuZero. MCTS turns the network's raw policy into a stronger 'search policy'; the game outcome supplies a value target; training on both produces a stronger network, which makes the next search stronger still.

The lineage: from AlphaGo to MuZero

These systems are a four-step march toward removing human-supplied knowledge. Each version keeps less and learns more.

AlphaGo (2016)

Beat world champion Lee Sedol at Go. Bootstrapped from a database of human expert games, used two networks plus MCTS, and a handcrafted rollout policy. Proof that deep nets + search could crack Go.

AlphaGo Zero (2017)

Dropped the human games entirely — pure self-play from random play, one network, no rollouts. Surpassed the version that beat Lee Sedol after a few days of training.

AlphaZero (2017)

Generalized AlphaGo Zero into one algorithm for chess, shogi and Go. Given only the rules, it reached superhuman play in each, beating the strongest existing programs.

MuZero (2019)

Removed the rules too. It learns a model of the environment and plans inside it — matching AlphaZero on board games while mastering 57 Atari games from pixels.

0
human games used by AlphaZero — pure self-play
~4h
self-play for AlphaZero to outplay Stockfish at chess
57
Atari games MuZero mastered with a learned model

How AlphaZero works

AlphaZero couples one network with one search procedure, trained in a self-improving loop.

One network, two heads

A single deep residual network fθf_\theta takes a board position ss and outputs two things:

(p,v)=fθ(s)(\,\mathbf{p},\,v\,) = f_\theta(s)
  • a policy vector p\mathbf{p} — a prior probability over legal moves (“which moves look promising?”);
  • a scalar value v[1,1]v \in [-1, 1] — the predicted game outcome from this position (“who is winning?”).

This replaces the two separate networks and handcrafted evaluation function of classical engines with one learned function.

MCTS: turning a guess into a plan

Raw network output is just intuition. AlphaZero runs Monte Carlo Tree Search to refine it, building a search tree where each edge (s,a)(s,a) stores a visit count NN, a mean value QQ, and the network’s prior PP. Each of the (typically 800) simulations does three things:

1
Select

Walk down the tree from the root, at each node picking the action that maximizes a PUCT score — value plus an exploration bonus that favors high-prior, low-visit moves:

a\*=argmaxa[Q(s,a)+cpuctP(s,a)bN(s,b)1+N(s,a)]a^\* = \arg\max_a \Big[\, Q(s,a) + c_{\text{puct}}\, P(s,a)\, \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)} \,\Big]

The QQ term exploits what looks good; the second term explores moves the prior likes but the search hasn’t tried much — a structured take on exploration vs exploitation.

2
Expand and evaluate

On reaching a leaf, apply the game rules to get the new position, then call the network fθf_\theta once to get its prior p\mathbf{p} and value vv. Add the leaf to the tree with these priors. No random rollouts — the value head replaces the Monte Carlo simulations of classical MCTS.

3
Backup

Propagate the leaf value vv back up the visited path, updating each edge’s running mean QQ and incrementing its visit count NN. Positions that lead to good outcomes accumulate value and get searched more.

After all simulations, the improved policy π\boldsymbol{\pi} is read off from the root visit counts — proportional to how often each move was explored:

π(as)=N(s,a)1/τbN(s,b)1/τ\pi(a \mid s) = \frac{N(s,a)^{1/\tau}}{\sum_b N(s,b)^{1/\tau}}

where τ\tau is a temperature. This π\boldsymbol{\pi} is almost always stronger than the raw policy head p\mathbf{p}, because search corrected the network’s mistakes. MCTS acts as a policy improvement operator.

Self-play and training

AlphaZero plays games against itself using MCTS for every move. Each position sts_t yields a training tuple (st,πt,z)(s_t, \boldsymbol{\pi}_t, z), where zz is the final game result (+1+1 win, 1-1 loss, 00 draw). The network is trained to make its two heads match the search:

=(zv)2    πlogp  +  cθ2\ell = (z - v)^2 \;-\; \boldsymbol{\pi}^\top \log \mathbf{p} \;+\; c\,\lVert\theta\rVert^2

The value head learns to predict outcomes; the policy head learns to imitate the (stronger) search policy; weight decay regularizes. A stronger network then makes the next round of MCTS stronger — the policy-iteration flywheel that drives everything.

Go deeper: why AlphaZero plays so differently from Stockfish

Classical engines like Stockfish search enormous trees (tens of millions of positions per move) guided by a handcrafted, human-tuned evaluation. AlphaZero searches far fewer nodes (~tens of thousands) but each is evaluated by a deep network that has learned what good positions feel like. The result is a famously “human-like but alien” style — long-term positional sacrifices, piece activity over material — because nothing in its evaluation was hand-specified by human chess theory. To inject opening variety it adds Dirichlet noise to the root prior during self-play, ensuring it explores moves a deterministic search would ignore.

How MuZero adds a learned model

AlphaZero still needs a perfect simulator: to expand a node, it applies the real game rules. That is fine for chess but impossible for Atari, robotics, or the real world, where you don’t have the rules. MuZero’s breakthrough is to learn a model and search inside it instead of the real environment.

Three functions, one latent space

MuZero never tries to predict the next screen of pixels. It learns three networks that operate on an abstract hidden state sks^k:

1
Representation h

Encode the real observations (the recent frames / board) into an initial hidden state:

s0=hθ(o1,,ot)s^0 = h_\theta(o_1, \dots, o_t)

This s0s^0 is not a reconstruction of the board — it’s whatever internal representation is useful for planning.

2
Dynamics g

Given a hidden state and an action, predict the next hidden state and the immediate reward — a learned transition model that lets search roll forward without the real environment:

(sk+1,rk+1)=gθ(sk,ak+1)(s^{k+1},\, r^{k+1}) = g_\theta(s^k, a^{k+1})
3
Prediction f

From any hidden state, predict the policy and value, exactly like AlphaZero’s two heads:

(pk,vk)=fθ(sk)(\mathbf{p}^k,\, v^k) = f_\theta(s^k)

MCTS now runs entirely in this learned latent space: at the root it encodes the real observation with hh, then expands the tree using gg to imagine successor states and ff to evaluate them. The planning loop (select → expand → backup) is identical to AlphaZero’s — only the “simulator” changed from given rules to a learned model.

obso₁…oₜhs⁰ggp⁰, v⁰ (f)p¹, v¹p², v²reward r¹reward r²
MuZero unrolls a learned model. The representation h encodes observations into a hidden state s⁰; the dynamics g imagines next states and rewards; the prediction f outputs policy and value at every step. The model is trained only to predict reward, value and policy — never the raw observation.

Value equivalence: the key idea

Why doesn’t predicting the wrong pixels hurt MuZero? Because it is never asked to. The three networks are trained end-to-end so that, when unrolled KK steps, their predicted policy, value, and reward match the targets observed in real play:

t=k=0K[p(πt+k,pk)+v(zt+k,vk)+r(ut+k,rk)]\ell_t = \sum_{k=0}^{K}\Big[ \ell^p(\pi_{t+k},\,\mathbf{p}^k) + \ell^v(z_{t+k},\,v^k) + \ell^r(u_{t+k},\,r^k) \Big]

where π\boldsymbol{\pi} are MCTS policies, zz are bootstrapped value targets, and uu are real rewards. Nothing forces the hidden state to resemble the true environment state. The model only has to be value-equivalent — accurate about the quantities planning consumes. This frees it to ignore irrelevant detail (the exact texture of an Atari background) and spend capacity on what changes the optimal decision. As MuZero’s first author Julian Schrittwieser puts it, the model “focuses only on important aspects of the environment.”

Reanalyse: squeezing more from old games

MuZero is greedy with data. Reanalyse re-runs MCTS on already-played trajectories using the latest network, producing fresh, stronger policy and value targets without playing new games. This is what makes a sample-efficient version of MuZero practical for data-limited and offline RL settings — you keep learning from the same logs as your network improves.

AlphaZero vs MuZero at a glance

AlphaZeroMuZero
Needs the rules?Yes — uses a perfect simulatorNo — learns its own model
Model used for searchThe true game rulesLearned dynamics function gg
NetworksOne (policy + value heads)Three (hh, gg, ff)
SearchMCTS over real statesMCTS over hidden states
DomainsChess, shogi, Go (known dynamics)+ 57 Atari games (unknown dynamics)
Predicts observations?n/a (has the simulator)No — only reward, value, policy
Training signalGame outcome + search policyReward + bootstrapped value + search policy

The headline: MuZero matches AlphaZero on board games while extending to environments where you can’t write down the rules — the gap between a game engine and the real world.

Where these ideas are used

The same plan-with-a-model recipe has crossed from games into systems and science.

Video compression at YouTube

MuZero learns a rate-control policy for the VP9 codec, picking quantization parameters per frame. DeepMind reported an average ~4–6% bitrate reduction at equal quality on a large YouTube video set — a planning problem with no clean simulator.

AlphaTensor — matrix multiplication

Cast as a single-player game of decomposing tensors, AlphaZero-style search discovered a faster algorithm for 4×4 matrix multiplication — the first improvement on Strassen’s two-level method in over 50 years.

AlphaDev — sorting algorithms

Treated assembly-code generation as a game and discovered shorter sorting routines for small inputs, now merged into the LLVM C++ standard sort library used by billions of programs.

Stochastic & Sampled MuZero

Extensions handle chance (Stochastic MuZero matched the state of the art on 2048 and backgammon) and huge action spaces (Sampled MuZero plans over sampled actions for continuous control).

Strengths and limitations

  • Strength — superhuman planning from zero priors. No human data, no domain heuristics; given only objective and dynamics (or learned dynamics), the system discovers its own strategy.
  • Strength — search amplifies a fixed network. Spend more MCTS simulations at decision time and play gets stronger, trading compute for skill without retraining.
  • Limitation — compute-hungry. AlphaZero trained on thousands of TPUs; MCTS at every move is expensive at inference. This is the bitter lesson in action — search and learning scale, but they cost.
  • Limitation — discrete, well-defined action spaces. Vanilla MCTS assumes a manageable set of discrete actions and (for AlphaZero) deterministic dynamics; continuous control and stochasticity need the Sampled/Stochastic variants.
  • Limitation — opaque learned model. MuZero’s value-equivalent state is not a faithful world model you can inspect or transfer, which complicates debugging and safety analysis.

A short history

2016
AlphaGo beats Lee Sedol
Deep nets + MCTS, bootstrapped from human games, win 4–1 against a top Go professional.
2017
AlphaGo Zero
Pure self-play from random play, one network, no human data — and stronger than the Lee Sedol version.
2017
AlphaZero
One algorithm for chess, shogi and Go; superhuman in each within hours, beating the best prior programs.
2019
MuZero
Learns a model and plans inside it — matches AlphaZero on board games, masters 57 Atari games from pixels.
2021–22
Real-world MuZero
VP9 video compression at YouTube; Sampled and Stochastic MuZero extend to large/continuous actions and chance.
2022–23
AlphaTensor & AlphaDev
Search discovers faster matrix-multiplication and sorting algorithms, the latter shipped in LLVM’s libc++.

Researcher takes

Brown explains the theoretical reason AlphaZero-style self-play succeeds: in two-player zero-sum games it provably converges to a minimax equilibrium, which is exactly the right objective there, and why that guarantee evaporates outside such games.

A pointed argument on the limits of self-play: the zero-sum structure of board games is precisely what makes AlphaZero-style self-play tractable, and its absence explains why the approach has not yet transferred to LLMs.

Frequently asked questions

What is the core difference between AlphaZero and MuZero?

AlphaZero is given the rules of the game and uses them as a perfect simulator during search. MuZero is not given the rules — it learns a model (representation, dynamics, prediction) and runs the same Monte Carlo Tree Search inside that learned model. That lets MuZero work where dynamics are unknown, like Atari from pixels.

Why doesn’t MuZero learn to predict the next frame or board?

Because it only needs information useful for planning. MuZero is trained so its unrolled model predicts reward, value, and policy accurately — a property called value equivalence. Its hidden state can ignore visually obvious but decision-irrelevant detail, which makes the model both cheaper and more focused than a full next-observation predictor.

Is MCTS in AlphaZero the same as classical Monte Carlo Tree Search?

It shares the select–expand–backup structure but replaces the random “rollout” with a neural value head, and guides selection with the network’s policy prior via the PUCT formula. So instead of playing thousands of random games to evaluate a leaf, AlphaZero asks the network once — far more accurate and sample-efficient.

How do AlphaZero and MuZero relate to RL for LLMs?

They’re the purest form of model-based RL with a verifiable environment reward (win/lose), which is why some researchers contrast them with RLHF’s learned reward. The “search to improve a policy, then train on the improved policy” idea also echoes in RL for reasoning, where test-time search and self-generated data play similar roles.

Key papers

Model-based RL · Value functions · Exploration vs exploitation · Offline RL · RL for reasoning · What is reinforcement learning?