- AlphaZero masters chess, shogi and Go from scratch using only self-play, a single deep network, and Monte Carlo Tree Search — no human games, no handcrafted evaluation.
- Its engine is a loop: MCTS uses the network to plan a stronger move than the raw policy, the game outcome trains the network, and the better network makes the next search stronger.
- MuZero removes the last piece of given knowledge — the rules — by learning a model that predicts only what matters for planning: reward, value, and policy, not pixels.
- These ideas now run far beyond games: video compression at YouTube, faster matrix multiplication (AlphaTensor), and sorting algorithms in the C++ standard library (AlphaDev).
What are AlphaZero and MuZero?
AlphaZero and MuZero are DeepMind’s landmark systems for learning to plan. AlphaZero (2017) reaches superhuman play in chess, shogi and Go starting from random play — no human games, no opening books, no handcrafted position-scoring. It needs only the rules of the game. MuZero (2019) goes one step further: it throws away even the rules, learning a model of the environment and planning inside that learned model. The same algorithm that masters board games then masters the 57 Atari video games, where the “rules” are unknown pixels.
Both belong to the model-based RL family: they don’t just react, they imagine ahead. The shared engine is Monte Carlo Tree Search (MCTS) guided by a neural network, trained by self-play — a pure loop where the system is its own opponent and its own teacher.
The lineage: from AlphaGo to MuZero
These systems are a four-step march toward removing human-supplied knowledge. Each version keeps less and learns more.
Beat world champion Lee Sedol at Go. Bootstrapped from a database of human expert games, used two networks plus MCTS, and a handcrafted rollout policy. Proof that deep nets + search could crack Go.
Dropped the human games entirely — pure self-play from random play, one network, no rollouts. Surpassed the version that beat Lee Sedol after a few days of training.
Generalized AlphaGo Zero into one algorithm for chess, shogi and Go. Given only the rules, it reached superhuman play in each, beating the strongest existing programs.
Removed the rules too. It learns a model of the environment and plans inside it — matching AlphaZero on board games while mastering 57 Atari games from pixels.
How AlphaZero works
AlphaZero couples one network with one search procedure, trained in a self-improving loop.
One network, two heads
A single deep residual network takes a board position and outputs two things:
- a policy vector — a prior probability over legal moves (“which moves look promising?”);
- a scalar value — the predicted game outcome from this position (“who is winning?”).
This replaces the two separate networks and handcrafted evaluation function of classical engines with one learned function.
MCTS: turning a guess into a plan
Raw network output is just intuition. AlphaZero runs Monte Carlo Tree Search to refine it, building a search tree where each edge stores a visit count , a mean value , and the network’s prior . Each of the (typically 800) simulations does three things:
Walk down the tree from the root, at each node picking the action that maximizes a PUCT score — value plus an exploration bonus that favors high-prior, low-visit moves:
The term exploits what looks good; the second term explores moves the prior likes but the search hasn’t tried much — a structured take on exploration vs exploitation.
On reaching a leaf, apply the game rules to get the new position, then call the network once to get its prior and value . Add the leaf to the tree with these priors. No random rollouts — the value head replaces the Monte Carlo simulations of classical MCTS.
Propagate the leaf value back up the visited path, updating each edge’s running mean and incrementing its visit count . Positions that lead to good outcomes accumulate value and get searched more.
After all simulations, the improved policy is read off from the root visit counts — proportional to how often each move was explored:
where is a temperature. This is almost always stronger than the raw policy head , because search corrected the network’s mistakes. MCTS acts as a policy improvement operator.
Self-play and training
AlphaZero plays games against itself using MCTS for every move. Each position yields a training tuple , where is the final game result ( win, loss, draw). The network is trained to make its two heads match the search:
The value head learns to predict outcomes; the policy head learns to imitate the (stronger) search policy; weight decay regularizes. A stronger network then makes the next round of MCTS stronger — the policy-iteration flywheel that drives everything.
Go deeper: why AlphaZero plays so differently from Stockfish
Classical engines like Stockfish search enormous trees (tens of millions of positions per move) guided by a handcrafted, human-tuned evaluation. AlphaZero searches far fewer nodes (~tens of thousands) but each is evaluated by a deep network that has learned what good positions feel like. The result is a famously “human-like but alien” style — long-term positional sacrifices, piece activity over material — because nothing in its evaluation was hand-specified by human chess theory. To inject opening variety it adds Dirichlet noise to the root prior during self-play, ensuring it explores moves a deterministic search would ignore.
How MuZero adds a learned model
AlphaZero still needs a perfect simulator: to expand a node, it applies the real game rules. That is fine for chess but impossible for Atari, robotics, or the real world, where you don’t have the rules. MuZero’s breakthrough is to learn a model and search inside it instead of the real environment.
Three functions, one latent space
MuZero never tries to predict the next screen of pixels. It learns three networks that operate on an abstract hidden state :
Encode the real observations (the recent frames / board) into an initial hidden state:
This is not a reconstruction of the board — it’s whatever internal representation is useful for planning.
Given a hidden state and an action, predict the next hidden state and the immediate reward — a learned transition model that lets search roll forward without the real environment:
From any hidden state, predict the policy and value, exactly like AlphaZero’s two heads:
MCTS now runs entirely in this learned latent space: at the root it encodes the real observation with , then expands the tree using to imagine successor states and to evaluate them. The planning loop (select → expand → backup) is identical to AlphaZero’s — only the “simulator” changed from given rules to a learned model.
Value equivalence: the key idea
Why doesn’t predicting the wrong pixels hurt MuZero? Because it is never asked to. The three networks are trained end-to-end so that, when unrolled steps, their predicted policy, value, and reward match the targets observed in real play:
where are MCTS policies, are bootstrapped value targets, and are real rewards. Nothing forces the hidden state to resemble the true environment state. The model only has to be value-equivalent — accurate about the quantities planning consumes. This frees it to ignore irrelevant detail (the exact texture of an Atari background) and spend capacity on what changes the optimal decision. As MuZero’s first author Julian Schrittwieser puts it, the model “focuses only on important aspects of the environment.”
Reanalyse: squeezing more from old games
MuZero is greedy with data. Reanalyse re-runs MCTS on already-played trajectories using the latest network, producing fresh, stronger policy and value targets without playing new games. This is what makes a sample-efficient version of MuZero practical for data-limited and offline RL settings — you keep learning from the same logs as your network improves.
AlphaZero vs MuZero at a glance
| AlphaZero | MuZero | |
|---|---|---|
| Needs the rules? | Yes — uses a perfect simulator | No — learns its own model |
| Model used for search | The true game rules | Learned dynamics function |
| Networks | One (policy + value heads) | Three (, , ) |
| Search | MCTS over real states | MCTS over hidden states |
| Domains | Chess, shogi, Go (known dynamics) | + 57 Atari games (unknown dynamics) |
| Predicts observations? | n/a (has the simulator) | No — only reward, value, policy |
| Training signal | Game outcome + search policy | Reward + bootstrapped value + search policy |
The headline: MuZero matches AlphaZero on board games while extending to environments where you can’t write down the rules — the gap between a game engine and the real world.
Where these ideas are used
The same plan-with-a-model recipe has crossed from games into systems and science.
MuZero learns a rate-control policy for the VP9 codec, picking quantization parameters per frame. DeepMind reported an average ~4–6% bitrate reduction at equal quality on a large YouTube video set — a planning problem with no clean simulator.
Cast as a single-player game of decomposing tensors, AlphaZero-style search discovered a faster algorithm for 4×4 matrix multiplication — the first improvement on Strassen’s two-level method in over 50 years.
Treated assembly-code generation as a game and discovered shorter sorting routines for small inputs, now merged into the LLVM C++ standard sort library used by billions of programs.
Extensions handle chance (Stochastic MuZero matched the state of the art on 2048 and backgammon) and huge action spaces (Sampled MuZero plans over sampled actions for continuous control).
Strengths and limitations
- Strength — superhuman planning from zero priors. No human data, no domain heuristics; given only objective and dynamics (or learned dynamics), the system discovers its own strategy.
- Strength — search amplifies a fixed network. Spend more MCTS simulations at decision time and play gets stronger, trading compute for skill without retraining.
- Limitation — compute-hungry. AlphaZero trained on thousands of TPUs; MCTS at every move is expensive at inference. This is the bitter lesson in action — search and learning scale, but they cost.
- Limitation — discrete, well-defined action spaces. Vanilla MCTS assumes a manageable set of discrete actions and (for AlphaZero) deterministic dynamics; continuous control and stochasticity need the Sampled/Stochastic variants.
- Limitation — opaque learned model. MuZero’s value-equivalent state is not a faithful world model you can inspect or transfer, which complicates debugging and safety analysis.
A short history
Researcher takes
Brown explains the theoretical reason AlphaZero-style self-play succeeds: in two-player zero-sum games it provably converges to a minimax equilibrium, which is exactly the right objective there, and why that guarantee evaporates outside such games.
A pointed argument on the limits of self-play: the zero-sum structure of board games is precisely what makes AlphaZero-style self-play tractable, and its absence explains why the approach has not yet transferred to LLMs.
Frequently asked questions
What is the core difference between AlphaZero and MuZero?
AlphaZero is given the rules of the game and uses them as a perfect simulator during search. MuZero is not given the rules — it learns a model (representation, dynamics, prediction) and runs the same Monte Carlo Tree Search inside that learned model. That lets MuZero work where dynamics are unknown, like Atari from pixels.
Why doesn’t MuZero learn to predict the next frame or board?
Because it only needs information useful for planning. MuZero is trained so its unrolled model predicts reward, value, and policy accurately — a property called value equivalence. Its hidden state can ignore visually obvious but decision-irrelevant detail, which makes the model both cheaper and more focused than a full next-observation predictor.
Is MCTS in AlphaZero the same as classical Monte Carlo Tree Search?
It shares the select–expand–backup structure but replaces the random “rollout” with a neural value head, and guides selection with the network’s policy prior via the PUCT formula. So instead of playing thousands of random games to evaluate a leaf, AlphaZero asks the network once — far more accurate and sample-efficient.
How do AlphaZero and MuZero relate to RL for LLMs?
They’re the purest form of model-based RL with a verifiable environment reward (win/lose), which is why some researchers contrast them with RLHF’s learned reward. The “search to improve a policy, then train on the improved policy” idea also echoes in RL for reasoning, where test-time search and self-generated data play similar roles.
Key papers
- Mastering Chess and Shogi by Self-Play (AlphaZero) — Silver et al., 2017 — one algorithm, three games, from rules alone.
- Mastering the Game of Go without Human Knowledge (AlphaGo Zero) — Silver et al., 2017 — pure self-play, the immediate precursor.
- Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero) — Schrittwieser et al., 2019/2020 — the learned-model breakthrough.
- Online and Offline RL by Planning with a Learned Model (MuZero Reanalyse) — Schrittwieser et al., 2021 — sample-efficient and offline.
- MuZero with Self-competition for VP9 Rate Control — Mandhane et al., 2022 — real-world video compression.
- Monte-Carlo Tree Search as Regularized Policy Optimization — Grill et al., 2020 — why MCTS works as policy improvement.
Related
Model-based RL · Value functions · Exploration vs exploitation · Offline RL · RL for reasoning · What is reinforcement learning?