DeepMind — Science, December 2018
AlphaGo proved neural networks + search = superhuman Go. But AlphaGo was a Go program — every piece of it was designed specifically for Go. AlphaZero asks: can the same approach work for any board game, with zero game-specific engineering?
The answer is yes. AlphaZero plays chess, shogi, and Go using the exact same algorithm. No opening books. No endgame databases. No hand-crafted features. Just the rules of the game and self-play.
Zero losses to Stockfish in 100 games. Stockfish searches ~60 million positions per second. AlphaZero searches ~60 thousand. That’s 1,000× fewer positions — yet AlphaZero wins because it searches the right positions.
| Game | Opponent | Training Time | Result |
|---|---|---|---|
| Chess | Stockfish (TCEC champion) | 9 hours | 28 W, 0 L, 72 D |
| Shogi | Elmo (world champion) | 2 hours | 90 W, 8 L, 2 D |
| Go | AlphaGo Zero (3-day) | 34 hours | 60 W, 40 L |
| Feature | AlphaGo | AlphaZero |
|---|---|---|
| Human data | 30M games | None |
| Hand-crafted features | 48 input planes | None (raw pieces only) |
| Rollout policy | Yes | Removed |
| Game-specific code | Go only | Same code for all 3 games |
| Symmetry exploitation | 8-fold Go symmetry | No assumptions |
Board state (any game)
↓
Single ResNet (≈20 residual blocks)
↓
└————└————└
Policy head Value head
P(a|s) for V(s) ∈ [-1, +1]
all legal moves “Who’s winning?”
↓ ↓
└———┌———└
↓
MCTS (800 simulations)
↓
Play best move
One network, two outputs. Same architecture for chess, shogi, and Go — the only differences are input encoding and output size.
Repeat forever:
1. Play a game against yourself using MCTS
2. Record each position’s board state, search results, and outcome
3. Train the network:
- Policy head: predict what MCTS recommends
- Value head: predict who actually won
4. Use the updated network for the next game
The network learns from MCTS, and MCTS uses the network. They bootstrap each other in a virtuous cycle.
Stockfish evaluates 60 million positions per second with a hand-crafted scoring function — wide but shallow understanding. AlphaZero evaluates 60 thousand positions per second with a deep neural network — narrow but deep understanding. The neural network evaluation is so much richer that AlphaZero needs 1,000× fewer evaluations to find better moves.
AlphaZero’s chess style shocked the chess world. Former world champion Garry Kasparov said AlphaZero plays “how I always imagined a superior being would play chess.” Its style:
AlphaZero is a generalization result. It doesn’t play chess better because it knows more about chess — it plays better because it has a more powerful learning algorithm that discovers any game’s strategies from scratch. The same code that masters chess also masters shogi and Go.
AlphaZero uses a single deep residual network with two output heads: 20 residual blocks, 256 filters throughout.
Input encoding (game-specific):
Chess: 8×8×119 planes
Shogi: 9×9×362 planes
Go: 19×19×17 planes
→ Conv 3×3, 256 filters, batch norm, ReLU
→ 19 residual blocks (Conv→BN→ReLU→Conv→BN→skip→ReLU)
→ Split into two heads:
Policy Head: Conv 1×1 → BN → ReLU → move planes → softmax
Value Head: Conv 1×1 → BN → ReLU → FC(256) → ReLU → FC(1) → tanh
Each “plane” is a binary grid the size of the board encoding one specific fact:
| Game | Board | Planes | Why |
|---|---|---|---|
| Chess | 8×8 | 119 | 12 piece types × 8 history steps + castling, en passant, move counter |
| Shogi | 9×9 | 362 | 14 piece types × 2 colors + pieces in hand (drop rule) × 8 history |
| Go | 19×19 | 17 | 2 stone colors × 8 history steps + color to play |
Go’s 17 planes vs. AlphaGo’s 48 is a massive simplification — no hand-crafted features (liberties, ladders, ko). The network learns to compute whatever features matter from raw board state.
AlphaGo Zero trained in discrete cycles (generate games → train → evaluate → repeat). AlphaZero uses continuous training:
Self-play actors: Always generating games with latest network
Trainer: Always training on most recent games
No explicit evaluation step — network always updating
This is simpler and faster — no checkpoint selection overhead. The risk: a bad update could produce bad self-play data that reinforces the bad update (a death spiral). In practice this doesn’t happen because updates are small relative to existing knowledge.
| Component | Scale |
|---|---|
| Self-play generation | 5,000 first-gen TPUs |
| Neural network training | 64 second-gen TPUs |
| Chess training time | ~9 hours (700K steps × 4,096 positions/step) |
| Shogi training time | ~12 hours |
| Go training time | ~13 days |
| Match hardware | 4 TPUs + 44 CPU cores (single machine) |
The original December 2017 preprint drew significant criticism:
| Criticism | Detail |
|---|---|
| Hash size | Stockfish given only 1 GB — developer Tord Romstad called this “suboptimal” |
| No opening book | Stockfish is optimized to work with opening books |
| Fixed time | 1 min/move doesn’t let Stockfish manage its time |
| Old version | 2017 match used Stockfish 8; newer versions were stronger |
DeepMind addressed every criticism: 3 hours + 15 sec/move time control; 1,000 games (+155 −6 =839); tested with opening book (AlphaZero still won); tested against latest Stockfish dev version (same results). AlphaZero only started losing at 30:1 time odds.
Without any opening book, AlphaZero develops its own repertoire through self-play. It almost never plays 1.e4, preferring 1.c4 (English Opening) and 1.d4 systems — slower, positional openings where long-term strategic understanding dominates over tactical calculation.
The only game-specific components are input encoding, output encoding, and the exploration noise parameter. Everything else — architecture, training loop, search algorithm — is identical across chess, shogi, and Go.
At each node during tree traversal, AlphaZero selects the action maximizing:
a = argmax [ Q(s,a) + c_puct · P(s,a) · sqrt(ΣN(s,b)) / (1 + N(s,a)) ]
Q(s,a) = mean value of simulations through this action (exploitation)
P(s,a) = prior probability from policy network (learned intuition)
N(s,a) = visit count for this action
c_puct = exploration constant
Early in search (low N): the prior P(s,a) dominates — network intuition guides exploration. Late in search (high N): Q(s,a) dominates — empirical results override intuition.
At the root node only, noise is added to the prior probabilities:
P(s,a) = (1 - ε) · p_a + ε · η_a where η ~ Dir(α), ε = 0.25
75% network prior + 25% random noise. The α parameter is game-specific:
| Game | α | Avg Legal Moves | Relationship |
|---|---|---|---|
| Chess | 0.3 | ~30 | α ≈ 10/avg_moves |
| Shogi | 0.15 | ~80 | α ≈ 10/avg_moves |
| Go | 0.03 | ~250 | α ≈ 10/avg_moves |
Small α produces sparse noise (boosting a few moves). Large α produces uniform noise. In Go (250 legal moves), you want sparse exploration; in chess (30 moves), denser noise is fine.
Move selection: π(a|s) = N(s,a)^(1/τ) / Σ N(s,b)^(1/τ)
First 30 moves: τ = 1.0 (proportional to visit count — exploratory)
After move 30: τ → 0 (approaches argmax — greedy)
Competitive play: τ → 0 (always greedy)
Early-game diversity prevents self-play from collapsing into a single opening line. After move 30, positions are unique enough for greedy play.
Chess Elo curve:
Hours 0-1: Random → discovers material values (~0 → ~1000 Elo)
Hours 1-2: Piece development, king safety, tactics (~1000 → ~2000)
Hours 2-4: Positional concepts, pawn structure (~2000 → ~3000, surpasses Stockfish)
Hours 4-9: Refinement, subtle endgames (~3000 → ~3400+, plateaus)
AlphaZero rediscovers centuries of human chess knowledge in about 3 hours. The remaining 5 hours are refinement.
GM Matthew Sadler and WIM Natasha Regan analyzed thousands of AlphaZero’s games for Game Changer (2019):
King safety as primary concern. Central control. Bishop pair advantage. Rook on open files. Connected passed pawns.
Material is less important — routinely sacrifices for initiative. Pawn structure flexibility — creates “weaknesses” deliberately. King mobility — sometimes marches the king to the center in the middlegame.
“Fawn pawns” — lone advanced pawns cramping the opponent (Leela Chess Zero independently discovered the same). Prophylactic sacrifice — sacrificing to prevent comfortable positions. Extreme patience — maneuvering 50+ moves in won positions.
The only game-specific “knowledge” is the Dirichlet α parameter — one number per game, derived from branching factor. Everything else is learned from scratch. The limitations (perfect info, two-player, zero-sum, deterministic) define exactly where MuZero needed to push next.
In August 2020, Stockfish integrated NNUE (Efficiently Updatable Neural Network) — a neural network evaluation originally developed for shogi. The result: +90 Elo from swapping the evaluation function alone.
Before NNUE (Stockfish 11): hand-crafted eval + alpha-beta search
After NNUE (Stockfish 12+): neural network eval + alpha-beta search (NOT MCTS)
The hybrid: neural evaluation + classical search
Stockfish didn’t adopt MCTS. It kept alpha-beta search but replaced hand-crafted evaluation with a neural network — taking AlphaZero’s key insight (learned evaluation) without its search algorithm. Modern Stockfish is stronger than AlphaZero ever was. AlphaZero’s contribution was proving neural evaluation is superior, which the traditional engine community then integrated.
Since AlphaZero was never open-sourced, the community built Leela Chess Zero (Lc0) — a faithful reimplementation trained through distributed volunteer computing. Lc0 has won multiple TCEC seasons, trading titles with Stockfish and pushing both engines far beyond where AlphaZero left off.
| Era | Dominant Approach |
|---|---|
| Pre-2017 | Hand-crafted eval + alpha-beta (Stockfish) |
| 2017–2020 | MCTS + neural eval begins competing (Lc0 rises) |
| 2020+ | NNUE hybrid emerges (Stockfish adopts neural eval) |
| 2024+ | Both approaches superhuman, trading TCEC wins |
For the first time, grandmasters studied computer games for strategic inspiration. Previous engines played “correct but uninspiring” chess. AlphaZero’s games were different — GM Sadler and WIM Regan spent a year analyzing them for Game Changer (2019).
A March 2026 paper (Zhou & Riis, Queen Mary University of London) tested AlphaZero-style self-play on Nim — a simple game with a known perfect strategy based on XOR of heap sizes.
Despite heavy training and search, agents developed blind spots — positions where they missed optimal moves. Performance degraded toward random as board size grew. The optimal Nim strategy is an abstract arithmetic rule (XOR), not a spatial pattern — and pattern-matching neural networks struggle with abstract rules.
AlphaZero still requires knowing the rules of the game. MuZero (2019 preprint, 2020 Nature) removes this final crutch:
| Feature | AlphaZero | MuZero |
|---|---|---|
| Knows game rules | ✓ Yes | ✗ Learns them |
| Game types | Board games | Board games + Atari |
| Search simulation | Uses real rules | Uses learned world model |
| Risk | None (rules are correct) | Model can be wrong → exploitation |
MuZero learns three networks: representation (observation → hidden state), dynamics (state + action → next state + reward), and prediction (state → policy + value). It achieves comparable performance to AlphaZero on chess, shogi, and Go — without knowing the rules — and masters 57 Atari games.
AlphaZero’s deepest insight isn’t about games. It’s about the relationship between fast intuition and slow deliberation:
System 1 (fast): Neural network — instant pattern recognition
“This position looks good for White”
System 2 (slow): MCTS — deliberate search through possibilities
“Let me verify by exploring move sequences…”
Neither alone is sufficient. Pure intuition (network without search) plays strong amateur chess. Pure deliberation (search without network) plays intermediate chess. Together: superhuman. This dual-process paradigm — learning provides intuition, search provides verification — may be the most important architectural insight in modern AI.
| Dimension | Rating | Notes |
|---|---|---|
| Novelty | ★★★★★ | First domain-general superhuman game-playing algorithm |
| Evidence quality | ★★★★ | 1,000-game matches, time-odds, addressed all criticism in Science paper |
| Technical depth | ★★★★ | Elegant simplification from AlphaGo Zero; identical code for 3 games |
| Writing quality | ★★★★ | Clear and concise; supplementary tables thorough |
| Longevity | ★★★★★ | Spawned Stockfish NNUE, Lc0, MuZero; transformed chess culture; paradigm applies beyond games |
The paper that proved AI can be domain-general. Not just superhuman at one game, but at three fundamentally different games with zero game-specific engineering. AlphaZero’s search + learning paradigm — where neural intuition and deliberate search amplify each other — has become the template for AI systems from protein folding to mathematical reasoning. And its chess is beautiful.