← Back to all papers

AlphaZero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

DeepMind — Science, December 2018

📄 Paper (Science)

TL;DR: A single algorithm — with zero game-specific knowledge beyond the rules — mastered chess, shogi, and Go through pure self-play. AlphaZero defeated world-champion engines Stockfish (chess, 28–0), Elmo (shogi, 90–8), and the 3-day AlphaGo Zero (Go, 60–40), while searching 1,000× fewer positions. It proved that the learning algorithm is domain-general, not game-specific.

Level 1 — Beginner

One algorithm, three games

AlphaGo proved neural networks + search = superhuman Go. But AlphaGo was a Go program — every piece of it was designed specifically for Go. AlphaZero asks: can the same approach work for any board game, with zero game-specific engineering?

The answer is yes. AlphaZero plays chess, shogi, and Go using the exact same algorithm. No opening books. No endgame databases. No hand-crafted features. Just the rules of the game and self-play.

Results

28–0
vs Stockfish
(chess champion)
90–8
vs Elmo
(shogi champion)
60–40
vs AlphaGo Zero
(3-day version)

Zero losses to Stockfish in 100 games. Stockfish searches ~60 million positions per second. AlphaZero searches ~60 thousand. That’s 1,000× fewer positions — yet AlphaZero wins because it searches the right positions.

GameOpponentTraining TimeResult
ChessStockfish (TCEC champion)9 hours28 W, 0 L, 72 D
ShogiElmo (world champion)2 hours90 W, 8 L, 2 D
GoAlphaGo Zero (3-day)34 hours60 W, 40 L

What changed from AlphaGo?

FeatureAlphaGoAlphaZero
Human data30M gamesNone
Hand-crafted features48 input planesNone (raw pieces only)
Rollout policyYesRemoved
Game-specific codeGo onlySame code for all 3 games
Symmetry exploitation8-fold Go symmetryNo assumptions

The architecture

Board state (any game)
      ↓
  Single ResNet (≈20 residual blocks)
      ↓
  └————└————└
  Policy head      Value head
  P(a|s) for       V(s) ∈ [-1, +1]
  all legal moves  “Who’s winning?”
      ↓               ↓
      └———┌———└
           ↓
     MCTS (800 simulations)
           ↓
     Play best move

One network, two outputs. Same architecture for chess, shogi, and Go — the only differences are input encoding and output size.

Self-play training loop

Repeat forever:
  1. Play a game against yourself using MCTS
  2. Record each position’s board state, search results, and outcome
  3. Train the network:
     - Policy head: predict what MCTS recommends
     - Value head: predict who actually won
  4. Use the updated network for the next game

The network learns from MCTS, and MCTS uses the network. They bootstrap each other in a virtuous cycle.

Why 1,000× fewer positions = better play

Quality beats quantity

Stockfish evaluates 60 million positions per second with a hand-crafted scoring function — wide but shallow understanding. AlphaZero evaluates 60 thousand positions per second with a deep neural network — narrow but deep understanding. The neural network evaluation is so much richer that AlphaZero needs 1,000× fewer evaluations to find better moves.

The chess revolution

AlphaZero’s chess style shocked the chess world. Former world champion Garry Kasparov said AlphaZero plays “how I always imagined a superior being would play chess.” Its style:

  • Sacrifices material for positional advantage — gives up pieces for long-term strategic compensation
  • Favors aggressive, open positions where pattern recognition dominates
  • Ignores conventional wisdom — plays moves that violate textbook principles, but they work
Key takeaway

AlphaZero is a generalization result. It doesn’t play chess better because it knows more about chess — it plays better because it has a more powerful learning algorithm that discovers any game’s strategies from scratch. The same code that masters chess also masters shogi and Go.

Quiz — Level 1
1. AlphaZero searches ~60,000 positions/sec while Stockfish searches ~60 million. Despite this 1,000× gap, AlphaZero wins. What best explains this?
The neural network captures complex positional patterns that hand-crafted rules miss. Quality of evaluation beats quantity of positions searched.
2. Which of the following did AlphaZero remove compared to AlphaGo Zero?
AlphaGo Zero exploited Go’s 8-fold board symmetry. AlphaZero drops all symmetry assumptions to keep the algorithm game-agnostic across chess, shogi, and Go.
3. AlphaZero’s training loop has the network learn from MCTS, while MCTS uses the network. What does this create?
The network and search bootstrap each other: better network → better search → better training data → even better network. This virtuous cycle drives continuous improvement.
4. Kasparov described AlphaZero’s chess as “how I always imagined a superior being would play.” What specifically was surprising?
Traditional engines are materialistic — piece values are hard-coded. AlphaZero learned that material isn’t everything, voluntarily sacrificing pieces for strategic compensation.
5. AlphaZero is described as a “generalization result.” What does this mean?
AlphaZero proves the algorithm is game-agnostic. No game-specific tuning needed — the same code, architecture, and hyperparameters work for three fundamentally different games.

Level 2 — Intermediate

Architecture deep dive

AlphaZero uses a single deep residual network with two output heads: 20 residual blocks, 256 filters throughout.

Input encoding (game-specific):
  Chess:  8×8×119 planes
  Shogi:  9×9×362 planes
  Go:     19×19×17 planes

→ Conv 3×3, 256 filters, batch norm, ReLU
→ 19 residual blocks (Conv→BN→ReLU→Conv→BN→skip→ReLU)
→ Split into two heads:

Policy Head: Conv 1×1 → BN → ReLU → move planes → softmax
Value Head:  Conv 1×1 → BN → ReLU → FC(256) → ReLU → FC(1) → tanh

Input planes — why the numbers differ

Each “plane” is a binary grid the size of the board encoding one specific fact:

GameBoardPlanesWhy
Chess8×811912 piece types × 8 history steps + castling, en passant, move counter
Shogi9×936214 piece types × 2 colors + pieces in hand (drop rule) × 8 history
Go19×19172 stone colors × 8 history steps + color to play

Go’s 17 planes vs. AlphaGo’s 48 is a massive simplification — no hand-crafted features (liberties, ladders, ko). The network learns to compute whatever features matter from raw board state.

Continuous training

AlphaGo Zero trained in discrete cycles (generate games → train → evaluate → repeat). AlphaZero uses continuous training:

Self-play actors: Always generating games with latest network
Trainer: Always training on most recent games
No explicit evaluation step — network always updating

This is simpler and faster — no checkpoint selection overhead. The risk: a bad update could produce bad self-play data that reinforces the bad update (a death spiral). In practice this doesn’t happen because updates are small relative to existing knowledge.

Training infrastructure

ComponentScale
Self-play generation5,000 first-gen TPUs
Neural network training64 second-gen TPUs
Chess training time~9 hours (700K steps × 4,096 positions/step)
Shogi training time~12 hours
Go training time~13 days
Match hardware4 TPUs + 44 CPU cores (single machine)

The Stockfish controversy

The original December 2017 preprint drew significant criticism:

CriticismDetail
Hash sizeStockfish given only 1 GB — developer Tord Romstad called this “suboptimal”
No opening bookStockfish is optimized to work with opening books
Fixed time1 min/move doesn’t let Stockfish manage its time
Old version2017 match used Stockfish 8; newer versions were stronger
Science paper response (Dec 2018)

DeepMind addressed every criticism: 3 hours + 15 sec/move time control; 1,000 games (+155 −6 =839); tested with opening book (AlphaZero still won); tested against latest Stockfish dev version (same results). AlphaZero only started losing at 30:1 time odds.

Chess opening preferences

Without any opening book, AlphaZero develops its own repertoire through self-play. It almost never plays 1.e4, preferring 1.c4 (English Opening) and 1.d4 systems — slower, positional openings where long-term strategic understanding dominates over tactical calculation.

Key takeaway

The only game-specific components are input encoding, output encoding, and the exploration noise parameter. Everything else — architecture, training loop, search algorithm — is identical across chess, shogi, and Go.

Quiz — Level 2
1. AlphaZero uses continuous training rather than AlphaGo Zero’s batched checkpoint approach. What risk does this introduce?
Without explicit checkpoint evaluation, there’s no gatekeeper preventing a bad update from propagating. In practice, the small update size relative to existing weights prevents this death spiral.
2. AlphaZero drops the symmetry exploitation that AlphaGo Zero used for Go. Why?
To keep the algorithm truly game-agnostic, no symmetry is assumed for any game. The king-side is different from the queen-side in chess, and pieces in hand break shogi’s near-symmetry.
3. The chess community criticized the 2017 match. Statement I says “Stockfish was given suboptimal hash size — addressed by providing tournament-standard 44 CPU cores.” Consider:
I. As stated above
II. No opening book — addressed by testing with a strong opening book; AlphaZero still won
III. Fixed time per move — addressed by using 3h+15s tournament time controls
IV. AlphaZero only starts losing at 30:1 time odds
Statement I connects the wrong fix to the wrong criticism. 44 CPU cores addresses hardware fairness, not hash size. The hash-size concern was about RAM allocation, not CPU count. II, III, and IV are all correctly stated.
4. Chess input encoding uses 119 planes while Go uses 17. Why is chess more complex per-square?
The number of planes is a deterministic consequence of the game’s rules: more piece types, more special rules (castling, en passant) = more planes needed to encode the full board state.
5. AlphaZero almost never opens with 1.e4 as White, preferring 1.c4 and 1.d4. What does this reveal?
AlphaZero’s neural evaluation excels at deep positional understanding. Quiet, strategic positions play to this strength more than sharp tactical positions where raw calculation depth matters.

Level 3 — Expert

MCTS — the PUCT formula

At each node during tree traversal, AlphaZero selects the action maximizing:

a = argmax [ Q(s,a) + c_puct · P(s,a) · sqrt(ΣN(s,b)) / (1 + N(s,a)) ]

Q(s,a)   = mean value of simulations through this action (exploitation)
P(s,a)   = prior probability from policy network (learned intuition)
N(s,a)   = visit count for this action
c_puct   = exploration constant

Early in search (low N): the prior P(s,a) dominates — network intuition guides exploration. Late in search (high N): Q(s,a) dominates — empirical results override intuition.

Dirichlet noise — forcing exploration

At the root node only, noise is added to the prior probabilities:

P(s,a) = (1 - ε) · p_a + ε · η_a    where η ~ Dir(α), ε = 0.25

75% network prior + 25% random noise. The α parameter is game-specific:

GameαAvg Legal MovesRelationship
Chess0.3~30α ≈ 10/avg_moves
Shogi0.15~80α ≈ 10/avg_moves
Go0.03~250α ≈ 10/avg_moves

Small α produces sparse noise (boosting a few moves). Large α produces uniform noise. In Go (250 legal moves), you want sparse exploration; in chess (30 moves), denser noise is fine.

Temperature scheduling

Move selection: π(a|s) = N(s,a)^(1/τ) / Σ N(s,b)^(1/τ)

First 30 moves:  τ = 1.0  (proportional to visit count — exploratory)
After move 30:   τ → 0   (approaches argmax — greedy)
Competitive play: τ → 0   (always greedy)

Early-game diversity prevents self-play from collapsing into a single opening line. After move 30, positions are unique enough for greedy play.

Training dynamics — what AlphaZero learns when

Chess Elo curve:
  Hours 0-1:  Random → discovers material values       (~0 → ~1000 Elo)
  Hours 1-2:  Piece development, king safety, tactics  (~1000 → ~2000)
  Hours 2-4:  Positional concepts, pawn structure      (~2000 → ~3000, surpasses Stockfish)
  Hours 4-9:  Refinement, subtle endgames              (~3000 → ~3400+, plateaus)

AlphaZero rediscovers centuries of human chess knowledge in about 3 hours. The remaining 5 hours are refinement.

Concepts AlphaZero independently discovers

GM Matthew Sadler and WIM Natasha Regan analyzed thousands of AlphaZero’s games for Game Changer (2019):

Classical concepts preserved

King safety as primary concern. Central control. Bishop pair advantage. Rook on open files. Connected passed pawns.

Classical concepts modified

Material is less important — routinely sacrifices for initiative. Pawn structure flexibility — creates “weaknesses” deliberately. King mobility — sometimes marches the king to the center in the middlegame.

Novel strategies

“Fawn pawns” — lone advanced pawns cramping the opponent (Leela Chess Zero independently discovered the same). Prophylactic sacrifice — sacrificing to prevent comfortable positions. Extreme patience — maneuvering 50+ moves in won positions.

Limitations

  • Perfect information only — cannot play poker (hidden cards), StarCraft (fog of war)
  • Two-player zero-sum only — no cooperative, multi-player, or negotiation games
  • Deterministic only — no backgammon (dice), no card draws
  • No explainability — outputs V(s)=0.72 but can’t say why the position favors White
Key takeaway

The only game-specific “knowledge” is the Dirichlet α parameter — one number per game, derived from branching factor. Everything else is learned from scratch. The limitations (perfect info, two-player, zero-sum, deterministic) define exactly where MuZero needed to push next.

Quiz — Level 3
1. Dirichlet noise uses α = 0.3 (chess), 0.15 (shogi), 0.03 (Go). What determines these values?
The α parameter is inversely proportional to the average number of legal moves: α ≈ 10/avg_moves. This ensures the noise distribution matches the action space size.
2. During self-play, τ=1.0 for the first 30 moves then τ→0. What happens with τ→0 (greedy) from move 1?
Early-game diversity is critical. Without exploration in the opening, self-play generates the same positions repeatedly, and the network overfits to a narrow set of strategies.
3. GM Sadler’s analysis found several departures from conventional play:
I. AlphaZero routinely sacrifices material for initiative
II. It sometimes marches the king to center in the middlegame
III. It always preserves pawn structure, never creating isolated/doubled pawns
IV. It independently discovered “fawn pawns”
III is false. AlphaZero deliberately creates isolated and doubled pawns when the dynamic compensation (activity, initiative) outweighs the structural weakness. Pawn structure flexibility is one of its signature departures.
4. AlphaZero surpasses Elmo (shogi) in ~2 hours but Stockfish (chess) in ~4 hours, despite shogi being more complex. Why?
The speed to surpass a champion reflects the gap between the existing champion and the game’s skill ceiling. Elmo was further from shogi’s ceiling than Stockfish was from chess’s.
5. AlphaZero works for chess, shogi, and Go but not poker. What fundamental property of poker violates its assumptions?
MCTS requires knowing the full game state to simulate future positions. In poker, opponents’ cards are hidden, making simulation impossible without modeling beliefs about hidden information.

Level 4 — Frontier

AlphaZero’s legacy — how it changed chess engines

In August 2020, Stockfish integrated NNUE (Efficiently Updatable Neural Network) — a neural network evaluation originally developed for shogi. The result: +90 Elo from swapping the evaluation function alone.

Before NNUE (Stockfish 11):  hand-crafted eval + alpha-beta search
After NNUE (Stockfish 12+):  neural network eval + alpha-beta search (NOT MCTS)

The hybrid: neural evaluation + classical search

Stockfish didn’t adopt MCTS. It kept alpha-beta search but replaced hand-crafted evaluation with a neural network — taking AlphaZero’s key insight (learned evaluation) without its search algorithm. Modern Stockfish is stronger than AlphaZero ever was. AlphaZero’s contribution was proving neural evaluation is superior, which the traditional engine community then integrated.

Leela Chess Zero — the open-source clone

Since AlphaZero was never open-sourced, the community built Leela Chess Zero (Lc0) — a faithful reimplementation trained through distributed volunteer computing. Lc0 has won multiple TCEC seasons, trading titles with Stockfish and pushing both engines far beyond where AlphaZero left off.

EraDominant Approach
Pre-2017Hand-crafted eval + alpha-beta (Stockfish)
2017–2020MCTS + neural eval begins competing (Lc0 rises)
2020+NNUE hybrid emerges (Stockfish adopts neural eval)
2024+Both approaches superhuman, trading TCEC wins

Impact on human chess

For the first time, grandmasters studied computer games for strategic inspiration. Previous engines played “correct but uninspiring” chess. AlphaZero’s games were different — GM Sadler and WIM Regan spent a year analyzing them for Game Changer (2019).

  • Initiative over material — top players became more willing to sacrifice for dynamic compensation
  • 1.d4 resurgence — AlphaZero’s preference influenced both engine and human opening choices
  • Exchange sacrifice revival — giving rook for minor piece + positional compensation, previously rejected by engines

The Nim blind spot (2026)

A March 2026 paper (Zhou & Riis, Queen Mary University of London) tested AlphaZero-style self-play on Nim — a simple game with a known perfect strategy based on XOR of heap sizes.

Key finding

Despite heavy training and search, agents developed blind spots — positions where they missed optimal moves. Performance degraded toward random as board size grew. The optimal Nim strategy is an abstract arithmetic rule (XOR), not a spatial pattern — and pattern-matching neural networks struggle with abstract rules.

From AlphaZero to MuZero

AlphaZero still requires knowing the rules of the game. MuZero (2019 preprint, 2020 Nature) removes this final crutch:

FeatureAlphaZeroMuZero
Knows game rules✓ Yes✗ Learns them
Game typesBoard gamesBoard games + Atari
Search simulationUses real rulesUses learned world model
RiskNone (rules are correct)Model can be wrong → exploitation

MuZero learns three networks: representation (observation → hidden state), dynamics (state + action → next state + reward), and prediction (state → policy + value). It achieves comparable performance to AlphaZero on chess, shogi, and Go — without knowing the rules — and masters 57 Atari games.

The meta-lesson — System 1 + System 2

AlphaZero’s deepest insight isn’t about games. It’s about the relationship between fast intuition and slow deliberation:

System 1 (fast):  Neural network — instant pattern recognition
                  “This position looks good for White”

System 2 (slow):  MCTS — deliberate search through possibilities
                  “Let me verify by exploring move sequences…”

Neither alone is sufficient. Pure intuition (network without search) plays strong amateur chess. Pure deliberation (search without network) plays intermediate chess. Together: superhuman. This dual-process paradigm — learning provides intuition, search provides verification — may be the most important architectural insight in modern AI.

Scorecard

DimensionRatingNotes
Novelty★★★★★First domain-general superhuman game-playing algorithm
Evidence quality★★★★1,000-game matches, time-odds, addressed all criticism in Science paper
Technical depth★★★★Elegant simplification from AlphaGo Zero; identical code for 3 games
Writing quality★★★★Clear and concise; supplementary tables thorough
Longevity★★★★★Spawned Stockfish NNUE, Lc0, MuZero; transformed chess culture; paradigm applies beyond games
Bottom line

The paper that proved AI can be domain-general. Not just superhuman at one game, but at three fundamentally different games with zero game-specific engineering. AlphaZero’s search + learning paradigm — where neural intuition and deliberate search amplify each other — has become the template for AI systems from protein folding to mathematical reasoning. And its chess is beautiful.

Quiz — Level 4
1. After AlphaZero, Stockfish integrated NNUE (neural evaluation) in 2020. What architectural choice differs from AlphaZero?
Stockfish took AlphaZero’s key insight (learned evaluation) but kept its own search (alpha-beta), creating a hybrid that’s now stronger than either pure approach.
2. A 2026 study tested AlphaZero-style self-play on Nim and found blind spots. What does this reveal?
Nim’s optimal strategy is an arithmetic XOR operation that doesn’t map to spatial patterns. Neural networks are powerful pattern matchers but can struggle with systematic rule-based reasoning.
3. MuZero removes AlphaZero’s need for game rules. Consider:
I. MuZero learns three networks: representation, dynamics, and prediction
II. MuZero uses the learned dynamics model instead of real rules during MCTS
III. MuZero achieves comparable performance without knowing the rules
IV. MuZero’s learned model is always more accurate than real rules
IV is false. A learned world model can be inaccurate, and the agent may find actions that score high in the imagined world but fail in reality — analogous to reward hacking.
4. AlphaZero learns each game independently from scratch. Why is this a significant limitation?
AlphaZero’s “generality” is in the algorithm, not in the learned knowledge. After 9 hours of chess training, it knows nothing about shogi. True generalization would transfer strategic concepts.
5. The neural network is “System 1” (fast intuition) and MCTS is “System 2” (slow deliberation). Why does this pairing outperform either alone?
Network intuition is fast but imperfect. Search is thorough but needs guidance. Together: the network narrows the search space, and search catches the network’s errors. This complementary pairing is the core architectural insight.
← Back to all papers