DeepMind — Nature, January 2016
By 2016, AI had conquered chess (Deep Blue, 1997), checkers (solved, 2007), and Jeopardy (Watson, 2011). But Go remained the “grand challenge” of game-playing AI.
| Dimension | Chess | Go |
|---|---|---|
| Board size | 8×8 (64 squares) | 19×19 (361 intersections) |
| Legal positions | ~1047 | ~10170 |
| Avg moves per turn | ~35 | ~250 |
| Game tree complexity | ~10123 | ~10360 |
10360 is more than the number of atoms in the observable universe (~1080) raised to the power of 4. Brute-force search — the approach that beat Kasparov — is physically impossible for Go.
Chess engines work by searching ahead: evaluate millions of future positions, pick the best move. This fails in Go because:
Replace both components with neural networks:
Traditional game engine:
Hand-crafted evaluation → How good is this position?
Exhaustive search → What moves should I consider?
AlphaGo:
Value network → “How good is this position?” (learned)
Policy network → “What moves are worth considering?” (learned)
MCTS → Intelligent, guided search using both networks
Instead of searching everything, AlphaGo uses the policy network to focus search on promising moves (reducing breadth from 250 to ~10–20), and the value network to evaluate positions without searching to the end of the game (reducing depth).
Input: current board position (19×19 grid). Output: probability distribution over all 361 intersections.
Trained in two phases: (1) Supervised learning on 30 million positions from expert human games — predicts what the human played with 57% accuracy. (2) Reinforcement learning via self-play — discovers moves humans wouldn’t play. The RL policy wins 80% of games against the SL policy.
Input: current board position. Output: single number between −1 (black loses) and +1 (black wins).
Trained on 30 million self-play positions (one per game to avoid overfitting). Each position is labeled with the game’s eventual outcome.
Current position
Repeat 1600 times (simulations):
1. SELECT — Walk down tree, balancing exploitation and exploration
2. EXPAND — Use policy network for candidate moves + priors
3. EVALUATE — Blend value network score + fast rollout result
4. BACKUP — Propagate evaluation back up the tree
After all simulations: pick the most-visited move
Phase 1: Learn from humans
30M human games → SL policy network (57% accuracy)
Phase 2: Get better than humans
SL policy plays itself → RL policy (wins 80% vs SL)
Phase 3: Learn to evaluate positions
RL policy plays 30M self-play games → Value network
Phase 4: Combine everything
Policy + Value + MCTS = AlphaGo
AlphaGo’s Move 37 in Game 2 against Lee Sedol — a shoulder hit on the fifth line that no human would play — is considered one of the most creative moves in Go history. Professional commentators initially called it a mistake. It won the game.
AlphaGo proved that neural networks + search can solve problems that pure search and pure neural networks cannot solve alone. The policy network alone plays at strong amateur level. MCTS with random rollouts plays at intermediate level. Together: superhuman. This became the template for every DeepMind project that followed.
A 13-layer convolutional neural network:
Input: 19×19×48 tensor
└ 19×19 board (stone positions)
└ 48 feature planes (liberties, ko, recency, color, edge distance)
→ Layer 1: 5×5 filters, 192 channels
→ Layers 2–12: 3×3 filters, 192 channels
→ Layer 13: 1×1 filter → 1 channel
→ Softmax over 361 positions
→ Output: P(a|s) — probability of playing each intersection
No pooling layers — spatial resolution must be preserved because the output is a move on the board. The 48 input features encode domain knowledge: liberties, ladder status, stone recency, ko rules. This is significant manual feature engineering despite the “deep learning” framing.
Training data: 30 million position-move pairs from 160,000 games on the KGS Go server (6–9 dan amateur and professional games).
| Model | Prediction Accuracy |
|---|---|
| Previous SOTA (2015) | 44.4% |
| AlphaGo SL policy (12 layers) | 55.7% |
| AlphaGo SL policy (13 layers) | 57.0% |
57% sounds low, but when expanded to top-5 predictions, accuracy exceeds 85%. Multiple moves are often equally strong at any given position.
A separate lightweight model for fast game simulations during MCTS:
| Main SL Policy | Rollout Policy | |
|---|---|---|
| Architecture | 13-layer CNN | Linear softmax over local patterns |
| Accuracy | 57% | 24.2% |
| Speed | 3ms per move | 2µs per move (1,500× faster) |
Starting from the SL policy, RL fine-tunes by self-play using the REINFORCE algorithm:
Repeat:
1. Select opponent from pool of past RL versions
(random selection prevents overfitting to one opponent)
2. Play a full game: current RL policy vs. selected opponent
3. Reward: z = +1 if won, -1 if lost
4. Update: increase probability of moves that led to wins,
decrease moves that led to losses
The opponent pool prevents policy collapse — co-evolving into a narrow strategy that beats only itself.
Same architecture as the policy network but outputs a single scalar via tanh (−1 to +1). Critically trained on self-play data with one position per game to avoid overfitting:
| Training Data | Train MSE | Test MSE |
|---|---|---|
| Human games (all positions) | 0.19 | 0.37 (memorization!) |
| Self-play (1 position/game) | 0.226 | 0.234 (generalizes) |
A single forward pass (~3ms) achieves accuracy comparable to averaging ~10,000 rollouts (~20 seconds) — a ~7,000× speedup.
At each node, the selection formula balances exploitation and exploration (PUCT):
a = argmax[ Q(s,a) + c · P(s,a) · sqrt(sum_b N(s,b)) / (1 + N(s,a)) ]
Q(s,a) = average value from simulations (exploitation)
P(s,a) = prior from policy network (guides exploration)
N(s,a) = visit count (more visits = less exploration bonus)
Leaf evaluation blends value network and rollout: V(s) = (1−λ)·v(s) + λ·z, where λ=0.5.
Subtle detail: MCTS uses the SL policy (not the stronger RL policy) for tree priors. The SL policy models a broader distribution of reasonable human moves, providing better exploration diversity. The RL policy is too peaked toward its own play style.
| Configuration | Elo |
|---|---|
| Raw RL policy (no search) | 2,181 |
| MCTS + rollouts only | 2,077 |
| MCTS + SL policy only | 2,890 |
| MCTS + value network only | 3,062 |
| MCTS + SL policy + value network | 3,492 |
| Full AlphaGo (all components) | 3,739 |
No single component is sufficient — the synergy is the breakthrough.
REINFORCE is the simplest policy gradient algorithm — and notoriously high-variance. Why not PPO or A2C?
The REINFORCE update with baseline: Δθ = α · Σt (zt − v̅) · ∇θ log πθ(at|st)
The baseline v̅ (typically the value function estimate) reduces variance by crediting moves that performed better than expected, rather than all moves in winning games.
The RL policy maintains a pool of past checkpoints. Each game is played against a random past version, not the current self. Without this:
The pool forces robustness against diverse opponents — an implicit form of population-based training.
| Experiment | Train MSE | Test MSE | Verdict |
|---|---|---|---|
| Human games, all positions | 0.19 | 0.37 | Memorizes trajectories |
| Human games, held-out games | 0.19 | 0.34 | Still bad — similar positions across games |
| Self-play, 1 position/game | 0.226 | 0.234 | Generalizes — no overfitting |
General principle: When training examples are temporally correlated (adjacent timesteps, consecutive frames, nearby sentences), the network can memorize sequential patterns rather than learning the underlying function. Breaking correlation forces genuine learning. This applies to video understanding, RL, and any sequential prediction task.
The distributed version runs many MCTS threads in parallel. Problem: without coordination, multiple threads explore the same promising branch simultaneously.
Thread 1 starts exploring move A:
Q(s, A) temporarily decreased
N(s, A) temporarily increased
→ Other threads see move A as less attractive
→ They explore moves B, C, D instead
→ Thread 1 finishes, removes virtual loss, updates real value
This spreads threads across different parts of the tree without explicit communication — an elegant lock-free parallelism technique.
AlphaGo uses adaptive time allocation based on position complexity:
Go boards have 8-fold symmetry (4 rotations × 2 reflections). Each network evaluation randomly selects one of 8 transformations, effectively averaging over symmetries without extra compute. During training, all 8 symmetries are used for data augmentation.
Master (search control)
└ 1,920 CPUs: rollouts, tree traversal
└ 280 GPUs: policy + value network evaluation
Asynchronous pipeline keeps both fully utilized
| Version | Hardware | Elo |
|---|---|---|
| Single machine | 48 CPUs, 8 GPUs | 2,890 |
| Distributed | 1,920 CPUs, 280 GPUs | 3,739 |
~850 Elo improvement from distribution alone — same networks, more search. Equivalent to 2–3 professional ranks.
AlphaGo required the convergence of several advances: deep CNNs (2012–2015), GPU computing maturity, MCTS (Coulom, 2006), the UCT formula (Kocsis & Szepesvári, 2006), move prediction from CNNs (Clark & Storkey, 2014), and batch normalization (2015). Without any single one, AlphaGo couldn’t have been built.
Human data dependency: The pipeline starts with 30M human expert games, creating a ceiling effect (inherits human blind spots), domain specificity (48 hand-crafted features don’t transfer to chess), limited data availability, and a bootstrap problem (needs human experts to exist first).
Rollout weakness: The fast rollout policy (24.2% accuracy) is most unreliable in tactically sharp positions where one wrong move changes the outcome — precisely when evaluation matters most. AlphaGo Zero later dropped rollouts entirely, and performance improved.
Compute: The Lee Sedol match system used 1,920 CPUs + 280 GPUs, estimated at over $1M total. A research demonstration, not a scalable product.
AlphaGo (2016, Nature)
Human data + hand-crafted features + rollouts + value + policy
└→ AlphaGo Zero (2017, Nature)
│ NO human data, NO features, NO rollouts. Beat AlphaGo 100-0.
└→ AlphaZero (2017/2018, Science)
│ Generalized to chess + shogi. Same algorithm, three games.
└→ MuZero (2019/2020, Nature)
Doesn’t know the rules. Learns world model from interaction.
Each step removed a dependency: human data → game specificity → game rules. AlphaGo Zero surpassed the Lee Sedol version after approximately 3 days (~72 hours) of pure self-play training, proving the human data was not just unnecessary — it was a bottleneck.
AlphaGo’s architecture (neural network evaluation + tree search) became a blueprint for diverse domains:
| Domain | Descendant | What It Does |
|---|---|---|
| Protein folding | AlphaFold | Predicts 3D protein structures from amino acid sequences |
| Chip design | Google TPU layout | Optimizes component placement for wire length and power |
| Theorem proving | AlphaGeometry | Solves International Math Olympiad geometry problems |
| Matrix multiplication | AlphaTensor | Discovers faster algorithms for matrix operations |
| Code generation | AlphaCode | Generates competitive programming solutions |
The insight: any problem frameable as sequential decisions over evaluable states can benefit from the AlphaGo paradigm.
AlphaGo’s Move 37 in Game 2 — estimated by the policy network at 1-in-10,000 probability of being played by a human — sparked debate about machine creativity. It emerged from RL self-play, not human imitation. Professional commentators called it a mistake before realizing it was brilliant.
In Game 4, Lee Sedol played Move 78 — a wedge that the policy network also estimated at 1-in-10,000 probability. It exploited AlphaGo’s coverage problem: moves with near-zero policy priors are never explored deeply by MCTS, even if they’re brilliant. AlphaGo’s response was erratic. Lee Sedol won decisively.
Pure learning (no search): Fast but limited by network capacity. Strong amateur.
Pure search (no learning): Slow but theoretically optimal. Intermediate.
Learning + Search: Search amplifies learned knowledge. Superhuman.
Neural networks provide fast, approximate intuition. Search provides slow, precise verification. The combination is greater than either alone. This principle carries through to AlphaFold, AlphaTensor, AlphaGeometry, and beyond.
The Lee Sedol match (March 2016) generated 200 million Weibo pageviews and tens of millions of live stream viewers. Lee Sedol retired from professional Go in November 2019, citing AI’s dominance. The Ke Jie match (May 2017) saw AlphaGo win 3–0 against the world #1.
| Dimension | Rating | Notes |
|---|---|---|
| Novelty | ★★★★★ | First superhuman Go AI; combined known techniques in a new way |
| Impact | ★★★★★ | Spawned an entire research paradigm; changed public perception of AI |
| Reproducibility | ★★★ | Architecture is detailed but compute requirements make reproduction expensive |
| Technical depth | ★★★★ | Elegant integration of multiple components, each well-motivated |
| Writing quality | ★★★★★ | One of the best-written Nature papers in recent memory |
| Longevity | ★★★★★ | Search + learning paradigm is foundation of everything DeepMind has done since |
The paper that made the world take AI seriously. Not the first superhuman game AI (Deep Blue, 1997), but the first to suggest that AI could develop genuine strategic intuition — and the first to hint that this intuition might transfer beyond games. The search + learning paradigm it established is now the foundation of DeepMind’s work in protein folding, mathematics, chip design, and beyond.