Jumper, Evans, Pritzel et al. — DeepMind — Nature, July 2021
📄 Paper (Nature) · 💻 Code (GitHub)
Proteins are molecular machines that do almost everything in your body — digest food, fight infections, carry oxygen, read DNA. Every protein starts as a chain of amino acids (think: a string of 20 different colored beads). There are 20 types of amino acids, and a typical protein has 100–1,000 of them strung together.
This chain doesn’t stay flat. Within milliseconds, it folds into a specific 3D shape. That shape determines what the protein does. Get the shape wrong, and the protein malfunctions — causing diseases like Alzheimer’s, cancer, and cystic fibrosis.
Given only the sequence of amino acids (the “string of beads”), predict the final 3D shape. This has been biology’s hardest problem for 50 years. Experimental methods (X-ray crystallography, cryo-EM) cost $50K–$100K per protein and take months to years. There are ~200 million known protein sequences but only ~170,000 experimentally solved structures.
CASP (Critical Assessment of Structure Prediction) is the Olympics of protein structure prediction, held every two years since 1994. Labs around the world try to predict structures for proteins whose real structures have been solved but not yet published. The metric is GDT (Global Distance Test): 0–100, where >90 is considered “experimental quality.”
AlphaFold 2 didn’t just win — it leapt past the experimental-quality threshold that the field thought was years away.
AlphaFold 2 has three key insights:
Your protein sequence has cousins across millions of species. By aligning these related sequences (a Multiple Sequence Alignment or MSA), you can spot patterns: “When position 10 changes, position 50 always changes too.” This co-evolution signal implies those positions are physically close in 3D — they need to change together to keep the protein functional.
Sequence 1: A L G V D K ... (human)
Sequence 2: A L D V D K ... (mouse)
Sequence 3: S L D I E K ... (fish)
Sequence 4: A M G V D K ... (bird)
Positions 3 and 5 co-vary:
G↔D, D↔D, D↔E, G↔D
→ These positions are likely close in 3D space
MSA is purely linear sequence data — no 3D information. But the co-evolutionary patterns hidden within it encode structural information. AlphaFold 2’s job is to decode that signal.
Traditional methods looked at each amino acid independently. AlphaFold 2 maintains a pair representation — a matrix tracking what every residue “knows” about every other residue. This is like having a giant spreadsheet where row 10, column 50 says “these two residues are probably 5Å apart and co-evolved strongly.”
Instead of trying to fold a chain step by step, AlphaFold 2 starts with all residues floating freely in space (a “gas”) and gradually moves them into their correct positions. Each residue is represented as a rigid frame (position + orientation), and the network iteratively refines all frames simultaneously.
Amino acid sequence
↓
Database search (JackHMMER / HHblits)
↓
Multiple Sequence Alignment (MSA)
↓
┌─────────────────────────┐
│ EVOFORMER │ ← 48 blocks of attention
│ MSA repr ↔ Pair repr │ (the core innovation)
│ (rows × cols attention) │
└─────────────────────────┘
↓
┌─────────────────────────┐
│ STRUCTURE MODULE │ ← Invariant Point Attention
│ “Residue gas” → 3D │ (rotations/translations)
└─────────────────────────┘
↓
3D protein structure + confidence (pLDDT)
AlphaFold 2 doesn’t just predict a structure — it tells you how confident it is for each residue. The predicted Local Distance Difference Test (pLDDT) ranges from 0 to 100:
| pLDDT | Meaning |
|---|---|
| > 90 | Very high confidence — trust this prediction |
| 70–90 | Good confidence — backbone reliable, some side-chain uncertainty |
| 50–70 | Low confidence — treat with caution |
| < 50 | Very low — may be intrinsically disordered (no fixed structure) |
AlphaFold 2 solved the protein folding problem by treating it as an information extraction problem — mining co-evolutionary signals from millions of related sequences — rather than a physics simulation. The key was building the right attention architecture (Evoformer) to decode that evolutionary signal into 3D coordinates.
The Evoformer is a stack of 48 identical blocks, each updating two representations simultaneously:
| Representation | Shape | What it encodes |
|---|---|---|
| MSA representation | Nseq × Nres × 256 | Per-sequence, per-position features — what each sequence “knows” about each position |
| Pair representation | Nres × Nres × 128 | Pairwise relationship between every residue pair — distance, orientation, co-evolution signals |
These two representations talk to each other every block through specific information pathways:
This is the mechanism that converts evolutionary information into pairwise structural information:
For each pair of residue positions (i, j):
1. Take column i from the MSA representation (N_seq vectors)
2. Take column j from the MSA representation (N_seq vectors)
3. Compute outer product of each pair of vectors
4. Average across all sequences
5. Project to update pair[i][j]
Intuition: “What do all the sequences collectively say
about the relationship between position i and position j?”
This is where co-evolution gets directly injected into the pair representation. If positions i and j co-evolve strongly, their MSA columns will have correlated patterns that produce a distinctive outer product signature.
The pair representation must obey the triangle inequality: if residue A is close to B, and B is close to C, then A must be somewhat close to C. Standard attention doesn’t enforce this. Triangle updates do:
To update pair(i, j), consider ALL intermediate residues k:
“Outgoing edges”: pair(i,k) × pair(j,k) → update pair(i,j)
“Incoming edges”: pair(k,i) × pair(k,j) → update pair(i,j)
Intuition: “What does the rest of the protein tell me
about the relationship between i and j?”
i ─── j
\ /
\ /
k ← intermediate residue provides geometric constraint
This is computationally O(N³) per block — for each pair (i,j), you sum over all k — which is expensive but essential for geometric consistency.
When homologous structures exist in PDB, AlphaFold 2 can use them as templates. Template features (backbone distances, torsion angles) are projected and added as a bias to the pair representation. Two of the five models use templates; three don’t.
Templates from close homologs are very helpful. But templates from distant homologs can actually mislead the network, anchoring it to an incorrect fold. The template-free models avoid this risk entirely. Omitting templates also prevents the model from becoming dependent on template availability and helps when the target has a truly novel fold.
The Frame Aligned Point Error (FAPE) is AlphaFold 2’s primary loss function, measuring structural accuracy in a rotation/translation invariant way:
For each pair of residues (i, j):
1. Look at the predicted structure from residue i’s reference frame
2. Look at the true structure from residue i’s reference frame
3. Compute the distance between predicted and true position of j
4. Average over all (i, j) pairs
Why frames? Two structures can be identical but rotated
differently in space. FAPE compares local geometry,
not global orientation, making it invariant to rigid-body
transformations.
AlphaFold 2 runs the entire Evoformer + Structure Module pipeline three times, feeding the output of each cycle back as input to the next:
Cycle 1: MSA + pair repr → Evoformer → Structure → 3D coords (draft 1)
↑ │
└──────────── feed back pair + coords ─────────┘
Cycle 2: improved input → Evoformer → Structure → 3D coords (draft 2)
↑ │
└──────────── feed back pair + coords ─────────┘
Cycle 3: further refined → Evoformer → Structure → FINAL structure
Each cycle refines the structure. Loss is computed only on the final cycle, but gradients flow through all three via shared weights.
The Evoformer’s genius is the bidirectional information flow between MSA and pair representations. The outer product mean converts evolutionary signals to structural signals; triangle updates enforce geometric consistency; and recycling lets the network iteratively refine its predictions across multiple passes.
AlphaFold 2 was trained in carefully staged phases:
| Stage | Crop Size | Details |
|---|---|---|
| 1. Initial training | 256 residues | ~170K PDB structures (clustered at 40% seq identity), 128-seq MSA clusters, ~300K steps. Learn basic fold recognition and attention patterns. |
| 2. Fine-tuning | 384 residues | Larger crops, structure violation loss added. Handle longer proteins, enforce physical constraints. |
| 3. Self-distillation | 384 residues | “Noisy student”: use the trained model to predict structures for sequences with no experimental data, then retrain on real + predicted structures. |
Only high-confidence predictions (pLDDT > 70) were used as pseudo-labels. Low-confidence predictions were discarded to prevent training on garbage. This massively expanded the effective training set beyond the ~170K PDB structures to millions of protein sequences.
AlphaFold 2’s training uses six loss terms working together:
| Loss | What It Teaches | Detail |
|---|---|---|
| FAPE backbone | Global fold accuracy | Frame Aligned Point Error on Cα atoms; clamped at 10Å to prevent outlier residues from dominating gradients |
| FAPE sidechain | Local rotamer accuracy | Same metric on all-atom positions, using side-chain reference frames |
| Distogram | Pairwise distances | Cross-entropy on binned Cα–Cα distances (64 bins, 2–22Å); regularizer for pair representation |
| Masked MSA | Evolutionary understanding | BERT-like: mask 15% of MSA positions, predict the amino acids. Forces genuine evolutionary pattern learning. |
| Violation | Physical realism | Penalizes bond length/angle violations, steric clashes, chain breaks. Added only in Stage 2. |
| Experimentally resolved | Confidence calibration | Per-residue prediction of whether each atom has experimental coordinates in the PDB entry. |
FAPE_clamped = min(FAPE_raw, 10Å)
Without clamping:
One badly predicted residue 50Å away generates huge loss
→ Gradient dominated by one outlier
→ Network optimizes that residue at expense of everything else
With clamping at 10Å:
Outliers contribute at most 10Å of loss each
→ Balanced gradients across all residues
→ Network improves overall structure, not just worst cases
This is analogous to Huber loss in regression — robust to outliers while still penalizing errors.
Original MSA: A L G V D
A L D V D
S L D I E
A M G V D
Masked (15%): A L [M] V D
A [M] D V [M]
S L [M] I E
A M G [M] D
Task: predict masked amino acids from context
This BERT-like auxiliary loss forces the Evoformer to genuinely understand evolutionary patterns rather than simply passing MSA features through without processing them.
AlphaFold 2 trains five separate models with different configurations:
All five are run independently and the highest-confidence prediction (ranked by pLDDT) wins.
MSA Depth vs Accuracy:
>1000 sequences: Median GDT > 90 (near-experimental)
100–1000: Median GDT 70–85 (good, some details wrong)
30–100: Median GDT 50–70 (rough fold)
<30: Often fails completely
Single sequence: Near-random for most proteins
Co-evolution is the primary signal. With few sequences, the outer product mean has no data to extract — the pair representation stays uninformative. Orphan proteins (~10–15% of known families) remain AlphaFold 2’s biggest failure mode.
Skolnick (2021) argued AF2 is fundamentally a very sophisticated fold recognition algorithm: the library of single-domain protein folds in PDB is essentially complete — all possible domain topologies are already represented. AF2 has learned to map any sequence to the correct existing fold, then refine local details. This explains both its success (single domains) and its limitations (truly novel folds).
| Limitation | Detail |
|---|---|
| Intrinsically disordered regions | ~30% of human proteome has no fixed 3D structure. AF2 predicts one arbitrary conformation with low pLDDT, but cannot distinguish “genuinely disordered” from “insufficient data.” |
| Conformational states | Proteins often switch between states (e.g., active/inactive kinase). AF2 predicts the single dominant conformation in PDB training data. Alternative states are invisible. |
| Protein complexes | Designed for single chains. Multi-chain prediction requires AlphaFold-Multimer (2021) or AlphaFold 3 (2024). |
| Mutations & stability | Wild-type and mutant sequences often produce identical structures. AF2 is not a thermodynamic stability predictor. |
Pair representation: N_res × N_res × 128 (quadratic in protein length)
Triangle attention: O(N³) per Evoformer block
48 blocks × 3 cycles: ~144 forward passes through attention stack
~1000 residues: ~16 GB GPU, minutes
~2000 residues: ~64 GB GPU, hours
>2500 residues: typically split into domains and predicted separately
AlphaFold 2’s training is a masterclass in engineering: staged curriculum, robust loss functions with FAPE clamping, BERT-like auxiliary losses for representation quality, self-distillation for data augmentation, and a five-model ensemble for robustness. The system’s accuracy depends critically on MSA depth, and its fundamental limitation is predicting static snapshots of dynamic proteins.
AlphaFold 3 (May 2024, Nature) isn’t an incremental update — it’s a fundamental architectural redesign co-developed by DeepMind and Isomorphic Labs.
| Dimension | AlphaFold 2 (2020) | AlphaFold 3 (2024) |
|---|---|---|
| Scope | Single protein chains | Proteins + DNA + RNA + ligands + ions |
| Input tokens | Per-residue (one token = one residue) | Per-atom (every atom is a token) |
| Trunk | Evoformer (MSA + pair repr) | Pairformer (pair repr only; MSA processed separately upstream) |
| Structure module | IPA — deterministic, one output | Diffusion — denoises from random noise, can sample multiple structures |
| Confidence | pLDDT + pTM | pLDDT + pTM + PAE + pDE (interface distance error) |
AlphaFold 2 Structure Module:
Input: pair repr → IPA → ONE deterministic output
Problem: one input → one structure, no structural uncertainty
AlphaFold 3 Diffusion Module:
Input: pair repr + NOISE → denoise → predicted coordinates
Different noise seeds → different structures
Run 5 times → 5 candidates → rank by confidence
Same paradigm shift as deterministic image encoders →
Stable Diffusion: one input, many possible outputs
Gained: Protein-ligand docking (50% better than prior best), protein-nucleic acid complexes, antibody-antigen interactions, multiple structure samples per input.
Lost: Single-chain protein accuracy slightly worse than AF2 (traded monomer accuracy for generality); hallucination risk from diffusion; ~4.4% chirality errors in predicted ligand poses.
While DeepMind built AF2 around MSA + co-evolution, Meta’s FAIR team asked: what if a protein language model already encodes structural information, and you don’t need MSA at all?
AlphaFold 2: Sequence → MSA (minutes-hours) → Evoformer → Structure
ESMFold: Sequence → ESM-2 (15B params) → Structure Module → 3D
ESMFold: no MSA, no database search, just the raw sequence
~60× faster than AlphaFold 2
ESM-2 was trained on 250M protein sequences with masked language modeling (exactly like BERT). During training, it implicitly learns co-evolutionary patterns, structural motifs, and long-range contacts.
| Model | Monomer Accuracy | Speed | Orphan Proteins |
|---|---|---|---|
| AlphaFold 2 | 88% | Minutes–hours | Fails (needs MSA) |
| ESMFold | 76% | Seconds | Works (no MSA needed) |
ESMFold predicted 617 million metagenomic protein structures (the ESM Metagenomic Atlas) — structures for proteins that have no homologs in any database.
AlphaFold solves the forward problem: sequence → structure. The more valuable problem is the inverse: design a sequence that folds into a desired structure.
David Baker’s lab adapted diffusion models for protein backbone design:
1. Start with random noise in 3D coordinate space
2. Denoise using a fine-tuned RoseTTAFold network → novel backbone
3. Use ProteinMPNN to design a sequence for that backbone
4. Use AlphaFold 2 to VERIFY the sequence folds correctly
5. Synthesize in lab
Applications demonstrated:
• De novo binders (therapeutic antibodies)
• Symmetric nanocages (drug delivery)
• Custom enzyme active sites (RFdiffusion2, April 2025)
AlphaFold 2 serves as the verification step in this pipeline — closing the design loop by predicting whether designed sequences actually fold into the intended structures.
| Model | Lab | Year | Key Feature |
|---|---|---|---|
| AlphaFold 3 | DeepMind | 2024 | Gold standard; initially closed, later opened |
| Boltz-1 | MIT | 2024 | Fully open-source, AF3-level accuracy, “Boltz-steering” |
| Chai-1 | Chai Discovery | 2024 | Commercial; claims higher accuracy than AF3 |
| OpenFold | Columbia | 2022 | Open AF2 reimplementation |
| RoseTTAFold | Baker Lab | 2021–24 | Independent architecture, extended to all-atom |
| ESMFold | Meta FAIR | 2022 | No MSA; 60× faster |
| Status | Problem |
|---|---|
| ✅ Solved | Single-domain structure prediction; large-scale structural annotation (200M+ predictions) |
| 🟡 Partial | Multi-domain proteins; stable protein complexes; antibody CDR-H3 loops |
| ❌ Unsolved | Conformational ensembles (snapshot vs. movie); intrinsically disordered proteins; allosteric mechanisms; protein function prediction; folding pathways; membrane protein environments; post-translational modifications |
| Laureate | Contribution |
|---|---|
| Demis Hassabis + John Jumper | Protein structure prediction (AlphaFold) |
| David Baker | Computational protein design (Rosetta, RFdiffusion) |
The split is telling: Hassabis/Jumper = understanding (prediction), Baker = creation (design). Together they represent the full loop of programmable biology: predict structure → design new proteins → verify predictions.
AlphaFold 2 wasn’t just a better model — it was a better formulation. The key innovations (treating structure as a graph with frames, using attention over evolution, learning to refine iteratively) came from deeply understanding the domain and finding the right inductive biases. Raw scale alone wouldn’t have worked. This lesson applies across all of AI — the best models come from understanding what the data fundamentally is, not just throwing more compute at it.