Four-level explainers for deeply understanding research papers — from beginner to frontier. Each summary includes interactive quizzes to test your understanding.
Qualitative interview study of 25 leading AI researchers (Aug–Sep 2025) on automating AI R&D and intelligence explosion scenarios. 20/25 flagged ASARA as one of the most severe AI risks; 17/25 expect frontier models to be kept internal; a clear schism between frontier-lab researchers and academics on trajectory clarity.
820 expert-authored problems (~11 person-hours each) measuring integration density — coordination of multiple cognitive operations simultaneously — calibrated with IRT 2PL psychometrics across 47 model configurations. Key finding: best human+AI centaur (θ=2.26) beats best pure LLM (θ=2.16), but operator skill at directing AI is the differentiating variable.
The 7-page consensus document that reshaped embodied AI — zero figures, zero tables, one equation. Defined SPL (Success weighted by inverse Path Length), the PointGoal/ObjectGoal/AreaGoal taxonomy, and 7 recommendations that became the foundation for the Habitat platform and every major navigation benchmark since 2018.
ByteDance's dual-branch diffusion transformer co-generates audio and video, accepts 15 multimodal references (9 images + 3 videos + 3 audio), supports V2V editing and multi-shot narratives, and beat Sora 2, Veo 3.1, and Kling 3.0 in blind evaluation — all at ~$0.14 per clip.
One image generator, three specialist-killers. Google DeepMind shows that a single generative model (Nano Banana Pro), with lightweight instruction tuning, beats SAM 3, Depth Anything V3, and Lotus-2 — zero-shot. The thesis: image generation is to vision what next-token prediction is to language.
DeepMind solved protein folding — a 50-year grand challenge — by treating it as attention over evolution. AlphaFold 2's Evoformer extracts co-evolutionary signals from MSAs, predicting 3D structures at near-experimental accuracy (median GDT 92.4 at CASP14). Predicted 200M+ protein structures. 2024 Nobel Prize in Chemistry.
One algorithm, zero game-specific knowledge, three superhuman games. AlphaZero defeated Stockfish (chess, 28-0), Elmo (shogi, 90-8), and AlphaGo Zero (Go, 60-40) — searching 1,000× fewer positions but evaluating each one with deep neural network understanding. Proved that the learning algorithm is domain-general.
DeepMind's landmark paper that defeated Lee Sedol 4-1 — combining policy networks (move prediction), value networks (position evaluation), and Monte Carlo Tree Search. Proved that neural networks + search can achieve superhuman performance in domains where brute-force search is impossible.
OpenAI's data-centric breakthrough — a custom-trained captioner re-describes every training image with rich detail, then retrains a standard diffusion model on these synthetic captions. 71.7% human preference over SDXL. Proved that data quality trumps model architecture for image generation.
Meta's unified multimodal model that tokenizes everything — text, images, code — into discrete tokens and trains a single transformer with next-token prediction. Beats GPT-4V on mixed-modal reasoning. The architecture that Transfusion argues against.
A single transformer that uses next-token prediction for text and diffusion denoising for images — matching DALL-E 2/SDXL on image generation and LLaMA-1 on text, at less than 1/3 the compute of discrete tokenization approaches.
A hardened version of MMMU that patches shortcut exploitation — filtering text-answerable questions, expanding to 10 options, and embedding questions in screenshots. Performance dropped 17–27% across all models.
The first comprehensive multimodal benchmark testing college-level expert reasoning — 11,500 questions across 30 subjects with 30 heterogeneous image types. GPT-4V scored 56% vs. human experts at 76–89%.
A benchmark of 900 videos (11s to 1 hour) with 2,700 expert-annotated questions across 6 domains — revealing that all models degrade on longer videos and that subtitles/audio significantly help.
A unified 2×2 framework for making AI agents better after pre-training. Four paradigms — A1, A2, T1, T2 — organize 100+ methods. Key finding: training smarter tools (T2) can match full agent retraining with 70× less data.
AI agents autonomously discover state-of-the-art adversarial attack algorithms by recombining existing methods — achieving 100% attack success on a hardened model.
Level 1 — Beginner: Plain language, analogies, no jargon. Assumes no background.
Level 2 — Intermediate: How the methods work, key technical concepts, comparisons.
Level 3 — Expert: Full math, algorithms, related work connections, critical evaluation.
Phase 4 — Frontier: Improvement vectors, latest follow-on work, open gaps scorecard.
Each level ends with a 5-question interactive quiz. Score 4/5 or higher to pass.