Research Paper Summaries

Four-level explainers for deeply understanding research papers — from beginner to frontier. Each summary includes interactive quizzes to test your understanding.

Papers

AI Researchers' Perspectives on Automating AI R&D and Intelligence Explosions

Qualitative interview study of 25 leading AI researchers (Aug–Sep 2025) on automating AI R&D and intelligence explosion scenarios. 20/25 flagged ASARA as one of the most severe AI risks; 17/25 expect frontier models to be kept internal; a clear schism between frontier-lab researchers and academics on trajectory clarity.

qualitative interviews intelligence explosion AI governance arXiv Mar 2026

GIM: Evaluating Models via Tasks that Integrate Multiple Cognitive Domains

820 expert-authored problems (~11 person-hours each) measuring integration density — coordination of multiple cognitive operations simultaneously — calibrated with IRT 2PL psychometrics across 47 model configurations. Key finding: best human+AI centaur (θ=2.26) beats best pure LLM (θ=2.16), but operator skill at directing AI is the differentiating variable.

evaluation IRT psychometrics centaur study integration density March 2026

On Evaluation of Embodied Navigation Agents

The 7-page consensus document that reshaped embodied AI — zero figures, zero tables, one equation. Defined SPL (Success weighted by inverse Path Length), the PointGoal/ObjectGoal/AreaGoal taxonomy, and 7 recommendations that became the foundation for the Habitat platform and every major navigation benchmark since 2018.

embodied AI navigation evaluation SPL metric arXiv Jul 2018

Seedance 2.0: Advancing Video Generation for World Complexity

ByteDance's dual-branch diffusion transformer co-generates audio and video, accepts 15 multimodal references (9 images + 3 videos + 3 audio), supports V2V editing and multi-shot narratives, and beat Sora 2, Veo 3.1, and Kling 3.0 in blind evaluation — all at ~$0.14 per clip.

video generation diffusion transformer audio-video RLHF arXiv Apr 2026

Vision Banana: Image Generators are Generalist Vision Learners

One image generator, three specialist-killers. Google DeepMind shows that a single generative model (Nano Banana Pro), with lightweight instruction tuning, beats SAM 3, Depth Anything V3, and Lotus-2 — zero-shot. The thesis: image generation is to vision what next-token prediction is to language.

generative vision segmentation depth estimation instruction tuning arXiv Apr 2026

AlphaFold 2: Highly Accurate Protein Structure Prediction with AlphaFold

DeepMind solved protein folding — a 50-year grand challenge — by treating it as attention over evolution. AlphaFold 2's Evoformer extracts co-evolutionary signals from MSAs, predicting 3D structures at near-experimental accuracy (median GDT 92.4 at CASP14). Predicted 200M+ protein structures. 2024 Nobel Prize in Chemistry.

protein structure attention co-evolution structural biology Nature 2021

AlphaZero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

One algorithm, zero game-specific knowledge, three superhuman games. AlphaZero defeated Stockfish (chess, 28-0), Elmo (shogi, 90-8), and AlphaGo Zero (Go, 60-40) — searching 1,000× fewer positions but evaluating each one with deep neural network understanding. Proved that the learning algorithm is domain-general.

reinforcement learning self-play MCTS game AI Science 2018

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search

DeepMind's landmark paper that defeated Lee Sedol 4-1 — combining policy networks (move prediction), value networks (position evaluation), and Monte Carlo Tree Search. Proved that neural networks + search can achieve superhuman performance in domains where brute-force search is impossible.

reinforcement learning MCTS game AI Nature 2016

DALL-E 3: Improving Image Generation with Better Captions

OpenAI's data-centric breakthrough — a custom-trained captioner re-describes every training image with rich detail, then retrains a standard diffusion model on these synthetic captions. 71.7% human preference over SDXL. Proved that data quality trumps model architecture for image generation.

image generation data-centric AI recaptioning diffusion

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta's unified multimodal model that tokenizes everything — text, images, code — into discrete tokens and trains a single transformer with next-token prediction. Beats GPT-4V on mixed-modal reasoning. The architecture that Transfusion argues against.

multimodal early fusion discrete tokens ICLR 2025

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

A single transformer that uses next-token prediction for text and diffusion denoising for images — matching DALL-E 2/SDXL on image generation and LLaMA-1 on text, at less than 1/3 the compute of discrete tokenization approaches.

multimodal unified generation diffusion ICLR 2025 Oral

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

A hardened version of MMMU that patches shortcut exploitation — filtering text-answerable questions, expanding to 10 options, and embedding questions in screenshots. Performance dropped 17–27% across all models.

multimodal benchmarks robustness ACL 2025

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

The first comprehensive multimodal benchmark testing college-level expert reasoning — 11,500 questions across 30 subjects with 30 heterogeneous image types. GPT-4V scored 56% vs. human experts at 76–89%.

multimodal benchmarks expert reasoning CVPR 2024

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

A benchmark of 900 videos (11s to 1 hour) with 2,700 expert-annotated questions across 6 domains — revealing that all models degrade on longer videos and that subtitles/audio significantly help.

video understanding benchmarks MLLMs CVPR 2025

Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

A unified 2×2 framework for making AI agents better after pre-training. Four paradigms — A1, A2, T1, T2 — organize 100+ methods. Key finding: training smarter tools (T2) can match full agent retraining with 70× less data.

agentic AI post-training memory survey Dec 2025

Claudini: Autoresearch Discovers SOTA Adversarial Attack Algorithms for LLMs

AI agents autonomously discover state-of-the-art adversarial attack algorithms by recombining existing methods — achieving 100% attack success on a hardened model.

AI safety adversarial attacks autoresearch Mar 2026

How this works

Level 1 — Beginner: Plain language, analogies, no jargon. Assumes no background.

Level 2 — Intermediate: How the methods work, key technical concepts, comparisons.

Level 3 — Expert: Full math, algorithms, related work connections, critical evaluation.

Phase 4 — Frontier: Improvement vectors, latest follow-on work, open gaps scorecard.

Each level ends with a 5-question interactive quiz. Score 4/5 or higher to pass.