Yue, Ni, Zhang, Zheng, Liu, Zhang, Stevens, Jiang, Ren, Sun, Wei, Yu, Yuan, Sun, Yin, Zheng, Yang, Liu, Huang, Sun, Su, Chen — November 2023 (CVPR 2024 Oral)
📄 arXiv:2311.16502 · 📥 PDF · 💻 GitHub · 🌐 Leaderboard
AI models like GPT-4V and Gemini can look at images and answer questions. But before MMMU, the "tests" we gave them were like kindergarten quizzes — "What animal is in this photo?" or "What color is the car?" MMMU asks: what if we gave these models actual college finals instead?
Think of MMMU as building a proper college final exam for AI. Previous benchmarks were like giving a college student only arithmetic quizzes — too easy, too narrow. MMMU is the first test comprehensive enough to measure how far AI is from expert-level performance across many fields.
1. Breadth. 30 subjects across 6 disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.
2. Depth. Questions require college-level reasoning — applying Fourier transforms, diagnosing from medical imaging, analyzing circuit behavior. Prior benchmarks tested common sense.
3. Heterogeneous images. 30 image types: circuit diagrams, MRI scans, sheet music, chemical structures, geometric proofs, paintings, charts, comics, and more. Not just photographs.
4. Interleaved text and images. Questions weave text and images together ("Given <image 1> and <image 2>, calculate..."). The model must jointly understand both.
The authors analyzed 150 GPT-4V errors and found three root causes: 35% perceptual (misread the image), 29% knowledge gaps (didn't know the domain concept), and 26% reasoning errors (understood inputs but botched the logic). The problem isn't just "see better" — it's a combination of seeing, knowing, and thinking.
Adding OCR or image captions to text-only models didn't help. A text description of a circuit diagram can't capture the precise topology, component values, and connections. MMMU demands genuine multimodal understanding — not text proxies for vision.
If we want AI that can assist doctors reading MRI scans, engineers analyzing circuits, or art historians interpreting paintings, we need to measure how far away we are. MMMU is that measuring stick — and at launch, it showed the field was very far from expert-level multimodal AI.
The pipeline has three stages: collection → quality control → difficulty filtering.
Collection: 50+ college students (including co-authors) from different majors pulled questions from textbooks, online course materials, and exams. Annotators followed copyright rules and preferred questions with non-co-located answers to reduce contamination risk.
Quality control: Three passes — (1) lexical overlap + URL similarity for deduplication, (2) format/typo checking by different co-authors, (3) difficulty classification into four tiers.
Difficulty filtering: Bottom ~10% ("very easy") removed. Final distribution: 28% easy, 45% medium, 27% hard.
| Split | Count | Purpose |
|---|---|---|
| Development | 150 | Few-shot examples (5 per subject) |
| Validation | 900 | Hyperparameter tuning, quick eval |
| Test | 10,500 | Held-out evaluation (answers released Feb 2026) |
94% multiple-choice, 6% open-ended. 97.5% include at least one image. 7.4% have multiple images. Images appear at beginning (18%), middle (37%), or end (50%) of questions.
Prior benchmarks mostly had photographs. MMMU includes diagrams, tables, charts, chemical structures, paintings, geometric shapes, music sheets, medical scans, microscopic images, comics, and more. GPT-4V did well on photos and paintings but collapsed on geometric shapes, music sheets, and chemical structures — sometimes near random chance (25%). Models haven't generalized visual perception beyond common training distributions.
Zero-shot only in main results — no fine-tuning on MMMU. Tests genuine capability, not benchmark-specific adaptation.
Rule-based answer extraction using regex, not an LLM judge. Deterministic and reproducible — no judge variability.
Micro-averaged accuracy: each question counts equally regardless of subject. Subjects with more questions (Tech & Engineering: 2,784) dominate over smaller ones (Humanities: 947).
90 college seniors — 3 per subject — took their subject's 30 validation questions with textbooks but no internet. Worst: 76.2%, median: 82.6%, best: 88.6%.
GPT-4V scored 76% on easy questions but only 31% on hard ones — nearly random. Open-source models dropped from ~41% to ~27%. Hard questions require multi-step expert reasoning that hits a shared capability ceiling across all models.
MMMU tests three skills simultaneously: perception (can you see what's in the image?), knowledge (do you know the domain concepts?), and reasoning (can you chain the logic?). The error analysis shows these fail somewhat independently: 35% perceptual, 29% knowledge, 26% reasoning. A model could reason perfectly but still fail because it misread a diagram.
MMMU's construction — domain-expert annotators, heterogeneous image types, interleaved inputs, and three-skill decomposition — makes it diagnostic, not just a pass/fail score. It tells you where to invest improvement effort.
MMMU is framed around Morris et al.'s (2023) AGI taxonomy, targeting Level 3 ("Expert AGI") — an AI performing at the 90th percentile of skilled adults. The benchmark operationalizes this via college-level exams. The authors frame MMMU as a necessary condition (an Expert AGI should ace these) not a sufficient one (acing them doesn't make you Expert AGI).
The 50+ annotators were domain experts in their subjects (college students in those majors), not crowd workers. Questions reflect authentic expert assessment patterns.
Stage 1 — Deduplication: Lexical overlap + source URL similarity, then human review. Catches near-identical questions but misses semantically equivalent ones with different phrasing.
Stage 2 — Format standardization: Cross-checker was not the original annotator (weak independent review).
Stage 3 — Difficulty filtering: Labels assigned by authors, not calibrated against human solve rates. Human expert evaluation covers only 900 validation questions.
Zero-shot only in main results. Disadvantages models with strong few-shot learning (Flamingo variants); advantages models with strong RLHF instruction-following.
Micro-averaged accuracy means larger subjects dominate: Tech & Engineering (2,784 questions) has ~3× the weight of Humanities (947). Macro-averaging would weight all subjects equally.
MCQ dominance (94%) creates a 25% random floor and enables elimination strategies. Open-ended questions (6%) use brittle key-phrase matching.
Models with stronger vision encoders (GPT-4V, InternVL-Chat-V1.2) dramatically outperform weaker ones (Kosmos2, Fuyu-8B). But the language backbone matters equally — LLaVA-1.5-13B (CLIP ViT-L + Vicuna-13B) gets 34%; InternVL-Chat-V1.2 (InternViT-6B + stronger LLM) gets 46%.
Images at varying positions (beginning 18%, middle 37%, end 50%) with multi-image questions (7.4%) test cross-modal reference tracking. Architecturally demanding: Q-Former models (BLIP-2) compress images separately, potentially losing cross-image relationships. Direct projection models (LLaVA) embed images inline but at higher compute cost.
Genuine expert-level difficulty with authentic source materials. Heterogeneous image types expose real generalization failures. Actionable error decomposition. Deterministic evaluation. Meaningful human expert baseline.
No contamination analysis. Questions on public HuggingFace/arXiv; test answers released Feb 2026.
Uncalibrated difficulty labels. Author judgments, not validated against human solve rates on the test set.
Subject imbalance. Tech & Eng dominates under micro-averaging.
Static benchmark. Fixed questions with no refresh mechanism. Contamination risk grows over time.
MCQ ceiling. 25% floor allows elimination. Less discriminating as models approach human performance.
English-only. All QA pairs in English despite global applicability.
Single-turn, no tool use. Real experts use calculators, references, and iterative problem-solving.
MMMU established the template for expert-level multimodal evaluation. Its design decisions — domain-expert curation, heterogeneous images, interleaved inputs, three-skill decomposition — shaped how the field measures progress. The limitations are real but well-understood, and successors build on rather than replace it.
Six improvement vectors for MMMU, mapped against recent work (as of April 2026).
MMMU's 4-option MCQ format (94%) sets a 25% floor and allows elimination. Models can exploit text-option correlations without visual understanding.
MMMU-Pro (Sep 2024, ACL 2025) — Same team. Three-step hardening: filter text-answerable questions, expand to 10 options (floor drops to 10%), add vision-only setting where questions are embedded within images. Performance dropped 17–27% across all models. The single most important successor.
11,500 fixed questions on public HuggingFace, GitHub, and arXiv. Test answers released Feb 2026. Models like GPT-5 and Gemini explicitly cite MMMU as a primary benchmark. No contamination analysis, no dynamic refresh, no held-out rotation. This is the widest unresolved gap in the MMMU family.
Top models (o4 Mini High, GPT-5) now score ~79% — within the human expert range (76–89%). Models have surpassed the worst human experts. Progress appears asymptotic.
MMMU-Pro restores headroom: top scores ~81% (GPT-5.4, Gemini 3 Pro) with room to the ceiling. But the original MMMU is nearing its useful life as a discriminating benchmark for frontier models.
MMMU tests static images only. Real expert work often involves video.
Video-MMMU (Jan 2025) — Same 6 disciplines, 30 subjects, but with 300 educational videos and 900 questions. Introduces Δknowledge metric measuring learning gain from watching videos. Best model (GPT-4o) achieved only 15.6% knowledge gain vs. 33.1% for humans.
Uni-MMMU (Oct 2025) — Unifies understanding and generation evaluation across the same domains with bidirectional tasks.
94% of MMMU is MCQ. Real expert work involves generating explanations, proofs, and diagrams.
Uni-MMMU (Oct 2025) — Bidirectional tasks coupling generation and understanding. Models must generate visual outputs to demonstrate comprehension, and use generation as a reasoning scaffold. Strongest move toward open-ended evaluation in the MMMU family.
All questions are English-only. Expert knowledge is practiced globally with varying notation conventions, diagram styles, and domain terminology. No multilingual version of MMMU or MMMU-Pro exists. Creating equivalent expert-level questions in other languages would require new domain-expert annotators per language — a significant effort no one has undertaken.
| Improvement vector | Status | Key work |
|---|---|---|
| Shortcut exploitation / MCQ ceiling | Addressed | MMMU-Pro (4→10 options, vision-only) |
| Contamination resistance | Area to explore | No dynamic refresh; test answers public |
| Saturation on original MMMU | Partially addressed | MMMU-Pro restores headroom |
| Video / temporal extension | Addressed | Video-MMMU, Uni-MMMU |
| Open-ended / generation eval | Partially addressed | Uni-MMMU (bidirectional) |
| Multilingual coverage | Area to explore | No multilingual version exists |
MMMU catalyzed a family of successors. The original is nearing saturation, but MMMU-Pro has taken over as the active benchmark. The biggest unresolved gaps are contamination resistance and multilingual coverage. The frontier is moving fastest on video extension and unified understand+generate evaluation.