MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Yue, Ni, Zhang, Zheng, Liu, Zhang, Stevens, Jiang, Ren, Sun, Wei, Yu, Yuan, Sun, Yin, Zheng, Yang, Liu, Huang, Sun, Su, Chen — November 2023 (CVPR 2024 Oral)

📄 arXiv:2311.16502 · 📥 PDF · 💻 GitHub · 🌐 Leaderboard

TL;DR: The first comprehensive multimodal benchmark testing college-level expert reasoning — 11,500 questions across 30 subjects with 30 heterogeneous image types. GPT-4V scored only 56% vs. human experts at 76–89%, revealing massive gaps in perception, knowledge, and reasoning.

Level 1 — Beginner

▼

What is this paper about?

AI models like GPT-4V and Gemini can look at images and answer questions. But before MMMU, the "tests" we gave them were like kindergarten quizzes — "What animal is in this photo?" or "What color is the car?" MMMU asks: what if we gave these models actual college finals instead?

The college final exam analogy

Core idea

Think of MMMU as building a proper college final exam for AI. Previous benchmarks were like giving a college student only arithmetic quizzes — too easy, too narrow. MMMU is the first test comprehensive enough to measure how far AI is from expert-level performance across many fields.

What makes MMMU different? Four things.

1. Breadth. 30 subjects across 6 disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

2. Depth. Questions require college-level reasoning — applying Fourier transforms, diagnosing from medical imaging, analyzing circuit behavior. Prior benchmarks tested common sense.

3. Heterogeneous images. 30 image types: circuit diagrams, MRI scans, sheet music, chemical structures, geometric proofs, paintings, charts, comics, and more. Not just photographs.

4. Interleaved text and images. Questions weave text and images together ("Given <image 1> and <image 2>, calculate..."). The model must jointly understand both.

Key results

56%

GPT-4V
(best at publication)

34%

Best open-source
(BLIP-2, LLaVA-1.5)

89%

Best human expert
(college seniors)

Why do models struggle?

The authors analyzed 150 GPT-4V errors and found three root causes: 35% perceptual (misread the image), 29% knowledge gaps (didn't know the domain concept), and 26% reasoning errors (understood inputs but botched the logic). The problem isn't just "see better" — it's a combination of seeing, knowing, and thinking.

What about converting images to text?

Adding OCR or image captions to text-only models didn't help. A text description of a circuit diagram can't capture the precise topology, component values, and connections. MMMU demands genuine multimodal understanding — not text proxies for vision.

Key takeaway

If we want AI that can assist doctors reading MRI scans, engineers analyzing circuits, or art historians interpreting paintings, we need to measure how far away we are. MMMU is that measuring stick — and at launch, it showed the field was very far from expert-level multimodal AI.

Quiz — Level 1

1. What is MMMU's primary contribution compared to prior multimodal benchmarks?

MMMU is a benchmark (a test), not a model. Its contribution is testing expert-level questions across 30 subjects with 30 heterogeneous image types — far beyond prior benchmarks that focused on everyday knowledge.

2. What was the single largest category of GPT-4V errors?

35% of errors were perceptual — the model literally misread the image. This tells us that even the best models at publication had fundamental vision problems, not just reasoning or knowledge gaps.

3. Why didn't OCR or image captions help text-only LLMs on MMMU?

A caption like "a circuit with resistors" loses the exact topology, values, and connections needed to solve the problem. The information is inherently visual.

4. Which disciplines did models perform best on, and why?

Art & Design and Humanities use familiar image types (photos, paintings) and involve less mathematical reasoning. GPT-4V scored 65% on Art & Design and 76% on Humanities vs. 42% on Tech & Engineering.

5. Why do the authors say MMMU is "necessary but not sufficient" for Expert AGI?

Expert AGI means performing at the 90th percentile of skilled adults across broad tasks. College exams test knowledge and reasoning, but real-world expertise also involves tool use, collaboration, and open-ended problem solving.

Level 2 — Intermediate

▼

How MMMU is constructed

The pipeline has three stages: collection → quality control → difficulty filtering.

Collection: 50+ college students (including co-authors) from different majors pulled questions from textbooks, online course materials, and exams. Annotators followed copyright rules and preferred questions with non-co-located answers to reduce contamination risk.

Quality control: Three passes — (1) lexical overlap + URL similarity for deduplication, (2) format/typo checking by different co-authors, (3) difficulty classification into four tiers.

Difficulty filtering: Bottom ~10% ("very easy") removed. Final distribution: 28% easy, 45% medium, 27% hard.

Dataset at a glance

Split	Count	Purpose
Development	150	Few-shot examples (5 per subject)
Validation	900	Hyperparameter tuning, quick eval
Test	10,500	Held-out evaluation (answers released Feb 2026)

94% multiple-choice, 6% open-ended. 97.5% include at least one image. 7.4% have multiple images. Images appear at beginning (18%), middle (37%), or end (50%) of questions.

The 30 image types problem

Prior benchmarks mostly had photographs. MMMU includes diagrams, tables, charts, chemical structures, paintings, geometric shapes, music sheets, medical scans, microscopic images, comics, and more. GPT-4V did well on photos and paintings but collapsed on geometric shapes, music sheets, and chemical structures — sometimes near random chance (25%). Models haven't generalized visual perception beyond common training distributions.

Evaluation methodology

Zero-shot only in main results — no fine-tuning on MMMU. Tests genuine capability, not benchmark-specific adaptation.

Rule-based answer extraction using regex, not an LLM judge. Deterministic and reproducible — no judge variability.

Micro-averaged accuracy: each question counts equally regardless of subject. Subjects with more questions (Tech & Engineering: 2,784) dominate over smaller ones (Humanities: 947).

The human expert baseline

90 college seniors — 3 per subject — took their subject's 30 validation questions with textbooks but no internet. Worst: 76.2%, median: 82.6%, best: 88.6%.

The difficulty cliff

GPT-4V scored 76% on easy questions but only 31% on hard ones — nearly random. Open-source models dropped from ~41% to ~27%. Hard questions require multi-step expert reasoning that hits a shared capability ceiling across all models.

The three-skill decomposition

Key insight

MMMU tests three skills simultaneously: perception (can you see what's in the image?), knowledge (do you know the domain concepts?), and reasoning (can you chain the logic?). The error analysis shows these fail somewhat independently: 35% perceptual, 29% knowledge, 26% reasoning. A model could reason perfectly but still fail because it misread a diagram.

Key takeaway

MMMU's construction — domain-expert annotators, heterogeneous image types, interleaved inputs, and three-skill decomposition — makes it diagnostic, not just a pass/fail score. It tells you where to invest improvement effort.

Quiz — Level 2

1. Why did the authors use rule-based regex extraction instead of an LLM judge?

Rule-based extraction gives the same result every time. Different labs evaluating on MMMU get identical scores for the same model outputs — no judge disagreements.

2. What does the performance gap across image types reveal?

Models trained on billions of natural images do well on photos but collapse on chemical structures, music notation, and geometric diagrams — formats with completely different spatial grammars.

3. What contamination mitigation did annotators use during collection?

By choosing questions whose answers live in separate documents or at the back of textbooks, they reduced the chance that training data would contain both Q and A together.

4. Why does GPT-4V's advantage disappear on hard questions?

GPT-4V scored 76% on easy but 31% on hard — near random. All models hit the same ceiling because hard questions demand reasoning chains that break down regardless of model scale.

5. Why is the three-skill decomposition diagnostically valuable?

35% perceptual = invest in vision encoders. 29% knowledge = add domain training data. 26% reasoning = improve chain-of-thought. Each failure mode has a different fix.

Level 3 — Expert

▼

Benchmark design philosophy

MMMU is framed around Morris et al.'s (2023) AGI taxonomy, targeting Level 3 ("Expert AGI") — an AI performing at the 90th percentile of skilled adults. The benchmark operationalizes this via college-level exams. The authors frame MMMU as a necessary condition (an Expert AGI should ace these) not a sufficient one (acing them doesn't make you Expert AGI).

Data curation methodology

The 50+ annotators were domain experts in their subjects (college students in those majors), not crowd workers. Questions reflect authentic expert assessment patterns.

Quality control details

Stage 1 — Deduplication: Lexical overlap + source URL similarity, then human review. Catches near-identical questions but misses semantically equivalent ones with different phrasing.

Stage 2 — Format standardization: Cross-checker was not the original annotator (weak independent review).

Stage 3 — Difficulty filtering: Labels assigned by authors, not calibrated against human solve rates. Human expert evaluation covers only 900 validation questions.

Evaluation design decisions

Zero-shot only in main results. Disadvantages models with strong few-shot learning (Flamingo variants); advantages models with strong RLHF instruction-following.

Micro-averaged accuracy means larger subjects dominate: Tech & Engineering (2,784 questions) has ~3× the weight of Humanities (947). Macro-averaging would weight all subjects equally.

MCQ dominance (94%) creates a 25% random floor and enables elimination strategies. Open-ended questions (6%) use brittle key-phrase matching.

Architectural implications

Models with stronger vision encoders (GPT-4V, InternVL-Chat-V1.2) dramatically outperform weaker ones (Kosmos2, Fuyu-8B). But the language backbone matters equally — LLaVA-1.5-13B (CLIP ViT-L + Vicuna-13B) gets 34%; InternVL-Chat-V1.2 (InternViT-6B + stronger LLM) gets 46%.

The interleaved input challenge

Images at varying positions (beginning 18%, middle 37%, end 50%) with multi-image questions (7.4%) test cross-modal reference tracking. Architecturally demanding: Q-Former models (BLIP-2) compress images separately, potentially losing cross-image relationships. Direct projection models (LLaVA) embed images inline but at higher compute cost.

Critical evaluation

Strengths

Genuine expert-level difficulty with authentic source materials. Heterogeneous image types expose real generalization failures. Actionable error decomposition. Deterministic evaluation. Meaningful human expert baseline.

Weaknesses

No contamination analysis. Questions on public HuggingFace/arXiv; test answers released Feb 2026.

Uncalibrated difficulty labels. Author judgments, not validated against human solve rates on the test set.

Subject imbalance. Tech & Eng dominates under micro-averaging.

Static benchmark. Fixed questions with no refresh mechanism. Contamination risk grows over time.

MCQ ceiling. 25% floor allows elimination. Less discriminating as models approach human performance.

English-only. All QA pairs in English despite global applicability.

Single-turn, no tool use. Real experts use calculators, references, and iterative problem-solving.

Key takeaway

MMMU established the template for expert-level multimodal evaluation. Its design decisions — domain-expert curation, heterogeneous images, interleaved inputs, three-skill decomposition — shaped how the field measures progress. The limitations are real but well-understood, and successors build on rather than replace it.

Quiz — Level 3

1. Why does micro-averaged accuracy potentially misrepresent balanced performance?

Under micro-averaging, each question counts equally. A model could bomb Humanities but ace Tech & Engineering and still look decent because the larger subject contributes ~3× more to the overall score.

2. Why does interleaved image placement create an architectural challenge?

Q-Former architectures compress images separately and merge later, potentially losing cross-image relationships. Direct projection models preserve positional relationships but at higher compute cost.

3. What is the key limitation of MMMU's difficulty labels?

"Hard" according to authors may not align with "hard" as measured by actual human solve rates. The 10,500 test set labels remain uncalibrated.

4. Why does the static, fixed nature of MMMU become a growing concern?

With 11,500 questions on HuggingFace, GitHub, and arXiv — and test answers released in Feb 2026 — the probability of training data contamination increases continuously with no refresh mechanism.

5. What does the OCR/caption experiment prove about MMMU's design?

Adding OCR/captions to GPT-4 (text) improved it from 33.8% to only 34.9%. This validates MMMU's core design claim: these questions demand genuine vision-language integration, not text extraction.

Phase 4 — Frontier

▼

Six improvement vectors for MMMU, mapped against recent work (as of April 2026).

1. Shortcut exploitation and MCQ ceiling

Addressed

MMMU's 4-option MCQ format (94%) sets a 25% floor and allows elimination. Models can exploit text-option correlations without visual understanding.

Recent work

MMMU-Pro (Sep 2024, ACL 2025) — Same team. Three-step hardening: filter text-answerable questions, expand to 10 options (floor drops to 10%), add vision-only setting where questions are embedded within images. Performance dropped 17–27% across all models. The single most important successor.

2. Contamination resistance

Area to explore

11,500 fixed questions on public HuggingFace, GitHub, and arXiv. Test answers released Feb 2026. Models like GPT-5 and Gemini explicitly cite MMMU as a primary benchmark. No contamination analysis, no dynamic refresh, no held-out rotation. This is the widest unresolved gap in the MMMU family.

3. Saturation on original MMMU

Partially addressed

Top models (o4 Mini High, GPT-5) now score ~79% — within the human expert range (76–89%). Models have surpassed the worst human experts. Progress appears asymptotic.

Recent work

MMMU-Pro restores headroom: top scores ~81% (GPT-5.4, Gemini 3 Pro) with room to the ceiling. But the original MMMU is nearing its useful life as a discriminating benchmark for frontier models.

4. Video and temporal extension

Addressed

MMMU tests static images only. Real expert work often involves video.

Recent work

Video-MMMU (Jan 2025) — Same 6 disciplines, 30 subjects, but with 300 educational videos and 900 questions. Introduces Δknowledge metric measuring learning gain from watching videos. Best model (GPT-4o) achieved only 15.6% knowledge gain vs. 33.1% for humans.

Uni-MMMU (Oct 2025) — Unifies understanding and generation evaluation across the same domains with bidirectional tasks.

5. Open-ended and generation evaluation

Partially addressed

94% of MMMU is MCQ. Real expert work involves generating explanations, proofs, and diagrams.

Recent work

Uni-MMMU (Oct 2025) — Bidirectional tasks coupling generation and understanding. Models must generate visual outputs to demonstrate comprehension, and use generation as a reasoning scaffold. Strongest move toward open-ended evaluation in the MMMU family.

6. Multilingual coverage

Area to explore

All questions are English-only. Expert knowledge is practiced globally with varying notation conventions, diagram styles, and domain terminology. No multilingual version of MMMU or MMMU-Pro exists. Creating equivalent expert-level questions in other languages would require new domain-expert annotators per language — a significant effort no one has undertaken.

Scorecard

Improvement vector	Status	Key work
Shortcut exploitation / MCQ ceiling	Addressed	MMMU-Pro (4→10 options, vision-only)
Contamination resistance	Area to explore	No dynamic refresh; test answers public
Saturation on original MMMU	Partially addressed	MMMU-Pro restores headroom
Video / temporal extension	Addressed	Video-MMMU, Uni-MMMU
Open-ended / generation eval	Partially addressed	Uni-MMMU (bidirectional)
Multilingual coverage	Area to explore	No multilingual version exists

Bottom line

MMMU catalyzed a family of successors. The original is nearing saturation, but MMMU-Pro has taken over as the active benchmark. The biggest unresolved gaps are contamination resistance and multilingual coverage. The frontier is moving fastest on video extension and unified understand+generate evaluation.

← Back to all papers