MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Yue, Zheng, Ni, Wang, Zhang, Tong, Sun, Yu, Zhang, Sun, Su, Chen, Neubig — September 2024 (ACL 2025 Main)

📄 arXiv:2409.02813 · 📥 PDF · 💻 GitHub · 🌐 Leaderboard

TL;DR: A hardened version of MMMU that patches shortcut exploitation through three steps: filtering text-answerable questions, expanding from 4 to 10 options, and embedding questions inside screenshots. Performance dropped 17–27% across all models, proving they were gaming the original benchmark.

Level 1 — Beginner

▼

What is this paper about?

Remember MMMU — the college final exam for AI? It had a problem: some models were passing not because they understood the material, but because they were gaming the test. MMMU-Pro is the same team saying: "We caught you cheating. Here's a harder version."

The cheat-proofing analogy

Core idea

Imagine a professor discovers students are acing the exam by eliminating obviously wrong answers, guessing from option patterns, or answering questions without looking at the diagrams. Three fixes: throw out giveaway questions, add more choices, and embed everything in the image so they must actually look. That's exactly what MMMU-Pro does.

The three-step hardening process

Step 1: Filter text-answerable questions. Four text-only LLMs tried to answer MMMU questions without seeing images. If 3 out of 4 could answer a question most of the time, that question was removed.

Step 2: Expand from 4 to 10 options. Human experts added 6 more plausible wrong answers per question. Guessing floor dropped from 25% to 10%.

Step 3: Vision-only input. Human annotators photographed questions on screens with varying backgrounds, fonts, and layouts. The model receives only an image — no separate text input.

Key results

54%

GPT-4o Standard
(was 69% on MMMU)

50%

GPT-4o Vision
(embedded questions)

81%

Top models Apr 2026
(GPT-5.4, Gemini 3 Pro)

Two surprising findings

OCR prompts don't help. Models already extract text from images well (85–92% accuracy). The bottleneck isn't reading — it's reasoning with what they read.

Chain-of-thought is a double-edged sword. CoT improved Claude 3.5 Sonnet by 12 points but hurt models like VILA-1.5-40B that couldn't follow the structured format.

Key takeaway

MMMU-Pro proves that a significant chunk of MMMU performance came from shortcuts, not genuine understanding. By removing those shortcuts, it reveals how far multimodal AI truly is from expert-level performance.

Quiz — Level 1

1. What was the core problem with MMMU that motivated MMMU-Pro?

Text-only LLMs could answer some questions without images. Models exploited option patterns rather than genuinely understanding the visuals.

2. What does expanding from 4 to 10 options accomplish?

GPT-4o dropped 10.7% from this change alone, showing how much performance relied on option-space advantages.

3. What is the "vision-only input setting"?

Annotators captured screenshots with varied fonts, backgrounds, and layouts. The model must simultaneously read and see, the way you'd process a photographed textbook page.

4. Why didn't OCR prompts improve vision-only performance?

Models scored 85–92% OCR accuracy. Reading text is solved; reasoning with it in visual context is not.

5. How much did performance drop from MMMU to MMMU-Pro?

Every model dropped significantly. VILA-1.5-40B dropped 27%. This proves a large portion of MMMU performance came from shortcuts.

Level 2 — Intermediate

▼

The three-step pipeline in detail

Step 1 — Text-only filtering. Four LLMs (Llama-3-70B, Qwen2-72B, Yi-1.5-34B, Mixtral-8×22B) attempted each question 10 times without images. "Answerable" = correct >5/10 times. If 3+ models flagged it, removed. 1,800 survived (60 per subject).

Step 2 — Option augmentation. GPT-4o generated candidates, Claude 3.5 Sonnet filtered, two human review rounds. 70 questions removed for weak image relevance. Final: 1,730 questions.

Step 3 — Vision-only screenshots. Manual capture varying backgrounds, fonts, sizes, devices. Manual (not automated) to prevent template-pattern learning.

Evaluation settings

Setting	Input format	Options	In MMMU-Pro score
Standard (10 opts)	Text + images separately	10	Yes
Vision	Everything in screenshot	10	Yes
Standard (4 opts)	Text + images separately	4	No (comparison only)

Delta decomposition

Δ1 = Standard(10) − MMMU(Val): text filtering + option augmentation. GPT-4o: −15.1%.

Δ2 = Vision − MMMU(Val): all three steps combined. GPT-4o: −19.4%.

Vision penalty = Δ2 − Δ1. GPT-4o: −4.3%. VILA-1.5-40B: −21.8%. Reveals which models genuinely integrate vision-language vs. depend on explicit text input.

The OCR paradox

OCR accuracy 36.6%–92.3%. Weak correlation with Vision performance. LLaVA-OneVision-72B matched InternVL2-76B on OCR but scored far lower on questions. The Vision setting tests integrated cognitive processing, not text extraction.

CoT across disciplines

CoT helped most in structured reasoning (Tech & Eng: +14.5% for GPT-4o). Helped least in interpretive domains (Art & Design: +1.6%). Hurt VILA-1.5-40B (−17.1% in Art & Design) — confused reasoning chains worse than direct answers.

Ranking changes as diagnostic

Ranking drops between MMMU and MMMU-Pro reveal shortcut exploitation. VILA-1.5-40B jumped 9 ranks on Δ2 despite good standard scores — heavily reliant on explicit text. Claude 3.5 Sonnet held steady — more genuine integration.

Key takeaway

MMMU-Pro's delta decomposition separates how much performance comes from anti-guessing measures (Δ1) vs. genuine vision-language integration (Vision penalty), revealing architectural differences invisible in raw scores.

Quiz — Level 2

1. Why did text-only filtering use 4 diverse LLMs instead of 1?

Each LLM has different biases. A question Llama-3 can't exploit might be trivially solvable by Qwen2. Four diverse models cast a wider net.

2. What does a large Vision penalty (Δ2 − Δ1) reveal about a model?

VILA-1.5-40B had a 21.8% Vision penalty vs. GPT-4o's 4.3%. This exposes which models break when text moves from input into the visual channel.

3. Why were vision-only screenshots captured manually?

Automated generation would produce consistent templates models could memorize. Manual variation in lighting, fonts, and backgrounds mimics real-world diversity.

4. Why does high OCR accuracy fail to predict strong Vision performance?

LLaVA-OneVision-72B had 87.8% OCR but only 24% on Vision questions. Reading text is solved; understanding its role in visual context is not.

5. Why did CoT hurt some models like VILA-1.5-40B?

CoT amplifies reasoning for models that follow structured formats, but generates noise for those that can't — leading to worse answers than direct guessing.

Level 3 — Expert

▼

Construction methodology

Text-only filtering: 4 LLMs × 10 attempts per question, no images. "Answerable" = >5/10 correct. Exclusion = 3+ models flagged. Conservative filter. Reduced text-only accuracy: ~33% → ~17% (filtered) → ~12% (augmented).

Option augmentation: Random baseline: 1/4=25% → 1/10=10%. Empirical impact exceeds 15-point theoretical drop because augmented options also reduce elimination: with 10 options, eliminating 2 still leaves 1/8=12.5%.

Dataset: 1,730 questions × 2 settings = 3,460 evaluation items. ~58 per subject.

Evaluation framework

Score = average(Standard 10-opt, Vision). Both Direct and CoT evaluated; higher reported — measures each model's ceiling. Human expert range: 73–85% (approximated from original MMMU data, adjusted for 10-option random penalty).

The implicit Vision penalty

Vision penalty = Δ2 − Δ1. GPT-4o: −4.3%. VILA-1.5-40B: −21.8%. Reveals whether models process text and images as integrated information or as separate channels that break when text moves into the visual modality.

OCR-reasoning dissociation

OCR accuracy (Levenshtein) 36.6%–92.3%. Weak correlation with Vision performance. LLaVA-OneVision-72B matched InternVL2-76B on OCR but diverged on reasoning. Proves the Vision setting tests cognitive integration, not text extraction.

CoT discipline dependence

CoT value proportional to problem decomposability. Tech & Eng: +14.5% (clear steps). Art & Design: +1.6% (interpretive). VILA-1.5-40B: −17.1% in Art & Design with CoT.

Critical evaluation

Strengths

Principled three-step hardening with each step measured independently. Vision-only tests genuine novel capability. Delta decomposition reveals architectural differences. OCR-reasoning dissociation is clean and important. Same 30 subjects enables direct MMMU comparison.

Weaknesses

Approximated human baseline — estimated, not measured. Vision-only assumption untested.

Construction circularity — GPT-4o generates distractors, Claude 3.5 Sonnet filters; both then evaluated.

No contamination analysis — inherits MMMU's vulnerability.

Uncontrolled vision variation — no analysis of presentation-confound effects.

Still MCQ-only — 10 options still allows elimination strategies.

Smaller scale — 1,730 vs. MMMU's 11,500. Limited per-subject power.

Key takeaway

MMMU-Pro is methodologically rigorous. Its three-step approach is principled, its delta decomposition analytically valuable, and the OCR-reasoning dissociation a genuine contribution. The weaknesses are real but don't undermine the core finding: models were substantially gaming MMMU.

Quiz — Level 3

1. Why is the implicit Vision penalty the most architecturally revealing metric?

GPT-4o loses 4.3% from the vision setting; VILA-1.5-40B loses 21.8%. This gap exposes genuine vs. channel-dependent vision-language integration.

2. What subtle circularity exists in MMMU-Pro's construction?

Distractors shaped by what GPT-4o considers plausible may be calibrated to a specific difficulty level for that model family, creating unmeasured bias.

3. Why report the higher of Direct and CoT scores?

CoT hurts some models but helps others dramatically. Reporting the max measures each model's ceiling, not penalizing architectures that respond differently to CoT.

4. What is an untested assumption in the human expert approximation?

Humans naturally integrate text and images, making this plausible. But varied fonts and noisy backgrounds could introduce friction — and the magnitude is unknown without testing.

5. Why does CoT's benefit correlate with problem decomposability?

Tech & Eng (+14.5%) decomposes into calculation steps; Art & Design (+1.6%) involves holistic interpretation. CoT shines when reasoning has an explicit chain.

Phase 4 — Frontier

▼

Six improvement vectors for MMMU-Pro, mapped against recent work (as of April 2026).

1. Contamination resistance

Area to explore

MMMU-Pro inherits MMMU's contamination vulnerability. 1,730 questions sourced from MMMU's public dataset; test answers released February 2026. No contamination detection or dynamic refresh in the MMMU family. Models like GPT-5 and Gemini 3 Pro explicitly cite MMMU-Pro, increasing contamination incentives.

2. Empirical human evaluation

Area to explore

Human expert baseline approximated from original MMMU data, not measured on MMMU-Pro itself. The vision-only assumption is untested. Running actual experts on even a subset would validate the approximation. No one has done this.

3. Saturation trajectory

Partially addressed

Top models score ~81% (GPT-5.4, Gemini 3 Pro) vs. estimated human ceiling of 80–85%. Models converging rapidly. MMMU-Pro restored headroom vs. original MMMU but may approach saturation within a year without further hardening or transition to open-ended evaluation.

4. Open-ended evaluation

Partially addressed

Still MCQ-only. Uni-MMMU (Oct 2025) introduces bidirectional understand+generate tasks across the same 30 subjects — strongest move toward open-ended evaluation in the MMMU family, but a separate benchmark.

5. Vision-only variation analysis

Area to explore

Manual screenshots introduce natural but uncontrolled variation. No analysis of how performance varies across different renderings of the same question (dark vs. light backgrounds, serif vs. sans-serif, different resolutions). A systematic study would determine whether presentation confounds affect reliability.

6. Construction pipeline circularity

Area to explore

GPT-4o generates distractors; Claude 3.5 Sonnet filters. Both evaluated. No analysis of whether construction-involved models are systematically advantaged or disadvantaged.

Scorecard

Improvement vector	Status	Key work
Contamination resistance	Area to explore	No dynamic refresh in MMMU family
Empirical human evaluation	Area to explore	Only approximated, never measured
Saturation trajectory	Partially addressed	Headroom exists but models converging ~81%
Open-ended evaluation	Partially addressed	Uni-MMMU (separate benchmark)
Vision-only variation	Area to explore	No presentation-confound study
Construction circularity	Area to explore	No bias analysis for involved models

Bottom line

MMMU-Pro is the current gold standard for expert multimodal evaluation. It successfully hardened MMMU's weaknesses and remains a meaningful discriminator. The biggest unresolved gaps: contamination resistance and empirical human validation. The single highest-impact experiment: run actual human experts on MMMU-Pro.

← Back to all papers