Yue, Zheng, Ni, Wang, Zhang, Tong, Sun, Yu, Zhang, Sun, Su, Chen, Neubig — September 2024 (ACL 2025 Main)
📄 arXiv:2409.02813 · 📥 PDF · 💻 GitHub · 🌐 Leaderboard
Remember MMMU — the college final exam for AI? It had a problem: some models were passing not because they understood the material, but because they were gaming the test. MMMU-Pro is the same team saying: "We caught you cheating. Here's a harder version."
Imagine a professor discovers students are acing the exam by eliminating obviously wrong answers, guessing from option patterns, or answering questions without looking at the diagrams. Three fixes: throw out giveaway questions, add more choices, and embed everything in the image so they must actually look. That's exactly what MMMU-Pro does.
Step 1: Filter text-answerable questions. Four text-only LLMs tried to answer MMMU questions without seeing images. If 3 out of 4 could answer a question most of the time, that question was removed.
Step 2: Expand from 4 to 10 options. Human experts added 6 more plausible wrong answers per question. Guessing floor dropped from 25% to 10%.
Step 3: Vision-only input. Human annotators photographed questions on screens with varying backgrounds, fonts, and layouts. The model receives only an image — no separate text input.
OCR prompts don't help. Models already extract text from images well (85–92% accuracy). The bottleneck isn't reading — it's reasoning with what they read.
Chain-of-thought is a double-edged sword. CoT improved Claude 3.5 Sonnet by 12 points but hurt models like VILA-1.5-40B that couldn't follow the structured format.
MMMU-Pro proves that a significant chunk of MMMU performance came from shortcuts, not genuine understanding. By removing those shortcuts, it reveals how far multimodal AI truly is from expert-level performance.
Step 1 — Text-only filtering. Four LLMs (Llama-3-70B, Qwen2-72B, Yi-1.5-34B, Mixtral-8×22B) attempted each question 10 times without images. "Answerable" = correct >5/10 times. If 3+ models flagged it, removed. 1,800 survived (60 per subject).
Step 2 — Option augmentation. GPT-4o generated candidates, Claude 3.5 Sonnet filtered, two human review rounds. 70 questions removed for weak image relevance. Final: 1,730 questions.
Step 3 — Vision-only screenshots. Manual capture varying backgrounds, fonts, sizes, devices. Manual (not automated) to prevent template-pattern learning.
| Setting | Input format | Options | In MMMU-Pro score |
|---|---|---|---|
| Standard (10 opts) | Text + images separately | 10 | Yes |
| Vision | Everything in screenshot | 10 | Yes |
| Standard (4 opts) | Text + images separately | 4 | No (comparison only) |
Δ1 = Standard(10) − MMMU(Val): text filtering + option augmentation. GPT-4o: −15.1%.
Δ2 = Vision − MMMU(Val): all three steps combined. GPT-4o: −19.4%.
Vision penalty = Δ2 − Δ1. GPT-4o: −4.3%. VILA-1.5-40B: −21.8%. Reveals which models genuinely integrate vision-language vs. depend on explicit text input.
OCR accuracy 36.6%–92.3%. Weak correlation with Vision performance. LLaVA-OneVision-72B matched InternVL2-76B on OCR but scored far lower on questions. The Vision setting tests integrated cognitive processing, not text extraction.
CoT helped most in structured reasoning (Tech & Eng: +14.5% for GPT-4o). Helped least in interpretive domains (Art & Design: +1.6%). Hurt VILA-1.5-40B (−17.1% in Art & Design) — confused reasoning chains worse than direct answers.
Ranking drops between MMMU and MMMU-Pro reveal shortcut exploitation. VILA-1.5-40B jumped 9 ranks on Δ2 despite good standard scores — heavily reliant on explicit text. Claude 3.5 Sonnet held steady — more genuine integration.
MMMU-Pro's delta decomposition separates how much performance comes from anti-guessing measures (Δ1) vs. genuine vision-language integration (Vision penalty), revealing architectural differences invisible in raw scores.
Text-only filtering: 4 LLMs × 10 attempts per question, no images. "Answerable" = >5/10 correct. Exclusion = 3+ models flagged. Conservative filter. Reduced text-only accuracy: ~33% → ~17% (filtered) → ~12% (augmented).
Option augmentation: Random baseline: 1/4=25% → 1/10=10%. Empirical impact exceeds 15-point theoretical drop because augmented options also reduce elimination: with 10 options, eliminating 2 still leaves 1/8=12.5%.
Dataset: 1,730 questions × 2 settings = 3,460 evaluation items. ~58 per subject.
Score = average(Standard 10-opt, Vision). Both Direct and CoT evaluated; higher reported — measures each model's ceiling. Human expert range: 73–85% (approximated from original MMMU data, adjusted for 10-option random penalty).
Vision penalty = Δ2 − Δ1. GPT-4o: −4.3%. VILA-1.5-40B: −21.8%. Reveals whether models process text and images as integrated information or as separate channels that break when text moves into the visual modality.
OCR accuracy (Levenshtein) 36.6%–92.3%. Weak correlation with Vision performance. LLaVA-OneVision-72B matched InternVL2-76B on OCR but diverged on reasoning. Proves the Vision setting tests cognitive integration, not text extraction.
CoT value proportional to problem decomposability. Tech & Eng: +14.5% (clear steps). Art & Design: +1.6% (interpretive). VILA-1.5-40B: −17.1% in Art & Design with CoT.
Principled three-step hardening with each step measured independently. Vision-only tests genuine novel capability. Delta decomposition reveals architectural differences. OCR-reasoning dissociation is clean and important. Same 30 subjects enables direct MMMU comparison.
Approximated human baseline — estimated, not measured. Vision-only assumption untested.
Construction circularity — GPT-4o generates distractors, Claude 3.5 Sonnet filters; both then evaluated.
No contamination analysis — inherits MMMU's vulnerability.
Uncontrolled vision variation — no analysis of presentation-confound effects.
Still MCQ-only — 10 options still allows elimination strategies.
Smaller scale — 1,730 vs. MMMU's 11,500. Limited per-subject power.
MMMU-Pro is methodologically rigorous. Its three-step approach is principled, its delta decomposition analytically valuable, and the OCR-reasoning dissociation a genuine contribution. The weaknesses are real but don't undermine the core finding: models were substantially gaming MMMU.
Six improvement vectors for MMMU-Pro, mapped against recent work (as of April 2026).
MMMU-Pro inherits MMMU's contamination vulnerability. 1,730 questions sourced from MMMU's public dataset; test answers released February 2026. No contamination detection or dynamic refresh in the MMMU family. Models like GPT-5 and Gemini 3 Pro explicitly cite MMMU-Pro, increasing contamination incentives.
Human expert baseline approximated from original MMMU data, not measured on MMMU-Pro itself. The vision-only assumption is untested. Running actual experts on even a subset would validate the approximation. No one has done this.
Top models score ~81% (GPT-5.4, Gemini 3 Pro) vs. estimated human ceiling of 80–85%. Models converging rapidly. MMMU-Pro restored headroom vs. original MMMU but may approach saturation within a year without further hardening or transition to open-ended evaluation.
Still MCQ-only. Uni-MMMU (Oct 2025) introduces bidirectional understand+generate tasks across the same 30 subjects — strongest move toward open-ended evaluation in the MMMU family, but a separate benchmark.
Manual screenshots introduce natural but uncontrolled variation. No analysis of how performance varies across different renderings of the same question (dark vs. light backgrounds, serif vs. sans-serif, different resolutions). A systematic study would determine whether presentation confounds affect reliability.
GPT-4o generates distractors; Claude 3.5 Sonnet filters. Both evaluated. No analysis of whether construction-involved models are systematically advantaged or disadvantaged.
| Improvement vector | Status | Key work |
|---|---|---|
| Contamination resistance | Area to explore | No dynamic refresh in MMMU family |
| Empirical human evaluation | Area to explore | Only approximated, never measured |
| Saturation trajectory | Partially addressed | Headroom exists but models converging ~81% |
| Open-ended evaluation | Partially addressed | Uni-MMMU (separate benchmark) |
| Vision-only variation | Area to explore | No presentation-confound study |
| Construction circularity | Area to explore | No bias analysis for involved models |
MMMU-Pro is the current gold standard for expert multimodal evaluation. It successfully hardened MMMU's weaknesses and remains a meaningful discriminator. The biggest unresolved gaps: contamination resistance and empirical human validation. The single highest-impact experiment: run actual human experts on MMMU-Pro.