Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Fu, Dai, Luo, Li, Ren, Zhang, Wang, Zhou, Shen, Zhang, Chen, Li, Lin, Zhao, Li, Xu, Zheng, Chen, Shan, He, Sun — May 2024 (CVPR 2025)

📄 arXiv:2405.21075 · 📥 PDF · 💻 GitHub · 🌐 Project

TL;DR: The first comprehensive benchmark for evaluating how well multimodal AI models understand video — spanning 900 videos (11 seconds to 1 hour), 6 domains, 30 categories, and 2,700 expert-annotated questions. Reveals that all models degrade on longer videos and that subtitles/audio significantly help.

Level 1 — Beginner

▼

What is this paper about?

AI models like GPT-4o and Gemini can look at images and answer questions about them. But the real world isn't a photograph — it's a video: things move, change, and unfold over time. This paper asks: how well can these AI models actually understand videos?

The answer is "not as well as you'd think" — and before this paper, we didn't even have a good way to measure it.

The report card analogy

Core idea

Think of Video-MME as building a proper report card for video-understanding AI. Before it, the existing "tests" were like giving a college student only arithmetic quizzes — too easy, too narrow, not covering enough subjects. Video-MME is the first test comprehensive enough to actually tell you how good (or bad) these models are.

What makes this benchmark special? Four things.

1. Diversity — covers lots of subjects. The benchmark spans 6 big categories (knowledge, film & TV, sports, artistic performance, daily life, and multilingual content) broken into 30 specific subtypes — football replays, cooking tutorials, documentaries, magic shows, news reports, and more.

2. Duration — short to long. Videos range from 11 seconds to 1 full hour. Understanding a 10-second clip is fundamentally different from understanding a 30-minute documentary. Most prior benchmarks only had short clips.

3. Multiple modalities — not just the picture. Beyond video frames, the benchmark tests whether models can use subtitles and audio to improve understanding. Many videos are hard to understand from visuals alone.

4. Quality — humans wrote everything. All 2,700 questions (3 per video, 900 videos) were written and reviewed by human experts, not auto-generated. Any question a model could answer without watching the video was thrown out.

Key results

75%

Gemini 1.5 Pro
(best commercial model)

59%

VILA-1.5
(best open-source model)

All models get worse as videos get longer. Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long ones — a 14-point cliff.

Subtitles and audio help a lot. Adding subtitles boosted Gemini 1.5 Pro by 6.2 percentage points overall, and by 10.1 points on long videos.

Key takeaway

If you want AI that can be a doctor reviewing surgery footage, a sports analyst breaking down game film, or a tutor explaining a lecture video, you need models that truly understand video. Video-MME gives the field a reliable measuring stick for tracking progress toward that goal.

Quiz — Level 1

1. What was the main gap Video-MME was designed to fill?

Video-MME's four distinguishing features — diversity, duration coverage, multi-modal inputs, and quality annotations — were all designed to address shortcomings in prior benchmarks that were too narrow in scope. Models already existed; the problem was measuring them properly.

2. Why did the authors filter out questions that Gemini 1.5 Pro could answer without watching the video?

If a model can answer "What is Argentina's #10's biggest achievement in 2022?" without seeing the video, the question is testing world knowledge, not video understanding. Filtering these out ensures the benchmark measures what it claims to measure.

3. What happens to model performance as video duration increases?

Every model tested — commercial and open-source alike — showed declining accuracy on longer videos. Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long ones. The reasons include sparser frame sampling, harder reasoning tasks, and the inherent difficulty of long-context understanding.

4. How do subtitles affect model performance on Video-MME?

Subtitles boosted Gemini 1.5 Pro by 2.8% on short videos but by 10.1% on long videos. Long videos have sparser frame sampling so more visual information is lost — subtitles fill that gap by providing verbal content the model would otherwise miss.

5. Why is it significant that Video-MME works for both image-based and video-based models?

Image MLLMs like InternVL-Chat-V1.5 achieved ~50%, comparable to some video-specific models. This confirms that image understanding is the foundation of video understanding, and that Video-MME is broadly applicable across the MLLM landscape.

Level 2 — Intermediate

▼

How Video-MME is constructed

The pipeline has three stages: video collection → QA annotation → quality review.

Video collection starts with a domain hierarchy — 6 top-level domains from popular YouTube trends, subdivided into 30 fine-grained categories. For each category, videos are collected at three duration tiers: short (<2 min), medium (4–15 min), and long (30–60 min). Subtitles are available for 744 of 900 videos; audio tracks for all 900.

QA annotation uses expert human annotators with strong English and vision-language research experience. Each annotator watches the entire video, then writes 3 multiple-choice questions with 4 options each. The 2,700 questions span 12 task types across perception and reasoning. Answer distribution is nearly uniform across A/B/C/D (25.1/27.2/25.3/22.4%).

Quality review is two-pass: a different annotator checks language and logic, then questions are text-only filtered with Gemini 1.5 Pro. If the model answers correctly without the video, the question is sent back for revision. Gemini scored <15% on text-only, confirming genuine video dependency.

Certificate length — measuring temporal difficulty

Key concept

The certificate length is the minimum total duration of video sub-clips needed to answer a question. It isolates temporal difficulty from total video length. Video-MME's median certificate lengths are 26s (short), 164.7s (medium), and 890.7s (long) — far exceeding EgoSchema's ~100s.

The information density problem

Gemini 1.5 Pro samples at 1 fps for short/medium videos and 0.5 fps for long ones, leveraging its massive context window. Most open-source models are limited to a fixed frame count — often just 8–16 frames regardless of duration. 8 frames from a 30-minute video means one frame every ~225 seconds. Huge amounts of visual information are simply lost.

The modality analysis

Subtitles consistently outperform audio. Overall +6.2% vs +4.3%. Subtitles are clean text; audio includes ambient noise that models handle less well.

The benefit scales with duration. Subtitles add +2.8% on short videos but +10.1% on long videos — compensating for sparser frame sampling.

Domain matters. Sports gets +9.1% from subtitles (commentary carries play-by-play). Artistic Performance gets only +2.7% (visual content dominates). Multilingual sees up to +16.7% on long videos.

Why all models degrade on longer videos

Three compounding factors: (1) Task difficulty shifts — long videos emphasize reasoning over perception. (2) Frame sampling sparsity — fixed-frame models lose information density. (3) Long-context understanding is fundamentally hard — even with adequate frames, maintaining coherence across thousands of visual tokens remains a core challenge.

Key takeaway

Video-MME's construction — domain hierarchy, certificate length analysis, multi-modal evaluation, and duration stratification — reveals that the long-video bottleneck is the critical frontier for MLLM development.

Quiz — Level 2

1. What is the "certificate length" and why does it matter?

Certificate length isolates temporal difficulty from total video length. A 30-minute video might have a certificate length of only 60 seconds if the answer lives in one short segment — or 890 seconds if you need to piece together information from across the whole video.

2. Why does the text-only filtering step use Gemini 1.5 Pro specifically?

Using the strongest available model sets the highest bar. If even Gemini 1.5 Pro can't answer a question without the video, then the question genuinely requires video understanding. It scored less than 15% in this text-only setup.

3. What is the "information density problem" for open-source models on long videos?

Most open-source models sample a fixed frame count — 8 frames from a 30-minute video means one frame every ~225 seconds. Even Gemini 1.5 Pro, which scales frame count with duration, still degrades — showing sparsity isn't the whole story.

4. Why do subtitles help more than audio for most categories?

Subtitles distill the verbal content into clean text that language models process natively. Audio carries the same speech plus ambient sounds, music, and noise — useful but harder for current models to parse. The exception is multilingual content, where audio pronunciation cues sometimes help more.

5. What three factors compound to cause performance degradation on longer videos?

All three factors stack: (1) long videos have proportionally more reasoning questions; (2) fixed-frame models lose information density; (3) even with enough frames, maintaining coherence across thousands of visual tokens remains unsolved. That's why even Gemini 1.5 Pro drops 14+ points from short to long.

Level 3 — Expert

▼

Architectural paradigms evaluated

Dual-encoder with temporal aggregation

Models like Video-LLaMA use a ViT encoder with an image Q-Former per frame, then a video Q-Former for temporal modeling. The Q-Former's learned query vectors (typically 32) compress 196 patch tokens per frame via cross-attention — but this compression is trained on fixed distributions and can't adapt at inference to novel visual details. Fine-grained spatial detail, temporal micro-events, and cross-frame correspondences are systematically lost.

Frame-as-image approaches

Image MLLMs like InternVL-Chat-V1.5 treat video as independent frames. No temporal modeling — the LLM must implicitly reason about time from frame ordering. Achieves ~50% on Video-MME, competitive with some video models, validating that image understanding is foundational.

Long-context native models

Gemini 1.5 Pro ingests frames at 1/0.5 fps, feeding hundreds of frames into its million-token context. GPT-4o samples up to 384 frames at 512×512. Avoids the information bottleneck but faces quadratic attention costs. Ring attention (as in Large World Models) distributes sequences across GPUs in a ring topology, enabling exact full attention for million-token contexts.

Evaluation methodology — design decisions

MCQ format enables deterministic evaluation (direct regex matching, no ChatGPT judge) but allows elimination strategies and has a 25% random baseline. Answer extraction uses a standardized prompt template requesting only the letter response.

Subtitle synchronization couples with frame sampling: if a model samples 10 frames, it gets the 10 subtitle segments matching those timestamps. Sparse-frame models get sparse subtitles, confounding the modality analysis with the sampling strategy.

The 12 task types

Perception tasks (object recognition, attribute perception, action recognition, OCR, counting, spatial reasoning) dominate short videos. Reasoning tasks (causal reasoning, temporal ordering, future prediction, information synthesis, summarization) dominate long videos. Counting is a joint bottleneck across all models.

Cross-category analysis (Gemini 1.5 Pro)

Category	Frames only	+ Subtitles	Δ
Knowledge	74.1%	83.2%	+9.2%
Film & Television	77.9%	81.8%	+3.9%
Sports Competition	68.6%	77.7%	+9.1%
Artistic Performance	78.8%	81.5%	+2.7%
Life Record	77.4%	80.3%	+2.9%
Multilingual	78.2%	85.9%	+7.7%
Overall	75.0%	81.3%	+6.2%

Critical evaluation

Strengths

Rigorous human annotation + text-only filtering. First benchmark to systematically study subtitle/audio impact. Duration stratification reveals the long-video bottleneck. Clean evaluation without ChatGPT judge. Active leaderboard with 50+ models maintains ongoing relevance.

Weaknesses

Scale: 900 videos / 2,700 questions limits statistical power for 90-cell subcategory analysis (~30 questions each). MCQ ceiling: 25% floor allows elimination. YouTube bias: No medical, surveillance, or broadcast content. English-centric: QA pairs are English-only despite a "multilingual" category. Static: Fixed questions risk training data contamination with no contamination analysis. No temporal grounding: Tests "what happened" but not "when." Subtitle coupling: Sparse-frame models get sparse subtitles, confounding the modality analysis.

Key takeaway

Video-MME established the template for comprehensive video MLLM evaluation. Its design decisions — duration stratification, multi-modal inputs, certificate length analysis — directly shaped how the field measures progress. The limitations are real but well-understood, and successor benchmarks build on rather than replace it.

Quiz — Level 3

1. Why is Q-Former compression fundamentally lossy for video understanding?

The 32 learned queries attend to patterns useful during training, but at inference they can't dynamically preserve unexpected details — a quick text overlay, a small object, or a split-second event. Compression happens before the question is seen, so there's no way to know what to keep.

2. How does ring attention solve the memory bottleneck for long video sequences?

Ring attention is mathematically identical to standard full attention — no approximation. Each GPU computes attention for its local queries against every key/value block as they rotate around the ring, with communication overlapped behind computation.

3. What is the main methodological trade-off of using MCQ instead of open-ended generation?

The upside is deterministic evaluation with no judge model bias. The downside is a 25% floor — a model that understands nothing still scores 25%, and elimination inflates scores further. Open-ended generation is more discriminating but needs a subjective evaluator.

4. How does subtitle synchronization confound the modality analysis?

If a model samples 10 frames, it gets only 10 matching subtitle segments — not the full transcript. So when subtitles help open-source models less than Gemini, it's unclear whether they're worse at using subtitles or just receiving far fewer.

5. What is the benchmark contamination risk, and how does Video-MME partially mitigate it?

Text-only filtering ensures questions can't be gamed from world knowledge alone — that's partial mitigation. But 2,700 fixed QA pairs on public GitHub and arXiv can eventually seep into training corpora. Dynamic or held-out test sets would be stronger.

Phase 4 — Frontier

▼

Improvement vectors

1. Scale and statistical power

900 videos / 2,700 QAs across 90 subcategory-duration cells means ~30 questions each — too few for robust fine-grained analysis. ALLVB (2025) demonstrates that automated annotation pipelines (GPT-4o + human QC) can scale to 252K QAs.

2. Open-ended evaluation

MCQ-only has a 25% random baseline and allows elimination. MLVU adds open-ended captioning, summarization, and free-form QA alongside MCQ. A more discriminating evaluation format, though harder to score.

3. Temporal grounding

Video-MME asks "what happened" but not "when." Temporal localization is critical for real applications. LongVideoBench introduces "referring reasoning" and CrossVid (2026) pushes into cross-video temporal reasoning.

4. Contamination resistance

Fixed benchmarks leak into training data. GPT-5 and Gemini 2.5 Pro both cite Video-MME as a primary benchmark, meaning their training pipelines are aware of it. No current video benchmark has robust dynamic or held-out evaluation.

5. Decoupled subtitle evaluation

Subtitle synchronization couples with frame sampling — sparse-frame models get sparse subtitles. Testing with full transcripts regardless of frame count would isolate the true modality effect.

6. Beyond YouTube

All 900 videos are from YouTube. Medical imaging, surveillance, industrial inspection, and broadcast content have distinct characteristics. MLVU partially addresses this with surveillance and egocentric footage.

Latest work building on Video-MME

Successor benchmarks: MLVU (CVPR 2025) adds open-ended tasks and referring QA. ALLVB (AAAI 2025) scales to 1,376 videos averaging ~2 hours with 252K QAs. LongVideoBench introduces referring reasoning. CrossVid (2026) evaluates cross-video reasoning. MME-Unify (ICLR 2026) unifies understanding and generation evaluation.

Architectural advances: VideoChat-Flash proposes hierarchical compression (~1/50 ratio with minimal loss). QuoTA allocates tokens by query relevance (+3.2% across benchmarks). LP-Comp (NeurIPS 2025) achieves one token per frame via learnable progressive compression. The token compression field has exploded with dozens of methods (FlashVID, ForestPrune, DToMA, METok).

Industry adoption: Video-MME is cited by OpenAI (GPT-4.1 as "industry standard measure," GPT-5), Google (Gemini 2.5 Pro, Gemini 3 Pro) as a primary video understanding benchmark.

Scorecard

Improvement vector	Status	Key work
Scale (more videos/questions)	Addressed	ALLVB (252K QAs)
Open-ended evaluation	Addressed	MLVU (open + MCQ)
Temporal grounding	Partial	LongVideoBench
Contamination resistance	Open gap	—
Decoupled subtitle eval	Open gap	—
Beyond YouTube domains	Partial	MLVU (surveillance, ego)
Cross-video reasoning	Addressed	CrossVid (2026)
Token compression efficiency	Active	VideoChat-Flash, QuoTA, LP-Comp
Unified understand + generate	Active	MME-Unify (ICLR 2026)

← Back to all papers