Fu, Dai, Luo, Li, Ren, Zhang, Wang, Zhou, Shen, Zhang, Chen, Li, Lin, Zhao, Li, Xu, Zheng, Chen, Shan, He, Sun โ May 2024 (CVPR 2025)
๐ arXiv:2405.21075 ยท ๐ฅ PDF ยท ๐ป GitHub ยท ๐ Project
AI models like GPT-4o and Gemini can look at images and answer questions about them. But the real world isn't a photograph โ it's a video: things move, change, and unfold over time. This paper asks: how well can these AI models actually understand videos?
The answer is "not as well as you'd think" โ and before this paper, we didn't even have a good way to measure it.
Think of Video-MME as building a proper report card for video-understanding AI. Before it, the existing "tests" were like giving a college student only arithmetic quizzes โ too easy, too narrow, not covering enough subjects. Video-MME is the first test comprehensive enough to actually tell you how good (or bad) these models are.
1. Diversity โ covers lots of subjects. The benchmark spans 6 big categories (knowledge, film & TV, sports, artistic performance, daily life, and multilingual content) broken into 30 specific subtypes โ football replays, cooking tutorials, documentaries, magic shows, news reports, and more.
2. Duration โ short to long. Videos range from 11 seconds to 1 full hour. Understanding a 10-second clip is fundamentally different from understanding a 30-minute documentary. Most prior benchmarks only had short clips.
3. Multiple modalities โ not just the picture. Beyond video frames, the benchmark tests whether models can use subtitles and audio to improve understanding. Many videos are hard to understand from visuals alone.
4. Quality โ humans wrote everything. All 2,700 questions (3 per video, 900 videos) were written and reviewed by human experts, not auto-generated. Any question a model could answer without watching the video was thrown out.
All models get worse as videos get longer. Gemini 1.5 Pro dropped from 81.7% on short videos to 67.4% on long ones โ a 14-point cliff.
Subtitles and audio help a lot. Adding subtitles boosted Gemini 1.5 Pro by 6.2 percentage points overall, and by 10.1 points on long videos.
If you want AI that can be a doctor reviewing surgery footage, a sports analyst breaking down game film, or a tutor explaining a lecture video, you need models that truly understand video. Video-MME gives the field a reliable measuring stick for tracking progress toward that goal.
The pipeline has three stages: video collection โ QA annotation โ quality review.
Video collection starts with a domain hierarchy โ 6 top-level domains from popular YouTube trends, subdivided into 30 fine-grained categories. For each category, videos are collected at three duration tiers: short (<2 min), medium (4โ15 min), and long (30โ60 min). Subtitles are available for 744 of 900 videos; audio tracks for all 900.
QA annotation uses expert human annotators with strong English and vision-language research experience. Each annotator watches the entire video, then writes 3 multiple-choice questions with 4 options each. The 2,700 questions span 12 task types across perception and reasoning. Answer distribution is nearly uniform across A/B/C/D (25.1/27.2/25.3/22.4%).
Quality review is two-pass: a different annotator checks language and logic, then questions are text-only filtered with Gemini 1.5 Pro. If the model answers correctly without the video, the question is sent back for revision. Gemini scored <15% on text-only, confirming genuine video dependency.
The certificate length is the minimum total duration of video sub-clips needed to answer a question. It isolates temporal difficulty from total video length. Video-MME's median certificate lengths are 26s (short), 164.7s (medium), and 890.7s (long) โ far exceeding EgoSchema's ~100s.
Gemini 1.5 Pro samples at 1 fps for short/medium videos and 0.5 fps for long ones, leveraging its massive context window. Most open-source models are limited to a fixed frame count โ often just 8โ16 frames regardless of duration. 8 frames from a 30-minute video means one frame every ~225 seconds. Huge amounts of visual information are simply lost.
Subtitles consistently outperform audio. Overall +6.2% vs +4.3%. Subtitles are clean text; audio includes ambient noise that models handle less well.
The benefit scales with duration. Subtitles add +2.8% on short videos but +10.1% on long videos โ compensating for sparser frame sampling.
Domain matters. Sports gets +9.1% from subtitles (commentary carries play-by-play). Artistic Performance gets only +2.7% (visual content dominates). Multilingual sees up to +16.7% on long videos.
Three compounding factors: (1) Task difficulty shifts โ long videos emphasize reasoning over perception. (2) Frame sampling sparsity โ fixed-frame models lose information density. (3) Long-context understanding is fundamentally hard โ even with adequate frames, maintaining coherence across thousands of visual tokens remains a core challenge.
Video-MME's construction โ domain hierarchy, certificate length analysis, multi-modal evaluation, and duration stratification โ reveals that the long-video bottleneck is the critical frontier for MLLM development.
Models like Video-LLaMA use a ViT encoder with an image Q-Former per frame, then a video Q-Former for temporal modeling. The Q-Former's learned query vectors (typically 32) compress 196 patch tokens per frame via cross-attention โ but this compression is trained on fixed distributions and can't adapt at inference to novel visual details. Fine-grained spatial detail, temporal micro-events, and cross-frame correspondences are systematically lost.
Image MLLMs like InternVL-Chat-V1.5 treat video as independent frames. No temporal modeling โ the LLM must implicitly reason about time from frame ordering. Achieves ~50% on Video-MME, competitive with some video models, validating that image understanding is foundational.
Gemini 1.5 Pro ingests frames at 1/0.5 fps, feeding hundreds of frames into its million-token context. GPT-4o samples up to 384 frames at 512ร512. Avoids the information bottleneck but faces quadratic attention costs. Ring attention (as in Large World Models) distributes sequences across GPUs in a ring topology, enabling exact full attention for million-token contexts.
MCQ format enables deterministic evaluation (direct regex matching, no ChatGPT judge) but allows elimination strategies and has a 25% random baseline. Answer extraction uses a standardized prompt template requesting only the letter response.
Subtitle synchronization couples with frame sampling: if a model samples 10 frames, it gets the 10 subtitle segments matching those timestamps. Sparse-frame models get sparse subtitles, confounding the modality analysis with the sampling strategy.
Perception tasks (object recognition, attribute perception, action recognition, OCR, counting, spatial reasoning) dominate short videos. Reasoning tasks (causal reasoning, temporal ordering, future prediction, information synthesis, summarization) dominate long videos. Counting is a joint bottleneck across all models.
| Category | Frames only | + Subtitles | ฮ |
|---|---|---|---|
| Knowledge | 74.1% | 83.2% | +9.2% |
| Film & Television | 77.9% | 81.8% | +3.9% |
| Sports Competition | 68.6% | 77.7% | +9.1% |
| Artistic Performance | 78.8% | 81.5% | +2.7% |
| Life Record | 77.4% | 80.3% | +2.9% |
| Multilingual | 78.2% | 85.9% | +7.7% |
| Overall | 75.0% | 81.3% | +6.2% |
Rigorous human annotation + text-only filtering. First benchmark to systematically study subtitle/audio impact. Duration stratification reveals the long-video bottleneck. Clean evaluation without ChatGPT judge. Active leaderboard with 50+ models maintains ongoing relevance.
Scale: 900 videos / 2,700 questions limits statistical power for 90-cell subcategory analysis (~30 questions each). MCQ ceiling: 25% floor allows elimination. YouTube bias: No medical, surveillance, or broadcast content. English-centric: QA pairs are English-only despite a "multilingual" category. Static: Fixed questions risk training data contamination with no contamination analysis. No temporal grounding: Tests "what happened" but not "when." Subtitle coupling: Sparse-frame models get sparse subtitles, confounding the modality analysis.
Video-MME established the template for comprehensive video MLLM evaluation. Its design decisions โ duration stratification, multi-modal inputs, certificate length analysis โ directly shaped how the field measures progress. The limitations are real but well-understood, and successor benchmarks build on rather than replace it.
900 videos / 2,700 QAs across 90 subcategory-duration cells means ~30 questions each โ too few for robust fine-grained analysis. ALLVB (2025) demonstrates that automated annotation pipelines (GPT-4o + human QC) can scale to 252K QAs.
MCQ-only has a 25% random baseline and allows elimination. MLVU adds open-ended captioning, summarization, and free-form QA alongside MCQ. A more discriminating evaluation format, though harder to score.
Video-MME asks "what happened" but not "when." Temporal localization is critical for real applications. LongVideoBench introduces "referring reasoning" and CrossVid (2026) pushes into cross-video temporal reasoning.
Fixed benchmarks leak into training data. GPT-5 and Gemini 2.5 Pro both cite Video-MME as a primary benchmark, meaning their training pipelines are aware of it. No current video benchmark has robust dynamic or held-out evaluation.
Subtitle synchronization couples with frame sampling โ sparse-frame models get sparse subtitles. Testing with full transcripts regardless of frame count would isolate the true modality effect.
All 900 videos are from YouTube. Medical imaging, surveillance, industrial inspection, and broadcast content have distinct characteristics. MLVU partially addresses this with surveillance and egocentric footage.
Successor benchmarks: MLVU (CVPR 2025) adds open-ended tasks and referring QA. ALLVB (AAAI 2025) scales to 1,376 videos averaging ~2 hours with 252K QAs. LongVideoBench introduces referring reasoning. CrossVid (2026) evaluates cross-video reasoning. MME-Unify (ICLR 2026) unifies understanding and generation evaluation.
Architectural advances: VideoChat-Flash proposes hierarchical compression (~1/50 ratio with minimal loss). QuoTA allocates tokens by query relevance (+3.2% across benchmarks). LP-Comp (NeurIPS 2025) achieves one token per frame via learnable progressive compression. The token compression field has exploded with dozens of methods (FlashVID, ForestPrune, DToMA, METok).
Industry adoption: Video-MME is cited by OpenAI (GPT-4.1 as "industry standard measure," GPT-5), Google (Gemini 2.5 Pro, Gemini 3 Pro) as a primary video understanding benchmark.
| Improvement vector | Status | Key work |
|---|---|---|
| Scale (more videos/questions) | Addressed | ALLVB (252K QAs) |
| Open-ended evaluation | Addressed | MLVU (open + MCQ) |
| Temporal grounding | Partial | LongVideoBench |
| Contamination resistance | Open gap | โ |
| Decoupled subtitle eval | Open gap | โ |
| Beyond YouTube domains | Partial | MLVU (surveillance, ego) |
| Cross-video reasoning | Addressed | CrossVid (2026) |
| Token compression efficiency | Active | VideoChat-Flash, QuoTA, LP-Comp |
| Unified understand + generate | Active | MME-Unify (ICLR 2026) |