Rohit Patel, Danielle Belgrave, Erin Grant, Tom Zahavy, Jessica B. Hamrick, Kevin McClain et al. — March 2026
Most AI benchmarks test one thing at a time — math questions, reading comprehension, code generation. Models have gotten so good at these that the benchmarks saturate:
MMLU (2020): GPT-4 hit 86.4% → benchmark dead
GSM8K (2021): Multiple models hit 95%+ → benchmark dead
HumanEval: Solved → benchmark dead
Each lasted ~2-3 years before saturation.
The community is running out of single-domain tests.
Real-world intelligence doesn’t work in isolated domains. A doctor diagnosing a patient combines medical knowledge, visual interpretation of scans, probabilistic reasoning, communication, and temporal sequencing — all simultaneously. GIM measures that kind of integration.
SINGLE-DOMAIN PROBLEM:
"What is 15% of 240?"
→ One cognitive operation: arithmetic
→ Models ace this
INTEGRATION-DENSE PROBLEM (GIM-style):
A 1955 ZIP code problem:
"Given a historical postal routing map from 1955,
determine which modern ZIP codes correspond to
the original routing zones, accounting for the
fact that ZIP codes weren't introduced until 1963."
→ World Knowledge (postal history)
→ Temporal Reasoning (1955 vs 1963 timeline)
→ Spatial Reasoning (map interpretation)
→ Quantitative Reasoning (zone calculations)
→ Language (parsing the complex prompt)
= 5 cognitive operations, all SIMULTANEOUSLY required.
Fail at ANY ONE and you get the wrong answer.
GIM’s thesis: integration density — how many cognitive operations must be coordinated simultaneously — is the right axis of difficulty for measuring intelligence.
LR Linguistic Reasoning Language structure, ambiguity
QR Quantitative Reasoning Math, logic, formal systems
SI Spatial & Intuitive Visual patterns, 3D reasoning
WK World Knowledge Facts, history, science
LN Lateral & Novel Thinking Creative problem-solving
PR Procedural Reasoning Multi-step processes, algorithms
CT Constraints & Puzzles Rule satisfaction, optimization
Each problem maps to a primary category but requires drawing from multiple categories simultaneously. That’s the point — integration, not isolation.
CLASSIC VERSION (every model aces this):
A farmer must transport a wolf, goat, and cabbage
across a river. The boat holds the farmer + 1 item.
Wolf eats goat if left alone. Goat eats cabbage.
GIM VARIANT (integration-dense):
Same setup, but:
- The boat has a WEIGHT LIMIT of 150 kg
- The farmer weighs 80 kg
- Wolf: 90 kg, Goat: 40 kg, Cabbage: 30 kg
- The farmer can carry items on his back (limit: 50 kg)
- It's raining: the river rises 10 cm per crossing
- After 5 crossings, the boat can't make it back
Now you need:
CT Constraint satisfaction (weight limits)
QR Arithmetic (weight calculations per trip)
PR Procedural reasoning (optimal crossing sequence)
SI Spatial/intuitive (river level visualization)
LR Language parsing (complex multi-constraint prompt)
The FAMILIAR framing activates memorized solutions.
The ADDED constraints invalidate those solutions.
The model must DETECT the invalidation.
Then REASON through a novel path.
THE LEADERBOARD (simplified):
Best centaur (human+AI): θ = 2.26
Best pure LLM (GPT-5.4 Pro): θ = 2.16
Average centaur: θ = 0.11
Worst centaur: θ = -1.80
THE INSIGHT:
Best centaur BEATS best pure LLM.
But average centaur is MID-PACK.
The 2.15-logit gap between top and average
centaur EXCEEDS the gap between best and worst models.
→ It's not the AI that matters most.
→ It's the OPERATOR'S SKILL at directing the AI.
ORIGIN: Kasparov's freestyle chess (2005):
"A weak human + machine + better PROCESS was superior
to a strong computer alone and, more remarkably,
superior to a strong human + machine + inferior process."
— Garry Kasparov, Deep Thinking (2017)
21 years later, GIM quantitatively confirms this
with LLMs instead of chess engines.
GIM argues that difficulty isn’t about harder math or more obscure knowledge — it’s about how many cognitive operations must be coordinated simultaneously. And when humans pair with AI, the human’s skill at directing the AI matters more than which AI they choose.
GIM doesn’t use raw accuracy. It uses Item Response Theory (2-Parameter Logistic) — the same statistical framework used in standardized tests like the GRE and GMAT — to jointly estimate model ability and problem quality.
P(correct | θ, a_j, b_j) = 1 / (1 + exp(-a_j(θ - b_j)))
Where:
θ = model ability (the score we care about)
b_j = difficulty of problem j (higher = harder)
a_j = discrimination of problem j
(how sharply it separates strong from weak)
HIGH a_j: Problem sharply separates good from bad models.
Strong models get it right, weak ones don't.
= A GOOD test question.
LOW a_j: Random performance regardless of ability.
Even strong models sometimes fail, weak ones succeed.
= A NOISY or POORLY DESIGNED question.
WHY 2PL > RAW ACCURACY:
Raw accuracy: 80% on easy problems = 80% on hard ones
IRT: 80% on hard problems ≫ 80% on easy ones
IRT also DOWNWEIGHTS noisy questions (low a_j).
A question that strong models randomly fail doesn't
drag their score down — because IRT knows it's noisy.
Each GIM problem has a detailed rubric — median 6 criteria per problem. The LLM judge scores each criterion independently, with a confidence weight:
Score for problem j, model i:
s_ij = Σ (c_k × r_k) / Σ c_k
Where:
r_k = rubric criterion k score (0 or 1)
c_k = judge's confidence in that criterion (0-1)
EXAMPLE — The 1955 ZIP Code Problem:
Criterion 1: Correctly identifies ZIP codes didn't
exist in 1955 → r=1, c=0.95
Criterion 2: Maps routing zones to modern codes
→ r=1, c=0.70
Criterion 3: Accounts for zone boundary changes
→ r=0, c=0.85
Criterion 4: Shows temporal reasoning chain
→ r=1, c=0.90
Criterion 5: Final answer correct → r=0, c=0.95
Criterion 6: Acknowledges ambiguity → r=1, c=0.60
Weighted score: (0.95+0.70+0+0.90+0+0.60) /
(0.95+0.70+0.85+0.90+0.95+0.60)
= 3.15 / 4.95 = 0.636
The confidence weights let the judge express uncertainty.
Criterion 2 (c=0.70): judge isn't sure if the mapping
is correct. That uncertainty is built into the score.
PRIMARY JUDGE: Gemini 3 Flash (structured output)
CROSS-VALIDATION: GPT 5.4
Agreement metrics:
Pearson r = 0.922 per prompt
Cohen's κ = 0.815 per rubric criterion
SCALE:
47 configs × 820 problems × 5 epochs = ~192,700 runs
Each run × ~6 rubric criteria = ~1.15 million judgments
No human panel could do this at this scale.
The cross-validation between model FAMILIES
(not just model versions) is the key defense
against self-preference bias.
| Category | Sub-categories | Multimodal % |
|---|---|---|
| LR Linguistic | Semantic, Pragmatic, Temporal | 12.3% |
| QR Quantitative | Algebraic, Probabilistic, Geometric | 28.7% |
| SI Spatial & Intuitive | Visual, Spatial, Physical | 59.5% |
| WK World Knowledge | Historical, Scientific, Cultural | 15.8% |
| LN Lateral & Novel | Creative, Analogical | 8.2% |
| PR Procedural | Sequential, Conditional | 22.4% |
| CT Constraints | Optimization, Satisfiability | 18.9% |
Note: SI has by far the highest multimodal percentage (59.5%). This matters for understanding model specialization profiles later.
THINKING BUDGET | θ RANGE | COST MULTIPLIER
───────────────────┼───────────────┼─────────────────
OFF | -1.5 to 0.2 | 1×
Low | -0.5 to 0.8 | ~5×
Medium | 0.3 to 1.4 | ~15×
High | 0.8 to 1.9 | ~40×
X-High | 1.2 to 2.2 | ~100×
WITHIN-FAMILY SPREAD:
GPT-5.4: OFF → X-High spans ~1.2 logits
Gemini 3.1: OFF → X-High spans ~0.9 logits
Claude 4.6: OFF → X-High spans ~1.1 logits
BETWEEN-FAMILY spread (all at same thinking):
At Medium: ~0.7 logit range across all families
Within-family ≈ between-family
→ Configuration is AS IMPORTANT as model choice
PARTICIPANTS: 246 humans + AI
SETUP:
- Free tool choice (any model, any number of models)
- Free process choice (how to use the tools)
- Same GIM problems as pure-model evaluation
- 5 response epochs, same as model evaluation
CONSTRAINT: Participants could NOT just copy-paste
the model's answer. They had to demonstrate they
understood and could evaluate the response.
RESULTS:
Best centaur: θ = 2.26
75th percentile: θ = 1.45
Median: θ = 0.87
25th percentile: θ = 0.11
Worst centaur: θ = -1.80
For comparison:
Best pure LLM: θ = 2.16 (GPT-5.4 Pro, X-High)
TOP CENTAUR BEAT BEST LLM.
But MEDIAN centaur scored below GPT-5.4 at Medium.
Per-category θ is within 0.2-0.3 logits of overall θ
for ALL frontier models. The profiles are remarkably flat.
Small peaks:
Muse Spark: SI (Spatial & Intuitive) +0.3
GPT 5.4 Pro: CT (Constraints/Puzzles) +0.2
Claude 4.6 Opus: LR-Temporal +0.25
Gemini 3.1 Pro: QR (Quantitative) +0.15
IMPLICATION: Pretraining produces generalists.
Specialization is marginal, not architectural.
GIM’s IRT 2PL framework extracts far more signal per problem than raw accuracy. The discrimination parameter (a_j) identifies which problems actually separate strong from weak models, and the confidence-weighted rubric scoring propagates judge uncertainty through the entire pipeline. The result: 820 precision instruments instead of 15,000 noisy thermometers.
MOST BENCHMARKS:
Scrape existing data (MMLU: exam questions)
Or auto-generate (GSM8K: template math problems)
Or crowdsource (cheap, fast, noisy)
Cost per problem: minutes to ~1 hour
Quality: variable. Contamination: likely.
GIM'S APPROACH:
820 original, expert-authored problems
~11 person-hours per problem
= ~9,000 total person-hours (~4 person-years)
Each problem goes through:
1. Design: author crafts problem requiring
specific cognitive integration
2. Rubric: median 6 criteria, each with
independent scoring guidelines
3. Anti-memorization check: verify the problem
can't be solved by recalling known answers
4. Difficulty calibration: test against models
to ensure it discriminates well
5. Category labeling: primary + secondary domains
6. Multimodal assets: images/PDFs where needed
THE SCALE TRADE-OFF:
820 problems × 11 hours = boutique benchmark
MMLU has 15,000+ questions
But GIM's 820 problems with IRT calibration
extract MORE signal per problem than MMLU's 15K.
Each GIM problem is a precision instrument.
MMLU problems are mass-produced thermometers.
LAYER 1: PUBLIC/PRIVATE SPLIT (615/205)
Independently estimate θ on each split:
θ_public vs. θ_private
Correlation: r ≈ 0.98
If a model trained on leaked public problems:
θ_public >> θ_private → contamination detected
r ≈ 0.98 means: no model shows suspicious
divergence between public and private scores.
LAYER 2: LEAVE-ONE-MODEL-OUT (LOMO)
Remove model X from IRT calibration entirely.
Re-estimate item parameters (a_j, b_j) without X.
Re-estimate X's θ using recalibrated parameters.
Recovery: within 0.087 logits of full estimate.
Why this matters: if one model's data is
distorting the calibration (e.g., it memorized
specific items, skewing their difficulty), LOMO
would show large recovery errors for that model.
LAYER 3: ORIGINAL CONTENT
All 820 problems are ORIGINAL.
Not scraped from exams, textbooks, or the web.
Not in any training corpus (probably).
Hard to contaminate what wasn't on the internet.
WHAT THIS DOESN'T CATCH:
Post-publication contamination IS possible.
The private split is the ongoing defense —
it should never be released.
THE FUNDAMENTAL TENSION:
You're evaluating LLMs using... an LLM.
If the judge is biased, every score is biased.
WHERE LLM JUDGES FAIL:
1. SELF-PREFERENCE BIAS
Gemini judging a Gemini response: might score higher.
Defense: cross-validation with DIFFERENT model family.
κ = 0.815 suggests agreement is real, not shared bias.
2. RUBRIC INTERPRETATION DRIFT
Judge might interpret criterion #3 differently
for response A vs. response B.
Defense: structured output + per-criterion scoring.
3. FORMAT SENSITIVITY
Well-formatted responses scored higher regardless
of correctness.
Defense: confidence weights c_i.
But format bias could be confident AND wrong.
4. CEILING EFFECTS
Gemini 3 Flash is BELOW the frontier models
it's judging. Could it miss subtle excellence?
THE HONEST ASSESSMENT:
LLM judges are the best scalable option.
~1.15 million individual judgments — no human
panel could do this.
Cross-validation is genuinely reassuring.
But the fundamental circularity (LLMs grading LLMs)
remains an unsolved epistemological problem.
TOTAL COMPUTE: 3.6 billion tokens, ~10²¹ FLOPs
COST PER θ LOGIT GAINED (illustrative):
OFF → Low: ~$0.01 per 0.001θ improvement
Low → Medium: ~$0.10 per 0.001θ improvement
Medium → High: ~$1.00 per 0.001θ improvement
High → X-High: ~$100 per 0.001θ improvement
THE DEPLOYMENT IMPLICATION:
Medium thinking: ~90% of max performance at ~10% cost
X-High thinking: last ~10% at ~90% of total cost
For products serving billions: Medium or Low.
For benchmarks / bragging rights: X-High.
The same cost curve applies to media perception.
At Instagram-scale (billions of images/day), the
thinking-budget trade-off determines infrastructure
cost in the billions of dollars range.
1. PRETRAINING PRODUCES GENERALISTS
Flat per-category profiles. No architecture
creates domain-specific cognitive modules.
2. SPECIALIZATION IS MARGINAL
Muse Spark's SI +0.3 may reflect generation-pretrained
visual representations (the Vision Banana thesis).
3. THINKING-DISABLED IS A BINARY BREAK
Non-thinking models cluster at the bottom.
NOT a gradual degradation — it's qualitative.
Chain-of-thought is a capability, not a boost.
4. QUANTIZATION MATTERS
Gemma 4 31B: bf16 > fp8 > fp4
Each precision step costs ~0.15 logits.
Quantizing for deployment speed has a
MEASURABLE cost in reasoning quality.
ACKNOWLEDGED BY THE PAPER:
✗ No audio or video modalities (text + images only)
✗ No multi-turn interaction
✗ No tool use (code execution, web search)
✗ No time pressure / real-time constraints
✗ No creative generation evaluation
NOT ACKNOWLEDGED BUT REAL:
✗ Single-session only — can't measure learning
✗ English-centric
✗ Static problems — real integration happens
over EVOLVING contexts
✗ Scale of integration — GIM tests 3-6 operations;
real expert work may require 20-50
✗ The embodied gap — 2D images, not 3D environments
COMPARE WITH EMBODIED NAV EVAL:
Both papers solved evaluation for ONE layer.
Embodied Nav: navigation eval, not manipulation + social
GIM: cognitive integration, not multi-turn + embodied
The FULL evaluation framework doesn't exist yet.
GIM’s three-layer contamination defense, IRT-based quality filtering, and confidence-weighted rubric scoring represent the current state of the art in benchmark design. But the epistemological circularity of LLM judges and the single-dimension limitation (integration density alone) remain fundamental open problems.
ALPHAGO (2016):
More MCTS simulations → better move quality
But: logarithmic returns after ~1,600 sims
Going from 1,600 → 160,000 sims (100×)
gives ~10% improvement in win rate
GIM (2026):
More thinking tokens → higher θ
But: logarithmic returns after Medium
Going from Medium → X-High (~10× compute)
gives ~10% improvement in θ
SAME CURVE. SAME DIMINISHING RETURNS.
10 years apart, completely different domains.
This suggests a FUNDAMENTAL LAW:
Test-time compute follows logarithmic returns.
The first N tokens of "thinking" give you most
of the value. Each subsequent N gives less.
This isn't a model limitation — it may be a
property of problem complexity itself.
THE PRODUCT SPECTRUM:
MASS-MARKET (billions of users):
Average user = average centaur = θ ≈ 0.11
Pure LLM at Medium = θ ≈ 1.2
→ PURE LLM MODE IS BETTER for mass-market.
→ Average users make the AI WORSE by interfering.
→ The product should HIDE the complexity.
POWER USERS (expert operators):
Skilled operator = top centaur = θ ≈ 2.26
Best pure LLM = θ ≈ 2.16
→ CENTAUR MODE IS BETTER for power users.
→ The product should EXPOSE the controls.
→ Thinking budget, model selection, tool choice.
THE DESIGN QUESTION FOR MEDIA AI:
If you're building media generation for billions:
Pure LLM mode for the default experience.
Centaur controls for the power-user tier.
The centaur finding doesn't mean EVERYONE should
have AI copilots. It means the RIGHT PEOPLE
should have the RIGHT controls.
GIM'S THESIS:
Integration density = the right difficulty axis
THE CRITIQUE:
It's ONE axis. Important? Yes. Sufficient? No.
MISSING AXES:
1. DEPTH — How many levels of abstraction
must be traversed? Integration can be shallow
(many operations, all at surface level) or deep
(few operations, but each requires meta-reasoning).
2. NOVELTY — Can the model solve problems with
NO structural similarity to training data?
GIM's problems are novel but structured.
True novelty = no template at all.
3. ABSTRACTION — Can the model identify the
underlying principle rather than pattern-match?
Integration might be solvable by chaining
pattern matches without true understanding.
4. CREATIVITY — Can the model produce something
genuinely new, not just combine existing elements?
GIM can't measure this because rubric scoring
requires a known-correct answer.
THE HONEST MAP:
GIM measures: integration (one dimension)
We also need: depth, novelty, abstraction, creativity
Each is an independent axis.
No existing benchmark covers more than one well.
THE PERCEPTION-GENERATION CONNECTION:
For media AI, the missing axes matter:
- Depth: understanding nested visual narratives
- Novelty: generating truly original compositions
- Abstraction: grasping style vs. content
- Creativity: the thing that makes art art
GIM's integration axis captures coordination.
But the FULL picture of intelligence requires
all five axes measured simultaneously.
GIM'S CURRENT DEFENSE:
Original content + public/private split + LOMO
THE FUTURE THREAT:
As GIM becomes influential, model trainers will
encounter GIM-style problems in the wild.
Not direct contamination — INDIRECT:
Blog posts analyzing GIM problems.
Study guides for GIM-style reasoning.
Synthetic training data designed to improve
on GIM-like tasks.
The private split catches DIRECT leakage.
It CANNOT catch indirect capability transfer.
LOMO AS CALIBRATION INTEGRITY CHECK:
LOMO's 0.087-logit recovery proves no single model
is distorting the calibration today.
But as more models train specifically to do well
on integration-dense tasks, the DISTRIBUTION
of model abilities shifts, and item parameters
(a_j, b_j) may need recalibration.
THE META-LESSON:
Every benchmark has a half-life.
GIM's is longer than most (original content,
private split, IRT calibration).
But it's not infinite.
THE ARC OF EVALUATION:
Embodied Nav (2018): "How do we measure navigation?"
→ SPL + goal taxonomy + generalization regimes
→ Solved: one-dimension spatial evaluation
MMLU (2020) → MMMU (2023) → MMMU-Pro (2024):
"How do we measure knowledge?"
→ Multi-domain accuracy + hardening against shortcuts
→ Partially solved: knowledge with anti-gaming
GIM (2026): "How do we measure INTEGRATION?"
→ IRT 2PL + integration density + centaur study
→ Solved: one-dimension cognitive integration
MISSING (2026+): "How do we measure EVERYTHING?"
→ Unified: spatial + cognitive + creative + embodied
→ Unsolved
GIM FILLS THE "HOW TO MEASURE" GAP:
The papers studied so far address:
- HOW TO BUILD: Transfusion, Chameleon, DALL-E 3,
Seedance 2.0
- HOW IT UNDERSTANDS: Vision Banana
- HOW TO SEARCH: AlphaGo, AlphaZero
- HOW TO PREDICT: AlphaFold 2
- HOW TO EVALUATE: GIM-Eval, Embodied Nav
GIM sits at the intersection of building and
evaluating — it measures whether what we build
actually achieves integrated intelligence.
THE CENTAUR BRIDGE:
GIM's centaur finding connects evaluation to product:
1955 (chess) → 2005 (freestyle) → 2026 (GIM centaur)
Each generation confirms: human+machine+process >
machine alone, IF the human has the right skill.
For media AI: the question isn't "will AI replace
human creators?" — it's "which human operators will
unlock capabilities no pure AI can reach?"
| Dimension | Score | Notes |
|---|---|---|
| Novelty | 9/10 | Integration density as THE difficulty axis is genuinely new; centaur study is unprecedented at this scale |
| Impact | 9/10 | Redefines how we think about benchmark difficulty; centaur findings have direct product implications |
| Reproducibility | 8/10 | Code released; IRT methodology well-documented; private split prevents full replication by design |
| Technical Depth | 9/10 | IRT 2PL, 3-layer contamination defense, confidence-weighted rubrics, LOMO validation |
| Writing | 7/10 | Dense and thorough; could be more accessible to non-psychometrics audiences |
| Longevity | 8/10 | Integration density concept will endure; specific problem set has a contamination half-life |
GIM’s deepest contribution isn’t the leaderboard — it’s the integration density thesis and the centaur finding. Integration density reframes how we think about what makes problems hard. The centaur finding, echoing Kasparov’s 2005 freestyle chess result with quantitative rigor, has immediate product implications: for mass-market AI, hide the controls; for expert tools, expose them. The right human+AI pairing isn’t a nice-to-have — it’s the actual product differentiator.