Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir — Multi-institution — July 2018
Between 2016 and 2018, embodied navigation exploded — agents learning to navigate 3D environments from visual input. But every lab was doing it differently:
Lab A: "Our agent reaches the goal 80% of the time!"
Lab B: "Our agent reaches the goal 75% of the time!"
Sounds like A wins, right? Except:
- Lab A counts "success" if the agent passes WITHIN 3 METERS
of the goal, even accidentally, even if it keeps walking
- Lab B counts "success" only if the agent STOPS at the goal
and SIGNALS it's done
- Lab A measures distance as straight-line (through walls)
Lab B measures distance along navigable paths
- Lab A trained in the TEST environment for 100 hours
Lab B had ZERO prior exposure to the test environment
These results are COMPLETELY INCOMPARABLE.
It’s like comparing marathon times when one runner’s marathon is 26.2 miles and the other’s is 20 miles on a downhill course.
A working group of the field’s top researchers convened to fix this. They produced 7 consensus recommendations covering task definitions, generalization measurement, evaluation metrics, and simulation standards.
POINTGOAL: "Go to coordinates (5.2, 3.1)"
Agent gets: target location in meters
Challenge: path planning in cluttered space
Analogy: GPS navigation with turn-by-turn
OBJECTGOAL: "Find the refrigerator"
Agent gets: object category name
Challenge: path planning + world knowledge
("kitchens are usually near dining rooms")
Analogy: "Where did I leave my keys?"
AREAGOAL: "Go to the kitchen"
Agent gets: room/area category name
Challenge: path planning + area recognition
Analogy: Navigating a hotel you've never been in
Each goal type can be specified different ways — coordinates, category labels, images, or natural language (“go to the room with the big window”).
Before this paper, most teams reported success rate alone. But success rate ignores efficiency:
AGENT A: Reaches goal 100% of the time.
Takes 500 meters to travel a 10-meter path.
Success rate: 100% (looks great!)
AGENT B: Reaches goal 80% of the time.
When successful, takes near-optimal paths.
Success rate: 80% (looks worse?)
Which is actually better? You can't tell from success rate alone.
SPL (Success weighted by inverse Path Length) fixes this:
SPL = (1/N) × Σ Sᵢ × (ℓᵢ / max(pᵢ, ℓᵢ))
Where:
N = number of test episodes
Sᵢ = 1 if agent succeeded, 0 if not
ℓᵢ = shortest possible path (geodesic)
pᵢ = path the agent actually took
EXAMPLES:
100% success, always optimal paths → SPL = 1.0
100% success, always 2× optimal → SPL = 0.5
50% success, always optimal paths → SPL = 0.5
50% success, always 2× optimal → SPL = 0.25
SPL punishes BOTH failure AND inefficiency.
One number that captures what you actually care about.
SPL became the standard metric. Every embodied navigation paper since 2018 reports it.
EVALUATION:
#1 DONE ACTION — Agent must explicitly signal "I'm done."
No accidental goal-reaching counts as success.
#2 GEODESIC DISTANCE — Measure shortest navigable path,
not straight-line through walls.
#3 SPL AS PRIMARY METRIC — Report success AND efficiency
in one number.
SIMULATION:
#4 CONTINUOUS STATE SPACES — No grid worlds.
Agents move in continuous 3D space.
#5 SI UNITS — Distance 1 = 1 meter. Period.
#6 SIM-TO-REAL SOFTWARE — Open-source code to deploy
trained agents onto physical robots.
ARCHITECTURE:
#7 INTERNAL REPRESENTATION MATTERS — Study what mental
map the agent builds, not just its success rate.
One of the paper’s key insights — how much has the agent seen the test environment before?
REGIME 1: ZERO PRIOR EXPLORATION
Agent drops into a completely novel environment.
Must navigate from scratch using only general knowledge.
= The hardest, most impressive regime.
REGIME 2: PRE-RECORDED EXPLORATION
Agent watches a recording of someone else exploring.
Builds a mental model from observation, then navigates.
= Moderate difficulty.
REGIME 3: TIME-LIMITED SELF-EXPLORATION
Agent gets a budget (e.g., 500m of movement) to freely
explore the environment before navigation episodes.
= Easier, but exploration budget is a quantified variable.
THE KEY: this must be REPORTED EXPLICITLY.
A paper claiming "80% success" means nothing if you
don't know which regime it's using.
This is a 7-page document with zero figures, zero tables, and one equation. And it reshaped an entire field.
BEFORE THIS PAPER (2016–2018):
Every lab had its own task definitions
Metrics were incomparable
"State of the art" was meaningless
AFTER THIS PAPER (2019+):
SPL became universal
PointGoal/ObjectGoal/AreaGoal became standard vocabulary
Habitat platform built around these recommendations
Annual Habitat Challenge uses this exact framework
500+ citations and counting
The direct line: this paper → Habitat (the dominant embodied AI simulation platform) → every major navigation benchmark since 2019. Several co-authors went on to build Habitat, implementing every single recommendation from this document.
The formula has subtle design choices that matter:
SPL = (1/N) × Σᵢ Sᵢ × (ℓᵢ / max(pᵢ, ℓᵢ))
Sᵢ ∈ {0, 1}
Binary. Agent either reached goal AND called DONE = 1.
Otherwise = 0. No partial credit.
ℓᵢ = geodesic shortest path from start to goal
Computed by the simulator (ground truth).
This is the OPTIMAL path respecting walls/obstacles.
pᵢ = actual path length agent traveled
Total distance, including backtracking, dead ends,
circling. Every wasted meter counts against you.
max(pᵢ, ℓᵢ) — WHY THE MAX?
Prevents the ratio from exceeding 1.0.
Edge case: agent clips through a wall (collision)
and takes a "shortcut" where pᵢ < ℓᵢ.
Without the max, this cheating would score >1.
The max clamps it to 1.0 at best.
CRITICAL PROPERTY:
SPL ∈ [0, 1]
SPL = 0 → total failure (never reaches goal)
SPL = 1 → perfect (always reaches goal, optimal paths)
Why not just multiply success rate × average efficiency separately?
Per-episode weighting means an episode where the agent
wanders 10× the optimal path drags down the score MORE
than an episode where it's only 1.5× optimal.
Per-episode weighting prevents a few efficient successes
from masking many wildly inefficient ones.
FLOOR PLAN:
┌─────────────────────────┐
│ WALL │
│ ┌──────────────┐ │
│ │ │ │
│ A │ │ B │
│ │ │ │
│ └──────┐ │ │
│ │ door │ │
│ └───────┘ │
└─────────────────────────┘
EUCLIDEAN distance A→B: ~3 meters (through wall)
GEODESIC distance A→B: ~12 meters (around wall, through door)
With Euclidean: agent on the wrong side of a wall
gets credit for being "close."
With geodesic: agent must be REACHABLY close.
WITH DONE ACTION, the agent must:
1. Navigate to the goal (path planning)
2. Recognize it has arrived (perception/understanding)
3. Decide to stop (confidence/decision-making)
All three are required. Missing any one = failure.
SUCCESS THRESHOLD τ:
Default: τ = 2 × agent body width
For body width 0.2m → τ = 0.4m
For AreaGoal: center of mass inside target area
| Environment | Type | Scenes | Goals | Key Feature |
|---|---|---|---|---|
| SUNCG | Synthetic | 500 | Pt/Obj/Area | 41K objects, 110K m² |
| Matterport3D | Real scan | 90 | Pt/Obj/Area | Real homes/hotels/offices |
| AI2-THOR | Synthetic | 120 | Pt/Obj | Interactive objects (open cabinets) |
| Gibson | Real scan | 572 | Pt | Largest — 211K m², 1,447 floors |
SYNTHETIC (SUNCG, AI2-THOR):
+ Unlimited variations, perfect geometry
+ Interactive objects possible (AI2-THOR)
- May not reflect real-world complexity
- "Sim-to-real gap" when deploying to robots
REAL SCANS (Matterport3D, Gibson):
+ Authentic visual complexity and clutter
+ Real lighting, textures, layouts
- Fixed environments (can't generate new ones)
- Scanning artifacts (holes, stitching errors)
Agent gets a "budget" before navigation episodes:
Budget = 0m → zero-shot (Regime 1)
Budget = 500m → brief exploration
Budget = 1000m → moderate exploration
Budget = 2000m → extensive exploration
Budget = ∞ → full memorization
SPL
1.0 ┤
│ ●———— ceiling
0.8 ┤ ●——
│ ●——
0.6 ┤ ●——
│ ●——
0.4 ┤●——
│
0.2 ┤
│
0.0 ┼——┬——┬——┬——┬——┬—
0 500 1000 1500 2000 ∞
Exploration Budget (meters)
THIS CURVE is the real evaluation, not a single number.
Different agents may dominate at different budgets.
LEVEL 0: PURELY REACTIVE
Current frame → Deep Network → Action
No memory. Can't build a mental map.
LEVEL 1: SHORT-TERM VECTORIAL MEMORY (LSTM/GRU)
Recurrent state carries compressed history.
Can remember "I tried going left already."
But fixed-size vector — lossy.
LEVEL 2: RICH INTERNAL REPRESENTATIONS
Agent builds explicit spatial maps, topologies,
semantic labels. Can plan paths through its model.
The paper deliberately doesn't recommend an architecture —
it highlights that internal representation IS the core
research question.
max(pᵢ, ℓᵢ) in the denominator rather than simply pᵢ. What specific edge case does this address?max(pᵢ, ℓᵢ) is a safety clamp. In simulation, collision handling can allow agents to pass through geometry and take physically impossible shortcuts where p < ℓ. The max ensures no single episode can score above 1.0, preventing physics exploits from inflating SPL.SPL became universal. But universality breeds complacency — and SPL has real limitations the field only discovered by using it for years:
BLIND SPOT 1: ALL FAILURES LOOK THE SAME
Agent stops 0.1m too early from goal → Sᵢ = 0
Agent wanders aimlessly for 500m → Sᵢ = 0
SPL score for both: 0. These are fundamentally different.
BLIND SPOT 2: PATH QUALITY IS ONE-DIMENSIONAL
Agent takes optimal-LENGTH path but scrapes every wall,
spins 360° at every intersection, moves in jerky bursts.
SPL: 1.0 (perfect!). You’d never deploy this robot.
BLIND SPOT 3: NO DYNAMICS
Agent A reaches goal in 30 seconds.
Agent B reaches goal in 300 seconds (same path length).
SPL: identical. Time is invisible.
BLIND SPOT 4: THRESHOLD SENSITIVITY
Agent consistently stops 0.39m from goal.
With τ = 0.4m → success rate: 100%
With τ = 0.35m → success rate: 0%
These blind spots spawned follow-up metrics:
July 2018: This paper published
↓
Dec 2019: Habitat 1.0 released (Facebook AI Research)
↓ Built by co-authors Savva, Malik + others
↓ Implements EVERY recommendation
↓
2019: First Habitat Challenge — PointGoal focus
↓
2020: PointGoal essentially SOLVED
↓ DD-PPO agent: 0.97 SPL on Gibson
↓ Near-optimal paths in unseen environments
↓ Zero-shot generalization regime
↓
2020+: Challenge shifts to ObjectGoal
↓ SPL drops from 0.97 to ~0.3
↓ Adding recognition + world knowledge
↓ is MASSIVELY harder than pure path planning
↓
2022+: MultiObjectNav, Social Navigation, Rearrangement
↓
2023+: Foundation models enter navigation
LLM/VLM planners with small RL execution policies
WHY POINTGOAL WAS "EASY" (in retrospect):
Agent always KNOWS where the goal is.
Challenge is purely: find a collision-free path.
No recognition. No world knowledge. No language.
Deep RL + massive compute was enough.
WHY OBJECTGOAL IS STILL HARD:
"Find the refrigerator" requires:
1. Explore systematically (where to look?)
2. Recognize the target (is that a refrigerator?)
3. Use world knowledge (fridges are in kitchens)
4. Reason about layout (kitchen is near dining room)
5. Handle ambiguity (which refrigerator?)
Adding ONE cognitive layer dropped performance by HALF.
This is the navigation equivalent of "integration density"
— coordinating multiple cognitive operations is
fundamentally harder than any single operation.
Agent in simulation: SPL = 0.95
Same agent on robot: SPL = 0.30 (if you're lucky)
SOURCES OF THE GAP:
VISUAL DOMAIN: Clean renders vs. noisy sensor data
(motion blur, varying lighting, reflections)
ACTUATION: move_forward(0.25m) → moves EXACTLY 0.25m
in sim. Real robot: 0.23m, drifts 0.02m right, wheel slips.
DYNAMICS: Agent is a perfect cylinder in sim.
Real robot: complex geometry, shifting center of mass.
ENVIRONMENT: Sim is frozen at scan time.
Real world: furniture moves, people walk through,
doors change state, lighting varies hour to hour.
Even with proper train/test splits, memorization leaks:
1. ENVIRONMENT-TYPE MEMORIZATION
All training kitchens have fridges on south wall →
agent "learns" to always go south for fridge.
Works in training distribution. Fails in novel layouts.
2. SIMULATOR ARTIFACT MEMORIZATION
Agent learns Matterport3D-specific stitching patterns
as navigation cues. Useless in AI2-THOR or real world.
3. DISTRIBUTION MEMORIZATION
Training: 80% of goals are 3–10m away.
Agent calibrates to this range.
Novel environments with longer paths → degraded SPL.
The paper mentioned natural language goal specification but didn’t standardize it. Peter Anderson (this paper’s first author) simultaneously created Room-to-Room (R2R), which combined natural language instructions with navigation:
"Walk past the dining table, turn left at the hallway,
go through the second door on your right, and stop
in front of the bathroom mirror."
This is HARDER because:
- Language is ambiguous ("second door" from which side?)
- Instructions reference landmarks requiring recognition
- Instructions may be wrong or imprecise
- Grounding: mapping words to visual percepts in real-time
ImageNet (2009): standardized image classification eval
→ catalyzed the deep learning revolution
This paper (2018): standardized navigation eval
→ catalyzed embodied AI progress
The meta-lesson: EVALUATION DRIVES RESEARCH DIRECTION.
Whatever you measure is what people optimize.
If the metric is wrong, the field goes sideways.
If the metric is right, the field accelerates.
✓ "SPL of 0.5 would represent good performance"
Understated for PointGoal (hit 0.97), but prescient
for ObjectGoal (still ~0.4–0.5 in 2026).
✓ "Internal representation is central"
Became THE research question. The entire "world models"
movement is about what representations agents build.
✓ "Standardization will catalyze progress"
Habitat Challenge created a coordination mechanism
that drove steady improvement year over year.
✓ "Sim-to-real deployment matters"
Still the hardest problem. Their concern was justified.
✗ DIDN'T ANTICIPATE FOUNDATION MODELS
The paper assumes agents trained FROM SCRATCH via RL
in simulation. Nobody in 2018 imagined that a model
trained on internet text + images could navigate at all.
✗ DIDN'T ANTICIPATE THE LANGUAGE SHIFT
Natural language became the PRIMARY interface for
embodied agents by 2024. VLN is now bigger than
pure ObjectGoal research.
✗ UNDERWEIGHTED MULTI-AGENT SCENARIOS
No mention of multiple agents, social navigation,
or collaborative tasks. These are now central.
✗ THE SIMULATOR MONOCULTURE
Standardizing on 4 environments created a
"teaching to the test" effect — agents optimized
for those specific environments rather than
developing general navigation ability.
THE 2018 PARADIGM (what this paper assumed):
Train an RL agent FROM SCRATCH in simulation:
Environment → RGB → CNN → LSTM → Policy → Action
Millions of episodes. Months of GPU time.
THE 2024+ PARADIGM (what actually happened):
Foundation model as "brain" + small RL policy for control:
┌————————————————————┐
│ FOUNDATION MODEL │
│ (GPT-4V / Gemini / etc.) │
│ │
│ "I see a hallway with │
│ doors. The kitchen is │
│ likely ahead-left." │
└——————┬—————————————┘
│ high-level plan
▼
┌————————————————————┐
│ LOW-LEVEL RL POLICY │
│ (obstacle avoidance, │
│ motor commands) │
└————————————————————┘
WHY THIS CHANGES EVERYTHING:
1. ZERO-SHOT GENERALIZATION — model already knows
"refrigerators are in kitchens" from pretraining
2. LANGUAGE IS NATIVE — VLN tasks need no special training
3. COMMON SENSE — impossible for RL agents to learn
4. SIM-TO-REAL GAP SHRINKS for perception
(trained on real images, not renders)
THE CATCH: latency (2–5s per decision vs. <100ms needed),
cost, and hallucination risk.
SPL was designed for one agent, one goal, one episode, static environment. Modern embodied AI needs to evaluate far more:
MULTI-STEP TASKS:
"Go to kitchen, pick up red mug, bring to living room."
SPL measures: did you reach the kitchen?
Doesn't measure: did you grab the RIGHT mug?
SOCIAL NAVIGATION:
"Navigate to exit without making people uncomfortable."
SPL: did you reach the exit efficiently?
Doesn't measure: did you violate personal space?
OPEN-ENDED EXPLORATION:
"Explore this building and report what's here."
No single goal. No success/failure binary.
CONTINUOUS OPERATION:
Home robot that navigates ALL DAY.
How do you measure a month of navigation?
The field needs integrated evaluation that tests navigation
+ manipulation + language + social reasoning together —
not as separate tasks, but as one combined challenge.
The paper’s Recommendation 7 (“internal representation is central”) anticipated today’s world models debate:
2018: "Internal representation" = LSTM hidden state.
A black-box vector. We don't know what's in it.
2020: Spatial memory maps emerge.
Agents build top-down 2D maps from egocentric views.
2022: Neural scene representations (NeRF-based).
Encode 3D geometry + appearance from observations.
2024+: Foundation model "mental models."
Semantic, not geometric.
"I've seen a living room and two bedrooms.
Kitchen is probably beyond the hallway."
THE LIKELY SYNTHESIS:
Foundation model for HIGH-LEVEL reasoning
(which room to visit, what strategy to use)
+
Geometric map for LOW-LEVEL execution
(collision avoidance, precise path following)
Neither pure reasoning nor pure spatial computation
is optimal alone. The combination wins.
The paper’s goal taxonomy maps directly to both media perception and generation:
NAVIGATION GOALS AS A PERCEPTION STACK:
PointGoal → Spatial perception
(depth estimation, free-space segmentation)
ObjectGoal → Object perception + world knowledge
(detection, recognition, semantic segmentation)
AreaGoal → Scene perception + layout understanding
(room classification, spatial layout estimation)
VIDEO GENERATION NEEDS THE SAME CAPABILITIES:
A video generator that produces coherent sequences
must internally represent:
- Consistent 3D geometry across frames
- Object persistence through occlusion
- Physics-aware motion
- Navigable camera paths
A video generator IS a world model.
Generation = perception in reverse.
THE LOOP:
Generation pretraining → better representations
→ better perception → better spatial reasoning
→ better world models → better generation
It's a loop, not a pipeline.
SOLVED:
✓ PointGoal in simulation (0.97 SPL)
✓ Standard evaluation framework (this paper)
✓ High-quality simulators (Habitat, AI2-THOR)
PARTIALLY SOLVED:
◑ ObjectGoal (~0.5 SPL, improving)
◑ VLN (language-conditioned navigation)
◑ Foundation model integration
UNSOLVED:
✗ Sim-to-real transfer at scale
✗ Manipulation + navigation combined
✗ Social navigation (around people)
✗ Long-horizon multi-step tasks
✗ Continuous real-world operation
✗ Real-time foundation model planning
| Dimension | Score | Notes |
|---|---|---|
| Novelty | 7/10 | Not a new algorithm — but a new framework for an entire field |
| Impact | 10/10 | 500+ citations, created Habitat, defined the field’s vocabulary |
| Reproducibility | 9/10 | Standards doc — nothing to reproduce, everything to implement |
| Technical Depth | 6/10 | Deliberately high-level; depth is in the design choices |
| Writing | 9/10 | Remarkably clear, concise, well-structured for 7 pages |
| Longevity | 8/10 | SPL and goal taxonomy endure; sim-to-real and multi-agent gaps now visible |
This paper proves that a well-timed standards document — zero figures, zero tables, one equation — can be more consequential than any individual algorithmic advance. The seven recommendations didn’t just evaluate navigation — they shaped what navigation research became. Whatever you measure is what people build.