← Back to all papers

AI Researchers’ Perspectives on Automating AI R&D and Intelligence Explosions

Severin Field, Raymond Douglas, David Krueger (MATS Program / Berkeley) · arXiv:2603.03338v2 · March 2026
TL;DR. 25 leading AI researchers from frontier labs (OpenAI, Anthropic, DeepMind, Meta) and academia (Berkeley, Princeton, Stanford) were interviewed in Aug–Sep 2025 about automating AI R&D and intelligence explosions. 20/25 flagged it as one of the most severe and urgent AI risks. 17/25 expect frontier ASARA-capable models to be kept internal rather than deployed publicly. A “schism” emerged between frontier-lab researchers (clearer path to ASARA, more concerned) and academics (more skeptical, more obstacle-focused). Participants split evenly on whether red-line governance would work.

Level 1 — Beginner

What is this paper?

Imagine 25 chefs at the world’s most cutting-edge restaurants and culinary schools. A researcher asks each one: “Do you think AI cooking robots will eventually get so good they start designing better cooking robots themselves — and if they do, will that change everything overnight?”

The chefs from fancy new restaurants (where AI tools are already speeding up the kitchen) tend to say “yes, this is happening, and faster than most people realize.” The chefs from culinary schools say “slow down — we’ve heard hype like this before, and there are real obstacles you’re ignoring.”

That’s what this paper does, but with AI researchers and AI that builds AI. The authors interviewed 25 top researchers in August and September 2025 and published the patterns they found.

The big idea: “Intelligence Explosion”

The core idea was first proposed by I.J. Good in 1966:

If AI ever gets smart enough to improve itself, then a slightly smarter AI can design an even smarter AI, which can design an even smarter one, and so on — a runaway feedback loop where AI capability shoots up at a pace humans can’t follow.

The paper calls this kind of AI ASARAAI Systems for AI R&D Automation. The technical name for “AI smart enough to help build better AI.”

What did the researchers find?

Four big takeaways:

1. Most experts take this seriously

Out of 25 leading researchers, 20 said this was one of the most severe and urgent AI risks — not a sci-fi distraction, but a real concern.

2. Industry vs. academia divide

Researchers inside frontier AI companies tend to see a relatively clear path to ASARA. Academic researchers are more skeptical — they’ve seen AI hype cycles fizzle before and their culture rewards skepticism.

3. Most expect ASARA to be kept secret

17 out of 25 thought frontier AI labs will keep their most powerful AI internal rather than release it publicly. It’s more valuable for accelerating their own research than for selling to customers.

4. Experts disagree on how to handle it

Some want “red lines” — bright-line rules like “no AI may improve itself without human approval.” Others think red lines are too rigid and prefer transparency requirements (mandatory reporting, government monitoring).

The three-stage path the experts described

The interviewees mostly agreed on the shape of the path to ASARA, even when they disagreed on timing:

  1. Speedup: AI is a power tool that makes researchers faster (where we are now — Cursor, Claude Code making coding 5× faster)
  2. Collaboration: AI handles whole sub-tasks on its own while humans set direction
  3. Full automation: AI runs the whole research loop. As one professor put it: “The human will become the bottleneck — companies will try to remove the humans by all means.”
Why this paper matters

Most discussions of “will AI take over” happen between two camps: enthusiastic believers and dismissive skeptics, shouting past each other. This paper is one of the first systematic surveys of what the people actually building this technology privately think. It captures nuance — the same person who thinks ASARA is coming might also think red lines won’t work. A thermometer reading of where expert opinion actually sits in late 2025.

Quiz — Level 1
1. What does ASARA stand for as defined in the paper?
ASARA = AI Systems for AI R&D Automation. The authors borrowed the term from Eth’s Forethought essay and use it throughout as shorthand for AI capable enough to meaningfully contribute to building frontier AI.
2. Who first introduced the concept of an “intelligence explosion”?
I.J. Good (British statistician, Bletchley Park codebreaker) proposed it in 1966 in “Speculations Concerning the First Ultraintelligent Machine.” Yudkowsky later formalized it via “return on cognitive investment,” Bostrom popularized it in Superintelligence (2014), and Chollet wrote a well-known critique (2018).
3. Out of 25 researchers interviewed, how many identified automating AI research as one of the most severe and urgent AI risks?
20/25. Don’t confuse with another headline number: 17/25 is the count who expect frontier models to be kept internal. The risk-perception number is 80% — four out of five flagged ASARA as a top-tier risk.
4. According to the paper, what is the key difference between frontier-lab researchers and academic researchers?
The paper calls this the “Schism Between Silicon Valley and Academia.” Lab researchers have firsthand experience with rapid capability gains; academic culture rewards skepticism. Academics are also generally less worried about ASARA, not more.
5. Which sequence best describes the three-stage path to ASARA that participants converged on?
17 of 25 participants described this exact 3-stage path. Stage 1: AI raises the floor for all researchers. Stage 2: AI autonomously handles sub-tasks. Stage 3: humans become the bottleneck and full research loops run autonomously.

Level 2 — Intermediate

The methodology: semi-structured qualitative interviews

This is a qualitative research study. The goal isn’t to count things or run statistical tests — it’s to surface reasoning patterns and capture why experts hold the views they do.

Sampling

182 researchers invited; 25 agreed (13.7% response rate, normal for elite-expert interviews). Three recruitment channels deliberately to capture different vantage points:

ChannelCountCaptures
Literature-based (Google Scholar)7Published authors on recursive improvement
Conference workshops (NeurIPS / ICLR 2024)8Active researchers at relevant venues
Network / snowball10“Recommend someone who disagrees with you”

Participant mix: 7 from frontier labs, 4 ex-frontier-lab, 9 academics, 3 industry, 2 nonprofit. The stratification is what lets them compare clusters later.

Interview protocol

14 core questions, 40–60 minutes each, organized into three sections:

  • Section A — Conceptualizing ASARA: what it’ll look like, trajectory, intelligence-explosion expectations
  • Section B — Organizational dynamics: deployment decisions, internal discussions
  • Section C — Risk assessment and governance: red lines, mitigations

Critical detail: when participants didn’t know a concept, the interviewer read a scripted definition. This standardizes the prompt so people react to the same idea.

Inductive coding + AI-assisted classification

The lead author developed codes inductively (grounded-theory style). For the categorical dot plots in Figures 1–4, the authors did something novel: they fed anonymized transcripts to Claude with structured prompts that forced classification into fixed codes. The full prompt is in Appendix B.

This makes the categorical results reproducible — but creates a methodological dependency on Claude that the paper acknowledges only partially.

The four figures — what they actually show

Figure 1 — Deployment expectations
Of 20 with a clear position: 10 internal, 6 nuanced, 4 public. Frontier and ex-frontier dots cluster left (internal).
Expects internalNuancedExpects public
Figure 2 — Trajectory clarity toward ASARA
Frontier lab researchers cluster at “clear path”; academics spread toward “major obstacles.” No one chose “unknown unknowns.”
Clear pathMajor obstaclesUnknown unknowns
Figure 3 — Risk perception of ASARA
Most see ASARA as carrying serious risk. Academia is where variance lives — only 2 dismissed it outright.
Primary driverMajor riskSome concernMinimalDismissive
Figure 4 — Views on red lines as governance
Roughly even split. Even supporters identified implementation challenges; even skeptics often wanted transparency-based mitigations.
FavorableConditional / mixedSkeptical

Figures recreated from the paper’s text. Position totals are exact (stated in the paper); per-affiliation dot placements are inferred from qualitative descriptions. Original figures at arxiv.org/html/2603.03338v2.

The “ideation vs execution” framework

The most analytically interesting frame. 15 of 25 participants spontaneously distinguished:

SkillWhat it isWhy hard for AI
ExecutionImplementing experiments, training code, ablationsEasier — well-defined sub-goals, fast feedback
IdeationPicking experiments, “research taste,” noticing what mattersHarder — long feedback loops, hard to evaluate, paradigm shifts have a long tail

A subset reframed the ideation problem more sharply: it’s not generating ideas that’s hard, it’s validating them. Expert humans with decades of experience struggle to recognize good ideas. And ML models tend to learn the mode of training data, not exceptional cases — exactly the wrong inductive bias for spotting paradigm shifts.

Internal vs. public deployment — the argument map

Why labs will keep ASARA internal (top three codes)

  1. Preserving competitive advantage (12 transcripts) — deploying gives competitors capability lifts
  2. Limited compute (10) — every GPU serving the API isn’t accelerating internal R&D. One striking quote: $100k of compute used for R&D might be worth $1M in researcher salary equivalent
  3. Avoiding diffusion (6) — public deployment enables distillation and reverse-engineering

Why labs will deploy publicly (top three)

  1. Financial pressures (10) — labs need to raise $20B+/year
  2. Government intervention (7) — regulators may force visibility
  3. Business model and culture (6) — Meta’s open-source culture vs OpenAI’s more closed approach

Constraints discussed

Sixteen participants flagged binding constraints. Three came up repeatedly:

  • Compute (11 mentions) — split between “engineering challenge, solvable” and “physical limit, real barrier”
  • Data (5 mentions) — “data determines the ceiling of self-improvement” (P3)
  • Bootstrapping level — the capability threshold required for sustained autonomous improvement. P1: “Can the model evaluate its own answers? Until relatively recently, the models just could not do that.”

Red lines vs transparency — comparing governance approaches

Red lines = bright-line rules that trigger major response when crossed (e.g. IDAIS Beijing: “No AI system should copy or improve itself without explicit human approval”).

Three implementation challenges identified even by supporters:

  1. Specification problem — the more precisely you define a threshold, the more it diverges from the abstract risk you’re trying to capture
  2. Verification & enforcement — “any kind of AI includes some intelligence and a compiler optimization” makes self-improvement hard to define
  3. Timing — too early handicaps beneficial development; too late and it’s useless

Trade-off captured by P8: red lines are “the dumbest possible supervisor but the most trustworthy” — crude but transparent vs. sophisticated but discretionary.

Headline numbers

FindingCountMeaning
ASARA as severe / urgent risk20/25Strong elite consensus on risk salience
Expects internal deployment17/25Most expect frontier labs to withhold ASARA models
Three-stage path described17/25Convergence on trajectory shape
Ideation/execution distinction15/25Spontaneous frame for thinking about capabilities
Risk = “meta risk” (amplifies others)18/25Most common reason for concern
Concerned about adaptation lag17/25Second most common concern
Quiz — Level 2
1. How did the authors generate the categorical dot plots in Figures 1–4?
They fed anonymized transcripts to Claude with structured prompts that forced classification into fixed codes. The full prompt is in Appendix B. The inductive theme-coding was done by the first author alone — the paper notes inter-rater reliability was NOT calculated.
2. Why might “research ideation” be harder for AI than “research execution,” according to participants?
Good ideas have a long tail — most are mediocre, a few are paradigm-shifting. ML models tend to learn the mode of training data, exactly the wrong inductive bias. P24 framed this as “research taste” vs “experiment implementation.”
3. Among the 20 participants who took a clear position on deployment, what was the approximate split?
10 internal (half), 6 nuanced, 4 public (20%) of those with a clear position. 5 participants had no clear position and were omitted from Figure 1.
4. What is the “specification problem” with red lines as governance?
P24: “The more concrete your red line, the more decoupled it becomes from the abstract intelligence explosion risk that you’re worried about.” A specific benchmark threshold will catch AI that isn’t actually dangerous and miss other AI that is.
5. What does Participant 1 mean by a “bootstrapping level” in the context of recursive improvement?
P1’s specific example was self-evaluation: “Can the model evaluate its own answers? Can it rank its own answers?” This is a phase transition — below the level, human time gates progress; above, it doesn’t.

Level 3 — Expert

Since this is a qualitative interview study, L3 goes deep on methodology, the intelligence explosion theoretical literature this paper sits inside, and critical evaluation rather than equations and algorithms.

1. Qualitative methodology — what was done, what’s missing

Sampling architecture

40% of participants came through network/snowball channels including MATS connections (Krueger is a MATS supervisor; lead author is in the program). Snowball sampling is known to produce homophilous networks — even when you ask for disagreement, you get disagreement within your cluster. Likely shape of non-response bias: declines included skeptics who think the framing is uninteresting, systematically reducing the “skeptical academic” representation in the sample. This actually works against the headline schism finding, which is epistemically reassuring.

Inductive coding by a single author

Codes were developed inductively — grounded-theory style, letting concepts emerge. The Limitations section flags that inter-rater reliability was not calculated. The standard remedy is a second researcher independently coding a subset, with Cohen’s kappa or Krippendorff’s alpha. Values above ~0.8 are strong agreement; below 0.6 starts being concerning. This absence matters less for descriptive counts (“20 of 25 said X”) and more for interpretive claims like “a schism emerged.”

AI-assisted coding with Claude

For Figures 1–4, transcripts were fed to Claude with structured prompts forcing classification into fixed codes. Methodological strength: reproducible (anyone can re-run the prompt), removes manual coder bias, auditable via extracted quotes.

Three issues the paper doesn’t engage with:

  • Single-LLM coder problem — structurally analogous to a single human coder. No triangulation across model families.
  • Prompt-as-framing — forcing participants into one of four buckets is itself an interpretive act; the structure isn’t validated against a held-out human-coded set.
  • Conflict-of-interest pattern — Anthropic researchers are in the sample, classified using an Anthropic product. A robustness check would have been appropriate.

2. The intelligence explosion theoretical lineage

Good 1966 — the syllogism

“An ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind.”

Three buried assumptions, each a fault line in later literature:

  1. Intelligence is sufficiently coherent (more intelligence → better at AI design)
  2. Marginal improvement is substantial, not trivial
  3. The design process doesn’t hit external constraints (treats AI design as pure intellectual work)

Yudkowsky — return on cognitive investment

Yudkowsky’s Intelligence Explosion Microeconomics (2013) formalizes Good’s argument with the k-factor:

k = (cognitive improvement per round) / (cognitive effort per round)

Mapping onto chain-reaction physics:

k < 1  →  subcritical    (improvements taper off)
k = 1  →  critical       (sustained linear improvement)
k > 1  →  supercritical  (each round produces more than it cost
                            → exponential explosion)

With I(t) the cognitive capability and R(t) the resources invested:

dI/dt = k · R(t), R(t) ∝ I(t), k > 1
⇒ I(t) = I(0) · ekt

Yudkowsky’s key move: the empirical question isn’t “will AI improve?” (yes) but is k persistently > 1? Most natural processes have k < 1 — diminishing returns dominate.

This formalism also describes the “bootstrapping level” P1 mentioned: the capability threshold below which k < 1 (human time gates progress) and above which k > 1 (system sustains its own improvement).

Bostrom 2014 — the takeoff equation

Rate of improvement = (Optimization power applied) / (Recalcitrance to improvement)

Fast takeoff: optimization power grows faster than recalcitrance. In the software-only regime, optimization power can grow superlinearly while recalcitrance grows sublinearly. In the hardware regime, recalcitrance grows much faster — you have to fab chips, build datacenters, generate power.

This exactly matches the paper’s ideation/execution distinction: ideation = low-recalcitrance but hard to apply optimization power; execution = higher-recalcitrance but easier to apply optimization power.

Bostrom adds two theses absent from Good/Yudkowsky:

  • Orthogonality thesis: capability and goals are independent — you can have a superintelligent paperclip maximizer
  • Instrumental convergence: nearly any final goal implies similar subgoals (resource acquisition, self-preservation, cognitive enhancement)

The paper’s “meta risk” framing (ASARA amplifies all other risks) is a downstream consequence of instrumental convergence.

Davidson — the empirical model

Effective compute: E = C × A (physical compute × algorithmic efficiency).

dA/dt = α · R(t) · A(t)β

Once AI labor dominates research (R ∝ A), substitution gives:

dA/dt = α · A(t) · A(t)β = α · A(t)1+β

Regimes:

ConditionDynamics
β < 0Diminishing returns — sub-exponential
β = 0Constant returns — plain exponential
β > 0Increasing returns — finite-time singularity

When β > 0, the ODE produces a finite-time blow-up (in the idealized model). Whether this happens in reality depends on historical algorithmic-progress data once AI dominates research — data we don’t yet have.

Chollet — the principled critique (2018)

  1. Intelligence is not scalar — capability is domain-specific. In Yudkowsky’s notation: k > 1 in domain i doesn’t imply k > 1 overall.
  2. Feedback loops require reality, not just thought — algorithmic progress requires experiments, which need compute, data, and calendar time.
  3. Diminishing returns are the default — almost every system in nature exhibits them. Research productivity per researcher in physics has fallen by orders of magnitude.
  4. No general intelligence has ever exhibited recursive self-improvement — smart humans don’t have smarter children at faster rates.

Putting them on the same axes

The key disagreement isn’t whether AI will improve. It’s whether k > 1 (or β > 0) is sustained for AI-on-AI improvement.

ThinkerCore positionImplicit prior on k
Good 1966Argues from definitionk > 1 by assumption
YudkowskyFramework for asking the questionOpen empirical
BostromDecomposes into opt power / recalcitrancek > 1 likely in software regime
DavidsonEmpirical model fit to ML historyk > 1 contingent on β estimate
CholletPrincipled skepticismk ≤ 1, diminishing returns rule

The Field et al. participants are basically a poll across this fault line. Frontier-lab researchers cluster near Bostrom/Davidson; academic researchers cluster near Chollet.

3. Ideation vs execution — and the evaluation problem

The sharper frame: research progress is bottlenecked on evaluation, and ASARA-class self-improvement is bottlenecked on evaluating research direction.

Execution = generation + evaluation
         ≈ generation (med)   +  evaluation (easy, fast feedback)
Ideation = generation + evaluation
         ≈ generation (easy)  +  evaluation (hard)

Research taste as a reward model

“Research taste” is the human-research term for what ML calls a reward model:

RewardModelhuman: ResearchIdea → ExpectedImpact

You don’t just need AI that proposes directions (solved) — you need RewardModelAI matching RewardModelsenior_human well enough that following its rankings produces good research. Two failure modes that echo RLHF:

  • Mode collapse on training distribution — reward models trained on what the community valued historically will undervalue paradigm shifts that the community would have rejected at the time
  • Goodharting on observable proxies — optimizing citation count, benchmark improvement, conference acceptance produces the proxy, not the underlying thing

The long tail of research impact

~80% of papers       →  cited < 10 times
~15% of papers       →  cited 10–100 times
~5% of papers        →  cited 100–1000 times
~0.1% of papers      →  cited > 10,000 times, define paradigms

Cumulative impact is dominated by the tail. A reward model 95% accurate on the median idea but mode-seeking on the tail produces 95% reasonable-looking work and 5% missed paradigm shifts — functionally similar to “no paradigm shifts.”

Mode-seeking behavior in MLE

MLE minimizes forward KL:

KL(pdata ‖ pθ) = E[log(pdata / pθ)]

Forward KL is mean-seeking — it penalizes the model for assigning low probability to high-data regions, smearing probability across modes. The colloquial sense (“learning the typical case, missing the tail”) is exactly what MLE does. Both cut against tail-case evaluation accuracy.

RLHF and the Gao et al. 2023 result

Gao, Schulman, Hilton fit functional forms to gold reward vs proxy reward as KL grows. RL version:

Rgold(d) = d · (αRL − βRL · log d)

Where d = √KL(π ‖ πinit). Gold reward follows an inverted-U: it rises, peaks, then declines. Proxy keeps rising monotonically. The two terms map to:

TermGoodhart typeMechanismReduced by
α (linear)RegressionalSelection on noise in proxy featuresLarger RM helps weakly
β log(d)ExtremalOptimized samples drift OODLarger RM helps strongly
(not captured)AdversarialPolicy manipulates the proxyOpen research problem

The iterated RLHF result — critical for ASARA

If you retrain the RM in k stages of distance d/k each:

RRL(d, k iter) = d · (αRL − βRL · log d + βRL · log k)

The new term β · log k is positive but logarithmic in k, not linear. Doubling iterations adds the same as the previous doubling. The α term doesn’t move — regressional Goodhart is unaffected by iteration.

Translation for ASARA: even idealized “AI keeps retraining its own evaluator” produces logarithmic gains, not exponential ones. The mechanism Section 2’s k > 1 regime needs, the math gives you keffective decaying like 1/log(rounds) — sub-exponential dynamics inside “the explosion.”

What this means for the speed of any recursive loop

Research progress rate = min(generation rate, evaluation rate)

If generation rate → ∞ (AI is fast):
  Bottleneck shifts to evaluation rate
  Evaluation rate is bounded by:
    - quality of the reward model
    - true distribution of idea quality (long-tailed)
    - calibration on tail cases (poor by default)

Asymmetric capability growth pattern:

CapabilityTrajectoryReason
Coding abilityFast, sustained growthClear evaluators
Math problem solvingFast, sustained growthClear evaluators
Benchmark performanceFast growth, then saturationGoodhart
True novel-research generationSlow growth, possible plateauNo evaluator

That asymmetry is already visible in 2025–2026 data. Coding and math benchmarks have shot up; novel-paradigm research from AI has not appeared.

The L3 takeaway

Under this analysis, the recursive loop has a specific shape: fast on execution, gated on evaluation, with each evaluation improvement requiring increasing amounts of ground-truth signal that takes calendar time to accumulate. Not Chollet’s “no acceleration ever” and not Bostrom’s “fast takeoff” — something in between, dominated by long stretches of incremental work punctuated by paradigm shifts that arrive roughly on the historical schedule because that’s how long ground truth takes to accumulate.

Quiz — Level 3
1. In Yudkowsky’s framing, what specifically distinguishes an intelligence explosion regime from incremental AI progress?
Yudkowsky reframed the question precisely: explosion is the regime where each round of cognitive investment produces more than it cost. Below k = 1 subcritical, at k = 1 critical, above k = 1 supercritical. The whole disagreement is whether k > 1 is sustained.
2. Davidson’s software-intelligence-explosion model produces a finite-time singularity when which condition holds in dA/dt = α·R(t)·A(t)β?
Once AI labor dominates research (R ∝ A), substitution gives dA/dt ∝ A1+β. When the exponent exceeds 1 (i.e. β > 0), this ODE has finite-time blow-up. β = 0 gives plain exponential growth; β < 0 gives sub-exponential. α controls timescale, not the qualitative dynamics.
3. Why is the AI-assisted coding methodology worth flagging beyond what the paper notes in Limitations?
The defense of AI-assisted coding is reproducibility. But a single LLM coder is structurally analogous to a single human coder — one perspective, not triangulation. A stronger implementation would have run prompts on multiple model families and reported agreement, or held out a human-coded subset for validation.
4. Under Gao et al.’s 2023 scaling law RRL(d, k iter) = d·(α − β·log d + β·log k), what does this predict for a recursive self-improvement loop that keeps retraining its evaluator?
The positive term is β·log k — gains grow logarithmically with diminishing returns built in. The α term (regressional Goodhart, selection on noise) is structural and unchanged by iteration. Even idealized “AI retrains its own evaluator” produces sub-exponential gains.
5. Why does the “long tail of research impact” argument predict that AI-driven research could plateau in quality even while quantitative outputs accelerate?
Research impact follows a power-law-like distribution where ~0.1% of papers define paradigms and contribute most cumulative impact. MLE-trained models learn the mode and have known calibration problems on tail cases. An AI evaluator can be 95% accurate on median ideas while being systematically miscalibrated on the paradigm-shifting tail.

Phase 4 — Frontier

Six improvement vectors

  1. Methodological triangulation — multiple human coders, multiple LLM classifiers, held-out validation
  2. Larger and more representative samples — supplement depth with breadth via survey instrument
  3. Longitudinal tracking — re-interview the same 25 every 6–12 months
  4. Operationalizing the observable milestones — concrete metrics with thresholds
  5. Empirical grounding for the theoretical disagreement — measure β in the AI-on-AI regime
  6. Cross-national and cross-institutional coverage — PRC researchers, open-source / decentralized contexts

What’s happened since

1. Methodological triangulationArea to explore

No follow-on study has reproduced this paper’s interviews with multiple human coders, multiple LLM classifiers, or a held-out validation set. The methodological dependency on a single LLM (Claude) and single human coder remains the largest unaddressed gap. This is the cheapest fix — running the existing 25 transcripts through a Claude-vs-GPT-vs-Gemini comparison could be done in a weekend.

2. Larger and more representative samplesPartial

METR’s May 2026 self-reported productivity survey of 349 technical workers found median 1.4–2× value uplift from AI tools. But it’s about current productivity, not ASARA scenarios — it doesn’t ask about intelligence explosion, governance, or deployment. A semi-structured 200+ researcher study with proper stratification is still wide open.

3. Longitudinal trackingArea to explore

No team is running ASARA-belief surveys on a recurring cadence. Given how quickly the technology is moving — GPT-5 IMO gold August 2025, Claude Mythos May 2026, METR time horizons reaching 16 hours in May 2026 — a snapshot-and-update protocol would be the highest-information-density direction.

4. Operationalizing the observable milestonesSubstantial

METR has built much of this infrastructure:

  • Time-horizon metric — duration of human-expert-time tasks AI completes reliably. Doubling every ~7 months.
  • RE-Bench (Wijk et al., NeurIPS 2024) — 7 ML research engineering environments, best AI 4× human at 2-hour budgets, humans 2× AI at 32-hour budgets.
  • MLE-Bench (OpenAI) — autonomous ML engineering on Kaggle-style tasks.
  • MLR-Bench (2025) — open-ended ML research with LLM-judge validated against human reviewers.
  • MLRC-Bench (2025) — novel methods on cutting-edge research problems, best agent closes only 9.3% of the human gap — strongest empirical confirmation of the ideation/execution gap.

Open: nobody has formally mapped Field et al.’s qualitative milestone descriptions onto these existing benchmarks.

5. Empirical grounding for the theoretical disagreementSubstantial

Davidson, Halperin, Houlden, and Korinek — “When Does Automating AI Research Produce Explosive Growth? Feedback Loops in Innovation Networks” (NBER Working Paper 35155, 2026) — develops a semi-endogenous growth model with an innovation network and derives a clean analytical condition for superexponential (“explosive”) growth. Two reinforcing channels offset diminishing returns:

  1. Technological feedback across research sectors
  2. Economic feedback — higher output finances further research

This is the most direct extension of the theoretical lineage — bridges the qualitative k > 1 / β > 0 disagreement into a formal model with testable conditions. Still open: no measurements of β in the AI-on-AI regime, because we don’t yet have sustained AI doing the research at scale.

6. Cross-national and cross-institutional coveragePartial

Substantial reporting on Chinese AI development exists, but no comparable interview study of PRC researchers.

  • Carnegie documents Chinese views on AI safety shifting toward greater concern; senior Chinese scientists signed the 2024 IDAIS-Beijing consensus.
  • CSET documents that Chinese researchers favor embodied AI as a path to AGI rather than software-only recursive self-improvement — a substantively different model of the path to ASARA.
  • ChinaTalk: “Rather than a rapid software-driven intelligence explosion, Chinese thinking converges on something more embodied” — which would invert several Field et al. findings about trajectory shape.

A Field-et-al-style interview study with 20–25 Chinese AI researchers using the same protocol is still missing.

Emergent finding — the governance landscape moved fast

A development post-dating the paper that directly affects the red-lines findings:

The Global Call for AI Red Lines (September 2025) gathered 200+ prominent signatories including Nobel laureates, calling for binding international agreement on AI red lines by end of 2026. Followed by:

  • IDAIS-Shanghai (August 2025) — proposed an international coordination body for red line implementation
  • India AI Impact Summit (February 2026) — meant to operationalize specific thresholds
  • Anthropic-DoD dispute (April 2026) — the DoD requested Anthropic remove contractual red lines on fully autonomous weapons and mass domestic surveillance, providing a real-world test case

Field et al.’s participants in Aug–Sep 2025 were debating red lines as an abstract governance approach. By mid-2026, red lines became an active political fight — making the paper’s documentation of expert preferences a useful historical baseline.

Scorecard

VectorStatusWhat existsWhat’s still open
1. Methodological triangulationArea to exploreMulti-model classification, multi-coder IRR, validation sets
2. Larger samplesPartialMETR n=349 (productivity only)ASARA-specific large-N study
3. Longitudinal trackingArea to exploreRecurring panel of same researchers
4. Operationalizing milestonesSubstantialMETR time-horizon, RE-Bench, MLE-Bench, MLR-Bench, MLRC-Bench, AIFMFormal mapping of paper’s qualitative milestones to benchmarks
5. Theoretical groundingSubstantialDavidson-Halperin-Houlden-Korinek NBER 2026, AIFMDirect measurement of β in AI-on-AI regime
6. Cross-national coveragePartialCSET, Carnegie, ChinaTalk reportingParallel interview study with PRC researchers
The single most useful follow-on

A longitudinal panel reconnecting the original 25 every 9 months, expanding to include 25 PRC-based researchers using the same protocol, with multi-model AI-assisted coding and a human-validation subset. One design hits Vectors 1, 3, and 6 simultaneously and provides exactly the time-series data needed to test whether the “schism” persists or converges as capabilities advance.

Quiz — Phase 4 Frontier
1. Which 2026 paper most directly formalizes the theoretical disagreement between frontier and academic researchers about whether AI-on-AI research produces explosive growth?
The NBER paper develops a semi-endogenous growth model with an innovation network and derives a clean analytical condition for superexponential growth. METR measures capability; MLE-Bench measures ML engineering ability; IDAIS is a governance document. Only the NBER paper extends the Yudkowsky/Davidson theoretical lineage into a formal model.
2. Which existing benchmark provides the strongest empirical confirmation of the “ideation vs execution gap” that 15 of 25 Field et al. participants raised?
MLRC-Bench tests novel methodologies — problems that are NOT solvable by sufficient engineering effort. The 9.3% result is the cleanest empirical confirmation of asymmetric capability growth: execution-heavy benchmarks saturate, ideation-heavy benchmarks don’t.
3. Why is a longitudinal version of the Field et al. study (re-interviewing the same participants every 9–12 months) particularly valuable?
A panel design isolates belief revision by holding respondents fixed. Cross-sectional surveys conflate updating with sample-composition change. With major capability milestones happening on the same cadence as the proposed interval, you’d get the cleanest possible signal about which milestones move expert opinion.
4. How would a parallel Field-et-al-style interview study with PRC researchers likely change the headline findings?
CSET, ChinaTalk, and Carnegie all document that Chinese AI researchers tend to favor embodied AI as a path to AGI. This is a substantively different trajectory model — not software recursive self-improvement but AI grounded in physical-world interaction. The cleanest predicted change is on Figure 2 (trajectory), not Figure 1 or 3.
5. What is the strongest methodological reason a single LLM coder (Claude) for the Field et al. figures should be the first vector addressed in follow-on work?
The transcripts already exist, the prompts are published in Appendix B, and re-running on GPT/Gemini would take a small fraction of any other vector. The result would be a concrete inter-classifier-agreement number, directly addressing the missing reliability metric. Lowest cost, highest information yield.
← Back to all papers