Jiang, Lin, Shi, Wang, He, Wu, Zhong, Song, Zhang, Wang et al. (34 authors) — December 2025 (v3: March 2026)
📄 arXiv:2512.16301 · 📥 PDF · 💻 GitHub
AI assistants like ChatGPT or Claude can chat and answer questions. Now imagine giving them the ability to do things — search the web, run code, look up a calendar, query a database. That's an "AI agent." It doesn't just talk; it acts.
The problem: even the smartest AI models are clumsy agents out of the box. They call the wrong tool, forget what they were doing, or fall apart on multi-step tasks. This paper asks: how do you make these agents better after they're built?
Think of an AI agent as a smart student sitting an open-book exam. The student (the AI model) has a toolbox on the desk — a calculator, a dictionary, a search engine, a notebook for memory. Two ways to help:
1. Coach the student — teach them to reason better, use tools more skillfully, plan ahead.
2. Upgrade the toolbox — give them a better calculator, a smarter search engine, a better notebook.
The paper's big insight: the entire messy landscape of "making agents better" boils down to a clean 2x2 grid based on two questions: What are you improving? (student vs. toolbox) and What signal tells you it's working? (did the tool work, or did the final answer come out right?)
The student writes code, runs it, sees "all tests pass." That success signal teaches the student to write better code. Like learning from lab experiments — the equipment gives direct feedback.
Did the final answer match the answer key? If yes, reinforce whatever the student did. Like grading a math test — you only care about the bottom line.
Build a better search engine or calculator that any student can use. Trained on their own, without any specific student in mind. Plug-and-play.
The student is fixed (maybe a closed-source API). Tune the search engine to return results this particular student understands best. The student's performance tells you whether the upgrade worked.
The paper argues that the best strategy is often not to retrain the whole AI model (expensive, risky). Instead, upgrade the tools around it. A small 7B-parameter search tool trained with just 2,400 examples using T2 matched or beat a monolithic agent trained with 70x more data.
Real-world systems will combine strategies: rare, expensive retraining of the core model (A1/A2) with frequent, cheap upgrades to tools, memory, and helper modules (T1/T2). You don't retrain your CEO every week — you upgrade the tools and processes around them.
Every adaptation method optimizes some objective function over: Agent A(theta) — the foundation model; Tool T — everything external; Objective O — a measurable signal. A1/A2 update theta (agent weights); T1/T2 update phi (tool parameters). The difference is where the training signal comes from.
Toolformer (2023) — Self-supervised SFT. Inserts API calls, keeps calls that reduce perplexity.
Gorilla (2024) — AST-based correctness checking of generated API calls against schemas.
DeepRetrieval (2025) — Full RL. Query reformulation as MDP with reward = alpha*Recall@k + beta*nDCG@k + gamma*format_score. KL-regularized PPO. 3x recall improvement (65.1% vs 24.7%).
A2 doesn't care how tools were used — only whether the final answer was right. Subtle trap: the agent can learn to ignore tools entirely and still improve if the answer happens to be correct.
Reasoning-centric RL (DeepSeek-R1) — Pure RL with verifiable rewards. Emergent chain-of-thought, self-correction.
Self-refinement (TextGrad) — LLM critiques own output, treats critique as a "textual gradient." Parameter-free, works on black-box APIs.
Multi-tool orchestration (ReTool) — Trains agents to decide when/how to use tools via outcome-based reward.
Trained independently, serves any agent: CLIP/SAM for vision, dense retrievers for search. An A1-trained agent can be frozen and reused as T1 — the paper calls this "graduation."
REPLUG — Trains retriever using frozen LLM's perplexity as reward.
s3 — 7B search subagent, "Gain Beyond RAG" reward: G = Quality(with_search) - Quality(RAG_only). 2,400 examples matched 170K+ example baselines.
Memory as T2 — Long-term memory is an external store with learned read/write, updated using frozen agent signals. Memento trains memory policies from binary outcome rewards.
| Dimension | A1/A2 (agent) | T1/T2 (tools) |
|---|---|---|
| Data efficiency | High data cost | T2 needs 70x less data |
| Forgetting risk | High — can destroy old skills | Low — changes localized |
| Modularity | Monolithic retraining | Swap/upgrade independently |
| Capability ceiling | Can improve fundamental reasoning | Limited by frozen agent |
Rare, expensive A1/A2 updates for the base model; frequent, cheap T1/T2 adaptation of retrievers, search policies, memory, and planners for robustness and scalability.
Adaptation over tuple (A, T, D, E, O): Agent A(theta), Tool set T, offline Data D, Environment E, Objective O.
A1: min_theta O_tool(T(A_theta(x))) -- agent weights, tool-execution signal A2: min_theta O_agent(A_theta(x, T(a))) -- agent weights, output signal T1: min_phi O_tool(T_phi(a)) -- tool weights, independent signal T2: min_phi O_agent(A(x, T_phi(a))) -- tool weights, frozen-agent signal
Boundary cases: when both agent and tool are modified, assigned by dominant locus of optimization with secondary component flagged.
Phase 1 — Toolformer: Keep tool call c if Loss(text_with_c) < Loss(text_without_c).
Phase 2 — Gorilla: AST-based correctness. Parse API call, compare against schema.
Phase 3 — DeepRetrieval: MDP. R = alpha*Recall@k + beta*nDCG@k + gamma*format_score. max E[R(s,a)] - beta*KL(pi_theta || pi_ref). 65.1% vs 24.7% recall.
DeepSeek-R1: Pure RL, no SFT warm-start. Emergent reasoning at scale.
TextGrad: o_(t+1) = LLM(x, o_t, critique(o_t)). Gradient descent in natural language. Works on black-box APIs.
ReTool: Bridges A1/A2. Outcome reward (A2), tool execution in reasoning trace (A1-like). Classified A2 by dominant locus.
REPLUG: min_phi E_x[-log P_theta(o | x, R_phi(x))], theta frozen.
s3: G = Quality(answer_with_search) - Quality(answer_with_RAG_only). Marginal value of search. 2,400 examples = 170K+ baselines.
Memento: Memory read/write policies from binary outcome rewards. Pure T2, no annotation.
| Dimension | A1/T1 | A2/T2 |
|---|---|---|
| Signal metrics | Verifiable: pass rate, Recall, nDCG | Holistic: exact match, F1, LLM-as-judge |
| Objectivity | High — deterministic | Lower — subjective judges |
| Benchmarks | SWE-bench, HumanEval, BEIR, MTEB | MMLU, MATH, task-specific |
Critical gap: almost no benchmarks measure T2 specifically.
Vs. agent surveys: Those catalog architectures; this focuses on adaptation. 2x2 taxonomy is novel.
Vs. RLHF: A2 is more general — includes non-parametric methods and tool-augmented settings.
Vs. RAG surveys: Decomposes RAG into T1/T2 retriever + A1/A2 generator.
Strengths: Clarifying taxonomy; concrete T2 efficiency case study; honest boundary-case treatment; v3 adds memory, skills, evaluation framework.
Weaknesses: 70x claim from single comparison; no controlled experiments; soft paradigm boundaries; single-agent scope only; evaluation framework not empirically validated.
The 2x2 framework provides a genuine organizing principle. The strongest practical insight: T2 — training tools under frozen agent supervision — achieves remarkable data efficiency and modularity, making it the recommended default for production systems on closed-source models.
Six improvement vectors for this paper, mapped against recent work (as of April 2026) that addresses — or doesn't address — each one.
The 70x T2 efficiency claim comes from comparing s3 against specific A2 baselines across different papers, models, and datasets. A rigorous version would hold everything constant — same base model, same task, same compute budget — and sweep A1/A2/T1/T2 head-to-head. No one has done this yet. This would be the single most valuable empirical contribution to validate the survey's claims.
The survey flags co-adaptation as a key open problem but offers no algorithms. Recent infrastructure work is laying the groundwork:
Agent Lightning (Microsoft, late 2025) — Separates task execution from model training. Universal (input, output, reward) trajectory format. RL with virtually no code modification. Works for multi-tool, multi-agent workflows.
NVIDIA NeMo Gym + NeMo RL (Jan 2026) — Modular RL infrastructure for scientific agents. GRPO support, end-to-end FP8 training.
The plumbing exists, but no one has published a formal co-adaptation algorithm with convergence guarantees or stability analysis. Game-theoretic formulations and credit assignment across the agent-tool boundary remain open.
The survey scopes to single-agent systems, but production systems increasingly involve multiple agents sharing tools.
Google Paradigms of Intelligence (Mar 2026) — Decentralized RL against mixed opponent pools produces cooperative multi-agent behavior without hardcoded coordination. Agents performed better with no prior information about adversaries, adapting through trial and error alone.
This shows multi-agent adaptation can emerge from standard training, but doesn't address the survey's specific gap: how to adapt shared tools when multiple agents depend on them without degrading any individual agent's performance.
AgeMem (Jan 2026) — Unifies long-term and short-term memory management into the agent's policy as learnable tool-based actions (add, update, delete, retrieve, summarize, filter). Three-stage progressive RL with step-wise GRPO. 4.8–8.6 percentage point improvements over baselines.
ALMA (Feb 2026) — Meta-learns the memory architecture itself. Discovers domain-specific designs that surpass human-designed baselines while being more cost-efficient. Challenges the assumption that memory architecture is fixed.
Strong advances, but formal guarantees remain missing — no theory of optimal forgetting, no bounds on memory staleness, no analysis of when memory-based T2 outperforms parametric A2.
The survey flags reward hacking and parasitic adaptation but proposes no concrete mitigations.
ICLR 2026 "Lifelong Agents" workshop — First unified forum bridging continual learning, RL, memory, and safety for long-lived agents. Covers continual fine-tuning, domain shift adaptation, and tool-use strategies.
Workshop-level discussion is happening, but concrete primitives are missing: no constrained RL formulations guaranteeing tool-use safety during adaptation, no automated detection of parasitic adaptation, no formal verification for adapted tools. This is the widest gap between identified risks and actual solutions.
The survey recommends combining paradigms but gives no decision procedure. ReTool provides indirect evidence (67% on AIME in 400 steps vs text-based RL at 40%/1080 steps), but this is one data point. No one has built a meta-benchmark or selection controller that recommends A1/A2/T1/T2 given task, budget, and model access constraints. This would transform the survey from a conceptual framework into an actionable design tool.
| Vector | Status | Key work |
|---|---|---|
| Controlled cross-paradigm experiments | Area to explore | No one has done this |
| Co-adaptation algorithms | Partially addressed | Agent Lightning, NeMo Gym |
| Multi-agent adaptation | Partially addressed | Google decentralized RL |
| Memory formalization | Substantially advanced | AgeMem, ALMA |
| Safety primitives | Area to explore | ICLR workshop only |
| Paradigm selection benchmark | Area to explore | No one has done this |
The survey's 2x2 framework is holding up — nothing since has broken the taxonomy. The frontier is moving fastest on memory (vector 4) and infrastructure (vector 2), while safety (vector 5), controlled experiments (vector 1), and paradigm selection (vector 6) remain wide open.