← Back to all papers

Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills

Jiang, Lin, Shi, Wang, He, Wu, Zhong, Song, Zhang, Wang et al. (34 authors) — December 2025 (v3: March 2026)

📄 arXiv:2512.16301  ·  📥 PDF  ·  💻 GitHub

TL;DR: A unified 2x2 framework for making AI agents better after pre-training. Four paradigms — A1, A2 (adapt the agent), T1, T2 (adapt the tools) — organize 100+ methods across post-training, memory, and skills. Key finding: training smarter tools (T2) can match full agent retraining with 70x less data.

Level 1 — Beginner

What is this paper about?

AI assistants like ChatGPT or Claude can chat and answer questions. Now imagine giving them the ability to do things — search the web, run code, look up a calendar, query a database. That's an "AI agent." It doesn't just talk; it acts.

The problem: even the smartest AI models are clumsy agents out of the box. They call the wrong tool, forget what they were doing, or fall apart on multi-step tasks. This paper asks: how do you make these agents better after they're built?

The student-with-a-toolbox analogy

Core idea

Think of an AI agent as a smart student sitting an open-book exam. The student (the AI model) has a toolbox on the desk — a calculator, a dictionary, a search engine, a notebook for memory. Two ways to help:

1. Coach the student — teach them to reason better, use tools more skillfully, plan ahead.

2. Upgrade the toolbox — give them a better calculator, a smarter search engine, a better notebook.

The four strategies (A1, A2, T1, T2)

The paper's big insight: the entire messy landscape of "making agents better" boils down to a clean 2x2 grid based on two questions: What are you improving? (student vs. toolbox) and What signal tells you it's working? (did the tool work, or did the final answer come out right?)

A1 — Coach the student using tool feedback

The student writes code, runs it, sees "all tests pass." That success signal teaches the student to write better code. Like learning from lab experiments — the equipment gives direct feedback.

A2 — Coach the student using final-answer feedback

Did the final answer match the answer key? If yes, reinforce whatever the student did. Like grading a math test — you only care about the bottom line.

T1 — Upgrade the tools independently

Build a better search engine or calculator that any student can use. Trained on their own, without any specific student in mind. Plug-and-play.

T2 — Upgrade tools for this specific student

The student is fixed (maybe a closed-source API). Tune the search engine to return results this particular student understands best. The student's performance tells you whether the upgrade worked.

Why does this matter?

The paper argues that the best strategy is often not to retrain the whole AI model (expensive, risky). Instead, upgrade the tools around it. A small 7B-parameter search tool trained with just 2,400 examples using T2 matched or beat a monolithic agent trained with 70x more data.

70x
Less data needed
(T2 vs A2)
100+
Methods organized
into 4 paradigms
Key takeaway

Real-world systems will combine strategies: rare, expensive retraining of the core model (A1/A2) with frequent, cheap upgrades to tools, memory, and helper modules (T1/T2). You don't retrain your CEO every week — you upgrade the tools and processes around them.

Quiz — Level 1
1. What is the core problem this paper is trying to solve?
The paper focuses on adaptation — improving agents, tools, and their interaction after pre-training, not building models from scratch.
2. In the student-with-a-toolbox analogy, what does the "toolbox" represent?
Tools are external callable components — retrievers, APIs, code executors, memory modules — that the agent uses to extend its capabilities beyond its internal knowledge.
3. What makes A1 different from A2?
Both A1 and A2 adapt the agent. The difference is the signal: A1 learns from tool execution outcomes (e.g., code tests passed), while A2 learns from whether the final answer was correct.
4. Why is T2 particularly important for closed-source AI models?
T2 keeps the agent frozen and optimizes the tools around it using the agent's outputs as a signal — perfect when you can't touch the model's parameters.
5. What was the striking data efficiency finding about T2 vs. A2?
The s3 system (T2) trained a 7B search subagent with just 2,400 examples and matched agents trained with 70x more data.

Level 2 — Intermediate

The mathematical setup

Every adaptation method optimizes some objective function over: Agent A(theta) — the foundation model; Tool T — everything external; Objective O — a measurable signal. A1/A2 update theta (agent weights); T1/T2 update phi (tool parameters). The difference is where the training signal comes from.

A1 in practice: learning from tool execution

The RLVR progression

Toolformer (2023) — Self-supervised SFT. Inserts API calls, keeps calls that reduce perplexity.

Gorilla (2024) — AST-based correctness checking of generated API calls against schemas.

DeepRetrieval (2025) — Full RL. Query reformulation as MDP with reward = alpha*Recall@k + beta*nDCG@k + gamma*format_score. KL-regularized PPO. 3x recall improvement (65.1% vs 24.7%).

A2 in practice: learning from final outputs

A2 doesn't care how tools were used — only whether the final answer was right. Subtle trap: the agent can learn to ignore tools entirely and still improve if the answer happens to be correct.

Three branches

Reasoning-centric RL (DeepSeek-R1) — Pure RL with verifiable rewards. Emergent chain-of-thought, self-correction.

Self-refinement (TextGrad) — LLM critiques own output, treats critique as a "textual gradient." Parameter-free, works on black-box APIs.

Multi-tool orchestration (ReTool) — Trains agents to decide when/how to use tools via outcome-based reward.

T1: plug-and-play tools

Trained independently, serves any agent: CLIP/SAM for vision, dense retrievers for search. An A1-trained agent can be frozen and reused as T1 — the paper calls this "graduation."

T2: the "symbiotic inversion"

Key T2 methods

REPLUG — Trains retriever using frozen LLM's perplexity as reward.

s3 — 7B search subagent, "Gain Beyond RAG" reward: G = Quality(with_search) - Quality(RAG_only). 2,400 examples matched 170K+ example baselines.

Memory as T2 — Long-term memory is an external store with learned read/write, updated using frozen agent signals. Memento trains memory policies from binary outcome rewards.

When to use what

DimensionA1/A2 (agent)T1/T2 (tools)
Data efficiencyHigh data costT2 needs 70x less data
Forgetting riskHigh — can destroy old skillsLow — changes localized
ModularityMonolithic retrainingSwap/upgrade independently
Capability ceilingCan improve fundamental reasoningLimited by frozen agent
Key takeaway

Rare, expensive A1/A2 updates for the base model; frequent, cheap T1/T2 adaptation of retrievers, search policies, memory, and planners for robustness and scalability.

Quiz — Level 2
1. Why can A2 methods accidentally train the agent to ignore its tools?
Since A2 supervises only the final output, the agent can learn strategies that improve answer quality without invoking tools — the signal doesn't explicitly reward tool use.
2. How does DeepRetrieval frame query reformulation, and what makes it A1?
DeepRetrieval treats query rewriting as an MDP where the reward combines Recall, nDCG, and format correctness — all from tool execution, the defining feature of A1.
3. What makes s3 (T2) so data-efficient compared to monolithic A2?
T2 localizes adaptation to a small module learning one specific skill. A2 retrains billions of parameters across entangled capabilities, requiring far more data.
4. How can an A1-trained agent become a T1 tool?
The paper calls this "graduation" — an A1-trained agent is frozen and reused as a plug-and-play T1 tool in new systems.
5. What is "parasitic adaptation"?
A security risk where a compromised tool manipulates the RL reward signal to hijack the agent's learning — controlling what it learns.

Level 3 — Expert

Formal framework

Adaptation over tuple (A, T, D, E, O): Agent A(theta), Tool set T, offline Data D, Environment E, Objective O.

A1: min_theta O_tool(T(A_theta(x)))      -- agent weights, tool-execution signal
A2: min_theta O_agent(A_theta(x, T(a)))   -- agent weights, output signal
T1: min_phi   O_tool(T_phi(a))            -- tool weights, independent signal
T2: min_phi   O_agent(A(x, T_phi(a)))     -- tool weights, frozen-agent signal

Boundary cases: when both agent and tool are modified, assigned by dominant locus of optimization with secondary component flagged.

A1: the RLVR evolution

Three phases

Phase 1 — Toolformer: Keep tool call c if Loss(text_with_c) < Loss(text_without_c).

Phase 2 — Gorilla: AST-based correctness. Parse API call, compare against schema.

Phase 3 — DeepRetrieval: MDP. R = alpha*Recall@k + beta*nDCG@k + gamma*format_score. max E[R(s,a)] - beta*KL(pi_theta || pi_ref). 65.1% vs 24.7% recall.

A2: DeepSeek-R1 and TextGrad

DeepSeek-R1: Pure RL, no SFT warm-start. Emergent reasoning at scale.

TextGrad: o_(t+1) = LLM(x, o_t, critique(o_t)). Gradient descent in natural language. Works on black-box APIs.

ReTool: Bridges A1/A2. Outcome reward (A2), tool execution in reasoning trace (A1-like). Classified A2 by dominant locus.

T2: REPLUG, s3, and memory

Formal T2 objectives

REPLUG: min_phi E_x[-log P_theta(o | x, R_phi(x))], theta frozen.

s3: G = Quality(answer_with_search) - Quality(answer_with_RAG_only). Marginal value of search. 2,400 examples = 170K+ baselines.

Memento: Memory read/write policies from binary outcome rewards. Pure T2, no annotation.

Memory taxonomy (v3)

  • Dynamic stores — episodic buffers, key-value caches
  • Experiential/reflective — Reflexion-style failed-attempt storage
  • Structured — knowledge graphs, trees, learned read/write
  • Parametric — LoRA adapters / soft prompts (blurs agent/tool boundary)
  • Skill libraries — reusable procedural knowledge; skills from one agent become T1 for others

Evaluation framework (Section 7)

DimensionA1/T1A2/T2
Signal metricsVerifiable: pass rate, Recall, nDCGHolistic: exact match, F1, LLM-as-judge
ObjectivityHigh — deterministicLower — subjective judges
BenchmarksSWE-bench, HumanEval, BEIR, MTEBMMLU, MATH, task-specific

Critical gap: almost no benchmarks measure T2 specifically.

Connections to related work

Positioning

Vs. agent surveys: Those catalog architectures; this focuses on adaptation. 2x2 taxonomy is novel.

Vs. RLHF: A2 is more general — includes non-parametric methods and tool-augmented settings.

Vs. RAG surveys: Decomposes RAG into T1/T2 retriever + A1/A2 generator.

Critical evaluation

Strengths and weaknesses

Strengths: Clarifying taxonomy; concrete T2 efficiency case study; honest boundary-case treatment; v3 adds memory, skills, evaluation framework.

Weaknesses: 70x claim from single comparison; no controlled experiments; soft paradigm boundaries; single-agent scope only; evaluation framework not empirically validated.

Key takeaway

The 2x2 framework provides a genuine organizing principle. The strongest practical insight: T2 — training tools under frozen agent supervision — achieves remarkable data efficiency and modularity, making it the recommended default for production systems on closed-source models.

Quiz — Level 3
1. How does the paper formally distinguish A1 from A2?
Both optimize theta, but A1's objective is over tool execution quality (O_tool) while A2's is over final output quality (O_agent). This formal axis separates them regardless of SFT vs RL.
2. What is "Gain Beyond RAG" in s3, and why is it effective?
By subtracting RAG-only quality from search-augmented quality, the reward isolates exactly how much the search subagent contributed beyond basic retrieval. This prevents the searcher from getting credit for what the retriever already handles.
3. Why does the paper identify a "T2 evaluation gap"?
BEIR measures retrieval (T1), MMLU measures agent answers (A2), but no benchmark asks: "did this tool get better at serving this specific agent?" T2's defining property lacks evaluation infrastructure.
4. How does TextGrad achieve A2 adaptation without model weights?
TextGrad frames self-critique as gradient descent in natural language space. Since it only needs text outputs, it works on any API — test-time compute scaling rather than training.
5. What criterion handles methods that modify both agent and tool?
Whichever component receives primary parameter updates determines the paradigm. The secondary is noted. Pragmatic but acknowledged as ambiguous in edge cases like ReTool.

Phase 4 — Frontier

Six improvement vectors for this paper, mapped against recent work (as of April 2026) that addresses — or doesn't address — each one.

1. Controlled cross-paradigm experiments

Area to explore

The 70x T2 efficiency claim comes from comparing s3 against specific A2 baselines across different papers, models, and datasets. A rigorous version would hold everything constant — same base model, same task, same compute budget — and sweep A1/A2/T1/T2 head-to-head. No one has done this yet. This would be the single most valuable empirical contribution to validate the survey's claims.

2. Co-adaptation: jointly training agent and tools

Partially addressed

The survey flags co-adaptation as a key open problem but offers no algorithms. Recent infrastructure work is laying the groundwork:

Recent work

Agent Lightning (Microsoft, late 2025) — Separates task execution from model training. Universal (input, output, reward) trajectory format. RL with virtually no code modification. Works for multi-tool, multi-agent workflows.

NVIDIA NeMo Gym + NeMo RL (Jan 2026) — Modular RL infrastructure for scientific agents. GRPO support, end-to-end FP8 training.

The plumbing exists, but no one has published a formal co-adaptation algorithm with convergence guarantees or stability analysis. Game-theoretic formulations and credit assignment across the agent-tool boundary remain open.

3. Multi-agent adaptation

Partially addressed

The survey scopes to single-agent systems, but production systems increasingly involve multiple agents sharing tools.

Recent work

Google Paradigms of Intelligence (Mar 2026) — Decentralized RL against mixed opponent pools produces cooperative multi-agent behavior without hardcoded coordination. Agents performed better with no prior information about adversaries, adapting through trial and error alone.

This shows multi-agent adaptation can emerge from standard training, but doesn't address the survey's specific gap: how to adapt shared tools when multiple agents depend on them without degrading any individual agent's performance.

4. Memory adaptation formalization

Substantially advanced
Recent work

AgeMem (Jan 2026) — Unifies long-term and short-term memory management into the agent's policy as learnable tool-based actions (add, update, delete, retrieve, summarize, filter). Three-stage progressive RL with step-wise GRPO. 4.8–8.6 percentage point improvements over baselines.

ALMA (Feb 2026) — Meta-learns the memory architecture itself. Discovers domain-specific designs that surpass human-designed baselines while being more cost-efficient. Challenges the assumption that memory architecture is fixed.

Strong advances, but formal guarantees remain missing — no theory of optimal forgetting, no bounds on memory staleness, no analysis of when memory-based T2 outperforms parametric A2.

5. Safety under adaptation

Area to explore

The survey flags reward hacking and parasitic adaptation but proposes no concrete mitigations.

Community response

ICLR 2026 "Lifelong Agents" workshop — First unified forum bridging continual learning, RL, memory, and safety for long-lived agents. Covers continual fine-tuning, domain shift adaptation, and tool-use strategies.

Workshop-level discussion is happening, but concrete primitives are missing: no constrained RL formulations guaranteeing tool-use safety during adaptation, no automated detection of parasitic adaptation, no formal verification for adapted tools. This is the widest gap between identified risks and actual solutions.

6. Paradigm selection benchmark

Area to explore

The survey recommends combining paradigms but gives no decision procedure. ReTool provides indirect evidence (67% on AIME in 400 steps vs text-based RL at 40%/1080 steps), but this is one data point. No one has built a meta-benchmark or selection controller that recommends A1/A2/T1/T2 given task, budget, and model access constraints. This would transform the survey from a conceptual framework into an actionable design tool.

Scorecard

VectorStatusKey work
Controlled cross-paradigm experimentsArea to exploreNo one has done this
Co-adaptation algorithmsPartially addressedAgent Lightning, NeMo Gym
Multi-agent adaptationPartially addressedGoogle decentralized RL
Memory formalizationSubstantially advancedAgeMem, ALMA
Safety primitivesArea to exploreICLR workshop only
Paradigm selection benchmarkArea to exploreNo one has done this
Bottom line

The survey's 2x2 framework is holding up — nothing since has broken the taxonomy. The frontier is moving fastest on memory (vector 4) and infrastructure (vector 2), while safety (vector 5), controlled experiments (vector 1), and paradigm selection (vector 6) remain wide open.

← Back to all papers