Toolformer: Language Models Can Teach Themselves to Use Tools¶
ArXiv: 2302.04761
đŻ Pitch¶
Toolformer finetunes a pretrained LM to autonomously decide when, which, and how to call simple text-based APIs (search, QA, calculator, translation, calendar) by self-generating candidate calls and keeping only those that measurably reduce next-token loss. This lets a single model combine fluent language generation with reliable external computation and retrievalâimproving zero-shot factual, mathematical, multilingual, and time-sensitive performance without large human-labeled tool-call datasets.
1. Executive Summary (2-3 sentences)¶
Toolformer trains a pretrained language model to autonomously decide when to call external tools (via simple text APIs), which tool to call, what arguments to pass, and how to use the returned result during generation, using a self-supervised filtering signal based on improved next-token prediction (Section 2; Figure 2). Built on GPT-J (6.7B), this approach substantially improves zero-shot performance on factual recall, math word problems, multilingual QA, and time-sensitive questions while preserving standard language modeling perplexity when tools are disabled (Tables 3â8; Section 4.3). The significance is that tool use is learned without large human-labeled tool tracesâonly âa handful of demonstrationsâ per API are needed to bootstrap a large training set (Abstract; Figure 3; Appendix A.2).
2. Context and Motivation¶
- Problem / gap addressed.
- Large LMs show strong zero-/few-shot task performance but have persistent weaknesses in:
- precise arithmetic,
- factual lookup / up-to-date knowledge,
- low-resource language understanding,
- temporal awareness (Introduction).
-
Existing ways to add tools often either:
- require substantial human supervision for tool calls (e.g., tool-annotated data), or
- restrict tool use to task-specific prompting/training setups, limiting generality (Introduction; Section 6 âTool Useâ).
-
Why it matters.
- Tools like search, calculators, translation systems, and calendars can provide ground truth or computed information that a parametric LM may not reliably store or infer (Introduction; Figure 1 examples).
-
The goal is to get âbest of both worldsâ: fluent language modeling plus reliable external computation/lookup (Abstract).
-
Prior approaches and shortcomings (as positioned here).
- Tool-use systems with heavy human annotations are costly and may encode what humans think is helpful rather than what reduces model uncertainty (Introduction).
-
Few-shot tool prompting tailored per task requires users to specify tool usage patterns and does not generalize to arbitrary text contexts (Section 4.2 discussion; Section 6).
-
How this paper positions itself.
- Toolformer is designed to satisfy two desiderata (Introduction):
- Self-supervised tool learning: no large-scale human tool traces; only a few demonstrations per tool.
- General tool-use policy: the model decides when/how to use tools without being restricted to a specific downstream task format.
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a pretrained language model that learns to insert and use textual âAPI callâ snippets inside ordinary text generation.
- It solves tool use by generating candidate tool calls in-context, executing them, keeping only those that measurably help predict future tokens, and then fine-tuning the LM on the resulting augmented corpus (Section 2; Figure 2).
3.2 Big-picture architecture (diagram in words)¶
- Inputs: a large plain-text corpus
C(here, a subset ofCCNet) and a pretrained LMM(here,GPT-J) (Section 4.1). - Component A â Call sampler: prompts
Mto propose tool calls at positions in each text (Section 2 âSampling API Callsâ; Figure 3). - Component B â Tool executor: runs each proposed call against the corresponding tool (calculator, QA model, BM25 Wikipedia search, MT model, calendar) to obtain a text result (Section 2 âExecuting API Callsâ; Section 3; Table 1).
- Component C â Self-supervised filter: computes whether including the tool result reduces a weighted next-token loss compared to no call / no result, and keeps only helpful calls (Section 2 âFiltering API Callsâ; Figure 2).
- Component D â Fine-tuner: interleaves surviving calls into the original text to form
C*and fine-tunesMwith standard LM training (Section 2 âModel Finetuningâ; Section 4.1; Appendix B). - Inference loop: during decoding, when the model emits the tool-response separator token, the system pauses, executes the tool, inserts the result, and continues (Section 2 âInferenceâ).
3.3 Roadmap for the deep dive¶
- Explain the API-call representation (how calls and results become tokens in the LM).
- Walk through candidate generation (how positions and calls are sampled).
- Detail the filtering objective (the loss comparison that decides which calls âhelpâ).
- Describe dataset construction and fine-tuning (how
C*is built and how the LM is trained). - Describe inference-time tool execution and decoding controls (how the model is induced to actually call tools).
- Summarize tools and implementation specifics (what each tool is and how data is selected for it).
3.4 Detailed, sentence-based technical breakdown¶
This is an algorithmic + empirical systems paper whose core idea is to bootstrap tool-use supervision from a pretrained LM itself, then keep only tool calls that reduce next-token prediction loss, and fine-tune on the resulting data (Section 2; Figure 2).
3.4.1 Tool-call representation (what the LM sees)¶
- Each tool call is represented as a tuple
c = (a_c, i_c)wherea_cis the API name andi_cis its textual input (Section 2). - The paper defines two linearizations (Section 2):
e(c) = <API> a_c(i_c) </API>(call without result)e(c, r) = <API> a_c(i_c) â r </API>(call with resultr)- Special tokens mark structure:
<API>,</API>, andâ. In practice, they use token sequences" [","]", and"->"so the method works without changing the LM vocabulary (footnote in Section 2).
3.4.2 System/data pipeline diagram in words (explicit sequence)¶
What happens first, second, third (Figure 2; Section 2):
- Start with a plain-text example
x = x1, âŠ, xnfrom corpusC. - Sample candidate positions
iwhere the model might want to begin a tool call by computing
p_i = p_M(<API> | P(x), x1:i-1)and keeping positions whose probability exceeds a thresholdÏ_s, capped to the topkpositions (Section 2 âSampling API Callsâ). - Generate candidate tool calls at each selected position by sampling from
Mconditioned on[P(x), x1, âŠ, x_{i-1}, <API>]until the end token</API>(Section 2). The paper samples up tomcalls per position and discards examples where</API>is not produced. - Execute each candidate call against the tool implementation to obtain a textual response
r_i(Section 2 âExecuting API Callsâ). - Filter calls using a self-supervised loss test: keep a call only if providing both call and response lowers a weighted loss on future tokens by at least
Ï_fcompared to baselines (Section 2 âFiltering API Callsâ; Figure 2). - Construct an augmented training example
x*by inserting the retainede(c_i, r_i)snippets into the text (Section 2 âModel Finetuningâ). - The paper notes a key implementation choice: before fine-tuning, it provides
e(c_i, r_i)as a prefix rather than inserting it mid-sequence to avoid disrupting the pretrained LMâs expected token patterns (Section 2, footnote 3). - Repeat across corpus to create
C*, then fine-tune the LM onC*with a standard LM objective (Section 2; Section 4.1; Appendix B). - At inference, decode normally until the model outputs
â, then pause decoding, run the tool, insert the response +</API>, and continue decoding (Section 2 âInferenceâ).
3.4.3 The filtering objective (how âusefulâ tool calls are detected)¶
Toolformerâs key technical mechanism is a perplexity/loss-based filter that converts tool use into a self-supervised decision problem (Section 2; Figure 2).
- Define a weighted cross-entropy loss over future tokens starting at position
i(Section 2): L_i(z) = - Σ_{j=i..n} w_{j-i} · log p_M(x_j | z, x1:j-1)- Here
zis an optional prefix (e.g., the tool call + result), and weightsw_temphasize tokens close to the call site. - Two variants are compared (Section 2):
L_i^+ = L_i(e(c_i, r_i))(tool call with result provided)L_i^- = min( L_i(Δ), L_i(e(c_i, Δ)) )
whereΔis empty prefix, ande(c_i, Δ)is the call without result.- A call is kept if it satisfies (Section 2):
L_i^- - L_i^+ â„ Ï_f- Intuition (as implemented by the inequality): the response must provide predictive information beyond (a) no call, and (b) âdoing a callâ without actually learning anything (call input alone).
Weighting near-term tokens.
- In the experiments, weights are chosen to emphasize locality (Section 4.1):
- unnormalized wÌ_t = max(0, 1 - 0.2·t), then normalized so weights sum to 1.
- This biases the filter toward tool calls whose results help predict tokens soon after the call (Section 4.1).
Worked micro-example (based on Figure 2).
- Text: âPittsburgh is also known as the Steel City.â
- Candidate calls at position i might include:
- c_i^1 = QA("What other name is Pittsburgh known by?") returning Steel City
- c_i^2 = QA("Which country is Pittsburgh in?") returning United States
- The filter keeps the first call because providing Steel City makes predicting the next tokens (âthe Steel Cityâ) easier, i.e., yields L_i^- - L_i^+ above threshold, while the second is discarded because âUnited Statesâ does not help predict that continuation (Figure 2 illustration).
3.4.4 Sampling candidate calls (how tool supervision is bootstrapped)¶
- For each API, the authors write a prompt
P(x)that instructs the LM to insert tool calls into a text; Figure 3 shows the QA prompt pattern, and Appendix A.2 lists prompts for all tools. - Position sampling uses the probability the LM assigns to starting a tool call (
<API>) at each position (Section 2). This is an important design choice: the modelâs own uncertainty/affordance for tool insertion guides where tool use is even considered. - For each selected position, multiple call candidates are sampled, then tool execution + filtering selects those that empirically help.
3.4.5 Building C* and fine-tuning M (training details and what is / isnât specified)¶
Data source and selection.
- Base corpus C is a subset of CCNet (Section 4.1).
- To reduce annotation cost, heuristics select subsets likely to benefit specific tools (Section 4.1; Appendix A.1):
- Calculator: documents with numeric patterns (e.g., at least three numbers; presence of â=â; etc.), with an additional 1% subsampling for a broad criterion (Appendix A.1).
- Calendar: only texts where a date can be extracted from the URL; this leaves ~18% of documents (Appendix A.1).
- MT: identify non-English chunks bracketed by English text using fastText language ID with confidence > 0.8 and chunk size 10 tokens; filter out numeric/symbol-only chunks; also remove cases where the MT input appears only after the call (look-ahead artifact) (Appendix A.1).
- After filtering tool calls, they also drop examples where all calls were eliminated (Section 4.1, footnote 4).
Sampling/filtering hyperparameters.
- Default values for data generation are (Appendix A):
- Ï_s = 0.05, Ï_f = 1.0, keep top k = 5 positions, sample up to m = 5 calls per position.
- For Calculator and MT, to offset small yields, they use (Appendix A):
- Ï_s = 0.0, k = 20, m = 10, and a lower filter threshold Ï_f = 0.5.
- Table 2 quantifies how many examples with retained calls are produced under different Ï_f settings (e.g., Wikipedia Search yields 60,974 examples at Ï_f = 1.0, while Calculator yields 994 at Ï_f = 1.0) (Table 2).
Fine-tuning configuration (what is explicitly given).
- The model is GPT-J (6.7B) (Introduction; Section 4.1).
- Fine-tuning uses (Section 4.1; Appendix B):
- batch size: 128 (effective batch size 128 in Appendix B),
- learning rate: 1e-5,
- linear warmup over first 10% of training,
- max sequence length: 1024,
- training steps: up to 2000, selecting the checkpoint with best dev perplexity (Appendix B),
- hardware: 8 Ă NVIDIA A100 40GB, BF16, using DeepSpeed ZeRO-3 (Appendix B).
- Not specified in the provided excerpt: optimizer type (e.g., AdamW), weight decay, gradient clipping, tokenizer details, and architectural hyperparameters (layers, hidden size, attention heads, context window beyond max sequence length). I do not infer these because they are not stated in the provided content.
Why fine-tuning should preserve general LM ability (claimed mechanism).
- C* contains the same underlying texts as C, only with inserted tool call snippets at positions that reduce predictive loss, so training should not âspecialize awayâ from language modeling (Section 2 âModel Finetuningâ). Section 4.3 empirically checks this using perplexity.
3.4.6 Inference-time tool use and decoding controls¶
- The inference protocol triggers tool execution when the model outputs the
âtoken, meaning it expects a tool response next (Section 2 âInferenceâ). - In experiments, they adjust decoding to increase tool usage (Section 4.2; Section 5):
- Use greedy decoding, except they allow
<API>to be generated whenever it is among thekmost likely tokens, not only when it is top-1. - They set
k = 10in main downstream experiments (Section 4.2). - They restrict to at most one API call per input to avoid loops (Section 4.2).
- Section 5 (âDecoding Strategyâ) shows tool-call rate and performance vary strongly with
k(Table 9), and calibration (calling tools mostly when needed) is better atk=1but degrades at higherk.
4. Key Insights and Innovations¶
- (1) Self-supervised filtering criterion for tool usefulness.
- Innovation: tool calls are kept only if including the toolâs result reduces weighted next-token loss by at least
Ï_fcompared to no call / no result (Section 2). -
Why itâs significant: this turns tool-use learning into a scalable self-supervised data-generation pipeline without human tool traces, and it directly optimizes for âhelpfulness to the LMâ rather than human intuition (Introduction; Figure 2).
-
(2) Bootstrapping tool-use datasets from a pretrained LM with only a few demonstrations.
- The LM is prompted (Figure 3; Appendix A.2) to propose tool calls across a large corpus, creating
C*fromC(Section 2). -
This differs from task-specific prompting: the tool-use behavior is learned at the pretraining-corpus level and then transferred to downstream zero-shot tasks (Section 4.2).
-
(3) Unified textual interface for multiple heterogeneous tools.
- The approach only requires that tool inputs/outputs are representable as text sequences (Section 2; Section 3).
-
Tools included cover distinct failure modes (Section 3):
QA (Atlas),WikiSearch (BM25 snippets),Calculator (basic arithmetic, rounded to 2 decimals),MT (NLLB 600M + fastText LID), andCalendar (current date). -
(4) Evidence that tool-use capability âemergesâ with scale.
- Scaling experiment across
GPT-2family sizes (124M â 1.6B) plusGPT-Jsuggests tool leveraging only becomes effective around ~775Mparameters (Section 4.4; Figure 4). - This is positioned as an emergent capability: small models do not benefit much from tools, while larger ones learn to use them meaningfully (Figure 4 narrative).
5. Experimental Analysis¶
5.1 Evaluation methodology (datasets, metrics, baselines, setup)¶
Baselines (Section 4.1).
- GPT-J: pretrained, no fine-tuning.
- GPT-J + CC: fine-tuned on C (CCNet subset), no tool calls.
- Toolformer: GPT-J fine-tuned on C* (CCNet subset augmented with tool calls).
- Toolformer (disabled): same weights as Toolformer, but tool calls are disabled at decoding by setting probability of <API> to 0 (Section 4.1, footnote 5).
- Also compared (for many tasks): OPT (66B) and GPT-3 (175B) (davinci, not instruction-tuned) (Section 4.1).
Decoding and tool-call constraints (Section 4.2).
- Prompted zero-shot only: instructions are given, but no in-context examples for task solving.
- Greedy decoding with modified <API> triggering: <API> can be emitted if among top-k tokens; they use k=10.
- At most one API call per input.
Task suites and evaluation details. - LAMA: SQuAD, Google-RE, T-REx subsets; filtered to left-to-right completion; metric: correct answer appears within first 5 generated words; Wikipedia Search disabled to avoid unfair advantage (Section 4.2.1). - Math: ASDiv, SVAMP, MAWPS; metric: first generated number (with a special-case rule for equations like â=â) (Section 4.2.2, footnote 7). - Open-domain QA: WebQS, Natural Questions, TriviaQA; metric: answer appears within first 20 generated words; QA tool disabled to avoid triviality because the QA tool was trained on NQ (Section 4.2.3). - MLQA: multilingual questions with English context; metric: answer in first 10 words; expects model to answer in English; tool usage rates reported (Section 4.2.4). - Temporal: TEMPLAMA and DATESET; LAMA-like metric; DATESET requires knowing current date (Section 4.2.5; Appendix D). - Language modeling preservation: perplexity on WikiText and held-out CCNet subset (Section 4.3; Table 8).
5.2 Main quantitative results (with specific numbers)¶
LAMA factual completion (Table 3; Section 4.2.1)¶
Table 3:
Toolformer= 33.8 / 11.5 / 53.5 (SQuAD / Google-RE / T-REx)
vsGPT-J= 17.8 / 4.9 / 31.9, andGPT-J + CC= 19.2 / 5.6 / 33.2.
Also higher thanOPT (66B)(21.6 / 2.9 / 30.1) andGPT-3 (175B)(26.8 / 7.0 / 39.8).
- Tool use behavior: Toolformer uses the QA tool for 98.1% of examples (Section 4.2.1).
Math word problems (Table 4; Section 4.2.2)¶
Table 4:
Toolformer= 40.4 / 29.4 / 44.0 (ASDiv / SVAMP / MAWPS)
vsGPT-J= 7.5 / 5.2 / 9.9, andGPT-3 (175B)= 14.0 / 10.0 / 19.8.
- Tool use behavior: uses the calculator tool for 97.9% of examples (Section 4.2.2).
- Notable nuance:
Toolformer (disabled)improves overGPT-Jeven without calling tools (e.g., ASDiv 14.8 vs 7.5), and the paper attributes this to being fine-tuned on many calculator call/result examples, which may strengthen internal math ability (Section 4.2.2).
Open-domain QA via Wikipedia search (Table 5; Section 4.2.3)¶
Table 5:
Toolformer= 26.3 / 17.7 / 48.8 (WebQS / NQ / TriviaQA)
vsGPT-J= 18.5 / 12.8 / 43.9, andGPT-3 (175B)= 29.0 / 22.6 / 65.9.
- Tool usage: Wikipedia Search used for 99.3% of examples (Section 4.2.3).
- Toolformer improves substantially over same-size baselines but still trails GPT-3 on these datasets (Table 5).
Multilingual QA (MLQA) with MT tool (Table 6; Section 4.2.4)¶
Table 6 (Toolformer):
Es 20.6, De 13.5, Hi 1.4, Vi 10.6, Zh 16.8, Ar 3.7.
Compared toGPT-J: Es 15.2, De 16.5, Hi 1.3, Vi 8.2, Zh 18.2, Ar 8.2.
- Toolformerâs tool usage: MT used 63.8%â94.9% depending on language, except Hindi where it is used only 7.3% (Section 4.2.4).
- Important mixed outcome: Toolformer does not consistently outperform vanilla GPT-J, and the paper links this to
GPT-J + CCfine-tuning harming some languages due to distribution shift (Section 4.2.4).
Temporal tasks (Table 7; Section 4.2.5)¶
Table 7:
Toolformer= 16.3 on TEMPLAMA and 27.3 on DATESET
vsGPT-J= 13.7 / 3.9, andGPT-3 (175B)= 15.5 / 0.8.
- Tool use detail:
- Calendar tool is used only 0.2% on TEMPLAMA; gains come mostly from Wikipedia search and QA tools (Section 4.2.5).
- On DATESET, calendar tool is used 54.8% and is credited for the large improvement (Section 4.2.5).
Language modeling ability (Table 8; Section 4.3)¶
Table 8 (perplexity):
GPT-J: WikiText 9.9, CCNet 10.6
GPT-J + CC: WikiText 10.3, CCNet 10.5
Toolformer (disabled): WikiText 10.3, CCNet 10.5
- Interpretation grounded in Section 4.3: adding API-call training (evaluated with tools disabled) does not increase perplexity relative to CCNet fine-tuning without calls, suggesting core LM ability is preserved in this evaluation mode.
5.3 Decoding-strategy analysis (Table 9; Section 5)¶
- Table 9 shows that increasing top-
kthreshold for emitting<API>increases tool-call rates and affects performance: - On T-REx: tool usage rises from 40.3% (k=1) to 98.1% (k=10), and overall score increases from 47.8 to 53.5 (Table 9).
- On WebQS: tool usage rises from 8.5% (k=1) to 100% (k=10), and overall score rises from 19.3 to 26.3 (Table 9).
- The paper observes calibration-like behavior at
k=1: the model tends to call tools on cases it would do poorly on otherwise, but this calibration is âlostâ at higherk(Section 5).
5.4 Do the experiments support the claims?¶
- Support for âtools improve zero-shot performanceâ: strong evidence across LAMA (Table 3), math (Table 4), QA (Table 5), and DATESET (Table 7), with explicit tool usage rates reported in several sections.
- Support for âno sacrifice in core LM abilityâ: supported in the limited sense measuredâperplexity with tools disabled is unchanged vs CCNet fine-tuning (Table 8; Section 4.3). The paper explicitly notes that evaluating perplexity with tools enabled is intractable because it would require marginalizing over possible API calls (Section 4.3, footnote 8).
- Mixed evidence / caveats: MLQA results show tool use helps relative to
Toolformer (disabled)but does not consistently beatGPT-Jdue to CCNet fine-tuning degrading multilingual performance (Table 6; Section 4.2.4).
6. Limitations and Trade-offs¶
Grounded in Section 7 (Limitations) plus task-specific analyses:
- No chaining of tools (single-step tool use).
- Calls for each tool are generated independently, so the training data lacks examples where output of one tool becomes input to another (Section 7).
-
This interacts with the âone API call per exampleâ decoding restriction used in experiments (Section 4.2), which prevents multi-step strategies like âcall calendar â incorporate date â ask QA about time-dependent factâ (Section 4.2.5 discussion).
-
Limited interactivity with tools (especially search).
-
Wikipedia search returns snippets and may be a poor match; Toolformer cannot refine queries or browse multiple results in an interactive loop (Section 4.2.3; Section 7).
-
Sensitivity to wording / prompts.
-
Tool invocation is often sensitive to exact input phrasing, consistent with known prompt sensitivity of LMs (Section 7).
-
Sample inefficiency for some tools.
-
The pipeline may process very large corpora but yield few useful calls for certain tools (explicitly mentioned for the calculator: âmore than a million documentsâ leading to âonly a few thousandâ useful examples) (Section 7). Table 2âs calculator counts (e.g., 994 examples at
Ï_f=1.0) are consistent with this sparsity. -
Tool-cost awareness is not modeled.
-
The decision to call a tool ignores tool-dependent computational cost (Section 7). This matters practically because QA models, MT models, or retrieval may be expensive.
-
Evaluation/measurement limitation for LM perplexity with tools enabled.
- The paper does not compute perplexity when API calls are enabled because doing so would require marginalization over potential calls at each position, deemed intractable (Section 4.3, footnote 8).
7. Implications and Future Directions¶
- How this changes the landscape (based on the provided paper).
- Toolformer demonstrates a general recipe for turning a pretrained LM into a tool-using LM without extensive human tool-call annotations, using a loss-based filter that is aligned with the LMâs own predictive needs (Section 2; Figure 2; Abstract).
-
The results suggest tool use can let smaller models (6.7B) outperform much larger ones (66B/175B) on tool-relevant tasks like factual completion and math word problems (Tables 3â4), though not universally (Table 5; Table 6).
-
Follow-up research directions suggested by the paperâs findings.
- Chained tool use: create training data that includes multi-step tool traces (calendar â QA; search â re-search) to overcome the independent sampling limitation (Section 7; Section 4.2.5).
- Interactive search behavior: enable query reformulation, multi-result browsing, and multi-call loops safely (Section 4.2.3; Section 7).
- Iterative bootstrapping to improve sample efficiency: repeatedly re-run the generation/filtering pipeline, analogous to other bootstrapping approaches, to increase the density of useful calls (Section 7).
-
Cost-aware tool calling: incorporate tool-specific costs into the decision rule, so the model trades off expected gain vs compute (Section 7).
-
Practical applications / downstream use cases (within the paperâs tool set).
- Reliable arithmetic and quantitative statements in generated text via
Calculator(Figure 1; Table 4). - Factual augmentation for generation via
QAorWikiSearch(Figure 1; Tables 3 and 5). - Better handling of mixed-language inputs (MLQA) via
MT(Table 6). -
Date-dependent answers (DATESET) via
Calendar(Table 7; Appendix D). -
Repro/Integration Guidance (based on what is described).
- Prefer this approach when you can:
- represent tool I/O as text,
- execute tools programmatically at training-time for many candidate calls,
- and you want general tool-use behavior transferable across tasks rather than task-specific few-shot tool prompting (Section 2; Section 4.2).
- Key knobs to tune in integration, as highlighted by experiments:
- generation thresholds (
Ï_s,k,m) and filter threshold (Ï_f) control data volume/quality (Section 2; Appendix A; Table 2), - inference decoding parameter
kstrongly affects tool-call rate and performance (Section 5; Table 9), - limiting tool calls per example prevents loops but blocks multi-step reasoning (Section 4.2; Section 4.2.5; Section 7).
- generation thresholds (