Skip to content

Toolformer: Language Models Can Teach Themselves to Use Tools

ArXiv: 2302.04761

🎯 Pitch

Toolformer finetunes a pretrained LM to autonomously decide when, which, and how to call simple text-based APIs (search, QA, calculator, translation, calendar) by self-generating candidate calls and keeping only those that measurably reduce next-token loss. This lets a single model combine fluent language generation with reliable external computation and retrieval—improving zero-shot factual, mathematical, multilingual, and time-sensitive performance without large human-labeled tool-call datasets.


1. Executive Summary (2-3 sentences)

Toolformer trains a pretrained language model to autonomously decide when to call external tools (via simple text APIs), which tool to call, what arguments to pass, and how to use the returned result during generation, using a self-supervised filtering signal based on improved next-token prediction (Section 2; Figure 2). Built on GPT-J (6.7B), this approach substantially improves zero-shot performance on factual recall, math word problems, multilingual QA, and time-sensitive questions while preserving standard language modeling perplexity when tools are disabled (Tables 3–8; Section 4.3). The significance is that tool use is learned without large human-labeled tool traces—only “a handful of demonstrations” per API are needed to bootstrap a large training set (Abstract; Figure 3; Appendix A.2).

2. Context and Motivation

  • Problem / gap addressed.
  • Large LMs show strong zero-/few-shot task performance but have persistent weaknesses in:
    • precise arithmetic,
    • factual lookup / up-to-date knowledge,
    • low-resource language understanding,
    • temporal awareness (Introduction).
  • Existing ways to add tools often either:

    • require substantial human supervision for tool calls (e.g., tool-annotated data), or
    • restrict tool use to task-specific prompting/training setups, limiting generality (Introduction; Section 6 “Tool Use”).
  • Why it matters.

  • Tools like search, calculators, translation systems, and calendars can provide ground truth or computed information that a parametric LM may not reliably store or infer (Introduction; Figure 1 examples).
  • The goal is to get “best of both worlds”: fluent language modeling plus reliable external computation/lookup (Abstract).

  • Prior approaches and shortcomings (as positioned here).

  • Tool-use systems with heavy human annotations are costly and may encode what humans think is helpful rather than what reduces model uncertainty (Introduction).
  • Few-shot tool prompting tailored per task requires users to specify tool usage patterns and does not generalize to arbitrary text contexts (Section 4.2 discussion; Section 6).

  • How this paper positions itself.

  • Toolformer is designed to satisfy two desiderata (Introduction):
    1. Self-supervised tool learning: no large-scale human tool traces; only a few demonstrations per tool.
    2. General tool-use policy: the model decides when/how to use tools without being restricted to a specific downstream task format.

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a pretrained language model that learns to insert and use textual “API call” snippets inside ordinary text generation.
  • It solves tool use by generating candidate tool calls in-context, executing them, keeping only those that measurably help predict future tokens, and then fine-tuning the LM on the resulting augmented corpus (Section 2; Figure 2).

3.2 Big-picture architecture (diagram in words)

  • Inputs: a large plain-text corpus C (here, a subset of CCNet) and a pretrained LM M (here, GPT-J) (Section 4.1).
  • Component A — Call sampler: prompts M to propose tool calls at positions in each text (Section 2 “Sampling API Calls”; Figure 3).
  • Component B — Tool executor: runs each proposed call against the corresponding tool (calculator, QA model, BM25 Wikipedia search, MT model, calendar) to obtain a text result (Section 2 “Executing API Calls”; Section 3; Table 1).
  • Component C — Self-supervised filter: computes whether including the tool result reduces a weighted next-token loss compared to no call / no result, and keeps only helpful calls (Section 2 “Filtering API Calls”; Figure 2).
  • Component D — Fine-tuner: interleaves surviving calls into the original text to form C* and fine-tunes M with standard LM training (Section 2 “Model Finetuning”; Section 4.1; Appendix B).
  • Inference loop: during decoding, when the model emits the tool-response separator token, the system pauses, executes the tool, inserts the result, and continues (Section 2 “Inference”).

3.3 Roadmap for the deep dive

  • Explain the API-call representation (how calls and results become tokens in the LM).
  • Walk through candidate generation (how positions and calls are sampled).
  • Detail the filtering objective (the loss comparison that decides which calls “help”).
  • Describe dataset construction and fine-tuning (how C* is built and how the LM is trained).
  • Describe inference-time tool execution and decoding controls (how the model is induced to actually call tools).
  • Summarize tools and implementation specifics (what each tool is and how data is selected for it).

3.4 Detailed, sentence-based technical breakdown

This is an algorithmic + empirical systems paper whose core idea is to bootstrap tool-use supervision from a pretrained LM itself, then keep only tool calls that reduce next-token prediction loss, and fine-tune on the resulting data (Section 2; Figure 2).

3.4.1 Tool-call representation (what the LM sees)

  • Each tool call is represented as a tuple c = (a_c, i_c) where a_c is the API name and i_c is its textual input (Section 2).
  • The paper defines two linearizations (Section 2):
  • e(c) = <API> a_c(i_c) </API> (call without result)
  • e(c, r) = <API> a_c(i_c) → r </API> (call with result r)
  • Special tokens mark structure: <API>, </API>, and →. In practice, they use token sequences " [", "]", and "->" so the method works without changing the LM vocabulary (footnote in Section 2).

3.4.2 System/data pipeline diagram in words (explicit sequence)

What happens first, second, third (Figure 2; Section 2):

  1. Start with a plain-text example x = x1, 
, xn from corpus C.
  2. Sample candidate positions i where the model might want to begin a tool call by computing
    p_i = p_M(<API> | P(x), x1:i-1) and keeping positions whose probability exceeds a threshold τ_s, capped to the top k positions (Section 2 “Sampling API Calls”).
  3. Generate candidate tool calls at each selected position by sampling from M conditioned on [P(x), x1, 
, x_{i-1}, <API>] until the end token </API> (Section 2). The paper samples up to m calls per position and discards examples where </API> is not produced.
  4. Execute each candidate call against the tool implementation to obtain a textual response r_i (Section 2 “Executing API Calls”).
  5. Filter calls using a self-supervised loss test: keep a call only if providing both call and response lowers a weighted loss on future tokens by at least τ_f compared to baselines (Section 2 “Filtering API Calls”; Figure 2).
  6. Construct an augmented training example x* by inserting the retained e(c_i, r_i) snippets into the text (Section 2 “Model Finetuning”).
  7. The paper notes a key implementation choice: before fine-tuning, it provides e(c_i, r_i) as a prefix rather than inserting it mid-sequence to avoid disrupting the pretrained LM’s expected token patterns (Section 2, footnote 3).
  8. Repeat across corpus to create C*, then fine-tune the LM on C* with a standard LM objective (Section 2; Section 4.1; Appendix B).
  9. At inference, decode normally until the model outputs →, then pause decoding, run the tool, insert the response + </API>, and continue decoding (Section 2 “Inference”).

3.4.3 The filtering objective (how “useful” tool calls are detected)

Toolformer’s key technical mechanism is a perplexity/loss-based filter that converts tool use into a self-supervised decision problem (Section 2; Figure 2).

  • Define a weighted cross-entropy loss over future tokens starting at position i (Section 2):
  • L_i(z) = - ÎŁ_{j=i..n} w_{j-i} · log p_M(x_j | z, x1:j-1)
  • Here z is an optional prefix (e.g., the tool call + result), and weights w_t emphasize tokens close to the call site.
  • Two variants are compared (Section 2):
  • L_i^+ = L_i(e(c_i, r_i)) (tool call with result provided)
  • L_i^- = min( L_i(Δ), L_i(e(c_i, Δ)) )
    where Δ is empty prefix, and e(c_i, Δ) is the call without result.
  • A call is kept if it satisfies (Section 2):
  • L_i^- - L_i^+ ≄ τ_f
  • Intuition (as implemented by the inequality): the response must provide predictive information beyond (a) no call, and (b) “doing a call” without actually learning anything (call input alone).

Weighting near-term tokens. - In the experiments, weights are chosen to emphasize locality (Section 4.1): - unnormalized w̃_t = max(0, 1 - 0.2·t), then normalized so weights sum to 1. - This biases the filter toward tool calls whose results help predict tokens soon after the call (Section 4.1).

Worked micro-example (based on Figure 2). - Text: “Pittsburgh is also known as the Steel City.” - Candidate calls at position i might include: - c_i^1 = QA("What other name is Pittsburgh known by?") returning Steel City - c_i^2 = QA("Which country is Pittsburgh in?") returning United States - The filter keeps the first call because providing Steel City makes predicting the next tokens (“the Steel City”) easier, i.e., yields L_i^- - L_i^+ above threshold, while the second is discarded because “United States” does not help predict that continuation (Figure 2 illustration).

3.4.4 Sampling candidate calls (how tool supervision is bootstrapped)

  • For each API, the authors write a prompt P(x) that instructs the LM to insert tool calls into a text; Figure 3 shows the QA prompt pattern, and Appendix A.2 lists prompts for all tools.
  • Position sampling uses the probability the LM assigns to starting a tool call (<API>) at each position (Section 2). This is an important design choice: the model’s own uncertainty/affordance for tool insertion guides where tool use is even considered.
  • For each selected position, multiple call candidates are sampled, then tool execution + filtering selects those that empirically help.

3.4.5 Building C* and fine-tuning M (training details and what is / isn’t specified)

Data source and selection. - Base corpus C is a subset of CCNet (Section 4.1). - To reduce annotation cost, heuristics select subsets likely to benefit specific tools (Section 4.1; Appendix A.1): - Calculator: documents with numeric patterns (e.g., at least three numbers; presence of “=”; etc.), with an additional 1% subsampling for a broad criterion (Appendix A.1). - Calendar: only texts where a date can be extracted from the URL; this leaves ~18% of documents (Appendix A.1). - MT: identify non-English chunks bracketed by English text using fastText language ID with confidence > 0.8 and chunk size 10 tokens; filter out numeric/symbol-only chunks; also remove cases where the MT input appears only after the call (look-ahead artifact) (Appendix A.1). - After filtering tool calls, they also drop examples where all calls were eliminated (Section 4.1, footnote 4).

Sampling/filtering hyperparameters. - Default values for data generation are (Appendix A): - τ_s = 0.05, τ_f = 1.0, keep top k = 5 positions, sample up to m = 5 calls per position. - For Calculator and MT, to offset small yields, they use (Appendix A): - τ_s = 0.0, k = 20, m = 10, and a lower filter threshold τ_f = 0.5. - Table 2 quantifies how many examples with retained calls are produced under different τ_f settings (e.g., Wikipedia Search yields 60,974 examples at τ_f = 1.0, while Calculator yields 994 at τ_f = 1.0) (Table 2).

Fine-tuning configuration (what is explicitly given). - The model is GPT-J (6.7B) (Introduction; Section 4.1). - Fine-tuning uses (Section 4.1; Appendix B): - batch size: 128 (effective batch size 128 in Appendix B), - learning rate: 1e-5, - linear warmup over first 10% of training, - max sequence length: 1024, - training steps: up to 2000, selecting the checkpoint with best dev perplexity (Appendix B), - hardware: 8 × NVIDIA A100 40GB, BF16, using DeepSpeed ZeRO-3 (Appendix B). - Not specified in the provided excerpt: optimizer type (e.g., AdamW), weight decay, gradient clipping, tokenizer details, and architectural hyperparameters (layers, hidden size, attention heads, context window beyond max sequence length). I do not infer these because they are not stated in the provided content.

Why fine-tuning should preserve general LM ability (claimed mechanism). - C* contains the same underlying texts as C, only with inserted tool call snippets at positions that reduce predictive loss, so training should not “specialize away” from language modeling (Section 2 “Model Finetuning”). Section 4.3 empirically checks this using perplexity.

3.4.6 Inference-time tool use and decoding controls

  • The inference protocol triggers tool execution when the model outputs the → token, meaning it expects a tool response next (Section 2 “Inference”).
  • In experiments, they adjust decoding to increase tool usage (Section 4.2; Section 5):
  • Use greedy decoding, except they allow <API> to be generated whenever it is among the k most likely tokens, not only when it is top-1.
  • They set k = 10 in main downstream experiments (Section 4.2).
  • They restrict to at most one API call per input to avoid loops (Section 4.2).
  • Section 5 (“Decoding Strategy”) shows tool-call rate and performance vary strongly with k (Table 9), and calibration (calling tools mostly when needed) is better at k=1 but degrades at higher k.

4. Key Insights and Innovations

  • (1) Self-supervised filtering criterion for tool usefulness.
  • Innovation: tool calls are kept only if including the tool’s result reduces weighted next-token loss by at least τ_f compared to no call / no result (Section 2).
  • Why it’s significant: this turns tool-use learning into a scalable self-supervised data-generation pipeline without human tool traces, and it directly optimizes for “helpfulness to the LM” rather than human intuition (Introduction; Figure 2).

  • (2) Bootstrapping tool-use datasets from a pretrained LM with only a few demonstrations.

  • The LM is prompted (Figure 3; Appendix A.2) to propose tool calls across a large corpus, creating C* from C (Section 2).
  • This differs from task-specific prompting: the tool-use behavior is learned at the pretraining-corpus level and then transferred to downstream zero-shot tasks (Section 4.2).

  • (3) Unified textual interface for multiple heterogeneous tools.

  • The approach only requires that tool inputs/outputs are representable as text sequences (Section 2; Section 3).
  • Tools included cover distinct failure modes (Section 3): QA (Atlas), WikiSearch (BM25 snippets), Calculator (basic arithmetic, rounded to 2 decimals), MT (NLLB 600M + fastText LID), and Calendar (current date).

  • (4) Evidence that tool-use capability “emerges” with scale.

  • Scaling experiment across GPT-2 family sizes (124M → 1.6B) plus GPT-J suggests tool leveraging only becomes effective around ~775M parameters (Section 4.4; Figure 4).
  • This is positioned as an emergent capability: small models do not benefit much from tools, while larger ones learn to use them meaningfully (Figure 4 narrative).

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, baselines, setup)

Baselines (Section 4.1). - GPT-J: pretrained, no fine-tuning. - GPT-J + CC: fine-tuned on C (CCNet subset), no tool calls. - Toolformer: GPT-J fine-tuned on C* (CCNet subset augmented with tool calls). - Toolformer (disabled): same weights as Toolformer, but tool calls are disabled at decoding by setting probability of <API> to 0 (Section 4.1, footnote 5). - Also compared (for many tasks): OPT (66B) and GPT-3 (175B) (davinci, not instruction-tuned) (Section 4.1).

Decoding and tool-call constraints (Section 4.2). - Prompted zero-shot only: instructions are given, but no in-context examples for task solving. - Greedy decoding with modified <API> triggering: <API> can be emitted if among top-k tokens; they use k=10. - At most one API call per input.

Task suites and evaluation details. - LAMA: SQuAD, Google-RE, T-REx subsets; filtered to left-to-right completion; metric: correct answer appears within first 5 generated words; Wikipedia Search disabled to avoid unfair advantage (Section 4.2.1). - Math: ASDiv, SVAMP, MAWPS; metric: first generated number (with a special-case rule for equations like “=”) (Section 4.2.2, footnote 7). - Open-domain QA: WebQS, Natural Questions, TriviaQA; metric: answer appears within first 20 generated words; QA tool disabled to avoid triviality because the QA tool was trained on NQ (Section 4.2.3). - MLQA: multilingual questions with English context; metric: answer in first 10 words; expects model to answer in English; tool usage rates reported (Section 4.2.4). - Temporal: TEMPLAMA and DATESET; LAMA-like metric; DATESET requires knowing current date (Section 4.2.5; Appendix D). - Language modeling preservation: perplexity on WikiText and held-out CCNet subset (Section 4.3; Table 8).

5.2 Main quantitative results (with specific numbers)

LAMA factual completion (Table 3; Section 4.2.1)

Table 3:
Toolformer = 33.8 / 11.5 / 53.5 (SQuAD / Google-RE / T-REx)
vs GPT-J = 17.8 / 4.9 / 31.9, and GPT-J + CC = 19.2 / 5.6 / 33.2.
Also higher than OPT (66B) (21.6 / 2.9 / 30.1) and GPT-3 (175B) (26.8 / 7.0 / 39.8).

  • Tool use behavior: Toolformer uses the QA tool for 98.1% of examples (Section 4.2.1).

Math word problems (Table 4; Section 4.2.2)

Table 4:
Toolformer = 40.4 / 29.4 / 44.0 (ASDiv / SVAMP / MAWPS)
vs GPT-J = 7.5 / 5.2 / 9.9, and GPT-3 (175B) = 14.0 / 10.0 / 19.8.

  • Tool use behavior: uses the calculator tool for 97.9% of examples (Section 4.2.2).
  • Notable nuance: Toolformer (disabled) improves over GPT-J even without calling tools (e.g., ASDiv 14.8 vs 7.5), and the paper attributes this to being fine-tuned on many calculator call/result examples, which may strengthen internal math ability (Section 4.2.2).

Open-domain QA via Wikipedia search (Table 5; Section 4.2.3)

Table 5:
Toolformer = 26.3 / 17.7 / 48.8 (WebQS / NQ / TriviaQA)
vs GPT-J = 18.5 / 12.8 / 43.9, and GPT-3 (175B) = 29.0 / 22.6 / 65.9.

  • Tool usage: Wikipedia Search used for 99.3% of examples (Section 4.2.3).
  • Toolformer improves substantially over same-size baselines but still trails GPT-3 on these datasets (Table 5).

Multilingual QA (MLQA) with MT tool (Table 6; Section 4.2.4)

Table 6 (Toolformer):
Es 20.6, De 13.5, Hi 1.4, Vi 10.6, Zh 16.8, Ar 3.7.
Compared to GPT-J: Es 15.2, De 16.5, Hi 1.3, Vi 8.2, Zh 18.2, Ar 8.2.

  • Toolformer’s tool usage: MT used 63.8%–94.9% depending on language, except Hindi where it is used only 7.3% (Section 4.2.4).
  • Important mixed outcome: Toolformer does not consistently outperform vanilla GPT-J, and the paper links this to GPT-J + CC fine-tuning harming some languages due to distribution shift (Section 4.2.4).

Temporal tasks (Table 7; Section 4.2.5)

Table 7:
Toolformer = 16.3 on TEMPLAMA and 27.3 on DATESET
vs GPT-J = 13.7 / 3.9, and GPT-3 (175B) = 15.5 / 0.8.

  • Tool use detail:
  • Calendar tool is used only 0.2% on TEMPLAMA; gains come mostly from Wikipedia search and QA tools (Section 4.2.5).
  • On DATESET, calendar tool is used 54.8% and is credited for the large improvement (Section 4.2.5).

Language modeling ability (Table 8; Section 4.3)

Table 8 (perplexity):
GPT-J: WikiText 9.9, CCNet 10.6
GPT-J + CC: WikiText 10.3, CCNet 10.5
Toolformer (disabled): WikiText 10.3, CCNet 10.5

  • Interpretation grounded in Section 4.3: adding API-call training (evaluated with tools disabled) does not increase perplexity relative to CCNet fine-tuning without calls, suggesting core LM ability is preserved in this evaluation mode.

5.3 Decoding-strategy analysis (Table 9; Section 5)

  • Table 9 shows that increasing top-k threshold for emitting <API> increases tool-call rates and affects performance:
  • On T-REx: tool usage rises from 40.3% (k=1) to 98.1% (k=10), and overall score increases from 47.8 to 53.5 (Table 9).
  • On WebQS: tool usage rises from 8.5% (k=1) to 100% (k=10), and overall score rises from 19.3 to 26.3 (Table 9).
  • The paper observes calibration-like behavior at k=1: the model tends to call tools on cases it would do poorly on otherwise, but this calibration is “lost” at higher k (Section 5).

5.4 Do the experiments support the claims?

  • Support for “tools improve zero-shot performance”: strong evidence across LAMA (Table 3), math (Table 4), QA (Table 5), and DATESET (Table 7), with explicit tool usage rates reported in several sections.
  • Support for “no sacrifice in core LM ability”: supported in the limited sense measured—perplexity with tools disabled is unchanged vs CCNet fine-tuning (Table 8; Section 4.3). The paper explicitly notes that evaluating perplexity with tools enabled is intractable because it would require marginalizing over possible API calls (Section 4.3, footnote 8).
  • Mixed evidence / caveats: MLQA results show tool use helps relative to Toolformer (disabled) but does not consistently beat GPT-J due to CCNet fine-tuning degrading multilingual performance (Table 6; Section 4.2.4).

6. Limitations and Trade-offs

Grounded in Section 7 (Limitations) plus task-specific analyses:

  • No chaining of tools (single-step tool use).
  • Calls for each tool are generated independently, so the training data lacks examples where output of one tool becomes input to another (Section 7).
  • This interacts with the “one API call per example” decoding restriction used in experiments (Section 4.2), which prevents multi-step strategies like “call calendar → incorporate date → ask QA about time-dependent fact” (Section 4.2.5 discussion).

  • Limited interactivity with tools (especially search).

  • Wikipedia search returns snippets and may be a poor match; Toolformer cannot refine queries or browse multiple results in an interactive loop (Section 4.2.3; Section 7).

  • Sensitivity to wording / prompts.

  • Tool invocation is often sensitive to exact input phrasing, consistent with known prompt sensitivity of LMs (Section 7).

  • Sample inefficiency for some tools.

  • The pipeline may process very large corpora but yield few useful calls for certain tools (explicitly mentioned for the calculator: “more than a million documents” leading to “only a few thousand” useful examples) (Section 7). Table 2’s calculator counts (e.g., 994 examples at τ_f=1.0) are consistent with this sparsity.

  • Tool-cost awareness is not modeled.

  • The decision to call a tool ignores tool-dependent computational cost (Section 7). This matters practically because QA models, MT models, or retrieval may be expensive.

  • Evaluation/measurement limitation for LM perplexity with tools enabled.

  • The paper does not compute perplexity when API calls are enabled because doing so would require marginalization over potential calls at each position, deemed intractable (Section 4.3, footnote 8).

7. Implications and Future Directions

  • How this changes the landscape (based on the provided paper).
  • Toolformer demonstrates a general recipe for turning a pretrained LM into a tool-using LM without extensive human tool-call annotations, using a loss-based filter that is aligned with the LM’s own predictive needs (Section 2; Figure 2; Abstract).
  • The results suggest tool use can let smaller models (6.7B) outperform much larger ones (66B/175B) on tool-relevant tasks like factual completion and math word problems (Tables 3–4), though not universally (Table 5; Table 6).

  • Follow-up research directions suggested by the paper’s findings.

  • Chained tool use: create training data that includes multi-step tool traces (calendar → QA; search → re-search) to overcome the independent sampling limitation (Section 7; Section 4.2.5).
  • Interactive search behavior: enable query reformulation, multi-result browsing, and multi-call loops safely (Section 4.2.3; Section 7).
  • Iterative bootstrapping to improve sample efficiency: repeatedly re-run the generation/filtering pipeline, analogous to other bootstrapping approaches, to increase the density of useful calls (Section 7).
  • Cost-aware tool calling: incorporate tool-specific costs into the decision rule, so the model trades off expected gain vs compute (Section 7).

  • Practical applications / downstream use cases (within the paper’s tool set).

  • Reliable arithmetic and quantitative statements in generated text via Calculator (Figure 1; Table 4).
  • Factual augmentation for generation via QA or WikiSearch (Figure 1; Tables 3 and 5).
  • Better handling of mixed-language inputs (MLQA) via MT (Table 6).
  • Date-dependent answers (DATESET) via Calendar (Table 7; Appendix D).

  • Repro/Integration Guidance (based on what is described).

  • Prefer this approach when you can:
    • represent tool I/O as text,
    • execute tools programmatically at training-time for many candidate calls,
    • and you want general tool-use behavior transferable across tasks rather than task-specific few-shot tool prompting (Section 2; Section 4.2).
  • Key knobs to tune in integration, as highlighted by experiments:
    • generation thresholds (τ_s, k, m) and filter threshold (τ_f) control data volume/quality (Section 2; Appendix A; Table 2),
    • inference decoding parameter k strongly affects tool-call rate and performance (Section 5; Table 9),
    • limiting tool calls per example prevents loops but blocks multi-step reasoning (Section 4.2; Section 4.2.5; Section 7).