Tool-Augmented Reasoning at Scale
Nexus Research • ICML 2026 • Oral Presentation
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving large language model performance on complex tasks. However, standard CoT operates entirely within the model's parametric knowledge: the model reasons using language alone, without access to external tools that could verify, compute, or retrieve information. We introduce Tool-Augmented Reasoning (TAR), a method that interleaves natural language reasoning with structured API calls to external tools within a unified thought loop. On a benchmark suite of 2,347 complex software engineering tasks spanning code generation, debugging, and architecture analysis, TAR achieves a 37% improvement in task accuracy over standard CoT (68.3% to 93.7%) and a 26% improvement over CoT with self-consistency. Our analysis of 12,847 tool calls reveals that tool invocations are most beneficial at three critical junctures: verifying intermediate assumptions (41% of calls), retrieving context-dependent specifications (33%), and exploring alternative solution paths (26%). We further show that TAR generalizes across five model families (Llama 4, Claude 4, Gemini 2.5, GPT-5, Nexus-1), with gains ranging from 28% to 44%, and that the benefit increases monotonically with task complexity. Ablation studies show that each tool contributes non-redundantly, with the compiler providing the largest marginal benefit (18 percentage points). We release the TAR benchmark suite, tool schema definitions, and evaluation harness to facilitate further research.
1. Introduction
Large language models exhibit remarkable reasoning capabilities when prompted to think step by step. Chain-of-thought reasoning (Wei et al., 2022) has become a standard technique for improving performance on math, logic, coding, and scientific reasoning tasks. Variants such as self-consistency (Wang et al., 2023), tree-of-thoughts (Yao et al., 2024), and program-of-thoughts (Chen et al., 2023) further extend the paradigm by aggregating multiple reasoning paths or structuring reasoning as executable programs.
Yet all these approaches share a fundamental limitation: they operate entirely within the model's parametric knowledge. The model cannot compile code to check for errors, query a database to verify an assumption, explore a codebase to find relevant context, or run a test to validate a hypothesis. For software engineering tasks where correctness depends on external ground truth (compiler output, test results, API documentation, static analysis reports), this limitation is critical. A model may generate a plausible-looking solution that does not compile, or reason correctly about a codebase but miss a key constraint documented only in an external specification.
We formalize this as the verification gap: the discrepancy between the model's internal confidence in a reasoning step and the ground-truth correctness of that step when evaluated against the environment. In purely textual CoT, this gap can grow without correction across the reasoning chain, causing errors to compound. The cost of an undetected error increases with chain length: a mistake in step 3 of a 20-step reasoning process invalidates all subsequent steps, yet the model has no mechanism to detect and recover from such errors.
We propose Tool-Augmented Reasoning (TAR), a framework that extends chain-of-thought by allowing the model to make structured API calls to external tools at any point during the reasoning process. TAR preserves the sequential, interpretable nature of CoT while adding the ability to verify, compute, and retrieve information from the environment. The key insight is that tool calls are not separate from reasoning but integrated into it: each tool call is preceded by a reasoning step that formulates the query and followed by a reasoning step that interprets the result. This creates a closed loop of hypothesize-verify-revise that mirrors human expert problem-solving.
Our contributions are fivefold. First, we introduce TAR, a general framework for interleaving tool use with chain-of-thought reasoning that requires no model fine-tuning. Second, we construct a benchmark of 2,347 real-world software engineering tasks across three categories with ground-truth verification. Third, we demonstrate a 37% improvement over standard CoT, with detailed analysis of 12,847 tool calls across our experiments. Fourth, we characterize the conditions under which tool use provides the greatest benefit, showing that it scales with task complexity and is inversely correlated with base model capability. Fifth, we release all benchmarks, tool specifications, and evaluation code to enable reproducible research.
2. Related Work
Chain-of-Thought and Its Variants. Chain-of-thought prompting (Wei et al., 2022) elicits step-by-step reasoning from LLMs by providing intermediate reasoning steps in few-shot examples. Self-consistency (Wang et al., 2023) samples multiple CoT paths and aggregates results by majority vote, improving robustness at the cost of increased compute. Tree-of-Thoughts (Yao et al., 2024) generalizes CoT to a tree search over reasoning steps with explicit evaluation and backtracking. Program-of-Thoughts (Chen et al., 2023) replaces natural language reasoning steps with executable Python code, enabling computation but not interaction with external tools. Our work differs from all of these in that we allow the model to interact with arbitrary external tools during reasoning, not just to compute but to verify, retrieve, and explore.
Tool-Using Language Models. The line between tool use and reasoning has been explored in several recent works. Toolformer (Schick et al., 2024) fine-tunes models to call APIs by learning from weakly supervised examples, but requires per-tool fine-tuning and does not integrate tool calls with step-by-step reasoning. ReAct (Yao et al., 2023) interleaves reasoning traces and actions in a single framework, but focuses on interactive decision-making tasks (web navigation, question answering) rather than software engineering. Our work differs from ReAct in three key ways: we focus specifically on the integration of tool use with chain-of-thought-style reasoning; we provide a formal analysis of when and why tools are valuable; and we evaluate at scale across multiple model families and task types.
Code Generation and Verification. Several works have explored using tools to improve code generation. CodeT (Chen et al., 2023) generates multiple candidate solutions and uses test cases to select among them. Self-Debugging (Chen et al., 2024) trains models to explain and fix their own errors. LEVER (Ni et al., 2023) learns to verify program correctness from execution feedback. These approaches treat tool use as a post-hoc verification step rather than an integral part of the reasoning process. TAR is complementary: models using TAR can and do invoke tools at any point during reasoning, not just after generating a candidate solution.
Planning and Search in LLMs. LLM-based planning (Huang et al., 2022; Ahn et al., 2022) decomposes tasks into subgoals and executes them sequentially or hierarchically. Tree search methods (Hao et al., 2023; Feng et al., 2024) use LLMs as heuristic guides for Monte Carlo Tree Search. These approaches typically use tools only for execution, not for verification or exploration within the reasoning process. TAR treats verification as a first-class operation within the reasoning loop, enabling the model to correct course mid-reasoning rather than only at the end.
3. The TAR Framework
TAR extends standard CoT with a tool-use grammar that allows the model to insert structured API calls into the reasoning chain. The framework comprises three components: a tool registry that defines available tools and their specifications, a router that decides when to invoke tools based on the current reasoning state, and a result integrator that incorporates tool outputs into the ongoing reasoning process.
3.1 Tool Registry. We define a set of six tools commonly needed in software engineering reasoning, each specified by a JSON schema: (1) Code Compiler (cc): compiles Python, TypeScript, Go and returns errors. (2) Test Runner (tr): executes test suites and returns pass/fail results. (3) Static Analyzer (sa): runs type checking, linting, security scanning. (4) Documentation Retriever (dr): retrieves API documentation from natural language queries. (5) Dependency Resolver (dep): analyzes dependency graphs for conflicts. (6) Codebase Search (cs): searches project codebase for patterns and definitions.
3.2 Routing Strategy. Rather than training a separate router, we rely on the model's inherent ability to decide when tool use is beneficial. At each reasoning step, the model may emit either a natural language thought or a tool call in a structured format. We use few-shot prompting with 5 examples to teach the tool-use grammar. The prompt format interleaves thoughts, tool calls, and observations. We find that models naturally learn to invoke tools at three characteristic junctures: after stating a hypothesis (to verify), when encountering an ambiguous specification (to retrieve), and after generating a solution (to test).
3.3 Result Integration. When a tool returns output, the model must integrate it into the ongoing reasoning process. We identify four integration strategies: Confirmation (output confirms hypothesis, strengthening confidence), Correction (output contradicts, triggering backtracking), Augmentation (output provides new information incorporated into subsequent reasoning), and Delegation (output used directly as part of the solution). The choice of strategy is implicit in the model's next reasoning step.
Figure 1: Example TAR trace for a bug fix task
4. Experimental Setup
4.1 Benchmark Construction. We construct a benchmark of 2,347 software engineering tasks drawn from real engineering tickets across 12 open-source projects (React, TypeScript, VS Code, Django, Flask, Jupyter, scikit-learn, PyTorch, TensorFlow, Kubernetes, Homebrew, and Nixpkgs). Each task has an associated codebase snapshot, a natural language description, a ground-truth solution, and a verification harness. Tasks are categorized into three types: Code Generation (892 tasks) verified by unit tests, Bug Diagnosis and Repair (781 tasks) verified by original test case plus regression tests, and Architecture Analysis and Migration (674 tasks) verified by human expert evaluation using a structured rubric.
4.2 Models Evaluated. We evaluate five model families: Llama 4 (70B), Claude 4 Sonnet, Gemini 2.5 Pro, GPT-5, and Nexus-1 (405B). All models are evaluated with temperature = 0.2, top-p = 0.95, max tokens = 8,192. For CoT baselines, we use standard few-shot CoT. For TAR, we augment the CoT prompt with the tool-use grammar and 5 tool-use examples. Each configuration is run with 3 random seeds to measure variance.
4.3 Metrics. We measure six primary metrics: (M1) Task Accuracy: binary task completion success; (M2) Tool Call Accuracy: percentage of correct tool invocations; (M3) Redundancy Rate: percentage of avoidable tool calls; (M4) Avg Tool Calls per Task; (M5) Chain Length: mean reasoning steps per task; (M6) Recovery Rate: percentage of tasks where the model corrects an initial incorrect solution after tool feedback.
5. Results
5.1 Main Results. Table 1 presents the main results. TAR achieves 93.7% accuracy across all tasks, compared to 68.3% for standard CoT (+37%) and 71.2% for CoT with self-consistency (5 samples). The improvement is statistically significant (p < 0.001, paired bootstrap test). The improvement is largest for architecture analysis tasks (+52% relative improvement) and bug diagnosis (+41%), where tool use provides the most benefit for verification and exploration. Code generation sees a more modest improvement (+24%), likely because many generation tasks can be solved from parametric knowledge alone. The recovery rate for TAR is 41.3%, meaning that in over 40% of tasks where the model initially produces an incorrect solution, it successfully corrects itself after tool feedback.
Table 1: Main results across reasoning approaches
| Method | All | Code Gen | Bug Diag. | Architecture | Delta vs CoT |
|---|---|---|---|---|---|
| CoT (standard) | 68.3% | 74.1% | 65.2% | 61.7% | -- |
| CoT + Self-Consistency | 71.2% | 76.8% | 68.1% | 64.5% | +4.2% |
| ReAct | 76.4% | 79.3% | 75.8% | 72.1% | +11.9% |
| Toolformer (fine-tuned) | 81.5% | 84.2% | 79.8% | 77.6% | +19.3% |
| TAR (ours) | 93.7% | 91.8% | 94.3% | 95.6% | +37.2% |
5.2 Analysis of Tool Call Patterns. We analyze the complete set of 12,847 tool calls made during our evaluation. Three critical junctures emerge: Verification of intermediate assumptions (5,267 calls, 41%) where the model invokes the compiler or test runner to test a hypothesis; Context retrieval (4,240 calls, 33%) where the model retrieves documentation or code before proceeding; and Exploration of alternatives (3,340 calls, 26%) where the model explores approaches after generating a candidate. In 73% of verification calls, tool output confirms the hypothesis; in 27%, it triggers revision. Exploration increases with task difficulty (34% on hard tasks vs. 21% on easy tasks).
5.3 Ablation: Impact of Individual Tools. We conduct a leave-one-out ablation. Removing the compiler reduces accuracy from 93.7% to 75.8% (-18pp). Test runner: -9pp, static analysis: -7pp, documentation retriever: -6pp, codebase search: -5pp, dependency resolver: -3pp. Combining all tools provides synergistic effects beyond the sum of individual contributions.
5.4 Generalization Across Models. All five model families show significant improvement with TAR: Llama 4 (58.2% to 83.7%, +44% relative), Claude 4 (68.3% to 93.7%, +37%), Gemini 2.5 (72.1% to 92.4%, +28%), GPT-5 (74.6% to 94.1%, +26%), Nexus-1 (78.4% to 95.8%, +22%). Gains are inversely correlated with base model performance, suggesting TAR compensates for gaps in parametric knowledge.
5.5 Scaling with Task Complexity. On easy tasks (<=3 steps): +18% (82.1% to 96.7%). On medium (4-7 steps): +31% (71.4% to 93.8%). On hard (>=8 steps): +52% (52.3% to 79.6%). Average tool calls per task: 2.1 (easy), 4.8 (medium), 8.3 (hard).
5.6 Error Analysis. 148 failures (6.3%): Tool misuse (38%)—wrong tool or incorrect arguments; Misinterpretation (31%)—correct tool but misread output; Over-reliance (21%)—delegating to tools that cannot provide answers; Cost avoidance (10%)—choosing not to invoke a beneficial tool. These suggest directions for improvement in tool schema design, output interpretation training, and adaptive routing.
6. Limitations and Future Work
TAR relies entirely on the model's implicit routing decisions; a learned router predicting the expected value of tool invocation could reduce unnecessary calls. TAR does not model invocation cost (latency or API cost); cost-benefit analysis would improve production deployment. Our evaluation is limited to software engineering tasks; extending to scientific reasoning, data analysis, and legal reasoning would test generality. Tools return raw output without confidence estimates; calibrated confidence scores would enable uncertainty-aware reasoning. Multi-turn tool interactions are not supported; extending TAR for tool chaining with state tracking would enable more sophisticated workflows.
7. Conclusion
Tool-Augmented Reasoning provides a simple, general-purpose method for integrating external tools into chain-of-thought reasoning. Across 2,347 software engineering tasks spanning code generation, bug diagnosis, and architecture analysis, TAR achieves a 37% improvement over standard CoT, with the largest gains in tasks requiring verification and context retrieval. Through analysis of 12,847 tool calls, we identify three critical junctures where tools provide the most benefit: hypothesis verification, context retrieval, and solution exploration. The approach generalizes across five model families, with gains ranging from 28% to 44%, and scales with task complexity. TAR is deployed in production as the default reasoning mode for Nexus agents, processing over 500,000 tool-assisted reasoning sessions per week as of June 2026.
References
[1] Wei et al. Chain-of-Thought Prompting Elicits Reasoning. NeurIPS 2022.
[2] Wang et al. Self-Consistency Improves Chain of Thought Reasoning. ICLR 2023.
[3] Yao et al. Tree of Thoughts. NeurIPS 2024.
[4] Chen et al. Program of Thoughts. arXiv:2211.12588, 2023.
[5] Schick et al. Toolformer. arXiv:2302.04761, 2024.
[6] Yao et al. ReAct. ICLR 2023.
[7] Chen et al. CodeT. ICLR 2023.
[8] Chen et al. Self-Debugging. arXiv:2404.13156, 2024.
[9] Ni et al. LEVER. ICML 2023.
[10] Huang et al. Inner Monologue. arXiv:2207.05608, 2022.
[11] Ahn et al. Do As I Can. CoRL 2022.
[12] Hao et al. Reasoning with LM is Planning with WM. EMNLP 2023.
[13] Feng et al. Alphacode. arXiv:2403.04831, 2024.
[14] Cobbe et al. Training Verifiers. arXiv:2110.14168, 2021.
[15] Nye et al. Show Your Work. arXiv:2112.00114, 2021.
[16] Zhou et al. Least-to-Most Prompting. ICLR 2023.
[17] Press et al. Compositionality Gap. EMNLP 2023.
[18] Valmeekam et al. Planning Abilities of LLMs. NeurIPS 2023.