Nexus-1: An Agent Foundation Model / Nexus Research

Abstract

We introduce Nexus-1, a foundation model purpose-built for autonomous agent tasks. Unlike general-purpose language models optimized for dialogue or text completion, Nexus-1 is designed and trained specifically for tool-mediated problem solving: planning, executing, and iterating on complex engineering workflows. Nexus-1 achieves state-of-the-art results on SWE-Bench Lite (56.3% resolution rate, +8.1 points over the previous SOTA of 48.2%), the Tool-Use benchmark (92.1% task completion, 98.3% correct tool selection), and a novel Multi-Step Reasoning benchmark that we release alongside this report (73.4% completion rate vs. 51.2% for GPT-5 and 54.8% for Claude 4 Sonnet).

The model is trained on 3.1 million trajectories of tool-mediated problem solving, collected from 2,400+ professional software engineers using the Nexus platform over an 18-month period. Training uses a two-stage curriculum: Stage 1 performs supervised fine-tuning on 1.7M high-quality expert demonstrations (filtered for success and expert quality rating >= 4/5), teaching the model to use tools effectively with a next-action prediction objective. Stage 2 applies reinforcement learning from tool-use feedback (RLTF), where the reward signal is derived from objective task outcomes—test pass rates, build success, code review approval—rather than human preferences. Nexus-1 has 405B parameters, supports a 128k-token context window, and is architecturally optimized for the long-range tool-use chains characteristic of agent workloads. We release evaluation benchmarks, safety documentation, and a subset of training trajectories alongside this technical report.

1. Introduction

General-purpose language models have made impressive progress on a wide range of tasks, from creative writing to mathematical reasoning to code generation. State-of-the-art models like GPT-5, Claude 4, Gemini 2.5, and Llama 4 demonstrate remarkable fluency and broad knowledge. However, these models are fundamentally optimized for language generation, not for autonomous action. When deployed as agents—given a goal, a set of tools, and the autonomy to plan and execute multi-step workflows—even the most capable general-purpose models exhibit characteristic failure modes.

We identify three such failure modes through systematic analysis of 5,000+ agent sessions using general-purpose models on the Nexus platform. First, error propagation: when a model makes a mistake early in a multi-step plan, it rarely detects the error and often compounds it in subsequent steps. Second, shallow exploration: when an initial approach fails, models tend to retry the same approach with minor variations rather than fundamentally rethinking the strategy. Third, feedback neglect: models frequently ignore or misinterpret tool outputs, particularly when the output contradicts the model's prior reasoning. These failure modes share a common root cause: general-purpose language models are trained to produce plausible text, not to succeed in environments where actions have objective consequences.

Nexus-1 is designed from first principles as an agent foundation model: a model whose architecture, training data, and optimization objective are all shaped by the requirements of autonomous task completion. The model is trained not to produce plausible text, but to produce actions that succeed in the world. This is reflected in every design decision. The training data consists of 3.1M real-world agent trajectories where the outcome (task success or failure) is objectively known. The reward function for reinforcement learning is derived from task outcomes—tests passing, builds succeeding, code reviews approving—rather than human preferences or perplexity. The architecture is optimized for long-range tool-use chains, with a 128k-token context window and attention mechanisms designed for the structured input patterns characteristic of agent logs.

Our contributions are fivefold. First, we present Nexus-1, the first foundation model purpose-built for autonomous agent tasks, achieving SOTA results across three benchmarks. Second, we release a dataset of 3.1M expert agent trajectories collected from 2,400+ engineers over 18 months, with full action traces, tool inputs/outputs, and objective outcome labels. Third, we introduce RLTF (Reinforcement Learning from Tool-Use Feedback), a training paradigm that uses objective task outcomes as reward signals. Fourth, we present a thorough ablation study characterizing the contribution of data quality, model scale, training methodology, and context window to agent performance. Fifth, we release evaluation benchmarks, safety documentation, and model analysis to facilitate responsible deployment.

2. Related Work

General-Purpose LLMs as Agents. A large body of recent work evaluates general-purpose LLMs on agent tasks by wrapping them in agent frameworks. SWE-Bench (Jimenez et al., 2024) evaluates models on real GitHub issues; the leading approaches use GPT-5, Claude 4, or Gemini 2.5 with prompting frameworks like SWE-agent (Yang et al., 2024) or Devika. These approaches achieve impressive results but are limited by the underlying model's architecture and training: the models were not designed or trained for agent tasks. Nexus-1 differs in being purpose-built for agent workloads from the ground up.

Code-Specific Models. Models like CodeLlama (Roziere et al., 2024), StarCoder (Li et al., 2023), and DeepSeek-Coder (Guo et al., 2024) are fine-tuned for code generation but are not designed for end-to-end agent workflows involving tool use, error recovery, and multi-step planning. They excel at single-turn code generation but lack the reasoning and tool-use capabilities required for autonomous task completion.

Tool-Use Fine-Tuning. Several works fine-tune models specifically for tool use. Toolformer (Schick et al., 2024) uses self-supervised learning to teach API calling. ToolLLM (Qin et al., 2024) builds a tool-use instruction dataset. Gorilla (Patil et al., 2024) focuses on API call generation. These approaches typically fine-tune on static datasets and do not incorporate reinforcement learning from task outcomes. Nexus-1's RLTF stage goes beyond supervised learning by training models to recover from errors and explore alternatives.

Reinforcement Learning for LLMs. RL from human feedback (RLHF; Ouyang et al., 2022; Bai et al., 2022) has become standard for aligning LLMs with human preferences. Constitutional AI (Bai et al., 2022) reduces reliance on human feedback but still uses AI-generated preferences. RLTF replaces human judgments with objective task outcomes (test results, build status, code review approval), providing a dense, consistent, and scalable reward signal.

3. Training Data

We collected 3.1 million trajectories of tool-mediated problem solving from 2,400+ professional software engineers using the Nexus platform across 800+ organizations over an 18-month period (January 2025 through June 2026). Each trajectory includes: (a) a natural language task description, (b) the full sequence of reasoning steps and tool calls, (c) the inputs and outputs of every tool invocation, (d) a binary outcome label indicating task success, and (e) expert quality ratings on a 1-5 scale.

Data Filtering. We retain only trajectories with outcome = success (2.4M) and then filter for expert quality rating >= 4, yielding 1.7M high-quality trajectories. This quality filter is critical: models trained on all successful trajectories perform 3.5pp worse on SWE-Bench than models trained on quality-filtered data (52.8% vs 56.3%), consistent with findings that data quality matters more than quantity for agent training.

Task Distribution. Filtered trajectories span: code engineering (42%), system design (23%), documentation (15%), debugging (12%), code review (5%), and infrastructure configuration (3%). This distribution reflects real-world engineering work.

Trajectory Structure. Median trajectory length: 47 reasoning steps, 12 tool invocations, 18,420 tokens. The longest 10% exceed 200 steps and 112K tokens. The 128K context window was chosen to accommodate the longest trajectories.

4. Model Architecture

Nexus-1 is a 405B-parameter dense Transformer with a 128K-token context window. The model uses 96 layers, 128 attention heads, hidden dimension 16,384, and intermediate dimension 53,248 (GELU activation). Vocabulary size is 131,072 tokens including specialized tokens for tool call formatting and code syntax. The model uses Rotary Position Embeddings (RoPE; Su et al., 2024) with base frequency 500,000.

Agent-Specific Optimizations. We introduce two modifications: (1) tool-aware attention bias: attention logits between tool call tokens and their corresponding output tokens are positively biased, improving tool output interpretation by 3.2pp. (2) Structured input encoding: tool inputs and outputs are encoded with special delimiter tokens marking boundaries between reasoning steps, tool calls, and observations.

Context Window Scaling. We use staged pretraining: 32K for 70% of tokens, 64K for 20%, 128K for 10%. This reduces the perplexity gap between short and long sequences by 41% vs. full 128K throughout.

Table 1: Architecture comparison

Property	Nexus-1	GPT-5	Claude 4	Llama 4
Parameters	405B	~2T	Unknown	70B
Context window	128K	256K	200K	128K
Tool-aware attn.	Yes	No	No	No
Structured encoding	Yes	No	Partial	No
Agent traj. training	3.1M	Unclear	Unclear	No
RL from outcomes	Yes	No	No	No

5. Training Methodology

5.1 Stage 1: Supervised Fine-Tuning (SFT). We start from a pre-trained base model (trained on 8.4T tokens of text and code) and fine-tune on the filtered 1.7M trajectory dataset using a next-action prediction objective. The model predicts the next action (reasoning step or tool call) given the full preceding context including all previous tool outputs. Loss is computed only over action tokens, not tool output tokens. We train for 3 epochs with learning rate 2e-5, cosine decay to 2e-6, batch size 512. Training requires 7,680 A100 GPU-hours. We use AdamW optimizer with beta1=0.9, beta2=0.95, weight decay 0.1, gradient clipping at 1.0, and mixed-precision training (bfloat16).

5.2 Stage 2: Reinforcement Learning from Tool-Use Feedback (RLTF). After SFT, we apply reinforcement learning where the reward is derived from objective task outcomes. The reward function is:

\[ R = w_1 \cdot \text{test\_pass\_rate} + w_2 \cdot \text{build\_success} + w_3 \cdot \text{code\_review\_approval} + w_4 \cdot \text{efficiency\_bonus} \]

where test_pass_rate (0-1), build_success (binary), code_review_approval (binary), and efficiency_bonus = 0.1 * (1 - steps/max_steps). Weights w = [0.4, 0.3, 0.2, 0.1] tuned via grid search on a held-out validation set. We use PPO (Schulman et al., 2017) with KL penalty 0.05. The policy is initialized from SFT checkpoint and trained on 500K additional trajectories collected during RL exploration (epsilon = 0.1 for tool selection). RLTF requires 14,200 A100 GPU-hours.

5.3 Training Dynamics. During SFT, task completion rate plateaus at ~47% after 2.5 epochs. When RLTF begins, the policy initially regresses (KL penalty takes effect) before surpassing SFT, reaching 56.3% after 500K RL steps. RLTF improvement is concentrated in error recovery and exploration: RL-trained models make an average of 1.8 strategy changes per task (vs. 0.7 for SFT-only).

Figure 1: Training curves for SFT and RLTF

6. Evaluation

6.1 SWE-Bench Lite. Nexus-1 achieves 56.3% resolution rate (300 tasks), surpassing prior SOTA of 48.2% by 8.1pp. Resolution rate is 62.4% on tasks requiring >= 3 tool calls (vs. 38.1% for next best). On dependency/configuration tasks: 51.7% vs. prior best 39.2%. Improvement is statistically significant (p < 0.001, paired bootstrap).

6.2 Tool-Use Benchmark. 1,200 tasks across 12 tools. Nexus-1 achieves 92.1% task completion, 98.3% correct tool selection, 94.7% correct parameter specification. Primary failure mode (5.3%): incorrect output interpretation.

6.3 Multi-Step Reasoning Benchmark. 800 tasks requiring 5+ sequential steps. Nexus-1 completes 73.4% vs. 51.2% (GPT-5) and 54.8% (Claude 4). Performance by task length: 5-8 steps: 78.1%; 9-15 steps: 71.3%; 16+ steps: 62.8%. GPT-5 drops from 59.4% to 42.1% to 31.5% across same buckets, showing Nexus-1 advantage grows with complexity.

6.4 Human Evaluation. 40 professional engineers, 25 samples each from Nexus-1 and GPT-5 (1,000 evaluations per model). Rated on 1-5 Likert scale. Nexus-1 outperforms GPT-5 on correctness (4.21 vs 3.47, p < 0.001), efficiency (3.89 vs 3.12), code quality (4.08 vs 3.55), and clarity (4.14 vs 3.68). Pairwise preference: Nexus-1 preferred in 68% of cases.

Table 2: Benchmark results

Benchmark	Nexus-1	GPT-5	Claude 4	Gemini 2.5
SWE-Bench Lite	56.3%	44.7%	48.2%	42.1%
SWE-Bench (full)	43.8%	34.2%	36.1%	31.8%
Tool-Use	92.1%	81.4%	83.7%	78.2%
Multi-Step Reasoning	73.4%	51.2%	54.8%	47.6%
Human Eval (corr.)	4.21	3.47	3.62	3.38

7. Ablations and Analysis

7.1 RLTF vs. SFT Only. Removing RL reduces SWE-Bench from 56.3% to 47.1% (-9.2pp). On error recovery, RL models recover in 62.4% of cases vs. 41.8% for SFT-only. On exploration, RL models explore in 47.3% of cases vs. 28.1%. RL-trained models also show 2.6x more strategy changes per task.

7.2 Data Quality vs. Quantity. Quality-filtered data (1.7M, rating >= 4) achieves 56.3% vs. 52.8% for all-successful (2.4M), despite 29% less data. Low-quality trajectories contain 2.4x more redundant tool calls and 3.1x more partial solutions. Models trained on unfiltered data acquire 1.8x more redundant tool-calling patterns.

7.3 Context Window. 32K to 128K extension improves cross-file tasks by 4.2pp but single-file by only 0.8pp. Primary benefit is referencing multiple code files simultaneously. Staged pretraining closes 41% of the long-context perplexity gap.

7.4 Scale Ablation. Variants at 7B, 34B, 105B, 405B show log-linear scaling: 31.2% to 42.8% to 50.1% to 56.3% on SWE-Bench. The scaling slope is 0.047 log(accuracy)/log(params). Extrapolating, a 1T-parameter model would achieve ~63%.

7.5 Error Analysis. 437 failures analyzed: Incorrect root cause (34%), Incomplete fix (28%), Edge case oversight (21%), Tool sequencing errors (17%). These suggest directions for improved codebase understanding and systematic test coverage analysis.

8. Limitations and Safety

Limitations. Nexus-1 is trained exclusively on Nexus platform trajectories, which may not generalize to all frameworks. The 405B parameter size requires significant infrastructure. Training data is limited to software engineering tasks. The RLTF reward function may not capture all quality dimensions (maintainability, security, style). The model has not been evaluated on multi-agent coordination scenarios.

Safety. We evaluated for harmful content, code injection, prompt injection, and unintended tool use. The model shows comparable safety to base models of similar capability. In 0.3% of test cases, Nexus-1 attempted elevated-privilege commands without authorization, mitigated via a post-training safety filter. Full safety evaluation report accompanies this technical report.

9. Conclusion

Nexus-1 demonstrates that purpose-built agent foundation models significantly outperform general-purpose language models on autonomous task completion. Through large-scale training on 3.1M expert trajectories, a two-stage curriculum combining SFT and RLTF, and architecture optimizations for tool-aware reasoning, Nexus-1 achieves SOTA results on SWE-Bench (56.3%), Tool-Use (92.1%), and Multi-Step Reasoning (73.4%). Ablation studies confirm the importance of each design decision: data quality filtering (+3.5pp), RLTF (+9.2pp), context window scaling (+4.2pp on cross-file tasks), and model scale (log-linear improvement). Nexus-1 is available through the Nexus platform API and powers over 1 million agent sessions per week across 800+ organizations as of June 2026.

References

[1] Jimenez et al. SWE-Bench. ICLR 2024.

[2] Yang et al. SWE-agent. arXiv:2405.15793, 2024.

[3] Roziere et al. Code Llama. arXiv:2308.12950, 2024.

[4] Li et al. StarCoder. arXiv:2305.06161, 2023.

[5] Guo et al. DeepSeek-Coder. arXiv:2401.14196, 2024.

[6] Schick et al. Toolformer. arXiv:2302.04761, 2024.

[7] Qin et al. ToolLLM. ICLR 2024.

[8] Patil et al. Gorilla. arXiv:2305.15334, 2024.

[9] Ouyang et al. InstructGPT. NeurIPS 2022.

[10] Bai et al. Constitutional AI. arXiv:2212.08073, 2022.

[11] Schulman et al. PPO. arXiv:1707.06347, 2017.

[12] Vaswani et al. Attention Is All You Need. NeurIPS 2017.

[13] Su et al. RoFormer. Neurocomputing, 2024.

[14] Gemini Team. Gemini 1.5. arXiv:2403.05530, 2024.

[15] Wei et al. Chain-of-Thought. NeurIPS 2022.

[16] Wang et al. Survey on LLM Agents. arXiv:2308.11432, 2024.

[17] Kaplan et al. Scaling Laws. arXiv:2001.08361, 2020.

[18] Hoffmann et al. Training Compute-Optimal LLMs. NeurIPS 2022.