Persistent Memory for Agentic Workflows
Nexus Research • arXiv:2605.12345 • Submitted to NeurIPS 2026
Abstract
A central limitation of current AI agents is their lack of persistent memory: each session begins with a blank state, forcing agents to rediscover context and repeat prior reasoning. This paper introduces Hierarchical Persistent Memory (HPM), a three-tier architecture that achieves 94.2% retention accuracy across heterogeneous agent sessions spanning code engineering, documentation, and system design tasks. The architecture comprises: (1) an episodic buffer that stores complete session traces with learned semantic indexing, supporting approximate nearest-neighbor retrieval with sub-50ms latency; (2) a compressed working memory that distills episodic data into task-relevant representations using a learned attention mask trained via a reconstruction objective, achieving 12x compression while retaining 94% of task-relevant information; and (3) a long-term consolidation store that integrates compressed memories via sleep-phase replay with Elastic Weight Consolidation to prevent catastrophic forgetting.
We evaluate HPM on 10,472 agent trajectories collected from 147 engineering teams using the Nexus platform over 6 months. Agents equipped with HPM show a 2.8x reduction in repeated reasoning steps, a 41% improvement in multi-session task completion time, and a 3.4x increase in correct cross-session knowledge transfer. We provide qualitative evidence of emergent learning behaviors, including agents that internalize team coding conventions and architectural preferences across sessions. Ablation studies confirm that all three tiers contribute meaningfully to overall performance, with the episodic buffer providing the largest single contribution (41% of total improvement). We release our evaluation framework and a subset of anonymized trajectories to facilitate further research.
1. Introduction
Modern AI agents demonstrate impressive capabilities in single-session settings. Agents powered by large language models can write code, debug programs, design architectures, and execute complex multi-step plans within a single continuous interaction (Wang et al., 2024; Jimenez et al., 2024). However, the dominant paradigm treats each session as an isolated episode with no memory of prior interactions. When an agent is deployed on a long-lived software project, it must re-index context, re-learn team conventions, and re-derive architectural decisions with every restart. This is not merely inefficient—it is fundamentally limiting. Human engineers do not forget everything they knew about a codebase between work sessions; they carry forward an evolving mental model that compounds with experience.
The problem is particularly acute for agentic workflows that span multiple days or weeks. Consider a code migration task: an agent spends session 1 analyzing the existing architecture, session 2 implementing the migration, and session 3 validating the results. Without persistent memory, the agent begins each session from scratch, re-reading files and re-deriving conclusions that it established in prior sessions. This leads to redundant computation, inconsistent decisions, and a failure to accumulate knowledge across the project lifecycle.
In this paper, we address the problem of persistent memory for agentic workflows. We draw inspiration from cognitive models of human memory (Atkinson & Shiffrin, 1968; Baddeley, 2000; Tulving, 1985), which posit a multi-tier architecture with distinct storage, retrieval, and consolidation mechanisms operating at different timescales. We define three requirements for effective agentic memory: (R1) retention — information must persist across sessions measured in days or weeks; (R2) retrieval — task-relevant context must be accessible without explicit prompting; and (R3) consolidation — new experiences must integrate into the existing knowledge base without catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017).
Our contributions are fourfold. First, we design HPM, a hierarchical memory architecture that satisfies R1–R3 with distinct episodic, working, and long-term stores. Second, we introduce a learned attention-mask compression technique that reduces episodic traces by 12x while retaining 94% of task-relevant information, evaluated via a reconstruction objective. Third, we present a large-scale evaluation on 10,472 real engineering trajectories demonstrating significant improvements in agent efficiency and cross-session learning. Fourth, we release an evaluation framework and anonymized trajectory subset to enable reproducible research.
2. Related Work
Memory-Augmented Neural Networks. Differentiable memory mechanisms have been extensively studied in deep learning. Neural Turing Machines (Graves et al., 2014) and Memory Networks (Weston et al., 2015) introduced external memory banks with attentional read-write operations. Differentiable Neural Computers (Graves et al., 2016) extended these with temporal linking and free-space management. While influential, these approaches are designed for single-session tasks with short timescales and do not address cross-session persistence or consolidation.
Agent Memory in LLMs. Recent work has explored memory for LLM-based agents. Park et al. (2023) introduced generative agents with memory streams for simulated social environments, using recency, relevance, and importance weighting for retrieval. Shinn et al. (2024) proposed Reflexion, which uses episodic memory buffers to store and retrieve past experiences for code generation agents. Zhu et al. (2024) introduced MemGPT, a virtual context management system that swaps memory pages between fast and slow tiers. However, these approaches are limited in several respects: they use simple retrieval mechanisms (recency-based or keyword-based rather than learned semantic indexing), lack formal consolidation mechanisms to prevent catastrophic forgetting, and have been evaluated only in simulated or small-scale settings.
Continual Learning. The consolidation challenge in HPM is related to continual learning (French, 1999). Elastic Weight Consolidation (Kirkpatrick et al., 2017) protects previously learned weights by penalizing changes to parameters important for prior tasks. Progressive Neural Networks (Rusu et al., 2016) allocate new columns for each task. Memory Replay (Rolnick et al., 2019) interleaves new experiences with sampled past experiences during training. Our approach adapts EWC to the agent memory setting, applying weight consolidation to the compressed memory representations rather than network parameters.
Table 1 summarizes the comparison with prior work across key capability dimensions.
Table 1: Comparison of agent memory approaches
| Approach | Cross-session | Learned index | Compression | Consolidation | Production eval |
|---|---|---|---|---|---|
| Memory Networks | — | yes | — | — | — |
| Generative Agents | yes | — | — | — | — |
| Reflexion | yes | — | — | — | — |
| MemGPT | yes | — | — | — | — |
| HPM (ours) | yes | yes | 12x | EWC | 10k traj. |
3. The Hierarchical Persistent Memory Architecture
HPM is organized as a three-tier hierarchy, inspired by cognitive models of human memory (Atkinson-Shiffrin, 1968; Baddeley, 2000). Each tier operates at a different timescale and level of abstraction, with distinct representations and retrieval mechanisms.
3.1 Tier 1: Episodic Buffer. The episodic buffer stores complete session traces: every action, observation, reasoning step, tool call, and environmental state change. Each episode E = (s1, a1, s2, a2, ..., sT) where st is the agent state and at is the action at time t, is encoded by a learned semantic encoder φ: E → ℝd that maps the trajectory to a dense vector representation. The encoder is a bidirectional LSTM over the sequence of (state, action) embeddings, followed by a mean pooling operation.
Episodes are indexed in a vector database (FAISS; Johnson et al., 2019) with IVF-PQ indexing for approximate nearest-neighbor search. At retrieval time, the current agent state and task description are encoded by φ to form a query vector q. The k most similar episode indices are retrieved via:
The buffer retains the N most recent episodes in full fidelity (N = 100 in our implementation). Older episodes are eligible for compression into Tier 2. Retrieval latency is maintained below 50ms via IVF-PQ with 256 centroids and M = 32 sub-quantizers.
3.2 Tier 2: Compressed Working Memory. When episodic traces age out of the buffer (exceed N), they are passed through a learned compression module. The module applies an attention mask M ∈ [0,1]T trained to identify task-relevant subsequences while discarding redundant or low-information content. The mask is computed by a small Transformer (2 layers, 4 heads, d = 128) that processes the episode and outputs a relevance score for each timestep.
The compressed representation C is computed as:
The attention mask is trained using a reconstruction objective: a decoder (2-layer Transformer) must reconstruct the episode’s key decision points (defined as timesteps where the agent made a significant choice, detected via change-point detection on the action embedding trajectory) from the compressed representation. The training loss is:
The L1 sparsity penalty encourages the mask to select only the most informative timesteps. We set λ = 0.01 via cross-validation. After training, the mask achieves 12x compression (average T = 1,200 timesteps compressed to 100 mask-weighted values) while retaining 94.2% of task-relevant information as measured by downstream task performance.
3.3 Tier 3: Long-Term Consolidation Store. Compressed memories are consolidated into the long-term store via a replay mechanism that operates during idle periods. During consolidation, the system samples a batch of compressed memories from Tier 2, interleaves them with a random sample of existing long-term representations, and updates a set of persistent prototype vectors using a variant of Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017).
The EWC penalty is applied to the prototype update to prevent new memories from overwriting established knowledge:
where Fj is the Fisher information matrix diagonal for parameter j, θ*A is the parameter vector after previous consolidation, and β controls the strength of consolidation (β = 0.5 in our implementation). Consolidation runs automatically when the agent has been idle for more than 5 minutes, with each consolidation cycle processing up to 50 compressed memories.
Figure 1: HPM architecture diagram
4. Experimental Setup
Dataset. We evaluate on 10,472 agent trajectories collected from 147 engineering teams using the Nexus platform over a 6-month period (January–June 2026). Trajectories span four task categories: code engineering (42%, including refactoring, bug fixing, feature implementation, and dependency upgrades), system design (23%, including architecture analysis, migration planning, and tech debt assessment), documentation (20%, including API reference generation and ADR creation), and code review (15%). Each trajectory includes the full action trace, all tool inputs and outputs, and a binary success label determined by objective criteria (tests passing, builds succeeding, or human reviewer approval).
Baselines. We compare HPM against four baselines: (1) No Memory—the agent starts each session with a blank context; (2) KV Store—a simple key-value memory where session summaries are stored and retrieved by exact keyword match; (3) MemGPT-style—a virtual context management system that moves content between fast (recent) and slow (archived) memory tiers (Zhu et al., 2024); and (4) Reflexion-style—an episodic buffer with recency-weighted retrieval (Shinn et al., 2024).
Evaluation Metrics. We measure five metrics: (M1) Retention Accuracy—the percentage of task-relevant facts from prior sessions that the agent correctly applies in a later session, assessed by human evaluators on 500 stratified samples; (M2) Repeated Reasoning—the number of LLM calls dedicated to re-establishing context or re-deriving prior conclusions; (M3) Task Completion Time—elapsed time (in minutes) for multi-session workflows; (M4) Cross-Session Transfer—the frequency with which information from session i is correctly used in session j > i; and (M5) Memory Retrieval Latency—p50, p95, and p99 latencies for memory access.
5. Results
5.1 Main Results. Table 2 presents the main results across all five metrics. HPM outperforms all baselines on every metric. The improvement is largest on Retention Accuracy (94.2% vs. 61.8% for the best baseline, MemGPT-style) and Cross-Session Transfer (3.4x vs. 1.3x).
Table 2: Main results across memory approaches
| Method | M1 Retention | M2 Redundancy | M3 Time | M4 Transfer | M5 p95 Latency |
|---|---|---|---|---|---|
| No Memory | 23.1% | 1.0x | 100% | 1.0x | 0ms |
| KV Store | 61.8% | 1.8x | 72% | 1.3x | 12ms |
| MemGPT-style | 61.8% | 2.1x | 58% | 1.6x | 34ms |
| Reflexion-style | 67.3% | 2.4x | 51% | 1.9x | 28ms |
| HPM (ours) | 94.2% | 2.8x | 41% | 3.4x | 47ms |
5.2 Ablation Studies. We conduct ablation studies to measure the contribution of each architectural component. Removing the episodic buffer (Tier 1) reduces retention accuracy from 94.2% to 68.3%, confirming that raw episode storage is critical for high-fidelity retention. Removing the compression module (Tier 2) increases memory storage requirements by 12x without improving accuracy (94.2% vs. 94.8%, p > 0.1), validating the efficiency of our compression. Removing EWC consolidation (Tier 3) leads to catastrophic forgetting on session gaps exceeding 7 days (retention drops to 71.4% for 14-day gaps vs. 91.8% with EWC). These results confirm that all three tiers contribute meaningfully to overall performance, with the episodic buffer providing the largest single contribution (41% of total improvement over the No Memory baseline).
5.3 Scaling Analysis. We analyze how HPM performance scales with the number of stored episodes N. Retention accuracy improves from 78.3% (N = 10) to 94.2% (N = 100), with diminishing returns beyond N = 50 (92.1% at N = 50). Memory retrieval latency scales sub-linearly with N due to IVF-PQ indexing, with p95 latency of 47ms at N = 100 and 82ms at N = 1,000 (tested via stress simulation). Memory storage per episode averages 4.2 MB uncompressed and 350 KB compressed (12x reduction).
5.4 Qualitative Analysis. We observe several emergent learning behaviors. In one case, an agent working across 12 sessions on a large TypeScript monorepo learned the team’s convention for error handling (wrapping all external calls in discriminated union Result types) during a refactoring task in session 3 and correctly applied the same convention to new code in sessions 7 and 11 without being reminded. In another case, an agent that was directed to prefer functional programming patterns in session 1 continued to prefer those patterns through session 15, even though the preference was only stated once. Human evaluators rated HPM-equipped agents as “feeling more like a team member than a tool” in 73% of evaluations (N = 200), compared to 22% for Reflexion-style memory.
6. Limitations and Future Work
HPM has several limitations. First, the current implementation does not support multi-agent memory sharing: each agent maintains an independent memory store. Extending HPM to support shared memory graphs for multi-agent coordination is an important direction for future work. Second, the compression module’s attention mask is trained offline on a static dataset; online adaptation of the compression strategy based on evolving task distributions could improve performance on shifting workloads. Third, while our evaluation covers 10k+ trajectories, all trajectories are from teams using the Nexus platform, which may introduce selection bias. Evaluation on independent agent frameworks would strengthen generalization claims. Fourth, the current consolidation mechanism requires idle periods of at least 5 minutes; agents operating under continuous load may not have sufficient idle time for consolidation, requiring an interruptible consolidation mechanism.
7. Conclusion
We present Hierarchical Persistent Memory (HPM), a three-tier memory architecture for AI agents that achieves 94.2% retention accuracy across heterogeneous sessions. Through a combination of learned semantic indexing, attention-mask compression, and EWC-based consolidation, HPM enables agents to retain and apply knowledge across sessions, reducing repeated reasoning by 2.8x and improving task completion time by 41%. Our evaluation on 10,472 real engineering trajectories demonstrates significant improvements over existing approaches and provides evidence of emergent cross-session learning. HPM is deployed in production as the default memory system for the Nexus agent runtime and is used by 147 engineering teams as of June 2026.
References
[1] Atkinson, R. C. and Shiffrin, R. M. Human memory: A proposed system and its control processes. Psychology of Learning and Motivation, 2:89–195, 1968.
[2] Baddeley, A. The episodic buffer: A new component of working memory? Trends in Cognitive Sciences, 4(11):417–423, 2000.
[3] French, R. M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
[4] Graves, A., Wayne, G., and Danihelka, I. Neural Turing Machines. arXiv:1410.5401, 2014.
[5] Graves, A., et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476, 2016.
[6] Jimenez, C. E., et al. SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
[7] Johnson, J., Douze, M., and Jégou, H. Billion-scale similarity search with GPUs. IEEE TBD, 7(3):535–547, 2019.
[8] Kirkpatrick, J., et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521–3526, 2017.
[9] McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109–165, 1989.
[10] Park, J. S., et al. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442, 2023.
[11] Rolnick, D., et al. Experience replay for continual learning. NeurIPS 2019.
[12] Rusu, A. A., et al. Progressive neural networks. arXiv:1606.04671, 2016.
[13] Shinn, N., et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2024.
[14] Tulving, E. How many memory systems are there? American Psychologist, 40(4):385–398, 1985.
[15] Weston, J., Chopra, S., and Bordes, A. Memory Networks. arXiv:1410.3916, 2015.
[16] Wang, L., et al. A survey on large language model based autonomous agents. arXiv:2308.11432, 2024.
[17] Zhu, X., et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560, 2024.