Agent memory at scale

OpenClaw's approach to memory is a key reason it hit 150,000 GitHub stars in two months (now 217,000). The approach unlocked agent efficiency during longer run times — a key barrier to wider agent adoption — spawning thousands of articles, videos, and attention that has somehow reached our grandparents. Another Rubicon has been crossed.

People are obsessed with the idea of an AI that actually remembers. That recalls what you discussed last Tuesday, knows the architecture decision you made in March, and doesn't ask you to repeat yourself for the hundredth time (something felt particularly acutely in TTS models as they become increasingly natural sounding).

In this celebration of capability, the lower-level details are largely glossed over: "memory" isn't one thing. It's now clear that agents require several distinct types of memory, each serving a different purpose, each with different failure modes. Understanding the difference is increasingly non-negotiable, especially as organisations move from one agent to many.

This is part education, part explanation of how we think about it at ctx|.

The four memory types

The cognitive science literature — now being rapidly mapped onto agent architectures — identifies four memory types that matter. We'll use plain language.

Working memory is the agent's active information state — the current task, the accumulated findings, the decisions made so far in this session. It's often conflated with the context window, but they're not the same thing. The context window is a hard transport limit; a working session can span hours of computation and many sequential context loads. Working memory is what sits between those loads: a plan.md tracking decisions, a scratchpad of intermediate findings, a set of files held ready for re-injection. Information in working memory typically gets searched, summarised, or filtered before it enters the context window at all — a session may be huge, even if each context window is not. The fundamental tension in all agent memory design is this: the context window is limited, and everything else has to be retrieved into it from somewhere. What gets retrieved, when, and how — that's the whole game.

Episodic memory is the agent's diary. Timestamped events: what happened, when, in what sequence. "On the 14th, agent-7 modified the auth middleware and introduced a race condition. On the 15th, it was reverted." This is the memory of execution — specific, contextual, temporally ordered. OpenClaw implements this as daily Markdown files (YYYY-MM-DD.md), append-only logs of what occurred, like a daily journal. The critical engineering insight from their implementation is that episodic memory needs temporal decay — a six-month-old note should rank lower than yesterday's, even if semantically it's a better match for your query. Without decay, stale context wins. The tuning of that half-life is a real engineering decision: thirty days makes sense for a personal assistant's conversational notes; it's completely wrong for a codebase where a two-year-old architectural decision may still be the most authoritative thing an agent can know.

Semantic memory is the knowledge graph. Not "what happened" but "what is true." Facts, relationships, entities. The auth service owns the JWT validation logic. The payments domain boundary sits here. This function calls that one. In a personal assistant context, semantic memory is "the user's brother is named Mark and is a software engineer." In a software engineering context, it's the entire ontology of your system — types, modules, services, owners, dependencies, decisions — and how they relate to each other. This is where RAG and vector search live, and also where RAG falls short: retrieving a relevant chunk is not the same as understanding a relationship. The graph exists to provide this.

Procedural memory is learned approaches and patterns — how work gets done here, not just what is true. In cognitive science, this is implicit memory: knowledge encoded into action, like typing without thinking. For agents, it's the accumulated institutional knowledge of how to approach problems: the design patterns that have proven reliable, the architectural principles the team has converged on, the conventions that have been validated across many cycles. Critically, this isn't scripting — it's not "call this tool in this sequence." It's closer to "this is how we design authentication boundaries in this codebase" or "this is how we decompose services." Runbooks live here, but so do skills: not the mechanical execution of steps, but the deeper design intuitions that experienced engineers carry and that agents need to inherit. In OpenClaw's architecture, this is what gets promoted when a pattern proves itself repeatedly across episodes. In software engineering terms: your AGENTS.md, your skills, your validated runbooks — the instruction hierarchy.

There's also a fifth dimension worth naming that cuts across all of these: temporal awareness — not a separate store but a dimension applied to the others. When did a fact become true? When did it stop being true? The bi-temporal model (when something happened vs when the system learned it) matters enormously in living codebases. An architectural decision made in 2023 may have been superseded in 2024. Without temporal awareness, agents operating on your codebase are working with a snapshot that doesn't know its own age.

◎

Interactive diagram available on desktop.

Ctx| · Knowledge Infrastructure
The Four Memory Types
Click a memory type to explore its role, failure modes, and how it connects to the others
Temporal Dimension — staleness layer
consolidatepromoteretrievewrite-backretrievewrite-backretrievewrite-backEPISODICMEMORYSEMANTICMEMORYPROCEDURALMEMORYWORKINGMEMORYctx windowretrieve →← write-backconsolidate / promote
◎
SELECT A MEMORY TYPE
Working Memory
What's in the room right now.
Episodic Memory
What happened, when, in sequence.
Semantic Memory
What is true. Facts, relationships, entities.
Procedural Memory
How work gets done here.
TEMPORAL DIMENSION — STALENESS LAYER
Not a separate store — a dimension applied across all types. The core tension: recency bias vs correctness bias. A recent note can be wrong. A 2023 architectural decision can still be canonical.
Freshness requirements are task-dependent. For a codebase, an architectural decision from 2023 may still outrank yesterday's refactor comment. Half-life tuning is a real engineering decision, not a default.

Why this taxonomy matters at scale

For a single developer using OpenClaw as a personal assistant, getting memory right is a quality-of-life problem. For an organisation deploying hundreds of concurrent coding agents across a large estate, it becomes a structural problem.

Consider what happens without each type:

Without episodic memory, agents have no history. Every session is the first session. They repeat mistakes. They re-examine decisions the team has already resolved. They can't answer "what did we try last week?"
Without semantic memory, agents have no map. They can search for relevant text, but they can't traverse relationships. They can't answer "what breaks if I change this?" or "who owns that service?" Without structure, every question becomes a grep — and at agent scale, that's not just slow and expensive, it's how you get confidently wrong answers. Finding the string isn't the same as understanding the system.
Without procedural memory, agents have no institutional knowledge. They don't know how things are done here. Every agent reinvents the pattern from scratch, and divergence compounds. You end up with thirty slightly different approaches to the same problem across thirty repos. This is precisely why static Markdown files alone don't solve this — a file per agent, per session, per repo scales the storage without ever scaling the knowledge.
Without temporal awareness, agents have no sense of time. They treat a superseded ADR with the same weight as a current one. They apply patterns that were deprecated. They lack the ability to distinguish "what was true" from "what is true now."

Most agent memory systems today solve one or two of these well. OpenClaw's markdown approach handles episodic well and semantic partially — but it's designed for a single user, not a fleet. Vector databases handle semantic retrieval but have no temporal dimension. RAG handles factual recall but has no procedural layer.

How ctx| approaches this

We didn't start from an abstract memory taxonomy and work forwards. We started from the problem — thousands of agents running concurrently across codebases — and realised all four types would be required simultaneously.

The knowledge graph is our semantic memory layer. Typed entities, traversable relationships, the full software engineering ontology: repos, modules, functions, services, owners, domains, dependencies. Not documents about code — the structural model of the code itself, connected.

Agent interactions enrich the graph. When an agent navigates a pattern, validates a decision, or makes a mistake that gets corrected, that becomes an observation. The graph learns. This is our episodic layer — but unlike daily markdown files, it's captured at the level of the system, not the individual session, and it feeds back into the graph rather than sitting in a file.

The instruction hierarchy is our procedural memory. AGENTS.md, skills, MCPs — versioned in git, promoted and demoted based on actual usage patterns, reviewed in PRs. When a pattern proves itself across many agent interactions, it gets promoted. When a pattern proves harmful, it gets demoted. The graph is the evidence base for those decisions.

Git is our temporal layer. Everything is versioned. The history of decisions is auditable. We know not just what is true now but what was true before, and when it changed. Temporal decay applies to retrieved context: recent changes rank higher than old ones, unless pinned.

The result is a system where each new agent connects — via a single MCP — to all four memory types simultaneously, without having to build or manage any of it themselves. The memory is the infrastructure. The agents just use it.

The demands on memory infrastructure are different in the agent scaling era. You need memory that's shared across a fleet, governed at org scale, and embedded in the development workflow itself — rather than living in a sidecar Markdown file.

That's what we're building.

ctx| is the open-source, agent-agnostic, self-learning context layer for AI engineering agent fleets. If you're deploying agents at scale and want the knowledge graph to live with you, not your model provider, request SaaS early access.

Request early access · View GitHub repo

References

This article draws on the following recent research:

[1] Zhang et al. (2026) — Memory in the Age of AI Agents — arxiv.org/abs/2512.13564
[2] Liang et al. (2025) — AI Meets Brain — arxiv.org/abs/2512.23343
[3] Terranova et al. (2025) — Evaluating Long-Term Memory — arxiv.org/abs/2510.23730
[4] Anokhin et al. (2025) — AriGraph — IJCAI 2025 — ijcai.org
[5] Mem^p (2025) — arxiv.org/abs/2508.06433