Infinite context is theoretically possible. That's just the start.

Tom & Jakub · Mon, 30 Mar 2026

Recursive Language Models push token-level working memory to millions of tokens — and make it obvious why context window limits are necessary but not sufficient for real engineering organisations.

A paper dropped out of MIT CSAIL in December 2025 that's worth reading carefully if you're building in this space.

Recursive Language Models — authored by Alex L. Zhang, Tim Kraska, and Omar Khattab — showed that you can give an agent a Python REPL environment, load a document corpus or codebase as a variable, and let the model write code to navigate, chunk, and recursively sub-query itself across inputs well beyond its native context window. The results are striking: GPT-5, which degrades sharply on complex tasks well within its 272K token context window, maintains strong performance at 10 million tokens with the RLM harness. On their hardest benchmark — OOLONG-Pairs, requiring quadratic reasoning across input pairs — the base model achieves an F1 score of 0.04. The RLM achieves 58.00.¹

We read it carefully. It's impressive work on a real problem. It's also the clearest illustration we've seen of why solving context window limitations is necessary but not sufficient — and why the approaches that come after it introduce risks the paper doesn't address at all.


What RLMs actually solve

The RLM paper sits precisely within a taxonomy the memory research community has recently formalised. A concurrent survey — Memory in the Age of AI Agents by Hu, Liu, Yue, Zhang et al., spanning NUS, Fudan, Oxford, Peking University, and Georgia Tech — organises agent memory into three forms: token-level memory (context window), parametric memory (model weights), and latent memory (compressed hidden states). And three functions: working memory, factual memory, and experiential memory.²

RLMs are a sophisticated token-level working memory mechanism. The REPL is externalised working memory — the model offloads context into variables, processes it programmatically, sub-queries recursively, builds understanding within the session. It is excellent at this. The paper demonstrates it convincingly across four benchmark types.

What it doesn't do: persist anything across sessions. Learn from what happened. Share observations across agents. Update shared organisational knowledge. Every RLM run starts from zero. The model that navigated your 10 million token codebase today knows nothing about it tomorrow.

The memory paper names this gap precisely. Among its "emerging yet underdeveloped research frontiers": shared memory for multi-agent systems — how a fleet of agents operating across an organisation contributes to and benefits from common knowledge.² The research community has identified what's missing.

Diagram available on desktop.


Theoretically infinite context. Practically constrained in ways the paper doesn't discuss.

Let's follow the theoretical arc. The RLM paper extends context to 10 million tokens. Assume the trend continues — future models handle 100 million, a billion, eventually unlimited context. Infinite working memory. Does that solve the problem?

No. And the reasons are more fundamental than model architecture.

Processing takes compute and time. The RLM paper is honest about cost variance — at the 95th percentile, some runs are significantly more expensive than baseline. Processing 10 million tokens isn't free, and it isn't instant. Scaling that to the full signal landscape of an engineering organisation — dozens of repos, years of history, continuous agent runs — makes the compute and latency costs significant before you reach the interesting tasks.

The bottleneck moves upstream. Even if processing were instantaneous, you still have to load the data from its sources. Repos, docs, observability systems, deployment logs, incident histories — these live in different places, in different formats, updated at different rates. Loading all of it for every agent run means traversing storage systems, parsing formats, and resolving conflicts between stale and current versions. That work has to happen somewhere. The context window limit isn't removed — it's relocated to an ingestion problem.

You cannot cheat the speed of light or network bandwidth. This sounds like a theoretical concern. It isn't. At the scale real engineering organisations generate data, the time to load and transmit that data to a model — across network boundaries, through storage systems, over infrastructure that was not designed for this access pattern — is a real constraint that no improvement to context window architecture addresses. The data has to travel. Bandwidth is finite. Latency is real.

So even in the theoretical future where context window limits disappear, the problem of identifying what to load, what to send, and what's actually relevant doesn't disappear with them. The limit isn't what constrains you — the identification and routing problem is. That problem gets harder as data volumes grow, not easier.


The more interesting question: what if models become stateful?

Here's a thought worth taking seriously. What if the solution isn't smarter retrieval, but persistent model state — you push all your data to the model once, and it maintains it indefinitely, reasoning over accumulated organisational context across every session?

It's a genuinely interesting architectural direction. It would mean the identification problem is solved by the model itself, not by an external system. It would mean continuity across sessions without explicit memory infrastructure.

But follow the implications.

For a model to hold the persistent organisational state of a real engineering organisation — years of commits, decisions, constraints, incidents, agent runs — the model provider would need to store gargantuan amounts of data. Not general training data. Your data. Proprietary architectural decisions, internal constraints, competitive context, organisational knowledge that your engineers have built over years.

That data lives at OpenAI, or Anthropic, or Google. It becomes the foundation of your agent fleet's intelligence. Switching means losing it — or migrating it, at significant cost, to a different provider's proprietary format. This is not a minor inconvenience. It is a switching cost that compounds with every agent run, every stored decision, every month of organisational history. The longer you use it, the more locked in you become.

And here's the part that concerns us most: model providers with access to your persistent organisational state would be structurally incentivised to prioritise their own agents over third-party ones. You already see this direction in the market — infrastructure decisions that advantage the provider's own tooling over alternatives. At the level of stateful organisational memory, that dynamic becomes significantly more consequential. Your ability to choose your agent tooling would be constrained by where your memory lives.

Open, agent-agnostic infrastructure — where the knowledge graph is owned by the engineering organisation, not the model provider — isn't just a nice-to-have. It becomes the only architecture that preserves organisational autonomy as agent fleets scale.


The problem that remains regardless

The RLM paper solves single-session, single-corpus context navigation. That's a real contribution. But real engineering organisations don't look like a clean corpus in a single session. Their knowledge is distributed across dozens of repos, written by people who've since left, contradicted by codebases that drifted from the decisions that shaped them, invisible to any agent that wasn't present when the decisions were made.

The context window, however large, cannot contain what was never written down. The Memory in the Age of AI paper is clear on this: getting knowledge into factual memory — structured, queryable, connected — is a distinct problem from navigating it once it exists. And experiential memory — what the organisation has learned from doing, not what it documented — is a further problem that neither paper addresses.

This is where the real gap is. Not "how do we help an agent read more." But "how do we ensure that every agent run makes the organisation permanently smarter, and that intelligence is available to every agent and every human that comes after."

That requires ingestion infrastructure that captures signal continuously. A knowledge graph that structures decisions, constraints, and history in a form that compounds. A learning layer that updates the graph automatically from agent behaviour. And a reasoning layer that surfaces what matters proactively — before the agent makes the same mistake again, and before the human finds out too late.

And it requires all of that to live with the engineering organisation, not inside a model provider's proprietary infrastructure.


The demo question that stays with us

When we run the demo, the task is simple and ambiguous: "Add a SQL database to this application."

The agent is sitting in an application repo. No explicit database instructions anywhere in it. We don't point it at anything.

It comes back with PostgreSQL. Drizzle for ORM and migrations. Structured exactly the way a governance repo specifies — a repo the agent was never told existed.

ctx| had indexed it, connected the relevant ADR to the current service context across the repo boundary, and surfaced it at the moment the agent needed it. Without being asked.

Someone watching always asks the same question: "How did it know to use Drizzle?"

The answer is that it didn't retrieve from a single clean corpus in a single session. It surfaced a connection between two repos, an ADR written months ago, and a task happening right now. No context window contains that by default. No single-session harness builds that connection. It exists because the knowledge graph has been learning what belongs together — from ingestion, from previous runs, from the signals an organisation generates continuously.

That's the gap between context navigation and organisational knowledge.

Infinite context would let an agent read more. It wouldn't tell it where to look, what matters, or what the organisation already decided.

That's the work that remains.


ctx| is the open-source, agent-agnostic, self-learning context layer for AI engineering agent fleets. If you're deploying agents at scale and want the knowledge graph to live with you, not your model provider, get in touch or join the growing waitlist.

Join the waitlist


References

This article draws on the following recent research:

  • [1] Zhang, A. L., Kraska, T., Khattab, O. (2025)Recursive Language Models — MIT CSAILarxiv.org/abs/2512.24601
  • [2] Hu, Y., Liu, S., Yue, Y., Zhang, G. et al. (2025)Memory in the Age of AI Agents: A Survey — NUS, Fudan, Peking University, Oxford, Georgia Tech, et al.arxiv.org/abs/2512.13564