AGENTS.md is the wrong conversation

Tom & Jakub · Mon, 2 Mar 2026

A paper dropped this week that tested AGENTS.md files across multiple models and real GitHub issues. Context files reduced task success rates and inflated inference costs. The debate is useful — but it's pointing at the wrong solution.

A paper dropped this week that tested AGENTS.md files — the repo-level context documents that every AI coding tool now recommends — across multiple models and real GitHub issues. The result was uncomfortable: context files reduced task success rates compared to no file at all, while inflating inference costs by over 20%.

The reaction has been predictable. Theo (t3.gg) posted a sharp video dissecting why these files backfire in practice, running his own before/after test and finding the run without a freshly-generated Claude.md was faster. The folks on Hacker News nodded knowingly. Addy Osmani posted a structural reframe: treat AGENTS.md as "a living list of codebase smells you haven't fixed yet." The tech-influencer class weighed in with hot takes about context management philosophy. Everyone agreed the files are being misused. Most proposed better usage guidelines.

None of them named the actual problem.


What the debate got right

To be fair, the diagnosis of symptoms is accurate. The paper found that auto-generated context files — the default output of /init commands — perform worst. Human-written files help slightly, but only when they contain non-discoverable information: tooling gotchas, non-obvious conventions, landmines the model would otherwise step on. Everything else is noise that sends agents down wrong paths and drives up cost.

Theo's explanation of why this happens is the clearest in the conversation. Your prompt is not the start of the context. There's a hierarchy: provider-level rules, system prompt, developer message, user messages — and AGENTS.md sits in the developer message layer, above your prompt, always present, biasing everything. The critical insight: whatever you put in context becomes more likely to happen. You can't mention tRPC "just in case" and expect the model not to reach for it. If you tell it about something, it will think about it — even when it's not relevant. That's the mechanism. That's why these files backfire.

Addy's hierarchy insight is the most structurally interesting take: a single AGENTS.md at the repo root is inadequate for any codebase of real complexity. What you actually need is a hierarchy — files placed at the relevant directory or module level, scoped precisely to the code each agent is working in. He's right. That insight points directly at the structural failure of the current approach.

The HN thread surfaces practitioner wisdom that converges on the same place: only add to the file when the agent fails repeatedly at something. Use it to correct consistent mistakes, not to document the repo. Keep it lean. Keep it current. Theo's philosophy maps exactly: use it to correct consistent agent mistakes, and when the agent struggles, don't start by editing AGENTS.md — start by fixing the system. Better tests. Clearer project structure. Tighter feedback loops. Making it easier to do the right thing and harder to do the wrong thing beats writing a bigger instruction file every time.

This is all correct. It is also, fundamentally, a manual process. And manual processes don't scale.

Interactive diagram available on desktop.


The problem everyone is dancing around

Here is what the debate is actually describing, stripped of the context engineering vocabulary:

Every organisation running AI coding agents needs a system that knows the codebase, learns from what works and what fails, scopes knowledge to the right level of the stack, stays current without human maintenance, and propagates good patterns across the fleet.

AGENTS.md is not that system. It is a text file checked into a repo. It is written by a human, maintained (or not) by a human, scoped to wherever that human decided to put it, and contains whatever that human happened to think was important on the day they wrote it.

The paper's finding — that these files hurt performance — is not a finding about AGENTS.md specifically. It is a finding about what happens when you use a static, manually-maintained document to do the job of a dynamic, self-updating knowledge system. The document is always wrong in some way. It is either out of date, too broad, too narrow, biased toward whatever the author was thinking about, or contradicted by something the model can discover itself. Filling the context with confidently incorrect information is worse than filling it with nothing.

Theo puts it well: if you fill your context with giant rule files, random skills you downloaded, MCP servers you're not using, and someone else's Cursor rules, you'll never be able to diagnose why the model is behaving badly. At one agent that's a debugging problem. At thirty, it's an organisational crisis.

There's one moment in Theo's video that points most directly at what's actually missing. He describes a single line he keeps in his AGENTS.md files: "This file exists to capture common confusion points. If you encounter something surprising, flag it and suggest an update." He doesn't want agents editing the file — he wants them to surface what confused them so he can fix the codebase. That feedback, he says, is gold.

He's right. That feedback loop is exactly what agent knowledge infrastructure should be doing — automatically, at scale, across every repo and every run. The instinct is correct. The tool is wrong.

The engineer's response — "only add when the agent fails" — is the right instinct applied to the wrong mechanism. Yes, you should capture agent failures. Yes, you should feed corrections back into future runs. But doing this manually, per repo, per developer, is the 1-agent solution. It breaks at three agents. It is unrecognisable at thirty.


The best answer so far — and why it still falls short

OpenAI's harness engineering post is the most serious public attempt to solve this problem. Worth reading in full, but the architecture in brief: AGENTS.md shrinks to ~100 lines acting as a table of contents, pointing to a structured docs/ directory of typed artifacts — design docs, architecture maps, execution plans, quality grades. Background agents run periodically, scan for stale documentation, and open cleanup PRs. Mechanical linters enforce architectural constraints, with error messages that double as remediation instructions for the agent. Documentation for agents, by agents. Plans versioned and co-located in the repo. The entire repository as the single system of record — if it isn't discoverable in the repo, it doesn't exist for the agent.

It works. They shipped a million lines of production code without a human writing a single line. That's not a thought experiment — it's a delivered product.

But read the prerequisites carefully. Greenfield codebase. Five months of dedicated harness-building before the product work began. A specialised team whose primary job was constructing the scaffolding. And crucially: single-repo scope.

The harness captures what lives in that repository. It doesn't see the incident postmortem in PagerDuty that explains why the retry logic is the way it is. It doesn't know that the payments team changed ownership last quarter. It doesn't connect the Jira ticket from eight months ago to the architectural decision it motivated. It doesn't share patterns across repositories, or promote a convention discovered in one service to a team-wide standard.

Addy's tweet-level version of this — a hierarchy of AGENTS.md files, scoped to modules, automatically maintained — is describing the same shape of solution, but without the implementation. "Automatically maintained" is doing enormous work in that sentence. Something has to watch agent runs and decide what to update. Something has to detect when a pattern has gone stale. Something has to know which module a piece of knowledge belongs to, prevent conflicting instructions at different hierarchy levels, and decide when a pattern discovered in one service is worth promoting to a fleet-wide rule.

OpenAI built that something. It took a dedicated team months, and it only works for one codebase.

We wrote about this gap directly when the harness engineering post dropped — the problem isn't the approach, it's that every organisation has to build it from scratch. What OpenAI constructed is a bespoke, single-org, single-repo solution to a problem that every engineering organisation running agents will hit. The harness engineering insight is correct. The output is a proof of concept, not infrastructure.

That's still deck chairs — just very well-engineered ones.


What the infrastructure actually looks like

The reason this conversation keeps circling without resolution is that the solution is not a better document format. It is infrastructure — specifically, a knowledge layer that sits beneath the agent layer and provides what static files cannot.

That infrastructure needs to do four things that no file can:

Learn from both outcomes and intentions — and keep both alive. An AGENTS.md file encodes what a developer intended the agent to know: the reasoning behind an architectural choice, the constraint that explains an unusual pattern, the direction the system is heading. That intent is genuinely valuable — arguably more valuable than the code itself. The public debate is converging on exactly this: context without intent produces agents that are locally correct but strategically incoherent, capable of writing perfect code while missing what the organisation is actually trying to build.

The failure of AGENTS.md is not that it captures intention. It's that it captures intention statically — a snapshot authored by one person on one day, which begins decaying the moment it's committed. Intent changes. Architectural direction shifts. A constraint that was critical six months ago may now be obsolete, or actively misleading.

Scoped automatically. The hierarchy Addy describes — knowledge at the right level of the codebase — is the right structure. But the scoping should be derived from the actual structure of the codebase, not decided by a developer placing files in directories. A knowledge graph that models the relationships between repos, services, modules, and teams can answer "what does this agent need to know right now" without a human deciding in advance what "this" might be.

Persist organisational memory, not just technical patterns. The most valuable knowledge in an engineering organisation is not in the code. It is in the decisions that shaped the code: why the payments service has that retry logic, why you migrated off that database, why the API boundary is where it is. That knowledge lives in PRs, in Jira tickets, in Slack threads, in incident postmortems. An AGENTS.md file cannot reach it. A knowledge system that ingests those signals can.

Separate what agents know from what agents do. AGENTS.md conflates two different problems: giving agents information (context) and directing agent behaviour (instruction). Theo's context hierarchy framing makes this concrete — the developer message layer, where AGENTS.md lives, is being asked to do two jobs at once. The paper's finding — that context files increase exploration at the cost of correctness — is partly a finding about what happens when instructional noise is bundled with informational context. A proper knowledge layer separates the two: here is what you know about this codebase, and separately, here is how you should behave in this organisation. The second part is governance. It belongs in a hierarchy of instructions, not a markdown file in a repo.


What this points to

The AGENTS.md debate is a useful discussion that points to where the industry is. Theo, Addy, the HN thread, the paper authors — these are smart people working on a genuinely hard problem and arriving at solutions that are genuinely better than what came before. Better scoping. Better maintenance discipline. Better separation of discoverable from non-discoverable information.

But the ceiling of this approach is visible. You can see it in the paper's results, in the engineering frustration across the thread, in the fundamental tension between "keep it lean" and "cover everything the agent might need." Theo's band-aid framing is exactly right: useful as a temporary patch, not a foundation. Every improvement to AGENTS.md practice is an improvement to a workaround. The underlying problem — that there is no system for persistent, evolving, organisationally-scoped agent knowledge — remains unaddressed.

That system is what we are building at Ctx|.

The knowledge layer that learns from agent runs — holding both the intentions that explain why the codebase is the way it is, and the outcomes that validate or contradict those intentions over time — scopes context to the code being worked in, preserves the organisational memory that lives outside the codebase, and governs agent behaviour at fleet scale. When a pattern is confirmed by enough successful runs, it gets promoted. When it is repeatedly overridden or causes failures, it gets demoted — and the intention behind it gets surfaced for human review, not silently discarded. Theo's feedback loop — surface what confused the agent, fix the underlying system — is the right behaviour applied to both dimensions. It just needs to happen automatically, across every run, without a developer curating it by hand. The knowledge is maintained, not merely authored.

That is the infrastructure underneath which AGENTS.md files become what they probably should always have been: a lightweight override mechanism for edge cases, not the primary source of agent knowledge.

The conversation about how to write better context files is useful. But the conversation worth having is about what replaces them.


Ctx| is being built by Tom & Jakub. It has an open-source core, so you can deploy within your own infrastructure or use our managed hosting.

Join the waitlist


References

References cited in this article:

  • [1] TheoAGENTS.md videoyoutube.com
  • [2] Hacker NewsAGENTS.md discussionnews.ycombinator.com
  • [3] Addy OsmaniAGENTS.md as codebase smellsx.com
  • [4] OpenAIHarness engineeringopenai.com
  • [5] Research paperAGENTS.md effectiveness studyarxiv.org