Specificity compounds. Generality dilutes.

Tom & Jakub · Fri, 12 June 2026

General-purpose memory layers promise one graph for everything. For agentic software engineering, that's the wrong trade.

General-purpose memory layers promise one graph for everything. For agentic software engineering, that's the wrong trade. Build the knowledge system on the ontology of software and nothing else, and it gets cheaper per token, more accurate per retrieval — and it's the only way a fleet of agents converges instead of drifting.


There's an appealing idea doing the rounds in agent infrastructure: build one knowledge layer for the whole business. Finance, ops, marketing, engineering - one graph, one memory, one retrieval endpoint. Every agent, every department, every question.

It sounds like leverage, but it's actually dilution.

We've made a deliberate, opinionated bet in the opposite direction: ctxpipe is scoped to the ontology of software engineering. Repos, services, modules, functions, dependencies, owners, ADRs, incidents, agent runs. Not "documents in a graph." Not "your company's knowledge." The specific, typed, traversable structure of how software gets built. We've written about what we're doing, but have not really defended this decision.

This post is about why that specificity isn't a limitation we'll grow out of. It's the mechanism that makes everything else work.


Words mean things, until they don't

Software engineering is unusually well-suited to a knowledge graph because its entities have crisp identity. A commit SHA is a commit SHA. A service has an owner, a contract, a dependency list. Relationships are unambiguous and typed: depends_on, deployed_to, caused_incident, governed_by. When the schema is this tight, extraction is precise, entity resolution is reliable, and traversal returns facts rather than vibes.

Mix in the rest of the business and the vocabulary collapses. "Pipeline" is a CI run to engineering, a sales funnel to revenue, a cash-flow projection to finance. "Deployment" means something different to the platform team and the field ops team. In a mixed graph you face an ugly choice: over-merge entities and get confidently wrong answers, or namespace everything and rebuild domain separation with worse ergonomics and a committee attached.

A schema that means everything to everyone ends up meaning nothing precisely. And agents - unlike senior engineers - can't read the room to clarify. They take the graph at face value.

The token efficiency argument

Here's the part that shows up on an invoice.

Every retrieval an agent makes spends context window - and not once, but on every step that follows, until it's evicted. Context windows are priced per token, reasoned over per token, and - as the long-context research keeps demonstrating - degraded per token. Recent work has shown that input length alone hurts model performance even when retrieval is perfect: pad the context with semantically empty filler and accuracy still drops substantially.[1] The question is never just can we fit it? It's what is the signal density of what we fitted?

A domain-scoped graph maximises signal per token in two ways.

Retrieval stays in-neighbourhood. Graph retrieval works by expanding outward from seed nodes. In an engineering-only graph, two hops from a payments-service node lands on its dependencies, its owner, its recent incidents, the ADR that constrains it. In a mixed business graph, those same two hops can drag in the Q3 marketing campaign that shares a "project" node, the finance forecast linked through a quarter entity, a sales account that happens to mention payments. Every one of those nodes costs tokens to retrieve, tokens to reason over, and accuracy to ignore. The graph RAG literature has arrived at the same diagnosis: the PathRAG authors argue that the limitation of current graph-based retrieval is the redundancy of what comes back, not its insufficiency - and show that pruning retrieval down to key relational paths reduces noise and token consumption while improving answer quality.[2]

Two hops from payments-service Same retrieval radius. Different signal density. MIXED BUSINESS GRAPH ENGINEERING-SCOPED GRAPH 1 edge 2 edges depends_on constrained_by tagged linked tagged mentions linked filed_under payments-service auth-service Q3-2026 proj-phoenix ADR-012 campaign-q3-launch forecast-fy26 acct-acme-renewal okr-board-q3 1 edge 2 edges depends_on calls owned_by deployed_to caused constrained_by documented_in runs_on payments-service auth-service billing-api payments-team prod-eu INC-204 ADR-012 postmortem-204 pg-cluster-eu 9 nodes retrieved · 3 relevant 9 nodes retrieved · 9 relevant 6 irrelevant — retrieved, reasoned over, paid for 0 irrelevant — every token earns its place Every retrieved node is paid for again on every inference step that follows — relevant or not.
Same retrieval radius. Different signal density.

Typed facts beat prose chunks. A graph that knows auth-service -owned_by-> security-team -constrained_by-> ADR-0005 can answer an agent's question in a handful of structured statements. A general-purpose layer retrieving top-k document chunks returns paragraphs of redundant prose hoping the fact is somewhere inside. The difference per query is modest. Multiply it by tens of thousands of retrievals a day across a fleet, and the per-token economics of specificity stop being a nicety and start being a line item.

This is the same argument we made about context files in AGENTS.md is the wrong conversation - and it's now empirically grounded: the ETH Zurich evaluation of repository context files found they tended to reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.[3] Stuffing context indiscriminately doesn't just fail to help. It actively costs you, twice. Specificity is how you stuff less and deliver more.

The quality argument

Token efficiency is about what retrieval costs. Quality is about whether it's right.

Extraction precision. Most knowledge graphs for agents are built by LLM-based entity and relation extraction. Constrain that extraction to a known engineering schema - these are the entity types, these are the legal relationships - and precision jumps while hallucinated edges drop. The construction research backs this directly: pipelines that enforce ontology-based type and relation constraints during extraction, and normalise entities against the schema, produce compact, consistent, deduplicated graphs - in one recent evaluation, the correct answer entity appeared in 96% of generated triplets.[4] Ask a model to extract "everything important" from a mixed corpus of contracts, campaign briefs, and codebases, and you get the opposite: a mushy, weakly-typed graph that's confidently wrong at the edges. Garbage edges don't just sit there inert. Agents traverse them.

Resolution reliability. When every entity in the graph belongs to one domain, "the payments pipeline" resolves to exactly one node. Resolution errors in a knowledge graph are silent and compounding - a wrongly merged entity poisons every traversal that passes through it, and nobody notices until an agent acts on the corrupted picture.

Answerable questions. The questions engineering agents actually need answered are structural: what breaks if I change this interface? Who owns the thing I'm about to touch? Has this pattern been tried and rejected before? These are multi-hop traversals over typed relationships. They are only answerable when the ontology models those relationships explicitly. A general-purpose document graph can tell an agent what was written about the auth service. An engineering ontology can tell it what the auth service is connected to - which is the question that prevents the incident. And this isn't theoretical for our domain: on enterprise legacy code migration tasks, graph-based retrieval outperformed vanilla vector baselines by up to 15% in production-oriented evaluation.[5]

The consistency argument

This is the one that matters most at fleet scale, and it's the least discussed.

One agent retrieving slightly-off context produces one slightly-off PR. And the cost of off-context isn't marginal: when researchers injected irrelevant task trajectories into agent contexts, success rates collapsed from 40-50% to under 10%, with agents looping and losing track of their original objectives.[6] That study was on web agents, but the mechanism is general - and a thousand coding agents retrieving from an ambiguous, weakly-typed graph produce a thousand divergent interpretations of how your organisation builds things, at machine speed, replicated before any human notices.

A shared, specific ontology is what makes convergence possible:

Same types, same picture. When every agent - Cursor, Claude Code, Copilot, custom harnesses - retrieves against the same typed schema through the same MCP, they reason from the same model of the world. Consistency stops being a prompting aspiration and becomes a structural property of the infrastructure.

Governance needs something to attach to. An instruction hierarchy is only enforceable if the graph knows what a repo, a domain, and an org are. "This standard applies to all services in the payments domain" is a governance rule expressible in one edge - if service and domain are first-class types. In a generic graph, that rule is a paragraph of prose an agent may or may not retrieve, may or may not weight. Promotion and demotion of patterns, domain-scoped instructions, blast-radius-aware review - all of it presupposes the engineering ontology underneath.

Learning generalises along type lines. When an agent run produces a lesson - this migration pattern fails on services with this dependency shape - the fleet can only inherit it if "services with this dependency shape" is a queryable concept. Typed nodes are what let one agent's mistake become every agent's guardrail. In a soup of documents, the lesson is just another chunk competing for top-k.

The honest caveat

The strongest objection to domain scoping is real: the most valuable questions in a business are often cross-domain. The incident that hit the customer that triggered the churn that shaped the roadmap. Hard silos can't trace that thread.

Our answer isn't that those connections don't matter. It's that they should be deliberate edges, not accidental adjacency. The right architecture is a dense, precise domain graph with a thin shared spine - people, teams, products, time - that other systems can join against. What you don't want is an engineering agent wandering into compensation data because a quarterly planning doc happened to bridge two graph neighbourhoods. Cross-domain access for agents should be explicit, scoped, and audited - an escalation, not a default. Specificity isn't just an accuracy property. It's a security boundary.

What we're building

ctxpipe is unapologetically a software engineering brain. The ontology is engineering-native: repos roll up to domains, domains roll up to org. Services, owners, contracts, ADRs, incidents, agent runs - typed, traversable, governed. The graph compounds along those types with every commit and every agent run, and serves every agent the same precise picture through one MCP.

General-purpose memory layers will keep improving, and for general-purpose problems they're the right tool. But agentic software engineering isn't a general-purpose problem. It's a domain with crisp entities, structural questions, and a fleet that needs to converge on one way of building, which collectively evolves rapidly along with the underlying tech.

Specificity compounds. Generality dilutes. We picked our side.


ctxpipe is the open-source, agent-agnostic, self-learning context layer for AI engineering agent fleets. If you're deploying agents at scale and want the knowledge graph to live with you, not your model provider, request SaaS early access.

Request early access · Book a demo · View GitHub repo


References

This article draws on the following recent research:

  • [1] Du, Y., et al. (2025)Context Length Alone Hurts LLM Performance Despite Perfect Retrieval — Findings of EMNLP 2025arxiv.org/abs/2510.05381
  • [2] Chen, B., Guo, Z., Yang, Z., Chen, Y., Chen, J., Liu, Z., Shi, C., Yang, C. (2025)PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths — AAAI 2026arxiv.org/abs/2502.14902
  • [3] Gloaguen, T., Mündler, N., Müller, M., Raychev, V., Vechev, M. (2026)Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? — ETH Zuricharxiv.org/abs/2602.11988
  • [4] Chepurova, A., Bulatov, A., Burtsev, M., Kuratov, Y. (2025)Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Modelsarxiv.org/abs/2512.00590
  • [5] Min, C., Mathew, R., Pan, J., Bansal, S., Keshavarzi, A., Kannan, A. V. (2025)Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systemsarxiv.org/abs/2507.03226
  • [6] Chung, A., Zhang, Y., Lin, K., Rawal, A., Gao, Q., Chai, J. (2025)Evaluating Long-Context Reasoning in LLM-Based WebAgents — University of Michigan / Amazonarxiv.org/abs/2512.04307