The company we documented is not the company we run

The model told me how our billing system worked. The system it described wasn't fully live yet.

city skyline during night time
Photo by Marek Piwnicki / Unsplash

The pile becomes the world

We're in the middle of a migration. A small subset of customers is on the new system, but most are still on the old one. Nothing in the billing topic made that parallel state obvious. The old systems were in the knowledge repo, but they didn't show up as part of the answer to how billing works today. So the model resolved the pile into one clean truth: the version with the most documentation.

A couple of weeks ago, I had a few spare hours on my hands and Karpathy had just posted his viral GitHub gist about the LLM wiki: basically, a way to build a knowledge base readable by agents. This is something I've been tinkering with in my role at Aiven anyway: how much of the company can we make understandable by AI? So I did the obvious thing and tried to build one for our internal context. I pointed AI at everything I could reach: Slack channels I had access to, Coda pages, dbt models, Omni metric models, system data, ownership notes, whatever I could feed into a script during one sitting. The output was a neat Obsidian repo, deployed to GitHub Pages as a Quartz site, with an MCP server on top. I could query it through Cowork and it felt like magic.

It broke on billing

The answer was confident. It explained the integrations, the ownership, the gotchas. It had a very good read on the new system. It even understood our new ARR model, which we had built ahead of go-live, even though that model wasn't live yet. If a new hire had read that summary, they'd have come away with a clean mental model of how billing works at our company. That mental model would have been wrong about almost every customer we have today.

The model wasn't hallucinating. It was reading the artifacts I'd given it and building the version of the company those artifacts most clearly described. And those artifacts were fresh. As fresh as they could be. Updated that day, backed by an active Slack channel, architecture docs, ticket trails, metric definitions. The problem was that most of them described the new billing system, because that's the one people were actively building. The old one was just running. There were traces of it in the repo, but almost nothing that said: this is still the system most customers are on. That kind of information disappears in most companies. Not because anyone deletes it, but because nobody wakes up thinking: today I should document the system that has been running for years.

So it picked the version with the most signal and treated it as how things work today.

The map had no edge

This is how we write things down, anyway. Documents accumulate around the work that's currently getting attention. Operations that just run don't generate much of anything. And once you give that pile to a model, the pile becomes the world. It doesn't know what is missing. It doesn't know which parts live only in people's heads. A human can catch one bad answer, sure. But that doesn't fix the failure mode. The same thing happens everywhere, big and small: the model treats the documented version as complete, even when it's only the part that happened to leave a trail.

The deeper failure isn't that the answer was wrong. It's that the model picked one version at all.

The honest answer to my billing question wasn't the new system or the old one. It was that two systems are running in parallel. They don't agree, and the right answer depends on which customer you're asking about. The model had enough signal to hesitate. Instead it picked the cleaner story and presented it as how things work.

The complete wiki will never stay complete

A better wiki could have fixed this answer. In principle, sure. Someone could maintain a perfect billing page that captured everything: what is intended, what is running, the manual work filling the gaps, who handles the exceptions, and why they exist. The model would do better. This specific failure would probably go away.

But that is the point. The answer depends on someone maintaining the boring middle forever. The messy transition state. Which customers are where. Which model is live. Which process is designed, and which process people actually follow because the designed one doesn't cover reality yet.

That page doesn't exist for long unless someone keeps doing the invisible maintenance work. Not because people are careless, but because nobody wakes up thinking: today I should document the gap between the system and the business. Every company runs on that layer of human work. People know which exception still matters. They know who to ask. They know which dashboard is technically right but practically misleading. Most of that never becomes an artifact.

So the model read the pile and treated it as the complete picture. The failure is that it had no way to know where the map ended.

The next version has to assume the complete wiki will never stay complete. Whatever it is, it has to work with partial sources, conflicting sources, and suspiciously clean sources. When the company is in the middle of changing, it should not collapse that into one answer. It should say which layer it's reading from, what seems live, what seems planned, and where it can't see enough to be confident.

That is the next thing I want to figure out: not how to make the pile bigger, but how to show where it ends.