heading · body

YouTube

LLMs Are Databases - So Query Them

Chris Hay published 2026-04-13 added 2026-04-18 score 8/10
ai llm transformers interpretability graph-database mechanistic-interpretability
watch on youtube → view transcript

ELI5/TLDR

A large language model is secretly a messy graph database. The bits that store facts (who borders what, what capital belongs to which country) are really just a lookup table wearing the costume of a matrix. Chris Hay built a SQL-like language called Larql that reads the model’s weights directly, runs select and insert queries against them, and can even teach the model that Atlantis’s capital is Poseidon without any retraining.

The Full Story

The claim, stated flatly

Most people picture an LLM as a dense soup of numbers. Chris Hay says pick the soup apart and you find tables. Not metaphorical tables — literal nodes, edges, and relation labels hiding inside the feedforward layers. If that is true, three consequences follow: you can query the weights, you can insert new facts, and you can recompile the whole thing back into a normal model file that any inference engine will accept.

He proves it by connecting Larql to the open-weights Gemma 3 4B model and typing things like describe France and select * from edges where entity = France.

Three layers do three different jobs

Running describe France against the model shows a pattern that repeats for every transformer.

The syntax gathers the context, the edges store the knowledge, and the output commits to the token. And that’s the kind of three-stage architecture of every transformer.

The early layers (L1 to around L13) are figuring out what kind of question you asked. Middle layers (L14 to L27) pull knowledge out. Late layers (L28 onward) commit to a next-token guess. On a France query, middle layers light up with Europe, Italy, Spain, borders, nationality. They also light up — awkwardly — with CEO and fountain.

Features are edges, and edges are shared

A “feature” inside the FFN is just one column of weights: a gate vector that decides when the feature fires, and a down vector that decides what it contributes. Think of it like a switch with a payload. The switch watches the model’s running thought (the “residual stream”) and flips when that thought points in the right direction. When it flips, the payload gets added back in.

The first surprise: features are not dedicated to one fact. Feature 9348 at layer 26 fires for Australia, Italy, Germany, and Spain all at once. That is called polysemanticity — one slot, many meanings.

That feature is not an Australia feature. That’s the key thing. It’s a Western nations feature. The model has compressed multiple countries into one slot because they appear in similar contexts.

The second surprise: the same feature index means completely different things at different layers. Feature 1484 is “planet” at layer 2, “foods” at layer 6, and “state capital” at layer 23. The index is reused. The knowledge is independent.

Why the knowledge looks messy

Each feature produces a single scalar — one number when it fires. The residual stream it is watching has 2,560 dimensions. You cannot cleanly project 2,560 dimensions down to one scalar without losing information, so the model packs unrelated facts into the same slot and trusts the next stage to sort them out.

It’s not a bug, it’s a dimensionality constraint.

Attention is the thing that picks the right meaning

So how does a messy graph produce clean answers? Attention. When you ask “capital of France,” that phrase creates a specific pattern across all 2,560 dimensions. Attention heads at each layer score that pattern against every feature and decide which ones to amplify and which to ignore. The CEO-and-fountain noise that shares a slot with France-the-country gets suppressed because the query pattern does not point that way.

The features are the edges, attention is the routing. You need both.

The model learned a schema nobody taught it

Running show relations returns 1,489 probe-confirmed relation labels. Manufacturer has 76 features. League has 60. Genre has 52. Language has 46. Capital has 32.

Nobody taught the model the schema. The model learned these categories because that is how the world is structured. Things have makers, places have capitals, people have occupations. The FFN reinvented a relational schema from raw text.

This is the quiet punchline. A transformer trained on next-token prediction spontaneously built something that looks exactly like a relational database, because the world itself is relational.

Inference as a graph walk

Hay’s tool stores the weights in a custom format called V index. When inference runs, it does not do a matrix multiplication against the full FFN. It does a nearest-neighbor lookup — find the features whose gate vectors are closest to the current residual stream, fire those, add their down vectors, move to the next layer.

The FFN is the graph, attention is the navigator, and together they produce the forward pass.

The weights are identical to what Gemma shipped. Only the encoding changed. The matrix was the inefficient form; the graph is the honest form.

Editing a model without training

The demo that makes the rest of the video land: infer the capital of Atlantis is returns garbage, because Atlantis is not in the training data. Then Hay runs:

insert into edges values (Atlantis, capital, Poseidon)

The tool captures the model’s residual state at layer 26 for the canonical prompt, builds a gate vector pointing in that direction, builds a down vector pointing toward the Poseidon token, and writes the triple into a free feature slot. A “balancer” scales the new vectors so the fact lands strong enough to be top-1 but not so strong that it hijacks every other capital query.

After the insert, infer the capital of Atlantis is returns Poseidon at 99.98%. infer the capital of France is still returns Paris at 81%. Nothing bled. Running compile current into V index bakes the patch into a standalone weight file that exports back out to safetensors or GGUF and runs in any normal inference engine. No retraining, no fine-tuning, no LoRA.

Why the next video matters more than this one

At the end Hay drops the real implication. If the knowledge store is a graph, and attention is just the router, then the two can live on different machines. The knowledge graph could sit on a remote server. Attention runs locally. You could, he claims, run Gemma 4 31B on a laptop this way. Possibly bigger.

Key Takeaways

  • A transformer’s feedforward network is mathematically equivalent to a graph database. The matrix form is just an encoding.
  • Each “feature” in an FFN layer is a gate-vector (when to fire) plus a down-vector (what to add). That pair is an edge in the graph.
  • Features are polysemantic — one slot holds many unrelated concepts because the residual stream (2,560 dims) has to compress down to one scalar per feature.
  • The same feature index means different things at different layers. Feature 1484 is “planet” at L2, “foods” at L6, “capital” at L23.
  • Transformers split work into three bands: syntax (early layers), knowledge (middle layers), output commitment (late layers).
  • Attention’s real job is disambiguation — it picks which polysemantic features to trust for the current query by matching across all 2,560 dimensions.
  • Gemma 3 4B has 1,489 distinct relation types that the model discovered during training. Nobody wrote this schema — the world is relational, and the model noticed.
  • Inference can be reframed as a KNN graph walk instead of a matrix multiply. Same weights, different encoding, same output.
  • New facts can be injected by writing a gate-up-down triple into a free feature slot, with a balancer to prevent collateral damage.
  • Model edits compile back to standard safetensors or GGUF, so the technique is portable across inference engines.
  • The deeper implication Hay hints at — knowledge storage and attention can be physically separated onto different machines — would change how large models are deployed.

Claude’s Take

This is one of the more interesting interpretability demos from the last year, partly because it is not just a paper but a working CLI doing things live. The intellectual move — reframing the FFN as a graph database rather than a linear projection — is not new in the research community (sparse autoencoders, dictionary learning, and the whole Anthropic circuits agenda point the same way), but Hay’s contribution is packaging it as select statements and making the equivalence obvious to anyone who has ever touched SQL. That is a real teaching achievement.

The score: 8. It loses a point because Hay is also the vendor. Larql is his tool, V index is his format, and the probe that discovers relation labels is doing real work under the hood that gets hand-waved in the demo. Probes can hallucinate structure — if you go looking for “capital” relations, you will find features that correlate with capitals, and it is easy to mistake correlation for clean separation. The fact that Atlantis injection worked is genuinely cool, but one demo is not a stress test. What happens when you insert 10,000 facts? Does interference scale? He does not say.

It loses another partial point for the closing pitch, which slides from “the FFN is a graph” (solid) to “you can run Gemini K2 on a laptop” (speculative bordering on hype). The decoupling argument is plausible in principle — attention and FFN do different jobs and could be placed on different machines — but the bandwidth cost of shipping residual streams back and forth every layer is not trivial, and the video does not grapple with it.

Still, the central claim lands. Transformers did not learn a blob. They learned a graph. That reframe changes how you think about editing models, pruning them, and serving them.

Further Reading

  • Anthropic, “Toy Models of Superposition” — the paper that formalized polysemanticity and why features share slots
  • Anthropic’s sparse autoencoder work (“Scaling Monosemanticity”) — the mainstream approach to extracting clean features from LLMs
  • Meng et al., “Locating and Editing Factual Associations in GPT” (ROME / MEMIT) — the family of techniques Hay references as “the Memmet technique” for baking facts into weights
  • Elhage et al., “A Mathematical Framework for Transformer Circuits” — the residual-stream picture that Larql relies on
  • Chris Hay’s earlier videos on the residual stream, referenced in passing during this one