Simplest Hash Functions — shantum's notes

ELI5 / TLDR

Hash functions turn data into fixed-size numbers — like assigning every book in a library a shelf number based on its contents. This article asks: how dumb can you make a hash function before it stops working? Turns out, for non-adversarial data (nobody’s actively trying to break it), you can get surprisingly far with just addition and multiplication. The tradeoffs are real but knowable, and the article maps them with actual collision counts and bit-level analysis.

The Full Story

The Setup: Why Go Simple?

Cryptographic hash functions like SHA-256 are built to withstand attackers. But most hash tables in everyday code don’t face attackers — they face dictionaries of words, lists of domain names, configuration keys. The author’s question: if nobody’s trying to sabotage your data, how bare-bones can a hash function get while still doing its job?

The job, specifically, is distributing keys across buckets in a hash table with minimal collisions. Not security. Not integrity verification. Just: spread things out reasonably well.

The Addition Hash: As Simple As It Gets

Imagine chopping your input into 32-bit chunks and just adding them up, letting the sum overflow naturally. That’s the addition hash. It reacts to insertions, deletions, and bit changes — but it can’t tell AB from BA. Reordering the chunks gives the same hash.

Fix: rotate the accumulated value before each addition. Now the operation isn’t commutative — order matters. But this only partially solves the problem.

Testing on 3,579 paragraphs (average 217 bytes): the addition hash produces 1,643 collisions out of 65,536 buckets. SHA-256 gets 1,530. Not bad — within spitting distance of the gold standard on long, varied text.

Where It Falls Apart: Short, Structured Data

Switch to 5,000 domain names averaging 12 bytes and the picture changes fast. SHA-256: 2,987 collisions. Addition hash: 5,471. Almost double. The problem is that ASCII characters only use specific bit patterns — lowercase letters all share certain high bits. Short strings don’t have enough entropy to wash that structure out through addition alone.

Think of it like mixing paint. Long paragraphs are twenty different colors — add them up and you get a uniform muddy brown. Domain names are three shades of blue — add them and you still get blue.

Foldmul: One Clever Trick

The fix is a technique called folded multiplication, or foldmul. Take your accumulated value, multiply it by a large constant to get a 128-bit result, then XOR the top 64 bits with the bottom 64. This is like putting your paint through a blender — multiplication spreads information across all bit positions far more aggressively than addition.

With foldmul as a finalization step, domain name collisions drop from 5,471 to 3,199 — much closer to SHA-256’s 2,987.

Measuring Quality: Bit Probabilities and Correlation

The author goes deeper than collision counts. For a perfect hash, each output bit should be 1 exactly 50% of the time, and no two bits should be correlated. On paragraphs, the addition hash stays within 3% of ideal — solid. Bit independence holds at roughly 4% deviation, with one 8% outlier.

On domain names, correlation explodes to 53%. The bits aren’t independent at all — they’re echoing the structure of the input. Foldmul brings this down dramatically but not perfectly.

The Avalanche Question

A hash function has good “avalanche” if flipping one input bit flips roughly half the output bits. Foldmul doesn’t achieve this perfectly because carry propagation in multiplication is deterministic, not random. The constant 0xaaa...aaa (alternating bits) theoretically maximizes carry spread, but even then, it’s not uniform.

The practical solution: chain two foldmul operations. This is what foldhash does, and empirically it works well — though the author admits the mathematical justification remains somewhat mysterious.

The Punchline: Hash Tables Should Meet Hash Functions Halfway

Here’s the sharpest insight. Modern hash table implementations like Rust’s HashBrown and Google’s Abseil use open addressing with power-of-two bucket counts, which means they look at the lowest bits of the hash. But multiplication naturally concentrates entropy in the highest bits. So these tables force hash functions to do extra work redistributing bits downward — work that wouldn’t be necessary if the table just looked at the high bits instead.

Separate-chaining tables that use the top bits can get away with far simpler hash functions. The author argues the industry has collectively chosen a hash table design that demands unnecessarily complex hashing.

The Recipe

For non-adversarial use cases, the author’s formula is three steps: mix inputs together (addition works), apply an avalanche pass (foldmul, CRC, or even a byte-swap), and tailor the finalization to how your hash table actually indexes.

Claude’s Take

This is an unusually well-structured deep dive that manages to be both rigorous and readable. The author doesn’t just benchmark — they explain why each approach works or fails using bit-level analysis, which is rare even in academic treatments of hashing. The progression from “just add stuff up” to “here’s exactly where that breaks and here’s one multiplication that mostly fixes it” is genuinely instructive.

The sharpest contribution is the argument that hash table design and hash function design are co-dependent, and the industry has locked itself into a local optimum where open-addressing tables demand unnecessarily strong hash functions. That’s a real insight, not a theoretical curiosity.

Two caveats worth noting: the article explicitly warns against production use without verifying your data is non-adversarial, which is responsible. And the “mysterious” effectiveness of chained foldmul is an honest admission — the empirical results are ahead of the theoretical understanding.

Score: 8/10. Dense, original, well-evidenced. Loses points only because some of the bit-level analysis in the middle section requires comfort with binary arithmetic that even the clear writing can’t fully bridge, and the practical applicability is narrow (retrocomputing, embedded systems, static datasets). But for anyone who’s ever wondered what’s actually happening inside a hash function, this is one of the better explanations around.