Advanced Data Structures - Lecture 01 (Summer 2026)

ELI5/TLDR

Lecture one of a masters-level course on advanced data structures at the University of Marburg. After the housekeeping (tools, tutorials, a polite “please don’t let ChatGPT do your homework”), Sebastian Wild walks through the motivation for the course: a handful of short stories about data structures that built billion-dollar companies (Akamai, Tokutek), enabled entire fields (genome mapping), or quietly hold the internet together (Merkle trees in git and Bitcoin). Then he sets up the first real topic: why the beloved balanced binary search tree, despite being “correct,” is annoying enough in practice that researchers have spent decades chasing simpler alternatives. The last stretch starts defining a “random binary search tree” rigorously so the next lecture can analyze it.

The Full Story

Why data structures matter, in three stories

Wild opens with a claim that sounds corny but lands: a single clever data structure can change the world. Three examples.

Akamai and consistent hashing. A content delivery network has to decide which server, out of many, stores which piece of data. The naive approach (hash the data, take modulo the number of servers) dies the moment a server crashes or you add capacity, because almost every piece of data suddenly belongs to a different server. Akamai’s trick, born at MIT: imagine a circle with circumference one. Hash the data to a point on the circle. Hash each server to a point on the circle too. Each piece of data belongs to the next server you hit walking clockwise. Now if a server dies, only the data pointing at that server moves — everyone else stays put. Clean, cute, built a company.

Think of it like a game of musical chairs where every player and every chair has an assigned spot on a circular track. When a chair disappears, only the players sitting in that chair scramble. The rest of the room doesn’t notice.

Tokutek and write-efficient B-trees. B-trees are the workhorse index of every serious database. Problem: each insert walks the whole tree down and writes. That’s a lot of disk activity for one small update. Tokutek batched the updates into buffers at each node and flushed them in bulk, playing nice with how real memory works. Got bought, made its founders rich.

Merkle trees in git and Bitcoin. A Merkle tree lets you check “are these two big piles of data identical?” by comparing a single hash at the top. Git is built on these. Wild’s dry side-note: git has a theoretical bug — it uses the hash of a file as the file’s name. If SHA-256 ever collides, git would treat two different files as the same. Mathematically troubling. Practically, if SHA-256 breaks, git’s the least of our problems.

A chat assistant he quizzed for this lecture told him git has “mathematical certainty” that hashes never collide. They don’t. Infinite inputs, finite outputs — pigeonhole says collisions must exist, we just never find one. The AI confidently hallucinated, and this becomes his segue into the course’s AI policy: fine for background reading, banned for exercises. His analogy: cheating in a tutorial is like doping in a training session. There’s nothing to win, and the exam has no AI, so you just hurt yourself.

The dictionary problem, recapped

Before any of the fancy stuff, a quick refresher on the abstract data type the whole course revolves around.

A dictionary (or symbol table, or map, or Python dict) lets you put a key-value pair in, look up a value by key, delete by key, ask whether a key is present. Bread and butter.

If the keys have an order on them, you get bonus operations:

min / max — the smallest or largest key
predecessor / successor — the nearest key smaller or larger than a given key
rank(k) — how many keys are smaller than k
select(i) — give me the i-th smallest key

Rank and select together behave like a sorted dynamic array. You can insert anywhere in the middle without shifting half the elements around, which is what you’d have to do in an actual array.

Imagine a library catalog where you don’t just want to look up “is this book here?” but also “which book is 47th in alphabetical order?” or “how many books come before ‘Moby Dick’ alphabetically?” That’s rank and select.

Binary search trees, undusted

A binary search tree is a tree where every node has a left child and a right child, and everything in the left subtree is smaller than the node, everything in the right subtree is larger. Search walks down the tree. Insert finds the leaf where the new key would have been, drops it there. Delete has an irritating special case: if the node has two children, you can’t just remove it, so you swap it with its in-order successor (or predecessor) and recurse — but you only ever recurse once, because that successor is guaranteed to be a leaf or have one child. This is called Hibbard deletion, after the person who first wrote it down.

Add a small piece of bookkeeping — each node stores the size of its own subtree — and you can now do rank and select as well. This is called an augmented binary search tree.

The height problem

How fast are all these operations? They all walk a path from the root to somewhere. The longest such path is the height. So everything is bounded by the height. If you’re unlucky and insert keys in sorted order, the tree collapses into a long chain and the height is n — linear, terrible. If you’re lucky, it’s log n.

For a completely unbalanced tree built from random insertions, the height turns out to be close to log n with high probability. That phrase deserves its own pause. “High probability” has a precise meaning in this world, and there are several flavors:

1 − o(1): the weakest. The probability converges to 1 as n grows, but no promise about speed.
1 − 2^(−n): exponentially strong. Basically never happens.
Something in between — the usual convention — is 1 − 1/n^c for some constant c.

Balanced BSTs and their discontents

If you don’t trust the input to be random, you can force balance: AVL trees, red-black trees, and a zoo of newer cousins. They all use rotations — a local flip of three pointers that changes the tree’s shape without breaking the search property. Single rotations, and when you need to move a middle subtree up, double rotations.

Think of it like rearranging a small pile of books on a shelf so everything stays sorted but nothing’s stacked too tall.

All these schemes guarantee O(log n) height at all times. They differ in the constants — AVL is tighter than red-black, for example. Nobody can prove what the expected height is when you build them from random insertions. Experiments show something like 1.01 × log₂(n), but it’s an open problem. Wild slips this in partly because he wanted an open research problem in lecture one.

So, problem solved? Not quite. His complaint: the cleanest red-black tree implementation he knows (Sedgewick and Wayne’s) is 750 lines. Strip it to the essentials and it’s still four screens of irritating case-distinctions — color flips, left rotation, right rotation, double rotations, delete (which is even worse). Memorize-for-a-desert-island scoring: poor. This is the whole motivation for the course.

The catch, given up front: nobody, not even Turing award winners like Robert Tarjan, has found a substantially simpler deterministic solution with the same worst-case guarantees. The alternatives coming up in this course will either be randomized (uses internal coin flips) or amortized (individual operations may be slow, but a sequence is fast on average). Those are the two knobs you trade balance-code-complexity against.

Average case vs. randomized: a distinction worth keeping straight

These two words both have “random” in them. They are very different things.

Average case analysis: your algorithm is deterministic. You assume the input is drawn from some random distribution, and you compute the expected cost. Run the same input twice, you get the same behavior.

Randomized algorithm: your algorithm flips coins internally. The input can be adversarial — someone trying to make you fail. Run the same input twice, you get different behavior each time.

Randomized is stronger as a guarantee (no assumption on the input) but makes practitioners twitchy because performance isn’t reproducible. Wild notes that standard libraries often won’t ship randomized data structures for exactly this reason, even though they’d be simpler and sometimes faster. Third-party libraries built inside tech companies often use them anyway.

Random binary search trees, defined

The course’s first real topic: what happens if you build an unbalanced BST by inserting keys in a random order? This is the “world is nice to us” assumption — a warm-up before building data structures that force this niceness to happen no matter what.

Formal definition: take the integers 1 through n, shuffle them into a uniformly random permutation (each of the n! orderings equally likely), insert them one by one. That’s a random BST.

Small example, n = 3. There are 3! = 6 permutations. They produce only 5 distinct trees — because the permutations [1,3,2] and [2,3,1] both happen to build the same balanced “cherry” tree (root 2, children 1 and 3). So the balanced tree has probability 2/6, the others 1/6 each.

A student notices: the other four trees look pretty unbalanced. Doesn’t that mean random BSTs are usually bad? Wild’s answer: n = 3 is unrepresentative. The number of maximally unbalanced trees (strict chains) grows like 2^n, because at each step you either attach the new node as a min or a max. But the total number of binary tree shapes on n nodes is the Catalan number, which grows like 4^n / n^(3/2). So unbalanced chains become exponentially rare as n grows. Meanwhile, n! grows faster than any exponential (Stirling: n! ≈ (n/e)^n), so many different insertion orders fold down to the same tree shape. The balanced ones absorb more permutations than the skewed ones.

Two small lemmas to make it all work

Lemma 1: Building a random BST step by step matches the distribution you’d get by fixing n up front. More precisely: if T is a random BST on n keys, and you insert a fresh key into a uniformly-chosen one of the n+1 gaps between existing keys (with appropriate relabeling), the result is distributed exactly like a random BST on n+1 keys. Proof is by a small bijection argument on permutations. This matters because it means “random BST” is a sensible, consistent object even when you build it incrementally.

Lemma 2: Instead of inserting integers 1..n in random order, you can insert n independent uniform reals from [0, 1]. Same distribution of tree shapes. Reason: the tree only ever compares keys, never reads their actual values, so integers and reals look identical to it. And two random reals are equal with probability zero, so there are no ties. Why bother? It’s often more convenient in proofs — each new insertion is independent of the past, and the labels never need renaming.

A probability refresher

The last stretch was a whiteboard-style recap of the probability vocabulary the next lecture will lean on: sample spaces, events, probability measures, the union bound (Pr[A∪B] ≤ Pr[A] + Pr[B]), independence (Pr[A∩B] = Pr[A]·Pr[B]), random variables, indicator variables, iid sequences. A student asks a good question: in practice, how “truly random” do the coin flips need to be? Wild’s answer: often you only need k-wise independence — any k of your coin flips look independent, but larger subsets might have structure. For many data structures a small k suffices. For quicksort’s random pivots, something like 5-wise independence is enough. The standard analysis usually assumes full independence, then a meta-analysis checks what you actually used.

Key Takeaways

A dictionary or symbol table supports put/get/delete/contains. If keys are ordered, you also get min, max, predecessor, successor, rank, select.
Rank(k) = how many keys are smaller than k. Select(i) = the i-th smallest key. Together they turn a sorted structure into an insert-anywhere dynamic array.
An augmented BST stores subtree sizes in each node, enabling O(height) rank and select.
BST height depends entirely on insertion order — somewhere between log n and n.
AVL trees and red-black trees guarantee O(log n) height via rotations. AVL is tighter. Both are implementationally painful.
Rotation = local pointer flip that changes tree shape while preserving the BST property. Single rotation moves one node up. Double rotation moves a middle subtree up.
High probability usually means probability at least 1 − 1/n^c for some constant c. Weaker variants: 1 − o(1). Stronger: 1 − 2^(−n).
Average case = deterministic algorithm on random input. Randomized = randomized algorithm on worst-case input. Different assumptions, different guarantees. Randomized is stronger but less reproducible.
Random BST on n keys = insert 1..n in a uniformly random permutation. Equivalently (same distribution), insert n iid uniform reals from [0, 1].
Count of BST shapes on n nodes = nth Catalan number ≈ 4^n / n^(3/2). Count of insertion orders = n!. The latter dominates, so many permutations produce the same tree — and balanced shapes win probability mass.
Building a random BST one key at a time (inserting into a uniformly chosen gap) produces the same distribution as fixing n up front.
k-wise independence is often enough for randomized data structures to work in practice; you don’t always need full independence.
Three real-world data-structure wins: consistent hashing (Akamai/CDNs), write-efficient B-trees (Tokutek/databases), Merkle trees (git, Bitcoin). Also: compressed text indices underpin modern DNA sequencing.

Claude’s Take

This is a competent opening lecture that doesn’t pretend to be more than it is — mostly housekeeping, mostly recap, closing with the first real definitions of the course. Wild has a nice dry register and a good instinct for the one concrete story that lands (the consistent hashing sketch is genuinely elegant, and he resists the temptation to over-explain it). The AI section is refreshingly honest about what these tools are and aren’t, and he frames the no-AI-on-tutorials rule as self-interest rather than moralism, which is smarter pedagogy than most professors manage.

The substance is uneven in a way that’s hard to fault, because it has to be. The first half is organizational throat-clearing. The middle third is a decent recap for students who already know BSTs — if you don’t, you won’t learn them here. The final third starts laying down the probability scaffolding and the random-BST definition carefully, which is where the real course begins. Tomorrow’s lecture will apparently do the actual expected-height analysis; this one sets up the table.

Score: 7/10. Good teaching voice, useful motivation, legitimately interesting open problem dropped in casually (expected height of AVL trees built from random insertions — unknown constant). Loses points only for being a first-lecture first-lecture: lots of it is admin, and the new material is thin. If you’re curious about randomized data structures, bookmark the whole series — this is the on-ramp, and Wild seems likely to deliver on the rest.