Transcript: Advanced Data Structures Lecture 01

TITLE: Advanced Data Structures - Lecture 01 (Summer 2026) CHANNEL: Sebastian Wild (Lectures) DATE: 2026-04-14 ---TRANSCRIPT--- Okay, watch for that. Amazing. Ready almost on in time. Uh despite being in the wrong room first, I I swear to God that Marvin at some point said lecture hall lecture hall 3. Uh it might be a few months ago. Uh, either way worked.

Okay, welcome everyone. Um, this is advanced data structures. Um, I’ll tell you more about what that is. Um, it’s probably a non-compulsory module for everyone. So, uh, I have to win you over to stay here. I’ll try my best. On the other hand, um, let’s not rush ahead of ourselves. Um, I wanted to kick us off a bit with an interactive part so that uh, everyone’s joining. Uh, in principle, the module is open to various courses of studies. It sounds like it’s mostly computer science, but let’s wait. It’s a few more people in the room now.

Um, this the system that you see here, Slido, I’ll say a few more more words about it in a second. Is something I’ll regularly use in the lecture once in a while to get get your thoughts on some questions. Uh, you can also use it to ask questions back to me. Uh, we’ll come to that.

Okay. Oh, how lovely. The clever AI combined your answers into computer science and MSE almost. Right.

Yeah. Right. We’re all scientists at the end of the day. There’s so many different ways to type. So, I think the official spelling of the abbreviation of Master of Science is M.Sc. dot. If you leave out the dots, that’s British. If you put in the dots, that’s American. I think a space between M dot and SC dot is somehow just uh uh the same but more spacious. I haven’t seen that as much. I don’t know. Thanks for trying all the different combinations.

One or two bachelor students as well. Okay. So, mostly computer science, few data science, mostly masters, few bachelors. Uh, was there any more esoteric ones? Let’s see. And we’re 13. There could be people online. Um, and indeed, hi there guys at home or wherever you’re joining from. Who knows? Uh today is not so lovely outside but the summer term sometimes that’s an option. As you may have seen or as you may know um I’m trying to live stream the lectures uh as a best effort on a best effort basis. Sometimes things go wrong like in the morning the undergrad lecture somehow had both mics enabled and the audio is a bit annoying. Uh that happens. Sometimes things entirely break. Hopefully not all the time. And whenever it works, you have some extra benefit of uh joining from elsewhere or watching the recordings if you want.

Okay. Uh thanks for playing along in in the little slidoh question. Um I’ll come back to that. Once again, welcome to advanced data structures. In the first part, I’ll quickly run through some organizational bits. If you’ve done modules with me, then uh this will be similar or will be familiar to you. Uh if not, I’ll quickly tell you some of the things we I I like to use and why, but I won’t spend uh a ton of time on it so that we can also get a glimpse of the actual material and you can make your decisions.

I always try to have one slide with all the key information. Uh, Sebastian Wild, that’s me. Um, Tomio Nakashima, my postto is not here, but he’ll take he’s not here today. He’s traveling this week. Uh, but he will take care of of the tutorials from next week onwards. Um, probably will will join the lecture and say hi at least once.

All the key information is on the module website. And maybe before we move on, let me show you what that looks like. I suppose uh people know how to navigate our uni website. Um you can find us from our department, the working groups at the very bottom. It’s alphabetic so my name comes last. There’s the algorithms group and from there you can go to teaching. From there there’s a link here. You can also find it in other ways. Uh so that’s the module website. The important part is this table which has the topics um that I plan to cover. It might change a little bit and the timing might change a little bit. Uh once we get there, there will be exercise uh sheets that pop up in the same table. And uh important part is you can click on these uh subunits once they’re filled with life and that’s where the slides are available and where the recordings will pop up. and sometimes more things like uh extra sources specifically to that unit I’ll I’ll put there as well. Um and some extra links down here. We’ll come back to that.

So, anything that can will be on the public website. I just find it more convenient myself and hope that you do too. Uh we’ll also use Campus Wire for a Q&A forum. This is a fairly small group so maybe it’s a bit overkill but it doesn’t hurt. It’s nice markdown and latte formulas and uh you can help each other. Um I’ll show you a bit later what this is. Uh what you need for joining is your marborg uh email and this pin otherwise the system doesn’t let you in. You’ve already seen slido for the questions and the other part is we’ll have some final exam. I guess from the group size, oral exams would be my preference. Uh to be admitted to the exam, you have to achieve 50% of the points.

I’m going to go through all of this a little more slowly and a little more in detail, but that’s basically the key information.

Um so the goals for today are to see a bit what’s this module is about and how we approach it.

Um slider was the first thing uh you’ve already seen the polls that’s what how I asked you about your courses of studies the more usual one is a multiple choice question but you can also do free text as you’ve seen you can use the same tool also for asking questions uh sometimes that’s uh convenient to just collect questions and then go over them uh at the end entire uh together I mostly like to do a slider to keep you awake once in a while. Since we’re a small group, it’s may it may not be as as important.

Um, for curiosity, let’s see what kind of tools people know. For some of them, I very much hope that we’ll see essentially 100%. Some others uh I care less. I’m just curious. Anyone still thinking we had 13 before, but maybe no need to uh wait forever. I’m very happy to see that people know Elas. There’s a course a module or a course for for this module uh which you should sign up for because that’s where you will hand up hand in your exercises. I I guess people seem to know Slido more. Okay, now it’s tied. Fun. Uh, cool. So, people also know chat GPT. I’m not so surprised. Not so many know campus wire. Um, hope we’ll change that. And then there’s a few other uh similar LLM based systems. Uh, of course there’s tons more, right? So, it’s a bit uh it’s annoying to keep up. It’s fun how Chachi beauty is still somehow dominant.

Uh any part you want to comment on this? Any specific reasons or is just uh whatever you try first you you’re stuck with? Any any thoughts you want to share on this?

That’s very true. Yeah. Yeah. Uh they were first. Um they’re still good, I guess. Uh ah there you go. Depends, I guess, what you what you want to do.

Okay. Uh I mean this this question was just an excuse to wake you up again a bit.

Let’s uh have a look at the content of the module. Um, my goals are to show you some just beautiful ideas in data structures that are somehow clever and and insightful. Um, make something possible that isn’t at first clear that it would be possible, but then with the right approach, it works. Um, the hope is that after seeing some examples of this, you’ll also be able to do this on new problems of a similar type. and it’s a master’s uh specialization module. So the goals should also be to bring you close to the state-of-the-art so that you would be uh equipped to do research in this area or I mean get close enough that with a bit of further reading you are there or depends a bit how how strictly you want to interpret that sentence but it’s a good goal to have.

Uh here’s the the rough outline of things I have planned. This is potentially a bit ambitious. I might have to thin out a few of the topics at the end if we run out of time. I was too enthusiastic at this point to kick them out right away. Uh but if there’s uh some of the last three that you’re extremely keen to see, maybe uh you you’ll have to let me know. Um we’ll roughly start with fairly elementary data structures. Well, that’s what we maybe start a little bit today and tomorrow um alternatives to the good old binary search trees and um maybe also the the reasons why you would maybe not want to use them sometimes.

Uh that’s the first two chapters in a way. Uh for for the experts if you’ve seen some of this uh look at at trips as one and at play trees at another but I’m not sure if you’ve seen the full glory of the um theoretical guarantees that they have.

Um I’m sure you’ve seen some priority cues, binary heaps, etc. Maybe someone has tortured you with Fibonacci heaps as well. Uh there’s a few more out there that are I think maybe more elegant. Um persistence is kind of time traveling for data structures. Very cool. Uh well this a fairly light topic. Sounds sounds harder than it is. And then we get into some a bit more advanced and more specialized topics. There’s data structures that are better than the general purpose data structures. If you know that you’re working with integers with numbers and assume uh the word RAM model again for the experts we’ll cover all of this in uh in slow pace and detail.

The next few are um something from closer to my research area space efficient data structures. Uh again, there’s a few that are very very clean and nice to see as uh an introduction and then we’ll try to get a bit closer to the state-of-the-art in some areas. And the last topics are a bit about the machine models that we otherwise assume in in algorithms and how they are maybe um how they need to be adapted. Okay, that’s the that’s the rough plan. The goal would be to cover roughly one of these units per week. And uh that might be ambitious. We’ll see.

In all of this, this is a list of areas. Uh in all of this in a way orthogonal to the topic is you can you can focus on what works well in implementations or you can look at what is the theoretically asmtoically best possible. I’ll strive for the second most of the time because that’s uh it’s still a theory specialization but we’ll also look a bit at uh implementation concerns say uh so from the data structures that work well in theory or have good guarantees in theory we’ll will look a bit a bit like um at what they could be implemented like and to maybe motivate a bit uh the module in general and data structures in general.

Let me tell you a few stories of where data structures really saved the day. Um, two are examples of tech startups that completely centered around the data structure. Maybe you even know about some of these. Akamay is is a I guess one of the first companies that started something like these content delivery delivery networks that now are completely responsible for you being able to stream efficiently. Uh a lot of a lot of uh data that passes through uh our backbones is essentially video streaming. And if we would send say all the Netflix series that all of us watch or any pick any other streaming service if they would all go through the uh transatlantic cables they would be completely clocked.

So instead the content delivery networks set up local data centers closer to you say somewhere in Germany and cache the data that you would want to access. But this is highly non-trivial how to do this efficiently in a way that adapts to how people access uh different parts of the data and uh without hard coding this. So it’s a somewhat recent thing if you read up about the story. uh some of the first was was Akcomi um I think it’s a startup out of MIT and the first kind of theory idea that that kickstarted it was a a type of hashing that is used to uh handle which server stores which piece of data in the abstract. And the challenge that they had to overcome there was that you uh you want this to be resilient if there’s servers coming and going because they crash or you want to add more to increase your capacity. If you do that in the standard hashing way with some say a hash number hash hash value then modulo the number of servers then whenever you add a new server a lot of the hash codes change and so all the servers would have to be exchanging the data because they’re now responsible for different pieces of the data. That doesn’t work at all. And so consistent hashing the first first piece that they uh invented is a way uh to circumvent that. And I can even sketch the ideas. It’s cute and simple. Uh you imagine a unit circle and now you hash both the data and the servers somewhere randomly onto this um in you know a real interval between zero and one. we just deliberately say this is this is zero and then it goes around here from zero to one.

You have a random position on this circle for each data item and for each server and then the consistent hashing scheme just says the data item always goes to the next server in cyclic order. And now if you add another server or if a server drops out some of the data has to move to a different server but most don’t. Very simple clever idea. Now it’s a bit more there’s a bit more to it because they also need to deal with uh unequal load but let’s leave it at that. Essentially a data structure that kickstarted this uh large company.

Another example on a slightly different scale uh less well-known is Toku Tech. It doesn’t exist anymore in this in this way. It’s been bought up several times and now things get confusing. Uh but it it was originally a database engine. So there’s just the storage part and you could use this uh on one of the uh one of the large well-known databases. Um and the innovation here was um a rightefficient B tree. So B tree as you know is the breadand butter data structure for indices indexes in data structures. Uh but if uh if you insert something you always move down the entire tree and then change something. So it’s a lot of writing for a single update. And you can try to cache these updates and bulk update the nodes that’s at the base of this plus then cache oblivious etc. uh properties that work with the memory better and at least here I know I know the people that founded it and that they made uh a nice nice uh amount of money out of it.

Some other examples um are not specific companies but entire applications that essentially became possible because of a data structure. One example is is uh genomewide read mapping, DNA mapping. That’s what you’ve essentially done or had done for you for a PCR test during the pandemic and what you still do with uh genomic analysis. uh and for that essentially the the data structure behind it is a compressed text index of of the sort. We’ll we’ll come a bit close to this but if you want to see this specifically in detail then you have to wait for the next round of algorithms in bioinformatics.

And the last example are merkellet trees that maybe people know are used both in in git the version control system and in bitcoin is a way to efficiently check consistency of data. Uh I think lenos towell said that this is a without this nothing would work in git. It’s entirely focused around this. I also learned when researching into this that git is fundamentally fun uh in that it completely relies on cryptographic hashing never colliding. If uh if you somehow can make uh the SHA 256 hash collide then git breaks because it treats two different files as the same. is essenti git is using the hash of your file as the name of that file. Um that’s uh mathematically speaking a bit troubling. On the other hand, the chances I mean if if cryptographic hashes are broken, we have other problems than git no longer recognizing our files. Maybe it’s a it’s fair game.

Yeah. Well, at least here we can quantify that it’s extremely unlikely because the r this the size of the hash uh is very big and nobody is seems to have found a way to force a collision but it happened before with the previous sha so it’s not an impossibility.

I’m trying to show you that uh although we focus on the theory of data structures, the gap between theory and practice is not that big. A good data structure can when implemented right quickly changed the practice of computing and that’s what makes it that’s part of what makes it so enticing. I also think it’s just a lot of clever and beautiful ideas. But both is a a good motivation and maybe you can add another one if you uh take something from this module and invent something cool then uh maybe you’re the next bullet on this on this list. Right back to the boring stuff.

Uh the assessments I already sketched. We essentially have the final uh oral examination. I would say is kind of clear uh that determines the module mark to be admitted. the standard rule I think it’s uh standard in the department you need half of all the uh points from our exercise problems and two of the sheets you can zero out at most standard rules and the tutorials themselves or the exercise sheets themselves are handed in in groups because we’re a small group maybe we don’t need four people per group I think uh you can handle a you a few more groups.

Um, all relatively standard stuff. As always for M’s modules, I consider the marking for the exercise sheets more as a a feedback for you than as a strict check that you followed everything single step. So, we’ll usually mark for serious attempts and not completely solved. Uh the the point being that it’s u it’s practice for you, prepare you for the exam, not so much checking on you. That’s what the exam’s for.

Now, we have to talk uh about about this guy, the elephant in the room these days of uh AI tools. uh I’ve opened the conversation on so most most people have used these tools and of course I I encourage you to use them and and play with them just not for the exercise problems uh for the for the following reason AI is is an exciting tool it’s it’s very amazing in some things and can be less amazing in others um I’ll come back to an example in a minute but it also makes us lazy that’s um among other things a problem.

So uh we need people who can check whether the outcome of what an AI produced is actually accurate. Here’s an example from before. Um, of course I interviewed some uh some chat assistants about nice data structures applications and it tried to convince me that precisely what I mentioned before. Git has mathematical certainty that the hashes never collide. And it kept insisting until I told it this is nonsense. You’re mapping infinitely many strings to a fixed set. There must be collisions. Oh yeah, you’re right. Uh I’ve I I uh gave you marketing certainty, not mathematical certainty, but it used literally the words mathematical certainty. So you have to call out to the systems. And to be able to do that, you need to know your stuff. How do you get there? By practicing it.

And so for the assignments, don’t cheat yourself. Is okay if you use it for background research or maybe language editing? uh don’t let it solve the exercise problems. You’re defeating the purpose and the uh the main person suffering is is you.

Of course, it’s fun to make pictures with uh these tools. So, this is kind of the the Pixar vision might be that some benevolent AI is taking good care of us and we become a bit lazy and round, but otherwise things are pretty dandy. I don’t think that’s realistic. I think it will soon look more like this. And maybe at some point if nobody fixes things even more like this. And specifically for tutorials, think of the following analogy. If you’re if you’re uh cheating in a competition, that’s unethical. It’s like doping in a in a sports competition. But the tutorials are your training sessions. And if you dope in the training, that’s out that’s utterly stupid. There’s not just no competition to be won. you’re not even winning the competition because the exam is without AI. Uh so you’re only getting the uh the negatives. I know it’s sometimes hard, time can be short, etc. But try to resist.

Enough of the uh dark language. Um a quick few words about the other tools uh for those who haven’t used them before. Campus Wire I would say is a bit like Reddit specific for lectures. It allows you to ask questions and people can answer the questions. The answers are sorted by up votes. Uh so you can work on or reply to questions and refine them. Um and you can also answer or ask questions anonymously. The goal would be to use this as our main out of classroom communication platform. uh just because then everyone gets the same answers I think is more effective is a nice markdown and latte support uh interface. So uh usually works fine. They have an a mobile app that works okay. Again, you need this pin to sign up. Um and the hope is that it’s not just a dump for your questions, but also that you answer other people’s questions. That’s the best way to learn.

Okay. Um the only other other system is Ilas which most of you know. Uh I find Ilas workable but not particularly nice to use. So I use it only for the things where it’s the only sensible option which is the official parts and marks and the assignment handins. Everything else will either be campus fire or the website.

And one last thing that I would like to try again is a collaborative question gallery for the exam. It’s a long way to go. So, uh I haven’t even put this on the website yet just to have a shared Google doc to collect questions that could be questions asked on an oral exam. And the double purpose is by making you think about these questions, you can try to anticipate what might be good questions. If you put them in, I promise to have a little comment saying, “Yeah, that’s a good question. I could have asked that right like it is or yeah, that maybe is too much or it leads too far or that’s fine, but it’s a rather shallow one.” I’ll comment on them so you get some feedback and maybe if I like your question, you know, could even be asked on your exam. When I say your question, this is the general view. It’s it’s just a Google doc, so nobody uh really tracks who says what. Okay. Engage in this early, but not just yet. It’s not up there. Uh plenty of time to go.

Questions on any of the organizational parts? Okay. Then let’s take uh I usually do a few minutes break in the middle. It’s not been a very taxing session so far I guess but it doesn’t hurt. Also it’s late night shows lot on a Monday. I think we also almost minimize the time between the two lecture slots where we see each other in a couple of hours again tomorrow morning. Sorry about that. It was a bit tricky to find slots and rooms. That makes sense. And that was the the least bad. It could have been the 8 a.m. slot. Can always be worse. Sweet.

Speaking of Elias, does anyone know how to not get an email for every student who signs up for the tutorials? And this this group’s fine, right? But algorithm data structures was a bit of a

participate on the list of participants. Yes. Okay. Worth trying. I’ll do that later. The the worst part is over by now. Oh yeah. What?

Okay, maybe we are slowly coming out of the break. Um, let’s start with a bit of recap. Which of these you may you potentially have seen before. I’m hoping to see 100% for some of these, but maybe not for some of the others. Ah, sorry. Yes, the default is a bit stupid on this.

Yeah. Try again now. Yeah. What is what is the meaning of seen? Uh this may be a bit unclear, but seen in a lecture at least or what’s a lecture? M let’s take it to mean you could roughly explain what it’s doing. Oh, cool. Nobody’s seen trips. Heaps. Uh, they got to I mean they are called trips because it sounds similar to heaps. That is indeed true. Okay, but skip lists. I guess this is uh maybe Ben had Sega’s doing. Yeah. After this, maybe we can have a debate about the last two. The last two you maybe you’ve seen something. Uh I found the last one is very very uh little known. On the other hand, the name is also very generic and uh it’s not clear if if we’re talking about the same thing. We’ll we’ll see. Uh it’s a fun fun alternative to skip list. I find this deserves to be more wellnown. There’s no great reason why it’s not. We’ll we’ll see. Uh let’s let’s have a discussion about this at some other point. Good. Um that’s roughly what I what I was expecting. um that that should should work fine.

In the first unit of actual content, which we’ll just start a little bit today, um we’ll look at at randomized trees. So, alternatives to uh standard binary search trees or roughly dictionaries that somehow use randomness. But you may ask why would you even want to? And to motivate that a bit, uh, I’ll I’ll recap a bit where we come from and then, uh, we’ll slowly work our way to various alternatives that all exhibit some some cute ideas, I think. Uh, but let’s first see uh, where we come from. And I expect that this is nothing new. Maybe you’ve seen it in slightly different language, but you this will be essentially very familiar.

A symbol table or a dictionary or one of these other lovely names is one of the key abstractions in computer science. I would say uh you can put in a mapping for a key, assign it a new value. You can read out the mapping or the value assigned to a key or you can delete uh the mapping uh or ask whether it’s contained plus a few other things. That’s the bread and butter thing we use all the time. Java calls this maps. Um, in Python, it’s even a built-in dim, I guess, data type. Of course, it’s also a data structure. So, very well known. I’m sure you’ve all used them in programming.

How do we implement these? Well, okay, before we get to implementing this, there’s a few other operations that you would sometimes like to have. If people just say map or or symbol, table or dictionary, they usually mean mean just this a key value mapping. Um you can also if the keys have a linear order or a total order support these other operations where min and max are maybe not so exciting but you can find for a given key what’s the the closest in key space that is stored in the dictionary either the one that’s smaller or larger than a given key. And rank and select are powerful primitives to build essentially something like a dynamic array where you can insert in any position. What do I mean by that? Well, rank gives you for a given key how many other keys in this data structure are smaller. That’s a bit like the position in that list in sorted order. And select is the opposite. it tells you uh for a given index uh what is the key such that I elements are smaller. So these two together you can use to implement something that looks like a sorted list but you can insert arbitrarily at any place um without copying like in an array uh where you would have to move everything to the side. That’s what this uh sentence here with the extra dot means. Um this of course has the additional assumption. This um assumes that the key universe is ordered whereas the previous didn’t need that assumption.

I’m sure you’ve seen both of these before, maybe with a different name. One option to implement both at the same time are binary search trees. I think all of you have seen those. So, uh it’s familiar to you. Uh you could say it’s a dynamic sorted array of sorts. The simplest version is composed of of pointer structures where you have nodes in this binary tree that each have a left and a right child and we have the search tree property which is crucial for searching in the tree. Okay, if I’m like interrupt me please if I’m going over this too quickly but I’m I’m hoping this is just undusting memory.

Uh if you look at these operations rank and select from just this it’s actually not clear how to do that uh at least not efficiently. Um, but there’s a standard trick that probably you’ve also seen usually goes by the name of augmented binary search trees, which just means every node stores its subree size. So if there’s a a node X and it has a certain number of other nodes below it, it really looks more like like this, right? We draw these triangles to abstract from the actual shape. uh then in this case this X would store 1 2 3 4 5 6 7 8 9 10 including itself. So maybe here on the side it would remember that its sub tree size is 10 and turns out with these numbers you can solve rank and select and you can maintain these sizes efficiently upon updates. So, binary search trees at the very least with this augmentation solve all of these previous operations.

Now, how fast are they? That depends. Um, first of all, here’s a binary search tree as you would want to draw it. Do we need an example for search? Again, maybe not. Um, you navigate down. If you want to find the 63, start at the root, you find 63 is smaller, so you go left, etc., etc. Um, insert, you do the same. The standard insert, I’ll I know this is baby stuff. We’ll we’ll come back to this at some point. So, I want to make the contrast. Insert in a binary search tree is normally done like this. You bind the leaf. If you want to insert 88, this leaf is where you would have expected to find 88 if it was in the tree, but at the moment it isn’t. So you replace that leaf, that null pointer with a new note with that key. That’s how we normally do insert.

And then delete is a bit more tricky. There’s this annoying case distinction. It’s easy. If you’re removing a leaf or a note with a single child, then you can just shortcut If you’re told to remove a binary node, there’s not enough parents to hold both children. So, you have to do something. And the usual version is to go to the predecessor or successor in sorted order, swap the value with that, and then recursively delete that guy. And it turns out you only ever recurse once because your predecessor or successor is guaranteed to be a unary node or leaf. It cannot have another left or right child because that would be you would be the 12. Again uh undusting memories standard deletion in binary search trees uh also known as as a hibarts deletion. uh that that is the person who first described this but again this is with almost 100% certainty this is how you’ve seen it one or the two one of the two versions either you go to the successor or the predecessor doesn’t matter okay

now how fast is this well it all depends on the shape of the tree all these operations uh for some you can be a bit clever if you just uh keep track of the global minimum and maximum in the tree with an extra variable. You can do this in constant time, but almost all of these operations have to follow down a path in the tree. I guess constructing from n keys means just n times insert. So um not even clear if we need that as a as an extra operation in in this case. And the path down the tree along a search path is never longer than the longest such path. That is called the height. Hence in the worst case all these operations are bounded by the height of the tree. Again we haven’t left baby data structures land. This is uh how you learned it back in the day.

Uh now unfortunately in unbalanced binary search trees which just means you insert as things come and don’t care about what happens to the shape of the tree. This height depends on the insertion order is always between log n and n. Uh maybe I should say log I always write lg or the binary logarithm because it pops up so often. it’s convenient to have this extra symbol. So that’s fine. But between log and n, there’s a lot of space. There’s an exponential space, right? Uh it turns out if you randomly insert things, specifically if uh the keys come in the order of a random permutation of the final set, then it’s actually close to this best case. We’ll look at that in more detail soon.

Uh, also what this means, this is where we leave BL baby data structures land. This term probably doesn’t come up. I don’t know if you know a formal definition of it. It’s a bit of a trick question because there’s more than one, but I’ll show you the the usual one. Someone want to Yeah. What’s What’s with high probability? What does it mean? It should be with probability somewhat close to one closeish. But what does that mean? What

define close define close? Yeah. Um it’s an asmtoic statement. As n becomes bigger, we get we converge to one to the probability one that this holds. Well, uh we’ll come back to that. I just want to mention it could mean different things. It could mean that the probability is 1 minus little O of one. So that’s the weakest form. It converges to one, but I’m not making any guarantees about how fast. Or it could be 1 minus 2 to the minus n exponential conversions. That’s great. It’s very quick. Maybe a bit too much. Uh we’ll settle for something in between.

So at this point, you’re not supposed to know this or you’re not supposed to see why that’s the case. I’m I’m telling you promise pinky promise. This is what will come out of the analysis that we’ll do. Okay. But that’s for the knowledge about unbalanced binary search trees. Now you’re going to say, “Yeah, I know how to solve this. We know schemes for balancing binary search trees. 85% new red black trees. Maybe the other 15 no AVL trees. uh one of the two probably you’ve seen. These are any kind of rules that upon updates, insertions and deletions somehow change the shape of the tree to guarantee that we have a logarithmic height.

As I said, the two most well-known are AVL trees and red black trees. Uh there’s others and there’s to be honest there’s a large family of of more modern ones that all have slightly different properties. Um all of them guarantee that the height is big O of login at all times. The constants hidden by this big O in terms of the the largest uh length of a search path they actually differ and this is the not just part of the analysis. There’s really uh some schemes like AVL trees are more balanced than red black trees. Uh okay, here it depends on the parameter. You can you can look further into this if you wanted to.

All of them essentially do rotations on the search path. I’ll briefly recap that on the next slide. Uh sometimes you get additional guarantees that are useful in some uh let’s call it fringe use cases that are not so fringe but uh we’re not going to uh dig into this uh that much. Um computational geometry often makes use of this property that you don’t do a lot of rotations high up in the tree.

What’s actually an open question is to look at a closer statement where most of these trees it seems our mathematical toolkit is not able to handle these questions if you do pick one of those rules for keeping the height logarithmic. We know an upper bound for the height that follows from the worst possible tree you could still find that satisfies say the AVL condition. But if you start with random insertions like previously where we say ah if if inputs if the insertions come in random order even the unbalanced tree is closely balanced. If you do that on either AVL trees or red black trees you essentially get 1.01 01 times log base 2 of n height or search costs they’re extremely balanced but nobody really can prove this is this you can do experiments and then people observe and I think there’s some some things known that it’s not exactly optimal log n without a constant in front but the constant is maybe indeed 1.01 01 or so something of that sort and we don’t know. It seems we’re not uh the the math tools that we have are not sharp enough to handle this question yet or maybe there’s a clever idea that nobody has tried is an open problem. I was just very keen to have an open research problem in the first lecture. So that is one that’s easy to state. Not sure how excited you are about this problem, but it’s a kind of tantalizing question that we don’t understand these very basic data structures all so well. I said all these um

yeah or there I don’t uh it’s a good so the question was is it proven that there is some constant times login as the cost uh I think it’s proven that if there is a constant it’s not one that’s I I remember a statement like this probably it’s not known whether it’s a constant times login. It could be some some term that is periodic with login or some crazy behavior that’s not a constant. It’s possible I guess I don’t know of a result saying it’s not not the case, but there’s some exact average cost for every n right is a well- definfined thing. We just don’t know it.

I mentioned that all these uh balancing schemes are based on rotations. So let’s briefly recap what the rotations do. There’s essentially two cases and then you build further rotations from that. But the key one is either this the topmost tree has um this structure and you care about moving y further up the symmetric version as y as a left child. Okay, x and y should look different. And here a, b and c are any kind of sub trees that could be empty or could be large sub trees. I’m just giving them some names. And by and large, the two rotations let you go from one to the other. If you move ah this is let me swap the two. This is very unintuitive. I’m sorry because now going from left to right is a right rotation. We um you move the right rotation says I’m moving this guy up and this guy down and so maybe in between they’re somehow like this I can’t really connect them well because it’s not a proper tree but then at the end this guy moves further down and it’s connected like this both before and after. Note that um cool X and Y are wrong. This was testing your attention. Obvious note that before and after. Y is smaller than X. Here is because it’s a left child and here because X is the right child. And A is everything smaller than Y. B is between Y and X and C is greater than Y uh than X. And that doesn’t change during the rotation. All that changes is the shape of the tree. And we’re just doing a local flip. So if you see, if you look at what changes, it’s this pointer and this pointer and this pointer that changes. So we res um take that back. This one doesn’t even change. It’s two double rotation changes. Three. Uh it’s two pointers that we that we change. And then we’ve slightly modified the tree. But what happens to the sub tree sizes or the heights is that this uh whole tree C moves a level up and the uh a level down if we go to the right and the whole sub tree A moves a level up or vice versa.

Now uh if you’re in the situation that you want to move this B up then you have to combine the two rotations. That’s why AVL trees and friends also use double rotations. If you just use these single ones, then uh you’re not moving B.

Okay, questions up to this point. It looks like uh case closed. We have defined an abstract data type. the uh dictionary and sorted dictionary and we found uh a data structure the bounce binaries the augmented balanced binary search trees that solve all the operations in login time which is pretty good. You can even say in the comparison model where all you can do is compare keys pair-wise that’s even the best you can you can hope for. So aren’t we done?

What’s wrong with that solution? What’s wrong with that solution, it’s just that it’s annoying if you actually implement it. It’s nothing that wouldn’t work. Most languages except Python have a balanced binary search tree as part of their standard library, which shows it’s very doable and they work well. However, even a um from my taste, the cleanest solution I know of either AVL trees or red black trees is the one from from Cedric and Wayne from the the red algorithms book, if you know what I mean. Uh the original full code is 750 lines. If you if you take out some of the comments and some of the additional operations, the core that you definitely need to keep a tree balanced is still this too small to read part that takes essentially four screens of text. And that’s because you have to distinguish a couple of cases. There’s the left rotations, the right rotations. In case of the red black trees, you have these color flips upon insert. There’s various different cases. Delete is even worse. So if you wanted to memorize say this implementation say uh you know data structures to take on the on the desert island uh with you that’s not a great choice and it’s not even that the cases are so super intuitive and uh easy to remember. They’re kind of annoying. You have to learn by heart what they all do. Uh I haven’t even included the rankbased operations. So really not that great.

Can we somehow do this better simpler in this case in this sense? And that’s what we’ll look at in that’s the motivation of this in the next unit. Uh I’ll also take it as a motivation to take you through a journey of just fun things. Uh but we’ll end up with some some candidates for indeed nicer and simpler solutions.

One disclaimer I have to give is that up to now and people have tried a lot uh including the original inventors of some of of some of these Turing award winners like uh Robert Targen and so on. Uh nobody has found a solution that’s substantially simpler than what we know and achieves the same guarantees in particular the worst case deterministic login height. Um indeed the alternatives I’m going to show you all have some slightly weaker guarantees. They’re either randomized or amortized. We’ll talk about amortized the next chapter. Um it’s a big topic in data structures anyways and then we’ll use the uh search trees to exemplify that a bit. Randomization is the other one which is clever use of randomness. In most applications that’s quite fine. We talked about git earlier. It’s not exactly randomness but is of the same type. There is a it’s not explicitly randomized but still we kind of hope and pray that this unlikely event of a hash collision doesn’t happen. We might as well hope and pray that some other unlikely things don’t happen. And indeed for randomized algorithms we have uh a dialing knob that we can make things as unlikely as we want.

Okay. Um the other thing of course is that simpler may often be faster by the constant factors. these uh worst case balanced deterministic search trees are optimal in terms of the big O class but the constant hidden inside there isn’t quite so optimal and that’s something you can measure as well

at this point let me quickly recall two important concepts that you don’t want to mess up again this is something you will have seen the difference between average case, average case analysis and randomization. The terms are similar. There’s random stuff in both going on. Um, but they’re distinctively different in how we apply the randomness. Average case analysis says we have a deterministic algorithm, say our unbalanced binary search trees, and we assume that the input is somehow uh randomly chosen. We look at an average input. But if I fix an input, the algorithm does the same thing each time.

By big contrast, a randomized algorithm rolls dice internally. So it uses randomness internally. Even if the input is the same, it’ll behave differently each time. This usually makes practitioners or at least maintainers of standard libraries very nervous. They hate this idea that performance isn’t repeatable or uh reproducible. That’s sad because oftentimes you can do so much easier and better solutions with the randomization. Uh external libraries, many many libraries built by tech companies use randomness where the standard libraries didn’t dare to. And that shows you that there’s this gap.

Okay, I’m I’m digressing. the definition deterministic algorithm on a random input versus non-deterministic randomized algorithm on a fixed input. Uh here to do the analysis there has to be some randomness that’s in the input distribution. We assume that the input is chosen randomly. Whereas for the analysis of a randomized algorithm we usually think of the input as worst case or it could be chosen adversarially by someone who tries to make us fail. Here we assume that the world is nice to us. It randomly chooses the inputs. It’s a weaker statement, right? A weaker assumption.

In both cases, we get somehow costs as random variables. They depend on random things here on the inputs and here on the internal random choices of the algorithm. What’s indeed confusing apart from the name is that the analysis often uses the same techniques but keep these two things separate. Uh they’re of course very different beasts because in this case you get a guarantee that is worst case over inputs. So it’s a robust solution and here you have to hope and pray that your assumption is realistic.

Okay, we have a bit of time. So, let’s start with this. Um, I guess the jump from from the previous slide to the next part needs a bit of a um a leap of faith first. In the next section, we’ll look at random binary search trees, which is precisely what I mentioned earlier. We study the unbalanced binary search trees under this generous assumption that the world is very nice to us. Namely, that it gives us insertions in random order and that that is what gives us the shape of the binary search tree. um and the goal will later be to make the data structure behave as if that was the case even if it isn’t the case. So first we’ll see if this goal of having someone of the data structure kind of trying to simulate being built randomly if that’s even worthwhile doing. So we’ll first study binary search trees built from random insertions. And if spoiler alert we will we find that these are well behaved then it’s a worthwhile goal to have a data structure explicitly use randomness to get there. Okay that’s that’s the idea.

H what does random mean? Well to be precise um that’s where the theory side comes. There’ll be some a few definitions and theorems. Now in the following uh don’t be scared they don’t bite and I just want to make things precise. A random binary search tree of a given size is one that we get if we start with an empty unbalanced binary search tree and then successively insert the elements um the numbers one up to n in random order. So every of the n factorial permutations is equally likely for n= 3. There’s not so many permutations, precisely six. So we can uh build all of these trees. For example, there’s the sorted permutation 1 2 3. There’s the reverse sorted permutation. There’s the 132. There’s the three one two and two one three two three one.

Let’s build some trees. Uh, and just look at the shape. I’m not going to draw all the uh labels here. We get one first, then we insert two, so that’s a right child. And then we insert three. So that’s what the tree looks like. And uh in the drawing before, I would have had these indicators for null pointers. Let’s not draw them. But you know that they’re are there. This guy is the opposite. We get the three first, then the two, then the one. So, it’s a mirror image of this, but they’re distinct as trees. Left and right is not the same. Here we get the one first, then the three, and then the two. So, we get this zigzag. And here it’s the other way around. Again, the two mirror images. Now, here happens what uh comes the fun thing. Both of these give us the same tree. The balanced cherry effectively effectively called affectionately called. So if each of these six permutations originally happened with probability 1 over six, there’s only five different trees. these happening with probability 1 over6 but the other one with probability one over3 and uh note how the single balance tree among those five is the one that gets the larger probability that’s already a good indication that things are going going well right yeah these are all equally bad.

Yeah. Ah, good point. That is totally correct. Um, I think it’s an artifact of three being very small. Um if you think about it, these are um these are essentially linear chains. And the number of linear chains as n becomes bigger is relatively small because to get a linear chain in the the strict sense to have the very bad case, the worst case where the tree is is one long linked list. Uh in each step, you only have two options. You can have the the minimum of the current thing or the maximum of the remaining. set of numbers. So you can go back and forth. So is a like such a a funnel and that could be what the tree looks like. But there’s only two to the n by and large of these trees because you have a choice whether you pick the min or the max in each step. As n becomes bigger, there’s way more trees than 2 to the n. not something I wanted to get into at this point. Uh, but as a as a quick one just as a as a remark that what we see with n equals 3 is a bit unrepresentative. That’s the number of maximally unbalanced trees.

The number of binary trees on nodes. Does anyone know is uh the nth Catalan number? Oops. Does anyone remember the formula for that? And that is that’s the precise formula. Now if you’re a a combinatorics person, you say joy. But as a computer scientist, how much bigger is this than that? It has to be bigger because these are all binary search trees, binary trees. But how much bigger is it? Huh? Asmtotics to the rescue. uh I’m not going to remember the asmtotics the precise asmtotics right but it’s in a wake sense closer to 4 to the n than any other exponential it’s a bit less it’s 4 to the n / n^ the 3s and then there’s some constants involving the square root of pi again if you want to see where did we do this um maybe advanced algorithms touches on it a bit. Um, at some point I’m I’m going to do a lecture on how to get to these numbers.

The point here is just that it’s way bigger than 2 to the n. So the probability of this maximally unbalanced tree is dropping like 2 to the n almost very unlikely. Now of course you will say hold on that’s only the most extreme case. There’s a lot in between and that’s right. Okay. But for n equals 3 it looks like these are twothirds of the cases and they’re larger n there aren’t um yeah that is what we mean by random binary search trees uniform over the n factorial permutations but then you build the trees now if you compare this number to n factorial that’s another huge step n factorial way bigger than this okay While we’re here, what’s a good approximation for n factorial? Sterling’s formula. Remember, uh, again, I’m going to do just the very rough one that I can remember from the top of my head. Something like n / e and the whole thing to the power n. And then there’s again funny uh additional factors like this. But the point is it’s not even just exponential, it’s factorial. uh it grows more like n to the n than a constant to the n if you uh close both eyes. So again much much more uh variability which means as n becomes bigger we’ll have some of these trees that are generated by a lot of different insertion orders and this example already shows you why once the two here is there the order in which you put things in the left and the right tree doesn’t matter because they’re separated at the root. If I put the one first and then the three or the other way around is invisible to the tree. Oh boy, that was a bit of digression. Um, I hope it was useful.

Uh, let’s first see that this model even matches what we wanted initially. Um, as a little statement, if you take a tree that is a random binary search tree over n keys, is it correct that if we add another key n +1 that we then get um a random binary search tree, a random BST over n plus1 keys? Because here what we mean is we fix the number of keys up front and insert them in order. What does that mean if we do it step by step? Uh, fortunately has a well- definfined meaning. It just says if you have n keys currently, let’s maybe again draw the example for n= 3 with some uh let’s do one of the not so balanced trees. And this time I’m going to add the uh exact values. Um, and in squares I’m going to add the I’m going to number also the leaves. And there’s always n plus one gaps between these n keys. And now I’m going to pick one of those all with equal probability. Then the result is a tree that’s one bigger and has the same shape as if it was built with this random permutation of the keys right from the get-go. Okay.

To formally prove this, you can do a bit of combinotaurics. Uh we have two sets. We want to show that they’re uh that they’re the same and even have the same distribution. We’ll show that there’s a bjection between the two. Um the insertion order for generating this this larger tree is injection with permutations of the keys that we have at the end. By definition that’s just the insertion order itself. uh and that can be put in byjection with inserting the first n keys and then the last value in in this range which is the position the the leaf that we pick to add the new value. Now if we do this here say we decide this is where we insert then instead of the two let me copy this uh now we’ll get a new node a new internal node but that should now be the formally what what is that should now be the three and what was formerly called three is now suddenly called four and this also needs leaves and these have to be again two three. So this needs to be renamed to five. So this insertion changes the labels of some nodes. That’s inevitable. But it doesn’t change the shape of the tree other than that it adds this new node. So we need to mimic in this bjection what happens here. This last value was this leaf that I picked and that’s a number in uh I wonder if I was consistent. I think the zero is what we later use. Then maybe here it’s rather a value in zero up to n and we increment all the values to the right of that. All all the bigger ones we increment by one and then we get a new insertion order. That’s the one we could have used to build this tree from the start if we didn’t do it step by step.

So here’s an example. We have first the uh insertion order here where we have the the three separated out. Um so this is the insertion order originally where we have the numbers one up to seven each occurring one time. And I’ve just decided to single out the last one as that is the last one. That’s why there’s this vertical bar. And this is in bjection with having the same permutation, but everything that’s above the three here is is reduced by one. And we have the same three. These are the same because I can add plus one on everything that’s bigger without losing information.

uh questions up to this point. I’ve just defined a model of generating random trees and it happens to be the one that we want even if we build it step by step. That’s all that this lema is supposed to tell you. only the set from 0 to n to n + one. Well, the the notation here this this notation means one up to n + one or in other words uh one two right that’s the more standard and here’s the zero not included that’s the okay depends it’s a definition uh depends who you ask I think it’s somewhat standard that the zero is not included in this one um and here the way I labeled them you would want to start at zero It doesn’t affect the bjection of course if you rename the things accordingly but uh yes and let me let me peek forward to how we use this later. Yeah, I think this is this is one of the things where um I modified earlier slides to be consistent with some of the books I build on and so that might be a mistake I copied. Sorry about that. Okay, other than that clear um nothing deep going on.

uh before the next statement which is going to be this one that we can also build these random binary search trees with a slightly different random model that is some sometimes more convenient. So I wanted to let you uh let you have this knowledge. uh you can pick pick uniformly distributed real numbers between zero and one and pick n of these identically distributed and independent then uh you insert those into a random binary search tree you get the same as if you had used this random permutation of the numbers one up to n basically the reason is just that they’re indistinguishable if you just do comparisons and the probability that two random real numbers are equal is zero. So you’re really essentially never going to see this. Okay.

Now to make to make this statement make sense, uh I thought it’s good to define this and this maybe even this. So I inserted a few slides that cover this um about um probability theory. This is hopefully just a refresher, but maybe it’s good to just uh recap the terminology there. Um we have some probability space which is a ground set. That’s the things that can happen. Um and then there’s a probability measure of the the probability we assign to sets of that. And an event is uh I think think of f in the discrete case. Uh in the discrete case f is just the subset the power set of omega. Nope. Uh sorry 2 to the omega as power set sets of sets that’s good enough. Forget about the measurable sets and so on. This is uh well the general definition is important but we don’t need that. And then an event is just some some subset. So maybe omega could be the six outcomes of an of a standard die where you have the numbers one up to six. And then maybe uh an event could be I get an even number. Um I’ll use this probability P is uh stands for probability and I can have the probability of some event. Maybe for the example with the die, the probability to get an even number is is exactly half if it’s a fair die.

uh we can have the counter events which is the uh complement set is one minus the probability of the other standard stuff. Um the union bound is often useful if you take the union of several events. You cannot exactly say what the probability of the union is but you can upper bound it by the sum of the probabilities of the uh subsets. Um and we will need independence. Remember there was this iid popping up. Um a collection of events is independent exactly when their probabilities factor. So the probability to see all of the events happen at the same time. That probability is the same as the product of the probability of the individual parts. That’s the definition of stochastic independence. You’ve seen this before, right? I hope if not then this is a bit rough but in principle this is is all on the slides there’s not much more to it but it’s a bit rough if you’ve not seen it before um I don’t think we need this necessarily right now

uh this was events so sets of things a random variable is something that maps this abstract space of potential outcomes to numbers will usually work with uh normal numbers whatever uh reals or or integers. Um so we can have indicator random variables which are one if the event that they assoc associated with occurs and otherwise zero. Um I often write these as this Iverson bracket notation. So this uh some conditional in square brackets means one if this condition is true and zero otherwise that’s that says else. Oh come on. Okay.

Not sure if we need these uh cumulative dist distribution and probability mass function. Um what we want is also independence for random variables. We had it for events. It’s essentially the same property that uh carries over for discrete variables. It’s the easiest to state that the probability that two random variables jointly assumed some values just factors. And then last thing, iid independent and identically distributed sequences of random variables means that they’re all mutually independent and they have the same distribution. I use this equal d to denote that two random variables have the same distribution. Uh that just means they have the same cumulative distribution function. A typical thing in algorithms is for randomized algorithms. We often assume we have access to some some random generator, some thing that generates random bits and it could be just flipping coins. We assume that one coin flip leaves the coin in the same state as it was before. So if you flip it another time, the first and the second outcome are independent.

Question. in in practice, how good does the randomness have to be for these data structures to work at? Yeah, it’s a very good question. Uh question was how good does the randomness have to be for the data structures to work? Uh the answer is it depends. For some uh you in indeed need a lot of actual randomness and for others um the the typical phrase is uh is this one. Um it’s often good enough if say the coin flips are kwise independent means if you pick any any or say five five wise independent that means if you pick five of the coin flips they’re perfectly independent you wouldn’t be able to distinguish them from perfect randomness but if you take six or more together you start seeing patterns many data structures have this property that a certain k and then kwise independence is good enough for the guarantees we want to prove. Uh I remember that there’s a a theorem about quick sort for example that’s you choose the pivots randomly. Uh it’s good enough to have I think five wise independence or something of that sort. Don’t don’t ask me about the proof for now. Uh that’s that’s the kind of thing um one can prove. Usually we’ll ignore that part or the analysis first and assume perfect randomness and then as a second step you can analyze in your analysis kind of meta analysis in your analysis what did you actually use about the sequence of random things and often times it just pops out that you didn’t really need full mushial independence uh in the case of bricksort is a bit more work

okay uh with this terminology at least you can make sense of this statement. Uh again the the proof is very simple that if you just do pair-wise comparisons these real numbers behave just as if they were the numbers the integers one up to n and so of course they are the same. The reason why this is a bit more convenient is that you look at uh the next insertion they’re all independent. So any value between 0 and one could be the next insertion and you don’t have this annoying renaming of things. If you keep the actual real numbers as labels in the tree, they don’t move. They just stay the real number they are. That’s why this is sometimes a bit con more convenient to think about. Okay, let’s leave it with that and um we get to the actual analysis of the random binary search trees uh tomorrow morning. Yeah, a couple hours.