SPEAKER 1: All right, so we are back. Welcome to CS50. This is the end of week seven. So recall that last time, we started looking at slightly more sophisticated data structures. Since up until now, all we had really at our disposal was this, an array. But before we discard the array as not all that interesting, which indeed it actually is, what are some of the pluses of this simple data structure thus far? What's it good at? So far as we've seen? What do you got? Nothing. STUDENT: [INAUDIBLE]. SPEAKER 1: What's that? STUDENT: [INAUDIBLE]. SPEAKER 1: Fixed size. OK, so why is fixed size good though? STUDENT: [INAUDIBLE]. SPEAKER 1: OK, so it's efficient in the sense that you can allocate a fixed amount of space, which hopefully is precisely as much space as you want. So that could be absolutely a plus. What's another up side of an array? Yeah? STUDENT: [INAUDIBLE]. SPEAKER 1: All the-- sorry? STUDENT: [INAUDIBLE]. SPEAKER 1: All the boxes in memory or next to each other. And that's helpful-- why? That's quite true. But how can we exploit that truth? STUDENT: [INAUDIBLE]. SPEAKER 1: Exactly, we can keep track of where everything is just by knowing one address, namely the address of the first byte of that chunk of memory. Or in the case of the string, the address of the first char in that string. And from there, we can find the end of the string. We can find the second element, the third element, and so forth. And so the fancy way of describing that feature is that arrays give us random access. Just by using the square bracket notation and a number, you can jump to a specific element in the array in constant time, big O of one, so to speak. But there's been some downsides. What an array not do very easily? What's it not good at? STUDENT: [INAUDIBLE]. SPEAKER 1: What's that? STUDENT: [INAUDIBLE]. SPEAKER 1: Expanding in size. So the downsides of the array are precisely the opposite of what the upsides are. So one of the downsides is that it's a fixed size. So you can't really grow it. You can reallocate a bigger chunk of memory, and then move the old elements into the new array. And then free the old array, for instance, by using malloc or a similar function called realloc, which reallocates memory. Realloc, as an aside, tries to give you memory that's next to the array that you already have. But it might move things around altogether. But in short, that's expensive, right? Because if you have a chunk of memory of this size, but you really want one of this size, and you want to preserve the original elements, you have roughly a linear time copying process that needs to happen from old array to new. And the reality is asking the operating system again and again and again for big chunks of memory can start to cost you some time as well. So it's both a blessing and a curse in disguise, the fact that these arrays are of fixed size. But if we introduce instead something like this, which we called a linked list, we get a few upsides and a few downsides here as well. So a linked list is simply a data structure made up of C structs in this case, where a struct, recall, is just a container for one or more specific types of variables. In this case, what do the data types appear to be inside of the struct that last time we called a node? Each of these rectangles is a node. And each of the smaller rectangles inside of it is a data type. What types did we say they were on Monday? Yeah? STUDENT: [INAUDIBLE]. SPEAKER 1: A variable and a pointer, or more specifically, an int, for n, and a pointer at the bottom. Both of those happen to be 32 bits, at least on a computer like this CS50 Appliance, and so they're drawn equally in size. So what are using the pointer though for apparently? Why add this arrow now when arrays were so nice and clean and simple? What is the pointer doing for us in each of these nodes? STUDENT: [INAUDIBLE]. SPEAKER 1: Exactly. It's telling you where the next one is. So I sort of use the analogy of using a thread to sort of thread these nodes together. And that's exactly what we're doing with pointers because each of these chunks of memory may or may not be contiguous, back to back to back inside of RAM, because each time you call malloc saying, give me enough bytes for a new node, it might be here or it might be here. It might be here. It might be here. You just don't know. But using pointers in the addresses of those nodes, you can stitch them together in a way that looks visually like a list even if these things are all spread out throughout your one or your two or your four gigabytes of RAM inside of your own computer. So the downside, then, of a linked list is what? What's a price we're apparently paying? STUDENT: [INAUDIBLE]. SPEAKER 1: More space, right? We've, in this case, doubled the amount of space because we've gone from 32 bits for each node, for each int, so now 64 bits because we have to keep around a pointer as well. You get more efficiency if your struct is bigger than this simple thing. If you actually have a student inside of which is a couple of strings for name and house, maybe an ID number, maybe some other fields altogether. So if you have a large enough struct, then maybe the cost of the pointer is not such a big deal. This is a bit of a corner case in that we're storing such a simple primitive inside of the linked list. But the point is the same. You're definitely spending more memory, but you're getting flexibility. Because now if I want to add an element at the beginning of this list, I have to allocate a new node. And I have to just update those arrows somehow by just moving some pointers around. If I want to insert something into the middle of the list, I don't have to push everyone aside like we did in weeks' past with our volunteers who represented an array. I can just allocate a new node and then just point the arrows in different directions because it doesn't have to remain in actual memory a true line like I've drawn it here on the screen. And then lastly, if you want to insert something at the end of the list, it's even easier. This is sort of arbitrary notation, but 34's pointer, take a guess. What is the value of its pointer most likely drawn sort of like an old school antenna there? STUDENT: [INAUDIBLE]. SPEAKER 1: It's probably null. And indeed that is one author's representation of null. And it's null because you absolutely need to know where the end of a linked list is, lest you keep following and following and following these arrows to some garbage value. So null will signify that there's no more nodes to the right of number 34, in this case. So we propose that we can implement this node in code. And we've seen this kind of syntax before. Typedef just defines a new type for us, gives us a synonym like string was for char*. In this case, it's going to give us shorthand notation so that struct node can instead just be written as node, which is a lot cleaner. It's a lot less verbose. Inside of a node is apparently an int called n, and then a struct node* which means exactly what we wanted the arrows to mean, a pointer to another node of the exact same data type. And I propose that we could implement a search function like this, which at first glance might seem a little complex. But let's see it in context. Let me go over to the appliance here. Let me open up a file called list zero dot h. And that only contains the definition we just saw a moment ago for this data type called a node. So we've put that into a dot h file. And as an aside, even though this program that you're about to see is not all that complex, it's indeed convention when writing a program to put things like data types, to pull constants sometimes, inside of your header file and not necessarily in your C file, certainly when your programs get larger and larger, so that you know where to look both for documentation in some cases, or for basics like this, the definition of some type. If I now open up list zero dot c, notice a few things. It includes a few header files, most of which we've seen before. It includes its own header file. And as an aside, why that's double quotes here, as opposed to the angle brackets on the line that I've highlighted there? STUDENT: [INAUDIBLE]. SPEAKER 1: Yeah so it's a local file. So if it's a local file of your own here on line 15, for instance, you use the double quotes instead of the angled brackets. Now this is kind of interesting. Notice that I've declared a global variable in this program on line 18 called first, the idea being this is going to be a pointer to the first node in my linked list, and I've initialized it to null, because I've not allocated any actual nodes just yet. So this represents, pictorially, what we saw a moment ago in the picture as that pointer on the far left hand side. So right now, that pointer does not have an arrow. It instead is just null. But it represents what will be the address of the first actual node in this list. So I've implemented it is a global because, as you'll see, all this program does in life is implement a linked list for me. Now I've got a few prototypes here. I decided to implement features like deletion, insertion, searching, and traversal-- the last just being walk across the list, printing out its elements. And now here's my main routine. And we won't spend too much time on these since this is sort of, hopefully old hat by now. I'm going to do the following, while the user cooperates. So one, I'm going to print out this menu. And I've formatted it as cleanly as I could. If the user types in one, that means they want to delete something. If the user types in two, that means they want to insert something. And so forth. I'm going to then prompt then for a command. And then I'm going to use GetInt. So this is a really simple menuing interface where you just have to type a number mapping to one of those commands. And now I have a nice clean switch statement that's going to switch on whatever the user typed in. And if they typed one, I'll call delete and break. If they typed two, I'll call insert and break. And now notice I've put each of these on the same line. This is just a stylistic decision. Typically we've seen something like this. But I just decided, frankly, my program looked more readable because it was only four cases to just list it like this. Totally legitimate use of style. And I'm going to do this so long as the user has not typed zero, which I decided will mean they want to quit. So now notice what I'm going to do here. I'm going to free the list apparently. But more on that in just a moment. Let's first run this program. So let me make a bigger terminal window, dot slash list 0. I'm going to go ahead and insert by typing two, a number like 50, and now you'll see the list is now 50. And my text just scrolled up a bit. So now notice the list contains the number 50. Let's do another insert by taking two. Let's type in the number like one. List is now one, followed by 50. So this is just a textual representation of the list. And let's insert one more number like the number 42, which is hopefully going to end up in the middle, because this program in particular sorts it elements as it inserts them. So there we have it. Super simple program that could absolutely have used an array, but I happen to be using a linked list just so I can dynamically grow and shrink it. So let's take a look for search, if I run command three, I want to search for, say, the number 43. And nothing was apparently found, because I got back no response. So let's do this again. Search. Let's search for 50, or rather search for 42, which has a nice little subtle meaning. And I found the meaning of life there. Number 42, if you don't know the reference, Google it. All right. So what has this program done for me? It's just allowed me to insert thus far and search for elements. Let's fast forward, then, to that function we glanced at on Monday as a teaser. So this function, I claim, searches for an element in the list by first one, prompting the user and then calling GetInt to get an actual int that you want to search for. Then notice this. I'm going to create a temporary variable in line 188 called pointer-- PTR-- could have called it anything. And it's a pointer to a node because I said node* there. And I'm initializing it to be equal to first so that I effectively have my finger, so to speak, on the very first element of the list. So if my right hand here is PTR I'm pointing at the same thing that first is pointing at. So now back in code, what happens next-- this is a common paradigm when iterating over a structure like a linked list. I'm going to do the following while pointer is not equal to null So while my finger is not pointing at some null value, if pointer arrow n equals n. We'll notice first that n is what the user typed in per GetInts call here. And pointer arrow n means what? Well if we go back to the picture here, if I have a finger pointing at that first node containing nine, the arrow essentially means go to that node and grab the value at location n, in this case, the data field called n. As an aside-- and we saw this a couple of weeks ago when someone asked-- this syntax is new, but it doesn't give us powers that we didn't already have. What was this phrase equivalent to using dot notation and star a couple of weeks ago when we peeled back this layer a bit prematurely? STUDENT: [INAUDIBLE]. SPEAKER 1: Exactly, it was star, and then it was star dot n, with parentheses here, which looks, frankly, I think a lot more cryptic to read. But star pointer, as always, means go there. And once you're there, what data field do you want to access? Well you use the dot notation to access a structs data field, and I specifically want n. Frankly, I would argue this is just harder to read. It's harder to remember where do the parentheses go, the star and all of that. So the world adopted some syntactic sugar, so to speak. Just a sexy way of saying, this is equivalent, and perhaps more intuitive. If pointer is indeed a pointer, the arrow notation means go there and find the field in this case called n. So if I find it, notice what I do. I simply print out, I found percent i, plugging in the value for that int. I call sleep for one second just to kind of pause things on the screen to give the user a second to absorb what just happened. And then I break. Otherwise, what do I do? I update pointer to equal pointer arrow next. So just to be clear, this means go there, using my old-school notation. So this just means to go to whatever you're pointing at, which in the very first case is I'm pointing at the struct with nine in it. So I've gone there. And then the dot notation means, get the value at next. But the value, even though it's drawn as an narrow, is just a number. It's a numeric address. So this one line of code, whether written like this, the more cryptic way, or like this, the slightly more intuitive way, just means move my hand from the first node to the next one, and then the next one, and then the next one, and so forth. So we won't dwell on the other implementations of insert and delete and traversal, the first two of which are fairly involved. And I think it's quite easy to get lost when doing it verbally. But what we can do here is try to determine how best to do this visually. Because I would propose that if we want to insert elements into this existing list, which has five elements-- 9, 17, 22, 26, and 33-- if I were going to implement this in code, I need to consider how to go about doing this. And I would propose taking baby steps whereby, in this case I mean, what are the possible scenarios that we might encounter in general? When implementing insert for a linked list, this just happens to be a specific example of size five. Well if you want to insert a number, like say the number one, and maintaining sorted order, where obviously does the number one need to go in this specific example? Like at the beginning. But what's interesting there is that if you want to insert one into this list, what special pointer needs to be updated apparently? First. So I would argue, this is the first case that we might want to consider, a scenario involving inserting at the beginning of the list. Let's pluck off maybe an as easy or even easier case, relatively speaking. Suppose I want to insert the number 35 in sorted order. It obviously belongs over there. So what pointer obviously is going to have to be updated in that scenario? 34's pointer becoming not null but the address of the struct containing the number 35. So that's case two. So already, I'm sort of quantizing how much work I have to do here. And finally, the obvious middle case is indeed, in the middle, if I want to insert something like say 23, that goes between the 23 and the 26, but now things get a little more involved because what pointers need to be changed? So 22 obviously needs to be changed because he can't point to 26 anymore. He needs to point to the new node that I'll have to allocate by calling malloc or some equivalent. But then I also need that new node, 23 in this case, to have its pointer pointing at whom? 26. And there's going to be an order of operations here. Because if I do this foolishly, and I for instance start at the beginning of the list, and my goal is to insert 23. And I check, does it belong here, near nine? No. Does it belong here, next to 17? No. Does it belongs here next to 22? Yes. Now if I'm foolish here, and not thinking this through, I might allocate my new node for 23. I might update the pointer from the node called 22, pointing it at the new node. And then what do I have to update the new node's pointer to be? STUDENT: [INAUDIBLE]. SPEAKER 1: Exactly. Pointing at 26. But dammit if I didn't already update 22's pointer to point at this guy, and now I have orphans, the rest of the list, so to speak. So order of operations here is going to be important. To do this could I steal, say, six volunteers. And let's see if we can't do this visually instead of code-wise. And we have some lovely stress balls for you today. OK, how about one, two, in the back-- on the end there. three, four, both of you guys on the end. And five, six. Sure. Five and six. All right and we'll come to you guys next time. All right, come on up. All right, since you're up here first, would you like to be the one awkwardly in Google Glass here? All right, so, OK, Glass, record a video. OK, you're good to go. All right, so if you guys can come over here, I have prepared in advance some numbers. All right, come on over here. And why don't you go a little further that way. And let's see, what's your name, with the Google Glass? STUDENT: Ben. SPEAKER 1: Ben? OK, Ben, you will be first, literally. So we're going to send you to the end of the stage. All right, and your name? STUDENT: Jason. SPEAKER 1: Jason, OK you'll be number nine. So if you want to follow Ben that way. STUDENT: Jill. SPEAKER 1: Jill, you're going to be 17, which if I'd done this more intelligently, I would have started at the other end. You go that way. 22. And you are? STUDENT: Mary. SPEAKER 1: Mary, you'll be 22. And your name is? STUDENT: Chris. SPEAKER 1: Chris, you'll be 26. And then lastly. STUDENT: Diana. SPEAKER 1: Diana, you'll be 34. So you come on over here. All right, so perfect sorted order already. And let's go ahead and do this so that we can really-- Ben you're just kind of looking out into nowhere there. OK, so let's go ahead and depict this using arms, much like I was, exactly, what's going on. So go ahead and give yourselves a foot or two between yourselves. And go ahead and point with one hand to whoever you should be pointing at based on this. And if you're null just point straight down to the floor. OK, so good. So now we have a linked list, and let me propose that I'll play the role of PTR, so I won't bother carrying this around. And then-- someone stupid convention-- you can call this anything you want-- predecessor pointer, pred pointer-- it's just the nickname we gave in our sample code to my left hand. The other hand that going to be keeping track of who is who in the following scenarios. So suppose, first, I want to pluck off that first example of inserting, say 20, into the list. So I'm going to need someone to embody the number 20 for us. So I need to malloc someone from the audience. Come on up. What's your name? STUDENT: Brian. SPEAKER 1: Brian, all right, so you shall be the node containing 20. All right, come on over here. And obviously, where does Brian belong? So, in the middle of-- actually, wait a minute. We're doing this out of order. We're making this a lot harder than it needs to be at first. OK, we're going to free Brian and realloc Brian as five. OK, so now we want to insert Brian as five. So come on over here next to Ben for just a moment. And you can presumably tell where this story is going. But let's think carefully about the order of operations. And it's precisely this visual that's going to line up with that sample code. So here I have PTR pointing initially not at Ben, per se, but at whatever value he contains, which in this case is-- what's your name again? STUDENT: Jason. SPEAKER 1: Jason, so both Ben and I are pointing at Jason at this moment. So now I have to determine, where does Brian belong? So the only thing I have access to right now is his n data item. So I'm going to check, is Brian less than Jason? The answer is true. So what now needs to happen, in the correct order? I need to update how many pointers in total in this story? Where my hand is still pointing at Jason, and your hand-- if you want to put your hand like, sort of, I don't know, a question mark. OK, good. All right, so you have a few candidates. Either Ben or I or Brian or Jason or everyone else, which pointers need to change? How many in total? OK, so two. My pointer doesn't really matter anymore because I'm just temporary. So it's these two guys, presumably, both Ben and Brian. So let me propose that we update Ben, since he's first. The first element of this list is now going to be Brian. So Ben point at Brian. OK, now what? Who gets pointed at whom? STUDENT: [INAUDIBLE]. SPEAKER 1: OK so Brian has to point at Jason. But have I lost track of that pointer? Do I know where Jason is? STUDENT: [INAUDIBLE]. SPEAKER 1: I do, since I'm the temporary pointer. And presumably, I have not changed to point at the new node. So we can simply have Brian point at whoever I'm pointing at. And we're done. So case one, insertion at the beginning of the list. There were two key steps. One, we have to update Ben, and then we also have to update Brian. And then I don't have to bother traipsing through the rest of the list, because we already found his location, because he belonged to the left of the first element. All right, so pretty straightforward. In fact, feels like we're almost making this too complicated. So let's now pluck off the end of the list, and see where the complexity starts. So if now, I alloc from the audience. Anyone want to play 55? All right, I saw your hand first. Come on up. Yeah. What's your name? STUDENT: [INAUDIBLE]. SPEAKER 1: Habata. OK, come on up. You'll be the number 55. So you, of course, belong at the end of the list. So let's replay the simulation with me being the PTR for just a moment. So I'm first going to point at whatever Ben's pointing at. We're both pointing now at Brian. So 55 is not less than five. So I'm going to update myself by pointing to Brian's next pointer, who now is of course Jason. 55 is not less than nine, so I'm going to update PTR. I'm going to update PTR. I'm going to update PTR I going to update PTR. And I'm going to-- hmm, what's your name again? STUDENT: Diana. SPEAKER 1: Diana is pointing, of course, at null with her left hand. So where does Habata actually belong clearly? To the left, here. So how do I know to put her Here I think I've screwed up. Because what is PTR art this moment in time? Null. So even though, visually, we can obviously see all of these guys here on stage. I've not kept track of the previous person in the list. I don't have a finger pointing out, in this case, the node number 34. So let's actually start this over. So now I actually do need a second local variable. And this is what you'll see in the actual sample C code, where as I go, when I update my right hand to point Jason, thereby leaving Brian behind, I better start using my left hand to update where I was, so that as I go through this list-- more awkwardly than I intended now here visually-- I'm going to get to the end of the list. This hand is still null, which is pretty useless, other than to indicate I'm clearly at the end of the list, but now at least I have this predecessor pointer pointing here, so now what hands and what pointers need to be updated? Whose hand do you want to reconfigure first? STUDENT: [INAUDIBLE]. SPEAKER 1: OK, so Diana's. Where do you want to point Diana's left pointer at? At 55, presumably, so that we've inserted there. And where should 55 pointer go? Down, representing null. And my hands, at this point, don't matter because they were just temporary variables. So now we're done. So the additional complexity there-- and it's not that hard to implement, but we need a secondary variable to make sure that before I move my right hand, I update the value of my left hand, pred pointer in this case, so that I have a trailing pointer to keep track of where I was. Now as an aside, if you're thinking this through, this feels like it's a little annoying to have to keep track of this left hand. What would another solution to this problem have been? If you got to redesign the data structure we're talking through right now? If this just kind of feels a little annoying to have, like, two pointers going through the list, who else could have, in an ideal world, maintained information that we need? Yeah? STUDENT: [INAUDIBLE]. SPEAKER 1: Exactly. Right so there's actually an interesting germ of an idea. And this idea of a previous pointer, pointing at the previous element. What if I just embodied that inside of the list itself? And it's going to be hard to visualize this without all the paper falling to the floor. But suppose that these guys used both of their hands to have a previous pointer, and a next pointer, thereby implementing what we'll call a doubly linked list. That would allow me to sort of rewind, much more easily without me, the programmer, having to keep track manually-- truly manually-- of where I had been previously in the list. So we won't do that. We'll keep it simple because that's going to come at a price, twice as much space for the pointers, if you want a second one. But that's indeed a common data structure known as a doubly linked list. Let's do the final example here and put these guys out of their misery. So malloc 20. Come on up from the aisle there. All right, what's your name? STUDENT: [INAUDIBLE]. SPEAKER 1: Sorry? STUDENT: [INAUDIBLE]. SPEAKER 1: Demeron? OK come on up. You shall be 20. You obviously are going to belong between 17 and 22. So let me learn my lesson. I'm going to start pointer pointing at Brian. And I'm going to have my left hand only update to Brian as I move to Jason, checking does 20 less than nine? No. Is 20 less than 17? No. Is 20 less than 22? Yes. So what pointers or hands need to change where they're pointing now? So we can do 17 pointing at 20. So that's fine. Where do we want to point your pointer now? At 22. And we know where 22 is, again thanks to my temporary pointer. So we're OK there. So because of this temporary storage I've kept track of where everyone is. And now you can visually go into where you belong, and now we need 1, 2, 3, 4, 5, 6, 7, 8, 9 stress balls, and a round of applause for these guys, if we could. Nicely done. [APPLAUSE] SPEAKER 1: All right. And you may keep the pieces of paper as mementos. All right, so, trust me it's a lot easier to walk through that with humans than it is with actual code. But what you'll find in just a moment now, is that same-- oh, thank you. Thank you-- is that you'll find that the same data structure, a linked list, can actually be used as a building block to even more sophisticated data structures. And realize too the theme here is that we've absolutely introduced more complexity into the implementation of this algorithm. Insertion, and if we went through it, deletion and searching, is a little more complicated than it was with an array. But we gain some dynamism. We get an adaptive data structure. But again, we pay a price of having some additional complexity, both in implementing it. And we're given up random access. And to be honest, there's not some nice clean slide I can give you that says here is why a linked list is better than an array. And leave it at that. Because the theme reoccurring now, even more so in the coming weeks, is that there's not necessarily a correct answer. This is why we have the separate axis of design for problem sets. It will be very context sensitive whether you want to use this data structure or that one, and it will depend on what matters to you in terms of resources and complexity. But let me propose that the ideal data structure, the holy grail, would be something that's constant time, irrespective of how much stuff is inside it, wouldn't it be amazing if a data structure returned answers in constant time. Yes. This word is in your huge dictionary. Or no, this word is not. Or any such problem there. Well let's see if we can't at least take a step toward that. Let me propose a new data structure that can be used for different things, in this case called a hash table. And so we're actually back to glancing at an array, in this case, and somewhat arbitrarily, I've drawn this hash table as an array with sort of a two-dimensional array-- or rather it's depicted here as a two dimensional array-- but this is just an array of size 26, such that if we call the array table, table bracket zero is the rectangle at the top. Table bracket 25 is the rectangle at the bottom. And this is how I might draw a data structure in which I want to store people's names. So for instance, and I won't draw the whole thing here on the overhead, if I had this array, which I'm now going to call a hash table, and this is again location zero. This here is location one, and so forth. I claim that I want to use this data structure, for the sake of discussion, to store people's names, Alice and Bob and Charlie and other such names. So think of this now as the beginnings of, say, a dictionary with lots of words. They happen to be names in our example here. And this is all too germane, perhaps, to implementing a spell checker, as we might for problem set six. So if we have an array of total size 26 so that this is the 25th location at the bottom, and I claim that Alice is the first word in the dictionary of names that I want to insert into RAM, into this data structure, where are instincts telling you that Alice's name should go in this array? We have 26 options. Where we want to put her? We want her in bracket zero, right? A for Alice, let's call that zero. And B will be one, and C will be two. So we're going to write Alice's name up here. If we then insert Bob, his name will go here. Charlie will go here. And so forth down through this data structure. This is a wonderful data structure. Why? Well what is the running time of inserting a human's name into this data structure right now? Given that this table is implemented, truly, as an array. Well it's constant time. It's order of one. Why? Well how do you determine where Alice belongs? You look at which letter of her name? The first. And you can get there, if it's a string, by just looking at string bracket zero. So the zeroth character of the string. That's easy. We did that in the crypto assignment weeks ago. And then once you know that Alice's letter is capital A, we can subtract off 65 or capital A itself, that gives us zero. So we now know that Alice belongs at location zero. And given a pointer to this data structure, of some sort, how long does it take me to find location zero in an array? Just one step, right It's constant time because of the random access we proposed was a feature of an array. So in short, figuring out what the index of Alice's name is, which is, in this case, is A, or let's just resolve that to zero, where B is one and C is two, figuring that out is constant time. I just have to look at her first letter, figuring out where zero is an array is also constant time. So technically that's like two steps now. But that's still constant. So we call that big O of one, so we've inserted Alice into this table in constant time. But of course, I'm being naive here, right? What if there's an Aaron in the class? Or Alicia? Or any other names starting with A. Where are we going to put that person, right? I mean, right now there's only three people on the table, so maybe we should put Aaron at location zero one two three. Right, I could put A here. But then, if we try to insert David into this list, where does David go? Now our system starts breaking down, right? Because now David ends up here if Aaron is actually here. And so now this whole idea of having a clean data structure that gives us constant time insertions is no longer constant time, because I have to check, oh, damnit, someone's already at Alice's location. Let me probe the rest of this data structure, looking for a spot to put someone like Aaron's name. And so that too is starting to take linear time. Moreover , if you now want to find the Aaron in this data structure, and you check, and Aaron's name is not here. Ideally, you would just say Aaron's not in the data structure. But if you do start making room for Aaron where there should have been a D or an E, you, worst case, have to check the whole data structure, in which case it devolves into something linear in the size of the table. So all right, I'll fix this. The problem here is that I had 26 elements in this array. Let me change it. Whoops. Let me change it so that rather being of size 26 in total, notice the bottom index is going to change to n minus 1. If 26 is clearly too small for humans' names, because there's thousands of names of the world, let's just make in 100 or 1,000 or 10,000. Let's just allocate a lot more space. Well that doesn't necessarily decrease the probability that we won't have two people with names starting with A, and so, you were going to try to put A names at location zero still. They're still going to collide, which means we still need a solution to put Alice and Aaron and Alicia and other names starting with A elsewhere. But how much of a problem is this? What's the probability that you have collisions in a data structure like this? Well, let me-- we'll come back to that question here. And look at how we might solve it first. Let me pull up this proposal here. What we just described is an algorithm, a heuristic called linear probing whereby, if you tried to insert something here in this data structure, which is called a hash table, and there's no room there, you truly probe the data structure checking, is this available? Is this Available is this available? Is this available? And when it finally is, you insert the name that you originally intended elsewhere at that location. But in the worst case, the only spot might be the very bottom of the data structure, the very end of the array. So linear probing, in the worst case, devolves into a linear algorithm where Aaron, if he happens to be inserted last in this data structure, he might collide with this first location, but then end by bad luck at the very end. So this is not a constant time holy grail for us. This approach of inserting elements into a data structure called a hash table does not seem to be constant time at least not in the general case. It can devolve into something linear. So what if we resolve collisions somewhat differently? So here's a more sophisticated approach to what's still called a hash table. And by hash, as an aside, what I mean is the index that I referred to earlier. To hash something can be thought of as a verb. So if you hash Alice's a name, a hash function, so to speak, should return a number. In this case is zero if she belongs at location zero, one if she belongs at location one, and so forth. So my hash function thus far has been super simple, only looking at the first letter in someone's name. But a hash function takes as input some piece of data, a string, an int, whatever. And it spits out typically a number. And that number is where that data element belongs in a data structure known here as a hash table. So just intuitively, this is a slightly different context. This actually is referring to an example involving birthdays, where there might be as many as 31 days in the month. But what did this person decide to do in the event of a collision? Context now being, not a collision of names, but a collision of birthdays, if two people have the same birthday on the 2nd of October, for instance. STUDENT: [INAUDIBLE]. SPEAKER 1: Yeah, so here we have the leveraging of linked lists. So it looks a little differently than we drew it earlier. But we appear to have to an array on the left hand side. That's one index, for no particular reason. But it's still an array. It's an array of pointers. And each of those elements, each of these circles or slashes--the slash representing null-- each of these pointers is apparently pointing to what data structure? A linked list. So now we have the ability to hard code into our program the size of the table. In this case, we know there's never more than 31 days in a month. So hard coding a value like 31 is reasonable in that context. In the context of names, hard coding 26 is not unreasonable it people's names only start with, for instance, the alphabet involving A through Z. We can cram them all into that data structure so long as, when we get a collision, we don't put the names here, we instead think of these cells not as strings themselves, but as pointers to, for instance, Alice. And then Alice can have another pointer to another name starting with A. And Bob actually goes over here. And if there's another name starting with B, he ends up over here. And so each of the elements of this table two, if we designed this a little more cleverly-- come on-- if we designed this a little more cleverly, now becomes an adaptive data structure, where there's no hard limit on how many elements you can insert into it because if you do have a collision, that's fine. Just go ahead and append it to what we saw a bit ago was known as a linked list. Well let's pause for just a moment. What is the probability of a collision in the first place? Right, maybe I'm over thinking, maybe I'm over engineering this problem, because you know what? Yes, I can come up with arbitrary examples off the top of my head like Allison and Aaron, but in reality, given a uniform distribution of inputs, that is some random insertions into a data structure, what really is the probability of a collision? Well turns out, it's actually super high. Let me generalize this problem is as this. So in a room of n CS50 students, what's the probability that at least two students in the room have the same birthday? So there's what. a few hund-- 200, 300 people here and several hundred people at home today. So if you wanted to ask ourselves what's the probability of two people in this room having the same birthday, we can figure this out. And I claim actually there are two people with the same birthday. For instance, does anyone have birthday today? Yesterday? Tomorrow? All right, so it feels like I'm going to have to do this 363 or so more times to actually figure out if we do have a collision. Or we could just do this mathematically rather than tediously doing this. And propose the following. So I propose that we could model the probability of two people having the same birthday as the probability of 1 minus the probability of no one having the same birthday. So to get this, and this is just the fancy way of writing this, for the first person in the room, he or she can have any one of the possible birthdays assuming 365 days in the year, with apologies to persons with the February 29th birthday. So the first person in this room is free to have any number of birthdays out of the 365 possibilities so that we'll do that 365 divided by 365, which is one. The next person in the room, if the goal is to avoid a collision, can only have his or her birthday on how many different possible days? 364. So the second term in this expression is essentially doing that math for us by subtracting off one possible day. And then the next day, the next day, the next day down to the total number of people in the room. And if we then consider, what then is the probability not of everyone having unique birthdays, but again 1 minus that, what we get is an expression that can very fancifully look like this. But it's more interesting to look at visually. This is a chart where on the x-axis is the number of people in the room, the number of birthdays. On the y-axis is the probability of a collision, two people having the same birthday. And the takeaway from this curve is that as soon as you get to like 40 students, you're up at a 90% probability combinatorically of two people or more having the same birthday. And once you get to like 58 people it's nearly 100% of a chance the two people in the room are going to have the same birthday, even though there's 365 or 366 possible buckets, and only 58 people in the room. Just statistically you're likely to get collisions, which in short motivates this discussion. That even if we get fancy here, and start having these chains, we're still going to have collisions. So that begs the question, what is the cost of doing insertions and deletions into a data structure like this? Well let me propose-- and let me go back to the screen over here-- if we have n elements in the list, so if we're trying to insert n elements, and we have how many total buckets? Let's say 31 total buckets in the case of birthdays. What's the maximum length of one of these chains potentially? If again there's 31 possible birthdays in a given month. And we're just clumping everyone-- actually that's a stupid example. Let's do 26 instead. So if actually have people whose names start with A through Z, thereby giving us 26 possibilities. And we're using a data structure like the one we just saw, whereby we have an array of pointers, each of which points to a linked list where the first list is everyone with the name Alice. The second list is every with the name starting with A, starting with B, and so forth. What's the likely length of each of those lists if we assume a nice clean distribution of names A through Z across the whole data structure? There's n people in the data structure divided by 26, if they're nicely spread out over the whole data structure. So the length of each of these chains is n divided by 26. But in big O notation, what is that? What is that really? So it's really just n, right? Because we've said in the past, that ugh you divide by 26. Yes, in reality it is faster. But in theory, it's not fundamentally all that faster. So we don't seem to be all that much closer to this holy grail. In fact, this is just linear time. Heck, at this point, why don't we just use one huge linked list? Why don't we just use one huge array to store the names of everyone in the room? Well, is there still something compelling about a hash table? Is there still something compelling about a data structure that looks like this? This. STUDENT: [INAUDIBLE]. SPEAKER 1: Right, and again if it's just a linear time algorithm, and a linear time data structure, why don't I just store everyone's name in a big array, or in a big linked list? And stop making CS so much harder than it needs to be? What is compelling about this, even though I scratched it out? STUDENT: [INAUDIBLE]. SPEAKER 1: Insertions aren't? Expensive anymore. So insertions potentially could still be constant time, even if your data structure looks like this, an array of pointers, each of which is pointing at potentially a linked list. How could you achieve constant time insertion of names? Stick it in the front, right? If we sacrifice a design goal from earlier, where we wanted to keep everyone's name, for instance, sorted, or all of the numbers on stage sorted, suppose that we have an unsorted linked list. It only costs us one or two steps, like in the case of Ben and Brian earlier, to insert an element at the beginning of the list. So if we don't care about sorting all of the names starting with A or all the names starting with B, we can still achieve constant time insertion. Now looking up Alice or Bob or any name more generally is still what? It's big O of n divided by 26, in the ideal case where everyone's uniformly distributed, where there's as many A's as there are Z's, which is probably unrealistic. But that's still linear. But here, we come back to the point of asymptotic notation being theoretically true. But in the real world, if I claim that my program can do something 26 times faster than yours, whose program are you going to prefer using? Yours or mine, which is 26 times faster? Realistically, the person whose is 26 times faster, even if theoretically our algorithms run in the same asymptotic running time. Let me propose a different solution altogether. And if this doesn't blow your mind, we're out of data structures. So this is it a trie-- kind of a stupid name. It comes from retrievals, and the word is spelled trie, t-r-i-e, because of course retrieval sounds like trie. But that's the history of the word trie. So a trie is indeed some kind of tree, and it's also a play on that word. And even though you can't quite see it with this visualization, a trie is a tree structured, like a family tree with one ancestor at the top and lots of grandchildren and great grandchildren as leaves on the bottom. But each node in a trie is an array. And it's in an array-- and let's oversimplify for a moment-- it's an array, in this case, of size 26, where each node again is an array of size 26, where the zeroth element in that array represents A, and the last element in each such array represents Z. So I propose, then, that this data structure, known as a trie, can be used also to store words. We saw a moment ago how we could store words, or in this case names, and we saw earlier how we can store numbers, but if we focus on names or strings here, notice what's interesting. I claim that the name Maxwell is inside of this data structure. Where do you see Maxwell? STUDENT: [INAUDIBLE]. SPEAKER 1: On the left. So what's interesting with this data structure is rather than store the string M-A-X-W-E-L-L backslash zero, all contiguously, what you instead do is following. If this is a trie like data structure, each of whose nodes is again an array, and you want to store Maxwell, you first index and so the root's node, so to speak, the topmost node, at location M, right, so roughly into the middle. And then from there, you follow a pointer to a child nodes, so to speak. So in the family tree sense, you follow it downward. And that lead you to another node on the left there, which is just another array. And then if you want to store Maxwell, you find the pointer that represents A, which is this one here. Then you go to the next node. And notice-- this is why the picture's a little deceiving-- this node look super tiny. But to the right of this is Y and Z. It's just the author has truncated the picture so that you actually see things. Otherwise this picture would be hugely wide. So now you index into location X, then W, Then E, then L, then L. Then what's this curiosity? Well, if we're using this sort of new take on how to store a string in a data structure, you still need to essentially check off in the data structure that a word ends here. In other words, each of these nodes somehow has to remember that we actually followed all of these pointers and are leaving a little bread crumb at the bottom here of this structure to indicate M-A-X-W-E-L-L is indeed in this data structure. So we might do this as follows. Each of the nodes in the picture we just saw has one, an array of size 27. And it's now 27, because in p set six, we'll actually give you an apostrophe, so we can have names like O'Reilly and others with apostrophes. But same idea. Each of those elements in the array points to a struct node, so just a node. So this is very reminiscent of our linked list. And then I have a Boolean, which I'll call word, which is just going to be true if a word ends at this node in the tree. It effectively represents the little triangle we saw a moment ago. So if a word ends at that node in the tree, that word field will be true, which is conceptually checking off, or we're drawing this triangle, yes there is a word here. So this is a trie. And now the question is, what is its running time? Is it big O of n? Is it something else? Well, if you have n names in this data structure, Maxwell being just one of them, what is the running time of inserting or finding Maxwell? What's the running time of inserting Maxwell? If there's n other names already in the table? Yeah? STUDENT: [INAUDIBLE]. SPEAKER 1: Yeah, it's the length of the name, right? So M-a-x-w-e-l-l so it feels like this algorithm is big O of seven. Now, of course, the name will vary in length. Maybe it's a short name. Maybe it's a longer name. But what's key here is that it's a constant number. And maybe it's not really constant, but god, if realistically, in a dictionary, there's probably some limit on the number of letters in a person's name in a particular country. And so we can assume that value is a constant. I don't know what it is. It's probably larger than we think it is. Because there's always some corner case with a crazy long name. So let's call it k, but it's still a constant presumably, because every name in the world, at least in a particular country, is that length or shorter, so it's constant. But when we've said something is big O of a constant value, what's that really equivalent to? That's really the same thing as saying constant time. Now we're kind of cheating, right? We're kind of leveraging some theory here to say that well, order of k is really just order of one, and it's constant time. But it really is. Because the key insight here is that if we have n names already in this data structure, and we insert Maxwell, is the amount of time it takes us to insert Maxwell at all affected by how many other people are in the data structure? Doesn't seem to be. If I had a billion more elements to this trie, and then insert Maxwell, is he at all affected? No. And that's unlike any of the day data structures we've seen thus far, where the running time of your algorithm is completely independent of how much stuff is or isn't already in that data structure. And so with this affords you now is an opportunity for p set six, which will again involve implementing your own spell checker, reading in 150,000 words, how best to store that isn't necessarily obvious. And though I've aspired to find the holy grail, I don't claim that a trie is. In fact, a hash table may very well prove to be much more efficient. But those are just-- that's just one of the design decisions you will have to make. But in closing let's take 50 or so seconds to take a peek at what lies ahead next week and beyond we transition from this command line world if C programs to things web based and languages like PHP and JavaScript and the internet itself, protocols like HTTP, which you've taken for granted for years now, and typed most every day, perhaps, or seen. And we'll start to peel back the layers of what is the internet. And what is the code that underlies today's tools. So 50 seconds of this teaser here. I give you Warriors of the Net. [VIDEO PLAYBACK] -He came with a message. With a protocol all his own. He came to a world of cruel firewalls, uncaring routers, and dangers far worse than death. He's fast. He's strong. He's TCPIP. And he's got your address. Warriors of the Net. [END VIDEO PLAYBACK] SPEAKER 1: That is how the internet shall work as of next week.