[MUSIC PLAYING] DAVID MALAN: All right. This CS50 and this is week five. Recall that last week in week four, we introduced a few new building blocks, namely pointers and spoke in great detail about how you can now manipulate a computer's memory and begin to do things at a lower level with it. Well today, we'll sort of use those basic building blocks to start creating things called data structures in the computer's memory. It turns out that once you have this ability to refer to different locations in the computer's memory, you can stitch together your own custom shapes, your own custom data structures, as they're called. And indeed, we'll start doing that by rolling back for just a moment to where we first saw a data structure in week two. So recall in week two which was our second week of playing with C, we introduced you to the notion of an array. And an array is just a contiguous sequence of memory in which you can store a whole bunch of integers back to back to back, or maybe a whole bunch of chars back to back to back. And those arrays might have been represented pictorially like this. So this would be an array of size 3. And suppose that this were an array indeed of integers. We might have one, two, three inside of it. But suppose now-- and here's where we'll start to bump up against a problem, but also solve a problem today. Suppose that you want to add another number to this array, but you've only had the forethought to create an array of size 3. The catch with arrays in C is that they're not really easily resizable. Again, you all know that you have to decide in advance how big the array is going to be. So if you change your mind later or your program's running long enough where you need to store more values in it, you're in a bind. Like intuitively, if you wanted to insert the number 4 into this array, you would ideally just plop it right here at the end of the array and continue about your business. But the catch with an array is that that chunk of memory is not-- it doesn't exist in a vacuum. Recall if we zoom out and look at all of your computer's memory, this byte, this byte and a whole bunch of other bytes might very well be in use by other variables or other aspects of your program. So for instance for the sake of discussion, suppose that the program in question has one array of size 3 containing the integers 1, 2 3. And then suppose that your same program has a string somewhere in the code that you've written and that string is storing hello, world. Well next to this array in memory just by chance, may very well be an H-E-L-L-O comma space W-O-R-L-D backslash 0. And there might be free memory, so to speak. Memory you could use that's filled with garbage values and garbage isn't bad. It just means that you don't know. You don't really care what values are or were there. So there is free space, so to speak. Each of these Oscars represents effectively free space with some garbage value there. Remnants, maybe, of some execution past. But the problem here is that you don't have room to plop this four right where you might want to put it. So what's the solution? If we have this array of size 3 containing three integers-- 1, 2 3-- but it's been painted into a corner whereby H-E-L-L and so forth are already immediately next to it, we can't just put the four there without sacrificing the H. And that doesn't really feel like a solution. So any thoughts on how we can solve this problem? Are we completely out of luck? Can you just not add a number to an array in a situation like this? Or is there a solution perhaps that comes to mind? Even if you've never programmed before, if on the screen there-- the lay of the land. Santiago, what do you think? AUDIENCE: I would say that maybe you could copy the elements in the original array and create a new array, but that's one size bigger or one element bigger. And then, add that new element. DAVID MALAN: Yeah. That's really good intuition. After all, there's all these Oscars on the screen right now, which again represent garbage values. Or in turn, free space. So I could put 1, 2, 3, 4 up here. I could put 1, 2, 3, 4 over here. I could put 1, 2, 3, 4 down here. So we some flexibility, but Santiago is exactly right. Intuitively, all we really need to do is let's focus only on four of the available spots. A new array of size four, if you will, which initially has these four garbage values. But that's OK. Because as Santiago also notes, it suffices now to just copy the old array-- 1, 2, 3-- into the new array. And heck, maybe we can now even free the memory from the original array, much like we could if we used malloc. And that leaves us then, of course, with just an array of size 4 with that fourth garbage value. But now, we do have room for, like the number 4 itself. So it would seem that there is a solution to this problem that doesn't violate the definition of an array. Again, the only definition of an array really is that the memory must be contiguous. You can't just plop the 4 anywhere in the computer's memory. It has to become right after your existing memory if this whole thing, this whole structure, is indeed going to still be an array. But I worry that might have cost us a bit of time. And in fact, let me go ahead and open up on the screen here in just a moment a question that you're all welcome to buzz in for. What would be the running time of inserting into an array? Let me go ahead and reveal the poll question here. Feel free to go to the usual URL. Which Brian, if you wouldn't mind copying and pasting as usual. What do you think the running time is of inserting into an array? Inserting into an array. Recall in the past, we talked about arrays in terms of searching. But today, this is the first time we're really talking about inserting into them. And if we take a look at the results coming in, it looks like 80% of you feel that it's linear time, big O of n, whereby it might take as many as n steps to actually insert into an array. 5% of you propose n log n. 7% of you, n squared. And then, 2% and 5% respectively for the other values as well. So this is kind of an interesting mix because when we talked about arrays in the past and we talked about searching, recall that we typically achieve big O of log n. That's really good. Unfortunately if we have to do with Santiago proposed and actually copy all of the elements from the old array into the new array, it's going to take us as many as n steps because we have to copy each of the original elements-- 1, 2, 3-- over into the new array, which is going to be of size n plus 1. So it's on the order of n steps in total. So the running time then of inserting into an array, in terms of an upper bound at least, is going to be indeed big O of n because you've got a copy potentially all of those elements over. But perhaps, we might feel differently if we consider a lower bound on the running time of insert. What might the lower bound of insert be when it comes to an array? Again, omega notation is what we can use here. How many steps maybe in the best case might it take me to insert a value into an array? We won't do a poll for this one. Brian, why don't we go ahead and call on a hand for this. What's a lower bound on the running time of insert? Ryan, what do you think? AUDIENCE: Well, the best case scenario would be if there's only one element in the array. So you would just have a one or it would just be one element right into the array. DAVID MALAN: Yeah, so if you've got an array-- and let me emphasize that's already empty-- whereby you have room for the new element, then indeed. Omega of 1, constant time, is all you need so long as you have space available in the array. And it doesn't matter how large the array is. Maybe it's a size 4, but you've only put one number in it. That's OK because you can immediately put the new number in place. Recall that array support random access. You can use the square bracket notation and just jump to any location in so-called constant time in just one step. So if your array is of sufficient size and it's not full, then yes. The lower bound on the insertion into an array is going to be constant time, omega of 1. But as we saw in Santiago's situation whereby you have an array that's already filled with elements and you want to add another, well in that case, then upper bound is indeed going to be big O of n because you have to do the additional labor of copying the values over from one to the other. Now those of you have programmed before, maybe you've use Java. You might be familiar with the phrase vector. A vector is like an array that can be resized, can grow and shrink. That's not what arrays in C. Arrays in C are just contiguous blocks of memory with values back to back to back. But once you decide on their size, that is it. You're going to have to resize it, essentially yourself. They're not going to automatically grow for you. So an array was the first and really the simplest of the data structures we'll see. But it's also not nearly as powerful as what we can do now that we have access to a computers memory. Today, we're going to start to leverage these things called pointers, the addresses via which we can refer to locations in memory. And we're going to start to stitch together some fancier data structures-- first, one dimensional in some sense. Then, two dimensional in some sense-- by using some very basic building blocks. Recall these three pieces of syntax from the past weeks. Struct, recall, is this mechanism, this keyword in C, whereby we can define our own structures in memory. We saw one for a person whereby a person might have both a name and a number in something like a phone book. And you've seen the dot operator The dot operator was how we go inside of such a structure and get at dot name or dot number. The specific variables inside of the struct. And then last week, recall we saw the star operator whereby the dereference operator, which colloquially means go to this particular address. So just by using these three ingredients are we going to be able to now build up our own custom data structures that are even fancier than arrays and can ultimately help us solve problems more efficiently. And in fact, this is such a common technique in programming in C that these two operators, the dot and the star, are often combined into one that intentionally looks like an arrow. So we'll see its usage in just a little bit. So syntactically today, it's pretty much building blocks past but we're going to use these building blocks now in new ways to start solving problems differently. And we'll first do this by way of something called linked lists. So for instance, a linked list is going to be a data structure that solves some of the problems with an array. And what's a problem arguably? Well if it takes big O of n steps to insert into an array, frankly that's kind of annoying. That's kind of expensive. Because over time if you're writing a really big program with lots of data-- you're the Googles of the world, the Twitters of the world-- it's fine if you want to store all of your data in a big array for the sake of efficient searching. Which recall, was big O of log n if we use something like binary search and keep everything sorted. But it's going to be pretty painful if every time you want to add another tweet to the array, or some other web page to the array. Depending on what the problem is that you're solving, you might potentially-- as Santiago notes-- have to copy all of the contents of your original smaller array into a new bigger array just to add more tweets or more web pages or the like. So a linked list is going to be a data structure that's more dynamic, whereby you can grow and shrink the data structure without having to touch all of the original data and move it from old location to new. So what might this look like? Well let's consider again our computers memory and let's propose that I want to store those same values again. So like the number one-- and just for the sake of discussion, suppose that it's in my computer's memory at address 0x123. 0x just means it's a hexadecimal number. 1-2-3- is completely arbitrary. I made it up just for the sake of discussion. So let me stipulate that that's where the number 1 happens to be in the computer's memory in this new solution to the problem of storing lots of data. Suppose I want to store number 2. Maybe it's at address 0x456. And then, suppose I want to store number 3. Suppose it's at address 0x789. So notice deliberately, these numbers are spread out in the computer's memory. Because after all, the problem we ran into a moment ago with arrays is that you might have hello, world or other values from other parts of your program in the way. So if I'm proposing instead that if you want to store one, and then two, and then three, that's fine. Plop them anywhere you want and you don't have to worry about where there is already existing values. Instead, you can just put these values where there is room. The problem, though, is that if you just start plopping values like one, two, three anywhere you want in the computer's memory, you can no longer very easily find those values, right? You might know where one is, but it no longer suffices to just look one location to the right to find the next value or add two find the next value after that. In an array, everything is contiguous. But if we instead start to treat the computer's memory as just a canvas where we can just draw numbers anywhere we want, that's fine so long as we can somehow help ourselves get from the first, to the second, to the third irrespective of all the other stuff that's cluttering up the computer's memory. So in fact, let me propose that we do this by maybe stealing a bit more space from the computer. So rather than use just enough memory to store one, two, and three, let me store twice as much information. In addition to every number that I actually care about-- one, two, three, my data-- let me store a little metadata, so to speak. The values that I don't fundamentally care about, but that are going to help me keep track of my actual data. And let me propose that in this box here, I literally store the value 0x456. Again, it's written in hexadecimal, but that's just the number. That's the address of somewhere else in memory. In this box, let me propose that I store 0x789. And in this box, let me arbitrarily say 0x0, which is just all zero bits. Now why have I done this? Even if you've never seen this structure that's evolving to be what's called a linked list, why have I just done what I've done? In addition to storing one, two, and three respectively, I'm also now storing 0x456 and an additional chunk of memory and 0x789 and an additional chunk of memory. But why? Sofia. AUDIENCE: So that we know how the first element relates to the second or how they're linked together. DAVID MALAN: That's exactly-- AUDIENCE: --between the first and second. DAVID MALAN: Yeah. So now, I'm using essentially twice as much space to store each piece of data. One, two, and three respectively. But in the second chunk of space, I'm now storing a pointer to the next element in the thing I'll now think of as a list. So this is 0x456 because the number two, which is the next number in the list I care about, lives at 0x456. This number is 0x789 because the next number after that I care about is at address 0x789. So it's just a helpful way now of leaving myself breadcrumbs so that I can plop the one, the two, the three anywhere I want in the computer's memory wherever there's available space and still figure out how to get from one to the other to the other. And we've actually seen some of this syntax before. It turns out that 0x0, which is all zero bits, that's just the technical-- that's the numeric equivalent of what we've called null. N-U-L-L, which we introduced last week, is a special symbol indicating that something has gone wrong with memory or you're out of space. It's sort of the absence of an address. C guarantees that if you use 0x0, AKA null, that just indicates the absence of any useful address there. But you know what? Again like last week, this is of getting way into the weeds. I don't really care about 0x123456789. So let's just kind of abstract this away and start thinking about this really as a list of numbers that are somehow linked together. Underneath the hood, the links are implemented by way of addresses, or pointers, those low level numbers like 0x123456789. But pictorially, it suffices for us to just start thinking of this data structure, hereafter known as a linked list, as being a collection of nodes, so to speak-- N-O-D-E-- that are connected via pointers. So a node is just a generic computer science term that refers to some kind of structure that store stuff you care about. What I care about here is a number and a pointer to another such structure. So this is a linked list and each of these rectangles represents a node. And in code, we'll implement those nodes ultimately by way of that thing in C called a struct. But let me pause here to see first if there are any questions about the structure we have built up. Any questions about this thing called a linked list before we see it in some code? Brian? BRIAN: Yeah, a question came in in the chat asking, isn't this kind of a waste of memory that we're now using all of these cells to store addresses too? DAVID MALAN: Yeah, really good observation. Isn't this kind of a waste of memory in that we're storing all of these addresses in addition to the numbers 1, 2, 3 that we care about? Yes, and in fact, that is exactly the price we are paying, and this is going to be thematic. This week, last week and really every week thereafter, any time we solve a problem in CS and programming in particular, there's always going to be some price paid. There's going to be some cost and really, there's going to be some trade-off. So if a moment ago, it was unacceptable that inserting into an array is in big O of n because man, that's going to take so many steps to copy all of the values from the old array into the new array, if that is unacceptable to you for performance reasons or whatever the problem is that you're solving, well, that's fine. You can solve that problem and now have much more dynamism whereby you can plop your numbers anywhere in memory without having to move the existing numbers anywhere else, thereby saving yourself time, but the price you're going to pay indeed is more space. So at that point, it kind of depends what's more important to you, the computer's time, your human time, or maybe the space or the cost of the space, the more memory that you might really need to literally buy for that computer. So this is going to be thematic. This trade-off between space and time is omnipresent, really, in programming. Well, let's consider how we might translate this now into actual code. Recall that when we last saw structs in C, we did something like this to define a person as having two things associated with them, a name and a number. So today, we don't care about persons and names and numbers. We care about these things we're going to start calling nodes, so let me go ahead and rewind from that. Erase that and let's instead say that every node in this structure, renaming person as well to node, is going to have a number, which will be an int in our case here, and I've left room for one other value because we ultimately need to be able to store that second piece of data. That second piece of data is going to be a pointer. It's going to be an address of something, but how might I express this? This is going to be a little less obvious but we laid the foundation last week with pointers. How could I describe this structure as having a pointer to another such structure? Any thoughts on verbally the syntax to use, or even if you're not sure of exactly the incantation, exactly what symbols we should use to express and address to another such node? BRIAN: Someone is suggesting we use a node * as a pointer to a node. DAVID MALAN: Node *? All right. So * I definitely remember from last week, indicating that if you have int *, this is the address of an int. If you have char *, this is the address of a char. So if all of these arrows really just represent addresses of nodes, it stands to reason that syntax is probably going to be something akin to node *. Now I can call this pointer anything I want. By convention, I'll call it next. Because literally, its purpose in life is to point to the next node in the data structure. And that's going to be in addition to the int called number that I'll propose describes the top of that individual data structure. But there's a subtle problem here in C. Recall that in C, it's kind of a simplistic language, complicated though it might often seem, in that it doesn't understand anything that it hasn't seen before. So at the moment, notice that the very first time I have mentioned node up until now was on the very last line of this snippet of code. The problem is that by nature of how typedef works, the thing called a node does not actually exist until the compiler is done reading that last line of code and the semicolon, which is to say that it would actually be incorrect to use or refer to quote, unquote "node" inside of this structure. Because literally, by definition of how typedef works, a node does not exist until, again, that last line of code and the semicolon are executed. Thankfully, there's a workaround. It's a little weird to look at. But what you can do in C is this. You can actually add an additional word after, literally, the keyword struct. And we'll keep it simple. We'll use the exact same word, though technically that's not necessary. And now, I'm going to change the inside of the structure to say this instead. So it feels a little verbose. And it feels a little bit of copy paste. But this is the way it is done in C. Typedef struct node is essentially a hint, similar in spirit to the prototypes we've talked about for functions that gives the compiler a clue that, OK, something called a struct node is going to exist. You can then use it inside of that data structure and refer to it as struct node *. It's more of a mouthful this week. We've not seen multiple words like this. But it's similar to char * or int *, like last week. And I'm going to call that arbitrarily next. And down here, the same thing happens as in the past with persons. By calling this node at the very last line of the code, that just tells the compiler, you know what? You don't have to refer to it as struct node all over the place. You can just call this thing node. So it's a little verbose in this case. But all this is done is create for me, in the computer, the definition of a node as we have depicted it pictorially with that rectangle. All right, so how can we now translate this into more useful code, not just defining what these structures are, but how do we begin building up linked lists? Well, let me propose that a linked list really begins with just a pointer. And in fact, here we have, thanks to the theatre's prop shop, just kind of a null pointer, if you will. I'm going to call this variable "list." And list is currently pointing to nothing. The arrow, we'll say, is just pointing at the floor, which means it's null. It's not actually pointing at anything useful. Suppose I now want to start to begin to allocate a linked list with three numbers, 1, 2, and 3. Well, how am I going to do this? Well, at the moment, the only thing that exists in my program is going to be this variable called list. There's no array in this story. That was in Week 2. Today is all about linked lists. So how do I get myself a wooden block that represents 1, another wooden block that represents 2, and a third that represents 3? Well, we need to use our new friend from last week, malloc. Recall that malloc allows you to allocate memory, as much memory as you might want, so long as you tell it the size of that thing. So frankly, I think what we could do ultimately today is use malloc to allocate dynamically one struct and put the number 1 in it, another struct put the number 2 on it, another struct put the number 3 in it. And then, use these playful arrows here to actually stitch them together, having one point to the other. So thankfully, the prop shop has wonderfully created a whole bunch of these for us. Let me go ahead and malloc a very heavy node that has room for two values. And you'll see it has room for a number and a next pointer. So the number I'm going to first install here is going to be the number 1, for instance. And I'm going to leave its pointer as just pointing at the ground, indicating that this is a null pointer. It's not actually pointing at anything else. But now that I'm starting to instantiate, that is create this list, now I'm going to do something like this and say that, all right, my variable called "list," whose purpose in life is to keep track of where this list is in memory, I'm going to connect one to the other by actually having this variable point at this node. When it comes time, then, to allocate another node, I want to insert into this linked list. Back in the world of arrays, I might have to allocate a new chunk of memory, copy this value over into the new values. I don't have to do that. In the world of linked lists, I just call malloc for a second time and say, give me another chunk of memory big enough to fit a node. Thankfully, from the prop shop, we have another one of these. Let me go ahead and malloc another node here. At the moment, there's nothing in it. It's just the placeholder. So it's garbage values, if you will. I don't know what's there until I actually say that the number shall be number 2. And then I go over to my linked list whose variable name is "list," and I want to insert this thing. So I follow the arrow. I then point the next field of this node at this node here. So now I have a linked list of size 2. There's three things in the picture, but this is just a simple variable. This is a pointer that's pointing at the actual node which, in turn, is pointing at an actual other node. Now suppose I want to insert the number 3 into this linked list. Recall that malloc is powerful and that it takes memory from wherever it's available, the so-called "heap" from your computer. And that means, by definition, that it might not be contiguous. The next number might not actually be here metaphorically in the computer's memory. It might be way over there. So that might indeed happen. The third time I call malloc now and allocate a third node, it might not be available anywhere in the computer's memory except for, like, way over here. And that's fine. It doesn't have to be contiguous as it did in the world of our arrays. I can now put the number 3 in its place. But if I want to keep a pointer to that node, so that all of these things are stitched together, I can start at the beginning. I follow the arrows. I follow the arrows. And now, if I want to remember where that one is, I'm going to have to connect these things. And so now, that pointer needs to point at this block here. And this visual is very much deliberate. These nodes can be anywhere in the computer's memory. They're not necessarily going to be contiguous. The downside of that is that you cannot rely on binary search, our friend from Week 0 onward. Binary search was amazing in that it's Big O of log n. You can find stuff way faster than Big O of n. That was the whole point of even the phonebook example in the very first week. But the upside of this approach is that you don't have to actually keep allocating and copying more memory and all of your values any time you want to resize this thing. And I'm a little embarrassed to admit that I'm out of breath for some reason, just mallocing nodes here. But the point is using malloc and building out the structure does come at some price. It's exhausting, frankly. But it's also going to spread things out in memory. But you have this dynamism. And honestly, if you are the Twitters of the world, the Googles of the world, where you have lots and lots of data, it's going to kill you performance wise to have to copy all of your data from one location into another as Santiago originally proposed as a solution to the array. So using these dynamic data structures, like a linked list, where you allocate more space wherever it's available, and you somehow remember where that is by stitching [? team ?] things together, as with these physical pointers, is really the state of the art in how you can create these more dynamic structures if it's more important to you to have that dynamism. All right, any questions before we now translate these physical blocks to some samples of code? BRIAN: A question came in in the chat. When in this whole process of linked lists are we actually using malloc? And what is malloc being used for? DAVID MALAN: Really good question. So where are we using malloc? Every time I went off stage and grabbed one of these big blocks, that was my acting out the process of mallocing a node. So when you call malloc, that returns to you, per last week, the address of the first byte of a chunk of memory. And if you call malloc of one, that gives you the address of one byte. If you call malloc of 100, that gives you the first address of 100 bytes that are contiguous. So each of these nodes, then, represents the return value, if you will, of a single call to malloc. And in fact, perhaps, Brian, the best way to answer that question in more detail is to translate this now to some actual code. So let's do that now and then revisit. So here is, for instance, a line of code that represents the beginning of our story, where we only had a variable called "list" that was initialized to nothing initially. This arrow was not pointing at anything. And in fact, if it was just pointing up, down, left, or right, it would be considered a garbage value. This is just memory, which means there are values inside of it before I actually put actual values in it. So if I don't assign it a value, who knows what it's pointing to? But let me go ahead and change the code. This list variable, by default, has some garbage value unless I initialize it to something like null. So I'll represent that here figuratively by just pointing at the ground. That now represents null. This would be the corresponding line of code that just creates for you an empty linked list. That was the beginning of the story. Now recall that the next thing I did was I went off stage and I called malloc to bring back one of these big boxes. That code might instead look like this. Node *n. So I could call the variable anything I want, malloc, size of node. So size of we might have seen briefly in that it's just an operator in C that tells you how many bytes large any data type is. So I could do the math and figure out in my Mac or PC or CS50 IDE just how many bytes these nodes are supposed to take up. Size of just answers that question for me. So malloc, again, takes one argument, the number of bytes you want to allocate dynamically, and it returns to the address of the first of those bytes. So if you think of this as one of my earlier slides in yellow, it returns to you, like, the address of the top left byte of this chunk of memory, if you will. And I'm going to go ahead and assign that to a variable I'll call n to represent a node. And its node * because, again, malloc returns always an address. And the syntax we saw last week for storing addresses is to use this new * syntax here. So this gave me a block that initially just had garbage values. So there was no number in place and who knows where the arrow was pointing at? So it looked a little something like this, if I draw it now on the screen. List is initialized to null. It's not pointing at anything right now. But n, the variable I just declared, is pointing at a node. But inside of that node, or who knows what? It's garbage values. Number and next are just garbage values. Because that's what's there by default, remnants of the past. But now, let me propose that we do this code. So long as n is not null-- which is something you want to get into the habit always now of checking, any time in C, when you call a function that returns to you a pointer, you should almost always check is it null or is it not null. Because you do not want to try touching it if it is indeed null. Because that means there is no valid address here. That's the human convention when using pointers. But if n does not equal null, that's a good thing. That means it's a valid address somewhere in the computer's memory. Let me go ahead now and go to that address, the syntax for which is *n, just like last week. And then the operator means go inside of the structure that's there and go into the variable inside of it, called number in this case. So when I go ahead and do something like this, when I have a list that's initially of size 1, where this variable is pointing at the one and only node at the moment, it's going to have the number 1 in it as soon as I execute this line of code. *n means start at the address embodied here, go to it, and then put the number, 1 in this case, in place. But I need to do one other thing as well. I want to go ahead, too, and replace the garbage value that represents the next pointer in that structure and replace it with null. Null [INAUDIBLE] find that this is the end of the list. There's nothing there. I don't want it to be garbage value. Because a garbage value is an arbitrary number that could be pointing this way, this way, this way. Metaphorically, I want to actually change that to be null and so I can use the same syntax. But there's this clever approach now. I don't have to use star and then dot over the place, as mentioned earlier. Star and dot comes with some syntactic sugar. You can replace the star and the parentheses and the dot that we just saw with an arrow notation, which means follow the arrow. And then set this thing equal to null, which I'll represent again by just having the arrow literally points at the floor for clarity. So, again, rewinding from where we started, *n.number, with the first of those in parentheses, is the same thing as this. And the reason that most people prefer using the arrow notation, so to speak, is that it really does capture this physicality. You start at the address, you go there, and then you look at the field, number or next, respectively. But it's equivalent to the syntax we saw a moment ago with the start and the dot as before. So after these two steps, have we initialized this node to containing the number 1 and null inside of it. But what comes next? What comes next? Well, at point in the story, in this code, I have some other variable here that's not pictured. Because we're now transitioning from the world of woodwork to actual code. So there's another variable n, which I might as well be representing myself. If I am this temporary variable n, I, too, am pointing at this value. It's not until I actually execute this line of code, list = n;, that I remember where this node is in the computer's memory. So n, up until now, has really just been a temporary variable. It's a variable that I can use to actually keep track of this thing in memory. But if I want to add this node ultimately to my linked list, list started out as null, recall that this pointer was pointing at the ground, representing null. But when I now want to remember that this linked list has a node in it, I need to actually execute a line of code like this, list = null. All right, what did we next do? Let's take one more step further. So at this point in the story, if I was representing n, I'm also pointing at the same block. n is temporary, so it can eventually go away. But at this point in the story, we have a linked list of size 1. Let's go ahead and take this further. Suppose now I execute these lines of code. And we'll do it a little faster and all at once so that you can see it more in context. The first line of code is the same as before. Hey, malloc, give me a chunk of memory that's big enough for the size of a node. And, again, let's use this temporary variable n to point to that. And suppose that means that I, if I'm representing this temporary variable, I'm pointing at this new chunk of memory here. I then check, if n does not equal null, then and only then do I go ahead and install the number 2 as I did physically earlier and do I initialize the pointer originally to pointing not at some other node which doesn't yet exist, but pointing at, let's just call it the floor, thereby representing null. And that's it. That has now allocated the second node. But notice, literally, this disconnect. Just because I've allocated a new node and put the number I care about and initialized its next pointer to null, that doesn't mean it's part of the data structure. The linked list is still missing a pointer from old to new. So we need to execute one other line of code now so that we can get from this picture ultimately to the final one. And here's where we can use the same kind of syntax as before. If I start at my list variable, I follow the arrow as per that code. And then I update the next field to point to n, which is my newly allocated node. Now, after that final line of code, do I have a linked list of size 2. Because I've not only allocated the node, initialized its two variables, a number and next respectively, I've also chained it together with the existing node on the linked list. And let me do this one even slightly more quickly, only because it's the same thing, just so we can see it all together at once. Now that we have this picture, let's execute the same kind of code again. The only difference in this chunk of code is that I'm initializing number to 3. So what has this done for me? That chunk of code has malloced a third and final node. I've initialized the top to the number 3. I've initialized the bottom to null, as I'll represent by just pointing the arrow at the ground. There's one final step, then. If I want to go ahead and insert that third node into this linked list, I've got to go ahead now and not just point it at myself, me representing, again, the temporary variable n, I need to now do this. And this is syntax you won't do often. We'll see it in an actual program in just a moment. But it just speaks to the basic building blocks we're manipulating. If I want to go ahead and link the 2 to the 3, I can start here at list. I can follow the arrow once. I can follow the arrow again. And then I can set that next field equal to n. Because, again, n is the current address of that most recently allocated node. So even though the syntax is getting a little new, the two arrows I'm using are literally just like a code manifestation of just follow the pointer, follow the pointer, boom, assign one value to the next. All right, so at this point in the story, the picture now looks like this. So long as I, n, am still in the picture. But if I just get rid of myself, because I'm a temporary variable, voila, we've just built up, step by step, a brand new linked list of size 3. Seems like a lot of work, but it allows us to now grow this thing dynamically. But let me pause here. Any questions or confusion on linked lists? Any questions or confusions on linked lists? BRIAN: One question just came in the chat. If we're trying to make a list that's going to be much longer, like, more than three elements, wouldn't it get tedius to have, like, arrow next, arrow next over and over again? DAVID MALAN: Yeah. Really good observation. And so that's why I said, you won't usually write it like this. We'll do it temporarily in a full-fledged program in just a moment just to demonstrate it. But in general, you'll probably use something like a loop. And what you'll do is use a temporary variable that points at this one, then iterates again, then iterates again. And let me stipulate for now that if you use a loop in the right way, you can end up writing just a single arrow by just keep updating the variable again and again. So there is a way to avoid that. And you do it much more dynamically. Let me go ahead then and ask you a question of my own here. Let me go ahead and ask this one. If you'd like to buzz in to this question, what is the running time of searching a linked list? What's the running time of searching a linked list in Big O notation? So what's an upper bound, worst case of searching a linked list like this one, whether it has three elements or even many more? So it looks like, as before, about 80% of the group is proposing that it's Big O of n. And this is actually correct. Because consider the worst case. Suppose you're looking for the number 3, you're going to have to look at all three numbers, Big O of n. If you're looking for the number 10, you're going to start here at the beginning, you're going to keep looking, looking, looking, looking. You're going to get to the end realize, oh, the number 10 is not even here. At which point, you've already looked at n elements. But here's one of the trade, again, of linked lists. With arrays, you could jump to the end of the list in constant time. You could just use a little bit of arithmetic. You could jump to the middle elements, or the first element, all using constant time. A linked list, unfortunately, is represented ultimately by just a single address, the address that points to the very first node. And so even though you all on camera can see this node and this node and this one as humans all at once, the computer can only follow these breadcrumbs. And so searching a linked list is going to be Big O of n in that case. But let me ask a follow up question. What's the running time of inserting into a linked list? What's the running time of inserting into a linked list? So you've got some new number like the number 4 or 0 or 100 or negative 5, whatever it may be. There's going to be a malloc involved, but that's constant time. It's just one function call. But you're going to have to insert it somewhere. And here it looks like 68% of you are proposing Big O of 1, which is interesting, constant time. 25% of you are proposing Big O of n. Would anyone be comfortable chiming in verbally or on the chat as to why you feel it's one or the other? It is indeed one of those answers. AUDIENCE: It could be O of n. Because of the fact that even though you're using malloc to create a new node, essentially, I think all the computer's doing when you assign it is [? going-- ?] as you cite those arrows, like we're going from one arrow to the next to the next to the next. And I would think it would be O of n. DAVID MALAN: It is O of n as you described it. But you, too, are making an assumption, like 25% of other people are making. You seem to be assuming that if a new number, suppose it's number 4, has to go at the end. Or if it's the number 5, it has to go at the end. And I kind of deliberately set things up that way. I've happened to maintain them in sorted order, 1, 2, 3, from left to right. But up until now, I have not made the condition that the linked list has to be sorted, even though the examples we've seen thus far are deliberately that way. But you know what, if you want to get fancy and a little more efficient, and you want to allocate the number 4, and frankly, you don't really care about keeping the linked list in sorted order, well, heck, just pull this out, put your new node here. Plug it in here. Plug the other one back in here. And just insert the new element at the beginning of the list. And for every number thereafter, malloc as before. But just keep inserting it here, inserting it here, inserting it here. Now it's not going to take a single step. Because as I verbalized it, there's, like, the malloc step. I have to unplug this. I have to plug it back in. So it's like three or four steps total. But four steps is also constant. That's big O of 1, because it's a fixed number of steps. So if you're able to sacrifice sorted order when it comes to this list, you can in constant time insert, insert, insert, insert. And the list is going to get longer and longer, but from the beginning of it rather than the end. So that's always a trade off. If you don't care about sorted order, and none of your algorithms or code require that it be sorted, then you can go ahead and cut that corner and achieve constant time insert, which if you're a Twitter or Google or the like, maybe that's actually a net savings and a good thing. But, again, you sacrifice the sorted order in that case. Well, let's go ahead, I think. And let's translate this to some actual code. Let me go ahead here in CS50 IDE. And let's go ahead and write a couple of variants of a program that now actually do something with numbers and start to manipulate things in memory. So I'm going to go ahead here and create a program called list.c. And my first version is going to be very simplistic. I'm going to go ahead and include stdio. Give myself an int main(void). And then inside of here, let me go ahead, like we began, and give myself a list of integers of size 3. So this is going to be an array that's of size 3. And this array of size 3, I'm going to go ahead and hard code some new values into it. So at the very first location, I'll put the number 1. At the second location, I will put the number 2. At the third location, I will put the number 3. And then, just to demonstrate that this is working as I think I intend, I'm going to do a quick for loop. So for int i get 0, i less than 3, i++. And then, inside this loop, I'm going to go ahead and print out %i. And then I'm going to print out the value of i. So now that I've printed out all of these values in my loop, let me go ahead and do make list. Let me go ahead then and do ./list and hit Enter. And indeed, I get-- oops, not what I wanted. So good teachable moment, not intended, admittedly. But what have I done wrong here? My goal is we print out the list. But somehow, I printed out 0, 1, 2. And those are indeed not the numbers in this list. [? Greg? ?] AUDIENCE: So you printed i. You should have printed list of i. DAVID MALAN: Yes. So I should have printed the contents of the array, which is list[i]. So that was just a newbie mistake by me here. So let me fix that. Let me go ahead and recompile make list, ./list., and voila. I've now printed the list. So this is sort of Week 2 stuff, when we first introduced arrays in Week 2. But now let me go ahead and transition now to something more dynamic. Where I don't have to commit in advance to creating an array, I can do this with a dynamically allocated chunk of memory. So let me delete everything I've done inside of main. And let me go ahead and give myself this. Let me go ahead and declare a list of values where list is now going to be an address as per the star operator. And I'm going to go ahead and malloc-- let's see. I want space for three integers. So the simplest way to do this, if I'm just going to keep it simple, I can actually do this, 3 times size of int. So this version of my program isn't going to use an array per se. It's going to use malloc. But it's going to dynamically allocate that array for me. And we'll see what the syntax for this is. As always now, any time you use malloc, I should check whether list equals equals null. And if so, you know what? I'm just going to return 1. Recall that you can return 0 or 1 or some other value from main to effectively quit your program. I'm going to go ahead and just return 1 if list is null, just assuming that something very badly went wrong, like I'm out of memory altogether. But now that I have this chunk of memory that's of size 3 times the size of an int, this is actually the malloc way to give yourself an array. Up until now, every time we've created arrays for ourselves, we've used square bracket notation. And you all have put a number inside the square brackets to give yourself an array of that size. But frankly, if we have malloc and the ability to just ask the computer for memory, well, if I want to store three integers, why don't I ask malloc for three times the size of an integer? And the way malloc works is it's actually going to return to me a contiguous chunk of memory of that size, so that many bytes back to back to back. And that's a technique that we'll use in just a moment when allocating actual nodes. So at this point in the story, so long as list does not equal null, I now have a chunk of memory that's big enough to fit the size of three ints. And as before, I can go ahead and initialize those. The first element will be 1. The second element will be 2. The third element will be 3. And notice the sort of equivalence now between using arrays and using pointers. C is kind of versatile in this way and that if you have a chunk of memory returned to you by malloc, you can, per last week, use square bracket notation. You can use square bracket notation and treat that chunk of memory as an array. Because after all, what's an array? It's a contiguous block of memory and that is exactly what malloc returns. If you want to be fancy instead, you could actually say go to that address and put the number 1 there. You could say go to that address plus 1 and put the next number there. You could say go to that address plus 2 and put the third number there. But honestly, this just very quickly becomes unreadable, at least to most people. This is that thing called pointer arithmetic, you're doing arithmetic with pointers, that is equivalent to using the syntax that we've used for a while now, which is to just use the square brackets. And the nice thing about square brackets is that the computer will figure out for you how far apart each of those integers are because it knows the size of an int. But now, at this point in the story, things get interesting and also annoying. And Santiago, recall, was the one that helped us solve this earlier. Suppose I didn't plan ahead. I only allocated three integers on line five there. But now, at line 13, I'm, like, oh, dammit. Now I want to add a fourth integer to the list. I could obviously just redo all the code. But suppose that part of the story here is to go ahead and dynamically allocate more memory. Well, how can I do this? Well, let me go ahead and allocate another chunk of memory temporarily. So I'll call it temp, by convention. And this time I'm going to go ahead and allocate 4 times size of int. Because, again, for the sake of the story, I messed up and I want to allocate enough space now for that original-- I haven't so much messed up. I have now decided that I want to add a fourth number to this array. As always, I should check if temp equals equals null. You know what? I'm going to go ahead and free the memory I already allocated. And then I'm just going to get out of here, return 1. Something went wrong. There's nothing to demonstrate. So I'm going to exit out of main entirely. But if malloc did not return null, and all is well, what am I going to do? Well, let's first do what Santiago proposed when we first began this conversation. For int i get 0, i less than 3, i++, let's go ahead and copy into this new, temporary chunk of memory whatever is at the original chunk of memory. So when Santiago proposed that we copy 1, 2, 3 from the old array into the new array, here's how we might do that in code just using a simple for loop, a la Week 2. And then let me go ahead now and add one more value, tmp[3], which is the fourth location if you're starting from 0. I'm going to go ahead and put the number 4 there. And now, at this point, I'm going to go ahead and remember the fact that tmp is my new list. So I'm going to go ahead and free the original list. And I'm going to update my old list to point at the new list. And then lastly, I'm going to go ahead and use another for loop just to demonstrate that I think I did this correctly, this time iterating up to 4 instead of 3. I'm going to go ahead and print out with i the contents of list[i]. So let's rewind real quick. We began the story by allocating an array of three integers. But we did it this time dynamically to demonstrate that malloc just returns a chunk of memory. And if you want to treat that chunk of memory as an array, you absolutely can. This stuff here, if list equals equals null, it's just error checking, just to make sure that nothing went wrong. The interesting code resumes here. I'm putting the numbers 1, 2, and 3 at location 0, 1, and 2 respectively in that chunk of memory which, again, I'm treating like an array. But now at this point in the story, I've stipulated that, wait a minute, I want to go ahead and add a fourth value. How can I do that? And let's stipulate that I want to go back and change the existing program. Because suppose that for the sake of discussion, this is code that's running at Google or Twitter over time and it's only after receiving another tweet that their code realizes, oh, we need more space. So how do I do this here? On line 15, this time I allocate enough space for four integers. And I, again, do some error checking. If tmp equals null, then something bad happened. Let's just exit all together. But if nothing bad happened, let's take Santiago's suggestion and translate his English advice into C. Let's use a for loop from 0 to 3 and copy into this new temporary chunk of memory the contents of the original chunk of memory. So tmp[i] = list[i]. And then here, which was the point of this exercise, let me add my fourth number at tmp[3], which is the fourth location if you start counting from 0. But at this point in the story, much like my earlier slide, I have both the 1, 2, 3 in an array of size 3, and I have 1, 2, 3 duplicated in the array of size 4. Let me go ahead and free the original list and give back to the computer that original chunk of memory. Let me then remember, using my better names variable, what the address of this new chunk of memory is. And then, just to show off, let me go ahead, and with another for loop, this time counting four times, not three, let me print out all of those values. Now here's where I'll cross my fingers, compile my new program. It does not compile OK. Because it looks like I have one too many parentheses there. So let's recompile the program with make list-- another error. So let me scroll up there. And oh, interesting, so this is a common mistake, implicitly declaring library function malloc something something. So any time you get an implicitly declaring error, odds are it means you just did something simple like this, you forgot the requisite header file in which that function is defined. And indeed, recall from last week, malloc is in standard lib, as is free. So now let's do make list. Cross my fingers again. Phew. That time it worked. ./list, voila, 1, 2, 3, 4. So this is now a completely literal translation of all of that code into a working program that, again, starts off by using an array of size 3, having dynamically allocated it. And then it resizes it by creating a new one of size 4, copying old into new, freeing the old, and then proceeding as before. And I've deliberately used malloc both times here as follows. If you create an array in C using square bracket notation, you have painted yourself into a corner. You can't use any lines of code that we have seen and resize an array that you have declared using square brackets. More technically speaking, when you use the square brackets, you are statically allocating the array on the stack. You're putting it into the frame of the computer's memory that belongs to that computer, to that function's stack frame, per the diagram last week. If, however, you use malloc, our new tool from last week, and say give me a chunk of memory, that comes from the heap. And that you can resize. That you can give back and take more of and back and forth. And in fact, there's even a more simple way of doing this, relatively speaking. If you want to reallocate an array, a chunk of memory, by resizing it, you don't have to do all of this, which I did before. You don't have to use malloc twice. You can use malloc once at the beginning. And then, you can use a new function that's actually kind of helpful in this case, called realloc. And you can actually do this, realloc a chunk of memory of size 4 times size of int. But specifically, reallocate the thing called list. So realloc is very similar to malloc. But it takes two arguments. One is the size of the memory you want, whether bigger or smaller. But it takes a second argument. Its very first argument now is the address of a chunk of memory that you have already allocated, as with malloc. So, again, at the top of the same program, recall that I used malloc to give myself a list that points at a chunk of memory big enough for three integers. On line 16, I'm now handing that address back to realloc, saying, wait a minute, here is that same address you gave me. Please now resize it, reallocate it to be of size 4. And what the function does is, if all goes well, it returns to the address in memory that it is now of sufficient size. Otherwise, it returns null if anything bad happened. So I'll leave that code alone. But what I don't have to do anymore is this. Realloc actually copies the old into the new for you. So, again, coming back to Santiago's story at the beginning of today, realloc will not only give you a bigger chunk of memory, if you ask for it, by handing back the address of the memory you already requested. And it's going to hand you back the address of a new chunk of memory that is big enough to fit all of those new values. And it's smart, too. If there happens to be room at the very end of the existing chunk of memory, there's no hello, world, like we saw on my slide earlier, then you're actually going to get back the exact same address. But the computer's operating system, Windows, Mac, OS, or Linux, is going to remember, OK, yes, I know I gave you three bytes originally. There happened to be room at the end of that chunk of memory. So now I'm going to remember, instead, that that same address has room for four integers, or whatever number you pass in. So, again, you don't have to bother copying yourself. You can let the computer actually do the reallocation for you. Any questions, then on malloc, on realloc, on free, or fundamentally, on linked lists? Notice that this isn't yet a list. This is still an array. So we still need to take this program one step further and actually transition from this chunk of memory using arrays to these actual nodes. But before we do that, any questions or confusion? BRIAN: Yeah. A question came in, why do you not need to free tmp at the end of the program? DAVID MALAN: Why do I not need to free tmp at the end of the program? Because I'm an idiot and glossed over that key, important detail. You absolutely should free not tmp in this case, but list. So at this line here, 27, I use my list variable, which just has a better name, and I make it equal to tmp so that I can just refer to it as a bigger list. But you are quite right. That was an oversight on my part. Valgrind would not have liked that. At the very end of this program, I should absolutely free list. However, I don't need to free tmp, per se, because I've simply reused the variable name through that assignment. Good question and good catch. Unintended. Other questions or comments, Brian? BRIAN: Question came in, why does the linked list improve this situation if we can just use arrays and realloc and malloc to do all this stuff? DAVID MALAN: Yeah. Really good question. So how have we improved this situation if we can just use arrays in this way? Recall that this is kind of a regression. What I just did is a regression to where we started the story, whereby in any of the versions of code I just wrote, I reallocated more space for this array. Which means that I, manually with that for loop, or realloc with its own for loop, had to copy all of the old values into the new. So the approach we've taken in all three versions of this program that I've written thus far on the fly, they've all been Big O of n. When it comes to inserts, they have not given us the dynamism of a linked list to just add without that duplication. And we haven't had the ability yet to just do an insert, for instance, at the beginning of the structure in Big O of 1 time. So, again, this is the code translation, really, of that slower approach from which we began. So the ultimate goal now is going to be to change this code and give us that dynamism and actually implement things as a proper linked list, not just as an array of integers. But we're about an hour in. Let's go ahead and take our first five minute break here. And when we come back, we'll translate the nodes themselves to a full program. All right, we are back. And recall that we began today by revisiting arrays and pointing out that searching is great in arrays if you keep them sorted. You get the Big O of log n that we liked back from Week 0. But as soon as you want to start dynamically modifying an array, it gets very expensive quickly. It might take you Big O of n steps to copy the contents of an old, small array into a new, bigger array. And honestly, over time, especially for real world software with lots of data, even Big O of n is expensive. Like, you don't want to be constantly copying and copying and copying all of your data around the computer's memory. So we can avoid that by using pointers and, in turn, stitching together these structures called linked lists, albeit at a price of spending more memory. But with that additional memory, that additional cost, comes dynamism. So that if we want we can even achieve constant time when it comes to inserting. But of course, then we have to sacrifice things like sortability. So this came with trade offs. We've just seen a few examples of actual C programs that implement first the old school array, per Week 0, where we just hardcode the array's length. And unfortunately, we painted ourselves into a corner, using the bracket notation alone. So we deployed instead m which is more versatile tool that lets us get as much memory as we want. And we used that to recreate the idea of a list implemented as arrays. But even then, we saw that I had to copy using a for loop or we had to copy using, indirectly, realloc, old into new. And, again, for these small programs, you don't even notice the difference. The programs run like that. But for large, real world software, all of that Big O of n time is going to add up quickly. So it's best if we can try to avoid it altogether and achieve dynamism. So the code via which you can add to linked list dynamically is actually part of the challenge for Problem Set 5 this coming week. But let's see some of the building blocks via which we can syntactically start to allocate nodes and stitch them together when we know in advance how many we want. Which is not going to be the case for Problem Set 5, but for now is indeed the case, because I only want three of these things. So I'm going to go back to my program from before. And I'm going to rewind and race everything inside of main. And I'm going to go ahead and declare myself a type called struct node initially with a number inside of it and a struct node * called next inside of that. And then I'm going to call this whole thing quite simply node. So that's quite similar to what we did with a person. But now it's a little fancier in that I'm giving the structure itself a temporary name, struct node. I'm referring to that temporary name inside of the structure so that I can have a pointer there, too. And then I'm renaming what was person to now node. Now let's go ahead and actually use this thing inside of main. So let me go ahead and create an empty linked list. The simplest way to translate the simple block with which we began today is just doing node *list;. Unfortunately, any time you declare a variable that does not have an assigned value, it's garbage. And garbage is bad in the world of pointers. Again, to be clear, if you create this variable called list and you do not explicitly initialize its value to be something like null pointing at the ground, but instead leave it as a garbage value, it's the sort of metaphorical equivalent of this arrow pointing this way, this way, this other way. That is to say you might accidentally, in your own code, follow this arrow to a completely bogus place. And that's the point at which you have what are called segmentation faults, as some of you might have experienced already with Problem Set 4 when you touch memory that you shouldn't. So garbage values are bad ever more so in the context of pointers for that reason. So you rarely want to do this. You almost always want to initialize the pointer to some known value. In the absence of an actual address, we're going to use null to indicate that there's nothing there. But that's deliberate on our part. Now, suppose I want to insert, just as I did physically by lugging the block number 1 onto stage before, let me go ahead and allocate a node-- we'll call it n temporarily-- using malloc, this time asking for the size of a node. So the story is now changing. I'm not allocating individual ints. I'm allocating individual nodes inside of which is enough room for an integer and another pointer to a node. And this size of operator figures out, from the definition of this structure up here above main, how much space is needed to store an integer and a pointer to a struct node. So as always now, I'm always going to check. If n equals equals null, I'm going to get out of this program immediately, and just return 1. Because something went wrong, and there's just not enough memory. But if all went well, I'm going to go ahead now and go into that node n. I'm going to go into its number field and assign it the value 1. And I'm going to go into that node n and go into its next field and, for now, assign it the value null. So this is as though I've just allocated the wooden block with a 1 in it and I have initialized its next pointer to null. Now I'm going to go ahead and update the list itself to point at that value. So, again, my variable called list is the variable via which I'm representing the whole list. And now that I have an actual node to point to, I'm setting list which, again, is a pointer, to a node, equal to whatever n is, the address of an actual node. So at this point in the story, I have the small wooden block connected to the larger block containing 1. Let's suppose, for the sake of discussion, I now want to add the number 2 to this list, this time using a node as well, not just an integer. I'm going to go ahead and allocate n, using malloc, giving myself the size of another node. I'm going to again just going to check if that thing equals null let me go ahead and free the list so I don't leak memory. Then let me go ahead and return 1. So that's just a quick sanity check to make sure I free any memory I've already allocated before. But if all goes well, and that's what I'm hoping for, I'm going to go ahead and go into this node n and store in its number field literally the number 2. And then now, because this thing-- I'll insert it in sorted order for now-- I'm going to go ahead and insert this next = NULL. And if I indeed want to put this number 2 node after the number 1 node, I can start at the top of the list, I can go to the next node. And inside of its value, I can say n. So this line of code here starts at the little block, follows the arrow, and then updates the next pointer of that first node, the 1 node, to instead store the address of this new node, n. And then lastly, let's do one more of these so n = malloc(sizeof(node)); one last time. Let me go ahead and do my sanity check one more time. If n = = NULL something bad happened. So now I'm going to go ahead and don't worry about the syntax just yet. But I'm going to go ahead and free list next. And I'm going to go ahead and free(list); and then I'm going to go ahead and return 1. But more on that another time. That's just in the corner case where something bad happened. But if nothing bad happened, I'm going to update the number field to be 3. I'm going to update the next field to be NULL. And now I'm going to update the list next block next block to equal this new one n. And then here, after this, I can proceed to print all of these things if I want. And in fact, I'll go ahead and do this with a loop. The loop is going to look a little different from before in this case. But it turns out we can use for loops pretty powerfully here, too. But at this point in the story, my list pointer is pointing at the 1 node, which is pointing at the 2 node, which is pointing at the 3 node. And, again, as someone observed earlier, it's not common to use this double arrow notation in this case. I bet I could actually use a loop to iterate over these things one at a time. And we can see this here when it's time to print. Let me go ahead and do this for. And instead of using i, because there really aren't any numbers in question. This is no longer an array, so I can't use square bracket notation or pointer arithmetic. I need to use pointers. So this might feel a little weird at first. But there's nothing stopping me with a for loop from doing this. Give me a temporary pointer to a node called tmp and initialize it to be whatever is at the beginning of the list. Keep doing the following so long as tmp does not equal NULL. And on each iteration of this loop, don't do something like i++ which, again, is not relevant now. But go ahead and update my temporary pointer to be whatever the value of the temporary pointer's next field is. So this looks crazy cryptic most likely, especially if you're new to pointers as of last week, as most of you are. But it's the same idea is a typical for loop. You initialize some variable before the semicolon. You check some condition after the first semicolon. And you perform an update of that variable after the second semicolon. In this case, they're not integers, though. Instead, I'm saying give myself a temporary pointer to the beginning of the list, like my finger pointing at, or if you prefer, the foam finger pointing at some node in the list. Go ahead and call that temporary variable tmp. And now do the following. So long as tmp is not null, that is, so long as it's pointing at an actual legitimate wooden block, what do I want to do? Let me go ahead and print out, using printf and %i as always, whatever value is in the number field of that node there. And that's it. With this simple for loop, relatively simple for loop, I can essentially point at the very first node in my list and keep updating it to the next field, updating it to the next field, updating it to the next field. And I keep doing this until my finger sort of walks off the end of the list of wooden blocks, thereby pointing at null, 0x0, at which point the loop stops and there's nothing more to print. So in answer to that question earlier, do we need to use this double arrow notation? Short answer, no, this is kind of the secret ingredient here. This syntax inside of the for loop takes whatever you're pointing at, follows one arrow, and then updates the temporary variable now to point at that structure instead. So this is kind of the equivalent in the world of pointers and linked lists of doing i++. But it's not as simple as i++. You can't just look one byte to the right or to the left. Instead, you have to follow an arrow, follow an arrow. But by reassigning this temporary variable to wherever you just followed, it's a way of following each of these orange arrows as we did physically a moment ago. After this, I should, for good measure, go ahead and free the whole list. And let me just offer up a common way of freeing a linked list. I can actually do something like this. While list != NULL, so while the whole list itself does not equal null, go ahead and get a temporary pointer like this to the next field so I remember what comes after the current head of the list. Free the list node itself. And then update list to be tmp. So, again, this probably looks crazy cryptic and certainly in the coming days, especially with Problem Set 5, you'll work through this kind of logic a little more logically, a little more pictorially, perhaps. But what am I doing here? First, I'm going to do the following, so long as my linked list is not null. And if I've got three nodes in it, by definition it's not null from the beginning. But my goal now is to free all of the memory I have allocated from left to right, so to speak. So how do I do that? Well, if I've got a wooden block in front of me, it's not safe to free that wooden block yet. Because that wooden block, recall, contains the pointer to the next node. So if I free this memory prematurely, I've then stranded all subsequent nodes. Because they are no longer accessible once I've told the computer you can take back this chunk of memory for the first node. So this line of code here, on line 52, is just saying temporarily give me a variable call tmp. And point it not at the list itself, the first node, point at the next node. So it's like using my right hand to point at the current node, my left hand to point at the next node, so that I can then, on line 53, free the list itself, which should not be taken literally. List represents the first node in the linked list, not the whole thing. So when you say free list, that's, like, freeing just the current node. But that's OK. Even now this memory has been given back, I still have my left hand pointing at every subsequent node by way of the next one. So now I can update list to equal that temporary variable and just continue this loop. So it's a way of sort of Pacman style, like, gobbling up the entire linked list from left to right by freeing the first node, the second node, the third node, and then you're done. But by using a temporary variable to look one step ahead to make sure you don't chomp, free the memory too soon and, therefore, lose access to all of those subsequent nodes. All right, phew. That was a big program. But it was meant to be in succession, starting with an array, transitioning into a dynamically allocated array, followed by, finally, an implementation using linked list, albeit hardcoded to support only three nodes. But in that example, do you see some sample syntax via which you can manipulate these kinds of nodes? Questions or confusion that I can help address? BRIAN: Yes. Someone asked, similar to one of the examples you did before, why could we not have just done malloc three times size of node to get three nodes and do it that way? DAVID MALAN: Really good question. Could I not just use malloc and allocate all three at once? Absolutely. Yes. That is completely your prerogative. I did it a little more pedantically, one at a time. But you could absolutely do it all three at once. You would then need to use some pointer arithmetic though, or you would need to square bracket notation to treat that bigger chunk of memory as essentially an array of nodes, and then stitch them together. So I am assuming, for demonstration purposes, that even though we have these little simple examples that demonstrate the syntax, in a real world system, you're not going to be inserting 1, then 2, then 3. Odds are you're going to be inserting 1. Some time passes. Then you want to insert 2, so you allocate more memory. Then some more time passes. Then you want to insert 3. And so there's gaps in between these chunks of code in the real world. Other questions or confusion? BRIAN: Yeah. Another question came in. Why would malloc ever fail to allocate memory? DAVID MALAN: Why would malloc ever fail? It's rarely going to fail, but if the computer is out of memory. So essentially, if you're writing such a memory hungry program with so many variables, big arrays, big structures, lots of data, you may very well run out of memory. Maybe that's two gigabytes, maybe it's four gigabytes or more. But malloc may very well return null to you. And so you should always check for it. In fact, I dare say, on Macs and PCs, one of the most common reasons, to this day, for programs to freeze, to crash, for your home computer to reboot, is truly because someone did something stupid, like I've done multiple times now already today and last week, by touching memory that you shouldn't have. So in Problem Set 4 and now 5, any time you experience one of those segmentation faults whereby your program just crashes, that is the problem set version of, like, your whole Mac or PC crashing, because someone more experienced than you made that same mistake in their code. Just to reinforce this, let's take a quick final example involving linked lists which, again, are this very one dimensional structure, left to right. And then we'll add a second dimension and see what that buys us. But we've changed the numbers around here. Now we still have our list. But it's first pointing at the number two here. And then the number two is pointing to some other chunk of memory that's been malloced way over here. And this, then, is the number 4. And this, then, is the number 5. So we have a linked list of size 3. But I've deliberately spread the numbers out this time, 2, 4, 5. Because suppose that we do want to insert more numbers into this list, but in sorted order, it turns out that we have to think a little bit differently when we're adding nodes not to the end and not to the beginning, but in the middle. Like, when we want to allocate more nodes in the middle, there's a bit more work that actually has to happen. So how might we go about doing this? Suppose that we want to allocate, for instance, the number 1. And I want to add the number 1. Well, we could use code like this. This is the same code as we used before. We allocate the size of a node. We check whether it equals null, we initialize it with a value we care about, and by default, we sit next equal to null. And pictorially, it might look like this. It's kind of floating somewhere in the computer's memory. I have this temporary variable n, no longer pictured, that I'm just pointing at when I allocate the number 1. So what does this look like? This is like having the number 1 maybe somewhere over here. And I'll just put it in place. We got lucky, and there was a chunk of memory right there. So what do I want to now do? Well, I want to go ahead and connect this. So what I could do, just intuitively, if 1 should go before 2, I can unplug this. And I can plug this into here, which makes sense. But there's already a problem. If I have done nothing else up until this point, I have just orphaned three nodes, 2, 4, and 5. To orphan a node means to forget where it is. And if I don't have another variable in my code, or if I'm not acting out with one of my hands pointing at the original beginning of the list, I have literally orphaned the rest of the list. And the technical implication of that, per last week, is that now I have a massive memory leak. You have just leaked the size of three nodes in memory that you can literally never get back until you reboot the computer, for instance, or the program quits and the operating system cleans things up for you. So you don't want to do this. Order of operations actually matters. So what I should probably do is this. When I insert the number 1 first, I should probably recognize that, well, the one begins at the beginning of the list. So what I should really do is point this arrow also at the same node. And we'll sort of do it a little sloppily like that. But let me stipulate those are both pointing at the same node. Now that my new node, AKA n and the code I showed is pointing at this thing, now I can do kind of a switcheroo. Because I'm already pointing at the final destination there. And now I can remove this safely, because this is my list. This is n. Therefore, I have variables pointing to both and I can go ahead and insert that correctly. So long story short, order of operations matters. So graphically, if I were to do this as before, just by saying list equals n, if this is n, and this is list, and I adjust this arrow first, bad things are going to happen. Indeed, we end up orphaning 2, 4, and 5, thereby leaking a significant amount of memory potentially. And leaking any memory, typically, is bad. So I don't want to do that. So let's look at the correct code. The correct code is going to be to start at n and update its next field, this arrow here, to point at the same thing as the list was originally pointing at. And then go ahead and update the list, such that both of them are currently pointing in duplicate. Then update the list to point to the new node. So, again, the code's a little different this time from before. Because before we kept adding it to the end, or I proposed verbally that we just add it to the beginning. Here, we're adding it, indeed, at the beginning. And so the actual steps, the actual code, are a little bit different. Well, let's do one final example, if we want to allocate 3. Well, I've got to malloc another node, the number 3. Suppose that ends up somewhere in the computer's memory. Let's go ahead and plop this one over here. So now 3 is in place. How do I now insert this thing? Well, similar to before, I'm not going to want to update this pointer and go like this and then plug this guy in over here. Because now I've orphaned those two nodes. So that, again is the wrong step. When you're inside the middle of a linked list, any code that you write to insert into the middle, if you care about inserting in sorted order, this should be updated first. And odds are I should kind of cheat and point this at the same thing, even though there's only one physical plug at the moment. So we'll just pretend that this is working. There we go. And now I can go ahead and safely say that n and the previous node are already pointing where they should be. So now it's safe for me to unplug this one and go ahead and update this final arrow to point at the new node in the correct location. So let's see that in code again. If I go here, I've got graphically the node 3 kind of floating in space. I first update its next field to point also at the 4 in duplicate. Then, I update the 2 to point to the 3. The goal being, again, to avoid any leaking of memory or orphaning of nodes. All right , we are about to leave linked lists behind. Because as multiple of you have noted or probably thought, they're good, but maybe not great. They're good in that they are dynamic and I can add to them, as by inserting at the beginning if I really want and don't care about sorted order. But they're still a good amount of work to do if I want to keep them in sorted order and I insert them in the middle or the end. Because that's, like, Big O of n if I keep traversing all of these darn arrows. So we get the dynamism, but we don't necessarily get the performance increase. But we fundamentally have opened up a whole new world to ourselves. We can now stitch together these data structures in memory using pointers as our thread, if you will. We can just use memory as a canvas, painting on it any values we want. And we can sort of remember where all of those values are. But this is indeed very single, one dimensional, left to right. What if we give ourselves a second dimension? What if we start thinking sort of not left to right, but also left to right, up, down. So, again, this is meaningless to the computer. The computer just thinks of memory as being byte 0, 1, 2, 3. But we humans can kind of think of these data structures a little more abstractly, a little more simply. And we can think about them in a way familiar, perhaps, to us in the real world. Trees, not the ones so much that grow from the ground, but if you're familiar with family trees. Where you might have a matriarch or a patriarch and then sort of descendants hanging off of them graphically on a piece of paper, something you might have made in grade school, for instance. We can leverage this idea of a tree structure that has a root that kind of branches and branches and branches and grows top to bottom. So, again, more like a family tree than an actual tree in the soil. So with trees, it turns out, this idea of a tree, we can take some of the lessons learned from linked lists. But we can gain back some of the features of arrays. And we can do that as follows. Consider the following definition of what we're about to call a binary search tree. Binary search being a very good thing from the first, Week 0. And a tree now being the new idea. Here's an array from Week 0, Week 1, Week 2, whenever. And it's of size 7. And recall that if it's sorted, we can apply binary search to this array. And that's great. Because if we want to search for a value, we can start looking in the middle. Then we can go either left or right, halfway between each. And then we can similarly go left or right. So binary search on a sorted array was so powerful because it was Big O of log n, we've concluded, by just having and having and having the problem again and again, tearing the phone book in half again and again and again. But the problem with binary search is that it requires that you use an array so that you have random access. You have to be able to index into the array in constant time using simple arithmetic, like bracket 0, bracket n minus 1, bracket n minus 1 divided by 2, to get the halfway point. You have to be able to do arithmetic on the data structure. And we've just proposed getting rid of that random access by transitioning to a dynamic data structure like a linked list instead of an array. But what if we do this? What if you and I start thinking not on one dimension, but on two dimensions? And what if we alter our thinking to be like this? So think of an array, perhaps, as being a two dimensional structure that has not only with or length, but also height. And so we maintain, it seems, visually this relationship between all of these values. But you know what? We can stitch all of these values together using what? Well, pointers. Pointers are this new building block that we can use to stitch together things in memory. If the things in memory are numbers, that's fine. They're integers. But if we throw a little more memory at them, if we use a node, and we kind of wrap the integer in a node such that that node contains not only numbers but pointers, we could probably draw a picture like this, not unlike a family tree, where there's a root node, at the very top in this case, and then children, so to speak. Left child and right child, and that definition repeats again and again. And it turns out the computer scientists do use this data structure in order to have the dynamism of a linked list, where you can add more and more nodes to the tree by just adding more and more squares even lower than the 1, the 3, the 5, and the 7. And just use more pointers to kind of stitch them together, to sort of grow the tree vertically, down, down, down, if you will. But a good computer scientist would recognize that you shouldn't just put these numbers in random locations, otherwise, you're really just wasting your time. You should use some algorithm. And notice. Does anyone notice the pattern to this tree? Can anyone verbalize or textualize in the chat what pattern is manifest by these seven nodes in this tree? They're not randomly ordered. They're very deliberately ordered left to right, top to bottom, in a certain way. Can anyone put their finger on what the definition of this thing is? What is the most important characteristic besides it just being drawn like a family tree? [? Greg? ?] AUDIENCE: You have put in the middle of them, on top of them you have put the middle number. For example, between 1 and 3, you have put 2. Between 5 and 7, you have put 6. So on top of them, you have put the middle number. DAVID MALAN: Exactly. There's this pattern to all of the numbers. Between 1 and 3 is 2. Between 5 and 7 is 6. Between 2 and 6 is 4. And it doesn't even have to be the middle number, per se. I can generalize it a little bit and comment that, if you pick any node in this tree, so to speak, its left child will be less than its value, and its right child's will be greater than its value. And we can do that again and again. So here's 4. Its left child is 2. That's less than. Here's 4. It's right child to 6. That's greater than. We can do this again. Let's go to 2. Its left child is 1, which is less. Its right child is 3, which is more. 6, its left child is 5, which is less. 6, its right child is 7, which is more. And so this is actually, if you don't mind the revisiting recursion from last week, this is a recursive definition. This is a recursive data structure. So it's not only algorithms or functions that can be recursive by calling themselves. A data structure can also be recursive. After all, what is this thing? This is a tree, yes. I'll stipulate. But it's technically a tree with two trees. Right? This node here, number four, technically has two children. And each of those children is itself a tree. It's a smaller tree. But it's the same exact definition again and again. And any time we see a recursive data structure, it's actually going to be an opportunity to use recursive code, which we'll take a look at in just a moment. But for now, notice what we've achieved again, the dynamism of using pointer so that we can, if we want, add more nodes to this tree, as by stringing them along the bottom in the correct order. And yet, we've preserved an important order for binary search, AKA, the formal name of this data structure, binary search tree, by making sure that left child is always less, right child is always more. Because now, we can go about searching this thing more efficiently. How? Well, if I want to search for the number 3, what do I do? Well, I start at the beginning of the tree. Just like with a linked list, you start at the beginning of the linked list. So with the tree, you start with the root of the tree [? always. ?] Suppose I want to search for 3. Well, what do I do? Well, 3 is obviously less than 4. So just like in Week 0 where I tore the phone book in half, you can now think of this as chopping down half of the tree. Because you know 3, if it's present, is definitely not going to be anywhere over here, so we can focus our attention down here. Here's the number 2. This is another tree. It's just a smaller subtree, if you will. How do I find the number 3? Well, I look to the right because it's greater than. And boom, I found it. But by contrast, suppose I were searching for the number 8. I would start here. I would look here. I would look here. And then conclude, no, it's not there. But, again, every time I search for that 8, I'm ignoring this half of the tree, this half of the subtree, and so forth. So you're going to achieve, it would seem, the same kind of power, the same kind of performance as we saw from Week 0. So how do we translate this idea now into code? We have all the building blocks already. Let me go ahead and propose that, instead of the node we used before for a linked list, which looked like this, with a number and one pointer called next-- but, again, we could have called those things anything-- let's go ahead and make room for not just a number, but also two pointers, one that I'll call left, one that I'll call right. Both of those is still a pointer to a struct node. So same terminology as before, but now I have two pointers instead of one so that one can conceptually point to the left and point to a smaller subtree. One can point to the right and point to a larger subtree. So how do we go about implementing something like binary search? Well, let's actually see some code. And this is where recursion really gets kind of cool. We kind of forced it when building [? Maro's ?] pyramid with recursion. Like, yeah, you can do it. And yes, the pyramid was, I claimed, a recursive physical structure or virtual structure in the game. But with data structures and pointers, now recursion really starts to shine. So let's consider this. If I declare a function in C whose purpose in life is to search a tree for a number, it's going to, by definition, search from the root on down. How do we implement this? Well, my function, I'll propose, is going to return a bool. True or false, the number is in the tree, yes or no. It's going to take two arguments, a pointer to a node, AKA tree. I could call it root or anything else. And it's going to take a number, which is the number I care about, whether it's four or six or eight or anything else. So what's going to be my first chunk of code? Well, let me do the best practice that I keep preaching. Any time you're dealing with pointers, check for null so that your program doesn't freeze or crash or bad thing happens. Because who knows? Maybe you will accidentally or maybe intentionally pass this function on null pointer, because there's no tree. Maybe you'll screw up. And that's OK, so long as your code is self-defensive. So always check pointers for null. If so, if the tree is null, that is, there is no tree there, obviously, the number is not present. So you just return false. So that's one of our base cases, so to speak. Else, if the number you're looking for is less than the tree's own number-- so, again, this arrow notation means take tree, which is a node *, so take this pointer. Go there to the actual node. And look in its own number field. If the number you're looking for from the argument is less than the number in the tree's own number field, well, that means that you want to go left. And whereas in the phone book, I went to the left of the phone book, here we're going to go to the left subtree. But how do I search a subtree? Here's where it's important that a tree is a recursive data structure. A tree is two subtrees with a new root node, as before. So I already have code via which I can search a smaller tree. So I can just say search the left subtree, as expressed here, which means start at the current node and go to the left child and pass in the same number. The number is not changing. But the tree is getting smaller. I've effectively, in code, chopped the tree in half. And I'm ignoring the right half. And I'm returning whatever that answer is. Otherwise, if the number I care about is greater than the number in the current node, do the opposite. Search the right subtree, passing in the same number. So, again, just like with the phone book, it kept getting smaller and smaller? Here I keep searching a smaller and smaller subtree. Because I keep chopping off branches left or right as I go from top to bottom. There's one final case. And let me toss this out to the group. There's a fourth case. Verbally or textually, what else should I be checking for and doing here? I've left room for one final case. BRIAN: A few people are suggesting if the tree itself is the number. DAVID MALAN: If the tree itself contains the number, yeah. So if the number in the tree equals equals the number I'm looking for, go ahead and, only in this case, return true. And so this is where the code, again, gets kind of-- recursion, rather, gets a little mind-bending. I only have false up here, true here. But not in either of these middle two branches, no pun intended, if you will. But that's OK. Because my code is designed in such a way that if I search the left subtree and there's nothing there, like, literally I'm at the leaf of the tree, so to speak, then it's going to return false. So that's fine. That's, like, if I search for the number 8. It's not even in the tree. I'm only going to realize that once I fall off the end of the tree and see, oops, null. I'll just return false. But if I ever see the number along the way, I will return true. And these two inner calls to search, these two recursive calls to search, are just kind of like passing the buck. Instead of answering true or false themselves, they're returning whatever the answer to a smaller question is by searching the left or right tree instead respectively. So, again, this is where recursion starts to get not really forced or even necessarily really as forced, but really as appropriate. When your data is itself recursive, then recursion in as a coding technique really rather shines. So if ultimately, we have-- oh, [? an ?] minor optimization. As we might have noted in Week 0 with Scratch, we, of course, don't need to explicitly check if the number is equal. We can get rid of that last condition and just assume that if it's not null and it's not to the left and it's not to the right, we must be standing right on top of it. And so we just return true there. Well, let me summarize the picture here. This is now a two dimensional data structure. And it's sort of better than a linked list in that now it's two dimensions. I gain back binary search, which is amazing, so long as I keep my data in sorted order per this binary search tree definition. But I've surely paid a price. Right? Nothing is absolutely better than anything else in our story thus far. So what is the downside of a tree? What price have I secretly or not so secretly paid here, while preaching the upsides of trees? And, again, the answer is often in this context sort of space, or time, or developer time, or money, or some resource, personal, or physical, or real world. Yeah. How about over to [INAUDIBLE]? AUDIENCE: So I think that inserting is no longer constant time. And I guess we need more memory. You need memory to [? sort ?] two pointers instead of one this time. DAVID MALAN: Yeah. It seems insertion is no longer constant time. Because if I need to preserve sorted order, I can't just put it at the top. I can't just keep pushing everything else down because things might get out of order. In that case, it would [? seem. ?] Or rather, even if I maintain the order, it might kind of get very long and stringy. If I add, for instance, another number, another number, and I keep jamming it at the top, I probably need to kind of keep things balanced, if you will. And yeah, the bigger point, too, is that immediately I'm using twice as many pointers. So now my node is getting even bigger than these things. I now have room for not only a number and a pointer, but another pointer, which is, of course, going to cost me more space. Again, so a trade off there. And let's go ahead and ask the group here, when it comes to insertion, why don't we consider for a moment what the running time of insertion might be when inserting into a binary search tree. If you'd like to pull up the URL as always, let me go ahead and present this one. What's the running time of inserting into a binary search tree? So if you want to insert the number 0 into that tree, if you want to insert the number 8, or anything in between or bigger or smaller, what are the answers here? So not quite as big of a victory for the tallest bar. About 60% of you think log n. And the good instincts there, frankly, are so that is going to be the right answer. And that's kind of the right instinct. Anytime you have binary search, odds are you're talking something logarithmic. But we've also seen divide and conquer and merge sort with n log n, so not unreasonable that about 10% of you think that, too. n squared would actually be bad. So n squared is, like, the worst of the times we've seen thus far. And that would suggest that a tree is even worse than a linked list, is even worse than an array. And thankfully, we're not at that point. But it's indeed log n, based on certain assumptions. So why is that? So if we consider the tree from a moment ago, it looked a little something like this. And what is involved in inserting into a tree? Well, suppose I want to insert the number 8. Well, I start here. And it obviously belongs to the right because 8 is bigger. I go here. It belongs to the right because 8 is bigger. I go here. It belongs to the right, because 8 is bigger. And so a new node is going to be created somewhere down here. And even though it doesn't fit on the screen, I could absolutely call malloc. I could update a couple of pointers. And boom, we've added an eighth node to the tree. So if it took me that many steps, starting at the root, 1, 2, 3, how do I generalize this into Big O notation? Well, a binary search tree, if you lay it out nice and prettily like this, nice and balanced if you will, the height of that binary search tree, it turns out, is going to be log of n. If n is the number of total nodes, 7 or 8 now in the story, then log base 2 of n is going to be the height of the tree. So if you take n nodes, n numbers, and you kind of balance them in this nice sorted way, the total height is going to be log n. So what is the running time of insert? Well, the running time of insert is equivalent to how many steps does it take you to find the location into which the new number belongs? Well, that's 1, 2, 3. And as it turns out, log base 2 of 8 is indeed 3. So the math actually works out perfectly in this case. Sometimes, there might be a little rounding error. But in general it's going to indeed be Big O of log n. But what if we get a little sloppy? What if we get a little sloppy and we start inserting nodes that are giving us a bit of bad luck, if you will? So for instance, suppose that I go ahead-- and let me do something on the fly here. Suppose that I go ahead and insert the number 1, the number 2, and the number 3 such that this is what logically happens. This adheres to the definition of a binary search tree. If this is the roots, it's 1. It has no left subtree, and that's not strictly a problem, because there's nothing violating the definition of a search tree here. There's just nothing there. 2 is in the right place. 3 is in the right place. So this, too, technically is a binary search tree. But it's a bit of a corner case, a perverse case, if you will, where the way you inserted things ended up in the binary search tree actually resembling more of a what, would you say, if you want to chime in the chat? And Brian, if you might want to relay? BRIAN: A few people are saying it looks like a linked list. DAVID MALAN: Yeah. So even though I've drawn it sort of top down, so in the sort of second dimension, that's really just an artist's rendition. This tree is a binary search tree. But it's kind of sort of also a linked list. And so even the most well-intentioned data structures, given unfortunate inputs or some bad luck or some bad design, could devolve into a different data structure just by chance. So there's a way to solve this though, even with these values. When I insert 1, 2, 3, I could allow for this perverse situation where it just gets long and stringy, at which point everything is Big O of n. It's just a linked list. It just happens to be drawn diagonally instead of left right. But does anyone see a solution intuitively? Not in terms of code, no formal language, but there is a solution here to make sure that this tree with 1, 2, 3 does not get long and stringy in the first place. What might you do instead to solve this? BRIAN: A few people in the chat are suggesting you should make 2 the new root node at the top of the tree. DAVID MALAN: So if I instead make 2 the new root node, let me go ahead and mock this up real quickly. And in a moment, I'll reveal what I think you've just verbalized. What if instead, I make sure that when inserting these nodes I don't naively just keep going to the right, to the right, to the right? I exercise some judgment. And if I notice, maybe, that my data structure, my tree is getting kind of long and stringy, maybe I should kind of rotate it around, using that second dimension for real so that I change what the root is. And we won't go through the code for doing this. But it turns out this is the solution. That is exactly the right intuition. If you take a higher level class on data structures and algorithms specifically in computer science, you'll study trees like AVL trees, or red, black trees, which are different types of tree data structures. They kind of have built into them the algorithms for kind of shifting things as needed to make sure that as you insert, or maybe as you delete, you constantly rebalance the tree. And long story short, doing so might cost you a little extra time. But if you've got a lot of data, keeping that thing balanced and logarithmic in height, so to speak, and not long and stringy in the linear in height, it's probably, depending on your application, going to save you quite a bit of time overall. So we might say that insert into a balanced binary search tree is indeed Big O of log n. But that is conditional on you making sure that you keep it balanced. And that's going to be more code than we'll go into today, but indeed, a possible design decision. All right, any questions then on now trees and binary search trees, in particular? We started with arrays a few weeks ago. We've now got linked lists, which are good, better, but not great. To trees, which seemed maybe to be great but, again, it's always a trade off. They're costing us more space. But I bet we can continue to stitch some of these ideas together for other structures still. Brian, anything outstanding? BRIAN: Yeah. One question came in as to why it's a problem if you have, like, the 1, and the 2, and the 3 all in just one sequence on the right side? DAVID MALAN: Yeah, a really good question. Why is it a problem? Maybe it isn't. If you don't have a very large data set and you don't have many values in the structure, honestly, who cares? If it's three elements, definitely don't care. If it's 10 elements, if it's 1,000, heck, if your computer is fast enough, there might be a million elements, and it's not a big deal. But if it's two million elements or a billion elements, then it totally depends, again, on what is the business you're building? What is the application you're writing? How big is your data? How fast or how slow is your computer? It might very well matter ultimately. And indeed, when we've seen some of our algorithms a couple of weeks ago, when we compared bubble sort and selection sort and merge sort, even though those were in a different category of running times, n log n and n squared, just recall the appreciable difference. Log of n in the context of searching is way better than n. So if your data structure is devolving into something long and stringy, recall that's, like, searching a phone book a thousand total pages. But binary search and not letting it get long and stringy gives you, like, 10 steps instead of 1,000 steps in order to search those same pages. So, again, even in Week 0 we saw the appreciable difference between these different categories of running times. All right, well, let's see if we can't maybe take some of the best of both worlds. Thus far, again, we've seen arrays. We've seen linked lists. We've seen trees. What if we kind of get a little Frankenstein here and mash things together and take sort of the best features of these things and build up something grander. In fact, I feel like the Holy Grail of a data structure would be something for which cert and insertion aren't n, Big O of n, aren't Big O of log n. But wouldn't it be amazing if there's a data structure out there where the running time is like constant time, Big O of 1. That's the Holy Grail. If you can cleverly lay out your computer's memory in such a way that if you want to search for or insert a value, boom, you're done. Boom, you're done, and none of this linear or logarithmic running time. So let's see if we can't pursue that goal. Let me propose that we introduce this topic called hash tables. Hash table is another data structure that's essentially an array of linked lists. So, again, it's this Frankenstein monster, whereby we've combined arrays ultimately with linked list. Let's see how this is done. Let me propose that we first start with an array of size 26. And I'm going to start drawing my arrays vertically, just because it sort of works out better pictorially. But, again, these are all artist's renditions, anyway. Even though we always draw arrays left to right, that's completely arbitrary. So I'm going to start, for now, drawing my array top to bottom. And suppose that the data structure I care about now is going to be even more interesting the numbers. Suppose I want to store things like names, like dictionaries, or names like contacts in your phone. If you want to keep track of all the people you know, it would be great if it doesn't take linear time or logarithmic time to find people. Instead, constant time would seem to be even better. So here's an array, for instance, size 26. And I proposed that deliberately. In English, there's 26 letters, A through Z. So let's consider location 0 is A. Location 25 is Z. And if I now go and start inserting all of my friends into my new phone, into the contacts application, where might I put them? Well, let me go ahead and do this. Let me go ahead and think of each of these elements as, again, 0 through 25, or really A through Z. And let me, upon inserting a new friend or contact into my phone, let me put them into a location that has some relationship with the name itself. Let's not just start putting them at the very beginning. Let's not necessarily put them alphabetically per se. Let's actually put them at a specific location in this array, not just top to bottom, but at a specific entry. So suppose the first person I want to add to my contacts is Albus. Well, I'm going to propose that because Albus starts with an A, he is going to go into the A location, so the very first entry in this array. Suppose I next want to add Zacharias. Well, his name starts with Z. So he's going to go in the very last location. So, again, I'm jumping around. I went from 0 to 25. But it's an array, and I can do that in constant time. You can randomly index to any element using square brackets. So this is both constant time. I don't have to just put him right after Albus. I can put him wherever I want. Suppose the third person is Hermione. Well, I'm going to put her at location H. Why? Because I can do the math, and I can figure out H. OK, I can just jump immediately to that letter of the alphabet. And in turn, thanks to ASCII and doing a bit of arithmetic, I convert that to a number as well. So she ends up at 0, 1, 2, 3, 4, 5, 6, 7, because H ends up mapping to the eighth character, or location 7. All right. Who else? All these other people end up in my address book. And so they're all spread out. I don't have as many as 26 friends. So there's some gaps in the data there. But I fit everyone here. But there might be a problem. And you can perhaps see this coming. Thus far, I've kind of gotten lucky. And I've only know people whose names are-- uniquely start with a letter. But as soon as I meet someone at school and I add them to my contacts, well, now Harry, for instance, has to go in the same location. Now this is a problem if I want to store both Hermione and Harry, because their names both start with H. But, again if it's an array, it's absolutely a deal breaker. At that point, all things break down. Because I could, yes, grow the array. But if I grow the array, then it's size 27. And then it's, like, how do I know what number is what letter at that point? It just devolves into a complete mess. But if I borrow the idea of a linked list, what if I make my array an array of linked lists? So yes, even though there's this collision where both Hermione and Harry belong at the same location in the array, that's fine. In the event this happens, I'm just going to kind of stitch them together into a linked list from left to right. So it's not ideal. Because now it takes me two steps to get to Harry instead of 1, using simple arithmetic and square bracket notation. But heck, at least I can still fit him in my address book. So a bit of a trade off, but feels reasonable. Well, someone else, Hagrid-- all right, it's not ideal that now it takes me three steps to get to Hagrid in my address book. But three is way better than not having him in there at all. So, again, we see a manifestation of a trade off here, too. But we've solved the problem. And a hash table is, indeed, exactly this data structure. It is an array of linked lists, at least it can be implemented as such. And it is predicated on introducing the notion of a hash function. This is actually something we'll see in other contexts before long. But a hash function is going to allow us to map not only Hermione, Harry, and Hagrid, but also Ron and Remus, Severus and Sirius to their respective locations deterministically. That is, there's no randomness involved here. Every time I look at these people's names, I'm going to figure out the location at which they belong and that location is never going to change. So how do I do this? Well, it turns out we can think back to problem solving itself and what functions are. So this is problem solving as we defined it. This is also a function in any language that takes inputs and outputs. A hash function is going to be sort of a secret sauce inside of this black box for now. And so what is a hash function? Well, a hash function is literally a function, either mathematically or in programming, that takes as input some string, in this case, like Hermione or Harry, and it returns some output. And the output of a hash function is usually a number. In this case, the number I want is going to be between 0 and 25. So in order to implement this notion of a hash table, not just pictorially on the screen, but in actual code, I'm literally going to have to write a function that takes a string, or if you will, char*, as input and returns an int between 0 and 25 so that I know how to convert Hermione or Harry or Hagrid to the number 7 in this case. So what does this hash function do? It takes as input something like Albus and it outputs 0. It takes someone like Zacharias, and it outputs 25. And you can probably see the pattern here. The code I would write in order to implement something like this is probably going to look at the user's input, that char*. And it's going to look at the first character, which is A or Z, respectively, for these two. And it's then going to do a little bit of math and subtract off, like, 65 or whatnot. And it's going to get me a number between 0 and 25, just like with Caesar or some of our past manipulations of strings. So from here, we can now take this building block, though, and perhaps solve our problems a little more effectively. Like, I don't love the fact that even though, yes, I've made room for Harry, and Hermione, and Hagrid, and now Luna, and Lily, and Lucius, and Lavender, some of these linked lists are getting a little long. And there's another term you can use here. These are kind of like chains, if you will, because they look like chain link fences or little links in a chain. These are chains or linked lists. But some of them are starting to get long. And it's a little stupid that I'm trying to achieve constant time, Big O of 1, but technically, even though some of the names literally take one step, some of them are taking two or three or four steps. So it's starting to devolve. So what would be an optimization here? If you start to get uncomfortable because you're so popular and you've got so many names in your contacts that looking up the H's, looking up the L's is taking more time than the others, what could we do to improve the situation and still use a hash table, still use a hash function? But what would maybe the logical solution be when you have too many collisions, you have too many names colliding with one another? How could we improve our performance and get at locations, again, closer to one step, not two, not three, not four? So literally one step, because that's our Holy Grail here. BRIAN: A few people have suggested you should look at more than just the first letter. Like, you could look at the second letter, for example. DAVID MALAN: Yeah. Nice. So if looking at one letter of the person's name is obviously insufficient. Because a whole bunch of us have names that start with H or A or Z or the like, well, why don't we look at two letters and therefore decrease the probability that we're going to have these collisions? So let me go ahead and restructure this. And focusing on the Hermione, Harry, and Hagrid problem, why don't we go ahead and take our array in the hash table, the vertical thing here, and let's think of it as maybe not just being H at that location. But what if we think of that location specifically as being HA, and then HB, HC, HD, HE, HF, all the way down to HZ. And then IA, IB, IC, and so forth, so we now enumerate all possible pairs of letters from AA to ZZ? But this would seem to spread things out, right? Because now Hermione goes in the HE location in the array. Now Harry goes in the HA. And now Hag-- oh, dammit. Like, Hagrid still goes in the same location. So what would maybe be a better fix? Again, this isn't horrible, like two steps is not a big deal, especially on fast computers. But, again, with large enough data sets, and if we're no longer talking about people in your contacts, but maybe all the people in the world who have Google accounts, or Twitter accounts, and the like, where you want to search this information quickly, you're going to have a lot of people whose names start with H and A and Z and everything else. It would be nice to spread them out further. So what could we do? Well, instead of using the first two letters, frankly I think the logical extension of this is to use the first three letters. So maybe this is the HAA bucket. This is the HAB, HAC, HAD, HAE, dot, dot, dot, all the way down to HZZ, and then IAA. But now when we hash our three friends, Hermione goes in the HER bucket so to speak, the elements of the array. Harry, HAR, goes in that bucket. And Hagrid now gets his own bucket, HAG, as would everyone else. So it seems to have solved this specific problem. You could still imagine, and I have to think harder and probably Google to see if there's other Harry Potter names that start with HAG or HAR or HER to find another collision. Because you could imagine using four letters instead. But what price are we paying? Like, I'm solving this problem again and again. And I'm getting myself literally faster look up time, because it's giving me one step. I can mathematically figure out by just doing a bit of ASCII math, what the number of the index is that I should jump to in this bigger and bigger array. But what price am I paying? Brian, any thoughts you'd like to relay? BRIAN: Yeah. A few people are saying it's going to take a lot of memory. DAVID MALAN: Yeah. My God, like, this is taking a huge amount of memory now. Previously, how much memory did it take? Well, let me pull up a little calculator here and do some quick math. So if we had originally 26 buckets, so to speak, elements in the array, that, of course, isn't that bad. That feels pretty reasonable, 26 slots. But the downside was that the chains might get kind of long, three names, four names, maybe even more. But if we have AA through ZZ, instead of A through Z, that's 26 times 26, that's 676 buckets. Doesn't sound like a huge deal, though that's bigger than most things we've done in memory thus far, not a huge deal. But if we have three, that's 26 possibilities times 26 times 26, for AAA through ZZZ. Now we have 17,576 buckets in my array. And the problem isn't so much that we're using that memory. Because honestly, if you need the memory, use it. That's fine. Just throw hardware at the problem. Buy and upgrade more memory. But the problem is that I probably don't know that many people whose names start with HAA or AZZ or any number of these combinations of letters of the alphabet. A lot of those buckets are going to be empty. But it doesn't matter if they're empty. If you want an array and you want random access, they have to be present so that your arithmetic works out, per Week 2, where you just use square bracket notation and jump to any of the locations in memory that you care about. So finding that trade off, or finding the inflection point with those trade offs, is kind of an art and/or a science, figuring out for your particular data, your particular application, which is more important, time or space or some happy medium in between the two. And with Problem Set 5, as you'll see, you'll actually have to figure out this balance in part by trying to minimize, ultimately, your own use of memory and your own use of computers' time. But let me point something out, actually. This notion of hash table, which up until now, definitely the most sophisticated data structure that we've looked at, it's kind of familiar to you in some way already. These are probably larger than the playing cards you have at home. But if you've ever played with a deck of cards, and the cards start out randomly, odds are you've, at some point, needed to sort them for one game or another. Sometimes you need to shuffle them entirely. If you want to be a little neat, you might sort them, not just by number, but also by suit. So hearts and spades and clubs and diamonds into separate categories. So honestly, I have this literally here just for the sake of the metaphor. We have four buckets here. And we've gone ahead and labeled them in advance with spade there. So that's one bucket. Here we have diamond shape here. And here we have-- [GRUNTING] --here we have hearts here and then clubs here. So if you've ever sorted a deck of cards, odds are you haven't really thought about this very hard. Because it's not that interesting. You probably mindlessly start laying them out and sorting them by suits and then maybe by number. But if you've done that, you have hashed values before. If you take a look at the first card and you see, that, oh, it's the ace of diamonds. You know, yes, you might care ultimately that it's a diamond, that it's an ace. But for now, I'm just going to put it, for instance, into the diamond bucket. Here's the two of diamonds here. I'm going to put that into the diamond bucket. Here's the ace of clubs. So I'm going to put that over here. And you can just progressively hash one card after the other. And, indeed, hashing really just means to look at some input and produce, in this case, some numeric output that outputs the like bucket 0, 1, 2, or 3 based on some characteristic of that input, whether it's actually the suit on the card like I'm doing here, or maybe it's based on the letter of the alphabet here. And why am I doing this? Right? I'm not going to do the whole thing. Because 52 steps is going to take a while and get boring quickly, if not already. But why am I doing this? Because odds are you've probably done this. Not with the drama of actual buckets, you've probably just kind of laid them out in front of you. But why have you done that, if that's indeed something you have done? Yeah. Over to Sophia? AUDIENCE: There's a possibility that we could actually get to things faster, like, if we know what bucket it is. We might be able to even search things for, like, 0, 1, or less. DAVID MALAN: Yeah. AUDIENCE: Something like that. DAVID MALAN: Yeah. You start to gain these optimizations, right? At least, as a human, honestly, I can process four smaller problems just much easier than one bigger problem, that's size 52. I can solve four 13 card problems a little faster, especially if I'm looking for a particular card. Now I can find it among 13 cards instead of 52. So there's just kind of an optimization here. So you might take as input these cards, hash them into a particular bucket, and then proceed to solve the smaller problem. Now that's not what a hash table itself is all about. A hash table is about storing information, but storing information so as to get to it more quickly. So to Sophia's point, if indeed she just wants to find like the ace of diamonds, she now only has to look through a 13 sized problem, a linked list of size 13, if you will, instead of an array or a linked list of size 52. So a hash table allows you to bucketize your inputs, if you will, colloquially, and get access to data more quickly. Not necessarily in time one, in one step, it might be two. It might be four. It might be 13 steps. But it's generally fewer steps than if you were doing something purely linearly, or even logarithmically. Ideally, you're trying to pick your hash function in such a way that you minimize the number of elements that collide by using not A through Z, but AA through ZZ and so forth. So let me go ahead here and ask a question. What, then, is the running time when it comes to this data structure of a hash table? If you want to go ahead and search in a hash table, once all of the data is in there, once all of my contacts are there, how many steps does your phone have to take, given n contacts in your phone, to find Hermione or Hagrid or anyone else? So I see, again, 80% of you are saying constant time, Big O of 1. And, again, constant time might mean one step, two steps, four steps, but some fixed number, not dependent on n. 18% of you or so are saying linear time. And I have to admit, the 20% of you or so that said linear time are technically, asymptotically, mathematically correct. And here we begin to see sort of a distinction between the real world and academia. So the academic here, or rather the real world here, the real world programmer would say, just like Sophia did, obviously, a bucket with 13 cards in it is strictly better than one bigger bucket with 52 cards. That is just faster. It's literally four times as fast to find or to flip through those 13 cards instead of 52. That is objectively faster. But the academic would say yes, but asymptotically-- and asymptotically is just a fancy way of saying as n gets really large, the sort of wave of the hand that I keep describing, asymptotically taking 13 steps is technically big O of n. Why? Well, in the case of the cards here, it's technically n divided by 4. Yes, it's 13. But if there's n cards total, technically, the size of this bucket is going to end up being m divided by 4. And what did we talk about when we talked about Big O and omega? Well, you throw away the lower order terms. You get rid of the constants, like the divide by 4, or the plus something else. So we get rid of that. And it's technically a hash table searching. It is still in Big O of n. But here, again, we see a contrast between the real world and the theoretical world. Like, yes, if you want to get into an academic debate, yes, it's still technically the same as a linked list or an array, at which point you might as well just search the thing left to right linearly, whether it's an array or a linked list. But come on. Like, if you actually hash these values in advance, and spread them out into 4, or 26, or 576 buckets, that is actually going to be faster when it comes to wall clock time. So when you literally look at the clock on the wall, less time will pass taking Sophia's approach than taking an array or linked list approach. So here, those of you who said Big O of n are correct. But when it comes to the real world programming, honestly, if it's faster than actual n steps, that may very well be a net positive. And so perhaps we should be focusing more on practice and less, sometimes, on the theory of these things. And indeed that's going to be the challenge. The Problem Set 5 to which I keep alluding is going to challenge you to implement one of these data structures, a hash table, with 100,000 plus English words. We're going to, in a nutshell, give you a big text file containing one English word per line. And among your goals is going to be to load all of those 140,000 plus words into your computer's memory using a hash table. Now if you are simplistic about it, and you use a hash table with 26 buckets, A through Z, you're going to have a lot of collisions. If there's 140,000 plus English words, there's a lot of words in there that start with A, or B, or Z, or anything in between. If you maybe then go with AA through ZZ, maybe that's better, or AAA through ZZZ, maybe that's better. But at some point, you're going to start to use too much memory for your own good. And one of the challenges, optionally, of Problem Set 5, is going to be to playfully challenge your classmates whereby, if you opt into this, you can run a command that will put you on the big board which will show on the course's website exactly how much or how little RAM or memory you're using, and how little or how much time your code is taking to run. And so we just put aside the sort of academic waves of the hand saying, well, yes, all of your code is Big O of n. But n divided by 4, Sophia's approach, is way better in practice than n itself. And we'll begin to tease apart the dichotomy between theory here and practice. But these aren't the only ways to lay things out in memory. And we wanted to show you just a few other ideas that come out now that we have all of these building blocks. One of which is the data structure that we're going to call a trie. Trie is actually short for the word retrieval, even though it's not quite pronounced the same. It's also known as a prefix tree. And it's a different type of tree that is typically used to store words, or other more sophisticated pieces of data instead of just numbers alone. So a trie is actually a tree made up of arrays. So you can kind of see a pattern here. A hash table was an array of linked lists. A trie is a tree, each of whose nodes is an array. So at some point, computer scientists started getting a little creative and started just, like, literally smashing together different data structures to see what they could come up with, it seems. And so a trie begins to look like this. But it has this amazing property that's better than anything in theory we've seen before. Here is one node in a try. And it's a node in the sense that this would be like a rectangle or square. But inside of that node is literally an array of size 26 in this case. And each of those locations or buckets, if you will represent A through Z. And what we're going to do is any time we insert a word, like a name, like Harry, or Hagrid, or Hermione, or anyone else, we are going to walk through the letters of their name, like, H-A-G-R-I-D, and we are going to follow a series of pointers from one node to another as follows. So for instance, if this is A through Z, or 0 through 25, here is location H. So if the goal at the moment is to insert the first of our contacts, for instance, Harry, I'm going to start by looking at the first node, the root of the tree, looking up the H location. And I'm going to kind of make mental note that Harry starts there, the H in Harry starts there. Then if I want to insert the A in Harry, I'm going to go ahead and add a new node, representing 26 letters. But I'm going to keep track of the fact that, OK, A is here. So now I'm going to have another pointer-- oh, I'm sorry-- not Harry, Hagrid first, H-A-G-R-I-D. So what have I just done? A trie, again, is a tree, each of whose nodes is an array. And each of those arrays is an array of pointers to other nodes. So, again, we're really just mashing up everything together here. But it's the same building blocks as before. Each node in this tree, top to bottom, is an array of pointers to other nodes. And so, if I wanted to check if Hagrid is in my contacts, I literally start at the first node, and I follow the H pointer. I then follow the A pointer. I then follow the G pointer, the R pointer, the I pointer. And then I check at the D pointer. Is there a Boolean value inside of that structure-- more on that another time, perhaps-- that just says, yes or no, there is someone named H-A-G-R-I-D in my contacts. Notice there's no other letters noted at the moment. And there's no other green boxes. Green just denotes a Boolean value for our purposes now. So that means there's no one whose name is H-A-G-R-I-A, or R-I-B, or R-I-C. It's only H-A-G-R-I-D that exists in my contacts. But notice what happens next. Now if I go ahead and insert Harry, notice that Harry and Hagrid share the H, the A, and then this third node. But then Harry needs to follow a different pointer to store the R and the Y. And notice the green there. It's sort of a checkmark in the data structure, a Boolean value, that's saying yes. I have someone in my context named H-A-R-R-Y. And then, if we add Hermione, she shares the H, and then also the second node. But Hermione requires some new nodes all together. But notice this key property, the reason for this sort of complexity, because this is probably the weirdest structure we've seen thus far, is that even if I have a billion names in my phone book, how many steps literally does it take me to find Hagrid? Someone? Feel free to chime in the text, the chat window if you like? Even if I have a billion names in my contacts, how many steps does it take for me to look up and check if Hagrid is among them? BRIAN: People are saying six. DAVID MALAN: Six, H-A-G-R-I-D. If I have 2 billion names, 4 billion names, how many steps does it take? 6, and this is what we mean by constant time, constant time at least in the length of the humans' names in the data structure. So what does this mean? No matter how many names you cram into a trie data structure, the number of other names does not impact how many steps it takes for me to find Harry or Hagrid or Hermione or anyone else. It is only dependent on the length of their name. And here's where we can get a little academic. If you assume that there's a finite number of characters in any human, real or imaginary's name, maybe it's 20, or 100, or whatever. It's more than 6, but it's probably fewer than hundreds. And then you can assume that that's constant. So it might be Big O of, like, 200 characters for someone with a super long name. But that's constant. And so technically a try gives you that Holy Grail of look up times and insertion times of Big O of 1. Because it is not dependent on n, which is the number of other names in the data structure. It is dependent only on the length of the name you're inputting. And if you assume that all names in the world are reasonably linked, less than some finite value, like 6, or 200, or whatever it is, then you can call that. And it technically is Big O of 1. That is constant time. So here is the goal of this whole day, like, trying to get to constant time. Because constant time is better than, it would seem, linear time, or logarithmic time, or anything else we've seen. But-- but-- but-- again, there's always a trade off. What price have we just paid if you see it? Why are tries not necessarily all that? There's still a catch. Daniel? AUDIENCE: For example, if let's say you have two people in your contact list. One person was named Daniel and one person was named Danielle. And you know that Daniel is in your list. So the L would have this Boolean operator of true. But then, how would you get to Danielle, if your L was an operator and it didn't point to another L for Danielle? DAVID MALAN: Really good question, a corner case, if you will. What if someone's name is a substring of another person's name? So Daniel, D-A-N-I-E-L, and I think you're saying Danielle, D-A-N-I-E-L-L-E. So the second name is a little longer. Let me stipulate that we can solve that. I have not shown code that represents each of the nodes in this tree. Let me propose that we could continue having arrows even below. So if we were to have Daniel in this tree, we could also have Danielle by just having a couple of more nodes below Daniel and just having another green checkmark. So it is solvable in code, even though it's not obvious from the graphical representation, but absolutely a corner case. But thankfully, it is solvable in tries. What might another downside be, though, of a trie? Like, you do get Big O of 1 time. You can solve the Daniel-Danielle problem. But there's still a price being paid. Any thoughts? Yeah. How about over to Ethan, if I'm saying it right? AUDIENCE: Yeah. I'm Ethan. DAVID MALAN: Ethan. AUDIENCE: I think it's because it would just take a lot of memory to house all of that. And that could take a lot of time. It could slow down the system. DAVID MALAN: Exactly. Yeah. You can kind of see it from my picture alone. We only added three names to this data structure. But my God, like, there's dozens, maybe 100 plus pointers pictured here, even though they might all be null. Right? If the absence of an arrow here suggests that they're null, 0x0, but even storing null, 0x0, is 8 0 bits. So this is not lacking for actual memory usage, we're just not using it very efficiently. We have spent a huge number of bits, or bytes, or however you want to measure it. Because look, even with the H's, I'm using one pointer out of 26. 25 pointers are probably initialized to null, which means I'm wasting 25 pointers. And you can imagine the lower and lower you get in this tree, the less likely there is to be a name that even starts with H-A-G-R-I-D. Daniel came up with a good example with Danielle. But that's not going to often happen. Certainly, the lower you get in this tree. So as Ethan says, you're wasting a huge amount of memory. So yes, you're gaining constant time look ups, but my God, at what price? You might be using megabytes, gigabytes of storage space. Because, again, the most important property of an array is that it's all contiguous. And therefore, you have random access. But if you have to have every node containing an array of size 26 or anything else, you have to spend the memory over and over again. So there, too, is a trade off. While this might be theoretically ideal, theory does not necessarily mean practice. And sometimes, designing something that is textbook less efficient, might actually be more efficient in the real world. And in Problem Set 5, in your own spellchecker, when you build up this dictionary that we then use to spell check very large corpuses of text, will you begin to experience some of those real world trade offs yourself. Well, we wanted to end today with a look at what else you can do with these kinds of data structures, just to give you a taste of where else you can go with this and what other kinds of problems you can solve. Again, thus far, we've looked at arrays, which are really the simplest of data structures. And they're not even structures per se. It's just contiguous blocks of memory. We then introduced linked lists, of course, where you have this one dimensional data structure that allows you to stitch together nodes and memory, giving you dynamic allocation. And if you want deallocation, inserting and deleting nodes if you want, then we had trees, which kind of gives us the best of both worlds, arrays and linked lists. But we have to spend more space and use more pointers. Then, of course, hash tables kind of merged together two of those ideas, arrays and the linked lists. And that starts to work well. And indeed, that's what you'll experiment with your own spellchecker. But then, of course, there's tries, which at first glance seem better, but not without great cost, as Ethan says. So it turns out, with all of those building blocks at your disposal, you can actually use them as lower level implementation details to solve higher level problems. And this is what are known as abstract data structures, or abstract data types. An abstract data structure is kind of a mental structure that you can imagine, implementing some real world problem, typically, that's implemented with some other data structure. So you're sort of writing code at this level. But you're thinking about what you've built ultimately at this level. And that's abstraction, taking lower level implementation details, simplifying them for the sake of discussion or problem solving higher up. So what's one such data structure? A queue is a very common abstract data structure. What is a queue? Well, those of you who grew up, say, in Britain call a line outside of a store a queue. And that's, indeed, where it gets its name. A queue is a data structure that has certain properties. So if you're standing outside of a store or a restaurant, in healthier times, waiting to get in, you're generally in a queue. But there's an important property of a queue. At least, if you live in a fair society, you'd like to think that if you are the first one in line, you are the first one to get out of line. First in, first out. It would be kind of obnoxious if you're first in line and then they start letting people in who are behind you in that queue. So a queue, if it's implemented correctly, has a property known as FIFO, first in, first out. And we humans would think of that as just a fair property. And a queue generally has two operations associated with it at least, enqueue and dequeue. Those are just conventions. You could call it add and remove, or insert, delete, whatever. But enqueue and dequeue are sort of the more common ones to say. So enqueuing means you walk up to the store and you get in line, because you have to wait. Dequeue means they're ready to serve you or have you. And so you get out of line. That's dequeue. And again, FIFO is just a fancy acronym that describes a key property of that, which is that it's first in, first out. So how could you implement a queue then? Well, what's interesting about that data structure is that it's abstract. Right? It's more of an idea than an actual thing in code. You want to implement some kind of fair queuing system. And so you think of it as a queue. But frankly, if we're going to translate that example to code, you could imagine using an array of persons. That could implement a queue. You could use a linked list of persons. Frankly, either of those would actually work. Underneath the hood is the lower level implementation details. But what would be a problem, if we translate this real world analogy like queuing up outside of a store to get in, into code and you used an array? What would a downside be of using an array to represent a queue, in general, even though we're making a bit of a leap from real world to code suddenly? But there's probably a downside. What might a downside be of using an array to implement a queue, Ryan? AUDIENCE: If you're using an array, you can't really just take out the existing values. Because if you were thinking about doing this in a line, you would have to take out the first person, take out the second person. But you can't really dynamically sort of change the memory after that. DAVID MALAN: Yeah. Yeah. That's a really good point. Think about a line. Suppose that there's a line that can fit 10 people outside the Apple store. Because Apple's pretty good right now during the health crisis of letting people in only so many at a time. So suppose they have room for 10 people, six feet apart. That's actually a pretty apt analogy, this year more than ever. But as Ryan says, if you want to dequeue someone, then the first person in line is going to go into the store. And then the second person, you're going to dequeue them. They go into the store. The problem with an array, it would seem, is that now you have essentially empty spaces at the beginning of the line. But you still don't have room at the end of the line for new people. Now there's an obvious real world solution there. You just say, hey, everyone, would you mind taking a few steps forward? But that's inefficient. Not so much in the human world, you've got to get into the store. But in code, that's copying of values. You have to move, like, eight values two places over if two people were just let into the store. So now your dequeue operation is Big O of n. And that doesn't feel quite ideal. And we can do better than that if we're a little clever with some local variables and such. But that would be one challenge of a queue, certainly, is just how well we could implement it using an array. So you might imagine, too, an array is limited, too. Because it would kind of be obnoxious if you get to the Apple store. There's already 10 people in line. And they don't let you get in line. They say, sorry, we're all full for today, when they're obviously not. Because eventually there'll be more room in the queue. A linked list would allow you to keep appending more and more people. And even if the line outside the store gets crazy long, at least the linked list allows you to service all of the customers who are showing up over time. An array of fixed size would make that harder. And, again, you could allocate a bigger array. But then you're going to have to ask all the customers, hey, could everyone come over here? No. Go back over there. I mean, you're constantly moving humans, or values and memory, back and forth. So that is only to say that to implement this real world notion of a queue, which is very commonly used even in the computer world to represent certain ideas, for instance, the printer queue, when you send something to the printer, especially on a campus or in a company, there's a queue. And ideally, the first person who printed is the first one who gets their printouts thereafter. Queues are also used in software. But there's other abstract data types out there besides queues. One of them is called a stack. So a stack is a data structure that can also be implemented underneath the hood using arrays or linked lists or, heck, maybe something else. But stacks have a different property. It's last in, first out. Last in, first out. So if you think about the trays in the cafeteria, in healthier times when everyone was on campus using trays from a cafeteria, you'll recall, of course, that trays tend to get stacked like this. And the last tray to go on top of the stack is the first one to come out. If you go to a clothing store, your own closet, if you don't hang things on hangers or put them in drawers, but kind of stack them, like here, like all of these sweaters, this is a stack of sweaters. And how do I get at a sweater I want? Well, the easiest way to do it is with last in, first out. So I constantly take the black sweater, the black sweater. But if I've stored all of my sweaters in this stack, you may never get to the sort of lower level ones, like the red or the blue sweater because, again, of this data structure. So LIFO, last in, first out, is in fact, the property used to characterize stacks. And stacks are useful or not useful, depending on the real world context. But even within computing, we'll see applications over time where stacks indeed come into play. And those two operations that those things support are generally called push and pop. It's the same thing as add or remove or insert or delete, but the terms of art are generally push and pop, where this is me popping a value off of the stack. This is me pushing a value onto the stack. But, again, it's last in, first out, otherwise known as LIFO. And then there's this other data structure that actually has a very real world analog, known as a dictionary. A dictionary is an abstract data type, which means you can implement it with arrays, or linked lists, or hash tables, or tries, or whatever else, an abstract data type that allows you to associate keys with values. And the best analog here is indeed in the real world. What is a dictionary, like an old school dictionary that's actually printed on paper in book form? What is inside that book? A whole bunch of keys, a whole bunch of boldfaced words, like apple and banana and so forth, each of which have definitions, otherwise known as values. And they're often alphabetized to make it easier for you to find things so that you can look things up more quickly. But a dictionary is an abstract data type that associates keys with values. And you look up the values by way of their keys, just like you look up a word's definition by way of the word itself. And dictionaries are actually kind of all around us, too. You don't think of them in these terms, probably. But if you've ever been to Sweet Green, for instance, in New Haven or in Cambridge or elsewhere, this is a salad place where nowadays, especially, you can order in advance online or on an app and then go into the store and pick up your food from a shelf. But the shelf, the way they do it here in Cambridge and in other cities, is they actually have letters of the alphabet on the shelves, A, B, C, D, E, F, all the way through Z. The idea being that if I go in to pick up my salad, it's probably on the D section, if Brian goes in to pick up his, it's in the B section, and so forth. Now here, too, you can imagine perverse corner cases where this data structure, this dictionary whereby letters of the alphabet map to values, which are people's salad, is not necessarily fail proof. Can you think of a perverse corner case where Sweet Green's very wonderful, methodical system actually breaks down? Can you think of a limitation here, even if you've never been to Sweet Green or never eaten a salad, what could break down with this system if going into a store and picking something up based on your name? Any thoughts? BRIAN: A few people say there might be a problem if two people have the same name. DAVID MALAN: Yeah. If two people have the same names, you start to stack things up. So literally, Sweet Green will start stacking one salad on top of the other. So there is actually an interesting incarnation of one data type being built on top of yet another data type. So, again, all of these are sort of like custom Scratch pieces, if you will, that we're constantly sort of reassembling into more interesting and powerful ideas. But at some point, if there's a lot of B names, or D names, or any letter of the alphabet, I surely see a finite height to this shelf. So it's kind of as though Sweet Green has implemented their dictionary using stacks with arrays. Because arrays are fixed size. So there's surely only so many inches of space here vertically. So you can see a real world limitation. So what does Sweet Green do if that happens? They probably just kind of cheat and put the D's in the C section, or the D's in the E section. Like, who really cares in the real world? Your eyes are probably going to skim left and right. But algorithmically, that is slowing things down. And in the worst case, if you're really late to pick up your salad, or if Sweet Green is really popular and there's a huge number of salads on the shelf, your name might be Albus, but your salad might end up way over here in the Z section if they're just out of room. And so that, too, is a valid algorithmic decision, to just make room somewhere else. But, again, trade offs between time and space. And so we thought we'd end on a note, thanks to some friends of ours at another institution who made a wonderful visualization that distinguished these notions of stacks versus queues. Stack and, again, a queue are these abstract data types that can be implemented in different ways. They have different properties, each of them respectively FIFO or LIFO. And here, for instance, is a final look in our final moments together here today about how these ideas manifest themselves, perhaps in the real world, not unlike this stack of sweaters here. [VIDEO PLAYBACK] [MUSIC PLAYING] - Once upon a time, there was a guy named Jack. When it came to making friends, Jack did not have the knack. So Jack went to talk to the most popular guy he knew. He went up to Lou and asked, "What do I do?" Lou saw that his friend was really distressed. "Well," Lou began, "just look how you're dressed. Don't you have any clothes with a different look?" "Yes," said Jack. "I sure do. Come to my house, and I'll show them to you." So they went off to Jack's. And Jack showed Lou the box where he kept all his shirts and his pants and his socks. Lou said, "I see you have all your clothes in a pile. Why don't you wear some others once in a While?" Jack said, "Well, when I remove clothes and socks, I wash them and put them away in the box. Then comes the next morning, and up I hop. I go to the box and get my clothes off the top." Lou quickly realized the problem with Jack. He kept clothes, CDS, and books in the stack. When he reached for something to read or to wear, he chose top book or underwear. Then, when he was done, he would put it right back. Back it would go, on top of the stack. "I know the solution!" said a triumphant Lou. "You need to learn to start using a queue." Lou took Jack's clothes and hung them in a closet. And when he had emptied the box, he just tossed it. Then he said, "Now, Jack, at the end of the day, put your clothes on the left when you put them away. Then tomorrow morning when you see the sunshine, get your clothes from the right, from the end of the line. Don't you see?" said Lou, "it will be so nice. You'll wear everything once before you wear something twice." And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself, all thanks to Lou and his wonderful queue. [END PLAYBACK] DAVID MALAN: All right. That's it for CS50. We will see you next time. [MUSIC PLAYING]