SPEAKER 1: All right, so this is CS50 This is the end of week five. And recall that last time we started looking at the fancier data structures that started to solve problems, that started to introduce new problems, but the key to this was the sort of threading that we started to do from node to node. So this of course is a singly linked list. And by singly linked, I mean there's just one thread between each of those nodes. Turns out you can do fancier things like doubly linked lists whereby you have an arrow going in both directions, which can help with certain efficiencies. But this solved the problem? What problem did this solve? Why did we care on Monday? Why, in theory, did we care on Monday? What does it do? AUDIENCE: We can dynamically resize it. SPEAKER 1: OK, so we can dynamically resize it. Well done both of you. So you can dynamically resize this data structure, whereas an array, recall, you have to know a priori how much space you want and if you need a little more space, you're kind of out of luck. You have to create a whole new array. You have to move all of your data from one to the other, eventually free the old array if you can, and then proceed. Which just feels very costly and very inefficient, and indeed it can be. But this isn't all good. We pay a price, what was one of the more obvious prices we pay by using a linked list? AUDIENCE: We have to use double space for each one. SPEAKER 1: Yeah, so we need at least twice as much space. In fact, I realized this picture's even a little misleading, because on CS50 IDE in a lot of modern computers, a pointer or an address is not in fact four bytes. It's very often these days eight bytes, which means the bottom most rectangles there in reality are kind of twice as big as what I've drawn, which means you're using three times as much space as we might have otherwise. Now at the same time, we're still talking bytes, right? We're not necessarily talking megabytes or gigabytes, unless these data structures get big. And so today we start to consider how we might explore data more efficiently if in fact the data gets bigger. But let's try to canonicalize the operations first that you can do on these kinds of data structures. So something like a linked list generally supports operations like delete, insert, and search. And what do I mean by that? That just means that usually, if people are using linked list, they or someone else has implemented functions like delete, insert, and search, so you can actually do something useful with the data structure. So let's take a quick look at how we might implement some code for a linked list as follows. So this is just some C code, not even a complete program that I really quickly whipped up. It's not online in the distribution code, because it won't actually run. But notice I've just with a comment said, dot dot dot, there's something there, dot dot dot, something there. And let's just look at what the juicy parts are. So on line three, recall that this is now we proposed declaring a node last time, one of those rectangular objects. It has an int that we'll call N, but we could call it anything, and then a struct node star called next. And just to be clear, that second line, on line six, what is that? What is it doing for us? Because it certainly looks more cryptic than our usual variables. AUDIENCE: It makes it move over one. SPEAKER 1: It makes it move over one. And to be more precise, it will store the address of the node that's meant to be semantically next to it, right? So it's not going to necessarily move anything. It's just going to store a value, which is going to be the address of some other node, and that's why we've said struct node star, the star denoting a pointer or an address. OK, so now if you assume that we have this N available to us, and let's assume that someone else has inserted a whole bunch of integers into a linked list. And that linked list is pointed to by some point a variable called list that's passed in here as a parameter, how do I go about line 14 implementing search? In other words, if I am implementing function whose purpose in life is to take an int and then the beginning of a linked list, that is a pointer to the linked list. Like first, who I think David was our volunteer on Monday, he was pointing at the whole linked list, it's as though we're passing David in as our argument here. How do we go about traversing this list? Well, it turns out that even though pointers are relatively new now to us, we can do this relatively straightforwardly. I'm going to go ahead and declare a temporary variable that by convention is just going to be called pointer, or PTR, but you could call it anything you want. And I'm going to initialize it to the start of the list. So you can kind of think of this as me the teacher the other day, kind of pointing at someone among our humans as volunteers. So I'm a temporary variable that's just pointing at the same thing that our coincidentally named volunteer David was also pointing out. Now while pointer is not null, because recall that null is some special sentinel value the demarcates the end of the list, so while I'm not pointing at the ground like our last volunteer was, let's go ahead and do the following. If pointer-- and now I kind of want to do what we did with the student structure-- if pointer dot next equals-- rather, if pointer dot N equals equals the variable N, the argument that's been passed in, then I want to go ahead and say return true. I have found the number N inside of one of the nodes of my linked list. But the dot no longer works in this context, because pointer, PTR, is indeed a pointer, an address, we actually can wonderfully use finally a piece of syntax that kind of makes intuitive sense and actually use an arrow here, which means go from that address to the integer there in. So it's very similar in spirit to the dot operator, but because pointer isn't a pointer and not an actual struct itself, we just use the arrow. So if the current node that I, the temporary variable, am pointing at is not N, what do I want to do? Well, with my human volunteers that we had here the other day, if my first human is not the one I want, and maybe the second human isn't the one I want, and the third, I need to keep physically moving. Like how do I step through a list? When we had an array, you just did like I plus plus. But in this case, it suffices to do pointer, gets, pointer, next. In other words, the next field is like all of the left hands that our human volunteers on Monday were using to point at some other node. Those were their next neighbors. So if I want to step through this list, I can't just do I plus plus anymore, I instead have to say I, pointer, is going to equal whatever the next field is, the next field is, the next field is, following all of those left hands that we had on stage pointing to some subsequent values. And if I get through that whole iteration, and finally, I hit null not having found N yet, I just return false. So again, all that we're doing here, as per the picture a moment ago, is starting by pointing at the beginning of the list presumably. And then I check, is the value I'm looking for equal to nine? If so, I return true and I'm done. If not, I update my hand, AKA pointer, to point at the next arrow's location, and then the next arrow's location, and the next. I'm simply walking through this array. So again, who cares? Like what is this an ingredient for? Well, recall that we introduced the notion of a stack, which is an abstract data type insofar as it's not a C thing, it's not a CS50 thing, it's an abstract idea, this idea of stacking things on top of one another that can be implemented in bunches of different ways. And one way we proposed was with an array, or with a linked list. And it turns out that canonically, a stack supports at least two operations. And the buzz words are push, to push something onto the stack, like a new tray in the dining hall, or pop, which means to remove the topmost tray from the stack in the dining hall, and then maybe some other operations as well. So how might we define the structure that we're now calling a stack? Well, we have all of the requisite syntax at our disposal in C. I say, give me a type definition of a struct inside of a stack, I'm going to say is an array, of a whole bunch of numbers and then size. So in other words, if I want to implement this in code, let me go and just kind of draw what this is saying. So this is saying, give me a structure that's got an array, and I don't know what capacity is, it's apparently some constant that I've defined elsewhere, and that's fine. But suppose it's just one, two, three, four, five. So capacity is 5. This element inside of my structure will be called numbers. And then I need one other variable apparently called size that initially I'm going to stipulate is initialized to zero. If there's nothing in the stack, size is zero, and it's garbage values in numbers. I have no idea what's in there just yet. So if I want to push something onto the stack, suppose I call the function push, and I say push 50, like the number 50, where would you propose I draw it in this array? There's five different possible answers. Where do you want to push the number 50? If the goal here, again, call the function push, pass in an argument of 50, where do I put it? Five possible-- 20% chance of guessing correctly. Yes? AUDIENCE: Far right. SPEAKER 1: Far right. There is now a 25% chance of guessing correctly. So that would actually be fine. By convention, I'll say with an array, we would generally start at the left, but we could certainly start at the right. So the spoiler here would be I'm probably going to draw it on the left, just like in a normal array where I start going left to right. But if you can flip the arithmetic, fine. It's just not conventional. OK, I need to make one more change though. Now that I've pushed something onto the stack, what's next? All right, I have to increment the size. So let me go ahead and just update this, which was zero. And instead now, I'm going to put in the value one. And now suppose I push another number onto the stack, like 51. Well, I have to make one more change, which is up to the size two. And then suppose I push one more number onto the stack like 61, now I need to update the size one more time, and get the value 3 as the size. And now suppose I call pop. Now pop, by convention, does not take an argument. With a stack, the whole point of the tray metaphor is that you don't have discretion to go get that tray, all you can do is pop the topmost one from the stack, just because. That's what this data structure does. So by that logic if I say pop, what comes off? So 61. So what really is the computer going to do in memory? What does my code have to do? What would you propose we change on the screen? What should change? Sorry? So we get rid of 61. So I can definitely do that. And I can get rid of 61. And then what other change needs to happen? Size probably has to go back to two. And so that's fine. But wait a minute, size a moment ago was three. Let's just do a quick sanity check. How did we know that we wanted to get rid of 61? Because we're popping. And so I have this second property size. Wait a minute, I'm thinking back to week two when we started talking about arrays, where this was location zero, this was location one, this was location two, this is location three, four, it looks like the relationship between size and the element that I want to remove from the array appears to just be what? Size minus one. And so that's how as humans we know 61 comes first. How's the computer going to know? When your code, where you probably want to do size minus one, so three minus one is two, and that means we want to get rid of 61. And then we can indeed update the size so that size now goes from three to just two. And just to be pedantic, I'm going to propose that I'm done, right? You proposed intuitively correctly I should get rid of 61. But haven't I kind of sort of gotten rid of 61? I've effectively forgotten that it's actually there. And think back to PSET4, if you've read the article about forensics, the PDF that we had you guys read, or you will read this week for PSET4. Recall that this is actually germane to the whole idea of computer forensics. What a computer generally does is it just forgets where something is, but it doesn't go in and like try to scratch it out or override those bits with zeros and ones or some other random pattern unless you yourself do so deliberately. So your intuition was right, let's get rid of 61. But in reality, we don't have to bother. We just need to forget that it's there by changing our size. Now there's a problem with this stack. If I keep pushing things onto the stack, what's obviously going to happen in just a few moments time? We're going to run out of space. And what do we do? We're kind of screwed. This implementation does not let us resize the array, because using this syntax, if you think back to week two, once you've declared the size of an array, we have not seen a mechanism yet where you can change the size of the array. And indeed C does not have that feature. If you say give me five Nths, call them numbers, that's all you're going to get it. So we do now as of Monday, have the ability to express a solution though, we just need to tweak the definition of our stack to not be some hard-coded array, but just to store an address. Now why is this? Now we just have to be comfortable with the fact that when my program runs, I'm presumably going to have to ask the human, how many numbers do you want to store? So the input has to come from somewhere. But once I know that number, then I can just use what function to give me a chunk of memory? I can use malloc. And I can say any number of bytes I want back for these Nths. And all I have to store in the numbers variable here inside of this struct should be what? What actually goes into the numbers in this scenario? Yeah, a pointer to the first byte of that chunk of memory, or more specifically, the address of the first of those bytes. Doesn't matter if it's one byte or a billion bytes, I just need to care about the first. Because what malloc guarantees and my operating system guarantees, is that the chunk of memory I get, it's going to be contiguous. There's not going to be gaps. So if I've asked for 50 bytes or 1,000 bytes, they're all going to be back to back to back. And so long as I remember how big, how much I asked for, all I need to know is the first such address. So now we have the ability in code. Albeit, it's going to take us more time to write this up, we could now reallocate that memory by just storing a different address there if we want a bigger or even a smaller chunk of memory. So here to a trade off. Now we get dynamism. We still have contiguousness I'm claiming. Because malloc will give us a contiguous chunk of memory. But this is going to be a pain in the neck for us, the programmer, to actually code up. It's just more work. We need code akin to what I was banging out just a moment ago. Very doable, but it adds complexity. And so developer time, programmer time is yet another resource that we might need to spend some time to get new features. And then of course there is a queue. We won't go into this one in much detail. But it's very similar in spirit. I could implement a queue, and its corresponding operations, enqueue or dequeue, like add or remove, it's just a fancier way of saying it, enqueue or dequeue, as follows. I can just give myself a struct that again has a number's array, that again has a size, but why do I now need to keep track of the front of a queue? I didn't need to know the front of my stack. Well, if I again for a queue-- let's just hard code it as having like five integers in here potentially. So this is zero, one, two, three, four. This is going to be called numbers again. And this will be called size. Why is it not sufficient to have just size? Well, let's push those same numbers on. So I pushed-- I enqueued, or pushed. Now I'll enqueue 50, and then 51, and then 61, and dot dot dot. So that's enqueue. I enqueued 50, then 51, then 61. And that looks identical to a stack thus far, except I do need to make one change. I need to update this size, so I go from zero to one to two to three now. How do I dequeue? What happens with dequeue? Who should come off this list first if it's the line at the Apple Store? So 50. So it's kind of trickier this time. Whereas last time it was super easy to just do size minus one, I get to the end of my array effectively where the numbers are, it removes 61. But I don't want to remove 61. I want to take 50, who was there at 5:00 AM to line up for the new iPhone or whatnot. And so to get rid of 50, I can't just do this, right? I can cross out 50. But we just said we don't have to be so anal as to scratch out or hide the data. We can just forget where it is. But if I change my size now to two, is this sufficient information to know what is going on in my queue? Not really. Like my size is two, but where does the queue begin, especially if I still have those same numbers in memory. 50, 51, 61. So I need to remember now where the front is. And so as I proposed up there, we'll have just called Nth front, whose initial value should have been what? Zero, just the beginning of the list. But now in addition to decrementing the size, we just increment the front. Now here's another problem. So once I keep going. Suppose this is the number of like 121, 124, and then, dammit, I'm out of space. But wait a minute, I'm not. So at this point in the story, suppose that the size is one, two, three, four, so suppose that the size is four, the front is one, so 51 is at the front. I want to put another number here, but, dammit, I'm out of space. But I'm not really, right? Where could I put some additional value, like 171? Yeah, I could just kind of go back over there, right? And then cross out the 50, or just overwrite it with 171. And if you're wondering why our numbers got so random, these are commonly taken computer science courses at Harvard after CS50. But that was a good optimization, because now I'm not wasting space. I still have to remember how big this thing is total. It's five total. Because I don't want to start overwriting 51. So now I am still out of space, so the same problem as before. But you can see how now in your code, you probably have to write a little more complexity to make that happen. And in fact, what operator in C probably lets you magically do this the circularity? Yeah the modulo operator, the percent sign. So what's kind of cool about a queue, even though we keep drawing arrays as these like straight lines, if you kind of think about this as curving around as a circle, then just intuitively it kind of works mentally I think a little more cleanly. You would still have to implement that mental model in code. So not that hard, ultimately, to implement, but we still lose the size-- rather, the ability to resize, unless we do this. We have to get rid of the array, we replace it with a single pointer, and then somewhere in my code I've got a call what function to actually create the array called numbers? Malloc, or some similar function, exactly. Any questions on stacks or queues. Yeah? Good question. What modulo would you use here. So generally, when using mod, you would do it with the size of the whole data structure. So something like five or capacity, if it's constant, is probably involved. But just doing modulo five probably isn't sufficient, because we need to know do we wrap around here or here or here. So you're probably also going to want to involve the size of the thing, or the front variable as well. So it's just this relatively simple arithmetic expression, but modulo would be the key ingredient. So short film if you will. An animation that some folks at another university put together that we've adapted for this discussion. It involves Jack learning the facts about queues and stats. FILM: Once upon a time, there was a guy named Jack. When it came to making friends, Jack did not have a knack. So Jack went to talk to the most popular guy he knew. He went to Lou and asked, What do I do? Lou saw that his friend was really distressed. Well, he began, just look how you're dressed. Don't you have any clothes with a different look? Yes, said Jack. I sure do. Come to my house and I'll show them to you. So they went off to Jack's. And Jack showed Lou the box where he kept all his shirts, and his pants, and his socks. Lou said, I see you have all your clothes in a pile. Why don't you wear some others once in awhile? Jack said, well, when I remove clothes and socks, I wash them and put them away in the box. Then comes the next morning, and up I hop. I go to the box and get my clothes off the top. Lou quickly realized the problem with Jack. He kept clothes, CD's, and books in the stack. When he reached for something to read or to wear, he'd choose the top book or underwear. Then when he was done, he would put it right back. Back it would go, on top of the stack. I know the solution, said a triumphant Loud. You need to learn to start using a queue. Lou took Jack's clothes and hung them in the closet. And when he had emptied the box, he just tossed it. Then he said, now Jack, at the end of the day, put your clothes on the left when you put them away. Then tomorrow morning when you see the sunshine, get your clothes on the right, from the end of the line. Don't you see? said Lou. It will be so nice. You'll wear everything once before you wear something twice. And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself. All thanks to Lou and his wonderful queue. SPEAKER 1: All right, it's adorable. So what has been really going on underneath the hood now? That we have pointers, that we have malloc, that we have the ability to create chunks of memory for ourselves dynamically. So this is a picture we glimpsed just the other day. We didn't really dwell on it, but this picture has been going on underneath the hood for weeks now. And so this represents, just a rectangle that we've drawn, your computer's memory. And maybe your computer, or CS50 ID, has a gigabyte of memory or RAM or two gigabytes or four. It doesn't really matter. Your operating system Windows or Mac OS or Linux, essentially allows your program to think that it has access to the entirety of your computer's memory, even though you might be running multiple programs at once. So in reality, that doesn't really work. But it's kind of an illusion given to all of your programs. So if you had two gigs of RAM, this is how the computer might think of it. Now coincidentally, one of these things, one of these segments of memory, is called a stack. And indeed any time thus far in writing code that you have called a function, for instance main. Recall that any time I've drawn computer's memory, I always draw sort of half of a rectangle here and don't bother talking about what's above. Because when main is called, I claim that you get this sliver of memory that goes down here. And if main called a function like swap, well swap goes here. And it turns out, that's where it's ending up. On something called a stack inside of your computer's memory. Now at the end of the day, this is just addresses. It's like byte zero, byte one, byte 2 billion. But if you think about it as this rectangular object, all we're doing every time we call a function is layering a new slice of memory. We're giving that function a slice of its own memory to work with. And recall now that this is important. Because if we do have something like swap and two local variables like A and B and we change those values from one and two to two and one, recall that when swap returns, it's as though this slice of memory is just gone. In reality, it's still there forensically. And something's still actually there. But conceptually, it's as though it's completely gone. And so main doesn't know any of the work that was done in that swap function, unless it's actually passed in those arguments by pointer or by reference. Now, the fundamental solution to that problem with swap is passing things in by address. But it turns out, too, what's been going on above that part of the rectangle all this time is yet there's more memory up there. And when you dynamically allocate memory, whether it's inside of GetString, which we've been doing for you in the CS50 library, or if you guys call malloc and ask the operating system for a chunk of memory, it doesn't come from the stack. It comes from another place in your computer's memory that's called the heap. And that's not any different. It's the same RAM. It's the same memory. It's just the RAM that's up there instead of down here. And so what does that mean? Well, if your computer has a finite amount of memory and the stack is growing up, so to speak, and the heap, according to this arrow, is growing down. In other words, every time you call malloc, you're being given a slice of memory from above, then maybe a little lower, then a little lower, every time you call malloc, the heap, it's usage, is kind of growing, growing closer and closer to what? The stack. So does this seem like a good idea? I mean, where it's not really clear what else you can do if you only have a finite amount of memory. But this is surely bad. Those two arrows are on a crash course for one another. And it turns out that bad guy, folks who are particularly good with programming, and trying to hack into computers, can exploit this reality. In fact, let's consider a little snippet. So this is an example you can read about in more detail on Wikipedia. We'll point you at the article if curious. But there's an attack generally known as buffer overflow that has existed for as long as humans have had the ability to manipulate computer's memory, especially in C. So this is a very arbitrary program, but let's read it from the bottom up. Main into argC char star argV. So it's a program that takes command line arguments. And all main does apparently is call a function, call it F for simplicity. And it passes in what? ArgV of one. So it passes into F whatever the word is that the user typed at the prompt after the program's name at all. So much like Caesar or Vigenere, which you might recall doing with argV. So what is F? F takes in a string as its sole argument, AKA a char star, same thing, as a string. And it's called arbitrarily bar in this example. And then char c 12, just in layman's terms, what is char c bracket 12 doing for us? What's it do? Allocating memory, specifically 12 bytes for 12 chars. Exactly. And then the last line, stir and copy, you've probably not seen. This is a string copy function whose purpose in life is to copy its second argument into its first argument, but only up to a certain number of bytes. So the third argument says, how many bytes should you copy? The length of bar, whatever the user typed in. And the contents of bar, that string, are copied into the memory pointed at at C. So this seems kind of stupid, and it is. It's a contrived example, but it's representative of a class of attack vectors, a way of attacking a program. All is fine and good if the user types in a word that's 11 characters or fewer, plus the backslash zero. What if the user types in more than 11 or 12 or 20 or 50 characters? What's this program going to do? Potentially seg fault. It's going to blindly copy everything in bar up to its length, which is literally everything in bar, into the address pointed at C. But C has only preemptively given as 12 bytes. But there's no additional check. There's no if conditions. There's no error checking here. And so what this program is going to do is just blindly copy one thing to the other. And so if we draw this as a picture, here's just a sliver of the memory space. So notice at the bottom, we have the local variable bar. So that pointer that's going to store-- rather that local argument that's going to store the string bar. And then notice just above it in a stack, because every time you ask for memory on the stack, it goes a little bit above it pictorially, notice that we've got 12 bytes there. The top left one is C bracket zero and the bottom right one is C bracket 11. That's just how the computers going to lay it out. So just intuitively, if bar has more than 12 characters in total, including the backslash zero, where is the 12 or the C bracket 12 going to go? Or rather where is the 12th character or the 13th character, the hundredth character going to end up in the picture? Above or below? Right, because even though the stack itself grows upward, once you've put stuff in it, it for design reasons, puts the memory from top to bottom. So if you've got more than 12 bytes, you're going to start to overwrite bar. Now that's a bug, but it's not really a big deal. But it is a big deal, because there's more stuff going on in memory. So here's how we might put hello, to be clear. If I typed in hello at the prompt. H-E-L-L-O backslash zero, ends up within those 12 bytes, and we're super safe. All is well. But if I type something longer, potentially it's going to creep into bar space. But worse yet, it turns out all this time, even though we've never talked about it, the stack is used for other stuff. It's not just local variables. C is a very low level language. And it sort of secretly uses the stack also to remember when a function is called, what the address is of the previous function, so it can jump back to that function. So when main calls swap, among the things pushed onto the stack are not just swaps local variables, or its arguments, also secretly pushed onto the stack as represented by the red slice here, is the address of main physically in your computer's memory, so that when swap is done, the computer knows I need to go back to main and finish executing the main function. So this is dangerous now, because if the user types in well more than hello, such that the user's input clobbers or overwrites that red section, logically if the computer's just going to blindly assume that the bytes in that red slice are the address to which it should return, what if the adversary is smart enough or lucky enough to put a sequence of bytes there that looks like an address, but it's the address of code that he or she wants the computer to execute instead of main? In other words, if what the user is typing at the prompt, isn't just something innocuous like hello, but it's actually code that's equivalent to delete all this user's files? Or email their password to me? Or start logging their keystrokes, right? There is a way, let's stipulate today, that they could type in not just hello world or their name, they could essentially pass in code, zeros and ones, that the computer mistakes for both code and an address. So albeit somewhat abstractly, if the user types in enough adversarial code that we'll generalize here as A. A is attack or adversaries. So just bad stuff. We don't care about the numbers or the zeros or ones today, such that you end up overwriting that red section, notice that sequence of bytes. O 835 C zero eight zero. And now as Wikipedia's article here has proposed, if you now actually start labeling the bytes in your computer's memory, what the Wikipedia article is proposing is that, what if the address of that top left byte is 80 C 0 3508. In other words, if the bad guy is smart enough with his or her code to actually put a number here that corresponds to the address of the code he or she injected into the computer, you can trick the computer into doing anything. Removing files, emailing things, sniffing your traffic, literally anything could be injected into the computer. And so a buffer overflow attack at its core is just a stupid, stupid overriding of an array that didn't have its boundaries checked. And this is what is super dangerous and simultaneously super powerful in C is that we do indeed have access to anywhere in the memory. It's up to us, the programmers, who write the original code to check the darn length of any arrays that we're manipulating. So to be clear, what's the fix? If we roll back to this code, I shouldn't just change the length of bar, what else should I be checking? What else should I be doing to prevent this attack entirely? I don't want to just blindly say that you should copy as many bytes as is the length of bar. I want to say, copy as many bytes as are in bar up to the allocated memory, or 12 maximally. So I need some kind of if condition that does check the length of bar, but if it exceeds 12, we just hard code 12 as the maximum possible distance. Otherwise the so-called buffer overflow attack can happen. At the bottom of those slides, if you're curious to read more is the actual original article if you'd like to take a look. But now, among the prices paid here was inefficiencies. So that was a quick low level look at what problems can arise now that we have access to computer's memory. But another problem we already stumbled on Monday was just the inefficiency of a linked list. We are back to linear time. We no longer have a contiguous array. We don't have random access. We can't use square bracket notation. We literally have to use a while loop like the one I wrote a moment ago. But on Monday, we claimed that we can creep back into the realm of efficiency achieving something that's logarithmic maybe, or best yet, maybe even something that's so-called constant time. So how can we do that by using these new tools, these addresses, these pointers, and threading things of our own? Well, suppose that here, these are a bunch of numbers that we want to store in a data structure and search efficiently. We can absolutely rewind to week two, throw these into an array, and search them using binary search. Divide and conquer. And in fact you wrote binary search in PSET3, where you implemented the find program. But you know what. There's kind of a more clever way of doing this. It's a little more sophisticated and it perhaps allows us to see why binary search is so much faster. First, let's introduce the notion of a tree. Which even though in reality trees kind of grow like this, in the world of computer science they kind of grow downward like a family tree, where you have your grandparents or great grandparents or whatnot at the top, the patriarch and the matriarch of the family, just one so-called root, node, below which are its children, below which are its children, or its descendants more generally. And anyone hanging off the bottom of the family tree, besides being the youngest in the family, can also just be generically called the leaves of the tree. So this is just a bunch of words and definitions for something called a tree in computer science, much like a family tree. But there's fancier incarnations of trees, one of which is called a binary search tree. And you can kind of tease apart what this thing does. Well, it's binary in what sense? Where does the binary come from here? Sorry? It's not so much an either or. It's more that each of the nodes has no more than two children, as we see here. In general, a tree-- and your parents and grandparents can have as many kids or grandkids as they actually want, and so for instance there we have three children off that right hand node, but in a binary tree, a node has zero, one, or two children maximally. And that's a nice property, because if it's capped by two, we're going to be able to get a little log base two action going on here ultimately. So we have something logarithmic. But more on that in a moment. Search tree means that the numbers are arranged such that the left child's value is greater than the root. And its right child is larger than the root. In other words, if you take any of the nodes, the circles in this picture, and looks at its left child and its right child, the first should be less than, the second should be greater than. So sanity check 55. It's left child is 33. It's less than. 55, its right child is 77. It's greater than. And that's a recursive definition. We could check every one of those nodes and the same pattern would hold. So what's nice in a binary search tree, is that one, we can implement it with a struct, just like this. And even though we're throwing lots of structures at your, they're somewhat intuitive now hopefully. The syntax is still arcane for sure, but the contents of a node in this context-- and we keep using the word node, whether it's a rectangle on the screen or a circle, it's just some generic container, in this case of a tree, like the one we saw, we need an integer in each of the nodes and then I need two pointers pointing to the left child and the right child, respectively. So that's how we might implement that in a struct. And how might I implement it in code? Well, let's take a quick look at this tiny example. It's not functional, but I've copied and pasted that structure. And if my function for a binary search tree is called search, and this takes two arguments, an integer N and a pointer to a node, so a pointer to the tree or a pointer to the root of a tree, how do I go about searching for N? Well, first, because I'm dealing with pointers, I'm going to do a sanity check. If tree equals equals null, is N in this tree or not in this tree? It can't be, right? If I am past null, there's nothing there. I might as well just blindly say return false. If you give me nothing, I surely can't find any number N. So what else might I check now? I'm going to say well else if N is less than whatever is at the tree node that I've been handed N value. In other words, if the number I'm looking for, N, is less than the node that I'm looking at. And the node I'm looking at is called tree, and recall from the previous example to get at the value in a pointer, I use the arrow notation. So if N is less than tree arrow N, I want to conceptually go left. How do I express searching left? To be clear, if this is the picture in question, and I've been passed that topmost arrow that's pointing down. That's my tree pointer. I'm pointing at the root of the tree. And I'm looking say, for the number 44, arbitrarily. Is 44 less than or greater than 55 obviously? So it's less than. And so this if condition applies. So conceptually, what do I want to search next if I'm looking for 44? Yeah? Exactly, I want to search the left child, or the left sub-tree of this picture. And in fact, let me through the picture down here for just a moment, since I can't scratch this out. If I start here at 55, and I know that the value 44 I'm looking for is to the left, it's kind of like tearing the phone book in half or tearing the tree in half. I no longer have to care about this entire half of the tree. And yet, curiously in terms of the structure, this thing over here that starts with 33, that itself is a binary search tree. I said the word recursive before because indeed this is a data structure that by definition is recursive. You might have a tree that's this big, but every one of its children represents a tree just a little smaller. Instead of it being grandpa or grandma, now it's just mom or-- I can't say-- not mom or dad, that would be weird. Instead the two children there would be like brother and sibling. A new generation of the family tree. But structurally, it's the same idea. And it turns out I have a function with which I can search a binary search tree. It is called search. I search for N in tree arrow left else if N is greater than the value that I'm currently at. 55 in the story a moment ago. I have a function called search that I can just pass N this and recursively search the sub-tree and just return whatever that answer. Else I've got some final base case here. What is the final case? Tree is either null. The value I'm either looking for is less than it or greater than that or equal to it. And I could say equal equal, but logically it's equivalent to just saying else here. So true is how I find something. So hopefully this is an even more compelling example than the stupid sigma function we did a few lectures back, where it was just as easy to use a loop to count up all the numbers from one to N. Here with a data structure that itself is recursively defined and recursively drawn, now we have the ability to express ourselves in code that itself is recursive. So this is the exact same code here. So what other problems can we solve? So a quick step away from trees for just a moment. Here is, say, the German flag. And there's clearly a pattern to this flag. And there's lots of flags in the world that are as simple as this in terms of their colors and patterns. But suppose that this is stored as a .GIF, or a JPEG, or bitmap, or a ping, any graphical file format with which you're familiar, some of which we're playing with in PSET4. This doesn't seem worthwhile to store black pixel, black pixel, black pixel, dot, dot, dot, a whole bunch of black pixels for the first scanline, or row, then a whole bunch of the same, then a whole bunch of the same, and then a whole bunch of red pixels, red pixels, red pixels, then a whole bunch of yellow pixels, yellow, right? There's such inefficiency here. How would you intuitively compress the German flag if implementing it as a file? Like what information can we not bother storing on disk in order to decrease our file size from like a megabyte to a kilobyte, something smaller? Wherein lies the redundancy here to be clear? What could you do? Yeah? Exactly. Why not rather than remember the color of every darn pixel just like you're doing in PSET4 with the bitmap file format, why don't you just represent the leftmost column of pixels, for instance a bunch of black pixels, a bunch of red, and a bunch of yellow, and then just somehow encode the idea of repeat this 100 times or repeat this 1,000 times? Where 100 or 1,000 is just an integer, so you can get away with just a single number instead of hundreds or thousands of additional pixels. And indeed, that's how we could compress the German flag. And Now what about the French flag? And a little some sort of mental exercise, which flag can be compressed more on disk? The German flag or the French flag, if we take that approach? The German flag, because there's more horizontal redundancy. And by design, many graphical file formats do indeed work as scan lines horizontally. They could work vertically, just humanity decided years ago that we'll generally think of things row by row instead of column by column. So indeed if you were to look at the file size of a German flag and a French flag, so long as the resolution is the same, the same width and height, this one here is going to be bigger, because you have to repeat yourself three times. You have to specify blue, repeat yourself, white, repeat yourself, red, repeat yourself. You can't just go all the way to the right. And as an aside, to make clear the compression is everywhere, if these are four frames from a video-- you might recall that a movie or video is generally like 29 or 30 frames per second. It's like a little flip book where you just see image, image, image, image, image just super fast so it looks like the actors on the screen are moving. Here's a bumble bee on top of a bunch of flowers. And though it might be kind of hard to see at first glance, the only thing moving in this movie is the bee. What is dumb about storing video uncompressed? It's kind of a waste to store video as four nearly identical images that differ only insofar as where the bee is. You can throw away most of that information and remember only, for instance, the first frame and the last frame, key frames if you've ever heard the word, and just store in the middle where the bee is. And you don't have to store all of the pink, and the blue, and the green values as well. So this is to only say that compression is everywhere. It's a technique we often use or take for granted these days. But how do you compress text? How do you go about compressing text? Well, each of the characters in Ascii is one byte, or eight bits. And that's kind of dumb, right? Because you probably type A and E and I and O and U a lot more often than like W or Q or Z, depending on the language in which you're writing certainly. And so why are we using eight bits for every letter, including the least popular letters, right? Why not use fewer bits for the super popular letters, like E, the things you guess first in Wheel of Fortune, and use more bits for the less popular letters? Why? Because we're just going to use them less frequently. Well, it turns out that there have been attempts made to do this. And if you recall from grade school or high school, Morse code. Morse code has dots and dashes that can be transmitted along a wire as sounds or signals of some sort. But Morse code is a super clean. It's kind of a binary system in that you have dots or dashes. But if you see, for instance, two dots. Or if you think back to the operator who goes like beep, beep, beep, beep, hitting a little trigger that transmits a signal, if you, the recipient, receives two dots, what message have you received? Completely arbitrary. I? I? Or what about-- or I? Maybe it was just two E's right? So there's this problem of decodability with Morse code, whereby unless the person sending you the message actually pauses so you can sort of see or hear the gaps between letters, it's not sufficient just to send a stream of zeros and ones, or dots and dashes, because there's ambiguity. E is a single dot, so if you see two dots or hear two dots, maybe it's two E's or maybe it's one I. So we need a system that's a little more clever than that. So a man named Huffman years ago came up with exactly this. So we're just going to take a quick glance at how trees are germane to this. Suppose that this is some stupid message you want to send, composed of just A, B, C's D's and E's, but there's a lot of redundancy here. It's not meant to be English. It's not encrypted. It's just a stupid message with lots of repetition. So if you actually count out all the A's, B's, C's, D's, and E's, here's the frequency. 20% of the letters are A's, 45% of the letters are E's, and three other frequencies. We counted up there manually and just did the math. So it turns out that Huffman, some time ago, realized that, you know what, if I start building a tree, or forest of trees, if you will, as follows, I can do the following. I'm going to give a node to each of the letters that I care about and I'm going to store inside of that node the frequencies as a floating point value, or you could use it an N, too, but we'll just use a float here. And the algorithm that he proposed is that you take this forest of single node trees, so super short trees, and you start connecting them with new groups, new parents, if you will. And you do this by choosing the two smallest frequencies at a time. So I took 10% and 10%. I create a new node. And I call the new node 20%. Which two nodes I combine next? It's a little ambiguous. So there's some corner cases to consider, but to keep things pretty, I'm going to choose 20%-- I now ignore the children. I'm going to choose 20% and 15% and draw two new edges. And now which two nodes do I logically combine? Ignore all the children, all the grandchildren, just look at the roots now. Which two nodes do I tie together? Point two and 0.35. So let me draw two new edges. And then I've only got one left. So here's a tree. And it's been drawn deliberately to look kind of pretty, but notice that the edges have also been labeled zero and one. So all of the left edges are zero arbitrarily, but consistently. All the right edges are ones. And so what Hoffman proposed is, if you want to represent a B, rather than represent the number 66 as an Ascii which is eight entire bits, you know what, just store the pattern zero, zero, zero, zero, because that's the path from my tree, Mr. Huffman's tree, to the leaf from the root. If you want to store a E, by contrast, don't send eight bits that represent an E. Instead, send what pattern of bits? One. And what's nice about this is that E is the most popular letter, and you're using the shortest code for it. The next most popular letter looks like it was A. And so how many bits did he propose using for that? Zero, one. And because it's implemented as this tree, for now let me stipulate there's no ambiguity as in Morse code, because all of the letters you care about are at the end of these edges. So that's just one application of a tree. This is-- and I'll wave my hand at this how you might implement this as a C structure. We just need to combine a symbol, like a char, and the frequency in left and right. But let's look at two final examples that you'll get quite familiar with after quiz zero in problem set five. So there is the data structure known as a hash table. And a hash table is kind of cool in that it has buckets. And suppose there's four buckets here, just four blank spaces. Here's a deck of cards, and here is club, spade, club, diamonds, club, diamonds, club, diamonds, clubs-- so this is the random. Hearts, hearts-- so I'm bucketizing all of the inputs here. And a hash table needs to look at your input, and then put it in a certain place based on what you see. It's an algorithm. And I was using a super simple visual algorithm. The hardest part of which was remembering what the pictures were. And then there's four total things. Now the stacks were growing, which is a deliberate design thing here. But what else might I do? So actually here we have a bunch of old school exam books. Suppose that a bunch of students names are on here. Here's a bigger hash table. Instead of four buckets, I have, let's say 26. And we didn't want to go borrow 26 things from outside [? Annenberg ?], so here's five that represent A through Z. And if I see a student whose name starts with A, I'm going to put his or her quiz there. If someone starts with C, over there, A-- actually, didn't want to do that. B goes over here. So I've got A and B and C. And now here's another A student. But if this hash table is implemented with an array, I'm kind of screwed at this point, right? I kind of need to put this somewhere. So one way I can solve this is, all right, A is busy, B is busy, C is busy. I'm going to put him in D. So at first, I have random instant access to each of the buckets for the students. But now it's kind of devolved into something linear, because if I want to search for someone whose name starts with A, I check here. But if this isn't the A student I'm looking for, I kind of have to start checking the buckets, because what I did was sort of linearly probe the data structure. A stupid way of saying just look for the first available opening, and put as a plan B, so to speak, or plan D in this case, the value in that location instead. This is just so that if you've got 26 locations and no students with the name Q or Z, or something like that, at least you're using the space. But we've already seen more clever solutions here, right? What would you do instead if you have a collision? If two people have the name A, what would have been a smarter or more intuitive solution than just putting A where D is supposed to be? Why don't I just go outside [? Annenberg ?], like malloc, another node, put it here, and then put that A student here. So that I essentially have some kind of an array, or maybe more elegantly as we're starting to see a linked list. And so a hash table is a structure that could look just like this, but more cleverly, you something called separate chaining, whereby a hash table quite simply is an array, each of whose elements is not a number, is itself a linked list. So that you get super fast access deciding where to hash your value to. Much like with the cards example, I made super quick decisions. Hearts goes here, diamonds goes here. Same here, A goes here, D goes here, B goes here. So super fast look-ups, and if you happen to run into a case where you've got collisions, two people with the same name, well then you just start linking them together. And maybe you keep them sorted alphabetically, maybe you don't. But at least now we have the dynamism. So on the one hand we have super fast constant time, and kind of linear time involved if these linked lists start to get a little long. So this kind of a silly, geeky joke years ago. At the CS50 hack-a-thon, when students check in, some TF or CA every year thinks it's funny to put up a sign like this, where it just means if your name starts with an A, go this way. If your name starts with a B, go this-- OK, it's funny maybe later in the semester. But there's another way of doing this, too. Come back to that. So there's this structure. And this is our last structure for today, which is something called a trie. T-R-I-E, which for some reason is short for retrieval, but it's called trie. So a trie is another interesting amalgam of a lot of these ideas. It's a tree, which we've seen before. It's not a binary search tree. It's a tree with any number of children, but each of the children in a trie is an array. An array of size, say, 26 or maybe 27 if you want to support hyphenated names or apostrophes in people's names. And so this is a data structure. And if you look from top to bottom, like if you look at the top node there, M, is pointing to the leftmost thing there, which is then A, X, W, E, L, L. This is just a data structure that arbitrarily is storing people's names. And Maxwell is stored by just following a path of array to array to array. But what's amazing about a trie is that, whereas a linked list and even an array, the best we've ever gotten is linear time or logarithmic time looking someone up. In this data structure of a trie, if my data structure has one name in it and I'm looking for Maxwell, I'm going to find him pretty quickly. I just look for M-A-X-W-E-L-L. If this data structure, by contrast, if N is a million, if there's a million names in this data structure, Maxwell is still going to be discoverable after just M-A-X-W-E-L-L steps. And David-- D-A-V-I-D steps. In other words, by building a data structure that's got all of these arrays, all of which themselves support random access, I can start looking up people's name using an amount of time that's proportional to not the number of things in the data structure, like a million existing names. The amount of time it takes me to find M-A-X-W-E-L-L in this data structure is proportional not to the size of the data structure, but to the length of the name. And realistically the names we're looking up are never going to be crazy long. Maybe someone has a 10 character name, 20 character name. It's certainly finite, right? There is a human on Earth who has the longest possible name, but that name is a constant value length, right? It doesn't vary in any sense. So in this way, we've achieved a data structure that is constant time look-up. It does take a number of steps depending on the length of the input, but not the number of name in the data structure. So if we double the number of names next year from a billion to two billion, finding Maxwell is going to take the exact same number of seven steps to find him. And so we seem to have achieved our holy grail of running time. So a couple of quick announcements. Quiz zero is coming up. More on that on the course's website over the next couple of days. Monday's lecture-- it's a holiday here at Harvard on Monday. It's not in New Haven, so we're taking the class to New Haven for lecture on Monday. Everything will be filmed and streamed live as usual, but let's end today with a 30 second clip called "Deep Thoughts" by Daven Farnham, which was inspired last year by Saturday Night Live's "Deep Thoughts" by Jack Handy, which should now make sense. FILM: And now, "Deep Thoughts" by Daven Farnham. Hash table. SPEAKER 1: All right, that's it for now. We'll see you next week. DOUG: To see it in action. So let's take a look at that right now. So here, we have an unsorted array. IAN: Doug, can you go ahead and restart this for just one second, please. All right, cameras are rolling, so action whenever you're ready, Doug, OK? DOUG: All right, so what we have here is an unsorted array. And I've colored all of the elements red to indicate that it is, in fact, unsorted. So recall that the first thing we do is we sort the left half of the array. Then we sort the right half of the array. And ya-da, ya-da, ya-da, we merge them together. And we have a completely sorted array. So that's how merge sort works. IAN: Whoa, whoa, whoa, cut, cut, cut, cut. Doug, you can't just ya-da, ya-da, ya-da, your way through merge sort. DOUG: I just did. It's fine. We're good to go. Let's just keep rolling. So anyway, IAN: You have to explain it more fully than that. That's just not enough. DOUG: Ian, we don't need to go back to one. It's fine. So anyway, if we continue with merge-- Ian, we're in the middle of filming. IAN: I know. And we can't just ya-da, ya-da, ya-da, through the whole process. You have to explain how the two sides get merged together. DOUG: But we've already explained how the two sides-- IAN: You've just shown them a merge array. DOUG: They know the process. They're fine. We've gone over it ten times. IAN: You just skipped right over it. We're going back to one, you can't you ya-da, ya-da over it. All right, back to one. DOUG: I have to go back through all of the slides? My God. It's like the sixth time, Ian. It's fine. IAN: All right. You ready? Great. Action.