[ Silence ]
>> This is CS50 and this is the end of week 7 and this is CS50's own Ansel Duff now available in iTunes. The link will be on today's lecture page who just so happens to be my advisee as well. So store.cs50.net has debuted thanks to your and the staff's contributions. This is the URL, realized that everything is sold at cause. The opportunity really is just to give you a chance to really live and wear CS50. If for reasons of financial aid you would feel, otherwise, excluded from anything involved in the course, please just reach out to me privately and we will ensure that you can get most anything you would like and we will take care of it. So, here's one design that was submitted, if you'd actually like to wear and dress as CS50 like this, so we enjoy this one. Another one that came out, if you'd like to wear this on campus, it says this on the front of your shirt. For those more comfortable, there's this option here. There was a surprising frequency of people who submitted screen shots of their appliance as their t-shirts. This one was creative though and that it used some ASCII art to convey the idea of taking CS50. This one took the idea of creating a CS50 slogan a little, literally. [Laughter] But perhaps my favorites are those that were submitted here and here.
[ Laughter ]
[ Applause ]
>> So get yours today followed by if you'd like these slightly creepier, this one here. [Laughter] So that store.cs50.net. So, today we finished our discussion really of C, next week we dive into web programing. If you'd like to have lunch with us this week even if you've done this before at the cs50.net/rsvp and so we can talk about where we're at, at this point and where we're going. If you've already been, that's fine. Again, if you wanna come back, 1:15 this Friday. But next week already we dive into web programing and even though we've spent the majority of the semester focused on C and on this fairly low level language hence forth we move into the world of HTML and CSS. These are not really, these are not programming languages, they are what we'd call markup languages but they are the languages in which webpages are written. In fact, if we go to most any website out there, we can go to the favorite Harvard FML which is actually a hosted blog, we can actually look at the source code here. And this is the kind of stuff with which you guys will soon be familiar. Some of this is a little messier, a little more advanced than we'll see in the first day but this stuff here is what's called Cascading Style Sheets and it apparently has something to do with font sizes and weights and colors and so forth and indeed aesthetics of a webpage. If we scroll down further we'll start to see lots of open brackets and closed brackets and spans, and divs and all of this color coated text here is what drives, again, any website out there on the internet. But it's one thing to actually make static webpages that simply display formation that's a lot more fun, a lot more empowering to make dynamic websites. They take user input, they produce dynamic output and Harvard FML is an example of that. I saw uharvard.com is another example alright? [Inaudible] the voice--the noise are not actually just posting these thing statically, they're actually receiving form submission from actual users and then they've written some code on the back end to actually display these posts. And all of this will be possible thanks to languages like PHP, a data base language called SQL, and in fact for your final projects you're welcome to go with those languages or beyond. We have students every year who dabble in C sharp, in Python, in Ruby, there're so many languages that will be available to you. And in fact, statistically, over 80 percent of you will end up doing web based final projects because one, it's relatively accessible, two, it's familiar, three, it's useful. Probably, not as exciting to tell your friends to come over to your laptop, sit down in front of your C50 appliance, and run your game or run your program. That might have satisfied you in the first week but it'd be nice now to just e-mail to your house list or your parents or your friends, or your URL and say, start using my amazing new website. So, that is what is on their horizon. But for today we close out our discussion of some of the more advanced data structures in C. And how we might actually build tools that we're going to soon start taking for granted. You're gonna see that a lot of Cs restrictions and nitpickingness are gonna fall by the way side for better or for worse in the upcoming weeks. So, that if you wanna big chunk of storage, you're gonna have to think a lot, lot less in PHP to actually get that memory. You can just take for granted that it's accessible for you. If you want something called a linked list or an array, you'll find that syndactyly it's a lot easier to get at it. But we're also gonna use fancier things, things called hash tables and trees and other data structures that will actually look at today and then use in practice in the coming weeks. So, this was again a photograph of Mather and this is an example of a stack of trays and we photographed it simply because it's a representative of a very common computer science data structure that is known as a stack. A stack is a LIFO data structure, last in, last out. I'm sorry, wait a minute, last in, first out, LIFO, last in, first out. And as you might infer from reality, if you put a tray on top of the stack, well, unless you're, I mean an idiot, you're not gonna go and take the bottom most tray that was first put there instead you're obviously gonna take the one on the top. So, this has actually very useful implications. I alluded to just one fairly arcane one involving validating HTML but this ability, to have a data structure that allows you to incredibly rapidly add to it, really in constant time, big-O of 1, and then remove from it also in constant time, big-O of 1 is a powerful technique because we've dabbling so far with array based data structures and others where we often need at least log in or N. So, really today and onward we wanna approximate or achieve constant time. So, if you have a data structure like this stack, what do you need--how do you go about implementing this, say in C? Well, the only syntax we have thus far for creating custom variables, custom types is a struct. So, if were to start typing, struct stack and then an open curly brace. And we want you clumped together one or more data types that collectively define a stack, much like an ID and a name and house collectively define a student, what do we need to be able to keep track of? If suppose we wanna stack, not of trays, but let's keep it simple, a stack of integers. I want a data structure where I can put a number and then a number and then a number and then a number and I want it to be easy to get at the most recently added number. What could I put inside of the struct and then start calling the whole thing a stack? What might you propose? Again, the goal is to store numbers, integers and a stack of them. Well, what existing data structure can we kind of steal to implement at least part of this idea, yeah? Okay, so we could use a linked list, we could use a linked list which we'll look at on Monday which is a dynamic data structure that allows you to grow and shrink your data structure as on demand, when you need more memory. And in fact, we could even regress a bit more. We could use even the simpler array, right, and that would actually kind of conceptually be pretty consistent with the stack because much like these trays, they are all right next to each other. So, within an array of integers, be right next to each other, but of course what's the downside of using an array really for most anything?
>> It's a fixed size.
>> It's fixed size, right? It might be fine but you're either gonna spend way more memory than you need by over allocating and then just wasting it or you're gonna under allocate and in that case you've ended up wasting time because if you allocate too few integers and you need more, you then have to call malloc again but with bigger--a bigger number and then you have to copy your old data into the new. And then you typically have to call free on the old array to give that memory back. So, we can have an array but if we just have an array of numbers to represent a stack, we need at least one more details. So, here is how in C we might represent a stack. So, recall that typedef defines your own type a synonym of sorts, struct says here comes a bunch of data types collectively call them, the following, and then the word at the end stack is just the arbitrary, but conventional name, we wanna give to this data structure. And I propose even simpler than a linked list, we could implement a stack with just 2 data members. Previously again, we implemented a student with three--ID, name and house. A stack might just be an array of numbers, an array of integers, capacity is probably a constant, sharp defined some number, a hundred and twenty eight, a thousand something, arbitrary but big. And why do I need to know about the size, and semantically we mean a slight distinction here. Capacity is the total capacity, how many total numbers you could put in here, so what might size actually represent in this context of a stack? Yeah, how many things are actually in there? So, initially capacity might be a big number, a thousand, so that gives me a thousand integers potentially in an array but initially there's nothing in there. So, I at least have to remember for this data structure stack that it size is zero. And then anytime I add a number to this stack, well what are you really doing? You're gonna add a number to the array and then another one, and then another, and another one. And so long as you remember how many times you've inserted numbers into this array, it doesn't even matter if there's a whole bunch garbage values because if you do it in order from left to right or conceptually from bottom up in the array so as long as you remember the size that it's 0 or 1 or 2, you know that the first--those first integers are legit and the rest might be garbage values. But there is a problem here, what happens when I insert the 1001 for instance number into this data structure? What, what would you do in that case? What do you say? I mean, same problem you might, if you just blindly go past the boundaries of this array by going to capacity or capacity plus 1 or plus 2.
>> You might very well seg fault so in the worst case you might have to either just accept that, which is not good, or two, you might have to actually re-implement this thing using malloc. It's probably not ideal to statically allocate that many integers inside of your structure 'cause in this definition you can never re-grow it. Because I'm not using pointers here, because I'm using a hard coded array, I can't just call malloc or realloc later so this is a design decision and again, when we talk about these axes along which we grade problems that's design takes these things into account If you are comfortable with the design constraint, that you can only have say a thousand numbers in this stack, well maybe that's actually quite fine but if that's not acceptable, well then we need to re-engineer this, and we do need to use malloc or pointers or somehow to get that dynamism or as you proposed use a linked list but these are tradeoffs. Like what might be the trade? What's the first tradeoff that comes to mind if I propose, alright, let's scrap this and let's actually re-implement this using linked list. So we don't have this upper bound. Just instinctively like what's the downside of that approach?
[ Inaudible Remark ]
>> Yeah, and I mean this is totally legitimate. It's harder, right? We saw on Monday and I deliberately didn't even go through all of those lines of code 'cause you know it's like a hundred or so lines to collectively implement those linked list that we first simulated with humans but if you actually coded up you need an insert function and a delete function and you glimpse toward the end of Monday how relatively complex the code was. You had to handle insertion at the head, at the tail, in the middle, deletion at the beginning, at the end and it's all doable and that code is in fact correct but it's just a lot of work, my God, I know how to use arrays and like week 2, here is the implementation of the data structure now, all I have to do is index into that array using some square bracket and dot notation, done. So this too is a valid design constraint especially in the real world, if you can either implement something that solves the problem in an hour or you can solve this problem and many others in a week. You know you have to decide for yourself what's actually worth it and so one of the goals of the final project too is going to try to rain in your own ambitions and you'll be asked to propose that you'd be happy would say this good version of a final project and this better version of a final project and this best version of your final project 'cause realistically you can sort of imagine implementing the whole world in every feature you might imagine but as you've seen with problem sets, just implementing one seemingly simple feature can sometimes waste or cost you some five hours of your life 'cause you just don't anticipate the things that arise. So getting better at estimating difficulty in time will be one of the key features of that process. So here is for instance a queue, and a queue is a FIFO--F-I-F-O, First-In-First-Out data structure and this one in the real world is certainly more fair, right? You probably--if you get there early again you wanna be the first one let into the store, and not the last and so a queue has its own applications and here too, a queue could very reasonably be implemented with a fixed to size array, right? For instance, if this is some kind of giveaway where they only have a finite number of iPhones or iPads, well, they don't need to let more than say 500 people in line 'cause the 500, the first person is not going to get into the store so it's sometimes whether in the real world or in a program, might be totally reasonable to balance the sides of your data structure so again, we could implement this thing called a queue perhaps a little something like this where we have another array called numbers, it too is initialized to capacity whether that's 500 or 1000, whatever, it's a constant but this time I have to keep track of two things potentially, two things, whereas the stack is kind of nice and simple in that you add a tray or add a number, add one, add one, add one, add one and then as soon as you remove one, it just comes off the top, off the top, off the top. Notice what stay in constant, the first thing you put in is staying where it is. Your elements are going up and down, and up and down but in a queue it's a little different because supposed in a queue some, someone gets in line and then another person gets in line and then by 5 a.m. you have like 10 people in line at the Apple store and then they start opening the store but the queue is not filled, right? It could still wrap around the block but those first few people start to go in. Well, in the real world to the rest of the humans who arrived later are going to probably shift in line, right? So the end of the line--the beginning of the line is always the same but the end of the line is kind of moving here and so when it comes time to implement something like a queue, we can definitely remember its size, how many people are in line right now, how many integers are in this array right now but if we want this thing to be dynamic in that we're implementing it with an array and our array looks like this where I'll just draw it as our typical array and I'll put in something like let's say, Alice and this is Bob who gets in line and this is Charlie who gets in line here and I still have spots for two more people in this line. Well, as soon as the store opens, Alice is effectively gonna be taken out of line but if we're now implementing this notion of a queue in a program you know we could move Bob and Charlie all the way to the left and every time someone gets into the store, we could shift them all to the left but what's the design concern with that approach?
[ Inaudible Remark ]
>> Yeah, it's O of N, right? It's kind of expensive. Every time we pop someone off the queue, we potentially shift everyone to the end of the array, I mean that might be nice and clean and it makes for nicer pictures if everyone is kind of moving up in the line but really in programming we could just remember where the start of the line is so rather than assume that the start of the line is at bracket 0, just now update some index variable, call it, what do call it here, call it head for head of the line and just do plus plus and remember that Bob is the head of the line and you know what this now represents, the end of the line, right? We can reuse this memory so that when this person and this person, and then the next person come in we don't have to kick them out of line, we can say okay, fine, go here but in computing we can just remember that the head of the line is somewhere else so that we can avoid this expense of potentially big O of N for shuffling people around. So in that sense, implementing a queue is actually a little more complex because we have to keep one more piece of information around if we wanna implement things efficiently. So notice here the tradeoff and this is a very minor tradeoff but by spending 32 more bits for one more int called head we can avoid spending what as a resource?
>> Time.
>> Time, right and this is a constant thing especially this week and next week and even with the final project. You have so often in programming this tradeoff between well, if you spend a little more time, you can save, rather if you spend a little more space, you can save time or vice versa and the first time we saw this, I think was with merge sort. Remember that merge sort was blazingly fast compared to bubble and selection sort but the little dirty secret was we technically needed an extra array and that's why we left more space on stage for all of these milk cartons because we needed some temporary space to actually merge the elements into. So we spent twice as much space but my God, we went from N square too N log N which we saw multiple times which is much faster. So there's this them of tradeoffs in terms of space, in terms of time, in terms of difficulty or human time spent implementing. All of these are legit resources. But the holy grail is gonna be a data structure that lets us just put stuff into it and take stuff out of it in ideally constant time, right? If you don't care about this notion of a queue or this concept of a stock, all you care about is having some kind of black box into which you can put stuff and then you wanna be able to get it back out by saying give me the--let's say give me the student with ID 13 or give me the student with ID 16, you don't care how this black box is implemented you just want the answer to come back like that. Now you can obviously implement a black box with an array and within your search but again in the worst case, if you wanna get a student with some ID, it might be at the end of your array or at the back of this black box so you could take big O of N time but the goal today is again constant time. So how might we go about implementing this? Well, the spoiler is that it's gonna be called a hash table. But how we actually implement this magical black box is a little more interesting if not obvious. So supposed we initially create a data structure using an array of some sort so we'll call it table but a table is effectively synonymous now with an array and supposed that the goal here is to store things like let's say people's names in this array. So supposed I want a data structure, we're even more simple than a student object, I just wanna store student's names and then I wanna be able to check is Alice in my black box? Is Bob in the black box or they--have I seen them before and put them into this data structure? I just wanna be able to answer those queries. So I propose implicitly here in array aka table that's of 26 and that's not the coincidence. What might the 26 conjure up ideas of here?
[ Inaudible Remark ]
>> Yeah, so alphabetical somehow, right? So what if--if we're given an arbitrary list of names Alice, Bob, Charlie, and so forth in random order of O, suppose I wanna keep track of whether or not I've seen those names before, where do I put them? Well, I could every time you hand me a name I can store it in this black box and just put it at the first available location. I can put it at the bracket 0, bracket 1, bracket 2, bracket 3 but if I insert things into a data structure in random order, what's gonna be the running time of search when I ask for the--that same structure back out? It's gonna big O of N, right? If you do something in random order and we've seen this multiple times, we saw it when Shaun was up here on video on the board just trying to find some number, well, if you put things in random order you're just differing your effort 'til the end of the process 'til you wanna get that thing back. So we don't want that. Today we want constant time ideally lookups. So as this table suggests, if we--if I hand you Charlie first rather than put Charlie at the first available location, where might you instead put him? So bracket 0, 1, 2, 3, 4, which one comes to mind?
>> So maybe bracket 2. Right, maybe you should just in advance decide that bracket 0 is going to be for people whose names start with A. Bracket 1 is going to be for people whose names start with B. Bracket 2 is going to be for people whose names start with C and then people whose names start with Z end up with at the end of this array at location 25 but because it's an array, you can get there instantly, right, using square bracket notation, you have so called random access to memory so you can jump right there. So even though it looks like Zeek is getting the short end of the stick by going all the way at the end of the black box, you still have instantaneous insertions. Except if what happens? Right. What if two people have the name, a name that starts with Z or maybe more realistically, two people who have a name that starts with A or B or some of the more common letters of the English alphabet. Well, now we kind of have a problem. If in advance, we only allocated 26 buckets, we can store 26 people but just statistically a class of 26 students is not going to have unique starts of everyone's name so what could we do just instinctively? Suppose I insert Charlie and he goes to bracket 2 and then Chuck comes along and I need to insert him as well, where do we put Chuck, whose name obviously starts with C as well?
>> 2D array.
>> Okay, so maybe a 2D array. Right, so that's actually not a bad idea so maybe instead of allocating a 1D array, I could allocate a 2D array and then just put Charlie there first and then put Chuck to the right him conceptually. We start drawing this is a grid, alright, but then Carmen comes along and so now we have a third C name and what do we do with Carmen? So right, we have a 2 times 3 D array, right? So we need another dimension so this is not a bad idea, right? If you need to be able to insert more people into the array, well just kind of increase your number of dimension so that you can have people vertically so to speak, and also horizontally. But the problem there is every time we double the dimensions from 2 by--1 by N to 2 by N we'll now allocating effectively more space for Z names and Y names and Q names and we're probably not going to have that many of them so there's a downside there. If we use arrays or 2-dimensional or 3-dimensional arrays, no less, we're kind of wasting space on the less common letters even though we're solving problems for the least common or for the more common letters. So what about this? Why don't we try to leverage this reality statistically that there's not that many people whose names starts with Q or Z, my apologies statistically to those whose names do start with Q or Z here. But suppose that we're willing to accept that for small data sizes we're not going to get someone with the Q name or the Z or a few other letters as well. So what if I first try to go to bracket 2 and I try to insert Charlie there and I get some kind of collision, well why don't I just--well if bracket 2 was busy, well, let me put in bracket 4 or 3 or bracket 4. In other words, just start probing the array so to speak, from top to bottom and the first available spot below Charlie is where you can put this other C name, so does this work? Does this not work? Well, definitely makes more efficient use of the space, right? Because now if we have 26 buckets, we're going to fill all of them at some point even if not everyone is exactly where we would like them to be but in the worst case, what's going to happen here? What's going to be the worst case lookup time for Alice if Alice just so happens to be inserted last, number 26 in the class? And there's already someone else with the letter A as their name. Yeah, it's going to take us big O of N again to get to Alice so not ideal, right? 'Cause Alice might get arbitrarily put at the end of the array and just finding her, again, this is kind of now devolving back into random order. And random is typically bad at least in terms of running time here so maybe this isn't such a big deal, right? I'm kind of making up these names. I don't even know where Carmen came from off the top of my head so maybe this idea of collisions is not all that likely. Well, we can actually answer this question if you assume a room full of, say, N of us, in CS50 students, what is the probability that two students here are going to have the same birthday, right? This is kind of the same question. There's 365 or 66 days in a year and that's a finite number. There's a finite number of us, less so on some days and you, presumably, if we have 367 students here today or earlier in the semester then there's gotta be a collision, right? This is what's called the pigeonhole principle. For more on this, take CS21. So we could do this. Actually, we can probably spend like 20 minutes trying to figure out who has the same birthdays here but this could actually very quickly devolve into a big mess but let's assume that this is probable. Let's actually see if we can slap a number on this so this is just a formula that actually answers the other question. What's the probability that every one of us has a different birthday? Well, this is actually a little easier to compute and we can do it as follows. If I pick one person on the audience what's the probability that you have a different birthday than everyone else at this point in time? 100 percent 'cause I haven't considered anyone else so that's 100 percent. That's 1 over 1, hence the 1 in this formula. Now, I move on to the second student in the room. Well, she can have any possible birthday out of 364 days because already one day was taken by someone else so if we want the probability that these two students do not have the same birthdays, we do a 100 percent times like 99 percent, give or take, or otherwise known as 1 minus 1 over 365 and that's just give us the probability at this point in the story that these two have different birthdays and then we keep multiplying. We keep multiplying. We keep multiplying and if you actually do all these out or look at the back of like a physics or a math book, you get this formula, which at first glance, admittedly is not all that enlightening. You're probably not graphing this mentally in your head all that quickly but if we do actually graph not this but 1 minus is this 'cause recall that the question was what's the probability of a collision? Right now, this is the probability of no collisions. I want to know if we're going to start talking about hash tables and using arrays and have to deal with multiple C names or A names or B names, how likely is that? Well as the birthday problem suggests, it's actually really, really likely so this is a plot whereby the X axis represents number of students in the room or number of birthdays equivalently the left--Y axis represents the probability of a collision of two people having the same birthday and what's remarkable about this is that as soon as you get to a class full of like 49, 52, 58 students, give or take, there is a nearly 100 percent probability even in a class that's much smaller than CS50. A class of just 50 students that there's going to be a collision and that's thanks to combinatorics and just the probability of someone actually having the same birthday of someone else. So what's the takeaway here? Well, even if you scroll back, you know this is kind of remarkable even in the class of like 28 students which is much smaller, there's like a 60 or 70 percent chance of two people having the same birthdays. Now, again what's the take away here? Well, if we now generalize this to the problem of data structures and this thing called apparently hash table. We might assume that we're not going to have Q names or Z names. This is not really a big deal. We're not going to have collisions but indeed we are even if we scale down to 26 people or 26 names, the probability of a collision two people having the same birthday or if we assume a uniform distribution of names, the probability of two people in a 26 student-class is 60 percent. So in short, this is a problem so it's not safe to sort of sweep this detail under the rug and just hope that, what are the odds that this happens? In fact, it's very often going to devolve into a situation where our lookups, our big O of N and the whole goal here is to avoid that. So I used this buzz word a moment ago, when your probing to actually start somewhere in the array and then just kind of probe for an open available space if you want to insert something into this array like a student or a number and its ideal location is already occupied but we already discussed ultimately the downside. In the very worst case, and that's generally what big O notation conjures up. In the very worst case, we might just insert Alice into this data structure last. There is already someone else like Adam with the name whose starts with A and there was a Bob and a Charlie and everyone else so the very last Alice gets stuck all the way at the end of the data structure which means when I say high is Alice in this list and I ask the black box, it's going to take it linear time. Big O of N and literally N steps to confirm or deny the answer to that question. So not ideal but very efficient use of space but what it's costing us, time. So there again is the straight off so rather than just do a 2-dimensional array, why don't we steal what we talked about Monday and start to merge these two ideas. So this thing here is a hash table but it's a little fancier looking in that what we really store vertically, so to speak, is still an array but this time, it's not an array of numbers, it's not an array of strings, it's just an array of pointers because pointers recall can point to dynamically allocated data structures whereby you don't know in advance how much member you're going to need or how many structures you're going to need. So why even pre-commit? We can just decide initially in this case, this is related to a birthday scenario, so in this case, I have an array of size 31 just because of the textbook that we scan is from it's not zero index but that's fine. There's 31 pointers and each of these now represents this idea of a birthday in a given month. What is the proba--rather who has a birthday of like January 1 or February 1? Will that person might go here or something 31 will go here or 27, 26 and so forth. So we only allocate a bunch of pointers so our table in this case is not size 26 anymore, it's size 31 but it's still finite but what we have now is the ability to grow laterally so I think the example that this author used was a famous people. I think that might be Sam Adams up top. So Sam Adams has a birthday I think and that's why I'm making this up of something 2 in the month and so he was inserted there.
>> So what does it mean now to insert into this data structure? We're not just plopping him or her into a preexisting array. What kind of code do we have to write to insert S. Adams into this data structure? What functions do we need to call if we implemented this in C? It's related to memory, it's related to allocation?
>> Malloc.
>> Malloc, okay. Actually, I think it was funny. I think we saw in the quiz a function calls like "mallocates", that was one. And yeah, there are some other funny ones I'll keep to myself. So, ask your TF. So if we want to actually insert S. Adams into this list and the location at which we want to insert him is initially null and null is suggested by all of these slash marks, that's just a null pointer. Well, we have to go ahead and call mallocates and create a new struct and inside of that is gonna be apparently a string like S dot Adams and then a new line character. But what is this thing probably represents? To the right of S. Adams in this data structure is another slash and another square, what does that represent perhaps?
[ Inaudible Remark ]
>> Yeah. So it represents null, it represents a pointer. So in fact these aren't strings per se, each of these rectangles that represent a person, S. Adams, J. Adams and so forth is apparently instead a struct. Inside of which is a string just like we saw previously inside of a student is an ID and a name and a house, the last two of which are strings. Here apparently one of these structures, let's just generically call it a node, N-O-D-E. So this might just be a string as part of the struct and then this other thing is just a pointer. And as we typically draw pointers as squares, that's probably 4 bytes or 32 bits and it's null because there's no one else in this data structure after S. Adams. But apparently, W. Floyd and J. Adams share the same birth dates of the month, and so we have to start chaining those two guys together in order to put them at the same location. Now, this two is not ideal, alright. This does avoid what problem, this is better than this design, why? What problem do we avoid completely or theoretically? What's that?
[ Inaudible Remark ]
>> Linear time so it's actually not quite, we'll come back to that. But what was the biggest deal breaker with this thing besides collisions? It's limited, right? It's finite space. We had 26 buckets in this case which means if we try to insert the 27th person, my god we have to jump through this hoop of reallocating more memory, copying everything over and again, this is typically not an ideal solution. So this approach called Separate Chaining definitely solves that problem. And I say theoretically because you could still run out of memory in total in your computer so you might not be able to allocate more of the structures here but that's only if you're talking like 2 gigabytes or something crazy inside of the program. So this at least is completely dynamic but technically, the running time now of finding someone in this data structure in this ideal case it's still constant time. In the ideal case it's just S. Adams, he's the only guy there, we jump right through that spot and bam, we found S. Adams. But if you kind of do this overtime in the worst case, what's your data structure gonna look like? If this is my array of pointers and I'm chaining off of this, all of the actual people, what's the worst case scenario in a class of students when you are trying to insert them into this array?
[ Inaudible Remark ]
>> Yeah, they all have the same birth date of the month or in our previous example, their names all oddly enough start with the letter A or some specific letter. So even though we might have a really smart design, it's possible that just by bad luck, these things could be ending up here, here and so forth. So what's this devolving into despite the vertical aspect of this picture? It's just the linked list, right? So the running time potentially of a hash table even in this again it's called a hash table, this is just a different version of it that's using separate chaining. Well, even in this case it could devolve back into linear time. But if we do this intelligently and we actually use, let's call it a hash function, that's more intelligent than just saying use the first letter of their name or use their birth date, the day of the month on which they were born because these clearly lend themselves to perhaps disproportionate collisions. You know what we could do in the best case, none of these chains so to speak is going to be that long. This one might be two--this one might have two nodes, this one might have two nodes. Ideally, if we have a whole bunch of pointers vertically and we create these chains laterally, ideally we're gonna have these chains being of roughly equal length. So ideally, what we have is a running time of big O of N where N is the total number of elements inside of this structure. And let's call it arbitrarily K where K might refer to what here? Why do I divide by K? The goal is to answer the question, what's the running time of Search ideally here?
>> The size of the array.
>> The size of the array, the size of the hash table, the height of the hash table. So if this is height K, so this is either 26 and 1 example or 31 and another. That's what we mean by K. So this is better, right? Big O of N over K is better than big O of N but only in the real world, right? If K is a fixed number like 26 or 31, what have we been doing for the past few weeks? We always kind of cross that out so here's where theory and practice kind of start to clash with one another. In the real world, N over K is absolutely better than N. Just arithmetically you divide some number by a positive number and you're going to get a smaller number than the original number, right? That just works out that way. So in the real world you're gonna spend less time or use less memory if you're dividing by that factor. But asymptotically sort of pre-quiz zero, in pre-quiz zero terms, we always said that this is still big O of N. So here's what's kind of neat and also what's neat in particular about problem set 6 which is this speed based, this speed-oriented spell checking challenge. Well, yes it's still big O of N, this notion of a hash table whether we implement it like this which has a couple of problems. This which is better but still has some problems. In the real world doing something in say 100 seconds is absolutely gonna be slower than doing something in 10 seconds if K is 10, right? So any kind of real world improvement we can make actually benefits real people even if the mathematicians and theorists might say in asymptotically or with big enough numbers, this is really quite irrelevant but it's not irrelevant in the real world. So this promise land of having a constant time lookup is really only in an ideal world. You can construct in fact what are called ideal hash functions which indeed take some data structure like this or even one like this. And if you're smart enough and put enough time into the process of writing the hash function, the thing that determines where an input goes. Well, you can actually come up with an ideal hash function which will literally put everyone at a specific place without any collisions but it requires a lot of work. And so invariably every semester we have folks particularly that are hacker types that try to implement 4 piece at 6 and ideal hash function and then they redefine and print where you're not allowed to do that in advance because you have to know something about the data in advance to actually construct that. So instead, you have to come up with better hash functions. The hash function that we've been using right now is actually been simple. You hand me a person, I look at, I being a hash function in this story, I look at the first letter of their name and I return a number between 0 and 25, so some letter of the alphabet. Or you hand me one of these apparently famous people and I hand you back a number which is the date of their birthday and then you use that index into an array. So hash function generally takes this input, some structure or some string or some number, and then returns a number within a specific range. What's that range? Well in this case it's 26, in this case it's 31, it's within some specific bound. So if you think back to your C operators, the modulo operator, it's actually quite common in this context. When you wanna compute something and then return a relatively smaller number. So let's do better. When I hand you the name of a person, if it's apparently not quite ideal to just look at their first letter, the first letter of their name, what could you do that's a little better that's still predictable, that you can repeat this process but it's not as naive as just looking at the first letter. What might you do instead? Alright, all you're given is a string and you have to somehow give me a number between let's say 0 and 25, but again this is arbitrary as well. What could you do besides looking at just the first letter, what else could you do?
>> Look at the second letter.
>> You're right. I mean that's actually pretty reasonable. You could look at the second letter. And what could you do? Well, remember that in C and in programming in general, a string is just a bunch of characters. A character is just a number so you can already think back to P-Set 2 with Caesar and Vigenere. You can already kind of treat strings as numbers. So maybe we could do something arbitrary but perhaps predictable like take the ASCII value of the first letter, add it to the ASCII value of the second letter, and that gives me a sum. And that sum might be a little too big, it might be 100 plus or something but that's okay. I can just use the mod operator, the percent operator in C and give me a number between 0 and 25. Now, whether or not that works is kind of unclear, right? There are still some dependencies in the real world between first letters and last letters like you don't generally see a name that's like ZK. You might see ZO or ZI but that's gonna be more common. So that might not be quite enough. So what can you do if the first two letters might not really be sort of independent or random enough? We'll do the first three or the first four.
>> And in fact the common hash function is given a string is to actually add up the ASCII values of all the numbers and then take some modulo operator of it to get back a smaller number. Or you can do even much fancier things especially if you know in advance what the data is going to look like whether it's human names or the like. So, how can we represent? Say, let me say something like this. So, we can do a little better than this in fact. If we introduce the notion of a tree, we can actually, and let me tease you with the crazy looking picture, we can actually implement a little something that looks like this. And even though I just said that with a hash table, we can only get to say linear time or linear divided by K where K is some constant, I actually can retract that a bit. If you're willing to spend a lot more RAM, a lot more memory and have a lot of arrays inside of your program, we can actually boil down the problem of looking up strings or doing dictionary look ups in literally constant time. But let's pause for a couple minutes for a musical interlude by our own Ansel Duff again and then we'll return in 5 minutes and answer how this can be achieved.
[ Music ]
>> Okay, we're back. So, before we get to this amazing new data structure that's gonna solve all of our problems, let's just paint a quick picture here of this just to make clear how we can even go about implementing this so that is not completely in verbal pseudocode here. So, here was so far our best implementation. Best only and as much as it is dynamic and can grow to fit as many things as we actually have even though we might still get some long chain. And in the very worse case, we might get a lot of Alice names, A names, A names, A names, so it could devolve into just a linked list but hopefully that's not actually the case. Well, how do we implement each of these things I called a node before? Well again, we just need a way of expressing that picture. So, here's a struct called node and inside of the curly braces, I have just 2 fields. One of those fields is going to be a string so to speak. So it's an array of characters and it's an array of characters each of which is of length, length plus 1. And because it's--we're in caps, I can infer that some constant define somewhere and it's, I don't know, maybe it's 16, maybe it's 128. It's some relatively reasonable if large constant that I'm assuming all of my words are shorter then. And the plus 1 is always for the--back slash zero for the null terminating character. So, this thing though is that for by pointer that allows me to chain S. Adams to someone else, to someone else, to someone else. And the reason that there's this apparent redundancy between the word node here and node here is again the rule of thumb is if you're creating a data structure inside of which you want to have a pointer to another one of your self as object--as a struct of the same type, you need to be able to express that with some word. So, if you say, struct node or struct foo or whatever you wanna call it, but for consistency I'd used the same, we can now because we've declared this here refer to it inside of the structure. But henceforth, thanks to the typedef we can just call this thing a node. So that's how we might actually implement that in code but let's actually now use the same idea, these pointers and this linked list so to speak and those concepts for Monday to start making more interesting structure. So, this is what in computer science, is generally known as a tree. It's kind of like a family tree where there's one patriarch or matriarch at the very top of the tree and then you have all the descendants beneath it. Well, what do these nodes each represent? Well, let's just toss some jargon out and we'll actually see the same jargon with HTML next week. So, anything at the top here, that has arrows emanating from it but not into, it is gonna be called the root node. So that's again the sort of oldest person in the family at the top of the tree. 2 and 3 here, we would generally say our children of node number 1, 4, 5, 6, 7. Our grandchildren of node number 1, but in turn, children of 2 and 3 and these things down here like 8 and 9 and even 5, 6, 7. Since we already have this tree analogy, well, this is going to be a--this is gonna be a leaves of the tree, anything that has arrows coming into it but not actually leaving it. So, these arrows, at this point in the story, just represent pointers. So, if you wanna create some kind of tree structure, you can actually use pointers to chain them together. So once you do this and have this ability to chain nodes together, we can actually start solving some problems in more sophisticated way. And again, the goal here of this weekend and 2 weeks ago in introducing this data structures is to not have to fall back on simple variables and simple arrays. We can now start building and wiring things together in memory. So this I claim is a binary search tree. It is conceptually identical to the notion of a tree that we just slap just some jargon on, but there's some key feature of this binary search tree. If you look at its numbers, what pattern do you see if any?
>> The leftmost is the smallest.
>> What's that?
>> The leftmost is the smallest.
>> Yeah. The leftmost is the smallest and conversely the rightmost is the biggest. It's actually a coincidence that they're all double digits here or identical digits. But notice that on the left are smaller numbers, on the right are small numbers, but more specifically, if you look at any specific node, its left child so to speak is less than it. And its child is greater than it. And with that recursive definition, we can infer that all of the nodes in this tree are indeed in sorted order from, so to speak, left to right. There're 22 and that's a child of 33 so 33 is indeed bigger than it. But 44 is the right child of 33 so he is bigger than 33, but 44 smaller than 55 because he is a left child descendant of 55, right? So, anytime you go left, everything to the left of 55 will be smaller. Anything to the right of 55 will be greater and that same definition can be replied recursively, so to speak, to 33 and 77. So, who cares? Why is this useful? Well, thus far we've been only using arrays but suppose that back in the day when we had all those pieces of paper on the board, we instead transform that array into a tree structure where we had a pointer pointing at nodes that were arranged in this way. Or just imagine students for instance standing awkwardly with these numbers but pointing at each other with multiple hands this time. What's the running time of doing search on a data structure that's called a binary search tree? How many steps does it take if there're N nodes or N numbers in this tree? So, it's actually log N. And what's nice here is that you can actually see this even if you're still s little rusty on logarithms. Logarithms in this class have generally involved dividing something by 2, dividing something by 2 again and again and again and again, dividing and conquering. Well, what are you doing in a picture like this? Well, if you take N nodes and you have one at a top but then each node has 2 children, 2 children, 2 children that's affectively the same as having, having, having the total height of this tree. So in other words, what's the height of this tree if you actually have 1, 2, 3, 4, 5, 6, 7 total nodes which is roughly 8 total nodes give or take. If I just have an off by 1 error. Well, it's actually log base 2 of 8 or 3. So, the height of this tree is 3 which mean if I wanna find like the number 44 and I start at the root of this tree which is just like a linked list, you start at one special pointer. In the context of a tree, you start at the top, the root. How do I find 44? Well, I first check. Is 44 less than or greater than 55? If it's less than, I go this way. But notice conceptually just like our phonebook that has the effect now snipping off this entire branch of the tree so I don't even need to go there anymore. So, even though it's not quite perfect division, the problem has just gone half, give or take, in size 'cause I now only consider the left branch. So, now I'm at 33. Is 44 less than or greater than 33? It's obviously greater than so now I chop off this branch of the tree. Not as exciting 'cause it's a tiny little leaf. But if you imagine now a much, much bigger picture, you're really pruning the size of this problem and huge branches, huge halves of problems are falling off with every descent into the tree. So how deep might 44 be? Log base 2 of N is the answer or just generally log N. So anytime you have a tree structure, you have this opportunity to really improve the performance of inserting things or finding things or even deleting things. And in fact, you probably at least heard the term "database" before. We'll use a database called MySQL toward the end of the semester and one of the key features of databases these days is to let you insert and delete and retrieve data really fast and they typically do this using tree structures like this. Database has actually used something a called a B-tree which is a little more sophisticated, but this idea is the same. You might have some node in memory pointing to some other nodes, to some other nodes, to some other nodes so these data structures might get kind of wide which is okay. They don't get very tall 'cause it's in the height that you end up wasting time searching deeper and deeper and deeper. So how do we implement this? So now it should be getting a little easier, right? Given a picture, you can kind of translate it directly to C code. Here is a new definition of a node, so I'm using the same word, "node." But inside of this node is gonna be an int N for 55, 44, 33, and so forth. And then each of these nodes needs a left pointer and a right pointer. And just to follow the logic, what is going to be the value of the left pointer and the pointer for any leaf in a tree? Just null, right, we already have a special value to represent that. So, we can literally build this kind of structure more sophisticated though it might be than anything we've done before using just these basic primitives, an int and a couple of pointers. So, let's continue this idea. Rather than have each of our nodes be a little tiny circle with just an int. What if I cram an entire array of pointers into a node? So this is what's called a trie, T-R-I-E.
>> The name apparently derives from retrieval even though people say "trie" so retrieval. But it's a trie and that's the plan word, it's a tree but it's used for retrieval. That's the quick--completely mangled history of the tries. So, a trie is a data structure that ultimately that looks like this. Each of its nodes is actually an array and typically, each of its arrays represents a letter of the alphabet. And it's an over simplification 'cause sometimes you need to deal with punctuation and what not. But a very common approach to implementing a dictionary for something like a website or spellchecker, is to use this trie data structure whereby each node again has an array, an array of pointers so that if you have a word like "Alice" how do you store Alice into a trie? Well, you start at this so-called "roots" which is the rectangle of the very top of the picture. You index into the left most elements of the array for A, then that brings you to a new node. Then you index into the L location then the I location then the C location then the E location, and you sort of implicitly store Alice in the tree by sort of marking the spot at the end of this path that you're taking from top to bottom in the trie saying, "Alice was here," so to speak. And now this picture has done it for what words? Well, we have a--turing on the right. So if we start with the letter T-U-R-I-N-G, all the way over there on the right. Now, realize this picture, it would be impossible to read if they drew all of the arrays so that's why they've chopped them off. So this refers to Alan Turing, a very famous computer scientist, the so called father of computer scientist, also take CS121 for more information. Who's all the way on the left? We have Maxwell, so, Maxwell's equation, M-A-X-W-E-L-L. And what's with these triangles? This is just arbitrary symbology here. What does that probably represent? You know of some sort, it's the sort of like conceptual check mark like Maxwell was here and Turing was here. And so that we have to somehow remember that a name stops here. So, the goal here is to insert a whole bunch of names or more generally in the dictionary, 150,000 words into a data structure. And what's curious here is that we don't actually store the names. Instead, we implicitly store the names by way of these pointers 'cause again, you have 26 pointers or maybe 27 or more if you have apostrophes or other punctuation that appears in names. You kind of walk through this tree allocating these nodes, as soon as you need them, and then just check off at the bottom or with this delta symbol, "Someone was here." So you implicitly store among all of these pointers some kind of identifier. So what's really powerful here? What's the running time of looking up someone in this data structure if the name is Maxwell? Well, how many steps does it take to answer the question, is Maxwell in this data structure? If you start at the top, you go M-A-X-W-E-L-L and maybe one more just to get to that special sentinel value, so 8 steps. Now, contrast this with one of our earlier designs which was actually pretty fancy at the time, a hash table with separate chaining, how many steps might it take here? Well, it might actually tale big-O of N over K 'cause again it doesn't matter how big or small Maxwell's name is here. The running time of a--a hash table seems more dependent on the number of other words in the data structure, right? Because if there's lots of Alices involved, the chain might be long. But in the case of a trie, what's kind of neat is that by construction, no one is in Maxwell's way as you kind of weave your way to the tree. So the running time of a trie for looking up a name and I need to come up an arbitrary new letter. Let's just call it M, big-O of M. But what does M represent? The length of the word, the name. And here's what's kind of neat asymptotically. If you assume that there is a finite limit on the length of people's name or just words in the English dictionary and that really is the case. You can search through the whole dictionary and figure out that their biggest word is of this many characters, 48 characters or something crazy, crazy large. Well technically then, M is a constant. So if it's a constant, we throw away constants, which means big-O of M is equivalent to big-O of 1 and here we have it, this holy grail of data structures, right? We can now look up words that are of some length but bounded length in effectively constant time. And heck, even if it's not constant time, it's like M-A-X-W-E-L-L plus 1. So, like 8 steps to find a letter, whereas before, if I had a million words in my dictionary or 150,000 words in my dictionary, I could maybe divide that by K, 26. I'm guessing that's gonna be more steps than just 8 as in this case. Now, this is not necessarily, unfortunately there's always a catch, the direction you wanna go in when implementing the fastest possible spell checker for instance because what are you paying to get this seemingly amazing performance? What's this feature costing you?
>> Memory.
>> Memory, right? This picture is kind of misleading and that they only show the letters or really the pointers that you're actually using. But what's not shown in this picture? For every one of those nodes, it's currently not showing the 25 other pointers that represent the other 25 letters of the alphabet that you're just wasting, right? There's absolutely words that just don't exist. So you might not have a word like A, A, A, B, B, C, Q, Q, Q, like random sequences of letters that technically you could follow pointers through the picture 'cause those might be prefixes of other words but there's no little check mark at the bottom, there's no triangle. And yet you have to spend that memory if you're using arrays. Well, okay, well, arrays are--solutions to arrays is always, well, if you're wasting space with an array, use what data structure instead? Like a linked list so in each of this node, wait a minute. Let's not waste 26 or more pointers for every letter. Let's just use one node per letter but the problem with linked list is that you give up random access. You can't just jump to the letter A, the letter B, letter C. If you devolve it back into a linked list, now you have to search. So then the process becomes not so much going from top to bottom like this as in the case of Maxwell. You instead have to kind of go left to right then down, left to right, down, left to right. So you have a lot of lateral searching which is gonna cost you time but save you space. And it turns out in terms of implementation, hash tables can in fact, even though theoretically and asymptotically they're much slower than constant time in reality because tries use so much memory, they can actually perform much more slowly. But really for P set 6 every year with the fastest spell checker possible, you typically end up choosing between one of these two data structures, hash table with chaining or trie and then deciding whether or not it was a good decision or how you can improve upon it. So, when a trie, how might we represent this notion of a check marker, that little triangle? Well, let's just call it a bull [phonetic], right? It is a word. Yes or no, Maxwell was here or Turing with here. If it's falls, there's no word that stops here. If it's true, there was a word and then here, struct no children. Well, notice I'm borrowing the lexicon of like a family tree where a node has a bunch of children. Each of which has a pointer to another node but 27 of them, why 27? Well, in this case as you'll see in this particular P set. We allow for alphabetical characters and then also apostrophes which are actually pretty common in English words. But we disallow other crazy punctuation that does exist in some words but isn't necessarily common. So we can solve though even more interesting problems here now that we have trees. So, Morse code is something you're probably generally familiar with as a kid, and you can bang out messages. You've seen these dots and dashes or sounds that represent short dots or long dashes. But the problem with Morse code is that it's not immediately decodable. In other words, if I give you the sounds or I show you the symbol, let's say dot slash or beep, beeeep. What did I just spell out?
>> A
>> Okay. You say A and you might be right or--
>> ET. So, Morse code is not immediately decodable because there're ambiguities that arise. If you have dot slash, it could be a dot and a slash or it could be dot slash or dot dash. So, how did they solve this? Well, the military folks who had hit the little beepy-dippy--in the old movies, well, they would just pause essentially between words or you get good enough at it that you just realize that in the context, obviously, you spelled the letter A not ET. You just got good at it. But fundamentally and certainly if a program were parsing this information, you need a split second pause. What's the downside there? Well, even though Morse code is just dots and dashes, there's actually a third digit which is pause, which is not pictured if you need to kinda delay yourself. Why is that bad? Well, it just kind of slows things down or leads to ambiguities. But it turns out you don't need these ambiguities. You can actually encode messages being very succinct, right? This should--what's neat about Morse code is relatively speaking how short this sequence is. And yet all this time in CS50, we've been using 8 bits to represent every letter of the alphabet whether it's a capital A for 65 or 97 for lower case A. I mean, they, the Morse code folks wheedled this down, the letter E, to a single dot which is effectively like a single bit. And it's no coincidence that E is pretty short, T is pretty short, I is pretty short and yet Z is kind of long and Y I kind of long 'cause again, the dashes represent longer tones of sound. So, what did they do? They optimized for the more common cases. E is pretty common, A is pretty common so they have relatively shorter sequences than J or Y or Z. So this suggests some form of compression. And we don't seem to have learned from Mr. Morse here whereby for ASCII and for computer encoding of letters, we've been using 8 bits this whole time. And the result of using 8 bits is obviously every letter takes up of 8 bits and so often, we're just wasting a lot of space, right?
>> If the letters of the alphabet only go from 65 up to 26 plus that and the 97 plus that, and you never use some of the crazy punctuation on the keyboard, let alone some of the non-printable characters, we're just wasting bits. So, what do we want to--what could we do instead? Well, let's try using one of these data structures in sort of a novel way to compress information and in turn, answer the question, how do you actually compress information? When you use zip or the compress feature in Windows or Mac OS like what is actually going on, well, if it's a text file, it might very well be something like this. So let's come up with a completely arbitrary text just to have a simple discussion point. So, this string here E, C, A, it's completely random. There's no pattern there. But what is noteworthy about it is that it's only composed of A's and B's and C's and D's and E's and if you add up all the letters in that crazy long string, which would normally take up 8 bits times the length of that string, you see that 20 percent of the letters are A's, hence the point 2, 10 percent are B's, 10 percent are C's, 15 percent D's and 45 percent E's. So already, if we were going to reinvent the ASCII system and use fewer than 8 bits to represent each of these letters, A through E, which one would you probably wanna use the fewest bits for it intuitively? So, E. So hopefully we can come up with a system, maybe using today's building blocks that somehow gives us very short sequences of bits for E's and, okay fine, we'll spend a few more bits on B's and C's than we would on A's or E's. In short, can we do better than ASCII? Can we compress information and thus save disk space? Well, let's try this. We have this ability now to talk about trees and also build trees using simple C structs. So, let's go ahead and draw 5 trees but 5 stupid trees. Trees that are only single node. So, they're all leaves, they're all roots but they are nonetheless trees. They just have no children, each of these five trees. And just to keep me saying I'm gonna put inside of each of these nodes its current weight based on the original weights in the sentence. Now, just to be clear, even though we're about to compress something completely inane like this, now imagine this is actually a large movie script or the King James Bible or some huge body of English text that is nonetheless implemented with ASCII just like this. And there are too, there's gonna be patterns, right? The word "the" pretty darn common in English and there's definitely common letters like E and so forth. So even though this is contrived, there is definitely statistical patterns in English words. So here are my 5 simple trees. I wanna go ahead and build one big tree out of each of these nodes. So you know what I'm gonna do. I'm gonna go ahead and combine the two smallest nodes possible at every decision point. So, the smallest nodes right now are 10 percent and 10 percent, point 1 and point 1. So I'm gonna go ahead and merge those into a new tree by creating a new root outdating its left and its right pointer to point to those guys and again to keep myself saying, I'm going to draw inside of the new node, the total of the old node. So point 1 plus point 1 gives me point 2. And notice just because I kind of know where this is going, I'm gonna start labeling the edges I'm drawing. I'm gonna arbitrarily but consistently say a left child's edge or arrow just represents a 0 bit and the right arrow represents a 1 bit, just because. So now let's repeat. So now let's combine two of the smallest. We actually have a little discretion here 'cause I've got point 2 as the root, point 1, 5 and point 2 is the root. I'm gonna go ahead and break this tie arbitrarily but consistently. And I'm gonna go ahead and combine that point 2 on the left with point 1, 5 on the right. Create a new node whose sum is point 2 plus point 1, 5, so that's point 3, 5 and I similarly label the edges. And now, just to make sure we're on the same page, what two nodes get combined next? We have point 3, 5 and point 2 'cause those are both smaller than point 4, 5. So, the picture is growing and now I realize I actually put the nodes in the right place so we get a pretty picture. But even if they weren't aligned in a pretty way, you would still get the same tree fundamentally. Let's now combine the final two nodes and now I claim I have a tree. This is actually called the Huffman tree, named after a computer scientist named Mr. Huffman who then proposed that using this policy of assigning 0 bits on the left edges and 1 bits on the right edges, you now have your own custom alternative to ASCII. What this system proposes now is that I'm going to start using for the letter A, B, C, and D, and I'll switch over to the chalkboard in just a second. The following encodings, if I am going to represent the letter A or B or C or D or E, I'm gonna start at the root and figure out what pattern of bit gets me to it. So let's start with A. I start at the root to get to A. I go left which gives me a 0, then I go to the right which gives me a 1. So I claim henceforth. Forget about this crazy 8-bit ASCII system. I'm gonna use 2 bits for the letter A, 0 and 1. For the letter B, what pattern of bits will I use?
[ Inaudible Remark ]
>> Yeah, 4 zeros. Again, if not quite following, you start with the root, we just find B and to get there, we go through a 0, 0, 0, 0. So, B is encoding, it's gonna be a little longer, still better than 8 bits but it's 4 bits this time. So, C is going to be 0, 0, 0, 0, 1 'cause I'm just following it, following the root down to it. D is gonna be 0, 0, 1 and E very nicely is going to be 1. And now, what's neat about this and I'll put this on the screen which maps that tree to numbers, what's neat about this is 1, we've optimized for the really common case followed by the next common cases and we're willing to spend a few more bits on the les common cases. But unlike Morse code, notice, none of these things are prefixes of one another. In other words, if I draw the numbers or let's say, 0, 1, 0, 0, 0, 0, what does this decode to? Well notice there is no ambiguity. If I start with 0, could be any one of these but if I then immediately see a 1, it can only be in A and that's pretty compelling. There's no ambiguity which means I don't need a break, I don't need to waste [inaudible] pause here or here's a gap between letters. Now I see another 0, it could be any one of these. I see 3 zeros, okay, it could any one of these three. I see 3 zeros, I've now whittled it down to this but then notice 4 zeros means aha, this has gotta be B. So, by contrast, Huffman's coding scheme is immediately decodable which means it is very efficient in that you don't waste bits or time in between letters. So, what's the takeaway? Well now, the takeaway is quite simple and you can now encode A's and B's and C's and D's and E's with far fewer than 8 bits. But what's the catch? If you're going to actually encode letters in this different way and you're gonna kind of shrug of ASCII, what do you have--what's the price you're paying? 'Cause right if Mr. Huffman had this brilliant insight and we're still using ASCII, there's gotta be a catch to this. So, why are we not all just using this scheme?
[ Inaudible Remark ]
>> Exactly, right. I'm kind naively assuming, well, if my world only has A's, B's, C's, D's and E's, well I don't really need to use ASCII because I have such a smaller address space so, of course I can use fewer than 8 bits. But if you actually wanna be able to let someone type any character on the keyboard, you're gonna eventually need to start spending more bits. So, what is really is, the takeaway really is that if you have patterns in your data, common letters or common words, you actually do have an opportunity then to compress data. And the fun thing about data is that there's almost always patterns, right? Even in things like music, there's choruses and parts of songs that repeat. And if you can leverage those commonalities and say, do I really need to store this chorus part of the song 3 different times just because the artist sings it 3 different times? Why not just store it once and then somehow digitally remember times 3 which you can presumably express much more succinctly. We can then save bits that way. And in fact we can. So just to be clear, these nodes we're now drawing, again, we can come up with kind of a simple C structure where we store the symbol that we're representing, the frequency as an int or as a percentage, as a flow. We can do it either way but generally, again, using ints instead of flows avoids issues of precision and accuracy. And then we need a pointer to the left or right but let's now map this same idea, not just the arcane text in A's, B's and C's and D's. Suppose we consider like the German flag here vis-a-vis say the French flag which is nice in its simplicity too in that where as these guys have horizontal bands, these guys have vertical bands. Any thoughts as to which of these 2 flags if stored as a GIF in internet or a file format, which one could compress more? They're pretty--they are the same size even if I didn't quite resize them correctly. They're the same number of pixels, which one might compress better? The French, yeah, it's kind of nice but this German's kind of redundant too from left to right. It turns out you can't really notice unless you know the GIF specification. But there are opportunities here clearly to compress, right? Why? Frankly, I could represent this flag with like 3 pixels and then just say repeat some number of times. Same deal for the French flag. Well, the authors of the GIF file format, G-I-F decided that, you know what? I'm just like in BMPs whereby images are represented as scan lines from left to right. What we're really gonna do is recognize that, alright. The top black band of this flag, let's just say that the top left pixel is black and then say repeat for the rest of the scan line. Then the next row, we say black, repeat, black repeat, black repeat then eventually we get to red and then we just red repeat, red repeat. So, it turns out the French flag if saved as GIF doesn't quite compress as well 'cause you have to keep changing your mind. You have to say blue repeat, white repeat, red repeat and then again. So, you're still compressing it but you can't repeat as long chunks as possible because just somewhat arbitrarily the GIF authors decided that we're gonna compress images effectively laterally.
>> But you can actually see this so I downloaded these same images into a folder and if you actually look at this, there's the same German flag, here's the same French flag so they look like the same size but look at their file sizes, and ignore the--I in the middle there. So, the French flag is twice as big, 8 kilobytes and that's simply because of how GIF is actually compressed. We can apply the same ideas, independent of these tree structures indeed whereby we can have say a picture of this apple, right? This apple, it's stored as a GIF, you do need some red and brown to represent the stem in the apple itself. But this blue background, you definitely don't need to remember all of those pixels. Instead, just remember some of them and then say, "repeat wherever it's white" or in the context, perhaps more a modernly with a movie. So, this is just a little cartoon of 2 scenes in a movie where on the top strip of film there. Notice the little RV, the truck and the house and the tree. Now in this simple little cartoon movie, the car is moving from left to right. But what's obviously not gonna change? Well, the tree's not moving and the house isn't moving, the background's not changing and so what can you do? Well actually in this case, the car, okay, so some of those things were moving. The tree and the house were effectively moving as the camera move but the sky was not changing, right? So, you can do the same thing. Throw away information that's not changing 'cause you definitely don't need to store in a movie the same picture again and again. And for those unfamiliar, a movie, a video, is pretty much like a bunch of images back to back to back to back, played really fast so it looks seamless. And here too, in the bottom, you're seeing a same idea where you can do interframe compression, whereby, you know what? If we know the car is starting over there on the left and then it's gonna end up there over on the right, do we really need to watch as the car does this or can't we with algorithms kind of predict where the car is gonna be and then throw away what might otherwise be useless information. And you can see it here too. So, this is a picture on top of an uncompressed video of a bee flying over some flowers and then at the bottom you see, well, the flowers aren't changing, only the bees moving, let's just throw away the background, throw away the purple and the blue and just keep track in our file where that bee is actually is. So, where are we going with all this? Well, we now have the ability to have images. You have in your appliance the ability to actually make webpages. Since you actually have inside of your appliance your own web server and your own database server as you'll see. I'm gonna go into the CS50 account here just for a moment and use a program that let's meet edit, the course's own website. So, you might recall from our course's website, we often have things like hungry RSVP, don't submit, submit this form and so forth. So this is HTML. We saw divs before, we saw angle brackets before. I didn't say anything about what they mean but I thought it would be fun to maybe just update this in real time as an example of where we can go. I'm gonna create another division of the page like this and I'm gonna say the style of this division I want the background color for this part of the website to be black that happens to represent black or I could just say, literally black. I wanna have some padding around it, 10 pixels worth which is a bunch of white space around it. I'm gonna go ahead and align everything in the center here and that feels good. I'm gonna go down div. I'm gonna create what's called a line break below that in whatever else is there. I'm gonna go ahead and do image. IMG is image, source equals, let's see, I called it ice.ping like this. Let me go ahead and actually do this. I'm gonna surround this image tag with A, H ref it goes go quote, unquote, dub, dub, dub--dot Facebook dot com slash Rboden91 [phonetic]. You've be surprised how many of your friends they made---the first we did this. Alright, so now let me go ahead and save that. And if I made no typos, here's before, here's after, oh, damn it. [Noise] What it's called? Not, they took it away. Ice.ping, ice--no, what did I call the image, hang on. Ice.ping, hold on. Where is the image? Damn it! [Laughter] Very seamless demo here. Okay, standby. This is a trick that you'll learn before long. Alright, I've just transferred the file and I'm gonna move the file into here. I'm gonna change my application. I'm gonna go over here. [Laughter] Damn it! What re--what is it David? Oh wait, ice--ice.ping. Oh, this is just catching issues. Okay hold on. [Laughter] Thank you!
[ Laughter ]
[ Applause ]
>> We will see you next week.