1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ```
>> Alright. Welcome back to CS50. This is the end of week seven. So, there's one handout today. It might feel a little early, though granted the calendar has changed. So, it's going to be a little earlier. But I wanted to draw your attention to one of the URLs that's listed on this handout. So, CS50's final project is, as we say in the syllabus, the climax of this course. It's going to be the time where you guys get to choose what you do, how you do it, and what you ultimately make of the course. And these photos that we've been showing are of this thing called the CS50 fair, which we actually started for the very first time last year. And, it was actually for us, a whole lot of fun. We had, again, all three hundred plus students last year. We had six hundred plus of their friends and their family, and random folks on campus, join us in one of the science buildings down the road. And, it was just an exhibition, and an opportunity to hang out, chat, and to see on each other's laptops exactly what you pulled off this semester. Now you might be feeling that thus far we've been handing you most of what you're capable of. So, we gave you a framework for p set three and for p set four and a little bit for p set five. So, we're going to have to come full circle and take those training wheels back off by the end of the semester. And, you will be able to do that, because in the last two p sets of the course seven and eight, these will be web-focused which caters in part to the interest among students in recent years to do web-based projects. What odds are you guys are going to want to go off on your own quite possibly and do even more than the course itself officially covers. And, so the teaching fellows, CAs and some alumni and graduate students have kindly offered to put together this suite of seminars. So, these seminars are optional. They will be scheduled in early November based on the level of interest and on your actual schedules. Some more on that to come. We ask right now that you RSVP for any of these things of interest. There's eleven on the board. Just to skim a couple, I-Phone development has already proved popular. Though you probably won't make in just a couple of weeks the most amazing I-Phone application, you can absolutely do something that excites you. You can certainly build on it if you choose in say Spring term, and a seminar like this will help get you started. Those of you interested in making facebook apps that your friends might actually use, we'll do a seminar on that as well. And then a whole bunch of other topics, another is this Android cell phone platform. And let me make explicit mention on this, which we did on Monday already, Google very kindly donated some cell phones to the course. We are happy to give these to those students who would be interested in tackling Android projects, but do go to the appropriate link here and find out who to contact, namely me and one of these CAs if you would like to tackle an Android-specific project. And we will attach a couple of conditions like you really should have T-Mobile already, or AT&T, because again these are GSM phones. But, if you're interested express an interest now as per those directions and we'll get you started. There's a whole diversity of projects that students did last year. If you get a chance at some point go to the fair's page where we have a bunch of videos. And, I see some of the TFs have started enhancing their privacy settings. So, some of these videos are no longer accessible. So, we'll fix that. We have archives. But, you'll also see at the top of this page a link to last year's program. So, it's rather inspiring. If not maybe admittedly a little intimidating at this point in the course. That's the wrong link. This program here, this is the official program that we handed out to attendees last year. And, some of the specifics now you shouldn't care about, but if you skim this, for instance these charts you'll see what kinds of languages students last year actually used. So, yeah, the course teaches C, the course teaches a bit of PHP and JavaScript. But, your predecessors bit off a whole bunch of other stuff, some of which they came into the course with, some of which they absolutely did not. And, you can actually see titles and descriptions for all of last year's projects here, and you probably will see some familiar friends' names. So, do check that out. It's on the fair page. And, do RSVP per the directions on the seminar's page as soon as you get a chance. Let's see, any questions about that? OK, so even though p set five isn't yet due, it's been striking how many surveys have already been submitted for problem set five. I dare say it's a lot more fun and procrastination-worthy to click little bubbles and fill out circles and stuff online than to actually do the p sets. We got over, like, two hundred plus surveys that have already been submitted in like five problem sets that actually came in already. But, that's OK. But, it meant I was able to start reading through these things, because all the better I think for me if we start absorbing some of your feedback, good and bad, and see if we can't make the best of the remaining weeks in the course. We will, I will personally make sure to read all of these surveys that still come in. So, don't think that those gates, that ship has sailed, but I thought I would excerpt some of the more serious comments that were posted and then also some of the more memorable ones. One student, one classmate noted in answer to the question, what have you learned in the class? Well, we always enjoy things like this. So, that's good. But this one was clever, because we know the course is challenging, especially for a non-trivial percentage of the class that's not programmed before. This was just cute I thought.
[ silence ]
[ laughter ] But what's sad is you understand this kind of humor now, so I'm not sure that's such a good thing. But, let me address a couple of things in seriousness. And, these are recurring themes. So, I was actually thinking unrelated to the surveys on Monday that it's actually pretty hard teaching in this space because it's very cavernous. Even I can hear my voice echoing in this room. And, it's hard I think as a teacher to actually reach out to you guys. I can kind of sense the audience, and maybe you feel as best I can. Like, you are in CS50 and this is just an awkward one on one conversation right now that you'll always remember now. But, it's hard in this room especially. And some classes for years that have been in here resort to this Oprah Winfrey-style approach of having mics in the audience, so you guys can stand up and ask your questions. I can't imagine anything more awkward. And, frankly we'd get all of the gov majors in the room, perhaps, asking questions instead. I remember, I was one of the at one point. So, realize that I've actually been feeling this frustration myself where it's hard to kind of connect in this room. I very often feel like those Simpson moments where Skinner is standing up on stage. He says something stupid and the whole middle school is silent, except for that awkward [cough] in the background that makes you really just how awkward it is. So, I hope you guys will raise your hands more often. Or, and I'll address this with some specific comments, raise your hands not when you have answers, but when you have questions. And, I'll admit it's hard for me to hear things up here. And, I realize it's a little awkward to say I don't understand something, because you literally have to speak loudly so I hear it. But, I'm hoping that we can address some of these comments by just making this even more interactive. And so that you don't feel like your exiting lecture having completely missed something, because there's no point in my plowing ahead or teaching something that I think I'm teaching well when clearly I'm not. So, please help me fix that by raising hands as often as you are inclined. If your hand is one of those going up too often don't worry, I'll start to focus more on these students then say those, or something like that. But the workload in the course, so one of four themes that has already come up in the surveys and I'm sure will come up in the remaining couple of days. So, this course is a lot of work. With that said, this is all relative because we often hear this frankly from freshman and sophomores for whom this is a lot of work. And, I think perspectives do change over the course of time. So, keep that in mind. But, this is very much intentional, and as I said in the first week of the course, I mean this really is kind of one of those courses, one of those fields, where you really only learn by doing. And, if we were handing you problem sets that were half as long, or quarter as long, I mean I dare say that you would get half as much out of the course as experienced, or a quarter as much. And, we like to think that this course by contrast is, yes, one that absolutely expects a lot of you during the course of the semester, and for some really squeezes your brain, but whether it's going to be in January of 2010 that you look back and feel, damn, that was for the best. Or, even several years from now when you look back on this course and maybe others that are kind of kicking your ass, so you may feel right now, I think that's frankly why we, why you, are all kind of here. I know I don't want to sound all too lofty. I realize that all of us have way too much going on, right? To Fall Harvard students, myself included, you'd probably bite off way too many extracurricular. Yes, you have three other classes. So, realize if nothing else we appreciate where you're coming from, but this is why there's sixty staff members for the course, and why there's a bulletin board, and office hours, and all of this other stuff because we certainly are going to get you through the experience. So, realize we hear you, but please do at least meet us halfway with those resources. OK, grades too. Quick on this. So, being Harvard students too, grades are of predominate interest in many courses. It took me years, until graduate school, to kind of calm down. I was absolutely one of those kids in high school, and even through college that, like, work came first. And frankly I didn't even like school all that much. I just was kind of conditioned as a kid to work hard, do my work, you know, get my As. And, ironically it probably helped me get into grad school, and get me where I am today. But, it wasn't until grad school, having worked for three years after college, that I realized there's kind of more important things than worrying about grades. So, not doing well, but worrying about grades. And, I say this because this course does take a somewhat different approach from a lot of the science courses, math courses where it pretty much is right or wrong; check or uncheck; plus or minus, where you can get a very qualitative response. I know some of you per your own feedback have been feeling frustrations, and this happens every semester certainly that I've been teaching the course where you feel, you know, hey I have undergrads being my teaching fellows. I'm getting only, getting very subjective, I'm getting very desperate feedback from all of them. And, I know I've said many times that we will take all of this into account. And you can talk to past students about just how many of them were disappointed or not with their actual grades for some data points. But realize we view the course much more like, so far as grading goes a humanities course. Where if you're trying to write an essay it's really hard to say this is 95 percent or 75 percent. We could do this with problem sets, like check or plus or minus. You got this correct in this metric right, but you hopefully realize now, not eventually, that programming and solving problems is not just about yes or no, it's right. It's about how good it is. How fast your code is, how intelligent your code is. And, honestly these are only things that can be evaluated subjectively and qualitatively with handwritten or typed feedback and less with numbers. So, realize that's very much a design decision. I am quite certain that we will never convince some of you in the course of just this ideology when it comes to grading, but do realize and perhaps consult past students that it does tend I think to work out for the best. And, please speak with me personally if you are concerned about any specific scores. So, with that said my availability, no one ever comes. So, this is linked on our course's homepage, my own URL. So, these, OK. That's maybe why no one comes.
[ laughter ] Alright. So, these are my office hours.
[ laughter ] So, please come. If you feel that I'm a little too aloof, or just this big voice on stage, or God-forbid, like, worse, scary or intimidating just please drop by. Drop me a note because frankly I tend not to be there. But, I will be there if you want to be there. I essentially block off these and other times to sit down and chat. So, reach out. There's no reason I should not be accessible. Interesting. Alright. And, finally, so, now I will go back to a couple of actual comments. I'm not the only one hearing that right?
[ laughter ] Interesting.
[ laughter ] Alright. Maybe it's this live wire. The I-Phone has fixed it. Alright. So, these were some comments that were directed about lectures and content, because one theme that's very common in this course and frankly in all the courses I've taught for ten years is the rapidity of my voice, or the speed of class. And, I don't necessarily think that moving fast or relatively fast is a bad thing because the flip side of course is that course can risk going too slowly, and this too is one of those things where there is no chance that we're going to please everyone in this course. Much like, it will be very difficult to ever get 100% consensus that, yeah, the course is going perfectly. So, we do our best frankly to displease this half of the class and displease this half of the class, and so everything kind of averages out. I mean that's kind of the way of these things. So, we try to walk this fine line between going too fast and going too slow. Because the downside of course is if we devolve into something too slow then, you know, then there's other people who don't want to be here. But, this is the bigger question in my mind, so we now compete, or people like me compete with video tapes, like, of ourselves. And, I thought many times over the past few years, like, what really is the point of lectures, because I would dare say that some of you, many of you are here just because this is what you do in school, you go to class whether or not you feel that you're getting something out of lecture. And, those of you who are watching this at home on video you've made a conscious choice that this is maybe not a good time of day to be awake, or a good experience. And, so this is a hard question, right? Because what is the point of lecture if you can watch this thing on video, if we have scribe notes detailing what happened, if we have slides that predict what would happen, if we have code of everything, right? Even I ask myself this, what is the point of all of us being here in this room? And, so the answers I've come up with over time are, one, it's got to be engaging, ideally. Right? Otherwise who cares? Why come? Even I walk out of lectures all the time if it doesn't maintain my interest levels these days because I have many, many other things more interesting or more challenging, stimulating that I would like to do. So, on the one hand my goal whether it's with silly things like candy, or with fun demonstrations with humans on stage is to actually make this an experience, and hopefully that's not something that you get solely by watching things on camera. But, two, the goal pedagogically is to set a framework, a mental model each week so that when you do go into sections and you those ah-ha moments with your TS, oh that's how you implement this. Or, that's how it works. That at least you've been prepped with sort of a big picture understanding of what is data structure, or what are link lists? You know, in general, what problems do they solve and then it's when you actually get into these more intimate environments with your peers that you can really, you know, sink your teeth into something more concrete. So, there's a line too that I try to walk carefully between dwelling too much on code and syntax, which I know can get boring quickly versus not doing enough of that. So, that you walk out of here thinking, what the heck just went on? And so these are the kind of comments that give me pause and definitely make me, you know, worry. And granted there's other comments. There's good comments that balance these things out, hopefully in the whole. But, like, there's one thank you that I'm nice.
[ laughter ] But, two, something like this. Like, we can address, right? Because hands can go up and I can pause more frequently. So, realize that I'm happy to try to address these kinds of comments. You know, this too is worrisome, because this too we should fix. Because if you're not getting, like what the hell is the point of even being here, honestly? I don't want to be here if you're getting nothing out of this. We don't take attendance. We don't care if you're here. We want you to be here. And this, frankly, maybe this person was more spokesperson for others, I mean these are the three comments particularly that I'd like to address starting now, as best we can. So, whether the lectures have been good, or sucked, I'd like them to be better moving forward. So, please at least help meet me by trying to raise hands, engage, stop me if I'm ever doing something stupid like big white screen, certainly, but also if it's just something that's gone over your head. Now, let me counterbalance the seriousness, the serious tone, with this. I'll offer this with no comment.
[ laughter ] So,
[ laughter ] So, you're welcome whoever that was. Happy birthday. And, this I know, my God, it was like a six page or so survey. This one was just a cute note to end on. So, I'll end our discussion of that here. So, let me turn our attention to where we left off in terms of content. And try to do this, try to give you a sense of why are we here? Where have we been? Where have we been going? So, in my mind in the course, and I worry that sometimes the syllabus doesn't make this quite clear is we started the course with just sitting some basic mental fundamentals, like what's a loop? What's a condition? And, some of you knew that. But, we had to do it because there's a lot of students in the class that didn't have that. And, that's why we use something like Scratch. And, then weeks one and two were kind of syntax oriented. What's a four loop in C? What's a while loop? How do you compile? How do you debug? Stuff that again, is kind of interesting. It's kind of geeky, but too, we have to get that out of the way. And, hopefully we wrapped those kinds of, that kind of material in some domain-specific packaging. Like the crypto, which frankly makes four loops just a lot more interesting, or a lot more compelling than what a problem set that says implement the four loop that prints one through one hundred, which kind of gets to the same core material, but certainly not interesting. Then in week four we introduced, we started talking more about arrays, and memory management. So, now things are getting at least technically more interesting, and certainly more difficult. And we started talking about data structures very minimalist data structures. We had arrays. We the C notion of a struct, where you can multiple variables inside of some new data type. So, we had the basics of data structures. Then in week five and now in week seven we take things up a notch further and talk about more sophisticated data structures, because you can't really get through life, you can't really get through problem-solving in CS, or in the sciences, or in any realm of life where coding is useful without having more sophisticated tools in your toolkit. So we introduced a couple of weeks ago linked lists, which allow you to have data structures and storage, but that grows dynamically. But, we had this linearity problem. It's still pretty slow to search something that just has arrow after arrow, after arrow, because you can only go from left to right, or maybe right to left if it's doubly linked. But, you have this linear problem. So, this week, on Monday and today the goal is to introduce some more clever solutions to the problem of storing data and getting at it quickly. And, the ideal, the holy grail of data structures is one that grows infinitely I would say, right? There's no fixed arbitrary upper bounds like you have with arrays, but one where you're search time, your searching time, your deletion, all of those basic operations are blindly fast. Constant time even. And, so that's when we introduce this idea of a hash table. So, a hash table is kind of the Swiss army knife of data structures, because it's supposed to address ideally precisely that problem. And, we're going to see hash table, this topic today, again in the weeks to come when we look at web programming. The goal of the end of the course realize is to one make you realize we did not learn just C. We learned ideas that transcend this particular language. And, we'll look at PHP and JavaScript, both of which use hash tables. They call them something different, but they have them. And, in the remaining weeks of the course where we really focus on real world problems, problems frankly that you might be very inclined to encounter in your own lives, whether it's for personal fun or start-ups, or for data sets in other classes you need to process, or just stupid stuff like you run a student group and you've got a really long list of people who've registered and you really don't want to manually email a hundred people, or manually, you know, aggregate this data into some other format. You can write a script to do that. So, were going to try to use these basic ideas from weeks one through eight and conclude the course with a look at these things that are more real world. And, the final project is meant to embrace that finale of real world applications. So, hash tables. So, what is the point here? So it would be really nice if we had a data structure that didn't have problems like arrays, which are a fixed length, right? They kind of suck, because it meant if we ever ran out of space what did we have to do?
>> [inaudible]
>> So, we had to reallocated. And reallocation, even though you haven't really felt it given the size of the programs we've been writing, is kind of slow. Anytime you have to interact with the computer itself, or the operating system and say give me more memory relative to other things, that's slow. So, it's ideally in principle to avoid having to allocate memory and deallocate memory, because you're just wasting time by not having had sufficient foresight. But, of course the flip side is you can say to the OS, well, give me an array that's two gigabytes large. I'm never going to need more than that. But, the downside there is you can only run, like, one program at a time. So, again, the theme in computer science, or in programming specifically is these trade-offs. Nothing comes for free. You pay for something with time, or with space, or with your own time, the developer's time, or just with your own, you know, mental anguish. There are things that are harder to implement, but my God when they work, e.g. Google, they work really well. And, that's not something they implemented on a Thursday night. So, again, it's all tradeoffs, whether it's machine related or human related. So, a hash table is meant to address that problem of arrays. They're fix length. It's expensive to reallocate and deallocate them. It's just stupid to have to copy your data around just to grow the space, but that's why we introduced link lists, right? Link lists are these dynamic structures. So, even though I kind of answered my own question a bit ago, what's a problem with link lists? With every solution we introduce a new problem it seems. What's a downside of a link list?
>> [inaudible]
>> So, it takes a long time to search it. If I want to search for something, well, and let's contextualize this. What might I put in a linked list? Anything. We put numbers because that's simple, but you can imagine implementing a database of students for the registrar. Well, each of those link list nodes contains a student. What's a student? ID number, phone number, grades, I mean all of this kind of stuff. So, there's these objects in memory, but searching them requires what, how much time to find Joe Smith in a linked list of 65,000 Harvard undergraduates? What's the running time of search on a linked list? So big ON steps, right? Because in the worst case Smith is going to be the end of the list. Or even if they're alphabetized he's certainly not going to be right at the middle. So, worst case an arbitrary student is going to require big ON steps. Well, wait a minute. We solved this in, like, week two. Why don't we just use binary search on linked lists if they're sorted? Go ahead.
>> [inaudible]
>> OK, so they are sorted, but there is another problem. Yeah?
>> [inaudible]
>> Yeah, you can't access the middle of it. With an array it was very easy. Know the end, which is bracket zero. Know this end, which is bracket zero; this end, which is minus one. You can figure out the middle by doing N over 2, and bam. You can go right there. So, arrays are random access. What happens if you take the address of this node in a linked list and subtract the address of this node in the linked list and then divide by two? Where do you end up? So, numerically here right in the middle conceptually, but you have no idea where any of those linked list nodes might be, because the whole point of allocating them with malock is that you're just handed a chunk of memory. But it can come from anywhere in the computer's ram. It's not necessarily going to be contiguous. So, again, we get this upside of dynamism. Now we don't have to worry about painting ourselves into a corner in terms of space, but we now have this problem that we really hurt our search time. And linear search kind of sucks, especially when your data sets starts to get larger and larger. But, my God if there's this data structure that's been promised that gives us constant time whereby if I have a new student to insert, and let's see. I'm going to need, I won't use students here. We'll use lollypop flavors to represent distinct elements. So, what I really want is a data structure where if I, some new student matriculates, this is not going to work very well. I really wanted a data structure where the lollypops would sit in the top, but it didn't work out. But we want essentially a black box. And, this is a common theme in computer science, if you've not heard the term before. Something that does something takes input, produces output. You don't really care how it works. You just care how well it works. Or, how quickly it works. So my goal is to insert Joe Smith into this black box, because later I want to be able to ask question of the form, is Joe Smith registered at Harvard? Or more specifically give me Joe Smith's student records so I can look up his phone number or his email address. So, you want to be able to insert things into the data structure ideally in constant time; get in and get out. And then if you want that same element back you want to be able to ask that question just as quickly; is Joe Smith in here? If so, give him back to me. So this constant time is kind of, again, the holy grail, but how do you even begin to implement something like that? Well, we already teased you with an answer to this question on Monday, right? We used an array. We used a little picture that looked like this. Ideally, what's the ideal, so suppose we have this data structure. Let's use an array, because it's so simple. I know how to declare it. There's no fanciness. We did this weeks ago. But I decide there's only going to be twenty six students. Well, the ideal student body would be that Harvard admits each year only one person with a last name that starts with A. And one student who's name starts with B, and C. Because then we can contrive what's known as the ideal hash function. We can implement a function literally in code that takes as input the name of the student and then returns as output a number, which is what bucket that student should go in. Now what do we mean by this? Well, we can actually whip something like this up. So, I'm just going to open a terminal window here. Let me increase my font size. I'm going to call this hash.c And, I just need to write a quick and dirty function here that's going to, let's see we're turning int. I'm going to call it hash, or h, or whatever you want to call it. It's going to take as input a student, or really just their name. So, I'll do a char*s to represent a string there. And the goal of this hash function is just to answer a question that the black box can use. So, the black box is a piece of storage, but with it also is a hash function that tells it where to put its input, what location to put its input. So, how do we decide this? Well, first let's do a little sanity check, just to practice what we've been preaching. If s equals equals nill, what should I probably return?
>> [inaudible]
>> So, I could return one. Why do you say one?
>> [inaudible]
>> So, that it will, sorry? Keep running?
>> [inaudible]
>> OK, so I do want to return something so that the function won't keep executing. Let me push back now though and say if we return one, we're probably going to run into problems. Can you anticipate what the problem is going to be if the error code we're returning is itself one?
>> [inaudible]
>> Yeah.
>> [inaudible]
>> It's suppose to return int anyway, which means I might have just been handed null, so my answer is, oh, put Mr. Null in bucket number one. But that's the wrong answer. We probably want to return a so-called sential [assumed spelling] value that indicates to the caller, maybe it's main or whoever's using this hash function that problem happened. So, what should I maybe instead return?
>> [inaudible]
>> So, negative one, right? A very common answer honestly is anything other than the range of numbers we care about. So, let's return negative one, and just expect that the caller is smart enough to know he should check for negative one before trusting the return value of this function. And actually if we want to be really sophisticated here we can do what's very common too. You know what? Let's define a constant called error, and just define it to be the number negative one. So, now we can kind of abstract away and say return the error code called error, and now the caller can just check if what was returned is error, not negative one. So, there's conventions like this where we can clean up what's otherwise a very arbitrary choice like negative one. OK, so that this point I now need to figure out what actual number to return. So, I've been handed a string. Let's assume for simplicity today that it's an actual student's name, and that we don't have any crazy hyphenated names, or apostrophes, or anything like that. It's just ASCII alphabetical characters. So, I want to decide what bucket to put this input in. So, I just need to return a number. So, maybe I can do something like, let's see, int N gets, and then what do I want to do? Well, I could do cast to an int, what? S [0]. That will return a number, right? And then I can do return N? This will return the numeric value of the person's first letter in their name, but what's the problem here?
>> [inaudible]
>> So, it's the numerical ASCII value, which means I might get back 65 or 66, probably some number that's much bigger than 0-26. So, what's the easy fix here?
>> [inaudible]
>> Yeah, subtract capital A. That's what we've done in the past. If you want to be really anal you can explicitly say int, but again GCC is smart enough to realize, oh, if you're subtracting int just give me back the numeric value you. So, this would work, but another approach too might be this. Let's see. Instead of subtracting this off, what if I do something like mod 26? Would this return to me a number between 0-25 inclusive? OK, so this would work here too, and it would actually work just as well, right? Even though it would not put the letter A in location 0. It would essentially rotate the positions where things are. Would it put two people with different last names in the same bucket?
>> Yes.
>> Yes? OK, how?
>> [inaudible]
>> OK, so let's assume simplicity here capital letters, just alphabetical then. But, otherwise yes. That would be a corner case. So, no. It should work because our hash function is so simple and we're just looking at one letter. I'm going to even keep the hash function simple, just to be ever so clear. This will return a number between 0 and 25. So now the black box that's using this hash function realizes, oh, this is Joe Smith. This returned, or this is Adam Aardvark. This returns to me the number 0. So, I'm going to put Adam in this location here, because Aardvark, stupid name, I know. I couldn't think on the fly. Starts with the letter A, and that maps to 0. So, we put it there. So, then the question is now we have all these students, 26 students in this black box, how do we answer questions of the form, is Adam enrolled at Harvard? Well, the black box has to use this same hash function. So, now you use the same exact function, given the input and see here's Adam, here's Adam Aardvark. What is the name, what is the location of the Adam in my black box? 0. Let me now check and if there's actually something there, the string representing the name, the struct representing the student then I know ah-ha. That must be Adam. Unless our assumptions are no good. If we actually have more than 26 students what's going to happen when we insert the 27th and the 28th? Just intuitively? What's going to happen when we insert another person with a last name that starts with A?
>> [inaudible]
>> OK, so right. We can just expel Adam, right? And put this 27thh person in instead, but otherwise in more generally we get a collision. And, that was the point of introducing this then depiction of a possible solution. So, if we paint ourselves into a corner with an array, which even though it does lend itself to constant time insertion, notice the code. What is the running time of this code? It is constant because it takes one step to check for null, another step to check what the integral value N is. And then, OK, a third step to return that actual number. Three steps, but that's big O1 time. That's constant time, because it always takes three steps. And, that's so much better than log N, N squared, N itself. I mean that is very fast, and that is the ideal, but we risk this collision. So, there's the tradeoff, you get amazing speed, but you can only have 26 students. So, again, this theme is recurring, but the theme is the result of this lack of foresight. We have painted ourselves into a corner by only having a bucket that's that large. Well, the point of this teaser at the end of Monday was to ask the question, well maybe this isn't such a big deal, right? Obviously one solution to this is let's make our bucket bigger. Let's not have 26 spots. Let's have 52 spots, right? Because then if we have more students we have more room for them, right? And we can find more room for them. You know what? 26, let's really minimize the probability of something ever going wrong, even though we're a relatively small university, let's allocate an array for a million students. Because just probabilistically the idea that we'd actually run into collisions using some kind of algorithm is much, much lower by nature of how much free space we have. But a downside of over compensating is what obviously?
>> [inaudible]
>> You're just wasting memory, right? And it takes longer to search that memory. And, even though we won't dwell on this in this course, the more memory you're using, the less effective things called cache are. So, as a side if you're the type in the class that likes hardware you might know of concepts in your own personal computers called L1 cache, L2 cache; level one and level two. All of that relates to optimizing look ups and memory. Well, just the more data you have in memory the less likely it is that cache is going to benefit you. So, more on that in a future course if you're interested. So, here's a question then, suppose we have a room full of NCS 50 students. What's the probability of a collision of just birthdays. Let's make this more real. Not do first names, and let's assume a uniform distribution of birthdays, which itself I think statistically is not accurate if you read like the literature around like special events in history. Nine months later there have been lots more babies than in other months of the year. So, uniformed distribution of birthdays is actually not realistic. But, let's assume it is and ask ourselves what's the probability that if you just have a bucket and you need to put students in that bucket, and you're going to put them at random locations, what's the probability of collision? And, now what's the who cares question here? Well, if we have the previous scenario, a black box with 26 spots, maybe 52 spots, maybe 100, maybe a million spots, clearly we're going to have collisions if our hash function is as simple as checking the first letter of a person's last name. Right? The moment we have 27 students we are guaranteed to have a collision. So, that begs a more complicated, or more sophisticated hash function. This is too simplistic because it's too deterministic based on inputs we know we're going to have. We're going to have two students whose names start with A, and with B. So, how can we introduce something that's going to take in a name and put it not at the 0 location, but in any old location, but a location that we can reproduce, that we can refind by calling the hash function again. In other words how can we spread out the inputs across a big bucket? What's a smarter hash function than just this? Yep.
>> [inaudible]
>> OK, so look at more than just the first letter. Again, some of these questions even if they're starving to kind of heap over here, what's the problem? Well, just looking at the first letter is insufficient because we're going to have multiple students with the same first letter of their last name. But, you know what? We're less likely to have two students, like, I regret already this name, Adam Aardvark, because the probability of having two students whose last name starts with AA, Aardvark, is not very likely. So, just taking into account the second letter of someone's last name itself might be a good thing. Now, how do we take into account two letters? Well, this is where you have to start getting a little creative. Maybe I could take the value of the first letter and then take the value of the second letter, that gives me a number between 0 and what?
>> [inaudible]
>> And actually we should, we should technically do this. Let's see, let me subtract off my capital A, and let me subtract off my capital A, so that now what I have is two numbers between 0 and 25 being added together. So, we're going to get a number between 0 and 50. So you know what this means? Let me go back up here. And, if our hash table is something that's storing strings. So, it's storing char*s, let's call this hash table. Previously, a moment ago, my world looked like this. So, that one line of code was essentially what I was describing in the form of this little milk crate. That was our hash table. But now if we can get values from 0 to 50, right? Or, yeah, 0 to 50. Well, I need more space for this at least. So, does this help me here?
[ laughter ] Am I going to, yeah.
>> [inaudible]
>> Perfect. So, it's still, perfect answer, imperfect solution here, because now, the be clear, if we have someone's name who starts with Ab, that is going to give me the same sum as Ba, but that then begs the question how likely is that? Well, OK, that's pretty likely, I am not even going to try to contrive some stupid name that begins with Ab. But it's pretty reasonable to expect that exists. Alright, so let me just push back on you and be like alright, let's just decrease the probability. I don't know the distribution of names in this country, but I do know that the more and more letters I tack on, the less and less likely a collision is going to be. And in fact, this is what's very often done, at least with simple hash functions, this is starting to get messy, right? This is kind of a copy paste job, so maybe I am at the point where I should really re-writing this as this, int N gets 0; and then let me do for int I get 0, length get strlen of S And then, I is less than length. And then, I plus plus and do you see where I'm going with this? What should I add to N on each iteration probably? [i-a] and I can optimize this. I don't need to do all this subtraction of A all the time. But this now is already a more elegant solution, because how many letters of the person's name am I taking into account? So all of them. So already this is more clever, so now this begs the question, if you take, you know, all of the most popular names in the world, the first names, and you start adding their letters, their numeric values of their names together, are you going to get unique numbers? Well, no, if we spent enough time, we could probably find someone's name, and someone else's name where if you add the numbers in those names together, you get the same results for two different students with completely different names, just by chance. But the question then is how likely is that to happen? Because it's not such a problem if it happens, yeah, a collision would suck, but we can mitigate this by having those things called chains, we can introduce linked lists, which we will do in a moment, but let's see if we can't slap a number at least on randomness. So what you proposed, adding the second letter, the third letter, or maybe adding all of them together, is a good thing, because it's just taking in more input, it's kind of adding, in this psuedo way, some randomness. But let's just simplify with a model that we can wrap our minds around, suppose we just use a random hash function. Suppose instead, this is just a waste of time, I am just conjecturing. Let's just do this, return something like rand times 26, and I'm going to return an int, right? This returns a number from zero to one, let's multiply it by 26, this will return a random number. So that's pretty good. And in fact, if I don't do this as 26, if I actually add in some other numbers here, I can return a number that's not between zero and 26, I can return a number that's between zero and a million, zero and fifty, however big I want to make my buckets. But how likely is even a random approach to give us a collision? This is one of these very sort of famous but simple problems, what's the probability that two students in this room have the same birthday? Well, we could kill a lot of minutes by actually polling everyone, say, who has January 1st? Anyone? January 2nd? January 3rd? January 4th? That would have been really cool if we would have had two January 4ths, because we would be done. Alright, let's cut it off there though and try to model this mathematically. This looks a little complicated, but if you think about it, it's not. What's the probability that two students in this room have the same birthday? Well, this is one of those statistics problems where it's actually easier to answer the opposite question. What's the probability that two students don't have the same birthday. And then you take that answer and do one minus that answer, and voila`, you have the answer you care about. Well, this top expression there, this probability that given N students, well, what's the probability that the first student in the room has the same birthday as someone else? Or rather, how many different birthdays can the first student in the class have that no-one else has? Well, you can have any of the 365 birthdays, because you are not going to collide with anyone if you are the first student. So that's like 365 over 365, which is the number one. But the second student who walks through the door. That student can not have that same student's birthday. So that same student can have 364 divided 365 birthdays. So now the product of those two numbers is the probability that those two students do not have the same birthday. OK, now the third student walks into the room, two birthdays have been used up, so now we have 1 over, 1 minus 2 over 365, the times 1 over 3 minus 365. In other words, if you just compute the probabilities that this student has, does not have the same birthday times the next students, how many birthdays can they have, the next one, the next one, the next one? You get a very complicated formula that, in the end, actually, it's not all that complicated, but, it looks like this. Which might remind you a little too much of certain math classes you never loved. But it's not all that complicated to derive. So that begs the question, what is this? Because that itself looks a little scary, well, it's a lot easier for our purposes to graph it. And the interesting take away is this. If you have a room full of NCS50 students, the probability that two of you have the same birthday, put another way, the probability that two of you will hash, given a random hash function, to the same location in that black box, is strikingly high, which is to say the probability of collisions in hash tables is just by nature very high. So what is this Y axis? The Y axis says probability of a match, what is the probability that two students or more have the same birthday? The X axis here is the number of students in the room, so what's kind of mind blowing is that if you just have a class of forty students, you have a ninety percent chance that two students at least in that class have the same birthday, and my God if you've got a class of 346 students like we do, I mean you are kind of literally off the chart, but you've got to have someone with the same birthday probabilistically, even though 346 is less than 365. Probability-wise you're going to have that collision. So, what is the take away? The whole point of the birthdays is just to make concrete this idea that if we have any kind of data structure that we're putting students into, objects into, we're going to have collisions. So, the motivation here is how do we fix this problem? Well, the obvious solution is if you're running out of space at a given location just start making more space. Start chaining students, start chaining your outputs from that initial location, because now do you have any upper bounds on the number of students you can fit into your data structure if instead of putting the students in the array itself you're simply adding the student to a linked list that begins at that location in the array? Do we now have this same upper limit? [silence] Doesn't feel like it, right? Because even though the tide of this table is fixed, 31 being 31 birthdays here. 31 days, actually what was the motivation here? Yeah, it was day of the month. This particular picture that we stole from the book there. So, what happens when two people have the same birthday? Or two people hash to the same location? You can't put them physically at the same spot, so where do you put the second person? After them, right? Or before them. Whatever. Keep it sorted. Keep it unsorted. It doesn't really matter. Just put them in that list. But what now is our running time of insertion? What is our running time of searching? Is it constant time anymore? So, it's definitely not, right? As soon as you start seeing arrows and lists. Like, now this data structure is devolving back into what? Something,
>> Linear.
>> Something linear. But, but I can push back. Suppose that we have K locations. There's K buckets, 31 in this case, 26 in this case. That actually means that my running time isn't that big O1 of N students, but they can go in any of K locations. Doesn't that mean my running time is big O1 of N over K? [silence] Right, it's not linear. If it were linear that would mean that it's big O1 of N and so that means that my longest chain is going to be N students long. But, wait a minute. I have all of these other chains I can put students into; K of them. So, doesn't that mean in the worst case my chain is going to be of length K? And therefore my big O1 running time is N over K? There's a bug in my logic here. Yeah?
>> [inaudible]
>> Good. Good. So, if your hash function is bad, or just simply broken, or simply stupid like this, what is the running time, whoops. What is the running time of in fact finding a student? It is in fact always N because you're putting every student in the same location. Now this is the extreme. Odds are you're not going to be so, you know, naive as to implement this as your hash function. But, what if you're just unlucky? Right? That's what big O1 notion is kind of about, asking questions like what is the worst case? Well, in the worst case you're really unlucky. Everything is backwards, or all students names start with the letter A. That's really bad, so even if our code does say S [0 minus big A] maybe just with some probability this happens, well, this then begs the question what is the ideal hash function? In other words it puts the burden on you, the computer scientist to figure out what is the best function to use here? It's probably not checking the first letter. It might not even be checking all of the letters in the alphabet. Maybe you need to do something more intelligent. And the teaser here is with problem set six, which we'll release on Friday. This is going to be our last problem set in C. You're going to have to implement the fastest spellchecker possible. And by coincidence those inputs are going to be words. You're going to be handed an empty file that you need to fill with a function that somehow loads 140,000 inputs, 140,000 words into memory. So, you are going to be implementing this black box. How do you do that? Well, every time we feed you another word, another word, 140,000 plus thousand times you're going to have to decide where to put it inside this black box. What's going to be inside the black box? Well, that's going to be up to you. You can use an array, an array with 140,000 locations. And you can load all of those words into memory pretty damn fast if you're just using an array, but what is your code going to suck at if you're just using an array?
>> [inaudible]
>> So, search, now granted if you keep them sorted then you can do big O1 of log N and that's definitely better than N and certainly N squared, but again, we're trying to get to constant time. And constant time is, again, that ideal. So, the real cleverness in problem set six is going to be what hash function should you use so that you can actually put these words at the smartest location possible. Now as for this question, what is the running time of a hash table. Here is finally when we see a data structure where theory no longer is necessarily consistent with reality. And by that I mean this. Asymptotically, a hash table is, yes, big O1 of N over K. But what do we always do every time we talk about big O1 and omega with constant numbers? We kind of ignore them. Right? Because we said it's not interesting. Because, hardware advancements will take away that constant factor within 18 months, right? The algorithm itself will just get faster because hardware is faster. So, we always cross out constant values. K is constant. Because when you compile your program you're choosing K. Or, when you run your program you're choosing K, making that many buckets and then using that many buckets. So in fact a hash table asymptotically is the big O1 of N. But here's this, this is where reality should break off from theory. In the real world doing something, navigating a list that's one Kth the size of a list of size N, like that is in real world seconds, human time much faster. It's a factor of K faster than searching the whole list. So, hash tables have this interesting property where, you know, even though there's still a linear data structure, yeah in the worst case you're chains might devolve into these really long beasts if you're unlucky, or foolish. Well, if you're smart and your data has some properties like English words tend to, various patterns and distributions of letters. Your hash table doesn't need to have really long chains. In fact ideal would be to have chains that are roughly the same length because code that runs in O of, N over K time is in real world terms much faster than code that runs over in just N time. And so the challenge of P set six is going to be the leverage, these real world, these real world implications of performance improvements to implement the fastest spell checker possible. Let's take our five minute break here.
[ multiple background voices ] Alright we're back. So, this slightly retro song reminds me of other comments, which some people love the music we play in 50. Some people think my music tastes are like two to three years out of date. That's kind of true. I rarely add songs to this I-phone, but I did add this song. I even paid like a $1.29 for it. Those of you who are fans of The Office.
[ multiple background voices ] Yeah. So, I had never even seen the You Tube video that they recently spoofed, but I'm kind of into this song, like a year or so later. So, anyhow we'll maybe link to the video, or conclude with it as you walk out today. So, quick, quick reframing then of where we came from and where we're about to go. So, the motivation was to come up with a data structure that puts to shame the things we've looked at thus far; arrays in linked lists. Constant time is what we want, but unfortunately constant time is not really possible unless we waste a ridiculous amount of memory. Or come up with the so-called ideal hash function that puts every student in his or her right place without collisions, but that's hard. And, so we seem to be at the point now where we kind of have got to compromise. Right? We have to pick a size for our hash table, 31 in this case, 26 I kept saying in that case. We're going to have collisions, but let's at least have a way of dealing with them, we can deal with them with linked lists. And now we can in fact fit an arbitrary number of words, or students, or objects into memory, but still find things fasters in N over K time than we could in just something linear, like a linked list. So, what might this look like in code? Well, let me go ahead and open something like this hashtable.c This too is empty by default, and just as I think aloud here, how would I implement a hash table? Well, we kind of did it a moment ago. If I'm just going to store student names or words, maybe for P set six we could call this my hash table. I need to pick a side. For problem set six you're probably not going to want to pick 26, probably not going to want to pick 31. Because when you have 140,000 words, I mean you might actually want something that's like 1,024 locations long. Or maybe even longer. 100 locations long. You'll still have some collisions. 40,000 or so words won't fit obviously in this table. So, you'll have chains, but what will be true about those chains if your hash table is this big?
>> [inaudible]
>> They're really short. So, again trade off, right? More RAM, more memory, but faster running time. And, so what you'll find in problem set six, even though the competition aspect of it is purely for fun. Opt in. You don't need to do it if you don't want to. But if you do test your code, you're following the directions and post the speed of your code and the amount of memory you're using on the course's website, as part of that problem set you'll see that some of the fun actually is in turning knobs, so to speak. Tweaking constants, like 140,000 and seeing if I get more buckets does my code speed up? Or does it mean I'm using so much memory that the computer is actually finding things slightly more slowly. So, this would be really the first problem set where you really appreciate real world considerations like tweaking settings like this. Alright so what do we do now for my insert function? Well, what is problem set six going to have you do? Well, you're going to have a function called insert, or add, or something like that. Actually, the function you'll see is going to be called load. And you'll be passed a whole file, not a single word. But, let me implement psuedo code for insert function. I need to insert one word, S, into my hash table. So, what am I going to need to do? I'm going to have to find a location for S. And then what do I want to do? Insert S at that location if collision append to chain. So, that's the essence of inserting a note into this list. How do you actually find something? Well, find, let me just for spell checker I don't care about location here. Let me do a true/false thing. So, I'm going to do bool and then I'm going to do find, and I'm going to take in as input a word, and the question being asked is, is this word in your dictionary? I just need to say yes or no. So, I'm going to do the same thing really. Find the location for S, because hopefully it's there already. And then return true if found. If you've never seen the terminology Iff, it's not a typo. It means if and only if. The implication is return true if it's found, else false. But if we want to be a little more pedantic we can say else false. So, this is kind of problem set six in a nut shell. Now granted it won't be use two lines of code. It won't be terribly many. The problem set will ultimately be more about thought and then about experimentation with actual knob-turning, so to speak. But this is really what it's going to be about. But the digression is going to come in then as to how big should this table be, and how do you implement those chains. Well, there was a reason two weeks ago passed out that source code for linked list manipulation. We looked, I think, at the find function and the insert function. You realize, lest you forget next week, you actually have an implementation of linked list in your hands, or in your binders. And, that's probably going to be very useful for implementing support for collisions, because you've been handed essentially the code that will implement chains. But what you're going to have to figure out is how do you have not one linked list, how do you have 31, or 100,000 linked list, and actually add words and then find words in this dictionary. But it turns out there's other approaches to dictionaries and to hashing in general. The real new idea today and on Monday was this idea of a hash function. Rather than take some input and just put it in some fixed location we now have some dynamism, this hash function that takes input massages it a little bit, or analyzes it a little bit and then extracts from that analysis an answer like location zero, or location one. This is kind of a cool thing, and hashing in general is nice. So, why don't we just use it to the extreme? There are these data structures called tris. Silly name, derives from the word retrival, which for some reason is pronounced different from the word tris. But a tri is a tree, the word tree was already taken as we'll see in a moment. So, a tri kind of looks like this. This is a really nice simplification of tri. A tri is a tree, what's a tree? Think of a family tree. Right? You've got like grandma and grandpa up here, and then all of their children as these leaves, or edges coming off of them. And, then it forks out. So, it's a big upside down tree, a family tree. That's what a tree is in computer science context. So you know what each of the nodes in this family tree called a tri is? It's actually an array. So, another approach fundamentally to implementing a dictionary, especially for lots of words is, is you start with a root node, which is an array of size let's say 26. And let's assume all lower case just alphabetically letters, no hyphenated words, no apostrophes in our dictionary, very simple words. So, my root node is a node containing an array with 26 letters. And you know what? How do I check if a word is in this dictionary? Well, I hash on the word letter by letter by letter. So, here's my first node. I've got A through Z in here. I take in a word like apple, and I look at the first letter, and it's an A. So, A in a tri will typically map to zero, because it's the 0th letter. So, what does that mean? That means I check the A location in my array, so location A. It's not depicted there in the top node. A. And then where do I go next in this tree? Well, I look at the second letter of my word. My word was apple. So, P. So, now I follow what's essentially going to be a pointer, an arrow to another node. But the arrow I follow is from the location in the array that represents P. So, in other words I have a tree, each of whose nodes is an array. And so to find whether or not a word is in this data structure I go from node to node to node to node to node following the arrows appropriate for the ith [assumed spelling] letter in my word. And then I get to the bottom of this tree, and as this particular textbook depicted it, at the end of your tree, the so called leaves you have to have a special value, a Boolean; true or false that says a word stops here. Because if a word stops here what does that mean? Well, it means it was inserted at one point. So, now I can answer questions of the form, is this word in my data structure? Yes. Now this, we'll come back to this a just a moment. But think tree. Think use of arrays. And think multi-level hashing. Hash again, and again, and again, not just once and we can get to this end game. This is what some of you will likely adopt for experimentation sake for problem set six. So let's frame the new jargon. What is a tree? It looks a little something like this. You have a so called root node. And we'll use trees again in a week or two's time with web programming, and HTML, and PHP. A tree in computer science speak is a node depicted here as a circle. Those nodes might contain values. Here that node contains the number one. And trees can have children. A left child, a right child, or in fact any number of children, although trees with just left and right children are very common. They're called binary trees. Bi meaning two. But this tree is just meant for jargon's sake. This thing up here is called the root node. Three is a child of one. Two is a child of one. Two is a sibling of three. So, they completely stole the family tree terminology. There's nothing new here. And then finally anything at the very bottom that itself has no children we call a leaf. So, this is a tree. And, this is a specific instance in computer science of what's called a graph, where a graph is something with nodes, and arrows, and edges that go elsewhere. But a tree is nice because you have no circles. Right? It's a very rare thing for a tree to have a branch that then grows back into itself. Right? Then it becomes a graph. More on that in a data structures class. So, this is a tree. This is a binary search tree. I'm just going to make mention of this, because this is the kind of stuff you can revisit in a data structures course. We used binary search on arrays. Turns out you can use more sophisticated structures like trees and actually find data pretty fast. A binary search tree is a tree that's binary. Each node has no more than two children; left and right. And it's a search tree in that if you store numbers in each of the nodes so long as you store them in a smart way you can find things as fast in a tree as you can in an array that's also sorted. What does this mean? Well, notice that 55 is the root. That's kind of arbitrary. What's not arbitrary that its child on the left is smaller than the root. And its child on the right, 77, is larger than the root. And that principle applies to all of the other children recursively. So notice 77's children on the left is a smaller number. On the right is a bigger number. So, if you adhere to this pattern in a tree it turns out you can search this thing in log, in time. And this just hands us some really cool data structures in computer science. There's things called two three trees, red black trees. The world has really leveraged this idea of a tree to come up with very sophisticated data structures. And when we in a couple of weeks use PHP my MySQL, a popular database engine. The database world is replete with use of trees, because you can do really fancy things. And if you like this kind of stuff CS124 is the place to go in the spring or beyond. So, let's go use trees to solve a problem and come full circle back to this idea of tris and then finish the day answering the question how well can we do with storing things like words. Alright. So, there's this thing called Morse code. Some of you might know what it is. You kind of push that thing; beep, beep, beep, beep, beep. And, it sends messages and it receives messages. And it's pretty efficient, because if you want to send the letter A you just hold down a button quickly and then you push the button for like a slightly longer time and that's what a dash means. A dot and a dash. So, Morse code is a really cool way of encoding information. And so the domain specific topic we'll introduce just for the sake of discussion today is compression. So, how many of you have ever compressed a file before on your computer? Alright it's pretty commonplace these days. In fact almost any web page you download these days is compressed, but then it's immediately decompressed by your browser before it shows you it. In English what does it mean to decompress a file?
>> [inaudible]
>> Make it smaller, right? So, you've got a really long document and compressing it means you make it smaller. Alright. I can make your essays smaller. So if you just wrote a ten page paper I can compress it by deleting pages six through ten. Is that compression?
[ laughter ]
>> No.
>> So, it actually is compression. There is compression called lossy, L-O-S-S-Y compression. And that's actually a valid technique. It's not so good for essays, but it's reasonable for movie and graphics, right? If you've ever looked a You Tube video some of them are pretty low quality, but that's because someone compressed them. And that person compressed them lossily. That means they made it smaller by throwing out some of the fidelity, some of the quality. And that might be reasonable, but with English and with text it's generally bad to throw information. So, there's also lossless compression. L-O-S-S-L-E-S-S, lossless compression. That's good for things like things like text. So, how in the world can you take a ten page essay whether it's written in notepad, or Microsoft Word, or pages, or whatever, how conceptually could you make something smaller without throwing away words and importation thesis and supporting paragraphs? What has to happen if you've got a text file with words, none of which you can afford to get rid of?
>> [inaudible]
>> Sorry? OK, so you get rid of the spaces, right? Actually that would be a fun exercise in expos [assumed spelling] something. Really annoy your preceptor. Just eliminate all spaces, right? That's really easy, Control R, space, then replace all. And really annoy someone easily. Or, you can do stupid things like, maybe you just, let's take that, which is kind of crazy and slightly improve on it by saying just capitalize the first letter of every word. So, then at least a human can more easily see where the word breaks are. But there too you're losing information. So a space is a legitimate grammatical construct and we're throwing away information if we take that approach. But, we're saving space. What else could we do besides actually affecting the quality of the document.
>> [inaudible]
>> OK. Good. So, if you use the same word a lot, why don't you just replace that word with the letter X, and tell your TF, much like you would a computer science TF, X equals American History, or something that you don't want to keep saying again and again, and again. Right? Just declare a variable, declare a constant at the top, right? And then AH all over your documents, right? That's another way to annoy your humanities teacher this week. OK, so that's actually pretty reasonable and speaks to a really clever idea, look for patterns. Look for some places where you're spending a lot of bits and spend few bits but communicate the same ideas, factor out the common cases. Well, you might recall this chart from weeks ago. We had ASCIItable.com the funny thing about ASCII is that you're using eight bits to represent every character on your keyboard plus some others, right? This whole chart is kind of overwhelming, because there's so many characters. But in your English essay, or expos essay, I mean how many of you are very often using the carrot symbol, or the tilde if it's an English document and not like Spanish? Or, you know, what else is not that common? Plus? You probably don't use plus that often. How many of you use the synchronize idol key on your keyboard? Probably not that much. In other words ASCII is itself not very efficient because it spends eight bits on every letter irrespective of the frequency of that character in your English pros. So, maybe we could take this idea of finding patterns, common words to the real extreme and look for common patterns of bits. Right? If I see the bit pattern 111111 a whole lot, maybe I can represent these bit patterns more succinctly. And in fact you can. When you use on Mac OS or Windows the compress file feature, or the zip file feature, any of these kinds of ideas you're taking not necessarily text documents, you can compress almost anything sometimes and make it smaller without sacrificing quality. And if we do this not at the alphabetical level, but rather at the bit level, we can do this pretty effectively. And how many of you for instance have taken a big file, whether it's a movie or something else and compressed it, that's made it smaller, but have you tried compressing it again? Right? I mean this is kind of fun. Take any file, compress it. Take your essay, compress it. It will make it smaller. Do it again. It will make it smaller, and smaller, and smaller, and smaller, until you can compress anything you've written into what? Hopefully one bit. Right?
>> [inaudible]
>> It's kind of crazy talk, right? That should not be possible. Right? And so there's actually an interesting upper bound here that's very theoretical. There's interesting ideas of entropy and randomness here, because at some point the only way you can really compress data is to eliminate patterns, right? Or exploit patterns and replace them with shorter sequences of bits. But eventually if you do this again and again, and again, at some point your document is essentially going to look random. You're not going to have long patterns of bits. You might have a 11 and 00, but you're not going to have twenty 1s in a row over here, and over here, and over here. You're going to essentially turn your document into something that's very random in appearance. And that's sort of the end game with compression. And encryption does the same thing. It tries to make everything look random. So, there's an interesting relationship there. But the question for today is how can we go about compressing data? Well, that is not the way. We could do something succinct like this. Why use eight bits, why don't we use something like Morse code where you represent with a dot, the letter E, and the letter A with a dot dash, as it's called. And the reason for the different shapes here is dot means the person running the show would push a switch like this really fast and the dash means their finger goes down for, like, a second and then comes up. But there's a problem, if you are the recipient of a letter in Morse code and you receive a dot dash and then some other stuff, what letter or letters have you just received?
>> [inaudible]
>> Yeah, you might have received A dot dash. Or, you might have received ET. So, the problem with Morse code at least is that it's not, there's this ambiguity, which is it? Right? So, you could maybe insert pauses, and this is what they did in the Morse code world. You kind of hesitate before sending the second character. Or you infer from context what the letter is. But there is the problem that we need to address. If we're going to compress information we really can't have their being ambiguities in our zip files, or in our compressed files. Because you don't want to compress your essay into this and then get back a different essay, essentially, by having the words decompressed differently. So, we need something called immediate decodability. Now the formal algorithm we'll look at for a second here is actually this. If you want to read through the specifics, but I think it's easier to think through just with simple pictures. And it turns out when the neatest application of this data structure trees, and thus the motivation for this context, is compression. So, this is how you can compress data using a fairly simple, but sophisticated data structure. Suppose for the sake of discussion I wrote an essay that looks like this, between quotes. Right? It's complete non-sense, but assume it's not. We just needed very simple alphabet here. So, I've got a lot of As and Es, but what's interesting is there are some patterns there. I see just at glancing up at the board there's lots of EE patterns, or EEE patterns. And the goal as you proposed is let's try to factor those out, or let's represent those more succinctly. Now, what I did here in advance was I counted all the Es, all the As, all the Bs, and so forth. This is their relative frequency. So, if you do a sanity check twenty percent of the letters in this nonsense essay are As. Ten percent are Bs, ten percent are Cs, and so forth. So, I did a frequency analysis. I opened the file, went from left to right, counted up the letters using an array. And now I have my percentages. This is all it takes to compress this file intelligently. I'm going to do the following, I'm going to build a forest of trees. And, we're going to do this not in code today, but in Word. So, a forest of trees is five trees, each of which represents a letter. And inside that node I'm going to put the frequency of that letter. So, think of these as my leaves for the moment. But, it's a forest of trees and you know what? I actually have five trees here. It's just they don't have any children, so they're kind of stupid looking trees. But we're going to build them up into something more interesting. Mr. Huffman, a graduate student fifty plus years ago, came up with the following algorithm for compressing data. Analysis the frequency of your text, your essay or whatever. Though we can do similar ideas on binary encodings too. Analysis the frequency and start building one tree out of all of these letters by taking the two smallest nodes and combining them into a new parent node. So, this is step one. Make your forest of trees, just by copying those frequencies. Step two is join the two smallest nodes, ten percent, ten percent, and make a new node, and put the number in its node, that's the sum of the children. So, we had ten percent, ten percent, this new node; the parent is twenty percent. OK, so now repeat. And, one of the concepts, we don't use all that much in this course because it's hard to do it in a compelling way until problems become more interesting as recursion. But what I'm going to say is recurse. Apply that algorithm again. So, what doest that mean? Find the two smallest nodes. Well, I see twenty percent and fifteen percent. Let's go ahead and merge those. So, I merge those new parents, 35 percent. But, now notice what Huffan proposed. He said take the edges and arbitrarily but consistently label them with a zero or a one. So, a zero is going on the left, one on the right, and then again, and again. The goal is to make one tree all of whose edges have zeros and ones. Because that's going to be the new sequence of bits we should use instead of ASCII for Mr. Huffman's scheme. So, 35 percent and 20 percent, we're going to get a new node now, because 55 percent. Finally the two smallest nodes, or the only two nodes 55 and 45 percent, and thanks to percentages everything adds up to 100 percent. So, now I have a tree. This is Huffman tree. This is a data structure you could fairly easily build out in C code, in fact if we were to implement each of these nodes with some C code with a struct, what pieces of data do you need inside of the struct? So, I say struct, node; what goes inside of a node to implement this data structure? One such node. So, let's arbitrarily [inaudible] how do I embody the information point 45 in a struct? What data type do I need inside my struct?
>> [inaudible]
>> So, like a float? Float F. Call it whatever you want, but what else do I need for a node to implement this picture?
>> [inaudible]
>> Yeah, you need star. So, we need like a struct node star for left pointer and a struct node star for a right pointer. But, there's no new concepts here, because remember we did this with linked lists. Every node in a linked list had a next pointer, and on the quiz the doubly linked list had a next and previous pointer. So, it's the same idea. We're just now saying not next and previous, but left and right. And conceptually they point downward instead of laterally. So, a struct might look like this. So, we have frequency, and actually I flipped it around. Float would work too, but frequency would also work if you just used the original raw numbers. Apologizes for that. Either approach is fine. We might want to retain the symbol though, right? For the leaf nodes, those actually have characters. We need to remember those. So, that's a char and a struct node; left struct node, right. So, how do you actually compress information? Well, what Mr. Huffman said at this point in the story is if you want to represent the letter A, do not use the binary number for 65, because that's a waste of bits. That's eight bits. If the only letters in your essay are A through E you absolutely don't need eight bits per letter. Right? That's just a waste. You never used those other characters. So, you can probably infer how do we represent the number A according to Mr. Huffman?
>> [inaudible]
>> Zero followed by one. And for letter B what bit sequence?
>> [inaudible]
>> Yeah. Right. And, again it is this simple. Just I'm starting from the root and I'm going from root to the letter B. What path am I taking? 0000, Mr. Huffman said use four zeros for B. What about C?
>> [inaudible]
>> Good. And D is? 001. And E is finally? 1. And now notice what's cool about this, so it's, the data is already on the board, but let's go back to picture one. What was the most popular character? So, Es 45 percent of the time do we have Es in this document. So, this is what's beautiful about Huffman coding is that Mr. Huffman said, you know what? Just use a single bit, the number one to represent an E, because my God if the whole idea is to use the fewest bits possible, well, then optimize the common case. And this too is a theme in programming, or computer science. If E is the most common letter, then my God, use the fewest bits you can to represent E. And for something like B and C, ten percent of the time. Fine. Waste some bits on those letters, because they happen less often. And now per this comment about immediate decodablity, unlike Morse code where we very quickly saw a problem, notice what's neat here. Does the letter E share a sequence, a prefix with any other letter? The only letter that starts with 1 is E. So, that's good. We can't conflate it with any other letters. How about A? A is 01. Does any other letter start with 01? So, no. And, that's a really cool property because with Huffman coding not only can you compress information by using fewer bits, there's also no ambiguity. So, to compress a file with Huffman coding you, one, analyze it, which takes big O1 of N time. You've got to count your characters left to right, and you build up a tree in memory using fairly familiar, even though we haven't, we won't code them until problem set six, fairly familiar constructs like structs with pointers inside of them. You build up five nodes and then you loop over your nodes. And, you find the two smallest ones; create a new parent node; link them together; repeat until you're out of nodes. And then you just traverse the tree, and there exists actually recursive algorithms that are just a few lines of code that could spit out a little table in memory, or an array. That's actually kind of an interesting design problem itself. But, then finally you analyze the file for frequency; you build your tree in memory; you figure out your encodings. The last step is to iterate over your file, your essay, top to bottom, left to right, one last time and every time you see an A, you use the function F right, or something similar to spit out a zero and a one. Anytime you see an E, you spit out a one. Anytime a D, this pattern; then you close the file and voila, you have a file that now mathematically has been compressed optimally, that is you've used as few bits as possible for the common cases and you've spent more bits as is appropriate for the less common cases. So for p set six you'll have this opportunity to implement your own tree structure known as a tri, or your own hash table, known as a hash table. So, more on that on Monday.
[ multiple voices in background ]
[ music ]
==== Transcribed by Automatic Sync Technologies ====
``` |