>> Alright. Hello. Welcome back to CS50. This is the end of week seven, so we've got a lot of fun stuff ahead. I thought we would wrap up a couple of the lower level details that we began on Monday today, but really focus today on some techniques and some concepts that you'll be using certainly throughout the remainder of the semester, albeit in a completely different context of web programming. And then if you ever pick up a computer and a compiler or a text editor in the future, odds are these same concepts will occur for quite some time. But first, a few, do you mind bringing the house volume down a bit? But first a few comments toward an end of teasing you with some ideas for your final projects and also to get the juices flowing with regards to final projects. We'll spend more time on this next week or so. We'll release the specification for the final project even though it's not due for a couple of months so that as you dive in to the web based aspect of the course and the web based pieces, you start to at least think about the kinds of things you might want to do. But realize you need not do something web based, you're welcome to do most anything, as you will see. So we surveyed all 1600+ students that ended up using this particular tool to shop for course this past fall. We set up arbitrary dates of releasing some updates today because you guys, if you've not heard, have to do this thing called pre term planning starting this year, which means registering what you are likely to take so that FAS can plan teaching fellows and room assignments more effectively. If you have no idea of what I'm talking about, you, odds are, have some email in your Harvard.edu email account somewhere, so go dig that up. I think that binds in a couple of weeks. But we want to use this as an opportunity to make this tool even better and to actually practice what we're preaching, take in feedback and try to iteratively improve the design of this thing. So now if you've never actually used the side it's supposedly pretty straight forward. You go to courses.cs50.net and the purpose in life of this application is really just meant to make navigating the course catalogue, which you probably do at least a couple of times a year, more, certainly simpler, easier, if not a little bit exciting in that you might very well stumble across courses that in the traditional format of a printed catalogue or in my.Harvard, you might not find these things. So a few of the shortcomings, we realized the previous versions, so there were some stupid things, frankly, whereby if you chose some faculty member's name from this drop down to see what courses this person taught, and then you wanted to undo that process, there was actually no bold faced link up to that said all faculty. You essentially had to reload the page or click clear or go through a non obvious process. So there was some low hanging fruits. But the funny thing is, as we'll discuss in a moment with regard to your pset4 gripes with software and machines that you've seen in your life, there are some lot of low hanging fruit out there in software whereby if you just fix that one stupid bug or shortcoming, you end up making a lot of people happy. So we certainly plucked off the easiest of things. We previously had these links at top left to link to all core courses, all gen ed, and we got comments like well, that's great, but I'd really like to see what the empirical and mathematical reasoning courses are. I don't want just gen ed in general. And you know the programmer in me pushed back and I was like well, in reading this comment you could just go over here and type empirical and mathematical reasoning, but the student's comment was fair. Because though it may have been obvious to me, the person who implemented that feature, well if the users aren't similarly identifying how to do something, that's my fault, frankly. And I think if you don't take ownership of this problem and just expect that your users figure it out, I think you're not going to end up designing good tools and good products for people to use. So now, for instance, this too, not a hard thing to implement in the end. You click gen ed and you actually can select all of the areas or individual areas. So again, some low hanging fruit. But then there were some more interesting problems to solve. If you used this tool this past fall, then previously toward the left of every course there were a couple of check boxes, check this if you want to shop this course, check this box if you've taken this course. And there was also a little Facebook icon that if you clicked it would trigger a comments box where you could actually click Facebook's like button and you could start a little Facebook comment thread about the course. And what was fascinating in actually looking at the logs on the server and asking people outright what features they used, like no one even realized you could click this little blue icon and no one really ended up commenting on courses. So though this was a great idea, we thought, have students have this sort of unofficial channel for commenting on courses, nobody used it. And so we ripped it out. Because if it's not being used, why leave it in there as a distraction, even though this idea we think still might have some potential. So now we realized it actually might just be nice to generalize the idea of adding a course to some kind of list because a lot of students didn't want to take a shopper course this fall, but they wanted to shop it in the spring and just because we hadn't thought this through fully, we didn't give them a way three weeks ago to actually shop courses in the spring. So now we've generalized this idea so you have a little drop down menu and you can choose courses you're shopping, courses you're taking, courses you've taken. And now it remains to be seen if you all even care about some of these features, but from the computer scientist in us what we like is the idea of being able to gather this kind of data and understand who's taking what, whose shopped what, who has shopped something and not taken it. Because if we can build up a corpus of data that really logs the kinds of behavior that people exhibit during shopping period, just imagine the kinds of questions you could start answering for people. For instance, I want to take Economics 10. Well what other course, well that's probably a bad example, because almost everyone takes Economics 10. I want to take ABC 10. All right? So ABC 10, you might then look at your data and say well, who else has taken this course? Well, if they've taken this course, what other courses have those people taken? And you can begin to infer with some probability that if you took this course, and you're now shopping this course, you might like these other suggestions as well. So for now, we just have some random suggestions at the left. Sometimes it's not so random. CS50 seems to pop up with some recurring frequency for some reason. [ Laughter ] But this is where we hope to go. And so hopefully now that you're beginning to understand how computers work and how computers think and how you can make them do your bidding, you can start to appreciate the kinds of problems you can solve once you've got some interesting data and really just some basic tools in your took kit. So if you want to use this tool, by all means feel free to. The funny thing was, too, we got a lot of comments about features that it would be great to add that were actually there. For instance, Q Guide data has been in here since the start, but admittedly it was not obvious that if you hover over a course you can actually get all of the data from materials, assignments, workload, difficulty, but our fault. If you didn't notice that you could actually get all of that data without having to click through to the course's individual Q Guide evaluations, again, our fault. But the funny thing is, it's very easy for people to make comments on tools and to design things sort of off the top of their head, but it's really only once you sit down and start trying to program this machine that you realize the tradeoffs you have to make. There are so many other good suggestions that we've received that we decided not to implement, at least not yet, because you also only have a limited amount of space and if you just keep throwing more and more content, and this too is a recurring theme in Problem Set 4, you just end up overwhelming your user. So the theme, at least for CS50's applications, which I don't think in general is bad advice, is to try to do few things really well rather than try to do a lot of things poorly. So case and point, another recurring theme that's been on people's minds is you know, I'd like to be able to tag courses for I might do this for gen ed, I might do this for pre med, I might do this for this, like a lot of really compelling use cases, but we felt at least for now if we tried cramming all of the more features into this and we didn't do them really, really well, then it would just be another crappy tool that gets commented on frankly next year for Problem Set 4. So for now we're going to restrict it to what we think is working well and then we'll continue to iterate. So again, take away is not look at how we implemented this particular tool, but look at the kind of questions that you yourselves should ask perhaps when it comes time to designing your final projects. It's really easy to brainstorm what you think is a brilliant idea on paper, in your head, but when you sit down in front of that keyboard and then have to start deciding how to implement this and where to put this feature, it's sometimes non obvious. So with that said, yes, we've pretty much rattled that off. So with this said, one of the fun, frankly, one of the most useful features that was proposed that was non-trivial for us to implement is this, I want to be able to ask what course, across any department can I take in order to satisfy the economics concentration or the government concentration. Obviously you can shop in the government catalogue, you can shop in the economics catalogue, but there's a lot of mathematics courses, a lot of the statistics courses, even CS courses that count for some of those concentrations. And believe it or not, to spite how many computers there are on this campus, there is no database that we can find that actually has that information. Rather you have to read through the student handbook, which is written by humans and pros with bulleted lists and so forth and figure it out. And so one of the things we thought we'd do which would be hugely compelling is to actually make this feature happen so that you can say I want to major in economics, show me all of the courses that I could take strategically to satisfy this concentration, even though they might not be inside of economics itself. There are 5,000 courses in the courses catalogue, there's like 40 concentration, all of which are written differently, so we thought we'd use a little word dejour [assumed spelling] these days of crowd sourcing. So if you would like to help us make this feature happen, the teaching fellows and CA's and I have been biting off some of the concentrations, we will share a Google doc with you, it will be prepped with a certain concentration's columns so that you can then just copy and paste into that spreadsheet which courses can count for a primary concentration or a secondary field for that particular department. And then within a few days, a few weeks, we will integrate this so that hopefully we will have a super duper app for everyone to benefit then from this spring. So drop me a note if you would like to do that. It's hard for one person to do this. With 500, hopefully, if we just get one percent of volunteers, we can do it pretty well. So idea. So when I gave on behalf of CS50 this little intro talk called Harvard Thinks Big, the gauntlet I threw down verbally was to have the crowd that was in Sanders that night actually give us ideas, not unlike Problem Set 4, that they would loved to see solved with computers, with a machine, with software, something, frankly, that a CS50 student can do. And if you go to ideas.CS50.net, there weren't terribly many ideas, right? You can only expect a certain percentage of people to actually follow through on that. But there were some really good ones and also some really fun ones. I thought I'd share some of the highlights, so one suggestion that was made by some random person on the internet after that night was this, a course catalogue Android app, Android being an operating system that's increasingly popular on mobile phones. This is absolutely within the grasp of CS50 students, particularly because for every one of, for every one of the CS50 apps that we've talked about thus far, if you go to follow the link called API, the course provides all API's, application programming interfaces, which is just a fancy way of saying we have implemented some functions that you can call over the internet from language like JavaScript and PHP, so that if you want to get data from Harvard courses, if you want to get data from all of the events happening on campus, if you want to get data from the dining hall menu, you don't have to go and copy paste these things or figure out how to do that. We will just hand you the data that you want and you can really dive into the fun part of the problem, solving a workflow, solving a problem that you think you could when handed that data. So on something like this, if you already have an Android phone, by all means can you dive into this. As an aside, industry tends to be nice to the course. Research in Motion, the company that makes Blackberries has donated a bunch of phones to the course, so if you would like specifically to tackle a Blackberry based final project, we will let you know how to sign up for a lottery to receive one of those phone with which to develop that. So another great idea that came out of ideas.CS50.net, Harvard Travel, a website where Harvard students traveling during J-Term or summer can find another Harvard student who have been there or live there so as to get some info about that place, maybe even be hosted by that person, so a very compelling opportunity for one of us this final project season. An application to download all Facebook photos tagged of me. So that's not bad, whatever one's motivation for that, but also tenable, Facebook, too, provides an API with which you can interface with a lot of their data and use it to implement your own idea. Case and point, Harvard courses, you log in with your Facebook account so that you can see once you log in what your friends are taking. And as an aside, if the idea of signing into anything with Facebook kind of troubles you, realize there's a checkbox you can check on our application, at least, that hides your presence from your friends when shopping. You don't have to divulge to the world what you're actually doing. So this one's good, too, to reserve a practice room freshmen still have to go to the FDO and register on a sheet of paper, save some trees and save some time with a website. So again, a low hanging fruit there. Now, not all of them were perhaps as technical as that one. This student took the time to write, type included. This was actually one of my favorite suggestions for how to improve Harvard with a tool or website or piece of software. [ Laughter ] Alright, is this even, is this a thing now? Because I've never heard this. This one was similarly hardware oriented. [ Laughter ] We got a lot of things like this. So this is form spam. There's a lot of people with way too much time and malicious motivations where they don't send you spam trying to buy things, they just send you crap like this, and this was submitted automatically by a bot of some sort, a spider, a program that just crawls the web looking for forms to submit. And I actually, a little timidly, visited all of these URL's before class just to make sure. I'm not saying go to sketchy.com here. None of them even work and this was just posted a few days ago. And this fascinating theme of the world of spammers is one, they rarely speak or use tools with which to write perfect English. So frankly, a really good heuristic for detecting if an email from Bank of America is legit, Bank of America's probably not going to make an email riddled with spelling mistakes, so that's one heuristic. And it's funny, because you would thing if you're trying to capture a lot of people you'd run it through like Google translator or whatever the tool might be, or Microsoft Word, and you might actually buy the domain names that you're telling people to actually go to, because these are, in fact, dead ends, at least as of this morning. Other things that, this was interesting, that was on suggestion for how to improve Harvard. Another one, same person. [ Laughter ] These were from Carlos. Hello Carlos. And then perhaps related in spirit was this. [ Laughter ] So perhaps beware of that particular applications. So frankly, some of the juiciest data that we received, those things aside, were those from Problem Set 4. So frankly we could spend all day, all week, all month reading through some really interesting pros. I printed this out this morning. These are 84 pages, two sided, 12 point font, of all the gripes you people have about this place and tools that you've used, but it's really quite fascinating. So what we're going to do on problem set five is ask you with a checkbox are you okay with our posting your comments anonymously to the website, because there's really some fun comments in here. There's some really astute observations. And there's really, I think, some food for thought when it comes to identifying these same kinds of problems in your own life. It's curious some of the themes were many people commented on tools close to home like my.Harvard, Hollis, CVS in the square recently put in this self checkout system, which at the moment is a disaster, I think. I've never waited longer in lines at CVS than I have over the past couple of weeks. And this is, you know, probably a boom for them because they can hire fewer humans, but they put the burden on us and you see lines perpetually. So that was one theme. Bank of America, people had some gripes with their ATMs. And similar, dining services web site, apparently it's hard to find out what's for dinner and what the nutritional content is. And also a theme with sports sites, ESPN and MLB.com, it seems that perhaps there's a theme in that industry of not giving as much thought to some of the user interface issues. There were also some comments, of course, on CS50 stuff, to which we are quite sensitive and will address. A few things on color blindness issues where there are some color things we've done incorrectly, so we will take a stab at fixing those. And what else? Oh, there was one comment that it's really hard to find the course websites for previous years, previous semester's instantiations of courses. So in Harvard courses, CS50.net, if you expand the course, you can now see all prior year's websites. We found all the URLs for you. And let's see, anything else juicy here? Oh, okay. So have you ever tried to play a simple game of Halo on the Xbox 306, man's best gaming system? After turning on the machine you were required to navigate through multiple menu items and obscure system settings simply to get to, quote unquote, play Halo. I would venture that most people who turn on the Xbox with the Halo disc in it are trying to play Halo, not customize their avatar's mustachio. [ Laughter ] So this, too, is actually a recurring theme. My soapbox that day with the MBTA was that too often our tools are problem solved without the common cases in mind. And again, in the case of the MBTA, my God, the common case, I'm going to guess, 90 percent, 80 percent of the time is someone wants to use the machine to just get on the next subway to some destination and it's eight steps, so a clear opportunity for doing something better. Here, you plug in, you turn on your Xbox to play Halo, apparently you have to jump through hoops just to play Halo. I have a PS3, which I rarely use, partly because every time I turn the damn thing on, because so many weeks have passed it wants to do a firmware update and download new software, but it forces you to do that before you can actually play the Blu-ray disc or play the game. You know, idiotic design decisions, not optimizing for the common case. So if you are amendable we will invite you to check off a box with which to share these ideas anonymously because it's certainly a good way at coming up with entrepreneurial ideas, coming up with good project ideas and really just making you a more astute technical person moving forward. The theme of this course and courses like it, CS1 and BIS and Quantitative Reasoning or Empirical Mathematical Reasoning, all of them from the CS department is really to output students that are not necessarily going to grow up to become computer scientists, not necessarily going to program ever again after this course, but are going to be in positions in society where you're making decisions and you're making technical decisions or you're making decisions that affect technical people and one of the best things we can do in any of these intricacies, frankly, is at least empower you to make more intelligent decisions than is currently done in all spheres of life, business and politics alike. So hopefully we will have some CS50 alumni in important places someday so as to make the world a more user friendly place, to say the lease. Alright, so a few, let's see, a few wrap up details here. So just to plant this seed, you can in fact manipulate data. This is a very awkward transition from changing the world to bit wise operations, but so be it. So bit wise operations, this is just a mechanism provided by a lot of languages with which you can manipulate data, variables, at a bit wise level. You can change individual bits, you can look at individual bits toward various ends. We looked at some simple examples on Monday at how to change things from uppercase to lower case, we generalized the idea of actually using am ask to extract data. And one of the themes of problem set five forensic is that kind of thing. Not necessarily using bit wise operators, but really controlling precisely the data that's on disc, the data that's in memory to extract those details that you want. So just so that you've seen these tools and can tuck them away if ever relevant in the short term, again the ampersand is for and-ing two bits. It's saying if this bit is one and this bit is one, then the output is also one, otherwise it's zero. Or it does something similar to Boolean operators. If it's a one or a, if this is a one or this is a one or they're both ones, the answer is again one. XOR is interesting. XOR actually gives you a lot of neat capabilities in life. XOR says that the answer is one if and only if only one of the inputs is the number one. So if your inputs are one and zero then the XOR of those two values is one. If your inputs are zero and one, then the output is one. But if the inputs are one and one or zero and zero, the output is zero. So exclusive or kind of does what the name implies. Exclusively one of the bits must be one and not the other. Now just as a teaser, using this operation like XOR you can do some neat things. In fact, if you've ever heard of RAID, an acronym, redundant array of independent disks, it has to do with the hardware world, but it's increasingly common in desktop computers as well as servers. Using this very simple operation you can ensure that if you have a computer with three hard drives you can end up using space equal to two of them and the third drive essentially just stores the XOR of all of the bits on the other two hard drives. And the implication of this very basic operation is that in such a computer that's using this technology called RAID, you have three hard drives, if any one of them dies, the additional one, thanks to XOR is enough information to recover all of the data. So it's kind of a special trick with which you can have error checking and error detection built into a system. Hugely advantageous. And if you're interested in that kind of stuff there are plenty of hardware courses down the pipeline. You can also do one neat thing. For the true geeks in the audience I just thought I'd show you this real fast, swap2.c. It was in Monday's printout. This is a copy paste from our many previous iterations of this notion of swapping. But it turns out I was lying to you when I said you need a temporary variable in order to swap two values, right? The problem always was that if you don't save A or B's value in a temp variable, you're going to clobber one, you're going to overwrite one and lost track of the other. Not so. If you really want to be a geek or you really want to be space conservative you can actually use the XOR operator, which is implemented in C as a little caret symbol, which is above the number six. This is the XOR operator. You can execute these three lines of code here down at the bottom and it turns out nearly magically these three lines together will simultaneously conceptually move the bits from A to B and B to A without losing any of them. So it's kind of a neat operation but not something we'll dwell on today. So things that we can do, though, include, or rather details that are relevant are this, one of the themes too that comes up in problem set five forensics is in actually reading bytes from disk and actually comparing them, looking for instance for the special signature that JPEGs have. If you haven't read the specs yet a JPEG image almost always starts with the same pattern of for bytes. And that pattern means here comes a JPEG. And if you see another such pattern that usually means here comes another JPEG. And leveraging that very simple hint, those for bytes alone, you can essentially, with high probability, recover lots of JPEGs. And that's one of the goals of that pset. Well, it turns out that depending on the operating system or the computer architecture, the CPU that you're using, sometimes numbers, if they are multi bytes, are integers are, as doubles are, as longs are, depending on the architecture, sometimes those bytes are laid out in opposite order. So just again to plant the seed so that you've seen it and if you run into certain bugs that the teaching fells address in section or office hours it makes a little more sense. Realize that there's two ways of implementing numbers on disk. You can either take all for bytes and essentially write them on disk left to right, or you can write them to disk from right to left. And the effects here, this is a little screen shot from Wikipedia, a nice little article that discusses this issue, essentially if you're using a big Indian architecture the number is laid out just as we've been thinking about it. But little Indian architectures, which is very much in vogue these days, Intel CPUs use this thing called little Indian, if you actually look at the bytes at a low level what you think should be over here is actually over here. And again, we won't spend time on this today, but just in case you run across certain bugs in problem set five and we point you toward this notion, just realize there is this subtlety that differs among architectures where numbers are stored differently. But you will be playing with pointers a bit, on this problem set. And one of the other files from Monday was this thing here, pointers1.c. This program is actually pretty old school at this point. This is just a program whose purpose in life is to get a string from the user using our get string function, then we store the return address in a variable called S. Why do we check for null here? What might actually go wrong here? [ Inaudible ] This may not be a string. And in what situations might get strings return null like that? Someone from over here, perhaps? [ Inaudible ] So not enough memory, right? And so if the user is being annoying or malicious or the computer is just very memory constrained, it the user types in something too long in memory then malloc is going to return null and recall that get string uses malloc and so this, too, will return null. If you don't check for null and just proceed along your way and dereference a potentially null pointer, what, of course happens? Seg fault with high probability, not always, but with high probability. So this part of the code here just does some old school stuff now. We loop from I equals zero on up to N, what is N? Well, we initialize N to the return value of string length. So this loop is going to iterate conceptually over every character in this string S and then it's just going to print it using S bracket I notation, each char one at a time, one character per line. So again, I say old school because there's nothing new here today. In fact, it's a bit of a regression using strings as though they were arrays, which is very convenient notation, but we've of course been talking about strings as pointers lately. So much so that we now have been preaching this habit of freeing any memory that you allocate with malloc. Or, in turn, any memory that you implicitly allocate using get string, because it again uses malloc, and so one of themes for Problem Set 6 will be to use that tool [inaudible] from Monday to actually help you find situations where you're forgetting things like free. But the fact that a string is really just a memory address, it turns out that you can do the same kind of program but treating S as thought it truly is a pointer. And so you can use something called pointer arithmetic. So again, just to plant the seed here that will recur perhaps in Problem Set 6 or in future courses if you continue on, this program, again, just gets a string from the user, stores the return address in S and does this little null sanity check. This for loop is identical up here. The only thing different about this program at the moment is that I'm using this, star parenthesis, S plus I close parenthesis. So star, remember, is the dereference operator. It means go there. So how is this actually doing the same thing? Well, just so it's clear, what both of these things are doing, let me run pointers, whoops, interesting bug, alright, standard lib. I screwed up so let me recompile. It was complaining that it couldn't find free. So let me run pointers to enter, hello, that's all this program and the predecessor did. It just prints the word, one character per line. But how does this version work? Well, if F is an address and it's pointing to conceptually a string, what it really is is the address of a character and the string is just the result of going step by step by step until you see what special character demarking the end of the string. So the null terminator, which is not N U L L, but backslash zero, right? That's the special character we look for. So what this is really doing is it's saying what? Take S, which is an address, it's like OX 1, 2, 3, 4 or 5, some location in RAM, it's adding I to it. Now initially I is zero. So initially this is meaningless, it's just S. So star S means go there. Well, when you go to S, what's waiting for you? Just a character right? If it's the address of a character and you go there, obviously what's waiting is a character and so we can continue using percent C and we print out that character. Well, now iterate one more time in the loop. So now I equals one, so S plus I is whatever the address is, 1, 2, 3, now we add one, now it's 1, 2, 4 and so now we say go to address 1, 2, 4. Well what's waiting for you there? You know, the next character, because arrays are continuous. So it's neat at this point in the course, now that we've pealed back this layer of the CS50 library and this layer of array notation for everything. In fact, when you have a pointer, you can go anywhere in memory that you want. And the fact that you can just come up with an arbitrary value, perhaps arithmetically like this with some addition, means that you have great power, but again, also great risk in your programs. As we've discussed in the context of buffer overrun exploits because if you can do it, so might someone who actually has malicious intent and somehow tricks you into running code that you, yourself, did not write. So that is generally known as pointer arithmetic, but it's really just how that square bracket notation has been working all of this time. So any questions about bit wise operators or pointers? >> Isn't it adding a pointer to an integer? >> Isn't it adding a pointer to an integer? Yes. But that's okay because in this case the integer will be essentially implicitly cast up to an address and it will work okay. This notion of pointer arithmetic is built into the language, so it is permissible, even though it is, in fact, two different types, a long and a pointer, but it's legit. Alright. So now the proverbial fun begins that I keep promising. If only because no we get to really think about and talk about some ideas that you can start taking for granted, frankly. Once we get to web based programming you'll find that in languages like C, I'm sorry, languages like PHP and JavaScript and Python and Ruby and Java, there's so many other languages out there, and rest assured, frankly, generally, you just need a good foundation in one language, maybe two, three languages, and the rest you can teach yourself. I mean, case and point in school I took CS50 where we learned C, CS51 where we learned a language called LISP and C plus plus, and that was it. I never again formally learned any other languages. I bootstrapped myself from that process. And this isn't because I was particularly cut out for CS, but it's just really is true that the marginal returns on some foundations and different types of languages is pretty huge, so one of the themes next week onward is that we're going to move pretty darn quickly through PHP and JavaScript. I'm not going to teach you the whole language, because frankly you're going to see that for loops look almost the same, y loops look almost the same. There's going to be some more functions in PHP, which is great, because it means you can implement fewer uninteresting things yourself. But what you'll find is that one of the goals of the next few weeks is to help you realize by doing it together that teaching yourself new languages is absolutely quite possible and quite useful when you're in some lab or some job. Even, frankly, in the financial and consulting industry, where a lot of my friends, case and point, never once programmed in school but are now doing it as part of their analytics work. So you learn. And better to learn in advance than have to learn on that kind of job. So we talked about stacks as a new tool with which to implement a data structure. And a data structure is generally just a convenient mechanism for representing data so that you can solve more interesting problems or solve problems more efficiently. So arrays were fairly limited, link lists gave us some dynamism, but the problem with link lists really is that it's just this long line of people, long line of numbers. We lost the division and conquer capabilities. We lost the binary search capabilities that we started the course with. So can we get some of those features back and get more of them? Well, just to tidy up a loose end, I felt bad in that I completely, I felt I did not express very clearly how one might go about implementing a stack or a queue and we won't spend great time on it because you can spend entire weeks discussing the implementation details, but I at least thought I'd propose a canonical example to kind of clean up my verbal mess from Monday. So here is how you might implement a stack using some basic building blocks. So again, a stack is akin to the stack of trays in a dormitory. So type def struct says here comes a struct, the last word there says stack, which means I'm going to give this structure the name stack, so really the interesting part is inside of the curly braces. Now a queue, or rather a stack, presumably has some fixed length. That's generally the assumption in the world of stacks that, you know, yes, you could stack cafeteria trays ad nauseum, but at some point it's going to fall over. So realistically there is some upper bound. And certainly in a computer there's going to be an upper bound based on the amount of memory that you have. So it stands to reason that this data structure generally has a fixed size, which means we can actually implement it using an array. If we know it's going to be a fixed size, we can just allocate that memory at the beginning. Let's assume that capacity, in all capitals here, is just a constant declared somewhere else that's a hundred or 200 or whatever, it's a fixed value. So this says the structure called stack is going to have an array inside of it, in this case called numbers, so this appears to be a stack that stores just numbers. So pretty simple application, just stores some numbers. But I need to remember some key details. Initially I might have space for 100 integers, but there's not necessarily 100 integers in there. I probably have 100 garbage values. Or if I take some care, I might have 100 zeros or negative ones, if I initialize the whole thing to some known values. But the point is that it's not sufficient to just say here's a chunk of memory, now you have a stack. You now need to have some meta data associated with the structure to remember what's in there and how much of it's in there so that you don't risk returning some garbage value or overstepping the bounds of your array. So I'm going to remember one, the size. So size is distinct from capacity. Capacity means how much space is there and size means how much space are you actually using. So initially I've probably initialized size to zero because there's nothing in an empty array by default, and then top. So I chose top specifically because if I'm stacking cafeteria trays, really the only number I care about is where in this array is the topmost tray? Where in this array is the topmost number? Now why would the topmost number not just always be at bracket zero? Right? If I have an array it feels like the logical thing to do is to put this top of the stack in bracket zero. Now why is that not, why is a stack's implementation not as simple as that? Why do I need to remember separately where the top of it is? Yes? >> [Inaudible] >> Yes. Exactly. So it's a design decision. I could absolutely start by putting numbers into this stack by just starting at bracket zero, then bracket one, then bracket two, right, old school style, totally reasonable. But if the point is to maintain some kind of order, I need to remember which one went in last so that I take it out first, so it's a LIFO structure, last in, first out. So in that case, if I start at bracket zero putting in some number, I put a tray down. Then I do it again at bracket one, that's another tray, then another number, that's a third tray. Really the number I care about at this point is not bracket zero, because that's actually the bottommost tray that I don't really care about until I get there. I need to remember bracket two at this point. And so using this variable, this meta data like top can allow me to remember that the current top of this tray, this stack, is just the number two. And size, by contrast, is also, is three in this case because I've put three elements onto the data structure. Yes? >> Is size always [inaudible]? >> Good question. Would size always be top+1? It could be, but what if we start putting in like 100 numbers then we start popping off numbers? Pop is the lexicon for removing things from a stack, much like a cafeteria stack. So if I do this, I start putting in number, number, number, number, number, number, number, and now I hit the end of the queue and I've started popping, I've kept going up and up and up and up, I've started putting numbers into the queue, actually, is this? Let me debug this in my head for a moment. Yes, it's kind of stupid. Alright, so this is, in fact, not necessary. I was conflating it with a queue. So I will fix this verbally in a moment. I almost had it correct this time. So let me come back in just a moment and transition to this, if the wave of a hand, but then we'll fix that one in just a moment. So the other notion was a queue. So a queue is the real world's concept of actually having like a line of people and you want to maintain fairness so it's a FIFO data structure, first in first out. The first person into the structure is the one you want to remove from the structure first, and so we might implement this as follows, whereby we have an array called numbers, again, with the same kind of capacity. Here we have size. But here, we might want to remember where the head of the list is. This is what I was thinking here. So in the context of a queue, you might very reasonably, in your code, put the first person at bracket zero, then the next person at bracket one, then the next person at bracket two. So at that point in the story, size is three, but head, the front of the list is zero, where ever the first person was. Now, if you remove the first person from the queue, you now conceptually have this hole in your array whereby you still have a space at bracket zero but there's no human actually waiting in line there. So what could you do? You could just move all of these humans, or move all these integer over by sliding them in the array. But what's the running time of shifting an array over one space to the right or to the left? It's linear, right? In the worst case you've got an almost full queue, you've got to move every darn number over to the right or to the left and what's the point, right? I can just remember where the head of the queue is. And so the size might be decreasing, but the head is moving. And then in this case with the queue, I might to save time and memory, I might want to wrap around at the end and reuse all space, in which case I might very well want to remember where the tail of the queue is so that I essentially treat my array conceptually as a circular structure. I don't care where the start or the end is in memory. I just have to keep track of these variables. So I'll go back before long and I'll clean up the stacks structure. But let's focus now on ones we'll deploy in Problem Set 6. So Problem Set 6, we'll hand you a huge text file with 140 some odd thousand words from an actual dictionary that you might use in Microsoft Word or really any tool that can spell check. Unfortunately it's just a text file. It's got one word per line, but all it is is a text file. And so one of the thing you're going to have to use is your newfound comfort, or by Friday new found comfort with file I/O to read those words in from disk and then to load them in from memory. Now the simplest implementation of a spell checker might just be how I allocate an array that's 140,000 in size so that you have one location for each string, and then you load each of the strings into those locations. The upside of this, because the file we'll give you is, in fact, alphabetized, which might be useful. The upside of this is that now you can use like binary search. If just conceptually you've got a huge array and you don't have numbers and you don't have humans in each of the slots, you have actual words, well there's that function called stir comp, stir compare, you can absolutely use binary search on a really big array to find some words so that when your program asks a question is the word the user just typed in spelled correctly? You just go boom, boom, boom, boom, boom using binary search on your array and you can return true or false. Now is that the best approach? It's not bad. What's the burning time of binary search on an array? It's a log end, and log end frankly has kicked the butt of every other algorithm we've looked at just yet, but as promised on Monday the holy grail really would be constant time. Like I don't even want to wait for log end time, which for large data sets it might be asymptotically fast, it might be theoretically fast, but if you have large enough data sets, not just 140,000 words, but a billion web pages or 500 million users on Facebook, even binary search on a data set that size might very well take some time. So certainly returning it in constant time, one step conceptually would be optimal. And so enter into this picture the notion of a hash table. A hash table is often implemented as an array, but it's implemented as an array that's bigger than it needs to be to store all of the words or the numbers or the whatever's that you want to store in it. And the idea is essentially if this is my hash table, this is just a big array and on the left hand side I've just numbered the locations in a traditional way, zero at the top, one, two, three, dot, dot, dot, down to 25 in this case, you can think of the hash table conceptually as being a data structure where you take some input, whether it's a number or a string, and then you, kind of like darts, throw it into the data structure and hope that it actually sticks and doesn't hit some existing value. Now that would be more dramatic if it actually stuck, but that was off the top of my head. We'll come with sticky things next year. But if you throw this input, this number or this string, into the data structure you're going to hope that it lands in some blank spot as the board currently is. And that's great, because how much effort did it take to pull it to my string or my int in pretty much a random location? It was one step. It took me one throw to actually hit the data structure. But there's a downside. What's an obvious downside? [ Inaudible ] So if there's something there, right? So this idea of probabilistic insertion into a data structure could get you into trouble. If you've got a huge data structure and only three pens to throw into it, you know, statistically you're not going to have any collisions and so odds are you'll have too much space, but at least you won't collide, which means all of your data will fit in there. And so that's pretty good. But if now you have lots of data in there, the odds of throwing a new item into your data structure and having it stick into some empty location is pretty actually low. And plus, even when you go through this process of throwing something into the data structure, well, how do you then find it? Right? Insertion is pretty damn easy. But what about search? I mean, who knows where it ended up? So we clearly can't just randomly pick the insertion point, otherwise we're killing the running time of the search algorithm. And arguably, what I care about most is finding this data in the future. Right? I'm putting it into a data structure presumably in order to create a situation where I can get it back in constant time. So randomness certainly isn't the case. But let's suppose that it is random. And let's asking ourselves, if you have a container, a table, a hash table that's initially empty and you just throw randomly some pens into it, you know, what really is the probability of having a collision? Well, it turns out, even though I said verbally, you know, the odds of my hitting another number it's pretty small, it's actually not. And if you've ever heard of a little mathematical problem called the birthday paradox, you can model this question very easily and see that, you know, even if you had a whole lot of memory and relatively few things to throw in it, you're still very likely to actually hit something that's already there. So we can frame it as this, in a room full of N CS50 students where N is declining over the weeks, what's the probability that at least two students have the same birthday? So this is kind of the same problem, it's just framed in kind of a silly way. But here the hash table is 365, or maybe 366 for leap years and such, but we have a hash table of 365. If we assume that your parents had children with by a uniform distribution over the years, which I don't think is actually statistically true, we can assume that if we throw all of your birthdays into this bucket that's the size of 365, that we'll get collisions eventually, but at what point? Now with a class of 501 students, you're going to thrown those birthdays into a data structure the size of 365 and obviously you're going to have collisions, because you can't fit that many people into that few bucket, but even if you're taking a smaller seminar course that's only 40 people, turns out it's a very high probability that even in smaller classes you're going to have collisions. And this is an experiment that you can run, frankly, in your next section or your next seminar class. And we can model it as follows, if you ask yourself when I throw the first birthday into this data structure, what's the probability of a collision, or what's the probability of it not colliding with anyone? Well, it's 100 percent. The data structure's initially empty. You have a 100 percent probability of not colliding with anyone else. So this actually turns out to be one of those problems where it's a little easier to answer at the opposite question. Not what's the probability of a collision, what's the probability of no collisions? So that's the question we're asking. Well, for that first person, it's one divided by one. There is a 100 percent certainty that I'm not going to collide if it's the first birthday. What about the second person? Well, it's a pretty easy problem. It just means that person obviously has a birthday, so it means that there's 364 other buckets that they can fall into without a collision and so we can model that as one minus one over 364, which is just 365 minus one over 365. It's 364 over 365. Now we do this again and again and again and again and we ultimately get a formula that we could plot and we could actually see at what point the probability becomes worrisomely high. And so in the end, if you work this out with the exponents and such, this is the concise way of expressing the probability we're discussing verbally there. But it looks like this. And this is the neat take away. So if you had just one person in the room, one birthday, there's a zero percent chance of a collision, of a match of two birthdays. But notice this curve, this is 100 percent change, 90 percent, 80 percent chance. I mean, notice how quickly this formula hits 90 percent. And that, I picked 40 for specifically this reason. It turns out that statistically if you assume a uniform distribution of reproduction then you will have, in a class of 40 people, with 90 percent probability a collision of birthdays. And once you have a class of like 58 people on the right, it's a near statistical certainty. So this is kind of a neat sociological curiosity. But for us, the question is actually very germane because if we're trying to create a data structure that has some fixed size, because we only have so much memory in the world, we want to over allocate memory, because we need to have some holes in this structure with which to avoid collisions with higher probability, but it looks like we need a super amount of memory, way more than 365 buckets just to fit a 40 person seminar class. So the assumption we've been making thus far is that it's this uniform distribution of numbers. What if we actually look at the input, the birthday or the number or the string that we're inputting and try to figure out its location, not randomly, but actually analytically so that we minimize the probability of collision. And so that's what we'll do after our five minute break. [ Silence ] Alright. So we are back. So one amendment. So I finally remember why in the implementation of a stack here and in a C data structure that I had both size and top. It was actually for one corner case. If you only have top to remember where the top of the stack is, there's one gotcha, which is when the stack is initially empty. And so if the stack is initially empty and you initialize top, for instance, to zero, you might otherwise return a garbage value. Now there's a way to fix this, of course. You could just declare top to be some sentinel value, a special value like negative one, to signify that there's actually not something here. Or you can take this alternative approach and just maintain it as a separate end, which is the approach I took here. But actually a decent take away here, a save for me, is to say that though we might introduce these structures in one way, there's absolutely different decisions you can make. And in fact, one of the themes of pset6, when you implement your spell checker, is going to be to decide for yourself what would be the best way to implement this structure or that and you'll be guided toward a couple of possibilities in this Sunday's walkthrough for pset6 in particular. So the question I had a moment ago was what's the probability of a collision and the take away is really, it's pretty damn high. So collisions are going to happen so we need to be able to mitigate these somehow or at least deal with them. So what's the simplest possible implementation of a hash table? Well, the simplest one arguably is just an array, a fixed size, and you take you input, let's say an integer, and you try to put it into the hash table at a specific location. And if there is a collision, as we say, if there is something else already there, you know what the simplest fix is just to move it then to the next bucket. And if there's something there, move it to the next bucket. You know, just make a best effort to find an empty location and if you miss, then just increment, plus, plus, plus, plus, plus, plus, plus until you find a hole in the data structure. Now what might this mean? So when you throw a pen at the board, you're doing something that's generally called hashing. To hash is a verb that generally means to take as input a number or a string or something and then return as output the intended location for that input in the data structure. So we might implement a very simple hash function as follows. If I need to return an int, because again, the function, the role of a hash function, what's called hash, is to take as input, let's say a number, and then return as output the location of that number in the hash table. It's answering the question where do I put this? And now we could, as I did physically with my pen, we could just do something like return rand, right? So that would be the simplest approach. And that's actually not quite right. I'd actually have to multiply it out so I get the range correctly, but we could use rand and actually get back a random value. But again, to be clear, what's the downside of putting an element in any data structure at some random location, which is pretty fast, but there's a downside. >> Search. >> Search, right? You can't find it again unless potentially you search through the whole data structure. And the worst case running time there is probably going to be big old event, because if it's in some random location, it might be in the last location you check just by bad luck. And so that's not the best goal here, right? We could have done better, frankly, with binary search and array. So that's a step backwards. But we could do this. Suppose that the input is N and let's suppose that we've got a pretty good data structure. You know what I could do if I want to store the number N in an array, where should I put it? Well, the, you know, the common sense in me says why don't I just return the number itself? So if I'm trying to insert into this data structure the number zero, where should I put it? How about bracket zero? And if I'm trying to insert number one, where should I put it? How about bracket one? How about the number 25? How about bracket 25? How about the number 26? Oh, there's the problem. Right? If your number, if your inputs might actually exceed the balance of your data structure, which isn't a deal breaker, right? Because if they're sparse, if you have a number here and a number over here and number here, you don't need a super big data structure in order to store it. Like some of you, for the hacker edition of pset2 might have to get linear sorting time. Well we need to actually exercise a bit of intelligence. So what's the best way to avoid wrapping around to the end of a data structure that's only of size let's say 26? How do I change this formula? Alright, so, modulo, right? So modulo is kind of our savior when it comes to wrapping around. So this is better. Now my goal is to insert an integer into some data structure and that data structure is of size 26. Well, now, thanks to modulo 26, no matter how big my input is, whether it's zero or 25 or 26 or 2000, I'm always going to get back a number between zero and 25, because of modulo and so that will definitely fit. But if you're allowed to have these super large numbers going into a fairly small space, what's going to happen when you has into as input the number zero? Well, it's going to go back at zero. What if you hash in the number 26, where's it going to go? >> Zero. >> Bracket zero, right? Because 26 mod 26 gives you no remainder, which means bracket zero. So again, you've got collision. So one of the solutions to this might very well be conceptually, we won't do this part in code, because it's easier to just do it verbally, but one of the solutions might be if I try to hash into location zero and something's there, we'll take the location I tried to hash into, just do plus, plus, and check location, in that case, one. If there's an opening, put it there. Now how do you find that element in the future? Well, it's the same process. If I am asked a question, search for the number N, well, I first run it through the same hash function and I get back the same answer, zero. So this function search is going to go ahead and check at bracket zero for the number N. is it there? If so, it returns true and search has done its job. If it's not there, should I immediately return false? >> No. >> So right, probably not, I don't want to return false, because where might it be? At you know, bracket one or bracket two or worst case, and here is the got you, worst case, what if the whole array was so damn full except for slot number bracket 25? You know, even my number zero or my number 26 could, by bad luck, end up all the way at the bottom of the hash table. So this method that we're describing is generally called linear probing whereby you probe at subsequent locations in the hash table just looking for an opening, looking, looking, looking, but it's linear in that you keep doing it like a plus, plus operator. And the downside of this, of course, is that if you've got a lot of input, or a lot of bad luck, and you keep hashing through the same locations and they're all kind of blocking together so you keep having to plus, plus, plus, plus, plus, plus, plus, plus, this algorithm ultimately degrades into big oh of what again. So end. So it's a little better perhaps than an even more naive approach, but it's not perfect. So we need something better. Yes? [ Inaudible ] >> There may be a number there, so this is incomplete. So what I really need to do here is in pseudo code if something at location, actually we'll do pseudo code, if something at location this, then I need to, not to induce giggles, but probe array. Alright, so keep probing down further into the array looking for, and that would be implemented with the while construct or a for loop or something like this, so just some pseudo code for now. Now this is not a very interesting hash function. This is just for integers. Let's still return an int, because hash tables ultimately boil down to the question where should I put this? But let's do something more interesting like a string. So if I want to take in a string S, I can't just return for instance return S ampersand N. I kind of could if I dealt with the casting, but S is a pointer. It's pretty meaningless to take the pointer and just hash it in this way. Although, you could do this if you really wanted to. But this doesn't feel very intelligent. Suppose that the strings that I'm trying to insert into this data structure are people's names, right, and so that's actually why this hash table, by default, has only 26 locations, because I'm assuming that there's a 26 letter alphabet, A through Z, and ignore uppercase and lowercase for now, let's just suppose that I really want to put names into this hash table with, ideally without having collisions. So what might be the best approach if my input is someone's name like Alice? I'm given a string, A L I C E backslash zero. How do I convert Alice to a number that tells me where to put her name into this hash table? >> [Inaudible] >> What, down here? >> [Inaudible] >> So we could add the ASCII values. Or you can actually do it a little simpler. I'll push back first so we know that a string is just a bunch of characters, so we know, and characters are just numbers. So if the goal is to get a number, we can kind of leverage the fact that characters are represented as numbers. And, in fact, let's not add them all together just yet, let's go super trivial here for now. Let me go ahead and return as bracket zero cast to an int. Right? Let me just look. It's a little heuristic. It might get me into trouble. But I could return the integral value of S bracket zero. But there is a bug here. What's this really going to return to me? This is going to return like 65 if it's capital A or 66. And already these numbers are beyond my boundaries. So really I should be doing something like this, return S bracket capital A. I need to whip out some code from like Caesar or Visionaire where I actually converted these things to numbers between zero and 25, so I could do something like this and I don't explicitly need the cast. It will happen for me. But this would take the character at the first letter in the word and subtract off capital A, and let's assume for simplicity right now it's all capital, and this is going to give me a number between zero and 25. Again, I need to improve this. I need to write better code to handle lower case numbers, maybe weird punctuation or strings that start with numbers. But in the simplest case, if I'm getting an all capital word, this hash function would work. But would it result in collisions? And if so under what circumstances? >> [Inaudible] >> Yes. So obviously you're going to have a collision if the two names start with the same letters. So Albert and Alice are both going to try to hash to location zero, so we have a collision. And I'd wager that that's going to be pretty common, right? If you've ever played Wheel of Fortune, there's certainly some letters in the English alphabet that recur more frequently and odds are if we look at the class list, I'm guessing we don't have that many students whose first names start with Q, probably not Z either, right? They're some relatively infrequent characters. If that is in fact your name, great. We have a spot for you in the hash table. But there's going to be a lot of people with B's and D's and C's and all of that that are just going to collide, collide, collide. So that kind of suggest that though this is kind of simple, it's constant time, right? There's no loop here. There's no algorithm. There's no looping process, which is why I pushed back on the idea of adding all the ASCII values, because that would take more time. Let me pluck off the first one, but it's flawed in that I'm going to have a lot of collisions for the A's and the B's and the C's, and I'm like always with high probability going to have an opening for this person where Q might be or where Z might be or where X might be, and this feels like I'm wasting space. So an opportunity for Problem Set 6, when you're given words from a dictionary, it's going to be to answer this question do you want to go naively but simply looking at the first letter? Or do you want to do something a little more sophisticated to minimize that probability and use something that takes more input into account? So, what's the solution really to this problem of collisions right? Even with linear probing, you're going to end up filling all of the spots eventually and then you're out of luck, right, then the data structure needs to be entirely reallocated which is hugely expensive, right. Making the decision of size in the beginning is probably better. So, a better approach is something like this. So, this picture here isn't about names but this one here is about birthdays. This is an array on the left hand side that is numbered from one to 31 to imply that we're assuming that every month has maximally 31 days. And the goal here is to take some people as input and put them into the hash table's location that corresponds to their birthday. So, if you're born on the first of the month, you go at location one. If you're born on the 29th of the month you go at location 29 irrespective of the specific month and year. So, this is called separate chaining. And this pretty much is an obvious solution to the problem of collisions. You know what, if you have a collision don't try probing the rest of the array just hoping you're going to find an empty spot because again, the downside of that is linear running time for search later on. You know what, stay where you are and just tack on the additional word or the additional input onto a list that might grow over time. But, at least stay where you are. So, here's what a hash table really generally looks like. These are something we'll take for granite in PHP and JavaScript in the weeks to come. But hash table, underneath the hood is very often implemented as something like this. It's still a fixed sized data structure where you have maybe 26 or 31 or a billion locations into which you put even more data than that, right. Linear probing was flawed in that you could only fit 26 elements. Separate chaining frees you from that arbitrary limitation by allowing you to have essentially this really clever amalgam of the thing we called an array and the thing we called on the right hand said here, link lists. It's kind of a merger of these two ideas. And so, with separate chaining, we still get not theoretically better running time but we get much better distribution of our values. If we've got, n elements, just to put some formulas on this, if we've got n number or n people that we're trying to put on this data structure. And suppose that the distribution of birthdays is uniform over 31 days. Well, how long is each of these chains going to be? If we've got n people and we've got chains of length and we've got 31 possible chains, well, if it's a uniform distribution, you divide n total inputs by the total number of buckets you have. So, this implies that after you start inserting a whole bunch of people into a data structure like this, even though this picture's a little ragged in that there's some short chains there's some long chains there's some no chains. In the end if you assume uniform distribution, you're going to have chains that are all roughly the same length. So, that kind of means that the running time for search in the end is going to be something like big O of n over 31. But, let's generalize this, n over k where k is the number of buckets that you have. Now, the mathematicians might realize that the k, if it's just a constant, every time we talk about asymptotic notation, we cross out things like constants. So, the little white lie here is that the hash table's running time is still big O of n. And so, fundamentally, theoretically, it's no better than linear probing, it's no better than an array, it's no better than a link list. But now, as we get to Problem Set 6 and the web based aspect of the course, when we start solving some more real world representative problems in the real world, taking a big number n and dividing it by a decent size number k, that's actually one kith, the running time that it would be if it were n. So, in the real world, though we might cross out these constants and asymptotic life, in the real world, constant factor differences, constant factor differences actually make a difference in terms of human time and the amount of cycles you're actually spending. So one of the goals of p set six will be to kind of say yeah, yeah, yeah, I know asymptotically this is still linear time algorithm. But, look how dam fast my code is because I have a lot of buckets and I have fairly short chains and I made some design decisions underneath the hood that really expedite in the real world the solution of this problem, namely, spellchecking. How might we implement this? Well, we might implement this with some nodes here. So again, I made the mistake two weeks ago of not defining what a node is. A node in computer science terms is just some kind of container. It's a data structure that generically is called a node. And inside of it is some stuff that's interesting. So, each of these little rectangles here, we're going to call a node. We talked about nodes in the context of link lists. It looks like based on this picture, each node has a string, a person's name, and then this little square which is meant to represent a pointer. So, the translations of this might be a node struct, and see like this. Every node in this data structure has a string, a word, someone's name. And let's suppose that length is a constant defined elsewhere. No one's name is going to be longer than like 50 characters or a thousand characters. There's some upper bound into which everyone's name in the world fits even if it's large and variant. Well, this just means maximally everyone's name can be length plus one characters long plus one being for the null character. Now, I also need, inside this node, another pointer to another such node. And that's why we have sort of this circular reference to something called the struct node. But this is very, very similar to what we did for a link list. It's just with link lists, we didn't have a character word array, I think we just had an int the day we looked at link list. So, this might be one approach for implementing that kind of structure there. Well, what more can we do? Now that we're starting to take arrays and this building block and link lists and we're wiring things together with pointers ever so cleverly, we can really start implementing arbitrary shapes and sizes inside memory toward an end of solving problems with data structures that are most conducive to the problem at hand. So, in computer science, there's this notion of a tree. It's kind of like what we'd think of as a family tree where you have the oldest member of the family at the top, the so called roots of the tree, and the children, maybe a left child and a right child hang off of them pictorially with arrows suggesting that each of these circles is a node, another type of node. We drew it as a circle. But, each of these pointers just points to another member of the family and another member of the family. And the idea of it being drawn from top down means this is like grandpa, this is mom and her brother and sister, this is her children, grandpa's grandchildren, and actually grandpa's here and so forth. But it doesn't have to be viewed as a family tree. Really the lexicon we care about today is this is the root of any such structure, the thing that's at the top that holds it all together. We call any descendants of the node the children, just like you would a family tree. And then, any nodes that are at the very bottom that have no children of their own are generally called, in graph theory, this is an example of a computer science graph, they're called leaves. So, let's do some jargon. Let's actually do something interesting with it. Well, we had an array back in the day for representing numbers that we were able to use binary search on. It turns out, even though this is not a fundamental improvement over an array because now we have some overhead of pointers. It turns out, now that you have the ability to wire things together dynamically and kind of branch out in different directions, you can resolve some of those same problems and will see more interesting problems using more interesting data structures. This, in fact, is kind of equivalent to a list in that we, an array of numbers, in that we have 22, 33, 44, 55, 66, 77, 88. And it's intentional that this picture is drawn in the way that it is. Notice that every number to the left of some node is smaller than that parent. And, every number to the right of some node is bigger than that parent. So, here's 55, notice how 33 is obviously smaller but so is 22 and 44 because all descendants of a node are, by definition in this structure, smaller. And same deal over here 66, 77, and 88 are all bigger than 55. Well now, what's really neat is that you can find things in this structure using not only binary search but finally that technique we called recursion. All right recursion was the act of a function calling itself, or more generally, it's the act of doing something and then doing that same something again but on a slightly smaller problem. Well how does that apply here? Suppose I am searching for the number 44. This data structure is generally called a binary search tree, the search meaning that the numbers are laid out in this special order, smaller than on the left, bigger than on the right. So, I'm handed the root, so just like a link list where I'm only given the firs node, same deal here. When you're given a tree in memory, all you need is one pointer to the root. And then everything else hangs off of it. Well notice here, what I might do is search for 44, I say search 44. I'm handed the root. Well is 44 equal to 55? No, obviously not. Is it greater than? No. Is it less than? Yes. And so now, I can chop this problem in half, just like the phonebook from week zero leaving me with a problem that's half as big but it's the exact same kind of problem. So, suppose that the search function that I'm describing here verbally takes its input in ints and it takes as input a pointer to a node of a tree. Well, you know what this thing on the left hand side here looks like? It kind of looks like another tree, so another piece of jargon people tend to use is yes 33 is the child of this node 55. But, you know what, 33 is also the root of his own tree, a smaller tree but that has his own children. So, if I'm implementing code now called search, and again it's a function that takes an int as a parameter and a pointer to a node as a parameter, well I can just call that same search function again on the left child pointer. And the only time I want my recursion to stop is if I find the number in question or there's no children. Right, I obviously don't want to keep calling this same search function on a child if it's not there. So this begs the question, how do you implement a node like this? Well, you can translate this pretty identically to the picture at hand. So I'm going to define this thing now as a node. So again, I'm redefining node all the time. And that's what I mean. A node is just a generic concept to describe some bucket of information in the picture. So, in this case of a binary search tree, well, we need a node to have a number in it. We also need that node to have a left pointer and also a right pointer. Now we can call these fu and bar we can call them anything we want. But, conceptually, they're meant to point to two other nodes in the tree. Now, you can get yourself into trouble right. You can start wiring this data structure together with loops and crazy sort of redirections. And you can get into cycles where then your code might very well get into an infinite loop. Well, what's the solution? Well don't do that. If you start implementing cycles, you're then implementing really what people call a graph and not a tree. When you're implementing a data structure like this, it's up to you, the burden is on the programmer to actually get the details right. Well now, if we are moving toward a tree, let me propose another approach to a spellchecker or a dictionary. Thus far we've emphasized hash table. And again hash table's pretty decent algorithm pretty decent data structure for a dictionary. To be clear, even though we talked about birthdays here, you can imagine taking as input a string, a word in a dictionary, and somehow using its asking values or something like that to come up with a location from one, or from zero on up to whatever the size of the hash table is. And then put that string there. Well, that could give us pretty good running time. Big O of n over k, is technically O of n. But as you'll see in the real world, hash tables tend to be pretty darn fast. But, there's also this thing called the trie, T R I E. It's meant to imply retrieval. A trie is an interesting data structure that we'll also talk about incidentally in this week's walk through that is essentially a tree each of whose nodes is itself in array. So, this is kind of trippy. So an array is specifically defined in trie usually as having a fixed number of characters, a fixed number of buckets, namely 26 or in fact in the p set you'll see 27 because we allow words to have an apostrophe for plurality and so forth, so let's supposed though for now that each of these arrays is a size 26 or maybe 27 for apostrophes. So what does it mean to store a dictionary inside of a trie, a data structure like this? Well, this happens to be a picture from a textbook that was implementing a trie for people's first names. And notice, even though it's not obvious at first glance, here's the root, the root node of this trie. And notice what each of the elements in the array is actually meant to do is it's meant to represent a pointer to another node. So, even though we've written, in this picture, letters of the alphabet, you can think of A as just mapping to bracket zero and B is mapping to bracket one and Z mapping to bracket 25 and apostrophe who knows where. You'd have to special case it but somewhere specific. So, what really, each of these elements in a trie's node represents is a pointer to another such node. And so, what you do with a trie is you, you implicitly store words as follows. If I want to insert the name Maxwell into my trie I first start at the root of the array, the root of the, the root of the trie. The trie initially has nothing in it. This is kind of later in the story. Initially the trie just has a single node with a whole bunch of null pointers. But, I want to store Maxwell into this trie. So, I look at the value of M which is 13, which is the 13th letter of the alphabet. So, let me subtract one for zero indexing in 12. So, I go to bracket 12 in the root nodes array. It's initially null. So I use malloc and I allocate another node. That's going to be this node over here. And I attach a pointer to location 12 that maps to that new node. Now, I haven't put the letter M inside of the trie's node. I've just implicitly assumed that the M is the 13th letter which means bracket 12 if zero index. So, it's pointer means, you know what, there exists a pointer here because there is some word in this data structure that starts with the letter M. So, it's all implicit. And this is what's pretty neat about this. Now, the story continues. The next letter in Maxwell's name is A. A maps to the first letter of the alphabet or bracket zero. So, I initially say, all right, give me another node. So, I use malloc and substantiate another node and pointers. And then, I hang a pointer off of bracket zero pointing at the new node. And now, for space efficiency notice that the author of this book did not draw all of these nodes out. A trie is interesting that it's a very bulky wide data structure because you have all of these arrays. So he's only showing you the letter that's actually relevant. But, if you continue the story, what a try really is it's a multilayered hash table where you initially hash on the first letter of the word, then you hash again on the second letter of the word. But the hash function's really simple. It always returns 0 through 25 which tells you what the number is in the array that's supposed to represent that character. And then, once you get down to LL in Maxwell, it turns out that you need a special marker in your structure that says word stops here. But in the end, what's neat about a tri, and it's a tricky one to wrap one's mind around the first time around, what's neat is that you're not actually storing any words per say in the tree it's all implicit. It's a huge data structure full of pointers not actual content. But, here's the magic. What then is the running time of search on a data structure like a trie? Has table recalled is big O of n divided by k but k is a constant so it was really big O of n. What's the running time of search on a tri? It's big O of, like well we need a new expression here. Let's call it m where m is the length of a word. Notice to find if a word is inside of the tri, you have to hash on that word's letters M A X W E L L so it looks like the running time of this is at least 7. Now, maybe if it's David I'm inserting, D A V I D, ooh looks like the running time of a tri is actually 5. But, the point is it's some constant. And, if the constant is defined as longest word, the running time of a trie's search function is maybe 10 if that's the longest word in the dictionary, maybe it's 20 maybe it's 40. But it's not depending on n. Which means, you can have a billion words in the tri, you could have a trillion words in the tri, and you know what, you can still find David in five operations, D A V I D. And that's what's particularly magical about this data structure. If you spend, frankly, a ridiculous amount of ram, the structure's pretty wide, you can get truly constant time look ups depending only on the length of the input not on how many things you're inserting. So, the challenge for you for Problem Set 6 will be to choose from among those data structures, and implement the fastest spellchecker possible. [ Music ]