>> David: Welcome back. This is Week 5 of CS50 and this is Sudoku in Problem Set 4. If you haven't dived in already this week you will be again handed some so called distribution code. Lines of code that we have written that provides you with a framework with which to begin this problem set and there, from there you then have to proceed to fill in some blanks namely to implement this game of Sudoku. Now for those unfamiliar, this is a game that's still pretty popular in various newspapers that you get on the subway and the like and you can fill in some boxes with numbers and it looks a little something like this. Now this screen shot here is of the staff's implementation of the hacker edition actually and what we've used is a little bit of so called ASCII using the characters on the keyboard in order to render a Graphical User Interface but what's interesting about problem set four is that with problem set three the game of 15, it too was a game and it too had a GUI of sorts, but recall that for problem set three to print that GUI out all we kept doing was calling print F from the top of the screen all the way down and when we had a special line of code that would clear the screen and then we would redraw the whole board from top to bottom. Now, this doesn't really lend itself to very good animation and it's certainly not terribly efficient if just to update the board you essentially have to redraw the whole screen. In fact, if you play with hence forth problem set three your implementation on a really slow Internet connection, maybe back home over vacation you might actually see the screen redrawing if your connection is sufficiently slow that you literally see each line getting drawn. So much more efficient at least conceptually would be if you want to update part of the screen don't redraw the whole darn board top to bottom, left to right. Just update the individual character that you care about. And so we introduce in problem set four a graphical library. It's called ncurses, for new curses, and this is a C library that is code that other people wrote that you can integrate into your own programs that allows you to have a much more compelling graphical interface. Now at the end of the day the graphics are still just characters from the keyboard, ASCII codes themselves. Toward the end of the semester we'll introduce actual graphics and in the context of web programming but as this board here suggests you can actually in this game ultimately move a cursor up, down, left, right as you can see with the actual running code here. So the game of Sudoku for those unfamiliar gives you a board like this, a grid. It's got nine numbers across, nine numbers down, and then it's divided into nine squares and you're given a bunch of numbers from the start such as those depicted here in yellow and the challenge of this game is to figure out what numbers you should put where all of these blanks are, where all of these dots are and the constraints are as follows. You can only put each number from 1 to 9 in a row once. So, you can have the numbers 1, 2, 3, 4, 5, 6, 7, 8, 9 in any order in that top row, but you can't repeat any of those digits. Similarly you have to obey that constraint with every vertical row in the chart and then finally you have to make sure that you never reuse the same number within each individual square. So those are the three primary constraints and to be honest even if you've never played this game before it's actually kind of neat in that especially if you play one of the n00b levels as we call them; one of the easier levels where you can pretty much always make forward progress if you look for logically a hole that you can fill in. The strategy that you can implement is something like this. Let's see if we can't find an easy one to fill out ourselves. So, for instance down here in the bottom left-hand corner if I move my cursor, I definitely cannot put 9, 7 or 5. I could put 6, 7, I could not also put 6, 7 or 3, but I could, for instance put for now the number 1, but no, I can't. Now, this is not a feature you have to implement in the standard edition, but notice in the hacker edition I changed the colors of all those dots to red. That's because I've violated one of the constraints. Why can I not put 1 in the corner? So 1 is already in that same square and so this is the basic idea and what's neat in the end is that you'll have the ability to actually play this game and if you choose to elect the hacker edition of this pset, well, much like God mode in the game of 15 you can just kind of hang on to the number 8 for hint and the game will be solved for you, but that's for this hacker edition only and the standard edition you implement the game itself. Some more on that in the specification. For those then who caught on to that definition of Sudoku pretty quickly, here's a little bit for geek humor. Right, pretty lame game at this point, but there is one correct solution to this problem. All right if you're not sure what that means, go ahead and ask in sections. It's a pretty simple game that one. So quiz zero is coming up. So this is the first of two milestones in the course. This is really meant to be not an opportunity just regurgitate things we've said, this is not an opportunity to see how well you can code on a piece of paper, but rather how well you've mastered certain concepts in the course, how you can spot say problems in code or how you can draw or how you can pen a few lines for code in the order of tens of lines of code in the context of a quiz. You're not going to have, of course, a computer with you, you're not going to have a compiler with you and so the quiz will be designed in a way that understands the constraints of this problem we'll hand out a PDF, print out of a PDF this Wednesday that will tell you exactly what to do next Wednesday so the quiz is in lieu of lecture next Wednesday so that you guys actually have something to draw on. We will not be in Sanders. We've reserved a few rooms elsewhere on campus with actual writing surfaces, and you'll be dispatched to different rooms according to this printout based on the first letter of your last name. So more on that on Wednesday, but know that there are innumerable resources coming up available to you. So, one, this Sunday during the regularly scheduled walk through time will be a course-wide review session. The details of that will be on this handout on Wednesday, but that will be 7PM this Sunday. All sections this coming Sunday, Monday, Tuesday will also involve quiz review. Because of the quiz there is no pset due next week. 

>> Student: Oh. [Laughter] 

>> David: Very sorry. There is no pset due next week so if there's a silver lining in having any sort of examination it is that there's no pset that corresponds to it. [Laughter] So, let's regroup here. So, more on that on Wednesday, but also actually in terms of resources there are three years' worth of past quizzes and sample solutions on the course's website. Just click quizzes so those are probably the best guide toward getting a sense of what format to expect and if you haven't noticed them already let me draw some explicit attention to what's on the lecture's page and has been here for some time. Everything we use and do in lecture ends up on this page. Now, if you expand everything at once, it looks a little overwhelming, but the juiciest stuff, for instance, are things like the scribe notes that one of our teaching fellows, Andrew Sellergren, produces. These are meant to be sort of an authoritative teaching fellow's guide to what happened in lectures and they're actually quite fun to read. Andrew like me is a fan of snarky little foot notes. In fact, he commented to me recently when writing the scribe notes why does everyone clap at the end of a computer science lecture? It's just a computer science lecture and I actually agree. You're all very kind sometimes to have this awkward applause at the end of lecture, but it's just a computer science lecture so feel free to get right up out of your seats as soon as we're done. No need for that. That will make Andrew happy, hello, Andrew. Andrew scribes these lectures based on the video so what we say ends up written down. All right. So, for those of you who are thinking of switching either to graded status or to pass/fail status, today is, in fact, that fifth Monday of the term. We will have pink sheets on the sides of the stage here. Feel free to just chat with me during break or afterward if you're a bit on the fence as to what to do. For those of you who are feeling like maybe you want neither of those options I thought I'd take a quick tour with you through bugs from years past that you can find actually in popular culture. These are all screen shots that random people some of them staff, some of them people on the Internet have taken and if you're feeling like your code is perpetually buggy and there's always something wrong, well, realize it is, in fact, not just you. So, this of course, was the first such bug with which we'll set the stage here that was actually found inside of one of the first computers. This is perhaps a screen that some of you are familiar with. It's Windows own blue screen of death. So, if you'll look here, there's not all much information that's useful to a normal person other than control alt delete at bottom left here, but there are some details that might start to feel at least a little comprehensible even if you're not quite sure what to make of them. On the top left, for instance, there's this cryptic looking sequence of digits, 0010E36. Well, this is just a hexadecimal number and this is just a way of encoding a number slightly more efficiently than decimal and what this likely refer to? So it might very well be an error code, it might very well be an address, too, as we've started to play with gdb recall that we often see these similarly looking codes. a little bit of numbers, a little bit of characters, these hexadecimal notation and it often refers to memory addresses. So, it seems that here a fatal exception, which just means error, zero E so this just happens to represent the number 14, but in this system called hexadecimal, but more on that to come. At this location, whatever that means, but it's mostly numeric or hexadecimal there. VXD happens to stand for Virtual Device Driver, which is just Window's way of saying some piece of hardware that came with like a CD or some driver that you installed into this computer itself probably had the bug and that triggered the entirety of Windows to crash in this way. So, a little more light heartedly, do you see things that are just a little playful? And here you see that this is clearly an address of some sort. We've seen things like this in gdb, but I've always thought it's kind of amusing at the end that they kind of simplify it with the memory could not be read with a little bit of finger quotes there with these same pop-ups though do you see errors within errors? [Laughter] So this is literally a screen shot. A slightly more amusing is literally this one. [Laughter] And no joke this is a bug from the mid-1990's that appeared in Windows. [Laughter] So that is what happens when you don't consider corner cases. Things can certainly get out of control. a few years ago pop-ups were all the rage on the Web. You get scenarios like this. a bit out of control. It turns out that operating systems like Windows, Mac OS, Linux and others are actually used in the real world including by brokers like Fidelity where if you run Windows on your sign you get blue signs of death. a little ticker outside some city corner there. This was from let's say, oh, this was one of my favorites back in the Vista era. This was from like a Best Buy or something like that where they were selling the product. [Laughter] So that's kind of amusing. Behind glass no less so it can't easily be fixed. Now with have Mac OS in fairness. So Apple kind of simplifies things for people. They don't tell you that something has gone wrong they just advise you you need to restart your computer now, but this is not, in fact, a good thing when this happens. Cent OS, which is a variant of Linux, one that we're actually using in the cluster, this is probably not what you need to do at this point in an installation process, inserting the disk number negative 99. This one is one of my favorites if only because I was ordering pizza late one night on Dominos.com, and I was like ah-ha, this is as perfect lecture example. What's the bug here? 

>> Student: [inaudible]. 

>> David: So 6.3, right? So this is printf error of some sort. Now they probably didn't implement their website in C. They used some other language, but this function printf and several variants of it that we'll see over time are very common in other languages as well, and it looks like the programmer that wrote this little advert for these chicken kickers did not realize that US currency is typically formatted to point to F as opposed to just percent F as a format string. Let's see, one other here that was given to us by a former student so this is bankofamerica.com, you log in but this particular day this user logged in and there is, in fact, a bug. Allow me to zoom in at top left. [Laughter] That really should not be, any hypotheses as to why this actually happened? It's probably something like a time zone difference or the time zone setting was wrong, something like this, but not really inspiring of confidence in a place where you're storing your money. So, real screen shot there. This is not actually a bug, but it's perhaps been written into infamy by the movie Office Space. It's something that you can still see in printers on campus. This is really just an example of an idiotic design decision when 99% of the time all you need to be told is to refill the thing, but instead HP decided to introduce this into the American lexicon. PC load letter anyone? Paper, cassette, load letter-sized paper, right? Instead of just add paper. So, if you've never seen this movie that I keep referring to, Office Space, do rent it and you will see three guys go to town on a printer like this in a very famous scene. So, without further ado, those are bugs that other people make. It's quite normal to be making them initially yourself and now is our transition to exactly the kind of code we've been writing so many bugs in. So, we had this picture, this mental model, for the past couple of days and this really just represents a computer's memory and we've started kind of arbitrarily allocating different areas of our RAM to different purposes. So at the very bottom there we have these things called environment variables, which we don't play with much, but they have to do with system-wide settings generally. The stack though has been of particular interest. It's on the stack that any local variables go. Any time you call a function it gets a chunk of memory called a frame on the stack and you keep layering them again and again and again every time one function calls another, but then as soon as that function recall stops executing, that memory goes away or at least it's no longer valid so it's not safe to trust, but we have on occasion seen variables containing garbage values. If you just declare int X semicolon, recall that on one or two occasions we've actually used printf and printed out its value and once I think it was zero just by luck, but then some other time it was some completely random value. Well, where did that come from? Well, if you're using this stack and reusing this stack and you're just changing what bits are where by assigning variables values and then you're just forgetting where those values were, certainly if you reuse that memory later on in your program might you have some bogus value inside a new variable because it just happens to have been put inside RAM where something once was. And since this is why it's terribly important to always initialize variables to some knowing state least you accidently use some value that's not yours or actually follow a pointer, an address, that's not, in fact, valid. This is actually a security concern, too. You can imagine a program whose purpose in life is to ask someone for their user name and their password and you check that their user name and password are correct by calling a function and so while you're in that function, you need to store what the user typed in for user name and for password in the frame on the stack. Well, as soon as you've checked that their user name and password is correct, this function might return, the program might go on executing, but where is the potential security danger here? Yeah? 

>> Student: [inaudible] 

>> David: Yeah. So, your user name and more importantly your password might actually be lingering somewhere there in memory. Now, so long as your program is trustworthy, you yourself have written it, that's maybe not such a big deal because only you, the programmer of that program, can later access that same memory, but we'll talk briefly this week about something called a buffer overflow exploit that is one of the most common ways, some random person on the Internet or some hacker can compromise a program by inserting data into your program and tricking it into running some other code that they wrote. So, if this is the case whereby you can leave secret stuff like passwords somewhere in your memory but there is a chance that someone else can access that memory, therein lies the potential security threat. So, more on that in more technical detail to come. This thing called the heap. What was the difference about the heap that we touched on at the end of last week? What's it used for? Yeah? 

>> Student: The memory that you allocate... [inaudible]. 

>> David: Okay, good. Let me tweak the jargon, but the memory you allocate on the so called heap does not get reused until you yourself free it up. So, if you ask the operating system for memory and you grab it from this thing called the heap, that memory is yours until you call a function that we'll see today called free that explicitly hands that memory back to the operating system. The up side of that is that any function you write can ask for as much memory as it wants and it's not going to disappear, it's not going to get reused because it's not coming from that place called the stack. Now, we introduced this toward the tail end of last week because we realized this is really stupid. If I have to write programs that have to know in advance how much memory the user is going to need, case in point was the quizzes example. What if the number of quizzes in some semester changes? I don't want to go back and change my source code, recompile my program just to change that hard-coded constant from 2 to 3. I'd much rather ask the user when they run the program how many quizzes were there this semester? And then dynamically get data from the user. Well, we introduced a function last week with which we can ask for chunks of memory and this thing was called what? So this was malloc, memory alloc, and this was just a function that took a number as an argument how many bytes of memory do you want and you get it and then we'll see today that you better return that memory otherwise you induce what's generally called a memory leak. So first let's actually play with some of these ideas in a real environment. So I'm going to go ahead into a file from last week called bar.c. So if you have your printouts from last week this is in bar.c. If not, not to worry. We're going to step through it deliberately on the screen. This is a pretty stupid program. It's been contrived only for the purposes of stepping through it with gdb, but so that we can see some interesting detail. So, at the very top of this function we have a function called main. It takes no command line arguments so I've explicitly said void. It seems to allocate an integer called a and then a char * called S and as an aside even though I've gotten into the habit of always putting the star right next to S, you may see in textbooks or the TF sections spaces sometimes. That's okay, but this is perhaps a little more clear. S is the pointer now. All right. So we have a char * and this is synonymous with what? So string, right? We started to take off that training wheel verbally last week. What we used to call string is now char * and this makes sense because what's really being stored in memory for a string is the address of a sequence of characters. The address of an array of characters. In this case, hello world. Now, this is just a line of printf, but notice I'm doing something a little funky here. I'm first treating S as though it's an array and that really is true conceptually. S bracket S gives me the seventh location, 0 index, of S. So what should that be? Well, this is 0 the H, 0, 1, 2, 3, 4, 5, 6, 7. So I count from 0 and I get W at location 7. So, S bracket 7 would give me a char namely W. What does ampersand S bracket 7 give me? It gives me the address of. So ampersand is the address of operator even though we haven't seen it in this context being prefixed to an array it's just the same question, at what address can I find the character W? And so it turns out that this is really going to be whatever the address of H is plus 7. So one of the tricks we'll start to see is that what's neat about pointers is that you can do arithmetic on them. You can start at one pointer, add some number like 7 and you can make your way to another part of the string its location because the computer does not care where strings begin. The only thing that's important here when I print out this here is that the string that starts at location 7 also still ends somewhere and what was the special character that demarks the end of a string? 

>> Student: [inaudible] 

>> David: Yeah, so 'o0' was a character, a special character. It's really just the number 0, but it's a character representing the number 0 that demarks the end of the string and conceptually it's right after the D here. We don't see it, but it is, in fact, going to be used there underneath the hood. So, in the end, what should this print? This line of code if I'm printing out the string starting at the address of the 7th character? What should print? Just world. It's the same idea. I'm passing in an address of a string to printf ampersand S says here comes a string so I am passing in a string but only part of the string because now I'm leveraging this fact that pointers can let me get to any location in memory so if I want to go to the 7th location in this string and then get the address of it, I can absolutely now pretend that's a string because notice that hello world and also just world will have what at the end of both of them? They both have that 'h0'. So I'm not changing where the string ends; I'm essentially just starting in the middle of the string at location 7. So we'll see this in just a moment as to what this means. Now in the next line of code here I just arbitrarily say a gets 5 so that I actually assign it a value. Then I call foo. If I step into that, I now have a parameter called N. It's going to initially have the value of 5 because I passed in A=5. I declare a local variable here called B. B arbitrarily gets assigned to N. I do a little bit of mathematics and now just because I pass B into this last function called bar. So this last function called bar simply says, hi, I'm bar as though to confirm that it was, in fact, called, but notice I'm not even using that int N and why? This is completely arbitrary. I just wanted some interesting constructs with which we could now play with the code and actually confirm that what we've been preaching verbally is, in fact, true in reality. So let's go ahead and make bar. I'm going to run gdb on this program called bar, and I'm going to go ahead and type break main to get a break point at my function and then notice this OX80483FD represents what perhaps? So that's the address where main is in memory. What does that mean? Well, recall from last week that that chunk of memory at the very top of RAM is the so-called .text segment. Those are the 0's and 1's that represent all of the functions I wrote and then compiled. So when gdb says, oh, well, main is at OX080483FD that jus means that is the numeric address of the function called main in that text segment. Generally not useful, but it turns out in C you can actually pass functions around just as you can variables and you can pass functions around by way of their addresses, but that's not something we'll generally do. Now just a quick crash course in something that'll be a little more fun in Visual once we get to Web programming since everyone likes to change colors of things it turns out that the same notation hexadecimal notation is used for color codes and Photoshop and Web development and so forth, but for now we just need a simple definition. So, in the world of binary, we represented the number 0 with 0, the number 1 with 1, the number 2 with? 10 and then the number 3 with 11 and then 100 an notice the pattern. So I didn't quite line them up optimally, but I can right now draw this, can I fix this? Not really. Well, let's just leave it like that. So, 0 goes to 1, 1 plus 1 is technically 2, but I don't have the digit 2 in the binary system so I have to carry that 1 and so I get 10. I then add 1, this is easy, get 11 or 1, 1. Then I add 1 again. Well, 1 plus 1 is 2, but 2 doesn't exist so I have to carry the 1 and so forth. That's how you end up getting the same number here 1 0 0. Now, in decimal, these things are pretty straightforward. So this is just 0 1 2 3 4. Well, what about in hexadecimal? Actually it starts off the same. Zero is 0, 1 is 1, 2 is 2, dot, dot, dot, but now let me do some dot, dot, dot down here. When I get to the number 9 and then 10 in decimal, now I've used up all of those digits, but it turns out in hexadecimal I can go up to 9. There is no 10 but there is, in fact, A, B, C, D, E, F. So in this notation called hexadecimal, you can effectively count from 0 on up to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. You can count to 15 by using 0 though 9 and then when you run out a through F. Now who cares? Why is this interesting? Well, the bigger picture taken away is that you know have more digits at your disposal so for representing really large numbers in the millions and the billions and higher, you can actually write them using fewer characters because you have more expressive capabilities with each single digit. That is it only takes one character F to represent 15 whereas in decimal it would take 1, 5, 2 and how about in binary? How do you represent 15? It's 1111. So one's place, two's place, four's, eight. So 8 plus 4 plus 2 plus 1 gives me 15. So this is the idea. With all these different base systems, really it's just a convention, a convenience for humans. Hexadecimal tends to be used when you start talking about larger numbers because you can fit more information into fewer characters. Now as an aside so that this feels a little more real and so that those of you already familiar with Photoshop can see the connection, turns out in the graphics world if you want to represent the number, the color red, you just express it with what's called and RGB code, red, green, blue. And so if you want to represent the number red, you would generally say FF0000. Now why this? Here's how much blue you want, here's how much green you want, here's how much red you want. So, because each F is 1111, FF is 11111111 so that's cranking up all the possible red that's available whereas 0 is no green, 0 is no blue so this is how the world represents red. How does the world represent green? Well, 00FF00 and you can infer from that. a good quiz question maybe what blue is actually represented with. So for now the whole point is just this is a different way of expressing numbers and the human convention is generally to put this character 0X at the start of any hexadecimal number just to make clear to the reader here comes a hexadecimal number. The 0 and the X are not information unto themselves, they're not part of the address, it's just a notation, a hint that says here comes a hexadecimal number. If you ever see any alphabetical letters though in a number, odds are it's hexadecimal. If you see G in a number, odds are it's not hexadecimal because we can't count that high as G. Okay, so back on track. Here's main. I've set a break point in this program called bar.c. Let me go ahead now and run my program by typing run. We've done this before. It hits the break point in main. That first line of code declares a string S, but wait a minute. Let me type list just so I can see that code again. Well, it turns out I have an int A. It didn't bother showing it to me because that was just allocating space. There was no statement, there was no assignment so let's do one sanity check. Let me go ahead and print out the value of a in gdb. What is this value? All right. Garbage value that was not put there by us necessarily. There was some start-up process that happened when I first ran my program whereby that memory on the stack was used for some purpose unbeknownst to me and I cannot now trust that a has been initialized to any fixed value because who knows what's there? But that's okay because in a few lines from now I'm going to explicitly initialize a to some value so it's a problem that it's initialized to some junk value. I just had better not use that for any purpose yet. So I'm going to go ahead now and print the value of S. Notice I haven't started executing the lines of code in this function. All I did was list the lines and then I printed a now I'm going to print S and what do I see? Well, what's this? It's definitely hexadecimal so it turns out that S2 is just a variable. It happens to be a variable of a different type. It's a pointer to a char, which means it's the address of some character in memory but at this point in the story this line of code, line 20, has not been executed yet. So because S is a variable just like a is S2 can have some garbage value. So the garbage value that's apparently an S is this thing here, OX312FF4. That's just whatever number happens to be there gdb is showing it in hexadecimal notation because it knows it should be an address but that doesn't mean it's not a garbage value. In fact, what is at that garbage value? Well, gdb is actually telling me it's checking in advance for me apparently the string that starts with a vertical bar then a curly brace then this special code there. So it's just some junk value. It's just a coincidence and it's a coincidence, too, that this string that's at that address is only a few characters long because gdb stops showing what's there as soon as it hits a ' so just by luck there is, in fact, a 0 byte in the vicinity of that address and that's why the string does not look longer. Now, why did this happen here? Who knows. It's just an artifact of having run this program on this machine. Yeah? 

>> Student: [inaudible]. 

>> David: s plus 30? Okay so we can do this. So jumping slightly ahead let me go ahead and print let's do s plus 30 and now I've advanced, if you do the math, I've advanced 30 bytes in memory and what's apparently 30 bytes away from S's current bogus address is this random character there. So, you can poke around with gdb anywhere you want in memory. In fact, if you continue on next year with a course called CS61, which is low-level systems programming, one of the first problem sets actually involves they hand you a program that the staff wrote that they compiled whose purpose in life is to ask for passwords and then check if you've typed in the right value. The problem is if you input the wrong password into this special problem set program, it emails the staff saying David got this question wrong. So the goal of that pset is to actually figure out what the user's passwords are without triggering this bomb to go off, without triggering the email to get sent out. So you obviously can't just run the program at the command line and start inputting your passwords because clearly the odds of your guessing the password right the first time is not going to happen and so you instead have to run, making I'm making next year's problem set too easy for you, you could certainly use a tool like gdb, right? Because with gdb can you pause execution at main and then start stepping through it poking around, poking around and if they've hard coded the right passwords into the program, surely by printing random parts and memory out can you poke around and see what's actually there. So, if you take CS61 remember that don't tell them I said this in lecture. [Laughter] All right. So now let's continue. Here's the program we're working on. Here's the program we're working on. Let me go ahead and remind you in main dot C the main function notice I've just initialized S or I'm about to initialize S to hello world and then we're going to start printing things out for real. So let me go ahead and type next in gdb so now I've executed line 20 so now if I print out what's at location S notice that S itself is still an address but it's now the address of this string in memory. Now where did that end up? That ended up specifically to be clear at this address. Now let's go ahead and call printf. So I'm going to go ahead and type next and sure enough it printed out the word world on the left hand side there and let's see why this is. Well, let me go ahead and printout S bracket 7. That should print out the 7th character in the string called S, and in fact, it does. It prints out W whose ASCII code if you look it up in the chart is 119 just as lower case a was 97. So, W is 119. Well, let me go ahead now and print the address of S bracket S, and I shouldn't get W per se. I should instead get the address at that location, which is this again and yet it shows me world because gdb realizes, oh, this is the start of a string, I'm going to just keep showing you character after character after character until I hit what special value? Back slash 0 and so I happen to see the word world again and that's exactly why printf itself printed out just the word world. So, now let's go ahead and advance. If I hit next now, I'm at line 22 so a is currently Print A, it's currently some bogus value, but now if I do next and now I Print A, notice that a is now 5, which is as expected. Now, if I type next again, foo will get called but gdb won't step into it. It will just step right over it, foo will do its thing and that's kind of useless because then my program will end so I'm instead going to type not next but step that steps me into recall the function, gdb is reminding me, all right, I'm in bar.c line 31 so you can do a little sanity check in another window or on a printout. I've called a function called foo with an argument of 5. What do I do here? Well, I'm going to go ahead and hit next, I'm going to go ahead and hit next and now let's see what N is. Well, N is still 5 because that was the argument that goes passed into foo. Well, what should B be at this point? So hopefully B is 10. That makes sense because I initialize B to N and then I multiplied B times 2. This special shorthand notation just means B equals B times 2 so now it's 10 and now what I'm going to do next let me go ahead and step into bar and now if I print out M the argument to bar it's indeed 10. So, at this point now I'm just doing some useless stuff because it's just going to print high on bar so I'm going to type continue because I'm getting bored with the process so the whole thing finishes executing without breaking again, program exists normally, which means it returned zero and voila, but the point for now is not what this program actually did but that using gdb where we able to confirm that this whole layout of memory is, in fact, the case and if we look specifically at the addresses without just saying hexadecimal, hexadecimal, if you actually look at the numbers, you'll see that all of the locations of these values are at different points in this process. Let's do one sanity check. Let me go ahead and rerun gdb on bar. I'm going to set another break point in main and now I'm going to type run and now I'm going to do this. So notice that print the value of a it is some bogus value again, but let's print the address of a and it looks like the address of a is here; OXBFFFFF688. So we'll leave that on the screen. Let me now type a couple of commands. Next. Next. Next. All right I'm going to step. I want to go into foo for a moment. Now I'm going to step into foo. Now, foo recall has an argument. It's called N. So on this picture foo should be higher on the stack than main's own variables, right? Because main has called foo and eventually foo is going to call bar. So the stack is growing upward in this way, but we're about to see a curiosity. Let me go ahead and print out N, N is indeed 5, let me go ahead and print out B. B is some bogus value, but if I execute B equals N by typing next now I print B and it should be what? Should be 5 also because I said B equal to N and now last step if you're following along let me go ahead and print out the address of B, which is a local variable in foo and now notice. So it's a little cryptic at first. Here is where a was and a is a local variable in main. Now down here B is a local variable in foo and notice their addresses. They're almost the same. They both start with BFFFF and then a 6, but then we have, what, 5C and then we have 88. So, in which direction do things seem to be growing? So it looks like even though we think of the address, even though we think of the stack as growing up, recall that main is going to be on the bottom here and then foo is going to be on top of it pictorially here and yet notice foo's address. Is it bigger or smaller if you compare the two? So, it's actually smaller so this is actually a problem and it lends itself to a security exploit we'll soon see even though conceptually frankly we call it the stack, it makes nice sense to say the frames get stacked up and up and up just like in a cafeteria, but in reality you'll actually see in some textbooks the pictures reversed where the stack grows down, which is fine. You just have to flip your mental model, but then the whole cafeteria example doesn't really make as much sense and it's kind of weird that they called it a stack, but so be it. But in this case with gdb we're actually confirming visually that here's a local variable in main and it's at a decently high, a big address. Notice that this is a local variable in foo, but its address is smaller. So, that means big addresses in RAM are actually at the bottom; small addresses in RAM at higher up. So even though I've kind of been cheating and saying suppose this is byte 0 and this is 1 and this is 2, it's actually the reverse on most systems and this is a problem because we'll see this week that you can exploit that feature of the order of memory and insert as big a chunk of data as you want into someone's function and what's going to happen is because these addresses are going from small to big, when you insert big chunk of memory, you're going to end up overwriting previous frames on the stack and by doing this can you take over the machine or take over that program. So that's very abstract for now. Why don't we go ahead and take a five-minute break and we'll resume with more detail. All right so that was a lot all at once and I recognize that it's kind of hard to follow through gdb when the screen starts becoming a mess. So, let's take a look at some perhaps simpler pictures just to set the stage here, again, mentally. The story we're telling is that memory is still laid out relatively speaking the same way it's been for the past few weeks, but it turns out that byte 0 in your computer's memory is actually up here and byte 2 billion if you have 2 gigabytes of RAM is lower in this picture even though the stack itself grows upward pictorially the addresses must be going from big address to smaller, to smaller to smaller. Now why is this of concern? Well, consider this chunk of code here. This is a pretty small C program, but it demonstrates one of the most common attacks that still compromises servers and programs today. So here's main down here. It does, in fact, take command line arguments. As an aside, you might see in sections or in textbooks there's this other notation for ARG V. We generally say char * and then ARG V with square brackets to make very clear that ARG V is an array of char *s aka strings and array of strings. Well, because an array can also be treated as we've started to see last week in this as itself just an address it is also possible to write ARG V as char * star so consider this just for now a teaser. You can have things called pointers, two pointers, that can be useful but for now will generally use our original notation with square brackets. But for now it's the same thing. So this program here calls foo, which is this function up here and it just blindly passes in ARG V bracket one. Now just instinctively what is potentially wrong with this implementation? So the real answer is going to be no bounds checking, but there's another problem in main itself. What am I not doing in main? So, I'm not checking ARG C. Right, even though thus far we've been writing main for you recently in game of 15 and Sudoku, think back to caesar in vigenere when you had to accept the command line argument assuming you implemented the requirements of the specification you probably at least checked is ARG C equal equal to the number 2 so that you knew there was going to be a number after caesar or after vigenere's name at the command line. Here we're just assuming blindly that ARG C is going to have a count of 2 or more, and I'm passing in bracket 1 of ARG V. So the first word typed after the program's name, but what if there is no word there? Well, it looks like I'm going to be jumping into memory that's not necessarily my own. So that's one problem, but now let's look at what foo actually does. It looks like foo takes a single argument, a string, aka char *. It then has a local variable called C, C is the name of an array with 12 characters available in it and now I'm using this function. So there's all sorts of fun memory-related functions in C. We saw one last week, malloc that will keep using. There's free as we'll see and there's memcpy, which does as the name implies. It's not a typo. Someone decided that saving that individual character just made the function easier to type so memcpy, C-P-Y, what does it do? Well, its purpose in life if we've read the man page is to take, use the first argument here as a destination, use the second argument as the source that is what you want to, what bytes do you want to copy into the destination and then it takes a third argument, which is how many bytes do you want to copy? So, casually speaking memcpy is supposed to copy this many bytes from bar into C. Now there's a comment here, which says no bounds checking, but what does that mean? What really is the danger here? Yeah? 

>> Student: [inaudible] 

>> David: Perfect. So if I type a really long word at the command line after this program's name, it's not just foo. It's not just 13. It is "a really long sentence". Which if you count that up, that's going to be more than a really long, I think that's more than 12 characters you're going to overflow it seems this buffer. In other words, I am not only just assuming there's something in ARG V bracket 1, I'm also then blindly copying the entirety of that string into C even though I've only allocated how many bytes for this variable C. So 12. So what does this mean in reality? Well, this means that it will work. It will copy that many bytes from bar into C, but there's a danger here as you've seen in your own code that if you start touching memory that's not yours, odds are the program is going to do what? It's going to seg fault, it's going to crash and that's because you're touching memory that's not your own, but if you don't type a crazy long string but one that's a little longer than 12, but not something ridiculous like a thousand or a million just, you know, something a little long that's still in that sweet spot where the computer doesn't realize that it should crash, it doesn't realize that that really isn't your memory, you can actually overwrite some really important information that's also on the stack that we've not mentioned thus far. Thus far we've always said that on the stack goes a function's local variables and its parameters and that repeats for each function that's called, but it turns out we've been taking for granted the fact that if main calls foo and foo calls bar, that the computer just knows how to get back from bar to foo and how to get back from foo to main. In other words, we've just waved our hands at this detail of unwinding the stack, removing frames and we've just assumed the computer knows where it came from, but that's not the case. C in a programming language is very low level. You need to tell the computer when this function is done executing where to go back to so it turns out that one of the other pieces of information that's been stored on the stack all this time not by you, but the compiler has been doing this for you, also what goes on the stack is something called a return address. So here's a little picture of just the stack. Consider this a zoomed in version where it says parent routine stack here. Suppose this is main. Now, main lives somewhere in memory in that text segment, but when you call foo, for instance, which is this guy up here. So this is foo up in the blue at the top here, foo takes this argument and has this local array called C, when foo gets called, once it's done executing the computer needs to know what address to go back to. That is it needs to know go back to main. So one of the things that happens inside memory is what goes here in red is the return address, the address of the line in code in main that should get executed after foo is done running. So, this red bar just represents, again, the address of the specific line of code that the computer should go back to once foo is done after you hit that curly brace. Now notice the danger here now. If addresses are small up here and big up here, notice if you pass in more bytes than is expected, where do those extra dangerous bytes end up? They end up going this way, right, and so as this redness suggests, going over the red part is not a good thing because it means you could clobber, you could change that return address to be something else. Now, if you're the bad guy and you are deliberately typing in a word at the command line that's not just a word, it actually represents code that you want to be executed, if through trial and error or really smart savvy you know exactly what you should type at the command line so that the different number ends up here in the return address, what you can do is trick the computer by passing in enough bytes to overload not just the blue segment and this green segment here but also this red segment you can trick the computer into returning to the address not of main but of the code you yourself just typed in. So what you can be passing in to this blue portion is actual executable code if you know how to represent it with ASCII characters and you can trick the program into running your own code what might the program then do? Well, if this is an installation of Photoshop you could be tricking it into skipping that next line of code that says ask user for their serial number or their registration code and just skip right over it and execute your line of code instead. If it's a server, you might be tricking the server into skipping over that very important line of code that says check user's password and if you just skip it, the user might have now unfettered access to the system. So if have the ability to control the flow of a program by changing where it's going in memory, you can essentially take over that program or server and do anything you want thereafter. So here's a picture so here is to be clear the array called C and just like I always do on the blackboard we're representing each byte in or each char in that array is just a square here's the 0th one, here's the 11th one, 0 index, so suppose we pass in an actual word like hello? It gets passed in as H-E-L-L-O back slash zero and then this rest of the 12 bytes that were allocated they're wasted, they're not being used, but nothing bad has happened, but now suppose I insert some attack code. So, just some letters or numbers that happen to represent executable code what can happen is you can over run that segment. So notice before hello is confined to this blue box. If I now insert some attack code, whatever it looks like, it might actually overflow everything here and notice because I've been really smart I've included some numbers, four hexadecimal numbers, at the very end of my attack code, which hopefully represent what? Hopefully represent the address of the code that I inserted where hello used to be. So, it's fairly sophisticated and yet a lot of these attacks are done by trial and error. It's pretty hard to know a priori if you didn't write the program what the return address in main or where rather the return address is. Where exactly this red bar is, but what attackers generally do both on the Internet and even with installations of Photoshop, Microsoft Word, whatever it is they're trying to crack, you just try running your attack code again and again and again and you know what? If you crash that server or crash that program or blue screen of death the computer, you know, that's not actually a bad thing if you're the attacker. It means you've found a bug, you've found an opportunity to exploit because if you can run programs and get them to crash, there's an opportunity there. Odds are you've passed it too much attack code or too much input so maybe if you scale back your attack and send fewer bytes and different addresses, maybe you can hit that sweet spot and actually trick the computer into executing your code. Now a lot of programs today that we all run on our computers are written in languages like C and C++ objective C and a whole bunch of other languages that do not do, offer a number of safety features that other languages do, for instance, what we're describing here you can't really do in Java and frankly even with modern compilers you can generally check a box in your compiler if you're using like Visual Studio that will essentially ward off this attack, but it's frankly just lack of knowledge or lack of savvy that continues to, that developers continue to write code that's vulnerable to this and if you read about some attack on the Internet involving so-called buffer overflow or buffer over run exploit, what they're describing is something like this. Someone had an array and someone did not check the bounds of that array and just blindly copy memory data into it or just blindly tried to print it or just generally did not check the bounds of that array. Yeah? 

>> Student: [inaudible]. 

>> David: Good question. If you yourself have written the code and you know how big the memory is and will be, it might be safe to say, okay, you know what you're doing then go ahead and not bother checking memory but the reality is if you actually pursue this in a professional capacity and research capacity, you're not going to be the only programmer on some projects. So, whereas you might know what you're doing the next person might not realize you made some assumption so frankly the best rule of thumb especially since we have millions of CPU cycles at our disposal these days is spin some on the additional if condition just to protect you from yourself or some colleague. So, with that said any other questions? Don't try and run off and do this tonight. Yeah? 

>> Student: [inaudible]. 

>> David: So parameter is the input to a function, the thing you put inside the parenthesis when you call it. That's an argument aka parameter. 

>> Student: [inaudible]. 

>> David: Yep, exactly. 

>> Student: [inaudible]. 

>> David: So, good question, what goes on the stack, what goes on the heap to recap. What goes on the stack as we've seen are local variables, parameters, aka arguments, and also it appears today something called return addresses and there's some other stuff, too, that I'll gloss over for today's purposes, but this frame pointer there's other sophisticated numbers that end up on the stack for useful purposes but they're not as germane right now. So those three things; arguments, aka parameters, local variables and return addresses. What goes on the heap any time you call malloc, which we're going to start doing today onward with increased frequency, ends up on the heap. So, anytime you call malloc, it ends up on the heap. So that's actually a good segue. So, let me go ahead and pull up some of the code we started that's among last week's printouts and that's namely this one, structs one dot C. And let's introduce a new feature that's actually quite useful so that we can get away from stupid little examples where all we do is store chars and ints. Let's actually start storing some more real-world data. So this is structs1.c. It is from, it's the last program from last week's printout and then we'll, oh, actually and I did just in case you didn't bring that it's also in today's printout at the very end. It's a duplicate of last weeks. So, in structs1.c, I appear to be doing something a big crazy. Even though we've only talked about ints and chars and floats and doubles and strings and pointers and all of this turns out there's a data type called student, but not quite. The goal of this program ultimately is to store some student's specific information. I want to actually implement a slightly more real-world program involving students, maybe like the registrar's database, and I'm assuming every student has an ID number like an int, a name, a string, and a house, a string also, but I don't want to have like three different variables for every student because if I have two students that's six variables, three students that's nine variables, a 100 students that's 300 variables feels like this is an opportunity not only for an array but it would also be nice to bundle up all of each student's variables into just one sort of metavariable. One container variable inside of which is their ID and their name and their house. So it turns out we can do that. At the very top of this file notice I'm just using this convention, which is pretty common. If you have some data structures that you yourself have created it's very typical to put them into a header file, a .h file and now notice as you might have seen in recent distribution code any time you're using a standard, official library whether it's CS50's or the standard libraries that come with C itself, you use angled brackets. If you want to write your own dot H file, which you're given, in fact, for Sudoku. There is a sudoku.h file. You use double quotes just to make clear this is my file in this directory it's not in the special directories owned by the system. So inside of structs.h must be something interesting. Let's go into structs.h, which you should have as well. Not much. It's mostly comments, but notice one new feature of C here. It turns out if you want to bundle together multiple variables just because conceptually it makes more sense to lump them together and call them by one name, for instance, student. You can say this, type def for type definition, struct. So typedef strict then a pair of curly braces and inside the curly braces you simply list one by one semicolons at the end of each line what variables, what types of variables you want to associate with this particular structure. Then outside the curly brace you give these type a name. I'm going to go with the obvious, I'm going to call this student so now hence forth C or gcc not only knows about ints and chars and floats and strings and all of that it will also know about a new data type that you created called student and inside of a student struct is going to be three variables that you can access by name as well. So now let's actually use this. There's a new piece of syntax here but it's actually pretty straightforward. Let's go ahead and open up structs1.c again. Let me scroll down. First take note that just so that I have a fixed sized program I didn't want to mix examples so I went back to the approach of using a constant. I'm not going to use malloc dynamically just yet. I'm going to define number of students supported by this program to be hard coded as three just so we have something slightly interesting to play with. Now notice what main does. main allocates an array called class and inside of that array are not ints, not chars not floats, but student structure. So the syntax is the same it's declaring an array of ints and array of chars or anything else I just now say student. Now here's a for loop. It's going to iterate from 0 up to 3 because students in all caps is a constant called initialize the 3 and then there's some slightly new syntax mixed with old. First I use printf and I say students ID, I'm calling get int, but now I don't have an int variable per se. I have a student variable inside of which is an int so the new piece of syntax here is dot notation. If you want to go inside of a struct and class[i] is going to be the i'th student in this array, if I want to get access not just to the student structure itself but inside that structure is an ID field. I go .id and I assign it the return value of GetInt. Now next recall that name and house were declared as char *s. So those are strings, which means I can call get string and assign them a value, technically assignment the address of a string, but I can assign them a return value of get string so I use the same notation. Get the i'th student in the array dot name equals get string. And now notice this would be wrong because what does class bracket I represent? It represents a student struct. So you can't assign a string on the right hand side to a student on the left hand side, but if you dive into that structure and say give me the name field inside of that student struct, then absolutely can you assign that name the return value of get string. Yeah? 

>> Student: [inaudible] 

>> David: How is a struct stored and memory? So essentially when you declare something simple like an int. We always draw the as something like this. So this is the int called a and it's represented as a box. If I now declare a student with something like student s semicolon, this too, gives me a chunk of memory but let's see it has enough memory for an int and then a char * and then another char *. So what actually ends up in memory here is an int further ID but then next to that in memory is also going to be a char *, which by coincidence is also 32 bits. The same as an int we said and then also allocated for a student is this field we called house, but this is also a char * so a pointer. So, in fact, I get three, 32 bit chunks of memory back to back to back or 12 bytes total contiguous. So it looks like this in memory or if you ignore the divisions it's just a bigger chunk of memory that you're handed because you've clustered it together. Now that's a slight white lie. It's possible that the compiler might give you more memory than you actually asked for just so things line up in a nice mathematical way, but conceptually that's what happens. Yeah? 

>> Student: [inaudible] 

>> David: Okay. 

>> Student: [inaudible] 

>> David: Correct. So, what's really happening then if I declare student class bracket 3 what happens is exactly the same story. I'm going to have to zoom out to make it fit better, but here is the first student's chunk of memory. Because this is an array all of the students will be back to back to back so the next student ends up here and then the last student ends up not to scale down there. So this bracket notation though is actually not saying go to byte one, go to byte two, go to byte three. When I say S or rather when I say class bracket I, it's not going to the i'th byte. It's going to the i'th data member in this array so the computer checks well how big is a student structure? It's 12 bytes so when you say class bracket one that's actually going 12 bytes away from the first student. It's not going for instance right there. So all of this happens for you thanks to something we'll look at actually before long called pointer arithmetic and it's done sort of magically for you, but for now just consider the utility of having this ability. We now have a struct, we can bundle up together inside one variable or an array of variables and ID and name and a house together now what's this chunk of code here that I've stripped the comments from actually doing? This for loop in the middle here? Let's first see if we can infer from context. So it's doing a comparison. strcmp is "string comparison." It checks two strings for equality. So what is this loop doing? 

>> Student: [inaudible] 

>> David: Yeah, it's just going to say, it's going to iterate over all the students, three students in this case, and it's just going to proclaim so and so [inaudible] if that's indeed the case. And so you'll see this in actually you will see this in Sudoku source code where we use strcmp. String compare takes two strings, left and right, and it returns 0 if they're equal but what's nice is it also returns negative 1 if 1 alphabetically should come before the other or it returns positive 1 if the other string should go before the other so you can actually use this as we'll eventually see the sort strings which can be useful for dictionaries and spell checkers and things like that, but for now we're just using it for a silly proclamation. So and so is in Mather but then down here notice what I'm doing. Just as I called malloc last week, it turns out you guys have been writing buggy code with our blessing for five weeks now. Every time you call get string that string has to be stored not in the stack because if the words you typed in were getting stored on the stack you could not trust what get string is returning. So, just conceptually based on the definition we've been using in recent days of memory when you call get string to get a string from the user, where in this picture must the words you type in be getting stored? So in the heap, right? But if they're stored in the heap and you're just using those strings and you're never actually saying to the operating system I am done with this string, you actually for five weeks now have been leaking memory sort of unbeknownst to you or maybe you didn't notice as much because what get string actually calls itself underneath the hood is this function called malloc and so now hence forth as we start to take off some of these training wheels because I've called get string which itself calls malloc if I want to free the memory that I asked for implicitly by calling get string, I just have to call this new function it's not hard to use, called free, and I pass it, the pointer that's pointing at the chunk of memory that was handed to me. So notice here I'm using get strings return value and remembering it as the name and we said last week that, yes, this is a string conceptually, but underneath the hood it's the address of those characters in memory. Same deal for house. Well, later it's really proper now to be freeing the memory that we've asked the operating system for. So, let's go ahead and run this program. So let me go ahead and make struct 1, let me go ahead and run structs 1 and let's do student ID 1, David, Mather. Student ID 2, Cansu, Eliot. Student ID 3, Yuhki, lives in Mather. Enter. And so we see David is in Mather, Yuhki is in Mather, but the point really is that underneath the hood I was able to store all three of those variables, data types, for each student in its own struct and then bundled them up together in print just the ones I cared about. Let me go ahead into the second version of this. So this is structs2.c. It's also toward the end of this week's printout. This is kind of a useless program I just ran. All I did was ask you for all this input, nine lines of input; ID numbers, names, and houses and then I did something stupid like check for "Mather" and then I quit which meant all that hard work from the user was really for naught. So suppose we want to do something reasonable like actually start storing information from users to disks. Well, thus far you guys have only stored things in variables, maybe in arrays, but at the end of the day the program runs any work you did while running that program, any keystrokes you typed, get lost forever. So suppose we now want to start creating own our files and actually store the data on disks it's actually pretty easy and we can use a function that's very, very similar to printf to do this. So, this is up top the exact same program. I declare an array of students called class. I then call this loop just to get some data from the human, but now rather than just do the silly little Mather comparison in the middle here notice I've added a chunk of code down here. Now at first glance it might look a little cryptic, but let's tease it apart line by line. So here's a new function, open. It's called "file open." So this opens a file. Now that file doesn't have to exist in advance because you can first specify the name of a file but then the second argument to F open is what's called the mode and if you put in "r" that means read this file. Load it from disk, but that's not the point of this program. The point of this program is to write to the disk, write to the hard drive, so you pass "w". And what this will do is if a file called database does not exist in your current working directory because I've said W it will create it for me. It will be empty initially, but it will now exist under the name database. Here's a little sanity check. F opens returns a pointer to a new data type so thus far we've only talked about built-in types like ints and chars and floats, we made up our own called student, but there is another one that's built in that's quite useful. For whatever reason it's in all caps, but it's called FILE so this is returning one of these things called a pointer but it's really conceptually a pointer to a file. Somewhere on disk, I don't know where it is, I don't really care where it is, I just know that via this file pointer, fp, by convention can I find that file on disk and add stuff to it, but first the sanity check. Almost always when you use pointers should you make sure they're not NULL because if you follow a NULL pointer what tends to happen? Seg faults and you've not probably done this too often if at all yourselves, but very soon will this be a risk. So here's a loop. I'm just iterating over all of my students and now I'm calling this other function that's probably new called F printf but F just stands for file. So this is not printf to the screen, this is file printf, print to a file. What file do I want to print to? Why, I don't redundantly say the name of the file again and again. I instead just paste in the name of the pointer that I got back. So what this means is I'm going to print to this file percent D back slash N. So this means print an integer to the file. Right integer to the file, what integer? Print the i'th students ID then go ahead on a new line print the name of that student, the i'th student in the classes name and then do the same thing for the house. So in the end, and then F close, F P once I'm all done just means close the file. It's like going to file save in most any program on the computer and then at the bottom here I again called free just to free up that RAM. Now as an aside, even though up until now we've been intentionally or accidentally writing memory leaks whereby we're asking for memory and never giving it back, the reality is when you quit a program generally the memory you asked for is returned so this isn't necessarily a horrible, horrible sin we've been committing, but it's certainly not good practice to ask for memory and never give it back since there will be situations where it doesn't get freed automatically. Case in point most of you I think a few weeks ago kind of raised your hand or nodded in familiarity when I asked have you ever sat in front of your computer after using it for a while and the thing is just getting slower and slower and slower? You try to tab between different programs or click a different icon and it takes forever for the windows to actually change. That's generally indicative of your running out of memory. Now that might not be your fault. It might be because some of the programs you ran since you first booted your computer asked for RAM and never actually gave it back and so the computer thinks it doesn't have much memory to play with so things slow down. So, let's see what actually happens here. So make structs 2 let me go ahead and run structs 2 and let's do the same kind of input. Student 1, David Mather, student 2, Cansu Eliot, student 3, Yuhki, and then Mather, Enter. Okay nothing seems to have happened, but if I type LS now for list files there's a whole bunch of stuff that we'll be able to play with her but the interesting one is, ah, this one, database. Let me go ahead and open up database. This was actually a file that I wrote. So now what does it mean to create a file? Well, most of you guys write essays in things called .doc files or .dock these days and Excel files are dot xls. And there's all of these file extensions, file formats that we're familiar with. What does that really mean? Well, this is now the CS50 file format. We just arbitrarily decided that to store students in a database we're just going to write each student on a triple of lines, one after the other, and then for the next student we're going to write the next lines of data for that student and then again and again. So, if we were to read these same students in for memory, we would have to have a loop that reads three lines at a time, grabs an int from each and then stores them somewhere in memory. So this is a file format in only so far as we came up with this arbitrary format and we've decided how this data is going to be laid out in memory. Same as Microsoft done with the dot doc format they just put things in places they want them to be and they call it their own proprietary format. So now let's look at what we've actually been using that's put these memory, that's created these leaks in the first place. So let me go ahead and scroll down here. This is CS50 dot C. So this is the first printout from today. We won't spend too much time on the library itself because it's a lot more fun to just take for granted that these things work but let's at least zoom in on one of them. So, if you scroll down visually to the function called get string, let's just get a taste of what get string has been doing for you all this time and frankly have perhaps a bit of appreciation for the headache that is getting user input on, say, Week 0 of a class if you're not handed helper functions like this. Even in languages like Java it's a little easier but it's not nearly as easy as just calling a function like get string. So what do I appear to be doing in this implementation of get string? So this is, again, the code in the CS50 library that you've been taking for granted for a couple of weeks. So the first line of code here string buffer gets NULL. So it turns out get string does not know in advance, of course, how long a string, how long a word or sentence you're going to type in so we assume 0. We assume that you're not going to type anything just so we start from an arbitrary starting point. Maybe not the most efficient, but at least we can't go wrong if we subsequently grow the amount of space we're using. So, I declare an int called capacity saying there are no bytes available yet and then I declare another int called N, which specifies how many characters are in the buffer. So in other words conceptually, get string has a buffer. It's initially this big and we keep track of not only how big it can be but also how big it currently is. So, at the moment it can fit 0 characters and it has 0 characters. Now what's the deal with unsigned? Turns out with a lot of C data types you can prefix them with keywords like signed or unsigned. Why might it make sense to call capacity and int unsigned ints? What's that? 

>> Student: [inaudible] 

>> David: I haven't initialized it, but more importantly it's not going to be negative, right? I can't have a negative amount of space. Contrary to some of the silly error messages we saw today, you can't have a negative amount of space and yet if I'm allowing my numbers to just be an int recall that an int can be negative and positive but that's pretty wasteful because if you use just a signed int by default integers are signed if you don't specify signed or unsigned, by default they can be negative or positive which means in 32 bit int you are throwing away 2 billion possible values if you know you're never going to use the negative ones you're just wasting all of those bits. So rather if I say unsigned, this means I can now have a string that's between 0 and not 2 billion but 0 and 4 billion characters long. Now it may be excessive in fairness, but it's such an easy fix here to not throw away half of my number space so I went ahead and prefixed it with unsigned. So, now let's go ahead and read through what happens next. I have an int called C, which is actually going to represent a character and it turns out we're using a C function called F get C to get 1 character after another from the user. So I'm not going to dwell too long on these lines because, again, we handed this to you for a reason because this is more sophistication early on than I think is all that enlightening, but notice conceptually what we're doing. F gets C's purpose in life is to get from a special thing from standard in, S-T-D-I-N. This essentially represents the keyboard. So this function F get C gets one character at a time from the keyboard. This check here is making sure that the current character C that we just got from the keyboard is not a special character called EOF. So the way a computer signifies, huh-uh, there's nothing else that the users typed. They hit enter or they hit control D or they stopped typing, EOF is a special symbol that signifies to the computer there's nothing more here and that's why this loop will terminate as soon as we hit EOF. Now, let's ignore all of these lines for just a moment and fast forward just to the bottom of this loop. At the very bottom of this loop is this line of code here. This buffer, which initially is this size, apparently I just keep storing character after character after character inside of this buffer, but I can't do that yet because initially the size of this buffer is zero. So the code I just glossed over is a whole bunch of lines of code whose purpose in life is to check all right is the current size, N, plus 1 over our capacity? In other words, if I allocated let's say 8 bytes and I'm already at 8 so 8 plus 1 is 9 and 9 is greater than 8, I'm out of space. That means the users typed a ninth character, my buffer is this big, and I need some more space over here so what do I do? You have to jump through some hoops. I first check and make sure that I'm not going to run out of memory completely because of the maximum size of an int but I'm going to completely wave my hands at this one and let's just focus on the one that uses a more familiar function or a cousin thereof. Notice what I do here. There's another function besides malloc called realloc whose name implies reallocate memory. So notice what I'm doing here is I'm reallocating the buffer to a new size because if we scroll up and this is, again, why we hand this function to you, the key detail is this one here. Every time I run out of space based on that line of code alone what do I apparently do to my buffer? I double it. So, I initially it starts at 0 and then I think we initialize it to, we add one at some point. I'd have to check the code. Then it goes to 2 then to 8, then to 4 then to 8 then to 16 then to 32 and so forth so any time we run out of space we ask for more memory but because we're getting things character by character at a time with fgetc, there's absolutely no risk in the CS50 library of someone exploiting our code because you can't possibly pass us a bigger string than we're expecting because we're only looking at it one at a time but notice the headache that is this code and why we bundle it all up in a function called get string but notice, too, if you look through this code at your leisure you'll see that almost all of the functions in the CS50 library use get string to first get the input from the user and then we convert those strings to integers. Now who cares about all of these details? Well, there are now ways in which we now have the expressive capabilities to store things on disk. Unfortunately, once you have this ability to store things on disks you have to worry about cleaning things up and most operating systems today including Windows, Mac OS and Linux leave a ridiculous number of bread crumbs, leave quite the paper trail on your computer even when you so called delete a program and even if you're just visiting some website and you're not saving any files but those images or that text is resident in RAM turns out that even those photos you're looking at could end up on your computer's hard drive in something called virtual memory but more on that and the solutions there to on Wednesday. 

[ Applause ] Thank you. Andrew will be please. 

==== Transcribed by Automatic Sync Technologies ====