>> All right. Welcome back to CS50. This is the start of Week 5 and this is Recursion, couldn't miss the opportunity here. So this fancy new technology here will hopefully empower some of you to actually see the blackboard if we actually sketch things over there so more on that to come. So, this might perhaps be the simplest Sudoku board that you might solve. For those unfamiliar, Sudoku is all about completing the rows, columns and squares within a little board like this but if nothing else, perhaps this little joke here makes a bit more sense now. So, this week and next and the week after, we introduce one of our newest problem domains. So the next Problem Set No. 5 is actually going to focus on the world of forensics and the world of forensics is all about recovering data that's been accidentally deleted or intentionally deleted or generally trying to answer questions that someone else does not want you to have the answer to. One of my favorite jobs to be honest years ago in grad school was an internship I started over the summer which was working for the Middlesex District Attorney's Office working for their forensic investigator and learning a lot about how this stuff works. In a nutshell what would happen is the local Mass State Police would drop off things like hard drives and USB sticks, sometimes keyboards and mice and monitors, which are actually forensically not all that useful, but we got everything from actual suspect's homes or offices. Our job was to answer questions of the form did he or she do it. Now quite often the answer was unclear because there isn't necessarily a smoking gun digitally on a computer but I will say within Middlesex County sometimes the answer was as easy as double-clicking a secret folder called My Documents inside of which evidence could quite often be found but more technically interesting was when we would have to actually recover data whether it was images that had been deleted intentionally or from a browser's cache, whether it was Excel spreadsheets with financial information that needed to be deleted or emails that had been sent or issuing subpoena's to local ISP's like Yahoo and Comcast and these folks to actually get records. It's fascinating and it's also terrifying to be honest. Increasingly just how many remnants we all leave digitally when we so much as check our email these days and we'll talk more about this towards semester's end in the context of the internet but this week we'll begin to focus on the lower level details of your own laptops and your own USB sticks and external hard drives as to just how much data is actually there even when you think you have deleted said information. Now with that said, shows like this have glorified the sexiness of doing forensic investigations and they've also conveyed to the population that we can do some amazing things with computers. So that's not always the case. For instance if you've ever been watching a TV show, something like Law and Order and the like and the cop's leaning over the shoulder of the computer technician and here she says something like can you enhance that? No, you can't usually enhance that whether that is a license plate or some suspect's clothes that they're wearing. We don't have infinite zoom. If you've taken a photo with your camera or with your digital camera and you start zooming in you typically start seeing things get pretty pixilated and pretty splotchy. So for the most part that's what the cops would also seize. So we'll also try to debunk some of these myths that are propagated by shows like this. Also too hopefully after taking this semester we're going to kind of ruin accidentally a whole bunch of movies for you because you're going to start noticing in movies and also TV that there's just crazy talk sometimes. All right, they'll spend a million dollars producing some TV show and not so much as ask a computer oriented intern what a technically accurate thing to say would be. You've all probably seen the movie Jurassic Park. Famous clip that you might not have noticed at the time but is with us on YouTube this day was this clip here. [ Movie clip ] >> Okay so first of all it's a UNIX system. It's essentially the same as it's a Mac or something ridiculous like hat and the program that she then proceeded to use with all those little squares and the rectangles that you saw depicted is apparently just a graphical file browser whereby your folders are depicted as cubes or squares on the screen. So these are essentially things a person would have double-clicked. Another one that we're fans of, this one related to CSI here. We haven't used all of this jargon here but it too is a bit of crazy talk by the writers. [ TV clip ] >> Okay that means nothing, like. I will create a graphical user interface using an old dated language to track an IP address. It's not the right technical solution to this problem. We'll give you let's see. One more. One more and this one I'm looking forward to using this spring in our iPhone programming class for reasons that will become clear in a moment. [ Video ] >> Now let me rewind a split second here and go back to this one in particular. Again not a language we ourselves will be using but we'll be using in CS164 this spring. Fast forward and so this is a programming language which is called Objective C used these days to program on the iPhone as well as on the Mac. Pretty sure whatever this hacker was doing has nothing to do with crayons as suggested down here in the source code. If you also look through this what this code is actually implementing is something completely benign. It's like a little toggle switch on an iPhone or an iPod Touch where you touch the little blue thing and it moves left to right. So there's a whole lot of Easter eggs ahead of you if you start pausing your TiVo or Hulu online but hopefully one of the goals of this course, not just to ruin certain popular media for you but really just to make sure after a course like this you're actually making educated and informed decisions when it actually comes to these technologies. So, without further ado Problem Set 4 this week challenges you with an even larger set of distribution code whereas last week you did get a bit of distribution code for Game of 15. This week we hand you more so the goal just like last week is to first understand code that someone else has written and this will be incredibly common. Not if you go off and necessarily program with for a company or with some partner working on some project but even if for your own final projects or side projects that you might want to implement you need to use something called an API, an application programming interface. Quite common at the end of the semester is not for you guys to implement your entire projects from scratch but rather to stand on the shoulders of other people who have written code that you can incorporate into your own programs so that your product is actually much more glamorous that it might be if you had to do absolutely everything from scratch. So we've provided you with a framework for the game of Sudoku out of the box when you run it, it looks a little something like this. Unfortunately it doesn't do anything if you hit up, down, left and right. That green cursor doesn't actually move but once you begin to fill in some of the blanks and this time we're a little less explicit as to where these blanks are. We don't say to do, to do, to do. We instead leave it to you to determine both on your own and with the help of the walkthrough where you need to start augmenting the code in order to add things like support for the cursor up, down, left, right and also detecting when the game has actually been won. So there's also the hacker edition this week whereas much like God mode last week in Game of 15, this week the hackers will have to implement a hint mode whereby hitting H will begin to spoil the fun of the game and hitting H again and again and again will start to reveal one after the other correctly positioned numbers on the board. So this is going to require that those of you tackling the hacker edition essentially figure out how to solve Sudoku and then reveal one at a time a potential number to the user that is playing. So that is Sudoku. So I'm wearing a little CS50 apparel here as you may have seen some of the course staff wear. So the course says there's this tradition of designing by students and staff alike fun designs that then in theory can then be sported at semesters end so that you don't need to just say it to your friends I took CS50, you can literally wear CS50. So you'll see in the problem set this week that you are cordially invited to participate in this year's design if you would like to contribute per those instructions there. And so at this point we have these pink forms here which is the pass fail forms. If you have any doubts as to whether or not you remain in the course, whether or not you're a little increasingly unsure, whether or not you can actually, whether or not this is something you can in fact succeed at realize that the aim of pass fail, which is wholeheartedly encouraged by the course in an opportunity really to take the edge off of a course like this. So that if you do find yourself and you have found yourself over the past few weeks making it to some 1 a.m., 2 a.m. the night before as some P sets do and you're feeling pretty good about it, but damn there's just some bug, some bugs that you can't quite chase down, one of the aims in our mind pass fail is to allow you to say you know what. I'm 95, 99 percent of the way there. That's fine. We're going to move on from there. So realize I'm happy to still sign these and I thought I would show you just two things and this might only be of interest to me sort of nostalgically but this was the very first sheet of notes that I wrote on my very first day of CS50. A very famous computer scientist named Brian Kernighan was there that year teaching us and I was just kind of amused to see 1) this was like the only day I took notes apparently I very quickly became one of those students but I was just kind of amused to see that even for me years ago this stuff wasn't completely new. What is an algorithm? I actually wrote it down. Some precise sequence of steps for getting something done. Apparently my takeaway from Lecture 1 that year was precision and correctness were really important. I was really in the exclamation point phase here where I scroll. Let's see. My last page of notes in my first section were three and four. So these are the top three things apparently that I took away from the first weeks of CS50 but this is to say too like all of this was new to me when I first dove in years ago. So realize you remain in very good company. Now one of your classmates, last clip for now, one of your classmates actually sent me this link after he had just finished P set 3 with great delight and it's perhaps representative of the emotions some of you have felt quite a bit. [ Video ] All right. [ Applause ] >> So this was bad. So this implementation of swap though at first glance seemingly correct in that it does swap two inputs A and B, we then emphasized multiple times that this is actually fundamentally flawed because as soon as the swap function returns and you hit that curly brace well that's it. All the work that you did with A and B and temp where we did last week with milk and orange juice, it's just kind of lost because the values are not actually returned. Now one of the problems in C is how many values maximally can you return from a function? So just one right. You can return nothing which is the case of the void or you can return an ENT or HR or a pointer to a string as we'll see but you can really only return one thing. Now you can actually return multiple things if you return like a two-element array or a three-element array or you return this thing called a pointer but these are not elegant solutions. So in the world of C typically if you want a function to take input and produce multiple outputs what you do is you do not pass in those inputs by value, by copy. In other words you don't do this. Instead what you do is the promised solution of last week of you do this ever so slightly different but those stars before the A and the B make all of the difference. This notation Star A and Star B mean don't pass A and B in by value. Don't pass in copies of them but rather pass them in by their address, by pointer. So pointers are every semester, for myself included back in the day, like very often a challenging concept but really if you just reduce it consistently in your mind to just the basic definition. A pointer is an address and so Star A means give me the address of some integer. You can typically reason through any sort of new type of code that might at first glance look pretty arcane. So this syntax here is saying when I take in two arguments, let's call them A and B, they're not going to be actual integers. They are going to be the addresses of integers and just as I did over here with the milk and OJ by kind of writing directions on a sheet of paper and then handing those to the swap function in our imaginary tale here. Well, that's exactly what's happening here. You're being handed a map to the computer's memory and you're saying A is going to be here, B is going to be there. If you want to change those values, go there, change them and now you've actually effectively not returned two values but you've affected two values. So this is a very common paradigm and will also enable us to do much more interesting things and in fact we'll begin now to peel back even more layers. The CS50 library, that thing called argv for command line Arguments. We've actually been using pointers and memory addresses for quite some time. We just haven't revealed such syntax. So let's actually see this work. So I've gone ahead here and launched the CS50 appliance. I've opened the file called swap dot c. I'm going to go ahead and zoom in on what is now the correct solution to this problem. So I've got some boilerplate stuff up top includes standard IO dot h. I then have the swap prototype and now this is copy paste of what we just saw in the green slide but it ends with a semi colon and that's just again the clue saying hey swap is going to exist. It's a message to GCC, the compiler. So now main. We've seen this half a dozen times now where I assign X a value, Y a value and then I claim I'm swapping them but then they're never actually swapped but there's one more key syntactic difference this week that we didn't look at last week. You do have to change the syntax that you use when you actually want to pass in a variable or two variables by their address. You can't just say X. You can't just say Y. You have to say give me the address of X and the address of Y and you can perhaps infer what the relatively simple syntax for that is, is the single ampersand. By putting an ampersand before a variable's name you're telling the compiler go find the address of X and pass that in. Do not pass X in itself. Now just a word on types. So X and Y are indeed integers. So what is the data type of ampersand X would you say? How would you answer that question? What's the type of ampersand X? So it's the address of operator so it is passing an address. Well what is that map to? Well that just maps to no need to project here. That just maps to INT star. So anytime you have a pointer, pointers are meant to point to specific things. So if you want a pointer to an INT you would typically if you were going to write it say INT star. If you want a pointer to a char, you say char star. If you want a pointer to a double, you say double star. So in this case ampersand X I don't have to mention X at all. I don't mention star because X already exists at this point in the story, in this highlighted line that's called a swap. All I want to do is tell GCC X whatever data type it is, happens to be an INT, get its address and pass that in. So now what does swap look like? It looks exactly like as was promised. In the swap function we say INT Star A, INT Star B. So now A and B are effectively local variables. They're parameters but they're local to swap. They only exist at this point in the story. What gets stored to be clear in A? So the address of X and what gets stored in B? The address of Y. So now we can't implement swap in precisely the same way we used to because notice what would immediately be bad about this. If I go in and manually delete all of these stars for just a moment, if I leave it alone as this what am I actually doing with the swap function now? What am I swapping? So I'm just swapping their addresses but that's not what I want. I want to go to whatever A is pointing at which is X. I want to go to whatever B is pointing at and that's Y. I want to change those values and then that's it. It would be completely meaningless just to change the addresses because at the end of the day if you put something at Address 1, 2, 3, 4, 5 that's where it is. If you put something else Y at 4, 5, 6, 7 that's where it is. If you want to change those values you have to go to those locations. You can't just change those locations themselves, the addresses themselves. So what does this mean then if we walk through these three final lines? So the left-hand side here declares an integer called temp and that's how many bits? Just a quick sanity check. So it's 32 bits in the appliance. Could be 64 on more modern hardware but in the appliance and in a lot of computers 32 bits. Now Star A means what in English? Okay the location of A but not quite the location of A. Star A, so the star in this case is the dereference operator so Star A means go to that address and find what's there. So a quick sanity check. If I hand you the address and it's called A and I say go to this address what are you going to find when you show up at that address? >> X >> X, the No. 1. So what is stored in temp right now is simply the No. 1. Okay so now in the second line here. Star A means go to A so what's currently at A? >> [Inaudible comment] >> So the No. 1 aka X so what am I putting there based on the assignment operator, based on that equal sign, well I'm saying go to B. So I'm going to B. What do I find there? >> [Inaudible comment] >> So I find 2 aka Y and so now it's saying take that value 2, put it at Star A which now has the effect of putting the No. 2 where the No. 1 once was. So at this point in the story there's X and there's Y and there's A and B respectively pointing at them but what two values, in this point in the story, are in locations X and Y? >> [Inaudible comment] >> They're both 2. They're both the No. 2 at this point. So all we have to do now is fix the mess we just made. We made two copies of 2. We need to now put the No. 1 back in the location Y. So temp you already said earlier was the No. 1 and there's nothing new there. Temp is an int. It's as simple as that. It's just this number we put there which was 1. Star B means go to wherever B is pointing and that is the location we think of as? So that's Y right. Star B is Y. So what do I put at Y location? The No. 1. So that's a bit of a long tale to tell of just verbally but recall how we can actually see this relatively simply step by step. Let me go into my source directory. Let me go ahead and make swap which is the name of this file. Let me go ahead and run it just as a sanity check and it does seem to be finally all these weeks later, working properly but that all happens too fast in underneath the hood so let me go ahead instead and run GDB of swap and Enter. I'm going to go ahead and run it here. Okay that wasn't all that useful but it did run into the confines of GDB. How do I actually pause execution and step through this? So I want to actually do break it. Let's just break at the beginning, break main Enter. It's now telling me there's a breakpoint at some crazy address and that's where the program physically is in memory. In the file swap dot c Line 22 and if I actually scrolled up the very start of main would indeed be at Line 22. So now I type Run to run the program but I'm instantly going to hit that breakpoint. So now we can just step through this one step at a time. So INT X gets 0. So let me do a Print X and that's already wrong. Why? >> [Inaudible comment] >> Yeah. >> I think it's the address. >> It's the address so it's actually not the address here. This is literally the value that's at X. >> You were saying what was sort of there before you have to make it to here. >> Exactly. So this is an example of one of these garbage values right. Just because we see that we're at Line 22 remember with GDB it shows you the line before it actually executes it for you. So at this point in the story I've not yet executed Line 22. So X has not been set to one so what in the world is X? Well we never know. It's probably some garbage value that's there from some previous function call or some previous use of that location. So if we ran this all day long on different computers might very well get different numbers again and again but the point is that it's just wrong. It's not a useful number. So now let's type Next and this will actually move me to the next line. It's not yet executed Line 23 but if I now do Print X I should see one and again don't be confused by the dollar sign too. This is just a fancier feature so that if you want to refer back to some value you print it earlier. You can still access it by way of this dollar sign notation but for now you'll rarely need to use that. All right. So let's just go ahead and do Next. Let's do Next. Notice now it all of a sudden said X is 1 so it's actually printing the printouts. Let's do Next. All right. It's about to say swapping so let's do Next one more time and now before we dive into the swap function let's do a little sanity check. Let me clear the screen and Print X. Okay let me Print Y. So the state of the world is as we expect. So if I do Type Next what's about to happen? >> [Inaudible comment] >> So they're going to get swapped but it's going to kind of happen all of a sudden. I'd actually like to see what the swap function is going to do. How do I actually go inside of the swap function? >> Step. >> So you do Step. So Step steps you into what's about to happen as opposed to Next which steps over. In both cases though they'll make the code execute but all at once or one step at a time. So now notice we're in swap and not to make things look a little too crazy, what I'm going to do, a new command. Back trace just in case you start to get lost in where you are, if you type back trace you'll see a quick if arcane summary of those frames that we keep drawing on the board as rectangles. So unfortunately it actually rather fortunately it draws them in the right order. Notice at the bottom of the stack is main. It's indicated with No. 1 here and then on top of that is No. 0. It's kind of unfortunate that they reversed the numbering but we can now see those stack frames and so what we have here, we're inside of the swap function. So what's about to happen is Line 42 is about to get executed. So if I print out temp, what's actually going to get displayed on the screen? If I print temp. >> [Inaudible comment] >> Yeah so it's going to be some garbage value right because this line has not executed so again there's no right answer, happens to be zero. We're just lucky. So if I actually now print out let's say Star A, what am I going to see? >> [Inaudible comment] >> So Star A. So star means go there. It's the d reference operator and if I go to A what am I going to see? >> One. >> So I should see 1 so let's do a quick check. Print Star A. I do see the No. 1. If I instead just Print A what should I instead see? >> Address. >> So the address and then again mostly meaningless but it is in fact the address in RAM of where A or rather of what is inside of A. So when we say the address of X is in A where is X? Well it happens to be at that address there. Okay so let's just move on with the functionality. So if I type Next, that line of code up here is going to execute. INT temp gets Star A. So when I type Next and now Print temp, I should see the No. 1 and if I print Star A I should also see the No. 1 because we've moved Star A into temp. The next line of code that's about to get executed is this. So as soon as I type in Next and then print out Star A, it should be this time? >> One. >> So not one now because we just executed this line that I've highlighted. So Star A equals whatever is at Star B. What was at Star B? >> Two. >> So Star A is now 2 as well. So we have seen to now put into Star A the No. 2 and in Star B is the No. 2 so we have that just one last step. So now the last step recall is go ahead and replace what's at location B with whatever's in temp. Well what's in temp? One. So what's at Star B? We already know it's 2. So in a moment when I type Next hopefully this will have clobbered the value at Star B and in fact it's the No. 1 and it's Star A is 2. So if I now, now I'm getting bored with this debugging. I'm just going to type Continue and that's going to run to the end of the program. Voila. It's claimed swap X is 2 Y is 1. So there should be two takeaways here. So 1) exactly what the difference is between just saying A, just saying B and saying Star A and Star B and what he difference is between saying X and Y versus saying ampersand X and ampersand Y and then lastly, more in practical terms again just going through this process of using GDB even though it might feel a little clunky or a bit tricky to wrap your mind around at first is incredible useful as you start to try to find bugs in your own code. >> Does the pointer have a scope? >> Does the pointer have a scope? Yes. So a pointer is still just a variable. So A and B only exist inside of the swap function here. So yes. There's no change in the notion of scope. All we've introduced is the Star and the star means this is not an actual integer. This is the address of an integer. Yeah? >> What's the difference between the asterisk, the star and the ampersand again? >> Good question. What's the difference between the asterisk, the star and the ampersand? So the star means unfortunately two things. When you specify a star in the prototype of a function, for instance, right here in the line I've highlighted, swap INT Star A and Star B. In that context you are saying A should be a pointer to an int. B should be a pointer to an int. This is identical to well, let me not go there yet. So that's all that's saying. That is the data type of A and B. Now in the context of the actual function, star takes on unfortunately a different meaning. What does Star A and Star B mean in the three highlighted lines now inside of the swap function? >> Go there. >> So it just means go there. So it's the dereference operator. It means go to that address and then lastly, this syntax up here and this is really the third and final piece of syntax, the ampersand means X is some variable. It's INT in this case. Ampersand X means get the address of X. So if I can hammer home just that one detail let me actually quickly go back into GDB. We still have our break point even though the program stopped running. Let me run the run command again. Now I'm back in main. Let me do Next. Let me do Next and let me clear the screen and just do a quick check. Print X. Print Y. They're 1 and 2 but now even in GDB I can play around with this. I can Print ampersand X and what do I get back? Well that in fact is the numeric address, albeit in hexadecimal notation which we started looking at last week of the variable X. So if you look at your two gigabytes worth of RAM in your computer and you literally go to that address and I can't translate hexadecimal in my head to English words. If you go to that address you can actually see the No. 1 at that location. Yeah. >> I was wondering why do you need to use a star to access the value inside the pointer? >> Do you need to use the star to access the value in the pointer? Yes if you want to go to that address to find its value you say Star A or Star B. >> If you just Print A that would just give you the address? >> If you just Print A or you just Print B, yes that will only show you the address which for the most part is useless unless you're using the address to actually go there. >> [Inaudible comment] >> [Inaudible comment] pointer? >> That is correct because A and B were declared as pointers, simply printing A or printing B would not be useful. It might be interesting in that you can see where they are but it would not be useful. Yeah? >> So when you [inaudible] allocate spaces in memory for A and B know that you're not overwriting something else or something which is more important so when that garbage might have come up [inaudible] program. >> That's a really okay so that's a really good question. So the garbage values we keep seeing and they're kind of a cute curiosity but thus far they've had no down sides but if we actually use these garbage values bad things can in fact happen. If you have a pointer that has some junk value in it because of some previous function call or some previous execution of code and you do star and then that pointer value, bad things will happen. In fact let's see if we can implement this very idea really fast here. So let me open up a new document in [inaudible] I'm going to go ahead and save this as bad dot C, I'm going to go ahead and at the top here let's just first start doing INTs main void, I'm not going to bother with any command line arguments, and this is the other context in which you might use the star notation that I promised a moment ago. I can actually say to the compiler give me a pointer to some data type. I can actually say INT star P and P denotes pointer here but notice I do not have in this simple example thus far an equal sign. There's no assignment operator, so what is actually at star P. If I were to do something like this, print a percent D like print an integer, what do I want to print? Go to P and print whatever integer's there. What's going to happen? What's gonna print on the screen? Yeah, it's kind of totally unclear, right? So P as a pointer and because I've not initialized it to anything it could be the number 0, it could be the number 1, it could be 555551212, it could be any number. And if I then presumptuously say it doesn't matter what it is, go there with star P well what if star P is in a particularly bad point in memory, it's the part of memory up at the top that is where your own program is. Maybe its address zero which we also said last week is bad because the null pointer is known as address zero. So in short I daresay the most common way for crashing a program with a seg fault or with a core dump that some of you guys have accidentally induced thus far is by doing exactly something like that. Not realizing that a pointer has just some bogus value or some null value and accidentally going there and then you get this so-called segmentation fault. So though this is a lot of new syntax, again there are only three ways in which we've used the star. Here, and this is just off the cuff, we'll actually see this in more useful context in the future. Here when you actually want a pointer, we saw it in the function prototype of the swap function. And then inside of the function itself where you say go there by way of the star notation. And we'll see another example before long. Any other questions? Yeah. >> [Inaudible comment] >> So what is the advantage of swapping them in the function? I missed the end of it. >> [Inaudible comment] >> Ah, so good question and if I can kind of reminisce what we did last week with the milk and OJ. When I had two slips of paper and I said here's the address of the milk, here's the address of the orange juice, then I handed them to someone to do the swapping. Suppose I now receive these pieces of paper. Well if I just said okay here is these pieces of paper back it doesn't actually change the contents of memory. All I've done is kind of a switcheroo of the addresses but the physical layout of RAM has been affected in no way and the goal typically of swapping two variables is to literally swap two variables, not just change you know the order in which we're talking about them, if that's not too hand wavy of an answer. Other questions? Yeah. >> [Inaudible comment] >> Yeah. >> [Inaudible comment] >> Perfect. Yes. So we could start to do crazy things whereby if you have-let's do this and again we'll see a more complete example in a moment. If I have INT X gets 1 and now I have INT star P, I could actually do this because X is an integer and every integer, every variable, is somewhere in memory, ampersand X means find the location in memory so INT star P means give me a pointer to an integer and what address do I actually want to put there is X. And so what I can then do is now I can either print out X like we did in week one and this would absolutely work. Print percent D as a placeholder, print out X, but what would also work now if I wanted to print the number 1? I could also say star P, or if you want to get crazy I could also do well print X but wait a minute, let's get the address of X, wait a minute changed my mind. Star X ampersand star and yes these would invert each other's behaviors. Completely useless, very bad style but it would in fact reverse the effects of those operators. So, all right so we've now swapped successfully. It's always around this point in the semester and you might have seen this in section already, but a quick interlude here. Now that you get this particular joke let's go ahead and take a five minute break and we'll regroup as to how these pointers can be used and abused. All right, so we are back and recall that this is the picture we keep drawing, the bottom part of our computer's memory and we typically call these rectangular slivers frames. And recall just a moment ago within GDB when I executed back trace what I was seeing was pairs of these frames. So even though I've drawn them at different heights here main is a frame and swap is a frame but I've kind of divided it out here pictorially whereby you see the parameters first, then the local variables, then swaps the parameters, then swaps locals and so forth. And this simple, relatively simple layout in this pattern is actually very easily exploitable. So I mentioned a couple weeks ago this very popular way of hacking into systems and taking over servers and defacing websites and so forth and that very often but not always reduces to leveraging and understanding of all of this memory and then somehow figuring out how to inject your own code into one of these frames so that you overwhelm that frame with way more bits that it was supposed to store. And the result is that your code, not the original programmers, gets executed. But more on that in just a bit. So we had a couple of programs on Wednesday, among them compare one dot C and this program recall if I scroll up to just the main function was actually flawed. Even if I typed in hello and then hello again it said you typed different things. And what now is the simple explanation for why typing hello and then hello again yields different things apparently? Yeah. >> [Inaudible comment] >> Exactly. What we're really doing with this one seemingly correct line of code is comparing two values but literally two values. We're comparing S-1 and S-2, well what are they? Well recall we peeled back this layer last week and we said you know what, this is a training wheel, this is actually char star, this too is actually char star and so S-1 and S-2 are pointers to chars. What char in fact? Well the very first character in the string. And recall that the convention the world adopted years ago for strings in C is that all you have to remember is the address of the first character because so long as you plant a little breadcrumb at the end, backslash zero at the end of the string, then you can figure out by just iterating from left to right exactly how long that string is and where it ends and therefore when you should stop printing out H-E-L-L-O. So what was the solution to fixing this problem? What's that? >> [Inaudible comment] >> So put star-okay, so we could do this, we could say star S-1, star S-2 but borrowing the jargon from before break today, what does star S-1 mean here? >> [Inaudible comment] >> Go to S-1 and what should we find there if I typed in hello both times? I'll find H and then I'll find H again, so this is a step toward correctness but unfortunately now it's going to say that any two words that start with hello, that start with H are actually going to be the same and in fact they're not. So we did have recall last week a loop where we started iterating from left to right and then just looking for difference, looking for difference, looking for difference, and if we find one difference whether it's the string length, whether it's a character that's not quite identical then it actually would say different. Otherwise if we get to the end of both strings we found nothing wrong, we can then say the same. Or an even simpler approach and again consistent with this idea that you need not always reinvent the wheel if someone else has put it into some kind of library and header file we could do this, and I've actually added one bit of error checking that we didn't have last week when we did it on the fly. Recall that we introduced this function, str com, string comparison and it can actually return three values but we only care about one. What did it mean if it returns the value zero? That the strings are actually equal, they're the same length, they're literally the same characters. They're equal and recall that str com could return a negative number or a positive number but that had to do with alphabetical ordering if you actually cared about whether one comes first or last. Now I added one check here and this is one of these rules of thumbs that even though it might not seem all that compelling at first, take our word for it. Any, any, anytime henceforth that you start using pointers which you've actually been using since week one, we just call them strings. We didn't call them char stars, you should always be checking if a pointer is null before you do anything with it because as per the conversation before break if you've got some garbage value in a pointer, if that pointer is itself zero either intentionally or accidentally and you say go to this address and that address is zero or some bogus garbage character, you're going to induce what? Site fault, but the problem is not always and this is why a lot of bugs go undetected by people like us, by professional programmers, sometimes for years because not always does a mistake like this induce an error. Typically we call these things frames and they also refer in general to segments of memory. Typically you're actually allowed to wander a little bit outside of the actual memory you've been given. You can go a little higher, a little lower, but if you go too far then the segmentation fault happens. But the problem is that if you don't venture terribly far or you just get lucky and you go to a garbage address but that address doesn't induce a crash, you might not realize that your code is in fact flawed. So one of the things we'll introduce in a couple of problem sets is a tool called Balgren [assumed spelling] which is a tool like GDB that you just run at the command line and amazingly it will actually analyze your program for you and try with some probability to figure out if you're misusing pointers in some way. But for now the approach is to be defensive and anytime you're about to do something with a pointer check if it's in fact null. And the reason is str com is actually a pretty simple function. It does what we did last week, it starts at the left of each string and compares, compares, compares, compares, compares. But it does not check if a pointer is bogus or if it's zero. So the problem is if you accidentally somehow make S-1 or S-2 null, for instance the computer is running low on memory or you've initialized them to null yourself, any number of ways, we'll see more over time, and you accidentally pass null pointers to str com it probably will crash on you. So in short always check the values of pointers. All right so we saw one more sophisticated example, namely trying to copy strings, and our first attempt at this also was flawed. So this was copy one dot C last week, recall that I typed in a word I think hello then too and I tried to capitalize it by just changing the first letter of the word. But every time I tried to capitalize that word hello I ended up changing both copies, the original and the copy because how did I try to copy these strings last week? Well I used this line S-2 gets S-1 and really I can simplify that and you might have tried this in previous problem sets. This even looks more convincing, this looks perfectly legit, string S-2 gets equal to S-1. But it's not right because what is really happening in this one line of code? Sorry. >> [Inaudible comment] >> Exactly. So you are creating a new pointer but you're not creating a new string. If we define a string as a bunch of contiguous characters all we're defining with char S-2, char star S-2, is another pointer. What are you pointing at? Well you are pointing at whatever the user typed in. So if this is S-1, this is S-2 and the user typed hello and that word is over here in RAM, at this point in the story they're both pointing to the same place. All you've done is create a new pointer, not a whole array of characters. So we did solve this and we introduced finally copy 2 dot C. So in copy 2 it was admittedly a little more complex but if you think about what each step is doing it actually solves the problem quite nicely. So the left hand side stays the same, char star S-2 and again if this is starting to intimidate it's again just string S-2 but you have to start remembering now that strings involve pointers so what now am I storing inside of S-2? Well I claim I'm putting in S-2 this time the address of a new chunk of memory. That memory initially has just junk in it. Who knows what's in there, but how much memory have I just allocated? Well malloc is memory alloc and it takes in arguments and that's the number, an integer, and that integer is how many bytes do you want. So if I typed in H-E-L-L-O I want five bytes from malloc, okay I actually want six, the sixth one being for the backslash zero. So I always want a +1 so I get the string length of S-1 +1 and then just because I've kind of done this before and I know these kinds of crazy corner cases that can arise, why did we say last week that I then want to multiply that number by the size of a char? Yeah. >> [Inaudible comment] >> Exactly. Let me tweak slightly, it could be different machines use different sizes for chars. So in the typical system certainly the appliance and most of the computers you will own typically a char would just be one byte or a bit but just in case we hand this compote off to a fancier computer or it lingers around for ten years and the next computer actually uses two bytes for every char, this way we're being a little defensive and the program will keep working. And silly though this example is, think about again the Y2K issue. Right, it was this mentality typically of either its memory is expensive, we need to shave off every byte we can and just use two digits or it was a brash mentality like who the heck is going to be running our code in 30 years. Well a lot of people were still running the code. So it's this kind of mentality trying to make sure that your code is future proofed and always correct that is a very good habit to get into. So what is malloc actually doing? Well it turns out that malloc, memory alloc, is not taking memory from anywhere you see here on the screen because recall that the stack, these frames that we keep drawing as slivers on this rectangle, they go away pretty quickly, right. As soon as the function returns that's it, the memory is no longer safe to use. So this feels problematic. If malloc actually borrowed RAM from this part of the picture, well as soon as the swap function returned or some other function returned, even if it asked malloc give me some memory. How much memory? Well whatever number you passed in as an argument, what if then someone else needs that memory? So in other words when you call malloc you want to allocate memory permanently or at least until you are ready to give it back. The stack is in very ephemeral, anything on the stack can go away at any time as soon as these functions return. If you want to allocate a chunk of memory for like a word like hello and you want that memory to stay active even after your functions themselves have returned, well you actually need to put it elsewhere. So we drew briefly in previous weeks this picture here. This whole rectangle now might represent your computer's RAM, at the bottom of this is the stack. There's also this other stuff that are generally called environment variables. These are special things like your user name. So it turns out that when you run a program the word Jay Harbard [assumed spelling] is actually put in RAM and that is accessible to you, the programmer. We typically don't need it or use it but there's stuff like that that's put below your stack so that it's there for main and anyone else who wants it. At the very very top of your RAM is the so-called text segment, and what does text mean? So it's a little misleading but for historical reasons text segment is the zeroes and ones that you wrote by way of compiling your code with GCC. So the text segment is literally the zeros and ones that compose your program Sudoku or Game of 15, anything like that, they're loaded into RAM. And that should just make intuitive sense, right? If you double click an icon or run a program's name for anything to happen those bits need to be loaded somewhere so the computer can start doing something with them, and indeed they're loaded into RAM in this way. Now lastly there's a couple other things that are put way up there, initialized data and uninitialized data. There's some subtleties there but in a nutshell on the rare occasions when we've used global variables, for instance the Game of 15 we had G which was the global board at the very top of the file, global variables go there too. And that should also now make intuitive sense because you don't want your global variables on the stack because then they might disappear mid-execution of program but if they're way at the top they're going to stay there is the takeaway here. So that leaves just one last segment. So this was someone's bright idea but you know why not have this thing called the stack row up, up, up, up, up, and we need some other space so let's have this other space grow down, down, down, down, down. Now of course this is like a problem waiting to happen and we've induced this problem before, right. When I wrote that silly function called foo and then I had foo called foo and I didn't have any kind of base case or if condition to stop foo from calling itself, what happened? Well, the stack went up, up, up, up, up, it clobbered the so-called heap even though we didn't call it that that day, and the program terminated. The operating system said segmentation fault, you have exceeded your segments. But this is kind of nice, right. This might seem kind of stupid in that this is like you know two trains on a track, this is not going to end well. But if you think about it in the real world with your computer you only have a finite amount of memory so it's not as though we could just flip this picture around and just say let the memory grow up for both of these sections of memory, both heap and stack, or eventually you're just going to hit your two gigabyte, your two billionth byte of RAM or however much you have. So this is actually nice in that it lets us use the whole potential memory and bad things happen only if we get greedy and we try to load in too big of a dataset, we call function too many times. But long story short malloc puts memory on the heap and it puts it way up there away from the stack. And again if this is two gigabytes it actually takes a bit of effort or a bit of stupidity to actually have the stack hit the heap, though it does sometimes happen. So let's take a look. Let me go ahead now and open up a little file called bar dot C which actually does not do anything that's useful but it does now play with a bunch of these various syntactic details. So if at any point something goes over your head, just look confused, raise your hand because all of these are just now re-emphasis on these same basic building blocks. So this is bar dot C, at the top these two things are called quick quiz prototypes, right, so apparently there's two functions foo and bar defined elsewhere in this file. I'm just informing GCC that they're going to exist at some point. So here's main. So this in English if you were asked what is this doing declare an integer called A. What does that really mean? Allocate 32 bits and call it A. Where are those bits stored though? Where is A in this picture? So it would be in the stack. This is a local variable, local variables have been and remain on the stack. The only time something is going to end up on the heap is if you explicitly call malloc. That is a safe rule of thumb. Now how about here? As a new site so you might see differences between me and the TFs or even textbooks. This is probably the best habit to get into whereby the star is immediately next to the variable name only because it can be confusing if you have multiple variables declared on the same line. So in short this is probably the best style to get into. But char star S means give me a pointer to a character, more technically give me 32 bits in which I can store what? >> Char. >> Not a char but the address of a character. So char star S means gives me 32 bits that are going to store an address. The address of what? The address of some char. All right, so on the right hand side we now have hello world. Now this is hard coded, right? This is not a variable, I'm not using guest string so just as an aside where is hello world stored? This is effectively global data so when I said before global variables are way at the top, so is a hard coded string like hello also put way up there. All right, so what else here do we have? We have print a, percent S, backslash N, now this looks a little scary. So what is S? Well first let me delete the scary part. So we just have S bracket zero. If I print out S bracket zero and then put percent C here, that prints a char. What char does it print? So I just figured it out too, right, W if I counted right. Zero, one, two, three, four, five, six, seven, so the seventh character zero index is W. But if I instead say uh uh, I don't want to print a char, I want to print a string so I change the percent C to percent S and I don't want to print S bracket seven because that's actually the seventh character zero index. What does the ampersand operator do in this context? What am I getting the address of to be clear? I'm getting the address of the seventh character so that's kind of interesting. Now that doesn't feel like a complete string, in fact that's smack dab in the middle of this bigger string. But guess what? What does this, what does the word world also coincidentally end in? The null terminator, right, that substring world also obviously ends in the same backslash zero, so ampersand S means get me the address of the seventh character. Well what is that? Well if we're all in agreement that W is it that's where we are in the story. But when you say ampersand S bracket seven that means okay I don't want W, I want the address of W. So that means now somewhere, we don't need to project this, it's just going to be a silly rectangle. But that means somewhere is the word H-E-L-L, okay now it's enough words to project, let's try our fancy new technology. That's how bad my handwriting looks for those who have never seen it. All right, so we have W-O-R-L-D backslash zero, now if I break this up into something that looks a bit more like an array, okay normally this would be a much prettier picture, but maybe this is not the best use of technology. Okay, that's the projector doing that. So when we say S, what is S? S is the address of the start of the string itself and it ends back here at the backslash zero. But if I say S bracket zero that takes me to the seventh character, zero index, and that's literally W. But if I say what's the address of it, that's like asking the question what is the address of this particular point in this whole array that we think of as a string. Well who knows what it is, it's like OX 1, 2, 3, 4, 5. But if I then say print the string that starts at that address what does print F do? Well it's a pretty simple function, it starts here and it prints, prints, prints, prints and it only stops when it sees backslash zero. So what word is going to get printed based on this second line of code in main? World. Now you know again, highlighting it now clearly doesn't work just as I highlighted it. So this is kind of a silly exercise here but again it just means that all of this stuff with pointers and ampersands and stars just reduces still to these basics. So let's do this now. A gets five, so what is A, what data type? All right so it's an INT, we declared that a couple lines before. A gets five, puts the number five in A, so that's kind of old school. Foo A, I don't know what foo's going to do so let's scroll down now. So foo takes in an argument and that argument is of type INT, so at this point in the story what value is an N? It's a five but think back to the picture. This actually looks a little different, so if this is now my picture we're talking about foo now not swap but whereas A exists in main's local frame, the second rectangle here, we call that foo so actually we have a computer here. Foo's locals, foo's parameters, so where is actually this number N stored? Well it's going to be stored in the thing called foo's parameters. So they're distinct copies, and we can see that now with this kind of picture. All right so what am I actually going to do with foo? So let's see INT B, so this is yet another local variable, B gets N's so now everyone in the world has a value of five it seems, so B is five, N is five, back over there A is five. So B star equals two, so what's the value of B? So ten, right, so five times two is ten, so I'm passing 10 into bar so we're almost done with this story. Okay, this is kind of stupid. All bar does is print this. Now what's the takeaway here? Well actually so the takeaway here is if I had thought about this we would have been walking through this with GDB but we did that earlier so it's not all that enlightening now to step through each of these frames. But if we now get down to bar here and I print out M, M is going to be-let's at least close this loop-is going to be ten, right. We're just not doing anything with it. So actually we can now do this. Let me go in here and I'm going to compile this program called bar. So make bar, all right I'm going to now run GDB on it, so GDB bar dot slash bar, enter. All right, so now I'm going to do a break point at-let's do a breakpoint, let's not step through this because it gets very boring, let's do a breakpoint at bar, this is the very last function that ever gets called in this program. So now let's go ahead and do run. Okay, so now this is more interesting than last time because we don't have to step through all the tedium with GDB. At this point in the story I'm kind of pretty deep. That picture is actually starting to stack up. If I type back trace and hit enter, notice I can see a reminder here what's going on. At the very bottom of my stack is main and its variables, in foo is its variables, in bar is its variables. And again this is even if you've not yet dabbled in GDB the goal here really is just to show you some opportunities. I can obviously print out in bar the value of M, again M is just this simple thing here, the input. But suppose I get curious in GDB and I've kind of gone too far and I actually want to know wait a minute, what was the value of B inside of the foo function, right. You might think that damn, now I have to kind of restart my programs, set a new breakpoint in foo, well you don't need to do that, right. We have paused execution by way of a breakpoint, you can see where you've been with back trace. What's really powerful about GDB is you can kind of go back in time or at least with regard to this picture you can access any of these frames you want within GDB simply by typing this command. If I'm curious as to what the value of B was in the foo function, even though I'm currently in the bar function because that's where I set my breakpoint, I can say frame number one and hit enter and voila, it kind of winds back time. It doesn't undo any of the work, it's now showing me what's inside of that frame number one so I now can do something like what was B and ten. So again this just hints that again the power of something like GDB. You can literally pause execution of your program and go anywhere you want in your memory and simply understanding the order in which your own functions are getting called will let you begin to traverse this. All right, so let's do one other thing now here. The CS50 library, so in the CS50 library which we've been taking for granted and will decreasingly actually be using now that we've revealed what a string actually is, a char star. Let me scroll down, there's get char, there's get double, there's get floats, there's get INTs, there's get long long, there's get string. Get string's the only one we're going to look at right now because it actually is used by all of those other functions. When you call get INT it's actually getting a string from the user and then effectively using HY or an equivalent to convert it to an integer for you. So what is get string doing? I'm going to skip over these local variables here and just reveal a little bit about what's going on. So inside of get string there's this function that rarely if ever you will have to use yourself. But F get C you can kind of guess what it might do. Get C, what are we probably getting? It's getting a char, one char at a time. The reason for this is that as I keep saying and we'll soon see it's really easy to overstep again the boundaries of your own memory in a programming language like C and so if you get greedy and you say give me 10 characters at once or give me 20, but the user gives you fewer than that or more than that again site fault, core dump, crashes, hacking, bad stuff is the takeaway there. So with get C even though it feels a little less efficient what get string in the CS50 library is really doing is it's getting one character at a time from the user. And this very paranoid approach to getting input from the keyboard ensures that we never go past what the user has actually typed in. Now what's really happening here? Well the way CS50's library works is that initially it arbitrarily allocates a buffer that looks like that that's size 32. We made an educated guess that typically 32 bytes is plenty for the types of words you guys might type into a program. But just in case that's wrong, and just in case you actually want to type in 33 characters or 100 characters, the get string function needs to be able to grow dynamically. It needs to be able to say to the operating system we screwed up, we asked for 32 bytes, we actually need 64 bytes, can we have more. So what the CS50 library actually does if you read through this is if we find that you know what the so-called buffer, the space we've allocated in get string is too little, turns out there's another function and this is not something you'll likely need to call at least during the term but called realloc that will do literally that, it will reallocate an array for you and find you more space. And if need be, if you've been allocated here and someone else is already here, what realloc will do is it will say all right you want more memory, I'm going to move your whole array here and put it elsewhere so I can then give you more space so that it all remains contiguous. And so what realloc does which is nice is you just can't move an array around in memory, you can't just move bytes from one place to another. But you can do what? You can copy them from one place to another and so realloc actually does this. So we'll come back to this before long but know that all this time the CS50 library has been using malloc and in turn realloc that's a family of functions to allocate a chunk of memory and if it needs more, more, more, more. And now where does get string put that memory that it's handing you when you call get string? Where must the strings from get string be getting put in this memory when they're handed back to you? It's got to be the heap, right, because if get string foolishly put its memory on the stack what would obviously happen the moment get string returns? Like that memory, you know it might still say H-E-L-L-O in that stack frame but now as soon as you call another function bam that string's going to get overwritten. So what was important to us at the start of this semester that anytime you allocate a string that it ends up in the heap so that your functions can get called, finish calling, call return, call return, and that string stays alive so all this time we've actually been using the heap. And we'll see before long yet more sophisticated approaches to that. So let me just give a sense now of why bad things happen so easily. Let me actually skip over the actual code here and focus on this. So if you want to read more about this, especially if you're the hacker edition type, this is actually excerpted this picture from a Wikipedia article that goes into much more detail. But this is a picture right now of a stack, right, it's from bottom to top that we keep looking at but it's got some colors depicted here to make things more friendly. But it turns out that I've kind of been simplifying the world. When I say that on the stack is something like main's parameters and main's local variables and foo's parameters and foo's local variables, there's actually been other stuff still. Long story short when a function calls another function the only way that guy, the second guy knows where he should hand control back to when he's done executing is because what also is going on the stack is each function's return address. In other words when you call a function even like print F, this is kind of like happening all these weeks behind the scenes. You're essentially writing on a piece of paper your address, your main function's address, your foo function's address, whatever function you wrote you're handing it to print F and saying do your thing and then return to this address which is my address. So, the problem though is that this picture then means on top of the stack is a whole bunch of juicy information, not only are there things like variables. Notice up here there's the variable, there's the variable bar whatever that is and the variable C whatever that is. But these things I just mentioned, particularly this one in red is a return address. So the return address is that sheet of paper that's been secretly being put on the stack every time you call a function. But here's the bad thing and here's how this all ties together with arrays. We've said before that an array is a fixed size and bad things happen if you go too far to the right of your array. But notice what happens. When you're calling functions notice how this arrow reminds us, the stack grows up, up, up, up, up, up, but it was someone's bright idea years ago that when you store an array in memory the first letter, for instance H, is going to get put here but then the next one goes there, there, there, and again this is a picture so it's not really a rectangle. But then you move down, down, down, down, wrap around, down, down, down, so what happens in this model if the string the user types is not H-E-L-L-O but is like 1,000 characters. Where does it go? What memory gets clobbered so to speak? Well all of this stuff, so what really sophisticated or lucky bad guys do is they try to-and this can be as simple as running a program at the prompt and typing in some crazy long string and seeing if it crashes-what a bad guy will do is try to inject let's call it attack code, generalize it as A for now. What is an attack code? It can be anything, it can be something stupid like spam everyone in the computer's address book or it could be erase everything on the hard drive. Attack code is whatever you want it to be. But a bad guy will essentially do this and the colors here are important. What did the red colors represent? The return address, that sheet of paper. So a really smart bad guy will somehow figure out to be honest quite often by trial and error, a whole lot of trials and a whole lot of errors, but it only takes one success to then break into the system. He or she will try to figure out you know where is this address, where is that sheet of paper being stored in this program because I'm going to type in some crazy long sequence of characters and it might not even be alphabetical [inaudible] characters, it can be like those unprintable characters where like it looks weird on the screen because it's just zeros and ones. I'm going to inject some kind of crazy long string, attack, attack, attack, attack, attack, but I just have to get one detail right. The very last thing I type in at the keyboard or pass through the server via the internet, say via URL, has to be the address of my own attack code. So if what you do when prompted by something like a get string is you type in the equivalent on your keyboard of an attack function that again deletes something, sends spam, whatever, but you get the very last detail right and you say the address of my own attack code is going to start at this address 80C0, whatever that is. So long as you put that address 8035C080 to refer back to you, what's going to happen? Whatever function this is, print F, get string or whatnot, he's going to finish executing, he's going to say all right I've been told to return to this address but what just happened, someone else wrote over what I wrote on this thereby telling me to return to a different return address and that return address is depicted in A there and that might literally mean go erase the hard drive now. So how do bad guys figure this out? It really is a lot of trial and error. Anytime you visit a website, anytime you run a program and it crashes, it core dumps, you get file not found or internal server error occurred, this is prime hacking opportunities for bad guys because what it suggests is not necessarily that there's a vulnerability but there's at least a mistake. Someone screwed up and did not check the length of an array, someone let me go too far in memory, and so if I can actually now guess by trial and error writing a program that generates a whole bunch of garbage just seeing which one actually gets me into the system, this is incredibly common how bad guys get into systems. And in fact if you ever run a web server as you might do for your final project and troll through your own web logs, you'll see that there are these people called script kitties on the internet who essentially run programs like this, sometimes not even knowing who you are or caring who you are, but trying to send garbage values, garbage values to your web server in particular and so you'll see crazy long URLs show up in your log because it's been incredibly common that web servers on Linux, on Windows, on MAC OS actually have flaws like this where if you get that input just right you can trick the computer into doing anything you want. So with that said, we'll see you on Wednesday. The end.