[ Silence ] >> Al right, welcome back to CS50. This is the end of week five and this one of my favorite problems that's on the horizon here. So if some of you have kinda been staring at this, trying to figure out what it is, letting your eyes zone in and zone out. Odd are, you didn't quite decipher it but if you had say from childhood one of these pieces of plastic that's came for instance with one of these cereal top boxes and you held it up to this thing, what you would see as did last year, students when they implemented this idea in code was that it was Professor Plumb [phonetic] in the lounge with a candlestick with this particular murder mystery. And so on the horizon for you this year is gonna be a little someone different and a little some place different with a little something different but this is a teaser for problems set 5 who's focus will be forensics and the recovery of information that's either been accidentally or deliberately lost. I went through my inbox to retrieve an email from one of your predecessors, I'll keep it anonymous but this was really quite fun to read since he sent this to us a few months after CS50 ended a year ago. And he wrote to us yesterday, my sister accidentally formatted her camera's digital media card and lost a year's worth of memorable photos, parenthetically she unfortunately isn't the best at backing up her data. This reminded me of problems set 5. So I thought I would try to run her memory card through the recover.c program that I wrote all the way back in October. So after four hours of figuring out how to create a raw forensic image from the formatted memory card, I came across this page and was able to do the following. After tinkering around with some of the command line arguments, I managed to create the forensic image, I installed and configured the CS50 virtual box, I managed to run the forensic image through my program and run all--recover all 1,027 of my sister's photos. I find it absolutely amazing that I was able to go from being one of those less comfortable with no programming knowledge whatsoever to having the ability to recover data off of a formatted memory card in an actual real life situation. So it does in fact happen. [Applause] So this will be problem set five. So quick invitation, CS50 lunch at the usual place cs50.net/rsvp this Friday at 1:15 p.m. We're gonna do a dinner in future weeks for those who consistently cannot make Friday, FYI. And now let's our attention back to this library. So very briefly last time we looked at this function get string, we've been taking this for--taking, taking this for granted for some time now. And underneath the hood we recall that it's starting to--it's actually been using some of these new fundamentals we looked at, this notion of memory management, dynamically allocating memory and just as a quick review. So the function we introduced with which you can ask the operating system for memory is now called, okay, malloc, for memory allocation. And that memory ends up on stack or heap. So it ends up on the heap and their intuitive explanation for that is that the stack memory constantly is coming and going. It's disappearing every time a function finishes executing so it's stands to reason that a function like get string, if you want whatever memory its allocating to actually survive its return which you do otherwise what's the point in getting a string, that memory needs to go on to the heap. So hence forth and we'll see this more on future P sets when you actually want memory dynamically, in other words you don't know how much you need in advanced and that's absolutely the case when you have no idea what the human's gonna type until he or she does or if you want to dynamically grow or shrink something, you're not just gonna declare local variables anymore, we're gonna start using a function called, "malloc." And it comes with an opposite function called "free" which is a function we'll start to need to call when you wanna tell the operating system, I am done with this memory. In fact a little dirty secret of the CS50 library is that currently it leaks memory whereby, when every time you call get string, we call that function malloc, asking the OS for memory, asking the OS for memory but never once probably for your problem sets and in even in lecture have we said to the operating system, "I'm done with this string, you can have these bytes back." And so that's actually a bug. It's a bug that's been in the CS50 library from the beginning of this semester but it's meant to simplify the process so that we can just assume that we're getting a string but now as we begin to dismantle these training wheels, anytime we actually ask the OS for memory, we're going to need to give it back. And so we're going to stop using the get string function since now we can appreciate perhaps that it's not quite doing us a huge number of favors. Now just as an aside, what do we mean by, what do we care about memory leaks for? Well, if you've ever been running your Mac or PC, you're Ubuntu computer, whatever you have and you find that overtime it's getting slower and slower and you're opening things like Gchat or your browser or Photoshop or whatever programs you typically run and you've been doing these a lot, loading things into memory, quitting, loading, well if any of those programs has one of these things called the memory leak or some programmer just screwed up and asks the OS for memory but didn't necessarily hand it back, what can happen in real terms is that your computer can slow down because your OS is gonna think that it's out of RAM. And so, future programs are going to use something called "virtual memory" which for today's purposes is slower. So if you're ever finding that your computer's inexplicably getting slower and slower and slower, it might just be frankly that you need to reboot because there's some buggy program or programs that haven't quite been playing nicely inside of your laptop. So let's take a quick look at the CS50 library. This is cs50.c, we started looking at this last time and it's okay if you don't understand all of the intricacies of get string but if a function that is a little easier is something called, "get int." And its turns out that, get long, get long, long, they're all pretty much equivalently implemented just for different data types and they all conveniently use get string. So let's assume for the moment that get string works and it allocates memory for us and it ultimately hands us back a line of text that the user has typed in. So here is the implementation of get ints that we've used as far back as week one and problem set 1 in C. So I first have kind of curious feature here. I am deliberately inducing an infinite loop while true means, do these forever. So hopefully somewhere in this loop I have a break statement or a return statement, something that's gonna deliberately break me out of this infinite loop. So let's see what I'm trying to do potentially infinitely. So, I have in the inside of this loop, I'm gonna try to get a line of text, so this is a familiar call. I'm now checking recall for line equals, equals null. So as an example, when might get string return this special keyword null? [ Inaudible Remark ] >> Yeah, so if there's nothing typed in and we can simulate that as we've discussed with control D which normal humans aren't likely to type, but there's another scenario that could very well happen and we'd get back null. [ Inaudible Remark ] >> It's too long a string, right. If they paste in a huge essay that would fully exhaust the computers RAM, get string and as much as it calls malloc, might end up returning this special pointer value called null which essentially means, "Something bad happens, I don't know what necessarily but I can't give you back a string." So we have to check for this otherwise we risk later in our program, those things called seg faults and core dumps. Now, why am I returning ints max? Well, this is just a convention in C. There's kind of this problem fundamentally in C, anytime you're implementing a function that's supposed to return a number, like an integer 'cause you can return, you know, upwards of 2 billion positive numbers maybe as big as negative 2 billion. But if you wanna return an error, you kinda have to arbitrarily say, "Okay, zero represents an error." But then that's problematic because if you want your function like get int to be able to legitimately return a zero, if the user typed in zero, well, you can't use zero as your special error value. So, you can say, "Alright, let's use negative 1." If the user--If the--something goes wring, we'll return negative 1 by default but that too is problematic, what if the user wants to type in negative 1? So the convention that most functions in C adopt is they don't choose these popular values like negative 1, 0 and 1. They'll choose like 2 billion and 1, something that's crazy large and just shy of the maximum possible value and we define that typically with a constant. So if we actually poked around a bunch of dot H files in the appliance, we would see that someone had done # define ints max and it's something like 2 billion. And so we're kind of deciding, "Ugh, yes, we're sacrificing that number but it's less likely to be useful than 0 or negative 1 or 2. So that's why we're returning int max just as a matter of convention. Now, here is where things get interesting and here's where the training wheels now come off. So this final chunk of code, this big branch here is how get int ultimately works and it uses a function that you may have seen in readings or other examples called sscanf, which means, String Scan Formatted. So it's kind of like the opposite of printf, printf obviously prints, scanf reads in from the keyboard. So if we didn't have the CS50 library, you're seeing the hoops, you would have needed to jump through in week 1 just to do something stupid like ask the user for a number. So how does this actually work? Well notice first just for temporary variables, I'm declaring an int N and a character C and I ultimately wanna put the number that the user types in into N. C, I just need to check if the user screws up, I need a character for the following reason. Here's the magical function, sscanf, takes a few arguments. The first argument is the line of text that the user typed in. So look at a string that the user typed in and recall from above, we called get string and we called the return value line. So, that' all this is. This is literally the array of characters, a.k.a. string, that the user typed in. So we're saying to scanf, "Scan this string and try to extract an integer for us." But how do we do that? Well, much like printf which is use for output, scanf does this for input, we say in parent--we say in quote marks, present D which is a placeholder saying, "Try to fill this placeholder with an actual integer that the user typed in that's inside of this string called line." >> We have a white space character to the left and a white space character in the right--to the right, just to be friendly. So that if the user accidentally hits the spacebar then a number, that's okay, or they hit a number and then the spacebar, that's okay, we're gonna forgive that. But we're not gonna forgive typing any character other than a space. If the user types in 42, space F or G or any letter of the alphabet or symbol on the keyboard? What's gonna happen is this, scanf is also being told, "Alright, put the first number you see in this placeholder but put the first character you see, alphabetical character, punctuation character that you see in what variable apparently?" In C. So this is just here as a bit of error detection. We only want to fill in. But just in case the user messes with us and we also happen to fill C, we wanna be able to detect that the user types not just a number but also a number and a letter of some sort. So we have to tell scanf where to put those variables and it is not correct certainly to do this because anytime you pass a variable into a function, how are you passing it, what's the jargon? So you're passing it by so called value, you're passing a copy of it in. But if you want scanf, sscanf to be able to fill those variables with some new values, how do you do that? Well, you have to write down that little scrap of paper, the visual we keep doing where you have to say to sscanf, "Put your answer on these two sheets of paper." So here is &D, here is--rather &N, here is &C, put your ints here, put your character here because then, I, the person who wrote this code is gonna check if sscanf returns 1, that's perfect. The return value of scanf signifies how many pieces of paper that function filled with values. So if it only filled one, the ints called N, great, we're good to go. We can immediately free the line, the string that the user typed in and that's new today and then we immediately return that integer. But if the user somehow did not cooperate, did not just type a number like 42 and instead we get back from sscanf the number 2 which signifies that they typed in some number plus some garbage character or if s scan F returns zero because they didn't type in anything somehow, the else condition is gonna apply. We're still going to free that line, that string that we got from the user and then we're gonna nag them, "Retry, retry, retry." So, if you've ever kind of wondered how that retry is being spit out automatically for you if the user doesn't cooperate? It's right there, everyone of these functions get ints, get float, get double, has that retry line and if you don't get there but instead the user cooperates, you instead return the ints or the floats or the double or the like. So that was our--the training wheels that are hence forth off. Questions? [ Inaudible Remark ] >> If it--So no. It has--Good question. So if the user just typed a character would it now still return 1? No. Then it would return 0 because when you have the format string, percent D, percent C, that's telling sscanf, "You have to try to put number then a letter." So if the user only puts a letter doesn't match that pattern, so sscanf returns 0. So, good, good corner case. Yeah? [ Inaudible Remark ] >> So, no--good question too. So while true is kind of a--sort of [inaudible] truth. True is always true so while true just runs forever no matter what, there's no way to break out of this loop unless you manually return from the inside of it. Now, we don't have to do it this way, in fact it's usually wrong to induce an infinite loop, it usually means you messed up and you made some mistakes but in this case we could have used a do-while, we could have used a for loop. But in this case, we decided as the staff, we don't wanna say we're gonna let the user try a hundred time in which case we have a for loop with the number hundred hard coded and we also noticed, did not wanna prompt the user initially and say, and saying something silly like give me an integer. We wanted to leave that to you guys, the ability to actually put that first print F, the only print F we wanted to spit out was the retry so we just decided that really what we want, really the construct that gets the job done is just do the following forever but we'll stop ourselves when we're ready. So it's just a design decision. Yeah? [ Inaudible Remark ] >> So what if you typed in a number? [ Inaudible Remark ] >> Good question. So what if you typed in a number and a character? So sscanf actually treats whites space inside of that second argument special. If you actually did that, it would ignore--it would still put number and then letter. You don't need to have the white space there. So that is a--It's sort of a--sort of a secret feature whereby this space is optional. [ Inaudible Remark ] >> It would return too in that case. If you typed in 4-2-F, for instance, you get the 42 and the F but that would still be wrong. Yeah? [ Inaudible Remark ] >> Why do your programs crash? What do you mean? >> Why do they not crash? >> Why do they not crash? [ Inaudible Remark ] >> Ah, good question. So why do your programs--not so much crash but why don't they hang and just kind of sit there waiting in perpetuity? Well remember, get string is actually the function that's first getting called and it's get string that sits there with that blinking prompt waiting for input. And so if the user doesn't ever type anything or even hit Control D, absolutely, your programs just gonna sit there and run in infinitely long but get string is actually what pauses this loop and waits with the blinking prompt for the user's input. Good question. Alright, so scanf, let's actually see. This is how we might use this manually. So this is an example. It's is among today's printouts on online. This is an example of called, scanf-1 and it demonstrates how you can sort of old school style get an integer from the user if you don't have the CS50 library. And it actually is relatively simple if all you wanna get is an int from the user of these exceptions of the user not cooperating aside, but it gets a little dangerous if we're not trying to get ints but we're trying to get characters or strings. Because recall, we've revealed that string involve pointers, pointers involve memory, memory involves the risk at least of screwing up and inducing core dumps and seg faults, so we're about to see that. So here's a program whose purpose in life is to say to the user, give me a number please, then I have already declared a local variable called X and I'm passing it in by reference so to speak, by pointer, by address, these are all synonymous phrases to a function called scanf. And scanf reads directly from the keyboard, sscanf, string scanf, reads from a string that you already have in a variable as we did earlier. Scanf reads directly from the user's keyboard, which is the goal right now. So right now the user is being prompted for an integer and if the user provides an integer, it's going to be stored inside of the variable X and it's going to be printed back out on the screen. So let's try this, let me go ahead and do make scanf-1, alright it seems to compile okay. I'm gonna go ahead and run scanf-1 and I'm gonna type in then number 42, thanks for the 42, it seems to work. So let's try it again. Let's try the number, let's say, 0. Always try the interesting cases. That seems to work. Let's try the number negative 1. That seems to work. Let's try the arbitrary word, like [inaudible] to this term, monkey, something went wrong here. So what happened? Can we infer from this short program what the bug now is? Yeah? [ Inaudible Remark ] >> Okay, good--it's a good thought. So perhaps it's converting the ints to the chars, alright. So other--so maybe. So X, no. But let's see some other ideas. But that's a good thought. Yeah? [ Inaudible Remark ] >> Yeah, so it's actually--it's a good thought but it is indeed this case whereby because sscanf did not detect an integer, it instead detected a word or the character M in particular, it could not populate the place holder and so nothing was put in X. So, its original value is unchanged and what is X's original value apparently? Well, it's just--who knows, it's some garbage value and that garbage value at that moment in time happen to be 2719732, who knows what it might have been. So this is another take away here too. If there's ever a risk in a program where your variables might not be assigned some value, it's actually a good habit just to detect such things to for instance initialize your variables to some known value, maybe it's zero, maybe it's negative one, maybe it's int max, but so that you yourself can check afterward what actually--whether or not the value in there is legitimate. So realize this is absolutely and option and so if there's ever a risk again of your variables not getting initialized, you might want to pre-initialize it yourself to what we keep calling a sentinel value, just some special number or special constant. Well, this is relatively straightforward for ints but let's look at version two here at trying to get a string. Let's just continue this logic. So we've taken the turning wheels off, I have removed the CS50 library from my appliance and so I cannot anymore say include CS50.h and so I instead, if I wanna declare a string? I have to go back to the old fashion way of saying char star. So char star buffer is going to represent my string and I'm calling this a buffer deliberately. In my mind a string really is an array, and an array is just a chunk of memory and computer scientists would typically call a chunk of memory a buffer, something into which you can read values the characters, numbers, whatever. >> So a buffer typically means an array of memory. But notice here that really what I've declared is char star buffer. So just to be proactive here, how much space have I actually allocated with this first line of code here? How may bites or how many bits have just been allocated by char star buffer? So it's pretty small, right? How big is a char star? It is what the question reduces to. It's probably--It's generally 32 bits, right. So anytime we have a pointer we've claimed at least for the appliance and for slightly older computers, a pointer is always 32 bits. So what does that mean, char star buffer? All I've allocated is 32 bits and what's supposed to go inside of those bits? Not a string, that's not really a buffer, that's kind of a misnomer at this point in the story because really I only have a pointer and I'm not supposed to put characters in pointers, I'm supposed to put memory addresses in pointers. So even if this still feels a little abstract at least take away from this that something here is wrong. And as an aside, just to tie this together. If you're pretty computer savvy and you've known it for sometime that for instance your PC can only have 2 gigabytes maximally, a RAM. You might be generally aware of these restrictions. Well, 2 gigabytes is actually a result of having a 32 bit CPU, Intel Inside if its 32 bits, the biggest possible number you can express is 2 to the 32, which is 4 billion. But for technical reasons that's actually half. So you have 31 buts that you can use to address your RAM. So long story short, you can't have more that 2 gigabytes of memory in some computer 'cause you don't have numbers big enough to actually say, "Put this here, put this here." You kind of can't count that high even if you have 10 gigabytes of memory. So another reason to having a 64-bit computer these days as most of you now probably do is actually a very good thing. [ Inaudible Remark ] >> Correct. So C does not have a native string class like something like Java does but it does know about strings in the sense of printf. But it really treats them as arrays of characters and it stops when it sees So it's very primitive in that sense. So what do I say? I say to the user, string please and then I use scanf, percent S buffer. So I'm saying to scanf, take whatever the user just typed in at his or her keyboard and then put it, where? At that memory address. Now what is that memory address? Well, what is the value of buffer at this point in the story? [ Inaudible Remark ] >> It's a garbage value, right? The answer to this question is always gonna be, it's just some unknown garbage value. If I and none of my code has not initialized as anything explicit. So what you're really saying is your allocating only 32 bits for a pointer to a character and that's got the number like OX1234. You are then handing this slip of paper to scanf and saying, "Put whatever the user types here." Well what's there? Well, you have no idea. It could be 0. It could be 1, 2, 3, 4, 5. It could be some part of memory that you don't even have access to. So scanf is going to erroneously put the string there which typically induces one of those seg faults. So let's see if we can confirm this. So let me go ahead and make scanf-2. And always keep in mind that some of these errors are not detectable because you get lucky and you actually touch memory that is yours but you really shouldn't be using it. Now we've configured make in such a way that you can't even compile this code because we are proactively checking, wait a minute, buffer is uninitialized in this function and so there's an error. Won't let me even compile. But let me see if I can manually override this. Rather than type make, I'm instead going to do GCC-standards equals C99, this is just a version of C we're using, and I'm gonna skip the -W error, which is the guy that's making make be so picky, scanf-2.c-O, scanf-2, Enter. So now I'm gonna run scanf-2. So it compiled even though I know there's a bug in this program, let's go ahead and run it, string please, monkey, enter, segmentation fault. So what's a possible solution here? Well, I can at least initialize this to some known value and the convention is typically null. Let me go ahead and recompile this. But notice even if I do this, this is no better, it's still buggy, but at least now printf is detecting as much in this case. So still a bug, but at least now it's obvious to you, the human, the programer, wait a minute, this is clearly not what I expected. I'm doing something wrong. So what's the solution perhaps to this? What do--How do we fix this problem, a buffer being some unknown value? Can I do something a little crazy like, well, null is obviously bad, I know that much. Well, let's put it at this address I keep quoting as popular. Well, we could do that but what you're really saying, now you're just arbitrarily saying, put what the user types over there and you have no idea where there is. So what function can we call that gives me a memory address of a legitimate chunk of memory? [ Inaudible Remark ] >> Right. So the solution here is malloc. So we could try this. So give me a string that's of size, the user is not gonna type in a word that's more that like 10 characters. So I'm gonna hard code 10. Now this should actually work because malloc, assuming there's RAM left in the computer is gonna give me a pointer and it's gonna say put this word that the user types here in memory and that address is now stored in buffer so scanf will put it there. Now, let me go ahead and try compiling this, let me recompile it with GCC, implicit declaration of function malloc. So we've not how to this manually yet because the library--the CS50 library is usually doing this for us, but there's another header that's popular, standard lib, standardlibrary.h, and that should make GCC happy now, not knowing about malloc. And indeed it did. So scanf-2, we go ahead and type monkey, nice. Thanks for the monkey. Alright but wait a minute, monkey, monkey, monkey, enter, interesting. So that kind of worked but monkey, monkey, monkey, monkey, with no spaces, interesting but what's happening here actually is that we are getting lucky. Let me see if I can make us unlucky by doing this ad nauseam. Let's see, that doesn't work. So let's do paste. Let's do paste. Let's do paste. Let's do zoom out. Let's do paste. Now obviously a user is not typically going to do something like this. But imagine it's actually, you know, a form field on a web page where--still working. I'm getting a little bored copying and pasting. But take my word for it today that if we did this long enough, we would traverse one of those segmentation barriers where right now we're within it and we're just getting lucky but we're gonna cross over at some point and in fact it's going to crash on us. Again, segmentation fault. So how do you fix this? Well, this is actually harder to fix. And now the CS50 libraries get string function motivated. Recall, we looked at it briefly on Monday and anytime you call get string, how many characters does it get at a time? Well, recall, it used a new function, get character, get char and it just got one a time, one at a time. It's incredibly paranoid, the get string function that we wrote so that it only slowly looks at what the user is typing in and only if it realizes, "Wait a minute, you just typed in 11 characters but I've only allocated 10 bytes." What is it gonna do? Well, recall, we saw briefly the realloc function which is like a cousin of malloc. And realloc as it's name suggests, takes the 10 bytes that you might have already allocated with malloc and doubles it or triples it and we repeat this process. So why was get string relatively complex compared to this? For this simple reason, there are so many programs to this day written out there where the--you, the programer has made a arbitrary and ultimately dangerous decision to say, no one is gonna have a name longer than 16 characters or a thousand characters. But these are precisely the opportunities that bad guys look for trying to crash programs because again we saw on Monday this opportunity for a buffer overflow exploit which essentially means typing something a little more sophisticated than monkey, monkey, monkey but rather code that you wanna execute and you can trick the computer into over flowing this buffer and executing your adversarial code. Yeah? [ Inaudible Remark ] >> The word string does not exist. It's in CS50.h. [ Inaudible Remark ] >> Good question, coincidence. So percent S is part of C. It's part of printf and percent--so the word string exists in programmer's vocabulary but the data type string does not exist in C. So percent S denotes char star. [ Inaudible Remark ] >> Really good question. So I'm contradicting my self here, right? In the previous example with S--with scanf-1, recall that I did this. I put a past in--in percent X to get the address of X. But we can kind of answer this just with our own jargon. What is buffer already? It's a memory address. So I don't need to use ampersand because I already have the answer to the question. The question is gonna be, where do you want me to put the user's input? Well, put it at that address. And the fundamental difference here is that in the previous example, we allocated in ints, it was on the stack as a local variable but there is no malloc involved. But as soon as you involve malloc, what you're literally getting is that address. So we don't need to use the ampersand in this case. And realize there's one other way we can create a buffer that's just as dangerous as hard coding a length. A very common approach in a program is to say something like, char buffer bracket 16. >> So char buffer 16 doesn't feel like a memory address really, but it is in fact an array of characters. And what though is an array? Well, an array really is just the address of the first chunk of memory. So what is this really doing? This also is allocating not 10 but 16 bytes this time. It's then passing the word of the name of the array buffer to scanf and think back now to P set 3 when you implemented sort or search, remember that you could pass in an array as an argument to a function and you didn't use the double brackets, you instead just wrote the arrays' name. That's because you can pass an array but its name, by its address and so scanf here would use that 16-byte buffer to put the users input. But what's gonna happen if the user types in 17 characters? Just by nature, by definition, you're gonna go beyond boundaries of that array and notice too in C, scanf and your own code has no idea how big the original array is. It has no idea how many bytes you asked malloc for. It is entirely up to you, the programmer to remember how many bytes you asked for or how many bytes you hard coded in the array. And so again, this is the prime--one of the primary reasons that so much code written in C and C++ and even in some modern languages is in fact exploitable because of these kinds of dangers. And if you don't believe that too, realize that these languages that you might already know a little bit, and we certainly will by semesters in like JavaScript and PHP and Python and Ruby, a lot of the times the programs called interpreters that you use to use those language, they're written in C themselves. So you might be writing PHP code, but it's being executed by a C program, so if that C program is buggy, your PHP code can be vulnerable as well. Yeah? [ Inaudible Remark ] >> Good question. So if the buffer is an address why do we not say star buffer as we did in our swap function on Monday? So the reason again boils down to the fundamental question we're trying to answer here. The question at hand for scanf is where do you want me to put the user's input? The answer to that question must be an address. But we already have an address, malloc gave us an address. So the simple answer to why we need no star and no ampersand here is because buffer is already an address. Because in this case, we called malloc and as I'm disclosing today the name of an array can be treated as though it is a pointer as well, an address. The only time we used the star is when we want to go there. Scanf will do star buffer, we do not. Yeah? >> Can you compare char buffer 16? Is that 16 characters or? >> 16 characters, 16 of whatever the data type is in green. [ Inaudible Remark ] >> No, it's still gonna be 16 bytes. A char is 1 byte. It's 8 bits. So, in this case we would get literally 16 bytes or 16 chars. If instead we're in buffer 16, then we would get 64, 'cause it'd be 4 bytes per integer. Alright, so recall then the danger that this leads to. Alright, we saw this picture and we kind of lambasted this design because you have this dangerous pointer called the return address in red and that was simple the address of what? What was this red return address used for? It tells the function what? This--The return address, is literally return address. It tells a function where it should return control of the computer to once it's done doing its thing. So if I'm the main function and I called the fu function, what I'm essentially doing conceptually is, I main. I'm gonna say I am address 1, 2, 3, 4, 5. You hand this piece of paper to fu, fu keeps it around in this red slice of memory and as soon as fu is done executing, it checks where did main tell me to go back to, 1, 2, 3, 4. Let me hand control to the CPU back to the address that was on this piece of paper. But the problem is that if fu has on a buffer say 16 bytes, or in this case 12 bytes, and the user types in not hello but something much longer that that, where does the space go? It goes from top left to bottom, and so you run the risk ultimately of overwriting this with the address of some bad guy's code. And unfortunately even though this simple solution might be to just say, alright well don't write hello from top down. Write it from bottom up. It turns out that only makes the problem a little harder for the bad guys. But the problem is that you can end up tricking future functions that gets called into exploiting codes. So there's actually not a simple fix for this. And again, this remains one of the most common ways of exploiting a program. Let me just peel back the layer of one other thing with regard to pointers. So this here is a program called "pointers.C." It's among our source code from today already. Notice that I'm using a few header files up here, using a few libraries just because I wanted to resort to the CS50 library for this. And now notice, the one new habit I'm getting into is anytime I call get string, I now need to say if the return value equals-equals null, something went wrong. I should yell at the user. I should return. I should exit. So I'm now checking that value. Now why can get string return null? Because it uses malloc and malloc can return null. Alright, so notice this trick though, we have previously printed strings and previously the syntax had been this, alright. This should probably remind you a little bit of week 1, week 2. If you wanna print a string that the users typed in, it should you remind you of P set 2, the Caesar cipher and Vigenere. Well, I can print each character, present C one at a time and then I can print that character by way of S, the name of the string bracket I. So comfort with this? Hopefully? So it turns out that all this time, these square brackets are what we would generally call syntactic sugar. It's just a nicer, prettier way of doing something that at the end of the day is actually more sophisticated. This code here that I just wrote is equivalent to S bracket 1. So let's go back to the fundamentals. First of all, what is S? Well, S we call string but really as of this week, what is S in more technical terms? It's an address, right? It is the address is RAM at which that string's characters live from left to right. So, star S recall means "go there." And if you go to that address, you're gonna see the letter M and then O and then N and then K, if the word is monkey, right. If you go to that address, you're gonna see those characters. But you don't wanna go to the same address every time. You wanna go to the start of the string which is identified by the name S, but each time you iterate in this loop how many steps to the right in memory do you wanna look? Well I, right, one more, one more, one more. So if inside of your loop, you take this address S and you add I to it, well the first time this loop goes through, I is initialized to what apparently? Zero. So S plus 0 is S. So what are you gonna print first? You're gonna print star S which means go to that address and print out if the word is monkey, the letter M. If you then take that same address and do plus 1 on your second iteration, that's not address 1, 2, 3, 4. That's like 1, 2, 3, 5 and what letter presumably is at that location? So, O. So star of that summation means print the O, print the N, print the K and so forth. And because we already checked in advance the length of S with this helpful function string length, we're not gonna crash. We're only gonna step over the characters one at a time and then we're gonna stop. But just realize all this time even as far back as problem set 2 in Caesar, we've been using pointers. We've been using memory addresses. We've been walking through your computer's RAM but we did it in a more user-friendly way with S bracket I, but really you've been using a feature called pointer arithmetic, taking an address and doing some mathematical arithmetic on it plus 1 minus 1. So realize all we've bee doing is the same topic all this time. Yeah? [ Inaudible Remark ] >> Really good question. So if we were instead iterating not over characters which are 1 byte typically, but instead over ints, would we have to do plus 4 times I so that we go 4 bytes, 4 bytes, 4 bytes? Short answer, no. The reason this feature has its own name, pointer arithmetic, is because the compiler will figure out that when you say S plus I, if S is actually a char star, it's gonna do literally plus 1. If instead though, S is an int star, well an inst star points to an int. By definition, ints on this machine are 4 bytes so plus 1 is actually gonna be implicitly converted to plus 4 than plus 8, plus 12, plus 16. So that's what's really cool about point arithmetic. You don't even have to think about those details. So your code will work on old machines, new machines because the compiler will figure this out for you, really good catch. Alright, yeah? [ Inaudible Remark ] >> Good question. Is this more computationally efficient than using square brackets? No. The compiler will actually effectively turn your square brackets into this. So when your code is running, you will notice no difference. Back in the day, you might notice a compilation difference but these days on a 2 gigahertz computer compiling Caesars instantaneous anyway, so it's a non-issue in modern times, alright. So that was a lot. Let's go ahead and take our 5-minute musical break here. Alright, so we are back, really good news, no problems set next week. [Cheering] I know. There it goes. So yeah, so quiz 0 is next Wednesday. There's no lecture on Monday because it's a holiday for the University. Quiz 0 is on Wednesday. We will announce via email on the course's website where to go next Wednesday. We're gonna try to book enough classrooms so that we have writing surfaces for everyone. So we most likely will not be here. So again, don't show up before checking your email. There will be a review session this Sunday at 7 p.m. in Northwest Science, same time, same place as the walkthroughs usually are. >> This will be a course wide review. It will be filmed, put online by the next to cover really the past 6 weeks of material and particularly filled questions from you. And we'll also have office hours next week on Monday and Tuesday night. They most likely will not be in the dining halls, so instead be in a classroom where we can use a white board and it'll be totally casual and an opportunity to get some last minute questions answered. Know too that there's 4 years worth of old quizzes on the course's website. So the best guidance t get a sense of what past quizzes have been like is to go there and you'll see not just the questions but also the answer, do just realize that the course evolves overtime. For instance in '07, we had three quizzes, thereafter it was just two but the material changes. So realize that if you have no idea how to answer some question on the quiz, that's either because you zoned out at some point at this semester or we just never talked about it this semester. So look ultimately to the syllabus and to the lecture slides and scribe notes in particular for guidance. And so if you've not realized this, it's always fascinating at the end of the semester in the cue guide to read that people are unaware to these scribe notes. So we have a wonderful teaching fellow who actually summarizes what goes on in class each day typically with snarky little footnotes which you might enjoy. And so this is meant to be an authoritative set of notes in lieu of your own potentially if you'd rather not so much scribble things down. Perfect! [ Laughter ] [ Inaudible Remark ] >> So realize these two are meant to be a very good guidance through the course and I would strongly urge too, when you do show up at office hours and/or the course review session, honestly you'll be doing yourself a service if you spend at least a little bit of time this weekend taking a past quiz so that you're not being hit with material for the first time. You can rather make better use of your time and ask questions only about this stuff you are forgetting or struggling with. Also, we put online these things. We have been transcribing all of the course's lectures as promised. So that now when you visit past lectures, videos, you will see not just the course's video on the left, you will also see every word that came out of my mouth, for better or for worst, and you will be able to hit play and just to give you a sense of what you too will be able to do by semester's end with a little JavaScript. You can even read what I'm saying in realtime as it highlights as the words come out of my mouth, then even subtitle it if you would prefer to watch in that fashion. So this also means two more compelling that you can search, Control F and so forth, looking for topics like pointers, looking for arrays, things that you might have struggled with. You can actually find that point in the lecture, scroll down to that spot, click on a specific sentence, and the lecture will immediately jump to that point in the class. So realize that's there and we will finish by the weekend's, today's, and Monday's lecture so that you have access to those online before the quiz. So I also got curious as to what words do actually come out of my mouth. And so, I uploaded them to a nice free tool online that creates a visualization of the words that you've pasted in and the bigger the word, the bigger the font, the more times I said it. The smaller the word or if it's not even there, the fewer times I said it. And this for instance was this year's very first lecture. It's kind of curious. I say the word "just" a lot, "actually" a lot, "course" that makes sense, CS50 is a little smaller there. I was worried I was saying Facebook too much the first week but that's actually pretty small at the top right. So that was reassuring. I then fast-forwarded it a few weeks and some themes definitely popped out, "just" again and "gonna," so I didn't realize I sound so intellectual in class. And then I looked at yet another week and like "just" is the theme. So now I'm never gonna be able this word without kind of tripping over myself but apparently that is the most popular takeaway, a word that I say in CS50. So, hi! So one exciting initiative at the university across all of Harvard schools has been working on, you may have read about it at some point in the Crimson is the Harvard Innovation Lab. This is a beautiful new space across the river, right next to HPS that is a few floors to it. The top two are being used by HPS classes. The bottom floor is meant to be an innovation space for undergraduates, GSAS students, HPS students from all across the university who have entrepreneurial ideas, who have projects they wanna work on collaboratively with friends and they just need space to work and they want to be around other smart people doing interesting things, technical people, so as to ask and to answer questions. And so just to give you a sense of the space, it has literally just opened in the past few days when you walk across the river, you'll see a building like this. You go in. It's very modern and high-tech, concrete floors, nice lighting, funky seating, and so forth, and a lot, a lot, a lot of workspace. What we thought we'd do for fun even though it is across the river, so it's a little farther than Leverett and Quincy and [inaudible] and Lowell is for just one week, not next week but the week after for problems set 5, is we've been cordially invited as a class to spend office hours there, Monday, Tuesday, Wednesday, Thursday, on pizza and soda, will be served throughout the evening. The shuttles will run back and forth between campus and it's actually not that far a walk, but just emotionally we'll make sure that you can hop on a shuttle and to actually get there. But it should be fun, the CS50 field trip, to see a couple of a hundred of us working on P set 5 in the Harvard Innovation Lab as its inaugural class. So more on that I'm sure overtime. So we stage on Monday for forensics and focusing on a problem domain, albeit using the same fundamentals of recovering information or covering your tracks when trying to get rid of information to begin to discuss how information is actually stored on something like a hard drive, we actually need to be able to represent the data on a hard drive in actual programatic terms. So just as a flashback, you might recall from week 0 what is actually inside of a hard drive. We'll just watch the first part of this for a few seconds. Recall that this was the story. [ Video Clip ] [ Background Music ] >> A hard drive is where your PC stores most of its permanent data. To do that, the data travels from RAM along with software signals that tell the hard drive how to store that data. The hard drive circuits translate those signals in voltage fluctuations. These in turn control the hard drive's moving parts, some of the few moving parts left in the modern computer. Some of the signals control a motor which spins metal coded platters. Your data is actually stored on these platters. Other signals move the read/write heads to read or write data on the platters. This machinery is so precise that a human hair couldn't even pass between the heads and spinning platers. Yet it all works at terrific speeds. >> So the film, recall, goes on to just discuss--this is gonna be a little messing with your mind. If we can focus the camera on the board here for a moment. So you'll recall that the video on perhaps to show those little blue and red magnetic particles and to reduce the problem of storing information on a hard drive to the orientation of magnetic particles either north south or south north thereby representing 1 or 0 respectively, some system like that. Well, what's really going on in a computer's hard drive is obviously gonna be higher level than that. There's got to be some notion of file name, some notion of file sizes, some notion of folders in which things are. So hard drives are not just zeros and ones, rather they actually have some metadata stored on them, and metadata means data but it's useful for other data you really care about. So let me just draw for instance one of those platters. So this is a metal disk that's inside of a typical hard drive and it spins around generally at thousands of times per minute and all along here are zeros and ones in some orientation. Now, what's really going on there is that clusters of these zeros and ones represent more interesting things. They represent your actual files, like your MP3S and your movies, but also things like the file name and where things are. So it turns out even though this might represent some movie you've downloaded from iTunes, there's a special part of the hard drive reserved for a table, sort of an Excel spreadsheet of sorts that has--to oversimplify, 2 columns. On the left is the name of the file and on the right is the address of the file. So, just like we can say that your RAM can be addressed from byte 0, 1, 2, 3, on up, similarly can your hard drive even though it's a circle be described in the same way. This is byte 0. This is byte 1, 2, 3, 4, and some system along those lines. So, how does your computer remember where data on your hard drive is? It uses this little cheat sheet. So when you create or save or download a file, there's this table in your operating system's memory that says, "Okay, you just downloaded movie.mov," some file name like that, and so the name gets written in the left column, in the right column gets written the address of say the first byte of that movie. Now hard drives are actually pretty fancy. And so, you can get what's called disk fragmentation. If you've got big files, they might not necessarily all end up here. You might get part of your movie here or here or here or here. So, if you've ever heard this term "defragment your hard drive", it's referring to the fact that your file might be spread out. So it's not sufficient just to remember the starting address of a file. It turns out there is usually a list of addresses, part 1 is here, part 2 is here and so forth. But now it's in the world of forensics. What happens when you actually drag a file to the recycle bin or to the trash can on Mac OS or Windows? So you probably figured this much out, right? Like what happens when you just drag it to that special icon? >> You'll forget the address? >> Yeah, and actually not even there, in fact we can rewind further, nothing happens, right? If you drag a file, something sketchy or private, if you wanna delete and you just put it in the recycle bin or the trash can, hopefully by now you figured out that your roommate can just double click your trash can or your recycle bin, right? So you actually have to do what? So, empty the recycle bin or empty the trash can in some way, but what really happens? Well despite years worth of our being trained by computer soci--a computer society that that deleting means deleting, deleting does not mean deleting, right? >> So, deleting means forgetting. So what really happens if you've got some financial data or some, you know, Dear John letter you didn't mean to send or you wanted to delete or something you don't want found, well if you delete it by dragging it to the recycle bin and clicking empty recycle bin or empty trash, all that's happening in this picture is that, the operating system just forgets where it is. Now, anyone who's versed in the art of forensics or anyone who has Goggle can find a program that can then search your hard drive looking for what we'll call signatures of known files. It turns out a lot of files on the internet and in general have signatures which just means they're identifiable by just a few bytes. For instance if you detect this sequence of bytes on a hard disk, FFD8, FFE0 and these are just hexadecimal numbers, if you detect that on a hard drive, that means with very high probability, you have just encountered the start of a JPEG, a JPEG is an image, a photograph. These are very commonly deleted from one's hard drive. And so you probably don't want it to be so easy for someone to just search your entire hard drive with an automated program and say JPEG, JPEG, JPEG, JPEG. But indeed, this is what forensic investigators do. I mean this is what your roommate could be doing. If you see Googles for such tools, you can recover files from a hard drive with very high probability because literally, the bits are still there and just because your computer forgot about them doesn't mean there's not enough hints scattered around your hard drive to recover them. So that then begs the question, "Okay, how do I really delete these files from my hard drive?" Alright, so what do you do or what do you need to do technically, intuitively? Yeah. >> After I delete all my lectures pictures, I can upload a bunch of--I can download a bunch of episodes of The Sopranos. >> Okay. You have a very concrete plan B here. So-- [ Laughter ] >> So after you delete your sketchy photos, you can download some very big Sopranos episodes online and that's actually quite clever. That would have the side effect of overwriting your sketchy photos because even though your computers forgotten that these files are there, as soon as you start downloading a huge amount of content, big movie files like Sopranos episodes with high probability, these zeros and ones, they're not gonna get erased per se but they're gonna get reused. Those magnetic particles are now gonna be part of some Sopranos episode. So now, 50 percent of your sketchy behavior is not gone, right? So if you've ever used something like Norton utilities or any of these programs that undelete files, what--it will sometimes say, "Oh, we can delete this file with 95 percent certainty." Well, that probably means because 5 percent of the zeros and ones have been reused by a Sopranos episode or the like. Now thankfully, there are even easier ways. This is sort of the like, "Oh my god, I really have to cover my tracks download a lot of content." Thankfully, there do exist built-in ways and different operating systems do this better. If you've never noticed, though you've had this probably if you own a Mac for some time under your finder menu, there is empty trash but there's also secure empty trash. So secure empty trash will not only erase what's in this table, it will also overwrite the zeros and ones with it minimally all zeros. And there are Department of Defense standards that actually say, "Well, you can overwrite the bits with zeros and ones randomly, some 7 times, even more times that that." The research literature, there has been no published evidence that simply writing all zeros over your data on modern hard drives is insufficient, so you probably don't need to use Department of Defense standards to actually cover your tracks, whatever it is you're trying to delete. It generally suffices to overwrite zeros and ones. And this isn't a Mac versus PC thing but Apple has been much, much better in recent years making this easy on Windows to my knowledge, there is no such super simple option as that. In fact in Mac OS 2, and it's maybe fine because we're preaching to the choir since most of you, a majority have Macs, you can also encrypt your whole hard drive. The thing called FileVault means all of your data is actually encrypted. So if it's lost or stolen, no one can actually see what your files are, they just look completely random. And there's even, if you're super paranoid, something called secure virtual memory. Long story short, when I mentioned earlier that your computer might slow down sometimes because you're loading lots of programs, well typically a modern operating system will not say like it used to, you have too many programs running. I will not launch Safari again or I will not launch Internet Explorer. Instead what it will do is create the illusion that you don't have maximally 2 gigs of RAM. The computer will pretend that you have 3 gigabytes of RAM. And to create that illusion, it will take one gigabytes worth of programs and files you have open temporarily copy them from RAM to your hard drive somewhere and then voila, you have a gigabyte that you can use for new programs. But which is slower, RAM or hard drives? Well the short answer is a hard drive is generally something mechanical, that this is becoming less and less true, anything mechanical is gonna be slower than anything electronic and RAM is purely electronic. So, one of the reasons your computer slows down when you're doing lots of things is because you're using virtual memory, hard disk space as though it were RAM. But here is the security worry, even if you're good about deleting your browser's cache and you even empty securely your recycle bin or trash can, it doesn't matter because those sketchy photos you had opened in your program might have been temporarily put into virtual memory which means put into some special part of the hard disk that you don't have easy access to delete. So unless you enable something and I think if I go in a Mac system preferences, security, general--yeah. So, it's great out right now but this option here, use secure virtual memory is checked here by default. That means that even your virtual memory is encrypted. So, we'll come back to this in problem set 5, but realize that there is a lot of ways to both cover or accidentally leave uncovered your tracks. But we need now some way of representing structures like this. Up until now, the only kinds of data structures we've had are things like chars and inst and floats and slightly more fancy, arrays. But even an array has just been a contiguous sequence of inst or chars. But what if we actually wanna represent something like a table like this or a file or maybe even more familiar lately, a student. And a student might have a name and an ID or you can imagine any number of real world entities, that'd be nice to kind of represent with your new custom data type. So we can actually do this. So this is a file this week called struct.h. And notice what you can do here. There's a new keyword that we've actually seen before but we're using it now proactively for the first time called typedef and another one called struct. And even though this is slightly new syntax, what this chunk of code means is declare a new type, similar in spirit to ints and float but a custom one, whose structure looks like that. Inside of apparently a struct that's gonna be called students is an integer called ID, a string called name and a string called house. So, I could literally write string but again, I'm trying to take off the training wheels of the CS50 library. But this here means give me a new variable type called student inside of which are 3 things. Now, why is this useful? Well, let me open up this file structs1.c. If I scroll down here, notice a couple of things. One, I'm kind of including some familiar or friendly things 'cause I wanna use GetString and some other stuff, printf and the like. But notice this too. Now that I have my own header file as we've had in some of our own P sets, I have to include my own file. And any time you're including a file you wrote, you use quotes. Any time you're using a file someone else wrote, you use angled brackets, so that's the one subtlety there. I'm apparently using this trick called the constant so that the total number of students in this program is 3. And let's before we look at the code, just look at what this thing is gonna do. So this is structs 1. So, let me go into my code and do make structs 1. Let me go ahead and run structs 1, alright. So, a student's ID is 1, student's name is David, Mather. Student's ID is 2. Let's say this is Rob. This is Kirkland. And 3, this is Matt. This is Kirkland. Okay, that's the only thing I did with this program, right? David didn't matter. So, how did we do this, right? So clearly, we threw away two-thirds of the information we collected here. But, what's actually going on? How did I store these? Now, to take a step back, you could totally implement this program in like week 1, right? You could have 9 variables, student ID 1, student name 1, student house 1, then you can have student ID 2, student house 2, student name 2. And then you could just kind of come up with some arbitrary convention like numbering your variables. But you have 9 of them. And conceptually, this should hopefully start to rub you the wrong way. This is a little inelegant, just to store 3 real world entities like teaching staff, I now need 9 variables and I need to give them all separate names, I kind of like to have a variable called student 1, student 2, student 3, something super simple inside of which are the nitty-gritty details. So that's exactly what we're doing here. Notice if I scroll down to my main program here, notice in my main function, I first declare an array of students. And I can now use jargon that just sounds more natural to me. I want a student data variable. I'm gonna call this a class of students and how many do I want? Well, this is just 3. Remember that we hard coded that up above. So this means give me an array of 3 students and call that array class which is just kind of consistent with the idea. Here is a for loop that iterates from 0 to 3. And then I just use some familiar functions. I use GetInts and GetString and GetString but notice the syntax. This is one new piece of syntax and that's it. To get the Ith students in the class, I do class bracket I. But if I wanna go inside of that student structure and say, edit its ID number, I just say dots or I say dot name or I say dot house. So this is a way of kind of clumping up together multiple variables in int, a char star, a char star but thinking of them and programming them as though they're one bigger entity like a student but still having access to all the nitty-gritty details. >> So, here's how only I was printed. If I iterate then overall of these students again in the array, recall this function, string comparison, so if the Ith students in the class, house equals "Mather" and I check that by checking for equal to equal to 0, remember that's what the string comparison does, the 0 if they're equal. I print out David or whoever is in Mather. And how do I get at David's name? Class bracket I dot name. But there is one thing I have to do now to get into this habit, notice at the very bottom, and I had this in my last example, I call free. Free is the opposite of malloc. Any, any, any time you call malloc, it is up to you and it's expected of you to call free at some point, not right away 'cause that would kind of defeat the purpose but eventually before you actually exit your program or return for main, so what I'm doing here is I'm freeing name and house for all 3 students but I'm not freeing ID. Why? Because what? [ Inaudible Remark ] >> I think you're right, they can't hear you. [ Inaudible Remark ] >> Okay, so I didn't print it but why did I make a conscious decision not to free ID? >> You didn't use malloc. >> I didn't use malloc. It's as simple as that. Because in my H file, notice my header file, because in my header file, I said that a student is an int and a char star and a house. Notice there's no malloc here but I did assign to name and house the return value of what function? GetString, GetString uses malloc. And so here too is this dirty little secret that I admitted to earlier, all this time we've been writing technically buggy programs. But the upside is my god, we didn't have to think about or talk about pointers in the first week, we could just use GetString and get a string from the user. But now, any time you call GetString, which will soon be no more or ultimately call malloc, you have to free memory. Otherwise, your program will "leak" and that generally results in slowdowns and it ultimately involves incorrectness of programs. So I thought I would disclose or emphasize all the more with a concrete example what is and is not possible in popular culture in terms of technical shows like this. So I dug out a 30 or so second clip from an actual TV show. We'll then counterbalance it with what really would happen had they consulted anyone remotely technical before shooting this episode. So, here we go. [ Inaudible Remark ] [ Background Music ] >> Back on. >> What do you see? [ Music ] [ Background Music ] >> Bring the space up, full screen. >> His glasses. >> There's a reflection. [ Music ] [ Noise ] [ Background Music ] >> It's a movie that has baseball team. That's a logo. >> And he's talking to whoever is wearing that jacket. >> Okay. [ Laughter ] >> So, any time you hear someone say, again, can you clean that up? Can you enhance that? You can't. If you see a little glimmer in someone's eye, for instance Rob's-- [ Laughter ] >> And you try to zoom in on that little white spec of some reflection there as I can do with some consumer program, let's call it Photoshop. Let me drag suspect.jpg into Photoshop. I'm going to zoom in on something suspicious here in his eye and I'm gonna zoom and I'm gonna zoom and I'm gonna zoom. Okay, so that glimmer in his eye as well as CSI's eye, 2 pixels of color. This is what happens when you enhance an actual image in reality. So I give you Rob Bowden's [phonetic] eye. This is CS50 end of week 5. We will see you next week. [ Applause ] [ Music ]