[MUSIC PLAYING] DAVID J. MALAN: All right. This is CS50. And this is the start of week 5. And as you may have noticed, some of the material is getting a little more complex, the little denser.
And it's very easy, especially if you've been in the habit for some time, to be trying to scribble down most anything we do, we're saying in class. But realize, that is not perhaps the ideal pedagogical approach to learning this kind of material, and material more generally. And so we are pleased to announce that CS50's own Gheng Gong has begun to prepare a canonical set of notes for the course, the hope of which is that, one, these not only serve as a reference and a resource for reviewing material and going back through material that might have escaped you the first time around, but also so that your heads can be more up than down, when it comes time to lecture, so that you might engage more thoughtfully, as opposed to more scribbly.
With that said, what you'll find on the website is such documents as this. And notice, at top left, there's not only a table of contents, but also time codes that will immediately jump you to the appropriate part in the video online. And what Chang here has done is, essentially, documented what happened in this particular lecture. And many of the lectures are already online now with this URL. And we'll continue to post the remainder of those by the end of this week, so do take advantage of that resource.
So without further ado, we started to peel back the layer that has been string for some time. And what did we say a string actually is last week? So char star. And char star, well, what did that really mean? Well, all this time, if we've been calling a function, like getString, and storing the so-called return value of getString in a variable-- it's called s type string-- we've been writing the line of code up there above. And it's only when I see my handwriting magnified here do I realize just how atrocious this is.
However, let's assume that, on the right-hand side is, nonetheless, a reasonable depiction of what's been going on all this time with getString. getString, of course, gets a string. But what does that really mean? It means it gets a chunk of memory from the operating system by calling a function, called malloc. But more on that later. And then it populates that chunk of memory with the letters the user has typed in, followed by, of course, a null character, or backslash zero at the very end.
Meanwhile, on the left-hand side of this story, all this time, we've been declaring a variable, like s. And that variable is what now will start calling a pointer. It's not a box inside of which we put the string, Daven, per se, but rather we put in that square box on the left what exactly? Yeah?
AUDIENCE: The address of where it's located in memory.
DAVID J. MALAN: Exactly. The address of where Daven is located in memory. And not where all of Daven is located, per se, but specifically the address of what? Yeah?
AUDIENCE: First character.
DAVID J. MALAN: The first character in Daven, which, in this case, I proposed was arbitrarily and unrealistically 1, Ox1, which just means the hexadecimal number of 1. But it's probably going to be a much bigger number that we might draw with a 0x as a prefix, representing a hexadecimal character. And because we don't need to know where the rest of the characters of Daven are, because of what simple design decision that was made many years ago? Yeah?
AUDIENCE: Backslash 0. DAVID J. MALAN: Yeah, exactly. The backslash 0 allows you, albeit in linear time, to traverse the string, walk from left to right, with a for loop, or a while loop, or something like that, and determine, oh, here is the end of this particular string. So with just the address at the beginning of a string, we can access the entirety of it, because all this while, a string has just been a char star.
So it's certainly fine to continue using the CS50 library and this abstraction, so to speak, but we'll begin to see exactly what's been going on underneath this whole time. So you may recall this example, too, from last time, compare 0, which didn't actually compare. But we began to solve this.
But as perhaps a refresher, might I interest someone in a pink elephant today, also made by Chang? How about you in front? [INAUDIBLE]. Come on up.
And in the meantime, as you come up, let's consider for just a moment what this code was actually doing. It's declaring two variables up top, s and t, and calling getString. This isn't a very user-friendly program, because it doesn't tell you what to do. But let's just assume we're focusing on the juicy part. And then we do, if s equals equals t, it should say printf, you typed the same thing. Hello. What's your name?
JANELLE: Janelle. DAVID J. MALAN: Janelle, nice to meet you. So your challenge at hand for this elephant is to first draw us a picture of what's being represented in those first two lines. So s and t might be represented how on the screen? And you can just draw it with your finger on this big screen.
So there's two halves to each side of that equation. So there's s on the left, and then getString on the right. And then there's t on the left, and then getString on the right. So how might we begin drawing a picture that represents what's going on here in memory, would you say? And let me let you explain what you're doing as you go.
JANELLE: OK. Well, first, it would be asking you to get the input string. And it would store-- oh, sorry. DAVID J. MALAN: OK. Good. And this is called what? Oh, OK. Keep going. I didn't mean to interrupt. JANELLE: Sorry. So it would input it into the address of-- not sure. I can't exactly remember the number, but I believe it was starting with 0.
DAVID J. MALAN: That's all right, because I made the numbers up, so there's no right answer.
JANELLE: Starting with the 0 arc.
DAVID J. MALAN: OK, so element 0. Sure.
JANELLE: And then if was like just a two-letter--
DAVID J. MALAN: OK, back to you.
JANELLE: So element 0, and then element 1 or element 2. DAVID J. MALAN: And which piece of the picture are you drawing right now? The call to getString? Or the declaration of s?
JANELLE: The declaration of s, I believe. Oh, the getString, because it would be inputted into each [? area. ?]
DAVID J. MALAN: Good. Exactly. Even though this effectively returns an array, recall, when we get back a string, we can index into that string using 01 and 2. Technically, these are probably represented by individual addresses, but that's fine.
So suppose, if I can just fast forward to where we left off last time, if one of the strings was g a b e, backslash 0, thereby representing gabe's input, how might we represent s now? If this is the memory that's been returned by getString?
JANELLE: Would it be represented by an arc?
DAVID J. MALAN: By an arc? Well, no. Let's just say, pictorially, let me just go ahead and propose that, if this is s, this is the return value of getString. And you've drawn this as 0, 1, 2, which is perfectly reasonable, because we can index into the string, as such. But just to be consistent with last time, let me go ahead and arbitrarily propose that this is address 1, this is address 2, this is address 3, and so forth. And so, just to be super clear, what's going to go in s as a result of that first line of code, would you say?
JANELLE: Address 1?
DAVID J. MALAN: Exactly. So address 0x1. And meanwhile, let me go ahead and duplicate much of what you've done and add my own t here. If I were to type in gabe again, a second time, when prompted with getString, where, of course, is gabe going to go? Well, presumably--
JANELLE: Like on here? DAVID J. MALAN: Yeah. JANELLE: Or it's also in the same boxes? DAVID J. MALAN: Let me propose, yeah, exactly, so in these additional boxes. But what's key now is that, even though I've drawn these pretty close together-- 0x1, this is 0x2-- in reality, this now might be address 0x10, for instance, and 0x11, and 0x12, and so forth. And so, if that's the case, what's going to end up here in t?
JANELLE: 0x10? DAVID J. MALAN: Exactly. So 0x10. And so now, final question. You have, by far, had to work the hardest for an elephant thus far. By now, if I pull up the code again, when I do, in line three, if s equals equals t, what am I actually comparing that we've drawn here?
JANELLE: The two addresses? DAVID J. MALAN: Exactly. So I'm saying is s equal equal to t? In other words, is 1 equal equal to 10? And of course, the obvious answer now is, no. And so this program is ultimately going to print what, would you say?
JANELLE: Would it be, you typed the same thing?
DAVID J. MALAN: So if s is 1 and t is 10?
JANELLE: You typed different things.
DAVID J. MALAN: Exactly. You typed different things. All right. So a round of applause, if we could, here. [APPLAUSE] That was painful. I know. Nicely done. So now let's see if we can't tease apart what the fix was. And of course, when we fixed this-- which I'll now represent in green-- we did a couple of enhancements here. First, just as a sanity check, I'm first checking if s equals null and t equals null. And just to be clear, when might s or t be null in code like this? When might s or t be null. Yeah?
DAVID J. MALAN: Exactly. If the string that the user typed in is way too long to fit into memory, or some weird corner case like that, getString, as we'll see, literally today, in its documentation, says it will return null as a special sentinel value, or just sort of a special symbol that means something went wrong. So we want to check for that, because it turns out that null is a very dangerous value.
Often, if you try to do something with null involving a function-- passing it as input, for instance-- that function might very will crash and, with it, take down your whole program. So this third line now is just a sanity check, error checking, if you will. That's a good habit now for us to get into any time we try to use a value that could, potentially, be null.
Now, in the fourth line here, "if strcmp(s, t)," well, what's that referring to? Well, we said this was a very succinctly named function for string comparison. And its purpose in life is to compare its first argument against it second, but not in terms of their addresses, as we did unintentionally a moment ago with the red code, but rather to compare those two strings in the humanly intuitive way by comparing this, against this, against this, against this, and then stopping if and when one or both of my fingers hits a backslash 0. So someone years ago implemented strcmp to implement for us the functionality that we hoped we would have gotten by just comparing two simple values.
Now frankly, I keep drawing all of these various numbers. But the reality is, I've been making these up the whole time. And so let me just go ahead and scribble these out to make a point that, at the end of the day and moving forward, we're not really going to care about what addresses things are actually in memory. So I'm not going to draw these kinds of numbers so much anymore, I'm just an abstract this away a little more friendly with just arrows.
In other words, if s is a pointer, well, let's just draw it, literally, as a pointer, an arrow pointing from itself to something else, and not worry too much more about the minutia of these addresses which, again, I made up anyway. But we'll see those addresses, sometimes, when debugging code.
Now meanwhile, this program up here fixes, of course, that problem by comparing those two strings. But we ran into another problem. This was from the copy program last time, whereby, I was trying to capitalize just the first character in a string. But what was the symptom we saw last time when a user typed in a value, like gabe in lowercase, for s, then we assigned s into t, as in the third line there, and then I tried to capitalize t bracket 0? What was the effect of changing t bracket 0 here?
AUDIENCE: It changed s.
DAVID J. MALAN: Yeah, I changed s, as well. Because what was really going on? Well, let me see if I can clean up this picture, as follows.
If s is, again, the word g, a, b, e, backslash, 0, and s we'll continue drawing as a box here, but no more addresses. Let's stop making things up. Let's just draw a picture to simplify the world.
When I declare t with string t, that creates that chunk of memory. Square happens to be 32 bits in most computers. In fact, if you've ever heard of a computer having a 32-bit architecture, really fancy-speak, that just means it uses 32-bit addresses. And as a technical aside, if you've ever wondered why older computers, if you actually tried to soup them up with lots of RAM, could only have a maximum of four gigabytes of RAM, well that's because, literally, your old computer could only count as high as 4 billion, 4 billion bytes, because it was using 32-bit numbers for addresses.
But in any case, in this example, story's much simpler. t is just another pointer, or really a char star, aka string. And how do I want to update this picture now with that second line of code, after the dot, dot, dot? When I do string t equals s semicolon, how does this picture change? Yeah?
DAVID J. MALAN: Yeah. Exactly. I just put an arrow from the t box to the same address, the same first letter in gave. Or technically, if this guy were still at 0x1, it's as though I had 0x1 here and 0x1 here. But again, who cares about the addresses? It's just the idea that now matters. So this is what's happening here. So of course, if you do t bracket 0, which is array notation, of course-- and frankly, it looks like there's an array over here, but now there's this weird thing. Know that the programming language, C, offers you this feature, whereby, even if t is a pointer, or s is a pointer, you can still use that familiar, comfortable square bracket notation to go to the first element, or the second element, or any element that that pointer is pointing to because, presumably, it is, as in this case, pointing at some array.
So how do we fix this? Frankly, this is where it got a little overwhelming at first glance. But here is a new and improved version.
So first, I'm getting rid of the CS50 library, just to expose that s is indeed a char star, just a synonym. And t is also a char star. But what is going on on the right-hand side of that line where t is assigned a value?
What is malloc? What it's strlen? What is sizeof(char)? Why the heck does this line look so complex? What's it doing at a high level? What's it storing in t? Yeah? AUDIENCE: It's allocating a certain amount of memory space. It's to store, I guess, letters [INAUDIBLE].
DAVID J. MALAN: Perfect. Perfect. It's allocating a certain amount of memory space to store, presumably, future letters. And in particular, malloc is therefore returning what?
AUDIENCE: Returning the [INAUDIBLE]? DAVID J. MALAN: Exactly. Returning the address of that memory, which is a fancy way of saying, returns the address of the first byte of that memory. The onus is on me to remember how much memory I actually allocated or asked malloc for.
Now how much is that? Well, even though there's a lot of parentheses here, malloc takes just a single argument. And I'm specifying strlen of s, so give me as many bytes as there are in s, but add one. Why? Yeah?
AUDIENCE: The backslash 0. DAVID J. MALAN: Exactly. We've got to do a little housekeeping. So because there's a backslash 0, we'd better remember that. Otherwise, we're going to create a string that doesn't have that special terminator.
Meanwhile, just to be super anal, I have sizeof(char), just in case someone runs my code not on the CS50 appliance, but maybe a different computer altogether where chars are one byte, by convention, but two bytes, or something bigger than that. It's just to be super, super averse to errors. Even though, in reality, it's most likely going to be a 1.
Now, meanwhile, I go ahead and copy the string, t bracket i equals t bracket s. And I will defer to last week's source code to see what's going on. But the key takeaway, and the reason I put the code now in green, is because that very last line, t bracket 0 equals toupper, has the effect of capitalizing which string? t and/or s? That last line of code.
Just t, because what's happened this time, if I slightly undo that last step, what's happened is, when I call malloc, I essentially get a chunk of memory that is the same size as the original, because that's the arithmetic I did. I'm storing in t the address of that chunk of memory. Even though this looks nice and pretty, nice and blank, the reality is there's, what we'll keep calling, garbage values in here. That chunk of memory might very well have been used before, a few seconds, a few minutes ago. So there could absolutely be numbers or letters there, just by accident. But they're not valid, until I myself populate this chunk of memory with actual chars, as I do in that for loop there. All right?
So now, the climax of these three examples that were seemingly broken last time, this Swap example, this function worked in the sense that it swapped a and b. But it didn't work in what other sense? Yeah?
DAVID J. MALAN: Exactly. If I were to call this function from another-- for instance, from a function like main, where I have a variable, x and y, as I did last week, same code, and I pass in x and y to Swap, and then call Swap-- this, of course, is the correct version is what we're about to see-- it did not work. So what is the fix?
Well, so just to be clear, let me go ahead and-- give me one second here, and see if I can show you the last one, which will be in-- let's see if I can find this real fast-- OK, [INAUDIBLE]. OK, there it is. So ignore the commands I'm just typing. I want it to retrieve at the last minute an example from last time, which is now called no Swap.
So no Swap is where we left off last time, whereby, I initialized x to 1 and y to 2. I then call Swap, passing in 1 and 2. And then this function worked in some sense, but it had no permanent effect on x and y. So the question at hand is, how now do we actually fix this problem? What is the solution at hand?
Well, in swap.c, which is new today, notice a couple of differences. x and y are the same. But what is clearly different about line 25? What's new there, if you remember what it looked like a second ago?
DAVID J. MALAN: Yeah. So the ampersands are a new piece of syntax not only in this program, but also more generally in CS50. To date, I don't think we've seen any examples or really talked about them in any detail, other than, maybe, preemptively in section, an ampersand like this. Well, it turns out ampersand is one of the last pieces of new syntax we're going to learn. All it means is the address of some variable. At what address does x live? But what address does y live? Because if the fundamental problem before was that x and y were being passed as copies, what we really want to do is provide Swap with like a treasure map that leads to where x and y actually are in RAM, so that Swap can follow that map and go to wherever x or y marks the spot and change the actual values 1 and 2 there.
So Swap needs to change slightly too. And at first glance, this might seem a little similar to char star. And indeed it is. So a is a pointer to what type of data, based on this highlighted portion? So it's an int.
So a is no longer an int, it's the address of an int. And similarly, b is now going to be the address of an int. So when I now call Swap from Main, I'm not going to give Swap 1 and 2. I'm going to give it like Ox-something and Ox-something, two addresses that will lead Swap to their actual locations in my computer's memory.
So now, my remaining implementation needs to change a tad. What's obviously different now in these three lines of code? There's these damn stars all over the place, all right? So what's going on here? Yeah?
AUDIENCE: It's obviously [INAUDIBLE].
DAVID J. MALAN: Exactly. So in this context-- and this was not the best design decision, admittedly, years ago. In this context, where you just have a star, and you don't have a data type, like int, immediately to the left, instead you have an equal sign, clearly, in this context, when you say star a, that means go to the address that's in a. Follow the treasure map, so to speak.
And meanwhile, in line 37, it means the same thing. Go to the address a, and put what there? Whatever is at the location that b specifies. In other words, go to b. Get that value. Go to a and, per the equal sign, the assignment operator, put that value there.
Similarly, int temp is just an int. Nothing needs to change about temp. It's just a spare glass from Annenberg for some milk or orange juice. But I do need to say, go to b. Go to that destination and put the value in temp there. So what's happening then? When I actually call Swap this time, if this first tray here represents Main, this second tray represents Swap, when I pass ampersand x and ampersand y from Main to Swap, just to be clear, what is this stack frame receiving? Yeah?
AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Exactly. The address of x and the address of y. And you can think of these like postal addresses. 33 Oxford Street and 35 Oxford Street, and you want to move the two buildings that are at those locations.
It's sort of a ridiculous idea, but that's all we mean by address. Where in the world can you find those two ints? Where in the world can you find those two buildings? So if finally, after all this time I go into today's source code and compile Swap and run ./swap, finally, for the first time do we actually see that my values have indeed been swapped successfully. And now, we can even take note of this in, say, gdb.
So let me go into the same file. Let me go ahead and run gdb of ./swap. And now, in Swap, I'm going to go ahead and set a break point in Main. And now I'm going to go ahead and run the program. And now we see my code paused at that line.
If I go ahead and print x, what should I see here? It's a question. Say again?
DAVID J. MALAN: So random numbers, maybe. Maybe I get lucky, and it's nice and simple, like 0. But maybe it's some random number. In this case, I got lucky. It just happens to be 0. But it is indeed luck, because not until I type next and then print x has that line of code, line 19, been executed.
Meanwhile, if I type next again, and now print out y, I'm going to see 2. Now, if I type next, it's going to get a little confusing, because now, the printf is going to appear on the screen, as it did. x is 1.
Let's do this again. And now, here's where things get interesting. Before I call Swap or even step into it, let's take a little peek. x is, again, 1. Y is, of course, quick sanity check, 2, so not hard there. But what is ampersand x? Answer, it's kind of funky looking. But the int star in parentheses is just gdp's way of saying this is an address. It's not an int, it's a pointer to an int, or otherwise known as an address.
What is this crazy thing? We've never seen something quite like that before. So this is the address in my computer's memory of where x happens to live. It's Ox-something. And this is, frankly, why I've started drawing arrows, instead of numbers, because who really cares that your int is at a particular address that's that big. But bffff0c4, these are all indeed hexadecimal digits, which are 0 through f.
So we're not going to dwell too long on what those things are. But if I print out y, of course, I see 2. But ampersand y, I see this address. And notice, for the curious, how far apart are x and y? You can ignore most of the address. Four bytes. And that's consistent with our earlier claim that how big is an int? Four bytes. So it looks like everything's lining up nicely, as you might hope, in memory.
So now, let's just fast forward to the end of this story. Let's go ahead and type step, to dive into the Swap function. Now notice, if I type a, it's identical to the address of x. If I type b, it's identical to the address of y. So what should I see if I say, go to the address a? So print star a. So star means go there, in this context. Ampersand means what's the address of. So star a means 1. And print star b gives me 2.
And let me assume, for the moment, that at least the code that proceeds to execute now can be reasoned through in that way. But we'll revisit this idea before long. So this version of Swap is now correct and allows us to swap this particular data type.
So any questions then on Swap? On star? On address of? And you'll see, with problem set 4, sort of, but problem set 5, definitely, how these things are useful and get much more comfortable with them, as a result. Anything at all? All right. So malloc is, again, this function that just allocates memory, memory allocation. And why is this useful? Well, all this time, you've been using malloc. If you consider now how getString works, presumably, it's been asking someone for a chunk of memory, anytime the user types a string in, because we certainly didn't know, as CS50 staff, how big those strings that humans are going to type might be.
So let's, for the first time, start to peel back how the CS50 library works, by way of a couple of examples that will lead us there. So if I open up gedit and open up scanf 0, we're going to see the following code. Scanf 0, available on the website for today, has relatively few lines of code here, 14 through 20. And let's see what it's doing. It declares an int, called x. It says something like, number please. And now it says, scanf %i, &x. So there's a bunch of new stuff there.
But scanf, you can kind of think of as the opposite of printf. printf, of course, prints to the screen. scanf sort of scans from the user's keyboard something he or she has typed.
%i is just like printf. This means expect the user to type an int. And now, why do you think I might be passing scanf &x? If the purpose in life of scanf is to get something from the user, what is the meaning of passing it, &x, now? Yeah?
AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Exactly. Whatever I, the human, type in, my input is going to be saved at that location. It's not sufficient, recall, to just pass in x, because we've seen already, any time you pass just a raw variable, like an int, to some other function, sure, it can change that variable, but not permanently. It can't have an effect on Main. It can only change its own local copy. But if, instead, you don't give me the actual int, but you give me directions to that int, I now, being scanf, surely, I can follow that address and put a number there so you have access to it as well.
So when I run this program, let's see. Make scanf 0 dot slash, scanf 0. And if I now type a number like 50, thanks for the 50. If I now type a number like negative 1, for the negative 1. I now type a number like 1.5, hm. Why did my program ignore me? Well, because simply, I told it to expect an int only. All right. So that's one version of this. Let's take things up a notch and propose that this is not good. And herein lies a very simple example of how we can start writing code that other people can exploit or compromise by doing bad things. So line 16, so similar in spirit to before, but I'm not declaring it int this time. I'm declaring it char star, aka string.
But what does that really mean? So if I don't specify an address-- and I'm calling it arbitrarily, buffer, but I could call it s, to be simple-- and then I do this, explain to me, if you could, based on the previous logic, what is scanf doing in line 18, if pass %s and buffer, which is an address? What is scanf, if you apply the exact same logic as version 0, going to try to do here, when the user types something in? Yeah?
DAVID J. MALAN: Exactly. Scanf, by the logic earlier, is going to take the string that the human typed in-- it's now a string, it's not a number, presumably, if he or she cooperates-- and it's going to try to put that string in memory at whatever address buffer specifies. And this is great, because buffer is indeed meant to be an address.
But I claim this program is buggy in a very serious way, because what value is buffer by default? What have I initialized into? What chunk of memory? I haven't, right?
So even though I've allocated a char star that's no longer called s, it's instead called, buffer-- so let's draw the variable's name now as buffer-- if I haven't called getString or malloc here, that effectively means that buffer is just some garbage value.
Now what does that mean? It means that I have told scanf to expect a string from the user. And you know what? Whatever this thing is pointing to-- and I draw question mark, but in reality, it's going to be something like Ox1, 2, 3, right? It's some bogus value that just happens to be there from before. So put another way, it's as though buffer is just pointing to something in memory. I have no idea what.
So if I type in gabe now, it's going to try to put g-a-b-e /0 there. But who knows what that is? And in the past, any time we've tried to touch memory that doesn't belong to us, what has happened? Or almost every time. Segmentation fault, right?
This arrow, I have no idea where it's pointing. it's just some random value. And of course, if you interpret a random value as an address, you're going to go to some random destination. So gabe might indeed crash my program in this case here.
So what can we do that's almost as bad? Consider this third and final example of scanf. This version is better in what sense? If you are comfortable with the previous problem, this is better. Why?
AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Good. So this case of line 16 is better, in the sense that we're explicitly allocating some memory. We're not using malloc, we're using the week 2 approach of just declaring an array. And we've said before that a string is just an array of characters, so this is totally legitimate. But it's, of course, as you note, fixed size, 16.
So this program is totally safe, if I type in one character strings, two character strings, 15 character strings. But as soon as I start typing 16, 17, 18, 1,000 character strings, where is that string going to end up? It's going to end up partly here. But then who knows what else is beyond the boundaries of this particular array?
It's as though I've declared 16 boxes here. So rather than draw out all 16, we'll just pretend that I've drawn 16. But if I then try to read a string that's much longer, like 50 characters, I'm going to start putting a, b, c, d, x, y, z. And this is presumably some other memory segment that, again, might cause my program to crash, because I've not asked for anything more than just 16 bytes.
So who cares? Well, here's the CS50 library. And most of this is just like instructions up top. The CS50 library, all this time, has had this line in line 52. We've seen typedef, or you will see typedef in pset 4, which just creates a synonym whereby char star can be more simply referred to as string. So this is one of the few training wheels we've used secretly underneath the hood.
Meanwhile, here's the function, getchar. Now apparently, there's no body to it. And in fact, if I keep scrolling, I don't actually see any implementations of these functions. As a sanity check, why is that?
AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Yeah. So this is the header file. And header files contain prototypes, plus some other stuff, it seems, like typedefs. But in CS50.c, which we've never given you outright, but has been in the CS50 appliance all this time, deep inside of its folders, notice that there's a whole bunch of functions in here.
In fact, let's scroll down. Let's ignore most of them, for now. But scroll down to getInt and see how getInt works. So here is getInt. And if you ever really cared how get int works, here is its documentation. And among the things it says is it tells you what the ranges of values it can return. It's essentially negative 2 billion to positive 2 billion, give or take.
And it turns out, all this time, even though we've never had you check for it, if something goes wrong, it turns out that all this time, getInt has been returning a special constant, not null, but rather int_max, which is just a programmer's convention. It means here is a special value. Make sure to check for this, just in case something goes wrong. But we've never bothered with that to date, because again, this is meant to simplify.
But how does getInt get implemented? Well, one, it takes no arguments. We know that. It returns an int. We know that. So how does it work underneath the hood?
So there's apparently an infinite loop, at least the appearance of one. Notice that we're using getString. So that's interesting. getInt calls our own function, getString. And now why might this be the case? Why am I being defensive here in line 165? What could happen in line 164, just to be clear? It's the same answer as before. Might just be out of memory. Something goes wrong with getString, we've got to be able to handle that. And the reason I don't return null is that, technically, null is a pointer. getInt has to return an int. So I've arbitrarily decided, essentially, that 2 billion, give or take, is going to be a special value that I can never actually get from the user. It's just the one value I'm going to waste to represent an error code.
So now, things get a little fancy. And it's not quite the same function as before, but it's very similar. So notice, I declare here, in line 172, both an int n and a char c. And then I use this funky line, sscanf, which it turns out doesn't scan a string from the keyboard. It stands an existing string that the user has already typed in. So I already called getString, which means I have a string in memory. sscanf is what you'd call a parsing function. It looks at the string I've typed in, character by character, and does something useful. That string is stored in line. And I know that only by going back up here and saying, oh, OK, I called it not s this time, but line.
And now this is a little different. But this effectively means, for reasons we'll somewhat wave our hands at today, that we are checking to see if the user typed in and int and maybe another character. If the user typed in an int, it's going to be stored in n, because I'm passing this by address, the new trick we've seen today. If the user also typed in like 123x, that x is going to end up a letter in character c.
Now it turns out that sscanf will tell me, intelligently, how many variables was sscanf successfully able to fill. So by this logic, if the function I'm implementing is getInt, but I'm checking, potentially, for the user to have typed in an int followed by something else, what do I want sscanf's return value truly to be? If the purpose is to get just an int from the user?
So if sscanf returns 2, what does that mean? The user typed in something like, literally, 123x, which is just nonsense. It's an error condition, and I want to check for that.
So if the user types this in, by this logic, what does sscanf return, would you say? So it's going to return 2, because the 123 is going to go in here, and the x is going to end up in here. But I don't want the x to get filled. I want to sscanf to only succeed in filling the first of its variables. And so that's why I want sscanf to return 1.
And if this is a bit over the head for the moment, that's totally fine. Realize though, that one of the values of getInt and getString is that we're doing a heck of a lot of error checking like this so that, to date, you can pretty much type anything at your keyboard, and we will catch it. And we certainly, the staff, will definitely not be the source of a bug in your program, because we're defensively checking for all of the stupid things that a user might do, like typing a string, when you really wanted int. So for now-- we'll come back to this before long-- but all this time, getString and getInt have been underneath the hood using this basic idea of addresses of memory.
So now, let's make things a little more user-friendly. As you may recall, from Binky last time-- if my mouse will cooperate-- so we had this code, which frankly, is fairly nonsensical. This code achieves nothing useful, but it was the example that professor Parlante used in order to represent what was going on in a program involving memory.
So let's retell this story super briefly. These first two lines, in English, do what, would you say? Just in reasonably human, but slightly technical terms, take a stab. AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: OK, you're establishing addresses for your x and y variables. Not quite, because x and y are not variables in the traditional sense. x and y are addresses or will store address. So let's try this once more. Not a bad start, though. Yeah?
AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Good. I think that's a little cleaner. Declaring two pointers, two integers. And we're calling them x and y. Or if we were to draw this as a picture, again, recall quite simply that all we're doing with that first line is drawing a box like this, with some garbage value in it, and calling it x, and then another box like this, with some garbage value in it, calling it y. We've declared two pointers that ultimately will store the address of an int. So that's all there.
So when Binky did this, the clay just looked like this. And Nick just kind of wrapped up the arrows, as though they're not pointing anywhere in particular, because they're just garbage values. They're not explicitly initialized anywhere in particular.
Now the next line of code, recall, was this. So in reasonably user-friendly, but somewhat technical English, what is this line of code doing? Yeah?
DAVID J. MALAN: Perfect. It's allocating the chunk of the memory that's the size of an int. And that's half the answer. You answered the right half of the expression. What is happening on the left-hand side of the equal sign? Yeah? AUDIENCE: And assigns it to the variable x?
DAVID J. MALAN: And assigns it to the variable x. So to recap, right-hand side allocates enough memory to store an int. But malloc specifically returns the address of that chunk of memory, which you've just proposed gets stored in x.
So what Nick did last time with Binky is he dragged that pointer out, the clay, to point now at a white chunk of memory that is equal to the size of an int. And indeed, that's meant to represent four bytes.
Now, the next line of code did this, star x gets 42. So 42 is straightforward on the right-hand side, meaning of life. Left-hand side, star x means what? That too might have gone-- that's OK. OK.
AUDIENCE: Basically, go to the [INAUDIBLE] DAVID J. MALAN: Good. AUDIENCE: [INAUDIBLE]. DAVID J. MALAN: Exactly. Left-hand side means go to x. x is address. It's like 33 Oxford Street, or Ox1. And star x means go to that address and put what there? 42.
So indeed, that's exactly what Nick did. He started with by, essentially, mentally pointing a finger at x, following the arrow to the white box on the right-hand side, and putting the number 42 there. But then things got a little dangerous, right? Binky's about to lose his head.
Star y equals 13, bad luck, means what? So star y means go to the address in y. But what is the address in y? All right, it's garbage value, right? I drew it as a question mark. Nick drew it as a curled up arrow. And as soon as you try to do star y, saying go there, but there is not a legitimate address, it's some bogus location, the program's going to crash. And Binky's head is going to fly off here, as it did.
So in the end, this program was just flat out flaw. It was a buggy program. And it needed to be fixed. And the only way, really, to fix it would be, for instance, this line, which we didn't even get to, because the program crashed too soon. But if we were to fix this, what effect does doing y equal x have? Well, it essentially points y at whatever value x is pointing at.
So in Nick's story, or Binky's story, both x and y were pointing at the white chunk of memory, so that, finally, when you do star y equals 13 again, you end up putting 13 in the appropriate location. So all of these lines are perfectly legitimate, except for this one, when it happened before you actually assigned y some value.
Now thankfully, you don't have to reason through all of these kinds of issues on your own. Let me go ahead and open up a terminal window here and open up, for just a moment, a super short program that also is sort of pointless. It's ugly. It doesn't achieve anything useful. But it does demonstrate issues of memory, so let's take a look.
Main, super simple. It apparently calls a function, f, and then returns 0. It's kind of hard to mess this up. So Main is pretty good, so far.
So f is problematic. And just didn't put much effort into naming it here, to keep the focus on the code. f has two lines. And let's see what's now going on. So on the one hand here-- and let me make this consistent with the previous example-- on the one hand, the left-hand side is doing what, in English? It is-- AUDIENCE: Creating a pointer. DAVID J. MALAN: Creating a pointer to an int and calling it x. So it's creating one of those boxes I keep drawing on the touch screen. And now, on the right-hand side, malloc, of course, is allocating a chunk of memory. And just to be clear, how much memory is it apparently allocating, if you just kind of do the math here?
So it's 40 bytes. And I know that only because I know an int, on the CS50 appliance, at least, is four bytes. So 10 times 4 is 40. So this is storing an x, the address of the first out of 40 ints that have been allocated space back, to back, to back, to back.
And that's what's key about malloc. It doesn't take a little memory here, a little here, a little here. It gives you one chunk of memory, contiguously, from the operating system.
Now what about this, x bracket 10 equals 0? Arbitrary line of code. It doesn't achieve anything useful. But it is interesting, because x bracket 10--? Yeah?
DAVID J. MALAN: x bracket 10 doesn't have to be null. The null detail only comes into play with strings, at the end of a string. But a good thought.
How big is this array, even though I've allocated 40 bytes? It's 0 through nine, right? It's 10 ints, total. 40 bytes, but 10 ints, indexed 0 through 0.
So what is that x bracket 10? It's actually some unknown garbage value. It's memory that doesn't belong to me. I should not be touching that byte number 41, 42, 43, 44. I'm going slightly too far.
And indeed, if I run this program, it might very well crash. But sometimes, we'll get lucky. And so just to demonstrate this-- and frankly, you never know before you do it-- let's run this. It didn't actually crash.
But if I change this, for instance, to be like 1,000, to make this really deliberate, let's see if we can get it to crash this time. OK, it didn't crash. How about 100,000? Let's remake it, and now rerun it. OK. Phew. All right. So apparently, again, these segments of memory, so to speak, are reasonably big, so we can get lucky again and again. But eventually, once you get ridiculous and really go far out on the screen, you touch memory that really, really doesn't belong to you.
But frankly, these kinds of bugs are going to be harder and harder to figure out on your own. But thankfully, as programmers, we have tools that allow us to do this for us. So this is, perhaps, one of the ugliest programs, even uglier than gdb's output. But it always has a line or two that are super useful.
Valgrind is a program that helps you not debug a program, per se, but find memory-related problems, specifically. It will automatically run your code for you and look for at least two things. One, did you do something accidental like touch memory that didn't belong to you? It will help you find those cases.
And two, it will help you find something called memory leaks, which we have completely ignored, naively, for some time and blissfully. But it turns out, all this time, whenever you've called getString in so many of our programs, you're asking the operating system for memory, but you have any recollection of ever giving it back, doing unalloc, or free, as it's called. No, because we've never asked you to do so.
But all this time, the programs you've been writing in C have been leaking memory, asking the operating system for more and more memory for strings and whatnot, but never handing it back. And now this is a bit of a oversimplification, but if you've ever run your Mac or your PC for quite some time, opening lots of programs, maybe closing programs, and even though your computer hasn't crashed, it's getting so much slower, as though it's really using a lot of memory or resources, even though, if you're not even touching the keyboard, that could be-- but not always-- could be that the programs you're running have themselves memory leaks. And they keep asking the OS for more and more memory, but forgetting about it, not actually using it, but therefore taking memory away from other programs that might want it. So that's a common explanation. Now here's where Valgrind's output is completely atrocious to those less and more comfortable alike. But the interesting stuff is right up here. It is telling me an invalid write of size four happens in this program, in particular, at line 21 of memory.c.
If I go to line 21, hm, there indeed is an invalid write of size four. Why size four? Well, this number-- and it could be anything-- is an int. So it's four bytes. So I'm putting four bytes where they don't belong. That's what Valgrind is actually telling me. Moreover, it will also tell me, as we'll see, as you run this in a future pset, if and when you've leaked memory, which indeed I have, because I've called malloc, but I haven't actually called, in this case, free, which we'll eventually see is the opposite of malloc.
So now, I think, a final example. So this one's a little more arcane, but it's perhaps the biggest reason to be careful with memory, and the reason that many programs and/or web servers, even to this day, are taken over by bad guys somewhere on the internet who are somehow sending bogus packets to your server trying to compromise your accounts, or take your data, or just generally take over a machine. Buffer overflow, as the name suggests, means overflowing not an int, but a buffer. And a buffer is just a fancy way of saying it's a bunch of memory.
And indeed, I called a string before buffer, instead of s. Because if it's a buffer, like in the YouTube sense, or any time you're watching a video, you might have seen the word buffering, dot, dot, dot. It's incredibly annoying. And that just means that your video player is trying to download lots of bytes, lots of bytes from a video from the internet. But it's slow, so it's trying to download a bunch of them to fill a buffer, a container, so that you have enough bytes that it can then show you the video, without pausing constantly. But it turns out, you can have a buffer to this big. But try to put this much data in it, and very bad things can happen. So for instance, let's look at this final teaser of an example. This is another program that, at first glance, doesn't do anything super useful. It's got a Main function that calls that function, f. And that function, f, up here, has a char array, called c, of size 12. And then it's using this new function called strncpy.
It turns out that, with this simple, simple line of code, just two lines, we have made my entire program, and therefore, my entire computer, and my user account, and my hard drive potentially vulnerable to anyone who knows and is good enough to run this program with a certain command line argument. In other words, if this bad guy puts inside of argvargv by typing at the keyboard a very specially crafted string, not abc, 123, but essentially, binary symbols that represent executable code, a program that he or she wrote, with this simple program, which is representative of thousands of programs that are similarly vulnerable, daresay, he or she can ultimately delete all the files on my hard drive, get a blinking prompt so that he or she can type commands on their own, email all files to myself. Anything that I can do, he or she can do with this code.
We won't quite solve this yet. And in fact, it's going to involve a little picture like this, which we'll soon come to understand all the better. But for today, let's end on what's, hopefully, a slightly more understandable XKCD joke, until we resume next time. All right. See you on Wednesday.
SPEAKER: And now, deep thoughts, by Daven Farnham. Memory is like jumping into a pile of golden leaves on a Sunday afternoon. Wind blowing, tossing your hair-- oh, I miss the days when--