1 00:00:00,000 --> 00:00:02,860 [Week 5] 2 00:00:02,860 --> 00:00:04,860 [David J. Malan - Harvard University] 3 00:00:04,860 --> 00:00:07,260 [This is CS50. - CS50.TV] 4 00:00:07,260 --> 00:00:09,740 >> This is CS50, Week 5. 5 00:00:09,740 --> 00:00:12,900 Today and this week, we introduce a little bit of the world of forensics 6 00:00:12,900 --> 00:00:14,850 in the context of the Problem Set 4. 7 00:00:14,850 --> 00:00:18,480 Today will be an abbreviated lecture because there's a special event in here afterwards. 8 00:00:18,480 --> 00:00:21,940 So we'll take a peek and tease both students and parents alike today 9 00:00:21,940 --> 00:00:24,600 with some of the things that are on the horizon. 10 00:00:24,600 --> 00:00:29,050 >> Among them, as of Monday, you will have a few more classmates. 11 00:00:29,050 --> 00:00:32,980 edX, Harvard and MIT's new online initiative for OpenCourseWare and more, 12 00:00:32,980 --> 00:00:36,730 is launching on Harvard's campus on Monday, which means come Monday 13 00:00:36,730 --> 00:00:40,930 you will have, as of last count, 86,000 additional classmates 14 00:00:40,930 --> 00:00:43,680 who will be following along with CS50's lectures and sections 15 00:00:43,680 --> 00:00:45,890 and walkthroughs and problem sets. 16 00:00:45,890 --> 00:00:51,870 And as part of this, you will become members of the inaugural class of CS50 and now CS50x. 17 00:00:51,870 --> 00:00:56,150 As part of this now, realize that there will be some upsides as well. 18 00:00:56,150 --> 00:01:00,620 To get ready for this, for the massive number of students, 19 00:01:00,620 --> 00:01:03,820 suffice it to say that even though we have 108 TFs and CAs, 20 00:01:03,820 --> 00:01:07,560 it's not quite the best student-teacher ratio once we hit 80,000 of the students. 21 00:01:07,560 --> 00:01:09,830 We're not going to be grading so many problem sets manually, 22 00:01:09,830 --> 00:01:13,050 so introduced this week in the problem set will be CS50 Check, 23 00:01:13,050 --> 00:01:15,410 which is going to be a command-line utility within the appliance 24 00:01:15,410 --> 00:01:17,880 that you'll get once you update it later this weekend. 25 00:01:17,880 --> 00:01:21,030 You'll be able to run a command, check50, on your own pset, 26 00:01:21,030 --> 00:01:24,770 and you'll get instant feedback as to whether your program is correct or incorrect 27 00:01:24,770 --> 00:01:27,980 according to various design specifications that we have provided. 28 00:01:27,980 --> 00:01:30,310 More on that in the problem set specification. 29 00:01:30,310 --> 00:01:34,220 The CS50x classmates will be using this as well. 30 00:01:34,220 --> 00:01:36,170 >> Problem Set 4 is all about forensics, 31 00:01:36,170 --> 00:01:38,630 and this pset was really inspired by some real-life stuff 32 00:01:38,630 --> 00:01:41,210 whereby when I was in graduate school I interned for a while 33 00:01:41,210 --> 00:01:45,270 with Middlesex County's District Attorney's office doing forensic work 34 00:01:45,270 --> 00:01:47,660 with their lead forensic investigator. 35 00:01:47,660 --> 00:01:50,280 What this amounted to, as I think I mentioned a few weeks past, 36 00:01:50,280 --> 00:01:52,720 is the Mass State Police or others would come in, 37 00:01:52,720 --> 00:01:56,150 they would drop off things like hard drives and CDs and floppy disks and the like, 38 00:01:56,150 --> 00:01:58,770 and then the goal of the forensics office was to ascertain 39 00:01:58,770 --> 00:02:01,470 whether there was or was not evidence of some sort. 40 00:02:01,470 --> 00:02:04,730 This was the Special Investigations Unit, so it was white-collar crime. 41 00:02:04,730 --> 00:02:10,949 It was more troubling sort of crimes, anything involving some kind of digital media. 42 00:02:10,949 --> 00:02:16,450 It turns out that not that many people write an email saying, "I did it." 43 00:02:16,450 --> 00:02:20,490 So quite often, these forensic searches did not turn up all that much fruit, 44 00:02:20,490 --> 00:02:22,820 but sometimes people would write such emails. 45 00:02:22,820 --> 00:02:25,240 So sometimes, the efforts were rewarded. 46 00:02:25,240 --> 00:02:31,210 >> But to lead up to this forensic pset, we'll be introducing in pset4 a bit of graphics. 47 00:02:31,210 --> 00:02:35,410 You probably take these things for granted--JPEGs, GIFs, and the like--these days. 48 00:02:35,410 --> 00:02:38,320 But if you really think about it, an image, much like Rob's face, 49 00:02:38,320 --> 00:02:41,270 could be modeled as a sequence of dots or pixels. 50 00:02:41,270 --> 00:02:43,380 In the case of Rob's face, there's all sorts of colors, 51 00:02:43,380 --> 00:02:46,760 and we started to see the individual dots, otherwise known as pixels, 52 00:02:46,760 --> 00:02:48,610 once we started to zoom in. 53 00:02:48,610 --> 00:02:54,660 But if we simplify the world a bit and just say that this here is Rob in black and white, 54 00:02:54,660 --> 00:02:57,490 to represent black and white, we can just use binary. 55 00:02:57,490 --> 00:03:01,660 And if we're going to use binary, 1 or 0, we can express this same image 56 00:03:01,660 --> 00:03:06,140 of Rob's smiling face with this pattern of bits. 57 00:03:06,140 --> 00:03:12,100 11000011 represents white, white, black, black, black, black, white, white. 58 00:03:12,100 --> 00:03:16,150 And so it's not a huge leap then to start talking about colorful photographs, 59 00:03:16,150 --> 00:03:18,600 things that you'd see on Facebook or take with a digital camera. 60 00:03:18,600 --> 00:03:21,410 But certainly when it comes to colors, you need more bits. 61 00:03:21,410 --> 00:03:25,690 And quite common in the world of photographs is to use not 1-bit color, 62 00:03:25,690 --> 00:03:29,560 as this suggests, but 24-bit color, where you actually get millions of colors. 63 00:03:29,560 --> 00:03:32,250 So as in the case when we zoomed in on Rob's eye, 64 00:03:32,250 --> 00:03:36,370 that was any number of millions of different colorful possibilities. 65 00:03:36,370 --> 00:03:39,040 So we'll introduce this in Problem Set 4 as well as in the walkthrough, 66 00:03:39,040 --> 00:03:43,370 which will be today at 3:30 instead of the usual 2:30 because of Friday's lecture here. 67 00:03:43,370 --> 00:03:46,620 But video will be online as usual tomorrow. 68 00:03:46,620 --> 00:03:48,820 >> We'll also introduce you to another file format. 69 00:03:48,820 --> 00:03:51,270 This is deliberately meant to look intimidating at first, 70 00:03:51,270 --> 00:03:55,670 but this is just some documentation for a C struct. 71 00:03:55,670 --> 00:03:58,940 It turns out that Microsoft years ago helped popularize this format 72 00:03:58,940 --> 00:04:05,150 called the bitmap file format, bmp, and this was a super simple, colorful graphical file format 73 00:04:05,150 --> 00:04:10,150 that was used for quite some time and sometimes still for wallpapers on desktops. 74 00:04:10,150 --> 00:04:14,760 If you think back to Windows XP and the rolling hills and the blue sky, 75 00:04:14,760 --> 00:04:17,170 that was typically a bmp or bitmap image. 76 00:04:17,170 --> 00:04:19,959 Bitmaps are fun for us because they have a bit more complexity. 77 00:04:19,959 --> 00:04:22,610 It's not quite as simple as this grid of 0s and 1s. 78 00:04:22,610 --> 00:04:27,510 Instead, you have things like a header at the start of a file. 79 00:04:27,510 --> 00:04:31,990 So in other words, inside of a .bmp file is a whole bunch of 0s and 1s, 80 00:04:31,990 --> 00:04:34,910 but there's some additional 0s and 1s in there. 81 00:04:34,910 --> 00:04:38,220 And it turns out that what we've probably taken for granted for years-- 82 00:04:38,220 --> 00:04:45,170 file formats like .doc or .xls or .mp3, .mp4, whatever the file formats 83 00:04:45,170 --> 00:04:48,480 that you're familiar with--what does it even mean to be a file format, 84 00:04:48,480 --> 00:04:52,480 because at the end of the day all of these files we use have just 0s and 1s. 85 00:04:52,480 --> 00:04:56,810 And maybe those 0s and 1s represent ABC through ASCII or the like, 86 00:04:56,810 --> 00:04:58,820 but at the end of the day, it's still just 0s and 1s. 87 00:04:58,820 --> 00:05:02,100 So humans just occasionally decide to invent a new file format 88 00:05:02,100 --> 00:05:06,420 where they standardize what patterns of bits will actually mean. 89 00:05:06,420 --> 00:05:09,220 And in this case here, the folks who designed the bitmap file format 90 00:05:09,220 --> 00:05:15,620 said that at the very first byte in a bitmap file, as denoted by offset 0 there, 91 00:05:15,620 --> 00:05:18,940 there is going to be some cryptically named variable called bfType, 92 00:05:18,940 --> 00:05:23,080 which just stands for bitmap file type, what type of bitmap file is this. 93 00:05:23,080 --> 00:05:27,700 You can infer perhaps from the second row that offset 2, byte number 2, 94 00:05:27,700 --> 00:05:33,740 has a pattern of 0s and 1s that represents what? The size of something. 95 00:05:33,740 --> 00:05:35,310 And it goes on from there. 96 00:05:35,310 --> 00:05:37,410 So in Problem Set 4, you'll be walked through some of these things. 97 00:05:37,410 --> 00:05:39,520 We won't end up caring about all of them. 98 00:05:39,520 --> 00:05:47,510 But notice it starts to get interesting around byte 54: rgbtBlue, Green, and Red. 99 00:05:47,510 --> 00:05:52,110 If you've ever heard the acronym RGB--red, green, blue--this is a reference to that 100 00:05:52,110 --> 00:05:54,610 because it turns out you can paint all the colors of the rainbow 101 00:05:54,610 --> 00:05:58,180 with some combination of red and blue and green. 102 00:05:58,180 --> 00:06:03,320 And in fact, parents in the room might recall some of the earliest projectors. 103 00:06:03,320 --> 00:06:05,890 These days, you just see one bright light coming out of a lens, 104 00:06:05,890 --> 00:06:09,800 but back in the day you had the red lens, the blue lens, and the green lens, 105 00:06:09,800 --> 00:06:13,380 and together they aimed at a screen and formed a colorful picture. 106 00:06:13,380 --> 00:06:16,270 And quite often, middle schools and high schools would have those lenses 107 00:06:16,270 --> 00:06:19,720 ever so slightly askew, so you were sort of seeing double or triple images. 108 00:06:19,720 --> 00:06:24,100 But that was the idea. You had red and green and blue light painting a picture. 109 00:06:24,100 --> 00:06:26,590 And that same principle is used in computers. 110 00:06:26,590 --> 00:06:30,230 >> So among the challenges then for you in Problem Set 4 are going to be a few things. 111 00:06:30,230 --> 00:06:34,800 One is to actually resize an image, to take in a pattern of 0s and 1s, 112 00:06:34,800 --> 00:06:40,200 figure out which chunks of 0s and 1s represent what in a structure like this, 113 00:06:40,200 --> 00:06:43,630 and then figure out how to replicate the pixels--the reds, the blues, the greens-- 114 00:06:43,630 --> 00:06:46,660 inside so that when a picture looks like this initially, 115 00:06:46,660 --> 00:06:49,210 it might look like this instead after that. 116 00:06:49,210 --> 00:06:53,640 Among the other challenges too is going to be that you'll be handed a forensic image 117 00:06:53,640 --> 00:06:56,030 of an actual file from a digital camera. 118 00:06:56,030 --> 00:06:58,960 And on that camera, once upon a time, were a whole bunch of photos. 119 00:06:58,960 --> 00:07:03,760 The problem is we accidentally erased or had the image corrupted somehow. 120 00:07:03,760 --> 00:07:05,750 Bad things happen with digital cameras. 121 00:07:05,750 --> 00:07:09,150 And so we quickly copied all of the 0s and 1s off of that card for you, 122 00:07:09,150 --> 00:07:13,610 saved them all in one big file, and then we'll hand them to you in Problem Set 4 123 00:07:13,610 --> 00:07:19,320 so that you can write a program in C with which to recover all of those JPEGs, ideally. 124 00:07:19,320 --> 00:07:23,330 And it turns out that JPEGs, even though they're somewhat of a complex file format-- 125 00:07:23,330 --> 00:07:26,360 they're much more complex than this smiling face here-- 126 00:07:26,360 --> 00:07:31,160 it turns out that every JPEG starts with the same patterns of 0s and 1s. 127 00:07:31,160 --> 00:07:35,630 So using, ultimately, a while loop or a for loop or similar, 128 00:07:35,630 --> 00:07:38,880 you can iterate over all the 0s and 1s in this forensic image, 129 00:07:38,880 --> 00:07:43,150 and every time you see the special pattern that's defined in the problem set specification, 130 00:07:43,150 --> 00:07:47,880 you can assume here is, with very high probability, the start of a JPEG. 131 00:07:47,880 --> 00:07:51,230 And as soon as you find the same pattern some number of bytes 132 00:07:51,230 --> 00:07:55,430 or kilobytes or megabytes later, you can assume here is a second JPEG, 133 00:07:55,430 --> 00:07:57,380 the photo I took after the first one. 134 00:07:57,380 --> 00:08:01,370 Let me stop reading that first file, start writing this new one, 135 00:08:01,370 --> 00:08:06,310 and the output of your program for pset4 is going to be as many as 50 JPEGs. 136 00:08:06,310 --> 00:08:09,270 And if it's not 50 JPEGs, you have a bit of a loop. 137 00:08:09,270 --> 00:08:12,490 If you have an infinite number of JPEGs, you have an infinite loop. 138 00:08:12,490 --> 00:08:14,910 So that too will be quite a common case. 139 00:08:14,910 --> 00:08:16,600 So that's what's on the horizon. 140 00:08:16,600 --> 00:08:21,310 >> Quiz 0 behind us, realize per my email that invariably there are folks who are both happy, 141 00:08:21,310 --> 00:08:23,640 sort of neutral, and sad around quiz 0 time. 142 00:08:23,640 --> 00:08:26,800 And please do reach out to me, the head TF Zamyla, your own TF, 143 00:08:26,800 --> 00:08:31,180 or one of the CAs that you know if you would like to discuss how things went. 144 00:08:31,180 --> 00:08:35,539 >> So to impress the parents here in the room, what is the CS50 library? 145 00:08:36,429 --> 00:08:40,390 [laughter] Good job. 146 00:08:40,390 --> 00:08:48,340 What's the CS50 library? Yeah. >>[student] It's a pre-written set of code [inaudible] 147 00:08:48,340 --> 00:08:49,750 Okay, good. 148 00:08:49,750 --> 00:08:53,240 It's a pre-written set of code that we the staff wrote, we provide to you, 149 00:08:53,240 --> 00:08:55,030 that provides some common functionality, 150 00:08:55,030 --> 00:08:59,020 stuff like get me a string, get me an int--all of the functions that are listed here. 151 00:08:59,020 --> 00:09:02,260 >> Starting now, we start to really take these training wheels off. 152 00:09:02,260 --> 00:09:05,050 We're going to start to take away a string from you, 153 00:09:05,050 --> 00:09:08,870 which recall was just a synonym for what actual data type? >>[multiple students] Char*. 154 00:09:08,870 --> 00:09:12,730 Char*. For parents, that was probably [makes whooshing sound]. That's good. 155 00:09:12,730 --> 00:09:17,550 Char* we'll start to see on the screen all the more as we remove string from our vocabulary, 156 00:09:17,550 --> 00:09:19,730 at least when it comes to actually writing code. 157 00:09:19,730 --> 00:09:22,840 Similarly, we'll stop using some of these functions as much 158 00:09:22,840 --> 00:09:25,280 because our programs are going to get more sophisticated. 159 00:09:25,280 --> 00:09:28,480 Rather than just write programs that sit there with a prompt blinking, 160 00:09:28,480 --> 00:09:31,870 waiting for the user to type something in, you'll get your inputs from elsewhere. 161 00:09:31,870 --> 00:09:35,490 For instance, you'll get them from a series of bits on the local hard drive. 162 00:09:35,490 --> 00:09:38,580 You'll instead get them in the future from a network connection, 163 00:09:38,580 --> 00:09:40,230 some website somewhere. 164 00:09:40,230 --> 00:09:44,110 >> So let's peel back this layer for the first time and pull up the CS50 Appliance 165 00:09:44,110 --> 00:09:49,010 and this file called cs50.h, which you've been #including for weeks, 166 00:09:49,010 --> 00:09:51,140 but let's actually see what's inside of this. 167 00:09:51,140 --> 00:09:54,430 The top of the file in blue is just a whole bunch of comments: 168 00:09:54,430 --> 00:09:57,050 warranty information and licensing. 169 00:09:57,050 --> 00:09:59,050 This is sort of a common paradigm in software 170 00:09:59,050 --> 00:10:01,580 because a lot of software these days is what's called open source, 171 00:10:01,580 --> 00:10:05,220 which means that someone has written the code and made it freely available 172 00:10:05,220 --> 00:10:10,470 not just to run and to use but to actually read and alter and integrate into your own work. 173 00:10:10,470 --> 00:10:14,660 So that's what you've been using, open source software, albeit in a very small form. 174 00:10:14,660 --> 00:10:18,560 If I scroll down past the comments, though, we'll start to see some more familiar things. 175 00:10:18,560 --> 00:10:25,010 Notice at the top here that the cs50.h file includes a whole bunch of header files. 176 00:10:25,010 --> 00:10:28,560 Most of these, we haven't seen before, but one is familiar. 177 00:10:28,560 --> 00:10:32,270 Which of these have we seen, albeit briefly, thus far? >>[student] Standard library. 178 00:10:32,270 --> 00:10:35,810 Yeah, standard library. stdlib.h has malloc. 179 00:10:35,810 --> 00:10:38,320 Once we started talking about dynamic memory allocation, 180 00:10:38,320 --> 00:10:41,650 which we'll come back to next week as well, we started including that file. 181 00:10:41,650 --> 00:10:46,640 It turns out that bool and true and false don't actually exist in C per se 182 00:10:46,640 --> 00:10:49,440 unless you include this file here. 183 00:10:49,440 --> 00:10:52,710 We have for weeks been including stdbool.h 184 00:10:52,710 --> 00:10:55,620 so that you can use the notion of a bool, true or false. 185 00:10:55,620 --> 00:10:58,620 Without this, you would have to sort of fake it and use an int 186 00:10:58,620 --> 00:11:02,610 and just arbitrarily assume that 0 is false and 1 is true. 187 00:11:02,610 --> 00:11:07,150 If we scroll down further, here is our definition of a string. 188 00:11:07,150 --> 00:11:11,390 It turns out, as we've said before, that where this star is doesn't really matter. 189 00:11:11,390 --> 00:11:13,720 You can even have space all around. 190 00:11:13,720 --> 00:11:16,740 We this semester have been promoting it as this to make clear 191 00:11:16,740 --> 00:11:18,620 that the star has to do with the type, 192 00:11:18,620 --> 00:11:21,700 but realize just as common, if not a little more common, 193 00:11:21,700 --> 00:11:24,430 is to put it there, but functionally it's the same thing. 194 00:11:24,430 --> 00:11:27,720 But now if we read down further, let's take a look at GetInt 195 00:11:27,720 --> 00:11:32,190 because we used that perhaps first before anything else this semester. 196 00:11:32,190 --> 00:11:37,440 Here is GetInt. This is what? >>[student] A prototype. >>This is just a prototype. 197 00:11:37,440 --> 00:11:41,410 Often, we have put prototypes at the tops of our .c files, 198 00:11:41,410 --> 00:11:46,690 but you can also put prototypes in header files, .h files, like this one here 199 00:11:46,690 --> 00:11:50,840 so that when you write some functions that you want other people to be able to use, 200 00:11:50,840 --> 00:11:53,550 which is exactly the case with the CS50 library, 201 00:11:53,550 --> 00:11:57,040 you not only implement your functions in something like cs50.c, 202 00:11:57,040 --> 00:12:02,790 you also put the prototypes not at the top of that file but at the top of a header file. 203 00:12:02,790 --> 00:12:07,170 Then that header file is what friends and colleagues include 204 00:12:07,170 --> 00:12:09,760 with #include in their own code. 205 00:12:09,760 --> 00:12:12,210 So all this time, you've been including all of these prototypes, 206 00:12:12,210 --> 00:12:16,580 effectively at the top of your file but by way of this #include mechanism, 207 00:12:16,580 --> 00:12:20,070 which essentially copies and pastes this file into your own. 208 00:12:20,070 --> 00:12:23,070 Here is some fairly detailed documentation. 209 00:12:23,070 --> 00:12:25,640 We've pretty much taken for granted that GetInt gets an int, 210 00:12:25,640 --> 00:12:27,640 but it turns out there are some corner cases. 211 00:12:27,640 --> 00:12:31,810 What if the user types in a number that's way too big, a quintillion, 212 00:12:31,810 --> 00:12:35,490 that just can't fit inside of an int? What is the expected behavior? 213 00:12:35,490 --> 00:12:38,020 Ideally, it's predictable. 214 00:12:38,020 --> 00:12:40,280 So in this case, if you actually read the fine print, 215 00:12:40,280 --> 00:12:44,500 you'll actually see that if the line can't be read, this returns INT_MAX. 216 00:12:44,500 --> 00:12:48,320 We've never talked about this, but based on its capitalization, what is it probably? 217 00:12:48,320 --> 00:12:50,640 [student] A constant. >>It's a constant. 218 00:12:50,640 --> 00:12:54,770 It's some special constant that's probably declared in one of those header files 219 00:12:54,770 --> 00:13:00,090 that's up higher in the file, and INT_MAX is probably something like roughly 2 billion, 220 00:13:00,090 --> 00:13:04,990 the idea being that because we need to somehow signify that something went wrong, 221 00:13:04,990 --> 00:13:10,700 we, yes, have 4 billion numbers at our disposal: -2 billion on up to 2 billion, give or take. 222 00:13:10,700 --> 00:13:14,710 Well, what is common in programming is you steal just one of those numbers, 223 00:13:14,710 --> 00:13:18,920 maybe 0, maybe 2 billion, maybe -2 billion, 224 00:13:18,920 --> 00:13:23,280 so you spend one of your possible values so that you can commit to the world 225 00:13:23,280 --> 00:13:26,820 that if something goes wrong, I will return this super big value. 226 00:13:26,820 --> 00:13:31,030 But you don't want the user typing something cryptic like 234..., a really big number. 227 00:13:31,030 --> 00:13:34,060 You generalize it instead as a constant. 228 00:13:34,060 --> 00:13:38,060 So really, if you were being anal the past few weeks, any time you called GetInt, 229 00:13:38,060 --> 00:13:42,900 you should have been checking with an if condition did the user type in INT_MAX, 230 00:13:42,900 --> 00:13:46,590 or, more specifically, did GetInt return INT_MAX, because if it did, 231 00:13:46,590 --> 00:13:51,830 that actually means they didn't type it. Something went wrong in this case. 232 00:13:51,830 --> 00:13:56,080 So this is what's generally known as a sentinel value, which just means special. 233 00:13:56,080 --> 00:13:58,120 >> Let's now turn into the .c file. 234 00:13:58,120 --> 00:14:01,340 The C file has existed in the appliance for some time. 235 00:14:01,340 --> 00:14:06,840 And in fact, the appliance has it pre-compiled for you into that thing we called object code, 236 00:14:06,840 --> 00:14:09,540 but it just doesn't matter to you where it is because the system knows 237 00:14:09,540 --> 00:14:11,730 in this case where it is: the appliance. 238 00:14:11,730 --> 00:14:17,400 Let's scroll down now to GetInt and see how GetInt has been working all this time. 239 00:14:17,400 --> 00:14:19,460 Here we have similar comments from before. 240 00:14:19,460 --> 00:14:21,660 Let me zoom in on just the code portion. 241 00:14:21,660 --> 00:14:23,900 And what we have for GetInt is the following. 242 00:14:23,900 --> 00:14:25,700 It takes no input. 243 00:14:25,700 --> 00:14:29,510 It returns an int, while (true), so we have a deliberate infinite loop, 244 00:14:29,510 --> 00:14:33,180 but presumably we'll break out of this somehow or return from within this. 245 00:14:33,180 --> 00:14:34,870 >> Let's see how this works. 246 00:14:34,870 --> 00:14:39,240 We seem to be using GetString in this first line inside the loop, 166. 247 00:14:39,240 --> 00:14:43,780 This is now good practice because under what circumstances could GetString return 248 00:14:43,780 --> 00:14:47,660 the special keyword NULL? >>[student] If something goes wrong. 249 00:14:47,660 --> 00:14:51,630 If something goes wrong. And what could go wrong when you call something like GetString? 250 00:14:54,960 --> 00:14:57,640 Yeah. >>[student] Malloc fails to give it the ints. 251 00:14:57,640 --> 00:14:59,150 Yeah. Maybe malloc fails. 252 00:14:59,150 --> 00:15:03,190 Somewhere underneath the hood, GetString is calling malloc, which allocates memory, 253 00:15:03,190 --> 00:15:06,020 which lets the computer store all of the characters 254 00:15:06,020 --> 00:15:07,750 that the user types into the keyboard. 255 00:15:07,750 --> 00:15:11,590 And suppose the user had a whole lot of free time and typed more, for instance, 256 00:15:11,590 --> 00:15:16,160 than 2 billion characters in, more characters than the computer even has RAM. 257 00:15:16,160 --> 00:15:19,250 GetString has to be able to signify that to you. 258 00:15:19,250 --> 00:15:22,560 Even if this is a super, super uncommon corner case, 259 00:15:22,560 --> 00:15:24,340 it has to somehow be able to handle this, 260 00:15:24,340 --> 00:15:28,750 and so GetString, if we went back and read its documentation, does in fact return NULL. 261 00:15:28,750 --> 00:15:34,460 So now if GetString fails by returning NULL, GetInt is going to fail by returning INT_MAX 262 00:15:34,460 --> 00:15:37,690 just as a sentinel. These are just human conventions. 263 00:15:37,690 --> 00:15:41,450 The only way you would know this is the case is by reading the documentation. 264 00:15:41,450 --> 00:15:45,040 >> Let's scroll down to where the int is actually gotten. 265 00:15:45,040 --> 00:15:51,160 If I scroll down a bit further, in line 170, we have a comment above these lines. 266 00:15:51,160 --> 00:15:55,100 We declare in 172 an int, n, and a char, c, and then this new function, 267 00:15:55,100 --> 00:15:58,930 which some of you have stumbled across before, sscanf. 268 00:15:58,930 --> 00:16:00,870 This stands for string scanf. 269 00:16:00,870 --> 00:16:05,700 In other words, give me a string and I will scan it for pieces of information of interest. 270 00:16:05,700 --> 00:16:07,360 What does that mean? 271 00:16:07,360 --> 00:16:11,800 Suppose that I type in, literally, 123 at the keyboard and then hit Enter. 272 00:16:11,800 --> 00:16:16,470 What is the data type of 123 when returned by GetString? >>[student] String. 273 00:16:16,470 --> 00:16:18,380 It's obviously a string, right? I got a string. 274 00:16:18,380 --> 00:16:23,220 So 123 is really, quote-unquote, 123 with the \0 at the end of it. 275 00:16:23,220 --> 00:16:27,110 That is not an int. That's not a number. It looks like a number but it's not actually. 276 00:16:27,110 --> 00:16:29,080 So what does GetInt have to do? 277 00:16:29,080 --> 00:16:35,750 It has to scan that string left to right--123\0--and somehow convert to an actual integer. 278 00:16:35,750 --> 00:16:37,850 You could figure out how to do this. 279 00:16:37,850 --> 00:16:41,450 If you think back to pset2, you presumably got a little comfortable with Caesar 280 00:16:41,450 --> 00:16:44,820 or Vigenere, so you can iterate over a string, you can convert chars to ints. 281 00:16:44,820 --> 00:16:46,710 But heck, it's a whole lot of work. 282 00:16:46,710 --> 00:16:49,860 Why not call a function like sscanf that does that for you? 283 00:16:49,860 --> 00:16:54,230 So sscanf expects an argument--in this case called line, which is a string. 284 00:16:54,230 --> 00:17:01,840 You then specify in quotes, very similar to printf, what you expect to see in this string. 285 00:17:01,840 --> 00:17:09,000 And what I'm saying here is I expect to see a decimal number and maybe a character. 286 00:17:09,000 --> 00:17:12,000 And we'll see why this is the case in just a moment. 287 00:17:12,000 --> 00:17:15,869 And it turns out that this notation is now reminiscent of stuff we started talking about 288 00:17:15,869 --> 00:17:17,619 just over a week ago. 289 00:17:17,619 --> 00:17:21,740 What is &n and &c doing for us here? >>[student] Address of n and address of c. 290 00:17:21,740 --> 00:17:25,400 Yeah. It's giving me the address of n and address of c. Why is that important? 291 00:17:25,400 --> 00:17:30,220 You know that with functions in C, you can always return a value or no value at all. 292 00:17:30,220 --> 00:17:34,530 You can return an int, a string, a float, a char, whatever, or you can return void, 293 00:17:34,530 --> 00:17:38,030 but you can only return one thing maximally. 294 00:17:38,030 --> 00:17:42,760 But here we want sscanf to return me maybe an int, a decimal number, 295 00:17:42,760 --> 00:17:46,220 and also a char, and I'll explain why the char in a moment. 296 00:17:46,220 --> 00:17:51,460 You effectively want sscanf to return two things, but that's just not possible in C. 297 00:17:51,460 --> 00:17:55,200 You can work around that by passing in two addresses 298 00:17:55,200 --> 00:17:57,370 because as soon as you hand a function two addresses, 299 00:17:57,370 --> 00:18:00,470 what can that function do with them? >>[student] Write to those addresses. 300 00:18:00,470 --> 00:18:02,010 It can write to those addresses. 301 00:18:02,010 --> 00:18:05,770 You can use the star operation and go there, to each of those addresses. 302 00:18:05,770 --> 00:18:11,260 It's sort of this back-door mechanism but very common for changing the values of variables 303 00:18:11,260 --> 00:18:14,870 more than just one place--in this case, two. 304 00:18:14,870 --> 00:18:21,340 Now notice I'm checking for == 1 and then returning n if that does, in fact, evaluate to true. 305 00:18:21,340 --> 00:18:26,170 So what's going on? Technically, all we really want to happen in GetInt is this. 306 00:18:26,170 --> 00:18:30,740 We want to parse, so to speak, we want to read the string--quote-unquote 123-- 307 00:18:30,740 --> 00:18:34,560 and if it looks like there's a number there, what we're telling sscanf to do 308 00:18:34,560 --> 00:18:38,190 is put that number--123--in this variable n for me. 309 00:18:38,190 --> 00:18:42,090 So why then did I actually have this as well? 310 00:18:42,090 --> 00:18:48,220 What is the role of sscanf saying you might also get a character here? 311 00:18:48,220 --> 00:18:53,470 [inaudible student response] >>A decimal point actually could work. 312 00:18:53,470 --> 00:18:56,330 Let's hold that thought for a moment. What else? 313 00:18:56,330 --> 00:18:59,270 [student] It could be NULL. >>Good thought. It could be the null character. 314 00:18:59,270 --> 00:19:01,660 It's actually not in this case. Yeah. >>[student] ASCII. 315 00:19:01,660 --> 00:19:04,340 ASCII. Or let me generalize even further. 316 00:19:04,340 --> 00:19:06,640 The %c there is just for error checking. 317 00:19:06,640 --> 00:19:09,300 We don't want there to be a character after the number, 318 00:19:09,300 --> 00:19:11,870 but what this allows me to do is the following. 319 00:19:11,870 --> 00:19:18,210 It turns out that sscanf, besides storing values in n and c in this example here, 320 00:19:18,210 --> 00:19:24,890 what it also does is it returns the number of variables it put values in. 321 00:19:24,890 --> 00:19:30,260 So if you only type in 123, then only the %d is going to match, 322 00:19:30,260 --> 00:19:33,880 and only n gets stored with a value like 123, 323 00:19:33,880 --> 00:19:35,640 and nothing gets put in c. 324 00:19:35,640 --> 00:19:37,620 C remains a garbage value, so to speak-- 325 00:19:37,620 --> 00:19:40,730 garbage because it's never been initialized to some value. 326 00:19:40,730 --> 00:19:45,520 So in that case, sscanf returns 1 because I populated 1 of those pointers, 327 00:19:45,520 --> 00:19:50,190 in which case great, I have an int so I free the line to free up the memory 328 00:19:50,190 --> 00:19:54,000 that GetString actually allocated, and then I return n, 329 00:19:54,000 --> 00:19:58,500 else if you ever wondered where that Retry statement comes from, it comes from right here. 330 00:19:58,500 --> 00:20:04,390 So if, by contrast, I type in 123foo--just some random sequence of text-- 331 00:20:04,390 --> 00:20:08,490 sscanf is going to see number, number, number, f, 332 00:20:08,490 --> 00:20:16,410 and it's going to put the 123 in n; it's going to put the f in c and then return 2. 333 00:20:16,410 --> 00:20:20,640 So we have, just using the basic definition of sscanf's behavior, a very simple way-- 334 00:20:20,640 --> 00:20:23,900 well, complex at first glance but at the end of the day fairly simple mechanism-- 335 00:20:23,900 --> 00:20:28,320 of saying is there an int and if so, is that the only thing that I found? 336 00:20:28,320 --> 00:20:29,860 And the whitespace here is deliberate. 337 00:20:29,860 --> 00:20:34,000 If you read the documentation for sscanf, it tells you that if you include a piece of whitespace 338 00:20:34,000 --> 00:20:38,810 at the beginning or the end, sscanf too will allow the user, for whatever reason, 339 00:20:38,810 --> 00:20:41,860 to hit space bar 123 and that will be legitimate. 340 00:20:41,860 --> 00:20:44,150 You won't yell at the user just because they hit the space bar 341 00:20:44,150 --> 00:20:48,640 at the beginning or the end, which is just a little more user-friendly. 342 00:20:48,640 --> 00:20:52,300 >> Any questions then on GetInt? Yeah. >>[student] What if you just put in a char? 343 00:20:52,300 --> 00:20:54,030 Good question. 344 00:20:54,030 --> 00:20:59,890 What if you just typed in a char like f and hit Enter without ever typing 123? 345 00:20:59,890 --> 00:21:02,420 What do you think the behavior of this line of code would then be? 346 00:21:02,420 --> 00:21:04,730 [inaudible student response] 347 00:21:04,730 --> 00:21:08,790 Yeah, so sscanf can cover that too because in that case, it's not going to fill n or c. 348 00:21:08,790 --> 00:21:15,310 It's going to instead return 0, in which case I'm also catching that scenario 349 00:21:15,310 --> 00:21:18,750 because the expected value I want is 1. 350 00:21:18,750 --> 00:21:22,000 I only want one and only one thing to be filled. Good question. 351 00:21:22,000 --> 00:21:24,290 >> Others? All right. 352 00:21:24,290 --> 00:21:26,250 >> Let's not go through all of the functions in here, 353 00:21:26,250 --> 00:21:29,500 but the one that seems to be perhaps of remaining interest is GetString 354 00:21:29,500 --> 00:21:32,790 because it turns out that GetFloat, GetInt, GetDouble, GetLongLong 355 00:21:32,790 --> 00:21:36,260 all punt a lot of their functionality to GetString. 356 00:21:36,260 --> 00:21:39,750 So let's take a look at how he is implemented here. 357 00:21:39,750 --> 00:21:43,630 This one looks a little complex, but it uses the same fundamentals 358 00:21:43,630 --> 00:21:45,670 that we started talking about last week. 359 00:21:45,670 --> 00:21:49,490 In GetString, which takes no argument as per the void up here 360 00:21:49,490 --> 00:21:53,730 and it returns a string, I apparently am declaring a string called buffer. 361 00:21:53,730 --> 00:21:56,270 I don't really know what that's going to be used for yet, but we'll see. 362 00:21:56,270 --> 00:21:58,390 It looks like capacity is by default 0. 363 00:21:58,390 --> 00:22:01,350 Not quite sure where this is going, not sure what n is going to be used for yet, 364 00:22:01,350 --> 00:22:03,590 but now it's getting a little more interesting. 365 00:22:03,590 --> 00:22:06,520 In line 243, we declare an int, c. 366 00:22:06,520 --> 00:22:08,800 This is sort of a stupid detail. 367 00:22:08,800 --> 00:22:15,820 A char is 8 bits, and 8 bits can store how many different values? >>[student] 256. >>256. 368 00:22:15,820 --> 00:22:20,730 The problem is if you want to have 256 different ASCII characters, which there are 369 00:22:20,730 --> 00:22:23,340 if you think back--and this is not something to memorize. 370 00:22:23,340 --> 00:22:25,710 But if you think back to that big ASCII chart we had weeks ago, 371 00:22:25,710 --> 00:22:30,600 there were in that case 128 or 256 ASCII characters. 372 00:22:30,600 --> 00:22:32,940 We used all the patterns of 0s and 1s up. 373 00:22:32,940 --> 00:22:36,210 That's a problem if you want to be able to detect an error 374 00:22:36,210 --> 00:22:40,190 because if you're already using 256 values for your characters, 375 00:22:40,190 --> 00:22:43,050 you didn't really plan ahead because now you have no way of saying, 376 00:22:43,050 --> 00:22:46,270 this is not a legit character, this is some erroneous message. 377 00:22:46,270 --> 00:22:50,270 So what the world does is they use the next biggest value, something like an int, 378 00:22:50,270 --> 00:22:54,720 so that you have a crazy number of bits, 32, for 4 billion possible values 379 00:22:54,720 --> 00:22:58,860 so that you can simply end up using essentially 257 of them, 380 00:22:58,860 --> 00:23:01,720 1 of which has some special meaning as an error. 381 00:23:01,720 --> 00:23:03,120 >> So let's see how this works. 382 00:23:03,120 --> 00:23:07,760 In line 246, I have this big while loop that is calling fgetc, 383 00:23:07,760 --> 00:23:11,090 f meaning file, so getc, and then stdin. 384 00:23:11,090 --> 00:23:15,520 It turns out this is just the more precise way of saying read input from the keyboard. 385 00:23:15,520 --> 00:23:19,300 Standard input means keyboard, standard output means screen, 386 00:23:19,300 --> 00:23:23,310 and standard error, which we'll see in pset4, means the screen 387 00:23:23,310 --> 00:23:27,490 but a special part of the screen so that it's not conflated with actual output 388 00:23:27,490 --> 00:23:30,750 that you intended to print. But more on that in the future. 389 00:23:30,750 --> 00:23:34,440 So fgetc just means read one character from the keyboard and store it where? 390 00:23:34,440 --> 00:23:37,350 Store it in c. 391 00:23:37,350 --> 00:23:41,360 And then check--so I'm just using some Boolean conjunctions here-- 392 00:23:41,360 --> 00:23:46,000 check that it doesn't equal--\n, so the user has hit Enter, we want to stop at that point, 393 00:23:46,000 --> 00:23:49,850 end of the loop--and we also want to check for the special constant EOF, 394 00:23:49,850 --> 00:23:53,610 which if you know or guess, what does it stand for? >>[student] End of file. >>End of file. 395 00:23:53,610 --> 00:23:56,560 This is kind of nonsensical because if I'm typing at the keyboard, 396 00:23:56,560 --> 00:23:58,870 there's really no file involved in this, 397 00:23:58,870 --> 00:24:01,150 but this is just sort of the generic term used to mean 398 00:24:01,150 --> 00:24:04,220 that nothing else is coming from the human's fingers. 399 00:24:04,220 --> 00:24:06,460 EOF--end of file. 400 00:24:06,460 --> 00:24:09,920 As an aside, if you've ever hit Control D at your keyboard, not that you would have yet-- 401 00:24:09,920 --> 00:24:15,230 you've hit Control C--Control D sends this special constant called EOF. 402 00:24:15,230 --> 00:24:19,850 So now we just have some dynamic memory allocation. 403 00:24:19,850 --> 00:24:23,440 >> So if (n + 1 > capacity). Now I'll explain n. 404 00:24:23,440 --> 00:24:26,100 N is just how many bytes are currently in the buffer, 405 00:24:26,100 --> 00:24:28,620 the string that you're currently building up from the user. 406 00:24:28,620 --> 00:24:33,450 If you have more characters in your buffer than you have capacity in the buffer, 407 00:24:33,450 --> 00:24:37,410 intuitively what we need to do then is allocate more capacity. 408 00:24:37,410 --> 00:24:43,330 So I'm going to skim over some of the arithmetic here and focus only on this function here. 409 00:24:43,330 --> 00:24:46,070 You know what malloc is or are at least generally familiar. 410 00:24:46,070 --> 00:24:48,970 Take a guess what realloc does. >>[student] Adds memory. 411 00:24:48,970 --> 00:24:52,920 It's not quite adding memory. It reallocates memory as follows. 412 00:24:52,920 --> 00:24:57,220 If there's still room at the end of the string to give you more of that memory 413 00:24:57,220 --> 00:25:00,000 than it originally gives you, then you'll get that additional memory. 414 00:25:00,000 --> 00:25:03,460 So you can just keep putting the string's characters back to back to back to back. 415 00:25:03,460 --> 00:25:05,830 But if that's not the case because you waited too long 416 00:25:05,830 --> 00:25:07,940 and something random got plopped in memory there 417 00:25:07,940 --> 00:25:10,290 but there's extra memory down here, that's okay. 418 00:25:10,290 --> 00:25:13,100 Realloc is going to do all the heavy lifting for you, 419 00:25:13,100 --> 00:25:16,750 move the string you've read in thus far from here, put it down there, 420 00:25:16,750 --> 00:25:19,460 and then give you some more runway at that point. 421 00:25:19,460 --> 00:25:22,550 >> So with a wave of the hand, let me say that what GetString is doing 422 00:25:22,550 --> 00:25:26,330 is it's starting with a small buffer, maybe one single character, 423 00:25:26,330 --> 00:25:30,820 and if the user types in two characters, GetString ends up calling realloc and says 424 00:25:30,820 --> 00:25:33,150 one character wasn't enough; give me two characters. 425 00:25:33,150 --> 00:25:35,950 Then if you read through the logic of the loop, it's going to say 426 00:25:35,950 --> 00:25:39,600 the user typed in 3 characters; give me now not 2 but 4 characters, 427 00:25:39,600 --> 00:25:42,320 then give me 8, then give me 16 and 32. 428 00:25:42,320 --> 00:25:45,000 The fact that I'm doubling the capacity each time 429 00:25:45,000 --> 00:25:48,570 means that the buffer is not going to grow slowly, it's going to grow super fast. 430 00:25:48,570 --> 00:25:51,380 And what might be the advantage of that? 431 00:25:51,380 --> 00:25:54,600 Why am I doubling the size of the buffer 432 00:25:54,600 --> 00:25:58,020 even though the user might just need one extra character from the keyboard? 433 00:25:58,020 --> 00:26:01,750 [inaudible student response] >>What's that? >>[student] You don't have to grow it as often. 434 00:26:01,750 --> 00:26:03,300 Exactly. You don't have to grow it as often. 435 00:26:03,300 --> 00:26:05,510 And this is just kind of you're hedging your bets here, 436 00:26:05,510 --> 00:26:10,850 the idea being that you don't want to call realloc a lot because it tends to be slow. 437 00:26:10,850 --> 00:26:12,910 Any time you ask the operating system for memory, 438 00:26:12,910 --> 00:26:16,990 as you'll soon see in a future problem set, it tends to take some time. 439 00:26:16,990 --> 00:26:20,010 So minimizing that amount of time, even if you're wasting some space, 440 00:26:20,010 --> 00:26:21,900 tends to be a good thing. 441 00:26:21,900 --> 00:26:24,060 >> But if we read through the final part of GetString here-- 442 00:26:24,060 --> 00:26:27,950 and again understanding every single line here is not so important today-- 443 00:26:27,950 --> 00:26:30,530 notice that it eventually calls malloc again 444 00:26:30,530 --> 00:26:33,880 and it allocates exactly as many bytes as it needs for the string 445 00:26:33,880 --> 00:26:38,060 and then throws away by calling free the excessively large buffer 446 00:26:38,060 --> 00:26:40,080 if it indeed got doubled too many times. 447 00:26:40,080 --> 00:26:42,730 So in short, that's how GetString has been working all this time. 448 00:26:42,730 --> 00:26:47,060 All it does is read one character at a time again and again and again, 449 00:26:47,060 --> 00:26:50,750 and every time it needs some additional memory, it asks the operating system for it 450 00:26:50,750 --> 00:26:53,670 by calling realloc. 451 00:26:53,670 --> 00:26:57,890 >> Any questions? All right. 452 00:26:57,890 --> 00:26:59,270 >> An attack. 453 00:26:59,270 --> 00:27:04,060 Now that we understand pointers or at least are increasingly familiar with pointers, 454 00:27:04,060 --> 00:27:06,700 let's consider how the whole world starts to collapse 455 00:27:06,700 --> 00:27:10,030 if you don't quite defend against adversarial users, 456 00:27:10,030 --> 00:27:11,850 people who are trying to hack into your system, 457 00:27:11,850 --> 00:27:16,890 people who are trying to steal your software by circumventing some registration code 458 00:27:16,890 --> 00:27:19,090 that they might otherwise have to type in. 459 00:27:19,090 --> 00:27:22,990 >> Take a look at this example here, which is just C code that has a function main at the bottom 460 00:27:22,990 --> 00:27:26,380 that calls a function foo. And what is it passing to foo? 461 00:27:26,380 --> 00:27:29,680 [student] A single argument. >>[Malan] A single argument. 462 00:27:29,680 --> 00:27:33,450 So argv[1], which means the first word that the user typed at the command line 463 00:27:33,450 --> 00:27:36,360 after a.out or whatever the program is called. 464 00:27:36,360 --> 00:27:41,680 So foo at the top takes in a char*. But char* is just what? >>[student] A string. 465 00:27:41,680 --> 00:27:43,350 [Malan] A string, so there's nothing new here. 466 00:27:43,350 --> 00:27:45,420 That string is arbitrarily being called bar. 467 00:27:45,420 --> 00:27:51,430 In this line here, char c[12]; in sort of semi-technical English, what is this line doing? 468 00:27:51,430 --> 00:27:55,220 [student] An array of-- >>Array of? >>[student] Characters. >>Characters. 469 00:27:55,220 --> 00:27:58,870 Give me an array of 12 characters. So we might call this a buffer. 470 00:27:58,870 --> 00:28:02,920 It's technically called c, but a buffer in programming just means a bunch of space 471 00:28:02,920 --> 00:28:04,800 that you can put some stuff in. 472 00:28:04,800 --> 00:28:07,940 Then lastly, memcpy we've not used before, but you can probably guess what it does. 473 00:28:07,940 --> 00:28:10,480 It copies memory. What does it do? 474 00:28:10,480 --> 00:28:19,270 It apparently copies bar, its input, into c but only up to the length of bar. 475 00:28:19,270 --> 00:28:24,930 But there's a bug here. >>[student] You need the sizeof character. >>Okay. 476 00:28:24,930 --> 00:28:30,860 Technically, we should really do strlen(bar) * sizeof(char)). That's correct. 477 00:28:30,860 --> 00:28:33,930 But in the worst case here, let's assume that that's-- 478 00:28:33,930 --> 00:28:35,950 Okay. Then there's two bugs. 479 00:28:35,950 --> 00:28:39,160 So sizeof(char)); 480 00:28:39,160 --> 00:28:41,290 Let's make this a little wider. 481 00:28:41,290 --> 00:28:44,910 So now there's still a bug, which is what? >>[inaudible student response] 482 00:28:44,910 --> 00:28:46,990 Check for what? >>[student] Check for NULL. 483 00:28:46,990 --> 00:28:50,270 We should generally be checking for NULL because bad things happen 484 00:28:50,270 --> 00:28:53,200 when your pointer is NULL because you might end up going there, 485 00:28:53,200 --> 00:28:57,630 and you shouldn't ever be going to NULL by dereferencing it with the star operator. 486 00:28:57,630 --> 00:29:01,050 So that's good. And what else are we doing? Logically, there's a flaw here too. 487 00:29:01,050 --> 00:29:04,450 [student] Check if argc is >= to 2. 488 00:29:04,450 --> 00:29:10,550 So check if argc is >= 2. Okay, so there's three bugs in this program here. 489 00:29:10,550 --> 00:29:16,630 We're now checking if the user actually typed in anything into argv[1]. Good. 490 00:29:16,630 --> 00:29:20,950 So what's the third bug? Yeah. >>[student] C might not be big enough. 491 00:29:20,950 --> 00:29:23,320 Good. We checked one scenario. 492 00:29:23,320 --> 00:29:29,520 We implicitly checked don't copy more memory than would exceed the length of bar. 493 00:29:29,520 --> 00:29:32,510 So if the string the user typed in is 10 characters long, 494 00:29:32,510 --> 00:29:36,020 this is saying only copy 10 characters. And that's okay. 495 00:29:36,020 --> 00:29:39,940 But what if the user typed in a word at the prompt like a 20-character word? 496 00:29:39,940 --> 00:29:44,900 This is saying copy 20 characters from bar into what? 497 00:29:44,900 --> 00:29:49,750 C, otherwise known as our buffer, which means you just wrote data 498 00:29:49,750 --> 00:29:52,540 to 8 byte locations that you do not own, 499 00:29:52,540 --> 00:29:54,870 and you don't own them in the sense that you never allocated them. 500 00:29:54,870 --> 00:30:00,370 So this is what's generally known as the buffer overflow attack or buffer overrun attack. 501 00:30:00,370 --> 00:30:05,580 And it's an attack in the sense that if the user or the program that's calling your function 502 00:30:05,580 --> 00:30:10,490 is doing this maliciously, what actually happens next could actually be quite bad. 503 00:30:10,490 --> 00:30:12,450 >> So let's take a look at this picture here. 504 00:30:12,450 --> 00:30:16,060 This picture represents your stack of memory. 505 00:30:16,060 --> 00:30:19,580 Recall that every time you call a function you get this little frame on the stack 506 00:30:19,580 --> 00:30:21,520 and then another and then another and another. 507 00:30:21,520 --> 00:30:24,300 And thus far, we've just kind of abstracted these as rectangles 508 00:30:24,300 --> 00:30:26,290 either on the board or on the screen here. 509 00:30:26,290 --> 00:30:30,580 But if we zoom in on one of those rectangles, when you call a function foo, 510 00:30:30,580 --> 00:30:35,880 it turns out that there's more on the stack inside of that frame in that rectangle 511 00:30:35,880 --> 00:30:40,060 than just x and y and a and b, like we did talking about swap. 512 00:30:40,060 --> 00:30:44,410 It turns out that there's some lower-level details, among them Return Address. 513 00:30:44,410 --> 00:30:49,550 So it turns out when main calls foo, main has to inform foo 514 00:30:49,550 --> 00:30:53,520 what main's address is in the computer's memory 515 00:30:53,520 --> 00:30:57,770 because otherwise, as soon as foo is done executing, as in this case here, 516 00:30:57,770 --> 00:31:00,830 once you reach this closed curly brace at the end of foo, 517 00:31:00,830 --> 00:31:05,310 how the heck does foo know where the control of the program is supposed to go? 518 00:31:05,310 --> 00:31:08,970 It turns out that the answer to that question is in this red rectangle here. 519 00:31:08,970 --> 00:31:12,670 This represents a pointer, and it's up to the computer to store temporarily 520 00:31:12,670 --> 00:31:17,030 on the so-called stack the address of main so that as soon as foo is done executing, 521 00:31:17,030 --> 00:31:21,120 the computer knows where and what line in main to go back to. 522 00:31:21,120 --> 00:31:23,940 Saved Frame pointer relates similarly to this. 523 00:31:23,940 --> 00:31:26,310 Char *bar here represents what? 524 00:31:26,310 --> 00:31:31,350 Now this blue segment here is foo's frame. What is bar? 525 00:31:31,570 --> 00:31:35,010 Bar is just the argument to the foo function. 526 00:31:35,010 --> 00:31:37,500 So now we're back at sort of the familiar picture. 527 00:31:37,500 --> 00:31:39,850 There's more stuff and more distractions on the screen, 528 00:31:39,850 --> 00:31:43,380 but this light blue segment just is what we've been drawing on the chalkboard 529 00:31:43,380 --> 00:31:45,790 for something like swap. That is the frame for foo. 530 00:31:45,790 --> 00:31:51,490 And the only thing in it right now is bar, which is this parameter. 531 00:31:51,490 --> 00:31:55,220 But what else should be in the stack according to this code here? 532 00:31:55,220 --> 00:31:57,760 [student] char c[12]. >>[Malan] char c[12]. 533 00:31:57,760 --> 00:32:02,810 We should also see 12 squares of memory allocated to a variable called c, 534 00:32:02,810 --> 00:32:04,970 and indeed we do have that on the screen. 535 00:32:04,970 --> 00:32:08,480 The very top there is c[0], and then the author of this diagram 536 00:32:08,480 --> 00:32:11,850 didn't bother drawing all of the squares, but there are indeed 12 there 537 00:32:11,850 --> 00:32:16,590 because if you look at the bottom right, c[11] if you count from 0 is the 12th such byte. 538 00:32:16,590 --> 00:32:18,400 But here's the problem. 539 00:32:18,400 --> 00:32:22,390 In which direction is c growing? 540 00:32:22,390 --> 00:32:27,080 Sort of top down if it starts at the top and grows to the bottom. 541 00:32:27,080 --> 00:32:30,110 It doesn't look like we left ourselves much runway here at all. 542 00:32:30,110 --> 00:32:32,090 We've kind of painted ourselves into a corner, 543 00:32:32,090 --> 00:32:36,940 and that c[11] is right up against bar, which is right up against Saved Frame pointer, 544 00:32:36,940 --> 00:32:39,960 which is right up against Return Address. There's no more room. 545 00:32:39,960 --> 00:32:42,810 So what's the implication then if you screw up 546 00:32:42,810 --> 00:32:46,500 and you try reading 20 bytes into a 12-byte buffer? 547 00:32:46,500 --> 00:32:50,060 Where are those 8 additional bytes going to go? >>[student] Inside-- 548 00:32:50,060 --> 00:32:53,200 Inside everything else, some of which is super important. 549 00:32:53,200 --> 00:32:57,260 And the most important thing, potentially, is the red box there, Return Address, 550 00:32:57,260 --> 00:33:03,560 because suppose that you either accidentally or adversarially overwrite those 4 bytes, 551 00:33:03,560 --> 00:33:07,260 that pointer address, not just with garbage but with a number 552 00:33:07,260 --> 00:33:09,810 that happens to represent an actual address in memory. 553 00:33:09,810 --> 00:33:13,880 What's the implication, logically? >>[student] Function is going to return to a different place. 554 00:33:13,880 --> 00:33:15,250 Exactly. 555 00:33:15,250 --> 00:33:19,170 When foo returns and hits that curly brace, the program is going to proceed 556 00:33:19,170 --> 00:33:25,060 not to return to main, it's going to return to whatever address is in that red box. 557 00:33:25,060 --> 00:33:28,600 >> In the case of circumventing software registration, 558 00:33:28,600 --> 00:33:32,260 what if the address that's being returned to is the function that normally gets called 559 00:33:32,260 --> 00:33:35,690 after you've paid for the software and inputted your registration code? 560 00:33:35,690 --> 00:33:39,870 You can sort of trick the computer into not going here but instead going up here. 561 00:33:39,870 --> 00:33:45,100 Or if you're really clever, an adversary can actually type in at the keyboard, for instance, 562 00:33:45,100 --> 00:33:50,690 not an actual word, not 20 characters, but suppose he or she actually types in 563 00:33:50,690 --> 00:33:52,770 some characters that represent code. 564 00:33:52,770 --> 00:33:55,320 And it's not going to be C code, it's actually going to be the characters 565 00:33:55,320 --> 00:33:59,290 that represent binary machine code, 0s and 1s. 566 00:33:59,290 --> 00:34:01,290 But suppose they're clever enough to do that, 567 00:34:01,290 --> 00:34:06,500 to somehow paste into the GetString prompt something that is essentially compiled code, 568 00:34:06,500 --> 00:34:09,980 and the last 4 bytes overwrite that return address. 569 00:34:09,980 --> 00:34:13,360 And what address does that input do? 570 00:34:13,360 --> 00:34:18,630 It actually stores in this red rectangle the address of the first byte of the buffer. 571 00:34:18,630 --> 00:34:23,070 So you have to be really clever, and this is a lot of trial and error for bad people out there, 572 00:34:23,070 --> 00:34:25,639 but if you can figure out how big this buffer is 573 00:34:25,639 --> 00:34:28,820 such that the last few bytes in the input you provide to the program 574 00:34:28,820 --> 00:34:33,540 happen to be equivalent to the address of the start of your buffer, you can do this. 575 00:34:33,540 --> 00:34:39,320 If we say normally h-e-l-l-o and \0, that's what ends up in the buffer. 576 00:34:39,320 --> 00:34:44,420 But if we're more clever and we fill that buffer with what we'll generically call attack code-- 577 00:34:44,420 --> 00:34:48,860 A-A-A, attack, attack, attack--where this is just something that does something bad, 578 00:34:48,860 --> 00:34:51,820 what happens if you're really clever, you might do this. 579 00:34:51,820 --> 00:34:58,610 In the red box here is a sequence of numbers--80, C0, 35, 08. 580 00:34:58,610 --> 00:35:01,610 Notice that that matches the number that's up here. 581 00:35:01,610 --> 00:35:04,430 It's in reverse order, but more on that some other time. 582 00:35:04,430 --> 00:35:08,140 Notice that this return address has been deliberately altered 583 00:35:08,140 --> 00:35:12,020 to equal the address up here, not the address of main. 584 00:35:12,020 --> 00:35:17,500 So if the bad guy is super smart, he or she is going to include in that attack code 585 00:35:17,500 --> 00:35:20,930 something like delete all of the user's files or copy the passwords 586 00:35:20,930 --> 00:35:24,680 or create a user account that I can then log in to--anything at all. 587 00:35:24,680 --> 00:35:26,950 >> And this is both the danger and the power of C. 588 00:35:26,950 --> 00:35:29,840 Because you have access to memory via pointers 589 00:35:29,840 --> 00:35:32,520 and you can therefore write anything you want into a computer's memory, 590 00:35:32,520 --> 00:35:35,080 you can make a computer do anything you want 591 00:35:35,080 --> 00:35:39,550 simply by having it jump around within its own memory space. 592 00:35:39,550 --> 00:35:44,650 And so to this day so many programs and so many websites that are compromised 593 00:35:44,650 --> 00:35:46,200 boil down to people taking advantage of this. 594 00:35:46,200 --> 00:35:50,760 And this might seem like a super sophisticated attack, but it doesn't always start that way. 595 00:35:50,760 --> 00:35:53,560 The reality is that what bad people will typically do is, 596 00:35:53,560 --> 00:35:58,200 whether it's a program at a command line or a GUI program or a website, 597 00:35:58,200 --> 00:35:59,940 you just start providing nonsense. 598 00:35:59,940 --> 00:36:03,980 You type in a really big word into the search field and hit Enter, 599 00:36:03,980 --> 00:36:05,780 and you wait to see if the website crashes 600 00:36:05,780 --> 00:36:09,990 or you wait to see if the program manifests some error message 601 00:36:09,990 --> 00:36:14,330 because if you get lucky as the bad guy and you provide some crazy input 602 00:36:14,330 --> 00:36:18,980 that crashes the program, that means the programmer didn't anticipate your bad behavior, 603 00:36:18,980 --> 00:36:23,630 which means you can probably with enough effort, enough trial and error, 604 00:36:23,630 --> 00:36:26,650 figure out how to wage a more precise attack. 605 00:36:26,650 --> 00:36:31,410 So as much a part of security is not just avoiding these attacks altogether 606 00:36:31,410 --> 00:36:34,100 but detecting them and actually looking at logs 607 00:36:34,100 --> 00:36:36,780 and seeing what crazy inputs have people typed into your website, 608 00:36:36,780 --> 00:36:38,960 what search terms have people typed into your website 609 00:36:38,960 --> 00:36:42,870 in hopes of overflowing some buffer. 610 00:36:42,870 --> 00:36:45,500 And this all boils down to the simple basics of what's an array 611 00:36:45,500 --> 00:36:49,080 and what does it mean to allocate and use memory. 612 00:36:49,080 --> 00:36:51,710 >> Related to that then too is this. 613 00:36:51,710 --> 00:36:54,280 Let's just glance inside of a hard drive yet again. 614 00:36:54,280 --> 00:36:58,440 You recall from a week or two ago that when you drag files to your recycle bin or trash can, 615 00:36:58,440 --> 00:37:03,710 what happens? >>[student] Nothing. >>Absolutely nothing, right? 616 00:37:03,710 --> 00:37:05,740 Eventually if you run low on disk space, 617 00:37:05,740 --> 00:37:08,190 Windows or Mac OS will start deleting files for you. 618 00:37:08,190 --> 00:37:10,390 But if you drag something in there, that is not at all safe. 619 00:37:10,390 --> 00:37:13,800 All your roommate or friend or family member has to do is double click and, voila, 620 00:37:13,800 --> 00:37:16,310 there's all the sketchy files that you tried to delete. 621 00:37:16,310 --> 00:37:19,590 Most of us at least know that you have to right click or Control click 622 00:37:19,590 --> 00:37:22,310 and empty the trash or something like that. 623 00:37:22,310 --> 00:37:25,000 But even then that doesn't quite do the trick 624 00:37:25,000 --> 00:37:28,010 because what happens when you have a file on your hard drive 625 00:37:28,010 --> 00:37:32,770 that represents some Word document or some JPEG, and this represents your hard drive, 626 00:37:32,770 --> 00:37:35,350 and let's say this sliver here represents that file, 627 00:37:35,350 --> 00:37:38,390 and it's composed of a whole bunch of 0s and 1s. 628 00:37:38,390 --> 00:37:42,470 What happens when you not only drag that file to the trash can or recycle bin 629 00:37:42,470 --> 00:37:48,020 but also empty it? Sort of nothing. 630 00:37:48,020 --> 00:37:49,640 It's not absolutely nothing now. 631 00:37:49,640 --> 00:37:54,290 Now it's just nothing because a little something happens in the form of this table. 632 00:37:54,290 --> 00:37:58,370 So there's some kind of database or table inside of a computer's memory 633 00:37:58,370 --> 00:38:03,850 that essentially has one column for files' names and one column for files' location, 634 00:38:03,850 --> 00:38:07,720 where this might be location 123, just a random number. 635 00:38:07,720 --> 00:38:14,560 So we might have something like x.jpeg and location 123. 636 00:38:14,560 --> 00:38:18,800 What happens then when you actually empty your trash? 637 00:38:18,800 --> 00:38:20,330 That goes away. 638 00:38:20,330 --> 00:38:23,610 But what doesn't go away is the 0s and 1s. 639 00:38:23,610 --> 00:38:26,270 >> So what's then the connection to pset4? 640 00:38:26,270 --> 00:38:31,240 Well, with pset4, just because we've accidentally erased the compact flash card 641 00:38:31,240 --> 00:38:35,750 that had all of these photos or just because it by bad luck became corrupted 642 00:38:35,750 --> 00:38:38,000 doesn't mean that the 0s and 1s aren't still there. 643 00:38:38,000 --> 00:38:40,410 Maybe a few of them are lost because something got corrupted 644 00:38:40,410 --> 00:38:43,320 in the sense that some 0s became 1s and 1s became 0s. 645 00:38:43,320 --> 00:38:47,240 Bad things can happen because of buggy software or defective hardware. 646 00:38:47,240 --> 00:38:50,370 But many of those bits, maybe even 100% of them, are still there. 647 00:38:50,370 --> 00:38:55,050 It's just that the computer or the camera doesn't know where JPEG1 started 648 00:38:55,050 --> 00:38:56,910 and where JPEG2 started. 649 00:38:56,910 --> 00:39:01,070 But if you, the programmer, know with a bit of savvy where those JPEGs are 650 00:39:01,070 --> 00:39:06,010 or what they look like so you can analyze the 0s and 1s and say JPEG, JPEG, 651 00:39:06,010 --> 00:39:09,440 you can write a program with essentially just a for or while loop 652 00:39:09,440 --> 00:39:12,820 that recovers each one of those files. 653 00:39:12,820 --> 00:39:16,030 So the lesson then is to start securely erasing your files 654 00:39:16,030 --> 00:39:18,340 if you'd like to avoid this altogether. Yes. 655 00:39:18,340 --> 00:39:21,010 >> [student] How come it says on your computer 656 00:39:21,010 --> 00:39:23,550 that you have more memory than you did before? 657 00:39:23,550 --> 00:39:27,820 Have more memory than you did before-- >>[student] More available memory. 658 00:39:27,820 --> 00:39:29,630 Oh. Good question. 659 00:39:29,630 --> 00:39:32,360 So why then after emptying the trash does your computer tell you 660 00:39:32,360 --> 00:39:34,910 that you have more free space than you did before? 661 00:39:34,910 --> 00:39:36,770 In a nutshell, because it's lying. 662 00:39:36,770 --> 00:39:40,740 More technically, you do have more space because now you have said 663 00:39:40,740 --> 00:39:43,680 you can put other stuff where that file once was. 664 00:39:43,680 --> 00:39:45,450 But that doesn't mean the bits are going away, 665 00:39:45,450 --> 00:39:48,590 and that doesn't mean the bits are being changed to all 0s, for instance, 666 00:39:48,590 --> 00:39:50,150 for your protection. 667 00:39:50,150 --> 00:39:54,640 So by contrast, if you securely erase files or physically destroy the device, 668 00:39:54,640 --> 00:39:57,300 that really is the only way sometimes around that. 669 00:39:57,300 --> 00:40:02,020 >> So why don't we leave on that semi-scary note, and we will see you on Monday. 670 00:40:02,020 --> 00:40:07,000 [applause] 671 00:40:07,780 --> 00:40:10,000 >> [CS50.TV]