1 00:00:00,000 --> 00:00:02,760 [WEEK 5] 2 00:00:02,760 --> 00:00:04,760 [David J. Malan, Harvard University] 3 00:00:04,760 --> 00:00:11,990 [This is CS50.] [CS50.TV] 4 00:00:11,990 --> 00:00:17,780 [Woman] He's lying; about what, I don't know. 5 00:00:17,780 --> 00:00:20,300 [Man] So what do we know? 6 00:00:20,300 --> 00:00:24,120 [Woman] That at 9:15, Ray Santoya was at the ATM. 7 00:00:24,120 --> 00:00:27,420 [Man] So the question is, what was he doing at 9:16? 8 00:00:27,420 --> 00:00:29,980 [Woman] Shooting the 9 mm at something. 9 00:00:29,980 --> 00:00:31,900 Maybe he saw the sniper. 10 00:00:31,900 --> 00:00:34,000 [Man] Or he was working with him. 11 00:00:34,000 --> 00:00:36,330 [Woman] Wait. Go back one. 12 00:00:36,330 --> 00:00:38,330 [Man] What do you see? 13 00:00:38,330 --> 00:00:44,520 [♫Suspenseful music♫] 14 00:00:44,520 --> 00:00:48,320 [Woman] Bring his face up. Full screen. 15 00:00:48,320 --> 00:00:51,230 [Man] His glasses. >>There's a reflection. 16 00:00:51,230 --> 00:01:00,810 [♫Suspenseful music♫] 17 00:01:00,810 --> 00:01:03,580 [Man] That's the Nuevita's baseball team. That's their logo. 18 00:01:03,580 --> 00:01:07,790 [Woman] And he's talking to whoever's wearing that jacket. 19 00:01:07,790 --> 00:01:13,730 >> [David Malan] So, this is CS50 week 5, and today we ruin a bit of television and movie for you. 20 00:01:13,730 --> 00:01:16,170 So whenever you're watching a show like this one here, 21 00:01:16,170 --> 00:01:19,910 and the cops say "Can you clean that up?" or "Enhance," 22 00:01:19,910 --> 00:01:21,900 there is no enhance in the real world. 23 00:01:21,900 --> 00:01:25,220 In fact, what you really get is a little something like this. 24 00:01:25,220 --> 00:01:27,570 I've pulled up one of the staff photos from the page. 25 00:01:27,570 --> 00:01:30,980 This is a program called Photoshop. This is 1 of 2 Bowdens, 26 00:01:30,980 --> 00:01:36,300 1 of 3 Bowdens actually, today, because we have Mrs. Bowden here as well, with Rob and Paul. 27 00:01:36,300 --> 00:01:41,950 But here is Rob on the screen, and if we zoom in on that glint he's always had in his eye, 28 00:01:41,950 --> 00:01:47,600 what you actually see is that what you see is what you get. 29 00:01:47,600 --> 00:01:51,690 This is "enhanced," so "CSI" have it a bit wrong. 30 00:01:51,690 --> 00:01:55,190 There's one other clip, if we can pick on "CSI" just a little bit longer. 31 00:01:55,190 --> 00:01:58,500 This one is a nice phrase to utter henceforth if you want to 32 00:01:58,500 --> 00:02:10,280 sound technical with your friends when, really, you're saying absolutely nothing. 33 00:02:10,280 --> 00:02:12,970 >> [Man] For weeks I've been investigating the Cabby Killer murders 34 00:02:12,970 --> 00:02:15,360 with a certain morbid fascination. 35 00:02:15,360 --> 00:02:17,160 [Woman #1] This is in real time. 36 00:02:17,160 --> 00:02:22,930 [Woman #2] I'll create a GUI interface using Visual Basic, see if I can track an IP address. 37 00:02:22,930 --> 00:02:29,570 >> [Malan] So audio out of sync aside, creating a GUI interface using Visual Basic 38 00:02:29,570 --> 00:02:31,820 to track an IP address is complete nonsense. 39 00:02:31,820 --> 00:02:33,840 These days you wouldn't use Visual Basic, 40 00:02:33,840 --> 00:02:38,920 there's no need for a GUI, and IP address was a technically accurate term. 41 00:02:38,920 --> 00:02:41,730 So keep an eye out for these, and one of my favorites: 42 00:02:41,730 --> 00:02:45,070 This one's a little more arcane, because you need to know a different language. 43 00:02:45,070 --> 00:02:47,860 There's a language called Objective-C, which is a superset of C. 44 00:02:47,860 --> 00:02:51,960 Which means it's C plus some additional features, among them object-oriented programming. 45 00:02:51,960 --> 00:02:55,070 And this is the language that Apple has popularized for iOS programming. 46 00:02:55,070 --> 00:02:58,760 And so here's a clip from a different show altogether, from "Numbers," 47 00:02:58,760 --> 00:03:02,450 that if you actually look closely on your TiVo and pause at the right moment, 48 00:03:02,450 --> 00:03:07,700 you'll see that what they're looking at is not quite what is being described. 49 00:03:07,700 --> 00:03:11,170 And let me try a different audio connector here and see if we can't 50 00:03:11,170 --> 00:03:13,780 keep the audio in sync this time. 51 00:03:13,780 --> 00:03:20,530 I give you "Numbers." 52 00:03:20,530 --> 00:03:23,240 >> [Man #1] It's a 32-bit IPv4 address. 53 00:03:23,240 --> 00:03:38,930 [Man #2] IP, that's the Internet. >>Private network. It's Anita's private network. 54 00:03:38,930 --> 00:03:43,810 [Malan] Okay. This is Objective-C, and it's for some kid's coloring program, 55 00:03:43,810 --> 00:03:51,140 as you can perhaps infer from the name of the variable there. 56 00:03:51,140 --> 00:03:54,410 So that, then, was "Numbers." So today and this week we introduce 57 00:03:54,410 --> 00:03:57,740 a little bit of the world of forensics and the context in the problems therefore. 58 00:03:57,740 --> 00:04:00,590 Today will be an abbreviated lecture because there's a special event in here 59 00:04:00,590 --> 00:04:05,530 afterward, so we'll take a peek, and tease both students and parents alike today 60 00:04:05,530 --> 00:04:07,420 with some of the things that are on the horizon. 61 00:04:07,420 --> 00:04:12,240 Among them, as of Monday, you will have a few more classmates. 62 00:04:12,240 --> 00:04:16,050 EdX, Harvard and MITs new online initiative for open courseware 63 00:04:16,050 --> 00:04:19,120 and more, is launching on Harvard's campus on Monday. 64 00:04:19,120 --> 00:04:21,490 Which means come Monday you will have--as of last count, 65 00:04:21,490 --> 00:04:26,210 86,000 additional classmates will be following along with CS50's lectures 66 00:04:26,210 --> 00:04:29,170 and sections and walkthroughs and problem sets. 67 00:04:29,170 --> 00:04:32,350 And as part of this, you will become members of the inaugural class of 68 00:04:32,350 --> 00:04:35,090 CS50 and now CS50x. 69 00:04:35,090 --> 00:04:39,310 >> As part of this, now, realize that there will be some upsides as well. 70 00:04:39,310 --> 00:04:43,790 To get ready for this, for the massive number of students, 71 00:04:43,790 --> 00:04:47,180 suffice it to say that even though we have 108 TFs and CAs, 72 00:04:47,180 --> 00:04:50,790 not quite the best student/teacher ratio once we hit 80,000 other students. 73 00:04:50,790 --> 00:04:52,850 So we're not going to be grading so many problem sets manually. 74 00:04:52,850 --> 00:04:55,920 So introduced this week in the problem set will be CS50 Check, 75 00:04:55,920 --> 00:04:58,450 which is going to be a command line utility within the appliance 76 00:04:58,450 --> 00:05:01,200 that you'll get once you update it later this weekend, 77 00:05:01,200 --> 00:05:03,200 and you'll be able to run a command, check 50, 78 00:05:03,200 --> 00:05:06,500 on your own pset, and you'll get some feedback as to whether your program is 79 00:05:06,500 --> 00:05:11,160 correct or incorrect according to various design specifications that we have provided. 80 00:05:11,160 --> 00:05:13,580 So more on that and the problem set specification and 81 00:05:13,580 --> 00:05:17,240 the CS50x classmates will be using this as well. 82 00:05:17,240 --> 00:05:19,230 >> So problem set 4 is all about forensics. 83 00:05:19,230 --> 00:05:21,940 And this piece was inspired by some real-life stuff, 84 00:05:21,940 --> 00:05:24,620 whereby when I was in graduate school, I interned for a while with 85 00:05:24,620 --> 00:05:28,650 the Middlesex County's District Attorney's Office doing forensic work 86 00:05:28,650 --> 00:05:31,650 with their lead forensic investigator, and what this amounted to 87 00:05:31,650 --> 00:05:35,260 is, I think I mentioned a few week's past, is the Mass State police or others 88 00:05:35,260 --> 00:05:39,000 would come in, they would drop off things like hard drives and CDs and floppy disks 89 00:05:39,000 --> 00:05:42,340 and the like, and then the goal of the forensics office was to ascertain whether 90 00:05:42,340 --> 00:05:44,600 there was or was not evidence of some sort. 91 00:05:44,600 --> 00:05:48,010 This was the Special Investigations Unit, so it was white-collar crime, 92 00:05:48,010 --> 00:05:52,350 it was more troubling sort of crimes, 93 00:05:52,350 --> 00:05:55,990 anything involving some kind of digital media; turns out that not that many people 94 00:05:55,990 --> 00:05:59,370 write an email saying "I did it." 95 00:05:59,370 --> 00:06:03,290 So quite often these forensics searches did not turn up all that much fruit, 96 00:06:03,290 --> 00:06:05,850 but sometimes people would write such emails. 97 00:06:05,850 --> 00:06:08,490 So sometimes the efforts were rewarded. 98 00:06:08,490 --> 00:06:14,420 >> But to lead up to this forensic pset, we'll be introducing in pset 4 a bit of graphics. 99 00:06:14,420 --> 00:06:18,260 So you probably take these things for granted, JPEGs, GIFs and the like these days, 100 00:06:18,260 --> 00:06:21,640 but if you really think about it, an image, much like Rob's face, 101 00:06:21,640 --> 00:06:24,430 could be modeled as a sequence of dots, or pixels. 102 00:06:24,430 --> 00:06:26,680 Now, in the case of Rob's face, there's all sorts of colors, 103 00:06:26,680 --> 00:06:29,940 and we started to see the individual dots, otherwide known as pixels, 104 00:06:29,940 --> 00:06:31,610 once we started to zoom in. 105 00:06:31,610 --> 00:06:35,590 But if we simplify the world a bit, and just say that this here is Rob 106 00:06:35,590 --> 00:06:40,560 in black and white, well, to represent black and white we can just use binary. 107 00:06:40,560 --> 00:06:44,960 And if we're going to use binary, 1 or 0, we can express this same image 108 00:06:44,960 --> 00:06:51,970 of Rob's smiling face with this pattern of bits: 11000011 represents 109 00:06:51,970 --> 00:06:55,160 white, white, black, black, black, black, white white. 110 00:06:55,160 --> 00:06:59,290 And so it's not a huge leap, then, to start talking about colorful photographs. 111 00:06:59,290 --> 00:07:01,920 Things that you would see on Facebook or take with a digital camera, 112 00:07:01,920 --> 00:07:04,730 but, certainly, when it comes to colors, you need more bits. 113 00:07:04,730 --> 00:07:08,470 And quite common in the world of photographs is to use not 1-bit color, 114 00:07:08,470 --> 00:07:12,730 as this suggests, but 24-bit color, where you actually get millions of colors. 115 00:07:12,730 --> 00:07:15,430 So as in the case when we zoomed in on Rob's eye, 116 00:07:15,430 --> 00:07:19,270 that was any number of millions of different colorful possibilities. 117 00:07:19,270 --> 00:07:22,260 >> So we'll introduce this in problem set 4 as well as in the walkthrough, 118 00:07:22,260 --> 00:07:27,050 which will be today at 3:30 instead of the usual 2:30 because of Friday's lecture here. 119 00:07:27,050 --> 00:07:29,930 But the video will be online, as usual, tomorrow. 120 00:07:29,930 --> 00:07:31,880 We'll also introduce you to another file format. 121 00:07:31,880 --> 00:07:34,150 So this is deliberately meant to look intimidating at first, 122 00:07:34,150 --> 00:07:38,980 but this is just some documentation for a C struct. 123 00:07:38,980 --> 00:07:42,280 It turns out that Microsoft, years ago, helped popularize this format, 124 00:07:42,280 --> 00:07:46,630 called the bitmap file format, BMP, and this was a super-simple, 125 00:07:46,630 --> 00:07:50,390 colorful graphical file format that was used for quite some time 126 00:07:50,390 --> 00:07:53,640 and sometimes still for wallpapers on desktops. 127 00:07:53,640 --> 00:07:57,410 If you think back to Windows XP and the rolling hills and blue sky, 128 00:07:57,410 --> 00:08:00,660 that was typically a BMP, or bitmap image, and bitmaps 129 00:08:00,660 --> 00:08:03,340 are fun for us because they have a bit more complexity. 130 00:08:03,340 --> 00:08:05,640 It's not quite as simple as this grid of 0's and 1's; 131 00:08:05,640 --> 00:08:10,680 instead, you have things like a header at the start of a file. 132 00:08:10,680 --> 00:08:15,520 So in other words, inside a .bmp file is a whole bunch of 0's and 1's, 133 00:08:15,520 --> 00:08:18,070 but there's some additional 0's and 1's in there. 134 00:08:18,070 --> 00:08:21,450 And it turns out that what we've probably taken for granted for years, 135 00:08:21,450 --> 00:08:27,040 file formats like .doc or .xls or .mp3 or .mp4, 136 00:08:27,040 --> 00:08:29,910 whatever the file formats that you're familiar with. 137 00:08:29,910 --> 00:08:31,900 Well, what does it even mean to be a file format? 138 00:08:31,900 --> 00:08:35,740 Because at the end of the day, all of these files we use have just 0's and 1's 139 00:08:35,740 --> 00:08:39,950 and maybe those 0's and 1's represent a,b,c, through ASCII or the like, 140 00:08:39,950 --> 00:08:42,030 but through the end of the day, it's just 0's and 1's. 141 00:08:42,030 --> 00:08:45,300 >> So humans just occasionally decide to invent a new file format 142 00:08:45,300 --> 00:08:49,420 where they standardize what patterns of bits will actually mean. 143 00:08:49,420 --> 00:08:52,790 And in this case here, the folks who designed the bitmap file format 144 00:08:52,790 --> 00:08:58,260 said that at the very first byte in a bitmap file, as denoted by offset 0, there, 145 00:08:58,260 --> 00:09:02,320 there's going to be some cryptically named variable called bfType, 146 00:09:02,320 --> 00:09:06,510 which just stands for bitmap file type; what type of bitmap file this is. 147 00:09:06,510 --> 00:09:10,780 You can infer, perhaps, from the second row that offset 2, byte number 2, 148 00:09:10,780 --> 00:09:15,980 has a pattern of 0's and 1's that represents what? 149 00:09:15,980 --> 00:09:18,320 The size of something, and it goes on from there. 150 00:09:18,320 --> 00:09:20,660 So in problem set 4, you'll be walked through some of these things. 151 00:09:20,660 --> 00:09:24,480 >> We won't end up caring about all of them, but notice it starts to get interesting 152 00:09:24,480 --> 00:09:30,780 around line or byte 54, rgbtBlue, Green and Red. 153 00:09:30,780 --> 00:09:35,280 If you've ever heard the acronym RGB, red green blue, this is a reference to that. 154 00:09:35,280 --> 00:09:37,840 Because it turns out you can paint all the colors of the rainbow 155 00:09:37,840 --> 00:09:41,580 with some combination of red and blue and green. 156 00:09:41,580 --> 00:09:46,560 And, in fact, the parents in the room might recall some of the earliest projectors. 157 00:09:46,560 --> 00:09:49,360 These days, you just see 1 bright light coming out of a lens. 158 00:09:49,360 --> 00:09:52,870 But back in the day, you had the red lens, the blue lens, and the green lens 159 00:09:52,870 --> 00:09:56,620 and together they aimed at the screen and formed a colorful picture. 160 00:09:56,620 --> 00:09:59,590 And quite often middle schools and high schools would have those lenses 161 00:09:59,590 --> 00:10:02,680 ever-so-slightly askew, so you were sort of seeing double or triple images, 162 00:10:02,680 --> 00:10:07,500 but that was the idea. You had red and green and blue light painting a picture. 163 00:10:07,500 --> 00:10:09,570 And that same principle is used in computers. 164 00:10:09,570 --> 00:10:12,000 >> So among the challenges, then, for you in problem set 4 165 00:10:12,000 --> 00:10:16,080 are going to be a few things; one is to actually resize an image. 166 00:10:16,080 --> 00:10:18,050 To take in a pattern of 0's and 1's, 167 00:10:18,050 --> 00:10:22,840 figure out which chunks of 0's and 1's represent what in a structure like this, 168 00:10:22,840 --> 00:10:26,800 and then figure out how to replicate the pixels: the reds, the blues, the greens 169 00:10:26,800 --> 00:10:32,460 inside so that when a picture looks like this initially, might look like this instead after that. 170 00:10:32,460 --> 00:10:35,590 Among the other challenges, too, is going to be that you'll be handed 171 00:10:35,590 --> 00:10:38,900 a forensic image of an actual file from a digital camera 172 00:10:38,900 --> 00:10:42,410 and on that camera, once upon a time, were a whole bunch of photos. 173 00:10:42,410 --> 00:10:47,030 The problem is, we accidentally erased or had the image corrupted somehow. 174 00:10:47,030 --> 00:10:51,040 Bad things happen with digital cameras, and so we quickly copied all of the 0's and 1's 175 00:10:51,040 --> 00:10:55,410 off of that card for you, saved them all in 1 big file, and then we'll hand them to you 176 00:10:55,410 --> 00:11:00,000 in problem set 4 so that you can write a program in C with which to recover 177 00:11:00,000 --> 00:11:02,660 all of those JPEGs, ideally. 178 00:11:02,660 --> 00:11:06,280 And it turns out that JPEGs, even though they're somewhat of a complex file format, 179 00:11:06,280 --> 00:11:09,580 they're much more complex than this smiling face here. 180 00:11:09,580 --> 00:11:14,320 It turns out that every JPEG starts with the same patterns of 0's and 1's. 181 00:11:14,320 --> 00:11:18,820 So using a while loop or a for loop or similar, 182 00:11:18,820 --> 00:11:22,350 you can iterate over all the 0's and 1's in this forensic image 183 00:11:22,350 --> 00:11:26,670 and every time you see the special pattern that's defined in the problem set's specification, 184 00:11:26,670 --> 00:11:29,770 you can assume, 'Oh, here is, with very high probability, 185 00:11:29,770 --> 00:11:33,520 the start of a JPEG,' and as soon as you find the same pattern, 186 00:11:33,520 --> 00:11:36,050 some number of bytes or kilobytes or megabytes later, 187 00:11:36,050 --> 00:11:40,550 you can assume, 'Ooh! Here is a second JPEG, the photo I took after the first one. 188 00:11:40,550 --> 00:11:44,720 Let me stop reading that first file, start writing this new one.' 189 00:11:44,720 --> 00:11:49,980 And the output of your program for pset 4 is going to be as many as 50 JPEGs. 190 00:11:49,980 --> 00:11:52,400 And if it's not 50 JPEGs, you have a bit of a loop. 191 00:11:52,400 --> 00:11:55,580 If you have an infinite number of JPEGs, you have an infinite loop. 192 00:11:55,580 --> 00:11:58,280 So that, too, will be quite a common case. 193 00:11:58,280 --> 00:12:00,280 That's what's on the horizon. 194 00:12:00,280 --> 00:12:03,740 >> Quiz 0, behind us. Realize, per my email, that invariably there's folks 195 00:12:03,740 --> 00:12:06,820 who are both happy, sort of neutral, and sad around quiz 0 time. 196 00:12:06,820 --> 00:12:10,160 And please do reach out to me, the head TFs, Zamyla, your own TF 197 00:12:10,160 --> 00:12:14,120 or one of the CAs that you know if you would like to discuss how things went. 198 00:12:14,120 --> 00:12:16,460 >> So to impress the parents here in the room, 199 00:12:16,460 --> 00:12:23,990 what is the CS50 library? Good job. 200 00:12:23,990 --> 00:12:32,280 What's the CS50 library? Yeah? [Student answers, unintelligible] 201 00:12:32,280 --> 00:12:35,730 >>Okay, good. So it's a prewritten set of code that we, the staff, wrote, 202 00:12:35,730 --> 00:12:38,460 we provide to you, to provide some common functionalities. 203 00:12:38,460 --> 00:12:42,290 Stuff like get me a string; get me an int, all of the functions that are listed here. 204 00:12:42,290 --> 00:12:45,260 Starting now, we start to really take these training wheels off. 205 00:12:45,260 --> 00:12:48,230 So we're going to start to take away a "string" from you, 206 00:12:48,230 --> 00:12:52,790 which, recall, was just a synonym for what actual data type? char *. 207 00:12:52,790 --> 00:12:57,020 So for parents, that was probably--that's good, so char * we'll start to see 208 00:12:57,020 --> 00:13:00,810 on the screen all the more as we remove "string" from our vocabulary, 209 00:13:00,810 --> 00:13:02,760 at least when it comes to actually writing code. 210 00:13:02,760 --> 00:13:06,240 Similarly, we'll stop using some of these functions as much, 211 00:13:06,240 --> 00:13:08,390 because our programs are going to get more sophisticated 212 00:13:08,390 --> 00:13:11,370 rather than just write programs that sit there with a prompt blinking, 213 00:13:11,370 --> 00:13:13,580 waiting for the user to type something in. 214 00:13:13,580 --> 00:13:15,220 You'll get your inputs from elsewhere. 215 00:13:15,220 --> 00:13:18,720 For instance, you'll get them from a series of bits on the local hard drive. 216 00:13:18,720 --> 00:13:23,340 You'll instead get them in the future from a network connection, some website somewhere. 217 00:13:23,340 --> 00:13:27,460 So let's peel back this layer for the first time, and pull up the CS50 appliance 218 00:13:27,460 --> 00:13:32,300 and this file called CS50.h, which you've been sharp including for weeks. 219 00:13:32,300 --> 00:13:34,380 >> But let's actually see what's inside of this. 220 00:13:34,380 --> 00:13:38,250 So the top of the file in blue is just a whole bunch of comments, 221 00:13:38,250 --> 00:13:41,340 warranty information and licencing. This is sort of a common paradigm 222 00:13:41,340 --> 00:13:44,600 in software, because a lot of software these days is what's called "open source," 223 00:13:44,600 --> 00:13:46,940 which means that someone has written the code 224 00:13:46,940 --> 00:13:50,060 and made it freely available, not just to run and to use, 225 00:13:50,060 --> 00:13:53,660 but actually read and alter and integrate into your own work. 226 00:13:53,660 --> 00:13:55,790 So that's what you've been using, open source software, 227 00:13:55,790 --> 00:13:58,030 albeit in a very small form. 228 00:13:58,030 --> 00:14:01,860 If I scroll down past the comments, though, we'll start to see some more familiar things. 229 00:14:01,860 --> 00:14:08,090 So notice at the top here, that the CS50.h file includes a whole bunch of header files. 230 00:14:08,090 --> 00:14:11,160 Now, most of these we haven't seen before, but one is 231 00:14:11,160 --> 00:14:15,640 familiar; which of these have we seen, albeit briefly, thus far? 232 00:14:15,640 --> 00:14:18,720 Yeah, standard libraries. Stdlib.h has malloc, 233 00:14:18,720 --> 00:14:21,590 so once we started talking about dynamic memory allocation, 234 00:14:21,590 --> 00:14:24,960 which we'll come back to next week as well, we started including that file. 235 00:14:24,960 --> 00:14:29,660 It turns out that bool and true and false don't actually exist in C, per se, 236 00:14:29,660 --> 00:14:32,460 unless you include this file here. 237 00:14:32,460 --> 00:14:35,770 So we have, for weeks, been including standard bool.h 238 00:14:35,770 --> 00:14:39,020 so that you can use the notion of a bool, true or false. 239 00:14:39,020 --> 00:14:41,830 Without this, you would have to sort of fake it and use an int 240 00:14:41,830 --> 00:14:45,920 and just arbitrarily assume that 0 is false and 1 is true. 241 00:14:45,920 --> 00:14:49,980 >> Now, if we scroll down further, here is our definition of a string. 242 00:14:49,980 --> 00:14:54,820 It turns out, as we've said before, that where this * is doesn't really matter. 243 00:14:54,820 --> 00:14:56,750 You can even have space all around. 244 00:14:56,750 --> 00:15:01,550 We, this semester, have been promoting it as this to make clear that the * has to do with the type. 245 00:15:01,550 --> 00:15:05,370 But realize, just as common, if not a little more common, is to put it there 246 00:15:05,370 --> 00:15:07,480 but functionally it's the same thing. 247 00:15:07,480 --> 00:15:11,070 But now, if we read down further, let's take a look at say, GetInt, 248 00:15:11,070 --> 00:15:15,350 because we used that, perhaps, before anything else this semester. 249 00:15:15,350 --> 00:15:19,620 And here is GetInt. This is what? 250 00:15:19,620 --> 00:15:24,650 This is the prototype. So often, we have put prototypes at the tops of our .c files, 251 00:15:24,650 --> 00:15:28,190 but you can also put prototypes in header files, .h files, 252 00:15:28,190 --> 00:15:32,110 like this one here, so that when you write some functions 253 00:15:32,110 --> 00:15:36,790 that you want other people to be able to use, which is exactly the case with the CS50 library, 254 00:15:36,790 --> 00:15:40,900 you not only implement your functions in something like CS50.c, 255 00:15:40,900 --> 00:15:46,720 you also put the prototypes not at the top of that file, but at the top of a header file, 256 00:15:46,720 --> 00:15:50,810 then that header file is what friends and colleagues include, 257 00:15:50,810 --> 00:15:52,800 with sharp include in their own code. 258 00:15:52,800 --> 00:15:55,440 So all this time you've been including all of these prototypes 259 00:15:55,440 --> 00:15:59,870 effectively at the top of your file, but by way of this sharp include mechanism 260 00:15:59,870 --> 00:16:03,320 that essentially copies and pastes this file into your own. 261 00:16:03,320 --> 00:16:06,400 Now, here is some fairly detailed documentation. 262 00:16:06,400 --> 00:16:08,880 >> We've pretty much taken for granted that GetInt gets an int, 263 00:16:08,880 --> 00:16:10,740 but it turns out there's some corner cases, right? 264 00:16:10,740 --> 00:16:14,320 What if the user types in a number that's way too big? 265 00:16:14,320 --> 00:16:17,350 A quintillion, that just can't fit inside of an int? 266 00:16:17,350 --> 00:16:21,180 What is the expected behavior? Well, ideally, it's predictable. 267 00:16:21,180 --> 00:16:23,460 So in this case, if you actually read the fine print, 268 00:16:23,460 --> 00:16:27,850 you'll see that if the line can't be read, this returns INT_MAX. 269 00:16:27,850 --> 00:16:30,800 We've never talked about this, but based on its capitalization, 270 00:16:30,800 --> 00:16:33,030 what is it, probably? 271 00:16:33,030 --> 00:16:36,610 It's a constant, so it's some special constant that's probably declared 272 00:16:36,610 --> 00:16:39,460 in one of those header files that's up higher in the file, 273 00:16:39,460 --> 00:16:43,400 and INT_MAX is probably something like, roughly, 2 billion. 274 00:16:43,400 --> 00:16:48,160 The idea being that because we need to somehow signify that something went wrong, 275 00:16:48,160 --> 00:16:51,090 we, yes, have 4 billion numbers at our disposal, 276 00:16:51,090 --> 00:16:53,980 negative 2 billion on up to 2 billion, give or take. 277 00:16:53,980 --> 00:16:58,030 Well, what is common in programming is you steal just one of those numbers. 278 00:16:58,030 --> 00:17:02,250 Maybe 0, maybe 2 billion, maybe negative 2 billion. 279 00:17:02,250 --> 00:17:06,720 So you spend one of your possible values so that you can commit to the world 280 00:17:06,720 --> 00:17:10,089 that if something goes wrong, I will return this super-big value. 281 00:17:10,089 --> 00:17:13,329 But you don't want the user typing something cryptic like "2, 3, 4. . . " 282 00:17:13,329 --> 00:17:17,079 of really big number, where you generalize instead as a constant. 283 00:17:17,079 --> 00:17:19,380 So really, if you were being anal the past few weeks, 284 00:17:19,380 --> 00:17:23,800 anytime you call GetInt, you should have been checking with an if condition. 285 00:17:23,800 --> 00:17:27,109 Did the user type in INT_MAX, or more specifically, 286 00:17:27,109 --> 00:17:29,900 did GetInt return INT_MAX? Because if it did, 287 00:17:29,900 --> 00:17:35,140 that actually means they didn't type it; something went wrong in this case. 288 00:17:35,140 --> 00:17:38,970 So this is what's generally known as a "sentinel" value, which just means special. 289 00:17:38,970 --> 00:17:41,020 >> Well, let's now turn in to the .c files. 290 00:17:41,020 --> 00:17:44,500 The C file has existed in the appliance for some time, 291 00:17:44,500 --> 00:17:47,540 and, in fact, the appliance has it precompiled for you 292 00:17:47,540 --> 00:17:49,720 into that thing we called "object code," 293 00:17:49,720 --> 00:17:52,940 but it just doesn't matter to you where it is because the system knows, 294 00:17:52,940 --> 00:17:54,780 in this case, where it is, the appliance. 295 00:17:54,780 --> 00:18:00,620 But let's scroll down now to GetInt, and see how GetInt has been working all this time. 296 00:18:00,620 --> 00:18:02,380 So here we have similar comments from before. 297 00:18:02,380 --> 00:18:04,930 Let me zoom in on just the code portion, 298 00:18:04,930 --> 00:18:07,410 and what we have for GetInt is the following. 299 00:18:07,410 --> 00:18:12,770 It takes no input and it returns an int, while (true), so we have a deliberate infinite loop 300 00:18:12,770 --> 00:18:16,560 but, presumably, we'll break out of this somehow, or return from within this. 301 00:18:16,560 --> 00:18:19,890 So let's see how this works. Well, we seem to be using GetString 302 00:18:19,890 --> 00:18:22,550 in this first line inside the loop, 166. 303 00:18:22,550 --> 00:18:25,320 This is now good practice because under what circumstances 304 00:18:25,320 --> 00:18:30,820 could GetString return this special keyword, NULL? 305 00:18:30,820 --> 00:18:38,460 If something goes wrong. What could go wrong when you call something like GetString? 306 00:18:38,460 --> 00:18:42,550 Yeah? [Student answer, unintelligible] >>Yeah. So maybe malloc fails. 307 00:18:42,550 --> 00:18:45,310 Somewhere underneath the hood GetString is calling malloc, 308 00:18:45,310 --> 00:18:48,210 which allocates memory, which lets the computer store 309 00:18:48,210 --> 00:18:50,950 all of the characters that the user types into the keyboard. 310 00:18:50,950 --> 00:18:53,270 And suppose the user had a whole lot of free time 311 00:18:53,270 --> 00:18:56,470 and typed more, for instance, than 2 billion characters. 312 00:18:56,470 --> 00:18:59,600 More characters than the computer even has RAM. 313 00:18:59,600 --> 00:19:02,350 Well, GetString has to be able to signify that to you, 314 00:19:02,350 --> 00:19:05,650 even if this is a super, super uncommon corner case. 315 00:19:05,650 --> 00:19:08,490 It has to somehow be able to handle this, and so GetString, 316 00:19:08,490 --> 00:19:11,850 if we go back and read its documentation, does, in fact, return NULL. 317 00:19:11,850 --> 00:19:16,150 Now if GetString fails by returning NULL, GetInt is going to fail 318 00:19:16,150 --> 00:19:19,370 by returning INT_MAX, just as a sentinel. 319 00:19:19,370 --> 00:19:22,650 These are just human conventions. The only way you would know this is the case 320 00:19:22,650 --> 00:19:24,840 is by reading the documentation. 321 00:19:24,840 --> 00:19:28,200 So let's scroll down to where the int is actually GotInt. 322 00:19:28,200 --> 00:19:34,220 >> So if I scroll down a bit further, in line 170 we have a comment above these lines. 323 00:19:34,220 --> 00:19:38,470 So we declare, in 172, an int n and a char c, and then this new function 324 00:19:38,470 --> 00:19:41,870 which some of you have stumbled across before, but sscanf. 325 00:19:41,870 --> 00:19:44,190 This stands for string scan f. 326 00:19:44,190 --> 00:19:48,580 In other words, give me a string and I will scan it for pieces of information of interest. 327 00:19:48,580 --> 00:19:53,820 So what does that mean? Well, suppose that I type in, literally, 1 2 3 at the keyboard, 328 00:19:53,820 --> 00:19:59,730 and then hit enter. What is the data type of 1 2 3 when returned by GetString? 329 00:19:59,730 --> 00:20:05,010 It's obviously a string, right? I got a string, so 1 2 3 is really "1 2 3" 330 00:20:05,010 --> 00:20:07,260 with the \0 at the end of it. That is not an int. 331 00:20:07,260 --> 00:20:10,420 That's not a number. It looks like a number but it's not actually. 332 00:20:10,420 --> 00:20:14,680 So what does GetInt have to do? It has to scan that string left to right, 333 00:20:14,680 --> 00:20:19,010 1 2 3 \0, and somehow convert it to an actual integer. 334 00:20:19,010 --> 00:20:21,010 Now, you could figure out how to do this. 335 00:20:21,010 --> 00:20:24,240 If you think back to pset 2, you presumably got a little comfortable 336 00:20:24,240 --> 00:20:26,810 with Caesar or vigenere so you can iterate over a string, 337 00:20:26,810 --> 00:20:29,800 you can convert chars to ints with pick. That's a whole lot of work. 338 00:20:29,800 --> 00:20:32,800 Why not call a function like sscanf that does that for you? 339 00:20:32,800 --> 00:20:37,520 So sscanf expects an argument, in this case called line, which is a string. 340 00:20:37,520 --> 00:20:41,310 You then specify, in quotes, very similar to printf, 341 00:20:41,310 --> 00:20:44,960 what do you expect to see in this string? 342 00:20:44,960 --> 00:20:52,980 What I'm saying here is, I expect to see a decimal number and maybe a character. 343 00:20:52,980 --> 00:20:54,990 And we'll see why this is the case in just a moment. 344 00:20:54,990 --> 00:20:58,440 It turns out that this notation is now reminiscent of stuff 345 00:20:58,440 --> 00:21:00,840 we started talking about just over a week ago. 346 00:21:00,840 --> 00:21:05,430 >> What is &n and &c doing for us here? [Student answers, unintelligible] 347 00:21:05,430 --> 00:21:07,610 >>Yeah. It's giving me the address of n and address of c. 348 00:21:07,610 --> 00:21:10,440 Now, why is that important? Well, you know that with functions in C 349 00:21:10,440 --> 00:21:13,440 you can always return a value or no value at all. 350 00:21:13,440 --> 00:21:16,630 You can return an int, a string, a float, a char, whatever. 351 00:21:16,630 --> 00:21:21,150 Or you can return void, but you can only return 1 thing maximally. 352 00:21:21,150 --> 00:21:26,100 But here we want sscanf to return me maybe an int, a decimal number, 353 00:21:26,100 --> 00:21:29,240 and also a char, and I'll explain why the char in a moment. 354 00:21:29,240 --> 00:21:34,250 So you effectively want f to return 2 things; that's just not possible in C. 355 00:21:34,250 --> 00:21:38,460 So you can work around that by passing in 2 addresses, 356 00:21:38,460 --> 00:21:43,710 because as soon as you hand a function 2 addresses, what can that function do with them? 357 00:21:43,710 --> 00:21:49,880 It can write to those addresses. You can use the * operation and "go there" to each of those addresses. 358 00:21:49,880 --> 00:21:54,320 It's sort of this backdoor mechanism, but very common for changing the values of variables 359 00:21:54,320 --> 00:21:58,020 in more than just 1 place, in this case 2. 360 00:21:58,020 --> 00:22:04,590 Now, notice I'm checking for == to1, and then returning n if that does, in fact, evaluate to true. 361 00:22:04,590 --> 00:22:09,340 So what's going on? Well, technically, all we really want to happen in GetInt is this. 362 00:22:09,340 --> 00:22:12,340 We want to parse, so to speak; we want to read the string 363 00:22:12,340 --> 00:22:16,210 "1 2 3" and if it looks like there's a number there, 364 00:22:16,210 --> 00:22:21,360 what we're telling sscanf to do is put that number, 1 2 3, in this variable n for me. 365 00:22:21,360 --> 00:22:26,060 Why, then, did I have this as well? 366 00:22:26,060 --> 00:22:33,750 What is the role of also saying, sscanf, you might also get a character here. 367 00:22:33,750 --> 00:22:36,890 [Student speaking, unintelligible] >>Not--a decimal point could work. 368 00:22:36,890 --> 00:22:40,650 Let's hold that thought for a moment. What else? 369 00:22:40,650 --> 00:22:42,570 [Student, unintelligible] >>So, good thought, it could be the NULL character. 370 00:22:42,570 --> 00:22:44,970 It's actually not, in this case. Yeah? [Student, unintelligible] 371 00:22:44,970 --> 00:22:47,100 >> >> ASCII. Or, let me generalize even further. 372 00:22:47,100 --> 00:22:49,670 The %c there is just for error-checking. 373 00:22:49,670 --> 00:22:52,510 We don't want there to be character after the number, 374 00:22:52,510 --> 00:22:54,980 but what this allows me to do is the following: 375 00:22:54,980 --> 00:23:01,270 It turns out that sscanf, besides storing values in n and c, in this example here, 376 00:23:01,270 --> 00:23:08,170 what it also does is it returns the number of variables it put values in. 377 00:23:08,170 --> 00:23:13,330 So if you only type in 1 2 3, then only the %d is going to match 378 00:23:13,330 --> 00:23:18,830 and only n gets stored with a value like 1 2 3 and nothing gets put in c; 379 00:23:18,830 --> 00:23:20,870 c remains a garbage value, so to speak. 380 00:23:20,870 --> 00:23:23,550 Garbage because it's never been initialized as some value. 381 00:23:23,550 --> 00:23:29,390 So in that case, sscanf returns 1, because I populated one of those pointers, 382 00:23:29,390 --> 00:23:33,650 in which case, great. I have an int, so I free the line to free up the memory 383 00:23:33,650 --> 00:23:37,150 that GetString actually allocated, and then I return n. 384 00:23:37,150 --> 00:23:42,210 Else, if you ever wondered where that retry statement comes from, comes from right here. 385 00:23:42,210 --> 00:23:45,770 If, by contrast, I type in 1 2 3 foo, 386 00:23:45,770 --> 00:23:48,640 just some random sequence of text, sscanf is going to see, 387 00:23:48,640 --> 00:23:51,500 ooh, number, ooh, number, ooh, number, ooh--f. 388 00:23:51,500 --> 00:23:54,190 And it's going to put the 1 2 3 in n. 389 00:23:54,190 --> 00:23:59,970 It's going to put the f in c, and then return 2. 390 00:23:59,970 --> 00:24:02,980 So we have, just using the basic definition of scanf's behavior, 391 00:24:02,980 --> 00:24:06,170 a very simple way--well, complex at first glance, but at the end of the day, 392 00:24:06,170 --> 00:24:11,460 fairly simple mechanism of saying, is there an int, and if so, is that the only thing that I found? 393 00:24:11,460 --> 00:24:14,950 And the white space here is deliberate. If you read the documentation for sscanf, 394 00:24:14,950 --> 00:24:18,690 it tells you that if you include a piece of white space at the beginning or the end, 395 00:24:18,690 --> 00:24:24,990 sscanf too will allow the user, for whatever reason, to hit spacebar 1 2 3, and that will be legitimate. 396 00:24:24,990 --> 00:24:28,310 It won't yell at the user just because they hit the spacebar at the beginning or the end, 397 00:24:28,310 --> 00:24:32,160 which is just a little more user-friendly. 398 00:24:32,160 --> 00:24:34,160 >> Any questions, then, on GetInts? Yeah? 399 00:24:34,160 --> 00:24:36,820 [Student question, unintelligible] 400 00:24:36,820 --> 00:24:40,740 >>Good question. What if you just typed in a char, like f, and hit enter 401 00:24:40,740 --> 00:24:47,830 without ever typing 1 2 3; what do you think the behavior of this line of code would then be? 402 00:24:47,830 --> 00:24:50,500 So sscanf can cover that too, because in that case, 403 00:24:50,500 --> 00:24:56,280 it's not going to fill n or c; it's going to instead return 0. 404 00:24:56,280 --> 00:25:01,540 In which case, I'm also catching that scenario, because the expected value I want is 1. 405 00:25:01,540 --> 00:25:07,310 I only want 1, and only 1 thing to be filled. Good question. Others? 406 00:25:07,310 --> 00:25:09,610 >> All right, so let's not go through all of the functions in here, 407 00:25:09,610 --> 00:25:11,820 but the one that seems to be, perhaps, of remaining interest 408 00:25:11,820 --> 00:25:14,530 is GetString because it turns out that GetFloat, GetInt, 409 00:25:14,530 --> 00:25:19,490 GetDouble, GetLongLong all punt a lot of their functionality to GetString. 410 00:25:19,490 --> 00:25:22,860 So let's take a look at how he is implemented here. 411 00:25:22,860 --> 00:25:27,040 This one looks a little complex but it uses the same fundamentals 412 00:25:27,040 --> 00:25:29,680 that we started talking about last week. So in GetString, 413 00:25:29,680 --> 00:25:32,670 which takes no argument as per the void up here, 414 00:25:32,670 --> 00:25:37,110 and it returns a string; so I am declaring a string called buffer. 415 00:25:37,110 --> 00:25:39,670 I don't really know what that's going to be used for yet, but we'll see. 416 00:25:39,670 --> 00:25:42,950 Looks like capacity is, by default, 0; not quite sure where this is going. 417 00:25:42,950 --> 00:25:44,920 Not sure what n's going to be used for yet. 418 00:25:44,920 --> 00:25:47,860 But now it's getting a little more interesting, so in line 243, 419 00:25:47,860 --> 00:25:51,760 we declare an int c, this is sort of a stupid detail. 420 00:25:51,760 --> 00:25:58,080 A char is 8 bits, and 8 bits can store how many different values? 421 00:25:58,080 --> 00:26:03,310 256. The problem is, if you want to have 256 different ASCII characters, 422 00:26:03,310 --> 00:26:06,210 which there are, if you think back, and this is not something to memorize. 423 00:26:06,210 --> 00:26:09,100 But if you think back to that big ASCII chart we had weeks ago, 424 00:26:09,100 --> 00:26:13,780 there were, in that case, 128 or 256 ASCII characters. 425 00:26:13,780 --> 00:26:16,220 We used all the patterns of 0's and 1's up. 426 00:26:16,220 --> 00:26:19,410 That's a problem if you want to be able to detect an error. 427 00:26:19,410 --> 00:26:23,290 Because if you're already using 256 values for your characters, 428 00:26:23,290 --> 00:26:26,390 you didn't really plan ahead, because now you have no way of saying, 429 00:26:26,390 --> 00:26:29,750 "This is not a legit character; this is some erroneous message." 430 00:26:29,750 --> 00:26:32,430 So what the world does is, they use the next biggest value, 431 00:26:32,430 --> 00:26:35,790 something like an int so that you have a crazy number of bits, 432 00:26:35,790 --> 00:26:39,610 32 for 4 billion possible values, so that you can simply end up using, 433 00:26:39,610 --> 00:26:44,800 essentially, 257 of them, 1 of which has some special meaning as an error. 434 00:26:44,800 --> 00:26:49,190 >> So let's see how this works. In line 246, I have this big while loop 435 00:26:49,190 --> 00:26:54,530 that is calling fgetc; f meaning file, getc, and then stdin. 436 00:26:54,530 --> 00:26:59,030 Turns out this is just the more precise way of saying "read input from the keyboard." 437 00:26:59,030 --> 00:27:02,730 Standard input means keyboard, standard output means screen, 438 00:27:02,730 --> 00:27:06,920 and standard error, which we'll see in pset 4, means the screen, 439 00:27:06,920 --> 00:27:09,670 but a special part of the screen so that it's not conflated 440 00:27:09,670 --> 00:27:13,760 with actual output that you intended to print; but more on that in the future. 441 00:27:13,760 --> 00:27:19,430 So fgetc just means read one character from the keyboard, and store it where? 442 00:27:19,430 --> 00:27:24,000 Store it in c, and then check, so I'm just using some boolean conjunctions here, 443 00:27:24,000 --> 00:27:28,430 check that it doesn't equal \n, so the user has hit enter. 444 00:27:28,430 --> 00:27:31,510 We want to stop at that point, end of the loop, and we also want to check 445 00:27:31,510 --> 00:27:36,170 for the special constant, EOF, which if you know or guess-- what does it stand for? 446 00:27:36,170 --> 00:27:39,860 End of file. So this is kind of nonsensical, because if I'm typing at the keyboard, 447 00:27:39,860 --> 00:27:41,900 there's really no file involved in this, 448 00:27:41,900 --> 00:27:44,330 but this is just sort of the generic term used to mean 449 00:27:44,330 --> 00:27:50,320 that nothing else is coming from the human's fingers. EOF. End of file. 450 00:27:50,320 --> 00:27:52,600 As an aside, if you've ever hit control d at your keyboard, 451 00:27:52,600 --> 00:27:54,680 not that you would have yet; you've hit control c. 452 00:27:54,680 --> 00:27:57,920 But control d sends this special constant called EOF. 453 00:27:57,920 --> 00:28:03,100 >> So now we just have some dynamic memory allocation. 454 00:28:03,100 --> 00:28:06,460 So if n + 1 > capacity, now I'll explain n. 455 00:28:06,460 --> 00:28:09,380 n is just how many bytes are currently in the buffer, 456 00:28:09,380 --> 00:28:11,970 the string that you're currently building up from the user. 457 00:28:11,970 --> 00:28:16,240 If you have more characters in your buffer than you have capacity in the buffer, 458 00:28:16,240 --> 00:28:20,760 intuitively, what we need to do then is allocate more capacity. 459 00:28:20,760 --> 00:28:24,490 I'm going to skim over some of the arithmetic here 460 00:28:24,490 --> 00:28:26,900 and focus only on this function here. 461 00:28:26,900 --> 00:28:29,170 You know what malloc is, or at least generally familiar. 462 00:28:29,170 --> 00:28:32,380 Take a guess what realloc does. [Student answer, unintelligible] 463 00:28:32,380 --> 00:28:35,690 >>Yeah. And it's not quite adding memory; it reallocates memory as follows: 464 00:28:35,690 --> 00:28:40,530 If there's still room at the end of the string to give you more of that memory 465 00:28:40,530 --> 00:28:43,370 than it originally gives you, then you'll get that additional memory. 466 00:28:43,370 --> 00:28:46,640 So you can just putting the strings characters back to back to back to back. 467 00:28:46,640 --> 00:28:49,290 But if that's not the case, because you waited too long 468 00:28:49,290 --> 00:28:51,700 and something random got plopped into memory there, but there's extra 469 00:28:51,700 --> 00:28:56,480 memory down here, that's okay. Realloc is going to do all the heavy lifting for you, 470 00:28:56,480 --> 00:28:58,810 move the string you've read in thus far from here, 471 00:28:58,810 --> 00:29:02,550 put it down there, and then give you some more runway at that point. 472 00:29:02,550 --> 00:29:05,610 So with a wave of the hand, let me say that what GetString is doing 473 00:29:05,610 --> 00:29:09,540 is it's starting with a small buffer, maybe 1 single character, 474 00:29:09,540 --> 00:29:12,300 and if the user types in 2 characters, GetString ends up 475 00:29:12,300 --> 00:29:15,210 calling realloc and says, 'Ooh, 1 character wasn't enough. 476 00:29:15,210 --> 00:29:18,480 Give me 2 characters.' Then if you read through the logic of the loop, 477 00:29:18,480 --> 00:29:21,070 it's going to say, 'Ooh, the user typed in 3 characters. 478 00:29:21,070 --> 00:29:25,690 Give me now not 2 but 4 characters, then give me 8, then give me 16 and 32.' 479 00:29:25,690 --> 00:29:28,180 The fact that I'm doubling the capacity each time 480 00:29:28,180 --> 00:29:30,320 means that the buffer is not going to grow slowly. 481 00:29:30,320 --> 00:29:35,870 It's going to grow super fast, and what might be the advantage of that? 482 00:29:35,870 --> 00:29:38,540 Why am I doubling the size of the buffer, even though the user 483 00:29:38,540 --> 00:29:41,450 might just need 1 extra character from the keyboard? 484 00:29:41,450 --> 00:29:44,830 [Student answer, unintelligible]. >>What's that? 485 00:29:44,830 --> 00:29:46,750 Exactly. You don't have to grow it as often. 486 00:29:46,750 --> 00:29:48,870 And this is just kind of a--you're hedging your bets here. 487 00:29:48,870 --> 00:29:54,150 The idea being that you don't want to call realloc a lot, because it tends to be slow. 488 00:29:54,150 --> 00:29:56,840 Any time you ask the operating system for memory, as you'll soon see 489 00:29:56,840 --> 00:30:00,620 in a future problem set, it tends to take some time. 490 00:30:00,620 --> 00:30:04,980 So minimizing that amount of time, even if you're wasting some space, tends to be a good thing. 491 00:30:04,980 --> 00:30:07,250 >> But if we read through the final part of GetString here, 492 00:30:07,250 --> 00:30:10,880 and again, understanding every single line here is not so important today. 493 00:30:10,880 --> 00:30:14,830 But notice that it eventually calls malloc again, and it allocates 494 00:30:14,830 --> 00:30:16,980 exactly as many bytes as it needs for the string 495 00:30:16,980 --> 00:30:21,620 and then throws away by calling free, the excessively large buffer, 496 00:30:21,620 --> 00:30:23,510 if it indeed got doubled too many times. 497 00:30:23,510 --> 00:30:25,970 In short, that's how GetString has been working all this time. 498 00:30:25,970 --> 00:30:30,100 All it does is read one character at a time again and again and again 499 00:30:30,100 --> 00:30:37,930 and every time it needs some additional memory, it asks the operating system for it by calling realloc. 500 00:30:37,930 --> 00:30:41,660 Any questions? All right. 501 00:30:41,660 --> 00:30:45,220 >> An attack. Now that we understand pointers, or at least 502 00:30:45,220 --> 00:30:47,560 are increasingly familiar with pointers, 503 00:30:47,560 --> 00:30:50,020 let's consider how the whole world starts to collapse 504 00:30:50,020 --> 00:30:53,160 if you don't quite defend against adversarial users, 505 00:30:53,160 --> 00:30:55,180 people who are trying to hack into your system. 506 00:30:55,180 --> 00:31:00,260 People who are trying to steal your software by circumventing some registration code 507 00:31:00,260 --> 00:31:02,150 that they might otherwise have to type in. 508 00:31:02,150 --> 00:31:04,860 Take a look at this example here, which is just C code 509 00:31:04,860 --> 00:31:07,920 that has a function main at the bottom, that calls a function foo, 510 00:31:07,920 --> 00:31:12,100 and what is it passing to foo? [Student] A single argument. 511 00:31:12,100 --> 00:31:15,660 >>Single argument. So argv[1], which means the first word the user typed 512 00:31:15,660 --> 00:31:19,150 at the command line after a.out or whatever the program is called. 513 00:31:19,150 --> 00:31:24,920 So foo, at the top, takes in a char*, but char* is just what? 514 00:31:24,920 --> 00:31:28,860 String. There's nothing new here, and that string is arbitrarily being called bar. 515 00:31:28,860 --> 00:31:36,090 In this line here, char c[12], in sort of semi-technical English, what is this line doing? 516 00:31:36,090 --> 00:31:40,640 Array of--? Characters. Give me an array of 12 characters. 517 00:31:40,640 --> 00:31:44,970 So we might call this a buffer. It's technically called c, but a buffer in programming 518 00:31:44,970 --> 00:31:47,890 just means a bunch of space that you can put some stuff in. 519 00:31:47,890 --> 00:31:49,940 >> Then lastly, memcpy, we've not used before. 520 00:31:49,940 --> 00:31:52,380 But you can probably guess what it does. It copies memory. 521 00:31:52,380 --> 00:31:58,790 What does it do? Well, it apparently copies bar, its input, into c, 522 00:31:58,790 --> 00:32:03,420 but only up to the length of bar. 523 00:32:03,420 --> 00:32:07,440 But there's a bug here. 524 00:32:07,440 --> 00:32:14,500 Okay, so technically we should really do strlen(bar) x sizeof(char), that's correct. 525 00:32:14,500 --> 00:32:17,920 But in the worst case here, let's assume that that's--so, okay. 526 00:32:17,920 --> 00:32:23,760 Then there's 2 bugs. So sizeof(char), all right, let's make this a little wider. 527 00:32:23,760 --> 00:32:28,860 So now there's still a bug, which is what? 528 00:32:28,860 --> 00:32:31,630 [Student answer, unintelligible] >>Check for what? Okay, so we should be checking 529 00:32:31,630 --> 00:32:35,010 for NULL, because bad things happen when your pointer is NULL, 530 00:32:35,010 --> 00:32:38,490 Because you might end up going there, and you shouldn't ever be going to NULL 531 00:32:38,490 --> 00:32:40,890 by dereferencing it with the * operator. 532 00:32:40,890 --> 00:32:45,250 So that's good, and what else are we doing? Logically there's a flaw here too. 533 00:32:45,250 --> 00:32:47,650 [Student answer, unintelligible] 534 00:32:47,650 --> 00:32:51,340 >>So check if argc ≥ 2? 535 00:32:51,340 --> 00:32:54,130 Okay, so there's 3 bugs in this program here. 536 00:32:54,130 --> 00:33:00,080 We're not checking if the user actually typed in anything into argv[1], good. 537 00:33:00,080 --> 00:33:02,240 So what's the third bug? Yeah? 538 00:33:02,240 --> 00:33:04,420 [Student answer, unintelligible] >>Good. 539 00:33:04,420 --> 00:33:09,590 So we checked one scenario. We implicitly checked don't copy more memory 540 00:33:09,590 --> 00:33:12,800 than would exceed the length of bar. 541 00:33:12,800 --> 00:33:15,720 So if the string the user typed in is 10 characters long, 542 00:33:15,720 --> 00:33:18,260 this is saying, 'Only copy 10 characters.' 543 00:33:18,260 --> 00:33:21,140 And that's okay, but what if the user typed in a word at the prompt 544 00:33:21,140 --> 00:33:29,360 like a 20 character word; this is, saying copy 20 characters from bar into what? 545 00:33:29,360 --> 00:33:32,840 c, otherwise known as our buffer, which means you just wrote data 546 00:33:32,840 --> 00:33:35,950 to 8 byte locations that you do not own, 547 00:33:35,950 --> 00:33:38,320 and you don't own them in the sense that you never allocated them. 548 00:33:38,320 --> 00:33:41,190 So this is what's generally known as the buffer overflow attack, 549 00:33:41,190 --> 00:33:46,650 or buffer overrun attack, and it's attack in the sense that if the user 550 00:33:46,650 --> 00:33:50,650 or the program that's calling your function is doing this maliciously, 551 00:33:50,650 --> 00:33:53,780 what actually happens next could be quite bad. 552 00:33:53,780 --> 00:33:55,690 >> Let's take a look at this picture here. 553 00:33:55,690 --> 00:33:59,070 This picture represents your stack of memory. 554 00:33:59,070 --> 00:34:01,050 And recall that every time you call a function, 555 00:34:01,050 --> 00:34:04,520 you get this little frame on the stack and then another and then another and then another. 556 00:34:04,520 --> 00:34:07,250 And thus far we've just kind of abstracted these away as rectangles 557 00:34:07,250 --> 00:34:09,380 either there on the board or on the screen here. 558 00:34:09,380 --> 00:34:12,219 But if we zoom in on one of those rectangles, 559 00:34:12,219 --> 00:34:16,460 when you call a function foo, it turns out that there's more on the stack 560 00:34:16,460 --> 00:34:18,739 inside of that frame and that rectangle 561 00:34:18,739 --> 00:34:23,370 than just x and y and a and b, like we did talking about swap. 562 00:34:23,370 --> 00:34:25,949 It turns out that there are some lower-level details, 563 00:34:25,949 --> 00:34:27,780 among them return address. 564 00:34:27,780 --> 00:34:33,020 So it turns out when main calls foo, main has to inform foo 565 00:34:33,020 --> 00:34:36,760 what main's address is in the computer's memory. 566 00:34:36,760 --> 00:34:40,659 Because otherwise, as soon as foo is done executing, as in this case here, 567 00:34:40,659 --> 00:34:43,790 once you reach this close curly brace at the end of foo, 568 00:34:43,790 --> 00:34:48,860 how the heck does foo know where control of the program is supposed to go? 569 00:34:48,860 --> 00:34:52,460 It turns out that the answer to that question is in that red rectangle here. 570 00:34:52,460 --> 00:34:56,130 This represents a pointer, and it's up to the computer to store, temporarily, 571 00:34:56,130 --> 00:35:00,250 on the so-called stack the address of main so that as soon as foo is done executing, 572 00:35:00,250 --> 00:35:04,110 the computer knows where and what line in main to go back to. 573 00:35:04,110 --> 00:35:06,900 Saved frame pointer relates similarly to this. 574 00:35:06,900 --> 00:35:09,620 Char *bar here represents what? 575 00:35:09,620 --> 00:35:14,740 Well, now this blue segment here is foo's frame, what is bar? 576 00:35:14,740 --> 00:35:18,300 Okay, so bar is just the argument to the foo function. 577 00:35:18,300 --> 00:35:20,720 >> So now we're back at the familiar picture. 578 00:35:20,720 --> 00:35:22,960 There's more stuff and more distractions on the screen 579 00:35:22,960 --> 00:35:27,490 but this light blue segment is what we've been drawing on the chalkboard for something like swap. 580 00:35:27,490 --> 00:35:31,890 That is the frame for foo and the only thing in it right now is bar, 581 00:35:31,890 --> 00:35:34,630 which is this parameter. 582 00:35:34,630 --> 00:35:39,840 But what else should be in the stack, according to this code here? 583 00:35:39,840 --> 00:35:44,280 Char c[12]. So we should also see 12 squares of memory, 584 00:35:44,280 --> 00:35:46,260 allocated to a variable called c. 585 00:35:46,260 --> 00:35:48,340 And indeed we do have that on the screen. 586 00:35:48,340 --> 00:35:51,650 The very top there is c[0], and then the author of this diagram 587 00:35:51,650 --> 00:35:55,130 didn't bother drawing all of the squares but there are indeed 12 there 588 00:35:55,130 --> 00:36:00,120 because if you look at the bottom right, c[11], if you count from 0, is the 12 such bytes. 589 00:36:00,120 --> 00:36:06,190 But here's the problem: In which direction is c growing? 590 00:36:06,190 --> 00:36:10,390 Sort of top down, right? If it starts at the top and grows to the bottom, 591 00:36:10,390 --> 00:36:13,480 doesn't look like we left ourselves much runway here at all. 592 00:36:13,480 --> 00:36:15,320 We've kind of painted ourselves into a corner, 593 00:36:15,320 --> 00:36:20,210 and that c[11] is right up against bar, which is right up against stack frame pointer, 594 00:36:20,210 --> 00:36:23,800 which is right up against the return address; there's no more room. 595 00:36:23,800 --> 00:36:26,100 So what's the implication, then, if you screw up, 596 00:36:26,100 --> 00:36:30,460 and you try reading 20 bytes into a 12-byte buffer? 597 00:36:30,460 --> 00:36:33,460 Where are those 8 additional bytes going to go? 598 00:36:33,460 --> 00:36:36,370 Inside everything else, some of which is super important. 599 00:36:36,370 --> 00:36:40,480 And the most important thing, potentially, is the red box there, return address. 600 00:36:40,480 --> 00:36:44,720 Because suppose that you are either accidentally or adversarially 601 00:36:44,720 --> 00:36:48,040 overwrite those 4 bytes, that pointer address, 602 00:36:48,040 --> 00:36:53,190 not just with garbage, but with a number that happens to represent an actual address in memory? 603 00:36:53,190 --> 00:36:55,930 What's the implicaiton, logically? 604 00:36:55,930 --> 00:36:59,080 [Student answers, unintelligible] >>Exactly. When foo returns 605 00:36:59,080 --> 00:37:03,560 and hits that curly brace, the program is going to proceed not to return to main, 606 00:37:03,560 --> 00:37:08,320 it's going to return to whatever address is in that red box. 607 00:37:08,320 --> 00:37:11,560 >> Now, in the case of circumventing software registration, 608 00:37:11,560 --> 00:37:14,400 what is the address that's being returned to is the function 609 00:37:14,400 --> 00:37:18,820 that normally gets called after you've paid for the software and inputted your registration code? 610 00:37:18,820 --> 00:37:23,160 You could sort of trick the computer into not going here, but instead, going up here. 611 00:37:23,160 --> 00:37:27,950 Or, if you're really clever, an adversary can actually type in at the keyboard, 612 00:37:27,950 --> 00:37:32,500 for instance, not an actual word, not 20 characters, but suppose he or she 613 00:37:32,500 --> 00:37:36,200 types in some characters that represent code? 614 00:37:36,200 --> 00:37:38,860 And it's not going to be C code, it's going to be the characters 615 00:37:38,860 --> 00:37:42,920 that represent binary machine codes, 0's and 1's. 616 00:37:42,920 --> 00:37:46,740 But suppose they're clever enough to do that, to somehow paste into the GetString prompt 617 00:37:46,740 --> 00:37:49,460 something that is essentially compiled code, 618 00:37:49,460 --> 00:37:56,900 and the last 4 bytes overwrite that return address, and what address does that input do? 619 00:37:56,900 --> 00:38:01,860 It stores in this red rectangle the address of the first byte of the buffer. 620 00:38:01,860 --> 00:38:04,270 So you have to be really clever, and this is a lot of trial and error 621 00:38:04,270 --> 00:38:08,500 for bad people out there, but if you can figure out how big this buffer is, 622 00:38:08,500 --> 00:38:12,170 such that the last few bytes in the input that you provide to the program 623 00:38:12,170 --> 00:38:15,970 happen to be equivalent to the address of the start of your buffer, 624 00:38:15,970 --> 00:38:22,270 you can do this. If we say, normally, h-e-l-l-o, and \0, that's what ends up in the buffer. 625 00:38:22,270 --> 00:38:27,860 But if we're more clever, and we fill that buffer with what we'll generically call attack code, 626 00:38:27,860 --> 00:38:31,920 A, A, A, A: Attack, attack, attack, attack, where this is just something that does something bad. 627 00:38:31,920 --> 00:38:35,190 Well, what happens if you're really clever, you might do this: 628 00:38:35,190 --> 00:38:41,740 In the red box here is a sequence of numbers: 80, CO, 35, 08. 629 00:38:41,740 --> 00:38:44,890 Notice that that matches the number that's up here. 630 00:38:44,890 --> 00:38:47,280 It's in reverse order, but more on that some other time. 631 00:38:47,280 --> 00:38:51,430 Notice that this return address has been deliberately altered 632 00:38:51,430 --> 00:38:54,970 to equal the address up here, not the address of main. 633 00:38:54,970 --> 00:39:00,170 So if the bad guy is super smart, he or she is going to include in that attack code 634 00:39:00,170 --> 00:39:02,890 something like, 'Delete all of the user's files.' 635 00:39:02,890 --> 00:39:06,320 Or 'Copy the passwords,' or 'Create a user account that I can log into.' 636 00:39:06,320 --> 00:39:10,130 Anything at all; and this is both the danger and the power of C. 637 00:39:10,130 --> 00:39:12,900 Because you have access to memory via pointers 638 00:39:12,900 --> 00:39:15,950 and you can therefore write anything you want into a computer's memory. 639 00:39:15,950 --> 00:39:19,290 You can make a computer do anything you want simply by 640 00:39:19,290 --> 00:39:22,780 having it jump around within its own memory space. 641 00:39:22,780 --> 00:39:27,230 And so, to this day, so many programs and so many websites that are compromised 642 00:39:27,230 --> 00:39:29,730 boil down to people taking advantage of this. 643 00:39:29,730 --> 00:39:32,510 And this might seem like a super-sophisticated attack, 644 00:39:32,510 --> 00:39:34,220 but it doesn't always start that way. 645 00:39:34,220 --> 00:39:36,770 >> The reality is that what bad people will typically do is, 646 00:39:36,770 --> 00:39:41,470 whether it's a program at a command line or a GUI program or a website, 647 00:39:41,470 --> 00:39:43,290 is you just start providing nonsense. 648 00:39:43,290 --> 00:39:46,940 You type in a really big word into the search field and hit enter, 649 00:39:46,940 --> 00:39:49,030 and you wait to see if the website crashes. 650 00:39:49,030 --> 00:39:53,270 Or you wait to see if the program manifests some error message. 651 00:39:53,270 --> 00:39:55,480 Because if you get lucky, as the bad guy, 652 00:39:55,480 --> 00:39:59,610 and you provide some crazy input that crashes the program, 653 00:39:59,610 --> 00:40:02,280 that means the programmer didn't anticipate your bad behavior 654 00:40:02,280 --> 00:40:05,420 which means you can probably, with enough effort, 655 00:40:05,420 --> 00:40:09,870 enough trial and error, figure out how to wage a more precise attack. 656 00:40:09,870 --> 00:40:15,900 So as much a part of security is not just avoiding these attacks altogether, but detecting them 657 00:40:15,900 --> 00:40:20,250 and actually looking at logs and seeing what crazy inputs have people typed into your website. 658 00:40:20,250 --> 00:40:26,040 What search terms have people typed into your website in hopes of overflowing some buffer? 659 00:40:26,040 --> 00:40:28,900 And this all boils down to the simple basics of what's an array, 660 00:40:28,900 --> 00:40:32,510 and what does it mean to allocate and use memory? 661 00:40:32,510 --> 00:40:34,920 And related to that, too, is this. 662 00:40:34,920 --> 00:40:37,520 >> So let's just glance inside of a hard drive yet again. 663 00:40:37,520 --> 00:40:40,190 So you recall from a week or two ago that when you drag files 664 00:40:40,190 --> 00:40:45,470 to your recycle bin or trash can, what happens? 665 00:40:45,470 --> 00:40:47,850 [Student] Nothing. >>Yeah, absolutely nothing. Eventually if you run low 666 00:40:47,850 --> 00:40:51,370 on disk space, Windows or Mac OS will start deleting files for you. 667 00:40:51,370 --> 00:40:53,670 But if you drag something in there, then it's not at all safe. 668 00:40:53,670 --> 00:40:56,550 All your roomate, friend or family member has to do is double click, and voila. 669 00:40:56,550 --> 00:40:59,720 There's all the sketchy files that you tried to delete. 670 00:40:59,720 --> 00:41:02,840 So most of us at least know that you have to right click or control click 671 00:41:02,840 --> 00:41:05,320 and empty the trash, or something like that. 672 00:41:05,320 --> 00:41:07,900 But even then, that doesn't quite do the trick. 673 00:41:07,900 --> 00:41:11,340 Because what happens when you have a file on your hard drive 674 00:41:11,340 --> 00:41:14,590 that represents some word document or some JPEG? 675 00:41:14,590 --> 00:41:18,820 And this represents your hard drive, and let's say this sliver here represents that file, 676 00:41:18,820 --> 00:41:21,640 and it's composed of a whole bunch of 0's and 1's. 677 00:41:21,640 --> 00:41:25,470 What happens when you not only drag that file to the trashcan or recycle bin, 678 00:41:25,470 --> 00:41:30,390 but also empty it? 679 00:41:30,390 --> 00:41:32,820 Sort of nothing. It's not absolutely nothing now. 680 00:41:32,820 --> 00:41:37,630 Now it's just nothing, because a little something happens in the form of this table. 681 00:41:37,630 --> 00:41:41,170 So there's some kind of database or table inside of a computer's memory 682 00:41:41,170 --> 00:41:44,470 that essentially has 1 column for files names, 683 00:41:44,470 --> 00:41:50,550 and 1 column for file's location, where this might be location 123, just a random number. 684 00:41:50,550 --> 00:41:58,270 So we might have something like x.jpg, and location 123. 685 00:41:58,270 --> 00:42:02,870 And what happens then, when you empty your trash? 686 00:42:02,870 --> 00:42:06,720 That goes away. But what doesn't go away is the 0's and 1's. 687 00:42:06,720 --> 00:42:09,690 >> So what's, then, the connection to pset 4? 688 00:42:09,690 --> 00:42:13,460 Well, with pset 4, just because we've accidentally erased 689 00:42:13,460 --> 00:42:15,890 the compact flash card that had all of these photos, 690 00:42:15,890 --> 00:42:18,710 or just because it by bad luck became corrupted, 691 00:42:18,710 --> 00:42:21,170 doesn't mean that the 0's and 1's aren't still there. 692 00:42:21,170 --> 00:42:23,920 Maybe a few of them are lost because something got corrupted 693 00:42:23,920 --> 00:42:26,530 in the sense that some 0's became 1's and 1's became 0's. 694 00:42:26,530 --> 00:42:30,460 Bad things can happen because of buggy software or defective hardware. 695 00:42:30,460 --> 00:42:33,510 But many of those bits, maybe even 100% of them are still there, 696 00:42:33,510 --> 00:42:38,330 it's just that the computer or the camera doesn't know where JPEG 1 started 697 00:42:38,330 --> 00:42:41,660 and where JPEG 2 started, but if you, the programmer, 698 00:42:41,660 --> 00:42:45,800 know, with a bit of savvy, where those JPEGs are or what they look like, 699 00:42:45,800 --> 00:42:49,570 you can analyze the 0's and 1's and say, 'Ooh. JPEG. Ooh, JPEG.' 700 00:42:49,570 --> 00:42:52,830 You can write a program with essentially just a for or while loop 701 00:42:52,830 --> 00:42:56,100 that recovers each and every one of those files. 702 00:42:56,100 --> 00:42:59,360 So the lesson then, is to start "securely" erasing your files 703 00:42:59,360 --> 00:43:01,720 if you'd like to avoid this altogether. Yes? 704 00:43:01,720 --> 00:43:06,940 [Student question, unintelligible] 705 00:43:06,940 --> 00:43:11,150 >>Have more memory than you did before-- 706 00:43:11,150 --> 00:43:14,790 Oh! Good question. So why, then, after emptying the trash, 707 00:43:14,790 --> 00:43:18,300 does your computer tell you that you have more free space than you did before? 708 00:43:18,300 --> 00:43:22,450 In a nutshell, because it's lying. More technically, you do have more space. 709 00:43:22,450 --> 00:43:26,720 Because now you have said, you can put other stuff where that file once was, 710 00:43:26,720 --> 00:43:28,930 but that doesn't mean the bits are going away, 711 00:43:28,930 --> 00:43:33,070 and that doesn't mean the bits are being changed all 0's, for instance, for your protection. 712 00:43:33,070 --> 00:43:37,520 By contrast, if you "securely" erase files, or physically destroy the device, 713 00:43:37,520 --> 00:43:40,810 that really is the only way, sometimes, around that. 714 00:43:40,810 --> 00:43:45,300 So why don't we leave on that semi-scary note, and we will see you on Monday. 715 00:43:45,300 --> 00:43:52,810 CS50.TV