1 00:00:09,256 --> 00:00:12,566 >> David Malan: All right, welcome back to CS50! 2 00:00:12,566 --> 00:00:18,316 This is week five, so in problem set two, you'll recall that some 3 00:00:18,316 --> 00:00:21,176 of your classmates tackled the so-called hacker edition, 4 00:00:21,176 --> 00:00:24,556 and that hacker edition had them tackle a number 5 00:00:24,556 --> 00:00:26,776 of encrypted passwords, 6 00:00:27,116 --> 00:00:28,656 which looked a little something like this. 7 00:00:28,806 --> 00:00:31,836 So, long story short, if you didn't actually read this PDF 8 00:00:31,836 --> 00:00:33,646 at the time, this was the hacker edition. 9 00:00:33,906 --> 00:00:36,436 Passwords on the typical Linux system are stored 10 00:00:36,436 --> 00:00:39,516 in an encrypted form, so you have essentially this format 11 00:00:39,516 --> 00:00:42,966 in a file called XE Password Etcetera/Password. 12 00:00:43,686 --> 00:00:45,896 Julius would be the username, colon, 13 00:00:45,896 --> 00:00:48,406 and then some cryptic-looking string would be the 14 00:00:48,406 --> 00:00:49,096 encrypted password. 15 00:00:49,096 --> 00:00:51,326 And do you mind, Barry, taking my voice down a little bit? 16 00:00:51,326 --> 00:00:53,016 There's a bit of an echo up here. 17 00:00:53,086 --> 00:00:53,476 Much better. 18 00:00:53,886 --> 00:00:57,176 So, these students, who tackled the hacker edition, 19 00:00:57,176 --> 00:01:00,686 needed to decipher, needed to crack those passwords; but, 20 00:01:00,686 --> 00:01:03,266 unfortunately, the routine that's used 21 00:01:03,266 --> 00:01:04,556 in a typical Linux system 22 00:01:04,556 --> 00:01:07,796 to encrypt passwords is a one-way function, 23 00:01:07,796 --> 00:01:11,206 which means it's relatively easy and relatively fast to take 24 00:01:11,206 --> 00:01:13,276 in plaintext and generate cipher text, 25 00:01:13,636 --> 00:01:16,016 but you can't actually reverse the process. 26 00:01:16,126 --> 00:01:17,516 So, as I mentioned a few weeks ago, 27 00:01:17,516 --> 00:01:20,536 if you've ever lost a password to a typical website 28 00:01:20,536 --> 00:01:23,366 or to your FAS account, odds are you might've gotten annoyed 29 00:01:23,366 --> 00:01:25,126 at the text support person, because they insisted, 30 00:01:25,176 --> 00:01:26,716 "I can't tell you what your password is," 31 00:01:27,026 --> 00:01:28,116 but they can change it, 32 00:01:28,296 --> 00:01:30,626 and that's because of this mathematical property that's 33 00:01:30,626 --> 00:01:32,296 generally used for passwords. 34 00:01:32,606 --> 00:01:34,896 They're normally known as a one-way hash. 35 00:01:34,896 --> 00:01:37,616 And as the name implies, you can only encrypt 36 00:01:37,616 --> 00:01:40,326 in one direction; you can't decrypt. 37 00:01:40,786 --> 00:01:43,396 So, if you can't decrypt passwords in this way, 38 00:01:43,646 --> 00:01:47,536 how in the world are all of us logging in to FAS and logging 39 00:01:47,536 --> 00:01:51,356 in to Facebook and other such sites every day of the week? 40 00:01:51,836 --> 00:01:54,926 How is it that these sites are storing our passwords encrypted, 41 00:01:55,116 --> 00:01:58,836 and yet they're still somehow checking that what I type in, 42 00:01:58,836 --> 00:02:02,166 in plaintext, matches my password? 43 00:02:02,756 --> 00:02:05,326 So, in other words, if you can't decrypt the cipher text 44 00:02:05,326 --> 00:02:07,986 to then compare the plaintext I initially gave them, 45 00:02:07,986 --> 00:02:10,916 when I created my account, and the plaintext I just typed in, 46 00:02:10,916 --> 00:02:12,916 when visiting Facebook.com or equivalent, 47 00:02:13,476 --> 00:02:15,536 how do you actually compare what I typed in 48 00:02:15,536 --> 00:02:17,306 and what they know my password to be? 49 00:02:17,406 --> 00:02:17,516 Yeah? 50 00:02:17,926 --> 00:02:18,626 >> Just encrypt what [inaudible]. 51 00:02:19,136 --> 00:02:20,156 >> David: Just what? 52 00:02:20,416 --> 00:02:20,856 >> Encrypt what -- 53 00:02:20,856 --> 00:02:22,996 >> David: Yeah, just encrypt what I just typed in. 54 00:02:22,996 --> 00:02:25,006 So, in other words, what a typical website 55 00:02:25,006 --> 00:02:28,596 or server does these days is, again, it encrypts your password 56 00:02:28,596 --> 00:02:31,496 when you create your account and stores it in a crazy format 57 00:02:31,496 --> 00:02:34,346 like this; but, subsequently, when you log in, 58 00:02:34,706 --> 00:02:38,716 the server simply re-encrypts whatever you just typed in, 59 00:02:38,876 --> 00:02:40,146 using the same algorithm. 60 00:02:40,176 --> 00:02:42,496 And then if the two cipher texts match, 61 00:02:42,836 --> 00:02:44,176 does it actually let you pass. 62 00:02:44,516 --> 00:02:46,966 But the curious thing is, because of the mathematics 63 00:02:46,966 --> 00:02:48,496 of a lot of these implementations, 64 00:02:48,716 --> 00:02:51,716 there is a probability, though very, very low, 65 00:02:51,716 --> 00:02:55,846 that you could type in one word, and it would encrypt 66 00:02:55,966 --> 00:02:58,376 to some cipher text, and you could type in another word, 67 00:02:58,376 --> 00:03:00,906 a different word, but it might very well encrypt 68 00:03:01,136 --> 00:03:03,026 to that same cipher text; in other words, 69 00:03:03,026 --> 00:03:04,546 you might very well be able to login 70 00:03:04,546 --> 00:03:07,746 to some accounts you guys have, using any number of passwords, 71 00:03:08,036 --> 00:03:10,526 but the probability of that is very low, and the probability 72 00:03:10,586 --> 00:03:13,506 of figuring out what other words might hash, so to speak, 73 00:03:13,506 --> 00:03:16,016 to the same value is itself very low. 74 00:03:16,286 --> 00:03:20,136 So, a number of our students, who are more comfortable, did, 75 00:03:20,136 --> 00:03:24,766 in fact, determine that Julius' password was just 13. 76 00:03:25,436 --> 00:03:29,496 The clever, or those who like stupid jokes might understand 77 00:03:29,496 --> 00:03:30,736 what the reference there is. 78 00:03:30,736 --> 00:03:32,276 Julius Caesar, you know, 13. 79 00:03:32,276 --> 00:03:34,386 Okay. President Scroob, of course, 80 00:03:34,476 --> 00:03:37,616 in the Spaceballs, Smith hack. 81 00:03:37,616 --> 00:03:39,596 You've got to make fun of me when I do something like this. 82 00:03:39,646 --> 00:03:40,016 There we go. 83 00:03:40,336 --> 00:03:43,956 All right, President Scroob had a password of 12345. 84 00:03:43,956 --> 00:03:45,716 That was the second password these guys were tasked 85 00:03:45,976 --> 00:03:46,726 with figuring out. 86 00:03:46,966 --> 00:03:49,866 W. Brandus Voice, this is actually a reference to a movie 87 00:03:49,866 --> 00:03:51,436 that many of you probably haven't seen, 88 00:03:51,436 --> 00:03:54,076 but the movie Sneakers, where a guy authenticates himself 89 00:03:54,076 --> 00:03:57,196 as saying, "My voice is my password," a little example 90 00:03:57,196 --> 00:03:59,346 of using vocal algorithmics. 91 00:03:59,426 --> 00:04:00,316 That was his there. 92 00:04:00,446 --> 00:04:02,346 But notice what was important about voice was 93 00:04:02,346 --> 00:04:03,766 that it's just an English word. 94 00:04:03,766 --> 00:04:04,846 So any hacker students 95 00:04:04,846 --> 00:04:07,226 that actually used an English dictionary to try 96 00:04:07,226 --> 00:04:09,726 to crack our passwords probably would've gotten this one. 97 00:04:10,116 --> 00:04:12,826 Baravelli, which is a reference to an old Marx Brothers film. 98 00:04:12,826 --> 00:04:15,316 Swordfish, also a word that's probably in the dictionary. 99 00:04:15,556 --> 00:04:16,856 But then it got a little harder, 100 00:04:16,856 --> 00:04:21,806 so you may remember Mr. Blaise Visionaire [phonetic] of France, 101 00:04:21,966 --> 00:04:24,146 who we used Foobar as our keyword. 102 00:04:24,146 --> 00:04:24,806 A little harder. 103 00:04:24,806 --> 00:04:26,646 It's probably not in most dictionaries, 104 00:04:26,836 --> 00:04:28,116 let alone all capitalized. 105 00:04:28,386 --> 00:04:29,246 G. Costanza. 106 00:04:29,316 --> 00:04:31,336 Remember? Those familiar might recall 107 00:04:31,336 --> 00:04:34,476 that this was his ATM code, Bosco, a little chocolaty drink. 108 00:04:34,836 --> 00:04:37,906 And then Malan, I don't believe, was gotten by anyone, 109 00:04:37,946 --> 00:04:40,456 but that's because it was, in fact, rather difficult 110 00:04:40,456 --> 00:04:41,716 to crack and was this. 111 00:04:41,966 --> 00:04:45,196 For those who don't understand what the reference is, 112 00:04:45,196 --> 00:04:48,476 just Goggle FTW and 111, and you'll see 113 00:04:48,546 --> 00:04:50,606 such stupid illusions as those. 114 00:04:51,036 --> 00:04:54,636 So, I finally found a video that's actually educational 115 00:04:55,256 --> 00:04:55,776 in nature. 116 00:04:55,866 --> 00:04:58,086 [laughter] So, you might not think it, 117 00:04:58,806 --> 00:05:02,846 but this thing here actually comes to us 118 00:05:03,096 --> 00:05:05,066 by a very well-known professor, Nick Parlante 119 00:05:05,066 --> 00:05:06,766 at Stanford University. 120 00:05:06,766 --> 00:05:07,966 That's actually very well-known 121 00:05:07,966 --> 00:05:09,166 for its computer science program. 122 00:05:09,166 --> 00:05:11,516 So, this isn't a joke, but this is actually one 123 00:05:11,516 --> 00:05:13,176 of the most compelling ways I have, at least, 124 00:05:13,236 --> 00:05:16,106 seen in clay form to convey the idea 125 00:05:16,106 --> 00:05:18,096 of pointers and dereferencing. 126 00:05:18,416 --> 00:05:20,746 So, this is, for many students, a challenging topic. 127 00:05:20,746 --> 00:05:22,836 It certainly was for me years ago; but, frankly, 128 00:05:22,836 --> 00:05:24,056 we've just kind of taken ourselves 129 00:05:24,056 --> 00:05:25,586 down a notch with Binky here. 130 00:05:25,836 --> 00:05:29,236 Might hopefully elucidate what might otherwise, in C syntax, 131 00:05:29,316 --> 00:05:30,636 be a fairly cryptic topic. 132 00:05:30,786 --> 00:05:33,306 So, allow me to full screen, and allow me 133 00:05:33,406 --> 00:05:37,326 to give us this short three-minute film. 134 00:05:37,916 --> 00:05:38,366 Here we are. 135 00:05:38,366 --> 00:05:40,756 >> Nick Parlante: Hey, Binky. 136 00:05:40,856 --> 00:05:43,816 Wake up. It's time for pointer fun. 137 00:05:43,816 --> 00:05:44,546 >> Binky: What's that? 138 00:05:44,546 --> 00:05:46,316 Learn about pointers? 139 00:05:46,316 --> 00:05:47,096 Oh, goody! 140 00:05:47,186 --> 00:05:49,696 >> David: Mind you, this is a Stanford professor playing 141 00:05:49,696 --> 00:05:51,916 with clay, doing the voiceover here on camera. 142 00:05:51,916 --> 00:05:53,286 [laughter] 143 00:05:53,286 --> 00:05:54,846 >> Nick: Well, to get started, I guess we're going 144 00:05:54,846 --> 00:05:55,946 to need a couple of pointers. 145 00:05:56,416 --> 00:05:57,276 >> Binky: Okay. 146 00:05:57,276 --> 00:05:59,076 This code allocates two pointers, 147 00:05:59,076 --> 00:06:00,356 which can point to integers. 148 00:06:00,996 --> 00:06:02,876 >> Nick: Okay, well, I see the two pointers, 149 00:06:02,876 --> 00:06:04,716 but they don't seem to be pointing to anything. 150 00:06:04,936 --> 00:06:05,896 >> Binky That's right. 151 00:06:05,896 --> 00:06:07,716 Initially, pointers don't point to anything. 152 00:06:07,996 --> 00:06:10,056 The things they point to are called pointees, 153 00:06:10,056 --> 00:06:12,286 and setting them up is a separate step. 154 00:06:12,286 --> 00:06:13,006 >> Nick: Oh, right, right. 155 00:06:13,156 --> 00:06:13,516 I knew that. 156 00:06:13,776 --> 00:06:15,636 The pointees are separate. 157 00:06:15,636 --> 00:06:17,316 So, how do you allocate a pointee? 158 00:06:18,046 --> 00:06:21,516 >> Binky: Okay, well, this code allocates a new integer pointee, 159 00:06:21,516 --> 00:06:24,016 and this part sets X to point to it. 160 00:06:24,016 --> 00:06:25,836 >> Nick: Hey, that looks better. 161 00:06:26,146 --> 00:06:27,166 So, make it do something. 162 00:06:27,706 --> 00:06:30,386 >> Binky: Okay, I'll dereference the pointer X 163 00:06:30,516 --> 00:06:33,056 to store the number 42 into its pointee. 164 00:06:33,056 --> 00:06:36,376 For this trick, I'll need my magic wand of dereferencing. 165 00:06:37,016 --> 00:06:39,696 >> Nick: Your magic wand of dereferencing? 166 00:06:40,616 --> 00:06:41,046 That's great. 167 00:06:42,176 --> 00:06:43,596 >> Binky: This is what the code looks like. 168 00:06:43,946 --> 00:06:45,696 I'll just set up the number and-- [pop] 169 00:06:45,696 --> 00:06:48,396 >> Nick: Hey, look, there it goes. 170 00:06:49,006 --> 00:06:52,396 So, doing a dereference on X follows the arrow 171 00:06:52,396 --> 00:06:55,436 to access its pointee, in this case, to store 42 in there. 172 00:06:56,016 --> 00:06:58,286 Hey, try using it to store the number 13 173 00:06:58,286 --> 00:06:59,636 through the other pointer, Y. 174 00:07:00,416 --> 00:07:01,066 >> Binky: Okay. 175 00:07:01,066 --> 00:07:05,666 I'll just go over here to Y, and get the number 13 set up, 176 00:07:06,166 --> 00:07:09,826 and then take the wand of dereferencing, and just... 177 00:07:09,826 --> 00:07:09,893 [buzz] Oh! 178 00:07:09,986 --> 00:07:11,696 >> Nick: Oh, hey, that didn't work. 179 00:07:11,696 --> 00:07:17,606 Say, Binky, I don't think dereferencing Y is a good idea, 180 00:07:17,606 --> 00:07:20,726 'cause setting up the pointee is a separate step, 181 00:07:20,726 --> 00:07:23,066 and I don't think we ever did it. 182 00:07:23,256 --> 00:07:23,786 >> Binky: Mm, good point. 183 00:07:24,496 --> 00:07:27,726 >> Nick: Yeah, we allocated the pointer Y, but we never set it 184 00:07:27,726 --> 00:07:29,036 to point to a pointee. 185 00:07:29,086 --> 00:07:30,776 >> Binky: Mm, very observant. 186 00:07:31,526 --> 00:07:32,746 >> Nick: Hey, you're looking good there, Binky. 187 00:07:32,746 --> 00:07:35,936 Can you fix it so that Y points to the same pointee as X? 188 00:07:36,256 --> 00:07:38,016 >> Binky: Sure, I'll use my magic wand 189 00:07:38,016 --> 00:07:38,946 of pointer assignment. 190 00:07:38,946 --> 00:07:41,316 >> Nick: Is that going to be a problem like before? 191 00:07:41,316 --> 00:07:43,476 >> Binky: No, this doesn't touch the pointees. 192 00:07:43,476 --> 00:07:45,196 It just changes one pointer to point 193 00:07:45,196 --> 00:07:46,346 to the same thing as another. 194 00:07:47,206 --> 00:07:48,136 >> Nick: Oh, I see. 195 00:07:48,316 --> 00:07:51,946 Now Y points to the same place as X. So, wait. 196 00:07:51,946 --> 00:07:52,876 Now Y is fixed. 197 00:07:52,876 --> 00:07:55,846 It has a pointee, so you can try the wand of dereferencing again. 198 00:07:55,846 --> 00:07:56,926 So, send the 13 over. 199 00:07:57,806 --> 00:07:59,156 >> Binky: Oh, okay. 200 00:07:59,196 --> 00:07:59,856 Here goes. 201 00:08:00,966 --> 00:08:02,376 >> Nick: Hey, look at that. 202 00:08:02,376 --> 00:08:03,496 Now dereferencing works on Y. 203 00:08:03,906 --> 00:08:05,226 And because the pointers are sharing 204 00:08:05,226 --> 00:08:07,796 that one pointee, they both see the 13. 205 00:08:08,036 --> 00:08:09,396 >> Binky: Yeah, sharing, whatever. 206 00:08:09,736 --> 00:08:11,296 So, are we going to switch places now? 207 00:08:11,636 --> 00:08:12,336 >> Nick: Oh, look. 208 00:08:12,336 --> 00:08:13,216 We're out of time. 209 00:08:13,616 --> 00:08:13,746 >> Binky: But-- 210 00:08:14,436 --> 00:08:16,066 >> Nick: Just remember the three pointer rules. 211 00:08:16,396 --> 00:08:19,136 Number one, the basic structure is that you have a pointer, 212 00:08:19,366 --> 00:08:22,096 and it points over to a pointee, but the pointer 213 00:08:22,096 --> 00:08:24,746 and pointee are separate, and the common error is to set 214 00:08:24,746 --> 00:08:27,196 up a pointer, but to forget to give it a pointee. 215 00:08:27,876 --> 00:08:30,256 Number two, pointer dereferencing starts 216 00:08:30,256 --> 00:08:31,996 at the pointer and follows its arrow 217 00:08:31,996 --> 00:08:33,846 over to access its pointee. 218 00:08:33,926 --> 00:08:36,906 As we all know, this only works if there is a pointee, 219 00:08:36,956 --> 00:08:38,476 which kind of gets back to rule number one. 220 00:08:38,476 --> 00:08:42,666 Number three, pointer assignment takes one pointer and changes it 221 00:08:42,666 --> 00:08:44,686 to point to the same pointee as another pointer. 222 00:08:45,216 --> 00:08:46,066 So, after the assignment, 223 00:08:46,346 --> 00:08:48,786 the two pointers will point to the same pointee. 224 00:08:48,786 --> 00:08:50,176 Sometimes that's called sharing. 225 00:08:50,676 --> 00:08:51,966 And that's all there is to it, really. 226 00:08:52,606 --> 00:08:53,156 Bye-bye now. 227 00:08:54,426 --> 00:08:56,746 >> David: So if you actually still struggle with pointers 228 00:08:56,746 --> 00:08:58,776 at any point this semester, watch that video, 229 00:08:58,776 --> 00:09:00,416 and hopefully all your troubles will go away. 230 00:09:00,996 --> 00:09:03,596 So, I can't imagine, frankly, how long it took him to make it, 231 00:09:03,596 --> 00:09:05,446 because, if you notice, throughout the video, I mean, 232 00:09:05,446 --> 00:09:07,656 he was changing like the eyeballs' locations 233 00:09:07,656 --> 00:09:09,606 for almost every frame of the video. 234 00:09:09,696 --> 00:09:12,926 So that's an overnight project there, at least. 235 00:09:13,176 --> 00:09:15,166 I wanted to show you this demo, too. 236 00:09:15,166 --> 00:09:17,856 So, this is made by one of our own teaching fellows, Phil, 237 00:09:17,856 --> 00:09:20,116 who was a student in the course last year, who goes, 238 00:09:20,116 --> 00:09:21,956 even though it's only October here, 239 00:09:22,106 --> 00:09:24,606 final projects will soon upon us, 240 00:09:24,606 --> 00:09:26,576 or at least thoughts about final projects. 241 00:09:26,576 --> 00:09:29,286 And just as a teaser now, you'll get more formal documentation 242 00:09:29,286 --> 00:09:30,236 of this in the weeks to come. 243 00:09:30,486 --> 00:09:32,116 The final project in this course really is meant 244 00:09:32,116 --> 00:09:34,426 to be its culmination, where you guys can take your newfound 245 00:09:34,466 --> 00:09:36,936 savvy and comfort, with programming in any 246 00:09:36,936 --> 00:09:39,186 of the languages we cover or don't cover in the course 247 00:09:39,186 --> 00:09:41,636 and design some piece of software of your very own. 248 00:09:41,926 --> 00:09:45,066 So, what Phil did, actually, for fun, just before term started, 249 00:09:45,196 --> 00:09:47,586 was started playing with a fairly large dataset, 250 00:09:47,586 --> 00:09:48,866 that of the course catalog. 251 00:09:49,156 --> 00:09:52,006 So, he wrote, at the time, what's called a screen scraper, 252 00:09:52,286 --> 00:09:55,826 which is a fairly annoying but sometimes effective approach 253 00:09:55,826 --> 00:09:57,786 to getting data from some source 254 00:09:57,836 --> 00:10:00,466 that doesn't necessarily want you to have that data. 255 00:10:00,676 --> 00:10:04,456 So, the registrar's catalog is all HTML pages, all web pages, 256 00:10:04,496 --> 00:10:06,886 but they fortunately follow a standard format. 257 00:10:07,126 --> 00:10:10,066 So, if you wrote a program that essentially crawled 258 00:10:10,066 --> 00:10:12,026 to the registrar's website, much like Google 259 00:10:12,026 --> 00:10:15,006 and other search engines crawl the web, and then read 260 00:10:15,006 --> 00:10:18,106 into a database of his own, all of the information 261 00:10:18,106 --> 00:10:19,646 about this course and that course. 262 00:10:19,646 --> 00:10:23,636 And then what he did was he taught himself a library called 263 00:10:23,766 --> 00:10:28,896 Flair, that allows him to write software in Adobe Flash, 264 00:10:29,236 --> 00:10:30,826 if you're familiar with the Flash format. 265 00:10:31,146 --> 00:10:32,506 And what's neat about Flash is 266 00:10:32,506 --> 00:10:35,646 that you can do fairly sophisticated animations 267 00:10:35,876 --> 00:10:38,036 by writing in a language called ActionScript. 268 00:10:38,036 --> 00:10:42,956 So what this here is, is version 0.9 of a visualization 269 00:10:43,136 --> 00:10:45,716 of the course catalog for all computer science courses. 270 00:10:45,866 --> 00:10:50,076 And as is no accident, perhaps, he has 50 kind of interconnects, 271 00:10:50,076 --> 00:10:51,816 all of these guys here at the center. 272 00:10:51,936 --> 00:10:53,896 And if you hover over CS50, what the edges 273 00:10:53,896 --> 00:10:58,206 in this graph tell you is where you can go after CS50, 274 00:10:58,236 --> 00:11:00,256 for what courses 50 is a prerequisite, 275 00:11:00,256 --> 00:11:02,626 and you can follow the links for all these other courses 276 00:11:02,626 --> 00:11:04,026 and get the descriptions on the side. 277 00:11:04,026 --> 00:11:06,546 And I point this out, because Phil did not, at the time, 278 00:11:06,546 --> 00:11:09,656 so far as I know, he had not used this library before. 279 00:11:09,826 --> 00:11:10,786 He simply Googled. 280 00:11:10,786 --> 00:11:11,686 He read up on it. 281 00:11:11,976 --> 00:11:14,496 He took a little bit of input from the course, 282 00:11:14,496 --> 00:11:15,486 as to what would be neat, 283 00:11:15,646 --> 00:11:17,226 and then he started making what's starting 284 00:11:17,226 --> 00:11:19,166 out to be a really neat visualization. 285 00:11:19,166 --> 00:11:21,646 So, his hope is to do that for the rest of the course catalog. 286 00:11:21,646 --> 00:11:24,156 And this is just to seed your minds with just some random, 287 00:11:24,156 --> 00:11:27,196 crazy ideas that might interest you, whether it's this, 288 00:11:27,246 --> 00:11:28,856 doing something with shuttles, doing something 289 00:11:28,856 --> 00:11:29,716 with dining hall menus. 290 00:11:29,916 --> 00:11:32,476 The really neat thing, once you have these programming chops, 291 00:11:32,716 --> 00:11:35,306 is that you can start to do interesting things with data, 292 00:11:35,466 --> 00:11:37,446 and that data, again, may be shuttle schedules. 293 00:11:37,446 --> 00:11:38,506 It may be a course catalog. 294 00:11:38,696 --> 00:11:41,456 But there's so much interesting stuff out there, and soon, 295 00:11:41,456 --> 00:11:43,106 if not already, you will have the ability 296 00:11:43,226 --> 00:11:45,996 to actually manipulate that, and bend it to your will, 297 00:11:45,996 --> 00:11:47,936 and make interesting applications. 298 00:11:48,246 --> 00:11:51,336 So, speaking of applications, this week is about Sudoku. 299 00:11:51,336 --> 00:11:54,686 This is our logo for the week, made by our own Yuhki Yamashita. 300 00:11:54,686 --> 00:11:58,596 I double-checked, when he handed me this graphic, 301 00:11:58,726 --> 00:12:00,486 with another Japanese-speaking friend, 302 00:12:00,486 --> 00:12:02,926 that I wasn't putting something incredibly offensive on all 303 00:12:02,926 --> 00:12:04,046 of our documents for the week. 304 00:12:04,316 --> 00:12:08,016 So, this isn't that consistent with Sudoku in some form, 305 00:12:08,016 --> 00:12:10,676 but with Sudoku, those of you who have tackled it already, 306 00:12:10,966 --> 00:12:13,806 you know that it uses this library called ncurses, 307 00:12:13,806 --> 00:12:15,206 a graphical library. 308 00:12:15,206 --> 00:12:18,336 And we're not talking graphics in the Mac OS or Windows sense. 309 00:12:18,336 --> 00:12:21,096 It's still fairly primitive stuff within the confines 310 00:12:21,096 --> 00:12:22,206 of this terminal window. 311 00:12:22,516 --> 00:12:25,716 But those of you who might have been taking to heart our advice 312 00:12:25,756 --> 00:12:28,846 that you use GDB anytime you run into a problem, 313 00:12:28,846 --> 00:12:31,936 might've realized that once you have a graphical user interface 314 00:12:32,216 --> 00:12:36,156 between you and your code, it's kind of hard to use GDB and step 315 00:12:36,216 --> 00:12:39,306 through your program, if you've got a big game board right 316 00:12:39,306 --> 00:12:40,426 in the middle of the screen. 317 00:12:40,686 --> 00:12:43,176 So, Glen Holloway [assumed spelling], one of our sysadmins, 318 00:12:43,176 --> 00:12:44,216 actually posted last night 319 00:12:44,216 --> 00:12:45,526 on the bulletin board a little trick 320 00:12:45,526 --> 00:12:48,036 that I thought I'd show you real quickly here. 321 00:12:48,396 --> 00:12:51,926 What's really neat is if you connect a nice FAS 322 00:12:52,566 --> 00:12:56,336 in two windows simultaneously, now, there is a probability 323 00:12:56,536 --> 00:12:57,876 that you'll actually end up connecting 324 00:12:57,876 --> 00:12:59,946 to two different servers that FAS owns. 325 00:13:00,196 --> 00:13:02,616 Nice thought, FAS, even though we've been calling it one 326 00:13:02,666 --> 00:13:04,656 server, is actually a cluster of servers, 327 00:13:04,866 --> 00:13:06,976 so you can actually get pseudo randomly sent 328 00:13:06,976 --> 00:13:08,216 to one machine or the other. 329 00:13:08,466 --> 00:13:11,176 So look to the bulletin board and help at CS50.net as to how 330 00:13:11,176 --> 00:13:13,466 to mitigate this, if you don't find that this is working. 331 00:13:13,706 --> 00:13:14,316 But I've done that. 332 00:13:14,316 --> 00:13:15,736 I've connected to two machines. 333 00:13:15,876 --> 00:13:17,926 I'm going to do a quick sanity check. 334 00:13:17,926 --> 00:13:20,306 I'm going to type post name, and what I see there is 335 00:13:20,306 --> 00:13:22,656 that the nice machine that I'm connected 336 00:13:22,656 --> 00:13:25,496 to actually has a name of Ice 2. 337 00:13:25,776 --> 00:13:28,036 And if I do host name in this other window, 338 00:13:28,036 --> 00:13:29,936 I also have Ice 2 there. 339 00:13:30,236 --> 00:13:32,246 So there's Ice 1, Ice 2, Ice 3. 340 00:13:32,566 --> 00:13:34,106 And as an aside, if you find that you're 341 00:13:34,106 --> 00:13:37,786 on the wrong machine, you can actually just type SSH 342 00:13:37,786 --> 00:13:43,156 or username@ice2.fas.harbor.edu, and that will make sure 343 00:13:43,156 --> 00:13:45,856 that you connect from this machine to the right one. 344 00:13:45,856 --> 00:13:48,186 But the endgame, ultimately, is just to make sure you're 345 00:13:48,186 --> 00:13:50,386 on the same machine physically, and I am. 346 00:13:50,656 --> 00:13:51,806 And what I'm going to do now is this. 347 00:13:51,806 --> 00:13:55,206 I'm going to run Sudoku after compiling it here. 348 00:13:55,206 --> 00:13:58,086 I'm going to go ahead and run Sudoku 349 00:13:58,086 --> 00:14:00,466 and then using the newbie level, so I'm just going 350 00:14:00,466 --> 00:14:03,306 to pick a newbie game, but I'm going to type an ampersand 351 00:14:03,676 --> 00:14:04,646 at the end of the command. 352 00:14:05,086 --> 00:14:07,296 If I don't type an ampersand, I get Sudoku. 353 00:14:07,296 --> 00:14:09,376 Filling my screen, doing the nice little graphics 354 00:14:09,406 --> 00:14:11,666 that it ships with and your distribution code. 355 00:14:11,666 --> 00:14:14,536 But if I do an ampersand, nothing seems to happen. 356 00:14:15,046 --> 00:14:17,006 So, in a Linux or a Mac OS environment, 357 00:14:17,226 --> 00:14:18,616 putting an ampersand at the end 358 00:14:18,616 --> 00:14:21,616 of your commands actually backgrounds the application, 359 00:14:21,856 --> 00:14:24,796 which means it is, in fact, running, and I can see this 360 00:14:24,876 --> 00:14:28,096 by typing PS, which shows you the process list. 361 00:14:28,346 --> 00:14:29,776 A process list, in computer speak, 362 00:14:29,776 --> 00:14:30,976 is just a running program. 363 00:14:31,226 --> 00:14:33,436 If you've ever looked at your Task Manager in Windows 364 00:14:33,716 --> 00:14:36,466 or your Activity Monitor in Mac OS, it's the same thing, 365 00:14:36,466 --> 00:14:37,756 albeit in text form here. 366 00:14:37,996 --> 00:14:39,306 So, I'm running a few things. 367 00:14:39,606 --> 00:14:40,886 One is called Sudoku. 368 00:14:40,986 --> 00:14:43,636 One is called PS, because I literally just ran that, 369 00:14:43,636 --> 00:14:45,766 so PS notices that it's self-running. 370 00:14:46,016 --> 00:14:48,736 And then, out of curiosity, what is TCSH, T-C-S-H? 371 00:14:50,946 --> 00:14:54,826 TCSH? So, this is your shell. 372 00:14:54,946 --> 00:14:57,826 So, when we talked about your prompts or your blinking prompt, 373 00:14:57,976 --> 00:15:00,666 that's actually a program, a program that takes input. 374 00:15:00,726 --> 00:15:02,296 The input it takes is whatever you type 375 00:15:02,296 --> 00:15:03,926 in at the command prompt and hit enter. 376 00:15:04,236 --> 00:15:07,636 So, know, just FYI, there's a lot of shells out there. 377 00:15:07,636 --> 00:15:08,396 There's TCSH. 378 00:15:08,396 --> 00:15:09,086 There's CSH. 379 00:15:09,136 --> 00:15:10,036 There's KSH. 380 00:15:10,036 --> 00:15:10,916 There's the born shell. 381 00:15:10,916 --> 00:15:13,266 There's a whole bunch of shells that people have written 382 00:15:13,266 --> 00:15:15,446 over the years, most of which behave similarly. 383 00:15:15,696 --> 00:15:17,006 But if you ever, after CS50, 384 00:15:17,006 --> 00:15:19,886 connect to what you're told is a Linux or Unix machine, 385 00:15:20,106 --> 00:15:22,666 and things don't quite seem to be working as expected, 386 00:15:22,666 --> 00:15:24,826 it probably just means the shell is slightly different. 387 00:15:24,866 --> 00:15:26,776 So, basics like LS will still work. 388 00:15:26,846 --> 00:15:29,596 CD will still work, but there might be some other differences. 389 00:15:29,596 --> 00:15:31,506 But don't be startled by those. 390 00:15:31,546 --> 00:15:33,646 But what's important here is the following. 391 00:15:33,646 --> 00:15:37,326 When I just run this command, I backgrounded the application, 392 00:15:37,326 --> 00:15:39,616 and what I'm told is that the process I'm running, 393 00:15:39,796 --> 00:15:45,906 the first one I backgrounded, has a PID, process ID of 6373. 394 00:15:46,116 --> 00:15:48,836 This is just a unique number that Linux, or Windows, 395 00:15:49,066 --> 00:15:51,086 or Mac OS assigned to this program 396 00:15:51,346 --> 00:15:53,276 so that it can keep track of it in memory. 397 00:15:53,506 --> 00:15:57,216 But this is useful, because if I now type FG, for foreground, 398 00:15:57,516 --> 00:15:59,736 that's going to put things in the front of my screen, 399 00:15:59,736 --> 00:16:01,416 and it's going to let the program run as usual. 400 00:16:01,566 --> 00:16:03,846 So now I'm going to go to my second window here, 401 00:16:04,216 --> 00:16:10,576 and I'm going to run GDB on Sudoku on 6373. 402 00:16:10,866 --> 00:16:11,716 So a new little trick. 403 00:16:11,716 --> 00:16:14,126 At the command prompt, I'm going to type the name of the program, 404 00:16:14,166 --> 00:16:17,436 but also the process ID here, hit enter. 405 00:16:17,436 --> 00:16:20,026 What I'm now at is a prompt, GDB. 406 00:16:20,026 --> 00:16:21,006 There's a whole bunch of stuff 407 00:16:21,006 --> 00:16:22,446 that loaded, but never mind that. 408 00:16:22,726 --> 00:16:25,396 And I'm going to do something like set a breakpoint 409 00:16:25,396 --> 00:16:31,736 in Sudoku.c line 164, which I just so happen 410 00:16:31,736 --> 00:16:34,356 to know is the part of the code, if you've looked at it already, 411 00:16:34,576 --> 00:16:36,216 is where I'm dealing with user input. 412 00:16:36,216 --> 00:16:38,126 So I'm going to hit enter here. 413 00:16:38,306 --> 00:16:40,746 I have a breakpoint, but now I still seem 414 00:16:40,746 --> 00:16:41,886 to have broken the program. 415 00:16:41,886 --> 00:16:44,166 How do I tell the program to just keep running now 416 00:16:44,546 --> 00:16:46,366 until it hits that breakpoint? 417 00:16:47,056 --> 00:16:48,476 Remember continue? 418 00:16:49,216 --> 00:16:49,846 So continue. 419 00:16:49,846 --> 00:16:51,246 Now nothing seems to be happening, 420 00:16:51,246 --> 00:16:53,326 but that's because the program is now running. 421 00:16:53,556 --> 00:16:56,076 If, as you see in the bottom, we have a little menu here: 422 00:16:56,116 --> 00:16:58,306 N for new game, R for restart game. 423 00:16:58,546 --> 00:17:02,176 If I go ahead and type N, notice what happened. 424 00:17:02,176 --> 00:17:04,456 Nothing in the game, but on the right-hand side, 425 00:17:04,456 --> 00:17:06,916 GDB appears to have responded. 426 00:17:06,916 --> 00:17:08,526 So this is actually a really neat trick. 427 00:17:08,746 --> 00:17:11,356 I'm sort of remotely debugging one process 428 00:17:11,676 --> 00:17:13,176 in a separate window altogether. 429 00:17:13,176 --> 00:17:16,956 So now on the right-hand side, I've broken at line 164. 430 00:17:17,026 --> 00:17:19,226 For those who have looked at the distribution code already, 431 00:17:19,226 --> 00:17:22,426 this is the kind of stuff in Sudoku.c 432 00:17:22,426 --> 00:17:23,626 that we've given you this week. 433 00:17:23,806 --> 00:17:26,966 And long story short, at this part of the code, am I sitting 434 00:17:26,966 --> 00:17:28,636 in a loop waiting for user input? 435 00:17:28,636 --> 00:17:30,156 But what's really neat now is 436 00:17:30,186 --> 00:17:32,406 that I still have the game running in the left-hand side. 437 00:17:32,666 --> 00:17:35,866 On the right-hand side, I can do things like print the value 438 00:17:35,866 --> 00:17:38,406 of a variable called g.number, which is discussed 439 00:17:38,446 --> 00:17:40,036 in the PDF, for those unfamiliar. 440 00:17:40,296 --> 00:17:42,166 Apparently, it's 160 right now. 441 00:17:42,456 --> 00:17:43,426 Oh, look at that. 442 00:17:43,426 --> 00:17:46,996 I'm playing the 160th new board at the time, so there is, 443 00:17:46,996 --> 00:17:48,076 in fact, a correspondent. 444 00:17:48,076 --> 00:17:50,126 So, for those of you who have been kind of banging your head 445 00:17:50,126 --> 00:17:52,786 against the wall, struggling with various bugs, 446 00:17:53,096 --> 00:17:55,616 know that relatively easily can you use GDB 447 00:17:55,616 --> 00:17:57,966 in this still interactive format without getting caught 448 00:17:57,966 --> 00:17:59,276 up in the graphical details. 449 00:17:59,656 --> 00:18:02,786 So, courtesy of Glen, this particular trick. 450 00:18:02,786 --> 00:18:05,146 So, if you didn't get all that, just post it as a sticky 451 00:18:05,306 --> 00:18:06,966 on the course's bulletin board, 452 00:18:06,966 --> 00:18:09,136 so you can check the other category 453 00:18:09,136 --> 00:18:10,766 or the P set 4 category there. 454 00:18:11,086 --> 00:18:11,996 So, I'm just going to quit. 455 00:18:12,256 --> 00:18:16,526 I'm just going to quit here, and then voila! 456 00:18:16,966 --> 00:18:17,526 Any questions? 457 00:18:18,926 --> 00:18:20,396 We tricked fast, but we'll recover it 458 00:18:20,396 --> 00:18:21,276 in section, if need be. 459 00:18:21,606 --> 00:18:24,226 Okay, so there's never a more attentive audience 460 00:18:24,266 --> 00:18:25,676 than when you're talking about things 461 00:18:25,676 --> 00:18:28,486 that affect them personally or can get in their way 462 00:18:28,486 --> 00:18:30,296 of their privacy or their security. 463 00:18:30,566 --> 00:18:32,846 So what I thought we'd do-- because this week, 464 00:18:33,016 --> 00:18:35,936 we lay the foundation in terms of concepts 465 00:18:36,206 --> 00:18:38,816 for a specific domain, that of forensics. 466 00:18:38,956 --> 00:18:40,296 So, in a couple of weeks' time, 467 00:18:40,296 --> 00:18:42,016 we'll be distributing problem set five. 468 00:18:42,186 --> 00:18:44,566 Problem set five's theme is, in fact, forensics, 469 00:18:44,566 --> 00:18:45,246 where you're going to have 470 00:18:45,246 --> 00:18:47,796 to implement some digital forensic-type software 471 00:18:47,946 --> 00:18:51,916 to recover, among other things, some accidentally deleted JPEGs. 472 00:18:51,916 --> 00:18:54,816 So, if you see me and a couple of the teaching fellows 473 00:18:54,816 --> 00:18:56,836 on campus this coming week, taking photos, 474 00:18:57,066 --> 00:18:58,516 just keep walking, because that kind 475 00:18:58,596 --> 00:18:59,746 of spoils the scavenger hunt. 476 00:18:59,746 --> 00:19:02,596 In fact, last year, we were kind of trailed, as I recall, 477 00:19:02,596 --> 00:19:06,806 by one student, who was taking notice of where all the hidden, 478 00:19:06,806 --> 00:19:10,626 but-- let's see-- non-obvious, 479 00:19:11,036 --> 00:19:13,506 but identifiable locations on campus were. 480 00:19:13,506 --> 00:19:15,566 So, we'll be on the lookout. 481 00:19:16,266 --> 00:19:17,326 Buffer overflow attacks. 482 00:19:17,546 --> 00:19:18,306 How does this relate? 483 00:19:18,366 --> 00:19:21,556 So, we talked last week about security and bad stuff 484 00:19:21,556 --> 00:19:23,716 that can arise when you manipulate pointers 485 00:19:23,716 --> 00:19:25,586 or ultimately manage your own memory. 486 00:19:25,856 --> 00:19:28,276 And still, today, one of the most common ways 487 00:19:28,276 --> 00:19:31,016 of compromising a program or server is 488 00:19:31,016 --> 00:19:33,756 to exploit someone's mistake with regard to memory. 489 00:19:34,046 --> 00:19:37,376 If that program is written in a language like C or C++, 490 00:19:37,656 --> 00:19:40,546 that means specifically someone screwed up with regard 491 00:19:40,546 --> 00:19:43,246 to a pointer, or someone screwed up with regard to an array. 492 00:19:43,516 --> 00:19:45,916 So what I thought I'd do is make this a little more concrete, 493 00:19:45,916 --> 00:19:48,716 using some really nice visuals that someone put together here 494 00:19:48,976 --> 00:19:50,626 of a very simple program. 495 00:19:50,626 --> 00:19:53,706 So, this program here has a main routine at the bottom. 496 00:19:54,216 --> 00:19:56,966 It has N to ARDC char ** ARDV. 497 00:19:56,966 --> 00:20:00,826 This is the same thing, recall, as saying CHAR*ARDV [], 498 00:20:01,076 --> 00:20:06,306 because we've seen this relation between pointers and arrays, 499 00:20:06,306 --> 00:20:08,576 so that's just another way of saying the same thing. 500 00:20:08,846 --> 00:20:11,386 The only thing this program does when run, apparently, 501 00:20:11,616 --> 00:20:14,006 is call a function called foo, and it passed 502 00:20:14,116 --> 00:20:17,096 to foo the first command line argument, 503 00:20:17,096 --> 00:20:19,746 whatever the user typed after the program's name. 504 00:20:19,746 --> 00:20:20,316 What does foo do? 505 00:20:20,316 --> 00:20:24,126 Well, it apparently takes a string, aka CHAR*'s input. 506 00:20:24,426 --> 00:20:27,156 It then declares statically on the stack, 507 00:20:27,526 --> 00:20:30,866 because this is a local variable in an array called C. 508 00:20:31,166 --> 00:20:34,206 That array called C has 12 locations, 509 00:20:34,266 --> 00:20:35,366 each of which is a CHAR, 510 00:20:35,586 --> 00:20:37,976 and those locations are all contiguous. 511 00:20:38,016 --> 00:20:41,166 So, this is 12 one byte chunks 512 00:20:41,166 --> 00:20:44,866 of memory back-to-back called C on the stack. 513 00:20:45,356 --> 00:20:47,506 Then there's this function here, which we haven't really used, 514 00:20:47,506 --> 00:20:48,926 but you can infer its role. 515 00:20:49,046 --> 00:20:52,986 Memcpy simply copies memory from a source to a destination. 516 00:20:53,306 --> 00:20:57,636 So, apparently what this function is doing is passing 517 00:20:57,686 --> 00:21:02,886 in to C the value that's in bar, and it's doing this 518 00:21:02,956 --> 00:21:04,416 for how many characters? 519 00:21:04,586 --> 00:21:06,816 Well, however many characters were in bar. 520 00:21:07,116 --> 00:21:09,576 So, in other words, the destination is C. That's 521 00:21:09,576 --> 00:21:11,466 where we want to put a copy of ARDV1. 522 00:21:12,306 --> 00:21:15,016 The string we want to put there is, again, bar, 523 00:21:15,016 --> 00:21:17,236 because that's what we gave this variable's name. 524 00:21:17,466 --> 00:21:21,536 And then the length of bar tells us, tells probably memcpy's 525 00:21:21,586 --> 00:21:24,006 for loop how many characters to actually copy. 526 00:21:24,356 --> 00:21:26,696 But there's a bug in here, as implied by this comment. 527 00:21:27,076 --> 00:21:29,806 When it says no bounds checking, what does that mean? 528 00:21:29,806 --> 00:21:31,116 What are we not checking here? 529 00:21:31,786 --> 00:21:36,126 So the bounds of C, right? 530 00:21:36,126 --> 00:21:39,526 So we've declared C to be of size 12, 531 00:21:39,526 --> 00:21:41,366 and we are doing a sanity check. 532 00:21:41,366 --> 00:21:43,916 We're checking how long bar is, 533 00:21:43,916 --> 00:21:45,766 so that we tell memcpy, you know what? 534 00:21:45,876 --> 00:21:49,536 Only copy the first N characters of bar, 535 00:21:49,756 --> 00:21:52,116 where N equals stir length of bar. 536 00:21:52,676 --> 00:21:54,156 But what's the problem? 537 00:21:54,156 --> 00:21:58,186 What value of string length of bar could cause problems for us? 538 00:21:58,356 --> 00:21:59,646 Any value that's what? 539 00:22:00,626 --> 00:22:02,406 Any value that's greater than 12. 540 00:22:02,486 --> 00:22:05,756 Because if we copy more than 12 bites, that's perfectly fine, 541 00:22:05,836 --> 00:22:07,866 from bar's perspective, because bar has 542 00:22:08,066 --> 00:22:09,626 as many characters as he has. 543 00:22:09,836 --> 00:22:11,976 So, it's fine if we traverse all of those characters, 544 00:22:12,236 --> 00:22:15,506 but if we blindly copy more than 12 characters 545 00:22:15,636 --> 00:22:17,606 into the array called C, 546 00:22:17,896 --> 00:22:19,846 who knows where we're going to be putting them? 547 00:22:19,846 --> 00:22:22,296 We're going to start overlapping something else on the stack, 548 00:22:22,646 --> 00:22:25,566 and that's precisely what a buffer overflow attack is 549 00:22:25,566 --> 00:22:26,056 all about. 550 00:22:26,306 --> 00:22:28,546 So, here's a really nice picture that someone put together. 551 00:22:28,546 --> 00:22:31,476 Actually, if you'd like to read more on this stuff, 552 00:22:31,476 --> 00:22:34,306 the Wikipedia article, at the moment, is really nicely-- 553 00:22:34,836 --> 00:22:36,446 really nicely describes this process. 554 00:22:36,726 --> 00:22:39,086 So, notice what we have here, a rectangle very similar 555 00:22:39,086 --> 00:22:40,326 to the one we've been using in class. 556 00:22:40,756 --> 00:22:43,116 At the bottom there, where it says parent routine stack, 557 00:22:43,556 --> 00:22:44,686 so that's like main stack. 558 00:22:44,896 --> 00:22:46,866 So main or any functions down below. 559 00:22:47,146 --> 00:22:49,426 In red is highlighted the return address. 560 00:22:49,576 --> 00:22:52,106 So, we haven't talked about this, but one of the things 561 00:22:52,106 --> 00:22:54,906 that the stack is used for, besides parameters 562 00:22:54,976 --> 00:22:59,146 and besides local variables, is the address of the function 563 00:22:59,446 --> 00:23:02,966 that should be returned to after another function is 564 00:23:03,016 --> 00:23:03,956 done executing. 565 00:23:04,186 --> 00:23:06,626 So, in other words, if main calls foo, 566 00:23:07,036 --> 00:23:10,906 what happens is main's return address, main's address 567 00:23:11,056 --> 00:23:13,106 in memory-- remember the text segment? 568 00:23:13,106 --> 00:23:16,106 Mains address in memory is plopped on top of the stack, 569 00:23:16,646 --> 00:23:19,696 so that the computer, when it's done executing the function foo, 570 00:23:19,816 --> 00:23:24,376 it then checks that lowest level frame, that red bar up there, 571 00:23:24,376 --> 00:23:27,516 and says, "What address should I now return execution to? 572 00:23:27,516 --> 00:23:29,116 Where should I go back to executing? 573 00:23:29,116 --> 00:23:29,666 Because I forget." 574 00:23:30,156 --> 00:23:32,186 It checks that value and, hopefully, that's, 575 00:23:32,186 --> 00:23:33,246 in fact, main's address. 576 00:23:33,856 --> 00:23:36,626 So a smart adversary might recognize this 577 00:23:36,626 --> 00:23:39,286 and might recognize, okay, at the very top of this picture, 578 00:23:39,596 --> 00:23:43,256 we have the start of the array, called C, hence C bracket zero, 579 00:23:43,496 --> 00:23:44,606 and at the bottom right corner 580 00:23:44,606 --> 00:23:46,786 of that rectangle we have C bracket 11. 581 00:23:47,076 --> 00:23:51,846 In other words, the computer has given us 12 bytes from top left 582 00:23:52,046 --> 00:23:54,426 to bottom right there, but where, 583 00:23:54,426 --> 00:23:57,546 according to this picture, if you pass in or try to copy more 584 00:23:57,546 --> 00:23:59,366 than 12 bytes, are we going to start writing? 585 00:24:00,026 --> 00:24:01,746 We're first going to overwrite what value? 586 00:24:03,066 --> 00:24:03,416 >> Bar. 587 00:24:03,606 --> 00:24:06,036 >> David: Bar, because bar is right below C11. 588 00:24:06,146 --> 00:24:09,276 So, as the picture implies, if you try to write to C12, or C13, 589 00:24:09,276 --> 00:24:12,756 or C bracket 14, or anything like that, you're going to start 590 00:24:12,756 --> 00:24:14,556 to go lower and lower into the stack, and so we're going 591 00:24:14,556 --> 00:24:16,976 to start to corrupt or clobber bar's value. 592 00:24:17,276 --> 00:24:19,846 Worse, we might clobber this thing called the saved frame 593 00:24:19,846 --> 00:24:20,736 pointer in green. 594 00:24:20,936 --> 00:24:23,536 This is another detail that the computer uses to remember 595 00:24:23,536 --> 00:24:26,286 where in ram it currently is 596 00:24:26,286 --> 00:24:28,146 and which memory it's actually using. 597 00:24:28,146 --> 00:24:32,596 And worse, if we write so much data into that array, 598 00:24:32,596 --> 00:24:34,246 completely overflowing it, 599 00:24:34,246 --> 00:24:37,076 such that we also overwrite the red return address, 600 00:24:37,436 --> 00:24:40,456 what the adversary can effectively do is trick the 601 00:24:40,456 --> 00:24:44,496 computer into returning to the wrong address completely. 602 00:24:44,496 --> 00:24:46,946 The next function that gets executed is not going 603 00:24:46,946 --> 00:24:50,606 to be main, where it left off, but it could be anywhere in ram. 604 00:24:51,126 --> 00:24:53,866 So, if you have this power now to tell the computer 605 00:24:53,866 --> 00:24:56,566 to trick the computer to go anywhere in memory, 606 00:24:56,876 --> 00:25:00,596 that just begs the question, where do you want them to go? 607 00:25:00,596 --> 00:25:02,766 Well, you probably want, you being the adversary, 608 00:25:02,766 --> 00:25:05,026 you probably want them to go to a location 609 00:25:05,236 --> 00:25:08,656 that has some malicious code, code ideally that you wrote 610 00:25:08,966 --> 00:25:11,116 that maybe circumvents a serial number check, 611 00:25:11,336 --> 00:25:13,016 that maybe circumvents a password check 612 00:25:13,056 --> 00:25:15,536 that does whatever you want this thing to do, 613 00:25:15,726 --> 00:25:18,306 and that's precisely what a buffer overrun does. 614 00:25:18,306 --> 00:25:21,576 So, if you just type in hello, h-e-l-l-o, 615 00:25:21,576 --> 00:25:24,276 and then you get the 0 at the end of it, 616 00:25:24,606 --> 00:25:25,886 that's perfectly safe. 617 00:25:26,046 --> 00:25:28,766 But to be clear, notice that the string fills the stack 618 00:25:29,266 --> 00:25:32,036 from the top on down, because even though we've been drawing 619 00:25:32,036 --> 00:25:34,436 this picture, as though the stack is on the bottom 620 00:25:34,436 --> 00:25:37,836 and the heap is on the top, just because of convention, 621 00:25:38,066 --> 00:25:40,676 notice that it's a little small here, but the black arrow 622 00:25:40,676 --> 00:25:44,126 on the right says memory addresses grow downward. 623 00:25:44,246 --> 00:25:48,096 In other words, in memory, the smaller addresses are on top, 624 00:25:48,096 --> 00:25:50,116 and the larger addresses are on the bottom, 625 00:25:50,116 --> 00:25:52,136 counterintuitive though that may seem, 626 00:25:52,136 --> 00:25:54,906 given how we've been drawing the pictures, but the implication is 627 00:25:54,906 --> 00:25:58,136 that if you plop down H, that does in a low numbered address. 628 00:25:58,286 --> 00:26:02,386 E slightly higher, LLO slightly higher, slightly higher. 629 00:26:02,426 --> 00:26:06,046 If you keep writing well beyond those 12 chars, you're going 630 00:26:06,046 --> 00:26:08,086 to go to a higher and higher memory address, 631 00:26:08,336 --> 00:26:10,446 which pictorially is actually down below, 632 00:26:10,586 --> 00:26:13,266 at which point you will eventually start 633 00:26:13,266 --> 00:26:15,816 to overwrite that return value. 634 00:26:16,346 --> 00:26:17,976 So here's just an example. 635 00:26:18,066 --> 00:26:21,246 If the adversary has just written a bunch of junk, 636 00:26:21,406 --> 00:26:23,726 so represented here by the letter A, the character A, 637 00:26:23,726 --> 00:26:26,776 it's just junk, junk, junk, junk, junk, but then finally 638 00:26:26,976 --> 00:26:30,426 in the red boxes, just because the adversary is lucky or he 639 00:26:30,426 --> 00:26:33,536 or she is particularly adroit with figuring out what he 640 00:26:33,536 --> 00:26:35,446 or she needs to input into the machine, 641 00:26:35,716 --> 00:26:39,046 it turns out that we have a 32-bit integer now 642 00:26:39,046 --> 00:26:43,766 in the red box, so each of those red boxes represents a char, 643 00:26:44,276 --> 00:26:47,146 but if you have four of them, that's an ent, so these types 644 00:26:47,146 --> 00:26:49,476 of decimal values now just represent, collectively, 645 00:26:49,476 --> 00:26:52,626 a 32-bit value, a big number, a 32-bit number. 646 00:26:53,086 --> 00:26:56,936 But because the adversary was smart, he or she realized, 647 00:26:57,046 --> 00:26:59,846 through trial and error or some very clever handiwork, 648 00:27:00,146 --> 00:27:03,206 that if I write this address here, that's going 649 00:27:03,206 --> 00:27:06,616 to trick the computer into executing the code that's 650 00:27:06,616 --> 00:27:14,346 up here, because this address of 80C03508 just so happens to be 651 00:27:14,346 --> 00:27:17,256 up here at the start of this blue block labeled A. 652 00:27:18,166 --> 00:27:20,486 So the implication, if not too long a tale here, 653 00:27:20,726 --> 00:27:23,666 is that if the adversary points some malicious code, 654 00:27:23,666 --> 00:27:25,436 some password circumventing code, 655 00:27:25,516 --> 00:27:28,386 serial number circumventing code, at the top of the stack, 656 00:27:28,496 --> 00:27:31,996 where that first blue A is, and then very intelligently puts 657 00:27:32,046 --> 00:27:35,206 that address down here in the red boxes, 658 00:27:35,496 --> 00:27:37,176 this adversary can trick the computer 659 00:27:37,176 --> 00:27:39,816 into executing their code that they wrote, 660 00:27:40,066 --> 00:27:42,476 that were not actually in the original program. 661 00:27:42,786 --> 00:27:44,426 Now, what can this result in? 662 00:27:44,666 --> 00:27:45,596 Frankly, anything. 663 00:27:45,646 --> 00:27:48,346 If you have the ability to inject arbitrary code 664 00:27:48,346 --> 00:27:51,026 into the computer's memory, you can do anything you want, 665 00:27:51,026 --> 00:27:53,316 and a lot of the time, just crashes the server. 666 00:27:53,316 --> 00:27:55,126 So, if you ever happen to be a webmaster 667 00:27:55,126 --> 00:27:57,876 for some student group, you're trolling through your logs, 668 00:27:58,136 --> 00:27:59,636 just seeing what people are requesting, 669 00:27:59,856 --> 00:28:01,536 very often, you'll see just junk. 670 00:28:01,536 --> 00:28:03,836 You won't see valid URLs being requested. 671 00:28:03,836 --> 00:28:06,906 You'll see a part of a valid URL and a whole bunch of nonsense, 672 00:28:06,906 --> 00:28:08,786 binary data, and that's because one 673 00:28:08,786 --> 00:28:12,176 of the most common approaches to finding mistakes like this, 674 00:28:12,516 --> 00:28:14,946 finding a mistake like this, where I'm failing 675 00:28:14,946 --> 00:28:19,526 to check just how large C is, is just hammer on the machine. 676 00:28:19,796 --> 00:28:22,316 Throw as much data as you can on it, because you know what? 677 00:28:22,316 --> 00:28:24,646 If you actually cause that machine to crash, 678 00:28:25,686 --> 00:28:28,606 probably not your goal, but something interesting is there. 679 00:28:28,606 --> 00:28:30,646 There's some potential to do some bad work, 680 00:28:31,046 --> 00:28:33,436 so that's very often the first approach is just bang 681 00:28:33,436 --> 00:28:33,846 on the thing. 682 00:28:33,846 --> 00:28:36,846 And if it breaks, aha, now you've got a potential target. 683 00:28:37,296 --> 00:28:40,816 So this is a-- there's a more general approach to this. 684 00:28:40,816 --> 00:28:43,706 You don't have to be as precise as this picture implies. 685 00:28:43,996 --> 00:28:47,346 What a lot of people use are these things called NOP sleds. 686 00:28:47,466 --> 00:28:50,116 So, a NOP, N-O-P, is no operation, 687 00:28:50,296 --> 00:28:53,436 and this actually a CPU instruction that's supported 688 00:28:53,436 --> 00:28:54,556 by many CPUs. 689 00:28:54,556 --> 00:28:56,936 That just tells the CPU, for a moment in time, 690 00:28:56,936 --> 00:28:59,546 for a little clock tick, do nothing. 691 00:28:59,966 --> 00:29:02,776 So if you have enough of these, you can actually tell the CPU, 692 00:29:02,776 --> 00:29:05,496 for some number of nanoseconds or the like, do nothing, 693 00:29:05,496 --> 00:29:07,716 do nothing, do nothing, but keep continuing down. 694 00:29:07,986 --> 00:29:10,766 So even though this picture implies that the adversary had 695 00:29:10,766 --> 00:29:13,116 to know precisely where all 696 00:29:13,116 --> 00:29:15,406 of these addresses were, they usually don't. 697 00:29:15,606 --> 00:29:19,476 And so, usually, the code they inject has a whole lot of NOPs 698 00:29:19,566 --> 00:29:23,166 that fills the space, but tells the CPU just keep going, 699 00:29:23,166 --> 00:29:25,906 keep going, keep going until you actually see an instruction 700 00:29:25,906 --> 00:29:27,386 that's not a NOP. 701 00:29:27,716 --> 00:29:30,586 So shell code refers to the code they want to inject. 702 00:29:30,856 --> 00:29:33,216 NOP just means keep going, keep going, keep going. 703 00:29:33,436 --> 00:29:35,916 So it's just a way of kind of filling the space in hopes 704 00:29:35,986 --> 00:29:38,716 that you actually maximize probability 705 00:29:38,786 --> 00:29:40,006 of compromising the machine. 706 00:29:40,006 --> 00:29:44,136 So, if you like this kind of stuff, or want to do this kind 707 00:29:44,236 --> 00:29:47,946 of stuff, CS61 next fall is a course that actually allows you 708 00:29:47,946 --> 00:29:49,016 to dabble in things like this. 709 00:29:49,016 --> 00:29:51,806 In fact, one of the first problem sets is really neat. 710 00:29:51,806 --> 00:29:54,036 It's a ticking time bomb, of sorts, 711 00:29:54,036 --> 00:29:57,106 whereby the professor hands you a program compiled in Linux. 712 00:29:57,736 --> 00:30:01,186 It has a number of password prompts, and you're challenged 713 00:30:01,306 --> 00:30:05,636 to figure out what the passwords are without actually guessing 714 00:30:05,666 --> 00:30:08,586 and without detonating the so-called binary bomb. 715 00:30:08,586 --> 00:30:10,576 In other words, if you type the wrong password in-- 716 00:30:10,576 --> 00:30:11,836 it's actually quite clever. 717 00:30:12,176 --> 00:30:13,426 If you type the wrong password 718 00:30:13,426 --> 00:30:15,996 in to CS61's first problem set code and hit enter, 719 00:30:16,236 --> 00:30:18,606 and you're wrong, well, your TF gets emailed, 720 00:30:18,606 --> 00:30:20,076 and you lose minus one point. 721 00:30:20,186 --> 00:30:21,396 Do it again, minus two points. 722 00:30:21,396 --> 00:30:22,606 Do it again, minus three points. 723 00:30:22,606 --> 00:30:25,136 So, trial and error, brute force not so smart. 724 00:30:25,196 --> 00:30:26,626 So, what you'll find-- here's a hint 725 00:30:26,626 --> 00:30:28,086 for your homework 12 months from now. 726 00:30:28,606 --> 00:30:32,126 So, use GDB, because using GDB, as we've already seen, 727 00:30:32,126 --> 00:30:33,586 can you poke around code? 728 00:30:33,746 --> 00:30:34,986 Can you set break points? 729 00:30:34,986 --> 00:30:37,436 And, in fact, that's how you can go about tackling 730 00:30:37,436 --> 00:30:40,436 that particular problem set by dissecting the binary, 731 00:30:40,436 --> 00:30:42,646 not just by running it and crossing your fingers 732 00:30:42,646 --> 00:30:44,526 that you can guess what the professor's choice 733 00:30:44,576 --> 00:30:45,346 of passwords are. 734 00:30:45,346 --> 00:30:47,876 So, it's a really neat class to take after this, and you talk 735 00:30:47,876 --> 00:30:49,546 about these kinds of low level details. 736 00:30:49,546 --> 00:30:52,196 And those of you coming to this course with familiarity with 737 00:30:52,196 --> 00:30:55,736 or experience in Java, these are really neat juicy details 738 00:30:55,786 --> 00:30:57,496 that you just can't discuss in Java, 739 00:30:57,696 --> 00:30:59,586 because in Java there are no pointers. 740 00:30:59,836 --> 00:31:01,816 There are no direct memory addresses. 741 00:31:01,816 --> 00:31:04,976 All of it is abstracted away, for better or for worse, 742 00:31:05,096 --> 00:31:06,496 using things called references. 743 00:31:06,786 --> 00:31:09,386 But the downside is that you don't have nearly 744 00:31:09,386 --> 00:31:11,236 as much control of your machine. 745 00:31:11,326 --> 00:31:15,276 The upside is adversaries don't have nearly as much control 746 00:31:15,276 --> 00:31:18,416 over your machine, so you get something and lose something. 747 00:31:18,856 --> 00:31:21,576 So with that said, let's pave the way for a little chat 748 00:31:21,576 --> 00:31:24,446 about forensics, so not actually breaking machines, 749 00:31:24,446 --> 00:31:27,656 but actually understanding how data is represented in recovery, 750 00:31:27,956 --> 00:31:29,186 stuff we actually care about. 751 00:31:29,496 --> 00:31:31,286 So, thus far in the course, we've pretty much talked 752 00:31:31,286 --> 00:31:35,736 about chars, and strings, and ents, and floats, and doubles, 753 00:31:35,736 --> 00:31:39,376 all of these primitive data types, but it's hard to imagine, 754 00:31:39,586 --> 00:31:40,726 using just those primitives, 755 00:31:40,726 --> 00:31:43,206 how you can implement anything more sophisticated, 756 00:31:43,206 --> 00:31:45,806 like a Facebook website or a search engine, 757 00:31:46,016 --> 00:31:47,716 which clearly needs to store more 758 00:31:47,716 --> 00:31:49,156 than just individual numbers. 759 00:31:49,206 --> 00:31:50,786 For instance, you might want to store 760 00:31:50,976 --> 00:31:53,926 in some program you're writing a student, and a student 761 00:31:53,926 --> 00:31:55,656 on campus is a real world entity, 762 00:31:55,656 --> 00:31:57,306 and most students have an ID number. 763 00:31:57,516 --> 00:31:58,276 They have a name. 764 00:31:58,276 --> 00:32:00,286 And if they're undergraduates, they have a house. 765 00:32:00,486 --> 00:32:02,336 So, it'd kind of be nice conceptually 766 00:32:02,336 --> 00:32:05,146 if we could just kind of clump all of those fields together 767 00:32:05,416 --> 00:32:08,286 and wrap them in a new data structure that we, ourselves, 768 00:32:08,286 --> 00:32:10,406 can define, and that's where we can, again, 769 00:32:10,406 --> 00:32:12,286 use this trick called type def for. 770 00:32:12,546 --> 00:32:14,256 So type def, we saw once before 771 00:32:14,256 --> 00:32:17,106 to declare what in CS50's library? 772 00:32:18,246 --> 00:32:21,216 A string. So, we used a one-liner that said, 773 00:32:21,216 --> 00:32:25,446 "Using type def, make 'string' a synonym for char*." 774 00:32:25,546 --> 00:32:28,496 Well, using type def, can you define more interesting data 775 00:32:28,496 --> 00:32:29,966 structures that aren't just synonyms? 776 00:32:30,006 --> 00:32:32,166 They're brand new data structures that just don't exist 777 00:32:32,166 --> 00:32:36,116 by default in C. The syntax for this is to say type def struct, 778 00:32:36,566 --> 00:32:40,156 and then open brace and then enumerate with line by line, 779 00:32:40,226 --> 00:32:43,246 typically, with semicolons at the end, each of the fields, 780 00:32:43,496 --> 00:32:47,286 so to speak, that you want to put inside of a new data type 781 00:32:47,286 --> 00:32:49,346 that you will henceforth call student. 782 00:32:49,566 --> 00:32:50,856 I could call this anything I want, 783 00:32:50,856 --> 00:32:52,966 but conceptually it seems sensible to call it student. 784 00:32:53,176 --> 00:32:54,536 So if I now glance at the code, 785 00:32:54,536 --> 00:32:58,036 of which you have a printout today, let me go ahead and go 786 00:32:58,036 --> 00:33:03,506 into struct.h. This simply looks like this. 787 00:33:03,896 --> 00:33:07,796 And let me quickly speak to this whole .h file. 788 00:33:07,796 --> 00:33:11,076 So you guys have been using CS50.h, string.h, 789 00:33:11,606 --> 00:33:13,996 standard lib.h, all of these header files. 790 00:33:14,236 --> 00:33:16,256 The motivation for those files, again, 791 00:33:16,256 --> 00:33:18,536 is so that you can tell the compiler, "Hey, 792 00:33:18,596 --> 00:33:22,056 I want to use some functions that someone else wrote; and, 793 00:33:22,056 --> 00:33:25,156 therefore, here's the file that declares all of those functions, 794 00:33:25,186 --> 00:33:27,626 and maybe some constants, and maybe some other things." 795 00:33:27,906 --> 00:33:31,096 But similarly, is it very common to use header files in C 796 00:33:31,326 --> 00:33:34,546 to declare data structures that you, yourself, want to use, 797 00:33:34,736 --> 00:33:37,306 and that other programs that use your code, 798 00:33:37,616 --> 00:33:39,926 if you start writing more sophisticated programs 799 00:33:39,926 --> 00:33:42,146 or libraries, themselves might want to use. 800 00:33:42,146 --> 00:33:44,006 So in the interest of best practice, what I did, 801 00:33:44,006 --> 00:33:45,756 even though this is a fairly simple program, 802 00:33:45,986 --> 00:33:48,306 I factored out these new definitions, 803 00:33:48,336 --> 00:33:49,296 this thing called a student. 804 00:33:49,456 --> 00:33:53,096 I put it in its own file called structs.h, just by convention, 805 00:33:53,366 --> 00:33:56,516 and what I then did in structs.c is what, 806 00:33:56,516 --> 00:33:57,976 at the top of the file, presumably? 807 00:34:00,056 --> 00:34:03,886 In structs1.c, I just do this. 808 00:34:04,426 --> 00:34:06,476 So here's a slight difference in approach. 809 00:34:06,686 --> 00:34:09,796 So there's a fundamental different approach between, 810 00:34:09,796 --> 00:34:13,936 includes CS50.h, standard IO.h, standard lib.h, string.h, 811 00:34:14,166 --> 00:34:16,786 and structs.h. Syntaxically, what's the obvious difference 812 00:34:16,786 --> 00:34:19,306 between these five lines of code? 813 00:34:20,176 --> 00:34:20,336 >> Quotes. 814 00:34:20,496 --> 00:34:21,266 >> David: So, the quotes. 815 00:34:21,266 --> 00:34:23,386 So instead of using angled brackets, 816 00:34:23,386 --> 00:34:27,066 which means this is a default server library 817 00:34:27,066 --> 00:34:30,016 or server header file that someone else preinstalled 818 00:34:30,016 --> 00:34:33,146 on the server, you use double quotes around the file name just 819 00:34:33,146 --> 00:34:35,296 to indicate this is a file that's 820 00:34:35,336 --> 00:34:37,826 in the current directory that, presumably, I wrote. 821 00:34:37,826 --> 00:34:39,666 So realize there is that distinction there. 822 00:34:39,956 --> 00:34:41,366 The blue is a little hard to read here, 823 00:34:41,366 --> 00:34:44,616 but apparently I'm using the sharp define preprocessor 824 00:34:44,616 --> 00:34:47,036 directive to define a constant called student. 825 00:34:47,116 --> 00:34:48,976 I didn't want to get bored by typing 826 00:34:48,976 --> 00:34:51,416 in half a dozen students into this program. 827 00:34:51,416 --> 00:34:54,096 I'm just going to type in three, but I could very quickly change 828 00:34:54,096 --> 00:34:56,526 that by recompiling, after changing that constant. 829 00:34:56,776 --> 00:34:59,456 And down here I have the following, a main function 830 00:35:00,066 --> 00:35:02,646 that doesn't do all that much, but it does show us how 831 00:35:02,646 --> 00:35:04,976 to use this new data structure. 832 00:35:04,976 --> 00:35:07,056 So, again, you have a printout, if you'd like to follow along. 833 00:35:07,316 --> 00:35:08,426 Notice what I've done at the top. 834 00:35:08,426 --> 00:35:10,816 I want to declare a class of students. 835 00:35:10,816 --> 00:35:13,526 So now, finally, consistent with this preaching 836 00:35:13,526 --> 00:35:16,766 of using good style, declaring your variables with smart names, 837 00:35:17,036 --> 00:35:19,806 well, let's start thinking about what we want to model, 838 00:35:19,806 --> 00:35:21,046 just in colloquial terms. 839 00:35:21,046 --> 00:35:24,176 I want to represent a classroom full of students, so I'm going 840 00:35:24,176 --> 00:35:25,756 to call my variable class. 841 00:35:26,316 --> 00:35:28,456 It's going to be of type student, but it's going 842 00:35:28,456 --> 00:35:31,906 to be an array that has specifically three students 843 00:35:31,906 --> 00:35:32,436 in it. 844 00:35:32,436 --> 00:35:34,836 So this is just an array of three data structures 845 00:35:34,896 --> 00:35:36,716 of type student; and, collectively, 846 00:35:36,716 --> 00:35:38,306 I'm going to call that class. 847 00:35:38,306 --> 00:35:40,356 Okay, so it's pretty easy now to refer to this. 848 00:35:40,766 --> 00:35:41,726 So now I have a loop. 849 00:35:41,816 --> 00:35:45,086 So for each of these students, I'm going to iterate from zero 850 00:35:45,086 --> 00:35:46,846 up to students, which, again, is three. 851 00:35:47,176 --> 00:35:49,416 Now notice the new syntax, if unfamiliar. 852 00:35:49,626 --> 00:35:50,746 I'm first going to tell the user, 853 00:35:50,806 --> 00:35:52,136 what's the first students ID. 854 00:35:52,136 --> 00:35:54,796 Then I'm going to use the familiar, get ent, and I'm going 855 00:35:54,796 --> 00:35:56,706 to use class bracket I. 856 00:35:56,946 --> 00:35:59,116 So give me the Ith student in the class, 857 00:35:59,116 --> 00:36:01,916 the Ith data structure, and then .ID. 858 00:36:02,196 --> 00:36:04,986 This dot operator just says go inside this structure 859 00:36:05,156 --> 00:36:07,896 and assign it to the field called ID. 860 00:36:08,556 --> 00:36:12,006 So dot simply means go into the struct and access this field. 861 00:36:12,176 --> 00:36:13,706 New syntax, not all that hard. 862 00:36:14,026 --> 00:36:15,016 What's the student's name? 863 00:36:15,086 --> 00:36:16,816 I do the same thing with string. 864 00:36:17,166 --> 00:36:18,186 What's the student's house? 865 00:36:18,186 --> 00:36:19,706 I do the same thing with string. 866 00:36:19,906 --> 00:36:23,246 And then just to keep the formatting of my program nice, 867 00:36:23,316 --> 00:36:24,696 I just print a new line after that. 868 00:36:25,116 --> 00:36:26,736 So, now I've got a little curious. 869 00:36:26,736 --> 00:36:29,216 I wanted to just use some other string functions, 870 00:36:29,386 --> 00:36:31,666 and I decided somewhat arbitrarily I want to print 871 00:36:31,666 --> 00:36:35,126 out any student who happens to be in Mather, a little biased. 872 00:36:35,126 --> 00:36:37,476 So 4 ent I gets serial up to student, 873 00:36:37,476 --> 00:36:40,786 so iterate over the whole class again, and here's a function 874 00:36:40,786 --> 00:36:41,626 that should be familiar, 875 00:36:41,626 --> 00:36:43,546 for those of you who've tackled Sudoku already. 876 00:36:43,826 --> 00:36:50,106 If string compares, stir comp of the Ith student's house 877 00:36:50,106 --> 00:36:51,676 "Mather" equals, equals zero-- 878 00:36:51,676 --> 00:36:53,136 well, we saw this briefly last week. 879 00:36:53,136 --> 00:36:56,736 Stir comp returns zero if two strings are equal. 880 00:36:56,876 --> 00:36:58,326 It returns a positive value 881 00:36:58,326 --> 00:37:01,336 if one belongs before the other string, lexically, graphically, 882 00:37:01,336 --> 00:37:04,056 or in a dictionary, or it returns less than zero 883 00:37:04,056 --> 00:37:06,056 if it belongs after the word. 884 00:37:06,056 --> 00:37:08,346 So, you have this ability to sort strings, 885 00:37:08,346 --> 00:37:10,116 using this function as a building block. 886 00:37:10,316 --> 00:37:11,866 I'm just using it for equality. 887 00:37:11,866 --> 00:37:14,166 So, I'm checking, is the return value of stir comp, 888 00:37:14,446 --> 00:37:16,616 given this string and given this other string, 889 00:37:16,886 --> 00:37:17,956 equal, equal to zero? 890 00:37:17,956 --> 00:37:21,196 If so, let me go ahead and print out who's in Mather? 891 00:37:21,196 --> 00:37:23,206 What percent S is my placeholder? 892 00:37:23,336 --> 00:37:28,536 Class, bracket I.name gives me the Ith student's name, 893 00:37:28,846 --> 00:37:30,336 and it just happens to be a char*, 894 00:37:30,756 --> 00:37:32,576 which is why percent S suffices. 895 00:37:32,796 --> 00:37:34,256 And then finally, down here, 896 00:37:34,396 --> 00:37:37,246 one of the most new important things that I've done. 897 00:37:37,496 --> 00:37:41,556 Why do I now call free on every student's name 898 00:37:41,556 --> 00:37:43,646 and every student's house? 899 00:37:45,156 --> 00:37:46,236 >> To return memory. 900 00:37:46,336 --> 00:37:46,456 >> David: Sorry? 901 00:37:47,066 --> 00:37:47,886 >> To return memory. 902 00:37:47,886 --> 00:37:48,496 >> David: So to return the memory, 903 00:37:48,496 --> 00:37:50,456 but yet I didn't call malloc, right? 904 00:37:50,486 --> 00:37:52,166 The rule of thumb last week, very simply, 905 00:37:52,166 --> 00:37:56,796 was if you call malloc, you'd better call free at the end. 906 00:37:56,796 --> 00:37:58,856 So, why am I, nonetheless, calling free here? 907 00:37:59,326 --> 00:38:04,856 >> Because it raised our pointer. 908 00:38:04,996 --> 00:38:07,756 >> David: Okay, because it raised our pointers, 909 00:38:08,326 --> 00:38:11,276 raised our very much like pointers, 910 00:38:11,276 --> 00:38:13,586 so that's the right direction, but I didn't call malloc. 911 00:38:13,586 --> 00:38:15,506 Let's focus on that inconsistency. 912 00:38:15,736 --> 00:38:18,396 Why am I calling free, even though I did not call malloc? 913 00:38:18,396 --> 00:38:18,636 >> Get string. 914 00:38:19,876 --> 00:38:21,076 >> So get string dict, right? 915 00:38:21,076 --> 00:38:22,566 So one of the reasons for peeling back 916 00:38:22,566 --> 00:38:25,036 that layer last week and looking briefly at the implementation 917 00:38:25,036 --> 00:38:28,376 of get string is that all this time it's had this inherent 918 00:38:28,376 --> 00:38:32,376 memory elite, because get string calls malloc to get as much ram 919 00:38:32,376 --> 00:38:36,536 as is needed to store the string the user types in, but if you, 920 00:38:36,636 --> 00:38:39,116 the person who's entrusted with that pointer, 921 00:38:39,306 --> 00:38:42,926 after get string returns, does not free this, you can begin 922 00:38:42,926 --> 00:38:44,126 to experience in your programs 923 00:38:44,126 --> 00:38:47,006 that very familiar real world annoying feature 924 00:38:47,006 --> 00:38:48,736 of some software that's poorly written 925 00:38:48,966 --> 00:38:51,926 of your computer grinding to a halt eventually or having 926 00:38:51,926 --> 00:38:54,986 to reboot, because everything is getting really, really slow, 927 00:38:54,986 --> 00:38:57,906 even though you're not doing anything with the computer. 928 00:38:58,056 --> 00:38:59,916 Quite often, is it simply a memory leak? 929 00:38:59,916 --> 00:39:01,406 So, thereto is it important 930 00:39:01,406 --> 00:39:03,786 to know what the functions are actually doing. 931 00:39:04,056 --> 00:39:06,696 So, let's go ahead and compile structs 1. 932 00:39:06,696 --> 00:39:08,026 I'm going to use make here. 933 00:39:08,026 --> 00:39:10,516 I'm going to go ahead and run structs 1. 934 00:39:10,846 --> 00:39:14,526 Student's ID, we'll call this student 11, 935 00:39:15,016 --> 00:39:17,616 and this will be David, and David is in Mather. 936 00:39:17,806 --> 00:39:23,756 Okay, student 12 will be Joe, and he'll be in Leverett. 937 00:39:24,336 --> 00:39:27,566 And the student ID 13, this will be Sally. 938 00:39:27,566 --> 00:39:28,846 She'll be in Foho [phonetic]. 939 00:39:28,846 --> 00:39:29,696 All right. 940 00:39:30,866 --> 00:39:31,466 Hello, Sally. 941 00:39:31,556 --> 00:39:32,676 So, David is in Mather. 942 00:39:32,806 --> 00:39:34,446 Okay, so a very simple demonstration, 943 00:39:34,646 --> 00:39:37,446 but nonetheless actually introduces finally this ability 944 00:39:37,446 --> 00:39:39,726 to define our own date structures 945 00:39:39,766 --> 00:39:42,256 and to even store those data structures inside 946 00:39:42,256 --> 00:39:44,586 of other data structures like an array. 947 00:39:44,696 --> 00:39:47,446 So, an array, even though it's barely a data structure, 948 00:39:47,446 --> 00:39:49,996 it's just the same type of data type, back, to back, to back. 949 00:39:50,276 --> 00:39:52,916 It's one of the first date types that we've actually looked at, 950 00:39:53,146 --> 00:39:55,816 but this program is not terribly useful, 951 00:39:55,816 --> 00:39:58,496 because I'm not doing anything with this information. 952 00:39:58,496 --> 00:40:01,156 Once I hit enter, yes, it tells me who's in Mather, 953 00:40:01,396 --> 00:40:03,676 but then it throws away all of that information. 954 00:40:03,676 --> 00:40:06,686 In fact, the program very explicitly frees all 955 00:40:06,686 --> 00:40:07,466 of that memory. 956 00:40:07,786 --> 00:40:10,746 So, if I actually wanted to do something with this, 957 00:40:11,276 --> 00:40:14,376 maybe I should use something like version 2 here. 958 00:40:14,376 --> 00:40:15,956 So this, too, you have a printout of. 959 00:40:16,276 --> 00:40:19,716 Notice that it's very similar to the code before. 960 00:40:19,716 --> 00:40:23,026 I'm populating the class variable with the same data 961 00:40:23,026 --> 00:40:25,586 as before, but before I quit this time, 962 00:40:25,786 --> 00:40:29,446 I introduce a new loop that does this. 963 00:40:29,996 --> 00:40:31,886 So, one thing we've not yet introduced, 964 00:40:31,886 --> 00:40:33,866 and that will become invaluable when it becomes time 965 00:40:33,866 --> 00:40:36,916 to actually recover JPEGs and other such things 966 00:40:36,916 --> 00:40:38,046 for forensic purposes, 967 00:40:38,366 --> 00:40:40,576 let's actually save these students to disc. 968 00:40:40,986 --> 00:40:42,236 So a new function call here. 969 00:40:42,236 --> 00:40:44,916 There's a function that exists in C and a whole bunch 970 00:40:44,916 --> 00:40:47,326 of languages called F-open, file open. 971 00:40:47,616 --> 00:40:48,856 Takes at least two arguments. 972 00:40:48,926 --> 00:40:51,646 The first is the name of the file that you want to open. 973 00:40:52,066 --> 00:40:54,956 Typically, if it doesn't exist, it will be created as a result 974 00:40:54,956 --> 00:40:55,916 of this function call. 975 00:40:56,156 --> 00:40:58,846 And then a second argument, which in this case, is "W". 976 00:40:59,186 --> 00:41:01,116 Any guesses as to what the W denotes here? 977 00:41:02,656 --> 00:41:04,356 W for write. 978 00:41:04,646 --> 00:41:07,676 So this just means open the file, but open it in such a way 979 00:41:07,676 --> 00:41:09,606 that I can write to it and change it, 980 00:41:09,606 --> 00:41:10,766 not just read from it. 981 00:41:10,926 --> 00:41:12,846 So that's useful, too, since, in this case, 982 00:41:12,846 --> 00:41:14,596 we want data going out, not coming in. 983 00:41:14,866 --> 00:41:16,826 Now FP not equal to null. 984 00:41:17,056 --> 00:41:18,946 So, null, again, is this special value 985 00:41:18,946 --> 00:41:20,686 that generally signifies errors. 986 00:41:20,876 --> 00:41:23,066 Why, I might, just thinking intuitively, 987 00:41:23,146 --> 00:41:26,066 why or when might F open return null, do you think? 988 00:41:26,536 --> 00:41:30,996 Just a possible reason might it return null? 989 00:41:30,996 --> 00:41:32,246 >> Another file with that name already 990 00:41:32,816 --> 00:41:36,856 in the database folder with-- that's read only. 991 00:41:36,856 --> 00:41:39,536 >> David: Okay, so maybe there's already a file called database 992 00:41:39,536 --> 00:41:41,706 in my current directory, and maybe, 993 00:41:41,706 --> 00:41:44,116 for whatever reasons, it's read only. 994 00:41:44,176 --> 00:41:47,566 I, as the user, don't have the right to change that file, 995 00:41:47,566 --> 00:41:49,476 because maybe it came with the operating system. 996 00:41:49,476 --> 00:41:50,756 Maybe another user put it there. 997 00:41:50,986 --> 00:41:52,366 I just don't have the privileges. 998 00:41:52,366 --> 00:41:54,306 What's another reason that you might fail 999 00:41:54,306 --> 00:41:59,406 to write a file to disc? 1000 00:41:59,406 --> 00:41:59,576 >> [inaudible] 1001 00:41:59,576 --> 00:41:59,796 >> David: Sorry? 1002 00:41:59,796 --> 00:42:03,516 >> No disc space. 1003 00:42:03,746 --> 00:42:04,506 >> David: No disc space. 1004 00:42:04,666 --> 00:42:07,916 Okay, so at some point, we might realize that as much as I want 1005 00:42:07,916 --> 00:42:10,376 to put these bits on disc, there's just not room, 1006 00:42:10,376 --> 00:42:13,006 so something has to fail at some point. 1007 00:42:13,006 --> 00:42:14,246 So let's take a look now. 1008 00:42:14,246 --> 00:42:15,176 I, at least, double check. 1009 00:42:15,416 --> 00:42:16,776 Is FD not equal to null? 1010 00:42:16,776 --> 00:42:17,706 Because if it's not null, 1011 00:42:17,886 --> 00:42:21,676 that means I've gotten back what's called a file pointer 1012 00:42:21,726 --> 00:42:22,956 or a file handle. 1013 00:42:23,326 --> 00:42:25,576 It's called any number of things, but it's just a variable 1014 00:42:25,786 --> 00:42:29,046 that somehow represent that file's location on disc. 1015 00:42:29,306 --> 00:42:29,946 Now I do this. 1016 00:42:30,056 --> 00:42:33,336 I go ahead and iterate three times from zero up to students. 1017 00:42:33,336 --> 00:42:34,026 And what do I do? 1018 00:42:34,246 --> 00:42:37,246 Well, I use this slightly different function that's very 1019 00:42:37,246 --> 00:42:38,186 similar to the one we've used. 1020 00:42:38,186 --> 00:42:40,906 Not print F, but F print F. What's really neat 1021 00:42:40,906 --> 00:42:44,316 about F print F is that it's the file version of print F, 1022 00:42:44,316 --> 00:42:47,556 and it just takes one additional argument that allows it 1023 00:42:47,556 --> 00:42:51,396 to distinguish itself from print F. The first argument here is 1024 00:42:51,396 --> 00:42:55,076 not the string you want to print, but a pointer to the file 1025 00:42:55,076 --> 00:42:56,866 that you want to print this sting to. 1026 00:42:57,136 --> 00:43:01,636 So this means send the following string to FP, that file pointer. 1027 00:43:01,636 --> 00:43:02,196 What string? 1028 00:43:02,816 --> 00:43:03,626 Percent dn. 1029 00:43:03,856 --> 00:43:06,536 What is the ent that should go there? 1030 00:43:06,536 --> 00:43:07,786 Just the student's ID. 1031 00:43:07,956 --> 00:43:09,676 After that, put the student's name. 1032 00:43:09,806 --> 00:43:11,206 After that, put the student's house. 1033 00:43:11,206 --> 00:43:14,106 And then, finally, after this loop is done, close the file. 1034 00:43:14,456 --> 00:43:18,966 So, what I've done is dump into a file student, after student, 1035 00:43:19,006 --> 00:43:21,716 after student, and each student's fields happen 1036 00:43:21,716 --> 00:43:22,866 to be on their own line. 1037 00:43:23,116 --> 00:43:27,046 So let me go ahead and compile this, make structs 2, 1038 00:43:27,166 --> 00:43:29,956 because the rest of the program is identical to structs 1. 1039 00:43:30,206 --> 00:43:34,726 Let me go ahead and run structs 2, enter, and now if I do this, 1040 00:43:34,726 --> 00:43:38,506 and then David, and then Mather, and then 13, and Joe, 1041 00:43:38,506 --> 00:43:44,396 and Leverett, and then 14, and then Sally, and Foho, enter. 1042 00:43:44,646 --> 00:43:47,566 Okay, so the program has behaved the same, but if I do an LS now, 1043 00:43:47,566 --> 00:43:51,046 notice I have this file that if I do an LS-L, 1044 00:43:51,286 --> 00:43:53,536 was just created a moment ago. 1045 00:43:53,626 --> 00:43:56,016 And as an aside, this is the size of a file. 1046 00:43:56,186 --> 00:43:57,636 This is when it was last modified. 1047 00:43:57,786 --> 00:43:59,816 LS-L shows you some interesting tidbits. 1048 00:44:00,006 --> 00:44:01,586 Let me go ahead and open database, 1049 00:44:01,586 --> 00:44:03,846 and voila, there's my database. 1050 00:44:04,076 --> 00:44:06,716 So what is my interpretation of a database here? 1051 00:44:07,076 --> 00:44:09,136 Just a text file that's got this data line, 1052 00:44:09,136 --> 00:44:10,336 after line, after line. 1053 00:44:10,666 --> 00:44:13,676 So, in effect, what I've kind of done now, albeit it very quick 1054 00:44:13,676 --> 00:44:16,546 and dirtily, I've come up with my own file format. 1055 00:44:16,926 --> 00:44:19,696 This is my, sort of David Malan's file format 1056 00:44:19,696 --> 00:44:21,106 for students. 1057 00:44:21,106 --> 00:44:24,816 I just have to know that a student record is stored 1058 00:44:24,816 --> 00:44:27,746 in three lines, one after the other, and I, myself, 1059 00:44:27,746 --> 00:44:30,896 just have to remember that the ID comes first, then the name, 1060 00:44:31,066 --> 00:44:33,846 then the house, then repeat some number of times. 1061 00:44:34,276 --> 00:44:37,016 So, now, if we take a step back and consider the real world, 1062 00:44:37,236 --> 00:44:39,986 when you've saved the file in Microsoft Word format, 1063 00:44:40,346 --> 00:44:45,066 or in HTML format, or in any file format that has one 1064 00:44:45,066 --> 00:44:46,976 of those three letter extensions, typically, 1065 00:44:47,346 --> 00:44:50,106 you are simply telling the program, save the contents 1066 00:44:50,106 --> 00:44:53,066 of whatever it is you want to save, in the proprietary format 1067 00:44:53,066 --> 00:44:55,786 at Microsoft, or Apple, or whomever it came up with. 1068 00:44:56,056 --> 00:44:59,106 Now, that just means that they decided, on their own, 1069 00:44:59,336 --> 00:45:01,436 how they want to put bits or how they want 1070 00:45:01,436 --> 00:45:03,296 to put characters on disc. 1071 00:45:03,586 --> 00:45:06,196 Probably, Microsoft Word is not as naive 1072 00:45:06,196 --> 00:45:08,906 as just storing the words in your file from top to bottom, 1073 00:45:08,906 --> 00:45:10,576 left to right, because there's a whole bunch 1074 00:45:10,576 --> 00:45:12,296 of junk in any Word document. 1075 00:45:12,426 --> 00:45:15,636 There's bold facing, italics, tabular information. 1076 00:45:15,826 --> 00:45:17,406 There's a huge amount of metadata, 1077 00:45:17,626 --> 00:45:19,606 so odds are Microsoft can't get away 1078 00:45:19,606 --> 00:45:21,256 with such a simple file format, 1079 00:45:21,556 --> 00:45:23,646 but some formats are relatively simple. 1080 00:45:23,646 --> 00:45:26,576 So JPEG photographs, when you take them with a digital camera, 1081 00:45:26,976 --> 00:45:29,936 actually are stored on disc, whether it's your hard drive 1082 00:45:29,936 --> 00:45:32,516 or your compact flash card, or your SD micro card, 1083 00:45:32,516 --> 00:45:33,996 whatever your technology is. 1084 00:45:34,406 --> 00:45:39,086 They're stored by storing a pattern of bits at the very top 1085 00:45:39,086 --> 00:45:41,596 of the disc that say here comes a JPEG. 1086 00:45:41,866 --> 00:45:44,566 Then there's some complicated mathematics that implement 1087 00:45:44,566 --> 00:45:48,206 that JPEG with a whole bunch of bits, megabytes worth of bits, 1088 00:45:48,206 --> 00:45:50,926 usually, and then sometimes at the bottom, there's a few bits 1089 00:45:50,926 --> 00:45:53,376 that say stop, this is the end of the JPEG. 1090 00:45:53,876 --> 00:45:56,546 And what you'll ultimately leverage in problem set five is 1091 00:45:56,546 --> 00:45:59,056 that signature at the top of the JPEG. 1092 00:45:59,656 --> 00:46:02,556 Just as here I could make this leap of faith and assume 1093 00:46:02,556 --> 00:46:05,586 that every there lines represents a student, 1094 00:46:05,916 --> 00:46:08,806 so in the context of forensics can you sometimes assume 1095 00:46:08,806 --> 00:46:10,786 that any time you see this pattern of bits, 1096 00:46:10,876 --> 00:46:12,666 aha, here comes a JPEG. 1097 00:46:12,896 --> 00:46:14,946 Let me just keep reading, reading, reading bits 1098 00:46:14,946 --> 00:46:16,566 until I see that pattern again, 1099 00:46:16,756 --> 00:46:18,946 at which point I know I've encountered what? 1100 00:46:18,946 --> 00:46:20,136 >> [inaudible] 1101 00:46:20,136 --> 00:46:20,766 >> David: Another JPEG. 1102 00:46:21,126 --> 00:46:23,916 So digital cameras, frankly, are pretty stupid when it comes 1103 00:46:23,916 --> 00:46:25,026 to storing information. 1104 00:46:25,026 --> 00:46:28,276 They generally write one photo, then the other right after it, 1105 00:46:28,276 --> 00:46:30,856 then the other right after it, and the other right after it. 1106 00:46:30,856 --> 00:46:32,086 And so, in fact, you'll find 1107 00:46:32,086 --> 00:46:33,736 that using some basic building blocks 1108 00:46:33,776 --> 00:46:35,696 like these here today can recover 1109 00:46:35,986 --> 00:46:38,116 that data relatively easily. 1110 00:46:38,376 --> 00:46:41,426 Let me show you one other trick here, going back to a piece 1111 00:46:41,426 --> 00:46:42,666 of code we had last week. 1112 00:46:42,916 --> 00:46:44,006 We had this sample, 1113 00:46:44,076 --> 00:46:46,846 stupid program called Bar Dot C. You had a printout 1114 00:46:46,846 --> 00:46:48,746 of it last week, and it was really just meant 1115 00:46:48,746 --> 00:46:51,366 to demonstrate GDB, because it has a bunch of functions, 1116 00:46:51,366 --> 00:46:53,616 a bunch of variables, and I can step through each one. 1117 00:46:53,926 --> 00:46:56,616 Well, one of the most powerful features of GDB, especially now 1118 00:46:56,616 --> 00:46:59,186 that some of you are in experiencing seg faults 1119 00:46:59,236 --> 00:47:02,506 with increasing frequency, is to ask GDB, okay, 1120 00:47:02,506 --> 00:47:03,956 yes, I have a seg fault. 1121 00:47:03,956 --> 00:47:06,416 I know that, but where did it happen? 1122 00:47:06,976 --> 00:47:09,236 One of the things you can do with GDB is this. 1123 00:47:09,236 --> 00:47:12,346 I'm going to go ahead and run GDB on this program called Bar. 1124 00:47:12,346 --> 00:47:15,716 I'm going to set a break point at the function called bar, 1125 00:47:16,016 --> 00:47:18,586 because recall from this program that main calls foo 1126 00:47:18,776 --> 00:47:21,256 and foo calls bar, so I want to break, 1127 00:47:21,326 --> 00:47:23,796 once I'm pretty nested inside this program, 1128 00:47:23,796 --> 00:47:26,156 once I've got a bunch of frames on my stack. 1129 00:47:26,156 --> 00:47:28,076 So I'm going to go ahead and type run now. 1130 00:47:28,446 --> 00:47:31,886 It's broken right at the point where bar is executing. 1131 00:47:31,886 --> 00:47:34,086 Actually, let me get rid of that error message. 1132 00:47:34,506 --> 00:47:40,186 Make bar. Okay, so break in bar, run. 1133 00:47:40,186 --> 00:47:42,136 So it's going to break in the bar function. 1134 00:47:42,336 --> 00:47:44,676 So you can type list to see where you are in the program, 1135 00:47:44,676 --> 00:47:48,306 but what's really neat is you can type back trace or just BT 1136 00:47:48,356 --> 00:47:52,176 for short, and what this will show you, somewhat arcanely, 1137 00:47:52,176 --> 00:47:54,926 is a list of all of the frames you currently have 1138 00:47:55,006 --> 00:47:57,246 on your stack; in other words, all of the functions 1139 00:47:57,246 --> 00:47:58,926 that were called up until this point. 1140 00:47:59,186 --> 00:48:01,816 So, if you're running GDB, and you encounter a seg fault, 1141 00:48:01,816 --> 00:48:04,596 and you really don't know what went wrong, frankly, 1142 00:48:05,196 --> 00:48:07,936 you can just kind of infer where it happened 1143 00:48:07,936 --> 00:48:09,446 by typing back trace often. 1144 00:48:09,706 --> 00:48:11,936 You'll see all of the functions that may have been called. 1145 00:48:12,136 --> 00:48:14,066 Now, especially in Sudoku, where you're using 1146 00:48:14,066 --> 00:48:15,776 that fancy library called curses, 1147 00:48:16,036 --> 00:48:17,996 there might be not only your functions in there, 1148 00:48:17,996 --> 00:48:19,936 but a whole bunch of functions you didn't write. 1149 00:48:20,586 --> 00:48:22,486 Generally speaking, it's safe to assume 1150 00:48:22,486 --> 00:48:24,136 that those functions are correct. 1151 00:48:24,346 --> 00:48:26,186 So, much as you'd like to disagree, 1152 00:48:26,356 --> 00:48:28,196 odds are those functions, many of which have been 1153 00:48:28,196 --> 00:48:29,876 around for many, many years, are correct, 1154 00:48:30,316 --> 00:48:33,206 which means probably one of the functions you wrote, 1155 00:48:33,206 --> 00:48:36,296 that called that huge list of functions, is the culprit. 1156 00:48:36,546 --> 00:48:38,516 So, in this case, what I see here is this. 1157 00:48:38,516 --> 00:48:41,386 The oldest frame on my stack is this one here, 1158 00:48:41,386 --> 00:48:42,786 number two, for main. 1159 00:48:42,956 --> 00:48:45,326 After that, there's another frame on the stack for foo, 1160 00:48:45,326 --> 00:48:48,046 and then this third one for bar. 1161 00:48:48,046 --> 00:48:51,346 So what I can actually do here, BT for back trace again. 1162 00:48:51,566 --> 00:48:54,216 If I want to see, you know, what was going in main, 1163 00:48:54,216 --> 00:48:55,886 right now I'm in bar-- whoops. 1164 00:48:56,536 --> 00:49:00,246 Let me fix since I accidentally pasted something in. 1165 00:49:00,836 --> 00:49:02,866 Break bar, run. 1166 00:49:04,256 --> 00:49:05,686 Okay, so now this is bar. 1167 00:49:05,936 --> 00:49:07,456 Suppose that here's my back trace? 1168 00:49:07,456 --> 00:49:09,546 You can type frame, and then the number 1169 00:49:09,546 --> 00:49:11,796 of the frame you want to go to, frame two. 1170 00:49:12,026 --> 00:49:14,416 Now notice I'm in frame two. 1171 00:49:14,596 --> 00:49:18,616 It doesn't mean that I've said go continue executing main. 1172 00:49:18,616 --> 00:49:20,536 Go back to frame two and start executing. 1173 00:49:20,726 --> 00:49:23,806 This means let me poke around inside of this stack frame. 1174 00:49:23,806 --> 00:49:25,906 If we have this picture from last week 1175 00:49:25,906 --> 00:49:28,886 or even this fellow's here, when we have various frames 1176 00:49:28,886 --> 00:49:30,466 from the stack, this just means let me poke 1177 00:49:30,466 --> 00:49:33,686 around inside this frame, so that now I can type things 1178 00:49:33,686 --> 00:49:37,006 like print A, and I can actually see the value of A, 1179 00:49:37,066 --> 00:49:41,116 even though that variable is out of scope inside of bar. 1180 00:49:41,426 --> 00:49:44,796 And just to be kind of curious, notice what we can do here. 1181 00:49:44,796 --> 00:49:47,256 I'm going to rerun the program, so as to start from scratch. 1182 00:49:47,256 --> 00:49:50,726 So run is going to break at bar. 1183 00:49:50,726 --> 00:49:51,606 And let me do this. 1184 00:49:51,606 --> 00:49:55,546 Notice that bar has a local variable or really a parameter. 1185 00:49:55,546 --> 00:49:56,716 This is what bar looks like. 1186 00:49:56,986 --> 00:50:00,866 So I'm going to print M. All right, M, it was the value 10. 1187 00:50:01,026 --> 00:50:01,876 That's not interesting. 1188 00:50:01,876 --> 00:50:05,806 Let me print the address of M. So ampersand M is the address 1189 00:50:05,856 --> 00:50:08,826 of operator, so not I get a fairly looking hexadecimal 1190 00:50:08,826 --> 00:50:11,126 string, but that's just a fairly concise way, 1191 00:50:11,156 --> 00:50:13,276 we decided last week, of representing big numbers. 1192 00:50:13,376 --> 00:50:15,186 Hexadecimal, because you have more digits. 1193 00:50:15,436 --> 00:50:17,116 You can represent things more compactly. 1194 00:50:17,326 --> 00:50:18,306 And, yeah, look at this. 1195 00:50:18,346 --> 00:50:20,176 This is a pointer of type ent. 1196 00:50:20,686 --> 00:50:22,716 Okay, so this is a cryptic-looking address, 1197 00:50:22,906 --> 00:50:23,906 but let me now do this. 1198 00:50:24,026 --> 00:50:26,986 Let me go into the frame for main. 1199 00:50:27,366 --> 00:50:30,876 Main, itself, apparently has a variable A. I can do a sanity 1200 00:50:30,876 --> 00:50:33,046 check, print A. A was five, whatever, 1201 00:50:33,236 --> 00:50:37,776 but let's print the address of A, and now we have this value. 1202 00:50:38,076 --> 00:50:40,446 So now notice, even though this is a little cryptic, 1203 00:50:40,446 --> 00:50:43,066 this number is almost identical to this one. 1204 00:50:43,376 --> 00:50:44,776 This is from bar. 1205 00:50:44,776 --> 00:50:45,886 This is from main. 1206 00:50:46,126 --> 00:50:48,586 And now notice which of these numbers is bigger. 1207 00:50:48,586 --> 00:50:56,386 It's somewhat subtle, but which of these numbers is bigger? 1208 00:50:56,846 --> 00:50:57,786 They're almost the same. 1209 00:50:57,786 --> 00:51:02,236 Zero X, bravo, foxtrot, 976 Charley, 1210 00:51:02,236 --> 00:51:06,296 and then they only differ at two places, 60 and 9C. 1211 00:51:06,456 --> 00:51:07,866 Even if you're not quite comfortable 1212 00:51:07,866 --> 00:51:10,616 with hexadecimal yet, this one on the bottom is, in fact, 1213 00:51:10,746 --> 00:51:14,076 bigger, which confirms the prediction earlier 1214 00:51:14,076 --> 00:51:15,976 that when we are manipulating the stack, 1215 00:51:16,486 --> 00:51:19,666 these larger addresses are, in fact, down below. 1216 00:51:19,796 --> 00:51:23,736 Main, the address of Main's variable is bigger numerically, 1217 00:51:23,736 --> 00:51:26,496 its address, because it's lower in the stack. 1218 00:51:26,736 --> 00:51:28,286 It's one of the earlier stack frames. 1219 00:51:28,516 --> 00:51:31,336 So, here, too, using GDB, can we actually poke around with 1220 00:51:31,466 --> 00:51:35,336 such detail that we can see precisely the assumptions 1221 00:51:35,636 --> 00:51:37,436 that adversaries are exploiting. 1222 00:51:37,436 --> 00:51:38,886 So, that was a lot. 1223 00:51:38,886 --> 00:51:39,776 Let's take a five minute break. 1224 00:51:43,286 --> 00:51:44,816 All right, so we are back. 1225 00:51:45,556 --> 00:51:46,936 We are back, and so is lunch. 1226 00:51:46,936 --> 00:51:50,696 So not this Friday, but next Friday will be the next lunch 1227 00:51:50,726 --> 00:51:51,816 with CS50's team. 1228 00:51:51,816 --> 00:51:54,596 If you'd like to RSVP, just go to CS50.net/RSVP. 1229 00:51:54,636 --> 00:51:56,696 We will have to cap at ten, 1230 00:51:56,696 --> 00:51:58,366 just so we keep things sufficiently intimate, 1231 00:51:58,366 --> 00:51:59,876 so that we can actually talk with one another, 1232 00:52:00,276 --> 00:52:02,856 but check that out on the website, if you are available. 1233 00:52:03,016 --> 00:52:08,526 And free, so exit a small leap we've seen now multiple times. 1234 00:52:08,526 --> 00:52:11,756 It's going to recur, especially in the context of forensics 1235 00:52:11,916 --> 00:52:13,806 and our use of file IO. 1236 00:52:13,946 --> 00:52:17,416 Incidentally, file IO simply means input and output, 1237 00:52:17,586 --> 00:52:19,556 anytime you've seen that acronym somewhere. 1238 00:52:20,116 --> 00:52:22,166 And we've seen briefly here functions 1239 00:52:22,166 --> 00:52:23,536 like F-open and F-close. 1240 00:52:23,536 --> 00:52:26,656 We saw print F, the analog for reading is F scan F, 1241 00:52:27,096 --> 00:52:28,936 which we'll use before long. 1242 00:52:28,936 --> 00:52:31,786 F read and F write allows you to read and write, 1243 00:52:31,786 --> 00:52:34,686 not ASCII characters, but binary data, which is useful 1244 00:52:34,686 --> 00:52:37,286 when you're writing things that aren't, by nature, ASCII. 1245 00:52:37,656 --> 00:52:39,146 FEOF is just a function 1246 00:52:39,146 --> 00:52:41,836 that tells you am I at the end of file? 1247 00:52:42,066 --> 00:52:43,716 EOF. Yes or no. 1248 00:52:44,276 --> 00:52:45,986 So more on those in the time to come. 1249 00:52:45,986 --> 00:52:48,096 But before we get to this sexy picture, 1250 00:52:48,466 --> 00:52:51,006 a less sexy picture, hexadecimal. 1251 00:52:51,366 --> 00:52:51,966 So we saw this. 1252 00:52:52,146 --> 00:52:56,996 Zero, zero, zero, zero in binary represents what decimal number? 1253 00:52:56,996 --> 00:52:57,586 Easy question. 1254 00:52:58,086 --> 00:52:58,286 >> Zero. 1255 00:52:58,286 --> 00:52:59,036 >> David: All right, so zero. 1256 00:52:59,256 --> 00:53:02,786 But in hexadecimal, it also is represented as zero. 1257 00:53:03,086 --> 00:53:07,326 So hexadecimal digits conveniently can be represented, 1258 00:53:07,806 --> 00:53:11,166 hexadecimal digits conveniently represent four bits. 1259 00:53:11,486 --> 00:53:13,766 So, in other words, let's disregard decimal for today 1260 00:53:13,896 --> 00:53:15,506 and just say that this is hexadecimal 1261 00:53:15,506 --> 00:53:16,436 in the right-hand side. 1262 00:53:16,436 --> 00:53:19,456 If, instead, I want to represent the number one in binary, 1263 00:53:19,676 --> 00:53:21,826 but in hexadecimal this is simply one, 1264 00:53:21,826 --> 00:53:27,756 then if I fast forward to, say, 1111, this equals what in hex? 1265 00:53:28,616 --> 00:53:28,716 >> F. 1266 00:53:28,946 --> 00:53:32,266 >> David: So this equals F. So all ones, that is what? 1267 00:53:32,316 --> 00:53:38,036 The number-- let's see, this is the 8's column, plus 4's column. 1268 00:53:38,036 --> 00:53:39,846 That's 12, 13, 14, 15. 1269 00:53:40,056 --> 00:53:44,186 So 15 in decimal is F in hexadecimal. 1270 00:53:44,376 --> 00:53:46,086 All right, but we've been looking at numbers 1271 00:53:46,086 --> 00:53:48,796 that are much longer than just single characters, right? 1272 00:53:48,796 --> 00:53:51,286 We saw addresses like this thing here, 1273 00:53:51,486 --> 00:53:54,986 which clearly has multiple hexadecimal digits back-to-back. 1274 00:53:55,206 --> 00:53:57,846 So all this means is that you take the-- 1275 00:53:58,826 --> 00:54:00,706 you take those numbers and then figure 1276 00:54:00,706 --> 00:54:02,666 out what the corresponding bit patterns are. 1277 00:54:02,666 --> 00:54:05,226 So, in other words, if I had this, one, two, three, 1278 00:54:05,356 --> 00:54:09,386 four zeroes in binary, followed by four ones in binary, 1279 00:54:09,646 --> 00:54:13,016 this would be represented in hex as zero F. Or just 1280 00:54:13,056 --> 00:54:14,266 to make clear the notation, 1281 00:54:14,266 --> 00:54:16,856 zero X just says here comes a hexadecimal value. 1282 00:54:17,236 --> 00:54:18,266 So zero F here. 1283 00:54:18,576 --> 00:54:22,076 If, instead, I did all ones-- one, two, three, four, five, 1284 00:54:22,136 --> 00:54:25,456 six, seven, eight-- and it doesn't line up, 1285 00:54:25,496 --> 00:54:28,866 because the zeroes are fatter-- that happens to represent what? 1286 00:54:28,866 --> 00:54:30,726 Well, this is the 128's column. 1287 00:54:30,726 --> 00:54:32,056 This is the 64's column. 1288 00:54:32,206 --> 00:54:34,666 This represents the value 255. 1289 00:54:34,796 --> 00:54:37,116 So the largest decimal number you can represent 1290 00:54:37,116 --> 00:54:40,456 with eight bits, we've seen long ago, is 255. 1291 00:54:40,736 --> 00:54:43,496 In hexadecimal, though, that's just zero XFF. 1292 00:54:44,296 --> 00:54:47,056 So, again, those of you who have some HTML background or play 1293 00:54:47,056 --> 00:54:50,396 around with Photoshop, you very often see precisely these 1294 00:54:50,396 --> 00:54:51,496 hexadecimal codes. 1295 00:54:51,496 --> 00:54:52,566 In fact, let me see. 1296 00:54:52,966 --> 00:54:56,586 This is somewhat tangential to CS50 1297 00:54:56,586 --> 00:55:00,226 but does demonstrate exactly where this stuff does come up. 1298 00:55:00,226 --> 00:55:03,286 So this is Photoshop, a very fancy photo manipulation tool. 1299 00:55:03,486 --> 00:55:04,456 This is the color wheel. 1300 00:55:04,766 --> 00:55:06,256 You'll notice that there's this mention 1301 00:55:06,256 --> 00:55:08,546 of RGB, red, green, blue. 1302 00:55:08,546 --> 00:55:10,306 So the way that numbers are represented-- 1303 00:55:10,306 --> 00:55:12,896 and this actually will be useful for problem set five, as well, 1304 00:55:12,896 --> 00:55:16,306 as you'll see, they are represented as RGB triples, 1305 00:55:16,306 --> 00:55:18,666 a value for red, followed by a value for green, 1306 00:55:18,666 --> 00:55:20,516 followed by a value for blue. 1307 00:55:20,736 --> 00:55:23,366 And what this means is that pictures are made 1308 00:55:23,366 --> 00:55:25,196 up of pixels, little dots. 1309 00:55:25,476 --> 00:55:27,186 Every one of those dots has a color. 1310 00:55:27,436 --> 00:55:30,506 That color can be implemented with some amount of red, 1311 00:55:30,576 --> 00:55:32,376 some amount of green, and some amount of blue. 1312 00:55:32,376 --> 00:55:34,286 You can combine those in so many different ways 1313 00:55:34,316 --> 00:55:36,546 that you can essentially get every color of the rainbow, 1314 00:55:36,586 --> 00:55:37,786 just by using those values. 1315 00:55:38,146 --> 00:55:42,646 So, here, I have the value 00 00 00. 1316 00:55:42,896 --> 00:55:45,406 So that means no red, no green, no blue. 1317 00:55:45,406 --> 00:55:46,716 What do I get? 1318 00:55:46,716 --> 00:55:49,096 Well, the absence of all color here gives me black. 1319 00:55:49,396 --> 00:55:52,956 If, by contrast, I say give me a lot of red. 1320 00:55:53,066 --> 00:55:54,486 That's a lot of red. 1321 00:55:54,546 --> 00:55:57,556 So give me a lot of red, but now crank up the green, 1322 00:55:58,056 --> 00:56:00,506 then crank up the blue. 1323 00:56:00,736 --> 00:56:02,346 That gives me all of the colors. 1324 00:56:02,346 --> 00:56:04,836 And if you have all of the wavelengths 1325 00:56:04,836 --> 00:56:05,986 of light, you have white. 1326 00:56:06,346 --> 00:56:08,546 But as we just saw, this would be red. 1327 00:56:09,266 --> 00:56:11,296 If I want to see green, I would put the FF's 1328 00:56:11,356 --> 00:56:12,566 in the middle, because it's RGB. 1329 00:56:12,566 --> 00:56:16,236 And if I wanted to see blue, I have this on the end here. 1330 00:56:16,506 --> 00:56:19,016 So this just means you can string these hexadecimal 1331 00:56:19,016 --> 00:56:20,456 characters along and along 1332 00:56:20,656 --> 00:56:23,646 to form not four bit values but eight bit values. 1333 00:56:23,646 --> 00:56:25,886 So now, if I go back to GDB, if I want to try 1334 00:56:25,886 --> 00:56:27,166 to understand this a bit better, 1335 00:56:27,736 --> 00:56:30,346 this here is simply an eight bit value. 1336 00:56:30,406 --> 00:56:33,016 This here is simply an eight bit value. 1337 00:56:33,076 --> 00:56:34,806 This here is simply an eight bit value. 1338 00:56:34,806 --> 00:56:37,926 Again, because every hex digit represents four bits, 1339 00:56:38,186 --> 00:56:40,596 which is one of the motivations for using hex at all is 1340 00:56:40,596 --> 00:56:42,776 because there is this really clean equivalence. 1341 00:56:43,006 --> 00:56:45,776 And then, finally, this here is also eight bits. 1342 00:56:45,846 --> 00:56:47,666 So, I have eight, plus eight, plus eight, plus eight. 1343 00:56:47,836 --> 00:56:51,456 So this just happens to represent a 32 bit integer, 1344 00:56:51,726 --> 00:56:54,526 which is really nice to see, because we said last week 1345 00:56:54,526 --> 00:56:57,886 that pointers are themselves 32 bit ents. 1346 00:56:57,886 --> 00:56:59,856 When you have a 32 bit machine, 1347 00:57:00,046 --> 00:57:02,066 that means its pointers are 32 bits. 1348 00:57:02,066 --> 00:57:05,946 And, voila, there is that ent represented, albeit in hex. 1349 00:57:06,086 --> 00:57:08,536 Now, if you find yourself looking more closely at some 1350 00:57:08,536 --> 00:57:10,846 of this buffer overflow stuff back at home on the slides, 1351 00:57:11,116 --> 00:57:13,596 you'll actually notice that some of these digits, 1352 00:57:13,596 --> 00:57:16,726 some of these hex digits are backwards on the diagram, 1353 00:57:16,726 --> 00:57:17,596 and it's not a mistake. 1354 00:57:17,596 --> 00:57:20,306 And that will actually come up in problem set five, as well, 1355 00:57:20,556 --> 00:57:22,586 a discussion of what's called little Indian-ness 1356 00:57:22,586 --> 00:57:23,766 and big Indian-ness. 1357 00:57:24,086 --> 00:57:26,226 Essentially, years ago, the world had to decide, 1358 00:57:26,276 --> 00:57:29,356 when we want to lay out multi byte numbers in memory, 1359 00:57:29,696 --> 00:57:32,856 do we go from left to right, or do we go from right to left? 1360 00:57:33,306 --> 00:57:34,866 It's a bit of an oversimplification. 1361 00:57:34,866 --> 00:57:36,336 P set five will tease this apart. 1362 00:57:36,336 --> 00:57:38,806 But, in short, different computers do it different ways. 1363 00:57:38,806 --> 00:57:40,796 And what you're seeing here is an artifact 1364 00:57:40,796 --> 00:57:43,006 of a decision made many years ago, 1365 00:57:43,006 --> 00:57:44,536 but the computer figures it out. 1366 00:57:45,246 --> 00:57:48,866 So CSI. So this is just kind of a silly way 1367 00:57:48,866 --> 00:57:50,836 of introducing the context 1368 00:57:50,836 --> 00:57:52,866 in which P set five will have you dabble. 1369 00:57:52,866 --> 00:57:57,506 So, years ago, a friend of mine literally had taken photos 1370 00:57:57,506 --> 00:57:59,506 on a digital camera, and something went wrong. 1371 00:57:59,506 --> 00:58:02,606 Like, his compact flash card or whatever it was got corrupted. 1372 00:58:02,856 --> 00:58:05,276 And at the time, I was very conveniently interning 1373 00:58:05,276 --> 00:58:07,706 as a graduate student for the local Middlesex District 1374 00:58:07,706 --> 00:58:10,986 Attorney's Office, where I was doing forensics work for them, 1375 00:58:11,156 --> 00:58:13,476 and it was really one of the best summer jobs, frankly, 1376 00:58:13,476 --> 00:58:14,886 I've ever had, because we were literally, 1377 00:58:15,116 --> 00:58:16,316 as I think I said weeks ago, 1378 00:58:16,316 --> 00:58:18,296 were taking in from the Mass State Police things 1379 00:58:18,296 --> 00:58:21,026 like hard drives, and CDs, and floppy discs, 1380 00:58:21,626 --> 00:58:24,026 and sometimes things that don't actually have data on them, 1381 00:58:24,026 --> 00:58:25,446 like monitors, and keyboards. 1382 00:58:25,446 --> 00:58:27,566 And so we just kind of piled that in a separate pile, 1383 00:58:27,816 --> 00:58:30,116 but we then had to answer questions 1384 00:58:30,116 --> 00:58:32,096 of the form, did he do it? 1385 00:58:32,346 --> 00:58:34,436 Right? Or is there any evidence to confirm 1386 00:58:34,436 --> 00:58:37,326 or deny this particular bad thing that happened? 1387 00:58:37,326 --> 00:58:39,806 And so much of this job was about going 1388 00:58:39,806 --> 00:58:42,436 through these hard drives and CDs, and, yes, as I've said, 1389 00:58:42,616 --> 00:58:44,406 just double clicking folders sometimes-- 1390 00:58:44,406 --> 00:58:45,966 really not all that sophisticated, 1391 00:58:45,966 --> 00:58:47,136 some of the criminals in this county, 1392 00:58:47,446 --> 00:58:49,346 but sometimes people had, at least, 1393 00:58:49,346 --> 00:58:50,826 dragged things to the recycle bin. 1394 00:58:51,196 --> 00:58:53,876 Now, as most people in this room know, from experience, 1395 00:58:53,876 --> 00:58:56,376 or bad experience, dragging things to your recycle bin 1396 00:58:56,376 --> 00:58:58,706 or trash can doesn't really do anything 1397 00:58:58,706 --> 00:59:00,066 until you click what button? 1398 00:59:00,866 --> 00:59:01,056 >> Empty. 1399 00:59:01,326 --> 00:59:03,576 >> David: So empty trash or empty recycle bin. 1400 00:59:03,576 --> 00:59:05,036 So more on that in just a moment, 1401 00:59:05,246 --> 00:59:10,106 but there's even more worrisome aspects of data storage. 1402 00:59:10,106 --> 00:59:12,586 So, in fact, there was a really neat story a few months ago 1403 00:59:12,816 --> 00:59:13,676 about the iPhone. 1404 00:59:13,676 --> 00:59:15,836 Those of you with iPhones or iPod touches know 1405 00:59:15,836 --> 00:59:17,566 that when you push the button on the bottom there, 1406 00:59:17,856 --> 00:59:20,356 you actually get this fancy animation where everything kind 1407 00:59:20,356 --> 00:59:22,756 of shrinks down, and you then see your desktop. 1408 00:59:22,996 --> 00:59:24,666 Well, I believe the way Apple did, 1409 00:59:24,666 --> 00:59:26,646 or maybe still does implement that feature, 1410 00:59:26,926 --> 00:59:29,906 is they don't resize your whole window down to a little dot. 1411 00:59:29,906 --> 00:59:32,336 Rather, they very quickly take a screenshot 1412 00:59:32,696 --> 00:59:34,416 and then shrink hat screenshot, 1413 00:59:34,416 --> 00:59:36,236 because it's very easy to shrink an image. 1414 00:59:36,236 --> 00:59:38,826 It's a little harder to shrink actual applications. 1415 00:59:39,086 --> 00:59:42,486 Now the problem is a forensic analyst realized was 1416 00:59:42,486 --> 00:59:44,596 that people then, who might hit the home button, 1417 00:59:44,596 --> 00:59:47,676 while reading a really personal or really sketchy email 1418 00:59:47,676 --> 00:59:51,226 or IM had this paper trail on the flash memory 1419 00:59:51,226 --> 00:59:54,436 in their iPhones or iPods, because the moment they hit 1420 00:59:54,436 --> 00:59:57,326 that button, their phone or iPod took a screenshot. 1421 00:59:57,546 --> 01:00:00,456 That screenshot had to get saved somewhere on disc. 1422 01:00:00,716 --> 01:00:02,116 Now, these things don't use hard drives. 1423 01:00:02,116 --> 01:00:04,416 They happen to use flash memory, but it's the same idea. 1424 01:00:04,606 --> 01:00:06,076 Those zeros and ones got written 1425 01:00:06,076 --> 01:00:07,766 to the phone or the iPod's memory. 1426 01:00:08,076 --> 01:00:10,506 And if someone got curious or nosey enough 1427 01:00:10,506 --> 01:00:13,506 and had physical access to this phone, they could start reading 1428 01:00:13,506 --> 01:00:15,336 in all of those zeroes and ones. 1429 01:00:15,336 --> 01:00:17,456 And much like you will do for P set five, 1430 01:00:17,676 --> 01:00:18,886 they can just look for patterns. 1431 01:00:19,016 --> 01:00:22,606 And even if that screenshot was intentionally deleted 1432 01:00:22,606 --> 01:00:26,616 by the iPhone software, odds are it's still sticking around, 1433 01:00:26,946 --> 01:00:29,306 unless the user has hit that home button a whole bunch 1434 01:00:29,356 --> 01:00:31,966 of times or unless they've downloaded a whole bunch of MP3s 1435 01:00:31,966 --> 01:00:35,496 and movies since then that might probabilistically overwrite 1436 01:00:35,786 --> 01:00:36,406 those bits. 1437 01:00:36,616 --> 01:00:38,036 But there's yet other problems, too. 1438 01:00:38,036 --> 01:00:39,566 I mean, there's some very familiar ones. 1439 01:00:39,566 --> 01:00:42,386 If you've ever sat down at a computer and used a browser, 1440 01:00:42,726 --> 01:00:45,476 if you're visiting websites that you don't want anyone else 1441 01:00:45,476 --> 01:00:46,926 to know, you probably should do what 1442 01:00:46,926 --> 01:00:48,606 after stepping up from the computer? 1443 01:00:49,596 --> 01:00:52,856 Clear your cache or clear the history, right? 1444 01:00:53,226 --> 01:00:54,856 Or do any number of tasks, 1445 01:00:55,136 --> 01:00:57,066 but even those things are very imperfect. 1446 01:00:57,066 --> 01:01:00,136 And you've probably seen and maybe have kind of crossed 1447 01:01:00,186 --> 01:01:02,126 that ethical line of clicking your friend 1448 01:01:02,126 --> 01:01:04,536 or your roommate's history bar just to see explicitly 1449 01:01:04,806 --> 01:01:05,746 where they've been going. 1450 01:01:06,596 --> 01:01:08,766 [laughter] You laugh, because it's true. 1451 01:01:09,736 --> 01:01:12,456 [laughter] So it's because a lot of software, browsers being one 1452 01:01:12,456 --> 01:01:14,656 of the worst offenders, just remember a lot of information, 1453 01:01:14,656 --> 01:01:16,416 which tends to be useful for the user. 1454 01:01:16,416 --> 01:01:18,886 It means you can have auto completion, and it can remember 1455 01:01:18,886 --> 01:01:20,016 that you've been there, and it can cache. 1456 01:01:20,116 --> 01:01:24,056 So there's a lot of technically and UI useful aspects to this, 1457 01:01:24,486 --> 01:01:26,776 but all that data gets stored somewhere. 1458 01:01:26,776 --> 01:01:28,176 The history gets stored somewhere. 1459 01:01:28,176 --> 01:01:30,876 All of the JPEGs that comprise a web page that you visit have 1460 01:01:30,916 --> 01:01:31,946 to get stored somewhere. 1461 01:01:32,306 --> 01:01:35,586 And worse than that, computers these days have something called 1462 01:01:35,586 --> 01:01:36,616 virtual memory. 1463 01:01:36,876 --> 01:01:39,066 So long story short, most computers have, 1464 01:01:39,256 --> 01:01:40,066 these days, like what? 1465 01:01:40,066 --> 01:01:41,836 A gig of ram, two gigs of ram, 1466 01:01:42,126 --> 01:01:44,926 but the operating system actually lets itself think 1467 01:01:44,926 --> 01:01:47,636 that you have four gigs of ram or six gigs of ram, 1468 01:01:47,896 --> 01:01:48,816 because that's convenient. 1469 01:01:48,956 --> 01:01:52,346 And by having more ram, or more elusions of more ram, 1470 01:01:52,636 --> 01:01:55,426 can you then run more and more programs at once, 1471 01:01:55,466 --> 01:01:56,936 at least in the background, right? 1472 01:01:56,936 --> 01:01:58,586 You might not be interacting with all of them, 1473 01:01:58,826 --> 01:02:00,696 but if you tell, you trick your computer 1474 01:02:00,696 --> 01:02:02,596 into thinking this person has so much ram, 1475 01:02:02,866 --> 01:02:04,426 it means you can double click more icons 1476 01:02:04,426 --> 01:02:06,736 and have more stuff running at once, but the means 1477 01:02:06,736 --> 01:02:08,406 by which most computers implement 1478 01:02:08,486 --> 01:02:11,326 that elusion is they steal some hard disc space 1479 01:02:11,326 --> 01:02:12,776 or some flash memory space, 1480 01:02:13,006 --> 01:02:15,276 and they say pretend this space is ram. 1481 01:02:15,566 --> 01:02:17,536 And what the computer does, unbeknownst to you, 1482 01:02:17,746 --> 01:02:19,786 is when you start to run too many programs 1483 01:02:19,816 --> 01:02:23,106 that physically can't fit in ram, some of them get shuttled 1484 01:02:23,276 --> 01:02:24,986 to this thing called virtual memory. 1485 01:02:24,986 --> 01:02:28,676 They get copied from ram, which is purely electronic, to disc, 1486 01:02:28,676 --> 01:02:30,916 which is often mechanical, and it's slower, 1487 01:02:31,066 --> 01:02:32,526 but they do this transparently. 1488 01:02:32,526 --> 01:02:35,936 So you, the user, don't know where all of those bits are, 1489 01:02:36,236 --> 01:02:38,906 but you can sometimes infer where they are. 1490 01:02:38,946 --> 01:02:41,826 If you start to run more and more programs on your computer, 1491 01:02:41,826 --> 01:02:44,366 and you can simulate this tonight, what eventually happens 1492 01:02:44,366 --> 01:02:46,976 if you try running 10, 20, 30 programs at once? 1493 01:02:48,126 --> 01:02:51,976 Odds are they will open, but slowly. 1494 01:02:51,976 --> 01:02:55,286 The thing will slow to a crawl, and that's because a lot 1495 01:02:55,286 --> 01:02:57,636 of those programs, yes, they're getting loaded into memory, 1496 01:02:57,676 --> 01:03:00,536 but they don't fit in memory, so they go to virtual memory, 1497 01:03:00,856 --> 01:03:04,236 and just because of how this stuff is designed, hardware 1498 01:03:04,236 --> 01:03:07,336 and things like, mechanical things, like hard discs 1499 01:03:07,336 --> 01:03:11,336 and even flash memory are slower by nature and by cost 1500 01:03:11,336 --> 01:03:12,516 to then something like ram. 1501 01:03:12,516 --> 01:03:14,076 So you see this even in the real world. 1502 01:03:14,296 --> 01:03:17,396 Now, the takeaway here is that even if you cover your tracks, 1503 01:03:17,506 --> 01:03:22,016 and empty your recycle bin, and you clear your cookies, 1504 01:03:22,016 --> 01:03:24,686 and clear your history, and all of this, there is a good chunk 1505 01:03:24,686 --> 01:03:28,276 of hard disc space on your computer technically called swap 1506 01:03:28,316 --> 01:03:30,626 space-- many operating systems do this-- 1507 01:03:30,936 --> 01:03:33,926 that still has a copy of all of that stuff. 1508 01:03:33,956 --> 01:03:35,376 So, in fact, one of the neat things 1509 01:03:35,376 --> 01:03:36,996 that Mac OS actually has-- 1510 01:03:36,996 --> 01:03:39,136 I don't believe this has come built-in to windows yet. 1511 01:03:39,136 --> 01:03:40,226 Maybe 7 is different. 1512 01:03:40,546 --> 01:03:43,036 You have this little security tab here, 1513 01:03:43,286 --> 01:03:45,336 and there's this thing called FileVault. 1514 01:03:45,736 --> 01:03:48,116 So one of the things FileVault does is it encrypts your whole 1515 01:03:48,116 --> 01:03:51,216 hard drive so that even if you are doing sketchy things 1516 01:03:51,216 --> 01:03:52,766 or just private things on your computer, 1517 01:03:53,046 --> 01:03:55,246 all of that is encrypted using something a little more 1518 01:03:55,246 --> 01:03:58,526 sophisticated than Caesar Cipher, but the same exact idea. 1519 01:03:58,526 --> 01:04:02,906 And I actually don't remember-- oh, here it is. 1520 01:04:03,016 --> 01:04:06,836 So if I actually log in here, you'll see this. 1521 01:04:06,836 --> 01:04:08,776 There's this little icon, for those of you-- 1522 01:04:08,946 --> 01:04:10,606 so this benefits half of you. 1523 01:04:10,746 --> 01:04:11,666 The other half, good luck. 1524 01:04:11,956 --> 01:04:14,596 But half of you can open up your security control panel 1525 01:04:14,676 --> 01:04:15,666 and check this. 1526 01:04:15,666 --> 01:04:17,226 Use secure virtual memory, 1527 01:04:17,506 --> 01:04:20,056 which means that feature I just described of swap space, 1528 01:04:20,056 --> 01:04:22,866 aka virtual memory, you can tell the computer, 1529 01:04:22,866 --> 01:04:26,176 albeit at a slow performance hit, to encrypt that, as well, 1530 01:04:26,476 --> 01:04:28,906 so that so long as I log off of this machine, 1531 01:04:28,906 --> 01:04:30,756 if someone steals it, and runs away, 1532 01:04:30,756 --> 01:04:32,936 and performs various forensic analyses on it, 1533 01:04:33,476 --> 01:04:35,696 there would be very low probability can they figure 1534 01:04:35,696 --> 01:04:37,526 out what bits were actually there. 1535 01:04:37,526 --> 01:04:40,806 So cryptography is really, really your friend these days. 1536 01:04:40,806 --> 01:04:42,476 And if you are liking this kind of stuff, 1537 01:04:42,506 --> 01:04:44,946 do bear in mind computer science 120, 1538 01:04:45,246 --> 01:04:47,316 which is offered in alternating terms. 1539 01:04:47,316 --> 01:04:48,856 So what's the implication here? 1540 01:04:48,856 --> 01:04:50,946 So let's consider a very simple example 1541 01:04:51,216 --> 01:04:54,136 of just storing some files on disc. 1542 01:04:54,666 --> 01:04:57,806 So inside of a typical computer is this thing called a 1543 01:04:57,806 --> 01:04:58,436 hard drive. 1544 01:04:58,706 --> 01:05:01,756 If it's still mechanical, which it is in most of your laptops, 1545 01:05:01,786 --> 01:05:05,136 unless you paid a bit of a premium to get SSD 1546 01:05:05,136 --> 01:05:07,366 or solid state drives, which is flash memory, 1547 01:05:07,366 --> 01:05:09,776 which has no moving parts, but the story is the same. 1548 01:05:09,776 --> 01:05:11,256 Inside of your computer is a hard disc, 1549 01:05:11,556 --> 01:05:14,396 and a hard disc is typically made up of these platters. 1550 01:05:14,616 --> 01:05:16,876 So if you recall floppy discs from many years ago, 1551 01:05:16,876 --> 01:05:18,556 there's those little circular floppy things, 1552 01:05:18,876 --> 01:05:22,026 hard drives have the exact same idea, but they're much thicker, 1553 01:05:22,026 --> 01:05:24,026 they're much stronger, and they spin a lot faster, 1554 01:05:24,066 --> 01:05:25,906 so they're more robust and store more bits. 1555 01:05:25,906 --> 01:05:30,036 But when you save a file to disc-- say it is a JPEG-- 1556 01:05:30,376 --> 01:05:35,326 so a JPEG is going to be some kind of image that has kilobytes 1557 01:05:35,326 --> 01:05:36,496 or megabytes worth of data. 1558 01:05:36,866 --> 01:05:39,066 So supposed it takes up this many bits. 1559 01:05:39,366 --> 01:05:43,206 Now, I have no idea, as the user, where that JPEG ends 1560 01:05:43,206 --> 01:05:44,696 up on disc, so to speak. 1561 01:05:44,766 --> 01:05:46,706 The operating system decides for itself 1562 01:05:46,976 --> 01:05:48,436 where does it currently have space, 1563 01:05:48,706 --> 01:05:50,956 but it ends up somewhere there on disc. 1564 01:05:51,106 --> 01:05:54,376 What the operating system then does is it has a little table 1565 01:05:54,556 --> 01:05:56,826 called a directory table or something like that, 1566 01:05:57,526 --> 01:05:59,616 and it's essentially like an Excel spreadsheet 1567 01:05:59,656 --> 01:06:00,696 with at least two columns. 1568 01:06:00,766 --> 01:06:02,646 There's definitely more juicy stuff there, 1569 01:06:02,966 --> 01:06:05,146 and then there's one column called name, 1570 01:06:05,406 --> 01:06:08,246 and then another column called vocation or something like that. 1571 01:06:08,696 --> 01:06:11,716 And then the name of this file might be foo.jpeg. 1572 01:06:11,716 --> 01:06:14,216 Apologies for the handwriting. 1573 01:06:14,216 --> 01:06:15,896 And then location is a number. 1574 01:06:15,896 --> 01:06:20,456 We'll write it in hex, but it's just a number, 0X123. 1575 01:06:20,736 --> 01:06:22,426 And what that tells the operating system, 1576 01:06:22,426 --> 01:06:23,766 or reminds the operating system, 1577 01:06:23,766 --> 01:06:26,706 is the next time the user double clicks or searches 1578 01:06:26,706 --> 01:06:29,276 for foo.jpeg, go to what address? 1579 01:06:29,806 --> 01:06:32,686 Will go to this location on disc, 1580 01:06:33,056 --> 01:06:36,086 0X123 or whatever it happens to be. 1581 01:06:36,516 --> 01:06:38,646 Now what if the user decides, you know what, 1582 01:06:38,706 --> 01:06:40,246 this is a really private JPEG, 1583 01:06:40,246 --> 01:06:41,826 or this is just a really bad photograph, 1584 01:06:41,826 --> 01:06:42,866 let me just delete it. 1585 01:06:43,166 --> 01:06:46,636 And here she drags it to the recycle bin or trash can. 1586 01:06:46,706 --> 01:06:47,876 What happens to this picture? 1587 01:06:49,596 --> 01:06:50,936 So absolutely nothing. 1588 01:06:51,116 --> 01:06:52,546 So, first of all, most of you knew this. 1589 01:06:52,546 --> 01:06:54,946 If you drag something to the recycle bin or trash can, 1590 01:06:55,026 --> 01:06:57,276 unless you wait a few days, a month, 1591 01:06:57,276 --> 01:07:00,516 however long it is your OS keeps things around, nothing happens. 1592 01:07:00,516 --> 01:07:02,356 It's like dragging it to a separate folder. 1593 01:07:02,626 --> 01:07:06,306 But if you do have a little bit of savvy, and you do right click 1594 01:07:06,306 --> 01:07:08,326 and choose empty trash or empty recycle bin, 1595 01:07:08,326 --> 01:07:10,726 or wherever your menu option is, what then happens 1596 01:07:10,786 --> 01:07:13,156 to this picture, when you actually delete that file? 1597 01:07:13,156 --> 01:07:13,296 >> [inaudible] 1598 01:07:13,296 --> 01:07:17,486 >> David: Yeah, so this goes away. 1599 01:07:17,936 --> 01:07:20,466 Essentially, this goes away. 1600 01:07:20,596 --> 01:07:21,216 This goes away. 1601 01:07:21,426 --> 01:07:23,976 The operating system just forgets where that file is. 1602 01:07:24,276 --> 01:07:27,446 But for historical reasons and for performance reasons, 1603 01:07:27,446 --> 01:07:31,166 historically, this piece of the picture doesn't change, right? 1604 01:07:31,166 --> 01:07:34,316 Back in the day, computers were very slow, and the idea 1605 01:07:34,526 --> 01:07:37,756 of wasting time and making the user wait 1606 01:07:37,756 --> 01:07:40,896 so that you can then delete these bits, as well, well, 1607 01:07:41,056 --> 01:07:41,906 what would that even mean? 1608 01:07:41,906 --> 01:07:43,956 You can't just delete bits and erase them, right? 1609 01:07:43,956 --> 01:07:45,806 You need those bits; otherwise, your hard drive is going 1610 01:07:45,806 --> 01:07:46,966 to get perpetually smaller 1611 01:07:46,966 --> 01:07:48,526 and smaller every time you delete a file. 1612 01:07:48,806 --> 01:07:50,696 So, what would it mean, really, to delete a file? 1613 01:07:50,996 --> 01:07:53,716 Well, maybe change these zeroes and ones to all ones, 1614 01:07:53,716 --> 01:07:56,226 or all zeroes, or just random zeroes and ones. 1615 01:07:56,256 --> 01:07:57,326 But that used to take awhile. 1616 01:07:57,576 --> 01:07:59,976 So, to this day, most operating systems forget 1617 01:07:59,976 --> 01:08:03,666 where the file is, which is why programs like Norton Utilities, 1618 01:08:03,666 --> 01:08:06,216 if you're familiar, any of these tools that you might have 1619 01:08:06,216 --> 01:08:08,486 on your own computer, that lets you undelete files, 1620 01:08:09,176 --> 01:08:11,896 they generally just remember what this table used to look 1621 01:08:11,896 --> 01:08:14,066 like and then, with some probability, 1622 01:08:14,336 --> 01:08:17,826 can they go to the hard drive and get back those bits. 1623 01:08:17,916 --> 01:08:19,336 But I say, with some probability, 1624 01:08:19,336 --> 01:08:21,766 because what's probably going to happen, over time, 1625 01:08:22,226 --> 01:08:24,126 after you've deleted a file in this way? 1626 01:08:24,126 --> 01:08:24,256 >> [inaudible] 1627 01:08:24,256 --> 01:08:28,266 >> David: Yeah, so they're going to get reused. 1628 01:08:28,266 --> 01:08:30,456 The operating system eventually is going to say, you know what, 1629 01:08:30,456 --> 01:08:32,346 I need maybe not exactly those bits, 1630 01:08:32,556 --> 01:08:35,266 but I've got a pretty big JPEG that this user just downloaded, 1631 01:08:35,266 --> 01:08:36,456 and now I need these bits. 1632 01:08:36,796 --> 01:08:38,566 So, you know, in fact, and we used to see this 1633 01:08:38,566 --> 01:08:41,496 at the DA's office, we would see partial JPEGs. 1634 01:08:41,496 --> 01:08:43,566 We would lose part of the image, part of the gif, 1635 01:08:43,566 --> 01:08:46,206 whatever it was, but not necessarily all of it, 1636 01:08:46,206 --> 01:08:48,576 and that's just because, by chance, probabilistically, 1637 01:08:48,876 --> 01:08:50,806 some of those bits were reused elsewhere. 1638 01:08:51,116 --> 01:08:52,986 So you'll see in problem set five is what? 1639 01:08:52,986 --> 01:08:56,316 Problem set five is sort of like the end all of problems sets, 1640 01:08:56,316 --> 01:08:58,366 it seems, but it's really neat, 1641 01:08:58,366 --> 01:09:00,446 because it's really this intersection 1642 01:09:00,446 --> 01:09:01,936 of some very familiar technologies, 1643 01:09:01,936 --> 01:09:04,146 and we'll hand you this article written by a colleague of mine 1644 01:09:04,146 --> 01:09:05,846 from MIT, from grad school, 1645 01:09:05,846 --> 01:09:08,566 who did a really neat project himself, as a Ph.D. student. 1646 01:09:08,806 --> 01:09:11,246 He went on eBay with a bit of cash and started buying 1647 01:09:11,246 --> 01:09:13,496 up all these hard drives that people were selling, 1648 01:09:13,496 --> 01:09:16,286 100 gig drive here, a 500 gig drive here. 1649 01:09:16,846 --> 01:09:18,606 Then he imaged all of these drives, 1650 01:09:18,906 --> 01:09:20,426 much like Homeland Security 1651 01:09:20,426 --> 01:09:21,966 or the district attorney's office would, 1652 01:09:22,156 --> 01:09:24,626 and then he started analyzing all of these files, 1653 01:09:24,626 --> 01:09:25,576 all of these hard drives. 1654 01:09:25,626 --> 01:09:27,286 And what he found, as you might expect, 1655 01:09:27,286 --> 01:09:29,026 because these things get mentioned in the media once 1656 01:09:29,026 --> 01:09:31,516 in awhile, were people's social security numbers, 1657 01:09:31,516 --> 01:09:34,726 people's credit card numbers, pornography. 1658 01:09:34,726 --> 01:09:36,266 Anything that someone might have 1659 01:09:36,266 --> 01:09:39,006 on their hard drive ultimately made its way 1660 01:09:39,006 --> 01:09:40,516 to Maxwell Dorkin [phonetic] across the street. 1661 01:09:40,816 --> 01:09:44,156 But what was really neat-- so the point for him was not 1662 01:09:44,156 --> 01:09:47,506 to get some cheap pornography on eBay, but, rather, 1663 01:09:47,906 --> 01:09:52,146 to actually assess with what frequency this is happening 1664 01:09:52,246 --> 01:09:54,946 and what tools could we, could they, those users, 1665 01:09:54,946 --> 01:09:57,316 start using to more effectively, you know, 1666 01:09:57,316 --> 01:09:58,646 scrub their information. 1667 01:09:58,646 --> 01:09:59,626 And what the scary thing is-- 1668 01:09:59,626 --> 01:10:02,106 and we'll present you with this paper in PDF form, 1669 01:10:02,106 --> 01:10:04,726 this article-- so many of the tools out there, 1670 01:10:04,726 --> 01:10:07,536 that you would pay good money for to, you know, 1671 01:10:07,826 --> 01:10:09,946 make sure that your financial documents are safe, 1672 01:10:09,946 --> 01:10:13,276 your personal documents are safe, so many of them are buggy. 1673 01:10:13,276 --> 01:10:17,016 And one aspect of this article was to find innumerable faults 1674 01:10:17,016 --> 01:10:18,956 with all of these tools that would overlook things 1675 01:10:19,046 --> 01:10:21,336 like the swap space, when wiping your hard drive, 1676 01:10:21,566 --> 01:10:24,196 or they would overlook something called slack space, 1677 01:10:24,306 --> 01:10:25,636 which this article will discuss. 1678 01:10:25,896 --> 01:10:27,976 But this, again, is one of the points of a course like this, 1679 01:10:28,016 --> 01:10:29,796 because by understanding, from the ground up, 1680 01:10:29,886 --> 01:10:31,896 how all of these little things work, one, 1681 01:10:31,896 --> 01:10:34,266 this kind of literature is accessible to you; and, two, 1682 01:10:34,266 --> 01:10:37,106 you understand what this machine is you guys use every day. 1683 01:10:37,536 --> 01:10:42,106 So with that said, we'll see you Wednesday.