>> David Malan: All right, welcome back to CS50! This is week five, so in problem set two, you'll recall that some of your classmates tackled the so-called hacker edition, and that hacker edition had them tackle a number of encrypted passwords, which looked a little something like this. So, long story short, if you didn't actually read this PDF at the time, this was the hacker edition. Passwords on the typical Linux system are stored in an encrypted form, so you have essentially this format in a file called XE Password Etcetera/Password. Julius would be the username, colon, and then some cryptic-looking string would be the encrypted password. And do you mind, Barry, taking my voice down a little bit? There's a bit of an echo up here. Much better. So, these students, who tackled the hacker edition, needed to decipher, needed to crack those passwords; but, unfortunately, the routine that's used in a typical Linux system to encrypt passwords is a one-way function, which means it's relatively easy and relatively fast to take in plaintext and generate cipher text, but you can't actually reverse the process. So, as I mentioned a few weeks ago, if you've ever lost a password to a typical website or to your FAS account, odds are you might've gotten annoyed at the text support person, because they insisted, "I can't tell you what your password is," but they can change it, and that's because of this mathematical property that's generally used for passwords. They're normally known as a one-way hash. And as the name implies, you can only encrypt in one direction; you can't decrypt. So, if you can't decrypt passwords in this way, how in the world are all of us logging in to FAS and logging in to Facebook and other such sites every day of the week? How is it that these sites are storing our passwords encrypted, and yet they're still somehow checking that what I type in, in plaintext, matches my password? So, in other words, if you can't decrypt the cipher text to then compare the plaintext I initially gave them, when I created my account, and the plaintext I just typed in, when visiting Facebook.com or equivalent, how do you actually compare what I typed in and what they know my password to be? Yeah? >> Just encrypt what [inaudible]. >> David: Just what? >> Encrypt what -- >> David: Yeah, just encrypt what I just typed in. So, in other words, what a typical website or server does these days is, again, it encrypts your password when you create your account and stores it in a crazy format like this; but, subsequently, when you log in, the server simply re-encrypts whatever you just typed in, using the same algorithm. And then if the two cipher texts match, does it actually let you pass. But the curious thing is, because of the mathematics of a lot of these implementations, there is a probability, though very, very low, that you could type in one word, and it would encrypt to some cipher text, and you could type in another word, a different word, but it might very well encrypt to that same cipher text; in other words, you might very well be able to login to some accounts you guys have, using any number of passwords, but the probability of that is very low, and the probability of figuring out what other words might hash, so to speak, to the same value is itself very low. So, a number of our students, who are more comfortable, did, in fact, determine that Julius' password was just 13. The clever, or those who like stupid jokes might understand what the reference there is. Julius Caesar, you know, 13. Okay. President Scroob, of course, in the Spaceballs, Smith hack. You've got to make fun of me when I do something like this. There we go. All right, President Scroob had a password of 12345. That was the second password these guys were tasked with figuring out. W. Brandus Voice, this is actually a reference to a movie that many of you probably haven't seen, but the movie Sneakers, where a guy authenticates himself as saying, "My voice is my password," a little example of using vocal algorithmics. That was his there. But notice what was important about voice was that it's just an English word. So any hacker students that actually used an English dictionary to try to crack our passwords probably would've gotten this one. Baravelli, which is a reference to an old Marx Brothers film. Swordfish, also a word that's probably in the dictionary. But then it got a little harder, so you may remember Mr. Blaise Visionaire [phonetic] of France, who we used Foobar as our keyword. A little harder. It's probably not in most dictionaries, let alone all capitalized. G. Costanza. Remember? Those familiar might recall that this was his ATM code, Bosco, a little chocolaty drink. And then Malan, I don't believe, was gotten by anyone, but that's because it was, in fact, rather difficult to crack and was this. For those who don't understand what the reference is, just Goggle FTW and 111, and you'll see such stupid illusions as those. So, I finally found a video that's actually educational in nature. [laughter] So, you might not think it, but this thing here actually comes to us by a very well-known professor, Nick Parlante at Stanford University. That's actually very well-known for its computer science program. So, this isn't a joke, but this is actually one of the most compelling ways I have, at least, seen in clay form to convey the idea of pointers and dereferencing. So, this is, for many students, a challenging topic. It certainly was for me years ago; but, frankly, we've just kind of taken ourselves down a notch with Binky here. Might hopefully elucidate what might otherwise, in C syntax, be a fairly cryptic topic. So, allow me to full screen, and allow me to give us this short three-minute film. Here we are. >> Nick Parlante: Hey, Binky. Wake up. It's time for pointer fun. >> Binky: What's that? Learn about pointers? Oh, goody! >> David: Mind you, this is a Stanford professor playing with clay, doing the voiceover here on camera. [laughter] >> Nick: Well, to get started, I guess we're going to need a couple of pointers. >> Binky: Okay. This code allocates two pointers, which can point to integers. >> Nick: Okay, well, I see the two pointers, but they don't seem to be pointing to anything. >> Binky That's right. Initially, pointers don't point to anything. The things they point to are called pointees, and setting them up is a separate step. >> Nick: Oh, right, right. I knew that. The pointees are separate. So, how do you allocate a pointee? >> Binky: Okay, well, this code allocates a new integer pointee, and this part sets X to point to it. >> Nick: Hey, that looks better. So, make it do something. >> Binky: Okay, I'll dereference the pointer X to store the number 42 into its pointee. For this trick, I'll need my magic wand of dereferencing. >> Nick: Your magic wand of dereferencing? That's great. >> Binky: This is what the code looks like. I'll just set up the number and-- [pop] >> Nick: Hey, look, there it goes. So, doing a dereference on X follows the arrow to access its pointee, in this case, to store 42 in there. Hey, try using it to store the number 13 through the other pointer, Y. >> Binky: Okay. I'll just go over here to Y, and get the number 13 set up, and then take the wand of dereferencing, and just... [buzz] Oh! >> Nick: Oh, hey, that didn't work. Say, Binky, I don't think dereferencing Y is a good idea, 'cause setting up the pointee is a separate step, and I don't think we ever did it. >> Binky: Mm, good point. >> Nick: Yeah, we allocated the pointer Y, but we never set it to point to a pointee. >> Binky: Mm, very observant. >> Nick: Hey, you're looking good there, Binky. Can you fix it so that Y points to the same pointee as X? >> Binky: Sure, I'll use my magic wand of pointer assignment. >> Nick: Is that going to be a problem like before? >> Binky: No, this doesn't touch the pointees. It just changes one pointer to point to the same thing as another. >> Nick: Oh, I see. Now Y points to the same place as X. So, wait. Now Y is fixed. It has a pointee, so you can try the wand of dereferencing again. So, send the 13 over. >> Binky: Oh, okay. Here goes. >> Nick: Hey, look at that. Now dereferencing works on Y. And because the pointers are sharing that one pointee, they both see the 13. >> Binky: Yeah, sharing, whatever. So, are we going to switch places now? >> Nick: Oh, look. We're out of time. >> Binky: But-- >> Nick: Just remember the three pointer rules. Number one, the basic structure is that you have a pointer, and it points over to a pointee, but the pointer and pointee are separate, and the common error is to set up a pointer, but to forget to give it a pointee. Number two, pointer dereferencing starts at the pointer and follows its arrow over to access its pointee. As we all know, this only works if there is a pointee, which kind of gets back to rule number one. Number three, pointer assignment takes one pointer and changes it to point to the same pointee as another pointer. So, after the assignment, the two pointers will point to the same pointee. Sometimes that's called sharing. And that's all there is to it, really. Bye-bye now. >> David: So if you actually still struggle with pointers at any point this semester, watch that video, and hopefully all your troubles will go away. So, I can't imagine, frankly, how long it took him to make it, because, if you notice, throughout the video, I mean, he was changing like the eyeballs' locations for almost every frame of the video. So that's an overnight project there, at least. I wanted to show you this demo, too. So, this is made by one of our own teaching fellows, Phil, who was a student in the course last year, who goes, even though it's only October here, final projects will soon upon us, or at least thoughts about final projects. And just as a teaser now, you'll get more formal documentation of this in the weeks to come. The final project in this course really is meant to be its culmination, where you guys can take your newfound savvy and comfort, with programming in any of the languages we cover or don't cover in the course and design some piece of software of your very own. So, what Phil did, actually, for fun, just before term started, was started playing with a fairly large dataset, that of the course catalog. So, he wrote, at the time, what's called a screen scraper, which is a fairly annoying but sometimes effective approach to getting data from some source that doesn't necessarily want you to have that data. So, the registrar's catalog is all HTML pages, all web pages, but they fortunately follow a standard format. So, if you wrote a program that essentially crawled to the registrar's website, much like Google and other search engines crawl the web, and then read into a database of his own, all of the information about this course and that course. And then what he did was he taught himself a library called Flair, that allows him to write software in Adobe Flash, if you're familiar with the Flash format. And what's neat about Flash is that you can do fairly sophisticated animations by writing in a language called ActionScript. So what this here is, is version 0.9 of a visualization of the course catalog for all computer science courses. And as is no accident, perhaps, he has 50 kind of interconnects, all of these guys here at the center. And if you hover over CS50, what the edges in this graph tell you is where you can go after CS50, for what courses 50 is a prerequisite, and you can follow the links for all these other courses and get the descriptions on the side. And I point this out, because Phil did not, at the time, so far as I know, he had not used this library before. He simply Googled. He read up on it. He took a little bit of input from the course, as to what would be neat, and then he started making what's starting out to be a really neat visualization. So, his hope is to do that for the rest of the course catalog. And this is just to seed your minds with just some random, crazy ideas that might interest you, whether it's this, doing something with shuttles, doing something with dining hall menus. The really neat thing, once you have these programming chops, is that you can start to do interesting things with data, and that data, again, may be shuttle schedules. It may be a course catalog. But there's so much interesting stuff out there, and soon, if not already, you will have the ability to actually manipulate that, and bend it to your will, and make interesting applications. So, speaking of applications, this week is about Sudoku. This is our logo for the week, made by our own Yuhki Yamashita. I double-checked, when he handed me this graphic, with another Japanese-speaking friend, that I wasn't putting something incredibly offensive on all of our documents for the week. So, this isn't that consistent with Sudoku in some form, but with Sudoku, those of you who have tackled it already, you know that it uses this library called ncurses, a graphical library. And we're not talking graphics in the Mac OS or Windows sense. It's still fairly primitive stuff within the confines of this terminal window. But those of you who might have been taking to heart our advice that you use GDB anytime you run into a problem, might've realized that once you have a graphical user interface between you and your code, it's kind of hard to use GDB and step through your program, if you've got a big game board right in the middle of the screen. So, Glen Holloway [assumed spelling], one of our sysadmins, actually posted last night on the bulletin board a little trick that I thought I'd show you real quickly here. What's really neat is if you connect a nice FAS in two windows simultaneously, now, there is a probability that you'll actually end up connecting to two different servers that FAS owns. Nice thought, FAS, even though we've been calling it one server, is actually a cluster of servers, so you can actually get pseudo randomly sent to one machine or the other. So look to the bulletin board and help at CS50.net as to how to mitigate this, if you don't find that this is working. But I've done that. I've connected to two machines. I'm going to do a quick sanity check. I'm going to type post name, and what I see there is that the nice machine that I'm connected to actually has a name of Ice 2. And if I do host name in this other window, I also have Ice 2 there. So there's Ice 1, Ice 2, Ice 3. And as an aside, if you find that you're on the wrong machine, you can actually just type SSH or username@ice2.fas.harbor.edu, and that will make sure that you connect from this machine to the right one. But the endgame, ultimately, is just to make sure you're on the same machine physically, and I am. And what I'm going to do now is this. I'm going to run Sudoku after compiling it here. I'm going to go ahead and run Sudoku and then using the newbie level, so I'm just going to pick a newbie game, but I'm going to type an ampersand at the end of the command. If I don't type an ampersand, I get Sudoku. Filling my screen, doing the nice little graphics that it ships with and your distribution code. But if I do an ampersand, nothing seems to happen. So, in a Linux or a Mac OS environment, putting an ampersand at the end of your commands actually backgrounds the application, which means it is, in fact, running, and I can see this by typing PS, which shows you the process list. A process list, in computer speak, is just a running program. If you've ever looked at your Task Manager in Windows or your Activity Monitor in Mac OS, it's the same thing, albeit in text form here. So, I'm running a few things. One is called Sudoku. One is called PS, because I literally just ran that, so PS notices that it's self-running. And then, out of curiosity, what is TCSH, T-C-S-H? TCSH? So, this is your shell. So, when we talked about your prompts or your blinking prompt, that's actually a program, a program that takes input. The input it takes is whatever you type in at the command prompt and hit enter. So, know, just FYI, there's a lot of shells out there. There's TCSH. There's CSH. There's KSH. There's the born shell. There's a whole bunch of shells that people have written over the years, most of which behave similarly. But if you ever, after CS50, connect to what you're told is a Linux or Unix machine, and things don't quite seem to be working as expected, it probably just means the shell is slightly different. So, basics like LS will still work. CD will still work, but there might be some other differences. But don't be startled by those. But what's important here is the following. When I just run this command, I backgrounded the application, and what I'm told is that the process I'm running, the first one I backgrounded, has a PID, process ID of 6373. This is just a unique number that Linux, or Windows, or Mac OS assigned to this program so that it can keep track of it in memory. But this is useful, because if I now type FG, for foreground, that's going to put things in the front of my screen, and it's going to let the program run as usual. So now I'm going to go to my second window here, and I'm going to run GDB on Sudoku on 6373. So a new little trick. At the command prompt, I'm going to type the name of the program, but also the process ID here, hit enter. What I'm now at is a prompt, GDB. There's a whole bunch of stuff that loaded, but never mind that. And I'm going to do something like set a breakpoint in Sudoku.c line 164, which I just so happen to know is the part of the code, if you've looked at it already, is where I'm dealing with user input. So I'm going to hit enter here. I have a breakpoint, but now I still seem to have broken the program. How do I tell the program to just keep running now until it hits that breakpoint? Remember continue? So continue. Now nothing seems to be happening, but that's because the program is now running. If, as you see in the bottom, we have a little menu here: N for new game, R for restart game. If I go ahead and type N, notice what happened. Nothing in the game, but on the right-hand side, GDB appears to have responded. So this is actually a really neat trick. I'm sort of remotely debugging one process in a separate window altogether. So now on the right-hand side, I've broken at line 164. For those who have looked at the distribution code already, this is the kind of stuff in Sudoku.c that we've given you this week. And long story short, at this part of the code, am I sitting in a loop waiting for user input? But what's really neat now is that I still have the game running in the left-hand side. On the right-hand side, I can do things like print the value of a variable called g.number, which is discussed in the PDF, for those unfamiliar. Apparently, it's 160 right now. Oh, look at that. I'm playing the 160th new board at the time, so there is, in fact, a correspondent. So, for those of you who have been kind of banging your head against the wall, struggling with various bugs, know that relatively easily can you use GDB in this still interactive format without getting caught up in the graphical details. So, courtesy of Glen, this particular trick. So, if you didn't get all that, just post it as a sticky on the course's bulletin board, so you can check the other category or the P set 4 category there. So, I'm just going to quit. I'm just going to quit here, and then voila! Any questions? We tricked fast, but we'll recover it in section, if need be. Okay, so there's never a more attentive audience than when you're talking about things that affect them personally or can get in their way of their privacy or their security. So what I thought we'd do--because this week, we lay the foundation in terms of concepts for a specific domain, that of forensics. So, in a couple of weeks' time, we'll be distributing problem set five. Problem set five's theme is, in fact, forensics, where you're going to have to implement some digital forensic-type software to recover, among other things, some accidentally deleted JPEGs. So, if you see me and a couple of the teaching fellows on campus this coming week, taking photos, just keep walking, because that kind of spoils the scavenger hunt. In fact, last year, we were kind of trailed, as I recall, by one student, who was taking notice of where all the hidden, but--let's see--non-obvious, but identifiable locations on campus were. So, we'll be on the lookout. Buffer overflow attacks. How does this relate? So, we talked last week about security and bad stuff that can arise when you manipulate pointers or ultimately manage your own memory. And still, today, one of the most common ways of compromising a program or server is to exploit someone's mistake with regard to memory. If that program is written in a language like C or C++, that means specifically someone screwed up with regard to a pointer, or someone screwed up with regard to an array. So what I thought I'd do is make this a little more concrete, using some really nice visuals that someone put together here of a very simple program. So, this program here has a main routine at the bottom. It has N to ARDC char ** ARDV. This is the same thing, recall, as saying CHAR*ARDV [], because we've seen this relation between pointers and arrays, so that's just another way of saying the same thing. The only thing this program does when run, apparently, is call a function called foo, and it passed to foo the first command line argument, whatever the user typed after the program's name. What does foo do? Well, it apparently takes a string, aka CHAR*'s input. It then declares statically on the stack, because this is a local variable in an array called C. That array called C has 12 locations, each of which is a CHAR, and those locations are all contiguous. So, this is 12 one byte chunks of memory back-to-back called C on the stack. Then there's this function here, which we haven't really used, but you can infer its role. Memcpy simply copies memory from a source to a destination. So, apparently what this function is doing is passing in to C the value that's in bar, and it's doing this for how many characters? Well, however many characters were in bar. So, in other words, the destination is C. That's where we want to put a copy of ARDV1. The string we want to put there is, again, bar, because that's what we gave this variable's name. And then the length of bar tells us, tells probably memcpy's for loop how many characters to actually copy. But there's a bug in here, as implied by this comment. When it says no bounds checking, what does that mean? What are we not checking here? So the bounds of C, right? So we've declared C to be of size 12, and we are doing a sanity check. We're checking how long bar is, so that we tell memcpy, you know what? Only copy the first N characters of bar, where N equals stir length of bar. But what's the problem? What value of string length of bar could cause problems for us? Any value that's what? Any value that's greater than 12. Because if we copy more than 12 bites, that's perfectly fine, from bar's perspective, because bar has as many characters as he has. So, it's fine if we traverse all of those characters, but if we blindly copy more than 12 characters into the array called C, who knows where we're going to be putting them? We're going to start overlapping something else on the stack, and that's precisely what a buffer overflow attack is all about. So, here's a really nice picture that someone put together. Actually, if you'd like to read more on this stuff, the Wikipedia article, at the moment, is really nicely--really nicely describes this process. So, notice what we have here, a rectangle very similar to the one we've been using in class. At the bottom there, where it says parent routine stack, so that's like main stack. So main or any functions down below. In red is highlighted the return address. So, we haven't talked about this, but one of the things that the stack is used for, besides parameters and besides local variables, is the address of the function that should be returned to after another function is done executing. So, in other words, if main calls foo, what happens is main's return address, main's address in memory--remember the text segment? Mains address in memory is plopped on top of the stack, so that the computer, when it's done executing the function foo, it then checks that lowest level frame, that red bar up there, and says, "What address should I now return execution to? Where should I go back to executing? Because I forget." It checks that value and, hopefully, that's, in fact, main's address. So a smart adversary might recognize this and might recognize, okay, at the very top of this picture, we have the start of the array, called C, hence C bracket zero, and at the bottom right corner of that rectangle we have C bracket 11. In other words, the computer has given us 12 bytes from top left to bottom right there, but where, according to this picture, if you pass in or try to copy more than 12 bytes, are we going to start writing? We're first going to overwrite what value? >> Bar. >> David: Bar, because bar is right below C11. So, as the picture implies, if you try to write to C12, or C13, or C bracket 14, or anything like that, you're going to start to go lower and lower into the stack, and so we're going to start to corrupt or clobber bar's value. Worse, we might clobber this thing called the saved frame pointer in green. This is another detail that the computer uses to remember where in ram it currently is and which memory it's actually using. And worse, if we write so much data into that array, completely overflowing it, such that we also overwrite the red return address, what the adversary can effectively do is trick the computer into returning to the wrong address completely. The next function that gets executed is not going to be main, where it left off, but it could be anywhere in ram. So, if you have this power now to tell the computer to trick the computer to go anywhere in memory, that just begs the question, where do you want them to go? Well, you probably want, you being the adversary, you probably want them to go to a location that has some malicious code, code ideally that you wrote that maybe circumvents a serial number check, that maybe circumvents a password check that does whatever you want this thing to do, and that's precisely what a buffer overrun does. So, if you just type in hello, h-e-l-l-o, and then you get the at the end of it, that's perfectly safe. But to be clear, notice that the string fills the stack from the top on down, because even though we've been drawing this picture, as though the stack is on the bottom and the heap is on the top, just because of convention, notice that it's a little small here, but the black arrow on the right says memory addresses grow downward. In other words, in memory, the smaller addresses are on top, and the larger addresses are on the bottom, counterintuitive though that may seem, given how we've been drawing the pictures, but the implication is that if you plop down H, that does in a low numbered address. E slightly higher, LLO slightly higher, slightly higher. If you keep writing well beyond those 12 chars, you're going to go to a higher and higher memory address, which pictorially is actually down below, at which point you will eventually start to overwrite that return value. So here's just an example. If the adversary has just written a bunch of junk, so represented here by the letter A, the character A, it's just junk, junk, junk, junk, junk, but then finally in the red boxes, just because the adversary is lucky or he or she is particularly adroit with figuring out what he or she needs to input into the machine, it turns out that we have a 32-bit integer now in the red box, so each of those red boxes represents a char, but if you have four of them, that's an ent, so these types of decimal values now just represent, collectively, a 32-bit value, a big number, a 32-bit number. But because the adversary was smart, he or she realized, through trial and error or some very clever handiwork, that if I write this address here, that's going to trick the computer into executing the code that's up here, because this address of 80C03508 just so happens to be up here at the start of this blue block labeled A. So the implication, if not too long a tale here, is that if the adversary points some malicious code, some password circumventing code, serial number circumventing code, at the top of the stack, where that first blue A is, and then very intelligently puts that address down here in the red boxes, this adversary can trick the computer into executing their code that they wrote, that were not actually in the original program. Now, what can this result in? Frankly, anything. If you have the ability to inject arbitrary code into the computer's memory, you can do anything you want, and a lot of the time, just crashes the server. So, if you ever happen to be a webmaster for some student group, you're trolling through your logs, just seeing what people are requesting, very often, you'll see just junk. You won't see valid URLs being requested. You'll see a part of a valid URL and a whole bunch of nonsense, binary data, and that's because one of the most common approaches to finding mistakes like this, finding a mistake like this, where I'm failing to check just how large C is, is just hammer on the machine. Throw as much data as you can on it, because you know what? If you actually cause that machine to crash, probably not your goal, but something interesting is there. There's some potential to do some bad work, so that's very often the first approach is just bang on the thing. And if it breaks, aha, now you've got a potential target. So this is a--there's a more general approach to this. You don't have to be as precise as this picture implies. What a lot of people use are these things called NOP sleds. So, a NOP, N-O-P, is no operation, and this actually a CPU instruction that's supported by many CPUs. That just tells the CPU, for a moment in time, for a little clock tick, do nothing. So if you have enough of these, you can actually tell the CPU, for some number of nanoseconds or the like, do nothing, do nothing, do nothing, but keep continuing down. So even though this picture implies that the adversary had to know precisely where all of these addresses were, they usually don't. And so, usually, the code they inject has a whole lot of NOPs that fills the space, but tells the CPU just keep going, keep going, keep going until you actually see an instruction that's not a NOP. So shell code refers to the code they want to inject. NOP just means keep going, keep going, keep going. So it's just a way of kind of filling the space in hopes that you actually maximize probability of compromising the machine. So, if you like this kind of stuff, or want to do this kind of stuff, CS61 next fall is a course that actually allows you to dabble in things like this. In fact, one of the first problem sets is really neat. It's a ticking time bomb, of sorts, whereby the professor hands you a program compiled in Linux. It has a number of password prompts, and you're challenged to figure out what the passwords are without actually guessing and without detonating the so-called binary bomb. In other words, if you type the wrong password in--it's actually quite clever. If you type the wrong password in to CS61's first problem set code and hit enter, and you're wrong, well, your TF gets emailed, and you lose minus one point. Do it again, minus two points. Do it again, minus three points. So, trial and error, brute force not so smart. So, what you'll find--here's a hint for your homework 12 months from now. So, use GDB, because using GDB, as we've already seen, can you poke around code? Can you set break points? And, in fact, that's how you can go about tackling that particular problem set by dissecting the binary, not just by running it and crossing your fingers that you can guess what the professor's choice of passwords are. So, it's a really neat class to take after this, and you talk about these kinds of low level details. And those of you coming to this course with familiarity with or experience in Java, these are really neat juicy details that you just can't discuss in Java, because in Java there are no pointers. There are no direct memory addresses. All of it is abstracted away, for better or for worse, using things called references. But the downside is that you don't have nearly as much control of your machine. The upside is adversaries don't have nearly as much control over your machine, so you get something and lose something. So with that said, let's pave the way for a little chat about forensics, so not actually breaking machines, but actually understanding how data is represented in recovery, stuff we actually care about. So, thus far in the course, we've pretty much talked about chars, and strings, and ents, and floats, and doubles, all of these primitive data types, but it's hard to imagine, using just those primitives, how you can implement anything more sophisticated, like a Facebook website or a search engine, which clearly needs to store more than just individual numbers. For instance, you might want to store in some program you're writing a student, and a student on campus is a real world entity, and most students have an ID number. They have a name. And if they're undergraduates, they have a house. So, it'd kind of be nice conceptually if we could just kind of clump all of those fields together and wrap them in a new data structure that we, ourselves, can define, and that's where we can, again, use this trick called type def for. So type def, we saw once before to declare what in CS50's library? A string. So, we used a one-liner that said, "Using type def, make 'string' a synonym for char*." Well, using type def, can you define more interesting data structures that aren't just synonyms? They're brand new data structures that just don't exist by default in C. The syntax for this is to say type def struct, and then open brace and then enumerate with line by line, typically, with semicolons at the end, each of the fields, so to speak, that you want to put inside of a new data type that you will henceforth call student. I could call this anything I want, but conceptually it seems sensible to call it student. So if I now glance at the code, of which you have a printout today, let me go ahead and go into struct.h. This simply looks like this. And let me quickly speak to this whole .h file. So you guys have been using CS50.h, string.h, standard lib.h, all of these header files. The motivation for those files, again, is so that you can tell the compiler, "Hey, I want to use some functions that someone else wrote; and, therefore, here's the file that declares all of those functions, and maybe some constants, and maybe some other things." But similarly, is it very common to use header files in C to declare data structures that you, yourself, want to use, and that other programs that use your code, if you start writing more sophisticated programs or libraries, themselves might want to use. So in the interest of best practice, what I did, even though this is a fairly simple program, I factored out these new definitions, this thing called a student. I put it in its own file called structs.h, just by convention, and what I then did in structs.c is what, at the top of the file, presumably? In structs1.c, I just do this. So here's a slight difference in approach. So there's a fundamental different approach between, includes CS50.h, standard IO.h, standard lib.h, string.h, and structs.h. Syntaxically, what's the obvious difference between these five lines of code? >> Quotes. >> David: So, the quotes. So instead of using angled brackets, which means this is a default server library or server header file that someone else preinstalled on the server, you use double quotes around the file name just to indicate this is a file that's in the current directory that, presumably, I wrote. So realize there is that distinction there. The blue is a little hard to read here, but apparently I'm using the sharp define preprocessor directive to define a constant called student. I didn't want to get bored by typing in half a dozen students into this program. I'm just going to type in three, but I could very quickly change that by recompiling, after changing that constant. And down here I have the following, a main function that doesn't do all that much, but it does show us how to use this new data structure. So, again, you have a printout, if you'd like to follow along. Notice what I've done at the top. I want to declare a class of students. So now, finally, consistent with this preaching of using good style, declaring your variables with smart names, well, let's start thinking about what we want to model, just in colloquial terms. I want to represent a classroom full of students, so I'm going to call my variable class. It's going to be of type student, but it's going to be an array that has specifically three students in it. So this is just an array of three data structures of type student; and, collectively, I'm going to call that class. Okay, so it's pretty easy now to refer to this. So now I have a loop. So for each of these students, I'm going to iterate from zero up to students, which, again, is three. Now notice the new syntax, if unfamiliar. I'm first going to tell the user, what's the first students ID. Then I'm going to use the familiar, get ent, and I'm going to use class bracket I. So give me the Ith student in the class, the Ith data structure, and then .ID. This dot operator just says go inside this structure and assign it to the field called ID. So dot simply means go into the struct and access this field. New syntax, not all that hard. What's the student's name? I do the same thing with string. What's the student's house? I do the same thing with string. And then just to keep the formatting of my program nice, I just print a new line after that. So, now I've got a little curious. I wanted to just use some other string functions, and I decided somewhat arbitrarily I want to print out any student who happens to be in Mather, a little biased. So 4 ent I gets serial up to student, so iterate over the whole class again, and here's a function that should be familiar, for those of you who've tackled Sudoku already. If string compares, stir comp of the Ith student's house "Mather" equals, equals zero--well, we saw this briefly last week. Stir comp returns zero if two strings are equal. It returns a positive value if one belongs before the other string, lexically, graphically, or in a dictionary, or it returns less than zero if it belongs after the word. So, you have this ability to sort strings, using this function as a building block. I'm just using it for equality. So, I'm checking, is the return value of stir comp, given this string and given this other string, equal, equal to zero? If so, let me go ahead and print out who's in Mather? What percent S is my placeholder? Class, bracket I.name gives me the Ith student's name, and it just happens to be a char*, which is why percent S suffices. And then finally, down here, one of the most new important things that I've done. Why do I now call free on every student's name and every student's house? >> To return memory. >> David: Sorry? >> To return memory. >> David: So to return the memory, but yet I didn't call malloc, right? The rule of thumb last week, very simply, was if you call malloc, you'd better call free at the end. So, why am I, nonetheless, calling free here? >> Because it raised our pointer. >> David: Okay, because it raised our pointers, raised our very much like pointers, so that's the right direction, but I didn't call malloc. Let's focus on that inconsistency. Why am I calling free, even though I did not call malloc? >> Get string. >> So get string dict, right? So one of the reasons for peeling back that layer last week and looking briefly at the implementation of get string is that all this time it's had this inherent memory elite, because get string calls malloc to get as much ram as is needed to store the string the user types in, but if you, the person who's entrusted with that pointer, after get string returns, does not free this, you can begin to experience in your programs that very familiar real world annoying feature of some software that's poorly written of your computer grinding to a halt eventually or having to reboot, because everything is getting really, really slow, even though you're not doing anything with the computer. Quite often, is it simply a memory leak? So, thereto is it important to know what the functions are actually doing. So, let's go ahead and compile structs 1. I'm going to use make here. I'm going to go ahead and run structs 1. Student's ID, we'll call this student 11, and this will be David, and David is in Mather. Okay, student 12 will be Joe, and he'll be in Leverett. And the student ID 13, this will be Sally. She'll be in Foho [phonetic]. All right. Hello, Sally. So, David is in Mather. Okay, so a very simple demonstration, but nonetheless actually introduces finally this ability to define our own date structures and to even store those data structures inside of other data structures like an array. So, an array, even though it's barely a data structure, it's just the same type of data type, back, to back, to back. It's one of the first date types that we've actually looked at, but this program is not terribly useful, because I'm not doing anything with this information. Once I hit enter, yes, it tells me who's in Mather, but then it throws away all of that information. In fact, the program very explicitly frees all of that memory. So, if I actually wanted to do something with this, maybe I should use something like version 2 here. So this, too, you have a printout of. Notice that it's very similar to the code before. I'm populating the class variable with the same data as before, but before I quit this time, I introduce a new loop that does this. So, one thing we've not yet introduced, and that will become invaluable when it becomes time to actually recover JPEGs and other such things for forensic purposes, let's actually save these students to disc. So a new function call here. There's a function that exists in C and a whole bunch of languages called F-open, file open. Takes at least two arguments. The first is the name of the file that you want to open. Typically, if it doesn't exist, it will be created as a result of this function call. And then a second argument, which in this case, is "W". Any guesses as to what the W denotes here? W for write. So this just means open the file, but open it in such a way that I can write to it and change it, not just read from it. So that's useful, too, since, in this case, we want data going out, not coming in. Now FP not equal to null. So, null, again, is this special value that generally signifies errors. Why, I might, just thinking intuitively, why or when might F open return null, do you think? Just a possible reason might it return null? >> Another file with that name already in the database folder with--that's read only. >> David: Okay, so maybe there's already a file called database in my current directory, and maybe, for whatever reasons, it's read only. I, as the user, don't have the right to change that file, because maybe it came with the operating system. Maybe another user put it there. I just don't have the privileges. What's another reason that you might fail to write a file to disc? >> [inaudible] >> David: Sorry? >> No disc space. >> David: No disc space. Okay, so at some point, we might realize that as much as I want to put these bits on disc, there's just not room, so something has to fail at some point. So let's take a look now. I, at least, double check. Is FD not equal to null? Because if it's not null, that means I've gotten back what's called a file pointer or a file handle. It's called any number of things, but it's just a variable that somehow represent that file's location on disc. Now I do this. I go ahead and iterate three times from zero up to students. And what do I do? Well, I use this slightly different function that's very similar to the one we've used. Not print F, but F print F. What's really neat about F print F is that it's the file version of print F, and it just takes one additional argument that allows it to distinguish itself from print F. The first argument here is not the string you want to print, but a pointer to the file that you want to print this sting to. So this means send the following string to FP, that file pointer. What string? Percent den. What is the ent that should go there? Just the student's ID. After that, put the student's name. After that, put the student's house. And then, finally, after this loop is done, close the file. So, what I've done is dump into a file student, after student, after student, and each student's fields happen to be on their own line. So let me go ahead and compile this, make structs 2, because the rest of the program is identical to structs 1. Let me go ahead and run structs 2, enter, and now if I do this, and then David, and then Mather, and then 13, and Joe, and Leverett, and then 14, and then Sally, and Foho, enter. Okay, so the program has behaved the same, but if I do an LS now, notice I have this file that if I do an LS-L, was just created a moment ago. And as an aside, this is the size of a file. This is when it was last modified. LS-L shows you some interesting tidbits. Let me go ahead and open database, and voila, there's my database. So what is my interpretation of a database here? Just a text file that's got this data line, after line, after line. So, in effect, what I've kind of done now, albeit it very quick and dirtily, I've come up with my own file format. This is my, sort of David Malan's file format for students. I just have to know that a student record is stored in three lines, one after the other, and I, myself, just have to remember that the ID comes first, then the name, then the house, then repeat some number of times. So, now, if we take a step back and consider the real world, when you've saved the file in Microsoft Word format, or in HTML format, or in any file format that has one of those three letter extensions, typically, you are simply telling the program, save the contents of whatever it is you want to save, in the proprietary format at Microsoft, or Apple, or whomever it came up with. Now, that just means that they decided, on their own, how they want to put bits or how they want to put characters on disc. Probably, Microsoft Word is not as naive as just storing the words in your file from top to bottom, left to right, because there's a whole bunch of junk in any Word document. There's bold facing, italics, tabular information. There's a huge amount of metadata, so odds are Microsoft can't get away with such a simple file format, but some formats are relatively simple. So JPEG photographs, when you take them with a digital camera, actually are stored on disc, whether it's your hard drive or your compact flash card, or your SD micro card, whatever your technology is. They're stored by storing a pattern of bits at the very top of the disc that say here comes a JPEG. Then there's some complicated mathematics that implement that JPEG with a whole bunch of bits, megabytes worth of bits, usually, and then sometimes at the bottom, there's a few bits that say stop, this is the end of the JPEG. And what you'll ultimately leverage in problem set five is that signature at the top of the JPEG. Just as here I could make this leap of faith and assume that every there lines represents a student, so in the context of forensics can you sometimes assume that any time you see this pattern of bits, aha, here comes a JPEG. Let me just keep reading, reading, reading bits until I see that pattern again, at which point I know I've encountered what? >> [inaudible] >> David: Another JPEG. So digital cameras, frankly, are pretty stupid when it comes to storing information. They generally write one photo, then the other right after it, then the other right after it, and the other right after it. And so, in fact, you'll find that using some basic building blocks like these here today can recover that data relatively easily. Let me show you one other trick here, going back to a piece of code we had last week. We had this sample, stupid program called Bar Dot C. You had a printout of it last week, and it was really just meant to demonstrate GDB, because it has a bunch of functions, a bunch of variables, and I can step through each one. Well, one of the most powerful features of GDB, especially now that some of you are in experiencing seg faults with increasing frequency, is to ask GDB, okay, yes, I have a seg fault. I know that, but where did it happen? One of the things you can do with GDB is this. I'm going to go ahead and run GDB on this program called Bar. I'm going to set a break point at the function called bar, because recall from this program that main calls foo and foo calls bar, so I want to break, once I'm pretty nested inside this program, once I've got a bunch of frames on my stack. So I'm going to go ahead and type run now. It's broken right at the point where bar is executing. Actually, let me get rid of that error message. Make bar. Okay, so break in bar, run. So it's going to break in the bar function. So you can type list to see where you are in the program, but what's really neat is you can type back trace or just BT for short, and what this will show you, somewhat arcanely, is a list of all of the frames you currently have on your stack; in other words, all of the functions that were called up until this point. So, if you're running GDB, and you encounter a seg fault, and you really don't know what went wrong, frankly, you can just kind of infer where it happened by typing back trace often. You'll see all of the functions that may have been called. Now, especially in Sudoku, where you're using that fancy library called curses, there might be not only your functions in there, but a whole bunch of functions you didn't write. Generally speaking, it's safe to assume that those functions are correct. So, much as you'd like to disagree, odds are those functions, many of which have been around for many, many years, are correct, which means probably one of the functions you wrote, that called that huge list of functions, is the culprit. So, in this case, what I see here is this. The oldest frame on my stack is this one here, number two, for main. After that, there's another frame on the stack for foo, and then this third one for bar. So what I can actually do here, BT for back trace again. If I want to see, you know, what was going in main, right now I'm in bar--whoops. Let me fix since I accidentally pasted something in. Break bar, run. Okay, so now this is bar. Suppose that here's my back trace? You can type frame, and then the number of the frame you want to go to, frame two. Now notice I'm in frame two. It doesn't mean that I've said go continue executing main. Go back to frame two and start executing. This means let me poke around inside of this stack frame. If we have this picture from last week or even this fellow's here, when we have various frames from the stack, this just means let me poke around inside this frame, so that now I can type things like print A, and I can actually see the value of A, even though that variable is out of scope inside of bar. And just to be kind of curious, notice what we can do here. I'm going to rerun the program, so as to start from scratch. So run is going to break at bar. And let me do this. Notice that bar has a local variable or really a parameter. This is what bar looks like. So I'm going to print M. All right, M, it was the value 10. That's not interesting. Let me print the address of M. So ampersand M is the address of operator, so not I get a fairly looking hexadecimal string, but that's just a fairly concise way, we decided last week, of representing big numbers. Hexadecimal, because you have more digits. You can represent things more compactly. And, yeah, look at this. This is a pointer of type ent. Okay, so this is a cryptic-looking address, but let me now do this. Let me go into the frame for main. Main, itself, apparently has a variable A. I can do a sanity check, print A. A was five, whatever, but let's print the address of A, and now we have this value. So now notice, even though this is a little cryptic, this number is almost identical to this one. This is from bar. This is from main. And now notice which of these numbers is bigger. It's somewhat subtle, but which of these numbers is bigger? They're almost the same. Zero X, bravo, foxtrot, 976 Charley, and then they only differ at two places, 60 and 9C. Even if you're not quite comfortable with hexadecimal yet, this one on the bottom is, in fact, bigger, which confirms the prediction earlier that when we are manipulating the stack, these larger addresses are, in fact, down below. Main, the address of Main's variable is bigger numerically, its address, because it's lower in the stack. It's one of the earlier stack frames. So, here, too, using GDB, can we actually poke around with such detail that we can see precisely the assumptions that adversaries are exploiting. So, that was a lot. Let's take a five minute break. All right, so we are back. We are back, and so is lunch. So not this Friday, but next Friday will be the next lunch with CS50's team. If you'd like to RSVP, just go to CS50.net/RSVP. We will have to cap at ten, just so we keep things sufficiently intimate, so that we can actually talk with one another, but check that out on the website, if you are available. And free, so exit a small leap we've seen now multiple times. It's going to recur, especially in the context of forensics and our use of file IO. Incidentally, file IO simply means input and output, anytime you've seen that acronym somewhere. And we've seen briefly here functions like F-open and F-close. We saw print F, the analog for reading is F scan F, which we'll use before long. F read and F write allows you to read and write, not ASCII characters, but binary data, which is useful when you're writing things that aren't, by nature, ASCII. FEOF is just a function that tells you am I at the end of file? EOF. Yes or no. So more on those in the time to come. But before we get to this sexy picture, a less sexy picture, hexadecimal. So we saw this. Zero, zero, zero, zero in binary represents what decimal number? Easy question. >> Zero. >> David: All right, so zero. But in hexadecimal, it also is represented as zero. So hexadecimal digits conveniently can be represented, hexadecimal digits conveniently represent four bits. So, in other words, let's disregard decimal for today and just say that this is hexadecimal in the right-hand side. If, instead, I want to represent the number one in binary, but in hexadecimal this is simply one, then if I fast forward to, say, 1111, this equals what in hex? >> F. >> David: So this equals F. So all ones, that is what? The number--let's see, this is the 8's column, plus 4's column. That's 12, 13, 14, 15. So 15 in decimal is F in hexadecimal. All right, but we've been looking at numbers that are much longer than just single characters, right? We saw addresses like this thing here, which clearly has multiple hexadecimal digits back-to-back. So all this means is that you take the--you take those numbers and then figure out what the corresponding bit patterns are. So, in other words, if I had this, one, two, three, four zeroes in binary, followed by four ones in binary, this would be represented in hex as zero F. Or just to make clear the notation, zero X just says here comes a hexadecimal value. So zero F here. If, instead, I did all ones--one, two, three, four, five, six, seven, eight--and it doesn't line up, because the zeroes are fatter--that happens to represent what? Well, this is the 128's column. This is the 64's column. This represents the value 255. So the largest decimal number you can represent with eight bits, we've seen long ago, is 255. In hexadecimal, though, that's just zero XFF. So, again, those of you who have some HTML background or play around with Photoshop, you very often see precisely these hexadecimal codes. In fact, let me see. This is somewhat tangential to CS50 but does demonstrate exactly where this stuff does come up. So this is Photoshop, a very fancy photo manipulation tool. This is the color wheel. You'll notice that there's this mention of RGB, red, green, blue. So the way that numbers are represented--and this actually will be useful for problem set five, as well, as you'll see, they are represented as RGB triples, a value for red, followed by a value for green, followed by a value for blue. And what this means is that pictures are made up of pixels, little dots. Every one of those dots has a color. That color can be implemented with some amount of red, some amount of green, and some amount of blue. You can combine those in so many different ways that you can essentially get every color of the rainbow, just by using those values. So, here, I have the value 00 00 00. So that means no red, no green, no blue. What do I get? Well, the absence of all color here gives me black. If, by contrast, I say give me a lot of red. That's a lot of red. So give me a lot of red, but now crank up the green, then crank up the blue. That gives me all of the colors. And if you have all of the wavelengths of light, you have white. But as we just saw, this would be red. If I want to see green, I would put the FF's in the middle, because it's RGB. And if I wanted to see blue, I have this on the end here. So this just means you can string these hexadecimal characters along and along to form not four bit values but eight bit values. So now, if I go back to GDB, if I want to try to understand this a bit better, this here is simply an eight bit value. This here is simply an eight bit value. This here is simply an eight bit value. Again, because every hex digit represents four bits, which is one of the motivations for using hex at all is because there is this really clean equivalence. And then, finally, this here is also eight bits. So, I have eight, plus eight, plus eight, plus eight. So this just happens to represent a 32 bit integer, which is really nice to see, because we said last week that pointers are themselves 32 bit ents. When you have a 32 bit machine, that means its pointers are 32 bits. And, voila, there is that ent represented, albeit in hex. Now, if you find yourself looking more closely at some of this buffer overflow stuff back at home on the slides, you'll actually notice that some of these digits, some of these hex digits are backwards on the diagram, and it's not a mistake. And that will actually come up in problem set five, as well, a discussion of what's called little Indian-ness and big Indian-ness. Essentially, years ago, the world had to decide, when we want to lay out multi byte numbers in memory, do we go from left to right, or do we go from right to left? It's a bit of an oversimplification. P set five will tease this apart. But, in short, different computers do it different ways. And what you're seeing here is an artifact of a decision made many years ago, but the computer figures it out. So CSI. So this is just kind of a silly way of introducing the context in which P set five will have you dabble. So, years ago, a friend of mine literally had taken photos on a digital camera, and something went wrong. Like, his compact flash card or whatever it was got corrupted. And at the time, I was very conveniently interning as a graduate student for the local Middlesex District Attorney's Office, where I was doing forensics work for them, and it was really one of the best summer jobs, frankly, I've ever had, because we were literally, as I think I said weeks ago, were taking in from the Mass State Police things like hard drives, and CDs, and floppy discs, and sometimes things that don't actually have data on them, like monitors, and keyboards. And so we just kind of piled that in a separate pile, but we then had to answer questions of the form, did he do it? Right? Or is there any evidence to confirm or deny this particular bad thing that happened? And so much of this job was about going through these hard drives and CDs, and, yes, as I've said, just double clicking folders sometimes--really not all that sophisticated, some of the criminals in this county, but sometimes people had, at least, dragged things to the recycle bin. Now, as most people in this room know, from experience, or bad experience, dragging things to your recycle bin or trash can doesn't really do anything until you click what button? >> Empty. >> David: So empty trash or empty recycle bin. So more on that in just a moment, but there's even more worrisome aspects of data storage. So, in fact, there was a really neat story a few months ago about the iPhone. Those of you with iPhones or iPod touches know that when you push the button on the bottom there, you actually get this fancy animation where everything kind of shrinks down, and you then see your desktop. Well, I believe the way Apple did, or maybe still does implement that feature, is they don't resize your whole window down to a little dot. Rather, they very quickly take a screenshot and then shrink hat screenshot, because it's very easy to shrink an image. It's a little harder to shrink actual applications. Now the problem is a forensic analyst realized was that people then, who might hit the home button, while reading a really personal or really sketchy email or IM had this paper trail on the flash memory in their iPhones or iPods, because the moment they hit that button, their phone or iPod took a screenshot. That screenshot had to get saved somewhere on disc. Now, these things don't use hard drives. They happen to use flash memory, but it's the same idea. Those zeros and ones got written to the phone or the iPod's memory. And if someone got curious or nosey enough and had physical access to this phone, they could start reading in all of those zeroes and ones. And much like you will do for P set five, they can just look for patterns. And even if that screenshot was intentionally deleted by the iPhone software, odds are it's still sticking around, unless the user has hit that home button a whole bunch of times or unless they've downloaded a whole bunch of MP3s and movies since then that might probabilistically overwrite those bits. But there's yet other problems, too. I mean, there's some very familiar ones. If you've ever sat down at a computer and used a browser, if you're visiting websites that you don't want anyone else to know, you probably should do what after stepping up from the computer? Clear your cache or clear the history, right? Or do any number of tasks, but even those things are very imperfect. And you've probably seen and maybe have kind of crossed that ethical line of clicking your friend or your roommate's history bar just to see explicitly where they've been going. [laughter] You laugh, because it's true. [laughter] So it's because a lot of software, browsers being one of the worst offenders, just remember a lot of information, which tends to be useful for the user. It means you can have auto completion, and it can remember that you've been there, and it can cache. So there's a lot of technically and UI useful aspects to this, but all that data gets stored somewhere. The history gets stored somewhere. All of the JPEGs that comprise a web page that you visit have to get stored somewhere. And worse than that, computers these days have something called virtual memory. So long story short, most computers have, these days, like what? A gig of ram, two gigs of ram, but the operating system actually lets itself think that you have four gigs of ram or six gigs of ram, because that's convenient. And by having more ram, or more elusions of more ram, can you then run more and more programs at once, at least in the background, right? You might not be interacting with all of them, but if you tell, you trick your computer into thinking this person has so much ram, it means you can double click more icons and have more stuff running at once, but the means by which most computers implement that elusion is they steal some hard disc space or some flash memory space, and they say pretend this space is ram. And what the computer does, unbeknownst to you, is when you start to run too many programs that physically can't fit in ram, some of them get shuttled to this thing called virtual memory. They get copied from ram, which is purely electronic, to disc, which is often mechanical, and it's slower, but they do this transparently. So you, the user, don't know where all of those bits are, but you can sometimes infer where they are. If you start to run more and more programs on your computer, and you can simulate this tonight, what eventually happens if you try running 10, 20, 30 programs at once? Odds are they will open, but slowly. The thing will slow to a crawl, and that's because a lot of those programs, yes, they're getting loaded into memory, but they don't fit in memory, so they go to virtual memory, and just because of how this stuff is designed, hardware and things like, mechanical things, like hard discs and even flash memory are slower by nature and by cost to then something like ram. So you see this even in the real world. Now, the takeaway here is that even if you cover your tracks, and empty your recycle bin, and you clear your cookies, and clear your history, and all of this, there is a good chunk of hard disc space on your computer technically called swap space--many operating systems do this--that still has a copy of all of that stuff. So, in fact, one of the neat things that Mac OS actually has--I don't believe this has come built-in to windows yet. Maybe 7 is different. You have this little security tab here, and there's this thing called FileVault. So one of the things FileVault does is it encrypts your whole hard drive so that even if you are doing sketchy things or just private things on your computer, all of that is encrypted using something a little more sophisticated than Caesar Cipher, but the same exact idea. And I actually don't remember--oh, here it is. So if I actually log in here, you'll see this. There's this little icon, for those of you--so this benefits half of you. The other half, good luck. But half of you can open up your security control panel and check this. Use secure virtual memory, which means that feature I just described of swap space, aka virtual memory, you can tell the computer, albeit at a slow performance hit, to encrypt that, as well, so that so long as I log off of this machine, if someone steals it, and runs away, and performs various forensic analyses on it, there would be very low probability can they figure out what bits were actually there. So cryptography is really, really your friend these days. And if you are liking this kind of stuff, do bear in mind computer science 120, which is offered in alternating terms. So what's the implication here? So let's consider a very simple example of just storing some files on disc. So inside of a typical computer is this thing called a hard drive. If it's still mechanical, which it is in most of your laptops, unless you paid a bit of a premium to get SSD or solid state drives, which is flash memory, which has no moving parts, but the story is the same. Inside of your computer is a hard disc, and a hard disc is typically made up of these platters. So if you recall floppy discs from many years ago, there's those little circular floppy things, hard drives have the exact same idea, but they're much thicker, they're much stronger, and they spin a lot faster, so they're more robust and store more bits. But when you save a file to disc--say it is a JPEG--so a JPEG is going to be some kind of image that has kilobytes or megabytes worth of data. So supposed it takes up this many bits. Now, I have no idea, as the user, where that JPEG ends up on disc, so to speak. The operating system decides for itself where does it currently have space, but it ends up somewhere there on disc. What the operating system then does is it has a little table called a directory table or something like that, and it's essentially like an Excel spreadsheet with at least two columns. There's definitely more juicy stuff there, and then there's one column called name, and then another column called vocation or something like that. And then the name of this file might be foo.jpeg. Apologies for the handwriting. And then location is a number. We'll write it in hex, but it's just a number, 0X123. And what that tells the operating system, or reminds the operating system, is the next time the user double clicks or searches for foo.jpeg, go to what address? Will go to this location on disc, 0X123 or whatever it happens to be. Now what if the user decides, you know what, this is a really private JPEG, or this is just a really bad photograph, let me just delete it. And here she drags it to the recycle bin or trash can. What happens to this picture? So absolutely nothing. So, first of all, most of you knew this. If you drag something to the recycle bin or trash can, unless you wait a few days, a month, however long it is your OS keeps things around, nothing happens. It's like dragging it to a separate folder. But if you do have a little bit of savvy, and you do right click and choose empty trash or empty recycle bin, or wherever your menu option is, what then happens to this picture, when you actually delete that file? >> [inaudible] >> David: Yeah, so this goes away. Essentially, this goes away. This goes away. The operating system just forgets where that file is. But for historical reasons and for performance reasons, historically, this piece of the picture doesn't change, right? Back in the day, computers were very slow, and the idea of wasting time and making the user wait so that you can then delete these bits, as well, well, what would that even mean? You can't just delete bits and erase them, right? You need those bits; otherwise, your hard drive is going to get perpetually smaller and smaller every time you delete a file. So, what would it mean, really, to delete a file? Well, maybe change these zeroes and ones to all ones, or all zeroes, or just random zeroes and ones. But that used to take awhile. So, to this day, most operating systems forget where the file is, which is why programs like Norton Utilities, if you're familiar, any of these tools that you might have on your own computer, that lets you undelete files, they generally just remember what this table used to look like and then, with some probability, can they go to the hard drive and get back those bits. But I say, with some probability, because what's probably going to happen, over time, after you've deleted a file in this way? >> [inaudible] >> David: Yeah, so they're going to get reused. The operating system eventually is going to say, you know what, I need maybe not exactly those bits, but I've got a pretty big JPEG that this user just downloaded, and now I need these bits. So, you know, in fact, and we used to see this at the DA's office, we would see partial JPEGs. We would lose part of the image, part of the gif, whatever it was, but not necessarily all of it, and that's just because, by chance, probabilistically, some of those bits were reused elsewhere. So you'll see in problem set five is what? Problem set five is sort of like the end all of problems sets, it seems, but it's really neat, because it's really this intersection of some very familiar technologies, and we'll hand you this article written by a colleague of mine from MIT, from grad school, who did a really neat project himself, as a Ph.D. student. He went on eBay with a bit of cash and started buying up all these hard drives that people were selling, 100 gig drive here, a 500 gig drive here. Then he imaged all of these drives, much like Homeland Security or the district attorney's office would, and then he started analyzing all of these files, all of these hard drives. And what he found, as you might expect, because these things get mentioned in the media once in awhile, were people's social security numbers, people's credit card numbers, pornography. Anything that someone might have on their hard drive ultimately made its way to Maxwell Dorkin [phonetic] across the street. But what was really neat--so the point for him was not to get some cheap pornography on eBay, but, rather, to actually assess with what frequency this is happening and what tools could we, could they, those users, start using to more effectively, you know, scrub their information. And what the scary thing is--and we'll present you with this paper in PDF form, this article--so many of the tools out there, that you would pay good money for to, you know, make sure that your financial documents are safe, your personal documents are safe, so many of them are buggy. And one aspect of this article was to find innumerable faults with all of these tools that would overlook things like the swap space, when wiping your hard drive, or they would overlook something called slack space, which this article will discuss. But this, again, is one of the points of a course like this, because by understanding, from the ground up, how all of these little things work, one, this kind of literature is accessible to you; and, two, you understand what this machine is you guys use every day. So with that said, we'll see you Wednesday. ==== Transcribed by Automatic Sync Technologies ====