## Week 5 Friday Andrew Sellergren ### Announcements and Demos (0:00-5:00) + This is CS50. + Today we'll ruin a bit of TV and movies for you. You should know that whenever the crime scene investigators say "zoom and enhance," there's really no such thing as "enhance." At some point, you zoom in far enough that all you see are individual pixels. + Similarly, there's no such thing as a [GUI interface using Visual Basic to track an IP address](http://www.youtube.com/watch?v=hkDD03yeLnU). Frankly, that just doesn't make any sense: you wouldn't use Visual Basic and you probably wouldn't need a GUI, although you might actually be interested in an IP address in the end. + Even my beloved show Numb3rs is [guilty](http://www.youtube.com/watch?v=zvGW4mlwqos). The code which Charlie reads on his screen is Objective C and looks to be part of a kids' drawing program (hence the variable named `crayon`). + As of Monday, you'll have 86,000 additional classmates as your counterparts from edX join you in the CS50 saga! To help scale the class to this size, we'll be introducing CS50 Check, a new piece of the Appliance which will allow you to run your code against various correctness checks in realtime. + If you'd like to talk about your results from Quiz 0, don't hesitate to reach out to a member of the staff! ### Forensics (5:00-12:00) + In a previous lifetime, David worked for the Massachusetts District Attorney's office as a forensic investigator. From time to time, the police would drop off hard drives and other digital media that were suspected to be involved in white collar crime. David's job was to search them for incriminating evidence. + As you might have gathered from zooming in on Rob's face earlier, images can be represented as a series of dots, each of which is called a *pixel*. If we decide to simply things, each of those dots can be either black or white. We can represent a black dot with a 0 and a white dot with a 1 so that an entire image can be encoded in binary. We might call this 1-bit color since it only requires 1 bit. If we decide to be a little fancier, we might use 24 bits to represent each dot. Then we can have millions of colors! + Image files usually contain more than just pixels encoded as bytes. *Bitmap* files, for example, contain a header which stores information about the type and size of the file. In general, file formats are just standards for interpreting patterns of bytes. Within the bitmap file format, there is room set aside for RGB triples. RGB stands for red green blue, the three colors with which you can express all possible colors. An RGB triple is a three-byte chunk of data which encodes the color of a pixel as a certain combination of red, green, and blue. + Knowing that images are really just sequences of bytes, you'll be asked in Problem Set 4 to recover a number of images that were accidentally deleted from a disk. Although the images were "deleted," the bytes themselves remain intact. To recover the images, you'll simply need to search for the pattern of bytes which denotes the start of a JPEG file. ### The CS50 Library (12:00-31:00) + The CS50 Library is a set of functions and types we provided you to make it easier to get user input. In it, we also defined a `string` to be a `char *` until we could reveal to you what a pointer is. Going forward, we'll rely less on getting input from the command-line prompt and more on getting input from files and network connections. + The CS50 Library includes a number of other libraries, including `stdbool.h`, which contains the definitions of `true` and `false`, as well as `stdlib.h`, which contains the definition of `malloc`. The following line in `cs50.h` defines the `string` data type as a `char *`: typedef char* string; + `cs50.h` is actually an example of a `header file`. Thus far, we've been writing our prototypes at the tops of our files. However, we could also have written these prototypes in separate files named with the `.h` extension. The definitions of the functions themselves would then live in a file of the same name but with the `.c` extension. + The comment next to `GetInt` notes that if a line can't be read, it will return `INT_MAX`. In the case that the user types a number that is much too large to be stored in 32 bits (a googol, for example), the function will return this sentinel value. As you might guess, `INT_MAX` is a constant representing the largest possible 32-bit signed integer, or around 2 billion. We use this number to signify that something went wrong. Knowing this, your programs should actually be checking to see if the return value of `GetInt` is equal to `INT_MAX`, which would indicate that an error occurred. + `cs50.c` is actually pre-compiled into the Appliance so that you need only include `cs50.h` and link to it at compile-time if you want to use its functions. + The implementation of `GetInt` is as follows: /* * Reads a line of text from standard input and returns it as an * int in the range of [-2^31 + 1, 2^31 - 2], if possible; if text * does not represent such an int, user is prompted to retry. Leading * and trailing whitespace is ignored. For simplicity, overflow is not * detected. If line can't be read, returns INT_MAX. */ int GetInt(void) { // try to get an int from user while (true) { // get line of text, returning INT_MAX on failure string line = GetString(); if (line == NULL) return INT_MAX; // return an int if only an int (possibly with // leading and/or trailing whitespace) was provided int n; char c; if (sscanf(line, " %d %c", &n, &c) == 1) { free(line); return n; } else { free(line); printf("Retry: "); } } } + Remember that `GetString` returns `NULL` on failure, so we first check that here and return `INT_MAX` to pass the error along. + Once we have checked that `GetString` has returned a non-`NULL` value, we pass the user's input, stored in `line`, to a function `sscanf` which looks for format strings within it. You can think of `sscanf` as the opposite of `printf`: it reads in as opposed to writes out. In this case, we're looking for an integer followed by a character. What we're checking, however, is if only an integer was found. In that case, `sscanf` will return 1, the if condition will evaluate to true, and `GetInt` will return the integer. If both an integer and a character are captured, `sscanf` will return 2 and the user will be asked to retry. This is a clever bit of error detection: if the user types "123 f," then both `n` and `c` will be filled and we will ask him to retry; if the user types "123," then only `n` will be filled and `GetInt` will return the integer 123. As you can infer, `sscanf` returns the number of placeholders it filled. Because we put a space before and after the `%d`, the user is allowed to type a space when entering an integer; `sscanf` will ignore it. If the user types *only* a character, `sscanf` will fill neither `n` nor `c` so it will return 0. Note we're passing pointers to `n` and `c` so that the memory they point to can actually be filled by `sscanf`. + `GetInt` and many other CS50 Library functions call `GetString` to do their dirty work. Let's take a look at the implementation of `GetString`: /* * Reads a line of text from standard input and returns it as a * string (char*), sans trailing newline character. (Ergo, if * user inputs only "\n", returns "" not NULL.) Returns NULL * upon error or no input whatsoever (i.e., just EOF). Leading * and trailing whitespace is not ignored. Stores string on heap * (via malloc); memory must be freed by caller to avoid leak. */ string GetString(void) { // growable buffer for chars string buffer = NULL; // capacity of buffer unsigned int capacity = 0; // number of chars actually in buffer unsigned int n = 0; // character read or EOF int c; // iteratively get chars from standard input while ((c = fgetc(stdin)) != '\n' && c != EOF) { // grow buffer if necessary if (n + 1 > capacity) { // determine new capacity: start at 32 then double if (capacity == 0) capacity = 32; else if (capacity <= (UINT_MAX / 2)) capacity *= 2; else { free(buffer); return NULL; } // extend buffer's capacity string temp = realloc(buffer, capacity * sizeof(char)); if (temp == NULL) { free(buffer); return NULL; } buffer = temp; } // append current character to buffer buffer[n++] = c; } // return NULL if user provided no input if (n == 0 && c == EOF) return NULL; // minimize buffer string minimal = malloc((n + 1) * sizeof(char)); strncpy(minimal, buffer, n); free(buffer); // terminate string minimal[n] = '\0'; // return string return minimal; } + The first line of interest is the declaration of `c`. Although this will store a character, which only requires a maximum of 256, we'll use an `int` to store is so that we can use one additional value as an error indicator. + The loop condition calls a function `fgetc` with `stdin` as a single argument. `stdin` represents keyboard input, as opposed to `stdout`, which represents output to the screen, and `stderr`, which represents a special type of output to the screen. `fgetc` reads a single charater of keyboard input and returns it to be stored in `c`. We then check that `c` doesn't equal the newline character (if the user has hit Enter) or the `EOF` (end-of-file) character (if the user has hit Ctrl+D). + `n` holds the number of characters that the user has entered. If `n` is greater than `capacity`, then we have more characters to store than memory in which to store them. Thus, we need to ask the operating system for some more using `malloc`. Actually, we also use a function named `realloc` to ask the operating system for memory at the end of the currently allocated string, if possible. + If you follow the rest of the logic in `GetString`, you'll see that it doubles the size of the memory chunk until there's enough to store the user's input. This doubling will cause the size of the memory chunk to grow very rapidly meaning that `realloc`, which tends to be slow, will only need to be called a few times. At the end of the function, we ask for a memory chunk that is exactly the size that we need so we don't waste memory. ### Attack! (31:00-41:00) + Now that we know more about memory allocation in C, we need to begin to think about how it can be abused. Consider the following snippet of code: #include void foo(char* bar) { char c[12]; memcpy(c, bar, strlen(bar)); } int main(int argc, char* argv[]) { foo(argv[1]); return 0; } + So there's a few bugs here. First, we don't account for systems on which a `char` might be more than 1 byte. Second, we don't account for running the program without providing a command-line argument. + Most importantly, we're not checking that we have enough memory to store the user's input. `memcpy` takes as its third argument the maximum number of characters to copy. However, if `bar` is longer than 12, then we'll be writing more than 12 characters into only 12 bytes of memory. This is what's generally known as a buffer overrun attack. + In previous weeks, we represented the stack as having just local variables and arguments on it. However, it turns out there's other important information stored there as well. The following diagram represents the stack as it would exist in the middle of execution of `foo` as defined above: The stack. + When `main` calls `foo`, it informs `foo` of its return address. That way, when `foo` finishes executing, it knows where in memory to jump to continue the program's execution. This is represented by the red block in the diagram above. + Notice that `c` is growing downward. If we write more than 12 bytes to it, we might overwrite `bar`, something called the saved frame pointer, and the return address. If this return address were to be overwritten with a valid memory address of malicious code, then `foo` will execute this code when it returns. We might represent this like so: The exploited stack. + Within the red block, the numbers that represent the address 0x80C03508 have been written (in reverse order, known as Little Endian). When `foo` returns, it will jump to this address in memory, which is actually the start of the block of blue A's at the top, the same block where this malicious user's input was stored. Thus, the malicious user can type in some attack code followed by this memory address so that when `foo` returns, it will jump to the beginning of this block of blue A's and execute the attack code. Yikes! + Most of the time, this kind of exploit begins with trial and error in which a malicious user tries to crash your program. If he succeeds in crashing your program, it means that your program didn't expect his input. He can use this information to learn more about what your program accounts for and what it doesn't account for. ### Teaser (41:00-43:00) + Recall that emptying the trash on your computer doesn't actually delete the contents of the file, it only erases the information which links the filename to the location of the file's contents. + Although the computer has forgotten where the file contents are stored, we can come to its aid by scanning the whole disk for chunks of memory that have the JPEG file pattern. + Question: why does the computer tell you that you have more space when you empty your trash? Mostly because it's lying. But, in reality, you do have more space because the next time you need room for a file, it can overwrite the memory that was previously used for your now-deleted files.