## Week 6 Wednesday Andrew Sellergren ### Announcements and Demos (0:00-6:00) + This is CS50. + There are now over 100,000 students registered for CS50x, which debuted this Monday! Head to [x.cs50.net](http://x.cs50.net) if you're curious. The material is largely the same, albeit a few weeks delayed and also translated into a number of different languages. CS50x will culminate in April with the CS50 Expo, an online version of the CS50 Fair. If you're interested in helping out the CS50x effort as a CA, let us know at [x.cs50.net/ca](http://x.cs50.net/ca). + Once you've taken CS50, you'll have the opportunity to *wear* CS50. Soon we'll ask you to submit your designs for the newest edition of CS50 apparel which will then be available for purchase on [store.cs50.net](http://store.cs50.net). ### Debugging (6:00-20:00) + Hopefully, you've already made use of GDB to debug your programs. GDB allows you to step through your code line by line, printing variables and inspecting the stack along the way. You can also set breakpoints so that you can pause execution of your program at the point where you suspect there's a problem. + `style50` is a tool we introduced to clean up the style of your code automatically. + Another tool available to you is Valgrind, which helps diagnose memory errors. The command to run it is as follows: valgrind -v --leak-check=full ./a.out + `a.out` should be replaced with the name of your compiled program. The `-v` means "verbose" to get the maximum possible output. `--leak-check=full` asks Valgrind to be rigorous in checking for all memory leaks. Note that command-line flags are passed to programs with a single dash if they are one letter (e.g. `-v`) but are passed with two dashes if they are longer than one letter (e.g. `--leak-check`). + Here's a sample program that we'll run through GDB: /**************************************************************************** * memory.c * * David J. Malan * malan@harvard.edu * * Demonstrates memory-related errors. * * problem 1: heap block overrun * problem 2: memory leak -- x not freed * * Adapted from * http://valgrind.org/docs/manual/quick-start.html#quick-start.prepare. ***************************************************************************/ #include void f(void) { int *x = malloc(10 * sizeof(int)); x[10] = 0; } int main(void) { f(); return 0; } + What does `f` do? First, it declares a pointer `x`. The pointer itself will take up 32 bits of memory, but it will point to a chunk of memory that is 10 times the size of an `int`, i.e. 40 bytes, which we've allocated using `malloc`. Note that we have to keep track of the size of this chunk of memory, there's no programmatic way to know where it ends. + Even though `x` is a pointer, we can use the array-style bracket notation to access its 10th element, as we do by writing `x[10]`. However, the problem is that `x` doesn't have a 10th index: we only allocated 10 `int`-size chunks, so 9 is the largest valid index. Interestingly, this program may not outright fail. Sometimes the operating system will be okay with us using one more chunk of memory than we asked for and sometimes it won't. Regardless, this is bad practice. + Another bug in this program is that we're not freeing the memory we allocated. To fix this, we would actually need to call `free(x)` inside of `f` because `x` will be out of scope once we return to `main`. + In a program of this size, these memory-related bugs may be obvious. In programs of much larger size, memory-related bugs may not be so obvious. Thankfully, Valgrind will help us identify them. When we run Valgrind on this program, we get the following messages (among a lot of other output): HEAP SUMMARY: in use at exit: 40 bytes in 1 blocks total heap usage: 1 allocs, 0 frees, 40 bytes allocated LEAK SUMMARY: definitely lost: 40 bytes in 1 blocks + 40 bytes sounds familiar. This was the size of the chunk of memory we allocated for `x`. + Another message we get from Valgrind is the following: 1 errors in context 1 of 2: Invalid write of size 4 at 0x80484A0: f (memory.c:21) by 0x8048C1: main (memory.c:26) + What are these numbers that begin with "0x"? These are hexadecimal (base-16) numbers that represent memory addresses. They're a little cryptic, but at least the "invalid write of size 4" message indicates that we tried to write to 4 bytes of memory that we don't own, i.e. when we tried to access the 10th index of `x`. ### Linked Lists (20:00-37:00) + We introduced linked lists as a dynamically sized data structure. In contrast, arrays have a fixed size. Although we can actually grow arrays using `realloc`, this is a very expensive operation. + Linked lists have the advantage of being dynamic in size. We pay for this, though, by requiring more memory to store the same amount of data. To store a linked list of integers, we need memory for each integer as well as memory for the pointers that connect the linked list nodes. + Another disadvantage of linked lists is the difficulty of accessing data. We cannot use binary search on linked lists, for example. + Recall the definition of a linked list node: /**************************************************************************** * list1.h * * David J. Malan * malan@harvard.edu * * Defines a node for a linked list of integers. ***************************************************************************/ typedef struct node { int n; struct node* next; } node; + The node contains an integer to store the actual data of interest as well as a pointer to another node. By convention, we write this definition in a header file named with the `.h` extension. This keeps our `.c` files more organized. + The actual linked list is defined as a single pointer to a node declared as a global variable at the top of the file: // linked list node* first = NULL; + Generally, global variables should be avoided. In this case, though, it allows us to avoid passing around the linked list to every single function. Since we know that all of the functions in this program will be manipulating the linked list, we're better off declaring it as a global variable. + When we compile and run `list1.c`, we get a prompt giving us options to delete, find, insert, and traverse the list. To test it out, we can choose the insert option a few times and add the numbers 9, 17, and 22 to our linked list. How is this insertion actually happening? Here's the definition of the `insert` function: /** * Tries to insert a number into list. */ void insert(void) { // try to instantiate node for number node* newptr = malloc(sizeof(node)); if (newptr == NULL) return; // initialize node printf("Number to insert: "); newptr->n = GetInt(); newptr->next = NULL; // check for empty list if (first == NULL) first = newptr; // else check if number belongs at list's head else if (newptr->n < first->n) { newptr->next = first; first = newptr; } // else try to insert number in middle or tail else { node* predptr = first; while (true) { // avoid duplicates if (predptr->n == newptr->n) { free(newptr); break; } // check for insertion at tail else if (predptr->next == NULL) { predptr->next = newptr; break; } // check for insertion in middle else if (predptr->next->n > newptr->n) { newptr->next = predptr->next; predptr->next = newptr; break; } // update pointer predptr = predptr->next; } } // traverse list traverse(); } + First, we allocate memory for a new node. We don't even need to know how many bytes a `node` is since we can use the `sizeof` function to figure it out. As always, we have to check the return value of `malloc` for `NULL` in case the memory couldn't be allocated for some reason. + Next, we prompt the user for the integer to store in the new node. We store it in `n` using the arrow syntax. Recall that writing `newptr->n = GetInt()` is equivalent to writing `(*newptr).n = GetInt()`. + Inserting into a linked lists can be divided into four cases. If `first` is `NULL`, then the list is empty, so we point `first` to the new node and we're done. To insert at the head of the list, we point the new node to where `first` is pointing and we point `first` to the new node: // else check if number belongs at list's head else if (newptr->n < first->n) { newptr->next = first; first = newptr; } + To insert at the tail of the list, we point the last node to the new node: // check for insertion at tail else if (predptr->next == NULL) { predptr->next = newptr; break; } + We don't need to point the new node to `NULL` since we did that earlier when we initialized `newptr`. + When traversing this list, we need to enlist the help of a pointer (`predptr`) which will keep track of the node before the node we're considering. We can't move backwards in singly linked lists. + The last case, inserting in the middle of the list, is a little tricky, so we'll wave our hands at it for now. Here it is in code, though: // check for insertion in middle else if (predptr->next->n > newptr->n) { newptr->next = predptr->next; predptr->next = newptr; break; } + In all these cases, the order of operations is important to avoid orphaning some part of the list. + `list2.c` gives example code for creating a linked list of structs rather than integers. ### Hash Tables (37:00-53:00) + To achieve our goal of constant-time lookup, we'll turn our attention to the data structure called a hash table. If we want to store strings like "Alice" and "Bob" in a hash table, we need a heuristic to translate them into an integer. That integer corresponds to the index in the array where we'll store the string. This heuristic will be our *hash function*. + One hash function we could choose is to take the first character of the string and convert it to its numeric position in the alphabet. Thus, "Alice" would become 0, "Bob" would become 1, and so on. There's a problem, though. What if we want to insert both "Alice" and "Anita" into this hash table? One solution would be to put "Anita" in the next available slot. This technique is called *linear probing*. Unfortunately, that means that our worst-case lookup time will be `n` since the next available slot might be the very last in the hash table. + How often does this kind of collision occur? Last time, we asked this question in a slightly different way: > In a room of *n* CS50 students, what's the probability that at least 2 students have the same birthday? + To answer this question, we'll consider the opposite: what's the probability that no 2 students have the same birthday. If there's only 1 student in the room, then the probability that no 2 students have the same birthday is 1. If there are 2 students in the room, then there are 364 possible birthdays out of 365 which the second student could have that would be different from the first student's. Thus, the probability that no 2 students have the same birthday in a room of 2 is 364365. The probability that no 2 students have the same birthday in a room of 3 is 363365. And so on. To get the total probability, we multiple all of these probabilities together. You can see this math here, courtesy of Wikipedia: ![The birthday problem.](birthday_problem.png "The birthday problem.") + This is much easier to interpret in the form of a graph, however: ![The birthday problem, graph.](birthday_problem_graph.png "The probability that two people will share a birthday.") + Notice that the probability is already 0.5 when there are only 22 students in the room. By the time we consider the case where there are 58 students in the room, the probability is almost 1. The implication for hash tables is that there are going to be collisions. + *Separate chaining* offers an attractive solution to handling collisions. In this technique, our hash table is actually an array of linked lists. When an element needs to be inserted into the same location as another element, we simply insert it into the linked list. If we don't care if the list is sorted or not, we can insert the element in constant time just by putting it at the head. Of course, when we want to search for that element again, we're going to have to traverse the entire linked list in the worst case. If we consider the number of locations in our hash table to be *m*, then the lookup time for a hash table that uses separate chaining is *O*(*n**m*). *m* is a constant, though, so the lookup time is really just *O*(*n*). In the real world, however, *O*(*n**m*) can be much faster than *O*(*n*). + How do we implement a hash table in code? We might declare an array of linked lists like so: node* table[31]; + If we wanted each node to store a string, we might define a node like so: typedef struct node { char word[LENGTH + 1]; struct node* next; } node; + Question: why are we using an array for `word` instead of a `char *`? Just so we don't have to use `malloc`. Instead, we define an array which has enough indices to store the longest possible word. ### Trees and Tries (53:00-64:00) + More generally, we can talk about a data structure called a tree which defines nodes that have parents, children, and siblings: ![A tree.](tree.png "A tree.") + This diagram shows a tree which has varying numbers of children for each node. But if we are a little more strict in how many children a node can have, we create a structure that's easily searchable. Consider a *binary search tree*: ![A binary search tree](bst.png "A binary search tree.") + In a binary search tree, a node can have at most two children. We could define this in code like so: typedef struct node { int n; struct node* left; struct node* right; } node; + What's interesting about a binary search tree is that a node's left child is less than it and a node's right child is greater than it. Searching a tree like this is trivial then: if the current node is less than the value we're looking for, then we move to the right child; if the current node is greater than the value we're looking for, then we move to the left child. + A *trie* is another type of tree structure. The word "trie" comes from the word "retrieval," but is usually pronounced like "try." For our purposes, the nodes in a trie are arrays. We might use a trie to store a dictionary of words, as this diagram suggests: ![A trie.](trie.png "A trie.") + In this trie, each index in the array stands for a letter of the alphabet. Each of those indices also points to another array of letters. In the above, we're storing the names of famous scientists, e.g. Mendel, Mendeleev, Pavlov, Pasteur, and Turing. The Δ symbol denotes the end of a name. We have to keep track of where words end so that if one word actually contains another word (e.g. Mendeleev and Mendel), we know that both words exist. In code, the Δ symbol could be a Boolean flag in each node: typedef struct node { bool is_word; struct node* children[27]; } node; + Note that we're not actually storing the word "Turing" in the trie, we're simply storing a series of pointers and a Boolean value. + One advantage of a trie is that insertion and search times are unaffected by the number of elements already stored. If there are 150,000 elements stored in the trie and you want to insert the value "Alice," it still takes just 5 steps, one for each letter. This runtime we might express as *O*(*k*), where *k* is the length of the longest possible word. But *k* is a constant, so we're actually just talking about *O*(1), or constant-time insertion and lookup. What's the catch? A trie takes up a large amount of memory. ### Teaser (64:00-67:00) + Before long, we'll transition to talking about web development, including HTML, PHP, and JavaScript. As a brief teaser, enjoy this trailer to [Warriors of the Net](http://www.youtube.com/watch?v=Cb8b1RMX6XY).