00:00:00,499 --> 00:00:02,810 ZAMYLA CHAN: Let's start by creating our dictionary. For each word found in the dictionary text file, we'll want to store it in the dictionary's data structure. And it's up to us to create a representation for our dictionary, whether it be a linked list, hash table, or try. If you choose to implement your dictionary as a try, then you can fast forward in this video, because we're going to be talking about linked lists and hash tables now. Hash tables can be thought of as an array of buckets, where a hash function will give us the bucket that a given key or value belongs to. Let's look at an example. Over my years working with CS50, I've accumulated a lot of CS50 Stress Balls. So much so that I'm having trouble organizing them all. I've numbered them in the order that I've received them. So, in total I have 10 CS50 Stress Balls. Now, I want to tidy up my room. So I put them all inside of a box. Say I wanted to reminisce about my very first CS50 fair. Then I would go to my box of stress balls and I would sift through to find the ball with the number 1. Now, I could be very lucky and I could find this ball right away. Or I could spend some time sifting through this box to find the right ball. Instead, I decide to make this process a little bit easier. And instead of putting all of my stress balls into one box, I put them into two, one for odds and one for evens. So now I have a reliable method of sorting my stress balls into two groups. So now, say I wanted to find the stress ball numbered 1. Then I would be able to discard the evens box and just look in the odds box. And with half as many balls as before, then it's more likely that I'll find that ball a lot faster. What I essentially did, here, was create a hash table. In this case, my hash table is small with only two buckets, my two boxes. And my hash function, in this case, checks the number of the ball. And if that number is odd, then that means that the ball belongs in the odd box, and otherwise in the even box. So, whenever I'm sorting my stress balls or looking for it, I know which box it'll belong to. Now, I hate to break it to you, but I have way more than just 10 CS50 Stress Balls in my closet. So, let's say I'm looking at just my first 20. Well then, I can do the same process as before, and sort all of the odds into the odd box and all of the evens into the even box. But now I'm faced with the same problem as before. With 10 stress balls per box, it might get a little bit more difficult to find the ball that we're looking for. So, I might decide that instead of just having two boxes, I'm going to sort my stress balls into four. With this organization, I have one box for balls numbered 1 through 5, another for 6 through 10, another for 11 through 15, and then my last box for 16 through 20. So, here I have a nicely balanced hash table. By that, I mean that I have the same number of keys-- or in this case, stress balls-- per box-- bucket. You can see how it might be unbalanced if instead of having these four boxes, like so, I had created three boxes, one for balls 1 through 5, another for 6 through 10, and then another for 11 through 20. So what we've learned is that a hash table is simply a collection, an array, of buckets. And the number of buckets that we have can change, as does the hash function, depending on what we're dealing with. Now in this case, we don't have physical boxes to sort our stress balls into. What we're going to deal with is linked lists. A hash table is simply an array of linked lists. Now you'll see this slide pop up a lot, because it's a really important concept to remember. Linked lists are comprised of nodes. The first component of the node is its value. Now the data type of this value depends on what kind of node you're dealing with. In the previous example with my stress balls, then the value would be an integer. But in this case when we're dealing with words, then the value is going to be a different type, most likely a char array. The second component of a node is a pointer to another node. This is important because in a linked list, we have our first node, which then contains a pointer to the node following it, which contains a pointer to the node after that, so on and so forth, until the very last node in the list contains a pointer to null. So when we're dealing with linked lists, it's very important not to lose any links. So I find it very useful to draw schematics like these, or even physically take hold of things and make sure that you don't lose any links prematurely. So how do we incorporate linked lists into hash tables? Well, as I said before, hash tables are just arrays of linked lists. So here, in this case, I have an integer hash table. So every node contains an integer value as well as a node pointer. And I have four buckets. Try and figure out if you can see what the hash function for this might be. Now, say I have three values that I want to insert into my hash table. Then what I'll do is I'll take each node at a time, hash the value, figure out which bucket or which linked list it belongs to, and insert it there. Now that we've got a nice grasp of these concepts, let's take a look at some code. So how do we go about creating nodes? Well, we'll start by allocating enough space in memory to store our node. To do that, we'll use the malloc function. Malloc will need to know how many bytes to allocate in memory. So then we'll use the size of function to calculate the size in bytes of a node. Once we have that, malloc will give us the pointer, which indicates a spot in memory where it's reserved for our char array and a pointer to a node. So the key takeaway here is that whenever we're creating a node, we actually just want to malloc a node pointer. So if, say I wanted to create another node, node 2, then I would do the same thing again. Allocating another spot in memory for that node. Now I want to set the values for these nodes. So node 1 and node 2 are actually node pointers. So I'll use arrow notation in order to access the word variable. And then I can set those two words, hello world. These are just two independent nodes. How do I link them? Well, I'll simply access the next variable for node 1 and point that to node 2. Great. So now we have a linked list of size 2. Remember that a hash table is simply an array of linked lists, where each element of the array is a node pointer. So your hash table might look something like this. You define your struct for a node. And then you create an array of nodes, the same way that you would create an array of any other variable type. Here, my hash table has 50 buckets. But depending on the hash function you decide, it could be a different number. Now that we've created our dictionary as a hash table, then it's time to populate the hash table with nodes containing words found in the dictionary. How do we do this? Well, let's look at the fscanf function. This piece of code will take the dictionary file-- in this case it's called file-- look for a string, and then put that string into a variable called word. And it will execute this loop until the end of the dictionary file is reached. Now what goes inside of this loop? Well, we know that for every word that we scan, we want to malloc a node for it. Now whenever we create a node, we always want to check to make sure malloc succeeded. But what happens if you run out of memory? Well, malloc will return null. So whenever we make a new node, you always need to check to make sure that the pointer to that node doesn't return null. If it does, then you need to unload your dictionary and return false so that speller will quit. But if it succeeds, then you can proceed and copy that word into your node by using the function string copy. OK. So now we have a new node with a value that we scanned in from the dictionary. Let's go on and insert that into a linked list. Remember I said that it's really important to maintain all links in your linked lists, and make sure not to let go of any links until you know it's safe. Let's look at an incorrect way of doing this. Say I wanted to insert my new node into the very beginning of my list and put that first. Well, if I have a node pointer, called head, pointing to the very first element in my linked list, then I might say, well, head now points to new node. But this won't do because then I lose the link to the A node. And then I lose the entire list. Let's look at a correct way to do this. Knowing that we're inserting new node into the very beginning, and that becomes the first value, then that means that new node should point to whichever node was previously the first value in the list. So let's point new node next to the pointer that head's pointing to. Now that we have two pointers holding on to that A node, we can safely reassign the head pointer to new node. And there we have a linked list. So now we know how to take a word from the dictionary and take that word and put it into a node, and take that node and put it into a linked list. So next we need to decide, well, how do we know how to put which word in to which linked list? Because remember, a hash table is simply an array of linked lists. Well, how did we know which box to put which stress ball in? We used a hash function. The hash function, in this case, will take a string. And from that string it will calculate an index, the index corresponding to which bucket-- which element in the hash table-- that word belongs to. Now hash functions all have to be deterministic. What I mean by this, is that the same value needs to map to the same bucket every time. So in the stress ball example, when I had the number 1 stress ball, I knew that with my hash function every time it would indicate that I'd have to put it in the odd box. So same with our spell checker. Every single time that I hash the word apple, it should give me the very same index, whether or not I'm putting it into the dictionary for the first time or checking it later. Once you've made your hash function, then you can hash your words. The word that you want to hash is contained within your new node, arrow, word. So hashing that will give you the index of whichever bucket in your hash table you should put that word in. From there, you insert into the linked list. How do you do that? Well, I've already given you the tools to do so. Knowing that a hash table is simply an array of linked lists and that each element of an array is a node pointer, pointing to the very first node in that linked list. Another way to represent the dictionary data structure is with a try. Now in tries, we're also going to make them up with nodes. But in this case, the nodes are going to be different than the nodes that we used in the hash table. Every node for a try contains an array of node pointers, one for every single letter in the alphabet and the apostrophe character. Notice, here, how I included the apostrophe character with a backslash in front of it? That way I allow the apostrophe to parse correctly. Now each element in that array will point to another node. And then if that is null, then that means that the letter that that array belongs to is not the next letter of any word in any sequence in the dictionary. Now just because a letter is contained within the path of the word doesn't mean that it's the end of a valid word in the dictionary. So every node will also indicate whether it's the last character of a valid word in the dictionary. So how do we represent this in a struct? Well, we'll have our Boolean indicating whether the node is a word or not. And then I'll have my array of node pointers. In this case, I'm calling it children. Finally, in order to keep track of the top or the beginning of your try, we're going to have another node pointer, in this case, called root. Scanning through the dictionary one word at a time will iterate through the try as follows. Given that each element in the children array corresponds to a different letter, we'll check the value at the appropriate index of children. Now, if that node is null, then we'll malloc a new node for it and have children at that index point to it. But if it's not null, then that means we've already created the node. So we can then move to that node and continue the process. Finally, once we reach the end of the dictionary word, we'll set the is word Boolean to true. Let's go through an example, storing fox into an empty dictionary. The word fox begins with the letter f. So for the letter f, we're going to malloc a new node. We're going to put this at children on index 5. That's because f is the sixth letter of the alphabet. So once we've done that, then we proceed to the next letter. Well, o comes after f. So then from that f node, we'll then access the corresponding node. In this case, children at index 14. And then we'll do the same thing for x. Finally, here I indicate, by shading the box in green, that is word is set to true. Now that we've loaded fox into our dictionary, let's go to the next valid word in our dictionary, foo. Now foo starts with the letter f. We've already created a node for that. So we don't need to create a new one. We progress to the second letter of this word, o. And we've also already created an o node that follows the first f node. So then the only node that we need to create is the last o in the word foo. Remember that whenever you make a new node, you have to allocate space for it in memory and make sure to check whether or not malloc returns null. Now if it doesn't, and if you succeed in creating a node, then you've reached the end of the word. So we can set is word to true. Let's go on and enter another word into our dictionary, dog. I start with the letter d. And d is the fourth letter of the alphabet. So then I access children at index 3. From there, I create another node for the letter o. Notice that this node is going to be a new node, because the letter o in dog follows the letter d, not the letter f as we previously created. So then, completing the chain for the word dog, I then create another node for the letter g, and then set is word to true. Now let's say, the next word that I need to enter is the word do. Well, do is a substring of dog. So I won't need to create any more nodes. What I'll need to do is navigate down to that o and set is word to true. So following this chain, you could either end a valid word at do or advance to dog. And there you have it. Whether or not you've implemented as a hash table or a try or perhaps a linked list, you've created your dictionary.