[ Silence ] >> Alright everybody. Welcome to Walkthrough 6 for pset 6: Misspellings, which is by far the culmination of your C experience in CS50 and perhaps your entire CS50 experience. So I would highly recommend you checkout my play list of epic music before you get started on this pset, you're gonna do much better I promise. So today I would just gonna go through the distribution code namely speller.c 'cause it looks really scary and intimidating, then we're gonna go through all the different date structures you're gonna need to implement these psets. So Linked List and Hash Tables and then Tries as an alternative to those two things. So inside of speller.c, it's basically doing four main things for you. The first it needs to do is it needs to load that dictionary file. So remember, this is some file that we've given you and it just contains correctly spelled words, one per line. And so this forms a dictionary because this is--if you were to look up some word in dictionary to see if it was correctly spelled, if it's not found in dictionary then it must be spelled incorrectly. So we're gonna now open up some text file, we have some great works of literature like the Austin Power screenplay. I was gonna through that and check every single word inside of that text and see which word is spelled correctly and which are not spelled correctly. So to do that, it's gonna call this function check on each of the words. You also need to implement a function called size. Now, that's gonna do is return how big is your dictionary, how many different words we spell checking against. And finally, unload, which is going to free up all the memory that you use because we're really good programmers and we would never do something like weak memory. So inside of dictionary.c is where those four functions are defined and that's where you're going to implement them. So just at a high level, let's see what's going on. You're loading that dictionary, you're going to every word, you're checking it against that whole dictionary, and if it's found in memory, if it's found on that dictionary file then it has to be spelled correctly because it matches some correct spelled word. And if it's not done on a dictionary file, well, then we can just assume it's not spelled correctly because we have a list of every correctly spelled word. So let's just walkthrough speller.c which you do not have to modify because you're only changing dictionary.c, but the speller.c is where the main is contained inside of your program. So notice here, we've just defined some constant and this is our default dictionary. So this dictionary is really, really big, when you're first starting of with the pset, it might not be so wise to keep this huge dictionary just because there're so many words there. What you can do is instead just create your own dictionary that have three or four words that you know are spelled correctly, save it somewhere in your appliance and then just change this constant or you can even just supply it when you actually run the program but more on that later. So here, here's the usage of our program, we're gonna run speller and then this first argument is optional, so this is the path to some dictionary file. If you don't specify it, then it's just gonna use this define constant up here, and if you do specify it, then you can use something other than that dictionary. So after that, it's going to be what text are you spell-checking against. So just like you might not wanna start off with the King James version of the bible, you might wanna have some smaller text file that you wrote yourself containing words that are either in your small dictionary or not and start testing against that. And once you think your implementation is working correctly, then starts scaling up to epic works of literature. So you'll see a lot of these timing functions because what we're doing is timing how fast it takes your program to spell-check each piece of literature. So we'll see a lot of that on the struct r usage is just for resource usage. So what time is it when you started to check words and what time is it now as just the means of calculating of how much time you took. So now, we're just determining the dictionary. This argument is optional. So if it doesn't exist then we'll just use that to find constant. And so now we're calling the first of those four functions. Here, we're calling load. So this is just gonna return a Boolean. It's just gonna return true. The dictionary load it correctly where we're recalling load on this dictionary text file. And you'll notice that we're specifying not some file pointer but just the path to the dictionary files. So inside of load you, you need to take care of opening that file up and reading it. So if for some reason, the load failed, we're just gonna quit. And now you'll see. Now we're calculating how much time it just took to load your dictionary. And so this speller.c is gonna save that and you don't have to worry about that. So now we're gonna do the same think of the text, try to open up. And now, here's where we're gonna start iterating through the text file. So int c equals f get c on fp. So you noticed that fp is what we called our file pointer when we opened up the text to read. So up here, we said file star fp is opening up the text file that you're reading through and spell-checking. And what f get c is going to do, it is going to go through that file character by character. And so we're gonna iterate until we reach the special character "eof" that signifies I've reach the end my file. So there's nothing else to read. And so that's the condition on which I'm going to stop reading through my file. And after each iteration of the loop, I'm gonna call f get c again. So I've move on to the next character. So a little untraditional looking for loop, but all this is doing is looping through every single character inside of the file that you are spell-checking. So we only want to allow alphabetical characters and apostrophes. This is just some guideline, some restriction that's set forth in the psets back. We say you don't have to worry about numbers inside of words. And so, we have this array, word. You notice it's initialized through some fixed link, + 1. So we're doing here is we basically initializing it to the biggest word possible. And so that way, we know we have enough space to hold any word that we might read in. So every time you read a character, we're inserting it into this word array. And then if we increment index, that means the next letter we read is going to go into the next position into this word array. So that way, we're filling up this array character by character with all of the characters inside of a single word. So if something is too long, then we say, well that's definitely not a word because it's bigger than the largest word in English language. So we're gonna is we're just gonna finish up reading that word and reset our index. So when we move on to the next word, we're gonna be starting at the beginning of the array again. So same thing here. Well, if we ever encounter a number inside of a word that we're spell-checking, we know that that can't be a word, so we're not gonna bother spell-checking it. We're just gonna consume the rest of the string, so skip over every other letter that follows that number until we hit a new world. And so, we're resetting the index again so that when we hit the next letter--when we hit the first letter in the next word, we're back to beginning of this word array. So now, if we have found some whole word, we're going to terminate it with the null terminator. So whenever--in C, we find this null terminator inside of a string, we know that the string ends there. Even if there's some stuff after this null terminator in the array, because we find this null terminator this says "this is where the string actually ends, so don't bother going looking pass this point." So at this point, we're gonna say, "Well, we've read one more word inside of our text file, and now we're calling this check function." And so this is a function that, again, you need to read--write yourself and this is gonna return true or false depending on whether or not this word which we just saw is coming from the file that we're spell-checking, we've read in every character. And so now, we're spell-checking this word against the dictionary. So this needs to return true or false depending on whether or not this word is spelled correctly. So again, this get resource usage is just timing, how long did it take for me to run this check function and I'm gonna keep track of how long it takes to run each one of them in this TI check variable. So now, if the word is misspelled, we're just gonna print it out. I'm gonna say, "Well, we found some other misspelled word which we're gonna print out at the end of our program." So then after that, we must have just spell-check the words, that means we finish reading it, so again, we're gonna reset our counter in the word array back to the beginning. SO next time we read a character, we're gonna be at the beginning of my word array 'cause where at a new word. So after that, we're gonna close the text file, the one we're spell-checking 'cause we don't need it anymore. And now we're calling our third function, size. And so this just needs to return an integer of how many words are inside of your dictionary. So again, just timing how long it took you to do that. And finally, we're calling unload. So after this function is done, everything that you loaded into memory, should now be gone and we're gonna--we can check that, with the check 50 command that you saw in the pset. It'll basically tell you, "Well, you left some things in memory so you must not have unloaded everything yet. So after we calculate how long that took, we're just gonna print out some stats, how many words are misspelled, et cetera, et cetera, and if you wanna know if yours is working correctly, you can always compare your implementation to the stat implementation. If they found 600 words misspelled, then you found 10,000, then you might have a problem somewhere and then that's it. So, any questions on how we are using speller? Yup? [ Inaudible Remark ] >> So how the speller.c knowing you're at the end of the word? >> Yeah. >> Yeah. So-- [ Inaudible Remark ] >> So let's see. So if it's a letter, that means we can't be at the end of the word. So if it's a digit, that means that we're wrong, and if it's anything but a letter or a digit, that must be the end of the word, so basically a space or something else, so that's just kind of the last condition, if it's not a letter or number, we must be at the end so then we can move on to the next word. Yup? [ Inaudible Remark ] >> So where that the sample text we gave you? So they should inside of your appliance, inside of--[inaudible] the CS50/pset6/text to something, it should be in the pset specification, but they just basically a bunch of text files. I can look that up in the spec after if you like, but we should have digits on your appliance after you run yum update and all that stuffs. >> Are there questions on the speller.c? Yup. >> When you're updating a word counter, is it words that [inaudible] you have like whole words? You just [inaudible] every time? >> So does it [inaudible] to make a--do a word array every time? So you'll notice that we're just reusing this single word array. So we only create one word array, but because we're always accessing at this index, we're just gonna keep changing where we are inside of this array and overwrite what was previously there. So we can use the same array and we're filling up this same array with every word in the text that we're spell-checking. >> Then why do we need word ++? >> So why do we need words ++? So that's a separate variable. So remember the array is called word and this one is called words. And so all this is doing is it keeping track of how many words you spell-checked just so we can print it out down there. So just for stats. Yup? >> Can you like clarify again how [inaudible] making our own dictionary or? >> So why are we keeping track of how many words in the dictionary? [ Inaudible Remark ] >> So you--So you can supply speller with any dictionary you want, this is the default dictionary, this whole CS50 pset6 dictionary is large. So this is just one big text file. So we want our program to be able to tell us, well, how big is this text file, how many words are contained inside of it. So if you were to specify something else, you can specify a smaller or a bigger dictionary and that size still return how many words are inside of it. >> How do we get like different dictionary? >> So how would you get a different dictionary? So this--all this is, is a text file. It's text file that each line has just a correctly spelled word. So if you can just make your own by saying, well, here's the word apple, here's the word banana, and that's it, every other word is misspelled as far as I'm concern. And this just happens to have most of the word in the English language. Other questions on speller? Okay, so now let's now move on to how we can actually implement those four functions. So the first thing we need to review is linked lists. So you've seen this in lecture. And a linked list is essentially a series of nodes. In each node can have one or more values. In this case, I have a linked list of integers. Each one of these nodes or these rectangles has a value associated with it that's an integer. So this person has a int value of 12, 99, 37. So the other part of this node is a pointer to the next element in my linked list. So in order to traverse the linked list, I can just keep following where these next thing is point to and the last element in my linked list is going to point to null. So if I ever find, well, the next element in my linked list is null that must mean that I traverse the entire linked list and there's nothing else there. So here's how we might want you to create a linked list. So the first thing we need to do is we need to define what each of these nodes look like. So because these nodes have more than one thing and they're not of the same type because I have an integer and a pointer, those aren't of the same type, I can't really use an array so I need to somehow group these variables into a single object, I'm going to use a struct. And so each of these nodes as we said, in this case, we're using integers; but in the case of speller, you're probably gonna wanna use words. So here each node has a word and a pointer to the next node. So in order to create a node, I wanna use malloc. And how much memory do I want? I just want a single node. So I can call sizeof node and a that's gonna give me the size of the struct which is just going to be 8 because it's gonna add those two pointers up. So I can create 2. And now to set the contents of these nodes, I'm gonna use the arrow notation because if we've use malloc, which means that we have a pointer, which mean if we have a pointer to a struct, we need to use this arrow so we can set word by saying this and is. So now I have two nodes, they have different values and if I wanna make my first node point to the other, I'm gonna set that next pointer equal to the address of my other struct. So now we have one node pointing to the other and the second node is the end of my linked list. So I'm just going to set its next to null, because it doesn't point anywhere, because it's at the end of the list. So is everyone clear on the syntax for creating linked list? Yup? [ Inaudible Remark ] >> So here, so I've created two nodes, node 1 and node 2. And so in my linked list I have node 1 followed by node 2. And so the next element after node 1 is node 2. So that next corresponds to the next that's inside of my struct and then node 2 is the other variable that I created on the heap. [ Inaudible Remark ] >> Oh, I should say node 2, sorry. That's just a typo. Yup? >> Why do you need a node star? >> So why do we need a node star? So that's a good question. So remember that if we allocate something on the heap, then it can only last as long as--if we allocate something on the stack, sorry. Then as soon as that function returns we can't really access it anymore. So inside of our spell checker, we only wanna have to load up the dictionary once. And after we load the dictionary, we wanna be able to keep using it every time we call check. So if we were to put this linked list on the stack and we load it in the stack frame of load or something, then we're gonna lose it immediately. Whereas if we put it in the heap then we ensure that no matter when this function returns, we still have access to it. That makes sense? Okay. Are there questions on creating linked lists? Yeah? >> So if we wanna have like the example on [inaudible] could have node star on those--the brackets [inaudible]? >> So if you wanted to create an array of nodes? So, sure. So if you have an array and that array just has a type, you could hypothetically create an array the held a bunch of nodes. It wouldn't be a linked list, strictly speaking, because you have an array and a linked list is different than an array. But because the struct is just like any other type in C, you could create arrays of [inaudible], you could do whatever you want. So that would be possible if just node star array and then give it some link, which we'll see later. Other questions on creating linked lists? Okay. So once we have some linked list in memory, we wanna be able to traverse it. By traverse, I mean just look at every single element and maybe do something with the value or compare the values. And to do that, we first wanna create some new pointer and we're gonna call this the iterator. So the iterator is just going to be pointing at every single element of the linked list and then from that iterator we can determine what the value of that node is, what its next pointer is et cetera. So we wanna loop until eventually this iterator is null. So because this special iterator pointer is gonna point at every single note, if eventually starts pointing at null that means that we must have traversed the entire linked list because remember we said that the last element has to have a next value of null. So every point in a loop, this iterator can point at a different element and from that we can print out what that element looks like. So here let's just assume that we have some linked list and we've just stored the root of the linked list or the first element in some pointer called "first." So now we're creating some new iterator node. It's initially pointing at the first element of my linked list so while the iterator is [inaudible] to the null that must mean that we're still at a node. So it means, we can keep going, we can print out the word associated with the current node. So right now, let's just print out the first word. So back to our example, this would say this. So after that, we're gonna change our iterator to point at the next element in our linked list. So now, once we get back up to the while loop we're still not null, because we have that second struct which has the word "is" so I print out is next. And so now, once we next again, iterator is gonna be null because the next element of the last element in our linked list is gonna be null. So that means our loop is gonna terminate and our iterator has gone through our very short linked list. So does that make sense to everyone, how we're gonna iterate through? Awesome! So now we have to worry about freeing linked list, because remember we need to unload everything. So in order to free linked list, we can't just free the first element. Free isn't smart enough to know it's a linked list and to traverse the whole thing and free everything. Instead, we need to explicitly free everything in our linked list individually. But what's important to note is that once you call free on some node, that means you can't access what was inside if it's word or it's next pointer anymore. So freeing it means you can't do anything with it anymore, it's effectively gone. So in order to do this, before we determine what--before we can free, we need to figure out where we're gonna go next. So this could look something like this. So again, we have this iterator and this iterator is going to point at every single element in our linked list separately. So we're starting at the beginning of the list. And so now, we're gonna save where this iterator was. So this is just some address, remember. So now, we can move our irritator to the next element but this node, n, and still remember where the iterator was. So even though that this iterator pointer is on the next element of our list, n remembers where it was. So now we can free n because we already know where iterator needs to go, and if we wipeout n that's no problem, and then we can continue with our loop until we hit null. And so this is one way that you can free an entire linked list. So you could also do this recursively, right. We could say, well, while irritator was not null, we wanna keep calling recursively our free functions or whatever function where--like free linked list or whatever we call it. And we keep calling that until eventually we hit null that's gonna be our base case and then we wanna start calling free. With that it will effectively do is free the linked list backwards. And so you'll hit every single element at least once and your whole linked list will be free. So both of those implantations are totally valid and they're gonna do the same thing at the end of the day. So any questions on why we needed this second node or why we can't just free the first one? >> Okay. So that's it for linked list. So now let's move on to a more complex data structure and one that you'll likely use to represent your dictionary inside of speller.c. So a hash table is a data structure that's going to map keys to values. So a hash table has some fixed number of buckets. So in C we already have some data structure that keeps track of a list of a fixed size. That's just an array. So we can represent our hash table as an array where each bucket is a different element of our array. So also essential to our hash table is some hash function. So what a hash function does it is going to map each value that we wanna look up on our hash table to some bucket in our array. Now it's important that this hash function is the same every time so if we call hash CS50 and get back a number then we call hash CS50 again, we need to make sure we get back the same number or else our hash function is not going to work correctly. So we can't do things like returning a random value because of you call that twice on the same input there's a good chance you're gonna get back different values so that's not really gonna work. So what are--in this case what our hash function needs to do is it needs to map some word to some number. But what's important is that this number does not exceed the length of our hash table. If we have ten buckets that hash table and our hash function gives us back the number 20, we're in trouble because we don't know what to do with that number 20. So when you're designing your hash function you need to make sure that it doesn't exceed the number of buckets or return some negative number or something that's not even a number or else you're not gonna be able to get your hash table working. So here's an example of a hash function for this hash table. So here on the left here we have our keys. So in this case, these could be the words inside of your dictionary. So these keys are gonna be put through some hash function that's going to return a number. So you see John Smith here. After you put that through a magical hash function it's gonna map to the number 152. And then Lisa Smith is gonna map to 1 and so we can insert these nodes onto that bucket inside of our hash table. So here's what this could look like. Well, if you gave me John Smith, I'm gonna returns to you 152. If you gave me Lisa Smith, I'm gonna return your 1. So this is probably not the best hash function ever because we're returning anything that's not one of these four names. We're just sending them to bucket 153. So here's another example to hash function that's still not the great. So what we're doing here is we say, "Okay, we're gonna start off my result at 0." So result is gonna be whatever this hash table returns. So now we're gonna iterate over every single character in a word. So we know that because that this is a string, each element of this word array is a character remember because we have ASCII, we could map each of these characters to a number automatically. So what this does, it effectively just sums up the ASCII values of everything inside of you word. And so now, we can't just return result. Because if you have a hash table of size 10, this result is probably gonna be some big number in a hundreds or thousands. So it's just not gonna work. So in order to make sure that we never exceed the length of our hash table, we can use this modulus operator. So what this does remember is it effectively establishes some upper bound. So we say, okay, if we do result mod hash table size with this just constant that I defined somewhere, remember it's gonna be constant because our hash table is never changing size, we still always have the same number of buckets. If we modify that hash table size, we're gonna get some number that's definitely less than hash table size which mean it's definitely a valid thing--a valid place to insert our node. Does that make sense? Why this hash function isn't so great but it's a little better than this one? So the reason that this is a problem is because we can have a lot values mapped to the same bucket. Now if two values mapped to the same bucket, we can't just wipe out whatever was there, right. Because then you could effectively only have a dictionary that has size of over many buckets on your hash table. We just start wiping things out. What we need to do is keep tract of everything that's still mapped to that same bucket. So one way to do this is instead of just storing a word inside of your hash table is to store a linked list that contains everything that hash to that value. So why does this need to need to be a linked list? Well, we don't know in advance how many words are going to hash to the same value. So the size of this list is going to be dynamic. Remember in C, we can't just dynamically resize an array we have to create a new one and copy everything over. With the linked list, we see, it's really easy to just insert a new element. So by representing this as a linked list in our hash table, if we say, okay, well I found another word that hash to 152, I can just add it on to that linked list. So that's what's going on in this diagram. You'll see that John Smith and Sandra Dee just maps to the same value. They both got a 152. So we can't just wipeout one of them and only keep the last one to hash that value. Instead what we're doing is creating a linked list. So this, what slot, 152 is going to point to some node that's just a node in a linked list. And so it could have a word value of John Smith and then as its next pointer, we're gonna point to the other node that maps to 152 and if this--the next pointer in this node points to nothing then we know that we have our entire linked list, so that conceptually makes sense to everybody? Okay. So this could be how we're going to represent our hash stable inside of dictionary.c. So again, we have some node and inside of it we want to keep track of some word. So now we could do something like, well, we know that this word is going to be at most linked letters and we need to add one for the null character. So that mode's length, whatever this constant link is, + 1. So that saves us the trouble of having to worry about how long the word is, right. Because if we some array that's definitely big enough, then we can just fill up that array with everything inside of our word and we could have some wasted space, but then that's okay. If you want to save yourself a little bit of space, then you'd have to use string length to get the number of words inside of your string. And then you could use malloc to create a dynamically size string. So this is a little bit of a simpler approach, we can just say, "Well, let's just allocate enough space for everywhere to not worry about having to create a bigger arrays." Or if you want to save yourself some space, then you can start malloc'ing word that's based on the size of the word that's gonna go inside of this node. So again, we still need to have this next pointer because we have a linked list and the only way to traverse the linked list is to keep looking at each of these next pointers. So that's the representation of a single node. So now our hash table is going to be an array of pointers to nodes. So why pointers? Well remember that we need to puts these things on a heap or else there is no real good place to put them because the stack is gonna wipe them out as soon as functions returns. So there need to be pointers because everything on the heap is just a pointer to wherever that memory block was. So now, how many buckets do we have? Well we can just define some arbitrary constant, something like hash table size. And that's gonna tell us how many buckets are inside of our hash table and each of those buckets is going to point to a node. So if I access in this array which is just like any old array. If I said hash table I. Well, that's going to give me the value of the address of the first node in the linked list of everything that maps to that bucket. So that's just gonna be the equivalent of the root of some linked list. And we can traverse that linked list just like we traverse any old other linked list. So questions on how our hash table is gonna look in memory? Nice. Alright, so now that's kind of a high level overview of what everything is gonna look like. So now let's get into how we're actually gonna implement those four functions load, check, size and unload. So remember the goal for load is to get every word in the dictionary into memory somehow. So you just need to get it there so you don't need to continue to keep opening dictionary, iterating through it which would work but it would be really, really slow and we probably wouldn't do so well. So in order to do this, we need to somehow iterate over every word in the dictionary text file because we know that we need to eventually load those words into memory. So remember from the last pset, we have this nice function called feof that's going to return if or at the end of some file. So if we're looking to iterate over the entire dictionary file, we wanna loop until this end of file returns true, so very similar to the last pset. And now, once we read a word, we need to insert each word individually into our hash table. So to do that, we need to create some node for each word that we read in. So that means that we wanna malloc some new node. We're gonna call this node n throughout the example. So now we have some new node. It lives in memory somewhere but there's nothing in it. So we need to fill in that node which looks just like this with each of these values that nodes need to have a word and that node needs to have a pointer to the next element in the linked list. So to get the word, we can use this handy-dandy function fscanf. What this is going to do is it's going to read a string from the dictionary and it's gonna store it into some point to that you specify. So I can just store this inside of n word or the word array that's inside of my node, n, and I can read it from my file pointer whatever that may look like. So if I continue to call fscanf in a loop and we're going to read in our dictionary, one word at a time. So let's just take a look how we can do that. >> So here's my fscanf. And so I have this file text.txt. And so this, as you probably guess at this point, just says this is CS50 one word per line. So now we can open it up. With f open, we're gonna specify r 'cause we're only reading this file. And so now, we can create this array that's definitely big enough to hold the word, right, 46 there's no word that's more than 46 characters inside of this CS50. So now when we call fscanf on fp and we specify percent s, this tells fscanf that I want to read in a string. Where I wanna store that string, I wanna store it at word which is an array so that's an address. So now after this function executes, I now have that word saved inside of this null string and I can print out both the word itself. So just by using percent s and I can print out its length. So if we were to execute this I can say, make fscanf. And if I call fscanf, I've just printed out my first word which is this and I printed out the length of the word which is 4. So now if I call this twice if I said here I just called it again fscanf fp percent s word, what is this gonna do? [ Inaudible Remark ] >> "Is" exactly. So we don't need to do something like fseek, remove the cursor over just like in the last pset if we call fscanf again it's gonna know, okay, we've already read one word so my cursors moved over a little bit and the next time I call it we're gonna get a different word. So if I call fscanf, we're now on to "is." So does everyone understand how that works? 'Cause we're all pros at File I/O after pset5 advice. So now that we've read in this word--oh, question first? [ Inaudible Remark ] >> So does fscanf ignore spaces? Yes, because the string is going to ignore spaces, so because you said percent s, you could have said like something percent s space percent s and that would read into, but because there's no s--space inside that specifier it will ignore them. Other questions? Okay, so now that we have our word, we now need to hash it. So because we've used fscanf correctly that means that this word now contains some word from our dictionary. So now, we just need to hash it, which is great because remember our hash function takes as an argument a string, it's going to return an integer so some string value is gonna return where to put it. And so remember that our hash table is going to--is going to contain pointers to the start of linked list. So after we hash some word, we basically have two cases. So this variable index is gonna be where inside of my hash table I want to put this word. So the first case is that when we go and look at that bucket there's nothing there. So that means that this is the first word so far that's going to map to the hash value index. So in that case we just need to start creating our linked list. So that's really easy, we just want hash table index which it makes an array of pointers, we just wanna make that point to the node we just created, so in this case n. And now because n is now the end of our linked list 'cause our linked list just has one element in it, we wanna make that point to null. So now we've just started off our linked list, it only has one element and that's the node we just created. Does that make sense? So our other case is if it's not null. So that means that, okay, well, I've made a collision. Some word has already hashed to the value that I'm trying to put this word at. So that means that this can't be pointed to null, instead it's pointing to what is the start of by linked list. So we wanna insert to the beginning of the linked list and not the end because in order to insert to the end we're gonna need to traverse the whole thing. Now speller is all about being really efficient with our time and so to go through the entire linked list just for the sake of inserting a word is gonna be a big waste of time. So to avoid having to traverse every element which could take awhile for a linked list is pretty long, we can just insert it the beginning of the list. And so how do we do that? Well, we need to do two things. We need to change the pointer of our new node. We need to make that point to somewhere in our old list and we need to change the value of hash table. So the first thing we wanna do is make our new node point to what was at the beginning of the linked list. So that way, we now have two things pointing at what was the beginning. We have hash tables pointing there and now our new node points there. So now that our new node points there, we can change what hash table index pointed to 'cause if we did this in the reverse order we would effectively lose where our list started which is a problem. So now that our first element is now pointing to what is now the second element we can say "Okay, well, this is my new first element." So that means hash table index should point to the node that I just created. That makes sense? Alright. So now on to size. So this is probably the easiest part right. So given some hash table to calculate size, you could do something like iterate over every bucket on the hash table, for every bucket on hash table go through to every element in the linked list and have some counter that counts it up and do that every time that you call size. But again spellers are all about speed and so that's kind of a waste of time. So instead what you can do is while you're loading the dictionary, just keep track of how many words you've loaded in some global variable or something on the heap or wherever you wanna put it. We can just say, "I loaded a word, I increment the number of words that I loaded." So when I call size I can just return that value which I've already calculated and the size of the dictionary is never going to change as you spell-check through your text because there's just one dictionary you're only loading it once so you can just keep returning that same size that you've calculated. Does that make sense? Awesome! So now we need to move on to our check function. So our check function is going to return a Boolean, true or false, true being the word was spelled correctly, false meaning the word was not spelled correctly. So as an input, it's going to take a string and that string is going to be from our text. So if the word exists in our hash table that means that the word "must" have been contained inside of our dictionary file which means that word is spelled correctly. Now if our word is not found in our hash table, that means it can't spelled correctly because it would have been found if it where. So we don't need to search through the entire hash table for the word. And so this is kind of the power of hash tables. Instead we only need to search the bucket for the hash value of the word I'm looking at, right. Because in order for this word to be inside of my hash table, it must hash to the same value as the same exact word inside of dictionary. So remember that's where our hash function needs to be deterministic. When I hash some word in my dictionary, I need to get the same value as when I hash some word inside of my text file. So if I do, that takes my hash table of size 10,000 and limits it down to one bucket. So inside of that one bucket I just need to look, well, there might be more than one word that hash to this value. So just saying that it hash to this value is not enough, right. i can have some word like Z, Q, X, Y that hashes to the same value as the word apple. So if I don't actually go through the linked list, then I could assume that Z, X, Q or whatever I just said is a word when it's not. Just because two things hash to the same value, does not mean that they're the same word. So we actually need to go through and traverse our linked list. But that's okay because the linked list is probably gonna be pretty small so that's not gonna be a huge waste of time. So we already know how to traverse the linked list, right. There's nothing special about this linked list being in a hash table it's still just a linked list at the end of a day and we just went over how we can go through that. So some function you might find comes in handy is the strcmp. So this is going to compare two strings. And if string 1 and string 2 are the same string, it's going to perhaps kind of intuitively return 0. If string 1 is greater than string 2, it's gonna return 1 or the other way around it's gonna return negative 1. So if you're checking if two strings are the same with this function, you need to make sure that it returns 0. Now the only problem with this is that our spell checker needs to be case sensitive. So you're guaranteed that everything the dictionary is gonna be already lowercase but you're not guaranteed that everything in your text is going to be a lowercase. So if your dictionary contains the word apple and your text contains the word Apple with the capital A, well, that apple is still spelled correctly even though based on the highest function we saw earlier just summing up the ASCII values, you're gonna get different values. And so you might return, well, capital A is not in my dictionary but lowercase A is, you need to account for that. So you need to make sure that any word with different case inside of your text is still going to be a correctly spelled word. So an easy way to do that is to just convert the case of the entire word so make everything lower case or make everything upper case or something like that and then hash it. So don't forget about that function to lower, to upper also exist and is upper and is lower, returns a Bool if a character is uppercase or not. So remember those functions work on characters. It's an order to convert an entire string that means you still need to iterate through it. And so now if--while you're traversing your linked list you find that two strings are equal, so some string inside of your linked list and the string that you're looking at, if those are the same then that means that--well, that word must be spelled correctly. But if you reached the end of your linked list that means that the word was not contained in the linked list. And now you know that word must not be spelled correctly even though you didn't check the entire dictionary, you only check maybe five or six words, but that's the problem of our hash table. >> You only check the five or six words that could be the same word as you're looking at. Any question on how that works? Alright so let's just take a look at an example. So here is my hash table. It has ten elements, when I started off, everything is pointing to null because there's nothing there. So now I'm going to insert the word "this" into my hash table. So "this" has a hash value of 1 using the magical hash function I just made up. So that means that we've hit our first case when I go to hash table "I" I'm going to get null. So that means that there's nothing there and I need to create some new linked list. That's pretty easy I can just set it equal to the address of the node. And now the same thing's gonna happen when I use my magical hash function on the word "is" so that's just gonna go in the fifth slot. Again, I'm creating a new linked list. Now when I get into CS50, there's gonna be a problem because my magical hash function just returned to 5. So what we need to do is add on CS50 to the end or at the beginning which is faster just the end in this case because "is CS50" is better than "CS50 is." So we're adding it to the end of our linked list and so now we have a linked list existing at 5. So this is my--if my dictionary just said this is CS50, this is what my hash table is gonna look like after I just called load. So now that I've loaded everything, I can start calling check on some words. So let's say that we were--called check on the word "isn't." And so my hash function is going to return a value of 3 for the word "isn't". So now, we go to our third bucket so 0, 1, 2, 3 and we see a null. And so that means that "isn't" is definitely not spelled correctly, because it does not exist anywhere inside of our hash table. If instead we called "this" so we can say "Okay, we're gonna go to bucket 1," bucket one is not null so it's--we can't just assume it's spelled correctly, we need to look at our linked list. In this case, we just have one element and the strings are the same so this is spelled correctly. Now in the case we called CS50, again, it's not null and now we need to traverse our entire linked list. So we're gonna say, "Okay, well, CS50 and is," are those the same string? No, but that doesn't mean we're wrong yet because we have another element to look at. So now we're gonna say, "Okay, CS50 and CS50, those are the same string, they're spelled correctly. So any questions on that example? Yup? >>Why did you add CS50 to the end of the linked list? >> So why did I add CS50 to the end of the linked list. So you'd actually be adding it to the beginning, I decided it to the end so it say this is CS50. [ Inaudible Remark ] >> Which is a good idea at the time. Other questions? Alright, so now we need to worry about unload. So remember the goal is we want to free our entire hash table for memory. So that includes every single node that's anywhere in any linked list, it needs to be gone for memory. So a nice thing to note is that you don't need to free anything on the stack. So if you put something on the stack whether it'd be in a function or be it'd be a global variable, you don't need to free it because you just don't free things that are on the stack. The only things you need to free are going to be those that you malloc. So if you have some global array that says like node star array, if this is a global variable you don't need to free it because it's on the stack. If you malloc something, like all of those nodes that we malloc, then we do need to free them. And so how can we free a hash table? Well we know our hash table has some fixed size and wanna iterate through every single bucket in our hash table in order to free the linked list that go there. So for every element in my hash table, I'm either gonna have null or I'm going to have a linked list. So now for every element in the linked list we are you know how to do this. We can just free the element, move on to the next element. As soon as we hit null, we're done and we can move on to the next bucket inside of our hash table. Questions on unload? Alright so this pset is going to introduce us to a new code--a new tool called Valgrind or Valgrind however you say it? And so what Valgrind is gonna do is it's going to check for us if we had any memory leaks. So let's take a look at how to use it. So here I have a program called "helpmevalgrind" and it looks pretty [inaudible]. So I'm creating linked list, I'm creating 1, 2, 3, 4 nodes, I have a root and N1, N2, N3 each of these nodes has a value that's an integer and it's pointing to some other node. And so now this is how I set up my linked list. Well, the root node is gonna point to N1, N1 is pointing to N2 and N2 is pointing to N3. So I'm not really gonna do anything about linked list. But now that my program is done, I'm gonna be a good programmer and I'm gonna free the linked list. And because I free the first element the whole thing is free, right? [ Inaudible Remark ] >> Exactly! So, it's not. So when we run Valgrind we can say--so here's by helpmevalgrind, I can say valgrind. This V is for robust. You wanna say tell me as much as you possibly can. I'm gonna say dash, dash, leak, check equals full. So we're gonna check everything possible. And now we're gonna say dot slash and the name of our binary. So don't forget about that dot slash. We hit Enter. Now we got a whole lot of output, but luckily what's in the middle of the screen here is what we care about. So we have some problems here. We definitely lost 8 bytes and we might have lost 16 bytes, so really we just lost 24 bytes of memory. So that's how much memory we've leaked inside of this leak summary. Everything in there is leak memory that you now care about. So now let's look above this. Well, okay. So add some address something. Now this looks interesting. So this is the name of our C file and it's going to give us a line number after it. So, in file, helpmevalgrind,c12, we lost the memory that was created there. So let's open up, get it, and now let's look at line 12. Oh, look at that. So this is where it created n1, and we never freed n1. So if you go into a fix version which valgrind helped us solve. If we just free, 3, 2, 1, or 1, 2, 3 in the root. Now when we run Valgrind, we can say, "Valgrind saves the day. All heap blocks were freed, no leaks are possible." So this s what your output is gonna look like when you're done and you're really happy. So no leaks possible. Nothing at use when we're done and it even tells us how many allocs we made. Remember, we had four nodes on a linked list. And remember, as we made the malloc four times and you use that many bytes 'cause 8 times 4 is 32. Questions? Yup? [ Inaudible Remark ] >> Yup. [ Inaudible Remark ] >> So yeah. So this is--So that's also a problem. So if we were trying to traverse this linked list, we probably run into an issue because it's not null. So you would want to set that null if you're actually using this linked list to diverse something, exactly. Other questions on Valgrind. Yup? [ Inaudible Remark ] >> So if we look back at our output, we're gonna go back to here. So right here. Right above the leak summary, it's gonna try to tell you and may or may not work. It's gonna try to tell well this one's gone. So you might wanna look there in your C file. So, yes. So the question was, is where is Valgrind telling us where we should maybe start to look and just write above leak summary? It's gonna say some helpful information among some other unhelpful information. Any other questions on Valgrind? Okay. So let's take a look at another way that we can implement our dictionary. So any questions on the hashtable approach or is everyone good on that. Yup? >> Is it up to us to figure out like what the ideal length is for the hash table? >> Is it up to you to figure out what the ideal length is for your hash table? Yes, it is. And it's going to depend on your hash function. So maybe your hash function does better with this size table and the best way to do that is to like I did is used to find constants. So if I say a hash table size, I just say, "I printing my hash table. It's an array of size hash table link." So as I'm debugging my program, if I just change that one constant, I can change the size of the hash table for everything. So something you might wanna do is just go through. Let's try a thousand. Let's try 10,000. Let's try 2,000. And just see what gives you the best performance. Other questions? Okay. So now this approach uses a totally different data structure than hash tables. So you would never use a combination of these two. These are kind of two different approaches to solving the same problem. And you should go with whichever you like better. So this is the data structure called the "trie." It was originally pronounced tree 'cause it comes from the word "retrieval" but tree also had it spelled T-R-E-E, so we decided to pronounce this tries. So the basic difference between tries and linked list is that instead of having a single word inside of our trie element, it's going to contain an array with an element foe each possible character inside of our word. So, the value of an element in array is going to--is going to have a corresponding letter. So each of my corresponding letters has some slot in my array that's found in each node. So inside of that array, it's going to point to new nodes. So here's my--so here's my example. So I have some node at the top here and this line is the array at that node. So the letter M is a valid letter. That means that the value at that corresponding place in my array is going to point to a new node. So now, this new node has another array, again, containing a slot for every possible letter inside of any word. >> So if you look at the letter A, that's going to point to another node. Now we can see all this kinda falls down into a valid word. We've made the word Maxwell. Now instead we were gonna say something like MB, when we go to the slot for B inside of this array because remember all the arrays of the same and there's an element for each possible letter, B is gonna point to null. So at that point, we know that there's no word in our dictionary that starts with MB. So as I'm going through this, if I ever encounter a word that starts with MB, it's definitely misspelled. And I don't even need to look at the rest of the word because there's nothing that starts with MB. So if the word you're looking at starts with MB, it's definitely wrong and you can end to there. So does everybody understand what's going here? So these other--these delta things at the bottom just signify that that's the end of a word. So we can represent that a little differently using this representation. So again, this is a single node and the node has an array for every possible character. So in this pset, don't forget about the apostrophe which is a valid character. So all the letters in the apostrophe, that's gonna make 27 different valid characters in a word. So we also wanna keep track whether or not this node is the valid end of a word. So if some word ends with this node. So that can just be a Boolean flag. So now to load something into a trie, now instead of just iterating through every word of the dictionary, we need to also iterate through every character n every word in the dictionary. So as we iterate through this word, we basically want to also iterate through our trie. So let's say we're inserting that word "Maxwell." So we're gonna say, okay, I just read the word Maxwell from my dictionary and I want to iterate through every letter. So we're gonna say okay so that M, I'm gonna jump to that spot M. Now if that's null, that mean that I need to create a node because the word I'm reading is definitely valid so we wanna insert it. So if this n is null, we want to create a new node that's gonna represent the sequence MA. So now you can see that I'm just going to iterate though that entire word creating nodes as necessary. Now if for example, it already existed like M, that node, already existing. We don't need to create it. Instead, we just wanna jump there. So if now if I'm inserting this word M-E-N, so Mendel, after I've inserted Maxwell, well, this node that's corresponds to the first letter M, it already exist. So I don't wanna create a new one. I just wanna use the one that already exist there and then fill in the slot for E. So now the same node has slots build in for A and E and we're gonna have a different sequence of nodes to represent the word Mendel. Does that make sense? So this code is actually super, super elegant if you can get it working correctly. So that's--once we get to the end of the word and this case a new line because we know that all that word the dictionary must end to the new line. Well, that's--and we can set that flag. That is the end of the word. We can set that to true because we reach the end of the word. We can put a null term in there. We can do whatever we want. So that's how we're going to insert. Does that make sense? So size, same thing. Nothing complicated. You don't need to traverse the whole thing and do some complicated recursion. Just need to keep track of the words as you're loading them. I found a new word. Total words loaded plus, plus done. So now, check is where it gets really, really cool. So in order to check if a word is inside your dictionary, you basically need to find some path along your trie that corresponds to that word. So let's say instead of inserting the word Maxwell, we were looking for it. So we need to again iterate through every single character inside of the word. So we're gonna start off at M. And if M is not null, then we're gonna jump to the next node. Now we also need to increment the letter that we're looking at. So now we're looking at A. So A is not null, we're gonna jump to the next level inside of our trie and jump to the next letter inside of our word, X, and keep going. So in order to check if a word is correctly--is correctly spelled, we need to look at every letter. But the real power of a trie comes in when you're looking at words that are not correctly spelled. So if we're looking for something like Dworkin, right, Maxwell-Dworkin, well, we're gonna say, okay, D, D is null, done. It cannot be spelled correctly because there's no word in here that starts with D. Or if you're looking at a hash table, your hash function is probably independent of what the word starts with. So you probably have to go through some linked list of a bunch of dissimilar words. Where with my trie, as soon as I can't finish my path, that word might be misspelled because there's no way to make a path through this trie that spells out the word. Does that make sense? Alright. So let's check. And so finally unload, gets a little more complicated because like a link, one way or freeing a linked list is to kind of free of backwards. You wanna do the same thing with this trie. So in order to free it, you wanna travel to the lowest possible level and free all of your pointers inside of that array. So if those points to something that was malloc, you wanna free them. So once you get to the lowest possible node, you then wanna backtrack upwards. So in order to free this trie, you could do something. If I get all the way down here to this V and free everything, now backtrack up, I can free everything that was apparent of that; I can free everything that's apparent of that. And so by freeing it, bottom to top, you make sure that you don't accidentally like free this N without freeing this G. 'Cause once you free this N, you have no idea where this is because you're not allowed to access that memory anymore. So in order to free this correctly, we wanna go like this. So as we've seen with recursion, this kind of natural lends itself to recursive implantation. For our base case, is the one right at the end, then once we free this and we recurs back up the stack, we're gonna hit everything in the order that we saw it. So we can easily just backtrack. If our base case is when were at the end. Does that make sense? Okay. So any questions on tries or hash tables or anything about pset6? Yup? >> So, [inaudible] dictionary place like how would you give like on the new nodes of different names [inaudible]? >> So how do we give the new nodes different links? >> Different names. >> Oh, different names? So if we're inside of a loop, so how would we give these new nodes different names? So if we're inside of a loop, we don't really need to name them something separate, right. Once we put it on the heap, that means that that node exists somewhere in the heap. So as long as we keep tracking of that, either inserting into the hash table, inserting to a linked list, we don't care what it was originally named. So if you're inside of a loop, you can basically reuse the same name but just keep malloc'ing new space. And make sure that once you malloc a node, you don't loose track of it. Because as soon as you lose track of it, you have a memory leak because you can never free it, because you have no idea where it is. Other questions on anything? Alright. Well in that case, then good luck on pset6. [ Silence ]