1 00:00:00,506 --> 00:00:08,886 [ Silence ] 2 00:00:09,386 --> 00:00:10,236 >> Alright everybody. 3 00:00:10,236 --> 00:00:14,326 Welcome to Walkthrough 6 for pset 6: Misspellings, 4 00:00:14,326 --> 00:00:18,146 which is by far the culmination of your C experience in CS50 5 00:00:18,146 --> 00:00:20,856 and perhaps your entire CS50 experience. 6 00:00:21,286 --> 00:00:23,446 So I would highly recommend you checkout my play list 7 00:00:23,446 --> 00:00:25,826 of epic music before you get started on this pset, 8 00:00:25,826 --> 00:00:27,636 you're gonna do much better I promise. 9 00:00:27,846 --> 00:00:30,256 So today I would just gonna go 10 00:00:30,256 --> 00:00:33,216 through the distribution code namely speller.c 'cause it looks 11 00:00:33,216 --> 00:00:35,406 really scary and intimidating, then we're gonna go 12 00:00:35,406 --> 00:00:37,096 through all the different date structures you're gonna need 13 00:00:37,096 --> 00:00:38,106 to implement these psets. 14 00:00:38,106 --> 00:00:40,866 So Linked List and Hash Tables and then Tries 15 00:00:40,866 --> 00:00:42,856 as an alternative to those two things. 16 00:00:43,806 --> 00:00:45,306 So inside of speller.c, 17 00:00:45,306 --> 00:00:47,686 it's basically doing four main things for you. 18 00:00:47,976 --> 00:00:49,456 The first it needs to do is it needs 19 00:00:49,456 --> 00:00:51,166 to load that dictionary file. 20 00:00:51,556 --> 00:00:53,666 So remember, this is some file that we've given you 21 00:00:53,666 --> 00:00:56,996 and it just contains correctly spelled words, one per line. 22 00:00:57,316 --> 00:00:59,036 And so this forms a dictionary because this is-- 23 00:00:59,036 --> 00:01:00,936 if you were to look up some word in dictionary to see 24 00:01:00,936 --> 00:01:02,896 if it was correctly spelled, if it's not found 25 00:01:02,896 --> 00:01:04,876 in dictionary then it must be spelled incorrectly. 26 00:01:05,706 --> 00:01:08,346 So we're gonna now open up some text file, 27 00:01:08,346 --> 00:01:09,766 we have some great works of literature 28 00:01:09,766 --> 00:01:11,426 like the Austin Power screenplay. 29 00:01:11,726 --> 00:01:14,526 I was gonna through that and check every single word inside 30 00:01:14,526 --> 00:01:16,886 of that text and see which word is spelled correctly 31 00:01:17,036 --> 00:01:18,246 and which are not spelled correctly. 32 00:01:18,246 --> 00:01:20,526 So to do that, it's gonna call this function check 33 00:01:20,996 --> 00:01:22,016 on each of the words. 34 00:01:22,606 --> 00:01:25,076 You also need to implement a function called size. 35 00:01:25,386 --> 00:01:27,706 Now, that's gonna do is return how big is your dictionary, 36 00:01:27,706 --> 00:01:30,546 how many different words we spell checking against. 37 00:01:30,906 --> 00:01:33,736 And finally, unload, which is going to free up all the memory 38 00:01:33,736 --> 00:01:35,706 that you use because we're really good programmers 39 00:01:35,706 --> 00:01:37,506 and we would never do something like weak memory. 40 00:01:38,086 --> 00:01:40,876 So inside of dictionary.c is 41 00:01:40,876 --> 00:01:43,196 where those four functions are defined and that's 42 00:01:43,196 --> 00:01:44,946 where you're going to implement them. 43 00:01:45,286 --> 00:01:47,956 So just at a high level, let's see what's going on. 44 00:01:47,956 --> 00:01:50,226 You're loading that dictionary, you're going to every word, 45 00:01:50,226 --> 00:01:51,996 you're checking it against that whole dictionary, 46 00:01:51,996 --> 00:01:54,086 and if it's found in memory, if it's found 47 00:01:54,166 --> 00:01:56,456 on that dictionary file then it has to be spelled correctly 48 00:01:56,556 --> 00:01:58,266 because it matches some correct spelled word. 49 00:01:58,586 --> 00:02:00,816 And if it's not done on a dictionary file, well, 50 00:02:00,816 --> 00:02:02,486 then we can just assume it's not spelled correctly 51 00:02:02,486 --> 00:02:04,536 because we have a list of every correctly spelled word. 52 00:02:05,566 --> 00:02:08,696 So let's just walkthrough speller.c which you do not have 53 00:02:08,696 --> 00:02:11,396 to modify because you're only changing dictionary.c, 54 00:02:11,646 --> 00:02:16,706 but the speller.c is where the main is contained inside 55 00:02:16,706 --> 00:02:17,316 of your program. 56 00:02:18,246 --> 00:02:20,136 So notice here, we've just defined some constant 57 00:02:20,136 --> 00:02:21,806 and this is our default dictionary. 58 00:02:21,806 --> 00:02:24,096 So this dictionary is really, really big, 59 00:02:24,316 --> 00:02:26,576 when you're first starting of with the pset, it might not be 60 00:02:26,576 --> 00:02:28,506 so wise to keep this huge dictionary just 61 00:02:28,506 --> 00:02:29,756 because there're so many words there. 62 00:02:30,016 --> 00:02:32,476 What you can do is instead just create your own dictionary 63 00:02:32,676 --> 00:02:33,886 that have three or four words 64 00:02:33,886 --> 00:02:36,156 that you know are spelled correctly, save it somewhere 65 00:02:36,156 --> 00:02:38,226 in your appliance and then just change this constant 66 00:02:39,026 --> 00:02:40,286 or you can even just supply it 67 00:02:40,286 --> 00:02:42,886 when you actually run the program but more on that later. 68 00:02:42,886 --> 00:02:46,306 So here, here's the usage of our program, we're gonna run speller 69 00:02:46,306 --> 00:02:49,316 and then this first argument is optional, so this is the path 70 00:02:49,756 --> 00:02:50,656 to some dictionary file. 71 00:02:51,056 --> 00:02:52,176 If you don't specify it, 72 00:02:52,436 --> 00:02:54,956 then it's just gonna use this define constant up here, 73 00:02:55,616 --> 00:02:56,756 and if you do specify it, 74 00:02:56,756 --> 00:02:58,716 then you can use something other than that dictionary. 75 00:02:59,556 --> 00:03:01,486 So after that, it's going 76 00:03:01,486 --> 00:03:03,406 to be what text are you spell-checking against. 77 00:03:03,806 --> 00:03:05,396 So just like you might not wanna start off 78 00:03:05,396 --> 00:03:06,976 with the King James version of the bible, 79 00:03:06,976 --> 00:03:08,776 you might wanna have some smaller text file 80 00:03:08,776 --> 00:03:11,446 that you wrote yourself containing words that are either 81 00:03:11,446 --> 00:03:14,066 in your small dictionary or not and start testing against that. 82 00:03:14,166 --> 00:03:15,896 And once you think your implementation is working 83 00:03:15,896 --> 00:03:19,226 correctly, then starts scaling up to epic works of literature. 84 00:03:19,506 --> 00:03:23,216 So you'll see a lot of these timing functions 85 00:03:23,216 --> 00:03:26,476 because what we're doing is timing how fast it takes your 86 00:03:26,476 --> 00:03:28,926 program to spell-check each piece of literature. 87 00:03:28,926 --> 00:03:29,756 So we'll see a lot of that 88 00:03:30,196 --> 00:03:33,016 on the struct r usage is just for resource usage. 89 00:03:33,016 --> 00:03:35,486 So what time is it when you started to check words 90 00:03:35,486 --> 00:03:38,036 and what time is it now as just the means of calculating 91 00:03:38,036 --> 00:03:38,996 of how much time you took. 92 00:03:39,876 --> 00:03:41,616 So now, we're just determining the dictionary. 93 00:03:41,946 --> 00:03:43,656 This argument is optional. 94 00:03:43,656 --> 00:03:45,166 So if it doesn't exist then we'll just use 95 00:03:45,166 --> 00:03:46,136 that to find constant. 96 00:03:46,506 --> 00:03:49,156 And so now we're calling the first of those four functions. 97 00:03:49,436 --> 00:03:50,586 Here, we're calling load. 98 00:03:51,206 --> 00:03:52,906 So this is just gonna return a Boolean. 99 00:03:52,906 --> 00:03:54,116 It's just gonna return true. 100 00:03:54,116 --> 00:03:56,576 The dictionary load it correctly where we're recalling load 101 00:03:56,576 --> 00:03:58,096 on this dictionary text file. 102 00:03:59,166 --> 00:04:02,036 And you'll notice that we're specifying not some file pointer 103 00:04:02,036 --> 00:04:04,096 but just the path to the dictionary files. 104 00:04:04,096 --> 00:04:06,076 So inside of load you, you need to take care of opening 105 00:04:06,076 --> 00:04:07,166 that file up and reading it. 106 00:04:08,176 --> 00:04:11,096 So if for some reason, the load failed, we're just gonna quit. 107 00:04:11,716 --> 00:04:12,916 And now you'll see. 108 00:04:13,106 --> 00:04:15,406 Now we're calculating how much time it just took 109 00:04:15,676 --> 00:04:16,636 to load your dictionary. 110 00:04:16,636 --> 00:04:18,486 And so this speller.c is gonna save that 111 00:04:18,486 --> 00:04:19,846 and you don't have to worry about that. 112 00:04:19,846 --> 00:04:21,626 So now we're gonna do the same think 113 00:04:21,626 --> 00:04:23,016 of the text, try to open up. 114 00:04:23,476 --> 00:04:26,276 And now, here's where we're gonna start iterating 115 00:04:26,276 --> 00:04:27,186 through the text file. 116 00:04:28,076 --> 00:04:30,456 So int c equals f get c on fp. 117 00:04:30,456 --> 00:04:34,196 So you noticed that fp is what we called our file pointer 118 00:04:34,196 --> 00:04:35,636 when we opened up the text to read. 119 00:04:36,006 --> 00:04:39,396 So up here, we said file star fp is opening up the text file 120 00:04:39,396 --> 00:04:40,986 that you're reading through and spell-checking. 121 00:04:42,186 --> 00:04:45,096 And what f get c is going to do, it is going to go 122 00:04:45,096 --> 00:04:47,236 through that file character by character. 123 00:04:47,466 --> 00:04:51,366 And so we're gonna iterate until we reach the special character 124 00:04:51,366 --> 00:04:54,246 "eof" that signifies I've reach the end my file. 125 00:04:54,336 --> 00:04:55,506 So there's nothing else to read. 126 00:04:55,696 --> 00:04:57,156 And so that's the condition on which I'm going 127 00:04:57,156 --> 00:04:59,266 to stop reading through my file. 128 00:05:00,076 --> 00:05:01,736 And after each iteration of the loop, 129 00:05:01,736 --> 00:05:03,386 I'm gonna call f get c again. 130 00:05:03,666 --> 00:05:05,396 So I've move on to the next character. 131 00:05:06,006 --> 00:05:07,766 So a little untraditional looking for loop, 132 00:05:07,766 --> 00:05:09,156 but all this is doing is looping 133 00:05:09,156 --> 00:05:11,676 through every single character inside of the file 134 00:05:11,676 --> 00:05:12,716 that you are spell-checking. 135 00:05:14,096 --> 00:05:16,456 So we only want to allow alphabetical characters 136 00:05:16,456 --> 00:05:17,216 and apostrophes. 137 00:05:17,216 --> 00:05:18,686 This is just some guideline, 138 00:05:18,686 --> 00:05:21,066 some restriction that's set forth in the psets back. 139 00:05:21,066 --> 00:05:23,696 We say you don't have to worry about numbers inside of words. 140 00:05:24,546 --> 00:05:27,446 And so, we have this array, word. 141 00:05:27,876 --> 00:05:31,426 You notice it's initialized through some fixed link, + 1. 142 00:05:31,896 --> 00:05:34,016 So we're doing here is we basically initializing it 143 00:05:34,016 --> 00:05:35,346 to the biggest word possible. 144 00:05:36,146 --> 00:05:39,556 And so that way, we know we have enough space to hold any word 145 00:05:39,556 --> 00:05:40,466 that we might read in. 146 00:05:41,296 --> 00:05:43,006 So every time you read a character, 147 00:05:43,236 --> 00:05:45,006 we're inserting it into this word array. 148 00:05:45,006 --> 00:05:46,846 And then if we increment index, 149 00:05:46,966 --> 00:05:49,136 that means the next letter we read is going to go 150 00:05:49,136 --> 00:05:51,036 into the next position into this word array. 151 00:05:51,446 --> 00:05:53,566 So that way, we're filling up this array character 152 00:05:53,566 --> 00:05:54,816 by character with all 153 00:05:54,816 --> 00:05:56,766 of the characters inside of a single word. 154 00:05:56,766 --> 00:06:00,186 So if something is too long, then we say, 155 00:06:00,186 --> 00:06:02,646 well that's definitely not a word because it's bigger 156 00:06:02,646 --> 00:06:04,496 than the largest word in English language. 157 00:06:04,866 --> 00:06:06,526 So we're gonna is we're just gonna finish up reading 158 00:06:06,526 --> 00:06:08,316 that word and reset our index. 159 00:06:08,316 --> 00:06:11,066 So when we move on to the next word, we're gonna be starting 160 00:06:11,066 --> 00:06:13,266 at the beginning of the array again. 161 00:06:14,436 --> 00:06:15,546 So same thing here. 162 00:06:15,716 --> 00:06:18,446 Well, if we ever encounter a number inside of a word 163 00:06:18,446 --> 00:06:20,566 that we're spell-checking, we know that that can't be a word, 164 00:06:20,756 --> 00:06:22,246 so we're not gonna bother spell-checking it. 165 00:06:22,246 --> 00:06:24,616 We're just gonna consume the rest of the string, 166 00:06:24,616 --> 00:06:27,476 so skip over every other letter that follows that number 167 00:06:27,666 --> 00:06:29,296 until we hit a new world. 168 00:06:29,646 --> 00:06:31,426 And so, we're resetting the index again so that 169 00:06:31,426 --> 00:06:33,776 when we hit the next letter-- when we hit the first letter 170 00:06:33,776 --> 00:06:36,106 in the next word, we're back to beginning of this word array. 171 00:06:37,216 --> 00:06:39,816 So now, if we have found some whole word, 172 00:06:39,816 --> 00:06:42,376 we're going to terminate it with the null terminator. 173 00:06:43,056 --> 00:06:46,256 So whenever-- in C, we find this null terminator inside 174 00:06:46,256 --> 00:06:48,206 of a string, we know that the string ends there. 175 00:06:48,626 --> 00:06:50,896 Even if there's some stuff after this null terminator 176 00:06:50,896 --> 00:06:53,806 in the array, because we find this null terminator this says 177 00:06:53,806 --> 00:06:55,506 "this is where the string actually ends, 178 00:06:55,766 --> 00:06:57,956 so don't bother going looking pass this point." 179 00:06:59,236 --> 00:07:00,656 So at this point, we're gonna say, "Well, 180 00:07:00,656 --> 00:07:02,686 we've read one more word inside of our text file, 181 00:07:02,936 --> 00:07:05,856 and now we're calling this check function." 182 00:07:06,126 --> 00:07:08,726 And so this is a function that, again, you need to read-- 183 00:07:08,796 --> 00:07:11,616 write yourself and this is gonna return true or false depending 184 00:07:11,616 --> 00:07:14,586 on whether or not this word which we just saw is coming 185 00:07:14,586 --> 00:07:16,086 from the file that we're spell-checking, 186 00:07:16,086 --> 00:07:17,086 we've read in every character. 187 00:07:17,086 --> 00:07:18,876 And so now, we're spell-checking this word 188 00:07:18,946 --> 00:07:19,946 against the dictionary. 189 00:07:19,946 --> 00:07:22,516 So this needs to return true or false depending on whether 190 00:07:22,516 --> 00:07:24,126 or not this word is spelled correctly. 191 00:07:24,656 --> 00:07:27,196 So again, this get resource usage is just timing, 192 00:07:27,306 --> 00:07:29,686 how long did it take for me to run this check function 193 00:07:29,896 --> 00:07:33,086 and I'm gonna keep track of how long it takes to run each one 194 00:07:33,086 --> 00:07:34,716 of them in this TI check variable. 195 00:07:35,676 --> 00:07:37,396 So now, if the word is misspelled, 196 00:07:37,396 --> 00:07:38,286 we're just gonna print it out. 197 00:07:38,286 --> 00:07:40,626 I'm gonna say, "Well, we found some other misspelled word 198 00:07:40,626 --> 00:07:42,996 which we're gonna print out at the end of our program." 199 00:07:43,626 --> 00:07:45,976 So then after that, we must have just spell-check the words, 200 00:07:45,976 --> 00:07:47,566 that means we finish reading it, so again, 201 00:07:47,566 --> 00:07:48,826 we're gonna reset our counter 202 00:07:48,826 --> 00:07:50,456 in the word array back to the beginning. 203 00:07:50,456 --> 00:07:52,216 SO next time we read a character, we're gonna be 204 00:07:52,216 --> 00:07:54,426 at the beginning of my word array 'cause 205 00:07:54,426 --> 00:07:55,306 where at a new word. 206 00:07:56,066 --> 00:07:59,126 So after that, we're gonna close the text file, 207 00:07:59,386 --> 00:08:01,656 the one we're spell-checking 'cause we don't need it anymore. 208 00:08:01,656 --> 00:08:03,636 And now we're calling our third function, size. 209 00:08:04,226 --> 00:08:06,606 And so this just needs to return an integer 210 00:08:06,606 --> 00:08:08,606 of how many words are inside of your dictionary. 211 00:08:09,556 --> 00:08:11,536 So again, just timing how long it took you to do that. 212 00:08:12,386 --> 00:08:14,286 And finally, we're calling unload. 213 00:08:14,286 --> 00:08:17,636 So after this function is done, everything that you loaded 214 00:08:17,636 --> 00:08:20,426 into memory, should now be gone and we're gonna-- 215 00:08:20,426 --> 00:08:24,526 we can check that, with the check 50 command 216 00:08:24,526 --> 00:08:26,706 that you saw in the pset. 217 00:08:27,096 --> 00:08:29,656 It'll basically tell you, "Well, you left some things in memory 218 00:08:29,656 --> 00:08:31,376 so you must not have unloaded everything yet. 219 00:08:32,106 --> 00:08:34,116 So after we calculate how long that took, 220 00:08:34,276 --> 00:08:35,406 we're just gonna print out some stats, 221 00:08:35,556 --> 00:08:38,166 how many words are misspelled, et cetera, et cetera, 222 00:08:38,826 --> 00:08:40,856 and if you wanna know if yours is working correctly, 223 00:08:40,856 --> 00:08:42,746 you can always compare your implementation 224 00:08:42,746 --> 00:08:43,896 to the stat implementation. 225 00:08:44,146 --> 00:08:45,996 If they found 600 words misspelled, 226 00:08:45,996 --> 00:08:48,706 then you found 10,000, then you might have a problem somewhere 227 00:08:48,706 --> 00:08:50,416 and then that's it. 228 00:08:51,506 --> 00:08:54,166 So, any questions on how we are using speller? 229 00:08:54,846 --> 00:08:55,266 Yup? 230 00:08:55,266 --> 00:08:55,476 [ Inaudible Remark ] 231 00:08:55,476 --> 00:09:06,596 >> So how the speller.c knowing you're at the end of the word? 232 00:09:06,756 --> 00:09:07,016 >> Yeah. 233 00:09:07,586 --> 00:09:08,826 >> Yeah. So-- 234 00:09:08,826 --> 00:09:09,156 [ Inaudible Remark ] 235 00:09:09,156 --> 00:09:11,036 >> So let's see. 236 00:09:11,036 --> 00:09:12,846 So if it's a letter, that means we can't be 237 00:09:12,846 --> 00:09:13,566 at the end of the word. 238 00:09:14,596 --> 00:09:16,786 So if it's a digit, that means that we're wrong, 239 00:09:17,406 --> 00:09:20,546 and if it's anything but a letter or a digit, 240 00:09:20,546 --> 00:09:23,486 that must be the end of the word, so basically a space 241 00:09:23,486 --> 00:09:26,196 or something else, so that's just kind of the last condition, 242 00:09:26,196 --> 00:09:28,186 if it's not a letter or number, we must be at the end 243 00:09:28,186 --> 00:09:29,646 so then we can move on to the next word. 244 00:09:31,056 --> 00:09:31,416 Yup? 245 00:09:32,516 --> 00:09:40,676 [ Inaudible Remark ] 246 00:09:41,176 --> 00:09:43,986 >> So where that the sample text we gave you? 247 00:09:44,166 --> 00:09:47,686 So they should inside of your appliance, inside of-- 248 00:09:47,786 --> 00:09:51,326 [inaudible] the CS50/pset6/text to something, it should be 249 00:09:51,326 --> 00:09:52,556 in the pset specification, 250 00:09:52,746 --> 00:09:54,956 but they just basically a bunch of text files. 251 00:09:55,546 --> 00:09:57,266 I can look that up in the spec after if you like, 252 00:09:57,266 --> 00:09:58,906 but we should have digits on your appliance 253 00:09:58,906 --> 00:10:00,636 after you run yum update and all that stuffs. 254 00:10:01,986 --> 00:10:03,616 >> Are there questions on the speller.c? 255 00:10:03,736 --> 00:10:04,646 Yup. 256 00:10:04,646 --> 00:10:06,786 >> When you're updating a word counter, 257 00:10:06,786 --> 00:10:11,276 is it words that [inaudible] you have like whole words? 258 00:10:11,276 --> 00:10:13,716 You just [inaudible] every time? 259 00:10:13,896 --> 00:10:14,796 >> So does it [inaudible] to make a-- 260 00:10:14,796 --> 00:10:15,836 do a word array every time? 261 00:10:15,836 --> 00:10:18,596 So you'll notice that we're just reusing this single word array. 262 00:10:19,206 --> 00:10:20,756 So we only create one word array, 263 00:10:21,056 --> 00:10:23,946 but because we're always accessing at this index, 264 00:10:24,256 --> 00:10:26,586 we're just gonna keep changing where we are inside 265 00:10:26,586 --> 00:10:29,086 of this array and overwrite what was previously there. 266 00:10:29,276 --> 00:10:32,086 So we can use the same array and we're filling up this same array 267 00:10:32,086 --> 00:10:33,996 with every word in the text that we're spell-checking. 268 00:10:34,576 --> 00:10:36,746 >> Then why do we need word ++? 269 00:10:36,916 --> 00:10:38,646 >> So why do we need words ++? 270 00:10:38,766 --> 00:10:40,026 So that's a separate variable. 271 00:10:40,026 --> 00:10:41,536 So remember the array is called word 272 00:10:41,536 --> 00:10:43,146 and this one is called words. 273 00:10:43,596 --> 00:10:45,496 And so all this is doing is it keeping track 274 00:10:45,496 --> 00:10:47,346 of how many words you spell-checked just 275 00:10:47,346 --> 00:10:49,956 so we can print it out down there. 276 00:10:49,956 --> 00:10:52,326 So just for stats. 277 00:10:52,846 --> 00:10:53,096 Yup? 278 00:10:53,416 --> 00:10:56,586 >> Can you like clarify again how [inaudible] making our own 279 00:11:00,116 --> 00:11:00,456 dictionary or? 280 00:11:00,456 --> 00:11:01,536 >> So why are we keeping track 281 00:11:01,536 --> 00:11:04,746 of how many words in the dictionary? 282 00:11:04,816 --> 00:11:06,816 [ Inaudible Remark ] 283 00:11:06,886 --> 00:11:08,916 >> So you-- So you can supply speller 284 00:11:08,916 --> 00:11:12,236 with any dictionary you want, this is the default dictionary, 285 00:11:12,366 --> 00:11:15,016 this whole CS50 pset6 dictionary is large. 286 00:11:15,016 --> 00:11:16,596 So this is just one big text file. 287 00:11:16,996 --> 00:11:19,066 So we want our program to be able to tell us, well, 288 00:11:19,066 --> 00:11:20,486 how big is this text file, 289 00:11:20,486 --> 00:11:22,636 how many words are contained inside of it. 290 00:11:22,816 --> 00:11:24,466 So if you were to specify something else, 291 00:11:24,466 --> 00:11:26,476 you can specify a smaller or a bigger dictionary 292 00:11:26,476 --> 00:11:29,206 and that size still return how many words are inside of it. 293 00:11:29,676 --> 00:11:32,896 >> How do we get like different dictionary? 294 00:11:32,896 --> 00:11:34,496 >> So how would you get a different dictionary? 295 00:11:34,626 --> 00:11:36,676 So this-- all this is, is a text file. 296 00:11:37,416 --> 00:11:40,766 It's text file that each line has just a correctly 297 00:11:40,766 --> 00:11:41,386 spelled word. 298 00:11:41,826 --> 00:11:43,626 So if you can just make your own by saying, well, 299 00:11:43,626 --> 00:11:46,166 here's the word apple, here's the word banana, and that's it, 300 00:11:46,166 --> 00:11:48,226 every other word is misspelled as far as I'm concern. 301 00:11:48,226 --> 00:11:50,836 And this just happens to have most of the word 302 00:11:50,836 --> 00:11:52,276 in the English language. 303 00:11:53,546 --> 00:11:54,916 Other questions on speller? 304 00:11:55,436 --> 00:11:58,956 Okay, so now let's now move 305 00:11:58,956 --> 00:12:02,286 on to how we can actually implement those four functions. 306 00:12:02,706 --> 00:12:06,336 So the first thing we need to review is linked lists. 307 00:12:06,466 --> 00:12:07,676 So you've seen this in lecture. 308 00:12:07,676 --> 00:12:11,176 And a linked list is essentially a series of nodes. 309 00:12:11,596 --> 00:12:13,746 In each node can have one or more values. 310 00:12:14,086 --> 00:12:16,216 In this case, I have a linked list of integers. 311 00:12:16,486 --> 00:12:19,176 Each one of these nodes or these rectangles has a value 312 00:12:19,176 --> 00:12:20,936 associated with it that's an integer. 313 00:12:21,046 --> 00:12:23,796 So this person has a int value of 12, 99, 37. 314 00:12:24,316 --> 00:12:26,746 So the other part of this node is a pointer 315 00:12:26,746 --> 00:12:29,856 to the next element in my linked list. 316 00:12:29,856 --> 00:12:31,326 So in order to traverse the linked list, 317 00:12:31,326 --> 00:12:34,446 I can just keep following where these next thing is point to 318 00:12:34,596 --> 00:12:35,676 and the last element 319 00:12:35,886 --> 00:12:37,726 in my linked list is going to point to null. 320 00:12:38,136 --> 00:12:39,786 So if I ever find, well, the next element 321 00:12:39,786 --> 00:12:41,526 in my linked list is null that must mean 322 00:12:41,526 --> 00:12:43,136 that I traverse the entire linked list 323 00:12:43,136 --> 00:12:44,566 and there's nothing else there. 324 00:12:45,786 --> 00:12:48,196 So here's how we might want you to create a linked list. 325 00:12:48,436 --> 00:12:50,906 So the first thing we need to do is we need to define what each 326 00:12:50,906 --> 00:12:52,096 of these nodes look like. 327 00:12:52,826 --> 00:12:56,296 So because these nodes have more than one thing and they're not 328 00:12:56,296 --> 00:12:59,156 of the same type because I have an integer and a pointer, 329 00:12:59,356 --> 00:13:01,536 those aren't of the same type, I can't really use an array 330 00:13:01,706 --> 00:13:04,526 so I need to somehow group these variables into a single object, 331 00:13:04,636 --> 00:13:05,886 I'm going to use a struct. 332 00:13:06,406 --> 00:13:09,446 And so each of these nodes as we said, in this case, 333 00:13:09,446 --> 00:13:11,456 we're using integers; but in the case of speller, 334 00:13:11,456 --> 00:13:12,926 you're probably gonna wanna use words. 335 00:13:13,596 --> 00:13:17,036 So here each node has a word and a pointer to the next node. 336 00:13:17,546 --> 00:13:19,996 So in order to create a node, I wanna use malloc. 337 00:13:20,326 --> 00:13:21,536 And how much memory do I want? 338 00:13:21,536 --> 00:13:22,766 I just want a single node. 339 00:13:22,766 --> 00:13:26,226 So I can call sizeof node and a that's gonna give me the size 340 00:13:26,226 --> 00:13:28,046 of the struct which is just going to be 8 341 00:13:28,046 --> 00:13:29,866 because it's gonna add those two pointers up. 342 00:13:30,766 --> 00:13:31,816 So I can create 2. 343 00:13:32,026 --> 00:13:34,006 And now to set the contents of these nodes, 344 00:13:34,006 --> 00:13:37,256 I'm gonna use the arrow notation because if we've use malloc, 345 00:13:37,256 --> 00:13:38,576 which means that we have a pointer, 346 00:13:38,826 --> 00:13:40,456 which mean if we have a pointer to a struct, 347 00:13:40,456 --> 00:13:43,126 we need to use this arrow so we can set word 348 00:13:43,126 --> 00:13:44,656 by saying this and is. 349 00:13:44,816 --> 00:13:47,066 So now I have two nodes, they have different values 350 00:13:47,406 --> 00:13:49,916 and if I wanna make my first node point to the other, 351 00:13:49,916 --> 00:13:51,976 I'm gonna set that next pointer equal 352 00:13:51,976 --> 00:13:53,416 to the address of my other struct. 353 00:13:53,756 --> 00:13:56,446 So now we have one node pointing to the other 354 00:13:56,686 --> 00:13:58,816 and the second node is the end of my linked list. 355 00:13:58,816 --> 00:14:00,696 So I'm just going to set its next to null, 356 00:14:00,696 --> 00:14:02,636 because it doesn't point anywhere, because it's 357 00:14:02,636 --> 00:14:03,566 at the end of the list. 358 00:14:03,566 --> 00:14:07,726 So is everyone clear on the syntax for creating linked list? 359 00:14:08,366 --> 00:14:08,466 Yup? 360 00:14:08,466 --> 00:14:09,996 [ Inaudible Remark ] 361 00:14:09,996 --> 00:14:17,636 >> So here, so I've created two nodes, node 1 and node 2. 362 00:14:17,746 --> 00:14:21,586 And so in my linked list I have node 1 followed by node 2. 363 00:14:21,796 --> 00:14:24,986 And so the next element after node 1 is node 2. 364 00:14:26,176 --> 00:14:29,206 So that next corresponds to the next that's inside of my struct 365 00:14:29,326 --> 00:14:31,146 and then node 2 is the other variable 366 00:14:31,146 --> 00:14:35,416 that I created on the heap. 367 00:14:35,416 --> 00:14:35,483 [ Inaudible Remark ] 368 00:14:35,483 --> 00:14:36,446 >> Oh, I should say node 2, sorry. 369 00:14:36,806 --> 00:14:37,446 That's just a typo. 370 00:14:38,616 --> 00:14:38,776 Yup? 371 00:14:39,886 --> 00:14:42,316 >> Why do you need a node star? 372 00:14:42,536 --> 00:14:43,976 >> So why do we need a node star? 373 00:14:43,976 --> 00:14:44,926 So that's a good question. 374 00:14:45,686 --> 00:14:48,546 So remember that if we allocate something on the heap, 375 00:14:48,966 --> 00:14:50,736 then it can only last as long as-- 376 00:14:50,736 --> 00:14:52,196 if we allocate something on the stack, sorry. 377 00:14:52,446 --> 00:14:55,096 Then as soon as that function returns we can't really access 378 00:14:55,096 --> 00:14:55,606 it anymore. 379 00:14:56,376 --> 00:14:58,996 So inside of our spell checker, we only wanna have to load 380 00:14:58,996 --> 00:15:00,296 up the dictionary once. 381 00:15:00,616 --> 00:15:03,446 And after we load the dictionary, we wanna be able 382 00:15:03,446 --> 00:15:05,356 to keep using it every time we call check. 383 00:15:05,866 --> 00:15:08,876 So if we were to put this linked list on the stack and we load it 384 00:15:08,876 --> 00:15:10,456 in the stack frame of load or something, 385 00:15:10,456 --> 00:15:12,086 then we're gonna lose it immediately. 386 00:15:12,516 --> 00:15:15,066 Whereas if we put it in the heap then we ensure that no matter 387 00:15:15,066 --> 00:15:18,246 when this function returns, we still have access to it. 388 00:15:18,816 --> 00:15:19,556 That makes sense? 389 00:15:20,286 --> 00:15:24,716 Okay. Are there questions on creating linked lists? 390 00:15:25,866 --> 00:15:25,986 Yeah? 391 00:15:25,986 --> 00:15:31,136 >> So if we wanna have like the example 392 00:15:31,296 --> 00:15:32,766 on [inaudible] could have node star on those-- 393 00:15:32,766 --> 00:15:34,276 the brackets [inaudible]? 394 00:15:34,386 --> 00:15:36,256 >> So if you wanted to create an array of nodes? 395 00:15:37,516 --> 00:15:39,286 So, sure. So if you have an array 396 00:15:39,286 --> 00:15:40,436 and that array just has a type, 397 00:15:40,556 --> 00:15:42,936 you could hypothetically create an array the held a bunch 398 00:15:42,936 --> 00:15:43,496 of nodes. 399 00:15:43,766 --> 00:15:46,036 It wouldn't be a linked list, strictly speaking, 400 00:15:46,036 --> 00:15:47,036 because you have an array 401 00:15:47,036 --> 00:15:48,546 and a linked list is different than an array. 402 00:15:48,876 --> 00:15:51,916 But because the struct is just like any other type in C, 403 00:15:51,916 --> 00:15:52,986 you could create arrays of [inaudible], 404 00:15:52,986 --> 00:15:53,906 you could do whatever you want. 405 00:15:54,576 --> 00:15:56,626 So that would be possible if just node star array 406 00:15:56,626 --> 00:15:58,926 and then give it some link, which we'll see later. 407 00:15:59,406 --> 00:16:02,896 Other questions on creating linked lists? 408 00:16:05,126 --> 00:16:08,746 Okay. So once we have some linked list in memory, 409 00:16:08,746 --> 00:16:10,056 we wanna be able to traverse it. 410 00:16:10,466 --> 00:16:13,096 By traverse, I mean just look at every single element 411 00:16:13,246 --> 00:16:15,816 and maybe do something with the value or compare the values. 412 00:16:16,076 --> 00:16:19,576 And to do that, we first wanna create some new pointer 413 00:16:19,576 --> 00:16:21,286 and we're gonna call this the iterator. 414 00:16:21,286 --> 00:16:23,276 So the iterator is just going to be pointing 415 00:16:23,276 --> 00:16:25,496 at every single element of the linked list and then 416 00:16:25,496 --> 00:16:27,496 from that iterator we can determine what the value 417 00:16:27,496 --> 00:16:29,906 of that node is, what its next pointer is et cetera. 418 00:16:30,706 --> 00:16:31,536 So we wanna loop 419 00:16:31,536 --> 00:16:33,506 until eventually this iterator is null. 420 00:16:33,856 --> 00:16:36,736 So because this special iterator pointer is gonna point 421 00:16:36,736 --> 00:16:39,276 at every single note, if eventually starts pointing 422 00:16:39,276 --> 00:16:42,006 at null that means that we must have traversed the entire linked 423 00:16:42,006 --> 00:16:45,156 list because remember we said that the last element has 424 00:16:45,156 --> 00:16:46,646 to have a next value of null. 425 00:16:47,736 --> 00:16:50,056 So every point in a loop, this iterator can point 426 00:16:50,056 --> 00:16:53,526 at a different element and from that we can print out what 427 00:16:53,526 --> 00:16:54,476 that element looks like. 428 00:16:55,006 --> 00:16:57,846 So here let's just assume that we have some linked list 429 00:16:58,096 --> 00:17:00,376 and we've just stored the root of the linked list 430 00:17:00,376 --> 00:17:02,996 or the first element in some pointer called "first." 431 00:17:03,656 --> 00:17:05,826 So now we're creating some new iterator node. 432 00:17:06,046 --> 00:17:10,206 It's initially pointing at the first element of my linked list 433 00:17:11,266 --> 00:17:13,356 so while the iterator is [inaudible] to the null 434 00:17:13,356 --> 00:17:15,526 that must mean that we're still at a node. 435 00:17:15,526 --> 00:17:18,386 So it means, we can keep going, we can print 436 00:17:18,386 --> 00:17:21,566 out the word associated with the current node. 437 00:17:22,026 --> 00:17:24,006 So right now, let's just print out the first word. 438 00:17:24,006 --> 00:17:26,076 So back to our example, this would say this. 439 00:17:26,496 --> 00:17:29,196 So after that, we're gonna change our iterator to point 440 00:17:29,376 --> 00:17:31,486 at the next element in our linked list. 441 00:17:32,146 --> 00:17:33,516 So now, once we get back 442 00:17:33,516 --> 00:17:35,846 up to the while loop we're still not null, because we have 443 00:17:35,846 --> 00:17:37,836 that second struct which has the word "is" 444 00:17:38,136 --> 00:17:39,706 so I print out is next. 445 00:17:40,276 --> 00:17:43,936 And so now, once we next again, iterator is gonna be null 446 00:17:43,936 --> 00:17:46,326 because the next element of the last element 447 00:17:46,326 --> 00:17:47,786 in our linked list is gonna be null. 448 00:17:48,096 --> 00:17:49,696 So that means our loop is gonna terminate 449 00:17:49,696 --> 00:17:50,686 and our iterator has gone 450 00:17:50,686 --> 00:17:52,526 through our very short linked list. 451 00:17:53,046 --> 00:17:55,216 So does that make sense to everyone, 452 00:17:55,216 --> 00:17:56,896 how we're gonna iterate through? 453 00:17:57,306 --> 00:18:03,566 Awesome! So now we have to worry about freeing linked list, 454 00:18:03,566 --> 00:18:06,076 because remember we need to unload everything. 455 00:18:06,666 --> 00:18:08,386 So in order to free linked list, 456 00:18:08,386 --> 00:18:10,096 we can't just free the first element. 457 00:18:10,176 --> 00:18:12,296 Free isn't smart enough to know it's a linked list 458 00:18:12,296 --> 00:18:14,046 and to traverse the whole thing and free everything. 459 00:18:14,446 --> 00:18:17,396 Instead, we need to explicitly free everything 460 00:18:17,396 --> 00:18:19,056 in our linked list individually. 461 00:18:19,676 --> 00:18:22,826 But what's important to note is that once you call free 462 00:18:22,826 --> 00:18:26,356 on some node, that means you can't access what was inside 463 00:18:26,356 --> 00:18:28,486 if it's word or it's next pointer anymore. 464 00:18:28,906 --> 00:18:31,266 So freeing it means you can't do anything with it anymore, 465 00:18:31,496 --> 00:18:32,526 it's effectively gone. 466 00:18:33,416 --> 00:18:37,976 So in order to do this, before we determine what-- 467 00:18:37,976 --> 00:18:39,716 before we can free, we need to figure 468 00:18:39,716 --> 00:18:40,916 out where we're gonna go next. 469 00:18:41,566 --> 00:18:43,146 So this could look something like this. 470 00:18:43,506 --> 00:18:46,386 So again, we have this iterator and this iterator is going 471 00:18:46,386 --> 00:18:49,266 to point at every single element in our linked list separately. 472 00:18:49,346 --> 00:18:51,476 So we're starting at the beginning of the list. 473 00:18:51,976 --> 00:18:55,416 And so now, we're gonna save where this iterator was. 474 00:18:55,886 --> 00:18:57,966 So this is just some address, remember. 475 00:18:58,456 --> 00:19:01,426 So now, we can move our irritator to the next element 476 00:19:01,806 --> 00:19:04,766 but this node, n, and still remember where the iterator was. 477 00:19:05,116 --> 00:19:07,146 So even though that this iterator pointer is 478 00:19:07,146 --> 00:19:10,376 on the next element of our list, n remembers where it was. 479 00:19:10,866 --> 00:19:13,716 So now we can free n because we already know 480 00:19:13,716 --> 00:19:14,926 where iterator needs to go, 481 00:19:14,926 --> 00:19:16,956 and if we wipeout n that's no problem, 482 00:19:16,956 --> 00:19:19,706 and then we can continue with our loop until we hit null. 483 00:19:20,286 --> 00:19:23,776 And so this is one way that you can free an entire linked list. 484 00:19:23,776 --> 00:19:26,556 So you could also do this recursively, right. 485 00:19:26,556 --> 00:19:29,396 We could say, well, while irritator was not null, 486 00:19:29,396 --> 00:19:32,296 we wanna keep calling recursively our free functions 487 00:19:32,296 --> 00:19:33,266 or whatever function where-- 488 00:19:33,266 --> 00:19:34,926 like free linked list or whatever we call it. 489 00:19:34,926 --> 00:19:38,316 And we keep calling that until eventually we hit null that's 490 00:19:38,316 --> 00:19:41,286 gonna be our base case and then we wanna start calling free. 491 00:19:41,286 --> 00:19:43,296 With that it will effectively do is free the linked 492 00:19:43,296 --> 00:19:44,256 list backwards. 493 00:19:44,696 --> 00:19:46,756 And so you'll hit every single element at least once 494 00:19:46,996 --> 00:19:48,256 and your whole linked list will be free. 495 00:19:48,906 --> 00:19:51,086 So both of those implantations are totally valid 496 00:19:51,086 --> 00:19:53,416 and they're gonna do the same thing at the end of the day. 497 00:19:54,466 --> 00:19:56,686 So any questions on why we needed this second node 498 00:19:56,686 --> 00:19:58,776 or why we can't just free the first one? 499 00:20:01,296 --> 00:20:04,076 >> Okay. So that's it for linked list. 500 00:20:04,076 --> 00:20:07,836 So now let's move on to a more complex data structure and one 501 00:20:07,836 --> 00:20:10,736 that you'll likely use to represent your dictionary inside 502 00:20:10,736 --> 00:20:15,036 of speller.c. So a hash table is a data structure that's going 503 00:20:15,036 --> 00:20:17,166 to map keys to values. 504 00:20:18,216 --> 00:20:21,716 So a hash table has some fixed number of buckets. 505 00:20:22,206 --> 00:20:25,486 So in C we already have some data structure that keeps track 506 00:20:25,486 --> 00:20:27,036 of a list of a fixed size. 507 00:20:27,266 --> 00:20:27,926 That's just an array. 508 00:20:28,516 --> 00:20:31,026 So we can represent our hash table as an array 509 00:20:31,506 --> 00:20:34,346 where each bucket is a different element of our array. 510 00:20:35,196 --> 00:20:38,336 So also essential to our hash table is some hash function. 511 00:20:38,996 --> 00:20:43,026 So what a hash function does it is going to map each value 512 00:20:43,026 --> 00:20:44,726 that we wanna look up on our hash table 513 00:20:44,956 --> 00:20:46,766 to some bucket in our array. 514 00:20:46,996 --> 00:20:50,396 Now it's important that this hash function is the same every 515 00:20:50,396 --> 00:20:52,146 time so if we call hash CS50 516 00:20:52,146 --> 00:20:54,816 and get back a number then we call hash CS50 again, 517 00:20:55,326 --> 00:20:58,006 we need to make sure we get back the same number 518 00:20:58,286 --> 00:21:00,726 or else our hash function is not going to work correctly. 519 00:21:00,726 --> 00:21:03,516 So we can't do things like returning a random value 520 00:21:03,846 --> 00:21:05,116 because of you call that twice 521 00:21:05,166 --> 00:21:07,516 on the same input there's a good chance you're gonna get back 522 00:21:07,856 --> 00:21:11,506 different values so that's not really gonna work. 523 00:21:11,696 --> 00:21:14,436 So what are-- in this case what our hash function needs 524 00:21:14,436 --> 00:21:17,896 to do is it needs to map some word to some number. 525 00:21:18,366 --> 00:21:20,396 But what's important is 526 00:21:20,396 --> 00:21:23,956 that this number does not exceed the length of our hash table. 527 00:21:24,436 --> 00:21:26,746 If we have ten buckets that hash table 528 00:21:26,746 --> 00:21:28,936 and our hash function gives us back the number 20, 529 00:21:29,286 --> 00:21:30,856 we're in trouble because we don't know what 530 00:21:30,856 --> 00:21:32,046 to do with that number 20. 531 00:21:32,426 --> 00:21:33,996 So when you're designing your hash function you need 532 00:21:33,996 --> 00:21:36,776 to make sure that it doesn't exceed the number of buckets 533 00:21:36,776 --> 00:21:38,276 or return some negative number 534 00:21:38,276 --> 00:21:39,836 or something that's not even a number 535 00:21:40,196 --> 00:21:42,286 or else you're not gonna be able to get your hash table working. 536 00:21:42,666 --> 00:21:46,866 So here's an example of a hash function for this hash table. 537 00:21:47,316 --> 00:21:49,456 So here on the left here we have our keys. 538 00:21:49,796 --> 00:21:51,766 So in this case, these could be the words inside 539 00:21:51,766 --> 00:21:52,436 of your dictionary. 540 00:21:53,216 --> 00:21:54,716 So these keys are gonna be put 541 00:21:54,716 --> 00:21:57,566 through some hash function that's going to return a number. 542 00:21:57,566 --> 00:21:59,696 So you see John Smith here. 543 00:21:59,696 --> 00:22:02,246 After you put that through a magical hash function it's gonna 544 00:22:02,246 --> 00:22:04,566 map to the number 152. 545 00:22:05,456 --> 00:22:07,496 And then Lisa Smith is gonna map to 1 546 00:22:08,186 --> 00:22:09,546 and so we can insert these nodes 547 00:22:09,546 --> 00:22:11,746 onto that bucket inside of our hash table. 548 00:22:12,306 --> 00:22:13,476 So here's what this could look like. 549 00:22:13,876 --> 00:22:16,816 Well, if you gave me John Smith, I'm gonna returns to you 152. 550 00:22:17,186 --> 00:22:19,256 If you gave me Lisa Smith, I'm gonna return your 1. 551 00:22:19,786 --> 00:22:21,956 So this is probably not the best hash function ever 552 00:22:22,166 --> 00:22:24,136 because we're returning anything that's not one 553 00:22:24,136 --> 00:22:24,896 of these four names. 554 00:22:24,896 --> 00:22:27,276 We're just sending them to bucket 153. 555 00:22:28,456 --> 00:22:29,566 So here's another example 556 00:22:29,566 --> 00:22:31,526 to hash function that's still not the great. 557 00:22:31,526 --> 00:22:34,126 So what we're doing here is we say, "Okay, 558 00:22:34,126 --> 00:22:36,026 we're gonna start off my result at 0." 559 00:22:36,406 --> 00:22:38,896 So result is gonna be whatever this hash table returns. 560 00:22:39,676 --> 00:22:40,716 So now we're gonna iterate 561 00:22:40,716 --> 00:22:43,076 over every single character in a word. 562 00:22:44,206 --> 00:22:47,396 So we know that because that this is a string, each element 563 00:22:47,396 --> 00:22:49,786 of this word array is a character remember 564 00:22:49,786 --> 00:22:52,506 because we have ASCII, we could map each of these characters 565 00:22:52,546 --> 00:22:53,916 to a number automatically. 566 00:22:54,466 --> 00:22:56,596 So what this does, it effectively just sums 567 00:22:56,596 --> 00:22:59,036 up the ASCII values of everything inside of you word. 568 00:22:59,746 --> 00:23:02,386 And so now, we can't just return result. 569 00:23:02,986 --> 00:23:04,976 Because if you have a hash table of size 10, 570 00:23:05,286 --> 00:23:07,436 this result is probably gonna be some big number 571 00:23:07,436 --> 00:23:08,506 in a hundreds or thousands. 572 00:23:08,926 --> 00:23:09,996 So it's just not gonna work. 573 00:23:10,566 --> 00:23:13,846 So in order to make sure that we never exceed the length 574 00:23:13,846 --> 00:23:16,366 of our hash table, we can use this modulus operator. 575 00:23:16,866 --> 00:23:19,636 So what this does remember is it effectively establishes some 576 00:23:19,636 --> 00:23:20,306 upper bound. 577 00:23:20,906 --> 00:23:24,036 So we say, okay, if we do result mod hash table size 578 00:23:24,306 --> 00:23:26,576 with this just constant that I defined somewhere, 579 00:23:26,876 --> 00:23:27,926 remember it's gonna be constant 580 00:23:27,926 --> 00:23:30,086 because our hash table is never changing size, 581 00:23:30,086 --> 00:23:32,226 we still always have the same number of buckets. 582 00:23:32,796 --> 00:23:34,536 If we modify that hash table size, 583 00:23:34,536 --> 00:23:37,036 we're gonna get some number that's definitely less 584 00:23:37,596 --> 00:23:41,216 than hash table size which mean it's definitely a valid thing-- 585 00:23:41,216 --> 00:23:44,026 a valid place to insert our node. 586 00:23:44,696 --> 00:23:45,816 Does that make sense? 587 00:23:46,036 --> 00:23:47,496 Why this hash function isn't so great 588 00:23:47,496 --> 00:23:48,706 but it's a little better than this one? 589 00:23:49,206 --> 00:23:54,106 So the reason that this is a problem is 590 00:23:54,106 --> 00:23:57,946 because we can have a lot values mapped to the same bucket. 591 00:23:58,896 --> 00:24:01,966 Now if two values mapped to the same bucket, we can't just wipe 592 00:24:01,966 --> 00:24:03,686 out whatever was there, right. 593 00:24:03,686 --> 00:24:05,666 Because then you could effectively only have a 594 00:24:05,666 --> 00:24:09,226 dictionary that has size of over many buckets on your hash table. 595 00:24:09,256 --> 00:24:10,446 We just start wiping things out. 596 00:24:10,446 --> 00:24:12,786 What we need to do is keep tract 597 00:24:12,786 --> 00:24:15,636 of everything that's still mapped to that same bucket. 598 00:24:16,276 --> 00:24:20,086 So one way to do this is instead of just storing a word inside 599 00:24:20,086 --> 00:24:22,706 of your hash table is to store a linked list 600 00:24:23,086 --> 00:24:25,756 that contains everything that hash to that value. 601 00:24:26,636 --> 00:24:28,326 So why does this need to need to be a linked list? 602 00:24:28,456 --> 00:24:31,546 Well, we don't know in advance how many words are going 603 00:24:31,546 --> 00:24:32,716 to hash to the same value. 604 00:24:33,376 --> 00:24:35,726 So the size of this list is going to be dynamic. 605 00:24:36,306 --> 00:24:40,166 Remember in C, we can't just dynamically resize an array we 606 00:24:40,166 --> 00:24:42,066 have to create a new one and copy everything over. 607 00:24:42,516 --> 00:24:44,836 With the linked list, we see, it's really easy 608 00:24:44,836 --> 00:24:46,676 to just insert a new element. 609 00:24:47,276 --> 00:24:50,086 So by representing this as a linked list in our hash table, 610 00:24:50,336 --> 00:24:53,436 if we say, okay, well I found another word that hash to 152, 611 00:24:53,476 --> 00:24:55,406 I can just add it on to that linked list. 612 00:24:55,806 --> 00:24:58,256 So that's what's going on in this diagram. 613 00:24:58,516 --> 00:24:59,756 You'll see that John Smith 614 00:24:59,756 --> 00:25:01,996 and Sandra Dee just maps to the same value. 615 00:25:01,996 --> 00:25:03,606 They both got a 152. 616 00:25:03,896 --> 00:25:05,556 So we can't just wipeout one of them 617 00:25:05,556 --> 00:25:07,736 and only keep the last one to hash that value. 618 00:25:08,136 --> 00:25:10,346 Instead what we're doing is creating a linked list. 619 00:25:10,966 --> 00:25:14,316 So this, what slot, 152 is going to point 620 00:25:14,646 --> 00:25:17,276 to some node that's just a node in a linked list. 621 00:25:17,656 --> 00:25:20,316 And so it could have a word value of John Smith and then 622 00:25:20,316 --> 00:25:23,116 as its next pointer, we're gonna point to the other node 623 00:25:23,116 --> 00:25:25,936 that maps to 152 and if this-- 624 00:25:25,936 --> 00:25:28,776 the next pointer in this node points to nothing then we know 625 00:25:28,776 --> 00:25:30,826 that we have our entire linked list, 626 00:25:31,756 --> 00:25:37,836 so that conceptually makes sense to everybody? 627 00:25:38,396 --> 00:25:42,406 Okay. So this could be how we're going 628 00:25:42,406 --> 00:25:45,446 to represent our hash stable inside of dictionary.c. 629 00:25:45,946 --> 00:25:48,826 So again, we have some node and inside of it we want 630 00:25:48,826 --> 00:25:50,696 to keep track of some word. 631 00:25:51,736 --> 00:25:54,346 So now we could do something like, well, 632 00:25:54,556 --> 00:25:58,716 we know that this word is going to be at most linked letters 633 00:25:58,716 --> 00:26:01,296 and we need to add one for the null character. 634 00:26:01,296 --> 00:26:04,566 So that mode's length, whatever this constant link is, + 1. 635 00:26:05,216 --> 00:26:07,076 So that saves us the trouble of having to worry 636 00:26:07,076 --> 00:26:08,786 about how long the word is, right. 637 00:26:08,786 --> 00:26:11,436 Because if we some array that's definitely big enough, 638 00:26:11,886 --> 00:26:14,096 then we can just fill up that array with everything inside 639 00:26:14,096 --> 00:26:15,816 of our word and we could have some wasted space, 640 00:26:16,096 --> 00:26:16,986 but then that's okay. 641 00:26:17,706 --> 00:26:19,846 If you want to save yourself a little bit of space, 642 00:26:20,236 --> 00:26:23,506 then you'd have to use string length to get the number 643 00:26:23,506 --> 00:26:24,806 of words inside of your string. 644 00:26:24,806 --> 00:26:25,946 And then you could use malloc 645 00:26:26,226 --> 00:26:27,976 to create a dynamically size string. 646 00:26:28,476 --> 00:26:30,196 So this is a little bit of a simpler approach, 647 00:26:30,196 --> 00:26:32,686 we can just say, "Well, let's just allocate enough space 648 00:26:32,686 --> 00:26:34,346 for everywhere to not worry about having 649 00:26:34,346 --> 00:26:35,346 to create a bigger arrays." 650 00:26:35,646 --> 00:26:37,426 Or if you want to save yourself some space, 651 00:26:37,616 --> 00:26:40,776 then you can start malloc'ing word that's based on the size 652 00:26:40,776 --> 00:26:42,856 of the word that's gonna go inside of this node. 653 00:26:43,806 --> 00:26:46,206 So again, we still need to have this next pointer 654 00:26:46,546 --> 00:26:48,646 because we have a linked list and the only way 655 00:26:48,646 --> 00:26:51,226 to traverse the linked list is to keep looking at each 656 00:26:51,226 --> 00:26:52,466 of these next pointers. 657 00:26:53,086 --> 00:26:55,916 So that's the representation of a single node. 658 00:26:56,426 --> 00:26:58,346 So now our hash table is going 659 00:26:58,346 --> 00:27:01,306 to be an array of pointers to nodes. 660 00:27:02,166 --> 00:27:03,026 So why pointers? 661 00:27:03,026 --> 00:27:05,636 Well remember that we need to puts these things on a heap 662 00:27:05,906 --> 00:27:07,896 or else there is no real good place to put them 663 00:27:07,896 --> 00:27:09,226 because the stack is gonna wipe them 664 00:27:09,226 --> 00:27:10,706 out as soon as functions returns. 665 00:27:11,206 --> 00:27:13,196 So there need to be pointers because everything 666 00:27:13,196 --> 00:27:16,066 on the heap is just a pointer to wherever that memory block was. 667 00:27:16,606 --> 00:27:18,356 So now, how many buckets do we have? 668 00:27:18,676 --> 00:27:20,626 Well we can just define some arbitrary constant, 669 00:27:20,666 --> 00:27:22,026 something like hash table size. 670 00:27:22,706 --> 00:27:25,036 And that's gonna tell us how many buckets are inside 671 00:27:25,036 --> 00:27:26,396 of our hash table and each 672 00:27:26,396 --> 00:27:28,846 of those buckets is going to point to a node. 673 00:27:29,536 --> 00:27:31,316 So if I access in this array 674 00:27:31,406 --> 00:27:32,806 which is just like any old array. 675 00:27:32,806 --> 00:27:34,376 If I said hash table I. 676 00:27:34,896 --> 00:27:38,396 Well, that's going to give me the value of the address 677 00:27:38,606 --> 00:27:41,576 of the first node in the linked list of everything 678 00:27:41,576 --> 00:27:42,706 that maps to that bucket. 679 00:27:43,936 --> 00:27:45,666 So that's just gonna be the equivalent 680 00:27:45,666 --> 00:27:47,296 of the root of some linked list. 681 00:27:47,566 --> 00:27:49,656 And we can traverse that linked list just 682 00:27:49,656 --> 00:27:51,266 like we traverse any old other linked list. 683 00:27:52,336 --> 00:27:58,316 So questions on how our hash table is gonna look in memory? 684 00:27:58,406 --> 00:28:03,246 Nice. Alright, so now that's kind of a high level overview 685 00:28:03,246 --> 00:28:05,076 of what everything is gonna look like. 686 00:28:05,506 --> 00:28:08,276 So now let's get into how we're actually gonna implement those 687 00:28:08,276 --> 00:28:10,546 four functions load, check, size and unload. 688 00:28:10,546 --> 00:28:14,666 So remember the goal for load is to get every word 689 00:28:14,666 --> 00:28:16,526 in the dictionary into memory somehow. 690 00:28:16,526 --> 00:28:19,176 So you just need to get it there so you don't need to continue 691 00:28:19,516 --> 00:28:21,826 to keep opening dictionary, iterating through it 692 00:28:21,826 --> 00:28:23,966 which would work but it would be really, really slow 693 00:28:23,966 --> 00:28:25,086 and we probably wouldn't do so well. 694 00:28:26,366 --> 00:28:29,186 So in order to do this, we need to somehow iterate 695 00:28:29,386 --> 00:28:31,726 over every word in the dictionary text file 696 00:28:31,966 --> 00:28:32,926 because we know that we need 697 00:28:32,926 --> 00:28:34,876 to eventually load those words into memory. 698 00:28:35,696 --> 00:28:37,276 So remember from the last pset, 699 00:28:37,276 --> 00:28:40,566 we have this nice function called feof that's going 700 00:28:40,566 --> 00:28:43,016 to return if or at the end of some file. 701 00:28:44,116 --> 00:28:46,976 So if we're looking to iterate over the entire dictionary file, 702 00:28:47,356 --> 00:28:50,596 we wanna loop until this end of file returns true, 703 00:28:51,176 --> 00:28:54,016 so very similar to the last pset. 704 00:28:54,016 --> 00:28:55,766 And now, once we read a word, 705 00:28:55,766 --> 00:28:58,076 we need to insert each word individually 706 00:28:58,306 --> 00:28:59,226 into our hash table. 707 00:28:59,706 --> 00:29:03,676 So to do that, we need to create some node 708 00:29:03,866 --> 00:29:05,336 for each word that we read in. 709 00:29:05,336 --> 00:29:08,286 So that means that we wanna malloc some new node. 710 00:29:08,286 --> 00:29:12,236 We're gonna call this node n throughout the example. 711 00:29:12,236 --> 00:29:13,636 So now we have some new node. 712 00:29:13,636 --> 00:29:15,566 It lives in memory somewhere but there's nothing in it. 713 00:29:16,336 --> 00:29:20,036 So we need to fill in that node which looks just like this 714 00:29:20,496 --> 00:29:23,276 with each of these values that nodes need to have a word 715 00:29:23,276 --> 00:29:25,536 and that node needs to have a pointer 716 00:29:25,536 --> 00:29:27,336 to the next element in the linked list. 717 00:29:28,586 --> 00:29:31,216 So to get the word, we can use this handy-dandy 718 00:29:31,216 --> 00:29:32,246 function fscanf. 719 00:29:33,076 --> 00:29:36,036 What this is going to do is it's going to read a string 720 00:29:37,016 --> 00:29:39,406 from the dictionary and it's gonna store it 721 00:29:39,406 --> 00:29:40,716 into some point to that you specify. 722 00:29:41,606 --> 00:29:43,886 So I can just store this inside of n word 723 00:29:44,336 --> 00:29:47,696 or the word array that's inside of my node, n, and I can read it 724 00:29:47,696 --> 00:29:49,686 from my file pointer whatever that may look like. 725 00:29:50,416 --> 00:29:53,676 So if I continue to call fscanf in a loop and we're going 726 00:29:53,676 --> 00:29:56,386 to read in our dictionary, one word at a time. 727 00:29:56,906 --> 00:29:59,286 So let's just take a look how we can do that. 728 00:30:00,576 --> 00:30:01,916 >> So here's my fscanf. 729 00:30:02,806 --> 00:30:04,676 And so I have this file text.txt. 730 00:30:04,876 --> 00:30:07,886 And so this, as you probably guess at this point, 731 00:30:07,916 --> 00:30:10,406 just says this is CS50 one word per line. 732 00:30:11,486 --> 00:30:13,486 So now we can open it up. 733 00:30:13,486 --> 00:30:16,216 With f open, we're gonna specify r 'cause we're only reading 734 00:30:16,216 --> 00:30:16,756 this file. 735 00:30:17,316 --> 00:30:20,196 And so now, we can create this array that's definitely big 736 00:30:20,196 --> 00:30:23,766 enough to hold the word, right, 46 there's no word that's more 737 00:30:23,766 --> 00:30:25,836 than 46 characters inside of this CS50. 738 00:30:26,796 --> 00:30:31,106 So now when we call fscanf on fp and we specify percent s, 739 00:30:31,496 --> 00:30:34,396 this tells fscanf that I want to read in a string. 740 00:30:35,006 --> 00:30:38,116 Where I wanna store that string, I wanna store it at word 741 00:30:38,526 --> 00:30:40,296 which is an array so that's an address. 742 00:30:41,156 --> 00:30:43,326 So now after this function executes, 743 00:30:43,546 --> 00:30:46,886 I now have that word saved inside of this null string 744 00:30:47,086 --> 00:30:49,546 and I can print out both the word itself. 745 00:30:49,656 --> 00:30:52,786 So just by using percent s and I can print out its length. 746 00:30:53,276 --> 00:30:55,806 So if we were to execute this I can say, make fscanf. 747 00:30:56,016 --> 00:31:01,136 And if I call fscanf, I've just printed out my first word 748 00:31:01,136 --> 00:31:03,056 which is this and I printed out the length 749 00:31:03,056 --> 00:31:04,556 of the word which is 4. 750 00:31:05,176 --> 00:31:06,746 So now if I call this twice 751 00:31:07,646 --> 00:31:13,026 if I said here I just called it again fscanf fp percent s word, 752 00:31:13,116 --> 00:31:17,516 what is this gonna do? 753 00:31:17,516 --> 00:31:17,746 [ Inaudible Remark ] 754 00:31:17,746 --> 00:31:18,446 >> "Is" exactly. 755 00:31:18,446 --> 00:31:21,186 So we don't need to do something like fseek, remove the cursor 756 00:31:21,186 --> 00:31:22,686 over just like in the last pset 757 00:31:22,686 --> 00:31:26,086 if we call fscanf again it's gonna know, okay, 758 00:31:26,086 --> 00:31:28,606 we've already read one word so my cursors moved 759 00:31:28,606 --> 00:31:31,186 over a little bit and the next time I call it we're gonna get a 760 00:31:31,186 --> 00:31:31,806 different word. 761 00:31:32,276 --> 00:31:35,296 So if I call fscanf, we're now on to "is." 762 00:31:35,296 --> 00:31:38,196 So does everyone understand how that works? 763 00:31:38,706 --> 00:31:42,856 'Cause we're all pros at File I/O after pset5 advice. 764 00:31:43,266 --> 00:31:47,686 So now that we've read in this word-- oh, question first? 765 00:31:47,686 --> 00:31:49,306 [ Inaudible Remark ] 766 00:31:49,306 --> 00:31:53,496 >> So does fscanf ignore spaces? 767 00:31:53,496 --> 00:31:56,356 Yes, because the string is going to ignore spaces, 768 00:31:56,356 --> 00:31:59,116 so because you said percent s, you could have said 769 00:31:59,116 --> 00:32:00,896 like something percent s space percent s 770 00:32:00,896 --> 00:32:03,206 and that would read into, but because there's no s-- 771 00:32:03,736 --> 00:32:06,456 space inside that specifier it will ignore them. 772 00:32:07,246 --> 00:32:08,196 Other questions? 773 00:32:08,766 --> 00:32:14,206 Okay, so now that we have our word, we now need to hash it. 774 00:32:14,966 --> 00:32:17,406 So because we've used fscanf correctly that means 775 00:32:17,406 --> 00:32:20,846 that this word now contains some word from our dictionary. 776 00:32:21,356 --> 00:32:24,246 So now, we just need to hash it, which is great 777 00:32:24,246 --> 00:32:25,956 because remember our hash function takes 778 00:32:25,956 --> 00:32:29,666 as an argument a string, it's going to return an integer 779 00:32:29,976 --> 00:32:32,846 so some string value is gonna return where to put it. 780 00:32:32,846 --> 00:32:36,606 And so remember that our hash table is going to-- 781 00:32:36,606 --> 00:32:39,756 is going to contain pointers to the start of linked list. 782 00:32:40,296 --> 00:32:43,696 So after we hash some word, we basically have two cases. 783 00:32:44,166 --> 00:32:47,166 So this variable index is gonna be where inside 784 00:32:47,166 --> 00:32:49,116 of my hash table I want to put this word. 785 00:32:49,746 --> 00:32:51,626 So the first case is that when we go and look 786 00:32:51,626 --> 00:32:53,276 at that bucket there's nothing there. 787 00:32:54,086 --> 00:32:57,116 So that means that this is the first word so far that's going 788 00:32:57,116 --> 00:32:59,036 to map to the hash value index. 789 00:32:59,686 --> 00:33:02,726 So in that case we just need to start creating our linked list. 790 00:33:02,726 --> 00:33:05,916 So that's really easy, we just want hash table index 791 00:33:06,266 --> 00:33:08,366 which it makes an array of pointers, we just wanna make 792 00:33:08,366 --> 00:33:12,756 that point to the node we just created, so in this case n. 793 00:33:12,796 --> 00:33:14,426 And now because n is now the end 794 00:33:14,426 --> 00:33:17,176 of our linked list 'cause our linked list just has one element 795 00:33:17,176 --> 00:33:19,176 in it, we wanna make that point to null. 796 00:33:20,076 --> 00:33:21,936 So now we've just started off our linked list, 797 00:33:21,936 --> 00:33:24,676 it only has one element and that's the node we just created. 798 00:33:24,676 --> 00:33:28,126 Does that make sense? 799 00:33:28,326 --> 00:33:30,436 So our other case is if it's not null. 800 00:33:30,976 --> 00:33:33,866 So that means that, okay, well, I've made a collision. 801 00:33:34,116 --> 00:33:36,506 Some word has already hashed to the value 802 00:33:36,506 --> 00:33:37,956 that I'm trying to put this word at. 803 00:33:38,746 --> 00:33:41,236 So that means that this can't be pointed to null, 804 00:33:41,426 --> 00:33:45,636 instead it's pointing to what is the start of by linked list. 805 00:33:46,586 --> 00:33:49,616 So we wanna insert to the beginning of the linked list 806 00:33:49,876 --> 00:33:52,456 and not the end because in order to insert 807 00:33:52,456 --> 00:33:55,486 to the end we're gonna need to traverse the whole thing. 808 00:33:55,956 --> 00:33:58,596 Now speller is all about being really efficient with our time 809 00:33:58,866 --> 00:34:01,526 and so to go through the entire linked list just for the sake 810 00:34:01,526 --> 00:34:04,106 of inserting a word is gonna be a big waste of time. 811 00:34:04,746 --> 00:34:06,836 So to avoid having to traverse every element 812 00:34:06,836 --> 00:34:08,916 which could take awhile for a linked list is pretty long, 813 00:34:09,176 --> 00:34:11,716 we can just insert it the beginning of the list. 814 00:34:12,326 --> 00:34:13,416 And so how do we do that? 815 00:34:13,986 --> 00:34:15,416 Well, we need to do two things. 816 00:34:15,786 --> 00:34:18,236 We need to change the pointer of our new node. 817 00:34:18,236 --> 00:34:21,306 We need to make that point to somewhere in our old list 818 00:34:21,786 --> 00:34:23,606 and we need to change the value of hash table. 819 00:34:24,426 --> 00:34:27,726 So the first thing we wanna do is make our new node point 820 00:34:27,726 --> 00:34:30,466 to what was at the beginning of the linked list. 821 00:34:31,786 --> 00:34:34,416 So that way, we now have two things pointing 822 00:34:34,416 --> 00:34:35,326 at what was the beginning. 823 00:34:35,326 --> 00:34:36,886 We have hash tables pointing there 824 00:34:37,106 --> 00:34:38,896 and now our new node points there. 825 00:34:39,546 --> 00:34:41,346 So now that our new node points there, 826 00:34:41,486 --> 00:34:44,846 we can change what hash table index pointed to 'cause 827 00:34:44,846 --> 00:34:47,406 if we did this in the reverse order we would effectively lose 828 00:34:47,406 --> 00:34:49,136 where our list started which is a problem. 829 00:34:49,806 --> 00:34:52,196 So now that our first element is now pointing 830 00:34:52,196 --> 00:34:54,406 to what is now the second element we can say "Okay, well, 831 00:34:54,406 --> 00:34:55,676 this is my new first element." 832 00:34:55,796 --> 00:34:57,936 So that means hash table index should point 833 00:34:57,936 --> 00:34:59,616 to the node that I just created. 834 00:35:00,026 --> 00:35:01,176 That makes sense? 835 00:35:01,176 --> 00:35:05,286 Alright. So now on to size. 836 00:35:05,866 --> 00:35:10,666 So this is probably the easiest part right. 837 00:35:11,136 --> 00:35:13,556 So given some hash table to calculate size, 838 00:35:13,556 --> 00:35:16,336 you could do something like iterate over every bucket 839 00:35:16,336 --> 00:35:18,826 on the hash table, for every bucket on hash table go 840 00:35:18,826 --> 00:35:20,276 through to every element in the linked list 841 00:35:20,276 --> 00:35:22,436 and have some counter that counts it up and do 842 00:35:22,436 --> 00:35:23,746 that every time that you call size. 843 00:35:24,896 --> 00:35:27,046 But again spellers are all about speed 844 00:35:27,046 --> 00:35:29,156 and so that's kind of a waste of time. 845 00:35:29,156 --> 00:35:32,116 So instead what you can do is while you're loading the 846 00:35:32,116 --> 00:35:34,886 dictionary, just keep track of how many words you've loaded 847 00:35:35,236 --> 00:35:37,256 in some global variable or something on the heap 848 00:35:37,256 --> 00:35:38,306 or wherever you wanna put it. 849 00:35:38,506 --> 00:35:41,006 We can just say, "I loaded a word, I increment the number 850 00:35:41,006 --> 00:35:41,796 of words that I loaded." 851 00:35:42,386 --> 00:35:44,756 So when I call size I can just return that value 852 00:35:45,036 --> 00:35:47,416 which I've already calculated and the size 853 00:35:47,416 --> 00:35:50,066 of the dictionary is never going to change as you spell-check 854 00:35:50,126 --> 00:35:52,056 through your text because there's just one dictionary 855 00:35:52,256 --> 00:35:54,536 you're only loading it once so you can just keep returning 856 00:35:54,536 --> 00:35:56,286 that same size that you've calculated. 857 00:35:56,736 --> 00:36:00,686 Does that make sense? 858 00:36:00,866 --> 00:36:04,996 Awesome! So now we need to move on to our check function. 859 00:36:06,006 --> 00:36:08,726 So our check function is going to return a Boolean, 860 00:36:09,126 --> 00:36:11,546 true or false, true being the word was spelled correctly, 861 00:36:11,766 --> 00:36:13,386 false meaning the word was not spelled correctly. 862 00:36:14,326 --> 00:36:16,786 So as an input, it's going to take a string 863 00:36:16,786 --> 00:36:19,616 and that string is going to be from our text. 864 00:36:20,776 --> 00:36:24,086 So if the word exists in our hash table that means 865 00:36:24,086 --> 00:36:26,116 that the word "must" have been contained inside 866 00:36:26,116 --> 00:36:27,826 of our dictionary file which means 867 00:36:27,826 --> 00:36:29,036 that word is spelled correctly. 868 00:36:29,866 --> 00:36:32,386 Now if our word is not found in our hash table, 869 00:36:32,616 --> 00:36:33,976 that means it can't spelled correctly 870 00:36:34,186 --> 00:36:35,716 because it would have been found if it where. 871 00:36:36,306 --> 00:36:38,546 So we don't need to search 872 00:36:38,546 --> 00:36:41,116 through the entire hash table for the word. 873 00:36:41,606 --> 00:36:43,456 And so this is kind of the power of hash tables. 874 00:36:43,976 --> 00:36:48,366 Instead we only need to search the bucket for the hash value 875 00:36:48,366 --> 00:36:51,226 of the word I'm looking at, right. 876 00:36:51,226 --> 00:36:54,676 Because in order for this word to be inside of my hash table, 877 00:36:54,886 --> 00:36:57,316 it must hash to the same value 878 00:36:57,626 --> 00:36:59,576 as the same exact word inside of dictionary. 879 00:37:00,046 --> 00:37:02,726 So remember that's where our hash function needs 880 00:37:02,726 --> 00:37:03,576 to be deterministic. 881 00:37:03,926 --> 00:37:05,736 When I hash some word in my dictionary, 882 00:37:05,986 --> 00:37:07,526 I need to get the same value 883 00:37:07,666 --> 00:37:10,166 as when I hash some word inside of my text file. 884 00:37:11,076 --> 00:37:14,426 So if I do, that takes my hash table of size 10,000 885 00:37:14,626 --> 00:37:16,116 and limits it down to one bucket. 886 00:37:17,026 --> 00:37:19,766 So inside of that one bucket I just need to look, well, 887 00:37:19,986 --> 00:37:22,166 there might be more than one word that hash to this value. 888 00:37:22,526 --> 00:37:26,566 So just saying that it hash to this value is not enough, right. 889 00:37:26,566 --> 00:37:29,726 i can have some word like Z, Q, X, Y that hashes 890 00:37:29,726 --> 00:37:31,306 to the same value as the word apple. 891 00:37:31,306 --> 00:37:34,426 So if I don't actually go through the linked list, 892 00:37:34,886 --> 00:37:36,416 then I could assume that Z, X, 893 00:37:36,416 --> 00:37:39,316 Q or whatever I just said is a word when it's not. 894 00:37:40,166 --> 00:37:43,476 Just because two things hash to the same value, does not mean 895 00:37:43,476 --> 00:37:44,426 that they're the same word. 896 00:37:45,046 --> 00:37:46,686 So we actually need to go through 897 00:37:46,686 --> 00:37:48,586 and traverse our linked list. 898 00:37:49,486 --> 00:37:52,006 But that's okay because the linked list is probably gonna be 899 00:37:52,006 --> 00:37:54,516 pretty small so that's not gonna be a huge waste of time. 900 00:37:55,306 --> 00:37:57,956 So we already know how to traverse the linked list, right. 901 00:37:57,956 --> 00:37:59,926 There's nothing special about this linked list being 902 00:37:59,926 --> 00:38:02,286 in a hash table it's still just a linked list at the end 903 00:38:02,286 --> 00:38:05,756 of a day and we just went over how we can go through that. 904 00:38:05,756 --> 00:38:09,666 So some function you might find comes in handy is the strcmp. 905 00:38:09,666 --> 00:38:12,936 So this is going to compare two strings. 906 00:38:13,596 --> 00:38:16,916 And if string 1 and string 2 are the same string, 907 00:38:17,136 --> 00:38:19,816 it's going to perhaps kind of intuitively return 0. 908 00:38:20,186 --> 00:38:23,836 If string 1 is greater than string 2, it's gonna return 1 909 00:38:23,836 --> 00:38:25,866 or the other way around it's gonna return negative 1. 910 00:38:26,146 --> 00:38:28,226 So if you're checking if two strings are the same 911 00:38:28,396 --> 00:38:31,506 with this function, you need to make sure that it returns 0. 912 00:38:32,416 --> 00:38:34,316 Now the only problem with this is 913 00:38:34,316 --> 00:38:36,446 that our spell checker needs to be case sensitive. 914 00:38:37,106 --> 00:38:37,976 So you're guaranteed 915 00:38:37,976 --> 00:38:40,596 that everything the dictionary is gonna be already lowercase 916 00:38:40,816 --> 00:38:42,836 but you're not guaranteed that everything 917 00:38:42,836 --> 00:38:44,716 in your text is going to be a lowercase. 918 00:38:44,716 --> 00:38:47,126 So if your dictionary contains the word apple 919 00:38:47,256 --> 00:38:50,576 and your text contains the word Apple with the capital A, well, 920 00:38:51,136 --> 00:38:54,156 that apple is still spelled correctly even though based 921 00:38:54,156 --> 00:38:56,006 on the highest function we saw earlier just summing 922 00:38:56,006 --> 00:38:58,366 up the ASCII values, you're gonna get different values. 923 00:38:58,366 --> 00:39:00,746 And so you might return, well, capital A is not 924 00:39:00,746 --> 00:39:02,206 in my dictionary but lowercase A is, 925 00:39:02,206 --> 00:39:03,746 you need to account for that. 926 00:39:03,936 --> 00:39:06,406 So you need to make sure that any word 927 00:39:06,406 --> 00:39:08,836 with different case inside of your text is still going 928 00:39:08,836 --> 00:39:10,026 to be a correctly spelled word. 929 00:39:10,466 --> 00:39:12,856 So an easy way to do that is to just convert the case 930 00:39:12,856 --> 00:39:15,116 of the entire word so make everything lower case 931 00:39:15,116 --> 00:39:16,656 or make everything upper case or something 932 00:39:16,656 --> 00:39:18,966 like that and then hash it. 933 00:39:19,336 --> 00:39:21,176 So don't forget about that function to lower, 934 00:39:21,416 --> 00:39:25,336 to upper also exist and is upper and is lower, returns a Bool 935 00:39:25,336 --> 00:39:27,586 if a character is uppercase or not. 936 00:39:27,586 --> 00:39:29,946 So remember those functions work on characters. 937 00:39:30,356 --> 00:39:32,186 It's an order to convert an entire string 938 00:39:32,186 --> 00:39:35,076 that means you still need to iterate through it. 939 00:39:35,306 --> 00:39:38,706 And so now if-- while you're traversing your linked list you 940 00:39:38,706 --> 00:39:41,546 find that two strings are equal, so some string inside 941 00:39:41,546 --> 00:39:43,996 of your linked list and the string that you're looking at, 942 00:39:43,996 --> 00:39:46,196 if those are the same then that means that-- 943 00:39:46,196 --> 00:39:48,036 well, that word must be spelled correctly. 944 00:39:48,746 --> 00:39:51,356 But if you reached the end of your linked list that means 945 00:39:51,356 --> 00:39:53,276 that the word was not contained in the linked list. 946 00:39:53,716 --> 00:39:57,526 And now you know that word must not be spelled correctly even 947 00:39:57,526 --> 00:39:59,856 though you didn't check the entire dictionary, 948 00:40:00,186 --> 00:40:02,006 you only check maybe five or six words, 949 00:40:02,516 --> 00:40:03,976 but that's the problem of our hash table. 950 00:40:04,106 --> 00:40:05,666 >> You only check the five or six words 951 00:40:05,906 --> 00:40:09,266 that could be the same word as you're looking at. 952 00:40:09,996 --> 00:40:11,806 Any question on how that works? 953 00:40:12,246 --> 00:40:18,206 Alright so let's just take a look at an example. 954 00:40:18,516 --> 00:40:19,506 So here is my hash table. 955 00:40:19,506 --> 00:40:21,806 It has ten elements, when I started off, 956 00:40:21,866 --> 00:40:24,236 everything is pointing to null because there's nothing there. 957 00:40:25,166 --> 00:40:28,976 So now I'm going to insert the word "this" into my hash table. 958 00:40:29,326 --> 00:40:31,106 So "this" has a hash value 959 00:40:31,106 --> 00:40:33,506 of 1 using the magical hash function I just made up. 960 00:40:34,296 --> 00:40:37,126 So that means that we've hit our first case when I go 961 00:40:37,126 --> 00:40:39,476 to hash table "I" I'm going to get null. 962 00:40:40,186 --> 00:40:42,206 So that means that there's nothing there and I need 963 00:40:42,206 --> 00:40:43,876 to create some new linked list. 964 00:40:44,306 --> 00:40:46,036 That's pretty easy I can just set it equal 965 00:40:46,036 --> 00:40:47,156 to the address of the node. 966 00:40:48,066 --> 00:40:49,506 And now the same thing's gonna happen 967 00:40:49,696 --> 00:40:51,916 when I use my magical hash function on the word "is" 968 00:40:52,696 --> 00:40:54,046 so that's just gonna go in the fifth slot. 969 00:40:54,196 --> 00:40:55,616 Again, I'm creating a new linked list. 970 00:40:56,256 --> 00:40:59,536 Now when I get into CS50, there's gonna be a problem 971 00:40:59,536 --> 00:41:01,596 because my magical hash function just returned to 5. 972 00:41:01,806 --> 00:41:06,796 So what we need to do is add on CS50 to the end 973 00:41:06,796 --> 00:41:09,546 or at the beginning which is faster just the end in this case 974 00:41:09,546 --> 00:41:11,646 because "is CS50" is better than "CS50 is." 975 00:41:12,046 --> 00:41:14,376 So we're adding it to the end of our linked list 976 00:41:14,656 --> 00:41:17,706 and so now we have a linked list existing at 5. 977 00:41:18,136 --> 00:41:21,156 So this is my-- if my dictionary just said this is CS50, 978 00:41:21,156 --> 00:41:22,616 this is what my hash table is gonna look 979 00:41:22,616 --> 00:41:24,396 like after I just called load. 980 00:41:25,396 --> 00:41:26,846 So now that I've loaded everything, 981 00:41:26,846 --> 00:41:29,096 I can start calling check on some words. 982 00:41:29,916 --> 00:41:31,016 So let's say that we were-- 983 00:41:31,016 --> 00:41:32,726 called check on the word "isn't." 984 00:41:33,556 --> 00:41:35,736 And so my hash function is going to return a value 985 00:41:35,736 --> 00:41:37,726 of 3 for the word "isn't". 986 00:41:38,456 --> 00:41:40,786 So now, we go to our third bucket so 0, 1, 987 00:41:40,786 --> 00:41:42,246 2, 3 and we see a null. 988 00:41:43,066 --> 00:41:46,476 And so that means that "isn't" is definitely not spelled 989 00:41:46,476 --> 00:41:49,666 correctly, because it does not exist anywhere inside 990 00:41:49,666 --> 00:41:50,286 of our hash table. 991 00:41:50,846 --> 00:41:56,076 If instead we called "this" so we can say "Okay, we're gonna go 992 00:41:56,076 --> 00:41:58,686 to bucket 1," bucket one is not null so it's-- 993 00:41:58,686 --> 00:42:00,596 we can't just assume it's spelled correctly, 994 00:42:00,806 --> 00:42:01,966 we need to look at our linked list. 995 00:42:01,966 --> 00:42:03,766 In this case, we just have one element 996 00:42:03,766 --> 00:42:06,816 and the strings are the same so this is spelled correctly. 997 00:42:06,816 --> 00:42:10,726 Now in the case we called CS50, again, it's not null 998 00:42:10,786 --> 00:42:12,906 and now we need to traverse our entire linked list. 999 00:42:12,906 --> 00:42:15,436 So we're gonna say, "Okay, well, CS50 and is," 1000 00:42:15,436 --> 00:42:16,606 are those the same string? 1001 00:42:17,016 --> 00:42:18,746 No, but that doesn't mean we're wrong 1002 00:42:18,746 --> 00:42:21,126 yet because we have another element to look at. 1003 00:42:21,386 --> 00:42:25,336 So now we're gonna say, "Okay, CS50 and CS50, 1004 00:42:25,336 --> 00:42:27,516 those are the same string, they're spelled correctly. 1005 00:42:28,896 --> 00:42:30,496 So any questions on that example? 1006 00:42:30,896 --> 00:42:31,086 Yup? 1007 00:42:31,576 --> 00:42:34,106 >>Why did you add CS50 to the end of the linked list? 1008 00:42:34,106 --> 00:42:36,036 >> So why did I add CS50 to the end of the linked list. 1009 00:42:36,036 --> 00:42:39,746 So you'd actually be adding it to the beginning, I decided it 1010 00:42:39,746 --> 00:42:43,856 to the end so it say this is CS50. 1011 00:42:43,856 --> 00:42:43,923 [ Inaudible Remark ] 1012 00:42:43,923 --> 00:42:49,536 >> Which is a good idea at the time. 1013 00:42:49,536 --> 00:42:51,176 Other questions? 1014 00:42:51,266 --> 00:42:53,086 Alright, so now we need to worry about unload. 1015 00:42:53,636 --> 00:42:54,966 So remember the goal is we want 1016 00:42:54,966 --> 00:42:57,476 to free our entire hash table for memory. 1017 00:42:57,956 --> 00:43:00,286 So that includes every single node that's anywhere 1018 00:43:00,286 --> 00:43:02,346 in any linked list, it needs to be gone for memory. 1019 00:43:03,226 --> 00:43:05,566 So a nice thing to note is that you don't need 1020 00:43:05,566 --> 00:43:07,496 to free anything on the stack. 1021 00:43:08,376 --> 00:43:10,426 So if you put something on the stack whether it'd be 1022 00:43:10,426 --> 00:43:13,586 in a function or be it'd be a global variable, you don't need 1023 00:43:13,586 --> 00:43:15,456 to free it because you just don't free things 1024 00:43:15,456 --> 00:43:15,956 that are on the stack. 1025 00:43:16,606 --> 00:43:18,926 The only things you need to free are going 1026 00:43:18,926 --> 00:43:20,066 to be those that you malloc. 1027 00:43:20,676 --> 00:43:24,126 So if you have some global array that says like node star array, 1028 00:43:24,126 --> 00:43:27,006 if this is a global variable you don't need to free it 1029 00:43:27,176 --> 00:43:28,076 because it's on the stack. 1030 00:43:28,736 --> 00:43:31,046 If you malloc something, like all of those nodes 1031 00:43:31,046 --> 00:43:33,516 that we malloc, then we do need to free them. 1032 00:43:33,806 --> 00:43:36,326 And so how can we free a hash table? 1033 00:43:36,636 --> 00:43:38,846 Well we know our hash table has some fixed size 1034 00:43:39,536 --> 00:43:42,776 and wanna iterate through every single bucket in our hash table 1035 00:43:42,776 --> 00:43:44,676 in order to free the linked list that go there. 1036 00:43:44,676 --> 00:43:47,566 So for every element in my hash table, 1037 00:43:47,866 --> 00:43:50,856 I'm either gonna have null or I'm going to have a linked list. 1038 00:43:51,796 --> 00:43:52,996 So now for every element 1039 00:43:52,996 --> 00:43:55,256 in the linked list we are you know how to do this. 1040 00:43:55,256 --> 00:43:57,956 We can just free the element, move on to the next element. 1041 00:43:58,066 --> 00:44:00,576 As soon as we hit null, we're done and we can move 1042 00:44:00,576 --> 00:44:02,896 on to the next bucket inside of our hash table. 1043 00:44:04,476 --> 00:44:05,986 Questions on unload? 1044 00:44:06,486 --> 00:44:15,156 Alright so this pset is going to introduce us to a new code-- 1045 00:44:15,156 --> 00:44:17,846 a new tool called Valgrind or Valgrind however you say it? 1046 00:44:17,846 --> 00:44:21,696 And so what Valgrind is gonna do is it's going to check for us 1047 00:44:21,796 --> 00:44:23,426 if we had any memory leaks. 1048 00:44:24,276 --> 00:44:26,406 So let's take a look at how to use it. 1049 00:44:26,636 --> 00:44:30,186 So here I have a program called "helpmevalgrind" 1050 00:44:30,216 --> 00:44:32,426 and it looks pretty [inaudible]. 1051 00:44:32,696 --> 00:44:36,046 So I'm creating linked list, I'm creating 1, 2, 3, 4 nodes, 1052 00:44:36,046 --> 00:44:37,616 I have a root and N1, N2, 1053 00:44:37,646 --> 00:44:41,356 N3 each of these nodes has a value that's an integer 1054 00:44:41,356 --> 00:44:42,646 and it's pointing to some other node. 1055 00:44:42,646 --> 00:44:45,166 And so now this is how I set up my linked list. 1056 00:44:45,596 --> 00:44:48,416 Well, the root node is gonna point to N1, N1 is pointing 1057 00:44:48,516 --> 00:44:51,496 to N2 and N2 is pointing to N3. 1058 00:44:51,496 --> 00:44:53,266 So I'm not really gonna do anything about linked list. 1059 00:44:54,026 --> 00:44:56,686 But now that my program is done, I'm gonna be a good programmer 1060 00:44:56,686 --> 00:44:57,816 and I'm gonna free the linked list. 1061 00:44:58,276 --> 00:45:00,526 And because I free the first element the whole thing is free, 1062 00:45:01,346 --> 00:45:01,446 right? 1063 00:45:01,446 --> 00:45:02,156 [ Inaudible Remark ] 1064 00:45:02,156 --> 00:45:03,006 >> Exactly! 1065 00:45:03,006 --> 00:45:03,436 So, it's not. 1066 00:45:03,846 --> 00:45:06,226 So when we run Valgrind we can say-- 1067 00:45:06,496 --> 00:45:09,186 so here's by helpmevalgrind, I can say valgrind. 1068 00:45:09,906 --> 00:45:11,386 This V is for robust. 1069 00:45:11,386 --> 00:45:13,706 You wanna say tell me as much as you possibly can. 1070 00:45:14,056 --> 00:45:17,196 I'm gonna say dash, dash, leak, check equals full. 1071 00:45:17,556 --> 00:45:19,166 So we're gonna check everything possible. 1072 00:45:19,166 --> 00:45:20,886 And now we're gonna say dot slash 1073 00:45:21,506 --> 00:45:22,556 and the name of our binary. 1074 00:45:22,746 --> 00:45:25,376 So don't forget about that dot slash. 1075 00:45:25,376 --> 00:45:26,016 We hit Enter. 1076 00:45:26,016 --> 00:45:29,666 Now we got a whole lot of output, but luckily what's 1077 00:45:29,666 --> 00:45:31,846 in the middle of the screen here is what we care about. 1078 00:45:32,776 --> 00:45:33,946 So we have some problems here. 1079 00:45:33,946 --> 00:45:37,276 We definitely lost 8 bytes and we might have lost 16 bytes, 1080 00:45:37,276 --> 00:45:39,766 so really we just lost 24 bytes of memory. 1081 00:45:39,916 --> 00:45:42,256 So that's how much memory we've leaked inside 1082 00:45:42,256 --> 00:45:43,206 of this leak summary. 1083 00:45:43,236 --> 00:45:46,286 Everything in there is leak memory that you now care about. 1084 00:45:47,126 --> 00:45:48,266 So now let's look above this. 1085 00:45:48,656 --> 00:45:49,296 Well, okay. 1086 00:45:49,296 --> 00:45:51,356 So add some address something. 1087 00:45:51,546 --> 00:45:52,536 Now this looks interesting. 1088 00:45:53,576 --> 00:45:56,326 So this is the name of our C file and it's going 1089 00:45:56,326 --> 00:45:58,146 to give us a line number after it. 1090 00:45:58,576 --> 00:46:03,816 So, in file, helpmevalgrind,c12, we lost the memory 1091 00:46:03,816 --> 00:46:04,786 that was created there. 1092 00:46:05,336 --> 00:46:07,566 So let's open up, get it, and now let's look at line 12. 1093 00:46:07,976 --> 00:46:08,626 Oh, look at that. 1094 00:46:08,626 --> 00:46:13,296 So this is where it created n1, and we never freed n1. 1095 00:46:13,296 --> 00:46:17,756 So if you go into a fix version which valgrind helped us solve. 1096 00:46:18,226 --> 00:46:21,846 If we just free, 3, 2, 1, or 1, 2, 3 in the root. 1097 00:46:22,336 --> 00:46:31,126 Now when we run Valgrind, we can say, "Valgrind saves the day. 1098 00:46:31,896 --> 00:46:34,746 All heap blocks were freed, no leaks are possible." 1099 00:46:35,396 --> 00:46:38,156 So this s what your output is gonna look like when you're done 1100 00:46:38,156 --> 00:46:38,936 and you're really happy. 1101 00:46:39,726 --> 00:46:41,086 So no leaks possible. 1102 00:46:41,306 --> 00:46:43,416 Nothing at use when we're done 1103 00:46:43,416 --> 00:46:45,296 and it even tells us how many allocs we made. 1104 00:46:45,296 --> 00:46:47,016 Remember, we had four nodes on a linked list. 1105 00:46:47,016 --> 00:46:49,926 And remember, as we made the malloc four times and you use 1106 00:46:49,926 --> 00:46:55,086 that many bytes 'cause 8 times 4 is 32. 1107 00:46:55,186 --> 00:46:55,616 Questions? 1108 00:46:55,676 --> 00:46:56,156 Yup? 1109 00:46:56,156 --> 00:46:56,346 [ Inaudible Remark ] 1110 00:46:56,346 --> 00:46:56,546 >> Yup. 1111 00:46:56,546 --> 00:46:58,216 [ Inaudible Remark ] 1112 00:46:58,216 --> 00:46:59,026 >> So yeah. 1113 00:47:03,456 --> 00:47:05,796 So this is-- So that's also a problem. 1114 00:47:05,796 --> 00:47:07,696 So if we were trying to traverse this linked list, 1115 00:47:07,696 --> 00:47:09,766 we probably run into an issue because it's not null. 1116 00:47:10,036 --> 00:47:11,596 So you would want to set that null 1117 00:47:11,726 --> 00:47:13,296 if you're actually using this linked list 1118 00:47:13,546 --> 00:47:14,856 to diverse something, exactly. 1119 00:47:15,366 --> 00:47:18,326 Other questions on Valgrind. 1120 00:47:18,456 --> 00:47:19,346 Yup? 1121 00:47:19,701 --> 00:47:21,701 [ Inaudible Remark ] 1122 00:47:22,056 --> 00:47:24,266 >> So if we look back at our output, 1123 00:47:24,266 --> 00:47:25,586 we're gonna go back to here. 1124 00:47:26,876 --> 00:47:27,926 So right here. 1125 00:47:27,976 --> 00:47:30,876 Right above the leak summary, it's gonna try to tell you 1126 00:47:30,876 --> 00:47:31,686 and may or may not work. 1127 00:47:31,686 --> 00:47:33,406 It's gonna try to tell well this one's gone. 1128 00:47:33,406 --> 00:47:37,636 So you might wanna look there in your C file. 1129 00:47:37,856 --> 00:47:40,936 So, yes. So the question was, is where is Valgrind telling us 1130 00:47:40,936 --> 00:47:42,366 where we should maybe start to look 1131 00:47:42,696 --> 00:47:44,426 and just write above leak summary? 1132 00:47:44,886 --> 00:47:48,146 It's gonna say some helpful information among some other 1133 00:47:48,146 --> 00:47:48,946 unhelpful information. 1134 00:47:49,386 --> 00:47:52,176 Any other questions on Valgrind? 1135 00:47:52,636 --> 00:47:58,386 Okay. So let's take a look at another way 1136 00:47:59,056 --> 00:48:00,866 that we can implement our dictionary. 1137 00:48:00,866 --> 00:48:02,826 So any questions on the hashtable approach 1138 00:48:03,106 --> 00:48:04,266 or is everyone good on that. 1139 00:48:04,266 --> 00:48:04,396 Yup? 1140 00:48:04,666 --> 00:48:06,036 >> Is it up to us to figure 1141 00:48:06,036 --> 00:48:08,696 out like what the ideal length is for the hash table? 1142 00:48:08,856 --> 00:48:09,686 >> Is it up to you to figure 1143 00:48:09,686 --> 00:48:11,606 out what the ideal length is for your hash table? 1144 00:48:11,606 --> 00:48:12,406 Yes, it is. 1145 00:48:12,486 --> 00:48:14,576 And it's going to depend on your hash function. 1146 00:48:15,216 --> 00:48:17,756 So maybe your hash function does better with this size table 1147 00:48:18,006 --> 00:48:19,696 and the best way to do that is 1148 00:48:19,696 --> 00:48:21,936 to like I did is used to find constants. 1149 00:48:22,346 --> 00:48:25,096 So if I say a hash table size, I just say, 1150 00:48:25,096 --> 00:48:26,166 "I printing my hash table. 1151 00:48:26,376 --> 00:48:28,146 It's an array of size hash table link." 1152 00:48:28,746 --> 00:48:31,216 So as I'm debugging my program, if I just change 1153 00:48:31,216 --> 00:48:33,426 that one constant, I can change the size 1154 00:48:33,426 --> 00:48:34,706 of the hash table for everything. 1155 00:48:35,266 --> 00:48:36,886 So something you might wanna do is just go through. 1156 00:48:36,886 --> 00:48:37,686 Let's try a thousand. 1157 00:48:37,686 --> 00:48:38,496 Let's try 10,000. 1158 00:48:38,496 --> 00:48:39,466 Let's try 2,000. 1159 00:48:39,816 --> 00:48:41,676 And just see what gives you the best performance. 1160 00:48:42,136 --> 00:48:44,706 Other questions? 1161 00:48:45,236 --> 00:48:52,026 Okay. So now this approach uses a totally different data 1162 00:48:52,026 --> 00:48:53,176 structure than hash tables. 1163 00:48:54,306 --> 00:48:56,146 So you would never use a combination of these two. 1164 00:48:56,146 --> 00:48:57,746 These are kind of two different approaches 1165 00:48:57,946 --> 00:48:59,096 to solving the same problem. 1166 00:48:59,286 --> 00:49:01,306 And you should go with whichever you like better. 1167 00:49:02,256 --> 00:49:04,476 So this is the data structure called the "trie." 1168 00:49:05,246 --> 00:49:06,956 It was originally pronounced tree 'cause it comes 1169 00:49:06,956 --> 00:49:07,856 from the word "retrieval" 1170 00:49:08,076 --> 00:49:11,866 but tree also had it spelled T-R-E-E, so we decided 1171 00:49:11,866 --> 00:49:12,766 to pronounce this tries. 1172 00:49:13,876 --> 00:49:16,376 So the basic difference between tries and linked list is 1173 00:49:16,376 --> 00:49:20,296 that instead of having a single word inside of our trie element, 1174 00:49:20,626 --> 00:49:22,296 it's going to contain an array 1175 00:49:22,396 --> 00:49:25,386 with an element foe each possible character inside 1176 00:49:25,386 --> 00:49:25,976 of our word. 1177 00:49:25,976 --> 00:49:29,886 So, the value of an element in array is going to-- 1178 00:49:30,166 --> 00:49:31,726 is going to have a corresponding letter. 1179 00:49:31,846 --> 00:49:34,486 So each of my corresponding letters has some slot 1180 00:49:34,696 --> 00:49:36,316 in my array that's found in each node. 1181 00:49:37,246 --> 00:49:40,606 So inside of that array, it's going to point to new nodes. 1182 00:49:41,496 --> 00:49:43,186 So here's my-- so here's my example. 1183 00:49:43,396 --> 00:49:45,246 So I have some node at the top here 1184 00:49:45,246 --> 00:49:47,986 and this line is the array at that node. 1185 00:49:48,796 --> 00:49:51,066 So the letter M is a valid letter. 1186 00:49:51,416 --> 00:49:54,116 That means that the value at that corresponding place 1187 00:49:54,116 --> 00:49:56,516 in my array is going to point to a new node. 1188 00:49:57,726 --> 00:50:00,226 So now, this new node has another array, again, 1189 00:50:00,226 --> 00:50:03,476 containing a slot for every possible letter inside 1190 00:50:03,476 --> 00:50:04,036 of any word. 1191 00:50:05,116 --> 00:50:07,636 >> So if you look at the letter A, that's going 1192 00:50:07,636 --> 00:50:08,836 to point to another node. 1193 00:50:08,946 --> 00:50:11,806 Now we can see all this kinda falls down into a valid word. 1194 00:50:12,086 --> 00:50:13,186 We've made the word Maxwell. 1195 00:50:14,176 --> 00:50:15,886 Now instead we were gonna say something like MB, 1196 00:50:15,886 --> 00:50:19,396 when we go to the slot for B inside of this array 1197 00:50:19,396 --> 00:50:21,086 because remember all the arrays of the same 1198 00:50:21,206 --> 00:50:23,396 and there's an element for each possible letter, 1199 00:50:23,606 --> 00:50:25,026 B is gonna point to null. 1200 00:50:25,566 --> 00:50:27,886 So at that point, we know that there's no word 1201 00:50:27,886 --> 00:50:29,946 in our dictionary that starts with MB. 1202 00:50:30,696 --> 00:50:33,516 So as I'm going through this, if I ever encounter a word 1203 00:50:33,516 --> 00:50:35,866 that starts with MB, it's definitely misspelled. 1204 00:50:35,966 --> 00:50:38,116 And I don't even need to look at the rest of the word 1205 00:50:38,116 --> 00:50:39,396 because there's nothing that starts with MB. 1206 00:50:39,396 --> 00:50:41,226 So if the word you're looking at starts with MB, 1207 00:50:41,226 --> 00:50:43,036 it's definitely wrong and you can end to there. 1208 00:50:43,496 --> 00:50:46,096 So does everybody understand what's going here? 1209 00:50:46,716 --> 00:50:48,236 So these other-- these delta things 1210 00:50:48,236 --> 00:50:50,446 at the bottom just signify that that's the end of a word. 1211 00:50:51,126 --> 00:50:51,846 So we can represent 1212 00:50:51,846 --> 00:50:55,036 that a little differently using this representation. 1213 00:50:55,626 --> 00:50:59,196 So again, this is a single node and the node has an array 1214 00:50:59,476 --> 00:51:00,956 for every possible character. 1215 00:51:01,566 --> 00:51:03,786 So in this pset, don't forget about the apostrophe 1216 00:51:03,996 --> 00:51:05,096 which is a valid character. 1217 00:51:05,096 --> 00:51:06,506 So all the letters in the apostrophe, 1218 00:51:06,506 --> 00:51:09,286 that's gonna make 27 different valid characters in a word. 1219 00:51:10,076 --> 00:51:11,696 So we also wanna keep track whether 1220 00:51:11,696 --> 00:51:14,126 or not this node is the valid end of a word. 1221 00:51:14,536 --> 00:51:17,146 So if some word ends with this node. 1222 00:51:17,786 --> 00:51:19,496 So that can just be a Boolean flag. 1223 00:51:20,816 --> 00:51:22,786 So now to load something into a trie, 1224 00:51:23,386 --> 00:51:25,466 now instead of just iterating through every word 1225 00:51:25,466 --> 00:51:27,346 of the dictionary, we need to also iterate 1226 00:51:27,346 --> 00:51:30,076 through every character n every word in the dictionary. 1227 00:51:31,316 --> 00:51:34,526 So as we iterate through this word, we basically want 1228 00:51:34,526 --> 00:51:36,156 to also iterate through our trie. 1229 00:51:36,716 --> 00:51:38,716 So let's say we're inserting that word "Maxwell." 1230 00:51:39,326 --> 00:51:41,356 So we're gonna say, okay, I just read the word Maxwell 1231 00:51:41,356 --> 00:51:43,846 from my dictionary and I want to iterate through every letter. 1232 00:51:43,846 --> 00:51:46,486 So we're gonna say okay so that M, I'm gonna jump 1233 00:51:46,486 --> 00:51:50,756 to that spot M. Now if that's null, that mean that I need 1234 00:51:50,756 --> 00:51:54,096 to create a node because the word I'm reading is definitely 1235 00:51:54,096 --> 00:51:55,546 valid so we wanna insert it. 1236 00:51:55,866 --> 00:51:57,756 So if this n is null, we want 1237 00:51:57,756 --> 00:52:02,106 to create a new node that's gonna represent the sequence MA. 1238 00:52:02,226 --> 00:52:03,856 So now you can see that I'm just going to iterate though 1239 00:52:03,856 --> 00:52:06,456 that entire word creating nodes as necessary. 1240 00:52:07,346 --> 00:52:10,366 Now if for example, it already existed like M, 1241 00:52:10,366 --> 00:52:11,866 that node, already existing. 1242 00:52:12,176 --> 00:52:13,526 We don't need to create it. 1243 00:52:13,526 --> 00:52:14,896 Instead, we just wanna jump there. 1244 00:52:15,366 --> 00:52:18,146 So if now if I'm inserting this word M-E-N, so Mendel, 1245 00:52:18,146 --> 00:52:20,346 after I've inserted Maxwell, well, 1246 00:52:20,346 --> 00:52:21,946 this node that's corresponds 1247 00:52:21,946 --> 00:52:24,276 to the first letter M, it already exist. 1248 00:52:24,276 --> 00:52:25,936 So I don't wanna create a new one. 1249 00:52:26,216 --> 00:52:28,056 I just wanna use the one that already exist there 1250 00:52:28,056 --> 00:52:30,346 and then fill in the slot for E. 1251 00:52:30,966 --> 00:52:33,926 So now the same node has slots build in for A and E 1252 00:52:34,046 --> 00:52:36,046 and we're gonna have a different sequence of nodes 1253 00:52:36,356 --> 00:52:38,056 to represent the word Mendel. 1254 00:52:38,056 --> 00:52:39,596 Does that make sense? 1255 00:52:40,186 --> 00:52:43,096 So this code is actually super, super elegant 1256 00:52:43,096 --> 00:52:45,086 if you can get it working correctly. 1257 00:52:46,266 --> 00:52:49,186 So that's-- once we get to the end of the word 1258 00:52:49,186 --> 00:52:51,526 and this case a new line because we know that all 1259 00:52:51,526 --> 00:52:53,286 that word the dictionary must end to the new line. 1260 00:52:53,676 --> 00:52:55,196 Well, that's-- and we can set that flag. 1261 00:52:55,266 --> 00:52:56,456 That is the end of the word. 1262 00:52:56,456 --> 00:52:59,336 We can set that to true because we reach the end of the word. 1263 00:52:59,586 --> 00:53:01,056 We can put a null term in there. 1264 00:53:01,056 --> 00:53:03,576 We can do whatever we want. 1265 00:53:03,726 --> 00:53:05,316 So that's how we're going to insert. 1266 00:53:06,306 --> 00:53:09,086 Does that make sense? 1267 00:53:09,276 --> 00:53:10,416 So size, same thing. 1268 00:53:10,416 --> 00:53:11,386 Nothing complicated. 1269 00:53:11,386 --> 00:53:12,986 You don't need to traverse the whole thing 1270 00:53:12,986 --> 00:53:14,456 and do some complicated recursion. 1271 00:53:14,776 --> 00:53:17,136 Just need to keep track of the words as you're loading them. 1272 00:53:17,346 --> 00:53:18,226 I found a new word. 1273 00:53:18,646 --> 00:53:22,366 Total words loaded plus, plus done. 1274 00:53:22,366 --> 00:53:25,216 So now, check is where it gets really, really cool. 1275 00:53:25,546 --> 00:53:28,646 So in order to check if a word is inside your dictionary, 1276 00:53:28,856 --> 00:53:31,826 you basically need to find some path along your trie 1277 00:53:32,226 --> 00:53:33,776 that corresponds to that word. 1278 00:53:34,216 --> 00:53:36,396 So let's say instead of inserting the word Maxwell, 1279 00:53:36,396 --> 00:53:37,836 we were looking for it. 1280 00:53:37,876 --> 00:53:39,286 So we need to again iterate 1281 00:53:39,286 --> 00:53:41,296 through every single character inside of the word. 1282 00:53:41,296 --> 00:53:44,826 So we're gonna start off at M. And if M is not null, 1283 00:53:44,876 --> 00:53:46,026 then we're gonna jump to the next node. 1284 00:53:46,666 --> 00:53:48,206 Now we also need to increment the letter 1285 00:53:48,206 --> 00:53:48,976 that we're looking at. 1286 00:53:49,146 --> 00:53:52,086 So now we're looking at A. So A is not null, we're gonna jump 1287 00:53:52,086 --> 00:53:54,326 to the next level inside of our trie and jump 1288 00:53:54,326 --> 00:53:57,316 to the next letter inside of our word, X, and keep going. 1289 00:53:58,146 --> 00:54:00,336 So in order to check if a word is correctly-- 1290 00:54:00,336 --> 00:54:02,866 is correctly spelled, we need to look at every letter. 1291 00:54:03,446 --> 00:54:05,246 But the real power of a trie comes in when you're looking 1292 00:54:05,246 --> 00:54:06,806 at words that are not correctly spelled. 1293 00:54:07,246 --> 00:54:10,536 So if we're looking for something like Dworkin, right, 1294 00:54:10,536 --> 00:54:12,156 Maxwell-Dworkin, well, we're gonna say, 1295 00:54:12,156 --> 00:54:13,946 okay, D, D is null, done. 1296 00:54:14,666 --> 00:54:17,136 It cannot be spelled correctly because there's no word in here 1297 00:54:17,136 --> 00:54:20,026 that starts with D. Or if you're looking at a hash table, 1298 00:54:20,306 --> 00:54:22,006 your hash function is probably independent 1299 00:54:22,006 --> 00:54:22,986 of what the word starts with. 1300 00:54:23,026 --> 00:54:25,216 So you probably have to go through some linked list 1301 00:54:25,216 --> 00:54:26,706 of a bunch of dissimilar words. 1302 00:54:27,256 --> 00:54:30,146 Where with my trie, as soon as I can't finish my path, 1303 00:54:30,876 --> 00:54:33,106 that word might be misspelled because there's no way 1304 00:54:33,256 --> 00:54:35,946 to make a path through this trie that spells out the word. 1305 00:54:36,996 --> 00:54:40,436 Does that make sense? 1306 00:54:40,986 --> 00:54:43,456 Alright. So let's check. 1307 00:54:44,336 --> 00:54:47,186 And so finally unload, gets a little more complicated 1308 00:54:47,186 --> 00:54:49,266 because like a link, one way or freeing a linked list is 1309 00:54:49,266 --> 00:54:50,236 to kind of free of backwards. 1310 00:54:50,366 --> 00:54:52,326 You wanna do the same thing with this trie. 1311 00:54:53,096 --> 00:54:55,176 So in order to free it, you wanna travel 1312 00:54:55,176 --> 00:54:57,826 to the lowest possible level and free all 1313 00:54:57,826 --> 00:54:59,286 of your pointers inside of that array. 1314 00:54:59,806 --> 00:55:00,846 So if those points to something 1315 00:55:00,846 --> 00:55:02,296 that was malloc, you wanna free them. 1316 00:55:03,136 --> 00:55:05,306 So once you get to the lowest possible node, 1317 00:55:05,576 --> 00:55:07,056 you then wanna backtrack upwards. 1318 00:55:07,496 --> 00:55:10,136 So in order to free this trie, you could do something. 1319 00:55:10,136 --> 00:55:13,956 If I get all the way down here to this V and free everything, 1320 00:55:14,106 --> 00:55:16,246 now backtrack up, I can free everything that was apparent 1321 00:55:16,246 --> 00:55:18,186 of that; I can free everything that's apparent of that. 1322 00:55:18,436 --> 00:55:21,776 And so by freeing it, bottom to top, you make sure 1323 00:55:21,776 --> 00:55:24,316 that you don't accidentally like free this N 1324 00:55:24,316 --> 00:55:26,796 without freeing this G. 'Cause once you free this N, 1325 00:55:26,796 --> 00:55:29,086 you have no idea where this is because you're not allowed 1326 00:55:29,086 --> 00:55:30,206 to access that memory anymore. 1327 00:55:30,636 --> 00:55:32,056 So in order to free this correctly, 1328 00:55:32,056 --> 00:55:33,056 we wanna go like this. 1329 00:55:33,636 --> 00:55:34,846 So as we've seen with recursion, 1330 00:55:34,846 --> 00:55:36,186 this kind of natural lends itself 1331 00:55:36,186 --> 00:55:37,416 to recursive implantation. 1332 00:55:37,726 --> 00:55:39,586 For our base case, is the one right at the end, 1333 00:55:39,856 --> 00:55:42,186 then once we free this and we recurs back up the stack, 1334 00:55:42,516 --> 00:55:45,716 we're gonna hit everything in the order that we saw it. 1335 00:55:45,716 --> 00:55:47,166 So we can easily just backtrack. 1336 00:55:47,166 --> 00:55:50,546 If our base case is when were at the end. 1337 00:55:50,546 --> 00:55:53,596 Does that make sense? 1338 00:55:53,596 --> 00:55:56,296 Okay. So any questions on tries or hash tables 1339 00:55:56,296 --> 00:55:58,376 or anything about pset6? 1340 00:55:59,086 --> 00:55:59,186 Yup? 1341 00:55:59,936 --> 00:56:06,216 >> So, [inaudible] dictionary place like how would you give 1342 00:56:06,466 --> 00:56:10,826 like on the new nodes of different names [inaudible]? 1343 00:56:10,826 --> 00:56:12,656 >> So how do we give the new nodes different links? 1344 00:56:13,886 --> 00:56:14,816 >> Different names. 1345 00:56:14,816 --> 00:56:16,566 >> Oh, different names? 1346 00:56:16,566 --> 00:56:18,276 So if we're inside of a loop, 1347 00:56:18,276 --> 00:56:19,886 so how would we give these new nodes different names? 1348 00:56:19,886 --> 00:56:21,896 So if we're inside of a loop, we don't really need 1349 00:56:21,896 --> 00:56:23,506 to name them something separate, right. 1350 00:56:23,966 --> 00:56:25,336 Once we put it on the heap, 1351 00:56:25,716 --> 00:56:28,416 that means that that node exists somewhere in the heap. 1352 00:56:29,016 --> 00:56:32,176 So as long as we keep tracking of that, either inserting 1353 00:56:32,176 --> 00:56:34,066 into the hash table, inserting to a linked list, 1354 00:56:34,306 --> 00:56:36,156 we don't care what it was originally named. 1355 00:56:36,736 --> 00:56:38,096 So if you're inside of a loop, 1356 00:56:38,096 --> 00:56:40,276 you can basically reuse the same name 1357 00:56:40,566 --> 00:56:42,176 but just keep malloc'ing new space. 1358 00:56:42,336 --> 00:56:43,956 And make sure that once you malloc a node, 1359 00:56:43,956 --> 00:56:45,176 you don't loose track of it. 1360 00:56:45,896 --> 00:56:47,316 Because as soon as you lose track of it, 1361 00:56:47,316 --> 00:56:49,736 you have a memory leak because you can never free it, 1362 00:56:49,796 --> 00:56:52,636 because you have no idea where it is. 1363 00:56:52,846 --> 00:56:56,116 Other questions on anything? 1364 00:56:56,216 --> 00:57:02,936 Alright. Well in that case, then good luck on pset6. 1365 00:57:03,516 --> 00:57:08,420 [ Silence ]