1 00:00:00,000 --> 00:00:00,500 2 00:00:00,500 --> 00:00:01,900 [MUSIC PLAYING] 3 00:00:01,900 --> 00:00:05,710 4 00:00:05,710 --> 00:00:09,150 >> DOUG LLOYD: By now you know a lot about arrays, 5 00:00:09,150 --> 00:00:11,610 and you know a lot about linked lists. 6 00:00:11,610 --> 00:00:13,650 And we've discuss the pros and cons, we've 7 00:00:13,650 --> 00:00:16,620 discussed that linked lists can get bigger and smaller, 8 00:00:16,620 --> 00:00:18,630 but they take up more size. 9 00:00:18,630 --> 00:00:22,359 Arrays are much more straightforward to use, but they're restrictive in as much 10 00:00:22,359 --> 00:00:24,900 as we have to set the size of the array at the very beginning 11 00:00:24,900 --> 00:00:26,910 and then we're stuck with it. 12 00:00:26,910 --> 00:00:30,470 >> But that's, we've pretty much exhausted all of our topics 13 00:00:30,470 --> 00:00:33,040 about linked lists and arrays. 14 00:00:33,040 --> 00:00:34,950 Or have we? 15 00:00:34,950 --> 00:00:37,720 Maybe we can do something even more creative. 16 00:00:37,720 --> 00:00:40,950 And that sort of lends the idea of a hash table. 17 00:00:40,950 --> 00:00:46,680 >> So in a hash table we're going to try combine an array with a linked list. 18 00:00:46,680 --> 00:00:49,520 We're going to take the advantages of the array, like random access, 19 00:00:49,520 --> 00:00:53,510 being able to just go to array element 4 or array element 8 20 00:00:53,510 --> 00:00:55,560 without having to iterate across. 21 00:00:55,560 --> 00:00:57,260 That's pretty fast, right? 22 00:00:57,260 --> 00:01:00,714 >> But we also want to have our data structure be able to grow and shrink. 23 00:01:00,714 --> 00:01:02,630 We don't need, we don't want to be restricted. 24 00:01:02,630 --> 00:01:04,588 And we want to be able to add and remove things 25 00:01:04,588 --> 00:01:08,430 very easily, which if you recall, is very complex with an array. 26 00:01:08,430 --> 00:01:11,650 And we can call this new thing a hash table. 27 00:01:11,650 --> 00:01:15,190 >> And if implemented correctly, we're sort of taking 28 00:01:15,190 --> 00:01:18,150 the advantages of both data structures you've already seen, 29 00:01:18,150 --> 00:01:19,880 arrays and linked lists. 30 00:01:19,880 --> 00:01:23,070 Insertion can start to tend toward theta of 1. 31 00:01:23,070 --> 00:01:26,207 Theta we haven't really discussed, but theta is just the average case, 32 00:01:26,207 --> 00:01:27,540 what's actually going to happen. 33 00:01:27,540 --> 00:01:29,680 You're not always going to have the worst case scenario, 34 00:01:29,680 --> 00:01:32,555 and you're not always going to have the best case scenario, so what's 35 00:01:32,555 --> 00:01:33,900 the average scenario? 36 00:01:33,900 --> 00:01:36,500 >> Well an average insertion into a hash table 37 00:01:36,500 --> 00:01:39,370 can start to get close to constant time. 38 00:01:39,370 --> 00:01:41,570 And deletion can get close to constant time. 39 00:01:41,570 --> 00:01:44,440 And lookup can get close to constant time. 40 00:01:44,440 --> 00:01:48,600 That's-- we don't have a data structure yet that can do that, 41 00:01:48,600 --> 00:01:51,180 and so this already sounds like a pretty great thing. 42 00:01:51,180 --> 00:01:57,010 We've really mitigated the disadvantages of each on its own. 43 00:01:57,010 --> 00:01:59,160 >> To get this performance upgrade though, we 44 00:01:59,160 --> 00:02:03,580 need to rethink how we add data into the structure. 45 00:02:03,580 --> 00:02:07,380 Specifically we want the data itself to tell us 46 00:02:07,380 --> 00:02:09,725 where it should go in the structure. 47 00:02:09,725 --> 00:02:12,850 And if we then need to see if it's in the structure, if we need to find it, 48 00:02:12,850 --> 00:02:16,610 we want to look at the data again and be able to effectively, 49 00:02:16,610 --> 00:02:18,910 using the data, randomly access it. 50 00:02:18,910 --> 00:02:20,700 Just by looking at the data we should have 51 00:02:20,700 --> 00:02:25,890 an idea of where exactly we're going to find it in the hash table. 52 00:02:25,890 --> 00:02:28,770 >> Now the downside of a hash table is that they're really 53 00:02:28,770 --> 00:02:31,770 pretty bad at ordering or sorting data. 54 00:02:31,770 --> 00:02:34,970 And in fact, if you start to use them to order or sort 55 00:02:34,970 --> 00:02:37,990 data you lose all of the advantages you previously 56 00:02:37,990 --> 00:02:40,710 had in terms of insertion and deletion. 57 00:02:40,710 --> 00:02:44,060 The time becomes closer to theta of n, and we've basically 58 00:02:44,060 --> 00:02:45,530 regressed into a linked list. 59 00:02:45,530 --> 00:02:48,850 And so we only want to use hash tables if we don't care about 60 00:02:48,850 --> 00:02:51,490 whether data is sorted. 61 00:02:51,490 --> 00:02:54,290 For the context in which you'll use them in CS50 62 00:02:54,290 --> 00:02:58,900 you probably don't care that the data is sorted. 63 00:02:58,900 --> 00:03:03,170 >> So a hash table is a combination of two distinct pieces 64 00:03:03,170 --> 00:03:04,980 with which we're familiar. 65 00:03:04,980 --> 00:03:07,930 The first is a function, which we usually call a hash function. 66 00:03:07,930 --> 00:03:11,760 And that hash function is going to return some non-negative integer, which 67 00:03:11,760 --> 00:03:14,870 we usually call a hashcode, OK? 68 00:03:14,870 --> 00:03:20,230 The second piece is an array, which is capable of storing data of the type we 69 00:03:20,230 --> 00:03:22,190 want to place into the data structure. 70 00:03:22,190 --> 00:03:24,310 We'll hold off on the linked list element for now 71 00:03:24,310 --> 00:03:27,810 and just start with the basics of a hash table to get your head around it, 72 00:03:27,810 --> 00:03:30,210 and then we'll maybe blow your mind a little bit when we 73 00:03:30,210 --> 00:03:32,920 combine arrays and link lists together. 74 00:03:32,920 --> 00:03:35,590 >> The basic idea though is we take some data. 75 00:03:35,590 --> 00:03:37,860 We run that data through the hash function. 76 00:03:37,860 --> 00:03:41,980 And so the data is processed and it spits out a number, OK? 77 00:03:41,980 --> 00:03:44,890 And then with that number we just store the data 78 00:03:44,890 --> 00:03:48,930 we want to store in the array at that location. 79 00:03:48,930 --> 00:03:53,990 So for example we have maybe this hash table of strings. 80 00:03:53,990 --> 00:03:57,350 It's got 10 elements in it, so we can fit 10 strings in it. 81 00:03:57,350 --> 00:03:59,320 >> Let's say we want to hash John. 82 00:03:59,320 --> 00:04:02,979 So John as the data we want to insert into this hash table somewhere. 83 00:04:02,979 --> 00:04:03,770 Where do we put it? 84 00:04:03,770 --> 00:04:05,728 Well typically with an array so far we probably 85 00:04:05,728 --> 00:04:07,610 would put it in array location 0. 86 00:04:07,610 --> 00:04:09,960 But now we have this new hash function. 87 00:04:09,960 --> 00:04:13,180 >> And let's say that we run John through this hash function 88 00:04:13,180 --> 00:04:15,417 and it's spits out 4. 89 00:04:15,417 --> 00:04:17,500 Well that's where we're going to want to put John. 90 00:04:17,500 --> 00:04:22,050 We want to put John in array location 4, because if we hash John again-- 91 00:04:22,050 --> 00:04:23,810 let's say later we want to search and see 92 00:04:23,810 --> 00:04:26,960 if John exists in this hash table-- all we need to do 93 00:04:26,960 --> 00:04:30,310 is run it through the same hash function, get the number 4 out, 94 00:04:30,310 --> 00:04:33,929 and be able to find John immediately in our data structure. 95 00:04:33,929 --> 00:04:34,720 That's pretty good. 96 00:04:34,720 --> 00:04:36,928 >> Let's say we now do this again, we want to hash Paul. 97 00:04:36,928 --> 00:04:39,446 We want to add Paul into this hash table. 98 00:04:39,446 --> 00:04:42,070 Let's say that this time we run Paul through the hash function, 99 00:04:42,070 --> 00:04:44,600 the hashcode that is generated is 6. 100 00:04:44,600 --> 00:04:47,340 Well now we can put Paul in the array location 6. 101 00:04:47,340 --> 00:04:50,040 And if we need to look up whether Paul is in this hash table, 102 00:04:50,040 --> 00:04:53,900 all we need to do is run Paul through the hash function again 103 00:04:53,900 --> 00:04:55,830 and we're going to get 6 out again. 104 00:04:55,830 --> 00:04:57,590 >> And then we just look at array location 6. 105 00:04:57,590 --> 00:04:58,910 Is Paul there? 106 00:04:58,910 --> 00:05:00,160 If so, he's in the hash table. 107 00:05:00,160 --> 00:05:01,875 Is Paul not there? 108 00:05:01,875 --> 00:05:03,000 He's not in the hash table. 109 00:05:03,000 --> 00:05:05,720 It's pretty straightforward. 110 00:05:05,720 --> 00:05:07,770 >> Now how do you define a hash function? 111 00:05:07,770 --> 00:05:11,460 Well there's really no limit to the number of possible hash functions. 112 00:05:11,460 --> 00:05:14,350 In fact there's a number of really, really good ones on the internet. 113 00:05:14,350 --> 00:05:17,560 There's a number of really, really bad ones on the internet. 114 00:05:17,560 --> 00:05:21,080 It's also pretty easy to write a bad one. 115 00:05:21,080 --> 00:05:23,760 >> So what makes up a good hash function, right? 116 00:05:23,760 --> 00:05:27,280 Well a good hash function should use only the data being hashed, 117 00:05:27,280 --> 00:05:29,420 and all of the data being hashed. 118 00:05:29,420 --> 00:05:32,500 So we don't want to use anything-- we don't incorporate anything 119 00:05:32,500 --> 00:05:35,560 else other than the data. 120 00:05:35,560 --> 00:05:37,080 And we want to use all of the data. 121 00:05:37,080 --> 00:05:39,830 We don't want to just use a piece of it, we want to use all of it. 122 00:05:39,830 --> 00:05:41,710 A hash function should also be deterministic. 123 00:05:41,710 --> 00:05:42,550 What does that mean? 124 00:05:42,550 --> 00:05:46,200 Well it means that every time we pass the exact same piece of data 125 00:05:46,200 --> 00:05:50,040 into the hash function we always get the same hashcode out. 126 00:05:50,040 --> 00:05:52,870 If I pass John into the hash function I get out 4. 127 00:05:52,870 --> 00:05:56,110 I should be able to do that 10,000 times and I'll always get 4. 128 00:05:56,110 --> 00:06:00,810 So no random numbers effectively can be involved in our hash tables-- 129 00:06:00,810 --> 00:06:02,750 in our hash functions. 130 00:06:02,750 --> 00:06:05,750 >> A hash function should also uniformly distribute data. 131 00:06:05,750 --> 00:06:09,700 If every time you run data through the hash function you get the hashcode 0, 132 00:06:09,700 --> 00:06:11,200 that's probably not so great, right? 133 00:06:11,200 --> 00:06:14,600 You probably want to big a range of hash codes. 134 00:06:14,600 --> 00:06:17,190 Also things can be spread out throughout the table. 135 00:06:17,190 --> 00:06:23,210 And also it would be great if really similar data, like John and Jonathan, 136 00:06:23,210 --> 00:06:26,320 maybe were spread out to weigh different locations in the hash table. 137 00:06:26,320 --> 00:06:28,750 That would be a nice advantage. 138 00:06:28,750 --> 00:06:31,250 >> Here's an example of a hash function. 139 00:06:31,250 --> 00:06:33,150 I wrote this one up earlier. 140 00:06:33,150 --> 00:06:35,047 It's not a particularly good hash function 141 00:06:35,047 --> 00:06:37,380 for reasons that don't really bear going into right now. 142 00:06:37,380 --> 00:06:41,040 But do you see what's going on here? 143 00:06:41,040 --> 00:06:44,420 It seems like we're declaring a variable called sum and setting it equal to 0. 144 00:06:44,420 --> 00:06:50,010 And then apparently I'm doing something so long as strstr[j] is not equal 145 00:06:50,010 --> 00:06:52,490 to backslash 0. 146 00:06:52,490 --> 00:06:54,870 What am I doing there? 147 00:06:54,870 --> 00:06:57,440 >> This is basically just another way of implementing [? strl ?] 148 00:06:57,440 --> 00:06:59,773 and detecting when you've reached the end of the string. 149 00:06:59,773 --> 00:07:02,480 So I don't have to actually calculate the length of the string, 150 00:07:02,480 --> 00:07:05,640 I'm just using when I hit the backslash 0 character I know 151 00:07:05,640 --> 00:07:07,185 I've reached the end of the string. 152 00:07:07,185 --> 00:07:09,560 And then I'm going to keep iterating through that string, 153 00:07:09,560 --> 00:07:15,310 adding strstr[j] to sum, and then at the end of the day going to return sum mod 154 00:07:15,310 --> 00:07:16,200 HASH_MAX. 155 00:07:16,200 --> 00:07:18,700 >> Basically all this hash function is doing is adding up 156 00:07:18,700 --> 00:07:23,450 all of the ASCII values of my string, and then it's 157 00:07:23,450 --> 00:07:26,390 returning some hashcode modded by HASH_MAX. 158 00:07:26,390 --> 00:07:29,790 It's probably the size of my array, right? 159 00:07:29,790 --> 00:07:33,160 I don't want to be getting hash codes if my array is of size 10, 160 00:07:33,160 --> 00:07:35,712 I don't want to be getting out hash codes 11, 12, 161 00:07:35,712 --> 00:07:38,690 13, I can't put things into those locations of the array, 162 00:07:38,690 --> 00:07:39,790 that would be illegal. 163 00:07:39,790 --> 00:07:42,130 I'd suffer a segmentation fault. 164 00:07:42,130 --> 00:07:45,230 >> Now here is another quick aside. 165 00:07:45,230 --> 00:07:48,470 Generally you're probably not going to want to write your own hash functions. 166 00:07:48,470 --> 00:07:50,997 It is actually a bit of an art, not a science. 167 00:07:50,997 --> 00:07:52,580 And there's a lot that goes into them. 168 00:07:52,580 --> 00:07:55,288 The internet, like I said, is full of really good hash functions, 169 00:07:55,288 --> 00:07:58,470 and you should use the internet to find hash functions because it's really 170 00:07:58,470 --> 00:08:03,230 just kind of an unnecessary waste of time to create your own. 171 00:08:03,230 --> 00:08:05,490 >> You can write simple ones for testing purposes. 172 00:08:05,490 --> 00:08:08,323 But when you actually are going to start hashing data and storing it 173 00:08:08,323 --> 00:08:10,780 into a hash table you're probably going to want 174 00:08:10,780 --> 00:08:14,580 to use some function that was generated for you, that exists on the internet. 175 00:08:14,580 --> 00:08:17,240 If you do just be sure to cite your sources. 176 00:08:17,240 --> 00:08:21,740 There's no reason to plagiarize anything here. 177 00:08:21,740 --> 00:08:25,370 >> The computer science community is definitely growing, and really values 178 00:08:25,370 --> 00:08:28,796 open source, and it's really important to cite your sources so that people 179 00:08:28,796 --> 00:08:30,670 can get attribution for the work that they're 180 00:08:30,670 --> 00:08:32,312 doing to the benefit of the community. 181 00:08:32,312 --> 00:08:34,020 So always be sure-- and not just for hash 182 00:08:34,020 --> 00:08:37,270 functions, but generally when you use code from an outside source, 183 00:08:37,270 --> 00:08:38,299 always cite your source. 184 00:08:38,299 --> 00:08:43,500 Give credit to the person who did some of the work so you don't have to. 185 00:08:43,500 --> 00:08:46,720 >> OK so let's revisit this hash table for a second. 186 00:08:46,720 --> 00:08:48,780 This is where we left off after we inserted 187 00:08:48,780 --> 00:08:53,300 John and Paul into this hash table. 188 00:08:53,300 --> 00:08:55,180 Do you see a problem here? 189 00:08:55,180 --> 00:08:58,410 You might see two. 190 00:08:58,410 --> 00:09:02,240 But in particular, do you see this possible problem? 191 00:09:02,240 --> 00:09:06,770 >> What if I hash Ringo, and it turns out that after processing 192 00:09:06,770 --> 00:09:14,000 that data through the hash function Ringo also generated the hashcode 6. 193 00:09:14,000 --> 00:09:16,810 I've already got data at hashcode-- array location 6. 194 00:09:16,810 --> 00:09:22,000 So it's probably going to be a bit of a problem for me now, right? 195 00:09:22,000 --> 00:09:23,060 >> We call this a collision. 196 00:09:23,060 --> 00:09:26,460 And the collision occurs when two pieces of data run through the same hash 197 00:09:26,460 --> 00:09:29,200 function yield the same hashcode. 198 00:09:29,200 --> 00:09:32,850 Presumably we still want to get both pieces of data into the hash table, 199 00:09:32,850 --> 00:09:36,330 otherwise we wouldn't be running Ringo arbitrarily through the hash function. 200 00:09:36,330 --> 00:09:40,870 We presumably want to get Ringo into that array. 201 00:09:40,870 --> 00:09:46,070 >> How do we do it though, if he and Paul both yield hashcode 6? 202 00:09:46,070 --> 00:09:49,520 We don't want to overwrite Paul, we want Paul to be there too. 203 00:09:49,520 --> 00:09:52,790 So we need to find a way to get elements into the hash table that 204 00:09:52,790 --> 00:09:56,550 still preserves our quick insertion and quick look up. 205 00:09:56,550 --> 00:10:01,350 And one way to deal with it is to do something called linear probing. 206 00:10:01,350 --> 00:10:04,170 >> Using this method if we have a collision, well, what do we do? 207 00:10:04,170 --> 00:10:08,580 Well we can't put him in array location 6, or whatever hashcode was generated, 208 00:10:08,580 --> 00:10:10,820 let's put him at hashcode plus 1. 209 00:10:10,820 --> 00:10:13,670 And if that's full let's put him in hashcode plus 2. 210 00:10:13,670 --> 00:10:17,420 The benefit of this being if he's not exactly where we think he is, 211 00:10:17,420 --> 00:10:20,850 and we have to start searching, maybe we don't have to go too far. 212 00:10:20,850 --> 00:10:23,900 Maybe we don't have to search all n elements of the hash table. 213 00:10:23,900 --> 00:10:25,790 Maybe we have to search a couple of them. 214 00:10:25,790 --> 00:10:30,680 >> And so we're still tending towards that average case being close to 1 vs 215 00:10:30,680 --> 00:10:34,280 close to n, so maybe that'll work. 216 00:10:34,280 --> 00:10:38,010 So let's see how this might work out in reality. 217 00:10:38,010 --> 00:10:41,460 And let's see if maybe we can detect the problem that might occur here. 218 00:10:41,460 --> 00:10:42,540 >> Let's say we hash Bart. 219 00:10:42,540 --> 00:10:45,581 So now we're going to run a new set of strings through the hash function, 220 00:10:45,581 --> 00:10:48,460 and we run Bart through the hash function, we get hashcode 6. 221 00:10:48,460 --> 00:10:52,100 We take a look, we see 6 is empty, so we can put Bart there. 222 00:10:52,100 --> 00:10:55,410 >> Now we hash Lisa and that also generates hashcode 6. 223 00:10:55,410 --> 00:10:58,330 Well now that we're using this linear probing method we start at 6, 224 00:10:58,330 --> 00:10:59,330 we see that 6 is full. 225 00:10:59,330 --> 00:11:00,990 We can't put Lisa in 6. 226 00:11:00,990 --> 00:11:02,090 So where do we go? 227 00:11:02,090 --> 00:11:03,280 Let's go to 7. 228 00:11:03,280 --> 00:11:04,610 7's empty, so that works. 229 00:11:04,610 --> 00:11:06,510 So let's put Lisa there. 230 00:11:06,510 --> 00:11:10,200 >> Now we hash Homer and we get 7. 231 00:11:10,200 --> 00:11:13,380 OK well we know that 7's full now, so we can't put Homer there. 232 00:11:13,380 --> 00:11:14,850 So let's go to 8. 233 00:11:14,850 --> 00:11:15,664 Is 8 available? 234 00:11:15,664 --> 00:11:18,330 Yeah, and 8's close to 7, so if we have to start searching we're 235 00:11:18,330 --> 00:11:20,020 not going to have to go too far. 236 00:11:20,020 --> 00:11:22,860 And so let's put Homer at 8. 237 00:11:22,860 --> 00:11:25,151 >> Now we hash Maggie and returns 3, thank goodness 238 00:11:25,151 --> 00:11:26,650 we're able to just put Maggie there. 239 00:11:26,650 --> 00:11:29,070 We don't have to do any sort of probing for that. 240 00:11:29,070 --> 00:11:32,000 Now we hash Marge, and Marge also returns 6. 241 00:11:32,000 --> 00:11:39,560 >> Well 6 is full, 7 is full, 8 is full, 9, all right thank God, 9 is empty. 242 00:11:39,560 --> 00:11:41,600 I can put Marge at 9. 243 00:11:41,600 --> 00:11:45,280 Already we can see that we're starting to have this problem where now we're 244 00:11:45,280 --> 00:11:48,780 starting to stretch things kind of far away from their hash codes. 245 00:11:48,780 --> 00:11:52,960 And that theta of 1, that average case of being constant time, 246 00:11:52,960 --> 00:11:56,560 is starting to get a little more-- starting to tend a little more 247 00:11:56,560 --> 00:11:58,050 towards theta of n. 248 00:11:58,050 --> 00:12:01,270 We're starting to lose that advantage of hash tables. 249 00:12:01,270 --> 00:12:03,902 >> This problem that we just saw is something called clustering. 250 00:12:03,902 --> 00:12:06,360 And what's really bad about clustering is that once you now 251 00:12:06,360 --> 00:12:09,606 have two elements that are side by side it makes it even more likely, 252 00:12:09,606 --> 00:12:11,480 you have double the chance, that you're going 253 00:12:11,480 --> 00:12:13,516 to have another collision with that cluster, 254 00:12:13,516 --> 00:12:14,890 and the cluster will grow by one. 255 00:12:14,890 --> 00:12:19,640 And you'll keep growing and growing your likelihood of having a collision. 256 00:12:19,640 --> 00:12:24,470 And eventually it's just as bad as not sorting the data at all. 257 00:12:24,470 --> 00:12:27,590 >> The other problem though is we still, and so far up to this point, 258 00:12:27,590 --> 00:12:30,336 we've just been sort of understanding what a hash table is, 259 00:12:30,336 --> 00:12:31,960 we still only have room for 10 strings. 260 00:12:31,960 --> 00:12:37,030 If we want to continue to hash the citizens of Springfield, 261 00:12:37,030 --> 00:12:38,790 we can only get 10 of them in there. 262 00:12:38,790 --> 00:12:42,619 And if we try and add an 11th or 12th, we don't have a place to put them. 263 00:12:42,619 --> 00:12:45,660 We could just be spinning around in circles trying to find an empty spot, 264 00:12:45,660 --> 00:12:49,000 and we maybe get stuck in an infinite loop. 265 00:12:49,000 --> 00:12:51,810 >> So this sort of lends to the idea of something called chaining. 266 00:12:51,810 --> 00:12:55,790 And this is where we're going to bring linked lists back into the picture. 267 00:12:55,790 --> 00:13:01,300 What if instead of storing just the data itself in the array, 268 00:13:01,300 --> 00:13:04,471 every element of the array could hold multiple pieces of data? 269 00:13:04,471 --> 00:13:05,970 Well that doesn't make sense, right? 270 00:13:05,970 --> 00:13:09,280 We know that an array can only hold-- each element of an array 271 00:13:09,280 --> 00:13:12,930 can only hold one piece of data of that data type. 272 00:13:12,930 --> 00:13:16,750 >> But what if that data type is a linked list, right? 273 00:13:16,750 --> 00:13:19,830 So what if every element of the array was 274 00:13:19,830 --> 00:13:22,790 a pointer to the head of a linked list? 275 00:13:22,790 --> 00:13:24,680 And then we could build those linked lists 276 00:13:24,680 --> 00:13:27,120 and grow them arbitrarily, because linked lists allow 277 00:13:27,120 --> 00:13:32,090 us to grow and shrink a lot more flexibly than an array does. 278 00:13:32,090 --> 00:13:34,210 So what if we now use, we leverage this, right? 279 00:13:34,210 --> 00:13:37,760 We start to grow these chains out of these array locations. 280 00:13:37,760 --> 00:13:40,740 >> Now we can fit an infinite amount of data, or not infinite, 281 00:13:40,740 --> 00:13:44,170 an arbitrary amount of data, into our hash table 282 00:13:44,170 --> 00:13:47,760 without ever running into the problem of collision. 283 00:13:47,760 --> 00:13:50,740 We've also eliminated clustering by doing this. 284 00:13:50,740 --> 00:13:54,380 And well we know that when we insert into a linked list, if you recall 285 00:13:54,380 --> 00:13:57,779 from our video on linked lists, singly linked lists and doubly linked lists, 286 00:13:57,779 --> 00:13:59,070 it's a constant time operation. 287 00:13:59,070 --> 00:14:00,710 We're just adding to the front. 288 00:14:00,710 --> 00:14:04,400 >> And for look up, well we do know that look up in a linked list 289 00:14:04,400 --> 00:14:05,785 can be a problem, right? 290 00:14:05,785 --> 00:14:07,910 We have to search through it from beginning to end. 291 00:14:07,910 --> 00:14:10,460 There's no random access in a linked list. 292 00:14:10,460 --> 00:14:15,610 But if instead of having one linked list where a lookup would be O of n, 293 00:14:15,610 --> 00:14:19,590 we now have 10 linked lists, or 1,000 linked lists, 294 00:14:19,590 --> 00:14:24,120 now it's O of n divided by 10, or O of n divided by 1,000. 295 00:14:24,120 --> 00:14:26,940 >> And while we were talking theoretically about complexity 296 00:14:26,940 --> 00:14:30,061 we disregard constants, in the real world these things actually matter, 297 00:14:30,061 --> 00:14:30,560 right? 298 00:14:30,560 --> 00:14:33,080 We actually will notice that this happens 299 00:14:33,080 --> 00:14:36,610 to run 10 times faster, or 1,000 times faster, 300 00:14:36,610 --> 00:14:41,570 because we're distributing one long chain across 1,000 smaller chains. 301 00:14:41,570 --> 00:14:45,090 And so each time we have to search through one of those chains we can 302 00:14:45,090 --> 00:14:50,290 ignore the 999 chains we don't care about , and just search that one. 303 00:14:50,290 --> 00:14:53,220 >> Which is on average to be 1,000 times shorter. 304 00:14:53,220 --> 00:14:56,460 And so we still are sort of tending towards this average case 305 00:14:56,460 --> 00:15:01,610 of being constant time, but only because we are leveraging 306 00:15:01,610 --> 00:15:03,730 dividing by some huge constant factor. 307 00:15:03,730 --> 00:15:05,804 Let's see how this might actually look though. 308 00:15:05,804 --> 00:15:08,720 So this was the hash table we had before we declared a hash table that 309 00:15:08,720 --> 00:15:10,270 was capable of storing 10 strings. 310 00:15:10,270 --> 00:15:11,728 We're not going to do that anymore. 311 00:15:11,728 --> 00:15:13,880 We already know the limitations of that method. 312 00:15:13,880 --> 00:15:20,170 Now our hash table's going to be an array of 10 nodes, pointers 313 00:15:20,170 --> 00:15:22,120 to heads of linked lists. 314 00:15:22,120 --> 00:15:23,830 >> And right now it's null. 315 00:15:23,830 --> 00:15:26,170 Each one of those 10 pointers is null. 316 00:15:26,170 --> 00:15:29,870 There's nothing in our hash table right now. 317 00:15:29,870 --> 00:15:32,690 >> Now let's start to put some things into this hash table. 318 00:15:32,690 --> 00:15:35,440 And let's see how this method is going to benefit us a little bit. 319 00:15:35,440 --> 00:15:36,760 Let's now hash Joey. 320 00:15:36,760 --> 00:15:40,210 We'll will run the string Joey through a hash function and we return 6. 321 00:15:40,210 --> 00:15:41,200 Well what do we do now? 322 00:15:41,200 --> 00:15:44,090 >> Well now working with linked lists, we're not working with arrays. 323 00:15:44,090 --> 00:15:45,881 And when we're working with linked lists we 324 00:15:45,881 --> 00:15:49,790 know we need to start dynamically allocating space and building chains. 325 00:15:49,790 --> 00:15:54,020 That's sort of how-- those are the core elements of building a linked list. 326 00:15:54,020 --> 00:15:57,670 So let's dynamically allocate space for Joey, 327 00:15:57,670 --> 00:16:00,390 and then let's add him to the chain. 328 00:16:00,390 --> 00:16:03,170 >> So now look what we've done. 329 00:16:03,170 --> 00:16:06,440 When we hash Joey we got the hashcode 6. 330 00:16:06,440 --> 00:16:11,790 Now the pointer at array location 6 points to the head of a linked list, 331 00:16:11,790 --> 00:16:14,900 and right now it's the only element of a linked list. 332 00:16:14,900 --> 00:16:18,350 And the node in that linked list is Joey. 333 00:16:18,350 --> 00:16:22,300 >> So if we need to look up Joey later, we just hash Joey again, 334 00:16:22,300 --> 00:16:26,300 we get 6 again because our hash function is deterministic. 335 00:16:26,300 --> 00:16:30,400 And then we start at the head of the linked list pointed 336 00:16:30,400 --> 00:16:33,360 to by array location 6, and we can iterate 337 00:16:33,360 --> 00:16:35,650 across that trying to find Joey. 338 00:16:35,650 --> 00:16:37,780 And if we build our hash table effectively, 339 00:16:37,780 --> 00:16:41,790 and our hash function effectively to distribute data well, 340 00:16:41,790 --> 00:16:45,770 on average each of those linked lists at every array location 341 00:16:45,770 --> 00:16:50,110 will be 1/10 the size of if we just had it as a single huge 342 00:16:50,110 --> 00:16:51,650 linked list with everything in it. 343 00:16:51,650 --> 00:16:55,670 >> If we distribute that huge linked list across 10 linked lists 344 00:16:55,670 --> 00:16:57,760 each list will be 1/10 the size. 345 00:16:57,760 --> 00:17:01,432 And thus 10 times quicker to search through. 346 00:17:01,432 --> 00:17:02,390 So let's do this again. 347 00:17:02,390 --> 00:17:04,290 Let's now hash Ross. 348 00:17:04,290 --> 00:17:07,540 >> And let's say Ross, when we do that the hash code we get back is 2. 349 00:17:07,540 --> 00:17:10,630 Well now we dynamically allocate a new node, we put Ross in that node, 350 00:17:10,630 --> 00:17:14,900 and we say now array location 2, instead of pointing to null, 351 00:17:14,900 --> 00:17:18,579 points to the head of a linked list whose only node is Ross. 352 00:17:18,579 --> 00:17:22,660 And we can do this one more time, we can hash Rachel and get hashcode 4. 353 00:17:22,660 --> 00:17:25,490 malloc a new node, put Rachel in the node, and say a array location 354 00:17:25,490 --> 00:17:27,839 4 now points to the head of a linked list whose 355 00:17:27,839 --> 00:17:31,420 only element happens to be Rachel. 356 00:17:31,420 --> 00:17:33,470 >> OK but what happens if we have a collision? 357 00:17:33,470 --> 00:17:38,560 Let's see how we handle collisions using the separate chaining method. 358 00:17:38,560 --> 00:17:39,800 Let's hash Phoebe. 359 00:17:39,800 --> 00:17:41,094 We get the hashcode 6. 360 00:17:41,094 --> 00:17:44,010 In our previous example we were just storing the strings in the array. 361 00:17:44,010 --> 00:17:45,980 This was a problem. 362 00:17:45,980 --> 00:17:48,444 >> We don't want to clobber Joey, and we've already 363 00:17:48,444 --> 00:17:51,110 seen that we can get some clustering problems if we try and step 364 00:17:51,110 --> 00:17:52,202 through and probe. 365 00:17:52,202 --> 00:17:54,660 But what if we just kind of treat this the same way, right? 366 00:17:54,660 --> 00:17:57,440 It's just like adding an element to the head of a linked list. 367 00:17:57,440 --> 00:18:00,220 Let's just malloc space for Phoebe. 368 00:18:00,220 --> 00:18:04,764 >> We'll say Phoebe's next pointer points to the old head of the linked list, 369 00:18:04,764 --> 00:18:07,180 and then 6 just points to the new head of the linked list. 370 00:18:07,180 --> 00:18:10,150 And now look, we've changed Phoebe in. 371 00:18:10,150 --> 00:18:14,210 We can now store two elements with hashcode 6, 372 00:18:14,210 --> 00:18:17,170 and we don't have any problems. 373 00:18:17,170 --> 00:18:20,147 >> That's pretty much all there is to chaining. 374 00:18:20,147 --> 00:18:21,980 And chaining is definitely the method that's 375 00:18:21,980 --> 00:18:27,390 going to be most effective for you if you are storing data in a hash table. 376 00:18:27,390 --> 00:18:30,890 But this combination of arrays and linked lists 377 00:18:30,890 --> 00:18:36,080 together to form a hash table really dramatically improves your ability 378 00:18:36,080 --> 00:18:40,550 to store large amounts of data, and very quickly and efficiently search 379 00:18:40,550 --> 00:18:41,630 through that data. 380 00:18:41,630 --> 00:18:44,150 >> There's still one more data structure out there 381 00:18:44,150 --> 00:18:48,700 that might even be a bit better in terms of guaranteeing 382 00:18:48,700 --> 00:18:51,920 that our insertion, deletion, and look up times are even faster. 383 00:18:51,920 --> 00:18:55,630 And we'll see that in a video on tries. 384 00:18:55,630 --> 00:18:58,930 I'm Doug Lloyd, this is CS50. 385 00:18:58,930 --> 00:19:00,214