1 00:00:00,500 --> 00:00:01,900 [MUSIC PLAYING] 2 00:00:05,710 --> 00:00:09,150 >> DOUG LLOYD: By now you know a lot about arrays, 3 00:00:09,150 --> 00:00:11,610 and you know a lot about linked lists. 4 00:00:11,610 --> 00:00:13,650 And we've discuss the pros and cons, we've 5 00:00:13,650 --> 00:00:16,620 discussed that linked lists can get bigger and smaller, 6 00:00:16,620 --> 00:00:18,630 but they take up more size. 7 00:00:18,630 --> 00:00:22,359 Arrays are much more straightforward to use, but they're restrictive in as much 8 00:00:22,359 --> 00:00:24,900 as we have to set the size of the array at the very beginning 9 00:00:24,900 --> 00:00:26,910 and then we're stuck with it. 10 00:00:26,910 --> 00:00:30,470 >> But that's, we've pretty much exhausted all of our topics 11 00:00:30,470 --> 00:00:33,040 about linked lists and arrays. 12 00:00:33,040 --> 00:00:34,950 Or have we? 13 00:00:34,950 --> 00:00:37,720 Maybe we can do something even more creative. 14 00:00:37,720 --> 00:00:40,950 And that sort of lends the idea of a hash table. 15 00:00:40,950 --> 00:00:46,680 >> So in a hash table we're going to try combine an array with a linked list. 16 00:00:46,680 --> 00:00:49,520 We're going to take the advantages of the array, like random access, 17 00:00:49,520 --> 00:00:53,510 being able to just go to array element 4 or array element 8 18 00:00:53,510 --> 00:00:55,560 without having to iterate across. 19 00:00:55,560 --> 00:00:57,260 That's pretty fast, right? 20 00:00:57,260 --> 00:01:00,714 >> But we also want to have our data structure be able to grow and shrink. 21 00:01:00,714 --> 00:01:02,630 We don't need, we don't want to be restricted. 22 00:01:02,630 --> 00:01:04,588 And we want to be able to add and remove things 23 00:01:04,588 --> 00:01:08,430 very easily, which if you recall, is very complex with an array. 24 00:01:08,430 --> 00:01:11,650 And we can call this new thing a hash table. 25 00:01:11,650 --> 00:01:15,190 >> And if implemented correctly, we're sort of taking 26 00:01:15,190 --> 00:01:18,150 the advantages of both data structures you've already seen, 27 00:01:18,150 --> 00:01:19,880 arrays and linked lists. 28 00:01:19,880 --> 00:01:22,480 >> Well an average insertion into a hash table 29 00:01:22,480 --> 00:01:25,350 can start to get close to constant time. 30 00:01:25,350 --> 00:01:27,550 And deletion can get close to constant time. 31 00:01:27,550 --> 00:01:30,420 And lookup can get close to constant time. 32 00:01:30,420 --> 00:01:34,580 That's-- we don't have a data structure yet that can do that, 33 00:01:34,580 --> 00:01:37,160 and so this already sounds like a pretty great thing. 34 00:01:37,160 --> 00:01:42,990 We've really mitigated the disadvantages of each on its own. 35 00:01:42,990 --> 00:01:45,140 >> To get this performance upgrade though, we 36 00:01:45,140 --> 00:01:49,560 need to rethink how we add data into the structure. 37 00:01:49,560 --> 00:01:53,360 Specifically we want the data itself to tell us 38 00:01:53,360 --> 00:01:55,700 where it should go in the structure. 39 00:01:55,700 --> 00:01:58,830 And if we then need to see if it's in the structure, if we need to find it, 40 00:01:58,830 --> 00:02:02,590 we want to look at the data again and be able to effectively, 41 00:02:02,590 --> 00:02:04,890 using the data, randomly access it. 42 00:02:04,890 --> 00:02:06,680 Just by looking at the data we should have 43 00:02:06,680 --> 00:02:11,870 an idea of where exactly we're going to find it in the hash table. 44 00:02:11,870 --> 00:02:14,750 >> Now the downside of a hash table is that they're really 45 00:02:14,750 --> 00:02:17,750 pretty bad at ordering or sorting data. 46 00:02:17,750 --> 00:02:20,950 And in fact, if you start to use them to order or sort 47 00:02:20,950 --> 00:02:23,970 data you lose all of the advantages you previously 48 00:02:23,970 --> 00:02:26,690 had in terms of insertion and deletion. 49 00:02:26,690 --> 00:02:30,040 The time becomes closer to theta of n, and we've basically 50 00:02:30,040 --> 00:02:31,510 regressed into a linked list. 51 00:02:31,510 --> 00:02:34,830 And so we only want to use hash tables if we don't care about 52 00:02:34,830 --> 00:02:37,470 whether data is sorted. 53 00:02:37,470 --> 00:02:40,270 For the context in which you'll use them in CS50 54 00:02:40,270 --> 00:02:44,880 you probably don't care that the data is sorted. 55 00:02:44,880 --> 00:02:49,150 >> So a hash table is a combination of two distinct pieces 56 00:02:49,150 --> 00:02:50,960 with which we're familiar. 57 00:02:50,960 --> 00:02:53,910 The first is a function, which we usually call a hash function. 58 00:02:53,910 --> 00:02:57,740 And that hash function is going to return some non-negative integer, which 59 00:02:57,740 --> 00:03:00,850 we usually call a hashcode, OK? 60 00:03:00,850 --> 00:03:06,210 The second piece is an array, which is capable of storing data of the type we 61 00:03:06,210 --> 00:03:08,170 want to place into the data structure. 62 00:03:08,170 --> 00:03:10,290 We'll hold off on the linked list element for now 63 00:03:10,290 --> 00:03:13,790 and just start with the basics of a hash table to get your head around it, 64 00:03:13,790 --> 00:03:16,190 and then we'll maybe blow your mind a little bit when we 65 00:03:16,190 --> 00:03:18,900 combine arrays and link lists together. 66 00:03:18,900 --> 00:03:21,570 >> The basic idea though is we take some data. 67 00:03:21,570 --> 00:03:23,840 We run that data through the hash function. 68 00:03:23,840 --> 00:03:27,960 And so the data is processed and it spits out a number, OK? 69 00:03:27,960 --> 00:03:30,870 And then with that number we just store the data 70 00:03:30,870 --> 00:03:34,910 we want to store in the array at that location. 71 00:03:34,910 --> 00:03:39,970 So for example we have maybe this hash table of strings. 72 00:03:39,970 --> 00:03:43,330 It's got 10 elements in it, so we can fit 10 strings in it. 73 00:03:43,330 --> 00:03:45,300 >> Let's say we want to hash John. 74 00:03:45,300 --> 00:03:48,950 So John as the data we want to insert into this hash table somewhere. 75 00:03:48,950 --> 00:03:49,750 Where do we put it? 76 00:03:49,750 --> 00:03:51,700 Well typically with an array so far we probably 77 00:03:51,700 --> 00:03:53,590 would put it in array location 0. 78 00:03:53,590 --> 00:03:55,940 But now we have this new hash function. 79 00:03:55,940 --> 00:03:59,160 >> And let's say that we run John through this hash function 80 00:03:59,160 --> 00:04:01,390 and it's spits out 4. 81 00:04:01,390 --> 00:04:03,480 Well that's where we're going to want to put John. 82 00:04:03,480 --> 00:04:08,030 We want to put John in array location 4, because if we hash John again-- 83 00:04:08,030 --> 00:04:09,790 let's say later we want to search and see 84 00:04:09,790 --> 00:04:12,940 if John exists in this hash table-- all we need to do 85 00:04:12,940 --> 00:04:16,290 is run it through the same hash function, get the number 4 out, 86 00:04:16,290 --> 00:04:19,900 and be able to find John immediately in our data structure. 87 00:04:19,900 --> 00:04:20,700 That's pretty good. 88 00:04:20,700 --> 00:04:22,900 >> Let's say we now do this again, we want to hash Paul. 89 00:04:22,900 --> 00:04:25,420 We want to add Paul into this hash table. 90 00:04:25,420 --> 00:04:28,050 Let's say that this time we run Paul through the hash function, 91 00:04:28,050 --> 00:04:30,580 the hashcode that is generated is 6. 92 00:04:30,580 --> 00:04:33,320 Well now we can put Paul in the array location 6. 93 00:04:33,320 --> 00:04:36,020 And if we need to look up whether Paul is in this hash table, 94 00:04:36,020 --> 00:04:39,880 all we need to do is run Paul through the hash function again 95 00:04:39,880 --> 00:04:41,810 and we're going to get 6 out again. 96 00:04:41,810 --> 00:04:43,570 >> And then we just look at array location 6. 97 00:04:43,570 --> 00:04:44,890 Is Paul there? 98 00:04:44,890 --> 00:04:46,140 If so, he's in the hash table. 99 00:04:46,140 --> 00:04:47,850 Is Paul not there? 100 00:04:47,850 --> 00:04:48,980 He's not in the hash table. 101 00:04:48,980 --> 00:04:51,700 It's pretty straightforward. 102 00:04:51,700 --> 00:04:53,750 >> Now how do you define a hash function? 103 00:04:53,750 --> 00:04:57,440 Well there's really no limit to the number of possible hash functions. 104 00:04:57,440 --> 00:05:00,330 In fact there's a number of really, really good ones on the internet. 105 00:05:00,330 --> 00:05:03,540 There's a number of really, really bad ones on the internet. 106 00:05:03,540 --> 00:05:07,060 It's also pretty easy to write a bad one. 107 00:05:07,060 --> 00:05:09,740 >> So what makes up a good hash function, right? 108 00:05:09,740 --> 00:05:13,260 Well a good hash function should use only the data being hashed, 109 00:05:13,260 --> 00:05:15,400 and all of the data being hashed. 110 00:05:15,400 --> 00:05:18,480 So we don't want to use anything-- we don't incorporate anything 111 00:05:18,480 --> 00:05:21,540 else other than the data. 112 00:05:21,540 --> 00:05:23,060 And we want to use all of the data. 113 00:05:23,060 --> 00:05:25,810 We don't want to just use a piece of it, we want to use all of it. 114 00:05:25,810 --> 00:05:27,690 A hash function should also be deterministic. 115 00:05:27,690 --> 00:05:28,530 What does that mean? 116 00:05:28,530 --> 00:05:32,180 Well it means that every time we pass the exact same piece of data 117 00:05:32,180 --> 00:05:36,020 into the hash function we always get the same hashcode out. 118 00:05:36,020 --> 00:05:38,850 If I pass John into the hash function I get out 4. 119 00:05:38,850 --> 00:05:42,090 I should be able to do that 10,000 times and I'll always get 4. 120 00:05:42,090 --> 00:05:46,790 So no random numbers effectively can be involved in our hash tables-- 121 00:05:46,790 --> 00:05:48,730 in our hash functions. 122 00:05:48,730 --> 00:05:51,730 >> A hash function should also uniformly distribute data. 123 00:05:51,730 --> 00:05:55,680 If every time you run data through the hash function you get the hashcode 0, 124 00:05:55,680 --> 00:05:57,180 that's probably not so great, right? 125 00:05:57,180 --> 00:06:00,580 You probably want to big a range of hash codes. 126 00:06:00,580 --> 00:06:03,170 Also things can be spread out throughout the table. 127 00:06:03,170 --> 00:06:09,190 And also it would be great if really similar data, like John and Jonathan, 128 00:06:09,190 --> 00:06:12,300 maybe were spread out to weigh different locations in the hash table. 129 00:06:12,300 --> 00:06:14,730 That would be a nice advantage. 130 00:06:14,730 --> 00:06:17,230 >> Here's an example of a hash function. 131 00:06:17,230 --> 00:06:19,130 I wrote this one up earlier. 132 00:06:19,130 --> 00:06:21,020 It's not a particularly good hash function 133 00:06:21,020 --> 00:06:23,360 for reasons that don't really bear going into right now. 134 00:06:23,360 --> 00:06:27,020 But do you see what's going on here? 135 00:06:27,020 --> 00:06:30,400 It seems like we're declaring a variable called sum and setting it equal to 0. 136 00:06:30,400 --> 00:06:35,990 And then apparently I'm doing something so long as strstr[j] is not equal 137 00:06:35,990 --> 00:06:38,470 to backslash 0. 138 00:06:38,470 --> 00:06:40,850 What am I doing there? 139 00:06:40,850 --> 00:06:43,420 >> This is basically just another way of implementing [? strl ?] 140 00:06:43,420 --> 00:06:45,750 and detecting when you've reached the end of the string. 141 00:06:45,750 --> 00:06:48,460 So I don't have to actually calculate the length of the string, 142 00:06:48,460 --> 00:06:51,620 I'm just using when I hit the backslash 0 character I know 143 00:06:51,620 --> 00:06:53,160 I've reached the end of the string. 144 00:06:53,160 --> 00:06:55,540 And then I'm going to keep iterating through that string, 145 00:06:55,540 --> 00:07:01,290 adding strstr[j] to sum, and then at the end of the day going to return sum mod 146 00:07:01,290 --> 00:07:02,180 HASH_MAX. 147 00:07:02,180 --> 00:07:04,680 >> Basically all this hash function is doing is adding up 148 00:07:04,680 --> 00:07:09,430 all of the ASCII values of my string, and then it's 149 00:07:09,430 --> 00:07:12,370 returning some hashcode modded by HASH_MAX. 150 00:07:12,370 --> 00:07:15,770 It's probably the size of my array, right? 151 00:07:15,770 --> 00:07:19,140 I don't want to be getting hash codes if my array is of size 10, 152 00:07:19,140 --> 00:07:21,690 I don't want to be getting out hash codes 11, 12, 153 00:07:21,690 --> 00:07:24,670 13, I can't put things into those locations of the array, 154 00:07:24,670 --> 00:07:25,770 that would be illegal. 155 00:07:25,770 --> 00:07:28,110 I'd suffer a segmentation fault. 156 00:07:28,110 --> 00:07:31,210 >> Now here is another quick aside. 157 00:07:31,210 --> 00:07:34,450 Generally you're probably not going to want to write your own hash functions. 158 00:07:34,450 --> 00:07:36,970 It is actually a bit of an art, not a science. 159 00:07:36,970 --> 00:07:38,560 And there's a lot that goes into them. 160 00:07:38,560 --> 00:07:41,260 The internet, like I said, is full of really good hash functions, 161 00:07:41,260 --> 00:07:44,450 and you should use the internet to find hash functions because it's really 162 00:07:44,450 --> 00:07:49,210 just kind of an unnecessary waste of time to create your own. 163 00:07:49,210 --> 00:07:51,470 >> You can write simple ones for testing purposes. 164 00:07:51,470 --> 00:07:54,300 But when you actually are going to start hashing data and storing it 165 00:07:54,300 --> 00:07:56,760 into a hash table you're probably going to want 166 00:07:56,760 --> 00:08:00,560 to use some function that was generated for you, that exists on the internet. 167 00:08:00,560 --> 00:08:03,220 If you do just be sure to cite your sources. 168 00:08:03,220 --> 00:08:07,720 There's no reason to plagiarize anything here. 169 00:08:07,720 --> 00:08:11,350 >> The computer science community is definitely growing, and really values 170 00:08:11,350 --> 00:08:14,770 open source, and it's really important to cite your sources so that people 171 00:08:14,770 --> 00:08:16,650 can get attribution for the work that they're 172 00:08:16,650 --> 00:08:18,290 doing to the benefit of the community. 173 00:08:18,290 --> 00:08:20,000 So always be sure-- and not just for hash 174 00:08:20,000 --> 00:08:23,250 functions, but generally when you use code from an outside source, 175 00:08:23,250 --> 00:08:24,270 always cite your source. 176 00:08:24,270 --> 00:08:29,480 Give credit to the person who did some of the work so you don't have to. 177 00:08:29,480 --> 00:08:32,700 >> OK so let's revisit this hash table for a second. 178 00:08:32,700 --> 00:08:34,760 This is where we left off after we inserted 179 00:08:34,760 --> 00:08:39,280 John and Paul into this hash table. 180 00:08:39,280 --> 00:08:41,160 Do you see a problem here? 181 00:08:41,160 --> 00:08:44,390 You might see two. 182 00:08:44,390 --> 00:08:48,220 But in particular, do you see this possible problem? 183 00:08:48,220 --> 00:08:52,750 >> What if I hash Ringo, and it turns out that after processing 184 00:08:52,750 --> 00:08:59,980 that data through the hash function Ringo also generated the hashcode 6. 185 00:08:59,980 --> 00:09:02,790 I've already got data at hashcode-- array location 6. 186 00:09:02,790 --> 00:09:07,980 So it's probably going to be a bit of a problem for me now, right? 187 00:09:07,980 --> 00:09:09,040 >> We call this a collision. 188 00:09:09,040 --> 00:09:12,440 And the collision occurs when two pieces of data run through the same hash 189 00:09:12,440 --> 00:09:15,180 function yield the same hashcode. 190 00:09:15,180 --> 00:09:18,830 Presumably we still want to get both pieces of data into the hash table, 191 00:09:18,830 --> 00:09:22,310 otherwise we wouldn't be running Ringo arbitrarily through the hash function. 192 00:09:22,310 --> 00:09:26,850 We presumably want to get Ringo into that array. 193 00:09:26,850 --> 00:09:32,050 >> How do we do it though, if he and Paul both yield hashcode 6? 194 00:09:32,050 --> 00:09:35,500 We don't want to overwrite Paul, we want Paul to be there too. 195 00:09:35,500 --> 00:09:38,770 So we need to find a way to get elements into the hash table that 196 00:09:38,770 --> 00:09:42,530 still preserves our quick insertion and quick look up. 197 00:09:42,530 --> 00:09:47,330 And one way to deal with it is to do something called linear probing. 198 00:09:47,330 --> 00:09:50,150 >> Using this method if we have a collision, well, what do we do? 199 00:09:50,150 --> 00:09:54,560 Well we can't put him in array location 6, or whatever hashcode was generated, 200 00:09:54,560 --> 00:09:56,800 let's put him at hashcode plus 1. 201 00:09:56,800 --> 00:09:59,650 And if that's full let's put him in hashcode plus 2. 202 00:09:59,650 --> 00:10:03,400 The benefit of this being if he's not exactly where we think he is, 203 00:10:03,400 --> 00:10:06,830 and we have to start searching, maybe we don't have to go too far. 204 00:10:06,830 --> 00:10:09,880 Maybe we don't have to search all n elements of the hash table. 205 00:10:09,880 --> 00:10:11,770 Maybe we have to search a couple of them. 206 00:10:11,770 --> 00:10:16,660 >> And so we're still tending towards that average case being close to 1 vs 207 00:10:16,660 --> 00:10:20,260 close to n, so maybe that'll work. 208 00:10:20,260 --> 00:10:23,990 So let's see how this might work out in reality. 209 00:10:23,990 --> 00:10:27,440 And let's see if maybe we can detect the problem that might occur here. 210 00:10:27,440 --> 00:10:28,520 >> Let's say we hash Bart. 211 00:10:28,520 --> 00:10:31,560 So now we're going to run a new set of strings through the hash function, 212 00:10:31,560 --> 00:10:34,440 and we run Bart through the hash function, we get hashcode 6. 213 00:10:34,440 --> 00:10:38,080 We take a look, we see 6 is empty, so we can put Bart there. 214 00:10:38,080 --> 00:10:41,390 >> Now we hash Lisa and that also generates hashcode 6. 215 00:10:41,390 --> 00:10:44,310 Well now that we're using this linear probing method we start at 6, 216 00:10:44,310 --> 00:10:45,310 we see that 6 is full. 217 00:10:45,310 --> 00:10:46,970 We can't put Lisa in 6. 218 00:10:46,970 --> 00:10:48,070 So where do we go? 219 00:10:48,070 --> 00:10:49,260 Let's go to 7. 220 00:10:49,260 --> 00:10:50,590 7's empty, so that works. 221 00:10:50,590 --> 00:10:52,490 So let's put Lisa there. 222 00:10:52,490 --> 00:10:56,180 >> Now we hash Homer and we get 7. 223 00:10:56,180 --> 00:10:59,360 OK well we know that 7's full now, so we can't put Homer there. 224 00:10:59,360 --> 00:11:00,830 So let's go to 8. 225 00:11:00,830 --> 00:11:01,640 Is 8 available? 226 00:11:01,640 --> 00:11:04,310 Yeah, and 8's close to 7, so if we have to start searching we're 227 00:11:04,310 --> 00:11:06,000 not going to have to go too far. 228 00:11:06,000 --> 00:11:08,840 And so let's put Homer at 8. 229 00:11:08,840 --> 00:11:11,130 >> Now we hash Maggie and returns 3, thank goodness 230 00:11:11,130 --> 00:11:12,630 we're able to just put Maggie there. 231 00:11:12,630 --> 00:11:15,050 We don't have to do any sort of probing for that. 232 00:11:15,050 --> 00:11:17,980 Now we hash Marge, and Marge also returns 6. 233 00:11:17,980 --> 00:11:25,540 >> Well 6 is full, 7 is full, 8 is full, 9, all right thank God, 9 is empty. 234 00:11:25,540 --> 00:11:27,580 I can put Marge at 9. 235 00:11:27,580 --> 00:11:31,260 Already we can see that we're starting to have this problem where now we're 236 00:11:31,260 --> 00:11:34,760 starting to stretch things kind of far away from their hash codes. 237 00:11:34,760 --> 00:11:38,940 And that theta of 1, that average case of being constant time, 238 00:11:38,940 --> 00:11:42,540 is starting to get a little more-- starting to tend a little more 239 00:11:42,540 --> 00:11:44,030 towards theta of n. 240 00:11:44,030 --> 00:11:47,250 We're starting to lose that advantage of hash tables. 241 00:11:47,250 --> 00:11:49,880 >> This problem that we just saw is something called clustering. 242 00:11:49,880 --> 00:11:52,340 And what's really bad about clustering is that once you now 243 00:11:52,340 --> 00:11:55,580 have two elements that are side by side it makes it even more likely, 244 00:11:55,580 --> 00:11:57,460 you have double the chance, that you're going 245 00:11:57,460 --> 00:11:59,490 to have another collision with that cluster, 246 00:11:59,490 --> 00:12:00,870 and the cluster will grow by one. 247 00:12:00,870 --> 00:12:05,620 And you'll keep growing and growing your likelihood of having a collision. 248 00:12:05,620 --> 00:12:10,450 And eventually it's just as bad as not sorting the data at all. 249 00:12:10,450 --> 00:12:13,570 >> The other problem though is we still, and so far up to this point, 250 00:12:13,570 --> 00:12:16,310 we've just been sort of understanding what a hash table is, 251 00:12:16,310 --> 00:12:17,940 we still only have room for 10 strings. 252 00:12:17,940 --> 00:12:23,010 If we want to continue to hash the citizens of Springfield, 253 00:12:23,010 --> 00:12:24,770 we can only get 10 of them in there. 254 00:12:24,770 --> 00:12:28,590 And if we try and add an 11th or 12th, we don't have a place to put them. 255 00:12:28,590 --> 00:12:31,640 We could just be spinning around in circles trying to find an empty spot, 256 00:12:31,640 --> 00:12:34,980 and we maybe get stuck in an infinite loop. 257 00:12:34,980 --> 00:12:37,790 >> So this sort of lends to the idea of something called chaining. 258 00:12:37,790 --> 00:12:41,770 And this is where we're going to bring linked lists back into the picture. 259 00:12:41,770 --> 00:12:47,280 What if instead of storing just the data itself in the array, 260 00:12:47,280 --> 00:12:50,450 every element of the array could hold multiple pieces of data? 261 00:12:50,450 --> 00:12:51,950 Well that doesn't make sense, right? 262 00:12:51,950 --> 00:12:55,260 We know that an array can only hold-- each element of an array 263 00:12:55,260 --> 00:12:58,910 can only hold one piece of data of that data type. 264 00:12:58,910 --> 00:13:02,730 >> But what if that data type is a linked list, right? 265 00:13:02,730 --> 00:13:05,810 So what if every element of the array was 266 00:13:05,810 --> 00:13:08,770 a pointer to the head of a linked list? 267 00:13:08,770 --> 00:13:10,660 And then we could build those linked lists 268 00:13:10,660 --> 00:13:13,100 and grow them arbitrarily, because linked lists allow 269 00:13:13,100 --> 00:13:18,070 us to grow and shrink a lot more flexibly than an array does. 270 00:13:18,070 --> 00:13:20,190 So what if we now use, we leverage this, right? 271 00:13:20,190 --> 00:13:23,740 We start to grow these chains out of these array locations. 272 00:13:23,740 --> 00:13:26,720 >> Now we can fit an infinite amount of data, or not infinite, 273 00:13:26,720 --> 00:13:30,150 an arbitrary amount of data, into our hash table 274 00:13:30,150 --> 00:13:33,740 without ever running into the problem of collision. 275 00:13:33,740 --> 00:13:36,720 We've also eliminated clustering by doing this. 276 00:13:36,720 --> 00:13:40,360 And well we know that when we insert into a linked list, if you recall 277 00:13:40,360 --> 00:13:43,750 from our video on linked lists, singly linked lists and doubly linked lists, 278 00:13:43,750 --> 00:13:45,050 it's a constant time operation. 279 00:13:45,050 --> 00:13:46,690 We're just adding to the front. 280 00:13:46,690 --> 00:13:50,380 >> And for look up, well we do know that look up in a linked list 281 00:13:50,380 --> 00:13:51,760 can be a problem, right? 282 00:13:51,760 --> 00:13:53,890 We have to search through it from beginning to end. 283 00:13:53,890 --> 00:13:56,440 There's no random access in a linked list. 284 00:13:56,440 --> 00:14:01,590 But if instead of having one linked list where a lookup would be O of n, 285 00:14:01,590 --> 00:14:05,570 we now have 10 linked lists, or 1,000 linked lists, 286 00:14:05,570 --> 00:14:10,100 now it's O of n divided by 10, or O of n divided by 1,000. 287 00:14:10,100 --> 00:14:12,920 >> And while we were talking theoretically about complexity 288 00:14:12,920 --> 00:14:16,040 we disregard constants, in the real world these things actually matter, 289 00:14:16,040 --> 00:14:16,540 right? 290 00:14:16,540 --> 00:14:19,060 We actually will notice that this happens 291 00:14:19,060 --> 00:14:22,590 to run 10 times faster, or 1,000 times faster, 292 00:14:22,590 --> 00:14:27,550 because we're distributing one long chain across 1,000 smaller chains. 293 00:14:27,550 --> 00:14:31,070 And so each time we have to search through one of those chains we can 294 00:14:31,070 --> 00:14:36,270 ignore the 999 chains we don't care about , and just search that one. 295 00:14:36,270 --> 00:14:39,200 >> Which is on average to be 1,000 times shorter. 296 00:14:39,200 --> 00:14:42,440 And so we still are sort of tending towards this average case 297 00:14:42,440 --> 00:14:47,590 of being constant time, but only because we are leveraging 298 00:14:47,590 --> 00:14:49,710 dividing by some huge constant factor. 299 00:14:49,710 --> 00:14:51,780 Let's see how this might actually look though. 300 00:14:51,780 --> 00:14:54,700 So this was the hash table we had before we declared a hash table that 301 00:14:54,700 --> 00:14:56,250 was capable of storing 10 strings. 302 00:14:56,250 --> 00:14:57,700 We're not going to do that anymore. 303 00:14:57,700 --> 00:14:59,860 We already know the limitations of that method. 304 00:14:59,860 --> 00:15:06,150 Now our hash table's going to be an array of 10 nodes, pointers 305 00:15:06,150 --> 00:15:08,100 to heads of linked lists. 306 00:15:08,100 --> 00:15:09,810 >> And right now it's null. 307 00:15:09,810 --> 00:15:12,150 Each one of those 10 pointers is null. 308 00:15:12,150 --> 00:15:15,850 There's nothing in our hash table right now. 309 00:15:15,850 --> 00:15:18,670 >> Now let's start to put some things into this hash table. 310 00:15:18,670 --> 00:15:21,420 And let's see how this method is going to benefit us a little bit. 311 00:15:21,420 --> 00:15:22,740 Let's now hash Joey. 312 00:15:22,740 --> 00:15:26,190 We'll will run the string Joey through a hash function and we return 6. 313 00:15:26,190 --> 00:15:27,180 Well what do we do now? 314 00:15:27,180 --> 00:15:30,070 >> Well now working with linked lists, we're not working with arrays. 315 00:15:30,070 --> 00:15:31,860 And when we're working with linked lists we 316 00:15:31,860 --> 00:15:35,770 know we need to start dynamically allocating space and building chains. 317 00:15:35,770 --> 00:15:40,000 That's sort of how-- those are the core elements of building a linked list. 318 00:15:40,000 --> 00:15:43,650 So let's dynamically allocate space for Joey, 319 00:15:43,650 --> 00:15:46,370 and then let's add him to the chain. 320 00:15:46,370 --> 00:15:49,150 >> So now look what we've done. 321 00:15:49,150 --> 00:15:52,420 When we hash Joey we got the hashcode 6. 322 00:15:52,420 --> 00:15:57,770 Now the pointer at array location 6 points to the head of a linked list, 323 00:15:57,770 --> 00:16:00,880 and right now it's the only element of a linked list. 324 00:16:00,880 --> 00:16:04,330 And the node in that linked list is Joey. 325 00:16:04,330 --> 00:16:08,280 >> So if we need to look up Joey later, we just hash Joey again, 326 00:16:08,280 --> 00:16:12,280 we get 6 again because our hash function is deterministic. 327 00:16:12,280 --> 00:16:16,380 And then we start at the head of the linked list pointed 328 00:16:16,380 --> 00:16:19,340 to by array location 6, and we can iterate 329 00:16:19,340 --> 00:16:21,630 across that trying to find Joey. 330 00:16:21,630 --> 00:16:23,760 And if we build our hash table effectively, 331 00:16:23,760 --> 00:16:27,770 and our hash function effectively to distribute data well, 332 00:16:27,770 --> 00:16:31,750 on average each of those linked lists at every array location 333 00:16:31,750 --> 00:16:36,090 will be 1/10 the size of if we just had it as a single huge 334 00:16:36,090 --> 00:16:37,630 linked list with everything in it. 335 00:16:37,630 --> 00:16:41,650 >> If we distribute that huge linked list across 10 linked lists 336 00:16:41,650 --> 00:16:43,740 each list will be 1/10 the size. 337 00:16:43,740 --> 00:16:47,410 And thus 10 times quicker to search through. 338 00:16:47,410 --> 00:16:48,370 So let's do this again. 339 00:16:48,370 --> 00:16:50,270 Let's now hash Ross. 340 00:16:50,270 --> 00:16:53,520 >> And let's say Ross, when we do that the hash code we get back is 2. 341 00:16:53,520 --> 00:16:56,610 Well now we dynamically allocate a new node, we put Ross in that node, 342 00:16:56,610 --> 00:17:00,880 and we say now array location 2, instead of pointing to null, 343 00:17:00,880 --> 00:17:04,550 points to the head of a linked list whose only node is Ross. 344 00:17:04,550 --> 00:17:08,640 And we can do this one more time, we can hash Rachel and get hashcode 4. 345 00:17:08,640 --> 00:17:11,470 malloc a new node, put Rachel in the node, and say a array location 346 00:17:11,470 --> 00:17:13,810 4 now points to the head of a linked list whose 347 00:17:13,810 --> 00:17:17,400 only element happens to be Rachel. 348 00:17:17,400 --> 00:17:19,450 >> OK but what happens if we have a collision? 349 00:17:19,450 --> 00:17:24,540 Let's see how we handle collisions using the separate chaining method. 350 00:17:24,540 --> 00:17:25,780 Let's hash Phoebe. 351 00:17:25,780 --> 00:17:27,070 We get the hashcode 6. 352 00:17:27,070 --> 00:17:29,990 In our previous example we were just storing the strings in the array. 353 00:17:29,990 --> 00:17:31,960 This was a problem. 354 00:17:31,960 --> 00:17:34,420 >> We don't want to clobber Joey, and we've already 355 00:17:34,420 --> 00:17:37,090 seen that we can get some clustering problems if we try and step 356 00:17:37,090 --> 00:17:38,180 through and probe. 357 00:17:38,180 --> 00:17:40,640 But what if we just kind of treat this the same way, right? 358 00:17:40,640 --> 00:17:43,420 It's just like adding an element to the head of a linked list. 359 00:17:43,420 --> 00:17:46,200 Let's just malloc space for Phoebe. 360 00:17:46,200 --> 00:17:50,740 >> We'll say Phoebe's next pointer points to the old head of the linked list, 361 00:17:50,740 --> 00:17:53,160 and then 6 just points to the new head of the linked list. 362 00:17:53,160 --> 00:17:56,130 And now look, we've changed Phoebe in. 363 00:17:56,130 --> 00:18:00,190 We can now store two elements with hashcode 6, 364 00:18:00,190 --> 00:18:03,150 and we don't have any problems. 365 00:18:03,150 --> 00:18:06,120 >> That's pretty much all there is to chaining. 366 00:18:06,120 --> 00:18:07,960 And chaining is definitely the method that's 367 00:18:07,960 --> 00:18:13,370 going to be most effective for you if you are storing data in a hash table. 368 00:18:13,370 --> 00:18:16,870 But this combination of arrays and linked lists 369 00:18:16,870 --> 00:18:22,060 together to form a hash table really dramatically improves your ability 370 00:18:22,060 --> 00:18:26,530 to store large amounts of data, and very quickly and efficiently search 371 00:18:26,530 --> 00:18:27,610 through that data. 372 00:18:27,610 --> 00:18:30,130 >> There's still one more data structure out there 373 00:18:30,130 --> 00:18:34,680 that might even be a bit better in terms of guaranteeing 374 00:18:34,680 --> 00:18:37,900 that our insertion, deletion, and look up times are even faster. 375 00:18:37,900 --> 00:18:41,610 And we'll see that in a video on tries. 376 00:18:41,610 --> 00:18:44,910 I'm Doug Lloyd, this is CS50.