1
00:00:00,000 --> 00:00:00,500

2
00:00:00,500 --> 00:00:01,900
[MUSIC PLAYING]

3
00:00:01,900 --> 00:00:05,710

4
00:00:05,710 --> 00:00:09,150
>> DOUG LLOYD: By now you
know a lot about arrays,

5
00:00:09,150 --> 00:00:11,610
and you know a lot about linked lists.

6
00:00:11,610 --> 00:00:13,650
And we've discuss the
pros and cons, we've

7
00:00:13,650 --> 00:00:16,620
discussed that linked lists
can get bigger and smaller,

8
00:00:16,620 --> 00:00:18,630
but they take up more size.

9
00:00:18,630 --> 00:00:22,359
Arrays are much more straightforward to
use, but they're restrictive in as much

10
00:00:22,359 --> 00:00:24,900
as we have to set the size of
the array at the very beginning

11
00:00:24,900 --> 00:00:26,910
and then we're stuck with it.

12
00:00:26,910 --> 00:00:30,470
>> But that's, we've pretty much
exhausted all of our topics

13
00:00:30,470 --> 00:00:33,040
about linked lists and arrays.

14
00:00:33,040 --> 00:00:34,950
Or have we?

15
00:00:34,950 --> 00:00:37,720
Maybe we can do something
even more creative.

16
00:00:37,720 --> 00:00:40,950
And that sort of lends
the idea of a hash table.

17
00:00:40,950 --> 00:00:46,680
>> So in a hash table we're going to try
combine an array with a linked list.

18
00:00:46,680 --> 00:00:49,520
We're going to take the advantages
of the array, like random access,

19
00:00:49,520 --> 00:00:53,510
being able to just go to array
element 4 or array element 8

20
00:00:53,510 --> 00:00:55,560
without having to iterate across.

21
00:00:55,560 --> 00:00:57,260
That's pretty fast, right?

22
00:00:57,260 --> 00:01:00,714
>> But we also want to have our data
structure be able to grow and shrink.

23
00:01:00,714 --> 00:01:02,630
We don't need, we don't
want to be restricted.

24
00:01:02,630 --> 00:01:04,588
And we want to be able
to add and remove things

25
00:01:04,588 --> 00:01:08,430
very easily, which if you recall,
is very complex with an array.

26
00:01:08,430 --> 00:01:11,650
And we can call this
new thing a hash table.

27
00:01:11,650 --> 00:01:15,190
>> And if implemented correctly,
we're sort of taking

28
00:01:15,190 --> 00:01:18,150
the advantages of both data
structures you've already seen,

29
00:01:18,150 --> 00:01:19,880
arrays and linked lists.

30
00:01:19,880 --> 00:01:23,070
Insertion can start to
tend toward theta of 1.

31
00:01:23,070 --> 00:01:26,207
Theta we haven't really discussed,
but theta is just the average case,

32
00:01:26,207 --> 00:01:27,540
what's actually going to happen.

33
00:01:27,540 --> 00:01:29,680
You're not always going to
have the worst case scenario,

34
00:01:29,680 --> 00:01:32,555
and you're not always going to have
the best case scenario, so what's

35
00:01:32,555 --> 00:01:33,900
the average scenario?

36
00:01:33,900 --> 00:01:36,500
>> Well an average insertion
into a hash table

37
00:01:36,500 --> 00:01:39,370
can start to get close to constant time.

38
00:01:39,370 --> 00:01:41,570
And deletion can get
close to constant time.

39
00:01:41,570 --> 00:01:44,440
And lookup can get
close to constant time.

40
00:01:44,440 --> 00:01:48,600
That's-- we don't have a data
structure yet that can do that,

41
00:01:48,600 --> 00:01:51,180
and so this already sounds
like a pretty great thing.

42
00:01:51,180 --> 00:01:57,010
We've really mitigated the
disadvantages of each on its own.

43
00:01:57,010 --> 00:01:59,160
>> To get this performance
upgrade though, we

44
00:01:59,160 --> 00:02:03,580
need to rethink how we add
data into the structure.

45
00:02:03,580 --> 00:02:07,380
Specifically we want the
data itself to tell us

46
00:02:07,380 --> 00:02:09,725
where it should go in the structure.

47
00:02:09,725 --> 00:02:12,850
And if we then need to see if it's in
the structure, if we need to find it,

48
00:02:12,850 --> 00:02:16,610
we want to look at the data
again and be able to effectively,

49
00:02:16,610 --> 00:02:18,910
using the data, randomly access it.

50
00:02:18,910 --> 00:02:20,700
Just by looking at the
data we should have

51
00:02:20,700 --> 00:02:25,890
an idea of where exactly we're
going to find it in the hash table.

52
00:02:25,890 --> 00:02:28,770
>> Now the downside of a hash
table is that they're really

53
00:02:28,770 --> 00:02:31,770
pretty bad at ordering or sorting data.

54
00:02:31,770 --> 00:02:34,970
And in fact, if you start
to use them to order or sort

55
00:02:34,970 --> 00:02:37,990
data you lose all of the
advantages you previously

56
00:02:37,990 --> 00:02:40,710
had in terms of insertion and deletion.

57
00:02:40,710 --> 00:02:44,060
The time becomes closer to
theta of n, and we've basically

58
00:02:44,060 --> 00:02:45,530
regressed into a linked list.

59
00:02:45,530 --> 00:02:48,850
And so we only want to use hash
tables if we don't care about

60
00:02:48,850 --> 00:02:51,490
whether data is sorted.

61
00:02:51,490 --> 00:02:54,290
For the context in which
you'll use them in CS50

62
00:02:54,290 --> 00:02:58,900
you probably don't care
that the data is sorted.

63
00:02:58,900 --> 00:03:03,170
>> So a hash table is a combination
of two distinct pieces

64
00:03:03,170 --> 00:03:04,980
with which we're familiar.

65
00:03:04,980 --> 00:03:07,930
The first is a function, which
we usually call a hash function.

66
00:03:07,930 --> 00:03:11,760
And that hash function is going to
return some non-negative integer, which

67
00:03:11,760 --> 00:03:14,870
we usually call a hashcode, OK?

68
00:03:14,870 --> 00:03:20,230
The second piece is an array, which is
capable of storing data of the type we

69
00:03:20,230 --> 00:03:22,190
want to place into the data structure.

70
00:03:22,190 --> 00:03:24,310
We'll hold off on the
linked list element for now

71
00:03:24,310 --> 00:03:27,810
and just start with the basics of a
hash table to get your head around it,

72
00:03:27,810 --> 00:03:30,210
and then we'll maybe blow
your mind a little bit when we

73
00:03:30,210 --> 00:03:32,920
combine arrays and link lists together.

74
00:03:32,920 --> 00:03:35,590
>> The basic idea though
is we take some data.

75
00:03:35,590 --> 00:03:37,860
We run that data through
the hash function.

76
00:03:37,860 --> 00:03:41,980
And so the data is processed
and it spits out a number, OK?

77
00:03:41,980 --> 00:03:44,890
And then with that number
we just store the data

78
00:03:44,890 --> 00:03:48,930
we want to store in the
array at that location.

79
00:03:48,930 --> 00:03:53,990
So for example we have maybe
this hash table of strings.

80
00:03:53,990 --> 00:03:57,350
It's got 10 elements in it, so
we can fit 10 strings in it.

81
00:03:57,350 --> 00:03:59,320
>> Let's say we want to hash John.

82
00:03:59,320 --> 00:04:02,979
So John as the data we want to insert
into this hash table somewhere.

83
00:04:02,979 --> 00:04:03,770
Where do we put it?

84
00:04:03,770 --> 00:04:05,728
Well typically with an
array so far we probably

85
00:04:05,728 --> 00:04:07,610
would put it in array location 0.

86
00:04:07,610 --> 00:04:09,960
But now we have this new hash function.

87
00:04:09,960 --> 00:04:13,180
>> And let's say that we run John
through this hash function

88
00:04:13,180 --> 00:04:15,417
and it's spits out 4.

89
00:04:15,417 --> 00:04:17,500
Well that's where we're
going to want to put John.

90
00:04:17,500 --> 00:04:22,050
We want to put John in array location
4, because if we hash John again--

91
00:04:22,050 --> 00:04:23,810
let's say later we
want to search and see

92
00:04:23,810 --> 00:04:26,960
if John exists in this hash
table-- all we need to do

93
00:04:26,960 --> 00:04:30,310
is run it through the same hash
function, get the number 4 out,

94
00:04:30,310 --> 00:04:33,929
and be able to find John
immediately in our data structure.

95
00:04:33,929 --> 00:04:34,720
That's pretty good.

96
00:04:34,720 --> 00:04:36,928
>> Let's say we now do this
again, we want to hash Paul.

97
00:04:36,928 --> 00:04:39,446
We want to add Paul
into this hash table.

98
00:04:39,446 --> 00:04:42,070
Let's say that this time we run
Paul through the hash function,

99
00:04:42,070 --> 00:04:44,600
the hashcode that is generated is 6.

100
00:04:44,600 --> 00:04:47,340
Well now we can put Paul
in the array location 6.

101
00:04:47,340 --> 00:04:50,040
And if we need to look up whether
Paul is in this hash table,

102
00:04:50,040 --> 00:04:53,900
all we need to do is run Paul
through the hash function again

103
00:04:53,900 --> 00:04:55,830
and we're going to get 6 out again.

104
00:04:55,830 --> 00:04:57,590
>> And then we just look
at array location 6.

105
00:04:57,590 --> 00:04:58,910
Is Paul there?

106
00:04:58,910 --> 00:05:00,160
If so, he's in the hash table.

107
00:05:00,160 --> 00:05:01,875
Is Paul not there?

108
00:05:01,875 --> 00:05:03,000
He's not in the hash table.

109
00:05:03,000 --> 00:05:05,720
It's pretty straightforward.

110
00:05:05,720 --> 00:05:07,770
>> Now how do you define a hash function?

111
00:05:07,770 --> 00:05:11,460
Well there's really no limit to the
number of possible hash functions.

112
00:05:11,460 --> 00:05:14,350
In fact there's a number of really,
really good ones on the internet.

113
00:05:14,350 --> 00:05:17,560
There's a number of really,
really bad ones on the internet.

114
00:05:17,560 --> 00:05:21,080
It's also pretty easy
to write a bad one.

115
00:05:21,080 --> 00:05:23,760
>> So what makes up a good
hash function, right?

116
00:05:23,760 --> 00:05:27,280
Well a good hash function should
use only the data being hashed,

117
00:05:27,280 --> 00:05:29,420
and all of the data being hashed.

118
00:05:29,420 --> 00:05:32,500
So we don't want to use anything--
we don't incorporate anything

119
00:05:32,500 --> 00:05:35,560
else other than the data.

120
00:05:35,560 --> 00:05:37,080
And we want to use all of the data.

121
00:05:37,080 --> 00:05:39,830
We don't want to just use a piece
of it, we want to use all of it.

122
00:05:39,830 --> 00:05:41,710
A hash function should
also be deterministic.

123
00:05:41,710 --> 00:05:42,550
What does that mean?

124
00:05:42,550 --> 00:05:46,200
Well it means that every time we
pass the exact same piece of data

125
00:05:46,200 --> 00:05:50,040
into the hash function we always
get the same hashcode out.

126
00:05:50,040 --> 00:05:52,870
If I pass John into the
hash function I get out 4.

127
00:05:52,870 --> 00:05:56,110
I should be able to do that 10,000
times and I'll always get 4.

128
00:05:56,110 --> 00:06:00,810
So no random numbers effectively
can be involved in our hash tables--

129
00:06:00,810 --> 00:06:02,750
in our hash functions.

130
00:06:02,750 --> 00:06:05,750
>> A hash function should also
uniformly distribute data.

131
00:06:05,750 --> 00:06:09,700
If every time you run data through the
hash function you get the hashcode 0,

132
00:06:09,700 --> 00:06:11,200
that's probably not so great, right?

133
00:06:11,200 --> 00:06:14,600
You probably want to big
a range of hash codes.

134
00:06:14,600 --> 00:06:17,190
Also things can be spread
out throughout the table.

135
00:06:17,190 --> 00:06:23,210
And also it would be great if really
similar data, like John and Jonathan,

136
00:06:23,210 --> 00:06:26,320
maybe were spread out to weigh
different locations in the hash table.

137
00:06:26,320 --> 00:06:28,750
That would be a nice advantage.

138
00:06:28,750 --> 00:06:31,250
>> Here's an example of a hash function.

139
00:06:31,250 --> 00:06:33,150
I wrote this one up earlier.

140
00:06:33,150 --> 00:06:35,047
It's not a particularly
good hash function

141
00:06:35,047 --> 00:06:37,380
for reasons that don't really
bear going into right now.

142
00:06:37,380 --> 00:06:41,040
But do you see what's going on here?

143
00:06:41,040 --> 00:06:44,420
It seems like we're declaring a variable
called sum and setting it equal to 0.

144
00:06:44,420 --> 00:06:50,010
And then apparently I'm doing something
so long as strstr[j] is not equal

145
00:06:50,010 --> 00:06:52,490
to backslash 0.

146
00:06:52,490 --> 00:06:54,870
What am I doing there?

147
00:06:54,870 --> 00:06:57,440
>> This is basically just another
way of implementing [? strl ?]

148
00:06:57,440 --> 00:06:59,773
and detecting when you've
reached the end of the string.

149
00:06:59,773 --> 00:07:02,480
So I don't have to actually
calculate the length of the string,

150
00:07:02,480 --> 00:07:05,640
I'm just using when I hit the
backslash 0 character I know

151
00:07:05,640 --> 00:07:07,185
I've reached the end of the string.

152
00:07:07,185 --> 00:07:09,560
And then I'm going to keep
iterating through that string,

153
00:07:09,560 --> 00:07:15,310
adding strstr[j] to sum, and then at the
end of the day going to return sum mod

154
00:07:15,310 --> 00:07:16,200
HASH_MAX.

155
00:07:16,200 --> 00:07:18,700
>> Basically all this hash
function is doing is adding up

156
00:07:18,700 --> 00:07:23,450
all of the ASCII values of
my string, and then it's

157
00:07:23,450 --> 00:07:26,390
returning some hashcode
modded by HASH_MAX.

158
00:07:26,390 --> 00:07:29,790
It's probably the size
of my array, right?

159
00:07:29,790 --> 00:07:33,160
I don't want to be getting hash
codes if my array is of size 10,

160
00:07:33,160 --> 00:07:35,712
I don't want to be getting
out hash codes 11, 12,

161
00:07:35,712 --> 00:07:38,690
13, I can't put things into
those locations of the array,

162
00:07:38,690 --> 00:07:39,790
that would be illegal.

163
00:07:39,790 --> 00:07:42,130
I'd suffer a segmentation fault.

164
00:07:42,130 --> 00:07:45,230
>> Now here is another quick aside.

165
00:07:45,230 --> 00:07:48,470
Generally you're probably not going to
want to write your own hash functions.

166
00:07:48,470 --> 00:07:50,997
It is actually a bit of
an art, not a science.

167
00:07:50,997 --> 00:07:52,580
And there's a lot that goes into them.

168
00:07:52,580 --> 00:07:55,288
The internet, like I said, is full
of really good hash functions,

169
00:07:55,288 --> 00:07:58,470
and you should use the internet to
find hash functions because it's really

170
00:07:58,470 --> 00:08:03,230
just kind of an unnecessary
waste of time to create your own.

171
00:08:03,230 --> 00:08:05,490
>> You can write simple ones
for testing purposes.

172
00:08:05,490 --> 00:08:08,323
But when you actually are going to
start hashing data and storing it

173
00:08:08,323 --> 00:08:10,780
into a hash table you're
probably going to want

174
00:08:10,780 --> 00:08:14,580
to use some function that was generated
for you, that exists on the internet.

175
00:08:14,580 --> 00:08:17,240
If you do just be sure
to cite your sources.

176
00:08:17,240 --> 00:08:21,740
There's no reason to
plagiarize anything here.

177
00:08:21,740 --> 00:08:25,370
>> The computer science community is
definitely growing, and really values

178
00:08:25,370 --> 00:08:28,796
open source, and it's really important
to cite your sources so that people

179
00:08:28,796 --> 00:08:30,670
can get attribution for
the work that they're

180
00:08:30,670 --> 00:08:32,312
doing to the benefit of the community.

181
00:08:32,312 --> 00:08:34,020
So always be sure--
and not just for hash

182
00:08:34,020 --> 00:08:37,270
functions, but generally when you
use code from an outside source,

183
00:08:37,270 --> 00:08:38,299
always cite your source.

184
00:08:38,299 --> 00:08:43,500
Give credit to the person who did
some of the work so you don't have to.

185
00:08:43,500 --> 00:08:46,720
>> OK so let's revisit this
hash table for a second.

186
00:08:46,720 --> 00:08:48,780
This is where we left
off after we inserted

187
00:08:48,780 --> 00:08:53,300
John and Paul into this hash table.

188
00:08:53,300 --> 00:08:55,180
Do you see a problem here?

189
00:08:55,180 --> 00:08:58,410
You might see two.

190
00:08:58,410 --> 00:09:02,240
But in particular, do you
see this possible problem?

191
00:09:02,240 --> 00:09:06,770
>> What if I hash Ringo, and it
turns out that after processing

192
00:09:06,770 --> 00:09:14,000
that data through the hash function
Ringo also generated the hashcode 6.

193
00:09:14,000 --> 00:09:16,810
I've already got data at
hashcode-- array location 6.

194
00:09:16,810 --> 00:09:22,000
So it's probably going to be a bit
of a problem for me now, right?

195
00:09:22,000 --> 00:09:23,060
>> We call this a collision.

196
00:09:23,060 --> 00:09:26,460
And the collision occurs when two
pieces of data run through the same hash

197
00:09:26,460 --> 00:09:29,200
function yield the same hashcode.

198
00:09:29,200 --> 00:09:32,850
Presumably we still want to get both
pieces of data into the hash table,

199
00:09:32,850 --> 00:09:36,330
otherwise we wouldn't be running Ringo
arbitrarily through the hash function.

200
00:09:36,330 --> 00:09:40,870
We presumably want to get
Ringo into that array.

201
00:09:40,870 --> 00:09:46,070
>> How do we do it though, if he
and Paul both yield hashcode 6?

202
00:09:46,070 --> 00:09:49,520
We don't want to overwrite Paul,
we want Paul to be there too.

203
00:09:49,520 --> 00:09:52,790
So we need to find a way to get
elements into the hash table that

204
00:09:52,790 --> 00:09:56,550
still preserves our quick
insertion and quick look up.

205
00:09:56,550 --> 00:10:01,350
And one way to deal with it is to
do something called linear probing.

206
00:10:01,350 --> 00:10:04,170
>> Using this method if we have a
collision, well, what do we do?

207
00:10:04,170 --> 00:10:08,580
Well we can't put him in array location
6, or whatever hashcode was generated,

208
00:10:08,580 --> 00:10:10,820
let's put him at hashcode plus 1.

209
00:10:10,820 --> 00:10:13,670
And if that's full let's
put him in hashcode plus 2.

210
00:10:13,670 --> 00:10:17,420
The benefit of this being if he's
not exactly where we think he is,

211
00:10:17,420 --> 00:10:20,850
and we have to start searching,
maybe we don't have to go too far.

212
00:10:20,850 --> 00:10:23,900
Maybe we don't have to search
all n elements of the hash table.

213
00:10:23,900 --> 00:10:25,790
Maybe we have to search
a couple of them.

214
00:10:25,790 --> 00:10:30,680
>> And so we're still tending towards
that average case being close to 1 vs

215
00:10:30,680 --> 00:10:34,280
close to n, so maybe that'll work.

216
00:10:34,280 --> 00:10:38,010
So let's see how this
might work out in reality.

217
00:10:38,010 --> 00:10:41,460
And let's see if maybe we can detect
the problem that might occur here.

218
00:10:41,460 --> 00:10:42,540
>> Let's say we hash Bart.

219
00:10:42,540 --> 00:10:45,581
So now we're going to run a new set
of strings through the hash function,

220
00:10:45,581 --> 00:10:48,460
and we run Bart through the hash
function, we get hashcode 6.

221
00:10:48,460 --> 00:10:52,100
We take a look, we see 6 is
empty, so we can put Bart there.

222
00:10:52,100 --> 00:10:55,410
>> Now we hash Lisa and that
also generates hashcode 6.

223
00:10:55,410 --> 00:10:58,330
Well now that we're using this
linear probing method we start at 6,

224
00:10:58,330 --> 00:10:59,330
we see that 6 is full.

225
00:10:59,330 --> 00:11:00,990
We can't put Lisa in 6.

226
00:11:00,990 --> 00:11:02,090
So where do we go?

227
00:11:02,090 --> 00:11:03,280
Let's go to 7.

228
00:11:03,280 --> 00:11:04,610
7's empty, so that works.

229
00:11:04,610 --> 00:11:06,510
So let's put Lisa there.

230
00:11:06,510 --> 00:11:10,200
>> Now we hash Homer and we get 7.

231
00:11:10,200 --> 00:11:13,380
OK well we know that 7's full
now, so we can't put Homer there.

232
00:11:13,380 --> 00:11:14,850
So let's go to 8.

233
00:11:14,850 --> 00:11:15,664
Is 8 available?

234
00:11:15,664 --> 00:11:18,330
Yeah, and 8's close to 7, so if
we have to start searching we're

235
00:11:18,330 --> 00:11:20,020
not going to have to go too far.

236
00:11:20,020 --> 00:11:22,860
And so let's put Homer at 8.

237
00:11:22,860 --> 00:11:25,151
>> Now we hash Maggie and
returns 3, thank goodness

238
00:11:25,151 --> 00:11:26,650
we're able to just put Maggie there.

239
00:11:26,650 --> 00:11:29,070
We don't have to do any
sort of probing for that.

240
00:11:29,070 --> 00:11:32,000
Now we hash Marge, and
Marge also returns 6.

241
00:11:32,000 --> 00:11:39,560
>> Well 6 is full, 7 is full, 8 is full,
9, all right thank God, 9 is empty.

242
00:11:39,560 --> 00:11:41,600
I can put Marge at 9.

243
00:11:41,600 --> 00:11:45,280
Already we can see that we're starting
to have this problem where now we're

244
00:11:45,280 --> 00:11:48,780
starting to stretch things kind
of far away from their hash codes.

245
00:11:48,780 --> 00:11:52,960
And that theta of 1, that average
case of being constant time,

246
00:11:52,960 --> 00:11:56,560
is starting to get a little more--
starting to tend a little more

247
00:11:56,560 --> 00:11:58,050
towards theta of n.

248
00:11:58,050 --> 00:12:01,270
We're starting to lose that
advantage of hash tables.

249
00:12:01,270 --> 00:12:03,902
>> This problem that we just saw
is something called clustering.

250
00:12:03,902 --> 00:12:06,360
And what's really bad about
clustering is that once you now

251
00:12:06,360 --> 00:12:09,606
have two elements that are side by
side it makes it even more likely,

252
00:12:09,606 --> 00:12:11,480
you have double the
chance, that you're going

253
00:12:11,480 --> 00:12:13,516
to have another collision
with that cluster,

254
00:12:13,516 --> 00:12:14,890
and the cluster will grow by one.

255
00:12:14,890 --> 00:12:19,640
And you'll keep growing and growing
your likelihood of having a collision.

256
00:12:19,640 --> 00:12:24,470
And eventually it's just as bad
as not sorting the data at all.

257
00:12:24,470 --> 00:12:27,590
>> The other problem though is we
still, and so far up to this point,

258
00:12:27,590 --> 00:12:30,336
we've just been sort of
understanding what a hash table is,

259
00:12:30,336 --> 00:12:31,960
we still only have room for 10 strings.

260
00:12:31,960 --> 00:12:37,030
If we want to continue to hash
the citizens of Springfield,

261
00:12:37,030 --> 00:12:38,790
we can only get 10 of them in there.

262
00:12:38,790 --> 00:12:42,619
And if we try and add an 11th or 12th,
we don't have a place to put them.

263
00:12:42,619 --> 00:12:45,660
We could just be spinning around in
circles trying to find an empty spot,

264
00:12:45,660 --> 00:12:49,000
and we maybe get stuck
in an infinite loop.

265
00:12:49,000 --> 00:12:51,810
>> So this sort of lends to the idea
of something called chaining.

266
00:12:51,810 --> 00:12:55,790
And this is where we're going to bring
linked lists back into the picture.

267
00:12:55,790 --> 00:13:01,300
What if instead of storing just
the data itself in the array,

268
00:13:01,300 --> 00:13:04,471
every element of the array could
hold multiple pieces of data?

269
00:13:04,471 --> 00:13:05,970
Well that doesn't make sense, right?

270
00:13:05,970 --> 00:13:09,280
We know that an array can only
hold-- each element of an array

271
00:13:09,280 --> 00:13:12,930
can only hold one piece
of data of that data type.

272
00:13:12,930 --> 00:13:16,750
>> But what if that data type
is a linked list, right?

273
00:13:16,750 --> 00:13:19,830
So what if every
element of the array was

274
00:13:19,830 --> 00:13:22,790
a pointer to the head of a linked list?

275
00:13:22,790 --> 00:13:24,680
And then we could build
those linked lists

276
00:13:24,680 --> 00:13:27,120
and grow them arbitrarily,
because linked lists allow

277
00:13:27,120 --> 00:13:32,090
us to grow and shrink a lot more
flexibly than an array does.

278
00:13:32,090 --> 00:13:34,210
So what if we now use,
we leverage this, right?

279
00:13:34,210 --> 00:13:37,760
We start to grow these chains
out of these array locations.

280
00:13:37,760 --> 00:13:40,740
>> Now we can fit an infinite
amount of data, or not infinite,

281
00:13:40,740 --> 00:13:44,170
an arbitrary amount of
data, into our hash table

282
00:13:44,170 --> 00:13:47,760
without ever running into
the problem of collision.

283
00:13:47,760 --> 00:13:50,740
We've also eliminated
clustering by doing this.

284
00:13:50,740 --> 00:13:54,380
And well we know that when we insert
into a linked list, if you recall

285
00:13:54,380 --> 00:13:57,779
from our video on linked lists, singly
linked lists and doubly linked lists,

286
00:13:57,779 --> 00:13:59,070
it's a constant time operation.

287
00:13:59,070 --> 00:14:00,710
We're just adding to the front.

288
00:14:00,710 --> 00:14:04,400
>> And for look up, well we do know
that look up in a linked list

289
00:14:04,400 --> 00:14:05,785
can be a problem, right?

290
00:14:05,785 --> 00:14:07,910
We have to search through
it from beginning to end.

291
00:14:07,910 --> 00:14:10,460
There's no random
access in a linked list.

292
00:14:10,460 --> 00:14:15,610
But if instead of having one linked
list where a lookup would be O of n,

293
00:14:15,610 --> 00:14:19,590
we now have 10 linked lists,
or 1,000 linked lists,

294
00:14:19,590 --> 00:14:24,120
now it's O of n divided by 10,
or O of n divided by 1,000.

295
00:14:24,120 --> 00:14:26,940
>> And while we were talking
theoretically about complexity

296
00:14:26,940 --> 00:14:30,061
we disregard constants, in the real
world these things actually matter,

297
00:14:30,061 --> 00:14:30,560
right?

298
00:14:30,560 --> 00:14:33,080
We actually will notice
that this happens

299
00:14:33,080 --> 00:14:36,610
to run 10 times faster,
or 1,000 times faster,

300
00:14:36,610 --> 00:14:41,570
because we're distributing one long
chain across 1,000 smaller chains.

301
00:14:41,570 --> 00:14:45,090
And so each time we have to search
through one of those chains we can

302
00:14:45,090 --> 00:14:50,290
ignore the 999 chains we don't care
about , and just search that one.

303
00:14:50,290 --> 00:14:53,220
>> Which is on average to
be 1,000 times shorter.

304
00:14:53,220 --> 00:14:56,460
And so we still are sort of
tending towards this average case

305
00:14:56,460 --> 00:15:01,610
of being constant time, but
only because we are leveraging

306
00:15:01,610 --> 00:15:03,730
dividing by some huge constant factor.

307
00:15:03,730 --> 00:15:05,804
Let's see how this might
actually look though.

308
00:15:05,804 --> 00:15:08,720
So this was the hash table we had
before we declared a hash table that

309
00:15:08,720 --> 00:15:10,270
was capable of storing 10 strings.

310
00:15:10,270 --> 00:15:11,728
We're not going to do that anymore.

311
00:15:11,728 --> 00:15:13,880
We already know the
limitations of that method.

312
00:15:13,880 --> 00:15:20,170
Now our hash table's going to be
an array of 10 nodes, pointers

313
00:15:20,170 --> 00:15:22,120
to heads of linked lists.

314
00:15:22,120 --> 00:15:23,830
>> And right now it's null.

315
00:15:23,830 --> 00:15:26,170
Each one of those 10 pointers is null.

316
00:15:26,170 --> 00:15:29,870
There's nothing in our
hash table right now.

317
00:15:29,870 --> 00:15:32,690
>> Now let's start to put some
things into this hash table.

318
00:15:32,690 --> 00:15:35,440
And let's see how this method is
going to benefit us a little bit.

319
00:15:35,440 --> 00:15:36,760
Let's now hash Joey.

320
00:15:36,760 --> 00:15:40,210
We'll will run the string Joey through
a hash function and we return 6.

321
00:15:40,210 --> 00:15:41,200
Well what do we do now?

322
00:15:41,200 --> 00:15:44,090
>> Well now working with linked lists,
we're not working with arrays.

323
00:15:44,090 --> 00:15:45,881
And when we're working
with linked lists we

324
00:15:45,881 --> 00:15:49,790
know we need to start dynamically
allocating space and building chains.

325
00:15:49,790 --> 00:15:54,020
That's sort of how-- those are the core
elements of building a linked list.

326
00:15:54,020 --> 00:15:57,670
So let's dynamically
allocate space for Joey,

327
00:15:57,670 --> 00:16:00,390
and then let's add him to the chain.

328
00:16:00,390 --> 00:16:03,170
>> So now look what we've done.

329
00:16:03,170 --> 00:16:06,440
When we hash Joey we got the hashcode 6.

330
00:16:06,440 --> 00:16:11,790
Now the pointer at array location 6
points to the head of a linked list,

331
00:16:11,790 --> 00:16:14,900
and right now it's the only
element of a linked list.

332
00:16:14,900 --> 00:16:18,350
And the node in that
linked list is Joey.

333
00:16:18,350 --> 00:16:22,300
>> So if we need to look up Joey
later, we just hash Joey again,

334
00:16:22,300 --> 00:16:26,300
we get 6 again because our
hash function is deterministic.

335
00:16:26,300 --> 00:16:30,400
And then we start at the head
of the linked list pointed

336
00:16:30,400 --> 00:16:33,360
to by array location
6, and we can iterate

337
00:16:33,360 --> 00:16:35,650
across that trying to find Joey.

338
00:16:35,650 --> 00:16:37,780
And if we build our
hash table effectively,

339
00:16:37,780 --> 00:16:41,790
and our hash function effectively
to distribute data well,

340
00:16:41,790 --> 00:16:45,770
on average each of those linked
lists at every array location

341
00:16:45,770 --> 00:16:50,110
will be 1/10 the size of if we
just had it as a single huge

342
00:16:50,110 --> 00:16:51,650
linked list with everything in it.

343
00:16:51,650 --> 00:16:55,670
>> If we distribute that huge linked
list across 10 linked lists

344
00:16:55,670 --> 00:16:57,760
each list will be 1/10 the size.

345
00:16:57,760 --> 00:17:01,432
And thus 10 times quicker
to search through.

346
00:17:01,432 --> 00:17:02,390
So let's do this again.

347
00:17:02,390 --> 00:17:04,290
Let's now hash Ross.

348
00:17:04,290 --> 00:17:07,540
>> And let's say Ross, when we do that
the hash code we get back is 2.

349
00:17:07,540 --> 00:17:10,630
Well now we dynamically allocate a
new node, we put Ross in that node,

350
00:17:10,630 --> 00:17:14,900
and we say now array location
2, instead of pointing to null,

351
00:17:14,900 --> 00:17:18,579
points to the head of a linked
list whose only node is Ross.

352
00:17:18,579 --> 00:17:22,660
And we can do this one more time, we
can hash Rachel and get hashcode 4.

353
00:17:22,660 --> 00:17:25,490
malloc a new node, put Rachel in
the node, and say a array location

354
00:17:25,490 --> 00:17:27,839
4 now points to the head
of a linked list whose

355
00:17:27,839 --> 00:17:31,420
only element happens to be Rachel.

356
00:17:31,420 --> 00:17:33,470
>> OK but what happens if
we have a collision?

357
00:17:33,470 --> 00:17:38,560
Let's see how we handle collisions
using the separate chaining method.

358
00:17:38,560 --> 00:17:39,800
Let's hash Phoebe.

359
00:17:39,800 --> 00:17:41,094
We get the hashcode 6.

360
00:17:41,094 --> 00:17:44,010
In our previous example we were just
storing the strings in the array.

361
00:17:44,010 --> 00:17:45,980
This was a problem.

362
00:17:45,980 --> 00:17:48,444
>> We don't want to clobber
Joey, and we've already

363
00:17:48,444 --> 00:17:51,110
seen that we can get some clustering
problems if we try and step

364
00:17:51,110 --> 00:17:52,202
through and probe.

365
00:17:52,202 --> 00:17:54,660
But what if we just kind of
treat this the same way, right?

366
00:17:54,660 --> 00:17:57,440
It's just like adding an element
to the head of a linked list.

367
00:17:57,440 --> 00:18:00,220
Let's just malloc space for Phoebe.

368
00:18:00,220 --> 00:18:04,764
>> We'll say Phoebe's next pointer points
to the old head of the linked list,

369
00:18:04,764 --> 00:18:07,180
and then 6 just points to the
new head of the linked list.

370
00:18:07,180 --> 00:18:10,150
And now look, we've changed Phoebe in.

371
00:18:10,150 --> 00:18:14,210
We can now store two
elements with hashcode 6,

372
00:18:14,210 --> 00:18:17,170
and we don't have any problems.

373
00:18:17,170 --> 00:18:20,147
>> That's pretty much all
there is to chaining.

374
00:18:20,147 --> 00:18:21,980
And chaining is definitely
the method that's

375
00:18:21,980 --> 00:18:27,390
going to be most effective for you if
you are storing data in a hash table.

376
00:18:27,390 --> 00:18:30,890
But this combination of
arrays and linked lists

377
00:18:30,890 --> 00:18:36,080
together to form a hash table really
dramatically improves your ability

378
00:18:36,080 --> 00:18:40,550
to store large amounts of data, and
very quickly and efficiently search

379
00:18:40,550 --> 00:18:41,630
through that data.

380
00:18:41,630 --> 00:18:44,150
>> There's still one more
data structure out there

381
00:18:44,150 --> 00:18:48,700
that might even be a bit
better in terms of guaranteeing

382
00:18:48,700 --> 00:18:51,920
that our insertion, deletion, and
look up times are even faster.

383
00:18:51,920 --> 00:18:55,630
And we'll see that in a video on tries.

384
00:18:55,630 --> 00:18:58,930
I'm Doug Lloyd, this is CS50.

385
00:18:58,930 --> 00:19:00,214