1
00:00:00,500 --> 00:00:01,900
[MUSIC PLAYING]

2
00:00:05,710 --> 00:00:09,150
>> DOUG LLOYD: By now you
know a lot about arrays,

3
00:00:09,150 --> 00:00:11,610
and you know a lot about linked lists.

4
00:00:11,610 --> 00:00:13,650
And we've discuss the
pros and cons, we've

5
00:00:13,650 --> 00:00:16,620
discussed that linked lists
can get bigger and smaller,

6
00:00:16,620 --> 00:00:18,630
but they take up more size.

7
00:00:18,630 --> 00:00:22,359
Arrays are much more straightforward to
use, but they're restrictive in as much

8
00:00:22,359 --> 00:00:24,900
as we have to set the size of
the array at the very beginning

9
00:00:24,900 --> 00:00:26,910
and then we're stuck with it.

10
00:00:26,910 --> 00:00:30,470
>> But that's, we've pretty much
exhausted all of our topics

11
00:00:30,470 --> 00:00:33,040
about linked lists and arrays.

12
00:00:33,040 --> 00:00:34,950
Or have we?

13
00:00:34,950 --> 00:00:37,720
Maybe we can do something
even more creative.

14
00:00:37,720 --> 00:00:40,950
And that sort of lends
the idea of a hash table.

15
00:00:40,950 --> 00:00:46,680
>> So in a hash table we're going to try
combine an array with a linked list.

16
00:00:46,680 --> 00:00:49,520
We're going to take the advantages
of the array, like random access,

17
00:00:49,520 --> 00:00:53,510
being able to just go to array
element 4 or array element 8

18
00:00:53,510 --> 00:00:55,560
without having to iterate across.

19
00:00:55,560 --> 00:00:57,260
That's pretty fast, right?

20
00:00:57,260 --> 00:01:00,714
>> But we also want to have our data
structure be able to grow and shrink.

21
00:01:00,714 --> 00:01:02,630
We don't need, we don't
want to be restricted.

22
00:01:02,630 --> 00:01:04,588
And we want to be able
to add and remove things

23
00:01:04,588 --> 00:01:08,430
very easily, which if you recall,
is very complex with an array.

24
00:01:08,430 --> 00:01:11,650
And we can call this
new thing a hash table.

25
00:01:11,650 --> 00:01:15,190
>> And if implemented correctly,
we're sort of taking

26
00:01:15,190 --> 00:01:18,150
the advantages of both data
structures you've already seen,

27
00:01:18,150 --> 00:01:19,880
arrays and linked lists.

28
00:01:19,880 --> 00:01:22,480
>> Well an average insertion
into a hash table

29
00:01:22,480 --> 00:01:25,350
can start to get close to constant time.

30
00:01:25,350 --> 00:01:27,550
And deletion can get
close to constant time.

31
00:01:27,550 --> 00:01:30,420
And lookup can get
close to constant time.

32
00:01:30,420 --> 00:01:34,580
That's-- we don't have a data
structure yet that can do that,

33
00:01:34,580 --> 00:01:37,160
and so this already sounds
like a pretty great thing.

34
00:01:37,160 --> 00:01:42,990
We've really mitigated the
disadvantages of each on its own.

35
00:01:42,990 --> 00:01:45,140
>> To get this performance
upgrade though, we

36
00:01:45,140 --> 00:01:49,560
need to rethink how we add
data into the structure.

37
00:01:49,560 --> 00:01:53,360
Specifically we want the
data itself to tell us

38
00:01:53,360 --> 00:01:55,700
where it should go in the structure.

39
00:01:55,700 --> 00:01:58,830
And if we then need to see if it's in
the structure, if we need to find it,

40
00:01:58,830 --> 00:02:02,590
we want to look at the data
again and be able to effectively,

41
00:02:02,590 --> 00:02:04,890
using the data, randomly access it.

42
00:02:04,890 --> 00:02:06,680
Just by looking at the
data we should have

43
00:02:06,680 --> 00:02:11,870
an idea of where exactly we're
going to find it in the hash table.

44
00:02:11,870 --> 00:02:14,750
>> Now the downside of a hash
table is that they're really

45
00:02:14,750 --> 00:02:17,750
pretty bad at ordering or sorting data.

46
00:02:17,750 --> 00:02:20,950
And in fact, if you start
to use them to order or sort

47
00:02:20,950 --> 00:02:23,970
data you lose all of the
advantages you previously

48
00:02:23,970 --> 00:02:26,690
had in terms of insertion and deletion.

49
00:02:26,690 --> 00:02:30,040
The time becomes closer to
theta of n, and we've basically

50
00:02:30,040 --> 00:02:31,510
regressed into a linked list.

51
00:02:31,510 --> 00:02:34,830
And so we only want to use hash
tables if we don't care about

52
00:02:34,830 --> 00:02:37,470
whether data is sorted.

53
00:02:37,470 --> 00:02:40,270
For the context in which
you'll use them in CS50

54
00:02:40,270 --> 00:02:44,880
you probably don't care
that the data is sorted.

55
00:02:44,880 --> 00:02:49,150
>> So a hash table is a combination
of two distinct pieces

56
00:02:49,150 --> 00:02:50,960
with which we're familiar.

57
00:02:50,960 --> 00:02:53,910
The first is a function, which
we usually call a hash function.

58
00:02:53,910 --> 00:02:57,740
And that hash function is going to
return some non-negative integer, which

59
00:02:57,740 --> 00:03:00,850
we usually call a hashcode, OK?

60
00:03:00,850 --> 00:03:06,210
The second piece is an array, which is
capable of storing data of the type we

61
00:03:06,210 --> 00:03:08,170
want to place into the data structure.

62
00:03:08,170 --> 00:03:10,290
We'll hold off on the
linked list element for now

63
00:03:10,290 --> 00:03:13,790
and just start with the basics of a
hash table to get your head around it,

64
00:03:13,790 --> 00:03:16,190
and then we'll maybe blow
your mind a little bit when we

65
00:03:16,190 --> 00:03:18,900
combine arrays and link lists together.

66
00:03:18,900 --> 00:03:21,570
>> The basic idea though
is we take some data.

67
00:03:21,570 --> 00:03:23,840
We run that data through
the hash function.

68
00:03:23,840 --> 00:03:27,960
And so the data is processed
and it spits out a number, OK?

69
00:03:27,960 --> 00:03:30,870
And then with that number
we just store the data

70
00:03:30,870 --> 00:03:34,910
we want to store in the
array at that location.

71
00:03:34,910 --> 00:03:39,970
So for example we have maybe
this hash table of strings.

72
00:03:39,970 --> 00:03:43,330
It's got 10 elements in it, so
we can fit 10 strings in it.

73
00:03:43,330 --> 00:03:45,300
>> Let's say we want to hash John.

74
00:03:45,300 --> 00:03:48,950
So John as the data we want to insert
into this hash table somewhere.

75
00:03:48,950 --> 00:03:49,750
Where do we put it?

76
00:03:49,750 --> 00:03:51,700
Well typically with an
array so far we probably

77
00:03:51,700 --> 00:03:53,590
would put it in array location 0.

78
00:03:53,590 --> 00:03:55,940
But now we have this new hash function.

79
00:03:55,940 --> 00:03:59,160
>> And let's say that we run John
through this hash function

80
00:03:59,160 --> 00:04:01,390
and it's spits out 4.

81
00:04:01,390 --> 00:04:03,480
Well that's where we're
going to want to put John.

82
00:04:03,480 --> 00:04:08,030
We want to put John in array location
4, because if we hash John again--

83
00:04:08,030 --> 00:04:09,790
let's say later we
want to search and see

84
00:04:09,790 --> 00:04:12,940
if John exists in this hash
table-- all we need to do

85
00:04:12,940 --> 00:04:16,290
is run it through the same hash
function, get the number 4 out,

86
00:04:16,290 --> 00:04:19,900
and be able to find John
immediately in our data structure.

87
00:04:19,900 --> 00:04:20,700
That's pretty good.

88
00:04:20,700 --> 00:04:22,900
>> Let's say we now do this
again, we want to hash Paul.

89
00:04:22,900 --> 00:04:25,420
We want to add Paul
into this hash table.

90
00:04:25,420 --> 00:04:28,050
Let's say that this time we run
Paul through the hash function,

91
00:04:28,050 --> 00:04:30,580
the hashcode that is generated is 6.

92
00:04:30,580 --> 00:04:33,320
Well now we can put Paul
in the array location 6.

93
00:04:33,320 --> 00:04:36,020
And if we need to look up whether
Paul is in this hash table,

94
00:04:36,020 --> 00:04:39,880
all we need to do is run Paul
through the hash function again

95
00:04:39,880 --> 00:04:41,810
and we're going to get 6 out again.

96
00:04:41,810 --> 00:04:43,570
>> And then we just look
at array location 6.

97
00:04:43,570 --> 00:04:44,890
Is Paul there?

98
00:04:44,890 --> 00:04:46,140
If so, he's in the hash table.

99
00:04:46,140 --> 00:04:47,850
Is Paul not there?

100
00:04:47,850 --> 00:04:48,980
He's not in the hash table.

101
00:04:48,980 --> 00:04:51,700
It's pretty straightforward.

102
00:04:51,700 --> 00:04:53,750
>> Now how do you define a hash function?

103
00:04:53,750 --> 00:04:57,440
Well there's really no limit to the
number of possible hash functions.

104
00:04:57,440 --> 00:05:00,330
In fact there's a number of really,
really good ones on the internet.

105
00:05:00,330 --> 00:05:03,540
There's a number of really,
really bad ones on the internet.

106
00:05:03,540 --> 00:05:07,060
It's also pretty easy
to write a bad one.

107
00:05:07,060 --> 00:05:09,740
>> So what makes up a good
hash function, right?

108
00:05:09,740 --> 00:05:13,260
Well a good hash function should
use only the data being hashed,

109
00:05:13,260 --> 00:05:15,400
and all of the data being hashed.

110
00:05:15,400 --> 00:05:18,480
So we don't want to use anything--
we don't incorporate anything

111
00:05:18,480 --> 00:05:21,540
else other than the data.

112
00:05:21,540 --> 00:05:23,060
And we want to use all of the data.

113
00:05:23,060 --> 00:05:25,810
We don't want to just use a piece
of it, we want to use all of it.

114
00:05:25,810 --> 00:05:27,690
A hash function should
also be deterministic.

115
00:05:27,690 --> 00:05:28,530
What does that mean?

116
00:05:28,530 --> 00:05:32,180
Well it means that every time we
pass the exact same piece of data

117
00:05:32,180 --> 00:05:36,020
into the hash function we always
get the same hashcode out.

118
00:05:36,020 --> 00:05:38,850
If I pass John into the
hash function I get out 4.

119
00:05:38,850 --> 00:05:42,090
I should be able to do that 10,000
times and I'll always get 4.

120
00:05:42,090 --> 00:05:46,790
So no random numbers effectively
can be involved in our hash tables--

121
00:05:46,790 --> 00:05:48,730
in our hash functions.

122
00:05:48,730 --> 00:05:51,730
>> A hash function should also
uniformly distribute data.

123
00:05:51,730 --> 00:05:55,680
If every time you run data through the
hash function you get the hashcode 0,

124
00:05:55,680 --> 00:05:57,180
that's probably not so great, right?

125
00:05:57,180 --> 00:06:00,580
You probably want to big
a range of hash codes.

126
00:06:00,580 --> 00:06:03,170
Also things can be spread
out throughout the table.

127
00:06:03,170 --> 00:06:09,190
And also it would be great if really
similar data, like John and Jonathan,

128
00:06:09,190 --> 00:06:12,300
maybe were spread out to weigh
different locations in the hash table.

129
00:06:12,300 --> 00:06:14,730
That would be a nice advantage.

130
00:06:14,730 --> 00:06:17,230
>> Here's an example of a hash function.

131
00:06:17,230 --> 00:06:19,130
I wrote this one up earlier.

132
00:06:19,130 --> 00:06:21,020
It's not a particularly
good hash function

133
00:06:21,020 --> 00:06:23,360
for reasons that don't really
bear going into right now.

134
00:06:23,360 --> 00:06:27,020
But do you see what's going on here?

135
00:06:27,020 --> 00:06:30,400
It seems like we're declaring a variable
called sum and setting it equal to 0.

136
00:06:30,400 --> 00:06:35,990
And then apparently I'm doing something
so long as strstr[j] is not equal

137
00:06:35,990 --> 00:06:38,470
to backslash 0.

138
00:06:38,470 --> 00:06:40,850
What am I doing there?

139
00:06:40,850 --> 00:06:43,420
>> This is basically just another
way of implementing [? strl ?]

140
00:06:43,420 --> 00:06:45,750
and detecting when you've
reached the end of the string.

141
00:06:45,750 --> 00:06:48,460
So I don't have to actually
calculate the length of the string,

142
00:06:48,460 --> 00:06:51,620
I'm just using when I hit the
backslash 0 character I know

143
00:06:51,620 --> 00:06:53,160
I've reached the end of the string.

144
00:06:53,160 --> 00:06:55,540
And then I'm going to keep
iterating through that string,

145
00:06:55,540 --> 00:07:01,290
adding strstr[j] to sum, and then at the
end of the day going to return sum mod

146
00:07:01,290 --> 00:07:02,180
HASH_MAX.

147
00:07:02,180 --> 00:07:04,680
>> Basically all this hash
function is doing is adding up

148
00:07:04,680 --> 00:07:09,430
all of the ASCII values of
my string, and then it's

149
00:07:09,430 --> 00:07:12,370
returning some hashcode
modded by HASH_MAX.

150
00:07:12,370 --> 00:07:15,770
It's probably the size
of my array, right?

151
00:07:15,770 --> 00:07:19,140
I don't want to be getting hash
codes if my array is of size 10,

152
00:07:19,140 --> 00:07:21,690
I don't want to be getting
out hash codes 11, 12,

153
00:07:21,690 --> 00:07:24,670
13, I can't put things into
those locations of the array,

154
00:07:24,670 --> 00:07:25,770
that would be illegal.

155
00:07:25,770 --> 00:07:28,110
I'd suffer a segmentation fault.

156
00:07:28,110 --> 00:07:31,210
>> Now here is another quick aside.

157
00:07:31,210 --> 00:07:34,450
Generally you're probably not going to
want to write your own hash functions.

158
00:07:34,450 --> 00:07:36,970
It is actually a bit of
an art, not a science.

159
00:07:36,970 --> 00:07:38,560
And there's a lot that goes into them.

160
00:07:38,560 --> 00:07:41,260
The internet, like I said, is full
of really good hash functions,

161
00:07:41,260 --> 00:07:44,450
and you should use the internet to
find hash functions because it's really

162
00:07:44,450 --> 00:07:49,210
just kind of an unnecessary
waste of time to create your own.

163
00:07:49,210 --> 00:07:51,470
>> You can write simple ones
for testing purposes.

164
00:07:51,470 --> 00:07:54,300
But when you actually are going to
start hashing data and storing it

165
00:07:54,300 --> 00:07:56,760
into a hash table you're
probably going to want

166
00:07:56,760 --> 00:08:00,560
to use some function that was generated
for you, that exists on the internet.

167
00:08:00,560 --> 00:08:03,220
If you do just be sure
to cite your sources.

168
00:08:03,220 --> 00:08:07,720
There's no reason to
plagiarize anything here.

169
00:08:07,720 --> 00:08:11,350
>> The computer science community is
definitely growing, and really values

170
00:08:11,350 --> 00:08:14,770
open source, and it's really important
to cite your sources so that people

171
00:08:14,770 --> 00:08:16,650
can get attribution for
the work that they're

172
00:08:16,650 --> 00:08:18,290
doing to the benefit of the community.

173
00:08:18,290 --> 00:08:20,000
So always be sure--
and not just for hash

174
00:08:20,000 --> 00:08:23,250
functions, but generally when you
use code from an outside source,

175
00:08:23,250 --> 00:08:24,270
always cite your source.

176
00:08:24,270 --> 00:08:29,480
Give credit to the person who did
some of the work so you don't have to.

177
00:08:29,480 --> 00:08:32,700
>> OK so let's revisit this
hash table for a second.

178
00:08:32,700 --> 00:08:34,760
This is where we left
off after we inserted

179
00:08:34,760 --> 00:08:39,280
John and Paul into this hash table.

180
00:08:39,280 --> 00:08:41,160
Do you see a problem here?

181
00:08:41,160 --> 00:08:44,390
You might see two.

182
00:08:44,390 --> 00:08:48,220
But in particular, do you
see this possible problem?

183
00:08:48,220 --> 00:08:52,750
>> What if I hash Ringo, and it
turns out that after processing

184
00:08:52,750 --> 00:08:59,980
that data through the hash function
Ringo also generated the hashcode 6.

185
00:08:59,980 --> 00:09:02,790
I've already got data at
hashcode-- array location 6.

186
00:09:02,790 --> 00:09:07,980
So it's probably going to be a bit
of a problem for me now, right?

187
00:09:07,980 --> 00:09:09,040
>> We call this a collision.

188
00:09:09,040 --> 00:09:12,440
And the collision occurs when two
pieces of data run through the same hash

189
00:09:12,440 --> 00:09:15,180
function yield the same hashcode.

190
00:09:15,180 --> 00:09:18,830
Presumably we still want to get both
pieces of data into the hash table,

191
00:09:18,830 --> 00:09:22,310
otherwise we wouldn't be running Ringo
arbitrarily through the hash function.

192
00:09:22,310 --> 00:09:26,850
We presumably want to get
Ringo into that array.

193
00:09:26,850 --> 00:09:32,050
>> How do we do it though, if he
and Paul both yield hashcode 6?

194
00:09:32,050 --> 00:09:35,500
We don't want to overwrite Paul,
we want Paul to be there too.

195
00:09:35,500 --> 00:09:38,770
So we need to find a way to get
elements into the hash table that

196
00:09:38,770 --> 00:09:42,530
still preserves our quick
insertion and quick look up.

197
00:09:42,530 --> 00:09:47,330
And one way to deal with it is to
do something called linear probing.

198
00:09:47,330 --> 00:09:50,150
>> Using this method if we have a
collision, well, what do we do?

199
00:09:50,150 --> 00:09:54,560
Well we can't put him in array location
6, or whatever hashcode was generated,

200
00:09:54,560 --> 00:09:56,800
let's put him at hashcode plus 1.

201
00:09:56,800 --> 00:09:59,650
And if that's full let's
put him in hashcode plus 2.

202
00:09:59,650 --> 00:10:03,400
The benefit of this being if he's
not exactly where we think he is,

203
00:10:03,400 --> 00:10:06,830
and we have to start searching,
maybe we don't have to go too far.

204
00:10:06,830 --> 00:10:09,880
Maybe we don't have to search
all n elements of the hash table.

205
00:10:09,880 --> 00:10:11,770
Maybe we have to search
a couple of them.

206
00:10:11,770 --> 00:10:16,660
>> And so we're still tending towards
that average case being close to 1 vs

207
00:10:16,660 --> 00:10:20,260
close to n, so maybe that'll work.

208
00:10:20,260 --> 00:10:23,990
So let's see how this
might work out in reality.

209
00:10:23,990 --> 00:10:27,440
And let's see if maybe we can detect
the problem that might occur here.

210
00:10:27,440 --> 00:10:28,520
>> Let's say we hash Bart.

211
00:10:28,520 --> 00:10:31,560
So now we're going to run a new set
of strings through the hash function,

212
00:10:31,560 --> 00:10:34,440
and we run Bart through the hash
function, we get hashcode 6.

213
00:10:34,440 --> 00:10:38,080
We take a look, we see 6 is
empty, so we can put Bart there.

214
00:10:38,080 --> 00:10:41,390
>> Now we hash Lisa and that
also generates hashcode 6.

215
00:10:41,390 --> 00:10:44,310
Well now that we're using this
linear probing method we start at 6,

216
00:10:44,310 --> 00:10:45,310
we see that 6 is full.

217
00:10:45,310 --> 00:10:46,970
We can't put Lisa in 6.

218
00:10:46,970 --> 00:10:48,070
So where do we go?

219
00:10:48,070 --> 00:10:49,260
Let's go to 7.

220
00:10:49,260 --> 00:10:50,590
7's empty, so that works.

221
00:10:50,590 --> 00:10:52,490
So let's put Lisa there.

222
00:10:52,490 --> 00:10:56,180
>> Now we hash Homer and we get 7.

223
00:10:56,180 --> 00:10:59,360
OK well we know that 7's full
now, so we can't put Homer there.

224
00:10:59,360 --> 00:11:00,830
So let's go to 8.

225
00:11:00,830 --> 00:11:01,640
Is 8 available?

226
00:11:01,640 --> 00:11:04,310
Yeah, and 8's close to 7, so if
we have to start searching we're

227
00:11:04,310 --> 00:11:06,000
not going to have to go too far.

228
00:11:06,000 --> 00:11:08,840
And so let's put Homer at 8.

229
00:11:08,840 --> 00:11:11,130
>> Now we hash Maggie and
returns 3, thank goodness

230
00:11:11,130 --> 00:11:12,630
we're able to just put Maggie there.

231
00:11:12,630 --> 00:11:15,050
We don't have to do any
sort of probing for that.

232
00:11:15,050 --> 00:11:17,980
Now we hash Marge, and
Marge also returns 6.

233
00:11:17,980 --> 00:11:25,540
>> Well 6 is full, 7 is full, 8 is full,
9, all right thank God, 9 is empty.

234
00:11:25,540 --> 00:11:27,580
I can put Marge at 9.

235
00:11:27,580 --> 00:11:31,260
Already we can see that we're starting
to have this problem where now we're

236
00:11:31,260 --> 00:11:34,760
starting to stretch things kind
of far away from their hash codes.

237
00:11:34,760 --> 00:11:38,940
And that theta of 1, that average
case of being constant time,

238
00:11:38,940 --> 00:11:42,540
is starting to get a little more--
starting to tend a little more

239
00:11:42,540 --> 00:11:44,030
towards theta of n.

240
00:11:44,030 --> 00:11:47,250
We're starting to lose that
advantage of hash tables.

241
00:11:47,250 --> 00:11:49,880
>> This problem that we just saw
is something called clustering.

242
00:11:49,880 --> 00:11:52,340
And what's really bad about
clustering is that once you now

243
00:11:52,340 --> 00:11:55,580
have two elements that are side by
side it makes it even more likely,

244
00:11:55,580 --> 00:11:57,460
you have double the
chance, that you're going

245
00:11:57,460 --> 00:11:59,490
to have another collision
with that cluster,

246
00:11:59,490 --> 00:12:00,870
and the cluster will grow by one.

247
00:12:00,870 --> 00:12:05,620
And you'll keep growing and growing
your likelihood of having a collision.

248
00:12:05,620 --> 00:12:10,450
And eventually it's just as bad
as not sorting the data at all.

249
00:12:10,450 --> 00:12:13,570
>> The other problem though is we
still, and so far up to this point,

250
00:12:13,570 --> 00:12:16,310
we've just been sort of
understanding what a hash table is,

251
00:12:16,310 --> 00:12:17,940
we still only have room for 10 strings.

252
00:12:17,940 --> 00:12:23,010
If we want to continue to hash
the citizens of Springfield,

253
00:12:23,010 --> 00:12:24,770
we can only get 10 of them in there.

254
00:12:24,770 --> 00:12:28,590
And if we try and add an 11th or 12th,
we don't have a place to put them.

255
00:12:28,590 --> 00:12:31,640
We could just be spinning around in
circles trying to find an empty spot,

256
00:12:31,640 --> 00:12:34,980
and we maybe get stuck
in an infinite loop.

257
00:12:34,980 --> 00:12:37,790
>> So this sort of lends to the idea
of something called chaining.

258
00:12:37,790 --> 00:12:41,770
And this is where we're going to bring
linked lists back into the picture.

259
00:12:41,770 --> 00:12:47,280
What if instead of storing just
the data itself in the array,

260
00:12:47,280 --> 00:12:50,450
every element of the array could
hold multiple pieces of data?

261
00:12:50,450 --> 00:12:51,950
Well that doesn't make sense, right?

262
00:12:51,950 --> 00:12:55,260
We know that an array can only
hold-- each element of an array

263
00:12:55,260 --> 00:12:58,910
can only hold one piece
of data of that data type.

264
00:12:58,910 --> 00:13:02,730
>> But what if that data type
is a linked list, right?

265
00:13:02,730 --> 00:13:05,810
So what if every
element of the array was

266
00:13:05,810 --> 00:13:08,770
a pointer to the head of a linked list?

267
00:13:08,770 --> 00:13:10,660
And then we could build
those linked lists

268
00:13:10,660 --> 00:13:13,100
and grow them arbitrarily,
because linked lists allow

269
00:13:13,100 --> 00:13:18,070
us to grow and shrink a lot more
flexibly than an array does.

270
00:13:18,070 --> 00:13:20,190
So what if we now use,
we leverage this, right?

271
00:13:20,190 --> 00:13:23,740
We start to grow these chains
out of these array locations.

272
00:13:23,740 --> 00:13:26,720
>> Now we can fit an infinite
amount of data, or not infinite,

273
00:13:26,720 --> 00:13:30,150
an arbitrary amount of
data, into our hash table

274
00:13:30,150 --> 00:13:33,740
without ever running into
the problem of collision.

275
00:13:33,740 --> 00:13:36,720
We've also eliminated
clustering by doing this.

276
00:13:36,720 --> 00:13:40,360
And well we know that when we insert
into a linked list, if you recall

277
00:13:40,360 --> 00:13:43,750
from our video on linked lists, singly
linked lists and doubly linked lists,

278
00:13:43,750 --> 00:13:45,050
it's a constant time operation.

279
00:13:45,050 --> 00:13:46,690
We're just adding to the front.

280
00:13:46,690 --> 00:13:50,380
>> And for look up, well we do know
that look up in a linked list

281
00:13:50,380 --> 00:13:51,760
can be a problem, right?

282
00:13:51,760 --> 00:13:53,890
We have to search through
it from beginning to end.

283
00:13:53,890 --> 00:13:56,440
There's no random
access in a linked list.

284
00:13:56,440 --> 00:14:01,590
But if instead of having one linked
list where a lookup would be O of n,

285
00:14:01,590 --> 00:14:05,570
we now have 10 linked lists,
or 1,000 linked lists,

286
00:14:05,570 --> 00:14:10,100
now it's O of n divided by 10,
or O of n divided by 1,000.

287
00:14:10,100 --> 00:14:12,920
>> And while we were talking
theoretically about complexity

288
00:14:12,920 --> 00:14:16,040
we disregard constants, in the real
world these things actually matter,

289
00:14:16,040 --> 00:14:16,540
right?

290
00:14:16,540 --> 00:14:19,060
We actually will notice
that this happens

291
00:14:19,060 --> 00:14:22,590
to run 10 times faster,
or 1,000 times faster,

292
00:14:22,590 --> 00:14:27,550
because we're distributing one long
chain across 1,000 smaller chains.

293
00:14:27,550 --> 00:14:31,070
And so each time we have to search
through one of those chains we can

294
00:14:31,070 --> 00:14:36,270
ignore the 999 chains we don't care
about , and just search that one.

295
00:14:36,270 --> 00:14:39,200
>> Which is on average to
be 1,000 times shorter.

296
00:14:39,200 --> 00:14:42,440
And so we still are sort of
tending towards this average case

297
00:14:42,440 --> 00:14:47,590
of being constant time, but
only because we are leveraging

298
00:14:47,590 --> 00:14:49,710
dividing by some huge constant factor.

299
00:14:49,710 --> 00:14:51,780
Let's see how this might
actually look though.

300
00:14:51,780 --> 00:14:54,700
So this was the hash table we had
before we declared a hash table that

301
00:14:54,700 --> 00:14:56,250
was capable of storing 10 strings.

302
00:14:56,250 --> 00:14:57,700
We're not going to do that anymore.

303
00:14:57,700 --> 00:14:59,860
We already know the
limitations of that method.

304
00:14:59,860 --> 00:15:06,150
Now our hash table's going to be
an array of 10 nodes, pointers

305
00:15:06,150 --> 00:15:08,100
to heads of linked lists.

306
00:15:08,100 --> 00:15:09,810
>> And right now it's null.

307
00:15:09,810 --> 00:15:12,150
Each one of those 10 pointers is null.

308
00:15:12,150 --> 00:15:15,850
There's nothing in our
hash table right now.

309
00:15:15,850 --> 00:15:18,670
>> Now let's start to put some
things into this hash table.

310
00:15:18,670 --> 00:15:21,420
And let's see how this method is
going to benefit us a little bit.

311
00:15:21,420 --> 00:15:22,740
Let's now hash Joey.

312
00:15:22,740 --> 00:15:26,190
We'll will run the string Joey through
a hash function and we return 6.

313
00:15:26,190 --> 00:15:27,180
Well what do we do now?

314
00:15:27,180 --> 00:15:30,070
>> Well now working with linked lists,
we're not working with arrays.

315
00:15:30,070 --> 00:15:31,860
And when we're working
with linked lists we

316
00:15:31,860 --> 00:15:35,770
know we need to start dynamically
allocating space and building chains.

317
00:15:35,770 --> 00:15:40,000
That's sort of how-- those are the core
elements of building a linked list.

318
00:15:40,000 --> 00:15:43,650
So let's dynamically
allocate space for Joey,

319
00:15:43,650 --> 00:15:46,370
and then let's add him to the chain.

320
00:15:46,370 --> 00:15:49,150
>> So now look what we've done.

321
00:15:49,150 --> 00:15:52,420
When we hash Joey we got the hashcode 6.

322
00:15:52,420 --> 00:15:57,770
Now the pointer at array location 6
points to the head of a linked list,

323
00:15:57,770 --> 00:16:00,880
and right now it's the only
element of a linked list.

324
00:16:00,880 --> 00:16:04,330
And the node in that
linked list is Joey.

325
00:16:04,330 --> 00:16:08,280
>> So if we need to look up Joey
later, we just hash Joey again,

326
00:16:08,280 --> 00:16:12,280
we get 6 again because our
hash function is deterministic.

327
00:16:12,280 --> 00:16:16,380
And then we start at the head
of the linked list pointed

328
00:16:16,380 --> 00:16:19,340
to by array location
6, and we can iterate

329
00:16:19,340 --> 00:16:21,630
across that trying to find Joey.

330
00:16:21,630 --> 00:16:23,760
And if we build our
hash table effectively,

331
00:16:23,760 --> 00:16:27,770
and our hash function effectively
to distribute data well,

332
00:16:27,770 --> 00:16:31,750
on average each of those linked
lists at every array location

333
00:16:31,750 --> 00:16:36,090
will be 1/10 the size of if we
just had it as a single huge

334
00:16:36,090 --> 00:16:37,630
linked list with everything in it.

335
00:16:37,630 --> 00:16:41,650
>> If we distribute that huge linked
list across 10 linked lists

336
00:16:41,650 --> 00:16:43,740
each list will be 1/10 the size.

337
00:16:43,740 --> 00:16:47,410
And thus 10 times quicker
to search through.

338
00:16:47,410 --> 00:16:48,370
So let's do this again.

339
00:16:48,370 --> 00:16:50,270
Let's now hash Ross.

340
00:16:50,270 --> 00:16:53,520
>> And let's say Ross, when we do that
the hash code we get back is 2.

341
00:16:53,520 --> 00:16:56,610
Well now we dynamically allocate a
new node, we put Ross in that node,

342
00:16:56,610 --> 00:17:00,880
and we say now array location
2, instead of pointing to null,

343
00:17:00,880 --> 00:17:04,550
points to the head of a linked
list whose only node is Ross.

344
00:17:04,550 --> 00:17:08,640
And we can do this one more time, we
can hash Rachel and get hashcode 4.

345
00:17:08,640 --> 00:17:11,470
malloc a new node, put Rachel in
the node, and say a array location

346
00:17:11,470 --> 00:17:13,810
4 now points to the head
of a linked list whose

347
00:17:13,810 --> 00:17:17,400
only element happens to be Rachel.

348
00:17:17,400 --> 00:17:19,450
>> OK but what happens if
we have a collision?

349
00:17:19,450 --> 00:17:24,540
Let's see how we handle collisions
using the separate chaining method.

350
00:17:24,540 --> 00:17:25,780
Let's hash Phoebe.

351
00:17:25,780 --> 00:17:27,070
We get the hashcode 6.

352
00:17:27,070 --> 00:17:29,990
In our previous example we were just
storing the strings in the array.

353
00:17:29,990 --> 00:17:31,960
This was a problem.

354
00:17:31,960 --> 00:17:34,420
>> We don't want to clobber
Joey, and we've already

355
00:17:34,420 --> 00:17:37,090
seen that we can get some clustering
problems if we try and step

356
00:17:37,090 --> 00:17:38,180
through and probe.

357
00:17:38,180 --> 00:17:40,640
But what if we just kind of
treat this the same way, right?

358
00:17:40,640 --> 00:17:43,420
It's just like adding an element
to the head of a linked list.

359
00:17:43,420 --> 00:17:46,200
Let's just malloc space for Phoebe.

360
00:17:46,200 --> 00:17:50,740
>> We'll say Phoebe's next pointer points
to the old head of the linked list,

361
00:17:50,740 --> 00:17:53,160
and then 6 just points to the
new head of the linked list.

362
00:17:53,160 --> 00:17:56,130
And now look, we've changed Phoebe in.

363
00:17:56,130 --> 00:18:00,190
We can now store two
elements with hashcode 6,

364
00:18:00,190 --> 00:18:03,150
and we don't have any problems.

365
00:18:03,150 --> 00:18:06,120
>> That's pretty much all
there is to chaining.

366
00:18:06,120 --> 00:18:07,960
And chaining is definitely
the method that's

367
00:18:07,960 --> 00:18:13,370
going to be most effective for you if
you are storing data in a hash table.

368
00:18:13,370 --> 00:18:16,870
But this combination of
arrays and linked lists

369
00:18:16,870 --> 00:18:22,060
together to form a hash table really
dramatically improves your ability

370
00:18:22,060 --> 00:18:26,530
to store large amounts of data, and
very quickly and efficiently search

371
00:18:26,530 --> 00:18:27,610
through that data.

372
00:18:27,610 --> 00:18:30,130
>> There's still one more
data structure out there

373
00:18:30,130 --> 00:18:34,680
that might even be a bit
better in terms of guaranteeing

374
00:18:34,680 --> 00:18:37,900
that our insertion, deletion, and
look up times are even faster.

375
00:18:37,900 --> 00:18:41,610
And we'll see that in a video on tries.

376
00:18:41,610 --> 00:18:44,910
I'm Doug Lloyd, this is CS50.