1
00:00:00,000 --> 00:00:07,700

2
00:00:07,700 --> 00:00:10,890
>> KEVIN SCHMID: Sometimes, when building a
program, you might want to utilize a

3
00:00:10,890 --> 00:00:13,190
data structure known as a dictionary.

4
00:00:13,190 --> 00:00:17,960
A dictionary maps keys, which are
usually strings, to values, ints,

5
00:00:17,960 --> 00:00:21,900
chars, a pointer to some object,
whatever we want.

6
00:00:21,900 --> 00:00:26,510
It's just like ordinary dictionaries
that map words through definitions.

7
00:00:26,510 --> 00:00:29,440
>> Dictionaries provide us with the
ability to store information

8
00:00:29,440 --> 00:00:32,750
associated with something
and look it up later.

9
00:00:32,750 --> 00:00:36,620
So how do we actually implement a
dictionary in, say, C code that we can

10
00:00:36,620 --> 00:00:38,460
use in one of our programs?

11
00:00:38,460 --> 00:00:41,790
Well, there are a lot of ways that
we could implement a dictionary.

12
00:00:41,790 --> 00:00:45,930
>> For one, we could use an array that we
dynamically re-size or we could use a

13
00:00:45,930 --> 00:00:49,150
linked list, hash table
or a binary tree.

14
00:00:49,150 --> 00:00:52,250
But whatever we choose, we should
be mindful of the efficiency and

15
00:00:52,250 --> 00:00:54,300
performance of the implementation.

16
00:00:54,300 --> 00:00:57,930
We should think about the algorithm used
to insert and look up items into

17
00:00:57,930 --> 00:00:59,120
our data structure.

18
00:00:59,120 --> 00:01:03,060
>> For now, let's assume that we
want to use strings as keys.

19
00:01:03,060 --> 00:01:07,290
Let's talk about one possibility,
a data structure called a trie.

20
00:01:07,290 --> 00:01:11,210
So here's a visual representation
of a trie.

21
00:01:11,210 --> 00:01:14,590
>> As the picture suggests, a trie
is a tree data structure with

22
00:01:14,590 --> 00:01:16,050
nodes linked together.

23
00:01:16,050 --> 00:01:19,420
We see that there's clearly a root
node with some links extending to

24
00:01:19,420 --> 00:01:20,500
other nodes.

25
00:01:20,500 --> 00:01:23,040
But what does each node consist of?

26
00:01:23,040 --> 00:01:26,700
If we assume that we're storing keys
with only alphabetic characters, and

27
00:01:26,700 --> 00:01:30,150
we don't care about capitalization,
here's a definition of a node that

28
00:01:30,150 --> 00:01:31,100
will suffice.

29
00:01:31,100 --> 00:01:34,130
>> An object whose type is struct
node has two parts

30
00:01:34,130 --> 00:01:35,740
called data and children.

31
00:01:35,740 --> 00:01:39,200
We've left the data part as a comment
to be replaced by a component

32
00:01:39,200 --> 00:01:43,190
declaration when struct node is
incorporated in a C program.

33
00:01:43,190 --> 00:01:47,040
The data part of a node might be a
Boolean value to indicate whether or

34
00:01:47,040 --> 00:01:51,160
not the node represents the completion
of a dictionary key or it might be a

35
00:01:51,160 --> 00:01:54,240
string representing the definition
of a word in the dictionary.

36
00:01:54,240 --> 00:01:58,870
>> We'll use a smiley face to indicate
when data is present in a node.

37
00:01:58,870 --> 00:02:02,310
There are 26 elements in our
children array, one index

38
00:02:02,310 --> 00:02:03,690
per alphabetic character.

39
00:02:03,690 --> 00:02:06,570
We'll see the significance
of this soon.

40
00:02:06,570 --> 00:02:10,759
>> Let's get a closer look of the root node
in our diagram, which has no data

41
00:02:10,759 --> 00:02:14,740
associated with it, as indicated by the
absence of the smiley face in the

42
00:02:14,740 --> 00:02:16,110
data portion.

43
00:02:16,110 --> 00:02:19,910
The arrows extending from the parts of
the children array represent non-node

44
00:02:19,910 --> 00:02:21,640
pointers to other nodes.

45
00:02:21,640 --> 00:02:25,500
For example, the arrow extending from
the second element of children

46
00:02:25,500 --> 00:02:28,400
represents the letter B
in a dictionary key.

47
00:02:28,400 --> 00:02:31,920
And in the larger diagram
we label it with a B.

48
00:02:31,920 --> 00:02:35,810
>> Note that in the larger diagram, when we
draw a pointer to another node, it

49
00:02:35,810 --> 00:02:39,100
doesn't matter where the arrowhead
meets that other node.

50
00:02:39,100 --> 00:02:43,850
Our sample dictionary trie contains
two words, that and zoom.

51
00:02:43,850 --> 00:02:47,040
Let's walk through an example of
looking up data for a key.

52
00:02:47,040 --> 00:02:50,800
>> Suppose we wanted to look up the
corresponding value for the key bath.

53
00:02:50,800 --> 00:02:53,610
We'll begin our look up
at the root node.

54
00:02:53,610 --> 00:02:57,870
Then we'll take the first letter of our
key, B, and find the corresponding

55
00:02:57,870 --> 00:03:00,020
spot in our children array.

56
00:03:00,020 --> 00:03:04,490
Notice that there are exactly 26 spots
in the array, one for each letter of

57
00:03:04,490 --> 00:03:05,330
the alphabet.

58
00:03:05,330 --> 00:03:08,800
And we'll have the spots represent the
letters of the alphabet in order.

59
00:03:08,800 --> 00:03:13,960
>> We'll look at the second index then,
index one, for B. In general, if we

60
00:03:13,960 --> 00:03:17,990
have some alphabetic character C we
could determine the corresponding spot

61
00:03:17,990 --> 00:03:21,520
in the children array using
a calculation like this.

62
00:03:21,520 --> 00:03:25,140
We could have used a larger children
array if we wanted to offer look up of

63
00:03:25,140 --> 00:03:28,380
keys with a wider range of characters,
such as the entire

64
00:03:28,380 --> 00:03:29,880
ASCII character set.

65
00:03:29,880 --> 00:03:32,630
>> In this case, the pointer
in our children array at

66
00:03:32,630 --> 00:03:34,320
index one is not null.

67
00:03:34,320 --> 00:03:36,600
So we'll continue looking
up the key bath.

68
00:03:36,600 --> 00:03:40,130
If we ever encountered a null pointer
at the proper spot in the children

69
00:03:40,130 --> 00:03:43,230
array while we traversed the nodes,
then we'll have to say that we

70
00:03:43,230 --> 00:03:45,630
couldn't find anything for that key.

71
00:03:45,630 --> 00:03:49,370
>> Now, we'll take the second letter of
our key, A, and continue following

72
00:03:49,370 --> 00:03:52,400
pointers in this way until we
reach the end of our key.

73
00:03:52,400 --> 00:03:56,530
If we reach the end of the key without
hitting any dead ends, null pointers,

74
00:03:56,530 --> 00:03:59,730
as is the case here, then we only
have to check one more thing.

75
00:03:59,730 --> 00:04:02,110
Is this key actually
in the dictionary?

76
00:04:02,110 --> 00:04:07,660
>> If so, we should find a value, well a
smiley face icon in our diagram where

77
00:04:07,660 --> 00:04:08,750
the word ends.

78
00:04:08,750 --> 00:04:12,270
If there is something else stored with
the data, then we can return it.

79
00:04:12,270 --> 00:04:16,500
For example, the key zoo is not in the
dictionary, even though we could have

80
00:04:16,500 --> 00:04:19,810
reached the end of this key without ever
hitting a null pointer, while we

81
00:04:19,810 --> 00:04:21,089
iterate through the trie.

82
00:04:21,089 --> 00:04:25,436
>> If we tried to look up the key bath, the
second to last node's array index,

83
00:04:25,436 --> 00:04:28,750
corresponding to the letter H, would
have held a null pointer.

84
00:04:28,750 --> 00:04:31,120
So bath is not in the dictionary.

85
00:04:31,120 --> 00:04:34,800
And so a trie is unique in that the keys
are never explicitly stored in

86
00:04:34,800 --> 00:04:36,650
the data structure.

87
00:04:36,650 --> 00:04:38,810
So how do we insert something
into a trie?

88
00:04:38,810 --> 00:04:41,780
>> Let's insert the key
zoo into our trie.

89
00:04:41,780 --> 00:04:46,120
Remember that a smiley face at a node
could correspond in code to a simple

90
00:04:46,120 --> 00:04:50,170
Boolean value to indicate that zoo
is in the dictionary or it could

91
00:04:50,170 --> 00:04:53,710
correspond to more information that we
wish to associate with the key zoo,

92
00:04:53,710 --> 00:04:56,860
like the definition of the
word or something else.

93
00:04:56,860 --> 00:05:00,350
In some ways, the process to insert
something into a trie is similar to

94
00:05:00,350 --> 00:05:02,060
looking up something in a trie.

95
00:05:02,060 --> 00:05:05,720
>> We'll start with the root node again,
following pointers corresponding to

96
00:05:05,720 --> 00:05:07,990
the letters of our key.

97
00:05:07,990 --> 00:05:11,310
Luckily, we were able to follow pointers
all the way until we reached

98
00:05:11,310 --> 00:05:12,770
the end of the key.

99
00:05:12,770 --> 00:05:16,480
Since zoo is a prefix of the word
zoom, which is a member of the

100
00:05:16,480 --> 00:05:19,440
dictionary, we don't need to
allocate any new nodes.

101
00:05:19,440 --> 00:05:23,140
>> We can modify the node to indicate that
the path of characters leading to

102
00:05:23,140 --> 00:05:25,360
it represents a key in our dictionary.

103
00:05:25,360 --> 00:05:28,630
Now, let's try inserting the
key BATH into the trie.

104
00:05:28,630 --> 00:05:32,260
We'll start at the root node
and follow pointers again.

105
00:05:32,260 --> 00:05:35,620
But in this situation, we hit a dead
end before we're able to get to the

106
00:05:35,620 --> 00:05:36,940
end of the key.

107
00:05:36,940 --> 00:05:40,980
Now, we'll need to allocate some new
nodes will need to allocate one new

108
00:05:40,980 --> 00:05:43,660
node for each remaining
letter of our key.

109
00:05:43,660 --> 00:05:46,740
>> In this case, we just need
to allocate one new node.

110
00:05:46,740 --> 00:05:50,590
Then we'll need to make the H index
reference this new node.

111
00:05:50,590 --> 00:05:54,070
Once again, we can modify the node to
indicate that the path of characters

112
00:05:54,070 --> 00:05:57,120
leading to it represents a
key in our dictionary.

113
00:05:57,120 --> 00:06:00,730
Let's reason about the asymptotic
complexity of our procedures for these

114
00:06:00,730 --> 00:06:02,110
two operations.

115
00:06:02,110 --> 00:06:06,420
>> We notice that in both cases the number
of steps our algorithm took was

116
00:06:06,420 --> 00:06:09,470
proportional to the number of
letters in the keyword.

117
00:06:09,470 --> 00:06:10,220
That's right.

118
00:06:10,220 --> 00:06:13,470
When you want to look up a word in a
trie you just need to iterate through

119
00:06:13,470 --> 00:06:17,100
the letters one by one until you either
reach the end of the word or

120
00:06:17,100 --> 00:06:19,060
hit a dead end in the trie.

121
00:06:19,060 --> 00:06:22,470
>> And when you wish to insert a key
value pair into a trie using the

122
00:06:22,470 --> 00:06:26,250
procedure we discussed, the worst case
will have you allocating a new node

123
00:06:26,250 --> 00:06:27,550
for each letter.

124
00:06:27,550 --> 00:06:31,290
And we'll assume that allocation
is a constant time operation.

125
00:06:31,290 --> 00:06:35,850
So if we assume that the key length is
bounded by a fixed constant, both

126
00:06:35,850 --> 00:06:39,400
insertion and look up are constant
time operations for a trie.

127
00:06:39,400 --> 00:06:42,930
>> If we don't make this assumption that
the key length is bounded by a fixed

128
00:06:42,930 --> 00:06:46,650
constant, then insertion and look up,
in the worst case, are linear in the

129
00:06:46,650 --> 00:06:48,240
length of the key.

130
00:06:48,240 --> 00:06:51,800
Notice that the number of items stored
in the trie doesn't affect the look up

131
00:06:51,800 --> 00:06:52,820
or insertion time.

132
00:06:52,820 --> 00:06:55,360
It's only impacted by the
length of the key.

133
00:06:55,360 --> 00:06:59,300
>> By contrast, adding entries to, say,
a hash table tends to make

134
00:06:59,300 --> 00:07:01,250
future look up slower.

135
00:07:01,250 --> 00:07:04,520
While this may sound appealing at first,
we should keep in mind that a

136
00:07:04,520 --> 00:07:08,740
favorable asymptotic complexity doesn't
mean that in practice the data

137
00:07:08,740 --> 00:07:11,410
structure is necessarily
beyond reproach.

138
00:07:11,410 --> 00:07:15,860
We must also consider that to store a
word in a trie we need, in the worst

139
00:07:15,860 --> 00:07:19,700
case, a number of nodes proportional
to the length of the word itself.

140
00:07:19,700 --> 00:07:21,880
>> Tries tend to use a lot of space.

141
00:07:21,880 --> 00:07:25,620
That's in contrast to a hash table,
where we only need one new node to

142
00:07:25,620 --> 00:07:27,940
store some key value pair.

143
00:07:27,940 --> 00:07:31,370
Now, again in theory, large space
consumption doesn't seem like a big

144
00:07:31,370 --> 00:07:34,620
deal, especially given that modern
computers have gigabytes and

145
00:07:34,620 --> 00:07:36,180
gigabytes of memory.

146
00:07:36,180 --> 00:07:39,200
But it turns out that we still have
to worry about memory usage and

147
00:07:39,200 --> 00:07:42,540
organization for the sake of
performance, since modern computers

148
00:07:42,540 --> 00:07:46,960
have mechanisms in place under the
hood to speed up memory access.

149
00:07:46,960 --> 00:07:51,180
>> But these mechanisms work best when
memory accesses are made in compact

150
00:07:51,180 --> 00:07:52,810
regions or areas.

151
00:07:52,810 --> 00:07:55,910
And the nodes of a trie could reside
anywhere in that heap.

152
00:07:55,910 --> 00:07:58,390
But these are trade-offs
that we must consider.

153
00:07:58,390 --> 00:08:01,440
>> Remember that, when choosing a data
structure for a certain task, we

154
00:08:01,440 --> 00:08:04,420
should think about what kinds of
operations the data structure needs to

155
00:08:04,420 --> 00:08:07,140
support and how much the performance
of each of those

156
00:08:07,140 --> 00:08:09,080
operations matters to us.

157
00:08:09,080 --> 00:08:11,300
These operations may even
extend beyond just

158
00:08:11,300 --> 00:08:13,430
basic look up and insertion.

159
00:08:13,430 --> 00:08:17,010
Suppose we wanted to implement a kind
of auto-complete functionality, much

160
00:08:17,010 --> 00:08:18,890
like Google search engine does.

161
00:08:18,890 --> 00:08:22,210
That is, return all the keys and
potentially values which

162
00:08:22,210 --> 00:08:24,130
have a given prefix.

163
00:08:24,130 --> 00:08:27,050
>> A trie is uniquely useful
for this operation.

164
00:08:27,050 --> 00:08:29,890
It's straightforward to iterate through
the trie for each character of

165
00:08:29,890 --> 00:08:30,950
the prefix.

166
00:08:30,950 --> 00:08:33,559
Just like a look up operation,
we could follow pointers

167
00:08:33,559 --> 00:08:35,400
character by character.

168
00:08:35,400 --> 00:08:38,659
Then, when we arrive at the end of the
prefix, we could iterate through the

169
00:08:38,659 --> 00:08:42,049
remaining portion of the data structure
since any of the keys beyond

170
00:08:42,049 --> 00:08:43,980
this point have the prefix.

171
00:08:43,980 --> 00:08:47,670
>> It's also easy to obtain this listing
in alphabetical order since the

172
00:08:47,670 --> 00:08:50,970
elements of the children array
are ordered alphabetically.

173
00:08:50,970 --> 00:08:54,420
So hopefully you'll consider
giving tries a try.

174
00:08:54,420 --> 00:08:56,085
I'm Kevin Schmid, and this is CS50.

175
00:08:56,085 --> 00:08:58,745

176
00:08:58,745 --> 00:09:00,790
>> Ah, this is the beginning
of the decline.

177
00:09:00,790 --> 00:09:01,350
I'm sorry.

178
00:09:01,350 --> 00:09:01,870
Sorry.

179
00:09:01,870 --> 00:09:02,480
Sorry.

180
00:09:02,480 --> 00:09:03,130
Sorry.

181
00:09:03,130 --> 00:09:03,950
>> Strike four.

182
00:09:03,950 --> 00:09:04,360
I'm out.

183
00:09:04,360 --> 00:09:05,280
Sorry.

184
00:09:05,280 --> 00:09:06,500
Sorry.

185
00:09:06,500 --> 00:09:07,490
Sorry.

186
00:09:07,490 --> 00:09:12,352
Sorry for making the person who
has to edit this go crazy.

187
00:09:12,352 --> 00:09:13,280
>> Sorry.

188
00:09:13,280 --> 00:09:13,880
Sorry.

189
00:09:13,880 --> 00:09:15,080
Sorry.

190
00:09:15,080 --> 00:09:15,680
Sorry.

191
00:09:15,680 --> 00:09:16,280
>> SPEAKER 1: Well done.

192
00:09:16,280 --> 00:09:17,530
That was really well done.

193
00:09:17,530 --> 00:09:18,430