1 00:00:00,000 --> 00:00:07,700 2 00:00:07,700 --> 00:00:10,890 >> KEVIN SCHMID: Sometimes, when building a program, you might want to utilize a 3 00:00:10,890 --> 00:00:13,190 data structure known as a dictionary. 4 00:00:13,190 --> 00:00:17,960 A dictionary maps keys, which are usually strings, to values, ints, 5 00:00:17,960 --> 00:00:21,900 chars, a pointer to some object, whatever we want. 6 00:00:21,900 --> 00:00:26,510 It's just like ordinary dictionaries that map words through definitions. 7 00:00:26,510 --> 00:00:29,440 >> Dictionaries provide us with the ability to store information 8 00:00:29,440 --> 00:00:32,750 associated with something and look it up later. 9 00:00:32,750 --> 00:00:36,620 So how do we actually implement a dictionary in, say, C code that we can 10 00:00:36,620 --> 00:00:38,460 use in one of our programs? 11 00:00:38,460 --> 00:00:41,790 Well, there are a lot of ways that we could implement a dictionary. 12 00:00:41,790 --> 00:00:45,930 >> For one, we could use an array that we dynamically re-size or we could use a 13 00:00:45,930 --> 00:00:49,150 linked list, hash table or a binary tree. 14 00:00:49,150 --> 00:00:52,250 But whatever we choose, we should be mindful of the efficiency and 15 00:00:52,250 --> 00:00:54,300 performance of the implementation. 16 00:00:54,300 --> 00:00:57,930 We should think about the algorithm used to insert and look up items into 17 00:00:57,930 --> 00:00:59,120 our data structure. 18 00:00:59,120 --> 00:01:03,060 >> For now, let's assume that we want to use strings as keys. 19 00:01:03,060 --> 00:01:07,290 Let's talk about one possibility, a data structure called a trie. 20 00:01:07,290 --> 00:01:11,210 So here's a visual representation of a trie. 21 00:01:11,210 --> 00:01:14,590 >> As the picture suggests, a trie is a tree data structure with 22 00:01:14,590 --> 00:01:16,050 nodes linked together. 23 00:01:16,050 --> 00:01:19,420 We see that there's clearly a root node with some links extending to 24 00:01:19,420 --> 00:01:20,500 other nodes. 25 00:01:20,500 --> 00:01:23,040 But what does each node consist of? 26 00:01:23,040 --> 00:01:26,700 If we assume that we're storing keys with only alphabetic characters, and 27 00:01:26,700 --> 00:01:30,150 we don't care about capitalization, here's a definition of a node that 28 00:01:30,150 --> 00:01:31,100 will suffice. 29 00:01:31,100 --> 00:01:34,130 >> An object whose type is struct node has two parts 30 00:01:34,130 --> 00:01:35,740 called data and children. 31 00:01:35,740 --> 00:01:39,200 We've left the data part as a comment to be replaced by a component 32 00:01:39,200 --> 00:01:43,190 declaration when struct node is incorporated in a C program. 33 00:01:43,190 --> 00:01:47,040 The data part of a node might be a Boolean value to indicate whether or 34 00:01:47,040 --> 00:01:51,160 not the node represents the completion of a dictionary key or it might be a 35 00:01:51,160 --> 00:01:54,240 string representing the definition of a word in the dictionary. 36 00:01:54,240 --> 00:01:58,870 >> We'll use a smiley face to indicate when data is present in a node. 37 00:01:58,870 --> 00:02:02,310 There are 26 elements in our children array, one index 38 00:02:02,310 --> 00:02:03,690 per alphabetic character. 39 00:02:03,690 --> 00:02:06,570 We'll see the significance of this soon. 40 00:02:06,570 --> 00:02:10,759 >> Let's get a closer look of the root node in our diagram, which has no data 41 00:02:10,759 --> 00:02:14,740 associated with it, as indicated by the absence of the smiley face in the 42 00:02:14,740 --> 00:02:16,110 data portion. 43 00:02:16,110 --> 00:02:19,910 The arrows extending from the parts of the children array represent non-node 44 00:02:19,910 --> 00:02:21,640 pointers to other nodes. 45 00:02:21,640 --> 00:02:25,500 For example, the arrow extending from the second element of children 46 00:02:25,500 --> 00:02:28,400 represents the letter B in a dictionary key. 47 00:02:28,400 --> 00:02:31,920 And in the larger diagram we label it with a B. 48 00:02:31,920 --> 00:02:35,810 >> Note that in the larger diagram, when we draw a pointer to another node, it 49 00:02:35,810 --> 00:02:39,100 doesn't matter where the arrowhead meets that other node. 50 00:02:39,100 --> 00:02:43,850 Our sample dictionary trie contains two words, that and zoom. 51 00:02:43,850 --> 00:02:47,040 Let's walk through an example of looking up data for a key. 52 00:02:47,040 --> 00:02:50,800 >> Suppose we wanted to look up the corresponding value for the key bath. 53 00:02:50,800 --> 00:02:53,610 We'll begin our look up at the root node. 54 00:02:53,610 --> 00:02:57,870 Then we'll take the first letter of our key, B, and find the corresponding 55 00:02:57,870 --> 00:03:00,020 spot in our children array. 56 00:03:00,020 --> 00:03:04,490 Notice that there are exactly 26 spots in the array, one for each letter of 57 00:03:04,490 --> 00:03:05,330 the alphabet. 58 00:03:05,330 --> 00:03:08,800 And we'll have the spots represent the letters of the alphabet in order. 59 00:03:08,800 --> 00:03:13,960 >> We'll look at the second index then, index one, for B. In general, if we 60 00:03:13,960 --> 00:03:17,990 have some alphabetic character C we could determine the corresponding spot 61 00:03:17,990 --> 00:03:21,520 in the children array using a calculation like this. 62 00:03:21,520 --> 00:03:25,140 We could have used a larger children array if we wanted to offer look up of 63 00:03:25,140 --> 00:03:28,380 keys with a wider range of characters, such as the entire 64 00:03:28,380 --> 00:03:29,880 ASCII character set. 65 00:03:29,880 --> 00:03:32,630 >> In this case, the pointer in our children array at 66 00:03:32,630 --> 00:03:34,320 index one is not null. 67 00:03:34,320 --> 00:03:36,600 So we'll continue looking up the key bath. 68 00:03:36,600 --> 00:03:40,130 If we ever encountered a null pointer at the proper spot in the children 69 00:03:40,130 --> 00:03:43,230 array while we traversed the nodes, then we'll have to say that we 70 00:03:43,230 --> 00:03:45,630 couldn't find anything for that key. 71 00:03:45,630 --> 00:03:49,370 >> Now, we'll take the second letter of our key, A, and continue following 72 00:03:49,370 --> 00:03:52,400 pointers in this way until we reach the end of our key. 73 00:03:52,400 --> 00:03:56,530 If we reach the end of the key without hitting any dead ends, null pointers, 74 00:03:56,530 --> 00:03:59,730 as is the case here, then we only have to check one more thing. 75 00:03:59,730 --> 00:04:02,110 Is this key actually in the dictionary? 76 00:04:02,110 --> 00:04:07,660 >> If so, we should find a value, well a smiley face icon in our diagram where 77 00:04:07,660 --> 00:04:08,750 the word ends. 78 00:04:08,750 --> 00:04:12,270 If there is something else stored with the data, then we can return it. 79 00:04:12,270 --> 00:04:16,500 For example, the key zoo is not in the dictionary, even though we could have 80 00:04:16,500 --> 00:04:19,810 reached the end of this key without ever hitting a null pointer, while we 81 00:04:19,810 --> 00:04:21,089 iterate through the trie. 82 00:04:21,089 --> 00:04:25,436 >> If we tried to look up the key bath, the second to last node's array index, 83 00:04:25,436 --> 00:04:28,750 corresponding to the letter H, would have held a null pointer. 84 00:04:28,750 --> 00:04:31,120 So bath is not in the dictionary. 85 00:04:31,120 --> 00:04:34,800 And so a trie is unique in that the keys are never explicitly stored in 86 00:04:34,800 --> 00:04:36,650 the data structure. 87 00:04:36,650 --> 00:04:38,810 So how do we insert something into a trie? 88 00:04:38,810 --> 00:04:41,780 >> Let's insert the key zoo into our trie. 89 00:04:41,780 --> 00:04:46,120 Remember that a smiley face at a node could correspond in code to a simple 90 00:04:46,120 --> 00:04:50,170 Boolean value to indicate that zoo is in the dictionary or it could 91 00:04:50,170 --> 00:04:53,710 correspond to more information that we wish to associate with the key zoo, 92 00:04:53,710 --> 00:04:56,860 like the definition of the word or something else. 93 00:04:56,860 --> 00:05:00,350 In some ways, the process to insert something into a trie is similar to 94 00:05:00,350 --> 00:05:02,060 looking up something in a trie. 95 00:05:02,060 --> 00:05:05,720 >> We'll start with the root node again, following pointers corresponding to 96 00:05:05,720 --> 00:05:07,990 the letters of our key. 97 00:05:07,990 --> 00:05:11,310 Luckily, we were able to follow pointers all the way until we reached 98 00:05:11,310 --> 00:05:12,770 the end of the key. 99 00:05:12,770 --> 00:05:16,480 Since zoo is a prefix of the word zoom, which is a member of the 100 00:05:16,480 --> 00:05:19,440 dictionary, we don't need to allocate any new nodes. 101 00:05:19,440 --> 00:05:23,140 >> We can modify the node to indicate that the path of characters leading to 102 00:05:23,140 --> 00:05:25,360 it represents a key in our dictionary. 103 00:05:25,360 --> 00:05:28,630 Now, let's try inserting the key BATH into the trie. 104 00:05:28,630 --> 00:05:32,260 We'll start at the root node and follow pointers again. 105 00:05:32,260 --> 00:05:35,620 But in this situation, we hit a dead end before we're able to get to the 106 00:05:35,620 --> 00:05:36,940 end of the key. 107 00:05:36,940 --> 00:05:40,980 Now, we'll need to allocate some new nodes will need to allocate one new 108 00:05:40,980 --> 00:05:43,660 node for each remaining letter of our key. 109 00:05:43,660 --> 00:05:46,740 >> In this case, we just need to allocate one new node. 110 00:05:46,740 --> 00:05:50,590 Then we'll need to make the H index reference this new node. 111 00:05:50,590 --> 00:05:54,070 Once again, we can modify the node to indicate that the path of characters 112 00:05:54,070 --> 00:05:57,120 leading to it represents a key in our dictionary. 113 00:05:57,120 --> 00:06:00,730 Let's reason about the asymptotic complexity of our procedures for these 114 00:06:00,730 --> 00:06:02,110 two operations. 115 00:06:02,110 --> 00:06:06,420 >> We notice that in both cases the number of steps our algorithm took was 116 00:06:06,420 --> 00:06:09,470 proportional to the number of letters in the keyword. 117 00:06:09,470 --> 00:06:10,220 That's right. 118 00:06:10,220 --> 00:06:13,470 When you want to look up a word in a trie you just need to iterate through 119 00:06:13,470 --> 00:06:17,100 the letters one by one until you either reach the end of the word or 120 00:06:17,100 --> 00:06:19,060 hit a dead end in the trie. 121 00:06:19,060 --> 00:06:22,470 >> And when you wish to insert a key value pair into a trie using the 122 00:06:22,470 --> 00:06:26,250 procedure we discussed, the worst case will have you allocating a new node 123 00:06:26,250 --> 00:06:27,550 for each letter. 124 00:06:27,550 --> 00:06:31,290 And we'll assume that allocation is a constant time operation. 125 00:06:31,290 --> 00:06:35,850 So if we assume that the key length is bounded by a fixed constant, both 126 00:06:35,850 --> 00:06:39,400 insertion and look up are constant time operations for a trie. 127 00:06:39,400 --> 00:06:42,930 >> If we don't make this assumption that the key length is bounded by a fixed 128 00:06:42,930 --> 00:06:46,650 constant, then insertion and look up, in the worst case, are linear in the 129 00:06:46,650 --> 00:06:48,240 length of the key. 130 00:06:48,240 --> 00:06:51,800 Notice that the number of items stored in the trie doesn't affect the look up 131 00:06:51,800 --> 00:06:52,820 or insertion time. 132 00:06:52,820 --> 00:06:55,360 It's only impacted by the length of the key. 133 00:06:55,360 --> 00:06:59,300 >> By contrast, adding entries to, say, a hash table tends to make 134 00:06:59,300 --> 00:07:01,250 future look up slower. 135 00:07:01,250 --> 00:07:04,520 While this may sound appealing at first, we should keep in mind that a 136 00:07:04,520 --> 00:07:08,740 favorable asymptotic complexity doesn't mean that in practice the data 137 00:07:08,740 --> 00:07:11,410 structure is necessarily beyond reproach. 138 00:07:11,410 --> 00:07:15,860 We must also consider that to store a word in a trie we need, in the worst 139 00:07:15,860 --> 00:07:19,700 case, a number of nodes proportional to the length of the word itself. 140 00:07:19,700 --> 00:07:21,880 >> Tries tend to use a lot of space. 141 00:07:21,880 --> 00:07:25,620 That's in contrast to a hash table, where we only need one new node to 142 00:07:25,620 --> 00:07:27,940 store some key value pair. 143 00:07:27,940 --> 00:07:31,370 Now, again in theory, large space consumption doesn't seem like a big 144 00:07:31,370 --> 00:07:34,620 deal, especially given that modern computers have gigabytes and 145 00:07:34,620 --> 00:07:36,180 gigabytes of memory. 146 00:07:36,180 --> 00:07:39,200 But it turns out that we still have to worry about memory usage and 147 00:07:39,200 --> 00:07:42,540 organization for the sake of performance, since modern computers 148 00:07:42,540 --> 00:07:46,960 have mechanisms in place under the hood to speed up memory access. 149 00:07:46,960 --> 00:07:51,180 >> But these mechanisms work best when memory accesses are made in compact 150 00:07:51,180 --> 00:07:52,810 regions or areas. 151 00:07:52,810 --> 00:07:55,910 And the nodes of a trie could reside anywhere in that heap. 152 00:07:55,910 --> 00:07:58,390 But these are trade-offs that we must consider. 153 00:07:58,390 --> 00:08:01,440 >> Remember that, when choosing a data structure for a certain task, we 154 00:08:01,440 --> 00:08:04,420 should think about what kinds of operations the data structure needs to 155 00:08:04,420 --> 00:08:07,140 support and how much the performance of each of those 156 00:08:07,140 --> 00:08:09,080 operations matters to us. 157 00:08:09,080 --> 00:08:11,300 These operations may even extend beyond just 158 00:08:11,300 --> 00:08:13,430 basic look up and insertion. 159 00:08:13,430 --> 00:08:17,010 Suppose we wanted to implement a kind of auto-complete functionality, much 160 00:08:17,010 --> 00:08:18,890 like Google search engine does. 161 00:08:18,890 --> 00:08:22,210 That is, return all the keys and potentially values which 162 00:08:22,210 --> 00:08:24,130 have a given prefix. 163 00:08:24,130 --> 00:08:27,050 >> A trie is uniquely useful for this operation. 164 00:08:27,050 --> 00:08:29,890 It's straightforward to iterate through the trie for each character of 165 00:08:29,890 --> 00:08:30,950 the prefix. 166 00:08:30,950 --> 00:08:33,559 Just like a look up operation, we could follow pointers 167 00:08:33,559 --> 00:08:35,400 character by character. 168 00:08:35,400 --> 00:08:38,659 Then, when we arrive at the end of the prefix, we could iterate through the 169 00:08:38,659 --> 00:08:42,049 remaining portion of the data structure since any of the keys beyond 170 00:08:42,049 --> 00:08:43,980 this point have the prefix. 171 00:08:43,980 --> 00:08:47,670 >> It's also easy to obtain this listing in alphabetical order since the 172 00:08:47,670 --> 00:08:50,970 elements of the children array are ordered alphabetically. 173 00:08:50,970 --> 00:08:54,420 So hopefully you'll consider giving tries a try. 174 00:08:54,420 --> 00:08:56,085 I'm Kevin Schmid, and this is CS50. 175 00:08:56,085 --> 00:08:58,745 176 00:08:58,745 --> 00:09:00,790 >> Ah, this is the beginning of the decline. 177 00:09:00,790 --> 00:09:01,350 I'm sorry. 178 00:09:01,350 --> 00:09:01,870 Sorry. 179 00:09:01,870 --> 00:09:02,480 Sorry. 180 00:09:02,480 --> 00:09:03,130 Sorry. 181 00:09:03,130 --> 00:09:03,950 >> Strike four. 182 00:09:03,950 --> 00:09:04,360 I'm out. 183 00:09:04,360 --> 00:09:05,280 Sorry. 184 00:09:05,280 --> 00:09:06,500 Sorry. 185 00:09:06,500 --> 00:09:07,490 Sorry. 186 00:09:07,490 --> 00:09:12,352 Sorry for making the person who has to edit this go crazy. 187 00:09:12,352 --> 00:09:13,280 >> Sorry. 188 00:09:13,280 --> 00:09:13,880 Sorry. 189 00:09:13,880 --> 00:09:15,080 Sorry. 190 00:09:15,080 --> 00:09:15,680 Sorry. 191 00:09:15,680 --> 00:09:16,280 >> SPEAKER 1: Well done. 192 00:09:16,280 --> 00:09:17,530 That was really well done. 193 00:09:17,530 --> 00:09:18,430