1 00:00:00,000 --> 00:00:05,259 2 00:00:05,259 --> 00:00:08,300 DOUG LLOYD: So in CS50, we've covered a lot of different data structures, 3 00:00:08,300 --> 00:00:09,180 right? 4 00:00:09,180 --> 00:00:11,420 We've seen arrays, and linked lists, and hash tables, 5 00:00:11,420 --> 00:00:15,210 and tries, stacks and queues. 6 00:00:15,210 --> 00:00:17,579 We'll also learn a little about trees and heaps, 7 00:00:17,579 --> 00:00:20,120 but really these all just end up being variations on a theme. 8 00:00:20,120 --> 00:00:22,840 There really are these kind of four basic ideas 9 00:00:22,840 --> 00:00:25,190 that everything else can boil down to. 10 00:00:25,190 --> 00:00:28,150 Arrays, linked lists, hash tables, and tries. 11 00:00:28,150 --> 00:00:30,720 And like I said, there are variations on them, 12 00:00:30,720 --> 00:00:32,720 but this is pretty much going to summarize 13 00:00:32,720 --> 00:00:38,140 everything we're going to talk about in this class in terms of C. 14 00:00:38,140 --> 00:00:40,140 >> But how do these all measure up, right? 15 00:00:40,140 --> 00:00:44,265 We've talked about the pros and cons of each in separate videos on them, 16 00:00:44,265 --> 00:00:46,390 but there's a lot of numbers getting thrown around. 17 00:00:46,390 --> 00:00:48,723 There's a lot of general thoughts getting thrown around. 18 00:00:48,723 --> 00:00:51,950 Let's try and consolidate it into just one place. 19 00:00:51,950 --> 00:00:55,507 Let's weigh the pros against the cons, and consider 20 00:00:55,507 --> 00:00:57,340 which data structure might be the right data 21 00:00:57,340 --> 00:01:01,440 structure for your particular situation, whatever kind of data you're storing. 22 00:01:01,440 --> 00:01:06,625 You don't necessarily always need to use the super fast insertion, deletion, 23 00:01:06,625 --> 00:01:10,761 and lookup of a trie if you really don't care about inserting and deleting 24 00:01:10,761 --> 00:01:11,260 too much. 25 00:01:11,260 --> 00:01:13,968 If you need just quickly random access, maybe an array is better. 26 00:01:13,968 --> 00:01:15,340 So let's distill that. 27 00:01:15,340 --> 00:01:18,530 Let's talk about each of the four major kinds of data structures 28 00:01:18,530 --> 00:01:21,720 that we've talked about, and just see when they might be good, 29 00:01:21,720 --> 00:01:23,340 and when they might not be so good. 30 00:01:23,340 --> 00:01:24,610 So let's start with arrays. 31 00:01:24,610 --> 00:01:27,300 So insertion, that's kind of bad. 32 00:01:27,300 --> 00:01:31,350 >> Insertion at the end of an array is OK, if we're building an array as we go. 33 00:01:31,350 --> 00:01:33,570 But if we need to insert elements into the middle, 34 00:01:33,570 --> 00:01:35,550 think back to insertion sort, there's a lot 35 00:01:35,550 --> 00:01:37,510 of shifting to fit an element in there. 36 00:01:37,510 --> 00:01:41,170 And so if we're going to insert anywhere but the end of an array, 37 00:01:41,170 --> 00:01:43,590 that's probably not so great. 38 00:01:43,590 --> 00:01:46,710 >> Similarly, deletion, unless we're deleting from the end of an array, 39 00:01:46,710 --> 00:01:49,810 is probably also not so great if we don't want to leave empty gaps, 40 00:01:49,810 --> 00:01:50,790 which usually we don't. 41 00:01:50,790 --> 00:01:54,700 We want to remove an element, and then sort of make it snug again. 42 00:01:54,700 --> 00:01:57,670 And so deleting elements from an array, also not so great. 43 00:01:57,670 --> 00:01:58,820 >> Lookup, though, is great. 44 00:01:58,820 --> 00:02:00,920 We have random access, constant time lookup. 45 00:02:00,920 --> 00:02:03,800 We just say seven, and we go to array relocation seven. 46 00:02:03,800 --> 00:02:05,907 We say 20, with go to array relocation 20. 47 00:02:05,907 --> 00:02:07,240 We don't have to iterate across. 48 00:02:07,240 --> 00:02:08,630 That's pretty good. 49 00:02:08,630 --> 00:02:11,020 >> Arrays are also relatively easy to sort. 50 00:02:11,020 --> 00:02:14,040 Every time we talked about a sorting algorithm, such as selection sort, 51 00:02:14,040 --> 00:02:18,820 insertion sort, bubble sort, merge sort, we always used arrays to do it, 52 00:02:18,820 --> 00:02:21,860 because arrays are pretty easy to sort, relative to the data structures 53 00:02:21,860 --> 00:02:22,970 we've seen so far. 54 00:02:22,970 --> 00:02:24,320 >> They're also relatively small. 55 00:02:24,320 --> 00:02:25,695 There's not a lot of extra space. 56 00:02:25,695 --> 00:02:29,210 You just set aside exactly as much as you need to hold your data, 57 00:02:29,210 --> 00:02:30,320 and that's pretty much it. 58 00:02:30,320 --> 00:02:33,180 So they're pretty small and efficient in that way. 59 00:02:33,180 --> 00:02:36,000 But another downside, though, is that they are fixed in size. 60 00:02:36,000 --> 00:02:38,630 We have to declare exactly how big we want our array to be, 61 00:02:38,630 --> 00:02:39,940 and we only get one shot at it. 62 00:02:39,940 --> 00:02:41,280 We can't grow and shrink it. 63 00:02:41,280 --> 00:02:44,582 >> If we need to grow or shrink it, we need to declare an entirely new array, 64 00:02:44,582 --> 00:02:47,750 copy all of the elements of the first array into the second array. 65 00:02:47,750 --> 00:02:51,410 And if we miscalculated that time, we need to do it again. 66 00:02:51,410 --> 00:02:52,760 Not so great. 67 00:02:52,760 --> 00:02:58,750 So arrays don't give us the flexibility to have variable numbers of elements. 68 00:02:58,750 --> 00:03:01,320 >> With a linked list, insertion is pretty easy. 69 00:03:01,320 --> 00:03:03,290 We just tack onto the front. 70 00:03:03,290 --> 00:03:04,892 Deletion is also pretty easy. 71 00:03:04,892 --> 00:03:06,100 We have to find the elements. 72 00:03:06,100 --> 00:03:07,270 That involve some searching. 73 00:03:07,270 --> 00:03:10,270 >> But once you've found the element you're looking for, all you need to do 74 00:03:10,270 --> 00:03:12,830 is change a pointer, possibly two if you have 75 00:03:12,830 --> 00:03:15,151 a linked list-- a doubly linked list, rather-- 76 00:03:15,151 --> 00:03:16,650 and then you can just free the node. 77 00:03:16,650 --> 00:03:18,399 You don't have to shift everything around. 78 00:03:18,399 --> 00:03:22,090 You just change two pointers, so that's pretty quick. 79 00:03:22,090 --> 00:03:23,470 >> Lookup is bad though, right? 80 00:03:23,470 --> 00:03:26,280 In order for us to find an element in a linked list, 81 00:03:26,280 --> 00:03:29,154 whether singly or doubly linked, we have to linear search it. 82 00:03:29,154 --> 00:03:32,320 We have to start at the beginning and move the end, or start at the end move 83 00:03:32,320 --> 00:03:33,860 to the beginning. 84 00:03:33,860 --> 00:03:35,474 We don't have random access anymore. 85 00:03:35,474 --> 00:03:37,265 So if we're doing a lot of searching, maybe 86 00:03:37,265 --> 00:03:39,830 a linked list isn't quite so good for us. 87 00:03:39,830 --> 00:03:43,750 >> They're also really difficult to sort, right? 88 00:03:43,750 --> 00:03:45,666 The only way you can really sort a linked list 89 00:03:45,666 --> 00:03:47,870 is to sort it as you construct it. 90 00:03:47,870 --> 00:03:50,497 But if you sort it as you construct it, you're no longer 91 00:03:50,497 --> 00:03:51,830 making quick insertions anymore. 92 00:03:51,830 --> 00:03:53,746 You're not just tacking things onto the front. 93 00:03:53,746 --> 00:03:55,710 You have to find the right spot to put it, 94 00:03:55,710 --> 00:03:57,820 and then your insertion becomes just about as bad 95 00:03:57,820 --> 00:03:59,390 as inserting into an array. 96 00:03:59,390 --> 00:04:03,130 So linked lists are not so great for sorting data. 97 00:04:03,130 --> 00:04:05,830 >> They're also pretty small, size-wise. 98 00:04:05,830 --> 00:04:08,496 Doubly linked list slightly larger than singly linked lists, 99 00:04:08,496 --> 00:04:10,620 which are slightly larger than arrays, but it's not 100 00:04:10,620 --> 00:04:13,330 a huge amount of wasted space. 101 00:04:13,330 --> 00:04:18,730 So if space is at a premium, but not a really intense premium, 102 00:04:18,730 --> 00:04:22,180 this might be the right way to go. 103 00:04:22,180 --> 00:04:23,330 >> Hash tables. 104 00:04:23,330 --> 00:04:25,850 Insertion into a hash table is fairly straightforward. 105 00:04:25,850 --> 00:04:26,980 It's a two-step process. 106 00:04:26,980 --> 00:04:30,700 First we need to run our data through a hash function to get a hash code, 107 00:04:30,700 --> 00:04:37,550 and then we insert the element into the hash table at that hash code location. 108 00:04:37,550 --> 00:04:40,879 >> Deletion, similar to linked list, is easy once you find the element. 109 00:04:40,879 --> 00:04:43,170 You have to find it first, but then when you delete it, 110 00:04:43,170 --> 00:04:45,128 you just need to exchange a couple of pointers, 111 00:04:45,128 --> 00:04:47,250 if you're using separate chaining. 112 00:04:47,250 --> 00:04:49,942 If you're using probing, or if you're not 113 00:04:49,942 --> 00:04:51,650 using chaining at all in your hash table, 114 00:04:51,650 --> 00:04:53,040 deletion is actually really easy. 115 00:04:53,040 --> 00:04:57,134 All you need to do is hash the data, and then go to that location. 116 00:04:57,134 --> 00:04:58,925 And assuming you don't have any collisions, 117 00:04:58,925 --> 00:05:01,650 you'll be able to delete very quickly. 118 00:05:01,650 --> 00:05:04,930 >> Now, lookup is where things get a little more complicated. 119 00:05:04,930 --> 00:05:06,910 It's on average better than linked lists. 120 00:05:06,910 --> 00:05:09,560 If you're using chaining, you still have a linked list, 121 00:05:09,560 --> 00:05:13,170 which means you still have the search detriment a linked list. 122 00:05:13,170 --> 00:05:18,390 But because you're taking your linked list and splitting it over 100 or 1,000 123 00:05:18,390 --> 00:05:25,380 or n elements in your hash table, you're linked lists are all one nth the size. 124 00:05:25,380 --> 00:05:27,650 They're all substantially smaller. 125 00:05:27,650 --> 00:05:32,080 You have n linked lists instead of one linked list of size n. 126 00:05:32,080 --> 00:05:34,960 >> And so this real-world constant factor, which we generally 127 00:05:34,960 --> 00:05:39,730 don't talk about in time complexity, it does actually make a difference here. 128 00:05:39,730 --> 00:05:43,020 So lookup is still linear search if you're using chaining, 129 00:05:43,020 --> 00:05:46,780 but the length of the list you're searching through 130 00:05:46,780 --> 00:05:50,080 is very, very short by comparison. 131 00:05:50,080 --> 00:05:52,995 Again, if sorting is your goal here, hash table's 132 00:05:52,995 --> 00:05:54,370 probably not the right way to go. 133 00:05:54,370 --> 00:05:56,830 Just use an array if sorting is really important to you. 134 00:05:56,830 --> 00:05:58,590 >> And they can run the gamut of size. 135 00:05:58,590 --> 00:06:01,640 It's hard to say whether a hash table is small or big, 136 00:06:01,640 --> 00:06:04,110 because it really depends on how large your hash table is. 137 00:06:04,110 --> 00:06:07,340 If you're only going to be storing five elements in your hash table, 138 00:06:07,340 --> 00:06:10,620 and you have a hash table with 10,000 elements in it, 139 00:06:10,620 --> 00:06:12,614 you're probably wasting a lot of space. 140 00:06:12,614 --> 00:06:15,030 Contrast being you can also have very compact hash tables, 141 00:06:15,030 --> 00:06:18,720 but the smaller your hash table gets, the longer each of those linked lists 142 00:06:18,720 --> 00:06:19,220 gets. 143 00:06:19,220 --> 00:06:22,607 And so there's really no way to define exactly the size of a hash table, 144 00:06:22,607 --> 00:06:24,440 but it's probably safe to say it's generally 145 00:06:24,440 --> 00:06:27,990 going to be bigger than a linked list storing the same data, 146 00:06:27,990 --> 00:06:30,400 but smaller than a trie. 147 00:06:30,400 --> 00:06:32,720 >> And tries are the fourth of these structures 148 00:06:32,720 --> 00:06:34,070 that we've been talking about. 149 00:06:34,070 --> 00:06:36,450 Inserting into a trie is complex. 150 00:06:36,450 --> 00:06:38,400 There's a lot of dynamic memory allocation, 151 00:06:38,400 --> 00:06:40,780 especially at the beginning, as you're starting to build. 152 00:06:40,780 --> 00:06:43,700 But it's constant time. 153 00:06:43,700 --> 00:06:47,690 It's only the human element here that makes it tricky. 154 00:06:47,690 --> 00:06:53,250 Having to encounter null pointer, malloc space, go there, possibly malloc space 155 00:06:53,250 --> 00:06:54,490 from there again. 156 00:06:54,490 --> 00:06:58,880 The sort of intimidation factor of pointers in dynamic memory allocation 157 00:06:58,880 --> 00:07:00,130 is the hurdle to clear. 158 00:07:00,130 --> 00:07:04,550 But once you've cleared it, insertion actually comes quite simple, 159 00:07:04,550 --> 00:07:06,810 and it certainly is constant time. 160 00:07:06,810 --> 00:07:07,680 >> Deletion is easy. 161 00:07:07,680 --> 00:07:11,330 All you need to do is navigate down a couple of pointers and free the node, 162 00:07:11,330 --> 00:07:12,420 so that's pretty good. 163 00:07:12,420 --> 00:07:13,930 Lookup is also pretty fast. 164 00:07:13,930 --> 00:07:16,780 It's only based on the length of your data. 165 00:07:16,780 --> 00:07:19,924 So if all of your data is five character strings, 166 00:07:19,924 --> 00:07:22,590 for example, you're storing five character strings in your trie, 167 00:07:22,590 --> 00:07:25,439 it only takes five steps to find what you're looking for. 168 00:07:25,439 --> 00:07:28,480 Five is just a constant factor, so again, insertion, deletion, and lookup 169 00:07:28,480 --> 00:07:31,670 here are all constant time, effectively. 170 00:07:31,670 --> 00:07:34,880 >> Another thing is that your trie is actually kind of already sorted, right? 171 00:07:34,880 --> 00:07:36,800 By virtue of how we're inserting elements, 172 00:07:36,800 --> 00:07:40,060 by going letter by letter of the key, or digit by digit of the key, 173 00:07:40,060 --> 00:07:45,084 typically, your trie ends up being kind of sorted as you build it. 174 00:07:45,084 --> 00:07:47,250 It doesn't really makes sense to think about sorting 175 00:07:47,250 --> 00:07:49,874 in the same way we think about it with arrays, or linked lists, 176 00:07:49,874 --> 00:07:51,070 or hash tables. 177 00:07:51,070 --> 00:07:54,780 But in some sense, your trie is sorted as you go. 178 00:07:54,780 --> 00:07:58,630 >> The downside, of course, is that a trie rapidly becomes huge. 179 00:07:58,630 --> 00:08:02,970 From every junction point, you might have-- if your key consists of digits, 180 00:08:02,970 --> 00:08:04,880 you have 10 other places you can go, which 181 00:08:04,880 --> 00:08:07,490 means that every node contains information 182 00:08:07,490 --> 00:08:11,440 about the data you want to store at that node, plus 10 pointers. 183 00:08:11,440 --> 00:08:14,430 Which, on CS50 IDE, is 80 bytes. 184 00:08:14,430 --> 00:08:17,220 So it's at least 80 bytes for every node that you create, 185 00:08:17,220 --> 00:08:19,240 and that's not even counting data. 186 00:08:19,240 --> 00:08:24,950 And if your nodes are letters instead of digits, 187 00:08:24,950 --> 00:08:27,825 now you have 26 pointers from every location. 188 00:08:27,825 --> 00:08:32,007 And 26 times 8 is probably 200 bytes, or something like that. 189 00:08:32,007 --> 00:08:33,840 And you have capital and lowercase-- you can 190 00:08:33,840 --> 00:08:35,381 see where I'm going with this, right? 191 00:08:35,381 --> 00:08:37,500 Your nodes can get really big, and so the trie 192 00:08:37,500 --> 00:08:40,470 itself, overall, can get really big, too. 193 00:08:40,470 --> 00:08:42,630 So if space is at a high premium on your system, 194 00:08:42,630 --> 00:08:45,830 a trie might not be the right way to go, even though its other benefits 195 00:08:45,830 --> 00:08:47,780 come into play. 196 00:08:47,780 --> 00:08:48,710 I'm Doug Lloyd. 197 00:08:48,710 --> 00:08:50,740 This is CS50. 198 00:08:50,740 --> 00:08:52,316