WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:05.259 --> 00:00:08.300 DOUG LLOYD: So in CS50, we've covered a lot of different data structures, 00:00:08.300 --> 00:00:09.180 right? 00:00:09.180 --> 00:00:11.420 We've seen arrays, and linked lists, and hash tables, 00:00:11.420 --> 00:00:15.210 and tries, stacks and queues. 00:00:15.210 --> 00:00:17.579 We'll also learn a little about trees and heaps, 00:00:17.579 --> 00:00:20.120 but really these all just end up being variations on a theme. 00:00:20.120 --> 00:00:22.840 There really are these kind of four basic ideas 00:00:22.840 --> 00:00:25.190 that everything else can boil down to. 00:00:25.190 --> 00:00:28.150 Arrays, linked lists, hash tables, and tries. 00:00:28.150 --> 00:00:30.720 And like I said, there are variations on them, 00:00:30.720 --> 00:00:32.720 but this is pretty much going to summarize 00:00:32.720 --> 00:00:38.140 everything we're going to talk about in this class in terms of C. 00:00:38.140 --> 00:00:40.140 >> But how do these all measure up, right? 00:00:40.140 --> 00:00:44.265 We've talked about the pros and cons of each in separate videos on them, 00:00:44.265 --> 00:00:46.390 but there's a lot of numbers getting thrown around. 00:00:46.390 --> 00:00:48.723 There's a lot of general thoughts getting thrown around. 00:00:48.723 --> 00:00:51.950 Let's try and consolidate it into just one place. 00:00:51.950 --> 00:00:55.507 Let's weigh the pros against the cons, and consider 00:00:55.507 --> 00:00:57.340 which data structure might be the right data 00:00:57.340 --> 00:01:01.440 structure for your particular situation, whatever kind of data you're storing. 00:01:01.440 --> 00:01:06.625 You don't necessarily always need to use the super fast insertion, deletion, 00:01:06.625 --> 00:01:10.761 and lookup of a trie if you really don't care about inserting and deleting 00:01:10.761 --> 00:01:11.260 too much. 00:01:11.260 --> 00:01:13.968 If you need just quickly random access, maybe an array is better. 00:01:13.968 --> 00:01:15.340 So let's distill that. 00:01:15.340 --> 00:01:18.530 Let's talk about each of the four major kinds of data structures 00:01:18.530 --> 00:01:21.720 that we've talked about, and just see when they might be good, 00:01:21.720 --> 00:01:23.340 and when they might not be so good. 00:01:23.340 --> 00:01:24.610 So let's start with arrays. 00:01:24.610 --> 00:01:27.300 So insertion, that's kind of bad. 00:01:27.300 --> 00:01:31.350 >> Insertion at the end of an array is OK, if we're building an array as we go. 00:01:31.350 --> 00:01:33.570 But if we need to insert elements into the middle, 00:01:33.570 --> 00:01:35.550 think back to insertion sort, there's a lot 00:01:35.550 --> 00:01:37.510 of shifting to fit an element in there. 00:01:37.510 --> 00:01:41.170 And so if we're going to insert anywhere but the end of an array, 00:01:41.170 --> 00:01:43.590 that's probably not so great. 00:01:43.590 --> 00:01:46.710 >> Similarly, deletion, unless we're deleting from the end of an array, 00:01:46.710 --> 00:01:49.810 is probably also not so great if we don't want to leave empty gaps, 00:01:49.810 --> 00:01:50.790 which usually we don't. 00:01:50.790 --> 00:01:54.700 We want to remove an element, and then sort of make it snug again. 00:01:54.700 --> 00:01:57.670 And so deleting elements from an array, also not so great. 00:01:57.670 --> 00:01:58.820 >> Lookup, though, is great. 00:01:58.820 --> 00:02:00.920 We have random access, constant time lookup. 00:02:00.920 --> 00:02:03.800 We just say seven, and we go to array relocation seven. 00:02:03.800 --> 00:02:05.907 We say 20, with go to array relocation 20. 00:02:05.907 --> 00:02:07.240 We don't have to iterate across. 00:02:07.240 --> 00:02:08.630 That's pretty good. 00:02:08.630 --> 00:02:11.020 >> Arrays are also relatively easy to sort. 00:02:11.020 --> 00:02:14.040 Every time we talked about a sorting algorithm, such as selection sort, 00:02:14.040 --> 00:02:18.820 insertion sort, bubble sort, merge sort, we always used arrays to do it, 00:02:18.820 --> 00:02:21.860 because arrays are pretty easy to sort, relative to the data structures 00:02:21.860 --> 00:02:22.970 we've seen so far. 00:02:22.970 --> 00:02:24.320 >> They're also relatively small. 00:02:24.320 --> 00:02:25.695 There's not a lot of extra space. 00:02:25.695 --> 00:02:29.210 You just set aside exactly as much as you need to hold your data, 00:02:29.210 --> 00:02:30.320 and that's pretty much it. 00:02:30.320 --> 00:02:33.180 So they're pretty small and efficient in that way. 00:02:33.180 --> 00:02:36.000 But another downside, though, is that they are fixed in size. 00:02:36.000 --> 00:02:38.630 We have to declare exactly how big we want our array to be, 00:02:38.630 --> 00:02:39.940 and we only get one shot at it. 00:02:39.940 --> 00:02:41.280 We can't grow and shrink it. 00:02:41.280 --> 00:02:44.582 >> If we need to grow or shrink it, we need to declare an entirely new array, 00:02:44.582 --> 00:02:47.750 copy all of the elements of the first array into the second array. 00:02:47.750 --> 00:02:51.410 And if we miscalculated that time, we need to do it again. 00:02:51.410 --> 00:02:52.760 Not so great. 00:02:52.760 --> 00:02:58.750 So arrays don't give us the flexibility to have variable numbers of elements. 00:02:58.750 --> 00:03:01.320 >> With a linked list, insertion is pretty easy. 00:03:01.320 --> 00:03:03.290 We just tack onto the front. 00:03:03.290 --> 00:03:04.892 Deletion is also pretty easy. 00:03:04.892 --> 00:03:06.100 We have to find the elements. 00:03:06.100 --> 00:03:07.270 That involve some searching. 00:03:07.270 --> 00:03:10.270 >> But once you've found the element you're looking for, all you need to do 00:03:10.270 --> 00:03:12.830 is change a pointer, possibly two if you have 00:03:12.830 --> 00:03:15.151 a linked list-- a doubly linked list, rather-- 00:03:15.151 --> 00:03:16.650 and then you can just free the node. 00:03:16.650 --> 00:03:18.399 You don't have to shift everything around. 00:03:18.399 --> 00:03:22.090 You just change two pointers, so that's pretty quick. 00:03:22.090 --> 00:03:23.470 >> Lookup is bad though, right? 00:03:23.470 --> 00:03:26.280 In order for us to find an element in a linked list, 00:03:26.280 --> 00:03:29.154 whether singly or doubly linked, we have to linear search it. 00:03:29.154 --> 00:03:32.320 We have to start at the beginning and move the end, or start at the end move 00:03:32.320 --> 00:03:33.860 to the beginning. 00:03:33.860 --> 00:03:35.474 We don't have random access anymore. 00:03:35.474 --> 00:03:37.265 So if we're doing a lot of searching, maybe 00:03:37.265 --> 00:03:39.830 a linked list isn't quite so good for us. 00:03:39.830 --> 00:03:43.750 >> They're also really difficult to sort, right? 00:03:43.750 --> 00:03:45.666 The only way you can really sort a linked list 00:03:45.666 --> 00:03:47.870 is to sort it as you construct it. 00:03:47.870 --> 00:03:50.497 But if you sort it as you construct it, you're no longer 00:03:50.497 --> 00:03:51.830 making quick insertions anymore. 00:03:51.830 --> 00:03:53.746 You're not just tacking things onto the front. 00:03:53.746 --> 00:03:55.710 You have to find the right spot to put it, 00:03:55.710 --> 00:03:57.820 and then your insertion becomes just about as bad 00:03:57.820 --> 00:03:59.390 as inserting into an array. 00:03:59.390 --> 00:04:03.130 So linked lists are not so great for sorting data. 00:04:03.130 --> 00:04:05.830 >> They're also pretty small, size-wise. 00:04:05.830 --> 00:04:08.496 Doubly linked list slightly larger than singly linked lists, 00:04:08.496 --> 00:04:10.620 which are slightly larger than arrays, but it's not 00:04:10.620 --> 00:04:13.330 a huge amount of wasted space. 00:04:13.330 --> 00:04:18.730 So if space is at a premium, but not a really intense premium, 00:04:18.730 --> 00:04:22.180 this might be the right way to go. 00:04:22.180 --> 00:04:23.330 >> Hash tables. 00:04:23.330 --> 00:04:25.850 Insertion into a hash table is fairly straightforward. 00:04:25.850 --> 00:04:26.980 It's a two-step process. 00:04:26.980 --> 00:04:30.700 First we need to run our data through a hash function to get a hash code, 00:04:30.700 --> 00:04:37.550 and then we insert the element into the hash table at that hash code location. 00:04:37.550 --> 00:04:40.879 >> Deletion, similar to linked list, is easy once you find the element. 00:04:40.879 --> 00:04:43.170 You have to find it first, but then when you delete it, 00:04:43.170 --> 00:04:45.128 you just need to exchange a couple of pointers, 00:04:45.128 --> 00:04:47.250 if you're using separate chaining. 00:04:47.250 --> 00:04:49.942 If you're using probing, or if you're not 00:04:49.942 --> 00:04:51.650 using chaining at all in your hash table, 00:04:51.650 --> 00:04:53.040 deletion is actually really easy. 00:04:53.040 --> 00:04:57.134 All you need to do is hash the data, and then go to that location. 00:04:57.134 --> 00:04:58.925 And assuming you don't have any collisions, 00:04:58.925 --> 00:05:01.650 you'll be able to delete very quickly. 00:05:01.650 --> 00:05:04.930 >> Now, lookup is where things get a little more complicated. 00:05:04.930 --> 00:05:06.910 It's on average better than linked lists. 00:05:06.910 --> 00:05:09.560 If you're using chaining, you still have a linked list, 00:05:09.560 --> 00:05:13.170 which means you still have the search detriment a linked list. 00:05:13.170 --> 00:05:18.390 But because you're taking your linked list and splitting it over 100 or 1,000 00:05:18.390 --> 00:05:25.380 or n elements in your hash table, you're linked lists are all one nth the size. 00:05:25.380 --> 00:05:27.650 They're all substantially smaller. 00:05:27.650 --> 00:05:32.080 You have n linked lists instead of one linked list of size n. 00:05:32.080 --> 00:05:34.960 >> And so this real-world constant factor, which we generally 00:05:34.960 --> 00:05:39.730 don't talk about in time complexity, it does actually make a difference here. 00:05:39.730 --> 00:05:43.020 So lookup is still linear search if you're using chaining, 00:05:43.020 --> 00:05:46.780 but the length of the list you're searching through 00:05:46.780 --> 00:05:50.080 is very, very short by comparison. 00:05:50.080 --> 00:05:52.995 Again, if sorting is your goal here, hash table's 00:05:52.995 --> 00:05:54.370 probably not the right way to go. 00:05:54.370 --> 00:05:56.830 Just use an array if sorting is really important to you. 00:05:56.830 --> 00:05:58.590 >> And they can run the gamut of size. 00:05:58.590 --> 00:06:01.640 It's hard to say whether a hash table is small or big, 00:06:01.640 --> 00:06:04.110 because it really depends on how large your hash table is. 00:06:04.110 --> 00:06:07.340 If you're only going to be storing five elements in your hash table, 00:06:07.340 --> 00:06:10.620 and you have a hash table with 10,000 elements in it, 00:06:10.620 --> 00:06:12.614 you're probably wasting a lot of space. 00:06:12.614 --> 00:06:15.030 Contrast being you can also have very compact hash tables, 00:06:15.030 --> 00:06:18.720 but the smaller your hash table gets, the longer each of those linked lists 00:06:18.720 --> 00:06:19.220 gets. 00:06:19.220 --> 00:06:22.607 And so there's really no way to define exactly the size of a hash table, 00:06:22.607 --> 00:06:24.440 but it's probably safe to say it's generally 00:06:24.440 --> 00:06:27.990 going to be bigger than a linked list storing the same data, 00:06:27.990 --> 00:06:30.400 but smaller than a trie. 00:06:30.400 --> 00:06:32.720 >> And tries are the fourth of these structures 00:06:32.720 --> 00:06:34.070 that we've been talking about. 00:06:34.070 --> 00:06:36.450 Inserting into a trie is complex. 00:06:36.450 --> 00:06:38.400 There's a lot of dynamic memory allocation, 00:06:38.400 --> 00:06:40.780 especially at the beginning, as you're starting to build. 00:06:40.780 --> 00:06:43.700 But it's constant time. 00:06:43.700 --> 00:06:47.690 It's only the human element here that makes it tricky. 00:06:47.690 --> 00:06:53.250 Having to encounter null pointer, malloc space, go there, possibly malloc space 00:06:53.250 --> 00:06:54.490 from there again. 00:06:54.490 --> 00:06:58.880 The sort of intimidation factor of pointers in dynamic memory allocation 00:06:58.880 --> 00:07:00.130 is the hurdle to clear. 00:07:00.130 --> 00:07:04.550 But once you've cleared it, insertion actually comes quite simple, 00:07:04.550 --> 00:07:06.810 and it certainly is constant time. 00:07:06.810 --> 00:07:07.680 >> Deletion is easy. 00:07:07.680 --> 00:07:11.330 All you need to do is navigate down a couple of pointers and free the node, 00:07:11.330 --> 00:07:12.420 so that's pretty good. 00:07:12.420 --> 00:07:13.930 Lookup is also pretty fast. 00:07:13.930 --> 00:07:16.780 It's only based on the length of your data. 00:07:16.780 --> 00:07:19.924 So if all of your data is five character strings, 00:07:19.924 --> 00:07:22.590 for example, you're storing five character strings in your trie, 00:07:22.590 --> 00:07:25.439 it only takes five steps to find what you're looking for. 00:07:25.439 --> 00:07:28.480 Five is just a constant factor, so again, insertion, deletion, and lookup 00:07:28.480 --> 00:07:31.670 here are all constant time, effectively. 00:07:31.670 --> 00:07:34.880 >> Another thing is that your trie is actually kind of already sorted, right? 00:07:34.880 --> 00:07:36.800 By virtue of how we're inserting elements, 00:07:36.800 --> 00:07:40.060 by going letter by letter of the key, or digit by digit of the key, 00:07:40.060 --> 00:07:45.084 typically, your trie ends up being kind of sorted as you build it. 00:07:45.084 --> 00:07:47.250 It doesn't really makes sense to think about sorting 00:07:47.250 --> 00:07:49.874 in the same way we think about it with arrays, or linked lists, 00:07:49.874 --> 00:07:51.070 or hash tables. 00:07:51.070 --> 00:07:54.780 But in some sense, your trie is sorted as you go. 00:07:54.780 --> 00:07:58.630 >> The downside, of course, is that a trie rapidly becomes huge. 00:07:58.630 --> 00:08:02.970 From every junction point, you might have-- if your key consists of digits, 00:08:02.970 --> 00:08:04.880 you have 10 other places you can go, which 00:08:04.880 --> 00:08:07.490 means that every node contains information 00:08:07.490 --> 00:08:11.440 about the data you want to store at that node, plus 10 pointers. 00:08:11.440 --> 00:08:14.430 Which, on CS50 IDE, is 80 bytes. 00:08:14.430 --> 00:08:17.220 So it's at least 80 bytes for every node that you create, 00:08:17.220 --> 00:08:19.240 and that's not even counting data. 00:08:19.240 --> 00:08:24.950 And if your nodes are letters instead of digits, 00:08:24.950 --> 00:08:27.825 now you have 26 pointers from every location. 00:08:27.825 --> 00:08:32.007 And 26 times 8 is probably 200 bytes, or something like that. 00:08:32.007 --> 00:08:33.840 And you have capital and lowercase-- you can 00:08:33.840 --> 00:08:35.381 see where I'm going with this, right? 00:08:35.381 --> 00:08:37.500 Your nodes can get really big, and so the trie 00:08:37.500 --> 00:08:40.470 itself, overall, can get really big, too. 00:08:40.470 --> 00:08:42.630 So if space is at a high premium on your system, 00:08:42.630 --> 00:08:45.830 a trie might not be the right way to go, even though its other benefits 00:08:45.830 --> 00:08:47.780 come into play. 00:08:47.780 --> 00:08:48.710 I'm Doug Lloyd. 00:08:48.710 --> 00:08:50.740 This is CS50.