WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.480 [MUSIC PLAYING] 00:00:49.022 --> 00:00:49.980 DAVID MALAN: All right. 00:00:49.980 --> 00:00:52.860 This CS50 and this is week five. 00:00:52.860 --> 00:00:56.640 Recall that last week in week four, we introduced a few new building blocks, 00:00:56.640 --> 00:00:58.920 namely pointers and spoke in great detail 00:00:58.920 --> 00:01:01.230 about how you can now manipulate a computer's memory 00:01:01.230 --> 00:01:04.120 and begin to do things at a lower level with it. 00:01:04.120 --> 00:01:06.690 Well today, we'll sort of use those basic building blocks 00:01:06.690 --> 00:01:10.290 to start creating things called data structures in the computer's memory. 00:01:10.290 --> 00:01:12.330 It turns out that once you have this ability 00:01:12.330 --> 00:01:15.120 to refer to different locations in the computer's memory, 00:01:15.120 --> 00:01:18.390 you can stitch together your own custom shapes, 00:01:18.390 --> 00:01:20.910 your own custom data structures, as they're called. 00:01:20.910 --> 00:01:23.520 And indeed, we'll start doing that by rolling back 00:01:23.520 --> 00:01:27.150 for just a moment to where we first saw a data structure in week two. 00:01:27.150 --> 00:01:30.837 So recall in week two which was our second week of playing with C, 00:01:30.837 --> 00:01:32.670 we introduced you to the notion of an array. 00:01:32.670 --> 00:01:35.160 And an array is just a contiguous sequence of memory 00:01:35.160 --> 00:01:38.280 in which you can store a whole bunch of integers back to back to back, 00:01:38.280 --> 00:01:40.980 or maybe a whole bunch of chars back to back to back. 00:01:40.980 --> 00:01:45.280 And those arrays might have been represented pictorially like this. 00:01:45.280 --> 00:01:46.890 So this would be an array of size 3. 00:01:46.890 --> 00:01:49.560 And suppose that this were an array indeed of integers. 00:01:49.560 --> 00:01:52.650 We might have one, two, three inside of it. 00:01:52.650 --> 00:01:57.100 But suppose now-- and here's where we'll start to bump up against a problem, 00:01:57.100 --> 00:01:58.620 but also solve a problem today. 00:01:58.620 --> 00:02:01.260 Suppose that you want to add another number to this array, 00:02:01.260 --> 00:02:05.100 but you've only had the forethought to create an array of size 3. 00:02:05.100 --> 00:02:09.655 The catch with arrays in C is that they're not really easily resizable. 00:02:09.655 --> 00:02:12.780 Again, you all know that you have to decide in advance how big the array is 00:02:12.780 --> 00:02:13.630 going to be. 00:02:13.630 --> 00:02:16.200 So if you change your mind later or your program's 00:02:16.200 --> 00:02:19.260 running long enough where you need to store more values in it, 00:02:19.260 --> 00:02:20.550 you're in a bind. 00:02:20.550 --> 00:02:24.300 Like intuitively, if you wanted to insert the number 4 into this array, 00:02:24.300 --> 00:02:27.990 you would ideally just plop it right here at the end of the array 00:02:27.990 --> 00:02:30.660 and continue about your business. 00:02:30.660 --> 00:02:36.120 But the catch with an array is that that chunk of memory is not-- 00:02:36.120 --> 00:02:37.500 it doesn't exist in a vacuum. 00:02:37.500 --> 00:02:40.500 Recall if we zoom out and look at all of your computer's memory, 00:02:40.500 --> 00:02:43.170 this byte, this byte and a whole bunch of other bytes 00:02:43.170 --> 00:02:46.020 might very well be in use by other variables 00:02:46.020 --> 00:02:47.610 or other aspects of your program. 00:02:47.610 --> 00:02:49.710 So for instance for the sake of discussion, 00:02:49.710 --> 00:02:53.880 suppose that the program in question has one array of size 3 containing 00:02:53.880 --> 00:02:55.410 the integers 1, 2 3. 00:02:55.410 --> 00:02:59.520 And then suppose that your same program has a string somewhere in the code 00:02:59.520 --> 00:03:02.670 that you've written and that string is storing hello, world. 00:03:02.670 --> 00:03:05.970 Well next to this array in memory just by chance, 00:03:05.970 --> 00:03:13.380 may very well be an H-E-L-L-O comma space W-O-R-L-D backslash 0. 00:03:13.380 --> 00:03:15.660 And there might be free memory, so to speak. 00:03:15.660 --> 00:03:19.073 Memory you could use that's filled with garbage values and garbage isn't bad. 00:03:19.073 --> 00:03:20.490 It just means that you don't know. 00:03:20.490 --> 00:03:23.710 You don't really care what values are or were there. 00:03:23.710 --> 00:03:25.740 So there is free space, so to speak. 00:03:25.740 --> 00:03:28.560 Each of these Oscars represents effectively free space 00:03:28.560 --> 00:03:30.030 with some garbage value there. 00:03:30.030 --> 00:03:32.940 Remnants, maybe, of some execution past. 00:03:32.940 --> 00:03:36.330 But the problem here is that you don't have room 00:03:36.330 --> 00:03:40.380 to plop this four right where you might want to put it. 00:03:40.380 --> 00:03:41.670 So what's the solution? 00:03:41.670 --> 00:03:44.190 If we have this array of size 3 containing three integers-- 00:03:44.190 --> 00:03:50.070 1, 2 3-- but it's been painted into a corner whereby H-E-L-L and so forth are 00:03:50.070 --> 00:03:54.180 already immediately next to it, we can't just put the four there without 00:03:54.180 --> 00:03:57.640 sacrificing the H. And that doesn't really feel like a solution. 00:03:57.640 --> 00:04:01.350 So any thoughts on how we can solve this problem? 00:04:01.350 --> 00:04:02.970 Are we completely out of luck? 00:04:02.970 --> 00:04:08.310 Can you just not add a number to an array in a situation like this? 00:04:08.310 --> 00:04:10.780 Or is there a solution perhaps that comes to mind? 00:04:10.780 --> 00:04:14.760 Even if you've never programmed before, if on the screen there-- 00:04:14.760 --> 00:04:15.780 the lay of the land. 00:04:15.780 --> 00:04:17.579 Santiago, what do you think? 00:04:17.579 --> 00:04:21.060 AUDIENCE: I would say that maybe you could 00:04:21.060 --> 00:04:26.050 copy the elements in the original array and create a new array, 00:04:26.050 --> 00:04:29.640 but that's one size bigger or one element bigger. 00:04:29.640 --> 00:04:31.230 And then, add that new element. 00:04:31.230 --> 00:04:31.980 DAVID MALAN: Yeah. 00:04:31.980 --> 00:04:33.120 That's really good intuition. 00:04:33.120 --> 00:04:34.950 After all, there's all these Oscars on the screen 00:04:34.950 --> 00:04:37.050 right now, which again represent garbage values. 00:04:37.050 --> 00:04:38.490 Or in turn, free space. 00:04:38.490 --> 00:04:40.680 So I could put 1, 2, 3, 4 up here. 00:04:40.680 --> 00:04:42.780 I could put 1, 2, 3, 4 over here. 00:04:42.780 --> 00:04:44.640 I could put 1, 2, 3, 4 down here. 00:04:44.640 --> 00:04:48.090 So we some flexibility, but Santiago is exactly right. 00:04:48.090 --> 00:04:51.450 Intuitively, all we really need to do is let's focus only 00:04:51.450 --> 00:04:54.120 on four of the available spots. 00:04:54.120 --> 00:04:56.910 A new array of size four, if you will, which initially 00:04:56.910 --> 00:04:58.290 has these four garbage values. 00:04:58.290 --> 00:04:59.280 But that's OK. 00:04:59.280 --> 00:05:01.950 Because as Santiago also notes, it suffices now 00:05:01.950 --> 00:05:03.420 to just copy the old array-- 00:05:03.420 --> 00:05:06.840 1, 2, 3-- into the new array. 00:05:06.840 --> 00:05:10.860 And heck, maybe we can now even free the memory from the original array, 00:05:10.860 --> 00:05:12.870 much like we could if we used malloc. 00:05:12.870 --> 00:05:15.870 And that leaves us then, of course, with just an array of size 00:05:15.870 --> 00:05:17.430 4 with that fourth garbage value. 00:05:17.430 --> 00:05:21.923 But now, we do have room for, like the number 4 itself. 00:05:21.923 --> 00:05:24.840 So it would seem that there is a solution to this problem that doesn't 00:05:24.840 --> 00:05:26.490 violate the definition of an array. 00:05:26.490 --> 00:05:29.220 Again, the only definition of an array really 00:05:29.220 --> 00:05:31.320 is that the memory must be contiguous. 00:05:31.320 --> 00:05:34.620 You can't just plop the 4 anywhere in the computer's memory. 00:05:34.620 --> 00:05:37.710 It has to become right after your existing memory 00:05:37.710 --> 00:05:39.660 if this whole thing, this whole structure, 00:05:39.660 --> 00:05:42.630 is indeed going to still be an array. 00:05:42.630 --> 00:05:47.530 But I worry that might have cost us a bit of time. 00:05:47.530 --> 00:05:50.200 And in fact, let me go ahead and open up on the screen here 00:05:50.200 --> 00:05:54.670 in just a moment a question that you're all welcome to buzz in for. 00:05:54.670 --> 00:06:00.220 What would be the running time of inserting into an array? 00:06:00.220 --> 00:06:03.220 Let me go ahead and reveal the poll question here. 00:06:03.220 --> 00:06:05.050 Feel free to go to the usual URL. 00:06:05.050 --> 00:06:08.080 Which Brian, if you wouldn't mind copying and pasting as usual. 00:06:08.080 --> 00:06:13.450 What do you think the running time is of inserting into an array? 00:06:13.450 --> 00:06:15.400 Inserting into an array. 00:06:15.400 --> 00:06:19.303 Recall in the past, we talked about arrays in terms of searching. 00:06:19.303 --> 00:06:21.220 But today, this is the first time we're really 00:06:21.220 --> 00:06:22.660 talking about inserting into them. 00:06:22.660 --> 00:06:27.280 And if we take a look at the results coming in, it looks like 80% of you 00:06:27.280 --> 00:06:30.580 feel that it's linear time, big O of n, whereby 00:06:30.580 --> 00:06:34.180 it might take as many as n steps to actually insert into an array. 00:06:34.180 --> 00:06:36.130 5% of you propose n log n. 00:06:36.130 --> 00:06:37.690 7% of you, n squared. 00:06:37.690 --> 00:06:40.827 And then, 2% and 5% respectively for the other values as well. 00:06:40.827 --> 00:06:42.910 So this is kind of an interesting mix because when 00:06:42.910 --> 00:06:46.150 we talked about arrays in the past and we talked about searching, 00:06:46.150 --> 00:06:49.330 recall that we typically achieve big O of log n. 00:06:49.330 --> 00:06:50.380 That's really good. 00:06:50.380 --> 00:06:53.560 Unfortunately if we have to do with Santiago proposed 00:06:53.560 --> 00:06:58.540 and actually copy all of the elements from the old array into the new array, 00:06:58.540 --> 00:07:03.490 it's going to take us as many as n steps because we have 00:07:03.490 --> 00:07:05.800 to copy each of the original elements-- 00:07:05.800 --> 00:07:10.670 1, 2, 3-- over into the new array, which is going to be of size n plus 1. 00:07:10.670 --> 00:07:13.690 So it's on the order of n steps in total. 00:07:13.690 --> 00:07:16.870 So the running time then of inserting into an array, in terms 00:07:16.870 --> 00:07:19.780 of an upper bound at least, is going to be indeed big O of n 00:07:19.780 --> 00:07:23.560 because you've got a copy potentially all of those elements over. 00:07:23.560 --> 00:07:27.880 But perhaps, we might feel differently if we consider a lower 00:07:27.880 --> 00:07:30.400 bound on the running time of insert. 00:07:30.400 --> 00:07:35.290 What might the lower bound of insert be when it comes to an array? 00:07:35.290 --> 00:07:38.140 Again, omega notation is what we can use here. 00:07:38.140 --> 00:07:40.510 How many steps maybe in the best case might it 00:07:40.510 --> 00:07:42.520 take me to insert a value into an array? 00:07:42.520 --> 00:07:44.435 We won't do a poll for this one. 00:07:44.435 --> 00:07:46.810 Brian, why don't we go ahead and call on a hand for this. 00:07:46.810 --> 00:07:49.240 What's a lower bound on the running time of insert? 00:07:49.240 --> 00:07:50.240 Ryan, what do you think? 00:07:50.240 --> 00:07:52.073 AUDIENCE: Well, the best case scenario would 00:07:52.073 --> 00:07:54.270 be if there's only one element in the array. 00:07:54.270 --> 00:07:56.860 So you would just have a one or it would just 00:07:56.860 --> 00:07:58.633 be one element right into the array. 00:07:58.633 --> 00:08:00.550 DAVID MALAN: Yeah, so if you've got an array-- 00:08:00.550 --> 00:08:02.980 and let me emphasize that's already empty-- 00:08:02.980 --> 00:08:06.400 whereby you have room for the new element, then indeed. 00:08:06.400 --> 00:08:09.460 Omega of 1, constant time, is all you need so long as you 00:08:09.460 --> 00:08:11.350 have space available in the array. 00:08:11.350 --> 00:08:13.270 And it doesn't matter how large the array is. 00:08:13.270 --> 00:08:16.910 Maybe it's a size 4, but you've only put one number in it. 00:08:16.910 --> 00:08:20.110 That's OK because you can immediately put the new number in place. 00:08:20.110 --> 00:08:22.130 Recall that array support random access. 00:08:22.130 --> 00:08:24.130 You can use the square bracket notation and just 00:08:24.130 --> 00:08:28.280 jump to any location in so-called constant time in just one step. 00:08:28.280 --> 00:08:32.409 So if your array is of sufficient size and it's not full, then yes. 00:08:32.409 --> 00:08:34.690 The lower bound on the insertion into an array 00:08:34.690 --> 00:08:37.630 is going to be constant time, omega of 1. 00:08:37.630 --> 00:08:39.940 But as we saw in Santiago's situation whereby 00:08:39.940 --> 00:08:42.250 you have an array that's already filled with elements 00:08:42.250 --> 00:08:44.810 and you want to add another, well in that case, 00:08:44.810 --> 00:08:47.740 then upper bound is indeed going to be big O of n 00:08:47.740 --> 00:08:51.370 because you have to do the additional labor of copying the values over 00:08:51.370 --> 00:08:52.315 from one to the other. 00:08:52.315 --> 00:08:54.940 Now those of you have programmed before, maybe you've use Java. 00:08:54.940 --> 00:08:56.900 You might be familiar with the phrase vector. 00:08:56.900 --> 00:09:01.000 A vector is like an array that can be resized, can grow and shrink. 00:09:01.000 --> 00:09:05.530 That's not what arrays in C. Arrays in C are just contiguous blocks of memory 00:09:05.530 --> 00:09:07.100 with values back to back to back. 00:09:07.100 --> 00:09:10.330 But once you decide on their size, that is it. 00:09:10.330 --> 00:09:12.700 You're going to have to resize it, essentially yourself. 00:09:12.700 --> 00:09:16.280 They're not going to automatically grow for you. 00:09:16.280 --> 00:09:19.960 So an array was the first and really the simplest of the data 00:09:19.960 --> 00:09:21.130 structures we'll see. 00:09:21.130 --> 00:09:23.500 But it's also not nearly as powerful as what 00:09:23.500 --> 00:09:25.870 we can do now that we have access to a computers memory. 00:09:25.870 --> 00:09:28.328 Today, we're going to start to leverage these things called 00:09:28.328 --> 00:09:31.997 pointers, the addresses via which we can refer to locations in memory. 00:09:31.997 --> 00:09:35.080 And we're going to start to stitch together some fancier data structures-- 00:09:35.080 --> 00:09:37.030 first, one dimensional in some sense. 00:09:37.030 --> 00:09:41.530 Then, two dimensional in some sense-- by using some very basic building blocks. 00:09:41.530 --> 00:09:44.980 Recall these three pieces of syntax from the past weeks. 00:09:44.980 --> 00:09:48.250 Struct, recall, is this mechanism, this keyword in C, 00:09:48.250 --> 00:09:51.190 whereby we can define our own structures in memory. 00:09:51.190 --> 00:09:55.780 We saw one for a person whereby a person might have both a name and a number 00:09:55.780 --> 00:09:57.610 in something like a phone book. 00:09:57.610 --> 00:10:00.850 And you've seen the dot operator The dot operator was 00:10:00.850 --> 00:10:05.020 how we go inside of such a structure and get at dot name or dot number. 00:10:05.020 --> 00:10:07.570 The specific variables inside of the struct. 00:10:07.570 --> 00:10:11.020 And then last week, recall we saw the star operator 00:10:11.020 --> 00:10:14.410 whereby the dereference operator, which colloquially means 00:10:14.410 --> 00:10:17.000 go to this particular address. 00:10:17.000 --> 00:10:19.940 So just by using these three ingredients are 00:10:19.940 --> 00:10:22.690 we going to be able to now build up our own custom data structures 00:10:22.690 --> 00:10:26.062 that are even fancier than arrays and can ultimately help us solve problems 00:10:26.062 --> 00:10:26.770 more efficiently. 00:10:26.770 --> 00:10:30.610 And in fact, this is such a common technique in programming in C 00:10:30.610 --> 00:10:33.790 that these two operators, the dot and the star, 00:10:33.790 --> 00:10:38.230 are often combined into one that intentionally looks like an arrow. 00:10:38.230 --> 00:10:41.500 So we'll see its usage in just a little bit. 00:10:41.500 --> 00:10:45.190 So syntactically today, it's pretty much building blocks 00:10:45.190 --> 00:10:48.170 past but we're going to use these building blocks now in new ways 00:10:48.170 --> 00:10:50.780 to start solving problems differently. 00:10:50.780 --> 00:10:54.620 And we'll first do this by way of something called linked lists. 00:10:54.620 --> 00:11:00.110 So for instance, a linked list is going to be a data structure that solves 00:11:00.110 --> 00:11:01.940 some of the problems with an array. 00:11:01.940 --> 00:11:03.740 And what's a problem arguably? 00:11:03.740 --> 00:11:07.117 Well if it takes big O of n steps to insert into an array, 00:11:07.117 --> 00:11:08.450 frankly that's kind of annoying. 00:11:08.450 --> 00:11:09.710 That's kind of expensive. 00:11:09.710 --> 00:11:13.273 Because over time if you're writing a really big program with lots of data-- 00:11:13.273 --> 00:11:16.190 you're the Googles of the world, the Twitters of the world-- it's fine 00:11:16.190 --> 00:11:18.710 if you want to store all of your data in a big array 00:11:18.710 --> 00:11:20.640 for the sake of efficient searching. 00:11:20.640 --> 00:11:23.870 Which recall, was big O of log n if we use something like binary search 00:11:23.870 --> 00:11:25.280 and keep everything sorted. 00:11:25.280 --> 00:11:27.740 But it's going to be pretty painful if every time you 00:11:27.740 --> 00:11:32.432 want to add another tweet to the array, or some other web page to the array. 00:11:32.432 --> 00:11:34.640 Depending on what the problem is that you're solving, 00:11:34.640 --> 00:11:36.432 you might potentially-- as Santiago notes-- 00:11:36.432 --> 00:11:40.400 have to copy all of the contents of your original smaller array 00:11:40.400 --> 00:11:45.660 into a new bigger array just to add more tweets or more web pages or the like. 00:11:45.660 --> 00:11:47.990 So a linked list is going to be a data structure that's 00:11:47.990 --> 00:11:52.700 more dynamic, whereby you can grow and shrink the data 00:11:52.700 --> 00:11:56.210 structure without having to touch all of the original data 00:11:56.210 --> 00:11:58.820 and move it from old location to new. 00:11:58.820 --> 00:12:01.073 So what might this look like? 00:12:01.073 --> 00:12:02.990 Well let's consider again our computers memory 00:12:02.990 --> 00:12:05.615 and let's propose that I want to store those same values again. 00:12:05.615 --> 00:12:08.180 So like the number one-- and just for the sake of discussion, 00:12:08.180 --> 00:12:12.620 suppose that it's in my computer's memory at address 0x123. 00:12:12.620 --> 00:12:14.660 0x just means it's a hexadecimal number. 00:12:14.660 --> 00:12:16.430 1-2-3- is completely arbitrary. 00:12:16.430 --> 00:12:18.450 I made it up just for the sake of discussion. 00:12:18.450 --> 00:12:21.020 So let me stipulate that that's where the number 1 happens 00:12:21.020 --> 00:12:24.290 to be in the computer's memory in this new solution 00:12:24.290 --> 00:12:26.395 to the problem of storing lots of data. 00:12:26.395 --> 00:12:27.770 Suppose I want to store number 2. 00:12:27.770 --> 00:12:30.230 Maybe it's at address 0x456. 00:12:30.230 --> 00:12:32.340 And then, suppose I want to store number 3. 00:12:32.340 --> 00:12:36.260 Suppose it's at address 0x789. 00:12:36.260 --> 00:12:40.010 So notice deliberately, these numbers are spread out 00:12:40.010 --> 00:12:41.180 in the computer's memory. 00:12:41.180 --> 00:12:44.330 Because after all, the problem we ran into a moment ago with arrays 00:12:44.330 --> 00:12:48.080 is that you might have hello, world or other values from other parts 00:12:48.080 --> 00:12:50.340 of your program in the way. 00:12:50.340 --> 00:12:52.310 So if I'm proposing instead that if you want 00:12:52.310 --> 00:12:55.850 to store one, and then two, and then three, that's fine. 00:12:55.850 --> 00:12:58.640 Plop them anywhere you want and you don't 00:12:58.640 --> 00:13:02.520 have to worry about where there is already existing values. 00:13:02.520 --> 00:13:06.022 Instead, you can just put these values where there is room. 00:13:06.022 --> 00:13:09.230 The problem, though, is that if you just start plopping values like one, two, 00:13:09.230 --> 00:13:11.630 three anywhere you want in the computer's memory, 00:13:11.630 --> 00:13:15.050 you can no longer very easily find those values, right? 00:13:15.050 --> 00:13:17.630 You might know where one is, but it no longer 00:13:17.630 --> 00:13:21.950 suffices to just look one location to the right to find the next value 00:13:21.950 --> 00:13:25.790 or add two find the next value after that. 00:13:25.790 --> 00:13:28.100 In an array, everything is contiguous. 00:13:28.100 --> 00:13:30.450 But if we instead start to treat the computer's memory 00:13:30.450 --> 00:13:33.980 as just a canvas where we can just draw numbers anywhere 00:13:33.980 --> 00:13:37.370 we want, that's fine so long as we can somehow 00:13:37.370 --> 00:13:39.830 help ourselves get from the first, to the second, 00:13:39.830 --> 00:13:43.310 to the third irrespective of all the other stuff that's 00:13:43.310 --> 00:13:45.330 cluttering up the computer's memory. 00:13:45.330 --> 00:13:48.980 So in fact, let me propose that we do this by maybe stealing 00:13:48.980 --> 00:13:51.590 a bit more space from the computer. 00:13:51.590 --> 00:13:55.710 So rather than use just enough memory to store one, two, and three, 00:13:55.710 --> 00:13:59.060 let me store twice as much information. 00:13:59.060 --> 00:14:02.630 In addition to every number that I actually care about-- one, two, three, 00:14:02.630 --> 00:14:03.560 my data-- 00:14:03.560 --> 00:14:05.418 let me store a little metadata, so to speak. 00:14:05.418 --> 00:14:07.460 The values that I don't fundamentally care about, 00:14:07.460 --> 00:14:10.970 but that are going to help me keep track of my actual data. 00:14:10.970 --> 00:14:13.650 And let me propose that in this box here, 00:14:13.650 --> 00:14:16.355 I literally store the value 0x456. 00:14:16.355 --> 00:14:18.980 Again, it's written in hexadecimal, but that's just the number. 00:14:18.980 --> 00:14:21.350 That's the address of somewhere else in memory. 00:14:21.350 --> 00:14:24.530 In this box, let me propose that I store 0x789. 00:14:24.530 --> 00:14:31.070 And in this box, let me arbitrarily say 0x0, which is just all zero bits. 00:14:31.070 --> 00:14:32.810 Now why have I done this? 00:14:32.810 --> 00:14:35.720 Even if you've never seen this structure that's 00:14:35.720 --> 00:14:38.210 evolving to be what's called a linked list, 00:14:38.210 --> 00:14:40.670 why have I just done what I've done? 00:14:40.670 --> 00:14:43.280 In addition to storing one, two, and three respectively, 00:14:43.280 --> 00:14:49.700 I'm also now storing 0x456 and an additional chunk of memory and 0x789 00:14:49.700 --> 00:14:51.410 and an additional chunk of memory. 00:14:51.410 --> 00:14:52.760 But why? 00:14:52.760 --> 00:14:53.720 Sofia. 00:14:53.720 --> 00:14:57.380 AUDIENCE: So that we know how the first element relates to the second 00:14:57.380 --> 00:14:58.692 or how they're linked together. 00:14:58.692 --> 00:14:59.900 DAVID MALAN: That's exactly-- 00:14:59.900 --> 00:15:01.130 AUDIENCE: --between the first and second. 00:15:01.130 --> 00:15:01.880 DAVID MALAN: Yeah. 00:15:01.880 --> 00:15:05.540 So now, I'm using essentially twice as much space to store each piece of data. 00:15:05.540 --> 00:15:07.550 One, two, and three respectively. 00:15:07.550 --> 00:15:10.280 But in the second chunk of space, I'm now 00:15:10.280 --> 00:15:16.410 storing a pointer to the next element in the thing I'll now think of as a list. 00:15:16.410 --> 00:15:19.970 So this is 0x456 because the number two, which 00:15:19.970 --> 00:15:24.650 is the next number in the list I care about, lives at 0x456. 00:15:24.650 --> 00:15:28.640 This number is 0x789 because the next number after that I care about 00:15:28.640 --> 00:15:31.310 is at address 0x789. 00:15:31.310 --> 00:15:33.650 So it's just a helpful way now of leaving 00:15:33.650 --> 00:15:37.100 myself breadcrumbs so that I can plop the one, the two, the three anywhere 00:15:37.100 --> 00:15:40.550 I want in the computer's memory wherever there's available space 00:15:40.550 --> 00:15:44.720 and still figure out how to get from one to the other to the other. 00:15:44.720 --> 00:15:47.400 And we've actually seen some of this syntax before. 00:15:47.400 --> 00:15:53.680 It turns out that 0x0, which is all zero bits, that's just the technical-- 00:15:53.680 --> 00:15:56.080 that's the numeric equivalent of what we've called null. 00:15:56.080 --> 00:16:00.850 N-U-L-L, which we introduced last week, is a special symbol indicating that 00:16:00.850 --> 00:16:03.910 something has gone wrong with memory or you're out of space. 00:16:03.910 --> 00:16:06.070 It's sort of the absence of an address. 00:16:06.070 --> 00:16:09.850 C guarantees that if you use 0x0, AKA null, 00:16:09.850 --> 00:16:13.240 that just indicates the absence of any useful address there. 00:16:13.240 --> 00:16:13.990 But you know what? 00:16:13.990 --> 00:16:16.630 Again like last week, this is of getting way into the weeds. 00:16:16.630 --> 00:16:19.690 I don't really care about 0x123456789. 00:16:19.690 --> 00:16:23.680 So let's just kind of abstract this away and start thinking about this really 00:16:23.680 --> 00:16:26.950 as a list of numbers that are somehow linked together. 00:16:26.950 --> 00:16:31.210 Underneath the hood, the links are implemented by way of addresses, 00:16:31.210 --> 00:16:36.280 or pointers, those low level numbers like 0x123456789. 00:16:36.280 --> 00:16:39.040 But pictorially, it suffices for us to just start 00:16:39.040 --> 00:16:44.170 thinking of this data structure, hereafter known as a linked list, 00:16:44.170 --> 00:16:47.590 as being a collection of nodes, so to speak-- 00:16:47.590 --> 00:16:50.830 N-O-D-E-- that are connected via pointers. 00:16:50.830 --> 00:16:52.710 So a node is just a generic computer science 00:16:52.710 --> 00:16:54.460 term that refers to some kind of structure 00:16:54.460 --> 00:16:56.230 that store stuff you care about. 00:16:56.230 --> 00:17:00.260 What I care about here is a number and a pointer to another such structure. 00:17:00.260 --> 00:17:05.680 So this is a linked list and each of these rectangles represents a node. 00:17:05.680 --> 00:17:08.740 And in code, we'll implement those nodes ultimately 00:17:08.740 --> 00:17:12.069 by way of that thing in C called a struct. 00:17:12.069 --> 00:17:14.589 But let me pause here to see first if there 00:17:14.589 --> 00:17:18.730 are any questions about the structure we have built up. 00:17:18.730 --> 00:17:22.180 Any questions about this thing called a linked list 00:17:22.180 --> 00:17:24.530 before we see it in some code? 00:17:24.530 --> 00:17:25.030 Brian? 00:17:25.030 --> 00:17:27.010 BRIAN: Yeah, a question came in in the chat asking, 00:17:27.010 --> 00:17:30.177 isn't this kind of a waste of memory that we're now using all of these cells 00:17:30.177 --> 00:17:31.390 to store addresses too? 00:17:31.390 --> 00:17:33.310 DAVID MALAN: Yeah, really good observation. 00:17:33.310 --> 00:17:35.393 Isn't this kind of a waste of memory in that we're 00:17:35.393 --> 00:17:37.150 storing all of these addresses in addition 00:17:37.150 --> 00:17:39.710 to the numbers 1, 2, 3 that we care about? 00:17:39.710 --> 00:17:42.375 Yes, and in fact, that is exactly the price we are paying, 00:17:42.375 --> 00:17:43.750 and this is going to be thematic. 00:17:43.750 --> 00:17:46.270 This week, last week and really every week thereafter, 00:17:46.270 --> 00:17:50.170 any time we solve a problem in CS and programming in particular, 00:17:50.170 --> 00:17:52.268 there's always going to be some price paid. 00:17:52.268 --> 00:17:54.310 There's going to be some cost and really, there's 00:17:54.310 --> 00:17:55.610 going to be some trade-off. 00:17:55.610 --> 00:18:00.400 So if a moment ago, it was unacceptable that inserting into an array 00:18:00.400 --> 00:18:04.570 is in big O of n because man, that's going to take so many steps to copy all 00:18:04.570 --> 00:18:07.420 of the values from the old array into the new array, 00:18:07.420 --> 00:18:10.300 if that is unacceptable to you for performance reasons 00:18:10.300 --> 00:18:13.870 or whatever the problem is that you're solving, well, that's fine. 00:18:13.870 --> 00:18:17.410 You can solve that problem and now have much more dynamism 00:18:17.410 --> 00:18:19.900 whereby you can plop your numbers anywhere in memory 00:18:19.900 --> 00:18:24.190 without having to move the existing numbers anywhere else, thereby saving 00:18:24.190 --> 00:18:27.948 yourself time, but the price you're going to pay indeed is more space. 00:18:27.948 --> 00:18:29.740 So at that point, it kind of depends what's 00:18:29.740 --> 00:18:31.540 more important to you, the computer's time, 00:18:31.540 --> 00:18:35.980 your human time, or maybe the space or the cost of the space, the more memory 00:18:35.980 --> 00:18:38.577 that you might really need to literally buy for that computer. 00:18:38.577 --> 00:18:39.910 So this is going to be thematic. 00:18:39.910 --> 00:18:44.590 This trade-off between space and time is omnipresent, really, in programming. 00:18:44.590 --> 00:18:47.780 Well, let's consider how we might translate this now into actual code. 00:18:47.780 --> 00:18:51.010 Recall that when we last saw structs in C, 00:18:51.010 --> 00:18:55.390 we did something like this to define a person as having two things associated 00:18:55.390 --> 00:18:56.980 with them, a name and a number. 00:18:56.980 --> 00:18:59.650 So today, we don't care about persons and names and numbers. 00:18:59.650 --> 00:19:01.780 We care about these things we're going to start calling nodes, 00:19:01.780 --> 00:19:03.447 so let me go ahead and rewind from that. 00:19:03.447 --> 00:19:08.440 Erase that and let's instead say that every node in this structure, 00:19:08.440 --> 00:19:10.840 renaming person as well to node, is going 00:19:10.840 --> 00:19:15.430 to have a number, which will be an int in our case here, 00:19:15.430 --> 00:19:18.670 and I've left room for one other value because we ultimately 00:19:18.670 --> 00:19:21.255 need to be able to store that second piece of data. 00:19:21.255 --> 00:19:23.380 That second piece of data is going to be a pointer. 00:19:23.380 --> 00:19:26.860 It's going to be an address of something, but how might 00:19:26.860 --> 00:19:29.200 I express this? 00:19:29.200 --> 00:19:31.270 This is going to be a little less obvious 00:19:31.270 --> 00:19:34.900 but we laid the foundation last week with pointers. 00:19:34.900 --> 00:19:39.220 How could I describe this structure as having 00:19:39.220 --> 00:19:42.750 a pointer to another such structure? 00:19:42.750 --> 00:19:45.880 Any thoughts on verbally the syntax to use, 00:19:45.880 --> 00:19:51.010 or even if you're not sure of exactly the incantation, exactly what symbols 00:19:51.010 --> 00:19:56.350 we should use to express and address to another such node? 00:19:56.350 --> 00:19:59.597 BRIAN: Someone is suggesting we use a node * as a pointer to a node. 00:19:59.597 --> 00:20:00.430 DAVID MALAN: Node *? 00:20:00.430 --> 00:20:00.930 All right. 00:20:00.930 --> 00:20:02.740 So * I definitely remember from last week, 00:20:02.740 --> 00:20:06.310 indicating that if you have int *, this is the address of an int. 00:20:06.310 --> 00:20:08.710 If you have char *, this is the address of a char. 00:20:08.710 --> 00:20:12.130 So if all of these arrows really just represent addresses of nodes, 00:20:12.130 --> 00:20:14.200 it stands to reason that syntax is probably 00:20:14.200 --> 00:20:16.660 going to be something akin to node *. 00:20:16.660 --> 00:20:18.940 Now I can call this pointer anything I want. 00:20:18.940 --> 00:20:20.322 By convention, I'll call it next. 00:20:20.322 --> 00:20:22.030 Because literally, its purpose in life is 00:20:22.030 --> 00:20:24.940 to point to the next node in the data structure. 00:20:24.940 --> 00:20:29.140 And that's going to be in addition to the int called number that I'll propose 00:20:29.140 --> 00:20:32.120 describes the top of that individual data structure. 00:20:32.120 --> 00:20:34.930 But there's a subtle problem here in C. Recall 00:20:34.930 --> 00:20:38.680 that in C, it's kind of a simplistic language, complicated 00:20:38.680 --> 00:20:42.550 though it might often seem, in that it doesn't understand anything 00:20:42.550 --> 00:20:44.480 that it hasn't seen before. 00:20:44.480 --> 00:20:47.620 So at the moment, notice that the very first time 00:20:47.620 --> 00:20:50.650 I have mentioned node up until now was on the very 00:20:50.650 --> 00:20:53.200 last line of this snippet of code. 00:20:53.200 --> 00:20:56.530 The problem is that by nature of how typedef works, 00:20:56.530 --> 00:20:59.260 the thing called a node does not actually 00:20:59.260 --> 00:21:02.980 exist until the compiler is done reading that last line of code 00:21:02.980 --> 00:21:06.070 and the semicolon, which is to say that it would actually 00:21:06.070 --> 00:21:10.240 be incorrect to use or refer to quote, unquote 00:21:10.240 --> 00:21:12.310 "node" inside of this structure. 00:21:12.310 --> 00:21:14.860 Because literally, by definition of how typedef works, 00:21:14.860 --> 00:21:19.270 a node does not exist until, again, that last line of code and the semicolon 00:21:19.270 --> 00:21:19.905 are executed. 00:21:19.905 --> 00:21:21.280 Thankfully, there's a workaround. 00:21:21.280 --> 00:21:22.660 It's a little weird to look at. 00:21:22.660 --> 00:21:24.445 But what you can do in C is this. 00:21:24.445 --> 00:21:29.208 You can actually add an additional word after, literally, the keyword struct. 00:21:29.208 --> 00:21:30.250 And we'll keep it simple. 00:21:30.250 --> 00:21:33.208 We'll use the exact same word, though technically that's not necessary. 00:21:33.208 --> 00:21:37.160 And now, I'm going to change the inside of the structure to say this instead. 00:21:37.160 --> 00:21:38.740 So it feels a little verbose. 00:21:38.740 --> 00:21:40.510 And it feels a little bit of copy paste. 00:21:40.510 --> 00:21:44.980 But this is the way it is done in C. Typedef struct node is essentially 00:21:44.980 --> 00:21:47.200 a hint, similar in spirit to the prototypes 00:21:47.200 --> 00:21:50.680 we've talked about for functions that gives the compiler a clue that, 00:21:50.680 --> 00:21:54.310 OK, something called a struct node is going to exist. 00:21:54.310 --> 00:21:56.830 You can then use it inside of that data structure 00:21:56.830 --> 00:21:59.763 and refer to it as struct node *. 00:21:59.763 --> 00:22:01.180 It's more of a mouthful this week. 00:22:01.180 --> 00:22:02.847 We've not seen multiple words like this. 00:22:02.847 --> 00:22:06.130 But it's similar to char * or int *, like last week. 00:22:06.130 --> 00:22:08.140 And I'm going to call that arbitrarily next. 00:22:08.140 --> 00:22:12.550 And down here, the same thing happens as in the past with persons. 00:22:12.550 --> 00:22:15.130 By calling this node at the very last line of the code, that 00:22:15.130 --> 00:22:16.600 just tells the compiler, you know what? 00:22:16.600 --> 00:22:19.280 You don't have to refer to it as struct node all over the place. 00:22:19.280 --> 00:22:21.310 You can just call this thing node. 00:22:21.310 --> 00:22:25.390 So it's a little verbose in this case. 00:22:25.390 --> 00:22:29.470 But all this is done is create for me, in the computer, 00:22:29.470 --> 00:22:33.490 the definition of a node as we have depicted it pictorially 00:22:33.490 --> 00:22:36.430 with that rectangle. 00:22:36.430 --> 00:22:40.490 All right, so how can we now translate this into more useful code, not 00:22:40.490 --> 00:22:42.490 just defining what these structures are, but how 00:22:42.490 --> 00:22:44.470 do we begin building up linked lists? 00:22:44.470 --> 00:22:48.460 Well, let me propose that a linked list really begins with just a pointer. 00:22:48.460 --> 00:22:51.880 And in fact, here we have, thanks to the theatre's prop shop, 00:22:51.880 --> 00:22:54.370 just kind of a null pointer, if you will. 00:22:54.370 --> 00:22:56.710 I'm going to call this variable "list." 00:22:56.710 --> 00:22:58.460 And list is currently pointing to nothing. 00:22:58.460 --> 00:23:01.585 The arrow, we'll say, is just pointing at the floor, which means it's null. 00:23:01.585 --> 00:23:03.610 It's not actually pointing at anything useful. 00:23:03.610 --> 00:23:06.160 Suppose I now want to start to begin to allocate a linked 00:23:06.160 --> 00:23:09.157 list with three numbers, 1, 2, and 3. 00:23:09.157 --> 00:23:10.490 Well, how am I going to do this? 00:23:10.490 --> 00:23:13.032 Well, at the moment, the only thing that exists in my program 00:23:13.032 --> 00:23:14.800 is going to be this variable called list. 00:23:14.800 --> 00:23:17.260 There's no array in this story. 00:23:17.260 --> 00:23:19.077 That was in Week 2. 00:23:19.077 --> 00:23:20.410 Today is all about linked lists. 00:23:20.410 --> 00:23:22.570 So how do I get myself a wooden block that 00:23:22.570 --> 00:23:25.240 represents 1, another wooden block that represents 2, 00:23:25.240 --> 00:23:27.012 and a third that represents 3? 00:23:27.012 --> 00:23:29.470 Well, we need to use our new friend from last week, malloc. 00:23:29.470 --> 00:23:32.710 Recall that malloc allows you to allocate memory, 00:23:32.710 --> 00:23:35.110 as much memory as you might want, so long as you tell it 00:23:35.110 --> 00:23:36.400 the size of that thing. 00:23:36.400 --> 00:23:38.740 So frankly, I think what we could do ultimately today 00:23:38.740 --> 00:23:44.020 is use malloc to allocate dynamically one struct and put the number 1 in it, 00:23:44.020 --> 00:23:48.010 another struct put the number 2 on it, another struct put the number 3 in it. 00:23:48.010 --> 00:23:52.160 And then, use these playful arrows here to actually stitch them together, 00:23:52.160 --> 00:23:53.930 having one point to the other. 00:23:53.930 --> 00:23:55.810 So thankfully, the prop shop has wonderfully 00:23:55.810 --> 00:23:57.860 created a whole bunch of these for us. 00:23:57.860 --> 00:24:04.460 Let me go ahead and malloc a very heavy node that has room for two values. 00:24:04.460 --> 00:24:07.795 And you'll see it has room for a number and a next pointer. 00:24:07.795 --> 00:24:09.670 So the number I'm going to first install here 00:24:09.670 --> 00:24:12.760 is going to be the number 1, for instance. 00:24:12.760 --> 00:24:15.640 And I'm going to leave its pointer as just pointing at the ground, 00:24:15.640 --> 00:24:17.800 indicating that this is a null pointer. 00:24:17.800 --> 00:24:19.660 It's not actually pointing at anything else. 00:24:19.660 --> 00:24:22.510 But now that I'm starting to instantiate, 00:24:22.510 --> 00:24:25.690 that is create this list, now I'm going to do something like this 00:24:25.690 --> 00:24:29.860 and say that, all right, my variable called "list," whose purpose in life 00:24:29.860 --> 00:24:32.530 is to keep track of where this list is in memory, 00:24:32.530 --> 00:24:34.960 I'm going to connect one to the other by actually having 00:24:34.960 --> 00:24:38.080 this variable point at this node. 00:24:38.080 --> 00:24:40.540 When it comes time, then, to allocate another node, 00:24:40.540 --> 00:24:42.353 I want to insert into this linked list. 00:24:42.353 --> 00:24:45.520 Back in the world of arrays, I might have to allocate a new chunk of memory, 00:24:45.520 --> 00:24:48.160 copy this value over into the new values. 00:24:48.160 --> 00:24:49.300 I don't have to do that. 00:24:49.300 --> 00:24:52.570 In the world of linked lists, I just call malloc for a second time 00:24:52.570 --> 00:24:55.960 and say, give me another chunk of memory big enough to fit a node. 00:24:55.960 --> 00:24:59.170 Thankfully, from the prop shop, we have another one of these. 00:24:59.170 --> 00:25:03.628 Let me go ahead and malloc another node here. 00:25:03.628 --> 00:25:05.170 At the moment, there's nothing in it. 00:25:05.170 --> 00:25:06.400 It's just the placeholder. 00:25:06.400 --> 00:25:08.200 So it's garbage values, if you will. 00:25:08.200 --> 00:25:10.060 I don't know what's there until I actually 00:25:10.060 --> 00:25:13.090 say that the number shall be number 2. 00:25:13.090 --> 00:25:17.710 And then I go over to my linked list whose variable name is "list," 00:25:17.710 --> 00:25:19.060 and I want to insert this thing. 00:25:19.060 --> 00:25:20.710 So I follow the arrow. 00:25:20.710 --> 00:25:26.180 I then point the next field of this node at this node here. 00:25:26.180 --> 00:25:28.600 So now I have a linked list of size 2. 00:25:28.600 --> 00:25:31.990 There's three things in the picture, but this is just a simple variable. 00:25:31.990 --> 00:25:35.710 This is a pointer that's pointing at the actual node which, in turn, is 00:25:35.710 --> 00:25:37.370 pointing at an actual other node. 00:25:37.370 --> 00:25:40.990 Now suppose I want to insert the number 3 into this linked list. 00:25:40.990 --> 00:25:43.420 Recall that malloc is powerful and that it 00:25:43.420 --> 00:25:45.550 takes memory from wherever it's available, 00:25:45.550 --> 00:25:48.080 the so-called "heap" from your computer. 00:25:48.080 --> 00:25:51.340 And that means, by definition, that it might not be contiguous. 00:25:51.340 --> 00:25:54.730 The next number might not actually be here metaphorically 00:25:54.730 --> 00:25:56.050 in the computer's memory. 00:25:56.050 --> 00:25:57.953 It might be way over there. 00:25:57.953 --> 00:25:59.120 So that might indeed happen. 00:25:59.120 --> 00:26:03.250 The third time I call malloc now and allocate a third node, 00:26:03.250 --> 00:26:07.000 it might not be available anywhere in the computer's memory except for, 00:26:07.000 --> 00:26:09.040 like, way over here. 00:26:09.040 --> 00:26:10.280 And that's fine. 00:26:10.280 --> 00:26:15.500 It doesn't have to be contiguous as it did in the world of our arrays. 00:26:15.500 --> 00:26:18.140 I can now put the number 3 in its place. 00:26:18.140 --> 00:26:22.683 But if I want to keep a pointer to that node, so that all of these things 00:26:22.683 --> 00:26:24.850 are stitched together, I can start at the beginning. 00:26:24.850 --> 00:26:26.200 I follow the arrows. 00:26:26.200 --> 00:26:27.350 I follow the arrows. 00:26:27.350 --> 00:26:29.590 And now, if I want to remember where that one is, 00:26:29.590 --> 00:26:32.120 I'm going to have to connect these things. 00:26:32.120 --> 00:26:38.120 And so now, that pointer needs to point at this block here. 00:26:38.120 --> 00:26:39.850 And this visual is very much deliberate. 00:26:39.850 --> 00:26:42.820 These nodes can be anywhere in the computer's memory. 00:26:42.820 --> 00:26:45.370 They're not necessarily going to be contiguous. 00:26:45.370 --> 00:26:49.450 The downside of that is that you cannot rely on binary search, 00:26:49.450 --> 00:26:51.460 our friend from Week 0 onward. 00:26:51.460 --> 00:26:55.210 Binary search was amazing in that it's Big O of log n. 00:26:55.210 --> 00:26:57.883 You can find stuff way faster than Big O of n. 00:26:57.883 --> 00:26:59.800 That was the whole point of even the phonebook 00:26:59.800 --> 00:27:01.720 example in the very first week. 00:27:01.720 --> 00:27:05.080 But the upside of this approach is that you 00:27:05.080 --> 00:27:09.310 don't have to actually keep allocating and copying more memory 00:27:09.310 --> 00:27:11.893 and all of your values any time you want to resize this thing. 00:27:11.893 --> 00:27:15.102 And I'm a little embarrassed to admit that I'm out of breath for some reason, 00:27:15.102 --> 00:27:16.360 just mallocing nodes here. 00:27:16.360 --> 00:27:18.820 But the point is using malloc and building out 00:27:18.820 --> 00:27:20.800 the structure does come at some price. 00:27:20.800 --> 00:27:22.755 It's exhausting, frankly. 00:27:22.755 --> 00:27:24.880 But it's also going to spread things out in memory. 00:27:24.880 --> 00:27:26.020 But you have this dynamism. 00:27:26.020 --> 00:27:28.270 And honestly, if you are the Twitters of the world, the Googles 00:27:28.270 --> 00:27:30.395 of the world, where you have lots and lots of data, 00:27:30.395 --> 00:27:34.360 it's going to kill you performance wise to have to copy all of your data 00:27:34.360 --> 00:27:36.810 from one location into another as Santiago originally 00:27:36.810 --> 00:27:39.450 proposed as a solution to the array. 00:27:39.450 --> 00:27:43.380 So using these dynamic data structures, like a linked list, where 00:27:43.380 --> 00:27:45.690 you allocate more space wherever it's available, 00:27:45.690 --> 00:27:48.750 and you somehow remember where that is by stitching [? team ?] things 00:27:48.750 --> 00:27:50.940 together, as with these physical pointers, 00:27:50.940 --> 00:27:53.160 is really the state of the art in how you 00:27:53.160 --> 00:27:56.730 can create these more dynamic structures if it's more important to you 00:27:56.730 --> 00:27:59.100 to have that dynamism. 00:27:59.100 --> 00:28:04.010 All right, any questions before we now translate these physical blocks 00:28:04.010 --> 00:28:06.897 to some samples of code? 00:28:06.897 --> 00:28:08.480 BRIAN: A question came in in the chat. 00:28:08.480 --> 00:28:11.660 When in this whole process of linked lists are we actually using malloc? 00:28:11.660 --> 00:28:13.220 And what is malloc being used for? 00:28:13.220 --> 00:28:14.280 DAVID MALAN: Really good question. 00:28:14.280 --> 00:28:15.488 So where are we using malloc? 00:28:15.488 --> 00:28:17.600 Every time I went off stage and grabbed one 00:28:17.600 --> 00:28:23.270 of these big blocks, that was my acting out the process of mallocing a node. 00:28:23.270 --> 00:28:25.520 So when you call malloc, that returns to you, 00:28:25.520 --> 00:28:30.240 per last week, the address of the first byte of a chunk of memory. 00:28:30.240 --> 00:28:33.200 And if you call malloc of one, that gives you the address of one byte. 00:28:33.200 --> 00:28:35.930 If you call malloc of 100, that gives you the first address 00:28:35.930 --> 00:28:38.580 of 100 bytes that are contiguous. 00:28:38.580 --> 00:28:43.110 So each of these nodes, then, represents the return value, if you will, 00:28:43.110 --> 00:28:45.200 of a single call to malloc. 00:28:45.200 --> 00:28:47.600 And in fact, perhaps, Brian, the best way 00:28:47.600 --> 00:28:50.720 to answer that question in more detail is to translate this now 00:28:50.720 --> 00:28:51.627 to some actual code. 00:28:51.627 --> 00:28:53.210 So let's do that now and then revisit. 00:28:53.210 --> 00:28:56.000 So here is, for instance, a line of code that 00:28:56.000 --> 00:28:58.310 represents the beginning of our story, where we only 00:28:58.310 --> 00:29:02.960 had a variable called "list" that was initialized to nothing initially. 00:29:02.960 --> 00:29:05.023 This arrow was not pointing at anything. 00:29:05.023 --> 00:29:07.190 And in fact, if it was just pointing up, down, left, 00:29:07.190 --> 00:29:09.650 or right, it would be considered a garbage value. 00:29:09.650 --> 00:29:12.440 This is just memory, which means there are values inside of it 00:29:12.440 --> 00:29:15.240 before I actually put actual values in it. 00:29:15.240 --> 00:29:18.273 So if I don't assign it a value, who knows what it's pointing to? 00:29:18.273 --> 00:29:19.940 But let me go ahead and change the code. 00:29:19.940 --> 00:29:24.290 This list variable, by default, has some garbage value unless I initialize it 00:29:24.290 --> 00:29:25.490 to something like null. 00:29:25.490 --> 00:29:28.940 So I'll represent that here figuratively by just pointing at the ground. 00:29:28.940 --> 00:29:30.260 That now represents null. 00:29:30.260 --> 00:29:32.510 This would be the corresponding line of code that just 00:29:32.510 --> 00:29:35.420 creates for you an empty linked list. 00:29:35.420 --> 00:29:36.975 That was the beginning of the story. 00:29:36.975 --> 00:29:39.350 Now recall that the next thing I did was I went off stage 00:29:39.350 --> 00:29:42.620 and I called malloc to bring back one of these big boxes. 00:29:42.620 --> 00:29:45.980 That code might instead look like this. 00:29:45.980 --> 00:29:47.360 Node *n. 00:29:47.360 --> 00:29:52.040 So I could call the variable anything I want, malloc, size of node. 00:29:52.040 --> 00:29:55.040 So size of we might have seen briefly in that it's just 00:29:55.040 --> 00:29:59.550 an operator in C that tells you how many bytes large any data type is. 00:29:59.550 --> 00:30:03.710 So I could do the math and figure out in my Mac or PC or CS50 IDE just how 00:30:03.710 --> 00:30:05.690 many bytes these nodes are supposed to take up. 00:30:05.690 --> 00:30:07.970 Size of just answers that question for me. 00:30:07.970 --> 00:30:10.370 So malloc, again, takes one argument, the number 00:30:10.370 --> 00:30:13.010 of bytes you want to allocate dynamically, 00:30:13.010 --> 00:30:15.930 and it returns to the address of the first of those bytes. 00:30:15.930 --> 00:30:18.822 So if you think of this as one of my earlier slides in yellow, 00:30:18.822 --> 00:30:20.780 it returns to you, like, the address of the top 00:30:20.780 --> 00:30:23.670 left byte of this chunk of memory, if you will. 00:30:23.670 --> 00:30:26.160 And I'm going to go ahead and assign that to a variable 00:30:26.160 --> 00:30:28.430 I'll call n to represent a node. 00:30:28.430 --> 00:30:33.410 And its node * because, again, malloc returns always an address. 00:30:33.410 --> 00:30:36.050 And the syntax we saw last week for storing addresses 00:30:36.050 --> 00:30:39.300 is to use this new * syntax here. 00:30:39.300 --> 00:30:44.310 So this gave me a block that initially just had garbage values. 00:30:44.310 --> 00:30:48.450 So there was no number in place and who knows where the arrow was pointing at? 00:30:48.450 --> 00:30:52.010 So it looked a little something like this, if I draw it now on the screen. 00:30:52.010 --> 00:30:53.523 List is initialized to null. 00:30:53.523 --> 00:30:55.190 It's not pointing at anything right now. 00:30:55.190 --> 00:30:59.330 But n, the variable I just declared, is pointing at a node. 00:30:59.330 --> 00:31:01.220 But inside of that node, or who knows what? 00:31:01.220 --> 00:31:02.120 It's garbage values. 00:31:02.120 --> 00:31:04.160 Number and next are just garbage values. 00:31:04.160 --> 00:31:07.760 Because that's what's there by default, remnants of the past. 00:31:07.760 --> 00:31:10.310 But now, let me propose that we do this code. 00:31:10.310 --> 00:31:12.098 So long as n is not null-- 00:31:12.098 --> 00:31:15.140 which is something you want to get into the habit always now of checking, 00:31:15.140 --> 00:31:19.580 any time in C, when you call a function that returns to you a pointer, 00:31:19.580 --> 00:31:23.030 you should almost always check is it null or is it not null. 00:31:23.030 --> 00:31:26.150 Because you do not want to try touching it if it is indeed null. 00:31:26.150 --> 00:31:28.490 Because that means there is no valid address here. 00:31:28.490 --> 00:31:30.950 That's the human convention when using pointers. 00:31:30.950 --> 00:31:34.578 But if n does not equal null, that's a good thing. 00:31:34.578 --> 00:31:37.370 That means it's a valid address somewhere in the computer's memory. 00:31:37.370 --> 00:31:43.040 Let me go ahead now and go to that address, the syntax for which is *n, 00:31:43.040 --> 00:31:44.480 just like last week. 00:31:44.480 --> 00:31:48.530 And then the operator means go inside of the structure that's there 00:31:48.530 --> 00:31:52.230 and go into the variable inside of it, called number in this case. 00:31:52.230 --> 00:31:54.680 So when I go ahead and do something like this, 00:31:54.680 --> 00:31:58.400 when I have a list that's initially of size 1, 00:31:58.400 --> 00:32:04.070 where this variable is pointing at the one and only node at the moment, 00:32:04.070 --> 00:32:07.970 it's going to have the number 1 in it as soon as I execute this line of code. 00:32:07.970 --> 00:32:12.710 *n means start at the address embodied here, go to it, 00:32:12.710 --> 00:32:16.560 and then put the number, 1 in this case, in place. 00:32:16.560 --> 00:32:18.740 But I need to do one other thing as well. 00:32:18.740 --> 00:32:20.990 I want to go ahead, too, and replace the garbage 00:32:20.990 --> 00:32:23.570 value that represents the next pointer in that structure 00:32:23.570 --> 00:32:24.680 and replace it with null. 00:32:24.680 --> 00:32:27.020 Null [INAUDIBLE] find that this is the end of the list. 00:32:27.020 --> 00:32:27.770 There's nothing there. 00:32:27.770 --> 00:32:29.150 I don't want it to be garbage value. 00:32:29.150 --> 00:32:31.067 Because a garbage value is an arbitrary number 00:32:31.067 --> 00:32:33.380 that could be pointing this way, this way, this way. 00:32:33.380 --> 00:32:36.560 Metaphorically, I want to actually change that to be null 00:32:36.560 --> 00:32:38.348 and so I can use the same syntax. 00:32:38.348 --> 00:32:39.890 But there's this clever approach now. 00:32:39.890 --> 00:32:43.850 I don't have to use star and then dot over the place, as mentioned earlier. 00:32:43.850 --> 00:32:47.390 Star and dot comes with some syntactic sugar. 00:32:47.390 --> 00:32:50.290 You can replace the star and the parentheses and the dot 00:32:50.290 --> 00:32:54.050 that we just saw with an arrow notation, which means follow the arrow. 00:32:54.050 --> 00:32:56.390 And then set this thing equal to null, which 00:32:56.390 --> 00:32:59.240 I'll represent again by just having the arrow literally points 00:32:59.240 --> 00:33:02.040 at the floor for clarity. 00:33:02.040 --> 00:33:07.490 So, again, rewinding from where we started, *n.number, 00:33:07.490 --> 00:33:11.270 with the first of those in parentheses, is the same thing as this. 00:33:11.270 --> 00:33:15.270 And the reason that most people prefer using the arrow notation, so to speak, 00:33:15.270 --> 00:33:18.020 is that it really does capture this physicality. 00:33:18.020 --> 00:33:20.870 You start at the address, you go there, and then 00:33:20.870 --> 00:33:24.050 you look at the field, number or next, respectively. 00:33:24.050 --> 00:33:27.875 But it's equivalent to the syntax we saw a moment ago with the start and the dot 00:33:27.875 --> 00:33:29.090 as before. 00:33:29.090 --> 00:33:30.980 So after these two steps, have we initialized 00:33:30.980 --> 00:33:36.760 this node to containing the number 1 and null inside of it. 00:33:36.760 --> 00:33:38.540 But what comes next? 00:33:38.540 --> 00:33:39.380 What comes next? 00:33:39.380 --> 00:33:41.470 Well, at point in the story, in this code, 00:33:41.470 --> 00:33:43.690 I have some other variable here that's not pictured. 00:33:43.690 --> 00:33:47.090 Because we're now transitioning from the world of woodwork to actual code. 00:33:47.090 --> 00:33:51.130 So there's another variable n, which I might as well be representing myself. 00:33:51.130 --> 00:33:54.280 If I am this temporary variable n, I, too, am pointing at this value. 00:33:54.280 --> 00:33:59.830 It's not until I actually execute this line of code, list = n;, 00:33:59.830 --> 00:34:04.357 that I remember where this node is in the computer's memory. 00:34:04.357 --> 00:34:06.940 So n, up until now, has really just been a temporary variable. 00:34:06.940 --> 00:34:09.190 It's a variable that I can use to actually keep 00:34:09.190 --> 00:34:10.969 track of this thing in memory. 00:34:10.969 --> 00:34:14.530 But if I want to add this node ultimately to my linked list, 00:34:14.530 --> 00:34:18.460 list started out as null, recall that this pointer was pointing 00:34:18.460 --> 00:34:20.170 at the ground, representing null. 00:34:20.170 --> 00:34:24.250 But when I now want to remember that this linked list has a node in it, 00:34:24.250 --> 00:34:30.489 I need to actually execute a line of code like this, list = null. 00:34:30.489 --> 00:34:31.900 All right, what did we next do? 00:34:31.900 --> 00:34:33.530 Let's take one more step further. 00:34:33.530 --> 00:34:36.159 So at this point in the story, if I was representing n, 00:34:36.159 --> 00:34:38.650 I'm also pointing at the same block. n is temporary, 00:34:38.650 --> 00:34:39.969 so it can eventually go away. 00:34:39.969 --> 00:34:43.018 But at this point in the story, we have a linked list of size 1. 00:34:43.018 --> 00:34:44.560 Let's go ahead and take this further. 00:34:44.560 --> 00:34:46.389 Suppose now I execute these lines of code. 00:34:46.389 --> 00:34:48.489 And we'll do it a little faster and all at once so 00:34:48.489 --> 00:34:50.230 that you can see it more in context. 00:34:50.230 --> 00:34:52.389 The first line of code is the same as before. 00:34:52.389 --> 00:34:54.722 Hey, malloc, give me a chunk of memory that's 00:34:54.722 --> 00:34:56.139 big enough for the size of a node. 00:34:56.139 --> 00:34:59.465 And, again, let's use this temporary variable n to point to that. 00:34:59.465 --> 00:35:02.590 And suppose that means that I, if I'm representing this temporary variable, 00:35:02.590 --> 00:35:04.630 I'm pointing at this new chunk of memory here. 00:35:04.630 --> 00:35:08.260 I then check, if n does not equal null, then 00:35:08.260 --> 00:35:10.870 and only then do I go ahead and install the number 00:35:10.870 --> 00:35:16.450 2 as I did physically earlier and do I initialize the pointer originally 00:35:16.450 --> 00:35:21.520 to pointing not at some other node which doesn't yet exist, but pointing at, 00:35:21.520 --> 00:35:24.640 let's just call it the floor, thereby representing null. 00:35:24.640 --> 00:35:26.020 And that's it. 00:35:26.020 --> 00:35:28.180 That has now allocated the second node. 00:35:28.180 --> 00:35:30.760 But notice, literally, this disconnect. 00:35:30.760 --> 00:35:33.010 Just because I've allocated a new node and put 00:35:33.010 --> 00:35:36.220 the number I care about and initialized its next pointer to null, 00:35:36.220 --> 00:35:38.480 that doesn't mean it's part of the data structure. 00:35:38.480 --> 00:35:43.030 The linked list is still missing a pointer from old to new. 00:35:43.030 --> 00:35:45.200 So we need to execute one other line of code 00:35:45.200 --> 00:35:49.150 now so that we can get from this picture ultimately to the final one. 00:35:49.150 --> 00:35:52.420 And here's where we can use the same kind of syntax as before. 00:35:52.420 --> 00:35:57.550 If I start at my list variable, I follow the arrow as per that code. 00:35:57.550 --> 00:36:00.280 And then I update the next field to point 00:36:00.280 --> 00:36:04.940 to n, which is my newly allocated node. 00:36:04.940 --> 00:36:09.700 Now, after that final line of code, do I have a linked list of size 2. 00:36:09.700 --> 00:36:12.880 Because I've not only allocated the node, initialized its two 00:36:12.880 --> 00:36:17.050 variables, a number and next respectively, 00:36:17.050 --> 00:36:23.727 I've also chained it together with the existing node on the linked list. 00:36:23.727 --> 00:36:25.810 And let me do this one even slightly more quickly, 00:36:25.810 --> 00:36:29.320 only because it's the same thing, just so we can see it all together at once. 00:36:29.320 --> 00:36:32.660 Now that we have this picture, let's execute the same kind of code again. 00:36:32.660 --> 00:36:34.390 The only difference in this chunk of code 00:36:34.390 --> 00:36:36.430 is that I'm initializing number to 3. 00:36:36.430 --> 00:36:37.660 So what has this done for me? 00:36:37.660 --> 00:36:40.540 That chunk of code has malloced a third and final node. 00:36:40.540 --> 00:36:42.430 I've initialized the top to the number 3. 00:36:42.430 --> 00:36:44.710 I've initialized the bottom to null, as I'll 00:36:44.710 --> 00:36:46.990 represent by just pointing the arrow at the ground. 00:36:46.990 --> 00:36:49.060 There's one final step, then. 00:36:49.060 --> 00:36:52.270 If I want to go ahead and insert that third node into this linked list, 00:36:52.270 --> 00:36:55.180 I've got to go ahead now and not just point it at myself, 00:36:55.180 --> 00:36:59.350 me representing, again, the temporary variable n, I need to now do this. 00:36:59.350 --> 00:37:01.270 And this is syntax you won't do often. 00:37:01.270 --> 00:37:03.830 We'll see it in an actual program in just a moment. 00:37:03.830 --> 00:37:07.090 But it just speaks to the basic building blocks we're manipulating. 00:37:07.090 --> 00:37:12.250 If I want to go ahead and link the 2 to the 3, I can start here at list. 00:37:12.250 --> 00:37:14.140 I can follow the arrow once. 00:37:14.140 --> 00:37:15.850 I can follow the arrow again. 00:37:15.850 --> 00:37:20.140 And then I can set that next field equal to n. 00:37:20.140 --> 00:37:29.110 Because, again, n is the current address of that most recently allocated node. 00:37:29.110 --> 00:37:34.770 So even though the syntax is getting a little new, the two arrows I'm using 00:37:34.770 --> 00:37:37.740 are literally just like a code manifestation of just 00:37:37.740 --> 00:37:43.950 follow the pointer, follow the pointer, boom, assign one value to the next. 00:37:43.950 --> 00:37:47.290 All right, so at this point in the story, the picture now looks like this. 00:37:47.290 --> 00:37:49.020 So long as I, n, am still in the picture. 00:37:49.020 --> 00:37:50.853 But if I just get rid of myself, because I'm 00:37:50.853 --> 00:37:54.480 a temporary variable, voila, we've just built up, step by step, 00:37:54.480 --> 00:38:00.480 a brand new linked list of size 3. 00:38:00.480 --> 00:38:06.973 Seems like a lot of work, but it allows us to now grow this thing dynamically. 00:38:06.973 --> 00:38:07.890 But let me pause here. 00:38:07.890 --> 00:38:12.460 Any questions or confusion on linked lists? 00:38:12.460 --> 00:38:15.210 Any questions or confusions on linked lists? 00:38:15.210 --> 00:38:17.267 BRIAN: One question just came in the chat. 00:38:17.267 --> 00:38:20.100 If we're trying to make a list that's going to be much longer, like, 00:38:20.100 --> 00:38:22.475 more than three elements, wouldn't it get tedius to have, 00:38:22.475 --> 00:38:24.630 like, arrow next, arrow next over and over again? 00:38:24.630 --> 00:38:25.050 DAVID MALAN: Yeah. 00:38:25.050 --> 00:38:25.990 Really good observation. 00:38:25.990 --> 00:38:28.470 And so that's why I said, you won't usually write it like this. 00:38:28.470 --> 00:38:31.220 We'll do it temporarily in a full-fledged program in just a moment 00:38:31.220 --> 00:38:32.220 just to demonstrate it. 00:38:32.220 --> 00:38:34.920 But in general, you'll probably use something like a loop. 00:38:34.920 --> 00:38:38.190 And what you'll do is use a temporary variable that points at this one, 00:38:38.190 --> 00:38:40.530 then iterates again, then iterates again. 00:38:40.530 --> 00:38:43.930 And let me stipulate for now that if you use a loop in the right way, 00:38:43.930 --> 00:38:46.230 you can end up writing just a single arrow by just 00:38:46.230 --> 00:38:48.447 keep updating the variable again and again. 00:38:48.447 --> 00:38:49.780 So there is a way to avoid that. 00:38:49.780 --> 00:38:53.430 And you do it much more dynamically. 00:38:53.430 --> 00:38:56.410 Let me go ahead then and ask you a question of my own here. 00:38:56.410 --> 00:38:58.170 Let me go ahead and ask this one. 00:38:58.170 --> 00:39:00.780 If you'd like to buzz in to this question, 00:39:00.780 --> 00:39:05.170 what is the running time of searching a linked list? 00:39:05.170 --> 00:39:08.800 What's the running time of searching a linked list in Big O notation? 00:39:08.800 --> 00:39:12.690 So what's an upper bound, worst case of searching 00:39:12.690 --> 00:39:17.460 a linked list like this one, whether it has three elements or even many more? 00:39:17.460 --> 00:39:20.190 So it looks like, as before, about 80% of the group 00:39:20.190 --> 00:39:21.750 is proposing that it's Big O of n. 00:39:21.750 --> 00:39:23.027 And this is actually correct. 00:39:23.027 --> 00:39:24.360 Because consider the worst case. 00:39:24.360 --> 00:39:26.318 Suppose you're looking for the number 3, you're 00:39:26.318 --> 00:39:29.400 going to have to look at all three numbers, Big O of n. 00:39:29.400 --> 00:39:31.260 If you're looking for the number 10, you're 00:39:31.260 --> 00:39:32.670 going to start here at the beginning, you're 00:39:32.670 --> 00:39:34.110 going to keep looking, looking, looking, looking. 00:39:34.110 --> 00:39:37.235 You're going to get to the end realize, oh, the number 10 is not even here. 00:39:37.235 --> 00:39:39.840 At which point, you've already looked at n elements. 00:39:39.840 --> 00:39:42.690 But here's one of the trade, again, of linked lists. 00:39:42.690 --> 00:39:46.413 With arrays, you could jump to the end of the list in constant time. 00:39:46.413 --> 00:39:48.330 You could just use a little bit of arithmetic. 00:39:48.330 --> 00:39:50.830 You could jump to the middle elements, or the first element, 00:39:50.830 --> 00:39:51.840 all using constant time. 00:39:51.840 --> 00:39:55.140 A linked list, unfortunately, is represented ultimately 00:39:55.140 --> 00:39:59.730 by just a single address, the address that points to the very first node. 00:39:59.730 --> 00:40:02.850 And so even though you all on camera can see this node and this node 00:40:02.850 --> 00:40:05.460 and this one as humans all at once, the computer 00:40:05.460 --> 00:40:07.350 can only follow these breadcrumbs. 00:40:07.350 --> 00:40:12.300 And so searching a linked list is going to be Big O of n in that case. 00:40:12.300 --> 00:40:14.460 But let me ask a follow up question. 00:40:14.460 --> 00:40:19.930 What's the running time of inserting into a linked list? 00:40:19.930 --> 00:40:23.220 What's the running time of inserting into a linked list? 00:40:23.220 --> 00:40:29.880 So you've got some new number like the number 4 or 0 or 100 or negative 5, 00:40:29.880 --> 00:40:31.350 whatever it may be. 00:40:31.350 --> 00:40:34.260 There's going to be a malloc involved, but that's constant time. 00:40:34.260 --> 00:40:36.010 It's just one function call. 00:40:36.010 --> 00:40:38.010 But you're going to have to insert it somewhere. 00:40:38.010 --> 00:40:42.720 And here it looks like 68% of you are proposing Big O of 1, 00:40:42.720 --> 00:40:44.820 which is interesting, constant time. 00:40:44.820 --> 00:40:47.880 25% of you are proposing Big O of n. 00:40:47.880 --> 00:40:51.360 Would anyone be comfortable chiming in verbally or on the chat 00:40:51.360 --> 00:40:54.970 as to why you feel it's one or the other? 00:40:54.970 --> 00:40:57.570 It is indeed one of those answers. 00:40:57.570 --> 00:41:00.450 AUDIENCE: It could be O of n. 00:41:00.450 --> 00:41:03.330 Because of the fact that even though you're 00:41:03.330 --> 00:41:06.570 using malloc to create a new node, essentially, 00:41:06.570 --> 00:41:11.820 I think all the computer's doing when you assign it is [? going-- ?] 00:41:11.820 --> 00:41:14.820 as you cite those arrows, like we're going from one arrow to the next 00:41:14.820 --> 00:41:16.750 to the next to the next. 00:41:16.750 --> 00:41:19.260 And I would think it would be O of n. 00:41:19.260 --> 00:41:21.300 DAVID MALAN: It is O of n as you described it. 00:41:21.300 --> 00:41:27.930 But you, too, are making an assumption, like 25% of other people are making. 00:41:27.930 --> 00:41:31.770 You seem to be assuming that if a new number, suppose it's number 4, 00:41:31.770 --> 00:41:33.240 has to go at the end. 00:41:33.240 --> 00:41:35.440 Or if it's the number 5, it has to go at the end. 00:41:35.440 --> 00:41:37.523 And I kind of deliberately set things up that way. 00:41:37.523 --> 00:41:41.310 I've happened to maintain them in sorted order, 1, 2, 3, from left to right. 00:41:41.310 --> 00:41:43.440 But up until now, I have not made the condition 00:41:43.440 --> 00:41:45.700 that the linked list has to be sorted, even 00:41:45.700 --> 00:41:48.450 though the examples we've seen thus far are deliberately that way. 00:41:48.450 --> 00:41:51.587 But you know what, if you want to get fancy and a little more efficient, 00:41:51.587 --> 00:41:54.420 and you want to allocate the number 4, and frankly, you don't really 00:41:54.420 --> 00:41:57.570 care about keeping the linked list in sorted order, 00:41:57.570 --> 00:42:00.930 well, heck, just pull this out, put your new node here. 00:42:00.930 --> 00:42:01.950 Plug it in here. 00:42:01.950 --> 00:42:03.420 Plug the other one back in here. 00:42:03.420 --> 00:42:06.870 And just insert the new element at the beginning of the list. 00:42:06.870 --> 00:42:10.110 And for every number thereafter, malloc as before. 00:42:10.110 --> 00:42:13.292 But just keep inserting it here, inserting it here, inserting it here. 00:42:13.292 --> 00:42:15.000 Now it's not going to take a single step. 00:42:15.000 --> 00:42:17.190 Because as I verbalized it, there's, like, the malloc step. 00:42:17.190 --> 00:42:18.180 I have to unplug this. 00:42:18.180 --> 00:42:19.290 I have to plug it back in. 00:42:19.290 --> 00:42:21.720 So it's like three or four steps total. 00:42:21.720 --> 00:42:23.640 But four steps is also constant. 00:42:23.640 --> 00:42:26.530 That's big O of 1, because it's a fixed number of steps. 00:42:26.530 --> 00:42:31.600 So if you're able to sacrifice sorted order when it comes to this list, 00:42:31.600 --> 00:42:34.620 you can in constant time insert, insert, insert, insert. 00:42:34.620 --> 00:42:39.690 And the list is going to get longer and longer, but from the beginning of it 00:42:39.690 --> 00:42:40.680 rather than the end. 00:42:40.680 --> 00:42:41.928 So that's always a trade off. 00:42:41.928 --> 00:42:44.970 If you don't care about sorted order, and none of your algorithms or code 00:42:44.970 --> 00:42:49.260 require that it be sorted, then you can go ahead and cut that corner 00:42:49.260 --> 00:42:51.600 and achieve constant time insert, which if you're 00:42:51.600 --> 00:42:54.420 a Twitter or Google or the like, maybe that's actually 00:42:54.420 --> 00:42:56.800 a net savings and a good thing. 00:42:56.800 --> 00:42:59.350 But, again, you sacrifice the sorted order in that case. 00:42:59.350 --> 00:43:00.600 Well, let's go ahead, I think. 00:43:00.600 --> 00:43:03.210 And let's translate this to some actual code. 00:43:03.210 --> 00:43:06.118 Let me go ahead here in CS50 IDE. 00:43:06.118 --> 00:43:08.160 And let's go ahead and write a couple of variants 00:43:08.160 --> 00:43:11.700 of a program that now actually do something with numbers 00:43:11.700 --> 00:43:14.770 and start to manipulate things in memory. 00:43:14.770 --> 00:43:17.910 So I'm going to go ahead here and create a program called list.c. 00:43:17.910 --> 00:43:20.160 And my first version is going to be very simplistic. 00:43:20.160 --> 00:43:22.446 I'm going to go ahead and include stdio. 00:43:22.446 --> 00:43:24.360 Give myself an int main(void). 00:43:24.360 --> 00:43:27.250 And then inside of here, let me go ahead, like we began, 00:43:27.250 --> 00:43:30.450 and give myself a list of integers of size 3. 00:43:30.450 --> 00:43:32.857 So this is going to be an array that's of size 3. 00:43:32.857 --> 00:43:35.190 And this array of size 3, I'm going to go ahead and hard 00:43:35.190 --> 00:43:36.700 code some new values into it. 00:43:36.700 --> 00:43:38.910 So at the very first location, I'll put the number 1. 00:43:38.910 --> 00:43:41.580 At the second location, I will put the number 2. 00:43:41.580 --> 00:43:44.910 At the third location, I will put the number 3. 00:43:44.910 --> 00:43:48.600 And then, just to demonstrate that this is working as I think I intend, 00:43:48.600 --> 00:43:49.990 I'm going to do a quick for loop. 00:43:49.990 --> 00:43:54.300 So for int i get 0, i less than 3, i++. 00:43:54.300 --> 00:43:58.200 And then, inside this loop, I'm going to go ahead and print out %i. 00:43:58.200 --> 00:44:01.530 And then I'm going to print out the value of i. 00:44:01.530 --> 00:44:04.770 So now that I've printed out all of these values in my loop, let 00:44:04.770 --> 00:44:08.230 me go ahead and do make list. 00:44:08.230 --> 00:44:12.060 Let me go ahead then and do ./list and hit Enter. 00:44:12.060 --> 00:44:15.750 And indeed, I get-- oops, not what I wanted. 00:44:15.750 --> 00:44:18.900 So good teachable moment, not intended, admittedly. 00:44:18.900 --> 00:44:21.305 But what have I done wrong here? 00:44:21.305 --> 00:44:22.680 My goal is we print out the list. 00:44:22.680 --> 00:44:24.390 But somehow, I printed out 0, 1, 2. 00:44:24.390 --> 00:44:27.570 And those are indeed not the numbers in this list. 00:44:27.570 --> 00:44:28.600 [? Greg? ?] 00:44:28.600 --> 00:44:30.330 AUDIENCE: So you printed i. 00:44:30.330 --> 00:44:32.172 You should have printed list of i. 00:44:32.172 --> 00:44:32.880 DAVID MALAN: Yes. 00:44:32.880 --> 00:44:35.755 So I should have printed the contents of the array, which is list[i]. 00:44:35.755 --> 00:44:38.430 So that was just a newbie mistake by me here. 00:44:38.430 --> 00:44:39.400 So let me fix that. 00:44:39.400 --> 00:44:42.927 Let me go ahead and recompile make list, ./list., and voila. 00:44:42.927 --> 00:44:44.010 I've now printed the list. 00:44:44.010 --> 00:44:48.600 So this is sort of Week 2 stuff, when we first introduced arrays in Week 2. 00:44:48.600 --> 00:44:51.660 But now let me go ahead and transition now to something more dynamic. 00:44:51.660 --> 00:44:56.190 Where I don't have to commit in advance to creating an array, 00:44:56.190 --> 00:44:59.080 I can do this with a dynamically allocated chunk of memory. 00:44:59.080 --> 00:45:01.350 So let me delete everything I've done inside of main. 00:45:01.350 --> 00:45:03.058 And let me go ahead and give myself this. 00:45:03.058 --> 00:45:06.180 Let me go ahead and declare a list of values 00:45:06.180 --> 00:45:09.600 where list is now going to be an address as per the star operator. 00:45:09.600 --> 00:45:11.530 And I'm going to go ahead and malloc-- 00:45:11.530 --> 00:45:12.030 let's see. 00:45:12.030 --> 00:45:14.270 I want space for three integers. 00:45:14.270 --> 00:45:17.370 So the simplest way to do this, if I'm just going to keep it simple, 00:45:17.370 --> 00:45:21.630 I can actually do this, 3 times size of int. 00:45:21.630 --> 00:45:26.040 So this version of my program isn't going to use an array per se. 00:45:26.040 --> 00:45:27.420 It's going to use malloc. 00:45:27.420 --> 00:45:30.660 But it's going to dynamically allocate that array for me. 00:45:30.660 --> 00:45:32.640 And we'll see what the syntax for this is. 00:45:32.640 --> 00:45:35.100 As always now, any time you use malloc, I 00:45:35.100 --> 00:45:39.098 should check whether list equals equals null. 00:45:39.098 --> 00:45:40.140 And if so, you know what? 00:45:40.140 --> 00:45:41.490 I'm just going to return 1. 00:45:41.490 --> 00:45:44.250 Recall that you can return 0 or 1 or some other value 00:45:44.250 --> 00:45:46.230 from main to effectively quit your program. 00:45:46.230 --> 00:45:49.170 I'm going to go ahead and just return 1 if list is null, 00:45:49.170 --> 00:45:52.260 just assuming that something very badly went wrong, like I'm out of memory 00:45:52.260 --> 00:45:53.430 altogether. 00:45:53.430 --> 00:45:59.190 But now that I have this chunk of memory that's of size 3 times 00:45:59.190 --> 00:46:05.030 the size of an int, this is actually the malloc way to give yourself an array. 00:46:05.030 --> 00:46:07.530 Up until now, every time we've created arrays for ourselves, 00:46:07.530 --> 00:46:09.030 we've used square bracket notation. 00:46:09.030 --> 00:46:11.700 And you all have put a number inside the square brackets 00:46:11.700 --> 00:46:13.770 to give yourself an array of that size. 00:46:13.770 --> 00:46:15.990 But frankly, if we have malloc and the ability 00:46:15.990 --> 00:46:18.480 to just ask the computer for memory, well, 00:46:18.480 --> 00:46:21.450 if I want to store three integers, why don't I ask malloc 00:46:21.450 --> 00:46:23.940 for three times the size of an integer? 00:46:23.940 --> 00:46:25.710 And the way malloc works is it's actually 00:46:25.710 --> 00:46:29.580 going to return to me a contiguous chunk of memory of that size, 00:46:29.580 --> 00:46:31.410 so that many bytes back to back to back. 00:46:31.410 --> 00:46:36.480 And that's a technique that we'll use in just a moment 00:46:36.480 --> 00:46:38.130 when allocating actual nodes. 00:46:38.130 --> 00:46:41.730 So at this point in the story, so long as list does not equal null, 00:46:41.730 --> 00:46:46.470 I now have a chunk of memory that's big enough to fit the size of three ints. 00:46:46.470 --> 00:46:48.960 And as before, I can go ahead and initialize those. 00:46:48.960 --> 00:46:50.370 The first element will be 1. 00:46:50.370 --> 00:46:51.960 The second element will be 2. 00:46:51.960 --> 00:46:54.090 The third element will be 3. 00:46:54.090 --> 00:46:58.290 And notice the sort of equivalence now between using arrays 00:46:58.290 --> 00:46:59.730 and using pointers. 00:46:59.730 --> 00:47:02.520 C is kind of versatile in this way and that if you 00:47:02.520 --> 00:47:07.020 have a chunk of memory returned to you by malloc, you can, per last week, 00:47:07.020 --> 00:47:09.030 use square bracket notation. 00:47:09.030 --> 00:47:12.532 You can use square bracket notation and treat that chunk of memory as an array. 00:47:12.532 --> 00:47:13.990 Because after all, what's an array? 00:47:13.990 --> 00:47:18.360 It's a contiguous block of memory and that is exactly what malloc returns. 00:47:18.360 --> 00:47:22.980 If you want to be fancy instead, you could actually say go to that address 00:47:22.980 --> 00:47:24.360 and put the number 1 there. 00:47:24.360 --> 00:47:28.170 You could say go to that address plus 1 and put the next number there. 00:47:28.170 --> 00:47:33.730 You could say go to that address plus 2 and put the third number there. 00:47:33.730 --> 00:47:36.750 But honestly, this just very quickly becomes unreadable, at least 00:47:36.750 --> 00:47:37.380 to most people. 00:47:37.380 --> 00:47:39.510 This is that thing called pointer arithmetic, 00:47:39.510 --> 00:47:41.580 you're doing arithmetic with pointers, that 00:47:41.580 --> 00:47:45.900 is equivalent to using the syntax that we've used for a while now, which 00:47:45.900 --> 00:47:47.528 is to just use the square brackets. 00:47:47.528 --> 00:47:49.320 And the nice thing about square brackets is 00:47:49.320 --> 00:47:51.810 that the computer will figure out for you 00:47:51.810 --> 00:47:56.790 how far apart each of those integers are because it knows the size of an int. 00:47:56.790 --> 00:47:59.120 But now, at this point in the story, things 00:47:59.120 --> 00:48:01.310 get interesting and also annoying. 00:48:01.310 --> 00:48:04.160 And Santiago, recall, was the one that helped us solve this earlier. 00:48:04.160 --> 00:48:06.020 Suppose I didn't plan ahead. 00:48:06.020 --> 00:48:10.590 I only allocated three integers on line five there. 00:48:10.590 --> 00:48:12.800 But now, at line 13, I'm, like, oh, dammit. 00:48:12.800 --> 00:48:15.240 Now I want to add a fourth integer to the list. 00:48:15.240 --> 00:48:17.210 I could obviously just redo all the code. 00:48:17.210 --> 00:48:19.520 But suppose that part of the story here is to go ahead 00:48:19.520 --> 00:48:21.920 and dynamically allocate more memory. 00:48:21.920 --> 00:48:23.222 Well, how can I do this? 00:48:23.222 --> 00:48:26.180 Well, let me go ahead and allocate another chunk of memory temporarily. 00:48:26.180 --> 00:48:28.160 So I'll call it temp, by convention. 00:48:28.160 --> 00:48:31.550 And this time I'm going to go ahead and allocate 4 times size of int. 00:48:31.550 --> 00:48:34.130 Because, again, for the sake of the story, I messed up 00:48:34.130 --> 00:48:37.523 and I want to allocate enough space now for that original-- 00:48:37.523 --> 00:48:38.690 I haven't so much messed up. 00:48:38.690 --> 00:48:42.050 I have now decided that I want to add a fourth number to this array. 00:48:42.050 --> 00:48:46.057 As always, I should check if temp equals equals null. 00:48:46.057 --> 00:48:46.640 You know what? 00:48:46.640 --> 00:48:49.430 I'm going to go ahead and free the memory I already allocated. 00:48:49.430 --> 00:48:51.230 And then I'm just going to get out of here, return 1. 00:48:51.230 --> 00:48:52.020 Something went wrong. 00:48:52.020 --> 00:48:53.312 There's nothing to demonstrate. 00:48:53.312 --> 00:48:56.300 So I'm going to exit out of main entirely. 00:48:56.300 --> 00:49:00.718 But if malloc did not return null, and all is well, what am I going to do? 00:49:00.718 --> 00:49:02.510 Well, let's first do what Santiago proposed 00:49:02.510 --> 00:49:05.000 when we first began this conversation. 00:49:05.000 --> 00:49:11.540 For int i get 0, i less than 3, i++, let's go ahead and copy into this new, 00:49:11.540 --> 00:49:15.660 temporary chunk of memory whatever is at the original chunk of memory. 00:49:15.660 --> 00:49:17.900 So when Santiago proposed that we copy 1, 2, 00:49:17.900 --> 00:49:20.960 3 from the old array into the new array, here's 00:49:20.960 --> 00:49:25.370 how we might do that in code just using a simple for loop, a la Week 2. 00:49:25.370 --> 00:49:30.890 And then let me go ahead now and add one more value, tmp[3], 00:49:30.890 --> 00:49:33.620 which is the fourth location if you're starting from 0. 00:49:33.620 --> 00:49:36.530 I'm going to go ahead and put the number 4 there. 00:49:36.530 --> 00:49:39.440 And now, at this point, I'm going to go ahead and remember 00:49:39.440 --> 00:49:43.513 the fact that tmp is my new list. 00:49:43.513 --> 00:49:45.680 So I'm going to go ahead and free the original list. 00:49:45.680 --> 00:49:49.580 And I'm going to update my old list to point at the new list. 00:49:49.580 --> 00:49:53.480 And then lastly, I'm going to go ahead and use another for loop just 00:49:53.480 --> 00:49:57.920 to demonstrate that I think I did this correctly, this time iterating up to 4 00:49:57.920 --> 00:49:59.120 instead of 3. 00:49:59.120 --> 00:50:04.280 I'm going to go ahead and print out with i the contents of list[i]. 00:50:04.280 --> 00:50:06.320 So let's rewind real quick. 00:50:06.320 --> 00:50:12.350 We began the story by allocating an array of three integers. 00:50:12.350 --> 00:50:15.817 But we did it this time dynamically to demonstrate that malloc just 00:50:15.817 --> 00:50:16.900 returns a chunk of memory. 00:50:16.900 --> 00:50:20.510 And if you want to treat that chunk of memory as an array, you absolutely can. 00:50:20.510 --> 00:50:22.477 This stuff here, if list equals equals null, 00:50:22.477 --> 00:50:25.310 it's just error checking, just to make sure that nothing went wrong. 00:50:25.310 --> 00:50:27.170 The interesting code resumes here. 00:50:27.170 --> 00:50:31.640 I'm putting the numbers 1, 2, and 3 at location 0, 1, and 2 respectively 00:50:31.640 --> 00:50:34.790 in that chunk of memory which, again, I'm treating like an array. 00:50:34.790 --> 00:50:38.180 But now at this point in the story, I've stipulated that, wait a minute, 00:50:38.180 --> 00:50:39.950 I want to go ahead and add a fourth value. 00:50:39.950 --> 00:50:40.920 How can I do that? 00:50:40.920 --> 00:50:43.790 And let's stipulate that I want to go back and change the existing program. 00:50:43.790 --> 00:50:45.790 Because suppose that for the sake of discussion, 00:50:45.790 --> 00:50:48.500 this is code that's running at Google or Twitter over time 00:50:48.500 --> 00:50:52.170 and it's only after receiving another tweet that their code realizes, 00:50:52.170 --> 00:50:53.730 oh, we need more space. 00:50:53.730 --> 00:50:54.890 So how do I do this here? 00:50:54.890 --> 00:50:59.690 On line 15, this time I allocate enough space for four integers. 00:50:59.690 --> 00:51:01.910 And I, again, do some error checking. 00:51:01.910 --> 00:51:04.820 If tmp equals null, then something bad happened. 00:51:04.820 --> 00:51:07.010 Let's just exit all together. 00:51:07.010 --> 00:51:09.620 But if nothing bad happened, let's take Santiago's suggestion 00:51:09.620 --> 00:51:14.960 and translate his English advice into C. Let's use a for loop from 0 to 3 00:51:14.960 --> 00:51:17.480 and copy into this new temporary chunk of memory 00:51:17.480 --> 00:51:20.190 the contents of the original chunk of memory. 00:51:20.190 --> 00:51:22.160 So tmp[i] = list[i]. 00:51:22.160 --> 00:51:24.530 And then here, which was the point of this exercise, 00:51:24.530 --> 00:51:29.870 let me add my fourth number at tmp[3], which is the fourth location if you 00:51:29.870 --> 00:51:31.370 start counting from 0. 00:51:31.370 --> 00:51:34.280 But at this point in the story, much like my earlier slide, 00:51:34.280 --> 00:51:38.000 I have both the 1, 2, 3 in an array of size 3, 00:51:38.000 --> 00:51:41.690 and I have 1, 2, 3 duplicated in the array of size 4. 00:51:41.690 --> 00:51:45.230 Let me go ahead and free the original list and give back to the computer 00:51:45.230 --> 00:51:46.820 that original chunk of memory. 00:51:46.820 --> 00:51:49.340 Let me then remember, using my better names 00:51:49.340 --> 00:51:52.730 variable, what the address of this new chunk of memory is. 00:51:52.730 --> 00:51:55.010 And then, just to show off, let me go ahead, 00:51:55.010 --> 00:51:58.430 and with another for loop, this time counting four times, not three, 00:51:58.430 --> 00:52:00.720 let me print out all of those values. 00:52:00.720 --> 00:52:04.200 Now here's where I'll cross my fingers, compile my new program. 00:52:04.200 --> 00:52:05.900 It does not compile OK. 00:52:05.900 --> 00:52:09.150 Because it looks like I have one too many parentheses there. 00:52:09.150 --> 00:52:11.840 So let's recompile the program with make list-- 00:52:11.840 --> 00:52:13.440 another error. 00:52:13.440 --> 00:52:14.810 So let me scroll up there. 00:52:14.810 --> 00:52:22.040 And oh, interesting, so this is a common mistake, implicitly declaring library 00:52:22.040 --> 00:52:24.170 function malloc something something. 00:52:24.170 --> 00:52:27.140 So any time you get an implicitly declaring error, 00:52:27.140 --> 00:52:31.010 odds are it means you just did something simple like this, 00:52:31.010 --> 00:52:35.150 you forgot the requisite header file in which that function is defined. 00:52:35.150 --> 00:52:39.120 And indeed, recall from last week, malloc is in standard lib, as is free. 00:52:39.120 --> 00:52:40.220 So now let's do make list. 00:52:40.220 --> 00:52:42.210 Cross my fingers again. 00:52:42.210 --> 00:52:42.710 Phew. 00:52:42.710 --> 00:52:43.880 That time it worked. 00:52:43.880 --> 00:52:47.360 ./list, voila, 1, 2, 3, 4. 00:52:47.360 --> 00:52:50.480 So this is now a completely literal translation 00:52:50.480 --> 00:52:53.990 of all of that code into a working program 00:52:53.990 --> 00:52:58.870 that, again, starts off by using an array of size 3, 00:52:58.870 --> 00:53:01.270 having dynamically allocated it. 00:53:01.270 --> 00:53:05.200 And then it resizes it by creating a new one of size 4, copying old into new, 00:53:05.200 --> 00:53:08.170 freeing the old, and then proceeding as before. 00:53:08.170 --> 00:53:11.240 And I've deliberately used malloc both times here as follows. 00:53:11.240 --> 00:53:14.350 If you create an array in C using square bracket notation, 00:53:14.350 --> 00:53:16.210 you have painted yourself into a corner. 00:53:16.210 --> 00:53:19.210 You can't use any lines of code that we have seen 00:53:19.210 --> 00:53:22.770 and resize an array that you have declared using square brackets. 00:53:22.770 --> 00:53:25.270 More technically speaking, when you use the square brackets, 00:53:25.270 --> 00:53:28.960 you are statically allocating the array on the stack. 00:53:28.960 --> 00:53:31.750 You're putting it into the frame of the computer's memory 00:53:31.750 --> 00:53:37.930 that belongs to that computer, to that function's stack frame, per the diagram 00:53:37.930 --> 00:53:38.950 last week. 00:53:38.950 --> 00:53:44.170 If, however, you use malloc, our new tool from last week, 00:53:44.170 --> 00:53:47.170 and say give me a chunk of memory, that comes from the heap. 00:53:47.170 --> 00:53:49.420 And that you can resize. 00:53:49.420 --> 00:53:52.520 That you can give back and take more of and back and forth. 00:53:52.520 --> 00:53:57.050 And in fact, there's even a more simple way of doing this, relatively speaking. 00:53:57.050 --> 00:54:02.140 If you want to reallocate an array, a chunk of memory, by resizing it, 00:54:02.140 --> 00:54:04.570 you don't have to do all of this, which I did before. 00:54:04.570 --> 00:54:06.460 You don't have to use malloc twice. 00:54:06.460 --> 00:54:08.710 You can use malloc once at the beginning. 00:54:08.710 --> 00:54:10.780 And then, you can use a new function that's 00:54:10.780 --> 00:54:13.930 actually kind of helpful in this case, called realloc. 00:54:13.930 --> 00:54:19.730 And you can actually do this, realloc a chunk of memory of size 4 times 00:54:19.730 --> 00:54:20.500 size of int. 00:54:20.500 --> 00:54:24.790 But specifically, reallocate the thing called list. 00:54:24.790 --> 00:54:27.250 So realloc is very similar to malloc. 00:54:27.250 --> 00:54:28.790 But it takes two arguments. 00:54:28.790 --> 00:54:32.140 One is the size of the memory you want, whether bigger or smaller. 00:54:32.140 --> 00:54:33.670 But it takes a second argument. 00:54:33.670 --> 00:54:38.170 Its very first argument now is the address of a chunk of memory 00:54:38.170 --> 00:54:41.800 that you have already allocated, as with malloc. 00:54:41.800 --> 00:54:43.660 So, again, at the top of the same program, 00:54:43.660 --> 00:54:48.950 recall that I used malloc to give myself a list that points at a chunk of memory 00:54:48.950 --> 00:54:50.380 big enough for three integers. 00:54:50.380 --> 00:54:55.030 On line 16, I'm now handing that address back to realloc, saying, wait a minute, 00:54:55.030 --> 00:54:56.800 here is that same address you gave me. 00:54:56.800 --> 00:55:01.090 Please now resize it, reallocate it to be of size 4. 00:55:01.090 --> 00:55:03.760 And what the function does is, if all goes well, 00:55:03.760 --> 00:55:10.300 it returns to the address in memory that it is now of sufficient size. 00:55:10.300 --> 00:55:13.030 Otherwise, it returns null if anything bad happened. 00:55:13.030 --> 00:55:14.570 So I'll leave that code alone. 00:55:14.570 --> 00:55:16.870 But what I don't have to do anymore is this. 00:55:16.870 --> 00:55:20.850 Realloc actually copies the old into the new for you. 00:55:20.850 --> 00:55:23.740 So, again, coming back to Santiago's story at the beginning of today, 00:55:23.740 --> 00:55:26.260 realloc will not only give you a bigger chunk 00:55:26.260 --> 00:55:29.740 of memory, if you ask for it, by handing back the address of the memory you 00:55:29.740 --> 00:55:31.150 already requested. 00:55:31.150 --> 00:55:35.020 And it's going to hand you back the address of a new chunk of memory 00:55:35.020 --> 00:55:37.810 that is big enough to fit all of those new values. 00:55:37.810 --> 00:55:39.130 And it's smart, too. 00:55:39.130 --> 00:55:42.880 If there happens to be room at the very end of the existing chunk of memory, 00:55:42.880 --> 00:55:45.397 there's no hello, world, like we saw on my slide earlier, 00:55:45.397 --> 00:55:47.980 then you're actually going to get back the exact same address. 00:55:47.980 --> 00:55:51.100 But the computer's operating system, Windows, Mac, OS, or Linux, 00:55:51.100 --> 00:55:54.160 is going to remember, OK, yes, I know I gave you three bytes originally. 00:55:54.160 --> 00:55:56.720 There happened to be room at the end of that chunk of memory. 00:55:56.720 --> 00:56:00.040 So now I'm going to remember, instead, that that same address has 00:56:00.040 --> 00:56:03.230 room for four integers, or whatever number you pass in. 00:56:03.230 --> 00:56:06.070 So, again, you don't have to bother copying yourself. 00:56:06.070 --> 00:56:10.880 You can let the computer actually do the reallocation for you. 00:56:10.880 --> 00:56:18.345 Any questions, then on malloc, on realloc, on free, or fundamentally, 00:56:18.345 --> 00:56:22.300 on linked lists? 00:56:22.300 --> 00:56:24.040 Notice that this isn't yet a list. 00:56:24.040 --> 00:56:25.780 This is still an array. 00:56:25.780 --> 00:56:28.628 So we still need to take this program one step further 00:56:28.628 --> 00:56:30.670 and actually transition from this chunk of memory 00:56:30.670 --> 00:56:34.120 using arrays to these actual nodes. 00:56:34.120 --> 00:56:37.310 But before we do that, any questions or confusion? 00:56:37.310 --> 00:56:37.810 BRIAN: Yeah. 00:56:37.810 --> 00:56:41.775 A question came in, why do you not need to free tmp at the end of the program? 00:56:41.775 --> 00:56:44.650 DAVID MALAN: Why do I not need to free tmp at the end of the program? 00:56:44.650 --> 00:56:48.610 Because I'm an idiot and glossed over that key, important detail. 00:56:48.610 --> 00:56:52.200 You absolutely should free not tmp in this case, but list. 00:56:52.200 --> 00:56:58.900 So at this line here, 27, I use my list variable, which just has a better name, 00:56:58.900 --> 00:57:03.520 and I make it equal to tmp so that I can just refer to it as a bigger list. 00:57:03.520 --> 00:57:04.630 But you are quite right. 00:57:04.630 --> 00:57:06.400 That was an oversight on my part. 00:57:06.400 --> 00:57:07.900 Valgrind would not have liked that. 00:57:07.900 --> 00:57:12.850 At the very end of this program, I should absolutely free list. 00:57:12.850 --> 00:57:15.790 However, I don't need to free tmp, per se, 00:57:15.790 --> 00:57:20.450 because I've simply reused the variable name through that assignment. 00:57:20.450 --> 00:57:22.730 Good question and good catch. 00:57:22.730 --> 00:57:24.370 Unintended. 00:57:24.370 --> 00:57:26.195 Other questions or comments, Brian? 00:57:26.195 --> 00:57:28.570 BRIAN: Question came in, why does the linked list improve 00:57:28.570 --> 00:57:32.050 this situation if we can just use arrays and realloc and malloc to do 00:57:32.050 --> 00:57:32.900 all this stuff? 00:57:32.900 --> 00:57:33.230 DAVID MALAN: Yeah. 00:57:33.230 --> 00:57:34.105 Really good question. 00:57:34.105 --> 00:57:38.410 So how have we improved this situation if we can just use arrays in this way? 00:57:38.410 --> 00:57:40.570 Recall that this is kind of a regression. 00:57:40.570 --> 00:57:44.920 What I just did is a regression to where we started the story, whereby 00:57:44.920 --> 00:57:46.990 in any of the versions of code I just wrote, 00:57:46.990 --> 00:57:50.270 I reallocated more space for this array. 00:57:50.270 --> 00:57:54.160 Which means that I, manually with that for loop, or realloc with its own 00:57:54.160 --> 00:57:57.490 for loop, had to copy all of the old values into the new. 00:57:57.490 --> 00:58:00.640 So the approach we've taken in all three versions of this program 00:58:00.640 --> 00:58:04.450 that I've written thus far on the fly, they've all been Big O of n. 00:58:04.450 --> 00:58:10.220 When it comes to inserts, they have not given us the dynamism of a linked list 00:58:10.220 --> 00:58:12.403 to just add without that duplication. 00:58:12.403 --> 00:58:15.320 And we haven't had the ability yet to just do an insert, for instance, 00:58:15.320 --> 00:58:18.460 at the beginning of the structure in Big O of 1 time. 00:58:18.460 --> 00:58:20.800 So, again, this is the code translation, really, 00:58:20.800 --> 00:58:23.740 of that slower approach from which we began. 00:58:23.740 --> 00:58:26.500 So the ultimate goal now is going to be to change this code 00:58:26.500 --> 00:58:29.830 and give us that dynamism and actually implement things as a proper linked 00:58:29.830 --> 00:58:32.905 list, not just as an array of integers. 00:58:32.905 --> 00:58:34.030 But we're about an hour in. 00:58:34.030 --> 00:58:36.405 Let's go ahead and take our first five minute break here. 00:58:36.405 --> 00:58:40.600 And when we come back, we'll translate the nodes themselves to a full program. 00:58:40.600 --> 00:58:42.400 All right, we are back. 00:58:42.400 --> 00:58:46.930 And recall that we began today by revisiting arrays and pointing out 00:58:46.930 --> 00:58:49.510 that searching is great in arrays if you keep them sorted. 00:58:49.510 --> 00:58:52.990 You get the Big O of log n that we liked back from Week 0. 00:58:52.990 --> 00:58:56.020 But as soon as you want to start dynamically modifying an array, 00:58:56.020 --> 00:58:58.050 it gets very expensive quickly. 00:58:58.050 --> 00:59:00.640 It might take you Big O of n steps to copy 00:59:00.640 --> 00:59:03.790 the contents of an old, small array into a new, bigger array. 00:59:03.790 --> 00:59:07.360 And honestly, over time, especially for real world software with lots of data, 00:59:07.360 --> 00:59:09.440 even Big O of n is expensive. 00:59:09.440 --> 00:59:12.640 Like, you don't want to be constantly copying and copying and copying 00:59:12.640 --> 00:59:15.170 all of your data around the computer's memory. 00:59:15.170 --> 00:59:18.730 So we can avoid that by using pointers and, in turn, 00:59:18.730 --> 00:59:21.400 stitching together these structures called linked lists, 00:59:21.400 --> 00:59:24.010 albeit at a price of spending more memory. 00:59:24.010 --> 00:59:27.693 But with that additional memory, that additional cost, comes dynamism. 00:59:27.693 --> 00:59:29.860 So that if we want we can even achieve constant time 00:59:29.860 --> 00:59:31.270 when it comes to inserting. 00:59:31.270 --> 00:59:34.540 But of course, then we have to sacrifice things like sortability. 00:59:34.540 --> 00:59:36.430 So this came with trade offs. 00:59:36.430 --> 00:59:39.400 We've just seen a few examples of actual C programs 00:59:39.400 --> 00:59:42.070 that implement first the old school array, per Week 0, 00:59:42.070 --> 00:59:43.820 where we just hardcode the array's length. 00:59:43.820 --> 00:59:46.210 And unfortunately, we painted ourselves into a corner, 00:59:46.210 --> 00:59:47.830 using the bracket notation alone. 00:59:47.830 --> 00:59:51.490 So we deployed instead m which is more versatile tool that 00:59:51.490 --> 00:59:53.770 lets us get as much memory as we want. 00:59:53.770 --> 00:59:58.120 And we used that to recreate the idea of a list implemented as arrays. 00:59:58.120 --> 01:00:00.880 But even then, we saw that I had to copy using a for loop 01:00:00.880 --> 01:00:04.865 or we had to copy using, indirectly, realloc, old into new. 01:00:04.865 --> 01:00:07.990 And, again, for these small programs, you don't even notice the difference. 01:00:07.990 --> 01:00:09.320 The programs run like that. 01:00:09.320 --> 01:00:12.580 But for large, real world software, all of that Big O of n time 01:00:12.580 --> 01:00:14.060 is going to add up quickly. 01:00:14.060 --> 01:00:18.020 So it's best if we can try to avoid it altogether and achieve dynamism. 01:00:18.020 --> 01:00:22.810 So the code via which you can add to linked list dynamically 01:00:22.810 --> 01:00:26.060 is actually part of the challenge for Problem Set 5 this coming week. 01:00:26.060 --> 01:00:30.100 But let's see some of the building blocks via which we can syntactically 01:00:30.100 --> 01:00:33.160 start to allocate nodes and stitch them together when 01:00:33.160 --> 01:00:35.163 we know in advance how many we want. 01:00:35.163 --> 01:00:37.330 Which is not going to be the case for Problem Set 5, 01:00:37.330 --> 01:00:40.857 but for now is indeed the case, because I only want three of these things. 01:00:40.857 --> 01:00:42.940 So I'm going to go back to my program from before. 01:00:42.940 --> 01:00:45.520 And I'm going to rewind and race everything inside of main. 01:00:45.520 --> 01:00:47.470 And I'm going to go ahead and declare myself 01:00:47.470 --> 01:00:54.490 a type called struct node initially with a number inside of it and a struct node 01:00:54.490 --> 01:00:56.737 * called next inside of that. 01:00:56.737 --> 01:00:59.320 And then I'm going to call this whole thing quite simply node. 01:00:59.320 --> 01:01:02.110 So that's quite similar to what we did with a person. 01:01:02.110 --> 01:01:05.440 But now it's a little fancier in that I'm giving the structure itself 01:01:05.440 --> 01:01:07.150 a temporary name, struct node. 01:01:07.150 --> 01:01:10.270 I'm referring to that temporary name inside of the structure 01:01:10.270 --> 01:01:12.430 so that I can have a pointer there, too. 01:01:12.430 --> 01:01:16.310 And then I'm renaming what was person to now node. 01:01:16.310 --> 01:01:19.330 Now let's go ahead and actually use this thing inside of main. 01:01:19.330 --> 01:01:22.840 So let me go ahead and create an empty linked list. 01:01:22.840 --> 01:01:27.760 The simplest way to translate the simple block with which we began today is just 01:01:27.760 --> 01:01:31.120 doing node *list;. 01:01:31.120 --> 01:01:33.970 Unfortunately, any time you declare a variable 01:01:33.970 --> 01:01:36.730 that does not have an assigned value, it's garbage. 01:01:36.730 --> 01:01:38.980 And garbage is bad in the world of pointers. 01:01:38.980 --> 01:01:42.430 Again, to be clear, if you create this variable called list 01:01:42.430 --> 01:01:46.690 and you do not explicitly initialize its value 01:01:46.690 --> 01:01:49.570 to be something like null pointing at the ground, 01:01:49.570 --> 01:01:51.563 but instead leave it as a garbage value, it's 01:01:51.563 --> 01:01:53.980 the sort of metaphorical equivalent of this arrow pointing 01:01:53.980 --> 01:01:55.840 this way, this way, this other way. 01:01:55.840 --> 01:01:58.570 That is to say you might accidentally, in your own code, 01:01:58.570 --> 01:02:01.570 follow this arrow to a completely bogus place. 01:02:01.570 --> 01:02:05.038 And that's the point at which you have what are called segmentation faults, 01:02:05.038 --> 01:02:08.080 as some of you might have experienced already with Problem Set 4 when you 01:02:08.080 --> 01:02:09.880 touch memory that you shouldn't. 01:02:09.880 --> 01:02:12.580 So garbage values are bad ever more so in the context 01:02:12.580 --> 01:02:14.240 of pointers for that reason. 01:02:14.240 --> 01:02:15.745 So you rarely want to do this. 01:02:15.745 --> 01:02:19.540 You almost always want to initialize the pointer to some known value. 01:02:19.540 --> 01:02:21.850 In the absence of an actual address, we're 01:02:21.850 --> 01:02:24.840 going to use null to indicate that there's nothing there. 01:02:24.840 --> 01:02:26.810 But that's deliberate on our part. 01:02:26.810 --> 01:02:29.600 Now, suppose I want to insert, just as I did physically 01:02:29.600 --> 01:02:32.960 by lugging the block number 1 onto stage before, let me go ahead 01:02:32.960 --> 01:02:36.320 and allocate a node-- we'll call it n temporarily-- 01:02:36.320 --> 01:02:40.410 using malloc, this time asking for the size of a node. 01:02:40.410 --> 01:02:41.900 So the story is now changing. 01:02:41.900 --> 01:02:43.640 I'm not allocating individual ints. 01:02:43.640 --> 01:02:49.340 I'm allocating individual nodes inside of which is enough room for an integer 01:02:49.340 --> 01:02:51.410 and another pointer to a node. 01:02:51.410 --> 01:02:55.100 And this size of operator figures out, from the definition 01:02:55.100 --> 01:02:58.310 of this structure up here above main, how much space 01:02:58.310 --> 01:03:03.560 is needed to store an integer and a pointer to a struct node. 01:03:03.560 --> 01:03:06.110 So as always now, I'm always going to check. 01:03:06.110 --> 01:03:09.860 If n equals equals null, I'm going to get out of this program 01:03:09.860 --> 01:03:11.330 immediately, and just return 1. 01:03:11.330 --> 01:03:14.130 Because something went wrong, and there's just not enough memory. 01:03:14.130 --> 01:03:19.010 But if all went well, I'm going to go ahead now and go into that node n. 01:03:19.010 --> 01:03:23.180 I'm going to go into its number field and assign it the value 1. 01:03:23.180 --> 01:03:27.410 And I'm going to go into that node n and go into its next field and, for now, 01:03:27.410 --> 01:03:28.950 assign it the value null. 01:03:28.950 --> 01:03:32.600 So this is as though I've just allocated the wooden block with a 1 in it 01:03:32.600 --> 01:03:35.700 and I have initialized its next pointer to null. 01:03:35.700 --> 01:03:40.440 Now I'm going to go ahead and update the list itself to point at that value. 01:03:40.440 --> 01:03:44.270 So, again, my variable called list is the variable via which 01:03:44.270 --> 01:03:46.640 I'm representing the whole list. 01:03:46.640 --> 01:03:49.460 And now that I have an actual node to point to, 01:03:49.460 --> 01:03:55.260 I'm setting list which, again, is a pointer, to a node, equal to whatever 01:03:55.260 --> 01:03:58.850 n is, the address of an actual node. 01:03:58.850 --> 01:04:01.940 So at this point in the story, I have the small wooden block connected 01:04:01.940 --> 01:04:04.330 to the larger block containing 1. 01:04:04.330 --> 01:04:06.080 Let's suppose, for the sake of discussion, 01:04:06.080 --> 01:04:09.800 I now want to add the number 2 to this list, this time using 01:04:09.800 --> 01:04:11.870 a node as well, not just an integer. 01:04:11.870 --> 01:04:15.500 I'm going to go ahead and allocate n, using malloc, 01:04:15.500 --> 01:04:17.872 giving myself the size of another node. 01:04:17.872 --> 01:04:20.330 I'm going to again just going to check if that thing equals 01:04:20.330 --> 01:04:24.170 null let me go ahead and free the list so I don't leak memory. 01:04:24.170 --> 01:04:25.618 Then let me go ahead and return 1. 01:04:25.618 --> 01:04:27.410 So that's just a quick sanity check to make 01:04:27.410 --> 01:04:30.770 sure I free any memory I've already allocated before. 01:04:30.770 --> 01:04:33.260 But if all goes well, and that's what I'm hoping for, 01:04:33.260 --> 01:04:37.790 I'm going to go ahead and go into this node n and store in its number field 01:04:37.790 --> 01:04:39.103 literally the number 2. 01:04:39.103 --> 01:04:40.520 And then now, because this thing-- 01:04:40.520 --> 01:04:42.770 I'll insert it in sorted order for now-- 01:04:42.770 --> 01:04:47.000 I'm going to go ahead and insert this next = NULL. 01:04:47.000 --> 01:04:52.340 And if I indeed want to put this number 2 node after the number 1 node, 01:04:52.340 --> 01:04:55.940 I can start at the top of the list, I can go to the next node. 01:04:55.940 --> 01:05:01.170 And inside of its value, I can say n. 01:05:01.170 --> 01:05:05.700 So this line of code here starts at the little block, follows the arrow, 01:05:05.700 --> 01:05:10.050 and then updates the next pointer of that first node, the 1 node, 01:05:10.050 --> 01:05:14.290 to instead store the address of this new node, n. 01:05:14.290 --> 01:05:18.690 And then lastly, let's do one more of these so n = malloc(sizeof(node)); 01:05:18.690 --> 01:05:19.722 one last time. 01:05:19.722 --> 01:05:21.930 Let me go ahead and do my sanity check one more time. 01:05:21.930 --> 01:05:24.402 If n = = NULL something bad happened. 01:05:24.402 --> 01:05:27.360 So now I'm going to go ahead and don't worry about the syntax just yet. 01:05:27.360 --> 01:05:29.730 But I'm going to go ahead and free list next. 01:05:29.730 --> 01:05:31.620 And I'm going to go ahead and free(list); 01:05:31.620 --> 01:05:33.453 and then I'm going to go ahead and return 1. 01:05:33.453 --> 01:05:34.830 But more on that another time. 01:05:34.830 --> 01:05:37.440 That's just in the corner case where something bad happened. 01:05:37.440 --> 01:05:41.310 But if nothing bad happened, I'm going to update the number field to be 3. 01:05:41.310 --> 01:05:45.450 I'm going to update the next field to be NULL. 01:05:45.450 --> 01:05:49.560 And now I'm going to update the list next block 01:05:49.560 --> 01:05:53.340 next block to equal this new one n. 01:05:53.340 --> 01:05:57.570 And then here, after this, I can proceed to print all of these things if I want. 01:05:57.570 --> 01:06:00.270 And in fact, I'll go ahead and do this with a loop. 01:06:00.270 --> 01:06:03.900 The loop is going to look a little different from before in this case. 01:06:03.900 --> 01:06:07.080 But it turns out we can use for loops pretty powerfully here, too. 01:06:07.080 --> 01:06:09.930 But at this point in the story, my list pointer 01:06:09.930 --> 01:06:12.300 is pointing at the 1 node, which is pointing at the 2 01:06:12.300 --> 01:06:14.190 node, which is pointing at the 3 node. 01:06:14.190 --> 01:06:16.110 And, again, as someone observed earlier, it's 01:06:16.110 --> 01:06:19.590 not common to use this double arrow notation in this case. 01:06:19.590 --> 01:06:22.680 I bet I could actually use a loop to iterate over these things 01:06:22.680 --> 01:06:23.280 one at a time. 01:06:23.280 --> 01:06:26.250 And we can see this here when it's time to print. 01:06:26.250 --> 01:06:28.650 Let me go ahead and do this for. 01:06:28.650 --> 01:06:32.310 And instead of using i, because there really aren't any numbers in question. 01:06:32.310 --> 01:06:36.420 This is no longer an array, so I can't use square bracket notation or pointer 01:06:36.420 --> 01:06:37.050 arithmetic. 01:06:37.050 --> 01:06:38.670 I need to use pointers. 01:06:38.670 --> 01:06:40.590 So this might feel a little weird at first. 01:06:40.590 --> 01:06:43.500 But there's nothing stopping me with a for loop from doing this. 01:06:43.500 --> 01:06:46.920 Give me a temporary pointer to a node called tmp 01:06:46.920 --> 01:06:51.600 and initialize it to be whatever is at the beginning of the list. 01:06:51.600 --> 01:06:56.040 Keep doing the following so long as tmp does not equal NULL. 01:06:56.040 --> 01:06:59.580 And on each iteration of this loop, don't do something like i++ which, 01:06:59.580 --> 01:07:01.320 again, is not relevant now. 01:07:01.320 --> 01:07:04.410 But go ahead and update my temporary pointer 01:07:04.410 --> 01:07:09.300 to be whatever the value of the temporary pointer's next field is. 01:07:09.300 --> 01:07:12.000 So this looks crazy cryptic most likely, especially 01:07:12.000 --> 01:07:16.020 if you're new to pointers as of last week, as most of you are. 01:07:16.020 --> 01:07:18.330 But it's the same idea is a typical for loop. 01:07:18.330 --> 01:07:21.780 You initialize some variable before the semicolon. 01:07:21.780 --> 01:07:24.720 You check some condition after the first semicolon. 01:07:24.720 --> 01:07:28.898 And you perform an update of that variable after the second semicolon. 01:07:28.898 --> 01:07:30.690 In this case, they're not integers, though. 01:07:30.690 --> 01:07:33.840 Instead, I'm saying give myself a temporary pointer to the beginning 01:07:33.840 --> 01:07:37.800 of the list, like my finger pointing at, or if you prefer, the foam finger 01:07:37.800 --> 01:07:40.620 pointing at some node in the list. 01:07:40.620 --> 01:07:43.950 Go ahead and call that temporary variable tmp. 01:07:43.950 --> 01:07:45.310 And now do the following. 01:07:45.310 --> 01:07:48.150 So long as tmp is not null, that is, so long as it's 01:07:48.150 --> 01:07:51.960 pointing at an actual legitimate wooden block, what do I want to do? 01:07:51.960 --> 01:07:57.300 Let me go ahead and print out, using printf and %i as always, 01:07:57.300 --> 01:08:04.230 whatever value is in the number field of that node there. 01:08:04.230 --> 01:08:05.190 And that's it. 01:08:05.190 --> 01:08:08.340 With this simple for loop, relatively simple for loop, 01:08:08.340 --> 01:08:11.910 I can essentially point at the very first node in my list 01:08:11.910 --> 01:08:14.850 and keep updating it to the next field, updating it to the next field, 01:08:14.850 --> 01:08:16.100 updating it to the next field. 01:08:16.100 --> 01:08:18.720 And I keep doing this until my finger sort of walks off 01:08:18.720 --> 01:08:21.420 the end of the list of wooden blocks, thereby 01:08:21.420 --> 01:08:25.493 pointing at null, 0x0, at which point the loop stops 01:08:25.493 --> 01:08:26.910 and there's nothing more to print. 01:08:26.910 --> 01:08:29.995 So in answer to that question earlier, do we 01:08:29.995 --> 01:08:31.620 need to use this double arrow notation? 01:08:31.620 --> 01:08:35.010 Short answer, no, this is kind of the secret ingredient here. 01:08:35.010 --> 01:08:38.520 This syntax inside of the for loop takes whatever you're pointing at, 01:08:38.520 --> 01:08:41.939 follows one arrow, and then updates the temporary variable now 01:08:41.939 --> 01:08:43.710 to point at that structure instead. 01:08:43.710 --> 01:08:47.189 So this is kind of the equivalent in the world of pointers and linked lists 01:08:47.189 --> 01:08:48.450 of doing i++. 01:08:48.450 --> 01:08:49.770 But it's not as simple as i++. 01:08:49.770 --> 01:08:52.170 You can't just look one byte to the right or to the left. 01:08:52.170 --> 01:08:55.380 Instead, you have to follow an arrow, follow an arrow. 01:08:55.380 --> 01:08:59.609 But by reassigning this temporary variable to wherever you just followed, 01:08:59.609 --> 01:09:02.790 it's a way of following each of these orange arrows 01:09:02.790 --> 01:09:06.420 as we did physically a moment ago. 01:09:06.420 --> 01:09:10.350 After this, I should, for good measure, go ahead and free the whole list. 01:09:10.350 --> 01:09:13.217 And let me just offer up a common way of freeing a linked list. 01:09:13.217 --> 01:09:14.800 I can actually do something like this. 01:09:14.800 --> 01:09:19.470 While list != NULL, so while the whole list itself does not equal null, 01:09:19.470 --> 01:09:23.970 go ahead and get a temporary pointer like this to the next field so I 01:09:23.970 --> 01:09:27.689 remember what comes after the current head of the list. 01:09:27.689 --> 01:09:29.460 Free the list node itself. 01:09:29.460 --> 01:09:32.010 And then update list to be tmp. 01:09:32.010 --> 01:09:35.040 So, again, this probably looks crazy cryptic and certainly in the coming 01:09:35.040 --> 01:09:36.899 days, especially with Problem Set 5, you'll 01:09:36.899 --> 01:09:40.439 work through this kind of logic a little more logically, a little more 01:09:40.439 --> 01:09:41.830 pictorially, perhaps. 01:09:41.830 --> 01:09:43.080 But what am I doing here? 01:09:43.080 --> 01:09:46.479 First, I'm going to do the following, so long as my linked list is not null. 01:09:46.479 --> 01:09:48.479 And if I've got three nodes in it, by definition 01:09:48.479 --> 01:09:49.854 it's not null from the beginning. 01:09:49.854 --> 01:09:52.590 But my goal now is to free all of the memory I have allocated 01:09:52.590 --> 01:09:54.430 from left to right, so to speak. 01:09:54.430 --> 01:09:55.390 So how do I do that? 01:09:55.390 --> 01:09:58.950 Well, if I've got a wooden block in front of me, 01:09:58.950 --> 01:10:02.180 it's not safe to free that wooden block yet. 01:10:02.180 --> 01:10:06.630 Because that wooden block, recall, contains the pointer to the next node. 01:10:06.630 --> 01:10:09.740 So if I free this memory prematurely, I've 01:10:09.740 --> 01:10:12.230 then stranded all subsequent nodes. 01:10:12.230 --> 01:10:15.020 Because they are no longer accessible once I've told the computer 01:10:15.020 --> 01:10:17.660 you can take back this chunk of memory for the first node. 01:10:17.660 --> 01:10:22.250 So this line of code here, on line 52, is just saying temporarily 01:10:22.250 --> 01:10:24.020 give me a variable call tmp. 01:10:24.020 --> 01:10:28.740 And point it not at the list itself, the first node, point at the next node. 01:10:28.740 --> 01:10:32.060 So it's like using my right hand to point at the current node, my left hand 01:10:32.060 --> 01:10:37.140 to point at the next node, so that I can then, on line 53, free the list itself, 01:10:37.140 --> 01:10:38.750 which should not be taken literally. 01:10:38.750 --> 01:10:43.260 List represents the first node in the linked list, not the whole thing. 01:10:43.260 --> 01:10:46.460 So when you say free list, that's, like, freeing just the current node. 01:10:46.460 --> 01:10:47.270 But that's OK. 01:10:47.270 --> 01:10:49.580 Even now this memory has been given back, 01:10:49.580 --> 01:10:52.610 I still have my left hand pointing at every subsequent node 01:10:52.610 --> 01:10:54.240 by way of the next one. 01:10:54.240 --> 01:10:58.250 So now I can update list to equal that temporary variable 01:10:58.250 --> 01:10:59.820 and just continue this loop. 01:10:59.820 --> 01:11:01.700 So it's a way of sort of Pacman style, like, 01:11:01.700 --> 01:11:04.250 gobbling up the entire linked list from left 01:11:04.250 --> 01:11:07.970 to right by freeing the first node, the second node, the third node, and then 01:11:07.970 --> 01:11:08.490 you're done. 01:11:08.490 --> 01:11:12.170 But by using a temporary variable to look one step ahead to make sure 01:11:12.170 --> 01:11:15.230 you don't chomp, free the memory too soon 01:11:15.230 --> 01:11:20.250 and, therefore, lose access to all of those subsequent nodes. 01:11:20.250 --> 01:11:21.420 All right, phew. 01:11:21.420 --> 01:11:22.560 That was a big program. 01:11:22.560 --> 01:11:25.860 But it was meant to be in succession, starting with an array, 01:11:25.860 --> 01:11:29.910 transitioning into a dynamically allocated array, followed by, finally, 01:11:29.910 --> 01:11:32.400 an implementation using linked list, albeit hardcoded 01:11:32.400 --> 01:11:34.000 to support only three nodes. 01:11:34.000 --> 01:11:37.200 But in that example, do you see some sample syntax via which you 01:11:37.200 --> 01:11:39.930 can manipulate these kinds of nodes? 01:11:39.930 --> 01:11:44.950 Questions or confusion that I can help address? 01:11:44.950 --> 01:11:45.450 BRIAN: Yes. 01:11:45.450 --> 01:11:48.060 Someone asked, similar to one of the examples you did before, 01:11:48.060 --> 01:11:51.780 why could we not have just done malloc three times size of node 01:11:51.780 --> 01:11:53.880 to get three nodes and do it that way? 01:11:53.880 --> 01:11:54.760 DAVID MALAN: Really good question. 01:11:54.760 --> 01:11:56.910 Could I not just use malloc and allocate all three at once? 01:11:56.910 --> 01:11:57.430 Absolutely. 01:11:57.430 --> 01:11:57.930 Yes. 01:11:57.930 --> 01:11:59.430 That is completely your prerogative. 01:11:59.430 --> 01:12:01.770 I did it a little more pedantically, one at a time. 01:12:01.770 --> 01:12:05.550 But you could absolutely do it all three at once. 01:12:05.550 --> 01:12:09.150 You would then need to use some pointer arithmetic though, 01:12:09.150 --> 01:12:11.370 or you would need to square bracket notation 01:12:11.370 --> 01:12:16.050 to treat that bigger chunk of memory as essentially an array of nodes, 01:12:16.050 --> 01:12:17.570 and then stitch them together. 01:12:17.570 --> 01:12:19.823 So I am assuming, for demonstration purposes, 01:12:19.823 --> 01:12:22.740 that even though we have these little simple examples that demonstrate 01:12:22.740 --> 01:12:26.670 the syntax, in a real world system, you're not going to be inserting 1, 01:12:26.670 --> 01:12:28.110 then 2, then 3. 01:12:28.110 --> 01:12:29.910 Odds are you're going to be inserting 1. 01:12:29.910 --> 01:12:31.170 Some time passes. 01:12:31.170 --> 01:12:33.750 Then you want to insert 2, so you allocate more memory. 01:12:33.750 --> 01:12:35.190 Then some more time passes. 01:12:35.190 --> 01:12:36.940 Then you want to insert 3. 01:12:36.940 --> 01:12:42.150 And so there's gaps in between these chunks of code in the real world. 01:12:42.150 --> 01:12:43.530 Other questions or confusion? 01:12:43.530 --> 01:12:43.800 BRIAN: Yeah. 01:12:43.800 --> 01:12:44.850 Another question came in. 01:12:44.850 --> 01:12:47.113 Why would malloc ever fail to allocate memory? 01:12:47.113 --> 01:12:48.780 DAVID MALAN: Why would malloc ever fail? 01:12:48.780 --> 01:12:52.660 It's rarely going to fail, but if the computer is out of memory. 01:12:52.660 --> 01:12:56.640 So essentially, if you're writing such a memory hungry program 01:12:56.640 --> 01:13:00.000 with so many variables, big arrays, big structures, lots of data, 01:13:00.000 --> 01:13:01.870 you may very well run out of memory. 01:13:01.870 --> 01:13:04.680 Maybe that's two gigabytes, maybe it's four gigabytes or more. 01:13:04.680 --> 01:13:07.080 But malloc may very well return null to you. 01:13:07.080 --> 01:13:09.490 And so you should always check for it. 01:13:09.490 --> 01:13:11.970 In fact, I dare say, on Macs and PCs, one 01:13:11.970 --> 01:13:15.780 of the most common reasons, to this day, for programs to freeze, 01:13:15.780 --> 01:13:17.910 to crash, for your home computer to reboot, 01:13:17.910 --> 01:13:20.250 is truly because someone did something stupid, like I've 01:13:20.250 --> 01:13:22.920 done multiple times now already today and last week, 01:13:22.920 --> 01:13:25.080 by touching memory that you shouldn't have. 01:13:25.080 --> 01:13:27.840 So in Problem Set 4 and now 5, any time you 01:13:27.840 --> 01:13:30.210 experience one of those segmentation faults 01:13:30.210 --> 01:13:35.910 whereby your program just crashes, that is the problem set 01:13:35.910 --> 01:13:38.760 version of, like, your whole Mac or PC crashing, 01:13:38.760 --> 01:13:40.530 because someone more experienced than you 01:13:40.530 --> 01:13:43.740 made that same mistake in their code. 01:13:43.740 --> 01:13:46.710 Just to reinforce this, let's take a quick final example 01:13:46.710 --> 01:13:50.550 involving linked lists which, again, are this very one dimensional structure, 01:13:50.550 --> 01:13:51.153 left to right. 01:13:51.153 --> 01:13:53.820 And then we'll add a second dimension and see what that buys us. 01:13:53.820 --> 01:13:55.300 But we've changed the numbers around here. 01:13:55.300 --> 01:13:56.580 Now we still have our list. 01:13:56.580 --> 01:13:58.770 But it's first pointing at the number two here. 01:13:58.770 --> 01:14:01.770 And then the number two is pointing to some other chunk of memory that's 01:14:01.770 --> 01:14:03.570 been malloced way over here. 01:14:03.570 --> 01:14:05.190 And this, then, is the number 4. 01:14:05.190 --> 01:14:06.820 And this, then, is the number 5. 01:14:06.820 --> 01:14:08.700 So we have a linked list of size 3. 01:14:08.700 --> 01:14:12.060 But I've deliberately spread the numbers out this time, 2, 4, 5. 01:14:12.060 --> 01:14:15.000 Because suppose that we do want to insert more numbers into this list, 01:14:15.000 --> 01:14:17.730 but in sorted order, it turns out that we 01:14:17.730 --> 01:14:19.770 have to think a little bit differently when 01:14:19.770 --> 01:14:23.280 we're adding nodes not to the end and not to the beginning, 01:14:23.280 --> 01:14:24.247 but in the middle. 01:14:24.247 --> 01:14:26.580 Like, when we want to allocate more nodes in the middle, 01:14:26.580 --> 01:14:29.560 there's a bit more work that actually has to happen. 01:14:29.560 --> 01:14:31.140 So how might we go about doing this? 01:14:31.140 --> 01:14:34.260 Suppose that we want to allocate, for instance, the number 1. 01:14:34.260 --> 01:14:35.820 And I want to add the number 1. 01:14:35.820 --> 01:14:37.290 Well, we could use code like this. 01:14:37.290 --> 01:14:39.060 This is the same code as we used before. 01:14:39.060 --> 01:14:40.770 We allocate the size of a node. 01:14:40.770 --> 01:14:43.080 We check whether it equals null, we initialize it 01:14:43.080 --> 01:14:46.412 with a value we care about, and by default, we sit next equal to null. 01:14:46.412 --> 01:14:48.120 And pictorially, it might look like this. 01:14:48.120 --> 01:14:50.495 It's kind of floating somewhere in the computer's memory. 01:14:50.495 --> 01:14:53.130 I have this temporary variable n, no longer pictured, 01:14:53.130 --> 01:14:55.470 that I'm just pointing at when I allocate the number 1. 01:14:55.470 --> 01:14:56.820 So what does this look like? 01:14:56.820 --> 01:15:01.510 This is like having the number 1 maybe somewhere over here. 01:15:01.510 --> 01:15:02.760 And I'll just put it in place. 01:15:02.760 --> 01:15:05.410 We got lucky, and there was a chunk of memory right there. 01:15:05.410 --> 01:15:07.140 So what do I want to now do? 01:15:07.140 --> 01:15:08.920 Well, I want to go ahead and connect this. 01:15:08.920 --> 01:15:12.720 So what I could do, just intuitively, if 1 should go before 2, 01:15:12.720 --> 01:15:14.220 I can unplug this. 01:15:14.220 --> 01:15:17.980 And I can plug this into here, which makes sense. 01:15:17.980 --> 01:15:19.560 But there's already a problem. 01:15:19.560 --> 01:15:22.290 If I have done nothing else up until this point, 01:15:22.290 --> 01:15:26.580 I have just orphaned three nodes, 2, 4, and 5. 01:15:26.580 --> 01:15:29.370 To orphan a node means to forget where it is. 01:15:29.370 --> 01:15:32.070 And if I don't have another variable in my code, 01:15:32.070 --> 01:15:34.080 or if I'm not acting out with one of my hands 01:15:34.080 --> 01:15:36.960 pointing at the original beginning of the list, 01:15:36.960 --> 01:15:39.330 I have literally orphaned the rest of the list. 01:15:39.330 --> 01:15:41.970 And the technical implication of that, per last week, 01:15:41.970 --> 01:15:44.010 is that now I have a massive memory leak. 01:15:44.010 --> 01:15:46.560 You have just leaked the size of three nodes 01:15:46.560 --> 01:15:49.320 in memory that you can literally never get back 01:15:49.320 --> 01:15:51.310 until you reboot the computer, for instance, 01:15:51.310 --> 01:15:54.742 or the program quits and the operating system cleans things up for you. 01:15:54.742 --> 01:15:55.950 So you don't want to do this. 01:15:55.950 --> 01:15:58.570 Order of operations actually matters. 01:15:58.570 --> 01:16:01.140 So what I should probably do is this. 01:16:01.140 --> 01:16:04.340 When I insert the number 1 first, I should probably 01:16:04.340 --> 01:16:07.090 recognize that, well, the one begins at the beginning of the list. 01:16:07.090 --> 01:16:10.980 So what I should really do is point this arrow also at the same node. 01:16:10.980 --> 01:16:13.630 And we'll sort of do it a little sloppily like that. 01:16:13.630 --> 01:16:16.720 But let me stipulate those are both pointing at the same node. 01:16:16.720 --> 01:16:21.990 Now that my new node, AKA n and the code I showed is pointing at this thing, 01:16:21.990 --> 01:16:23.820 now I can do kind of a switcheroo. 01:16:23.820 --> 01:16:27.750 Because I'm already pointing at the final destination there. 01:16:27.750 --> 01:16:30.720 And now I can remove this safely, because this is my list. 01:16:30.720 --> 01:16:31.710 This is n. 01:16:31.710 --> 01:16:33.625 Therefore, I have variables pointing to both 01:16:33.625 --> 01:16:35.500 and I can go ahead and insert that correctly. 01:16:35.500 --> 01:16:38.280 So long story short, order of operations matters. 01:16:38.280 --> 01:16:43.800 So graphically, if I were to do this as before, just by saying list equals n, 01:16:43.800 --> 01:16:48.950 if this is n, and this is list, and I adjust this arrow first, 01:16:48.950 --> 01:16:50.420 bad things are going to happen. 01:16:50.420 --> 01:16:54.890 Indeed, we end up orphaning 2, 4, and 5, thereby leaking a significant amount 01:16:54.890 --> 01:16:55.850 of memory potentially. 01:16:55.850 --> 01:16:57.645 And leaking any memory, typically, is bad. 01:16:57.645 --> 01:16:58.770 So I don't want to do that. 01:16:58.770 --> 01:17:00.200 So let's look at the correct code. 01:17:00.200 --> 01:17:03.350 The correct code is going to be to start at n 01:17:03.350 --> 01:17:08.210 and update its next field, this arrow here, to point at the same thing 01:17:08.210 --> 01:17:10.550 as the list was originally pointing at. 01:17:10.550 --> 01:17:14.600 And then go ahead and update the list, such that both of them 01:17:14.600 --> 01:17:16.220 are currently pointing in duplicate. 01:17:16.220 --> 01:17:18.750 Then update the list to point to the new node. 01:17:18.750 --> 01:17:21.410 So, again, the code's a little different this time from before. 01:17:21.410 --> 01:17:23.243 Because before we kept adding it to the end, 01:17:23.243 --> 01:17:26.390 or I proposed verbally that we just add it to the beginning. 01:17:26.390 --> 01:17:28.530 Here, we're adding it, indeed, at the beginning. 01:17:28.530 --> 01:17:31.792 And so the actual steps, the actual code, are a little bit different. 01:17:31.792 --> 01:17:34.250 Well, let's do one final example, if we want to allocate 3. 01:17:34.250 --> 01:17:37.460 Well, I've got to malloc another node, the number 3. 01:17:37.460 --> 01:17:42.090 Suppose that ends up somewhere in the computer's memory. 01:17:42.090 --> 01:17:45.330 Let's go ahead and plop this one over here. 01:17:45.330 --> 01:17:46.850 So now 3 is in place. 01:17:46.850 --> 01:17:48.630 How do I now insert this thing? 01:17:48.630 --> 01:17:52.460 Well, similar to before, I'm not going to want to update this pointer 01:17:52.460 --> 01:17:56.060 and go like this and then plug this guy in over here. 01:17:56.060 --> 01:17:57.830 Because now I've orphaned those two nodes. 01:17:57.830 --> 01:17:59.870 So that, again is the wrong step. 01:17:59.870 --> 01:18:03.052 When you're inside the middle of a linked list, any code 01:18:03.052 --> 01:18:04.760 that you write to insert into the middle, 01:18:04.760 --> 01:18:09.260 if you care about inserting in sorted order, this should be updated first. 01:18:09.260 --> 01:18:12.500 And odds are I should kind of cheat and point this at the same thing, 01:18:12.500 --> 01:18:14.930 even though there's only one physical plug at the moment. 01:18:14.930 --> 01:18:18.300 So we'll just pretend that this is working. 01:18:18.300 --> 01:18:19.140 There we go. 01:18:19.140 --> 01:18:23.430 And now I can go ahead and safely say that n and the previous node 01:18:23.430 --> 01:18:25.200 are already pointing where they should be. 01:18:25.200 --> 01:18:28.490 So now it's safe for me to unplug this one 01:18:28.490 --> 01:18:31.820 and go ahead and update this final arrow to point 01:18:31.820 --> 01:18:35.820 at the new node in the correct location. 01:18:35.820 --> 01:18:37.250 So let's see that in code again. 01:18:37.250 --> 01:18:41.480 If I go here, I've got graphically the node 3 kind of floating in space. 01:18:41.480 --> 01:18:45.830 I first update its next field to point also at the 4 in duplicate. 01:18:45.830 --> 01:18:48.620 Then, I update the 2 to point to the 3. 01:18:48.620 --> 01:18:53.630 The goal being, again, to avoid any leaking of memory or orphaning 01:18:53.630 --> 01:18:55.250 of nodes. 01:18:55.250 --> 01:18:59.340 All right , we are about to leave linked lists behind. 01:18:59.340 --> 01:19:02.760 Because as multiple of you have noted or probably thought, 01:19:02.760 --> 01:19:05.300 they're good, but maybe not great. 01:19:05.300 --> 01:19:08.160 They're good in that they are dynamic and I can add to them, as 01:19:08.160 --> 01:19:10.590 by inserting at the beginning if I really want 01:19:10.590 --> 01:19:12.030 and don't care about sorted order. 01:19:12.030 --> 01:19:13.980 But they're still a good amount of work to do 01:19:13.980 --> 01:19:16.770 if I want to keep them in sorted order and I insert them 01:19:16.770 --> 01:19:17.940 in the middle or the end. 01:19:17.940 --> 01:19:19.830 Because that's, like, Big O of n if I keep 01:19:19.830 --> 01:19:22.090 traversing all of these darn arrows. 01:19:22.090 --> 01:19:24.600 So we get the dynamism, but we don't necessarily 01:19:24.600 --> 01:19:26.250 get the performance increase. 01:19:26.250 --> 01:19:29.580 But we fundamentally have opened up a whole new world to ourselves. 01:19:29.580 --> 01:19:32.670 We can now stitch together these data structures 01:19:32.670 --> 01:19:35.910 in memory using pointers as our thread, if you will. 01:19:35.910 --> 01:19:39.690 We can just use memory as a canvas, painting on it any values we want. 01:19:39.690 --> 01:19:42.510 And we can sort of remember where all of those values are. 01:19:42.510 --> 01:19:47.010 But this is indeed very single, one dimensional, left to right. 01:19:47.010 --> 01:19:49.763 What if we give ourselves a second dimension? 01:19:49.763 --> 01:19:51.930 What if we start thinking sort of not left to right, 01:19:51.930 --> 01:19:53.632 but also left to right, up, down. 01:19:53.632 --> 01:19:55.590 So, again, this is meaningless to the computer. 01:19:55.590 --> 01:19:59.100 The computer just thinks of memory as being byte 0, 1, 2, 3. 01:19:59.100 --> 01:20:01.680 But we humans can kind of think of these data structures 01:20:01.680 --> 01:20:04.620 a little more abstractly, a little more simply. 01:20:04.620 --> 01:20:06.780 And we can think about them in a way familiar, 01:20:06.780 --> 01:20:08.170 perhaps, to us in the real world. 01:20:08.170 --> 01:20:10.630 Trees, not the ones so much that grow from the ground, 01:20:10.630 --> 01:20:12.412 but if you're familiar with family trees. 01:20:12.412 --> 01:20:14.370 Where you might have a matriarch or a patriarch 01:20:14.370 --> 01:20:16.870 and then sort of descendants hanging off of them graphically 01:20:16.870 --> 01:20:19.662 on a piece of paper, something you might have made in grade school, 01:20:19.662 --> 01:20:20.470 for instance. 01:20:20.470 --> 01:20:22.860 We can leverage this idea of a tree structure 01:20:22.860 --> 01:20:26.730 that has a root that kind of branches and branches and branches 01:20:26.730 --> 01:20:28.390 and grows top to bottom. 01:20:28.390 --> 01:20:32.610 So, again, more like a family tree than an actual tree in the soil. 01:20:32.610 --> 01:20:36.720 So with trees, it turns out, this idea of a tree, 01:20:36.720 --> 01:20:40.410 we can take some of the lessons learned from linked lists. 01:20:40.410 --> 01:20:43.800 But we can gain back some of the features of arrays. 01:20:43.800 --> 01:20:45.220 And we can do that as follows. 01:20:45.220 --> 01:20:48.420 Consider the following definition of what we're 01:20:48.420 --> 01:20:50.400 about to call a binary search tree. 01:20:50.400 --> 01:20:54.360 Binary search being a very good thing from the first, Week 0. 01:20:54.360 --> 01:20:57.030 And a tree now being the new idea. 01:20:57.030 --> 01:21:00.750 Here's an array from Week 0, Week 1, Week 2, whenever. 01:21:00.750 --> 01:21:02.160 And it's of size 7. 01:21:02.160 --> 01:21:05.802 And recall that if it's sorted, we can apply binary search to this array. 01:21:05.802 --> 01:21:06.510 And that's great. 01:21:06.510 --> 01:21:08.218 Because if we want to search for a value, 01:21:08.218 --> 01:21:09.780 we can start looking in the middle. 01:21:09.780 --> 01:21:13.080 Then we can go either left or right, halfway between each. 01:21:13.080 --> 01:21:15.840 And then we can similarly go left or right. 01:21:15.840 --> 01:21:20.460 So binary search on a sorted array was so powerful 01:21:20.460 --> 01:21:23.010 because it was Big O of log n, we've concluded, 01:21:23.010 --> 01:21:27.330 by just having and having and having the problem again and again, 01:21:27.330 --> 01:21:30.300 tearing the phone book in half again and again and again. 01:21:30.300 --> 01:21:33.300 But the problem with binary search is that it 01:21:33.300 --> 01:21:38.010 requires that you use an array so that you have random access. 01:21:38.010 --> 01:21:42.240 You have to be able to index into the array in constant time 01:21:42.240 --> 01:21:46.830 using simple arithmetic, like bracket 0, bracket n minus 1, bracket n minus 1 01:21:46.830 --> 01:21:48.720 divided by 2, to get the halfway point. 01:21:48.720 --> 01:21:52.470 You have to be able to do arithmetic on the data structure. 01:21:52.470 --> 01:21:56.280 And we've just proposed getting rid of that random access 01:21:56.280 --> 01:21:59.070 by transitioning to a dynamic data structure 01:21:59.070 --> 01:22:01.530 like a linked list instead of an array. 01:22:01.530 --> 01:22:02.740 But what if we do this? 01:22:02.740 --> 01:22:07.140 What if you and I start thinking not on one dimension, but on two dimensions? 01:22:07.140 --> 01:22:10.270 And what if we alter our thinking to be like this? 01:22:10.270 --> 01:22:13.710 So think of an array, perhaps, as being a two dimensional 01:22:13.710 --> 01:22:18.000 structure that has not only with or length, but also height. 01:22:18.000 --> 01:22:21.750 And so we maintain, it seems, visually this relationship 01:22:21.750 --> 01:22:23.470 between all of these values. 01:22:23.470 --> 01:22:24.240 But you know what? 01:22:24.240 --> 01:22:28.830 We can stitch all of these values together using what? 01:22:28.830 --> 01:22:29.700 Well, pointers. 01:22:29.700 --> 01:22:32.130 Pointers are this new building block that we can use 01:22:32.130 --> 01:22:34.410 to stitch together things in memory. 01:22:34.410 --> 01:22:36.480 If the things in memory are numbers, that's fine. 01:22:36.480 --> 01:22:37.230 They're integers. 01:22:37.230 --> 01:22:40.440 But if we throw a little more memory at them, if we use a node, 01:22:40.440 --> 01:22:42.840 and we kind of wrap the integer in a node such 01:22:42.840 --> 01:22:46.200 that that node contains not only numbers but pointers, 01:22:46.200 --> 01:22:48.420 we could probably draw a picture like this, not 01:22:48.420 --> 01:22:51.510 unlike a family tree, where there's a root 01:22:51.510 --> 01:22:54.960 node, at the very top in this case, and then children, so to speak. 01:22:54.960 --> 01:22:59.670 Left child and right child, and that definition repeats again and again. 01:22:59.670 --> 01:23:04.170 And it turns out the computer scientists do use this data structure in order 01:23:04.170 --> 01:23:07.320 to have the dynamism of a linked list, where 01:23:07.320 --> 01:23:10.680 you can add more and more nodes to the tree 01:23:10.680 --> 01:23:13.740 by just adding more and more squares even lower 01:23:13.740 --> 01:23:15.390 than the 1, the 3, the 5, and the 7. 01:23:15.390 --> 01:23:18.307 And just use more pointers to kind of stitch them together, to sort of 01:23:18.307 --> 01:23:22.020 grow the tree vertically, down, down, down, if you will. 01:23:22.020 --> 01:23:25.530 But a good computer scientist would recognize that you shouldn't just 01:23:25.530 --> 01:23:27.780 put these numbers in random locations, otherwise, 01:23:27.780 --> 01:23:29.490 you're really just wasting your time. 01:23:29.490 --> 01:23:30.830 You should use some algorithm. 01:23:30.830 --> 01:23:31.330 And notice. 01:23:31.330 --> 01:23:36.640 Does anyone notice the pattern to this tree? 01:23:36.640 --> 01:23:39.850 Can anyone verbalize or textualize in the chat 01:23:39.850 --> 01:23:44.410 what pattern is manifest by these seven nodes in this tree? 01:23:44.410 --> 01:23:46.130 They're not randomly ordered. 01:23:46.130 --> 01:23:50.230 They're very deliberately ordered left to right, top to bottom, 01:23:50.230 --> 01:23:51.040 in a certain way. 01:23:51.040 --> 01:23:55.360 Can anyone put their finger on what the definition of this thing is? 01:23:55.360 --> 01:23:57.700 What is the most important characteristic besides it 01:23:57.700 --> 01:24:00.820 just being drawn like a family tree? 01:24:00.820 --> 01:24:02.050 [? Greg? ?] 01:24:02.050 --> 01:24:05.353 AUDIENCE: You have put in the middle of them, 01:24:05.353 --> 01:24:07.270 on top of them you have put the middle number. 01:24:07.270 --> 01:24:09.550 For example, between 1 and 3, you have put 2. 01:24:09.550 --> 01:24:13.240 Between 5 and 7, you have put 6. 01:24:13.240 --> 01:24:15.605 So on top of them, you have put the middle number. 01:24:15.605 --> 01:24:16.480 DAVID MALAN: Exactly. 01:24:16.480 --> 01:24:18.272 There's this pattern to all of the numbers. 01:24:18.272 --> 01:24:19.330 Between 1 and 3 is 2. 01:24:19.330 --> 01:24:20.740 Between 5 and 7 is 6. 01:24:20.740 --> 01:24:22.595 Between 2 and 6 is 4. 01:24:22.595 --> 01:24:24.970 And it doesn't even have to be the middle number, per se. 01:24:24.970 --> 01:24:27.310 I can generalize it a little bit and comment 01:24:27.310 --> 01:24:30.760 that, if you pick any node in this tree, so to speak, 01:24:30.760 --> 01:24:33.700 its left child will be less than its value, 01:24:33.700 --> 01:24:36.610 and its right child's will be greater than its value. 01:24:36.610 --> 01:24:38.290 And we can do that again and again. 01:24:38.290 --> 01:24:39.100 So here's 4. 01:24:39.100 --> 01:24:40.270 Its left child is 2. 01:24:40.270 --> 01:24:41.050 That's less than. 01:24:41.050 --> 01:24:41.650 Here's 4. 01:24:41.650 --> 01:24:42.760 It's right child to 6. 01:24:42.760 --> 01:24:43.750 That's greater than. 01:24:43.750 --> 01:24:44.750 We can do this again. 01:24:44.750 --> 01:24:45.490 Let's go to 2. 01:24:45.490 --> 01:24:47.350 Its left child is 1, which is less. 01:24:47.350 --> 01:24:49.480 Its right child is 3, which is more. 01:24:49.480 --> 01:24:51.850 6, its left child is 5, which is less. 01:24:51.850 --> 01:24:54.660 6, its right child is 7, which is more. 01:24:54.660 --> 01:24:56.410 And so this is actually, if you don't mind 01:24:56.410 --> 01:25:00.730 the revisiting recursion from last week, this is a recursive definition. 01:25:00.730 --> 01:25:03.140 This is a recursive data structure. 01:25:03.140 --> 01:25:05.230 So it's not only algorithms or functions that 01:25:05.230 --> 01:25:07.330 can be recursive by calling themselves. 01:25:07.330 --> 01:25:10.300 A data structure can also be recursive. 01:25:10.300 --> 01:25:12.440 After all, what is this thing? 01:25:12.440 --> 01:25:13.570 This is a tree, yes. 01:25:13.570 --> 01:25:14.590 I'll stipulate. 01:25:14.590 --> 01:25:18.520 But it's technically a tree with two trees. 01:25:18.520 --> 01:25:19.120 Right? 01:25:19.120 --> 01:25:23.140 This node here, number four, technically has two children. 01:25:23.140 --> 01:25:25.750 And each of those children is itself a tree. 01:25:25.750 --> 01:25:27.070 It's a smaller tree. 01:25:27.070 --> 01:25:29.830 But it's the same exact definition again and again. 01:25:29.830 --> 01:25:32.710 And any time we see a recursive data structure, 01:25:32.710 --> 01:25:35.350 it's actually going to be an opportunity to use recursive code, 01:25:35.350 --> 01:25:37.183 which we'll take a look at in just a moment. 01:25:37.183 --> 01:25:39.290 But for now, notice what we've achieved again, 01:25:39.290 --> 01:25:42.700 the dynamism of using pointer so that we can, if we want, 01:25:42.700 --> 01:25:45.490 add more nodes to this tree, as by stringing them 01:25:45.490 --> 01:25:47.710 along the bottom in the correct order. 01:25:47.710 --> 01:25:51.970 And yet, we've preserved an important order for binary search, 01:25:51.970 --> 01:25:55.960 AKA, the formal name of this data structure, binary search tree, 01:25:55.960 --> 01:26:00.190 by making sure that left child is always less, right child is always more. 01:26:00.190 --> 01:26:03.790 Because now, we can go about searching this thing more efficiently. 01:26:03.790 --> 01:26:04.300 How? 01:26:04.300 --> 01:26:06.800 Well, if I want to search for the number 3, what do I do? 01:26:06.800 --> 01:26:08.592 Well, I start at the beginning of the tree. 01:26:08.592 --> 01:26:13.028 Just like with a linked list, you start at the beginning of the linked list. 01:26:13.028 --> 01:26:15.820 So with the tree, you start with the root of the tree [? always. ?] 01:26:15.820 --> 01:26:17.410 Suppose I want to search for 3. 01:26:17.410 --> 01:26:18.800 Well, what do I do? 01:26:18.800 --> 01:26:21.220 Well, 3 is obviously less than 4. 01:26:21.220 --> 01:26:23.980 So just like in Week 0 where I tore the phone book in half, 01:26:23.980 --> 01:26:27.340 you can now think of this as chopping down half of the tree. 01:26:27.340 --> 01:26:30.130 Because you know 3, if it's present, is definitely not 01:26:30.130 --> 01:26:33.790 going to be anywhere over here, so we can focus our attention down here. 01:26:33.790 --> 01:26:34.835 Here's the number 2. 01:26:34.835 --> 01:26:35.710 This is another tree. 01:26:35.710 --> 01:26:37.780 It's just a smaller subtree, if you will. 01:26:37.780 --> 01:26:39.010 How do I find the number 3? 01:26:39.010 --> 01:26:41.290 Well, I look to the right because it's greater than. 01:26:41.290 --> 01:26:43.370 And boom, I found it. 01:26:43.370 --> 01:26:46.330 But by contrast, suppose I were searching for the number 8. 01:26:46.330 --> 01:26:47.350 I would start here. 01:26:47.350 --> 01:26:48.130 I would look here. 01:26:48.130 --> 01:26:48.880 I would look here. 01:26:48.880 --> 01:26:51.560 And then conclude, no, it's not there. 01:26:51.560 --> 01:26:53.680 But, again, every time I search for that 8, 01:26:53.680 --> 01:26:58.400 I'm ignoring this half of the tree, this half of the subtree, and so forth. 01:26:58.400 --> 01:27:00.220 So you're going to achieve, it would seem, 01:27:00.220 --> 01:27:03.730 the same kind of power, the same kind of performance as we saw from Week 0. 01:27:03.730 --> 01:27:06.250 So how do we translate this idea now into code? 01:27:06.250 --> 01:27:08.110 We have all the building blocks already. 01:27:08.110 --> 01:27:11.020 Let me go ahead and propose that, instead of the node 01:27:11.020 --> 01:27:15.130 we used before for a linked list, which looked like this, with a number and one 01:27:15.130 --> 01:27:16.480 pointer called next-- 01:27:16.480 --> 01:27:19.438 but, again, we could have called those things anything-- let's go ahead 01:27:19.438 --> 01:27:24.580 and make room for not just a number, but also two pointers, one 01:27:24.580 --> 01:27:27.370 that I'll call left, one that I'll call right. 01:27:27.370 --> 01:27:30.730 Both of those is still a pointer to a struct node. 01:27:30.730 --> 01:27:35.170 So same terminology as before, but now I have two pointers instead of one 01:27:35.170 --> 01:27:38.560 so that one can conceptually point to the left and point to a smaller 01:27:38.560 --> 01:27:39.190 subtree. 01:27:39.190 --> 01:27:43.120 One can point to the right and point to a larger subtree. 01:27:43.120 --> 01:27:46.102 So how do we go about implementing something like binary search? 01:27:46.102 --> 01:27:47.560 Well, let's actually see some code. 01:27:47.560 --> 01:27:50.140 And this is where recursion really gets kind of cool. 01:27:50.140 --> 01:27:53.560 We kind of forced it when building [? Maro's ?] pyramid with recursion. 01:27:53.560 --> 01:27:55.180 Like, yeah, you can do it. 01:27:55.180 --> 01:27:59.110 And yes, the pyramid was, I claimed, a recursive physical structure 01:27:59.110 --> 01:28:00.880 or virtual structure in the game. 01:28:00.880 --> 01:28:05.680 But with data structures and pointers, now recursion really starts to shine. 01:28:05.680 --> 01:28:06.970 So let's consider this. 01:28:06.970 --> 01:28:10.060 If I declare a function in C whose purpose in life 01:28:10.060 --> 01:28:13.660 is to search a tree for a number, it's going to, by definition, 01:28:13.660 --> 01:28:15.580 search from the root on down. 01:28:15.580 --> 01:28:16.820 How do we implement this? 01:28:16.820 --> 01:28:19.278 Well, my function, I'll propose, is going to return a bool. 01:28:19.278 --> 01:28:22.180 True or false, the number is in the tree, yes or no. 01:28:22.180 --> 01:28:27.070 It's going to take two arguments, a pointer to a node, AKA tree. 01:28:27.070 --> 01:28:29.442 I could call it root or anything else. 01:28:29.442 --> 01:28:31.150 And it's going to take a number, which is 01:28:31.150 --> 01:28:36.020 the number I care about, whether it's four or six or eight or anything else. 01:28:36.020 --> 01:28:38.450 So what's going to be my first chunk of code? 01:28:38.450 --> 01:28:41.760 Well, let me do the best practice that I keep preaching. 01:28:41.760 --> 01:28:44.860 Any time you're dealing with pointers, check for null 01:28:44.860 --> 01:28:47.710 so that your program doesn't freeze or crash or bad thing happens. 01:28:47.710 --> 01:28:48.460 Because who knows? 01:28:48.460 --> 01:28:51.190 Maybe you will accidentally or maybe intentionally 01:28:51.190 --> 01:28:53.830 pass this function on null pointer, because there's no tree. 01:28:53.830 --> 01:28:55.000 Maybe you'll screw up. 01:28:55.000 --> 01:28:57.880 And that's OK, so long as your code is self-defensive. 01:28:57.880 --> 01:28:59.740 So always check pointers for null. 01:28:59.740 --> 01:29:02.710 If so, if the tree is null, that is, there is no tree there, 01:29:02.710 --> 01:29:04.610 obviously, the number is not present. 01:29:04.610 --> 01:29:06.110 So you just return false. 01:29:06.110 --> 01:29:08.410 So that's one of our base cases, so to speak. 01:29:08.410 --> 01:29:14.870 Else, if the number you're looking for is less than the tree's own number-- 01:29:14.870 --> 01:29:19.390 so, again, this arrow notation means take tree, which is a node *, 01:29:19.390 --> 01:29:20.740 so take this pointer. 01:29:20.740 --> 01:29:22.960 Go there to the actual node. 01:29:22.960 --> 01:29:25.450 And look in its own number field. 01:29:25.450 --> 01:29:28.390 If the number you're looking for from the argument 01:29:28.390 --> 01:29:32.010 is less than the number in the tree's own number field, 01:29:32.010 --> 01:29:33.760 well, that means that you want to go left. 01:29:33.760 --> 01:29:36.593 And whereas in the phone book, I went to the left of the phone book, 01:29:36.593 --> 01:29:39.130 here we're going to go to the left subtree. 01:29:39.130 --> 01:29:40.630 But how do I search a subtree? 01:29:40.630 --> 01:29:45.430 Here's where it's important that a tree is a recursive data structure. 01:29:45.430 --> 01:29:50.000 A tree is two subtrees with a new root node, as before. 01:29:50.000 --> 01:29:54.340 So I already have code via which I can search a smaller tree. 01:29:54.340 --> 01:29:57.730 So I can just say search the left subtree, 01:29:57.730 --> 01:30:00.580 as expressed here, which means start at the current node 01:30:00.580 --> 01:30:04.340 and go to the left child and pass in the same number. 01:30:04.340 --> 01:30:05.530 The number is not changing. 01:30:05.530 --> 01:30:06.940 But the tree is getting smaller. 01:30:06.940 --> 01:30:09.790 I've effectively, in code, chopped the tree in half. 01:30:09.790 --> 01:30:11.320 And I'm ignoring the right half. 01:30:11.320 --> 01:30:13.990 And I'm returning whatever that answer is. 01:30:13.990 --> 01:30:16.390 Otherwise, if the number I care about is greater 01:30:16.390 --> 01:30:18.760 than the number in the current node, do the opposite. 01:30:18.760 --> 01:30:22.132 Search the right subtree, passing in the same number. 01:30:22.132 --> 01:30:23.840 So, again, just like with the phone book, 01:30:23.840 --> 01:30:25.340 it kept getting smaller and smaller? 01:30:25.340 --> 01:30:28.750 Here I keep searching a smaller and smaller subtree. 01:30:28.750 --> 01:30:31.810 Because I keep chopping off branches left or right as 01:30:31.810 --> 01:30:33.710 I go from top to bottom. 01:30:33.710 --> 01:30:35.530 There's one final case. 01:30:35.530 --> 01:30:38.720 And let me toss this out to the group. 01:30:38.720 --> 01:30:41.110 There's a fourth case. 01:30:41.110 --> 01:30:45.850 Verbally or textually, what else should I be checking for and doing here? 01:30:45.850 --> 01:30:48.370 I've left room for one final case. 01:30:48.370 --> 01:30:51.833 BRIAN: A few people are suggesting if the tree itself is the number. 01:30:51.833 --> 01:30:54.250 DAVID MALAN: If the tree itself contains the number, yeah. 01:30:54.250 --> 01:30:57.850 So if the number in the tree equals equals 01:30:57.850 --> 01:31:02.300 the number I'm looking for, go ahead and, only in this case, return true. 01:31:02.300 --> 01:31:04.840 And so this is where the code, again, gets kind of-- 01:31:04.840 --> 01:31:07.570 recursion, rather, gets a little mind-bending. 01:31:07.570 --> 01:31:11.120 I only have false up here, true here. 01:31:11.120 --> 01:31:16.040 But not in either of these middle two branches, no pun intended, if you will. 01:31:16.040 --> 01:31:17.170 But that's OK. 01:31:17.170 --> 01:31:21.220 Because my code is designed in such a way that if I search the left subtree 01:31:21.220 --> 01:31:24.580 and there's nothing there, like, literally I'm at the leaf of the tree, 01:31:24.580 --> 01:31:26.535 so to speak, then it's going to return false. 01:31:26.535 --> 01:31:27.160 So that's fine. 01:31:27.160 --> 01:31:28.690 That's, like, if I search for the number 8. 01:31:28.690 --> 01:31:29.890 It's not even in the tree. 01:31:29.890 --> 01:31:32.640 I'm only going to realize that once I fall off the end of the tree 01:31:32.640 --> 01:31:33.740 and see, oops, null. 01:31:33.740 --> 01:31:35.180 I'll just return false. 01:31:35.180 --> 01:31:38.750 But if I ever see the number along the way, I will return true. 01:31:38.750 --> 01:31:43.145 And these two inner calls to search, these two recursive calls to search, 01:31:43.145 --> 01:31:44.770 are just kind of like passing the buck. 01:31:44.770 --> 01:31:47.590 Instead of answering true or false themselves, 01:31:47.590 --> 01:31:51.070 they're returning whatever the answer to a smaller question 01:31:51.070 --> 01:31:55.670 is by searching the left or right tree instead respectively. 01:31:55.670 --> 01:31:57.520 So, again, this is where recursion starts 01:31:57.520 --> 01:32:02.620 to get not really forced or even necessarily really as forced, 01:32:02.620 --> 01:32:04.330 but really as appropriate. 01:32:04.330 --> 01:32:08.500 When your data is itself recursive, then recursion in as a coding technique 01:32:08.500 --> 01:32:10.600 really rather shines. 01:32:10.600 --> 01:32:14.020 So if ultimately, we have-- oh, [? an ?] minor optimization. 01:32:14.020 --> 01:32:15.975 As we might have noted in Week 0 with Scratch, 01:32:15.975 --> 01:32:18.850 we, of course, don't need to explicitly check if the number is equal. 01:32:18.850 --> 01:32:22.695 We can get rid of that last condition and just assume that if it's not null 01:32:22.695 --> 01:32:24.820 and it's not to the left and it's not to the right, 01:32:24.820 --> 01:32:27.260 we must be standing right on top of it. 01:32:27.260 --> 01:32:29.230 And so we just return true there. 01:32:29.230 --> 01:32:31.840 Well, let me summarize the picture here. 01:32:31.840 --> 01:32:33.940 This is now a two dimensional data structure. 01:32:33.940 --> 01:32:38.020 And it's sort of better than a linked list in that now it's two dimensions. 01:32:38.020 --> 01:32:40.750 I gain back binary search, which is amazing, so 01:32:40.750 --> 01:32:45.460 long as I keep my data in sorted order per this binary search tree definition. 01:32:45.460 --> 01:32:47.480 But I've surely paid a price. 01:32:47.480 --> 01:32:47.980 Right? 01:32:47.980 --> 01:32:52.990 Nothing is absolutely better than anything else in our story thus far. 01:32:52.990 --> 01:32:56.080 So what is the downside of a tree? 01:32:56.080 --> 01:32:59.590 What price have I secretly or not so secretly paid here, 01:32:59.590 --> 01:33:02.710 while preaching the upsides of trees? 01:33:02.710 --> 01:33:06.700 And, again, the answer is often in this context sort of space, or time, 01:33:06.700 --> 01:33:13.140 or developer time, or money, or some resource, personal, or physical, 01:33:13.140 --> 01:33:14.800 or real world. 01:33:14.800 --> 01:33:15.300 Yeah. 01:33:15.300 --> 01:33:17.970 How about over to [INAUDIBLE]? 01:33:17.970 --> 01:33:24.000 AUDIENCE: So I think that inserting is no longer constant time. 01:33:24.000 --> 01:33:25.650 And I guess we need more memory. 01:33:25.650 --> 01:33:28.680 You need memory to [? sort ?] two pointers instead of one this time. 01:33:28.680 --> 01:33:29.430 DAVID MALAN: Yeah. 01:33:29.430 --> 01:33:32.220 It seems insertion is no longer constant time. 01:33:32.220 --> 01:33:35.940 Because if I need to preserve sorted order, I can't just put it at the top. 01:33:35.940 --> 01:33:37.980 I can't just keep pushing everything else down 01:33:37.980 --> 01:33:39.563 because things might get out of order. 01:33:39.563 --> 01:33:43.110 In that case, it would [? seem. ?] Or rather, even if I maintain the order, 01:33:43.110 --> 01:33:45.180 it might kind of get very long and stringy. 01:33:45.180 --> 01:33:48.507 If I add, for instance, another number, another number, 01:33:48.507 --> 01:33:50.340 and I keep jamming it at the top, I probably 01:33:50.340 --> 01:33:52.890 need to kind of keep things balanced, if you will. 01:33:52.890 --> 01:33:55.590 And yeah, the bigger point, too, is that immediately I'm 01:33:55.590 --> 01:33:57.660 using twice as many pointers. 01:33:57.660 --> 01:34:00.720 So now my node is getting even bigger than these things. 01:34:00.720 --> 01:34:03.360 I now have room for not only a number and a pointer, 01:34:03.360 --> 01:34:06.780 but another pointer, which is, of course, going to cost me more space. 01:34:06.780 --> 01:34:08.030 Again, so a trade off there. 01:34:08.030 --> 01:34:09.780 And let's go ahead and ask the group here, 01:34:09.780 --> 01:34:13.080 when it comes to insertion, why don't we consider for a moment what 01:34:13.080 --> 01:34:17.500 the running time of insertion might be when inserting into a binary search 01:34:17.500 --> 01:34:18.000 tree. 01:34:18.000 --> 01:34:20.458 If you'd like to pull up the URL as always, let me go ahead 01:34:20.458 --> 01:34:22.450 and present this one. 01:34:22.450 --> 01:34:26.650 What's the running time of inserting into a binary search tree? 01:34:26.650 --> 01:34:29.370 So if you want to insert the number 0 into that tree, 01:34:29.370 --> 01:34:36.390 if you want to insert the number 8, or anything in between or bigger 01:34:36.390 --> 01:34:39.580 or smaller, what are the answers here? 01:34:39.580 --> 01:34:42.540 So not quite as big of a victory for the tallest bar. 01:34:42.540 --> 01:34:44.680 About 60% of you think log n. 01:34:44.680 --> 01:34:46.680 And the good instincts there, frankly, are so 01:34:46.680 --> 01:34:47.970 that is going to be the right answer. 01:34:47.970 --> 01:34:49.553 And that's kind of the right instinct. 01:34:49.553 --> 01:34:52.920 Anytime you have binary search, odds are you're talking something logarithmic. 01:34:52.920 --> 01:34:56.730 But we've also seen divide and conquer and merge sort with n log n, 01:34:56.730 --> 01:34:59.385 so not unreasonable that about 10% of you think that, too. 01:34:59.385 --> 01:35:01.230 n squared would actually be bad. 01:35:01.230 --> 01:35:04.830 So n squared is, like, the worst of the times we've seen thus far. 01:35:04.830 --> 01:35:08.343 And that would suggest that a tree is even worse than a linked list, 01:35:08.343 --> 01:35:09.510 is even worse than an array. 01:35:09.510 --> 01:35:11.340 And thankfully, we're not at that point. 01:35:11.340 --> 01:35:14.790 But it's indeed log n, based on certain assumptions. 01:35:14.790 --> 01:35:16.360 So why is that? 01:35:16.360 --> 01:35:19.350 So if we consider the tree from a moment ago, 01:35:19.350 --> 01:35:21.430 it looked a little something like this. 01:35:21.430 --> 01:35:25.740 And what is involved in inserting into a tree? 01:35:25.740 --> 01:35:27.780 Well, suppose I want to insert the number 8. 01:35:27.780 --> 01:35:28.830 Well, I start here. 01:35:28.830 --> 01:35:31.410 And it obviously belongs to the right because 8 is bigger. 01:35:31.410 --> 01:35:32.157 I go here. 01:35:32.157 --> 01:35:33.990 It belongs to the right because 8 is bigger. 01:35:33.990 --> 01:35:34.570 I go here. 01:35:34.570 --> 01:35:36.250 It belongs to the right, because 8 is bigger. 01:35:36.250 --> 01:35:38.790 And so a new node is going to be created somewhere down here. 01:35:38.790 --> 01:35:41.520 And even though it doesn't fit on the screen, I could absolutely call malloc. 01:35:41.520 --> 01:35:43.020 I could update a couple of pointers. 01:35:43.020 --> 01:35:46.120 And boom, we've added an eighth node to the tree. 01:35:46.120 --> 01:35:51.720 So if it took me that many steps, starting at the root, 1, 2, 3, 01:35:51.720 --> 01:35:55.660 how do I generalize this into Big O notation? 01:35:55.660 --> 01:36:00.360 Well, a binary search tree, if you lay it out nice and prettily like this, 01:36:00.360 --> 01:36:04.290 nice and balanced if you will, the height of that binary search tree, 01:36:04.290 --> 01:36:06.810 it turns out, is going to be log of n. 01:36:06.810 --> 01:36:11.310 If n is the number of total nodes, 7 or 8 now in the story, then 01:36:11.310 --> 01:36:14.770 log base 2 of n is going to be the height of the tree. 01:36:14.770 --> 01:36:18.060 So if you take n nodes, n numbers, and you kind of 01:36:18.060 --> 01:36:22.770 balance them in this nice sorted way, the total height is going to be log n. 01:36:22.770 --> 01:36:24.690 So what is the running time of insert? 01:36:24.690 --> 01:36:27.660 Well, the running time of insert is equivalent to how many 01:36:27.660 --> 01:36:30.960 steps does it take you to find the location into which 01:36:30.960 --> 01:36:32.010 the new number belongs? 01:36:32.010 --> 01:36:33.480 Well, that's 1, 2, 3. 01:36:33.480 --> 01:36:36.720 And as it turns out, log base 2 of 8 is indeed 3. 01:36:36.720 --> 01:36:39.300 So the math actually works out perfectly in this case. 01:36:39.300 --> 01:36:41.383 Sometimes, there might be a little rounding error. 01:36:41.383 --> 01:36:45.990 But in general it's going to indeed be Big O of log n. 01:36:45.990 --> 01:36:48.420 But what if we get a little sloppy? 01:36:48.420 --> 01:36:50.550 What if we get a little sloppy and we start 01:36:50.550 --> 01:36:56.710 inserting nodes that are giving us a bit of bad luck, if you will? 01:36:56.710 --> 01:37:00.090 So for instance, suppose that I go ahead-- and let 01:37:00.090 --> 01:37:02.430 me do something on the fly here. 01:37:02.430 --> 01:37:09.570 Suppose that I go ahead and insert the number 1, the number 2, 01:37:09.570 --> 01:37:13.590 and the number 3 such that this is what logically happens. 01:37:13.590 --> 01:37:16.230 This adheres to the definition of a binary search tree. 01:37:16.230 --> 01:37:18.210 If this is the roots, it's 1. 01:37:18.210 --> 01:37:21.462 It has no left subtree, and that's not strictly a problem, 01:37:21.462 --> 01:37:24.420 because there's nothing violating the definition of a search tree here. 01:37:24.420 --> 01:37:25.560 There's just nothing there. 01:37:25.560 --> 01:37:26.850 2 is in the right place. 01:37:26.850 --> 01:37:28.690 3 is in the right place. 01:37:28.690 --> 01:37:32.460 So this, too, technically is a binary search tree. 01:37:32.460 --> 01:37:35.890 But it's a bit of a corner case, a perverse case, if you will, 01:37:35.890 --> 01:37:40.410 where the way you inserted things ended up in the binary search tree actually 01:37:40.410 --> 01:37:44.790 resembling more of a what, would you say, if you want to chime in the chat? 01:37:44.790 --> 01:37:46.712 And Brian, if you might want to relay? 01:37:46.712 --> 01:37:49.170 BRIAN: A few people are saying it looks like a linked list. 01:37:49.170 --> 01:37:49.920 DAVID MALAN: Yeah. 01:37:49.920 --> 01:37:52.050 So even though I've drawn it sort of top down, 01:37:52.050 --> 01:37:53.820 so in the sort of second dimension, that's 01:37:53.820 --> 01:37:55.590 really just an artist's rendition. 01:37:55.590 --> 01:37:58.800 This tree is a binary search tree. 01:37:58.800 --> 01:38:00.960 But it's kind of sort of also a linked list. 01:38:00.960 --> 01:38:03.450 And so even the most well-intentioned data 01:38:03.450 --> 01:38:08.490 structures, given unfortunate inputs or some bad luck or some bad design, 01:38:08.490 --> 01:38:12.790 could devolve into a different data structure just by chance. 01:38:12.790 --> 01:38:18.040 So there's a way to solve this though, even with these values. 01:38:18.040 --> 01:38:23.820 When I insert 1, 2, 3, I could allow for this perverse situation 01:38:23.820 --> 01:38:27.060 where it just gets long and stringy, at which point everything is Big O of n. 01:38:27.060 --> 01:38:28.140 It's just a linked list. 01:38:28.140 --> 01:38:30.900 It just happens to be drawn diagonally instead of left right. 01:38:30.900 --> 01:38:33.510 But does anyone see a solution intuitively? 01:38:33.510 --> 01:38:36.480 Not in terms of code, no formal language, 01:38:36.480 --> 01:38:40.500 but there is a solution here to make sure that this tree with 1, 2, 3 01:38:40.500 --> 01:38:43.500 does not get long and stringy in the first place. 01:38:43.500 --> 01:38:46.083 What might you do instead to solve this? 01:38:46.083 --> 01:38:48.000 BRIAN: A few people in the chat are suggesting 01:38:48.000 --> 01:38:51.100 you should make 2 the new root node at the top of the tree. 01:38:51.100 --> 01:38:54.390 DAVID MALAN: So if I instead make 2 the new root node, let me go ahead 01:38:54.390 --> 01:38:57.030 and mock this up real quickly. 01:38:57.030 --> 01:39:00.180 And in a moment, I'll reveal what I think you've just verbalized. 01:39:00.180 --> 01:39:02.808 What if instead, I make sure that when inserting these nodes 01:39:02.808 --> 01:39:05.850 I don't naively just keep going to the right, to the right, to the right? 01:39:05.850 --> 01:39:07.420 I exercise some judgment. 01:39:07.420 --> 01:39:09.930 And if I notice, maybe, that my data structure, my tree 01:39:09.930 --> 01:39:11.820 is getting kind of long and stringy, maybe I 01:39:11.820 --> 01:39:15.600 should kind of rotate it around, using that second dimension for real 01:39:15.600 --> 01:39:17.370 so that I change what the root is. 01:39:17.370 --> 01:39:19.470 And we won't go through the code for doing this. 01:39:19.470 --> 01:39:21.240 But it turns out this is the solution. 01:39:21.240 --> 01:39:22.980 That is exactly the right intuition. 01:39:22.980 --> 01:39:25.320 If you take a higher level class on data structures 01:39:25.320 --> 01:39:27.390 and algorithms specifically in computer science, 01:39:27.390 --> 01:39:30.960 you'll study trees like AVL trees, or red, black trees, 01:39:30.960 --> 01:39:33.510 which are different types of tree data structures. 01:39:33.510 --> 01:39:37.080 They kind of have built into them the algorithms for kind of shifting things 01:39:37.080 --> 01:39:40.530 as needed to make sure that as you insert, or maybe as you delete, 01:39:40.530 --> 01:39:42.840 you constantly rebalance the tree. 01:39:42.840 --> 01:39:46.870 And long story short, doing so might cost you a little extra time. 01:39:46.870 --> 01:39:50.970 But if you've got a lot of data, keeping that thing balanced and logarithmic 01:39:50.970 --> 01:39:54.750 in height, so to speak, and not long and stringy in the linear in height, 01:39:54.750 --> 01:39:58.290 it's probably, depending on your application, going to save you 01:39:58.290 --> 01:40:01.170 quite a bit of time overall. 01:40:01.170 --> 01:40:05.250 So we might say that insert into a balanced binary search tree 01:40:05.250 --> 01:40:07.500 is indeed Big O of log n. 01:40:07.500 --> 01:40:11.100 But that is conditional on you making sure that you keep it balanced. 01:40:11.100 --> 01:40:13.620 And that's going to be more code than we'll go into today, 01:40:13.620 --> 01:40:17.790 but indeed, a possible design decision. 01:40:17.790 --> 01:40:24.040 All right, any questions then on now trees and binary search trees, 01:40:24.040 --> 01:40:25.150 in particular? 01:40:25.150 --> 01:40:27.340 We started with arrays a few weeks ago. 01:40:27.340 --> 01:40:30.940 We've now got linked lists, which are good, better, but not great. 01:40:30.940 --> 01:40:34.900 To trees, which seemed maybe to be great but, again, it's always a trade off. 01:40:34.900 --> 01:40:37.000 They're costing us more space. 01:40:37.000 --> 01:40:39.490 But I bet we can continue to stitch some of these ideas 01:40:39.490 --> 01:40:41.860 together for other structures still. 01:40:41.860 --> 01:40:44.510 Brian, anything outstanding? 01:40:44.510 --> 01:40:45.010 BRIAN: Yeah. 01:40:45.010 --> 01:40:48.070 One question came in as to why it's a problem if you have, 01:40:48.070 --> 01:40:51.278 like, the 1, and the 2, and the 3 all in just one sequence on the right side? 01:40:51.278 --> 01:40:53.028 DAVID MALAN: Yeah, a really good question. 01:40:53.028 --> 01:40:54.070 Why is it a problem? 01:40:54.070 --> 01:40:55.150 Maybe it isn't. 01:40:55.150 --> 01:40:57.820 If you don't have a very large data set and you 01:40:57.820 --> 01:41:00.550 don't have many values in the structure, honestly, who cares? 01:41:00.550 --> 01:41:02.607 If it's three elements, definitely don't care. 01:41:02.607 --> 01:41:05.690 If it's 10 elements, if it's 1,000, heck, if your computer is fast enough, 01:41:05.690 --> 01:41:08.148 there might be a million elements, and it's not a big deal. 01:41:08.148 --> 01:41:10.510 But if it's two million elements or a billion elements, 01:41:10.510 --> 01:41:13.655 then it totally depends, again, on what is the business you're building? 01:41:13.655 --> 01:41:15.280 What is the application you're writing? 01:41:15.280 --> 01:41:16.270 How big is your data? 01:41:16.270 --> 01:41:18.220 How fast or how slow is your computer? 01:41:18.220 --> 01:41:20.187 It might very well matter ultimately. 01:41:20.187 --> 01:41:23.020 And indeed, when we've seen some of our algorithms a couple of weeks 01:41:23.020 --> 01:41:26.470 ago, when we compared bubble sort and selection sort and merge sort, 01:41:26.470 --> 01:41:29.200 even though those were in a different category of running times, 01:41:29.200 --> 01:41:33.640 n log n and n squared, just recall the appreciable difference. 01:41:33.640 --> 01:41:38.500 Log of n in the context of searching is way better than n. 01:41:38.500 --> 01:41:41.710 So if your data structure is devolving into something long and stringy, 01:41:41.710 --> 01:41:46.120 recall that's, like, searching a phone book a thousand total pages. 01:41:46.120 --> 01:41:49.960 But binary search and not letting it get long and stringy 01:41:49.960 --> 01:41:53.170 gives you, like, 10 steps instead of 1,000 01:41:53.170 --> 01:41:55.510 steps in order to search those same pages. 01:41:55.510 --> 01:41:58.900 So, again, even in Week 0 we saw the appreciable difference 01:41:58.900 --> 01:42:04.010 between these different categories of running times. 01:42:04.010 --> 01:42:06.248 All right, well, let's see if we can't maybe 01:42:06.248 --> 01:42:07.790 take some of the best of both worlds. 01:42:07.790 --> 01:42:09.260 Thus far, again, we've seen arrays. 01:42:09.260 --> 01:42:10.370 We've seen linked lists. 01:42:10.370 --> 01:42:11.780 We've seen trees. 01:42:11.780 --> 01:42:15.110 What if we kind of get a little Frankenstein here and mash things 01:42:15.110 --> 01:42:18.528 together and take sort of the best features of these things 01:42:18.528 --> 01:42:19.820 and build up something grander. 01:42:19.820 --> 01:42:22.580 In fact, I feel like the Holy Grail of a data structure 01:42:22.580 --> 01:42:27.140 would be something for which cert and insertion aren't n, Big O of n, 01:42:27.140 --> 01:42:28.340 aren't Big O of log n. 01:42:28.340 --> 01:42:30.590 But wouldn't it be amazing if there's a data structure 01:42:30.590 --> 01:42:33.950 out there where the running time is like constant time, Big O of 1. 01:42:33.950 --> 01:42:35.090 That's the Holy Grail. 01:42:35.090 --> 01:42:37.957 If you can cleverly lay out your computer's memory in such a way 01:42:37.957 --> 01:42:40.790 that if you want to search for or insert a value, boom, you're done. 01:42:40.790 --> 01:42:44.740 Boom, you're done, and none of this linear or logarithmic running time. 01:42:44.740 --> 01:42:46.820 So let's see if we can't pursue that goal. 01:42:46.820 --> 01:42:50.360 Let me propose that we introduce this topic called hash tables. 01:42:50.360 --> 01:42:53.390 Hash table is another data structure that's 01:42:53.390 --> 01:42:56.010 essentially an array of linked lists. 01:42:56.010 --> 01:42:57.980 So, again, it's this Frankenstein monster, 01:42:57.980 --> 01:43:01.200 whereby we've combined arrays ultimately with linked list. 01:43:01.200 --> 01:43:02.540 Let's see how this is done. 01:43:02.540 --> 01:43:05.347 Let me propose that we first start with an array of size 26. 01:43:05.347 --> 01:43:08.180 And I'm going to start drawing my arrays vertically, just because it 01:43:08.180 --> 01:43:09.890 sort of works out better pictorially. 01:43:09.890 --> 01:43:11.840 But, again, these are all artist's renditions, anyway. 01:43:11.840 --> 01:43:15.048 Even though we always draw arrays left to right, that's completely arbitrary. 01:43:15.048 --> 01:43:17.780 So I'm going to start, for now, drawing my array top to bottom. 01:43:17.780 --> 01:43:20.510 And suppose that the data structure I care about now 01:43:20.510 --> 01:43:22.620 is going to be even more interesting the numbers. 01:43:22.620 --> 01:43:26.450 Suppose I want to store things like names, like dictionaries, or names 01:43:26.450 --> 01:43:28.412 like contacts in your phone. 01:43:28.412 --> 01:43:30.620 If you want to keep track of all the people you know, 01:43:30.620 --> 01:43:33.620 it would be great if it doesn't take linear time or logarithmic time 01:43:33.620 --> 01:43:34.640 to find people. 01:43:34.640 --> 01:43:37.980 Instead, constant time would seem to be even better. 01:43:37.980 --> 01:43:40.945 So here's an array, for instance, size 26. 01:43:40.945 --> 01:43:42.320 And I proposed that deliberately. 01:43:42.320 --> 01:43:44.720 In English, there's 26 letters, A through Z. So 01:43:44.720 --> 01:43:48.590 let's consider location 0 is A. Location 25 is Z. 01:43:48.590 --> 01:43:53.270 And if I now go and start inserting all of my friends into my new phone, 01:43:53.270 --> 01:43:56.210 into the contacts application, where might I put them? 01:43:56.210 --> 01:43:57.650 Well, let me go ahead and do this. 01:43:57.650 --> 01:44:01.400 Let me go ahead and think of each of these elements as, again, 0 through 25, 01:44:01.400 --> 01:44:03.170 or really A through Z. 01:44:03.170 --> 01:44:07.670 And let me, upon inserting a new friend or contact into my phone, 01:44:07.670 --> 01:44:12.260 let me put them into a location that has some relationship with the name itself. 01:44:12.260 --> 01:44:15.120 Let's not just start putting them at the very beginning. 01:44:15.120 --> 01:44:17.810 Let's not necessarily put them alphabetically per se. 01:44:17.810 --> 01:44:22.310 Let's actually put them at a specific location in this array, not just top 01:44:22.310 --> 01:44:24.980 to bottom, but at a specific entry. 01:44:24.980 --> 01:44:28.310 So suppose the first person I want to add to my contacts is Albus. 01:44:28.310 --> 01:44:31.820 Well, I'm going to propose that because Albus starts with an A, 01:44:31.820 --> 01:44:37.880 he is going to go into the A location, so the very first entry in this array. 01:44:37.880 --> 01:44:40.160 Suppose I next want to add Zacharias. 01:44:40.160 --> 01:44:43.700 Well, his name starts with Z. So he's going to go in the very last location. 01:44:43.700 --> 01:44:45.350 So, again, I'm jumping around. 01:44:45.350 --> 01:44:46.970 I went from 0 to 25. 01:44:46.970 --> 01:44:50.030 But it's an array, and I can do that in constant time. 01:44:50.030 --> 01:44:53.400 You can randomly index to any element using square brackets. 01:44:53.400 --> 01:44:54.672 So this is both constant time. 01:44:54.672 --> 01:44:56.630 I don't have to just put him right after Albus. 01:44:56.630 --> 01:44:57.980 I can put him wherever I want. 01:44:57.980 --> 01:44:59.930 Suppose the third person is Hermione. 01:44:59.930 --> 01:45:02.360 Well, I'm going to put her at location H. Why? 01:45:02.360 --> 01:45:04.610 Because I can do the math, and I can figure out H. OK, 01:45:04.610 --> 01:45:07.580 I can just jump immediately to that letter of the alphabet. 01:45:07.580 --> 01:45:10.760 And in turn, thanks to ASCII and doing a bit of arithmetic, 01:45:10.760 --> 01:45:13.110 I convert that to a number as well. 01:45:13.110 --> 01:45:16.310 So she ends up at 0, 1, 2, 3, 4, 5, 6, 7, 01:45:16.310 --> 01:45:21.570 because H ends up mapping to the eighth character, or location 7. 01:45:21.570 --> 01:45:22.820 All right. 01:45:22.820 --> 01:45:23.630 Who else? 01:45:23.630 --> 01:45:25.950 All these other people end up in my address book. 01:45:25.950 --> 01:45:27.210 And so they're all spread out. 01:45:27.210 --> 01:45:28.670 I don't have as many as 26 friends. 01:45:28.670 --> 01:45:30.360 So there's some gaps in the data there. 01:45:30.360 --> 01:45:32.180 But I fit everyone here. 01:45:32.180 --> 01:45:33.750 But there might be a problem. 01:45:33.750 --> 01:45:35.480 And you can perhaps see this coming. 01:45:35.480 --> 01:45:37.010 Thus far, I've kind of gotten lucky. 01:45:37.010 --> 01:45:40.790 And I've only know people whose names are-- uniquely start with a letter. 01:45:40.790 --> 01:45:45.710 But as soon as I meet someone at school and I add them to my contacts, 01:45:45.710 --> 01:45:49.560 well, now Harry, for instance, has to go in the same location. 01:45:49.560 --> 01:45:52.850 Now this is a problem if I want to store both Hermione and Harry, 01:45:52.850 --> 01:45:54.680 because their names both start with H. 01:45:54.680 --> 01:45:57.770 But, again if it's an array, it's absolutely a deal breaker. 01:45:57.770 --> 01:45:59.840 At that point, all things break down. 01:45:59.840 --> 01:46:01.880 Because I could, yes, grow the array. 01:46:01.880 --> 01:46:04.880 But if I grow the array, then it's size 27. 01:46:04.880 --> 01:46:08.120 And then it's, like, how do I know what number is what letter at that point? 01:46:08.120 --> 01:46:09.950 It just devolves into a complete mess. 01:46:09.950 --> 01:46:12.530 But if I borrow the idea of a linked list, what if I 01:46:12.530 --> 01:46:15.860 make my array an array of linked lists? 01:46:15.860 --> 01:46:17.900 So yes, even though there's this collision 01:46:17.900 --> 01:46:22.400 where both Hermione and Harry belong at the same location in the array, 01:46:22.400 --> 01:46:23.180 that's fine. 01:46:23.180 --> 01:46:26.570 In the event this happens, I'm just going to kind of stitch them together 01:46:26.570 --> 01:46:28.830 into a linked list from left to right. 01:46:28.830 --> 01:46:29.690 So it's not ideal. 01:46:29.690 --> 01:46:32.570 Because now it takes me two steps to get to Harry instead of 1, 01:46:32.570 --> 01:46:35.060 using simple arithmetic and square bracket notation. 01:46:35.060 --> 01:46:37.880 But heck, at least I can still fit him in my address book. 01:46:37.880 --> 01:46:40.220 So a bit of a trade off, but feels reasonable. 01:46:40.220 --> 01:46:41.630 Well, someone else, Hagrid-- 01:46:41.630 --> 01:46:43.970 all right, it's not ideal that now it takes me 01:46:43.970 --> 01:46:46.370 three steps to get to Hagrid in my address book. 01:46:46.370 --> 01:46:50.280 But three is way better than not having him in there at all. 01:46:50.280 --> 01:46:53.000 So, again, we see a manifestation of a trade off here, too. 01:46:53.000 --> 01:46:54.710 But we've solved the problem. 01:46:54.710 --> 01:46:57.680 And a hash table is, indeed, exactly this data structure. 01:46:57.680 --> 01:47:02.000 It is an array of linked lists, at least it can be implemented as such. 01:47:02.000 --> 01:47:05.708 And it is predicated on introducing the notion of a hash function. 01:47:05.708 --> 01:47:08.500 This is actually something we'll see in other contexts before long. 01:47:08.500 --> 01:47:12.760 But a hash function is going to allow us to map not only Hermione, Harry, 01:47:12.760 --> 01:47:15.880 and Hagrid, but also Ron and Remus, Severus and Sirius 01:47:15.880 --> 01:47:19.180 to their respective locations deterministically. 01:47:19.180 --> 01:47:21.910 That is, there's no randomness involved here. 01:47:21.910 --> 01:47:24.100 Every time I look at these people's names, 01:47:24.100 --> 01:47:27.760 I'm going to figure out the location at which they belong 01:47:27.760 --> 01:47:30.170 and that location is never going to change. 01:47:30.170 --> 01:47:31.190 So how do I do this? 01:47:31.190 --> 01:47:34.212 Well, it turns out we can think back to problem solving itself 01:47:34.212 --> 01:47:35.170 and what functions are. 01:47:35.170 --> 01:47:37.003 So this is problem solving as we defined it. 01:47:37.003 --> 01:47:41.140 This is also a function in any language that takes inputs and outputs. 01:47:41.140 --> 01:47:43.720 A hash function is going to be sort of a secret sauce 01:47:43.720 --> 01:47:45.600 inside of this black box for now. 01:47:45.600 --> 01:47:47.050 And so what is a hash function? 01:47:47.050 --> 01:47:50.620 Well, a hash function is literally a function, either mathematically 01:47:50.620 --> 01:47:55.150 or in programming, that takes as input some string, in this case, 01:47:55.150 --> 01:47:58.430 like Hermione or Harry, and it returns some output. 01:47:58.430 --> 01:48:01.480 And the output of a hash function is usually a number. 01:48:01.480 --> 01:48:06.170 In this case, the number I want is going to be between 0 and 25. 01:48:06.170 --> 01:48:09.040 So in order to implement this notion of a hash table, 01:48:09.040 --> 01:48:11.740 not just pictorially on the screen, but in actual code, 01:48:11.740 --> 01:48:15.130 I'm literally going to have to write a function that takes a string, 01:48:15.130 --> 01:48:20.800 or if you will, char*, as input and returns an int between 0 and 25 so that 01:48:20.800 --> 01:48:27.380 I know how to convert Hermione or Harry or Hagrid to the number 7 in this case. 01:48:27.380 --> 01:48:29.080 So what does this hash function do? 01:48:29.080 --> 01:48:32.050 It takes as input something like Albus and it outputs 0. 01:48:32.050 --> 01:48:35.110 It takes someone like Zacharias, and it outputs 25. 01:48:35.110 --> 01:48:37.220 And you can probably see the pattern here. 01:48:37.220 --> 01:48:40.120 The code I would write in order to implement something like this is 01:48:40.120 --> 01:48:43.120 probably going to look at the user's input, that char*. 01:48:43.120 --> 01:48:45.880 And it's going to look at the first character, which 01:48:45.880 --> 01:48:48.763 is A or Z, respectively, for these two. 01:48:48.763 --> 01:48:50.680 And it's then going to do a little bit of math 01:48:50.680 --> 01:48:53.660 and subtract off, like, 65 or whatnot. 01:48:53.660 --> 01:48:57.040 And it's going to get me a number between 0 and 25, 01:48:57.040 --> 01:49:01.910 just like with Caesar or some of our past manipulations of strings. 01:49:01.910 --> 01:49:06.790 So from here, we can now take this building block, though, and perhaps 01:49:06.790 --> 01:49:09.010 solve our problems a little more effectively. 01:49:09.010 --> 01:49:11.710 Like, I don't love the fact that even though, yes, I've 01:49:11.710 --> 01:49:15.010 made room for Harry, and Hermione, and Hagrid, 01:49:15.010 --> 01:49:19.510 and now Luna, and Lily, and Lucius, and Lavender, some of these linked lists 01:49:19.510 --> 01:49:20.620 are getting a little long. 01:49:20.620 --> 01:49:22.090 And there's another term you can use here. 01:49:22.090 --> 01:49:23.882 These are kind of like chains, if you will, 01:49:23.882 --> 01:49:28.390 because they look like chain link fences or little links in a chain. 01:49:28.390 --> 01:49:29.860 These are chains or linked lists. 01:49:29.860 --> 01:49:31.610 But some of them are starting to get long. 01:49:31.610 --> 01:49:35.620 And it's a little stupid that I'm trying to achieve constant time, Big O of 1, 01:49:35.620 --> 01:49:39.580 but technically, even though some of the names literally take one step, 01:49:39.580 --> 01:49:42.400 some of them are taking two or three or four steps. 01:49:42.400 --> 01:49:44.080 So it's starting to devolve. 01:49:44.080 --> 01:49:45.762 So what would be an optimization here? 01:49:45.762 --> 01:49:48.220 If you start to get uncomfortable because you're so popular 01:49:48.220 --> 01:49:50.890 and you've got so many names in your contacts that looking up 01:49:50.890 --> 01:49:56.790 the H's, looking up the L's is taking more time than the others, 01:49:56.790 --> 01:50:01.200 what could we do to improve the situation and still use a hash table, 01:50:01.200 --> 01:50:03.300 still use a hash function? 01:50:03.300 --> 01:50:05.790 But what would maybe the logical solution 01:50:05.790 --> 01:50:08.880 be when you have too many collisions, you 01:50:08.880 --> 01:50:12.150 have too many names colliding with one another? 01:50:12.150 --> 01:50:15.570 How could we improve our performance and get at locations, again, 01:50:15.570 --> 01:50:18.630 closer to one step, not two, not three, not four? 01:50:18.630 --> 01:50:23.327 So literally one step, because that's our Holy Grail here. 01:50:23.327 --> 01:50:26.160 BRIAN: A few people have suggested you should look at more than just 01:50:26.160 --> 01:50:26.880 the first letter. 01:50:26.880 --> 01:50:28.710 Like, you could look at the second letter, for example. 01:50:28.710 --> 01:50:29.160 DAVID MALAN: Yeah. 01:50:29.160 --> 01:50:29.660 Nice. 01:50:29.660 --> 01:50:32.220 So if looking at one letter of the person's name 01:50:32.220 --> 01:50:33.420 is obviously insufficient. 01:50:33.420 --> 01:50:37.530 Because a whole bunch of us have names that start with H or A 01:50:37.530 --> 01:50:40.680 or Z or the like, well, why don't we look at two letters 01:50:40.680 --> 01:50:42.930 and therefore decrease the probability that we're 01:50:42.930 --> 01:50:44.320 going to have these collisions? 01:50:44.320 --> 01:50:46.140 So let me go ahead and restructure this. 01:50:46.140 --> 01:50:48.840 And focusing on the Hermione, Harry, and Hagrid problem, 01:50:48.840 --> 01:50:52.830 why don't we go ahead and take our array in the hash table, the vertical thing 01:50:52.830 --> 01:50:56.567 here, and let's think of it as maybe not just being H at that location. 01:50:56.567 --> 01:50:58.650 But what if we think of that location specifically 01:50:58.650 --> 01:51:05.010 as being HA, and then HB, HC, HD, HE, HF, all the way down to HZ. 01:51:05.010 --> 01:51:09.390 And then IA, IB, IC, and so forth, so we now 01:51:09.390 --> 01:51:16.060 enumerate all possible pairs of letters from AA to ZZ? 01:51:16.060 --> 01:51:18.060 But this would seem to spread things out, right? 01:51:18.060 --> 01:51:21.850 Because now Hermione goes in the HE location in the array. 01:51:21.850 --> 01:51:23.340 Now Harry goes in the HA. 01:51:23.340 --> 01:51:24.870 And now Hag-- oh, dammit. 01:51:24.870 --> 01:51:28.120 Like, Hagrid still goes in the same location. 01:51:28.120 --> 01:51:30.060 So what would maybe be a better fix? 01:51:30.060 --> 01:51:33.640 Again, this isn't horrible, like two steps is not a big deal, 01:51:33.640 --> 01:51:34.890 especially on fast computers. 01:51:34.890 --> 01:51:37.153 But, again, with large enough data sets, and if we're 01:51:37.153 --> 01:51:39.570 no longer talking about people in your contacts, but maybe 01:51:39.570 --> 01:51:43.773 all the people in the world who have Google accounts, or Twitter accounts, 01:51:43.773 --> 01:51:46.440 and the like, where you want to search this information quickly, 01:51:46.440 --> 01:51:49.410 you're going to have a lot of people whose names start with H and A 01:51:49.410 --> 01:51:50.760 and Z and everything else. 01:51:50.760 --> 01:51:52.877 It would be nice to spread them out further. 01:51:52.877 --> 01:51:53.710 So what could we do? 01:51:53.710 --> 01:51:55.740 Well, instead of using the first two letters, 01:51:55.740 --> 01:51:59.590 frankly I think the logical extension of this is to use the first three letters. 01:51:59.590 --> 01:52:01.740 So maybe this is the HAA bucket. 01:52:01.740 --> 01:52:07.560 This is the HAB, HAC, HAD, HAE, dot, dot, dot, all the way down to HZZ, 01:52:07.560 --> 01:52:08.630 and then IAA. 01:52:08.630 --> 01:52:13.320 But now when we hash our three friends, Hermione 01:52:13.320 --> 01:52:17.400 goes in the HER bucket so to speak, the elements of the array. 01:52:17.400 --> 01:52:19.530 Harry, HAR, goes in that bucket. 01:52:19.530 --> 01:52:24.360 And Hagrid now gets his own bucket, HAG, as would everyone else. 01:52:24.360 --> 01:52:26.622 So it seems to have solved this specific problem. 01:52:26.622 --> 01:52:29.580 You could still imagine, and I have to think harder and probably Google 01:52:29.580 --> 01:52:33.402 to see if there's other Harry Potter names that start with HAG or HAR 01:52:33.402 --> 01:52:35.880 or HER to find another collision. 01:52:35.880 --> 01:52:39.420 Because you could imagine using four letters instead. 01:52:39.420 --> 01:52:41.460 But what price are we paying? 01:52:41.460 --> 01:52:44.190 Like, I'm solving this problem again and again. 01:52:44.190 --> 01:52:48.030 And I'm getting myself literally faster look up time, 01:52:48.030 --> 01:52:49.590 because it's giving me one step. 01:52:49.590 --> 01:52:51.690 I can mathematically figure out by just doing 01:52:51.690 --> 01:52:55.705 a bit of ASCII math, what the number of the index 01:52:55.705 --> 01:52:58.080 is that I should jump to in this bigger and bigger array. 01:52:58.080 --> 01:53:00.995 But what price am I paying? 01:53:00.995 --> 01:53:03.470 Brian, any thoughts you'd like to relay? 01:53:03.470 --> 01:53:03.970 BRIAN: Yeah. 01:53:03.970 --> 01:53:06.180 A few people are saying it's going to take a lot of memory. 01:53:06.180 --> 01:53:06.930 DAVID MALAN: Yeah. 01:53:06.930 --> 01:53:09.690 My God, like, this is taking a huge amount of memory now. 01:53:09.690 --> 01:53:11.660 Previously, how much memory did it take? 01:53:11.660 --> 01:53:15.420 Well, let me pull up a little calculator here and do some quick math. 01:53:15.420 --> 01:53:21.327 So if we had originally 26 buckets, so to speak, elements in the array, that, 01:53:21.327 --> 01:53:22.410 of course, isn't that bad. 01:53:22.410 --> 01:53:24.270 That feels pretty reasonable, 26 slots. 01:53:24.270 --> 01:53:26.280 But the downside was that the chains might 01:53:26.280 --> 01:53:28.990 get kind of long, three names, four names, maybe even more. 01:53:28.990 --> 01:53:35.100 But if we have AA through ZZ, instead of A through Z, that's 26 times 26, 01:53:35.100 --> 01:53:36.845 that's 676 buckets. 01:53:36.845 --> 01:53:39.720 Doesn't sound like a huge deal, though that's bigger than most things 01:53:39.720 --> 01:53:41.950 we've done in memory thus far, not a huge deal. 01:53:41.950 --> 01:53:46.440 But if we have three, that's 26 possibilities times 26 times 26, 01:53:46.440 --> 01:53:48.600 for AAA through ZZZ. 01:53:48.600 --> 01:53:54.540 Now we have 17,576 buckets in my array. 01:53:54.540 --> 01:53:57.060 And the problem isn't so much that we're using that memory. 01:53:57.060 --> 01:53:58.840 Because honestly, if you need the memory, use it. 01:53:58.840 --> 01:53:59.460 That's fine. 01:53:59.460 --> 01:54:01.050 Just throw hardware at the problem. 01:54:01.050 --> 01:54:02.940 Buy and upgrade more memory. 01:54:02.940 --> 01:54:05.610 But the problem is that I probably don't know 01:54:05.610 --> 01:54:11.310 that many people whose names start with HAA or AZZ 01:54:11.310 --> 01:54:14.700 or any number of these combinations of letters of the alphabet. 01:54:14.700 --> 01:54:16.765 A lot of those buckets are going to be empty. 01:54:16.765 --> 01:54:18.390 But it doesn't matter if they're empty. 01:54:18.390 --> 01:54:22.870 If you want an array and you want random access, they have to be present 01:54:22.870 --> 01:54:25.980 so that your arithmetic works out, per Week 2, 01:54:25.980 --> 01:54:27.750 where you just use square bracket notation 01:54:27.750 --> 01:54:32.160 and jump to any of the locations in memory that you care about. 01:54:32.160 --> 01:54:36.060 So finding that trade off, or finding the inflection point with those trade 01:54:36.060 --> 01:54:39.330 offs, is kind of an art and/or a science, figuring out 01:54:39.330 --> 01:54:42.330 for your particular data, your particular application, which 01:54:42.330 --> 01:54:46.990 is more important, time or space or some happy medium in between the two. 01:54:46.990 --> 01:54:49.110 And with Problem Set 5, as you'll see, you'll 01:54:49.110 --> 01:54:51.270 actually have to figure out this balance in part 01:54:51.270 --> 01:54:54.300 by trying to minimize, ultimately, your own use of memory 01:54:54.300 --> 01:54:56.792 and your own use of computers' time. 01:54:56.792 --> 01:54:58.500 But let me point something out, actually. 01:54:58.500 --> 01:55:00.540 This notion of hash table, which up until now, 01:55:00.540 --> 01:55:03.457 definitely the most sophisticated data structure that we've looked at, 01:55:03.457 --> 01:55:06.380 it's kind of familiar to you in some way already. 01:55:06.380 --> 01:55:09.130 These are probably larger than the playing cards you have at home. 01:55:09.130 --> 01:55:11.088 But if you've ever played with a deck of cards, 01:55:11.088 --> 01:55:14.400 and the cards start out randomly, odds are you've, at some point, 01:55:14.400 --> 01:55:16.440 needed to sort them for one game or another. 01:55:16.440 --> 01:55:18.330 Sometimes you need to shuffle them entirely. 01:55:18.330 --> 01:55:20.538 If you want to be a little neat, you might sort them, 01:55:20.538 --> 01:55:22.440 not just by number, but also by suit. 01:55:22.440 --> 01:55:27.820 So hearts and spades and clubs and diamonds into separate categories. 01:55:27.820 --> 01:55:31.050 So honestly, I have this literally here just for the sake of the metaphor. 01:55:31.050 --> 01:55:32.550 We have four buckets here. 01:55:32.550 --> 01:55:36.430 And we've gone ahead and labeled them in advance with spade there. 01:55:36.430 --> 01:55:37.740 So that's one bucket. 01:55:37.740 --> 01:55:41.880 Here we have diamond shape here. 01:55:41.880 --> 01:55:43.920 And here we have-- 01:55:43.920 --> 01:55:45.510 [GRUNTING] 01:55:45.510 --> 01:55:50.230 --here we have hearts here and then clubs here. 01:55:50.230 --> 01:55:52.375 So if you've ever sorted a deck of cards, 01:55:52.375 --> 01:55:54.750 odds are you haven't really thought about this very hard. 01:55:54.750 --> 01:55:55.860 Because it's not that interesting. 01:55:55.860 --> 01:55:58.443 You probably mindlessly start laying them out and sorting them 01:55:58.443 --> 01:56:01.140 by suits and then maybe by number. 01:56:01.140 --> 01:56:04.798 But if you've done that, you have hashed values before. 01:56:04.798 --> 01:56:06.840 If you take a look at the first card and you see, 01:56:06.840 --> 01:56:09.540 that, oh, it's the ace of diamonds. 01:56:09.540 --> 01:56:12.930 You know, yes, you might care ultimately that it's a diamond, that it's an ace. 01:56:12.930 --> 01:56:18.240 But for now, I'm just going to put it, for instance, into the diamond bucket. 01:56:18.240 --> 01:56:19.623 Here's the two of diamonds here. 01:56:19.623 --> 01:56:21.540 I'm going to put that into the diamond bucket. 01:56:21.540 --> 01:56:23.650 Here's the ace of clubs. 01:56:23.650 --> 01:56:25.380 So I'm going to put that over here. 01:56:25.380 --> 01:56:29.970 And you can just progressively hash one card after the other. 01:56:29.970 --> 01:56:34.260 And, indeed, hashing really just means to look at some input and produce, 01:56:34.260 --> 01:56:38.070 in this case, some numeric output that outputs the like bucket 0, 1, 2, 01:56:38.070 --> 01:56:41.010 or 3 based on some characteristic of that input, 01:56:41.010 --> 01:56:43.950 whether it's actually the suit on the card like I'm doing here, 01:56:43.950 --> 01:56:46.950 or maybe it's based on the letter of the alphabet here. 01:56:46.950 --> 01:56:48.010 And why am I doing this? 01:56:48.010 --> 01:56:48.510 Right? 01:56:48.510 --> 01:56:49.410 I'm not going to do the whole thing. 01:56:49.410 --> 01:56:52.530 Because 52 steps is going to take a while and get boring quickly, if not 01:56:52.530 --> 01:56:53.520 already. 01:56:53.520 --> 01:56:54.750 But why am I doing this? 01:56:54.750 --> 01:56:56.190 Because odds are you've probably done this. 01:56:56.190 --> 01:56:59.370 Not with the drama of actual buckets, you've probably just kind of laid them 01:56:59.370 --> 01:57:00.203 out in front of you. 01:57:00.203 --> 01:57:05.560 But why have you done that, if that's indeed something you have done? 01:57:05.560 --> 01:57:06.060 Yeah. 01:57:06.060 --> 01:57:08.340 Over to Sophia? 01:57:08.340 --> 01:57:11.520 AUDIENCE: There's a possibility that we could actually get to things faster, 01:57:11.520 --> 01:57:12.990 like, if we know what bucket it is. 01:57:12.990 --> 01:57:16.530 We might be able to even search things for, like, 0, 1, or less. 01:57:16.530 --> 01:57:17.760 DAVID MALAN: Yeah. 01:57:17.760 --> 01:57:18.150 AUDIENCE: Something like that. 01:57:18.150 --> 01:57:18.450 DAVID MALAN: Yeah. 01:57:18.450 --> 01:57:20.325 You start to gain these optimizations, right? 01:57:20.325 --> 01:57:24.360 At least, as a human, honestly, I can process four smaller problems 01:57:24.360 --> 01:57:27.540 just much easier than one bigger problem, that's size 52. 01:57:27.540 --> 01:57:32.238 I can solve four 13 card problems a little faster, especially 01:57:32.238 --> 01:57:33.780 if I'm looking for a particular card. 01:57:33.780 --> 01:57:36.370 Now I can find it among 13 cards instead of 52. 01:57:36.370 --> 01:57:38.310 So there's just kind of an optimization here. 01:57:38.310 --> 01:57:41.790 So you might take as input these cards, hash them into a particular bucket, 01:57:41.790 --> 01:57:43.890 and then proceed to solve the smaller problem. 01:57:43.890 --> 01:57:46.500 Now that's not what a hash table itself is all about. 01:57:46.500 --> 01:57:50.580 A hash table is about storing information, but storing information 01:57:50.580 --> 01:57:52.330 so as to get to it more quickly. 01:57:52.330 --> 01:58:00.390 So to Sophia's point, if indeed she just wants to find like the ace of diamonds, 01:58:00.390 --> 01:58:03.900 she now only has to look through a 13 sized problem, a linked 01:58:03.900 --> 01:58:09.480 list of size 13, if you will, instead of an array or a linked list of size 52. 01:58:09.480 --> 01:58:12.870 So a hash table allows you to bucketize your inputs, if you will, 01:58:12.870 --> 01:58:15.480 colloquially, and get access to data more quickly. 01:58:15.480 --> 01:58:20.262 Not necessarily in time one, in one step, it might be two. 01:58:20.262 --> 01:58:20.970 It might be four. 01:58:20.970 --> 01:58:22.560 It might be 13 steps. 01:58:22.560 --> 01:58:25.530 But it's generally fewer steps than if you 01:58:25.530 --> 01:58:29.970 were doing something purely linearly, or even logarithmically. 01:58:29.970 --> 01:58:32.850 Ideally, you're trying to pick your hash function in such a way 01:58:32.850 --> 01:58:38.790 that you minimize the number of elements that collide by using not A through Z, 01:58:38.790 --> 01:58:41.290 but AA through ZZ and so forth. 01:58:41.290 --> 01:58:43.440 So let me go ahead here and ask a question. 01:58:43.440 --> 01:58:45.720 What, then, is the running time when it comes 01:58:45.720 --> 01:58:48.180 to this data structure of a hash table? 01:58:48.180 --> 01:58:52.860 If you want to go ahead and search in a hash table, once all of the data 01:58:52.860 --> 01:58:55.560 is in there, once all of my contacts are there, 01:58:55.560 --> 01:59:00.240 how many steps does your phone have to take, given n contacts in your phone, 01:59:00.240 --> 01:59:06.440 to find Hermione or Hagrid or anyone else? 01:59:06.440 --> 01:59:10.570 So I see, again, 80% of you are saying constant time, Big O of 1. 01:59:10.570 --> 01:59:13.480 And, again, constant time might mean one step, two steps, four steps, 01:59:13.480 --> 01:59:16.390 but some fixed number, not dependent on n. 01:59:16.390 --> 01:59:19.900 18% of you or so are saying linear time. 01:59:19.900 --> 01:59:22.570 And I have to admit, the 20% of you or so that 01:59:22.570 --> 01:59:29.060 said linear time are technically, asymptotically, mathematically correct. 01:59:29.060 --> 01:59:33.010 And here we begin to see sort of a distinction between the real world 01:59:33.010 --> 01:59:33.820 and academia. 01:59:33.820 --> 01:59:40.510 So the academic here, or rather the real world here, the real world programmer 01:59:40.510 --> 01:59:44.590 would say, just like Sophia did, obviously, a bucket with 13 cards in it 01:59:44.590 --> 01:59:48.227 is strictly better than one bigger bucket with 52 cards. 01:59:48.227 --> 01:59:49.060 That is just faster. 01:59:49.060 --> 01:59:52.480 It's literally four times as fast to find or to flip 01:59:52.480 --> 01:59:54.820 through those 13 cards instead of 52. 01:59:54.820 --> 01:59:56.770 That is objectively faster. 01:59:56.770 --> 02:00:00.670 But the academic would say yes, but asymptotically-- and asymptotically 02:00:00.670 --> 02:00:03.310 is just a fancy way of saying as n gets really large, 02:00:03.310 --> 02:00:07.510 the sort of wave of the hand that I keep describing, asymptotically taking 02:00:07.510 --> 02:00:11.170 13 steps is technically big O of n. 02:00:11.170 --> 02:00:11.800 Why? 02:00:11.800 --> 02:00:15.520 Well, in the case of the cards here, it's technically n divided by 4. 02:00:15.520 --> 02:00:16.360 Yes, it's 13. 02:00:16.360 --> 02:00:19.630 But if there's n cards total, technically, the size of this bucket 02:00:19.630 --> 02:00:21.400 is going to end up being m divided by 4. 02:00:21.400 --> 02:00:24.280 And what did we talk about when we talked about Big O and omega? 02:00:24.280 --> 02:00:26.450 Well, you throw away the lower order terms. 02:00:26.450 --> 02:00:28.810 You get rid of the constants, like the divide by 4, 02:00:28.810 --> 02:00:31.613 or the plus something else. 02:00:31.613 --> 02:00:32.530 So we get rid of that. 02:00:32.530 --> 02:00:35.860 And it's technically a hash table searching. 02:00:35.860 --> 02:00:38.050 It is still in Big O of n. 02:00:38.050 --> 02:00:41.260 But here, again, we see a contrast between the real world 02:00:41.260 --> 02:00:42.640 and the theoretical world. 02:00:42.640 --> 02:00:44.920 Like, yes, if you want to get into an academic debate, 02:00:44.920 --> 02:00:46.920 yes, it's still technically the same as a linked 02:00:46.920 --> 02:00:49.630 list or an array, at which point you might as well just search 02:00:49.630 --> 02:00:52.672 the thing left to right linearly, whether it's an array or a linked list. 02:00:52.672 --> 02:00:53.290 But come on. 02:00:53.290 --> 02:00:55.780 Like, if you actually hash these values in advance, 02:00:55.780 --> 02:01:00.370 and spread them out into 4, or 26, or 576 buckets, 02:01:00.370 --> 02:01:04.353 that is actually going to be faster when it comes to wall clock time. 02:01:04.353 --> 02:01:06.520 So when you literally look at the clock on the wall, 02:01:06.520 --> 02:01:09.610 less time will pass taking Sophia's approach than taking 02:01:09.610 --> 02:01:12.020 an array or linked list approach. 02:01:12.020 --> 02:01:16.030 So here, those of you who said Big O of n are correct. 02:01:16.030 --> 02:01:18.280 But when it comes to the real world programming, 02:01:18.280 --> 02:01:21.760 honestly, if it's faster than actual n steps, 02:01:21.760 --> 02:01:23.950 that may very well be a net positive. 02:01:23.950 --> 02:01:28.430 And so perhaps we should be focusing more on practice and less, sometimes, 02:01:28.430 --> 02:01:29.680 on the theory of these things. 02:01:29.680 --> 02:01:31.513 And indeed that's going to be the challenge. 02:01:31.513 --> 02:01:33.520 The Problem Set 5 to which I keep alluding 02:01:33.520 --> 02:01:35.860 is going to challenge you to implement one 02:01:35.860 --> 02:01:40.630 of these data structures, a hash table, with 100,000 plus English words. 02:01:40.630 --> 02:01:43.780 We're going to, in a nutshell, give you a big text file containing 02:01:43.780 --> 02:01:45.260 one English word per line. 02:01:45.260 --> 02:01:50.140 And among your goals is going to be to load all of those 140,000 plus words 02:01:50.140 --> 02:01:53.380 into your computer's memory using a hash table. 02:01:53.380 --> 02:01:56.500 Now if you are simplistic about it, and you 02:01:56.500 --> 02:01:59.712 use a hash table with 26 buckets, A through Z, 02:01:59.712 --> 02:02:01.420 you're going to have a lot of collisions. 02:02:01.420 --> 02:02:04.510 If there's 140,000 plus English words, there's a lot of words in there 02:02:04.510 --> 02:02:08.080 that start with A, or B, or Z, or anything in between. 02:02:08.080 --> 02:02:11.560 If you maybe then go with AA through ZZ, maybe that's better, 02:02:11.560 --> 02:02:13.562 or AAA through ZZZ, maybe that's better. 02:02:13.562 --> 02:02:15.520 But at some point, you're going to start to use 02:02:15.520 --> 02:02:17.630 too much memory for your own good. 02:02:17.630 --> 02:02:20.260 And one of the challenges, optionally, of Problem Set 5, 02:02:20.260 --> 02:02:23.148 is going to be to playfully challenge your classmates whereby, 02:02:23.148 --> 02:02:25.690 if you opt into this, you can run a command that will put you 02:02:25.690 --> 02:02:29.650 on the big board which will show on the course's website exactly how much 02:02:29.650 --> 02:02:33.760 or how little RAM or memory you're using, and how little 02:02:33.760 --> 02:02:36.740 or how much time your code is taking to run. 02:02:36.740 --> 02:02:41.000 And so we just put aside the sort of academic waves of the hand saying, 02:02:41.000 --> 02:02:43.420 well, yes, all of your code is Big O of n. 02:02:43.420 --> 02:02:49.450 But n divided by 4, Sophia's approach, is way better in practice than n 02:02:49.450 --> 02:02:50.020 itself. 02:02:50.020 --> 02:02:51.910 And we'll begin to tease apart the dichotomy 02:02:51.910 --> 02:02:54.310 between theory here and practice. 02:02:54.310 --> 02:02:57.040 But these aren't the only ways to lay things out in memory. 02:02:57.040 --> 02:03:00.730 And we wanted to show you just a few other ideas that come out now 02:03:00.730 --> 02:03:02.560 that we have all of these building blocks. 02:03:02.560 --> 02:03:06.400 One of which is the data structure that we're going to call a trie. 02:03:06.400 --> 02:03:08.410 Trie is actually short for the word retrieval, 02:03:08.410 --> 02:03:10.368 even though it's not quite pronounced the same. 02:03:10.368 --> 02:03:11.770 It's also known as a prefix tree. 02:03:11.770 --> 02:03:14.170 And it's a different type of tree that is typically 02:03:14.170 --> 02:03:17.470 used to store words, or other more sophisticated pieces of data 02:03:17.470 --> 02:03:19.630 instead of just numbers alone. 02:03:19.630 --> 02:03:24.497 So a trie is actually a tree made up of arrays. 02:03:24.497 --> 02:03:26.080 So you can kind of see a pattern here. 02:03:26.080 --> 02:03:29.290 A hash table was an array of linked lists. 02:03:29.290 --> 02:03:33.592 A trie is a tree, each of whose nodes is an array. 02:03:33.592 --> 02:03:36.550 So at some point, computer scientists started getting a little creative 02:03:36.550 --> 02:03:38.620 and started just, like, literally smashing together 02:03:38.620 --> 02:03:41.620 different data structures to see what they could come up with, it seems. 02:03:41.620 --> 02:03:44.230 And so a trie begins to look like this. 02:03:44.230 --> 02:03:48.130 But it has this amazing property that's better than anything in theory 02:03:48.130 --> 02:03:49.300 we've seen before. 02:03:49.300 --> 02:03:52.300 Here is one node in a try. 02:03:52.300 --> 02:03:55.540 And it's a node in the sense that this would be like a rectangle or square. 02:03:55.540 --> 02:03:59.920 But inside of that node is literally an array of size 26 in this case. 02:03:59.920 --> 02:04:05.640 And each of those locations or buckets, if you will represent A through Z. 02:04:05.640 --> 02:04:09.840 And what we're going to do is any time we insert a word, like a name, 02:04:09.840 --> 02:04:12.690 like Harry, or Hagrid, or Hermione, or anyone else, 02:04:12.690 --> 02:04:17.490 we are going to walk through the letters of their name, like, H-A-G-R-I-D, 02:04:17.490 --> 02:04:23.340 and we are going to follow a series of pointers from one node to another 02:04:23.340 --> 02:04:23.980 as follows. 02:04:23.980 --> 02:04:28.560 So for instance, if this is A through Z, or 0 through 25, here is location H. 02:04:28.560 --> 02:04:33.570 So if the goal at the moment is to insert the first of our contacts, 02:04:33.570 --> 02:04:36.540 for instance, Harry, I'm going to start by looking 02:04:36.540 --> 02:04:40.290 at the first node, the root of the tree, looking up the H location. 02:04:40.290 --> 02:04:43.170 And I'm going to kind of make mental note that Harry starts there, 02:04:43.170 --> 02:04:45.000 the H in Harry starts there. 02:04:45.000 --> 02:04:48.120 Then if I want to insert the A in Harry, I'm 02:04:48.120 --> 02:04:53.670 going to go ahead and add a new node, representing 26 letters. 02:04:53.670 --> 02:04:57.280 But I'm going to keep track of the fact that, OK, A is here. 02:04:57.280 --> 02:04:59.170 So now I'm going to have another pointer-- 02:04:59.170 --> 02:05:07.410 oh, I'm sorry-- not Harry, Hagrid first, H-A-G-R-I-D. So what have I just done? 02:05:07.410 --> 02:05:11.520 A trie, again, is a tree, each of whose nodes is an array. 02:05:11.520 --> 02:05:17.140 And each of those arrays is an array of pointers to other nodes. 02:05:17.140 --> 02:05:19.920 So, again, we're really just mashing up everything together here. 02:05:19.920 --> 02:05:21.930 But it's the same building blocks as before. 02:05:21.930 --> 02:05:27.630 Each node in this tree, top to bottom, is an array of pointers to other nodes. 02:05:27.630 --> 02:05:32.370 And so, if I wanted to check if Hagrid is in my contacts, 02:05:32.370 --> 02:05:35.910 I literally start at the first node, and I follow the H pointer. 02:05:35.910 --> 02:05:37.500 I then follow the A pointer. 02:05:37.500 --> 02:05:40.650 I then follow the G pointer, the R pointer, the I pointer. 02:05:40.650 --> 02:05:42.480 And then I check at the D pointer. 02:05:42.480 --> 02:05:46.930 Is there a Boolean value inside of that structure-- more on that another time, 02:05:46.930 --> 02:05:47.850 perhaps-- 02:05:47.850 --> 02:05:53.070 that just says, yes or no, there is someone named H-A-G-R-I-D 02:05:53.070 --> 02:05:54.450 in my contacts. 02:05:54.450 --> 02:05:57.215 Notice there's no other letters noted at the moment. 02:05:57.215 --> 02:05:58.590 And there's no other green boxes. 02:05:58.590 --> 02:06:01.290 Green just denotes a Boolean value for our purposes now. 02:06:01.290 --> 02:06:08.190 So that means there's no one whose name is H-A-G-R-I-A, or R-I-B, or R-I-C. 02:06:08.190 --> 02:06:13.412 It's only H-A-G-R-I-D that exists in my contacts. 02:06:13.412 --> 02:06:14.620 But notice what happens next. 02:06:14.620 --> 02:06:16.980 Now if I go ahead and insert Harry, notice 02:06:16.980 --> 02:06:21.840 that Harry and Hagrid share the H, the A, and then this third node. 02:06:21.840 --> 02:06:24.510 But then Harry needs to follow a different pointer 02:06:24.510 --> 02:06:27.120 to store the R and the Y. And notice the green there. 02:06:27.120 --> 02:06:29.520 It's sort of a checkmark in the data structure, a Boolean value, that's 02:06:29.520 --> 02:06:30.120 saying yes. 02:06:30.120 --> 02:06:32.640 I have someone in my context named H-A-R-R-Y. 02:06:32.640 --> 02:06:37.740 And then, if we add Hermione, she shares the H, and then also the second node. 02:06:37.740 --> 02:06:42.280 But Hermione requires some new nodes all together. 02:06:42.280 --> 02:06:46.950 But notice this key property, the reason for this sort of complexity, 02:06:46.950 --> 02:06:50.100 because this is probably the weirdest structure we've seen thus far, 02:06:50.100 --> 02:06:54.060 is that even if I have a billion names in my phone book, 02:06:54.060 --> 02:06:57.930 how many steps literally does it take me to find Hagrid? 02:06:57.930 --> 02:06:58.620 Someone? 02:06:58.620 --> 02:07:01.680 Feel free to chime in the text, the chat window if you like? 02:07:01.680 --> 02:07:04.530 Even if I have a billion names in my contacts, 02:07:04.530 --> 02:07:07.320 how many steps does it take for me to look up and check 02:07:07.320 --> 02:07:09.232 if Hagrid is among them? 02:07:09.232 --> 02:07:10.440 BRIAN: People are saying six. 02:07:10.440 --> 02:07:15.810 DAVID MALAN: Six, H-A-G-R-I-D. If I have 2 billion names, 4 billion names, 02:07:15.810 --> 02:07:17.130 how many steps does it take? 02:07:17.130 --> 02:07:21.930 6, and this is what we mean by constant time, constant time at least 02:07:21.930 --> 02:07:25.510 in the length of the humans' names in the data structure. 02:07:25.510 --> 02:07:26.760 So what does this mean? 02:07:26.760 --> 02:07:30.450 No matter how many names you cram into a trie data 02:07:30.450 --> 02:07:34.020 structure, the number of other names does not 02:07:34.020 --> 02:07:38.040 impact how many steps it takes for me to find Harry or Hagrid or Hermione 02:07:38.040 --> 02:07:39.360 or anyone else. 02:07:39.360 --> 02:07:41.793 It is only dependent on the length of their name. 02:07:41.793 --> 02:07:43.710 And here's where we can get a little academic. 02:07:43.710 --> 02:07:46.043 If you assume that there's a finite number of characters 02:07:46.043 --> 02:07:50.370 in any human, real or imaginary's name, maybe it's 20, or 100, or whatever. 02:07:50.370 --> 02:07:52.930 It's more than 6, but it's probably fewer than hundreds. 02:07:52.930 --> 02:07:55.030 And then you can assume that that's constant. 02:07:55.030 --> 02:07:57.420 So it might be Big O of, like, 200 characters 02:07:57.420 --> 02:07:59.190 for someone with a super long name. 02:07:59.190 --> 02:08:00.600 But that's constant. 02:08:00.600 --> 02:08:05.100 And so technically a try gives you that Holy Grail 02:08:05.100 --> 02:08:09.360 of look up times and insertion times of Big O of 1. 02:08:09.360 --> 02:08:12.360 Because it is not dependent on n, which is the number 02:08:12.360 --> 02:08:14.850 of other names in the data structure. 02:08:14.850 --> 02:08:18.300 It is dependent only on the length of the name you're inputting. 02:08:18.300 --> 02:08:21.630 And if you assume that all names in the world are reasonably linked, 02:08:21.630 --> 02:08:24.780 less than some finite value, like 6, or 200, or whatever 02:08:24.780 --> 02:08:26.610 it is, then you can call that. 02:08:26.610 --> 02:08:29.580 And it technically is Big O of 1. 02:08:29.580 --> 02:08:32.140 That is constant time. 02:08:32.140 --> 02:08:36.633 So here is the goal of this whole day, like, trying to get to constant time. 02:08:36.633 --> 02:08:38.550 Because constant time is better than, it would 02:08:38.550 --> 02:08:42.240 seem, linear time, or logarithmic time, or anything else we've seen. 02:08:42.240 --> 02:08:45.540 But-- but-- but-- again, there's always a trade off. 02:08:45.540 --> 02:08:49.710 What price have we just paid if you see it? 02:08:49.710 --> 02:08:55.630 Why are tries not necessarily all that? 02:08:55.630 --> 02:08:57.850 There's still a catch. 02:08:57.850 --> 02:08:59.230 Daniel? 02:08:59.230 --> 02:09:03.010 AUDIENCE: For example, if let's say you have two people in your contact list. 02:09:03.010 --> 02:09:07.120 One person was named Daniel and one person was named Danielle. 02:09:07.120 --> 02:09:09.870 And you know that Daniel is in your list. 02:09:09.870 --> 02:09:13.360 So the L would have this Boolean operator of true. 02:09:13.360 --> 02:09:17.500 But then, how would you get to Danielle, if your L was an operator 02:09:17.500 --> 02:09:19.630 and it didn't point to another L for Danielle? 02:09:19.630 --> 02:09:22.213 DAVID MALAN: Really good question, a corner case, if you will. 02:09:22.213 --> 02:09:26.110 What if someone's name is a substring of another person's name? 02:09:26.110 --> 02:09:29.950 So Daniel, D-A-N-I-E-L, and I think you're saying Danielle, 02:09:29.950 --> 02:09:35.410 D-A-N-I-E-L-L-E. So the second name is a little longer. 02:09:35.410 --> 02:09:37.210 Let me stipulate that we can solve that. 02:09:37.210 --> 02:09:40.960 I have not shown code that represents each of the nodes in this tree. 02:09:40.960 --> 02:09:44.180 Let me propose that we could continue having arrows even below. 02:09:44.180 --> 02:09:46.540 So if we were to have Daniel in this tree, 02:09:46.540 --> 02:09:48.580 we could also have Danielle by just having 02:09:48.580 --> 02:09:52.540 a couple of more nodes below Daniel and just having another green checkmark. 02:09:52.540 --> 02:09:54.850 So it is solvable in code, even though it's not 02:09:54.850 --> 02:09:57.790 obvious from the graphical representation, but absolutely 02:09:57.790 --> 02:09:58.510 a corner case. 02:09:58.510 --> 02:10:01.210 But thankfully, it is solvable in tries. 02:10:01.210 --> 02:10:04.000 What might another downside be, though, of a trie? 02:10:04.000 --> 02:10:05.740 Like, you do get Big O of 1 time. 02:10:05.740 --> 02:10:08.450 You can solve the Daniel-Danielle problem. 02:10:08.450 --> 02:10:10.195 But there's still a price being paid. 02:10:13.290 --> 02:10:15.950 Any thoughts? 02:10:15.950 --> 02:10:16.450 Yeah. 02:10:16.450 --> 02:10:20.285 How about over to Ethan, if I'm saying it right? 02:10:20.285 --> 02:10:20.910 AUDIENCE: Yeah. 02:10:20.910 --> 02:10:21.690 I'm Ethan. 02:10:21.690 --> 02:10:22.482 DAVID MALAN: Ethan. 02:10:22.482 --> 02:10:25.200 AUDIENCE: I think it's because it would just take a lot of memory 02:10:25.200 --> 02:10:26.580 to house all of that. 02:10:26.580 --> 02:10:28.410 And that could take a lot of time. 02:10:28.410 --> 02:10:29.790 It could slow down the system. 02:10:29.790 --> 02:10:30.820 DAVID MALAN: Exactly. 02:10:30.820 --> 02:10:31.320 Yeah. 02:10:31.320 --> 02:10:33.195 You can kind of see it from my picture alone. 02:10:33.195 --> 02:10:35.410 We only added three names to this data structure. 02:10:35.410 --> 02:10:39.390 But my God, like, there's dozens, maybe 100 plus pointers 02:10:39.390 --> 02:10:41.830 pictured here, even though they might all be null. 02:10:41.830 --> 02:10:42.330 Right? 02:10:42.330 --> 02:10:46.200 If the absence of an arrow here suggests that they're null, 0x0, but even 02:10:46.200 --> 02:10:49.180 storing null, 0x0, is 8 0 bits. 02:10:49.180 --> 02:10:52.710 So this is not lacking for actual memory usage, 02:10:52.710 --> 02:10:54.960 we're just not using it very efficiently. 02:10:54.960 --> 02:10:58.650 We have spent a huge number of bits, or bytes, 02:10:58.650 --> 02:11:00.570 or however you want to measure it. 02:11:00.570 --> 02:11:04.440 Because look, even with the H's, I'm using one pointer out of 26. 02:11:04.440 --> 02:11:06.780 25 pointers are probably initialized to null, 02:11:06.780 --> 02:11:08.910 which means I'm wasting 25 pointers. 02:11:08.910 --> 02:11:11.550 And you can imagine the lower and lower you get in this tree, 02:11:11.550 --> 02:11:16.560 the less likely there is to be a name that even starts with H-A-G-R-I-D. 02:11:16.560 --> 02:11:18.990 Daniel came up with a good example with Danielle. 02:11:18.990 --> 02:11:20.850 But that's not going to often happen. 02:11:20.850 --> 02:11:22.940 Certainly, the lower you get in this tree. 02:11:22.940 --> 02:11:26.330 So as Ethan says, you're wasting a huge amount of memory. 02:11:26.330 --> 02:11:30.900 So yes, you're gaining constant time look ups, but my God, at what price? 02:11:30.900 --> 02:11:34.680 You might be using megabytes, gigabytes of storage space. 02:11:34.680 --> 02:11:37.200 Because, again, the most important property of an array 02:11:37.200 --> 02:11:38.700 is that it's all contiguous. 02:11:38.700 --> 02:11:40.300 And therefore, you have random access. 02:11:40.300 --> 02:11:42.330 But if you have to have every node containing 02:11:42.330 --> 02:11:45.660 an array of size 26 or anything else, you 02:11:45.660 --> 02:11:49.960 have to spend the memory over and over again. 02:11:49.960 --> 02:11:51.210 So there, too, is a trade off. 02:11:51.210 --> 02:11:53.760 While this might be theoretically ideal, theory 02:11:53.760 --> 02:11:55.680 does not necessarily mean practice. 02:11:55.680 --> 02:11:59.400 And sometimes, designing something that is textbook less efficient, 02:11:59.400 --> 02:12:01.980 might actually be more efficient in the real world. 02:12:01.980 --> 02:12:05.733 And in Problem Set 5, in your own spellchecker, 02:12:05.733 --> 02:12:07.650 when you build up this dictionary that we then 02:12:07.650 --> 02:12:10.440 use to spell check very large corpuses of text, 02:12:10.440 --> 02:12:14.310 will you begin to experience some of those real world trade offs yourself. 02:12:14.310 --> 02:12:17.400 Well, we wanted to end today with a look at what else 02:12:17.400 --> 02:12:19.380 you can do with these kinds of data structures, 02:12:19.380 --> 02:12:21.990 just to give you a taste of where else you can go with this 02:12:21.990 --> 02:12:23.760 and what other kinds of problems you can solve. 02:12:23.760 --> 02:12:26.135 Again, thus far, we've looked at arrays, which are really 02:12:26.135 --> 02:12:27.468 the simplest of data structures. 02:12:27.468 --> 02:12:29.093 And they're not even structures per se. 02:12:29.093 --> 02:12:30.940 It's just contiguous blocks of memory. 02:12:30.940 --> 02:12:33.090 We then introduced linked lists, of course, 02:12:33.090 --> 02:12:35.640 where you have this one dimensional data structure that 02:12:35.640 --> 02:12:39.120 allows you to stitch together nodes and memory, giving you dynamic allocation. 02:12:39.120 --> 02:12:42.570 And if you want deallocation, inserting and deleting nodes if you want, 02:12:42.570 --> 02:12:46.380 then we had trees, which kind of gives us the best of both worlds, 02:12:46.380 --> 02:12:48.090 arrays and linked lists. 02:12:48.090 --> 02:12:51.113 But we have to spend more space and use more pointers. 02:12:51.113 --> 02:12:53.280 Then, of course, hash tables kind of merged together 02:12:53.280 --> 02:12:56.430 two of those ideas, arrays and the linked lists. 02:12:56.430 --> 02:12:57.790 And that starts to work well. 02:12:57.790 --> 02:13:00.480 And indeed, that's what you'll experiment with your own spellchecker. 02:13:00.480 --> 02:13:02.855 But then, of course, there's tries, which at first glance 02:13:02.855 --> 02:13:07.680 seem better, but not without great cost, as Ethan says. 02:13:07.680 --> 02:13:11.220 So it turns out, with all of those building blocks at your disposal, 02:13:11.220 --> 02:13:14.820 you can actually use them as lower level implementation details 02:13:14.820 --> 02:13:16.740 to solve higher level problems. 02:13:16.740 --> 02:13:20.400 And this is what are known as abstract data structures, or abstract data 02:13:20.400 --> 02:13:21.060 types. 02:13:21.060 --> 02:13:24.122 An abstract data structure is kind of a mental structure 02:13:24.122 --> 02:13:26.580 that you can imagine, implementing some real world problem, 02:13:26.580 --> 02:13:30.135 typically, that's implemented with some other data structure. 02:13:30.135 --> 02:13:32.010 So you're sort of writing code at this level. 02:13:32.010 --> 02:13:34.885 But you're thinking about what you've built ultimately at this level. 02:13:34.885 --> 02:13:37.710 And that's abstraction, taking lower level implementation 02:13:37.710 --> 02:13:40.830 details, simplifying them for the sake of discussion or problem 02:13:40.830 --> 02:13:42.600 solving higher up. 02:13:42.600 --> 02:13:44.670 So what's one such data structure? 02:13:44.670 --> 02:13:48.180 A queue is a very common abstract data structure. 02:13:48.180 --> 02:13:48.855 What is a queue? 02:13:48.855 --> 02:13:50.813 Well, those of you who grew up, say, in Britain 02:13:50.813 --> 02:13:53.040 call a line outside of a store a queue. 02:13:53.040 --> 02:13:54.840 And that's, indeed, where it gets its name. 02:13:54.840 --> 02:13:58.465 A queue is a data structure that has certain properties. 02:13:58.465 --> 02:14:00.840 So if you're standing outside of a store or a restaurant, 02:14:00.840 --> 02:14:04.005 in healthier times, waiting to get in, you're generally in a queue. 02:14:04.005 --> 02:14:05.880 But there's an important property of a queue. 02:14:05.880 --> 02:14:08.550 At least, if you live in a fair society, you'd 02:14:08.550 --> 02:14:11.760 like to think that if you are the first one in line, 02:14:11.760 --> 02:14:13.920 you are the first one to get out of line. 02:14:13.920 --> 02:14:15.750 First in, first out. 02:14:15.750 --> 02:14:18.150 It would be kind of obnoxious if you're first in line 02:14:18.150 --> 02:14:21.180 and then they start letting people in who are behind you in that queue. 02:14:21.180 --> 02:14:23.880 So a queue, if it's implemented correctly, 02:14:23.880 --> 02:14:27.630 has a property known as FIFO, first in, first out. 02:14:27.630 --> 02:14:30.180 And we humans would think of that as just a fair property. 02:14:30.180 --> 02:14:33.870 And a queue generally has two operations associated with it at least, 02:14:33.870 --> 02:14:35.745 enqueue and dequeue. 02:14:35.745 --> 02:14:36.870 Those are just conventions. 02:14:36.870 --> 02:14:40.020 You could call it add and remove, or insert, delete, whatever. 02:14:40.020 --> 02:14:43.170 But enqueue and dequeue are sort of the more common ones to say. 02:14:43.170 --> 02:14:45.120 So enqueuing means you walk up to the store 02:14:45.120 --> 02:14:47.037 and you get in line, because you have to wait. 02:14:47.037 --> 02:14:49.755 Dequeue means they're ready to serve you or have you. 02:14:49.755 --> 02:14:50.880 And so you get out of line. 02:14:50.880 --> 02:14:51.780 That's dequeue. 02:14:51.780 --> 02:14:54.120 And again, FIFO is just a fancy acronym that 02:14:54.120 --> 02:14:59.100 describes a key property of that, which is that it's first in, first out. 02:14:59.100 --> 02:15:03.390 So how could you implement a queue then? 02:15:03.390 --> 02:15:07.150 Well, what's interesting about that data structure is that it's abstract. 02:15:07.150 --> 02:15:07.650 Right? 02:15:07.650 --> 02:15:10.740 It's more of an idea than an actual thing in code. 02:15:10.740 --> 02:15:14.020 You want to implement some kind of fair queuing system. 02:15:14.020 --> 02:15:15.450 And so you think of it as a queue. 02:15:15.450 --> 02:15:18.770 But frankly, if we're going to translate that example to code, 02:15:18.770 --> 02:15:21.590 you could imagine using an array of persons. 02:15:21.590 --> 02:15:23.450 That could implement a queue. 02:15:23.450 --> 02:15:25.175 You could use a linked list of persons. 02:15:25.175 --> 02:15:27.050 Frankly, either of those would actually work. 02:15:27.050 --> 02:15:30.470 Underneath the hood is the lower level implementation details. 02:15:30.470 --> 02:15:33.860 But what would be a problem, if we translate this real world 02:15:33.860 --> 02:15:38.810 analogy like queuing up outside of a store to get in, into code 02:15:38.810 --> 02:15:39.830 and you used an array? 02:15:39.830 --> 02:15:42.980 What would a downside be of using an array 02:15:42.980 --> 02:15:48.380 to represent a queue, in general, even though we're making a bit of a leap 02:15:48.380 --> 02:15:50.060 from real world to code suddenly? 02:15:50.060 --> 02:15:52.760 But there's probably a downside. 02:15:52.760 --> 02:15:56.720 What might a downside be of using an array to implement a queue, Ryan? 02:15:56.720 --> 02:15:59.690 AUDIENCE: If you're using an array, you can't really 02:15:59.690 --> 02:16:03.327 just take out the existing values. 02:16:03.327 --> 02:16:05.660 Because if you were thinking about doing this in a line, 02:16:05.660 --> 02:16:09.500 you would have to take out the first person, take out the second person. 02:16:09.500 --> 02:16:13.130 But you can't really dynamically sort of change the memory after that. 02:16:13.130 --> 02:16:13.910 DAVID MALAN: Yeah. 02:16:13.910 --> 02:16:14.270 Yeah. 02:16:14.270 --> 02:16:15.180 That's a really good point. 02:16:15.180 --> 02:16:15.972 Think about a line. 02:16:15.972 --> 02:16:19.328 Suppose that there's a line that can fit 10 people outside the Apple store. 02:16:19.328 --> 02:16:22.370 Because Apple's pretty good right now during the health crisis of letting 02:16:22.370 --> 02:16:24.210 people in only so many at a time. 02:16:24.210 --> 02:16:26.840 So suppose they have room for 10 people, six feet apart. 02:16:26.840 --> 02:16:30.000 That's actually a pretty apt analogy, this year more than ever. 02:16:30.000 --> 02:16:35.480 But as Ryan says, if you want to dequeue someone, then the first person in line 02:16:35.480 --> 02:16:36.770 is going to go into the store. 02:16:36.770 --> 02:16:38.600 And then the second person, you're going to dequeue them. 02:16:38.600 --> 02:16:40.049 They go into the store. 02:16:40.049 --> 02:16:41.840 The problem with an array, it would seem, 02:16:41.840 --> 02:16:44.976 is that now you have essentially empty spaces at the beginning of the line. 02:16:44.976 --> 02:16:47.809 But you still don't have room at the end of the line for new people. 02:16:47.809 --> 02:16:49.851 Now there's an obvious real world solution there. 02:16:49.851 --> 02:16:52.879 You just say, hey, everyone, would you mind taking a few steps forward? 02:16:52.879 --> 02:16:54.139 But that's inefficient. 02:16:54.139 --> 02:16:56.847 Not so much in the human world, you've got to get into the store. 02:16:56.847 --> 02:16:58.940 But in code, that's copying of values. 02:16:58.940 --> 02:17:03.410 You have to move, like, eight values two places over if two people were just 02:17:03.410 --> 02:17:04.440 let into the store. 02:17:04.440 --> 02:17:07.520 So now your dequeue operation is Big O of n. 02:17:07.520 --> 02:17:08.990 And that doesn't feel quite ideal. 02:17:08.990 --> 02:17:10.459 And we can do better than that if we're a little 02:17:10.459 --> 02:17:12.440 clever with some local variables and such. 02:17:12.440 --> 02:17:14.870 But that would be one challenge of a queue, 02:17:14.870 --> 02:17:19.722 certainly, is just how well we could implement it using an array. 02:17:19.722 --> 02:17:21.889 So you might imagine, too, an array is limited, too. 02:17:21.889 --> 02:17:24.590 Because it would kind of be obnoxious if you get to the Apple store. 02:17:24.590 --> 02:17:26.100 There's already 10 people in line. 02:17:26.100 --> 02:17:27.559 And they don't let you get in line. 02:17:27.559 --> 02:17:30.650 They say, sorry, we're all full for today, when they're obviously not. 02:17:30.650 --> 02:17:32.900 Because eventually there'll be more room in the queue. 02:17:32.900 --> 02:17:36.209 A linked list would allow you to keep appending more and more people. 02:17:36.209 --> 02:17:38.510 And even if the line outside the store gets crazy long, 02:17:38.510 --> 02:17:41.059 at least the linked list allows you to service 02:17:41.059 --> 02:17:43.790 all of the customers who are showing up over time. 02:17:43.790 --> 02:17:46.400 An array of fixed size would make that harder. 02:17:46.400 --> 02:17:48.379 And, again, you could allocate a bigger array. 02:17:48.379 --> 02:17:50.090 But then you're going to have to ask all the customers, hey, 02:17:50.090 --> 02:17:51.340 could everyone come over here? 02:17:51.340 --> 02:17:51.840 No. 02:17:51.840 --> 02:17:52.730 Go back over there. 02:17:52.730 --> 02:17:57.690 I mean, you're constantly moving humans, or values and memory, back and forth. 02:17:57.690 --> 02:18:00.320 So that is only to say that to implement this real world 02:18:00.320 --> 02:18:03.559 notion of a queue, which is very commonly used even in the computer 02:18:03.559 --> 02:18:09.139 world to represent certain ideas, for instance, the printer queue, when you 02:18:09.139 --> 02:18:12.230 send something to the printer, especially on a campus or in a company, 02:18:12.230 --> 02:18:12.952 there's a queue. 02:18:12.952 --> 02:18:14.660 And ideally, the first person who printed 02:18:14.660 --> 02:18:17.240 is the first one who gets their printouts thereafter. 02:18:17.240 --> 02:18:19.100 Queues are also used in software. 02:18:19.100 --> 02:18:22.760 But there's other abstract data types out there besides queues. 02:18:22.760 --> 02:18:24.379 One of them is called a stack. 02:18:24.379 --> 02:18:27.230 So a stack is a data structure that can also 02:18:27.230 --> 02:18:31.459 be implemented underneath the hood using arrays or linked lists or, 02:18:31.459 --> 02:18:32.870 heck, maybe something else. 02:18:32.870 --> 02:18:35.090 But stacks have a different property. 02:18:35.090 --> 02:18:37.910 It's last in, first out. 02:18:37.910 --> 02:18:39.469 Last in, first out. 02:18:39.469 --> 02:18:41.930 So if you think about the trays in the cafeteria, 02:18:41.930 --> 02:18:46.070 in healthier times when everyone was on campus using trays from a cafeteria, 02:18:46.070 --> 02:18:48.950 you'll recall, of course, that trays tend to get stacked like this. 02:18:48.950 --> 02:18:56.930 And the last tray to go on top of the stack is the first one to come out. 02:18:56.930 --> 02:18:59.000 If you go to a clothing store, your own closet, 02:18:59.000 --> 02:19:01.040 if you don't hang things on hangers or put them in drawers, 02:19:01.040 --> 02:19:03.920 but kind of stack them, like here, like all of these sweaters, 02:19:03.920 --> 02:19:05.540 this is a stack of sweaters. 02:19:05.540 --> 02:19:07.459 And how do I get at a sweater I want? 02:19:07.459 --> 02:19:10.879 Well, the easiest way to do it is with last in, first out. 02:19:10.879 --> 02:19:13.879 So I constantly take the black sweater, the black sweater. 02:19:13.879 --> 02:19:16.100 But if I've stored all of my sweaters in this stack, 02:19:16.100 --> 02:19:18.709 you may never get to the sort of lower level ones, 02:19:18.709 --> 02:19:22.379 like the red or the blue sweater because, again, of this data structure. 02:19:22.379 --> 02:19:26.660 So LIFO, last in, first out, is in fact, the property 02:19:26.660 --> 02:19:28.940 used to characterize stacks. 02:19:28.940 --> 02:19:33.379 And stacks are useful or not useful, depending on the real world context. 02:19:33.379 --> 02:19:36.170 But even within computing, we'll see applications over time 02:19:36.170 --> 02:19:38.480 where stacks indeed come into play. 02:19:38.480 --> 02:19:41.450 And those two operations that those things support are generally called 02:19:41.450 --> 02:19:42.740 push and pop. 02:19:42.740 --> 02:19:46.400 It's the same thing as add or remove or insert or delete, but the terms of art 02:19:46.400 --> 02:19:51.230 are generally push and pop, where this is me popping a value off of the stack. 02:19:51.230 --> 02:19:54.260 This is me pushing a value onto the stack. 02:19:54.260 --> 02:19:58.520 But, again, it's last in, first out, otherwise known as LIFO. 02:19:58.520 --> 02:20:00.410 And then there's this other data structure 02:20:00.410 --> 02:20:04.740 that actually has a very real world analog, known as a dictionary. 02:20:04.740 --> 02:20:07.370 A dictionary is an abstract data type, which 02:20:07.370 --> 02:20:10.370 means you can implement it with arrays, or linked lists, or hash tables, 02:20:10.370 --> 02:20:13.280 or tries, or whatever else, an abstract data type 02:20:13.280 --> 02:20:16.130 that allows you to associate keys with values. 02:20:16.130 --> 02:20:18.560 And the best analog here is indeed in the real world. 02:20:18.560 --> 02:20:20.720 What is a dictionary, like an old school dictionary 02:20:20.720 --> 02:20:23.330 that's actually printed on paper in book form? 02:20:23.330 --> 02:20:25.550 What is inside that book? 02:20:25.550 --> 02:20:30.380 A whole bunch of keys, a whole bunch of boldfaced words, like apple and banana 02:20:30.380 --> 02:20:35.180 and so forth, each of which have definitions, otherwise known as values. 02:20:35.180 --> 02:20:37.548 And they're often alphabetized to make it easier 02:20:37.548 --> 02:20:40.340 for you to find things so that you can look things up more quickly. 02:20:40.340 --> 02:20:45.560 But a dictionary is an abstract data type that associates keys with values. 02:20:45.560 --> 02:20:48.320 And you look up the values by way of their keys, 02:20:48.320 --> 02:20:53.070 just like you look up a word's definition by way of the word itself. 02:20:53.070 --> 02:20:57.318 And dictionaries are actually kind of all around us, too. 02:20:57.318 --> 02:20:59.360 You don't think of them in these terms, probably. 02:20:59.360 --> 02:21:02.330 But if you've ever been to Sweet Green, for instance, in New Haven 02:21:02.330 --> 02:21:05.060 or in Cambridge or elsewhere, this is a salad place 02:21:05.060 --> 02:21:08.840 where nowadays, especially, you can order in advance online or on an app 02:21:08.840 --> 02:21:12.920 and then go into the store and pick up your food from a shelf. 02:21:12.920 --> 02:21:16.580 But the shelf, the way they do it here in Cambridge and in other cities, 02:21:16.580 --> 02:21:20.765 is they actually have letters of the alphabet on the shelves, A, B, C, D, E, 02:21:20.765 --> 02:21:24.500 F, all the way through Z. The idea being that if I go in to pick up my salad, 02:21:24.500 --> 02:21:27.710 it's probably on the D section, if Brian goes in to pick up his, 02:21:27.710 --> 02:21:30.210 it's in the B section, and so forth. 02:21:30.210 --> 02:21:33.680 Now here, too, you can imagine perverse corner cases 02:21:33.680 --> 02:21:37.100 where this data structure, this dictionary whereby 02:21:37.100 --> 02:21:40.910 letters of the alphabet map to values, which are people's salad, 02:21:40.910 --> 02:21:42.920 is not necessarily fail proof. 02:21:42.920 --> 02:21:45.440 Can you think of a perverse corner case where 02:21:45.440 --> 02:21:50.060 Sweet Green's very wonderful, methodical system actually breaks down? 02:21:50.060 --> 02:21:54.640 Can you think of a limitation here, even if you've never been to Sweet Green 02:21:54.640 --> 02:21:58.210 or never eaten a salad, what could break down with this system 02:21:58.210 --> 02:22:02.890 if going into a store and picking something up based on your name? 02:22:02.890 --> 02:22:04.070 Any thoughts? 02:22:04.070 --> 02:22:06.070 BRIAN: A few people say there might be a problem 02:22:06.070 --> 02:22:07.730 if two people have the same name. 02:22:07.730 --> 02:22:08.480 DAVID MALAN: Yeah. 02:22:08.480 --> 02:22:11.500 If two people have the same names, you start to stack things up. 02:22:11.500 --> 02:22:14.667 So literally, Sweet Green will start stacking one salad on top of the other. 02:22:14.667 --> 02:22:16.625 So there is actually an interesting incarnation 02:22:16.625 --> 02:22:19.220 of one data type being built on top of yet another data type. 02:22:19.220 --> 02:22:22.593 So, again, all of these are sort of like custom Scratch pieces, if you will, 02:22:22.593 --> 02:22:24.760 that we're constantly sort of reassembling into more 02:22:24.760 --> 02:22:26.380 interesting and powerful ideas. 02:22:26.380 --> 02:22:29.470 But at some point, if there's a lot of B names, or D names, 02:22:29.470 --> 02:22:33.670 or any letter of the alphabet, I surely see a finite height to this shelf. 02:22:33.670 --> 02:22:37.270 So it's kind of as though Sweet Green has implemented their dictionary 02:22:37.270 --> 02:22:39.920 using stacks with arrays. 02:22:39.920 --> 02:22:41.300 Because arrays are fixed size. 02:22:41.300 --> 02:22:45.040 So there's surely only so many inches of space here vertically. 02:22:45.040 --> 02:22:47.000 So you can see a real world limitation. 02:22:47.000 --> 02:22:48.833 So what does Sweet Green do if that happens? 02:22:48.833 --> 02:22:51.880 They probably just kind of cheat and put the D's in the C section, 02:22:51.880 --> 02:22:53.092 or the D's in the E section. 02:22:53.092 --> 02:22:54.800 Like, who really cares in the real world? 02:22:54.800 --> 02:22:56.967 Your eyes are probably going to skim left and right. 02:22:56.967 --> 02:22:59.530 But algorithmically, that is slowing things down. 02:22:59.530 --> 02:23:02.732 And in the worst case, if you're really late to pick up your salad, 02:23:02.732 --> 02:23:04.690 or if Sweet Green is really popular and there's 02:23:04.690 --> 02:23:08.050 a huge number of salads on the shelf, your name might be Albus, 02:23:08.050 --> 02:23:11.530 but your salad might end up way over here in the Z section 02:23:11.530 --> 02:23:12.820 if they're just out of room. 02:23:12.820 --> 02:23:16.150 And so that, too, is a valid algorithmic decision, 02:23:16.150 --> 02:23:17.930 to just make room somewhere else. 02:23:17.930 --> 02:23:21.650 But, again, trade offs between time and space. 02:23:21.650 --> 02:23:24.847 And so we thought we'd end on a note, thanks to some friends of ours 02:23:24.847 --> 02:23:27.430 at another institution who made a wonderful visualization that 02:23:27.430 --> 02:23:31.480 distinguished these notions of stacks versus queues. 02:23:31.480 --> 02:23:34.448 Stack and, again, a queue are these abstract data types that 02:23:34.448 --> 02:23:35.990 can be implemented in different ways. 02:23:35.990 --> 02:23:40.420 They have different properties, each of them respectively FIFO or LIFO. 02:23:40.420 --> 02:23:43.210 And here, for instance, is a final look in our final moments 02:23:43.210 --> 02:23:47.440 together here today about how these ideas manifest themselves, perhaps 02:23:47.440 --> 02:23:50.170 in the real world, not unlike this stack of sweaters here. 02:23:50.170 --> 02:23:50.837 [VIDEO PLAYBACK] 02:23:50.837 --> 02:23:55.220 [MUSIC PLAYING] 02:23:55.220 --> 02:23:58.130 - Once upon a time, there was a guy named Jack. 02:23:58.130 --> 02:24:01.550 When it came to making friends, Jack did not have the knack. 02:24:01.550 --> 02:24:04.460 So Jack went to talk to the most popular guy he knew. 02:24:04.460 --> 02:24:07.070 He went up to Lou and asked, "What do I do?" 02:24:07.070 --> 02:24:09.620 Lou saw that his friend was really distressed. 02:24:09.620 --> 02:24:12.180 "Well," Lou began, "just look how you're dressed. 02:24:12.180 --> 02:24:14.900 Don't you have any clothes with a different look?" 02:24:14.900 --> 02:24:15.900 "Yes," said Jack. 02:24:15.900 --> 02:24:17.330 "I sure do. 02:24:17.330 --> 02:24:19.590 Come to my house, and I'll show them to you." 02:24:19.590 --> 02:24:20.720 So they went off to Jack's. 02:24:20.720 --> 02:24:23.750 And Jack showed Lou the box where he kept all his shirts 02:24:23.750 --> 02:24:25.490 and his pants and his socks. 02:24:25.490 --> 02:24:28.520 Lou said, "I see you have all your clothes in a pile. 02:24:28.520 --> 02:24:31.070 Why don't you wear some others once in a While?" 02:24:31.070 --> 02:24:34.190 Jack said, "Well, when I remove clothes and socks, 02:24:34.190 --> 02:24:36.980 I wash them and put them away in the box. 02:24:36.980 --> 02:24:39.410 Then comes the next morning, and up I hop. 02:24:39.410 --> 02:24:42.620 I go to the box and get my clothes off the top." 02:24:42.620 --> 02:24:45.320 Lou quickly realized the problem with Jack. 02:24:45.320 --> 02:24:48.290 He kept clothes, CDS, and books in the stack. 02:24:48.290 --> 02:24:50.690 When he reached for something to read or to wear, 02:24:50.690 --> 02:24:53.330 he chose top book or underwear. 02:24:53.330 --> 02:24:55.700 Then, when he was done, he would put it right back. 02:24:55.700 --> 02:24:58.160 Back it would go, on top of the stack. 02:24:58.160 --> 02:25:00.680 "I know the solution!" said a triumphant Lou. 02:25:00.680 --> 02:25:03.290 "You need to learn to start using a queue." 02:25:03.290 --> 02:25:06.050 Lou took Jack's clothes and hung them in a closet. 02:25:06.050 --> 02:25:08.900 And when he had emptied the box, he just tossed it. 02:25:08.900 --> 02:25:12.710 Then he said, "Now, Jack, at the end of the day, put your clothes on the left 02:25:12.710 --> 02:25:14.250 when you put them away. 02:25:14.250 --> 02:25:16.880 Then tomorrow morning when you see the sunshine, get 02:25:16.880 --> 02:25:19.640 your clothes from the right, from the end of the line. 02:25:19.640 --> 02:25:22.550 Don't you see?" said Lou, "it will be so nice. 02:25:22.550 --> 02:25:25.940 You'll wear everything once before you wear something twice." 02:25:25.940 --> 02:25:28.850 And with everything in queues in his closet and shelf, 02:25:28.850 --> 02:25:31.910 Jack started to feel quite sure of himself, all thanks 02:25:31.910 --> 02:25:34.237 to Lou and his wonderful queue. 02:25:34.237 --> 02:25:34.820 [END PLAYBACK] 02:25:34.820 --> 02:25:35.360 DAVID MALAN: All right. 02:25:35.360 --> 02:25:36.470 That's it for CS50. 02:25:36.470 --> 02:25:38.090 We will see you next time. 02:25:38.090 --> 02:25:41.440 [MUSIC PLAYING]