1 00:00:00,000 --> 00:00:02,480 [MUSIC PLAYING] 2 00:00:02,480 --> 00:00:49,022 3 00:00:49,022 --> 00:00:49,980 DAVID MALAN: All right. 4 00:00:49,980 --> 00:00:52,860 This CS50 and this is week five. 5 00:00:52,860 --> 00:00:56,640 Recall that last week in week four, we introduced a few new building blocks, 6 00:00:56,640 --> 00:00:58,920 namely pointers and spoke in great detail 7 00:00:58,920 --> 00:01:01,230 about how you can now manipulate a computer's memory 8 00:01:01,230 --> 00:01:04,120 and begin to do things at a lower level with it. 9 00:01:04,120 --> 00:01:06,690 Well today, we'll sort of use those basic building blocks 10 00:01:06,690 --> 00:01:10,290 to start creating things called data structures in the computer's memory. 11 00:01:10,290 --> 00:01:12,330 It turns out that once you have this ability 12 00:01:12,330 --> 00:01:15,120 to refer to different locations in the computer's memory, 13 00:01:15,120 --> 00:01:18,390 you can stitch together your own custom shapes, 14 00:01:18,390 --> 00:01:20,910 your own custom data structures, as they're called. 15 00:01:20,910 --> 00:01:23,520 And indeed, we'll start doing that by rolling back 16 00:01:23,520 --> 00:01:27,150 for just a moment to where we first saw a data structure in week two. 17 00:01:27,150 --> 00:01:30,837 So recall in week two which was our second week of playing with C, 18 00:01:30,837 --> 00:01:32,670 we introduced you to the notion of an array. 19 00:01:32,670 --> 00:01:35,160 And an array is just a contiguous sequence of memory 20 00:01:35,160 --> 00:01:38,280 in which you can store a whole bunch of integers back to back to back, 21 00:01:38,280 --> 00:01:40,980 or maybe a whole bunch of chars back to back to back. 22 00:01:40,980 --> 00:01:45,280 And those arrays might have been represented pictorially like this. 23 00:01:45,280 --> 00:01:46,890 So this would be an array of size 3. 24 00:01:46,890 --> 00:01:49,560 And suppose that this were an array indeed of integers. 25 00:01:49,560 --> 00:01:52,650 We might have one, two, three inside of it. 26 00:01:52,650 --> 00:01:57,100 But suppose now-- and here's where we'll start to bump up against a problem, 27 00:01:57,100 --> 00:01:58,620 but also solve a problem today. 28 00:01:58,620 --> 00:02:01,260 Suppose that you want to add another number to this array, 29 00:02:01,260 --> 00:02:05,100 but you've only had the forethought to create an array of size 3. 30 00:02:05,100 --> 00:02:09,655 The catch with arrays in C is that they're not really easily resizable. 31 00:02:09,655 --> 00:02:12,780 Again, you all know that you have to decide in advance how big the array is 32 00:02:12,780 --> 00:02:13,630 going to be. 33 00:02:13,630 --> 00:02:16,200 So if you change your mind later or your program's 34 00:02:16,200 --> 00:02:19,260 running long enough where you need to store more values in it, 35 00:02:19,260 --> 00:02:20,550 you're in a bind. 36 00:02:20,550 --> 00:02:24,300 Like intuitively, if you wanted to insert the number 4 into this array, 37 00:02:24,300 --> 00:02:27,990 you would ideally just plop it right here at the end of the array 38 00:02:27,990 --> 00:02:30,660 and continue about your business. 39 00:02:30,660 --> 00:02:36,120 But the catch with an array is that that chunk of memory is not-- 40 00:02:36,120 --> 00:02:37,500 it doesn't exist in a vacuum. 41 00:02:37,500 --> 00:02:40,500 Recall if we zoom out and look at all of your computer's memory, 42 00:02:40,500 --> 00:02:43,170 this byte, this byte and a whole bunch of other bytes 43 00:02:43,170 --> 00:02:46,020 might very well be in use by other variables 44 00:02:46,020 --> 00:02:47,610 or other aspects of your program. 45 00:02:47,610 --> 00:02:49,710 So for instance for the sake of discussion, 46 00:02:49,710 --> 00:02:53,880 suppose that the program in question has one array of size 3 containing 47 00:02:53,880 --> 00:02:55,410 the integers 1, 2 3. 48 00:02:55,410 --> 00:02:59,520 And then suppose that your same program has a string somewhere in the code 49 00:02:59,520 --> 00:03:02,670 that you've written and that string is storing hello, world. 50 00:03:02,670 --> 00:03:05,970 Well next to this array in memory just by chance, 51 00:03:05,970 --> 00:03:13,380 may very well be an H-E-L-L-O comma space W-O-R-L-D backslash 0. 52 00:03:13,380 --> 00:03:15,660 And there might be free memory, so to speak. 53 00:03:15,660 --> 00:03:19,073 Memory you could use that's filled with garbage values and garbage isn't bad. 54 00:03:19,073 --> 00:03:20,490 It just means that you don't know. 55 00:03:20,490 --> 00:03:23,710 You don't really care what values are or were there. 56 00:03:23,710 --> 00:03:25,740 So there is free space, so to speak. 57 00:03:25,740 --> 00:03:28,560 Each of these Oscars represents effectively free space 58 00:03:28,560 --> 00:03:30,030 with some garbage value there. 59 00:03:30,030 --> 00:03:32,940 Remnants, maybe, of some execution past. 60 00:03:32,940 --> 00:03:36,330 But the problem here is that you don't have room 61 00:03:36,330 --> 00:03:40,380 to plop this four right where you might want to put it. 62 00:03:40,380 --> 00:03:41,670 So what's the solution? 63 00:03:41,670 --> 00:03:44,190 If we have this array of size 3 containing three integers-- 64 00:03:44,190 --> 00:03:50,070 1, 2 3-- but it's been painted into a corner whereby H-E-L-L and so forth are 65 00:03:50,070 --> 00:03:54,180 already immediately next to it, we can't just put the four there without 66 00:03:54,180 --> 00:03:57,640 sacrificing the H. And that doesn't really feel like a solution. 67 00:03:57,640 --> 00:04:01,350 So any thoughts on how we can solve this problem? 68 00:04:01,350 --> 00:04:02,970 Are we completely out of luck? 69 00:04:02,970 --> 00:04:08,310 Can you just not add a number to an array in a situation like this? 70 00:04:08,310 --> 00:04:10,780 Or is there a solution perhaps that comes to mind? 71 00:04:10,780 --> 00:04:14,760 Even if you've never programmed before, if on the screen there-- 72 00:04:14,760 --> 00:04:15,780 the lay of the land. 73 00:04:15,780 --> 00:04:17,579 Santiago, what do you think? 74 00:04:17,579 --> 00:04:21,060 AUDIENCE: I would say that maybe you could 75 00:04:21,060 --> 00:04:26,050 copy the elements in the original array and create a new array, 76 00:04:26,050 --> 00:04:29,640 but that's one size bigger or one element bigger. 77 00:04:29,640 --> 00:04:31,230 And then, add that new element. 78 00:04:31,230 --> 00:04:31,980 DAVID MALAN: Yeah. 79 00:04:31,980 --> 00:04:33,120 That's really good intuition. 80 00:04:33,120 --> 00:04:34,950 After all, there's all these Oscars on the screen 81 00:04:34,950 --> 00:04:37,050 right now, which again represent garbage values. 82 00:04:37,050 --> 00:04:38,490 Or in turn, free space. 83 00:04:38,490 --> 00:04:40,680 So I could put 1, 2, 3, 4 up here. 84 00:04:40,680 --> 00:04:42,780 I could put 1, 2, 3, 4 over here. 85 00:04:42,780 --> 00:04:44,640 I could put 1, 2, 3, 4 down here. 86 00:04:44,640 --> 00:04:48,090 So we some flexibility, but Santiago is exactly right. 87 00:04:48,090 --> 00:04:51,450 Intuitively, all we really need to do is let's focus only 88 00:04:51,450 --> 00:04:54,120 on four of the available spots. 89 00:04:54,120 --> 00:04:56,910 A new array of size four, if you will, which initially 90 00:04:56,910 --> 00:04:58,290 has these four garbage values. 91 00:04:58,290 --> 00:04:59,280 But that's OK. 92 00:04:59,280 --> 00:05:01,950 Because as Santiago also notes, it suffices now 93 00:05:01,950 --> 00:05:03,420 to just copy the old array-- 94 00:05:03,420 --> 00:05:06,840 1, 2, 3-- into the new array. 95 00:05:06,840 --> 00:05:10,860 And heck, maybe we can now even free the memory from the original array, 96 00:05:10,860 --> 00:05:12,870 much like we could if we used malloc. 97 00:05:12,870 --> 00:05:15,870 And that leaves us then, of course, with just an array of size 98 00:05:15,870 --> 00:05:17,430 4 with that fourth garbage value. 99 00:05:17,430 --> 00:05:21,923 But now, we do have room for, like the number 4 itself. 100 00:05:21,923 --> 00:05:24,840 So it would seem that there is a solution to this problem that doesn't 101 00:05:24,840 --> 00:05:26,490 violate the definition of an array. 102 00:05:26,490 --> 00:05:29,220 Again, the only definition of an array really 103 00:05:29,220 --> 00:05:31,320 is that the memory must be contiguous. 104 00:05:31,320 --> 00:05:34,620 You can't just plop the 4 anywhere in the computer's memory. 105 00:05:34,620 --> 00:05:37,710 It has to become right after your existing memory 106 00:05:37,710 --> 00:05:39,660 if this whole thing, this whole structure, 107 00:05:39,660 --> 00:05:42,630 is indeed going to still be an array. 108 00:05:42,630 --> 00:05:47,530 But I worry that might have cost us a bit of time. 109 00:05:47,530 --> 00:05:50,200 And in fact, let me go ahead and open up on the screen here 110 00:05:50,200 --> 00:05:54,670 in just a moment a question that you're all welcome to buzz in for. 111 00:05:54,670 --> 00:06:00,220 What would be the running time of inserting into an array? 112 00:06:00,220 --> 00:06:03,220 Let me go ahead and reveal the poll question here. 113 00:06:03,220 --> 00:06:05,050 Feel free to go to the usual URL. 114 00:06:05,050 --> 00:06:08,080 Which Brian, if you wouldn't mind copying and pasting as usual. 115 00:06:08,080 --> 00:06:13,450 What do you think the running time is of inserting into an array? 116 00:06:13,450 --> 00:06:15,400 Inserting into an array. 117 00:06:15,400 --> 00:06:19,303 Recall in the past, we talked about arrays in terms of searching. 118 00:06:19,303 --> 00:06:21,220 But today, this is the first time we're really 119 00:06:21,220 --> 00:06:22,660 talking about inserting into them. 120 00:06:22,660 --> 00:06:27,280 And if we take a look at the results coming in, it looks like 80% of you 121 00:06:27,280 --> 00:06:30,580 feel that it's linear time, big O of n, whereby 122 00:06:30,580 --> 00:06:34,180 it might take as many as n steps to actually insert into an array. 123 00:06:34,180 --> 00:06:36,130 5% of you propose n log n. 124 00:06:36,130 --> 00:06:37,690 7% of you, n squared. 125 00:06:37,690 --> 00:06:40,827 And then, 2% and 5% respectively for the other values as well. 126 00:06:40,827 --> 00:06:42,910 So this is kind of an interesting mix because when 127 00:06:42,910 --> 00:06:46,150 we talked about arrays in the past and we talked about searching, 128 00:06:46,150 --> 00:06:49,330 recall that we typically achieve big O of log n. 129 00:06:49,330 --> 00:06:50,380 That's really good. 130 00:06:50,380 --> 00:06:53,560 Unfortunately if we have to do with Santiago proposed 131 00:06:53,560 --> 00:06:58,540 and actually copy all of the elements from the old array into the new array, 132 00:06:58,540 --> 00:07:03,490 it's going to take us as many as n steps because we have 133 00:07:03,490 --> 00:07:05,800 to copy each of the original elements-- 134 00:07:05,800 --> 00:07:10,670 1, 2, 3-- over into the new array, which is going to be of size n plus 1. 135 00:07:10,670 --> 00:07:13,690 So it's on the order of n steps in total. 136 00:07:13,690 --> 00:07:16,870 So the running time then of inserting into an array, in terms 137 00:07:16,870 --> 00:07:19,780 of an upper bound at least, is going to be indeed big O of n 138 00:07:19,780 --> 00:07:23,560 because you've got a copy potentially all of those elements over. 139 00:07:23,560 --> 00:07:27,880 But perhaps, we might feel differently if we consider a lower 140 00:07:27,880 --> 00:07:30,400 bound on the running time of insert. 141 00:07:30,400 --> 00:07:35,290 What might the lower bound of insert be when it comes to an array? 142 00:07:35,290 --> 00:07:38,140 Again, omega notation is what we can use here. 143 00:07:38,140 --> 00:07:40,510 How many steps maybe in the best case might it 144 00:07:40,510 --> 00:07:42,520 take me to insert a value into an array? 145 00:07:42,520 --> 00:07:44,435 We won't do a poll for this one. 146 00:07:44,435 --> 00:07:46,810 Brian, why don't we go ahead and call on a hand for this. 147 00:07:46,810 --> 00:07:49,240 What's a lower bound on the running time of insert? 148 00:07:49,240 --> 00:07:50,240 Ryan, what do you think? 149 00:07:50,240 --> 00:07:52,073 AUDIENCE: Well, the best case scenario would 150 00:07:52,073 --> 00:07:54,270 be if there's only one element in the array. 151 00:07:54,270 --> 00:07:56,860 So you would just have a one or it would just 152 00:07:56,860 --> 00:07:58,633 be one element right into the array. 153 00:07:58,633 --> 00:08:00,550 DAVID MALAN: Yeah, so if you've got an array-- 154 00:08:00,550 --> 00:08:02,980 and let me emphasize that's already empty-- 155 00:08:02,980 --> 00:08:06,400 whereby you have room for the new element, then indeed. 156 00:08:06,400 --> 00:08:09,460 Omega of 1, constant time, is all you need so long as you 157 00:08:09,460 --> 00:08:11,350 have space available in the array. 158 00:08:11,350 --> 00:08:13,270 And it doesn't matter how large the array is. 159 00:08:13,270 --> 00:08:16,910 Maybe it's a size 4, but you've only put one number in it. 160 00:08:16,910 --> 00:08:20,110 That's OK because you can immediately put the new number in place. 161 00:08:20,110 --> 00:08:22,130 Recall that array support random access. 162 00:08:22,130 --> 00:08:24,130 You can use the square bracket notation and just 163 00:08:24,130 --> 00:08:28,280 jump to any location in so-called constant time in just one step. 164 00:08:28,280 --> 00:08:32,409 So if your array is of sufficient size and it's not full, then yes. 165 00:08:32,409 --> 00:08:34,690 The lower bound on the insertion into an array 166 00:08:34,690 --> 00:08:37,630 is going to be constant time, omega of 1. 167 00:08:37,630 --> 00:08:39,940 But as we saw in Santiago's situation whereby 168 00:08:39,940 --> 00:08:42,250 you have an array that's already filled with elements 169 00:08:42,250 --> 00:08:44,810 and you want to add another, well in that case, 170 00:08:44,810 --> 00:08:47,740 then upper bound is indeed going to be big O of n 171 00:08:47,740 --> 00:08:51,370 because you have to do the additional labor of copying the values over 172 00:08:51,370 --> 00:08:52,315 from one to the other. 173 00:08:52,315 --> 00:08:54,940 Now those of you have programmed before, maybe you've use Java. 174 00:08:54,940 --> 00:08:56,900 You might be familiar with the phrase vector. 175 00:08:56,900 --> 00:09:01,000 A vector is like an array that can be resized, can grow and shrink. 176 00:09:01,000 --> 00:09:05,530 That's not what arrays in C. Arrays in C are just contiguous blocks of memory 177 00:09:05,530 --> 00:09:07,100 with values back to back to back. 178 00:09:07,100 --> 00:09:10,330 But once you decide on their size, that is it. 179 00:09:10,330 --> 00:09:12,700 You're going to have to resize it, essentially yourself. 180 00:09:12,700 --> 00:09:16,280 They're not going to automatically grow for you. 181 00:09:16,280 --> 00:09:19,960 So an array was the first and really the simplest of the data 182 00:09:19,960 --> 00:09:21,130 structures we'll see. 183 00:09:21,130 --> 00:09:23,500 But it's also not nearly as powerful as what 184 00:09:23,500 --> 00:09:25,870 we can do now that we have access to a computers memory. 185 00:09:25,870 --> 00:09:28,328 Today, we're going to start to leverage these things called 186 00:09:28,328 --> 00:09:31,997 pointers, the addresses via which we can refer to locations in memory. 187 00:09:31,997 --> 00:09:35,080 And we're going to start to stitch together some fancier data structures-- 188 00:09:35,080 --> 00:09:37,030 first, one dimensional in some sense. 189 00:09:37,030 --> 00:09:41,530 Then, two dimensional in some sense-- by using some very basic building blocks. 190 00:09:41,530 --> 00:09:44,980 Recall these three pieces of syntax from the past weeks. 191 00:09:44,980 --> 00:09:48,250 Struct, recall, is this mechanism, this keyword in C, 192 00:09:48,250 --> 00:09:51,190 whereby we can define our own structures in memory. 193 00:09:51,190 --> 00:09:55,780 We saw one for a person whereby a person might have both a name and a number 194 00:09:55,780 --> 00:09:57,610 in something like a phone book. 195 00:09:57,610 --> 00:10:00,850 And you've seen the dot operator The dot operator was 196 00:10:00,850 --> 00:10:05,020 how we go inside of such a structure and get at dot name or dot number. 197 00:10:05,020 --> 00:10:07,570 The specific variables inside of the struct. 198 00:10:07,570 --> 00:10:11,020 And then last week, recall we saw the star operator 199 00:10:11,020 --> 00:10:14,410 whereby the dereference operator, which colloquially means 200 00:10:14,410 --> 00:10:17,000 go to this particular address. 201 00:10:17,000 --> 00:10:19,940 So just by using these three ingredients are 202 00:10:19,940 --> 00:10:22,690 we going to be able to now build up our own custom data structures 203 00:10:22,690 --> 00:10:26,062 that are even fancier than arrays and can ultimately help us solve problems 204 00:10:26,062 --> 00:10:26,770 more efficiently. 205 00:10:26,770 --> 00:10:30,610 And in fact, this is such a common technique in programming in C 206 00:10:30,610 --> 00:10:33,790 that these two operators, the dot and the star, 207 00:10:33,790 --> 00:10:38,230 are often combined into one that intentionally looks like an arrow. 208 00:10:38,230 --> 00:10:41,500 So we'll see its usage in just a little bit. 209 00:10:41,500 --> 00:10:45,190 So syntactically today, it's pretty much building blocks 210 00:10:45,190 --> 00:10:48,170 past but we're going to use these building blocks now in new ways 211 00:10:48,170 --> 00:10:50,780 to start solving problems differently. 212 00:10:50,780 --> 00:10:54,620 And we'll first do this by way of something called linked lists. 213 00:10:54,620 --> 00:11:00,110 So for instance, a linked list is going to be a data structure that solves 214 00:11:00,110 --> 00:11:01,940 some of the problems with an array. 215 00:11:01,940 --> 00:11:03,740 And what's a problem arguably? 216 00:11:03,740 --> 00:11:07,117 Well if it takes big O of n steps to insert into an array, 217 00:11:07,117 --> 00:11:08,450 frankly that's kind of annoying. 218 00:11:08,450 --> 00:11:09,710 That's kind of expensive. 219 00:11:09,710 --> 00:11:13,273 Because over time if you're writing a really big program with lots of data-- 220 00:11:13,273 --> 00:11:16,190 you're the Googles of the world, the Twitters of the world-- it's fine 221 00:11:16,190 --> 00:11:18,710 if you want to store all of your data in a big array 222 00:11:18,710 --> 00:11:20,640 for the sake of efficient searching. 223 00:11:20,640 --> 00:11:23,870 Which recall, was big O of log n if we use something like binary search 224 00:11:23,870 --> 00:11:25,280 and keep everything sorted. 225 00:11:25,280 --> 00:11:27,740 But it's going to be pretty painful if every time you 226 00:11:27,740 --> 00:11:32,432 want to add another tweet to the array, or some other web page to the array. 227 00:11:32,432 --> 00:11:34,640 Depending on what the problem is that you're solving, 228 00:11:34,640 --> 00:11:36,432 you might potentially-- as Santiago notes-- 229 00:11:36,432 --> 00:11:40,400 have to copy all of the contents of your original smaller array 230 00:11:40,400 --> 00:11:45,660 into a new bigger array just to add more tweets or more web pages or the like. 231 00:11:45,660 --> 00:11:47,990 So a linked list is going to be a data structure that's 232 00:11:47,990 --> 00:11:52,700 more dynamic, whereby you can grow and shrink the data 233 00:11:52,700 --> 00:11:56,210 structure without having to touch all of the original data 234 00:11:56,210 --> 00:11:58,820 and move it from old location to new. 235 00:11:58,820 --> 00:12:01,073 So what might this look like? 236 00:12:01,073 --> 00:12:02,990 Well let's consider again our computers memory 237 00:12:02,990 --> 00:12:05,615 and let's propose that I want to store those same values again. 238 00:12:05,615 --> 00:12:08,180 So like the number one-- and just for the sake of discussion, 239 00:12:08,180 --> 00:12:12,620 suppose that it's in my computer's memory at address 0x123. 240 00:12:12,620 --> 00:12:14,660 0x just means it's a hexadecimal number. 241 00:12:14,660 --> 00:12:16,430 1-2-3- is completely arbitrary. 242 00:12:16,430 --> 00:12:18,450 I made it up just for the sake of discussion. 243 00:12:18,450 --> 00:12:21,020 So let me stipulate that that's where the number 1 happens 244 00:12:21,020 --> 00:12:24,290 to be in the computer's memory in this new solution 245 00:12:24,290 --> 00:12:26,395 to the problem of storing lots of data. 246 00:12:26,395 --> 00:12:27,770 Suppose I want to store number 2. 247 00:12:27,770 --> 00:12:30,230 Maybe it's at address 0x456. 248 00:12:30,230 --> 00:12:32,340 And then, suppose I want to store number 3. 249 00:12:32,340 --> 00:12:36,260 Suppose it's at address 0x789. 250 00:12:36,260 --> 00:12:40,010 So notice deliberately, these numbers are spread out 251 00:12:40,010 --> 00:12:41,180 in the computer's memory. 252 00:12:41,180 --> 00:12:44,330 Because after all, the problem we ran into a moment ago with arrays 253 00:12:44,330 --> 00:12:48,080 is that you might have hello, world or other values from other parts 254 00:12:48,080 --> 00:12:50,340 of your program in the way. 255 00:12:50,340 --> 00:12:52,310 So if I'm proposing instead that if you want 256 00:12:52,310 --> 00:12:55,850 to store one, and then two, and then three, that's fine. 257 00:12:55,850 --> 00:12:58,640 Plop them anywhere you want and you don't 258 00:12:58,640 --> 00:13:02,520 have to worry about where there is already existing values. 259 00:13:02,520 --> 00:13:06,022 Instead, you can just put these values where there is room. 260 00:13:06,022 --> 00:13:09,230 The problem, though, is that if you just start plopping values like one, two, 261 00:13:09,230 --> 00:13:11,630 three anywhere you want in the computer's memory, 262 00:13:11,630 --> 00:13:15,050 you can no longer very easily find those values, right? 263 00:13:15,050 --> 00:13:17,630 You might know where one is, but it no longer 264 00:13:17,630 --> 00:13:21,950 suffices to just look one location to the right to find the next value 265 00:13:21,950 --> 00:13:25,790 or add two find the next value after that. 266 00:13:25,790 --> 00:13:28,100 In an array, everything is contiguous. 267 00:13:28,100 --> 00:13:30,450 But if we instead start to treat the computer's memory 268 00:13:30,450 --> 00:13:33,980 as just a canvas where we can just draw numbers anywhere 269 00:13:33,980 --> 00:13:37,370 we want, that's fine so long as we can somehow 270 00:13:37,370 --> 00:13:39,830 help ourselves get from the first, to the second, 271 00:13:39,830 --> 00:13:43,310 to the third irrespective of all the other stuff that's 272 00:13:43,310 --> 00:13:45,330 cluttering up the computer's memory. 273 00:13:45,330 --> 00:13:48,980 So in fact, let me propose that we do this by maybe stealing 274 00:13:48,980 --> 00:13:51,590 a bit more space from the computer. 275 00:13:51,590 --> 00:13:55,710 So rather than use just enough memory to store one, two, and three, 276 00:13:55,710 --> 00:13:59,060 let me store twice as much information. 277 00:13:59,060 --> 00:14:02,630 In addition to every number that I actually care about-- one, two, three, 278 00:14:02,630 --> 00:14:03,560 my data-- 279 00:14:03,560 --> 00:14:05,418 let me store a little metadata, so to speak. 280 00:14:05,418 --> 00:14:07,460 The values that I don't fundamentally care about, 281 00:14:07,460 --> 00:14:10,970 but that are going to help me keep track of my actual data. 282 00:14:10,970 --> 00:14:13,650 And let me propose that in this box here, 283 00:14:13,650 --> 00:14:16,355 I literally store the value 0x456. 284 00:14:16,355 --> 00:14:18,980 Again, it's written in hexadecimal, but that's just the number. 285 00:14:18,980 --> 00:14:21,350 That's the address of somewhere else in memory. 286 00:14:21,350 --> 00:14:24,530 In this box, let me propose that I store 0x789. 287 00:14:24,530 --> 00:14:31,070 And in this box, let me arbitrarily say 0x0, which is just all zero bits. 288 00:14:31,070 --> 00:14:32,810 Now why have I done this? 289 00:14:32,810 --> 00:14:35,720 Even if you've never seen this structure that's 290 00:14:35,720 --> 00:14:38,210 evolving to be what's called a linked list, 291 00:14:38,210 --> 00:14:40,670 why have I just done what I've done? 292 00:14:40,670 --> 00:14:43,280 In addition to storing one, two, and three respectively, 293 00:14:43,280 --> 00:14:49,700 I'm also now storing 0x456 and an additional chunk of memory and 0x789 294 00:14:49,700 --> 00:14:51,410 and an additional chunk of memory. 295 00:14:51,410 --> 00:14:52,760 But why? 296 00:14:52,760 --> 00:14:53,720 Sofia. 297 00:14:53,720 --> 00:14:57,380 AUDIENCE: So that we know how the first element relates to the second 298 00:14:57,380 --> 00:14:58,692 or how they're linked together. 299 00:14:58,692 --> 00:14:59,900 DAVID MALAN: That's exactly-- 300 00:14:59,900 --> 00:15:01,130 AUDIENCE: --between the first and second. 301 00:15:01,130 --> 00:15:01,880 DAVID MALAN: Yeah. 302 00:15:01,880 --> 00:15:05,540 So now, I'm using essentially twice as much space to store each piece of data. 303 00:15:05,540 --> 00:15:07,550 One, two, and three respectively. 304 00:15:07,550 --> 00:15:10,280 But in the second chunk of space, I'm now 305 00:15:10,280 --> 00:15:16,410 storing a pointer to the next element in the thing I'll now think of as a list. 306 00:15:16,410 --> 00:15:19,970 So this is 0x456 because the number two, which 307 00:15:19,970 --> 00:15:24,650 is the next number in the list I care about, lives at 0x456. 308 00:15:24,650 --> 00:15:28,640 This number is 0x789 because the next number after that I care about 309 00:15:28,640 --> 00:15:31,310 is at address 0x789. 310 00:15:31,310 --> 00:15:33,650 So it's just a helpful way now of leaving 311 00:15:33,650 --> 00:15:37,100 myself breadcrumbs so that I can plop the one, the two, the three anywhere 312 00:15:37,100 --> 00:15:40,550 I want in the computer's memory wherever there's available space 313 00:15:40,550 --> 00:15:44,720 and still figure out how to get from one to the other to the other. 314 00:15:44,720 --> 00:15:47,400 And we've actually seen some of this syntax before. 315 00:15:47,400 --> 00:15:53,680 It turns out that 0x0, which is all zero bits, that's just the technical-- 316 00:15:53,680 --> 00:15:56,080 that's the numeric equivalent of what we've called null. 317 00:15:56,080 --> 00:16:00,850 N-U-L-L, which we introduced last week, is a special symbol indicating that 318 00:16:00,850 --> 00:16:03,910 something has gone wrong with memory or you're out of space. 319 00:16:03,910 --> 00:16:06,070 It's sort of the absence of an address. 320 00:16:06,070 --> 00:16:09,850 C guarantees that if you use 0x0, AKA null, 321 00:16:09,850 --> 00:16:13,240 that just indicates the absence of any useful address there. 322 00:16:13,240 --> 00:16:13,990 But you know what? 323 00:16:13,990 --> 00:16:16,630 Again like last week, this is of getting way into the weeds. 324 00:16:16,630 --> 00:16:19,690 I don't really care about 0x123456789. 325 00:16:19,690 --> 00:16:23,680 So let's just kind of abstract this away and start thinking about this really 326 00:16:23,680 --> 00:16:26,950 as a list of numbers that are somehow linked together. 327 00:16:26,950 --> 00:16:31,210 Underneath the hood, the links are implemented by way of addresses, 328 00:16:31,210 --> 00:16:36,280 or pointers, those low level numbers like 0x123456789. 329 00:16:36,280 --> 00:16:39,040 But pictorially, it suffices for us to just start 330 00:16:39,040 --> 00:16:44,170 thinking of this data structure, hereafter known as a linked list, 331 00:16:44,170 --> 00:16:47,590 as being a collection of nodes, so to speak-- 332 00:16:47,590 --> 00:16:50,830 N-O-D-E-- that are connected via pointers. 333 00:16:50,830 --> 00:16:52,710 So a node is just a generic computer science 334 00:16:52,710 --> 00:16:54,460 term that refers to some kind of structure 335 00:16:54,460 --> 00:16:56,230 that store stuff you care about. 336 00:16:56,230 --> 00:17:00,260 What I care about here is a number and a pointer to another such structure. 337 00:17:00,260 --> 00:17:05,680 So this is a linked list and each of these rectangles represents a node. 338 00:17:05,680 --> 00:17:08,740 And in code, we'll implement those nodes ultimately 339 00:17:08,740 --> 00:17:12,069 by way of that thing in C called a struct. 340 00:17:12,069 --> 00:17:14,589 But let me pause here to see first if there 341 00:17:14,589 --> 00:17:18,730 are any questions about the structure we have built up. 342 00:17:18,730 --> 00:17:22,180 Any questions about this thing called a linked list 343 00:17:22,180 --> 00:17:24,530 before we see it in some code? 344 00:17:24,530 --> 00:17:25,030 Brian? 345 00:17:25,030 --> 00:17:27,010 BRIAN: Yeah, a question came in in the chat asking, 346 00:17:27,010 --> 00:17:30,177 isn't this kind of a waste of memory that we're now using all of these cells 347 00:17:30,177 --> 00:17:31,390 to store addresses too? 348 00:17:31,390 --> 00:17:33,310 DAVID MALAN: Yeah, really good observation. 349 00:17:33,310 --> 00:17:35,393 Isn't this kind of a waste of memory in that we're 350 00:17:35,393 --> 00:17:37,150 storing all of these addresses in addition 351 00:17:37,150 --> 00:17:39,710 to the numbers 1, 2, 3 that we care about? 352 00:17:39,710 --> 00:17:42,375 Yes, and in fact, that is exactly the price we are paying, 353 00:17:42,375 --> 00:17:43,750 and this is going to be thematic. 354 00:17:43,750 --> 00:17:46,270 This week, last week and really every week thereafter, 355 00:17:46,270 --> 00:17:50,170 any time we solve a problem in CS and programming in particular, 356 00:17:50,170 --> 00:17:52,268 there's always going to be some price paid. 357 00:17:52,268 --> 00:17:54,310 There's going to be some cost and really, there's 358 00:17:54,310 --> 00:17:55,610 going to be some trade-off. 359 00:17:55,610 --> 00:18:00,400 So if a moment ago, it was unacceptable that inserting into an array 360 00:18:00,400 --> 00:18:04,570 is in big O of n because man, that's going to take so many steps to copy all 361 00:18:04,570 --> 00:18:07,420 of the values from the old array into the new array, 362 00:18:07,420 --> 00:18:10,300 if that is unacceptable to you for performance reasons 363 00:18:10,300 --> 00:18:13,870 or whatever the problem is that you're solving, well, that's fine. 364 00:18:13,870 --> 00:18:17,410 You can solve that problem and now have much more dynamism 365 00:18:17,410 --> 00:18:19,900 whereby you can plop your numbers anywhere in memory 366 00:18:19,900 --> 00:18:24,190 without having to move the existing numbers anywhere else, thereby saving 367 00:18:24,190 --> 00:18:27,948 yourself time, but the price you're going to pay indeed is more space. 368 00:18:27,948 --> 00:18:29,740 So at that point, it kind of depends what's 369 00:18:29,740 --> 00:18:31,540 more important to you, the computer's time, 370 00:18:31,540 --> 00:18:35,980 your human time, or maybe the space or the cost of the space, the more memory 371 00:18:35,980 --> 00:18:38,577 that you might really need to literally buy for that computer. 372 00:18:38,577 --> 00:18:39,910 So this is going to be thematic. 373 00:18:39,910 --> 00:18:44,590 This trade-off between space and time is omnipresent, really, in programming. 374 00:18:44,590 --> 00:18:47,780 Well, let's consider how we might translate this now into actual code. 375 00:18:47,780 --> 00:18:51,010 Recall that when we last saw structs in C, 376 00:18:51,010 --> 00:18:55,390 we did something like this to define a person as having two things associated 377 00:18:55,390 --> 00:18:56,980 with them, a name and a number. 378 00:18:56,980 --> 00:18:59,650 So today, we don't care about persons and names and numbers. 379 00:18:59,650 --> 00:19:01,780 We care about these things we're going to start calling nodes, 380 00:19:01,780 --> 00:19:03,447 so let me go ahead and rewind from that. 381 00:19:03,447 --> 00:19:08,440 Erase that and let's instead say that every node in this structure, 382 00:19:08,440 --> 00:19:10,840 renaming person as well to node, is going 383 00:19:10,840 --> 00:19:15,430 to have a number, which will be an int in our case here, 384 00:19:15,430 --> 00:19:18,670 and I've left room for one other value because we ultimately 385 00:19:18,670 --> 00:19:21,255 need to be able to store that second piece of data. 386 00:19:21,255 --> 00:19:23,380 That second piece of data is going to be a pointer. 387 00:19:23,380 --> 00:19:26,860 It's going to be an address of something, but how might 388 00:19:26,860 --> 00:19:29,200 I express this? 389 00:19:29,200 --> 00:19:31,270 This is going to be a little less obvious 390 00:19:31,270 --> 00:19:34,900 but we laid the foundation last week with pointers. 391 00:19:34,900 --> 00:19:39,220 How could I describe this structure as having 392 00:19:39,220 --> 00:19:42,750 a pointer to another such structure? 393 00:19:42,750 --> 00:19:45,880 Any thoughts on verbally the syntax to use, 394 00:19:45,880 --> 00:19:51,010 or even if you're not sure of exactly the incantation, exactly what symbols 395 00:19:51,010 --> 00:19:56,350 we should use to express and address to another such node? 396 00:19:56,350 --> 00:19:59,597 BRIAN: Someone is suggesting we use a node * as a pointer to a node. 397 00:19:59,597 --> 00:20:00,430 DAVID MALAN: Node *? 398 00:20:00,430 --> 00:20:00,930 All right. 399 00:20:00,930 --> 00:20:02,740 So * I definitely remember from last week, 400 00:20:02,740 --> 00:20:06,310 indicating that if you have int *, this is the address of an int. 401 00:20:06,310 --> 00:20:08,710 If you have char *, this is the address of a char. 402 00:20:08,710 --> 00:20:12,130 So if all of these arrows really just represent addresses of nodes, 403 00:20:12,130 --> 00:20:14,200 it stands to reason that syntax is probably 404 00:20:14,200 --> 00:20:16,660 going to be something akin to node *. 405 00:20:16,660 --> 00:20:18,940 Now I can call this pointer anything I want. 406 00:20:18,940 --> 00:20:20,322 By convention, I'll call it next. 407 00:20:20,322 --> 00:20:22,030 Because literally, its purpose in life is 408 00:20:22,030 --> 00:20:24,940 to point to the next node in the data structure. 409 00:20:24,940 --> 00:20:29,140 And that's going to be in addition to the int called number that I'll propose 410 00:20:29,140 --> 00:20:32,120 describes the top of that individual data structure. 411 00:20:32,120 --> 00:20:34,930 But there's a subtle problem here in C. Recall 412 00:20:34,930 --> 00:20:38,680 that in C, it's kind of a simplistic language, complicated 413 00:20:38,680 --> 00:20:42,550 though it might often seem, in that it doesn't understand anything 414 00:20:42,550 --> 00:20:44,480 that it hasn't seen before. 415 00:20:44,480 --> 00:20:47,620 So at the moment, notice that the very first time 416 00:20:47,620 --> 00:20:50,650 I have mentioned node up until now was on the very 417 00:20:50,650 --> 00:20:53,200 last line of this snippet of code. 418 00:20:53,200 --> 00:20:56,530 The problem is that by nature of how typedef works, 419 00:20:56,530 --> 00:20:59,260 the thing called a node does not actually 420 00:20:59,260 --> 00:21:02,980 exist until the compiler is done reading that last line of code 421 00:21:02,980 --> 00:21:06,070 and the semicolon, which is to say that it would actually 422 00:21:06,070 --> 00:21:10,240 be incorrect to use or refer to quote, unquote 423 00:21:10,240 --> 00:21:12,310 "node" inside of this structure. 424 00:21:12,310 --> 00:21:14,860 Because literally, by definition of how typedef works, 425 00:21:14,860 --> 00:21:19,270 a node does not exist until, again, that last line of code and the semicolon 426 00:21:19,270 --> 00:21:19,905 are executed. 427 00:21:19,905 --> 00:21:21,280 Thankfully, there's a workaround. 428 00:21:21,280 --> 00:21:22,660 It's a little weird to look at. 429 00:21:22,660 --> 00:21:24,445 But what you can do in C is this. 430 00:21:24,445 --> 00:21:29,208 You can actually add an additional word after, literally, the keyword struct. 431 00:21:29,208 --> 00:21:30,250 And we'll keep it simple. 432 00:21:30,250 --> 00:21:33,208 We'll use the exact same word, though technically that's not necessary. 433 00:21:33,208 --> 00:21:37,160 And now, I'm going to change the inside of the structure to say this instead. 434 00:21:37,160 --> 00:21:38,740 So it feels a little verbose. 435 00:21:38,740 --> 00:21:40,510 And it feels a little bit of copy paste. 436 00:21:40,510 --> 00:21:44,980 But this is the way it is done in C. Typedef struct node is essentially 437 00:21:44,980 --> 00:21:47,200 a hint, similar in spirit to the prototypes 438 00:21:47,200 --> 00:21:50,680 we've talked about for functions that gives the compiler a clue that, 439 00:21:50,680 --> 00:21:54,310 OK, something called a struct node is going to exist. 440 00:21:54,310 --> 00:21:56,830 You can then use it inside of that data structure 441 00:21:56,830 --> 00:21:59,763 and refer to it as struct node *. 442 00:21:59,763 --> 00:22:01,180 It's more of a mouthful this week. 443 00:22:01,180 --> 00:22:02,847 We've not seen multiple words like this. 444 00:22:02,847 --> 00:22:06,130 But it's similar to char * or int *, like last week. 445 00:22:06,130 --> 00:22:08,140 And I'm going to call that arbitrarily next. 446 00:22:08,140 --> 00:22:12,550 And down here, the same thing happens as in the past with persons. 447 00:22:12,550 --> 00:22:15,130 By calling this node at the very last line of the code, that 448 00:22:15,130 --> 00:22:16,600 just tells the compiler, you know what? 449 00:22:16,600 --> 00:22:19,280 You don't have to refer to it as struct node all over the place. 450 00:22:19,280 --> 00:22:21,310 You can just call this thing node. 451 00:22:21,310 --> 00:22:25,390 So it's a little verbose in this case. 452 00:22:25,390 --> 00:22:29,470 But all this is done is create for me, in the computer, 453 00:22:29,470 --> 00:22:33,490 the definition of a node as we have depicted it pictorially 454 00:22:33,490 --> 00:22:36,430 with that rectangle. 455 00:22:36,430 --> 00:22:40,490 All right, so how can we now translate this into more useful code, not 456 00:22:40,490 --> 00:22:42,490 just defining what these structures are, but how 457 00:22:42,490 --> 00:22:44,470 do we begin building up linked lists? 458 00:22:44,470 --> 00:22:48,460 Well, let me propose that a linked list really begins with just a pointer. 459 00:22:48,460 --> 00:22:51,880 And in fact, here we have, thanks to the theatre's prop shop, 460 00:22:51,880 --> 00:22:54,370 just kind of a null pointer, if you will. 461 00:22:54,370 --> 00:22:56,710 I'm going to call this variable "list." 462 00:22:56,710 --> 00:22:58,460 And list is currently pointing to nothing. 463 00:22:58,460 --> 00:23:01,585 The arrow, we'll say, is just pointing at the floor, which means it's null. 464 00:23:01,585 --> 00:23:03,610 It's not actually pointing at anything useful. 465 00:23:03,610 --> 00:23:06,160 Suppose I now want to start to begin to allocate a linked 466 00:23:06,160 --> 00:23:09,157 list with three numbers, 1, 2, and 3. 467 00:23:09,157 --> 00:23:10,490 Well, how am I going to do this? 468 00:23:10,490 --> 00:23:13,032 Well, at the moment, the only thing that exists in my program 469 00:23:13,032 --> 00:23:14,800 is going to be this variable called list. 470 00:23:14,800 --> 00:23:17,260 There's no array in this story. 471 00:23:17,260 --> 00:23:19,077 That was in Week 2. 472 00:23:19,077 --> 00:23:20,410 Today is all about linked lists. 473 00:23:20,410 --> 00:23:22,570 So how do I get myself a wooden block that 474 00:23:22,570 --> 00:23:25,240 represents 1, another wooden block that represents 2, 475 00:23:25,240 --> 00:23:27,012 and a third that represents 3? 476 00:23:27,012 --> 00:23:29,470 Well, we need to use our new friend from last week, malloc. 477 00:23:29,470 --> 00:23:32,710 Recall that malloc allows you to allocate memory, 478 00:23:32,710 --> 00:23:35,110 as much memory as you might want, so long as you tell it 479 00:23:35,110 --> 00:23:36,400 the size of that thing. 480 00:23:36,400 --> 00:23:38,740 So frankly, I think what we could do ultimately today 481 00:23:38,740 --> 00:23:44,020 is use malloc to allocate dynamically one struct and put the number 1 in it, 482 00:23:44,020 --> 00:23:48,010 another struct put the number 2 on it, another struct put the number 3 in it. 483 00:23:48,010 --> 00:23:52,160 And then, use these playful arrows here to actually stitch them together, 484 00:23:52,160 --> 00:23:53,930 having one point to the other. 485 00:23:53,930 --> 00:23:55,810 So thankfully, the prop shop has wonderfully 486 00:23:55,810 --> 00:23:57,860 created a whole bunch of these for us. 487 00:23:57,860 --> 00:24:04,460 Let me go ahead and malloc a very heavy node that has room for two values. 488 00:24:04,460 --> 00:24:07,795 And you'll see it has room for a number and a next pointer. 489 00:24:07,795 --> 00:24:09,670 So the number I'm going to first install here 490 00:24:09,670 --> 00:24:12,760 is going to be the number 1, for instance. 491 00:24:12,760 --> 00:24:15,640 And I'm going to leave its pointer as just pointing at the ground, 492 00:24:15,640 --> 00:24:17,800 indicating that this is a null pointer. 493 00:24:17,800 --> 00:24:19,660 It's not actually pointing at anything else. 494 00:24:19,660 --> 00:24:22,510 But now that I'm starting to instantiate, 495 00:24:22,510 --> 00:24:25,690 that is create this list, now I'm going to do something like this 496 00:24:25,690 --> 00:24:29,860 and say that, all right, my variable called "list," whose purpose in life 497 00:24:29,860 --> 00:24:32,530 is to keep track of where this list is in memory, 498 00:24:32,530 --> 00:24:34,960 I'm going to connect one to the other by actually having 499 00:24:34,960 --> 00:24:38,080 this variable point at this node. 500 00:24:38,080 --> 00:24:40,540 When it comes time, then, to allocate another node, 501 00:24:40,540 --> 00:24:42,353 I want to insert into this linked list. 502 00:24:42,353 --> 00:24:45,520 Back in the world of arrays, I might have to allocate a new chunk of memory, 503 00:24:45,520 --> 00:24:48,160 copy this value over into the new values. 504 00:24:48,160 --> 00:24:49,300 I don't have to do that. 505 00:24:49,300 --> 00:24:52,570 In the world of linked lists, I just call malloc for a second time 506 00:24:52,570 --> 00:24:55,960 and say, give me another chunk of memory big enough to fit a node. 507 00:24:55,960 --> 00:24:59,170 Thankfully, from the prop shop, we have another one of these. 508 00:24:59,170 --> 00:25:03,628 Let me go ahead and malloc another node here. 509 00:25:03,628 --> 00:25:05,170 At the moment, there's nothing in it. 510 00:25:05,170 --> 00:25:06,400 It's just the placeholder. 511 00:25:06,400 --> 00:25:08,200 So it's garbage values, if you will. 512 00:25:08,200 --> 00:25:10,060 I don't know what's there until I actually 513 00:25:10,060 --> 00:25:13,090 say that the number shall be number 2. 514 00:25:13,090 --> 00:25:17,710 And then I go over to my linked list whose variable name is "list," 515 00:25:17,710 --> 00:25:19,060 and I want to insert this thing. 516 00:25:19,060 --> 00:25:20,710 So I follow the arrow. 517 00:25:20,710 --> 00:25:26,180 I then point the next field of this node at this node here. 518 00:25:26,180 --> 00:25:28,600 So now I have a linked list of size 2. 519 00:25:28,600 --> 00:25:31,990 There's three things in the picture, but this is just a simple variable. 520 00:25:31,990 --> 00:25:35,710 This is a pointer that's pointing at the actual node which, in turn, is 521 00:25:35,710 --> 00:25:37,370 pointing at an actual other node. 522 00:25:37,370 --> 00:25:40,990 Now suppose I want to insert the number 3 into this linked list. 523 00:25:40,990 --> 00:25:43,420 Recall that malloc is powerful and that it 524 00:25:43,420 --> 00:25:45,550 takes memory from wherever it's available, 525 00:25:45,550 --> 00:25:48,080 the so-called "heap" from your computer. 526 00:25:48,080 --> 00:25:51,340 And that means, by definition, that it might not be contiguous. 527 00:25:51,340 --> 00:25:54,730 The next number might not actually be here metaphorically 528 00:25:54,730 --> 00:25:56,050 in the computer's memory. 529 00:25:56,050 --> 00:25:57,953 It might be way over there. 530 00:25:57,953 --> 00:25:59,120 So that might indeed happen. 531 00:25:59,120 --> 00:26:03,250 The third time I call malloc now and allocate a third node, 532 00:26:03,250 --> 00:26:07,000 it might not be available anywhere in the computer's memory except for, 533 00:26:07,000 --> 00:26:09,040 like, way over here. 534 00:26:09,040 --> 00:26:10,280 And that's fine. 535 00:26:10,280 --> 00:26:15,500 It doesn't have to be contiguous as it did in the world of our arrays. 536 00:26:15,500 --> 00:26:18,140 I can now put the number 3 in its place. 537 00:26:18,140 --> 00:26:22,683 But if I want to keep a pointer to that node, so that all of these things 538 00:26:22,683 --> 00:26:24,850 are stitched together, I can start at the beginning. 539 00:26:24,850 --> 00:26:26,200 I follow the arrows. 540 00:26:26,200 --> 00:26:27,350 I follow the arrows. 541 00:26:27,350 --> 00:26:29,590 And now, if I want to remember where that one is, 542 00:26:29,590 --> 00:26:32,120 I'm going to have to connect these things. 543 00:26:32,120 --> 00:26:38,120 And so now, that pointer needs to point at this block here. 544 00:26:38,120 --> 00:26:39,850 And this visual is very much deliberate. 545 00:26:39,850 --> 00:26:42,820 These nodes can be anywhere in the computer's memory. 546 00:26:42,820 --> 00:26:45,370 They're not necessarily going to be contiguous. 547 00:26:45,370 --> 00:26:49,450 The downside of that is that you cannot rely on binary search, 548 00:26:49,450 --> 00:26:51,460 our friend from Week 0 onward. 549 00:26:51,460 --> 00:26:55,210 Binary search was amazing in that it's Big O of log n. 550 00:26:55,210 --> 00:26:57,883 You can find stuff way faster than Big O of n. 551 00:26:57,883 --> 00:26:59,800 That was the whole point of even the phonebook 552 00:26:59,800 --> 00:27:01,720 example in the very first week. 553 00:27:01,720 --> 00:27:05,080 But the upside of this approach is that you 554 00:27:05,080 --> 00:27:09,310 don't have to actually keep allocating and copying more memory 555 00:27:09,310 --> 00:27:11,893 and all of your values any time you want to resize this thing. 556 00:27:11,893 --> 00:27:15,102 And I'm a little embarrassed to admit that I'm out of breath for some reason, 557 00:27:15,102 --> 00:27:16,360 just mallocing nodes here. 558 00:27:16,360 --> 00:27:18,820 But the point is using malloc and building out 559 00:27:18,820 --> 00:27:20,800 the structure does come at some price. 560 00:27:20,800 --> 00:27:22,755 It's exhausting, frankly. 561 00:27:22,755 --> 00:27:24,880 But it's also going to spread things out in memory. 562 00:27:24,880 --> 00:27:26,020 But you have this dynamism. 563 00:27:26,020 --> 00:27:28,270 And honestly, if you are the Twitters of the world, the Googles 564 00:27:28,270 --> 00:27:30,395 of the world, where you have lots and lots of data, 565 00:27:30,395 --> 00:27:34,360 it's going to kill you performance wise to have to copy all of your data 566 00:27:34,360 --> 00:27:36,810 from one location into another as Santiago originally 567 00:27:36,810 --> 00:27:39,450 proposed as a solution to the array. 568 00:27:39,450 --> 00:27:43,380 So using these dynamic data structures, like a linked list, where 569 00:27:43,380 --> 00:27:45,690 you allocate more space wherever it's available, 570 00:27:45,690 --> 00:27:48,750 and you somehow remember where that is by stitching [? team ?] things 571 00:27:48,750 --> 00:27:50,940 together, as with these physical pointers, 572 00:27:50,940 --> 00:27:53,160 is really the state of the art in how you 573 00:27:53,160 --> 00:27:56,730 can create these more dynamic structures if it's more important to you 574 00:27:56,730 --> 00:27:59,100 to have that dynamism. 575 00:27:59,100 --> 00:28:04,010 All right, any questions before we now translate these physical blocks 576 00:28:04,010 --> 00:28:06,897 to some samples of code? 577 00:28:06,897 --> 00:28:08,480 BRIAN: A question came in in the chat. 578 00:28:08,480 --> 00:28:11,660 When in this whole process of linked lists are we actually using malloc? 579 00:28:11,660 --> 00:28:13,220 And what is malloc being used for? 580 00:28:13,220 --> 00:28:14,280 DAVID MALAN: Really good question. 581 00:28:14,280 --> 00:28:15,488 So where are we using malloc? 582 00:28:15,488 --> 00:28:17,600 Every time I went off stage and grabbed one 583 00:28:17,600 --> 00:28:23,270 of these big blocks, that was my acting out the process of mallocing a node. 584 00:28:23,270 --> 00:28:25,520 So when you call malloc, that returns to you, 585 00:28:25,520 --> 00:28:30,240 per last week, the address of the first byte of a chunk of memory. 586 00:28:30,240 --> 00:28:33,200 And if you call malloc of one, that gives you the address of one byte. 587 00:28:33,200 --> 00:28:35,930 If you call malloc of 100, that gives you the first address 588 00:28:35,930 --> 00:28:38,580 of 100 bytes that are contiguous. 589 00:28:38,580 --> 00:28:43,110 So each of these nodes, then, represents the return value, if you will, 590 00:28:43,110 --> 00:28:45,200 of a single call to malloc. 591 00:28:45,200 --> 00:28:47,600 And in fact, perhaps, Brian, the best way 592 00:28:47,600 --> 00:28:50,720 to answer that question in more detail is to translate this now 593 00:28:50,720 --> 00:28:51,627 to some actual code. 594 00:28:51,627 --> 00:28:53,210 So let's do that now and then revisit. 595 00:28:53,210 --> 00:28:56,000 So here is, for instance, a line of code that 596 00:28:56,000 --> 00:28:58,310 represents the beginning of our story, where we only 597 00:28:58,310 --> 00:29:02,960 had a variable called "list" that was initialized to nothing initially. 598 00:29:02,960 --> 00:29:05,023 This arrow was not pointing at anything. 599 00:29:05,023 --> 00:29:07,190 And in fact, if it was just pointing up, down, left, 600 00:29:07,190 --> 00:29:09,650 or right, it would be considered a garbage value. 601 00:29:09,650 --> 00:29:12,440 This is just memory, which means there are values inside of it 602 00:29:12,440 --> 00:29:15,240 before I actually put actual values in it. 603 00:29:15,240 --> 00:29:18,273 So if I don't assign it a value, who knows what it's pointing to? 604 00:29:18,273 --> 00:29:19,940 But let me go ahead and change the code. 605 00:29:19,940 --> 00:29:24,290 This list variable, by default, has some garbage value unless I initialize it 606 00:29:24,290 --> 00:29:25,490 to something like null. 607 00:29:25,490 --> 00:29:28,940 So I'll represent that here figuratively by just pointing at the ground. 608 00:29:28,940 --> 00:29:30,260 That now represents null. 609 00:29:30,260 --> 00:29:32,510 This would be the corresponding line of code that just 610 00:29:32,510 --> 00:29:35,420 creates for you an empty linked list. 611 00:29:35,420 --> 00:29:36,975 That was the beginning of the story. 612 00:29:36,975 --> 00:29:39,350 Now recall that the next thing I did was I went off stage 613 00:29:39,350 --> 00:29:42,620 and I called malloc to bring back one of these big boxes. 614 00:29:42,620 --> 00:29:45,980 That code might instead look like this. 615 00:29:45,980 --> 00:29:47,360 Node *n. 616 00:29:47,360 --> 00:29:52,040 So I could call the variable anything I want, malloc, size of node. 617 00:29:52,040 --> 00:29:55,040 So size of we might have seen briefly in that it's just 618 00:29:55,040 --> 00:29:59,550 an operator in C that tells you how many bytes large any data type is. 619 00:29:59,550 --> 00:30:03,710 So I could do the math and figure out in my Mac or PC or CS50 IDE just how 620 00:30:03,710 --> 00:30:05,690 many bytes these nodes are supposed to take up. 621 00:30:05,690 --> 00:30:07,970 Size of just answers that question for me. 622 00:30:07,970 --> 00:30:10,370 So malloc, again, takes one argument, the number 623 00:30:10,370 --> 00:30:13,010 of bytes you want to allocate dynamically, 624 00:30:13,010 --> 00:30:15,930 and it returns to the address of the first of those bytes. 625 00:30:15,930 --> 00:30:18,822 So if you think of this as one of my earlier slides in yellow, 626 00:30:18,822 --> 00:30:20,780 it returns to you, like, the address of the top 627 00:30:20,780 --> 00:30:23,670 left byte of this chunk of memory, if you will. 628 00:30:23,670 --> 00:30:26,160 And I'm going to go ahead and assign that to a variable 629 00:30:26,160 --> 00:30:28,430 I'll call n to represent a node. 630 00:30:28,430 --> 00:30:33,410 And its node * because, again, malloc returns always an address. 631 00:30:33,410 --> 00:30:36,050 And the syntax we saw last week for storing addresses 632 00:30:36,050 --> 00:30:39,300 is to use this new * syntax here. 633 00:30:39,300 --> 00:30:44,310 So this gave me a block that initially just had garbage values. 634 00:30:44,310 --> 00:30:48,450 So there was no number in place and who knows where the arrow was pointing at? 635 00:30:48,450 --> 00:30:52,010 So it looked a little something like this, if I draw it now on the screen. 636 00:30:52,010 --> 00:30:53,523 List is initialized to null. 637 00:30:53,523 --> 00:30:55,190 It's not pointing at anything right now. 638 00:30:55,190 --> 00:30:59,330 But n, the variable I just declared, is pointing at a node. 639 00:30:59,330 --> 00:31:01,220 But inside of that node, or who knows what? 640 00:31:01,220 --> 00:31:02,120 It's garbage values. 641 00:31:02,120 --> 00:31:04,160 Number and next are just garbage values. 642 00:31:04,160 --> 00:31:07,760 Because that's what's there by default, remnants of the past. 643 00:31:07,760 --> 00:31:10,310 But now, let me propose that we do this code. 644 00:31:10,310 --> 00:31:12,098 So long as n is not null-- 645 00:31:12,098 --> 00:31:15,140 which is something you want to get into the habit always now of checking, 646 00:31:15,140 --> 00:31:19,580 any time in C, when you call a function that returns to you a pointer, 647 00:31:19,580 --> 00:31:23,030 you should almost always check is it null or is it not null. 648 00:31:23,030 --> 00:31:26,150 Because you do not want to try touching it if it is indeed null. 649 00:31:26,150 --> 00:31:28,490 Because that means there is no valid address here. 650 00:31:28,490 --> 00:31:30,950 That's the human convention when using pointers. 651 00:31:30,950 --> 00:31:34,578 But if n does not equal null, that's a good thing. 652 00:31:34,578 --> 00:31:37,370 That means it's a valid address somewhere in the computer's memory. 653 00:31:37,370 --> 00:31:43,040 Let me go ahead now and go to that address, the syntax for which is *n, 654 00:31:43,040 --> 00:31:44,480 just like last week. 655 00:31:44,480 --> 00:31:48,530 And then the operator means go inside of the structure that's there 656 00:31:48,530 --> 00:31:52,230 and go into the variable inside of it, called number in this case. 657 00:31:52,230 --> 00:31:54,680 So when I go ahead and do something like this, 658 00:31:54,680 --> 00:31:58,400 when I have a list that's initially of size 1, 659 00:31:58,400 --> 00:32:04,070 where this variable is pointing at the one and only node at the moment, 660 00:32:04,070 --> 00:32:07,970 it's going to have the number 1 in it as soon as I execute this line of code. 661 00:32:07,970 --> 00:32:12,710 *n means start at the address embodied here, go to it, 662 00:32:12,710 --> 00:32:16,560 and then put the number, 1 in this case, in place. 663 00:32:16,560 --> 00:32:18,740 But I need to do one other thing as well. 664 00:32:18,740 --> 00:32:20,990 I want to go ahead, too, and replace the garbage 665 00:32:20,990 --> 00:32:23,570 value that represents the next pointer in that structure 666 00:32:23,570 --> 00:32:24,680 and replace it with null. 667 00:32:24,680 --> 00:32:27,020 Null [INAUDIBLE] find that this is the end of the list. 668 00:32:27,020 --> 00:32:27,770 There's nothing there. 669 00:32:27,770 --> 00:32:29,150 I don't want it to be garbage value. 670 00:32:29,150 --> 00:32:31,067 Because a garbage value is an arbitrary number 671 00:32:31,067 --> 00:32:33,380 that could be pointing this way, this way, this way. 672 00:32:33,380 --> 00:32:36,560 Metaphorically, I want to actually change that to be null 673 00:32:36,560 --> 00:32:38,348 and so I can use the same syntax. 674 00:32:38,348 --> 00:32:39,890 But there's this clever approach now. 675 00:32:39,890 --> 00:32:43,850 I don't have to use star and then dot over the place, as mentioned earlier. 676 00:32:43,850 --> 00:32:47,390 Star and dot comes with some syntactic sugar. 677 00:32:47,390 --> 00:32:50,290 You can replace the star and the parentheses and the dot 678 00:32:50,290 --> 00:32:54,050 that we just saw with an arrow notation, which means follow the arrow. 679 00:32:54,050 --> 00:32:56,390 And then set this thing equal to null, which 680 00:32:56,390 --> 00:32:59,240 I'll represent again by just having the arrow literally points 681 00:32:59,240 --> 00:33:02,040 at the floor for clarity. 682 00:33:02,040 --> 00:33:07,490 So, again, rewinding from where we started, *n.number, 683 00:33:07,490 --> 00:33:11,270 with the first of those in parentheses, is the same thing as this. 684 00:33:11,270 --> 00:33:15,270 And the reason that most people prefer using the arrow notation, so to speak, 685 00:33:15,270 --> 00:33:18,020 is that it really does capture this physicality. 686 00:33:18,020 --> 00:33:20,870 You start at the address, you go there, and then 687 00:33:20,870 --> 00:33:24,050 you look at the field, number or next, respectively. 688 00:33:24,050 --> 00:33:27,875 But it's equivalent to the syntax we saw a moment ago with the start and the dot 689 00:33:27,875 --> 00:33:29,090 as before. 690 00:33:29,090 --> 00:33:30,980 So after these two steps, have we initialized 691 00:33:30,980 --> 00:33:36,760 this node to containing the number 1 and null inside of it. 692 00:33:36,760 --> 00:33:38,540 But what comes next? 693 00:33:38,540 --> 00:33:39,380 What comes next? 694 00:33:39,380 --> 00:33:41,470 Well, at point in the story, in this code, 695 00:33:41,470 --> 00:33:43,690 I have some other variable here that's not pictured. 696 00:33:43,690 --> 00:33:47,090 Because we're now transitioning from the world of woodwork to actual code. 697 00:33:47,090 --> 00:33:51,130 So there's another variable n, which I might as well be representing myself. 698 00:33:51,130 --> 00:33:54,280 If I am this temporary variable n, I, too, am pointing at this value. 699 00:33:54,280 --> 00:33:59,830 It's not until I actually execute this line of code, list = n;, 700 00:33:59,830 --> 00:34:04,357 that I remember where this node is in the computer's memory. 701 00:34:04,357 --> 00:34:06,940 So n, up until now, has really just been a temporary variable. 702 00:34:06,940 --> 00:34:09,190 It's a variable that I can use to actually keep 703 00:34:09,190 --> 00:34:10,969 track of this thing in memory. 704 00:34:10,969 --> 00:34:14,530 But if I want to add this node ultimately to my linked list, 705 00:34:14,530 --> 00:34:18,460 list started out as null, recall that this pointer was pointing 706 00:34:18,460 --> 00:34:20,170 at the ground, representing null. 707 00:34:20,170 --> 00:34:24,250 But when I now want to remember that this linked list has a node in it, 708 00:34:24,250 --> 00:34:30,489 I need to actually execute a line of code like this, list = null. 709 00:34:30,489 --> 00:34:31,900 All right, what did we next do? 710 00:34:31,900 --> 00:34:33,530 Let's take one more step further. 711 00:34:33,530 --> 00:34:36,159 So at this point in the story, if I was representing n, 712 00:34:36,159 --> 00:34:38,650 I'm also pointing at the same block. n is temporary, 713 00:34:38,650 --> 00:34:39,969 so it can eventually go away. 714 00:34:39,969 --> 00:34:43,018 But at this point in the story, we have a linked list of size 1. 715 00:34:43,018 --> 00:34:44,560 Let's go ahead and take this further. 716 00:34:44,560 --> 00:34:46,389 Suppose now I execute these lines of code. 717 00:34:46,389 --> 00:34:48,489 And we'll do it a little faster and all at once so 718 00:34:48,489 --> 00:34:50,230 that you can see it more in context. 719 00:34:50,230 --> 00:34:52,389 The first line of code is the same as before. 720 00:34:52,389 --> 00:34:54,722 Hey, malloc, give me a chunk of memory that's 721 00:34:54,722 --> 00:34:56,139 big enough for the size of a node. 722 00:34:56,139 --> 00:34:59,465 And, again, let's use this temporary variable n to point to that. 723 00:34:59,465 --> 00:35:02,590 And suppose that means that I, if I'm representing this temporary variable, 724 00:35:02,590 --> 00:35:04,630 I'm pointing at this new chunk of memory here. 725 00:35:04,630 --> 00:35:08,260 I then check, if n does not equal null, then 726 00:35:08,260 --> 00:35:10,870 and only then do I go ahead and install the number 727 00:35:10,870 --> 00:35:16,450 2 as I did physically earlier and do I initialize the pointer originally 728 00:35:16,450 --> 00:35:21,520 to pointing not at some other node which doesn't yet exist, but pointing at, 729 00:35:21,520 --> 00:35:24,640 let's just call it the floor, thereby representing null. 730 00:35:24,640 --> 00:35:26,020 And that's it. 731 00:35:26,020 --> 00:35:28,180 That has now allocated the second node. 732 00:35:28,180 --> 00:35:30,760 But notice, literally, this disconnect. 733 00:35:30,760 --> 00:35:33,010 Just because I've allocated a new node and put 734 00:35:33,010 --> 00:35:36,220 the number I care about and initialized its next pointer to null, 735 00:35:36,220 --> 00:35:38,480 that doesn't mean it's part of the data structure. 736 00:35:38,480 --> 00:35:43,030 The linked list is still missing a pointer from old to new. 737 00:35:43,030 --> 00:35:45,200 So we need to execute one other line of code 738 00:35:45,200 --> 00:35:49,150 now so that we can get from this picture ultimately to the final one. 739 00:35:49,150 --> 00:35:52,420 And here's where we can use the same kind of syntax as before. 740 00:35:52,420 --> 00:35:57,550 If I start at my list variable, I follow the arrow as per that code. 741 00:35:57,550 --> 00:36:00,280 And then I update the next field to point 742 00:36:00,280 --> 00:36:04,940 to n, which is my newly allocated node. 743 00:36:04,940 --> 00:36:09,700 Now, after that final line of code, do I have a linked list of size 2. 744 00:36:09,700 --> 00:36:12,880 Because I've not only allocated the node, initialized its two 745 00:36:12,880 --> 00:36:17,050 variables, a number and next respectively, 746 00:36:17,050 --> 00:36:23,727 I've also chained it together with the existing node on the linked list. 747 00:36:23,727 --> 00:36:25,810 And let me do this one even slightly more quickly, 748 00:36:25,810 --> 00:36:29,320 only because it's the same thing, just so we can see it all together at once. 749 00:36:29,320 --> 00:36:32,660 Now that we have this picture, let's execute the same kind of code again. 750 00:36:32,660 --> 00:36:34,390 The only difference in this chunk of code 751 00:36:34,390 --> 00:36:36,430 is that I'm initializing number to 3. 752 00:36:36,430 --> 00:36:37,660 So what has this done for me? 753 00:36:37,660 --> 00:36:40,540 That chunk of code has malloced a third and final node. 754 00:36:40,540 --> 00:36:42,430 I've initialized the top to the number 3. 755 00:36:42,430 --> 00:36:44,710 I've initialized the bottom to null, as I'll 756 00:36:44,710 --> 00:36:46,990 represent by just pointing the arrow at the ground. 757 00:36:46,990 --> 00:36:49,060 There's one final step, then. 758 00:36:49,060 --> 00:36:52,270 If I want to go ahead and insert that third node into this linked list, 759 00:36:52,270 --> 00:36:55,180 I've got to go ahead now and not just point it at myself, 760 00:36:55,180 --> 00:36:59,350 me representing, again, the temporary variable n, I need to now do this. 761 00:36:59,350 --> 00:37:01,270 And this is syntax you won't do often. 762 00:37:01,270 --> 00:37:03,830 We'll see it in an actual program in just a moment. 763 00:37:03,830 --> 00:37:07,090 But it just speaks to the basic building blocks we're manipulating. 764 00:37:07,090 --> 00:37:12,250 If I want to go ahead and link the 2 to the 3, I can start here at list. 765 00:37:12,250 --> 00:37:14,140 I can follow the arrow once. 766 00:37:14,140 --> 00:37:15,850 I can follow the arrow again. 767 00:37:15,850 --> 00:37:20,140 And then I can set that next field equal to n. 768 00:37:20,140 --> 00:37:29,110 Because, again, n is the current address of that most recently allocated node. 769 00:37:29,110 --> 00:37:34,770 So even though the syntax is getting a little new, the two arrows I'm using 770 00:37:34,770 --> 00:37:37,740 are literally just like a code manifestation of just 771 00:37:37,740 --> 00:37:43,950 follow the pointer, follow the pointer, boom, assign one value to the next. 772 00:37:43,950 --> 00:37:47,290 All right, so at this point in the story, the picture now looks like this. 773 00:37:47,290 --> 00:37:49,020 So long as I, n, am still in the picture. 774 00:37:49,020 --> 00:37:50,853 But if I just get rid of myself, because I'm 775 00:37:50,853 --> 00:37:54,480 a temporary variable, voila, we've just built up, step by step, 776 00:37:54,480 --> 00:38:00,480 a brand new linked list of size 3. 777 00:38:00,480 --> 00:38:06,973 Seems like a lot of work, but it allows us to now grow this thing dynamically. 778 00:38:06,973 --> 00:38:07,890 But let me pause here. 779 00:38:07,890 --> 00:38:12,460 Any questions or confusion on linked lists? 780 00:38:12,460 --> 00:38:15,210 Any questions or confusions on linked lists? 781 00:38:15,210 --> 00:38:17,267 BRIAN: One question just came in the chat. 782 00:38:17,267 --> 00:38:20,100 If we're trying to make a list that's going to be much longer, like, 783 00:38:20,100 --> 00:38:22,475 more than three elements, wouldn't it get tedius to have, 784 00:38:22,475 --> 00:38:24,630 like, arrow next, arrow next over and over again? 785 00:38:24,630 --> 00:38:25,050 DAVID MALAN: Yeah. 786 00:38:25,050 --> 00:38:25,990 Really good observation. 787 00:38:25,990 --> 00:38:28,470 And so that's why I said, you won't usually write it like this. 788 00:38:28,470 --> 00:38:31,220 We'll do it temporarily in a full-fledged program in just a moment 789 00:38:31,220 --> 00:38:32,220 just to demonstrate it. 790 00:38:32,220 --> 00:38:34,920 But in general, you'll probably use something like a loop. 791 00:38:34,920 --> 00:38:38,190 And what you'll do is use a temporary variable that points at this one, 792 00:38:38,190 --> 00:38:40,530 then iterates again, then iterates again. 793 00:38:40,530 --> 00:38:43,930 And let me stipulate for now that if you use a loop in the right way, 794 00:38:43,930 --> 00:38:46,230 you can end up writing just a single arrow by just 795 00:38:46,230 --> 00:38:48,447 keep updating the variable again and again. 796 00:38:48,447 --> 00:38:49,780 So there is a way to avoid that. 797 00:38:49,780 --> 00:38:53,430 And you do it much more dynamically. 798 00:38:53,430 --> 00:38:56,410 Let me go ahead then and ask you a question of my own here. 799 00:38:56,410 --> 00:38:58,170 Let me go ahead and ask this one. 800 00:38:58,170 --> 00:39:00,780 If you'd like to buzz in to this question, 801 00:39:00,780 --> 00:39:05,170 what is the running time of searching a linked list? 802 00:39:05,170 --> 00:39:08,800 What's the running time of searching a linked list in Big O notation? 803 00:39:08,800 --> 00:39:12,690 So what's an upper bound, worst case of searching 804 00:39:12,690 --> 00:39:17,460 a linked list like this one, whether it has three elements or even many more? 805 00:39:17,460 --> 00:39:20,190 So it looks like, as before, about 80% of the group 806 00:39:20,190 --> 00:39:21,750 is proposing that it's Big O of n. 807 00:39:21,750 --> 00:39:23,027 And this is actually correct. 808 00:39:23,027 --> 00:39:24,360 Because consider the worst case. 809 00:39:24,360 --> 00:39:26,318 Suppose you're looking for the number 3, you're 810 00:39:26,318 --> 00:39:29,400 going to have to look at all three numbers, Big O of n. 811 00:39:29,400 --> 00:39:31,260 If you're looking for the number 10, you're 812 00:39:31,260 --> 00:39:32,670 going to start here at the beginning, you're 813 00:39:32,670 --> 00:39:34,110 going to keep looking, looking, looking, looking. 814 00:39:34,110 --> 00:39:37,235 You're going to get to the end realize, oh, the number 10 is not even here. 815 00:39:37,235 --> 00:39:39,840 At which point, you've already looked at n elements. 816 00:39:39,840 --> 00:39:42,690 But here's one of the trade, again, of linked lists. 817 00:39:42,690 --> 00:39:46,413 With arrays, you could jump to the end of the list in constant time. 818 00:39:46,413 --> 00:39:48,330 You could just use a little bit of arithmetic. 819 00:39:48,330 --> 00:39:50,830 You could jump to the middle elements, or the first element, 820 00:39:50,830 --> 00:39:51,840 all using constant time. 821 00:39:51,840 --> 00:39:55,140 A linked list, unfortunately, is represented ultimately 822 00:39:55,140 --> 00:39:59,730 by just a single address, the address that points to the very first node. 823 00:39:59,730 --> 00:40:02,850 And so even though you all on camera can see this node and this node 824 00:40:02,850 --> 00:40:05,460 and this one as humans all at once, the computer 825 00:40:05,460 --> 00:40:07,350 can only follow these breadcrumbs. 826 00:40:07,350 --> 00:40:12,300 And so searching a linked list is going to be Big O of n in that case. 827 00:40:12,300 --> 00:40:14,460 But let me ask a follow up question. 828 00:40:14,460 --> 00:40:19,930 What's the running time of inserting into a linked list? 829 00:40:19,930 --> 00:40:23,220 What's the running time of inserting into a linked list? 830 00:40:23,220 --> 00:40:29,880 So you've got some new number like the number 4 or 0 or 100 or negative 5, 831 00:40:29,880 --> 00:40:31,350 whatever it may be. 832 00:40:31,350 --> 00:40:34,260 There's going to be a malloc involved, but that's constant time. 833 00:40:34,260 --> 00:40:36,010 It's just one function call. 834 00:40:36,010 --> 00:40:38,010 But you're going to have to insert it somewhere. 835 00:40:38,010 --> 00:40:42,720 And here it looks like 68% of you are proposing Big O of 1, 836 00:40:42,720 --> 00:40:44,820 which is interesting, constant time. 837 00:40:44,820 --> 00:40:47,880 25% of you are proposing Big O of n. 838 00:40:47,880 --> 00:40:51,360 Would anyone be comfortable chiming in verbally or on the chat 839 00:40:51,360 --> 00:40:54,970 as to why you feel it's one or the other? 840 00:40:54,970 --> 00:40:57,570 It is indeed one of those answers. 841 00:40:57,570 --> 00:41:00,450 AUDIENCE: It could be O of n. 842 00:41:00,450 --> 00:41:03,330 Because of the fact that even though you're 843 00:41:03,330 --> 00:41:06,570 using malloc to create a new node, essentially, 844 00:41:06,570 --> 00:41:11,820 I think all the computer's doing when you assign it is [? going-- ?] 845 00:41:11,820 --> 00:41:14,820 as you cite those arrows, like we're going from one arrow to the next 846 00:41:14,820 --> 00:41:16,750 to the next to the next. 847 00:41:16,750 --> 00:41:19,260 And I would think it would be O of n. 848 00:41:19,260 --> 00:41:21,300 DAVID MALAN: It is O of n as you described it. 849 00:41:21,300 --> 00:41:27,930 But you, too, are making an assumption, like 25% of other people are making. 850 00:41:27,930 --> 00:41:31,770 You seem to be assuming that if a new number, suppose it's number 4, 851 00:41:31,770 --> 00:41:33,240 has to go at the end. 852 00:41:33,240 --> 00:41:35,440 Or if it's the number 5, it has to go at the end. 853 00:41:35,440 --> 00:41:37,523 And I kind of deliberately set things up that way. 854 00:41:37,523 --> 00:41:41,310 I've happened to maintain them in sorted order, 1, 2, 3, from left to right. 855 00:41:41,310 --> 00:41:43,440 But up until now, I have not made the condition 856 00:41:43,440 --> 00:41:45,700 that the linked list has to be sorted, even 857 00:41:45,700 --> 00:41:48,450 though the examples we've seen thus far are deliberately that way. 858 00:41:48,450 --> 00:41:51,587 But you know what, if you want to get fancy and a little more efficient, 859 00:41:51,587 --> 00:41:54,420 and you want to allocate the number 4, and frankly, you don't really 860 00:41:54,420 --> 00:41:57,570 care about keeping the linked list in sorted order, 861 00:41:57,570 --> 00:42:00,930 well, heck, just pull this out, put your new node here. 862 00:42:00,930 --> 00:42:01,950 Plug it in here. 863 00:42:01,950 --> 00:42:03,420 Plug the other one back in here. 864 00:42:03,420 --> 00:42:06,870 And just insert the new element at the beginning of the list. 865 00:42:06,870 --> 00:42:10,110 And for every number thereafter, malloc as before. 866 00:42:10,110 --> 00:42:13,292 But just keep inserting it here, inserting it here, inserting it here. 867 00:42:13,292 --> 00:42:15,000 Now it's not going to take a single step. 868 00:42:15,000 --> 00:42:17,190 Because as I verbalized it, there's, like, the malloc step. 869 00:42:17,190 --> 00:42:18,180 I have to unplug this. 870 00:42:18,180 --> 00:42:19,290 I have to plug it back in. 871 00:42:19,290 --> 00:42:21,720 So it's like three or four steps total. 872 00:42:21,720 --> 00:42:23,640 But four steps is also constant. 873 00:42:23,640 --> 00:42:26,530 That's big O of 1, because it's a fixed number of steps. 874 00:42:26,530 --> 00:42:31,600 So if you're able to sacrifice sorted order when it comes to this list, 875 00:42:31,600 --> 00:42:34,620 you can in constant time insert, insert, insert, insert. 876 00:42:34,620 --> 00:42:39,690 And the list is going to get longer and longer, but from the beginning of it 877 00:42:39,690 --> 00:42:40,680 rather than the end. 878 00:42:40,680 --> 00:42:41,928 So that's always a trade off. 879 00:42:41,928 --> 00:42:44,970 If you don't care about sorted order, and none of your algorithms or code 880 00:42:44,970 --> 00:42:49,260 require that it be sorted, then you can go ahead and cut that corner 881 00:42:49,260 --> 00:42:51,600 and achieve constant time insert, which if you're 882 00:42:51,600 --> 00:42:54,420 a Twitter or Google or the like, maybe that's actually 883 00:42:54,420 --> 00:42:56,800 a net savings and a good thing. 884 00:42:56,800 --> 00:42:59,350 But, again, you sacrifice the sorted order in that case. 885 00:42:59,350 --> 00:43:00,600 Well, let's go ahead, I think. 886 00:43:00,600 --> 00:43:03,210 And let's translate this to some actual code. 887 00:43:03,210 --> 00:43:06,118 Let me go ahead here in CS50 IDE. 888 00:43:06,118 --> 00:43:08,160 And let's go ahead and write a couple of variants 889 00:43:08,160 --> 00:43:11,700 of a program that now actually do something with numbers 890 00:43:11,700 --> 00:43:14,770 and start to manipulate things in memory. 891 00:43:14,770 --> 00:43:17,910 So I'm going to go ahead here and create a program called list.c. 892 00:43:17,910 --> 00:43:20,160 And my first version is going to be very simplistic. 893 00:43:20,160 --> 00:43:22,446 I'm going to go ahead and include stdio. 894 00:43:22,446 --> 00:43:24,360 Give myself an int main(void). 895 00:43:24,360 --> 00:43:27,250 And then inside of here, let me go ahead, like we began, 896 00:43:27,250 --> 00:43:30,450 and give myself a list of integers of size 3. 897 00:43:30,450 --> 00:43:32,857 So this is going to be an array that's of size 3. 898 00:43:32,857 --> 00:43:35,190 And this array of size 3, I'm going to go ahead and hard 899 00:43:35,190 --> 00:43:36,700 code some new values into it. 900 00:43:36,700 --> 00:43:38,910 So at the very first location, I'll put the number 1. 901 00:43:38,910 --> 00:43:41,580 At the second location, I will put the number 2. 902 00:43:41,580 --> 00:43:44,910 At the third location, I will put the number 3. 903 00:43:44,910 --> 00:43:48,600 And then, just to demonstrate that this is working as I think I intend, 904 00:43:48,600 --> 00:43:49,990 I'm going to do a quick for loop. 905 00:43:49,990 --> 00:43:54,300 So for int i get 0, i less than 3, i++. 906 00:43:54,300 --> 00:43:58,200 And then, inside this loop, I'm going to go ahead and print out %i. 907 00:43:58,200 --> 00:44:01,530 And then I'm going to print out the value of i. 908 00:44:01,530 --> 00:44:04,770 So now that I've printed out all of these values in my loop, let 909 00:44:04,770 --> 00:44:08,230 me go ahead and do make list. 910 00:44:08,230 --> 00:44:12,060 Let me go ahead then and do ./list and hit Enter. 911 00:44:12,060 --> 00:44:15,750 And indeed, I get-- oops, not what I wanted. 912 00:44:15,750 --> 00:44:18,900 So good teachable moment, not intended, admittedly. 913 00:44:18,900 --> 00:44:21,305 But what have I done wrong here? 914 00:44:21,305 --> 00:44:22,680 My goal is we print out the list. 915 00:44:22,680 --> 00:44:24,390 But somehow, I printed out 0, 1, 2. 916 00:44:24,390 --> 00:44:27,570 And those are indeed not the numbers in this list. 917 00:44:27,570 --> 00:44:28,600 [? Greg? ?] 918 00:44:28,600 --> 00:44:30,330 AUDIENCE: So you printed i. 919 00:44:30,330 --> 00:44:32,172 You should have printed list of i. 920 00:44:32,172 --> 00:44:32,880 DAVID MALAN: Yes. 921 00:44:32,880 --> 00:44:35,755 So I should have printed the contents of the array, which is list[i]. 922 00:44:35,755 --> 00:44:38,430 So that was just a newbie mistake by me here. 923 00:44:38,430 --> 00:44:39,400 So let me fix that. 924 00:44:39,400 --> 00:44:42,927 Let me go ahead and recompile make list, ./list., and voila. 925 00:44:42,927 --> 00:44:44,010 I've now printed the list. 926 00:44:44,010 --> 00:44:48,600 So this is sort of Week 2 stuff, when we first introduced arrays in Week 2. 927 00:44:48,600 --> 00:44:51,660 But now let me go ahead and transition now to something more dynamic. 928 00:44:51,660 --> 00:44:56,190 Where I don't have to commit in advance to creating an array, 929 00:44:56,190 --> 00:44:59,080 I can do this with a dynamically allocated chunk of memory. 930 00:44:59,080 --> 00:45:01,350 So let me delete everything I've done inside of main. 931 00:45:01,350 --> 00:45:03,058 And let me go ahead and give myself this. 932 00:45:03,058 --> 00:45:06,180 Let me go ahead and declare a list of values 933 00:45:06,180 --> 00:45:09,600 where list is now going to be an address as per the star operator. 934 00:45:09,600 --> 00:45:11,530 And I'm going to go ahead and malloc-- 935 00:45:11,530 --> 00:45:12,030 let's see. 936 00:45:12,030 --> 00:45:14,270 I want space for three integers. 937 00:45:14,270 --> 00:45:17,370 So the simplest way to do this, if I'm just going to keep it simple, 938 00:45:17,370 --> 00:45:21,630 I can actually do this, 3 times size of int. 939 00:45:21,630 --> 00:45:26,040 So this version of my program isn't going to use an array per se. 940 00:45:26,040 --> 00:45:27,420 It's going to use malloc. 941 00:45:27,420 --> 00:45:30,660 But it's going to dynamically allocate that array for me. 942 00:45:30,660 --> 00:45:32,640 And we'll see what the syntax for this is. 943 00:45:32,640 --> 00:45:35,100 As always now, any time you use malloc, I 944 00:45:35,100 --> 00:45:39,098 should check whether list equals equals null. 945 00:45:39,098 --> 00:45:40,140 And if so, you know what? 946 00:45:40,140 --> 00:45:41,490 I'm just going to return 1. 947 00:45:41,490 --> 00:45:44,250 Recall that you can return 0 or 1 or some other value 948 00:45:44,250 --> 00:45:46,230 from main to effectively quit your program. 949 00:45:46,230 --> 00:45:49,170 I'm going to go ahead and just return 1 if list is null, 950 00:45:49,170 --> 00:45:52,260 just assuming that something very badly went wrong, like I'm out of memory 951 00:45:52,260 --> 00:45:53,430 altogether. 952 00:45:53,430 --> 00:45:59,190 But now that I have this chunk of memory that's of size 3 times 953 00:45:59,190 --> 00:46:05,030 the size of an int, this is actually the malloc way to give yourself an array. 954 00:46:05,030 --> 00:46:07,530 Up until now, every time we've created arrays for ourselves, 955 00:46:07,530 --> 00:46:09,030 we've used square bracket notation. 956 00:46:09,030 --> 00:46:11,700 And you all have put a number inside the square brackets 957 00:46:11,700 --> 00:46:13,770 to give yourself an array of that size. 958 00:46:13,770 --> 00:46:15,990 But frankly, if we have malloc and the ability 959 00:46:15,990 --> 00:46:18,480 to just ask the computer for memory, well, 960 00:46:18,480 --> 00:46:21,450 if I want to store three integers, why don't I ask malloc 961 00:46:21,450 --> 00:46:23,940 for three times the size of an integer? 962 00:46:23,940 --> 00:46:25,710 And the way malloc works is it's actually 963 00:46:25,710 --> 00:46:29,580 going to return to me a contiguous chunk of memory of that size, 964 00:46:29,580 --> 00:46:31,410 so that many bytes back to back to back. 965 00:46:31,410 --> 00:46:36,480 And that's a technique that we'll use in just a moment 966 00:46:36,480 --> 00:46:38,130 when allocating actual nodes. 967 00:46:38,130 --> 00:46:41,730 So at this point in the story, so long as list does not equal null, 968 00:46:41,730 --> 00:46:46,470 I now have a chunk of memory that's big enough to fit the size of three ints. 969 00:46:46,470 --> 00:46:48,960 And as before, I can go ahead and initialize those. 970 00:46:48,960 --> 00:46:50,370 The first element will be 1. 971 00:46:50,370 --> 00:46:51,960 The second element will be 2. 972 00:46:51,960 --> 00:46:54,090 The third element will be 3. 973 00:46:54,090 --> 00:46:58,290 And notice the sort of equivalence now between using arrays 974 00:46:58,290 --> 00:46:59,730 and using pointers. 975 00:46:59,730 --> 00:47:02,520 C is kind of versatile in this way and that if you 976 00:47:02,520 --> 00:47:07,020 have a chunk of memory returned to you by malloc, you can, per last week, 977 00:47:07,020 --> 00:47:09,030 use square bracket notation. 978 00:47:09,030 --> 00:47:12,532 You can use square bracket notation and treat that chunk of memory as an array. 979 00:47:12,532 --> 00:47:13,990 Because after all, what's an array? 980 00:47:13,990 --> 00:47:18,360 It's a contiguous block of memory and that is exactly what malloc returns. 981 00:47:18,360 --> 00:47:22,980 If you want to be fancy instead, you could actually say go to that address 982 00:47:22,980 --> 00:47:24,360 and put the number 1 there. 983 00:47:24,360 --> 00:47:28,170 You could say go to that address plus 1 and put the next number there. 984 00:47:28,170 --> 00:47:33,730 You could say go to that address plus 2 and put the third number there. 985 00:47:33,730 --> 00:47:36,750 But honestly, this just very quickly becomes unreadable, at least 986 00:47:36,750 --> 00:47:37,380 to most people. 987 00:47:37,380 --> 00:47:39,510 This is that thing called pointer arithmetic, 988 00:47:39,510 --> 00:47:41,580 you're doing arithmetic with pointers, that 989 00:47:41,580 --> 00:47:45,900 is equivalent to using the syntax that we've used for a while now, which 990 00:47:45,900 --> 00:47:47,528 is to just use the square brackets. 991 00:47:47,528 --> 00:47:49,320 And the nice thing about square brackets is 992 00:47:49,320 --> 00:47:51,810 that the computer will figure out for you 993 00:47:51,810 --> 00:47:56,790 how far apart each of those integers are because it knows the size of an int. 994 00:47:56,790 --> 00:47:59,120 But now, at this point in the story, things 995 00:47:59,120 --> 00:48:01,310 get interesting and also annoying. 996 00:48:01,310 --> 00:48:04,160 And Santiago, recall, was the one that helped us solve this earlier. 997 00:48:04,160 --> 00:48:06,020 Suppose I didn't plan ahead. 998 00:48:06,020 --> 00:48:10,590 I only allocated three integers on line five there. 999 00:48:10,590 --> 00:48:12,800 But now, at line 13, I'm, like, oh, dammit. 1000 00:48:12,800 --> 00:48:15,240 Now I want to add a fourth integer to the list. 1001 00:48:15,240 --> 00:48:17,210 I could obviously just redo all the code. 1002 00:48:17,210 --> 00:48:19,520 But suppose that part of the story here is to go ahead 1003 00:48:19,520 --> 00:48:21,920 and dynamically allocate more memory. 1004 00:48:21,920 --> 00:48:23,222 Well, how can I do this? 1005 00:48:23,222 --> 00:48:26,180 Well, let me go ahead and allocate another chunk of memory temporarily. 1006 00:48:26,180 --> 00:48:28,160 So I'll call it temp, by convention. 1007 00:48:28,160 --> 00:48:31,550 And this time I'm going to go ahead and allocate 4 times size of int. 1008 00:48:31,550 --> 00:48:34,130 Because, again, for the sake of the story, I messed up 1009 00:48:34,130 --> 00:48:37,523 and I want to allocate enough space now for that original-- 1010 00:48:37,523 --> 00:48:38,690 I haven't so much messed up. 1011 00:48:38,690 --> 00:48:42,050 I have now decided that I want to add a fourth number to this array. 1012 00:48:42,050 --> 00:48:46,057 As always, I should check if temp equals equals null. 1013 00:48:46,057 --> 00:48:46,640 You know what? 1014 00:48:46,640 --> 00:48:49,430 I'm going to go ahead and free the memory I already allocated. 1015 00:48:49,430 --> 00:48:51,230 And then I'm just going to get out of here, return 1. 1016 00:48:51,230 --> 00:48:52,020 Something went wrong. 1017 00:48:52,020 --> 00:48:53,312 There's nothing to demonstrate. 1018 00:48:53,312 --> 00:48:56,300 So I'm going to exit out of main entirely. 1019 00:48:56,300 --> 00:49:00,718 But if malloc did not return null, and all is well, what am I going to do? 1020 00:49:00,718 --> 00:49:02,510 Well, let's first do what Santiago proposed 1021 00:49:02,510 --> 00:49:05,000 when we first began this conversation. 1022 00:49:05,000 --> 00:49:11,540 For int i get 0, i less than 3, i++, let's go ahead and copy into this new, 1023 00:49:11,540 --> 00:49:15,660 temporary chunk of memory whatever is at the original chunk of memory. 1024 00:49:15,660 --> 00:49:17,900 So when Santiago proposed that we copy 1, 2, 1025 00:49:17,900 --> 00:49:20,960 3 from the old array into the new array, here's 1026 00:49:20,960 --> 00:49:25,370 how we might do that in code just using a simple for loop, a la Week 2. 1027 00:49:25,370 --> 00:49:30,890 And then let me go ahead now and add one more value, tmp[3], 1028 00:49:30,890 --> 00:49:33,620 which is the fourth location if you're starting from 0. 1029 00:49:33,620 --> 00:49:36,530 I'm going to go ahead and put the number 4 there. 1030 00:49:36,530 --> 00:49:39,440 And now, at this point, I'm going to go ahead and remember 1031 00:49:39,440 --> 00:49:43,513 the fact that tmp is my new list. 1032 00:49:43,513 --> 00:49:45,680 So I'm going to go ahead and free the original list. 1033 00:49:45,680 --> 00:49:49,580 And I'm going to update my old list to point at the new list. 1034 00:49:49,580 --> 00:49:53,480 And then lastly, I'm going to go ahead and use another for loop just 1035 00:49:53,480 --> 00:49:57,920 to demonstrate that I think I did this correctly, this time iterating up to 4 1036 00:49:57,920 --> 00:49:59,120 instead of 3. 1037 00:49:59,120 --> 00:50:04,280 I'm going to go ahead and print out with i the contents of list[i]. 1038 00:50:04,280 --> 00:50:06,320 So let's rewind real quick. 1039 00:50:06,320 --> 00:50:12,350 We began the story by allocating an array of three integers. 1040 00:50:12,350 --> 00:50:15,817 But we did it this time dynamically to demonstrate that malloc just 1041 00:50:15,817 --> 00:50:16,900 returns a chunk of memory. 1042 00:50:16,900 --> 00:50:20,510 And if you want to treat that chunk of memory as an array, you absolutely can. 1043 00:50:20,510 --> 00:50:22,477 This stuff here, if list equals equals null, 1044 00:50:22,477 --> 00:50:25,310 it's just error checking, just to make sure that nothing went wrong. 1045 00:50:25,310 --> 00:50:27,170 The interesting code resumes here. 1046 00:50:27,170 --> 00:50:31,640 I'm putting the numbers 1, 2, and 3 at location 0, 1, and 2 respectively 1047 00:50:31,640 --> 00:50:34,790 in that chunk of memory which, again, I'm treating like an array. 1048 00:50:34,790 --> 00:50:38,180 But now at this point in the story, I've stipulated that, wait a minute, 1049 00:50:38,180 --> 00:50:39,950 I want to go ahead and add a fourth value. 1050 00:50:39,950 --> 00:50:40,920 How can I do that? 1051 00:50:40,920 --> 00:50:43,790 And let's stipulate that I want to go back and change the existing program. 1052 00:50:43,790 --> 00:50:45,790 Because suppose that for the sake of discussion, 1053 00:50:45,790 --> 00:50:48,500 this is code that's running at Google or Twitter over time 1054 00:50:48,500 --> 00:50:52,170 and it's only after receiving another tweet that their code realizes, 1055 00:50:52,170 --> 00:50:53,730 oh, we need more space. 1056 00:50:53,730 --> 00:50:54,890 So how do I do this here? 1057 00:50:54,890 --> 00:50:59,690 On line 15, this time I allocate enough space for four integers. 1058 00:50:59,690 --> 00:51:01,910 And I, again, do some error checking. 1059 00:51:01,910 --> 00:51:04,820 If tmp equals null, then something bad happened. 1060 00:51:04,820 --> 00:51:07,010 Let's just exit all together. 1061 00:51:07,010 --> 00:51:09,620 But if nothing bad happened, let's take Santiago's suggestion 1062 00:51:09,620 --> 00:51:14,960 and translate his English advice into C. Let's use a for loop from 0 to 3 1063 00:51:14,960 --> 00:51:17,480 and copy into this new temporary chunk of memory 1064 00:51:17,480 --> 00:51:20,190 the contents of the original chunk of memory. 1065 00:51:20,190 --> 00:51:22,160 So tmp[i] = list[i]. 1066 00:51:22,160 --> 00:51:24,530 And then here, which was the point of this exercise, 1067 00:51:24,530 --> 00:51:29,870 let me add my fourth number at tmp[3], which is the fourth location if you 1068 00:51:29,870 --> 00:51:31,370 start counting from 0. 1069 00:51:31,370 --> 00:51:34,280 But at this point in the story, much like my earlier slide, 1070 00:51:34,280 --> 00:51:38,000 I have both the 1, 2, 3 in an array of size 3, 1071 00:51:38,000 --> 00:51:41,690 and I have 1, 2, 3 duplicated in the array of size 4. 1072 00:51:41,690 --> 00:51:45,230 Let me go ahead and free the original list and give back to the computer 1073 00:51:45,230 --> 00:51:46,820 that original chunk of memory. 1074 00:51:46,820 --> 00:51:49,340 Let me then remember, using my better names 1075 00:51:49,340 --> 00:51:52,730 variable, what the address of this new chunk of memory is. 1076 00:51:52,730 --> 00:51:55,010 And then, just to show off, let me go ahead, 1077 00:51:55,010 --> 00:51:58,430 and with another for loop, this time counting four times, not three, 1078 00:51:58,430 --> 00:52:00,720 let me print out all of those values. 1079 00:52:00,720 --> 00:52:04,200 Now here's where I'll cross my fingers, compile my new program. 1080 00:52:04,200 --> 00:52:05,900 It does not compile OK. 1081 00:52:05,900 --> 00:52:09,150 Because it looks like I have one too many parentheses there. 1082 00:52:09,150 --> 00:52:11,840 So let's recompile the program with make list-- 1083 00:52:11,840 --> 00:52:13,440 another error. 1084 00:52:13,440 --> 00:52:14,810 So let me scroll up there. 1085 00:52:14,810 --> 00:52:22,040 And oh, interesting, so this is a common mistake, implicitly declaring library 1086 00:52:22,040 --> 00:52:24,170 function malloc something something. 1087 00:52:24,170 --> 00:52:27,140 So any time you get an implicitly declaring error, 1088 00:52:27,140 --> 00:52:31,010 odds are it means you just did something simple like this, 1089 00:52:31,010 --> 00:52:35,150 you forgot the requisite header file in which that function is defined. 1090 00:52:35,150 --> 00:52:39,120 And indeed, recall from last week, malloc is in standard lib, as is free. 1091 00:52:39,120 --> 00:52:40,220 So now let's do make list. 1092 00:52:40,220 --> 00:52:42,210 Cross my fingers again. 1093 00:52:42,210 --> 00:52:42,710 Phew. 1094 00:52:42,710 --> 00:52:43,880 That time it worked. 1095 00:52:43,880 --> 00:52:47,360 ./list, voila, 1, 2, 3, 4. 1096 00:52:47,360 --> 00:52:50,480 So this is now a completely literal translation 1097 00:52:50,480 --> 00:52:53,990 of all of that code into a working program 1098 00:52:53,990 --> 00:52:58,870 that, again, starts off by using an array of size 3, 1099 00:52:58,870 --> 00:53:01,270 having dynamically allocated it. 1100 00:53:01,270 --> 00:53:05,200 And then it resizes it by creating a new one of size 4, copying old into new, 1101 00:53:05,200 --> 00:53:08,170 freeing the old, and then proceeding as before. 1102 00:53:08,170 --> 00:53:11,240 And I've deliberately used malloc both times here as follows. 1103 00:53:11,240 --> 00:53:14,350 If you create an array in C using square bracket notation, 1104 00:53:14,350 --> 00:53:16,210 you have painted yourself into a corner. 1105 00:53:16,210 --> 00:53:19,210 You can't use any lines of code that we have seen 1106 00:53:19,210 --> 00:53:22,770 and resize an array that you have declared using square brackets. 1107 00:53:22,770 --> 00:53:25,270 More technically speaking, when you use the square brackets, 1108 00:53:25,270 --> 00:53:28,960 you are statically allocating the array on the stack. 1109 00:53:28,960 --> 00:53:31,750 You're putting it into the frame of the computer's memory 1110 00:53:31,750 --> 00:53:37,930 that belongs to that computer, to that function's stack frame, per the diagram 1111 00:53:37,930 --> 00:53:38,950 last week. 1112 00:53:38,950 --> 00:53:44,170 If, however, you use malloc, our new tool from last week, 1113 00:53:44,170 --> 00:53:47,170 and say give me a chunk of memory, that comes from the heap. 1114 00:53:47,170 --> 00:53:49,420 And that you can resize. 1115 00:53:49,420 --> 00:53:52,520 That you can give back and take more of and back and forth. 1116 00:53:52,520 --> 00:53:57,050 And in fact, there's even a more simple way of doing this, relatively speaking. 1117 00:53:57,050 --> 00:54:02,140 If you want to reallocate an array, a chunk of memory, by resizing it, 1118 00:54:02,140 --> 00:54:04,570 you don't have to do all of this, which I did before. 1119 00:54:04,570 --> 00:54:06,460 You don't have to use malloc twice. 1120 00:54:06,460 --> 00:54:08,710 You can use malloc once at the beginning. 1121 00:54:08,710 --> 00:54:10,780 And then, you can use a new function that's 1122 00:54:10,780 --> 00:54:13,930 actually kind of helpful in this case, called realloc. 1123 00:54:13,930 --> 00:54:19,730 And you can actually do this, realloc a chunk of memory of size 4 times 1124 00:54:19,730 --> 00:54:20,500 size of int. 1125 00:54:20,500 --> 00:54:24,790 But specifically, reallocate the thing called list. 1126 00:54:24,790 --> 00:54:27,250 So realloc is very similar to malloc. 1127 00:54:27,250 --> 00:54:28,790 But it takes two arguments. 1128 00:54:28,790 --> 00:54:32,140 One is the size of the memory you want, whether bigger or smaller. 1129 00:54:32,140 --> 00:54:33,670 But it takes a second argument. 1130 00:54:33,670 --> 00:54:38,170 Its very first argument now is the address of a chunk of memory 1131 00:54:38,170 --> 00:54:41,800 that you have already allocated, as with malloc. 1132 00:54:41,800 --> 00:54:43,660 So, again, at the top of the same program, 1133 00:54:43,660 --> 00:54:48,950 recall that I used malloc to give myself a list that points at a chunk of memory 1134 00:54:48,950 --> 00:54:50,380 big enough for three integers. 1135 00:54:50,380 --> 00:54:55,030 On line 16, I'm now handing that address back to realloc, saying, wait a minute, 1136 00:54:55,030 --> 00:54:56,800 here is that same address you gave me. 1137 00:54:56,800 --> 00:55:01,090 Please now resize it, reallocate it to be of size 4. 1138 00:55:01,090 --> 00:55:03,760 And what the function does is, if all goes well, 1139 00:55:03,760 --> 00:55:10,300 it returns to the address in memory that it is now of sufficient size. 1140 00:55:10,300 --> 00:55:13,030 Otherwise, it returns null if anything bad happened. 1141 00:55:13,030 --> 00:55:14,570 So I'll leave that code alone. 1142 00:55:14,570 --> 00:55:16,870 But what I don't have to do anymore is this. 1143 00:55:16,870 --> 00:55:20,850 Realloc actually copies the old into the new for you. 1144 00:55:20,850 --> 00:55:23,740 So, again, coming back to Santiago's story at the beginning of today, 1145 00:55:23,740 --> 00:55:26,260 realloc will not only give you a bigger chunk 1146 00:55:26,260 --> 00:55:29,740 of memory, if you ask for it, by handing back the address of the memory you 1147 00:55:29,740 --> 00:55:31,150 already requested. 1148 00:55:31,150 --> 00:55:35,020 And it's going to hand you back the address of a new chunk of memory 1149 00:55:35,020 --> 00:55:37,810 that is big enough to fit all of those new values. 1150 00:55:37,810 --> 00:55:39,130 And it's smart, too. 1151 00:55:39,130 --> 00:55:42,880 If there happens to be room at the very end of the existing chunk of memory, 1152 00:55:42,880 --> 00:55:45,397 there's no hello, world, like we saw on my slide earlier, 1153 00:55:45,397 --> 00:55:47,980 then you're actually going to get back the exact same address. 1154 00:55:47,980 --> 00:55:51,100 But the computer's operating system, Windows, Mac, OS, or Linux, 1155 00:55:51,100 --> 00:55:54,160 is going to remember, OK, yes, I know I gave you three bytes originally. 1156 00:55:54,160 --> 00:55:56,720 There happened to be room at the end of that chunk of memory. 1157 00:55:56,720 --> 00:56:00,040 So now I'm going to remember, instead, that that same address has 1158 00:56:00,040 --> 00:56:03,230 room for four integers, or whatever number you pass in. 1159 00:56:03,230 --> 00:56:06,070 So, again, you don't have to bother copying yourself. 1160 00:56:06,070 --> 00:56:10,880 You can let the computer actually do the reallocation for you. 1161 00:56:10,880 --> 00:56:18,345 Any questions, then on malloc, on realloc, on free, or fundamentally, 1162 00:56:18,345 --> 00:56:22,300 on linked lists? 1163 00:56:22,300 --> 00:56:24,040 Notice that this isn't yet a list. 1164 00:56:24,040 --> 00:56:25,780 This is still an array. 1165 00:56:25,780 --> 00:56:28,628 So we still need to take this program one step further 1166 00:56:28,628 --> 00:56:30,670 and actually transition from this chunk of memory 1167 00:56:30,670 --> 00:56:34,120 using arrays to these actual nodes. 1168 00:56:34,120 --> 00:56:37,310 But before we do that, any questions or confusion? 1169 00:56:37,310 --> 00:56:37,810 BRIAN: Yeah. 1170 00:56:37,810 --> 00:56:41,775 A question came in, why do you not need to free tmp at the end of the program? 1171 00:56:41,775 --> 00:56:44,650 DAVID MALAN: Why do I not need to free tmp at the end of the program? 1172 00:56:44,650 --> 00:56:48,610 Because I'm an idiot and glossed over that key, important detail. 1173 00:56:48,610 --> 00:56:52,200 You absolutely should free not tmp in this case, but list. 1174 00:56:52,200 --> 00:56:58,900 So at this line here, 27, I use my list variable, which just has a better name, 1175 00:56:58,900 --> 00:57:03,520 and I make it equal to tmp so that I can just refer to it as a bigger list. 1176 00:57:03,520 --> 00:57:04,630 But you are quite right. 1177 00:57:04,630 --> 00:57:06,400 That was an oversight on my part. 1178 00:57:06,400 --> 00:57:07,900 Valgrind would not have liked that. 1179 00:57:07,900 --> 00:57:12,850 At the very end of this program, I should absolutely free list. 1180 00:57:12,850 --> 00:57:15,790 However, I don't need to free tmp, per se, 1181 00:57:15,790 --> 00:57:20,450 because I've simply reused the variable name through that assignment. 1182 00:57:20,450 --> 00:57:22,730 Good question and good catch. 1183 00:57:22,730 --> 00:57:24,370 Unintended. 1184 00:57:24,370 --> 00:57:26,195 Other questions or comments, Brian? 1185 00:57:26,195 --> 00:57:28,570 BRIAN: Question came in, why does the linked list improve 1186 00:57:28,570 --> 00:57:32,050 this situation if we can just use arrays and realloc and malloc to do 1187 00:57:32,050 --> 00:57:32,900 all this stuff? 1188 00:57:32,900 --> 00:57:33,230 DAVID MALAN: Yeah. 1189 00:57:33,230 --> 00:57:34,105 Really good question. 1190 00:57:34,105 --> 00:57:38,410 So how have we improved this situation if we can just use arrays in this way? 1191 00:57:38,410 --> 00:57:40,570 Recall that this is kind of a regression. 1192 00:57:40,570 --> 00:57:44,920 What I just did is a regression to where we started the story, whereby 1193 00:57:44,920 --> 00:57:46,990 in any of the versions of code I just wrote, 1194 00:57:46,990 --> 00:57:50,270 I reallocated more space for this array. 1195 00:57:50,270 --> 00:57:54,160 Which means that I, manually with that for loop, or realloc with its own 1196 00:57:54,160 --> 00:57:57,490 for loop, had to copy all of the old values into the new. 1197 00:57:57,490 --> 00:58:00,640 So the approach we've taken in all three versions of this program 1198 00:58:00,640 --> 00:58:04,450 that I've written thus far on the fly, they've all been Big O of n. 1199 00:58:04,450 --> 00:58:10,220 When it comes to inserts, they have not given us the dynamism of a linked list 1200 00:58:10,220 --> 00:58:12,403 to just add without that duplication. 1201 00:58:12,403 --> 00:58:15,320 And we haven't had the ability yet to just do an insert, for instance, 1202 00:58:15,320 --> 00:58:18,460 at the beginning of the structure in Big O of 1 time. 1203 00:58:18,460 --> 00:58:20,800 So, again, this is the code translation, really, 1204 00:58:20,800 --> 00:58:23,740 of that slower approach from which we began. 1205 00:58:23,740 --> 00:58:26,500 So the ultimate goal now is going to be to change this code 1206 00:58:26,500 --> 00:58:29,830 and give us that dynamism and actually implement things as a proper linked 1207 00:58:29,830 --> 00:58:32,905 list, not just as an array of integers. 1208 00:58:32,905 --> 00:58:34,030 But we're about an hour in. 1209 00:58:34,030 --> 00:58:36,405 Let's go ahead and take our first five minute break here. 1210 00:58:36,405 --> 00:58:40,600 And when we come back, we'll translate the nodes themselves to a full program. 1211 00:58:40,600 --> 00:58:42,400 All right, we are back. 1212 00:58:42,400 --> 00:58:46,930 And recall that we began today by revisiting arrays and pointing out 1213 00:58:46,930 --> 00:58:49,510 that searching is great in arrays if you keep them sorted. 1214 00:58:49,510 --> 00:58:52,990 You get the Big O of log n that we liked back from Week 0. 1215 00:58:52,990 --> 00:58:56,020 But as soon as you want to start dynamically modifying an array, 1216 00:58:56,020 --> 00:58:58,050 it gets very expensive quickly. 1217 00:58:58,050 --> 00:59:00,640 It might take you Big O of n steps to copy 1218 00:59:00,640 --> 00:59:03,790 the contents of an old, small array into a new, bigger array. 1219 00:59:03,790 --> 00:59:07,360 And honestly, over time, especially for real world software with lots of data, 1220 00:59:07,360 --> 00:59:09,440 even Big O of n is expensive. 1221 00:59:09,440 --> 00:59:12,640 Like, you don't want to be constantly copying and copying and copying 1222 00:59:12,640 --> 00:59:15,170 all of your data around the computer's memory. 1223 00:59:15,170 --> 00:59:18,730 So we can avoid that by using pointers and, in turn, 1224 00:59:18,730 --> 00:59:21,400 stitching together these structures called linked lists, 1225 00:59:21,400 --> 00:59:24,010 albeit at a price of spending more memory. 1226 00:59:24,010 --> 00:59:27,693 But with that additional memory, that additional cost, comes dynamism. 1227 00:59:27,693 --> 00:59:29,860 So that if we want we can even achieve constant time 1228 00:59:29,860 --> 00:59:31,270 when it comes to inserting. 1229 00:59:31,270 --> 00:59:34,540 But of course, then we have to sacrifice things like sortability. 1230 00:59:34,540 --> 00:59:36,430 So this came with trade offs. 1231 00:59:36,430 --> 00:59:39,400 We've just seen a few examples of actual C programs 1232 00:59:39,400 --> 00:59:42,070 that implement first the old school array, per Week 0, 1233 00:59:42,070 --> 00:59:43,820 where we just hardcode the array's length. 1234 00:59:43,820 --> 00:59:46,210 And unfortunately, we painted ourselves into a corner, 1235 00:59:46,210 --> 00:59:47,830 using the bracket notation alone. 1236 00:59:47,830 --> 00:59:51,490 So we deployed instead m which is more versatile tool that 1237 00:59:51,490 --> 00:59:53,770 lets us get as much memory as we want. 1238 00:59:53,770 --> 00:59:58,120 And we used that to recreate the idea of a list implemented as arrays. 1239 00:59:58,120 --> 01:00:00,880 But even then, we saw that I had to copy using a for loop 1240 01:00:00,880 --> 01:00:04,865 or we had to copy using, indirectly, realloc, old into new. 1241 01:00:04,865 --> 01:00:07,990 And, again, for these small programs, you don't even notice the difference. 1242 01:00:07,990 --> 01:00:09,320 The programs run like that. 1243 01:00:09,320 --> 01:00:12,580 But for large, real world software, all of that Big O of n time 1244 01:00:12,580 --> 01:00:14,060 is going to add up quickly. 1245 01:00:14,060 --> 01:00:18,020 So it's best if we can try to avoid it altogether and achieve dynamism. 1246 01:00:18,020 --> 01:00:22,810 So the code via which you can add to linked list dynamically 1247 01:00:22,810 --> 01:00:26,060 is actually part of the challenge for Problem Set 5 this coming week. 1248 01:00:26,060 --> 01:00:30,100 But let's see some of the building blocks via which we can syntactically 1249 01:00:30,100 --> 01:00:33,160 start to allocate nodes and stitch them together when 1250 01:00:33,160 --> 01:00:35,163 we know in advance how many we want. 1251 01:00:35,163 --> 01:00:37,330 Which is not going to be the case for Problem Set 5, 1252 01:00:37,330 --> 01:00:40,857 but for now is indeed the case, because I only want three of these things. 1253 01:00:40,857 --> 01:00:42,940 So I'm going to go back to my program from before. 1254 01:00:42,940 --> 01:00:45,520 And I'm going to rewind and race everything inside of main. 1255 01:00:45,520 --> 01:00:47,470 And I'm going to go ahead and declare myself 1256 01:00:47,470 --> 01:00:54,490 a type called struct node initially with a number inside of it and a struct node 1257 01:00:54,490 --> 01:00:56,737 * called next inside of that. 1258 01:00:56,737 --> 01:00:59,320 And then I'm going to call this whole thing quite simply node. 1259 01:00:59,320 --> 01:01:02,110 So that's quite similar to what we did with a person. 1260 01:01:02,110 --> 01:01:05,440 But now it's a little fancier in that I'm giving the structure itself 1261 01:01:05,440 --> 01:01:07,150 a temporary name, struct node. 1262 01:01:07,150 --> 01:01:10,270 I'm referring to that temporary name inside of the structure 1263 01:01:10,270 --> 01:01:12,430 so that I can have a pointer there, too. 1264 01:01:12,430 --> 01:01:16,310 And then I'm renaming what was person to now node. 1265 01:01:16,310 --> 01:01:19,330 Now let's go ahead and actually use this thing inside of main. 1266 01:01:19,330 --> 01:01:22,840 So let me go ahead and create an empty linked list. 1267 01:01:22,840 --> 01:01:27,760 The simplest way to translate the simple block with which we began today is just 1268 01:01:27,760 --> 01:01:31,120 doing node *list;. 1269 01:01:31,120 --> 01:01:33,970 Unfortunately, any time you declare a variable 1270 01:01:33,970 --> 01:01:36,730 that does not have an assigned value, it's garbage. 1271 01:01:36,730 --> 01:01:38,980 And garbage is bad in the world of pointers. 1272 01:01:38,980 --> 01:01:42,430 Again, to be clear, if you create this variable called list 1273 01:01:42,430 --> 01:01:46,690 and you do not explicitly initialize its value 1274 01:01:46,690 --> 01:01:49,570 to be something like null pointing at the ground, 1275 01:01:49,570 --> 01:01:51,563 but instead leave it as a garbage value, it's 1276 01:01:51,563 --> 01:01:53,980 the sort of metaphorical equivalent of this arrow pointing 1277 01:01:53,980 --> 01:01:55,840 this way, this way, this other way. 1278 01:01:55,840 --> 01:01:58,570 That is to say you might accidentally, in your own code, 1279 01:01:58,570 --> 01:02:01,570 follow this arrow to a completely bogus place. 1280 01:02:01,570 --> 01:02:05,038 And that's the point at which you have what are called segmentation faults, 1281 01:02:05,038 --> 01:02:08,080 as some of you might have experienced already with Problem Set 4 when you 1282 01:02:08,080 --> 01:02:09,880 touch memory that you shouldn't. 1283 01:02:09,880 --> 01:02:12,580 So garbage values are bad ever more so in the context 1284 01:02:12,580 --> 01:02:14,240 of pointers for that reason. 1285 01:02:14,240 --> 01:02:15,745 So you rarely want to do this. 1286 01:02:15,745 --> 01:02:19,540 You almost always want to initialize the pointer to some known value. 1287 01:02:19,540 --> 01:02:21,850 In the absence of an actual address, we're 1288 01:02:21,850 --> 01:02:24,840 going to use null to indicate that there's nothing there. 1289 01:02:24,840 --> 01:02:26,810 But that's deliberate on our part. 1290 01:02:26,810 --> 01:02:29,600 Now, suppose I want to insert, just as I did physically 1291 01:02:29,600 --> 01:02:32,960 by lugging the block number 1 onto stage before, let me go ahead 1292 01:02:32,960 --> 01:02:36,320 and allocate a node-- we'll call it n temporarily-- 1293 01:02:36,320 --> 01:02:40,410 using malloc, this time asking for the size of a node. 1294 01:02:40,410 --> 01:02:41,900 So the story is now changing. 1295 01:02:41,900 --> 01:02:43,640 I'm not allocating individual ints. 1296 01:02:43,640 --> 01:02:49,340 I'm allocating individual nodes inside of which is enough room for an integer 1297 01:02:49,340 --> 01:02:51,410 and another pointer to a node. 1298 01:02:51,410 --> 01:02:55,100 And this size of operator figures out, from the definition 1299 01:02:55,100 --> 01:02:58,310 of this structure up here above main, how much space 1300 01:02:58,310 --> 01:03:03,560 is needed to store an integer and a pointer to a struct node. 1301 01:03:03,560 --> 01:03:06,110 So as always now, I'm always going to check. 1302 01:03:06,110 --> 01:03:09,860 If n equals equals null, I'm going to get out of this program 1303 01:03:09,860 --> 01:03:11,330 immediately, and just return 1. 1304 01:03:11,330 --> 01:03:14,130 Because something went wrong, and there's just not enough memory. 1305 01:03:14,130 --> 01:03:19,010 But if all went well, I'm going to go ahead now and go into that node n. 1306 01:03:19,010 --> 01:03:23,180 I'm going to go into its number field and assign it the value 1. 1307 01:03:23,180 --> 01:03:27,410 And I'm going to go into that node n and go into its next field and, for now, 1308 01:03:27,410 --> 01:03:28,950 assign it the value null. 1309 01:03:28,950 --> 01:03:32,600 So this is as though I've just allocated the wooden block with a 1 in it 1310 01:03:32,600 --> 01:03:35,700 and I have initialized its next pointer to null. 1311 01:03:35,700 --> 01:03:40,440 Now I'm going to go ahead and update the list itself to point at that value. 1312 01:03:40,440 --> 01:03:44,270 So, again, my variable called list is the variable via which 1313 01:03:44,270 --> 01:03:46,640 I'm representing the whole list. 1314 01:03:46,640 --> 01:03:49,460 And now that I have an actual node to point to, 1315 01:03:49,460 --> 01:03:55,260 I'm setting list which, again, is a pointer, to a node, equal to whatever 1316 01:03:55,260 --> 01:03:58,850 n is, the address of an actual node. 1317 01:03:58,850 --> 01:04:01,940 So at this point in the story, I have the small wooden block connected 1318 01:04:01,940 --> 01:04:04,330 to the larger block containing 1. 1319 01:04:04,330 --> 01:04:06,080 Let's suppose, for the sake of discussion, 1320 01:04:06,080 --> 01:04:09,800 I now want to add the number 2 to this list, this time using 1321 01:04:09,800 --> 01:04:11,870 a node as well, not just an integer. 1322 01:04:11,870 --> 01:04:15,500 I'm going to go ahead and allocate n, using malloc, 1323 01:04:15,500 --> 01:04:17,872 giving myself the size of another node. 1324 01:04:17,872 --> 01:04:20,330 I'm going to again just going to check if that thing equals 1325 01:04:20,330 --> 01:04:24,170 null let me go ahead and free the list so I don't leak memory. 1326 01:04:24,170 --> 01:04:25,618 Then let me go ahead and return 1. 1327 01:04:25,618 --> 01:04:27,410 So that's just a quick sanity check to make 1328 01:04:27,410 --> 01:04:30,770 sure I free any memory I've already allocated before. 1329 01:04:30,770 --> 01:04:33,260 But if all goes well, and that's what I'm hoping for, 1330 01:04:33,260 --> 01:04:37,790 I'm going to go ahead and go into this node n and store in its number field 1331 01:04:37,790 --> 01:04:39,103 literally the number 2. 1332 01:04:39,103 --> 01:04:40,520 And then now, because this thing-- 1333 01:04:40,520 --> 01:04:42,770 I'll insert it in sorted order for now-- 1334 01:04:42,770 --> 01:04:47,000 I'm going to go ahead and insert this next = NULL. 1335 01:04:47,000 --> 01:04:52,340 And if I indeed want to put this number 2 node after the number 1 node, 1336 01:04:52,340 --> 01:04:55,940 I can start at the top of the list, I can go to the next node. 1337 01:04:55,940 --> 01:05:01,170 And inside of its value, I can say n. 1338 01:05:01,170 --> 01:05:05,700 So this line of code here starts at the little block, follows the arrow, 1339 01:05:05,700 --> 01:05:10,050 and then updates the next pointer of that first node, the 1 node, 1340 01:05:10,050 --> 01:05:14,290 to instead store the address of this new node, n. 1341 01:05:14,290 --> 01:05:18,690 And then lastly, let's do one more of these so n = malloc(sizeof(node)); 1342 01:05:18,690 --> 01:05:19,722 one last time. 1343 01:05:19,722 --> 01:05:21,930 Let me go ahead and do my sanity check one more time. 1344 01:05:21,930 --> 01:05:24,402 If n = = NULL something bad happened. 1345 01:05:24,402 --> 01:05:27,360 So now I'm going to go ahead and don't worry about the syntax just yet. 1346 01:05:27,360 --> 01:05:29,730 But I'm going to go ahead and free list next. 1347 01:05:29,730 --> 01:05:31,620 And I'm going to go ahead and free(list); 1348 01:05:31,620 --> 01:05:33,453 and then I'm going to go ahead and return 1. 1349 01:05:33,453 --> 01:05:34,830 But more on that another time. 1350 01:05:34,830 --> 01:05:37,440 That's just in the corner case where something bad happened. 1351 01:05:37,440 --> 01:05:41,310 But if nothing bad happened, I'm going to update the number field to be 3. 1352 01:05:41,310 --> 01:05:45,450 I'm going to update the next field to be NULL. 1353 01:05:45,450 --> 01:05:49,560 And now I'm going to update the list next block 1354 01:05:49,560 --> 01:05:53,340 next block to equal this new one n. 1355 01:05:53,340 --> 01:05:57,570 And then here, after this, I can proceed to print all of these things if I want. 1356 01:05:57,570 --> 01:06:00,270 And in fact, I'll go ahead and do this with a loop. 1357 01:06:00,270 --> 01:06:03,900 The loop is going to look a little different from before in this case. 1358 01:06:03,900 --> 01:06:07,080 But it turns out we can use for loops pretty powerfully here, too. 1359 01:06:07,080 --> 01:06:09,930 But at this point in the story, my list pointer 1360 01:06:09,930 --> 01:06:12,300 is pointing at the 1 node, which is pointing at the 2 1361 01:06:12,300 --> 01:06:14,190 node, which is pointing at the 3 node. 1362 01:06:14,190 --> 01:06:16,110 And, again, as someone observed earlier, it's 1363 01:06:16,110 --> 01:06:19,590 not common to use this double arrow notation in this case. 1364 01:06:19,590 --> 01:06:22,680 I bet I could actually use a loop to iterate over these things 1365 01:06:22,680 --> 01:06:23,280 one at a time. 1366 01:06:23,280 --> 01:06:26,250 And we can see this here when it's time to print. 1367 01:06:26,250 --> 01:06:28,650 Let me go ahead and do this for. 1368 01:06:28,650 --> 01:06:32,310 And instead of using i, because there really aren't any numbers in question. 1369 01:06:32,310 --> 01:06:36,420 This is no longer an array, so I can't use square bracket notation or pointer 1370 01:06:36,420 --> 01:06:37,050 arithmetic. 1371 01:06:37,050 --> 01:06:38,670 I need to use pointers. 1372 01:06:38,670 --> 01:06:40,590 So this might feel a little weird at first. 1373 01:06:40,590 --> 01:06:43,500 But there's nothing stopping me with a for loop from doing this. 1374 01:06:43,500 --> 01:06:46,920 Give me a temporary pointer to a node called tmp 1375 01:06:46,920 --> 01:06:51,600 and initialize it to be whatever is at the beginning of the list. 1376 01:06:51,600 --> 01:06:56,040 Keep doing the following so long as tmp does not equal NULL. 1377 01:06:56,040 --> 01:06:59,580 And on each iteration of this loop, don't do something like i++ which, 1378 01:06:59,580 --> 01:07:01,320 again, is not relevant now. 1379 01:07:01,320 --> 01:07:04,410 But go ahead and update my temporary pointer 1380 01:07:04,410 --> 01:07:09,300 to be whatever the value of the temporary pointer's next field is. 1381 01:07:09,300 --> 01:07:12,000 So this looks crazy cryptic most likely, especially 1382 01:07:12,000 --> 01:07:16,020 if you're new to pointers as of last week, as most of you are. 1383 01:07:16,020 --> 01:07:18,330 But it's the same idea is a typical for loop. 1384 01:07:18,330 --> 01:07:21,780 You initialize some variable before the semicolon. 1385 01:07:21,780 --> 01:07:24,720 You check some condition after the first semicolon. 1386 01:07:24,720 --> 01:07:28,898 And you perform an update of that variable after the second semicolon. 1387 01:07:28,898 --> 01:07:30,690 In this case, they're not integers, though. 1388 01:07:30,690 --> 01:07:33,840 Instead, I'm saying give myself a temporary pointer to the beginning 1389 01:07:33,840 --> 01:07:37,800 of the list, like my finger pointing at, or if you prefer, the foam finger 1390 01:07:37,800 --> 01:07:40,620 pointing at some node in the list. 1391 01:07:40,620 --> 01:07:43,950 Go ahead and call that temporary variable tmp. 1392 01:07:43,950 --> 01:07:45,310 And now do the following. 1393 01:07:45,310 --> 01:07:48,150 So long as tmp is not null, that is, so long as it's 1394 01:07:48,150 --> 01:07:51,960 pointing at an actual legitimate wooden block, what do I want to do? 1395 01:07:51,960 --> 01:07:57,300 Let me go ahead and print out, using printf and %i as always, 1396 01:07:57,300 --> 01:08:04,230 whatever value is in the number field of that node there. 1397 01:08:04,230 --> 01:08:05,190 And that's it. 1398 01:08:05,190 --> 01:08:08,340 With this simple for loop, relatively simple for loop, 1399 01:08:08,340 --> 01:08:11,910 I can essentially point at the very first node in my list 1400 01:08:11,910 --> 01:08:14,850 and keep updating it to the next field, updating it to the next field, 1401 01:08:14,850 --> 01:08:16,100 updating it to the next field. 1402 01:08:16,100 --> 01:08:18,720 And I keep doing this until my finger sort of walks off 1403 01:08:18,720 --> 01:08:21,420 the end of the list of wooden blocks, thereby 1404 01:08:21,420 --> 01:08:25,493 pointing at null, 0x0, at which point the loop stops 1405 01:08:25,493 --> 01:08:26,910 and there's nothing more to print. 1406 01:08:26,910 --> 01:08:29,995 So in answer to that question earlier, do we 1407 01:08:29,995 --> 01:08:31,620 need to use this double arrow notation? 1408 01:08:31,620 --> 01:08:35,010 Short answer, no, this is kind of the secret ingredient here. 1409 01:08:35,010 --> 01:08:38,520 This syntax inside of the for loop takes whatever you're pointing at, 1410 01:08:38,520 --> 01:08:41,939 follows one arrow, and then updates the temporary variable now 1411 01:08:41,939 --> 01:08:43,710 to point at that structure instead. 1412 01:08:43,710 --> 01:08:47,189 So this is kind of the equivalent in the world of pointers and linked lists 1413 01:08:47,189 --> 01:08:48,450 of doing i++. 1414 01:08:48,450 --> 01:08:49,770 But it's not as simple as i++. 1415 01:08:49,770 --> 01:08:52,170 You can't just look one byte to the right or to the left. 1416 01:08:52,170 --> 01:08:55,380 Instead, you have to follow an arrow, follow an arrow. 1417 01:08:55,380 --> 01:08:59,609 But by reassigning this temporary variable to wherever you just followed, 1418 01:08:59,609 --> 01:09:02,790 it's a way of following each of these orange arrows 1419 01:09:02,790 --> 01:09:06,420 as we did physically a moment ago. 1420 01:09:06,420 --> 01:09:10,350 After this, I should, for good measure, go ahead and free the whole list. 1421 01:09:10,350 --> 01:09:13,217 And let me just offer up a common way of freeing a linked list. 1422 01:09:13,217 --> 01:09:14,800 I can actually do something like this. 1423 01:09:14,800 --> 01:09:19,470 While list != NULL, so while the whole list itself does not equal null, 1424 01:09:19,470 --> 01:09:23,970 go ahead and get a temporary pointer like this to the next field so I 1425 01:09:23,970 --> 01:09:27,689 remember what comes after the current head of the list. 1426 01:09:27,689 --> 01:09:29,460 Free the list node itself. 1427 01:09:29,460 --> 01:09:32,010 And then update list to be tmp. 1428 01:09:32,010 --> 01:09:35,040 So, again, this probably looks crazy cryptic and certainly in the coming 1429 01:09:35,040 --> 01:09:36,899 days, especially with Problem Set 5, you'll 1430 01:09:36,899 --> 01:09:40,439 work through this kind of logic a little more logically, a little more 1431 01:09:40,439 --> 01:09:41,830 pictorially, perhaps. 1432 01:09:41,830 --> 01:09:43,080 But what am I doing here? 1433 01:09:43,080 --> 01:09:46,479 First, I'm going to do the following, so long as my linked list is not null. 1434 01:09:46,479 --> 01:09:48,479 And if I've got three nodes in it, by definition 1435 01:09:48,479 --> 01:09:49,854 it's not null from the beginning. 1436 01:09:49,854 --> 01:09:52,590 But my goal now is to free all of the memory I have allocated 1437 01:09:52,590 --> 01:09:54,430 from left to right, so to speak. 1438 01:09:54,430 --> 01:09:55,390 So how do I do that? 1439 01:09:55,390 --> 01:09:58,950 Well, if I've got a wooden block in front of me, 1440 01:09:58,950 --> 01:10:02,180 it's not safe to free that wooden block yet. 1441 01:10:02,180 --> 01:10:06,630 Because that wooden block, recall, contains the pointer to the next node. 1442 01:10:06,630 --> 01:10:09,740 So if I free this memory prematurely, I've 1443 01:10:09,740 --> 01:10:12,230 then stranded all subsequent nodes. 1444 01:10:12,230 --> 01:10:15,020 Because they are no longer accessible once I've told the computer 1445 01:10:15,020 --> 01:10:17,660 you can take back this chunk of memory for the first node. 1446 01:10:17,660 --> 01:10:22,250 So this line of code here, on line 52, is just saying temporarily 1447 01:10:22,250 --> 01:10:24,020 give me a variable call tmp. 1448 01:10:24,020 --> 01:10:28,740 And point it not at the list itself, the first node, point at the next node. 1449 01:10:28,740 --> 01:10:32,060 So it's like using my right hand to point at the current node, my left hand 1450 01:10:32,060 --> 01:10:37,140 to point at the next node, so that I can then, on line 53, free the list itself, 1451 01:10:37,140 --> 01:10:38,750 which should not be taken literally. 1452 01:10:38,750 --> 01:10:43,260 List represents the first node in the linked list, not the whole thing. 1453 01:10:43,260 --> 01:10:46,460 So when you say free list, that's, like, freeing just the current node. 1454 01:10:46,460 --> 01:10:47,270 But that's OK. 1455 01:10:47,270 --> 01:10:49,580 Even now this memory has been given back, 1456 01:10:49,580 --> 01:10:52,610 I still have my left hand pointing at every subsequent node 1457 01:10:52,610 --> 01:10:54,240 by way of the next one. 1458 01:10:54,240 --> 01:10:58,250 So now I can update list to equal that temporary variable 1459 01:10:58,250 --> 01:10:59,820 and just continue this loop. 1460 01:10:59,820 --> 01:11:01,700 So it's a way of sort of Pacman style, like, 1461 01:11:01,700 --> 01:11:04,250 gobbling up the entire linked list from left 1462 01:11:04,250 --> 01:11:07,970 to right by freeing the first node, the second node, the third node, and then 1463 01:11:07,970 --> 01:11:08,490 you're done. 1464 01:11:08,490 --> 01:11:12,170 But by using a temporary variable to look one step ahead to make sure 1465 01:11:12,170 --> 01:11:15,230 you don't chomp, free the memory too soon 1466 01:11:15,230 --> 01:11:20,250 and, therefore, lose access to all of those subsequent nodes. 1467 01:11:20,250 --> 01:11:21,420 All right, phew. 1468 01:11:21,420 --> 01:11:22,560 That was a big program. 1469 01:11:22,560 --> 01:11:25,860 But it was meant to be in succession, starting with an array, 1470 01:11:25,860 --> 01:11:29,910 transitioning into a dynamically allocated array, followed by, finally, 1471 01:11:29,910 --> 01:11:32,400 an implementation using linked list, albeit hardcoded 1472 01:11:32,400 --> 01:11:34,000 to support only three nodes. 1473 01:11:34,000 --> 01:11:37,200 But in that example, do you see some sample syntax via which you 1474 01:11:37,200 --> 01:11:39,930 can manipulate these kinds of nodes? 1475 01:11:39,930 --> 01:11:44,950 Questions or confusion that I can help address? 1476 01:11:44,950 --> 01:11:45,450 BRIAN: Yes. 1477 01:11:45,450 --> 01:11:48,060 Someone asked, similar to one of the examples you did before, 1478 01:11:48,060 --> 01:11:51,780 why could we not have just done malloc three times size of node 1479 01:11:51,780 --> 01:11:53,880 to get three nodes and do it that way? 1480 01:11:53,880 --> 01:11:54,760 DAVID MALAN: Really good question. 1481 01:11:54,760 --> 01:11:56,910 Could I not just use malloc and allocate all three at once? 1482 01:11:56,910 --> 01:11:57,430 Absolutely. 1483 01:11:57,430 --> 01:11:57,930 Yes. 1484 01:11:57,930 --> 01:11:59,430 That is completely your prerogative. 1485 01:11:59,430 --> 01:12:01,770 I did it a little more pedantically, one at a time. 1486 01:12:01,770 --> 01:12:05,550 But you could absolutely do it all three at once. 1487 01:12:05,550 --> 01:12:09,150 You would then need to use some pointer arithmetic though, 1488 01:12:09,150 --> 01:12:11,370 or you would need to square bracket notation 1489 01:12:11,370 --> 01:12:16,050 to treat that bigger chunk of memory as essentially an array of nodes, 1490 01:12:16,050 --> 01:12:17,570 and then stitch them together. 1491 01:12:17,570 --> 01:12:19,823 So I am assuming, for demonstration purposes, 1492 01:12:19,823 --> 01:12:22,740 that even though we have these little simple examples that demonstrate 1493 01:12:22,740 --> 01:12:26,670 the syntax, in a real world system, you're not going to be inserting 1, 1494 01:12:26,670 --> 01:12:28,110 then 2, then 3. 1495 01:12:28,110 --> 01:12:29,910 Odds are you're going to be inserting 1. 1496 01:12:29,910 --> 01:12:31,170 Some time passes. 1497 01:12:31,170 --> 01:12:33,750 Then you want to insert 2, so you allocate more memory. 1498 01:12:33,750 --> 01:12:35,190 Then some more time passes. 1499 01:12:35,190 --> 01:12:36,940 Then you want to insert 3. 1500 01:12:36,940 --> 01:12:42,150 And so there's gaps in between these chunks of code in the real world. 1501 01:12:42,150 --> 01:12:43,530 Other questions or confusion? 1502 01:12:43,530 --> 01:12:43,800 BRIAN: Yeah. 1503 01:12:43,800 --> 01:12:44,850 Another question came in. 1504 01:12:44,850 --> 01:12:47,113 Why would malloc ever fail to allocate memory? 1505 01:12:47,113 --> 01:12:48,780 DAVID MALAN: Why would malloc ever fail? 1506 01:12:48,780 --> 01:12:52,660 It's rarely going to fail, but if the computer is out of memory. 1507 01:12:52,660 --> 01:12:56,640 So essentially, if you're writing such a memory hungry program 1508 01:12:56,640 --> 01:13:00,000 with so many variables, big arrays, big structures, lots of data, 1509 01:13:00,000 --> 01:13:01,870 you may very well run out of memory. 1510 01:13:01,870 --> 01:13:04,680 Maybe that's two gigabytes, maybe it's four gigabytes or more. 1511 01:13:04,680 --> 01:13:07,080 But malloc may very well return null to you. 1512 01:13:07,080 --> 01:13:09,490 And so you should always check for it. 1513 01:13:09,490 --> 01:13:11,970 In fact, I dare say, on Macs and PCs, one 1514 01:13:11,970 --> 01:13:15,780 of the most common reasons, to this day, for programs to freeze, 1515 01:13:15,780 --> 01:13:17,910 to crash, for your home computer to reboot, 1516 01:13:17,910 --> 01:13:20,250 is truly because someone did something stupid, like I've 1517 01:13:20,250 --> 01:13:22,920 done multiple times now already today and last week, 1518 01:13:22,920 --> 01:13:25,080 by touching memory that you shouldn't have. 1519 01:13:25,080 --> 01:13:27,840 So in Problem Set 4 and now 5, any time you 1520 01:13:27,840 --> 01:13:30,210 experience one of those segmentation faults 1521 01:13:30,210 --> 01:13:35,910 whereby your program just crashes, that is the problem set 1522 01:13:35,910 --> 01:13:38,760 version of, like, your whole Mac or PC crashing, 1523 01:13:38,760 --> 01:13:40,530 because someone more experienced than you 1524 01:13:40,530 --> 01:13:43,740 made that same mistake in their code. 1525 01:13:43,740 --> 01:13:46,710 Just to reinforce this, let's take a quick final example 1526 01:13:46,710 --> 01:13:50,550 involving linked lists which, again, are this very one dimensional structure, 1527 01:13:50,550 --> 01:13:51,153 left to right. 1528 01:13:51,153 --> 01:13:53,820 And then we'll add a second dimension and see what that buys us. 1529 01:13:53,820 --> 01:13:55,300 But we've changed the numbers around here. 1530 01:13:55,300 --> 01:13:56,580 Now we still have our list. 1531 01:13:56,580 --> 01:13:58,770 But it's first pointing at the number two here. 1532 01:13:58,770 --> 01:14:01,770 And then the number two is pointing to some other chunk of memory that's 1533 01:14:01,770 --> 01:14:03,570 been malloced way over here. 1534 01:14:03,570 --> 01:14:05,190 And this, then, is the number 4. 1535 01:14:05,190 --> 01:14:06,820 And this, then, is the number 5. 1536 01:14:06,820 --> 01:14:08,700 So we have a linked list of size 3. 1537 01:14:08,700 --> 01:14:12,060 But I've deliberately spread the numbers out this time, 2, 4, 5. 1538 01:14:12,060 --> 01:14:15,000 Because suppose that we do want to insert more numbers into this list, 1539 01:14:15,000 --> 01:14:17,730 but in sorted order, it turns out that we 1540 01:14:17,730 --> 01:14:19,770 have to think a little bit differently when 1541 01:14:19,770 --> 01:14:23,280 we're adding nodes not to the end and not to the beginning, 1542 01:14:23,280 --> 01:14:24,247 but in the middle. 1543 01:14:24,247 --> 01:14:26,580 Like, when we want to allocate more nodes in the middle, 1544 01:14:26,580 --> 01:14:29,560 there's a bit more work that actually has to happen. 1545 01:14:29,560 --> 01:14:31,140 So how might we go about doing this? 1546 01:14:31,140 --> 01:14:34,260 Suppose that we want to allocate, for instance, the number 1. 1547 01:14:34,260 --> 01:14:35,820 And I want to add the number 1. 1548 01:14:35,820 --> 01:14:37,290 Well, we could use code like this. 1549 01:14:37,290 --> 01:14:39,060 This is the same code as we used before. 1550 01:14:39,060 --> 01:14:40,770 We allocate the size of a node. 1551 01:14:40,770 --> 01:14:43,080 We check whether it equals null, we initialize it 1552 01:14:43,080 --> 01:14:46,412 with a value we care about, and by default, we sit next equal to null. 1553 01:14:46,412 --> 01:14:48,120 And pictorially, it might look like this. 1554 01:14:48,120 --> 01:14:50,495 It's kind of floating somewhere in the computer's memory. 1555 01:14:50,495 --> 01:14:53,130 I have this temporary variable n, no longer pictured, 1556 01:14:53,130 --> 01:14:55,470 that I'm just pointing at when I allocate the number 1. 1557 01:14:55,470 --> 01:14:56,820 So what does this look like? 1558 01:14:56,820 --> 01:15:01,510 This is like having the number 1 maybe somewhere over here. 1559 01:15:01,510 --> 01:15:02,760 And I'll just put it in place. 1560 01:15:02,760 --> 01:15:05,410 We got lucky, and there was a chunk of memory right there. 1561 01:15:05,410 --> 01:15:07,140 So what do I want to now do? 1562 01:15:07,140 --> 01:15:08,920 Well, I want to go ahead and connect this. 1563 01:15:08,920 --> 01:15:12,720 So what I could do, just intuitively, if 1 should go before 2, 1564 01:15:12,720 --> 01:15:14,220 I can unplug this. 1565 01:15:14,220 --> 01:15:17,980 And I can plug this into here, which makes sense. 1566 01:15:17,980 --> 01:15:19,560 But there's already a problem. 1567 01:15:19,560 --> 01:15:22,290 If I have done nothing else up until this point, 1568 01:15:22,290 --> 01:15:26,580 I have just orphaned three nodes, 2, 4, and 5. 1569 01:15:26,580 --> 01:15:29,370 To orphan a node means to forget where it is. 1570 01:15:29,370 --> 01:15:32,070 And if I don't have another variable in my code, 1571 01:15:32,070 --> 01:15:34,080 or if I'm not acting out with one of my hands 1572 01:15:34,080 --> 01:15:36,960 pointing at the original beginning of the list, 1573 01:15:36,960 --> 01:15:39,330 I have literally orphaned the rest of the list. 1574 01:15:39,330 --> 01:15:41,970 And the technical implication of that, per last week, 1575 01:15:41,970 --> 01:15:44,010 is that now I have a massive memory leak. 1576 01:15:44,010 --> 01:15:46,560 You have just leaked the size of three nodes 1577 01:15:46,560 --> 01:15:49,320 in memory that you can literally never get back 1578 01:15:49,320 --> 01:15:51,310 until you reboot the computer, for instance, 1579 01:15:51,310 --> 01:15:54,742 or the program quits and the operating system cleans things up for you. 1580 01:15:54,742 --> 01:15:55,950 So you don't want to do this. 1581 01:15:55,950 --> 01:15:58,570 Order of operations actually matters. 1582 01:15:58,570 --> 01:16:01,140 So what I should probably do is this. 1583 01:16:01,140 --> 01:16:04,340 When I insert the number 1 first, I should probably 1584 01:16:04,340 --> 01:16:07,090 recognize that, well, the one begins at the beginning of the list. 1585 01:16:07,090 --> 01:16:10,980 So what I should really do is point this arrow also at the same node. 1586 01:16:10,980 --> 01:16:13,630 And we'll sort of do it a little sloppily like that. 1587 01:16:13,630 --> 01:16:16,720 But let me stipulate those are both pointing at the same node. 1588 01:16:16,720 --> 01:16:21,990 Now that my new node, AKA n and the code I showed is pointing at this thing, 1589 01:16:21,990 --> 01:16:23,820 now I can do kind of a switcheroo. 1590 01:16:23,820 --> 01:16:27,750 Because I'm already pointing at the final destination there. 1591 01:16:27,750 --> 01:16:30,720 And now I can remove this safely, because this is my list. 1592 01:16:30,720 --> 01:16:31,710 This is n. 1593 01:16:31,710 --> 01:16:33,625 Therefore, I have variables pointing to both 1594 01:16:33,625 --> 01:16:35,500 and I can go ahead and insert that correctly. 1595 01:16:35,500 --> 01:16:38,280 So long story short, order of operations matters. 1596 01:16:38,280 --> 01:16:43,800 So graphically, if I were to do this as before, just by saying list equals n, 1597 01:16:43,800 --> 01:16:48,950 if this is n, and this is list, and I adjust this arrow first, 1598 01:16:48,950 --> 01:16:50,420 bad things are going to happen. 1599 01:16:50,420 --> 01:16:54,890 Indeed, we end up orphaning 2, 4, and 5, thereby leaking a significant amount 1600 01:16:54,890 --> 01:16:55,850 of memory potentially. 1601 01:16:55,850 --> 01:16:57,645 And leaking any memory, typically, is bad. 1602 01:16:57,645 --> 01:16:58,770 So I don't want to do that. 1603 01:16:58,770 --> 01:17:00,200 So let's look at the correct code. 1604 01:17:00,200 --> 01:17:03,350 The correct code is going to be to start at n 1605 01:17:03,350 --> 01:17:08,210 and update its next field, this arrow here, to point at the same thing 1606 01:17:08,210 --> 01:17:10,550 as the list was originally pointing at. 1607 01:17:10,550 --> 01:17:14,600 And then go ahead and update the list, such that both of them 1608 01:17:14,600 --> 01:17:16,220 are currently pointing in duplicate. 1609 01:17:16,220 --> 01:17:18,750 Then update the list to point to the new node. 1610 01:17:18,750 --> 01:17:21,410 So, again, the code's a little different this time from before. 1611 01:17:21,410 --> 01:17:23,243 Because before we kept adding it to the end, 1612 01:17:23,243 --> 01:17:26,390 or I proposed verbally that we just add it to the beginning. 1613 01:17:26,390 --> 01:17:28,530 Here, we're adding it, indeed, at the beginning. 1614 01:17:28,530 --> 01:17:31,792 And so the actual steps, the actual code, are a little bit different. 1615 01:17:31,792 --> 01:17:34,250 Well, let's do one final example, if we want to allocate 3. 1616 01:17:34,250 --> 01:17:37,460 Well, I've got to malloc another node, the number 3. 1617 01:17:37,460 --> 01:17:42,090 Suppose that ends up somewhere in the computer's memory. 1618 01:17:42,090 --> 01:17:45,330 Let's go ahead and plop this one over here. 1619 01:17:45,330 --> 01:17:46,850 So now 3 is in place. 1620 01:17:46,850 --> 01:17:48,630 How do I now insert this thing? 1621 01:17:48,630 --> 01:17:52,460 Well, similar to before, I'm not going to want to update this pointer 1622 01:17:52,460 --> 01:17:56,060 and go like this and then plug this guy in over here. 1623 01:17:56,060 --> 01:17:57,830 Because now I've orphaned those two nodes. 1624 01:17:57,830 --> 01:17:59,870 So that, again is the wrong step. 1625 01:17:59,870 --> 01:18:03,052 When you're inside the middle of a linked list, any code 1626 01:18:03,052 --> 01:18:04,760 that you write to insert into the middle, 1627 01:18:04,760 --> 01:18:09,260 if you care about inserting in sorted order, this should be updated first. 1628 01:18:09,260 --> 01:18:12,500 And odds are I should kind of cheat and point this at the same thing, 1629 01:18:12,500 --> 01:18:14,930 even though there's only one physical plug at the moment. 1630 01:18:14,930 --> 01:18:18,300 So we'll just pretend that this is working. 1631 01:18:18,300 --> 01:18:19,140 There we go. 1632 01:18:19,140 --> 01:18:23,430 And now I can go ahead and safely say that n and the previous node 1633 01:18:23,430 --> 01:18:25,200 are already pointing where they should be. 1634 01:18:25,200 --> 01:18:28,490 So now it's safe for me to unplug this one 1635 01:18:28,490 --> 01:18:31,820 and go ahead and update this final arrow to point 1636 01:18:31,820 --> 01:18:35,820 at the new node in the correct location. 1637 01:18:35,820 --> 01:18:37,250 So let's see that in code again. 1638 01:18:37,250 --> 01:18:41,480 If I go here, I've got graphically the node 3 kind of floating in space. 1639 01:18:41,480 --> 01:18:45,830 I first update its next field to point also at the 4 in duplicate. 1640 01:18:45,830 --> 01:18:48,620 Then, I update the 2 to point to the 3. 1641 01:18:48,620 --> 01:18:53,630 The goal being, again, to avoid any leaking of memory or orphaning 1642 01:18:53,630 --> 01:18:55,250 of nodes. 1643 01:18:55,250 --> 01:18:59,340 All right , we are about to leave linked lists behind. 1644 01:18:59,340 --> 01:19:02,760 Because as multiple of you have noted or probably thought, 1645 01:19:02,760 --> 01:19:05,300 they're good, but maybe not great. 1646 01:19:05,300 --> 01:19:08,160 They're good in that they are dynamic and I can add to them, as 1647 01:19:08,160 --> 01:19:10,590 by inserting at the beginning if I really want 1648 01:19:10,590 --> 01:19:12,030 and don't care about sorted order. 1649 01:19:12,030 --> 01:19:13,980 But they're still a good amount of work to do 1650 01:19:13,980 --> 01:19:16,770 if I want to keep them in sorted order and I insert them 1651 01:19:16,770 --> 01:19:17,940 in the middle or the end. 1652 01:19:17,940 --> 01:19:19,830 Because that's, like, Big O of n if I keep 1653 01:19:19,830 --> 01:19:22,090 traversing all of these darn arrows. 1654 01:19:22,090 --> 01:19:24,600 So we get the dynamism, but we don't necessarily 1655 01:19:24,600 --> 01:19:26,250 get the performance increase. 1656 01:19:26,250 --> 01:19:29,580 But we fundamentally have opened up a whole new world to ourselves. 1657 01:19:29,580 --> 01:19:32,670 We can now stitch together these data structures 1658 01:19:32,670 --> 01:19:35,910 in memory using pointers as our thread, if you will. 1659 01:19:35,910 --> 01:19:39,690 We can just use memory as a canvas, painting on it any values we want. 1660 01:19:39,690 --> 01:19:42,510 And we can sort of remember where all of those values are. 1661 01:19:42,510 --> 01:19:47,010 But this is indeed very single, one dimensional, left to right. 1662 01:19:47,010 --> 01:19:49,763 What if we give ourselves a second dimension? 1663 01:19:49,763 --> 01:19:51,930 What if we start thinking sort of not left to right, 1664 01:19:51,930 --> 01:19:53,632 but also left to right, up, down. 1665 01:19:53,632 --> 01:19:55,590 So, again, this is meaningless to the computer. 1666 01:19:55,590 --> 01:19:59,100 The computer just thinks of memory as being byte 0, 1, 2, 3. 1667 01:19:59,100 --> 01:20:01,680 But we humans can kind of think of these data structures 1668 01:20:01,680 --> 01:20:04,620 a little more abstractly, a little more simply. 1669 01:20:04,620 --> 01:20:06,780 And we can think about them in a way familiar, 1670 01:20:06,780 --> 01:20:08,170 perhaps, to us in the real world. 1671 01:20:08,170 --> 01:20:10,630 Trees, not the ones so much that grow from the ground, 1672 01:20:10,630 --> 01:20:12,412 but if you're familiar with family trees. 1673 01:20:12,412 --> 01:20:14,370 Where you might have a matriarch or a patriarch 1674 01:20:14,370 --> 01:20:16,870 and then sort of descendants hanging off of them graphically 1675 01:20:16,870 --> 01:20:19,662 on a piece of paper, something you might have made in grade school, 1676 01:20:19,662 --> 01:20:20,470 for instance. 1677 01:20:20,470 --> 01:20:22,860 We can leverage this idea of a tree structure 1678 01:20:22,860 --> 01:20:26,730 that has a root that kind of branches and branches and branches 1679 01:20:26,730 --> 01:20:28,390 and grows top to bottom. 1680 01:20:28,390 --> 01:20:32,610 So, again, more like a family tree than an actual tree in the soil. 1681 01:20:32,610 --> 01:20:36,720 So with trees, it turns out, this idea of a tree, 1682 01:20:36,720 --> 01:20:40,410 we can take some of the lessons learned from linked lists. 1683 01:20:40,410 --> 01:20:43,800 But we can gain back some of the features of arrays. 1684 01:20:43,800 --> 01:20:45,220 And we can do that as follows. 1685 01:20:45,220 --> 01:20:48,420 Consider the following definition of what we're 1686 01:20:48,420 --> 01:20:50,400 about to call a binary search tree. 1687 01:20:50,400 --> 01:20:54,360 Binary search being a very good thing from the first, Week 0. 1688 01:20:54,360 --> 01:20:57,030 And a tree now being the new idea. 1689 01:20:57,030 --> 01:21:00,750 Here's an array from Week 0, Week 1, Week 2, whenever. 1690 01:21:00,750 --> 01:21:02,160 And it's of size 7. 1691 01:21:02,160 --> 01:21:05,802 And recall that if it's sorted, we can apply binary search to this array. 1692 01:21:05,802 --> 01:21:06,510 And that's great. 1693 01:21:06,510 --> 01:21:08,218 Because if we want to search for a value, 1694 01:21:08,218 --> 01:21:09,780 we can start looking in the middle. 1695 01:21:09,780 --> 01:21:13,080 Then we can go either left or right, halfway between each. 1696 01:21:13,080 --> 01:21:15,840 And then we can similarly go left or right. 1697 01:21:15,840 --> 01:21:20,460 So binary search on a sorted array was so powerful 1698 01:21:20,460 --> 01:21:23,010 because it was Big O of log n, we've concluded, 1699 01:21:23,010 --> 01:21:27,330 by just having and having and having the problem again and again, 1700 01:21:27,330 --> 01:21:30,300 tearing the phone book in half again and again and again. 1701 01:21:30,300 --> 01:21:33,300 But the problem with binary search is that it 1702 01:21:33,300 --> 01:21:38,010 requires that you use an array so that you have random access. 1703 01:21:38,010 --> 01:21:42,240 You have to be able to index into the array in constant time 1704 01:21:42,240 --> 01:21:46,830 using simple arithmetic, like bracket 0, bracket n minus 1, bracket n minus 1 1705 01:21:46,830 --> 01:21:48,720 divided by 2, to get the halfway point. 1706 01:21:48,720 --> 01:21:52,470 You have to be able to do arithmetic on the data structure. 1707 01:21:52,470 --> 01:21:56,280 And we've just proposed getting rid of that random access 1708 01:21:56,280 --> 01:21:59,070 by transitioning to a dynamic data structure 1709 01:21:59,070 --> 01:22:01,530 like a linked list instead of an array. 1710 01:22:01,530 --> 01:22:02,740 But what if we do this? 1711 01:22:02,740 --> 01:22:07,140 What if you and I start thinking not on one dimension, but on two dimensions? 1712 01:22:07,140 --> 01:22:10,270 And what if we alter our thinking to be like this? 1713 01:22:10,270 --> 01:22:13,710 So think of an array, perhaps, as being a two dimensional 1714 01:22:13,710 --> 01:22:18,000 structure that has not only with or length, but also height. 1715 01:22:18,000 --> 01:22:21,750 And so we maintain, it seems, visually this relationship 1716 01:22:21,750 --> 01:22:23,470 between all of these values. 1717 01:22:23,470 --> 01:22:24,240 But you know what? 1718 01:22:24,240 --> 01:22:28,830 We can stitch all of these values together using what? 1719 01:22:28,830 --> 01:22:29,700 Well, pointers. 1720 01:22:29,700 --> 01:22:32,130 Pointers are this new building block that we can use 1721 01:22:32,130 --> 01:22:34,410 to stitch together things in memory. 1722 01:22:34,410 --> 01:22:36,480 If the things in memory are numbers, that's fine. 1723 01:22:36,480 --> 01:22:37,230 They're integers. 1724 01:22:37,230 --> 01:22:40,440 But if we throw a little more memory at them, if we use a node, 1725 01:22:40,440 --> 01:22:42,840 and we kind of wrap the integer in a node such 1726 01:22:42,840 --> 01:22:46,200 that that node contains not only numbers but pointers, 1727 01:22:46,200 --> 01:22:48,420 we could probably draw a picture like this, not 1728 01:22:48,420 --> 01:22:51,510 unlike a family tree, where there's a root 1729 01:22:51,510 --> 01:22:54,960 node, at the very top in this case, and then children, so to speak. 1730 01:22:54,960 --> 01:22:59,670 Left child and right child, and that definition repeats again and again. 1731 01:22:59,670 --> 01:23:04,170 And it turns out the computer scientists do use this data structure in order 1732 01:23:04,170 --> 01:23:07,320 to have the dynamism of a linked list, where 1733 01:23:07,320 --> 01:23:10,680 you can add more and more nodes to the tree 1734 01:23:10,680 --> 01:23:13,740 by just adding more and more squares even lower 1735 01:23:13,740 --> 01:23:15,390 than the 1, the 3, the 5, and the 7. 1736 01:23:15,390 --> 01:23:18,307 And just use more pointers to kind of stitch them together, to sort of 1737 01:23:18,307 --> 01:23:22,020 grow the tree vertically, down, down, down, if you will. 1738 01:23:22,020 --> 01:23:25,530 But a good computer scientist would recognize that you shouldn't just 1739 01:23:25,530 --> 01:23:27,780 put these numbers in random locations, otherwise, 1740 01:23:27,780 --> 01:23:29,490 you're really just wasting your time. 1741 01:23:29,490 --> 01:23:30,830 You should use some algorithm. 1742 01:23:30,830 --> 01:23:31,330 And notice. 1743 01:23:31,330 --> 01:23:36,640 Does anyone notice the pattern to this tree? 1744 01:23:36,640 --> 01:23:39,850 Can anyone verbalize or textualize in the chat 1745 01:23:39,850 --> 01:23:44,410 what pattern is manifest by these seven nodes in this tree? 1746 01:23:44,410 --> 01:23:46,130 They're not randomly ordered. 1747 01:23:46,130 --> 01:23:50,230 They're very deliberately ordered left to right, top to bottom, 1748 01:23:50,230 --> 01:23:51,040 in a certain way. 1749 01:23:51,040 --> 01:23:55,360 Can anyone put their finger on what the definition of this thing is? 1750 01:23:55,360 --> 01:23:57,700 What is the most important characteristic besides it 1751 01:23:57,700 --> 01:24:00,820 just being drawn like a family tree? 1752 01:24:00,820 --> 01:24:02,050 [? Greg? ?] 1753 01:24:02,050 --> 01:24:05,353 AUDIENCE: You have put in the middle of them, 1754 01:24:05,353 --> 01:24:07,270 on top of them you have put the middle number. 1755 01:24:07,270 --> 01:24:09,550 For example, between 1 and 3, you have put 2. 1756 01:24:09,550 --> 01:24:13,240 Between 5 and 7, you have put 6. 1757 01:24:13,240 --> 01:24:15,605 So on top of them, you have put the middle number. 1758 01:24:15,605 --> 01:24:16,480 DAVID MALAN: Exactly. 1759 01:24:16,480 --> 01:24:18,272 There's this pattern to all of the numbers. 1760 01:24:18,272 --> 01:24:19,330 Between 1 and 3 is 2. 1761 01:24:19,330 --> 01:24:20,740 Between 5 and 7 is 6. 1762 01:24:20,740 --> 01:24:22,595 Between 2 and 6 is 4. 1763 01:24:22,595 --> 01:24:24,970 And it doesn't even have to be the middle number, per se. 1764 01:24:24,970 --> 01:24:27,310 I can generalize it a little bit and comment 1765 01:24:27,310 --> 01:24:30,760 that, if you pick any node in this tree, so to speak, 1766 01:24:30,760 --> 01:24:33,700 its left child will be less than its value, 1767 01:24:33,700 --> 01:24:36,610 and its right child's will be greater than its value. 1768 01:24:36,610 --> 01:24:38,290 And we can do that again and again. 1769 01:24:38,290 --> 01:24:39,100 So here's 4. 1770 01:24:39,100 --> 01:24:40,270 Its left child is 2. 1771 01:24:40,270 --> 01:24:41,050 That's less than. 1772 01:24:41,050 --> 01:24:41,650 Here's 4. 1773 01:24:41,650 --> 01:24:42,760 It's right child to 6. 1774 01:24:42,760 --> 01:24:43,750 That's greater than. 1775 01:24:43,750 --> 01:24:44,750 We can do this again. 1776 01:24:44,750 --> 01:24:45,490 Let's go to 2. 1777 01:24:45,490 --> 01:24:47,350 Its left child is 1, which is less. 1778 01:24:47,350 --> 01:24:49,480 Its right child is 3, which is more. 1779 01:24:49,480 --> 01:24:51,850 6, its left child is 5, which is less. 1780 01:24:51,850 --> 01:24:54,660 6, its right child is 7, which is more. 1781 01:24:54,660 --> 01:24:56,410 And so this is actually, if you don't mind 1782 01:24:56,410 --> 01:25:00,730 the revisiting recursion from last week, this is a recursive definition. 1783 01:25:00,730 --> 01:25:03,140 This is a recursive data structure. 1784 01:25:03,140 --> 01:25:05,230 So it's not only algorithms or functions that 1785 01:25:05,230 --> 01:25:07,330 can be recursive by calling themselves. 1786 01:25:07,330 --> 01:25:10,300 A data structure can also be recursive. 1787 01:25:10,300 --> 01:25:12,440 After all, what is this thing? 1788 01:25:12,440 --> 01:25:13,570 This is a tree, yes. 1789 01:25:13,570 --> 01:25:14,590 I'll stipulate. 1790 01:25:14,590 --> 01:25:18,520 But it's technically a tree with two trees. 1791 01:25:18,520 --> 01:25:19,120 Right? 1792 01:25:19,120 --> 01:25:23,140 This node here, number four, technically has two children. 1793 01:25:23,140 --> 01:25:25,750 And each of those children is itself a tree. 1794 01:25:25,750 --> 01:25:27,070 It's a smaller tree. 1795 01:25:27,070 --> 01:25:29,830 But it's the same exact definition again and again. 1796 01:25:29,830 --> 01:25:32,710 And any time we see a recursive data structure, 1797 01:25:32,710 --> 01:25:35,350 it's actually going to be an opportunity to use recursive code, 1798 01:25:35,350 --> 01:25:37,183 which we'll take a look at in just a moment. 1799 01:25:37,183 --> 01:25:39,290 But for now, notice what we've achieved again, 1800 01:25:39,290 --> 01:25:42,700 the dynamism of using pointer so that we can, if we want, 1801 01:25:42,700 --> 01:25:45,490 add more nodes to this tree, as by stringing them 1802 01:25:45,490 --> 01:25:47,710 along the bottom in the correct order. 1803 01:25:47,710 --> 01:25:51,970 And yet, we've preserved an important order for binary search, 1804 01:25:51,970 --> 01:25:55,960 AKA, the formal name of this data structure, binary search tree, 1805 01:25:55,960 --> 01:26:00,190 by making sure that left child is always less, right child is always more. 1806 01:26:00,190 --> 01:26:03,790 Because now, we can go about searching this thing more efficiently. 1807 01:26:03,790 --> 01:26:04,300 How? 1808 01:26:04,300 --> 01:26:06,800 Well, if I want to search for the number 3, what do I do? 1809 01:26:06,800 --> 01:26:08,592 Well, I start at the beginning of the tree. 1810 01:26:08,592 --> 01:26:13,028 Just like with a linked list, you start at the beginning of the linked list. 1811 01:26:13,028 --> 01:26:15,820 So with the tree, you start with the root of the tree [? always. ?] 1812 01:26:15,820 --> 01:26:17,410 Suppose I want to search for 3. 1813 01:26:17,410 --> 01:26:18,800 Well, what do I do? 1814 01:26:18,800 --> 01:26:21,220 Well, 3 is obviously less than 4. 1815 01:26:21,220 --> 01:26:23,980 So just like in Week 0 where I tore the phone book in half, 1816 01:26:23,980 --> 01:26:27,340 you can now think of this as chopping down half of the tree. 1817 01:26:27,340 --> 01:26:30,130 Because you know 3, if it's present, is definitely not 1818 01:26:30,130 --> 01:26:33,790 going to be anywhere over here, so we can focus our attention down here. 1819 01:26:33,790 --> 01:26:34,835 Here's the number 2. 1820 01:26:34,835 --> 01:26:35,710 This is another tree. 1821 01:26:35,710 --> 01:26:37,780 It's just a smaller subtree, if you will. 1822 01:26:37,780 --> 01:26:39,010 How do I find the number 3? 1823 01:26:39,010 --> 01:26:41,290 Well, I look to the right because it's greater than. 1824 01:26:41,290 --> 01:26:43,370 And boom, I found it. 1825 01:26:43,370 --> 01:26:46,330 But by contrast, suppose I were searching for the number 8. 1826 01:26:46,330 --> 01:26:47,350 I would start here. 1827 01:26:47,350 --> 01:26:48,130 I would look here. 1828 01:26:48,130 --> 01:26:48,880 I would look here. 1829 01:26:48,880 --> 01:26:51,560 And then conclude, no, it's not there. 1830 01:26:51,560 --> 01:26:53,680 But, again, every time I search for that 8, 1831 01:26:53,680 --> 01:26:58,400 I'm ignoring this half of the tree, this half of the subtree, and so forth. 1832 01:26:58,400 --> 01:27:00,220 So you're going to achieve, it would seem, 1833 01:27:00,220 --> 01:27:03,730 the same kind of power, the same kind of performance as we saw from Week 0. 1834 01:27:03,730 --> 01:27:06,250 So how do we translate this idea now into code? 1835 01:27:06,250 --> 01:27:08,110 We have all the building blocks already. 1836 01:27:08,110 --> 01:27:11,020 Let me go ahead and propose that, instead of the node 1837 01:27:11,020 --> 01:27:15,130 we used before for a linked list, which looked like this, with a number and one 1838 01:27:15,130 --> 01:27:16,480 pointer called next-- 1839 01:27:16,480 --> 01:27:19,438 but, again, we could have called those things anything-- let's go ahead 1840 01:27:19,438 --> 01:27:24,580 and make room for not just a number, but also two pointers, one 1841 01:27:24,580 --> 01:27:27,370 that I'll call left, one that I'll call right. 1842 01:27:27,370 --> 01:27:30,730 Both of those is still a pointer to a struct node. 1843 01:27:30,730 --> 01:27:35,170 So same terminology as before, but now I have two pointers instead of one 1844 01:27:35,170 --> 01:27:38,560 so that one can conceptually point to the left and point to a smaller 1845 01:27:38,560 --> 01:27:39,190 subtree. 1846 01:27:39,190 --> 01:27:43,120 One can point to the right and point to a larger subtree. 1847 01:27:43,120 --> 01:27:46,102 So how do we go about implementing something like binary search? 1848 01:27:46,102 --> 01:27:47,560 Well, let's actually see some code. 1849 01:27:47,560 --> 01:27:50,140 And this is where recursion really gets kind of cool. 1850 01:27:50,140 --> 01:27:53,560 We kind of forced it when building [? Maro's ?] pyramid with recursion. 1851 01:27:53,560 --> 01:27:55,180 Like, yeah, you can do it. 1852 01:27:55,180 --> 01:27:59,110 And yes, the pyramid was, I claimed, a recursive physical structure 1853 01:27:59,110 --> 01:28:00,880 or virtual structure in the game. 1854 01:28:00,880 --> 01:28:05,680 But with data structures and pointers, now recursion really starts to shine. 1855 01:28:05,680 --> 01:28:06,970 So let's consider this. 1856 01:28:06,970 --> 01:28:10,060 If I declare a function in C whose purpose in life 1857 01:28:10,060 --> 01:28:13,660 is to search a tree for a number, it's going to, by definition, 1858 01:28:13,660 --> 01:28:15,580 search from the root on down. 1859 01:28:15,580 --> 01:28:16,820 How do we implement this? 1860 01:28:16,820 --> 01:28:19,278 Well, my function, I'll propose, is going to return a bool. 1861 01:28:19,278 --> 01:28:22,180 True or false, the number is in the tree, yes or no. 1862 01:28:22,180 --> 01:28:27,070 It's going to take two arguments, a pointer to a node, AKA tree. 1863 01:28:27,070 --> 01:28:29,442 I could call it root or anything else. 1864 01:28:29,442 --> 01:28:31,150 And it's going to take a number, which is 1865 01:28:31,150 --> 01:28:36,020 the number I care about, whether it's four or six or eight or anything else. 1866 01:28:36,020 --> 01:28:38,450 So what's going to be my first chunk of code? 1867 01:28:38,450 --> 01:28:41,760 Well, let me do the best practice that I keep preaching. 1868 01:28:41,760 --> 01:28:44,860 Any time you're dealing with pointers, check for null 1869 01:28:44,860 --> 01:28:47,710 so that your program doesn't freeze or crash or bad thing happens. 1870 01:28:47,710 --> 01:28:48,460 Because who knows? 1871 01:28:48,460 --> 01:28:51,190 Maybe you will accidentally or maybe intentionally 1872 01:28:51,190 --> 01:28:53,830 pass this function on null pointer, because there's no tree. 1873 01:28:53,830 --> 01:28:55,000 Maybe you'll screw up. 1874 01:28:55,000 --> 01:28:57,880 And that's OK, so long as your code is self-defensive. 1875 01:28:57,880 --> 01:28:59,740 So always check pointers for null. 1876 01:28:59,740 --> 01:29:02,710 If so, if the tree is null, that is, there is no tree there, 1877 01:29:02,710 --> 01:29:04,610 obviously, the number is not present. 1878 01:29:04,610 --> 01:29:06,110 So you just return false. 1879 01:29:06,110 --> 01:29:08,410 So that's one of our base cases, so to speak. 1880 01:29:08,410 --> 01:29:14,870 Else, if the number you're looking for is less than the tree's own number-- 1881 01:29:14,870 --> 01:29:19,390 so, again, this arrow notation means take tree, which is a node *, 1882 01:29:19,390 --> 01:29:20,740 so take this pointer. 1883 01:29:20,740 --> 01:29:22,960 Go there to the actual node. 1884 01:29:22,960 --> 01:29:25,450 And look in its own number field. 1885 01:29:25,450 --> 01:29:28,390 If the number you're looking for from the argument 1886 01:29:28,390 --> 01:29:32,010 is less than the number in the tree's own number field, 1887 01:29:32,010 --> 01:29:33,760 well, that means that you want to go left. 1888 01:29:33,760 --> 01:29:36,593 And whereas in the phone book, I went to the left of the phone book, 1889 01:29:36,593 --> 01:29:39,130 here we're going to go to the left subtree. 1890 01:29:39,130 --> 01:29:40,630 But how do I search a subtree? 1891 01:29:40,630 --> 01:29:45,430 Here's where it's important that a tree is a recursive data structure. 1892 01:29:45,430 --> 01:29:50,000 A tree is two subtrees with a new root node, as before. 1893 01:29:50,000 --> 01:29:54,340 So I already have code via which I can search a smaller tree. 1894 01:29:54,340 --> 01:29:57,730 So I can just say search the left subtree, 1895 01:29:57,730 --> 01:30:00,580 as expressed here, which means start at the current node 1896 01:30:00,580 --> 01:30:04,340 and go to the left child and pass in the same number. 1897 01:30:04,340 --> 01:30:05,530 The number is not changing. 1898 01:30:05,530 --> 01:30:06,940 But the tree is getting smaller. 1899 01:30:06,940 --> 01:30:09,790 I've effectively, in code, chopped the tree in half. 1900 01:30:09,790 --> 01:30:11,320 And I'm ignoring the right half. 1901 01:30:11,320 --> 01:30:13,990 And I'm returning whatever that answer is. 1902 01:30:13,990 --> 01:30:16,390 Otherwise, if the number I care about is greater 1903 01:30:16,390 --> 01:30:18,760 than the number in the current node, do the opposite. 1904 01:30:18,760 --> 01:30:22,132 Search the right subtree, passing in the same number. 1905 01:30:22,132 --> 01:30:23,840 So, again, just like with the phone book, 1906 01:30:23,840 --> 01:30:25,340 it kept getting smaller and smaller? 1907 01:30:25,340 --> 01:30:28,750 Here I keep searching a smaller and smaller subtree. 1908 01:30:28,750 --> 01:30:31,810 Because I keep chopping off branches left or right as 1909 01:30:31,810 --> 01:30:33,710 I go from top to bottom. 1910 01:30:33,710 --> 01:30:35,530 There's one final case. 1911 01:30:35,530 --> 01:30:38,720 And let me toss this out to the group. 1912 01:30:38,720 --> 01:30:41,110 There's a fourth case. 1913 01:30:41,110 --> 01:30:45,850 Verbally or textually, what else should I be checking for and doing here? 1914 01:30:45,850 --> 01:30:48,370 I've left room for one final case. 1915 01:30:48,370 --> 01:30:51,833 BRIAN: A few people are suggesting if the tree itself is the number. 1916 01:30:51,833 --> 01:30:54,250 DAVID MALAN: If the tree itself contains the number, yeah. 1917 01:30:54,250 --> 01:30:57,850 So if the number in the tree equals equals 1918 01:30:57,850 --> 01:31:02,300 the number I'm looking for, go ahead and, only in this case, return true. 1919 01:31:02,300 --> 01:31:04,840 And so this is where the code, again, gets kind of-- 1920 01:31:04,840 --> 01:31:07,570 recursion, rather, gets a little mind-bending. 1921 01:31:07,570 --> 01:31:11,120 I only have false up here, true here. 1922 01:31:11,120 --> 01:31:16,040 But not in either of these middle two branches, no pun intended, if you will. 1923 01:31:16,040 --> 01:31:17,170 But that's OK. 1924 01:31:17,170 --> 01:31:21,220 Because my code is designed in such a way that if I search the left subtree 1925 01:31:21,220 --> 01:31:24,580 and there's nothing there, like, literally I'm at the leaf of the tree, 1926 01:31:24,580 --> 01:31:26,535 so to speak, then it's going to return false. 1927 01:31:26,535 --> 01:31:27,160 So that's fine. 1928 01:31:27,160 --> 01:31:28,690 That's, like, if I search for the number 8. 1929 01:31:28,690 --> 01:31:29,890 It's not even in the tree. 1930 01:31:29,890 --> 01:31:32,640 I'm only going to realize that once I fall off the end of the tree 1931 01:31:32,640 --> 01:31:33,740 and see, oops, null. 1932 01:31:33,740 --> 01:31:35,180 I'll just return false. 1933 01:31:35,180 --> 01:31:38,750 But if I ever see the number along the way, I will return true. 1934 01:31:38,750 --> 01:31:43,145 And these two inner calls to search, these two recursive calls to search, 1935 01:31:43,145 --> 01:31:44,770 are just kind of like passing the buck. 1936 01:31:44,770 --> 01:31:47,590 Instead of answering true or false themselves, 1937 01:31:47,590 --> 01:31:51,070 they're returning whatever the answer to a smaller question 1938 01:31:51,070 --> 01:31:55,670 is by searching the left or right tree instead respectively. 1939 01:31:55,670 --> 01:31:57,520 So, again, this is where recursion starts 1940 01:31:57,520 --> 01:32:02,620 to get not really forced or even necessarily really as forced, 1941 01:32:02,620 --> 01:32:04,330 but really as appropriate. 1942 01:32:04,330 --> 01:32:08,500 When your data is itself recursive, then recursion in as a coding technique 1943 01:32:08,500 --> 01:32:10,600 really rather shines. 1944 01:32:10,600 --> 01:32:14,020 So if ultimately, we have-- oh, [? an ?] minor optimization. 1945 01:32:14,020 --> 01:32:15,975 As we might have noted in Week 0 with Scratch, 1946 01:32:15,975 --> 01:32:18,850 we, of course, don't need to explicitly check if the number is equal. 1947 01:32:18,850 --> 01:32:22,695 We can get rid of that last condition and just assume that if it's not null 1948 01:32:22,695 --> 01:32:24,820 and it's not to the left and it's not to the right, 1949 01:32:24,820 --> 01:32:27,260 we must be standing right on top of it. 1950 01:32:27,260 --> 01:32:29,230 And so we just return true there. 1951 01:32:29,230 --> 01:32:31,840 Well, let me summarize the picture here. 1952 01:32:31,840 --> 01:32:33,940 This is now a two dimensional data structure. 1953 01:32:33,940 --> 01:32:38,020 And it's sort of better than a linked list in that now it's two dimensions. 1954 01:32:38,020 --> 01:32:40,750 I gain back binary search, which is amazing, so 1955 01:32:40,750 --> 01:32:45,460 long as I keep my data in sorted order per this binary search tree definition. 1956 01:32:45,460 --> 01:32:47,480 But I've surely paid a price. 1957 01:32:47,480 --> 01:32:47,980 Right? 1958 01:32:47,980 --> 01:32:52,990 Nothing is absolutely better than anything else in our story thus far. 1959 01:32:52,990 --> 01:32:56,080 So what is the downside of a tree? 1960 01:32:56,080 --> 01:32:59,590 What price have I secretly or not so secretly paid here, 1961 01:32:59,590 --> 01:33:02,710 while preaching the upsides of trees? 1962 01:33:02,710 --> 01:33:06,700 And, again, the answer is often in this context sort of space, or time, 1963 01:33:06,700 --> 01:33:13,140 or developer time, or money, or some resource, personal, or physical, 1964 01:33:13,140 --> 01:33:14,800 or real world. 1965 01:33:14,800 --> 01:33:15,300 Yeah. 1966 01:33:15,300 --> 01:33:17,970 How about over to [INAUDIBLE]? 1967 01:33:17,970 --> 01:33:24,000 AUDIENCE: So I think that inserting is no longer constant time. 1968 01:33:24,000 --> 01:33:25,650 And I guess we need more memory. 1969 01:33:25,650 --> 01:33:28,680 You need memory to [? sort ?] two pointers instead of one this time. 1970 01:33:28,680 --> 01:33:29,430 DAVID MALAN: Yeah. 1971 01:33:29,430 --> 01:33:32,220 It seems insertion is no longer constant time. 1972 01:33:32,220 --> 01:33:35,940 Because if I need to preserve sorted order, I can't just put it at the top. 1973 01:33:35,940 --> 01:33:37,980 I can't just keep pushing everything else down 1974 01:33:37,980 --> 01:33:39,563 because things might get out of order. 1975 01:33:39,563 --> 01:33:43,110 In that case, it would [? seem. ?] Or rather, even if I maintain the order, 1976 01:33:43,110 --> 01:33:45,180 it might kind of get very long and stringy. 1977 01:33:45,180 --> 01:33:48,507 If I add, for instance, another number, another number, 1978 01:33:48,507 --> 01:33:50,340 and I keep jamming it at the top, I probably 1979 01:33:50,340 --> 01:33:52,890 need to kind of keep things balanced, if you will. 1980 01:33:52,890 --> 01:33:55,590 And yeah, the bigger point, too, is that immediately I'm 1981 01:33:55,590 --> 01:33:57,660 using twice as many pointers. 1982 01:33:57,660 --> 01:34:00,720 So now my node is getting even bigger than these things. 1983 01:34:00,720 --> 01:34:03,360 I now have room for not only a number and a pointer, 1984 01:34:03,360 --> 01:34:06,780 but another pointer, which is, of course, going to cost me more space. 1985 01:34:06,780 --> 01:34:08,030 Again, so a trade off there. 1986 01:34:08,030 --> 01:34:09,780 And let's go ahead and ask the group here, 1987 01:34:09,780 --> 01:34:13,080 when it comes to insertion, why don't we consider for a moment what 1988 01:34:13,080 --> 01:34:17,500 the running time of insertion might be when inserting into a binary search 1989 01:34:17,500 --> 01:34:18,000 tree. 1990 01:34:18,000 --> 01:34:20,458 If you'd like to pull up the URL as always, let me go ahead 1991 01:34:20,458 --> 01:34:22,450 and present this one. 1992 01:34:22,450 --> 01:34:26,650 What's the running time of inserting into a binary search tree? 1993 01:34:26,650 --> 01:34:29,370 So if you want to insert the number 0 into that tree, 1994 01:34:29,370 --> 01:34:36,390 if you want to insert the number 8, or anything in between or bigger 1995 01:34:36,390 --> 01:34:39,580 or smaller, what are the answers here? 1996 01:34:39,580 --> 01:34:42,540 So not quite as big of a victory for the tallest bar. 1997 01:34:42,540 --> 01:34:44,680 About 60% of you think log n. 1998 01:34:44,680 --> 01:34:46,680 And the good instincts there, frankly, are so 1999 01:34:46,680 --> 01:34:47,970 that is going to be the right answer. 2000 01:34:47,970 --> 01:34:49,553 And that's kind of the right instinct. 2001 01:34:49,553 --> 01:34:52,920 Anytime you have binary search, odds are you're talking something logarithmic. 2002 01:34:52,920 --> 01:34:56,730 But we've also seen divide and conquer and merge sort with n log n, 2003 01:34:56,730 --> 01:34:59,385 so not unreasonable that about 10% of you think that, too. 2004 01:34:59,385 --> 01:35:01,230 n squared would actually be bad. 2005 01:35:01,230 --> 01:35:04,830 So n squared is, like, the worst of the times we've seen thus far. 2006 01:35:04,830 --> 01:35:08,343 And that would suggest that a tree is even worse than a linked list, 2007 01:35:08,343 --> 01:35:09,510 is even worse than an array. 2008 01:35:09,510 --> 01:35:11,340 And thankfully, we're not at that point. 2009 01:35:11,340 --> 01:35:14,790 But it's indeed log n, based on certain assumptions. 2010 01:35:14,790 --> 01:35:16,360 So why is that? 2011 01:35:16,360 --> 01:35:19,350 So if we consider the tree from a moment ago, 2012 01:35:19,350 --> 01:35:21,430 it looked a little something like this. 2013 01:35:21,430 --> 01:35:25,740 And what is involved in inserting into a tree? 2014 01:35:25,740 --> 01:35:27,780 Well, suppose I want to insert the number 8. 2015 01:35:27,780 --> 01:35:28,830 Well, I start here. 2016 01:35:28,830 --> 01:35:31,410 And it obviously belongs to the right because 8 is bigger. 2017 01:35:31,410 --> 01:35:32,157 I go here. 2018 01:35:32,157 --> 01:35:33,990 It belongs to the right because 8 is bigger. 2019 01:35:33,990 --> 01:35:34,570 I go here. 2020 01:35:34,570 --> 01:35:36,250 It belongs to the right, because 8 is bigger. 2021 01:35:36,250 --> 01:35:38,790 And so a new node is going to be created somewhere down here. 2022 01:35:38,790 --> 01:35:41,520 And even though it doesn't fit on the screen, I could absolutely call malloc. 2023 01:35:41,520 --> 01:35:43,020 I could update a couple of pointers. 2024 01:35:43,020 --> 01:35:46,120 And boom, we've added an eighth node to the tree. 2025 01:35:46,120 --> 01:35:51,720 So if it took me that many steps, starting at the root, 1, 2, 3, 2026 01:35:51,720 --> 01:35:55,660 how do I generalize this into Big O notation? 2027 01:35:55,660 --> 01:36:00,360 Well, a binary search tree, if you lay it out nice and prettily like this, 2028 01:36:00,360 --> 01:36:04,290 nice and balanced if you will, the height of that binary search tree, 2029 01:36:04,290 --> 01:36:06,810 it turns out, is going to be log of n. 2030 01:36:06,810 --> 01:36:11,310 If n is the number of total nodes, 7 or 8 now in the story, then 2031 01:36:11,310 --> 01:36:14,770 log base 2 of n is going to be the height of the tree. 2032 01:36:14,770 --> 01:36:18,060 So if you take n nodes, n numbers, and you kind of 2033 01:36:18,060 --> 01:36:22,770 balance them in this nice sorted way, the total height is going to be log n. 2034 01:36:22,770 --> 01:36:24,690 So what is the running time of insert? 2035 01:36:24,690 --> 01:36:27,660 Well, the running time of insert is equivalent to how many 2036 01:36:27,660 --> 01:36:30,960 steps does it take you to find the location into which 2037 01:36:30,960 --> 01:36:32,010 the new number belongs? 2038 01:36:32,010 --> 01:36:33,480 Well, that's 1, 2, 3. 2039 01:36:33,480 --> 01:36:36,720 And as it turns out, log base 2 of 8 is indeed 3. 2040 01:36:36,720 --> 01:36:39,300 So the math actually works out perfectly in this case. 2041 01:36:39,300 --> 01:36:41,383 Sometimes, there might be a little rounding error. 2042 01:36:41,383 --> 01:36:45,990 But in general it's going to indeed be Big O of log n. 2043 01:36:45,990 --> 01:36:48,420 But what if we get a little sloppy? 2044 01:36:48,420 --> 01:36:50,550 What if we get a little sloppy and we start 2045 01:36:50,550 --> 01:36:56,710 inserting nodes that are giving us a bit of bad luck, if you will? 2046 01:36:56,710 --> 01:37:00,090 So for instance, suppose that I go ahead-- and let 2047 01:37:00,090 --> 01:37:02,430 me do something on the fly here. 2048 01:37:02,430 --> 01:37:09,570 Suppose that I go ahead and insert the number 1, the number 2, 2049 01:37:09,570 --> 01:37:13,590 and the number 3 such that this is what logically happens. 2050 01:37:13,590 --> 01:37:16,230 This adheres to the definition of a binary search tree. 2051 01:37:16,230 --> 01:37:18,210 If this is the roots, it's 1. 2052 01:37:18,210 --> 01:37:21,462 It has no left subtree, and that's not strictly a problem, 2053 01:37:21,462 --> 01:37:24,420 because there's nothing violating the definition of a search tree here. 2054 01:37:24,420 --> 01:37:25,560 There's just nothing there. 2055 01:37:25,560 --> 01:37:26,850 2 is in the right place. 2056 01:37:26,850 --> 01:37:28,690 3 is in the right place. 2057 01:37:28,690 --> 01:37:32,460 So this, too, technically is a binary search tree. 2058 01:37:32,460 --> 01:37:35,890 But it's a bit of a corner case, a perverse case, if you will, 2059 01:37:35,890 --> 01:37:40,410 where the way you inserted things ended up in the binary search tree actually 2060 01:37:40,410 --> 01:37:44,790 resembling more of a what, would you say, if you want to chime in the chat? 2061 01:37:44,790 --> 01:37:46,712 And Brian, if you might want to relay? 2062 01:37:46,712 --> 01:37:49,170 BRIAN: A few people are saying it looks like a linked list. 2063 01:37:49,170 --> 01:37:49,920 DAVID MALAN: Yeah. 2064 01:37:49,920 --> 01:37:52,050 So even though I've drawn it sort of top down, 2065 01:37:52,050 --> 01:37:53,820 so in the sort of second dimension, that's 2066 01:37:53,820 --> 01:37:55,590 really just an artist's rendition. 2067 01:37:55,590 --> 01:37:58,800 This tree is a binary search tree. 2068 01:37:58,800 --> 01:38:00,960 But it's kind of sort of also a linked list. 2069 01:38:00,960 --> 01:38:03,450 And so even the most well-intentioned data 2070 01:38:03,450 --> 01:38:08,490 structures, given unfortunate inputs or some bad luck or some bad design, 2071 01:38:08,490 --> 01:38:12,790 could devolve into a different data structure just by chance. 2072 01:38:12,790 --> 01:38:18,040 So there's a way to solve this though, even with these values. 2073 01:38:18,040 --> 01:38:23,820 When I insert 1, 2, 3, I could allow for this perverse situation 2074 01:38:23,820 --> 01:38:27,060 where it just gets long and stringy, at which point everything is Big O of n. 2075 01:38:27,060 --> 01:38:28,140 It's just a linked list. 2076 01:38:28,140 --> 01:38:30,900 It just happens to be drawn diagonally instead of left right. 2077 01:38:30,900 --> 01:38:33,510 But does anyone see a solution intuitively? 2078 01:38:33,510 --> 01:38:36,480 Not in terms of code, no formal language, 2079 01:38:36,480 --> 01:38:40,500 but there is a solution here to make sure that this tree with 1, 2, 3 2080 01:38:40,500 --> 01:38:43,500 does not get long and stringy in the first place. 2081 01:38:43,500 --> 01:38:46,083 What might you do instead to solve this? 2082 01:38:46,083 --> 01:38:48,000 BRIAN: A few people in the chat are suggesting 2083 01:38:48,000 --> 01:38:51,100 you should make 2 the new root node at the top of the tree. 2084 01:38:51,100 --> 01:38:54,390 DAVID MALAN: So if I instead make 2 the new root node, let me go ahead 2085 01:38:54,390 --> 01:38:57,030 and mock this up real quickly. 2086 01:38:57,030 --> 01:39:00,180 And in a moment, I'll reveal what I think you've just verbalized. 2087 01:39:00,180 --> 01:39:02,808 What if instead, I make sure that when inserting these nodes 2088 01:39:02,808 --> 01:39:05,850 I don't naively just keep going to the right, to the right, to the right? 2089 01:39:05,850 --> 01:39:07,420 I exercise some judgment. 2090 01:39:07,420 --> 01:39:09,930 And if I notice, maybe, that my data structure, my tree 2091 01:39:09,930 --> 01:39:11,820 is getting kind of long and stringy, maybe I 2092 01:39:11,820 --> 01:39:15,600 should kind of rotate it around, using that second dimension for real 2093 01:39:15,600 --> 01:39:17,370 so that I change what the root is. 2094 01:39:17,370 --> 01:39:19,470 And we won't go through the code for doing this. 2095 01:39:19,470 --> 01:39:21,240 But it turns out this is the solution. 2096 01:39:21,240 --> 01:39:22,980 That is exactly the right intuition. 2097 01:39:22,980 --> 01:39:25,320 If you take a higher level class on data structures 2098 01:39:25,320 --> 01:39:27,390 and algorithms specifically in computer science, 2099 01:39:27,390 --> 01:39:30,960 you'll study trees like AVL trees, or red, black trees, 2100 01:39:30,960 --> 01:39:33,510 which are different types of tree data structures. 2101 01:39:33,510 --> 01:39:37,080 They kind of have built into them the algorithms for kind of shifting things 2102 01:39:37,080 --> 01:39:40,530 as needed to make sure that as you insert, or maybe as you delete, 2103 01:39:40,530 --> 01:39:42,840 you constantly rebalance the tree. 2104 01:39:42,840 --> 01:39:46,870 And long story short, doing so might cost you a little extra time. 2105 01:39:46,870 --> 01:39:50,970 But if you've got a lot of data, keeping that thing balanced and logarithmic 2106 01:39:50,970 --> 01:39:54,750 in height, so to speak, and not long and stringy in the linear in height, 2107 01:39:54,750 --> 01:39:58,290 it's probably, depending on your application, going to save you 2108 01:39:58,290 --> 01:40:01,170 quite a bit of time overall. 2109 01:40:01,170 --> 01:40:05,250 So we might say that insert into a balanced binary search tree 2110 01:40:05,250 --> 01:40:07,500 is indeed Big O of log n. 2111 01:40:07,500 --> 01:40:11,100 But that is conditional on you making sure that you keep it balanced. 2112 01:40:11,100 --> 01:40:13,620 And that's going to be more code than we'll go into today, 2113 01:40:13,620 --> 01:40:17,790 but indeed, a possible design decision. 2114 01:40:17,790 --> 01:40:24,040 All right, any questions then on now trees and binary search trees, 2115 01:40:24,040 --> 01:40:25,150 in particular? 2116 01:40:25,150 --> 01:40:27,340 We started with arrays a few weeks ago. 2117 01:40:27,340 --> 01:40:30,940 We've now got linked lists, which are good, better, but not great. 2118 01:40:30,940 --> 01:40:34,900 To trees, which seemed maybe to be great but, again, it's always a trade off. 2119 01:40:34,900 --> 01:40:37,000 They're costing us more space. 2120 01:40:37,000 --> 01:40:39,490 But I bet we can continue to stitch some of these ideas 2121 01:40:39,490 --> 01:40:41,860 together for other structures still. 2122 01:40:41,860 --> 01:40:44,510 Brian, anything outstanding? 2123 01:40:44,510 --> 01:40:45,010 BRIAN: Yeah. 2124 01:40:45,010 --> 01:40:48,070 One question came in as to why it's a problem if you have, 2125 01:40:48,070 --> 01:40:51,278 like, the 1, and the 2, and the 3 all in just one sequence on the right side? 2126 01:40:51,278 --> 01:40:53,028 DAVID MALAN: Yeah, a really good question. 2127 01:40:53,028 --> 01:40:54,070 Why is it a problem? 2128 01:40:54,070 --> 01:40:55,150 Maybe it isn't. 2129 01:40:55,150 --> 01:40:57,820 If you don't have a very large data set and you 2130 01:40:57,820 --> 01:41:00,550 don't have many values in the structure, honestly, who cares? 2131 01:41:00,550 --> 01:41:02,607 If it's three elements, definitely don't care. 2132 01:41:02,607 --> 01:41:05,690 If it's 10 elements, if it's 1,000, heck, if your computer is fast enough, 2133 01:41:05,690 --> 01:41:08,148 there might be a million elements, and it's not a big deal. 2134 01:41:08,148 --> 01:41:10,510 But if it's two million elements or a billion elements, 2135 01:41:10,510 --> 01:41:13,655 then it totally depends, again, on what is the business you're building? 2136 01:41:13,655 --> 01:41:15,280 What is the application you're writing? 2137 01:41:15,280 --> 01:41:16,270 How big is your data? 2138 01:41:16,270 --> 01:41:18,220 How fast or how slow is your computer? 2139 01:41:18,220 --> 01:41:20,187 It might very well matter ultimately. 2140 01:41:20,187 --> 01:41:23,020 And indeed, when we've seen some of our algorithms a couple of weeks 2141 01:41:23,020 --> 01:41:26,470 ago, when we compared bubble sort and selection sort and merge sort, 2142 01:41:26,470 --> 01:41:29,200 even though those were in a different category of running times, 2143 01:41:29,200 --> 01:41:33,640 n log n and n squared, just recall the appreciable difference. 2144 01:41:33,640 --> 01:41:38,500 Log of n in the context of searching is way better than n. 2145 01:41:38,500 --> 01:41:41,710 So if your data structure is devolving into something long and stringy, 2146 01:41:41,710 --> 01:41:46,120 recall that's, like, searching a phone book a thousand total pages. 2147 01:41:46,120 --> 01:41:49,960 But binary search and not letting it get long and stringy 2148 01:41:49,960 --> 01:41:53,170 gives you, like, 10 steps instead of 1,000 2149 01:41:53,170 --> 01:41:55,510 steps in order to search those same pages. 2150 01:41:55,510 --> 01:41:58,900 So, again, even in Week 0 we saw the appreciable difference 2151 01:41:58,900 --> 01:42:04,010 between these different categories of running times. 2152 01:42:04,010 --> 01:42:06,248 All right, well, let's see if we can't maybe 2153 01:42:06,248 --> 01:42:07,790 take some of the best of both worlds. 2154 01:42:07,790 --> 01:42:09,260 Thus far, again, we've seen arrays. 2155 01:42:09,260 --> 01:42:10,370 We've seen linked lists. 2156 01:42:10,370 --> 01:42:11,780 We've seen trees. 2157 01:42:11,780 --> 01:42:15,110 What if we kind of get a little Frankenstein here and mash things 2158 01:42:15,110 --> 01:42:18,528 together and take sort of the best features of these things 2159 01:42:18,528 --> 01:42:19,820 and build up something grander. 2160 01:42:19,820 --> 01:42:22,580 In fact, I feel like the Holy Grail of a data structure 2161 01:42:22,580 --> 01:42:27,140 would be something for which cert and insertion aren't n, Big O of n, 2162 01:42:27,140 --> 01:42:28,340 aren't Big O of log n. 2163 01:42:28,340 --> 01:42:30,590 But wouldn't it be amazing if there's a data structure 2164 01:42:30,590 --> 01:42:33,950 out there where the running time is like constant time, Big O of 1. 2165 01:42:33,950 --> 01:42:35,090 That's the Holy Grail. 2166 01:42:35,090 --> 01:42:37,957 If you can cleverly lay out your computer's memory in such a way 2167 01:42:37,957 --> 01:42:40,790 that if you want to search for or insert a value, boom, you're done. 2168 01:42:40,790 --> 01:42:44,740 Boom, you're done, and none of this linear or logarithmic running time. 2169 01:42:44,740 --> 01:42:46,820 So let's see if we can't pursue that goal. 2170 01:42:46,820 --> 01:42:50,360 Let me propose that we introduce this topic called hash tables. 2171 01:42:50,360 --> 01:42:53,390 Hash table is another data structure that's 2172 01:42:53,390 --> 01:42:56,010 essentially an array of linked lists. 2173 01:42:56,010 --> 01:42:57,980 So, again, it's this Frankenstein monster, 2174 01:42:57,980 --> 01:43:01,200 whereby we've combined arrays ultimately with linked list. 2175 01:43:01,200 --> 01:43:02,540 Let's see how this is done. 2176 01:43:02,540 --> 01:43:05,347 Let me propose that we first start with an array of size 26. 2177 01:43:05,347 --> 01:43:08,180 And I'm going to start drawing my arrays vertically, just because it 2178 01:43:08,180 --> 01:43:09,890 sort of works out better pictorially. 2179 01:43:09,890 --> 01:43:11,840 But, again, these are all artist's renditions, anyway. 2180 01:43:11,840 --> 01:43:15,048 Even though we always draw arrays left to right, that's completely arbitrary. 2181 01:43:15,048 --> 01:43:17,780 So I'm going to start, for now, drawing my array top to bottom. 2182 01:43:17,780 --> 01:43:20,510 And suppose that the data structure I care about now 2183 01:43:20,510 --> 01:43:22,620 is going to be even more interesting the numbers. 2184 01:43:22,620 --> 01:43:26,450 Suppose I want to store things like names, like dictionaries, or names 2185 01:43:26,450 --> 01:43:28,412 like contacts in your phone. 2186 01:43:28,412 --> 01:43:30,620 If you want to keep track of all the people you know, 2187 01:43:30,620 --> 01:43:33,620 it would be great if it doesn't take linear time or logarithmic time 2188 01:43:33,620 --> 01:43:34,640 to find people. 2189 01:43:34,640 --> 01:43:37,980 Instead, constant time would seem to be even better. 2190 01:43:37,980 --> 01:43:40,945 So here's an array, for instance, size 26. 2191 01:43:40,945 --> 01:43:42,320 And I proposed that deliberately. 2192 01:43:42,320 --> 01:43:44,720 In English, there's 26 letters, A through Z. So 2193 01:43:44,720 --> 01:43:48,590 let's consider location 0 is A. Location 25 is Z. 2194 01:43:48,590 --> 01:43:53,270 And if I now go and start inserting all of my friends into my new phone, 2195 01:43:53,270 --> 01:43:56,210 into the contacts application, where might I put them? 2196 01:43:56,210 --> 01:43:57,650 Well, let me go ahead and do this. 2197 01:43:57,650 --> 01:44:01,400 Let me go ahead and think of each of these elements as, again, 0 through 25, 2198 01:44:01,400 --> 01:44:03,170 or really A through Z. 2199 01:44:03,170 --> 01:44:07,670 And let me, upon inserting a new friend or contact into my phone, 2200 01:44:07,670 --> 01:44:12,260 let me put them into a location that has some relationship with the name itself. 2201 01:44:12,260 --> 01:44:15,120 Let's not just start putting them at the very beginning. 2202 01:44:15,120 --> 01:44:17,810 Let's not necessarily put them alphabetically per se. 2203 01:44:17,810 --> 01:44:22,310 Let's actually put them at a specific location in this array, not just top 2204 01:44:22,310 --> 01:44:24,980 to bottom, but at a specific entry. 2205 01:44:24,980 --> 01:44:28,310 So suppose the first person I want to add to my contacts is Albus. 2206 01:44:28,310 --> 01:44:31,820 Well, I'm going to propose that because Albus starts with an A, 2207 01:44:31,820 --> 01:44:37,880 he is going to go into the A location, so the very first entry in this array. 2208 01:44:37,880 --> 01:44:40,160 Suppose I next want to add Zacharias. 2209 01:44:40,160 --> 01:44:43,700 Well, his name starts with Z. So he's going to go in the very last location. 2210 01:44:43,700 --> 01:44:45,350 So, again, I'm jumping around. 2211 01:44:45,350 --> 01:44:46,970 I went from 0 to 25. 2212 01:44:46,970 --> 01:44:50,030 But it's an array, and I can do that in constant time. 2213 01:44:50,030 --> 01:44:53,400 You can randomly index to any element using square brackets. 2214 01:44:53,400 --> 01:44:54,672 So this is both constant time. 2215 01:44:54,672 --> 01:44:56,630 I don't have to just put him right after Albus. 2216 01:44:56,630 --> 01:44:57,980 I can put him wherever I want. 2217 01:44:57,980 --> 01:44:59,930 Suppose the third person is Hermione. 2218 01:44:59,930 --> 01:45:02,360 Well, I'm going to put her at location H. Why? 2219 01:45:02,360 --> 01:45:04,610 Because I can do the math, and I can figure out H. OK, 2220 01:45:04,610 --> 01:45:07,580 I can just jump immediately to that letter of the alphabet. 2221 01:45:07,580 --> 01:45:10,760 And in turn, thanks to ASCII and doing a bit of arithmetic, 2222 01:45:10,760 --> 01:45:13,110 I convert that to a number as well. 2223 01:45:13,110 --> 01:45:16,310 So she ends up at 0, 1, 2, 3, 4, 5, 6, 7, 2224 01:45:16,310 --> 01:45:21,570 because H ends up mapping to the eighth character, or location 7. 2225 01:45:21,570 --> 01:45:22,820 All right. 2226 01:45:22,820 --> 01:45:23,630 Who else? 2227 01:45:23,630 --> 01:45:25,950 All these other people end up in my address book. 2228 01:45:25,950 --> 01:45:27,210 And so they're all spread out. 2229 01:45:27,210 --> 01:45:28,670 I don't have as many as 26 friends. 2230 01:45:28,670 --> 01:45:30,360 So there's some gaps in the data there. 2231 01:45:30,360 --> 01:45:32,180 But I fit everyone here. 2232 01:45:32,180 --> 01:45:33,750 But there might be a problem. 2233 01:45:33,750 --> 01:45:35,480 And you can perhaps see this coming. 2234 01:45:35,480 --> 01:45:37,010 Thus far, I've kind of gotten lucky. 2235 01:45:37,010 --> 01:45:40,790 And I've only know people whose names are-- uniquely start with a letter. 2236 01:45:40,790 --> 01:45:45,710 But as soon as I meet someone at school and I add them to my contacts, 2237 01:45:45,710 --> 01:45:49,560 well, now Harry, for instance, has to go in the same location. 2238 01:45:49,560 --> 01:45:52,850 Now this is a problem if I want to store both Hermione and Harry, 2239 01:45:52,850 --> 01:45:54,680 because their names both start with H. 2240 01:45:54,680 --> 01:45:57,770 But, again if it's an array, it's absolutely a deal breaker. 2241 01:45:57,770 --> 01:45:59,840 At that point, all things break down. 2242 01:45:59,840 --> 01:46:01,880 Because I could, yes, grow the array. 2243 01:46:01,880 --> 01:46:04,880 But if I grow the array, then it's size 27. 2244 01:46:04,880 --> 01:46:08,120 And then it's, like, how do I know what number is what letter at that point? 2245 01:46:08,120 --> 01:46:09,950 It just devolves into a complete mess. 2246 01:46:09,950 --> 01:46:12,530 But if I borrow the idea of a linked list, what if I 2247 01:46:12,530 --> 01:46:15,860 make my array an array of linked lists? 2248 01:46:15,860 --> 01:46:17,900 So yes, even though there's this collision 2249 01:46:17,900 --> 01:46:22,400 where both Hermione and Harry belong at the same location in the array, 2250 01:46:22,400 --> 01:46:23,180 that's fine. 2251 01:46:23,180 --> 01:46:26,570 In the event this happens, I'm just going to kind of stitch them together 2252 01:46:26,570 --> 01:46:28,830 into a linked list from left to right. 2253 01:46:28,830 --> 01:46:29,690 So it's not ideal. 2254 01:46:29,690 --> 01:46:32,570 Because now it takes me two steps to get to Harry instead of 1, 2255 01:46:32,570 --> 01:46:35,060 using simple arithmetic and square bracket notation. 2256 01:46:35,060 --> 01:46:37,880 But heck, at least I can still fit him in my address book. 2257 01:46:37,880 --> 01:46:40,220 So a bit of a trade off, but feels reasonable. 2258 01:46:40,220 --> 01:46:41,630 Well, someone else, Hagrid-- 2259 01:46:41,630 --> 01:46:43,970 all right, it's not ideal that now it takes me 2260 01:46:43,970 --> 01:46:46,370 three steps to get to Hagrid in my address book. 2261 01:46:46,370 --> 01:46:50,280 But three is way better than not having him in there at all. 2262 01:46:50,280 --> 01:46:53,000 So, again, we see a manifestation of a trade off here, too. 2263 01:46:53,000 --> 01:46:54,710 But we've solved the problem. 2264 01:46:54,710 --> 01:46:57,680 And a hash table is, indeed, exactly this data structure. 2265 01:46:57,680 --> 01:47:02,000 It is an array of linked lists, at least it can be implemented as such. 2266 01:47:02,000 --> 01:47:05,708 And it is predicated on introducing the notion of a hash function. 2267 01:47:05,708 --> 01:47:08,500 This is actually something we'll see in other contexts before long. 2268 01:47:08,500 --> 01:47:12,760 But a hash function is going to allow us to map not only Hermione, Harry, 2269 01:47:12,760 --> 01:47:15,880 and Hagrid, but also Ron and Remus, Severus and Sirius 2270 01:47:15,880 --> 01:47:19,180 to their respective locations deterministically. 2271 01:47:19,180 --> 01:47:21,910 That is, there's no randomness involved here. 2272 01:47:21,910 --> 01:47:24,100 Every time I look at these people's names, 2273 01:47:24,100 --> 01:47:27,760 I'm going to figure out the location at which they belong 2274 01:47:27,760 --> 01:47:30,170 and that location is never going to change. 2275 01:47:30,170 --> 01:47:31,190 So how do I do this? 2276 01:47:31,190 --> 01:47:34,212 Well, it turns out we can think back to problem solving itself 2277 01:47:34,212 --> 01:47:35,170 and what functions are. 2278 01:47:35,170 --> 01:47:37,003 So this is problem solving as we defined it. 2279 01:47:37,003 --> 01:47:41,140 This is also a function in any language that takes inputs and outputs. 2280 01:47:41,140 --> 01:47:43,720 A hash function is going to be sort of a secret sauce 2281 01:47:43,720 --> 01:47:45,600 inside of this black box for now. 2282 01:47:45,600 --> 01:47:47,050 And so what is a hash function? 2283 01:47:47,050 --> 01:47:50,620 Well, a hash function is literally a function, either mathematically 2284 01:47:50,620 --> 01:47:55,150 or in programming, that takes as input some string, in this case, 2285 01:47:55,150 --> 01:47:58,430 like Hermione or Harry, and it returns some output. 2286 01:47:58,430 --> 01:48:01,480 And the output of a hash function is usually a number. 2287 01:48:01,480 --> 01:48:06,170 In this case, the number I want is going to be between 0 and 25. 2288 01:48:06,170 --> 01:48:09,040 So in order to implement this notion of a hash table, 2289 01:48:09,040 --> 01:48:11,740 not just pictorially on the screen, but in actual code, 2290 01:48:11,740 --> 01:48:15,130 I'm literally going to have to write a function that takes a string, 2291 01:48:15,130 --> 01:48:20,800 or if you will, char*, as input and returns an int between 0 and 25 so that 2292 01:48:20,800 --> 01:48:27,380 I know how to convert Hermione or Harry or Hagrid to the number 7 in this case. 2293 01:48:27,380 --> 01:48:29,080 So what does this hash function do? 2294 01:48:29,080 --> 01:48:32,050 It takes as input something like Albus and it outputs 0. 2295 01:48:32,050 --> 01:48:35,110 It takes someone like Zacharias, and it outputs 25. 2296 01:48:35,110 --> 01:48:37,220 And you can probably see the pattern here. 2297 01:48:37,220 --> 01:48:40,120 The code I would write in order to implement something like this is 2298 01:48:40,120 --> 01:48:43,120 probably going to look at the user's input, that char*. 2299 01:48:43,120 --> 01:48:45,880 And it's going to look at the first character, which 2300 01:48:45,880 --> 01:48:48,763 is A or Z, respectively, for these two. 2301 01:48:48,763 --> 01:48:50,680 And it's then going to do a little bit of math 2302 01:48:50,680 --> 01:48:53,660 and subtract off, like, 65 or whatnot. 2303 01:48:53,660 --> 01:48:57,040 And it's going to get me a number between 0 and 25, 2304 01:48:57,040 --> 01:49:01,910 just like with Caesar or some of our past manipulations of strings. 2305 01:49:01,910 --> 01:49:06,790 So from here, we can now take this building block, though, and perhaps 2306 01:49:06,790 --> 01:49:09,010 solve our problems a little more effectively. 2307 01:49:09,010 --> 01:49:11,710 Like, I don't love the fact that even though, yes, I've 2308 01:49:11,710 --> 01:49:15,010 made room for Harry, and Hermione, and Hagrid, 2309 01:49:15,010 --> 01:49:19,510 and now Luna, and Lily, and Lucius, and Lavender, some of these linked lists 2310 01:49:19,510 --> 01:49:20,620 are getting a little long. 2311 01:49:20,620 --> 01:49:22,090 And there's another term you can use here. 2312 01:49:22,090 --> 01:49:23,882 These are kind of like chains, if you will, 2313 01:49:23,882 --> 01:49:28,390 because they look like chain link fences or little links in a chain. 2314 01:49:28,390 --> 01:49:29,860 These are chains or linked lists. 2315 01:49:29,860 --> 01:49:31,610 But some of them are starting to get long. 2316 01:49:31,610 --> 01:49:35,620 And it's a little stupid that I'm trying to achieve constant time, Big O of 1, 2317 01:49:35,620 --> 01:49:39,580 but technically, even though some of the names literally take one step, 2318 01:49:39,580 --> 01:49:42,400 some of them are taking two or three or four steps. 2319 01:49:42,400 --> 01:49:44,080 So it's starting to devolve. 2320 01:49:44,080 --> 01:49:45,762 So what would be an optimization here? 2321 01:49:45,762 --> 01:49:48,220 If you start to get uncomfortable because you're so popular 2322 01:49:48,220 --> 01:49:50,890 and you've got so many names in your contacts that looking up 2323 01:49:50,890 --> 01:49:56,790 the H's, looking up the L's is taking more time than the others, 2324 01:49:56,790 --> 01:50:01,200 what could we do to improve the situation and still use a hash table, 2325 01:50:01,200 --> 01:50:03,300 still use a hash function? 2326 01:50:03,300 --> 01:50:05,790 But what would maybe the logical solution 2327 01:50:05,790 --> 01:50:08,880 be when you have too many collisions, you 2328 01:50:08,880 --> 01:50:12,150 have too many names colliding with one another? 2329 01:50:12,150 --> 01:50:15,570 How could we improve our performance and get at locations, again, 2330 01:50:15,570 --> 01:50:18,630 closer to one step, not two, not three, not four? 2331 01:50:18,630 --> 01:50:23,327 So literally one step, because that's our Holy Grail here. 2332 01:50:23,327 --> 01:50:26,160 BRIAN: A few people have suggested you should look at more than just 2333 01:50:26,160 --> 01:50:26,880 the first letter. 2334 01:50:26,880 --> 01:50:28,710 Like, you could look at the second letter, for example. 2335 01:50:28,710 --> 01:50:29,160 DAVID MALAN: Yeah. 2336 01:50:29,160 --> 01:50:29,660 Nice. 2337 01:50:29,660 --> 01:50:32,220 So if looking at one letter of the person's name 2338 01:50:32,220 --> 01:50:33,420 is obviously insufficient. 2339 01:50:33,420 --> 01:50:37,530 Because a whole bunch of us have names that start with H or A 2340 01:50:37,530 --> 01:50:40,680 or Z or the like, well, why don't we look at two letters 2341 01:50:40,680 --> 01:50:42,930 and therefore decrease the probability that we're 2342 01:50:42,930 --> 01:50:44,320 going to have these collisions? 2343 01:50:44,320 --> 01:50:46,140 So let me go ahead and restructure this. 2344 01:50:46,140 --> 01:50:48,840 And focusing on the Hermione, Harry, and Hagrid problem, 2345 01:50:48,840 --> 01:50:52,830 why don't we go ahead and take our array in the hash table, the vertical thing 2346 01:50:52,830 --> 01:50:56,567 here, and let's think of it as maybe not just being H at that location. 2347 01:50:56,567 --> 01:50:58,650 But what if we think of that location specifically 2348 01:50:58,650 --> 01:51:05,010 as being HA, and then HB, HC, HD, HE, HF, all the way down to HZ. 2349 01:51:05,010 --> 01:51:09,390 And then IA, IB, IC, and so forth, so we now 2350 01:51:09,390 --> 01:51:16,060 enumerate all possible pairs of letters from AA to ZZ? 2351 01:51:16,060 --> 01:51:18,060 But this would seem to spread things out, right? 2352 01:51:18,060 --> 01:51:21,850 Because now Hermione goes in the HE location in the array. 2353 01:51:21,850 --> 01:51:23,340 Now Harry goes in the HA. 2354 01:51:23,340 --> 01:51:24,870 And now Hag-- oh, dammit. 2355 01:51:24,870 --> 01:51:28,120 Like, Hagrid still goes in the same location. 2356 01:51:28,120 --> 01:51:30,060 So what would maybe be a better fix? 2357 01:51:30,060 --> 01:51:33,640 Again, this isn't horrible, like two steps is not a big deal, 2358 01:51:33,640 --> 01:51:34,890 especially on fast computers. 2359 01:51:34,890 --> 01:51:37,153 But, again, with large enough data sets, and if we're 2360 01:51:37,153 --> 01:51:39,570 no longer talking about people in your contacts, but maybe 2361 01:51:39,570 --> 01:51:43,773 all the people in the world who have Google accounts, or Twitter accounts, 2362 01:51:43,773 --> 01:51:46,440 and the like, where you want to search this information quickly, 2363 01:51:46,440 --> 01:51:49,410 you're going to have a lot of people whose names start with H and A 2364 01:51:49,410 --> 01:51:50,760 and Z and everything else. 2365 01:51:50,760 --> 01:51:52,877 It would be nice to spread them out further. 2366 01:51:52,877 --> 01:51:53,710 So what could we do? 2367 01:51:53,710 --> 01:51:55,740 Well, instead of using the first two letters, 2368 01:51:55,740 --> 01:51:59,590 frankly I think the logical extension of this is to use the first three letters. 2369 01:51:59,590 --> 01:52:01,740 So maybe this is the HAA bucket. 2370 01:52:01,740 --> 01:52:07,560 This is the HAB, HAC, HAD, HAE, dot, dot, dot, all the way down to HZZ, 2371 01:52:07,560 --> 01:52:08,630 and then IAA. 2372 01:52:08,630 --> 01:52:13,320 But now when we hash our three friends, Hermione 2373 01:52:13,320 --> 01:52:17,400 goes in the HER bucket so to speak, the elements of the array. 2374 01:52:17,400 --> 01:52:19,530 Harry, HAR, goes in that bucket. 2375 01:52:19,530 --> 01:52:24,360 And Hagrid now gets his own bucket, HAG, as would everyone else. 2376 01:52:24,360 --> 01:52:26,622 So it seems to have solved this specific problem. 2377 01:52:26,622 --> 01:52:29,580 You could still imagine, and I have to think harder and probably Google 2378 01:52:29,580 --> 01:52:33,402 to see if there's other Harry Potter names that start with HAG or HAR 2379 01:52:33,402 --> 01:52:35,880 or HER to find another collision. 2380 01:52:35,880 --> 01:52:39,420 Because you could imagine using four letters instead. 2381 01:52:39,420 --> 01:52:41,460 But what price are we paying? 2382 01:52:41,460 --> 01:52:44,190 Like, I'm solving this problem again and again. 2383 01:52:44,190 --> 01:52:48,030 And I'm getting myself literally faster look up time, 2384 01:52:48,030 --> 01:52:49,590 because it's giving me one step. 2385 01:52:49,590 --> 01:52:51,690 I can mathematically figure out by just doing 2386 01:52:51,690 --> 01:52:55,705 a bit of ASCII math, what the number of the index 2387 01:52:55,705 --> 01:52:58,080 is that I should jump to in this bigger and bigger array. 2388 01:52:58,080 --> 01:53:00,995 But what price am I paying? 2389 01:53:00,995 --> 01:53:03,470 Brian, any thoughts you'd like to relay? 2390 01:53:03,470 --> 01:53:03,970 BRIAN: Yeah. 2391 01:53:03,970 --> 01:53:06,180 A few people are saying it's going to take a lot of memory. 2392 01:53:06,180 --> 01:53:06,930 DAVID MALAN: Yeah. 2393 01:53:06,930 --> 01:53:09,690 My God, like, this is taking a huge amount of memory now. 2394 01:53:09,690 --> 01:53:11,660 Previously, how much memory did it take? 2395 01:53:11,660 --> 01:53:15,420 Well, let me pull up a little calculator here and do some quick math. 2396 01:53:15,420 --> 01:53:21,327 So if we had originally 26 buckets, so to speak, elements in the array, that, 2397 01:53:21,327 --> 01:53:22,410 of course, isn't that bad. 2398 01:53:22,410 --> 01:53:24,270 That feels pretty reasonable, 26 slots. 2399 01:53:24,270 --> 01:53:26,280 But the downside was that the chains might 2400 01:53:26,280 --> 01:53:28,990 get kind of long, three names, four names, maybe even more. 2401 01:53:28,990 --> 01:53:35,100 But if we have AA through ZZ, instead of A through Z, that's 26 times 26, 2402 01:53:35,100 --> 01:53:36,845 that's 676 buckets. 2403 01:53:36,845 --> 01:53:39,720 Doesn't sound like a huge deal, though that's bigger than most things 2404 01:53:39,720 --> 01:53:41,950 we've done in memory thus far, not a huge deal. 2405 01:53:41,950 --> 01:53:46,440 But if we have three, that's 26 possibilities times 26 times 26, 2406 01:53:46,440 --> 01:53:48,600 for AAA through ZZZ. 2407 01:53:48,600 --> 01:53:54,540 Now we have 17,576 buckets in my array. 2408 01:53:54,540 --> 01:53:57,060 And the problem isn't so much that we're using that memory. 2409 01:53:57,060 --> 01:53:58,840 Because honestly, if you need the memory, use it. 2410 01:53:58,840 --> 01:53:59,460 That's fine. 2411 01:53:59,460 --> 01:54:01,050 Just throw hardware at the problem. 2412 01:54:01,050 --> 01:54:02,940 Buy and upgrade more memory. 2413 01:54:02,940 --> 01:54:05,610 But the problem is that I probably don't know 2414 01:54:05,610 --> 01:54:11,310 that many people whose names start with HAA or AZZ 2415 01:54:11,310 --> 01:54:14,700 or any number of these combinations of letters of the alphabet. 2416 01:54:14,700 --> 01:54:16,765 A lot of those buckets are going to be empty. 2417 01:54:16,765 --> 01:54:18,390 But it doesn't matter if they're empty. 2418 01:54:18,390 --> 01:54:22,870 If you want an array and you want random access, they have to be present 2419 01:54:22,870 --> 01:54:25,980 so that your arithmetic works out, per Week 2, 2420 01:54:25,980 --> 01:54:27,750 where you just use square bracket notation 2421 01:54:27,750 --> 01:54:32,160 and jump to any of the locations in memory that you care about. 2422 01:54:32,160 --> 01:54:36,060 So finding that trade off, or finding the inflection point with those trade 2423 01:54:36,060 --> 01:54:39,330 offs, is kind of an art and/or a science, figuring out 2424 01:54:39,330 --> 01:54:42,330 for your particular data, your particular application, which 2425 01:54:42,330 --> 01:54:46,990 is more important, time or space or some happy medium in between the two. 2426 01:54:46,990 --> 01:54:49,110 And with Problem Set 5, as you'll see, you'll 2427 01:54:49,110 --> 01:54:51,270 actually have to figure out this balance in part 2428 01:54:51,270 --> 01:54:54,300 by trying to minimize, ultimately, your own use of memory 2429 01:54:54,300 --> 01:54:56,792 and your own use of computers' time. 2430 01:54:56,792 --> 01:54:58,500 But let me point something out, actually. 2431 01:54:58,500 --> 01:55:00,540 This notion of hash table, which up until now, 2432 01:55:00,540 --> 01:55:03,457 definitely the most sophisticated data structure that we've looked at, 2433 01:55:03,457 --> 01:55:06,380 it's kind of familiar to you in some way already. 2434 01:55:06,380 --> 01:55:09,130 These are probably larger than the playing cards you have at home. 2435 01:55:09,130 --> 01:55:11,088 But if you've ever played with a deck of cards, 2436 01:55:11,088 --> 01:55:14,400 and the cards start out randomly, odds are you've, at some point, 2437 01:55:14,400 --> 01:55:16,440 needed to sort them for one game or another. 2438 01:55:16,440 --> 01:55:18,330 Sometimes you need to shuffle them entirely. 2439 01:55:18,330 --> 01:55:20,538 If you want to be a little neat, you might sort them, 2440 01:55:20,538 --> 01:55:22,440 not just by number, but also by suit. 2441 01:55:22,440 --> 01:55:27,820 So hearts and spades and clubs and diamonds into separate categories. 2442 01:55:27,820 --> 01:55:31,050 So honestly, I have this literally here just for the sake of the metaphor. 2443 01:55:31,050 --> 01:55:32,550 We have four buckets here. 2444 01:55:32,550 --> 01:55:36,430 And we've gone ahead and labeled them in advance with spade there. 2445 01:55:36,430 --> 01:55:37,740 So that's one bucket. 2446 01:55:37,740 --> 01:55:41,880 Here we have diamond shape here. 2447 01:55:41,880 --> 01:55:43,920 And here we have-- 2448 01:55:43,920 --> 01:55:45,510 [GRUNTING] 2449 01:55:45,510 --> 01:55:50,230 --here we have hearts here and then clubs here. 2450 01:55:50,230 --> 01:55:52,375 So if you've ever sorted a deck of cards, 2451 01:55:52,375 --> 01:55:54,750 odds are you haven't really thought about this very hard. 2452 01:55:54,750 --> 01:55:55,860 Because it's not that interesting. 2453 01:55:55,860 --> 01:55:58,443 You probably mindlessly start laying them out and sorting them 2454 01:55:58,443 --> 01:56:01,140 by suits and then maybe by number. 2455 01:56:01,140 --> 01:56:04,798 But if you've done that, you have hashed values before. 2456 01:56:04,798 --> 01:56:06,840 If you take a look at the first card and you see, 2457 01:56:06,840 --> 01:56:09,540 that, oh, it's the ace of diamonds. 2458 01:56:09,540 --> 01:56:12,930 You know, yes, you might care ultimately that it's a diamond, that it's an ace. 2459 01:56:12,930 --> 01:56:18,240 But for now, I'm just going to put it, for instance, into the diamond bucket. 2460 01:56:18,240 --> 01:56:19,623 Here's the two of diamonds here. 2461 01:56:19,623 --> 01:56:21,540 I'm going to put that into the diamond bucket. 2462 01:56:21,540 --> 01:56:23,650 Here's the ace of clubs. 2463 01:56:23,650 --> 01:56:25,380 So I'm going to put that over here. 2464 01:56:25,380 --> 01:56:29,970 And you can just progressively hash one card after the other. 2465 01:56:29,970 --> 01:56:34,260 And, indeed, hashing really just means to look at some input and produce, 2466 01:56:34,260 --> 01:56:38,070 in this case, some numeric output that outputs the like bucket 0, 1, 2, 2467 01:56:38,070 --> 01:56:41,010 or 3 based on some characteristic of that input, 2468 01:56:41,010 --> 01:56:43,950 whether it's actually the suit on the card like I'm doing here, 2469 01:56:43,950 --> 01:56:46,950 or maybe it's based on the letter of the alphabet here. 2470 01:56:46,950 --> 01:56:48,010 And why am I doing this? 2471 01:56:48,010 --> 01:56:48,510 Right? 2472 01:56:48,510 --> 01:56:49,410 I'm not going to do the whole thing. 2473 01:56:49,410 --> 01:56:52,530 Because 52 steps is going to take a while and get boring quickly, if not 2474 01:56:52,530 --> 01:56:53,520 already. 2475 01:56:53,520 --> 01:56:54,750 But why am I doing this? 2476 01:56:54,750 --> 01:56:56,190 Because odds are you've probably done this. 2477 01:56:56,190 --> 01:56:59,370 Not with the drama of actual buckets, you've probably just kind of laid them 2478 01:56:59,370 --> 01:57:00,203 out in front of you. 2479 01:57:00,203 --> 01:57:05,560 But why have you done that, if that's indeed something you have done? 2480 01:57:05,560 --> 01:57:06,060 Yeah. 2481 01:57:06,060 --> 01:57:08,340 Over to Sophia? 2482 01:57:08,340 --> 01:57:11,520 AUDIENCE: There's a possibility that we could actually get to things faster, 2483 01:57:11,520 --> 01:57:12,990 like, if we know what bucket it is. 2484 01:57:12,990 --> 01:57:16,530 We might be able to even search things for, like, 0, 1, or less. 2485 01:57:16,530 --> 01:57:17,760 DAVID MALAN: Yeah. 2486 01:57:17,760 --> 01:57:18,150 AUDIENCE: Something like that. 2487 01:57:18,150 --> 01:57:18,450 DAVID MALAN: Yeah. 2488 01:57:18,450 --> 01:57:20,325 You start to gain these optimizations, right? 2489 01:57:20,325 --> 01:57:24,360 At least, as a human, honestly, I can process four smaller problems 2490 01:57:24,360 --> 01:57:27,540 just much easier than one bigger problem, that's size 52. 2491 01:57:27,540 --> 01:57:32,238 I can solve four 13 card problems a little faster, especially 2492 01:57:32,238 --> 01:57:33,780 if I'm looking for a particular card. 2493 01:57:33,780 --> 01:57:36,370 Now I can find it among 13 cards instead of 52. 2494 01:57:36,370 --> 01:57:38,310 So there's just kind of an optimization here. 2495 01:57:38,310 --> 01:57:41,790 So you might take as input these cards, hash them into a particular bucket, 2496 01:57:41,790 --> 01:57:43,890 and then proceed to solve the smaller problem. 2497 01:57:43,890 --> 01:57:46,500 Now that's not what a hash table itself is all about. 2498 01:57:46,500 --> 01:57:50,580 A hash table is about storing information, but storing information 2499 01:57:50,580 --> 01:57:52,330 so as to get to it more quickly. 2500 01:57:52,330 --> 01:58:00,390 So to Sophia's point, if indeed she just wants to find like the ace of diamonds, 2501 01:58:00,390 --> 01:58:03,900 she now only has to look through a 13 sized problem, a linked 2502 01:58:03,900 --> 01:58:09,480 list of size 13, if you will, instead of an array or a linked list of size 52. 2503 01:58:09,480 --> 01:58:12,870 So a hash table allows you to bucketize your inputs, if you will, 2504 01:58:12,870 --> 01:58:15,480 colloquially, and get access to data more quickly. 2505 01:58:15,480 --> 01:58:20,262 Not necessarily in time one, in one step, it might be two. 2506 01:58:20,262 --> 01:58:20,970 It might be four. 2507 01:58:20,970 --> 01:58:22,560 It might be 13 steps. 2508 01:58:22,560 --> 01:58:25,530 But it's generally fewer steps than if you 2509 01:58:25,530 --> 01:58:29,970 were doing something purely linearly, or even logarithmically. 2510 01:58:29,970 --> 01:58:32,850 Ideally, you're trying to pick your hash function in such a way 2511 01:58:32,850 --> 01:58:38,790 that you minimize the number of elements that collide by using not A through Z, 2512 01:58:38,790 --> 01:58:41,290 but AA through ZZ and so forth. 2513 01:58:41,290 --> 01:58:43,440 So let me go ahead here and ask a question. 2514 01:58:43,440 --> 01:58:45,720 What, then, is the running time when it comes 2515 01:58:45,720 --> 01:58:48,180 to this data structure of a hash table? 2516 01:58:48,180 --> 01:58:52,860 If you want to go ahead and search in a hash table, once all of the data 2517 01:58:52,860 --> 01:58:55,560 is in there, once all of my contacts are there, 2518 01:58:55,560 --> 01:59:00,240 how many steps does your phone have to take, given n contacts in your phone, 2519 01:59:00,240 --> 01:59:06,440 to find Hermione or Hagrid or anyone else? 2520 01:59:06,440 --> 01:59:10,570 So I see, again, 80% of you are saying constant time, Big O of 1. 2521 01:59:10,570 --> 01:59:13,480 And, again, constant time might mean one step, two steps, four steps, 2522 01:59:13,480 --> 01:59:16,390 but some fixed number, not dependent on n. 2523 01:59:16,390 --> 01:59:19,900 18% of you or so are saying linear time. 2524 01:59:19,900 --> 01:59:22,570 And I have to admit, the 20% of you or so that 2525 01:59:22,570 --> 01:59:29,060 said linear time are technically, asymptotically, mathematically correct. 2526 01:59:29,060 --> 01:59:33,010 And here we begin to see sort of a distinction between the real world 2527 01:59:33,010 --> 01:59:33,820 and academia. 2528 01:59:33,820 --> 01:59:40,510 So the academic here, or rather the real world here, the real world programmer 2529 01:59:40,510 --> 01:59:44,590 would say, just like Sophia did, obviously, a bucket with 13 cards in it 2530 01:59:44,590 --> 01:59:48,227 is strictly better than one bigger bucket with 52 cards. 2531 01:59:48,227 --> 01:59:49,060 That is just faster. 2532 01:59:49,060 --> 01:59:52,480 It's literally four times as fast to find or to flip 2533 01:59:52,480 --> 01:59:54,820 through those 13 cards instead of 52. 2534 01:59:54,820 --> 01:59:56,770 That is objectively faster. 2535 01:59:56,770 --> 02:00:00,670 But the academic would say yes, but asymptotically-- and asymptotically 2536 02:00:00,670 --> 02:00:03,310 is just a fancy way of saying as n gets really large, 2537 02:00:03,310 --> 02:00:07,510 the sort of wave of the hand that I keep describing, asymptotically taking 2538 02:00:07,510 --> 02:00:11,170 13 steps is technically big O of n. 2539 02:00:11,170 --> 02:00:11,800 Why? 2540 02:00:11,800 --> 02:00:15,520 Well, in the case of the cards here, it's technically n divided by 4. 2541 02:00:15,520 --> 02:00:16,360 Yes, it's 13. 2542 02:00:16,360 --> 02:00:19,630 But if there's n cards total, technically, the size of this bucket 2543 02:00:19,630 --> 02:00:21,400 is going to end up being m divided by 4. 2544 02:00:21,400 --> 02:00:24,280 And what did we talk about when we talked about Big O and omega? 2545 02:00:24,280 --> 02:00:26,450 Well, you throw away the lower order terms. 2546 02:00:26,450 --> 02:00:28,810 You get rid of the constants, like the divide by 4, 2547 02:00:28,810 --> 02:00:31,613 or the plus something else. 2548 02:00:31,613 --> 02:00:32,530 So we get rid of that. 2549 02:00:32,530 --> 02:00:35,860 And it's technically a hash table searching. 2550 02:00:35,860 --> 02:00:38,050 It is still in Big O of n. 2551 02:00:38,050 --> 02:00:41,260 But here, again, we see a contrast between the real world 2552 02:00:41,260 --> 02:00:42,640 and the theoretical world. 2553 02:00:42,640 --> 02:00:44,920 Like, yes, if you want to get into an academic debate, 2554 02:00:44,920 --> 02:00:46,920 yes, it's still technically the same as a linked 2555 02:00:46,920 --> 02:00:49,630 list or an array, at which point you might as well just search 2556 02:00:49,630 --> 02:00:52,672 the thing left to right linearly, whether it's an array or a linked list. 2557 02:00:52,672 --> 02:00:53,290 But come on. 2558 02:00:53,290 --> 02:00:55,780 Like, if you actually hash these values in advance, 2559 02:00:55,780 --> 02:01:00,370 and spread them out into 4, or 26, or 576 buckets, 2560 02:01:00,370 --> 02:01:04,353 that is actually going to be faster when it comes to wall clock time. 2561 02:01:04,353 --> 02:01:06,520 So when you literally look at the clock on the wall, 2562 02:01:06,520 --> 02:01:09,610 less time will pass taking Sophia's approach than taking 2563 02:01:09,610 --> 02:01:12,020 an array or linked list approach. 2564 02:01:12,020 --> 02:01:16,030 So here, those of you who said Big O of n are correct. 2565 02:01:16,030 --> 02:01:18,280 But when it comes to the real world programming, 2566 02:01:18,280 --> 02:01:21,760 honestly, if it's faster than actual n steps, 2567 02:01:21,760 --> 02:01:23,950 that may very well be a net positive. 2568 02:01:23,950 --> 02:01:28,430 And so perhaps we should be focusing more on practice and less, sometimes, 2569 02:01:28,430 --> 02:01:29,680 on the theory of these things. 2570 02:01:29,680 --> 02:01:31,513 And indeed that's going to be the challenge. 2571 02:01:31,513 --> 02:01:33,520 The Problem Set 5 to which I keep alluding 2572 02:01:33,520 --> 02:01:35,860 is going to challenge you to implement one 2573 02:01:35,860 --> 02:01:40,630 of these data structures, a hash table, with 100,000 plus English words. 2574 02:01:40,630 --> 02:01:43,780 We're going to, in a nutshell, give you a big text file containing 2575 02:01:43,780 --> 02:01:45,260 one English word per line. 2576 02:01:45,260 --> 02:01:50,140 And among your goals is going to be to load all of those 140,000 plus words 2577 02:01:50,140 --> 02:01:53,380 into your computer's memory using a hash table. 2578 02:01:53,380 --> 02:01:56,500 Now if you are simplistic about it, and you 2579 02:01:56,500 --> 02:01:59,712 use a hash table with 26 buckets, A through Z, 2580 02:01:59,712 --> 02:02:01,420 you're going to have a lot of collisions. 2581 02:02:01,420 --> 02:02:04,510 If there's 140,000 plus English words, there's a lot of words in there 2582 02:02:04,510 --> 02:02:08,080 that start with A, or B, or Z, or anything in between. 2583 02:02:08,080 --> 02:02:11,560 If you maybe then go with AA through ZZ, maybe that's better, 2584 02:02:11,560 --> 02:02:13,562 or AAA through ZZZ, maybe that's better. 2585 02:02:13,562 --> 02:02:15,520 But at some point, you're going to start to use 2586 02:02:15,520 --> 02:02:17,630 too much memory for your own good. 2587 02:02:17,630 --> 02:02:20,260 And one of the challenges, optionally, of Problem Set 5, 2588 02:02:20,260 --> 02:02:23,148 is going to be to playfully challenge your classmates whereby, 2589 02:02:23,148 --> 02:02:25,690 if you opt into this, you can run a command that will put you 2590 02:02:25,690 --> 02:02:29,650 on the big board which will show on the course's website exactly how much 2591 02:02:29,650 --> 02:02:33,760 or how little RAM or memory you're using, and how little 2592 02:02:33,760 --> 02:02:36,740 or how much time your code is taking to run. 2593 02:02:36,740 --> 02:02:41,000 And so we just put aside the sort of academic waves of the hand saying, 2594 02:02:41,000 --> 02:02:43,420 well, yes, all of your code is Big O of n. 2595 02:02:43,420 --> 02:02:49,450 But n divided by 4, Sophia's approach, is way better in practice than n 2596 02:02:49,450 --> 02:02:50,020 itself. 2597 02:02:50,020 --> 02:02:51,910 And we'll begin to tease apart the dichotomy 2598 02:02:51,910 --> 02:02:54,310 between theory here and practice. 2599 02:02:54,310 --> 02:02:57,040 But these aren't the only ways to lay things out in memory. 2600 02:02:57,040 --> 02:03:00,730 And we wanted to show you just a few other ideas that come out now 2601 02:03:00,730 --> 02:03:02,560 that we have all of these building blocks. 2602 02:03:02,560 --> 02:03:06,400 One of which is the data structure that we're going to call a trie. 2603 02:03:06,400 --> 02:03:08,410 Trie is actually short for the word retrieval, 2604 02:03:08,410 --> 02:03:10,368 even though it's not quite pronounced the same. 2605 02:03:10,368 --> 02:03:11,770 It's also known as a prefix tree. 2606 02:03:11,770 --> 02:03:14,170 And it's a different type of tree that is typically 2607 02:03:14,170 --> 02:03:17,470 used to store words, or other more sophisticated pieces of data 2608 02:03:17,470 --> 02:03:19,630 instead of just numbers alone. 2609 02:03:19,630 --> 02:03:24,497 So a trie is actually a tree made up of arrays. 2610 02:03:24,497 --> 02:03:26,080 So you can kind of see a pattern here. 2611 02:03:26,080 --> 02:03:29,290 A hash table was an array of linked lists. 2612 02:03:29,290 --> 02:03:33,592 A trie is a tree, each of whose nodes is an array. 2613 02:03:33,592 --> 02:03:36,550 So at some point, computer scientists started getting a little creative 2614 02:03:36,550 --> 02:03:38,620 and started just, like, literally smashing together 2615 02:03:38,620 --> 02:03:41,620 different data structures to see what they could come up with, it seems. 2616 02:03:41,620 --> 02:03:44,230 And so a trie begins to look like this. 2617 02:03:44,230 --> 02:03:48,130 But it has this amazing property that's better than anything in theory 2618 02:03:48,130 --> 02:03:49,300 we've seen before. 2619 02:03:49,300 --> 02:03:52,300 Here is one node in a try. 2620 02:03:52,300 --> 02:03:55,540 And it's a node in the sense that this would be like a rectangle or square. 2621 02:03:55,540 --> 02:03:59,920 But inside of that node is literally an array of size 26 in this case. 2622 02:03:59,920 --> 02:04:05,640 And each of those locations or buckets, if you will represent A through Z. 2623 02:04:05,640 --> 02:04:09,840 And what we're going to do is any time we insert a word, like a name, 2624 02:04:09,840 --> 02:04:12,690 like Harry, or Hagrid, or Hermione, or anyone else, 2625 02:04:12,690 --> 02:04:17,490 we are going to walk through the letters of their name, like, H-A-G-R-I-D, 2626 02:04:17,490 --> 02:04:23,340 and we are going to follow a series of pointers from one node to another 2627 02:04:23,340 --> 02:04:23,980 as follows. 2628 02:04:23,980 --> 02:04:28,560 So for instance, if this is A through Z, or 0 through 25, here is location H. 2629 02:04:28,560 --> 02:04:33,570 So if the goal at the moment is to insert the first of our contacts, 2630 02:04:33,570 --> 02:04:36,540 for instance, Harry, I'm going to start by looking 2631 02:04:36,540 --> 02:04:40,290 at the first node, the root of the tree, looking up the H location. 2632 02:04:40,290 --> 02:04:43,170 And I'm going to kind of make mental note that Harry starts there, 2633 02:04:43,170 --> 02:04:45,000 the H in Harry starts there. 2634 02:04:45,000 --> 02:04:48,120 Then if I want to insert the A in Harry, I'm 2635 02:04:48,120 --> 02:04:53,670 going to go ahead and add a new node, representing 26 letters. 2636 02:04:53,670 --> 02:04:57,280 But I'm going to keep track of the fact that, OK, A is here. 2637 02:04:57,280 --> 02:04:59,170 So now I'm going to have another pointer-- 2638 02:04:59,170 --> 02:05:07,410 oh, I'm sorry-- not Harry, Hagrid first, H-A-G-R-I-D. So what have I just done? 2639 02:05:07,410 --> 02:05:11,520 A trie, again, is a tree, each of whose nodes is an array. 2640 02:05:11,520 --> 02:05:17,140 And each of those arrays is an array of pointers to other nodes. 2641 02:05:17,140 --> 02:05:19,920 So, again, we're really just mashing up everything together here. 2642 02:05:19,920 --> 02:05:21,930 But it's the same building blocks as before. 2643 02:05:21,930 --> 02:05:27,630 Each node in this tree, top to bottom, is an array of pointers to other nodes. 2644 02:05:27,630 --> 02:05:32,370 And so, if I wanted to check if Hagrid is in my contacts, 2645 02:05:32,370 --> 02:05:35,910 I literally start at the first node, and I follow the H pointer. 2646 02:05:35,910 --> 02:05:37,500 I then follow the A pointer. 2647 02:05:37,500 --> 02:05:40,650 I then follow the G pointer, the R pointer, the I pointer. 2648 02:05:40,650 --> 02:05:42,480 And then I check at the D pointer. 2649 02:05:42,480 --> 02:05:46,930 Is there a Boolean value inside of that structure-- more on that another time, 2650 02:05:46,930 --> 02:05:47,850 perhaps-- 2651 02:05:47,850 --> 02:05:53,070 that just says, yes or no, there is someone named H-A-G-R-I-D 2652 02:05:53,070 --> 02:05:54,450 in my contacts. 2653 02:05:54,450 --> 02:05:57,215 Notice there's no other letters noted at the moment. 2654 02:05:57,215 --> 02:05:58,590 And there's no other green boxes. 2655 02:05:58,590 --> 02:06:01,290 Green just denotes a Boolean value for our purposes now. 2656 02:06:01,290 --> 02:06:08,190 So that means there's no one whose name is H-A-G-R-I-A, or R-I-B, or R-I-C. 2657 02:06:08,190 --> 02:06:13,412 It's only H-A-G-R-I-D that exists in my contacts. 2658 02:06:13,412 --> 02:06:14,620 But notice what happens next. 2659 02:06:14,620 --> 02:06:16,980 Now if I go ahead and insert Harry, notice 2660 02:06:16,980 --> 02:06:21,840 that Harry and Hagrid share the H, the A, and then this third node. 2661 02:06:21,840 --> 02:06:24,510 But then Harry needs to follow a different pointer 2662 02:06:24,510 --> 02:06:27,120 to store the R and the Y. And notice the green there. 2663 02:06:27,120 --> 02:06:29,520 It's sort of a checkmark in the data structure, a Boolean value, that's 2664 02:06:29,520 --> 02:06:30,120 saying yes. 2665 02:06:30,120 --> 02:06:32,640 I have someone in my context named H-A-R-R-Y. 2666 02:06:32,640 --> 02:06:37,740 And then, if we add Hermione, she shares the H, and then also the second node. 2667 02:06:37,740 --> 02:06:42,280 But Hermione requires some new nodes all together. 2668 02:06:42,280 --> 02:06:46,950 But notice this key property, the reason for this sort of complexity, 2669 02:06:46,950 --> 02:06:50,100 because this is probably the weirdest structure we've seen thus far, 2670 02:06:50,100 --> 02:06:54,060 is that even if I have a billion names in my phone book, 2671 02:06:54,060 --> 02:06:57,930 how many steps literally does it take me to find Hagrid? 2672 02:06:57,930 --> 02:06:58,620 Someone? 2673 02:06:58,620 --> 02:07:01,680 Feel free to chime in the text, the chat window if you like? 2674 02:07:01,680 --> 02:07:04,530 Even if I have a billion names in my contacts, 2675 02:07:04,530 --> 02:07:07,320 how many steps does it take for me to look up and check 2676 02:07:07,320 --> 02:07:09,232 if Hagrid is among them? 2677 02:07:09,232 --> 02:07:10,440 BRIAN: People are saying six. 2678 02:07:10,440 --> 02:07:15,810 DAVID MALAN: Six, H-A-G-R-I-D. If I have 2 billion names, 4 billion names, 2679 02:07:15,810 --> 02:07:17,130 how many steps does it take? 2680 02:07:17,130 --> 02:07:21,930 6, and this is what we mean by constant time, constant time at least 2681 02:07:21,930 --> 02:07:25,510 in the length of the humans' names in the data structure. 2682 02:07:25,510 --> 02:07:26,760 So what does this mean? 2683 02:07:26,760 --> 02:07:30,450 No matter how many names you cram into a trie data 2684 02:07:30,450 --> 02:07:34,020 structure, the number of other names does not 2685 02:07:34,020 --> 02:07:38,040 impact how many steps it takes for me to find Harry or Hagrid or Hermione 2686 02:07:38,040 --> 02:07:39,360 or anyone else. 2687 02:07:39,360 --> 02:07:41,793 It is only dependent on the length of their name. 2688 02:07:41,793 --> 02:07:43,710 And here's where we can get a little academic. 2689 02:07:43,710 --> 02:07:46,043 If you assume that there's a finite number of characters 2690 02:07:46,043 --> 02:07:50,370 in any human, real or imaginary's name, maybe it's 20, or 100, or whatever. 2691 02:07:50,370 --> 02:07:52,930 It's more than 6, but it's probably fewer than hundreds. 2692 02:07:52,930 --> 02:07:55,030 And then you can assume that that's constant. 2693 02:07:55,030 --> 02:07:57,420 So it might be Big O of, like, 200 characters 2694 02:07:57,420 --> 02:07:59,190 for someone with a super long name. 2695 02:07:59,190 --> 02:08:00,600 But that's constant. 2696 02:08:00,600 --> 02:08:05,100 And so technically a try gives you that Holy Grail 2697 02:08:05,100 --> 02:08:09,360 of look up times and insertion times of Big O of 1. 2698 02:08:09,360 --> 02:08:12,360 Because it is not dependent on n, which is the number 2699 02:08:12,360 --> 02:08:14,850 of other names in the data structure. 2700 02:08:14,850 --> 02:08:18,300 It is dependent only on the length of the name you're inputting. 2701 02:08:18,300 --> 02:08:21,630 And if you assume that all names in the world are reasonably linked, 2702 02:08:21,630 --> 02:08:24,780 less than some finite value, like 6, or 200, or whatever 2703 02:08:24,780 --> 02:08:26,610 it is, then you can call that. 2704 02:08:26,610 --> 02:08:29,580 And it technically is Big O of 1. 2705 02:08:29,580 --> 02:08:32,140 That is constant time. 2706 02:08:32,140 --> 02:08:36,633 So here is the goal of this whole day, like, trying to get to constant time. 2707 02:08:36,633 --> 02:08:38,550 Because constant time is better than, it would 2708 02:08:38,550 --> 02:08:42,240 seem, linear time, or logarithmic time, or anything else we've seen. 2709 02:08:42,240 --> 02:08:45,540 But-- but-- but-- again, there's always a trade off. 2710 02:08:45,540 --> 02:08:49,710 What price have we just paid if you see it? 2711 02:08:49,710 --> 02:08:55,630 Why are tries not necessarily all that? 2712 02:08:55,630 --> 02:08:57,850 There's still a catch. 2713 02:08:57,850 --> 02:08:59,230 Daniel? 2714 02:08:59,230 --> 02:09:03,010 AUDIENCE: For example, if let's say you have two people in your contact list. 2715 02:09:03,010 --> 02:09:07,120 One person was named Daniel and one person was named Danielle. 2716 02:09:07,120 --> 02:09:09,870 And you know that Daniel is in your list. 2717 02:09:09,870 --> 02:09:13,360 So the L would have this Boolean operator of true. 2718 02:09:13,360 --> 02:09:17,500 But then, how would you get to Danielle, if your L was an operator 2719 02:09:17,500 --> 02:09:19,630 and it didn't point to another L for Danielle? 2720 02:09:19,630 --> 02:09:22,213 DAVID MALAN: Really good question, a corner case, if you will. 2721 02:09:22,213 --> 02:09:26,110 What if someone's name is a substring of another person's name? 2722 02:09:26,110 --> 02:09:29,950 So Daniel, D-A-N-I-E-L, and I think you're saying Danielle, 2723 02:09:29,950 --> 02:09:35,410 D-A-N-I-E-L-L-E. So the second name is a little longer. 2724 02:09:35,410 --> 02:09:37,210 Let me stipulate that we can solve that. 2725 02:09:37,210 --> 02:09:40,960 I have not shown code that represents each of the nodes in this tree. 2726 02:09:40,960 --> 02:09:44,180 Let me propose that we could continue having arrows even below. 2727 02:09:44,180 --> 02:09:46,540 So if we were to have Daniel in this tree, 2728 02:09:46,540 --> 02:09:48,580 we could also have Danielle by just having 2729 02:09:48,580 --> 02:09:52,540 a couple of more nodes below Daniel and just having another green checkmark. 2730 02:09:52,540 --> 02:09:54,850 So it is solvable in code, even though it's not 2731 02:09:54,850 --> 02:09:57,790 obvious from the graphical representation, but absolutely 2732 02:09:57,790 --> 02:09:58,510 a corner case. 2733 02:09:58,510 --> 02:10:01,210 But thankfully, it is solvable in tries. 2734 02:10:01,210 --> 02:10:04,000 What might another downside be, though, of a trie? 2735 02:10:04,000 --> 02:10:05,740 Like, you do get Big O of 1 time. 2736 02:10:05,740 --> 02:10:08,450 You can solve the Daniel-Danielle problem. 2737 02:10:08,450 --> 02:10:10,195 But there's still a price being paid. 2738 02:10:10,195 --> 02:10:13,290 2739 02:10:13,290 --> 02:10:15,950 Any thoughts? 2740 02:10:15,950 --> 02:10:16,450 Yeah. 2741 02:10:16,450 --> 02:10:20,285 How about over to Ethan, if I'm saying it right? 2742 02:10:20,285 --> 02:10:20,910 AUDIENCE: Yeah. 2743 02:10:20,910 --> 02:10:21,690 I'm Ethan. 2744 02:10:21,690 --> 02:10:22,482 DAVID MALAN: Ethan. 2745 02:10:22,482 --> 02:10:25,200 AUDIENCE: I think it's because it would just take a lot of memory 2746 02:10:25,200 --> 02:10:26,580 to house all of that. 2747 02:10:26,580 --> 02:10:28,410 And that could take a lot of time. 2748 02:10:28,410 --> 02:10:29,790 It could slow down the system. 2749 02:10:29,790 --> 02:10:30,820 DAVID MALAN: Exactly. 2750 02:10:30,820 --> 02:10:31,320 Yeah. 2751 02:10:31,320 --> 02:10:33,195 You can kind of see it from my picture alone. 2752 02:10:33,195 --> 02:10:35,410 We only added three names to this data structure. 2753 02:10:35,410 --> 02:10:39,390 But my God, like, there's dozens, maybe 100 plus pointers 2754 02:10:39,390 --> 02:10:41,830 pictured here, even though they might all be null. 2755 02:10:41,830 --> 02:10:42,330 Right? 2756 02:10:42,330 --> 02:10:46,200 If the absence of an arrow here suggests that they're null, 0x0, but even 2757 02:10:46,200 --> 02:10:49,180 storing null, 0x0, is 8 0 bits. 2758 02:10:49,180 --> 02:10:52,710 So this is not lacking for actual memory usage, 2759 02:10:52,710 --> 02:10:54,960 we're just not using it very efficiently. 2760 02:10:54,960 --> 02:10:58,650 We have spent a huge number of bits, or bytes, 2761 02:10:58,650 --> 02:11:00,570 or however you want to measure it. 2762 02:11:00,570 --> 02:11:04,440 Because look, even with the H's, I'm using one pointer out of 26. 2763 02:11:04,440 --> 02:11:06,780 25 pointers are probably initialized to null, 2764 02:11:06,780 --> 02:11:08,910 which means I'm wasting 25 pointers. 2765 02:11:08,910 --> 02:11:11,550 And you can imagine the lower and lower you get in this tree, 2766 02:11:11,550 --> 02:11:16,560 the less likely there is to be a name that even starts with H-A-G-R-I-D. 2767 02:11:16,560 --> 02:11:18,990 Daniel came up with a good example with Danielle. 2768 02:11:18,990 --> 02:11:20,850 But that's not going to often happen. 2769 02:11:20,850 --> 02:11:22,940 Certainly, the lower you get in this tree. 2770 02:11:22,940 --> 02:11:26,330 So as Ethan says, you're wasting a huge amount of memory. 2771 02:11:26,330 --> 02:11:30,900 So yes, you're gaining constant time look ups, but my God, at what price? 2772 02:11:30,900 --> 02:11:34,680 You might be using megabytes, gigabytes of storage space. 2773 02:11:34,680 --> 02:11:37,200 Because, again, the most important property of an array 2774 02:11:37,200 --> 02:11:38,700 is that it's all contiguous. 2775 02:11:38,700 --> 02:11:40,300 And therefore, you have random access. 2776 02:11:40,300 --> 02:11:42,330 But if you have to have every node containing 2777 02:11:42,330 --> 02:11:45,660 an array of size 26 or anything else, you 2778 02:11:45,660 --> 02:11:49,960 have to spend the memory over and over again. 2779 02:11:49,960 --> 02:11:51,210 So there, too, is a trade off. 2780 02:11:51,210 --> 02:11:53,760 While this might be theoretically ideal, theory 2781 02:11:53,760 --> 02:11:55,680 does not necessarily mean practice. 2782 02:11:55,680 --> 02:11:59,400 And sometimes, designing something that is textbook less efficient, 2783 02:11:59,400 --> 02:12:01,980 might actually be more efficient in the real world. 2784 02:12:01,980 --> 02:12:05,733 And in Problem Set 5, in your own spellchecker, 2785 02:12:05,733 --> 02:12:07,650 when you build up this dictionary that we then 2786 02:12:07,650 --> 02:12:10,440 use to spell check very large corpuses of text, 2787 02:12:10,440 --> 02:12:14,310 will you begin to experience some of those real world trade offs yourself. 2788 02:12:14,310 --> 02:12:17,400 Well, we wanted to end today with a look at what else 2789 02:12:17,400 --> 02:12:19,380 you can do with these kinds of data structures, 2790 02:12:19,380 --> 02:12:21,990 just to give you a taste of where else you can go with this 2791 02:12:21,990 --> 02:12:23,760 and what other kinds of problems you can solve. 2792 02:12:23,760 --> 02:12:26,135 Again, thus far, we've looked at arrays, which are really 2793 02:12:26,135 --> 02:12:27,468 the simplest of data structures. 2794 02:12:27,468 --> 02:12:29,093 And they're not even structures per se. 2795 02:12:29,093 --> 02:12:30,940 It's just contiguous blocks of memory. 2796 02:12:30,940 --> 02:12:33,090 We then introduced linked lists, of course, 2797 02:12:33,090 --> 02:12:35,640 where you have this one dimensional data structure that 2798 02:12:35,640 --> 02:12:39,120 allows you to stitch together nodes and memory, giving you dynamic allocation. 2799 02:12:39,120 --> 02:12:42,570 And if you want deallocation, inserting and deleting nodes if you want, 2800 02:12:42,570 --> 02:12:46,380 then we had trees, which kind of gives us the best of both worlds, 2801 02:12:46,380 --> 02:12:48,090 arrays and linked lists. 2802 02:12:48,090 --> 02:12:51,113 But we have to spend more space and use more pointers. 2803 02:12:51,113 --> 02:12:53,280 Then, of course, hash tables kind of merged together 2804 02:12:53,280 --> 02:12:56,430 two of those ideas, arrays and the linked lists. 2805 02:12:56,430 --> 02:12:57,790 And that starts to work well. 2806 02:12:57,790 --> 02:13:00,480 And indeed, that's what you'll experiment with your own spellchecker. 2807 02:13:00,480 --> 02:13:02,855 But then, of course, there's tries, which at first glance 2808 02:13:02,855 --> 02:13:07,680 seem better, but not without great cost, as Ethan says. 2809 02:13:07,680 --> 02:13:11,220 So it turns out, with all of those building blocks at your disposal, 2810 02:13:11,220 --> 02:13:14,820 you can actually use them as lower level implementation details 2811 02:13:14,820 --> 02:13:16,740 to solve higher level problems. 2812 02:13:16,740 --> 02:13:20,400 And this is what are known as abstract data structures, or abstract data 2813 02:13:20,400 --> 02:13:21,060 types. 2814 02:13:21,060 --> 02:13:24,122 An abstract data structure is kind of a mental structure 2815 02:13:24,122 --> 02:13:26,580 that you can imagine, implementing some real world problem, 2816 02:13:26,580 --> 02:13:30,135 typically, that's implemented with some other data structure. 2817 02:13:30,135 --> 02:13:32,010 So you're sort of writing code at this level. 2818 02:13:32,010 --> 02:13:34,885 But you're thinking about what you've built ultimately at this level. 2819 02:13:34,885 --> 02:13:37,710 And that's abstraction, taking lower level implementation 2820 02:13:37,710 --> 02:13:40,830 details, simplifying them for the sake of discussion or problem 2821 02:13:40,830 --> 02:13:42,600 solving higher up. 2822 02:13:42,600 --> 02:13:44,670 So what's one such data structure? 2823 02:13:44,670 --> 02:13:48,180 A queue is a very common abstract data structure. 2824 02:13:48,180 --> 02:13:48,855 What is a queue? 2825 02:13:48,855 --> 02:13:50,813 Well, those of you who grew up, say, in Britain 2826 02:13:50,813 --> 02:13:53,040 call a line outside of a store a queue. 2827 02:13:53,040 --> 02:13:54,840 And that's, indeed, where it gets its name. 2828 02:13:54,840 --> 02:13:58,465 A queue is a data structure that has certain properties. 2829 02:13:58,465 --> 02:14:00,840 So if you're standing outside of a store or a restaurant, 2830 02:14:00,840 --> 02:14:04,005 in healthier times, waiting to get in, you're generally in a queue. 2831 02:14:04,005 --> 02:14:05,880 But there's an important property of a queue. 2832 02:14:05,880 --> 02:14:08,550 At least, if you live in a fair society, you'd 2833 02:14:08,550 --> 02:14:11,760 like to think that if you are the first one in line, 2834 02:14:11,760 --> 02:14:13,920 you are the first one to get out of line. 2835 02:14:13,920 --> 02:14:15,750 First in, first out. 2836 02:14:15,750 --> 02:14:18,150 It would be kind of obnoxious if you're first in line 2837 02:14:18,150 --> 02:14:21,180 and then they start letting people in who are behind you in that queue. 2838 02:14:21,180 --> 02:14:23,880 So a queue, if it's implemented correctly, 2839 02:14:23,880 --> 02:14:27,630 has a property known as FIFO, first in, first out. 2840 02:14:27,630 --> 02:14:30,180 And we humans would think of that as just a fair property. 2841 02:14:30,180 --> 02:14:33,870 And a queue generally has two operations associated with it at least, 2842 02:14:33,870 --> 02:14:35,745 enqueue and dequeue. 2843 02:14:35,745 --> 02:14:36,870 Those are just conventions. 2844 02:14:36,870 --> 02:14:40,020 You could call it add and remove, or insert, delete, whatever. 2845 02:14:40,020 --> 02:14:43,170 But enqueue and dequeue are sort of the more common ones to say. 2846 02:14:43,170 --> 02:14:45,120 So enqueuing means you walk up to the store 2847 02:14:45,120 --> 02:14:47,037 and you get in line, because you have to wait. 2848 02:14:47,037 --> 02:14:49,755 Dequeue means they're ready to serve you or have you. 2849 02:14:49,755 --> 02:14:50,880 And so you get out of line. 2850 02:14:50,880 --> 02:14:51,780 That's dequeue. 2851 02:14:51,780 --> 02:14:54,120 And again, FIFO is just a fancy acronym that 2852 02:14:54,120 --> 02:14:59,100 describes a key property of that, which is that it's first in, first out. 2853 02:14:59,100 --> 02:15:03,390 So how could you implement a queue then? 2854 02:15:03,390 --> 02:15:07,150 Well, what's interesting about that data structure is that it's abstract. 2855 02:15:07,150 --> 02:15:07,650 Right? 2856 02:15:07,650 --> 02:15:10,740 It's more of an idea than an actual thing in code. 2857 02:15:10,740 --> 02:15:14,020 You want to implement some kind of fair queuing system. 2858 02:15:14,020 --> 02:15:15,450 And so you think of it as a queue. 2859 02:15:15,450 --> 02:15:18,770 But frankly, if we're going to translate that example to code, 2860 02:15:18,770 --> 02:15:21,590 you could imagine using an array of persons. 2861 02:15:21,590 --> 02:15:23,450 That could implement a queue. 2862 02:15:23,450 --> 02:15:25,175 You could use a linked list of persons. 2863 02:15:25,175 --> 02:15:27,050 Frankly, either of those would actually work. 2864 02:15:27,050 --> 02:15:30,470 Underneath the hood is the lower level implementation details. 2865 02:15:30,470 --> 02:15:33,860 But what would be a problem, if we translate this real world 2866 02:15:33,860 --> 02:15:38,810 analogy like queuing up outside of a store to get in, into code 2867 02:15:38,810 --> 02:15:39,830 and you used an array? 2868 02:15:39,830 --> 02:15:42,980 What would a downside be of using an array 2869 02:15:42,980 --> 02:15:48,380 to represent a queue, in general, even though we're making a bit of a leap 2870 02:15:48,380 --> 02:15:50,060 from real world to code suddenly? 2871 02:15:50,060 --> 02:15:52,760 But there's probably a downside. 2872 02:15:52,760 --> 02:15:56,720 What might a downside be of using an array to implement a queue, Ryan? 2873 02:15:56,720 --> 02:15:59,690 AUDIENCE: If you're using an array, you can't really 2874 02:15:59,690 --> 02:16:03,327 just take out the existing values. 2875 02:16:03,327 --> 02:16:05,660 Because if you were thinking about doing this in a line, 2876 02:16:05,660 --> 02:16:09,500 you would have to take out the first person, take out the second person. 2877 02:16:09,500 --> 02:16:13,130 But you can't really dynamically sort of change the memory after that. 2878 02:16:13,130 --> 02:16:13,910 DAVID MALAN: Yeah. 2879 02:16:13,910 --> 02:16:14,270 Yeah. 2880 02:16:14,270 --> 02:16:15,180 That's a really good point. 2881 02:16:15,180 --> 02:16:15,972 Think about a line. 2882 02:16:15,972 --> 02:16:19,328 Suppose that there's a line that can fit 10 people outside the Apple store. 2883 02:16:19,328 --> 02:16:22,370 Because Apple's pretty good right now during the health crisis of letting 2884 02:16:22,370 --> 02:16:24,210 people in only so many at a time. 2885 02:16:24,210 --> 02:16:26,840 So suppose they have room for 10 people, six feet apart. 2886 02:16:26,840 --> 02:16:30,000 That's actually a pretty apt analogy, this year more than ever. 2887 02:16:30,000 --> 02:16:35,480 But as Ryan says, if you want to dequeue someone, then the first person in line 2888 02:16:35,480 --> 02:16:36,770 is going to go into the store. 2889 02:16:36,770 --> 02:16:38,600 And then the second person, you're going to dequeue them. 2890 02:16:38,600 --> 02:16:40,049 They go into the store. 2891 02:16:40,049 --> 02:16:41,840 The problem with an array, it would seem, 2892 02:16:41,840 --> 02:16:44,976 is that now you have essentially empty spaces at the beginning of the line. 2893 02:16:44,976 --> 02:16:47,809 But you still don't have room at the end of the line for new people. 2894 02:16:47,809 --> 02:16:49,851 Now there's an obvious real world solution there. 2895 02:16:49,851 --> 02:16:52,879 You just say, hey, everyone, would you mind taking a few steps forward? 2896 02:16:52,879 --> 02:16:54,139 But that's inefficient. 2897 02:16:54,139 --> 02:16:56,847 Not so much in the human world, you've got to get into the store. 2898 02:16:56,847 --> 02:16:58,940 But in code, that's copying of values. 2899 02:16:58,940 --> 02:17:03,410 You have to move, like, eight values two places over if two people were just 2900 02:17:03,410 --> 02:17:04,440 let into the store. 2901 02:17:04,440 --> 02:17:07,520 So now your dequeue operation is Big O of n. 2902 02:17:07,520 --> 02:17:08,990 And that doesn't feel quite ideal. 2903 02:17:08,990 --> 02:17:10,459 And we can do better than that if we're a little 2904 02:17:10,459 --> 02:17:12,440 clever with some local variables and such. 2905 02:17:12,440 --> 02:17:14,870 But that would be one challenge of a queue, 2906 02:17:14,870 --> 02:17:19,722 certainly, is just how well we could implement it using an array. 2907 02:17:19,722 --> 02:17:21,889 So you might imagine, too, an array is limited, too. 2908 02:17:21,889 --> 02:17:24,590 Because it would kind of be obnoxious if you get to the Apple store. 2909 02:17:24,590 --> 02:17:26,100 There's already 10 people in line. 2910 02:17:26,100 --> 02:17:27,559 And they don't let you get in line. 2911 02:17:27,559 --> 02:17:30,650 They say, sorry, we're all full for today, when they're obviously not. 2912 02:17:30,650 --> 02:17:32,900 Because eventually there'll be more room in the queue. 2913 02:17:32,900 --> 02:17:36,209 A linked list would allow you to keep appending more and more people. 2914 02:17:36,209 --> 02:17:38,510 And even if the line outside the store gets crazy long, 2915 02:17:38,510 --> 02:17:41,059 at least the linked list allows you to service 2916 02:17:41,059 --> 02:17:43,790 all of the customers who are showing up over time. 2917 02:17:43,790 --> 02:17:46,400 An array of fixed size would make that harder. 2918 02:17:46,400 --> 02:17:48,379 And, again, you could allocate a bigger array. 2919 02:17:48,379 --> 02:17:50,090 But then you're going to have to ask all the customers, hey, 2920 02:17:50,090 --> 02:17:51,340 could everyone come over here? 2921 02:17:51,340 --> 02:17:51,840 No. 2922 02:17:51,840 --> 02:17:52,730 Go back over there. 2923 02:17:52,730 --> 02:17:57,690 I mean, you're constantly moving humans, or values and memory, back and forth. 2924 02:17:57,690 --> 02:18:00,320 So that is only to say that to implement this real world 2925 02:18:00,320 --> 02:18:03,559 notion of a queue, which is very commonly used even in the computer 2926 02:18:03,559 --> 02:18:09,139 world to represent certain ideas, for instance, the printer queue, when you 2927 02:18:09,139 --> 02:18:12,230 send something to the printer, especially on a campus or in a company, 2928 02:18:12,230 --> 02:18:12,952 there's a queue. 2929 02:18:12,952 --> 02:18:14,660 And ideally, the first person who printed 2930 02:18:14,660 --> 02:18:17,240 is the first one who gets their printouts thereafter. 2931 02:18:17,240 --> 02:18:19,100 Queues are also used in software. 2932 02:18:19,100 --> 02:18:22,760 But there's other abstract data types out there besides queues. 2933 02:18:22,760 --> 02:18:24,379 One of them is called a stack. 2934 02:18:24,379 --> 02:18:27,230 So a stack is a data structure that can also 2935 02:18:27,230 --> 02:18:31,459 be implemented underneath the hood using arrays or linked lists or, 2936 02:18:31,459 --> 02:18:32,870 heck, maybe something else. 2937 02:18:32,870 --> 02:18:35,090 But stacks have a different property. 2938 02:18:35,090 --> 02:18:37,910 It's last in, first out. 2939 02:18:37,910 --> 02:18:39,469 Last in, first out. 2940 02:18:39,469 --> 02:18:41,930 So if you think about the trays in the cafeteria, 2941 02:18:41,930 --> 02:18:46,070 in healthier times when everyone was on campus using trays from a cafeteria, 2942 02:18:46,070 --> 02:18:48,950 you'll recall, of course, that trays tend to get stacked like this. 2943 02:18:48,950 --> 02:18:56,930 And the last tray to go on top of the stack is the first one to come out. 2944 02:18:56,930 --> 02:18:59,000 If you go to a clothing store, your own closet, 2945 02:18:59,000 --> 02:19:01,040 if you don't hang things on hangers or put them in drawers, 2946 02:19:01,040 --> 02:19:03,920 but kind of stack them, like here, like all of these sweaters, 2947 02:19:03,920 --> 02:19:05,540 this is a stack of sweaters. 2948 02:19:05,540 --> 02:19:07,459 And how do I get at a sweater I want? 2949 02:19:07,459 --> 02:19:10,879 Well, the easiest way to do it is with last in, first out. 2950 02:19:10,879 --> 02:19:13,879 So I constantly take the black sweater, the black sweater. 2951 02:19:13,879 --> 02:19:16,100 But if I've stored all of my sweaters in this stack, 2952 02:19:16,100 --> 02:19:18,709 you may never get to the sort of lower level ones, 2953 02:19:18,709 --> 02:19:22,379 like the red or the blue sweater because, again, of this data structure. 2954 02:19:22,379 --> 02:19:26,660 So LIFO, last in, first out, is in fact, the property 2955 02:19:26,660 --> 02:19:28,940 used to characterize stacks. 2956 02:19:28,940 --> 02:19:33,379 And stacks are useful or not useful, depending on the real world context. 2957 02:19:33,379 --> 02:19:36,170 But even within computing, we'll see applications over time 2958 02:19:36,170 --> 02:19:38,480 where stacks indeed come into play. 2959 02:19:38,480 --> 02:19:41,450 And those two operations that those things support are generally called 2960 02:19:41,450 --> 02:19:42,740 push and pop. 2961 02:19:42,740 --> 02:19:46,400 It's the same thing as add or remove or insert or delete, but the terms of art 2962 02:19:46,400 --> 02:19:51,230 are generally push and pop, where this is me popping a value off of the stack. 2963 02:19:51,230 --> 02:19:54,260 This is me pushing a value onto the stack. 2964 02:19:54,260 --> 02:19:58,520 But, again, it's last in, first out, otherwise known as LIFO. 2965 02:19:58,520 --> 02:20:00,410 And then there's this other data structure 2966 02:20:00,410 --> 02:20:04,740 that actually has a very real world analog, known as a dictionary. 2967 02:20:04,740 --> 02:20:07,370 A dictionary is an abstract data type, which 2968 02:20:07,370 --> 02:20:10,370 means you can implement it with arrays, or linked lists, or hash tables, 2969 02:20:10,370 --> 02:20:13,280 or tries, or whatever else, an abstract data type 2970 02:20:13,280 --> 02:20:16,130 that allows you to associate keys with values. 2971 02:20:16,130 --> 02:20:18,560 And the best analog here is indeed in the real world. 2972 02:20:18,560 --> 02:20:20,720 What is a dictionary, like an old school dictionary 2973 02:20:20,720 --> 02:20:23,330 that's actually printed on paper in book form? 2974 02:20:23,330 --> 02:20:25,550 What is inside that book? 2975 02:20:25,550 --> 02:20:30,380 A whole bunch of keys, a whole bunch of boldfaced words, like apple and banana 2976 02:20:30,380 --> 02:20:35,180 and so forth, each of which have definitions, otherwise known as values. 2977 02:20:35,180 --> 02:20:37,548 And they're often alphabetized to make it easier 2978 02:20:37,548 --> 02:20:40,340 for you to find things so that you can look things up more quickly. 2979 02:20:40,340 --> 02:20:45,560 But a dictionary is an abstract data type that associates keys with values. 2980 02:20:45,560 --> 02:20:48,320 And you look up the values by way of their keys, 2981 02:20:48,320 --> 02:20:53,070 just like you look up a word's definition by way of the word itself. 2982 02:20:53,070 --> 02:20:57,318 And dictionaries are actually kind of all around us, too. 2983 02:20:57,318 --> 02:20:59,360 You don't think of them in these terms, probably. 2984 02:20:59,360 --> 02:21:02,330 But if you've ever been to Sweet Green, for instance, in New Haven 2985 02:21:02,330 --> 02:21:05,060 or in Cambridge or elsewhere, this is a salad place 2986 02:21:05,060 --> 02:21:08,840 where nowadays, especially, you can order in advance online or on an app 2987 02:21:08,840 --> 02:21:12,920 and then go into the store and pick up your food from a shelf. 2988 02:21:12,920 --> 02:21:16,580 But the shelf, the way they do it here in Cambridge and in other cities, 2989 02:21:16,580 --> 02:21:20,765 is they actually have letters of the alphabet on the shelves, A, B, C, D, E, 2990 02:21:20,765 --> 02:21:24,500 F, all the way through Z. The idea being that if I go in to pick up my salad, 2991 02:21:24,500 --> 02:21:27,710 it's probably on the D section, if Brian goes in to pick up his, 2992 02:21:27,710 --> 02:21:30,210 it's in the B section, and so forth. 2993 02:21:30,210 --> 02:21:33,680 Now here, too, you can imagine perverse corner cases 2994 02:21:33,680 --> 02:21:37,100 where this data structure, this dictionary whereby 2995 02:21:37,100 --> 02:21:40,910 letters of the alphabet map to values, which are people's salad, 2996 02:21:40,910 --> 02:21:42,920 is not necessarily fail proof. 2997 02:21:42,920 --> 02:21:45,440 Can you think of a perverse corner case where 2998 02:21:45,440 --> 02:21:50,060 Sweet Green's very wonderful, methodical system actually breaks down? 2999 02:21:50,060 --> 02:21:54,640 Can you think of a limitation here, even if you've never been to Sweet Green 3000 02:21:54,640 --> 02:21:58,210 or never eaten a salad, what could break down with this system 3001 02:21:58,210 --> 02:22:02,890 if going into a store and picking something up based on your name? 3002 02:22:02,890 --> 02:22:04,070 Any thoughts? 3003 02:22:04,070 --> 02:22:06,070 BRIAN: A few people say there might be a problem 3004 02:22:06,070 --> 02:22:07,730 if two people have the same name. 3005 02:22:07,730 --> 02:22:08,480 DAVID MALAN: Yeah. 3006 02:22:08,480 --> 02:22:11,500 If two people have the same names, you start to stack things up. 3007 02:22:11,500 --> 02:22:14,667 So literally, Sweet Green will start stacking one salad on top of the other. 3008 02:22:14,667 --> 02:22:16,625 So there is actually an interesting incarnation 3009 02:22:16,625 --> 02:22:19,220 of one data type being built on top of yet another data type. 3010 02:22:19,220 --> 02:22:22,593 So, again, all of these are sort of like custom Scratch pieces, if you will, 3011 02:22:22,593 --> 02:22:24,760 that we're constantly sort of reassembling into more 3012 02:22:24,760 --> 02:22:26,380 interesting and powerful ideas. 3013 02:22:26,380 --> 02:22:29,470 But at some point, if there's a lot of B names, or D names, 3014 02:22:29,470 --> 02:22:33,670 or any letter of the alphabet, I surely see a finite height to this shelf. 3015 02:22:33,670 --> 02:22:37,270 So it's kind of as though Sweet Green has implemented their dictionary 3016 02:22:37,270 --> 02:22:39,920 using stacks with arrays. 3017 02:22:39,920 --> 02:22:41,300 Because arrays are fixed size. 3018 02:22:41,300 --> 02:22:45,040 So there's surely only so many inches of space here vertically. 3019 02:22:45,040 --> 02:22:47,000 So you can see a real world limitation. 3020 02:22:47,000 --> 02:22:48,833 So what does Sweet Green do if that happens? 3021 02:22:48,833 --> 02:22:51,880 They probably just kind of cheat and put the D's in the C section, 3022 02:22:51,880 --> 02:22:53,092 or the D's in the E section. 3023 02:22:53,092 --> 02:22:54,800 Like, who really cares in the real world? 3024 02:22:54,800 --> 02:22:56,967 Your eyes are probably going to skim left and right. 3025 02:22:56,967 --> 02:22:59,530 But algorithmically, that is slowing things down. 3026 02:22:59,530 --> 02:23:02,732 And in the worst case, if you're really late to pick up your salad, 3027 02:23:02,732 --> 02:23:04,690 or if Sweet Green is really popular and there's 3028 02:23:04,690 --> 02:23:08,050 a huge number of salads on the shelf, your name might be Albus, 3029 02:23:08,050 --> 02:23:11,530 but your salad might end up way over here in the Z section 3030 02:23:11,530 --> 02:23:12,820 if they're just out of room. 3031 02:23:12,820 --> 02:23:16,150 And so that, too, is a valid algorithmic decision, 3032 02:23:16,150 --> 02:23:17,930 to just make room somewhere else. 3033 02:23:17,930 --> 02:23:21,650 But, again, trade offs between time and space. 3034 02:23:21,650 --> 02:23:24,847 And so we thought we'd end on a note, thanks to some friends of ours 3035 02:23:24,847 --> 02:23:27,430 at another institution who made a wonderful visualization that 3036 02:23:27,430 --> 02:23:31,480 distinguished these notions of stacks versus queues. 3037 02:23:31,480 --> 02:23:34,448 Stack and, again, a queue are these abstract data types that 3038 02:23:34,448 --> 02:23:35,990 can be implemented in different ways. 3039 02:23:35,990 --> 02:23:40,420 They have different properties, each of them respectively FIFO or LIFO. 3040 02:23:40,420 --> 02:23:43,210 And here, for instance, is a final look in our final moments 3041 02:23:43,210 --> 02:23:47,440 together here today about how these ideas manifest themselves, perhaps 3042 02:23:47,440 --> 02:23:50,170 in the real world, not unlike this stack of sweaters here. 3043 02:23:50,170 --> 02:23:50,837 [VIDEO PLAYBACK] 3044 02:23:50,837 --> 02:23:55,220 [MUSIC PLAYING] 3045 02:23:55,220 --> 02:23:58,130 - Once upon a time, there was a guy named Jack. 3046 02:23:58,130 --> 02:24:01,550 When it came to making friends, Jack did not have the knack. 3047 02:24:01,550 --> 02:24:04,460 So Jack went to talk to the most popular guy he knew. 3048 02:24:04,460 --> 02:24:07,070 He went up to Lou and asked, "What do I do?" 3049 02:24:07,070 --> 02:24:09,620 Lou saw that his friend was really distressed. 3050 02:24:09,620 --> 02:24:12,180 "Well," Lou began, "just look how you're dressed. 3051 02:24:12,180 --> 02:24:14,900 Don't you have any clothes with a different look?" 3052 02:24:14,900 --> 02:24:15,900 "Yes," said Jack. 3053 02:24:15,900 --> 02:24:17,330 "I sure do. 3054 02:24:17,330 --> 02:24:19,590 Come to my house, and I'll show them to you." 3055 02:24:19,590 --> 02:24:20,720 So they went off to Jack's. 3056 02:24:20,720 --> 02:24:23,750 And Jack showed Lou the box where he kept all his shirts 3057 02:24:23,750 --> 02:24:25,490 and his pants and his socks. 3058 02:24:25,490 --> 02:24:28,520 Lou said, "I see you have all your clothes in a pile. 3059 02:24:28,520 --> 02:24:31,070 Why don't you wear some others once in a While?" 3060 02:24:31,070 --> 02:24:34,190 Jack said, "Well, when I remove clothes and socks, 3061 02:24:34,190 --> 02:24:36,980 I wash them and put them away in the box. 3062 02:24:36,980 --> 02:24:39,410 Then comes the next morning, and up I hop. 3063 02:24:39,410 --> 02:24:42,620 I go to the box and get my clothes off the top." 3064 02:24:42,620 --> 02:24:45,320 Lou quickly realized the problem with Jack. 3065 02:24:45,320 --> 02:24:48,290 He kept clothes, CDS, and books in the stack. 3066 02:24:48,290 --> 02:24:50,690 When he reached for something to read or to wear, 3067 02:24:50,690 --> 02:24:53,330 he chose top book or underwear. 3068 02:24:53,330 --> 02:24:55,700 Then, when he was done, he would put it right back. 3069 02:24:55,700 --> 02:24:58,160 Back it would go, on top of the stack. 3070 02:24:58,160 --> 02:25:00,680 "I know the solution!" said a triumphant Lou. 3071 02:25:00,680 --> 02:25:03,290 "You need to learn to start using a queue." 3072 02:25:03,290 --> 02:25:06,050 Lou took Jack's clothes and hung them in a closet. 3073 02:25:06,050 --> 02:25:08,900 And when he had emptied the box, he just tossed it. 3074 02:25:08,900 --> 02:25:12,710 Then he said, "Now, Jack, at the end of the day, put your clothes on the left 3075 02:25:12,710 --> 02:25:14,250 when you put them away. 3076 02:25:14,250 --> 02:25:16,880 Then tomorrow morning when you see the sunshine, get 3077 02:25:16,880 --> 02:25:19,640 your clothes from the right, from the end of the line. 3078 02:25:19,640 --> 02:25:22,550 Don't you see?" said Lou, "it will be so nice. 3079 02:25:22,550 --> 02:25:25,940 You'll wear everything once before you wear something twice." 3080 02:25:25,940 --> 02:25:28,850 And with everything in queues in his closet and shelf, 3081 02:25:28,850 --> 02:25:31,910 Jack started to feel quite sure of himself, all thanks 3082 02:25:31,910 --> 02:25:34,237 to Lou and his wonderful queue. 3083 02:25:34,237 --> 02:25:34,820 [END PLAYBACK] 3084 02:25:34,820 --> 02:25:35,360 DAVID MALAN: All right. 3085 02:25:35,360 --> 02:25:36,470 That's it for CS50. 3086 02:25:36,470 --> 02:25:38,090 We will see you next time. 3087 02:25:38,090 --> 02:25:41,440 [MUSIC PLAYING] 3088 02:25:41,440 --> 02:26:39,000