1 00:00:00,000 --> 00:00:00,500 2 00:00:00,500 --> 00:00:10,850 [MUSIC PLAYING] 3 00:00:10,850 --> 00:00:11,820 DAVID MALAN: All right. 4 00:00:11,820 --> 00:00:14,540 This is CS50 and this is lecture five. 5 00:00:14,540 --> 00:00:18,180 And you'll recall perhaps that in this most current week, 6 00:00:18,180 --> 00:00:20,760 you've probably been tackling a little something like this. 7 00:00:20,760 --> 00:00:23,949 And if you don't actually have this experience from childhood, 8 00:00:23,949 --> 00:00:25,740 what we're referring to in this problem set 9 00:00:25,740 --> 00:00:29,414 is these kind of glasses that aren't 3-D glasses because it's both red eyes 10 00:00:29,414 --> 00:00:31,330 or sometimes it's just a big piece of plastic. 11 00:00:31,330 --> 00:00:34,680 But with this can you actually look up and see 12 00:00:34,680 --> 00:00:37,210 what the answer is supposed to be. 13 00:00:37,210 --> 00:00:40,770 And so this is the allusion to which we're referring 14 00:00:40,770 --> 00:00:43,230 and the goal ultimately in problem set five's who done 15 00:00:43,230 --> 00:00:47,037 it is to actually figure out how to implement that kind of red filter. 16 00:00:47,037 --> 00:00:48,870 But to do that, you first have to understand 17 00:00:48,870 --> 00:00:52,380 this thing, which at first glance, admittedly, looks pretty complicated. 18 00:00:52,380 --> 00:00:54,131 But if you dived into the problem already, 19 00:00:54,131 --> 00:00:56,380 you'll probably have wrapped your mind around at least 20 00:00:56,380 --> 00:00:59,640 a few of these fields like the size of the image or the width of the image 21 00:00:59,640 --> 00:01:02,940 and the height of the image, which should be a little more reasonably 22 00:01:02,940 --> 00:01:03,950 straightforward. 23 00:01:03,950 --> 00:01:07,680 But to implement this, you've had to deal with something 24 00:01:07,680 --> 00:01:10,200 called a struct or a structure. 25 00:01:10,200 --> 00:01:12,337 And so in C, we have this feature, recall. 26 00:01:12,337 --> 00:01:14,670 And we didn't really play with this that much last time. 27 00:01:14,670 --> 00:01:17,640 But you've seen it now in forensics, or you soon will. 28 00:01:17,640 --> 00:01:20,070 And here we have the definition of a student. 29 00:01:20,070 --> 00:01:22,950 So when C was invented decades ago, they didn't foresee 30 00:01:22,950 --> 00:01:24,570 the need for a student data type. 31 00:01:24,570 --> 00:01:26,790 They had int and char and float and double. 32 00:01:26,790 --> 00:01:29,520 But we can invent our own data types much like in Scratch. 33 00:01:29,520 --> 00:01:32,340 We can make our own puzzle pieces as follows. 34 00:01:32,340 --> 00:01:35,940 Typedef to define a type, struct to say here comes a structure. 35 00:01:35,940 --> 00:01:37,732 And what is the structure known as student? 36 00:01:37,732 --> 00:01:40,481 Well in this case, I arbitrarily decided that a student would just 37 00:01:40,481 --> 00:01:42,930 have a name and a dorm and both of those would be strains. 38 00:01:42,930 --> 00:01:44,971 And you can imagine putting other things in there 39 00:01:44,971 --> 00:01:47,580 like ID numbers, phone numbers, email addresses, or whatnot 40 00:01:47,580 --> 00:01:50,890 so that you can actually combine all of this information together. 41 00:01:50,890 --> 00:01:54,300 So let's just take a quick look at how you might use code like this. 42 00:01:54,300 --> 00:01:57,200 Here is a file called struct.h. 43 00:01:57,200 --> 00:02:01,620 It's common, but not necessary to declare your structures inside 44 00:02:01,620 --> 00:02:05,490 of a file that also starts with .h so that we can share it across multiple 45 00:02:05,490 --> 00:02:08,162 programs just like with other libraries' header files. 46 00:02:08,162 --> 00:02:09,870 And here I've taken those training wheels 47 00:02:09,870 --> 00:02:13,830 off as before where, string is actually just a white lie for char star. 48 00:02:13,830 --> 00:02:15,720 But this is really the same data structure 49 00:02:15,720 --> 00:02:17,940 and it's in a file called struct.h. 50 00:02:17,940 --> 00:02:20,820 So let's take a quick look now at a program that actually uses 51 00:02:20,820 --> 00:02:25,660 this in struct0.C. So let's take a look at what we've done here. 52 00:02:25,660 --> 00:02:28,180 In struct0.C we have some header files up top. 53 00:02:28,180 --> 00:02:31,260 But we also include this header file so that we have access 54 00:02:31,260 --> 00:02:33,030 to this new custom data type. 55 00:02:33,030 --> 00:02:35,740 And then in main we do a few things. 56 00:02:35,740 --> 00:02:39,120 We first go ahead and ask the user for an integer called enrollment. 57 00:02:39,120 --> 00:02:41,550 So hopefully they'll give us a positive number. 58 00:02:41,550 --> 00:02:45,570 If we then do get back a number as expected in line 13, 59 00:02:45,570 --> 00:02:48,060 what do we do in English here? 60 00:02:48,060 --> 00:02:52,280 How would you just describe what line 13 is doing at this point in the term? 61 00:02:52,280 --> 00:02:54,124 Anyone, yeah? 62 00:02:54,124 --> 00:02:56,510 AUDIENCE: [INAUDIBLE] 63 00:02:56,510 --> 00:03:00,170 DAVID MALAN: Yeah, give us an array of students of size enrollment. 64 00:03:00,170 --> 00:03:03,050 So even though on line 11 and prior we didn't 65 00:03:03,050 --> 00:03:07,160 know how many students we needed, we did get on line 12 66 00:03:07,160 --> 00:03:09,030 that answer, the enrollment. 67 00:03:09,030 --> 00:03:12,440 And so on line 13 we declare an array using a variable saying, 68 00:03:12,440 --> 00:03:17,250 give me this many elements, this many students in my array to store things. 69 00:03:17,250 --> 00:03:20,260 And then we proceed, in the lines below, as follows. 70 00:03:20,260 --> 00:03:24,200 We start iterating over the enrollment from zero on up to enrollment. 71 00:03:24,200 --> 00:03:28,364 And we prompt the user on each iteration for a student's name and dorm. 72 00:03:28,364 --> 00:03:31,280 And the right hand side of those two lines of code is pretty familiar. 73 00:03:31,280 --> 00:03:32,860 You're just calling, get string. 74 00:03:32,860 --> 00:03:36,230 But on the left hand side, we do have a slightly new piece of syntax. 75 00:03:36,230 --> 00:03:39,620 We have students bracket I, which gives you the i-th students in the array. 76 00:03:39,620 --> 00:03:42,380 But what piece of syntax perhaps jumps out at you? 77 00:03:42,380 --> 00:03:45,140 Especially if you've never programmed before? 78 00:03:45,140 --> 00:03:48,355 And we've not used this symbol just yet in this context. 79 00:03:48,355 --> 00:03:49,230 What looks different? 80 00:03:49,230 --> 00:03:49,810 Yeah. 81 00:03:49,810 --> 00:03:50,710 AUDIENCE: [INAUDIBLE] 82 00:03:50,710 --> 00:03:52,043 DAVID MALAN: Yeah, so the .name. 83 00:03:52,043 --> 00:03:56,270 So you can probably infer that .name and .dorm is somehow accessing 84 00:03:56,270 --> 00:03:58,170 the student's name and the student's dorm. 85 00:03:58,170 --> 00:03:59,919 And that's literally all that's happening. 86 00:03:59,919 --> 00:04:04,400 This dot operator tells the computer, go inside of the student's structure 87 00:04:04,400 --> 00:04:09,080 at that i-th location and store the string that's coming 88 00:04:09,080 --> 00:04:11,060 back from get string in that variable. 89 00:04:11,060 --> 00:04:13,114 And similarly, do the same for dorm. 90 00:04:13,114 --> 00:04:15,530 So it's like taking a struct and then looking inside of it 91 00:04:15,530 --> 00:04:17,990 and going very specifically to one of the elements therein. 92 00:04:17,990 --> 00:04:21,740 We've never needed this .operator before because in the past, any array, 93 00:04:21,740 --> 00:04:25,310 any variable we've had has just been a string or an int or float. 94 00:04:25,310 --> 00:04:27,320 We haven't had anything to dive deeper into. 95 00:04:27,320 --> 00:04:28,820 So that's all that's going on there. 96 00:04:28,820 --> 00:04:33,200 We've encapsulated, so to speak, inside of a student structure, name and dorm. 97 00:04:33,200 --> 00:04:35,990 And then this last part is actually just a printing out 98 00:04:35,990 --> 00:04:37,680 of that same information. 99 00:04:37,680 --> 00:04:40,740 We're just printing out, so-and-so is in such-and-such a dorm 100 00:04:40,740 --> 00:04:44,812 by passing in those two strings using our familiar percent S. 101 00:04:44,812 --> 00:04:47,270 Now this program at the end of the day is kind of pointless 102 00:04:47,270 --> 00:04:50,311 because you prompt the user for some number of students' names and dorms, 103 00:04:50,311 --> 00:04:52,810 you print them out, and then you throw them away forever. 104 00:04:52,810 --> 00:04:55,670 And that's not all that useful of a program long term. 105 00:04:55,670 --> 00:05:00,320 And so we have in our second version of this program, struct 1.C a new trick, 106 00:05:00,320 --> 00:05:01,160 too. 107 00:05:01,160 --> 00:05:03,680 That's a teaser as well for the direction 108 00:05:03,680 --> 00:05:05,870 we're going in this problem set, next problem set, 109 00:05:05,870 --> 00:05:08,892 and beyond, where we're actually using files where files on a computer 110 00:05:08,892 --> 00:05:10,850 are just a whole bunch of bits, zeros and ones. 111 00:05:10,850 --> 00:05:13,050 Those zeroes and ones follow some pattern. 112 00:05:13,050 --> 00:05:16,280 But we have yet to see a mechanism for actually saving files. 113 00:05:16,280 --> 00:05:17,570 But here's how we can do it. 114 00:05:17,570 --> 00:05:20,720 So above this line here, 21 and above, same program. 115 00:05:20,720 --> 00:05:24,140 Just get a bunch of students from the user, per their name and dorm. 116 00:05:24,140 --> 00:05:26,390 Then here line 24 we see something that's 117 00:05:26,390 --> 00:05:30,740 a little new, though you have seen this in the forensics problem so far. 118 00:05:30,740 --> 00:05:33,825 We call a function called F open, meaning file open. 119 00:05:33,825 --> 00:05:36,200 That takes two arguments, according to its documentation. 120 00:05:36,200 --> 00:05:38,840 The name of the file you want open and then the second argument 121 00:05:38,840 --> 00:05:40,760 is how you want to open the file. 122 00:05:40,760 --> 00:05:42,650 And even if you've never seen this before, 123 00:05:42,650 --> 00:05:45,740 what might the W there represent? 124 00:05:45,740 --> 00:05:46,699 AUDIENCE: [INAUDIBLE] 125 00:05:46,699 --> 00:05:47,740 DAVID MALAN: Yeah, right. 126 00:05:47,740 --> 00:05:51,140 So it's read and write are two of the most common operations for a computer. 127 00:05:51,140 --> 00:05:53,040 R would be read, W would be write. 128 00:05:53,040 --> 00:05:57,050 And this kind of makes sense if the goal now is to save these students to a file 129 00:05:57,050 --> 00:06:00,020 so that the program is actually useful if you run it again and again. 130 00:06:00,020 --> 00:06:03,110 So here we have a new data type on the left. 131 00:06:03,110 --> 00:06:06,260 It's all caps, which is a bit of an anomaly, even in C. 132 00:06:06,260 --> 00:06:10,700 But file, star, file just says, hey, give me a variable of type file 133 00:06:10,700 --> 00:06:13,880 that can store the address of a file, so to speak. 134 00:06:13,880 --> 00:06:16,820 And technically, that's not the address of the file on disk. 135 00:06:16,820 --> 00:06:19,441 That's the address in RAM once you've opened the file. 136 00:06:19,441 --> 00:06:22,190 But for now, just assume that this is an abstraction for the file. 137 00:06:22,190 --> 00:06:23,720 And it's just called literally file. 138 00:06:23,720 --> 00:06:27,452 So what is line 25's purpose in life? 139 00:06:27,452 --> 00:06:29,660 Even though we've never written this code before now. 140 00:06:29,660 --> 00:06:32,010 141 00:06:32,010 --> 00:06:33,010 Yeah, what do you think? 142 00:06:33,010 --> 00:06:34,680 AUDIENCE: If the file exists. 143 00:06:34,680 --> 00:06:38,760 DAVID MALAN: If the file exists, or more generally, if the file was successfully 144 00:06:38,760 --> 00:06:39,540 opened. 145 00:06:39,540 --> 00:06:41,831 Because there could be bunches of things that go wrong. 146 00:06:41,831 --> 00:06:43,960 One, as you've implied, the file might not exist. 147 00:06:43,960 --> 00:06:47,190 Two, maybe you don't have permission so it exists but you can't open it. 148 00:06:47,190 --> 00:06:49,440 Three, maybe there's not enough memory in the computer 149 00:06:49,440 --> 00:06:50,920 to open yet one more file. 150 00:06:50,920 --> 00:06:52,740 So any number of things could go wrong. 151 00:06:52,740 --> 00:06:58,600 And recall, what is the special sentinel value that typically represents errors 152 00:06:58,600 --> 00:07:01,040 now that we're in the world of pointers? 153 00:07:01,040 --> 00:07:05,580 Null, so N-U-L-L in all caps is sort of the opposite of having a valid address. 154 00:07:05,580 --> 00:07:06,660 It means no such address. 155 00:07:06,660 --> 00:07:08,035 So we use it to check for errors. 156 00:07:08,035 --> 00:07:09,750 So that's all line 25 is doing. 157 00:07:09,750 --> 00:07:13,350 And then the rest of this is almost identical to the previous program, 158 00:07:13,350 --> 00:07:16,800 except we're not using print F. What are we apparently using? 159 00:07:16,800 --> 00:07:21,010 F print F and take a guess as to what the F in print F stands for? 160 00:07:21,010 --> 00:07:24,060 Yeah, so file print F. So it works almost the same 161 00:07:24,060 --> 00:07:25,680 except it takes one more argument. 162 00:07:25,680 --> 00:07:27,900 The very first argument now is not a format string 163 00:07:27,900 --> 00:07:29,640 like it's been for ages now. 164 00:07:29,640 --> 00:07:34,800 It's instead a variable that references an open file and then the rest of this 165 00:07:34,800 --> 00:07:35,480 is all the same. 166 00:07:35,480 --> 00:07:37,980 So what's really cool about F print F is that you don't just 167 00:07:37,980 --> 00:07:38,820 print to the screen. 168 00:07:38,820 --> 00:07:40,830 You actually print to whatever file you've 169 00:07:40,830 --> 00:07:43,020 opened so that then, on the very last line 170 00:07:43,020 --> 00:07:47,430 here, when I call F close, that's equivalent to saving the file. 171 00:07:47,430 --> 00:07:48,990 And then the program just quits. 172 00:07:48,990 --> 00:07:50,790 And so what's neat about this in the end is 173 00:07:50,790 --> 00:07:53,040 if I go ahead and scroll up here, and let 174 00:07:53,040 --> 00:07:56,200 me go into my source five directory. 175 00:07:56,200 --> 00:08:02,980 Let me go ahead and make struct one. 176 00:08:02,980 --> 00:08:04,820 Structs.h not found. 177 00:08:04,820 --> 00:08:05,590 What did I do? 178 00:08:05,590 --> 00:08:08,590 Well, I just screwed up before class and misnamed this file 179 00:08:08,590 --> 00:08:11,080 so that didn't just happen. 180 00:08:11,080 --> 00:08:12,670 So now I've compiled the program. 181 00:08:12,670 --> 00:08:17,560 All right, so now let me go ahead and run this ./struct1 enrollment will be 182 00:08:17,560 --> 00:08:18,550 three. 183 00:08:18,550 --> 00:08:28,100 And we'll say that it'll be Maria who is in Cabot House and Brian who 184 00:08:28,100 --> 00:08:29,380 is in Winthrop. 185 00:08:29,380 --> 00:08:33,010 And say David was, say, in Mather. 186 00:08:33,010 --> 00:08:34,929 Enter and nothing seems to happen. 187 00:08:34,929 --> 00:08:38,630 But if I type LS now and look inside this directory, 188 00:08:38,630 --> 00:08:41,860 notice that I have a file called students.csv deliberately named 189 00:08:41,860 --> 00:08:43,960 because if you've ever used Excel or Numbers, 190 00:08:43,960 --> 00:08:48,610 a very common file format is what's called the CSV, comma separated values 191 00:08:48,610 --> 00:08:49,270 format. 192 00:08:49,270 --> 00:08:52,390 And this is sort of like a very simplistic database. 193 00:08:52,390 --> 00:08:55,930 If I open this you'll see that indeed, the contents of this file 194 00:08:55,930 --> 00:08:58,760 are separated by commas. 195 00:08:58,760 --> 00:09:04,090 And if I were to actually open this file up in Excel, each of these columns 196 00:09:04,090 --> 00:09:06,920 would open up visually in exactly that. 197 00:09:06,920 --> 00:09:09,690 So what I did with my printdef, if I go back to structs1.c, 198 00:09:09,690 --> 00:09:14,590 notice as I consciously included that comma there, 199 00:09:14,590 --> 00:09:17,560 to create this sort of faux database format. 200 00:09:17,560 --> 00:09:19,750 And just for good measure, let me see if I 201 00:09:19,750 --> 00:09:23,050 go to download from the IDE's file manager 202 00:09:23,050 --> 00:09:26,260 and I go ahead and open up students.csv. 203 00:09:26,260 --> 00:09:32,830 And then if the program cooperates here, we have Microsoft Excel. 204 00:09:32,830 --> 00:09:35,240 And now I've made myself a tiny little spreadsheet. 205 00:09:35,240 --> 00:09:36,520 Now using c-code. 206 00:09:36,520 --> 00:09:39,130 Now we're going to find pretty quickly that this is not 207 00:09:39,130 --> 00:09:40,840 all that useful to make CSV files. 208 00:09:40,840 --> 00:09:44,025 Because the more and more rows we add to these files, the slower and slower 209 00:09:44,025 --> 00:09:45,400 it's going to get to search them. 210 00:09:45,400 --> 00:09:48,670 And so before long, as we transition next week and be onto web programming, 211 00:09:48,670 --> 00:09:52,450 we're actually going to replace spreadsheets or CSVs like this 212 00:09:52,450 --> 00:09:56,860 and actually replace them with something more powerful, namely databases. 213 00:09:56,860 --> 00:09:59,210 So that's a teaser then of what's to come. 214 00:09:59,210 --> 00:10:00,991 But where did we begin this conversation? 215 00:10:00,991 --> 00:10:02,740 It all kind of keeps coming back to what's 216 00:10:02,740 --> 00:10:05,870 inside of our computer, which we can continue abstracting away. 217 00:10:05,870 --> 00:10:08,080 You don't have to understand how this hardware works. 218 00:10:08,080 --> 00:10:10,600 But we previously had said that you can at least think 219 00:10:10,600 --> 00:10:13,360 about chopping up your computer's memory into a grid 220 00:10:13,360 --> 00:10:15,370 so that you can just number of the bytes. 221 00:10:15,370 --> 00:10:17,320 So that you have specific locations otherwise 222 00:10:17,320 --> 00:10:19,420 known as addresses or pointers. 223 00:10:19,420 --> 00:10:22,840 Last time we clarified that not all memory is 224 00:10:22,840 --> 00:10:24,640 treated equal inside of the computer. 225 00:10:24,640 --> 00:10:27,630 Rather, different chunks of memory are used differently. 226 00:10:27,630 --> 00:10:32,200 So the top portion, so to speak, but there is no notion of top in reality. 227 00:10:32,200 --> 00:10:34,140 This is just an artist's rendition. 228 00:10:34,140 --> 00:10:35,890 So the top of your computer's memory might 229 00:10:35,890 --> 00:10:40,600 be the heap, whereby you store certain types of values 230 00:10:40,600 --> 00:10:42,700 and then down here is the so-called stack where 231 00:10:42,700 --> 00:10:44,690 you store other types of values. 232 00:10:44,690 --> 00:10:48,590 And if we zoom out there's actually different layers still of memory. 233 00:10:48,590 --> 00:10:51,010 So let's actually tease apart what's going on here. 234 00:10:51,010 --> 00:10:53,410 If, when you run a program, you have access 235 00:10:53,410 --> 00:10:56,170 to a gigabyte of RAM or two gigabytes, and indeed, that's 236 00:10:56,170 --> 00:10:57,310 what your Mac or PC does. 237 00:10:57,310 --> 00:10:59,184 No matter how much RAM you have, the computer 238 00:10:59,184 --> 00:11:02,890 typically gives you the illusion of having access to all of it. 239 00:11:02,890 --> 00:11:05,500 And so this might be two gigabytes, then, of memory. 240 00:11:05,500 --> 00:11:07,270 Well, one of the first things that happens 241 00:11:07,270 --> 00:11:10,330 is that the zeros and ones that compose your program, whether it's 242 00:11:10,330 --> 00:11:14,680 called A.out or Caesar or Vigenere or Structs One, those zeros and ones 243 00:11:14,680 --> 00:11:18,140 are loaded way up top here in your computer's memory. 244 00:11:18,140 --> 00:11:21,100 So the text segment in memory is a weird name 245 00:11:21,100 --> 00:11:23,320 for the zeros and ones of your actual program. 246 00:11:23,320 --> 00:11:24,610 It's not ASCII text. 247 00:11:24,610 --> 00:11:28,180 It's like literally zeros and ones of your compiled program. 248 00:11:28,180 --> 00:11:31,900 Below that are what are generally called initialized data or uninitialized data. 249 00:11:31,900 --> 00:11:34,510 And this essentially just means any global variables 250 00:11:34,510 --> 00:11:38,230 you have in your program are stored here or here. 251 00:11:38,230 --> 00:11:40,910 If you gave them values at the top of your program, 252 00:11:40,910 --> 00:11:42,582 they're initialized by definition. 253 00:11:42,582 --> 00:11:44,290 And if you didn't, they're uninitialized. 254 00:11:44,290 --> 00:11:47,230 So the compiler just kind of lays those out a little bit differently. 255 00:11:47,230 --> 00:11:49,580 At the very bottom are something called environment variables, 256 00:11:49,580 --> 00:11:51,400 which we don't use too much but you use them 257 00:11:51,400 --> 00:11:52,858 in a few weeks for web programming. 258 00:11:52,858 --> 00:11:56,500 You'll often store things like user names and passwords or other values 259 00:11:56,500 --> 00:11:58,430 that you don't want to save in your code. 260 00:11:58,430 --> 00:12:02,440 But you want the Mac or PC or server to have somehow access to. 261 00:12:02,440 --> 00:12:05,811 But these are the ones we'll talk about the most, stack and heap. 262 00:12:05,811 --> 00:12:08,560 And we saw a couple of examples of each of these, though, briefly. 263 00:12:08,560 --> 00:12:11,976 What did we use the stack for or claim it's used for last time? 264 00:12:11,976 --> 00:12:12,850 AUDIENCE: [INAUDIBLE] 265 00:12:12,850 --> 00:12:14,046 DAVID MALAN: Say again? 266 00:12:14,046 --> 00:12:14,420 AUDIENCE: [INAUDIBLE] 267 00:12:14,420 --> 00:12:16,450 DAVID MALAN: Int min void, and more generally, 268 00:12:16,450 --> 00:12:18,130 functions when they are called. 269 00:12:18,130 --> 00:12:22,120 Main is, of course, the go to one for all programs' beginnings. 270 00:12:22,120 --> 00:12:25,780 But it might call another function like swap. 271 00:12:25,780 --> 00:12:29,650 And swap itself might call something else, maybe printdef gets called. 272 00:12:29,650 --> 00:12:34,990 So every time you call a function, it gets a slice of or a frame of memory. 273 00:12:34,990 --> 00:12:38,920 And they go up and up and up as those functions get called. 274 00:12:38,920 --> 00:12:43,870 And this was ultimately illuminating, at least theoretically, 275 00:12:43,870 --> 00:12:46,330 as to why this program was broken. 276 00:12:46,330 --> 00:12:49,900 The program we looked at last time, the swap and no swap programs, 277 00:12:49,900 --> 00:12:51,940 we claim that this implementation was wrong. 278 00:12:51,940 --> 00:12:55,480 And yet I think when Kate came up and we did the example with switching 279 00:12:55,480 --> 00:12:57,580 the Gatorade flavors, this is pretty much 280 00:12:57,580 --> 00:13:00,220 an interpretation of that into code. 281 00:13:00,220 --> 00:13:03,630 And it's correct in one sense and it's incorrect in another. 282 00:13:03,630 --> 00:13:06,510 In what sense was this code actually correct? 283 00:13:06,510 --> 00:13:07,890 In the no swap program. 284 00:13:07,890 --> 00:13:10,804 Because we did walk through it briefly with the debugger. 285 00:13:10,804 --> 00:13:13,070 AUDIENCE: [INAUDIBLE] 286 00:13:13,070 --> 00:13:15,870 DAVID MALAN: OK, we needed temporary container or variable in which 287 00:13:15,870 --> 00:13:18,960 to store one of the values or one of the Gatorade flavors. 288 00:13:18,960 --> 00:13:21,240 And by the time we got to this third and final line 289 00:13:21,240 --> 00:13:24,900 in this function, what could you say about A and B? 290 00:13:24,900 --> 00:13:26,070 AUDIENCE: [INAUDIBLE] 291 00:13:26,070 --> 00:13:27,810 DAVID MALAN: Yeah, they had in fact been swapped. 292 00:13:27,810 --> 00:13:31,060 And I saw that, I think, by plugging in, I think I struggled with the debugger 293 00:13:31,060 --> 00:13:32,940 so I used E print def at the last minute just 294 00:13:32,940 --> 00:13:36,120 to see what had happened after that very third line. 295 00:13:36,120 --> 00:13:37,110 So it works. 296 00:13:37,110 --> 00:13:40,586 This function does swap A and B. But it did not swap what? 297 00:13:40,586 --> 00:13:41,460 AUDIENCE: [INAUDIBLE] 298 00:13:41,460 --> 00:13:44,040 DAVID MALAN: X and Y, which were the variables in main. 299 00:13:44,040 --> 00:13:47,010 And so recall the story we told last time, 300 00:13:47,010 --> 00:13:49,680 was that if we focus only on your computer stack, 301 00:13:49,680 --> 00:13:52,890 that sort of bottom portion of memory, when main is called, 302 00:13:52,890 --> 00:13:55,157 it gets a chunk of memory down here at the bottom 303 00:13:55,157 --> 00:13:57,240 because it's the very first function to be called. 304 00:13:57,240 --> 00:14:01,680 And it had variables, recall, called X and Y whose values were one and two. 305 00:14:01,680 --> 00:14:05,150 When main called swap, the other function we just saw, 306 00:14:05,150 --> 00:14:12,180 it had values called A, B, and also temp that initially were one and two. 307 00:14:12,180 --> 00:14:14,844 And eventually became two and one. 308 00:14:14,844 --> 00:14:17,010 But that picture kind of answers the whole question. 309 00:14:17,010 --> 00:14:19,620 The reason X and Y didn't change is because you literally 310 00:14:19,620 --> 00:14:23,370 change in that red version, just A and B. 311 00:14:23,370 --> 00:14:27,416 So we solved this problem, recall, last time with what new feature of C? 312 00:14:27,416 --> 00:14:28,290 AUDIENCE: [INAUDIBLE] 313 00:14:28,290 --> 00:14:29,820 DAVID MALAN: Pointers, so addresses. 314 00:14:29,820 --> 00:14:33,510 Rather than just hand the function, the values you want to swap, 315 00:14:33,510 --> 00:14:36,690 give the function a road map, so to speak, to those values 316 00:14:36,690 --> 00:14:40,140 so that the function can go to the values you actually care about 317 00:14:40,140 --> 00:14:42,010 and move them wherever they are. 318 00:14:42,010 --> 00:14:43,871 And it's a strange looking syntax at first. 319 00:14:43,871 --> 00:14:45,870 It looks like multiplication all over the place. 320 00:14:45,870 --> 00:14:47,520 But it had two different uses. 321 00:14:47,520 --> 00:14:52,350 If you have the star or asterisk up here and a data type like int next to it, 322 00:14:52,350 --> 00:14:53,760 this is saying, hey computer. 323 00:14:53,760 --> 00:14:57,060 Give me a variable called A. But that's not going to store an int. 324 00:14:57,060 --> 00:14:58,525 It's going to store what? 325 00:14:58,525 --> 00:14:59,400 AUDIENCE: [INAUDIBLE] 326 00:14:59,400 --> 00:15:02,520 DAVID MALAN: Yeah, the memory address of an int and just as B 327 00:15:02,520 --> 00:15:05,424 is going to store the address of another integer. 328 00:15:05,424 --> 00:15:07,590 So those are kind of placeholders for the addresses. 329 00:15:07,590 --> 00:15:09,300 So down the road is the Science Center. 330 00:15:09,300 --> 00:15:12,300 So if the address of the Science Center is 1 Oxford Street, Cambridge, 331 00:15:12,300 --> 00:15:16,290 Mass 02138, that is it's postal address and it uniquely identifies that 332 00:15:16,290 --> 00:15:16,920 building. 333 00:15:16,920 --> 00:15:20,697 Similarly inside of a computer do values have unique addresses. 334 00:15:20,697 --> 00:15:21,780 They're just much simpler. 335 00:15:21,780 --> 00:15:24,738 They're numbers, they don't have streets and zip codes and all of that. 336 00:15:24,738 --> 00:15:26,590 But it's the same exact idea. 337 00:15:26,590 --> 00:15:29,867 So here we still have a variable temp of type int. 338 00:15:29,867 --> 00:15:31,950 So give me a temporary variable just like Kate had 339 00:15:31,950 --> 00:15:34,310 the extra cup that was initially empty. 340 00:15:34,310 --> 00:15:38,240 *A, though, without a data type to the left of it, was like saying what? 341 00:15:38,240 --> 00:15:41,850 342 00:15:41,850 --> 00:15:45,710 *A in this context is a sort of different English sentence than this 343 00:15:45,710 --> 00:15:46,280 one. 344 00:15:46,280 --> 00:15:50,480 This means give me a pointer to an int or declare for me a variable 345 00:15:50,480 --> 00:15:52,430 that will store the address of an int. 346 00:15:52,430 --> 00:15:55,340 But this says, go to that address. 347 00:15:55,340 --> 00:15:58,730 So it means A is in address, *A means go to that address so you can get 348 00:15:58,730 --> 00:16:01,970 at the value, which probably in the story, is one. 349 00:16:01,970 --> 00:16:03,260 *A means the same thing. 350 00:16:03,260 --> 00:16:04,850 Again, go to that address. 351 00:16:04,850 --> 00:16:07,010 And then, by the way, go to that address B. 352 00:16:07,010 --> 00:16:11,120 Whatever that value is, put it where this finger is already pointing. 353 00:16:11,120 --> 00:16:15,710 And then *B means go to that address and put whatever was in the temporary cup 354 00:16:15,710 --> 00:16:20,010 of Gatorade, or in this case, the value one. 355 00:16:20,010 --> 00:16:22,730 So pointers, though kind of a very convoluted way 356 00:16:22,730 --> 00:16:26,390 of fixing this solve the problem fundamentally, because now 357 00:16:26,390 --> 00:16:32,450 rather than passing one and two, we instead passed in here the address of X 358 00:16:32,450 --> 00:16:35,600 and the address of Y that allowed the computer 359 00:16:35,600 --> 00:16:38,540 to then go to those locations in memory and actually 360 00:16:38,540 --> 00:16:42,000 do what it is we wanted it to do. 361 00:16:42,000 --> 00:16:45,290 So long story short, this is how the computer stack is used. 362 00:16:45,290 --> 00:16:48,080 When you call a function, it gets a new slice of memory on top 363 00:16:48,080 --> 00:16:49,750 of whatever function called it. 364 00:16:49,750 --> 00:16:52,280 As soon as that function is done executing, 365 00:16:52,280 --> 00:16:54,890 this memory effectively goes away. 366 00:16:54,890 --> 00:16:57,290 It's technically still there, because it's hardware. 367 00:16:57,290 --> 00:16:59,510 It's not going to physically disappear. 368 00:16:59,510 --> 00:17:04,069 But now whatever next function main calls, maybe printdef, maybe something 369 00:17:04,069 --> 00:17:08,030 else, it will reuse this memory in any number of ways 370 00:17:08,030 --> 00:17:11,750 that it wants for its own local values and parameters. 371 00:17:11,750 --> 00:17:14,180 So given that definition, the fact that the stack, kind 372 00:17:14,180 --> 00:17:16,644 of like trays in the cafeteria grow up and down 373 00:17:16,644 --> 00:17:19,310 and up and down as a program calls functions and those functions 374 00:17:19,310 --> 00:17:22,960 return, where do garbage values actually come from? 375 00:17:22,960 --> 00:17:26,322 376 00:17:26,322 --> 00:17:27,599 AUDIENCE: [INAUDIBLE] 377 00:17:27,599 --> 00:17:29,640 DAVID MALAN: Old things that were in that memory. 378 00:17:29,640 --> 00:17:31,681 So I kind of made this sweeping claim in the past 379 00:17:31,681 --> 00:17:34,410 that you shouldn't trust variables if you yourself 380 00:17:34,410 --> 00:17:37,450 haven't put values in them, because they have so-called garbage values. 381 00:17:37,450 --> 00:17:39,510 But that actually has a very precise meaning. 382 00:17:39,510 --> 00:17:41,700 That means if you have called a function and it 383 00:17:41,700 --> 00:17:45,330 needs a variable that just happens to be here, and this is like a minute 384 00:17:45,330 --> 00:17:48,270 into your program's running, a whole bunch of different functions 385 00:17:48,270 --> 00:17:51,900 might have used and unused, used and unused that portion of memory. 386 00:17:51,900 --> 00:17:56,190 So they're just going to have zeros and ones lingering there in some pattern. 387 00:17:56,190 --> 00:18:00,210 The computer could be really defensive and it could just change 388 00:18:00,210 --> 00:18:02,430 these all to zeros bits all the time. 389 00:18:02,430 --> 00:18:04,680 And that could have been a reasonable design decision, 390 00:18:04,680 --> 00:18:06,389 but long story short, C does not do that. 391 00:18:06,389 --> 00:18:08,888 It would have just been time consuming, especially years ago 392 00:18:08,888 --> 00:18:10,200 when computers were slower. 393 00:18:10,200 --> 00:18:11,430 The language was younger. 394 00:18:11,430 --> 00:18:13,210 And it just wasn't compelling to do that. 395 00:18:13,210 --> 00:18:15,480 So you just get garbage values, which I have typically 396 00:18:15,480 --> 00:18:16,920 just written as question marks. 397 00:18:16,920 --> 00:18:17,810 But that's why. 398 00:18:17,810 --> 00:18:21,900 There are garbage values there because they're your own previous values, 399 00:18:21,900 --> 00:18:23,890 or those of some functions. 400 00:18:23,890 --> 00:18:27,430 So the heap, meanwhile, was fundamentally different. 401 00:18:27,430 --> 00:18:30,090 So the heap is this upper portion of memory 402 00:18:30,090 --> 00:18:34,470 that is in some sense conceptually above the stack. 403 00:18:34,470 --> 00:18:35,580 And it's up here. 404 00:18:35,580 --> 00:18:39,120 And that's different in the sense that it's more for long term storage. 405 00:18:39,120 --> 00:18:41,060 The stack is for short term storage, just 406 00:18:41,060 --> 00:18:43,320 to use locally when a function is executing. 407 00:18:43,320 --> 00:18:45,720 But suppose your program is to run for a while, 408 00:18:45,720 --> 00:18:49,080 or suppose you want a function to allocate memory that does stick around 409 00:18:49,080 --> 00:18:51,670 and does not just immediately become garbage values. 410 00:18:51,670 --> 00:18:53,310 In fact, think about GetString. 411 00:18:53,310 --> 00:18:55,860 GetString is a function we wrote and its purpose in life 412 00:18:55,860 --> 00:18:59,280 is to get a string from the user, which is a whole bunch of characters. 413 00:18:59,280 --> 00:19:00,150 And consider this. 414 00:19:00,150 --> 00:19:05,910 If GetString is called and therefore gets a slice of memory on the stack, 415 00:19:05,910 --> 00:19:12,210 and I type in Maria's name, M-A-R-I-A and then that gets a secret /0. 416 00:19:12,210 --> 00:19:16,800 Where is the M-A-R-I-A and /0 stored? 417 00:19:16,800 --> 00:19:20,280 Can it be, by this definition, stored on the stack? 418 00:19:20,280 --> 00:19:22,398 Why? 419 00:19:22,398 --> 00:19:24,445 AUDIENCE: [INAUDIBLE] 420 00:19:24,445 --> 00:19:25,320 DAVID MALAN: Exactly. 421 00:19:25,320 --> 00:19:29,970 So if we allocated space on the stack, which we could do with an array, 422 00:19:29,970 --> 00:19:32,580 and then return the address of that string, 423 00:19:32,580 --> 00:19:35,970 that would be valid in some sense, because you 424 00:19:35,970 --> 00:19:37,830 would have allocated the memory. 425 00:19:37,830 --> 00:19:39,600 And I'll go ahead and draw it like this. 426 00:19:39,600 --> 00:19:44,340 If we had a function, GetString being called and GetString's 427 00:19:44,340 --> 00:19:48,660 being called by main, that might mean that GetString has this chunk of memory 428 00:19:48,660 --> 00:19:49,320 here. 429 00:19:49,320 --> 00:19:52,080 And if Maria or I go ahead and type in her name, 430 00:19:52,080 --> 00:19:57,270 that's like allocating space for M-A-R-I-A /0. 431 00:19:57,270 --> 00:20:00,520 And you can think of this as just being a whole bunch of bytes in that frame. 432 00:20:00,520 --> 00:20:02,894 So they do exist and they literally are stored in memory. 433 00:20:02,894 --> 00:20:06,510 And I could return the address of this first byte, whatever this is. 434 00:20:06,510 --> 00:20:09,490 Maybe this is byte 10 or 100 or whatnot. 435 00:20:09,490 --> 00:20:13,110 And I could return that address to main as GetString's return value. 436 00:20:13,110 --> 00:20:16,530 But as soon as I do that, yeah, exactly. 437 00:20:16,530 --> 00:20:19,140 The memory doesn't technically go anywhere. 438 00:20:19,140 --> 00:20:21,390 But it's no longer trustworthy. 439 00:20:21,390 --> 00:20:23,190 All of that is now garbage value. 440 00:20:23,190 --> 00:20:24,630 So you might get lucky. 441 00:20:24,630 --> 00:20:28,530 And if you try to print GetString's return value you might see Maria 442 00:20:28,530 --> 00:20:31,800 but maybe briefly, because the next time you call a line of code 443 00:20:31,800 --> 00:20:34,440 or somehow that memory's reused, Maria's name 444 00:20:34,440 --> 00:20:36,360 might get overwritten with some other values 445 00:20:36,360 --> 00:20:39,264 because her name becomes, by definition, a garbage value. 446 00:20:39,264 --> 00:20:41,680 And you don't know when it's actually going to get reused. 447 00:20:41,680 --> 00:20:43,000 So that's not safe. 448 00:20:43,000 --> 00:20:44,610 So this is why the heap exists. 449 00:20:44,610 --> 00:20:48,990 If you need to keep your memory around for a while, like GetString is supposed 450 00:20:48,990 --> 00:20:54,570 to do, turns out you can allocate it just elsewhere that won't disappear 451 00:20:54,570 --> 00:20:56,040 until you yourself free it. 452 00:20:56,040 --> 00:20:59,190 And so that's when we introduced last time a couple of new functions, 453 00:20:59,190 --> 00:21:02,250 malloc for memory allocation and it turns out 454 00:21:02,250 --> 00:21:04,410 there's an opposite of it, free, which you'll 455 00:21:04,410 --> 00:21:08,100 need to use for future problem sets dealing with memory management 456 00:21:08,100 --> 00:21:10,237 in order to undo the allocation here. 457 00:21:10,237 --> 00:21:12,570 Otherwise you end up having what's called a memory leak. 458 00:21:12,570 --> 00:21:14,970 And the computer might slow down, run out of memory, 459 00:21:14,970 --> 00:21:16,510 because you're not giving it back. 460 00:21:16,510 --> 00:21:18,884 And as an aside, it turns out there's a couple of cousins 461 00:21:18,884 --> 00:21:21,390 of malloc, calloc and realloc. 462 00:21:21,390 --> 00:21:24,630 Calloc is kind of cool in that the C means clear. 463 00:21:24,630 --> 00:21:28,900 So calloc is identical to malloc, but it zeros the entire chunk of memory 464 00:21:28,900 --> 00:21:29,400 for you. 465 00:21:29,400 --> 00:21:32,760 So if you just want to initialize to have no garbage values whatsoever, 466 00:21:32,760 --> 00:21:36,750 you can use calloc instead of doing it yourself with a four loop or something 467 00:21:36,750 --> 00:21:37,530 like that. 468 00:21:37,530 --> 00:21:40,170 Realloc, we're going to see, is a more powerful function 469 00:21:40,170 --> 00:21:43,860 that allows you to take a chunk of memory and somehow grow it. 470 00:21:43,860 --> 00:21:46,470 But we'll see what that actually means in a moment. 471 00:21:46,470 --> 00:21:49,150 But with this power comes great responsibility. 472 00:21:49,150 --> 00:21:52,260 And we saw that things can go horribly wrong for binkie 473 00:21:52,260 --> 00:21:54,390 when you misuse memory addresses. 474 00:21:54,390 --> 00:21:57,870 And recall that we looked briefly at this program by way of Nick's video 475 00:21:57,870 --> 00:21:59,070 from Stanford. 476 00:21:59,070 --> 00:22:02,720 And let's see what these lines of code actually represent here. 477 00:22:02,720 --> 00:22:07,530 So here and here I'm declaring two variables, X and Y, 478 00:22:07,530 --> 00:22:11,760 that are going to store what, generally speaking? 479 00:22:11,760 --> 00:22:14,190 Addresses of integers. 480 00:22:14,190 --> 00:22:16,180 So that's all that's happening there. 481 00:22:16,180 --> 00:22:18,390 This now, was a new line of code last time. 482 00:22:18,390 --> 00:22:21,334 Where it's saying, call malloc, so allocate some amount of memory. 483 00:22:21,334 --> 00:22:22,500 How much memory do you want? 484 00:22:22,500 --> 00:22:24,420 Whatever the size of an int is. 485 00:22:24,420 --> 00:22:28,800 Odds are it's going to be four, maybe eight, some power of two 486 00:22:28,800 --> 00:22:32,550 or some multiple of two here, or a multiple of four. 487 00:22:32,550 --> 00:22:34,332 So here we get back what? 488 00:22:34,332 --> 00:22:36,540 A chunk of memory or specifically, its address and we 489 00:22:36,540 --> 00:22:40,620 store that in X. Meanwhile this line says, go to that address 490 00:22:40,620 --> 00:22:43,320 and put the special number 42 there. 491 00:22:43,320 --> 00:22:47,730 This next line blindly says, go to the address in Y 492 00:22:47,730 --> 00:22:50,190 and put the unlucky number 13 there. 493 00:22:50,190 --> 00:22:54,480 But that's where binkie had an accident because what 494 00:22:54,480 --> 00:22:56,580 was inside of Y at that moment in time? 495 00:22:56,580 --> 00:22:59,418 496 00:22:59,418 --> 00:23:00,294 AUDIENCE: [INAUDIBLE] 497 00:23:00,294 --> 00:23:02,251 DAVID MALAN: Memory it wasn't allowed to touch. 498 00:23:02,251 --> 00:23:02,790 And why? 499 00:23:02,790 --> 00:23:05,654 Be more precise. 500 00:23:05,654 --> 00:23:13,094 AUDIENCE: [INAUDIBLE] 501 00:23:13,094 --> 00:23:14,010 DAVID MALAN: So close. 502 00:23:14,010 --> 00:23:17,339 That's going explain the symptom ultimately but? 503 00:23:17,339 --> 00:23:19,155 AUDIENCE: [INAUDIBLE] 504 00:23:19,155 --> 00:23:20,030 DAVID MALAN: Exactly. 505 00:23:20,030 --> 00:23:21,740 I didn't ask for any memory whatsoever. 506 00:23:21,740 --> 00:23:24,320 So by default, even though this looks funky, 507 00:23:24,320 --> 00:23:27,410 int*y just means give me space for an address. 508 00:23:27,410 --> 00:23:31,460 So that means, give me four or eight bytes of memory in the computer's RAM. 509 00:23:31,460 --> 00:23:35,020 But what is inside of a variable by default, have we claimed? 510 00:23:35,020 --> 00:23:35,980 Just garbage values. 511 00:23:35,980 --> 00:23:38,960 So there's going to be a number there, and numbers are our addresses. 512 00:23:38,960 --> 00:23:41,126 So it's going to look like there's an address there. 513 00:23:41,126 --> 00:23:43,470 So it is technically correct to say, go there. 514 00:23:43,470 --> 00:23:46,550 But it's like following a map where you have no idea where you're going. 515 00:23:46,550 --> 00:23:49,220 You might sort of walk off the edge of wherever you are. 516 00:23:49,220 --> 00:23:51,050 And that's when bad things happen. 517 00:23:51,050 --> 00:23:54,437 And so the visualization that Nick put together with claymation was this. 518 00:23:54,437 --> 00:23:56,270 If you have this and it turns out it doesn't 519 00:23:56,270 --> 00:23:58,380 matter if the star is on the left or on the right here. 520 00:23:58,380 --> 00:24:00,410 But we have conventionally put it over onto the right side, 521 00:24:00,410 --> 00:24:01,640 next to the variable. 522 00:24:01,640 --> 00:24:04,970 When you do int*X and int*Y, that's like saying, 523 00:24:04,970 --> 00:24:08,780 give me a chunk of memory or clay for each of these variables. 524 00:24:08,780 --> 00:24:12,230 And he just kind of circled the little arrowhead there on the string 525 00:24:12,230 --> 00:24:13,672 because there's memory for it. 526 00:24:13,672 --> 00:24:15,380 It's just not pointing anywhere specific. 527 00:24:15,380 --> 00:24:17,660 It's a garbage value at this moment in time. 528 00:24:17,660 --> 00:24:20,280 The next chapter in this story was this. 529 00:24:20,280 --> 00:24:23,960 Allocate space for an int, drawn in white clay here. 530 00:24:23,960 --> 00:24:27,744 And Nick, because of the assignment, said X, which is again is a pointer, 531 00:24:27,744 --> 00:24:29,660 is now going to point to that chunk of memory. 532 00:24:29,660 --> 00:24:31,250 So it's no longer a garbage value. 533 00:24:31,250 --> 00:24:32,840 It points somewhere specific. 534 00:24:32,840 --> 00:24:35,360 That is why Nick was then able to say, go there. 535 00:24:35,360 --> 00:24:38,650 Follow the arrow, and put the number 42 there. 536 00:24:38,650 --> 00:24:42,530 But the next line of code, this one went horribly wrong 537 00:24:42,530 --> 00:24:45,550 because Y was not pointing anywhere. 538 00:24:45,550 --> 00:24:49,820 Nick tried to say, go there and put 13 but there is nowhere so 539 00:24:49,820 --> 00:24:50,900 the computer crashes. 540 00:24:50,900 --> 00:24:54,650 A segmentation fault, meaning that you touch the segment of memory 541 00:24:54,650 --> 00:24:57,530 that you should not have because you tried to go somewhere 542 00:24:57,530 --> 00:24:59,600 where it was just some garbage value. 543 00:24:59,600 --> 00:25:03,050 And the fix, recall, might be this, or a solution. 544 00:25:03,050 --> 00:25:05,900 If we instead kind of rewind and fix binkie. 545 00:25:05,900 --> 00:25:10,820 And say Y equals X, that's not allocating extra space. 546 00:25:10,820 --> 00:25:16,371 That's just saying, have Y point at the same chunk of memory as X. 547 00:25:16,371 --> 00:25:18,120 Because again, X and Y are just addresses. 548 00:25:18,120 --> 00:25:22,520 So if the address is 100 in memory, now X is 100, Y is 100. 549 00:25:22,520 --> 00:25:24,920 They're both pointing at the same chunk of white clay. 550 00:25:24,920 --> 00:25:29,360 So if he then did *Y, gets 13, that says, go there. 551 00:25:29,360 --> 00:25:30,590 Update the number. 552 00:25:30,590 --> 00:25:32,390 And now 42 became 13. 553 00:25:32,390 --> 00:25:36,170 Very similar in spirit, in fact, to our capitalization example, 554 00:25:36,170 --> 00:25:41,510 when we pointed to strings last time, at the same chunk of memory. 555 00:25:41,510 --> 00:25:47,720 So any questions on the stack or as depicted here, by binkie, the heap? 556 00:25:47,720 --> 00:25:50,360 Malloc allocates memory from the heap. 557 00:25:50,360 --> 00:25:53,870 But anytime you declare local variables or arrays 558 00:25:53,870 --> 00:25:56,120 inside of a function, that ends up on the stack. 559 00:25:56,120 --> 00:25:59,660 And thus far malloc is the only tool, or calloc or realloc, 560 00:25:59,660 --> 00:26:03,080 that gives us access to this new portion of memory depicted in white clay 561 00:26:03,080 --> 00:26:07,760 here and sort of depicted in our diagram up above. 562 00:26:07,760 --> 00:26:08,370 Any questions? 563 00:26:08,370 --> 00:26:10,450 Yeah. 564 00:26:10,450 --> 00:26:19,366 AUDIENCE: [INAUDIBLE] 565 00:26:19,366 --> 00:26:20,490 DAVID MALAN: Good question. 566 00:26:20,490 --> 00:26:23,540 So keep in mind that X and Y are both the same data types. 567 00:26:23,540 --> 00:26:25,310 They're both pointers to addresses. 568 00:26:25,310 --> 00:26:28,010 So as such, if you're going to set one equal to the other, 569 00:26:28,010 --> 00:26:32,570 you have to just store the value that's in one inside of the other. 570 00:26:32,570 --> 00:26:34,700 Otherwise it would be trying to put an integer 571 00:26:34,700 --> 00:26:37,640 inside of a pointer, which isn't quite correct even though they're 572 00:26:37,640 --> 00:26:39,330 technically both numbers. 573 00:26:39,330 --> 00:26:43,460 So this is just saying, whatever address is in X, 574 00:26:43,460 --> 00:26:47,690 put that same address in Y it says nothing about going to that address. 575 00:26:47,690 --> 00:26:51,410 It's like making a photocopy of a map but not actually following 576 00:26:51,410 --> 00:26:52,340 that map yet. 577 00:26:52,340 --> 00:26:56,090 Until this line, which then says, go follow the copy of the map. 578 00:26:56,090 --> 00:26:58,630 And it turns out it leads you to the same location. 579 00:26:58,630 --> 00:26:59,708 Yeah? 580 00:26:59,708 --> 00:27:05,870 AUDIENCE: [INAUDIBLE] 581 00:27:05,870 --> 00:27:09,690 DAVID MALAN: So if you return, in C you can only return one value. 582 00:27:09,690 --> 00:27:13,590 So you're kind of in a bad spot because you ideally want to return two values, 583 00:27:13,590 --> 00:27:14,090 right? 584 00:27:14,090 --> 00:27:16,700 Both A and B you want to return B and A. There 585 00:27:16,700 --> 00:27:19,820 are ways you can work around that and kind of sort of return two values 586 00:27:19,820 --> 00:27:20,319 and see. 587 00:27:20,319 --> 00:27:23,150 But in short, it's much easier said than done. 588 00:27:23,150 --> 00:27:25,830 Python will actually make that much easier. 589 00:27:25,830 --> 00:27:28,550 But the short of it in C is you can't return multiple values. 590 00:27:28,550 --> 00:27:31,800 And that ties our hands in this case. 591 00:27:31,800 --> 00:27:34,860 OK, so odds are, related to all of this, you've 592 00:27:34,860 --> 00:27:37,637 heard about this website, which is enormously useful when you're 593 00:27:37,637 --> 00:27:40,470 trying to learn some new language, when you're out in the real world 594 00:27:40,470 --> 00:27:42,287 trying to solve some problem because this 595 00:27:42,287 --> 00:27:45,120 is this wonderful community of people who post questions and answers 596 00:27:45,120 --> 00:27:48,570 and ideally link to canonical references so you can kind of understand 597 00:27:48,570 --> 00:27:50,590 why some answer is the way it is. 598 00:27:50,590 --> 00:27:53,920 But this actually has a very technical meaning, Stack Overflow. 599 00:27:53,920 --> 00:27:56,310 And stack, of course, is now a familiar concept. 600 00:27:56,310 --> 00:28:00,630 And you can imagine something like this picture eventually 601 00:28:00,630 --> 00:28:07,230 going very wrong if you call so many functions that you just kind of run out 602 00:28:07,230 --> 00:28:07,920 of memory. 603 00:28:07,920 --> 00:28:10,252 Or not even run out per se. 604 00:28:10,252 --> 00:28:12,960 What are you going to hit eventually if you keep calling function 605 00:28:12,960 --> 00:28:15,590 after function after function? 606 00:28:15,590 --> 00:28:16,200 The heap. 607 00:28:16,200 --> 00:28:17,866 And then bad things are going to happen. 608 00:28:17,866 --> 00:28:21,120 If your stack frames and your local variables 609 00:28:21,120 --> 00:28:24,480 are so numerous that you start overwriting the heap memory, 610 00:28:24,480 --> 00:28:29,380 now values that you allocated with malloc might themselves get clobbered. 611 00:28:29,380 --> 00:28:32,590 So this is not the best design decision but it is the reality. 612 00:28:32,590 --> 00:28:35,040 And it's mitigated only by using a compiler that 613 00:28:35,040 --> 00:28:37,980 might help you notice this or by actually using 614 00:28:37,980 --> 00:28:39,820 more memory than you actually need. 615 00:28:39,820 --> 00:28:44,170 Now, Stack Overflow actually has a very technical meaning here, 616 00:28:44,170 --> 00:28:45,690 as does heap overflow. 617 00:28:45,690 --> 00:28:47,640 Stack Overflow means you overflow the stack. 618 00:28:47,640 --> 00:28:50,610 You just call so many darn functions that you just touch memory you 619 00:28:50,610 --> 00:28:51,130 shouldn't. 620 00:28:51,130 --> 00:28:52,630 Heap overflow would be the opposite. 621 00:28:52,630 --> 00:28:54,505 You keep calling malloc and malloc and malloc 622 00:28:54,505 --> 00:28:56,460 and the heap overflows the stack. 623 00:28:56,460 --> 00:28:59,526 Because for better or for worse, the stack is growing this way 624 00:28:59,526 --> 00:29:00,900 and the heap is growing this way. 625 00:29:00,900 --> 00:29:03,600 And eventually they're going to strike each other. 626 00:29:03,600 --> 00:29:07,470 So this is a more general way of saying buffer overflows. 627 00:29:07,470 --> 00:29:11,517 A buffer it's just a chunk of memory that stores data or values. 628 00:29:11,517 --> 00:29:13,350 We people in the real world might have heard 629 00:29:13,350 --> 00:29:16,740 of this in the context of video, YouTube, or various video players. 630 00:29:16,740 --> 00:29:19,407 If you've ever seen the expression buffering, dot dot dot. 631 00:29:19,407 --> 00:29:20,740 It's the most infuriating thing. 632 00:29:20,740 --> 00:29:22,800 Something's about to happen in the movie or show you're watching. 633 00:29:22,800 --> 00:29:25,710 And the damn thing start buffering, buffering, buffering. 634 00:29:25,710 --> 00:29:26,830 What does that mean? 635 00:29:26,830 --> 00:29:28,980 Well, the video player, YouTube or something else, 636 00:29:28,980 --> 00:29:32,090 has a chunk of memory, which you can think of as an array. 637 00:29:32,090 --> 00:29:34,527 And loaded into that array are the zeros and ones 638 00:29:34,527 --> 00:29:36,610 that compose the movie or TV show you're watching. 639 00:29:36,610 --> 00:29:38,443 And those were downloaded over the internet. 640 00:29:38,443 --> 00:29:41,640 And what happens is, hopefully your internet speed is faster 641 00:29:41,640 --> 00:29:43,530 than the movie's own playback. 642 00:29:43,530 --> 00:29:46,710 So that even though you might be at the minute 10 in the video, 643 00:29:46,710 --> 00:29:50,460 hopefully your computer has downloaded minute 10 through 11 644 00:29:50,460 --> 00:29:52,649 so that you have this built up buffer of bytes 645 00:29:52,649 --> 00:29:54,690 that you have a whole minute where you can watch, 646 00:29:54,690 --> 00:29:57,090 even if your internet connection goes out. 647 00:29:57,090 --> 00:30:01,290 But when your video is buffering, it means you have this array of memory 648 00:30:01,290 --> 00:30:04,410 and you've kind of looked at or watched all of the bytes in it. 649 00:30:04,410 --> 00:30:06,030 And the buffer is now empty. 650 00:30:06,030 --> 00:30:07,650 But the opposite can happen, too. 651 00:30:07,650 --> 00:30:11,070 If you try downloading more bytes then you have memory for, 652 00:30:11,070 --> 00:30:13,680 you might try putting minutes of the video someplace they 653 00:30:13,680 --> 00:30:15,780 don't belong in your computer. 654 00:30:15,780 --> 00:30:19,230 Or if you call too many functions or if you call malloc too many times, 655 00:30:19,230 --> 00:30:21,810 you might overflow the chunk of memory that's been allocated. 656 00:30:21,810 --> 00:30:23,310 So buffers are all over the place. 657 00:30:23,310 --> 00:30:26,580 And indeed, a string as we know it is just a buffer. 658 00:30:26,580 --> 00:30:28,530 It's an array of memory and hopefully you 659 00:30:28,530 --> 00:30:32,520 will only put as many characters in that string 660 00:30:32,520 --> 00:30:34,600 as can fit in that chunk of memory. 661 00:30:34,600 --> 00:30:36,150 So what kinds of things can go wrong? 662 00:30:36,150 --> 00:30:37,920 This is a bit of a contrived example, but it 663 00:30:37,920 --> 00:30:39,753 comes with a couple of visuals just to paint 664 00:30:39,753 --> 00:30:44,430 a picture of how adversaries can hack into systems 665 00:30:44,430 --> 00:30:48,180 that are written in languages like C. So here's a quick program. 666 00:30:48,180 --> 00:30:49,800 We're going to include string.h. 667 00:30:49,800 --> 00:30:53,730 And down here we have int main that takes command line arguments. 668 00:30:53,730 --> 00:30:56,220 Notice this function does not do any error checking at all. 669 00:30:56,220 --> 00:30:57,240 It's pretty stupid. 670 00:30:57,240 --> 00:31:01,350 It just calls a function foo and passes an argv[1]. 671 00:31:01,350 --> 00:31:03,540 So the idea here is that this is a program 672 00:31:03,540 --> 00:31:06,750 that if you take a command line argument, a word after the program's 673 00:31:06,750 --> 00:31:09,900 name, just gets blindly passed into the foo program. 674 00:31:09,900 --> 00:31:10,440 OK. 675 00:31:10,440 --> 00:31:12,030 So next, what does foo do? 676 00:31:12,030 --> 00:31:17,220 It accepts as input a string, a.k.a. char* and we're just calling it bar. 677 00:31:17,220 --> 00:31:23,077 And then it allocates an array on the stack called C of size 12. 678 00:31:23,077 --> 00:31:25,410 And then even if you've never seen this function before, 679 00:31:25,410 --> 00:31:29,490 you can maybe kind of infer from its name, mem copy, like memory copy. 680 00:31:29,490 --> 00:31:33,420 So it turns out this is going to copy into this memory whatever is 681 00:31:33,420 --> 00:31:36,750 in this memory up to this many bytes. 682 00:31:36,750 --> 00:31:42,540 So if I type in Maria as the command line argument, M-A-R-I-A is five. 683 00:31:42,540 --> 00:31:45,090 So that means the length I typed in is going to be five. 684 00:31:45,090 --> 00:31:52,140 And this is going to copy five bytes from bar into C that's it. 685 00:31:52,140 --> 00:31:53,640 Now it's meant to be just a monster. 686 00:31:53,640 --> 00:31:55,848 This program is pretty useless at the end of the day. 687 00:31:55,848 --> 00:31:59,670 But it's kind of distilled a thread into the fewest lines of code. 688 00:31:59,670 --> 00:32:02,230 What does this actually look like or what's happening? 689 00:32:02,230 --> 00:32:03,660 We've called the function. 690 00:32:03,660 --> 00:32:05,970 We've allocated 12 bytes. 691 00:32:05,970 --> 00:32:08,670 We've copied those five bytes into those 12 bytes. 692 00:32:08,670 --> 00:32:10,140 So all is well in this story. 693 00:32:10,140 --> 00:32:12,670 694 00:32:12,670 --> 00:32:14,320 But what actually happens in memory? 695 00:32:14,320 --> 00:32:18,360 So here's a picture of the stack kind of zoomed in and nicely colorized. 696 00:32:18,360 --> 00:32:19,920 So stack is going this way. 697 00:32:19,920 --> 00:32:21,480 Heap is growing this way. 698 00:32:21,480 --> 00:32:25,320 And this is just showing you technically how things are laid out on the stack. 699 00:32:25,320 --> 00:32:27,900 I keep kind of simplifying the world by just drawing things 700 00:32:27,900 --> 00:32:31,430 as X and Y and A and B. But they actually follow a precise order. 701 00:32:31,430 --> 00:32:33,800 So specifically, if we have a local variable 702 00:32:33,800 --> 00:32:35,660 called bar, which we did for this function, 703 00:32:35,660 --> 00:32:38,060 it goes right there at the bottom. 704 00:32:38,060 --> 00:32:41,776 If you then declare an array called C, it goes right up there on top. 705 00:32:41,776 --> 00:32:43,150 And these are sized proportional. 706 00:32:43,150 --> 00:32:43,910 This is four bytes. 707 00:32:43,910 --> 00:32:45,118 This is going to be 12 bytes. 708 00:32:45,118 --> 00:32:47,870 So it all is kind of proportional size. 709 00:32:47,870 --> 00:32:50,550 And then it turns out, and we won't go into too much detail, 710 00:32:50,550 --> 00:32:53,420 but if you like this stuff, CS 61 and other classes will explore it, 711 00:32:53,420 --> 00:32:57,140 it turns out another thing that has always been tucked away on the stack 712 00:32:57,140 --> 00:33:01,080 secretly is what's called a return address. 713 00:33:01,080 --> 00:33:05,390 So when main calls swap, it's like handing the keys to the car 714 00:33:05,390 --> 00:33:06,320 off to someone else. 715 00:33:06,320 --> 00:33:08,210 Like swap, go do your thing. 716 00:33:08,210 --> 00:33:12,290 But main kind of has to tell swap or any function it calls, 717 00:33:12,290 --> 00:33:17,600 how to get back to its chunk of memory so that execution can resume with main 718 00:33:17,600 --> 00:33:20,090 as soon as swap is done executing. 719 00:33:20,090 --> 00:33:22,280 And it's not its stack memory, per se. 720 00:33:22,280 --> 00:33:25,500 Recall that top portion of memory that I described as the text segment? 721 00:33:25,500 --> 00:33:28,040 All the zeros and ones that compose your program? 722 00:33:28,040 --> 00:33:32,360 It turns out that main, when it calls swap or some other function, 723 00:33:32,360 --> 00:33:35,630 it tucks its own return address, the address of the appropriate zeros 724 00:33:35,630 --> 00:33:40,640 and ones in that text segment, into four bytes here, or maybe eight bytes, 725 00:33:40,640 --> 00:33:44,382 the address to which swap should hand the keys back to you, so to speak. 726 00:33:44,382 --> 00:33:47,090 Otherwise it's like main handing the keys off to another function 727 00:33:47,090 --> 00:33:49,881 and then it never hears from it again so main's other lines of code 728 00:33:49,881 --> 00:33:51,030 never get executed. 729 00:33:51,030 --> 00:33:55,640 So long story short, there needs to be an address or a map tucked away 730 00:33:55,640 --> 00:33:59,060 on the stack so that swap can hand control back to main. 731 00:33:59,060 --> 00:34:02,847 But what happens here when you actually use this memory? 732 00:34:02,847 --> 00:34:05,930 Well, it turns out that if we just number the bytes on the stack, and that 733 00:34:05,930 --> 00:34:09,199 was a size 12, the first one is zero, and the last one is 11. 734 00:34:09,199 --> 00:34:11,570 So zero through 11 gives us 12 total. 735 00:34:11,570 --> 00:34:16,190 So if we type in something like Maria or maybe more generally, hello, H-E-L-L-O, 736 00:34:16,190 --> 00:34:19,219 which is the same length, that's using six bytes technically, 737 00:34:19,219 --> 00:34:21,170 because the /0 and all is well. 738 00:34:21,170 --> 00:34:25,070 Fits comfortably in C and we've got room to breathe. 739 00:34:25,070 --> 00:34:29,429 But what if we don't type in Maria or hello? 740 00:34:29,429 --> 00:34:35,090 What if we type in a very long sentence that's more than 12 characters? 741 00:34:35,090 --> 00:34:36,860 Where are they going to end up? 742 00:34:36,860 --> 00:34:40,370 If you type in a longer string at the command line in argv[1], 743 00:34:40,370 --> 00:34:42,650 notice the code is flawed. 744 00:34:42,650 --> 00:34:46,310 You're going to check the length of the word that the user typed in, 745 00:34:46,310 --> 00:34:50,120 copy all of its bytes from bar into C. But what 746 00:34:50,120 --> 00:34:53,400 if the length of the string you typed in is 13? 747 00:34:53,400 --> 00:34:55,020 What are you going to do? 748 00:34:55,020 --> 00:34:59,000 You're going to copy 13 bytes from bar into C, 749 00:34:59,000 --> 00:35:02,450 thereby filling all of these 12 bytes plus one 750 00:35:02,450 --> 00:35:04,440 more that you shouldn't be touching. 751 00:35:04,440 --> 00:35:06,350 And if the string is even longer than 13, 752 00:35:06,350 --> 00:35:11,610 if the adversary really typed a long sentence or phrase or word or whatnot, 753 00:35:11,610 --> 00:35:14,000 you're going to really exceed the boundaries 754 00:35:14,000 --> 00:35:15,870 of that buffer or that array. 755 00:35:15,870 --> 00:35:18,710 So what does this look Like Well, if you type in a much longer word, 756 00:35:18,710 --> 00:35:23,780 like A-A-A-A-A-A-A, you could end up overwriting these 12 bytes, 757 00:35:23,780 --> 00:35:27,500 also these four bytes, also these green bytes, whatever they are, 758 00:35:27,500 --> 00:35:31,820 and most importantly, even the red bytes that I described as the return address. 759 00:35:31,820 --> 00:35:34,880 Now A-A-A-A-A really isn't going to cause anyone any trouble 760 00:35:34,880 --> 00:35:37,460 because it's just a sequence of random ASCII characters. 761 00:35:37,460 --> 00:35:39,920 But characters at the end of the day are just numbers, 762 00:35:39,920 --> 00:35:43,400 and numbers are just bits, and programs are just bits. 763 00:35:43,400 --> 00:35:47,180 So the threat here is that if you're a pretty sophisticated adversary, someone 764 00:35:47,180 --> 00:35:49,580 who really knows programming, you could technically 765 00:35:49,580 --> 00:35:52,430 write a program that does something really bad like delete 766 00:35:52,430 --> 00:35:53,870 all the files on a hard drive. 767 00:35:53,870 --> 00:35:55,676 Or send spam to everyone in your contacts. 768 00:35:55,676 --> 00:35:57,800 Or anything like that because at the end of the day 769 00:35:57,800 --> 00:36:00,800 the program that he or she has just written is just zeros and ones. 770 00:36:00,800 --> 00:36:04,730 If you then convert that program zeros and ones to the corresponding, 771 00:36:04,730 --> 00:36:10,010 even if weird-looking ASCII values, you could technically type a program 772 00:36:10,010 --> 00:36:14,239 at the command line in argv[1] just by typing out the funky characters 773 00:36:14,239 --> 00:36:17,030 on the keys that are not going to make sense to a human reading it. 774 00:36:17,030 --> 00:36:20,180 But those ASCII characters in the context of a program 775 00:36:20,180 --> 00:36:22,650 are going to be interpreted as code. 776 00:36:22,650 --> 00:36:24,850 And if you're really good, and frankly, it's 777 00:36:24,850 --> 00:36:26,350 not so much that you're really good. 778 00:36:26,350 --> 00:36:29,240 If by a lot of trial and error, you happen 779 00:36:29,240 --> 00:36:32,360 to overwrite the return address in a program, 780 00:36:32,360 --> 00:36:34,580 you can trick the computer into not returning 781 00:36:34,580 --> 00:36:39,750 back to main, but to jumping to the very input you passed into the program. 782 00:36:39,750 --> 00:36:42,930 So A here implies attack, like attack code. 783 00:36:42,930 --> 00:36:46,520 So if you're really clever, you can pass in an appropriate pattern of zeros 784 00:36:46,520 --> 00:36:50,600 and ones, convert it to ASCII so the human can type them in at the prompt, 785 00:36:50,600 --> 00:36:54,350 overwrite this return address, and trick the computer program 786 00:36:54,350 --> 00:36:58,910 to return from this function not to main, but to like, this byte up here. 787 00:36:58,910 --> 00:37:01,460 And maybe this byte coupled with all of these others 788 00:37:01,460 --> 00:37:03,760 means delete all the files on this user's hard drive, 789 00:37:03,760 --> 00:37:06,110 send spam to everyone in their contacts. 790 00:37:06,110 --> 00:37:08,750 This is called a buffer overflow exploit, 791 00:37:08,750 --> 00:37:11,880 and it's incredibly shockingly common even these days. 792 00:37:11,880 --> 00:37:15,440 C is not commonly used for a lot of programs but it still is everywhere. 793 00:37:15,440 --> 00:37:18,456 And there's other languages, too, like C++ that lend itself to this. 794 00:37:18,456 --> 00:37:20,330 And even though this is still a little arcane 795 00:37:20,330 --> 00:37:22,490 and you don't need to worry too much about the addresses on the screen, 796 00:37:22,490 --> 00:37:25,130 the fundamental threat here is that if you do not 797 00:37:25,130 --> 00:37:29,000 check the boundaries of your arrays and the amount of memory you've allocated 798 00:37:29,000 --> 00:37:32,570 and you touch memory you should not, very bad things can happen. 799 00:37:32,570 --> 00:37:34,490 You're effectively giving control to anyone 800 00:37:34,490 --> 00:37:38,870 on the internet who can use your program because he or she can be clever enough 801 00:37:38,870 --> 00:37:45,491 to inject their own zeros and ones into your program for execution. 802 00:37:45,491 --> 00:37:45,990 OK. 803 00:37:45,990 --> 00:37:49,670 So dear God, this is scary in a computer science sense. 804 00:37:49,670 --> 00:37:53,247 So what can we do to defend against this beyond just not writing bugs, 805 00:37:53,247 --> 00:37:54,830 which is never going to happen, right? 806 00:37:54,830 --> 00:37:57,050 Even the most advanced, best programmers still 807 00:37:57,050 --> 00:37:59,940 make bugs, especially as the software gets more and more complicated. 808 00:37:59,940 --> 00:38:03,050 We have eprintf and we have help50 and we have debug50 809 00:38:03,050 --> 00:38:05,900 but there's other tools, too, like Valgrind 810 00:38:05,900 --> 00:38:09,320 which happens to be a tool for detecting memory leaks in a program 811 00:38:09,320 --> 00:38:11,190 and other memory-related issues. 812 00:38:11,190 --> 00:38:14,690 So let me actually go ahead and open this program, memory.c. 813 00:38:14,690 --> 00:38:16,670 And it looks like this. 814 00:38:16,670 --> 00:38:21,090 And let's see if we can't tease apart what is buggy about this program. 815 00:38:21,090 --> 00:38:22,510 So here's the program here. 816 00:38:22,510 --> 00:38:26,240 817 00:38:26,240 --> 00:38:31,570 So, include standard lib.h, function f, function main, main calls f 818 00:38:31,570 --> 00:38:32,310 and returns 0. 819 00:38:32,310 --> 00:38:34,060 Nothing really interesting going on there. 820 00:38:34,060 --> 00:38:35,200 So what's in f? 821 00:38:35,200 --> 00:38:39,460 F on the right hand side allocates space for 10 integers, I think. 822 00:38:39,460 --> 00:38:45,880 Malloc returns the address of that chunk of memory and stores it in X 823 00:38:45,880 --> 00:38:51,190 and then line 8 is the bug, I think. 824 00:38:51,190 --> 00:38:52,570 What's wrong with line 8? 825 00:38:52,570 --> 00:38:58,200 826 00:38:58,200 --> 00:39:00,150 Let me go here first. 827 00:39:00,150 --> 00:39:04,039 AUDIENCE: [INAUDIBLE] 828 00:39:04,039 --> 00:39:06,330 DAVID MALAN: Exactly, because we start counting at zero 829 00:39:06,330 --> 00:39:08,829 and because we're asking the computer for space for 10 ints, 830 00:39:08,829 --> 00:39:09,930 we get them back. 831 00:39:09,930 --> 00:39:14,430 But that's going to be accessible via [0] through, as you say, [9]. 832 00:39:14,430 --> 00:39:19,320 So [10] is like writing one int past the buffer, a.k.a. 833 00:39:19,320 --> 00:39:20,610 A buffer overflow. 834 00:39:20,610 --> 00:39:23,610 Now it might be hard to see this, especially when the program isn't 835 00:39:23,610 --> 00:39:27,360 as relatively short as this but is buried in a dozen lines, 100 836 00:39:27,360 --> 00:39:29,010 lines, thousands of lines of code. 837 00:39:29,010 --> 00:39:30,040 But tools can help. 838 00:39:30,040 --> 00:39:32,820 So if I go ahead and make this program with make memory. 839 00:39:32,820 --> 00:39:40,050 And then I go ahead and execute Valgrind ./memory enter, 840 00:39:40,050 --> 00:39:43,110 the one downside of this program is that its output is just completely 841 00:39:43,110 --> 00:39:43,890 overwhelming. 842 00:39:43,890 --> 00:39:47,910 But let's see if we can't tease apart some recognizable terms. 843 00:39:47,910 --> 00:39:50,320 So all of this on the left is a bit of a distraction. 844 00:39:50,320 --> 00:39:51,810 This is just copyright information. 845 00:39:51,810 --> 00:39:54,090 So the interesting stuff seems to happen here. 846 00:39:54,090 --> 00:39:56,430 I see invalid right of size 4. 847 00:39:56,430 --> 00:39:58,860 Not quite sure what all of this means, but I 848 00:39:58,860 --> 00:40:02,700 do see that it's somehow related to uh-oh, line 8 of memory.c line 849 00:40:02,700 --> 00:40:07,470 13, line eight, line 13, a lot of bugs in the same places, it seems. 850 00:40:07,470 --> 00:40:11,360 And then here, address such and such as zero bytes after a block of sys 851 00:40:11,360 --> 00:40:14,340 40 alloc, it's a little hard to wrap my mind around this. 852 00:40:14,340 --> 00:40:18,330 So as always, let's at least initially just run this through help50 853 00:40:18,330 --> 00:40:20,280 and see if it can help tease this apart. 854 00:40:20,280 --> 00:40:21,900 So we see the same output. 855 00:40:21,900 --> 00:40:24,270 It recognizes something familiar in yellow here. 856 00:40:24,270 --> 00:40:26,580 Invalid right of size four and it highlights the lines. 857 00:40:26,580 --> 00:40:29,010 And our TF-like feedback is, looks like you're 858 00:40:29,010 --> 00:40:32,889 trying to modify four bytes of memory that isn't yours, question mark? 859 00:40:32,889 --> 00:40:35,430 Did you try to store something beyond the bounds of an array? 860 00:40:35,430 --> 00:40:38,420 Take a closer look at line 8 of memory.c. 861 00:40:38,420 --> 00:40:40,170 So that's kind of a mouthful but it's just 862 00:40:40,170 --> 00:40:44,110 because we have practiced reading stuff that's pretty arcane like this. 863 00:40:44,110 --> 00:40:47,760 So we've extracted all of the salient details like line eight of memory.c. 864 00:40:47,760 --> 00:40:53,040 So line eight of memory.c, as we noted, is already the dangerous line. 865 00:40:53,040 --> 00:40:56,610 So what might it mean to have an invalid right of size four? 866 00:40:56,610 --> 00:41:01,260 Well, it turns out an int in the IEDE is how many bytes? 867 00:41:01,260 --> 00:41:03,100 Four, or 32 bits. 868 00:41:03,100 --> 00:41:07,560 So invalid right of size four just means that this int here, zero, is an int, 869 00:41:07,560 --> 00:41:10,710 it's four bytes, it's just invalidly being written, as you say, 870 00:41:10,710 --> 00:41:11,940 to the wrong location. 871 00:41:11,940 --> 00:41:16,060 So this is Valgrind's pretty terse way of communicating that idea. 872 00:41:16,060 --> 00:41:17,850 So here we have then an explanation. 873 00:41:17,850 --> 00:41:19,000 So how do I fix this? 874 00:41:19,000 --> 00:41:22,620 Well, if my intent was just to update the last area there, 875 00:41:22,620 --> 00:41:32,610 let me go ahead and do make memory enter, ./valgrind ./memory enter. 876 00:41:32,610 --> 00:41:37,710 And now this is a good thing except we've made some progress. 877 00:41:37,710 --> 00:41:39,750 Let me scroll up to fit more on the screen. 878 00:41:39,750 --> 00:41:42,840 So I got rid of that message, invalid right of size four, 879 00:41:42,840 --> 00:41:44,910 but this does not sound good either. 880 00:41:44,910 --> 00:41:49,450 40 bytes in one block are definitely lost in lost record one of one, 881 00:41:49,450 --> 00:41:49,950 all right? 882 00:41:49,950 --> 00:41:51,366 So I need a little help with that. 883 00:41:51,366 --> 00:41:54,630 So let me do help50 again until I get familiar with the syntax. 884 00:41:54,630 --> 00:41:57,360 And it's highlighted that chunk of output. 885 00:41:57,360 --> 00:41:59,520 Looks like your program leaked 40 bytes of memory. 886 00:41:59,520 --> 00:42:02,550 Did you forget to free memory that you allocated via malloc? 887 00:42:02,550 --> 00:42:05,830 Take a closer look at line seven of memory.c. 888 00:42:05,830 --> 00:42:07,290 So let's do exactly that. 889 00:42:07,290 --> 00:42:11,580 So line 7 of memory.c, OK, here's where I malloc the memory. 890 00:42:11,580 --> 00:42:15,930 And per help50's own feedback, what have I apparently not done? 891 00:42:15,930 --> 00:42:16,500 Freed it. 892 00:42:16,500 --> 00:42:18,958 And it turns out freeing is actually pretty straightforward 893 00:42:18,958 --> 00:42:20,370 so long as you remember it do it. 894 00:42:20,370 --> 00:42:22,590 You just call free, passing in the same pointer. 895 00:42:22,590 --> 00:42:24,350 You don't have to remember how long it is. 896 00:42:24,350 --> 00:42:27,190 It's up to the operating system to remember how long it is. 897 00:42:27,190 --> 00:42:29,730 But now if I do make memory. 898 00:42:29,730 --> 00:42:35,430 And now I do again Valgrind./memory enter, heap summary, 899 00:42:35,430 --> 00:42:36,560 all heap blocks were freed. 900 00:42:36,560 --> 00:42:37,740 No leaks are possible. 901 00:42:37,740 --> 00:42:40,020 I see nothing particularly worrisome. 902 00:42:40,020 --> 00:42:41,460 And the program is bug free now. 903 00:42:41,460 --> 00:42:43,530 So Valgrind is another tool in your toolkit 904 00:42:43,530 --> 00:42:46,350 that doesn't help you find logical bugs per se. 905 00:42:46,350 --> 00:42:50,130 It helps you find memory-related errors, which might be logical bugs. 906 00:42:50,130 --> 00:42:53,670 But it helps you hone in on them and see them in a way that you as a human 907 00:42:53,670 --> 00:42:57,940 might not otherwise, especially if it's buried in many, many lines of code. 908 00:42:57,940 --> 00:43:01,050 Now you'll notice, too, real briefly, in Valgrin's output 909 00:43:01,050 --> 00:43:04,240 in these several examples, there are all of these funky numbers. 910 00:43:04,240 --> 00:43:07,780 So if I go back to the original version here just a moment ago, 911 00:43:07,780 --> 00:43:10,050 where it was in fact buggy in a couple of ways. 912 00:43:10,050 --> 00:43:13,410 And I rerun make and I rerun Valgrind, you'll 913 00:43:13,410 --> 00:43:15,780 see a whole bunch of things like this. 914 00:43:15,780 --> 00:43:21,870 At OX such and such, at OX such and such, OX, what did OX denote last time? 915 00:43:21,870 --> 00:43:23,970 So hexadecimal, so this is just a succinct 916 00:43:23,970 --> 00:43:27,450 if weird-looking way of representing numbers, generally memory addresses. 917 00:43:27,450 --> 00:43:32,340 And so this very specifically is saying that line 13 of memory.c 918 00:43:32,340 --> 00:43:35,400 happens to be using memory at this location. 919 00:43:35,400 --> 00:43:37,917 It's not particularly useful to us the programmers. 920 00:43:37,917 --> 00:43:39,000 But that's why you see it. 921 00:43:39,000 --> 00:43:41,370 And Valgrind is arguably a more advanced tool, 922 00:43:41,370 --> 00:43:45,330 which is to say that memory addresses in tools like this and even in debuggers 923 00:43:45,330 --> 00:43:48,810 tends to be written using hexadecimal notation like that. 924 00:43:48,810 --> 00:43:50,700 Of course, you've seen hexadecimal converted. 925 00:43:50,700 --> 00:43:53,070 Like these are the first three bytes in a JPEG, 926 00:43:53,070 --> 00:43:56,310 which are typically thought of using hexadecimal like this. 927 00:43:56,310 --> 00:43:59,730 But even though this looks new, it's the exact same idea. 928 00:43:59,730 --> 00:44:02,170 And I thought I'd tease perhaps with a joke 929 00:44:02,170 --> 00:44:07,140 that only a computer scientist can understand. 930 00:44:07,140 --> 00:44:10,800 OK, so that's a good one that goes around each year. 931 00:44:10,800 --> 00:44:13,350 So that of course is alluding to just these addresses. 932 00:44:13,350 --> 00:44:16,300 And now let me propose one other debugging technique and explain like, 933 00:44:16,300 --> 00:44:18,670 what the hell is going on here on stage today, too. 934 00:44:18,670 --> 00:44:22,080 So you have of course debug50, which is a tool for debugging and walking 935 00:44:22,080 --> 00:44:22,830 through code. 936 00:44:22,830 --> 00:44:24,750 And silly though this is, there is actually 937 00:44:24,750 --> 00:44:28,180 this thing in the world of programming, rubber duck debugging. 938 00:44:28,180 --> 00:44:31,175 This is, in the absence of having a TF or a CA to bounce ideas off of, 939 00:44:31,175 --> 00:44:34,050 this is in the absence of having a roommate around or roommate around 940 00:44:34,050 --> 00:44:36,390 who wants to talk to you about your code. 941 00:44:36,390 --> 00:44:40,450 It's recommended that if you have some bug in your program, 942 00:44:40,450 --> 00:44:42,900 that you keep something like this on your desk. 943 00:44:42,900 --> 00:44:45,780 And in the absence of roommates and friends 944 00:44:45,780 --> 00:44:49,620 and hopefully doors closed, you talk to the rubber duck. 945 00:44:49,620 --> 00:44:53,100 And I feel silly even saying this but there's a Wikipedia article 946 00:44:53,100 --> 00:44:54,870 on this it's a real thing. 947 00:44:54,870 --> 00:44:57,450 The idea here is that if you've ever been in office hours 948 00:44:57,450 --> 00:44:58,710 or you've been chatting with a TFer friend 949 00:44:58,710 --> 00:45:01,290 and just like talking about your code and talking about what 950 00:45:01,290 --> 00:45:04,710 it is you think your code is doing, just very often that act 951 00:45:04,710 --> 00:45:08,430 of saying something and hearing yourself say it can help reinforce one, 952 00:45:08,430 --> 00:45:10,200 what your code is in fact doing. 953 00:45:10,200 --> 00:45:12,450 Or if you realize verbally, wait a minute. 954 00:45:12,450 --> 00:45:14,910 What I just said does not seem to line up with the code, 955 00:45:14,910 --> 00:45:16,685 finally that light bulb goes off. 956 00:45:16,685 --> 00:45:18,060 And it doesn't have to be a duck. 957 00:45:18,060 --> 00:45:20,643 I mean, you can talk to the wall but that's a little stranger. 958 00:45:20,643 --> 00:45:24,090 So at least this is a personification of having someone like a colleague 959 00:45:24,090 --> 00:45:25,164 to talk to. 960 00:45:25,164 --> 00:45:27,330 So at the end of today or during break by all means, 961 00:45:27,330 --> 00:45:30,715 grab yourself a rubber duck the debugger and keep it on your shelf. 962 00:45:30,715 --> 00:45:32,340 It doesn't have to quite be this large. 963 00:45:32,340 --> 00:45:34,301 But this is a genuine debugging technique. 964 00:45:34,301 --> 00:45:36,300 Like, in the absence of understanding something, 965 00:45:36,300 --> 00:45:39,870 don't necessarily turn only to CS50 discourse or to office hours 966 00:45:39,870 --> 00:45:41,220 or to sections or the like. 967 00:45:41,220 --> 00:45:43,350 Literally try talking yourself through it, 968 00:45:43,350 --> 00:45:45,180 even if it feels a little bit silly. 969 00:45:45,180 --> 00:45:47,520 And if it does really feel silly, just look at him 970 00:45:47,520 --> 00:45:49,530 and talk to yourself in your head perhaps. 971 00:45:49,530 --> 00:45:54,120 But that kind of enunciation of what your code is doing or should be doing 972 00:45:54,120 --> 00:45:57,516 will hopefully help all the more light bulbs go off. 973 00:45:57,516 --> 00:45:59,640 And eventually you can just keep them on your shelf 974 00:45:59,640 --> 00:46:01,655 and take off that training wheel as well. 975 00:46:01,655 --> 00:46:03,780 Let's go ahead and take our five minute break here. 976 00:46:03,780 --> 00:46:08,190 Grab a duck if you'd like and we'll come back with more. 977 00:46:08,190 --> 00:46:09,400 All right. 978 00:46:09,400 --> 00:46:11,920 So we're back and we keep thinking about memory. 979 00:46:11,920 --> 00:46:14,756 Is this generally laid out as having addresses, 980 00:46:14,756 --> 00:46:17,380 but of course we've clarified that a little bit in that we have 981 00:46:17,380 --> 00:46:19,540 more of a canvas at our disposal now. 982 00:46:19,540 --> 00:46:21,880 But even then we keep talking about having things back 983 00:46:21,880 --> 00:46:23,770 to back to back in memory. 984 00:46:23,770 --> 00:46:25,300 But that simply needn't be the case. 985 00:46:25,300 --> 00:46:28,390 Like, what we have now with pointers and with malloc 986 00:46:28,390 --> 00:46:31,360 and these kinds of functions is the ability to get memory from anywhere 987 00:46:31,360 --> 00:46:36,050 we want and somehow stitch it together or connect these things. 988 00:46:36,050 --> 00:46:39,730 But how do we actually do that with the ingredients we now have? 989 00:46:39,730 --> 00:46:41,230 And why might we want to? 990 00:46:41,230 --> 00:46:43,822 So here is how we keep representing something like an array. 991 00:46:43,822 --> 00:46:46,030 An array, again, is just a contiguous chunk of memory 992 00:46:46,030 --> 00:46:48,730 where you store things literally back to back to back. 993 00:46:48,730 --> 00:46:52,840 But suppose that I've put six things into this array, six numbers, one, two, 994 00:46:52,840 --> 00:46:54,430 three, four, five, six. 995 00:46:54,430 --> 00:46:56,710 What happens if I try to put seven into this array? 996 00:46:56,710 --> 00:46:59,605 997 00:46:59,605 --> 00:47:01,480 What do I have to think about or worry about? 998 00:47:01,480 --> 00:47:04,524 999 00:47:04,524 --> 00:47:06,440 Touching memory that I'm not allowed to touch. 1000 00:47:06,440 --> 00:47:08,850 So I'd better not put it over here. 1001 00:47:08,850 --> 00:47:11,120 But what if seven must go in this array? 1002 00:47:11,120 --> 00:47:12,620 Well, I don't have too many options. 1003 00:47:12,620 --> 00:47:16,771 Like, if I fill the space I have to either overwrite some value or put it 1004 00:47:16,771 --> 00:47:18,770 somewhere it shouldn't be, and that should never 1005 00:47:18,770 --> 00:47:21,530 be an option because the program could or will crash. 1006 00:47:21,530 --> 00:47:25,080 And so I could alternatively just allocate more memory. 1007 00:47:25,080 --> 00:47:26,370 So how do I do that? 1008 00:47:26,370 --> 00:47:29,660 Well, if I've allocated this array initially to be a size six, 1009 00:47:29,660 --> 00:47:37,428 I could encode, just allocate a new array of size seven and then do what? 1010 00:47:37,428 --> 00:47:38,790 AUDIENCE: [INAUDIBLE] 1011 00:47:38,790 --> 00:47:40,831 DAVID MALAN: Yeah, add all the numbers and seven. 1012 00:47:40,831 --> 00:47:44,040 So I can take this array, I can give myself another one elsewhere in memory, 1013 00:47:44,040 --> 00:47:48,136 copy the values from old to new, then maybe free the old. 1014 00:47:48,136 --> 00:47:50,760 And then move on with my life because now I have enough memory. 1015 00:47:50,760 --> 00:47:52,189 Now, that fixes the problem. 1016 00:47:52,189 --> 00:47:53,980 And if we implemented it in code correctly, 1017 00:47:53,980 --> 00:47:58,050 it would be by nature correct, assuming there's enough memory in the computer. 1018 00:47:58,050 --> 00:48:01,387 But why is that arguably bad design? 1019 00:48:01,387 --> 00:48:02,720 AUDIENCE: It's a waste of space. 1020 00:48:02,720 --> 00:48:05,078 DAVID MALAN: It's a waste of space how? 1021 00:48:05,078 --> 00:48:09,047 AUDIENCE: [INAUDIBLE] 1022 00:48:09,047 --> 00:48:11,880 DAVID MALAN: Yeah, even though I don't keep both around in the story 1023 00:48:11,880 --> 00:48:14,550 I'm telling, it's temporarily pretty inefficient in that I'm 1024 00:48:14,550 --> 00:48:18,720 using twice as much memory as I actually need only to then kind of downscale. 1025 00:48:18,720 --> 00:48:19,740 What else? 1026 00:48:19,740 --> 00:48:21,323 Yeah. 1027 00:48:21,323 --> 00:48:24,469 AUDIENCE: [INAUDIBLE] 1028 00:48:24,469 --> 00:48:27,260 DAVID MALAN: Yeah, if I want to change the size of the array again, 1029 00:48:27,260 --> 00:48:31,190 whether bigger or even smaller perhaps, if I remove items from the list, then 1030 00:48:31,190 --> 00:48:34,100 I just have to keep allocating new memory, which is wasteful 1031 00:48:34,100 --> 00:48:36,530 and more importantly, it's not just space inefficient. 1032 00:48:36,530 --> 00:48:38,830 In what other sense is inefficient? 1033 00:48:38,830 --> 00:48:39,330 Time, why? 1034 00:48:39,330 --> 00:48:41,699 Where is the time coming from? 1035 00:48:41,699 --> 00:48:43,920 AUDIENCE: [INAUDIBLE] 1036 00:48:43,920 --> 00:48:45,860 DAVID MALAN: So that's an asymptotic notation. 1037 00:48:45,860 --> 00:48:49,330 This is copying something from one array to another would be in big O of what? 1038 00:48:49,330 --> 00:48:52,434 1039 00:48:52,434 --> 00:48:53,350 They go of side, yeah. 1040 00:48:53,350 --> 00:48:55,930 So if we just genericize size as n it's like big O of N. 1041 00:48:55,930 --> 00:48:58,660 It's a linear time operation, which isn't horrible, right? 1042 00:48:58,660 --> 00:49:00,220 N squared is bad. 1043 00:49:00,220 --> 00:49:03,910 Linear has never really been bad but we already know that log n is better, 1044 00:49:03,910 --> 00:49:05,500 constant time is best. 1045 00:49:05,500 --> 00:49:09,340 And so just wasting any amount of time doesn't feel like optimal design. 1046 00:49:09,340 --> 00:49:12,640 And that all is a function of arrays being 1047 00:49:12,640 --> 00:49:16,780 a fixed size and contiguous in memory, back to back to back. 1048 00:49:16,780 --> 00:49:19,580 In fact, arguably there's one other issue that could occur. 1049 00:49:19,580 --> 00:49:21,850 It's not so much if you have a very small array. 1050 00:49:21,850 --> 00:49:25,540 But what if you have a huge amount of memory available 1051 00:49:25,540 --> 00:49:29,200 but it's only in size five or six increments? 1052 00:49:29,200 --> 00:49:31,000 Like, for whatever reasons your computer's 1053 00:49:31,000 --> 00:49:33,916 been using some of this memory, this memory, this memory, this memory, 1054 00:49:33,916 --> 00:49:36,850 and if you add up all the available memory there's a lot of free space 1055 00:49:36,850 --> 00:49:39,640 but they're always separated by memory that's in use. 1056 00:49:39,640 --> 00:49:42,790 So maybe this memory is free, then there's a bite that's in use. 1057 00:49:42,790 --> 00:49:44,780 This memory is free, there's a bite in use. 1058 00:49:44,780 --> 00:49:47,380 So your memory is quote unquote very fragmented. 1059 00:49:47,380 --> 00:49:50,680 So you have lots of available memory but it's not contiguous. 1060 00:49:50,680 --> 00:49:55,420 You cannot, in this model, allocate an array of size seven if you don't have 1061 00:49:55,420 --> 00:49:57,910 that memory available contiguously. 1062 00:49:57,910 --> 00:50:00,624 So not as big of a concern given enough memory, 1063 00:50:00,624 --> 00:50:02,290 but at least something that could arise. 1064 00:50:02,290 --> 00:50:04,600 So let's introduce the solution. 1065 00:50:04,600 --> 00:50:06,610 Something here called a linked list. 1066 00:50:06,610 --> 00:50:08,860 And the name kind of describes what it is. 1067 00:50:08,860 --> 00:50:13,120 It's still a list of numbers but it's linked by way of these arrows. 1068 00:50:13,120 --> 00:50:14,440 And we've used arrows before. 1069 00:50:14,440 --> 00:50:18,620 What have we used arrows to represent in pictures past? 1070 00:50:18,620 --> 00:50:19,480 Yeah, so pointers. 1071 00:50:19,480 --> 00:50:24,070 So now that we have the expressiveness of pointers, you can kind of digitally 1072 00:50:24,070 --> 00:50:28,220 stitch your data structures together if you spend a little bit more memory. 1073 00:50:28,220 --> 00:50:30,510 So we've not really solved the problem you identified, 1074 00:50:30,510 --> 00:50:31,510 which was the space use. 1075 00:50:31,510 --> 00:50:33,550 But if you're tolerant of that and if you've 1076 00:50:33,550 --> 00:50:35,800 got enough memory at your disposal and can 1077 00:50:35,800 --> 00:50:38,740 afford to spend it, why don't we store for every number 1078 00:50:38,740 --> 00:50:42,490 not just the number but also space for a pointer? 1079 00:50:42,490 --> 00:50:45,010 So each of the boxes I've drawn here now doesn't just 1080 00:50:45,010 --> 00:50:47,160 have a box for the number itself n. 1081 00:50:47,160 --> 00:50:50,830 It's got really two boxes together, one for n and one for something 1082 00:50:50,830 --> 00:50:54,250 we'll call next, which is going to be a pointer to 1083 00:50:54,250 --> 00:50:58,060 or equivalently the address of the next node, as we'll call it, 1084 00:50:58,060 --> 00:51:00,050 the next box in this list. 1085 00:51:00,050 --> 00:51:03,250 Now even though we've drawn it here very prettily from left to right, 1086 00:51:03,250 --> 00:51:05,980 technically these boxes could be anywhere in memory, 1087 00:51:05,980 --> 00:51:07,955 specifically in the heap, we're going to see. 1088 00:51:07,955 --> 00:51:09,580 But they don't have to be back to back. 1089 00:51:09,580 --> 00:51:12,744 And so the fact that there are these gaps in between the nodes 1090 00:51:12,744 --> 00:51:15,160 deliberately paints a picture that these things don't have 1091 00:51:15,160 --> 00:51:16,840 to be back to back to back any more. 1092 00:51:16,840 --> 00:51:17,830 They can be anywhere. 1093 00:51:17,830 --> 00:51:21,490 And now suppose I've got these five numbers, nine through 34. 1094 00:51:21,490 --> 00:51:24,400 Suppose I want to add another number. 1095 00:51:24,400 --> 00:51:25,210 Where do I put it? 1096 00:51:25,210 --> 00:51:27,130 I don't seem to have room. 1097 00:51:27,130 --> 00:51:31,535 But based on this picture, how much you infer we're going to engineer this. 1098 00:51:31,535 --> 00:51:32,511 AUDIENCE: [INAUDIBLE] 1099 00:51:32,511 --> 00:51:35,260 DAVID MALAN: Yeah, so why don't we just allocate space for another 1100 00:51:35,260 --> 00:51:37,676 and it's not going to fit on the board here but who cares? 1101 00:51:37,676 --> 00:51:38,620 We can put it here. 1102 00:51:38,620 --> 00:51:43,840 Put the new number in it and then just add a line, an arrow from it to that. 1103 00:51:43,840 --> 00:51:47,270 And so this then is going to be what we call a linked list 1104 00:51:47,270 --> 00:51:49,060 and it gives us dynamism. 1105 00:51:49,060 --> 00:51:52,120 It gives us the ability to grow or shrink our data structure, addressing 1106 00:51:52,120 --> 00:51:53,620 your concern, not necessarily yours. 1107 00:51:53,620 --> 00:51:57,490 It's still a little space wasteful, but we gain benefits for both of you, 1108 00:51:57,490 --> 00:52:00,130 your concerns time is a lot better now because we don't 1109 00:52:00,130 --> 00:52:01,930 have to waste time copying the values. 1110 00:52:01,930 --> 00:52:03,710 We just add to the values. 1111 00:52:03,710 --> 00:52:06,520 And if we want to grow or subtract it we only do as much work 1112 00:52:06,520 --> 00:52:08,320 as we're trying to add or subtract. 1113 00:52:08,320 --> 00:52:10,660 We don't have to worry about fixing everything. 1114 00:52:10,660 --> 00:52:12,524 But there is some complexity here. 1115 00:52:12,524 --> 00:52:14,440 And given that we have a whole bunch of these, 1116 00:52:14,440 --> 00:52:16,150 which would make problems at five a little easier, 1117 00:52:16,150 --> 00:52:17,900 if you haven't quite finished who done it. 1118 00:52:17,900 --> 00:52:21,820 Could we get for just one demo today six volunteers? 1119 00:52:21,820 --> 00:52:22,810 Six volunteers? 1120 00:52:22,810 --> 00:52:23,410 All right. 1121 00:52:23,410 --> 00:52:26,331 Come on down, right in front here. 1122 00:52:26,331 --> 00:52:26,830 All right. 1123 00:52:26,830 --> 00:52:30,120 Well actually, come on up. 1124 00:52:30,120 --> 00:52:35,500 Come on up and have a two and three over there, four. 1125 00:52:35,500 --> 00:52:37,080 No one over here today. 1126 00:52:37,080 --> 00:52:38,520 OK, OK, five. 1127 00:52:38,520 --> 00:52:39,510 OK, six. 1128 00:52:39,510 --> 00:52:40,540 OK, six, six. 1129 00:52:40,540 --> 00:52:41,110 Come on up. 1130 00:52:41,110 --> 00:52:41,260 All right. 1131 00:52:41,260 --> 00:52:42,730 We'll save this till the very end. 1132 00:52:42,730 --> 00:52:45,792 But let me give you guys in the meantime, these numbers. 1133 00:52:45,792 --> 00:52:47,500 And if you don't mind holding the numbers 1134 00:52:47,500 --> 00:52:49,660 out I want you to go over to the left there 1135 00:52:49,660 --> 00:52:55,990 and just order yourselves just as the picture on the screen OK, you'll be 17. 1136 00:52:55,990 --> 00:52:57,370 Let's see, let's see, nine. 1137 00:52:57,370 --> 00:52:59,110 I might have given out too many numbers. 1138 00:52:59,110 --> 00:53:04,200 OK, let me free that and give you nine instead, if I may. 1139 00:53:04,200 --> 00:53:06,640 And give you, let's see. 1140 00:53:06,640 --> 00:53:07,750 You have 17? 1141 00:53:07,750 --> 00:53:08,822 OK, so 17. 1142 00:53:08,822 --> 00:53:11,780 And yeah, I'll have you go ahead and flip yourselves if you don't mind. 1143 00:53:11,780 --> 00:53:15,560 22, and then 26. 1144 00:53:15,560 --> 00:53:16,370 And what do we got? 1145 00:53:16,370 --> 00:53:19,180 20, 34. 1146 00:53:19,180 --> 00:53:20,030 34. 1147 00:53:20,030 --> 00:53:22,690 OK, and you guys will be slightly special. 1148 00:53:22,690 --> 00:53:27,550 So who wants to be literally first? 1149 00:53:27,550 --> 00:53:29,470 OK, so here first. 1150 00:53:29,470 --> 00:53:31,732 And who wants to be temporary? 1151 00:53:31,732 --> 00:53:33,190 OK, you'll be temporary, all right. 1152 00:53:33,190 --> 00:53:33,910 Come on over here. 1153 00:53:33,910 --> 00:53:36,618 OK, come on over here and if you guys could step a little closer. 1154 00:53:36,618 --> 00:53:41,110 So we have 9, 17, 22, 26, 34, and give yourselves like a foot in between. 1155 00:53:41,110 --> 00:53:45,340 And if you guys could use your left arms to represent the pointers to visualize 1156 00:53:45,340 --> 00:53:47,397 who is linked to whom. 1157 00:53:47,397 --> 00:53:49,980 OK, and why don't you just point, yes, very deliberately down. 1158 00:53:49,980 --> 00:53:51,404 So what's your name? 1159 00:53:51,404 --> 00:53:51,945 NAZLI: Nazli. 1160 00:53:51,945 --> 00:53:52,170 DAVID MALAN: Nazli. 1161 00:53:52,170 --> 00:53:54,270 So Nazli's left hand will be a null pointer. 1162 00:53:54,270 --> 00:53:55,080 It's not pointing at anyone. 1163 00:53:55,080 --> 00:53:58,038 So literally just pointed down to the ground, like ground electrically. 1164 00:53:58,038 --> 00:53:58,620 OK. 1165 00:53:58,620 --> 00:54:01,300 So now we just have to have some first node. 1166 00:54:01,300 --> 00:54:02,305 So what's your name? 1167 00:54:02,305 --> 00:54:02,660 OLIVIA: Olivia. 1168 00:54:02,660 --> 00:54:05,550 DAVID MALAN: Olivia is a little special here in that her paper has a word 1169 00:54:05,550 --> 00:54:06,720 and it's not just a number. 1170 00:54:06,720 --> 00:54:09,480 She represents a distinct variable called first. 1171 00:54:09,480 --> 00:54:11,940 Because that one catch with the linked list 1172 00:54:11,940 --> 00:54:15,630 is that you don't remember it by way of the address of a contiguous 1173 00:54:15,630 --> 00:54:16,620 chunk of memory. 1174 00:54:16,620 --> 00:54:20,500 You remember a linked list by way of the address of the first node in the linked 1175 00:54:20,500 --> 00:54:21,000 list. 1176 00:54:21,000 --> 00:54:21,490 And what's your name? 1177 00:54:21,490 --> 00:54:22,380 ACHMED: Achmed. 1178 00:54:22,380 --> 00:54:23,213 DAVID MALAN: Achmed. 1179 00:54:23,213 --> 00:54:26,427 So Achmed here happens to be the very first node in the list right now. 1180 00:54:26,427 --> 00:54:28,260 So Olivia's left arm is going to be pointing 1181 00:54:28,260 --> 00:54:32,040 to Achmed to represent that he is the first node in the list. 1182 00:54:32,040 --> 00:54:33,770 OK, and what's your name? 1183 00:54:33,770 --> 00:54:34,430 JESS: Jess. 1184 00:54:34,430 --> 00:54:34,570 DAVID MALAN: Jess. 1185 00:54:34,570 --> 00:54:37,590 We're going to use Jess in just a moment to complete some operations. 1186 00:54:37,590 --> 00:54:42,450 So suppose that we actually want to insert some value into this list, 1187 00:54:42,450 --> 00:54:44,670 like the number 55. 1188 00:54:44,670 --> 00:54:47,800 All right, so the number 55 is going to require a little bit of cleverness 1189 00:54:47,800 --> 00:54:48,300 here. 1190 00:54:48,300 --> 00:54:50,640 And so I need some place to store this. 1191 00:54:50,640 --> 00:54:51,389 I need to malloc. 1192 00:54:51,389 --> 00:54:52,680 So OK, you've been volunteered. 1193 00:54:52,680 --> 00:54:53,070 What's your name? 1194 00:54:53,070 --> 00:54:53,760 STELLA: Stella. 1195 00:54:53,760 --> 00:54:55,093 DAVID MALAN: Stella, come on up. 1196 00:54:55,093 --> 00:54:57,030 So malloc Stella. 1197 00:54:57,030 --> 00:55:01,344 And we will store the number 55 in Stella's node. 1198 00:55:01,344 --> 00:55:04,260 And right now if you could just kind of point your left hand anywhere. 1199 00:55:04,260 --> 00:55:06,670 It's kind of a garbage value. 1200 00:55:06,670 --> 00:55:08,399 OK, thank you. 1201 00:55:08,399 --> 00:55:09,690 And now what's your name again? 1202 00:55:09,690 --> 00:55:10,110 JESS: Jess. 1203 00:55:10,110 --> 00:55:10,860 DAVID MALAN: Jess. 1204 00:55:10,860 --> 00:55:13,650 OK, so Jess is going to help us find the right space here. 1205 00:55:13,650 --> 00:55:17,310 So we can obviously see where 55 belongs if we're keeping this sorted. 1206 00:55:17,310 --> 00:55:19,200 But again, computers don't have that luxury. 1207 00:55:19,200 --> 00:55:21,300 Moreover, we no longer have random access. 1208 00:55:21,300 --> 00:55:25,630 We can't just jump to [0], [1], [2] because there are these gaps between 1209 00:55:25,630 --> 00:55:26,130 them. 1210 00:55:26,130 --> 00:55:28,530 And just to make this more clear, can every other of you 1211 00:55:28,530 --> 00:55:32,050 step forward or back so that it just looks a little weird? 1212 00:55:32,050 --> 00:55:34,986 So you can no longer index into this data structure 1213 00:55:34,986 --> 00:55:36,360 because again, it's not an array. 1214 00:55:36,360 --> 00:55:37,230 It's not back to back. 1215 00:55:37,230 --> 00:55:40,021 These things could be anywhere in memory and it's only the pointers 1216 00:55:40,021 --> 00:55:43,280 that are linking everything together. 1217 00:55:43,280 --> 00:55:47,910 So Jess now is going to initially point at the very same thing 1218 00:55:47,910 --> 00:55:50,610 that Olivia is pointing at, the same address or Achmed. 1219 00:55:50,610 --> 00:55:53,070 All right, so now we have a bit of redundancy. 1220 00:55:53,070 --> 00:55:55,440 But suppose we want to insert 55. 1221 00:55:55,440 --> 00:55:57,750 What kind of logic, what's the pseudocode here for Jess 1222 00:55:57,750 --> 00:56:02,430 to find the location for Stella? 1223 00:56:02,430 --> 00:56:03,390 What should Jess do? 1224 00:56:03,390 --> 00:56:04,058 Yeah. 1225 00:56:04,058 --> 00:56:05,437 AUDIENCE: [INAUDIBLE] 1226 00:56:05,437 --> 00:56:08,020 DAVID MALAN: OK, keeps going until she finds the null pointer, 1227 00:56:08,020 --> 00:56:11,940 or more specifically, till she finds the null pointer or the right location 1228 00:56:11,940 --> 00:56:13,816 for this number if we want to keep it sorted. 1229 00:56:13,816 --> 00:56:14,523 So let's do that. 1230 00:56:14,523 --> 00:56:15,840 So you're pointing at Achmed. 1231 00:56:15,840 --> 00:56:19,620 The number nine is not greater than 55. 1232 00:56:19,620 --> 00:56:22,330 So Stella belongs after Achmed. 1233 00:56:22,330 --> 00:56:24,060 So what are you going to do? 1234 00:56:24,060 --> 00:56:25,980 Good, you're going to point at Maria. 1235 00:56:25,980 --> 00:56:28,156 So you know to point at Maria why? 1236 00:56:28,156 --> 00:56:30,340 JESS: Because nine is less than 55. 1237 00:56:30,340 --> 00:56:34,150 DAVID MALAN: Nine is less than 55 but also, Achmed isn't just story nine. 1238 00:56:34,150 --> 00:56:36,990 Right, he has this next pointer that's telling 1239 00:56:36,990 --> 00:56:39,270 Jess where the next value is to look. 1240 00:56:39,270 --> 00:56:43,830 So his left hand is the substitute for what would have been just ++ the world 1241 00:56:43,830 --> 00:56:44,560 of an array. 1242 00:56:44,560 --> 00:56:47,800 So you go ahead and physically walk and let's just walk through this. 1243 00:56:47,800 --> 00:56:49,320 So now we're pointing here at 17. 1244 00:56:49,320 --> 00:56:51,120 It's not greater than. 1245 00:56:51,120 --> 00:56:53,680 We point next at Arunev. 1246 00:56:53,680 --> 00:56:55,290 OK, that's a 22. 1247 00:56:55,290 --> 00:56:56,507 Still not the right value. 1248 00:56:56,507 --> 00:56:57,090 We keep going. 1249 00:56:57,090 --> 00:56:58,600 What's your name? 1250 00:56:58,600 --> 00:56:59,400 Jeung Wan? 1251 00:56:59,400 --> 00:57:02,230 OK, that's not the right number because he's holding 26. 1252 00:57:02,230 --> 00:57:03,870 And now we catch you again? 1253 00:57:03,870 --> 00:57:04,860 Nazli. 1254 00:57:04,860 --> 00:57:08,610 Still no good and now go ahead and follow her left hand. 1255 00:57:08,610 --> 00:57:12,450 OK, so now we know that this has got to be the right space because we haven't 1256 00:57:12,450 --> 00:57:14,400 found numerically the right space. 1257 00:57:14,400 --> 00:57:16,837 So if we could borrow you, Stella, all the way over here. 1258 00:57:16,837 --> 00:57:19,170 Well, you're not technically physically moving in memory 1259 00:57:19,170 --> 00:57:21,520 but this will just make the story better. 1260 00:57:21,520 --> 00:57:25,449 OK, so yes, we're re-alloc-ing sort of. 1261 00:57:25,449 --> 00:57:27,240 So what are you going to do now with Stella 1262 00:57:27,240 --> 00:57:30,860 now that you found the right location? 1263 00:57:30,860 --> 00:57:32,610 Leave her here, OK. 1264 00:57:32,610 --> 00:57:34,680 But she's just kind of orphaned now. 1265 00:57:34,680 --> 00:57:37,400 She's pointing at nowhere and no one's pointing at her. 1266 00:57:37,400 --> 00:57:39,082 It's kind of sad. 1267 00:57:39,082 --> 00:57:40,290 And this is actually perfect. 1268 00:57:40,290 --> 00:57:41,340 Memory leak. 1269 00:57:41,340 --> 00:57:44,130 OK, so let's fix. 1270 00:57:44,130 --> 00:57:45,880 Who has the point at whom? 1271 00:57:45,880 --> 00:57:47,020 OK, good. 1272 00:57:47,020 --> 00:57:50,440 And now what should Stella point at? 1273 00:57:50,440 --> 00:57:52,740 Since now she is the end of the list and she's just 1274 00:57:52,740 --> 00:57:54,974 pointing to some garbage value. 1275 00:57:54,974 --> 00:57:58,140 And she's pointing, to be clear, at some garbage value because when you call 1276 00:57:58,140 --> 00:58:00,420 malloc you just get garbage values. 1277 00:58:00,420 --> 00:58:03,376 We overrode one of those garbage values with 55 for n, 1278 00:58:03,376 --> 00:58:05,250 but the pointer has not yet been overwritten. 1279 00:58:05,250 --> 00:58:06,458 So what you want to do, Jess? 1280 00:58:06,458 --> 00:58:11,090 1281 00:58:11,090 --> 00:58:11,750 To whom? 1282 00:58:11,750 --> 00:58:12,300 That's OK. 1283 00:58:12,300 --> 00:58:12,890 It's close. 1284 00:58:12,890 --> 00:58:18,410 What should her value be if there's no one to her left? 1285 00:58:18,410 --> 00:58:19,040 Should be null. 1286 00:58:19,040 --> 00:58:22,250 OK, and how did we represent null before? 1287 00:58:22,250 --> 00:58:23,150 Yeah, exactly. 1288 00:58:23,150 --> 00:58:27,110 Null, so now we have a list and now just to fix things, your pointer, 1289 00:58:27,110 --> 00:58:29,030 so Jess is kind of temporary. 1290 00:58:29,030 --> 00:58:30,940 We don't really care what her value is. 1291 00:58:30,940 --> 00:58:32,190 But who's important over here? 1292 00:58:32,190 --> 00:58:33,200 What's your name again? 1293 00:58:33,200 --> 00:58:39,320 Olivia was first and now do we have a list that is still linked? 1294 00:58:39,320 --> 00:58:39,996 We do. 1295 00:58:39,996 --> 00:58:42,620 And now, it of course took a little while to walk through this. 1296 00:58:42,620 --> 00:58:44,304 And frankly I kind of told a lie. 1297 00:58:44,304 --> 00:58:46,220 I haven't really made this faster because what 1298 00:58:46,220 --> 00:58:49,290 was the running time of this algorithm? 1299 00:58:49,290 --> 00:58:50,750 It was still a log of n. 1300 00:58:50,750 --> 00:58:51,950 But that's because what? 1301 00:58:51,950 --> 00:58:54,390 I was trying to maintain what property? 1302 00:58:54,390 --> 00:58:55,260 Sorted. 1303 00:58:55,260 --> 00:58:57,020 So suppose I relax that constraint. 1304 00:58:57,020 --> 00:58:59,870 Suppose that I didn't care about being sorted order. 1305 00:58:59,870 --> 00:59:06,610 Can I do better than 0 of n in order to have inserted Stella? 1306 00:59:06,610 --> 00:59:07,490 AUDIENCE: [INAUDIBLE] 1307 00:59:07,490 --> 00:59:10,222 DAVID MALAN: OK, Constan time, where can I put her then? 1308 00:59:10,222 --> 00:59:11,215 AUDIENCE: [INAUDIBLE] 1309 00:59:11,215 --> 00:59:12,090 DAVID MALAN: Exactly. 1310 00:59:12,090 --> 00:59:13,798 So if we don't care about sorted order, I 1311 00:59:13,798 --> 00:59:15,720 could have saved myself a huge amount of time 1312 00:59:15,720 --> 00:59:19,890 and we could have inserted Stella here, updated Olivia's hands, and updated 1313 00:59:19,890 --> 00:59:21,510 Stella's hands to point at Achmed. 1314 00:59:21,510 --> 00:59:22,650 And then we're done. 1315 00:59:22,650 --> 00:59:23,240 Constant time. 1316 00:59:23,240 --> 00:59:26,430 And here's an example of why big 0 of one doesn't mean one step. 1317 00:59:26,430 --> 00:59:30,276 That's constant time because it's like moving Olivia's hand and Stella's hand 1318 00:59:30,276 --> 00:59:30,900 but not Achmed. 1319 00:59:30,900 --> 00:59:34,530 So that's at least two steps but it's still always two steps. 1320 00:59:34,530 --> 00:59:37,505 So if we could get a round of applause for our volunteers here? 1321 00:59:37,505 --> 00:59:42,360 [CHEERING] You can keep both your numbers and these if you'd like. 1322 00:59:42,360 --> 00:59:43,950 Thank you so much. 1323 00:59:43,950 --> 00:59:47,570 May let this help you P set five. 1324 00:59:47,570 --> 00:59:49,680 Oh, sorry. 1325 00:59:49,680 --> 00:59:50,850 All right, thank you. 1326 00:59:50,850 --> 00:59:56,911 So that's of course just one operation. 1327 00:59:56,911 --> 00:59:58,410 There could have been other numbers. 1328 00:59:58,410 --> 01:00:01,230 Like, if we were trying to insert five in sorted order, 1329 01:00:01,230 --> 01:00:04,049 we would have gotten our constant time or maybe our omega of one 1330 01:00:04,049 --> 01:00:06,840 because in the best case, the number might end up at the beginning. 1331 01:00:06,840 --> 01:00:09,120 20, had we inserted it with our humans, might 1332 01:00:09,120 --> 01:00:12,270 have been a little more involved because we kind of have to walk the list 1333 01:00:12,270 --> 01:00:13,170 as we did with Jess. 1334 01:00:13,170 --> 01:00:16,710 But then she's kind of got to point at the person behind her 1335 01:00:16,710 --> 01:00:20,340 and the person in front of her because she has to update more hands. 1336 01:00:20,340 --> 01:00:23,760 Which is to say, that doing insertions or even deletions 1337 01:00:23,760 --> 01:00:26,272 requires a bit of re-stitching. 1338 01:00:26,272 --> 01:00:28,230 It's like kind of fixing clothes here if you're 1339 01:00:28,230 --> 01:00:31,717 trying to maintain a contiguous thread through all of these data structures. 1340 01:00:31,717 --> 01:00:34,050 But at the end of the day, even though it uses pointers, 1341 01:00:34,050 --> 01:00:37,170 it really just boils down to getting this logic right. 1342 01:00:37,170 --> 01:00:41,790 And in fact, let me do an example here of some code of how we might do this. 1343 01:00:41,790 --> 01:00:45,810 I'm going to go ahead and open up list 0.c 1344 01:00:45,810 --> 01:00:48,400 and take a look at how this here works. 1345 01:00:48,400 --> 01:00:54,780 So in list 0.c we have the following code. 1346 01:00:54,780 --> 01:00:59,262 We first, in main, get a positive number from the user. 1347 01:00:59,262 --> 01:01:02,470 So I'm going to wave my hands at this because this is kind of old school now. 1348 01:01:02,470 --> 01:01:05,470 We've been using a do while loop to get a positive number from the user. 1349 01:01:05,470 --> 01:01:07,060 And I'm calling it capacity. 1350 01:01:07,060 --> 01:01:10,940 So this is an example of getting it in from the user, calling it capacity. 1351 01:01:10,940 --> 01:01:14,040 Capacity meaning the maximum possible size for a data structure. 1352 01:01:14,040 --> 01:01:15,690 That would be a term of art there. 1353 01:01:15,690 --> 01:01:19,740 Here now is where, per week two the class, 1354 01:01:19,740 --> 01:01:24,510 I allocate an array for ints this many ints capacity. 1355 01:01:24,510 --> 01:01:26,010 So that, too, is hopefully familiar. 1356 01:01:26,010 --> 01:01:27,330 There's no pointers yet. 1357 01:01:27,330 --> 01:01:28,450 There's nothing too fancy. 1358 01:01:28,450 --> 01:01:32,584 I'm just allocating an array of size based on what the human just typed in. 1359 01:01:32,584 --> 01:01:34,500 But here's where it gets a little interesting. 1360 01:01:34,500 --> 01:01:38,427 The purpose of this program is to just prompt the user for that many numbers 1361 01:01:38,427 --> 01:01:39,510 again and again and again. 1362 01:01:39,510 --> 01:01:44,350 So I can type in 1, 2, 3 or 5, 17, 20, 22, and so forth. 1363 01:01:44,350 --> 01:01:47,200 And just build up an array of numbers and memory, 1364 01:01:47,200 --> 01:01:50,830 but I'm going to trip over the problem we identified a little bit ago. 1365 01:01:50,830 --> 01:01:51,990 So here we go. 1366 01:01:51,990 --> 01:01:54,240 I initialize size to zero because initially there's 1367 01:01:54,240 --> 01:01:55,650 nothing in the structure. 1368 01:01:55,650 --> 01:01:58,590 While size is less than my capacity, so while the current size 1369 01:01:58,590 --> 01:02:02,500 is less than the max, so to speak, do the following. 1370 01:02:02,500 --> 01:02:04,620 First I'm going to get a number from the user. 1371 01:02:04,620 --> 01:02:07,590 And the goal is to now insert this number into this list. 1372 01:02:07,590 --> 01:02:09,780 But now I'm going to do a quick sanity check. 1373 01:02:09,780 --> 01:02:14,250 Let me check and make sure this number is or isn't in the list already, 1374 01:02:14,250 --> 01:02:15,930 because I don't want to have duplicates. 1375 01:02:15,930 --> 01:02:16,350 Why? 1376 01:02:16,350 --> 01:02:16,950 Just because. 1377 01:02:16,950 --> 01:02:19,500 I want this to be a very clean list, no duplicates. 1378 01:02:19,500 --> 01:02:24,120 And so this loop of code here, maybe from week one, week two, week three, 1379 01:02:24,120 --> 01:02:28,620 is just an example of iterating through an array, looking for the number, 1380 01:02:28,620 --> 01:02:31,500 and if so, remembering that I found it with a Boolean 1381 01:02:31,500 --> 01:02:35,170 so that I have an answer found or not found in a Boolean variable. 1382 01:02:35,170 --> 01:02:35,670 OK. 1383 01:02:35,670 --> 01:02:37,044 So that's all that code is doing. 1384 01:02:37,044 --> 01:02:38,400 Still no magic. 1385 01:02:38,400 --> 01:02:41,940 So now is where the interesting part of the story happens. 1386 01:02:41,940 --> 01:02:48,650 So if the number was not found in the list already, here is how per week two, 1387 01:02:48,650 --> 01:02:51,630 we add a number to the end of an array. 1388 01:02:51,630 --> 01:02:55,530 Because if the size is initially zero, numbers [0] 1389 01:02:55,530 --> 01:02:57,090 is where the first number should go. 1390 01:02:57,090 --> 01:03:00,434 If size's initial is one at this point, numbers [1] 1391 01:03:00,434 --> 01:03:01,850 is where the new number should go. 1392 01:03:01,850 --> 01:03:04,200 And then I should increment size. 1393 01:03:04,200 --> 01:03:08,130 But there's a problem here in that once I print out these numbers 1394 01:03:08,130 --> 01:03:12,270 and the program ends, I can only have inputted as many numbers 1395 01:03:12,270 --> 01:03:16,620 as are available, as I have capacity for. 1396 01:03:16,620 --> 01:03:18,390 So it's kind of constrained. 1397 01:03:18,390 --> 01:03:20,660 So what if I want to do a little better than this? 1398 01:03:20,660 --> 01:03:25,050 Enlist 1.c which does now introduce new material, or at least an application 1399 01:03:25,050 --> 01:03:27,480 of the topics this week and last? 1400 01:03:27,480 --> 01:03:32,010 Here in line nine is technically how I can allocate 1401 01:03:32,010 --> 01:03:35,730 an array before I know the size of it. 1402 01:03:35,730 --> 01:03:38,610 So an array, recall, is just a chunk of memory 1403 01:03:38,610 --> 01:03:42,300 identified by some word, like numbers or students or whatnot. 1404 01:03:42,300 --> 01:03:45,270 But technically we've seen that there's kind of this equivalence, where 1405 01:03:45,270 --> 01:03:48,720 if an array is just the chunk of memory, you can technically refer to an array 1406 01:03:48,720 --> 01:03:51,940 by an address, the address of its first byte just like a string. 1407 01:03:51,940 --> 01:03:54,690 So this on line nine is a little new. 1408 01:03:54,690 --> 01:03:56,040 But it's kind of that idea. 1409 01:03:56,040 --> 01:03:58,710 Give me a pointer called numbers but initialize it to null. 1410 01:03:58,710 --> 01:04:00,290 There's no space for the number. 1411 01:04:00,290 --> 01:04:03,210 But this pointer therefore doesn't point to any chunk of memory. 1412 01:04:03,210 --> 01:04:05,940 It would be like Olivia standing up here awkwardly with no one 1413 01:04:05,940 --> 01:04:10,050 to point at because we've only allocated space for the first pointer, 1414 01:04:10,050 --> 01:04:12,540 not for everyone else on the stage. 1415 01:04:12,540 --> 01:04:14,970 So capacity is by default zero. 1416 01:04:14,970 --> 01:04:18,270 So here the rest of the program is pretty similar. 1417 01:04:18,270 --> 01:04:21,820 I go ahead and infinitely prompt the user for a number. 1418 01:04:21,820 --> 01:04:22,612 I check for errors. 1419 01:04:22,612 --> 01:04:24,361 It turns out if you read the documentation 1420 01:04:24,361 --> 01:04:26,875 for get int it will return a special constant called int 1421 01:04:26,875 --> 01:04:29,610 max if the user stops for writing input. 1422 01:04:29,610 --> 01:04:32,070 Here I'm just checking with a loop in a Boolean, 1423 01:04:32,070 --> 01:04:35,010 whether or not this number is in the list, same as before. 1424 01:04:35,010 --> 01:04:37,740 But here's where I start to use some new functionality. 1425 01:04:37,740 --> 01:04:43,610 If the number was not found already in the list, and the size of the list 1426 01:04:43,610 --> 01:04:47,850 already equals its capacity, that is if it is filled, 1427 01:04:47,850 --> 01:04:50,420 what do I have to do conceptually now? 1428 01:04:50,420 --> 01:04:53,229 If I've got an array whose purpose in life, as we proposed earlier, 1429 01:04:53,229 --> 01:04:54,020 is just to grow it? 1430 01:04:54,020 --> 01:04:57,130 1431 01:04:57,130 --> 01:04:58,500 I need to add space for it. 1432 01:04:58,500 --> 01:05:00,420 So I need to add space, as we were proposing. 1433 01:05:00,420 --> 01:05:03,211 Even though this is kind of lame in that it's a little inefficient, 1434 01:05:03,211 --> 01:05:04,500 here's how we do it. 1435 01:05:04,500 --> 01:05:10,500 I can simply call real alloc, passing in the array that whose memory 1436 01:05:10,500 --> 01:05:12,030 I want to reallocate. 1437 01:05:12,030 --> 01:05:15,540 And I just tell realloc how much space I now want. 1438 01:05:15,540 --> 01:05:19,330 So here, this is the size of the int, four bytes. 1439 01:05:19,330 --> 01:05:20,940 And this is how many bytes I want. 1440 01:05:20,940 --> 01:05:25,830 So whatever the current size is, realloc give me one more byte. 1441 01:05:25,830 --> 01:05:29,760 And then realloc gets assigned to numbers 1442 01:05:29,760 --> 01:05:32,172 and I check if it's null or not null. 1443 01:05:32,172 --> 01:05:33,630 And I'm keeping it a little simple. 1444 01:05:33,630 --> 01:05:36,588 We could add some additional error checking here, but what does realloc 1445 01:05:36,588 --> 01:05:37,180 do? 1446 01:05:37,180 --> 01:05:39,900 Realloc is pretty cool because if you pass to realloc, 1447 01:05:39,900 --> 01:05:43,426 a pointer to a chunk of memory that's like of this size, realloc 1448 01:05:43,426 --> 01:05:45,300 will look in your computer's memory and if it 1449 01:05:45,300 --> 01:05:47,760 sees a bigger chunk of memory over here, it 1450 01:05:47,760 --> 01:05:51,420 will handle the copying of everything over to it for you. 1451 01:05:51,420 --> 01:05:54,990 And then return to the address of the new chunk of memory 1452 01:05:54,990 --> 01:05:56,470 and free the old for you. 1453 01:05:56,470 --> 01:05:57,900 So does the switcheroo. 1454 01:05:57,900 --> 01:06:01,200 It's still linear time but this is how you would use it 1455 01:06:01,200 --> 01:06:07,630 without having to alloc and free and use a four loop like we described before. 1456 01:06:07,630 --> 01:06:10,870 And then you can go ahead and put the number in the list as before. 1457 01:06:10,870 --> 01:06:13,380 So the only new thing here, even though we're going through it quickly, 1458 01:06:13,380 --> 01:06:14,970 is that this is how you call realloc. 1459 01:06:14,970 --> 01:06:16,974 You pass in a pointer that's previously pointing 1460 01:06:16,974 --> 01:06:18,390 to a chunk of memory or even null. 1461 01:06:18,390 --> 01:06:18,899 That's OK. 1462 01:06:18,899 --> 01:06:20,940 If you pass it in a pointer that's pointing null, 1463 01:06:20,940 --> 01:06:23,981 it will give you back the address of just one byte and then the next time 1464 01:06:23,981 --> 01:06:26,910 two bytes, three bytes, and four bytes. 1465 01:06:26,910 --> 01:06:30,180 But with linked lists things get a little more interesting. 1466 01:06:30,180 --> 01:06:33,440 And the syntax is going to be a little funky but let's see. 1467 01:06:33,440 --> 01:06:37,770 Here it turns out is how we can implement each of our human volunteers 1468 01:06:37,770 --> 01:06:38,670 in code. 1469 01:06:38,670 --> 01:06:41,400 Each of them I called a node and node is a term of art in CS. 1470 01:06:41,400 --> 01:06:44,740 It refers to some data structure that contains some information. 1471 01:06:44,740 --> 01:06:48,030 Each of them was holding a number, which we called an int. 1472 01:06:48,030 --> 01:06:50,100 And then each of them, this is kind of funky, 1473 01:06:50,100 --> 01:06:53,820 had a left hand called next that was meant to point to someone 1474 01:06:53,820 --> 01:06:55,950 who looked just like them structurally. 1475 01:06:55,950 --> 01:06:59,871 So the idea here is that we don't want to just have another structure 1476 01:06:59,871 --> 01:07:01,620 inside of a structure, otherwise you would 1477 01:07:01,620 --> 01:07:05,550 get this sort of infinite Russian doll kind of thing going on. 1478 01:07:05,550 --> 01:07:08,730 You instead want to say, each of these structures 1479 01:07:08,730 --> 01:07:12,840 has a pointer to someone else who looks like them structurally, too. 1480 01:07:12,840 --> 01:07:16,930 And that's how we get the left arm metaphor implemented in code. 1481 01:07:16,930 --> 01:07:20,100 So that just defines a node, one of our volunteers. 1482 01:07:20,100 --> 01:07:24,810 Meanwhile though, here's how we would implement Olivia in one line of code. 1483 01:07:24,810 --> 01:07:28,440 So Olivia was herself a pointer to a node. 1484 01:07:28,440 --> 01:07:29,850 She didn't have a number, right? 1485 01:07:29,850 --> 01:07:31,300 Her sign just said first. 1486 01:07:31,300 --> 01:07:33,040 She was not holding a number. 1487 01:07:33,040 --> 01:07:35,130 So we don't need a whole structure for Olivia. 1488 01:07:35,130 --> 01:07:38,970 We just need a pointer to one such node structure. 1489 01:07:38,970 --> 01:07:41,850 But initially she was just kind of standing here so we'll just 1490 01:07:41,850 --> 01:07:43,840 say she was null initially. 1491 01:07:43,840 --> 01:07:47,730 So the rest of this code is presumably about malloc-ing someone like Stella 1492 01:07:47,730 --> 01:07:52,620 from the audience, updating Olivia, using Jess to actually update pointers 1493 01:07:52,620 --> 01:07:53,280 temporarily. 1494 01:07:53,280 --> 01:07:55,900 So let's see what this looks like in code. 1495 01:07:55,900 --> 01:07:59,760 So while true, I'm just going to prompt the user for numbers like before. 1496 01:07:59,760 --> 01:08:03,460 As before, I'm going to check for errors in the same way. 1497 01:08:03,460 --> 01:08:04,870 Here's a little different. 1498 01:08:04,870 --> 01:08:09,660 Here's the block of code wherein I just check if my current linked list already 1499 01:08:09,660 --> 01:08:11,760 has the number I'm trying to insert. 1500 01:08:11,760 --> 01:08:15,150 But remember, we took away the expressiveness of square brackets. 1501 01:08:15,150 --> 01:08:16,560 Can't do that anymore. 1502 01:08:16,560 --> 01:08:18,840 I have to now do this with pointers. 1503 01:08:18,840 --> 01:08:19,945 So here we go. 1504 01:08:19,945 --> 01:08:23,490 I, with my four loop, initialize a pointer, Jess, 1505 01:08:23,490 --> 01:08:26,710 to point at the same thing Olivia was pointing at, numbers. 1506 01:08:26,710 --> 01:08:28,941 So again, Jess was also just a pointer. 1507 01:08:28,941 --> 01:08:30,149 She was not holding a number. 1508 01:08:30,149 --> 01:08:33,689 She was holding PTR, so she was just one pointer 1509 01:08:33,689 --> 01:08:36,390 pointing at the same thing as Olivia. 1510 01:08:36,390 --> 01:08:40,410 Here we're saying, so long as Jess is not equal to null, 1511 01:08:40,410 --> 01:08:43,800 so as long as Jess doesn't walk off the edge of the stage, 1512 01:08:43,800 --> 01:08:45,359 go ahead and do the following. 1513 01:08:45,359 --> 01:08:46,529 What do I want to do? 1514 01:08:46,529 --> 01:08:48,149 And this syntax is new. 1515 01:08:48,149 --> 01:08:51,630 We saw at the beginning of today the dot operator, 1516 01:08:51,630 --> 01:08:54,300 which says take a data structure like students 1517 01:08:54,300 --> 01:08:56,380 and go into it with the dot operator. 1518 01:08:56,380 --> 01:08:58,470 Get their name and their dorm. 1519 01:08:58,470 --> 01:09:00,990 That was because the first demo today did not use pointers. 1520 01:09:00,990 --> 01:09:02,490 It just used structures. 1521 01:09:02,490 --> 01:09:05,250 Now we're using structures and pointers. 1522 01:09:05,250 --> 01:09:07,740 And so the syntax changes just a tiny bit. 1523 01:09:07,740 --> 01:09:10,920 When you have a pointer that is a pointer to a structure 1524 01:09:10,920 --> 01:09:14,160 and you want to follow that pointer and go to that structure, 1525 01:09:14,160 --> 01:09:18,720 the one piece of syntax in C that maybe actually maps to reality or concept 1526 01:09:18,720 --> 01:09:22,140 is this arrow operator, which means follow the left hand, 1527 01:09:22,140 --> 01:09:25,240 look at the structure, and get at that number. 1528 01:09:25,240 --> 01:09:30,660 And so if the volunteer's number equals the number that Jess was looking for, 1529 01:09:30,660 --> 01:09:32,649 go ahead and say found is true. 1530 01:09:32,649 --> 01:09:36,870 Otherwise update Jess or pointer to equal 1531 01:09:36,870 --> 01:09:39,939 whatever her left hand is pointing at. 1532 01:09:39,939 --> 01:09:42,630 So if Jess was temporarily pointing here, 1533 01:09:42,630 --> 01:09:46,140 she would then update herself by pointing there. 1534 01:09:46,140 --> 01:09:48,270 And so that's all this code is doing. 1535 01:09:48,270 --> 01:09:51,660 Jess starts to point at whatever her left hand was pointing at. 1536 01:09:51,660 --> 01:09:54,390 She moves physically on the stage. 1537 01:09:54,390 --> 01:09:56,975 All right, so now is where things get a little ugly. 1538 01:09:56,975 --> 01:09:59,850 And we'll do this with a hand-wave because I think this one is better 1539 01:09:59,850 --> 01:10:02,340 done at a slower pace on one's own. 1540 01:10:02,340 --> 01:10:05,490 And we'll come over these kinds of things in section and beyond. 1541 01:10:05,490 --> 01:10:07,710 Here's how I allocate space for a new node. 1542 01:10:07,710 --> 01:10:11,730 When I said malloc Stella, it's this line of code here, 45. 1543 01:10:11,730 --> 01:10:17,820 Malloc space for the size of a node and store it in the person 1544 01:10:17,820 --> 01:10:19,030 that Stella embodied. 1545 01:10:19,030 --> 01:10:21,780 Otherwise, if there is not enough memory, if something goes wrong, 1546 01:10:21,780 --> 01:10:23,220 return one. 1547 01:10:23,220 --> 01:10:25,900 Meanwhile, here's how we add the number to the list. 1548 01:10:25,900 --> 01:10:30,450 So this is exactly what Jess ended up acting out. 1549 01:10:30,450 --> 01:10:37,350 First we handed Stella her number, which is line 52 here. 1550 01:10:37,350 --> 01:10:40,177 We technically told her to point at a garbage value, 1551 01:10:40,177 --> 01:10:41,510 so I've improved the code since. 1552 01:10:41,510 --> 01:10:46,120 So line 53 would be like telling Stella, point here, not here. 1553 01:10:46,120 --> 01:10:48,789 So that's just cleaning up that omission last time. 1554 01:10:48,789 --> 01:10:50,580 And then here we have the same kind of code 1555 01:10:50,580 --> 01:10:52,380 again, a four loop that looks kind of funky 1556 01:10:52,380 --> 01:10:55,834 but it's just like updating the hand as you walk through the list. 1557 01:10:55,834 --> 01:10:57,750 And here's where the interesting part happens. 1558 01:10:57,750 --> 01:11:01,930 At the very end of our story, Jess kind of manipulated our volunteer's arms. 1559 01:11:01,930 --> 01:11:08,250 So if not pointer next, which is a cryptic way of saying, 1560 01:11:08,250 --> 01:11:10,890 if pointer next equals equals null. 1561 01:11:10,890 --> 01:11:14,220 So if Jess has found the end of the list, 1562 01:11:14,220 --> 01:11:18,060 go ahead and update whoever she is pointing at's 1563 01:11:18,060 --> 01:11:21,810 left hand to point to Stella, the new node. 1564 01:11:21,810 --> 01:11:26,280 Then break out because we're done. 1565 01:11:26,280 --> 01:11:28,500 So syntactically, this is hard and problem 1566 01:11:28,500 --> 01:11:31,870 set five will afford us opportunities to walk through very similar code. 1567 01:11:31,870 --> 01:11:34,380 But for now, just realize that all we're doing 1568 01:11:34,380 --> 01:11:37,380 is instead of just using super simple arithmetic, plus one, plus one, 1569 01:11:37,380 --> 01:11:40,950 plus one, we're just kind of following these arrows, following these arrows. 1570 01:11:40,950 --> 01:11:43,270 And the kind of syntax we'll use for that 1571 01:11:43,270 --> 01:11:45,849 is just this, which is not very readable at first glance. 1572 01:11:45,849 --> 01:11:48,390 But that's why I grasp onto, if you are a more visual person, 1573 01:11:48,390 --> 01:11:51,390 the kinds of hand manipulation and arm changes 1574 01:11:51,390 --> 01:11:55,800 that we were doing here physically with our volunteers. 1575 01:11:55,800 --> 01:11:57,551 And then we, again, print up [INAUDIBLE].. 1576 01:11:57,551 --> 01:11:59,258 The last thing here I'll note, and you'll 1577 01:11:59,258 --> 01:12:01,500 do this in problem set five, is here's how you might 1578 01:12:01,500 --> 01:12:03,120 free a whole length list of numbers. 1579 01:12:03,120 --> 01:12:05,580 I just kind of congratulated our volunteers and everyone 1580 01:12:05,580 --> 01:12:07,142 left the stage, thereby being freed. 1581 01:12:07,142 --> 01:12:10,100 But if we wanted to do this more methodically, we could use a four loop 1582 01:12:10,100 --> 01:12:12,141 but here I chose to do a while loop, because it's 1583 01:12:12,141 --> 01:12:13,710 a little more succinct design wise. 1584 01:12:13,710 --> 01:12:18,510 Here was our pointer, temporary pointer pointing at numbers. 1585 01:12:18,510 --> 01:12:21,420 And here I can say while pointer is not null because if it's null 1586 01:12:21,420 --> 01:12:22,530 my work is done. 1587 01:12:22,530 --> 01:12:26,220 Here I go ahead and say, update this value next to equal 1588 01:12:26,220 --> 01:12:28,550 whoever's next in the list. 1589 01:12:28,550 --> 01:12:31,290 Free whoever's currently in the list. 1590 01:12:31,290 --> 01:12:33,360 And then update the next pointer. 1591 01:12:33,360 --> 01:12:36,750 So again, don't worry too much about the lower level details here. 1592 01:12:36,750 --> 01:12:41,730 But just take away for today that we do now have a way of implementing, 1593 01:12:41,730 --> 01:12:45,420 in code, the higher level intuition that derived 1594 01:12:45,420 --> 01:12:46,980 from this kind of data structure. 1595 01:12:46,980 --> 01:12:53,280 But don't fret yet about the code itself. 1596 01:12:53,280 --> 01:12:57,480 But we now have the ability to stitch data structures together like this. 1597 01:12:57,480 --> 01:12:59,640 Upside of which is now we get dynamism, right? 1598 01:12:59,640 --> 01:13:01,590 We're no longer stuck painting our ourselves 1599 01:13:01,590 --> 01:13:04,650 into the proverbial corner with arrays by not allocating enough memory. 1600 01:13:04,650 --> 01:13:07,950 Or conversely, wasting memory by allocating way too much just so we 1601 01:13:07,950 --> 01:13:09,630 don't have to deal with the problem. 1602 01:13:09,630 --> 01:13:11,550 But we pay a price with the linked list. 1603 01:13:11,550 --> 01:13:15,750 We get dynamism and can more efficiently add a node, subtract a node, 1604 01:13:15,750 --> 01:13:18,480 and we just have to in constant time, update those pointers. 1605 01:13:18,480 --> 01:13:22,020 But we spend more memory for all these darn pointers. 1606 01:13:22,020 --> 01:13:24,120 And frankly, the code is more complex. 1607 01:13:24,120 --> 01:13:27,900 So recall from our first or second week, human time, programmer time 1608 01:13:27,900 --> 01:13:29,550 is a valuable resource. 1609 01:13:29,550 --> 01:13:32,670 And making something harder and more time consuming to implement 1610 01:13:32,670 --> 01:13:34,440 might not be a price you want to pay. 1611 01:13:34,440 --> 01:13:36,669 And so even I was just chatting with a colleague 1612 01:13:36,669 --> 01:13:39,210 yesterday about how in graduate school I used to cut corners, 1613 01:13:39,210 --> 01:13:41,140 especially late at night when writing code. 1614 01:13:41,140 --> 01:13:45,390 And I would write sometimes deliberately really bad code 1615 01:13:45,390 --> 01:13:47,280 that might take like eight hours to analyze 1616 01:13:47,280 --> 01:13:50,790 some data set for some research project I was working on because you know what? 1617 01:13:50,790 --> 01:13:54,960 I realized it was faster for me to write bad code, poorly designed, that 1618 01:13:54,960 --> 01:13:57,090 takes eight hours because in those eight hours 1619 01:13:57,090 --> 01:13:59,310 I could just go to sleep, frankly. 1620 01:13:59,310 --> 01:14:03,180 Now I would say that was only because my advisor was not grading me 1621 01:14:03,180 --> 01:14:05,900 on correctness and design and style. 1622 01:14:05,900 --> 01:14:09,900 But it is a manifestation of a very actual resource 1623 01:14:09,900 --> 01:14:12,990 that I don't recommend you cut that particular corner for now, 1624 01:14:12,990 --> 01:14:16,110 since one of the goals of being in a class is to get better at design. 1625 01:14:16,110 --> 01:14:18,110 But at the end of the day and in the real world, 1626 01:14:18,110 --> 01:14:20,790 even CS50 staff and I are constantly making decisions. 1627 01:14:20,790 --> 01:14:23,040 Well, yeah, we could improve this feature of help50 1628 01:14:23,040 --> 01:14:24,810 but it's going to take a week to do it. 1629 01:14:24,810 --> 01:14:27,930 Or we can just throw in some extra line of code and get it done now. 1630 01:14:27,930 --> 01:14:28,920 And it's a trade off. 1631 01:14:28,920 --> 01:14:30,974 And this is what makes code good and bad. 1632 01:14:30,974 --> 01:14:33,390 And when you start to cut these corners in the real world, 1633 01:14:33,390 --> 01:14:36,300 you start to accumulate what the world would call technical debt. 1634 01:14:36,300 --> 01:14:38,310 And debt tends not to be such a good thing. 1635 01:14:38,310 --> 01:14:41,040 And that speaks to the price you're paying in the long term 1636 01:14:41,040 --> 01:14:43,415 because it might take me and the staff longer this summer 1637 01:14:43,415 --> 01:14:45,150 now to go back in and clean all that up. 1638 01:14:45,150 --> 01:14:46,941 And God forbid, overnight frankly, and this 1639 01:14:46,941 --> 01:14:50,010 happened more often than I should admit, my code was buggy and bailed out 1640 01:14:50,010 --> 01:14:53,330 at like 2:00 AM I wake up eight hours later thinking, my data's ready. 1641 01:14:53,330 --> 01:14:54,330 No. 1642 01:14:54,330 --> 01:14:56,230 I should have done it right the first time so 1643 01:14:56,230 --> 01:14:58,850 I could rerun the code again and again. 1644 01:14:58,850 --> 01:15:01,670 So what else do we get now from this ability 1645 01:15:01,670 --> 01:15:03,740 to have pointers in data structures? 1646 01:15:03,740 --> 01:15:06,280 So there's this picture here from Mather's Dining Hall. 1647 01:15:06,280 --> 01:15:08,660 The cap represent the notion of actual trays. 1648 01:15:08,660 --> 01:15:10,760 And we've been using the stack in a very low level 1649 01:15:10,760 --> 01:15:12,801 arcane way to talk about memory management, which 1650 01:15:12,801 --> 01:15:15,800 isn't all that useful to us for solving problems. 1651 01:15:15,800 --> 01:15:17,300 But the data structure is. 1652 01:15:17,300 --> 01:15:21,410 It turns out there is a data structure in computer science called a stack. 1653 01:15:21,410 --> 01:15:23,630 And your computer, Mac or PC, are constantly 1654 01:15:23,630 --> 01:15:27,650 using it to manage functions and memory, but we can use it, too, 1655 01:15:27,650 --> 01:15:29,300 for various applications. 1656 01:15:29,300 --> 01:15:32,910 We can implement a data structure within I have two operations. 1657 01:15:32,910 --> 01:15:34,940 They're conventionally called push and pop. 1658 01:15:34,940 --> 01:15:36,040 Though it's like add and subtract. 1659 01:15:36,040 --> 01:15:38,539 You can call it anything you want but most programmers would 1660 01:15:38,539 --> 01:15:39,980 call it push and pop. 1661 01:15:39,980 --> 01:15:44,390 Push is like adding a tray to the stack and pop is like taking one off. 1662 01:15:44,390 --> 01:15:46,910 But just as the name implies with the stack, 1663 01:15:46,910 --> 01:15:53,480 what's this characteristic of a stack is that it is an example of a LIFO data 1664 01:15:53,480 --> 01:15:57,470 structure, last in first out, L-I-F-O. 1665 01:15:57,470 --> 01:15:58,650 Now what does that mean? 1666 01:15:58,650 --> 01:16:02,492 Well, if one of the staff from the dining hall comes by with a new tray 1667 01:16:02,492 --> 01:16:05,450 that's just been cleaned and he or she puts it on the top of the stack, 1668 01:16:05,450 --> 01:16:09,110 which one is a normal human being going to grab first? 1669 01:16:09,110 --> 01:16:10,930 The last one in, right? 1670 01:16:10,930 --> 01:16:13,700 It'd be strange and kind of difficult to get down on your knees 1671 01:16:13,700 --> 01:16:16,850 and pull out the bottom one, even though that would be more fair, right? 1672 01:16:16,850 --> 01:16:20,750 Like that little tray down there has been waiting the longest to be used. 1673 01:16:20,750 --> 01:16:25,610 But it's under the weight of the whole stack, literally. 1674 01:16:25,610 --> 01:16:28,127 But that, nonetheless, is how a stack would work. 1675 01:16:28,127 --> 01:16:30,460 And you can implement the stack now in a couple of ways. 1676 01:16:30,460 --> 01:16:33,290 And here's where the world gets interesting in programming, 1677 01:16:33,290 --> 01:16:36,380 in that there is this distinction between design 1678 01:16:36,380 --> 01:16:39,410 of data structures and low level implementation details. 1679 01:16:39,410 --> 01:16:42,540 A stack is as I've described it, a LIFO data structure. 1680 01:16:42,540 --> 01:16:45,320 Push and pop, last in, first out. 1681 01:16:45,320 --> 01:16:46,340 That's it. 1682 01:16:46,340 --> 01:16:48,660 How you implement that could be any number of ways. 1683 01:16:48,660 --> 01:16:54,530 For instance, I could implement a stack as a C data type, custom one, 1684 01:16:54,530 --> 01:16:57,680 that has an array of numbers for this capacity where capacity 1685 01:16:57,680 --> 01:17:00,920 is some big constant like 100, 1,000, however many trays I want 1686 01:17:00,920 --> 01:17:04,010 to store so long as I keep track of the size 1687 01:17:04,010 --> 01:17:07,400 of how many trays are in it so that I can always make sure its size less than 1688 01:17:07,400 --> 01:17:08,690 or equal to the capacity. 1689 01:17:08,690 --> 01:17:11,630 Just to make sure I don't try to cram too many trays in there. 1690 01:17:11,630 --> 01:17:14,450 But what's a downside of this implementation 1691 01:17:14,450 --> 01:17:17,080 of Mather House's stack of trays? 1692 01:17:17,080 --> 01:17:20,900 1693 01:17:20,900 --> 01:17:23,750 AUDIENCE: [INAUDIBLE] 1694 01:17:23,750 --> 01:17:25,250 DAVID MALAN: Limited space, exactly. 1695 01:17:25,250 --> 01:17:29,110 I have consciously hard coded capacity to be some fixed value. 1696 01:17:29,110 --> 01:17:33,780 So if we buy a new tray or a whole box of trays arrive, 1697 01:17:33,780 --> 01:17:35,020 might not fit there, right? 1698 01:17:35,020 --> 01:17:37,964 Once I exhaust this remaining space, I need to make a new pile 1699 01:17:37,964 --> 01:17:39,380 or I need to store them elsewhere. 1700 01:17:39,380 --> 01:17:40,780 I'm just out of space. 1701 01:17:40,780 --> 01:17:43,950 So maybe this is a good design decision in that it reflects reality. 1702 01:17:43,950 --> 01:17:47,130 Or maybe it's stupid because now I can't store even more trays 1703 01:17:47,130 --> 01:17:49,260 when they come in via shipment. 1704 01:17:49,260 --> 01:17:50,520 So I could solve that. 1705 01:17:50,520 --> 01:17:53,096 We know from our brief example a moment ago, 1706 01:17:53,096 --> 01:17:54,720 you could just make your array dynamic. 1707 01:17:54,720 --> 01:17:57,330 Don't preallocate it to be of size capacity. 1708 01:17:57,330 --> 01:18:00,270 Just declare it to be a pointer that will eventually 1709 01:18:00,270 --> 01:18:03,090 point to maybe space for one tray or 100 trays 1710 01:18:03,090 --> 01:18:08,280 or 1,000 trays or maybe 1,001 trays if we realloc the space again and again. 1711 01:18:08,280 --> 01:18:11,280 And when you start writing code that involves other people, whether it's 1712 01:18:11,280 --> 01:18:15,090 for some school project or a personal project or just in the real world, 1713 01:18:15,090 --> 01:18:16,860 this is where life gets more interesting, 1714 01:18:16,860 --> 01:18:20,910 too, because so long as you and I, if my colleague kind of decide, OK. 1715 01:18:20,910 --> 01:18:23,670 I'm going to expose push and pop as the operations. 1716 01:18:23,670 --> 01:18:25,230 I will implement push and pop. 1717 01:18:25,230 --> 01:18:28,350 You don't have to worry about the low level implementation 1718 01:18:28,350 --> 01:18:30,450 details in my own design decisions. 1719 01:18:30,450 --> 01:18:33,120 You just have to read my documentation and not 1720 01:18:33,120 --> 01:18:37,650 care how I've implemented it because I have abstracted that away for you 1721 01:18:37,650 --> 01:18:39,000 and given you just an API. 1722 01:18:39,000 --> 01:18:42,630 Push and pop would be an API, application programming interface. 1723 01:18:42,630 --> 01:18:44,880 All you need to know is that you can trust 1724 01:18:44,880 --> 01:18:46,710 that I will implement push and pop. 1725 01:18:46,710 --> 01:18:50,756 And you might dislike it ultimately, if I limit your space, 1726 01:18:50,756 --> 01:18:53,130 but to understand that you need to read the documentation 1727 01:18:53,130 --> 01:18:56,610 to know what features my implementation are providing. 1728 01:18:56,610 --> 01:19:00,960 Now this of course is the ridiculousness that ensues every year or so, whereby 1729 01:19:00,960 --> 01:19:03,000 people line up to buy an iPhone. 1730 01:19:03,000 --> 01:19:06,840 Now, why would it be a bad thing if Apple used a stack when people 1731 01:19:06,840 --> 01:19:10,330 arrive at 3:00 AM for their iPhones? 1732 01:19:10,330 --> 01:19:12,800 Yeah. 1733 01:19:12,800 --> 01:19:14,306 AUDIENCE: [INAUDIBLE] 1734 01:19:14,306 --> 01:19:15,180 DAVID MALAN: Exactly. 1735 01:19:15,180 --> 01:19:17,388 The person who came last would get their phone first, 1736 01:19:17,388 --> 01:19:19,650 which is fantastic for that person. 1737 01:19:19,650 --> 01:19:21,730 But it's really unfair to everyone else. 1738 01:19:21,730 --> 01:19:24,660 So Apple of course, like most stores, if they even have this problem, 1739 01:19:24,660 --> 01:19:30,750 have queues or lines whereby it's a FIFO data structure, first in, first out. 1740 01:19:30,750 --> 01:19:34,260 Were the first person in line hopefully gets his or her iPhone first. 1741 01:19:34,260 --> 01:19:37,160 So how can you implement those operations or that API? 1742 01:19:37,160 --> 01:19:40,750 Might call it nq and dq, but add, subtract, whatever. 1743 01:19:40,750 --> 01:19:42,180 But these are the terms of art. 1744 01:19:42,180 --> 01:19:43,830 And we might implement it as follows. 1745 01:19:43,830 --> 01:19:48,521 A queue might just need a little bit more information than capacity and size 1746 01:19:48,521 --> 01:19:49,020 alone. 1747 01:19:49,020 --> 01:19:51,420 You have to remember who's in front, potentially, 1748 01:19:51,420 --> 01:19:53,790 just so that when that person gets out of line, 1749 01:19:53,790 --> 01:19:56,490 you don't have to move all of your data in the array 1750 01:19:56,490 --> 01:19:58,232 just like the humans would walk forward. 1751 01:19:58,232 --> 01:19:59,190 That's a waste of time. 1752 01:19:59,190 --> 01:20:01,320 Every time someone's ready to buy their phone, 1753 01:20:01,320 --> 01:20:04,320 why does n-1 people have to take a step forward? 1754 01:20:04,320 --> 01:20:08,850 Why not just bring the phone to them and save that inefficient use of time? 1755 01:20:08,850 --> 01:20:12,150 Or we could do it like this, more of a dynamic data structure. 1756 01:20:12,150 --> 01:20:13,770 And we won't do the code here. 1757 01:20:13,770 --> 01:20:17,080 But we've seen, for instance, the example in our list zero, one, 1758 01:20:17,080 --> 01:20:20,790 and two code, how you could start with a fixed size array, 1759 01:20:20,790 --> 01:20:24,930 make it dynamic with malloc and realloc, and how you might further 1760 01:20:24,930 --> 01:20:28,980 make it dynamic with a linked list, albeit with trade offs of time 1761 01:20:28,980 --> 01:20:29,970 and space. 1762 01:20:29,970 --> 01:20:33,686 There's this great short video I thought I'd share here, wherein in Jack 1763 01:20:33,686 --> 01:20:36,810 learns the facts about queues and stacks which distinguishes these two data 1764 01:20:36,810 --> 01:20:39,660 structures in a way that actually paints an even more clear picture of how 1765 01:20:39,660 --> 01:20:40,230 they're distinct. 1766 01:20:40,230 --> 01:20:42,230 If we can dim the lights for 60 seconds or so. 1767 01:20:42,230 --> 01:20:49,354 1768 01:20:49,354 --> 01:20:50,020 [VIDEO PLAYBACK] 1769 01:20:50,020 --> 01:20:52,930 - Once upon a time there was a guy named Jack. 1770 01:20:52,930 --> 01:20:56,350 When it came to making friends, Jack did not have the knack. 1771 01:20:56,350 --> 01:20:59,260 So Jack went to talk to the most popular guy he knew. 1772 01:20:59,260 --> 01:21:01,880 He went up to Lu and asked, what do I do? 1773 01:21:01,880 --> 01:21:04,480 Lu saw that his friend was really distressed. 1774 01:21:04,480 --> 01:21:06,970 Well, Lu began, just look how you're dressed. 1775 01:21:06,970 --> 01:21:09,730 Don't you have any clothes with a different look? 1776 01:21:09,730 --> 01:21:10,690 Yes, said Jack. 1777 01:21:10,690 --> 01:21:11,860 I sure do. 1778 01:21:11,860 --> 01:21:14,320 Come to my house and I'll show them to you. 1779 01:21:14,320 --> 01:21:17,260 So they went off to Jack's and Jack showed Lu the box 1780 01:21:17,260 --> 01:21:20,270 where he kept all his shirts and his pants and his socks. 1781 01:21:20,270 --> 01:21:23,350 Lu said, I see you have all your clothes in a pile. 1782 01:21:23,350 --> 01:21:25,870 Why don't you wear some others once in awhile? 1783 01:21:25,870 --> 01:21:29,050 Jack said, well, when I remove clothes and socks, 1784 01:21:29,050 --> 01:21:31,810 I wash them and put them away in the box. 1785 01:21:31,810 --> 01:21:34,260 Then comes the next morning and up I hop. 1786 01:21:34,260 --> 01:21:37,470 I go to the box and get my clothes off the top. 1787 01:21:37,470 --> 01:21:40,120 Lu quickly realized the problem with Jack. 1788 01:21:40,120 --> 01:21:43,150 He kept clothes, CDs, and books in a stack. 1789 01:21:43,150 --> 01:21:45,460 When he reached for something to read or to wear, 1790 01:21:45,460 --> 01:21:48,130 he chose the top book or underwear. 1791 01:21:48,130 --> 01:21:50,430 Then when he was done he would put it right back. 1792 01:21:50,430 --> 01:21:52,990 Back it would go, on top of the stack. 1793 01:21:52,990 --> 01:21:55,510 I know the solution, said a triumphant Lu. 1794 01:21:55,510 --> 01:21:58,090 You need to learn to start using a queue. 1795 01:21:58,090 --> 01:22:00,880 Lu took Jack's clothes and hung them in a closet. 1796 01:22:00,880 --> 01:22:03,730 And when he had emptied the box, he just tossed it. 1797 01:22:03,730 --> 01:22:07,480 Then he said, now Jack, at the end of day, put your clothes on the left 1798 01:22:07,480 --> 01:22:09,066 when you put them away. 1799 01:22:09,066 --> 01:22:10,815 Then tomorrow morning when you see the sun 1800 01:22:10,815 --> 01:22:14,440 shine, get your clothes from the right, from the end of the line. 1801 01:22:14,440 --> 01:22:17,410 Don't you see, said Lu, it will be so nice. 1802 01:22:17,410 --> 01:22:20,790 You'll wear everything once before you wear something twice. 1803 01:22:20,790 --> 01:22:23,640 And with everything in queues in his closet and shelf, 1804 01:22:23,640 --> 01:22:26,680 Jack started to feel quite sure of himself all thanks 1805 01:22:26,680 --> 01:22:30,157 to Lou and his wonderful queue. 1806 01:22:30,157 --> 01:22:30,740 [END PLAYBACK] 1807 01:22:30,740 --> 01:22:34,460 DAVID MALAN: So that isn't to say that queues are all that 1808 01:22:34,460 --> 01:22:36,032 and stacks are a bad data structure. 1809 01:22:36,032 --> 01:22:37,990 They actually each have their own applications. 1810 01:22:37,990 --> 01:22:40,990 And in fact, one common use for stacks beyond memory management, 1811 01:22:40,990 --> 01:22:44,350 as we discuss in a couple of weeks when we start exploring HTML and web 1812 01:22:44,350 --> 01:22:47,342 programming, you'll see that HTML itself, this 1813 01:22:47,342 --> 01:22:49,300 is the language in which web pages are written. 1814 01:22:49,300 --> 01:22:51,520 That you'll soon be able to write if not already. 1815 01:22:51,520 --> 01:22:55,030 This is a language that actually has a nested hierarchy to it. 1816 01:22:55,030 --> 01:23:00,040 Who, where by a browser, might actually use a stack to analyze the HTML 1817 01:23:00,040 --> 01:23:03,940 that composes a web page to determine, for instance, if it is correct or not. 1818 01:23:03,940 --> 01:23:07,330 But there's so many other tools that we can now add to your toolkit. 1819 01:23:07,330 --> 01:23:10,270 And even though we'll look at each of these just briefly, each of them 1820 01:23:10,270 --> 01:23:13,450 derives from these very two simple principles, the ability 1821 01:23:13,450 --> 01:23:17,170 to come up with custom data structures inside of which are pointers, 1822 01:23:17,170 --> 01:23:19,630 or the ability to stitch one thing to another. 1823 01:23:19,630 --> 01:23:22,840 So here's an example of what a computer scientist would call a tree. 1824 01:23:22,840 --> 01:23:25,360 The node here we've drawn as circles just because. 1825 01:23:25,360 --> 01:23:29,080 But the nodes in a tree are much like a family tree, where 1826 01:23:29,080 --> 01:23:32,470 each node has zero or more children or descendants, maybe 1827 01:23:32,470 --> 01:23:35,140 a parent or other ancestors. 1828 01:23:35,140 --> 01:23:38,110 And so we'll call things like the first node 1829 01:23:38,110 --> 01:23:42,010 at the very top in a data structure called the tree, the root of the tree, 1830 01:23:42,010 --> 01:23:44,440 albeit growing downward like this like a family tree. 1831 01:23:44,440 --> 01:23:48,340 Anything at the very bottom of the tree that only has arrows going into it 1832 01:23:48,340 --> 01:23:51,380 will be called children or leaves of the tree. 1833 01:23:51,380 --> 01:23:55,010 And so this might be a way to lay out data in a useful way. 1834 01:23:55,010 --> 01:23:57,970 In fact, if you think back to when we had things like numbers 1835 01:23:57,970 --> 01:24:01,660 like this, thus far any time we dealt with numbers or words or Mike Smiths, 1836 01:24:01,660 --> 01:24:04,690 we would just order them from left to right in an array 1837 01:24:04,690 --> 01:24:08,720 and then search the array either in big O of end time linearly from left 1838 01:24:08,720 --> 01:24:09,220 to right. 1839 01:24:09,220 --> 01:24:10,810 But we did better using what? 1840 01:24:10,810 --> 01:24:13,450 1841 01:24:13,450 --> 01:24:16,750 Binary search, but for binary search it needed to be an array 1842 01:24:16,750 --> 01:24:19,270 and it needed to be sorted. 1843 01:24:19,270 --> 01:24:22,120 And the problem I never dealt with was we never 1844 01:24:22,120 --> 01:24:24,520 actually added another page to the phone book. 1845 01:24:24,520 --> 01:24:27,070 We never actually tried to add more numbers to our array. 1846 01:24:27,070 --> 01:24:30,700 And yet today, we've kind of identify these very glaring issues with arrays, 1847 01:24:30,700 --> 01:24:33,532 which is that you're kind of painted into a corner. 1848 01:24:33,532 --> 01:24:36,490 If you allocate only so much space, you use it all up and then darn it, 1849 01:24:36,490 --> 01:24:38,860 you want to add more to the array. 1850 01:24:38,860 --> 01:24:44,110 So how can we maybe still lay out data in sorted order, still 1851 01:24:44,110 --> 01:24:46,450 leverage something like logarithmic time and divide 1852 01:24:46,450 --> 01:24:50,740 and conquer, but get today's benefit of the dynamism whereby 1853 01:24:50,740 --> 01:24:54,670 we can grow the data structure and shrink it very incrementally, 1854 01:24:54,670 --> 01:24:58,330 without having to all of a sudden reallocate the whole structure? 1855 01:24:58,330 --> 01:25:01,120 Well, instead of laying out these numbers, which are conveniently 1856 01:25:01,120 --> 01:25:05,050 numbered as multiples of 11 here, 22, 33, 55, 1857 01:25:05,050 --> 01:25:07,960 what if we laid them out like this in memory? 1858 01:25:07,960 --> 01:25:11,590 We won't look at the code for this, but think of each of these circles 1859 01:25:11,590 --> 01:25:15,340 as a structure, a data structure, inside of which there's 1860 01:25:15,340 --> 01:25:18,940 an int n, how many pointers apparently? 1861 01:25:18,940 --> 01:25:21,820 1862 01:25:21,820 --> 01:25:23,220 Seven total on the screen. 1863 01:25:23,220 --> 01:25:25,560 But how about within each node, like this one here? 1864 01:25:25,560 --> 01:25:27,630 There's a number n, 55. 1865 01:25:27,630 --> 01:25:28,330 And what else? 1866 01:25:28,330 --> 01:25:29,850 How many pointers? 1867 01:25:29,850 --> 01:25:30,914 Just two. 1868 01:25:30,914 --> 01:25:33,330 Two maximally, in fact, because the leaves, it would seem, 1869 01:25:33,330 --> 01:25:34,980 have zero by definition. 1870 01:25:34,980 --> 01:25:38,640 And technically, if I hadn't added 22, maybe there could just be one child. 1871 01:25:38,640 --> 01:25:42,960 This is what we call a binary tree because every node has 1872 01:25:42,960 --> 01:25:45,330 at most two children, 0, 1, or 2. 1873 01:25:45,330 --> 01:25:48,900 And it's technically a binary search tree because of a special property. 1874 01:25:48,900 --> 01:25:55,110 It's very searchable because if you look at any node, its left child is smaller. 1875 01:25:55,110 --> 01:25:58,180 And if you look at any node, its right child is bigger. 1876 01:25:58,180 --> 01:26:00,210 And that's a recursive definition, so to speak. 1877 01:26:00,210 --> 01:26:04,020 You can look at any node in the tree and that definition is true. 1878 01:26:04,020 --> 01:26:06,660 Even the leaves, because it's sort of a vacuous statement 1879 01:26:06,660 --> 01:26:09,600 to say it's greater than its left child if there is no left child. 1880 01:26:09,600 --> 01:26:12,220 It's sort of trivially true. 1881 01:26:12,220 --> 01:26:14,550 So what's nice about this data structure? 1882 01:26:14,550 --> 01:26:17,700 Well, suppose I want to search for the number 22. 1883 01:26:17,700 --> 01:26:21,840 Like our linked list, and like Olivia being the special first pointer, 1884 01:26:21,840 --> 01:26:24,330 a binary tree in a computer's memory would just 1885 01:26:24,330 --> 01:26:28,050 have one special pointer, called root or first or whatever you want to call it. 1886 01:26:28,050 --> 01:26:31,770 And if you want to look for 22 just like Olivia and Jess were, 1887 01:26:31,770 --> 01:26:32,700 you might look here. 1888 01:26:32,700 --> 01:26:35,130 And say, hmm, 55 is greater than 22. 1889 01:26:35,130 --> 01:26:37,080 So which way do I go? 1890 01:26:37,080 --> 01:26:37,877 Left obviously. 1891 01:26:37,877 --> 01:26:39,960 And here, you know, if we were doing this visually 1892 01:26:39,960 --> 01:26:42,120 we could snip off that whole subtree. 1893 01:26:42,120 --> 01:26:45,210 And you would see half of the problem be torn away like the phone book. 1894 01:26:45,210 --> 01:26:48,130 22 versus 33, of course, this is greater. 1895 01:26:48,130 --> 01:26:50,080 So we go here and we find it. 1896 01:26:50,080 --> 01:26:52,204 And long story short, that was not linear 1897 01:26:52,204 --> 01:26:54,120 because we weren't searching all of the nodes. 1898 01:26:54,120 --> 01:26:57,240 And if conceptually we were chopping the tree in half, 1899 01:26:57,240 --> 01:26:59,910 in half, in half every time we went left or right, 1900 01:26:59,910 --> 01:27:04,080 what should be the running time of search on a binary search tree? 1901 01:27:04,080 --> 01:27:07,410 Log base 2 of n, or just logarithmic as we've seen. 1902 01:27:07,410 --> 01:27:10,510 Now it's not necessarily always as prettily balanced. 1903 01:27:10,510 --> 01:27:12,370 This is very deliberately chosen. 1904 01:27:12,370 --> 01:27:15,150 You can get perverse situations where it just kind of devolves 1905 01:27:15,150 --> 01:27:17,220 into a long linked list. 1906 01:27:17,220 --> 01:27:18,900 But it still is a binary search tree. 1907 01:27:18,900 --> 01:27:21,900 It was just poorly built. But at least if we keep a balance like this, 1908 01:27:21,900 --> 01:27:23,190 we can gain some benefits. 1909 01:27:23,190 --> 01:27:27,171 And here's how we would implement your proposed integer and two nodes. 1910 01:27:27,171 --> 01:27:28,920 Instead of just calling it next, I'm going 1911 01:27:28,920 --> 01:27:32,100 to call it more semantically usefully left and right. 1912 01:27:32,100 --> 01:27:35,550 And notice that struct node is just a pointer called left. 1913 01:27:35,550 --> 01:27:37,614 Struct node is a pointer called right. 1914 01:27:37,614 --> 01:27:39,030 And that's how we implement these. 1915 01:27:39,030 --> 01:27:43,380 And what do you think the leaves have as their values for left and right? 1916 01:27:43,380 --> 01:27:46,230 The leaves of the tree had no children, by definition, so what's 1917 01:27:46,230 --> 01:27:48,041 the value of left and right? 1918 01:27:48,041 --> 01:27:48,540 Null. 1919 01:27:48,540 --> 01:27:53,380 So they're just sort of pointing down at the floor as zero, null, values. 1920 01:27:53,380 --> 01:27:56,160 So we're not going to write the code for this now, 1921 01:27:56,160 --> 01:27:59,820 but we can leverage weak zero's ideas. 1922 01:27:59,820 --> 01:28:01,240 Divide and conquer, binary search. 1923 01:28:01,240 --> 01:28:04,020 We can leverage last week and this week's ideas of structures 1924 01:28:04,020 --> 01:28:07,050 and dynamic memory and technically the heap in order 1925 01:28:07,050 --> 01:28:10,140 to start to build up data structures like this, that now give us dynamism 1926 01:28:10,140 --> 01:28:12,540 that can grow and shrink as needed. 1927 01:28:12,540 --> 01:28:14,640 And just so you've seen the code, here might 1928 01:28:14,640 --> 01:28:18,780 be an implementation of a function for binary search tree 1929 01:28:18,780 --> 01:28:22,584 that, given the roots of the tree, finds for you true or false, 1930 01:28:22,584 --> 01:28:24,000 whether or not something is in it. 1931 01:28:24,000 --> 01:28:27,090 So I want to search for a number n in this tree. 1932 01:28:27,090 --> 01:28:29,760 So this here is, again, a pointer to the root 1933 01:28:29,760 --> 01:28:34,350 just like Olivia was a pointer to the first node in the linked list. 1934 01:28:34,350 --> 01:28:37,205 So if the tree is null, return false because it's 1935 01:28:37,205 --> 01:28:38,580 kind of a stupid question to ask. 1936 01:28:38,580 --> 01:28:41,250 If there's no tree being passed in, it's clearly not there. 1937 01:28:41,250 --> 01:28:42,030 So return false. 1938 01:28:42,030 --> 01:28:46,200 That's our special case to ensure that we don't dive too deeply. 1939 01:28:46,200 --> 01:28:49,650 But here's a very cool application of a past idea. 1940 01:28:49,650 --> 01:28:55,740 If n is less than the n at the current node in the tree, 1941 01:28:55,740 --> 01:28:58,950 and remember the arrow just says, go there and look at n, 1942 01:28:58,950 --> 01:29:02,430 we know we want to look at the left hand side of the tree. 1943 01:29:02,430 --> 01:29:07,002 So do we have an algorithm to search a tree for a specific value? 1944 01:29:07,002 --> 01:29:09,210 Just so happens that tree is now smaller because it's 1945 01:29:09,210 --> 01:29:11,340 this half of the tree on the left. 1946 01:29:11,340 --> 01:29:12,230 We do. 1947 01:29:12,230 --> 01:29:15,960 We have a function called search that takes a number as input 1948 01:29:15,960 --> 01:29:17,786 and takes a tree as a pointer. 1949 01:29:17,786 --> 01:29:19,410 That doesn't have to be the whole tree. 1950 01:29:19,410 --> 01:29:23,550 It can be a sub-tree because again, a tree is kind of recursively defined, 1951 01:29:23,550 --> 01:29:27,340 because every left child and right child itself might have children. 1952 01:29:27,340 --> 01:29:29,790 So it's a smaller tree but it's still a tree. 1953 01:29:29,790 --> 01:29:31,170 So I can answer this question. 1954 01:29:31,170 --> 01:29:34,920 If n is less than the current node's own n value, 1955 01:29:34,920 --> 01:29:39,690 I can just return the answer to calling search on the same number, 1956 01:29:39,690 --> 01:29:41,670 but passing in just the left half of the tree. 1957 01:29:41,670 --> 01:29:45,780 So this is like the tree version of tearing the phone book in half 1958 01:29:45,780 --> 01:29:48,120 and searching only the left half. 1959 01:29:48,120 --> 01:29:51,750 And you can perhaps guess, if you're following along at this point, 1960 01:29:51,750 --> 01:29:53,970 if n is greater than the current node, we're 1961 01:29:53,970 --> 01:29:56,100 just going to search to the right. 1962 01:29:56,100 --> 01:29:57,300 And that's three cases. 1963 01:29:57,300 --> 01:30:00,630 What's the fourth possible case? 1964 01:30:00,630 --> 01:30:03,030 Yeah, if n equals the current node. 1965 01:30:03,030 --> 01:30:06,060 And so in that case I'm just going to trivially return true. 1966 01:30:06,060 --> 01:30:07,770 And this is kind of beautiful. 1967 01:30:07,770 --> 01:30:10,230 It's not from one perspective. 1968 01:30:10,230 --> 01:30:12,600 It's not obvious at first glance, how this works. 1969 01:30:12,600 --> 01:30:15,999 And it's not comfortable necessarily if you're not used recursion that much. 1970 01:30:15,999 --> 01:30:17,790 But what's beautiful about this, especially 1971 01:30:17,790 --> 01:30:20,290 if we get rid of the stupid curly braces and a lot of stuff 1972 01:30:20,290 --> 01:30:22,180 it's not really intellectually interesting, 1973 01:30:22,180 --> 01:30:26,330 you are reducing this problem to really just these lines of code. 1974 01:30:26,330 --> 01:30:28,780 Check for null, return false. 1975 01:30:28,780 --> 01:30:31,630 Check if it's less than, just recurse on the left. 1976 01:30:31,630 --> 01:30:33,880 Check if it's greater than, recurse on the right. 1977 01:30:33,880 --> 01:30:35,410 Otherwise you found it. 1978 01:30:35,410 --> 01:30:39,370 It's literally the same idea or spirit as our divide and conquer 1979 01:30:39,370 --> 01:30:41,890 approach for the phone book, just implemented now 1980 01:30:41,890 --> 01:30:46,700 using trees or nodes linked together in a tree. 1981 01:30:46,700 --> 01:30:47,900 Yeah? 1982 01:30:47,900 --> 01:30:51,580 AUDIENCE: [INAUDIBLE] 1983 01:30:51,580 --> 01:30:56,320 DAVID MALAN: Tree arrow n, so tree, recall, is a pointer to a node. 1984 01:30:56,320 --> 01:31:00,340 So that, just like Olivia was a pointer to a node in a linked list, 1985 01:31:00,340 --> 01:31:02,560 this would be like Olivia standing here and instead 1986 01:31:02,560 --> 01:31:05,890 of pointing at a line of students, sort of pointing at a tree of students 1987 01:31:05,890 --> 01:31:07,540 that fans out this way. 1988 01:31:07,540 --> 01:31:09,670 So tree, we could call it anything we want. 1989 01:31:09,670 --> 01:31:13,430 I just called it tree, represents that. 1990 01:31:13,430 --> 01:31:18,229 And meanwhile, tree left would be like if Olivia was pointing at a node here. 1991 01:31:18,229 --> 01:31:20,770 Actually, if Olivia is pointing at the root of the tree here, 1992 01:31:20,770 --> 01:31:24,040 tree left would be go look at the left half of the tree 1993 01:31:24,040 --> 01:31:25,660 or the right half of the tree. 1994 01:31:25,660 --> 01:31:28,990 If again, our volunteers were laid out on stage like a fan, 1995 01:31:28,990 --> 01:31:31,310 like a tree instead of a list. 1996 01:31:31,310 --> 01:31:34,690 So we've seen a whole bunch of algorithms that might 1997 01:31:34,690 --> 01:31:36,520 have any number of these running times. 1998 01:31:36,520 --> 01:31:38,830 And up until now kind of the best running time 1999 01:31:38,830 --> 01:31:42,044 really has been this for the fanciest of algorithms. 2000 01:31:42,044 --> 01:31:43,960 But we have seen constant time here and there. 2001 01:31:43,960 --> 01:31:46,420 And even today if we want to insert into a linked list 2002 01:31:46,420 --> 01:31:48,190 and we don't really care about the order, 2003 01:31:48,190 --> 01:31:52,870 we can just plug the new value right there after Olivia and before Achmed 2004 01:31:52,870 --> 01:31:54,160 and get our constant time. 2005 01:31:54,160 --> 01:31:57,790 But wouldn't it be nice if more operations were constant timed? 2006 01:31:57,790 --> 01:32:00,790 One step, two step, three stepped or some finite number? 2007 01:32:00,790 --> 01:32:04,240 And it turns out we can achieve this with a bit of thought. 2008 01:32:04,240 --> 01:32:07,840 And we can leverage another sort of familiar idea as follows. 2009 01:32:07,840 --> 01:32:12,340 So like here, for instance, is some unusually large playing cards, 2010 01:32:12,340 --> 01:32:15,280 which actually do exist if you just Google jumbo playing cards 2011 01:32:15,280 --> 01:32:17,620 and look for them on Amazon. 2012 01:32:17,620 --> 01:32:20,260 Suppose I wanted to sort this deck of cards. 2013 01:32:20,260 --> 01:32:23,320 I could go through the deck one at a time 2014 01:32:23,320 --> 01:32:26,380 and order them both by their suite, like clubs and hearts 2015 01:32:26,380 --> 01:32:28,180 and so forth, and also by their numbers. 2016 01:32:28,180 --> 01:32:30,680 But odds are if you're like me, you're going to probably try 2017 01:32:30,680 --> 01:32:32,260 to make the problem a little simpler. 2018 01:32:32,260 --> 01:32:35,937 And if you see the king of spades here, I put him over here. 2019 01:32:35,937 --> 01:32:37,770 Nine of spades, I'm going to put that there. 2020 01:32:37,770 --> 01:32:39,120 10 of spades coincidentally. 2021 01:32:39,120 --> 01:32:40,453 I'm going to put that over here. 2022 01:32:40,453 --> 01:32:42,790 Then I'm going to do what with the hearts, probably? 2023 01:32:42,790 --> 01:32:46,030 You know, probably I'm not going to go through one at a time. 2024 01:32:46,030 --> 01:32:48,340 I'm going to kind of bucketize each of the cards. 2025 01:32:48,340 --> 01:32:51,714 So here's the ace of clubs, so I'm going to make a third pile. 2026 01:32:51,714 --> 01:32:52,880 Here's a couple of diamonds. 2027 01:32:52,880 --> 01:32:53,921 So that's my fourth pile. 2028 01:32:53,921 --> 01:32:57,340 And then I'm just going to repeat this, because it's a nice simple algorithm. 2029 01:32:57,340 --> 01:33:00,040 It's going to make my life a little easier in just a moment 2030 01:33:00,040 --> 01:33:05,440 once everything is in the right pile. 2031 01:33:05,440 --> 01:33:08,860 But this is a general notion of what we'll call hashing. 2032 01:33:08,860 --> 01:33:11,470 And I'm not going to finish it because surprise, surprise, 2033 01:33:11,470 --> 01:33:13,480 we're going to get 13 cards in each pile. 2034 01:33:13,480 --> 01:33:17,110 But this is a more fundamental notion of hashing. 2035 01:33:17,110 --> 01:33:20,710 You take as input something from your list of inputs. 2036 01:33:20,710 --> 01:33:23,080 You look at it and you make a decision based on it. 2037 01:33:23,080 --> 01:33:26,530 And in this case, my hash value is going to be zero, one, two, three, 2038 01:33:26,530 --> 01:33:28,840 two because it's going to go into the hearts pile. 2039 01:33:28,840 --> 01:33:30,151 And what is a hash function? 2040 01:33:30,151 --> 01:33:32,650 It's just going to be a function in code or in my brain that 2041 01:33:32,650 --> 01:33:36,820 just makes a decision based on output and outputs a hash value, which 2042 01:33:36,820 --> 01:33:40,030 in this case is going to be that pile, that pile, that pile, that pile, 2043 01:33:40,030 --> 01:33:43,180 or if we want to be more precise, zero, one, two, or three. 2044 01:33:43,180 --> 01:33:45,880 If those four piles are implemented it's like four arrays 2045 01:33:45,880 --> 01:33:47,410 or some kind of stacks, really. 2046 01:33:47,410 --> 01:33:49,630 I seem to be making a stack of cards. 2047 01:33:49,630 --> 01:33:50,440 Now it's not done. 2048 01:33:50,440 --> 01:33:52,510 If I want to sort these things later, I'm 2049 01:33:52,510 --> 01:33:54,880 still going to have to sort each of the piles of 13. 2050 01:33:54,880 --> 01:33:57,379 But I've kind of made the problem a little easier for myself 2051 01:33:57,379 --> 01:34:00,220 in that I've spread it out over four equivalent problems. 2052 01:34:00,220 --> 01:34:03,020 But the key ingredient here is that notion of hashing. 2053 01:34:03,020 --> 01:34:06,580 And honestly, if you've ever watched a TA or professor deal with these things 2054 01:34:06,580 --> 01:34:09,670 at the end of some class that has blue books, if a whole bunch of students 2055 01:34:09,670 --> 01:34:12,070 at the end of the hour come down and start handing in their blue books 2056 01:34:12,070 --> 01:34:13,180 it's a complete mess. 2057 01:34:13,180 --> 01:34:16,217 And if the TFs or professor wants to organize these, 2058 01:34:16,217 --> 01:34:17,800 you might make a whole bunch of piles. 2059 01:34:17,800 --> 01:34:19,300 All the L last names will go there. 2060 01:34:19,300 --> 01:34:20,080 E will go there. 2061 01:34:20,080 --> 01:34:21,100 F will go there. 2062 01:34:21,100 --> 01:34:24,010 And maybe in this case you'll alphabetize as you go, thereby 2063 01:34:24,010 --> 01:34:25,600 making this problem easier, too. 2064 01:34:25,600 --> 01:34:27,020 That is a hash function. 2065 01:34:27,020 --> 01:34:29,230 You take as input a student's name, you look 2066 01:34:29,230 --> 01:34:31,540 at the first letter of his or her last name, 2067 01:34:31,540 --> 01:34:34,270 and you decide whether it goes in bucket zero, one, 2068 01:34:34,270 --> 01:34:37,662 or maybe 25 if you're indeed hashing based on the English alphabet. 2069 01:34:37,662 --> 01:34:40,120 So hashing is something we've all done, even if we've never 2070 01:34:40,120 --> 01:34:42,530 slapped that name on it before. 2071 01:34:42,530 --> 01:34:44,530 So how might we leverage this kind of ingredient 2072 01:34:44,530 --> 01:34:47,980 and get ourselves closer to the holy grail of data structures, which 2073 01:34:47,980 --> 01:34:49,780 would be constant time for everything? 2074 01:34:49,780 --> 01:34:53,230 Like none of this linear, none of this logarithmic time. 2075 01:34:53,230 --> 01:34:57,637 So suppose we have an array or a table, we'll call it, like this. 2076 01:34:57,637 --> 01:35:00,220 I'm going to call this a hash table because I want to leverage 2077 01:35:00,220 --> 01:35:01,930 the idea of this hash function. 2078 01:35:01,930 --> 01:35:06,760 And suppose that what I want to store in here are just things like names. 2079 01:35:06,760 --> 01:35:08,530 And I want to go ahead and store the name 2080 01:35:08,530 --> 01:35:11,840 Alice, because she turned in her exam first. 2081 01:35:11,840 --> 01:35:17,230 So here I might have [0] through [25] or in general, n-1. 2082 01:35:17,230 --> 01:35:19,616 So there's 26 buckets total. 2083 01:35:19,616 --> 01:35:21,240 Where might I be inclined to put Alice? 2084 01:35:21,240 --> 01:35:24,140 2085 01:35:24,140 --> 01:35:26,090 I might just hash her to zero because Alice, 2086 01:35:26,090 --> 01:35:27,800 we'll use her first name, not last, because she never 2087 01:35:27,800 --> 01:35:29,130 seems to have a last name. 2088 01:35:29,130 --> 01:35:31,310 So first name, Alice, brackets zero, she goes there. 2089 01:35:31,310 --> 01:35:32,900 Then Bob comes up, turns in his exam. 2090 01:35:32,900 --> 01:35:34,530 Where does he go? 2091 01:35:34,530 --> 01:35:39,200 [1] and then maybe Brendan comes over. 2092 01:35:39,200 --> 01:35:39,710 Damn it. 2093 01:35:39,710 --> 01:35:40,981 No room for Brendan's exam. 2094 01:35:40,981 --> 01:35:41,480 Why? 2095 01:35:41,480 --> 01:35:43,917 Because he hashes to the same value. 2096 01:35:43,917 --> 01:35:44,750 And this can happen. 2097 01:35:44,750 --> 01:35:46,640 Like, you might hash to the same value. 2098 01:35:46,640 --> 01:35:47,990 And here it was not a big deal. 2099 01:35:47,990 --> 01:35:49,698 I kept getting diamond, diamond, diamond. 2100 01:35:49,698 --> 01:35:51,770 That's fine because this data structure grows. 2101 01:35:51,770 --> 01:35:54,140 But this is an array, it would seem. 2102 01:35:54,140 --> 01:35:56,540 And I could write Alice here, I could write Bob here, 2103 01:35:56,540 --> 01:35:58,490 but Brendan should be written there too. 2104 01:35:58,490 --> 01:36:02,150 But I don't want to give Bob his exam back just to accept Brendan's. 2105 01:36:02,150 --> 01:36:03,646 So where could I put Brendan? 2106 01:36:03,646 --> 01:36:06,770 Maybe I'll kind of cheat and just put him here because there's room, right? 2107 01:36:06,770 --> 01:36:08,930 This is all free in this story so far. 2108 01:36:08,930 --> 01:36:11,150 But then Charlie comes forward. 2109 01:36:11,150 --> 01:36:12,350 What do we do with Charlie? 2110 01:36:12,350 --> 01:36:14,660 Now Brendan is where Charlie should be. 2111 01:36:14,660 --> 01:36:18,105 So now I've just kind of made a mess but I have so much free space 2112 01:36:18,105 --> 01:36:21,230 and odds are I'm not going to have a student, no offense, whose name starts 2113 01:36:21,230 --> 01:36:24,565 with a Z or an X or and some of the statistically less likely ones. 2114 01:36:24,565 --> 01:36:25,940 So why don't we use those spaces? 2115 01:36:25,940 --> 01:36:28,640 And we could, but this is an example algorithmically 2116 01:36:28,640 --> 01:36:32,270 of linear probing, where you linearly top to bottom just kind of probe 2117 01:36:32,270 --> 01:36:34,700 the data structure looking for space and just drop 2118 01:36:34,700 --> 01:36:37,010 the values in the first available. 2119 01:36:37,010 --> 01:36:40,280 And initially it's nice and clean and nice and efficient 2120 01:36:40,280 --> 01:36:42,680 because if I want to look for Alice's exam later, boom, 2121 01:36:42,680 --> 01:36:44,180 she's on the top of the pile. 2122 01:36:44,180 --> 01:36:45,860 Bob, boom, second in the pile. 2123 01:36:45,860 --> 01:36:48,860 But then Charlie, not quite where he should be. 2124 01:36:48,860 --> 01:36:51,260 So eventually with this approach of linear probing 2125 01:36:51,260 --> 01:36:55,130 it's space efficient in that you pack everyone into your data structure. 2126 01:36:55,130 --> 01:36:57,770 But it eventually devolves into something linear. 2127 01:36:57,770 --> 01:37:01,210 If Alice came and given her exam last, by nature of space, 2128 01:37:01,210 --> 01:37:02,960 she might end up at the bottom of the pile 2129 01:37:02,960 --> 01:37:05,720 and that does not make her easy to find later. 2130 01:37:05,720 --> 01:37:09,560 So what if we instead change the data structure 2131 01:37:09,560 --> 01:37:12,980 and use elements from today and past? 2132 01:37:12,980 --> 01:37:17,480 Let's use an array here of pointers drawn vertically just because. 2133 01:37:17,480 --> 01:37:22,054 And then why don't we string students' names off the right of this? 2134 01:37:22,054 --> 01:37:25,220 So this is an excerpt from a text that explores exactly this data structure. 2135 01:37:25,220 --> 01:37:27,830 It's called a hash table, not with linear probing, 2136 01:37:27,830 --> 01:37:32,090 but with separate chaining, whereby your data structure, your hash table, 2137 01:37:32,090 --> 01:37:34,370 is technically an array. 2138 01:37:34,370 --> 01:37:36,980 This time it's upsized 31, because the book's example was 2139 01:37:36,980 --> 01:37:40,170 about day of the month for birthdays. 2140 01:37:40,170 --> 01:37:43,790 And so the data structure has not just an array, though, but what other data 2141 01:37:43,790 --> 01:37:47,030 structure combined with it? 2142 01:37:47,030 --> 01:37:48,710 It's a kind of linked list. 2143 01:37:48,710 --> 01:37:53,561 So what's nice here is that S Adams, so Adams, starting with A in our story. 2144 01:37:53,561 --> 01:37:56,060 Now they're using birthdays if you read this in the context. 2145 01:37:56,060 --> 01:37:58,400 But suppose that Adams is the only one with birthday 2146 01:37:58,400 --> 01:37:59,720 on the second of some month. 2147 01:37:59,720 --> 01:38:01,880 Well, he or she might end up here. 2148 01:38:01,880 --> 01:38:05,480 And that's no big deal if someone else has the same birthday in this example, 2149 01:38:05,480 --> 01:38:08,960 because we can either walk the list as we did with Jess 2150 01:38:08,960 --> 01:38:11,750 and just string him or her at the end of this data structure. 2151 01:38:11,750 --> 01:38:14,930 Or we can just kind of insert them at the very beginning 2152 01:38:14,930 --> 01:38:19,610 and just use some constant time changes to peoples' left hand to fit them in. 2153 01:38:19,610 --> 01:38:25,370 The point is though, the data structure no longer is an array only. 2154 01:38:25,370 --> 01:38:32,280 It's an array of 31 buckets, four piles, 26 piles, 31 piles. 2155 01:38:32,280 --> 01:38:34,430 But each of those piles can grow vertically, 2156 01:38:34,430 --> 01:38:37,190 so to speak, or in this case laterally because we're 2157 01:38:37,190 --> 01:38:39,120 implementing the idea of these data structures 2158 01:38:39,120 --> 01:38:42,200 now by using an actual linked list. 2159 01:38:42,200 --> 01:38:45,170 So why is this actually better or worse? 2160 01:38:45,170 --> 01:38:49,220 Well one, is there any limit now on how many students can turn in their exams 2161 01:38:49,220 --> 01:38:50,930 or have birthdays? 2162 01:38:50,930 --> 01:38:54,320 No because we just keep growing it wider and wider and wider. 2163 01:38:54,320 --> 01:38:56,551 Why is this then a good thing? 2164 01:38:56,551 --> 01:38:59,300 Well now if I want to look someone up, if I know their name starts 2165 01:38:59,300 --> 01:39:01,130 with A or in the birthday example, I know 2166 01:39:01,130 --> 01:39:03,050 their birthday is on the second of the month, 2167 01:39:03,050 --> 01:39:06,280 I know deterministically, no matter what, 2168 01:39:06,280 --> 01:39:09,630 what bucket they will be in in the array. 2169 01:39:09,630 --> 01:39:13,490 Now, they might be in a long string of people with similar names or birthdays. 2170 01:39:13,490 --> 01:39:16,490 But they're going to be there, deterministically, predictably, 2171 01:39:16,490 --> 01:39:17,240 again and again. 2172 01:39:17,240 --> 01:39:22,130 And the beautiful thing is if my hash function is well-implemented, uniform 2173 01:39:22,130 --> 01:39:26,600 so to speak statistically, then it would be nice if almost all of these chains 2174 01:39:26,600 --> 01:39:27,910 are roughly the same length. 2175 01:39:27,910 --> 01:39:30,802 It would be pretty lame if this chain were really huge 2176 01:39:30,802 --> 01:39:33,260 and then every other chain were shorter because that's just 2177 01:39:33,260 --> 01:39:35,370 an opportunity for better design. 2178 01:39:35,370 --> 01:39:39,080 So in real terms, a hash table, when implemented like this, 2179 01:39:39,080 --> 01:39:43,700 should decrease in this concrete case, by a factor of like 31, 2180 01:39:43,700 --> 01:39:45,560 how long it takes to find someone. 2181 01:39:45,560 --> 01:39:49,190 So the time is one divided by 31 because if all the chains are roughly 2182 01:39:49,190 --> 01:39:53,510 the same length, you have chopped up your data set into four piles, 26 2183 01:39:53,510 --> 01:39:57,830 piles, 31 piles, each of which is one fourth or 126th 2184 01:39:57,830 --> 01:40:01,980 or 131st the size of the whole data set. 2185 01:40:01,980 --> 01:40:04,640 Now asymptotically, per couple of weeks ago, 2186 01:40:04,640 --> 01:40:07,130 that is algorithmically irrelevant. 2187 01:40:07,130 --> 01:40:10,440 That's big O of the same thing, so to speak. 2188 01:40:10,440 --> 01:40:13,700 But in real terms, having it taking a quarter of as much time, 2189 01:40:13,700 --> 01:40:16,220 126th the amount of time, 131st the real time 2190 01:40:16,220 --> 01:40:19,070 is literally going to save us times on our watches. 2191 01:40:19,070 --> 01:40:21,890 Like that in real human times will save time. 2192 01:40:21,890 --> 01:40:24,890 And in fact, what you'll see in problem set five, in which you implement 2193 01:40:24,890 --> 01:40:27,824 your first spell checker, you'll see that that's 2194 01:40:27,824 --> 01:40:29,240 what we're trying to optimize for. 2195 01:40:29,240 --> 01:40:32,900 In fact, as a quick teaser before we look at our final data structure here, 2196 01:40:32,900 --> 01:40:35,660 you'll be challenged as part of this problem set optionally, 2197 01:40:35,660 --> 01:40:38,660 if you'd like to opt in, to compete on the big board. 2198 01:40:38,660 --> 01:40:40,847 Once your code is working per check 50, you 2199 01:40:40,847 --> 01:40:44,180 can actually run a separate command with check 50 to post it to the leader board 2200 01:40:44,180 --> 01:40:44,680 here. 2201 01:40:44,680 --> 01:40:47,300 And right now, damn it, Brian is beaning both Doug and me 2202 01:40:47,300 --> 01:40:49,790 because his implementation of the spell checker 2203 01:40:49,790 --> 01:40:55,490 takes only 4.81 seconds and only 7.4 kilobytes versus my 82 megabytes 2204 01:40:55,490 --> 01:40:59,930 of memory implementing a spell check over the whole lot of words. 2205 01:40:59,930 --> 01:41:03,162 But how do you decide how to minimize space or minimize time 2206 01:41:03,162 --> 01:41:05,120 and how do you mitigate some of the trade-offs? 2207 01:41:05,120 --> 01:41:08,000 Well, let's look at one final data structure to consider. 2208 01:41:08,000 --> 01:41:13,564 This is perhaps the most sophisticated and it takes up more space 2209 01:41:13,564 --> 01:41:15,230 and so it's hard to paint on the screen. 2210 01:41:15,230 --> 01:41:16,700 But suppose we did this. 2211 01:41:16,700 --> 01:41:20,660 Suppose we were trying to store in our data structure people's names. 2212 01:41:20,660 --> 01:41:23,142 We could do this with an array of a lot of strings. 2213 01:41:23,142 --> 01:41:25,100 And we could do linear search and Brian or Doug 2214 01:41:25,100 --> 01:41:28,490 or I could just use linear search big O of n and find any one you want. 2215 01:41:28,490 --> 01:41:29,510 That's not so great. 2216 01:41:29,510 --> 01:41:32,840 We could somehow use binary search if we used a tree or an array 2217 01:41:32,840 --> 01:41:34,170 but kept the names sorted. 2218 01:41:34,170 --> 01:41:35,240 We know we can do better. 2219 01:41:35,240 --> 01:41:38,030 Just as we found Mike Smith pretty quickly in week zero. 2220 01:41:38,030 --> 01:41:41,240 But what if we could find names in constant time? 2221 01:41:41,240 --> 01:41:44,990 Whereby no matter how many words are in the tree, no matter 2222 01:41:44,990 --> 01:41:47,930 how many words are in the dictionary more generally, 2223 01:41:47,930 --> 01:41:50,872 still takes me the same amount of time to find anyone? 2224 01:41:50,872 --> 01:41:53,330 And it doesn't get longer and longer the more names we add? 2225 01:41:53,330 --> 01:41:59,480 So here is a type of tree goofily called a trie, T-R-I-E, 2226 01:41:59,480 --> 01:42:02,700 which is an excerpt from retrieval, which is weird because it's retrieval 2227 01:42:02,700 --> 01:42:06,290 and retrival but this is a trie, T-R-I-E. 2228 01:42:06,290 --> 01:42:10,351 Each of the nodes in a trie, essentially, are an array themselves. 2229 01:42:10,351 --> 01:42:13,100 Technically they're a structure with a little more inside of them. 2230 01:42:13,100 --> 01:42:15,950 And you'll see this in the walk through that Zamyla put together. 2231 01:42:15,950 --> 01:42:19,580 But each of the nodes in the trie are an array. 2232 01:42:19,580 --> 01:42:23,660 Each of those arrays elements is a pointer to another such array. 2233 01:42:23,660 --> 01:42:27,710 And the way you store words in a trie is not with characters, 2234 01:42:27,710 --> 01:42:30,290 but implicitly with pointers. 2235 01:42:30,290 --> 01:42:33,650 So if we want to put someone's name like Maxwell in here, 2236 01:42:33,650 --> 01:42:38,960 we hash into this trie using the first letter of Maxwell's name, which 2237 01:42:38,960 --> 01:42:40,040 is of course m. 2238 01:42:40,040 --> 01:42:45,470 And that's going to be the 13th element in the array in my 26-element array 2239 01:42:45,470 --> 01:42:46,190 here. 2240 01:42:46,190 --> 01:42:48,800 I'm going to change that originally null pointer 2241 01:42:48,800 --> 01:42:50,850 to be a pointer to another node. 2242 01:42:50,850 --> 01:42:54,230 And then I'm going to hash on the second letter of Maxwell's name, which is A, 2243 01:42:54,230 --> 01:42:58,130 and I'm going to allocate a pointer to another array. 2244 01:42:58,130 --> 01:43:01,770 And then repeat that process for every letter in his name. 2245 01:43:01,770 --> 01:43:04,910 So if I hash on his first letter, second letter, third letter, every time 2246 01:43:04,910 --> 01:43:07,290 I do that it leads me to a new array. 2247 01:43:07,290 --> 01:43:10,340 What's not shown here is that each of these arrays is size 26. 2248 01:43:10,340 --> 01:43:12,440 It would just be atrocious to see on the screen. 2249 01:43:12,440 --> 01:43:14,270 So it does use a bunch of memory. 2250 01:43:14,270 --> 01:43:18,330 But the end of this, there's a special symbol drawn here is a delta symbol, 2251 01:43:18,330 --> 01:43:21,180 but it can be anything, that just means Maxwell stops here. 2252 01:43:21,180 --> 01:43:22,790 There's a word here. 2253 01:43:22,790 --> 01:43:26,210 So how many steps does it take to find any name in the tree? 2254 01:43:26,210 --> 01:43:32,480 Well, to find Maxwell it's M-A-X-W-E-L-L. So that's seven steps. 2255 01:43:32,480 --> 01:43:36,080 For Maria it'd be M-A-R-I-A. That would be five steps. 2256 01:43:36,080 --> 01:43:40,010 So it's still dependent on the number of letters in the name. 2257 01:43:40,010 --> 01:43:43,130 But if there is a billion names in this dictionary, 2258 01:43:43,130 --> 01:43:48,280 per this definition, how many more steps does it take to find Maxwell? 2259 01:43:48,280 --> 01:43:50,254 M-A-X-W-E-L-L. 2260 01:43:50,254 --> 01:43:52,670 How about if there's four billion names in the dictionary? 2261 01:43:52,670 --> 01:43:54,590 How long does it take to find Maxwell? 2262 01:43:54,590 --> 01:43:57,500 M-A-X-W-E-L-L. it's invariant. 2263 01:43:57,500 --> 01:44:00,590 And if we assume that no human name is going to be super long, 2264 01:44:00,590 --> 01:44:03,920 it's effectively constant whether it's 10 characters, maybe 30 characters 2265 01:44:03,920 --> 01:44:04,950 or whatnot. 2266 01:44:04,950 --> 01:44:09,830 That's effectively constant, which means a trie gives you constant time look up 2267 01:44:09,830 --> 01:44:15,500 or big O of one, which means it's in theory the fastest of data structures. 2268 01:44:15,500 --> 01:44:18,226 But of course you pay a price with more memory. 2269 01:44:18,226 --> 01:44:21,350 And I know we're one minute over but let me tease you with this final look. 2270 01:44:21,350 --> 01:44:24,350 And you'll see this data structure's implementation with Zamyla. 2271 01:44:24,350 --> 01:44:27,230 But we begin to do transitionally now, especially if you're 2272 01:44:27,230 --> 01:44:29,730 a little worried, especially as we're coming on the midpoint of the semester, 2273 01:44:29,730 --> 01:44:30,350 like oh my god. 2274 01:44:30,350 --> 01:44:31,890 Things are getting more and more sophisticated. 2275 01:44:31,890 --> 01:44:34,640 We're kind of at the peak of a hill here because after problem set 2276 01:44:34,640 --> 01:44:38,916 five do we transition to HTML and CSS and Python and JavaScript and web 2277 01:44:38,916 --> 01:44:40,040 programming more generally. 2278 01:44:40,040 --> 01:44:43,844 And next week, how the internet works. 2279 01:44:43,844 --> 01:44:44,510 [VIDEO PLAYBACK] 2280 01:44:44,510 --> 01:44:46,760 [MUSIC PLAYING] 2281 01:44:46,760 --> 01:44:49,780 2282 01:44:49,780 --> 01:44:59,176 - He came with a message, with a protocol all his own. 2283 01:44:59,176 --> 01:45:02,302 2284 01:45:02,302 --> 01:45:04,472 [MUSIC PLAYING] 2285 01:45:04,472 --> 01:45:12,190 2286 01:45:12,190 --> 01:45:18,040 He came to a world of cruel firewalls, uncuring routers, and dangers 2287 01:45:18,040 --> 01:45:21,670 from worse than death. 2288 01:45:21,670 --> 01:45:22,630 He's fast. 2289 01:45:22,630 --> 01:45:23,950 He's strong. 2290 01:45:23,950 --> 01:45:26,310 He's TCPIP. 2291 01:45:26,310 --> 01:45:28,685 And he's got your address. 2292 01:45:28,685 --> 01:45:31,370 2293 01:45:31,370 --> 01:45:32,987 Warriors of the net. 2294 01:45:32,987 --> 01:45:33,570 [END PLAYBACK] 2295 01:45:33,570 --> 01:45:33,980 DAVID MALAN: All right. 2296 01:45:33,980 --> 01:45:35,430 All that and more next week. 2297 01:45:35,430 --> 01:45:37,240 We'll see you then. 2298 01:45:37,240 --> 01:45:39,836