1 00:00:00,000 --> 00:00:02,520 [Week 6, Continued] 2 00:00:02,520 --> 00:00:04,160 [David J. Malan] [Harvard University] 3 00:00:04,160 --> 00:00:08,720 [This is CS50.] [CS50.TV] 4 00:00:08,720 --> 00:00:12,970 This is CS50 and this is the end of week 6. 5 00:00:12,970 --> 00:00:17,970 So CS50x, one of Harvard's first courses involved in the edX initiative 6 00:00:17,970 --> 00:00:20,590 indeed debuted this past Monday. 7 00:00:20,590 --> 00:00:23,460 If you would like to get a glimpse of what others on the Internet 8 00:00:23,460 --> 00:00:27,180 are now following along with, you can head to x.cs50.net. 9 00:00:27,180 --> 00:00:30,350 That will redirect you to the appropriate place on edx.org, 10 00:00:30,350 --> 00:00:34,160 which was where this and other courses from MIT and Berkeley now live. 11 00:00:34,160 --> 00:00:38,140 You'll have to sign up for an account; you'll find that the material is largely the same 12 00:00:38,140 --> 00:00:42,170 as you've had this semester, albeit a few weeks delayed, as we get everything ready. 13 00:00:42,170 --> 00:00:46,930 But what students in CS50x will now see is an interface quite like this one. 14 00:00:46,930 --> 00:00:50,040 This, for instance, is Zamyla leading the walkthrough for problem set 0. 15 00:00:50,040 --> 00:00:54,230 Upon logging in to edx.org, a CS50x student sees the sorts of things 16 00:00:54,230 --> 00:00:57,170 you would expect to see in a course: the lecture for the Monday, 17 00:00:57,170 --> 00:01:01,650 lecture for Wednesday, various shorts, the problem sets, the walkthroughs, PDFs. 18 00:01:01,650 --> 00:01:04,459 In addition, as you see here, machine translations 19 00:01:04,459 --> 00:01:08,390 of English transcripts into Chinese, Japanese, Spanish, Italian, 20 00:01:08,390 --> 00:01:12,810 and a whole bunch of other languages that will certainly be imperfect 21 00:01:12,810 --> 00:01:15,840 as we roll them out programmatically using something called an API, 22 00:01:15,840 --> 00:01:18,360 or application programming interface, from Google 23 00:01:18,360 --> 00:01:21,360 that allows us to convert English to these other languages. 24 00:01:21,360 --> 00:01:24,100 But thanks to the wonderful spirit of some hundred-plus volunteers, 25 00:01:24,100 --> 00:01:26,940 random people on the Internet who have kindly offered to get involved 26 00:01:26,940 --> 00:01:30,180 in this project, we'll gradually be improving the quality of those translations 27 00:01:30,180 --> 00:01:35,790 by having humans correct the mistakes that our computers have made. 28 00:01:35,790 --> 00:01:42,330 >> So it turns out we had a few more students show up on Monday than we initially expected. 29 00:01:42,330 --> 00:01:48,980 In fact, now CS50x has 100,000 people following along at home. 30 00:01:48,980 --> 00:01:54,430 So realize you are all part of this inaugural class of making this course in computer science 31 00:01:54,430 --> 00:01:57,370 education more generally, more broadly, accessible. 32 00:01:57,370 --> 00:02:00,130 And the reality is now, with some of these massive online courses, 33 00:02:00,130 --> 00:02:04,070 they all start with these very high numbers, as we seem to have done here. 34 00:02:04,070 --> 00:02:08,759 But the goal, ultimately, for CS50x is really to get as many people to the finish line as possible. 35 00:02:08,759 --> 00:02:12,000 By design, CS50x is going to be offered from this past Monday 36 00:02:12,000 --> 00:02:17,430 all the way through April 15, 2013, so that folks who have school commitments elsewhere, 37 00:02:17,430 --> 00:02:20,990 work, family, other conflicts and the like, have a bit more flexibility 38 00:02:20,990 --> 00:02:23,640 with which to dive into this course, which, suffice it to say, 39 00:02:23,640 --> 00:02:30,540 is quite ambitiously done if only over the course of just three months during a usual semester. 40 00:02:30,540 --> 00:02:34,190 But these students will be tackling the same problem sets, viewing the same content, 41 00:02:34,190 --> 00:02:36,350 having access to the same shorts and the like. 42 00:02:36,350 --> 00:02:38,990 So realize that we are all really in this together. 43 00:02:38,990 --> 00:02:42,360 And one of the end goals of CS50x is not just to get as many folks 44 00:02:42,360 --> 00:02:45,720 to the finish line and give them this newfound understanding of computer science 45 00:02:45,720 --> 00:02:49,000 and programming but also to have them have this shared experience. 46 00:02:49,000 --> 00:02:52,010 One of the defining characteristics of 50 on campus, we hope, 47 00:02:52,010 --> 00:02:56,260 has been this sort of communal experience, for better or for worse, sometimes, 48 00:02:56,260 --> 00:02:59,480 but having these people to turn to to the left and to the right, 49 00:02:59,480 --> 00:03:01,830 and office hours and the hackathon and the fair. 50 00:03:01,830 --> 00:03:04,560 It's a little harder to do that in person with folks online, 51 00:03:04,560 --> 00:03:10,580 but CS50x will conclude in April with the first ever CS50 Expo, 52 00:03:10,580 --> 00:03:13,630 which will be an online adaptation of our idea of the fair 53 00:03:13,630 --> 00:03:18,250 where these thousands of students will all be invited to submit a 1- to 2-minute video, 54 00:03:18,250 --> 00:03:22,480 either a screencast of their final project or video of them waving hello 55 00:03:22,480 --> 00:03:24,490 and talking about their project and demoing it, 56 00:03:24,490 --> 00:03:27,610 much like your predecessors have done here on campus in the fair, 57 00:03:27,610 --> 00:03:31,400 so that by semester's end, the hope is to have a global exhibition 58 00:03:31,400 --> 00:03:37,080 of the CS50x students' final projects, much like that which awaits you this December here on campus. 59 00:03:37,080 --> 00:03:39,680 So more on that in the months to come. 60 00:03:39,680 --> 00:03:43,640 >> With 100,000 students, though, comes a need for a few more CAs. 61 00:03:43,640 --> 00:03:47,590 Given that you guys are blazing the trail here and taking CS50 62 00:03:47,590 --> 00:03:51,630 several weeks in advance of this material's release to the folks on edX, 63 00:03:51,630 --> 00:03:55,330 realize we would love to involve as many of our own students as possible in this initiative, 64 00:03:55,330 --> 00:03:58,720 both during the semester as well as this winter and this coming spring. 65 00:03:58,720 --> 00:04:01,620 So if you would like to get involved in CS50x, 66 00:04:01,620 --> 00:04:07,450 particularly joining in on CS50x Discuss, the edX version of CS50 Discuss, 67 00:04:07,450 --> 00:04:10,140 which many of you have been using on campus, the online bulletin board, 68 00:04:10,140 --> 00:04:13,040 please do head to that URL, let us know who you are, 69 00:04:13,040 --> 00:04:16,450 because we'd love to build up a team of students and staff and faculty alike 70 00:04:16,450 --> 00:04:19,630 on campus who are simply playing along and helping out. 71 00:04:19,630 --> 00:04:21,720 And when they see a question that's familiar to them, 72 00:04:21,720 --> 00:04:25,320 you hear a student reporting some bug somewhere out there in some country on the Internet, 73 00:04:25,320 --> 00:04:27,450 and that rings a bell because you too had that same issue 74 00:04:27,450 --> 00:04:32,620 in your d-hall some time ago, hopefully then you can chime in and share your own experience. 75 00:04:32,620 --> 00:04:37,300 So please do partake if you would like. 76 00:04:37,300 --> 00:04:39,360 >> Computer science courses at Harvard have a bit of a tradition, 77 00:04:39,360 --> 00:04:44,730 CS50 among them, of having some apparel, some clothes, that you can wear proudly 78 00:04:44,730 --> 00:04:49,090 at semester's end, saying quite proudly that you finished CS50 79 00:04:49,090 --> 00:04:51,830 and took CS50 and the like, and we always try to involve students 80 00:04:51,830 --> 00:04:54,540 in this process as much as possible, whereby we invite, 81 00:04:54,540 --> 00:04:56,900 around this time of the semester, students to submit designs 82 00:04:56,900 --> 00:04:59,330 using Photoshop, or whatever tool of choice you'd like to use 83 00:04:59,330 --> 00:05:02,330 if you're a designer, to submit designs for T-shirts and sweatshirts 84 00:05:02,330 --> 00:05:06,100 and umbrellas and little bandanas for dogs we now have and the like. 85 00:05:06,100 --> 00:05:09,370 And everything is then--the winners each year are then exhibited 86 00:05:09,370 --> 00:05:12,700 on the course's website at store.cs50.net. 87 00:05:12,700 --> 00:05:15,790 Everything is sold at cost there, but the website just runs itself 88 00:05:15,790 --> 00:05:18,330 and allows people to choose the colors and designs that they like. 89 00:05:18,330 --> 00:05:20,420 So I thought we'd just share some of last year's designs 90 00:05:20,420 --> 00:05:25,130 that were on the website besides this one here, which is an annual tradition. 91 00:05:25,130 --> 00:05:29,410 "Every Day I'm Seg Faultn" was one of the submissions last year, 92 00:05:29,410 --> 00:05:32,290 which is still available there for alumni. 93 00:05:32,290 --> 00:05:35,820 We had this one, "CS50, Established 1989." 94 00:05:35,820 --> 00:05:39,010 One of our Bowdens, Rob, was very popular last year. 95 00:05:39,010 --> 00:05:43,480 "Team Bowden" was born, this design was submitted, among the top sellers. 96 00:05:43,480 --> 00:05:49,040 As was this one here. Many people had "Bowden Fever" according to the sales logs. 97 00:05:49,040 --> 00:05:52,650 Realize that that could now be your design there, up on the Internet. 98 00:05:52,650 --> 00:05:57,510 More details on this in the next problem sets to come. 99 00:05:57,510 --> 00:06:00,330 >> One more tool: you've had some exposure and hopefully now 100 00:06:00,330 --> 00:06:02,350 some hands-on experience with GDB, 101 00:06:02,350 --> 00:06:04,570 which is, of course, a debugger and allows you to manipulate 102 00:06:04,570 --> 00:06:09,500 your program at a fairly low level, doing what kinds of things? 103 00:06:09,500 --> 00:06:13,030 What does GDB let you do? 104 00:06:13,030 --> 00:06:15,030 Yeah? Give me something. [Student answer, unintelligible] 105 00:06:15,030 --> 00:06:18,120 Good. Step into function, so you don't just have to type run 106 00:06:18,120 --> 00:06:22,310 and have the program blow through its entirety, printing out things to standard output. 107 00:06:22,310 --> 00:06:25,190 Rather, you can step through it line by line, either typing next 108 00:06:25,190 --> 00:06:30,300 to go line by line by line or step to dive into a function, typically one that you wrote. 109 00:06:30,300 --> 00:06:35,240 What else does GDB let you do? Yeah? [Student answer, unintelligible] 110 00:06:35,240 --> 00:06:38,100 Print variables. So if you want to do a little introspection inside of your program 111 00:06:38,100 --> 00:06:41,500 without having to resort to writing printf statements all over the place, 112 00:06:41,500 --> 00:06:44,600 you can just print a variable or display a variable. 113 00:06:44,600 --> 00:06:46,710 What else can you do with a debugger like GDB? 114 00:06:46,710 --> 00:06:49,170 [Student answer, unintelligible] 115 00:06:49,170 --> 00:06:52,080 Exactly. You can set breakpoints; you can say break execution 116 00:06:52,080 --> 00:06:54,020 at the main function or the foo function. 117 00:06:54,020 --> 00:06:56,800 You can say break execution at line 123. 118 00:06:56,800 --> 00:06:58,950 And breakpoints are a really powerful technique 119 00:06:58,950 --> 00:07:01,110 because if you have a general sense of where your problem 120 00:07:01,110 --> 00:07:05,360 probably is, you don't have to waste time stepping through the program's entirety. 121 00:07:05,360 --> 00:07:08,250 You can essentially jump right there and then start typing-- 122 00:07:08,250 --> 00:07:10,970 stepping through it with step or next or the like. 123 00:07:10,970 --> 00:07:14,340 But the catch with something like GDB is that it helps you, the human, 124 00:07:14,340 --> 00:07:16,940 find your problems and find your bugs. 125 00:07:16,940 --> 00:07:19,470 It doesn't necessarily find them so much for you. 126 00:07:19,470 --> 00:07:23,070 >> So we introduced the other day style50, which is a short command line tool 127 00:07:23,070 --> 00:07:27,500 that tries to stylize your code a little bit more cleanly than you, the human, might have done. 128 00:07:27,500 --> 00:07:29,530 But that, too, is really just an aesthetic thing. 129 00:07:29,530 --> 00:07:34,110 But it turns out there's this other tool called Valgrind that is a little more arcane to use. 130 00:07:34,110 --> 00:07:36,860 Its output is atrociously cryptic at first glance. 131 00:07:36,860 --> 00:07:39,420 But it's wonderfully useful, especially now that we're at the part of the term 132 00:07:39,420 --> 00:07:43,080 where you're starting to use malloc and dynamic memory allocation. 133 00:07:43,080 --> 00:07:45,420 Things can go really, really wrong quickly. 134 00:07:45,420 --> 00:07:49,320 Because if you forget to free your memory, or you dereference some NULL pointer, 135 00:07:49,320 --> 00:07:55,770 or you dereference some garbage pointer, what is typically the symptom that results? 136 00:07:55,770 --> 00:07:59,470 Seg fault. And you get this core file of some number of kilobytes or megabytes 137 00:07:59,470 --> 00:08:02,990 that represents the state of your program's memory when it crashed, 138 00:08:02,990 --> 00:08:05,730 but your program ultimately seg faults, segmentation fault, 139 00:08:05,730 --> 00:08:08,450 which means something bad happened almost always related 140 00:08:08,450 --> 00:08:11,750 to a memory-related mistake that you made somewhere. 141 00:08:11,750 --> 00:08:14,100 So Valgrind helps you find things like this. 142 00:08:14,100 --> 00:08:17,720 It's a tool that you run, like GDB, after you've compiled your program, 143 00:08:17,720 --> 00:08:20,330 but rather than run your program directly, you run Valgrind 144 00:08:20,330 --> 00:08:23,960 and you pass to it your program, just like you do with GDB. 145 00:08:23,960 --> 00:08:26,220 Now, the usage, to get the best kind of output, 146 00:08:26,220 --> 00:08:30,410 is a little long, so right there atop of the screen you'll see Valgrind -v. 147 00:08:30,410 --> 00:08:35,350 "-v" almost universally means verbose when you're using programs on a Linux computer. 148 00:08:35,350 --> 00:08:38,770 So it means spit out more data than you might by default. 149 00:08:38,770 --> 00:08:45,510 "--leak-check=full." This is just saying check for all possible memory leaks, 150 00:08:45,510 --> 00:08:49,430 mistakes that I might have made. This, too, is a common paradigm with Linux programs. 151 00:08:49,430 --> 00:08:52,710 Generally, if you have a command line argument that's a "switch", 152 00:08:52,710 --> 00:08:55,830 that's supposed to change the program's behavior, and it's a single letter, 153 00:08:55,830 --> 00:09:00,310 it's -v, but if that's switched, just by design of the programmer, 154 00:09:00,310 --> 00:09:05,150 is a full word or series of words, the command line argument starts with --. 155 00:09:05,150 --> 00:09:08,190 These are just human conventions, but you'll see them increasingly. 156 00:09:08,190 --> 00:09:12,410 And then, finally, "a.out" is the arbitrary name for the program in this particular example. 157 00:09:12,410 --> 00:09:14,640 And here's some representative output. 158 00:09:14,640 --> 00:09:22,890 >> Before we look at what that might mean, let me go over to a snippet of code over here. 159 00:09:22,890 --> 00:09:26,390 And let me move this out of the way, coming soon, 160 00:09:26,390 --> 00:09:32,120 and let's take a look at memory.c, which is this short example here. 161 00:09:32,120 --> 00:09:36,290 So in this program, let me zoom in on the functions and questions. 162 00:09:36,290 --> 00:09:39,430 We have a function main that calls a function, f, 163 00:09:39,430 --> 00:09:45,590 and then what does f proceed to do, in slightly technical English? 164 00:09:45,590 --> 00:09:49,760 What does f proceed to do? 165 00:09:49,760 --> 00:09:53,680 How about I'll start with line 20, and the star's location doesn't matter, 166 00:09:53,680 --> 00:09:56,720 but I'll just be consistent here with last lecture. 167 00:09:56,720 --> 00:09:59,910 What's line 20 do for us? On the left hand side. We'll break it down further. 168 00:09:59,910 --> 00:10:02,410 Int* x: what does that do? 169 00:10:02,410 --> 00:10:04,940 Okay. It's declaring a pointer, and now let's be even more technical. 170 00:10:04,940 --> 00:10:10,230 What does it mean, very concretely, to declare a pointer? Someone else? 171 00:10:10,230 --> 00:10:15,050 Yeah? [Student answer, unintelligible] Too far. 172 00:10:15,050 --> 00:10:17,060 So you're reading to the right-hand side of the equal sign. 173 00:10:17,060 --> 00:10:20,290 Let's focus just on the left, just on int* x. 174 00:10:20,290 --> 00:10:24,700 This does "declare" a pointer, but now let's dive in deeper to that definition. 175 00:10:24,700 --> 00:10:28,330 What does that concretely, technically mean? Yeah? 176 00:10:28,330 --> 00:10:31,940 [Student answer, unintelligible] 177 00:10:31,940 --> 00:10:35,090 Okay. It's preparing to save an address in memory. 178 00:10:35,090 --> 00:10:40,680 Good. And let's take this one step further; it's declaring a variable, x, that's 32 bits. 179 00:10:40,680 --> 00:10:44,440 And I know it's 32 bits because--? 180 00:10:44,440 --> 00:10:47,390 It's not because it's an int, because it's a pointer in this case. 181 00:10:47,390 --> 00:10:49,650 Coincidence that it's one and the same with an int, 182 00:10:49,650 --> 00:10:51,970 but the fact that there's the star there means this is a pointer 183 00:10:51,970 --> 00:10:57,300 and in the appliance, as with many computers, but not all, pointers are 32 bits. 184 00:10:57,300 --> 00:11:01,850 On more modern hardware like the latest Macs, the latest PCs, you might have 64-bit pointers, 185 00:11:01,850 --> 00:11:04,160 but in the appliance, these things are 32 bits. 186 00:11:04,160 --> 00:11:08,380 So we'll standardize on that. More concretely, the story goes as follows: 187 00:11:08,380 --> 00:11:10,820 We "declare" a pointer; what does that mean? 188 00:11:10,820 --> 00:11:12,810 We prepare to store a memory address. 189 00:11:12,810 --> 00:11:15,530 What does that mean? 190 00:11:15,530 --> 00:11:19,810 We create a variable called x that takes up 32 bits 191 00:11:19,810 --> 00:11:23,810 that will soon store the address of an integer. 192 00:11:23,810 --> 00:11:26,470 And that's probably about as precise as we can get. 193 00:11:26,470 --> 00:11:31,810 It's fine moving forward to simplify the world and just say declare a pointer called x. 194 00:11:31,810 --> 00:11:35,380 Declare a pointer, but realize and understand what's actually going on 195 00:11:35,380 --> 00:11:38,490 even in just those few characters. 196 00:11:38,490 --> 00:11:42,040 >> Now, this one's almost a little easier, even though it's a longer expression. 197 00:11:42,040 --> 00:11:48,160 So what is this doing, that's highlighted now: "malloc(10* sizeof (int));" Yeah? 198 00:11:48,160 --> 00:11:52,350 [Student answer, unintelligible] 199 00:11:52,350 --> 00:11:58,250 Good. And I'll take it there. It's allocating a chunk of memory for ten integers. 200 00:11:58,250 --> 00:12:02,190 And now let's dive in slightly deeper; it's allocating a chunk of memory for ten integers. 201 00:12:02,190 --> 00:12:05,390 What is malloc then returning? 202 00:12:05,390 --> 00:12:10,390 The address of that chunk, or, more concretely, the address of the first byte of that chunk. 203 00:12:10,390 --> 00:12:14,080 How then am I, the programmer, to know where that chunk of memory ends? 204 00:12:14,080 --> 00:12:18,340 I know that it's contiguous. Malloc, by definition, will give you a contiguous chunk of memory. 205 00:12:18,340 --> 00:12:21,240 No gaps in it. You have access to every byte in that chunk, 206 00:12:21,240 --> 00:12:26,760 back to back to back, but how do I know where the end of this chunk of memory is? 207 00:12:26,760 --> 00:12:28,850 When you use malloc? [Student answer, unintelligible] Good. 208 00:12:28,850 --> 00:12:30,670 You don't. You have to remember. 209 00:12:30,670 --> 00:12:35,960 I have to remember that I used the value 10, and I don't even seem to have done that here. 210 00:12:35,960 --> 00:12:41,000 But the onus is entirely on me. Strlen, which we've become slightly reliant on for strings, 211 00:12:41,000 --> 00:12:45,860 works only because of this convention of having \0 212 00:12:45,860 --> 00:12:48,840 or this special nul character, N-U-L, at the end of a string. 213 00:12:48,840 --> 00:12:51,740 That does not hold for just arbitrary chunks of memory. 214 00:12:51,740 --> 00:12:58,590 It's up to you. So line 20, then, allocates a chunk of memory 215 00:12:58,590 --> 00:13:02,590 that can store ten integers, and it stores the address of the first byte 216 00:13:02,590 --> 00:13:05,610 of that chunk of memory in the variable called x. 217 00:13:05,610 --> 00:13:11,140 Ergo, which is a pointer. So line 21, unfortunately, was a mistake. 218 00:13:11,140 --> 00:13:16,110 But first, what is it doing? It's saying store at location 10, 0 indexed, 219 00:13:16,110 --> 00:13:19,480 of the chunk of memory called x the value 0. 220 00:13:19,480 --> 00:13:21,510 >> So notice a couple of things are going on. 221 00:13:21,510 --> 00:13:25,420 Even though x is a pointer, recall from a couple weeks ago 222 00:13:25,420 --> 00:13:29,440 that you can still use the array-style square bracket notation. 223 00:13:29,440 --> 00:13:36,180 Because that's actually short-hand notation for the more cryptic-looking pointer arithmetic. 224 00:13:36,180 --> 00:13:40,320 where we would do something like this: Take the address x, move 10 spots over, 225 00:13:40,320 --> 00:13:44,550 then go there to whatever address is stored at that location. 226 00:13:44,550 --> 00:13:48,090 But frankly, this is just atrocious to read and get comfortable with. 227 00:13:48,090 --> 00:13:52,900 So the world typically uses the square brackets just because it's so much more human-friendly to read. 228 00:13:52,900 --> 00:13:55,140 But that's what's really going on underneath the hood; 229 00:13:55,140 --> 00:13:58,190 x is an address, not an array, per se. 230 00:13:58,190 --> 00:14:02,410 So this is storing 0 at location 10 in x. 231 00:14:02,410 --> 00:14:06,120 Why is this bad? Yeah? 232 00:14:06,120 --> 00:14:17,370 [Student answer, unintelligible] Exactly. 233 00:14:17,370 --> 00:14:21,100 We only allocated ten ints, but we count from 0 when programming in C, 234 00:14:21,100 --> 00:14:25,690 so you have access to 0 1 2 3 4 5 6 7 8 9, but not 10. 235 00:14:25,690 --> 00:14:30,270 So either the program is going to seg fault or it's not. 236 00:14:30,270 --> 00:14:32,900 But we don't really know; this is sort of a nondeterministic behavior. 237 00:14:32,900 --> 00:14:35,600 It really depends on whether we get lucky. 238 00:14:35,600 --> 00:14:40,650 If it turns out that the operating system doesn't mind if I use that extra byte, 239 00:14:40,650 --> 00:14:43,360 even though it hasn't given it to me, my program might not crash. 240 00:14:43,360 --> 00:14:46,780 It's raw, it's buggy, but you might not see that symptom, 241 00:14:46,780 --> 00:14:48,960 or you might see it only once in a while. 242 00:14:48,960 --> 00:14:51,230 But the reality is that the bug is, in fact, there. 243 00:14:51,230 --> 00:14:54,320 And it's really problematic if you've written a program that you want to be correct, 244 00:14:54,320 --> 00:14:58,840 that you've sold the program that people are using that every once in a while crashes 245 00:14:58,840 --> 00:15:02,450 because, of course, this is not good. In fact, if you have an Android phone or an iPhone 246 00:15:02,450 --> 00:15:05,550 and you download apps these days, if you've ever had an app just quit, 247 00:15:05,550 --> 00:15:10,040 all of a sudden it disappears, that's almost always the result of some memory-related issue, 248 00:15:10,040 --> 00:15:12,830 whereby the programmer screwed up and dereferenced a pointer 249 00:15:12,830 --> 00:15:18,670 that he or she shouldn't have, and the result of iOS or Android is to just kill the program altogether 250 00:15:18,670 --> 00:15:23,080 rather than risk undefined behavior or some kind of security compromise. 251 00:15:23,080 --> 00:15:25,950 >> There's one other bug in this program besides this one. 252 00:15:25,950 --> 00:15:30,180 What else have I screwed up in this program? 253 00:15:30,180 --> 00:15:32,740 I've not practiced what I've preached. Yeah? 254 00:15:32,740 --> 00:15:34,760 [Student answer, unintelligible] Good. 255 00:15:34,760 --> 00:15:36,880 I haven't freed the memory. So the rule of thumb now 256 00:15:36,880 --> 00:15:43,150 has to be anytime you call malloc, you must call free when you are done using that memory. 257 00:15:43,150 --> 00:15:45,610 Now, when would I want to free this memory? 258 00:15:45,610 --> 00:15:49,780 Probably, assuming this first line was correct, I would want to do it here. 259 00:15:49,780 --> 00:15:55,710 Because I couldn't, for instance, do it down here. Why? 260 00:15:55,710 --> 00:15:57,860 Just out of scope. So even though we're talking about pointers, 261 00:15:57,860 --> 00:16:04,830 this is a week 2 or 3 issue, where x is only in scope inside of the curly braces where it was declared. 262 00:16:04,830 --> 00:16:11,000 So you definitely can't free it there. My only chance to free it is roughly after line 21. 263 00:16:11,000 --> 00:16:15,170 This is a fairly simple program; it was fairly easy once you kind of wrapped your mind 264 00:16:15,170 --> 00:16:17,870 around what the program's doing, where the mistakes were. 265 00:16:17,870 --> 00:16:20,500 And even if you didn't see it at first, hopefully it's a little obvious now 266 00:16:20,500 --> 00:16:23,870 that these mistakes are pretty easily solved and easily made. 267 00:16:23,870 --> 00:16:28,720 But when a program is more than 12 lines long, it's 50 lines long, 100 lines long, 268 00:16:28,720 --> 00:16:31,150 walking through your code line by line, thinking through it logically, 269 00:16:31,150 --> 00:16:35,110 is possible but not particularly fun to do, constantly looking for bugs, 270 00:16:35,110 --> 00:16:38,340 and it's also difficult to do, and that's why a tool like Valgrind exists. 271 00:16:38,340 --> 00:16:40,900 Let me go ahead and do this: let me open my terminal window, 272 00:16:40,900 --> 00:16:45,400 and let me not just run memory, because memory seems to be fine. 273 00:16:45,400 --> 00:16:49,180 I'm getting lucky. Going to that additional byte at the end of the array 274 00:16:49,180 --> 00:16:51,060 doesn't seem to be too problematic. 275 00:16:51,060 --> 00:16:56,370 But let me, nonetheless, do a sanity check, which just means to check 276 00:16:56,370 --> 00:16:58,320 whether or not this is actually correct. 277 00:16:58,320 --> 00:17:04,690 >> So let's do valgrind -v --leak -check=full, 278 00:17:04,690 --> 00:17:07,520 and then the name of the program in this case is memory, not a.out. 279 00:17:07,520 --> 00:17:10,760 So let me go ahead and do this. Hit Enter. 280 00:17:10,760 --> 00:17:14,109 Dear God. This is its output, and this is what I alluded to earlier. 281 00:17:14,109 --> 00:17:17,550 But, if you learn to read through all of the nonsense here, 282 00:17:17,550 --> 00:17:20,760 most of this is just diagnostic output that's not that interesting. 283 00:17:20,760 --> 00:17:24,829 What your eye really wants to be looking for is any mention of error or invalid. 284 00:17:24,829 --> 00:17:26,800 Words that suggest problems. 285 00:17:26,800 --> 00:17:29,340 And indeed, let's see what's going wrong down here. 286 00:17:29,340 --> 00:17:35,230 I have a summary of some sort, "in use at exit: 40 bytes in 1 blocks." 287 00:17:35,230 --> 00:17:38,750 I'm not really sure what a block is yet, but 40 bytes 288 00:17:38,750 --> 00:17:41,260 actually feels like I could figure out where that's coming from. 289 00:17:41,260 --> 00:17:45,030 40 bytes. Why are 40 bytes in use at exit? 290 00:17:45,030 --> 00:17:48,780 And more specifically, if we scroll down here, 291 00:17:48,780 --> 00:17:54,520 why have I definitely lost 40 bytes? Yeah? 292 00:17:54,520 --> 00:17:59,520 [Student answer, unintelligible] Perfect. Yeah, exactly. 293 00:17:59,520 --> 00:18:03,540 There were ten integers, and each of those is size of 4, or 32 bits, 294 00:18:03,540 --> 00:18:08,300 so I've lost precisely 40 bytes because, as you proposed, I haven't called free. 295 00:18:08,300 --> 00:18:13,460 That's one bug, and now let's look down a little further and see next to this, 296 00:18:13,460 --> 00:18:16,900 "invalid write of size 4." Now what is this? 297 00:18:16,900 --> 00:18:21,150 This address is expressed what base notation, apparently? 298 00:18:21,150 --> 00:18:23,640 This is hexadecimal, and any time you see a number starting with 0x, 299 00:18:23,640 --> 00:18:29,410 it means hexadecimal, which we did way back in, I think, pset 0's section of questions, 300 00:18:29,410 --> 00:18:34,090 which was just to do a warmup exercise, converting decimal to hex to binary and so forth. 301 00:18:34,090 --> 00:18:39,220 Hexadecimal, just by human convention, is usually used to represent pointers 302 00:18:39,220 --> 00:18:41,570 or, more generally, addresses. It's just a convention, 303 00:18:41,570 --> 00:18:45,340 because it's a little easier to read, it's a little more compact than something like decimal, 304 00:18:45,340 --> 00:18:47,720 and binary is useless for most humans to use. 305 00:18:47,720 --> 00:18:50,840 So now what does this mean? Well, it looks like there's an invalid write 306 00:18:50,840 --> 00:18:54,480 of size 4 on line 21 of memory.c. 307 00:18:54,480 --> 00:18:59,180 So let's go back to line 21, and indeed, here is that invalid write. 308 00:18:59,180 --> 00:19:02,640 So Valgrind isn't going to completely hold my hand and tell me what the fix is, 309 00:19:02,640 --> 00:19:05,520 but it is detecting that I'm doing an invalid write. 310 00:19:05,520 --> 00:19:08,800 I'm touching 4 bytes that I shouldn't be, and apparently that's because, 311 00:19:08,800 --> 00:19:13,960 as you pointed out, I'm doing [10] instead of [9] maximally 312 00:19:13,960 --> 00:19:16,660 or [0] or something in between. 313 00:19:16,660 --> 00:19:19,690 With Valgrind, realize any time you're now writing a program 314 00:19:19,690 --> 00:19:24,190 that uses pointers and uses memory, and malloc more specifically, 315 00:19:24,190 --> 00:19:27,080 definitely get into the habit of running this long 316 00:19:27,080 --> 00:19:30,890 but very easily copied and pasted command of Valgrind 317 00:19:30,890 --> 00:19:32,650 to see if there's some errors in there. 318 00:19:32,650 --> 00:19:34,820 And it'll be overwhelming every time you see the output, 319 00:19:34,820 --> 00:19:39,430 but just parse through visually all of the output and see if you see mentions of errors 320 00:19:39,430 --> 00:19:43,190 or warnings or invalid or lost. 321 00:19:43,190 --> 00:19:46,200 Any words that sound like you screwed up somewhere. 322 00:19:46,200 --> 00:19:48,580 So realize that's a new tool in your toolkit. 323 00:19:48,580 --> 00:19:51,270 >> Now on Monday, we had a whole bunch of folks come up here 324 00:19:51,270 --> 00:19:53,150 and represent the notion of a linked list. 325 00:19:53,150 --> 00:20:00,970 And we introduced the linked list as a solution to what problem? 326 00:20:00,970 --> 00:20:04,590 Yeah? [Student answer, unintelligible] Good. 327 00:20:04,590 --> 00:20:06,530 Arrays cannot have memory added to them. 328 00:20:06,530 --> 00:20:09,440 If you allocate an array of size 10, that's all you get. 329 00:20:09,440 --> 00:20:13,690 You can call a function like realloc if you initially called malloc, 330 00:20:13,690 --> 00:20:17,580 and that can try to grow the array if there is space toward the end of it 331 00:20:17,580 --> 00:20:21,610 that no one else is using, and if there's not, it will just find you a bigger chunk somewhere else. 332 00:20:21,610 --> 00:20:25,040 But then it will copy all of those bytes into the new array. 333 00:20:25,040 --> 00:20:28,310 This sounds like a very correct solution. 334 00:20:28,310 --> 00:20:34,790 Why is this unattractive? 335 00:20:34,790 --> 00:20:36,940 I mean it works, humans have solved this problem. 336 00:20:36,940 --> 00:20:40,710 Why did we need to resolve it on Monday with linked lists? Yeah? 337 00:20:40,710 --> 00:20:44,060 [Student answer, unintelligible] It could take a long time. 338 00:20:44,060 --> 00:20:49,260 In fact, any time you're calling malloc or realloc or calloc, which is yet another one, 339 00:20:49,260 --> 00:20:52,470 any time you, the program, are talking to the operating system, 340 00:20:52,470 --> 00:20:54,310 you tend to slow the program down. 341 00:20:54,310 --> 00:20:57,470 And if you're doing these kinds of things in loops, you're really slowing things down. 342 00:20:57,470 --> 00:21:00,740 You're not going to notice this for the simplest of "hello world" type programs, 343 00:21:00,740 --> 00:21:04,300 but in much larger programs, asking the operating system again and again for memory 344 00:21:04,300 --> 00:21:07,520 or giving it back again and again tends not to be a good thing. 345 00:21:07,520 --> 00:21:11,210 Plus, it's just sort of intellectually--it's a complete waste of time. 346 00:21:11,210 --> 00:21:16,490 Why allocate more and more memory, risk copying everything into the new array, 347 00:21:16,490 --> 00:21:21,980 if you have an alternative that lets you allocate only as much memory as you actually need? 348 00:21:21,980 --> 00:21:24,130 So there's pluses and minuses in here. 349 00:21:24,130 --> 00:21:26,730 One of the pluses now is that we have dynamism. 350 00:21:26,730 --> 00:21:29,100 Doesn't matter where the chunks of memory are that are free, 351 00:21:29,100 --> 00:21:32,070 I can just sort of create these bread crumbs via pointers 352 00:21:32,070 --> 00:21:34,470 to string my whole linked list together. 353 00:21:34,470 --> 00:21:36,470 But I pay at least one price. 354 00:21:36,470 --> 00:21:40,060 >> What do I have to give up in gaining linked lists? 355 00:21:40,060 --> 00:21:42,470 Yeah? [Student answer, unintelligible] Good. 356 00:21:42,470 --> 00:21:45,650 You need more memory. Now I need space for these pointers, 357 00:21:45,650 --> 00:21:47,900 and in the case of this super simple linked list 358 00:21:47,900 --> 00:21:51,410 that is only trying to store integers, which are 4 bytes, we keep saying 359 00:21:51,410 --> 00:21:54,240 well, a pointer is 4 bytes, so now I've literally doubled 360 00:21:54,240 --> 00:21:57,290 the amount of memory I need just to store this list. 361 00:21:57,290 --> 00:21:59,680 But again, this is a constant tradeoff in computer science 362 00:21:59,680 --> 00:22:03,440 between time and space and development, effort and other resources. 363 00:22:03,440 --> 00:22:06,630 What's another downside of using a linked list? Yeah? 364 00:22:06,630 --> 00:22:10,150 [Student answer, unintelligible] 365 00:22:10,150 --> 00:22:12,600 Good. Not as easy to access. We can no longer leverage 366 00:22:12,600 --> 00:22:15,530 week 0 principles like divide and conquer. 367 00:22:15,530 --> 00:22:18,220 And more specifically, binary search. Because even though we humans 368 00:22:18,220 --> 00:22:20,400 can see roughly where the middle of this list is, 369 00:22:20,400 --> 00:22:25,840 the computer only knows that this linked list starts at address called first. 370 00:22:25,840 --> 00:22:28,250 And that's 0x123 or something like that. 371 00:22:28,250 --> 00:22:30,990 And the only way the program can find the middle element 372 00:22:30,990 --> 00:22:33,350 is to actually search the whole list. 373 00:22:33,350 --> 00:22:35,500 And even then, it literally has to search the whole list because 374 00:22:35,500 --> 00:22:38,950 even once you reach the middle element by following the pointers, 375 00:22:38,950 --> 00:22:42,380 you, the program, have no idea how long this list is, potentially, 376 00:22:42,380 --> 00:22:45,250 until you hit the end of it, and how do you know programmatically 377 00:22:45,250 --> 00:22:48,600 that you're at the end of a linked list? 378 00:22:48,600 --> 00:22:51,120 There's a special NULL pointer, so again, a convention. 379 00:22:51,120 --> 00:22:53,870 Rather than use this pointer, we definitely don't want it to be some garbage value 380 00:22:53,870 --> 00:22:57,750 pointing off stage somewhere; we want it to be hand down, NULL, 381 00:22:57,750 --> 00:23:01,530 so that we have this terminus in this data structure so we know where it ends. 382 00:23:01,530 --> 00:23:03,410 >> What if we want to manipulate this? 383 00:23:03,410 --> 00:23:05,980 We did most of this visually, and with humans, 384 00:23:05,980 --> 00:23:07,630 but what if we want to do an insertion? 385 00:23:07,630 --> 00:23:12,360 So the original list was 9, 17, 20, 22, 29, 34. 386 00:23:12,360 --> 00:23:16,730 What if we then wanted to malloc space for number 55, a node for it, 387 00:23:16,730 --> 00:23:20,730 and then we want to insert 55 into the list just as we did on Monday? 388 00:23:20,730 --> 00:23:23,690 How do we do this? Well, Anita came up and she essentially walked the list. 389 00:23:23,690 --> 00:23:27,500 She started at the first element, then the next, the next, the next, the next, the next. 390 00:23:27,500 --> 00:23:29,500 Finally hit the left-hand all the way down 391 00:23:29,500 --> 00:23:34,480 and realized oh, this is NULL. So what pointer manipulation needed to be done? 392 00:23:34,480 --> 00:23:37,980 The person who was on the end, number 34, needed his left hand raised 393 00:23:37,980 --> 00:23:46,220 to point at 55, 55 needed their left arm pointing down to be the new NULL terminator. Done. 394 00:23:46,220 --> 00:23:49,540 Pretty easy to insert 55 into a sorted list. 395 00:23:49,540 --> 00:23:51,800 And how might this look? 396 00:23:51,800 --> 00:23:55,690 >> Let me go ahead and open up some code example here. 397 00:23:55,690 --> 00:23:58,120 I'll open up gedit, and let me open up two files first. 398 00:23:58,120 --> 00:24:02,050 One is list1.h, and let me just remind that this was the chunk of code 399 00:24:02,050 --> 00:24:04,920 that we used to represent a node. 400 00:24:04,920 --> 00:24:13,040 A node has both an int called n and a pointer called next that just points to the next thing in the list. 401 00:24:13,040 --> 00:24:15,450 That is now in a .h file. Why? 402 00:24:15,450 --> 00:24:19,090 There's this convention, and we haven't taken advantage of this a huge amount ourselves, 403 00:24:19,090 --> 00:24:22,220 but the person who wrote printf and other functions 404 00:24:22,220 --> 00:24:27,150 gave as a gift to the world all of those functions by writing a file called stdio.h. 405 00:24:27,150 --> 00:24:30,950 And then there's string.h, and then there's map.h, and there's all these h files 406 00:24:30,950 --> 00:24:34,410 that you might have seen or used during the term written by other people. 407 00:24:34,410 --> 00:24:38,470 Typically in those .h files are only things like typedefs 408 00:24:38,470 --> 00:24:42,310 or declarations of custom types or declarations of constants. 409 00:24:42,310 --> 00:24:47,890 You don't put functions' implementations in header files. 410 00:24:47,890 --> 00:24:50,570 You put, instead, just their prototypes. 411 00:24:50,570 --> 00:24:53,050 You put things you want to share with the world what they need 412 00:24:53,050 --> 00:24:55,640 in order to compile their code. So just to get into this habit, 413 00:24:55,640 --> 00:24:59,110 we decided to do the same thing. There's not much in list1.h, 414 00:24:59,110 --> 00:25:02,070 but we've put something that might be of interest to people in the world 415 00:25:02,070 --> 00:25:05,030 who want to use our linked list implementation. 416 00:25:05,030 --> 00:25:08,040 Now, in list1.c, I won't go through this whole thing 417 00:25:08,040 --> 00:25:11,390 because it's a bit long, this program, but let's run it real quickly at the prompt. 418 00:25:11,390 --> 00:25:15,720 Let me compile list1, let me then run list1, and what you'll see is 419 00:25:15,720 --> 00:25:18,070 we've simulated a simple little program here 420 00:25:18,070 --> 00:25:20,990 that's going to allow me to add and remove numbers to a list. 421 00:25:20,990 --> 00:25:24,310 So let me go ahead and type 3 for the menu option 3. 422 00:25:24,310 --> 00:25:27,880 I want to insert the number--let's do the first number, which was 9, 423 00:25:27,880 --> 00:25:30,550 and now I'm told the list is now 9. 424 00:25:30,550 --> 00:25:33,760 Let me go ahead and do another insertion, so I hit menu option 3. 425 00:25:33,760 --> 00:25:36,760 What number do I want to insert? 17. 426 00:25:36,760 --> 00:25:39,220 Enter. And I'll do just one more. 427 00:25:39,220 --> 00:25:41,720 Let me insert the number 22. 428 00:25:41,720 --> 00:25:45,850 So we have the beginnings of the linked list that we had in slide form a moment ago. 429 00:25:45,850 --> 00:25:48,740 How is this insertion actually happening? 430 00:25:48,740 --> 00:25:52,000 Indeed, 22 is now at the end of the list. 431 00:25:52,000 --> 00:25:55,050 So the story we told on stage on Monday and recapped just now 432 00:25:55,050 --> 00:25:57,460 must actually be happening in code. 433 00:25:57,460 --> 00:25:59,700 Let's take a look. Let me scroll down in this file. 434 00:25:59,700 --> 00:26:01,720 We'll gloss over some of the functions, 435 00:26:01,720 --> 00:26:05,630 but we'll go down to, say, the insert function. 436 00:26:05,630 --> 00:26:11,720 >> Let's see how we go about inserting a new node into this linked list. 437 00:26:11,720 --> 00:26:14,550 Where is the list declared? Well, let's scroll all the way up at the top, 438 00:26:14,550 --> 00:26:19,970 and notice that my linked list is essentially declared as a single pointer that's initially NULL. 439 00:26:19,970 --> 00:26:23,180 So I'm using a global variable here, which in general we've preached against 440 00:26:23,180 --> 00:26:25,280 because it makes your code a little messy to maintain, 441 00:26:25,280 --> 00:26:29,080 it's sort of lazy, usually, but it's not lazy and it's not wrong and it's not bad 442 00:26:29,080 --> 00:26:33,660 if your program's sole purpose in life is to simulate one linked list. 443 00:26:33,660 --> 00:26:35,460 Which is exactly what we're doing. 444 00:26:35,460 --> 00:26:39,100 So rather than declare this in main and then have to pass it to every function 445 00:26:39,100 --> 00:26:42,640 we've written in this program, we instead realize oh, let's just make it global 446 00:26:42,640 --> 00:26:47,060 because the whole purpose of this program is to demonstrate one and only one linked list. 447 00:26:47,060 --> 00:26:50,700 So that feels okay. Here are my prototypes, and we won't go through all of these, 448 00:26:50,700 --> 00:26:55,800 but I wrote a delete function, a find function, an insert function, and a traverse function. 449 00:26:55,800 --> 00:26:59,080 But let's now go back down to the insert function 450 00:26:59,080 --> 00:27:01,490 and see how this one works here. 451 00:27:01,490 --> 00:27:09,940 Insert is on line--here we go. 452 00:27:09,940 --> 00:27:12,850 Insert. So it doesn't take any arguments, because we're going to ask 453 00:27:12,850 --> 00:27:15,930 the user inside of this function for the number they want to insert. 454 00:27:15,930 --> 00:27:19,410 But first, we prepare to give them some space. 455 00:27:19,410 --> 00:27:22,050 This is sort of copy and paste from the other example. 456 00:27:22,050 --> 00:27:25,110 In that case, we were allocating an int; this time we're allocating a node. 457 00:27:25,110 --> 00:27:27,910 I don't really remember how many bytes a node is, but that's fine. 458 00:27:27,910 --> 00:27:30,460 Sizeof can figure that out for me. 459 00:27:30,460 --> 00:27:33,340 And why am I checking for NULL in line 120? 460 00:27:33,340 --> 00:27:37,530 What could go wrong in line 119? Yeah? 461 00:27:37,530 --> 00:27:40,530 [Student answer, unintelligible] 462 00:27:40,530 --> 00:27:43,440 Good. Just could be the case that I've asked for too much memory 463 00:27:43,440 --> 00:27:47,020 or something's wrong and the operating system doesn't have enough bytes to give me, 464 00:27:47,020 --> 00:27:50,640 so it signals as much by returning NULL, and if I don't check for that 465 00:27:50,640 --> 00:27:54,710 and I just blindly proceed to use the address returned, it could be NULL. 466 00:27:54,710 --> 00:27:58,400 It could be some unknown value; not a good thing unless I-- 467 00:27:58,400 --> 00:28:00,590 actually won't be an unknown value. It could be NULL, so I don't want 468 00:28:00,590 --> 00:28:02,550 to abuse it and risk dereferencing it. 469 00:28:02,550 --> 00:28:07,410 If that happens, I just return and we'll pretend like I didn't get back any memory at all. 470 00:28:07,410 --> 00:28:12,270 >> Otherwise, I tell the user give me a number to insert, I call our old friend GetInt, 471 00:28:12,270 --> 00:28:15,530 and then this was the new syntax we introduced on Monday. 472 00:28:15,530 --> 00:28:20,320 'newptr->n' means take the address that you were given by malloc 473 00:28:20,320 --> 00:28:23,490 which represents the first byte of a new node object, 474 00:28:23,490 --> 00:28:26,860 and then go to the field called n. 475 00:28:26,860 --> 00:28:35,270 A little trivia question: This is equivalent to what more cryptic line of code? 476 00:28:35,270 --> 00:28:38,110 How else could I have written this? Want to take a stab? 477 00:28:38,110 --> 00:28:41,480 [Student answer, unintelligible] 478 00:28:41,480 --> 00:28:44,870 Good. Using the .n, but it's not quite as simple as this. 479 00:28:44,870 --> 00:28:47,090 What do I first need to do? [Student answer, unintelligible] 480 00:28:47,090 --> 00:28:52,730 Good. I need to do *newptr.n. 481 00:28:52,730 --> 00:28:55,610 So this is saying new pointer's obviously an address. Why? 482 00:28:55,610 --> 00:28:59,520 Because it was returned by malloc. The *newptr saying "go there," 483 00:28:59,520 --> 00:29:02,970 and then once you're there, then you can use the more familiar .n, 484 00:29:02,970 --> 00:29:05,730 but this just looks a little ugly, especially if we humans are going to 485 00:29:05,730 --> 00:29:10,360 draw pointers with arrows all the time; the world has standardized on this arrow notation, 486 00:29:10,360 --> 00:29:12,320 which does exactly the same thing. 487 00:29:12,320 --> 00:29:16,070 So you only use the -> notation when the thing on the left is a pointer. 488 00:29:16,070 --> 00:29:18,790 Otherwise, if it's an actual struct, use the .n. 489 00:29:18,790 --> 00:29:25,800 And then this: Why do I initialize newptr->next to NULL? 490 00:29:25,800 --> 00:29:28,610 We don't want a dangling left hand off of the end of the stage. 491 00:29:28,610 --> 00:29:31,630 We want it pointing straight down, which means the end of this list 492 00:29:31,630 --> 00:29:34,980 could potentially be at this node, so we better make sure it is NULL. 493 00:29:34,980 --> 00:29:38,460 And, in general, initializing your variables or your data members and structs 494 00:29:38,460 --> 00:29:40,470 to something is just good practice. 495 00:29:40,470 --> 00:29:45,170 Just letting garbage exist and continue to exist generally gets you in trouble 496 00:29:45,170 --> 00:29:48,650 if you forget to do something later on. 497 00:29:48,650 --> 00:29:51,590 >> Here's a few cases. This, again, is the insert function, 498 00:29:51,590 --> 00:29:54,930 and the first thing I check for is if the variable called first, 499 00:29:54,930 --> 00:29:58,240 that global variable is NULL, that means there is no linked list. 500 00:29:58,240 --> 00:30:02,460 We haven't inserted any numbers, so it's trivial to insert this current number 501 00:30:02,460 --> 00:30:05,240 into the list, because it just belongs at the start of the list. 502 00:30:05,240 --> 00:30:08,100 So this was when Anita was just standing up here alone, pretending 503 00:30:08,100 --> 00:30:11,390 no one else was up here on stage until we allocated a node, 504 00:30:11,390 --> 00:30:13,940 then she could raise her hand for the first time, 505 00:30:13,940 --> 00:30:17,420 if everyone else had come up on the stage after her on Monday. 506 00:30:17,420 --> 00:30:22,900 Now here, this is a little check where I have to say if the new node's value of n 507 00:30:22,900 --> 00:30:27,370 is < the value of n in the current first node, 508 00:30:27,370 --> 00:30:29,930 that means there is a linked list that's begun. 509 00:30:29,930 --> 00:30:32,330 There's at least one node in the list, but this new guy 510 00:30:32,330 --> 00:30:37,230 belongs before it, so we need to move things around. 511 00:30:37,230 --> 00:30:43,450 In other words, if the list has started with just, let's say, 512 00:30:43,450 --> 00:30:48,100 just the number 17, that's the--actually, we can do this more clearly. 513 00:30:48,100 --> 00:30:56,010 If we start our story with a pointer here called first, 514 00:30:56,010 --> 00:30:59,870 and initially it's NULL, and we insert the number 9, 515 00:30:59,870 --> 00:31:02,510 the number 9 clearly belongs at the start of the list. 516 00:31:02,510 --> 00:31:07,400 So let's pretend we just malloced the address or the number 9 and put it here. 517 00:31:07,400 --> 00:31:13,170 If first is 9 by default, the first scenario we discussed just means let's point this guy here, 518 00:31:13,170 --> 00:31:15,790 leave this as NULL; now we have the number 9. 519 00:31:15,790 --> 00:31:18,280 The next number we want to insert is 17. 520 00:31:18,280 --> 00:31:22,420 17 belongs over here, so we're going to have to do some logical stepping through this. 521 00:31:22,420 --> 00:31:26,060 So let's instead, before we do that, let's pretend that we wanted to insert the number 8. 522 00:31:26,060 --> 00:31:28,650 >> So just for convenience's sake, I'm going to draw here. 523 00:31:28,650 --> 00:31:30,760 But remember, malloc can put it most anywhere. 524 00:31:30,760 --> 00:31:33,460 But for drawing's sake, I'll put it here. 525 00:31:33,460 --> 00:31:38,440 So pretend I've just allocated a node for the number 8; this is NULL by default. 526 00:31:38,440 --> 00:31:42,800 What now has to happen? A couple of things. 527 00:31:42,800 --> 00:31:47,090 We made this mistake on stage on Monday where we updated a pointer like this, 528 00:31:47,090 --> 00:31:51,890 then did this, and then we claimed--we orphaned everyone else on stage. 529 00:31:51,890 --> 00:31:54,350 Because you can't--the order of operations here is important, 530 00:31:54,350 --> 00:31:58,760 because now we've lost this node 9 that is just sort of floating in space. 531 00:31:58,760 --> 00:32:01,150 So this was not the right approach on Monday. 532 00:32:01,150 --> 00:32:03,330 We first have to do something else. 533 00:32:03,330 --> 00:32:06,280 The state of the world looks like this. Initially, 8 has been allocated. 534 00:32:06,280 --> 00:32:10,550 What would be a better way of inserting 8? 535 00:32:10,550 --> 00:32:14,720 Instead of updating this pointer first, just update this one here instead. 536 00:32:14,720 --> 00:32:17,720 So we need a line of code that's going to turn this NULL character 537 00:32:17,720 --> 00:32:22,020 into an actual pointer that's pointing at node 9, 538 00:32:22,020 --> 00:32:27,970 and then we can safely change first to point at this guy here. 539 00:32:27,970 --> 00:32:31,330 Now we have a list, a linked list, of two elements. 540 00:32:31,330 --> 00:32:33,580 And what does this actually look like here? 541 00:32:33,580 --> 00:32:36,900 If we look at the code, notice that I've done exactly that. 542 00:32:36,900 --> 00:32:41,970 I've said newptr, and in this story, newptr was pointing at this guy. 543 00:32:41,970 --> 00:32:45,520 >> So let me draw one more thing, and I should have left a little more room for this. 544 00:32:45,520 --> 00:32:48,540 So forgive the tiny little drawing. 545 00:32:48,540 --> 00:32:52,140 This guy is called newptr. 546 00:32:52,140 --> 00:32:57,940 That is the variable we declared a few lines earlier, in line--just above 25. 547 00:32:57,940 --> 00:33:03,430 And it's pointing to 8. So when I say newptr->next, that means go to the struct 548 00:33:03,430 --> 00:33:07,910 that's being pointed at by newptr, so here we are, go there. 549 00:33:07,910 --> 00:33:13,990 Then the arrow is saying get the next field, and then the = is saying put what value there? 550 00:33:13,990 --> 00:33:17,280 The value that was in first; what value was in first? 551 00:33:17,280 --> 00:33:21,930 First was pointing at this node, so that means this should now point at this node. 552 00:33:21,930 --> 00:33:25,660 In other words, what looks albeit a ridiculous mess with my handwriting, 553 00:33:25,660 --> 00:33:28,620 what's a simple idea of just moving these arrows around 554 00:33:28,620 --> 00:33:31,560 translates to code with just this one liner. 555 00:33:31,560 --> 00:33:38,110 Store what is in first in the next field and then update what first actually is. 556 00:33:38,110 --> 00:33:40,900 Let's go ahead and fast-forward through some of this, 557 00:33:40,900 --> 00:33:44,220 and look only at this tail insertion for now. 558 00:33:44,220 --> 00:33:51,210 Suppose I get to the point where I find that the next field of some node is NULL. 559 00:33:51,210 --> 00:33:53,410 And at this point in the story, a detail that I'm glossing over 560 00:33:53,410 --> 00:33:58,170 is that I've introduced another pointer up here in line 142, predecessor pointer. 561 00:33:58,170 --> 00:34:01,320 Essentially, at this point in the story, once the list gets long, 562 00:34:01,320 --> 00:34:04,800 I kind of need to walk it with two fingers because if I go too far, 563 00:34:04,800 --> 00:34:08,219 remember in a single-length list, you can't go backwards. 564 00:34:08,219 --> 00:34:13,659 So this idea of predptr is my left finger, and newptr--not newptr. 565 00:34:13,659 --> 00:34:17,199 Another pointer that's here is my other finger, and I'm just kind of walking the list. 566 00:34:17,199 --> 00:34:22,179 That's why that exists. But let's only consider one of the simpler cases here. 567 00:34:22,179 --> 00:34:26,620 If that pointer's next field is NULL, what's the logical implication? 568 00:34:26,620 --> 00:34:30,840 If you are traversing this list and you hit a NULL pointer? 569 00:34:30,840 --> 00:34:35,780 You're at the end of the list, and so the code to then append this one additional element 570 00:34:35,780 --> 00:34:41,230 is sort of the intuitive will take that node whose next pointer is NULL, 571 00:34:41,230 --> 00:34:46,120 so this is currently NULL, and change it, though, to be the address of the new node. 572 00:34:46,120 --> 00:34:52,260 So we're just drawing in code the arrow that we drew on stage by raising someone's left hand. 573 00:34:52,260 --> 00:34:54,070 >> And the case that I'll wave my hands at for now, 574 00:34:54,070 --> 00:34:58,020 just because I think it's easy to get lost when we do it in this sort of environment, 575 00:34:58,020 --> 00:35:00,600 is checking for insertion at the list's middle. 576 00:35:00,600 --> 00:35:03,220 But just intuitively, what needs to happen if you want to figure out 577 00:35:03,220 --> 00:35:06,600 where some number belongs in the middle is you do have to walk it 578 00:35:06,600 --> 00:35:09,510 with more than one finger, more than one pointer, 579 00:35:09,510 --> 00:35:12,920 figure out where it belongs by checking is the element < the current one, 580 00:35:12,920 --> 00:35:15,450 > the current one, and once you find that place, 581 00:35:15,450 --> 00:35:20,400 then you have to do this sort of shell game where you move the pointers around very carefully. 582 00:35:20,400 --> 00:35:23,850 And that answer, if you'd like to reason through this at home on your own, 583 00:35:23,850 --> 00:35:28,340 boils down just to these two lines of code, but the order of those lines is super important. 584 00:35:28,340 --> 00:35:31,390 Because if you drop someone's hand and raise someone else's in the wrong order, 585 00:35:31,390 --> 00:35:34,580 again, you could end up orphaning the list. 586 00:35:34,580 --> 00:35:39,500 To summarize more conceptually, the insertion at the tail is relatively straightforward. 587 00:35:39,500 --> 00:35:42,940 The insertion at the head is also relatively straightforward, 588 00:35:42,940 --> 00:35:45,580 but you need to update an additional pointer this time 589 00:35:45,580 --> 00:35:47,930 to squeeze number 5 into the list here, 590 00:35:47,930 --> 00:35:51,560 and then insertion in the middle involves even more effort, 591 00:35:51,560 --> 00:35:56,130 to very carefully insert the number 20 in its correct location, 592 00:35:56,130 --> 00:35:58,350 which is between 17 and 22. 593 00:35:58,350 --> 00:36:02,700 So you need to do something like have the new node 20 point to 22, 594 00:36:02,700 --> 00:36:08,470 and then, which node's pointer needs to be updated last? 595 00:36:08,470 --> 00:36:10,630 It's 17, to actually insert it. 596 00:36:10,630 --> 00:36:14,080 So again, I'll defer the actual code for that particular implementation. 597 00:36:14,080 --> 00:36:17,280 >> At first glance, it's a little overwhelming, but it's really just an infinite loop 598 00:36:17,280 --> 00:36:21,770 that's looping, looping, looping, looping, and breaking as soon as you hit the NULL pointer, 599 00:36:21,770 --> 00:36:24,590 at which point you can do the requisite insertion. 600 00:36:24,590 --> 00:36:30,960 This, then, is representative linked list insertion code. 601 00:36:30,960 --> 00:36:34,590 That was kind of a lot, and it feels like we've solved one problem, 602 00:36:34,590 --> 00:36:36,940 but we've introduced a whole other one. Frankly, we've spent all this time 603 00:36:36,940 --> 00:36:40,540 on big O and Ω and running time, trying to solve problems more quickly, 604 00:36:40,540 --> 00:36:43,270 and here we are taking a big step backwards, it feels. 605 00:36:43,270 --> 00:36:45,380 And yet, if the goal is to store data, 606 00:36:45,380 --> 00:36:48,010 it feels like the Holy Grail, as we said on Monday, would really be 607 00:36:48,010 --> 00:36:50,470 to store things instantly. 608 00:36:50,470 --> 00:36:53,930 >> In fact, suppose that we did put aside linked list for a moment 609 00:36:53,930 --> 00:36:56,000 and we instead introduced the notion of a table. 610 00:36:56,000 --> 00:36:59,110 And let's just think of a table for a moment as an array. 611 00:36:59,110 --> 00:37:03,790 This array and this case here has some 26 elements, 0 through 25, 612 00:37:03,790 --> 00:37:07,940 and suppose that you needed some chunk of storage for names: 613 00:37:07,940 --> 00:37:10,350 Alice and Bob and Charlie and the like. 614 00:37:10,350 --> 00:37:12,880 And you need some data structure to store those names. 615 00:37:12,880 --> 00:37:15,000 Well, you could use something like a linked list 616 00:37:15,000 --> 00:37:20,260 and you could walk the list inserting Alice before Bob and Charlie after Bob and so forth. 617 00:37:20,260 --> 00:37:23,850 And, in fact, if you want to see code like that as an aside, 618 00:37:23,850 --> 00:37:27,230 know that in list2.h, we do exactly that. 619 00:37:27,230 --> 00:37:30,610 We won't go through this code, but this is a variant of the first example 620 00:37:30,610 --> 00:37:34,640 that introduces one other struct we've seen before called student, 621 00:37:34,640 --> 00:37:40,330 and then what it actually stores in the linked list is a pointer to a student structure 622 00:37:40,330 --> 00:37:44,520 rather than a simple little integer, n. 623 00:37:44,520 --> 00:37:46,900 So realize there's code there that involves actual strings, 624 00:37:46,900 --> 00:37:49,940 but if the goal at hand really now is to address the efficiency problem, 625 00:37:49,940 --> 00:37:53,380 wouldn't it be nice if we're given an object called Alice, 626 00:37:53,380 --> 00:37:56,020 we want to put her into the right location in a data structure, 627 00:37:56,020 --> 00:37:58,860 it feels like it'd be really nice to just put Alice, 628 00:37:58,860 --> 00:38:01,180 whose name starts with A, in the first location. 629 00:38:01,180 --> 00:38:05,270 And Bob, whose name starts with B, in the second location. 630 00:38:05,270 --> 00:38:09,580 With an array, or let's start calling it a table, a hash table at that, 631 00:38:09,580 --> 00:38:13,650 we can do exactly that. If we are given a name like Alice, 632 00:38:13,650 --> 00:38:16,700 a string like Alice, where do you put A-l-i-c-e? 633 00:38:16,700 --> 00:38:20,540 We need a hueristic. We need a function to take some input like Alice 634 00:38:20,540 --> 00:38:24,610 and return an answer, "Put Alice at this location." 635 00:38:24,610 --> 00:38:28,720 And this function, this black box, is going to be called a hash function. 636 00:38:28,720 --> 00:38:32,330 >> A hash function is something that takes an input, like "Alice", 637 00:38:32,330 --> 00:38:38,080 and returns to you, typically, the numeric location in some data structure where Alice belongs. 638 00:38:38,080 --> 00:38:40,830 In this case, our hash function should be relatively simple. 639 00:38:40,830 --> 00:38:47,510 Our hash function should say, if you are given "Alice", which character should I care about? 640 00:38:47,510 --> 00:38:55,660 The first one. So I look at [0], and then I say if [0] character is A, return the number 0. 641 00:38:55,660 --> 00:39:01,130 If it's B, return 1. If it's C, return 2, and so forth. 642 00:39:01,130 --> 00:39:05,940 All 0 index, and that would allow me to insert Alice and then Bob and then Charlie and so forth 643 00:39:05,940 --> 00:39:10,960 into this data structure. But there's a problem. 644 00:39:10,960 --> 00:39:13,060 What if Anita comes along again? 645 00:39:13,060 --> 00:39:17,510 Where do we put Anita? Her name, too, starts with the letter A, 646 00:39:17,510 --> 00:39:20,330 and it feels like we've made an even bigger mess of this problem. 647 00:39:20,330 --> 00:39:24,380 We now have immediate insertion, constant time insertion, into a data structure 648 00:39:24,380 --> 00:39:27,100 rather than worse-case linear, 649 00:39:27,100 --> 00:39:29,510 but what can we do with Anita in this case? 650 00:39:29,510 --> 00:39:34,110 What are the two options, really? Yeah? 651 00:39:34,110 --> 00:39:37,410 [Student answer, unintelligible] Okay, so we could have another dimension. 652 00:39:37,410 --> 00:39:42,320 That's good. So we can build things out in 3D like we talked about verbally on Monday. 653 00:39:42,320 --> 00:39:46,700 We could add another access here, but suppose that no, I'm trying to keep this simple. 654 00:39:46,700 --> 00:39:50,160 The whole goal here is to have immediate constant-time access, 655 00:39:50,160 --> 00:39:52,170 so that's adding too much complexity. 656 00:39:52,170 --> 00:39:55,970 What are other options when trying to insert Anita into this data structure? Yeah? 657 00:39:55,970 --> 00:39:58,610 [Student answer, unintelligible] Good. So we could move everyone else down, 658 00:39:58,610 --> 00:40:03,040 like Charlie nudges down Bob and Alice, and then we put Anita where she really wants to be. 659 00:40:03,040 --> 00:40:05,660 >> Of course, now, there's a side effect of this. 660 00:40:05,660 --> 00:40:09,000 This data structure is probably useful not because we want to insert people once 661 00:40:09,000 --> 00:40:11,250 but because we want to check if they're there later 662 00:40:11,250 --> 00:40:13,600 if we want to print out all of the names in the data structure. 663 00:40:13,600 --> 00:40:15,850 We're going to do something with this data eventually. 664 00:40:15,850 --> 00:40:20,810 So now we've kind of screwed over Alice, who's no longer where she's supposed to be. 665 00:40:20,810 --> 00:40:23,880 Nor is Bob, nor is Charlie. 666 00:40:23,880 --> 00:40:26,060 So maybe this isn't such a good idea. 667 00:40:26,060 --> 00:40:28,830 But indeed, this is one option. We could shift everyone down, 668 00:40:28,830 --> 00:40:32,240 or heck, Anita came late to the game, why don't we just put Anita 669 00:40:32,240 --> 00:40:35,870 not here, not here, not here, let's just put her a little lower in the list. 670 00:40:35,870 --> 00:40:38,680 But then this problem starts to devolve again. 671 00:40:38,680 --> 00:40:41,630 You might be able to find Alice instantly, based on her first name. 672 00:40:41,630 --> 00:40:44,320 And Bob instantly, and Charlie. But then you look for Anita, 673 00:40:44,320 --> 00:40:46,360 and you see, hmm, Alice is in the way. 674 00:40:46,360 --> 00:40:48,770 Well, let me check below Alice. Bob is not Anita. 675 00:40:48,770 --> 00:40:51,850 Charlie is not Anita. Oh, there is Anita. 676 00:40:51,850 --> 00:40:54,720 And if you continue that train of logic all the way, 677 00:40:54,720 --> 00:41:00,690 what's the worst-case running time of finding or inserting Anita into this new data structure? 678 00:41:00,690 --> 00:41:03,280 It's O(n), right? 679 00:41:03,280 --> 00:41:06,280 Because in the worst case, there's Alice, Bob, Charlie . . . 680 00:41:06,280 --> 00:41:10,150 all the way down to someone named "Y", so there's only one spot left. 681 00:41:10,150 --> 00:41:13,950 Thankfully, we have no one called "Z", so we put Anita at the very bottom. 682 00:41:13,950 --> 00:41:16,040 >> We haven't really solved that problem. 683 00:41:16,040 --> 00:41:19,890 So maybe we do need to introduce this third dimension. 684 00:41:19,890 --> 00:41:22,230 And it turns out, if we do introduce this third dimension, 685 00:41:22,230 --> 00:41:25,240 we can't do this perfectly, but the Holy Grail is going to be getting 686 00:41:25,240 --> 00:41:28,370 constant-time insertion and dynamic insertions so that 687 00:41:28,370 --> 00:41:30,960 we don't have to hard-code an array of size 26. 688 00:41:30,960 --> 00:41:34,400 We can insert as many names as we want, but let's take our 5-minute break here 689 00:41:34,400 --> 00:41:38,790 and then do that properly. 690 00:41:38,790 --> 00:41:46,020 All right. I set the story up pretty artificially there 691 00:41:46,020 --> 00:41:48,670 by choosing Alice and then Bob and then Charlie and then Anita, 692 00:41:48,670 --> 00:41:51,000 whose name was obviously going to collide with Alice. 693 00:41:51,000 --> 00:41:54,120 But the question we ended on Monday with is just how probable is it 694 00:41:54,120 --> 00:41:56,370 that you would get these kinds of collisions? In other words, 695 00:41:56,370 --> 00:42:00,490 if we start to use this tabular structure, which is really just an array, 696 00:42:00,490 --> 00:42:02,460 in this case of 26 locations, 697 00:42:02,460 --> 00:42:05,740 what if our inputs are instead uniformly distributed? 698 00:42:05,740 --> 00:42:09,620 It's not artificially Alice and Bob and Charlie and David and so forth alphabetically, 699 00:42:09,620 --> 00:42:12,380 it's uniformly distributed over A through Z. 700 00:42:12,380 --> 00:42:15,220 >> Maybe we'll just get lucky and we're not going to have two A's or two B's 701 00:42:15,220 --> 00:42:17,640 with very high probability, but as someone pointed out, 702 00:42:17,640 --> 00:42:20,730 if we generalized this problem and not do 0 to 25 703 00:42:20,730 --> 00:42:26,060 but, say, 0 through 364 or 65, often the number of days in a typical year, 704 00:42:26,060 --> 00:42:31,170 and asked the question, "What's the probability that two of us in this room have the same birthday?" 705 00:42:31,170 --> 00:42:34,600 Put it another way, what's the probability that two of us have a name starting with A? 706 00:42:34,600 --> 00:42:37,190 The sort of question is the same, but this address space, 707 00:42:37,190 --> 00:42:39,940 this search space, is bigger in the case of birthdays, 708 00:42:39,940 --> 00:42:42,820 because we have so many more days in the year than letters in the alphabet. 709 00:42:42,820 --> 00:42:44,910 What's the probability of a collision? 710 00:42:44,910 --> 00:42:48,410 Well, we can think of this by figuring out the math the opposite way. 711 00:42:48,410 --> 00:42:50,580 What's the probability of no collisions? 712 00:42:50,580 --> 00:42:53,970 Well, this expression here says that what's the probability 713 00:42:53,970 --> 00:42:58,770 if there's just one person in this room, that they have a unique birthday? 714 00:42:58,770 --> 00:43:01,190 It's 100%. Because if there's only one person in the room, 715 00:43:01,190 --> 00:43:03,940 his or her birthday can be any of the 365 days out of the year. 716 00:43:03,940 --> 00:43:08,650 So 365/365 options gives me a value of 1. 717 00:43:08,650 --> 00:43:11,250 So the probability in question at the moment is just 1. 718 00:43:11,250 --> 00:43:13,270 But if there's a second person in the room, 719 00:43:13,270 --> 00:43:16,490 what's the probability that their birthday is different? 720 00:43:16,490 --> 00:43:20,680 There's only 364 possible days, ignoring leap years, 721 00:43:20,680 --> 00:43:23,580 for their birthday not to collide with the other persons. 722 00:43:23,580 --> 00:43:31,920 So 364/365. If a third person comes in, it's 363/365, and so forth. 723 00:43:31,920 --> 00:43:35,790 So we keep multiplying together these fractions, which are getting smaller and smaller, 724 00:43:35,790 --> 00:43:40,720 to figure out what is the probability that all of us have unique birthdays? 725 00:43:40,720 --> 00:43:43,570 But then we can, of course, just take that answer and flip it around 726 00:43:43,570 --> 00:43:47,210 and do 1 minus all of that, an expression we'll eventually get 727 00:43:47,210 --> 00:43:51,250 if you remember the back of your math books, it looks a little something like this, 728 00:43:51,250 --> 00:43:54,590 which is much more easily interpreted graphically. 729 00:43:54,590 --> 00:43:57,820 And this graphic here has on the x axis the number of birthdays, 730 00:43:57,820 --> 00:44:02,030 or number of people with birthdays, and on the y axis is the probability of a match. 731 00:44:02,030 --> 00:44:06,060 And what this is saying is that if you have, let's say, even, 732 00:44:06,060 --> 00:44:10,860 let's choose something like 22, 23. 733 00:44:10,860 --> 00:44:13,160 If there's 22 or 23 people in the room, 734 00:44:13,160 --> 00:44:17,100 the probability that two of those very few people are going to have the same birthday 735 00:44:17,100 --> 00:44:19,560 is actually super high, combinatorially. 736 00:44:19,560 --> 00:44:23,450 50% odds that in a class of just 22 people, a seminar, practically, 737 00:44:23,450 --> 00:44:25,790 2 of those people are going to have the same birthday. 738 00:44:25,790 --> 00:44:28,520 Because there's so many ways in which you can have the same birthday. 739 00:44:28,520 --> 00:44:31,110 Even worse, if you look at the right-hand side of the chart, 740 00:44:31,110 --> 00:44:34,040 by the time you have a class with 58 students in it, 741 00:44:34,040 --> 00:44:39,270 the probability of 2 people having a birthday is super, super high, nearly 100%. 742 00:44:39,270 --> 00:44:41,880 Now, that's sort of a fun fact about real life. 743 00:44:41,880 --> 00:44:45,850 >> But the implications, now, for data structures and storing information 744 00:44:45,850 --> 00:44:51,100 means that just assuming you have a nice, clean, uniform distribution of data 745 00:44:51,100 --> 00:44:53,650 and you have a big enough array to fit a bunch of things 746 00:44:53,650 --> 00:44:59,360 doesn't mean you're going to get people in unique locations. 747 00:44:59,360 --> 00:45:03,810 You're going to have collisions. So this notion of hashing, as it's called, 748 00:45:03,810 --> 00:45:07,450 taking an input like "Alice" and massaging it in some way 749 00:45:07,450 --> 00:45:10,190 and then getting back an answer like 0 or 1 or 2. 750 00:45:10,190 --> 00:45:17,500 Getting back some output from that function is plagued by this probability of collision. 751 00:45:17,500 --> 00:45:19,530 So how can we handle those collisions? 752 00:45:19,530 --> 00:45:21,940 Well, on the one case, we can take the idea that was suggested. 753 00:45:21,940 --> 00:45:25,100 We can just shift everyone down, or maybe, a little more simply, 754 00:45:25,100 --> 00:45:29,870 rather than move everyone else, let's just move Anita to the bottom of the available spot. 755 00:45:29,870 --> 00:45:32,810 So if Alice is in 0, Bob is in 1, Charlie is in 2, 756 00:45:32,810 --> 00:45:35,260 we'll just put Anita at location 3. 757 00:45:35,260 --> 00:45:38,860 And this is a technique in data structures called linear probing. 758 00:45:38,860 --> 00:45:41,310 Linear because you're just walking this line, and you're sort of probing 759 00:45:41,310 --> 00:45:43,640 for available spots in the data structure. 760 00:45:43,640 --> 00:45:46,210 Of course, this devolves into O(n). 761 00:45:46,210 --> 00:45:49,590 If the data structure's really full, there's 25 people in it already, 762 00:45:49,590 --> 00:45:54,120 and then Anita comes along, she ends up at what would be location Z, and that's fine. 763 00:45:54,120 --> 00:45:56,540 She still fits, and we can find her later. 764 00:45:56,540 --> 00:46:00,100 >> But this was contrary to the goal of speeding things up. 765 00:46:00,100 --> 00:46:02,530 So what if we instead introduced this third dimension? 766 00:46:02,530 --> 00:46:06,400 That technique is generally called separate chaining, or having chains. 767 00:46:06,400 --> 00:46:10,030 And what a hash table now is, this tabular structure, 768 00:46:10,030 --> 00:46:13,450 your table is just an array of pointers. 769 00:46:13,450 --> 00:46:18,230 But what those pointers point to is guess what? 770 00:46:18,230 --> 00:46:21,970 A linked list. So what if we take the best of both of these worlds? 771 00:46:21,970 --> 00:46:26,500 We use arrays for the initial indexes 772 00:46:26,500 --> 00:46:32,070 into the data structure so we can instantly go to [0] [1], [30] or so forth, 773 00:46:32,070 --> 00:46:36,480 but so that we have some flexibility and we can fit Anita and Alice and Adam 774 00:46:36,480 --> 00:46:38,630 and any other A name, 775 00:46:38,630 --> 00:46:43,470 we instead let the other axis grow arbitrarily. 776 00:46:43,470 --> 00:46:47,340 And we finally, as of Monday, have that expressive capability with linked list. 777 00:46:47,340 --> 00:46:49,530 We can grow a data structure arbitrarily. 778 00:46:49,530 --> 00:46:52,450 Alternatively, we could just make a huge 2-dimensional array, 779 00:46:52,450 --> 00:46:57,190 but that's going to be an awful situation if one of the rows in a 2-dimensional array 780 00:46:57,190 --> 00:47:01,280 isn't large enough for the additional person whose name happens to start with A. 781 00:47:01,280 --> 00:47:04,200 God forbid we have to reallocate a huge 2-dimensional structure 782 00:47:04,200 --> 00:47:06,600 just because there's so many people named A, 783 00:47:06,600 --> 00:47:09,480 especially when there's so few people named Z something. 784 00:47:09,480 --> 00:47:12,170 It's just going to be a very sparse data structure. 785 00:47:12,170 --> 00:47:15,400 So it's not perfect by any means, but now we at least have the ability 786 00:47:15,400 --> 00:47:19,090 to instantly find where Alice or Anita belongs, 787 00:47:19,090 --> 00:47:21,090 at least in terms of the vertical axis, 788 00:47:21,090 --> 00:47:25,850 and then we just have to decide where to put Anita or Alice in this linked list. 789 00:47:25,850 --> 00:47:32,480 If we don't care about sorting things, how quickly could we insert Alice into a structure like this? 790 00:47:32,480 --> 00:47:35,370 It's constant time. We index into [0], and if no one's there, 791 00:47:35,370 --> 00:47:37,550 Alice goes at the start of that linked list. 792 00:47:37,550 --> 00:47:40,000 But that's not a huge deal. Because if Anita then comes along 793 00:47:40,000 --> 00:47:42,160 some number of steps later, where does Anita belong? 794 00:47:42,160 --> 00:47:45,140 Well, [0]. Oop. Alice is already in that linked list. 795 00:47:45,140 --> 00:47:47,760 >> But if we don't care about sorting these names, 796 00:47:47,760 --> 00:47:53,580 we can just move Alice over, insert Anita, but even that is constant time. 797 00:47:53,580 --> 00:47:57,010 Even if there's Alice and Adam and all these other A names, 798 00:47:57,010 --> 00:47:59,410 it's not really shifting them physically. Why? 799 00:47:59,410 --> 00:48:04,090 Because we just did here with linked list, who knows were these nodes are anyway? 800 00:48:04,090 --> 00:48:06,550 All you have to do is move the bread crumbs. 801 00:48:06,550 --> 00:48:10,930 Move the arrows around; you don't have to physically move any data around. 802 00:48:10,930 --> 00:48:14,610 So we can insert Anita, in that case, instantly. Constant time. 803 00:48:14,610 --> 00:48:20,250 So we have constant-time lookup, and constant-time insertion of someone like Anita. 804 00:48:20,250 --> 00:48:22,740 But kind of oversimplifying the world. 805 00:48:22,740 --> 00:48:28,510 What if we later want to find Alice? 806 00:48:28,510 --> 00:48:31,050 What if we later want to find Alice? 807 00:48:31,050 --> 00:48:35,690 How many steps is that going to take? 808 00:48:35,690 --> 00:48:37,850 [Student answer, unintelligible] 809 00:48:37,850 --> 00:48:40,950 Exactly. The number of people before Alice in the linked list. 810 00:48:40,950 --> 00:48:45,420 So it's not quite perfect, because our data structure, again, has this vertical access 811 00:48:45,420 --> 00:48:50,240 and then it has these linked lists hanging--actually, let's not draw it an an array. 812 00:48:50,240 --> 00:48:56,020 It has these linked lists hanging off of it that looks a little something like this. 813 00:48:56,020 --> 00:48:59,110 But the problem is if Alice and Adam and all these other A names 814 00:48:59,110 --> 00:49:01,720 end up more and more over there, 815 00:49:01,720 --> 00:49:04,810 finding someone could end up taking a bunch of steps, 816 00:49:04,810 --> 00:49:06,670 bcause you have to traverse the linked list, 817 00:49:06,670 --> 00:49:08,090 which is a linear operation. 818 00:49:08,090 --> 00:49:14,270 So really, then, the insertion time ultimately is O(n), where n is the number of elements in the list. 819 00:49:14,270 --> 00:49:21,780 Divided by, let's arbitrarily call it m, where m is the number of linked lists 820 00:49:21,780 --> 00:49:24,500 that we have in this vertical axis. 821 00:49:24,500 --> 00:49:27,180 In other words, if we truly assume a uniform distribution of names, 822 00:49:27,180 --> 00:49:30,150 totally unrealistic. There's obviously more of some letters than others. 823 00:49:30,150 --> 00:49:32,580 >> But if we assume for the moment a uniform distribution, 824 00:49:32,580 --> 00:49:37,350 and we have n total people, and m total chains 825 00:49:37,350 --> 00:49:40,630 available to us, then the length of each of these chains 826 00:49:40,630 --> 00:49:44,380 fairly simply is going to be the total, n divided by the number of chains. 827 00:49:44,380 --> 00:49:48,900 So n/m. But here's where we can be all mathematically clever. 828 00:49:48,900 --> 00:49:53,030 m is a constant, because there's a fixed number of these. 829 00:49:53,030 --> 00:49:54,620 You're going to declare your array at the beginning, 830 00:49:54,620 --> 00:49:58,450 and we're not resizing the vertical axis. By definition, that stays fixed. 831 00:49:58,450 --> 00:50:01,220 It's only the horizontal axis, so to speak, that's changing. 832 00:50:01,220 --> 00:50:04,760 So technically, this is a constant. So now, the insertion time 833 00:50:04,760 --> 00:50:09,700 is pretty much O(n). 834 00:50:09,700 --> 00:50:12,410 So that doesn't feel all that much better. 835 00:50:12,410 --> 00:50:14,940 But what's the truth here? Well, all this time, for weeks, 836 00:50:14,940 --> 00:50:20,640 we've been saying O(n²). O(n), 2 x n², - n, divided by 2. . . ech. 837 00:50:20,640 --> 00:50:23,580 It's just n². But now, in this part of the semester, 838 00:50:23,580 --> 00:50:25,560 we can start talking about the real world again. 839 00:50:25,560 --> 00:50:31,520 And n/m is absolutely faster than just n alone. 840 00:50:31,520 --> 00:50:35,170 If you have a thousand names, and you break them up into multiple buckets 841 00:50:35,170 --> 00:50:37,820 so that you have only ten names in each of these chains, 842 00:50:37,820 --> 00:50:41,670 absolutely searching ten things is going to be faster than a thousand things. 843 00:50:41,670 --> 00:50:43,740 And so one of the upcoming problem sets is going to challenge you 844 00:50:43,740 --> 00:50:46,100 to think about exactly that even though, yeah, 845 00:50:46,100 --> 00:50:49,520 asymptotically and mathematically, this is still just linear, 846 00:50:49,520 --> 00:50:51,700 which sucks in general when trying to find things. 847 00:50:51,700 --> 00:50:54,530 In reality, it's going to be faster than that 848 00:50:54,530 --> 00:50:56,520 because of this divisor. 849 00:50:56,520 --> 00:50:58,310 And so there's again going to be this trade-off 850 00:50:58,310 --> 00:51:01,390 and this conflict between theory and reality, 851 00:51:01,390 --> 00:51:03,550 and one of the knobs will start turning at this point in the semester 852 00:51:03,550 --> 00:51:07,510 is more of the reality one as we sort of prepare for semster's end, 853 00:51:07,510 --> 00:51:09,280 as we introduce the world of web programming, 854 00:51:09,280 --> 00:51:11,530 where really, performance is going to count because your users are going to 855 00:51:11,530 --> 00:51:14,880 start to feel and appreciate poor design decisions. 856 00:51:14,880 --> 00:51:19,950 >> So how do you go about implementing a linked--a hash table with 31 elements? 857 00:51:19,950 --> 00:51:22,600 And the previous example was arbitrarily about birthdays. 858 00:51:22,600 --> 00:51:26,190 If someone has a birthday of January 1 or February 1, we'll put them in this bucket. 859 00:51:26,190 --> 00:51:28,960 If it's January 2, February 2, March 2, we'll put them in this bucket. 860 00:51:28,960 --> 00:51:32,220 That's why it was 31. How do you declare a hash table? 861 00:51:32,220 --> 00:51:37,480 It can be pretty simple, node* table is my arbitrary name for it, [31]. 862 00:51:37,480 --> 00:51:42,400 This gives me 31 pointers to nodes, 863 00:51:42,400 --> 00:51:45,370 and that allows me to have 31 pointers to linked lists 864 00:51:45,370 --> 00:51:48,800 even if those chains are initially NULL. 865 00:51:48,800 --> 00:51:54,860 What do I want to put if I want to store "Alice," "Bob," "Charlie"? 866 00:51:54,860 --> 00:51:57,010 Well, we need to wrap those things in a structure 867 00:51:57,010 --> 00:52:00,530 because we need Alice to point to Bob, to point to Charlie, and so forth. 868 00:52:00,530 --> 00:52:04,940 We can't just have the names alone, so I could create a new structure called node here. 869 00:52:04,940 --> 00:52:08,310 >> What is an actual node? What is a node in this new linked list? 870 00:52:08,310 --> 00:52:11,840 The first thing, called word, is for the person's name. 871 00:52:11,840 --> 00:52:14,340 LENGTH, presumably, relates to the maximum length of a human's name, 872 00:52:14,340 --> 00:52:18,210 whatever that is, 20, 30, 40 characters in crazy corner cases, 873 00:52:18,210 --> 00:52:22,680 and +1 is for what? It's just the extra NULL character, \0. 874 00:52:22,680 --> 00:52:27,410 So this node is wrapping "something" inside of itself, 875 00:52:27,410 --> 00:52:29,640 but it also declares a pointer called next 876 00:52:29,640 --> 00:52:32,580 so that we can chain Alice to Bob to Charlie and so forth. 877 00:52:32,580 --> 00:52:36,700 Can be NULL but doesn't necessarily have to be. 878 00:52:36,700 --> 00:52:40,110 Any questions on these hash tables? Yeah? 879 00:52:40,110 --> 00:52:46,190 [Student asking question, unintelligible] An array--good question. 880 00:52:46,190 --> 00:52:50,120 Why is this char word in an array rather than just char*? 881 00:52:50,120 --> 00:52:53,830 In this somewhat arbitrary example, I did not want to have to resort 882 00:52:53,830 --> 00:52:56,190 to malloc for each of the original names. 883 00:52:56,190 --> 00:52:59,530 I wanted to declare a maximum amount of memory for the string 884 00:52:59,530 --> 00:53:06,020 so that I could copy into the structure A-l-i-c-e\0 and not have to deal with malloc and free and the like. 885 00:53:06,020 --> 00:53:11,710 But I could do that if I wanted to be more conscious of space use. Good question. 886 00:53:11,710 --> 00:53:14,780 So let's try to generalize away from this 887 00:53:14,780 --> 00:53:18,350 and focus the remainder of today on data structures more generally 888 00:53:18,350 --> 00:53:21,170 and other problems that we can solve using the same fundamentals 889 00:53:21,170 --> 00:53:24,590 even though the data structures themselves might differ in their particulars. 890 00:53:24,590 --> 00:53:27,910 >> So it turns out in computer science, trees are very common. 891 00:53:27,910 --> 00:53:29,760 And you can think of a tree sort of like a family tree, 892 00:53:29,760 --> 00:53:31,830 where there's some roots, some matriarch or patriarch, 893 00:53:31,830 --> 00:53:34,540 grandma or grandpa or earlier back, 894 00:53:34,540 --> 00:53:38,880 beneath which are mom and dad or various siblings or the like. 895 00:53:38,880 --> 00:53:42,500 So a tree structure has nodes and it has children, 896 00:53:42,500 --> 00:53:45,260 usually 0 or more children for each node. 897 00:53:45,260 --> 00:53:47,320 And some of the jargon that you see in this picture here 898 00:53:47,320 --> 00:53:50,630 is any of the little kids or grandkids on the edges 899 00:53:50,630 --> 00:53:52,330 who have no arrows emanating from them, 900 00:53:52,330 --> 00:53:55,070 those are the so-called leaves, and anyone on the inside 901 00:53:55,070 --> 00:53:58,790 is an inner node; you can call it anything along those lines. 902 00:53:58,790 --> 00:54:01,430 But this structure is pretty common. This one's a little arbitrary. 903 00:54:01,430 --> 00:54:04,930 We have one child on the left, we have three children on the right, 904 00:54:04,930 --> 00:54:06,830 two children on the bottom left. 905 00:54:06,830 --> 00:54:10,740 So we can have different-sized trees, but if we start to standardize things, 906 00:54:10,740 --> 00:54:15,330 and you might recall this from Patrick's video on binary search from a previous short 907 00:54:15,330 --> 00:54:19,490 online, binary search doesn't have to be implemented with an array 908 00:54:19,490 --> 00:54:21,410 or pieces of paper on a blackboard. 909 00:54:21,410 --> 00:54:25,490 Suppose that you wanted to store your numbers in a more sophisticated data structure. 910 00:54:25,490 --> 00:54:27,680 You could create a tree like this. 911 00:54:27,680 --> 00:54:35,290 You could have a node declared in C, and that node can have at least two elements inside of it. 912 00:54:35,290 --> 00:54:39,470 One is the number you want to store, and the other is--well, we need one more. 913 00:54:39,470 --> 00:54:41,540 The other is its children. 914 00:54:41,540 --> 00:54:45,150 So here's another data structure. This time, a node is defined as storing a number n 915 00:54:45,150 --> 00:54:49,060 and then two pointers; left child and right child. 916 00:54:49,060 --> 00:54:52,100 And they're not arbitrary. What's interesting about this tree? 917 00:54:52,100 --> 00:55:00,550 >> What's the pattern in how we've laid this out or how Patrick laid it out in his video? 918 00:55:00,550 --> 00:55:02,790 It's kind of obvious that there's some sorting going on here, 919 00:55:02,790 --> 00:55:04,460 but what's the simple rule? Yeah? 920 00:55:04,460 --> 00:55:08,350 [Student answer, unintelligible] 921 00:55:08,350 --> 00:55:12,040 Perfect. If you glance at this, you see the small numbers on the left, 922 00:55:12,040 --> 00:55:14,690 big numbers on the left, but that's true for every node. 923 00:55:14,690 --> 00:55:20,370 For each node, its left child less than it, and its right child greater than it. 924 00:55:20,370 --> 00:55:25,210 What this means now is if I want to search this data structure for, say, the number 44, 925 00:55:25,210 --> 00:55:29,320 I have to start at the root, because as with all of these more complex data structures now, 926 00:55:29,320 --> 00:55:31,910 we only have a pointer to one thing, the beginning. 927 00:55:31,910 --> 00:55:35,010 And in this case, the beginning is the root. It's not the left end, 928 00:55:35,010 --> 00:55:39,530 it's the root of this structure. So I see here's 55, and I'm looking for 44. 929 00:55:39,530 --> 00:55:41,430 Which direction do I want to go? 930 00:55:41,430 --> 00:55:45,680 Well, I want to go to the left, because obviously, to the right is going to be too big. 931 00:55:45,680 --> 00:55:49,050 So notice here, you're sort of conceptually chopping the tree in half 932 00:55:49,050 --> 00:55:51,700 because you're never going down to the right-hand side. 933 00:55:51,700 --> 00:55:55,410 So now I go from the 55 to the 33. It's too small of a number. 934 00:55:55,410 --> 00:56:01,590 I'm looking for 44, but now I know if 44 is in this tree, I can go obviously to the right. 935 00:56:01,590 --> 00:56:04,460 So again, I'm pruning the tree in half. 936 00:56:04,460 --> 00:56:06,780 It's pretty much identical conceptually to the phone book. 937 00:56:06,780 --> 00:56:09,510 It's identical to what we did with the papers on the blackboard, 938 00:56:09,510 --> 00:56:13,940 but it's a more sophisticated structure that allows us to actually do 939 00:56:13,940 --> 00:56:16,880 this divide and conquer by design of the algorithm, 940 00:56:16,880 --> 00:56:19,420 and in fact, traversing a structure like this--whoops. 941 00:56:19,420 --> 00:56:22,870 Traversing a structure like this, where it's only "go this way or go that way," 942 00:56:22,870 --> 00:56:26,870 means all that code that bent your mind at first when implementing it in section 943 00:56:26,870 --> 00:56:31,270 or walking through it at home, for binary search, using recursion or iteration, 944 00:56:31,270 --> 00:56:35,060 it's a pain in the neck. Find the middle element, then do your rounding up or down. 945 00:56:35,060 --> 00:56:39,230 >> There's a beauty to this because we can now use recursion again, 946 00:56:39,230 --> 00:56:43,760 but much more cleanly. Indeed, if you're at the number 55 and you want to find 44, 947 00:56:43,760 --> 00:56:48,450 you go left in this case, then what do you do? You run the exact same algorithm. 948 00:56:48,450 --> 00:56:51,560 You check the value of the node, then you go left or right. 949 00:56:51,560 --> 00:56:53,670 Then you check the value of the node, go left or right. 950 00:56:53,670 --> 00:56:56,710 This is perfectly suited to recursion. 951 00:56:56,710 --> 00:57:00,920 So even though in the past we've done some fairly arbitrary examples involving recursion 952 00:57:00,920 --> 00:57:03,430 that didn't need to be recursive, with data stuctures, 953 00:57:03,430 --> 00:57:07,820 especially trees, it's a perfect application of this idea of taking a problem, 954 00:57:07,820 --> 00:57:12,920 shrinking it, and then solving the same type of, but smaller, program. 955 00:57:12,920 --> 00:57:14,590 >> So there's another data structure that we can introduce. 956 00:57:14,590 --> 00:57:18,760 This one is designed at first glance to look cryptic, but this one's amazing. 957 00:57:18,760 --> 00:57:25,090 So this is a data structure called a trie, t-r-i-e, which is inherited from the word retrieval, 958 00:57:25,090 --> 00:57:30,210 which is not pronounced re-try-val, but that's what the world calls these things. Tries. T-r-i-e. 959 00:57:30,210 --> 00:57:35,190 It is a tree structure of some sort, but each of the nodes in a trie 960 00:57:35,190 --> 00:57:41,280 appears to be what? And this is a bit misleading because it's kind of abbreviated. 961 00:57:41,280 --> 00:57:45,960 But it looks like each node in this trie is actually an array. 962 00:57:45,960 --> 00:57:48,840 And even though the author of this diagram hasn't shown it, 963 00:57:48,840 --> 00:57:54,130 in this case, this trie is a data structure whose purpose in life is to store words 964 00:57:54,130 --> 00:57:57,330 like A-l-i-c-e or B-o-b. 965 00:57:57,330 --> 00:58:02,480 And the way in which this data stores Alice and Bob and Charlie and Anita and so forth 966 00:58:02,480 --> 00:58:06,970 is it uses an array whereby to store Alice in a trie, 967 00:58:06,970 --> 00:58:09,820 we start at the root node which looks like an array, 968 00:58:09,820 --> 00:58:12,080 and it's been written in shorthand notation. 969 00:58:12,080 --> 00:58:15,070 The author omitted a b c d e f g because there were no names with that. 970 00:58:15,070 --> 00:58:19,150 They only showed M and P and T, but in this case, 971 00:58:19,150 --> 00:58:22,780 let's move away from Alice and Bob and Charlie to some names that are here. 972 00:58:22,780 --> 00:58:25,670 Maxwell is actually in this diagram. 973 00:58:25,670 --> 00:58:29,570 So how did the author store M-a-x-w-e-l-l? 974 00:58:29,570 --> 00:58:36,990 He or she started at the root node, and went to [M], so roughly 13, the 13th location in the array. 975 00:58:36,990 --> 00:58:40,750 Then from there, there's a pointer. 976 00:58:40,750 --> 00:58:42,760 A pointer leading to another array. 977 00:58:42,760 --> 00:58:47,880 From there the author indexed into that array at location A, as depicted there at top left, 978 00:58:47,880 --> 00:58:52,250 and then he or she followed that pointer to another array, 979 00:58:52,250 --> 00:58:55,460 and went to the pointer at location X. 980 00:58:55,460 --> 00:58:59,840 Then in the next array location W, E, L, L, and so forth, 981 00:58:59,840 --> 00:59:03,090 and finally, let's actually try to put a picture to this. 982 00:59:03,090 --> 00:59:05,380 What does a node look like in code? 983 00:59:05,380 --> 00:59:11,820 A node in a trie contains an array of pointers to more nodes. 984 00:59:11,820 --> 00:59:16,090 But there's also got to be some kind of boolean value, at least in this implementation. 985 00:59:16,090 --> 00:59:18,770 I happen to call it is_word. Why? 986 00:59:18,770 --> 00:59:22,670 Because when you're inserting M-a-x-w-e-l-l, you're not inserting 987 00:59:22,670 --> 00:59:25,300 anything into this data structure. 988 00:59:25,300 --> 00:59:27,480 You're not writing M. You're not writing X. 989 00:59:27,480 --> 00:59:30,240 All you're doing is following pointers. 990 00:59:30,240 --> 00:59:33,360 The pointer that represents M, then the pointer that represents A, 991 00:59:33,360 --> 00:59:36,310 then the pointer that represents X, then W, E, L, L, 992 00:59:36,310 --> 00:59:41,950 but what you need to do at the end is sort of go, check, I reached this location. 993 00:59:41,950 --> 00:59:45,560 There was a word that ends here in the data structure. 994 00:59:45,560 --> 00:59:48,190 >> So what a trie is really filled with and the author chose to represent 995 00:59:48,190 --> 00:59:51,880 these terminuses with little triangles. 996 00:59:51,880 --> 00:59:56,470 This just means that the fact this triangle's here, this boolean value of true 997 00:59:56,470 --> 00:59:59,200 means if you go backwards in the tree, 998 00:59:59,200 --> 01:00:02,420 that means a word named Maxwell is in this. 999 01:00:02,420 --> 01:00:04,870 But the word foo, for instance, 1000 01:00:04,870 --> 01:00:07,970 is not in the tree, because if I start at the root node up here at top, 1001 01:00:07,970 --> 01:00:14,030 There's no f pointer, no o pointer, no o pointer. Foo is not a name in this dictionary. 1002 01:00:14,030 --> 01:00:22,460 But by contrast, turing, t-u-r-i-n-g. Again, I didn't store t or u or r or i or n or g. 1003 01:00:22,460 --> 01:00:29,820 But I did store in this data structure a value of true way down here in this node--in the tree 1004 01:00:29,820 --> 01:00:33,030 by setting this boolean value of is_word to true. 1005 01:00:33,030 --> 01:00:35,740 So a trie is kind of this very interesting meta structure, 1006 01:00:35,740 --> 01:00:39,810 where you're not really storing the words themselves for this kind of dictionary. 1007 01:00:39,810 --> 01:00:45,100 To be clear, you're just storing yes or no, there is a word that ends here. 1008 01:00:45,100 --> 01:00:46,430 >> Now what's the implication? 1009 01:00:46,430 --> 01:00:51,120 If you have 150,000 words in a dictionary that you're trying to store in memory 1010 01:00:51,120 --> 01:00:53,400 using something like a linked list, 1011 01:00:53,400 --> 01:00:56,870 you are going to have 150,000 nodes in your linked list. 1012 01:00:56,870 --> 01:01:00,250 And finding one of those words alphabetically could take O(n) time. 1013 01:01:00,250 --> 01:01:04,370 Linear time. But in the case here of a trie, 1014 01:01:04,370 --> 01:01:09,210 what's the running time of finding a word? 1015 01:01:09,210 --> 01:01:17,390 It turns out the beauty here is that even if you have 149,999 words already in this dictionary, 1016 01:01:17,390 --> 01:01:20,170 as implemented with this data structure, 1017 01:01:20,170 --> 01:01:25,560 how much time does it take to find or insert one more person into that, like Alice, A-l-i-c-e? 1018 01:01:25,560 --> 01:01:30,640 Well, it's only 5, maybe 6 steps for the trailing character. 1019 01:01:30,640 --> 01:01:32,880 Because the presense of other names in the structure 1020 01:01:32,880 --> 01:01:35,340 does not get in the way of inserting Alice. 1021 01:01:35,340 --> 01:01:39,640 Moreover, finding Alice once there are 150,000 words in this dictionary 1022 01:01:39,640 --> 01:01:41,960 does not get in your way of finding Alice at all, 1023 01:01:41,960 --> 01:01:46,880 because Alice is . . . . . here, because I found a boolean value. 1024 01:01:46,880 --> 01:01:50,920 And if there is no boolean true, then Alice is not in this data structure of words. 1025 01:01:50,920 --> 01:01:56,220 In other words, the running time of finding things and inserting things into this new 1026 01:01:56,220 --> 01:02:01,920 data structure of trie is O of--it's not n. 1027 01:02:01,920 --> 01:02:05,730 Because the presense of 150,000 people has no effect on Alice, it seems. 1028 01:02:05,730 --> 01:02:11,560 So let's call it k, where k is the maximum length of a word in English 1029 01:02:11,560 --> 01:02:14,050 which is typically no more than 20-something characters. 1030 01:02:14,050 --> 01:02:17,940 So k is a constant. So the Holy Grail we seem to have found now 1031 01:02:17,940 --> 01:02:26,000 is that of a trie, constant time for inserts, for lookups, for deletions. 1032 01:02:26,000 --> 01:02:29,170 Because the number of things already in the structure, 1033 01:02:29,170 --> 01:02:32,600 which aren't even physically there. Again, they're just sort of checked off, yes or no, 1034 01:02:32,600 --> 01:02:35,050 has no impact on its future running time. 1035 01:02:35,050 --> 01:02:37,940 >> But there's got to be a catch, otherwise we wouldn't have wasted so much time 1036 01:02:37,940 --> 01:02:41,460 on all these other data structures just to finally get to the secret one that's amazing. 1037 01:02:41,460 --> 01:02:46,410 So what price are we paying to achieve this greatness here? Space. 1038 01:02:46,410 --> 01:02:49,010 This thing is massive. And the reason that the author 1039 01:02:49,010 --> 01:02:52,400 didn't present it here, notice that all of these things that look like arrays, 1040 01:02:52,400 --> 01:02:55,400 he didn't draw the rest of the tree, the rest of the trie, 1041 01:02:55,400 --> 01:02:58,060 because they're just not relevant to the story. 1042 01:02:58,060 --> 01:03:01,870 But all of these nodes are super wide, and every node in the tree takes up 1043 01:03:01,870 --> 01:03:07,780 26 or actually, could be 27 characters because in this case I was including space for the apostrophe 1044 01:03:07,780 --> 01:03:09,980 so that we could have apostrophized words. 1045 01:03:09,980 --> 01:03:14,450 In this case, these are wide arrays. So even though they're not picutured, 1046 01:03:14,450 --> 01:03:18,190 this takes up a massive amount of RAM. 1047 01:03:18,190 --> 01:03:20,670 Which might be fine, especilly in modern hardware, 1048 01:03:20,670 --> 01:03:25,650 but that's the tradeoff. We get less time by spending more space. 1049 01:03:25,650 --> 01:03:28,580 So where is this all going? 1050 01:03:28,580 --> 01:03:32,640 Well, let's do--let's see here. 1051 01:03:32,640 --> 01:03:39,510 Let's do a jump to this guy here. 1052 01:03:39,510 --> 01:03:43,450 >> Believe it or not, as much fun as C has been for some time now, 1053 01:03:43,450 --> 01:03:48,130 we're reaching the point in the semester where it's time to transition to things more modern. 1054 01:03:48,130 --> 01:03:50,950 Things on a higher level. And even though for the next couple of weeks 1055 01:03:50,950 --> 01:03:54,580 we'll still continue to immerse ourselves in the world of pointers and memory management 1056 01:03:54,580 --> 01:03:57,210 to get that comfort with which we can then build on, 1057 01:03:57,210 --> 01:04:01,270 the end game is ultimately to introduce,ironically, not this language. 1058 01:04:01,270 --> 01:04:03,330 We'll spend, like 10 minutes talking about HTML. 1059 01:04:03,330 --> 01:04:05,950 All HTML is is a markup language, and what a markup language is 1060 01:04:05,950 --> 01:04:10,220 is these series of open brackets and closed brackets that say 'make this bold' 1061 01:04:10,220 --> 01:04:12,000 'make this italics' 'make this centered.' 1062 01:04:12,000 --> 01:04:14,250 It's not all that intellectually interesting, but it's super useful. 1063 01:04:14,250 --> 01:04:16,650 And it's certainly omnipresent these days. 1064 01:04:16,650 --> 01:04:19,450 But what's powerful about the world of HTML, and web programming more generally, 1065 01:04:19,450 --> 01:04:25,910 is building dynamic things; writing code in languages like PHP or Python or Ruby or Java or C#. 1066 01:04:25,910 --> 01:04:30,360 Really, whatever your language of choice is, and generating HTML dynamically. 1067 01:04:30,360 --> 01:04:32,960 Generating something called CSS dynamically. 1068 01:04:32,960 --> 01:04:35,810 Cascading style sheets, which is also about aesthetics. 1069 01:04:35,810 --> 01:04:41,360 And so even though, today, if I go to some website like the familiar Google.com, 1070 01:04:41,360 --> 01:04:46,100 and I go to view, developer, view source, which maybe you've done before, 1071 01:04:46,100 --> 01:04:49,800 but going to view source, this stuff probably looks pretty cryptic. 1072 01:04:49,800 --> 01:04:55,320 But this is the underlying code that implements Google.com. 1073 01:04:55,320 --> 01:04:57,940 On the front end. And actually all this is fluffy aesthetics stuff. 1074 01:04:57,940 --> 01:05:01,740 This is CSS up here. If I keep scrolling down we'll get some color-coded stuff. 1075 01:05:01,740 --> 01:05:06,890 This is HTML. Google's code looks like a mess, but if I actually open up a different window, 1076 01:05:06,890 --> 01:05:09,380 we can see some structure to this. 1077 01:05:09,380 --> 01:05:12,640 If I open this up, notice here, it's a little more readable. 1078 01:05:12,640 --> 01:05:16,850 We're going to see before long this tag, [word] is a tag, 1079 01:05:16,850 --> 01:05:23,520 HTML, head, body, div, script, text area, span, centered, div. 1080 01:05:23,520 --> 01:05:26,770 And this is also sort of cryptic-looking at first glance, 1081 01:05:26,770 --> 01:05:30,890 but all of this mess follows certain patterns, and repeatable patterns, 1082 01:05:30,890 --> 01:05:33,850 so that once we get the basics down, you'll be able to write code like this 1083 01:05:33,850 --> 01:05:37,550 and then manipulate code like this using yet another language, called JavaScript. 1084 01:05:37,550 --> 01:05:40,440 And JavaScript is a language that runs inside of a browser 1085 01:05:40,440 --> 01:05:44,380 today that we use on Harvard courses, for the course shopping tool that Google maps uses 1086 01:05:44,380 --> 01:05:48,660 to give you a whole bunch of dynamism, Facebook gives you to show instant status updates, 1087 01:05:48,660 --> 01:05:51,430 Twitter uses it to show you tweets instantly. 1088 01:05:51,430 --> 01:05:53,820 All of this we will begin to immerse ourselves in. 1089 01:05:53,820 --> 01:05:57,190 But to get there, we need to understand a little something about the Internet. 1090 01:05:57,190 --> 01:06:01,130 This clip here is just a minute long, and let's assume for now this is, in fact, 1091 01:06:01,130 --> 01:06:08,380 how the Internet works as a teaser for what's about to come. I give you "Warriors of the Net." 1092 01:06:08,380 --> 01:06:14,720 >> [♫Slow chorus music♫] 1093 01:06:14,720 --> 01:06:20,450 [Male narrator] He came with a message. 1094 01:06:20,450 --> 01:06:23,770 With a protocol all his own. 1095 01:06:23,770 --> 01:06:37,270 [♫Faster electronic music♫] 1096 01:06:37,270 --> 01:06:41,330 He came to a world of cool firewalls, uncaring routers, 1097 01:06:41,330 --> 01:06:45,690 and dangers far worse than death. 1098 01:06:45,690 --> 01:06:55,400 He's fast. He's strong. He's TCP/IP, and he's got your address. 1099 01:06:55,400 --> 01:06:59,250 Warriors of the Net. 1100 01:06:59,250 --> 01:07:05,290 [Malan] Next week, then. The Internet. Web programming. This is CS50. 1101 01:07:05,290 --> 01:07:08,290 [CS50.TV]