1 00:00:09,156 --> 00:00:09,536 >> Alright. 2 00:00:10,226 --> 00:00:12,486 Welcome back to CS50. 3 00:00:12,486 --> 00:00:14,866 This is the end of week seven. 4 00:00:14,866 --> 00:00:16,236 So, there's one handout today. 5 00:00:16,236 --> 00:00:17,546 It might feel a little early, 6 00:00:17,546 --> 00:00:19,026 though granted the calendar has changed. 7 00:00:19,026 --> 00:00:20,426 So, it's going to be a little earlier. 8 00:00:20,686 --> 00:00:22,746 But I wanted to draw your attention to one 9 00:00:22,746 --> 00:00:25,256 of the URLs that's listed on this handout. 10 00:00:25,256 --> 00:00:28,066 So, CS50's final project is, as we say in the syllabus, 11 00:00:28,126 --> 00:00:29,386 the climax of this course. 12 00:00:29,386 --> 00:00:31,386 It's going to be the time where you guys get 13 00:00:31,386 --> 00:00:34,366 to choose what you do, how you do it, 14 00:00:34,366 --> 00:00:36,216 and what you ultimately make of the course. 15 00:00:36,216 --> 00:00:38,506 And these photos that we've been showing are 16 00:00:38,506 --> 00:00:41,046 of this thing called the CS50 fair, which we actually started 17 00:00:41,046 --> 00:00:42,376 for the very first time last year. 18 00:00:42,676 --> 00:00:44,896 And, it was actually for us, a whole lot of fun. 19 00:00:44,946 --> 00:00:47,566 We had, again, all three hundred plus students last year. 20 00:00:47,566 --> 00:00:50,736 We had six hundred plus of their friends and their family, 21 00:00:50,736 --> 00:00:53,196 and random folks on campus, join us in one 22 00:00:53,196 --> 00:00:54,836 of the science buildings down the road. 23 00:00:54,956 --> 00:00:57,346 And, it was just an exhibition, and an opportunity to hang out, 24 00:00:57,346 --> 00:01:00,366 chat, and to see on each other's laptops exactly what you pulled 25 00:01:00,366 --> 00:01:01,256 off this semester. 26 00:01:01,256 --> 00:01:02,326 Now you might be feeling 27 00:01:02,326 --> 00:01:04,636 that thus far we've been handing you most 28 00:01:04,636 --> 00:01:05,676 of what you're capable of. 29 00:01:05,766 --> 00:01:08,866 So, we gave you a framework for p set three and for p set four 30 00:01:08,866 --> 00:01:10,506 and a little bit for p set five. 31 00:01:10,836 --> 00:01:12,396 So, we're going to have to come full circle 32 00:01:12,396 --> 00:01:14,056 and take those training wheels back off 33 00:01:14,126 --> 00:01:15,196 by the end of the semester. 34 00:01:15,546 --> 00:01:17,816 And, you will be able to do that, 35 00:01:17,816 --> 00:01:20,266 because in the last two p sets of the course seven and eight, 36 00:01:20,526 --> 00:01:22,826 these will be web-focused which caters in part 37 00:01:22,826 --> 00:01:25,076 to the interest among students in recent years 38 00:01:25,346 --> 00:01:26,646 to do web-based projects. 39 00:01:26,966 --> 00:01:28,766 What odds are you guys are going to want to go off 40 00:01:28,806 --> 00:01:31,176 on your own quite possibly and do even more 41 00:01:31,176 --> 00:01:33,006 than the course itself officially covers. 42 00:01:33,006 --> 00:01:35,996 And, so the teaching fellows, CAs and some alumni 43 00:01:35,996 --> 00:01:37,756 and graduate students have kindly offered 44 00:01:37,756 --> 00:01:40,126 to put together this suite of seminars. 45 00:01:40,126 --> 00:01:41,786 So, these seminars are optional. 46 00:01:42,186 --> 00:01:45,226 They will be scheduled in early November based on the level 47 00:01:45,226 --> 00:01:47,296 of interest and on your actual schedules. 48 00:01:47,296 --> 00:01:48,216 Some more on that to come. 49 00:01:48,216 --> 00:01:49,766 We ask right now that you RSVP for any 50 00:01:49,836 --> 00:01:51,116 of these things of interest. 51 00:01:51,116 --> 00:01:52,076 There's eleven on the board. 52 00:01:52,406 --> 00:01:55,076 Just to skim a couple, I-Phone development has already 53 00:01:55,076 --> 00:01:55,826 proved popular. 54 00:01:56,276 --> 00:01:59,456 Though you probably won't make in just a couple 55 00:01:59,456 --> 00:02:02,546 of weeks the most amazing I-Phone application, 56 00:02:02,546 --> 00:02:05,486 you can absolutely do something that excites you. 57 00:02:05,486 --> 00:02:08,586 You can certainly build on it if you choose in say Spring term, 58 00:02:08,586 --> 00:02:10,686 and a seminar like this will help get you started. 59 00:02:10,906 --> 00:02:12,716 Those of you interested in making facebook apps 60 00:02:13,016 --> 00:02:14,446 that your friends might actually use, 61 00:02:14,446 --> 00:02:16,256 we'll do a seminar on that as well. 62 00:02:16,256 --> 00:02:18,356 And then a whole bunch of other topics, 63 00:02:18,356 --> 00:02:21,486 another is this Android cell phone platform. 64 00:02:21,486 --> 00:02:23,576 And let me make explicit mention on this, which we did 65 00:02:23,876 --> 00:02:27,536 on Monday already, Google very kindly donated some cell phones 66 00:02:27,536 --> 00:02:27,996 to the course. 67 00:02:28,376 --> 00:02:31,446 We are happy to give these to those students 68 00:02:31,446 --> 00:02:34,136 who would be interested in tackling Android projects, 69 00:02:34,166 --> 00:02:37,476 but do go to the appropriate link here and find out who 70 00:02:37,476 --> 00:02:40,916 to contact, namely me and one of these CAs if you would 71 00:02:40,916 --> 00:02:42,916 like to tackle an Android-specific project. 72 00:02:43,156 --> 00:02:44,746 And we will attach a couple of conditions 73 00:02:44,746 --> 00:02:47,436 like you really should have T-Mobile already, or AT&T, 74 00:02:47,436 --> 00:02:48,826 because again these are GSM phones. 75 00:02:49,116 --> 00:02:51,256 But, if you're interested express an interest now 76 00:02:51,426 --> 00:02:53,546 as per those directions and we'll get you started. 77 00:02:53,576 --> 00:02:55,576 There's a whole diversity of projects 78 00:02:55,576 --> 00:02:56,536 that students did last year. 79 00:02:56,536 --> 00:02:59,196 If you get a chance at some point go to the fair's page 80 00:02:59,196 --> 00:03:00,376 where we have a bunch of videos. 81 00:03:00,376 --> 00:03:02,446 And, I see some of the TFs have started enhancing their 82 00:03:02,446 --> 00:03:03,316 privacy settings. 83 00:03:03,346 --> 00:03:05,866 So, some of these videos are no longer accessible. 84 00:03:05,866 --> 00:03:07,036 So, we'll fix that. 85 00:03:07,036 --> 00:03:07,786 We have archives. 86 00:03:08,576 --> 00:03:11,016 But, you'll also see at the top of this page a link 87 00:03:11,016 --> 00:03:12,396 to last year's program. 88 00:03:12,396 --> 00:03:14,116 So, it's rather inspiring. 89 00:03:14,116 --> 00:03:16,356 If not maybe admittedly a little intimidating 90 00:03:16,356 --> 00:03:17,426 at this point in the course. 91 00:03:17,486 --> 00:03:18,246 That's the wrong link. 92 00:03:19,446 --> 00:03:24,146 This program here, this is the official program that we handed 93 00:03:24,146 --> 00:03:25,516 out to attendees last year. 94 00:03:25,666 --> 00:03:27,676 And, some of the specifics now you shouldn't care about, 95 00:03:27,786 --> 00:03:29,116 but if you skim this, 96 00:03:29,276 --> 00:03:31,246 for instance these charts you'll see what kinds 97 00:03:31,246 --> 00:03:33,656 of languages students last year actually used. 98 00:03:33,656 --> 00:03:36,416 So, yeah, the course teaches C, the course teaches a bit 99 00:03:36,416 --> 00:03:37,326 of PHP and JavaScript. 100 00:03:37,576 --> 00:03:40,856 But, your predecessors bit off a whole bunch of other stuff, 101 00:03:40,856 --> 00:03:42,536 some of which they came into the course with, 102 00:03:42,536 --> 00:03:44,236 some of which they absolutely did not. 103 00:03:44,546 --> 00:03:47,956 And, you can actually see titles and descriptions for all 104 00:03:47,956 --> 00:03:49,276 of last year's projects here, 105 00:03:49,276 --> 00:03:51,816 and you probably will see some familiar friends' names. 106 00:03:51,816 --> 00:03:52,526 So, do check that out. 107 00:03:52,566 --> 00:03:53,726 It's on the fair page. 108 00:03:53,906 --> 00:03:56,756 And, do RSVP per the directions on the seminar's page 109 00:03:57,056 --> 00:03:58,826 as soon as you get a chance. 110 00:04:00,176 --> 00:04:04,856 Let's see, any questions about that? 111 00:04:04,856 --> 00:04:07,536 OK, so even though p set five isn't yet due, 112 00:04:07,536 --> 00:04:10,516 it's been striking how many surveys have already been 113 00:04:10,516 --> 00:04:11,906 submitted for problem set five. 114 00:04:11,906 --> 00:04:14,716 I dare say it's a lot more fun and procrastination-worthy 115 00:04:14,716 --> 00:04:17,226 to click little bubbles and fill out circles and stuff online 116 00:04:17,226 --> 00:04:19,166 than to actually do the p sets. 117 00:04:19,166 --> 00:04:21,496 We got over, like, two hundred plus surveys 118 00:04:21,496 --> 00:04:24,736 that have already been submitted in like five problem sets 119 00:04:24,786 --> 00:04:25,786 that actually came in already. 120 00:04:25,786 --> 00:04:26,356 But, that's OK. 121 00:04:26,356 --> 00:04:28,416 But, it meant I was able to start reading 122 00:04:28,416 --> 00:04:30,656 through these things, because all the better I think for me 123 00:04:30,866 --> 00:04:33,326 if we start absorbing some of your feedback, good and bad, 124 00:04:33,326 --> 00:04:34,726 and see if we can't make the best 125 00:04:34,726 --> 00:04:36,146 of the remaining weeks in the course. 126 00:04:36,506 --> 00:04:39,536 We will, I will personally make sure to read all 127 00:04:39,536 --> 00:04:40,876 of these surveys that still come in. 128 00:04:40,936 --> 00:04:43,956 So, don't think that those gates, that ship has sailed, 129 00:04:44,016 --> 00:04:45,256 but I thought I would excerpt some 130 00:04:45,256 --> 00:04:46,986 of the more serious comments that were posted 131 00:04:46,986 --> 00:04:50,316 and then also some of the more memorable ones. 132 00:04:50,836 --> 00:04:53,696 One student, one classmate noted in answer to the question, 133 00:04:54,016 --> 00:04:57,386 what have you learned in the class? 134 00:04:57,386 --> 00:04:59,696 Well, we always enjoy things like this. 135 00:05:00,056 --> 00:05:00,656 So, that's good. 136 00:05:00,886 --> 00:05:02,666 But this one was clever, because we know the course is 137 00:05:02,666 --> 00:05:05,626 challenging, especially for a non-trivial percentage 138 00:05:05,626 --> 00:05:07,306 of the class that's not programmed before. 139 00:05:07,386 --> 00:05:12,836 This was just cute I thought. 140 00:05:12,836 --> 00:05:12,956 [ silence ] 141 00:05:12,956 --> 00:05:13,076 [ laughter ] 142 00:05:13,076 --> 00:05:15,896 But what's sad is you understand this kind of humor now, 143 00:05:15,896 --> 00:05:17,466 so I'm not sure that's such a good thing. 144 00:05:17,746 --> 00:05:21,026 But, let me address a couple of things in seriousness. 145 00:05:21,026 --> 00:05:22,166 And, these are recurring themes. 146 00:05:22,216 --> 00:05:24,936 So, I was actually thinking unrelated to the surveys 147 00:05:24,936 --> 00:05:26,986 on Monday that it's actually pretty hard teaching 148 00:05:26,986 --> 00:05:29,016 in this space because it's very cavernous. 149 00:05:29,136 --> 00:05:31,656 Even I can hear my voice echoing in this room. 150 00:05:31,656 --> 00:05:33,096 And, it's hard I think as a teacher 151 00:05:33,096 --> 00:05:34,466 to actually reach out to you guys. 152 00:05:34,516 --> 00:05:35,806 I can kind of sense the audience, 153 00:05:35,806 --> 00:05:37,736 and maybe you feel as best I can. 154 00:05:37,736 --> 00:05:40,156 Like, you are in CS50 and this is just an awkward one 155 00:05:40,156 --> 00:05:42,666 on one conversation right now that you'll always remember now. 156 00:05:43,186 --> 00:05:45,276 But, it's hard in this room especially. 157 00:05:45,276 --> 00:05:47,636 And some classes for years that have been in here resort 158 00:05:47,636 --> 00:05:50,246 to this Oprah Winfrey-style approach of having mics 159 00:05:50,246 --> 00:05:52,816 in the audience, so you guys can stand up and ask your questions. 160 00:05:52,846 --> 00:05:54,886 I can't imagine anything more awkward. 161 00:05:54,886 --> 00:05:58,196 And, frankly we'd get all of the gov majors in the room, perhaps, 162 00:05:58,196 --> 00:05:59,436 asking questions instead. 163 00:05:59,476 --> 00:06:01,896 I remember, I was one of the at one point. 164 00:06:02,026 --> 00:06:04,566 So, realize that I've actually been feeling this frustration 165 00:06:04,566 --> 00:06:06,966 myself where it's hard to kind of connect in this room. 166 00:06:07,326 --> 00:06:09,636 I very often feel like those Simpson moments 167 00:06:09,636 --> 00:06:11,706 where Skinner is standing up on stage. 168 00:06:11,706 --> 00:06:14,596 He says something stupid and the whole middle school is silent, 169 00:06:14,956 --> 00:06:18,016 except for that awkward [cough] in the background 170 00:06:18,016 --> 00:06:19,896 that makes you really just how awkward it is. 171 00:06:20,216 --> 00:06:22,906 So, I hope you guys will raise your hands more often. 172 00:06:22,906 --> 00:06:25,096 Or, and I'll address this with some specific comments, 173 00:06:25,396 --> 00:06:27,106 raise your hands not when you have answers, 174 00:06:27,106 --> 00:06:28,116 but when you have questions. 175 00:06:28,346 --> 00:06:30,826 And, I'll admit it's hard for me to hear things up here. 176 00:06:31,176 --> 00:06:32,706 And, I realize it's a little awkward 177 00:06:32,706 --> 00:06:34,806 to say I don't understand something, 178 00:06:34,806 --> 00:06:37,286 because you literally have to speak loudly so I hear it. 179 00:06:37,686 --> 00:06:39,916 But, I'm hoping that we can address some of these comments 180 00:06:39,966 --> 00:06:42,476 by just making this even more interactive. 181 00:06:42,476 --> 00:06:43,616 And so that you don't feel 182 00:06:43,616 --> 00:06:46,826 like your exiting lecture having completely missed something, 183 00:06:46,826 --> 00:06:49,596 because there's no point in my plowing ahead 184 00:06:49,726 --> 00:06:50,916 or teaching something 185 00:06:50,916 --> 00:06:53,226 that I think I'm teaching well when clearly I'm not. 186 00:06:53,416 --> 00:06:56,906 So, please help me fix that by raising hands 187 00:06:56,906 --> 00:06:58,916 as often as you are inclined. 188 00:06:59,036 --> 00:07:01,996 If your hand is one of those going up too often don't worry, 189 00:07:01,996 --> 00:07:04,786 I'll start to focus more on these students then say those, 190 00:07:04,786 --> 00:07:05,606 or something like that. 191 00:07:05,846 --> 00:07:08,076 But the workload in the course, so one of four themes 192 00:07:08,116 --> 00:07:10,446 that has already come up in the surveys and I'm sure will come 193 00:07:10,446 --> 00:07:12,186 up in the remaining couple of days. 194 00:07:12,326 --> 00:07:13,436 So, this course is a lot of work. 195 00:07:14,146 --> 00:07:15,896 With that said, this is all relative 196 00:07:15,976 --> 00:07:17,996 because we often hear this frankly from freshman 197 00:07:17,996 --> 00:07:21,466 and sophomores for whom this is a lot of work. 198 00:07:21,466 --> 00:07:24,296 And, I think perspectives do change over the course of time. 199 00:07:24,626 --> 00:07:25,856 So, keep that in mind. 200 00:07:26,196 --> 00:07:28,846 But, this is very much intentional, and as I said 201 00:07:28,846 --> 00:07:31,606 in the first week of the course, I mean this really is kind 202 00:07:31,606 --> 00:07:33,346 of one of those courses, one of those fields, 203 00:07:33,346 --> 00:07:35,526 where you really only learn by doing. 204 00:07:35,526 --> 00:07:38,006 And, if we were handing you problem sets that were half 205 00:07:38,006 --> 00:07:40,856 as long, or quarter as long, I mean I dare say 206 00:07:40,856 --> 00:07:42,896 that you would get half as much out of the course 207 00:07:42,896 --> 00:07:44,586 as experienced, or a quarter as much. 208 00:07:44,856 --> 00:07:47,236 And, we like to think that this course by contrast is, yes, 209 00:07:47,356 --> 00:07:50,186 one that absolutely expects a lot of you during the course 210 00:07:50,186 --> 00:07:53,276 of the semester, and for some really squeezes your brain, 211 00:07:53,486 --> 00:07:56,116 but whether it's going to be in January of 2010 212 00:07:56,116 --> 00:07:59,226 that you look back and feel, damn, that was for the best. 213 00:07:59,556 --> 00:08:01,506 Or, even several years from now when you look back 214 00:08:01,506 --> 00:08:03,106 on this course and maybe others that are kind 215 00:08:03,166 --> 00:08:05,066 of kicking your ass, so you may feel right now, 216 00:08:05,596 --> 00:08:10,306 I think that's frankly why we, why you, are all kind of here. 217 00:08:10,306 --> 00:08:12,856 I know I don't want to sound all too lofty. 218 00:08:12,896 --> 00:08:14,836 I realize that all of us have way too much going on, right? 219 00:08:14,836 --> 00:08:16,626 To Fall Harvard students, myself included, 220 00:08:16,626 --> 00:08:19,136 you'd probably bite off way too many extracurricular. 221 00:08:19,196 --> 00:08:20,406 Yes, you have three other classes. 222 00:08:20,806 --> 00:08:23,186 So, realize if nothing else we appreciate 223 00:08:23,186 --> 00:08:23,956 where you're coming from, 224 00:08:23,996 --> 00:08:26,496 but this is why there's sixty staff members for the course, 225 00:08:26,496 --> 00:08:28,326 and why there's a bulletin board, and office hours, 226 00:08:28,326 --> 00:08:31,066 and all of this other stuff because we certainly are going 227 00:08:31,066 --> 00:08:32,846 to get you through the experience. 228 00:08:33,006 --> 00:08:36,046 So, realize we hear you, but please do 229 00:08:36,166 --> 00:08:38,256 at least meet us halfway with those resources. 230 00:08:38,486 --> 00:08:39,306 OK, grades too. 231 00:08:39,306 --> 00:08:39,786 Quick on this. 232 00:08:40,136 --> 00:08:41,126 So, being Harvard students too, 233 00:08:41,126 --> 00:08:44,136 grades are of predominate interest in many courses. 234 00:08:44,136 --> 00:08:48,356 It took me years, until graduate school, to kind of calm down. 235 00:08:48,536 --> 00:08:50,506 I was absolutely one of those kids in high school, 236 00:08:50,506 --> 00:08:52,636 and even through college that, like, work came first. 237 00:08:52,636 --> 00:08:54,596 And frankly I didn't even like school all that much. 238 00:08:54,596 --> 00:08:57,376 I just was kind of conditioned as a kid to work hard, 239 00:08:57,416 --> 00:08:59,636 do my work, you know, get my As. 240 00:08:59,796 --> 00:09:02,626 And, ironically it probably helped me get into grad school, 241 00:09:02,626 --> 00:09:04,196 and get me where I am today. 242 00:09:04,196 --> 00:09:06,056 But, it wasn't until grad school, having worked 243 00:09:06,056 --> 00:09:09,576 for three years after college, that I realized there's kind 244 00:09:09,576 --> 00:09:11,966 of more important things than worrying about grades. 245 00:09:12,116 --> 00:09:14,416 So, not doing well, but worrying about grades. 246 00:09:14,956 --> 00:09:17,726 And, I say this because this course does take a somewhat 247 00:09:17,726 --> 00:09:19,756 different approach from a lot of the science courses, 248 00:09:19,756 --> 00:09:22,116 math courses where it pretty much is right or wrong; 249 00:09:22,116 --> 00:09:24,416 check or uncheck; plus or minus, 250 00:09:24,616 --> 00:09:26,676 where you can get a very qualitative response. 251 00:09:26,716 --> 00:09:29,866 I know some of you per your own feedback have been feeling 252 00:09:29,866 --> 00:09:32,596 frustrations, and this happens every semester certainly 253 00:09:32,596 --> 00:09:34,906 that I've been teaching the course where you feel, you know, 254 00:09:34,906 --> 00:09:37,356 hey I have undergrads being my teaching fellows. 255 00:09:37,536 --> 00:09:40,186 I'm getting only, getting very subjective, 256 00:09:40,186 --> 00:09:42,286 I'm getting very desperate feedback from all of them. 257 00:09:42,286 --> 00:09:43,366 And, I know I've said many times 258 00:09:43,366 --> 00:09:45,116 that we will take all of this into account. 259 00:09:45,356 --> 00:09:47,486 And you can talk to past students about just how many 260 00:09:47,486 --> 00:09:48,826 of them were disappointed or not 261 00:09:49,156 --> 00:09:51,106 with their actual grades for some data points. 262 00:09:51,436 --> 00:09:53,766 But realize we view the course much more like, 263 00:09:54,506 --> 00:09:56,576 so far as grading goes a humanities course. 264 00:09:56,806 --> 00:09:59,376 Where if you're trying to write an essay it's really hard 265 00:09:59,376 --> 00:10:02,116 to say this is 95 percent or 75 percent. 266 00:10:02,116 --> 00:10:03,626 We could do this with problem sets, 267 00:10:03,686 --> 00:10:06,126 like check or plus or minus. 268 00:10:06,126 --> 00:10:07,826 You got this correct in this metric right, 269 00:10:08,226 --> 00:10:12,386 but you hopefully realize now, not eventually, that programming 270 00:10:12,386 --> 00:10:16,186 and solving problems is not just about yes or no, it's right. 271 00:10:16,186 --> 00:10:17,206 It's about how good it is. 272 00:10:17,206 --> 00:10:19,556 How fast your code is, how intelligent your code is. 273 00:10:19,856 --> 00:10:21,516 And, honestly these are only things 274 00:10:21,516 --> 00:10:24,676 that can be evaluated subjectively and qualitatively 275 00:10:24,676 --> 00:10:27,316 with handwritten or typed feedback and less with numbers. 276 00:10:27,316 --> 00:10:29,166 So, realize that's very much a design decision. 277 00:10:29,376 --> 00:10:32,116 I am quite certain that we will never convince some of you 278 00:10:32,566 --> 00:10:36,806 in the course of just this ideology when it comes 279 00:10:36,976 --> 00:10:40,406 to grading, but do realize and perhaps consult past students 280 00:10:40,476 --> 00:10:42,906 that it does tend I think to work out for the best. 281 00:10:43,406 --> 00:10:45,546 And, please speak with me personally if you are concerned 282 00:10:46,596 --> 00:10:47,986 about any specific scores. 283 00:10:48,396 --> 00:10:52,146 So, with that said my availability, no one ever comes. 284 00:10:52,706 --> 00:10:55,936 So, this is linked on our course's homepage, my own URL. 285 00:10:55,936 --> 00:10:58,086 So, these, OK. 286 00:10:58,146 --> 00:10:59,256 That's maybe why no one comes. 287 00:10:59,256 --> 00:10:59,606 [ laughter ] 288 00:10:59,606 --> 00:11:03,746 Alright. So, these are my office hours. 289 00:11:03,746 --> 00:11:04,196 [ laughter ] 290 00:11:04,196 --> 00:11:05,056 So, please come. 291 00:11:05,056 --> 00:11:07,256 If you feel that I'm a little too aloof, 292 00:11:07,256 --> 00:11:10,866 or just this big voice on stage, or God-forbid, like, worse, 293 00:11:10,866 --> 00:11:13,076 scary or intimidating just please drop by. 294 00:11:13,076 --> 00:11:15,126 Drop me a note because frankly I tend not to be there. 295 00:11:15,406 --> 00:11:17,466 But, I will be there if you want to be there. 296 00:11:17,866 --> 00:11:19,156 I essentially block off these 297 00:11:19,156 --> 00:11:20,466 and other times to sit down and chat. 298 00:11:20,646 --> 00:11:21,366 So, reach out. 299 00:11:21,416 --> 00:11:26,586 There's no reason I should not be accessible. 300 00:11:27,416 --> 00:11:27,926 Interesting. 301 00:11:28,216 --> 00:11:30,896 Alright. And, finally, so, now I will go back 302 00:11:30,956 --> 00:11:32,786 to a couple of actual comments. 303 00:11:33,366 --> 00:11:38,686 I'm not the only one hearing that right? 304 00:11:38,686 --> 00:11:39,476 [ laughter ] 305 00:11:39,476 --> 00:11:40,226 Interesting. 306 00:11:40,226 --> 00:11:40,586 [ laughter ] 307 00:11:40,586 --> 00:11:47,386 Alright. Maybe it's this live wire. 308 00:11:49,416 --> 00:11:53,976 The I-Phone has fixed it. 309 00:11:54,116 --> 00:11:57,006 Alright. So, these were some comments that were directed 310 00:11:57,006 --> 00:11:58,196 about lectures and content, 311 00:11:58,236 --> 00:12:01,986 because one theme that's very common in this course 312 00:12:01,986 --> 00:12:03,426 and frankly in all the courses I've taught 313 00:12:03,426 --> 00:12:07,556 for ten years is the rapidity of my voice, or the speed of class. 314 00:12:07,816 --> 00:12:09,936 And, I don't necessarily think that moving fast 315 00:12:10,736 --> 00:12:13,746 or relatively fast is a bad thing because the flip side 316 00:12:13,746 --> 00:12:16,616 of course is that course can risk going too slowly, 317 00:12:16,616 --> 00:12:19,216 and this too is one of those things where there is no chance 318 00:12:19,256 --> 00:12:21,296 that we're going to please everyone in this course. 319 00:12:21,576 --> 00:12:24,796 Much like, it will be very difficult 320 00:12:24,796 --> 00:12:26,736 to ever get 100% consensus that, yeah, 321 00:12:26,846 --> 00:12:28,306 the course is going perfectly. 322 00:12:28,306 --> 00:12:32,436 So, we do our best frankly to displease this half of the class 323 00:12:32,436 --> 00:12:34,046 and displease this half of the class, 324 00:12:34,046 --> 00:12:35,726 and so everything kind of averages out. 325 00:12:35,726 --> 00:12:37,136 I mean that's kind of the way of these things. 326 00:12:37,136 --> 00:12:38,406 So, we try to walk this fine line 327 00:12:38,756 --> 00:12:42,206 between going too fast and going too slow. 328 00:12:42,206 --> 00:12:44,126 Because the downside of course is if we devolve 329 00:12:44,126 --> 00:12:46,426 into something too slow then, you know, 330 00:12:46,426 --> 00:12:48,446 then there's other people who don't want to be here. 331 00:12:48,746 --> 00:12:51,556 But, this is the bigger question in my mind, so we now compete, 332 00:12:51,556 --> 00:12:55,136 or people like me compete with video tapes, like, of ourselves. 333 00:12:55,136 --> 00:12:57,446 And, I thought many times over the past few years, like, 334 00:12:57,446 --> 00:12:59,546 what really is the point of lectures, 335 00:12:59,546 --> 00:13:01,396 because I would dare say that some of you, 336 00:13:01,396 --> 00:13:04,436 many of you are here just because this is what you do 337 00:13:04,436 --> 00:13:07,246 in school, you go to class whether or not you feel 338 00:13:07,246 --> 00:13:08,886 that you're getting something out of lecture. 339 00:13:08,886 --> 00:13:11,316 And, those of you who are watching this at home 340 00:13:11,316 --> 00:13:13,746 on video you've made a conscious choice 341 00:13:13,746 --> 00:13:17,706 that this is maybe not a good time of day 342 00:13:17,706 --> 00:13:19,426 to be awake, or a good experience. 343 00:13:19,426 --> 00:13:21,426 And, so this is a hard question, right? 344 00:13:21,426 --> 00:13:22,646 Because what is the point of lecture 345 00:13:22,646 --> 00:13:24,716 if you can watch this thing on video, 346 00:13:24,896 --> 00:13:27,306 if we have scribe notes detailing what happened, 347 00:13:27,306 --> 00:13:29,916 if we have slides that predict what would happen, 348 00:13:29,916 --> 00:13:31,326 if we have code of everything, right? 349 00:13:31,326 --> 00:13:33,766 Even I ask myself this, what is the point of all 350 00:13:33,766 --> 00:13:34,786 of us being here in this room? 351 00:13:35,106 --> 00:13:37,866 And, so the answers I've come up with over time are, one, 352 00:13:38,106 --> 00:13:39,846 it's got to be engaging, ideally. 353 00:13:39,846 --> 00:13:41,076 Right? Otherwise who cares? 354 00:13:41,076 --> 00:13:43,506 Why come? Even I walk out of lectures all the time 355 00:13:43,506 --> 00:13:45,576 if it doesn't maintain my interest levels these days 356 00:13:45,576 --> 00:13:48,236 because I have many, many other things more interesting 357 00:13:48,516 --> 00:13:50,986 or more challenging, stimulating that I would like to do. 358 00:13:51,256 --> 00:13:53,566 So, on the one hand my goal whether it's with silly things 359 00:13:53,566 --> 00:13:56,546 like candy, or with fun demonstrations with humans 360 00:13:56,546 --> 00:13:58,816 on stage is to actually make this an experience, 361 00:13:58,996 --> 00:14:01,556 and hopefully that's not something that you get solely 362 00:14:01,556 --> 00:14:03,076 by watching things on camera. 363 00:14:03,366 --> 00:14:05,906 But, two, the goal pedagogically is to set a framework, 364 00:14:06,126 --> 00:14:09,336 a mental model each week so that when you do go into sections 365 00:14:09,336 --> 00:14:11,396 and you those ah-ha moments with your TS, 366 00:14:11,826 --> 00:14:13,406 oh that's how you implement this. 367 00:14:13,406 --> 00:14:14,686 Or, that's how it works. 368 00:14:14,936 --> 00:14:16,916 That at least you've been prepped with sort 369 00:14:16,916 --> 00:14:19,616 of a big picture understanding of what is data structure, 370 00:14:19,616 --> 00:14:21,346 or what are link lists? 371 00:14:21,616 --> 00:14:23,526 You know, in general, what problems do they solve 372 00:14:23,726 --> 00:14:24,926 and then it's when you actually get 373 00:14:24,926 --> 00:14:26,946 into these more intimate environments with your peers 374 00:14:26,946 --> 00:14:29,286 that you can really, you know, sink your teeth 375 00:14:29,286 --> 00:14:30,546 into something more concrete. 376 00:14:30,546 --> 00:14:33,276 So, there's a line too that I try to walk carefully 377 00:14:33,276 --> 00:14:36,056 between dwelling too much on code and syntax, 378 00:14:36,056 --> 00:14:37,676 which I know can get boring quickly 379 00:14:37,986 --> 00:14:39,326 versus not doing enough of that. 380 00:14:39,326 --> 00:14:40,866 So, that you walk out of here thinking, 381 00:14:41,176 --> 00:14:42,486 what the heck just went on? 382 00:14:42,486 --> 00:14:45,426 And so these are the kind of comments that give me pause 383 00:14:45,426 --> 00:14:47,676 and definitely make me, you know, worry. 384 00:14:47,676 --> 00:14:48,906 And granted there's other comments. 385 00:14:48,906 --> 00:14:51,196 There's good comments that balance these things out, 386 00:14:51,196 --> 00:14:51,986 hopefully in the whole. 387 00:14:52,266 --> 00:14:55,256 But, like, there's one thank you that I'm nice. 388 00:14:55,256 --> 00:14:55,323 [ laughter ] 389 00:14:55,323 --> 00:14:58,036 But, two, something like this. 390 00:14:58,076 --> 00:14:59,356 Like, we can address, right? 391 00:14:59,356 --> 00:15:02,036 Because hands can go up and I can pause more frequently. 392 00:15:02,106 --> 00:15:03,956 So, realize that I'm happy to try 393 00:15:03,956 --> 00:15:05,746 to address these kinds of comments. 394 00:15:05,926 --> 00:15:08,806 You know, this too is worrisome, because this too we should fix. 395 00:15:09,066 --> 00:15:11,566 Because if you're not getting, like what the hell is the point 396 00:15:11,566 --> 00:15:12,866 of even being here, honestly? 397 00:15:12,916 --> 00:15:13,756 I don't want to be here 398 00:15:13,756 --> 00:15:14,906 if you're getting nothing out of this. 399 00:15:14,906 --> 00:15:15,796 We don't take attendance. 400 00:15:15,796 --> 00:15:16,786 We don't care if you're here. 401 00:15:16,786 --> 00:15:18,416 We want you to be here. 402 00:15:18,726 --> 00:15:23,476 And this, frankly, maybe this person was more spokesperson 403 00:15:23,476 --> 00:15:26,656 for others, I mean these are the three comments particularly 404 00:15:26,656 --> 00:15:29,746 that I'd like to address starting now, as best we can. 405 00:15:29,966 --> 00:15:31,726 So, whether the lectures have been good, or sucked, 406 00:15:31,836 --> 00:15:33,886 I'd like them to be better moving forward. 407 00:15:34,026 --> 00:15:39,146 So, please at least help meet me by trying to raise hands, 408 00:15:39,146 --> 00:15:41,706 engage, stop me if I'm ever doing something stupid 409 00:15:41,706 --> 00:15:43,586 like big white screen, certainly, 410 00:15:43,636 --> 00:15:45,656 but also if it's just something that's gone over your head. 411 00:15:45,956 --> 00:15:47,976 Now, let me counterbalance the seriousness, 412 00:15:48,336 --> 00:15:50,086 the serious tone, with this. 413 00:15:50,086 --> 00:15:57,806 I'll offer this with no comment. 414 00:15:57,806 --> 00:15:57,873 [ laughter ] 415 00:15:57,873 --> 00:15:57,940 So, 416 00:15:57,940 --> 00:15:58,096 [ laughter ] 417 00:15:58,096 --> 00:16:04,306 So, you're welcome whoever that was. 418 00:16:04,306 --> 00:16:05,166 Happy birthday. 419 00:16:05,426 --> 00:16:09,096 And, this I know, my God, it was like a six page or so survey. 420 00:16:09,096 --> 00:16:11,016 This one was just a cute note to end on. 421 00:16:11,016 --> 00:16:13,466 So, I'll end our discussion of that here. 422 00:16:13,506 --> 00:16:16,126 So, let me turn our attention 423 00:16:16,546 --> 00:16:19,076 to where we left off in terms of content. 424 00:16:19,406 --> 00:16:22,906 And try to do this, try to give you a sense of why are we here? 425 00:16:22,906 --> 00:16:23,716 Where have we been? 426 00:16:23,716 --> 00:16:24,586 Where have we been going? 427 00:16:24,586 --> 00:16:25,996 So, in my mind in the course, 428 00:16:26,326 --> 00:16:28,926 and I worry that sometimes the syllabus doesn't make this quite 429 00:16:28,926 --> 00:16:30,326 clear is we started the course 430 00:16:30,326 --> 00:16:33,186 with just sitting some basic mental fundamentals, 431 00:16:33,346 --> 00:16:34,306 like what's a loop? 432 00:16:34,586 --> 00:16:35,336 What's a condition? 433 00:16:35,336 --> 00:16:36,136 And, some of you knew that. 434 00:16:36,166 --> 00:16:37,976 But, we had to do it because there's a lot of students 435 00:16:37,976 --> 00:16:39,196 in the class that didn't have that. 436 00:16:39,196 --> 00:16:40,766 And, that's why we use something like Scratch. 437 00:16:41,056 --> 00:16:44,056 And, then weeks one and two were kind of syntax oriented. 438 00:16:44,126 --> 00:16:45,456 What's a four loop in C? 439 00:16:45,456 --> 00:16:46,336 What's a while loop? 440 00:16:46,336 --> 00:16:47,316 How do you compile? 441 00:16:47,316 --> 00:16:48,606 How do you debug? 442 00:16:48,706 --> 00:16:50,726 Stuff that again, is kind of interesting. 443 00:16:50,726 --> 00:16:53,146 It's kind of geeky, but too, we have to get that out of the way. 444 00:16:53,146 --> 00:16:56,706 And, hopefully we wrapped those kinds of, that kind of material 445 00:16:56,706 --> 00:16:59,506 in some domain-specific packaging. 446 00:16:59,506 --> 00:17:02,206 Like the crypto, which frankly makes four loops just a lot more 447 00:17:02,206 --> 00:17:03,636 interesting, or a lot more compelling 448 00:17:03,876 --> 00:17:06,616 than what a problem set that says implement the four loop 449 00:17:06,616 --> 00:17:09,186 that prints one through one hundred, which kind of gets 450 00:17:09,186 --> 00:17:12,146 to the same core material, but certainly not interesting. 451 00:17:12,396 --> 00:17:16,766 Then in week four we introduced, we started talking more 452 00:17:16,766 --> 00:17:18,936 about arrays, and memory management. 453 00:17:18,936 --> 00:17:19,826 So, now things are getting 454 00:17:19,826 --> 00:17:22,066 at least technically more interesting, 455 00:17:22,066 --> 00:17:23,676 and certainly more difficult. 456 00:17:23,676 --> 00:17:24,576 And we started talking 457 00:17:24,576 --> 00:17:27,626 about data structures very minimalist data structures. 458 00:17:27,626 --> 00:17:28,336 We had arrays. 459 00:17:28,626 --> 00:17:30,466 We the C notion of a struct, 460 00:17:30,466 --> 00:17:33,636 where you can multiple variables inside of some new data type. 461 00:17:33,896 --> 00:17:35,946 So, we had the basics of data structures. 462 00:17:36,166 --> 00:17:39,376 Then in week five and now in week seven we take things 463 00:17:39,376 --> 00:17:41,206 up a notch further and talk 464 00:17:41,206 --> 00:17:43,476 about more sophisticated data structures, 465 00:17:43,476 --> 00:17:45,226 because you can't really get through life, 466 00:17:45,516 --> 00:17:48,166 you can't really get through problem-solving in CS, 467 00:17:48,166 --> 00:17:50,246 or in the sciences, or in any realm of life 468 00:17:50,246 --> 00:17:53,756 where coding is useful without having more sophisticated tools 469 00:17:53,756 --> 00:17:54,556 in your toolkit. 470 00:17:54,726 --> 00:17:56,516 So we introduced a couple of weeks ago linked lists, 471 00:17:56,936 --> 00:17:59,556 which allow you to have data structures and storage, 472 00:17:59,716 --> 00:18:01,276 but that grows dynamically. 473 00:18:01,276 --> 00:18:03,026 But, we had this linearity problem. 474 00:18:03,026 --> 00:18:06,346 It's still pretty slow to search something that just has arrow 475 00:18:06,346 --> 00:18:08,696 after arrow, after arrow, because you can only go 476 00:18:08,696 --> 00:18:11,706 from left to right, or maybe right to left 477 00:18:11,916 --> 00:18:12,806 if it's doubly linked. 478 00:18:12,806 --> 00:18:14,206 But, you have this linear problem. 479 00:18:14,466 --> 00:18:17,156 So, this week, on Monday and today the goal is 480 00:18:17,156 --> 00:18:20,306 to introduce some more clever solutions to the problem 481 00:18:20,306 --> 00:18:22,666 of storing data and getting at it quickly. 482 00:18:22,666 --> 00:18:26,386 And, the ideal, the holy grail of data structures is one 483 00:18:26,386 --> 00:18:28,516 that grows infinitely I would say, right? 484 00:18:28,516 --> 00:18:31,076 There's no fixed arbitrary upper bounds like you have 485 00:18:31,106 --> 00:18:33,566 with arrays, but one where you're search time, 486 00:18:33,566 --> 00:18:35,156 your searching time, your deletion, 487 00:18:35,156 --> 00:18:38,306 all of those basic operations are blindly fast. 488 00:18:38,696 --> 00:18:39,886 Constant time even. 489 00:18:40,236 --> 00:18:43,956 And, so that's when we introduce this idea of a hash table. 490 00:18:44,116 --> 00:18:47,116 So, a hash table is kind of the Swiss army knife 491 00:18:47,416 --> 00:18:49,456 of data structures, because it's supposed 492 00:18:49,456 --> 00:18:52,926 to address ideally precisely that problem. 493 00:18:52,926 --> 00:18:55,176 And, we're going to see hash table, this topic today, 494 00:18:55,446 --> 00:18:58,456 again in the weeks to come when we look at web programming. 495 00:18:58,696 --> 00:19:00,416 The goal of the end of the course realize is 496 00:19:00,416 --> 00:19:02,996 to one make you realize we did not learn just C. We learned 497 00:19:02,996 --> 00:19:06,246 ideas that transcend this particular language. 498 00:19:06,246 --> 00:19:07,856 And, we'll look at PHP and JavaScript, 499 00:19:07,916 --> 00:19:09,466 both of which use hash tables. 500 00:19:09,496 --> 00:19:11,556 They call them something different, but they have them. 501 00:19:12,026 --> 00:19:14,816 And, in the remaining weeks of the course where we really focus 502 00:19:14,816 --> 00:19:17,616 on real world problems, problems frankly 503 00:19:17,616 --> 00:19:20,656 that you might be very inclined to encounter in your own lives, 504 00:19:20,656 --> 00:19:25,096 whether it's for personal fun or start-ups, or for data sets 505 00:19:25,096 --> 00:19:27,926 in other classes you need to process, or just stupid stuff 506 00:19:27,926 --> 00:19:30,236 like you run a student group and you've got a really long list 507 00:19:30,236 --> 00:19:32,066 of people who've registered and you really don't want 508 00:19:32,066 --> 00:19:35,176 to manually email a hundred people, or manually, you know, 509 00:19:35,256 --> 00:19:37,066 aggregate this data into some other format. 510 00:19:37,066 --> 00:19:38,416 You can write a script to do that. 511 00:19:38,586 --> 00:19:41,356 So, were going to try to use these basic ideas from weeks one 512 00:19:41,356 --> 00:19:44,296 through eight and conclude the course with a look 513 00:19:44,296 --> 00:19:45,506 at these things that are more real world. 514 00:19:45,506 --> 00:19:48,766 And, the final project is meant to embrace that finale 515 00:19:49,046 --> 00:19:50,556 of real world applications. 516 00:19:50,626 --> 00:19:52,206 So, hash tables. 517 00:19:52,446 --> 00:19:53,696 So, what is the point here? 518 00:19:53,696 --> 00:19:56,306 So it would be really nice if we had a data structure 519 00:19:56,556 --> 00:19:59,076 that didn't have problems like arrays, 520 00:19:59,236 --> 00:20:00,716 which are a fixed length, right? 521 00:20:00,716 --> 00:20:02,786 They kind of suck, because it meant if we ever ran 522 00:20:02,786 --> 00:20:05,146 out of space what did we have to do? 523 00:20:05,146 --> 00:20:05,556 >> [inaudible] 524 00:20:05,556 --> 00:20:06,786 >> So, we had to reallocated. 525 00:20:06,786 --> 00:20:09,386 And reallocation, even though you haven't really felt it given 526 00:20:09,386 --> 00:20:12,416 the size of the programs we've been writing, is kind of slow. 527 00:20:12,416 --> 00:20:15,346 Anytime you have to interact with the computer itself, 528 00:20:15,346 --> 00:20:18,556 or the operating system and say give me more memory relative 529 00:20:18,556 --> 00:20:20,596 to other things, that's slow. 530 00:20:20,596 --> 00:20:24,526 So, it's ideally in principle to avoid having to allocate memory 531 00:20:24,526 --> 00:20:26,996 and deallocate memory, because you're just wasting time 532 00:20:27,216 --> 00:20:29,036 by not having had sufficient foresight. 533 00:20:29,036 --> 00:20:31,966 But, of course the flip side is you can say to the OS, well, 534 00:20:31,966 --> 00:20:34,026 give me an array that's two gigabytes large. 535 00:20:34,026 --> 00:20:35,106 I'm never going to need more than that. 536 00:20:35,556 --> 00:20:37,396 But, the downside there is you can only run, like, 537 00:20:37,466 --> 00:20:38,846 one program at a time. 538 00:20:39,106 --> 00:20:41,666 So, again, the theme in computer science, 539 00:20:41,666 --> 00:20:44,446 or in programming specifically is these trade-offs. 540 00:20:44,446 --> 00:20:45,546 Nothing comes for free. 541 00:20:45,756 --> 00:20:48,166 You pay for something with time, or with space, 542 00:20:48,166 --> 00:20:50,876 or with your own time, the developer's time, 543 00:20:50,926 --> 00:20:53,736 or just with your own, you know, mental anguish. 544 00:20:53,736 --> 00:20:55,476 There are things that are harder to implement, 545 00:20:55,476 --> 00:20:58,976 but my God when they work, e.g. Google, they work really well. 546 00:20:58,976 --> 00:21:01,436 And, that's not something they implemented on a Thursday night. 547 00:21:01,556 --> 00:21:02,906 So, again, it's all tradeoffs, 548 00:21:02,906 --> 00:21:05,996 whether it's machine related or human related. 549 00:21:06,196 --> 00:21:09,316 So, a hash table is meant to address that problem of arrays. 550 00:21:09,576 --> 00:21:10,466 They're fix length. 551 00:21:10,676 --> 00:21:13,096 It's expensive to reallocate and deallocate them. 552 00:21:13,096 --> 00:21:15,576 It's just stupid to have to copy your data around just 553 00:21:15,576 --> 00:21:18,596 to grow the space, but that's why we introduced link lists, 554 00:21:19,006 --> 00:21:19,166 right? 555 00:21:19,166 --> 00:21:21,036 Link lists are these dynamic structures. 556 00:21:21,616 --> 00:21:22,326 So, even though I kind 557 00:21:22,326 --> 00:21:23,806 of answered my own question a bit ago, 558 00:21:24,036 --> 00:21:26,066 what's a problem with link lists? 559 00:21:26,936 --> 00:21:29,316 With every solution we introduce a new problem it seems. 560 00:21:30,026 --> 00:21:31,366 What's a downside of a link list? 561 00:21:31,366 --> 00:21:31,446 >> [inaudible] 562 00:21:31,446 --> 00:21:35,416 >> So, it takes a long time to search it. 563 00:21:35,416 --> 00:21:38,036 If I want to search for something, well, 564 00:21:38,106 --> 00:21:39,176 and let's contextualize this. 565 00:21:39,176 --> 00:21:40,356 What might I put in a linked list? 566 00:21:40,756 --> 00:21:42,876 Anything. We put numbers because that's simple, 567 00:21:42,876 --> 00:21:44,566 but you can imagine implementing a database 568 00:21:44,566 --> 00:21:45,996 of students for the registrar. 569 00:21:46,196 --> 00:21:48,936 Well, each of those link list nodes contains a student. 570 00:21:49,196 --> 00:21:50,216 What's a student? 571 00:21:50,216 --> 00:21:52,366 ID number, phone number, grades, 572 00:21:52,366 --> 00:21:53,586 I mean all of this kind of stuff. 573 00:21:53,696 --> 00:21:54,976 So, there's these objects in memory, 574 00:21:54,976 --> 00:21:57,766 but searching them requires what, how much time 575 00:21:57,766 --> 00:22:00,266 to find Joe Smith in a linked list 576 00:22:00,376 --> 00:22:02,726 of 65,000 Harvard undergraduates? 577 00:22:03,086 --> 00:22:07,036 What's the running time of search on a linked list? 578 00:22:08,046 --> 00:22:09,106 So big ON steps, right? 579 00:22:09,106 --> 00:22:10,646 Because in the worst case Smith is going 580 00:22:10,646 --> 00:22:11,836 to be the end of the list. 581 00:22:12,146 --> 00:22:14,436 Or even if they're alphabetized he's certainly not going 582 00:22:14,436 --> 00:22:15,276 to be right at the middle. 583 00:22:15,476 --> 00:22:18,056 So, worst case an arbitrary student is going 584 00:22:18,056 --> 00:22:19,456 to require big ON steps. 585 00:22:19,456 --> 00:22:20,446 Well, wait a minute. 586 00:22:20,446 --> 00:22:21,786 We solved this in, like, week two. 587 00:22:21,956 --> 00:22:23,316 Why don't we just use binary search 588 00:22:24,026 --> 00:22:26,566 on linked lists if they're sorted? 589 00:22:26,566 --> 00:22:30,006 Go ahead. 590 00:22:30,006 --> 00:22:31,276 >> [inaudible] 591 00:22:31,276 --> 00:22:33,526 >> OK, so they are sorted, but there is another problem. 592 00:22:34,236 --> 00:22:34,886 Yeah? 593 00:22:34,886 --> 00:22:36,066 >> [inaudible] 594 00:22:36,066 --> 00:22:37,896 >> Yeah, you can't access the middle of it. 595 00:22:37,896 --> 00:22:39,446 With an array it was very easy. 596 00:22:39,446 --> 00:22:41,226 Know the end, which is bracket zero. 597 00:22:41,426 --> 00:22:43,316 Know this end, which is bracket zero; 598 00:22:43,316 --> 00:22:44,956 this end, which is minus one. 599 00:22:45,146 --> 00:22:48,136 You can figure out the middle by doing N over 2, and bam. 600 00:22:48,186 --> 00:22:49,036 You can go right there. 601 00:22:49,036 --> 00:22:50,506 So, arrays are random access. 602 00:22:50,806 --> 00:22:53,216 What happens if you take the address of this node 603 00:22:53,216 --> 00:22:56,246 in a linked list and subtract the address of this node 604 00:22:56,246 --> 00:22:57,946 in the linked list and then divide by two? 605 00:22:57,946 --> 00:22:59,006 Where do you end up? 606 00:22:59,956 --> 00:23:02,936 So, numerically here right in the middle conceptually, 607 00:23:03,196 --> 00:23:04,696 but you have no idea where any 608 00:23:04,696 --> 00:23:07,066 of those linked list nodes might be, because the whole point 609 00:23:07,066 --> 00:23:08,746 of allocating them with malock is 610 00:23:08,746 --> 00:23:10,366 that you're just handed a chunk of memory. 611 00:23:10,406 --> 00:23:13,276 But it can come from anywhere in the computer's ram. 612 00:23:13,276 --> 00:23:15,766 It's not necessarily going to be contiguous. 613 00:23:16,146 --> 00:23:18,416 So, again, we get this upside of dynamism. 614 00:23:18,416 --> 00:23:21,136 Now we don't have to worry about painting ourselves into a corner 615 00:23:21,136 --> 00:23:24,406 in terms of space, but we now have this problem 616 00:23:24,406 --> 00:23:26,066 that we really hurt our search time. 617 00:23:26,066 --> 00:23:27,686 And linear search kind of sucks, 618 00:23:27,686 --> 00:23:30,956 especially when your data sets starts to get larger and larger. 619 00:23:31,196 --> 00:23:33,726 But, my God if there's this data structure that's been promised 620 00:23:33,976 --> 00:23:37,246 that gives us constant time whereby if I have a new student 621 00:23:37,246 --> 00:23:38,986 to insert, and let's see. 622 00:23:38,986 --> 00:23:41,336 I'm going to need, I won't use students here. 623 00:23:41,336 --> 00:23:44,016 We'll use lollypop flavors to represent distinct elements. 624 00:23:45,076 --> 00:23:50,336 So, what I really want is a data structure where if I, 625 00:23:50,656 --> 00:23:52,116 some new student matriculates, 626 00:23:52,116 --> 00:23:53,056 this is not going to work very well. 627 00:23:53,056 --> 00:23:56,296 I really wanted a data structure where the lollypops would sit 628 00:23:56,446 --> 00:23:57,886 in the top, but it didn't work out. 629 00:23:58,286 --> 00:24:00,866 But we want essentially a black box. 630 00:24:00,866 --> 00:24:02,736 And, this is a common theme in computer science, 631 00:24:02,736 --> 00:24:03,966 if you've not heard the term before. 632 00:24:04,146 --> 00:24:06,866 Something that does something takes input, produces output. 633 00:24:07,156 --> 00:24:08,676 You don't really care how it works. 634 00:24:08,676 --> 00:24:10,406 You just care how well it works. 635 00:24:10,456 --> 00:24:12,196 Or, how quickly it works. 636 00:24:12,426 --> 00:24:15,606 So my goal is to insert Joe Smith into this black box, 637 00:24:15,726 --> 00:24:18,396 because later I want to be able to ask question of the form, 638 00:24:18,616 --> 00:24:20,396 is Joe Smith registered at Harvard? 639 00:24:20,396 --> 00:24:23,416 Or more specifically give me Joe Smith's student records 640 00:24:23,416 --> 00:24:25,886 so I can look up his phone number or his email address. 641 00:24:25,886 --> 00:24:27,746 So, you want to be able to insert things 642 00:24:27,786 --> 00:24:29,306 into the data structure ideally 643 00:24:29,306 --> 00:24:31,156 in constant time; get in and get out. 644 00:24:31,256 --> 00:24:34,216 And then if you want that same element back you want to be able 645 00:24:34,216 --> 00:24:37,036 to ask that question just as quickly; is Joe Smith in here? 646 00:24:37,036 --> 00:24:38,666 If so, give him back to me. 647 00:24:39,016 --> 00:24:42,326 So this constant time is kind of, again, the holy grail, 648 00:24:42,366 --> 00:24:45,956 but how do you even begin to implement something like that? 649 00:24:46,626 --> 00:24:48,486 Well, we already teased you with an answer 650 00:24:48,486 --> 00:24:49,896 to this question on Monday, right? 651 00:24:50,166 --> 00:24:51,466 We used an array. 652 00:24:51,716 --> 00:24:54,106 We used a little picture that looked like this. 653 00:24:55,056 --> 00:24:57,216 Ideally, what's the ideal, 654 00:24:57,216 --> 00:24:59,156 so suppose we have this data structure. 655 00:24:59,256 --> 00:25:00,826 Let's use an array, because it's so simple. 656 00:25:00,826 --> 00:25:01,746 I know how to declare it. 657 00:25:01,746 --> 00:25:02,636 There's no fanciness. 658 00:25:02,636 --> 00:25:03,556 We did this weeks ago. 659 00:25:04,086 --> 00:25:06,276 But I decide there's only going to be twenty six students. 660 00:25:06,546 --> 00:25:08,876 Well, the ideal student body would be 661 00:25:08,876 --> 00:25:11,856 that Harvard admits each year only one person with a last name 662 00:25:11,856 --> 00:25:14,246 that starts with A. And one student who's name starts 663 00:25:14,246 --> 00:25:17,976 with B, and C. Because then we can contrive what's known 664 00:25:17,976 --> 00:25:20,036 as the ideal hash function. 665 00:25:20,466 --> 00:25:23,316 We can implement a function literally in code that takes 666 00:25:23,316 --> 00:25:26,256 as input the name of the student and then returns 667 00:25:26,256 --> 00:25:28,776 as output a number, which is what bucket 668 00:25:28,816 --> 00:25:30,106 that student should go in. 669 00:25:30,606 --> 00:25:31,746 Now what do we mean by this? 670 00:25:31,746 --> 00:25:33,736 Well, we can actually whip something like this up. 671 00:25:33,736 --> 00:25:35,526 So, I'm just going to open a terminal window here. 672 00:25:35,526 --> 00:25:36,986 Let me increase my font size. 673 00:25:36,986 --> 00:25:39,786 I'm going to call this hash.c And, 674 00:25:39,786 --> 00:25:40,736 I just need to write a quick 675 00:25:40,736 --> 00:25:42,916 and dirty function here that's going to, 676 00:25:42,916 --> 00:25:43,886 let's see we're turning int. 677 00:25:44,266 --> 00:25:47,146 I'm going to call it hash, or h, or whatever you want to call it. 678 00:25:47,146 --> 00:25:49,046 It's going to take as input a student, 679 00:25:49,046 --> 00:25:49,956 or really just their name. 680 00:25:49,956 --> 00:25:53,666 So, I'll do a char*s to represent a string there. 681 00:25:53,856 --> 00:25:56,266 And the goal of this hash function is just 682 00:25:56,266 --> 00:25:58,706 to answer a question that the black box can use. 683 00:25:58,936 --> 00:26:00,706 So, the black box is a piece of storage, 684 00:26:00,956 --> 00:26:03,406 but with it also is a hash function that tells it 685 00:26:03,636 --> 00:26:06,376 where to put its input, what location to put its input. 686 00:26:06,506 --> 00:26:08,556 So, how do we decide this? 687 00:26:08,946 --> 00:26:10,816 Well, first let's do a little sanity check, 688 00:26:10,876 --> 00:26:12,276 just to practice what we've been preaching. 689 00:26:12,276 --> 00:26:15,556 If s equals equals nill, what should I probably return? 690 00:26:15,556 --> 00:26:16,146 >> [inaudible] 691 00:26:16,146 --> 00:26:18,426 >> So, I could return one. 692 00:26:18,426 --> 00:26:19,136 Why do you say one? 693 00:26:19,136 --> 00:26:19,886 >> [inaudible] 694 00:26:19,886 --> 00:26:25,336 >> So, that it will, sorry? 695 00:26:25,416 --> 00:26:27,706 Keep running? 696 00:26:27,706 --> 00:26:27,836 >> [inaudible] 697 00:26:27,836 --> 00:26:29,426 >> OK, so I do want to return something 698 00:26:29,426 --> 00:26:31,596 so that the function won't keep executing. 699 00:26:31,866 --> 00:26:34,186 Let me push back now though and say if we return one, 700 00:26:34,186 --> 00:26:36,226 we're probably going to run into problems. 701 00:26:36,446 --> 00:26:38,056 Can you anticipate what the problem is going to be 702 00:26:38,056 --> 00:26:40,186 if the error code we're returning is itself one? 703 00:26:40,186 --> 00:26:41,586 >> [inaudible] 704 00:26:41,586 --> 00:26:41,686 >> Yeah. 705 00:26:41,991 --> 00:26:43,991 >> [inaudible] 706 00:26:44,296 --> 00:26:45,926 >> It's suppose to return int anyway, 707 00:26:46,116 --> 00:26:48,456 which means I might have just been handed null, 708 00:26:48,456 --> 00:26:51,776 so my answer is, oh, put Mr. Null in bucket number one. 709 00:26:51,926 --> 00:26:52,976 But that's the wrong answer. 710 00:26:52,976 --> 00:26:55,326 We probably want to return a so-called sential [assumed 711 00:26:55,326 --> 00:26:58,776 spelling] value that indicates to the caller, maybe it's main 712 00:26:58,776 --> 00:27:01,686 or whoever's using this hash function that problem happened. 713 00:27:01,726 --> 00:27:03,476 So, what should I maybe instead return? 714 00:27:03,476 --> 00:27:04,326 >> [inaudible] 715 00:27:04,326 --> 00:27:05,226 >> So, negative one, right? 716 00:27:05,226 --> 00:27:08,446 A very common answer honestly is anything other than the range 717 00:27:08,446 --> 00:27:09,466 of numbers we care about. 718 00:27:09,466 --> 00:27:11,886 So, let's return negative one, and just expect 719 00:27:11,886 --> 00:27:15,076 that the caller is smart enough to know he should check 720 00:27:15,076 --> 00:27:18,006 for negative one before trusting the return value 721 00:27:18,006 --> 00:27:18,616 of this function. 722 00:27:18,856 --> 00:27:19,786 And actually if we want 723 00:27:19,786 --> 00:27:22,736 to be really sophisticated here we can do what's very 724 00:27:22,736 --> 00:27:23,186 common too. 725 00:27:23,186 --> 00:27:23,536 You know what? 726 00:27:23,536 --> 00:27:27,026 Let's define a constant called error, and just define it 727 00:27:27,026 --> 00:27:28,586 to be the number negative one. 728 00:27:28,846 --> 00:27:30,636 So, now we can kind of abstract away 729 00:27:30,636 --> 00:27:33,346 and say return the error code called error, 730 00:27:33,536 --> 00:27:35,236 and now the caller can just check 731 00:27:35,336 --> 00:27:38,386 if what was returned is error, not negative one. 732 00:27:38,496 --> 00:27:40,526 So, there's conventions like this where we can clean 733 00:27:40,526 --> 00:27:42,696 up what's otherwise a very arbitrary choice 734 00:27:42,696 --> 00:27:43,496 like negative one. 735 00:27:43,756 --> 00:27:45,626 OK, so that this point I now need to figure 736 00:27:45,626 --> 00:27:47,596 out what actual number to return. 737 00:27:47,596 --> 00:27:48,906 So, I've been handed a string. 738 00:27:49,156 --> 00:27:50,756 Let's assume for simplicity today 739 00:27:50,756 --> 00:27:52,306 that it's an actual student's name, 740 00:27:52,306 --> 00:27:54,506 and that we don't have any crazy hyphenated names, 741 00:27:54,506 --> 00:27:56,536 or apostrophes, or anything like that. 742 00:27:56,536 --> 00:27:59,076 It's just ASCII alphabetical characters. 743 00:27:59,436 --> 00:28:02,286 So, I want to decide what bucket to put this input in. 744 00:28:02,526 --> 00:28:04,096 So, I just need to return a number. 745 00:28:04,096 --> 00:28:10,526 So, maybe I can do something like, let's see, int N gets, 746 00:28:10,726 --> 00:28:12,406 and then what do I want to do? 747 00:28:12,406 --> 00:28:15,036 Well, I could do cast to an int, what? 748 00:28:15,036 --> 00:28:18,716 S [0]. That will return a number, right? 749 00:28:18,716 --> 00:28:20,506 And then I can do return N? 750 00:28:20,946 --> 00:28:24,476 This will return the numeric value 751 00:28:24,506 --> 00:28:27,206 of the person's first letter in their name, 752 00:28:27,206 --> 00:28:28,156 but what's the problem here? 753 00:28:28,156 --> 00:28:28,846 >> [inaudible] 754 00:28:28,846 --> 00:28:31,726 >> So, it's the numerical ASCII value, 755 00:28:31,726 --> 00:28:34,636 which means I might get back 65 or 66, 756 00:28:34,636 --> 00:28:38,376 probably some number that's much bigger than 0-26. 757 00:28:38,376 --> 00:28:40,256 So, what's the easy fix here? 758 00:28:40,256 --> 00:28:41,536 >> [inaudible] 759 00:28:41,536 --> 00:28:43,896 >> Yeah, subtract capital A. That's what we've done 760 00:28:43,896 --> 00:28:44,306 in the past. 761 00:28:44,306 --> 00:28:46,686 If you want to be really anal you can explicitly say int, 762 00:28:46,686 --> 00:28:48,546 but again GCC is smart enough to realize, oh, 763 00:28:48,546 --> 00:28:50,996 if you're subtracting int just give me back the numeric 764 00:28:50,996 --> 00:28:51,396 value you. 765 00:28:51,656 --> 00:28:54,096 So, this would work, but another approach too might be this. 766 00:28:54,476 --> 00:28:54,926 Let's see. 767 00:28:54,926 --> 00:28:56,326 Instead of subtracting this off, 768 00:28:56,656 --> 00:28:58,856 what if I do something like mod 26? 769 00:28:59,586 --> 00:29:02,846 Would this return to me a number between 0-25 inclusive? 770 00:29:04,276 --> 00:29:05,986 OK, so this would work here too, 771 00:29:05,986 --> 00:29:08,616 and it would actually work just as well, right? 772 00:29:08,686 --> 00:29:13,276 Even though it would not put the letter A in location 0. 773 00:29:13,576 --> 00:29:16,366 It would essentially rotate the positions where things are. 774 00:29:16,676 --> 00:29:18,696 Would it put two people 775 00:29:18,696 --> 00:29:21,646 with different last names in the same bucket? 776 00:29:23,096 --> 00:29:23,226 >> Yes. 777 00:29:23,416 --> 00:29:25,316 >> Yes? OK, how? 778 00:29:25,316 --> 00:29:25,396 >> [inaudible] 779 00:29:25,396 --> 00:29:29,296 >> OK, so let's assume simplicity here capital letters, 780 00:29:29,326 --> 00:29:30,756 just alphabetical then. 781 00:29:31,016 --> 00:29:31,656 But, otherwise yes. 782 00:29:31,716 --> 00:29:33,796 That would be a corner case. 783 00:29:33,796 --> 00:29:36,126 So, no. It should work because our hash function is so simple 784 00:29:36,126 --> 00:29:37,336 and we're just looking at one letter. 785 00:29:37,336 --> 00:29:39,456 I'm going to even keep the hash function simple, 786 00:29:39,706 --> 00:29:40,746 just to be ever so clear. 787 00:29:40,926 --> 00:29:43,196 This will return a number between 0 and 25. 788 00:29:43,446 --> 00:29:47,346 So now the black box that's using this hash function 789 00:29:47,346 --> 00:29:49,266 realizes, oh, this is Joe Smith. 790 00:29:49,316 --> 00:29:52,256 This returned, or this is Adam Aardvark. 791 00:29:52,446 --> 00:29:54,616 This returns to me the number 0. 792 00:29:54,686 --> 00:29:57,416 So, I'm going to put Adam in this location here, 793 00:29:57,416 --> 00:29:59,146 because Aardvark, stupid name, I know. 794 00:29:59,146 --> 00:29:59,916 I couldn't think on the fly. 795 00:30:00,216 --> 00:30:02,526 Starts with the letter A, and that maps to 0. 796 00:30:02,526 --> 00:30:03,276 So, we put it there. 797 00:30:03,536 --> 00:30:05,786 So, then the question is now we have all these students, 798 00:30:05,786 --> 00:30:08,586 26 students in this black box, how do we answer questions 799 00:30:08,586 --> 00:30:11,286 of the form, is Adam enrolled at Harvard? 800 00:30:11,596 --> 00:30:14,186 Well, the black box has to use this same hash function. 801 00:30:14,406 --> 00:30:16,926 So, now you use the same exact function, given the input 802 00:30:16,926 --> 00:30:19,096 and see here's Adam, here's Adam Aardvark. 803 00:30:19,376 --> 00:30:22,006 What is the name, what is the location 804 00:30:22,006 --> 00:30:23,696 of the Adam in my black box? 805 00:30:24,016 --> 00:30:25,426 0. Let me now check 806 00:30:25,676 --> 00:30:27,966 and if there's actually something there, 807 00:30:27,966 --> 00:30:29,636 the string representing the name, 808 00:30:29,636 --> 00:30:32,346 the struct representing the student then I know ah-ha. 809 00:30:32,346 --> 00:30:33,486 That must be Adam. 810 00:30:34,216 --> 00:30:37,306 Unless our assumptions are no good. 811 00:30:37,406 --> 00:30:42,016 If we actually have more than 26 students what's going to happen 812 00:30:42,016 --> 00:30:45,066 when we insert the 27th and the 28th? 813 00:30:45,626 --> 00:30:46,436 Just intuitively? 814 00:30:47,006 --> 00:30:51,426 What's going to happen when we insert another person 815 00:30:51,426 --> 00:30:55,416 with a last name that starts with A? 816 00:30:55,416 --> 00:30:55,483 >> [inaudible] 817 00:30:55,483 --> 00:30:56,056 >> OK, so right. 818 00:30:56,056 --> 00:30:57,496 We can just expel Adam, right? 819 00:30:57,496 --> 00:31:00,746 And put this 27thh person in instead, but otherwise 820 00:31:00,746 --> 00:31:02,456 in more generally we get a collision. 821 00:31:02,696 --> 00:31:05,536 And, that was the point of introducing this then depiction 822 00:31:05,536 --> 00:31:06,806 of a possible solution. 823 00:31:07,046 --> 00:31:10,726 So, if we paint ourselves into a corner with an array, 824 00:31:10,726 --> 00:31:12,416 which even though it does lend itself 825 00:31:12,446 --> 00:31:14,996 to constant time insertion, notice the code. 826 00:31:15,286 --> 00:31:16,936 What is the running time of this code? 827 00:31:17,156 --> 00:31:19,996 It is constant because it takes one step to check for null, 828 00:31:20,246 --> 00:31:23,386 another step to check what the integral value N is. 829 00:31:23,386 --> 00:31:26,106 And then, OK, a third step to return that actual number. 830 00:31:26,356 --> 00:31:28,656 Three steps, but that's big O1 time. 831 00:31:28,736 --> 00:31:31,456 That's constant time, because it always takes three steps. 832 00:31:31,456 --> 00:31:35,886 And, that's so much better than log N, N squared, N itself. 833 00:31:35,886 --> 00:31:38,266 I mean that is very fast, and that is the ideal, 834 00:31:38,646 --> 00:31:40,066 but we risk this collision. 835 00:31:40,066 --> 00:31:42,396 So, there's the tradeoff, you get amazing speed, 836 00:31:42,566 --> 00:31:44,886 but you can only have 26 students. 837 00:31:44,886 --> 00:31:46,276 So, again, this theme is recurring, 838 00:31:46,276 --> 00:31:49,156 but the theme is the result of this lack of foresight. 839 00:31:49,296 --> 00:31:51,386 We have painted ourselves into a corner 840 00:31:51,616 --> 00:31:53,786 by only having a bucket that's that large. 841 00:31:54,106 --> 00:31:56,716 Well, the point of this teaser at the end of Monday was 842 00:31:56,716 --> 00:31:58,486 to ask the question, well maybe this isn't 843 00:31:58,486 --> 00:31:59,886 such a big deal, right? 844 00:31:59,936 --> 00:32:03,276 Obviously one solution to this is let's make our bucket bigger. 845 00:32:03,346 --> 00:32:05,056 Let's not have 26 spots. 846 00:32:05,356 --> 00:32:07,406 Let's have 52 spots, right? 847 00:32:07,406 --> 00:32:10,536 Because then if we have more students we have more room 848 00:32:10,536 --> 00:32:11,406 for them, right? 849 00:32:11,406 --> 00:32:12,916 And we can find more room for them. 850 00:32:12,916 --> 00:32:13,316 You know what? 851 00:32:13,316 --> 00:32:15,666 26, let's really minimize the probability 852 00:32:15,666 --> 00:32:17,016 of something ever going wrong, 853 00:32:17,226 --> 00:32:19,436 even though we're a relatively small university, 854 00:32:19,646 --> 00:32:21,896 let's allocate an array for a million students. 855 00:32:22,146 --> 00:32:25,056 Because just probabilistically the idea that we'd actually run 856 00:32:25,056 --> 00:32:27,936 into collisions using some kind of algorithm is much, 857 00:32:27,936 --> 00:32:30,286 much lower by nature of how much free space we have. 858 00:32:30,606 --> 00:32:33,596 But a downside of over compensating is what obviously? 859 00:32:33,596 --> 00:32:34,946 >> [inaudible] 860 00:32:34,946 --> 00:32:36,456 >> You're just wasting memory, right? 861 00:32:36,456 --> 00:32:38,386 And it takes longer to search that memory. 862 00:32:38,386 --> 00:32:40,226 And, even though we won't dwell on this in this course, 863 00:32:40,426 --> 00:32:42,536 the more memory you're using, 864 00:32:42,736 --> 00:32:45,446 the less effective things called cache are. 865 00:32:45,446 --> 00:32:47,106 So, as a side if you're the type in the class 866 00:32:47,246 --> 00:32:49,416 that likes hardware you might know of concepts 867 00:32:49,416 --> 00:32:52,496 in your own personal computers called L1 cache, L2 cache; 868 00:32:52,496 --> 00:32:53,506 level one and level two. 869 00:32:53,786 --> 00:32:56,786 All of that relates to optimizing look ups and memory. 870 00:32:56,996 --> 00:32:58,306 Well, just the more data you have 871 00:32:58,306 --> 00:32:59,816 in memory the less likely it is 872 00:32:59,816 --> 00:33:01,566 that cache is going to benefit you. 873 00:33:01,656 --> 00:33:03,846 So, more on that in a future course if you're interested. 874 00:33:04,196 --> 00:33:05,926 So, here's a question then, 875 00:33:05,926 --> 00:33:08,496 suppose we have a room full of NCS 50 students. 876 00:33:08,746 --> 00:33:11,376 What's the probability of a collision of just birthdays. 877 00:33:11,526 --> 00:33:12,486 Let's make this more real. 878 00:33:12,486 --> 00:33:16,166 Not do first names, and let's assume a uniform distribution 879 00:33:16,296 --> 00:33:19,166 of birthdays, which itself I think statistically is not 880 00:33:19,166 --> 00:33:21,206 accurate if you read like the literature 881 00:33:21,206 --> 00:33:23,056 around like special events in history. 882 00:33:23,056 --> 00:33:25,266 Nine months later there have been lots more babies 883 00:33:25,266 --> 00:33:26,306 than in other months of the year. 884 00:33:26,306 --> 00:33:27,426 So, uniformed distribution 885 00:33:27,426 --> 00:33:29,576 of birthdays is actually not realistic. 886 00:33:29,806 --> 00:33:33,186 But, let's assume it is and ask ourselves what's the probability 887 00:33:33,456 --> 00:33:36,086 that if you just have a bucket and you need to put students 888 00:33:36,086 --> 00:33:39,126 in that bucket, and you're going to put them at random locations, 889 00:33:39,386 --> 00:33:40,826 what's the probability of collision? 890 00:33:41,176 --> 00:33:43,036 And, now what's the who cares question here? 891 00:33:43,316 --> 00:33:45,866 Well, if we have the previous scenario, 892 00:33:46,116 --> 00:33:50,306 a black box with 26 spots, maybe 52 spots, maybe 100, 893 00:33:50,306 --> 00:33:53,956 maybe a million spots, clearly we're going to have collisions 894 00:33:53,956 --> 00:33:55,716 if our hash function is as simple 895 00:33:55,716 --> 00:33:58,796 as checking the first letter of a person's last name. 896 00:33:59,146 --> 00:34:01,986 Right? The moment we have 27 students we are guaranteed 897 00:34:01,986 --> 00:34:02,796 to have a collision. 898 00:34:03,126 --> 00:34:05,926 So, that begs a more complicated, 899 00:34:06,026 --> 00:34:07,886 or more sophisticated hash function. 900 00:34:08,216 --> 00:34:12,356 This is too simplistic because it's too deterministic based 901 00:34:12,356 --> 00:34:14,446 on inputs we know we're going to have. 902 00:34:14,446 --> 00:34:16,166 We're going to have two students whose names start with A, 903 00:34:16,166 --> 00:34:21,036 and with B. So, how can we introduce something that's going 904 00:34:21,036 --> 00:34:24,426 to take in a name and put it not at the 0 location, 905 00:34:24,516 --> 00:34:28,146 but in any old location, but a location that we can reproduce, 906 00:34:28,146 --> 00:34:30,956 that we can refind by calling the hash function again. 907 00:34:31,366 --> 00:34:32,686 In other words how can we spread 908 00:34:32,686 --> 00:34:34,736 out the inputs across a big bucket? 909 00:34:35,576 --> 00:34:39,446 What's a smarter hash function than just this? 910 00:34:39,616 --> 00:34:39,796 Yep. 911 00:34:39,796 --> 00:34:40,336 >> [inaudible] 912 00:34:40,336 --> 00:34:44,126 >> OK, so look at more than just the first letter. 913 00:34:44,126 --> 00:34:47,156 Again, some of these questions even if they're starving to kind 914 00:34:47,156 --> 00:34:49,116 of heap over here, what's the problem? 915 00:34:49,116 --> 00:34:51,626 Well, just looking at the first letter is insufficient 916 00:34:51,826 --> 00:34:53,516 because we're going to have multiple students 917 00:34:53,516 --> 00:34:56,266 with the same first letter of their last name. 918 00:34:56,526 --> 00:34:57,046 But, you know what? 919 00:34:57,046 --> 00:34:59,106 We're less likely to have two students, like, 920 00:34:59,266 --> 00:35:01,796 I regret already this name, Adam Aardvark, 921 00:35:01,946 --> 00:35:02,876 because the probability 922 00:35:02,876 --> 00:35:06,176 of having two students whose last name starts with AA, 923 00:35:06,176 --> 00:35:09,486 Aardvark, is not very likely. 924 00:35:09,736 --> 00:35:13,086 So, just taking into account the second letter 925 00:35:13,086 --> 00:35:16,236 of someone's last name itself might be a good thing. 926 00:35:16,236 --> 00:35:18,146 Now, how do we take into account two letters? 927 00:35:18,426 --> 00:35:21,236 Well, this is where you have to start getting a little creative. 928 00:35:21,526 --> 00:35:23,886 Maybe I could take the value of the first letter 929 00:35:23,886 --> 00:35:26,626 and then take the value of the second letter, 930 00:35:26,626 --> 00:35:29,456 that gives me a number between 0 and what? 931 00:35:29,956 --> 00:35:32,706 >> [inaudible] 932 00:35:33,206 --> 00:35:36,226 >> And actually we should, we should technically do this. 933 00:35:36,226 --> 00:35:38,476 Let's see, let me subtract off my capital A, 934 00:35:38,476 --> 00:35:41,036 and let me subtract off my capital A, 935 00:35:41,316 --> 00:35:43,886 so that now what I have is two numbers between 0 936 00:35:43,886 --> 00:35:45,296 and 25 being added together. 937 00:35:46,016 --> 00:35:49,166 So, we're going to get a number between 0 and 50. 938 00:35:49,366 --> 00:35:50,366 So you know what this means? 939 00:35:50,426 --> 00:35:51,646 Let me go back up here. 940 00:35:51,646 --> 00:35:54,146 And, if our hash table is something that's 941 00:35:54,146 --> 00:35:55,176 storing strings. 942 00:35:55,456 --> 00:35:58,396 So, it's storing char*s, let's call this hash table. 943 00:35:58,626 --> 00:36:01,556 Previously, a moment ago, my world looked like this. 944 00:36:01,676 --> 00:36:04,246 So, that one line of code was essentially what I was 945 00:36:04,246 --> 00:36:06,046 describing in the form of this little milk crate. 946 00:36:06,046 --> 00:36:07,096 That was our hash table. 947 00:36:07,396 --> 00:36:12,786 But now if we can get values from 0 to 50, right? 948 00:36:12,786 --> 00:36:14,826 Or, yeah, 0 to 50. 949 00:36:14,966 --> 00:36:17,236 Well, I need more space for this at least. 950 00:36:17,996 --> 00:36:19,896 So, does this help me here? 951 00:36:19,896 --> 00:36:20,916 [ laughter ] 952 00:36:20,916 --> 00:36:27,246 Am I going to, yeah. 953 00:36:27,246 --> 00:36:27,366 >> [inaudible] 954 00:36:27,366 --> 00:36:28,926 >> Perfect. 955 00:36:29,196 --> 00:36:34,456 So, it's still, perfect answer, imperfect solution here, 956 00:36:34,456 --> 00:36:37,216 because now, the be clear, if we have someone's name who starts 957 00:36:37,216 --> 00:36:40,036 with Ab, that is going to give me the same sum as Ba, 958 00:36:40,036 --> 00:36:43,656 but that then begs the question how likely is that? 959 00:36:43,656 --> 00:36:46,506 Well, OK, that's pretty likely, I am not even going to try 960 00:36:46,506 --> 00:36:48,766 to contrive some stupid name that begins with Ab. 961 00:36:48,936 --> 00:36:50,746 But it's pretty reasonable to expect that exists. 962 00:36:50,916 --> 00:36:53,296 Alright, so let me just push back on you and be like alright, 963 00:36:53,746 --> 00:36:56,436 let's just decrease the probability. 964 00:36:56,436 --> 00:36:58,676 I don't know the distribution of names in this country, 965 00:36:58,676 --> 00:37:02,026 but I do know that the more and more letters I tack on, the less 966 00:37:02,026 --> 00:37:04,736 and less likely a collision is going to be. 967 00:37:04,856 --> 00:37:06,626 And in fact, this is what's very often done, 968 00:37:06,626 --> 00:37:08,346 at least with simple hash functions, 969 00:37:08,616 --> 00:37:10,106 this is starting to get messy, right? 970 00:37:10,106 --> 00:37:13,436 This is kind of a copy paste job, so maybe I am at the point 971 00:37:13,656 --> 00:37:16,856 where I should really re-writing this as this, int N gets 0; 972 00:37:16,856 --> 00:37:23,776 and then let me do for int I get 0, length get strlen of S 973 00:37:23,776 --> 00:37:26,956 And then, I is less than length. 974 00:37:27,056 --> 00:37:29,756 And then, I plus plus and do you see where I'm going with this? 975 00:37:29,756 --> 00:37:32,016 What should I add to N on each iteration probably? 976 00:37:33,336 --> 00:37:36,716 [i-a] and I can optimize this. 977 00:37:36,776 --> 00:37:41,336 I don't need to do all this subtraction of A all the time. 978 00:37:41,576 --> 00:37:43,936 But this now is already a more elegant solution, 979 00:37:43,936 --> 00:37:45,006 because how many letters 980 00:37:45,006 --> 00:37:47,236 of the person's name am I taking into account? 981 00:37:48,196 --> 00:37:48,866 So all of them. 982 00:37:48,866 --> 00:37:52,086 So already this is more clever, so now this begs the question, 983 00:37:52,086 --> 00:37:54,526 if you take, you know, all of the most popular names 984 00:37:54,526 --> 00:37:56,156 in the world, the first names, 985 00:37:56,326 --> 00:37:59,116 and you start adding their letters, their numeric values 986 00:37:59,116 --> 00:38:01,796 of their names together, are you going to get unique numbers? 987 00:38:01,826 --> 00:38:03,506 Well, no, if we spent enough time, 988 00:38:03,506 --> 00:38:06,856 we could probably find someone's name, and someone else's name 989 00:38:06,856 --> 00:38:10,006 where if you add the numbers in those names together, 990 00:38:10,196 --> 00:38:12,066 you get the same results for two different students 991 00:38:12,066 --> 00:38:13,916 with completely different names, just by chance. 992 00:38:14,386 --> 00:38:17,426 But the question then is how likely is that to happen? 993 00:38:17,936 --> 00:38:20,476 Because it's not such a problem if it happens, yeah, 994 00:38:20,476 --> 00:38:22,816 a collision would suck, but we can mitigate this 995 00:38:22,856 --> 00:38:24,866 by having those things called chains, 996 00:38:24,866 --> 00:38:27,406 we can introduce linked lists, which we will do in a moment, 997 00:38:27,746 --> 00:38:30,746 but let's see if we can't slap a number at least on randomness. 998 00:38:30,836 --> 00:38:33,986 So what you proposed, adding the second letter, the third letter, 999 00:38:33,986 --> 00:38:36,446 or maybe adding all of them together, is a good thing, 1000 00:38:36,446 --> 00:38:39,426 because it's just taking in more input, it's kind of adding, 1001 00:38:39,426 --> 00:38:41,226 in this psuedo way, some randomness. 1002 00:38:41,806 --> 00:38:43,816 But let's just simplify with a model 1003 00:38:43,816 --> 00:38:45,156 that we can wrap our minds around, 1004 00:38:45,346 --> 00:38:47,486 suppose we just use a random hash function. 1005 00:38:47,706 --> 00:38:50,046 Suppose instead, this is just a waste of time, 1006 00:38:50,046 --> 00:38:51,346 I am just conjecturing. 1007 00:38:51,346 --> 00:38:55,996 Let's just do this, return something like rand times 26, 1008 00:38:55,996 --> 00:39:01,776 and I'm going to return an int, right? 1009 00:39:01,776 --> 00:39:04,856 This returns a number from zero to one, let's multiply it by 26, 1010 00:39:05,076 --> 00:39:06,486 this will return a random number. 1011 00:39:06,486 --> 00:39:07,206 So that's pretty good. 1012 00:39:07,546 --> 00:39:13,606 And in fact, if I don't do this as 26, if I actually add 1013 00:39:13,836 --> 00:39:16,446 in some other numbers here, I can return a number that's not 1014 00:39:16,446 --> 00:39:18,886 between zero and 26, I can return a number that's 1015 00:39:18,886 --> 00:39:20,876 between zero and a million, zero and fifty, 1016 00:39:21,066 --> 00:39:22,576 however big I want to make my buckets. 1017 00:39:23,226 --> 00:39:27,386 But how likely is even a random approach to give us a collision? 1018 00:39:27,386 --> 00:39:30,886 This is one of these very sort of famous but simple problems, 1019 00:39:30,886 --> 00:39:32,616 what's the probability that two students 1020 00:39:32,616 --> 00:39:33,956 in this room have the same birthday? 1021 00:39:33,956 --> 00:39:35,666 Well, we could kill a lot of minutes 1022 00:39:35,696 --> 00:39:38,586 by actually polling everyone, say, who has January 1st? 1023 00:39:39,276 --> 00:39:41,196 Anyone? January 2nd? 1024 00:39:42,256 --> 00:39:42,886 January 3rd? 1025 00:39:43,616 --> 00:39:44,246 January 4th? 1026 00:39:44,246 --> 00:39:46,456 That would have been really cool 1027 00:39:46,456 --> 00:39:48,646 if we would have had two January 4ths, because we would be done. 1028 00:39:48,866 --> 00:39:50,396 Alright, let's cut it off there though and try 1029 00:39:50,396 --> 00:39:51,626 to model this mathematically. 1030 00:39:52,026 --> 00:39:53,366 This looks a little complicated, 1031 00:39:53,366 --> 00:39:54,496 but if you think about it, it's not. 1032 00:39:54,896 --> 00:39:56,926 What's the probability that two students 1033 00:39:56,926 --> 00:39:58,326 in this room have the same birthday? 1034 00:39:58,496 --> 00:40:00,296 Well, this is one of those statistics problems 1035 00:40:00,296 --> 00:40:02,746 where it's actually easier to answer the opposite question. 1036 00:40:02,986 --> 00:40:05,436 What's the probability that two students don't have the 1037 00:40:05,436 --> 00:40:06,156 same birthday. 1038 00:40:06,156 --> 00:40:08,806 And then you take that answer and do one minus that answer, 1039 00:40:08,806 --> 00:40:10,536 and voila`, you have the answer you care about. 1040 00:40:10,896 --> 00:40:13,146 Well, this top expression there, this probability 1041 00:40:13,146 --> 00:40:16,206 that given N students, well, what's the probability 1042 00:40:16,206 --> 00:40:17,046 that the first student 1043 00:40:17,046 --> 00:40:19,866 in the room has the same birthday as someone else? 1044 00:40:20,356 --> 00:40:23,626 Or rather, how many different birthdays can the first student 1045 00:40:23,626 --> 00:40:25,416 in the class have that no-one else has? 1046 00:40:25,596 --> 00:40:28,356 Well, you can have any of the 365 birthdays, 1047 00:40:28,356 --> 00:40:29,756 because you are not going to collide with anyone 1048 00:40:29,756 --> 00:40:30,666 if you are the first student. 1049 00:40:30,826 --> 00:40:34,806 So that's like 365 over 365, which is the number one. 1050 00:40:35,026 --> 00:40:36,846 But the second student who walks through the door. 1051 00:40:36,846 --> 00:40:40,406 That student can not have that same student's birthday. 1052 00:40:40,406 --> 00:40:45,176 So that same student can have 364 divided 365 birthdays. 1053 00:40:45,176 --> 00:40:47,956 So now the product of those two numbers is the probability 1054 00:40:47,956 --> 00:40:50,706 that those two students do not have the same birthday. 1055 00:40:51,716 --> 00:40:53,616 OK, now the third student walks into the room, 1056 00:40:53,616 --> 00:40:57,266 two birthdays have been used up, so now we have 1 over, 1057 00:40:57,456 --> 00:41:03,106 1 minus 2 over 365, the times 1 over 3 minus 365. 1058 00:41:03,506 --> 00:41:05,686 In other words, if you just compute the probabilities 1059 00:41:05,746 --> 00:41:09,136 that this student has, does not have the same birthday times the 1060 00:41:09,136 --> 00:41:11,116 next students, how many birthdays can they have, 1061 00:41:11,406 --> 00:41:12,706 the next one, the next one, the next one? 1062 00:41:12,926 --> 00:41:15,126 You get a very complicated formula that, in the end, 1063 00:41:15,126 --> 00:41:16,966 actually, it's not all that complicated, 1064 00:41:16,966 --> 00:41:18,576 but, it looks like this. 1065 00:41:18,576 --> 00:41:19,876 Which might remind you a little too much 1066 00:41:19,876 --> 00:41:21,386 of certain math classes you never loved. 1067 00:41:21,386 --> 00:41:23,546 But it's not all that complicated to derive. 1068 00:41:23,886 --> 00:41:25,386 So that begs the question, what is this? 1069 00:41:25,386 --> 00:41:27,786 Because that itself looks a little scary, well, 1070 00:41:27,786 --> 00:41:30,296 it's a lot easier for our purposes to graph it. 1071 00:41:30,296 --> 00:41:32,056 And the interesting take away is this. 1072 00:41:32,056 --> 00:41:35,826 If you have a room full of NCS50 students, the probability 1073 00:41:35,826 --> 00:41:38,366 that two of you have the same birthday, put another way, 1074 00:41:38,576 --> 00:41:41,026 the probability that two of you will hash, 1075 00:41:41,236 --> 00:41:43,836 given a random hash function, to the same location 1076 00:41:43,836 --> 00:41:46,196 in that black box, is strikingly high, 1077 00:41:46,426 --> 00:41:48,416 which is to say the probability of collisions 1078 00:41:48,416 --> 00:41:51,926 in hash tables is just by nature very high. 1079 00:41:52,346 --> 00:41:54,446 So what is this Y axis? 1080 00:41:55,146 --> 00:41:58,346 The Y axis says probability of a match, what is the probability 1081 00:41:58,346 --> 00:42:01,106 that two students or more have the same birthday? 1082 00:42:01,446 --> 00:42:04,566 The X axis here is the number of students in the room, 1083 00:42:04,816 --> 00:42:08,026 so what's kind of mind blowing is that if you just have a class 1084 00:42:08,226 --> 00:42:12,046 of forty students, you have a ninety percent chance 1085 00:42:12,206 --> 00:42:13,646 that two students at least 1086 00:42:13,646 --> 00:42:15,766 in that class have the same birthday, 1087 00:42:15,936 --> 00:42:19,626 and my God if you've got a class of 346 students like we do, 1088 00:42:19,826 --> 00:42:22,996 I mean you are kind of literally off the chart, but you've got 1089 00:42:23,066 --> 00:42:25,826 to have someone with the same birthday probabilistically, 1090 00:42:26,146 --> 00:42:28,846 even though 346 is less than 365. 1091 00:42:28,846 --> 00:42:31,236 Probability-wise you're going to have that collision. 1092 00:42:31,496 --> 00:42:32,876 So, what is the take away? 1093 00:42:32,876 --> 00:42:35,086 The whole point of the birthdays is just 1094 00:42:35,086 --> 00:42:40,936 to make concrete this idea that if we have any kind 1095 00:42:40,936 --> 00:42:43,456 of data structure that we're putting students into, 1096 00:42:43,456 --> 00:42:46,306 objects into, we're going to have collisions. 1097 00:42:46,376 --> 00:42:49,666 So, the motivation here is how do we fix this problem? 1098 00:42:49,936 --> 00:42:53,546 Well, the obvious solution is if you're running out of space 1099 00:42:53,546 --> 00:42:56,296 at a given location just start making more space. 1100 00:42:56,606 --> 00:42:59,736 Start chaining students, start chaining your outputs 1101 00:43:00,216 --> 00:43:02,076 from that initial location, 1102 00:43:02,396 --> 00:43:05,616 because now do you have any upper bounds on the number 1103 00:43:05,616 --> 00:43:09,366 of students you can fit into your data structure if instead 1104 00:43:09,366 --> 00:43:10,266 of putting the students 1105 00:43:10,316 --> 00:43:13,436 in the array itself you're simply adding the student 1106 00:43:13,436 --> 00:43:18,066 to a linked list that begins at that location in the array? 1107 00:43:18,566 --> 00:43:22,066 Do we now have this same upper limit? 1108 00:43:23,656 --> 00:43:25,096 [silence] Doesn't feel like it, right? 1109 00:43:25,096 --> 00:43:27,296 Because even though the tide of this table is fixed, 1110 00:43:27,686 --> 00:43:29,586 31 being 31 birthdays here. 1111 00:43:29,586 --> 00:43:34,016 31 days, actually what was the motivation here? 1112 00:43:34,016 --> 00:43:35,266 Yeah, it was day of the month. 1113 00:43:35,436 --> 00:43:37,666 This particular picture that we stole from the book there. 1114 00:43:38,006 --> 00:43:40,816 So, what happens when two people have the same birthday? 1115 00:43:41,006 --> 00:43:42,936 Or two people hash to the same location? 1116 00:43:43,116 --> 00:43:44,906 You can't put them physically at the same spot, 1117 00:43:44,906 --> 00:43:46,286 so where do you put the second person? 1118 00:43:46,286 --> 00:43:49,226 After them, right? 1119 00:43:49,226 --> 00:43:49,846 Or before them. 1120 00:43:49,846 --> 00:43:51,146 Whatever. Keep it sorted. 1121 00:43:51,146 --> 00:43:51,786 Keep it unsorted. 1122 00:43:51,786 --> 00:43:52,576 It doesn't really matter. 1123 00:43:52,746 --> 00:43:53,876 Just put them in that list. 1124 00:43:54,466 --> 00:43:57,186 But what now is our running time of insertion? 1125 00:43:57,686 --> 00:44:00,206 What is our running time of searching? 1126 00:44:00,776 --> 00:44:03,966 Is it constant time anymore? 1127 00:44:04,666 --> 00:44:07,066 So, it's definitely not, right? 1128 00:44:07,066 --> 00:44:08,766 As soon as you start seeing arrows and lists. 1129 00:44:08,896 --> 00:44:11,706 Like, now this data structure is devolving back into what? 1130 00:44:11,706 --> 00:44:12,206 Something, 1131 00:44:12,726 --> 00:44:12,906 >> Linear. 1132 00:44:13,716 --> 00:44:14,906 >> Something linear. 1133 00:44:14,956 --> 00:44:16,346 But, but I can push back. 1134 00:44:16,836 --> 00:44:19,446 Suppose that we have K locations. 1135 00:44:19,676 --> 00:44:23,146 There's K buckets, 31 in this case, 26 in this case. 1136 00:44:23,376 --> 00:44:26,486 That actually means that my running time isn't that big O1 1137 00:44:26,486 --> 00:44:30,346 of N students, but they can go in any of K locations. 1138 00:44:30,346 --> 00:44:36,226 Doesn't that mean my running time is big O1 of N over K? 1139 00:44:36,436 --> 00:44:37,256 [silence] Right, it's not linear. 1140 00:44:37,256 --> 00:44:40,666 If it were linear that would mean that it's big O1 of N 1141 00:44:41,006 --> 00:44:43,966 and so that means that my longest chain is going 1142 00:44:43,966 --> 00:44:45,606 to be N students long. 1143 00:44:45,606 --> 00:44:46,146 But, wait a minute. 1144 00:44:46,146 --> 00:44:49,156 I have all of these other chains I can put students into; 1145 00:44:49,216 --> 00:44:49,686 K of them. 1146 00:44:50,146 --> 00:44:50,956 So, doesn't that mean 1147 00:44:50,956 --> 00:44:53,476 in the worst case my chain is going to be of length K? 1148 00:44:53,476 --> 00:44:56,206 And therefore my big O1 running time is N over K? 1149 00:44:57,496 --> 00:44:59,586 There's a bug in my logic here. 1150 00:44:59,586 --> 00:45:00,526 Yeah? 1151 00:45:00,526 --> 00:45:01,256 >> [inaudible] 1152 00:45:01,256 --> 00:45:06,706 >> Good. Good. 1153 00:45:06,916 --> 00:45:11,126 So, if your hash function is bad, or just simply broken, 1154 00:45:11,126 --> 00:45:15,266 or simply stupid like this, what is the running time, whoops. 1155 00:45:15,266 --> 00:45:18,416 What is the running time of in fact finding a student? 1156 00:45:19,376 --> 00:45:22,326 It is in fact always N because you're putting every student 1157 00:45:22,326 --> 00:45:23,356 in the same location. 1158 00:45:23,516 --> 00:45:24,616 Now this is the extreme. 1159 00:45:24,616 --> 00:45:26,516 Odds are you're not going to be so, you know, 1160 00:45:26,516 --> 00:45:28,626 naive as to implement this as your hash function. 1161 00:45:28,756 --> 00:45:29,936 But, what if you're just unlucky? 1162 00:45:29,936 --> 00:45:32,046 Right? That's what big O1 notion is kind of about, 1163 00:45:32,226 --> 00:45:34,336 asking questions like what is the worst case? 1164 00:45:34,336 --> 00:45:36,766 Well, in the worst case you're really unlucky. 1165 00:45:37,036 --> 00:45:40,056 Everything is backwards, or all students names start 1166 00:45:40,056 --> 00:45:41,486 with the letter A. That's really bad, 1167 00:45:41,916 --> 00:45:46,316 so even if our code does say S [0 minus big A] maybe just 1168 00:45:46,316 --> 00:45:48,906 with some probability this happens, well, 1169 00:45:49,076 --> 00:45:52,886 this then begs the question what is the ideal hash function? 1170 00:45:53,086 --> 00:45:55,526 In other words it puts the burden on you, 1171 00:45:55,526 --> 00:45:57,136 the computer scientist to figure 1172 00:45:57,136 --> 00:45:59,706 out what is the best function to use here? 1173 00:45:59,876 --> 00:46:01,746 It's probably not checking the first letter. 1174 00:46:01,916 --> 00:46:03,796 It might not even be checking all 1175 00:46:03,796 --> 00:46:04,926 of the letters in the alphabet. 1176 00:46:04,926 --> 00:46:06,806 Maybe you need to do something more intelligent. 1177 00:46:07,076 --> 00:46:09,586 And the teaser here is with problem set six, 1178 00:46:09,836 --> 00:46:11,016 which we'll release on Friday. 1179 00:46:11,016 --> 00:46:13,786 This is going to be our last problem set in C. You're going 1180 00:46:13,786 --> 00:46:16,536 to have to implement the fastest spellchecker possible. 1181 00:46:16,536 --> 00:46:19,946 And by coincidence those inputs are going to be words. 1182 00:46:19,946 --> 00:46:23,076 You're going to be handed an empty file that you need to fill 1183 00:46:23,076 --> 00:46:27,166 with a function that somehow loads 140,000 inputs, 1184 00:46:27,166 --> 00:46:28,966 140,000 words into memory. 1185 00:46:29,276 --> 00:46:31,526 So, you are going to be implementing this black box. 1186 00:46:31,656 --> 00:46:32,356 How do you do that? 1187 00:46:32,726 --> 00:46:35,636 Well, every time we feed you another word, another word, 1188 00:46:35,636 --> 00:46:38,436 140,000 plus thousand times you're going to have to decide 1189 00:46:38,626 --> 00:46:40,606 where to put it inside this black box. 1190 00:46:40,606 --> 00:46:42,106 What's going to be inside the black box? 1191 00:46:42,416 --> 00:46:43,586 Well, that's going to be up to you. 1192 00:46:43,626 --> 00:46:47,376 You can use an array, an array with 140,000 locations. 1193 00:46:47,376 --> 00:46:49,646 And you can load all of those words 1194 00:46:49,646 --> 00:46:52,056 into memory pretty damn fast if you're just using an array, 1195 00:46:52,056 --> 00:46:54,636 but what is your code going to suck 1196 00:46:54,936 --> 00:46:58,396 at if you're just using an array? 1197 00:46:58,396 --> 00:46:58,606 >> [inaudible] 1198 00:46:58,606 --> 00:47:00,246 >> So, search, now granted 1199 00:47:00,246 --> 00:47:04,706 if you keep them sorted then you can do big O1 of log N 1200 00:47:04,706 --> 00:47:08,356 and that's definitely better than N and certainly N squared, 1201 00:47:08,606 --> 00:47:11,006 but again, we're trying to get to constant time. 1202 00:47:11,006 --> 00:47:13,376 And constant time is, again, that ideal. 1203 00:47:13,576 --> 00:47:16,366 So, the real cleverness in problem set six is going 1204 00:47:16,366 --> 00:47:18,526 to be what hash function should you use 1205 00:47:18,526 --> 00:47:20,216 so that you can actually put these words 1206 00:47:20,396 --> 00:47:22,216 at the smartest location possible. 1207 00:47:22,516 --> 00:47:23,646 Now as for this question, 1208 00:47:23,646 --> 00:47:25,246 what is the running time of a hash table. 1209 00:47:25,596 --> 00:47:27,676 Here is finally when we see a data structure 1210 00:47:27,836 --> 00:47:30,716 where theory no longer is necessarily consistent 1211 00:47:30,716 --> 00:47:31,356 with reality. 1212 00:47:31,716 --> 00:47:32,606 And by that I mean this. 1213 00:47:32,606 --> 00:47:36,556 Asymptotically, a hash table is, yes, big O1 of N over K. 1214 00:47:36,556 --> 00:47:39,456 But what do we always do every time we talk about big O1 1215 00:47:39,456 --> 00:47:41,936 and omega with constant numbers? 1216 00:47:43,156 --> 00:47:44,096 We kind of ignore them. 1217 00:47:44,096 --> 00:47:45,906 Right? Because we said it's not interesting. 1218 00:47:45,906 --> 00:47:48,376 Because, hardware advancements will take away 1219 00:47:48,376 --> 00:47:50,426 that constant factor within 18 months, right? 1220 00:47:50,516 --> 00:47:52,326 The algorithm itself will just get faster 1221 00:47:52,326 --> 00:47:53,426 because hardware is faster. 1222 00:47:53,696 --> 00:47:55,446 So, we always cross out constant values. 1223 00:47:55,816 --> 00:47:56,746 K is constant. 1224 00:47:56,916 --> 00:47:59,896 Because when you compile your program you're choosing K. Or, 1225 00:47:59,896 --> 00:48:01,716 when you run your program you're choosing K, 1226 00:48:01,716 --> 00:48:04,276 making that many buckets and then using that many buckets. 1227 00:48:04,546 --> 00:48:07,716 So in fact a hash table asymptotically is the big O1 1228 00:48:07,716 --> 00:48:10,336 of N. But here's this, this is 1229 00:48:10,336 --> 00:48:13,116 where reality should break off from theory. 1230 00:48:13,116 --> 00:48:15,546 In the real world doing something, 1231 00:48:15,866 --> 00:48:20,896 navigating a list that's one Kth the size of a list of size N, 1232 00:48:21,156 --> 00:48:25,806 like that is in real world seconds, human time much faster. 1233 00:48:26,166 --> 00:48:29,246 It's a factor of K faster than searching the whole list. 1234 00:48:29,586 --> 00:48:32,176 So, hash tables have this interesting property where, 1235 00:48:32,536 --> 00:48:35,296 you know, even though there's still a linear data structure, 1236 00:48:35,296 --> 00:48:38,316 yeah in the worst case you're chains might devolve 1237 00:48:38,316 --> 00:48:42,066 into these really long beasts if you're unlucky, or foolish. 1238 00:48:42,516 --> 00:48:46,786 Well, if you're smart and your data has some properties 1239 00:48:46,786 --> 00:48:49,256 like English words tend to, various patterns 1240 00:48:49,256 --> 00:48:50,546 and distributions of letters. 1241 00:48:51,186 --> 00:48:53,996 Your hash table doesn't need to have really long chains. 1242 00:48:53,996 --> 00:48:56,046 In fact ideal would be to have chains 1243 00:48:56,086 --> 00:49:00,216 that are roughly the same length because code that runs in O of, 1244 00:49:00,216 --> 00:49:03,826 N over K time is in real world terms much faster than code 1245 00:49:03,826 --> 00:49:06,036 that runs over in just N time. 1246 00:49:06,566 --> 00:49:09,376 And so the challenge of P set six is going to be the leverage, 1247 00:49:09,486 --> 00:49:14,026 these real world, these real world implications 1248 00:49:14,026 --> 00:49:15,236 of performance improvements 1249 00:49:15,286 --> 00:49:17,736 to implement the fastest spell checker possible. 1250 00:49:18,526 --> 00:49:22,786 Let's take our five minute break here. 1251 00:49:22,786 --> 00:49:23,906 [ multiple background voices ] 1252 00:49:23,906 --> 00:49:24,746 Alright we're back. 1253 00:49:24,856 --> 00:49:28,806 So, this slightly retro song reminds me of other comments, 1254 00:49:28,806 --> 00:49:30,786 which some people love the music we play in 50. 1255 00:49:30,846 --> 00:49:32,946 Some people think my music tastes are like two 1256 00:49:32,946 --> 00:49:33,886 to three years out of date. 1257 00:49:34,206 --> 00:49:35,016 That's kind of true. 1258 00:49:35,016 --> 00:49:37,586 I rarely add songs to this I-phone, 1259 00:49:37,586 --> 00:49:38,696 but I did add this song. 1260 00:49:38,696 --> 00:49:40,496 I even paid like a $1.29 for it. 1261 00:49:40,736 --> 00:49:42,076 Those of you who are fans of The Office. 1262 00:49:42,076 --> 00:49:43,256 [ multiple background voices ] 1263 00:49:43,256 --> 00:49:46,436 Yeah. So, I had never even seen the You Tube video 1264 00:49:46,436 --> 00:49:49,196 that they recently spoofed, but I'm kind of into this song, 1265 00:49:49,196 --> 00:49:50,146 like a year or so later. 1266 00:49:50,496 --> 00:49:53,296 So, anyhow we'll maybe link to the video, or conclude with it 1267 00:49:53,296 --> 00:49:54,336 as you walk out today. 1268 00:49:55,056 --> 00:49:59,336 So, quick, quick reframing then of where we came from 1269 00:49:59,376 --> 00:50:00,376 and where we're about to go. 1270 00:50:00,646 --> 00:50:03,056 So, the motivation was to come up with a data structure 1271 00:50:03,056 --> 00:50:05,936 that puts to shame the things we've looked at thus far; 1272 00:50:05,976 --> 00:50:07,026 arrays in linked lists. 1273 00:50:07,356 --> 00:50:08,826 Constant time is what we want, 1274 00:50:09,086 --> 00:50:12,716 but unfortunately constant time is not really possible unless we 1275 00:50:12,716 --> 00:50:14,476 waste a ridiculous amount of memory. 1276 00:50:14,756 --> 00:50:17,556 Or come up with the so-called ideal hash function 1277 00:50:17,556 --> 00:50:20,636 that puts every student in his or her right place 1278 00:50:20,636 --> 00:50:22,346 without collisions, but that's hard. 1279 00:50:22,646 --> 00:50:25,276 And, so we seem to be at the point now where we kind 1280 00:50:25,276 --> 00:50:26,386 of have got to compromise. 1281 00:50:26,746 --> 00:50:29,376 Right? We have to pick a size for our hash table, 1282 00:50:29,716 --> 00:50:32,586 31 in this case, 26 I kept saying in that case. 1283 00:50:32,986 --> 00:50:35,746 We're going to have collisions, but let's at least have a way 1284 00:50:35,746 --> 00:50:36,596 of dealing with them, 1285 00:50:36,706 --> 00:50:38,516 we can deal with them with linked lists. 1286 00:50:38,516 --> 00:50:42,426 And now we can in fact fit an arbitrary number of words, 1287 00:50:42,426 --> 00:50:45,306 or students, or objects into memory, 1288 00:50:45,976 --> 00:50:49,876 but still find things fasters in N over K time than we could 1289 00:50:49,876 --> 00:50:52,176 in just something linear, like a linked list. 1290 00:50:52,616 --> 00:50:54,986 So, what might this look like in code? 1291 00:50:55,276 --> 00:50:57,676 Well, let me go ahead and open something 1292 00:50:57,676 --> 00:51:01,766 like this hashtable.c This too is empty by default, 1293 00:51:01,766 --> 00:51:03,236 and just as I think aloud here, 1294 00:51:03,236 --> 00:51:04,796 how would I implement a hash table? 1295 00:51:04,796 --> 00:51:06,246 Well, we kind of did it a moment ago. 1296 00:51:06,506 --> 00:51:08,886 If I'm just going to store student names or words, 1297 00:51:08,886 --> 00:51:13,126 maybe for P set six we could call this my hash table. 1298 00:51:13,386 --> 00:51:14,596 I need to pick a side. 1299 00:51:14,946 --> 00:51:16,986 For problem set six you're probably not going to want 1300 00:51:16,986 --> 00:51:19,566 to pick 26, probably not going to want to pick 31. 1301 00:51:19,756 --> 00:51:21,866 Because when you have 140,000 words, 1302 00:51:21,866 --> 00:51:23,796 I mean you might actually want something that's 1303 00:51:23,796 --> 00:51:25,926 like 1,024 locations long. 1304 00:51:26,186 --> 00:51:27,276 Or maybe even longer. 1305 00:51:27,276 --> 00:51:29,776 100 locations long. 1306 00:51:29,776 --> 00:51:30,936 You'll still have some collisions. 1307 00:51:31,156 --> 00:51:34,136 40,000 or so words won't fit obviously in this table. 1308 00:51:34,346 --> 00:51:37,896 So, you'll have chains, but what will be true about those chains 1309 00:51:37,956 --> 00:51:39,196 if your hash table is this big? 1310 00:51:39,196 --> 00:51:39,263 >> [inaudible] 1311 00:51:39,263 --> 00:51:41,386 >> They're really short. 1312 00:51:41,636 --> 00:51:42,996 So, again trade off, right? 1313 00:51:42,996 --> 00:51:46,986 More RAM, more memory, but faster running time. 1314 00:51:47,216 --> 00:51:49,206 And, so what you'll find in problem set six, 1315 00:51:49,206 --> 00:51:51,656 even though the competition aspect of it is purely for fun. 1316 00:51:51,656 --> 00:51:53,636 Opt in. You don't need to do it if you don't want to. 1317 00:51:53,856 --> 00:51:56,506 But if you do test your code, you're following the directions 1318 00:51:56,736 --> 00:51:59,136 and post the speed of your code and the amount 1319 00:51:59,136 --> 00:52:01,316 of memory you're using on the course's website, 1320 00:52:01,316 --> 00:52:03,386 as part of that problem set you'll see that some 1321 00:52:03,386 --> 00:52:06,796 of the fun actually is in turning knobs, so to speak. 1322 00:52:06,846 --> 00:52:09,496 Tweaking constants, like 140,000 and seeing 1323 00:52:09,666 --> 00:52:11,626 if I get more buckets does my code speed up? 1324 00:52:11,906 --> 00:52:13,916 Or does it mean I'm using so much memory 1325 00:52:13,916 --> 00:52:16,846 that the computer is actually finding things slightly 1326 00:52:16,846 --> 00:52:17,616 more slowly. 1327 00:52:17,896 --> 00:52:19,946 So, this would be really the first problem set 1328 00:52:19,946 --> 00:52:23,146 where you really appreciate real world considerations 1329 00:52:23,146 --> 00:52:25,006 like tweaking settings like this. 1330 00:52:25,516 --> 00:52:28,336 Alright so what do we do now for my insert function? 1331 00:52:28,926 --> 00:52:31,106 Well, what is problem set six going to have you do? 1332 00:52:31,106 --> 00:52:33,816 Well, you're going to have a function called insert, or add, 1333 00:52:33,816 --> 00:52:34,736 or something like that. 1334 00:52:34,786 --> 00:52:37,436 Actually, the function you'll see is going to be called load. 1335 00:52:37,506 --> 00:52:39,726 And you'll be passed a whole file, not a single word. 1336 00:52:39,906 --> 00:52:43,706 But, let me implement psuedo code for insert function. 1337 00:52:43,706 --> 00:52:47,416 I need to insert one word, S, into my hash table. 1338 00:52:47,756 --> 00:52:50,416 So, what am I going to need to do? 1339 00:52:50,676 --> 00:52:53,746 I'm going to have to find a location for S. 1340 00:52:53,746 --> 00:52:55,386 And then what do I want to do? 1341 00:52:55,386 --> 00:53:02,576 Insert S at that location if collision append to chain. 1342 00:53:02,946 --> 00:53:06,576 So, that's the essence of inserting a note into this list. 1343 00:53:06,576 --> 00:53:08,076 How do you actually find something? 1344 00:53:08,076 --> 00:53:09,356 Well, find, let me just 1345 00:53:09,416 --> 00:53:11,886 for spell checker I don't care about location here. 1346 00:53:11,886 --> 00:53:13,036 Let me do a true/false thing. 1347 00:53:13,036 --> 00:53:15,686 So, I'm going to do bool and then I'm going to do find, 1348 00:53:15,686 --> 00:53:17,646 and I'm going to take in as input a word, 1349 00:53:17,836 --> 00:53:21,536 and the question being asked is, is this word in your dictionary? 1350 00:53:21,536 --> 00:53:22,886 I just need to say yes or no. 1351 00:53:23,236 --> 00:53:24,916 So, I'm going to do the same thing really. 1352 00:53:24,916 --> 00:53:29,006 Find the location for S, because hopefully it's there already. 1353 00:53:29,446 --> 00:53:35,136 And then return true if found. 1354 00:53:35,136 --> 00:53:36,436 If you've never seen the terminology Iff, 1355 00:53:36,436 --> 00:53:37,966 it's not a typo. 1356 00:53:37,966 --> 00:53:39,316 It means if and only if. 1357 00:53:39,316 --> 00:53:42,816 The implication is return true if it's found, else false. 1358 00:53:43,236 --> 00:53:45,426 But if we want to be a little more pedantic we can say 1359 00:53:45,426 --> 00:53:46,526 else false. 1360 00:53:46,796 --> 00:53:49,146 So, this is kind of problem set six in a nut shell. 1361 00:53:49,196 --> 00:53:50,996 Now granted it won't be use two lines of code. 1362 00:53:50,996 --> 00:53:52,046 It won't be terribly many. 1363 00:53:52,236 --> 00:53:54,746 The problem set will ultimately be more about thought and then 1364 00:53:54,746 --> 00:53:55,926 about experimentation 1365 00:53:55,926 --> 00:53:58,706 with actual knob-turning, so to speak. 1366 00:53:59,066 --> 00:54:01,036 But this is really what it's going to be about. 1367 00:54:01,076 --> 00:54:02,686 But the digression is going to come in then 1368 00:54:02,886 --> 00:54:04,776 as to how big should this table be, 1369 00:54:04,836 --> 00:54:06,746 and how do you implement those chains. 1370 00:54:06,926 --> 00:54:09,686 Well, there was a reason two weeks ago passed 1371 00:54:09,896 --> 00:54:13,436 out that source code for linked list manipulation. 1372 00:54:13,746 --> 00:54:15,696 We looked, I think, at the find function 1373 00:54:15,696 --> 00:54:16,876 and the insert function. 1374 00:54:17,236 --> 00:54:19,426 You realize, lest you forget next week, 1375 00:54:19,426 --> 00:54:22,236 you actually have an implementation of linked list 1376 00:54:22,236 --> 00:54:24,136 in your hands, or in your binders. 1377 00:54:24,336 --> 00:54:25,746 And, that's probably going to be very useful 1378 00:54:25,806 --> 00:54:28,246 for implementing support for collisions, 1379 00:54:28,426 --> 00:54:30,276 because you've been handed essentially the code 1380 00:54:30,276 --> 00:54:31,666 that will implement chains. 1381 00:54:31,906 --> 00:54:33,146 But what you're going to have to figure 1382 00:54:33,146 --> 00:54:36,786 out is how do you have not one linked list, how do you have 31, 1383 00:54:36,786 --> 00:54:41,416 or 100,000 linked list, and actually add words 1384 00:54:41,696 --> 00:54:43,606 and then find words in this dictionary. 1385 00:54:43,936 --> 00:54:46,706 But it turns out there's other approaches to dictionaries 1386 00:54:46,946 --> 00:54:48,206 and to hashing in general. 1387 00:54:48,326 --> 00:54:50,376 The real new idea today 1388 00:54:50,376 --> 00:54:52,856 and on Monday was this idea of a hash function. 1389 00:54:53,156 --> 00:54:55,926 Rather than take some input and just put it 1390 00:54:55,926 --> 00:54:58,736 in some fixed location we now have some dynamism, 1391 00:54:59,026 --> 00:55:02,006 this hash function that takes input massages it a little bit, 1392 00:55:02,006 --> 00:55:04,186 or analyzes it a little bit and then extracts 1393 00:55:04,186 --> 00:55:05,786 from that analysis an answer 1394 00:55:05,786 --> 00:55:08,226 like location zero, or location one. 1395 00:55:08,516 --> 00:55:11,776 This is kind of a cool thing, and hashing in general is nice. 1396 00:55:11,776 --> 00:55:13,776 So, why don't we just use it to the extreme? 1397 00:55:14,166 --> 00:55:16,776 There are these data structures called tris. 1398 00:55:17,186 --> 00:55:19,186 Silly name, derives from the word retrival, 1399 00:55:19,186 --> 00:55:20,816 which for some reason is pronounced different 1400 00:55:20,816 --> 00:55:21,686 from the word tris. 1401 00:55:21,956 --> 00:55:25,546 But a tri is a tree, the word tree was already taken 1402 00:55:25,546 --> 00:55:26,556 as we'll see in a moment. 1403 00:55:26,786 --> 00:55:28,836 So, a tri kind of looks like this. 1404 00:55:28,956 --> 00:55:31,056 This is a really nice simplification of tri. 1405 00:55:31,536 --> 00:55:34,786 A tri is a tree, what's a tree? 1406 00:55:34,846 --> 00:55:35,856 Think of a family tree. 1407 00:55:35,856 --> 00:55:37,566 Right? You've got like grandma and grandpa up here, 1408 00:55:37,566 --> 00:55:39,586 and then all of their children as these leaves, 1409 00:55:39,586 --> 00:55:41,026 or edges coming off of them. 1410 00:55:41,026 --> 00:55:41,836 And, then it forks out. 1411 00:55:41,956 --> 00:55:43,876 So, it's a big upside down tree, a family tree. 1412 00:55:44,136 --> 00:55:46,176 That's what a tree is in computer science context. 1413 00:55:46,556 --> 00:55:48,236 So you know what each of the nodes 1414 00:55:48,236 --> 00:55:50,526 in this family tree called a tri is? 1415 00:55:50,866 --> 00:55:52,356 It's actually an array. 1416 00:55:52,356 --> 00:55:54,436 So, another approach fundamentally 1417 00:55:54,496 --> 00:55:57,496 to implementing a dictionary, especially for lots of words is, 1418 00:55:57,496 --> 00:55:59,256 is you start with a root node, 1419 00:55:59,376 --> 00:56:01,546 which is an array of size let's say 26. 1420 00:56:01,836 --> 00:56:04,566 And let's assume all lower case just alphabetically letters, 1421 00:56:04,566 --> 00:56:06,636 no hyphenated words, no apostrophes 1422 00:56:06,736 --> 00:56:08,376 in our dictionary, very simple words. 1423 00:56:08,616 --> 00:56:12,826 So, my root node is a node containing an array 1424 00:56:12,826 --> 00:56:13,886 with 26 letters. 1425 00:56:13,886 --> 00:56:14,766 And you know what? 1426 00:56:14,766 --> 00:56:17,516 How do I check if a word is in this dictionary? 1427 00:56:17,866 --> 00:56:21,636 Well, I hash on the word letter by letter by letter. 1428 00:56:21,936 --> 00:56:23,166 So, here's my first node. 1429 00:56:23,166 --> 00:56:24,646 I've got A through Z in here. 1430 00:56:24,696 --> 00:56:27,786 I take in a word like apple, and I look at the first letter, 1431 00:56:27,786 --> 00:56:31,176 and it's an A. So, A in a tri will typically map to zero, 1432 00:56:31,176 --> 00:56:32,806 because it's the 0th letter. 1433 00:56:33,116 --> 00:56:33,966 So, what does that mean? 1434 00:56:33,966 --> 00:56:37,016 That means I check the A location in my array, 1435 00:56:37,296 --> 00:56:41,056 so location A. It's not depicted there in the top node. 1436 00:56:41,136 --> 00:56:44,556 A. And then where do I go next in this tree? 1437 00:56:44,866 --> 00:56:47,166 Well, I look at the second letter of my word. 1438 00:56:47,166 --> 00:56:48,366 My word was apple. 1439 00:56:48,366 --> 00:56:50,826 So, P. So, now I follow what's essentially going 1440 00:56:50,826 --> 00:56:53,236 to be a pointer, an arrow to another node. 1441 00:56:53,596 --> 00:56:58,976 But the arrow I follow is from the location in the array 1442 00:56:58,976 --> 00:57:02,366 that represents P. So, in other words I have a tree, 1443 00:57:02,576 --> 00:57:04,406 each of whose nodes is an array. 1444 00:57:04,406 --> 00:57:07,486 And so to find whether or not a word is 1445 00:57:07,486 --> 00:57:11,446 in this data structure I go from node to node to node to node 1446 00:57:11,446 --> 00:57:13,656 to node following the arrows appropriate 1447 00:57:14,276 --> 00:57:15,746 for the ith [assumed spelling] letter in my word. 1448 00:57:16,006 --> 00:57:17,936 And then I get to the bottom of this tree, 1449 00:57:17,936 --> 00:57:19,896 and as this particular textbook depicted it, 1450 00:57:19,896 --> 00:57:22,586 at the end of your tree, the so called leaves you have 1451 00:57:22,586 --> 00:57:25,326 to have a special value, a Boolean; true or false 1452 00:57:25,426 --> 00:57:27,236 that says a word stops here. 1453 00:57:27,746 --> 00:57:30,326 Because if a word stops here what does that mean? 1454 00:57:30,516 --> 00:57:32,536 Well, it means it was inserted at one point. 1455 00:57:33,106 --> 00:57:35,196 So, now I can answer questions of the form, 1456 00:57:35,196 --> 00:57:36,946 is this word in my data structure? 1457 00:57:37,506 --> 00:57:40,466 Yes. Now this, we'll come back to this a just a moment. 1458 00:57:40,466 --> 00:57:41,306 But think tree. 1459 00:57:41,716 --> 00:57:42,766 Think use of arrays. 1460 00:57:42,766 --> 00:57:44,536 And think multi-level hashing. 1461 00:57:44,856 --> 00:57:46,856 Hash again, and again, and again, not just once 1462 00:57:46,886 --> 00:57:48,246 and we can get to this end game. 1463 00:57:48,246 --> 00:57:51,506 This is what some of you will likely adopt 1464 00:57:51,736 --> 00:57:53,736 for experimentation sake for problem set six. 1465 00:57:54,266 --> 00:57:55,826 So let's frame the new jargon. 1466 00:57:56,106 --> 00:57:56,786 What is a tree? 1467 00:57:56,786 --> 00:57:58,856 It looks a little something like this. 1468 00:57:58,856 --> 00:58:00,346 You have a so called root node. 1469 00:58:00,346 --> 00:58:02,646 And we'll use trees again in a week or two's time 1470 00:58:02,646 --> 00:58:05,316 with web programming, and HTML, and PHP. 1471 00:58:05,316 --> 00:58:07,996 A tree in computer science speak is a node depicted here 1472 00:58:07,996 --> 00:58:08,546 as a circle. 1473 00:58:08,906 --> 00:58:10,496 Those nodes might contain values. 1474 00:58:10,656 --> 00:58:12,496 Here that node contains the number one. 1475 00:58:12,746 --> 00:58:14,336 And trees can have children. 1476 00:58:14,626 --> 00:58:18,116 A left child, a right child, or in fact any number of children, 1477 00:58:18,196 --> 00:58:19,876 although trees with just left 1478 00:58:19,876 --> 00:58:21,196 and right children are very common. 1479 00:58:21,316 --> 00:58:22,386 They're called binary trees. 1480 00:58:22,436 --> 00:58:23,246 Bi meaning two. 1481 00:58:23,606 --> 00:58:25,396 But this tree is just meant for jargon's sake. 1482 00:58:25,636 --> 00:58:27,406 This thing up here is called the root node. 1483 00:58:27,656 --> 00:58:29,186 Three is a child of one. 1484 00:58:29,336 --> 00:58:30,806 Two is a child of one. 1485 00:58:31,086 --> 00:58:33,426 Two is a sibling of three. 1486 00:58:33,496 --> 00:58:35,996 So, they completely stole the family tree terminology. 1487 00:58:35,996 --> 00:58:37,146 There's nothing new here. 1488 00:58:37,146 --> 00:58:39,806 And then finally anything at the very bottom 1489 00:58:39,896 --> 00:58:42,936 that itself has no children we call a leaf. 1490 00:58:43,376 --> 00:58:44,096 So, this is a tree. 1491 00:58:44,316 --> 00:58:46,466 And, this is a specific instance in computer science 1492 00:58:46,506 --> 00:58:49,416 of what's called a graph, where a graph is something with nodes, 1493 00:58:49,416 --> 00:58:51,196 and arrows, and edges that go elsewhere. 1494 00:58:51,366 --> 00:58:53,276 But a tree is nice because you have no circles. 1495 00:58:53,276 --> 00:58:56,306 Right? It's a very rare thing for a tree to have a branch 1496 00:58:56,346 --> 00:58:57,586 that then grows back into itself. 1497 00:58:57,786 --> 00:58:59,006 Right? Then it becomes a graph. 1498 00:58:59,356 --> 00:59:00,756 More on that in a data structures class. 1499 00:59:01,196 --> 00:59:02,056 So, this is a tree. 1500 00:59:02,296 --> 00:59:03,766 This is a binary search tree. 1501 00:59:03,766 --> 00:59:05,106 I'm just going to make mention of this, 1502 00:59:05,186 --> 00:59:06,866 because this is the kind of stuff you can revisit 1503 00:59:06,866 --> 00:59:08,146 in a data structures course. 1504 00:59:08,416 --> 00:59:10,426 We used binary search on arrays. 1505 00:59:10,766 --> 00:59:13,246 Turns out you can use more sophisticated structures 1506 00:59:13,246 --> 00:59:15,776 like trees and actually find data pretty fast. 1507 00:59:16,186 --> 00:59:19,316 A binary search tree is a tree that's binary. 1508 00:59:19,646 --> 00:59:22,156 Each node has no more than two children; left and right. 1509 00:59:22,536 --> 00:59:25,836 And it's a search tree in that if you store numbers in each 1510 00:59:25,836 --> 00:59:27,866 of the nodes so long as you store them 1511 00:59:27,866 --> 00:59:31,276 in a smart way you can find things as fast in a tree 1512 00:59:31,276 --> 00:59:34,036 as you can in an array that's also sorted. 1513 00:59:34,326 --> 00:59:34,996 What does this mean? 1514 00:59:35,126 --> 00:59:36,826 Well, notice that 55 is the root. 1515 00:59:36,826 --> 00:59:37,766 That's kind of arbitrary. 1516 00:59:38,096 --> 00:59:40,006 What's not arbitrary that its child 1517 00:59:40,006 --> 00:59:42,136 on the left is smaller than the root. 1518 00:59:42,486 --> 00:59:46,106 And its child on the right, 77, is larger than the root. 1519 00:59:46,416 --> 00:59:47,956 And that principle applies to all 1520 00:59:47,956 --> 00:59:49,926 of the other children recursively. 1521 00:59:50,166 --> 00:59:53,346 So notice 77's children on the left is a smaller number. 1522 00:59:53,346 --> 00:59:55,286 On the right is a bigger number. 1523 00:59:55,536 --> 00:59:58,056 So, if you adhere to this pattern in a tree it turns 1524 00:59:58,056 --> 01:00:00,866 out you can search this thing in log, in time. 1525 01:00:01,306 --> 01:00:03,826 And this just hands us some really cool data structures 1526 01:00:03,826 --> 01:00:04,596 in computer science. 1527 01:00:04,746 --> 01:00:06,886 There's things called two three trees, red black trees. 1528 01:00:07,106 --> 01:00:10,146 The world has really leveraged this idea of a tree to come 1529 01:00:10,146 --> 01:00:11,936 up with very sophisticated data structures. 1530 01:00:11,936 --> 01:00:16,246 And when we in a couple of weeks use PHP my MySQL, 1531 01:00:16,526 --> 01:00:17,826 a popular database engine. 1532 01:00:18,106 --> 01:00:21,926 The database world is replete with use of trees, 1533 01:00:21,996 --> 01:00:24,056 because you can do really fancy things. 1534 01:00:24,056 --> 01:00:27,706 And if you like this kind of stuff CS124 is the place to go 1535 01:00:27,706 --> 01:00:28,896 in the spring or beyond. 1536 01:00:29,156 --> 01:00:32,196 So, let's go use trees to solve a problem 1537 01:00:32,196 --> 01:00:34,436 and come full circle back to this idea of tris 1538 01:00:34,436 --> 01:00:37,186 and then finish the day answering the question how well 1539 01:00:37,186 --> 01:00:39,896 can we do with storing things like words. 1540 01:00:40,316 --> 01:00:42,966 Alright. So, there's this thing called Morse code. 1541 01:00:43,036 --> 01:00:44,176 Some of you might know what it is. 1542 01:00:44,176 --> 01:00:46,396 You kind of push that thing; beep, beep, beep, beep, beep. 1543 01:00:46,396 --> 01:00:48,516 And, it sends messages and it receives messages. 1544 01:00:48,676 --> 01:00:50,236 And it's pretty efficient, because if you want 1545 01:00:50,236 --> 01:00:53,036 to send the letter A you just hold down a button quickly 1546 01:00:53,036 --> 01:00:55,896 and then you push the button for like a slightly longer time 1547 01:00:55,896 --> 01:00:56,856 and that's what a dash means. 1548 01:00:57,046 --> 01:00:58,386 A dot and a dash. 1549 01:00:58,696 --> 01:01:01,886 So, Morse code is a really cool way of encoding information. 1550 01:01:02,186 --> 01:01:05,036 And so the domain specific topic we'll introduce just 1551 01:01:05,036 --> 01:01:07,126 for the sake of discussion today is compression. 1552 01:01:07,496 --> 01:01:09,696 So, how many of you have ever compressed a file before 1553 01:01:09,786 --> 01:01:10,386 on your computer? 1554 01:01:10,906 --> 01:01:12,836 Alright it's pretty commonplace these days. 1555 01:01:12,836 --> 01:01:15,356 In fact almost any web page you download these days is 1556 01:01:15,356 --> 01:01:17,886 compressed, but then it's immediately decompressed 1557 01:01:17,996 --> 01:01:19,976 by your browser before it shows you it. 1558 01:01:20,326 --> 01:01:24,626 In English what does it mean to decompress a file? 1559 01:01:24,626 --> 01:01:24,786 >> [inaudible] 1560 01:01:24,786 --> 01:01:25,766 >> Make it smaller, right? 1561 01:01:25,766 --> 01:01:27,556 So, you've got a really long document 1562 01:01:27,676 --> 01:01:30,136 and compressing it means you make it smaller. 1563 01:01:30,136 --> 01:01:31,876 Alright. I can make your essays smaller. 1564 01:01:31,876 --> 01:01:34,176 So if you just wrote a ten page paper I can compress it 1565 01:01:34,176 --> 01:01:36,296 by deleting pages six through ten. 1566 01:01:36,746 --> 01:01:37,646 Is that compression? 1567 01:01:37,646 --> 01:01:38,566 [ laughter ] 1568 01:01:38,566 --> 01:01:38,646 >> No. 1569 01:01:38,846 --> 01:01:40,226 >> So, it actually is compression. 1570 01:01:40,366 --> 01:01:45,016 There is compression called lossy, L-O-S-S-Y compression. 1571 01:01:45,276 --> 01:01:46,826 And that's actually a valid technique. 1572 01:01:46,826 --> 01:01:49,206 It's not so good for essays, but it's reasonable 1573 01:01:49,206 --> 01:01:51,046 for movie and graphics, right? 1574 01:01:51,046 --> 01:01:52,766 If you've ever looked a You Tube video some 1575 01:01:52,766 --> 01:01:54,196 of them are pretty low quality, 1576 01:01:54,416 --> 01:01:56,076 but that's because someone compressed them. 1577 01:01:56,076 --> 01:02:00,396 And that person compressed them lossily. 1578 01:02:00,976 --> 01:02:05,106 That means they made it smaller by throwing out some 1579 01:02:05,106 --> 01:02:06,636 of the fidelity, some of the quality. 1580 01:02:06,636 --> 01:02:08,296 And that might be reasonable, but with English 1581 01:02:08,296 --> 01:02:10,596 and with text it's generally bad to throw information. 1582 01:02:10,836 --> 01:02:13,276 So, there's also lossless compression. 1583 01:02:13,576 --> 01:02:17,516 L-O-S-S-L-E-S-S, lossless compression. 1584 01:02:17,766 --> 01:02:19,346 That's good for things like things like text. 1585 01:02:19,636 --> 01:02:23,076 So, how in the world can you take a ten page essay whether 1586 01:02:23,076 --> 01:02:26,026 it's written in notepad, or Microsoft Word, or pages, 1587 01:02:26,026 --> 01:02:29,436 or whatever, how conceptually could you make something smaller 1588 01:02:29,436 --> 01:02:32,976 without throwing away words and importation thesis 1589 01:02:32,976 --> 01:02:34,526 and supporting paragraphs? 1590 01:02:35,066 --> 01:02:40,136 What has to happen if you've got a text file with words, 1591 01:02:40,136 --> 01:02:42,766 none of which you can afford to get rid of? 1592 01:02:42,766 --> 01:02:43,856 >> [inaudible] 1593 01:02:43,856 --> 01:02:46,626 >> Sorry? OK, so you get rid of the spaces, right? 1594 01:02:46,626 --> 01:02:48,196 Actually that would be a fun exercise 1595 01:02:48,196 --> 01:02:49,076 in expos [assumed spelling] something. 1596 01:02:49,076 --> 01:02:50,766 Really annoy your preceptor. 1597 01:02:50,976 --> 01:02:52,466 Just eliminate all spaces, right? 1598 01:02:52,466 --> 01:02:56,266 That's really easy, Control R, space, then replace all. 1599 01:02:56,266 --> 01:02:58,186 And really annoy someone easily. 1600 01:02:58,636 --> 01:03:01,306 Or, you can do stupid things like, maybe you just, 1601 01:03:01,306 --> 01:03:04,096 let's take that, which is kind of crazy and slightly improve 1602 01:03:04,096 --> 01:03:06,846 on it by saying just capitalize the first letter of every word. 1603 01:03:06,936 --> 01:03:08,816 So, then at least a human can more easily see 1604 01:03:08,816 --> 01:03:09,806 where the word breaks are. 1605 01:03:10,116 --> 01:03:11,856 But there too you're losing information. 1606 01:03:12,096 --> 01:03:15,286 So a space is a legitimate grammatical construct 1607 01:03:15,426 --> 01:03:16,726 and we're throwing away information 1608 01:03:16,726 --> 01:03:17,466 if we take that approach. 1609 01:03:17,526 --> 01:03:18,996 But, we're saving space. 1610 01:03:19,666 --> 01:03:23,546 What else could we do besides actually affecting the quality 1611 01:03:25,056 --> 01:03:31,216 of the document. 1612 01:03:31,216 --> 01:03:31,283 >> [inaudible] 1613 01:03:31,283 --> 01:03:31,866 >> OK. Good. 1614 01:03:31,866 --> 01:03:36,276 So, if you use the same word a lot, why don't you just replace 1615 01:03:36,326 --> 01:03:38,596 that word with the letter X, and tell your TF, 1616 01:03:38,596 --> 01:03:40,536 much like you would a computer science TF, 1617 01:03:40,866 --> 01:03:46,236 X equals American History, or something that you don't want 1618 01:03:46,236 --> 01:03:48,036 to keep saying again and again, and again. 1619 01:03:48,036 --> 01:03:49,166 Right? Just declare a variable, 1620 01:03:49,166 --> 01:03:50,826 declare a constant at the top, right? 1621 01:03:51,406 --> 01:03:53,866 And then AH all over your documents, right? 1622 01:03:53,866 --> 01:03:56,086 That's another way to annoy your humanities teacher this week. 1623 01:03:56,336 --> 01:03:58,446 OK, so that's actually pretty reasonable and speaks 1624 01:03:58,446 --> 01:04:01,006 to a really clever idea, look for patterns. 1625 01:04:01,766 --> 01:04:04,826 Look for some places where you're spending a lot of bits 1626 01:04:04,826 --> 01:04:07,896 and spend few bits but communicate the same ideas, 1627 01:04:08,136 --> 01:04:09,766 factor out the common cases. 1628 01:04:09,996 --> 01:04:12,046 Well, you might recall this chart from weeks ago. 1629 01:04:12,046 --> 01:04:15,836 We had ASCIItable.com the funny thing about ASCII is 1630 01:04:15,886 --> 01:04:20,456 that you're using eight bits to represent every character 1631 01:04:20,456 --> 01:04:22,496 on your keyboard plus some others, right? 1632 01:04:22,496 --> 01:04:24,306 This whole chart is kind of overwhelming, 1633 01:04:24,306 --> 01:04:25,446 because there's so many characters. 1634 01:04:25,726 --> 01:04:28,786 But in your English essay, or expos essay, I mean how many 1635 01:04:28,786 --> 01:04:32,566 of you are very often using the carrot symbol, or the tilde 1636 01:04:32,566 --> 01:04:35,096 if it's an English document and not like Spanish? 1637 01:04:35,096 --> 01:04:38,026 Or, you know, what else is not that common? 1638 01:04:38,206 --> 01:04:40,196 Plus? You probably don't use plus that often. 1639 01:04:40,476 --> 01:04:44,596 How many of you use the synchronize idol key 1640 01:04:44,596 --> 01:04:45,296 on your keyboard? 1641 01:04:45,586 --> 01:04:46,576 Probably not that much. 1642 01:04:46,576 --> 01:04:49,136 In other words ASCII is itself not very efficient 1643 01:04:49,136 --> 01:04:52,516 because it spends eight bits on every letter irrespective 1644 01:04:52,516 --> 01:04:56,266 of the frequency of that character in your English pros. 1645 01:04:56,516 --> 01:04:59,106 So, maybe we could take this idea of finding patterns, 1646 01:04:59,286 --> 01:05:01,666 common words to the real extreme and look 1647 01:05:01,706 --> 01:05:03,366 for common patterns of bits. 1648 01:05:03,686 --> 01:05:07,686 Right? If I see the bit pattern 111111 a whole lot, 1649 01:05:08,026 --> 01:05:11,426 maybe I can represent these bit patterns more succinctly. 1650 01:05:12,026 --> 01:05:13,026 And in fact you can. 1651 01:05:13,386 --> 01:05:14,826 When you use on Mac OS 1652 01:05:14,826 --> 01:05:17,016 or Windows the compress file feature, 1653 01:05:17,256 --> 01:05:19,456 or the zip file feature, any of these kinds 1654 01:05:19,456 --> 01:05:22,216 of ideas you're taking not necessarily text documents, 1655 01:05:22,216 --> 01:05:26,416 you can compress almost anything sometimes and make it smaller 1656 01:05:26,626 --> 01:05:28,376 without sacrificing quality. 1657 01:05:28,596 --> 01:05:31,016 And if we do this not at the alphabetical level, 1658 01:05:31,106 --> 01:05:34,396 but rather at the bit level, we can do this pretty effectively. 1659 01:05:34,616 --> 01:05:36,806 And how many of you for instance have taken a big file, 1660 01:05:37,076 --> 01:05:38,666 whether it's a movie or something else 1661 01:05:38,666 --> 01:05:40,346 and compressed it, that's made it smaller, 1662 01:05:40,346 --> 01:05:41,816 but have you tried compressing it again? 1663 01:05:43,466 --> 01:05:44,496 Right? I mean this is kind of fun. 1664 01:05:44,566 --> 01:05:46,126 Take any file, compress it. 1665 01:05:46,126 --> 01:05:47,056 Take your essay, compress it. 1666 01:05:47,056 --> 01:05:48,086 It will make it smaller. 1667 01:05:48,486 --> 01:05:49,426 Do it again. 1668 01:05:49,426 --> 01:05:52,476 It will make it smaller, and smaller, and smaller, 1669 01:05:52,476 --> 01:05:54,596 and smaller, until you can compress anything you've written 1670 01:05:54,596 --> 01:05:55,216 into what? 1671 01:05:55,486 --> 01:05:56,896 Hopefully one bit. 1672 01:05:57,316 --> 01:05:57,556 Right? 1673 01:05:57,556 --> 01:05:58,316 >> [inaudible] 1674 01:05:58,316 --> 01:06:01,136 >> It's kind of crazy talk, right? 1675 01:06:01,136 --> 01:06:02,246 That should not be possible. 1676 01:06:02,566 --> 01:06:05,296 Right? And so there's actually an interesting upper bound here 1677 01:06:05,296 --> 01:06:06,356 that's very theoretical. 1678 01:06:06,586 --> 01:06:09,816 There's interesting ideas of entropy and randomness here, 1679 01:06:09,816 --> 01:06:13,826 because at some point the only way you can really compress data 1680 01:06:13,826 --> 01:06:15,846 is to eliminate patterns, right? 1681 01:06:15,846 --> 01:06:17,826 Or exploit patterns and replace them 1682 01:06:17,826 --> 01:06:19,436 with shorter sequences of bits. 1683 01:06:19,676 --> 01:06:22,196 But eventually if you do this again and again, and again, 1684 01:06:22,396 --> 01:06:24,906 at some point your document is essentially going 1685 01:06:24,906 --> 01:06:25,996 to look random. 1686 01:06:26,206 --> 01:06:28,726 You're not going to have long patterns of bits. 1687 01:06:28,986 --> 01:06:31,766 You might have a 11 and 00, but you're not going 1688 01:06:31,766 --> 01:06:34,886 to have twenty 1s in a row over here, 1689 01:06:34,886 --> 01:06:36,076 and over here, and over here. 1690 01:06:36,176 --> 01:06:37,996 You're going to essentially turn your document 1691 01:06:37,996 --> 01:06:40,696 into something that's very random in appearance. 1692 01:06:40,776 --> 01:06:42,656 And that's sort of the end game with compression. 1693 01:06:42,856 --> 01:06:44,366 And encryption does the same thing. 1694 01:06:44,366 --> 01:06:45,876 It tries to make everything look random. 1695 01:06:45,876 --> 01:06:47,606 So, there's an interesting relationship there. 1696 01:06:47,976 --> 01:06:50,116 But the question for today is how can we go 1697 01:06:50,116 --> 01:06:51,236 about compressing data? 1698 01:06:51,906 --> 01:06:53,306 Well, that is not the way. 1699 01:06:54,786 --> 01:06:56,666 We could do something succinct like this. 1700 01:06:56,666 --> 01:07:00,026 Why use eight bits, why don't we use something like Morse code 1701 01:07:00,286 --> 01:07:05,746 where you represent with a dot, the letter E, and the letter A 1702 01:07:05,746 --> 01:07:08,586 with a dot dash, as it's called. 1703 01:07:08,586 --> 01:07:10,736 And the reason for the different shapes here is dot means the 1704 01:07:10,736 --> 01:07:13,186 person running the show would push a switch 1705 01:07:13,186 --> 01:07:15,456 like this really fast and the dash means their finger goes 1706 01:07:15,456 --> 01:07:17,086 down for, like, a second and then comes up. 1707 01:07:17,466 --> 01:07:20,406 But there's a problem, if you are the recipient of a letter 1708 01:07:20,406 --> 01:07:23,676 in Morse code and you receive a dot dash 1709 01:07:23,676 --> 01:07:25,986 and then some other stuff, what letter 1710 01:07:25,986 --> 01:07:29,296 or letters have you just received? 1711 01:07:29,296 --> 01:07:29,436 >> [inaudible] 1712 01:07:29,436 --> 01:07:31,866 >> Yeah, you might have received A dot dash. 1713 01:07:31,986 --> 01:07:35,386 Or, you might have received ET. 1714 01:07:35,386 --> 01:07:39,166 So, the problem with Morse code at least is that it's not, 1715 01:07:39,166 --> 01:07:40,976 there's this ambiguity, which is it? 1716 01:07:41,156 --> 01:07:42,776 Right? So, you could maybe insert pauses, 1717 01:07:42,776 --> 01:07:44,426 and this is what they did in the Morse code world. 1718 01:07:44,426 --> 01:07:46,756 You kind of hesitate before sending the second character. 1719 01:07:46,996 --> 01:07:49,226 Or you infer from context what the letter is. 1720 01:07:49,576 --> 01:07:51,536 But there is the problem that we need to address. 1721 01:07:51,536 --> 01:07:54,286 If we're going to compress information we really can't have 1722 01:07:54,286 --> 01:07:56,456 their being ambiguities in our zip files, 1723 01:07:56,456 --> 01:07:57,646 or in our compressed files. 1724 01:07:57,646 --> 01:07:59,716 Because you don't want to compress your essay into this 1725 01:07:59,716 --> 01:08:01,876 and then get back a different essay, essentially, 1726 01:08:01,876 --> 01:08:04,136 by having the words decompressed differently. 1727 01:08:04,436 --> 01:08:06,746 So, we need something called immediate decodability. 1728 01:08:06,896 --> 01:08:08,416 Now the formal algorithm we'll look 1729 01:08:08,416 --> 01:08:10,286 at for a second here is actually this. 1730 01:08:10,386 --> 01:08:11,586 If you want to read through the specifics, 1731 01:08:11,586 --> 01:08:12,876 but I think it's easier to think 1732 01:08:12,876 --> 01:08:14,236 through just with simple pictures. 1733 01:08:14,556 --> 01:08:16,636 And it turns out when the neatest application 1734 01:08:16,766 --> 01:08:19,906 of this data structure trees, and thus the motivation 1735 01:08:19,906 --> 01:08:22,886 for this context, is compression. 1736 01:08:23,346 --> 01:08:26,476 So, this is how you can compress data using a fairly simple, 1737 01:08:26,476 --> 01:08:28,056 but sophisticated data structure. 1738 01:08:28,356 --> 01:08:30,876 Suppose for the sake of discussion I wrote an essay 1739 01:08:30,876 --> 01:08:32,416 that looks like this, between quotes. 1740 01:08:32,546 --> 01:08:35,536 Right? It's complete non-sense, but assume it's not. 1741 01:08:35,816 --> 01:08:37,566 We just needed very simple alphabet here. 1742 01:08:37,566 --> 01:08:39,296 So, I've got a lot of As and Es, 1743 01:08:39,416 --> 01:08:41,356 but what's interesting is there are some patterns there. 1744 01:08:41,356 --> 01:08:43,746 I see just at glancing up at the board there's lots 1745 01:08:43,746 --> 01:08:46,076 of EE patterns, or EEE patterns. 1746 01:08:46,076 --> 01:08:49,186 And the goal as you proposed is let's try to factor those out, 1747 01:08:49,346 --> 01:08:51,116 or let's represent those more succinctly. 1748 01:08:51,386 --> 01:08:54,666 Now, what I did here in advance was I counted all the Es, 1749 01:08:54,666 --> 01:08:56,616 all the As, all the Bs, and so forth. 1750 01:08:56,836 --> 01:08:58,256 This is their relative frequency. 1751 01:08:58,366 --> 01:09:01,126 So, if you do a sanity check twenty percent of the letters 1752 01:09:01,126 --> 01:09:03,126 in this nonsense essay are As. 1753 01:09:03,426 --> 01:09:05,766 Ten percent are Bs, ten percent are Cs, and so forth. 1754 01:09:05,916 --> 01:09:07,406 So, I did a frequency analysis. 1755 01:09:07,636 --> 01:09:09,236 I opened the file, went from left to right, 1756 01:09:09,236 --> 01:09:10,716 counted up the letters using an array. 1757 01:09:10,766 --> 01:09:12,176 And now I have my percentages. 1758 01:09:12,556 --> 01:09:15,546 This is all it takes to compress this file intelligently. 1759 01:09:15,896 --> 01:09:16,916 I'm going to do the following, 1760 01:09:17,046 --> 01:09:19,416 I'm going to build a forest of trees. 1761 01:09:19,786 --> 01:09:22,566 And, we're going to do this not in code today, but in Word. 1762 01:09:22,566 --> 01:09:25,186 So, a forest of trees is five trees, 1763 01:09:25,546 --> 01:09:27,046 each of which represents a letter. 1764 01:09:27,046 --> 01:09:28,786 And inside that node I'm going 1765 01:09:28,786 --> 01:09:30,156 to put the frequency of that letter. 1766 01:09:30,366 --> 01:09:32,296 So, think of these as my leaves for the moment. 1767 01:09:32,496 --> 01:09:34,356 But, it's a forest of trees and you know what? 1768 01:09:34,356 --> 01:09:35,836 I actually have five trees here. 1769 01:09:36,116 --> 01:09:37,866 It's just they don't have any children, so they're kind 1770 01:09:37,866 --> 01:09:38,816 of stupid looking trees. 1771 01:09:39,076 --> 01:09:41,556 But we're going to build them up into something more interesting. 1772 01:09:41,896 --> 01:09:45,236 Mr. Huffman, a graduate student fifty plus years ago, 1773 01:09:45,506 --> 01:09:48,046 came up with the following algorithm for compressing data. 1774 01:09:48,386 --> 01:09:51,126 Analysis the frequency of your text, your essay or whatever. 1775 01:09:51,366 --> 01:09:54,236 Though we can do similar ideas on binary encodings too. 1776 01:09:54,626 --> 01:09:59,206 Analysis the frequency and start building one tree out of all 1777 01:09:59,206 --> 01:10:02,666 of these letters by taking the two smallest nodes 1778 01:10:02,966 --> 01:10:05,196 and combining them into a new parent node. 1779 01:10:05,546 --> 01:10:06,546 So, this is step one. 1780 01:10:06,616 --> 01:10:09,176 Make your forest of trees, just by copying those frequencies. 1781 01:10:09,226 --> 01:10:13,036 Step two is join the two smallest nodes, ten percent, 1782 01:10:13,036 --> 01:10:15,926 ten percent, and make a new node, and put the number 1783 01:10:15,926 --> 01:10:17,526 in its node, that's the sum of the children. 1784 01:10:17,616 --> 01:10:19,506 So, we had ten percent, ten percent, this new node; 1785 01:10:19,506 --> 01:10:21,316 the parent is twenty percent. 1786 01:10:21,766 --> 01:10:23,376 OK, so now repeat. 1787 01:10:23,776 --> 01:10:27,116 And, one of the concepts, we don't use all that much 1788 01:10:27,116 --> 01:10:29,956 in this course because it's hard to do it in a compelling way 1789 01:10:29,956 --> 01:10:32,546 until problems become more interesting as recursion. 1790 01:10:32,806 --> 01:10:35,106 But what I'm going to say is recurse. 1791 01:10:35,446 --> 01:10:36,596 Apply that algorithm again. 1792 01:10:36,956 --> 01:10:37,586 So, what doest that mean? 1793 01:10:37,586 --> 01:10:38,846 Find the two smallest nodes. 1794 01:10:38,846 --> 01:10:41,346 Well, I see twenty percent and fifteen percent. 1795 01:10:41,586 --> 01:10:43,686 Let's go ahead and merge those. 1796 01:10:43,686 --> 01:10:46,296 So, I merge those new parents, 35 percent. 1797 01:10:46,666 --> 01:10:48,346 But, now notice what Huffan proposed. 1798 01:10:48,346 --> 01:10:51,176 He said take the edges and arbitrarily 1799 01:10:51,176 --> 01:10:54,066 but consistently label them with a zero or a one. 1800 01:10:54,416 --> 01:10:56,716 So, a zero is going on the left, one on the right, 1801 01:10:56,716 --> 01:10:57,766 and then again, and again. 1802 01:10:58,076 --> 01:11:00,606 The goal is to make one tree all 1803 01:11:00,606 --> 01:11:02,196 of whose edges have zeros and ones. 1804 01:11:02,196 --> 01:11:03,976 Because that's going to be the new sequence 1805 01:11:03,976 --> 01:11:07,886 of bits we should use instead of ASCII for Mr. Huffman's scheme. 1806 01:11:08,196 --> 01:11:10,156 So, 35 percent and 20 percent, 1807 01:11:10,156 --> 01:11:13,496 we're going to get a new node now, because 55 percent. 1808 01:11:13,916 --> 01:11:16,646 Finally the two smallest nodes, or the only two nodes 55 1809 01:11:16,646 --> 01:11:18,206 and 45 percent, and thanks 1810 01:11:18,206 --> 01:11:21,036 to percentages everything adds up to 100 percent. 1811 01:11:21,536 --> 01:11:22,496 So, now I have a tree. 1812 01:11:22,836 --> 01:11:23,936 This is Huffman tree. 1813 01:11:24,176 --> 01:11:26,756 This is a data structure you could fairly easily build 1814 01:11:26,756 --> 01:11:30,066 out in C code, in fact if we were to implement each 1815 01:11:30,066 --> 01:11:32,926 of these nodes with some C code with a struct, 1816 01:11:33,346 --> 01:11:38,396 what pieces of data do you need inside of the struct? 1817 01:11:39,166 --> 01:11:41,906 So, I say struct, node; what goes inside of a node 1818 01:11:41,906 --> 01:11:46,706 to implement this data structure? 1819 01:11:46,836 --> 01:11:47,606 One such node. 1820 01:11:47,606 --> 01:11:50,066 So, let's arbitrarily [inaudible] how do I embody the 1821 01:11:50,066 --> 01:11:52,006 information point 45 in a struct? 1822 01:11:52,726 --> 01:11:54,306 What data type do I need inside my struct? 1823 01:11:54,306 --> 01:11:55,636 >> [inaudible] 1824 01:11:55,636 --> 01:11:56,576 >> So, like a float? 1825 01:11:56,576 --> 01:11:59,116 Float F. Call it whatever you want, but what else do I need 1826 01:11:59,116 --> 01:12:01,076 for a node to implement this picture? 1827 01:12:01,076 --> 01:12:01,876 >> [inaudible] 1828 01:12:01,876 --> 01:12:02,786 >> Yeah, you need star. 1829 01:12:02,786 --> 01:12:05,746 So, we need like a struct node star for left pointer 1830 01:12:05,746 --> 01:12:08,256 and a struct node star for a right pointer. 1831 01:12:08,256 --> 01:12:10,186 But, there's no new concepts here, 1832 01:12:10,186 --> 01:12:11,756 because remember we did this with linked lists. 1833 01:12:11,756 --> 01:12:14,536 Every node in a linked list had a next pointer, 1834 01:12:14,536 --> 01:12:16,916 and on the quiz the doubly linked list had a next 1835 01:12:16,916 --> 01:12:18,036 and previous pointer. 1836 01:12:18,246 --> 01:12:19,056 So, it's the same idea. 1837 01:12:19,056 --> 01:12:20,386 We're just now saying not next 1838 01:12:20,386 --> 01:12:22,246 and previous, but left and right. 1839 01:12:22,246 --> 01:12:24,996 And conceptually they point downward instead of laterally. 1840 01:12:25,226 --> 01:12:27,156 So, a struct might look like this. 1841 01:12:27,336 --> 01:12:30,286 So, we have frequency, and actually I flipped it around. 1842 01:12:30,346 --> 01:12:32,566 Float would work too, but frequency would also work 1843 01:12:32,566 --> 01:12:34,646 if you just used the original raw numbers. 1844 01:12:34,816 --> 01:12:36,586 Apologizes for that. 1845 01:12:36,586 --> 01:12:37,386 Either approach is fine. 1846 01:12:37,776 --> 01:12:39,466 We might want to retain the symbol though, right? 1847 01:12:39,466 --> 01:12:41,576 For the leaf nodes, those actually have characters. 1848 01:12:41,576 --> 01:12:42,496 We need to remember those. 1849 01:12:42,496 --> 01:12:45,226 So, that's a char and a struct node; left struct node, right. 1850 01:12:45,596 --> 01:12:47,636 So, how do you actually compress information? 1851 01:12:47,836 --> 01:12:50,616 Well, what Mr. Huffman said at this point in the story is 1852 01:12:50,616 --> 01:12:53,426 if you want to represent the letter A, 1853 01:12:53,426 --> 01:12:57,056 do not use the binary number for 65, 1854 01:12:57,056 --> 01:12:57,886 because that's a waste of bits. 1855 01:12:57,886 --> 01:12:59,306 That's eight bits. 1856 01:12:59,306 --> 01:13:01,326 If the only letters in your essay are A 1857 01:13:01,326 --> 01:13:05,386 through E you absolutely don't need eight bits per letter. 1858 01:13:05,466 --> 01:13:06,396 Right? That's just a waste. 1859 01:13:06,396 --> 01:13:07,936 You never used those other characters. 1860 01:13:08,246 --> 01:13:10,806 So, you can probably infer how do we represent the number A 1861 01:13:10,806 --> 01:13:13,816 according to Mr. Huffman? 1862 01:13:13,816 --> 01:13:13,883 >> [inaudible] 1863 01:13:13,883 --> 01:13:15,146 >> Zero followed by one. 1864 01:13:15,636 --> 01:13:17,366 And for letter B what bit sequence? 1865 01:13:17,366 --> 01:13:18,936 >> [inaudible] 1866 01:13:18,936 --> 01:13:19,646 >> Yeah. Right. 1867 01:13:19,646 --> 01:13:21,096 And, again it is this simple. 1868 01:13:21,096 --> 01:13:23,516 Just I'm starting from the root and I'm going from root 1869 01:13:23,516 --> 01:13:26,036 to the letter B. What path am I taking? 1870 01:13:26,036 --> 01:13:30,956 0000, Mr. Huffman said use four zeros for B. What about C? 1871 01:13:31,141 --> 01:13:33,141 >> [inaudible] 1872 01:13:33,326 --> 01:13:36,526 >> Good. And D is? 1873 01:13:36,526 --> 01:13:39,186 001. And E is finally? 1874 01:13:40,276 --> 01:13:43,096 1. And now notice what's cool about this, so it's, 1875 01:13:43,176 --> 01:13:44,576 the data is already on the board, 1876 01:13:44,576 --> 01:13:46,136 but let's go back to picture one. 1877 01:13:46,136 --> 01:13:47,546 What was the most popular character? 1878 01:13:48,876 --> 01:13:52,246 So, Es 45 percent of the time do we have Es in this document. 1879 01:13:52,376 --> 01:13:55,296 So, this is what's beautiful about Huffman coding is 1880 01:13:55,296 --> 01:13:57,136 that Mr. Huffman said, you know what? 1881 01:13:57,136 --> 01:14:00,716 Just use a single bit, the number one to represent an E, 1882 01:14:00,716 --> 01:14:02,636 because my God if the whole idea is 1883 01:14:02,636 --> 01:14:04,886 to use the fewest bits possible, well, 1884 01:14:04,886 --> 01:14:06,466 then optimize the common case. 1885 01:14:06,466 --> 01:14:07,526 And this too is a theme 1886 01:14:07,526 --> 01:14:09,006 in programming, or computer science. 1887 01:14:09,336 --> 01:14:11,866 If E is the most common letter, then my God, 1888 01:14:11,866 --> 01:14:15,156 use the fewest bits you can to represent E. And for something 1889 01:14:15,156 --> 01:14:17,356 like B and C, ten percent of the time. 1890 01:14:17,356 --> 01:14:20,806 Fine. Waste some bits on those letters, 1891 01:14:20,996 --> 01:14:22,186 because they happen less often. 1892 01:14:22,536 --> 01:14:26,256 And now per this comment about immediate decodablity, 1893 01:14:26,606 --> 01:14:29,776 unlike Morse code where we very quickly saw a problem, 1894 01:14:30,096 --> 01:14:31,276 notice what's neat here. 1895 01:14:31,586 --> 01:14:34,446 Does the letter E share a sequence, 1896 01:14:34,446 --> 01:14:36,416 a prefix with any other letter? 1897 01:14:36,416 --> 01:14:41,406 The only letter that starts with 1 is E. So, that's good. 1898 01:14:41,406 --> 01:14:43,176 We can't conflate it with any other letters. 1899 01:14:43,176 --> 01:14:44,076 How about A? 1900 01:14:44,566 --> 01:14:48,176 A is 01. Does any other letter start with 01? 1901 01:14:49,336 --> 01:14:51,236 So, no. And, that's a really cool property 1902 01:14:51,236 --> 01:14:53,636 because with Huffman coding not only can you compress 1903 01:14:53,706 --> 01:14:58,006 information by using fewer bits, there's also no ambiguity. 1904 01:14:58,216 --> 01:15:01,266 So, to compress a file with Huffman coding you, one, 1905 01:15:01,266 --> 01:15:03,356 analyze it, which takes big O1 of N time. 1906 01:15:03,356 --> 01:15:04,966 You've got to count your characters left to right, 1907 01:15:04,966 --> 01:15:09,266 and you build up a tree in memory using fairly familiar, 1908 01:15:09,586 --> 01:15:11,696 even though we haven't, we won't code them until problem set six, 1909 01:15:11,696 --> 01:15:13,916 fairly familiar constructs like structs 1910 01:15:13,916 --> 01:15:15,236 with pointers inside of them. 1911 01:15:15,596 --> 01:15:18,486 You build up five nodes and then you loop over your nodes. 1912 01:15:18,486 --> 01:15:20,886 And, you find the two smallest ones; create a new parent node; 1913 01:15:20,966 --> 01:15:23,406 link them together; repeat until you're out of nodes. 1914 01:15:23,696 --> 01:15:25,736 And then you just traverse the tree, 1915 01:15:25,816 --> 01:15:28,316 and there exists actually recursive algorithms 1916 01:15:28,536 --> 01:15:30,736 that are just a few lines of code that could spit 1917 01:15:30,736 --> 01:15:32,806 out a little table in memory, or an array. 1918 01:15:32,806 --> 01:15:33,476 That's actually kind 1919 01:15:33,476 --> 01:15:35,216 of an interesting design problem itself. 1920 01:15:35,476 --> 01:15:38,486 But, then finally you analyze the file for frequency; 1921 01:15:38,606 --> 01:15:41,576 you build your tree in memory; you figure out your encodings. 1922 01:15:41,576 --> 01:15:44,396 The last step is to iterate over your file, your essay, 1923 01:15:44,626 --> 01:15:46,656 top to bottom, left to right, one last time 1924 01:15:46,896 --> 01:15:50,536 and every time you see an A, you use the function F right, 1925 01:15:50,716 --> 01:15:52,876 or something similar to spit out a zero and a one. 1926 01:15:53,156 --> 01:15:55,296 Anytime you see an E, you spit out a one. 1927 01:15:55,376 --> 01:15:59,376 Anytime a D, this pattern; then you close the file and voila, 1928 01:15:59,376 --> 01:16:02,006 you have a file that now mathematically has been 1929 01:16:02,046 --> 01:16:05,756 compressed optimally, that is you've used as few bits 1930 01:16:05,756 --> 01:16:09,146 as possible for the common cases and you've spent more bits 1931 01:16:09,146 --> 01:16:12,006 as is appropriate for the less common cases. 1932 01:16:12,006 --> 01:16:14,376 So for p set six you'll have this opportunity 1933 01:16:14,416 --> 01:16:16,996 to implement your own tree structure known as a tri, 1934 01:16:17,296 --> 01:16:19,676 or your own hash table, known as a hash table. 1935 01:16:20,076 --> 01:16:20,906 So, more on that on Monday. 1936 01:16:21,516 --> 01:18:20,878 [ multiple voices in background ] 1937 01:18:21,378 --> 01:20:20,740 [ music ]