1 00:00:09,176 --> 00:00:09,976 >> Alright. 2 00:00:10,206 --> 00:00:11,756 Hello. Welcome back to CS50. 3 00:00:11,756 --> 00:00:13,466 This is the end of week seven, 4 00:00:13,466 --> 00:00:15,326 so we've got a lot of fun stuff ahead. 5 00:00:15,606 --> 00:00:18,186 I thought we would wrap up a couple 6 00:00:18,186 --> 00:00:20,916 of the lower level details that we began on Monday today, 7 00:00:20,916 --> 00:00:25,416 but really focus today on some techniques and some concepts 8 00:00:25,416 --> 00:00:27,556 that you'll be using certainly throughout the remainder 9 00:00:27,556 --> 00:00:28,456 of the semester, albeit 10 00:00:28,456 --> 00:00:30,716 in a completely different context of web programming. 11 00:00:30,716 --> 00:00:33,966 And then if you ever pick up a computer and a compiler 12 00:00:33,966 --> 00:00:35,736 or a text editor in the future, 13 00:00:35,736 --> 00:00:38,916 odds are these same concepts will occur for quite some time. 14 00:00:38,916 --> 00:00:42,656 But first, a few, do you mind bringing the house volume 15 00:00:42,656 --> 00:00:43,046 down a bit? 16 00:00:43,356 --> 00:00:46,856 But first a few comments toward an end of teasing you 17 00:00:46,856 --> 00:00:48,946 with some ideas for your final projects and also 18 00:00:48,946 --> 00:00:52,066 to get the juices flowing with regards to final projects. 19 00:00:52,066 --> 00:00:53,946 We'll spend more time on this next week or so. 20 00:00:54,136 --> 00:00:55,896 We'll release the specification 21 00:00:55,896 --> 00:00:58,056 for the final project even though it's not due for a couple 22 00:00:58,056 --> 00:01:01,096 of months so that as you dive in to the web based aspect 23 00:01:01,096 --> 00:01:02,756 of the course and the web based pieces, 24 00:01:02,986 --> 00:01:04,576 you start to at least think about the kinds 25 00:01:04,576 --> 00:01:05,736 of things you might want to do. 26 00:01:05,936 --> 00:01:08,166 But realize you need not do something web based, 27 00:01:08,166 --> 00:01:10,826 you're welcome to do most anything, as you will see. 28 00:01:10,826 --> 00:01:14,296 So we surveyed all 1600+ students that ended 29 00:01:14,296 --> 00:01:18,056 up using this particular tool to shop for course this past fall. 30 00:01:18,056 --> 00:01:20,996 We set up arbitrary dates of releasing some updates today 31 00:01:20,996 --> 00:01:22,576 because you guys, if you've not heard, 32 00:01:22,786 --> 00:01:25,246 have to do this thing called pre term planning starting this 33 00:01:25,246 --> 00:01:29,226 year, which means registering what you are likely to take 34 00:01:29,226 --> 00:01:31,366 so that FAS can plan teaching fellows 35 00:01:31,366 --> 00:01:32,726 and room assignments more effectively. 36 00:01:32,726 --> 00:01:35,126 If you have no idea of what I'm talking about, you, odds are, 37 00:01:35,126 --> 00:01:37,546 have some email in your Harvard.edu email account 38 00:01:37,546 --> 00:01:38,856 somewhere, so go dig that up. 39 00:01:38,856 --> 00:01:40,336 I think that binds in a couple of weeks. 40 00:01:40,376 --> 00:01:42,616 But we want to use this as an opportunity 41 00:01:42,996 --> 00:01:44,796 to make this tool even better 42 00:01:44,796 --> 00:01:46,666 and to actually practice what we're preaching, 43 00:01:46,666 --> 00:01:47,796 take in feedback and try 44 00:01:47,796 --> 00:01:50,226 to iteratively improve the design of this thing. 45 00:01:50,226 --> 00:01:53,396 So now if you've never actually used the side it's supposedly 46 00:01:53,646 --> 00:01:54,386 pretty straight forward. 47 00:01:54,386 --> 00:01:57,036 You go to courses.cs50.net and the purpose in life 48 00:01:57,036 --> 00:01:58,756 of this application is really just meant 49 00:01:58,986 --> 00:02:01,366 to make navigating the course catalogue, which you probably do 50 00:02:01,366 --> 00:02:04,606 at least a couple of times a year, more, certainly simpler, 51 00:02:04,696 --> 00:02:08,416 easier, if not a little bit exciting 52 00:02:08,416 --> 00:02:10,866 in that you might very well stumble across courses 53 00:02:10,866 --> 00:02:13,446 that in the traditional format of a printed catalogue 54 00:02:13,446 --> 00:02:15,696 or in my.Harvard, you might not find these things. 55 00:02:15,696 --> 00:02:19,816 So a few of the shortcomings, we realized the previous versions, 56 00:02:19,816 --> 00:02:22,236 so there were some stupid things, frankly, 57 00:02:22,496 --> 00:02:24,606 whereby if you chose some faculty member's name 58 00:02:24,606 --> 00:02:27,176 from this drop down to see what courses this person taught, 59 00:02:27,286 --> 00:02:28,956 and then you wanted to undo that process, 60 00:02:28,956 --> 00:02:30,976 there was actually no bold faced link up to 61 00:02:30,976 --> 00:02:32,076 that said all faculty. 62 00:02:32,076 --> 00:02:35,526 You essentially had to reload the page or click clear or go 63 00:02:35,526 --> 00:02:36,696 through a non obvious process. 64 00:02:36,966 --> 00:02:38,466 So there was some low hanging fruits. 65 00:02:38,526 --> 00:02:42,076 But the funny thing is, as we'll discuss in a moment with regard 66 00:02:42,156 --> 00:02:44,756 to your pset4 gripes with software and machines 67 00:02:44,756 --> 00:02:47,216 that you've seen in your life, there are some lot 68 00:02:47,216 --> 00:02:49,716 of low hanging fruit out there in software whereby 69 00:02:49,716 --> 00:02:52,246 if you just fix that one stupid bug or shortcoming, 70 00:02:52,576 --> 00:02:53,996 you end up making a lot of people happy. 71 00:02:53,996 --> 00:02:55,906 So we certainly plucked off the easiest of things. 72 00:02:56,086 --> 00:02:58,686 We previously had these links at top left to link 73 00:02:58,716 --> 00:03:02,016 to all core courses, all gen ed, and we got comments like well, 74 00:03:02,016 --> 00:03:04,536 that's great, but I'd really like to see what the empirical 75 00:03:04,536 --> 00:03:05,956 and mathematical reasoning courses are. 76 00:03:05,956 --> 00:03:07,656 I don't want just gen ed in general. 77 00:03:08,046 --> 00:03:10,816 And you know the programmer in me pushed back and I was 78 00:03:10,816 --> 00:03:13,396 like well, in reading this comment you could just go 79 00:03:13,396 --> 00:03:16,036 over here and type empirical and mathematical reasoning, 80 00:03:16,376 --> 00:03:17,816 but the student's comment was fair. 81 00:03:17,856 --> 00:03:19,536 Because though it may have been obvious to me, 82 00:03:19,536 --> 00:03:21,376 the person who implemented that feature, 83 00:03:21,566 --> 00:03:25,286 well if the users aren't similarly identifying how 84 00:03:25,286 --> 00:03:27,356 to do something, that's my fault, frankly. 85 00:03:27,426 --> 00:03:29,586 And I think if you don't take ownership of this problem 86 00:03:29,586 --> 00:03:31,446 and just expect that your users figure it out, 87 00:03:31,636 --> 00:03:33,956 I think you're not going to end up designing good tools 88 00:03:33,956 --> 00:03:36,406 and good products for people to use. 89 00:03:36,406 --> 00:03:39,426 So now, for instance, this too, not a hard thing 90 00:03:39,426 --> 00:03:40,276 to implement in the end. 91 00:03:40,486 --> 00:03:42,596 You click gen ed and you actually can select all 92 00:03:42,596 --> 00:03:44,026 of the areas or individual areas. 93 00:03:44,026 --> 00:03:45,256 So again, some low hanging fruit. 94 00:03:45,256 --> 00:03:47,206 But then there were some more interesting problems to solve. 95 00:03:47,206 --> 00:03:49,196 If you used this tool this past fall, 96 00:03:49,446 --> 00:03:51,686 then previously toward the left 97 00:03:51,686 --> 00:03:53,606 of every course there were a couple of check boxes, 98 00:03:53,636 --> 00:03:56,566 check this if you want to shop this course, check this box 99 00:03:56,566 --> 00:03:57,806 if you've taken this course. 100 00:03:57,806 --> 00:03:59,466 And there was also a little Facebook icon 101 00:03:59,756 --> 00:04:02,416 that if you clicked it would trigger a comments box 102 00:04:02,416 --> 00:04:04,506 where you could actually click Facebook's like button 103 00:04:04,706 --> 00:04:06,446 and you could start a little Facebook comment thread 104 00:04:06,446 --> 00:04:07,106 about the course. 105 00:04:07,106 --> 00:04:09,276 And what was fascinating in actually looking at the logs 106 00:04:09,276 --> 00:04:12,356 on the server and asking people outright what features they 107 00:04:12,356 --> 00:04:15,036 used, like no one even realized you could click this little blue 108 00:04:15,036 --> 00:04:18,016 icon and no one really ended up commenting on courses. 109 00:04:18,016 --> 00:04:20,066 So though this was a great idea, we thought, 110 00:04:20,286 --> 00:04:22,406 have students have this sort of unofficial channel 111 00:04:22,406 --> 00:04:24,376 for commenting on courses, nobody used it. 112 00:04:24,636 --> 00:04:25,616 And so we ripped it out. 113 00:04:25,886 --> 00:04:27,776 Because if it's not being used, why leave it in there 114 00:04:27,776 --> 00:04:30,686 as a distraction, even though this idea we think still might 115 00:04:30,686 --> 00:04:31,566 have some potential. 116 00:04:31,566 --> 00:04:33,466 So now we realized it actually might just be nice 117 00:04:33,466 --> 00:04:37,056 to generalize the idea of adding a course to some kind of list 118 00:04:37,316 --> 00:04:38,876 because a lot of students didn't want 119 00:04:38,876 --> 00:04:41,626 to take a shopper course this fall, but they wanted to shop it 120 00:04:41,626 --> 00:04:43,956 in the spring and just because we hadn't thought this 121 00:04:43,956 --> 00:04:46,866 through fully, we didn't give them a way three weeks ago 122 00:04:46,866 --> 00:04:48,866 to actually shop courses in the spring. 123 00:04:48,866 --> 00:04:51,356 So now we've generalized this idea so you have a little drop 124 00:04:51,356 --> 00:04:53,096 down menu and you can choose courses you're shopping, 125 00:04:53,356 --> 00:04:55,266 courses you're taking, courses you've taken. 126 00:04:55,476 --> 00:04:58,996 And now it remains to be seen if you all even care about some 127 00:04:58,996 --> 00:05:02,356 of these features, but from the computer scientist in us what we 128 00:05:02,356 --> 00:05:05,016 like is the idea of being able to gather this kind of data 129 00:05:05,186 --> 00:05:07,626 and understand who's taking what, whose shopped what, 130 00:05:07,846 --> 00:05:09,756 who has shopped something and not taken it. 131 00:05:09,996 --> 00:05:12,346 Because if we can build up a corpus of data 132 00:05:12,576 --> 00:05:15,516 that really logs the kinds of behavior 133 00:05:15,516 --> 00:05:17,006 that people exhibit during shopping period, 134 00:05:17,256 --> 00:05:18,246 just imagine the kinds 135 00:05:18,246 --> 00:05:20,256 of questions you could start answering for people. 136 00:05:20,256 --> 00:05:23,096 For instance, I want to take Economics 10. 137 00:05:23,606 --> 00:05:26,176 Well what other course, well that's probably a bad example, 138 00:05:26,176 --> 00:05:28,586 because almost everyone takes Economics 10. 139 00:05:28,586 --> 00:05:30,936 I want to take ABC 10. 140 00:05:31,236 --> 00:05:31,526 All right? 141 00:05:31,526 --> 00:05:34,696 So ABC 10, you might then look at your data and say well, 142 00:05:34,696 --> 00:05:35,746 who else has taken this course? 143 00:05:35,956 --> 00:05:37,176 Well, if they've taken this course, 144 00:05:37,176 --> 00:05:38,766 what other courses have those people taken? 145 00:05:38,766 --> 00:05:40,886 And you can begin to infer with some probability 146 00:05:41,156 --> 00:05:42,346 that if you took this course, 147 00:05:42,346 --> 00:05:43,576 and you're now shopping this course, 148 00:05:43,856 --> 00:05:46,146 you might like these other suggestions as well. 149 00:05:46,146 --> 00:05:50,326 So for now, we just have some random suggestions at the left. 150 00:05:50,356 --> 00:05:51,566 Sometimes it's not so random. 151 00:05:51,646 --> 00:05:54,176 CS50 seems to pop up with some recurring frequency 152 00:05:54,176 --> 00:05:54,786 for some reason. 153 00:05:54,786 --> 00:05:55,246 [ Laughter ] 154 00:05:55,246 --> 00:05:57,516 But this is where we hope to go. 155 00:05:57,516 --> 00:05:59,326 And so hopefully now that you're beginning 156 00:05:59,326 --> 00:06:03,326 to understand how computers work and how computers think 157 00:06:03,326 --> 00:06:05,896 and how you can make them do your bidding, you can start 158 00:06:05,896 --> 00:06:06,876 to appreciate the kinds 159 00:06:06,876 --> 00:06:09,466 of problems you can solve once you've got some interesting data 160 00:06:09,466 --> 00:06:12,496 and really just some basic tools in your took kit. 161 00:06:12,496 --> 00:06:14,906 So if you want to use this tool, by all means feel free to. 162 00:06:14,906 --> 00:06:17,956 The funny thing was, too, we got a lot of comments about features 163 00:06:17,956 --> 00:06:20,816 that it would be great to add that were actually there. 164 00:06:20,926 --> 00:06:22,976 For instance, Q Guide data has been in here 165 00:06:22,976 --> 00:06:25,676 since the start, but admittedly it was not obvious 166 00:06:25,676 --> 00:06:28,096 that if you hover over a course you can actually get all 167 00:06:28,096 --> 00:06:31,096 of the data from materials, assignments, workload, 168 00:06:31,096 --> 00:06:32,766 difficulty, but our fault. 169 00:06:32,766 --> 00:06:34,426 If you didn't notice that you could actually get all 170 00:06:34,426 --> 00:06:36,056 of that data without having to click 171 00:06:36,056 --> 00:06:37,786 through to the course's individual Q Guide 172 00:06:37,836 --> 00:06:40,156 evaluations, again, our fault. 173 00:06:40,156 --> 00:06:43,496 But the funny thing is, it's very easy for people 174 00:06:43,496 --> 00:06:46,096 to make comments on tools and to design things sort 175 00:06:46,096 --> 00:06:47,466 of off the top of their head, 176 00:06:47,786 --> 00:06:50,176 but it's really only once you sit down and start trying 177 00:06:50,176 --> 00:06:53,206 to program this machine that you realize the tradeoffs you have 178 00:06:53,256 --> 00:06:53,616 to make. 179 00:06:53,666 --> 00:06:56,226 There are so many other good suggestions that we've received 180 00:06:56,526 --> 00:06:59,516 that we decided not to implement, at least not yet, 181 00:06:59,716 --> 00:07:02,276 because you also only have a limited amount of space 182 00:07:02,276 --> 00:07:04,266 and if you just keep throwing more and more content, 183 00:07:04,326 --> 00:07:06,586 and this too is a recurring theme in Problem Set 4, 184 00:07:06,916 --> 00:07:08,526 you just end up overwhelming your user. 185 00:07:08,526 --> 00:07:10,786 So the theme, at least for CS50's applications, 186 00:07:10,786 --> 00:07:13,056 which I don't think in general is bad advice, 187 00:07:13,426 --> 00:07:17,016 is to try to do few things really well rather than try 188 00:07:17,016 --> 00:07:18,476 to do a lot of things poorly. 189 00:07:18,476 --> 00:07:20,716 So case and point, another recurring theme that's been 190 00:07:20,716 --> 00:07:22,716 on people's minds is you know, I'd like to be able 191 00:07:22,716 --> 00:07:25,956 to tag courses for I might do this for gen ed, I might do this 192 00:07:25,956 --> 00:07:27,436 for pre med, I might do this for this, 193 00:07:27,536 --> 00:07:31,136 like a lot of really compelling use cases, but we felt at least 194 00:07:31,136 --> 00:07:33,816 for now if we tried cramming all of the more features into this 195 00:07:33,816 --> 00:07:36,266 and we didn't do them really, really well, 196 00:07:36,486 --> 00:07:38,786 then it would just be another crappy tool that gets commented 197 00:07:38,786 --> 00:07:40,586 on frankly next year for Problem Set 4. 198 00:07:40,586 --> 00:07:41,786 So for now we're going to restrict it 199 00:07:41,786 --> 00:07:43,286 to what we think is working well 200 00:07:43,456 --> 00:07:44,706 and then we'll continue to iterate. 201 00:07:44,706 --> 00:07:47,216 So again, take away is not look 202 00:07:47,216 --> 00:07:49,166 at how we implemented this particular tool, 203 00:07:49,166 --> 00:07:50,746 but look at the kind of questions 204 00:07:50,746 --> 00:07:53,286 that you yourselves should ask perhaps when it comes time 205 00:07:53,286 --> 00:07:54,926 to designing your final projects. 206 00:07:54,926 --> 00:07:57,146 It's really easy to brainstorm what you think is a brilliant 207 00:07:57,146 --> 00:07:59,796 idea on paper, in your head, but when you sit down in front 208 00:07:59,796 --> 00:08:01,976 of that keyboard and then have to start deciding how 209 00:08:01,976 --> 00:08:04,506 to implement this and where to put this feature, 210 00:08:04,806 --> 00:08:06,786 it's sometimes non obvious. 211 00:08:06,946 --> 00:08:13,276 So with that said, yes, we've pretty much rattled that off. 212 00:08:13,896 --> 00:08:16,526 So with this said, one of the fun, frankly, 213 00:08:16,566 --> 00:08:18,626 one of the most useful features that was proposed 214 00:08:18,626 --> 00:08:20,836 that was non-trivial for us to implement is this, 215 00:08:21,276 --> 00:08:23,726 I want to be able to ask what course, 216 00:08:23,776 --> 00:08:26,256 across any department can I take in order 217 00:08:26,256 --> 00:08:28,256 to satisfy the economics concentration 218 00:08:28,256 --> 00:08:29,536 or the government concentration. 219 00:08:29,536 --> 00:08:31,616 Obviously you can shop in the government catalogue, 220 00:08:31,826 --> 00:08:33,786 you can shop in the economics catalogue, but there's a lot 221 00:08:33,786 --> 00:08:36,216 of mathematics courses, a lot of the statistics courses, 222 00:08:36,216 --> 00:08:38,546 even CS courses that count for some of those concentrations. 223 00:08:38,776 --> 00:08:41,076 And believe it or not, to spite how many computers there are 224 00:08:41,076 --> 00:08:43,776 on this campus, there is no database that we can find 225 00:08:44,006 --> 00:08:45,956 that actually has that information. 226 00:08:45,956 --> 00:08:48,176 Rather you have to read through the student handbook, 227 00:08:48,176 --> 00:08:50,636 which is written by humans and pros with bulleted lists 228 00:08:50,636 --> 00:08:52,406 and so forth and figure it out. 229 00:08:52,406 --> 00:08:53,986 And so one of the things we thought we'd do 230 00:08:53,986 --> 00:08:55,676 which would be hugely compelling is 231 00:08:55,676 --> 00:08:57,386 to actually make this feature happen 232 00:08:57,386 --> 00:08:59,776 so that you can say I want to major in economics, 233 00:08:59,776 --> 00:09:02,946 show me all of the courses that I could take strategically 234 00:09:02,946 --> 00:09:04,466 to satisfy this concentration, 235 00:09:04,916 --> 00:09:07,816 even though they might not be inside of economics itself. 236 00:09:07,816 --> 00:09:10,166 There are 5,000 courses in the courses catalogue, 237 00:09:10,166 --> 00:09:11,776 there's like 40 concentration, 238 00:09:11,776 --> 00:09:13,246 all of which are written differently, 239 00:09:13,246 --> 00:09:16,266 so we thought we'd use a little word dejour [assumed spelling] 240 00:09:16,266 --> 00:09:18,086 these days of crowd sourcing. 241 00:09:18,086 --> 00:09:20,446 So if you would like to help us make this feature happen, 242 00:09:20,446 --> 00:09:22,996 the teaching fellows and CA's and I have been biting off some 243 00:09:22,996 --> 00:09:25,326 of the concentrations, we will share a Google doc with you, 244 00:09:25,586 --> 00:09:28,976 it will be prepped with a certain concentration's columns 245 00:09:29,226 --> 00:09:32,076 so that you can then just copy and paste into that spreadsheet 246 00:09:32,286 --> 00:09:35,096 which courses can count for a primary concentration 247 00:09:35,096 --> 00:09:37,786 or a secondary field for that particular department. 248 00:09:37,786 --> 00:09:39,136 And then within a few days, a few weeks, 249 00:09:39,136 --> 00:09:41,706 we will integrate this so that hopefully we will have a super 250 00:09:41,706 --> 00:09:44,566 duper app for everyone to benefit then from this spring. 251 00:09:44,566 --> 00:09:46,416 So drop me a note if you would like to do that. 252 00:09:46,526 --> 00:09:48,336 It's hard for one person to do this. 253 00:09:48,416 --> 00:09:51,246 With 500, hopefully, if we just get one percent of volunteers, 254 00:09:51,246 --> 00:09:52,596 we can do it pretty well. 255 00:09:52,686 --> 00:09:56,156 So idea. So when I gave on behalf 256 00:09:56,156 --> 00:09:59,726 of CS50 this little intro talk called Harvard Thinks Big, 257 00:09:59,926 --> 00:10:03,126 the gauntlet I threw down verbally was to have the crowd 258 00:10:03,126 --> 00:10:05,996 that was in Sanders that night actually give us ideas, 259 00:10:06,056 --> 00:10:08,596 not unlike problem set four, that they would loved 260 00:10:08,596 --> 00:10:11,766 to see solved with computers, with a machine, with software, 261 00:10:11,766 --> 00:10:14,776 something, frankly, that a CS50 student can do. 262 00:10:14,776 --> 00:10:17,356 And if you go to ideas.CS50.net, 263 00:10:17,666 --> 00:10:19,276 there weren't terribly many ideas, right? 264 00:10:19,276 --> 00:10:21,006 You can only expect a certain percentage of people 265 00:10:21,006 --> 00:10:22,986 to actually follow through on that. 266 00:10:23,016 --> 00:10:24,246 But there were some really good ones 267 00:10:24,246 --> 00:10:25,616 and also some really fun ones. 268 00:10:25,896 --> 00:10:28,876 I thought I'd share some of the highlights, so one suggestion 269 00:10:28,876 --> 00:10:30,866 that was made by some random person on the internet 270 00:10:30,866 --> 00:10:34,186 after that night was this, a course catalogue Android app, 271 00:10:34,186 --> 00:10:36,236 Android being an operating system that's increasingly 272 00:10:36,236 --> 00:10:37,356 popular on mobile phones. 273 00:10:37,636 --> 00:10:40,636 This is absolutely within the grasp of CS50 students, 274 00:10:40,636 --> 00:10:45,936 particularly because for every one of, for every one 275 00:10:45,936 --> 00:10:49,176 of the CS50 apps that we've talked about thus far, 276 00:10:49,176 --> 00:10:52,586 if you go to follow the link called API, 277 00:10:53,326 --> 00:10:55,406 the course provides all API's, 278 00:10:55,776 --> 00:10:57,696 application programming interfaces, 279 00:10:57,696 --> 00:10:58,846 which is just a fancy way 280 00:10:58,846 --> 00:11:02,176 of saying we have implemented some functions that you can call 281 00:11:02,476 --> 00:11:05,816 over the internet from language like JavaScript and PHP, 282 00:11:05,816 --> 00:11:08,056 so that if you want to get data from Harvard courses, 283 00:11:08,056 --> 00:11:10,016 if you want to get data from all of the events happening 284 00:11:10,016 --> 00:11:12,876 on campus, if you want to get data from the dining hall menu, 285 00:11:13,106 --> 00:11:14,806 you don't have to go and copy paste these things 286 00:11:14,806 --> 00:11:15,706 or figure out how to do that. 287 00:11:15,706 --> 00:11:17,576 We will just hand you the data that you want 288 00:11:17,706 --> 00:11:20,016 and you can really dive into the fun part of the problem, 289 00:11:20,016 --> 00:11:23,376 solving a workflow, solving a problem that you think you could 290 00:11:23,646 --> 00:11:24,916 when handed that data. 291 00:11:24,916 --> 00:11:26,586 So on something like this, 292 00:11:26,636 --> 00:11:28,486 if you already have an Android phone, 293 00:11:28,526 --> 00:11:30,076 by all means can you dive into this. 294 00:11:30,076 --> 00:11:33,066 As an aside, industry tends to be nice to the course. 295 00:11:33,476 --> 00:11:34,676 Research in Motion, the company 296 00:11:34,676 --> 00:11:36,656 that makes Blackberries has donated a bunch of phones 297 00:11:36,656 --> 00:11:38,456 to the course, so if you would like specifically 298 00:11:38,456 --> 00:11:41,016 to tackle a Blackberry based final project, 299 00:11:41,316 --> 00:11:45,496 we will let you know how to sign up for a lottery to receive one 300 00:11:45,496 --> 00:11:47,396 of those phone with which to develop that. 301 00:11:47,596 --> 00:11:50,596 So another great idea that came out of ideas.CS50.net, 302 00:11:50,596 --> 00:11:51,826 Harvard Travel, a website 303 00:11:51,826 --> 00:11:53,386 where Harvard students traveling during J-Term 304 00:11:53,386 --> 00:11:55,086 or summer can find another Harvard student 305 00:11:55,086 --> 00:11:57,676 who have been there or live there so as to get some info 306 00:11:57,676 --> 00:12:00,266 about that place, maybe even be hosted by that person, 307 00:12:00,266 --> 00:12:03,016 so a very compelling opportunity for one 308 00:12:03,016 --> 00:12:04,636 of us this final project season. 309 00:12:04,886 --> 00:12:07,996 An application to download all Facebook photos tagged of me. 310 00:12:08,056 --> 00:12:11,056 So that's not bad, whatever one's motivation for that, 311 00:12:11,206 --> 00:12:14,486 but also tenable, Facebook, too, provides an API 312 00:12:14,826 --> 00:12:17,286 with which you can interface with a lot of their data 313 00:12:17,506 --> 00:12:19,906 and use it to implement your own idea. 314 00:12:20,056 --> 00:12:21,466 Case and point, Harvard courses, 315 00:12:21,736 --> 00:12:24,286 you log in with your Facebook account 316 00:12:24,286 --> 00:12:28,376 so that you can see once you log in what your friends are taking. 317 00:12:28,376 --> 00:12:30,926 And as an aside, if the idea of signing into anything 318 00:12:30,926 --> 00:12:32,256 with Facebook kind of troubles you, 319 00:12:32,486 --> 00:12:34,986 realize there's a checkbox you can check on our application, 320 00:12:34,986 --> 00:12:36,456 at least, that hides your presence 321 00:12:36,456 --> 00:12:38,386 from your friends when shopping. 322 00:12:38,386 --> 00:12:39,226 You don't have to divulge 323 00:12:39,226 --> 00:12:40,486 to the world what you're actually doing. 324 00:12:40,806 --> 00:12:41,636 So this one's good, too, 325 00:12:41,636 --> 00:12:43,666 to reserve a practice room freshmen still have to go 326 00:12:43,666 --> 00:12:46,596 to the FDO and register on a sheet of paper, save some trees 327 00:12:46,596 --> 00:12:48,696 and save some time with a website. 328 00:12:48,776 --> 00:12:51,196 So again, a low hanging fruit there. 329 00:12:51,196 --> 00:12:53,386 Now, not all of them were perhaps 330 00:12:53,386 --> 00:12:55,266 as technical as that one. 331 00:12:55,266 --> 00:13:01,776 This student took the time to write, type included. 332 00:13:02,346 --> 00:13:05,506 This was actually one of my favorite suggestions for how 333 00:13:05,506 --> 00:13:09,616 to improve Harvard with a tool or website or piece of software. 334 00:13:09,616 --> 00:13:11,366 [ Laughter ] 335 00:13:11,366 --> 00:13:15,606 Alright, is this even, is this a thing now? 336 00:13:15,726 --> 00:13:17,776 Because I've never heard this. 337 00:13:17,776 --> 00:13:20,256 This one was similarly hardware oriented. 338 00:13:20,716 --> 00:13:22,716 [ Laughter ] 339 00:13:23,176 --> 00:13:25,686 We got a lot of things like this. 340 00:13:26,476 --> 00:13:28,356 So this is form spam. 341 00:13:28,356 --> 00:13:30,406 There's a lot of people with way too much time 342 00:13:30,406 --> 00:13:33,446 and malicious motivations where they don't send you spam trying 343 00:13:33,446 --> 00:13:35,646 to buy things, they just send you crap like this, 344 00:13:35,936 --> 00:13:38,146 and this was submitted automatically by a bot 345 00:13:38,146 --> 00:13:39,196 of some sort, a spider, 346 00:13:39,196 --> 00:13:41,726 a program that just crawls the web looking for forms to submit. 347 00:13:41,986 --> 00:13:43,806 And I actually, a little timidly, 348 00:13:43,806 --> 00:13:46,526 visited all of these URL's before class just to make sure. 349 00:13:46,526 --> 00:13:49,066 I'm not saying go to sketchy.com here. 350 00:13:49,336 --> 00:13:51,886 None of them even work and this was just posted a few days ago. 351 00:13:51,886 --> 00:13:55,626 And this fascinating theme of the world of spammers is one, 352 00:13:55,656 --> 00:13:59,216 they rarely speak or use tools with which 353 00:13:59,216 --> 00:14:00,606 to write perfect English. 354 00:14:00,606 --> 00:14:03,176 So frankly, a really good heuristic for detecting 355 00:14:03,176 --> 00:14:05,256 if an email from Bank of America is legit, 356 00:14:05,576 --> 00:14:08,256 Bank of America's probably not going to make an email riddled 357 00:14:08,256 --> 00:14:10,196 with spelling mistakes, so that's one heuristic. 358 00:14:10,196 --> 00:14:12,186 And it's funny, because you would thing if you're trying 359 00:14:12,186 --> 00:14:13,706 to capture a lot of people you'd run it 360 00:14:13,706 --> 00:14:16,796 through like Google translator or whatever the tool might be, 361 00:14:16,796 --> 00:14:19,976 or Microsoft Word, and you might actually buy the domain names 362 00:14:19,976 --> 00:14:21,736 that you're telling people to actually go to, 363 00:14:21,736 --> 00:14:23,676 because these are, in fact, dead ends, 364 00:14:23,676 --> 00:14:25,106 at least as of this morning. 365 00:14:25,366 --> 00:14:28,066 Other things that, this was interesting, 366 00:14:28,606 --> 00:14:31,186 that was on suggestion for how to improve Harvard. 367 00:14:31,836 --> 00:14:33,646 Another one, same person. 368 00:14:33,866 --> 00:14:35,866 [ Laughter ] 369 00:14:36,086 --> 00:14:37,196 These were from Carlos. 370 00:14:37,376 --> 00:14:38,066 Hello Carlos. 371 00:14:38,346 --> 00:14:44,076 And then perhaps related in spirit was this. 372 00:14:44,076 --> 00:14:44,143 [ Laughter ] 373 00:14:44,143 --> 00:14:47,906 So perhaps beware of that particular applications. 374 00:14:47,906 --> 00:14:52,326 So frankly, some of the juiciest data that we received, 375 00:14:52,326 --> 00:14:55,166 those things aside, were those from Problem Set 4. 376 00:14:55,166 --> 00:14:58,596 So frankly we could spend all day, all week, all month reading 377 00:14:58,596 --> 00:15:00,456 through some really interesting pros. 378 00:15:00,456 --> 00:15:01,626 I printed this out this morning. 379 00:15:01,626 --> 00:15:05,056 These are 84 pages, two sided, 12 point font, 380 00:15:05,396 --> 00:15:08,176 of all the gripes you people have about this place and tools 381 00:15:08,176 --> 00:15:10,176 that you've used, but it's really quite fascinating. 382 00:15:10,176 --> 00:15:12,606 So what we're going to do on problem set five is ask you 383 00:15:12,606 --> 00:15:13,906 with a checkbox are you okay 384 00:15:13,906 --> 00:15:16,166 with our posting your comments anonymously to the website, 385 00:15:16,326 --> 00:15:18,676 because there's really some fun comments in here. 386 00:15:18,676 --> 00:15:20,376 There's some really astute observations. 387 00:15:20,616 --> 00:15:22,606 And there's really, I think, some food for thought 388 00:15:22,606 --> 00:15:25,636 when it comes to identifying these same kinds 389 00:15:25,636 --> 00:15:27,296 of problems in your own life. 390 00:15:27,456 --> 00:15:31,036 It's curious some of the themes were many people commented 391 00:15:31,036 --> 00:15:34,896 on tools close to home like my.Harvard, Hollis, 392 00:15:35,546 --> 00:15:39,626 CVS in the square recently put in this self checkout system, 393 00:15:39,626 --> 00:15:41,316 which at the moment is a disaster, I think. 394 00:15:41,316 --> 00:15:43,646 I've never waited longer in lines at CVS than I have 395 00:15:43,826 --> 00:15:45,716 over the past couple of weeks. 396 00:15:45,716 --> 00:15:48,056 And this is, you know, probably a boom for them 397 00:15:48,056 --> 00:15:50,906 because they can hire fewer humans, but they put the burden 398 00:15:50,906 --> 00:15:52,696 on us and you see lines perpetually. 399 00:15:52,946 --> 00:15:53,746 So that was one theme. 400 00:15:53,746 --> 00:15:55,976 Bank of America, people had some gripes with their ATMs. 401 00:15:55,976 --> 00:15:59,246 And similar, dining services web site, apparently it's hard 402 00:15:59,246 --> 00:16:00,156 to find out what's for dinner 403 00:16:00,156 --> 00:16:01,526 and what the nutritional content is. 404 00:16:01,526 --> 00:16:06,016 And also a theme with sports sites, ESPN and MLB.com, 405 00:16:06,016 --> 00:16:08,196 it seems that perhaps there's a theme in that industry 406 00:16:08,466 --> 00:16:10,096 of not giving as much thought to some 407 00:16:10,096 --> 00:16:12,606 of the user interface issues. 408 00:16:12,656 --> 00:16:15,636 There were also some comments, of course, on CS50 stuff, 409 00:16:15,686 --> 00:16:18,326 to which we are quite sensitive and will address. 410 00:16:18,326 --> 00:16:20,606 A few things on color blindness issues 411 00:16:20,606 --> 00:16:22,376 where there are some color things we've done incorrectly, 412 00:16:22,376 --> 00:16:25,746 so we will take a stab at fixing those. 413 00:16:26,476 --> 00:16:27,616 And what else? 414 00:16:27,616 --> 00:16:29,446 Oh, there was one comment that it's really hard 415 00:16:29,446 --> 00:16:32,696 to find the course websites for previous years, 416 00:16:32,696 --> 00:16:35,446 previous semester's instantiations of courses. 417 00:16:35,876 --> 00:16:38,946 So in Harvard courses, CS50.net, if you expand the course, 418 00:16:38,946 --> 00:16:40,646 you can now see all prior year's websites. 419 00:16:40,646 --> 00:16:42,006 We found all the URLs for you. 420 00:16:42,506 --> 00:16:47,276 And let's see, anything else juicy here? 421 00:16:48,396 --> 00:16:51,426 Oh, okay. So have you ever tried to play a simple game of Halo 422 00:16:51,426 --> 00:16:54,356 on the Xbox 306, man's best gaming system? 423 00:16:54,676 --> 00:16:57,096 After turning on the machine you were required to navigate 424 00:16:57,096 --> 00:17:00,406 through multiple menu items and obscure system settings simply 425 00:17:00,406 --> 00:17:02,796 to get to, quote unquote, play Halo. 426 00:17:02,796 --> 00:17:05,236 I would venture that most people who turn on the Xbox 427 00:17:05,236 --> 00:17:08,106 with the Halo disc in it are trying to play Halo, 428 00:17:08,296 --> 00:17:10,636 not customize their avatar's mustachio. 429 00:17:10,636 --> 00:17:10,996 [ Laughter ] 430 00:17:10,996 --> 00:17:15,116 So this, too, is actually a recurring theme. 431 00:17:15,276 --> 00:17:17,466 My soapbox that day with the MBTA was 432 00:17:17,466 --> 00:17:20,256 that too often our tools are problem solved 433 00:17:20,396 --> 00:17:22,126 without the common cases in mind. 434 00:17:22,126 --> 00:17:24,796 And again, in the case of the MBTA, my God, the common case, 435 00:17:24,796 --> 00:17:26,446 I'm going to guess, 90 percent, 436 00:17:26,446 --> 00:17:28,796 80 percent of the time is someone wants to use the machine 437 00:17:28,796 --> 00:17:31,646 to just get on the next subway to some destination 438 00:17:31,816 --> 00:17:34,136 and it's eight steps, so a clear opportunity 439 00:17:34,136 --> 00:17:35,076 for doing something better. 440 00:17:35,146 --> 00:17:38,626 Here, you plug in, you turn on your Xbox to play Halo, 441 00:17:38,626 --> 00:17:41,166 apparently you have to jump through hoops just to play Halo. 442 00:17:41,166 --> 00:17:43,226 I have a PS3, which I rarely use, 443 00:17:43,526 --> 00:17:45,896 partly because every time I turn the damn thing on, 444 00:17:45,896 --> 00:17:47,446 because so many weeks have passed it wants 445 00:17:47,446 --> 00:17:49,766 to do a firmware update and download new software, 446 00:17:50,066 --> 00:17:51,436 but it forces you to do 447 00:17:51,436 --> 00:17:53,816 that before you can actually play the Blu-ray disc 448 00:17:53,816 --> 00:17:54,716 or play the game. 449 00:17:54,866 --> 00:17:56,756 You know, idiotic design decisions, 450 00:17:56,756 --> 00:17:59,026 not optimizing for the common case. 451 00:17:59,026 --> 00:18:02,456 So if you are amendable we will invite you to check off a box 452 00:18:02,456 --> 00:18:04,346 with which to share these ideas anonymously 453 00:18:04,626 --> 00:18:07,016 because it's certainly a good way at coming 454 00:18:07,016 --> 00:18:08,526 up with entrepreneurial ideas, 455 00:18:08,556 --> 00:18:10,236 coming up with good project ideas 456 00:18:10,476 --> 00:18:14,246 and really just making you a more astute technical person 457 00:18:14,316 --> 00:18:15,126 moving forward. 458 00:18:15,126 --> 00:18:18,776 The theme of this course and courses like it, CS1 and BIS 459 00:18:18,776 --> 00:18:20,026 and Quantitative Reasoning 460 00:18:20,026 --> 00:18:21,546 or Empirical Mathematical Reasoning, 461 00:18:21,546 --> 00:18:23,646 all of them from the CS department is really 462 00:18:23,646 --> 00:18:25,596 to output students that are not necessarily going to grow 463 00:18:25,596 --> 00:18:26,806 up to become computer scientists, 464 00:18:27,006 --> 00:18:30,526 not necessarily going to program ever again after this course, 465 00:18:30,526 --> 00:18:32,846 but are going to be in positions in society 466 00:18:32,846 --> 00:18:34,096 where you're making decisions 467 00:18:34,096 --> 00:18:35,596 and you're making technical decisions 468 00:18:35,596 --> 00:18:38,326 or you're making decisions that affect technical people and one 469 00:18:38,326 --> 00:18:40,976 of the best things we can do in any of these intricacies, 470 00:18:40,976 --> 00:18:42,826 frankly, is at least empower you 471 00:18:42,826 --> 00:18:47,206 to make more intelligent decisions than is currently done 472 00:18:47,276 --> 00:18:50,486 in all spheres of life, business and politics alike. 473 00:18:50,486 --> 00:18:53,256 So hopefully we will have some CS50 alumni 474 00:18:53,256 --> 00:18:55,086 in important places someday so as 475 00:18:55,116 --> 00:18:58,426 to make the world a more user friendly place, 476 00:18:58,546 --> 00:18:59,096 to say the lease. 477 00:18:59,396 --> 00:19:04,106 Alright, so a few, let's see, a few wrap up details here. 478 00:19:04,396 --> 00:19:08,456 So just to plant this seed, you can in fact manipulate data. 479 00:19:08,676 --> 00:19:10,086 This is a very awkward transition 480 00:19:10,086 --> 00:19:13,006 from changing the world to bit wise operations, but so be it. 481 00:19:13,416 --> 00:19:16,336 So bit wise operations, this is just a mechanism provided 482 00:19:16,336 --> 00:19:19,106 by a lot of languages with which you can manipulate data, 483 00:19:19,496 --> 00:19:22,606 variables, at a bit wise level. 484 00:19:22,606 --> 00:19:24,656 You can change individual bits, you can look 485 00:19:24,656 --> 00:19:26,966 at individual bits toward various ends. 486 00:19:27,006 --> 00:19:29,266 We looked at some simple examples on Monday at how 487 00:19:29,266 --> 00:19:31,226 to change things from uppercase to lower case, 488 00:19:31,506 --> 00:19:32,276 we generalized the idea 489 00:19:32,276 --> 00:19:34,636 of actually using am ask to extract data. 490 00:19:34,636 --> 00:19:37,476 And one of the themes of problem set five forensic is 491 00:19:37,476 --> 00:19:38,506 that kind of thing. 492 00:19:38,506 --> 00:19:40,456 Not necessarily using bit wise operators, 493 00:19:40,706 --> 00:19:43,466 but really controlling precisely the data that's on disc, 494 00:19:43,516 --> 00:19:44,506 the data that's in memory 495 00:19:44,756 --> 00:19:46,476 to extract those details that you want. 496 00:19:46,476 --> 00:19:48,166 So just so that you've seen these tools 497 00:19:48,166 --> 00:19:50,856 and can tuck them away if ever relevant in the short term, 498 00:19:51,196 --> 00:19:53,476 again the ampersand is for and-ing two bits. 499 00:19:53,566 --> 00:19:56,596 It's saying if this bit is one and this bit is one, 500 00:19:56,596 --> 00:19:59,096 then the output is also one, otherwise it's zero. 501 00:19:59,426 --> 00:20:02,326 Or it does something similar to Boolean operators. 502 00:20:02,326 --> 00:20:06,056 If it's a one or a, if this is a one or this is a one 503 00:20:06,146 --> 00:20:08,316 or they're both ones, the answer is again one. 504 00:20:08,666 --> 00:20:09,896 XOR is interesting. 505 00:20:09,896 --> 00:20:12,506 XOR actually gives you a lot of neat capabilities in life. 506 00:20:12,506 --> 00:20:17,606 XOR says that the answer is one if and only if only one 507 00:20:17,606 --> 00:20:19,356 of the inputs is the number one. 508 00:20:19,606 --> 00:20:22,906 So if your inputs are one and zero then the XOR 509 00:20:22,906 --> 00:20:25,086 of those two values is one. 510 00:20:25,386 --> 00:20:28,696 If your inputs are zero and one, then the output is one. 511 00:20:28,986 --> 00:20:31,416 But if the inputs are one and one or zero 512 00:20:31,416 --> 00:20:32,996 and zero, the output is zero. 513 00:20:32,996 --> 00:20:35,166 So exclusive or kind of does what the name implies. 514 00:20:35,406 --> 00:20:39,316 Exclusively one of the bits must be one and not the other. 515 00:20:39,316 --> 00:20:41,376 Now just as a teaser, using this operation 516 00:20:41,376 --> 00:20:43,156 like XOR you can do some neat things. 517 00:20:43,626 --> 00:20:46,526 In fact, if you've ever heard of RAID, an acronym, 518 00:20:46,526 --> 00:20:48,936 redundant array of independent disks, it has to do 519 00:20:48,936 --> 00:20:50,896 with the hardware world, but it's increasingly common 520 00:20:50,896 --> 00:20:52,856 in desktop computers as well as servers. 521 00:20:53,196 --> 00:20:57,156 Using this very simple operation you can ensure 522 00:20:57,476 --> 00:21:00,996 that if you have a computer with three hard drives you can end 523 00:21:00,996 --> 00:21:03,526 up using space equal to two of them 524 00:21:03,786 --> 00:21:07,656 and the third drive essentially just stores the XOR of all 525 00:21:07,656 --> 00:21:09,506 of the bits on the other two hard drives. 526 00:21:09,726 --> 00:21:13,466 And the implication of this very basic operation is that in 527 00:21:13,466 --> 00:21:15,906 such a computer that's using this technology called RAID, 528 00:21:16,006 --> 00:21:18,526 you have three hard drives, if any one of them dies, 529 00:21:18,896 --> 00:21:22,156 the additional one, thanks to XOR is enough information 530 00:21:22,376 --> 00:21:24,096 to recover all of the data. 531 00:21:24,416 --> 00:21:25,996 So it's kind of a special trick 532 00:21:26,256 --> 00:21:29,066 with which you can have error checking 533 00:21:29,126 --> 00:21:31,036 and error detection built into a system. 534 00:21:31,036 --> 00:21:32,266 Hugely advantageous. 535 00:21:32,556 --> 00:21:34,576 And if you're interested in that kind of stuff there are plenty 536 00:21:34,576 --> 00:21:36,776 of hardware courses down the pipeline. 537 00:21:36,776 --> 00:21:38,526 You can also do one neat thing. 538 00:21:38,526 --> 00:21:39,536 For the true geeks 539 00:21:39,536 --> 00:21:42,426 in the audience I just thought I'd show you this real fast, 540 00:21:42,426 --> 00:21:44,796 swap2.c. It was in Monday's printout. 541 00:21:45,056 --> 00:21:48,036 This is a copy paste from our many previous iterations 542 00:21:48,036 --> 00:21:49,216 of this notion of swapping. 543 00:21:49,486 --> 00:21:50,736 But it turns out I was lying to you 544 00:21:50,736 --> 00:21:53,006 when I said you need a temporary variable in order 545 00:21:53,006 --> 00:21:54,846 to swap two values, right? 546 00:21:54,846 --> 00:21:58,356 The problem always was that if you don't save A or B's value 547 00:21:58,356 --> 00:22:00,696 in a temp variable, you're going to clobber one, you're going 548 00:22:00,696 --> 00:22:02,166 to overwrite one and lost track of the other. 549 00:22:02,726 --> 00:22:05,466 Not so. If you really want to be a geek or you really want 550 00:22:05,466 --> 00:22:09,316 to be space conservative you can actually use the XOR operator, 551 00:22:09,376 --> 00:22:11,836 which is implemented in C as a little caret symbol, 552 00:22:12,126 --> 00:22:14,116 which is above the number six. 553 00:22:14,666 --> 00:22:16,436 This is the XOR operator. 554 00:22:16,756 --> 00:22:18,896 You can execute these three lines of code here 555 00:22:18,896 --> 00:22:20,526 down at the bottom and it turns 556 00:22:20,526 --> 00:22:24,856 out nearly magically these three lines together will 557 00:22:24,966 --> 00:22:31,666 simultaneously conceptually move the bits from A to B and B to A 558 00:22:31,666 --> 00:22:33,256 without losing any of them. 559 00:22:33,496 --> 00:22:35,056 So it's kind of a neat operation 560 00:22:35,326 --> 00:22:37,436 but not something we'll dwell on today. 561 00:22:38,106 --> 00:22:41,896 So things that we can do, though, include, 562 00:22:42,166 --> 00:22:44,406 or rather details that are relevant are this, 563 00:22:44,406 --> 00:22:45,756 one of the themes too that comes 564 00:22:45,756 --> 00:22:49,636 up in problem set five forensics is in actually reading bytes 565 00:22:49,636 --> 00:22:52,456 from disk and actually comparing them, looking for instance 566 00:22:52,456 --> 00:22:54,856 for the special signature that JPEGs have. 567 00:22:54,856 --> 00:22:56,146 If you haven't read the specs 568 00:22:56,146 --> 00:22:58,716 yet a JPEG image almost always starts 569 00:22:58,716 --> 00:23:00,596 with the same pattern of for bytes. 570 00:23:01,166 --> 00:23:02,976 And that pattern means here comes a JPEG. 571 00:23:02,976 --> 00:23:04,686 And if you see another such pattern 572 00:23:04,686 --> 00:23:07,006 that usually means here comes another JPEG. 573 00:23:07,006 --> 00:23:10,906 And leveraging that very simple hint, those for bytes alone, 574 00:23:11,136 --> 00:23:12,746 you can essentially, with high probability, 575 00:23:12,776 --> 00:23:13,796 recover lots of JPEGs. 576 00:23:13,796 --> 00:23:15,656 And that's one of the goals of that pset. 577 00:23:15,846 --> 00:23:16,986 Well, it turns out that depending 578 00:23:16,986 --> 00:23:19,356 on the operating system or the computer architecture, 579 00:23:19,496 --> 00:23:22,146 the CPU that you're using, sometimes numbers, 580 00:23:22,226 --> 00:23:25,586 if they are multi bytes, are integers are, as doubles are, 581 00:23:25,586 --> 00:23:28,706 as longs are, depending on the architecture, 582 00:23:28,706 --> 00:23:31,796 sometimes those bytes are laid out in opposite order. 583 00:23:31,796 --> 00:23:34,286 So just again to plant the seed so that you've seen it 584 00:23:34,286 --> 00:23:37,276 and if you run into certain bugs that the teaching fells address 585 00:23:37,276 --> 00:23:39,396 in section or office hours it makes a little more sense. 586 00:23:39,806 --> 00:23:42,986 Realize that there's two ways of implementing numbers on disk. 587 00:23:43,496 --> 00:23:47,296 You can either take all for bytes and essentially write them 588 00:23:47,296 --> 00:23:50,466 on disk left to right, or you can write them 589 00:23:50,466 --> 00:23:51,776 to disk from right to left. 590 00:23:51,976 --> 00:23:53,726 And the effects here, this is a little screen shot 591 00:23:53,726 --> 00:23:55,086 from Wikipedia, a nice little article 592 00:23:55,086 --> 00:23:56,186 that discusses this issue, 593 00:23:56,486 --> 00:23:59,906 essentially if you're using a big Indian architecture the 594 00:23:59,906 --> 00:24:02,116 number is laid out just as we've been thinking about it. 595 00:24:02,156 --> 00:24:04,606 But little Indian architectures, which is very much 596 00:24:04,606 --> 00:24:07,386 in vogue these days, Intel CPUs use this thing called little 597 00:24:07,386 --> 00:24:09,086 Indian, if you actually look at the bytes 598 00:24:09,086 --> 00:24:10,906 at a low level what you think should be 599 00:24:10,906 --> 00:24:12,876 over here is actually over here. 600 00:24:12,876 --> 00:24:14,566 And again, we won't spend time on this today, 601 00:24:14,566 --> 00:24:17,886 but just in case you run across certain bugs in problem set five 602 00:24:18,096 --> 00:24:19,676 and we point you toward this notion, 603 00:24:19,676 --> 00:24:21,546 just realize there is this subtlety 604 00:24:21,546 --> 00:24:22,986 that differs among architectures 605 00:24:22,986 --> 00:24:25,416 where numbers are stored differently. 606 00:24:25,636 --> 00:24:26,396 But you will be playing 607 00:24:26,396 --> 00:24:28,366 with pointers a bit, on this problem set. 608 00:24:28,476 --> 00:24:30,796 And one of the other files from Monday was this thing here, 609 00:24:30,796 --> 00:24:34,086 pointers1.c. This program is actually pretty old school 610 00:24:34,086 --> 00:24:34,636 at this point. 611 00:24:34,886 --> 00:24:36,736 This is just a program whose purpose in life is 612 00:24:36,766 --> 00:24:39,866 to get a string from the user using our get string function, 613 00:24:39,956 --> 00:24:41,706 then we store the return address 614 00:24:41,916 --> 00:24:45,166 in a variable called S. Why do we check for null here? 615 00:24:45,446 --> 00:24:46,756 What might actually go wrong here? 616 00:24:47,096 --> 00:24:49,096 [ Inaudible ] 617 00:24:49,436 --> 00:24:50,686 This may not be a string. 618 00:24:50,686 --> 00:24:54,216 And in what situations might get strings return null like that? 619 00:24:55,436 --> 00:24:57,116 Someone from over here, perhaps? 620 00:24:57,116 --> 00:24:58,296 [ Inaudible ] 621 00:24:58,296 --> 00:24:59,786 So not enough memory, right? 622 00:24:59,786 --> 00:25:01,636 And so if the user is being annoying or malicious 623 00:25:01,636 --> 00:25:03,496 or the computer is just very memory constrained, 624 00:25:03,686 --> 00:25:05,426 it the user types in something too long 625 00:25:05,506 --> 00:25:08,896 in memory then malloc is going to return null and recall 626 00:25:08,896 --> 00:25:12,136 that get string uses malloc and so this, too, will return null. 627 00:25:12,136 --> 00:25:15,356 If you don't check for null and just proceed along your way 628 00:25:15,356 --> 00:25:17,176 and dereference a potentially null pointer, 629 00:25:17,176 --> 00:25:18,006 what, of course happens? 630 00:25:19,146 --> 00:25:21,786 Seg fault with high probability, not always, 631 00:25:21,786 --> 00:25:22,756 but with high probability. 632 00:25:22,756 --> 00:25:25,186 So this part of the code here just does some old school 633 00:25:25,186 --> 00:25:25,696 stuff now. 634 00:25:25,696 --> 00:25:29,446 We loop from I equals zero on up to N, what is N? 635 00:25:29,446 --> 00:25:32,086 Well, we initialize N to the return value of string length. 636 00:25:32,086 --> 00:25:34,106 So this loop is going to iterate conceptually 637 00:25:34,106 --> 00:25:36,066 over every character in this string S 638 00:25:36,066 --> 00:25:37,476 and then it's just going 639 00:25:37,476 --> 00:25:41,076 to print it using S bracket I notation, each char one 640 00:25:41,076 --> 00:25:43,076 at a time, one character per line. 641 00:25:43,076 --> 00:25:44,486 So again, I say old school 642 00:25:44,486 --> 00:25:46,226 because there's nothing new here today. 643 00:25:46,396 --> 00:25:49,386 In fact, it's a bit of a regression using strings 644 00:25:49,386 --> 00:25:50,356 as though they were arrays, 645 00:25:50,356 --> 00:25:52,346 which is very convenient notation, 646 00:25:52,576 --> 00:25:53,716 but we've of course been talking 647 00:25:53,716 --> 00:25:55,346 about strings as pointers lately. 648 00:25:55,546 --> 00:25:58,096 So much so that we now have been preaching this habit 649 00:25:58,096 --> 00:26:01,256 of freeing any memory that you allocate with malloc. 650 00:26:01,646 --> 00:26:06,066 Or, in turn, any memory that you implicitly allocate using get 651 00:26:06,066 --> 00:26:08,606 string, because it again uses malloc, and so one of themes 652 00:26:08,606 --> 00:26:10,716 for Problem Set 6 will be to use that tool [inaudible] 653 00:26:10,956 --> 00:26:13,856 from Monday to actually help you find situations 654 00:26:13,986 --> 00:26:15,776 where you're forgetting things like free. 655 00:26:16,096 --> 00:26:19,026 But the fact that a string is really just a memory address, 656 00:26:19,676 --> 00:26:22,266 it turns out that you can do the same kind of program 657 00:26:22,626 --> 00:26:25,706 but treating S as thought it truly is a pointer. 658 00:26:25,956 --> 00:26:28,236 And so you can use something called pointer arithmetic. 659 00:26:28,236 --> 00:26:31,866 So again, just to plant the seed here that will recur perhaps 660 00:26:31,866 --> 00:26:34,706 in Problem Set 6 or in future courses if you continue on, 661 00:26:34,966 --> 00:26:37,096 this program, again, just gets a string from the user, 662 00:26:37,406 --> 00:26:38,806 stores the return address in S 663 00:26:38,806 --> 00:26:40,446 and does this little null sanity check. 664 00:26:40,836 --> 00:26:42,776 This for loop is identical up here. 665 00:26:43,006 --> 00:26:45,986 The only thing different about this program at the moment is 666 00:26:45,986 --> 00:26:48,836 that I'm using this, star parenthesis, 667 00:26:49,006 --> 00:26:50,956 S plus I close parenthesis. 668 00:26:51,416 --> 00:26:53,756 So star, remember, is the dereference operator. 669 00:26:53,806 --> 00:26:55,186 It means go there. 670 00:26:55,516 --> 00:26:57,936 So how is this actually doing the same thing? 671 00:26:58,226 --> 00:27:00,606 Well, just so it's clear, what both of these things are doing, 672 00:27:00,606 --> 00:27:05,596 let me run pointers, whoops, interesting bug, 673 00:27:06,156 --> 00:27:07,506 alright, standard lib. 674 00:27:07,666 --> 00:27:09,316 I screwed up so let me recompile. 675 00:27:09,516 --> 00:27:11,166 It was complaining that it couldn't find free. 676 00:27:11,526 --> 00:27:16,406 So let me run pointers to enter, hello, that's all this program 677 00:27:16,486 --> 00:27:18,136 and the predecessor did. 678 00:27:18,186 --> 00:27:20,776 It just prints the word, one character per line. 679 00:27:21,046 --> 00:27:22,516 But how does this version work? 680 00:27:22,766 --> 00:27:25,636 Well, if F is an address and it's pointing 681 00:27:25,636 --> 00:27:28,916 to conceptually a string, what it really is is the address 682 00:27:28,916 --> 00:27:33,106 of a character and the string is just the result of going step 683 00:27:33,106 --> 00:27:36,186 by step by step until you see what special character demarking 684 00:27:36,186 --> 00:27:36,886 the end of the string. 685 00:27:37,966 --> 00:27:40,636 So the null terminator, which is not N U L L, 686 00:27:40,636 --> 00:27:41,866 but backslash zero, right? 687 00:27:41,866 --> 00:27:43,716 That's the special character we look for. 688 00:27:43,956 --> 00:27:46,646 So what this is really doing is it's saying what? 689 00:27:46,646 --> 00:27:49,976 Take S, which is an address, it's like OX 1, 2, 3, 4 or 5, 690 00:27:49,976 --> 00:27:53,026 some location in RAM, it's adding I to it. 691 00:27:53,026 --> 00:27:54,336 Now initially I is zero. 692 00:27:54,336 --> 00:27:56,846 So initially this is meaningless, it's just S. 693 00:27:57,036 --> 00:27:59,066 So star S means go there. 694 00:27:59,156 --> 00:28:01,466 Well, when you go to S, what's waiting for you? 695 00:28:01,956 --> 00:28:03,866 Just a character right? 696 00:28:03,866 --> 00:28:05,816 If it's the address of a character and you go there, 697 00:28:05,816 --> 00:28:07,536 obviously what's waiting is a character 698 00:28:07,686 --> 00:28:09,876 and so we can continue using percent C 699 00:28:09,876 --> 00:28:11,326 and we print out that character. 700 00:28:11,426 --> 00:28:13,396 Well, now iterate one more time in the loop. 701 00:28:13,396 --> 00:28:18,186 So now I equals one, so S plus I is whatever the address is, 1, 702 00:28:18,186 --> 00:28:20,796 2, 3, now we add one, now it's 1, 2, 703 00:28:20,996 --> 00:28:24,906 4 and so now we say go to address 1, 2, 4. 704 00:28:24,906 --> 00:28:26,086 Well what's waiting for you there? 705 00:28:27,106 --> 00:28:29,486 You know, the next character, because arrays are continuous. 706 00:28:29,696 --> 00:28:31,276 So it's neat at this point in the course, 707 00:28:31,276 --> 00:28:34,336 now that we've pealed back this layer of the CS50 library 708 00:28:34,336 --> 00:28:36,896 and this layer of array notation for everything. 709 00:28:37,426 --> 00:28:40,026 In fact, when you have a pointer, you can go anywhere 710 00:28:40,026 --> 00:28:41,176 in memory that you want. 711 00:28:41,176 --> 00:28:43,506 And the fact that you can just come up with an arbitrary value, 712 00:28:43,716 --> 00:28:45,906 perhaps arithmetically like this with some addition, 713 00:28:46,156 --> 00:28:48,076 means that you have great power, but again, 714 00:28:48,366 --> 00:28:50,276 also great risk in your programs. 715 00:28:50,276 --> 00:28:51,546 As we've discussed in the context 716 00:28:51,546 --> 00:28:53,856 of buffer overrun exploits because if you can do it, 717 00:28:54,116 --> 00:28:57,356 so might someone who actually has malicious intent 718 00:28:57,356 --> 00:29:00,376 and somehow tricks you into running code that you, 719 00:29:00,376 --> 00:29:01,626 yourself, did not write. 720 00:29:01,686 --> 00:29:03,826 So that is generally known as pointer arithmetic, 721 00:29:04,046 --> 00:29:05,296 but it's really just how 722 00:29:05,296 --> 00:29:08,556 that square bracket notation has been working all of this time. 723 00:29:08,906 --> 00:29:12,516 So any questions about bit wise operators or pointers? 724 00:29:12,966 --> 00:29:17,826 >> Isn't it adding a pointer to an integer? 725 00:29:17,826 --> 00:29:19,626 >> Isn't it adding a pointer to an integer? 726 00:29:19,746 --> 00:29:23,546 Yes. But that's okay because in this case the integer will be 727 00:29:23,546 --> 00:29:25,626 essentially implicitly cast up to an address 728 00:29:25,626 --> 00:29:26,596 and it will work okay. 729 00:29:26,936 --> 00:29:29,136 This notion of pointer arithmetic is built 730 00:29:29,136 --> 00:29:31,936 into the language, so it is permissible, even though it is, 731 00:29:31,936 --> 00:29:33,346 in fact, two different types, 732 00:29:33,456 --> 00:29:37,026 a long and a pointer, but it's legit. 733 00:29:37,266 --> 00:29:41,286 Alright. So now the proverbial fun begins 734 00:29:41,286 --> 00:29:42,196 that I keep promising. 735 00:29:42,196 --> 00:29:44,966 If only because no we get to really think about and talk 736 00:29:44,966 --> 00:29:46,966 about some ideas that you can start taking 737 00:29:46,966 --> 00:29:47,986 for granted, frankly. 738 00:29:47,986 --> 00:29:50,276 Once we get to web based programming you'll find 739 00:29:50,276 --> 00:29:54,886 that in languages like C, I'm sorry, languages like PHP 740 00:29:54,886 --> 00:29:58,436 and JavaScript and Python and Ruby and Java, 741 00:29:58,436 --> 00:30:01,116 there's so many other languages out there, and rest assured, 742 00:30:01,116 --> 00:30:03,936 frankly, generally, you just need a good foundation 743 00:30:03,936 --> 00:30:06,236 in one language, maybe two, three languages, 744 00:30:06,236 --> 00:30:07,946 and the rest you can teach yourself. 745 00:30:07,946 --> 00:30:11,216 I mean, case and point in school I took CS50 where we learned C, 746 00:30:11,696 --> 00:30:13,726 CS51 where we learned a language called LISP 747 00:30:13,816 --> 00:30:15,656 and C plus plus, and that was it. 748 00:30:15,656 --> 00:30:17,836 I never again formally learned any other languages. 749 00:30:17,836 --> 00:30:19,686 I bootstrapped myself from that process. 750 00:30:19,916 --> 00:30:23,136 And this isn't because I was particularly cut out for CS, 751 00:30:23,136 --> 00:30:26,596 but it's just really is true that the marginal returns 752 00:30:26,596 --> 00:30:28,676 on some foundations and different types 753 00:30:28,676 --> 00:30:30,146 of languages is pretty huge, 754 00:30:30,146 --> 00:30:32,656 so one of the themes next week onward is that we're going 755 00:30:32,656 --> 00:30:35,146 to move pretty darn quickly through PHP and JavaScript. 756 00:30:35,146 --> 00:30:37,146 I'm not going to teach you the whole language, 757 00:30:37,146 --> 00:30:38,926 because frankly you're going to see 758 00:30:38,926 --> 00:30:40,336 that for loops look almost the same, 759 00:30:40,336 --> 00:30:41,706 y loops look almost the same. 760 00:30:41,896 --> 00:30:43,246 There's going to be some more functions in PHP, 761 00:30:43,246 --> 00:30:45,916 which is great, because it means you can implement fewer 762 00:30:45,916 --> 00:30:47,886 uninteresting things yourself. 763 00:30:47,886 --> 00:30:49,716 But what you'll find is that one of the goals 764 00:30:49,716 --> 00:30:52,896 of the next few weeks is to help you realize by doing it together 765 00:30:53,186 --> 00:30:56,676 that teaching yourself new languages is absolutely quite 766 00:30:56,676 --> 00:30:59,626 possible and quite useful when you're in some lab or some job. 767 00:30:59,996 --> 00:31:01,946 Even, frankly, in the financial and consulting industry, 768 00:31:01,946 --> 00:31:04,646 where a lot of my friends, case and point, never once programmed 769 00:31:04,646 --> 00:31:07,616 in school but are now doing it as part of their analytics work. 770 00:31:07,616 --> 00:31:08,066 So you learn. 771 00:31:08,256 --> 00:31:09,876 And better to learn in advance than have 772 00:31:09,956 --> 00:31:12,606 to learn on that kind of job. 773 00:31:12,606 --> 00:31:15,486 So we talked about stacks as a new tool with which 774 00:31:15,486 --> 00:31:17,846 to implement a data structure. 775 00:31:17,846 --> 00:31:19,766 And a data structure is generally just a convenient 776 00:31:19,766 --> 00:31:21,266 mechanism for representing data 777 00:31:21,596 --> 00:31:23,876 so that you can solve more interesting problems 778 00:31:23,876 --> 00:31:26,616 or solve problems more efficiently. 779 00:31:27,006 --> 00:31:28,686 So arrays were fairly limited, 780 00:31:28,686 --> 00:31:31,196 link lists gave us some dynamism, but the problem 781 00:31:31,196 --> 00:31:34,136 with link lists really is that it's just this long line 782 00:31:34,136 --> 00:31:35,826 of people, long line of numbers. 783 00:31:35,826 --> 00:31:39,186 We lost the division and conquer capabilities. 784 00:31:39,186 --> 00:31:40,806 We lost the binary search capabilities 785 00:31:40,806 --> 00:31:42,056 that we started the course with. 786 00:31:42,286 --> 00:31:43,906 So can we get some of those features back 787 00:31:43,906 --> 00:31:45,206 and get more of them? 788 00:31:45,376 --> 00:31:46,796 Well, just to tidy up a loose end, 789 00:31:46,796 --> 00:31:48,546 I felt bad in that I completely, 790 00:31:48,546 --> 00:31:52,816 I felt I did not express very clearly how one might go 791 00:31:52,816 --> 00:31:54,346 about implementing a stack or a queue 792 00:31:54,346 --> 00:31:55,856 and we won't spend great time on it 793 00:31:55,856 --> 00:31:58,806 because you can spend entire weeks discussing the 794 00:31:58,806 --> 00:32:00,016 implementation details, but I 795 00:32:00,016 --> 00:32:02,356 at least thought I'd propose a canonical example to kind 796 00:32:02,356 --> 00:32:03,866 of clean up my verbal mess from Monday. 797 00:32:04,356 --> 00:32:09,186 So here is how you might implement a stack using some 798 00:32:09,186 --> 00:32:10,096 basic building blocks. 799 00:32:10,096 --> 00:32:13,746 So again, a stack is akin to the stack of trays in a dormitory. 800 00:32:14,006 --> 00:32:16,126 So type def struct says here comes a struct, 801 00:32:16,416 --> 00:32:18,716 the last word there says stack, which means I'm going 802 00:32:18,716 --> 00:32:20,296 to give this structure the name stack, 803 00:32:20,626 --> 00:32:23,156 so really the interesting part is inside of the curly braces. 804 00:32:23,376 --> 00:32:25,126 Now a queue, or rather a stack, 805 00:32:25,716 --> 00:32:27,276 presumably has some fixed length. 806 00:32:27,276 --> 00:32:29,746 That's generally the assumption in the world of stacks that, 807 00:32:29,996 --> 00:32:33,716 you know, yes, you could stack cafeteria trays ad nauseum, 808 00:32:33,756 --> 00:32:35,056 but at some point it's going to fall over. 809 00:32:35,056 --> 00:32:36,776 So realistically there is some upper bound. 810 00:32:36,916 --> 00:32:38,426 And certainly in a computer there's going 811 00:32:38,426 --> 00:32:39,836 to be an upper bound based on the amount 812 00:32:39,836 --> 00:32:40,836 of memory that you have. 813 00:32:41,046 --> 00:32:43,576 So it stands to reason that this data structure generally has a 814 00:32:43,576 --> 00:32:46,676 fixed size, which means we can actually implement it using 815 00:32:46,676 --> 00:32:47,126 an array. 816 00:32:47,206 --> 00:32:49,806 If we know it's going to be a fixed size, we can just allocate 817 00:32:49,806 --> 00:32:51,346 that memory at the beginning. 818 00:32:51,626 --> 00:32:54,466 Let's assume that capacity, in all capitals here, 819 00:32:54,466 --> 00:32:57,206 is just a constant declared somewhere else that's a hundred 820 00:32:57,206 --> 00:32:59,276 or 200 or whatever, it's a fixed value. 821 00:32:59,646 --> 00:33:02,246 So this says the structure called stack is going 822 00:33:02,246 --> 00:33:05,906 to have an array inside of it, in this case called numbers, 823 00:33:06,116 --> 00:33:09,216 so this appears to be a stack that stores just numbers. 824 00:33:09,216 --> 00:33:11,536 So pretty simple application, just stores some numbers. 825 00:33:11,796 --> 00:33:13,506 But I need to remember some key details. 826 00:33:13,776 --> 00:33:16,536 Initially I might have space for 100 integers, 827 00:33:16,776 --> 00:33:19,096 but there's not necessarily 100 integers in there. 828 00:33:19,126 --> 00:33:21,486 I probably have 100 garbage values. 829 00:33:21,656 --> 00:33:25,196 Or if I take some care, I might have 100 zeros or negative ones, 830 00:33:25,196 --> 00:33:27,636 if I initialize the whole thing to some known values. 831 00:33:27,926 --> 00:33:30,166 But the point is that it's not sufficient 832 00:33:30,166 --> 00:33:32,656 to just say here's a chunk of memory, now you have a stack. 833 00:33:33,156 --> 00:33:35,616 You now need to have some meta data associated 834 00:33:35,616 --> 00:33:38,326 with the structure to remember what's in there and how much 835 00:33:38,326 --> 00:33:41,266 of it's in there so that you don't risk returning some 836 00:33:41,266 --> 00:33:43,846 garbage value or overstepping the bounds of your array. 837 00:33:44,166 --> 00:33:46,266 So I'm going to remember one, the size. 838 00:33:46,266 --> 00:33:48,256 So size is distinct from capacity. 839 00:33:48,256 --> 00:33:50,436 Capacity means how much space is there 840 00:33:50,696 --> 00:33:53,256 and size means how much space are you actually using. 841 00:33:53,256 --> 00:33:55,886 So initially I've probably initialized size to zero 842 00:33:56,126 --> 00:33:58,006 because there's nothing in an empty array 843 00:33:58,296 --> 00:34:00,086 by default, and then top. 844 00:34:00,246 --> 00:34:01,716 So I chose top specifically 845 00:34:01,716 --> 00:34:03,466 because if I'm stacking cafeteria trays, 846 00:34:03,836 --> 00:34:06,896 really the only number I care about is 847 00:34:06,896 --> 00:34:10,136 where in this array is the topmost tray? 848 00:34:10,186 --> 00:34:12,326 Where in this array is the topmost number? 849 00:34:12,606 --> 00:34:16,156 Now why would the topmost number not just always be 850 00:34:16,206 --> 00:34:17,266 at bracket zero? 851 00:34:17,446 --> 00:34:19,556 Right? If I have an array it feels like the logical thing 852 00:34:19,556 --> 00:34:22,806 to do is to put this top of the stack in bracket zero. 853 00:34:23,766 --> 00:34:26,936 Now why is that not, why is a stack's implementation not 854 00:34:26,936 --> 00:34:27,696 as simple as that? 855 00:34:27,696 --> 00:34:29,066 Why do I need to remember separately 856 00:34:29,066 --> 00:34:30,146 where the top of it is? 857 00:34:30,866 --> 00:34:30,986 Yes? 858 00:34:30,986 --> 00:34:31,716 >> [Inaudible] 859 00:34:31,716 --> 00:34:33,046 >> Yes. Exactly. 860 00:34:33,046 --> 00:34:36,856 So it's a design decision. 861 00:34:36,856 --> 00:34:40,596 I could absolutely start by putting numbers into this stack 862 00:34:41,016 --> 00:34:43,236 by just starting at bracket zero, then bracket one, 863 00:34:43,236 --> 00:34:44,046 then bracket two, right, 864 00:34:44,046 --> 00:34:45,696 old school style, totally reasonable. 865 00:34:46,026 --> 00:34:50,296 But if the point is to maintain some kind of order, 866 00:34:50,296 --> 00:34:54,446 I need to remember which one went in last so that I take it 867 00:34:54,446 --> 00:34:58,046 out first, so it's a LIFO structure, last in, first out. 868 00:34:58,276 --> 00:35:00,776 So in that case, if I start at bracket zero putting 869 00:35:00,776 --> 00:35:02,556 in some number, I put a tray down. 870 00:35:02,736 --> 00:35:05,016 Then I do it again at bracket one, that's another tray, 871 00:35:05,016 --> 00:35:07,816 then another number, that's a third tray. 872 00:35:08,066 --> 00:35:09,056 Really the number I care 873 00:35:09,056 --> 00:35:10,776 about at this point is not bracket zero, 874 00:35:10,776 --> 00:35:13,186 because that's actually the bottommost tray 875 00:35:13,186 --> 00:35:14,896 that I don't really care about until I get there. 876 00:35:15,156 --> 00:35:17,306 I need to remember bracket two at this point. 877 00:35:17,566 --> 00:35:21,556 And so using this variable, this meta data like top can allow me 878 00:35:21,556 --> 00:35:24,346 to remember that the current top of this tray, this stack, 879 00:35:24,776 --> 00:35:26,346 is just the number two. 880 00:35:26,346 --> 00:35:31,556 And size, by contrast, is also, is three in this case 881 00:35:31,596 --> 00:35:34,016 because I've put three elements onto the data structure. 882 00:35:34,166 --> 00:35:34,296 Yes? 883 00:35:34,886 --> 00:35:37,876 >> Is size always top+1? 884 00:35:37,876 --> 00:35:38,446 >> Good question. 885 00:35:38,446 --> 00:35:40,756 Would size always be top+1? 886 00:35:40,996 --> 00:35:44,506 It could be, but what if we start putting 887 00:35:44,506 --> 00:35:47,696 in like 100 numbers then we start popping off numbers? 888 00:35:47,696 --> 00:35:50,216 Pop is the lexicon for removing things from a stack, 889 00:35:50,576 --> 00:35:52,066 much like a cafeteria stack. 890 00:35:52,476 --> 00:35:55,666 So if I do this, I start putting in number, number, number, 891 00:35:55,666 --> 00:35:59,126 number, number, number, number, and now I hit the end 892 00:35:59,536 --> 00:36:03,426 of the queue and I've started popping, I've kept going up 893 00:36:03,626 --> 00:36:06,826 and up and up and up, I've started putting numbers 894 00:36:06,826 --> 00:36:09,546 into the queue, actually, is this? 895 00:36:10,036 --> 00:36:12,826 Let me debug this in my head for a moment. 896 00:36:12,826 --> 00:36:17,346 Yes, it's kind of stupid. 897 00:36:17,636 --> 00:36:19,256 Alright, so this is, in fact, not necessary. 898 00:36:19,256 --> 00:36:20,386 I was conflating it with a queue. 899 00:36:20,386 --> 00:36:21,756 So I will fix this verbally in a moment. 900 00:36:21,796 --> 00:36:23,306 I almost had it correct this time. 901 00:36:23,526 --> 00:36:26,666 So let me come back in just a moment and transition to this, 902 00:36:26,666 --> 00:36:28,536 if the wave of a hand, but then we'll fix 903 00:36:28,536 --> 00:36:29,526 that one in just a moment. 904 00:36:29,526 --> 00:36:30,646 So the other notion was a queue. 905 00:36:30,886 --> 00:36:33,556 So a queue is the real world's concept of actually having 906 00:36:33,556 --> 00:36:35,636 like a line of people and you want to maintain fairness 907 00:36:35,636 --> 00:36:38,136 so it's a FIFO data structure, first in first out. 908 00:36:38,396 --> 00:36:40,756 The first person into the structure is the one you want 909 00:36:40,756 --> 00:36:42,976 to remove from the structure first, 910 00:36:43,256 --> 00:36:45,316 and so we might implement this as follows, 911 00:36:45,316 --> 00:36:48,756 whereby we have an array called numbers, again, 912 00:36:48,816 --> 00:36:50,146 with the same kind of capacity. 913 00:36:50,466 --> 00:36:51,956 Here we have size. 914 00:36:52,246 --> 00:36:54,026 But here, we might want to remember 915 00:36:54,116 --> 00:36:56,376 where the head of the list is. 916 00:36:56,426 --> 00:36:57,546 This is what I was thinking here. 917 00:36:57,546 --> 00:37:00,236 So in the context of a queue, you might very reasonably, 918 00:37:00,236 --> 00:37:02,206 in your code, put the first person at bracket zero, 919 00:37:02,316 --> 00:37:03,526 then the next person at bracket one, 920 00:37:03,656 --> 00:37:05,116 then the next person at bracket two. 921 00:37:05,576 --> 00:37:08,686 So at that point in the story, size is three, but head, 922 00:37:08,826 --> 00:37:12,496 the front of the list is zero, where ever the first person was. 923 00:37:12,696 --> 00:37:15,216 Now, if you remove the first person from the queue, 924 00:37:15,406 --> 00:37:17,036 you now conceptually have this hole 925 00:37:17,036 --> 00:37:20,386 in your array whereby you still have a space at bracket zero 926 00:37:20,386 --> 00:37:22,436 but there's no human actually waiting in line there. 927 00:37:22,436 --> 00:37:23,266 So what could you do? 928 00:37:23,536 --> 00:37:24,856 You could just move all of these humans, 929 00:37:24,856 --> 00:37:28,036 or move all these integer over by sliding them in the array. 930 00:37:28,036 --> 00:37:31,446 But what's the running time of shifting an array over one space 931 00:37:31,496 --> 00:37:32,766 to the right or to the left? 932 00:37:34,366 --> 00:37:35,226 It's linear, right? 933 00:37:35,226 --> 00:37:37,336 In the worst case you've got an almost full queue, 934 00:37:37,336 --> 00:37:39,686 you've got to move every darn number over to the right 935 00:37:39,946 --> 00:37:42,026 or to the left and what's the point, right? 936 00:37:42,026 --> 00:37:45,306 I can just remember where the head of the queue is. 937 00:37:45,406 --> 00:37:47,156 And so the size might be decreasing, 938 00:37:47,406 --> 00:37:48,436 but the head is moving. 939 00:37:48,436 --> 00:37:51,596 And then in this case with the queue, I might to save time 940 00:37:51,636 --> 00:37:54,386 and memory, I might want to wrap around at the end 941 00:37:54,386 --> 00:37:57,346 and reuse all space, in which case I might very well want 942 00:37:57,346 --> 00:37:59,446 to remember where the tail of the queue is 943 00:37:59,686 --> 00:38:02,946 so that I essentially treat my array conceptually 944 00:38:02,946 --> 00:38:04,086 as a circular structure. 945 00:38:04,086 --> 00:38:06,816 I don't care where the start or the end is in memory. 946 00:38:06,816 --> 00:38:09,096 I just have to keep track of these variables. 947 00:38:09,096 --> 00:38:10,536 So I'll go back before long and I'll clean 948 00:38:10,956 --> 00:38:12,466 up the stacks structure. 949 00:38:12,976 --> 00:38:16,676 But let's focus now on ones we'll deploy in Problem Set 6. 950 00:38:16,706 --> 00:38:19,586 So Problem Set 6, we'll hand you a huge text file 951 00:38:19,586 --> 00:38:23,006 with 140 some odd thousand words from an actual dictionary 952 00:38:23,006 --> 00:38:24,586 that you might use in Microsoft Word 953 00:38:24,826 --> 00:38:27,266 or really any tool that can spell check. 954 00:38:27,346 --> 00:38:29,276 Unfortunately it's just a text file. 955 00:38:29,276 --> 00:38:32,856 It's got one word per line, but all it is is a text file. 956 00:38:33,046 --> 00:38:34,416 And so one of the thing you're going to have 957 00:38:34,416 --> 00:38:37,056 to use is your newfound comfort, or by Friday new found comfort 958 00:38:37,056 --> 00:38:40,946 with file I/O to read those words in from disk and then 959 00:38:40,946 --> 00:38:42,176 to load them in from memory. 960 00:38:42,176 --> 00:38:43,606 Now the simplest implementation 961 00:38:43,606 --> 00:38:47,676 of a spell checker might just be how I allocate an array that's 962 00:38:47,676 --> 00:38:54,286 140,000 in size so that you have one location for each string, 963 00:38:54,496 --> 00:38:57,076 and then you load each of the strings into those locations. 964 00:38:57,336 --> 00:39:00,416 The upside of this, because the file we'll give you is, in fact, 965 00:39:00,416 --> 00:39:02,086 alphabetized, which might be useful. 966 00:39:02,426 --> 00:39:05,846 The upside of this is that now you can use like binary search. 967 00:39:05,846 --> 00:39:07,606 If just conceptually you've got a huge array 968 00:39:07,606 --> 00:39:09,836 and you don't have numbers and you don't have humans in each 969 00:39:09,836 --> 00:39:12,086 of the slots, you have actual words, well there's 970 00:39:12,086 --> 00:39:14,266 that function called stir comp, stir compare, 971 00:39:14,296 --> 00:39:17,636 you can absolutely use binary search on a really big array 972 00:39:17,926 --> 00:39:19,186 to find some words so that 973 00:39:19,186 --> 00:39:21,966 when your program asks a question is the word the user 974 00:39:21,966 --> 00:39:23,416 just typed in spelled correctly? 975 00:39:23,616 --> 00:39:26,016 You just go boom, boom, boom, boom, boom using binary search 976 00:39:26,016 --> 00:39:28,536 on your array and you can return true or false. 977 00:39:28,896 --> 00:39:30,986 Now is that the best approach? 978 00:39:31,496 --> 00:39:32,396 It's not bad. 979 00:39:32,426 --> 00:39:34,306 What's the burning time of binary search on an array? 980 00:39:35,676 --> 00:39:38,056 It's a log end, and log end frankly has kicked the butt 981 00:39:38,056 --> 00:39:40,446 of every other algorithm we've looked at just yet, 982 00:39:40,506 --> 00:39:43,236 but as promised on Monday the holy grail really would be 983 00:39:43,506 --> 00:39:44,406 constant time. 984 00:39:44,406 --> 00:39:46,606 Like I don't even want to wait for log end time, 985 00:39:46,606 --> 00:39:49,546 which for large data sets it might be asymptotically fast, 986 00:39:49,546 --> 00:39:51,366 it might be theoretically fast, 987 00:39:51,476 --> 00:39:53,066 but if you have large enough data sets, 988 00:39:53,066 --> 00:39:55,946 not just 140,000 words, but a billion web pages 989 00:39:55,946 --> 00:40:00,106 or 500 million users on Facebook, even binary search 990 00:40:00,106 --> 00:40:02,706 on a data set that size might very well take some time. 991 00:40:02,906 --> 00:40:05,246 So certainly returning it in constant time, 992 00:40:05,246 --> 00:40:07,836 one step conceptually would be optimal. 993 00:40:07,836 --> 00:40:11,206 And so enter into this picture the notion of a hash table. 994 00:40:11,266 --> 00:40:14,346 A hash table is often implemented as an array, 995 00:40:14,776 --> 00:40:17,936 but it's implemented as an array that's bigger than it needs 996 00:40:17,936 --> 00:40:22,806 to be to store all of the words or the numbers or the whatever's 997 00:40:23,066 --> 00:40:24,516 that you want to store in it. 998 00:40:24,516 --> 00:40:28,256 And the idea is essentially if this is my hash table, 999 00:40:28,256 --> 00:40:29,216 this is just a big array 1000 00:40:29,216 --> 00:40:31,536 and on the left hand side I've just numbered the locations 1001 00:40:31,536 --> 00:40:34,616 in a traditional way, zero at the top, one, two, three, dot, 1002 00:40:34,616 --> 00:40:37,456 dot, dot, down to 25 in this case, you can think 1003 00:40:37,456 --> 00:40:41,056 of the hash table conceptually as being a data structure 1004 00:40:41,056 --> 00:40:43,626 where you take some input, whether it's a number 1005 00:40:43,626 --> 00:40:46,466 or a string, and then you, kind of like darts, 1006 00:40:46,866 --> 00:40:50,656 throw it into the data structure and hope that it actually sticks 1007 00:40:51,026 --> 00:40:53,446 and doesn't hit some existing value. 1008 00:40:53,556 --> 00:40:55,066 Now that would be more dramatic if it actually stuck, 1009 00:40:55,276 --> 00:40:57,456 but that was off the top of my head. 1010 00:40:57,456 --> 00:40:58,816 We'll come with sticky things next year. 1011 00:40:59,076 --> 00:41:03,576 But if you throw this input, this number or this string, 1012 00:41:03,576 --> 00:41:06,756 into the data structure you're going to hope that it lands 1013 00:41:06,756 --> 00:41:08,886 in some blank spot as the board currently is. 1014 00:41:09,036 --> 00:41:10,826 And that's great, because how much effort did it take 1015 00:41:10,826 --> 00:41:12,956 to pull it to my string or my int 1016 00:41:12,956 --> 00:41:14,546 in pretty much a random location? 1017 00:41:14,866 --> 00:41:15,516 It was one step. 1018 00:41:15,516 --> 00:41:18,176 It took me one throw to actually hit the data structure. 1019 00:41:18,296 --> 00:41:19,226 But there's a downside. 1020 00:41:19,276 --> 00:41:20,566 What's an obvious downside? 1021 00:41:20,566 --> 00:41:21,906 [ Inaudible ] 1022 00:41:21,906 --> 00:41:23,656 So if there's something there, right? 1023 00:41:23,656 --> 00:41:26,086 So this idea of probabilistic insertion 1024 00:41:26,086 --> 00:41:28,346 into a data structure could get you into trouble. 1025 00:41:28,346 --> 00:41:31,106 If you've got a huge data structure and only three pens 1026 00:41:31,146 --> 00:41:33,816 to throw into it, you know, statistically you're not going 1027 00:41:33,816 --> 00:41:35,226 to have any collisions 1028 00:41:35,496 --> 00:41:38,706 and so odds are you'll have too much space, 1029 00:41:38,706 --> 00:41:40,856 but at least you won't collide, which means all 1030 00:41:40,856 --> 00:41:42,376 of your data will fit in there. 1031 00:41:42,826 --> 00:41:43,846 And so that's pretty good. 1032 00:41:43,846 --> 00:41:46,256 But if now you have lots of data in there, 1033 00:41:46,486 --> 00:41:49,086 the odds of throwing a new item into your data structure 1034 00:41:49,086 --> 00:41:51,906 and having it stick into some empty location is pretty 1035 00:41:51,906 --> 00:41:52,806 actually low. 1036 00:41:53,056 --> 00:41:55,646 And plus, even when you go through this process 1037 00:41:55,646 --> 00:41:58,526 of throwing something into the data structure, well, 1038 00:41:58,526 --> 00:41:59,766 how do you then find it? 1039 00:42:00,976 --> 00:42:02,846 Right? Insertion is pretty damn easy. 1040 00:42:02,846 --> 00:42:03,916 But what about search? 1041 00:42:04,166 --> 00:42:06,496 I mean, who knows where it ended up? 1042 00:42:06,646 --> 00:42:09,226 So we clearly can't just randomly pick the insertion 1043 00:42:09,226 --> 00:42:11,286 point, otherwise we're killing the running time 1044 00:42:11,516 --> 00:42:12,706 of the search algorithm. 1045 00:42:12,706 --> 00:42:14,236 And arguably, what I care 1046 00:42:14,236 --> 00:42:16,756 about most is finding this data in the future. 1047 00:42:16,916 --> 00:42:19,316 Right? I'm putting it into a data structure presumably 1048 00:42:19,526 --> 00:42:20,876 in order to create a situation 1049 00:42:20,876 --> 00:42:22,596 where I can get it back in constant time. 1050 00:42:22,786 --> 00:42:24,936 So randomness certainly isn't the case. 1051 00:42:25,076 --> 00:42:26,606 But let's suppose that it is random. 1052 00:42:26,606 --> 00:42:29,346 And let's asking ourselves, if you have a container, a table, 1053 00:42:29,546 --> 00:42:31,216 a hash table that's initially empty 1054 00:42:31,476 --> 00:42:34,366 and you just throw randomly some pens into it, you know, 1055 00:42:34,366 --> 00:42:36,816 what really is the probability of having a collision? 1056 00:42:37,086 --> 00:42:39,476 Well, it turns out, even though I said verbally, you know, 1057 00:42:39,676 --> 00:42:42,046 the odds of my hitting another number it's pretty small, 1058 00:42:42,046 --> 00:42:43,046 it's actually not. 1059 00:42:43,116 --> 00:42:44,006 And if you've ever heard 1060 00:42:44,006 --> 00:42:46,846 of a little mathematical problem called the birthday paradox, 1061 00:42:46,846 --> 00:42:50,496 you can model this question very easily and see that, you know, 1062 00:42:50,756 --> 00:42:54,046 even if you had a whole lot of memory and relatively few things 1063 00:42:54,126 --> 00:42:56,526 to throw in it, you're still very likely 1064 00:42:56,526 --> 00:42:58,836 to actually hit something that's already there. 1065 00:42:58,836 --> 00:43:01,976 So we can frame it as this, in a room full of N CS50 students 1066 00:43:02,286 --> 00:43:06,176 where N is declining over the weeks, what's the probability 1067 00:43:06,176 --> 00:43:08,066 that at least two students have the same birthday? 1068 00:43:08,396 --> 00:43:10,306 So this is kind of the same problem, it's just framed 1069 00:43:10,306 --> 00:43:11,436 in kind of a silly way. 1070 00:43:11,696 --> 00:43:16,486 But here the hash table is 365, or maybe 366 for leap years 1071 00:43:16,486 --> 00:43:19,426 and such, but we have a hash table of 365. 1072 00:43:19,716 --> 00:43:22,566 If we assume that your parents had children 1073 00:43:22,566 --> 00:43:24,896 with by a uniform distribution over the years, 1074 00:43:24,896 --> 00:43:26,646 which I don't think is actually statistically true, 1075 00:43:26,926 --> 00:43:29,426 we can assume that if we throw all of your birthdays 1076 00:43:29,426 --> 00:43:31,826 into this bucket that's the size of 365, 1077 00:43:32,216 --> 00:43:36,906 that we'll get collisions eventually, but at what point? 1078 00:43:37,006 --> 00:43:39,996 Now with a class of 501 students, you're going 1079 00:43:39,996 --> 00:43:43,196 to thrown those birthdays into a data structure the size of 365 1080 00:43:43,346 --> 00:43:45,486 and obviously you're going to have collisions, 1081 00:43:45,486 --> 00:43:47,876 because you can't fit that many people into that few bucket, 1082 00:43:48,256 --> 00:43:50,576 but even if you're taking a smaller seminar course that's 1083 00:43:50,576 --> 00:43:54,566 only 40 people, turns out it's a very high probability that even 1084 00:43:54,566 --> 00:43:56,626 in smaller classes you're going to have collisions. 1085 00:43:56,626 --> 00:43:58,266 And this is an experiment that you can run, frankly, 1086 00:43:58,266 --> 00:44:00,336 in your next section or your next seminar class. 1087 00:44:00,466 --> 00:44:03,326 And we can model it as follows, if you ask yourself 1088 00:44:03,976 --> 00:44:06,466 when I throw the first birthday into this data structure, 1089 00:44:06,736 --> 00:44:08,406 what's the probability of a collision, 1090 00:44:08,446 --> 00:44:11,986 or what's the probability of it not colliding with anyone? 1091 00:44:12,036 --> 00:44:13,356 Well, it's 100 percent. 1092 00:44:13,486 --> 00:44:14,946 The data structure's initially empty. 1093 00:44:15,176 --> 00:44:17,126 You have a 100 percent probability 1094 00:44:17,126 --> 00:44:19,166 of not colliding with anyone else. 1095 00:44:19,496 --> 00:44:21,546 So this actually turns out to be one of those problems 1096 00:44:21,546 --> 00:44:24,016 where it's a little easier to answer at the opposite question. 1097 00:44:24,316 --> 00:44:26,136 Not what's the probability of a collision, 1098 00:44:26,366 --> 00:44:28,066 what's the probability of no collisions? 1099 00:44:28,236 --> 00:44:29,466 So that's the question we're asking. 1100 00:44:29,776 --> 00:44:32,476 Well, for that first person, it's one divided by one. 1101 00:44:32,646 --> 00:44:35,426 There is a 100 percent certainty that I'm not going to collide 1102 00:44:35,426 --> 00:44:36,446 if it's the first birthday. 1103 00:44:36,776 --> 00:44:37,926 What about the second person? 1104 00:44:37,926 --> 00:44:39,256 Well, it's a pretty easy problem. 1105 00:44:39,256 --> 00:44:41,506 It just means that person obviously has a birthday, 1106 00:44:41,746 --> 00:44:45,916 so it means that there's 364 other buckets that they can fall 1107 00:44:45,916 --> 00:44:48,546 into without a collision and so we can model 1108 00:44:48,546 --> 00:44:50,756 that as one minus one over 364, 1109 00:44:50,866 --> 00:44:55,006 which is just 365 minus one over 365. 1110 00:44:55,006 --> 00:44:57,106 It's 364 over 365. 1111 00:44:57,326 --> 00:44:59,966 Now we do this again and again and again and again 1112 00:45:00,126 --> 00:45:03,046 and we ultimately get a formula that we could plot 1113 00:45:03,246 --> 00:45:04,496 and we could actually see 1114 00:45:04,496 --> 00:45:08,886 at what point the probability becomes worrisomely high. 1115 00:45:08,966 --> 00:45:12,686 And so in the end, if you work this out with the exponents 1116 00:45:12,686 --> 00:45:14,786 and such, this is the concise way 1117 00:45:14,786 --> 00:45:17,646 of expressing the probability we're discussing verbally there. 1118 00:45:17,996 --> 00:45:18,786 But it looks like this. 1119 00:45:18,786 --> 00:45:20,186 And this is the neat take away. 1120 00:45:20,516 --> 00:45:23,676 So if you had just one person in the room, one birthday, 1121 00:45:23,906 --> 00:45:27,126 there's a zero percent chance of a collision, 1122 00:45:27,126 --> 00:45:28,546 of a match of two birthdays. 1123 00:45:28,876 --> 00:45:31,246 But notice this curve, this is 100 percent change, 1124 00:45:31,286 --> 00:45:32,716 90 percent, 80 percent chance. 1125 00:45:32,956 --> 00:45:37,126 I mean, notice how quickly this formula hits 90 percent. 1126 00:45:37,126 --> 00:45:40,256 And that, I picked 40 for specifically this reason. 1127 00:45:40,256 --> 00:45:41,846 It turns out that statistically 1128 00:45:41,846 --> 00:45:43,906 if you assume a uniform distribution 1129 00:45:43,986 --> 00:45:47,976 of reproduction then you will have, in a class of 40 people, 1130 00:45:47,976 --> 00:45:51,276 with 90 percent probability a collision of birthdays. 1131 00:45:51,276 --> 00:45:53,886 And once you have a class of like 58 people on the right, 1132 00:45:54,176 --> 00:45:56,276 it's a near statistical certainty. 1133 00:45:56,536 --> 00:45:59,546 So this is kind of a neat sociological curiosity. 1134 00:45:59,546 --> 00:46:02,096 But for us, the question is actually very germane 1135 00:46:02,096 --> 00:46:04,006 because if we're trying to create a data structure 1136 00:46:04,346 --> 00:46:06,236 that has some fixed size, because we only have 1137 00:46:06,236 --> 00:46:10,286 so much memory in the world, we want to over allocate memory, 1138 00:46:10,286 --> 00:46:13,026 because we need to have some holes in this structure 1139 00:46:13,336 --> 00:46:16,966 with which to avoid collisions with higher probability, 1140 00:46:17,276 --> 00:46:19,856 but it looks like we need a super amount of memory, 1141 00:46:19,936 --> 00:46:22,466 way more than 365 buckets just 1142 00:46:22,466 --> 00:46:24,196 to fit a 40 person seminar class. 1143 00:46:24,786 --> 00:46:26,726 So the assumption we've been making thus far is 1144 00:46:26,726 --> 00:46:28,966 that it's this uniform distribution of numbers. 1145 00:46:29,196 --> 00:46:31,746 What if we actually look at the input, the birthday 1146 00:46:32,066 --> 00:46:35,216 or the number or the string that we're inputting and try 1147 00:46:35,216 --> 00:46:37,946 to figure out its location, not randomly, 1148 00:46:38,236 --> 00:46:39,856 but actually analytically 1149 00:46:39,856 --> 00:46:42,256 so that we minimize the probability of collision. 1150 00:46:42,586 --> 00:46:48,746 And so that's what we'll do after our five minute break. 1151 00:46:48,746 --> 00:46:48,936 [ Silence ] 1152 00:46:48,936 --> 00:46:50,246 Alright. So we are back. 1153 00:46:50,246 --> 00:46:51,376 So one amendment. 1154 00:46:51,376 --> 00:46:55,066 So I finally remember why in the implementation of a stack here 1155 00:46:55,066 --> 00:46:57,746 and in a C data structure that I had both size and top. 1156 00:46:57,746 --> 00:46:59,276 It was actually for one corner case. 1157 00:46:59,446 --> 00:47:03,456 If you only have top to remember where the top of the stack is, 1158 00:47:03,736 --> 00:47:04,866 there's one gotcha, which is 1159 00:47:04,866 --> 00:47:06,296 when the stack is initially empty. 1160 00:47:06,436 --> 00:47:08,936 And so if the stack is initially empty and you initialize top, 1161 00:47:08,936 --> 00:47:11,646 for instance, to zero, you might otherwise return a 1162 00:47:11,646 --> 00:47:12,436 garbage value. 1163 00:47:12,436 --> 00:47:14,396 Now there's a way to fix this, of course. 1164 00:47:14,626 --> 00:47:17,856 You could just declare top to be some sentinel value, 1165 00:47:17,856 --> 00:47:19,346 a special value like negative one, 1166 00:47:19,586 --> 00:47:21,726 to signify that there's actually not something here. 1167 00:47:21,986 --> 00:47:24,326 Or you can take this alternative approach and just maintain it 1168 00:47:24,326 --> 00:47:26,196 as a separate end, which is the approach I took here. 1169 00:47:26,466 --> 00:47:29,306 But actually a decent take away here, a save for me, 1170 00:47:29,306 --> 00:47:31,786 is to say that though we might introduce these structures 1171 00:47:31,786 --> 00:47:34,436 in one way, there's absolutely different decisions you 1172 00:47:34,436 --> 00:47:34,866 can make. 1173 00:47:34,866 --> 00:47:36,576 And in fact, one of the themes of pset6, 1174 00:47:36,676 --> 00:47:38,886 when you implement your spell checker, is going to be 1175 00:47:38,886 --> 00:47:41,326 to decide for yourself what would be the best way 1176 00:47:41,326 --> 00:47:43,096 to implement this structure or that 1177 00:47:43,316 --> 00:47:45,286 and you'll be guided toward a couple of possibilities 1178 00:47:45,616 --> 00:47:48,696 in this Sunday's walkthrough for pset6 in particular. 1179 00:47:48,786 --> 00:47:52,816 So the question I had a moment ago was what's the probability 1180 00:47:52,816 --> 00:47:54,636 of a collision and the take away is really, 1181 00:47:54,666 --> 00:47:55,596 it's pretty damn high. 1182 00:47:55,596 --> 00:47:58,576 So collisions are going to happen so we need to be able 1183 00:47:58,576 --> 00:48:01,146 to mitigate these somehow or at least deal with them. 1184 00:48:01,386 --> 00:48:04,706 So what's the simplest possible implementation of a hash table? 1185 00:48:04,886 --> 00:48:08,486 Well, the simplest one arguably is just an array, a fixed size, 1186 00:48:08,816 --> 00:48:12,266 and you take you input, let's say an integer, 1187 00:48:12,486 --> 00:48:14,336 and you try to put it into the hash table 1188 00:48:14,336 --> 00:48:15,456 at a specific location. 1189 00:48:15,606 --> 00:48:17,776 And if there is a collision, as we say, 1190 00:48:17,776 --> 00:48:19,246 if there is something else already there, 1191 00:48:19,606 --> 00:48:21,156 you know what the simplest fix is just 1192 00:48:21,156 --> 00:48:22,446 to move it then to the next bucket. 1193 00:48:22,446 --> 00:48:24,576 And if there's something there, move it to the next bucket. 1194 00:48:24,826 --> 00:48:27,156 You know, just make a best effort to find an empty location 1195 00:48:27,156 --> 00:48:29,636 and if you miss, then just increment, plus, plus, plus, 1196 00:48:29,636 --> 00:48:31,596 plus, plus, plus, plus until you find a hole 1197 00:48:31,896 --> 00:48:32,846 in the data structure. 1198 00:48:32,846 --> 00:48:33,866 Now what might this mean? 1199 00:48:33,866 --> 00:48:36,526 So when you throw a pen at the board, 1200 00:48:36,526 --> 00:48:38,716 you're doing something that's generally called hashing. 1201 00:48:38,796 --> 00:48:42,406 To hash is a verb that generally means to take as input a number 1202 00:48:42,406 --> 00:48:44,966 or a string or something and then return 1203 00:48:44,966 --> 00:48:48,246 as output the intended location for that input 1204 00:48:48,656 --> 00:48:49,726 in the data structure. 1205 00:48:49,726 --> 00:48:52,576 So we might implement a very simple hash function as follows. 1206 00:48:52,866 --> 00:48:55,896 If I need to return an int, because again, the function, 1207 00:48:55,896 --> 00:48:58,386 the role of a hash function, what's called hash, 1208 00:48:58,676 --> 00:49:01,826 is to take as input, let's say a number, and then return 1209 00:49:01,826 --> 00:49:06,126 as output the location of that number in the hash table. 1210 00:49:06,126 --> 00:49:07,876 It's answering the question where do I put this? 1211 00:49:08,196 --> 00:49:11,706 And now we could, as I did physically with my pen, 1212 00:49:11,996 --> 00:49:15,496 we could just do something like return rand, right? 1213 00:49:15,496 --> 00:49:16,936 So that would be the simplest approach. 1214 00:49:17,086 --> 00:49:18,226 And that's actually not quite right. 1215 00:49:18,226 --> 00:49:19,356 I'd actually have to multiply it 1216 00:49:19,356 --> 00:49:21,816 out so I get the range correctly, but we could use rand 1217 00:49:22,036 --> 00:49:23,396 and actually get back a random value. 1218 00:49:23,396 --> 00:49:25,326 But again, to be clear, what's the downside 1219 00:49:25,326 --> 00:49:27,816 of putting an element in any data structure 1220 00:49:27,816 --> 00:49:29,736 at some random location, which is pretty fast, 1221 00:49:30,426 --> 00:49:31,136 but there's a downside. 1222 00:49:32,376 --> 00:49:32,526 >> Search. 1223 00:49:32,526 --> 00:49:33,066 >> Search, right? 1224 00:49:33,066 --> 00:49:35,446 You can't find it again unless potentially you search 1225 00:49:35,446 --> 00:49:36,576 through the whole data structure. 1226 00:49:36,576 --> 00:49:38,596 And the worst case running time there is probably going 1227 00:49:38,596 --> 00:49:41,346 to be big old event, because if it's in some random location, 1228 00:49:41,346 --> 00:49:44,436 it might be in the last location you check just by bad luck. 1229 00:49:44,686 --> 00:49:46,586 And so that's not the best goal here, right? 1230 00:49:46,586 --> 00:49:47,786 We could have done better, frankly, 1231 00:49:47,786 --> 00:49:49,116 with binary search and array. 1232 00:49:49,116 --> 00:49:50,256 So that's a step backwards. 1233 00:49:50,596 --> 00:49:51,716 But we could do this. 1234 00:49:51,716 --> 00:49:54,736 Suppose that the input is N and let's suppose 1235 00:49:54,736 --> 00:49:56,816 that we've got a pretty good data structure. 1236 00:49:57,066 --> 00:49:59,816 You know what I could do if I want to store the number N 1237 00:49:59,816 --> 00:50:01,666 in an array, where should I put it? 1238 00:50:01,946 --> 00:50:04,496 Well, the, you know, the common sense 1239 00:50:04,586 --> 00:50:07,476 in me says why don't I just return the number itself? 1240 00:50:07,866 --> 00:50:09,326 So if I'm trying to insert 1241 00:50:09,326 --> 00:50:11,786 into this data structure the number zero, 1242 00:50:12,166 --> 00:50:13,206 where should I put it? 1243 00:50:13,206 --> 00:50:14,256 How about bracket zero? 1244 00:50:14,256 --> 00:50:15,836 And if I'm trying to insert number one, 1245 00:50:15,836 --> 00:50:16,486 where should I put it? 1246 00:50:16,696 --> 00:50:17,706 How about bracket one? 1247 00:50:17,706 --> 00:50:18,946 How about the number 25? 1248 00:50:18,946 --> 00:50:20,136 How about bracket 25? 1249 00:50:20,136 --> 00:50:21,216 How about the number 26? 1250 00:50:21,666 --> 00:50:22,496 Oh, there's the problem. 1251 00:50:22,646 --> 00:50:26,326 Right? If your number, if your inputs might actually exceed the 1252 00:50:26,326 --> 00:50:27,936 balance of your data structure, 1253 00:50:28,196 --> 00:50:29,816 which isn't a deal breaker, right? 1254 00:50:29,816 --> 00:50:32,576 Because if they're sparse, if you have a number here 1255 00:50:32,576 --> 00:50:33,866 and a number over here and number here, 1256 00:50:34,066 --> 00:50:37,156 you don't need a super big data structure in order to store it. 1257 00:50:37,156 --> 00:50:40,246 Like some of you, for the hacker edition of pset2 might have 1258 00:50:40,306 --> 00:50:42,216 to get linear sorting time. 1259 00:50:42,746 --> 00:50:46,446 Well we need to actually exercise a bit of intelligence. 1260 00:50:46,446 --> 00:50:50,056 So what's the best way to avoid wrapping around to the end 1261 00:50:50,056 --> 00:50:52,836 of a data structure that's only of size let's say 26? 1262 00:50:53,446 --> 00:50:54,626 How do I change this formula? 1263 00:50:55,976 --> 00:50:57,086 Alright, so, modulo, right? 1264 00:50:57,116 --> 00:50:58,666 So modulo is kind of our savior 1265 00:50:58,666 --> 00:51:00,846 when it comes to wrapping around. 1266 00:51:01,056 --> 00:51:01,806 So this is better. 1267 00:51:01,986 --> 00:51:05,046 Now my goal is to insert an integer into some data structure 1268 00:51:05,046 --> 00:51:07,246 and that data structure is of size 26. 1269 00:51:07,546 --> 00:51:09,796 Well, now, thanks to modulo 26, 1270 00:51:10,006 --> 00:51:14,236 no matter how big my input is, whether it's zero or 25 or 26 1271 00:51:14,236 --> 00:51:17,746 or 2000, I'm always going to get back a number between zero 1272 00:51:17,746 --> 00:51:21,846 and 25, because of modulo and so that will definitely fit. 1273 00:51:22,296 --> 00:51:25,926 But if you're allowed to have these super large numbers going 1274 00:51:25,926 --> 00:51:28,566 into a fairly small space, what's going to happen 1275 00:51:28,566 --> 00:51:31,876 when you has into as input the number zero? 1276 00:51:31,876 --> 00:51:33,296 Well, it's going to go back at zero. 1277 00:51:33,586 --> 00:51:38,206 What if you hash in the number 26, where's it going to go? 1278 00:51:38,386 --> 00:51:38,516 >> Zero. 1279 00:51:38,676 --> 00:51:39,486 >> Bracket zero, right? 1280 00:51:39,486 --> 00:51:41,586 Because 26 mod 26 gives you no remainder, 1281 00:51:41,586 --> 00:51:42,676 which means bracket zero. 1282 00:51:42,676 --> 00:51:43,956 So again, you've got collision. 1283 00:51:43,956 --> 00:51:47,476 So one of the solutions to this might very well be conceptually, 1284 00:51:47,476 --> 00:51:49,186 we won't do this part in code, because it's easier 1285 00:51:49,186 --> 00:51:51,666 to just do it verbally, but one of the solutions might be 1286 00:51:51,666 --> 00:51:54,586 if I try to hash into location zero and something's there, 1287 00:51:54,726 --> 00:51:57,386 we'll take the location I tried to hash into, just do plus, 1288 00:51:57,386 --> 00:51:59,916 plus, and check location, in that case, one. 1289 00:52:00,176 --> 00:52:01,896 If there's an opening, put it there. 1290 00:52:02,186 --> 00:52:05,126 Now how do you find that element in the future? 1291 00:52:05,606 --> 00:52:07,086 Well, it's the same process. 1292 00:52:07,186 --> 00:52:10,526 If I am asked a question, search for the number N, well, 1293 00:52:10,526 --> 00:52:12,356 I first run it through the same hash function 1294 00:52:12,356 --> 00:52:13,956 and I get back the same answer, zero. 1295 00:52:14,406 --> 00:52:17,846 So this function search is going to go ahead and check 1296 00:52:17,846 --> 00:52:21,286 at bracket zero for the number N. is it there? 1297 00:52:21,376 --> 00:52:24,026 If so, it returns true and search has done its job. 1298 00:52:24,266 --> 00:52:26,216 If it's not there, should I immediately return false? 1299 00:52:26,996 --> 00:52:27,063 >> No. 1300 00:52:27,776 --> 00:52:30,286 >> So right, probably not, I don't want to return false, 1301 00:52:30,286 --> 00:52:31,256 because where might it be? 1302 00:52:31,596 --> 00:52:34,796 At you know, bracket one or bracket two or worst case, 1303 00:52:34,796 --> 00:52:36,666 and here is the got you, worst case, 1304 00:52:36,666 --> 00:52:39,176 what if the whole array was so damn full except 1305 00:52:39,296 --> 00:52:41,756 for slot number bracket 25? 1306 00:52:42,066 --> 00:52:46,056 You know, even my number zero or my number 26 could, by bad luck, 1307 00:52:46,316 --> 00:52:48,826 end up all the way at the bottom of the hash table. 1308 00:52:49,106 --> 00:52:51,856 So this method that we're describing is generally called 1309 00:52:51,936 --> 00:52:55,686 linear probing whereby you probe at subsequent locations 1310 00:52:55,686 --> 00:52:58,056 in the hash table just looking for an opening, looking, 1311 00:52:58,056 --> 00:53:00,836 looking, looking, but it's linear in that you keep doing it 1312 00:53:00,836 --> 00:53:02,146 like a plus, plus operator. 1313 00:53:02,146 --> 00:53:03,816 And the downside of this, of course, 1314 00:53:04,066 --> 00:53:07,056 is that if you've got a lot of input, or a lot of bad luck, 1315 00:53:07,126 --> 00:53:09,326 and you keep hashing through the same locations 1316 00:53:09,326 --> 00:53:11,936 and they're all kind of blocking together so you keep having 1317 00:53:11,936 --> 00:53:13,596 to plus, plus, plus, plus, plus, plus, plus, plus, 1318 00:53:13,596 --> 00:53:15,486 this algorithm ultimately degrades 1319 00:53:15,486 --> 00:53:16,626 into big oh of what again. 1320 00:53:17,956 --> 00:53:20,166 So end. So it's a little better perhaps 1321 00:53:20,236 --> 00:53:23,686 than an even more naive approach, but it's not perfect. 1322 00:53:23,896 --> 00:53:24,886 So we need something better. 1323 00:53:24,886 --> 00:53:24,976 Yes? 1324 00:53:25,516 --> 00:53:35,556 [ Inaudible ] 1325 00:53:36,056 --> 00:53:38,436 >> There may be a number there, so this is incomplete. 1326 00:53:38,436 --> 00:53:44,126 So what I really need to do here is in pseudo code if something 1327 00:53:44,126 --> 00:53:47,716 at location, actually we'll do pseudo code, if something 1328 00:53:47,716 --> 00:53:51,166 at location this, then I need to, 1329 00:53:51,796 --> 00:53:55,636 not to induce giggles, but probe array. 1330 00:53:55,926 --> 00:53:57,746 Alright, so keep probing down further 1331 00:53:57,746 --> 00:53:59,606 into the array looking for, and that would be implemented 1332 00:53:59,606 --> 00:54:01,846 with the while construct or a for loop or something like this, 1333 00:54:01,846 --> 00:54:02,936 so just some pseudo code for now. 1334 00:54:03,176 --> 00:54:05,186 Now this is not a very interesting hash function. 1335 00:54:05,186 --> 00:54:06,216 This is just for integers. 1336 00:54:06,456 --> 00:54:07,806 Let's still return an int, 1337 00:54:07,806 --> 00:54:09,986 because hash tables ultimately boil down to the question 1338 00:54:09,986 --> 00:54:10,796 where should I put this? 1339 00:54:10,796 --> 00:54:13,076 But let's do something more interesting like a string. 1340 00:54:13,506 --> 00:54:16,916 So if I want to take in a string S, I can't just return 1341 00:54:16,916 --> 00:54:21,556 for instance return S ampersand N. I kind of could if I dealt 1342 00:54:21,556 --> 00:54:23,456 with the casting, but S is a pointer. 1343 00:54:23,636 --> 00:54:25,506 It's pretty meaningless to take the pointer 1344 00:54:25,566 --> 00:54:28,006 and just hash it in this way. 1345 00:54:28,006 --> 00:54:29,976 Although, you could do this if you really wanted to. 1346 00:54:30,216 --> 00:54:32,856 But this doesn't feel very intelligent. 1347 00:54:32,856 --> 00:54:34,676 Suppose that the strings that I'm trying to insert 1348 00:54:34,676 --> 00:54:37,586 into this data structure are people's names, right, 1349 00:54:37,586 --> 00:54:39,816 and so that's actually why this hash table, by default, 1350 00:54:39,816 --> 00:54:43,366 has only 26 locations, because I'm assuming 1351 00:54:43,366 --> 00:54:45,646 that there's a 26 letter alphabet, A through Z, 1352 00:54:45,876 --> 00:54:47,906 and ignore uppercase and lowercase for now, 1353 00:54:47,906 --> 00:54:51,966 let's just suppose that I really want to put names 1354 00:54:51,966 --> 00:54:53,286 into this hash table with, 1355 00:54:53,416 --> 00:54:55,086 ideally without having collisions. 1356 00:54:55,506 --> 00:54:57,526 So what might be the best approach 1357 00:54:57,756 --> 00:54:59,926 if my input is someone's name like Alice? 1358 00:55:00,336 --> 00:55:03,926 I'm given a string, A L I C E backslash zero. 1359 00:55:04,336 --> 00:55:07,186 How do I convert Alice to a number that tells me 1360 00:55:07,186 --> 00:55:09,476 where to put her name into this hash table? 1361 00:55:09,476 --> 00:55:10,826 >> [Inaudible] 1362 00:55:10,826 --> 00:55:12,006 >> What, down here? 1363 00:55:12,006 --> 00:55:12,106 >> [Inaudible] 1364 00:55:12,106 --> 00:55:15,326 >> So we could add the ASCII values. 1365 00:55:15,626 --> 00:55:17,676 Or you can actually do it a little simpler. 1366 00:55:17,676 --> 00:55:20,156 I'll push back first so we know that a string is just a bunch 1367 00:55:20,156 --> 00:55:23,596 of characters, so we know, and characters are just numbers. 1368 00:55:23,776 --> 00:55:25,376 So if the goal is to get a number, 1369 00:55:25,376 --> 00:55:26,606 we can kind of leverage the fact 1370 00:55:26,606 --> 00:55:28,326 that characters are represented as numbers. 1371 00:55:28,526 --> 00:55:30,316 And, in fact, let's not add them all together just yet, 1372 00:55:30,316 --> 00:55:32,096 let's go super trivial here for now. 1373 00:55:32,366 --> 00:55:38,906 Let me go ahead and return as bracket zero cast to an int. 1374 00:55:38,906 --> 00:55:41,256 Right? Let me just look. 1375 00:55:41,256 --> 00:55:42,716 It's a little heuristic. 1376 00:55:42,716 --> 00:55:43,816 It might get me into trouble. 1377 00:55:44,086 --> 00:55:48,156 But I could return the integral value of S bracket zero. 1378 00:55:48,156 --> 00:55:49,016 But there is a bug here. 1379 00:55:49,256 --> 00:55:50,506 What's this really going to return to me? 1380 00:55:51,146 --> 00:55:55,966 This is going to return like 65 if it's capital A or 66. 1381 00:55:55,966 --> 00:55:58,146 And already these numbers are beyond my boundaries. 1382 00:55:58,386 --> 00:56:00,236 So really I should be doing something like this, 1383 00:56:00,236 --> 00:56:04,086 return S bracket capital A. I need to whip out some code 1384 00:56:04,086 --> 00:56:05,526 from like Caesar or Visionaire 1385 00:56:05,526 --> 00:56:08,006 where I actually converted these things to numbers between zero 1386 00:56:08,006 --> 00:56:10,006 and 25, so I could do something like this 1387 00:56:10,006 --> 00:56:11,306 and I don't explicitly need the cast. 1388 00:56:11,586 --> 00:56:12,446 It will happen for me. 1389 00:56:12,676 --> 00:56:15,206 But this would take the character at the first letter 1390 00:56:15,206 --> 00:56:18,366 in the word and subtract off capital A, and let's assume 1391 00:56:18,366 --> 00:56:21,136 for simplicity right now it's all capital, and this is going 1392 00:56:21,136 --> 00:56:23,756 to give me a number between zero and 25. 1393 00:56:23,756 --> 00:56:24,876 Again, I need to improve this. 1394 00:56:24,876 --> 00:56:26,986 I need to write better code to handle lower case numbers, 1395 00:56:27,196 --> 00:56:29,816 maybe weird punctuation or strings that start with numbers. 1396 00:56:30,056 --> 00:56:32,426 But in the simplest case, if I'm getting an all capital word, 1397 00:56:32,696 --> 00:56:34,256 this hash function would work. 1398 00:56:34,516 --> 00:56:37,396 But would it result in collisions? 1399 00:56:38,306 --> 00:56:39,956 And if so under what circumstances? 1400 00:56:39,956 --> 00:56:40,446 >> [Inaudible] 1401 00:56:40,446 --> 00:56:44,466 >> Yes. So obviously you're going to have a collision 1402 00:56:44,466 --> 00:56:46,746 if the two names start with the same letters. 1403 00:56:46,746 --> 00:56:49,466 So Albert and Alice are both going to try to hash 1404 00:56:49,526 --> 00:56:51,976 to location zero, so we have a collision. 1405 00:56:52,086 --> 00:56:54,986 And I'd wager that that's going to be pretty common, right? 1406 00:56:54,986 --> 00:56:57,656 If you've ever played Wheel of Fortune, 1407 00:56:57,656 --> 00:56:59,496 there's certainly some letters in the English alphabet 1408 00:56:59,496 --> 00:57:01,636 that recur more frequently and odds are if we look 1409 00:57:01,636 --> 00:57:03,836 at the class list, I'm guessing we don't have 1410 00:57:03,836 --> 00:57:06,376 that many students whose first names start with Q, 1411 00:57:06,776 --> 00:57:09,026 probably not Z either, right? 1412 00:57:09,026 --> 00:57:11,506 They're some relatively infrequent characters. 1413 00:57:11,506 --> 00:57:12,936 If that is in fact your name, great. 1414 00:57:12,936 --> 00:57:14,776 We have a spot for you in the hash table. 1415 00:57:15,036 --> 00:57:18,126 But there's going to be a lot of people with B's and D's and C's 1416 00:57:18,126 --> 00:57:19,556 and all of that that are just going 1417 00:57:19,556 --> 00:57:20,856 to collide, collide, collide. 1418 00:57:21,046 --> 00:57:23,626 So that kind of suggest that though this is kind of simple, 1419 00:57:23,816 --> 00:57:25,356 it's constant time, right? 1420 00:57:25,356 --> 00:57:26,176 There's no loop here. 1421 00:57:26,176 --> 00:57:27,066 There's no algorithm. 1422 00:57:27,066 --> 00:57:29,586 There's no looping process, which is why I pushed back 1423 00:57:29,586 --> 00:57:31,546 on the idea of adding all the ASCII values, 1424 00:57:31,726 --> 00:57:32,896 because that would take more time. 1425 00:57:32,896 --> 00:57:35,896 Let me pluck off the first one, but it's flawed 1426 00:57:35,896 --> 00:57:38,116 in that I'm going to have a lot of collisions for the A's 1427 00:57:38,116 --> 00:57:40,056 and the B's and the C's, and I'm like always 1428 00:57:40,056 --> 00:57:42,506 with high probability going to have an opening for this person 1429 00:57:42,746 --> 00:57:46,226 where Q might be or where Z might be or where X might be, 1430 00:57:46,446 --> 00:57:48,376 and this feels like I'm wasting space. 1431 00:57:48,576 --> 00:57:51,406 So an opportunity for problem set six, when you're given words 1432 00:57:51,406 --> 00:57:52,586 from a dictionary, it's going to be 1433 00:57:52,586 --> 00:57:55,246 to answer this question do you want to go naively 1434 00:57:55,246 --> 00:57:58,686 but simply looking at the first letter? 1435 00:57:59,066 --> 00:58:01,146 Or do you want to do something a little more sophisticated 1436 00:58:01,146 --> 00:58:04,116 to minimize that probability and use something 1437 00:58:04,116 --> 00:58:06,296 that takes more input into account? 1438 00:58:06,706 --> 00:58:08,196 So, what's the solution really 1439 00:58:08,196 --> 00:58:10,166 to this problem of collisions right? 1440 00:58:10,166 --> 00:58:13,346 Even with linear probing, you're going to end up filling all 1441 00:58:13,346 --> 00:58:16,096 of the spots eventually and then you're out of luck, right, 1442 00:58:16,126 --> 00:58:18,526 then the data structure needs to be entirely reallocated 1443 00:58:18,746 --> 00:58:20,366 which is hugely expensive, right. 1444 00:58:20,366 --> 00:58:21,906 Making the decision of size 1445 00:58:21,906 --> 00:58:23,646 in the beginning is probably better. 1446 00:58:23,646 --> 00:58:25,926 So, a better approach is something like this. 1447 00:58:25,926 --> 00:58:27,826 So, this picture here isn't about names 1448 00:58:27,826 --> 00:58:29,476 but this one here is about birthdays. 1449 00:58:29,806 --> 00:58:33,446 This is an array on the left hand side that is numbered 1450 00:58:33,446 --> 00:58:36,236 from one to 31 to imply that we're assuming 1451 00:58:36,236 --> 00:58:38,526 that every month has maximally 31 days. 1452 00:58:38,816 --> 00:58:41,086 And the goal here is to take some people as input 1453 00:58:41,326 --> 00:58:43,626 and put them into the hash table's location 1454 00:58:43,816 --> 00:58:45,666 that corresponds to their birthday. 1455 00:58:45,796 --> 00:58:48,776 So, if you're born on the first of the month, 1456 00:58:48,776 --> 00:58:49,846 you go at location one. 1457 00:58:49,846 --> 00:58:51,566 If you're born on the 29th of the month you go 1458 00:58:51,566 --> 00:58:55,146 at location 29 irrespective of the specific month and year. 1459 00:58:55,546 --> 00:58:57,856 So, this is called separate chaining. 1460 00:58:57,886 --> 00:59:00,136 And this pretty much is an obvious solution 1461 00:59:00,136 --> 00:59:01,396 to the problem of collisions. 1462 00:59:01,396 --> 00:59:04,496 You know what, if you have a collision don't try probing the 1463 00:59:04,496 --> 00:59:06,036 rest of the array just hoping you're going 1464 00:59:06,036 --> 00:59:07,996 to find an empty spot because again, the downside 1465 00:59:07,996 --> 00:59:10,756 of that is linear running time for search later on. 1466 00:59:11,066 --> 00:59:13,366 You know what, stay where you are and just tack 1467 00:59:13,616 --> 00:59:16,736 on the additional word or the additional input onto a list 1468 00:59:16,896 --> 00:59:18,106 that might grow over time. 1469 00:59:18,336 --> 00:59:19,756 But, at least stay where you are. 1470 00:59:20,086 --> 00:59:23,116 So, here's what a hash table really generally looks like. 1471 00:59:23,116 --> 00:59:24,996 These are something we'll take for granite in PHP 1472 00:59:24,996 --> 00:59:26,446 and JavaScript in the weeks to come. 1473 00:59:26,716 --> 00:59:29,876 But hash table, underneath the hood is very often implemented 1474 00:59:29,876 --> 00:59:30,616 as something like this. 1475 00:59:30,866 --> 00:59:32,796 It's still a fixed sized data structure 1476 00:59:33,026 --> 00:59:37,106 where you have maybe 26 or 31 or a billion locations 1477 00:59:37,436 --> 00:59:42,706 into which you put even more data than that, right. 1478 00:59:42,706 --> 00:59:44,036 Linear probing was flawed 1479 00:59:44,036 --> 00:59:46,016 in that you could only fit 26 elements. 1480 00:59:46,336 --> 00:59:49,086 Separate chaining frees you from that arbitrary limitation 1481 00:59:49,336 --> 00:59:52,286 by allowing you to have essentially this really clever 1482 00:59:52,286 --> 00:59:55,586 amalgam of the thing we called an array and the thing we called 1483 00:59:55,666 --> 00:59:58,056 on the right hand said here, link lists. 1484 00:59:58,296 --> 01:00:00,306 It's kind of a merger of these two ideas. 1485 01:00:00,436 --> 01:00:02,106 And so, with separate chaining, 1486 01:00:02,386 --> 01:00:06,036 we still get not theoretically better running time 1487 01:00:06,036 --> 01:00:09,566 but we get much better distribution of our values. 1488 01:00:09,566 --> 01:00:13,606 If we've got, n elements, just to put some formulas on this, 1489 01:00:13,836 --> 01:00:16,876 if we've got n number or n people that we're trying to put 1490 01:00:16,876 --> 01:00:18,076 on this data structure. 1491 01:00:18,476 --> 01:00:21,536 And suppose that the distribution 1492 01:00:21,536 --> 01:00:23,796 of birthdays is uniform over 31 days. 1493 01:00:24,106 --> 01:00:26,806 Well, how long is each of these chains going to be? 1494 01:00:27,786 --> 01:00:31,136 If we've got n people and we've got chains of length 1495 01:00:31,226 --> 01:00:34,886 and we've got 31 possible chains, well, 1496 01:00:34,886 --> 01:00:38,106 if it's a uniform distribution, you divide n total inputs 1497 01:00:38,136 --> 01:00:39,826 by the total number of buckets you have. 1498 01:00:40,046 --> 01:00:42,386 So, this implies that after you start inserting a whole bunch 1499 01:00:42,386 --> 01:00:44,016 of people into a data structure like this, 1500 01:00:44,246 --> 01:00:46,026 even though this picture's a little ragged 1501 01:00:46,246 --> 01:00:48,486 in that there's some short chains there's some long chains 1502 01:00:48,486 --> 01:00:49,506 there's some no chains. 1503 01:00:49,786 --> 01:00:52,256 In the end if you assume uniform distribution, you're going 1504 01:00:52,256 --> 01:00:54,386 to have chains that are all roughly the same length. 1505 01:00:54,626 --> 01:00:57,166 So, that kind of means that the running time for search 1506 01:00:57,326 --> 01:00:59,426 in the end is going to be something 1507 01:00:59,426 --> 01:01:01,456 like big O of n over 31. 1508 01:01:01,456 --> 01:01:04,616 But, let's generalize this, n over k where k is the number 1509 01:01:04,616 --> 01:01:05,986 of buckets that you have. 1510 01:01:06,356 --> 01:01:08,506 Now, the mathematicians might realize that the k, 1511 01:01:08,506 --> 01:01:10,626 if it's just a constant, every time we talk 1512 01:01:10,626 --> 01:01:13,486 about asymptotic notation, we cross out things like constants. 1513 01:01:13,756 --> 01:01:15,666 So, the little white lie here is 1514 01:01:15,666 --> 01:01:18,206 that the hash table's running time is still big O of n. 1515 01:01:18,646 --> 01:01:21,466 And so, fundamentally, theoretically, it's no better 1516 01:01:21,466 --> 01:01:23,526 than linear probing, it's no better than an array, 1517 01:01:23,526 --> 01:01:24,806 it's no better than a link list. 1518 01:01:25,116 --> 01:01:28,946 But now, as we get to problem set six and the web based aspect 1519 01:01:28,946 --> 01:01:31,466 of the course, when we start solving some more real world 1520 01:01:31,466 --> 01:01:33,726 representative problems in the real world, 1521 01:01:33,926 --> 01:01:36,066 taking a big number n and dividing it 1522 01:01:36,096 --> 01:01:39,186 by a decent size number k, that's actually one kith, 1523 01:01:39,436 --> 01:01:41,856 the running time that it would be if it were n. So, 1524 01:01:41,856 --> 01:01:45,226 in the real world, though we might cross out these constants 1525 01:01:45,226 --> 01:01:47,946 and asymptotic life, in the real world, 1526 01:01:48,126 --> 01:01:49,686 constant factor differences, 1527 01:01:49,966 --> 01:01:52,946 constant factor differences actually make a difference 1528 01:01:52,946 --> 01:01:54,626 in terms of human time and the amount 1529 01:01:54,626 --> 01:01:56,096 of cycles you're actually spending. 1530 01:01:56,096 --> 01:01:58,786 So one of the goals of p set six will be to kind of say yeah, 1531 01:01:58,786 --> 01:01:59,286 yeah, yeah, 1532 01:01:59,286 --> 01:02:01,896 I know asymptotically this is still linear time algorithm. 1533 01:02:02,126 --> 01:02:04,586 But, look how dam fast my code is because I have a lot 1534 01:02:04,586 --> 01:02:06,266 of buckets and I have fairly short chains 1535 01:02:06,466 --> 01:02:08,946 and I made some design decisions underneath the hood 1536 01:02:09,176 --> 01:02:11,546 that really expedite in the real world the solution 1537 01:02:11,546 --> 01:02:13,986 of this problem, namely, spellchecking. 1538 01:02:13,986 --> 01:02:15,156 How might we implement this? 1539 01:02:15,516 --> 01:02:20,886 Well, we might implement this with some nodes here. 1540 01:02:20,886 --> 01:02:24,226 So again, I made the mistake two weeks ago 1541 01:02:24,226 --> 01:02:25,506 of not defining what a node is. 1542 01:02:25,746 --> 01:02:28,326 A node in computer science terms is just some kind of container. 1543 01:02:28,326 --> 01:02:30,776 It's a data structure that generically is called a node. 1544 01:02:30,776 --> 01:02:32,996 And inside of it is some stuff that's interesting. 1545 01:02:33,366 --> 01:02:35,206 So, each of these little rectangles here, 1546 01:02:35,326 --> 01:02:36,326 we're going to call a node. 1547 01:02:36,376 --> 01:02:38,396 We talked about nodes in the context of link lists. 1548 01:02:38,606 --> 01:02:41,576 It looks like based on this picture, each node has a string, 1549 01:02:41,576 --> 01:02:44,546 a person's name, and then this little square which is meant 1550 01:02:44,546 --> 01:02:45,716 to represent a pointer. 1551 01:02:45,946 --> 01:02:50,086 So, the translations of this might be a node struct, 1552 01:02:50,086 --> 01:02:51,446 and see like this. 1553 01:02:51,706 --> 01:02:55,636 Every node in this data structure has a string, 1554 01:02:55,636 --> 01:02:57,176 a word, someone's name. 1555 01:02:57,446 --> 01:03:00,256 And let's suppose that length is a constant defined elsewhere. 1556 01:03:00,256 --> 01:03:02,666 No one's name is going to be longer than like 50 characters 1557 01:03:02,666 --> 01:03:03,686 or a thousand characters. 1558 01:03:03,686 --> 01:03:05,756 There's some upper bound into which everyone's name 1559 01:03:05,756 --> 01:03:07,706 in the world fits even if it's large and variant. 1560 01:03:08,136 --> 01:03:11,486 Well, this just means maximally everyone's name can be length 1561 01:03:11,666 --> 01:03:13,776 plus one characters long plus one being 1562 01:03:13,776 --> 01:03:14,766 for the null character. 1563 01:03:15,046 --> 01:03:17,116 Now, I also need, inside this node, 1564 01:03:17,346 --> 01:03:20,676 another pointer to another such node. 1565 01:03:20,806 --> 01:03:23,066 And that's why we have sort of this circular reference 1566 01:03:23,356 --> 01:03:24,716 to something called the struct node. 1567 01:03:24,716 --> 01:03:28,026 But this is very, very similar to what we did for a link list. 1568 01:03:28,166 --> 01:03:29,286 It's just with link lists, 1569 01:03:29,526 --> 01:03:31,376 we didn't have a character word array, 1570 01:03:31,596 --> 01:03:34,466 I think we just had an int the day we looked at link list. 1571 01:03:34,466 --> 01:03:36,316 So, this might be one approach for implementing 1572 01:03:36,656 --> 01:03:37,906 that kind of structure there. 1573 01:03:38,246 --> 01:03:39,456 Well, what more can we do? 1574 01:03:39,456 --> 01:03:42,456 Now that we're starting to take arrays and this building block 1575 01:03:42,456 --> 01:03:44,686 and link lists and we're wiring things together 1576 01:03:44,686 --> 01:03:46,656 with pointers ever so cleverly, 1577 01:03:46,936 --> 01:03:49,536 we can really start implementing arbitrary shapes 1578 01:03:49,536 --> 01:03:52,966 and sizes inside memory toward an end of solving problems 1579 01:03:53,306 --> 01:03:55,506 with data structures that are most conducive 1580 01:03:55,556 --> 01:03:56,556 to the problem at hand. 1581 01:03:56,556 --> 01:03:59,106 So, in computer science, there's this notion of a tree. 1582 01:03:59,566 --> 01:04:02,296 It's kind of like what we'd think of as a family tree 1583 01:04:02,536 --> 01:04:05,116 where you have the oldest member of the family at the top, 1584 01:04:05,116 --> 01:04:07,806 the so called roots of the tree, and the children, 1585 01:04:07,806 --> 01:04:10,226 maybe a left child and a right child hang off 1586 01:04:10,226 --> 01:04:12,956 of them pictorially with arrows suggesting that each 1587 01:04:12,956 --> 01:04:15,286 of these circles is a node, another type of node. 1588 01:04:15,286 --> 01:04:16,176 We drew it as a circle. 1589 01:04:16,526 --> 01:04:19,056 But, each of these pointers just points to another member 1590 01:04:19,056 --> 01:04:21,066 of the family and another member of the family. 1591 01:04:21,306 --> 01:04:24,336 And the idea of it being drawn from top down means this is 1592 01:04:24,676 --> 01:04:29,276 like grandpa, this is mom and her brother and sister, 1593 01:04:29,566 --> 01:04:32,496 this is her children, grandpa's grandchildren, 1594 01:04:32,826 --> 01:04:35,336 and actually grandpa's here and so forth. 1595 01:04:35,376 --> 01:04:37,146 But it doesn't have to be viewed as a family tree. 1596 01:04:37,426 --> 01:04:40,746 Really the lexicon we care about today is this is the root of any 1597 01:04:40,746 --> 01:04:42,556 such structure, the thing that's at the top 1598 01:04:42,556 --> 01:04:43,606 that holds it all together. 1599 01:04:43,886 --> 01:04:46,306 We call any descendants of the node the children, 1600 01:04:46,306 --> 01:04:47,516 just like you would a family tree. 1601 01:04:47,756 --> 01:04:49,666 And then, any nodes that are at the very bottom 1602 01:04:49,666 --> 01:04:52,236 that have no children of their own are generally called, 1603 01:04:52,236 --> 01:04:54,676 in graph theory, this is an example 1604 01:04:54,676 --> 01:04:57,046 of a computer science graph, they're called leaves. 1605 01:04:57,166 --> 01:04:58,236 So, let's do some jargon. 1606 01:04:58,236 --> 01:04:59,946 Let's actually do something interesting with it. 1607 01:04:59,946 --> 01:05:05,756 Well, we had an array back in the day for representing numbers 1608 01:05:06,016 --> 01:05:07,776 that we were able to use binary search on. 1609 01:05:07,996 --> 01:05:10,696 It turns out, even though this is not a fundamental improvement 1610 01:05:10,696 --> 01:05:13,086 over an array because now we have some overhead of pointers. 1611 01:05:13,416 --> 01:05:15,486 It turns out, now that you have the ability 1612 01:05:15,486 --> 01:05:18,596 to wire things together dynamically and kind of branch 1613 01:05:18,596 --> 01:05:21,256 out in different directions, you can resolve some 1614 01:05:21,256 --> 01:05:24,646 of those same problems and will see more interesting problems 1615 01:05:24,896 --> 01:05:26,816 using more interesting data structures. 1616 01:05:26,816 --> 01:05:30,706 This, in fact, is kind of equivalent to a list in that we, 1617 01:05:30,906 --> 01:05:38,906 an array of numbers, in that we have 22, 33, 44, 55, 66, 77, 88. 1618 01:05:39,126 --> 01:05:41,286 And it's intentional that this picture is drawn 1619 01:05:41,286 --> 01:05:42,176 in the way that it is. 1620 01:05:42,176 --> 01:05:44,276 Notice that every number to the left 1621 01:05:44,276 --> 01:05:47,106 of some node is smaller than that parent. 1622 01:05:47,386 --> 01:05:48,396 And, every number to the right 1623 01:05:48,396 --> 01:05:50,736 of some node is bigger than that parent. 1624 01:05:50,736 --> 01:05:55,336 So, here's 55, notice how 33 is obviously smaller but so is 22 1625 01:05:55,336 --> 01:05:59,676 and 44 because all descendants of a node are, by definition 1626 01:05:59,676 --> 01:06:00,806 in this structure, smaller. 1627 01:06:00,996 --> 01:06:05,086 And same deal over here 66, 77, and 88 are all bigger than 55. 1628 01:06:05,376 --> 01:06:09,206 Well now, what's really neat is that you can find things 1629 01:06:09,206 --> 01:06:12,666 in this structure using not only binary search but finally 1630 01:06:12,666 --> 01:06:14,286 that technique we called recursion. 1631 01:06:14,636 --> 01:06:17,846 All right recursion was the act of a function calling itself, 1632 01:06:17,846 --> 01:06:19,996 or more generally, it's the act of doing something 1633 01:06:20,266 --> 01:06:21,986 and then doing that same something again 1634 01:06:21,986 --> 01:06:23,656 but on a slightly smaller problem. 1635 01:06:23,856 --> 01:06:25,066 Well how does that apply here? 1636 01:06:25,066 --> 01:06:27,926 Suppose I am searching for the number 44. 1637 01:06:28,176 --> 01:06:31,236 This data structure is generally called a binary search tree, 1638 01:06:31,456 --> 01:06:33,116 the search meaning that the numbers are laid 1639 01:06:33,116 --> 01:06:35,496 out in this special order, smaller than on the left, 1640 01:06:35,566 --> 01:06:36,446 bigger than on the right. 1641 01:06:36,886 --> 01:06:39,396 So, I'm handed the root, so just like a link list 1642 01:06:39,396 --> 01:06:41,866 where I'm only given the firs node, same deal here. 1643 01:06:41,866 --> 01:06:43,486 When you're given a tree in memory, 1644 01:06:43,516 --> 01:06:45,266 all you need is one pointer to the root. 1645 01:06:45,266 --> 01:06:46,776 And then everything else hangs off of it. 1646 01:06:47,076 --> 01:06:48,826 Well notice here, what I might do is search 1647 01:06:48,826 --> 01:06:51,026 for 44, I say search 44. 1648 01:06:51,156 --> 01:06:52,026 I'm handed the root. 1649 01:06:52,246 --> 01:06:54,716 Well is 44 equal to 55? 1650 01:06:54,906 --> 01:06:56,026 No, obviously not. 1651 01:06:56,026 --> 01:06:56,826 Is it greater than? 1652 01:06:56,936 --> 01:06:58,076 No. Is it less than? 1653 01:06:58,376 --> 01:07:01,516 Yes. And so now, I can chop this problem in half, 1654 01:07:01,686 --> 01:07:04,186 just like the phonebook from week zero leaving me 1655 01:07:04,186 --> 01:07:05,756 with a problem that's half as big 1656 01:07:05,986 --> 01:07:07,626 but it's the exact same kind of problem. 1657 01:07:07,806 --> 01:07:09,576 So, suppose that the search function 1658 01:07:09,676 --> 01:07:12,876 that I'm describing here verbally takes its input in ints 1659 01:07:13,216 --> 01:07:16,736 and it takes as input a pointer to a node of a tree. 1660 01:07:17,056 --> 01:07:18,016 Well, you know what this thing 1661 01:07:18,016 --> 01:07:19,666 on the left hand side here looks like? 1662 01:07:19,956 --> 01:07:23,186 It kind of looks like another tree, so another piece 1663 01:07:23,186 --> 01:07:26,286 of jargon people tend to use is yes 33 is the child 1664 01:07:26,286 --> 01:07:27,816 of this node 55. 1665 01:07:28,126 --> 01:07:32,156 But, you know what, 33 is also the root of his own tree, 1666 01:07:32,376 --> 01:07:35,036 a smaller tree but that has his own children. 1667 01:07:35,276 --> 01:07:37,706 So, if I'm implementing code now called search, 1668 01:07:37,926 --> 01:07:40,506 and again it's a function that takes an int as a parameter 1669 01:07:40,506 --> 01:07:43,736 and a pointer to a node as a parameter, well I can just call 1670 01:07:43,736 --> 01:07:47,266 that same search function again on the left child pointer. 1671 01:07:47,556 --> 01:07:50,436 And the only time I want my recursion to stop is 1672 01:07:50,436 --> 01:07:56,876 if I find the number in question or there's no children. 1673 01:07:57,166 --> 01:07:58,376 Right, I obviously don't want 1674 01:07:58,376 --> 01:08:00,216 to keep calling this same search function 1675 01:08:00,216 --> 01:08:01,686 on a child if it's not there. 1676 01:08:01,916 --> 01:08:02,916 So this begs the question, 1677 01:08:02,916 --> 01:08:04,546 how do you implement a node like this? 1678 01:08:04,676 --> 01:08:07,016 Well, you can translate this pretty identically 1679 01:08:07,016 --> 01:08:08,526 to the picture at hand. 1680 01:08:08,796 --> 01:08:10,806 So I'm going to define this thing now as a node. 1681 01:08:10,806 --> 01:08:12,366 So again, I'm redefining node all the time. 1682 01:08:12,436 --> 01:08:13,046 And that's what I mean. 1683 01:08:13,046 --> 01:08:16,036 A node is just a generic concept to describe some bucket 1684 01:08:16,036 --> 01:08:17,206 of information in the picture. 1685 01:08:17,816 --> 01:08:20,516 So, in this case of a binary search tree, well, 1686 01:08:20,516 --> 01:08:22,276 we need a node to have a number in it. 1687 01:08:22,586 --> 01:08:25,146 We also need that node to have a left pointer 1688 01:08:25,476 --> 01:08:26,696 and also a right pointer. 1689 01:08:26,696 --> 01:08:29,166 Now we can call these fu and bar we can call them anything 1690 01:08:29,166 --> 01:08:29,586 we want. 1691 01:08:29,816 --> 01:08:32,116 But, conceptually, they're meant to point 1692 01:08:32,116 --> 01:08:34,766 to two other nodes in the tree. 1693 01:08:34,926 --> 01:08:36,586 Now, you can get yourself into trouble right. 1694 01:08:36,586 --> 01:08:39,116 You can start wiring this data structure together with loops 1695 01:08:39,116 --> 01:08:40,656 and crazy sort of redirections. 1696 01:08:40,656 --> 01:08:41,806 And you can get into cycles 1697 01:08:41,806 --> 01:08:44,496 where then your code might very well get into an infinite loop. 1698 01:08:44,636 --> 01:08:45,566 Well, what's the solution? 1699 01:08:45,836 --> 01:08:46,626 Well don't do that. 1700 01:08:46,626 --> 01:08:48,206 If you start implementing cycles, 1701 01:08:48,206 --> 01:08:50,966 you're then implementing really what people call a graph 1702 01:08:50,966 --> 01:08:52,136 and not a tree. 1703 01:08:52,166 --> 01:08:54,016 When you're implementing a data structure like this, 1704 01:08:54,256 --> 01:08:56,576 it's up to you, the burden is on the programmer 1705 01:08:56,576 --> 01:08:58,626 to actually get the details right. 1706 01:08:59,046 --> 01:09:01,366 Well now, if we are moving toward a tree, 1707 01:09:01,366 --> 01:09:02,736 let me propose another approach 1708 01:09:02,736 --> 01:09:04,466 to a spellchecker or a dictionary. 1709 01:09:04,956 --> 01:09:07,396 Thus far we've emphasized hash table. 1710 01:09:07,396 --> 01:09:10,986 And again hash table's pretty decent algorithm pretty decent 1711 01:09:10,986 --> 01:09:12,766 data structure for a dictionary. 1712 01:09:12,766 --> 01:09:15,186 To be clear, even though we talked about birthdays here, 1713 01:09:15,536 --> 01:09:17,916 you can imagine taking as input a string, 1714 01:09:18,026 --> 01:09:22,276 a word in a dictionary, and somehow using its asking values 1715 01:09:22,276 --> 01:09:25,276 or something like that to come up with a location from one, 1716 01:09:25,276 --> 01:09:28,256 or from zero on up to whatever the size of the hash table is. 1717 01:09:28,566 --> 01:09:30,046 And then put that string there. 1718 01:09:30,676 --> 01:09:33,746 Well, that could give us pretty good running time. 1719 01:09:33,746 --> 01:09:36,566 Big O of n over k, is technically O of n. 1720 01:09:36,826 --> 01:09:38,966 But as you'll see in the real world, hash tables tend 1721 01:09:38,966 --> 01:09:39,926 to be pretty darn fast. 1722 01:09:40,386 --> 01:09:42,146 But, there's also this thing called the trie, 1723 01:09:42,386 --> 01:09:45,916 T R I E. It's meant to imply retrieval. 1724 01:09:46,276 --> 01:09:48,916 A trie is an interesting data structure that we'll also talk 1725 01:09:48,916 --> 01:09:50,656 about incidentally in this week's walk 1726 01:09:50,656 --> 01:09:54,276 through that is essentially a tree each 1727 01:09:54,276 --> 01:09:57,586 of whose nodes is itself in array. 1728 01:09:58,426 --> 01:09:59,526 So, this is kind of trippy. 1729 01:09:59,526 --> 01:10:03,416 So an array is specifically defined in trie usually 1730 01:10:03,416 --> 01:10:05,616 as having a fixed number of characters, a fixed number 1731 01:10:05,616 --> 01:10:09,576 of buckets, namely 26 or in fact in the p set you'll see 27 1732 01:10:09,726 --> 01:10:12,636 because we allow words to have an apostrophe for plurality 1733 01:10:12,636 --> 01:10:15,426 and so forth, so let's supposed though for now that each 1734 01:10:15,426 --> 01:10:18,816 of these arrays is a size 26 or maybe 27 for apostrophes. 1735 01:10:19,376 --> 01:10:23,396 So what does it mean to store a dictionary inside of a trie, 1736 01:10:23,496 --> 01:10:24,656 a data structure like this? 1737 01:10:25,296 --> 01:10:27,196 Well, this happens to be a picture from a textbook 1738 01:10:27,396 --> 01:10:30,186 that was implementing a trie for people's first names. 1739 01:10:30,416 --> 01:10:32,666 And notice, even though it's not obvious at first glance, 1740 01:10:32,666 --> 01:10:35,516 here's the root, the root node of this trie. 1741 01:10:35,856 --> 01:10:38,116 And notice what each of the elements 1742 01:10:38,316 --> 01:10:41,766 in the array is actually meant to do is it's meant 1743 01:10:41,766 --> 01:10:45,286 to represent a pointer to another node. 1744 01:10:45,686 --> 01:10:48,056 So, even though we've written, in this picture, 1745 01:10:48,056 --> 01:10:51,386 letters of the alphabet, you can think of A as just mapping 1746 01:10:51,386 --> 01:10:56,056 to bracket zero and B is mapping to bracket one and Z mapping 1747 01:10:56,056 --> 01:10:58,306 to bracket 25 and apostrophe who knows where. 1748 01:10:58,306 --> 01:11:00,546 You'd have to special case it but somewhere specific. 1749 01:11:00,966 --> 01:11:03,496 So, what really, each of these elements 1750 01:11:03,716 --> 01:11:08,706 in a trie's node represents is a pointer to another such node. 1751 01:11:08,956 --> 01:11:11,616 And so, what you do with a trie is you, 1752 01:11:12,286 --> 01:11:16,526 you implicitly store words as follows. 1753 01:11:16,626 --> 01:11:18,896 If I want to insert the name Maxwell 1754 01:11:19,086 --> 01:11:22,426 into my trie I first start at the root of the array, 1755 01:11:22,426 --> 01:11:25,296 the root of the, the root of the trie. 1756 01:11:25,926 --> 01:11:27,446 The trie initially has nothing in it. 1757 01:11:27,516 --> 01:11:28,886 This is kind of later in the story. 1758 01:11:28,886 --> 01:11:30,736 Initially the trie just has a single node 1759 01:11:30,736 --> 01:11:32,516 with a whole bunch of null pointers. 1760 01:11:32,826 --> 01:11:35,336 But, I want to store Maxwell into this trie. 1761 01:11:35,666 --> 01:11:39,786 So, I look at the value of M which is 13, 1762 01:11:39,886 --> 01:11:42,156 which is the 13th letter of the alphabet. 1763 01:11:42,156 --> 01:11:44,356 So, let me subtract one for zero indexing in 12. 1764 01:11:44,646 --> 01:11:47,886 So, I go to bracket 12 in the root nodes array. 1765 01:11:48,196 --> 01:11:49,206 It's initially null. 1766 01:11:49,426 --> 01:11:51,926 So I use malloc and I allocate another node. 1767 01:11:51,926 --> 01:11:53,736 That's going to be this node over here. 1768 01:11:53,976 --> 01:11:56,596 And I attach a pointer to location 12 1769 01:11:56,936 --> 01:11:58,946 that maps to that new node. 1770 01:11:59,576 --> 01:12:04,576 Now, I haven't put the letter M inside of the trie's node. 1771 01:12:04,836 --> 01:12:08,376 I've just implicitly assumed that the M is the 13th letter 1772 01:12:08,526 --> 01:12:10,536 which means bracket 12 if zero index. 1773 01:12:10,806 --> 01:12:12,526 So, it's pointer means, you know what, 1774 01:12:12,826 --> 01:12:16,656 there exists a pointer here because there is some word 1775 01:12:16,656 --> 01:12:18,026 in this data structure that starts 1776 01:12:18,026 --> 01:12:19,726 with the letter M. So, it's all implicit. 1777 01:12:19,806 --> 01:12:21,186 And this is what's pretty neat about this. 1778 01:12:21,506 --> 01:12:22,806 Now, the story continues. 1779 01:12:22,846 --> 01:12:25,116 The next letter in Maxwell's name is A. A maps 1780 01:12:25,196 --> 01:12:28,396 to the first letter of the alphabet or bracket zero. 1781 01:12:28,696 --> 01:12:31,696 So, I initially say, all right, give me another node. 1782 01:12:31,786 --> 01:12:32,606 So, I use malloc 1783 01:12:32,606 --> 01:12:34,596 and substantiate another node and pointers. 1784 01:12:34,906 --> 01:12:37,376 And then, I hang a pointer off 1785 01:12:37,376 --> 01:12:39,856 of bracket zero pointing at the new node. 1786 01:12:39,856 --> 01:12:42,816 And now, for space efficiency notice that the author 1787 01:12:42,816 --> 01:12:45,696 of this book did not draw all of these nodes out. 1788 01:12:45,696 --> 01:12:49,756 A trie is interesting that it's a very bulky wide data structure 1789 01:12:49,756 --> 01:12:50,996 because you have all of these arrays. 1790 01:12:51,266 --> 01:12:53,896 So he's only showing you the letter that's actually relevant. 1791 01:12:53,896 --> 01:12:55,446 But, if you continue the story, 1792 01:12:55,446 --> 01:12:58,596 what a try really is it's a multilayered hash table 1793 01:12:58,596 --> 01:13:00,996 where you initially hash on the first letter of the word, 1794 01:13:01,336 --> 01:13:03,306 then you hash again on the second letter of the word. 1795 01:13:03,396 --> 01:13:04,976 But the hash function's really simple. 1796 01:13:04,976 --> 01:13:07,246 It always returns 0 through 25 1797 01:13:07,496 --> 01:13:10,866 which tells you what the number is in the array that's supposed 1798 01:13:10,866 --> 01:13:12,156 to represent that character. 1799 01:13:12,476 --> 01:13:15,946 And then, once you get down to LL in Maxwell, 1800 01:13:16,276 --> 01:13:19,226 it turns out that you need a special marker in your structure 1801 01:13:19,226 --> 01:13:21,256 that says word stops here. 1802 01:13:21,726 --> 01:13:23,346 But in the end, what's neat about a tri, 1803 01:13:23,346 --> 01:13:25,406 and it's a tricky one to wrap one's mind 1804 01:13:25,406 --> 01:13:27,426 around the first time around, what's neat is 1805 01:13:27,426 --> 01:13:30,186 that you're not actually storing any words per say 1806 01:13:30,186 --> 01:13:31,806 in the tree it's all implicit. 1807 01:13:31,876 --> 01:13:35,966 It's a huge data structure full of pointers not actual content. 1808 01:13:36,296 --> 01:13:37,426 But, here's the magic. 1809 01:13:37,426 --> 01:13:39,366 What then is the running time of search 1810 01:13:39,366 --> 01:13:40,916 on a data structure like a trie? 1811 01:13:41,746 --> 01:13:45,496 Has table recalled is big O of n divided by k but k is a constant 1812 01:13:45,496 --> 01:13:46,296 so it was really big O 1813 01:13:46,296 --> 01:13:49,696 of n. What's the running time of search on a tri? 1814 01:13:50,296 --> 01:13:57,416 It's big O of, like well we need a new expression here. 1815 01:13:57,416 --> 01:14:00,836 Let's call it m where m is the length of a word. 1816 01:14:01,106 --> 01:14:04,346 Notice to find if a word is inside of the tri, 1817 01:14:04,696 --> 01:14:10,266 you have to hash on that word's letters M A X W E L L 1818 01:14:10,586 --> 01:14:13,296 so it looks like the running time of this is at least 7. 1819 01:14:13,726 --> 01:14:16,736 Now, maybe if it's David I'm inserting, D A V I D, 1820 01:14:16,736 --> 01:14:19,346 ooh looks like the running time of a tri is actually 5. 1821 01:14:19,646 --> 01:14:21,616 But, the point is it's some constant. 1822 01:14:22,006 --> 01:14:25,316 And, if the constant is defined as longest word, 1823 01:14:25,316 --> 01:14:29,436 the running time of a trie's search function is maybe 10 1824 01:14:29,436 --> 01:14:30,896 if that's the longest word in the dictionary, 1825 01:14:30,896 --> 01:14:32,526 maybe it's 20 maybe it's 40. 1826 01:14:32,856 --> 01:14:35,836 But it's not depending on n. Which means, 1827 01:14:35,836 --> 01:14:37,916 you can have a billion words in the tri, 1828 01:14:37,916 --> 01:14:40,656 you could have a trillion words in the tri, and you know what, 1829 01:14:40,656 --> 01:14:45,196 you can still find David in five operations, D A V I D. 1830 01:14:45,196 --> 01:14:46,746 And that's what's particularly magical 1831 01:14:46,746 --> 01:14:47,716 about this data structure. 1832 01:14:47,716 --> 01:14:50,076 If you spend, frankly, a ridiculous amount of ram, 1833 01:14:50,376 --> 01:14:51,546 the structure's pretty wide, 1834 01:14:51,826 --> 01:14:55,706 you can get truly constant time look ups depending only 1835 01:14:55,706 --> 01:14:57,456 on the length of the input not 1836 01:14:57,456 --> 01:14:59,106 on how many things you're inserting. 1837 01:14:59,446 --> 01:15:01,346 So, the challenge for you for Problem Set 6 will be 1838 01:15:01,346 --> 01:15:02,976 to choose from among those data structures, 1839 01:15:03,226 --> 01:15:05,976 and implement the fastest spellchecker possible. 1840 01:15:06,516 --> 01:15:13,380 [ Music ]