1 00:00:00,506 --> 00:00:10,576 [ Silence ] 2 00:00:11,076 --> 00:00:14,036 >> This is CS50 and this is the end of week 7 3 00:00:14,036 --> 00:00:18,186 and this is CS50's own Ansel Duff now available in iTunes. 4 00:00:18,186 --> 00:00:21,756 The link will be on today's lecture page who just so happens 5 00:00:21,756 --> 00:00:23,246 to be my advisee as well. 6 00:00:23,886 --> 00:00:27,836 So store.cs50.net has debuted thanks to your 7 00:00:27,836 --> 00:00:29,856 and the staff's contributions. 8 00:00:30,036 --> 00:00:33,556 This is the URL, realized that everything is sold at cause. 9 00:00:33,616 --> 00:00:35,916 The opportunity really is just to give you a chance 10 00:00:35,916 --> 00:00:37,816 to really live and wear CS50. 11 00:00:38,176 --> 00:00:40,796 If for reasons of financial aid you would feel, otherwise, 12 00:00:40,796 --> 00:00:43,286 excluded from anything involved in the course, please just reach 13 00:00:43,286 --> 00:00:45,366 out to me privately and we will ensure 14 00:00:45,366 --> 00:00:47,706 that you can get most anything you would like 15 00:00:47,706 --> 00:00:48,796 and we will take care of it. 16 00:00:49,196 --> 00:00:51,496 So, here's one design that was submitted, if you'd actually 17 00:00:51,496 --> 00:00:55,076 like to wear and dress as CS50 like this, so we enjoy this one. 18 00:00:55,586 --> 00:00:58,216 Another one that came out, if you'd like to wear this 19 00:00:58,216 --> 00:01:00,936 on campus, it says this on the front of your shirt. 20 00:01:01,316 --> 00:01:03,916 For those more comfortable, there's this option here. 21 00:01:03,916 --> 00:01:06,646 There was a surprising frequency of people 22 00:01:06,646 --> 00:01:08,676 who submitted screen shots 23 00:01:08,676 --> 00:01:10,606 of their appliance as their t-shirts. 24 00:01:10,896 --> 00:01:13,486 This one was creative though and that it used some ASCII art 25 00:01:13,486 --> 00:01:16,066 to convey the idea of taking CS50. 26 00:01:16,066 --> 00:01:17,166 This one took the idea 27 00:01:17,166 --> 00:01:20,266 of creating a CS50 slogan a little, literally. 28 00:01:21,336 --> 00:01:28,066 [Laughter] But perhaps my favorites are those 29 00:01:28,066 --> 00:01:31,536 that were submitted here and here. 30 00:01:31,536 --> 00:01:33,276 [ Laughter ] 31 00:01:33,276 --> 00:01:35,016 [ Applause ] 32 00:01:35,016 --> 00:01:38,276 >> So get yours today followed by if you'd 33 00:01:38,276 --> 00:01:40,506 like these slightly creepier, this one here. 34 00:01:40,506 --> 00:01:44,216 [Laughter] So that store.cs50.net. 35 00:01:44,216 --> 00:01:51,846 So, today we finished our discussion really of C, 36 00:01:51,846 --> 00:01:54,076 next week we dive into web programing. 37 00:01:54,076 --> 00:01:56,046 If you'd like to have lunch with us this week even 38 00:01:56,046 --> 00:01:59,076 if you've done this before at the cs50.net/rsvp 39 00:01:59,076 --> 00:02:01,956 and so we can talk about where we're at, at this point 40 00:02:01,956 --> 00:02:02,836 and where we're going. 41 00:02:02,836 --> 00:02:04,046 If you've already been, that's fine. 42 00:02:04,086 --> 00:02:06,536 Again, if you wanna come back, 1:15 this Friday. 43 00:02:06,996 --> 00:02:09,816 But next week already we dive into web programing 44 00:02:09,816 --> 00:02:12,336 and even though we've spent the majority of the semester focused 45 00:02:12,336 --> 00:02:17,446 on C and on this fairly low level language hence forth we 46 00:02:17,446 --> 00:02:19,456 move into the world of HTML and CSS. 47 00:02:19,646 --> 00:02:21,796 These are not really, these are not programming languages, 48 00:02:21,796 --> 00:02:23,786 they are what we'd call markup languages 49 00:02:23,816 --> 00:02:26,446 but they are the languages in which webpages are written. 50 00:02:26,706 --> 00:02:29,426 In fact, if we go to most any website out there, 51 00:02:29,426 --> 00:02:32,216 we can go to the favorite Harvard FML 52 00:02:32,216 --> 00:02:34,766 which is actually a hosted blog, we can actually look 53 00:02:34,766 --> 00:02:36,516 at the source code here. 54 00:02:36,516 --> 00:02:38,046 And this is the kind of stuff 55 00:02:38,046 --> 00:02:39,916 with which you guys will soon be familiar. 56 00:02:40,166 --> 00:02:42,656 Some of this is a little messier, a little more advanced 57 00:02:42,656 --> 00:02:43,916 than we'll see in the first day 58 00:02:44,156 --> 00:02:46,826 but this stuff here is what's called Cascading Style Sheets 59 00:02:46,826 --> 00:02:49,316 and it apparently has something to do with font sizes 60 00:02:49,316 --> 00:02:50,976 and weights and colors and so forth 61 00:02:50,976 --> 00:02:52,576 and indeed aesthetics of a webpage. 62 00:02:52,836 --> 00:02:54,816 If we scroll down further we'll start to see lots 63 00:02:54,816 --> 00:02:59,506 of open brackets and closed brackets and spans, and divs 64 00:02:59,756 --> 00:03:02,436 and all of this color coated text here is what drives, again, 65 00:03:02,696 --> 00:03:04,696 any website out there on the internet. 66 00:03:04,696 --> 00:03:07,816 But it's one thing to actually make static webpages 67 00:03:07,816 --> 00:03:10,136 that simply display formation that's a lot more fun, 68 00:03:10,136 --> 00:03:12,126 a lot more empowering to make dynamic websites. 69 00:03:12,156 --> 00:03:14,906 They take user input, they produce dynamic output 70 00:03:14,906 --> 00:03:16,566 and Harvard FML is an example of that. 71 00:03:16,906 --> 00:03:19,786 I saw uharvard.com is another example alright? 72 00:03:19,786 --> 00:03:20,816 [Inaudible] the voice-- 73 00:03:20,816 --> 00:03:23,726 the noise are not actually just posting these thing statically, 74 00:03:23,976 --> 00:03:26,936 they're actually receiving form submission from actual users 75 00:03:27,156 --> 00:03:28,876 and then they've written some code on the back end 76 00:03:28,876 --> 00:03:31,126 to actually display these posts. 77 00:03:31,126 --> 00:03:33,686 And all of this will be possible thanks to languages like PHP, 78 00:03:33,686 --> 00:03:36,996 a data base language called SQL, 79 00:03:36,996 --> 00:03:39,376 and in fact for your final projects you're welcome to go 80 00:03:39,376 --> 00:03:41,236 with those languages or beyond. 81 00:03:41,496 --> 00:03:45,506 We have students every year who dabble in C sharp, in Python, 82 00:03:45,506 --> 00:03:47,406 in Ruby, there're so many languages 83 00:03:47,466 --> 00:03:48,836 that will be available to you. 84 00:03:48,836 --> 00:03:52,036 And in fact, statistically, over 80 percent of you will end 85 00:03:52,036 --> 00:03:54,506 up doing web based final projects because one, 86 00:03:54,506 --> 00:03:55,996 it's relatively accessible, two, 87 00:03:55,996 --> 00:03:58,156 it's familiar, three, it's useful. 88 00:03:58,156 --> 00:04:00,636 Probably, not as exciting to tell your friends to come 89 00:04:00,636 --> 00:04:03,176 over to your laptop, sit down in front of your C50 appliance, 90 00:04:03,456 --> 00:04:05,246 and run your game or run your program. 91 00:04:05,246 --> 00:04:06,926 That might have satisfied you in the first week 92 00:04:06,926 --> 00:04:09,206 but it'd be nice now to just e-mail to your house list 93 00:04:09,206 --> 00:04:11,176 or your parents or your friends, or your URL and say, 94 00:04:11,476 --> 00:04:13,876 start using my amazing new website. 95 00:04:13,996 --> 00:04:16,816 So, that is what is on their horizon. 96 00:04:16,816 --> 00:04:19,726 But for today we close out our discussion of some 97 00:04:19,726 --> 00:04:21,986 of the more advanced data structures in C. 98 00:04:21,986 --> 00:04:25,546 And how we might actually build tools that we're going 99 00:04:25,546 --> 00:04:27,766 to soon start taking for granted. 100 00:04:27,766 --> 00:04:30,396 You're gonna see that a lot of Cs restrictions 101 00:04:30,396 --> 00:04:33,116 and nitpickingness are gonna fall by the way side for better 102 00:04:33,116 --> 00:04:34,556 or for worse in the upcoming weeks. 103 00:04:34,896 --> 00:04:37,236 So, that if you wanna big chunk of storage, you're gonna have 104 00:04:37,236 --> 00:04:40,996 to think a lot, lot less in PHP to actually get that memory. 105 00:04:40,996 --> 00:04:43,266 You can just take for granted that it's accessible for you. 106 00:04:43,266 --> 00:04:45,786 If you want something called a linked list or an array, 107 00:04:45,926 --> 00:04:48,496 you'll find that syndactyly it's a lot easier to get at it. 108 00:04:48,616 --> 00:04:50,446 But we're also gonna use fancier things, 109 00:04:50,446 --> 00:04:54,516 things called hash tables and trees and other data structures 110 00:04:54,516 --> 00:04:57,296 that will actually look at today and then use 111 00:04:57,296 --> 00:04:58,846 in practice in the coming weeks. 112 00:04:59,076 --> 00:05:01,596 So, this was again a photograph of Mather and this is an example 113 00:05:01,596 --> 00:05:04,616 of a stack of trays and we photographed it simply 114 00:05:04,616 --> 00:05:05,606 because it's a representative 115 00:05:05,606 --> 00:05:08,096 of a very common computer science data structure 116 00:05:08,326 --> 00:05:09,236 that is known as a stack. 117 00:05:09,706 --> 00:05:14,926 A stack is a LIFO data structure, last in, last out. 118 00:05:15,306 --> 00:05:18,836 I'm sorry, wait a minute, last in, first out, 119 00:05:19,036 --> 00:05:21,006 LIFO, last in, first out. 120 00:05:21,236 --> 00:05:24,756 And as you might infer from reality, if you put a tray 121 00:05:24,756 --> 00:05:28,386 on top of the stack, well, unless you're, I mean an idiot, 122 00:05:28,386 --> 00:05:30,356 you're not gonna go and take the bottom most tray 123 00:05:30,356 --> 00:05:32,606 that was first put there instead you're obviously gonna take the 124 00:05:32,606 --> 00:05:33,336 one on the top. 125 00:05:33,536 --> 00:05:35,776 So, this has actually very useful implications. 126 00:05:35,776 --> 00:05:39,376 I alluded to just one fairly arcane one involving validating 127 00:05:39,376 --> 00:05:43,436 HTML but this ability, to have a data structure that allows you 128 00:05:43,436 --> 00:05:46,706 to incredibly rapidly add to it, really in constant time, 129 00:05:46,706 --> 00:05:50,256 big-O of 1, and then remove from it also in constant time, 130 00:05:50,256 --> 00:05:53,966 big-O of 1 is a powerful technique because we've dabbling 131 00:05:53,966 --> 00:05:56,466 so far with array based data structures and others 132 00:05:56,466 --> 00:06:00,036 where we often need at least log in or N. So, really today 133 00:06:00,036 --> 00:06:03,406 and onward we wanna approximate or achieve constant time. 134 00:06:03,706 --> 00:06:06,546 So, if you have a data structure like this stack, 135 00:06:07,036 --> 00:06:08,426 what do you need-- how do you go 136 00:06:08,426 --> 00:06:10,146 about implementing this, say in C? 137 00:06:10,146 --> 00:06:12,126 Well, the only syntax we have thus far 138 00:06:12,126 --> 00:06:16,066 for creating custom variables, custom types is a struct. 139 00:06:16,416 --> 00:06:18,516 So, if were to start typing, struct stack 140 00:06:18,876 --> 00:06:21,086 and then an open curly brace. 141 00:06:21,086 --> 00:06:24,736 And we want you clumped together one or more data types 142 00:06:24,736 --> 00:06:28,596 that collectively define a stack, much like an ID 143 00:06:28,596 --> 00:06:31,256 and a name and house collectively define a student, 144 00:06:31,476 --> 00:06:33,286 what do we need to be able to keep track of? 145 00:06:33,336 --> 00:06:35,426 If suppose we wanna stack, not of trays, 146 00:06:35,426 --> 00:06:37,516 but let's keep it simple, a stack of integers. 147 00:06:37,516 --> 00:06:39,746 I want a data structure where I can put a number 148 00:06:39,746 --> 00:06:41,626 and then a number and then a number and then a number 149 00:06:41,736 --> 00:06:43,166 and I want it to be easy to get 150 00:06:43,166 --> 00:06:45,696 at the most recently added number. 151 00:06:46,246 --> 00:06:47,956 What could I put inside of the struct 152 00:06:47,956 --> 00:06:50,326 and then start calling the whole thing a stack? 153 00:06:51,636 --> 00:06:52,546 What might you propose? 154 00:06:53,976 --> 00:06:55,506 Again, the goal is to store numbers, 155 00:06:55,576 --> 00:06:58,266 integers and a stack of them. 156 00:06:59,136 --> 00:07:01,736 Well, what existing data structure can we kind of steal 157 00:07:01,736 --> 00:07:04,036 to implement at least part of this idea, yeah? 158 00:07:04,906 --> 00:07:07,416 Okay, so we could use a linked list, we could use a linked list 159 00:07:07,416 --> 00:07:08,406 which we'll look at on Monday 160 00:07:08,406 --> 00:07:10,376 which is a dynamic data structure that allows you 161 00:07:10,376 --> 00:07:13,526 to grow and shrink your data structure as on demand, 162 00:07:13,526 --> 00:07:14,386 when you need more memory. 163 00:07:14,576 --> 00:07:16,606 And in fact, we could even regress a bit more. 164 00:07:16,606 --> 00:07:19,066 We could use even the simpler array, right, 165 00:07:19,066 --> 00:07:20,476 and that would actually kind 166 00:07:20,476 --> 00:07:22,536 of conceptually be pretty consistent with the stack 167 00:07:22,536 --> 00:07:23,796 because much like these trays, 168 00:07:23,796 --> 00:07:25,156 they are all right next to each other. 169 00:07:25,316 --> 00:07:27,556 So, within an array of integers, be right next to each other, 170 00:07:27,556 --> 00:07:28,906 but of course what's the downside 171 00:07:28,906 --> 00:07:30,736 of using an array really for most anything? 172 00:07:30,866 --> 00:07:32,636 >> It's a fixed size. 173 00:07:32,866 --> 00:07:33,906 >> It's fixed size, right? 174 00:07:33,906 --> 00:07:37,386 It might be fine but you're either gonna spend way more 175 00:07:37,386 --> 00:07:39,486 memory than you need by over allocating 176 00:07:39,486 --> 00:07:41,936 and then just wasting it or you're gonna under allocate 177 00:07:41,936 --> 00:07:43,986 and in that case you've ended up wasting time 178 00:07:44,076 --> 00:07:46,526 because if you allocate too few integers and you need more, 179 00:07:46,776 --> 00:07:49,036 you then have to call malloc again but with bigger-- 180 00:07:49,306 --> 00:07:50,666 a bigger number and then you have 181 00:07:50,666 --> 00:07:52,576 to copy your old data into the new. 182 00:07:52,576 --> 00:07:56,846 And then you typically have to call free on the old array 183 00:07:56,846 --> 00:07:57,886 to give that memory back. 184 00:07:58,086 --> 00:08:01,376 So, we can have an array but if we just have an array of numbers 185 00:08:01,726 --> 00:08:04,506 to represent a stack, we need at least one more details. 186 00:08:04,506 --> 00:08:06,936 So, here is how in C we might represent a stack. 187 00:08:07,176 --> 00:08:09,876 So, recall that typedef defines your own type a synonym 188 00:08:10,106 --> 00:08:12,346 of sorts, struct says here comes a bunch 189 00:08:12,346 --> 00:08:14,866 of data types collectively call them, the following, 190 00:08:15,026 --> 00:08:17,346 and then the word at the end stack is just the arbitrary, 191 00:08:17,536 --> 00:08:19,796 but conventional name, we wanna give to this data structure. 192 00:08:19,996 --> 00:08:22,426 And I propose even simpler than a linked list, 193 00:08:22,626 --> 00:08:25,256 we could implement a stack with just 2 data members. 194 00:08:25,256 --> 00:08:27,266 Previously again, we implemented a student 195 00:08:27,266 --> 00:08:28,886 with three-- ID, name and house. 196 00:08:29,186 --> 00:08:32,066 A stack might just be an array of numbers, 197 00:08:32,066 --> 00:08:35,286 an array of integers, capacity is probably a constant, 198 00:08:35,506 --> 00:08:38,346 sharp defined some number, a hundred and twenty eight, 199 00:08:38,346 --> 00:08:40,266 a thousand something, arbitrary but big. 200 00:08:40,886 --> 00:08:43,376 And why do I need to know about the size, 201 00:08:43,836 --> 00:08:46,116 and semantically we mean a slight distinction here. 202 00:08:46,116 --> 00:08:48,196 Capacity is the total capacity, 203 00:08:48,796 --> 00:08:50,756 how many total numbers you could put in here, 204 00:08:50,756 --> 00:08:52,646 so what might size actually represent 205 00:08:52,776 --> 00:08:53,766 in this context of a stack? 206 00:08:54,116 --> 00:08:57,066 Yeah, how many things are actually in there? 207 00:08:57,256 --> 00:08:59,646 So, initially capacity might be a big number, a thousand, 208 00:08:59,646 --> 00:09:03,046 so that gives me a thousand integers potentially in an array 209 00:09:03,336 --> 00:09:04,886 but initially there's nothing in there. 210 00:09:05,086 --> 00:09:06,496 So, I at least have to remember 211 00:09:06,496 --> 00:09:09,806 for this data structure stack that it size is zero. 212 00:09:09,806 --> 00:09:12,496 And then anytime I add a number to this stack, 213 00:09:12,496 --> 00:09:13,786 well what are you really doing? 214 00:09:13,786 --> 00:09:16,116 You're gonna add a number to the array and then another one, 215 00:09:16,116 --> 00:09:17,386 and then another, and another one. 216 00:09:17,386 --> 00:09:20,096 And so long as you remember how many times you've inserted 217 00:09:20,096 --> 00:09:22,286 numbers into this array, it doesn't even matter 218 00:09:22,286 --> 00:09:24,986 if there's a whole bunch garbage values because if you do it 219 00:09:24,986 --> 00:09:27,736 in order from left to right or conceptually from bottom 220 00:09:27,936 --> 00:09:31,096 up in the array so as long as you remember the size 221 00:09:31,486 --> 00:09:34,506 that it's 0 or 1 or 2, you know that the first-- 222 00:09:34,766 --> 00:09:36,996 those first integers are legit 223 00:09:37,146 --> 00:09:38,926 and the rest might be garbage values. 224 00:09:39,236 --> 00:09:40,926 But there is a problem here, what happens 225 00:09:40,926 --> 00:09:44,686 when I insert the 1001 for instance number 226 00:09:44,856 --> 00:09:45,936 into this data structure? 227 00:09:45,936 --> 00:09:50,006 What, what would you do in that case? 228 00:09:50,216 --> 00:09:51,436 What do you say? 229 00:09:51,436 --> 00:09:52,816 I mean, same problem you might, 230 00:09:52,816 --> 00:09:56,446 if you just blindly go past the boundaries of this array 231 00:09:56,446 --> 00:09:59,636 by going to capacity or capacity plus 1 or plus 2. 232 00:10:00,046 --> 00:10:01,216 >> You might very well seg fault 233 00:10:01,526 --> 00:10:04,496 so in the worst case you might have to either just accept that, 234 00:10:04,556 --> 00:10:07,116 which is not good, or two, you might have 235 00:10:07,246 --> 00:10:09,816 to actually re-implement this thing using malloc. 236 00:10:09,816 --> 00:10:13,416 It's probably not ideal to statically allocate 237 00:10:13,416 --> 00:10:16,016 that many integers inside of your structure 'cause 238 00:10:16,016 --> 00:10:18,376 in this definition you can never re-grow it. 239 00:10:18,376 --> 00:10:19,956 Because I'm not using pointers here, 240 00:10:19,956 --> 00:10:23,016 because I'm using a hard coded array, I can't just call malloc 241 00:10:23,016 --> 00:10:26,246 or realloc later so this is a design decision and again, 242 00:10:26,246 --> 00:10:27,956 when we talk about these axes along 243 00:10:27,956 --> 00:10:30,726 which we grade problems that's design takes these things 244 00:10:30,726 --> 00:10:32,536 into account If you are comfortable 245 00:10:32,536 --> 00:10:33,796 with the design constraint, 246 00:10:34,036 --> 00:10:36,826 that you can only have say a thousand numbers in this stack, 247 00:10:37,306 --> 00:10:39,306 well maybe that's actually quite fine 248 00:10:39,606 --> 00:10:41,746 but if that's not acceptable, well then we need 249 00:10:41,746 --> 00:10:44,756 to re-engineer this, and we do need to use malloc or pointers 250 00:10:44,756 --> 00:10:46,526 or somehow to get that dynamism 251 00:10:46,556 --> 00:10:49,696 or as you proposed use a linked list but these are tradeoffs. 252 00:10:49,896 --> 00:10:50,886 Like what might be the trade? 253 00:10:50,886 --> 00:10:52,826 What's the first tradeoff that comes to mind if I propose, 254 00:10:52,826 --> 00:10:53,836 alright, let's scrap this 255 00:10:54,066 --> 00:10:56,166 and let's actually re-implement this using linked list. 256 00:10:56,536 --> 00:10:57,706 So we don't have this upper bound. 257 00:10:58,146 --> 00:11:03,056 Just instinctively like what's the downside of that approach? 258 00:11:03,056 --> 00:11:03,576 [ Inaudible Remark ] 259 00:11:03,576 --> 00:11:06,506 >> Yeah, and I mean this is totally legitimate. 260 00:11:06,506 --> 00:11:07,466 It's harder, right? 261 00:11:07,466 --> 00:11:09,676 We saw on Monday and I deliberately didn't even go 262 00:11:09,676 --> 00:11:11,826 through all of those lines of code 'cause you know it's 263 00:11:11,826 --> 00:11:13,146 like a hundred or so lines 264 00:11:13,186 --> 00:11:15,166 to collectively implement those linked list 265 00:11:15,166 --> 00:11:18,136 that we first simulated with humans but if you actually coded 266 00:11:18,136 --> 00:11:20,686 up you need an insert function and a delete function 267 00:11:20,686 --> 00:11:21,856 and you glimpse toward the end 268 00:11:21,856 --> 00:11:24,486 of Monday how relatively complex the code was. 269 00:11:24,486 --> 00:11:27,176 You had to handle insertion at the head, at the tail, 270 00:11:27,176 --> 00:11:28,996 in the middle, deletion at the beginning, at the end 271 00:11:28,996 --> 00:11:31,736 and it's all doable and that code is in fact correct 272 00:11:32,056 --> 00:11:34,506 but it's just a lot of work, my God, I know how to use arrays 273 00:11:34,506 --> 00:11:36,646 and like week 2, here is the implementation 274 00:11:36,646 --> 00:11:38,986 of the data structure now, all I have to do is index 275 00:11:38,986 --> 00:11:41,496 into that array using some square bracket 276 00:11:41,496 --> 00:11:43,076 and dot notation, done. 277 00:11:43,076 --> 00:11:45,596 So this too is a valid design constraint especially 278 00:11:45,596 --> 00:11:47,896 in the real world, if you can either implement something 279 00:11:47,896 --> 00:11:50,136 that solves the problem in an hour 280 00:11:50,376 --> 00:11:53,456 or you can solve this problem and many others in a week. 281 00:11:53,766 --> 00:11:54,546 You know you have to decide 282 00:11:54,546 --> 00:11:57,876 for yourself what's actually worth it and so one of the goals 283 00:11:57,876 --> 00:12:00,196 of the final project too is going to try to rain 284 00:12:00,196 --> 00:12:02,866 in your own ambitions and you'll be asked to propose 285 00:12:03,116 --> 00:12:05,356 that you'd be happy would say this good version 286 00:12:05,356 --> 00:12:07,316 of a final project and this better version 287 00:12:07,316 --> 00:12:09,486 of a final project and this best version 288 00:12:09,486 --> 00:12:11,746 of your final project 'cause realistically you can sort 289 00:12:11,746 --> 00:12:13,806 of imagine implementing the whole world 290 00:12:13,806 --> 00:12:15,866 in every feature you might imagine but as you've seen 291 00:12:15,866 --> 00:12:16,726 with problem sets, 292 00:12:16,726 --> 00:12:20,306 just implementing one seemingly simple feature can sometimes 293 00:12:20,306 --> 00:12:22,276 waste or cost you some five hours 294 00:12:22,276 --> 00:12:24,446 of your life 'cause you just don't anticipate the things 295 00:12:24,446 --> 00:12:24,886 that arise. 296 00:12:25,106 --> 00:12:28,586 So getting better at estimating difficulty in time will be one 297 00:12:28,586 --> 00:12:30,856 of the key features of that process. 298 00:12:31,256 --> 00:12:35,346 So here is for instance a queue, and a queue is a FIFO-- 299 00:12:35,496 --> 00:12:39,106 F-I-F-O, First-In-First-Out data structure and this one 300 00:12:39,106 --> 00:12:41,686 in the real world is certainly more fair, right? 301 00:12:41,686 --> 00:12:43,766 You probably-- if you get there early again you wanna be the 302 00:12:43,766 --> 00:12:45,976 first one let into the store, and not the last 303 00:12:45,976 --> 00:12:48,806 and so a queue has its own applications and here too, 304 00:12:48,806 --> 00:12:51,306 a queue could very reasonably be implemented 305 00:12:51,306 --> 00:12:53,666 with a fixed to size array, right? 306 00:12:53,816 --> 00:12:55,906 For instance, if this is some kind of giveaway 307 00:12:55,906 --> 00:12:58,986 where they only have a finite number of iPhones or iPads, 308 00:12:59,256 --> 00:13:01,856 well, they don't need to let more than say 500 people 309 00:13:01,856 --> 00:13:04,806 in line 'cause the 500, the first person is not going to get 310 00:13:04,806 --> 00:13:07,736 into the store so it's sometimes whether in the real world 311 00:13:07,736 --> 00:13:09,966 or in a program, might be totally reasonable 312 00:13:09,966 --> 00:13:12,636 to balance the sides of your data structure so again, 313 00:13:12,636 --> 00:13:15,686 we could implement this thing called a queue perhaps a little 314 00:13:15,686 --> 00:13:18,126 something like this where we have another array called 315 00:13:18,126 --> 00:13:21,576 numbers, it too is initialized to capacity whether that's 500 316 00:13:21,576 --> 00:13:24,466 or 1000, whatever, it's a constant but this time I have 317 00:13:24,496 --> 00:13:27,986 to keep track of two things potentially, two things, 318 00:13:27,986 --> 00:13:30,556 whereas the stack is kind of nice and simple 319 00:13:30,556 --> 00:13:33,326 in that you add a tray or add a number, add one, add one, 320 00:13:33,456 --> 00:13:35,356 add one, add one and then as soon as you remove one, 321 00:13:35,556 --> 00:13:38,346 it just comes off the top, off the top, off the top. 322 00:13:38,566 --> 00:13:40,726 Notice what stay in constant, the first thing you put 323 00:13:40,726 --> 00:13:42,456 in is staying where it is. 324 00:13:42,456 --> 00:13:44,696 Your elements are going up and down, and up and down 325 00:13:45,036 --> 00:13:47,736 but in a queue it's a little different because supposed 326 00:13:47,736 --> 00:13:50,896 in a queue some, someone gets in line 327 00:13:51,296 --> 00:13:54,216 and then another person gets in line and then by 5 a.m. you have 328 00:13:54,216 --> 00:13:56,336 like 10 people in line at the Apple store 329 00:13:56,606 --> 00:13:58,096 and then they start opening the store 330 00:13:58,096 --> 00:13:59,246 but the queue is not filled, right? 331 00:13:59,246 --> 00:14:00,596 It could still wrap around the block 332 00:14:00,856 --> 00:14:02,836 but those first few people start to go in. 333 00:14:03,086 --> 00:14:05,266 Well, in the real world to the rest of the humans 334 00:14:05,266 --> 00:14:08,846 who arrived later are going to probably shift in line, right? 335 00:14:08,846 --> 00:14:10,836 So the end of the line-- 336 00:14:11,056 --> 00:14:12,906 the beginning of the line is always the same but the end 337 00:14:12,906 --> 00:14:15,926 of the line is kind of moving here and so when it comes time 338 00:14:15,926 --> 00:14:17,686 to implement something like a queue, 339 00:14:18,066 --> 00:14:21,206 we can definitely remember its size, how many people are 340 00:14:21,206 --> 00:14:23,226 in line right now, how many integers are 341 00:14:23,226 --> 00:14:27,306 in this array right now but if we want this thing to be dynamic 342 00:14:27,576 --> 00:14:29,676 in that we're implementing it with an array 343 00:14:30,006 --> 00:14:33,796 and our array looks like this where I'll just draw it 344 00:14:33,986 --> 00:14:39,676 as our typical array and I'll put in something like let's say, 345 00:14:40,016 --> 00:14:44,256 Alice and this is Bob who gets in line and this is Charlie 346 00:14:44,256 --> 00:14:46,296 who gets in line here and I still have spots 347 00:14:46,296 --> 00:14:47,746 for two more people in this line. 348 00:14:47,916 --> 00:14:49,786 Well, as soon as the store opens, 349 00:14:49,836 --> 00:14:52,086 Alice is effectively gonna be taken out of line 350 00:14:52,306 --> 00:14:55,376 but if we're now implementing this notion of a queue 351 00:14:55,376 --> 00:14:59,896 in a program you know we could move Bob and Charlie all the way 352 00:14:59,896 --> 00:15:02,116 to the left and every time someone gets into the store, 353 00:15:02,116 --> 00:15:03,476 we could shift them all to the left 354 00:15:03,476 --> 00:15:07,136 but what's the design concern with that approach? 355 00:15:07,136 --> 00:15:07,203 [ Inaudible Remark ] 356 00:15:07,203 --> 00:15:09,716 >> Yeah, it's O of N, right? 357 00:15:09,716 --> 00:15:11,156 It's kind of expensive. 358 00:15:11,156 --> 00:15:13,806 Every time we pop someone off the queue, 359 00:15:13,936 --> 00:15:16,766 we potentially shift everyone to the end of the array, 360 00:15:16,766 --> 00:15:19,056 I mean that might be nice and clean and it makes 361 00:15:19,056 --> 00:15:22,146 for nicer pictures if everyone is kind of moving up in the line 362 00:15:22,446 --> 00:15:25,116 but really in programming we could just remember 363 00:15:25,376 --> 00:15:27,716 where the start of the line is so rather than assume 364 00:15:27,716 --> 00:15:29,736 that the start of the line is at bracket 0, 365 00:15:29,996 --> 00:15:32,506 just now update some index variable, call it, 366 00:15:32,586 --> 00:15:35,616 what do call it here, call it head for head of the line 367 00:15:35,886 --> 00:15:38,236 and just do plus plus and remember that Bob is the head 368 00:15:38,236 --> 00:15:40,316 of the line and you know what this now represents, 369 00:15:41,106 --> 00:15:42,166 the end of the line, right? 370 00:15:42,166 --> 00:15:44,676 We can reuse this memory so that when this person 371 00:15:44,676 --> 00:15:46,416 and this person, and then the next person come 372 00:15:46,416 --> 00:15:49,026 in we don't have to kick them out of line, we can say okay, 373 00:15:49,026 --> 00:15:52,386 fine, go here but in computing we can just remember 374 00:15:52,386 --> 00:15:54,196 that the head of the line is somewhere else 375 00:15:54,276 --> 00:15:57,396 so that we can avoid this expense of potentially big O 376 00:15:57,396 --> 00:15:59,146 of N for shuffling people around. 377 00:15:59,466 --> 00:16:03,086 So in that sense, implementing a queue is actually a little more 378 00:16:03,086 --> 00:16:06,256 complex because we have to keep one more piece of information 379 00:16:06,256 --> 00:16:08,746 around if we wanna implement things efficiently. 380 00:16:09,016 --> 00:16:11,776 So notice here the tradeoff and this is a very minor tradeoff 381 00:16:11,996 --> 00:16:14,196 but by spending 32 more bits 382 00:16:14,276 --> 00:16:18,006 for one more int called head we can avoid spending what 383 00:16:18,386 --> 00:16:19,376 as a resource? 384 00:16:19,826 --> 00:16:19,926 >> Time. 385 00:16:20,856 --> 00:16:24,276 >> Time, right and this is a constant thing especially this 386 00:16:24,276 --> 00:16:26,446 week and next week and even with the final project. 387 00:16:26,716 --> 00:16:30,346 You have so often in programming this tradeoff between well, 388 00:16:30,346 --> 00:16:32,406 if you spend a little more time, you can save, 389 00:16:32,666 --> 00:16:35,366 rather if you spend a little more space, you can save time 390 00:16:35,806 --> 00:16:37,976 or vice versa and the first time we saw this, 391 00:16:37,976 --> 00:16:39,156 I think was with merge sort. 392 00:16:39,156 --> 00:16:41,706 Remember that merge sort was blazingly fast compared 393 00:16:41,706 --> 00:16:43,066 to bubble and selection sort 394 00:16:43,356 --> 00:16:46,806 but the little dirty secret was we technically needed an extra 395 00:16:46,806 --> 00:16:49,256 array and that's why we left more space on stage for all 396 00:16:49,256 --> 00:16:52,156 of these milk cartons because we needed some temporary space 397 00:16:52,156 --> 00:16:53,926 to actually merge the elements into. 398 00:16:54,136 --> 00:16:56,446 So we spent twice as much space but my God, 399 00:16:56,446 --> 00:16:58,756 we went from N square too N log N 400 00:16:58,756 --> 00:17:01,456 which we saw multiple times which is much faster. 401 00:17:01,456 --> 00:17:03,896 So there's this them of tradeoffs in terms of space, 402 00:17:03,896 --> 00:17:05,816 in terms of time, in terms of difficulty 403 00:17:06,016 --> 00:17:07,676 or human time spent implementing. 404 00:17:07,956 --> 00:17:10,046 All of these are legit resources. 405 00:17:10,046 --> 00:17:13,246 But the holy grail is gonna be a data structure 406 00:17:13,526 --> 00:17:16,886 that lets us just put stuff into it and take stuff out of it 407 00:17:17,296 --> 00:17:19,466 in ideally constant time, right? 408 00:17:19,466 --> 00:17:21,506 If you don't care about this notion of a queue 409 00:17:21,756 --> 00:17:24,266 or this concept of a stock, all you care 410 00:17:24,266 --> 00:17:26,006 about is having some kind of black box 411 00:17:26,066 --> 00:17:28,446 into which you can put stuff and then you wanna be able 412 00:17:28,446 --> 00:17:31,656 to get it back out by saying give me the-- 413 00:17:31,876 --> 00:17:34,016 let's say give me the student with ID 13 414 00:17:34,016 --> 00:17:36,306 or give me the student with ID 16, 415 00:17:36,306 --> 00:17:39,366 you don't care how this black box is implemented you just want 416 00:17:39,366 --> 00:17:40,916 the answer to come back like that. 417 00:17:41,396 --> 00:17:43,936 Now you can obviously implement a black box with an array 418 00:17:43,936 --> 00:17:46,286 and within your search but again in the worst case, 419 00:17:46,286 --> 00:17:49,176 if you wanna get a student with some ID, it might be at the end 420 00:17:49,176 --> 00:17:51,716 of your array or at the back of this black box 421 00:17:51,716 --> 00:17:53,406 so you could take big O of N time 422 00:17:53,616 --> 00:17:55,926 but the goal today is again constant time. 423 00:17:56,236 --> 00:17:58,216 So how might we go about implementing this? 424 00:17:58,216 --> 00:18:01,326 Well, the spoiler is that it's gonna be called a hash table. 425 00:18:01,566 --> 00:18:05,786 But how we actually implement this magical black box is a 426 00:18:05,786 --> 00:18:07,676 little more interesting if not obvious. 427 00:18:07,976 --> 00:18:12,096 So supposed we initially create a data structure using an array 428 00:18:12,096 --> 00:18:14,046 of some sort so we'll call it table 429 00:18:14,046 --> 00:18:17,296 but a table is effectively synonymous now with an array 430 00:18:17,666 --> 00:18:22,096 and supposed that the goal here is to store things 431 00:18:22,096 --> 00:18:26,316 like let's say people's names in this array. 432 00:18:26,606 --> 00:18:28,436 So supposed I want a data structure, 433 00:18:28,436 --> 00:18:31,206 we're even more simple than a student object, 434 00:18:31,206 --> 00:18:33,926 I just wanna store student's names and then I wanna be able 435 00:18:33,926 --> 00:18:36,266 to check is Alice in my black box? 436 00:18:36,266 --> 00:18:39,016 Is Bob in the black box or they-- have I seen them before 437 00:18:39,016 --> 00:18:40,446 and put them into this data structure? 438 00:18:40,446 --> 00:18:42,206 I just wanna be able to answer those queries. 439 00:18:42,606 --> 00:18:47,896 So I propose implicitly here in array aka table that's of 26 440 00:18:48,416 --> 00:18:49,646 and that's not the coincidence. 441 00:18:49,646 --> 00:18:52,276 What might the 26 conjure up ideas of here? 442 00:18:52,276 --> 00:18:53,026 [ Inaudible Remark ] 443 00:18:53,026 --> 00:18:54,906 >> Yeah, so alphabetical somehow, right? 444 00:18:54,906 --> 00:18:57,916 So what if-- if we're given an arbitrary list of names Alice, 445 00:18:57,956 --> 00:19:00,376 Bob, Charlie, and so forth in random order of O, 446 00:19:00,666 --> 00:19:02,386 suppose I wanna keep track of whether 447 00:19:02,386 --> 00:19:05,336 or not I've seen those names before, where do I put them? 448 00:19:05,786 --> 00:19:08,796 Well, I could every time you hand me a name I can store it 449 00:19:08,796 --> 00:19:10,146 in this black box and just put it 450 00:19:10,146 --> 00:19:11,636 at the first available location. 451 00:19:11,636 --> 00:19:15,056 I can put it at the bracket 0, bracket 1, bracket 2, bracket 3 452 00:19:15,556 --> 00:19:19,456 but if I insert things into a data structure in random order, 453 00:19:19,586 --> 00:19:22,856 what's gonna be the running time of search when I ask for the-- 454 00:19:23,156 --> 00:19:24,566 that same structure back out? 455 00:19:25,566 --> 00:19:26,726 It's gonna big O of N, right? 456 00:19:26,726 --> 00:19:27,906 If you do something in random order 457 00:19:27,906 --> 00:19:29,966 and we've seen this multiple times, we saw it when Shaun was 458 00:19:29,966 --> 00:19:32,736 up here on video on the board just trying to find some number, 459 00:19:32,946 --> 00:19:33,756 well, if you put things 460 00:19:33,756 --> 00:19:36,786 in random order you're just differing your effort 'til the 461 00:19:36,816 --> 00:19:39,076 end of the process 'til you wanna get that thing back. 462 00:19:39,246 --> 00:19:39,926 So we don't want that. 463 00:19:39,926 --> 00:19:42,336 Today we want constant time ideally lookups. 464 00:19:42,646 --> 00:19:44,916 So as this table suggests, if we-- 465 00:19:44,916 --> 00:19:48,446 if I hand you Charlie first rather than put Charlie 466 00:19:48,446 --> 00:19:52,076 at the first available location, where might you instead put him? 467 00:19:52,976 --> 00:19:56,406 So bracket 0, 1, 2, 3, 4, which one comes to mind? 468 00:19:57,236 --> 00:19:58,216 >> So maybe bracket 2. 469 00:19:58,396 --> 00:20:00,466 Right, maybe you should just in advance decide 470 00:20:00,466 --> 00:20:02,656 that bracket 0 is going to be for people whose names start 471 00:20:02,656 --> 00:20:05,296 with A. Bracket 1 is going to be for people whose names start 472 00:20:05,296 --> 00:20:08,226 with B. Bracket 2 is going to be for people whose names start 473 00:20:08,226 --> 00:20:11,466 with C and then people whose names start with Z end up with 474 00:20:11,466 --> 00:20:14,146 at the end of this array at location 25 475 00:20:14,196 --> 00:20:17,216 but because it's an array, you can get there instantly, right, 476 00:20:17,216 --> 00:20:20,036 using square bracket notation, you have so called random access 477 00:20:20,086 --> 00:20:21,806 to memory so you can jump right there. 478 00:20:21,806 --> 00:20:26,016 So even though it looks like Zeek is getting the short end 479 00:20:26,016 --> 00:20:28,556 of the stick by going all the way at the end of the black box, 480 00:20:28,726 --> 00:20:30,606 you still have instantaneous insertions. 481 00:20:31,266 --> 00:20:32,846 Except if what happens? 482 00:20:33,496 --> 00:20:37,176 Right. What if two people have the name, a name that starts 483 00:20:37,176 --> 00:20:39,356 with Z or maybe more realistically, 484 00:20:39,356 --> 00:20:42,026 two people who have a name that starts with A or B or some 485 00:20:42,026 --> 00:20:44,056 of the more common letters of the English alphabet. 486 00:20:44,326 --> 00:20:45,786 Well, now we kind of have a problem. 487 00:20:45,986 --> 00:20:48,846 If in advance, we only allocated 26 buckets, 488 00:20:48,846 --> 00:20:52,536 we can store 26 people but just statistically a class 489 00:20:52,536 --> 00:20:55,896 of 26 students is not going to have unique starts 490 00:20:56,216 --> 00:20:59,946 of everyone's name so what could we do just instinctively? 491 00:20:59,946 --> 00:21:03,036 Suppose I insert Charlie and he goes to bracket 2 492 00:21:03,396 --> 00:21:07,056 and then Chuck comes along and I need to insert him as well, 493 00:21:07,556 --> 00:21:08,446 where do we put Chuck, 494 00:21:08,816 --> 00:21:10,546 whose name obviously starts with C as well? 495 00:21:11,016 --> 00:21:12,016 >> 2D array. 496 00:21:12,246 --> 00:21:13,306 >> Okay, so maybe a 2D array. 497 00:21:13,306 --> 00:21:16,506 Right, so that's actually not a bad idea so maybe instead 498 00:21:16,506 --> 00:21:19,716 of allocating a 1D array, I could allocate a 2D array 499 00:21:19,946 --> 00:21:22,386 and then just put Charlie there first and then put Chuck 500 00:21:22,386 --> 00:21:23,636 to the right him conceptually. 501 00:21:23,636 --> 00:21:25,516 We start drawing this is a grid, alright, 502 00:21:25,596 --> 00:21:31,006 but then Carmen comes along and so now we have a third C name 503 00:21:31,316 --> 00:21:32,586 and what do we do with Carmen? 504 00:21:34,286 --> 00:21:36,726 So right, we have a 2 times 3 D array, right? 505 00:21:36,726 --> 00:21:39,566 So we need another dimension so this is not a bad idea, right? 506 00:21:39,566 --> 00:21:42,096 If you need to be able to insert more people into the array, 507 00:21:42,096 --> 00:21:44,406 well just kind of increase your number of dimension 508 00:21:44,406 --> 00:21:47,006 so that you can have people vertically so to speak, 509 00:21:47,226 --> 00:21:48,396 and also horizontally. 510 00:21:48,586 --> 00:21:51,186 But the problem there is every time we double the dimensions 511 00:21:51,186 --> 00:21:54,176 from 2 by-- 1 by N to 2 512 00:21:54,176 --> 00:21:57,706 by N we'll now allocating effectively more space 513 00:21:57,706 --> 00:22:00,356 for Z names and Y names and Q names 514 00:22:00,356 --> 00:22:01,866 and we're probably not going to have that many of them 515 00:22:02,086 --> 00:22:03,316 so there's a downside there. 516 00:22:03,316 --> 00:22:05,626 If we use arrays or 2-dimensional 517 00:22:05,656 --> 00:22:07,026 or 3-dimensional arrays, no less, 518 00:22:07,306 --> 00:22:08,646 we're kind of wasting space 519 00:22:08,646 --> 00:22:11,646 on the less common letters even though we're solving problems 520 00:22:11,646 --> 00:22:14,286 for the least common or for the more common letters. 521 00:22:14,616 --> 00:22:16,066 So what about this? 522 00:22:16,186 --> 00:22:19,206 Why don't we try to leverage this reality statistically 523 00:22:19,206 --> 00:22:21,886 that there's not that many people whose names starts with Q 524 00:22:21,886 --> 00:22:23,926 or Z, my apologies statistically 525 00:22:23,926 --> 00:22:25,946 to those whose names do start with Q or Z here. 526 00:22:26,366 --> 00:22:29,006 But suppose that we're willing to accept 527 00:22:29,006 --> 00:22:31,626 that for small data sizes we're not going to get someone 528 00:22:31,626 --> 00:22:34,266 with the Q name or the Z or a few other letters as well. 529 00:22:34,516 --> 00:22:38,176 So what if I first try to go to bracket 2 and I try 530 00:22:38,176 --> 00:22:40,806 to insert Charlie there and I get some kind of collision, 531 00:22:41,116 --> 00:22:45,106 well why don't I just-- well if bracket 2 was busy, well, 532 00:22:45,106 --> 00:22:47,476 let me put in bracket 4 or 3 or bracket 4. 533 00:22:47,476 --> 00:22:49,986 In other words, just start probing the array so to speak, 534 00:22:49,986 --> 00:22:53,926 from top to bottom and the first available spot below Charlie is 535 00:22:53,926 --> 00:22:58,526 where you can put this other C name, so does this work? 536 00:22:58,526 --> 00:22:59,206 Does this not work? 537 00:22:59,206 --> 00:23:01,246 Well, definitely makes more efficient use 538 00:23:01,246 --> 00:23:02,426 of the space, right? 539 00:23:02,426 --> 00:23:06,306 Because now if we have 26 buckets, we're going to fill all 540 00:23:06,306 --> 00:23:08,986 of them at some point even if not everyone is exactly 541 00:23:08,986 --> 00:23:11,546 where we would like them to be but in the worst case, 542 00:23:11,546 --> 00:23:12,536 what's going to happen here? 543 00:23:13,646 --> 00:23:15,766 What's going to be the worst case lookup time for Alice 544 00:23:15,866 --> 00:23:18,066 if Alice just so happens to be inserted last, 545 00:23:18,356 --> 00:23:19,496 number 26 in the class? 546 00:23:20,126 --> 00:23:22,746 And there's already someone else with the letter A as their name. 547 00:23:22,826 --> 00:23:28,036 Yeah, it's going to take us big O of N again to get 548 00:23:28,036 --> 00:23:29,766 to Alice so not ideal, right? 549 00:23:29,766 --> 00:23:31,866 'Cause Alice might get arbitrarily put at the end 550 00:23:31,866 --> 00:23:34,566 of the array and just finding her, again, this is kind 551 00:23:34,566 --> 00:23:37,056 of now devolving back into random order. 552 00:23:37,056 --> 00:23:39,106 And random is typically bad at least in terms 553 00:23:39,106 --> 00:23:41,546 of running time here so maybe this isn't 554 00:23:41,546 --> 00:23:42,366 such a big deal, right? 555 00:23:42,366 --> 00:23:43,646 I'm kind of making up these names. 556 00:23:43,646 --> 00:23:45,966 I don't even know where Carmen came from off the top of my head 557 00:23:45,966 --> 00:23:49,376 so maybe this idea of collisions is not all that likely. 558 00:23:49,576 --> 00:23:51,586 Well, we can actually answer this question 559 00:23:52,106 --> 00:23:56,016 if you assume a room full of, say, N of us, in CS50 students, 560 00:23:56,176 --> 00:23:59,336 what is the probability that two students here are going 561 00:23:59,336 --> 00:24:01,126 to have the same birthday, right? 562 00:24:01,126 --> 00:24:02,566 This is kind of the same question. 563 00:24:02,566 --> 00:24:06,956 There's 365 or 66 days in a year and that's a finite number. 564 00:24:06,956 --> 00:24:11,286 There's a finite number of us, less so on some days and you, 565 00:24:11,286 --> 00:24:16,436 presumably, if we have 367 students here today or earlier 566 00:24:16,436 --> 00:24:19,306 in the semester then there's gotta be a collision, right? 567 00:24:19,306 --> 00:24:20,956 This is what's called the pigeonhole principle. 568 00:24:20,956 --> 00:24:22,066 For more on this, take CS21. 569 00:24:22,066 --> 00:24:22,716 So we could do this. 570 00:24:22,816 --> 00:24:27,416 Actually, we can probably spend like 20 minutes trying to figure 571 00:24:27,416 --> 00:24:29,616 out who has the same birthdays here 572 00:24:29,616 --> 00:24:32,016 but this could actually very quickly devolve into a big mess 573 00:24:32,246 --> 00:24:34,306 but let's assume that this is probable. 574 00:24:34,426 --> 00:24:36,536 Let's actually see if we can slap a number on this 575 00:24:36,956 --> 00:24:39,426 so this is just a formula 576 00:24:39,786 --> 00:24:41,506 that actually answers the other question. 577 00:24:41,506 --> 00:24:43,126 What's the probability that every one 578 00:24:43,126 --> 00:24:44,586 of us has a different birthday? 579 00:24:44,976 --> 00:24:46,906 Well, this is actually a little easier to compute 580 00:24:47,206 --> 00:24:48,816 and we can do it as follows. 581 00:24:48,816 --> 00:24:51,356 If I pick one person on the audience what's the probability 582 00:24:51,356 --> 00:24:53,926 that you have a different birthday than everyone else 583 00:24:53,926 --> 00:24:54,846 at this point in time? 584 00:24:54,846 --> 00:24:57,456 100 percent 'cause I haven't considered anyone else 585 00:24:57,686 --> 00:24:58,806 so that's 100 percent. 586 00:24:58,806 --> 00:25:01,556 That's 1 over 1, hence the 1 in this formula. 587 00:25:01,726 --> 00:25:03,806 Now, I move on to the second student in the room. 588 00:25:03,806 --> 00:25:08,836 Well, she can have any possible birthday out of 364 days 589 00:25:08,866 --> 00:25:10,816 because already one day was taken by someone else 590 00:25:10,816 --> 00:25:12,196 so if we want the probability 591 00:25:12,196 --> 00:25:15,166 that these two students do not have the same birthdays, 592 00:25:15,516 --> 00:25:19,386 we do a 100 percent times like 99 percent, give or take, 593 00:25:19,386 --> 00:25:22,466 or otherwise known as 1 minus 1 over 365 594 00:25:22,716 --> 00:25:24,466 and that's just give us the probability at this point 595 00:25:24,466 --> 00:25:26,986 in the story that these two have different birthdays 596 00:25:27,026 --> 00:25:28,216 and then we keep multiplying. 597 00:25:28,216 --> 00:25:29,066 We keep multiplying. 598 00:25:29,066 --> 00:25:31,486 We keep multiplying and if you actually do all these out 599 00:25:31,486 --> 00:25:33,576 or look at the back of like a physics or a math book, 600 00:25:33,916 --> 00:25:35,846 you get this formula, which at first glance, 601 00:25:35,906 --> 00:25:37,406 admittedly is not all that enlightening. 602 00:25:37,456 --> 00:25:40,356 You're probably not graphing this mentally in your head all 603 00:25:40,356 --> 00:25:43,276 that quickly but if we do actually graph not this 604 00:25:43,836 --> 00:25:45,976 but 1 minus is this 'cause recall 605 00:25:45,976 --> 00:25:48,936 that the question was what's the probability of a collision? 606 00:25:49,156 --> 00:25:51,486 Right now, this is the probability of no collisions. 607 00:25:51,486 --> 00:25:54,456 I want to know if we're going to start talking about hash tables 608 00:25:54,456 --> 00:25:57,586 and using arrays and have to deal with multiple C names 609 00:25:57,586 --> 00:26:00,156 or A names or B names, how likely is that? 610 00:26:00,526 --> 00:26:04,766 Well as the birthday problem suggests, it's actually really, 611 00:26:04,766 --> 00:26:08,846 really likely so this is a plot whereby the X axis represents 612 00:26:08,846 --> 00:26:11,126 number of students in the room or number 613 00:26:11,126 --> 00:26:13,756 of birthdays equivalently the left-- 614 00:26:13,756 --> 00:26:16,516 Y axis represents the probability of a collision 615 00:26:16,516 --> 00:26:20,496 of two people having the same birthday and what's remarkable 616 00:26:20,496 --> 00:26:23,586 about this is that as soon as you get to a class full 617 00:26:23,586 --> 00:26:27,496 of like 49, 52, 58 students, give or take, 618 00:26:27,756 --> 00:26:31,036 there is a nearly 100 percent probability even 619 00:26:31,036 --> 00:26:33,166 in a class that's much smaller than CS50. 620 00:26:33,216 --> 00:26:36,356 A class of just 50 students that there's going to be a collision 621 00:26:36,356 --> 00:26:37,816 and that's thanks to combinatorics 622 00:26:37,816 --> 00:26:38,676 and just the probability 623 00:26:38,676 --> 00:26:41,806 of someone actually having the same birthday of someone else. 624 00:26:42,066 --> 00:26:43,126 So what's the takeaway here? 625 00:26:43,126 --> 00:26:45,336 Well, even if you scroll back, you know this is kind 626 00:26:45,336 --> 00:26:48,496 of remarkable even in the class of like 28 students 627 00:26:48,876 --> 00:26:52,626 which is much smaller, there's like a 60 or 70 percent chance 628 00:26:52,956 --> 00:26:54,716 of two people having the same birthdays. 629 00:26:55,056 --> 00:26:56,446 Now, again what's the take away here? 630 00:26:56,446 --> 00:26:58,466 Well, if we now generalize this to the problem 631 00:26:58,536 --> 00:27:01,126 of data structures and this thing called apparently 632 00:27:01,126 --> 00:27:01,926 hash table. 633 00:27:02,856 --> 00:27:04,926 We might assume that we're not going 634 00:27:04,956 --> 00:27:06,476 to have Q names or Z names. 635 00:27:06,476 --> 00:27:07,776 This is not really a big deal. 636 00:27:07,776 --> 00:27:08,776 We're not going to have collisions 637 00:27:09,016 --> 00:27:12,366 but indeed we are even if we scale down to 26 people 638 00:27:12,366 --> 00:27:14,266 or 26 names, the probability 639 00:27:14,266 --> 00:27:16,956 of a collision two people having the same birthday 640 00:27:16,956 --> 00:27:19,306 or if we assume a uniform distribution of names, 641 00:27:19,306 --> 00:27:21,266 the probability of two people 642 00:27:21,266 --> 00:27:23,326 in a 26 student-class is 60 percent. 643 00:27:23,566 --> 00:27:26,306 So in short, this is a problem so it's not safe to sort 644 00:27:26,306 --> 00:27:29,056 of sweep this detail under the rug and just hope that, 645 00:27:29,446 --> 00:27:30,856 what are the odds that this happens? 646 00:27:31,146 --> 00:27:34,646 In fact, it's very often going to devolve into a situation 647 00:27:34,646 --> 00:27:36,416 where our lookups, our big O of N 648 00:27:36,666 --> 00:27:39,276 and the whole goal here is to avoid that. 649 00:27:39,376 --> 00:27:42,346 So I used this buzz word a moment ago, when your probing 650 00:27:42,346 --> 00:27:45,666 to actually start somewhere in the array and then just kind 651 00:27:45,666 --> 00:27:48,626 of probe for an open available space if you want 652 00:27:48,626 --> 00:27:51,746 to insert something into this array like a student or a number 653 00:27:52,066 --> 00:27:55,306 and its ideal location is already occupied 654 00:27:55,306 --> 00:27:57,306 but we already discussed ultimately the downside. 655 00:27:57,306 --> 00:27:58,416 In the very worst case, 656 00:27:58,416 --> 00:28:01,446 and that's generally what big O notation conjures up. 657 00:28:01,446 --> 00:28:04,676 In the very worst case, we might just insert Alice 658 00:28:04,676 --> 00:28:06,066 into this data structure last. 659 00:28:06,316 --> 00:28:08,186 There is already someone else like Adam 660 00:28:08,186 --> 00:28:10,366 with the name whose starts with A and there was a Bob 661 00:28:10,366 --> 00:28:11,696 and a Charlie and everyone else 662 00:28:11,966 --> 00:28:14,976 so the very last Alice gets stuck all the way at the end 663 00:28:14,976 --> 00:28:18,626 of the data structure which means when I say high is Alice 664 00:28:18,626 --> 00:28:20,746 in this list and I ask the black box, 665 00:28:20,976 --> 00:28:22,486 it's going to take it linear time. 666 00:28:22,486 --> 00:28:25,406 Big O of N and literally N steps to confirm 667 00:28:25,546 --> 00:28:27,186 or deny the answer to that question. 668 00:28:27,186 --> 00:28:30,526 So not ideal but very efficient use of space 669 00:28:30,526 --> 00:28:32,426 but what it's costing us, time. 670 00:28:32,786 --> 00:28:34,986 So there again is the straight off so rather 671 00:28:34,986 --> 00:28:36,476 than just do a 2-dimensional array, 672 00:28:37,146 --> 00:28:39,986 why don't we steal what we talked about Monday and start 673 00:28:39,986 --> 00:28:41,646 to merge these two ideas. 674 00:28:42,066 --> 00:28:44,806 So this thing here is a hash table 675 00:28:44,806 --> 00:28:46,416 but it's a little fancier looking 676 00:28:46,646 --> 00:28:49,706 in that what we really store vertically, so to speak, 677 00:28:49,776 --> 00:28:53,296 is still an array but this time, it's not an array of numbers, 678 00:28:53,296 --> 00:28:56,396 it's not an array of strings, it's just an array of pointers 679 00:28:56,726 --> 00:28:58,476 because pointers recall can point 680 00:28:58,476 --> 00:29:01,266 to dynamically allocated data structures whereby you don't 681 00:29:01,266 --> 00:29:03,636 know in advance how much member you're going to need 682 00:29:03,636 --> 00:29:05,096 or how many structures you're going to need. 683 00:29:05,316 --> 00:29:06,516 So why even pre-commit? 684 00:29:06,876 --> 00:29:09,716 We can just decide initially in this case, this is related 685 00:29:09,716 --> 00:29:12,776 to a birthday scenario, so in this case, I have an array 686 00:29:12,776 --> 00:29:15,886 of size 31 just because of the textbook that we scan is 687 00:29:15,886 --> 00:29:18,016 from it's not zero index but that's fine. 688 00:29:18,266 --> 00:29:22,576 There's 31 pointers and each of these now represents this idea 689 00:29:22,576 --> 00:29:24,016 of a birthday in a given month. 690 00:29:24,476 --> 00:29:27,686 What is the proba-- rather who has a birthday 691 00:29:27,686 --> 00:29:30,086 of like January 1 or February 1? 692 00:29:30,086 --> 00:29:34,036 Will that person might go here or something 31 will go here 693 00:29:34,036 --> 00:29:36,016 or 27, 26 and so forth. 694 00:29:36,226 --> 00:29:39,506 So we only allocate a bunch of pointers so our table 695 00:29:39,506 --> 00:29:42,636 in this case is not size 26 anymore, it's size 31 696 00:29:42,636 --> 00:29:45,736 but it's still finite but what we have now is the ability 697 00:29:45,736 --> 00:29:48,486 to grow laterally so I think the example 698 00:29:48,486 --> 00:29:52,406 that this author used was a famous people. 699 00:29:52,406 --> 00:29:54,016 I think that might be Sam Adams up top. 700 00:29:54,296 --> 00:29:56,956 So Sam Adams has a birthday I think 701 00:29:56,956 --> 00:29:59,566 and that's why I'm making this up of something 2 in the month 702 00:29:59,566 --> 00:30:00,976 and so he was inserted there. 703 00:30:01,076 --> 00:30:04,526 >> So what does it mean now to insert into this data structure? 704 00:30:04,526 --> 00:30:07,846 We're not just plopping him or her into a preexisting array. 705 00:30:08,466 --> 00:30:11,906 What kind of code do we have to write to insert S. Adams 706 00:30:12,416 --> 00:30:13,706 into this data structure? 707 00:30:15,026 --> 00:30:16,276 What functions do we need to call 708 00:30:16,276 --> 00:30:22,186 if we implemented this in C? 709 00:30:22,446 --> 00:30:25,446 It's related to memory, it's related to allocation? 710 00:30:26,336 --> 00:30:26,496 >> Malloc. 711 00:30:26,496 --> 00:30:27,356 >> Malloc, okay. 712 00:30:27,666 --> 00:30:29,966 Actually, I think it was funny. 713 00:30:29,966 --> 00:30:31,806 I think we saw in the quiz a function calls like 714 00:30:31,806 --> 00:30:34,426 "mallocates", that was one. 715 00:30:34,886 --> 00:30:39,336 And yeah, there are some other funny ones I'll keep to myself. 716 00:30:40,196 --> 00:30:42,296 So, ask your TF. 717 00:30:42,296 --> 00:30:45,946 So if we want to actually insert S. Adams into this list 718 00:30:45,946 --> 00:30:48,186 and the location at which we want 719 00:30:48,186 --> 00:30:51,256 to insert him is initially null and null is suggested by all 720 00:30:51,256 --> 00:30:53,246 of these slash marks, that's just a null pointer. 721 00:30:53,516 --> 00:30:55,306 Well, we have to go ahead and call mallocates 722 00:30:55,306 --> 00:30:56,916 and create a new struct and inside 723 00:30:56,916 --> 00:31:00,136 of that is gonna be apparently a string like S dot Adams 724 00:31:00,136 --> 00:31:01,296 and then a new line character. 725 00:31:01,506 --> 00:31:03,486 But what is this thing probably represents? 726 00:31:03,486 --> 00:31:06,276 To the right of S. Adams in this data structure is another slash 727 00:31:06,556 --> 00:31:09,116 and another square, what does that represent perhaps? 728 00:31:09,116 --> 00:31:10,306 [ Inaudible Remark ] 729 00:31:10,306 --> 00:31:12,696 >> Yeah. So it represents null, it represents a pointer. 730 00:31:12,856 --> 00:31:16,496 So in fact these aren't strings per se, each of these rectangles 731 00:31:16,496 --> 00:31:18,996 that represent a person, S. Adams, J. Adams 732 00:31:18,996 --> 00:31:22,436 and so forth is apparently instead a struct. 733 00:31:22,436 --> 00:31:26,016 Inside of which is a string just like we saw previously inside 734 00:31:26,016 --> 00:31:28,346 of a student is an ID and a name and a house, 735 00:31:28,376 --> 00:31:30,096 the last two of which are strings. 736 00:31:30,506 --> 00:31:32,326 Here apparently one of these structures, 737 00:31:32,326 --> 00:31:35,086 let's just generically call it a node, N-O-D-E. 738 00:31:35,396 --> 00:31:38,096 So this might just be a string as part of the struct 739 00:31:38,306 --> 00:31:40,506 and then this other thing is just a pointer. 740 00:31:40,796 --> 00:31:42,756 And as we typically draw pointers as squares, 741 00:31:42,756 --> 00:31:45,096 that's probably 4 bytes or 32 bits and it's null 742 00:31:45,336 --> 00:31:46,326 because there's no one else 743 00:31:46,326 --> 00:31:48,856 in this data structure after S. Adams. 744 00:31:48,906 --> 00:31:52,966 But apparently, W. Floyd and J. Adams share the same birth dates 745 00:31:53,066 --> 00:31:54,336 of the month, and so we have 746 00:31:54,336 --> 00:31:57,476 to start chaining those two guys together in order 747 00:31:57,476 --> 00:31:59,016 to put them at the same location. 748 00:31:59,306 --> 00:32:01,466 Now, this two is not ideal, alright. 749 00:32:01,466 --> 00:32:03,846 This does avoid what problem, this is better 750 00:32:03,846 --> 00:32:05,676 than this design, why? 751 00:32:06,386 --> 00:32:11,396 What problem do we avoid completely or theoretically? 752 00:32:11,986 --> 00:32:14,526 What's that? 753 00:32:14,526 --> 00:32:15,166 [ Inaudible Remark ] 754 00:32:15,166 --> 00:32:17,356 >> Linear time so it's actually not quite, 755 00:32:17,356 --> 00:32:18,146 we'll come back to that. 756 00:32:18,256 --> 00:32:20,166 But what was the biggest deal breaker 757 00:32:20,166 --> 00:32:22,386 with this thing besides collisions? 758 00:32:23,876 --> 00:32:24,576 It's limited, right? 759 00:32:24,576 --> 00:32:25,446 It's finite space. 760 00:32:25,446 --> 00:32:27,656 We had 26 buckets in this case which means if we try 761 00:32:27,656 --> 00:32:30,126 to insert the 27th person, my god we have to jump 762 00:32:30,126 --> 00:32:32,026 through this hoop of reallocating more memory, 763 00:32:32,026 --> 00:32:33,966 copying everything over and again, 764 00:32:33,996 --> 00:32:35,636 this is typically not an ideal solution. 765 00:32:35,636 --> 00:32:38,956 So this approach called Separate Chaining definitely solves 766 00:32:38,956 --> 00:32:39,656 that problem. 767 00:32:39,806 --> 00:32:42,736 And I say theoretically because you could still run 768 00:32:42,736 --> 00:32:44,856 out of memory in total in your computer 769 00:32:44,856 --> 00:32:48,086 so you might not be able to allocate more 770 00:32:48,086 --> 00:32:50,866 of the structures here but that's only if you're talking 771 00:32:50,866 --> 00:32:53,546 like 2 gigabytes or something crazy inside of the program. 772 00:32:53,816 --> 00:32:58,516 So this at least is completely dynamic but technically, 773 00:32:58,776 --> 00:33:02,166 the running time now of finding someone in this data structure 774 00:33:03,006 --> 00:33:06,116 in this ideal case it's still constant time. 775 00:33:06,116 --> 00:33:08,746 In the ideal case it's just S. Adams, he's the only guy there, 776 00:33:08,746 --> 00:33:11,656 we jump right through that spot and bam, we found S. Adams. 777 00:33:11,996 --> 00:33:15,396 But if you kind of do this overtime in the worst case, 778 00:33:15,396 --> 00:33:17,036 what's your data structure gonna look like? 779 00:33:17,036 --> 00:33:21,036 If this is my array of pointers and I'm chaining off of this, 780 00:33:21,036 --> 00:33:23,506 all of the actual people, what's the worst case scenario 781 00:33:23,506 --> 00:33:25,336 in a class of students when you are trying 782 00:33:25,336 --> 00:33:26,366 to insert them into this array? 783 00:33:26,366 --> 00:33:27,076 [ Inaudible Remark ] 784 00:33:27,076 --> 00:33:30,956 >> Yeah, they all have the same birth date of the month 785 00:33:30,956 --> 00:33:32,116 or in our previous example, 786 00:33:32,116 --> 00:33:34,726 their names all oddly enough start with the letter A 787 00:33:34,726 --> 00:33:35,846 or some specific letter. 788 00:33:36,066 --> 00:33:39,526 So even though we might have a really smart design, 789 00:33:39,776 --> 00:33:42,516 it's possible that just by bad luck, 790 00:33:42,756 --> 00:33:46,006 these things could be ending up here, here and so forth. 791 00:33:46,166 --> 00:33:47,216 So what's this devolving 792 00:33:47,216 --> 00:33:49,546 into despite the vertical aspect of this picture? 793 00:33:50,686 --> 00:33:52,126 It's just the linked list, right? 794 00:33:52,126 --> 00:33:54,726 So the running time potentially of a hash table even 795 00:33:54,726 --> 00:33:56,126 in this again it's called a hash table, 796 00:33:56,126 --> 00:33:57,466 this is just a different version 797 00:33:57,466 --> 00:33:58,916 of it that's using separate chaining. 798 00:33:59,256 --> 00:34:04,026 Well, even in this case it could devolve back into linear time. 799 00:34:04,026 --> 00:34:08,206 But if we do this intelligently and we actually use, 800 00:34:08,316 --> 00:34:11,876 let's call it a hash function, that's more intelligent 801 00:34:12,176 --> 00:34:14,346 than just saying use the first letter of their name 802 00:34:14,346 --> 00:34:16,606 or use their birth date, the day of the month 803 00:34:16,606 --> 00:34:18,966 on which they were born because these clearly lend themselves 804 00:34:18,966 --> 00:34:21,346 to perhaps disproportionate collisions. 805 00:34:21,676 --> 00:34:26,076 You know what we could do in the best case, none of these chains 806 00:34:26,286 --> 00:34:28,416 so to speak is going to be that long. 807 00:34:28,416 --> 00:34:31,166 This one might be two-- this one might have two nodes, 808 00:34:31,166 --> 00:34:32,476 this one might have two nodes. 809 00:34:32,746 --> 00:34:37,566 Ideally, if we have a whole bunch of pointers vertically 810 00:34:37,856 --> 00:34:40,106 and we create these chains laterally, 811 00:34:40,196 --> 00:34:42,856 ideally we're gonna have these chains being 812 00:34:42,856 --> 00:34:44,226 of roughly equal length. 813 00:34:44,656 --> 00:34:48,726 So ideally, what we have is a running time of big O of N 814 00:34:48,726 --> 00:34:49,856 where N is the total number 815 00:34:49,856 --> 00:34:52,006 of elements inside of this structure. 816 00:34:52,386 --> 00:34:54,456 And let's call it arbitrarily K 817 00:34:55,196 --> 00:35:00,376 where K might refer to what here? 818 00:35:00,596 --> 00:35:01,916 Why do I divide by K? 819 00:35:02,896 --> 00:35:04,186 The goal is to answer the question, 820 00:35:04,186 --> 00:35:06,746 what's the running time of Search ideally here? 821 00:35:07,506 --> 00:35:08,256 >> The size of the array. 822 00:35:08,326 --> 00:35:10,616 >> The size of the array, the size of the hash table, 823 00:35:10,616 --> 00:35:12,086 the height of the hash table. 824 00:35:12,346 --> 00:35:15,446 So if this is height K, so this is either 26 825 00:35:15,446 --> 00:35:17,296 and 1 example or 31 and another. 826 00:35:17,506 --> 00:35:20,306 That's what we mean by K. So this is better, right? 827 00:35:20,306 --> 00:35:25,516 Big O of N over K is better than big O of N but only 828 00:35:25,516 --> 00:35:26,556 in the real world, right? 829 00:35:26,556 --> 00:35:30,096 If K is a fixed number like 26 or 31, what have we been doing 830 00:35:30,096 --> 00:35:31,106 for the past few weeks? 831 00:35:31,376 --> 00:35:34,766 We always kind of cross that out so here's where theory 832 00:35:34,766 --> 00:35:38,496 and practice kind of start to clash with one another. 833 00:35:38,496 --> 00:35:42,036 In the real world, N over K is absolutely better 834 00:35:42,036 --> 00:35:44,796 than N. Just arithmetically you divide some number 835 00:35:45,086 --> 00:35:49,646 by a positive number and you're going to get a smaller number 836 00:35:49,726 --> 00:35:50,846 than the original number, right? 837 00:35:50,846 --> 00:35:51,936 That just works out that way. 838 00:35:51,936 --> 00:35:54,256 So in the real world you're gonna spend less time 839 00:35:54,256 --> 00:35:56,546 or use less memory if you're dividing by that factor. 840 00:35:56,826 --> 00:35:59,586 But asymptotically sort of pre-quiz zero, 841 00:35:59,776 --> 00:36:02,946 in pre-quiz zero terms, we always said 842 00:36:02,946 --> 00:36:05,686 that this is still big O of N. So here's what's kind of neat 843 00:36:05,686 --> 00:36:08,046 and also what's neat in particular about problem set 6 844 00:36:08,046 --> 00:36:09,516 which is this speed based, 845 00:36:09,636 --> 00:36:12,286 this speed-oriented spell checking challenge. 846 00:36:12,556 --> 00:36:14,506 Well, yes it's still big O of N, 847 00:36:14,566 --> 00:36:17,006 this notion of a hash table whether we implement it 848 00:36:17,006 --> 00:36:18,986 like this which has a couple of problems. 849 00:36:19,276 --> 00:36:21,586 This which is better but still has some problems. 850 00:36:21,956 --> 00:36:23,826 In the real world doing something 851 00:36:23,826 --> 00:36:27,056 in say 100 seconds is absolutely gonna be slower 852 00:36:27,056 --> 00:36:30,626 than doing something in 10 seconds if K is 10, right? 853 00:36:30,626 --> 00:36:34,586 So any kind of real world improvement we can make actually 854 00:36:34,586 --> 00:36:37,366 benefits real people even if the mathematicians 855 00:36:37,366 --> 00:36:39,886 and theorists might say in asymptotically 856 00:36:39,886 --> 00:36:42,776 or with big enough numbers, this is really quite irrelevant 857 00:36:42,776 --> 00:36:45,226 but it's not irrelevant in the real world. 858 00:36:45,526 --> 00:36:51,236 So this promise land of having a constant time lookup is really 859 00:36:51,236 --> 00:36:52,726 only in an ideal world. 860 00:36:52,826 --> 00:36:57,186 You can construct in fact what are called ideal hash functions 861 00:36:57,456 --> 00:36:59,896 which indeed take some data structure like this 862 00:36:59,896 --> 00:37:01,256 or even one like this. 863 00:37:01,566 --> 00:37:04,966 And if you're smart enough and put enough time into the process 864 00:37:04,966 --> 00:37:06,536 of writing the hash function, 865 00:37:06,806 --> 00:37:09,486 the thing that determines where an input goes. 866 00:37:09,726 --> 00:37:12,396 Well, you can actually come up with an ideal hash function 867 00:37:12,396 --> 00:37:15,966 which will literally put everyone at a specific place 868 00:37:16,146 --> 00:37:19,006 without any collisions but it requires a lot of work. 869 00:37:19,006 --> 00:37:21,576 And so invariably every semester we have folks particularly 870 00:37:21,576 --> 00:37:24,666 that are hacker types that try to implement 4 piece at 6 871 00:37:24,666 --> 00:37:27,726 and ideal hash function and then they redefine and print 872 00:37:27,726 --> 00:37:29,996 where you're not allowed to do that in advance because you have 873 00:37:29,996 --> 00:37:31,766 to know something about the data in advance 874 00:37:31,766 --> 00:37:32,886 to actually construct that. 875 00:37:33,116 --> 00:37:35,946 So instead, you have to come up with better hash functions. 876 00:37:35,946 --> 00:37:38,326 The hash function that we've been using right now is actually 877 00:37:38,326 --> 00:37:38,806 been simple. 878 00:37:39,056 --> 00:37:42,446 You hand me a person, I look at, I being a hash function 879 00:37:42,446 --> 00:37:44,796 in this story, I look at the first letter of their name 880 00:37:44,826 --> 00:37:47,416 and I return a number between 0 and 25, 881 00:37:47,806 --> 00:37:49,496 so some letter of the alphabet. 882 00:37:49,666 --> 00:37:51,896 Or you hand me one of these apparently famous people 883 00:37:52,176 --> 00:37:55,186 and I hand you back a number which is the date 884 00:37:55,436 --> 00:37:59,056 of their birthday and then you use that index into an array. 885 00:37:59,056 --> 00:38:02,226 So hash function generally takes this input, some structure 886 00:38:02,226 --> 00:38:06,106 or some string or some number, and then returns a number 887 00:38:06,136 --> 00:38:07,746 within a specific range. 888 00:38:07,746 --> 00:38:08,606 What's that range? 889 00:38:08,676 --> 00:38:11,096 Well in this case it's 26, in this case it's 31, 890 00:38:11,626 --> 00:38:13,196 it's within some specific bound. 891 00:38:13,196 --> 00:38:16,686 So if you think back to your C operators, the modulo operator, 892 00:38:16,686 --> 00:38:18,746 it's actually quite common in this context. 893 00:38:18,746 --> 00:38:20,166 When you wanna compute something 894 00:38:20,446 --> 00:38:23,116 and then return a relatively smaller number. 895 00:38:23,506 --> 00:38:24,286 So let's do better. 896 00:38:24,286 --> 00:38:26,266 When I hand you the name of a person, 897 00:38:26,486 --> 00:38:29,866 if it's apparently not quite ideal to just look 898 00:38:29,866 --> 00:38:33,276 at their first letter, the first letter of their name, 899 00:38:33,486 --> 00:38:36,026 what could you do that's a little better that's still 900 00:38:36,026 --> 00:38:39,356 predictable, that you can repeat this process but it's not 901 00:38:39,356 --> 00:38:41,166 as naive as just looking at the first letter. 902 00:38:41,166 --> 00:38:45,576 What might you do instead? 903 00:38:45,626 --> 00:38:47,276 Alright, all you're given is a string and you have 904 00:38:47,276 --> 00:38:51,126 to somehow give me a number between let's say 0 and 25, 905 00:38:51,126 --> 00:38:53,696 but again this is arbitrary as well. 906 00:38:54,186 --> 00:38:55,276 What could you do besides looking 907 00:38:55,276 --> 00:38:56,826 at just the first letter, what else could you do? 908 00:38:56,826 --> 00:38:59,176 >> Look at the second letter. 909 00:38:59,176 --> 00:38:59,436 >> You're right. 910 00:38:59,436 --> 00:39:00,636 I mean that's actually pretty reasonable. 911 00:39:00,636 --> 00:39:01,916 You could look at the second letter. 912 00:39:02,166 --> 00:39:02,896 And what could you do? 913 00:39:02,896 --> 00:39:05,276 Well, remember that in C and in programming in general, 914 00:39:05,276 --> 00:39:06,936 a string is just a bunch of characters. 915 00:39:06,936 --> 00:39:09,906 A character is just a number so you can already think back 916 00:39:09,906 --> 00:39:12,116 to P-Set 2 with Caesar and Vigenere. 917 00:39:12,116 --> 00:39:14,966 You can already kind of treat strings as numbers. 918 00:39:15,176 --> 00:39:16,936 So maybe we could do something arbitrary 919 00:39:16,936 --> 00:39:19,796 but perhaps predictable like take the ASCII value 920 00:39:19,796 --> 00:39:22,146 of the first letter, add it to the ASCII value 921 00:39:22,146 --> 00:39:24,576 of the second letter, and that gives me a sum. 922 00:39:24,746 --> 00:39:27,086 And that sum might be a little too big, it might be 100 plus 923 00:39:27,086 --> 00:39:27,956 or something but that's okay. 924 00:39:27,956 --> 00:39:31,096 I can just use the mod operator, the percent operator in C 925 00:39:31,316 --> 00:39:33,506 and give me a number between 0 and 25. 926 00:39:33,786 --> 00:39:36,436 Now, whether or not that works is kind of unclear, right? 927 00:39:36,436 --> 00:39:38,886 There are still some dependencies in the real world 928 00:39:38,886 --> 00:39:40,756 between first letters and last letters 929 00:39:41,036 --> 00:39:43,086 like you don't generally see a name that's like ZK. 930 00:39:43,176 --> 00:39:47,836 You might see ZO or ZI but that's gonna be more common. 931 00:39:48,046 --> 00:39:49,376 So that might not be quite enough. 932 00:39:49,376 --> 00:39:52,416 So what can you do if the first two letters might not really be 933 00:39:52,616 --> 00:39:54,716 sort of independent or random enough? 934 00:39:55,746 --> 00:39:56,916 We'll do the first three or the first four. 935 00:39:57,046 --> 00:39:59,796 >> And in fact the common hash function is given a string is 936 00:39:59,916 --> 00:40:02,556 to actually add up the ASCII values of all the numbers 937 00:40:02,556 --> 00:40:05,666 and then take some modulo operator of it 938 00:40:05,666 --> 00:40:06,736 to get back a smaller number. 939 00:40:06,736 --> 00:40:09,536 Or you can do even much fancier things especially if you know 940 00:40:09,536 --> 00:40:12,086 in advance what the data is going to look 941 00:40:12,086 --> 00:40:14,026 like whether it's human names or the like. 942 00:40:14,526 --> 00:40:17,256 So, how can we represent? 943 00:40:17,966 --> 00:40:23,356 Say, let me say something like this. 944 00:40:24,506 --> 00:40:27,996 So, we can do a little better than this in fact. 945 00:40:28,396 --> 00:40:30,946 If we introduce the notion of a tree, we can actually, 946 00:40:30,946 --> 00:40:32,536 and let me tease you with the crazy looking picture, 947 00:40:32,876 --> 00:40:34,456 we can actually implement a little something 948 00:40:34,456 --> 00:40:35,336 that looks like this. 949 00:40:35,936 --> 00:40:38,166 And even though I just said that with a hash table, 950 00:40:38,166 --> 00:40:43,186 we can only get to say linear time or linear divided by K 951 00:40:43,186 --> 00:40:46,486 where K is some constant, I actually can retract that a bit. 952 00:40:46,486 --> 00:40:49,366 If you're willing to spend a lot more RAM, a lot more memory 953 00:40:49,586 --> 00:40:52,166 and have a lot of arrays inside of your program, 954 00:40:52,486 --> 00:40:56,246 we can actually boil down the problem of looking up strings 955 00:40:56,246 --> 00:40:59,776 or doing dictionary look ups in literally constant time. 956 00:41:00,186 --> 00:41:02,826 But let's pause for a couple minutes for a musical interlude 957 00:41:02,826 --> 00:41:05,906 by our own Ansel Duff again and then we'll return in 5 minutes 958 00:41:05,976 --> 00:41:07,636 and answer how this can be achieved. 959 00:41:08,116 --> 00:41:10,116 [ Music ] 960 00:41:10,596 --> 00:41:12,236 >> Okay, we're back. 961 00:41:12,696 --> 00:41:15,506 So, before we get to this amazing new data structure 962 00:41:15,506 --> 00:41:16,826 that's gonna solve all of our problems, 963 00:41:17,006 --> 00:41:20,056 let's just paint a quick picture here of this just 964 00:41:20,056 --> 00:41:22,596 to make clear how we can even go about implementing this 965 00:41:22,596 --> 00:41:25,396 so that is not completely in verbal pseudocode here. 966 00:41:25,646 --> 00:41:28,446 So, here was so far our best implementation. 967 00:41:28,446 --> 00:41:32,496 Best only and as much as it is dynamic and can grow to fit 968 00:41:32,496 --> 00:41:35,656 as many things as we actually have even though we might still 969 00:41:35,656 --> 00:41:37,006 get some long chain. 970 00:41:37,006 --> 00:41:39,976 And in the very worse case, we might get a lot of Alice names, 971 00:41:39,976 --> 00:41:42,716 A names, A names, A names, so it could devolve 972 00:41:42,716 --> 00:41:43,736 into just a linked list 973 00:41:44,046 --> 00:41:46,026 but hopefully that's not actually the case. 974 00:41:46,026 --> 00:41:47,126 Well, how do we implement each 975 00:41:47,126 --> 00:41:48,846 of these things I called a node before? 976 00:41:49,076 --> 00:41:51,886 Well again, we just need a way of expressing that picture. 977 00:41:51,886 --> 00:41:55,486 So, here's a struct called node and inside 978 00:41:55,486 --> 00:41:57,426 of the curly braces, I have just 2 fields. 979 00:41:57,426 --> 00:42:00,316 One of those fields is going to be a string so to speak. 980 00:42:00,606 --> 00:42:02,756 So it's an array of characters and it's an array 981 00:42:02,756 --> 00:42:05,946 of characters each of which is of length, length plus 1. 982 00:42:06,156 --> 00:42:07,346 And because it's-- we're in caps, 983 00:42:07,346 --> 00:42:10,246 I can infer that some constant define somewhere and it's, 984 00:42:10,246 --> 00:42:12,746 I don't know, maybe it's 16, maybe it's 128. 985 00:42:12,746 --> 00:42:16,136 It's some relatively reasonable if large constant 986 00:42:16,326 --> 00:42:19,556 that I'm assuming all of my words are shorter then. 987 00:42:19,876 --> 00:42:22,176 And the plus 1 is always for the-- 988 00:42:23,126 --> 00:42:26,276 back slash zero for the null terminating character. 989 00:42:26,506 --> 00:42:30,116 So, this thing though is that for by pointer that allows me 990 00:42:30,116 --> 00:42:32,636 to chain S. Adams to someone else, 991 00:42:32,726 --> 00:42:34,226 to someone else, to someone else. 992 00:42:34,486 --> 00:42:36,746 And the reason that there's this apparent redundancy 993 00:42:36,746 --> 00:42:40,306 between the word node here and node here is again the rule 994 00:42:40,306 --> 00:42:43,096 of thumb is if you're creating a data structure inside 995 00:42:43,096 --> 00:42:45,706 of which you want to have a pointer to another one 996 00:42:45,706 --> 00:42:49,266 of your self as object-- as a struct of the same type, 997 00:42:49,596 --> 00:42:52,456 you need to be able to express that with some word. 998 00:42:52,606 --> 00:42:54,806 So, if you say, struct node or struct foo 999 00:42:54,806 --> 00:42:56,196 or whatever you wanna call it, 1000 00:42:56,196 --> 00:42:57,786 but for consistency I'd used the same, 1001 00:42:58,126 --> 00:43:00,646 we can now because we've declared this here refer 1002 00:43:00,646 --> 00:43:01,976 to it inside of the structure. 1003 00:43:01,976 --> 00:43:02,926 But henceforth, thanks 1004 00:43:02,966 --> 00:43:06,076 to the typedef we can just call this thing a node. 1005 00:43:06,416 --> 00:43:08,636 So that's how we might actually implement that in code 1006 00:43:08,756 --> 00:43:12,166 but let's actually now use the same idea, these pointers 1007 00:43:12,166 --> 00:43:14,836 and this linked list so to speak and those concepts for Monday 1008 00:43:15,056 --> 00:43:16,886 to start making more interesting structure. 1009 00:43:16,886 --> 00:43:18,536 So, this is what in computer science, 1010 00:43:18,626 --> 00:43:20,106 is generally known as a tree. 1011 00:43:20,106 --> 00:43:22,966 It's kind of like a family tree where there's one patriarch 1012 00:43:22,966 --> 00:43:24,856 or matriarch at the very top of the tree 1013 00:43:25,006 --> 00:43:27,026 and then you have all the descendants beneath it. 1014 00:43:27,406 --> 00:43:29,816 Well, what do these nodes each represent? 1015 00:43:29,816 --> 00:43:31,136 Well, let's just toss some jargon out 1016 00:43:31,136 --> 00:43:34,866 and we'll actually see the same jargon with HTML next week. 1017 00:43:35,096 --> 00:43:38,536 So, anything at the top here, that has arrows emanating 1018 00:43:38,536 --> 00:43:41,656 from it but not into, it is gonna be called the root node. 1019 00:43:41,656 --> 00:43:44,406 So that's again the sort of oldest person in the family 1020 00:43:44,406 --> 00:43:45,356 at the top of the tree. 1021 00:43:45,656 --> 00:43:48,896 2 and 3 here, we would generally say our children 1022 00:43:48,926 --> 00:43:52,046 of node number 1, 4, 5, 6, 7. 1023 00:43:52,046 --> 00:43:56,576 Our grandchildren of node number 1, but in turn, children of 2 1024 00:43:56,576 --> 00:44:00,506 and 3 and these things down here like 8 and 9 and even 5, 6, 7. 1025 00:44:00,786 --> 00:44:04,096 Since we already have this tree analogy, well, this is going 1026 00:44:04,096 --> 00:44:08,366 to be a-- this is gonna be a leaves of the tree, 1027 00:44:08,616 --> 00:44:10,896 anything that has arrows coming into it 1028 00:44:11,276 --> 00:44:13,046 but not actually leaving it. 1029 00:44:13,226 --> 00:44:15,096 So, these arrows, at this point in the story, 1030 00:44:15,096 --> 00:44:16,176 just represent pointers. 1031 00:44:16,176 --> 00:44:18,416 So, if you wanna create some kind of tree structure, 1032 00:44:18,636 --> 00:44:21,356 you can actually use pointers to chain them together. 1033 00:44:21,626 --> 00:44:24,246 So once you do this and have this ability 1034 00:44:24,246 --> 00:44:25,606 to chain nodes together, 1035 00:44:25,826 --> 00:44:28,096 we can actually start solving some problems 1036 00:44:28,096 --> 00:44:29,376 in more sophisticated way. 1037 00:44:29,376 --> 00:44:32,186 And again, the goal here of this weekend and 2 weeks ago 1038 00:44:32,186 --> 00:44:34,556 in introducing this data structures is to not have 1039 00:44:34,586 --> 00:44:38,046 to fall back on simple variables and simple arrays. 1040 00:44:38,046 --> 00:44:39,076 We can now start building 1041 00:44:39,076 --> 00:44:40,936 and wiring things together in memory. 1042 00:44:41,276 --> 00:44:43,406 So this I claim is a binary search tree. 1043 00:44:43,646 --> 00:44:46,176 It is conceptually identical to the notion of a tree 1044 00:44:46,176 --> 00:44:48,256 that we just slap just some jargon on, 1045 00:44:48,606 --> 00:44:51,426 but there's some key feature of this binary search tree. 1046 00:44:51,426 --> 00:44:54,376 If you look at its numbers, what pattern do you see if any? 1047 00:44:54,426 --> 00:44:57,646 >> The leftmost is the smallest. 1048 00:44:58,706 --> 00:44:59,016 >> What's that? 1049 00:44:59,016 --> 00:45:00,276 >> The leftmost is the smallest. 1050 00:45:00,276 --> 00:45:01,996 >> Yeah. The leftmost is the smallest 1051 00:45:01,996 --> 00:45:04,246 and conversely the rightmost is the biggest. 1052 00:45:04,246 --> 00:45:05,356 It's actually a coincidence 1053 00:45:05,356 --> 00:45:08,236 that they're all double digits here or identical digits. 1054 00:45:08,516 --> 00:45:10,506 But notice that on the left are smaller numbers, 1055 00:45:10,506 --> 00:45:13,176 on the right are small numbers, but more specifically, 1056 00:45:13,176 --> 00:45:16,876 if you look at any specific node, its left child 1057 00:45:16,956 --> 00:45:18,516 so to speak is less than it. 1058 00:45:18,716 --> 00:45:21,896 And its child is greater than it. 1059 00:45:21,896 --> 00:45:25,066 And with that recursive definition, we can infer 1060 00:45:25,286 --> 00:45:27,896 that all of the nodes in this tree are indeed 1061 00:45:27,896 --> 00:45:30,726 in sorted order from, so to speak, left to right. 1062 00:45:30,726 --> 00:45:33,236 There're 22 and that's a child of 33 1063 00:45:33,236 --> 00:45:34,976 so 33 is indeed bigger than it. 1064 00:45:35,186 --> 00:45:39,826 But 44 is the right child of 33 so he is bigger than 33, 1065 00:45:39,826 --> 00:45:45,876 but 44 smaller than 55 because he is a left child descendant 1066 00:45:46,256 --> 00:45:48,976 of 55, right? 1067 00:45:48,976 --> 00:45:51,656 So, anytime you go left, everything to the left 1068 00:45:51,656 --> 00:45:52,946 of 55 will be smaller. 1069 00:45:53,016 --> 00:45:55,396 Anything to the right of 55 will be greater 1070 00:45:55,726 --> 00:45:58,026 and that same definition can be replied recursively, 1071 00:45:58,026 --> 00:46:00,556 so to speak, to 33 and 77. 1072 00:46:00,886 --> 00:46:01,646 So, who cares? 1073 00:46:01,646 --> 00:46:02,486 Why is this useful? 1074 00:46:02,486 --> 00:46:05,966 Well, thus far we've been only using arrays but suppose 1075 00:46:05,966 --> 00:46:08,096 that back in the day when we had all those pieces of paper 1076 00:46:08,096 --> 00:46:11,746 on the board, we instead transform that array 1077 00:46:11,886 --> 00:46:15,886 into a tree structure where we had a pointer pointing at nodes 1078 00:46:16,106 --> 00:46:17,726 that were arranged in this way. 1079 00:46:17,726 --> 00:46:20,586 Or just imagine students for instance standing awkwardly 1080 00:46:20,586 --> 00:46:22,156 with these numbers but pointing at each other 1081 00:46:22,156 --> 00:46:23,726 with multiple hands this time. 1082 00:46:24,226 --> 00:46:26,686 What's the running time of doing search 1083 00:46:26,816 --> 00:46:29,686 on a data structure that's called a binary search tree? 1084 00:46:30,166 --> 00:46:32,526 How many steps does it take if there're N nodes 1085 00:46:32,686 --> 00:46:33,906 or N numbers in this tree? 1086 00:46:35,236 --> 00:46:37,386 So, it's actually log N. And what's nice here is 1087 00:46:37,386 --> 00:46:38,886 that you can actually see this even 1088 00:46:38,886 --> 00:46:40,796 if you're still s little rusty on logarithms. 1089 00:46:40,796 --> 00:46:43,726 Logarithms in this class have generally involved dividing 1090 00:46:43,726 --> 00:46:46,076 something by 2, dividing something by 2 again and again 1091 00:46:46,076 --> 00:46:48,056 and again and again, dividing and conquering. 1092 00:46:48,256 --> 00:46:49,956 Well, what are you doing in a picture like this? 1093 00:46:49,956 --> 00:46:52,596 Well, if you take N nodes and you have one at a top 1094 00:46:52,826 --> 00:46:55,476 but then each node has 2 children, 2 children, 1095 00:46:55,476 --> 00:46:59,676 2 children that's affectively the same as having, having, 1096 00:46:59,676 --> 00:47:02,326 having the total height of this tree. 1097 00:47:02,636 --> 00:47:04,476 So in other words, what's the height of this tree 1098 00:47:04,476 --> 00:47:08,076 if you actually have 1, 2, 3, 4, 5, 6, 7 total nodes 1099 00:47:08,076 --> 00:47:10,176 which is roughly 8 total nodes give or take. 1100 00:47:10,176 --> 00:47:11,606 If I just have an off by 1 error. 1101 00:47:11,966 --> 00:47:15,816 Well, it's actually log base 2 of 8 or 3. 1102 00:47:15,886 --> 00:47:18,836 So, the height of this tree is 3 which mean if I wanna find 1103 00:47:18,836 --> 00:47:22,226 like the number 44 and I start at the root of this tree 1104 00:47:22,226 --> 00:47:23,416 which is just like a linked list, 1105 00:47:23,416 --> 00:47:24,876 you start at one special pointer. 1106 00:47:24,876 --> 00:47:27,036 In the context of a tree, you start at the top, the root. 1107 00:47:27,416 --> 00:47:28,796 How do I find 44? 1108 00:47:28,796 --> 00:47:29,626 Well, I first check. 1109 00:47:29,916 --> 00:47:32,246 Is 44 less than or greater than 55? 1110 00:47:32,466 --> 00:47:34,606 If it's less than, I go this way. 1111 00:47:34,876 --> 00:47:37,166 But notice conceptually just like our phonebook 1112 00:47:37,166 --> 00:47:40,886 that has the effect now snipping off this entire branch 1113 00:47:40,886 --> 00:47:43,726 of the tree so I don't even need to go there anymore. 1114 00:47:43,726 --> 00:47:45,736 So, even though it's not quite perfect division, 1115 00:47:45,936 --> 00:47:48,776 the problem has just gone half, give or take, 1116 00:47:48,776 --> 00:47:51,406 in size 'cause I now only consider the left branch. 1117 00:47:51,676 --> 00:47:52,696 So, now I'm at 33. 1118 00:47:52,696 --> 00:47:55,096 Is 44 less than or greater than 33? 1119 00:47:55,096 --> 00:47:58,066 It's obviously greater than so now I chop off this branch 1120 00:47:58,066 --> 00:47:58,586 of the tree. 1121 00:47:58,776 --> 00:48:00,836 Not as exciting 'cause it's a tiny little leaf. 1122 00:48:00,946 --> 00:48:03,076 But if you imagine now a much, much bigger picture, 1123 00:48:03,266 --> 00:48:05,996 you're really pruning the size of this problem 1124 00:48:05,996 --> 00:48:09,606 and huge branches, huge halves of problems are falling off 1125 00:48:09,916 --> 00:48:11,496 with every descent into the tree. 1126 00:48:11,766 --> 00:48:13,716 So how deep might 44 be? 1127 00:48:13,976 --> 00:48:17,846 Log base 2 of N is the answer or just generally log N. 1128 00:48:18,106 --> 00:48:20,046 So anytime you have a tree structure, 1129 00:48:20,256 --> 00:48:23,236 you have this opportunity to really improve the performance 1130 00:48:23,236 --> 00:48:27,696 of inserting things or finding things or even deleting things. 1131 00:48:27,926 --> 00:48:28,846 And in fact, you probably 1132 00:48:28,846 --> 00:48:30,656 at least heard the term "database" before. 1133 00:48:30,656 --> 00:48:32,976 We'll use a database called MySQL toward the end 1134 00:48:32,976 --> 00:48:36,196 of the semester and one of the key features 1135 00:48:36,196 --> 00:48:39,236 of databases these days is to let you insert and delete 1136 00:48:39,326 --> 00:48:41,386 and retrieve data really fast 1137 00:48:41,736 --> 00:48:45,046 and they typically do this using tree structures like this. 1138 00:48:45,216 --> 00:48:47,576 Database has actually used something a called a B-tree 1139 00:48:47,576 --> 00:48:48,996 which is a little more sophisticated, 1140 00:48:49,236 --> 00:48:50,426 but this idea is the same. 1141 00:48:50,426 --> 00:48:52,446 You might have some node in memory pointing 1142 00:48:52,446 --> 00:48:54,636 to some other nodes, to some other nodes, to some other nodes 1143 00:48:54,846 --> 00:48:57,766 so these data structures might get kind of wide which is okay. 1144 00:48:58,006 --> 00:49:02,056 They don't get very tall 'cause it's in the height that you end 1145 00:49:02,056 --> 00:49:04,976 up wasting time searching deeper and deeper and deeper. 1146 00:49:04,976 --> 00:49:06,576 So how do we implement this? 1147 00:49:06,986 --> 00:49:09,096 So now it should be getting a little easier, right? 1148 00:49:09,096 --> 00:49:12,126 Given a picture, you can kind of translate it directly to C code. 1149 00:49:12,366 --> 00:49:14,016 Here is a new definition of a node, 1150 00:49:14,086 --> 00:49:15,776 so I'm using the same word, "node." 1151 00:49:15,986 --> 00:49:18,736 But inside of this node is gonna be an int N 1152 00:49:18,806 --> 00:49:21,476 for 55, 44, 33, and so forth. 1153 00:49:21,666 --> 00:49:24,166 And then each of these nodes needs a left pointer 1154 00:49:24,166 --> 00:49:25,096 and a right pointer. 1155 00:49:25,556 --> 00:49:29,686 And just to follow the logic, what is going to be the value 1156 00:49:29,686 --> 00:49:32,986 of the left pointer and the pointer for any leaf in a tree? 1157 00:49:34,186 --> 00:49:35,846 Just null, right, we already have a special value 1158 00:49:35,846 --> 00:49:36,586 to represent that. 1159 00:49:36,586 --> 00:49:38,886 So, we can literally build this kind 1160 00:49:38,886 --> 00:49:41,466 of structure more sophisticated though it might be 1161 00:49:41,466 --> 00:49:44,556 than anything we've done before using just these basic 1162 00:49:44,646 --> 00:49:46,826 primitives, an int and a couple of pointers. 1163 00:49:47,416 --> 00:49:49,046 So, let's continue this idea. 1164 00:49:49,046 --> 00:49:49,996 Rather than have each 1165 00:49:49,996 --> 00:49:52,826 of our nodes be a little tiny circle with just an int. 1166 00:49:53,286 --> 00:49:57,966 What if I cram an entire array of pointers into a node? 1167 00:49:58,216 --> 00:50:01,976 So this is what's called a trie, T-R-I-E. 1168 00:50:02,046 --> 00:50:02,736 >> The name apparently derives 1169 00:50:02,736 --> 00:50:05,966 from retrieval even though people say "trie" so retrieval. 1170 00:50:06,256 --> 00:50:08,686 But it's a trie and that's the plan word, it's a tree 1171 00:50:08,686 --> 00:50:09,836 but it's used for retrieval. 1172 00:50:10,006 --> 00:50:13,336 That's the quick-- completely mangled history of the tries. 1173 00:50:13,756 --> 00:50:15,536 So, a trie is a data structure 1174 00:50:15,536 --> 00:50:17,376 that ultimately that looks like this. 1175 00:50:17,376 --> 00:50:20,766 Each of its nodes is actually an array and typically, 1176 00:50:21,176 --> 00:50:24,106 each of its arrays represents a letter of the alphabet. 1177 00:50:24,246 --> 00:50:26,396 And it's an over simplification 'cause sometimes you need 1178 00:50:26,396 --> 00:50:27,986 to deal with punctuation and what not. 1179 00:50:28,246 --> 00:50:31,916 But a very common approach to implementing a dictionary 1180 00:50:32,116 --> 00:50:34,796 for something like a website or spellchecker, 1181 00:50:34,976 --> 00:50:38,646 is to use this trie data structure whereby each node 1182 00:50:38,646 --> 00:50:41,856 again has an array, an array of pointers so that 1183 00:50:41,856 --> 00:50:45,606 if you have a word like "Alice" how do you store Alice 1184 00:50:45,696 --> 00:50:46,636 into a trie? 1185 00:50:46,836 --> 00:50:48,416 Well, you start at this so-called "roots" 1186 00:50:48,456 --> 00:50:50,566 which is the rectangle of the very top of the picture. 1187 00:50:50,846 --> 00:50:54,486 You index into the left most elements of the array for A, 1188 00:50:54,486 --> 00:50:56,906 then that brings you to a new node. 1189 00:50:56,906 --> 00:51:00,446 Then you index into the L location then the I location 1190 00:51:00,446 --> 00:51:03,486 then the C location then the E location, and you sort 1191 00:51:03,486 --> 00:51:07,276 of implicitly store Alice in the tree by sort of marking the spot 1192 00:51:07,276 --> 00:51:10,176 at the end of this path that you're taking from top to bottom 1193 00:51:10,176 --> 00:51:13,246 in the trie saying, "Alice was here," so to speak. 1194 00:51:13,586 --> 00:51:16,396 And now this picture has done it for what words? 1195 00:51:16,396 --> 00:51:19,116 Well, we have a-- turing on the right. 1196 00:51:19,716 --> 00:51:24,956 So if we start with the letter T-U-R-I-N-G, 1197 00:51:24,956 --> 00:51:26,406 all the way over there on the right. 1198 00:51:26,646 --> 00:51:28,966 Now, realize this picture, it would be impossible to read 1199 00:51:28,966 --> 00:51:30,376 if they drew all of the arrays 1200 00:51:30,376 --> 00:51:31,676 so that's why they've chopped them off. 1201 00:51:31,956 --> 00:51:34,906 So this refers to Alan Turing, a very famous computer scientist, 1202 00:51:35,016 --> 00:51:36,956 the so called father of computer scientist, 1203 00:51:36,956 --> 00:51:38,746 also take CS121 for more information. 1204 00:51:39,136 --> 00:51:40,336 Who's all the way on the left? 1205 00:51:40,336 --> 00:51:46,436 We have Maxwell, so, Maxwell's equation, M-A-X-W-E-L-L. 1206 00:51:46,436 --> 00:51:47,716 And what's with these triangles? 1207 00:51:47,806 --> 00:51:49,756 This is just arbitrary symbology here. 1208 00:51:50,286 --> 00:51:51,546 What does that probably represent? 1209 00:51:52,746 --> 00:51:54,266 You know of some sort, it's the sort 1210 00:51:54,266 --> 00:51:55,636 of like conceptual check mark 1211 00:51:55,636 --> 00:51:58,176 like Maxwell was here and Turing was here. 1212 00:51:58,476 --> 00:52:02,236 And so that we have to somehow remember that a name stops here. 1213 00:52:02,576 --> 00:52:05,526 So, the goal here is to insert a whole bunch of names 1214 00:52:05,526 --> 00:52:09,076 or more generally in the dictionary, 150,000 words 1215 00:52:09,076 --> 00:52:10,236 into a data structure. 1216 00:52:10,486 --> 00:52:11,766 And what's curious here is 1217 00:52:11,766 --> 00:52:13,556 that we don't actually store the names. 1218 00:52:13,836 --> 00:52:17,356 Instead, we implicitly store the names by way 1219 00:52:17,356 --> 00:52:20,666 of these pointers 'cause again, you have 26 pointers or maybe 27 1220 00:52:20,666 --> 00:52:23,146 or more if you have apostrophes or other punctuation 1221 00:52:23,146 --> 00:52:23,926 that appears in names. 1222 00:52:24,346 --> 00:52:27,526 You kind of walk through this tree allocating these nodes, 1223 00:52:27,526 --> 00:52:29,616 as soon as you need them, and then just check off 1224 00:52:29,616 --> 00:52:32,866 at the bottom or with this delta symbol, "Someone was here." 1225 00:52:33,236 --> 00:52:34,896 So you implicitly store among all 1226 00:52:34,896 --> 00:52:36,926 of these pointers some kind of identifier. 1227 00:52:37,336 --> 00:52:39,296 So what's really powerful here? 1228 00:52:39,596 --> 00:52:41,816 What's the running time of looking up someone 1229 00:52:41,816 --> 00:52:44,716 in this data structure if the name is Maxwell? 1230 00:52:45,286 --> 00:52:47,406 Well, how many steps does it take to answer the question, 1231 00:52:47,406 --> 00:52:49,026 is Maxwell in this data structure? 1232 00:52:49,696 --> 00:52:56,916 If you start at the top, you go M-A-X-W-E-L-L 1233 00:52:56,916 --> 00:52:58,236 and maybe one more just to get 1234 00:52:58,236 --> 00:53:00,766 to that special sentinel value, so 8 steps. 1235 00:53:01,286 --> 00:53:03,886 Now, contrast this with one of our earlier designs 1236 00:53:03,886 --> 00:53:08,786 which was actually pretty fancy at the time, a hash table 1237 00:53:08,786 --> 00:53:11,766 with separate chaining, how many steps might it take here? 1238 00:53:11,966 --> 00:53:13,806 Well, it might actually tale big-O of N 1239 00:53:13,806 --> 00:53:17,496 over K 'cause again it doesn't matter how big 1240 00:53:17,896 --> 00:53:19,776 or small Maxwell's name is here. 1241 00:53:20,066 --> 00:53:24,436 The running time of a-- a hash table seems more dependent 1242 00:53:24,496 --> 00:53:28,036 on the number of other words in the data structure, right? 1243 00:53:28,076 --> 00:53:29,786 Because if there's lots of Alices involved, 1244 00:53:29,836 --> 00:53:31,116 the chain might be long. 1245 00:53:31,316 --> 00:53:33,696 But in the case of a trie, what's kind of neat is 1246 00:53:33,696 --> 00:53:37,196 that by construction, no one is in Maxwell's way as you kind 1247 00:53:37,196 --> 00:53:38,856 of weave your way to the tree. 1248 00:53:39,076 --> 00:53:42,766 So the running time of a trie for looking up a name and I need 1249 00:53:42,766 --> 00:53:44,506 to come up an arbitrary new letter. 1250 00:53:44,506 --> 00:53:49,746 Let's just call it M, big-O of M. But what does M represent? 1251 00:53:51,246 --> 00:53:54,646 The length of the word, the name. 1252 00:53:54,766 --> 00:53:56,806 And here's what's kind of neat asymptotically. 1253 00:53:57,066 --> 00:54:00,316 If you assume that there is a finite limit on the length 1254 00:54:00,316 --> 00:54:02,826 of people's name or just words in the English dictionary 1255 00:54:03,046 --> 00:54:04,146 and that really is the case. 1256 00:54:04,146 --> 00:54:05,866 You can search through the whole dictionary and figure 1257 00:54:05,866 --> 00:54:08,216 out that their biggest word is of this many characters, 1258 00:54:08,216 --> 00:54:11,026 48 characters or something crazy, crazy large. 1259 00:54:11,296 --> 00:54:13,026 Well technically then, M is a constant. 1260 00:54:13,386 --> 00:54:15,616 So if it's a constant, we throw away constants, 1261 00:54:15,616 --> 00:54:19,376 which means big-O of M is equivalent to big-O of 1 1262 00:54:19,376 --> 00:54:22,306 and here we have it, this holy grail of data structures, right? 1263 00:54:22,306 --> 00:54:25,706 We can now look up words that are of some length 1264 00:54:25,776 --> 00:54:28,416 but bounded length in effectively constant time. 1265 00:54:28,416 --> 00:54:30,056 And heck, even if it's not constant time, 1266 00:54:30,266 --> 00:54:33,046 it's like M-A-X-W-E-L-L plus 1. 1267 00:54:33,276 --> 00:54:36,386 So, like 8 steps to find a letter, whereas before, 1268 00:54:36,386 --> 00:54:39,426 if I had a million words in my dictionary or 150,000 words 1269 00:54:39,426 --> 00:54:42,666 in my dictionary, I could maybe divide that by K, 26. 1270 00:54:42,986 --> 00:54:44,906 I'm guessing that's gonna be more steps 1271 00:54:45,306 --> 00:54:48,546 than just 8 as in this case. 1272 00:54:48,846 --> 00:54:50,416 Now, this is not necessarily, 1273 00:54:50,506 --> 00:54:51,826 unfortunately there's always a catch, 1274 00:54:52,056 --> 00:54:53,086 the direction you wanna go 1275 00:54:53,086 --> 00:54:55,776 in when implementing the fastest possible spell checker 1276 00:54:55,776 --> 00:54:59,186 for instance because what are you paying 1277 00:54:59,246 --> 00:55:01,606 to get this seemingly amazing performance? 1278 00:55:01,606 --> 00:55:03,136 What's this feature costing you? 1279 00:55:03,966 --> 00:55:04,126 >> Memory. 1280 00:55:04,376 --> 00:55:05,066 >> Memory, right? 1281 00:55:05,296 --> 00:55:06,896 This picture is kind of misleading 1282 00:55:06,896 --> 00:55:09,526 and that they only show the letters or really the pointers 1283 00:55:09,556 --> 00:55:10,676 that you're actually using. 1284 00:55:10,946 --> 00:55:12,726 But what's not shown in this picture? 1285 00:55:12,966 --> 00:55:14,456 For every one of those nodes, 1286 00:55:14,456 --> 00:55:17,736 it's currently not showing the 25 other pointers 1287 00:55:17,736 --> 00:55:20,156 that represent the other 25 letters of the alphabet 1288 00:55:20,406 --> 00:55:21,716 that you're just wasting, right? 1289 00:55:21,716 --> 00:55:24,186 There's absolutely words that just don't exist. 1290 00:55:24,256 --> 00:55:29,226 So you might not have a word like A, A, A, B, B, C, Q, Q, Q, 1291 00:55:29,406 --> 00:55:30,906 like random sequences of letters 1292 00:55:31,136 --> 00:55:33,316 that technically you could follow pointers 1293 00:55:33,316 --> 00:55:36,026 through the picture 'cause those might be prefixes of other words 1294 00:55:36,436 --> 00:55:38,096 but there's no little check mark 1295 00:55:38,096 --> 00:55:39,906 at the bottom, there's no triangle. 1296 00:55:40,086 --> 00:55:43,696 And yet you have to spend that memory if you're using arrays. 1297 00:55:43,916 --> 00:55:47,076 Well, okay, well, arrays are-- solutions to arrays is always, 1298 00:55:47,076 --> 00:55:48,566 well, if you're wasting space with an array, 1299 00:55:48,566 --> 00:55:49,796 use what data structure instead? 1300 00:55:51,296 --> 00:55:53,566 Like a linked list so in each of this node, wait a minute. 1301 00:55:53,566 --> 00:55:56,546 Let's not waste 26 or more pointers for every letter. 1302 00:55:56,706 --> 00:55:59,596 Let's just use one node per letter but the problem 1303 00:55:59,596 --> 00:56:02,206 with linked list is that you give up random access. 1304 00:56:02,486 --> 00:56:04,556 You can't just jump to the letter A, the letter B, 1305 00:56:04,556 --> 00:56:07,286 letter C. If you devolve it back into a linked list, 1306 00:56:07,566 --> 00:56:08,596 now you have to search. 1307 00:56:08,806 --> 00:56:12,566 So then the process becomes not so much going from top to bottom 1308 00:56:12,566 --> 00:56:14,736 like this as in the case of Maxwell. 1309 00:56:14,976 --> 00:56:17,236 You instead have to kind of go left to right then down, 1310 00:56:17,236 --> 00:56:18,496 left to right, down, left to right. 1311 00:56:18,676 --> 00:56:20,256 So you have a lot of lateral searching 1312 00:56:20,256 --> 00:56:23,456 which is gonna cost you time but save you space. 1313 00:56:23,456 --> 00:56:27,466 And it turns out in terms of implementation, hash tables can 1314 00:56:27,466 --> 00:56:29,426 in fact, even though theoretically 1315 00:56:29,626 --> 00:56:32,436 and asymptotically they're much slower than constant time 1316 00:56:32,766 --> 00:56:35,146 in reality because tries use so much memory, 1317 00:56:35,456 --> 00:56:37,346 they can actually perform much more slowly. 1318 00:56:37,346 --> 00:56:39,166 But really for P set 6 every year 1319 00:56:39,166 --> 00:56:42,386 with the fastest spell checker possible, you typically end 1320 00:56:42,386 --> 00:56:44,776 up choosing between one of these two data structures, 1321 00:56:44,776 --> 00:56:48,686 hash table with chaining or trie and then deciding whether 1322 00:56:48,686 --> 00:56:51,386 or not it was a good decision or how you can improve upon it. 1323 00:56:51,626 --> 00:56:54,126 So, when a trie, how might we represent this notion 1324 00:56:54,126 --> 00:56:55,736 of a check marker, that little triangle? 1325 00:56:56,006 --> 00:56:57,256 Well, let's just call it a bull [phonetic], right? 1326 00:56:57,256 --> 00:56:58,106 It is a word. 1327 00:56:58,106 --> 00:57:00,426 Yes or no, Maxwell was here or Turing with here. 1328 00:57:00,426 --> 00:57:02,146 If it's falls, there's no word that stops here. 1329 00:57:02,146 --> 00:57:03,666 If it's true, there was a word 1330 00:57:03,876 --> 00:57:05,926 and then here, struct no children. 1331 00:57:06,116 --> 00:57:09,016 Well, notice I'm borrowing the lexicon of like a family tree 1332 00:57:09,016 --> 00:57:11,036 where a node has a bunch of children. 1333 00:57:11,276 --> 00:57:13,306 Each of which has a pointer to another node 1334 00:57:13,636 --> 00:57:15,726 but 27 of them, why 27? 1335 00:57:15,726 --> 00:57:17,876 Well, in this case as you'll see in this particular P set. 1336 00:57:18,156 --> 00:57:19,996 We allow for alphabetical characters 1337 00:57:19,996 --> 00:57:21,126 and then also apostrophes 1338 00:57:21,326 --> 00:57:23,356 which are actually pretty common in English words. 1339 00:57:23,356 --> 00:57:26,396 But we disallow other crazy punctuation that does exist 1340 00:57:26,396 --> 00:57:29,426 in some words but isn't necessarily common. 1341 00:57:30,206 --> 00:57:34,416 So we can solve though even more interesting problems here now 1342 00:57:34,416 --> 00:57:35,376 that we have trees. 1343 00:57:35,636 --> 00:57:37,666 So, Morse code is something you're probably generally 1344 00:57:37,666 --> 00:57:39,896 familiar with as a kid, and you can bang out messages. 1345 00:57:39,896 --> 00:57:42,166 You've seen these dots and dashes or sounds 1346 00:57:42,166 --> 00:57:44,836 that represent short dots or long dashes. 1347 00:57:45,156 --> 00:57:46,656 But the problem with Morse code is 1348 00:57:46,686 --> 00:57:49,236 that it's not immediately decodable. 1349 00:57:49,236 --> 00:57:52,716 In other words, if I give you the sounds 1350 00:57:52,716 --> 00:57:59,996 or I show you the symbol, let's say dot slash or beep, beeeep. 1351 00:58:00,636 --> 00:58:02,626 What did I just spell out? 1352 00:58:04,276 --> 00:58:04,346 >> A 1353 00:58:04,576 --> 00:58:06,996 >> Okay. You say A and you might be right or-- 1354 00:58:08,676 --> 00:58:12,716 >> ET. So, Morse code is not immediately decodable 1355 00:58:12,946 --> 00:58:15,196 because there're ambiguities that arise. 1356 00:58:15,196 --> 00:58:18,316 If you have dot slash, it could be a dot and a slash 1357 00:58:18,576 --> 00:58:21,096 or it could be dot slash or dot dash. 1358 00:58:21,456 --> 00:58:22,826 So, how did they solve this? 1359 00:58:22,826 --> 00:58:25,716 Well, the military folks who had hit the little beepy-dippy-- 1360 00:58:25,716 --> 00:58:28,516 in the old movies, well, they would just pause essentially 1361 00:58:28,516 --> 00:58:30,806 between words or you get good enough at it 1362 00:58:31,026 --> 00:58:33,496 that you just realize that in the context, obviously, 1363 00:58:33,496 --> 00:58:34,996 you spelled the letter A not ET. 1364 00:58:35,256 --> 00:58:36,266 You just got good at it. 1365 00:58:36,466 --> 00:58:37,736 But fundamentally and certainly 1366 00:58:37,736 --> 00:58:39,766 if a program were parsing this information, 1367 00:58:40,006 --> 00:58:41,616 you need a split second pause. 1368 00:58:41,616 --> 00:58:42,736 What's the downside there? 1369 00:58:42,936 --> 00:58:45,316 Well, even though Morse code is just dots and dashes, 1370 00:58:45,346 --> 00:58:47,926 there's actually a third digit which is pause, 1371 00:58:48,186 --> 00:58:50,656 which is not pictured if you need to kinda delay yourself. 1372 00:58:50,656 --> 00:58:51,466 Why is that bad? 1373 00:58:51,466 --> 00:58:53,016 Well, it just kind of slows things down 1374 00:58:53,096 --> 00:58:54,916 or leads to ambiguities. 1375 00:58:55,246 --> 00:58:57,426 But it turns out you don't need these ambiguities. 1376 00:58:57,466 --> 00:59:02,096 You can actually encode messages being very succinct, right? 1377 00:59:02,096 --> 00:59:02,726 This should-- what's neat 1378 00:59:02,726 --> 00:59:05,766 about Morse code is relatively speaking how short this 1379 00:59:05,766 --> 00:59:06,656 sequence is. 1380 00:59:06,766 --> 00:59:10,616 And yet all this time in CS50, we've been using 8 bits 1381 00:59:10,616 --> 00:59:12,126 to represent every letter 1382 00:59:12,126 --> 00:59:15,556 of the alphabet whether it's a capital A for 65 or 97 1383 00:59:15,556 --> 00:59:17,476 for lower case A. I mean, they, 1384 00:59:17,476 --> 00:59:20,436 the Morse code folks wheedled this down, the letter E, 1385 00:59:20,436 --> 00:59:23,236 to a single dot which is effectively like a single bit. 1386 00:59:23,236 --> 00:59:26,666 And it's no coincidence that E is pretty short, 1387 00:59:26,816 --> 00:59:30,846 T is pretty short, I is pretty short and yet Z is kind of long 1388 00:59:30,846 --> 00:59:32,686 and Y I kind of long 'cause again, 1389 00:59:32,796 --> 00:59:35,786 the dashes represent longer tones of sound. 1390 00:59:36,146 --> 00:59:36,946 So, what did they do? 1391 00:59:36,946 --> 00:59:39,046 They optimized for the more common cases. 1392 00:59:39,106 --> 00:59:40,896 E is pretty common, A is pretty common 1393 00:59:40,896 --> 00:59:44,896 so they have relatively shorter sequences than J or Y or Z. 1394 00:59:45,146 --> 00:59:47,216 So this suggests some form of compression. 1395 00:59:47,216 --> 00:59:48,726 And we don't seem to have learned 1396 00:59:48,726 --> 00:59:52,566 from Mr. Morse here whereby for ASCII and for computer encoding 1397 00:59:52,566 --> 00:59:55,006 of letters, we've been using 8 bits this whole time. 1398 00:59:55,346 --> 00:59:58,036 And the result of using 8 bits is obviously every letter takes 1399 00:59:58,036 --> 00:59:59,776 up of 8 bits and so often, 1400 00:59:59,986 --> 01:00:01,976 we're just wasting a lot of space, right? 1401 01:00:02,056 --> 01:00:04,826 >> If the letters of the alphabet only go from 65 1402 01:00:05,086 --> 01:00:07,726 up to 26 plus that and the 97 plus that, 1403 01:00:07,726 --> 01:00:10,096 and you never use some of the crazy punctuation 1404 01:00:10,096 --> 01:00:12,586 on the keyboard, let alone some of the non-printable characters, 1405 01:00:12,866 --> 01:00:13,986 we're just wasting bits. 1406 01:00:14,546 --> 01:00:17,796 So, what do we want to-- what could we do instead? 1407 01:00:17,796 --> 01:00:20,516 Well, let's try using one of these data structures in sort 1408 01:00:20,516 --> 01:00:24,766 of a novel way to compress information and in turn, 1409 01:00:24,916 --> 01:00:27,556 answer the question, how do you actually compress information? 1410 01:00:27,556 --> 01:00:30,696 When you use zip or the compress feature in Windows or Mac OS 1411 01:00:30,696 --> 01:00:33,936 like what is actually going on, well, if it's a text file, 1412 01:00:34,176 --> 01:00:36,146 it might very well be something like this. 1413 01:00:36,546 --> 01:00:38,756 So let's come up with a completely arbitrary text just 1414 01:00:38,756 --> 01:00:40,316 to have a simple discussion point. 1415 01:00:40,566 --> 01:00:43,476 So, this string here E, C, A, it's completely random. 1416 01:00:43,476 --> 01:00:44,366 There's no pattern there. 1417 01:00:44,646 --> 01:00:48,936 But what is noteworthy about it is that it's only composed 1418 01:00:48,936 --> 01:00:52,416 of A's and B's and C's and D's and E's and if you add 1419 01:00:52,526 --> 01:00:54,706 up all the letters in that crazy long string, 1420 01:00:54,986 --> 01:00:57,906 which would normally take up 8 bits times the length 1421 01:00:57,906 --> 01:01:00,686 of that string, you see that 20 percent of the letters are A's, 1422 01:01:00,686 --> 01:01:03,846 hence the point 2, 10 percent are B's, 10 percent are C's, 1423 01:01:03,846 --> 01:01:06,586 15 percent D's and 45 percent E's. 1424 01:01:06,806 --> 01:01:10,066 So already, if we were going to reinvent the ASCII system 1425 01:01:10,276 --> 01:01:13,146 and use fewer than 8 bits to represent each of these letters, 1426 01:01:13,146 --> 01:01:16,256 A through E, which one would you probably wanna use the fewest 1427 01:01:16,256 --> 01:01:17,666 bits for it intuitively? 1428 01:01:18,316 --> 01:01:20,696 So, E. So hopefully we can come up with a system, 1429 01:01:20,696 --> 01:01:22,126 maybe using today's building blocks 1430 01:01:22,176 --> 01:01:26,376 that somehow gives us very short sequences of bits for E's and, 1431 01:01:26,376 --> 01:01:29,266 okay fine, we'll spend a few more bits on B's and C's 1432 01:01:29,506 --> 01:01:31,526 than we would on A's or E's. 1433 01:01:31,556 --> 01:01:33,526 In short, can we do better than ASCII? 1434 01:01:33,526 --> 01:01:36,016 Can we compress information and thus save disk space? 1435 01:01:36,336 --> 01:01:36,966 Well, let's try this. 1436 01:01:37,226 --> 01:01:39,456 We have this ability now to talk about trees 1437 01:01:39,456 --> 01:01:42,306 and also build trees using simple C structs. 1438 01:01:42,666 --> 01:01:46,416 So, let's go ahead and draw 5 trees but 5 stupid trees. 1439 01:01:46,416 --> 01:01:48,216 Trees that are only single node. 1440 01:01:48,216 --> 01:01:49,786 So, they're all leaves, they're all roots 1441 01:01:50,046 --> 01:01:51,486 but they are nonetheless trees. 1442 01:01:51,486 --> 01:01:53,676 They just have no children, each of these five trees. 1443 01:01:53,996 --> 01:01:56,596 And just to keep me saying I'm gonna put inside of each 1444 01:01:56,596 --> 01:01:58,746 of these nodes its current weight based 1445 01:01:58,746 --> 01:02:00,846 on the original weights in the sentence. 1446 01:02:01,026 --> 01:02:02,376 Now, just to be clear, even though we're 1447 01:02:02,376 --> 01:02:04,506 about to compress something completely inane like this, 1448 01:02:04,806 --> 01:02:08,086 now imagine this is actually a large movie script 1449 01:02:08,086 --> 01:02:11,846 or the King James Bible or some huge body of English text 1450 01:02:12,486 --> 01:02:15,116 that is nonetheless implemented with ASCII just like this. 1451 01:02:15,546 --> 01:02:16,986 And there are too, there's gonna be patterns, right? 1452 01:02:16,986 --> 01:02:19,266 The word "the" pretty darn common in English 1453 01:02:19,266 --> 01:02:21,656 and there's definitely common letters like E and so forth. 1454 01:02:21,876 --> 01:02:23,106 So even though this is contrived, 1455 01:02:23,336 --> 01:02:26,186 there is definitely statistical patterns in English words. 1456 01:02:26,426 --> 01:02:28,216 So here are my 5 simple trees. 1457 01:02:28,616 --> 01:02:31,576 I wanna go ahead and build one big tree 1458 01:02:31,766 --> 01:02:32,956 out of each of these nodes. 1459 01:02:32,956 --> 01:02:33,776 So you know what I'm gonna do. 1460 01:02:33,776 --> 01:02:37,626 I'm gonna go ahead and combine the two smallest nodes possible 1461 01:02:37,686 --> 01:02:38,946 at every decision point. 1462 01:02:39,096 --> 01:02:41,516 So, the smallest nodes right now are 10 percent 1463 01:02:41,516 --> 01:02:43,176 and 10 percent, point 1 and point 1. 1464 01:02:43,446 --> 01:02:45,876 So I'm gonna go ahead and merge those into a new tree 1465 01:02:46,126 --> 01:02:49,396 by creating a new root outdating its left and its right pointer 1466 01:02:49,396 --> 01:02:52,186 to point to those guys and again to keep myself saying, 1467 01:02:52,186 --> 01:02:54,536 I'm going to draw inside of the new node, 1468 01:02:54,536 --> 01:02:56,646 the total of the old node. 1469 01:02:56,646 --> 01:02:59,166 So point 1 plus point 1 gives me point 2. 1470 01:02:59,566 --> 01:03:02,366 And notice just because I kind of know where this is going, 1471 01:03:02,556 --> 01:03:04,796 I'm gonna start labeling the edges I'm drawing. 1472 01:03:04,796 --> 01:03:05,826 I'm gonna arbitrarily 1473 01:03:05,826 --> 01:03:08,596 but consistently say a left child's edge 1474 01:03:08,726 --> 01:03:11,176 or arrow just represents a 0 bit 1475 01:03:11,386 --> 01:03:14,326 and the right arrow represents a 1 bit, just because. 1476 01:03:14,896 --> 01:03:15,976 So now let's repeat. 1477 01:03:16,036 --> 01:03:17,946 So now let's combine two of the smallest. 1478 01:03:17,946 --> 01:03:19,806 We actually have a little discretion here 'cause I've got 1479 01:03:19,936 --> 01:03:23,126 point 2 as the root, point 1, 5 and point 2 is the root. 1480 01:03:23,216 --> 01:03:25,256 I'm gonna go ahead and break this tie arbitrarily 1481 01:03:25,256 --> 01:03:26,496 but consistently. 1482 01:03:26,826 --> 01:03:29,196 And I'm gonna go ahead and combine that point 2 on the left 1483 01:03:29,196 --> 01:03:30,466 with point 1, 5 on the right. 1484 01:03:30,716 --> 01:03:33,716 Create a new node whose sum is point 2 plus point 1, 5, 1485 01:03:33,716 --> 01:03:36,716 so that's point 3, 5 and I similarly label the edges. 1486 01:03:37,066 --> 01:03:39,556 And now, just to make sure we're on the same page, 1487 01:03:39,556 --> 01:03:41,916 what two nodes get combined next? 1488 01:03:42,296 --> 01:03:44,866 We have point 3, 5 and point 2 'cause those are both smaller 1489 01:03:44,866 --> 01:03:45,776 than point 4, 5. 1490 01:03:45,996 --> 01:03:47,026 So, the picture is growing 1491 01:03:47,176 --> 01:03:50,156 and now I realize I actually put the nodes in the right place 1492 01:03:50,156 --> 01:03:51,226 so we get a pretty picture. 1493 01:03:51,226 --> 01:03:53,236 But even if they weren't aligned in a pretty way, 1494 01:03:53,466 --> 01:03:55,436 you would still get the same tree fundamentally. 1495 01:03:55,636 --> 01:03:57,186 Let's now combine the final two nodes 1496 01:03:57,186 --> 01:03:58,826 and now I claim I have a tree. 1497 01:03:58,826 --> 01:04:00,876 This is actually called the Huffman tree, 1498 01:04:00,876 --> 01:04:02,966 named after a computer scientist named Mr. Huffman 1499 01:04:03,246 --> 01:04:09,376 who then proposed that using this policy of assigning 0 bits 1500 01:04:09,376 --> 01:04:11,896 on the left edges and 1 bits on the right edges, 1501 01:04:11,896 --> 01:04:15,536 you now have your own custom alternative to ASCII. 1502 01:04:15,816 --> 01:04:20,656 What this system proposes now is that I'm going to start using 1503 01:04:21,746 --> 01:04:26,986 for the letter A, B, C, and D, and I'll switch 1504 01:04:26,986 --> 01:04:28,436 over to the chalkboard in just a second. 1505 01:04:28,656 --> 01:04:32,496 The following encodings, if I am going to represent the letter A 1506 01:04:32,966 --> 01:04:39,046 or B or C or D or E, I'm gonna start at the root and figure 1507 01:04:39,046 --> 01:04:41,506 out what pattern of bit gets me to it. 1508 01:04:41,506 --> 01:04:44,276 So let's start with A. I start at the root to get 1509 01:04:44,276 --> 01:04:46,176 to A. I go left which gives me a 0, 1510 01:04:46,476 --> 01:04:48,986 then I go to the right which gives me a 1. 1511 01:04:49,206 --> 01:04:50,316 So I claim henceforth. 1512 01:04:50,356 --> 01:04:52,436 Forget about this crazy 8-bit ASCII system. 1513 01:04:52,626 --> 01:04:55,216 I'm gonna use 2 bits for the letter A, 0 and 1. 1514 01:04:55,386 --> 01:04:59,956 For the letter B, what pattern of bits will I use? 1515 01:04:59,956 --> 01:05:00,146 [ Inaudible Remark ] 1516 01:05:00,146 --> 01:05:01,066 >> Yeah, 4 zeros. 1517 01:05:01,066 --> 01:05:03,766 Again, if not quite following, you start with the root, 1518 01:05:03,766 --> 01:05:07,626 we just find B and to get there, we go through a 0, 0, 0, 0. 1519 01:05:07,886 --> 01:05:10,436 So, B is encoding, it's gonna be a little longer, still better 1520 01:05:10,436 --> 01:05:13,426 than 8 bits but it's 4 bits this time. 1521 01:05:13,426 --> 01:05:18,726 So, C is going to be 0, 0, 0, 0, 1 'cause I'm just following it, 1522 01:05:19,046 --> 01:05:20,456 following the root down to it. 1523 01:05:20,806 --> 01:05:27,136 D is gonna be 0, 0, 1 and E very nicely is going to be 1. 1524 01:05:27,486 --> 01:05:30,416 And now, what's neat about this and I'll put this on the screen 1525 01:05:30,566 --> 01:05:34,656 which maps that tree to numbers, what's neat about this is 1, 1526 01:05:34,896 --> 01:05:37,136 we've optimized for the really common case followed 1527 01:05:37,136 --> 01:05:38,866 by the next common cases and we're willing 1528 01:05:38,866 --> 01:05:40,976 to spend a few more bits on the les common cases. 1529 01:05:41,226 --> 01:05:43,346 But unlike Morse code, notice, 1530 01:05:43,346 --> 01:05:46,546 none of these things are prefixes of one another. 1531 01:05:46,786 --> 01:05:51,806 In other words, if I draw the numbers or let's say, 0, 1, 0, 1532 01:05:51,876 --> 01:05:54,866 0, 0, 0, what does this decode to? 1533 01:05:54,866 --> 01:05:56,486 Well notice there is no ambiguity. 1534 01:05:56,486 --> 01:05:58,936 If I start with 0, could be any one of these 1535 01:05:59,296 --> 01:06:02,446 but if I then immediately see a 1, it can only be in A 1536 01:06:02,446 --> 01:06:04,126 and that's pretty compelling. 1537 01:06:04,126 --> 01:06:06,276 There's no ambiguity which means I don't need a break, 1538 01:06:06,276 --> 01:06:08,756 I don't need to waste [inaudible] pause here 1539 01:06:08,756 --> 01:06:10,206 or here's a gap between letters. 1540 01:06:10,456 --> 01:06:12,546 Now I see another 0, it could be any one of these. 1541 01:06:12,776 --> 01:06:15,636 I see 3 zeros, okay, it could any one of these three. 1542 01:06:15,856 --> 01:06:18,146 I see 3 zeros, I've now whittled it down to this 1543 01:06:18,446 --> 01:06:22,656 but then notice 4 zeros means aha, this has gotta be B. So, 1544 01:06:22,656 --> 01:06:26,476 by contrast, Huffman's coding scheme is immediately decodable 1545 01:06:26,576 --> 01:06:30,026 which means it is very efficient in that you don't waste bits 1546 01:06:30,116 --> 01:06:32,136 or time in between letters. 1547 01:06:32,366 --> 01:06:33,376 So, what's the takeaway? 1548 01:06:33,376 --> 01:06:35,276 Well now, the takeaway is quite simple 1549 01:06:35,276 --> 01:06:39,236 and you can now encode A's and B's and C's and D's and E's 1550 01:06:39,446 --> 01:06:40,786 with far fewer than 8 bits. 1551 01:06:41,126 --> 01:06:42,596 But what's the catch? 1552 01:06:43,046 --> 01:06:45,986 If you're going to actually encode letters 1553 01:06:45,986 --> 01:06:48,826 in this different way and you're gonna kind of shrug of ASCII, 1554 01:06:49,936 --> 01:06:51,876 what do you have-- what's the price you're paying? 1555 01:06:51,876 --> 01:06:54,036 'Cause right if Mr. Huffman had this brilliant insight 1556 01:06:54,036 --> 01:06:55,136 and we're still using ASCII, 1557 01:06:55,136 --> 01:06:56,186 there's gotta be a catch to this. 1558 01:06:56,706 --> 01:07:03,276 So, why are we not all just using this scheme? 1559 01:07:03,276 --> 01:07:03,476 [ Inaudible Remark ] 1560 01:07:03,476 --> 01:07:04,246 >> Exactly, right. 1561 01:07:04,246 --> 01:07:07,466 I'm kind naively assuming, well, if my world only has A's, B's, 1562 01:07:07,516 --> 01:07:09,986 C's, D's and E's, well I don't really need to use ASCII 1563 01:07:09,986 --> 01:07:12,316 because I have such a smaller address space so, 1564 01:07:12,316 --> 01:07:14,236 of course I can use fewer than 8 bits. 1565 01:07:14,546 --> 01:07:15,586 But if you actually wanna be able 1566 01:07:15,586 --> 01:07:17,786 to let someone type any character on the keyboard, 1567 01:07:17,976 --> 01:07:20,466 you're gonna eventually need to start spending more bits. 1568 01:07:20,796 --> 01:07:23,226 So, what is really is, the takeaway really is 1569 01:07:23,226 --> 01:07:26,576 that if you have patterns in your data, common letters 1570 01:07:26,576 --> 01:07:29,846 or common words, you actually do have an opportunity then 1571 01:07:30,126 --> 01:07:30,976 to compress data. 1572 01:07:30,976 --> 01:07:32,606 And the fun thing about data is 1573 01:07:32,606 --> 01:07:34,296 that there's almost always patterns, right? 1574 01:07:34,296 --> 01:07:36,406 Even in things like music, there's choruses 1575 01:07:36,406 --> 01:07:37,856 and parts of songs that repeat. 1576 01:07:38,026 --> 01:07:40,396 And if you can leverage those commonalities and say, 1577 01:07:40,396 --> 01:07:42,796 do I really need to store this chorus part 1578 01:07:42,796 --> 01:07:44,406 of the song 3 different times just 1579 01:07:44,406 --> 01:07:46,346 because the artist sings it 3 different times? 1580 01:07:46,596 --> 01:07:47,916 Why not just store it once 1581 01:07:48,196 --> 01:07:50,896 and then somehow digitally remember times 3 1582 01:07:51,136 --> 01:07:53,436 which you can presumably express much more succinctly. 1583 01:07:53,586 --> 01:07:55,566 We can then save bits that way. 1584 01:07:55,796 --> 01:07:56,826 And in fact we can. 1585 01:07:56,826 --> 01:07:59,666 So just to be clear, these nodes we're now drawing, again, 1586 01:07:59,666 --> 01:08:01,646 we can come up with kind of a simple C structure 1587 01:08:01,646 --> 01:08:03,906 where we store the symbol that we're representing, 1588 01:08:04,186 --> 01:08:06,976 the frequency as an int or as a percentage, as a flow. 1589 01:08:06,976 --> 01:08:08,916 We can do it either way but generally, again, 1590 01:08:08,916 --> 01:08:10,956 using ints instead of flows avoids issues 1591 01:08:10,956 --> 01:08:12,256 of precision and accuracy. 1592 01:08:12,496 --> 01:08:14,236 And then we need a pointer to the left or right 1593 01:08:14,236 --> 01:08:17,706 but let's now map this same idea, not just the arcane text 1594 01:08:18,026 --> 01:08:20,036 in A's, B's and C's and D's. 1595 01:08:20,086 --> 01:08:23,396 Suppose we consider like the German flag here vis-a-vis say 1596 01:08:23,396 --> 01:08:26,806 the French flag which is nice in its simplicity too in that 1597 01:08:26,806 --> 01:08:28,826 where as these guys have horizontal bands, 1598 01:08:28,826 --> 01:08:30,316 these guys have vertical bands. 1599 01:08:30,856 --> 01:08:34,866 Any thoughts as to which of these 2 flags if stored as a GIF 1600 01:08:34,866 --> 01:08:38,666 in internet or a file format, which one could compress more? 1601 01:08:38,666 --> 01:08:40,276 They're pretty-- they are the same size even 1602 01:08:40,276 --> 01:08:42,246 if I didn't quite resize them correctly. 1603 01:08:42,486 --> 01:08:43,666 They're the same number of pixels, 1604 01:08:44,066 --> 01:08:45,396 which one might compress better? 1605 01:08:47,636 --> 01:08:51,056 The French, yeah, it's kind of nice but this German's kind 1606 01:08:51,056 --> 01:08:53,076 of redundant too from left to right. 1607 01:08:53,076 --> 01:08:55,896 It turns out you can't really notice unless you know the 1608 01:08:55,896 --> 01:08:56,876 GIF specification. 1609 01:08:56,876 --> 01:09:00,766 But there are opportunities here clearly to compress, right? 1610 01:09:00,766 --> 01:09:04,456 Why? Frankly, I could represent this flag with like 3 pixels 1611 01:09:04,456 --> 01:09:06,776 and then just say repeat some number of times. 1612 01:09:06,776 --> 01:09:07,996 Same deal for the French flag. 1613 01:09:08,236 --> 01:09:09,976 Well, the authors of the GIF file format, 1614 01:09:09,976 --> 01:09:11,796 G-I-F decided that, you know what? 1615 01:09:12,086 --> 01:09:15,816 I'm just like in BMPs whereby images are represented 1616 01:09:16,026 --> 01:09:18,016 as scan lines from left to right. 1617 01:09:18,266 --> 01:09:20,476 What we're really gonna do is recognize that, alright. 1618 01:09:20,476 --> 01:09:23,576 The top black band of this flag, let's just say 1619 01:09:23,576 --> 01:09:26,566 that the top left pixel is black and then say repeat 1620 01:09:26,606 --> 01:09:27,836 for the rest of the scan line. 1621 01:09:27,896 --> 01:09:30,856 Then the next row, we say black, repeat, black repeat, 1622 01:09:30,896 --> 01:09:32,776 black repeat then eventually we get to red 1623 01:09:32,776 --> 01:09:35,156 and then we just red repeat, red repeat. 1624 01:09:35,526 --> 01:09:37,276 So, it turns out the French flag if saved 1625 01:09:37,276 --> 01:09:39,976 as GIF doesn't quite compress as well 'cause you have 1626 01:09:39,976 --> 01:09:40,986 to keep changing your mind. 1627 01:09:40,986 --> 01:09:43,726 You have to say blue repeat, white repeat, 1628 01:09:43,916 --> 01:09:46,096 red repeat and then again. 1629 01:09:46,626 --> 01:09:49,726 So, you're still compressing it but you can't repeat 1630 01:09:49,726 --> 01:09:51,186 as long chunks as possible 1631 01:09:51,186 --> 01:09:53,426 because just somewhat arbitrarily the GIF authors 1632 01:09:53,426 --> 01:09:55,246 decided that we're gonna compress images 1633 01:09:55,246 --> 01:09:56,956 effectively laterally. 1634 01:09:57,166 --> 01:09:58,196 >> But you can actually see this 1635 01:09:58,196 --> 01:10:00,686 so I downloaded these same images into a folder 1636 01:10:00,686 --> 01:10:02,686 and if you actually look at this, 1637 01:10:03,256 --> 01:10:05,966 there's the same German flag, here's the same French flag 1638 01:10:05,966 --> 01:10:09,206 so they look like the same size but look at their file sizes, 1639 01:10:09,416 --> 01:10:11,416 and ignore the-- I in the middle there. 1640 01:10:12,686 --> 01:10:15,386 So, the French flag is twice as big, 8 kilobytes 1641 01:10:15,386 --> 01:10:19,166 and that's simply because of how GIF is actually compressed. 1642 01:10:19,166 --> 01:10:20,806 We can apply the same ideas, 1643 01:10:21,126 --> 01:10:24,316 independent of these tree structures indeed whereby we can 1644 01:10:24,316 --> 01:10:26,766 have say a picture of this apple, right? 1645 01:10:26,766 --> 01:10:29,856 This apple, it's stored as a GIF, you do need some red 1646 01:10:29,856 --> 01:10:32,126 and brown to represent the stem in the apple itself. 1647 01:10:32,336 --> 01:10:34,356 But this blue background, you definitely don't need 1648 01:10:34,356 --> 01:10:36,036 to remember all of those pixels. 1649 01:10:36,036 --> 01:10:38,206 Instead, just remember some of them and then say, 1650 01:10:38,426 --> 01:10:41,316 "repeat wherever it's white" or in the context, 1651 01:10:41,316 --> 01:10:43,856 perhaps more a modernly with a movie. 1652 01:10:43,936 --> 01:10:47,606 So, this is just a little cartoon of 2 scenes in a movie 1653 01:10:47,606 --> 01:10:49,866 where on the top strip of film there. 1654 01:10:50,116 --> 01:10:53,616 Notice the little RV, the truck and the house and the tree. 1655 01:10:53,896 --> 01:10:55,806 Now in this simple little cartoon movie, 1656 01:10:55,806 --> 01:10:57,476 the car is moving from left to right. 1657 01:10:57,626 --> 01:10:59,306 But what's obviously not gonna change? 1658 01:10:59,306 --> 01:11:01,586 Well, the tree's not moving and the house isn't moving, 1659 01:11:01,826 --> 01:11:05,086 the background's not changing and so what can you do? 1660 01:11:05,336 --> 01:11:07,886 Well actually in this case, the car, okay, 1661 01:11:07,886 --> 01:11:08,936 so some of those things were moving. 1662 01:11:09,186 --> 01:11:10,856 The tree and the house were effectively moving 1663 01:11:10,856 --> 01:11:14,016 as the camera move but the sky was not changing, right? 1664 01:11:14,016 --> 01:11:15,096 So, you can do the same thing. 1665 01:11:15,096 --> 01:11:17,966 Throw away information that's not changing 'cause you 1666 01:11:17,966 --> 01:11:19,286 definitely don't need to store 1667 01:11:19,286 --> 01:11:22,136 in a movie the same picture again and again. 1668 01:11:22,136 --> 01:11:25,076 And for those unfamiliar, a movie, a video, is pretty much 1669 01:11:25,076 --> 01:11:27,406 like a bunch of images back to back to back to back, 1670 01:11:27,816 --> 01:11:29,706 played really fast so it looks seamless. 1671 01:11:29,976 --> 01:11:32,326 And here too, in the bottom, you're seeing a same idea 1672 01:11:32,326 --> 01:11:34,706 where you can do interframe compression, 1673 01:11:35,016 --> 01:11:36,156 whereby, you know what? 1674 01:11:36,276 --> 01:11:38,366 If we know the car is starting over there on the left 1675 01:11:38,726 --> 01:11:40,526 and then it's gonna end up there over on the right, 1676 01:11:40,896 --> 01:11:44,036 do we really need to watch as the car does this or can't we 1677 01:11:44,036 --> 01:11:47,276 with algorithms kind of predict where the car is gonna be 1678 01:11:47,476 --> 01:11:49,416 and then throw away what might otherwise be 1679 01:11:49,796 --> 01:11:51,146 useless information. 1680 01:11:51,246 --> 01:11:52,436 And you can see it here too. 1681 01:11:52,436 --> 01:11:55,196 So, this is a picture on top of an uncompressed video 1682 01:11:55,196 --> 01:11:57,406 of a bee flying over some flowers and then 1683 01:11:57,406 --> 01:11:59,566 at the bottom you see, well, the flowers aren't changing, 1684 01:11:59,566 --> 01:12:02,476 only the bees moving, let's just throw away the background, 1685 01:12:02,476 --> 01:12:05,646 throw away the purple and the blue and just keep track 1686 01:12:05,646 --> 01:12:09,436 in our file where that bee is actually is. 1687 01:12:09,796 --> 01:12:10,986 So, where are we going with all this? 1688 01:12:10,986 --> 01:12:12,966 Well, we now have the ability to have images. 1689 01:12:12,966 --> 01:12:15,136 You have in your appliance the ability 1690 01:12:15,296 --> 01:12:17,446 to actually make webpages. 1691 01:12:17,446 --> 01:12:18,966 Since you actually have inside 1692 01:12:18,966 --> 01:12:20,706 of your appliance your own web server 1693 01:12:20,986 --> 01:12:24,016 and your own database server as you'll see. 1694 01:12:24,226 --> 01:12:27,106 I'm gonna go into the CS50 account here just for a moment 1695 01:12:27,106 --> 01:12:28,816 and use a program that let's meet edit, 1696 01:12:29,186 --> 01:12:30,386 the course's own website. 1697 01:12:30,466 --> 01:12:32,446 So, you might recall from our course's website, 1698 01:12:32,446 --> 01:12:34,746 we often have things like hungry RSVP, don't submit, 1699 01:12:34,746 --> 01:12:36,666 submit this form and so forth. 1700 01:12:36,796 --> 01:12:38,726 So this is HTML. 1701 01:12:39,056 --> 01:12:41,906 We saw divs before, we saw angle brackets before. 1702 01:12:41,906 --> 01:12:43,546 I didn't say anything about what they mean 1703 01:12:43,736 --> 01:12:45,856 but I thought it would be fun to maybe just update this 1704 01:12:45,856 --> 01:12:48,026 in real time as an example of where we can go. 1705 01:12:48,026 --> 01:12:50,336 I'm gonna create another division of the page like this 1706 01:12:50,336 --> 01:12:52,166 and I'm gonna say the style 1707 01:12:52,166 --> 01:12:54,486 of this division I want the background color for this part 1708 01:12:54,486 --> 01:12:58,096 of the website to be black that happens to represent black 1709 01:12:58,096 --> 01:12:59,376 or I could just say, literally black. 1710 01:12:59,376 --> 01:13:02,486 I wanna have some padding around it, 10 pixels worth 1711 01:13:02,536 --> 01:13:04,756 which is a bunch of white space around it. 1712 01:13:04,756 --> 01:13:07,546 I'm gonna go ahead and align everything 1713 01:13:07,546 --> 01:13:10,156 in the center here and that feels good. 1714 01:13:10,386 --> 01:13:11,556 I'm gonna go down div. 1715 01:13:11,816 --> 01:13:13,626 I'm gonna create what's called a line break below 1716 01:13:13,626 --> 01:13:15,006 that in whatever else is there. 1717 01:13:15,006 --> 01:13:16,776 I'm gonna go ahead and do image. 1718 01:13:16,776 --> 01:13:20,026 IMG is image, source equals, let's see, 1719 01:13:20,026 --> 01:13:23,676 I called it ice.ping like this. 1720 01:13:23,676 --> 01:13:25,636 Let me go ahead and actually do this. 1721 01:13:25,636 --> 01:13:27,606 I'm gonna surround this image tag with A, 1722 01:13:27,606 --> 01:13:30,426 H ref it goes go quote, unquote, dub, dub, dub-- 1723 01:13:30,426 --> 01:13:33,766 dot Facebook dot com slash Rboden91 [phonetic]. 1724 01:13:33,936 --> 01:13:37,436 You've be surprised how many of your friends they made-- - 1725 01:13:37,436 --> 01:13:38,596 the first we did this. 1726 01:13:38,956 --> 01:13:41,726 Alright, so now let me go ahead and save that. 1727 01:13:42,066 --> 01:13:45,686 And if I made no typos, here's before, 1728 01:13:46,516 --> 01:13:49,726 here's after, oh, damn it. 1729 01:13:49,926 --> 01:13:52,806 [Noise] What it's called? 1730 01:13:53,566 --> 01:13:55,866 Not, they took it away. 1731 01:13:55,866 --> 01:14:00,406 Ice.ping, ice-- no, what did I call the image, hang on. 1732 01:14:00,966 --> 01:14:03,406 Ice.ping, hold on. 1733 01:14:04,616 --> 01:14:08,056 Where is the image? 1734 01:14:08,306 --> 01:14:12,716 Damn it! [Laughter] Very seamless demo here. 1735 01:14:12,716 --> 01:14:14,736 Okay, standby. 1736 01:14:15,586 --> 01:14:20,386 This is a trick that you'll learn before long. 1737 01:14:20,976 --> 01:14:23,216 Alright, I've just transferred the file 1738 01:14:23,216 --> 01:14:25,916 and I'm gonna move the file into here. 1739 01:14:25,916 --> 01:14:27,396 I'm gonna change my application. 1740 01:14:27,586 --> 01:14:28,526 I'm gonna go over here. 1741 01:14:28,636 --> 01:14:31,066 [Laughter] Damn it! 1742 01:14:31,066 --> 01:14:32,296 What re-- what is it David? 1743 01:14:32,586 --> 01:14:36,536 Oh wait, ice-- ice.ping. 1744 01:14:37,276 --> 01:14:40,086 Oh, this is just catching issues. 1745 01:14:40,086 --> 01:14:41,526 Okay hold on. 1746 01:14:41,996 --> 01:14:43,586 [Laughter] Thank you! 1747 01:14:43,586 --> 01:14:45,006 [ Laughter ] 1748 01:14:45,006 --> 01:14:46,426 [ Applause ] 1749 01:14:46,426 --> 01:14:47,756 >> We will see you next week.