1 00:00:00,000 --> 00:00:03,472 [MUSIC PLAYING] 2 00:00:03,472 --> 00:00:17,322 3 00:00:17,322 --> 00:00:20,280 DAVID MALAN: Recall that an algorithm is just step-by-step instructions 4 00:00:20,280 --> 00:00:21,510 for solving some problem. 5 00:00:21,510 --> 00:00:25,470 Not unlike this problem here wherein I sought Mike Smith among the whole phone 6 00:00:25,470 --> 00:00:27,210 book of names and numbers. 7 00:00:27,210 --> 00:00:29,280 But up until now, we've really only focused 8 00:00:29,280 --> 00:00:31,680 on those step-by-step instructions and not 9 00:00:31,680 --> 00:00:35,163 so much on how the data we are searching is stored. 10 00:00:35,163 --> 00:00:38,080 Of course, in this version of that problem, it's stored here on paper, 11 00:00:38,080 --> 00:00:42,150 but in the digital world, it's of course not going to be paper, but 0's and 1's. 12 00:00:42,150 --> 00:00:45,180 But it's one thing to say that the numbers and maybe even the names 13 00:00:45,180 --> 00:00:49,950 are stored ultimately as 0's and 1's, but where and how exactly? 14 00:00:49,950 --> 00:00:52,650 There's all those transistors and they're flipping on and off, 15 00:00:52,650 --> 00:00:56,830 but with respect to each other, are those numbers laid out left to right, 16 00:00:56,830 --> 00:00:58,950 top to bottom, are they all over the place? 17 00:00:58,950 --> 00:01:01,440 Let's actually take a look at that question now 18 00:01:01,440 --> 00:01:03,540 and consider how a computer leverages what 19 00:01:03,540 --> 00:01:07,800 are called data structures to facilitate implementation of algorithms. 20 00:01:07,800 --> 00:01:12,060 Indeed, how you lay out a computer's data inside of its memory 21 00:01:12,060 --> 00:01:15,540 has non-trivial impacts on the performance or efficiency 22 00:01:15,540 --> 00:01:17,940 of your algorithms, whereas the algorithm itself 23 00:01:17,940 --> 00:01:21,990 can be correct as we've seen, but not necessarily efficient logically. 24 00:01:21,990 --> 00:01:25,110 Both space and the representation underneath the hood of your data 25 00:01:25,110 --> 00:01:27,000 can also make a significant impact. 26 00:01:27,000 --> 00:01:28,590 But let's simplify the world first. 27 00:01:28,590 --> 00:01:32,850 And rather than focus on, say, a whole phone book of names and numbers, 28 00:01:32,850 --> 00:01:36,060 let's focus just on numbers, and much smaller numbers that aren't even 29 00:01:36,060 --> 00:01:40,770 phone numbers, but just integers, and only save seven of them at a time. 30 00:01:40,770 --> 00:01:42,940 And I've hidden these seven numbers, if you will, 31 00:01:42,940 --> 00:01:45,540 behind these seven yellow doors. 32 00:01:45,540 --> 00:01:48,360 And so by knocking and opening, one of these doors 33 00:01:48,360 --> 00:01:49,803 will reveal one number at a time. 34 00:01:49,803 --> 00:01:52,470 And the goal at hand, though, is to find a very specific number, 35 00:01:52,470 --> 00:01:56,280 just like I sought one specific phone number before, this time I want 36 00:01:56,280 --> 00:01:59,670 to find the number 50 specifically. 37 00:01:59,670 --> 00:02:01,290 Well, where to begin? 38 00:02:01,290 --> 00:02:04,050 I'll go with the one closest to me and knock, knock, knock-- 39 00:02:04,050 --> 00:02:05,580 15 is the number. 40 00:02:05,580 --> 00:02:06,840 So a little bit low. 41 00:02:06,840 --> 00:02:08,930 Let's proceed from there to see 23. 42 00:02:08,930 --> 00:02:10,320 We seem to be getting closer. 43 00:02:10,320 --> 00:02:14,040 Let's open this door next and-- oh, we seem to have veered down smaller, 44 00:02:14,040 --> 00:02:15,180 so I'm a little confused. 45 00:02:15,180 --> 00:02:17,130 But I have four doors left to check. 46 00:02:17,130 --> 00:02:23,940 So 50 is not there and 50 is not there and 50 is, in fact, there. 47 00:02:23,940 --> 00:02:24,690 So not bad. 48 00:02:24,690 --> 00:02:28,910 Within just six steps, have I found the number in question. 49 00:02:28,910 --> 00:02:31,670 But of course, to be fair, there were only seven doors. 50 00:02:31,670 --> 00:02:36,050 So if we generalize that to say that there were n doors where n is just 51 00:02:36,050 --> 00:02:41,540 a number, well that was roughly n doors I had to open among the n doors 52 00:02:41,540 --> 00:02:44,210 just to find the one that I sought. 53 00:02:44,210 --> 00:02:45,740 So could I have done better? 54 00:02:45,740 --> 00:02:49,460 You know, my instincts like yours were perhaps to start at the left 55 00:02:49,460 --> 00:02:53,090 and move to the right, and we seem to be on a good path initially. 56 00:02:53,090 --> 00:02:56,390 We went from 15 to 23, and then darn it if 16 57 00:02:56,390 --> 00:02:59,210 didn't throw a wrench in the works, because I expected it, 58 00:02:59,210 --> 00:03:03,290 perhaps naively, to be bigger and bigger as I moved right. 59 00:03:03,290 --> 00:03:07,010 But honestly, had I not told you anything-- and indeed I did-- then 60 00:03:07,010 --> 00:03:10,220 you wouldn't have known anything about these numbers other than maybe 61 00:03:10,220 --> 00:03:12,380 the number 50 is actually there. 62 00:03:12,380 --> 00:03:14,840 I told you nothing as to the magnitude or the size 63 00:03:14,840 --> 00:03:17,453 of any of the other numbers, let alone the order, 64 00:03:17,453 --> 00:03:19,370 but in the world of the phone book, of course, 65 00:03:19,370 --> 00:03:22,280 we were able to take for granted that those names were 66 00:03:22,280 --> 00:03:27,080 sorted by the phone company for us-- from left to right, from A to Z. 67 00:03:27,080 --> 00:03:31,070 But in this case, if your data is just added to the computer's memory one 68 00:03:31,070 --> 00:03:33,620 at a time in no particular order, the onus 69 00:03:33,620 --> 00:03:37,850 is on you, the programmer or algorithm, to find that number 70 00:03:37,850 --> 00:03:39,560 you're interested in nonetheless. 71 00:03:39,560 --> 00:03:40,820 Now what was left here? 72 00:03:40,820 --> 00:03:43,940 And indeed 4 is even smaller than 50. 73 00:03:43,940 --> 00:03:48,140 So these seven doors were by design randomly assigned a number. 74 00:03:48,140 --> 00:03:49,847 And so you could do no better. 75 00:03:49,847 --> 00:03:50,930 I might have gotten lucky. 76 00:03:50,930 --> 00:03:53,000 I might not have gone with my initial instincts 77 00:03:53,000 --> 00:03:55,010 and touch the number 15 at left. 78 00:03:55,010 --> 00:03:59,060 I might have, effectively blinded, gone and touched 50 and just gotten lucky, 79 00:03:59,060 --> 00:04:00,920 and then it would have been just one step. 80 00:04:00,920 --> 00:04:04,760 But there's only a one in seven chance I would have been correct so quickly, 81 00:04:04,760 --> 00:04:07,220 so that's not really an algorithm that I could 82 00:04:07,220 --> 00:04:11,340 reproduce with the same efficiency again and again. 83 00:04:11,340 --> 00:04:13,010 So how can I do better? 84 00:04:13,010 --> 00:04:16,490 And how does the phone company enable us to do better? 85 00:04:16,490 --> 00:04:19,339 Well they, of course, put in a huge amount of effort upfront 86 00:04:19,339 --> 00:04:23,270 to sort all of those names and associated numbers from left 87 00:04:23,270 --> 00:04:26,030 to right, from A to Z. And so that's a huge leg up for us, 88 00:04:26,030 --> 00:04:30,230 because then I can assume I can do divide and conquer or so-called binary 89 00:04:30,230 --> 00:04:32,570 search, dividing that phone book in two as 90 00:04:32,570 --> 00:04:36,860 implied by "bi" in "binary," having the problem again and again. 91 00:04:36,860 --> 00:04:40,430 But someone's got to do that work for us, be it the phone company 92 00:04:40,430 --> 00:04:42,120 or perhaps me with these numbers. 93 00:04:42,120 --> 00:04:44,360 So let's take one more stab at this problem, 94 00:04:44,360 --> 00:04:47,720 this time presuming that the seven doors in question 95 00:04:47,720 --> 00:04:52,370 do, in fact, have the numbers behind them sorted from left to right, 96 00:04:52,370 --> 00:04:54,170 small to big. 97 00:04:54,170 --> 00:04:56,630 So where to find the number 50 now? 98 00:04:56,630 --> 00:04:59,780 I have seven doors behind which are those same numbers, 99 00:04:59,780 --> 00:05:02,930 but this time they are sorted from left to right. 100 00:05:02,930 --> 00:05:06,420 And no skipping ahead thinking that, well, I remember all the other numbers, 101 00:05:06,420 --> 00:05:08,228 so I know immediately where 50 is. 102 00:05:08,228 --> 00:05:10,520 Let's assume for the moment that we don't know anything 103 00:05:10,520 --> 00:05:14,090 about the other numbers other than the fact that they are sorted. 104 00:05:14,090 --> 00:05:17,930 Well, my inclination is not to start at the left with this first door, 105 00:05:17,930 --> 00:05:21,260 much like my inclination ultimately with that phone book was not to start 106 00:05:21,260 --> 00:05:23,388 with the first page, but the middle. 107 00:05:23,388 --> 00:05:26,180 And indeed, I'm going to go here to the middle of these doors and-- 108 00:05:26,180 --> 00:05:27,050 16. 109 00:05:27,050 --> 00:05:29,030 Not quite the one I want. 110 00:05:29,030 --> 00:05:33,960 But if the doors are sorted now, I know that that number 50 is not to the left, 111 00:05:33,960 --> 00:05:35,460 and so I'm going to go to the right. 112 00:05:35,460 --> 00:05:37,280 Where do I go to the right? 113 00:05:37,280 --> 00:05:41,090 Well, I have three doors left, I'm going to follow the same algorithm 114 00:05:41,090 --> 00:05:44,960 and open that door in the middle and-- oh, so close. 115 00:05:44,960 --> 00:05:47,360 I found only, if you will, the meaning of life. 116 00:05:47,360 --> 00:05:49,460 So 42, though, is not the number I care about, 117 00:05:49,460 --> 00:05:52,940 but I do know something about 50-- it's bigger than 42. 118 00:05:52,940 --> 00:05:57,320 And so now, it's quite simply the case that-- aha, 50 is there, 119 00:05:57,320 --> 00:05:59,840 it's going to be in that last number. 120 00:05:59,840 --> 00:06:03,770 So whereas before took me up to six steps to find the number 50, 121 00:06:03,770 --> 00:06:05,750 and only then by luck did I find it where 122 00:06:05,750 --> 00:06:07,920 it was because it was just randomly placed, 123 00:06:07,920 --> 00:06:12,990 now I spent 1, 2, 3 steps in total, which is, of course, fewer than six. 124 00:06:12,990 --> 00:06:16,010 And as these numbers of doors grow in size 125 00:06:16,010 --> 00:06:18,200 and I have hundreds or thousands of doors, 126 00:06:18,200 --> 00:06:21,440 surely it will be the case just like the phone book that having this problem 127 00:06:21,440 --> 00:06:23,990 again and again is going to get me to my answer 128 00:06:23,990 --> 00:06:29,430 if it's there in logarithmic instead of linear time, so to speak. 129 00:06:29,430 --> 00:06:33,830 But what's key to the success of this algorithm-- binary search-- 130 00:06:33,830 --> 00:06:39,350 is that the doors are not only sorted, but they are back-to-back-to-back. 131 00:06:39,350 --> 00:06:42,140 Now I have the luxury of feet and I can move back and forth 132 00:06:42,140 --> 00:06:46,190 among these numbers, but even my steps take me some amount of time and energy. 133 00:06:46,190 --> 00:06:50,120 But fortunately, each such step just takes one unit of energy, if you will, 134 00:06:50,120 --> 00:06:54,840 and I can immediately jump wherever I would like one step at a time. 135 00:06:54,840 --> 00:06:58,520 But a computer is purely electronic, and in the context of memory, 136 00:06:58,520 --> 00:07:01,490 doesn't actually need to take any steps. 137 00:07:01,490 --> 00:07:05,270 Electronically a computer can jump to any location in memory 138 00:07:05,270 --> 00:07:07,460 instantly in so-called constant time. 139 00:07:07,460 --> 00:07:10,077 So just one step, that might take me several. 140 00:07:10,077 --> 00:07:13,160 And so that's an advantage a computer has and it's just one of the reasons 141 00:07:13,160 --> 00:07:17,330 why they are so much faster than us at solving so many problems. 142 00:07:17,330 --> 00:07:22,790 But the key ingredient to laying out the data for a computer to solve 143 00:07:22,790 --> 00:07:26,712 your problems quickly is that you need to put your data back-to-back-to-back. 144 00:07:26,712 --> 00:07:28,420 Because a computer at the end of the day, 145 00:07:28,420 --> 00:07:32,470 yes, stores only 0's and 1's, but those 0's and 1's are generally 146 00:07:32,470 --> 00:07:34,630 treated in units of, say, eight-- 147 00:07:34,630 --> 00:07:36,070 8 bits per byte. 148 00:07:36,070 --> 00:07:39,220 But those bytes, when storing numbers like this, 149 00:07:39,220 --> 00:07:42,670 need those numbers to be back-to-back-to-back and not just 150 00:07:42,670 --> 00:07:44,283 jumbled all over the place. 151 00:07:44,283 --> 00:07:46,450 Because it needs to be the case that the computer is 152 00:07:46,450 --> 00:07:51,280 allowed to do the simplest of arithmetic to figure out where to look. 153 00:07:51,280 --> 00:07:53,980 Even I in my head am sort of doing a bit of math figuring out, 154 00:07:53,980 --> 00:07:55,000 well where's the middle? 155 00:07:55,000 --> 00:07:57,812 Even though among few doors you can pretty much eyeball it quickly. 156 00:07:57,812 --> 00:08:00,520 But a computer's going to have to do a bit of arithmetic, so what 157 00:08:00,520 --> 00:08:01,600 is that math? 158 00:08:01,600 --> 00:08:05,680 Well if I have 1, 2, 3, 4, 5, 6, 7 doors initially, 159 00:08:05,680 --> 00:08:08,950 and I want to find the middle one, I'm actually just going to do what? 160 00:08:08,950 --> 00:08:12,520 7 divided by 2, which gives me 3 and 1/2-- that's 161 00:08:12,520 --> 00:08:14,920 not an integer that's that useful for counting doors, 162 00:08:14,920 --> 00:08:17,330 so let's just round it down to 3. 163 00:08:17,330 --> 00:08:22,090 So 7 divided by 2 is 3.5 rounded down to 3 suggests 164 00:08:22,090 --> 00:08:26,920 mathematically that the number of the door that's in the middle of my doors 165 00:08:26,920 --> 00:08:29,088 should be that known as 3. 166 00:08:29,088 --> 00:08:30,880 Now recall that a computer generally starts 167 00:08:30,880 --> 00:08:34,690 counting at 0 because 0 bits represent 0 in decimal, 168 00:08:34,690 --> 00:08:38,440 and so this is door 0, 1, 2, 3, 4, 5, 6. 169 00:08:38,440 --> 00:08:42,340 So there's still seven doors, but the first is 0 and the last is called 6. 170 00:08:42,340 --> 00:08:46,458 So if I'm looking for number 3, that's 0, 1, 2, 3. 171 00:08:46,458 --> 00:08:49,000 And indeed, that's why I jumped to the middle of these doors, 172 00:08:49,000 --> 00:08:52,960 because I went very specifically to location 3. 173 00:08:52,960 --> 00:08:56,200 Now why did I jump to 42 next? 174 00:08:56,200 --> 00:08:59,020 Of course, that was in the middle of the three remaining doors, 175 00:08:59,020 --> 00:09:01,750 but how would a computer know mathematically where to go, 176 00:09:01,750 --> 00:09:04,480 whereas we can just rather eyeball it here? 177 00:09:04,480 --> 00:09:09,430 Well if you've got 3 doors divided by 2, that gives me, of course, 1.5-- 178 00:09:09,430 --> 00:09:11,230 let's round that down to 1. 179 00:09:11,230 --> 00:09:15,040 So if we now re-number these doors, it's 0, 1, 2, 180 00:09:15,040 --> 00:09:20,220 because these are the only three doors that exist, well door 1 is 0, 1-- 181 00:09:20,220 --> 00:09:25,018 the 42, and that's how a computer would know to jump right to 42. 182 00:09:25,018 --> 00:09:27,310 Of course, with just one door left, it's pretty simple. 183 00:09:27,310 --> 00:09:29,980 You'd needn't even do any of that math if there's just one, 184 00:09:29,980 --> 00:09:33,820 and so we can immediately access that in constant time. 185 00:09:33,820 --> 00:09:37,360 In other words, even though my human feet are taking a bit of energy 186 00:09:37,360 --> 00:09:41,710 to get from one door to another, a computer has the leg-up, so to speak, 187 00:09:41,710 --> 00:09:45,250 of getting to these doors even quicker, because all it has to do 188 00:09:45,250 --> 00:09:47,770 is a little bit of division, maybe some rounding, 189 00:09:47,770 --> 00:09:51,400 and then jump exactly to that position in memory. 190 00:09:51,400 --> 00:09:56,050 And that is what we call constant time, but it presupposes, again, 191 00:09:56,050 --> 00:09:59,950 that the data is laid out back-to-back-to-back so that every one 192 00:09:59,950 --> 00:10:04,255 of these numbers is an equal distance away from every other. 193 00:10:04,255 --> 00:10:06,130 Because otherwise if you were to do this math 194 00:10:06,130 --> 00:10:08,278 and coming up with the numbers 3 or 1, you 195 00:10:08,278 --> 00:10:10,570 have to be able to know where you're jumping in memory, 196 00:10:10,570 --> 00:10:15,880 because that number 42 can't be down here, it has to be numerically in order 197 00:10:15,880 --> 00:10:18,580 exactly where you expect. 198 00:10:18,580 --> 00:10:23,500 And so in computer science and in programming is this kind of arrangement 199 00:10:23,500 --> 00:10:27,970 where you have doors or really data back-to-back-to-back known 200 00:10:27,970 --> 00:10:29,920 as what's called an array. 201 00:10:29,920 --> 00:10:34,900 An array is a contiguous block of memory wherein values are stored 202 00:10:34,900 --> 00:10:39,280 back-to-back-to-back-to-back-- from left to right conceptually, 203 00:10:39,280 --> 00:10:42,700 although of course, direction has less meaning once you're inside 204 00:10:42,700 --> 00:10:44,720 of a computer. 205 00:10:44,720 --> 00:10:47,800 Now it is thanks to these arrays that we were able to search, 206 00:10:47,800 --> 00:10:49,690 even something like a phone so quickly. 207 00:10:49,690 --> 00:10:52,000 After all, you can imagine in the physical world, 208 00:10:52,000 --> 00:10:54,850 a phone book isn't all that unlike an array, 209 00:10:54,850 --> 00:10:58,480 albeit a more arcane version here, because its pages are indeed 210 00:10:58,480 --> 00:11:02,120 back-to-back-to-back-to-back from left to right, which is wonderful. 211 00:11:02,120 --> 00:11:04,120 And you'll recall when we searched a phone book, 212 00:11:04,120 --> 00:11:07,600 we were already able to describe the efficiency via which 213 00:11:07,600 --> 00:11:10,330 we were able to search it-- via each of those three algorithms. 214 00:11:10,330 --> 00:11:13,870 One page at a time, two pages at a time, and then one-half 215 00:11:13,870 --> 00:11:15,610 of the remaining problem at a time. 216 00:11:15,610 --> 00:11:18,670 Well it turns out that there's a direct connection even 217 00:11:18,670 --> 00:11:20,860 to the simplification of that same problem. 218 00:11:20,860 --> 00:11:24,640 If I have n doors and I search them from left to right, that of course 219 00:11:24,640 --> 00:11:28,720 might take me as many six, seven total steps or n if the number I'm seeking 220 00:11:28,720 --> 00:11:30,160 is all the way at the end. 221 00:11:30,160 --> 00:11:33,100 I could have gone two doors at a time, although that really 222 00:11:33,100 --> 00:11:36,130 would have gone off the rails with the randomly-sorted numbers, 223 00:11:36,130 --> 00:11:39,970 because there would have been no logic to just going left to right twice as 224 00:11:39,970 --> 00:11:43,750 fast because I would be missing every other element never knowing 225 00:11:43,750 --> 00:11:45,070 when to go back. 226 00:11:45,070 --> 00:11:47,800 And in the case of binary search, my last algorithm where 227 00:11:47,800 --> 00:11:50,050 I started in the middle and found 16, and then 228 00:11:50,050 --> 00:11:52,540 started in the middle of that middle and found 42, and then 229 00:11:52,540 --> 00:11:55,720 started in the middle of the middle and found my last number, 230 00:11:55,720 --> 00:12:00,010 binary search is quite akin to what we did by tearing that problem in half 231 00:12:00,010 --> 00:12:00,880 and in half. 232 00:12:00,880 --> 00:12:04,150 So how did we describe the efficiency of that algorithm last time? 233 00:12:04,150 --> 00:12:08,740 Well we proposed that my first algorithm was linear, this straight line in red 234 00:12:08,740 --> 00:12:12,340 represented here by the label n, because for every page in the phone book, 235 00:12:12,340 --> 00:12:15,550 in the worst case you might need one extra step to find someone 236 00:12:15,550 --> 00:12:16,313 like Mike Smith. 237 00:12:16,313 --> 00:12:19,480 And indeed, in the case of these doors, if there's just one more door added, 238 00:12:19,480 --> 00:12:23,920 you might need one more step to find that number 50 or any other. 239 00:12:23,920 --> 00:12:26,200 Now I could, once those doors are sorted, 240 00:12:26,200 --> 00:12:29,270 go through them twice as fast, looking two doors at a time, 241 00:12:29,270 --> 00:12:34,630 and if I go too far and find, say, 51, I could double-back and fix that mistake. 242 00:12:34,630 --> 00:12:36,672 But what I ultimately did was divide and conquer. 243 00:12:36,672 --> 00:12:39,088 Starting in the middle, and then the middle of the middle, 244 00:12:39,088 --> 00:12:41,020 and the middle of the middle of the middle, 245 00:12:41,020 --> 00:12:43,420 and that's what give me this performance. 246 00:12:43,420 --> 00:12:46,600 This so-called logarithmic time-- log base 2 event 247 00:12:46,600 --> 00:12:50,740 which if nothing else means that we have a different shape fundamentally 248 00:12:50,740 --> 00:12:52,570 to the performance of this algorithm. 249 00:12:52,570 --> 00:12:57,430 It grows so much more slowly in time even as the problem gets really big. 250 00:12:57,430 --> 00:13:01,600 And even off the screen here, imagine that even as n gets huge, 251 00:13:01,600 --> 00:13:04,270 that green line would not seem to be going very high 252 00:13:04,270 --> 00:13:07,150 even as the red and yellow ones do. 253 00:13:07,150 --> 00:13:08,950 So in computer science, there are actually 254 00:13:08,950 --> 00:13:14,110 formal labels we can apply to this sort of methodology of analyzing algorithms. 255 00:13:14,110 --> 00:13:18,280 When you talk about upper bounds, on just how much time an algorithm takes, 256 00:13:18,280 --> 00:13:21,490 you might say this-- big O, quite literally. 257 00:13:21,490 --> 00:13:24,940 That an algorithm is in a big O of some formula. 258 00:13:24,940 --> 00:13:27,760 For instance, among the formulas it might be are these here-- 259 00:13:27,760 --> 00:13:32,170 n squared, or n log n, or n, or log n, or 1. 260 00:13:32,170 --> 00:13:34,570 Which is to say you can represent somewhat simply 261 00:13:34,570 --> 00:13:38,860 mathematically using n-- or really any other place holder-- as your value 262 00:13:38,860 --> 00:13:44,110 a variable that represents the size of the problem in question. 263 00:13:44,110 --> 00:13:47,200 So for instance, in the case of linear search, 264 00:13:47,200 --> 00:13:49,230 when I'm searching that phone book left to right 265 00:13:49,230 --> 00:13:51,970 or searching these doors left to right, in the worst case, 266 00:13:51,970 --> 00:13:55,990 it might take me as many as n steps to find Mike or that 50, 267 00:13:55,990 --> 00:13:58,990 and so we would say that that linear algorithm is 268 00:13:58,990 --> 00:14:03,850 in big O of n, which is just a fancier way of saying quite simply that it's 269 00:14:03,850 --> 00:14:06,070 indeed linear in time. 270 00:14:06,070 --> 00:14:09,340 But sometimes I might get lucky, and indeed in the best case, 271 00:14:09,340 --> 00:14:12,040 I might find Mike or 50 or anything else much faster, 272 00:14:12,040 --> 00:14:15,430 and computer scientists also have ways of expressing lower bounds 273 00:14:15,430 --> 00:14:17,170 on the running times of algorithms. 274 00:14:17,170 --> 00:14:19,630 Whereby in the best case, perhaps, an algorithm 275 00:14:19,630 --> 00:14:24,010 might take only this much time and at least this much time. 276 00:14:24,010 --> 00:14:28,450 And we use a capitalized omega to express that notion of a lower bound, 277 00:14:28,450 --> 00:14:32,410 whereas again, a big O represents an upper bound on the same. 278 00:14:32,410 --> 00:14:35,650 So we can use these same formulas, because depending on the algorithm, 279 00:14:35,650 --> 00:14:40,000 it might indeed take n squared steps or just 1 or constant number thereof, 280 00:14:40,000 --> 00:14:43,360 but we can consider even linear search to having a lower bound, 281 00:14:43,360 --> 00:14:47,020 because in the best case, maybe Mike or maybe 50 282 00:14:47,020 --> 00:14:49,300 or any other inputs of the problem just so 283 00:14:49,300 --> 00:14:53,200 happens to be at the very beginning of that book or those doors. 284 00:14:53,200 --> 00:14:56,440 And so in the best case, a lower bound on the running time of linear search 285 00:14:56,440 --> 00:14:59,170 might indeed be omega of 1 because you might just 286 00:14:59,170 --> 00:15:01,810 get lucky and take one step or two or three 287 00:15:01,810 --> 00:15:05,590 or terribly few, but independent of the number n. 288 00:15:05,590 --> 00:15:09,310 And so there, we might express this lower bound as well. 289 00:15:09,310 --> 00:15:11,830 Now meanwhile there's one more Greek symbol 290 00:15:11,830 --> 00:15:17,170 here, theta, capitalized here, which represents a coincidence of upper 291 00:15:17,170 --> 00:15:18,100 and lower bounds. 292 00:15:18,100 --> 00:15:20,980 Whereby if it happens to be the case for some algorithm 293 00:15:20,980 --> 00:15:24,250 that you have an upper bound and a lower bound that are the same, 294 00:15:24,250 --> 00:15:27,940 you can equivalently say not both of those statements, but quite simply 295 00:15:27,940 --> 00:15:30,755 that the algorithm is in theta of some formula. 296 00:15:30,755 --> 00:15:33,460 297 00:15:33,460 --> 00:15:36,460 Now suffice it to say, this green line is good. 298 00:15:36,460 --> 00:15:40,630 Indeed, any time we achieve logarithmic time instead of, say, linear time, we 299 00:15:40,630 --> 00:15:42,370 have made an improvement. 300 00:15:42,370 --> 00:15:44,180 But what did we presuppose? 301 00:15:44,180 --> 00:15:46,510 Well, we presupposed in both the case of the phone book 302 00:15:46,510 --> 00:15:50,162 and in the case of those doors that they were sorted in advance for us. 303 00:15:50,162 --> 00:15:52,120 By me in the case of the doors and by the phone 304 00:15:52,120 --> 00:15:54,370 company in the case of the book. 305 00:15:54,370 --> 00:15:56,560 But what did it cost me and what did it cost 306 00:15:56,560 --> 00:15:58,960 them to sort all of those numbers and names 307 00:15:58,960 --> 00:16:03,220 just to enable us ultimately to sort logarithmically? 308 00:16:03,220 --> 00:16:06,640 Well let's consider that in the context of, again, some numbers, this time 309 00:16:06,640 --> 00:16:09,130 some numbers that I myself can move around. 310 00:16:09,130 --> 00:16:13,270 Here we have eight cups, and on these eight cups are eight numbers from 1 311 00:16:13,270 --> 00:16:14,080 through 8. 312 00:16:14,080 --> 00:16:17,500 And they're indeed sorted from smallest to largest, though I could equivalently 313 00:16:17,500 --> 00:16:20,980 do this problem from largest to smallest so long as we all 314 00:16:20,980 --> 00:16:22,720 agree what the goal is. 315 00:16:22,720 --> 00:16:25,660 Well let me go ahead and just randomly shuffle some of these cups 316 00:16:25,660 --> 00:16:28,850 so that not everything is in order anymore, 317 00:16:28,850 --> 00:16:32,270 and indeed now they're fairly jumbled, and indeed not in the order I want, 318 00:16:32,270 --> 00:16:34,420 so some work needs to be done. 319 00:16:34,420 --> 00:16:36,610 Now why might they arrive in this order? 320 00:16:36,610 --> 00:16:39,100 Well in the case of the phone book, certainly new people 321 00:16:39,100 --> 00:16:42,580 are moving into a town every day, and so they're coming in not themselves 322 00:16:42,580 --> 00:16:44,980 in alphabetical order, but seemingly random, 323 00:16:44,980 --> 00:16:46,900 and it's up to the phone company to slot them 324 00:16:46,900 --> 00:16:50,800 into the right place in a phone book for the sake of next year's print. 325 00:16:50,800 --> 00:16:52,300 And the same thing with those doors. 326 00:16:52,300 --> 00:16:54,700 Were I to add more and more numbers behind those doors, 327 00:16:54,700 --> 00:16:57,580 I'd need to decide where to put them, and they're not necessarily 328 00:16:57,580 --> 00:17:01,870 going to arrive for my input source in the order I want. 329 00:17:01,870 --> 00:17:04,780 So here, then, I have some randomly-ordered data, 330 00:17:04,780 --> 00:17:07,690 how do I go about sorting it quickly? 331 00:17:07,690 --> 00:17:10,170 Well, let's take a look at the first problem I see. 332 00:17:10,170 --> 00:17:13,990 2 and 1 are out of order, so let me just go ahead and swap, so to speak, 333 00:17:13,990 --> 00:17:14,829 those two. 334 00:17:14,829 --> 00:17:16,930 I've now improved to the state of my cups, 335 00:17:16,930 --> 00:17:19,720 and I've made some progress still, but 2 and 6 336 00:17:19,720 --> 00:17:23,290 seem OK even though maybe there should be some cups in between. 337 00:17:23,290 --> 00:17:24,910 So let's look at the next pair now. 338 00:17:24,910 --> 00:17:29,260 We have 6 and 5, which definitely are out of order, so let's switch those. 339 00:17:29,260 --> 00:17:31,480 6 and 4 are the same, out of order. 340 00:17:31,480 --> 00:17:33,680 6 and 3, just as much. 341 00:17:33,680 --> 00:17:37,820 6 and 8 are not quite back-to-back, but there's probably 342 00:17:37,820 --> 00:17:41,260 going to be a number in-between, but they are at least in the right order, 343 00:17:41,260 --> 00:17:43,210 because 6, of course, is less than 8. 344 00:17:43,210 --> 00:17:45,310 And then lastly we have 8 and 7. 345 00:17:45,310 --> 00:17:47,740 Let's swap those here and done-- 346 00:17:47,740 --> 00:17:49,120 or are we not? 347 00:17:49,120 --> 00:17:53,680 Well I've made improvements with every such swap, but some of these cups 348 00:17:53,680 --> 00:17:55,240 still remain out of order. 349 00:17:55,240 --> 00:17:56,410 Now these two are all set. 350 00:17:56,410 --> 00:17:58,780 2 and 5 are as well, even though ultimately we 351 00:17:58,780 --> 00:18:02,390 might need some numbers between them, but 4 and 5 are indeed out of order. 352 00:18:02,390 --> 00:18:04,560 3 and 5 just as much. 353 00:18:04,560 --> 00:18:08,930 6 and 5 are OK, 7 and 6 are OK, and 8 and 7 as well. 354 00:18:08,930 --> 00:18:11,750 So we're almost done there, but I do see some glitches. 355 00:18:11,750 --> 00:18:14,200 So let's again compare all of these cups pairwise-- 356 00:18:14,200 --> 00:18:17,895 1, 2; 2, 4-- oops, 4, 3, let's swap that. 357 00:18:17,895 --> 00:18:19,270 Let's keep going just to be safe. 358 00:18:19,270 --> 00:18:23,050 4, 5; 5, 6; 6, 7; 7, 8. 359 00:18:23,050 --> 00:18:27,430 And by way of this process, just comparing cups back-to-back, 360 00:18:27,430 --> 00:18:29,740 we can fix any mistakes we see. 361 00:18:29,740 --> 00:18:31,870 Just for good measure, let me do this once more. 362 00:18:31,870 --> 00:18:36,640 1, 2; 2, 3; 3, 4; 4, 5; 5, 6; 6, 7; 7, 8. 363 00:18:36,640 --> 00:18:40,000 Now this time that I've gone all the way from left to right checking 364 00:18:40,000 --> 00:18:44,350 that every cup is in order, I can safely conclude that these cups are sorted. 365 00:18:44,350 --> 00:18:47,320 After all, if I just went from left to right and did no work, 366 00:18:47,320 --> 00:18:50,800 why would I presume that if I do that same algorithm again, 367 00:18:50,800 --> 00:18:51,790 I'd make any changes? 368 00:18:51,790 --> 00:18:55,450 I wouldn't, so I can quit at this point. 369 00:18:55,450 --> 00:18:59,260 So that's all fine and good, but perhaps we could have sorted these differently. 370 00:18:59,260 --> 00:19:02,650 That felt a little tedious and I felt like I was doing a lot of work. 371 00:19:02,650 --> 00:19:06,550 What if I just try to select the cups I want rather than deal 372 00:19:06,550 --> 00:19:08,120 with two cups at a time? 373 00:19:08,120 --> 00:19:11,570 Let's go ahead and randomly shuffle these again in any old order, 374 00:19:11,570 --> 00:19:17,620 making sure to perturb what was otherwise left to right. 375 00:19:17,620 --> 00:19:20,137 And here we have now another random assortment of cups. 376 00:19:20,137 --> 00:19:21,970 But you know what I'm going to do this time? 377 00:19:21,970 --> 00:19:25,000 I'm just going to select the smallest I see. 378 00:19:25,000 --> 00:19:28,043 2 is already pretty small, so I'll start as before on the left. 379 00:19:28,043 --> 00:19:31,210 So let's now check the other cups to see if there's something smaller that I 380 00:19:31,210 --> 00:19:33,550 might prefer to be in this location. 381 00:19:33,550 --> 00:19:36,760 3, 1-- ooh, 1 is better, I'm going to make mental note of this one. 382 00:19:36,760 --> 00:19:42,788 5, 8, 7, 6, 4-- all right, so 1 would seem to be the smallest number. 383 00:19:42,788 --> 00:19:45,080 So I'm going to go ahead and put this where it belongs, 384 00:19:45,080 --> 00:19:47,020 which is right here at the side. 385 00:19:47,020 --> 00:19:49,300 There's really no room for it, but you know what? 386 00:19:49,300 --> 00:19:51,300 These were randomly-sorted, let me just go ahead 387 00:19:51,300 --> 00:19:54,552 and evict whatever's there, too, and put 1 in it's place. 388 00:19:54,552 --> 00:19:57,010 Now to be fair, I might have messed things up a little bit, 389 00:19:57,010 --> 00:20:01,420 but no more so than I might have when I received these numbers randomly. 390 00:20:01,420 --> 00:20:03,640 In fact, I might even get lucky-- by evicting a cup, 391 00:20:03,640 --> 00:20:08,960 I might end up putting it in the right place so it all washes out in the end. 392 00:20:08,960 --> 00:20:11,380 Now let's go ahead and select the next smallest number, 393 00:20:11,380 --> 00:20:14,140 but not bother looking at that first one anymore. 394 00:20:14,140 --> 00:20:16,730 So 3 is pretty small, so I'll keep that in mind. 395 00:20:16,730 --> 00:20:19,990 2 is even smaller, so I'll forget about 3 and now remember 2. 396 00:20:19,990 --> 00:20:22,780 5 is bigger, 8 and 7 and 6 and 4-- 397 00:20:22,780 --> 00:20:26,710 all right, 2 now seems to be the next smallest number I can select. 398 00:20:26,710 --> 00:20:29,700 I know it belongs there, but 3's already there, so let's evict 3 399 00:20:29,700 --> 00:20:31,510 and there you go, I got lucky. 400 00:20:31,510 --> 00:20:33,877 Now I have 1 and 2 in the right place. 401 00:20:33,877 --> 00:20:35,710 Let's again select the next smallest number. 402 00:20:35,710 --> 00:20:38,920 I see 3 here, and again, I don't necessarily know as a computer 403 00:20:38,920 --> 00:20:41,200 if I'm only looking at one number at a time 404 00:20:41,200 --> 00:20:44,980 if there are, in fact, anything smaller to its side. 405 00:20:44,980 --> 00:20:48,340 So let's check-- 5, 8, 7, 6, 4-- nope. 406 00:20:48,340 --> 00:20:52,492 So 3 I shall select, and I got lucky, I'll leave it alone. 407 00:20:52,492 --> 00:20:53,950 How about the next smallest number? 408 00:20:53,950 --> 00:20:58,030 5 is pretty small, but 8, 7, 6, 4 is even smaller. 409 00:20:58,030 --> 00:21:01,900 Let's select this one, put it in its place, evicting the 5 410 00:21:01,900 --> 00:21:03,650 and putting it where there's room. 411 00:21:03,650 --> 00:21:05,920 8 is not that small, but it's all I know now. 412 00:21:05,920 --> 00:21:08,020 But ooh-- 7 is smaller, I'll remember this. 413 00:21:08,020 --> 00:21:10,655 6 is even smaller, I'll remember that, and it feels 414 00:21:10,655 --> 00:21:12,280 like I'm creating some work for myself. 415 00:21:12,280 --> 00:21:15,010 5 is the next smallest, 8's in the way. 416 00:21:15,010 --> 00:21:18,070 We'll evict 8 and put 5 right there. 417 00:21:18,070 --> 00:21:22,150 7 is pretty small, but 6 is even smaller, but still smaller than 8, 418 00:21:22,150 --> 00:21:27,290 so let's pick up 6, evict 7, and put 7 in its place. 419 00:21:27,290 --> 00:21:30,250 Now for good measure, we're obviously done, but I as the computer 420 00:21:30,250 --> 00:21:33,458 don't know that yet if I'm just looking at one of these cups or, if you will, 421 00:21:33,458 --> 00:21:34,510 doors at a time. 422 00:21:34,510 --> 00:21:38,740 7's pretty small, 8 is no smaller, so 7 I've selected 423 00:21:38,740 --> 00:21:40,750 to stay right there in its place. 424 00:21:40,750 --> 00:21:45,520 8 as well, by that same logic, is now in its right place. 425 00:21:45,520 --> 00:21:48,430 So it turns out that these two algorithms 426 00:21:48,430 --> 00:21:52,178 that I concocted along the way actually do have some formal semantics. 427 00:21:52,178 --> 00:21:54,220 In fact, in computer science, we'd call the first 428 00:21:54,220 --> 00:21:57,310 of those algorithms that thing here, bubble sort. 429 00:21:57,310 --> 00:22:01,990 Because in fact, as you compare two cups side-by-side and swap them on occasion 430 00:22:01,990 --> 00:22:05,980 in order to fix transpositions, well, your largest numbers 431 00:22:05,980 --> 00:22:08,590 would seem to be bubbling their way up to the top, 432 00:22:08,590 --> 00:22:13,210 or equivalently, the smallest ones down to the end, and so bubble sort 433 00:22:13,210 --> 00:22:15,130 is the formal name for that algorithm. 434 00:22:15,130 --> 00:22:18,850 How might express this more succinctly than my voice over there? 435 00:22:18,850 --> 00:22:20,380 Well let me propose this pseudocode. 436 00:22:20,380 --> 00:22:23,140 There's no one way to describe this or any algorithm, 437 00:22:23,140 --> 00:22:26,710 but this was as few English words as I could come up with and still 438 00:22:26,710 --> 00:22:28,390 be pretty precise. 439 00:22:28,390 --> 00:22:34,720 So repeat until no swaps the following-- for i from 0 to n minus 2, 440 00:22:34,720 --> 00:22:39,310 if the i-th and i-th plus 1 elements are out of order, swap them. 441 00:22:39,310 --> 00:22:40,570 Now why this lingo? 442 00:22:40,570 --> 00:22:43,390 Well computational thinking is all about expressing yourself 443 00:22:43,390 --> 00:22:46,210 very methodically, very clearly, and ultimately 444 00:22:46,210 --> 00:22:50,060 defining, say, some variables or terms that you'll need in your arguments. 445 00:22:50,060 --> 00:22:52,390 And so here what I've done is adopt a convention. 446 00:22:52,390 --> 00:22:54,550 I'm using i to represent an integer-- 447 00:22:54,550 --> 00:22:55,840 some sort of counter-- 448 00:22:55,840 --> 00:23:01,240 to represent the index of each of my cups or doors or pages. 449 00:23:01,240 --> 00:23:03,160 And here, we are adopting the convention, too, 450 00:23:03,160 --> 00:23:04,870 of starting to count from 0. 451 00:23:04,870 --> 00:23:08,030 And so if I want to start looking at the first cup, a.k.a. 452 00:23:08,030 --> 00:23:13,330 0, I want to keep looking up, up to the cup called n minus 2, 453 00:23:13,330 --> 00:23:22,210 because if my first cup is cup 0, and this is then 1, 2, 3, 4, 5, 6, 7, 454 00:23:22,210 --> 00:23:25,900 indeed the cup is labeled 8, but it's in position 7. 455 00:23:25,900 --> 00:23:30,355 And so this position more generally, if there are n cups, would be n minus 1. 456 00:23:30,355 --> 00:23:36,400 So bubble sort is telling me to start at 0 and then look up to n minus 2, 457 00:23:36,400 --> 00:23:38,500 because in the next line of code, I'm supposed 458 00:23:38,500 --> 00:23:43,450 to compare the i-th elements and the i-th plus 1, so to speak. 459 00:23:43,450 --> 00:23:45,610 So I don't want to look all the way to the end, 460 00:23:45,610 --> 00:23:49,210 I want to look one shy to the end, because I know in looking at pairs, 461 00:23:49,210 --> 00:23:53,680 I'm looking at this one as well as the one to its right, a.k.a. 462 00:23:53,680 --> 00:23:54,948 i plus 1. 463 00:23:54,948 --> 00:23:56,740 So the algorithm ultimately is just saying, 464 00:23:56,740 --> 00:24:01,360 as you repeat that process again and again until there are no swaps, just 465 00:24:01,360 --> 00:24:07,150 as I proposed, you're swapping any two cups that with respect to each other 466 00:24:07,150 --> 00:24:08,440 are out of order. 467 00:24:08,440 --> 00:24:13,030 And so this, too, is an example more generally of smalling local problems 468 00:24:13,030 --> 00:24:16,600 and achieving ultimately a global result, if you will. 469 00:24:16,600 --> 00:24:22,030 Because with each swap of those cups, I'm improving the quality of my data. 470 00:24:22,030 --> 00:24:26,290 And each swap in and of itself doesn't necessarily solve the big picture, 471 00:24:26,290 --> 00:24:29,560 but together when we aggregate all of those smaller solutions have we 472 00:24:29,560 --> 00:24:32,540 assembled the final result. 473 00:24:32,540 --> 00:24:34,780 Now what about that second algorithm, wherein 474 00:24:34,780 --> 00:24:38,170 I started again with some random cups, and then that time I 475 00:24:38,170 --> 00:24:42,220 selected one at a time the number I actually wanted in place? 476 00:24:42,220 --> 00:24:43,660 I first sought out the smallest. 477 00:24:43,660 --> 00:24:46,990 I found that to be 1 and I put it all the way there on the left. 478 00:24:46,990 --> 00:24:49,090 And I then sought out the next smallest number, 479 00:24:49,090 --> 00:24:52,690 which after checking the remaining cups, I determined was 2. 480 00:24:52,690 --> 00:24:55,808 And so I put 2 second in place. 481 00:24:55,808 --> 00:24:57,850 And then I repeated that process again and again, 482 00:24:57,850 --> 00:25:02,380 not necessarily knowing in advance from anyone what numbers I'd find. 483 00:25:02,380 --> 00:25:04,960 Because I checked each and every remaining cup, 484 00:25:04,960 --> 00:25:10,000 I was able to conclude safely that I had indeed found the next smallest element. 485 00:25:10,000 --> 00:25:12,010 And so that algorithm, too, has a name-- 486 00:25:12,010 --> 00:25:13,120 selection sort. 487 00:25:13,120 --> 00:25:15,520 And I might describe it pseudocode similar 488 00:25:15,520 --> 00:25:18,940 in structure but with different logic ultimately. 489 00:25:18,940 --> 00:25:23,270 Let me propose that we do for i from 0 to n minus 1, 490 00:25:23,270 --> 00:25:28,300 where again, n is the number of cups, and 0 is by convention my first cup, 491 00:25:28,300 --> 00:25:31,600 and n minus 1, therefore, is my last. 492 00:25:31,600 --> 00:25:35,370 And what I then want to do is find the smallest element between the i-th 493 00:25:35,370 --> 00:25:37,530 element and the n-th plus-- 494 00:25:37,530 --> 00:25:39,015 at n minus 1. 495 00:25:39,015 --> 00:25:42,390 That is, find the smallest element between wherever you've begun 496 00:25:42,390 --> 00:25:44,625 and that last element, n minus 1. 497 00:25:44,625 --> 00:25:48,390 And then if-- when you've found that smallest element, 498 00:25:48,390 --> 00:25:50,730 you swap it with the i-th element. 499 00:25:50,730 --> 00:25:53,190 And that's why I was picking up one cup and another 500 00:25:53,190 --> 00:25:57,630 and swapping them in place-- evicting one and putting one where it belongs. 501 00:25:57,630 --> 00:25:59,670 And you do this again and again and again, 502 00:25:59,670 --> 00:26:01,830 because each time your incrementing 1. 503 00:26:01,830 --> 00:26:06,780 So whereas the first iteration of this loop will start here all the way left, 504 00:26:06,780 --> 00:26:09,840 the second iteration will start here, and the third iteration 505 00:26:09,840 --> 00:26:10,860 will start here. 506 00:26:10,860 --> 00:26:13,950 And so with the amount of problem to be solved 507 00:26:13,950 --> 00:26:20,020 is steadily decreasing until I have 1 and then 0 cups left. 508 00:26:20,020 --> 00:26:22,810 Now it certainly took some work to sort those n cups, 509 00:26:22,810 --> 00:26:24,700 but how much work did it take? 510 00:26:24,700 --> 00:26:28,030 Well in the case of bubble sort, what was I doing on each pass 511 00:26:28,030 --> 00:26:29,020 through these cups? 512 00:26:29,020 --> 00:26:32,140 Well I was comparing and then potentially swapping 513 00:26:32,140 --> 00:26:36,100 each adjacent pair of cups, and then repeating myself again and again. 514 00:26:36,100 --> 00:26:39,100 Well if we have here n cups, how many pairs 515 00:26:39,100 --> 00:26:41,650 can you create which you then consider swapping? 516 00:26:41,650 --> 00:26:47,461 Well if I have n cups, I could seem to make 1, 2, 3, 4, 5, 6, 517 00:26:47,461 --> 00:26:52,750 7 out of 8 pairs at a time, so more generally n minus 1. 518 00:26:52,750 --> 00:26:57,100 So on each pass here, it would seem that I'm comparing n minus 1 cups. 519 00:26:57,100 --> 00:26:59,530 Now how many passes do I need to ultimately make? 520 00:26:59,530 --> 00:27:02,170 It would seem to be roughly n, because in the worst case, 521 00:27:02,170 --> 00:27:05,110 these cups might be completely out of order. 522 00:27:05,110 --> 00:27:09,160 Which is to say, I might indeed do n things n minus 1 times, 523 00:27:09,160 --> 00:27:13,990 and if you multiply that out, I'm going to get some factor of n squared. 524 00:27:13,990 --> 00:27:18,040 But what about selection sort, wherein I instead looked through all of the cups, 525 00:27:18,040 --> 00:27:20,680 selecting first the smallest, and then repeating 526 00:27:20,680 --> 00:27:23,380 that process for the next smallest still? 527 00:27:23,380 --> 00:27:25,840 Well in that case, I started with n cups, 528 00:27:25,840 --> 00:27:28,600 and I might need to look at all n, and then 529 00:27:28,600 --> 00:27:32,635 once I found that, I might instead look at n minus 1. 530 00:27:32,635 --> 00:27:37,990 So there, too, I seem to be summing something like n plus n minus 1 531 00:27:37,990 --> 00:27:40,600 plus n minus 2 and so forth, so let's see 532 00:27:40,600 --> 00:27:43,640 if we can't now summarize this as well. 533 00:27:43,640 --> 00:27:46,900 Well let me propose more mathematically, that, say, with selection sort, 534 00:27:46,900 --> 00:27:48,540 what we've done is this. 535 00:27:48,540 --> 00:27:53,200 In looking for that smallest cup, I had to make n minus 1 comparisons. 536 00:27:53,200 --> 00:27:55,960 Because as I identified the smallest cup I'd yet seen, 537 00:27:55,960 --> 00:27:59,590 I compared it to no more than n minus others. 538 00:27:59,590 --> 00:28:04,870 Now if the first selection of a cup took me n minus 1 steps but then it's done, 539 00:28:04,870 --> 00:28:07,590 the next lesson of the next smallest cup would 540 00:28:07,590 --> 00:28:10,690 have taken me only n minus 2 steps. 541 00:28:10,690 --> 00:28:13,280 And if you continue that logic with each pass, 542 00:28:13,280 --> 00:28:17,770 you have to do a little bit less work until you're left with just one 543 00:28:17,770 --> 00:28:20,680 very last cup at the end, such as 8. 544 00:28:20,680 --> 00:28:23,110 So what does this actually sum too? 545 00:28:23,110 --> 00:28:25,840 Well you might not remember or see it at first glance, 546 00:28:25,840 --> 00:28:28,960 but it turns out, particularly if you look at one of those charts 547 00:28:28,960 --> 00:28:33,970 at the back of a textbook, does this summation or series actually 548 00:28:33,970 --> 00:28:38,530 aggregate to n times n minus all divided by 2. 549 00:28:38,530 --> 00:28:41,560 Now this you can perhaps multiply out a bit more readily as 550 00:28:41,560 --> 00:28:44,590 in n squared minus n all divided by 2. 551 00:28:44,590 --> 00:28:47,230 And if we factor that out, we can now get n squared 552 00:28:47,230 --> 00:28:51,520 divided by 2 minus n divided by 2. 553 00:28:51,520 --> 00:28:56,620 Now which of these terms, n squared divided by 2 or n divided by 2, 554 00:28:56,620 --> 00:28:58,450 tends to dominate the other? 555 00:28:58,450 --> 00:29:01,420 That is to say, as n gets larger and larger, 556 00:29:01,420 --> 00:29:05,500 which of these mathematical expressions has the biggest effect 557 00:29:05,500 --> 00:29:07,060 on the number of steps? 558 00:29:07,060 --> 00:29:12,760 Well surely it's n squared, albeit divided by 2, because as n gets large, 559 00:29:12,760 --> 00:29:15,677 n squared is certainly larger than n. 560 00:29:15,677 --> 00:29:18,010 And so what a computer scientist here would typically do 561 00:29:18,010 --> 00:29:22,330 is just ignore those lower-ordered terms, so to speak. 562 00:29:22,330 --> 00:29:26,440 And he would say with a figurative or literal wave of the hand, 563 00:29:26,440 --> 00:29:31,030 this is on the order of n squared this algorithm. 564 00:29:31,030 --> 00:29:33,550 That isn't to say it's precisely that many steps, 565 00:29:33,550 --> 00:29:37,270 but rather as n gets really large, it is pretty much 566 00:29:37,270 --> 00:29:41,140 that n squared term that really matters the most. 567 00:29:41,140 --> 00:29:45,310 Now this is not a form of proof, but rather a proof by example, if you will, 568 00:29:45,310 --> 00:29:49,600 but let's see if I can't convince you with a single example numerically 569 00:29:49,600 --> 00:29:51,550 of the impact of that square. 570 00:29:51,550 --> 00:29:56,350 Well if we start again with n squared over 2 minus n over 2 and say n 571 00:29:56,350 --> 00:30:01,660 is maybe 1 million initially-- so not eight cups, not 1,000 pages in a book, 572 00:30:01,660 --> 00:30:06,550 but 1 million numbers or any other element itself. 573 00:30:06,550 --> 00:30:08,280 What does this actually sum to? 574 00:30:08,280 --> 00:30:12,820 Well 1 million squared divided by 2 minus 1 million divided by 2 575 00:30:12,820 --> 00:30:21,970 happens to be 500 billion minus 500,000, which of course is 499,999,500,000. 576 00:30:21,970 --> 00:30:28,090 Now I daresay that is pretty darn close to big O of n squared. 577 00:30:28,090 --> 00:30:28,900 Why? 578 00:30:28,900 --> 00:30:31,180 Well if we started with, say, 1 trillion then 579 00:30:31,180 --> 00:30:36,520 halved it and ended up with 499 billion, that's still pretty close. 580 00:30:36,520 --> 00:30:41,270 Now in real terms, that does not equal the same number of steps, 581 00:30:41,270 --> 00:30:44,710 but it gives us a general sense it's on the order of this many steps, 582 00:30:44,710 --> 00:30:47,830 because if we plugged in larger and larger values for n, 583 00:30:47,830 --> 00:30:51,118 that difference would not even be as extreme. 584 00:30:51,118 --> 00:30:54,160 Well why don't we take a look now at these algorithms in a different form 585 00:30:54,160 --> 00:30:58,030 altogether without the physical limitation of me as the computer? 586 00:30:58,030 --> 00:31:02,080 Pictured here is, if you will, an array of numbers, but pictured graphically. 587 00:31:02,080 --> 00:31:04,450 Wherein we have vertical bars, and the taller 588 00:31:04,450 --> 00:31:07,190 the bar, the bigger the number it represents. 589 00:31:07,190 --> 00:31:10,300 So big bar is big number, small bar is small number, 590 00:31:10,300 --> 00:31:12,970 but they're clearly, therefore, unsorted. 591 00:31:12,970 --> 00:31:16,870 Via these number of algorithms we've seen, bubble sort and selection sort, 592 00:31:16,870 --> 00:31:20,650 what does it actually look like to sort of many elements? 593 00:31:20,650 --> 00:31:22,140 Let's take a look. 594 00:31:22,140 --> 00:31:25,290 In this tool where I proceed to choose my first algorithm, 595 00:31:25,290 --> 00:31:27,630 which shall be, say, bubble sort. 596 00:31:27,630 --> 00:31:30,750 And you'll see rather slowly that this algorithm is indeed comparing 597 00:31:30,750 --> 00:31:32,280 pairwise elements, and if-- 598 00:31:32,280 --> 00:31:36,630 and only if they're out of order, swapping them again and again. 599 00:31:36,630 --> 00:31:38,730 Now to be fair, this quickly gets tedious, 600 00:31:38,730 --> 00:31:41,190 so let me increase the animation speed here. 601 00:31:41,190 --> 00:31:45,630 And now you can rather see that bubbling up of the largest. 602 00:31:45,630 --> 00:31:48,300 Previously it was my 8 and my 7 and 6. 603 00:31:48,300 --> 00:31:52,890 Here we have 99, 98, 97, but indeed, those tallest bars 604 00:31:52,890 --> 00:31:54,610 are making their way up. 605 00:31:54,610 --> 00:31:57,720 So let's turn our attention next to this other algorithm, selection sort, 606 00:31:57,720 --> 00:32:01,440 to see if it looks or perhaps feels rather different. 607 00:32:01,440 --> 00:32:03,720 Here now we have selection sort each time 608 00:32:03,720 --> 00:32:07,680 going through the entire list looking for the smallest possible element. 609 00:32:07,680 --> 00:32:09,810 Highlighted in red for just a moment here is 610 00:32:09,810 --> 00:32:13,530 9, because we have not yet until-- oh, now found 611 00:32:13,530 --> 00:32:16,455 a smaller element, now 2, and now 1. 612 00:32:16,455 --> 00:32:19,080 And we'll continue looking through the rest of the numbers just 613 00:32:19,080 --> 00:32:21,960 to be sure we don't find something smaller, and once we do, 614 00:32:21,960 --> 00:32:23,340 1 goes into place. 615 00:32:23,340 --> 00:32:26,820 And then we repeat that process, but we do fewer steps now, 616 00:32:26,820 --> 00:32:30,540 because whereas there are n total bars, we don't need to look at the leftmost 617 00:32:30,540 --> 00:32:34,455 now because it's sorted, we only need look at n minus 1. 618 00:32:34,455 --> 00:32:36,150 So this process again will repeat. 619 00:32:36,150 --> 00:32:37,140 We found 2. 620 00:32:37,140 --> 00:32:40,440 We're just double-checking that there's not something smaller, 621 00:32:40,440 --> 00:32:42,520 and now 2 is in its place. 622 00:32:42,520 --> 00:32:44,850 Now we humans, of course, have the advantage 623 00:32:44,850 --> 00:32:48,030 of having an aerial view, if you will, of all this data. 624 00:32:48,030 --> 00:32:51,330 And certainly a computer could remember more than just 625 00:32:51,330 --> 00:32:53,970 the smallest number it's recently seen. 626 00:32:53,970 --> 00:32:56,985 Why not for efficiency remember the two smallest numbers? 627 00:32:56,985 --> 00:32:58,110 The three smallest numbers? 628 00:32:58,110 --> 00:32:59,400 The four smallest numbers? 629 00:32:59,400 --> 00:33:03,120 That's fine, but that argument is quickly devolving into-- 630 00:33:03,120 --> 00:33:05,670 just remember all the original numbers. 631 00:33:05,670 --> 00:33:08,280 And so yes, you could perhaps save some time, 632 00:33:08,280 --> 00:33:11,520 but it sounds like you're asking for more and more space 633 00:33:11,520 --> 00:33:14,580 with which to remember the answers to those questions. 634 00:33:14,580 --> 00:33:17,010 Now this, too, would seem to be taking us all day. 635 00:33:17,010 --> 00:33:20,160 Even if we down here increase the animation speed, 636 00:33:20,160 --> 00:33:24,000 it now is selecting those elements a bit faster and faster, 637 00:33:24,000 --> 00:33:26,650 but there's still so much work to be done. 638 00:33:26,650 --> 00:33:29,820 Indeed, these comparison-based sorts that are comparing things again 639 00:33:29,820 --> 00:33:34,110 and again and then redoing that work in some form to improve the problem still 640 00:33:34,110 --> 00:33:36,630 just tend to end up on the order of-- 641 00:33:36,630 --> 00:33:38,410 bingo, of n squared. 642 00:33:38,410 --> 00:33:41,220 Which is to say that n squared or something quadratic 643 00:33:41,220 --> 00:33:42,810 tends to be rather slow. 644 00:33:42,810 --> 00:33:46,650 And this is in quite contrast to our logarithmic time before, 645 00:33:46,650 --> 00:33:51,600 but that logarithm thus far was for searching, not sorting. 646 00:33:51,600 --> 00:33:54,435 So let's compare these two now side by side, 647 00:33:54,435 --> 00:33:57,060 albeit with a different tool that presents the same information 648 00:33:57,060 --> 00:33:58,770 graphically sideways. 649 00:33:58,770 --> 00:34:01,440 Here again we have bars, and small bar is small number, 650 00:34:01,440 --> 00:34:06,090 and big bar is big number, but here, they've simply been rotated 90 degrees. 651 00:34:06,090 --> 00:34:09,090 On the left here we have selection sort, on the right here bubble sort, 652 00:34:09,090 --> 00:34:12,420 both of whose bars are randomly sorted so that neither 653 00:34:12,420 --> 00:34:14,880 has an edge necessarily over the other. 654 00:34:14,880 --> 00:34:18,150 Let's go ahead and play all and see what happens here. 655 00:34:18,150 --> 00:34:20,340 And you'll see that indeed, bubbles bubbling up 656 00:34:20,340 --> 00:34:23,460 and selection is improving its selections as we go. 657 00:34:23,460 --> 00:34:27,460 Bubble would seem to have won because selection's got a bit more work, 658 00:34:27,460 --> 00:34:30,929 but there, too, it's pretty close to a tie. 659 00:34:30,929 --> 00:34:32,800 So can we do better? 660 00:34:32,800 --> 00:34:37,080 Well it turns out we can, so long as we use a bit more of that intuition 661 00:34:37,080 --> 00:34:40,830 we had when we started thinking computationally 662 00:34:40,830 --> 00:34:44,340 and we divided and conquered, we divided and conquered. 663 00:34:44,340 --> 00:34:49,469 In other words, why not, given n doors or n cups or in pages, 664 00:34:49,469 --> 00:34:52,927 why don't we divide and conquer that problem again and again? 665 00:34:52,927 --> 00:34:54,719 In other words, in the context of the cups, 666 00:34:54,719 --> 00:34:59,220 why don't I simply sort for you the left half and then the right half, 667 00:34:59,220 --> 00:35:03,167 and then with two sorted halves, just interweave them for you together. 668 00:35:03,167 --> 00:35:06,000 That would seem to be a little different from walking back and forth 669 00:35:06,000 --> 00:35:08,722 and back and forth and swapping elements again and again. 670 00:35:08,722 --> 00:35:10,680 Just do a little bit of work here, a little bit 671 00:35:10,680 --> 00:35:14,070 more now, and then reassemble your total work. 672 00:35:14,070 --> 00:35:16,650 Now of course, if I simply say, I'll sort this left half, 673 00:35:16,650 --> 00:35:18,450 what does it mean to sort this left half? 674 00:35:18,450 --> 00:35:21,210 Well, I dare say this left half can be divided 675 00:35:21,210 --> 00:35:25,380 into a left half of the left half, thereby making the problem smaller. 676 00:35:25,380 --> 00:35:30,120 So somehow or other, we could leverage that intuition of binary search, 677 00:35:30,120 --> 00:35:31,800 but apply it to sort. 678 00:35:31,800 --> 00:35:35,400 It's not going to be in the end quite as fast as binary search, 679 00:35:35,400 --> 00:35:38,490 because with sort, you have to deal with all of the elements, 680 00:35:38,490 --> 00:35:40,590 you can't simply tear half of the problem 681 00:35:40,590 --> 00:35:44,490 away because you'd be leaving half of your elements unsorted. 682 00:35:44,490 --> 00:35:47,010 But it turns out there's many algorithms that 683 00:35:47,010 --> 00:35:50,520 are faster than selection and bubble sort, and one of those 684 00:35:50,520 --> 00:35:52,290 is called merge sort. 685 00:35:52,290 --> 00:35:55,800 And merge sort leverage is precisely this intuition of dividing 686 00:35:55,800 --> 00:35:59,640 a problem in half and in half, and to be fair, touching all of those halves 687 00:35:59,640 --> 00:36:04,290 ultimately, but doing it in a way that's more efficient and less 688 00:36:04,290 --> 00:36:08,280 comparison-based than bubble sort and selection sort themselves. 689 00:36:08,280 --> 00:36:11,670 So let me go ahead and play all now with these three sets of bars 690 00:36:11,670 --> 00:36:15,410 and see just which one wins now. 691 00:36:15,410 --> 00:36:17,580 And after just a moment, there's nothing more 692 00:36:17,580 --> 00:36:22,500 to say-- merge sort has already won, if you will, even though now bubble has, 693 00:36:22,500 --> 00:36:23,520 and now selection. 694 00:36:23,520 --> 00:36:25,270 And perhaps this was a fluke-- to be fair, 695 00:36:25,270 --> 00:36:27,810 these numbers are random, maybe merge sort got lucky. 696 00:36:27,810 --> 00:36:31,500 Let's go ahead and play the test once more with other numbers. 697 00:36:31,500 --> 00:36:33,720 And indeed it again is done. 698 00:36:33,720 --> 00:36:36,540 Let me play it one third and final time, but notice the pattern 699 00:36:36,540 --> 00:36:38,520 now that emerges with merge sort. 700 00:36:38,520 --> 00:36:44,930 You can see if you look closely the actual halving again and again. 701 00:36:44,930 --> 00:36:47,360 And indeed, it seems that half of the list get sorted, 702 00:36:47,360 --> 00:36:49,888 and then you re assemble it at the very end. 703 00:36:49,888 --> 00:36:51,680 And indeed, let's zoom in on this algorithm 704 00:36:51,680 --> 00:36:54,290 now and look specifically at merge sort alone. 705 00:36:54,290 --> 00:36:57,290 Here we have merge sort, and highlighted in colors 706 00:36:57,290 --> 00:37:01,160 as we do work is exactly the elements you're sorting again and again. 707 00:37:01,160 --> 00:37:03,410 The reason so few of these bars are being 708 00:37:03,410 --> 00:37:07,880 looked at a time is because again, logically or recursively, if you will, 709 00:37:07,880 --> 00:37:09,980 are we sorting first the left half? 710 00:37:09,980 --> 00:37:11,750 But no, the left half of the left half. 711 00:37:11,750 --> 00:37:15,020 But no, the left half of the left half of the left half and so 712 00:37:15,020 --> 00:37:17,120 forth, and what this really boils down to 713 00:37:17,120 --> 00:37:21,140 ultimately is sorting eventually individual elements. 714 00:37:21,140 --> 00:37:23,990 But if I hand you one element and I say, please sort this, 715 00:37:23,990 --> 00:37:27,920 it has no halves, so your work is done-- you don't need do a thing. 716 00:37:27,920 --> 00:37:30,947 But then if you have two halves, each of size 1, 717 00:37:30,947 --> 00:37:32,780 there might indeed be work to be done there, 718 00:37:32,780 --> 00:37:35,900 because if one is smaller than the other or one is larger than the other, 719 00:37:35,900 --> 00:37:39,410 you do need to interleave those for me to merge them. 720 00:37:39,410 --> 00:37:41,930 And that's exactly what merge sort's doing here. 721 00:37:41,930 --> 00:37:45,470 Allow me to increase the animation speed and you'll see as we go, 722 00:37:45,470 --> 00:37:48,410 that half of the list is getting sorted at a time. 723 00:37:48,410 --> 00:37:50,450 It's not perfect and it's not perfectly smooth, 724 00:37:50,450 --> 00:37:53,033 because that's-- because half of the other elements are there, 725 00:37:53,033 --> 00:37:56,600 but now are reemerging the two halves. 726 00:37:56,600 --> 00:37:58,850 And that was fast, but it finished faster 727 00:37:58,850 --> 00:38:02,060 indeed than would have been for bubble and selection sort, 728 00:38:02,060 --> 00:38:05,750 but there was a price being paid. 729 00:38:05,750 --> 00:38:09,170 If you think back to our vertical visualization of bubble sort 730 00:38:09,170 --> 00:38:13,420 and selection sort, they were doing all of their work in place. 731 00:38:13,420 --> 00:38:17,870 Merge sort seemed to be getting a little greedy on us, if you will, 732 00:38:17,870 --> 00:38:21,650 and that it was temporarily putting some of those bars down here, 733 00:38:21,650 --> 00:38:26,420 effectively using twice as much space as those first two algorithms, selection 734 00:38:26,420 --> 00:38:27,110 and bubble. 735 00:38:27,110 --> 00:38:30,350 And indeed, that's where merge sort gets its edge fundamentally. 736 00:38:30,350 --> 00:38:34,880 It's not just a better algorithm, per se, and better thought-out, 737 00:38:34,880 --> 00:38:40,100 but it actually additionally consumes more resources-- not time, but space. 738 00:38:40,100 --> 00:38:43,520 By using twice as much space-- not just the top half of the screen, 739 00:38:43,520 --> 00:38:44,510 but the bottom-- 740 00:38:44,510 --> 00:38:47,930 can merge sort temporarily put some of its work over here, 741 00:38:47,930 --> 00:38:51,710 continue doing some other work, and then reassemble them together. 742 00:38:51,710 --> 00:38:54,412 Both selection sort and bubble sort did not have that advantage. 743 00:38:54,412 --> 00:38:56,120 They had to do everything in place, which 744 00:38:56,120 --> 00:38:59,540 is why we had to swap so many things so many times. 745 00:38:59,540 --> 00:39:03,290 We had far fewer spots in which to work on that table. 746 00:39:03,290 --> 00:39:06,110 But with merge sort, spend a bit more space, 747 00:39:06,110 --> 00:39:10,580 and you can reduce that amount of time. 748 00:39:10,580 --> 00:39:15,370 Now all of these algorithms assume that our data is back-to-back-to-back-- 749 00:39:15,370 --> 00:39:17,140 that is, stored in an array. 750 00:39:17,140 --> 00:39:21,010 And that's great, because that's exactly how a computer is so inclined 751 00:39:21,010 --> 00:39:23,170 to store data inherently. 752 00:39:23,170 --> 00:39:26,590 For instance, pictured here is a stick of memory of RAM-- 753 00:39:26,590 --> 00:39:28,190 Random Access Memory. 754 00:39:28,190 --> 00:39:33,550 And indeed, albeit a bit of a misnomer that R in RAM, random, actually 755 00:39:33,550 --> 00:39:37,480 means that a computer can jump in instant or constant time 756 00:39:37,480 --> 00:39:38,920 to a specific byte. 757 00:39:38,920 --> 00:39:41,440 And that's so important when we want to jump 758 00:39:41,440 --> 00:39:46,330 around our data, our cups, or our pages in order to get at data instantly, 759 00:39:46,330 --> 00:39:47,440 if you will. 760 00:39:47,440 --> 00:39:52,270 And the reason it is so conducive to laying out information back-to-back 761 00:39:52,270 --> 00:39:57,370 contiguously in memory is if we consider one of these black chips on this DIMM-- 762 00:39:57,370 --> 00:39:59,380 or Dual In-line Memory Module-- 763 00:39:59,380 --> 00:40:02,200 is that we have in this black chip really, if you will, 764 00:40:02,200 --> 00:40:04,420 an artist's rendition at hand. 765 00:40:04,420 --> 00:40:06,670 That artist's rendition might propose that if you 766 00:40:06,670 --> 00:40:11,770 have some number of bytes in this chip, say 1 billion for 1 gigabyte, 767 00:40:11,770 --> 00:40:14,290 it certainly stands to reason that we humans could 768 00:40:14,290 --> 00:40:16,810 number those bytes from 0 on up-- 769 00:40:16,810 --> 00:40:19,810 from 0 to 1 billion, roughly speaking. 770 00:40:19,810 --> 00:40:23,110 And so the top left one here might be 0, the next one might be 1, 771 00:40:23,110 --> 00:40:25,960 the next one thereafter should be 2, and so we can 772 00:40:25,960 --> 00:40:28,090 number each and every one of our bytes. 773 00:40:28,090 --> 00:40:33,370 And so when you store a number on a cup or a number behind a door, 774 00:40:33,370 --> 00:40:37,420 that amounts to just writing those numbers inside of each of these boxes. 775 00:40:37,420 --> 00:40:40,540 And each is next to the other, and so with simple arithmetic, 776 00:40:40,540 --> 00:40:43,060 a bit of division and rounding, might you 777 00:40:43,060 --> 00:40:46,240 be able to jump instantly to any one of these addresses? 778 00:40:46,240 --> 00:40:49,870 There is no moving parts here to do any work like my human feet might 779 00:40:49,870 --> 00:40:51,310 have to do in our real world. 780 00:40:51,310 --> 00:40:55,630 Rather the computer can jump instantly to that so-called address 781 00:40:55,630 --> 00:40:59,030 or index of the array. 782 00:40:59,030 --> 00:41:01,960 Now what can we do when we have a canvas that 783 00:41:01,960 --> 00:41:05,320 allows us to layout memory in this way? 784 00:41:05,320 --> 00:41:07,720 We can represent any number of types. 785 00:41:07,720 --> 00:41:11,440 Indeed in Python, there are all sorts of types of data. 786 00:41:11,440 --> 00:41:15,400 For instance, bool for a Boolean value and float for a floating point 787 00:41:15,400 --> 00:41:17,450 value, a real number with a decimal. 788 00:41:17,450 --> 00:41:20,530 An int for an integer and str for a string. 789 00:41:20,530 --> 00:41:23,800 Each of those is laid out in memory in some particular way that's 790 00:41:23,800 --> 00:41:26,590 conducive to accessing it efficiently. 791 00:41:26,590 --> 00:41:29,440 But that's precisely why, too, we've run into issues 792 00:41:29,440 --> 00:41:33,130 when using something like a float, because if you decide a priori to use 793 00:41:33,130 --> 00:41:36,460 only so many bytes, bytes to the left and to the right, 794 00:41:36,460 --> 00:41:41,050 above and below it might end up getting used by other parts of your program. 795 00:41:41,050 --> 00:41:46,120 And so if you've only asked for, say, 32 or 64 bits or 4 or 8 bytes, 796 00:41:46,120 --> 00:41:49,090 because you're then going to be surrounded by other data, 797 00:41:49,090 --> 00:41:55,030 that floating point value or some other can only be ultimately so precise. 798 00:41:55,030 --> 00:41:57,190 Because ultimately yes, we're operating in bits, 799 00:41:57,190 --> 00:42:01,520 but those bits are physically laid out in some order. 800 00:42:01,520 --> 00:42:06,910 So with that said, what are the options via which we can paint on this canvas? 801 00:42:06,910 --> 00:42:08,860 Surely it would be nice if we could store 802 00:42:08,860 --> 00:42:12,550 data not necessarily always back-to-back in this way, 803 00:42:12,550 --> 00:42:15,610 but we can create more sophisticated data structures 804 00:42:15,610 --> 00:42:20,860 so as to support not only these types here, but also ones like these. 805 00:42:20,860 --> 00:42:25,030 Dict in Python for dictionary, otherwise known as a hash table. 806 00:42:25,030 --> 00:42:29,770 And list for a sort of array that can grow and shrink, and range 807 00:42:29,770 --> 00:42:31,330 for a range of values. 808 00:42:31,330 --> 00:42:33,820 Set for a collection of values that contain 809 00:42:33,820 --> 00:42:39,160 no duplicates, and tuples, something like x, y or latitude, longitude. 810 00:42:39,160 --> 00:42:42,640 These concepts-- surely it would be nice to have 811 00:42:42,640 --> 00:42:46,750 accessible to us in higher level contexts like Python, 812 00:42:46,750 --> 00:42:50,530 but if at the end of the day all we have is bytes of memory back-to-back, 813 00:42:50,530 --> 00:42:53,830 we need some layers of abstraction on top of that memory 814 00:42:53,830 --> 00:42:57,200 so as to implement these more sophisticated structures. 815 00:42:57,200 --> 00:42:59,275 So we'll take a look at a few in particular ints 816 00:42:59,275 --> 00:43:02,380 and str and dict and list, because all of those 817 00:43:02,380 --> 00:43:08,140 somehow need to be built on top of these lower-level principles of memory. 818 00:43:08,140 --> 00:43:11,950 So how might this work and what problems might we solve? 819 00:43:11,950 --> 00:43:15,130 Let's now use the board as my canvas, drawing on it 820 00:43:15,130 --> 00:43:17,710 that same grid of rows and columns in order 821 00:43:17,710 --> 00:43:21,940 to divide this screen into that many bytes. 822 00:43:21,940 --> 00:43:26,020 And I'll go ahead and divide this board into these squares, each one of which 823 00:43:26,020 --> 00:43:30,640 represents an individual byte, and each of those bytes, of course, 824 00:43:30,640 --> 00:43:32,740 has some number associated with it. 825 00:43:32,740 --> 00:43:36,790 That number is not the number inside of that box, per se, not the bits 826 00:43:36,790 --> 00:43:40,000 that compose it, but rather just metadata-- an index 827 00:43:40,000 --> 00:43:44,350 where address that exists implicitly, but is not actually stored. 828 00:43:44,350 --> 00:43:47,470 This then might be index 0 or address 0, this 829 00:43:47,470 --> 00:43:52,180 might be 1, this 2, this 3, this one 4, this one 5. 830 00:43:52,180 --> 00:43:55,420 And if we, as for artist's sake, move to the next row, 831 00:43:55,420 --> 00:43:59,000 we might call this 6 and this 7, and so forth. 832 00:43:59,000 --> 00:44:03,190 Now suppose we want to store some actual values in this memory, 833 00:44:03,190 --> 00:44:04,990 well let's go ahead and do just that. 834 00:44:04,990 --> 00:44:09,040 We might stored the actual number 4 here, followed by 8, 835 00:44:09,040 --> 00:44:15,390 followed by 15 and 16, perhaps followed by 23, and then 42. 836 00:44:15,390 --> 00:44:19,268 And so we have some random numbers inside of this memory, 837 00:44:19,268 --> 00:44:21,060 and because those numbers are back-to-back, 838 00:44:21,060 --> 00:44:24,240 we can call this an array of size 6. 839 00:44:24,240 --> 00:44:26,820 Its first index is 0, its last index is 5, 840 00:44:26,820 --> 00:44:30,150 and between there are six total values. 841 00:44:30,150 --> 00:44:36,630 Now what can we do if we're ready to add a seventh number to this list? 842 00:44:36,630 --> 00:44:38,580 Well, we could certainly put it right here 843 00:44:38,580 --> 00:44:41,490 because this is the next appropriate location, 844 00:44:41,490 --> 00:44:44,570 but it depends whether that spot is still available. 845 00:44:44,570 --> 00:44:46,320 Because the way a computer typically works 846 00:44:46,320 --> 00:44:48,070 is that when you're writing a program, you 847 00:44:48,070 --> 00:44:51,210 need to decide in advance how much memory you want. 848 00:44:51,210 --> 00:44:54,270 And you tell the computer by way of the operating system, 849 00:44:54,270 --> 00:44:57,420 be it Windows or macOS, Linux, or something else, 850 00:44:57,420 --> 00:45:02,640 how many bytes of memory you would like to allocate to your particular problem. 851 00:45:02,640 --> 00:45:04,740 And if I only had the foresight to say, I 852 00:45:04,740 --> 00:45:07,868 would like 6 bytes in which to store 6 numbers, 853 00:45:07,868 --> 00:45:10,410 the operating system might have handed me that back and said, 854 00:45:10,410 --> 00:45:14,340 fine, here you go, but the operating system thereafter 855 00:45:14,340 --> 00:45:18,720 might have proceeded to allocate subsequent adjacent bytes, like 6 856 00:45:18,720 --> 00:45:22,962 and 7, to some other aspect of your program. 857 00:45:22,962 --> 00:45:25,920 Which is to say, you might have painted yourself into a bit of a corner 858 00:45:25,920 --> 00:45:32,100 by only in code asking the operating system for just those initial 6 bytes. 859 00:45:32,100 --> 00:45:35,670 You instead might have wanted to ask for more bytes 860 00:45:35,670 --> 00:45:37,710 so as to allow yourself this room to grow, 861 00:45:37,710 --> 00:45:41,760 but if you didn't do that in code, you might just be unlucky. 862 00:45:41,760 --> 00:45:44,520 But that's the price you pay for an array. 863 00:45:44,520 --> 00:45:46,710 You have this wonderfully efficient ability 864 00:45:46,710 --> 00:45:48,900 to search it randomly, if you will, which 865 00:45:48,900 --> 00:45:51,700 is to say instantly via arithmetic. 866 00:45:51,700 --> 00:45:54,510 You can jump to the beginning or the end or even the middle, 867 00:45:54,510 --> 00:45:59,520 as we've seen, by just doing perhaps some addition, subtraction, division, 868 00:45:59,520 --> 00:46:02,400 and rounding, and that gets you ultimately right where 869 00:46:02,400 --> 00:46:07,030 you want to go in some constant and very few number of steps. 870 00:46:07,030 --> 00:46:10,050 But unfortunately, because you wanted all of that memory 871 00:46:10,050 --> 00:46:15,000 back-to-back-to-back, it's up to you to decide how much of it you want. 872 00:46:15,000 --> 00:46:18,180 And if the operating system, I'm sorry, has already allocated 6, 7, 873 00:46:18,180 --> 00:46:22,620 and elsewhere on the board to other parts of the program, 874 00:46:22,620 --> 00:46:25,830 you might be faced with the decision as to just say, no, 875 00:46:25,830 --> 00:46:31,330 I cannot accept any more data, or you might say, OK, operating system, 876 00:46:31,330 --> 00:46:35,310 what if I don't mind where I am in memory-- and you probably don't-- 877 00:46:35,310 --> 00:46:39,390 but I would like you to find me more bytes somewhere else? 878 00:46:39,390 --> 00:46:42,540 Rather like going from a one-bedroom to a two-bedroom apartment 879 00:46:42,540 --> 00:46:45,990 so that you have more room, you might physically have to pack your bags 880 00:46:45,990 --> 00:46:47,490 and go somewhere else. 881 00:46:47,490 --> 00:46:51,060 Unfortunately, just like in the real world, that's not without cost. 882 00:46:51,060 --> 00:46:54,600 You need to pack those bags and physically move, which takes time, 883 00:46:54,600 --> 00:46:57,780 and so will it take you and the operating system some time 884 00:46:57,780 --> 00:47:00,600 to relocate every one of your values. 885 00:47:00,600 --> 00:47:04,260 So sure, there might be plenty of space down here below on multiple rows 886 00:47:04,260 --> 00:47:07,890 and even not pictured, but it's going to take a non-zero amount of time 887 00:47:07,890 --> 00:47:14,310 to relocate that 4 and 8 and 15 and that 16 and 23 and 42 to new locations. 888 00:47:14,310 --> 00:47:17,370 That might be your only option if you want to support more data, 889 00:47:17,370 --> 00:47:20,880 and indeed, most programs would want-- it would be an unfortunate situation 890 00:47:20,880 --> 00:47:24,900 if you had to tell your user or boss, I'm sorry, I ran out of space, 891 00:47:24,900 --> 00:47:26,460 and that's certainly foolish. 892 00:47:26,460 --> 00:47:31,470 If you actually do have more space, it's just not right there next to you. 893 00:47:31,470 --> 00:47:34,830 So with an array, you have the ability physically 894 00:47:34,830 --> 00:47:39,390 to perform very sophisticated, very efficient algorithms such as we've 895 00:47:39,390 --> 00:47:42,570 seen-- binary search and bubble sort and selection sort 896 00:47:42,570 --> 00:47:46,860 and merge sort, and do so in quite fast time. 897 00:47:46,860 --> 00:47:50,160 Even though selection sort and bubble sort were big O of n squared, 898 00:47:50,160 --> 00:47:55,470 merge sort was actually n times log n, which is slow-- 899 00:47:55,470 --> 00:48:00,210 which is slower than log n alone, but faster than n squared. 900 00:48:00,210 --> 00:48:03,300 But they all presuppose that you had random access to elements 901 00:48:03,300 --> 00:48:06,630 arithmetically via their indexes or address, and to do so, 902 00:48:06,630 --> 00:48:09,870 you can with your computer's memory with arrays, 903 00:48:09,870 --> 00:48:12,210 but you need to commit to some value. 904 00:48:12,210 --> 00:48:13,110 All right, fine. 905 00:48:13,110 --> 00:48:17,190 Let's not ask the operating system for 6 bytes initially, let's say, give me 7 906 00:48:17,190 --> 00:48:19,080 because I'm going to leave one of them blank. 907 00:48:19,080 --> 00:48:22,470 Now of course, that might buy you some runway, so to speak, 908 00:48:22,470 --> 00:48:25,230 so that you can accommodate if and when a seventh element, 909 00:48:25,230 --> 00:48:26,723 but what about an eighth? 910 00:48:26,723 --> 00:48:29,640 Well, you could ask the operating system from the get-go, don't get me 911 00:48:29,640 --> 00:48:35,280 6 bytes of space, but give me 8 or give me 16 or give me 100. 912 00:48:35,280 --> 00:48:38,220 But at that point, you're starting to get a little greedy, 913 00:48:38,220 --> 00:48:41,130 and you're starting to ask for more memory than you might actually 914 00:48:41,130 --> 00:48:44,520 need anytime soon, and that, too, is unfortunate, 915 00:48:44,520 --> 00:48:46,030 because now you're being wasteful. 916 00:48:46,030 --> 00:48:49,140 Your computer, of course, only has a finite amount of space, 917 00:48:49,140 --> 00:48:51,930 and if you're asking for more of it than you actually need, 918 00:48:51,930 --> 00:48:55,860 that memory, by definition, is unavailable to other parts 919 00:48:55,860 --> 00:48:58,320 of your program and perhaps even others. 920 00:48:58,320 --> 00:49:00,690 And so your computer ultimately might not 921 00:49:00,690 --> 00:49:05,670 be able to get as much work done because it's been holding off to the side 922 00:49:05,670 --> 00:49:07,650 just some empty space. 923 00:49:07,650 --> 00:49:10,680 Empty parking spaces you've reserved for yourself or empty seats 924 00:49:10,680 --> 00:49:14,220 at a table that might potentially go unused, it's just wasteful. 925 00:49:14,220 --> 00:49:16,050 And hardware costs money. 926 00:49:16,050 --> 00:49:18,390 And hardware enables you to solve problems. 927 00:49:18,390 --> 00:49:21,930 And with less hardware available, can you solve fewer problems at hand, 928 00:49:21,930 --> 00:49:25,060 and so that, too, doesn't feel like a perfect solution. 929 00:49:25,060 --> 00:49:27,840 So again, this series of trade-offs, it depends 930 00:49:27,840 --> 00:49:32,430 on what's most important to you-- time or space or money or development 931 00:49:32,430 --> 00:49:35,980 or any number of other scarce resources. 932 00:49:35,980 --> 00:49:39,450 So what can we do instead as opposed to an array? 933 00:49:39,450 --> 00:49:43,770 How do we go about getting dynamism that we so clearly wants here, 934 00:49:43,770 --> 00:49:47,550 whereas it wouldn't-- wouldn't it be nice if we could grow these great data 935 00:49:47,550 --> 00:49:50,220 structures, and better yet, even shrink them? 936 00:49:50,220 --> 00:49:52,380 If I no longer need some of these numbers, 937 00:49:52,380 --> 00:49:55,320 I'm going to give you back that memory so that I can use it elsewhere 938 00:49:55,320 --> 00:49:57,390 for more compelling purposes. 939 00:49:57,390 --> 00:49:59,790 Well it turns out that in computer science, 940 00:49:59,790 --> 00:50:02,670 programmers can create even fancier data structures 941 00:50:02,670 --> 00:50:04,950 but at a higher level of abstraction. 942 00:50:04,950 --> 00:50:10,360 It turns out, we could start making lists out of our values. 943 00:50:10,360 --> 00:50:13,530 In fact, if I wanted to add some number to the screen, and for instance, 944 00:50:13,530 --> 00:50:16,920 maybe these two spots were blocked off by something else. 945 00:50:16,920 --> 00:50:17,670 But you know what? 946 00:50:17,670 --> 00:50:20,370 I do know there's some room elsewhere on the screen, 947 00:50:20,370 --> 00:50:22,630 it just happens to be available here. 948 00:50:22,630 --> 00:50:26,977 And so if I want to put the number 50 in my list of values, 949 00:50:26,977 --> 00:50:29,310 I might just have to say, I don't care where you put it, 950 00:50:29,310 --> 00:50:31,350 go ahead and put it right there. 951 00:50:31,350 --> 00:50:32,400 Well where is there? 952 00:50:32,400 --> 00:50:38,743 Well if we continue this indexing-- this is 6 and 7 and 8 and 9, 10, 11, 12, 953 00:50:38,743 --> 00:50:46,440 13, 14, and 15, if 50 happens to end up by chance at location 15 954 00:50:46,440 --> 00:50:50,730 because it's the first byte available, because not only these two, but maybe 955 00:50:50,730 --> 00:50:54,360 even all of these are taken for some other reason-- 956 00:50:54,360 --> 00:50:58,170 ever since you asked for your first six, that's OK, 957 00:50:58,170 --> 00:51:02,940 so long as you can somehow link your original data to the new. 958 00:51:02,940 --> 00:51:07,290 And pictorially here, I might be inclined just to say, you know what? 959 00:51:07,290 --> 00:51:12,390 Let me just leave a little breadcrumb, so to speak, and say that after the 42, 960 00:51:12,390 --> 00:51:16,380 I should actually go down here and follow this arrow. 961 00:51:16,380 --> 00:51:19,200 Sort of Chutes and Ladders style, if you will. 962 00:51:19,200 --> 00:51:22,530 Now that's fine and you can do that-- after all, at the end of the day, 963 00:51:22,530 --> 00:51:24,840 computers will do what you want, and if you 964 00:51:24,840 --> 00:51:27,420 can write the code to implement this idea, 965 00:51:27,420 --> 00:51:30,840 it will, in fact, remember that value. 966 00:51:30,840 --> 00:51:32,067 But how do we achieve this? 967 00:51:32,067 --> 00:51:34,650 Here, too, you have to come back to the fundamental definition 968 00:51:34,650 --> 00:51:36,450 of what your computer is doing and how. 969 00:51:36,450 --> 00:51:39,930 It's just got that chip of memory, and those bytes back-to-back, 970 00:51:39,930 --> 00:51:41,700 such as those pictured here. 971 00:51:41,700 --> 00:51:46,410 So this is all you get-- there is no arrow feature inside of a computer. 972 00:51:46,410 --> 00:51:49,320 You have to implement that notion yourself. 973 00:51:49,320 --> 00:51:51,940 So how can you go about doing that? 974 00:51:51,940 --> 00:51:54,780 Well, you can implement this concept of an arrow, 975 00:51:54,780 --> 00:51:58,200 but you need to implement it ultimately at a lower level or trust 976 00:51:58,200 --> 00:52:00,210 that someone else will for you. 977 00:52:00,210 --> 00:52:04,550 Well, as best I can tell, I do know that my first several elements happened 978 00:52:04,550 --> 00:52:09,870 to be back-to-back from 4 on up to 42 in locations 0 through 5. 979 00:52:09,870 --> 00:52:12,360 Because those are contiguous, I get my random access 980 00:52:12,360 --> 00:52:15,690 and I can immediately jump from beginning to middle to end. 981 00:52:15,690 --> 00:52:19,680 This 50 and anything after it needs to be handled a little better. 982 00:52:19,680 --> 00:52:23,700 If I want to implement this arrow, the only possible way 983 00:52:23,700 --> 00:52:28,650 seems to be to somehow remember that the next element after 42 984 00:52:28,650 --> 00:52:32,190 is at location 15. 985 00:52:32,190 --> 00:52:33,870 And that location, a.k.a. 986 00:52:33,870 --> 00:52:38,340 address or index, just has to be something I remember. 987 00:52:38,340 --> 00:52:42,720 Unfortunately I don't have quite enough room left to remember that. 988 00:52:42,720 --> 00:52:45,930 What I really want to do is not store this arrow, but by the way, 989 00:52:45,930 --> 00:52:50,340 parenthetically go ahead and store the number 15-- 990 00:52:50,340 --> 00:52:54,420 not as the index of that cell, but as the next address 991 00:52:54,420 --> 00:52:56,190 that should be followed. 992 00:52:56,190 --> 00:52:58,980 The catch, though, is that I've not left myself enough room. 993 00:52:58,980 --> 00:53:01,230 I've made mental note in parentheses here 994 00:53:01,230 --> 00:53:03,550 that we've got to solve this a bit better. 995 00:53:03,550 --> 00:53:07,260 So let's start over for the moment, and no longer worry 996 00:53:07,260 --> 00:53:11,340 about this very low level, because it's too messy at some point. 997 00:53:11,340 --> 00:53:13,050 It's like talking in 0's and 1's-- 998 00:53:13,050 --> 00:53:15,640 I don't want to talk in bytes in this way. 999 00:53:15,640 --> 00:53:17,670 So let's take things up in abstraction level, 1000 00:53:17,670 --> 00:53:24,390 if you will, and just agree to agree that you can store values in memory, 1001 00:53:24,390 --> 00:53:27,360 and those values can be data, like numbers you want-- 1002 00:53:27,360 --> 00:53:32,100 4, 8, 15, 16, 23, 42, and now 50. 1003 00:53:32,100 --> 00:53:37,380 And you can also store somehow the addresses or indexes-- 1004 00:53:37,380 --> 00:53:39,600 locations of those values. 1005 00:53:39,600 --> 00:53:43,000 It's just up to you how to use this canvas. 1006 00:53:43,000 --> 00:53:45,420 So let's do that and clear the screen and now start 1007 00:53:45,420 --> 00:53:47,160 to build a higher-level concept. 1008 00:53:47,160 --> 00:53:50,265 Not an array, but something we'll call a linked list. 1009 00:53:50,265 --> 00:53:53,330 1010 00:53:53,330 --> 00:53:55,370 Now what is a linked list? 1011 00:53:55,370 --> 00:53:59,750 A linked list is a data structure that's a higher-level concept in abstraction 1012 00:53:59,750 --> 00:54:04,820 on top of what ultimately is just chunks of memory or bytes. 1013 00:54:04,820 --> 00:54:08,840 But this linked list shall enable me to store more and more values 1014 00:54:08,840 --> 00:54:12,510 and even remove them simply by linking them together. 1015 00:54:12,510 --> 00:54:16,640 So here, let me go ahead and represent those same values starting with 4, 1016 00:54:16,640 --> 00:54:23,300 followed by 8 and 15, and then 16 and 23, and finally, 42. 1017 00:54:23,300 --> 00:54:26,840 And now eventually I'm going to want to store 50, but I've run out of room 1018 00:54:26,840 --> 00:54:32,000 but that's fine, I'm going to go ahead and write 50 wherever there's space. 1019 00:54:32,000 --> 00:54:35,900 But now let's not worry about that grid, rows, and columns of memory. 1020 00:54:35,900 --> 00:54:38,480 Let's just stipulate that yes, that's actually there, 1021 00:54:38,480 --> 00:54:41,180 but it's not useful to operate at that level. 1022 00:54:41,180 --> 00:54:45,980 Much like it's not useful to continually talk in terms of 0's and 1's. 1023 00:54:45,980 --> 00:54:50,030 So let me go ahead and wrap these values with a higher-level idea 1024 00:54:50,030 --> 00:54:52,340 called a node or just a box. 1025 00:54:52,340 --> 00:54:56,660 And this box is going to store for us each of these values. 1026 00:54:56,660 --> 00:55:02,303 Here I have 4, here I have 8 and 15, here I have 16, 1027 00:55:02,303 --> 00:55:05,960 I have 23, and finally, 42. 1028 00:55:05,960 --> 00:55:09,290 And then when it comes time to add 50 to the mix, 1029 00:55:09,290 --> 00:55:11,150 it, too, will come in this box. 1030 00:55:11,150 --> 00:55:12,080 Now what is this box? 1031 00:55:12,080 --> 00:55:14,990 It's just an artist's rendition of the underlying bytes, 1032 00:55:14,990 --> 00:55:18,790 but now I have the ability to draw a prettier picture, if you will, 1033 00:55:18,790 --> 00:55:22,320 that somehow interlinks these boxes together. 1034 00:55:22,320 --> 00:55:24,500 Indeed, what I ultimately want to remember 1035 00:55:24,500 --> 00:55:28,760 is that 4 comes first and 42 comes last, but then wait, if I had 50, 1036 00:55:28,760 --> 00:55:30,590 it shall now come last. 1037 00:55:30,590 --> 00:55:34,760 So we could do this as an artist quite simply with those arrows pointing 1038 00:55:34,760 --> 00:55:38,840 each box to the next, implying that the next element in the list, 1039 00:55:38,840 --> 00:55:45,590 whether it's next door or far away, happens to be at the end of that arrow. 1040 00:55:45,590 --> 00:55:46,850 But what are those arrows? 1041 00:55:46,850 --> 00:55:49,820 Those are not something that you can represent in a computer 1042 00:55:49,820 --> 00:55:55,940 if at the end of the day all you have are blocks of memory and in them bytes. 1043 00:55:55,940 --> 00:55:58,190 If all you have are bytes-- when, therefore, patterns 1044 00:55:58,190 --> 00:56:00,530 of 0's and 1's, whatever you store in the computer 1045 00:56:00,530 --> 00:56:04,580 must be representable with those 0's and 1's, and among the easiest things 1046 00:56:04,580 --> 00:56:10,340 to represent, we know already, is numbers, like indexes or addresses 1047 00:56:10,340 --> 00:56:11,540 of these nodes. 1048 00:56:11,540 --> 00:56:16,550 So for instance, depending on where these nodes are in memory, 1049 00:56:16,550 --> 00:56:20,300 we can simply check that address and store it as well. 1050 00:56:20,300 --> 00:56:23,510 So for instance, if the 4 still happens to be at address 0, 1051 00:56:23,510 --> 00:56:29,060 and this time 8 is at address 4, and this one 8, and this one 12, 1052 00:56:29,060 --> 00:56:34,430 and this one 16, and this one 20-- just by chance back-to-back-to-back 4 bytes 1053 00:56:34,430 --> 00:56:35,120 apart-- 1054 00:56:35,120 --> 00:56:38,270 32 bits, well 50 might be some distance away. 1055 00:56:38,270 --> 00:56:42,710 Maybe it's actually at location 100, that's OK. 1056 00:56:42,710 --> 00:56:44,150 We can still do this. 1057 00:56:44,150 --> 00:56:48,320 Because if we use part of this node, part of each box 1058 00:56:48,320 --> 00:56:53,300 to implement those actual arrows, we can actually store all the information 1059 00:56:53,300 --> 00:56:56,520 we need to know how to get from one box to another. 1060 00:56:56,520 --> 00:56:59,890 For instance, to get from 4 to the next element, 1061 00:56:59,890 --> 00:57:04,850 you're going to want to coincidentally go to not number 4, but address 4. 1062 00:57:04,850 --> 00:57:08,510 And if you want to go from value 8 to the next value, 15, 1063 00:57:08,510 --> 00:57:11,270 you're going to want to go to address 8. 1064 00:57:11,270 --> 00:57:14,300 And if you want to go from 15 to 16, the next address 1065 00:57:14,300 --> 00:57:19,010 is going to be 12, followed by 16, followed by 20. 1066 00:57:19,010 --> 00:57:21,170 And herein lies the magic-- 1067 00:57:21,170 --> 00:57:24,410 if you want to get from 42 to that newest element that's 1068 00:57:24,410 --> 00:57:31,820 just elsewhere at address 100, that's what gets associated with 42's node. 1069 00:57:31,820 --> 00:57:33,750 As for 50, it's the dead end. 1070 00:57:33,750 --> 00:57:35,990 There's nothing more there, so we might simply 1071 00:57:35,990 --> 00:57:38,780 draw a line through that box saying, eh, just 1072 00:57:38,780 --> 00:57:44,660 store it all 0 bits or some other convention equivalently. 1073 00:57:44,660 --> 00:57:46,670 So there's so many numbers now on the screen, 1074 00:57:46,670 --> 00:57:50,330 but to be fair, that's all that's going on inside of a computer-- 1075 00:57:50,330 --> 00:57:52,640 just storing of these bytes. 1076 00:57:52,640 --> 00:57:56,720 But now we can stipulate that, OK, I can somehow 1077 00:57:56,720 --> 00:58:03,770 store the location of each node in memory using its index or address. 1078 00:58:03,770 --> 00:58:06,890 It's just frankly not all that pleasant to stare at these values, 1079 00:58:06,890 --> 00:58:10,910 I'd much rather look at and draw the arrows graphically, 1080 00:58:10,910 --> 00:58:14,870 thereby representing the same idea of these pointers, if you will, 1081 00:58:14,870 --> 00:58:18,710 a term of art in some languages that allows me to remember 1082 00:58:18,710 --> 00:58:21,680 which element goes to which. 1083 00:58:21,680 --> 00:58:25,830 And what is the upside of all this now complexity? 1084 00:58:25,830 --> 00:58:29,810 Well now we have the ability to string together all of these nodes. 1085 00:58:29,810 --> 00:58:32,210 And frankly, if we wanted to remove one of these elements 1086 00:58:32,210 --> 00:58:35,150 from the list, that's fine, we can rather snip it out. 1087 00:58:35,150 --> 00:58:37,910 And we can simply update what the arrow is pointing to, 1088 00:58:37,910 --> 00:58:42,140 and equivalently, we can update the next address in that node. 1089 00:58:42,140 --> 00:58:45,620 And we can certainly add to this list by drawing more nodes here or perhaps 1090 00:58:45,620 --> 00:58:50,120 over here and just link them with arrows conceptually, or more specifically, 1091 00:58:50,120 --> 00:58:54,640 by changing that dead end to the address of the next element. 1092 00:58:54,640 --> 00:58:59,650 And so we can create the idea of the abstraction of a list using 1093 00:58:59,650 --> 00:59:02,260 just this canvas of memory. 1094 00:59:02,260 --> 00:59:04,000 But not all is good here. 1095 00:59:04,000 --> 00:59:06,040 We've surely paid a price, right? 1096 00:59:06,040 --> 00:59:10,120 Surely we couldn't get dynamism for addition and removal 1097 00:59:10,120 --> 00:59:13,840 and updating of a list without paying some price. 1098 00:59:13,840 --> 00:59:17,260 This dynamic growth, this ability to store as many more elements 1099 00:59:17,260 --> 00:59:20,890 as we want without having to tell the operating system from the get-go how 1100 00:59:20,890 --> 00:59:23,110 many elements we expect. 1101 00:59:23,110 --> 00:59:26,590 And indeed, while we're lucky at first, perhaps, 1102 00:59:26,590 --> 00:59:29,770 if we know from the get-go we need at least six values here, 1103 00:59:29,770 --> 00:59:33,100 they might be a consistent distance apart-- 1104 00:59:33,100 --> 00:59:36,250 4 bytes or 32 bits. 1105 00:59:36,250 --> 00:59:38,650 And so I could do arithmetic on some of these nodes, 1106 00:59:38,650 --> 00:59:42,670 but that is no longer, unfortunately, a guarantee of this structure. 1107 00:59:42,670 --> 00:59:47,230 Whereas arrays do guarantee you random access, linked lists do not. 1108 00:59:47,230 --> 00:59:52,900 And linked lists instead require that you traverse them in linear time 1109 00:59:52,900 --> 00:59:56,470 from the first element potentially all the way to the last. 1110 00:59:56,470 --> 00:59:58,600 There is no way to jump to the middle element, 1111 00:59:58,600 --> 01:00:04,140 because frankly, if I do that math as before, 100 bytes away is the last, 1112 01:00:04,140 --> 01:00:08,950 so 100 divided by 2 is 50-- rounding down, keeping me at 50, 1113 01:00:08,950 --> 01:00:12,590 puts me somewhere over here, and that's not right. 1114 01:00:12,590 --> 01:00:14,920 The middle element is earlier, but that's 1115 01:00:14,920 --> 01:00:19,450 because there's no now support for random access or instant arithmetic 1116 01:00:19,450 --> 01:00:24,250 access to elements like the first, last, or middle. 1117 01:00:24,250 --> 01:00:27,430 All we'll remember now for the linked list is that first element, 1118 01:00:27,430 --> 01:00:31,360 and from there, we have to follow all of those breadcrumbs. 1119 01:00:31,360 --> 01:00:34,100 So that might be too high of a price to pay. 1120 01:00:34,100 --> 01:00:36,760 And moreover, there's overhead now, because I'm not storing 1121 01:00:36,760 --> 01:00:39,280 for every node one value, but two-- 1122 01:00:39,280 --> 01:00:43,450 the value or data I care about, and the address or metadata that lets me 1123 01:00:43,450 --> 01:00:45,700 get to the next node. 1124 01:00:45,700 --> 01:00:48,190 So I'm using twice as much space there, say, 1125 01:00:48,190 --> 01:00:50,560 at least when storing numbers, but at least 1126 01:00:50,560 --> 01:00:53,500 I'm getting that dynamic support for growth. 1127 01:00:53,500 --> 01:00:59,710 So again, it depends on that trade-off and what is less costly to you. 1128 01:00:59,710 --> 01:01:00,490 But never fear. 1129 01:01:00,490 --> 01:01:02,650 This is just another problem to solve. 1130 01:01:02,650 --> 01:01:06,220 To be clear, we'd like to retain the dynamism that something 1131 01:01:06,220 --> 01:01:10,300 a linked list offers-- the ability to grow and even shrink that data 1132 01:01:10,300 --> 01:01:14,770 structure over time without having to decide a priori just how much memory we 1133 01:01:14,770 --> 01:01:15,400 want. 1134 01:01:15,400 --> 01:01:19,810 But at the moment we've lost the ability to search it quickly, as with something 1135 01:01:19,810 --> 01:01:21,100 like binary search. 1136 01:01:21,100 --> 01:01:25,570 So wouldn't it be nice if we could get both properties together? 1137 01:01:25,570 --> 01:01:29,320 The ability to grow and shrink as well as to search fast? 1138 01:01:29,320 --> 01:01:31,480 Well I daresay we can if we're just a bit more 1139 01:01:31,480 --> 01:01:34,300 clever about how we draw on our canvas. 1140 01:01:34,300 --> 01:01:36,310 Again, let's stipulate that we can certainly 1141 01:01:36,310 --> 01:01:40,180 store values anywhere in memory and somehow stitch them together 1142 01:01:40,180 --> 01:01:41,660 using addresses. 1143 01:01:41,660 --> 01:01:43,720 Now those addresses, otherwise known as pointers, 1144 01:01:43,720 --> 01:01:48,230 we no longer need draw, because frankly, they're just now a distraction. 1145 01:01:48,230 --> 01:01:52,450 It suffices to know we can draw them pictorially as with some arrows, 1146 01:01:52,450 --> 01:01:54,010 so let's do just that. 1147 01:01:54,010 --> 01:01:56,470 Let me go ahead now and draw those values, 1148 01:01:56,470 --> 01:02:02,680 say 16 up here followed by my 8 and 15, as well as my 4. 1149 01:02:02,680 --> 01:02:07,330 Over here, well I draw that 42 and my 23, 1150 01:02:07,330 --> 01:02:10,450 and now it remains for me to somehow link these together. 1151 01:02:10,450 --> 01:02:13,870 Since I don't need to leave room for those actual addresses, 1152 01:02:13,870 --> 01:02:16,420 it suffices now to just draw arrows. 1153 01:02:16,420 --> 01:02:22,630 I'll go ahead and draw just a box around 16 and 8, as well as my 4 and my 15, 1154 01:02:22,630 --> 01:02:26,330 as well as my 23 and my 42. 1155 01:02:26,330 --> 01:02:28,060 Now how should I go about linking them? 1156 01:02:28,060 --> 01:02:31,600 Well let me propose that we no longer link just from left to right, 1157 01:02:31,600 --> 01:02:37,690 but rather assemble more of a hierarchy here with 16 pointing at 8, 1158 01:02:37,690 --> 01:02:40,960 and 16 also pointing at 42. 1159 01:02:40,960 --> 01:02:48,160 And 42, meanwhile, pointing at 23 with 8 pointing at 4 as well as 15. 1160 01:02:48,160 --> 01:02:51,190 Now why have I done it this way? 1161 01:02:51,190 --> 01:02:54,430 Well by including these arrows sometimes bidirectionally, 1162 01:02:54,430 --> 01:02:57,100 have I stitched together a two-dimensional data 1163 01:02:57,100 --> 01:02:58,300 structure, if you will? 1164 01:02:58,300 --> 01:03:01,720 Now this again surely could be mapped to that lower level of memory 1165 01:03:01,720 --> 01:03:06,040 just by jotting down the addresses that each of these arrows represents, 1166 01:03:06,040 --> 01:03:08,470 but I like thinking at this level of abstraction 1167 01:03:08,470 --> 01:03:12,130 because I now can think in more sophisticated form about how 1168 01:03:12,130 --> 01:03:14,550 I might layout my data. 1169 01:03:14,550 --> 01:03:17,740 So what properties do I now get from this structure? 1170 01:03:17,740 --> 01:03:19,870 Well, dynamism was the first goal at hand, 1171 01:03:19,870 --> 01:03:21,970 and how might I go about adding a new value? 1172 01:03:21,970 --> 01:03:25,370 Say it's 50 that I'd like to add to this structure. 1173 01:03:25,370 --> 01:03:28,090 Well, if I look at the top here, 16, it's already 1174 01:03:28,090 --> 01:03:33,430 got two arrows, so it's full, but I know 50 is bigger than 16, 1175 01:03:33,430 --> 01:03:36,310 so let's start to apply that dynamic and say 50 1176 01:03:36,310 --> 01:03:39,190 shall definitely go down to the right. 1177 01:03:39,190 --> 01:03:43,870 Unfortunately, 42 already has one arrow off it, but there is room for more, 1178 01:03:43,870 --> 01:03:48,400 and it turns out that 50 is, in fact, greater than 42. 1179 01:03:48,400 --> 01:03:49,330 So you know what? 1180 01:03:49,330 --> 01:03:55,800 I'm just going to slot 50 right there and draw 42's second arrow to 50. 1181 01:03:55,800 --> 01:03:58,720 And what picture seems to be emerging here? 1182 01:03:58,720 --> 01:04:02,130 It's perhaps reminiscent of a family tree of sorts. 1183 01:04:02,130 --> 01:04:06,390 Indeed, with parents and children, or a tree more generally with roots. 1184 01:04:06,390 --> 01:04:08,940 Now whereas in our human world, trees tend to grow up, 1185 01:04:08,940 --> 01:04:12,170 these trees in computer science tend to grow down. 1186 01:04:12,170 --> 01:04:15,480 But henceforth, let's call this 16 our root, 1187 01:04:15,480 --> 01:04:19,140 and to its left is its left child, to its right is its right child, or more 1188 01:04:19,140 --> 01:04:22,740 generally, a whole left subtree and a whole right subtree. 1189 01:04:22,740 --> 01:04:26,040 Because indeed, starting at 42, we have another tree of sorts. 1190 01:04:26,040 --> 01:04:31,740 Rooted at 42 is a child called 23, and another child called 50. 1191 01:04:31,740 --> 01:04:35,310 So in this case, it's each of the nodes in our structure, 1192 01:04:35,310 --> 01:04:41,520 otherwise known in computer science as a tree, has zero, one, or two children, 1193 01:04:41,520 --> 01:04:44,110 you can create the second dimension. 1194 01:04:44,110 --> 01:04:46,140 and you can preserve not only the ability 1195 01:04:46,140 --> 01:04:49,320 to add data dynamically like 50, but, but, 1196 01:04:49,320 --> 01:04:53,912 but, we also now gain back that ability to search. 1197 01:04:53,912 --> 01:04:55,620 After all, if I'm asked now the question, 1198 01:04:55,620 --> 01:04:57,900 is the number 15 in this structure? 1199 01:04:57,900 --> 01:04:59,190 Well let me check for you. 1200 01:04:59,190 --> 01:05:02,610 Starting at 16, which is where this structure begins, just like a linked 1201 01:05:02,610 --> 01:05:05,790 list starts conceptually at the left, I'll 1202 01:05:05,790 --> 01:05:08,940 check if 16 is the value you want-- it's not, it's too big, 1203 01:05:08,940 --> 01:05:13,000 but I do know that 15, if it's here, it's to the left. 1204 01:05:13,000 --> 01:05:15,450 Now 8, of course, is not the value you want either, 1205 01:05:15,450 --> 01:05:19,620 but 8 is smaller than 15, so I'll now go to the right. 1206 01:05:19,620 --> 01:05:22,470 And indeed, sure enough, that I now find 15. 1207 01:05:22,470 --> 01:05:26,610 And it only took me one, two steps, not n to find it, 1208 01:05:26,610 --> 01:05:31,710 because through this second dimension am I able to lift up some of those nodes 1209 01:05:31,710 --> 01:05:34,890 rather than draw them just down as a straight line, 1210 01:05:34,890 --> 01:05:37,320 or in the linked to list, all the way from left to right. 1211 01:05:37,320 --> 01:05:41,550 With the second dimension can I now organize things more tightly. 1212 01:05:41,550 --> 01:05:44,230 And notice the key characteristics of this tree. 1213 01:05:44,230 --> 01:05:48,210 It is what's generally known, indeed, as a binary search tree. 1214 01:05:48,210 --> 01:05:51,090 Not only because it's a tree that lends itself to search, 1215 01:05:51,090 --> 01:05:57,500 but also because each of the nodes has no more than two or bi-children-- 1216 01:05:57,500 --> 01:05:58,920 zero, one, or two. 1217 01:05:58,920 --> 01:06:02,430 And notice that to the left of the 16 is not only the value 1218 01:06:02,430 --> 01:06:07,710 8, but every number that can be reached to the left of 16 happens to be, 1219 01:06:07,710 --> 01:06:10,350 by design, less than 16. 1220 01:06:10,350 --> 01:06:12,000 And that's how we found 15. 1221 01:06:12,000 --> 01:06:17,280 Moreover to the right of 16, every value is greater than 16, 1222 01:06:17,280 --> 01:06:18,720 just as we have here. 1223 01:06:18,720 --> 01:06:22,140 And that definition can be applied so-called recursively. 1224 01:06:22,140 --> 01:06:25,800 You can make that claim about every node in this tree at any level, 1225 01:06:25,800 --> 01:06:30,810 because here, 42, every node to its left albeit just one is less. 1226 01:06:30,810 --> 01:06:34,800 Every node to its right albeit one is indeed more. 1227 01:06:34,800 --> 01:06:39,840 So so long as you bring to bear to our data the same sort of intuition 1228 01:06:39,840 --> 01:06:43,590 we brought to our phone book can we achieve these same properties 1229 01:06:43,590 --> 01:06:47,340 and goals, this efficiency of logarithmic time. 1230 01:06:47,340 --> 01:06:52,050 Log base 2 of n is indeed how long it might take us, big O of that 1231 01:06:52,050 --> 01:06:54,540 to find or insert some value. 1232 01:06:54,540 --> 01:06:57,600 Now to be fair, there are some prices paid here. 1233 01:06:57,600 --> 01:07:00,120 If I'm not careful, a data structure like this 1234 01:07:00,120 --> 01:07:02,790 could actually devolve into a linked list 1235 01:07:02,790 --> 01:07:05,790 if I just keep adding, by coincidence or intent, 1236 01:07:05,790 --> 01:07:08,620 more and more big and big numbers. 1237 01:07:08,620 --> 01:07:12,060 They might just so happen to get long and long and long and stringy 1238 01:07:12,060 --> 01:07:16,050 unless we're smart about how we rebalance the tree occasionally. 1239 01:07:16,050 --> 01:07:18,420 And indeed, there are other forms of these trees that 1240 01:07:18,420 --> 01:07:22,560 are smart, and with more code, will rebalance themselves to make sure 1241 01:07:22,560 --> 01:07:26,910 that they don't get long and stringy, but stay as high up as possible. 1242 01:07:26,910 --> 01:07:30,420 But there's another price paid beyond that potential gotcha-- 1243 01:07:30,420 --> 01:07:31,800 more space. 1244 01:07:31,800 --> 01:07:36,930 Whereas my array used no arrows whatsoever and thus no extra space, 1245 01:07:36,930 --> 01:07:41,910 my linked list did use one extra chunk of space for each node-- 1246 01:07:41,910 --> 01:07:45,240 storage for that point or address of its neighbor. 1247 01:07:45,240 --> 01:07:48,180 But in a tree structure, if you're storing multiple children, 1248 01:07:48,180 --> 01:07:52,410 you're using as many as two additional chunks of memory 1249 01:07:52,410 --> 01:07:55,330 to store as many if two of those arrows. 1250 01:07:55,330 --> 01:07:58,110 And so with a tree structure are you spending more space, 1251 01:07:58,110 --> 01:08:00,640 but potentially it's saving you time. 1252 01:08:00,640 --> 01:08:02,610 So again, we see this theme of trade-offs, 1253 01:08:02,610 --> 01:08:06,240 whereby if you really want less time to be spent, 1254 01:08:06,240 --> 01:08:10,530 you're going to have to spend more of that space. 1255 01:08:10,530 --> 01:08:12,660 Now can we do even better? 1256 01:08:12,660 --> 01:08:15,720 With an array, we had instant access to data, 1257 01:08:15,720 --> 01:08:18,180 but we painted ourselves into that corner. 1258 01:08:18,180 --> 01:08:21,069 With a linked list did we solve that particular problem, 1259 01:08:21,069 --> 01:08:24,240 but we gave up the ability to jump right where we want. 1260 01:08:24,240 --> 01:08:27,000 But with trees, particularly binary search trees, 1261 01:08:27,000 --> 01:08:32,399 can we rearrange our data intelligently and regain that logarithmic time. 1262 01:08:32,399 --> 01:08:35,340 But wouldn't it be nice if we could achieve even better, say, 1263 01:08:35,340 --> 01:08:40,590 constant time searches of data and insertions thereof? 1264 01:08:40,590 --> 01:08:43,830 Well for that, perhaps we could amalgamate some of the ideas 1265 01:08:43,830 --> 01:08:47,970 we've seen thus far into just one especially clever structure. 1266 01:08:47,970 --> 01:08:51,569 And let's call that particular structure a hash table. 1267 01:08:51,569 --> 01:08:55,890 And indeed, this is perhaps, in theory, the holy grail of data structures, 1268 01:08:55,890 --> 01:09:01,380 insofar as you can store anything in it in ideally constant time. 1269 01:09:01,380 --> 01:09:02,630 But how best to do this? 1270 01:09:02,630 --> 01:09:06,120 Well let's begin by drawing ourselves an array. 1271 01:09:06,120 --> 01:09:08,939 And that array this time I'll draw vertically simply 1272 01:09:08,939 --> 01:09:12,810 to leave ourselves a bit more in room for something clever. 1273 01:09:12,810 --> 01:09:16,830 This array, as always, can be indexed into by way of these locations 1274 01:09:16,830 --> 01:09:20,970 here where this might be location 0 and 1, 2, and 3, 1275 01:09:20,970 --> 01:09:23,830 followed by any number of others. 1276 01:09:23,830 --> 01:09:25,770 Now how do I want to use this array? 1277 01:09:25,770 --> 01:09:29,130 Well suppose that I want to store names and not numbers. 1278 01:09:29,130 --> 01:09:32,229 Those names, of course, could just be inserted in any old location, 1279 01:09:32,229 --> 01:09:34,430 but if unsorted, we already know we're going 1280 01:09:34,430 --> 01:09:37,350 to suffer as much as big O of n time-- 1281 01:09:37,350 --> 01:09:40,380 linear time with which to find a particular name in that array 1282 01:09:40,380 --> 01:09:43,957 if you know nothing a priori about the order. 1283 01:09:43,957 --> 01:09:47,040 Well we know already, too, we could do better just like the phone company, 1284 01:09:47,040 --> 01:09:50,100 and if we sort the names we're putting into this structure, 1285 01:09:50,100 --> 01:09:53,490 we can at least then do binary search and whittle that search time down 1286 01:09:53,490 --> 01:09:56,370 to log base 2 of n. 1287 01:09:56,370 --> 01:09:59,160 But wouldn't it be nice if we can whittle that down further 1288 01:09:59,160 --> 01:10:04,290 and get to any name we want in nearly constant time-- one step, maybe two 1289 01:10:04,290 --> 01:10:05,730 or a few? 1290 01:10:05,730 --> 01:10:10,440 Well with a hash table can you approximately or ideally do that, 1291 01:10:10,440 --> 01:10:14,640 so long as we decide in advance how to hash those strings. 1292 01:10:14,640 --> 01:10:18,680 In other words, those strings of characters, here called names, 1293 01:10:18,680 --> 01:10:23,640 they have letters inside of them, say D-A-V-I-D for my own. 1294 01:10:23,640 --> 01:10:26,580 Well what if we looked at not the whole name, 1295 01:10:26,580 --> 01:10:29,400 but that first letter, which is, of course, constant time 1296 01:10:29,400 --> 01:10:31,090 to just look at one value. 1297 01:10:31,090 --> 01:10:35,520 And so if D is the fourth letter in the English alphabet, what if I store 1298 01:10:35,520 --> 01:10:36,270 DAVID-- 1299 01:10:36,270 --> 01:10:40,620 or really, any D name at the fourth index in my array, 1300 01:10:40,620 --> 01:10:43,960 location 3 if you start counting at 0? 1301 01:10:43,960 --> 01:10:47,730 So here might be the A names, and here the B names, and here the C names, 1302 01:10:47,730 --> 01:10:53,160 and someone like David now belongs in this bucket, if you will. 1303 01:10:53,160 --> 01:10:57,210 Now suppose I want to store other names in this structure. 1304 01:10:57,210 --> 01:11:02,400 Well Alice belongs at location 0, and Bob, for instance, location 1. 1305 01:11:02,400 --> 01:11:05,310 And we can continue this logic and can continue 1306 01:11:05,310 --> 01:11:09,690 to insert more and more names so long as we hash those names 1307 01:11:09,690 --> 01:11:12,240 and jump right to the right location. 1308 01:11:12,240 --> 01:11:15,360 After all, I can in one step look at A or B or D 1309 01:11:15,360 --> 01:11:18,840 and instantly know 0 or 1 or 3. 1310 01:11:18,840 --> 01:11:19,560 How? 1311 01:11:19,560 --> 01:11:22,560 Well recall that in a computer you have ASCII or Unicode. 1312 01:11:22,560 --> 01:11:26,190 And we already have numbers predetermined to map 1313 01:11:26,190 --> 01:11:27,820 to those same characters. 1314 01:11:27,820 --> 01:11:32,310 Now to be fair, A I'm pretty sure it was 65 in ASCII, 1315 01:11:32,310 --> 01:11:37,020 but we could certainly subtract 65 from 65 to get 0. 1316 01:11:37,020 --> 01:11:43,230 And if capital B was 66, we could certainly subtract 65 from 66 to get 1. 1317 01:11:43,230 --> 01:11:48,480 So we can look, then, at the first letter of any name, convert it to ASCII 1318 01:11:48,480 --> 01:11:51,810 and subtract quite simply 65 if it's capital, 1319 01:11:51,810 --> 01:11:54,810 and get precisely to the index we want. 1320 01:11:54,810 --> 01:11:57,850 So to be fair, that's not one, but it is two or three steps, 1321 01:11:57,850 --> 01:12:01,410 but that is a constant number of steps again and again independent 1322 01:12:01,410 --> 01:12:04,380 of n, the total number of names. 1323 01:12:04,380 --> 01:12:08,220 Now what's nice about this is that we have a data structure into which we 1324 01:12:08,220 --> 01:12:13,530 can insert names instantly by hashing them and getting as output 1325 01:12:13,530 --> 01:12:19,470 that number or index 0 through 25, in the case of an English alphabet. 1326 01:12:19,470 --> 01:12:22,690 But what problem might arise? 1327 01:12:22,690 --> 01:12:25,740 The catch, though, is that we have someone else, like Doug, 1328 01:12:25,740 --> 01:12:28,290 whose name happens to start with the same name, 1329 01:12:28,290 --> 01:12:31,890 unfortunately there seems to be no room at this moment for Doug 1330 01:12:31,890 --> 01:12:33,510 since I'm already there. 1331 01:12:33,510 --> 01:12:37,140 But there we can draw inspiration from other data structures still. 1332 01:12:37,140 --> 01:12:40,770 We could maybe not just put David in this array, 1333 01:12:40,770 --> 01:12:45,270 but not even treat this array as the entire data structure, 1334 01:12:45,270 --> 01:12:47,590 but really the beginning of another. 1335 01:12:47,590 --> 01:12:52,230 In fact, let me go ahead and put David in his or my own box 1336 01:12:52,230 --> 01:12:54,690 and give Doug his own as well. 1337 01:12:54,690 --> 01:12:58,710 Now Doug and I are really just nodes in a structure. 1338 01:12:58,710 --> 01:13:03,480 And we can use this array still to get to the right nodes of interest, 1339 01:13:03,480 --> 01:13:07,290 but now we can use arrows to stitch them together. 1340 01:13:07,290 --> 01:13:10,760 If I have multiple names, each of which starts with a D, 1341 01:13:10,760 --> 01:13:12,740 I just need to remember to link those together, 1342 01:13:12,740 --> 01:13:15,650 thereby allowing myself to have any number of names 1343 01:13:15,650 --> 01:13:18,800 that start with that same letter, treating that list really 1344 01:13:18,800 --> 01:13:20,240 as a linked list. 1345 01:13:20,240 --> 01:13:24,680 But I get to that length list instantly by looking at that first letter 1346 01:13:24,680 --> 01:13:27,710 and jumping here to the right location. 1347 01:13:27,710 --> 01:13:33,540 And so here I get both dynamic growth and instant access to that list, 1348 01:13:33,540 --> 01:13:37,140 thereby decreasing significantly the amount of time 1349 01:13:37,140 --> 01:13:40,950 it takes me to find someone maybe 1/26 of the time. 1350 01:13:40,950 --> 01:13:43,020 Now to be fair, wait a minute, we're already 1351 01:13:43,020 --> 01:13:47,160 seeing collisions, so to speak, whereby I have multiple inputs hashing 1352 01:13:47,160 --> 01:13:48,600 to the same output-- 1353 01:13:48,600 --> 01:13:50,460 three in this instance. 1354 01:13:50,460 --> 01:13:52,950 And in the worst case, perhaps everyone in the room 1355 01:13:52,950 --> 01:13:55,710 all has a name that starts with D, which means really, 1356 01:13:55,710 --> 01:13:58,240 you don't have a hash table or array at all, 1357 01:13:58,240 --> 01:14:02,970 you just have one really long linked list, and thus, linear. 1358 01:14:02,970 --> 01:14:07,180 But that would be considered a more perverse scenario, which you should try 1359 01:14:07,180 --> 01:14:09,570 to avoid by way of that hash function. 1360 01:14:09,570 --> 01:14:13,770 If that is the problem you're facing, then your hash function is just bad. 1361 01:14:13,770 --> 01:14:16,100 You should not have looked only in that case 1362 01:14:16,100 --> 01:14:18,270 at just the first letter of every name. 1363 01:14:18,270 --> 01:14:21,030 Perhaps you should have looked at the first two letters 1364 01:14:21,030 --> 01:14:26,510 back-to-back, and put anyone's name that starts with D-A in one list; 1365 01:14:26,510 --> 01:14:32,310 and D-B, if there is any, in a second list; and D-C, if there's any of those, 1366 01:14:32,310 --> 01:14:36,930 in some third list altogether; and D-D and D-E and D-F 1367 01:14:36,930 --> 01:14:42,870 and so forth, and actually have multiple combinations of every two letters, 1368 01:14:42,870 --> 01:14:47,790 and have as many buckets, so to speak, as many indexes in your array 1369 01:14:47,790 --> 01:14:51,840 as there are pairs of two alphabetical letters. 1370 01:14:51,840 --> 01:14:53,730 Now to be fair, you might have two people 1371 01:14:53,730 --> 01:14:58,770 whose names start with D-A or D-O, but hopefully there's even fewer. 1372 01:14:58,770 --> 01:15:00,510 And indeed, I say a hash table-- 1373 01:15:00,510 --> 01:15:04,440 this whole structure approximates the idea of constant time 1374 01:15:04,440 --> 01:15:10,680 because it can devolve in places to linear time with longer lists of names. 1375 01:15:10,680 --> 01:15:14,520 But if your hash function is good and you don't have these collisions, 1376 01:15:14,520 --> 01:15:19,650 and therefore ideally you don't have any linked lists, just names, then 1377 01:15:19,650 --> 01:15:23,010 you indeed have a structure that gives you constant time access, 1378 01:15:23,010 --> 01:15:25,710 ultimately, combining all of these underlying 1379 01:15:25,710 --> 01:15:29,250 principles of dynamic growth and random access 1380 01:15:29,250 --> 01:15:34,680 to achieve ultimately the storage of all your values. 1381 01:15:34,680 --> 01:15:39,150 How, then, might a language like Python implement data types like int and str? 1382 01:15:39,150 --> 01:15:41,130 Well in the case of Python's latest version, 1383 01:15:41,130 --> 01:15:45,040 it allows ints to grow as big as you need them to be. 1384 01:15:45,040 --> 01:15:49,530 And so it surely can only be using contiguous memory once allocated 1385 01:15:49,530 --> 01:15:51,190 that stays in the same place. 1386 01:15:51,190 --> 01:15:54,590 If instead you want a number to grow over time, 1387 01:15:54,590 --> 01:15:58,890 well you're probably going to need to allocate some variable number of bytes 1388 01:15:58,890 --> 01:15:59,850 in that memory. 1389 01:15:59,850 --> 01:16:01,020 Strings, too, as well. 1390 01:16:01,020 --> 01:16:04,540 If you want to allocate strings, you're going to need to allow them to grow, 1391 01:16:04,540 --> 01:16:07,080 which means finding extra space in proximity 1392 01:16:07,080 --> 01:16:10,890 to the characters you already have, or maybe relocating the whole structure 1393 01:16:10,890 --> 01:16:13,000 so that that value can keep growing. 1394 01:16:13,000 --> 01:16:15,960 But we know now, we can do this with our canvas of memory. 1395 01:16:15,960 --> 01:16:19,530 How the particular language does it isn't even necessarily of interest, 1396 01:16:19,530 --> 01:16:23,850 we just know that it can, and even underneath the hood, how 1397 01:16:23,850 --> 01:16:25,020 it might do so. 1398 01:16:25,020 --> 01:16:29,040 As for these other structures in Python like dict or dictionary and list, 1399 01:16:29,040 --> 01:16:32,760 well those, too, are exactly what we've seen here. 1400 01:16:32,760 --> 01:16:36,960 A dictionary in Python is really just a hash table, some sort of variable 1401 01:16:36,960 --> 01:16:40,770 that has indexes that are not necessarily numbers, but words, 1402 01:16:40,770 --> 01:16:43,950 and via those words can you get back a value. 1403 01:16:43,950 --> 01:16:47,580 Indeed, more generally does a hash table have keys and values. 1404 01:16:47,580 --> 01:16:51,100 The keys are the inputs via which you produce those outputs. 1405 01:16:51,100 --> 01:16:54,930 So in our data structure, might have been the inputs as names. 1406 01:16:54,930 --> 01:16:59,710 The output of my hash function was an index value like some number. 1407 01:16:59,710 --> 01:17:03,390 And in Python do you have a wonderful abstraction in code that 1408 01:17:03,390 --> 01:17:06,540 allows you to express that idea of associating keys 1409 01:17:06,540 --> 01:17:10,200 with values, names with yes or no, true or false 1410 01:17:10,200 --> 01:17:14,520 they are present so that you can ask those questions yourself in your code. 1411 01:17:14,520 --> 01:17:16,740 And as for list, it's quite simply that. 1412 01:17:16,740 --> 01:17:19,440 It's the idea of an array but with that added dynamism, 1413 01:17:19,440 --> 01:17:22,390 and as such, a linked list of sorts. 1414 01:17:22,390 --> 01:17:27,390 And so now at this higher level of code can you not only think computationally, 1415 01:17:27,390 --> 01:17:31,050 but express yourself computationally knowing and trusting 1416 01:17:31,050 --> 01:17:32,970 that the computer can do that bidding. 1417 01:17:32,970 --> 01:17:35,250 How the data structures are organized really 1418 01:17:35,250 --> 01:17:37,950 is the secret source of these languages and tools, 1419 01:17:37,950 --> 01:17:42,240 and indeed, when you have some database or backend system, too, 1420 01:17:42,240 --> 01:17:45,360 the intellectual property that underlies those systems 1421 01:17:45,360 --> 01:17:47,820 ultimately boils down not only to the algorithms 1422 01:17:47,820 --> 01:17:50,100 in use, but also the data structures. 1423 01:17:50,100 --> 01:17:53,340 Because together, they-- and we've seen this-- 1424 01:17:53,340 --> 01:17:57,660 together combine to produce not only the correctness of answers you want, 1425 01:17:57,660 --> 01:18:01,820 but the efficiency with which you can to those answers. 1426 01:18:01,820 --> 01:18:06,064