1 00:00:00,000 --> 00:01:17,345 [MUSIC PLAYING] 2 00:01:17,345 --> 00:01:22,170 DAVID J. MALAN: This is CS50, and this is already week three. 3 00:01:22,170 --> 00:01:25,695 And even as we've gotten much more into the minutia of programming 4 00:01:25,695 --> 00:01:27,570 and some of the C stuff that we've been doing 5 00:01:27,570 --> 00:01:30,840 is all the more cryptic looking, recall that at the end of the day, 6 00:01:30,840 --> 00:01:34,647 like, everything we've been doing ultimately fits into to this model. 7 00:01:34,647 --> 00:01:36,480 So keep that in mind, particularly as things 8 00:01:36,480 --> 00:01:39,230 seem like they're getting more complicated and more sophisticated. 9 00:01:39,230 --> 00:01:41,850 It's just a process of learning a new language that ultimately 10 00:01:41,850 --> 00:01:44,430 lets us express this process. 11 00:01:44,430 --> 00:01:47,490 And of course, last week we really went into the weeds of like how 12 00:01:47,490 --> 00:01:49,240 inputs and outputs are represented. 13 00:01:49,240 --> 00:01:53,580 And this thing here, a photograph thereof, is called what? 14 00:01:53,580 --> 00:01:54,742 This is what? 15 00:01:54,742 --> 00:01:55,628 AUDIENCE: RAM. 16 00:01:55,628 --> 00:01:56,670 DAVID J. MALAN: RAM, I heard-- 17 00:01:56,670 --> 00:01:59,670 Random Access Memory or just generally known as memory. 18 00:01:59,670 --> 00:02:02,280 And recall that we looked at one of these little black chips 19 00:02:02,280 --> 00:02:04,500 that contains all of the bytes-- 20 00:02:04,500 --> 00:02:05,790 all of the bits, ultimately. 21 00:02:05,790 --> 00:02:08,400 It's just kind of a grid, sort of an artist grid, that 22 00:02:08,400 --> 00:02:11,520 allows us to think about every one of these memory locations 23 00:02:11,520 --> 00:02:14,040 as just having a number or an address, so to speak. 24 00:02:14,040 --> 00:02:16,770 Like, this might be byte number 0 and then 1 and then 2 25 00:02:16,770 --> 00:02:19,680 and then, maybe way down here again, something like 2 billion 26 00:02:19,680 --> 00:02:22,080 if you have 2 gigabytes of memory. 27 00:02:22,080 --> 00:02:26,070 And so as we did that, we started to explore how we could use this canvas 28 00:02:26,070 --> 00:02:29,760 to create kind of our own information, our own inputs and outputs, not 29 00:02:29,760 --> 00:02:32,380 just the basics like ints and floats and so forth. 30 00:02:32,380 --> 00:02:34,470 But we also talked about strings. 31 00:02:34,470 --> 00:02:37,080 And what is a string as you now know it? 32 00:02:37,080 --> 00:02:40,200 How would you describe in layperson's terms a string? 33 00:02:40,200 --> 00:02:41,010 Yeah, over there. 34 00:02:41,010 --> 00:02:41,990 AUDIENCE: I was gonna say-- 35 00:02:41,990 --> 00:02:42,480 [AUDIO OUT] 36 00:02:42,480 --> 00:02:43,897 DAVID J. MALAN: An array of characters. 37 00:02:43,897 --> 00:02:45,690 And an array, meanwhile-- let's go there. 38 00:02:45,690 --> 00:02:50,580 How might someone else define an array in more familiar now terms? 39 00:02:50,580 --> 00:02:53,160 What would be an array? 40 00:02:53,160 --> 00:02:53,670 Yeah. 41 00:02:53,670 --> 00:02:57,070 AUDIENCE: Kind of like an indexed set of things. 42 00:02:57,070 --> 00:02:59,060 DAVID J. MALAN: An indexed set of things-- not bad. 43 00:02:59,060 --> 00:03:02,210 And I think a key characteristic to keep in mind with an array is that it 44 00:03:02,210 --> 00:03:03,890 does actually pertain to memory. 45 00:03:03,890 --> 00:03:05,810 And it's contiguous memory. 46 00:03:05,810 --> 00:03:09,200 Byte after byte after byte is what constitutes an array. 47 00:03:09,200 --> 00:03:11,810 And we'll see in a couple of weeks time that there's actually 48 00:03:11,810 --> 00:03:15,770 more interesting ways to use this same primitive Canvas to stitch together 49 00:03:15,770 --> 00:03:19,240 things that are sort of two directional even that have some kind of shape 50 00:03:19,240 --> 00:03:19,740 to them. 51 00:03:19,740 --> 00:03:22,970 But for now, all we've talked about is arrays and just using these things 52 00:03:22,970 --> 00:03:27,240 from left to right, top to bottom, contiguously to represent information. 53 00:03:27,240 --> 00:03:29,900 So today, we'll consider still an array. 54 00:03:29,900 --> 00:03:33,020 But we won't focus so much on representation 55 00:03:33,020 --> 00:03:34,460 of strings or other data types. 56 00:03:34,460 --> 00:03:36,918 We'll actually now focus on the other part of that process, 57 00:03:36,918 --> 00:03:39,950 of inputs becoming outputs, namely the thing in the middle-- 58 00:03:39,950 --> 00:03:40,940 algorithms. 59 00:03:40,940 --> 00:03:45,200 But we have to keep in mind, even though every time we've looked at an array 60 00:03:45,200 --> 00:03:48,610 thus far, certainly on the board like this, you as a human certainly 61 00:03:48,610 --> 00:03:50,360 have the luxury of just kind of eyeballing 62 00:03:50,360 --> 00:03:53,820 the whole thing with a bird's eye view and seeing where all of those numbers 63 00:03:53,820 --> 00:03:54,320 are. 64 00:03:54,320 --> 00:03:56,153 If I asked you where a particular number is, 65 00:03:56,153 --> 00:03:59,150 like zero, odds are your eyes would go right to where it is, 66 00:03:59,150 --> 00:04:02,150 and boom, problem solved in sort of one step. 67 00:04:02,150 --> 00:04:07,260 But the catch is, with a computer that has this memory, even though you, 68 00:04:07,260 --> 00:04:10,730 the human, can [INAUDIBLE] see everything at once, a computer cannot. 69 00:04:10,730 --> 00:04:14,090 It's better to think of your computer's memory, your phone's memory, 70 00:04:14,090 --> 00:04:17,000 or more specifically an array of memory like this 71 00:04:17,000 --> 00:04:21,500 as really being a set of closed doors, not unlike lockers in a school. 72 00:04:21,500 --> 00:04:24,260 And only by opening each of those doors can 73 00:04:24,260 --> 00:04:26,090 the computer actually see what's in there, 74 00:04:26,090 --> 00:04:28,610 which is to say that the computer, unlike you, doesn't 75 00:04:28,610 --> 00:04:32,480 have this bird's eye view of all of the data in all these locations. 76 00:04:32,480 --> 00:04:35,400 It has to much more methodically look here, 77 00:04:35,400 --> 00:04:39,590 maybe look here, maybe look here, and so forth in order to find something. 78 00:04:39,590 --> 00:04:43,670 Now fortunately, we already have some building blocks-- loops, conditions, 79 00:04:43,670 --> 00:04:45,210 Boolean expressions, and the like-- 80 00:04:45,210 --> 00:04:47,120 where you could imagine writing some code 81 00:04:47,120 --> 00:04:51,680 that very methodically goes from left to right or right to left or something 82 00:04:51,680 --> 00:04:55,490 more sophisticated that actually finds something you're looking for. 83 00:04:55,490 --> 00:04:58,850 And just remember that the conventions we've had since last week 84 00:04:58,850 --> 00:05:03,320 now is that these arrays are zero indexed, so to speak. 85 00:05:03,320 --> 00:05:08,150 To be zero indexed just means that the data type starts counting from zero. 86 00:05:08,150 --> 00:05:11,930 So this is location 0, 1, 2, 3, 4, 5, 6. 87 00:05:11,930 --> 00:05:15,320 And notice even though there are seven total doors here, 88 00:05:15,320 --> 00:05:17,930 the right-most one, of course, is called 6 89 00:05:17,930 --> 00:05:19,940 just because we've started counting at 0. 90 00:05:19,940 --> 00:05:24,680 So in the general case, if you had n doors or n bytes of memory, 91 00:05:24,680 --> 00:05:29,480 0 would always be at the left, and n minus 1 would always be at the right. 92 00:05:29,480 --> 00:05:33,860 That's sort of a generalization of just thinking about this kind of convention. 93 00:05:33,860 --> 00:05:37,790 All right, so let's revisit the problem that we started the whole term off 94 00:05:37,790 --> 00:05:40,523 with in week zero, which was this notion of searching. 95 00:05:40,523 --> 00:05:42,440 And what does it mean to search for something? 96 00:05:42,440 --> 00:05:45,218 Well, to find information-- and this, of course, is omnipresent. 97 00:05:45,218 --> 00:05:48,260 Anytime you take out your phone, you're searching for a friend's contact. 98 00:05:48,260 --> 00:05:51,140 Any time you pull up a browser, you're googling for this or that. 99 00:05:51,140 --> 00:05:55,310 So search is kind of one of the most omnipresent topics and features 100 00:05:55,310 --> 00:05:56,877 of any device these days. 101 00:05:56,877 --> 00:05:59,960 So let's consider how the Googles, the Apples, the Microsofts of the world 102 00:05:59,960 --> 00:06:03,660 are implementing something as seemingly familiar as this. 103 00:06:03,660 --> 00:06:06,020 So here might be the problem statement. 104 00:06:06,020 --> 00:06:08,450 We want some input to become some output. 105 00:06:08,450 --> 00:06:09,800 What's that input going to be? 106 00:06:09,800 --> 00:06:13,520 Maybe it's a bunch of closed doors like this out of which we want 107 00:06:13,520 --> 00:06:16,250 to get back an answer, true or false. 108 00:06:16,250 --> 00:06:18,785 Is something we're looking for there or not? 109 00:06:18,785 --> 00:06:21,410 You can imagine taking this one step further and trying to find 110 00:06:21,410 --> 00:06:23,520 where is the thing you're looking for. 111 00:06:23,520 --> 00:06:25,940 But for now, let's just take one bite out of the problem. 112 00:06:25,940 --> 00:06:30,560 Can we tell ourselves, true or false, is some number behind one 113 00:06:30,560 --> 00:06:33,390 of these doors or lockers in memory? 114 00:06:33,390 --> 00:06:37,760 But before we go there and start talking about ways to do that-- that is, 115 00:06:37,760 --> 00:06:38,580 algorithms. 116 00:06:38,580 --> 00:06:42,680 Let's consider how we might lay the foundation of, like, 117 00:06:42,680 --> 00:06:46,108 comparing whether one algorithm is better than another. 118 00:06:46,108 --> 00:06:47,900 We talked about correctness, and it sort of 119 00:06:47,900 --> 00:06:51,440 goes without saying that any code you write, any algorithm you implement, 120 00:06:51,440 --> 00:06:52,460 had better be correct. 121 00:06:52,460 --> 00:06:55,340 Otherwise, what's the point if it doesn't give you the right answers? 122 00:06:55,340 --> 00:06:57,260 But we also talked about design. 123 00:06:57,260 --> 00:07:01,280 And in your own words, what do we mean when we say a program is better 124 00:07:01,280 --> 00:07:04,470 designed at this stage than another? 125 00:07:04,470 --> 00:07:07,880 How do you think about this notion of design now? 126 00:07:07,880 --> 00:07:09,050 Yeah, in the middle? 127 00:07:09,050 --> 00:07:11,300 AUDIENCE: Easier to understand or easier to institute. 128 00:07:11,300 --> 00:07:12,990 DAVID J. MALAN: OK, so easier to understand. 129 00:07:12,990 --> 00:07:13,825 I like that. 130 00:07:13,825 --> 00:07:14,450 Other thoughts? 131 00:07:14,450 --> 00:07:14,950 Yeah. 132 00:07:14,950 --> 00:07:15,980 AUDIENCE: Efficiency. 133 00:07:15,980 --> 00:07:18,813 DAVID J. MALAN: Efficiency, and what do you mean by efficiency precisely? 134 00:07:18,813 --> 00:07:22,163 AUDIENCE: [INAUDIBLE] 135 00:07:22,163 --> 00:07:22,830 DAVID J. MALAN: Nice. 136 00:07:22,830 --> 00:07:25,390 It doesn't use up too much memory, and it isn't redundant. 137 00:07:25,390 --> 00:07:27,120 So you can think about design along a few 138 00:07:27,120 --> 00:07:29,078 of these axes-- sort of the quality of the code 139 00:07:29,078 --> 00:07:31,270 but also the quality of the performance. 140 00:07:31,270 --> 00:07:35,818 And as our programs get bigger and more sophisticated and just longer, 141 00:07:35,818 --> 00:07:37,860 those kinds of things are really going to matter. 142 00:07:37,860 --> 00:07:39,860 And in the real world, if you start writing code 143 00:07:39,860 --> 00:07:42,630 not just by yourself but with someone else, getting the design 144 00:07:42,630 --> 00:07:46,140 right is just going to make it easier to collaborate and ultimately produce, 145 00:07:46,140 --> 00:07:48,670 write code, with just higher probability. 146 00:07:48,670 --> 00:07:52,290 So let's consider how we might focus on exactly the second characteristic, 147 00:07:52,290 --> 00:07:54,390 the efficiency, of an algorithm. 148 00:07:54,390 --> 00:07:58,230 And the way we might talk about the efficiency of algorithms, just how fast 149 00:07:58,230 --> 00:08:01,350 or how slow they are, is in terms of their running time. 150 00:08:01,350 --> 00:08:04,900 That is to say, when they're running, how much time do they take? 151 00:08:04,900 --> 00:08:08,010 And we might measure this in seconds or milliseconds or minutes 152 00:08:08,010 --> 00:08:10,350 or just some number of steps in the general case 153 00:08:10,350 --> 00:08:14,560 because presumably fewer steps, to your point, is better than more steps. 154 00:08:14,560 --> 00:08:16,410 So how might we think about running times? 155 00:08:16,410 --> 00:08:19,470 Well, there's one general notation we should define today. 156 00:08:19,470 --> 00:08:23,610 So computer scientists tend to describe the running time of an algorithm 157 00:08:23,610 --> 00:08:27,840 or a piece of code, for that matter, in terms of what's called big O notation. 158 00:08:27,840 --> 00:08:30,630 This is literally a capitalized O, a big O. 159 00:08:30,630 --> 00:08:34,110 And this generally means that the running time of some algorithm 160 00:08:34,110 --> 00:08:37,559 is on the order of such and such, where such and such, we'll see, 161 00:08:37,559 --> 00:08:40,169 is just going to be a very simple mathematical formula. 162 00:08:40,169 --> 00:08:42,450 It's kind of a way of waving your hands mathematically 163 00:08:42,450 --> 00:08:46,770 to convey the idea of just how fast or how slow some algorithm or code is 164 00:08:46,770 --> 00:08:48,600 without getting into the weeds of like, it 165 00:08:48,600 --> 00:08:52,330 took this many milliseconds or this many specific number of steps. 166 00:08:52,330 --> 00:08:56,280 So you might recall then from week zero, I even introduced this picture 167 00:08:56,280 --> 00:08:57,360 but without much context. 168 00:08:57,360 --> 00:09:00,660 At the time, we just use this to compare those phone book algorithms. 169 00:09:00,660 --> 00:09:03,600 Recall that this red straight line was the first algorithm, 170 00:09:03,600 --> 00:09:04,920 one page at a time. 171 00:09:04,920 --> 00:09:09,870 The yellow line that's still straight differed how if you recall? 172 00:09:09,870 --> 00:09:14,370 That line represented what alternative algorithm? 173 00:09:14,370 --> 00:09:15,370 Looking out and back. 174 00:09:15,370 --> 00:09:16,620 What is that second algorithm? 175 00:09:16,620 --> 00:09:17,430 Yeah, over there. 176 00:09:17,430 --> 00:09:18,930 AUDIENCE: Like, two pages at a time. 177 00:09:18,930 --> 00:09:22,080 DAVID J. MALAN: Two pages at a time, which was almost correct so long as we 178 00:09:22,080 --> 00:09:25,340 potentially double back a page if maybe we go a little too far in the phone 179 00:09:25,340 --> 00:09:25,840 book. 180 00:09:25,840 --> 00:09:28,372 So it had a potential bug but arguably solvable. 181 00:09:28,372 --> 00:09:31,080 This last algorithm, though, was the so-called divide and conquer 182 00:09:31,080 --> 00:09:34,590 strategy where I sort of unnecessarily tore the phone book in half 183 00:09:34,590 --> 00:09:36,600 and then in half and then in half, which, 184 00:09:36,600 --> 00:09:40,890 as dramatic as that was unnecessarily, it actually took significantly bigger 185 00:09:40,890 --> 00:09:43,420 bites out of the problem-- like 500 pages 186 00:09:43,420 --> 00:09:49,360 the first time, another 250, another 125 versus just 1 or 2 bytes at a time. 187 00:09:49,360 --> 00:09:52,583 And so we described its running time as this picture 188 00:09:52,583 --> 00:09:55,500 there, though I didn't use that expression at the time, running times. 189 00:09:55,500 --> 00:09:58,230 But indeed, time to solve might be measured just 190 00:09:58,230 --> 00:10:00,150 abstractly in some unit of measure-- 191 00:10:00,150 --> 00:10:03,420 seconds, milliseconds, minutes, pages-- 192 00:10:03,420 --> 00:10:05,110 via this y-axis here. 193 00:10:05,110 --> 00:10:07,830 So let's now slap some numbers on this. 194 00:10:07,830 --> 00:10:11,250 If we had n pages in that phone book, n just representing 195 00:10:11,250 --> 00:10:13,620 a generic number, the first algorithm here 196 00:10:13,620 --> 00:10:15,450 we might describe as taking n steps. 197 00:10:15,450 --> 00:10:18,870 Second algorithm we might describe as taking n divided by 2 steps, 198 00:10:18,870 --> 00:10:21,510 maybe give or take one if we have to double back but generally 199 00:10:21,510 --> 00:10:22,482 n divided by 2. 200 00:10:22,482 --> 00:10:24,690 And then this thing, if you remember your logarithms, 201 00:10:24,690 --> 00:10:26,648 was sort of a fundamentally different formula-- 202 00:10:26,648 --> 00:10:30,100 log base 2 of n or just log of n for short. 203 00:10:30,100 --> 00:10:32,790 So this is of a fundamentally different formula. 204 00:10:32,790 --> 00:10:36,430 But what's noteworthy is that these first two algorithms, 205 00:10:36,430 --> 00:10:40,050 even though, yes, the second algorithm was hands down faster-- 206 00:10:40,050 --> 00:10:41,910 I mean, literally twice as fast-- 207 00:10:41,910 --> 00:10:46,560 when you start to zoom out and if I increase my y-axis and x-axis, 208 00:10:46,560 --> 00:10:52,560 these first two start to look awfully similar to one another. 209 00:10:52,560 --> 00:10:54,810 And if we keep zooming out and zooming out and zooming 210 00:10:54,810 --> 00:10:57,090 out as n gets really large-- 211 00:10:57,090 --> 00:10:59,220 that is, the x-axis gets really long-- 212 00:10:59,220 --> 00:11:03,250 these first two algorithms start to become essentially the same. 213 00:11:03,250 --> 00:11:06,390 And so this is where computer scientists use big O notation. 214 00:11:06,390 --> 00:11:10,420 Instead of saying specifically, this algorithm takes any steps. 215 00:11:10,420 --> 00:11:13,620 And this one n divided by 2, a computer scientist would say, 216 00:11:13,620 --> 00:11:17,010 eh, each of those algorithms takes on the order of n steps 217 00:11:17,010 --> 00:11:19,020 or on the order of n over 2. 218 00:11:19,020 --> 00:11:19,770 But you know what? 219 00:11:19,770 --> 00:11:23,490 On the order of n over 2 is pretty much the same 220 00:11:23,490 --> 00:11:29,620 when n gets really large as being equivalent to big O of n itself. 221 00:11:29,620 --> 00:11:34,480 So yes, in practice, it's obviously fewer steps to move twice as fast. 222 00:11:34,480 --> 00:11:37,950 But in the big picture, when n becomes a million, a billion, 223 00:11:37,950 --> 00:11:40,620 the numbers are already so darn big at that point 224 00:11:40,620 --> 00:11:43,680 that these are as, the shapes of these curves imply, 225 00:11:43,680 --> 00:11:45,930 pretty much functionally equivalent. 226 00:11:45,930 --> 00:11:49,080 But this one still looks better and better 227 00:11:49,080 --> 00:11:52,470 as n gets large because it's rising so much less quickly. 228 00:11:52,470 --> 00:11:54,660 And so here, a computer scientist would say 229 00:11:54,660 --> 00:11:59,318 that that third algorithm was on the order of-- that is, big O of-- log n. 230 00:11:59,318 --> 00:12:01,110 And they don't have to bother with the base 231 00:12:01,110 --> 00:12:04,980 because it's a smaller mathematical detail that is also just in some sense 232 00:12:04,980 --> 00:12:07,890 a constant, multiplicative factor. 233 00:12:07,890 --> 00:12:09,840 So in short, what are the takeaways here? 234 00:12:09,840 --> 00:12:12,270 This is just a new vocabulary that we'll start 235 00:12:12,270 --> 00:12:16,020 to use when we just want to describe the running time of an algorithm. 236 00:12:16,020 --> 00:12:18,600 To make this more real, if any of you have implemented 237 00:12:18,600 --> 00:12:24,290 a for loop at this point in any of your code and that for loop iterated n times 238 00:12:24,290 --> 00:12:27,590 where maybe in was the height of your pyramid or maybe n 239 00:12:27,590 --> 00:12:31,550 was something else that you wanted to do n times, you wrote code 240 00:12:31,550 --> 00:12:36,440 or you implemented an algorithm that operated in big O of n time, 241 00:12:36,440 --> 00:12:37,200 if you will. 242 00:12:37,200 --> 00:12:39,350 So this is just a way now to retroactively start 243 00:12:39,350 --> 00:12:43,490 describing with somewhat mathematical notation what we've 244 00:12:43,490 --> 00:12:45,780 been doing in practice for a while now. 245 00:12:45,780 --> 00:12:51,500 So here's a list of commonly seen running times in the real world. 246 00:12:51,500 --> 00:12:55,010 This is not a thorough list because you could come up 247 00:12:55,010 --> 00:12:57,710 with an infinite number of mathematical formulas, certainly. 248 00:12:57,710 --> 00:13:01,400 But the common ones we'll discuss and you will see in your own code 249 00:13:01,400 --> 00:13:04,070 probably reduce to this list here. 250 00:13:04,070 --> 00:13:06,320 And if you were to study more computer science theory, 251 00:13:06,320 --> 00:13:07,903 this list would get longer and longer. 252 00:13:07,903 --> 00:13:11,930 But for now, these are sort of the most familiar ones that we'll soon see. 253 00:13:11,930 --> 00:13:14,700 All right, two other pieces of vocabulary, if you will, 254 00:13:14,700 --> 00:13:16,160 before we start to use this stuff-- 255 00:13:16,160 --> 00:13:19,430 so this, a big omega, capital omega symbol, 256 00:13:19,430 --> 00:13:25,170 is used now to describe a lower bound on the running time of an algorithm. 257 00:13:25,170 --> 00:13:28,610 So to be clear, big O is on the order of-- that 258 00:13:28,610 --> 00:13:31,910 is, an upper bound-- on how many steps an algorithm might 259 00:13:31,910 --> 00:13:34,700 take, on the order of so many steps. 260 00:13:34,700 --> 00:13:37,580 If you want to talk, though, from the other perspective, well, 261 00:13:37,580 --> 00:13:39,710 how few steps my algorithm take? 262 00:13:39,710 --> 00:13:42,320 Maybe in the so-called best case, it'd be nice 263 00:13:42,320 --> 00:13:45,110 if we had a notation to just describe what a lower 264 00:13:45,110 --> 00:13:47,840 bound is because some algorithms might be super fast 265 00:13:47,840 --> 00:13:49,890 in these so-called best cases. 266 00:13:49,890 --> 00:13:53,660 So the symbology is almost the same, but we replace the big O 267 00:13:53,660 --> 00:13:54,780 with the big omega. 268 00:13:54,780 --> 00:13:58,580 So to be clear, big O describes an upper bound and omega 269 00:13:58,580 --> 00:14:00,110 describes a lower bound. 270 00:14:00,110 --> 00:14:02,540 And we'll see examples of this before long. 271 00:14:02,540 --> 00:14:08,420 And then lastly, last one here, big theta, is used by a computer scientist 272 00:14:08,420 --> 00:14:12,860 when you have a case where both the upper bound on an algorithm's 273 00:14:12,860 --> 00:14:15,890 running time is the same as the lower bound. 274 00:14:15,890 --> 00:14:19,580 You can then describe it in one breath as being in theta of such and such 275 00:14:19,580 --> 00:14:23,660 instead of saying it's in big O and in omega of something else. 276 00:14:23,660 --> 00:14:27,920 All right, so out of context, sort of just seemingly cryptic symbols, 277 00:14:27,920 --> 00:14:31,463 but all they refer to is upper bounds, lower bounds, or when 278 00:14:31,463 --> 00:14:32,880 they happen to be one in the same. 279 00:14:32,880 --> 00:14:36,440 And we'll now introduce over time examples of how we might actually 280 00:14:36,440 --> 00:14:38,990 apply these to concrete problems. 281 00:14:38,990 --> 00:14:43,220 But first, let me pause to see if there's any questions. 282 00:14:43,220 --> 00:14:46,120 Any questions here? 283 00:14:46,120 --> 00:14:48,430 Any questions? 284 00:14:48,430 --> 00:14:50,880 I see pointing somewhere. 285 00:14:50,880 --> 00:14:53,100 Where are you pointing to? 286 00:14:53,100 --> 00:14:54,640 Over here-- there we go. 287 00:14:54,640 --> 00:14:55,980 OK, sorry-- very bright. 288 00:14:55,980 --> 00:14:59,560 AUDIENCE: So, um, smaller-- 289 00:14:59,560 --> 00:15:01,520 DAVID J. MALAN: Smaller n functions move faster. 290 00:15:01,520 --> 00:15:06,260 So yes, if you have something like n, that takes only steps. 291 00:15:06,260 --> 00:15:09,265 If you have a formula like n squared, just by nature of the math, 292 00:15:09,265 --> 00:15:11,900 that take more steps and therefore be slower. 293 00:15:11,900 --> 00:15:13,810 So the larger the mathematical expression, 294 00:15:13,810 --> 00:15:18,280 the slower your algorithm is because the more time or more steps that it takes. 295 00:15:18,280 --> 00:15:20,800 296 00:15:20,800 --> 00:15:23,170 AUDIENCE: So you want your n function to be small? 297 00:15:23,170 --> 00:15:26,410 DAVID J. MALAN: You want your n function, so to speak, to be small, yes. 298 00:15:26,410 --> 00:15:28,220 And in fact, the Holy Grail, so to speak, 299 00:15:28,220 --> 00:15:31,690 would be this last one here either in big O notation or even theta, 300 00:15:31,690 --> 00:15:34,900 when an algorithm is on the order of a single step. 301 00:15:34,900 --> 00:15:39,370 That means it literally takes constant time, one step, or maybe 10 steps, 302 00:15:39,370 --> 00:15:42,490 100 steps, but a fixed, constant number of steps. 303 00:15:42,490 --> 00:15:46,120 That's the best because even as the phone book gets bigger, 304 00:15:46,120 --> 00:15:50,350 even as the data set you're searching gets larger and larger, 305 00:15:50,350 --> 00:15:53,500 if something only takes a finite number of steps constantly, 306 00:15:53,500 --> 00:15:57,640 then it doesn't matter how big the data set actually gets. 307 00:15:57,640 --> 00:16:01,840 Questions as well on these notations-- yep, thank you for the pointing. 308 00:16:01,840 --> 00:16:03,310 This is actually very helpful. 309 00:16:03,310 --> 00:16:05,170 I'm seeing pointing this way? 310 00:16:05,170 --> 00:16:08,227 AUDIENCE: [INAUDIBLE] 311 00:16:08,227 --> 00:16:10,560 DAVID J. MALAN: What is the input to each of these functions? 312 00:16:10,560 --> 00:16:13,710 It is an expression of how many steps an algorithm takes. 313 00:16:13,710 --> 00:16:16,080 So in fact, let me go ahead and make this 314 00:16:16,080 --> 00:16:19,220 more concrete with an actual example here if we could. 315 00:16:19,220 --> 00:16:22,350 So on stage here, we have seven lockers which represent, 316 00:16:22,350 --> 00:16:24,300 if you will, an array of memory. 317 00:16:24,300 --> 00:16:26,400 And this array of memory is maybe storing 318 00:16:26,400 --> 00:16:30,150 seven integers, seven integers that we might actually want to search for. 319 00:16:30,150 --> 00:16:33,090 And if we want to search for these values, how might 320 00:16:33,090 --> 00:16:34,048 we go about doing this? 321 00:16:34,048 --> 00:16:36,257 Well, for this, why don't we make things interesting? 322 00:16:36,257 --> 00:16:37,810 Would a volunteer like to come on up? 323 00:16:37,810 --> 00:16:40,920 Have to be masked and on the internet if you are comfortable. 324 00:16:40,920 --> 00:16:44,190 Both of-- oh, there's someone putting their friend's hand up and back? 325 00:16:44,190 --> 00:16:44,940 Yes, OK. 326 00:16:44,940 --> 00:16:45,660 Come on down. 327 00:16:45,660 --> 00:16:50,380 328 00:16:50,380 --> 00:16:53,080 And in just a moment, our brave volunteer 329 00:16:53,080 --> 00:16:57,070 is going to help me find a specific number in the data set 330 00:16:57,070 --> 00:16:58,520 that we have here on the screen. 331 00:16:58,520 --> 00:17:02,230 So come on down, and I'll get things ready for you in advance here. 332 00:17:02,230 --> 00:17:08,723 Come on down nice to meet. 333 00:17:08,723 --> 00:17:09,640 And what is your name? 334 00:17:09,640 --> 00:17:10,390 AUDIENCE: [? Nomira. ?] 335 00:17:10,390 --> 00:17:10,645 DAVID J. MALAN: Minera? 336 00:17:10,645 --> 00:17:11,530 AUDIENCE: [? Nomira. ?] 337 00:17:11,530 --> 00:17:13,113 DAVID J. MALAN: [? Nomira. ?] Nice to meet. 338 00:17:13,113 --> 00:17:13,700 Come on over. 339 00:17:13,700 --> 00:17:17,650 So here we have for Nomira seven lockers or an array of memory. 340 00:17:17,650 --> 00:17:19,540 And behind each of these doors is a number. 341 00:17:19,540 --> 00:17:22,569 And the goal, quite simply, is, given this array of memory 342 00:17:22,569 --> 00:17:27,680 as input, to return, true or false, is the number I care about actually there? 343 00:17:27,680 --> 00:17:30,250 So suppose I care about the number 0. 344 00:17:30,250 --> 00:17:33,730 What would be the simplest, most correct algorithm you could 345 00:17:33,730 --> 00:17:38,200 apply in order to find us the number 0? 346 00:17:38,200 --> 00:17:41,608 OK, try opening the first one. 347 00:17:41,608 --> 00:17:44,150 All right, and maybe just step aside so the audience can see. 348 00:17:44,150 --> 00:17:46,160 I think you have not found 0 yet. 349 00:17:46,160 --> 00:17:47,480 OK, so keep the door open. 350 00:17:47,480 --> 00:17:50,120 Let's move on to your next choice. 351 00:17:50,120 --> 00:17:51,080 Second door, sure. 352 00:17:51,080 --> 00:17:51,998 AUDIENCE: [INAUDIBLE] 353 00:17:51,998 --> 00:17:53,540 DAVID J. MALAN: Oh, go ahead, second door. 354 00:17:53,540 --> 00:17:54,415 Let's keep it simple. 355 00:17:54,415 --> 00:17:57,557 Let's just move from left to right, sort of searching our way. 356 00:17:57,557 --> 00:17:58,640 And what do you see there? 357 00:17:58,640 --> 00:17:59,770 Oh, 6, not 0. 358 00:17:59,770 --> 00:18:00,770 How about the next door? 359 00:18:00,770 --> 00:18:04,200 360 00:18:04,200 --> 00:18:07,230 All right, also not working out so well yet, but that's OK. 361 00:18:07,230 --> 00:18:11,610 If you want to go on to the next, we're still looking for 0. 362 00:18:11,610 --> 00:18:12,930 All right, I see a 2. 363 00:18:12,930 --> 00:18:14,502 All right, it's not so good yet. 364 00:18:14,502 --> 00:18:15,210 Let's keep going. 365 00:18:15,210 --> 00:18:16,860 Next door. 366 00:18:16,860 --> 00:18:18,540 2, 7-- no. 367 00:18:18,540 --> 00:18:20,100 OK, next door. 368 00:18:20,100 --> 00:18:25,870 No, that's a-- all right, very well done. 369 00:18:25,870 --> 00:18:26,370 Oh. 370 00:18:26,370 --> 00:18:29,030 371 00:18:29,030 --> 00:18:32,330 All right, so I kind of set you up for a fairly slow algorithm, 372 00:18:32,330 --> 00:18:34,550 but let me just ask you to describe what is it 373 00:18:34,550 --> 00:18:37,534 you did by following the steps I gave you. 374 00:18:37,534 --> 00:18:39,660 AUDIENCE: I just went one by one to each character. 375 00:18:39,660 --> 00:18:41,660 DAVID J. MALAN: You went one by one to each character 376 00:18:41,660 --> 00:18:43,380 if you want to talk into here. 377 00:18:43,380 --> 00:18:45,140 So you went one by one by each character. 378 00:18:45,140 --> 00:18:48,620 And would you say that algorithm left or right is correct? 379 00:18:48,620 --> 00:18:49,790 AUDIENCE: No. 380 00:18:49,790 --> 00:18:50,960 DAVID J. MALAN: No? 381 00:18:50,960 --> 00:18:53,000 AUDIENCE: Or, yes, in the scenario. 382 00:18:53,000 --> 00:18:54,500 DAVID J. MALAN: OK, yes in this scenario. 383 00:18:54,500 --> 00:18:55,520 Why are you hesitating? 384 00:18:55,520 --> 00:18:56,060 What's going through your mind? 385 00:18:56,060 --> 00:18:58,100 AUDIENCE: Because it's not the most efficient way to do it. 386 00:18:58,100 --> 00:18:58,980 DAVID J. MALAN: OK, good. 387 00:18:58,980 --> 00:19:01,610 So we see a contrast here between correctness and design. 388 00:19:01,610 --> 00:19:04,360 I mean, I do think it was correct because even though it was slow, 389 00:19:04,360 --> 00:19:05,960 you eventually found zero. 390 00:19:05,960 --> 00:19:08,490 But it took some number of steps. 391 00:19:08,490 --> 00:19:10,280 So in fact, this would be an algorithm. 392 00:19:10,280 --> 00:19:12,350 It has a name, called linear search. 393 00:19:12,350 --> 00:19:14,840 And, [? Nomira, ?] as you did, you kind of walked along 394 00:19:14,840 --> 00:19:16,735 a line going from left to right. 395 00:19:16,735 --> 00:19:17,360 Now let me ask. 396 00:19:17,360 --> 00:19:19,880 If you had gone from right to left, would the algorithm 397 00:19:19,880 --> 00:19:23,030 have been fundamentally better? 398 00:19:23,030 --> 00:19:23,780 AUDIENCE: Yes. 399 00:19:23,780 --> 00:19:24,885 DAVID J. MALAN: OK, and why? 400 00:19:24,885 --> 00:19:27,260 AUDIENCE: Because the zero is here in the first scenario. 401 00:19:27,260 --> 00:19:30,860 But if it was like, the zero is in the middle, it wouldn't have been. 402 00:19:30,860 --> 00:19:35,020 DAVID J. MALAN: Yeah, and so here is where the right way to do things 403 00:19:35,020 --> 00:19:36,270 becomes a little less obvious. 404 00:19:36,270 --> 00:19:38,707 You would absolutely have given yourself a better result 405 00:19:38,707 --> 00:19:40,790 if you would just happened to start from the right 406 00:19:40,790 --> 00:19:42,780 or if I had pointed you to start over there. 407 00:19:42,780 --> 00:19:45,907 But the catch is if I asked her to find another number, like the number 8, 408 00:19:45,907 --> 00:19:47,240 well, that would have backfired. 409 00:19:47,240 --> 00:19:48,948 And this time, it would have taken longer 410 00:19:48,948 --> 00:19:51,450 to find that number because it's way over here instead. 411 00:19:51,450 --> 00:19:55,970 And so in the general case, going left to right or, heck, right to left 412 00:19:55,970 --> 00:19:59,540 is probably as correct as you can get because if you know nothing 413 00:19:59,540 --> 00:20:03,517 about the order of these numbers-- and indeed, they seem to be fairly random. 414 00:20:03,517 --> 00:20:05,600 Some of them are smaller, some of them are bigger. 415 00:20:05,600 --> 00:20:07,308 There doesn't seem to be rhyme or reason. 416 00:20:07,308 --> 00:20:11,330 Linear search is about as good as you can do when you don't know anything 417 00:20:11,330 --> 00:20:13,053 a priori about the numbers. 418 00:20:13,053 --> 00:20:15,720 So I have a little thank you gift here, a little CS stress ball. 419 00:20:15,720 --> 00:20:18,680 Round of applause for our first volunteer. 420 00:20:18,680 --> 00:20:19,550 Thank you so much. 421 00:20:19,550 --> 00:20:23,670 422 00:20:23,670 --> 00:20:27,690 Let's try to formalize what I just described as linear search 423 00:20:27,690 --> 00:20:30,938 because indeed, no matter which end [? Nomira ?] had started on, 424 00:20:30,938 --> 00:20:32,730 I could have kind of changed up the problem 425 00:20:32,730 --> 00:20:35,070 to make sure that it appears to be running slow. 426 00:20:35,070 --> 00:20:36,090 But it is correct. 427 00:20:36,090 --> 00:20:39,270 If zero were among those doors, she absolutely would have found it 428 00:20:39,270 --> 00:20:40,260 and indeed did. 429 00:20:40,260 --> 00:20:45,870 So let's now try to translate what we did into what we might call again 430 00:20:45,870 --> 00:20:48,210 pseudo code as from week zero. 431 00:20:48,210 --> 00:20:50,730 So with pseudo code, we just need a terse English 432 00:20:50,730 --> 00:20:53,770 like, or any language, syntax to describe what we did. 433 00:20:53,770 --> 00:20:56,280 So here might be one formulation of what [? Nomira ?] did. 434 00:20:56,280 --> 00:21:00,690 For each door, from left to right, if the number is behind the door, 435 00:21:00,690 --> 00:21:02,280 return true. 436 00:21:02,280 --> 00:21:07,630 Else, at the very end of the program, you would return false by default. 437 00:21:07,630 --> 00:21:08,800 And now you got lucky. 438 00:21:08,800 --> 00:21:10,800 And by the seventh door, [? Nomira ?] had indeed 439 00:21:10,800 --> 00:21:14,430 returned true by saying, well, there is the zero. 440 00:21:14,430 --> 00:21:17,070 But let's consider if this pseudo code is now correct, 441 00:21:17,070 --> 00:21:18,150 an accurate translation. 442 00:21:18,150 --> 00:21:22,360 First of all, normally, when we've seen ifs, we might see an if else. 443 00:21:22,360 --> 00:21:26,100 And yet down here, return false is aligned with the for. 444 00:21:26,100 --> 00:21:29,700 Why did I not indent the return false, or put another way, 445 00:21:29,700 --> 00:21:36,780 why did I not do if number is behind door, return true, else return false? 446 00:21:36,780 --> 00:21:39,855 Why would that version of this code have been problematic? 447 00:21:39,855 --> 00:21:41,318 Way in back. 448 00:21:41,318 --> 00:21:50,320 AUDIENCE: [INAUDIBLE] 449 00:21:50,320 --> 00:21:52,690 DAVID J. MALAN: OK, I'm not sure it's because of redundancy. 450 00:21:52,690 --> 00:21:55,000 Let me go ahead and just make this explicit. 451 00:21:55,000 --> 00:21:58,510 If I had instead done else return false, I 452 00:21:58,510 --> 00:22:02,942 don't think it's so much redundancy that I'd be worried about. 453 00:22:02,942 --> 00:22:04,150 Let me bounce somewhere else. 454 00:22:04,150 --> 00:22:04,810 Yeah, in front? 455 00:22:04,810 --> 00:22:08,350 AUDIENCE: Um, maybe [INAUDIBLE] for the entire list 456 00:22:08,350 --> 00:22:09,730 after just checking one number. 457 00:22:09,730 --> 00:22:11,920 DAVID J. MALAN: Yeah, it would be returning falls for-- 458 00:22:11,920 --> 00:22:13,795 even though I'd only looked at-- [? Nomira ?] 459 00:22:13,795 --> 00:22:15,100 had only looked at one element. 460 00:22:15,100 --> 00:22:18,142 And it would have been as though if all of these doors were still closed, 461 00:22:18,142 --> 00:22:21,610 she opens this up and says, nope, this is not zero, return false. 462 00:22:21,610 --> 00:22:24,708 That would give me an incorrect result because obviously, 463 00:22:24,708 --> 00:22:27,250 at that stage in the algorithm, she wouldn't have even looked 464 00:22:27,250 --> 00:22:28,720 through any of the other doors. 465 00:22:28,720 --> 00:22:32,510 So just the original indentation of this, if you will, 466 00:22:32,510 --> 00:22:35,200 without the [? else, ?] is correct because only 467 00:22:35,200 --> 00:22:38,950 if I get to the bottom of this algorithm or the pseudo code does 468 00:22:38,950 --> 00:22:41,740 it make sense to conclude at that point, once she's 469 00:22:41,740 --> 00:22:45,220 gone through all of the doors, that nope, there's in fact-- 470 00:22:45,220 --> 00:22:48,550 the number I'm looking for is, in fact, not actually there. 471 00:22:48,550 --> 00:22:52,790 So how might we consider now the running time of this algorithm? 472 00:22:52,790 --> 00:22:55,930 We have a few different types of vocabulary now. 473 00:22:55,930 --> 00:22:59,080 And if we consider now how we might think about this, 474 00:22:59,080 --> 00:23:02,620 let's start to translate it from sort of higher level pseudo code 475 00:23:02,620 --> 00:23:04,420 to something a little lower level. 476 00:23:04,420 --> 00:23:07,820 We've been writing code using n and loops and the like. 477 00:23:07,820 --> 00:23:12,340 So let's take this higher level pseudo code and now just kind of 478 00:23:12,340 --> 00:23:14,890 get a middle ground between English and C. 479 00:23:14,890 --> 00:23:18,910 Let me propose that we think about this version of the same algorithm 480 00:23:18,910 --> 00:23:20,680 as being a little more pedantic. 481 00:23:20,680 --> 00:23:28,820 For i from 0 to n minus 1, if number behind doors bracket i return true. 482 00:23:28,820 --> 00:23:31,520 Otherwise, at the end of the program, return false. 483 00:23:31,520 --> 00:23:33,520 Now I'm kind of mixing English and C here, 484 00:23:33,520 --> 00:23:35,950 but that's reasonable if the reader is familiar with C 485 00:23:35,950 --> 00:23:37,540 or some similar language. 486 00:23:37,540 --> 00:23:39,530 And notice this pattern here. 487 00:23:39,530 --> 00:23:44,650 This is a way of just saying in pseudo code, give myself a variable called i. 488 00:23:44,650 --> 00:23:49,360 Start at 0 and then just count up to n minus 1. 489 00:23:49,360 --> 00:23:53,410 And recall n minus 1 is not one shy of the end of the array. 490 00:23:53,410 --> 00:23:56,170 N minus 1 is the end of the array because again, we 491 00:23:56,170 --> 00:23:57,580 started counting at 0. 492 00:23:57,580 --> 00:24:00,880 So this is a very common way of expressing this kind of loop 493 00:24:00,880 --> 00:24:03,910 from the left all the way to the right of an array. 494 00:24:03,910 --> 00:24:07,210 Doors I'm kind of implicitly treating as the name of this array, 495 00:24:07,210 --> 00:24:10,000 like it's a variable from last week that I defined as being 496 00:24:10,000 --> 00:24:11,780 an array of integers in this case. 497 00:24:11,780 --> 00:24:16,690 So doors bracket i means that when i is 0, it's this location. 498 00:24:16,690 --> 00:24:18,220 When i is 1, it's this. 499 00:24:18,220 --> 00:24:21,790 When i is 7 or, more generally n minus-- 500 00:24:21,790 --> 00:24:26,270 sorry, 6 or, more generally, n minus 1, that's this location here. 501 00:24:26,270 --> 00:24:28,700 So same idea but a translation of it. 502 00:24:28,700 --> 00:24:32,920 So now let's consider what the running time of this algorithm is. 503 00:24:32,920 --> 00:24:36,010 If we have this menu of possible answers to this question, 504 00:24:36,010 --> 00:24:38,930 how efficient or inefficient is this algorithm, 505 00:24:38,930 --> 00:24:41,650 let's take a look in the context of this pseudo code. 506 00:24:41,650 --> 00:24:44,500 We don't even have to bother going all the way to C. 507 00:24:44,500 --> 00:24:47,720 How do we go about analyzing each of these steps? 508 00:24:47,720 --> 00:24:49,600 Well, let's consider this. 509 00:24:49,600 --> 00:24:55,540 This outermost loop here for i from 0 to n minus 1, that line of code 510 00:24:55,540 --> 00:24:57,790 is going to execute how many times? 511 00:24:57,790 --> 00:25:01,420 How many times will that loop execute? 512 00:25:01,420 --> 00:25:04,540 Let me give folks this moment to think on it. 513 00:25:04,540 --> 00:25:07,330 How many times is that going to loop here? 514 00:25:07,330 --> 00:25:08,584 Yeah, over there. 515 00:25:08,584 --> 00:25:10,360 AUDIENCE: [INAUDIBLE] 516 00:25:10,360 --> 00:25:11,530 DAVID J. MALAN: n times, right? 517 00:25:11,530 --> 00:25:13,720 Because it's from 0 to n minus 1. 518 00:25:13,720 --> 00:25:16,330 And if it's a little weird to think in from 0 to n minus 1, 519 00:25:16,330 --> 00:25:19,832 this is essentially the same mathematically as from 1 to n. 520 00:25:19,832 --> 00:25:21,790 And that's perhaps a little more obviously more 521 00:25:21,790 --> 00:25:23,690 intuitively n total steps. 522 00:25:23,690 --> 00:25:28,180 So I might just make a note to myself this loop is going to operate n times. 523 00:25:28,180 --> 00:25:29,770 What about these inner steps? 524 00:25:29,770 --> 00:25:33,160 Well, how many steps or seconds does it take to ask a question? 525 00:25:33,160 --> 00:25:35,290 If the number behind-- 526 00:25:35,290 --> 00:25:38,740 if the number you're looking for is behind doors bracket i, 527 00:25:38,740 --> 00:25:41,570 well, as [? Nomira ?] did, that's kind of like one step. 528 00:25:41,570 --> 00:25:42,820 So you open the door and boom. 529 00:25:42,820 --> 00:25:46,100 All right, maybe it's two steps, but it's a constant number of steps. 530 00:25:46,100 --> 00:25:48,320 So this is some constant number of steps. 531 00:25:48,320 --> 00:25:50,080 Let's just call it one for simplicity. 532 00:25:50,080 --> 00:25:53,500 How many steps or seconds does it take to return true? 533 00:25:53,500 --> 00:25:55,863 I don't know exactly in the computer's memory 534 00:25:55,863 --> 00:25:57,280 but that feels like a single step. 535 00:25:57,280 --> 00:25:58,940 Just return true. 536 00:25:58,940 --> 00:26:01,960 So if this takes one step, this takes one step 537 00:26:01,960 --> 00:26:04,960 but only if the condition is true, it looks 538 00:26:04,960 --> 00:26:09,370 like you're doing a constant number of things n times. 539 00:26:09,370 --> 00:26:12,530 Or maybe you're doing one additional step. 540 00:26:12,530 --> 00:26:15,010 So in short, the only thing that really matters here 541 00:26:15,010 --> 00:26:18,220 in terms of the efficiency or inefficiency of the algorithm 542 00:26:18,220 --> 00:26:21,495 is what are you doing again and again and again because that's obviously 543 00:26:21,495 --> 00:26:22,870 the thing that's going to add up. 544 00:26:22,870 --> 00:26:26,080 Doing one thing or two things a constant number of times? 545 00:26:26,080 --> 00:26:27,010 Not a big deal. 546 00:26:27,010 --> 00:26:32,050 But looping, that's going to add up over time because the more doors there are, 547 00:26:32,050 --> 00:26:35,420 the bigger n is going to be and the more steps that's going to take, 548 00:26:35,420 --> 00:26:38,320 which is all to say if you were to describe roughly 549 00:26:38,320 --> 00:26:43,120 how many steps does this algorithm take in big O notation, 550 00:26:43,120 --> 00:26:46,060 what might your instincts say? 551 00:26:46,060 --> 00:26:51,350 How many steps is this algorithm on the order of given n doors or n integers? 552 00:26:51,350 --> 00:26:51,850 Yeah? 553 00:26:51,850 --> 00:26:52,805 AUDIENCE: [INAUDIBLE] 554 00:26:52,805 --> 00:26:53,680 DAVID J. MALAN: Say again? 555 00:26:53,680 --> 00:26:54,670 AUDIENCE: O n. 556 00:26:54,670 --> 00:26:56,110 DAVID J. MALAN: Big O of n. 557 00:26:56,110 --> 00:26:58,220 And indeed, that's going to be the case here. 558 00:26:58,220 --> 00:26:58,450 Why? 559 00:26:58,450 --> 00:27:00,533 Because you're essentially, at the end of the day, 560 00:27:00,533 --> 00:27:03,992 doing n things as an upper bound on running time. 561 00:27:03,992 --> 00:27:05,950 And that's, in fact, what exactly what happened 562 00:27:05,950 --> 00:27:08,500 with [? Nomira. ?] She had to look at all n lockers 563 00:27:08,500 --> 00:27:11,450 before finally getting to the right answer. 564 00:27:11,450 --> 00:27:14,470 But what if she got lucky and the number we 565 00:27:14,470 --> 00:27:17,380 were looking for was not at the end of the array 566 00:27:17,380 --> 00:27:20,080 but was at the beginning of the array? 567 00:27:20,080 --> 00:27:21,650 How might we think about that? 568 00:27:21,650 --> 00:27:25,120 Well, have a nomenclature for this too, of course-- omega notation. 569 00:27:25,120 --> 00:27:27,730 Remember, omega notation is a lower bound. 570 00:27:27,730 --> 00:27:34,240 So given this menu of possible running times for lower bounds on an algorithm, 571 00:27:34,240 --> 00:27:38,897 what might the omega notation be for [? Nomira's ?] linear search? 572 00:27:38,897 --> 00:27:40,270 AUDIENCE: Omega 1. 573 00:27:40,270 --> 00:27:42,250 DAVID J. MALAN: Omega of 1, and why that? 574 00:27:42,250 --> 00:27:43,973 AUDIENCE: [INAUDIBLE] 575 00:27:43,973 --> 00:27:46,390 DAVID J. MALAN: Right, because if just by chance she gets lucky 576 00:27:46,390 --> 00:27:49,300 and the number she's looking for is right there where 577 00:27:49,300 --> 00:27:51,490 she begins the algorithm, that's it. 578 00:27:51,490 --> 00:27:52,540 It's one step. 579 00:27:52,540 --> 00:27:55,210 Maybe it's two steps if you have to unlock the door and open it, 580 00:27:55,210 --> 00:27:56,740 but it's a constant number of steps. 581 00:27:56,740 --> 00:27:58,810 And the way we describe constant number of steps 582 00:27:58,810 --> 00:28:01,030 is just with a single number like 1. 583 00:28:01,030 --> 00:28:04,990 So the omega notation for linear search might be omega of 1 584 00:28:04,990 --> 00:28:08,650 because in the best case, she might just get the number right from the get go. 585 00:28:08,650 --> 00:28:11,860 But in the worst case, we need to talk about the upper bound, which 586 00:28:11,860 --> 00:28:13,990 might indeed be big O of n. 587 00:28:13,990 --> 00:28:16,690 So again there's this way now of talking symbolically 588 00:28:16,690 --> 00:28:22,150 about best cases and worst cases or lower bounds and upper bounds. 589 00:28:22,150 --> 00:28:24,880 Theta notation, just as a little trivia now, 590 00:28:24,880 --> 00:28:27,965 is it applicable based on the definition I gave earlier? 591 00:28:27,965 --> 00:28:28,840 AUDIENCE: [INAUDIBLE] 592 00:28:28,840 --> 00:28:31,630 DAVID J. MALAN: OK, no, because you only take out the theta notation 593 00:28:31,630 --> 00:28:34,120 when those two bounds, upper and lower, happen 594 00:28:34,120 --> 00:28:36,740 to be the same for shorthand notation, if you will. 595 00:28:36,740 --> 00:28:41,530 So it suffices here to talk about just big O and omega notation. 596 00:28:41,530 --> 00:28:43,880 Well, what if we are a little smarter about this? 597 00:28:43,880 --> 00:28:47,350 Let me go ahead and sort of semi-secretly here 598 00:28:47,350 --> 00:28:48,380 rearrange these numbers. 599 00:28:48,380 --> 00:28:50,605 But first, how about one other volunteer? 600 00:28:50,605 --> 00:28:53,230 One other volunteer-- you have to be comfortable with your mask 601 00:28:53,230 --> 00:28:55,510 and your being on the internet. 602 00:28:55,510 --> 00:28:58,180 How about over here? 603 00:28:58,180 --> 00:28:59,880 Yes, you want to come on down? 604 00:28:59,880 --> 00:29:00,880 All right, come on down. 605 00:29:00,880 --> 00:29:03,640 And don't look at what I'm doing because I'm going to-- 606 00:29:03,640 --> 00:29:08,340 607 00:29:08,340 --> 00:29:10,830 take your time and don't look up this way 608 00:29:10,830 --> 00:29:14,550 because I need a moment to rearrange all of the numbers. 609 00:29:14,550 --> 00:29:17,430 And actually, if you could stay right there before coming up, 610 00:29:17,430 --> 00:29:20,910 just an awkward few seconds while I finish hiding the numbers 611 00:29:20,910 --> 00:29:22,590 behind these doors for you. 612 00:29:22,590 --> 00:29:23,890 AUDIENCE: [INAUDIBLE] 613 00:29:23,890 --> 00:29:26,490 DAVID J. MALAN: I will be right with you. 614 00:29:26,490 --> 00:29:31,110 Actually, if-- do you want to warm up the crowd for a moment 615 00:29:31,110 --> 00:29:32,283 and I'll be right back? 616 00:29:32,283 --> 00:29:33,700 So you want to introduce yourself? 617 00:29:33,700 --> 00:29:34,770 AUDIENCE: Yeah, hi, guys. 618 00:29:34,770 --> 00:29:35,340 I'm Rave. 619 00:29:35,340 --> 00:29:37,900 620 00:29:37,900 --> 00:29:38,400 Yeah! 621 00:29:38,400 --> 00:29:43,500 622 00:29:43,500 --> 00:29:46,200 DAVID J. MALAN: All right, I think I am ready. 623 00:29:46,200 --> 00:29:47,700 Thank you for stalling there. 624 00:29:47,700 --> 00:29:48,970 AUDIENCE: Of course. 625 00:29:48,970 --> 00:29:49,710 DAVID J. MALAN: And I didn't catch your name. 626 00:29:49,710 --> 00:29:50,250 What was your name? 627 00:29:50,250 --> 00:29:51,080 AUDIENCE: I'm Rave. 628 00:29:51,080 --> 00:29:51,450 DAVID J. MALAN: I'm sorry? 629 00:29:51,450 --> 00:29:52,440 AUDIENCE: Rave, like a party. 630 00:29:52,440 --> 00:29:53,130 DAVID J. MALAN: Rave, OK. 631 00:29:53,130 --> 00:29:53,672 Nice to meet. 632 00:29:53,672 --> 00:29:54,930 Come on over. 633 00:29:54,930 --> 00:29:56,760 So Rave has kindly volunteered now. 634 00:29:56,760 --> 00:29:58,725 And I'm going to give you an additional advantage this time. 635 00:29:58,725 --> 00:29:59,400 AUDIENCE: OK. 636 00:29:59,400 --> 00:30:03,180 DAVID J. MALAN: Unbeknownst to you, I now took numbers behind the doors, 637 00:30:03,180 --> 00:30:04,530 but I sorted them for you. 638 00:30:04,530 --> 00:30:06,120 So they're not in the same random order like they 639 00:30:06,120 --> 00:30:08,162 were for [? Nomira. ?] You now have the advantage 640 00:30:08,162 --> 00:30:10,678 to know that the numbers are sorted from small to big. 641 00:30:10,678 --> 00:30:11,220 AUDIENCE: OK. 642 00:30:11,220 --> 00:30:15,180 DAVID J. MALAN: Given that, and given perhaps what we talked about in week zero 643 00:30:15,180 --> 00:30:19,140 with the phone book, where might you propose we begin the story this time? 644 00:30:19,140 --> 00:30:20,850 With which locker? 645 00:30:20,850 --> 00:30:21,808 AUDIENCE: To find zero? 646 00:30:21,808 --> 00:30:23,600 DAVID J. MALAN: Let's find number six this time. 647 00:30:23,600 --> 00:30:24,940 Let's make things interesting. 648 00:30:24,940 --> 00:30:26,200 AUDIENCE: OK. 649 00:30:26,200 --> 00:30:27,300 I'll start in the middle. 650 00:30:27,300 --> 00:30:28,050 DAVID J. MALAN: OK, so the middle. 651 00:30:28,050 --> 00:30:28,990 There's seven total. 652 00:30:28,990 --> 00:30:29,490 So-- 653 00:30:29,490 --> 00:30:29,670 AUDIENCE: OK. 654 00:30:29,670 --> 00:30:30,420 DAVID J. MALAN: --that would be right here. 655 00:30:30,420 --> 00:30:30,920 Go ahead. 656 00:30:30,920 --> 00:30:32,460 Open that up. 657 00:30:32,460 --> 00:30:34,320 And you find, sadly, the number five. 658 00:30:34,320 --> 00:30:36,150 So what do you know now? 659 00:30:36,150 --> 00:30:37,257 AUDIENCE: I know to go up. 660 00:30:37,257 --> 00:30:37,840 DAVID J. MALAN: OK. 661 00:30:37,840 --> 00:30:38,382 AUDIENCE: OK. 662 00:30:38,382 --> 00:30:40,465 DAVID J. MALAN: All right, and just to keep it uniform, 663 00:30:40,465 --> 00:30:43,080 just like I did, I opened to the right half of the phone book. 664 00:30:43,080 --> 00:30:43,350 AUDIENCE: Yes. 665 00:30:43,350 --> 00:30:44,880 DAVID J. MALAN: Let's keep it similar. 666 00:30:44,880 --> 00:30:45,390 Yeah. 667 00:30:45,390 --> 00:30:46,223 AUDIENCE: All right. 668 00:30:46,223 --> 00:30:48,182 DAVID J. MALAN: All right, and, uh, a little too far 669 00:30:48,182 --> 00:30:50,070 even though I know you wanted to go one over. 670 00:30:50,070 --> 00:30:51,080 AUDIENCE: All good, all good. 671 00:30:51,080 --> 00:30:52,800 DAVID J. MALAN: And now we're going to go which direction? 672 00:30:52,800 --> 00:30:54,217 AUDIENCE: Over here in the middle. 673 00:30:54,217 --> 00:30:56,320 DAVID J. MALAN: Right, and voila, the number six. 674 00:30:56,320 --> 00:30:57,840 All right, so very nicely done. 675 00:30:57,840 --> 00:31:00,620 676 00:31:00,620 --> 00:31:02,320 A little stressful for you as well. 677 00:31:02,320 --> 00:31:03,210 Thank you again. 678 00:31:03,210 --> 00:31:05,790 So here we see by nature of the locker door 679 00:31:05,790 --> 00:31:10,350 still being open sort of an artifact of the greater efficiency, 680 00:31:10,350 --> 00:31:13,560 it would seem, of this algorithm because now that Rave 681 00:31:13,560 --> 00:31:16,470 was given the assumption that these numbers are sorted from small 682 00:31:16,470 --> 00:31:20,310 on the left to large on the right, she was able to apply that same divide 683 00:31:20,310 --> 00:31:23,610 and conquer algorithm from week zero which we're now going to give a name-- 684 00:31:23,610 --> 00:31:25,200 binary search. 685 00:31:25,200 --> 00:31:28,650 And simply by starting in the middle and realizing, 686 00:31:28,650 --> 00:31:32,670 OK, too small, then by going to the right half and realizing, oh, 687 00:31:32,670 --> 00:31:35,820 went a little too far, then by going to the left half, which, 688 00:31:35,820 --> 00:31:39,660 Rave able to find in just three steps instead of seven 689 00:31:39,660 --> 00:31:43,720 the number six in this case that we were actually searching for. 690 00:31:43,720 --> 00:31:47,890 So you can see that this would seem to be more efficient. 691 00:31:47,890 --> 00:31:50,700 Let's consider for just a moment is it correct. 692 00:31:50,700 --> 00:31:56,250 If I had used different numbers but still sorted them from left to right, 693 00:31:56,250 --> 00:31:59,378 would it still have worked this algorithm? 694 00:31:59,378 --> 00:32:00,420 You're nodding your head. 695 00:32:00,420 --> 00:32:01,170 Can I call on you? 696 00:32:01,170 --> 00:32:03,920 Like, why would it still have worked, do you think? 697 00:32:03,920 --> 00:32:06,700 AUDIENCE: [INAUDIBLE] 698 00:32:06,700 --> 00:32:08,450 DAVID J. MALAN: Yeah, so so long as the numbers 699 00:32:08,450 --> 00:32:10,760 are always in the same order from left to right 700 00:32:10,760 --> 00:32:13,970 or, heck, they could even be in reverse order, so long as it's consistent, 701 00:32:13,970 --> 00:32:18,410 the decisions that Rave was making-- if greater than, else, if less than-- 702 00:32:18,410 --> 00:32:20,820 would guide us to the solution no matter what. 703 00:32:20,820 --> 00:32:23,460 And it would seem to take fewer steps. 704 00:32:23,460 --> 00:32:26,220 So if we consider now the pseudo code for this algorithm, 705 00:32:26,220 --> 00:32:28,530 let's take a look how we might describe binary search. 706 00:32:28,530 --> 00:32:31,400 So binary search we might describe with something like this. 707 00:32:31,400 --> 00:32:34,640 If the number is behind the middle door, which is where Rave began, 708 00:32:34,640 --> 00:32:36,710 then we can just return true. 709 00:32:36,710 --> 00:32:40,290 Else if the number is less than the middle door, 710 00:32:40,290 --> 00:32:42,800 so if six is less than whatever is behind the middle door, 711 00:32:42,800 --> 00:32:45,140 then Rave would have searched the left half. 712 00:32:45,140 --> 00:32:47,690 Else if the number is greater than the middle door, 713 00:32:47,690 --> 00:32:49,700 Rave would have searched the right half. 714 00:32:49,700 --> 00:32:53,840 Else, if there are no doors-- and we'll see in a moment why I put 715 00:32:53,840 --> 00:32:55,710 this up top just to keep things clean. 716 00:32:55,710 --> 00:32:59,390 If there's no doors, what should Rave have presumably returned immediately 717 00:32:59,390 --> 00:33:02,870 if I gave her no lockers to work with? 718 00:33:02,870 --> 00:33:03,920 Just returned false. 719 00:33:03,920 --> 00:33:06,020 But this is an important case to consider 720 00:33:06,020 --> 00:33:09,980 because if in the process of searching by locker by locker, 721 00:33:09,980 --> 00:33:14,630 we might have whittled down the problem from seven doors to three doors 722 00:33:14,630 --> 00:33:17,370 to one door to zero doors-- and at that point, 723 00:33:17,370 --> 00:33:19,230 we might have had no doors left to search. 724 00:33:19,230 --> 00:33:22,040 So we have to naturally have a scenario for just considering 725 00:33:22,040 --> 00:33:23,120 if there were no doors. 726 00:33:23,120 --> 00:33:26,910 So it's not to say that maybe I don't give Rave any doors to begin with. 727 00:33:26,910 --> 00:33:28,910 But as she divides and divides and divides, 728 00:33:28,910 --> 00:33:32,720 if she runs out of lockers to ask those questions of-- or a few weeks ago, 729 00:33:32,720 --> 00:33:35,660 if I ran out of phone book pages to tear in half, 730 00:33:35,660 --> 00:33:39,210 I too might have had to return false as in this case. 731 00:33:39,210 --> 00:33:42,500 So how can we now describe this a little more like C 732 00:33:42,500 --> 00:33:45,710 just to give ourselves a variable to start thinking and talking about? 733 00:33:45,710 --> 00:33:48,930 Well, I might talk about doors as being an array. 734 00:33:48,930 --> 00:33:52,490 And so if I want to express the middle door, I could just, in pseudo code, 735 00:33:52,490 --> 00:33:54,470 say doors bracket middle. 736 00:33:54,470 --> 00:33:56,270 I'm assuming that someone has done the math 737 00:33:56,270 --> 00:33:59,450 to figure out what the middle door is, but that's easy enough to do. 738 00:33:59,450 --> 00:34:01,940 And then doors, if the number we're looking for 739 00:34:01,940 --> 00:34:05,470 is less than doors bracket middle, then search door 740 00:34:05,470 --> 00:34:09,230 zero through doors middle minus 1. 741 00:34:09,230 --> 00:34:13,610 So again, this is a more pedantic way of taking what's a pretty intuitive idea-- 742 00:34:13,610 --> 00:34:16,159 search the left half, search the right half-- 743 00:34:16,159 --> 00:34:22,790 but start to now describe it in terms of actual indices or indexes 744 00:34:22,790 --> 00:34:24,593 like we did with our array notation. 745 00:34:24,593 --> 00:34:26,510 The last scenario, of course, is if the number 746 00:34:26,510 --> 00:34:28,760 is greater than the door's bracket middle, 747 00:34:28,760 --> 00:34:32,060 then Rave would have wanted to search the middle door plus 1-- 748 00:34:32,060 --> 00:34:37,250 so 1 over-- through doors n minus 1-- 749 00:34:37,250 --> 00:34:38,330 through n minus 1. 750 00:34:38,330 --> 00:34:41,389 So again, just a way of sort of describing a little more syntactically 751 00:34:41,389 --> 00:34:42,989 what it is that's going on. 752 00:34:42,989 --> 00:34:46,820 So how might we translate this now into big O notation? 753 00:34:46,820 --> 00:34:53,870 Well, in the worst case, how many steps total might Rave's binary search 754 00:34:53,870 --> 00:34:55,070 algorithm have taken? 755 00:34:55,070 --> 00:34:59,030 Given seven doors or given more generically n doors, 756 00:34:59,030 --> 00:35:03,620 how many times could she go left or go right before finding herself with one 757 00:35:03,620 --> 00:35:06,110 or no doors left? 758 00:35:06,110 --> 00:35:08,787 What's the way to think about that? 759 00:35:08,787 --> 00:35:09,620 Yeah, in the middle? 760 00:35:09,620 --> 00:35:11,150 AUDIENCE: Log n. 761 00:35:11,150 --> 00:35:11,990 DAVID J. MALAN: Log n. 762 00:35:11,990 --> 00:35:13,280 So there's log n again. 763 00:35:13,280 --> 00:35:16,155 And even if you're not feeling wholly comfortable with your logarithm 764 00:35:16,155 --> 00:35:19,250 still, pretty much in programming and in computer science more generally, 765 00:35:19,250 --> 00:35:22,430 any time we talk about some algorithm that's dividing and conquering 766 00:35:22,430 --> 00:35:26,180 in half, in half, in half, or any other multiple, 767 00:35:26,180 --> 00:35:28,580 it's probably involving logarithms in some sense. 768 00:35:28,580 --> 00:35:31,400 And log base n essentially refers to the number 769 00:35:31,400 --> 00:35:37,160 of times you can divide n by 2 until you bottom out at just a single door 770 00:35:37,160 --> 00:35:39,410 or equivalently zero doors left. 771 00:35:39,410 --> 00:35:40,370 So log n. 772 00:35:40,370 --> 00:35:44,270 So we might say that indeed, binary search is in big O of log n 773 00:35:44,270 --> 00:35:48,440 because the door that Rave opened last, this one, 774 00:35:48,440 --> 00:35:50,360 happened to be three doors away. 775 00:35:50,360 --> 00:35:52,790 And actually, if you do the math here, that roughly 776 00:35:52,790 --> 00:35:54,600 works out to be exactly that case. 777 00:35:54,600 --> 00:35:58,640 If we add one, that's sort of out of seven doors or roughly eight, 778 00:35:58,640 --> 00:36:01,940 we were able to search it in just three total steps. 779 00:36:01,940 --> 00:36:03,980 What about omega notation, though? 780 00:36:03,980 --> 00:36:07,220 Like, in the best case, Rave might have gotten lucky. 781 00:36:07,220 --> 00:36:09,170 She opened the door, and there it is. 782 00:36:09,170 --> 00:36:14,970 So how might we describe a lower bound on the running time of binary search. 783 00:36:14,970 --> 00:36:15,470 Yeah. 784 00:36:15,470 --> 00:36:16,125 AUDIENCE: 1. 785 00:36:16,125 --> 00:36:17,000 DAVID J. MALAN: Say again? 786 00:36:17,000 --> 00:36:17,840 AUDIENCE: 1. 787 00:36:17,840 --> 00:36:19,220 DAVID J. MALAN: Omega of 1. 788 00:36:19,220 --> 00:36:23,810 So here too, we see that in some cases binary search and linear search, eh, 789 00:36:23,810 --> 00:36:25,700 like, they're pretty equivalent. 790 00:36:25,700 --> 00:36:30,830 And so this is why sometimes compelling to consider both the best 791 00:36:30,830 --> 00:36:33,290 case in the worst case because honestly, in general, 792 00:36:33,290 --> 00:36:35,600 who really cares if you just get lucky once in a while 793 00:36:35,600 --> 00:36:37,280 and your algorithm is super fast? 794 00:36:37,280 --> 00:36:40,250 What you probably care about is what's the worst case. 795 00:36:40,250 --> 00:36:41,390 How long are my users-- 796 00:36:41,390 --> 00:36:45,170 how long am I going to be sitting there watching some spinning hourglass 797 00:36:45,170 --> 00:36:50,807 or beach ball trying to give myself an answer to a pretty big problem? 798 00:36:50,807 --> 00:36:53,640 Well, odds are, you're going to generally care about big O notation. 799 00:36:53,640 --> 00:36:55,430 So indeed, moving forward, will generally 800 00:36:55,430 --> 00:36:58,730 talk about the running time of algorithms often in terms of big O, 801 00:36:58,730 --> 00:37:00,780 a little less so in terms of omega. 802 00:37:00,780 --> 00:37:03,140 But understanding the range can be important 803 00:37:03,140 --> 00:37:08,700 depending on the nature of the data that you're going to actually be given here. 804 00:37:08,700 --> 00:37:11,510 All right let me pause and see if there is any questions. 805 00:37:11,510 --> 00:37:14,200 806 00:37:14,200 --> 00:37:16,420 Any questions here? 807 00:37:16,420 --> 00:37:18,850 Yes, thank you. 808 00:37:18,850 --> 00:37:21,430 AUDIENCE: So this method is clearly more efficient, 809 00:37:21,430 --> 00:37:26,440 but it requires that the information is all compiled in a certain order. 810 00:37:26,440 --> 00:37:29,770 How do you ensure that you can compile information 811 00:37:29,770 --> 00:37:31,265 in a particular order at scale? 812 00:37:31,265 --> 00:37:33,140 DAVID J. MALAN: Yeah, it's a really good question. 813 00:37:33,140 --> 00:37:36,015 And if I can generalize it, how do you guarantee that you can do this 814 00:37:36,015 --> 00:37:38,560 at scale, which algorithm is better? 815 00:37:38,560 --> 00:37:41,440 I've sort of led us down this road of implying 816 00:37:41,440 --> 00:37:43,540 that Rave's second algorithm, binary search, 817 00:37:43,540 --> 00:37:45,580 is better because it's so much faster. 818 00:37:45,580 --> 00:37:49,600 It's log of n in the worst case instead of big O of n. 819 00:37:49,600 --> 00:37:53,230 But Rave was given an advantage when she came up here in that the doors were 820 00:37:53,230 --> 00:37:54,222 already sorted. 821 00:37:54,222 --> 00:37:55,930 And so that sort of invites the question, 822 00:37:55,930 --> 00:37:57,670 well, given a whole bunch of random data, 823 00:37:57,670 --> 00:38:00,710 either a small data set or, heck, something Google sized with millions, 824 00:38:00,710 --> 00:38:04,540 billions of pieces of data, should you sort it first 825 00:38:04,540 --> 00:38:07,150 from smallest to largest and then search? 826 00:38:07,150 --> 00:38:11,920 Or should you just dive right in and search it linearly? 827 00:38:11,920 --> 00:38:13,638 Like, how might you think about that? 828 00:38:13,638 --> 00:38:15,430 If you are Google, for instance, and you've 829 00:38:15,430 --> 00:38:19,090 got millions, billions of web pages, should they just go with linear search 830 00:38:19,090 --> 00:38:21,850 because it's always going to work even though it might be slow? 831 00:38:21,850 --> 00:38:24,820 Or should they invest the time in sorting all of that data-- 832 00:38:24,820 --> 00:38:26,630 we'll see how in a bit-- 833 00:38:26,630 --> 00:38:28,900 and then search it more efficiently? 834 00:38:28,900 --> 00:38:31,438 Like, how do you decide between those options? 835 00:38:31,438 --> 00:38:33,730 AUDIENCE: If you're sorting the data, then wouldn't you 836 00:38:33,730 --> 00:38:36,573 have to go through all of the data? 837 00:38:36,573 --> 00:38:38,740 DAVID J. MALAN: Yeah, if you had to sort the data first-- 838 00:38:38,740 --> 00:38:40,700 and we don't yet formally know how to do this. 839 00:38:40,700 --> 00:38:43,117 But obviously, as humans, we could probably figure it out. 840 00:38:43,117 --> 00:38:45,280 You do have to look at all of the data anyway. 841 00:38:45,280 --> 00:38:48,760 And so you're sort of wasting your time if you're sorting it only 842 00:38:48,760 --> 00:38:50,680 then to go in search it. 843 00:38:50,680 --> 00:38:52,898 But maybe it depends a bit more. 844 00:38:52,898 --> 00:38:54,940 Like, that's absolutely right, and if you're just 845 00:38:54,940 --> 00:38:58,060 searching for one thing in life, then that's probably a waste of time 846 00:38:58,060 --> 00:39:01,720 to sort it and then search it because you're just adding to the process. 847 00:39:01,720 --> 00:39:03,880 But what's another scenario in which you might not 848 00:39:03,880 --> 00:39:09,410 worry about that whereby it might make sense to sort it and then search? 849 00:39:09,410 --> 00:39:09,910 Yeah. 850 00:39:09,910 --> 00:39:16,580 AUDIENCE: [INAUDIBLE] you can go and use the other values as a way 851 00:39:16,580 --> 00:39:17,810 to find out what's happening. 852 00:39:17,810 --> 00:39:18,852 DAVID J. MALAN: Yeah, exactly. 853 00:39:18,852 --> 00:39:21,392 So if your problem is a Google-like problem where 854 00:39:21,392 --> 00:39:24,350 you have more than just one user who's searching for more than just one 855 00:39:24,350 --> 00:39:26,780 website page, probably you should incur the cost up front 856 00:39:26,780 --> 00:39:30,620 and sort the whole thing because every subsequent request thereafter 857 00:39:30,620 --> 00:39:32,810 is going to be faster, faster, faster because it's 858 00:39:32,810 --> 00:39:36,440 going to [INAUDIBLE] algorithm of binary search, binary search, binary search 859 00:39:36,440 --> 00:39:39,320 that's going to add up to be way fewer steps 860 00:39:39,320 --> 00:39:41,610 than doing linear search multiple times. 861 00:39:41,610 --> 00:39:43,490 So again, kind of depends on the use case 862 00:39:43,490 --> 00:39:45,350 and kind of depends on how important it is. 863 00:39:45,350 --> 00:39:48,050 And this happens even in real world contexts. 864 00:39:48,050 --> 00:39:50,900 I think back always to graduate school, when I was writing some code 865 00:39:50,900 --> 00:39:52,610 to analyze some large data set. 866 00:39:52,610 --> 00:39:55,400 And honestly, it was actually easier at the time for me 867 00:39:55,400 --> 00:39:58,310 to write pretty inefficient but hopefully correct 868 00:39:58,310 --> 00:39:59,537 code because you know what? 869 00:39:59,537 --> 00:40:02,870 I could just go to sleep for eight hours and let it analyze this really big data 870 00:40:02,870 --> 00:40:03,370 set. 871 00:40:03,370 --> 00:40:06,335 I didn't have to bother writing more complex code to sort it just 872 00:40:06,335 --> 00:40:07,460 to run it more efficiently. 873 00:40:07,460 --> 00:40:07,960 Why? 874 00:40:07,960 --> 00:40:11,520 Because I was the only user, and I only needed to run these queries once. 875 00:40:11,520 --> 00:40:13,700 And so this was kind of a reasonable approach, 876 00:40:13,700 --> 00:40:17,550 reasonable until I woke up eight hours later and my code was incorrect. 877 00:40:17,550 --> 00:40:20,960 And now I had to spend another eight hours rerunning it after fixing it. 878 00:40:20,960 --> 00:40:22,910 But even there, you see an example where, 879 00:40:22,910 --> 00:40:24,890 what is your most precious resource? 880 00:40:24,890 --> 00:40:26,720 Is it time to run the code? 881 00:40:26,720 --> 00:40:28,970 Is it time to write the code? 882 00:40:28,970 --> 00:40:31,098 Is it the amount of memory the computer is using? 883 00:40:31,098 --> 00:40:33,890 These are all resources we'll start to talk about because it really 884 00:40:33,890 --> 00:40:36,080 depends on what your goals are. 885 00:40:36,080 --> 00:40:39,050 Any questions, then, on upper bounds, lower bounds, 886 00:40:39,050 --> 00:40:42,260 or each of these two searches, linear or binary? 887 00:40:42,260 --> 00:40:42,991 Yeah. 888 00:40:42,991 --> 00:40:45,580 AUDIENCE: So just, when you're calculating running time, 889 00:40:45,580 --> 00:40:50,317 does the sorting step count for that time? 890 00:40:50,317 --> 00:40:53,150 DAVID J. MALAN: When analyzing running time, does the sorting step count? 891 00:40:53,150 --> 00:40:55,310 If you want it to if you actually do it. 892 00:40:55,310 --> 00:40:56,900 At the moment, it did not apply. 893 00:40:56,900 --> 00:41:01,100 I just gave Rave the luxury of knowing that the data was sorted. 894 00:41:01,100 --> 00:41:04,520 But if I really wanted to charge her for the amount of time 895 00:41:04,520 --> 00:41:07,730 it took to find that number six, I should have added the time 896 00:41:07,730 --> 00:41:09,712 to sort plus the time to search. 897 00:41:09,712 --> 00:41:11,420 And in fact, that's a road we'll go down. 898 00:41:11,420 --> 00:41:13,170 Why don't we go ahead and pace ourselves as before? 899 00:41:13,170 --> 00:41:14,587 Let's take a 10 minute break here. 900 00:41:14,587 --> 00:41:17,130 And when we come back, we'll write some actual code. 901 00:41:17,130 --> 00:41:21,038 So we've seen a couple of searches-- linear search and binary search, which, 902 00:41:21,038 --> 00:41:22,580 to be fair, we saw back in week zero. 903 00:41:22,580 --> 00:41:25,790 But let's actually translate at least one of those now to some code 904 00:41:25,790 --> 00:41:28,970 using this building block from last week where we can actually 905 00:41:28,970 --> 00:41:32,820 define an array if we want, like an array of integers called numbers. 906 00:41:32,820 --> 00:41:34,550 So let me switch over to VS Code here. 907 00:41:34,550 --> 00:41:37,910 Let me go ahead and start a program called numbers.c. 908 00:41:37,910 --> 00:41:40,940 And in numbers.c, let me go ahead here. 909 00:41:40,940 --> 00:41:44,840 And how about let's include our familiar header files? 910 00:41:44,840 --> 00:41:46,670 So css50.h. 911 00:41:46,670 --> 00:41:51,330 I'll include standardio.h that we can get input and print input if we want. 912 00:41:51,330 --> 00:41:54,410 And now I'm going to go ahead and give myself int main void. 913 00:41:54,410 --> 00:41:56,100 No command line arguments today. 914 00:41:56,100 --> 00:41:57,232 So I'll leave that as void. 915 00:41:57,232 --> 00:41:58,940 And I'm going to go ahead and give myself 916 00:41:58,940 --> 00:42:01,410 an array of how about seven numbers? 917 00:42:01,410 --> 00:42:04,220 So I'll call it int number 7. 918 00:42:04,220 --> 00:42:06,260 And then I can fill this array with numbers. 919 00:42:06,260 --> 00:42:10,100 Like, numbers brackets 0 can be the number 4, and numbers bracket 1 920 00:42:10,100 --> 00:42:14,308 could be the number 6, and numbers bracket 2 can be the number 8. 921 00:42:14,308 --> 00:42:16,850 And this is the same list that we saw with [? Nomira ?] a bit 922 00:42:16,850 --> 00:42:19,170 ago where it was 4, then 6, then 8. 923 00:42:19,170 --> 00:42:19,920 But you know what? 924 00:42:19,920 --> 00:42:22,340 There's actually another syntax I can show you here. 925 00:42:22,340 --> 00:42:25,220 If you know in advance in a C program that you 926 00:42:25,220 --> 00:42:30,390 want an array of certain values and you know therefore how many of those values 927 00:42:30,390 --> 00:42:33,410 you want, you can actually do this little trick using curly braces. 928 00:42:33,410 --> 00:42:36,560 You can say, don't worry about how big this is. 929 00:42:36,560 --> 00:42:39,620 It's going to be implicit by way of these curly braces. 930 00:42:39,620 --> 00:42:44,610 Here, I can do 4, 6, 8, 2, 7, 5, 0, close curly brace. 931 00:42:44,610 --> 00:42:46,730 So it's a somewhat new use of curly braces. 932 00:42:46,730 --> 00:42:50,780 But this has the effect of giving me an array called numbers inside 933 00:42:50,780 --> 00:42:52,440 of which are a whole bunch of integers. 934 00:42:52,440 --> 00:42:53,030 How many? 935 00:42:53,030 --> 00:42:56,960 The compiler can infer it from what's ever inside these curly braces. 936 00:42:56,960 --> 00:43:00,140 And it seems to be of size 1, 2, 3, 4, 5, 6, 7. 937 00:43:00,140 --> 00:43:05,510 And all seven elements will be initialized with 4, 6, 8, 2, 7, 5, 0 938 00:43:05,510 --> 00:43:06,330 respectively. 939 00:43:06,330 --> 00:43:08,780 So just a minor optimization code wise to tighten up 940 00:43:08,780 --> 00:43:12,090 what would have otherwise been like eight separate lines of code. 941 00:43:12,090 --> 00:43:15,080 Now let's go ahead and implement linear search, as we called it. 942 00:43:15,080 --> 00:43:18,122 And you can do this in a bunch of ways, but I'm going to do it like this. 943 00:43:18,122 --> 00:43:24,830 For int i get 0, i is less than 7 i plus plus. 944 00:43:24,830 --> 00:43:27,800 Then inside of my loop, I'm going to ask the question, well, 945 00:43:27,800 --> 00:43:33,020 if the numbers at location i equals equals, as we asked of 946 00:43:33,020 --> 00:43:36,680 [? Nomira, ?] the number 0, then I'm going to go ahead and do something 947 00:43:36,680 --> 00:43:41,450 like printf found backslash n. 948 00:43:41,450 --> 00:43:43,280 And then I'm going to return 0. 949 00:43:43,280 --> 00:43:46,040 Just because of last week's discussion of returning 950 00:43:46,040 --> 00:43:50,150 a value for main when all is well, I'm going to return 0 by convention 951 00:43:50,150 --> 00:43:53,060 just to signal that indeed, I found what I'm looking for. 952 00:43:53,060 --> 00:44:00,560 Otherwise, on what line do I want to go and add a printf, like, not found 953 00:44:00,560 --> 00:44:02,600 and return something other than 0? 954 00:44:02,600 --> 00:44:06,860 Right, I don't think I want an else here per our pseudo code earlier. 955 00:44:06,860 --> 00:44:11,030 So on what line would you prefer I sort of insert a default scenario 956 00:44:11,030 --> 00:44:14,490 of not found and I'll return an error? 957 00:44:14,490 --> 00:44:15,795 Yeah, over here? 958 00:44:15,795 --> 00:44:19,343 [INTERPOSING VOICES] 959 00:44:19,343 --> 00:44:20,010 DAVID J. MALAN: Nice. 960 00:44:20,010 --> 00:44:21,718 So at the end of the for loop because you 961 00:44:21,718 --> 00:44:23,550 want to give the program or our volunteer 962 00:44:23,550 --> 00:44:26,980 earlier a chance to go through all of the doors, all of the numbers. 963 00:44:26,980 --> 00:44:29,700 But if you go through the whole thing, through the whole loop, 964 00:44:29,700 --> 00:44:33,630 at the very end, you probably just want to conclude not found backslash n 965 00:44:33,630 --> 00:44:36,060 and then return something like positive 1 966 00:44:36,060 --> 00:44:38,040 just to signify that an error happened. 967 00:44:38,040 --> 00:44:40,170 And again, this was a minor detail last week. 968 00:44:40,170 --> 00:44:44,370 Any time main is successful, the programming convention is to return 0. 969 00:44:44,370 --> 00:44:45,850 That means all as well. 970 00:44:45,850 --> 00:44:49,020 And if something goes wrong, like you didn't find what you're looking for, 971 00:44:49,020 --> 00:44:52,950 you might return something other than 0, like positive 1, maybe positive 2, 972 00:44:52,950 --> 00:44:55,202 or even negative numbers if you want. 973 00:44:55,202 --> 00:44:57,160 All right, well, let me go ahead and save this. 974 00:44:57,160 --> 00:44:58,800 Let me do make numbers. 975 00:44:58,800 --> 00:45:01,230 Hopefully no syntax errors. 976 00:45:01,230 --> 00:45:02,520 All good so far. 977 00:45:02,520 --> 00:45:05,040 dot slash numbers, enter. 978 00:45:05,040 --> 00:45:07,500 All right, and it's found, as I would hope it would be. 979 00:45:07,500 --> 00:45:10,710 And just as a little check, let's search for something that's definitely 980 00:45:10,710 --> 00:45:14,790 not there, like the number negative 1. 981 00:45:14,790 --> 00:45:17,640 Let me go ahead and recompile the code with make numbers. 982 00:45:17,640 --> 00:45:19,890 Let me rerun the code with dot slash numbers 983 00:45:19,890 --> 00:45:21,900 and hopefully-- whew, OK, not found. 984 00:45:21,900 --> 00:45:24,473 So proof by example seems to be working correctly. 985 00:45:24,473 --> 00:45:26,640 But let's make things a little more interesting now. 986 00:45:26,640 --> 00:45:29,700 Right now, I'm using just an array of integers. 987 00:45:29,700 --> 00:45:34,090 Let me go ahead and introduce maybe an array of strings instead. 988 00:45:34,090 --> 00:45:37,360 And maybe this time, I'll store a bunch of names and not just integers 989 00:45:37,360 --> 00:45:39,100 but actual strings of names. 990 00:45:39,100 --> 00:45:40,330 So how might I do this? 991 00:45:40,330 --> 00:45:42,130 Well, let me go back to my code here. 992 00:45:42,130 --> 00:45:46,470 I'm going to switch us over to maybe a file called names.c. 993 00:45:46,470 --> 00:45:50,130 And in here, I'll go ahead and include cs50.h. 994 00:45:50,130 --> 00:45:53,550 I'll include standardio.h. 995 00:45:53,550 --> 00:45:57,030 And I'm going to go ahead and for now include a new friend 996 00:45:57,030 --> 00:46:00,478 from last week, string.h, which gives me some string-related functionality. 997 00:46:00,478 --> 00:46:03,270 Int main void because I'm not going to bother with any command line 998 00:46:03,270 --> 00:46:04,480 arguments for now. 999 00:46:04,480 --> 00:46:09,330 And now if I want an array of strings, I could do something like this-- 1000 00:46:09,330 --> 00:46:12,120 string names bracket 7. 1001 00:46:12,120 --> 00:46:14,100 And then I could start doing like before. 1002 00:46:14,100 --> 00:46:17,580 Names bracket 0 could be someone like Bill, and names bracket 1 1003 00:46:17,580 --> 00:46:20,740 could be someone like Charlie and so forth. 1004 00:46:20,740 --> 00:46:24,352 But there's this new improvement I can make. 1005 00:46:24,352 --> 00:46:27,060 Let me just let the compiler figure out how many names there are. 1006 00:46:27,060 --> 00:46:32,550 And using curly braces, I'll do Bill and then Charlie and then Fred and then 1007 00:46:32,550 --> 00:46:39,690 George and then Ginny and then Percy and then Ron if there's the pattern there. 1008 00:46:39,690 --> 00:46:42,930 All right, so now I have these seven names as strings. 1009 00:46:42,930 --> 00:46:44,230 Let's do something similar. 1010 00:46:44,230 --> 00:46:47,730 So for int, i get 0. 1011 00:46:47,730 --> 00:46:51,330 i is less than 7 as before, i plus plus as before. 1012 00:46:51,330 --> 00:46:54,900 And inside of the, loop lets this time check for the string in question, 1013 00:46:54,900 --> 00:46:57,630 and suppose we're searching for Ron arbitrarily. 1014 00:46:57,630 --> 00:47:00,090 He is there, so we should eventually find him. 1015 00:47:00,090 --> 00:47:07,530 Let me go ahead and say if names bracket i equals quote unquote Ron, then inside 1016 00:47:07,530 --> 00:47:11,340 of my if condition, I'm going to say printf found just like before. 1017 00:47:11,340 --> 00:47:13,590 And I'm going to return 0 just because all is well. 1018 00:47:13,590 --> 00:47:16,390 And I'm going to take your advice from the get go this time 1019 00:47:16,390 --> 00:47:20,560 and, at the end of the loop, print out not found because if I get this far, 1020 00:47:20,560 --> 00:47:23,670 I have not printed found, and I have not returned already. 1021 00:47:23,670 --> 00:47:27,840 So I'm just going to go ahead and return 1 after printing not found. 1022 00:47:27,840 --> 00:47:30,420 All right, let me go ahead and cross my fingers as always. 1023 00:47:30,420 --> 00:47:33,310 Make names this time. 1024 00:47:33,310 --> 00:47:36,300 And it doesn't seem to like my code here. 1025 00:47:36,300 --> 00:47:38,370 This is perhaps a new error that you might not 1026 00:47:38,370 --> 00:47:41,080 have seen yet in names.c line 11. 1027 00:47:41,080 --> 00:47:43,920 So that's this line here, my if condition. 1028 00:47:43,920 --> 00:47:47,970 Result of comparison against a string literal is unspecified. 1029 00:47:47,970 --> 00:47:50,392 Use an explicit string comparison function instead. 1030 00:47:50,392 --> 00:47:53,100 I mean, that's kind of a mouthful, and the first time you see it, 1031 00:47:53,100 --> 00:47:55,600 you're probably not going to know how to make sense of that. 1032 00:47:55,600 --> 00:47:59,130 But it does kind of draw our attention to something being awry 1033 00:47:59,130 --> 00:48:03,840 with the equality checking here, with equal equals and Ron. 1034 00:48:03,840 --> 00:48:06,240 And here's where again we've been telling 1035 00:48:06,240 --> 00:48:08,700 sort of a white lie for the past couple of weeks. 1036 00:48:08,700 --> 00:48:12,900 Strings are a thing in C. Strings are a thing in programming. 1037 00:48:12,900 --> 00:48:14,670 But recall from last week, I did disclaim 1038 00:48:14,670 --> 00:48:16,650 there's no such thing as a string data type 1039 00:48:16,650 --> 00:48:20,670 technically because it's not a primitive in the way an int 1040 00:48:20,670 --> 00:48:24,210 and a float and a bool are that are sort of built into the language. 1041 00:48:24,210 --> 00:48:28,170 You can't just use equation equals to compare two strings. 1042 00:48:28,170 --> 00:48:31,110 You actually have to use a special function that's 1043 00:48:31,110 --> 00:48:34,140 in this header file we talked briefly about last week. 1044 00:48:34,140 --> 00:48:36,780 In that header file was string length or strlen. 1045 00:48:36,780 --> 00:48:39,490 But there's other functions instead as well. 1046 00:48:39,490 --> 00:48:43,080 Let me, in fact, go ahead and open up the manual pages. 1047 00:48:43,080 --> 00:48:46,200 And if we go to string.h-- 1048 00:48:46,200 --> 00:48:47,760 let me scroll down a bit. 1049 00:48:47,760 --> 00:48:52,800 In string.h you can perhaps infer what function will probably take 1050 00:48:52,800 --> 00:48:56,612 the place of equals equals for today. 1051 00:48:56,612 --> 00:48:57,570 What do we want to use? 1052 00:48:57,570 --> 00:48:57,930 Yeah. 1053 00:48:57,930 --> 00:48:58,680 AUDIENCE: Strcmp? 1054 00:48:58,680 --> 00:49:03,305 DAVID J. MALAN: So strcmp, S-T-R-C-M-P, which apparently compares two strings. 1055 00:49:03,305 --> 00:49:05,430 And if I click on that, we'll see more information. 1056 00:49:05,430 --> 00:49:09,510 And indeed, if I click on strcmp, we'll see under the synopsis 1057 00:49:09,510 --> 00:49:14,490 that, OK, I need to use the CS50 header file and string.h, as I already have. 1058 00:49:14,490 --> 00:49:17,850 Here is its prototype, which is telling me 1059 00:49:17,850 --> 00:49:21,360 that strcmp takes two strings, S1 and S2, that 1060 00:49:21,360 --> 00:49:22,890 are presumably going to be compared. 1061 00:49:22,890 --> 00:49:25,230 And it returns an integer, which is interesting. 1062 00:49:25,230 --> 00:49:26,430 So let's read on. 1063 00:49:26,430 --> 00:49:29,730 The description of this function is that it compares two strings case 1064 00:49:29,730 --> 00:49:30,870 sensitively. 1065 00:49:30,870 --> 00:49:34,110 So uppercase or lowercase matters, just FYI. 1066 00:49:34,110 --> 00:49:36,570 And then let's look it the return value here. 1067 00:49:36,570 --> 00:49:40,860 The return value of this function returns an int less than 0 1068 00:49:40,860 --> 00:49:48,480 if S1 comes before S2, 0 if S1 is the same as S2, or an int greater than 0 1069 00:49:48,480 --> 00:49:50,940 if S1 comes after S2. 1070 00:49:50,940 --> 00:49:54,780 So the reason that this function returns an integer and not just a 1071 00:49:54,780 --> 00:49:57,390 bool, true or false, is that it actually will 1072 00:49:57,390 --> 00:50:00,750 allow us to sort these things eventually because if you can tell me 1073 00:50:00,750 --> 00:50:04,980 if two strings come in this order or in this order or they're the same, 1074 00:50:04,980 --> 00:50:07,080 you need three possible return values. 1075 00:50:07,080 --> 00:50:08,910 And a bool, of course, only gives you two, 1076 00:50:08,910 --> 00:50:12,480 but an int gives you like 4 billion even though we just need the 3. 1077 00:50:12,480 --> 00:50:17,520 So 0 or a positive number or a negative number is what this function returns. 1078 00:50:17,520 --> 00:50:21,960 And the documentation goes on to explain what we mean by ASCIIbetical order. 1079 00:50:21,960 --> 00:50:25,260 Recall that capital A is 65, capital B is 66, 1080 00:50:25,260 --> 00:50:27,660 and it's those underlying ASCII or Unicode 1081 00:50:27,660 --> 00:50:31,050 numbers that a computer uses to figure out whether something comes before it 1082 00:50:31,050 --> 00:50:33,180 or after it like in the dictionary. 1083 00:50:33,180 --> 00:50:35,950 But for our purposes now, we only care about equality. 1084 00:50:35,950 --> 00:50:37,680 So I'm going to go ahead and do this. 1085 00:50:37,680 --> 00:50:41,820 If I want to compare names bracket i against Ron, 1086 00:50:41,820 --> 00:50:49,320 I use stir compare or strcmp, names bracket i comma, quote unquote, Ron. 1087 00:50:49,320 --> 00:50:51,510 So it's a little more involved than actually 1088 00:50:51,510 --> 00:50:55,830 using equals equals, which does work for integers, longs, 1089 00:50:55,830 --> 00:50:57,000 and certain other values. 1090 00:50:57,000 --> 00:51:00,850 But for strings, it turns out we need to use a more powerful function. 1091 00:51:00,850 --> 00:51:01,350 Why? 1092 00:51:01,350 --> 00:51:03,630 Well, last week, recall what a string really is. 1093 00:51:03,630 --> 00:51:06,220 It's an array of characters. 1094 00:51:06,220 --> 00:51:09,690 And so whereas you can use equals equals for single characters, 1095 00:51:09,690 --> 00:51:12,600 strcmp, as we'll eventually see, is going 1096 00:51:12,600 --> 00:51:14,438 to compare multiple characters for us. 1097 00:51:14,438 --> 00:51:15,480 There's more logic there. 1098 00:51:15,480 --> 00:51:19,570 There's a loop needed, and that's why it comes with the string library. 1099 00:51:19,570 --> 00:51:22,290 But it doesn't just work out of the box with equals equals alone. 1100 00:51:22,290 --> 00:51:26,267 That would literally be comparing two things, not two arrays of things. 1101 00:51:26,267 --> 00:51:28,350 And we'll come back to this next week as to what's 1102 00:51:28,350 --> 00:51:29,920 really going on under the hood. 1103 00:51:29,920 --> 00:51:34,140 So let me go ahead and fix one bug that I just realized I made. 1104 00:51:34,140 --> 00:51:39,450 I want to check if the return value of str compare is equal to 0 1105 00:51:39,450 --> 00:51:42,660 because per the documentation, that meant they're the same. 1106 00:51:42,660 --> 00:51:45,640 All right, let me go ahead and make names this time. 1107 00:51:45,640 --> 00:51:46,710 Now it compiles. 1108 00:51:46,710 --> 00:51:49,770 Dot slash names, Enter, found. 1109 00:51:49,770 --> 00:51:55,290 And just as a sanity check, let's check someone outside the family. 1110 00:51:55,290 --> 00:51:59,010 Searching now for Hermione after recompiling the code, 1111 00:51:59,010 --> 00:52:00,570 after rerunning the code. 1112 00:52:00,570 --> 00:52:02,280 And she's not, in fact, found. 1113 00:52:02,280 --> 00:52:05,220 So here's just a similar implementation of linear search 1114 00:52:05,220 --> 00:52:09,570 not for integers this time but instead for strings, 1115 00:52:09,570 --> 00:52:13,140 the subtlety really being we need a helper function, str compare, 1116 00:52:13,140 --> 00:52:17,640 to actually do the legwork for us of comparing two arrays of characters. 1117 00:52:17,640 --> 00:52:21,208 All right, questions on either of these implementations-- yeah, in the middle? 1118 00:52:21,208 --> 00:52:22,892 AUDIENCE: So, if I do [INAUDIBLE] 1119 00:52:22,892 --> 00:52:24,100 DAVID J. MALAN: Ah, good question. 1120 00:52:24,100 --> 00:52:28,260 If I had not fixed what I claimed was a mistake earlier and I did this-- 1121 00:52:28,260 --> 00:52:30,840 and we saw an example of this last week, actually. 1122 00:52:30,840 --> 00:52:37,470 If a function returns an integer, be it negative or positive or 0, 1123 00:52:37,470 --> 00:52:40,740 when you get back 0, the expression, the Boolean expression, 1124 00:52:40,740 --> 00:52:42,040 will be considered false. 1125 00:52:42,040 --> 00:52:44,940 So 0 equals false always. 1126 00:52:44,940 --> 00:52:49,470 If a function returns any positive number, or any negative number, 1127 00:52:49,470 --> 00:52:52,770 that's going to be interpreted as true even 1128 00:52:52,770 --> 00:52:57,510 if it's positive or negative, whether it's 1, negative 1, 2, negative 2. 1129 00:52:57,510 --> 00:53:01,570 And so if I did this, this would be saying the opposite. 1130 00:53:01,570 --> 00:53:06,930 So if I were to say this, if str compare of names bracket i and Hermione, that's 1131 00:53:06,930 --> 00:53:13,393 implicitly like saying this does not equal 0, or it means sort of is true, 1132 00:53:13,393 --> 00:53:15,810 but you don't want to check for true because, again, we're 1133 00:53:15,810 --> 00:53:17,350 comparing integers here. 1134 00:53:17,350 --> 00:53:21,000 So the reason I did 0 here in this case is 1135 00:53:21,000 --> 00:53:24,610 that it explicitly checks for the return value that means they're the same. 1136 00:53:24,610 --> 00:53:25,110 And yeah. 1137 00:53:25,110 --> 00:53:26,055 Follow up? 1138 00:53:26,055 --> 00:53:30,948 AUDIENCE: [INAUDIBLE] 1139 00:53:30,948 --> 00:53:32,990 DAVID J. MALAN: Yes, you might not have seen this yet, 1140 00:53:32,990 --> 00:53:36,580 but you can express the equivalent because if you 1141 00:53:36,580 --> 00:53:40,300 want to check if this is false, you can actually 1142 00:53:40,300 --> 00:53:43,300 use an exclamation point, known as a bang in programming, 1143 00:53:43,300 --> 00:53:44,960 that inverts the meaning. 1144 00:53:44,960 --> 00:53:48,163 So false becomes true, true becomes false. 1145 00:53:48,163 --> 00:53:50,080 So this would be another way of expressing it. 1146 00:53:50,080 --> 00:53:54,940 This is arguably a worse design, though, because the documentation explicitly 1147 00:53:54,940 --> 00:53:58,660 says you should be checking for 0 or a positive value 1148 00:53:58,660 --> 00:54:01,750 or a negative value, and this little trick, while correct, 1149 00:54:01,750 --> 00:54:05,500 and I think you can make a reasonable case for it, sort of hides that detail. 1150 00:54:05,500 --> 00:54:07,420 And I would argue instead for the first way, 1151 00:54:07,420 --> 00:54:09,607 checking for equals equals 0 instead. 1152 00:54:09,607 --> 00:54:11,440 And if that's a little subtle, not to worry. 1153 00:54:11,440 --> 00:54:16,240 We'll come back to little syntactic tricks like that before long. 1154 00:54:16,240 --> 00:54:20,770 Other questions on linear search in these two forms. 1155 00:54:20,770 --> 00:54:22,450 Is there another hand or hands? 1156 00:54:22,450 --> 00:54:24,070 Two hands? 1157 00:54:24,070 --> 00:54:24,610 No? 1158 00:54:24,610 --> 00:54:25,900 OK, just holler if I missed. 1159 00:54:25,900 --> 00:54:28,012 So let's now actually take this one step further. 1160 00:54:28,012 --> 00:54:30,970 Suppose that we want to write a program that maybe implements something 1161 00:54:30,970 --> 00:54:35,110 a little more like a phone book that has both names and numbers and not 1162 00:54:35,110 --> 00:54:37,030 just integers but actual phone numbers. 1163 00:54:37,030 --> 00:54:39,340 Well, we could escalate things like this. 1164 00:54:39,340 --> 00:54:42,850 We could now have two arrays-- one called names, one called numbers. 1165 00:54:42,850 --> 00:54:45,370 And I'm going to use strings for the numbers now, 1166 00:54:45,370 --> 00:54:48,010 the phone numbers, because in most communities, 1167 00:54:48,010 --> 00:54:51,730 phone numbers might have dashes, pluses, parentheses, so something 1168 00:54:51,730 --> 00:54:55,030 that really looks more like a string even though we call it a phone number. 1169 00:54:55,030 --> 00:54:58,580 Probably don't want to use an int lest we throw away those kinds of details. 1170 00:54:58,580 --> 00:55:03,160 So let me switch back to VS Code here, and let's do one more program, this one 1171 00:55:03,160 --> 00:55:05,033 in a file called phonebook.c. 1172 00:55:05,033 --> 00:55:06,700 And now let me go ahead and do the same. 1173 00:55:06,700 --> 00:55:08,380 Let me include cs50.h. 1174 00:55:08,380 --> 00:55:14,320 Let me include standardio.h, and let me include string.h. 1175 00:55:14,320 --> 00:55:17,380 I'm going to again do int main void. 1176 00:55:17,380 --> 00:55:20,710 And then inside of my program, I'm going to give myself two arrays-- 1177 00:55:20,710 --> 00:55:22,510 the efficient way this time. 1178 00:55:22,510 --> 00:55:25,000 String names will be just two of us this time. 1179 00:55:25,000 --> 00:55:28,390 How about Carter and me? 1180 00:55:28,390 --> 00:55:30,880 And then I'll give myself-- oops, typo already. 1181 00:55:30,880 --> 00:55:33,790 If I want this to be an array, I don't have to specify the number. 1182 00:55:33,790 --> 00:55:35,380 The compiler can count for me. 1183 00:55:35,380 --> 00:55:37,300 But I do need the square brackets. 1184 00:55:37,300 --> 00:55:43,750 Then for numbers, I'm again going to use a string array specifying with 1185 00:55:43,750 --> 00:55:49,510 the curly braces that how about Carter can be at 1-617-495-1000. 1186 00:55:49,510 --> 00:55:51,280 And how about my own number here-- 1187 00:55:51,280 --> 00:55:55,000 1-949-468-- oh pattern appearing-- 1188 00:55:55,000 --> 00:55:57,760 2750 will be mine. 1189 00:55:57,760 --> 00:55:58,600 Why mine? 1190 00:55:58,600 --> 00:56:00,530 Well, I'm just kind of lined things up. 1191 00:56:00,530 --> 00:56:03,460 So Carter's number is apparently first in this array, 1192 00:56:03,460 --> 00:56:06,800 and I'm claiming that he'll be first in this array, respectively. 1193 00:56:06,800 --> 00:56:09,610 I, David, will be the first-- the second in the names array 1194 00:56:09,610 --> 00:56:12,267 and second in the numbers array. 1195 00:56:12,267 --> 00:56:15,100 If you want to have a little fun with programming, feel free to text 1196 00:56:15,100 --> 00:56:17,270 or call me some time at that number. 1197 00:56:17,270 --> 00:56:20,950 So now let's actually use this data in some way. 1198 00:56:20,950 --> 00:56:24,368 Let's go ahead and actually search for my own name and number here. 1199 00:56:24,368 --> 00:56:24,910 So let me do. 1200 00:56:24,910 --> 00:56:27,490 For int i, get 0. 1201 00:56:27,490 --> 00:56:32,090 There's two of us this time-- so i less than 2 and then i plus plus as before. 1202 00:56:32,090 --> 00:56:34,480 And now I'm going to practice what I preached earlier, 1203 00:56:34,480 --> 00:56:38,440 and I'm going to use str compare to find my name in this case. 1204 00:56:38,440 --> 00:56:45,100 And I'm going to say if strcmp of names bracket i equals quote unquote David 1205 00:56:45,100 --> 00:56:48,740 and that equals 0, meaning they're the same, 1206 00:56:48,740 --> 00:56:51,610 then just as before, I'm going to go ahead and print something out. 1207 00:56:51,610 --> 00:56:53,320 But this time, I'm going to make the program more useful 1208 00:56:53,320 --> 00:56:55,100 and not just say found or not found. 1209 00:56:55,100 --> 00:56:59,050 Now I'm implementing a phone book, like the contacts app on iOS or Android. 1210 00:56:59,050 --> 00:57:02,380 So I'm going to say something like, quote unquote, found percent 1211 00:57:02,380 --> 00:57:08,830 s backslash n and then actually plug in numbers bracket i 1212 00:57:08,830 --> 00:57:12,370 to correspond to the current name bracket i. 1213 00:57:12,370 --> 00:57:14,225 And then I'll return 0 as before. 1214 00:57:14,225 --> 00:57:16,600 And then down here if we get all the way through the loop 1215 00:57:16,600 --> 00:57:20,120 and David's not there for some reason, I'm going to print as before not found 1216 00:57:20,120 --> 00:57:21,580 and then return 1. 1217 00:57:21,580 --> 00:57:26,590 So let me go ahead and compile this with make phone dot slash phonebook, 1218 00:57:26,590 --> 00:57:29,240 and it seems to have found the number. 1219 00:57:29,240 --> 00:57:33,130 So this code I'm going to claim is correct. 1220 00:57:33,130 --> 00:57:36,190 It's kind of stupid because I've just made a phone book or a contacts 1221 00:57:36,190 --> 00:57:37,690 app that only supports two people. 1222 00:57:37,690 --> 00:57:39,503 They're only going to be me and Carter. 1223 00:57:39,503 --> 00:57:41,920 This would be like downloading the contacts app on a phone 1224 00:57:41,920 --> 00:57:43,837 and you can only call two people in the world. 1225 00:57:43,837 --> 00:57:45,820 There's no ability to add names or edit things. 1226 00:57:45,820 --> 00:57:48,860 That, of course, could come later using get string or something else. 1227 00:57:48,860 --> 00:57:50,693 But for now for the sake of discussion, I've 1228 00:57:50,693 --> 00:57:53,450 just hardcoded two names and two numbers. 1229 00:57:53,450 --> 00:57:56,290 But for what it does, I claim this is correct. 1230 00:57:56,290 --> 00:57:59,590 It's going to find me and print out my number. 1231 00:57:59,590 --> 00:58:01,480 But is it well-designed? 1232 00:58:01,480 --> 00:58:05,560 Let's start to now consider if we're not just using arrays, 1233 00:58:05,560 --> 00:58:07,852 but are we using them, well? 1234 00:58:07,852 --> 00:58:10,810 We started to use them last week, but are we using them well this week? 1235 00:58:10,810 --> 00:58:14,680 And what might I even mean by using an array well or designing 1236 00:58:14,680 --> 00:58:16,640 this program well? 1237 00:58:16,640 --> 00:58:21,940 Any critiques or concerns with why this might not 1238 00:58:21,940 --> 00:58:24,100 be the best road for us to be going down when 1239 00:58:24,100 --> 00:58:28,540 I want to implement something like a phone book with pieces of information? 1240 00:58:28,540 --> 00:58:31,540 It seems all too vulnerable to just mistakes. 1241 00:58:31,540 --> 00:58:35,620 For instance, if I screw up the actual number of names in the names array 1242 00:58:35,620 --> 00:58:40,190 such that it's now more or less than is in the numbers array or vise versa, 1243 00:58:40,190 --> 00:58:43,420 it feels like there's not a tight relationship between those pieces 1244 00:58:43,420 --> 00:58:47,140 of data, and it's just sort of is trusting on the honor system 1245 00:58:47,140 --> 00:58:53,440 that any time I use names bracket i that it lines up with numbers bracket i. 1246 00:58:53,440 --> 00:58:54,160 And that's fine. 1247 00:58:54,160 --> 00:58:56,080 If you're the one writing the code, you're probably 1248 00:58:56,080 --> 00:58:57,640 not going to really screw this up. 1249 00:58:57,640 --> 00:59:00,265 But if you start collaborating with someone else or the program 1250 00:59:00,265 --> 00:59:03,700 is getting much, much longer, the odds that you or your colleagues 1251 00:59:03,700 --> 00:59:08,110 remember that you're sort of just trusting that names and numbers line up 1252 00:59:08,110 --> 00:59:10,420 like this is going to fail eventually. 1253 00:59:10,420 --> 00:59:13,540 Someone's not going to realize that, and just, the code is going to break. 1254 00:59:13,540 --> 00:59:16,540 And you're going to start out putting the wrong numbers for names, which 1255 00:59:16,540 --> 00:59:20,710 is to say it'd be much nicer if we could somehow couple these two 1256 00:59:20,710 --> 00:59:24,850 pieces of data, names and numbers, a little more tightly together so 1257 00:59:24,850 --> 00:59:28,900 that you're not just trusting that these two independent variables, names 1258 00:59:28,900 --> 00:59:32,630 and numbers, have this kind of relationship with themselves. 1259 00:59:32,630 --> 00:59:35,200 So let's consider how we might solve this. 1260 00:59:35,200 --> 00:59:39,400 A new feature today that we'll introduce is generally known as a data structure. 1261 00:59:39,400 --> 00:59:43,280 In C, we have the ability to invent our own data types, 1262 00:59:43,280 --> 00:59:46,555 if you will-- data types that the authors of C decades 1263 00:59:46,555 --> 00:59:48,430 ago just didn't envision or just didn't think 1264 00:59:48,430 --> 00:59:51,880 were necessary because we can implement them ourselves-- similar to Scratch 1265 00:59:51,880 --> 00:59:54,280 just as you could create custom puzzle pieces, 1266 00:59:54,280 --> 00:59:56,360 or in C, you can create custom functions. 1267 00:59:56,360 --> 01:00:00,760 So in C, can you create your own types of data 1268 01:00:00,760 --> 01:00:04,900 that go beyond the built in ints and floats and even strings? 1269 01:00:04,900 --> 01:00:10,540 You can make, for instance, a person data type or a candidate data 1270 01:00:10,540 --> 01:00:13,090 type in the context of elections or a person data type 1271 01:00:13,090 --> 01:00:15,950 more generically that might have a name and a number. 1272 01:00:15,950 --> 01:00:17,720 So how might we do this? 1273 01:00:17,720 --> 01:00:23,470 Well, let me go here and propose that if we want to define a person, 1274 01:00:23,470 --> 01:00:26,920 wouldn't it be nice if we could have a person data type, 1275 01:00:26,920 --> 01:00:29,470 and then we could have an array called people? 1276 01:00:29,470 --> 01:00:32,770 And maybe that array is our only array with two things 1277 01:00:32,770 --> 01:00:35,200 in it, two persons in it. 1278 01:00:35,200 --> 01:00:38,140 But somehow, those data types, these persons, 1279 01:00:38,140 --> 01:00:41,158 would have both a name and a number associated with them. 1280 01:00:41,158 --> 01:00:42,700 So we don't need two separate arrays. 1281 01:00:42,700 --> 01:00:47,240 We need one array of persons, a brand new data type. 1282 01:00:47,240 --> 01:00:48,890 So how might we do this? 1283 01:00:48,890 --> 01:00:51,070 Well, if we want every person in the world 1284 01:00:51,070 --> 01:00:53,320 or in this program to have a name and a number, 1285 01:00:53,320 --> 01:00:56,380 we literally right out first those two data types. 1286 01:00:56,380 --> 01:00:57,860 Give me a string called name. 1287 01:00:57,860 --> 01:01:00,850 Give me a string called number semicolon, after each. 1288 01:01:00,850 --> 01:01:04,030 And then we wrap that, those two lines of code, 1289 01:01:04,030 --> 01:01:06,730 with this syntax, which at first glance is a little cryptic. 1290 01:01:06,730 --> 01:01:08,410 It's a lot of words all of a sudden. 1291 01:01:08,410 --> 01:01:12,730 But typedef is a new keyword today that defines a new data type. 1292 01:01:12,730 --> 01:01:16,510 This is the C key word that lets you create your own data 1293 01:01:16,510 --> 01:01:18,100 type for the very first time. 1294 01:01:18,100 --> 01:01:22,840 Struct is another related key word that tells the compiler that this isn't just 1295 01:01:22,840 --> 01:01:27,310 a simple data type, like an int or a float renamed or something like that. 1296 01:01:27,310 --> 01:01:28,940 It actually is a structure. 1297 01:01:28,940 --> 01:01:32,710 It's got some dimensions to it, like two things in it or three things in it 1298 01:01:32,710 --> 01:01:35,260 or even 50 things inside of it. 1299 01:01:35,260 --> 01:01:39,310 The last word down here is the name that you want to give your data type, 1300 01:01:39,310 --> 01:01:41,980 and it weirdly goes after the curly braces. 1301 01:01:41,980 --> 01:01:45,760 But this is how you invent a data type called person. 1302 01:01:45,760 --> 01:01:48,670 And what this code is implying is that henceforth, 1303 01:01:48,670 --> 01:01:53,980 the compiler clang will know that a person is composed of a name that's 1304 01:01:53,980 --> 01:01:56,770 a string and a number that's a string. 1305 01:01:56,770 --> 01:01:59,770 And you don't have to worry about having multiple arrays now. 1306 01:01:59,770 --> 01:02:03,950 You can just have an array of people moving forward. 1307 01:02:03,950 --> 01:02:05,918 So how can we go about using this? 1308 01:02:05,918 --> 01:02:07,960 Well, let me go back to my code from before where 1309 01:02:07,960 --> 01:02:09,340 I was implementing a phone book. 1310 01:02:09,340 --> 01:02:11,715 And why don't we enhance the phone book code a little bit 1311 01:02:11,715 --> 01:02:14,230 by borrowing some of that new syntax? 1312 01:02:14,230 --> 01:02:16,720 Let me go to the top of my program above main 1313 01:02:16,720 --> 01:02:19,660 and define a type that's a structure or a data 1314 01:02:19,660 --> 01:02:24,500 structure that has a name inside of it and that has a number inside of it. 1315 01:02:24,500 --> 01:02:28,150 And the name of this new structure again is going to be called person. 1316 01:02:28,150 --> 01:02:33,550 Inside of my code now, let me go ahead and delete this old stuff temporarily. 1317 01:02:33,550 --> 01:02:37,510 Let me give myself an array called people of size 2. 1318 01:02:37,510 --> 01:02:40,997 And I'm going to use the non-terse way to do this. 1319 01:02:40,997 --> 01:02:42,580 I'm not going to use the curly braces. 1320 01:02:42,580 --> 01:02:47,140 I'm going to more pedantic spell out what I want in this array of size 2 1321 01:02:47,140 --> 01:02:50,860 at location 0, which is the first person in an array 1322 01:02:50,860 --> 01:02:52,690 because you always start counting at 0. 1323 01:02:52,690 --> 01:02:56,500 I'm going to give that person a name of quote unquote Carter. 1324 01:02:56,500 --> 01:02:59,980 And the dot is admittedly one new piece of syntax today too. 1325 01:02:59,980 --> 01:03:02,410 The dot means go inside of that structure 1326 01:03:02,410 --> 01:03:06,190 and access the variable called name and give it this value Carter. 1327 01:03:06,190 --> 01:03:08,350 Similarly, if I'm going to give Carter a number, 1328 01:03:08,350 --> 01:03:13,030 I can go into people bracket 0 dot number and give that the same thing 1329 01:03:13,030 --> 01:03:17,950 as before plus 1-617-495-1000. 1330 01:03:17,950 --> 01:03:20,230 And then I can do the same for myself here-- 1331 01:03:20,230 --> 01:03:24,145 people bracket-- where should I go? 1332 01:03:24,145 --> 01:03:26,107 OK, one because again, two elements. 1333 01:03:26,107 --> 01:03:27,440 But we started counting at zero. 1334 01:03:27,440 --> 01:03:29,420 Bracket name equals quote unquote David. 1335 01:03:29,420 --> 01:03:34,340 And then lastly, people bracket 1 dot number equals quote unquote plus 1336 01:03:34,340 --> 01:03:40,370 1-949-468-2750. 1337 01:03:40,370 --> 01:03:43,250 So now if I scroll down here to my logic, 1338 01:03:43,250 --> 01:03:46,130 I don't think this part needs to change too much. 1339 01:03:46,130 --> 01:03:50,680 I'm still, for the sake of discussion, going to iterate 2 times from i 1340 01:03:50,680 --> 01:03:53,630 is 0 on up to but not through 2. 1341 01:03:53,630 --> 01:03:56,670 But I think this line of code needs to change. 1342 01:03:56,670 --> 01:04:04,910 How should I now refer to the i-th person's name as I iterate? 1343 01:04:04,910 --> 01:04:08,250 What should I compare quote unquote David to this time? 1344 01:04:08,250 --> 01:04:08,750 Let me see. 1345 01:04:08,750 --> 01:04:10,050 On the end here? 1346 01:04:10,050 --> 01:04:13,078 AUDIENCE: People bracket i dot name. 1347 01:04:13,078 --> 01:04:14,870 DAVID J. MALAN: Yeah, people bracket i dot name. 1348 01:04:14,870 --> 01:04:15,170 Why? 1349 01:04:15,170 --> 01:04:16,837 Because people is the name of the array. 1350 01:04:16,837 --> 01:04:20,480 Bracket i is the i-th person that we're iterating over in the current loop-- 1351 01:04:20,480 --> 01:04:23,240 first zero, then one, maybe higher if it had more people. 1352 01:04:23,240 --> 01:04:26,570 Then dot is our new syntax for going inside of a data structure 1353 01:04:26,570 --> 01:04:29,870 and accessing a variable therein which in this case is name. 1354 01:04:29,870 --> 01:04:32,280 And so I can compare David just as before. 1355 01:04:32,280 --> 01:04:36,890 So it's a little more verbose, but now arguably this is a better program 1356 01:04:36,890 --> 01:04:42,470 because now these people are full fledged data types unto themselves. 1357 01:04:42,470 --> 01:04:44,810 There's no more honor system inside of my loop 1358 01:04:44,810 --> 01:04:47,102 that this is going to line up because in just a moment, 1359 01:04:47,102 --> 01:04:49,867 I'm going to fix this one last remnant of the previous version. 1360 01:04:49,867 --> 01:04:51,950 And if I can call back on you again, what should I 1361 01:04:51,950 --> 01:04:54,830 change numbers bracket i to this time? 1362 01:04:54,830 --> 01:05:00,760 AUDIENCE: [INAUDIBLE] dot number. 1363 01:05:00,760 --> 01:05:02,390 DAVID J. MALAN: Dot number, exactly. 1364 01:05:02,390 --> 01:05:04,940 So gone is the honor system that just assumes 1365 01:05:04,940 --> 01:05:08,060 that bracket i in this array lines up with bracket i in this other array. 1366 01:05:08,060 --> 01:05:08,810 Now why? 1367 01:05:08,810 --> 01:05:10,430 There's only one array. 1368 01:05:10,430 --> 01:05:11,990 It's an array called people. 1369 01:05:11,990 --> 01:05:14,060 The things it stores are persons. 1370 01:05:14,060 --> 01:05:15,800 A person has a name and a number. 1371 01:05:15,800 --> 01:05:18,217 And so even though it's kind of marginal admittedly given 1372 01:05:18,217 --> 01:05:21,050 that this is a short program and given that this kind of made things 1373 01:05:21,050 --> 01:05:23,300 look more complicated at first glance, we're 1374 01:05:23,300 --> 01:05:26,540 now laying the foundation for just a better design because you really 1375 01:05:26,540 --> 01:05:29,180 can't screw up now the association of names 1376 01:05:29,180 --> 01:05:32,780 with numbers because every person's name and number is, so to speak, 1377 01:05:32,780 --> 01:05:36,710 encapsulated inside of the same data type. 1378 01:05:36,710 --> 01:05:38,270 And that's a term of art in CS. 1379 01:05:38,270 --> 01:05:41,720 Encapsulation means to encapsulate-- that is, contain-- 1380 01:05:41,720 --> 01:05:43,830 related pieces of information. 1381 01:05:43,830 --> 01:05:49,670 And thus, we have a person that encapsulates two other data types, name 1382 01:05:49,670 --> 01:05:50,450 and number. 1383 01:05:50,450 --> 01:05:52,310 And this just sets the foundation for all 1384 01:05:52,310 --> 01:05:55,190 of the cool stuff we've talked about and you use every day. 1385 01:05:55,190 --> 01:05:55,993 What is an image? 1386 01:05:55,993 --> 01:05:58,910 Well, recall that an image is a bunch of pixels or dots on the screen. 1387 01:05:58,910 --> 01:06:02,570 Every one of those dots has RGB values associated 1388 01:06:02,570 --> 01:06:04,430 with it-- red, green, and blue. 1389 01:06:04,430 --> 01:06:07,760 You could imagine now creating a structure in C probably where 1390 01:06:07,760 --> 01:06:11,540 maybe you have three values, three variables-- one called red, 1391 01:06:11,540 --> 01:06:13,400 one called green, one called blue. 1392 01:06:13,400 --> 01:06:15,980 And then you could name the thing not person but pixel. 1393 01:06:15,980 --> 01:06:19,910 And now you could store in C three different colors-- some amount of red, 1394 01:06:19,910 --> 01:06:23,978 some green, some blue-- and collectively treat it as the color of a pixel. 1395 01:06:23,978 --> 01:06:27,020 And you could imagine doing something similar perhaps for video or music. 1396 01:06:27,020 --> 01:06:30,440 Music, you might have three variables-- one for the musical note, 1397 01:06:30,440 --> 01:06:32,660 the duration, the loudness of it. 1398 01:06:32,660 --> 01:06:36,120 And you can imagine coming up with your own data type for music as well. 1399 01:06:36,120 --> 01:06:37,370 So this is a little low level. 1400 01:06:37,370 --> 01:06:39,920 We're just using like a familiar contacts application. 1401 01:06:39,920 --> 01:06:44,270 But we now have the way in code to express most any type of data 1402 01:06:44,270 --> 01:06:48,260 that we might want to implement or discuss ultimately. 1403 01:06:48,260 --> 01:06:53,510 So any questions now on struct or defining our own types, 1404 01:06:53,510 --> 01:06:58,110 the purposes for which are to use arrays but use them more responsibly 1405 01:06:58,110 --> 01:07:01,280 now in a better design but also to lay the foundation 1406 01:07:01,280 --> 01:07:05,880 for implementing cooler and cooler stuff per our week zero discussion? 1407 01:07:05,880 --> 01:07:06,380 Yeah. 1408 01:07:06,380 --> 01:07:07,713 AUDIENCE: What's the [INAUDIBLE] 1409 01:07:07,713 --> 01:07:10,713 DAVID J. MALAN: What's the difference between this and an object in an object 1410 01:07:10,713 --> 01:07:11,550 oriented language? 1411 01:07:11,550 --> 01:07:14,390 So slight side note, C is not object-oriented. 1412 01:07:14,390 --> 01:07:17,990 Languages like Java and C++ and others which you might have heard 1413 01:07:17,990 --> 01:07:21,200 of, programmed yourself, had friends program in, are object oriented 1414 01:07:21,200 --> 01:07:25,430 languages in those languages they have things called classes or objects which 1415 01:07:25,430 --> 01:07:26,450 are interrelated. 1416 01:07:26,450 --> 01:07:30,020 And objects can store not just data, like variables. 1417 01:07:30,020 --> 01:07:34,490 Objects can also store functions, and you can kind of sort of do this in C. 1418 01:07:34,490 --> 01:07:36,380 But it's not sort of conventional. 1419 01:07:36,380 --> 01:07:39,770 In C, you have data structures that store data. 1420 01:07:39,770 --> 01:07:44,780 In languages like Java and C+, you have objects that store data and functions 1421 01:07:44,780 --> 01:07:45,410 together. 1422 01:07:45,410 --> 01:07:47,790 Python is an object-oriented language as well. 1423 01:07:47,790 --> 01:07:51,270 So we'll see this issue in a few weeks, but let me wave my hands at it for now. 1424 01:07:51,270 --> 01:07:51,770 Yeah. 1425 01:07:51,770 --> 01:07:53,755 AUDIENCE: Could you use this [INAUDIBLE]?? 1426 01:07:53,755 --> 01:07:54,380 DAVID J. MALAN: Yes. 1427 01:07:54,380 --> 01:07:57,020 Could you use this struct to redefine how an int is defined? 1428 01:07:57,020 --> 01:07:58,250 Short answer, yes. 1429 01:07:58,250 --> 01:08:01,700 We talked a couple of times now about integer overflow. 1430 01:08:01,700 --> 01:08:05,900 And most recently, you might have seen me mention the bug in iOS and Mac OS 1431 01:08:05,900 --> 01:08:08,480 that was literally related to an int overflow. 1432 01:08:08,480 --> 01:08:12,890 That's the result of ints only storing 4 bytes or 32 bits 1433 01:08:12,890 --> 01:08:15,800 or even as long as 64 bits or 8 bytes. 1434 01:08:15,800 --> 01:08:16,880 But it's finite. 1435 01:08:16,880 --> 01:08:19,520 But if you want to implement some financial software 1436 01:08:19,520 --> 01:08:22,100 or some scientific or mathematical software that 1437 01:08:22,100 --> 01:08:25,970 allows you to count way bigger than a typical int or a long, 1438 01:08:25,970 --> 01:08:29,135 you could imagine John coming up with your own structure. 1439 01:08:29,135 --> 01:08:31,260 And in fact, in some languages there is a structure 1440 01:08:31,260 --> 01:08:35,370 called big int, which allows you to express even bigger numbers. 1441 01:08:35,370 --> 01:08:35,970 How? 1442 01:08:35,970 --> 01:08:40,410 Well, maybe you store inside of a big ant an array of values. 1443 01:08:40,410 --> 01:08:43,492 And you somehow allow yourself to store more and more bits 1444 01:08:43,492 --> 01:08:45,450 based on how high you want to be able to count. 1445 01:08:45,450 --> 01:08:46,470 So in short, yes. 1446 01:08:46,470 --> 01:08:49,840 We now have the ability now to do most anything we want in the language 1447 01:08:49,840 --> 01:08:52,200 even if it's not built in for us. 1448 01:08:52,200 --> 01:08:53,100 Other questions. 1449 01:08:53,100 --> 01:08:58,762 AUDIENCE: [INAUDIBLE] 1450 01:08:58,762 --> 01:09:01,470 DAVID J. MALAN: Could you define a name and a number in the same line? 1451 01:09:01,470 --> 01:09:02,069 Sort of. 1452 01:09:02,069 --> 01:09:03,986 It starts to get syntactically a little messy, 1453 01:09:03,986 --> 01:09:07,229 so I did it a little more pedantic line by line. 1454 01:09:07,229 --> 01:09:07,870 Good question. 1455 01:09:07,870 --> 01:09:08,430 Over here. 1456 01:09:08,430 --> 01:09:12,910 AUDIENCE: [INAUDIBLE] function you use for the function 1457 01:09:12,910 --> 01:09:15,340 at the bottom of the [INAUDIBLE]. 1458 01:09:15,340 --> 01:09:19,029 Could you do something like that [INAUDIBLE]?? 1459 01:09:19,029 --> 01:09:21,399 DAVID J. MALAN: Prototypes-- you have to do A and C. You 1460 01:09:21,399 --> 01:09:25,029 have to define anything you're going to use or declare anything you're going 1461 01:09:25,029 --> 01:09:26,779 to use before you actually use it. 1462 01:09:26,779 --> 01:09:30,830 So it is deliberate that I put it at the top of my code in this file. 1463 01:09:30,830 --> 01:09:34,870 Otherwise, the compiler would not know what I mean by person when I first 1464 01:09:34,870 --> 01:09:37,750 use it here on what's line 14. 1465 01:09:37,750 --> 01:09:41,479 So it has to come first, or it has to be put into something like a header file 1466 01:09:41,479 --> 01:09:44,979 so that you include it at the very top of your code. 1467 01:09:44,979 --> 01:09:46,779 Other questions over here. 1468 01:09:46,779 --> 01:09:47,662 Yeah. 1469 01:09:47,662 --> 01:09:53,282 AUDIENCE: [INAUDIBLE] 1470 01:09:53,282 --> 01:09:54,990 DAVID J. MALAN: Yeah, good question, and we'll 1471 01:09:54,990 --> 01:09:58,500 come back to this later in the term when we talk about SQL, a database language, 1472 01:09:58,500 --> 01:10:00,630 and storing things in actual databases. 1473 01:10:00,630 --> 01:10:04,320 Generally speaking, even though we humans call things phone numbers, 1474 01:10:04,320 --> 01:10:07,800 or in the US, we have social security numbers, those types of numbers 1475 01:10:07,800 --> 01:10:12,060 often have other punctuation in it, like dashes, parentheses, pluses, 1476 01:10:12,060 --> 01:10:13,380 and so forth. 1477 01:10:13,380 --> 01:10:17,340 You could not store any of that syntax or that punctuation inside of an int. 1478 01:10:17,340 --> 01:10:18,880 You could only store numbers. 1479 01:10:18,880 --> 01:10:20,940 So one motivation for using a string is just 1480 01:10:20,940 --> 01:10:24,570 I can store whatever the human wanted me to store, including parentheses 1481 01:10:24,570 --> 01:10:25,690 and so forth. 1482 01:10:25,690 --> 01:10:29,428 Another reason for storing things as strings, 1483 01:10:29,428 --> 01:10:31,470 even if they look like numbers, is in the context 1484 01:10:31,470 --> 01:10:33,088 of zip codes in the United States. 1485 01:10:33,088 --> 01:10:34,380 Again, we'll come back to this. 1486 01:10:34,380 --> 01:10:36,690 But long story short-- years ago, actually-- 1487 01:10:36,690 --> 01:10:39,732 I was using Microsoft Outlook for my email client. 1488 01:10:39,732 --> 01:10:41,190 And eventually I switched to Gmail. 1489 01:10:41,190 --> 01:10:42,900 And this is like 10 plus years ago now. 1490 01:10:42,900 --> 01:10:47,430 And Outlook at the time lets you export all of your contacts as a CSV file-- 1491 01:10:47,430 --> 01:10:48,780 Comma Separated Values. 1492 01:10:48,780 --> 01:10:50,610 More on that in the weeks to come too. 1493 01:10:50,610 --> 01:10:52,402 And that just means I could download a text 1494 01:10:52,402 --> 01:10:55,710 file with all of my friends and family and their numbers inside of it. 1495 01:10:55,710 --> 01:10:59,812 Unfortunately, I open that same CSV file with Excel, I think, at the time 1496 01:10:59,812 --> 01:11:01,770 just to kind of spot check it and see if what's 1497 01:11:01,770 --> 01:11:03,400 in there was what it was expected. 1498 01:11:03,400 --> 01:11:06,870 And I must have instinctively hit, like, Command or Control-S to save it. 1499 01:11:06,870 --> 01:11:09,900 And Excel at least has this habit of sort of reformatting your data. 1500 01:11:09,900 --> 01:11:12,630 If things look like numbers, it treats them as numbers. 1501 01:11:12,630 --> 01:11:14,040 And Apple Numbers does this too. 1502 01:11:14,040 --> 01:11:16,080 Google Spreadsheets does this to nowadays. 1503 01:11:16,080 --> 01:11:23,040 But long story short, I then imported my mildly saved CSV file into Gmail. 1504 01:11:23,040 --> 01:11:26,760 And now 10 plus years later, I'm still occasionally finding friends and family 1505 01:11:26,760 --> 01:11:32,640 members whose zip codes are in Cambridge, Massachusetts 2138, 1506 01:11:32,640 --> 01:11:36,300 which is missing the 0 because we here in Cambridge are 02138. 1507 01:11:36,300 --> 01:11:39,420 And that's because I treated or I let Excel 1508 01:11:39,420 --> 01:11:42,630 treat what looks like a number as an actual number or int, 1509 01:11:42,630 --> 01:11:45,540 and now leading zeros become a problem because mathematically, they 1510 01:11:45,540 --> 01:11:48,900 mean nothing, but in the mail system, they do-- 1511 01:11:48,900 --> 01:11:50,070 sending envelopes and such. 1512 01:11:50,070 --> 01:11:51,653 All right, other final questions here. 1513 01:11:51,653 --> 01:11:54,660 AUDIENCE: [INAUDIBLE] 1514 01:11:54,660 --> 01:11:58,500 DAVID J. MALAN: Yeah, so could I have used a 2D or two dimensional 1515 01:11:58,500 --> 01:12:02,730 array to solve the problem earlier of having just one array? 1516 01:12:02,730 --> 01:12:06,750 Yes, but one, I would argue it's less readable, especially 1517 01:12:06,750 --> 01:12:08,640 as I get lots of names and numbers. 1518 01:12:08,640 --> 01:12:11,730 And two, that too is also kind of relying on the honor system. 1519 01:12:11,730 --> 01:12:14,940 It would be all too easy to omit some of the square brackets in the two 1520 01:12:14,940 --> 01:12:15,940 dimensional array. 1521 01:12:15,940 --> 01:12:20,010 So I would argue it too is not as good as introducing a struct. 1522 01:12:20,010 --> 01:12:21,180 More on that down the road. 1523 01:12:21,180 --> 01:12:26,070 Two dimensional arrays just means arrays of arrays, as you might infer. 1524 01:12:26,070 --> 01:12:28,080 All right, so now that we have this ability 1525 01:12:28,080 --> 01:12:32,210 to store different types of data like contacts in a phone book, 1526 01:12:32,210 --> 01:12:33,960 having names and addresses, let's actually 1527 01:12:33,960 --> 01:12:36,780 take a step back and consider how we might now 1528 01:12:36,780 --> 01:12:41,850 solve one of the original problems by actually sorting the information we're 1529 01:12:41,850 --> 01:12:45,930 given in advance and considering, per our discussion earlier, just how 1530 01:12:45,930 --> 01:12:48,900 costly, how time consuming is that because that might tip 1531 01:12:48,900 --> 01:12:53,010 the scales in favor of sorting, then searching, or maybe just 1532 01:12:53,010 --> 01:12:54,960 not sorting and only searching. 1533 01:12:54,960 --> 01:12:58,470 It'll give us a sense of just how expensive, so to speak, 1534 01:12:58,470 --> 01:13:00,345 sorting something actually is. 1535 01:13:00,345 --> 01:13:02,220 Well, what's the formulation of this problem? 1536 01:13:02,220 --> 01:13:03,780 It's the same thing as week zero. 1537 01:13:03,780 --> 01:13:04,950 We've got input to sort. 1538 01:13:04,950 --> 01:13:07,120 We want it to be output as sorted. 1539 01:13:07,120 --> 01:13:10,560 So for instance, if we're taking unsorted input as input, 1540 01:13:10,560 --> 01:13:13,895 we want the sorted output as the result. More concretely, 1541 01:13:13,895 --> 01:13:15,270 if we've got numbers like these-- 1542 01:13:15,270 --> 01:13:20,100 63852741, which are just randomly arranged numbers-- 1543 01:13:20,100 --> 01:13:24,510 we want to get back out 12345678. 1544 01:13:24,510 --> 01:13:26,440 So we just want those things to be sorted. 1545 01:13:26,440 --> 01:13:28,470 So again, inside of the black box here is 1546 01:13:28,470 --> 01:13:33,460 going to be one or more algorithms that actually gets this job done. 1547 01:13:33,460 --> 01:13:35,680 So how might we go about doing this? 1548 01:13:35,680 --> 01:13:39,180 Well, just to vary things a bit more, I think we have a chance here 1549 01:13:39,180 --> 01:13:41,580 for a bit more audience participation. 1550 01:13:41,580 --> 01:13:43,790 But this time, we need eight people if we may. 1551 01:13:43,790 --> 01:13:46,290 All of you have to be comfortable appearing on the internet. 1552 01:13:46,290 --> 01:13:49,165 OK, so this is actually quite convenient that you're all quite close. 1553 01:13:49,165 --> 01:13:52,752 How about 1, 2, 3, 4, 5, 6, 7-- 1554 01:13:52,752 --> 01:13:57,060 oh, OK, and someone volunteering their friend-- number eight. 1555 01:13:57,060 --> 01:13:58,020 Come on down. 1556 01:13:58,020 --> 01:13:58,875 Come on down. 1557 01:13:58,875 --> 01:14:00,750 And if you could, I'm going to set things up. 1558 01:14:00,750 --> 01:14:03,630 If you all could join Valerie, my colleague over there, 1559 01:14:03,630 --> 01:14:08,910 to give you a prop to use here, we'll go ahead in just a moment 1560 01:14:08,910 --> 01:14:11,835 and try to find some numbers at hand. 1561 01:14:11,835 --> 01:14:15,180 1562 01:14:15,180 --> 01:14:20,820 In just a moment, each of our volunteers is going to be representing an integer. 1563 01:14:20,820 --> 01:14:25,380 And that integer is initially going to be in unsorted order. 1564 01:14:25,380 --> 01:14:29,100 And I claim that using an algorithm, step by step instructions, 1565 01:14:29,100 --> 01:14:33,820 we can probably sort these folks in at least a couple of different ways. 1566 01:14:33,820 --> 01:14:38,430 So they're in wardrobe right now just getting their very own Harvard T-shirts 1567 01:14:38,430 --> 01:14:42,945 with a Jersey number on it, which will then represent an element of our array. 1568 01:14:42,945 --> 01:14:46,650 1569 01:14:46,650 --> 01:14:51,270 Give us just a moment to finish getting the attire ready. 1570 01:14:51,270 --> 01:14:56,040 They're being handed a shirt and a number. 1571 01:14:56,040 --> 01:14:58,620 And let me ask the audience for just a moment. 1572 01:14:58,620 --> 01:15:02,760 As we have these numbers up here on the screen, these numbers too are unsorted. 1573 01:15:02,760 --> 01:15:04,353 They're just in random order. 1574 01:15:04,353 --> 01:15:05,520 And let me ask the audience. 1575 01:15:05,520 --> 01:15:10,980 How would you go about sorting these eight numbers on the screen? 1576 01:15:10,980 --> 01:15:12,930 How would you go about sorting these? 1577 01:15:12,930 --> 01:15:14,138 Yeah, what are your thoughts? 1578 01:15:14,138 --> 01:15:20,327 AUDIENCE: [INAUDIBLE] the number at the end, the following number. 1579 01:15:20,327 --> 01:15:20,910 DAVID J. MALAN: OK. 1580 01:15:20,910 --> 01:15:24,547 AUDIENCE: The following number is bigger, then I keep it as it is. 1581 01:15:24,547 --> 01:15:25,130 DAVID J. MALAN: OK. 1582 01:15:25,130 --> 01:15:26,800 AUDIENCE: If not, then [INAUDIBLE]. 1583 01:15:26,800 --> 01:15:29,352 DAVID J. MALAN: OK, so just to recap, you would start 1584 01:15:29,352 --> 01:15:30,810 with one of the numbers on the end. 1585 01:15:30,810 --> 01:15:33,210 You would look to the number to the right or to the left of it, 1586 01:15:33,210 --> 01:15:34,530 depending on which end you start at. 1587 01:15:34,530 --> 01:15:37,113 And if it's out of order, you would just start to swap things. 1588 01:15:37,113 --> 01:15:38,350 And that seems reasonable. 1589 01:15:38,350 --> 01:15:40,435 There's a whole bunch of mistakes to fix here 1590 01:15:40,435 --> 01:15:42,060 because things are pretty out of order. 1591 01:15:42,060 --> 01:15:45,060 But probably, if you start to solve small problems at a time, 1592 01:15:45,060 --> 01:15:47,910 you can achieve the end result of getting the whole thing sorted. 1593 01:15:47,910 --> 01:15:50,820 Other instincts, if you were just handed these numbers, how 1594 01:15:50,820 --> 01:15:54,077 you might go about sorting them? 1595 01:15:54,077 --> 01:15:54,660 How might you? 1596 01:15:54,660 --> 01:15:55,856 Yeah, in the back. 1597 01:15:55,856 --> 01:16:00,140 AUDIENCE: [INAUDIBLE] 1598 01:16:00,140 --> 01:16:01,920 DAVID J. MALAN: OK, I like that. 1599 01:16:01,920 --> 01:16:05,840 So to recap there, find the smallest one first and put it at the beginning, 1600 01:16:05,840 --> 01:16:07,070 if I heard you correctly. 1601 01:16:07,070 --> 01:16:10,342 And then presumably, you could do that again and again and again. 1602 01:16:10,342 --> 01:16:13,050 And that would seem to give you a couple of different algorithms. 1603 01:16:13,050 --> 01:16:15,710 And if you all are attired here-- 1604 01:16:15,710 --> 01:16:18,800 do you want to come on up if you're ready? 1605 01:16:18,800 --> 01:16:20,660 We had some [? felt ?] volunteers too. 1606 01:16:20,660 --> 01:16:23,030 Come on over. 1607 01:16:23,030 --> 01:16:25,520 So if you all would like to line yourselves up 1608 01:16:25,520 --> 01:16:27,770 facing the audience in exactly this order-- so 1609 01:16:27,770 --> 01:16:30,170 whoever is number zero should be way over here, 1610 01:16:30,170 --> 01:16:33,710 and whoever is number five should be way over there. 1611 01:16:33,710 --> 01:16:36,920 Feel free to distance as much as you'd like and scooch a little with this way 1612 01:16:36,920 --> 01:16:38,480 if you could. 1613 01:16:38,480 --> 01:16:39,653 OK, all right. 1614 01:16:39,653 --> 01:16:40,820 And make a little more room. 1615 01:16:40,820 --> 01:16:41,870 So seven-- let's see. 1616 01:16:41,870 --> 01:16:44,170 5, 2, 7, 4-- 1617 01:16:44,170 --> 01:16:45,090 AUDIENCE: [INAUDIBLE] 1618 01:16:45,090 --> 01:16:46,733 DAVID J. MALAN: 4, hopefully 1. 1619 01:16:46,733 --> 01:16:47,900 Yeah, keep them to the side. 1620 01:16:47,900 --> 01:16:51,210 OK, 1, 6, and there we go-- 1621 01:16:51,210 --> 01:16:51,710 3. 1622 01:16:51,710 --> 01:16:52,543 Come on over, three. 1623 01:16:52,543 --> 01:16:53,690 I was looking for you. 1624 01:16:53,690 --> 01:16:57,075 All right, so here, we have an array of eight numbers-- 1625 01:16:57,075 --> 01:16:58,200 eight integers if you will. 1626 01:16:58,200 --> 01:17:00,770 And do you want to each say a quick hello to the group? 1627 01:17:00,770 --> 01:17:02,750 AUDIENCE: Hello, I'm Quinn. 1628 01:17:02,750 --> 01:17:04,892 Go [INAUDIBLE]. 1629 01:17:04,892 --> 01:17:05,850 AUDIENCE: Hi, everyone. 1630 01:17:05,850 --> 01:17:08,060 I'm [INAUDIBLE]. 1631 01:17:08,060 --> 01:17:09,460 AUDIENCE: Hey, I'm Mitchell. 1632 01:17:09,460 --> 01:17:10,460 AUDIENCE: Hi, I'm Brett. 1633 01:17:10,460 --> 01:17:12,675 And also, go [INAUDIBLE]. 1634 01:17:12,675 --> 01:17:13,550 AUDIENCE: I'm Hannah. 1635 01:17:13,550 --> 01:17:15,137 Go [INAUDIBLE]. 1636 01:17:15,137 --> 01:17:16,220 AUDIENCE: Hi, I'm Matthew. 1637 01:17:16,220 --> 01:17:18,058 Go [INAUDIBLE] 1638 01:17:18,058 --> 01:17:19,100 AUDIENCE: Hi, I'm Miriam. 1639 01:17:19,100 --> 01:17:20,720 Go Winthrop. 1640 01:17:20,720 --> 01:17:22,905 AUDIENCE: Hi, I'm Celeste, and go Strauss. 1641 01:17:22,905 --> 01:17:23,780 DAVID J. MALAN: Wonderful. 1642 01:17:23,780 --> 01:17:26,930 Well, welcome all to the stage, and let's just visualize, 1643 01:17:26,930 --> 01:17:29,430 perhaps organically, how you eight would solve this problem. 1644 01:17:29,430 --> 01:17:32,540 So we currently have the numbers 0 through 7 quite out of order. 1645 01:17:32,540 --> 01:17:36,508 Could you go ahead and just yourselves from 0 through 7? 1646 01:17:36,508 --> 01:17:37,341 AUDIENCE: Thank you. 1647 01:17:37,341 --> 01:17:41,400 1648 01:17:41,400 --> 01:17:44,200 DAVID J. MALAN: OK, so what did they just do? 1649 01:17:44,200 --> 01:17:44,700 OK, yes. 1650 01:17:44,700 --> 01:17:46,260 First of all, yes, very well done. 1651 01:17:46,260 --> 01:17:50,030 1652 01:17:50,030 --> 01:17:53,037 How would you describe what they just did? 1653 01:17:53,037 --> 01:17:53,870 Well, let's do this. 1654 01:17:53,870 --> 01:17:56,060 Could you go back into that order on the screen-- 1655 01:17:56,060 --> 01:18:00,320 52741630? 1656 01:18:00,320 --> 01:18:03,440 And could you do exactly what you just did again? 1657 01:18:03,440 --> 01:18:04,750 Sort yourselves. 1658 01:18:04,750 --> 01:18:08,040 1659 01:18:08,040 --> 01:18:09,030 All right, what did-- 1660 01:18:09,030 --> 01:18:09,630 OK, yes. 1661 01:18:09,630 --> 01:18:10,470 Well done again. 1662 01:18:10,470 --> 01:18:14,190 1663 01:18:14,190 --> 01:18:17,480 All right, so admittedly, there's kind of a lot going on because each of you, 1664 01:18:17,480 --> 01:18:21,150 except number four, are doing something in parallel all at the same time. 1665 01:18:21,150 --> 01:18:23,390 And that's not really how a computer typically works. 1666 01:18:23,390 --> 01:18:26,900 Just like a computer can only look at one memory location, at one locker, 1667 01:18:26,900 --> 01:18:31,490 at a time, so can a computer only move one number at a time-- sort of opening 1668 01:18:31,490 --> 01:18:33,717 a locker, checking what's there, moving it as needed. 1669 01:18:33,717 --> 01:18:36,800 So let's try this more methodically based on the two audience suggestions. 1670 01:18:36,800 --> 01:18:42,350 If you all could randomize yourself again to 52741630, 1671 01:18:42,350 --> 01:18:44,762 let's take the second of those approaches first. 1672 01:18:44,762 --> 01:18:46,220 I'm going to look at these numbers. 1673 01:18:46,220 --> 01:18:48,870 And even though I as the human can obviously see all the numbers 1674 01:18:48,870 --> 01:18:50,930 and I just kind of have the intuition for how to fix this, 1675 01:18:50,930 --> 01:18:53,180 we got to be more methodical because eventually, we've 1676 01:18:53,180 --> 01:18:55,500 got to translate this to pseudo code and then code. 1677 01:18:55,500 --> 01:18:56,390 So let me see. 1678 01:18:56,390 --> 01:18:59,200 I'm going to search for, as you proposed, the smallest number. 1679 01:18:59,200 --> 01:19:00,950 And I'm going to start from left to right. 1680 01:19:00,950 --> 01:19:04,100 I could do it right to left, but left to right just tends to be convention. 1681 01:19:04,100 --> 01:19:07,080 All right, 5 at this moment is the smallest number I've seen. 1682 01:19:07,080 --> 01:19:09,818 So I'm going to remember that in a variable, if you will. 1683 01:19:09,818 --> 01:19:11,360 Now I'm going to take one more step-- 1684 01:19:11,360 --> 01:19:11,930 2. 1685 01:19:11,930 --> 01:19:15,260 OK, 2 I'm going to compare to the variable in mind, obviously smaller. 1686 01:19:15,260 --> 01:19:19,160 I'm going to forget about 5 and only now remember 2 as the now smallest 1687 01:19:19,160 --> 01:19:19,820 elements. 1688 01:19:19,820 --> 01:19:23,090 7, nope-- I'm going to ignore that because it's not smaller than the 2 1689 01:19:23,090 --> 01:19:23,810 I have in mind. 1690 01:19:23,810 --> 01:19:27,170 4, 1-- OK, I'm going to update the variable in mind 1691 01:19:27,170 --> 01:19:28,430 because that's indeed smaller. 1692 01:19:28,430 --> 01:19:31,130 Now obviously, we the humans know that's getting pretty small. 1693 01:19:31,130 --> 01:19:32,180 Maybe it's the end. 1694 01:19:32,180 --> 01:19:35,630 I have to check all values to see if there's something even smaller 1695 01:19:35,630 --> 01:19:38,435 because 6 is not, 3 is not, but 0 is. 1696 01:19:38,435 --> 01:19:39,560 And what's your name again? 1697 01:19:39,560 --> 01:19:40,430 AUDIENCE: Celeste. 1698 01:19:40,430 --> 01:19:41,270 DAVID J. MALAN: Celeste. 1699 01:19:41,270 --> 01:19:47,960 Where should Celeste or number 0 go according to this proposed algorithm? 1700 01:19:47,960 --> 01:19:49,590 All right, I'm seeing a lot of this. 1701 01:19:49,590 --> 01:19:52,820 So at the beginning of the array, so before doing this for real, 1702 01:19:52,820 --> 01:19:54,560 let's have you pop out in front. 1703 01:19:54,560 --> 01:19:58,100 And could you all shift and make room for Celeste? 1704 01:19:58,100 --> 01:20:02,030 Is this a good idea to have all of them move or equivalently 1705 01:20:02,030 --> 01:20:04,550 move everything in the array to make room for Celeste 1706 01:20:04,550 --> 01:20:06,920 and number 0 over there? 1707 01:20:06,920 --> 01:20:07,670 No, probably not. 1708 01:20:07,670 --> 01:20:08,878 That felt like a lot of work. 1709 01:20:08,878 --> 01:20:12,200 And even though it happened pretty quickly, that's like seven steps 1710 01:20:12,200 --> 01:20:14,040 to happen just to move her in place. 1711 01:20:14,040 --> 01:20:16,580 So what would be marginally smarter perhaps-- 1712 01:20:16,580 --> 01:20:18,960 a little more efficient, perhaps? 1713 01:20:18,960 --> 01:20:19,460 What's that? 1714 01:20:19,460 --> 01:20:19,910 AUDIENCE: Swapping. 1715 01:20:19,910 --> 01:20:20,490 DAVID J. MALAN: Swapping. 1716 01:20:20,490 --> 01:20:21,532 What do you mean by swap? 1717 01:20:21,532 --> 01:20:23,053 AUDIENCE: Replacing swaps. 1718 01:20:23,053 --> 01:20:24,470 DAVID J. MALAN: OK, replace two values. 1719 01:20:24,470 --> 01:20:28,090 So if you want to go back to where you were, one step Over, number 5, 1720 01:20:28,090 --> 01:20:29,560 he's not in the right place. 1721 01:20:29,560 --> 01:20:30,790 He's got to move eventually. 1722 01:20:30,790 --> 01:20:31,498 So you know what? 1723 01:20:31,498 --> 01:20:34,300 If that's where Celeste belongs, why don't we just swap 5 and 0? 1724 01:20:34,300 --> 01:20:36,925 So if you want to go ahead and exchange places with each other. 1725 01:20:36,925 --> 01:20:38,230 Notice what's just happened. 1726 01:20:38,230 --> 01:20:41,420 The problem I'm trying to solve has gotten smaller. 1727 01:20:41,420 --> 01:20:44,020 Instead of being size 8, now it's size 7. 1728 01:20:44,020 --> 01:20:47,050 Now granted, I moved 5 to another wrong location. 1729 01:20:47,050 --> 01:20:48,940 But if these numbers started off randomly, 1730 01:20:48,940 --> 01:20:52,790 it doesn't really matter where 5 goes until we get him into the right place. 1731 01:20:52,790 --> 01:20:54,040 So I think we've improved. 1732 01:20:54,040 --> 01:20:57,430 And now if I go back, my loop is sort of coming back around. 1733 01:20:57,430 --> 01:21:01,690 I can ignore Celeste and make this a seven step problem and not eight 1734 01:21:01,690 --> 01:21:03,430 because I know she's in the right place. 1735 01:21:03,430 --> 01:21:04,840 2 seems to be the smallest. 1736 01:21:04,840 --> 01:21:05,650 I'll remember that. 1737 01:21:05,650 --> 01:21:07,300 Not 7, not 4-- 1738 01:21:07,300 --> 01:21:09,070 1 seems to be the smallest. 1739 01:21:09,070 --> 01:21:13,600 Now I know as a human this should be my next smallest. 1740 01:21:13,600 --> 01:21:18,100 But why, intuitively, should I keep going, do you think? 1741 01:21:18,100 --> 01:21:20,950 I can't sort of optimize as a human and just say, number 1, 1742 01:21:20,950 --> 01:21:22,760 let's get you into the right place. 1743 01:21:22,760 --> 01:21:24,610 I still want to check the whole array. 1744 01:21:24,610 --> 01:21:25,330 Why? 1745 01:21:25,330 --> 01:21:25,870 Yeah. 1746 01:21:25,870 --> 01:21:28,477 AUDIENCE: Perhaps there's another 1. 1747 01:21:28,477 --> 01:21:30,310 DAVID J. MALAN: Maybe there's another 1, and that 1748 01:21:30,310 --> 01:21:32,030 could be another problem altogether. 1749 01:21:32,030 --> 01:21:32,770 Other thoughts? 1750 01:21:32,770 --> 01:21:33,270 Yeah. 1751 01:21:33,270 --> 01:21:34,458 AUDIENCE: Could be another 0 1752 01:21:34,458 --> 01:21:36,250 DAVID J. MALAN: There could be another 0 indeed, 1753 01:21:36,250 --> 01:21:38,560 but I did go through the list once, right? 1754 01:21:38,560 --> 01:21:39,970 And I kind of know there isn't. 1755 01:21:39,970 --> 01:21:40,570 Your thoughts? 1756 01:21:40,570 --> 01:21:43,300 AUDIENCE: You don't know that every value is represented. 1757 01:21:43,300 --> 01:21:47,387 So maybe there's a [INAUDIBLE] You just don't know what kind of data 1758 01:21:47,387 --> 01:21:48,220 you're working with. 1759 01:21:48,220 --> 01:21:50,600 DAVID J. MALAN: Yeah, I don't necessarily know what is there. 1760 01:21:50,600 --> 01:21:54,940 And honestly, I only stipulated earlier that I'm using one variable in my mind. 1761 01:21:54,940 --> 01:21:58,180 I could use two and remember the two smallest elements I've seen. 1762 01:21:58,180 --> 01:22:00,130 I could use three variables, four. 1763 01:22:00,130 --> 01:22:03,440 But then I'm going to start to use a lot of space in addition to time. 1764 01:22:03,440 --> 01:22:06,880 So if I've stipulated that I only have one variable to solve this problem, 1765 01:22:06,880 --> 01:22:09,190 I don't know anything more about these elements 1766 01:22:09,190 --> 01:22:11,398 because the only thing I'm remembering at this moment 1767 01:22:11,398 --> 01:22:13,490 is number 1 is the smallest element I've seen. 1768 01:22:13,490 --> 01:22:14,290 So I'm going to keep going. 1769 01:22:14,290 --> 01:22:14,710 6? 1770 01:22:14,710 --> 01:22:15,070 Nope. 1771 01:22:15,070 --> 01:22:15,490 3? 1772 01:22:15,490 --> 01:22:15,790 Nope. 1773 01:22:15,790 --> 01:22:16,210 5? 1774 01:22:16,210 --> 01:22:16,710 Nope. 1775 01:22:16,710 --> 01:22:18,670 OK, I know that number 1, and your name was-- 1776 01:22:18,670 --> 01:22:19,378 AUDIENCE: Hannah. 1777 01:22:19,378 --> 01:22:21,850 DAVID J. MALAN: --Hannah is the next smallest element. 1778 01:22:21,850 --> 01:22:24,340 I could have everyone move over to make room, but nope. 1779 01:22:24,340 --> 01:22:25,060 2? 1780 01:22:25,060 --> 01:22:26,980 You know, even though you're so close to where I want you, 1781 01:22:26,980 --> 01:22:29,170 I'm just going to keep it simple and swap you two. 1782 01:22:29,170 --> 01:22:31,570 So granted, I've made the problem a little worse. 1783 01:22:31,570 --> 01:22:35,200 But on average, I could get lucky too and just pop number 2 1784 01:22:35,200 --> 01:22:36,280 into the right place. 1785 01:22:36,280 --> 01:22:38,210 Now let me just accelerate this. 1786 01:22:38,210 --> 01:22:42,550 I can now ignore Hannah and Celeste, making the problem size 6 instead of 8. 1787 01:22:42,550 --> 01:22:43,870 So it's getting smaller. 1788 01:22:43,870 --> 01:22:45,190 7 is the smallest. 1789 01:22:45,190 --> 01:22:46,450 Nope, now 4 is-- 1790 01:22:46,450 --> 01:22:47,920 2 is the smallest. 1791 01:22:47,920 --> 01:22:50,470 Still 2, still 2, still 2. 1792 01:22:50,470 --> 01:22:53,560 So let's go ahead and swap 2 and 7. 1793 01:22:53,560 --> 01:22:56,230 And now I'll just kind of orchestrate it verbally. 1794 01:22:56,230 --> 01:22:57,950 4, you're about to have to do something. 1795 01:22:57,950 --> 01:23:01,750 So we now have 4, 7, 6 3, 5. 1796 01:23:01,750 --> 01:23:04,480 OK, 3-- could you swap with 4? 1797 01:23:04,480 --> 01:23:07,675 All right, now we have 7, 6, 4, 5. 1798 01:23:07,675 --> 01:23:10,420 OK, 4, could you swap with 7? 1799 01:23:10,420 --> 01:23:13,180 Now we have 6, 7, 5. 1800 01:23:13,180 --> 01:23:15,430 5, could you swap with 6? 1801 01:23:15,430 --> 01:23:16,835 And now we have 7, 6. 1802 01:23:16,835 --> 01:23:18,160 6, would you swap at 7? 1803 01:23:18,160 --> 01:23:19,690 And now perhaps round of applause. 1804 01:23:19,690 --> 01:23:20,980 They've sorted themselves. 1805 01:23:20,980 --> 01:23:25,250 OK, hang on there one minute. 1806 01:23:25,250 --> 01:23:27,130 So we'll do this one other approach. 1807 01:23:27,130 --> 01:23:30,220 And my God, that felt so much slower than the first approach, 1808 01:23:30,220 --> 01:23:33,040 but that's, one, because I was kind of providing a long voiceover. 1809 01:23:33,040 --> 01:23:37,690 But two, we were doing one thing at a time whereas the first time, you guys 1810 01:23:37,690 --> 01:23:41,110 had the luxury of moving like eight different CPUs-- 1811 01:23:41,110 --> 01:23:43,690 brains, if you will-- were all operating at the same time. 1812 01:23:43,690 --> 01:23:45,140 And computers like that exist. 1813 01:23:45,140 --> 01:23:47,893 If you have a computer with multiple cores, so to speak, 1814 01:23:47,893 --> 01:23:49,810 that's like having a computer that technically 1815 01:23:49,810 --> 01:23:51,490 can do multiple things at once. 1816 01:23:51,490 --> 01:23:54,470 But software typically, at least as we've written it thus far, 1817 01:23:54,470 --> 01:23:56,195 can only do one thing at a time. 1818 01:23:56,195 --> 01:23:58,070 So in a bit, we'll add up all of these steps. 1819 01:23:58,070 --> 01:23:59,862 But for now, let's take one other approach. 1820 01:23:59,862 --> 01:24:02,140 If you all could reorder yourselves like that-- 1821 01:24:02,140 --> 01:24:06,940 52741630-- let's take the other approach that 1822 01:24:06,940 --> 01:24:10,610 was recommended by just fixing small problems and see where this gets us. 1823 01:24:10,610 --> 01:24:12,730 So we're back in the original order. 1824 01:24:12,730 --> 01:24:14,812 5 and 2 are clearly out of order. 1825 01:24:14,812 --> 01:24:15,520 So you know what? 1826 01:24:15,520 --> 01:24:17,350 Let's just bite this problem off now. 1827 01:24:17,350 --> 01:24:19,090 5 and 2, could you swap? 1828 01:24:19,090 --> 01:24:20,530 Now let me take a next step. 1829 01:24:20,530 --> 01:24:22,330 5 and 7, I think you're OK. 1830 01:24:22,330 --> 01:24:25,000 There's a gap, yes, but that might not be a big deal. 1831 01:24:25,000 --> 01:24:26,360 7 and 4-- problem. 1832 01:24:26,360 --> 01:24:28,420 Let's have you swap. 1833 01:24:28,420 --> 01:24:31,150 OK, 7 and 1, let's have you swap. 1834 01:24:31,150 --> 01:24:34,120 7 and 6, let's have you swap. 1835 01:24:34,120 --> 01:24:35,830 7 and 3, you swap. 1836 01:24:35,830 --> 01:24:37,690 7 and 0, you swap. 1837 01:24:37,690 --> 01:24:39,490 Now let me pause for just a moment. 1838 01:24:39,490 --> 01:24:40,790 Still not sorted. 1839 01:24:40,790 --> 01:24:42,470 So I'm clearly not done. 1840 01:24:42,470 --> 01:24:45,340 But have I improved the problem? 1841 01:24:45,340 --> 01:24:48,100 Right, I can't see-- like before, I can't optimize like 1842 01:24:48,100 --> 01:24:50,210 before because 0 is obviously not here. 1843 01:24:50,210 --> 01:24:54,160 So unless they're still way back there, so it's not like I've gone from 8 steps 1844 01:24:54,160 --> 01:24:56,300 to 7 to 6 just yet. 1845 01:24:56,300 --> 01:24:58,047 But have I made any improvements? 1846 01:24:58,047 --> 01:24:58,630 AUDIENCE: Yes. 1847 01:24:58,630 --> 01:24:59,255 DAVID J. MALAN: Yes. 1848 01:24:59,255 --> 01:25:01,480 In what sense is this improved? 1849 01:25:01,480 --> 01:25:06,050 What's a concrete thing you could point to is better? 1850 01:25:06,050 --> 01:25:06,550 Yeah. 1851 01:25:06,550 --> 01:25:08,110 AUDIENCE: Sorted the highest number. 1852 01:25:08,110 --> 01:25:10,690 DAVID J. MALAN: I've sorted the highest number, which is indeed 7. 1853 01:25:10,690 --> 01:25:15,400 And conversely, if you prefer, Celeste is one step closer to the beginning. 1854 01:25:15,400 --> 01:25:19,970 Now worst case, Celeste is going to have to move one step on each iteration. 1855 01:25:19,970 --> 01:25:22,540 So I might need to do this thing like n total times 1856 01:25:22,540 --> 01:25:24,215 to move her all the way over. 1857 01:25:24,215 --> 01:25:25,340 But that might work out OK. 1858 01:25:25,340 --> 01:25:26,510 Let me see. 1859 01:25:26,510 --> 01:25:27,980 2 and 5, you're good. 1860 01:25:27,980 --> 01:25:29,480 5 and 4, swap you. 1861 01:25:29,480 --> 01:25:31,370 5 and 1, let's swap you. 1862 01:25:31,370 --> 01:25:32,570 5 and 6, you're good. 1863 01:25:32,570 --> 01:25:34,550 6 and 3, let's swap you. 1864 01:25:34,550 --> 01:25:36,650 6 and 0, let's swap you. 1865 01:25:36,650 --> 01:25:38,240 6 and 7, you're good. 1866 01:25:38,240 --> 01:25:39,380 And I think now-- 1867 01:25:39,380 --> 01:25:41,510 notice that the high values, as you noted, 1868 01:25:41,510 --> 01:25:44,150 are sort of bubbling up, if you will, to the end of the list. 1869 01:25:44,150 --> 01:25:45,530 2 and 4, you're good. 1870 01:25:45,530 --> 01:25:46,820 4 and 1, let's swap. 1871 01:25:46,820 --> 01:25:48,080 4 and 5, good. 1872 01:25:48,080 --> 01:25:49,430 5 and 3, swap. 1873 01:25:49,430 --> 01:25:51,780 5 and 0, swap. 1874 01:25:51,780 --> 01:25:53,370 5, 6, 7, of course, are good. 1875 01:25:53,370 --> 01:25:56,175 So now you can sort of see the problem resolving itself. 1876 01:25:56,175 --> 01:25:57,800 And let's just do this part now faster. 1877 01:25:57,800 --> 01:26:00,140 2 and 1, 2 and 4. 1878 01:26:00,140 --> 01:26:04,160 OK, 4 and 3, 4 and 0. 1879 01:26:04,160 --> 01:26:08,877 All right, now 1 and 2, 2, and 3, and 0, and good. 1880 01:26:08,877 --> 01:26:10,460 So we do have some optimization there. 1881 01:26:10,460 --> 01:26:12,860 We don't need to keep going because those all are sorted. 1882 01:26:12,860 --> 01:26:13,910 1 and 2, you're good. 1883 01:26:13,910 --> 01:26:16,730 2 and 0, all right, done. 1884 01:26:16,730 --> 01:26:20,330 1 and 0-- and big round of applause in closing. 1885 01:26:20,330 --> 01:26:24,890 OK, so thank you all. 1886 01:26:24,890 --> 01:26:27,140 We need the puppets back, but you can keep the shirts. 1887 01:26:27,140 --> 01:26:28,840 Thank you for volunteering here. 1888 01:26:28,840 --> 01:26:31,810 Feel free to make your way exits left or right. 1889 01:26:31,810 --> 01:26:33,760 And let's see if, thanks to our volunteers 1890 01:26:33,760 --> 01:26:40,000 here, we can't now formalize a little bit what we did on both passes here. 1891 01:26:40,000 --> 01:26:44,170 I claim that the first algorithm our volunteers kindly acted out 1892 01:26:44,170 --> 01:26:45,730 is what's called selection sort. 1893 01:26:45,730 --> 01:26:50,980 And as the name implied, we selected the smallest elements again and again 1894 01:26:50,980 --> 01:26:53,530 and again, working our way from left to right, 1895 01:26:53,530 --> 01:26:58,100 putting Celeste into the right place, and then continuing with everyone else. 1896 01:26:58,100 --> 01:27:01,060 So selection sort, as it's formally called, 1897 01:27:01,060 --> 01:27:04,000 can be described, for instance, with this pseudo code here-- 1898 01:27:04,000 --> 01:27:06,910 4i from 0 to n minus 1. 1899 01:27:06,910 --> 01:27:08,350 And again, why this? 1900 01:27:08,350 --> 01:27:10,420 This is just how talk about arrays. 1901 01:27:10,420 --> 01:27:14,920 The left end is 0, the right end is n minus 1 where in this case, 1902 01:27:14,920 --> 01:27:16,720 n happened to be eight people. 1903 01:27:16,720 --> 01:27:18,700 So that's 0 through 7. 1904 01:27:18,700 --> 01:27:22,240 So for i from 0 to n minus 1, what did I do? 1905 01:27:22,240 --> 01:27:27,400 I found the smallest number between numbers bracket i and numbers bracket 1906 01:27:27,400 --> 01:27:28,810 n minus 1. 1907 01:27:28,810 --> 01:27:31,030 It's a little cryptic at first glance, but this 1908 01:27:31,030 --> 01:27:34,510 is just a very pseudo code-like way of saying 1909 01:27:34,510 --> 01:27:37,960 find the smallest element among all eight volunteers 1910 01:27:37,960 --> 01:27:43,120 because if i starts at 0 and n minus 1 never changes because there's always 1911 01:27:43,120 --> 01:27:47,230 8, 8 people, so 8 minus 1 is 7, this first 1912 01:27:47,230 --> 01:27:50,320 says find the smallest number between numbers bracket 0 1913 01:27:50,320 --> 01:27:53,350 and numbers bracket 7, if you will. 1914 01:27:53,350 --> 01:27:54,550 Then what do I do? 1915 01:27:54,550 --> 01:27:57,730 Swap the smallest number with numbers bracket i. 1916 01:27:57,730 --> 01:28:01,210 So that's how we got Celeste from over here all the way over there. 1917 01:28:01,210 --> 01:28:03,340 We just swapped those two values. 1918 01:28:03,340 --> 01:28:05,840 What then happens next in this pseudo code? 1919 01:28:05,840 --> 01:28:07,900 i, of course, goes from 0 to 1. 1920 01:28:07,900 --> 01:28:10,060 And that's the technical way of saying now 1921 01:28:10,060 --> 01:28:13,810 find the smallest element among the 7 remaining volunteers, 1922 01:28:13,810 --> 01:28:17,570 ignoring Celeste this time because she was already in the correct location. 1923 01:28:17,570 --> 01:28:19,960 So the problem went from size 8 to size 7. 1924 01:28:19,960 --> 01:28:23,500 And if we repeat, size 6, 5, 4, 3, 2, 1, until boom, 1925 01:28:23,500 --> 01:28:25,820 it's all done at the very end. 1926 01:28:25,820 --> 01:28:29,200 So this is just one way of expressing in pseudo code what 1927 01:28:29,200 --> 01:28:33,040 we did a little more organically and a formalization of what someone 1928 01:28:33,040 --> 01:28:35,420 volunteered out in the audience. 1929 01:28:35,420 --> 01:28:40,300 So if we consider, then, the efficiency of this algorithm, 1930 01:28:40,300 --> 01:28:42,730 maybe abstracting it away now as a bunch of doors 1931 01:28:42,730 --> 01:28:46,960 where the left most again is always 0, the right most is always n minus 1, 1932 01:28:46,960 --> 01:28:50,350 or equivalently, the second to last is n minus 2, the third to last 1933 01:28:50,350 --> 01:28:54,550 is n minus 3 where n might be 8 or anything else, 1934 01:28:54,550 --> 01:28:59,620 how do we think about or quantify the running time of selection sort? 1935 01:28:59,620 --> 01:29:02,230 Big O of what? 1936 01:29:02,230 --> 01:29:05,230 I mean, that was a lot of steps to be adding up. 1937 01:29:05,230 --> 01:29:09,130 It's probably more than n, right, because I went through the list 1938 01:29:09,130 --> 01:29:10,030 again and again. 1939 01:29:10,030 --> 01:29:14,740 It was like n plus n minus 1 plus n minus 2. 1940 01:29:14,740 --> 01:29:17,080 Any instincts here? 1941 01:29:17,080 --> 01:29:20,620 We got like the whole team in the orchestra now. 1942 01:29:20,620 --> 01:29:25,180 Let me propose we think about it this way with just a bit of formula, say. 1943 01:29:25,180 --> 01:29:29,350 So the first time, I had to look at n different volunteers. 1944 01:29:29,350 --> 01:29:33,130 n was 8 in this case, but generically, I looked at all eight numbers 1945 01:29:33,130 --> 01:29:35,230 in order to decide who was the smallest. 1946 01:29:35,230 --> 01:29:37,270 And sure enough, Celeste was at the very end. 1947 01:29:37,270 --> 01:29:39,103 She happened to be all the way to the right. 1948 01:29:39,103 --> 01:29:43,510 But I only knew that once I looked at all 8 or all n volunteers. 1949 01:29:43,510 --> 01:29:45,880 So that took me n steps first. 1950 01:29:45,880 --> 01:29:49,270 But once the list was swapped into the right place, then 1951 01:29:49,270 --> 01:29:53,290 my problem with size n minus 1, and I had n minus 1 other people 1952 01:29:53,290 --> 01:29:54,320 to look through. 1953 01:29:54,320 --> 01:29:55,870 So that's n minus 1 steps. 1954 01:29:55,870 --> 01:29:59,825 Then after that, it's n minus 2 plus n minus 3 plus n minus 4 plus dot dot 1955 01:29:59,825 --> 01:30:01,270 dot until I had one final step. 1956 01:30:01,270 --> 01:30:04,460 And it's obvious that I only have one human left to consider. 1957 01:30:04,460 --> 01:30:07,180 So we might wave our hands at this with a little ellipsis 1958 01:30:07,180 --> 01:30:10,400 and just say dot dot dot plus 1 for the final step. 1959 01:30:10,400 --> 01:30:11,890 Now what does this actually equal? 1960 01:30:11,890 --> 01:30:13,480 Well, this is where you might think back on, like, 1961 01:30:13,480 --> 01:30:15,400 your high school math or physics textbook that 1962 01:30:15,400 --> 01:30:18,640 has a little cheat sheet at the end that shows these kinds of recurrences. 1963 01:30:18,640 --> 01:30:21,490 That happens to work out mathematically to be 1964 01:30:21,490 --> 01:30:25,120 n times n plus 1 all divided by 2. 1965 01:30:25,120 --> 01:30:28,690 That's just what that recurrence, that series, actually adds up to. 1966 01:30:28,690 --> 01:30:31,300 So if you take on faith that that math is correct, let's 1967 01:30:31,300 --> 01:30:35,530 just now multiply this out mathematically. 1968 01:30:35,530 --> 01:30:41,920 That's n squared plus n divided by 2 or n squared divided by 2 plus n over 2. 1969 01:30:41,920 --> 01:30:44,890 And here's where we're starting to get annoyingly into the weeds. 1970 01:30:44,890 --> 01:30:50,080 Like, honestly, as n gets really large, like a million doors or integers 1971 01:30:50,080 --> 01:30:54,940 or a billion web pages in Google search engine, honestly, which of these terms 1972 01:30:54,940 --> 01:30:57,400 is going to matter the most mathematically 1973 01:30:57,400 --> 01:30:59,200 if n is a really big number? 1974 01:30:59,200 --> 01:31:01,840 Is n squared divided by 2 the dominant factor, 1975 01:31:01,840 --> 01:31:04,260 or is n divided by 2 the dominant factor? 1976 01:31:04,260 --> 01:31:05,995 AUDIENCE: n squared. 1977 01:31:05,995 --> 01:31:07,120 DAVID J. MALAN: Yeah, n squared. 1978 01:31:07,120 --> 01:31:09,290 I mean, no matter what n is-- and the bigger it is, 1979 01:31:09,290 --> 01:31:12,580 the bigger raising it to the power 2 is going to be. 1980 01:31:12,580 --> 01:31:13,330 So you know what? 1981 01:31:13,330 --> 01:31:16,270 Let's just wave our hands at this because at the end of the day, 1982 01:31:16,270 --> 01:31:19,780 as n gets really large, the dominant factor is indeed that first one. 1983 01:31:19,780 --> 01:31:20,530 And you know what? 1984 01:31:20,530 --> 01:31:24,290 Even the divided 2, as I claimed earlier with our two phone book examples, where 1985 01:31:24,290 --> 01:31:26,960 the two straight lines if you keep zooming out essentially 1986 01:31:26,960 --> 01:31:31,760 looked the same when n is large enough, let's just call this on the order of n 1987 01:31:31,760 --> 01:31:32,490 squared. 1988 01:31:32,490 --> 01:31:37,130 So that is to say a computer scientist would describe bubble sort as taking 1989 01:31:37,130 --> 01:31:39,860 on the order of n squared steps. 1990 01:31:39,860 --> 01:31:41,510 That's an oversimplification. 1991 01:31:41,510 --> 01:31:44,600 If we really added it up, it's actually this many steps-- n 1992 01:31:44,600 --> 01:31:46,610 squared divided by 2 plus n over 2. 1993 01:31:46,610 --> 01:31:50,420 But again, if we want to just be able to generally compare two algorithms' 1994 01:31:50,420 --> 01:31:53,810 performance, I think it's going to suffice if we look at that highest 1995 01:31:53,810 --> 01:31:59,720 order term to get a sense of what the algorithm feels like, if you will, 1996 01:31:59,720 --> 01:32:02,540 or what it even looks like graphically. 1997 01:32:02,540 --> 01:32:06,170 All right, so with that said, we might describe bubble sort 1998 01:32:06,170 --> 01:32:07,790 as being in big O-- 1999 01:32:07,790 --> 01:32:11,700 sorry, selection sort as being in big O of n squared. 2000 01:32:11,700 --> 01:32:17,030 But what if we consider now the best case scenario-- an opportunity 2001 01:32:17,030 --> 01:32:19,070 to talk about a lower bound? 2002 01:32:19,070 --> 01:32:23,305 In the best case, how many steps does selection sort take? 2003 01:32:23,305 --> 01:32:24,680 Well, here, we need some context. 2004 01:32:24,680 --> 01:32:27,320 Like, what does it mean to be the best case or the worst case 2005 01:32:27,320 --> 01:32:29,090 when it comes to sorting? 2006 01:32:29,090 --> 01:32:32,600 Like, what could you imagine meaning the best possible scenario when you're 2007 01:32:32,600 --> 01:32:35,651 trying to sort a bunch of numbers? 2008 01:32:35,651 --> 01:32:37,100 I got the whole crew here again. 2009 01:32:37,100 --> 01:32:37,400 Yeah. 2010 01:32:37,400 --> 01:32:39,050 AUDIENCE: They would already be sorted. 2011 01:32:39,050 --> 01:32:40,370 DAVID J. MALAN: All right, they're already sorted, right? 2012 01:32:40,370 --> 01:32:44,030 I can't really imagine a better scenario than I have to sort some numbers, 2013 01:32:44,030 --> 01:32:46,040 but they're already sorted for me. 2014 01:32:46,040 --> 01:32:51,170 But does this algorithm leverage that fact in practice? 2015 01:32:51,170 --> 01:32:54,110 Even if all of our humans had lined up from 0 to 7, 2016 01:32:54,110 --> 01:32:56,930 I'm pretty sure I would have pretty naively started here. 2017 01:32:56,930 --> 01:32:58,670 And yes, Celeste happens to be here. 2018 01:32:58,670 --> 01:33:03,375 But I only know she needs to be here once I've looked at all eight people. 2019 01:33:03,375 --> 01:33:06,000 And then I would have realized, well, that was a waste of time. 2020 01:33:06,000 --> 01:33:07,280 I can leave Celeste be. 2021 01:33:07,280 --> 01:33:09,500 But then what would I have done? 2022 01:33:09,500 --> 01:33:12,860 I would have ignored her position because we've solved one problem. 2023 01:33:12,860 --> 01:33:16,260 I would have done the same thing now for seven people, then six people. 2024 01:33:16,260 --> 01:33:19,320 So every time I walk through, I'm not doing much useful work. 2025 01:33:19,320 --> 01:33:21,860 But I am doing those comparisons because I 2026 01:33:21,860 --> 01:33:25,500 don't know until I do the work that the people were in the right order. 2027 01:33:25,500 --> 01:33:30,440 So this would seem to imply that the omega notation, the best case 2028 01:33:30,440 --> 01:33:33,740 scenario, even, a lower bound on the running time would be what, then? 2029 01:33:33,740 --> 01:33:35,840 AUDIENCE: [INAUDIBLE] 2030 01:33:35,840 --> 01:33:37,470 DAVID J. MALAN: A little louder? 2031 01:33:37,470 --> 01:33:38,410 AUDIENCE: N squared. 2032 01:33:38,410 --> 01:33:40,250 DAVID J. MALAN: It's still going to be n squared, 2033 01:33:40,250 --> 01:33:45,820 in fact, because the code I'm giving myself doesn't leverage or benefit 2034 01:33:45,820 --> 01:33:50,260 from any of that scenario because it just mindlessly continues 2035 01:33:50,260 --> 01:33:51,770 to do this again and again. 2036 01:33:51,770 --> 01:33:57,010 So in this case, yes, I would claim that the omega notation for selection sort 2037 01:33:57,010 --> 01:33:58,830 is also big O of n squared. 2038 01:33:58,830 --> 01:34:00,580 So those are the kinds of numbers to beat. 2039 01:34:00,580 --> 01:34:03,760 It seems like the upper bound and lower bound of selection 2040 01:34:03,760 --> 01:34:06,130 sort are indeed n squared. 2041 01:34:06,130 --> 01:34:08,380 And so we can also describe selection sort, therefore, 2042 01:34:08,380 --> 01:34:09,717 as being in theta of n squared. 2043 01:34:09,717 --> 01:34:12,550 That's the first algorithm we've had the chance to describe that in, 2044 01:34:12,550 --> 01:34:14,650 which is to say that it's kind of slow. 2045 01:34:14,650 --> 01:34:16,595 I mean, maybe other algorithms are slower, 2046 01:34:16,595 --> 01:34:18,220 but this isn't the best starting point. 2047 01:34:18,220 --> 01:34:19,370 Can we do better? 2048 01:34:19,370 --> 01:34:22,930 Well, there's a reason that I guided us to doing the second algorithm second. 2049 01:34:22,930 --> 01:34:25,430 Even though you verbally proposed them in a different order, 2050 01:34:25,430 --> 01:34:28,600 this second algorithm we did is generally known as bubble sort. 2051 01:34:28,600 --> 01:34:31,210 And I deliberately used that word a bit ago, 2052 01:34:31,210 --> 01:34:34,930 saying the big values are bubbling their way up to the right 2053 01:34:34,930 --> 01:34:37,990 to kind of capture the fact that, indeed, this algorithm works 2054 01:34:37,990 --> 01:34:38,620 differently. 2055 01:34:38,620 --> 01:34:40,870 But let's consider if it's better or worse. 2056 01:34:40,870 --> 01:34:43,960 So here, we have pseudo code for bubble sort. 2057 01:34:43,960 --> 01:34:45,930 You could write this too in different ways. 2058 01:34:45,930 --> 01:34:48,550 But let's consider what we did on the stage. 2059 01:34:48,550 --> 01:34:51,850 We repeated the following n minus 1 times. 2060 01:34:51,850 --> 01:34:55,390 We initialized at least, even though I didn't verbalize it this way, 2061 01:34:55,390 --> 01:35:00,430 a variable like i from 0 to n minus 2, n minus 2. 2062 01:35:00,430 --> 01:35:01,810 And then I asked this question. 2063 01:35:01,810 --> 01:35:08,470 If numbers bracket i and numbers bracket i plus 1 are out of order, 2064 01:35:08,470 --> 01:35:10,250 then swap them. 2065 01:35:10,250 --> 01:35:12,580 So again, I just did it more intuitively by pointing, 2066 01:35:12,580 --> 01:35:14,950 but this would be a way, with a bit of pseudo code, 2067 01:35:14,950 --> 01:35:16,147 to describe what's going on. 2068 01:35:16,147 --> 01:35:18,730 But notice that I'm doing something a little differently here. 2069 01:35:18,730 --> 01:35:22,730 I'm iterating from if equals 0 to n minus 2. 2070 01:35:22,730 --> 01:35:23,230 Why? 2071 01:35:23,230 --> 01:35:26,740 Well, if I'm comparing two things, left hand and right hand, 2072 01:35:26,740 --> 01:35:28,690 I'd still want to start at 0. 2073 01:35:28,690 --> 01:35:31,450 But I don't want to go all the way to n minus 1 2074 01:35:31,450 --> 01:35:35,048 because then, I'd be going past the boundary of my array, which 2075 01:35:35,048 --> 01:35:35,590 would be bad. 2076 01:35:35,590 --> 01:35:38,080 I want to make sure that my left hand-- i, if you will-- 2077 01:35:38,080 --> 01:35:42,820 stops at n minus 2 so that when I plus 1 in my pseudo code, 2078 01:35:42,820 --> 01:35:45,610 I'm looking at the last two elements, not the last element 2079 01:35:45,610 --> 01:35:46,900 and then pass the boundary. 2080 01:35:46,900 --> 01:35:48,733 That's actually a common programming mistake 2081 01:35:48,733 --> 01:35:50,620 that we'll undoubtedly soon make by going 2082 01:35:50,620 --> 01:35:52,880 beyond the boundaries of your array. 2083 01:35:52,880 --> 01:35:59,530 So this pseudo code, then, allows me to say compare every one again and again 2084 01:35:59,530 --> 01:36:01,730 and swap them if they're out of order. 2085 01:36:01,730 --> 01:36:06,190 Why do I repeat the whole thing n minus 1 times? 2086 01:36:06,190 --> 01:36:11,500 Like, why does it not suffice just to do this loop here? 2087 01:36:11,500 --> 01:36:14,980 Think what happened with Celeste. 2088 01:36:14,980 --> 01:36:19,480 Why do I repeat this whole thing n minus 1 times? 2089 01:36:19,480 --> 01:36:20,230 Yeah, in the back? 2090 01:36:20,230 --> 01:36:27,855 AUDIENCE: [INAUDIBLE] 2091 01:36:27,855 --> 01:36:30,230 DAVID J. MALAN: Indeed, and I think if I can recap accurately, 2092 01:36:30,230 --> 01:36:31,640 think back to Celeste again. 2093 01:36:31,640 --> 01:36:34,130 And I'm sorry to keep calling on you as our number 0. 2094 01:36:34,130 --> 01:36:37,670 Each time through bubble sort, she only moved one step. 2095 01:36:37,670 --> 01:36:41,280 And so in total, if there's n locations, at the end of the day, 2096 01:36:41,280 --> 01:36:45,930 she needs to move n minus 1 steps to get 0 all the way to where it needs to be. 2097 01:36:45,930 --> 01:36:50,330 And so this inner loop, if you will, where we're iterating using i, 2098 01:36:50,330 --> 01:36:52,700 that just fixes some of the problems. 2099 01:36:52,700 --> 01:36:56,150 But it doesn't fix all of the problems until we do that same logic again 2100 01:36:56,150 --> 01:36:57,660 and again and again. 2101 01:36:57,660 --> 01:37:01,410 And so how might we quantify the running time of this algorithm? 2102 01:37:01,410 --> 01:37:04,670 Well, one way to see it is to just literally look at the pseudo code. 2103 01:37:04,670 --> 01:37:09,170 The outer loop repeats n minus 1 times by definition. 2104 01:37:09,170 --> 01:37:10,490 It literally says that. 2105 01:37:10,490 --> 01:37:15,960 The inner loop, the for loop, also iterates n minus 1 times. 2106 01:37:15,960 --> 01:37:16,460 Why? 2107 01:37:16,460 --> 01:37:18,540 Because it's going from 0 to n minus 2. 2108 01:37:18,540 --> 01:37:22,580 And if that's hard to think about, that's the same thing is 1 to n minus 1 2109 01:37:22,580 --> 01:37:25,320 if you just add 1 to both ends of the formula. 2110 01:37:25,320 --> 01:37:29,910 So that means you're doing n minus 1 things n minus 1 times. 2111 01:37:29,910 --> 01:37:32,540 So I literally multiply how many times the outer loop 2112 01:37:32,540 --> 01:37:35,960 is running by how many times the inner loop is running, which gives me 2113 01:37:35,960 --> 01:37:39,320 sort of FOIL method n minus 1 squared. 2114 01:37:39,320 --> 01:37:41,300 And I could multiply that whole thing out. 2115 01:37:41,300 --> 01:37:44,330 Well, let's consider this just a little more methodically here. 2116 01:37:44,330 --> 01:37:48,382 If I have n minus 1 on the outer, n minus 1 on the inner-- 2117 01:37:48,382 --> 01:37:49,590 let's go ahead and FOIL this. 2118 01:37:49,590 --> 01:37:53,180 So n squared minus n minus n plus 1, combine 2119 01:37:53,180 --> 01:37:56,570 like terms-- n squared minus 2n plus 1. 2120 01:37:56,570 --> 01:38:01,470 And now which of these terms is clearly going to be dominant, so to speak? 2121 01:38:01,470 --> 01:38:01,970 The-- 2122 01:38:01,970 --> 01:38:02,450 AUDIENCE: N squared. 2123 01:38:02,450 --> 01:38:03,720 DAVID J. MALAN: --the n squared. 2124 01:38:03,720 --> 01:38:06,260 So yes, even though minus 2n is a good thing 2125 01:38:06,260 --> 01:38:08,660 because it's subtracting off some of the time required, 2126 01:38:08,660 --> 01:38:11,690 plus 1 is not that big a thing, there's such drops in the bucket when 2127 01:38:11,690 --> 01:38:14,120 n gets really large, like in the millions or billions, 2128 01:38:14,120 --> 01:38:18,510 certainly, that bubble sort 2 is on the order of n squared. 2129 01:38:18,510 --> 01:38:21,110 It's not the same exactly as selection sort. 2130 01:38:21,110 --> 01:38:23,030 But as n gets big, honestly, we're barely 2131 01:38:23,030 --> 01:38:25,530 going to be able to notice the difference most likely. 2132 01:38:25,530 --> 01:38:29,190 And so it too might be said to be on the order of n squared. 2133 01:38:29,190 --> 01:38:34,530 And if we consider now the lower bound on bubble sort's running time, 2134 01:38:34,530 --> 01:38:38,720 here's where things get potentially interesting. 2135 01:38:38,720 --> 01:38:44,450 What might you claim is the running time of bubble sort in the best case? 2136 01:38:44,450 --> 01:38:48,230 And the best case, I claim, is when the numbers are already sorted. 2137 01:38:48,230 --> 01:38:50,720 Is our pseudo code going to take that into account? 2138 01:38:50,720 --> 01:38:52,442 AUDIENCE: N 2139 01:38:52,442 --> 01:38:53,150 DAVID J. MALAN: OK, n. 2140 01:38:53,150 --> 01:38:54,350 Why do you propose n? 2141 01:38:54,350 --> 01:38:59,043 AUDIENCE: [INAUDIBLE] 2142 01:38:59,043 --> 01:39:00,710 DAVID J. MALAN: Yes, and that's the key word. 2143 01:39:00,710 --> 01:39:05,150 To summarize, in bubble sort, I do have to minimally make one pass because if I 2144 01:39:05,150 --> 01:39:07,760 don't look at all n elements, that I'm theoretically 2145 01:39:07,760 --> 01:39:09,260 just guessing if it's sorted or not. 2146 01:39:09,260 --> 01:39:11,480 Like, I obviously intuitively have to look 2147 01:39:11,480 --> 01:39:14,390 at every element to decide yay or nay, it's in the right order. 2148 01:39:14,390 --> 01:39:17,180 And my original pseudo code, though, is pretty naive. 2149 01:39:17,180 --> 01:39:22,640 It's just going to blindly go back and forth n minus 1 times again and again, 2150 01:39:22,640 --> 01:39:23,810 and that's going to add up. 2151 01:39:23,810 --> 01:39:25,760 But what if I add a bit of an optimization 2152 01:39:25,760 --> 01:39:28,010 that you might have glimpsed on the slide a moment ago 2153 01:39:28,010 --> 01:39:31,610 where if I compare two people and I don't swap them, compare two people, 2154 01:39:31,610 --> 01:39:34,550 don't swap them, and I go all the way through the list comparing 2155 01:39:34,550 --> 01:39:38,120 every pair of adjacent people, and I make no swaps, 2156 01:39:38,120 --> 01:39:40,820 it would be kind of not just naive but stupid 2157 01:39:40,820 --> 01:39:44,360 to do that same process again because if the humans have not moved, 2158 01:39:44,360 --> 01:39:47,150 I'm not going to make any different decisions. 2159 01:39:47,150 --> 01:39:49,740 I'm going to do nothing again, nothing again. 2160 01:39:49,740 --> 01:39:53,150 So at that point, it would be stupid, very inefficient, to go back and forth 2161 01:39:53,150 --> 01:39:54,210 and back and forth. 2162 01:39:54,210 --> 01:39:58,070 So if I modify our pseudo code with just an additional if condition, 2163 01:39:58,070 --> 01:40:00,050 I bet we can speed this up. 2164 01:40:00,050 --> 01:40:06,200 Inside of that same pseudo code, what if I say, hey, if no swaps, quit? 2165 01:40:06,200 --> 01:40:10,010 Like quit, prematurely before the loops are finished running. 2166 01:40:10,010 --> 01:40:13,380 One of the loops has gone through per the indentation here. 2167 01:40:13,380 --> 01:40:16,190 But if I do a loop from left to right and I 2168 01:40:16,190 --> 01:40:18,680 have made no swaps, which you can think of as just being 2169 01:40:18,680 --> 01:40:22,250 one other variable that's plus plusing as I go keeping, track of how many 2170 01:40:22,250 --> 01:40:22,820 swaps-- 2171 01:40:22,820 --> 01:40:24,740 if I've made no swaps from left to right, 2172 01:40:24,740 --> 01:40:27,240 I'm not going to make any swaps the next time around either. 2173 01:40:27,240 --> 01:40:30,030 So let's just quit at that point. 2174 01:40:30,030 --> 01:40:32,630 And that is to say in the best case, if you will, 2175 01:40:32,630 --> 01:40:36,410 when the list is already sorted, the omega notation for bubble sort 2176 01:40:36,410 --> 01:40:41,460 might indeed be omega of n if you add that optimization 2177 01:40:41,460 --> 01:40:44,060 so as to short circuit all of that inefficient 2178 01:40:44,060 --> 01:40:49,830 looping to do it only as many times as is necessary. 2179 01:40:49,830 --> 01:40:52,280 Let me pause to see if there's any questions here. 2180 01:40:52,280 --> 01:40:53,192 Yeah. 2181 01:40:53,192 --> 01:41:02,038 AUDIENCE: [INAUDIBLE] to optimize the running time for all cases possible? 2182 01:41:02,038 --> 01:41:03,080 DAVID J. MALAN: Good question. 2183 01:41:03,080 --> 01:41:08,690 If the running time of selection sort and bubble sort are both in big O 2184 01:41:08,690 --> 01:41:14,480 of n squared but selection sort is in omega of n squared while bubble sort is 2185 01:41:14,480 --> 01:41:17,300 in omega of n, which sounds better-- 2186 01:41:17,300 --> 01:41:20,510 I think if I may, should we just always use bubble sort? 2187 01:41:20,510 --> 01:41:24,680 Yes if we think that we might benefit over time 2188 01:41:24,680 --> 01:41:29,250 from a lot of good case scenarios or best case scenarios. 2189 01:41:29,250 --> 01:41:31,340 However, the goal at hand in just a bit is 2190 01:41:31,340 --> 01:41:33,480 going to be to do even better than both of these. 2191 01:41:33,480 --> 01:41:35,330 So hold that question further for a moment. 2192 01:41:35,330 --> 01:41:35,830 Yeah. 2193 01:41:35,830 --> 01:41:41,357 AUDIENCE: [INAUDIBLE] n minus 1? 2194 01:41:41,357 --> 01:41:41,940 DAVID J. MALAN: No. 2195 01:41:41,940 --> 01:41:43,080 So yes, good question. 2196 01:41:43,080 --> 01:41:46,890 So I say omega of n, but is it technically omega of n minus 1? 2197 01:41:46,890 --> 01:41:50,110 Maybe, but again, we're throwing away lower order terms. 2198 01:41:50,110 --> 01:41:53,670 And that's an advantage because we're not comparing things ever so precisely. 2199 01:41:53,670 --> 01:41:57,180 Just like I plotted with the green and yellow and red chart, 2200 01:41:57,180 --> 01:41:59,980 I just want to get a sense of the shape of these algorithms 2201 01:41:59,980 --> 01:42:03,330 so that when n gets really large, which of these choices 2202 01:42:03,330 --> 01:42:05,185 is going to matter the most? 2203 01:42:05,185 --> 01:42:07,560 At the end of the day, it's actually perfectly reasonable 2204 01:42:07,560 --> 01:42:09,460 to use selection sort or bubble sort if you 2205 01:42:09,460 --> 01:42:12,210 don't have that much data because they're going to be pretty fast. 2206 01:42:12,210 --> 01:42:14,520 My God, our computers nowadays are 1 gigahertz, 2207 01:42:14,520 --> 01:42:18,440 2 gigahertz, 1 billion things per second, 2 billion things per second. 2208 01:42:18,440 --> 01:42:20,940 But if we have large data sets, as we will later in the term 2209 01:42:20,940 --> 01:42:23,470 and as you might in the real world, that the Googles of the world, 2210 01:42:23,470 --> 01:42:25,530 then you're going to want to be more thoughtful. 2211 01:42:25,530 --> 01:42:27,240 And that's where we're going today. 2212 01:42:27,240 --> 01:42:30,090 All right, so let's actually see this visualized a little bit. 2213 01:42:30,090 --> 01:42:32,010 In a moment, I'm going to change screens here 2214 01:42:32,010 --> 01:42:37,350 to open up what is a little visualization tool that will give us 2215 01:42:37,350 --> 01:42:40,920 a sense of how these things actually work and look at a faster rate 2216 01:42:40,920 --> 01:42:42,820 than our humans are able to do here on stage. 2217 01:42:42,820 --> 01:42:47,580 So here is another visualization of a bunch of numbers, an array of numbers. 2218 01:42:47,580 --> 01:42:50,670 Short bars mean small numbers, tall bars mean big numbers. 2219 01:42:50,670 --> 01:42:52,920 So instead of having the numbers on their torsos here, 2220 01:42:52,920 --> 01:42:57,690 we just have bars that are small or tall based on the magnitude of the number. 2221 01:42:57,690 --> 01:43:00,630 Let me go ahead, and I preconfigured this in advance 2222 01:43:00,630 --> 01:43:02,050 to operate somewhat quickly. 2223 01:43:02,050 --> 01:43:05,550 Let's go ahead and do selections sort by clicking this button. 2224 01:43:05,550 --> 01:43:08,220 And you'll see some pink bars flying by. 2225 01:43:08,220 --> 01:43:11,640 And that's like me walking left and right, left and right, 2226 01:43:11,640 --> 01:43:14,140 to select the next smallest number. 2227 01:43:14,140 --> 01:43:17,580 And so what you'll see happening on the left of this array of numbers 2228 01:43:17,580 --> 01:43:20,520 is Celeste, if you will, and all of the other smaller numbers 2229 01:43:20,520 --> 01:43:24,030 are appearing on the left while we continue to solve the remaining 2230 01:43:24,030 --> 01:43:25,810 problems to the right. 2231 01:43:25,810 --> 01:43:29,177 So again, we no longer have to touch the smaller numbers here. 2232 01:43:29,177 --> 01:43:32,010 So that's why the problem is getting smaller and smaller and smaller 2233 01:43:32,010 --> 01:43:32,790 over time. 2234 01:43:32,790 --> 01:43:36,420 But you can notice now visually, look at how many times 2235 01:43:36,420 --> 01:43:37,920 we're retracing our steps. 2236 01:43:37,920 --> 01:43:40,710 This is why things that are n squared tend 2237 01:43:40,710 --> 01:43:44,830 to be frowned upon if avoidable because I'm touching the same elements again 2238 01:43:44,830 --> 01:43:45,330 and again. 2239 01:43:45,330 --> 01:43:47,610 When I was walking through, I kept pointing at the same humans 2240 01:43:47,610 --> 01:43:48,360 again and again. 2241 01:43:48,360 --> 01:43:49,660 And that adds up. 2242 01:43:49,660 --> 01:43:52,750 So let's see if bubble sort looks or feels a little different. 2243 01:43:52,750 --> 01:43:56,188 Let me re-randomize the thing, and let me now click Bubble Sort at the top. 2244 01:43:56,188 --> 01:43:58,980 And as you might infer, there's other sorting algorithms out there, 2245 01:43:58,980 --> 01:44:00,272 not all of which we'll look at. 2246 01:44:00,272 --> 01:44:01,740 But here's bubble sort. 2247 01:44:01,740 --> 01:44:04,570 Same pink coloration, but it's doing something different. 2248 01:44:04,570 --> 01:44:07,860 It's two pink bars going through again and again comparing 2249 01:44:07,860 --> 01:44:09,550 the adjacent numbers. 2250 01:44:09,550 --> 01:44:13,050 And you'll see that the largest numbers are indeed bubbling the way up 2251 01:44:13,050 --> 01:44:18,120 to the right, but the smaller numbers, like our number 0 was, 2252 01:44:18,120 --> 01:44:20,287 is only slowly making its way over. 2253 01:44:20,287 --> 01:44:21,120 Here's a comparable. 2254 01:44:21,120 --> 01:44:22,200 Here's the number one. 2255 01:44:22,200 --> 01:44:24,900 And it's going to take a while to get all the way to the left. 2256 01:44:24,900 --> 01:44:28,560 And here too, notice how many times the same bars 2257 01:44:28,560 --> 01:44:32,590 are becoming pink, how many times the algorithm is retracing and retracing 2258 01:44:32,590 --> 01:44:33,090 its steps. 2259 01:44:33,090 --> 01:44:33,600 Why? 2260 01:44:33,600 --> 01:44:37,140 Because it's only solving one problem at a time on each pass. 2261 01:44:37,140 --> 01:44:40,717 And each time we do that, we're stepping through practically the whole array. 2262 01:44:40,717 --> 01:44:43,800 And now granted, I could speed this up even further if I really wanted to, 2263 01:44:43,800 --> 01:44:48,120 but my God, this is only, what, like 50 or 60 elements, something like that? 2264 01:44:48,120 --> 01:44:49,140 This is slow. 2265 01:44:49,140 --> 01:44:52,320 Like, this is what n squared looks like and feels like. 2266 01:44:52,320 --> 01:44:54,570 And now I'm just trying to come up with words to say 2267 01:44:54,570 --> 01:44:56,280 until we get to the finish line here. 2268 01:44:56,280 --> 01:44:59,250 Like, this would be annoying if this is the speed of sorting, 2269 01:44:59,250 --> 01:45:03,133 and this is why I sort of secretly sorted the numbers for Rave in advance 2270 01:45:03,133 --> 01:45:05,550 because it would have taken us an annoying number of steps 2271 01:45:05,550 --> 01:45:07,210 to get that in place for her. 2272 01:45:07,210 --> 01:45:10,170 So those two algorithms are n squared. 2273 01:45:10,170 --> 01:45:12,265 Can we do, in fact, better? 2274 01:45:12,265 --> 01:45:15,390 Well, to save the best algorithm for last, let's take a shorter five minute 2275 01:45:15,390 --> 01:45:15,890 break here. 2276 01:45:15,890 --> 01:45:20,410 And when we come back, we'll do even better than n squared. 2277 01:45:20,410 --> 01:45:22,480 All right. 2278 01:45:22,480 --> 01:45:27,180 So the challenge at hand is to do better than selection sort 2279 01:45:27,180 --> 01:45:30,270 and better than bubble sort and ideally not just marginally 2280 01:45:30,270 --> 01:45:32,460 better but fundamentally better. 2281 01:45:32,460 --> 01:45:35,970 Just like in week zero, that third and final divide and conquer algorithm 2282 01:45:35,970 --> 01:45:38,860 was sort of fundamentally faster than the other two. 2283 01:45:38,860 --> 01:45:41,950 So can we do better than something on the order of n squared? 2284 01:45:41,950 --> 01:45:44,310 Well, I bet we can if we start to approach 2285 01:45:44,310 --> 01:45:46,080 the problem a little differently. 2286 01:45:46,080 --> 01:45:48,202 The sorts we've done thus far, generally known 2287 01:45:48,202 --> 01:45:50,160 as comparison sorts-- and that kind of captures 2288 01:45:50,160 --> 01:45:54,000 the reality that we were doing a huge number of comparisons again and again. 2289 01:45:54,000 --> 01:45:57,360 And you kind of saw that in the vertical bars that were going pink as everything 2290 01:45:57,360 --> 01:45:58,855 was being compared again and again. 2291 01:45:58,855 --> 01:46:01,230 But there's this programming technique, and it's actually 2292 01:46:01,230 --> 01:46:04,140 a mathematical technique known as recursion 2293 01:46:04,140 --> 01:46:05,950 that we've actually seen before. 2294 01:46:05,950 --> 01:46:09,000 And this is a building block or a mental model 2295 01:46:09,000 --> 01:46:12,300 we can bring to bear on the problem to solve the sorting problem 2296 01:46:12,300 --> 01:46:13,740 sort of fundamentally differently. 2297 01:46:13,740 --> 01:46:16,620 But first, let's look at it in a more familiar context. 2298 01:46:16,620 --> 01:46:23,190 A little bit ago, I proposed this pseudo code for the binary search algorithm. 2299 01:46:23,190 --> 01:46:26,130 And notice that what was interesting about this code, 2300 01:46:26,130 --> 01:46:29,970 even though I didn't call it out at the time, it's kind of cyclically defined. 2301 01:46:29,970 --> 01:46:32,730 Like, I claim this is an algorithm for search, 2302 01:46:32,730 --> 01:46:36,750 and yet it seems a little unfair that I'm using the verb search 2303 01:46:36,750 --> 01:46:38,960 inside of the algorithm for search. 2304 01:46:38,960 --> 01:46:41,720 It's like an English sort of defining a word by using the word. 2305 01:46:41,720 --> 01:46:43,850 Normally, you shouldn't really get away with that. 2306 01:46:43,850 --> 01:46:46,430 But there's something interesting about this technique 2307 01:46:46,430 --> 01:46:51,110 here because even though this whole thing is a search algorithm 2308 01:46:51,110 --> 01:46:56,510 and I'm using my own algorithm to search the left half or the right half, 2309 01:46:56,510 --> 01:46:58,520 the key feature here that doesn't normally 2310 01:46:58,520 --> 01:47:01,670 happen in English when you define a word in terms of a word 2311 01:47:01,670 --> 01:47:05,300 is that when I search the left half or search the right half, yes, 2312 01:47:05,300 --> 01:47:06,810 I'm doing the same thing. 2313 01:47:06,810 --> 01:47:08,090 I'm using the same algorithm. 2314 01:47:08,090 --> 01:47:11,060 But the problem is, by definition, half as large. 2315 01:47:11,060 --> 01:47:14,180 So this isn't going to be a cyclical argument in the same way. 2316 01:47:14,180 --> 01:47:17,750 This approach, by using search within search 2317 01:47:17,750 --> 01:47:21,290 is going to whittle the problem down and down and down until hopefully, 2318 01:47:21,290 --> 01:47:24,180 one door or no doors remains. 2319 01:47:24,180 --> 01:47:26,720 And so recursion is a programming technique 2320 01:47:26,720 --> 01:47:29,930 whereby a function calls itself. 2321 01:47:29,930 --> 01:47:34,220 And we haven't seen this yet in C, and we haven't seen this really in Scratch. 2322 01:47:34,220 --> 01:47:38,122 But in C, you can have a function call itself. 2323 01:47:38,122 --> 01:47:39,830 And the form that takes is like literally 2324 01:47:39,830 --> 01:47:44,180 using the function's name inside of the function's implementation itself. 2325 01:47:44,180 --> 01:47:48,310 We've actually seen an opportunity for this once before too. 2326 01:47:48,310 --> 01:47:49,310 Think back to week zero. 2327 01:47:49,310 --> 01:47:51,560 Here's that same pseudo code for searching for someone 2328 01:47:51,560 --> 01:47:53,180 in an actual, physical phone book. 2329 01:47:53,180 --> 01:47:55,700 And notice these yellow lines here. 2330 01:47:55,700 --> 01:47:59,810 We described those in week zero as inducing a loop, a cycle. 2331 01:47:59,810 --> 01:48:04,310 And this is a very procedural approach, if you will, because lines 8 and 11 2332 01:48:04,310 --> 01:48:06,470 are very mechanically, if you will, telling 2333 01:48:06,470 --> 01:48:10,370 me to go back to line three to do this kind of looping thing. 2334 01:48:10,370 --> 01:48:14,750 But really, what that's doing in the binary search algorithm for the phone 2335 01:48:14,750 --> 01:48:20,120 book is it's just telling me to search the left half or search the right half. 2336 01:48:20,120 --> 01:48:23,900 I'm doing it more mechanically again by sort of telling myself 2337 01:48:23,900 --> 01:48:25,490 what line number to go back to. 2338 01:48:25,490 --> 01:48:28,400 But that's equivalent to just telling myself go search the left half, 2339 01:48:28,400 --> 01:48:30,950 search the right half, the key thing being the left 2340 01:48:30,950 --> 01:48:33,720 have and the right half are smaller than the original problem. 2341 01:48:33,720 --> 01:48:37,630 It would be a bug if I just said search the phone book, search the phone book, 2342 01:48:37,630 --> 01:48:39,380 because obviously, you never get anywhere. 2343 01:48:39,380 --> 01:48:41,540 But if you search the half, the half, the half, 2344 01:48:41,540 --> 01:48:43,470 problem gets smaller and smaller. 2345 01:48:43,470 --> 01:48:50,180 So let's reformulate week zero's phone book code to be not procedural as here 2346 01:48:50,180 --> 01:48:55,140 but recursive whereby in this search algorithm, 2347 01:48:55,140 --> 01:48:58,580 AKA binary search, formerly called divide and conquer, I'm 2348 01:48:58,580 --> 01:49:01,940 going to literally use also the keyword search here. 2349 01:49:01,940 --> 01:49:03,950 Notice among the benefits of doing this is it 2350 01:49:03,950 --> 01:49:06,798 kind of tightens the code up, makes it a little more succinct, 2351 01:49:06,798 --> 01:49:08,840 even though that's kind of a fringe benefit here. 2352 01:49:08,840 --> 01:49:12,350 But it's an elegant way too of describing 2353 01:49:12,350 --> 01:49:17,210 a problem by just having a function use itself 2354 01:49:17,210 --> 01:49:21,330 to solve a smaller puzzle at hand. 2355 01:49:21,330 --> 01:49:24,380 So let's now consider a familiar problem, a smaller 2356 01:49:24,380 --> 01:49:27,132 version than the one you've dabbled with-- this sort of pyramid, 2357 01:49:27,132 --> 01:49:28,340 this half pyramid from Mario. 2358 01:49:28,340 --> 01:49:31,070 And let's throw away the parts that aren't that interesting 2359 01:49:31,070 --> 01:49:35,420 and just consider how we might, up until now, implement this in C code, 2360 01:49:35,420 --> 01:49:37,520 this left aligned pyramid, if you will. 2361 01:49:37,520 --> 01:49:44,630 Let me go over here, and let me create a file called-- how about iteration.c? 2362 01:49:44,630 --> 01:49:48,080 And in this file, I'm going to go ahead and include cs50.h. 2363 01:49:48,080 --> 01:49:51,890 And I'm going to include stdio.h. 2364 01:49:51,890 --> 01:49:57,440 And the goal at hand is to implement in C a little program that just prints out 2365 01:49:57,440 --> 01:49:59,280 this and exactly this pyramid. 2366 01:49:59,280 --> 01:50:02,113 So no get string or any of that-- we're just going to keep it simple 2367 01:50:02,113 --> 01:50:05,400 and print exactly this pyramid of height 4 here. 2368 01:50:05,400 --> 01:50:06,570 So how might I do this? 2369 01:50:06,570 --> 01:50:10,945 Well, let me go ahead, and in main, let me first ask the user for-- 2370 01:50:10,945 --> 01:50:12,570 well, we'll go ahead and generalize it. 2371 01:50:12,570 --> 01:50:14,403 Let's go ahead and ask the user for heights. 2372 01:50:14,403 --> 01:50:16,250 We're using getint as before. 2373 01:50:16,250 --> 01:50:18,500 And I'll store that in a variable called height. 2374 01:50:18,500 --> 01:50:20,750 And then let me go ahead and simply call the function 2375 01:50:20,750 --> 01:50:22,410 draw passing in that height. 2376 01:50:22,410 --> 01:50:25,040 So for the moment, let me assume that someone somewhere 2377 01:50:25,040 --> 01:50:26,630 has implemented a draw function. 2378 01:50:26,630 --> 01:50:29,810 And this, then, is the entirety of my program. 2379 01:50:29,810 --> 01:50:32,970 All right, unfortunately, C does not come with a draw function. 2380 01:50:32,970 --> 01:50:34,820 So let me go ahead and invent one. 2381 01:50:34,820 --> 01:50:36,300 It doesn't need to return a value. 2382 01:50:36,300 --> 01:50:38,850 It just needs to print something-- so-called side effect. 2383 01:50:38,850 --> 01:50:43,460 So I'm going to define a function called draw that takes as input an int. 2384 01:50:43,460 --> 01:50:46,280 I'll call it n for number, but I could call it anything I want. 2385 01:50:46,280 --> 01:50:47,850 And inside of this. 2386 01:50:47,850 --> 01:50:53,240 I'm going to go ahead and print out a left aligned pyramid like this from top 2387 01:50:53,240 --> 01:50:53,990 to bottom. 2388 01:50:53,990 --> 01:50:57,710 The salient features here are that this is a pyramid, at least in this example, 2389 01:50:57,710 --> 01:50:58,820 of height four. 2390 01:50:58,820 --> 01:51:02,240 And now in height four, the first row has one brick. 2391 01:51:02,240 --> 01:51:03,470 The second row has two. 2392 01:51:03,470 --> 01:51:04,400 The third has three. 2393 01:51:04,400 --> 01:51:05,270 The fourth has four. 2394 01:51:05,270 --> 01:51:08,100 That's a nice pattern that I can probably represent in code. 2395 01:51:08,100 --> 01:51:09,210 So how might I do this? 2396 01:51:09,210 --> 01:51:11,390 Well, how about 4 int i gets-- 2397 01:51:11,390 --> 01:51:12,950 let me do it the old school way-- 2398 01:51:12,950 --> 01:51:13,730 1. 2399 01:51:13,730 --> 01:51:18,530 And then i is less than or equal to n. 2400 01:51:18,530 --> 01:51:20,180 And then i plus plus-- 2401 01:51:20,180 --> 01:51:23,810 so I'm going from 1 to 4 just to keep myself sane here. 2402 01:51:23,810 --> 01:51:26,690 And then inside of this loop, what do I want to do? 2403 01:51:26,690 --> 01:51:28,560 Well, let me keep it conventional, in fact. 2404 01:51:28,560 --> 01:51:31,970 Let me just change this to be the more conventional 0 2405 01:51:31,970 --> 01:51:36,170 to n even though it might not be as intuitive because now on row 0, 2406 01:51:36,170 --> 01:51:37,550 I want one brick. 2407 01:51:37,550 --> 01:51:39,965 On row 1, I want two bricks, dot dot dot. 2408 01:51:39,965 --> 01:51:41,300 On row 3, I want four. 2409 01:51:41,300 --> 01:51:42,530 So it's kind of offset now. 2410 01:51:42,530 --> 01:51:44,240 But I'm being more conventional. 2411 01:51:44,240 --> 01:51:48,440 So on each row, how many bricks do I want to print? 2412 01:51:48,440 --> 01:51:49,880 Well, I think I want to do this. 2413 01:51:49,880 --> 01:51:55,700 For int j, for instance, common to use j after if you have a nested loop, 2414 01:51:55,700 --> 01:52:02,930 let's start j at 0 and do this so long as is less than i plus 1 2415 01:52:02,930 --> 01:52:04,460 and then do j plus plus. 2416 01:52:04,460 --> 01:52:06,500 So why i plus 1? 2417 01:52:06,500 --> 01:52:10,670 Well, again, when I equals 0, that's the first row, and I want one brick. 2418 01:52:10,670 --> 01:52:12,650 When i equals 1, that's the second row. 2419 01:52:12,650 --> 01:52:13,430 I want two bricks. 2420 01:52:13,430 --> 01:52:16,250 And dot dot dot, when i is 3, I want four bricks. 2421 01:52:16,250 --> 01:52:19,460 So again, I have to add 1 to i to get the total number of bricks 2422 01:52:19,460 --> 01:52:21,120 that I want to print to the screen. 2423 01:52:21,120 --> 01:52:25,820 So inside of this nested for loop, I'm going to do printf of a hash 2424 01:52:25,820 --> 01:52:28,970 with no new line. 2425 01:52:28,970 --> 01:52:33,112 I'm going to save the new line for about here instead. 2426 01:52:33,112 --> 01:52:34,820 All right, the last thing I'm going to do 2427 01:52:34,820 --> 01:52:38,550 is copy and paste the prototype at the top of the file. 2428 01:52:38,550 --> 01:52:39,620 So that I can call this. 2429 01:52:39,620 --> 01:52:42,728 And again, this is of now week one, week two. 2430 01:52:42,728 --> 01:52:45,020 Wouldn't necessarily come to your mind as quickly as it 2431 01:52:45,020 --> 01:52:48,380 might to mine after all this practice, but this is something reminiscent 2432 01:52:48,380 --> 01:52:51,230 of what you yourself did already for Mario-- printing out 2433 01:52:51,230 --> 01:52:54,330 a pyramid that hopefully in a moment is going to look like this. 2434 01:52:54,330 --> 01:52:56,370 So let me go back to my code. 2435 01:52:56,370 --> 01:53:00,560 Let me run make iteration, and let me do dot slash iteration. 2436 01:53:00,560 --> 01:53:02,540 I'll type in 4, and voila. 2437 01:53:02,540 --> 01:53:05,840 Seems to be correct, and let's assume it's going to work for other inputs 2438 01:53:05,840 --> 01:53:06,500 as well. 2439 01:53:06,500 --> 01:53:07,310 Oh, thank you. 2440 01:53:07,310 --> 01:53:11,770 2441 01:53:11,770 --> 01:53:15,460 So this is indeed an example of iteration-- doing something 2442 01:53:15,460 --> 01:53:16,780 again and again. 2443 01:53:16,780 --> 01:53:18,160 And it's very procedural. 2444 01:53:18,160 --> 01:53:21,160 Like, I literally have a function called draw that does this thing. 2445 01:53:21,160 --> 01:53:25,570 But I can think about implementing draw in a somewhat different way that's 2446 01:53:25,570 --> 01:53:26,710 kind of clever. 2447 01:53:26,710 --> 01:53:28,720 And it's not strictly necessary for this problem 2448 01:53:28,720 --> 01:53:31,360 because this problem honestly is not that complicated 2449 01:53:31,360 --> 01:53:33,387 to solve once you have practice under your belt. 2450 01:53:33,387 --> 01:53:36,220 Certainly the first time around, probably significantly challenging. 2451 01:53:36,220 --> 01:53:39,250 But now that you kind of associate, OK, row one 2452 01:53:39,250 --> 01:53:42,010 with one brick, row two with two bricks, it kind of comes together 2453 01:53:42,010 --> 01:53:43,340 with these for loops. 2454 01:53:43,340 --> 01:53:45,770 But how else could we think about this problem? 2455 01:53:45,770 --> 01:53:48,940 Well, this physical structure, these bricks, in some sense 2456 01:53:48,940 --> 01:53:54,980 is a recursive structure, a structure that's defined in terms of itself. 2457 01:53:54,980 --> 01:53:56,360 Now what do I mean by that? 2458 01:53:56,360 --> 01:54:01,600 Well, if I were to ask you the question, what does a pyramid of height 4 2459 01:54:01,600 --> 01:54:04,720 look like, you would point, of course, to this picture. 2460 01:54:04,720 --> 01:54:11,170 But you could also kind of cleverly say to me, well, 2461 01:54:11,170 --> 01:54:16,240 it's actually a pyramid of height 3 plus 1 additional row. 2462 01:54:16,240 --> 01:54:18,237 And here's that cyclical argument, right? 2463 01:54:18,237 --> 01:54:21,070 Kind of obnoxious to do typically in English or in a spoken language 2464 01:54:21,070 --> 01:54:23,200 because you're defining one thing in terms of itself. 2465 01:54:23,200 --> 01:54:24,408 What's a pyramid of height 4? 2466 01:54:24,408 --> 01:54:28,180 Well, it's a pyramid of height 3 plus 1 more row. 2467 01:54:28,180 --> 01:54:30,940 But we can kind of leverage this logic in code. 2468 01:54:30,940 --> 01:54:32,590 Well, what's a pyramid of height 3? 2469 01:54:32,590 --> 01:54:34,870 Well, it's a pyramid of height 2 plus 1 more row. 2470 01:54:34,870 --> 01:54:36,940 Fine, what's a pyramid of height 2? 2471 01:54:36,940 --> 01:54:39,370 Well, it's a pyramid of height 1 plus 1 more row. 2472 01:54:39,370 --> 01:54:42,370 And then hopefully, this process ends, and it does because notice, 2473 01:54:42,370 --> 01:54:44,990 the pyramid is getting smaller and smaller. 2474 01:54:44,990 --> 01:54:48,610 So you're not going to have this sort of silly back and forth with me 2475 01:54:48,610 --> 01:54:52,550 infinitely many times because when we finally get to the base case, 2476 01:54:52,550 --> 01:54:54,130 the end of the pyramid, fine. 2477 01:54:54,130 --> 01:54:55,600 What is a pyramid of height 1? 2478 01:54:55,600 --> 01:54:58,630 Well, it's a pyramid of no height plus one more row. 2479 01:54:58,630 --> 01:55:01,270 And at that point, things just get negative-- 2480 01:55:01,270 --> 01:55:02,050 no pun intended. 2481 01:55:02,050 --> 01:55:04,035 Things just would otherwise go negative. 2482 01:55:04,035 --> 01:55:05,410 And so you can just kind of stop. 2483 01:55:05,410 --> 01:55:07,930 The base case is when there is no more pyramid. 2484 01:55:07,930 --> 01:55:12,080 So there's a way to draw a line in the sand and say, stop, no more arguments. 2485 01:55:12,080 --> 01:55:16,180 But this idea of defining a physical structure in terms of itself 2486 01:55:16,180 --> 01:55:22,180 or code in terms of itself actually lets us do some interesting new algorithms. 2487 01:55:22,180 --> 01:55:24,160 Let me go back to my code here. 2488 01:55:24,160 --> 01:55:29,800 Let me go ahead and create one final file here called recursion.c 2489 01:55:29,800 --> 01:55:36,100 that leverages this idea of this built-in self-referential nature. 2490 01:55:36,100 --> 01:55:38,080 Let me include cs50.h. 2491 01:55:38,080 --> 01:55:41,740 Let me go ahead and include standardio.h, int main void. 2492 01:55:41,740 --> 01:55:46,000 And then inside of main, I'm going to do the exact same thing-- int 2493 01:55:46,000 --> 01:55:50,470 height equals get int, asking the user for height. 2494 01:55:50,470 --> 01:55:53,870 And then I'm going to go ahead and call draw passing in height. 2495 01:55:53,870 --> 01:55:55,520 So that's going to stay the same. 2496 01:55:55,520 --> 01:56:01,580 I even am going to make my prototype the same-- void draw int n semicolon. 2497 01:56:01,580 --> 01:56:03,820 And now I'm going to implement void down here 2498 01:56:03,820 --> 01:56:05,590 with that same prototype, of course. 2499 01:56:05,590 --> 01:56:08,470 But the code now is going to be a little different. 2500 01:56:08,470 --> 01:56:10,070 What am I going to do here? 2501 01:56:10,070 --> 01:56:15,610 Well, first of all, if you ask me to draw a pyramid of height n, 2502 01:56:15,610 --> 01:56:18,580 I'm going to be kind of a wise ass here and say, well, just 2503 01:56:18,580 --> 01:56:21,010 draw a pyramid of n minus 1-- 2504 01:56:21,010 --> 01:56:21,762 done. 2505 01:56:21,762 --> 01:56:24,220 All right, but there's still a little more work to be done. 2506 01:56:24,220 --> 01:56:28,660 What happens after I print or draw a pyramid of height n minus 1 2507 01:56:28,660 --> 01:56:33,340 according to our structural definition a moment ago? 2508 01:56:33,340 --> 01:56:38,470 What remains after drawing a pyramid of height n minus 1 or 3, specifically? 2509 01:56:38,470 --> 01:56:40,840 AUDIENCE: [INAUDIBLE] 2510 01:56:40,840 --> 01:56:42,400 We need one more row of hashes. 2511 01:56:42,400 --> 01:56:43,750 OK, so I can do that, right? 2512 01:56:43,750 --> 01:56:45,213 I'm OK with the single loops. 2513 01:56:45,213 --> 01:56:46,630 There's no nesting necessary here. 2514 01:56:46,630 --> 01:56:51,010 I'm just going to do this-- for int i get 0, i is less than n, 2515 01:56:51,010 --> 01:56:53,185 which is the height that's passed in, i plus plus. 2516 01:56:53,185 --> 01:56:55,060 And then inside of this loop, I'm very simply 2517 01:56:55,060 --> 01:56:56,740 going to print out a single hash. 2518 01:56:56,740 --> 01:57:00,650 And then down here, I'm going to print out a new line at the very end. 2519 01:57:00,650 --> 01:57:01,600 So that's good, right? 2520 01:57:01,600 --> 01:57:03,720 I might not be as comfortable with nested loops. 2521 01:57:03,720 --> 01:57:04,720 This is nice and simple. 2522 01:57:04,720 --> 01:57:07,660 What does this loop do here on line 17 through 20? 2523 01:57:07,660 --> 01:57:13,540 It literally prints n hashes by counting from i equals 0 on up to 2524 01:57:13,540 --> 01:57:15,130 but not through n. 2525 01:57:15,130 --> 01:57:17,930 So that's sort of week one style syntax. 2526 01:57:17,930 --> 01:57:20,740 But this is kind of trippy now because I've somehow 2527 01:57:20,740 --> 01:57:25,480 boiled down the implementation of draw into printing a row after just 2528 01:57:25,480 --> 01:57:27,250 drawing the thing above it. 2529 01:57:27,250 --> 01:57:31,250 But this is problematic as is because in this case, 2530 01:57:31,250 --> 01:57:37,900 my drawer function, notice, is always going to call the draw function forever 2531 01:57:37,900 --> 01:57:38,890 in some sense. 2532 01:57:38,890 --> 01:57:44,200 But ideally, when do I want this cyclical process to stop? 2533 01:57:44,200 --> 01:57:48,010 When do I want to not call draw anymore? 2534 01:57:48,010 --> 01:57:51,297 Yeah, when n is 1, right? 2535 01:57:51,297 --> 01:57:53,380 When I get to the top of the pyramid, when n is 1, 2536 01:57:53,380 --> 01:57:55,720 or heck, when the pyramids all gone and n equals 0. 2537 01:57:55,720 --> 01:57:58,060 I can pick any line in the sand, so long as it's 2538 01:57:58,060 --> 01:57:59,690 sort of at the end of the process. 2539 01:57:59,690 --> 01:58:01,480 Then I don't want to call draw anymore. 2540 01:58:01,480 --> 01:58:03,850 So maybe what I should do is this. 2541 01:58:03,850 --> 01:58:10,450 If n equals equals 0, there's really nothing to draw. 2542 01:58:10,450 --> 01:58:14,340 So I'm just going to go ahead and return like this. 2543 01:58:14,340 --> 01:58:16,920 Otherwise, I'm going to go ahead and draw 2544 01:58:16,920 --> 01:58:20,238 n minus 1 rows and then one more row. 2545 01:58:20,238 --> 01:58:21,780 And I could express this differently. 2546 01:58:21,780 --> 01:58:24,540 I could do something like this, which would be equivalent. 2547 01:58:24,540 --> 01:58:29,490 I could say something like if n is greater than or equal to 0, 2548 01:58:29,490 --> 01:58:31,462 then go ahead and draw the row. 2549 01:58:31,462 --> 01:58:32,670 But I like it this way first. 2550 01:58:32,670 --> 01:58:34,587 For now, I'm going to go with the original way 2551 01:58:34,587 --> 01:58:37,740 just to ask a simple question and then just bail out of the function 2552 01:58:37,740 --> 01:58:38,970 if n equals 0. 2553 01:58:38,970 --> 01:58:41,740 And heck, just to be super safe, just in case 2554 01:58:41,740 --> 01:58:44,040 the user types in a negative number, let me also 2555 01:58:44,040 --> 01:58:46,980 just check if n is a negative number, also, just return immediately. 2556 01:58:46,980 --> 01:58:48,210 Don't do anything. 2557 01:58:48,210 --> 01:58:51,490 I'm not returning a value because again, the function is void. 2558 01:58:51,490 --> 01:58:53,680 It doesn't need or have a return value. 2559 01:58:53,680 --> 01:58:55,860 So just saying return suffices. 2560 01:58:55,860 --> 01:59:00,660 But if n equals 1 or 2 or 3 or anything higher, 2561 01:59:00,660 --> 01:59:06,120 it is reasonable to draw a pyramid of slightly shorter height like, instead 2562 01:59:06,120 --> 01:59:10,680 of 4, 3, and then go ahead and print one more row. 2563 01:59:10,680 --> 01:59:16,450 So this is an example now of code that calls itself within itself. 2564 01:59:16,450 --> 01:59:18,150 Draw is calling draw. 2565 01:59:18,150 --> 01:59:22,920 But this so-called base case ensures, this conditional ensures, 2566 01:59:22,920 --> 01:59:24,690 that we're not going to do this forever. 2567 01:59:24,690 --> 01:59:27,300 Otherwise, we literally would do this infinitely many times, 2568 01:59:27,300 --> 01:59:30,330 and something bad is probably going to happen. 2569 01:59:30,330 --> 01:59:34,230 All right, let me go ahead and compile this code-- make recursion. 2570 01:59:34,230 --> 01:59:38,250 OK, no syntax errors-- dot slash recursion, Enter, height of 4, 2571 01:59:38,250 --> 01:59:40,590 and voila. 2572 01:59:40,590 --> 01:59:44,400 If only because some of you have run into this issue accidentally already, 2573 01:59:44,400 --> 01:59:48,420 let me get rid of the base case here, and let me recompile the code. 2574 01:59:48,420 --> 01:59:49,980 Make recursion. 2575 01:59:49,980 --> 01:59:52,150 Oh, and actually, now it's actually catching it. 2576 01:59:52,150 --> 01:59:55,050 So the compiler is smart enough here to realize 2577 01:59:55,050 --> 01:59:58,470 that all paths through this function will call itself. 2578 01:59:58,470 --> 02:00:01,300 AKA, It's going to loop forever. 2579 02:00:01,300 --> 02:00:03,090 So let me do the first thing. 2580 02:00:03,090 --> 02:00:05,550 Suppose I only check for n equaling 0. 2581 02:00:05,550 --> 02:00:09,390 Let me go ahead and recompile this code with make recursion. 2582 02:00:09,390 --> 02:00:12,030 And now let me just be kind of uncooperative. 2583 02:00:12,030 --> 02:00:16,200 When I run this program, still works for 4, still works for 0. 2584 02:00:16,200 --> 02:00:19,110 What if I do like negative 100? 2585 02:00:19,110 --> 02:00:22,920 Have any of you experienced a segmentation fault or core dump? 2586 02:00:22,920 --> 02:00:24,240 OK, so no shame in this. 2587 02:00:24,240 --> 02:00:28,980 Like, this means I have somehow touched memory that I shouldn't have. 2588 02:00:28,980 --> 02:00:32,910 And in short, I actually called this function thousands of times 2589 02:00:32,910 --> 02:00:35,550 accidentally, it would seem now, until the program just 2590 02:00:35,550 --> 02:00:38,413 bailed on me because I eventually touched memory in the computer 2591 02:00:38,413 --> 02:00:39,330 that I shouldn't have. 2592 02:00:39,330 --> 02:00:41,140 That'll make even more sense next week. 2593 02:00:41,140 --> 02:00:42,640 But for now, it's simply a bug. 2594 02:00:42,640 --> 02:00:44,430 And I can avoid that bug in this context, 2595 02:00:44,430 --> 02:00:48,900 probably not your own pset context, by just making sure we don't even 2596 02:00:48,900 --> 02:00:51,340 allow for negative numbers at all. 2597 02:00:51,340 --> 02:00:53,700 So with this building block in place, what 2598 02:00:53,700 --> 02:00:57,420 can we now do in terms of those same numbers to sort? 2599 02:00:57,420 --> 02:01:00,390 Well, it turns out there's a sorting algorithm called merge sort. 2600 02:01:00,390 --> 02:01:01,980 And there's bunches of others too. 2601 02:01:01,980 --> 02:01:06,923 But merge sort is a nice one to discuss because it fundamentally, we hope, 2602 02:01:06,923 --> 02:01:09,090 is going to do better than selection sort and bubble 2603 02:01:09,090 --> 02:01:11,490 sort that is better than n squared. 2604 02:01:11,490 --> 02:01:14,238 But the catch is it's a little harder to think about. 2605 02:01:14,238 --> 02:01:17,280 In fact, I'll act it out myself with just these numbers on the shelf here 2606 02:01:17,280 --> 02:01:21,360 rather than humans because recursion in general takes a little bit of effort 2607 02:01:21,360 --> 02:01:23,700 to wrap your mind around, typically a bit of practice. 2608 02:01:23,700 --> 02:01:25,908 But I'll see if we can't walk through it methodically 2609 02:01:25,908 --> 02:01:28,260 enough such that this comes to light. 2610 02:01:28,260 --> 02:01:32,460 So here's the pseudo code I propose for this algorithm called merge sort. 2611 02:01:32,460 --> 02:01:35,640 In the spirit of recursion, this sorting algorithm 2612 02:01:35,640 --> 02:01:41,020 literally calls itself by using the verb sort in its pseudo code. 2613 02:01:41,020 --> 02:01:42,600 So how does merge sort work? 2614 02:01:42,600 --> 02:01:46,020 It sort of obnoxiously says, well, if you want to sort all of these things, 2615 02:01:46,020 --> 02:01:48,720 go sort the left half, then go sort the right half, 2616 02:01:48,720 --> 02:01:50,430 and then merge the two together. 2617 02:01:50,430 --> 02:01:51,682 Now obnoxious in what sense? 2618 02:01:51,682 --> 02:01:54,390 Well, if I just asked you to sort something and you just tell me, 2619 02:01:54,390 --> 02:01:56,070 well, go sort that thing and then go sort 2620 02:01:56,070 --> 02:01:58,737 that thing, what was the point of asking you in the first place? 2621 02:01:58,737 --> 02:02:01,080 But the key is that each of these lines is 2622 02:02:01,080 --> 02:02:03,880 sorting a smaller piece of the problem. 2623 02:02:03,880 --> 02:02:06,450 So eventually, we'll be able to pare this down 2624 02:02:06,450 --> 02:02:10,410 into something that doesn't go on forever because in fact, in merge sort, 2625 02:02:10,410 --> 02:02:12,120 there's a base case too. 2626 02:02:12,120 --> 02:02:14,940 There's a scenario where we just check, wait a minute, 2627 02:02:14,940 --> 02:02:17,520 if there's only one number to sort, that's it. 2628 02:02:17,520 --> 02:02:19,470 Quit then because you're all done. 2629 02:02:19,470 --> 02:02:22,590 So there has to be this base case in any use of recursion 2630 02:02:22,590 --> 02:02:27,090 to make sure that you don't mindlessly call yourself forever. 2631 02:02:27,090 --> 02:02:29,380 You've got to stop at some point. 2632 02:02:29,380 --> 02:02:32,560 So let's focus on the third of these steps. 2633 02:02:32,560 --> 02:02:37,328 What does it mean to merge two lists, two halves of a list, 2634 02:02:37,328 --> 02:02:39,120 just because this is apparently going to be 2635 02:02:39,120 --> 02:02:41,290 a key ingredient-- so here, for instance, 2636 02:02:41,290 --> 02:02:43,890 are two halves of a list of size 8. 2637 02:02:43,890 --> 02:02:47,010 We have the numbers 2-- and I'll call it out if you're at a bad angle-- 2638 02:02:47,010 --> 02:02:52,200 2457 and 0136. 2639 02:02:52,200 --> 02:02:56,710 Notice that the left half at the moment, 2457, is already sorted, 2640 02:02:56,710 --> 02:03:01,000 and the right half, 0136, is also sorted as well. 2641 02:03:01,000 --> 02:03:03,870 So that's a good thing because it means that theoretically, I've 2642 02:03:03,870 --> 02:03:05,280 sorted the left half already. 2643 02:03:05,280 --> 02:03:07,620 I've sorted the right half already before we began. 2644 02:03:07,620 --> 02:03:09,690 I just need to merge these two halves. 2645 02:03:09,690 --> 02:03:11,800 What does it mean to sort two halves? 2646 02:03:11,800 --> 02:03:13,550 Well, for the sake of discussion, I'm just 2647 02:03:13,550 --> 02:03:19,340 going to turn over most of the numbers except for the first numbers in each 2648 02:03:19,340 --> 02:03:20,360 of these halves. 2649 02:03:20,360 --> 02:03:23,390 There's two halves here, left and right. 2650 02:03:23,390 --> 02:03:25,460 At the moment, I'm only going to consider 2651 02:03:25,460 --> 02:03:29,532 the leftmost element of each half-- that is, the one on the left here 2652 02:03:29,532 --> 02:03:30,740 and the one on the left here. 2653 02:03:30,740 --> 02:03:33,800 How do I merge these two lists together? 2654 02:03:33,800 --> 02:03:38,540 Well, if I look at 2 and I look at 0, which one should presumably come first? 2655 02:03:38,540 --> 02:03:39,502 The smaller one. 2656 02:03:39,502 --> 02:03:41,210 So I'm going to grab the 0, and I'm going 2657 02:03:41,210 --> 02:03:44,150 to put it into its own place on this new shelf here. 2658 02:03:44,150 --> 02:03:50,300 And now I'm going to consider, as part of my iteration, 2659 02:03:50,300 --> 02:03:53,270 the beginning of this list and the new beginning of this list. 2660 02:03:53,270 --> 02:03:55,130 So I'm now comparing 2 and 1. 2661 02:03:55,130 --> 02:03:56,210 Which one's smaller? 2662 02:03:56,210 --> 02:03:58,400 I'm going to go ahead and grab the 1. 2663 02:03:58,400 --> 02:04:01,130 Now I'm going to compare the beginning of the left list 2664 02:04:01,130 --> 02:04:03,440 and the new beginning of the right list, 2 and 3. 2665 02:04:03,440 --> 02:04:05,018 Of course, it's 2. 2666 02:04:05,018 --> 02:04:07,310 Now I'm going to compare the beginning of the left list 2667 02:04:07,310 --> 02:04:09,290 and the beginning of the right list, 4 and 3. 2668 02:04:09,290 --> 02:04:11,510 It's of course 3. 2669 02:04:11,510 --> 02:04:14,250 Now I'm going to compare the 4 against the beginning and end, 2670 02:04:14,250 --> 02:04:15,830 it turns out, of the second list-- 2671 02:04:15,830 --> 02:04:16,988 4, of course. 2672 02:04:16,988 --> 02:04:19,280 Now I'm going to compare the beginning of the left list 2673 02:04:19,280 --> 02:04:20,822 and the beginning of the right list-- 2674 02:04:20,822 --> 02:04:22,430 5, of course. 2675 02:04:22,430 --> 02:04:25,010 I'm realizing this is not going to end well because I left 2676 02:04:25,010 --> 02:04:26,420 too much distance between the numbers. 2677 02:04:26,420 --> 02:04:28,337 But that has nothing to do with the algorithm. 2678 02:04:28,337 --> 02:04:29,840 7 is the beginning of the left list. 2679 02:04:29,840 --> 02:04:31,460 6 is the beginning of the right list. 2680 02:04:31,460 --> 02:04:32,960 It's, of course, 6. 2681 02:04:32,960 --> 02:04:36,410 And at the risk of knocking all of these over, 2682 02:04:36,410 --> 02:04:43,310 if I now make room for this element, we have hopefully 2683 02:04:43,310 --> 02:04:50,400 sorted the whole thing by having merged together the two halves of the list. 2684 02:04:50,400 --> 02:04:52,040 So in short-- thank you. 2685 02:04:52,040 --> 02:04:56,448 2686 02:04:56,448 --> 02:04:58,740 I'm a little worried that's just getting sarcastic now, 2687 02:04:58,740 --> 02:05:03,870 but we now have merged two half lists. 2688 02:05:03,870 --> 02:05:07,430 We haven't done the guts of the algorithm yet-- sort the left half 2689 02:05:07,430 --> 02:05:08,430 and sort the right half. 2690 02:05:08,430 --> 02:05:11,160 But I claim that that is how mechanically you 2691 02:05:11,160 --> 02:05:12,960 merge two sorted halves. 2692 02:05:12,960 --> 02:05:15,060 You keep looking at the beginning of each list, 2693 02:05:15,060 --> 02:05:17,640 and you just kind of weave them together based 2694 02:05:17,640 --> 02:05:21,070 on which one belongs first based on its size. 2695 02:05:21,070 --> 02:05:23,460 So if you agree that that was a reasonable way 2696 02:05:23,460 --> 02:05:28,020 to merge two lists together, let's go ahead and focus lastly 2697 02:05:28,020 --> 02:05:31,260 on what it means to actually sort the left half 2698 02:05:31,260 --> 02:05:33,437 and sort the right half of a whole bunch of numbers. 2699 02:05:33,437 --> 02:05:35,520 And for this, I'm going to go ahead and order them 2700 02:05:35,520 --> 02:05:37,320 in this seemingly random order. 2701 02:05:37,320 --> 02:05:40,200 And I just have a little cheat sheet above so that I don't mess up. 2702 02:05:40,200 --> 02:05:42,330 And I'm going to start at the very top this time. 2703 02:05:42,330 --> 02:05:45,390 And hopefully, these will not fall down at any point. 2704 02:05:45,390 --> 02:05:51,300 But I'm just deliberately putting them in this random order, 5274. 2705 02:05:51,300 --> 02:05:53,703 And then we have 1630-- 2706 02:05:53,703 --> 02:05:57,000 1630. 2707 02:05:57,000 --> 02:05:59,340 Hopefully this won't fall over. 2708 02:05:59,340 --> 02:06:03,990 Here is now an array of size 8 with eight integers. 2709 02:06:03,990 --> 02:06:05,290 And I want to sort this. 2710 02:06:05,290 --> 02:06:08,332 I could use selection sort and just go back and forth and back and forth. 2711 02:06:08,332 --> 02:06:11,100 I could use bubble sort and just compare pairs, pairs, pairs. 2712 02:06:11,100 --> 02:06:13,830 But those are going to be on the order of big O of n squared. 2713 02:06:13,830 --> 02:06:16,000 My hope is to do fundamentally better here. 2714 02:06:16,000 --> 02:06:17,760 So let's see if we can do better. 2715 02:06:17,760 --> 02:06:19,800 All right, so let me look now at my code. 2716 02:06:19,800 --> 02:06:21,490 I'll keep it on the screen. 2717 02:06:21,490 --> 02:06:23,310 How do I implement merge sort? 2718 02:06:23,310 --> 02:06:25,303 Well, if there's only one number, I quit. 2719 02:06:25,303 --> 02:06:26,220 There's obviously not. 2720 02:06:26,220 --> 02:06:28,270 There's eight numbers, so that's not applicable. 2721 02:06:28,270 --> 02:06:30,603 I'm going to go ahead and sort the left half of numbers. 2722 02:06:30,603 --> 02:06:32,460 All right, here's the left half-- 2723 02:06:32,460 --> 02:06:34,470 5274. 2724 02:06:34,470 --> 02:06:37,230 Do I sort an array of size 4? 2725 02:06:37,230 --> 02:06:40,170 Well, here's where the recursion kicks in. 2726 02:06:40,170 --> 02:06:42,150 How do you sort a list of size 4? 2727 02:06:42,150 --> 02:06:44,310 Well, there's the pseudo code on the board. 2728 02:06:44,310 --> 02:06:47,490 I sort the left half of the list of size 4. 2729 02:06:47,490 --> 02:06:49,200 So here we go. 2730 02:06:49,200 --> 02:06:50,550 I have a list of size 4. 2731 02:06:50,550 --> 02:06:51,390 How do I sort it? 2732 02:06:51,390 --> 02:06:52,680 I sort the left half. 2733 02:06:52,680 --> 02:06:54,480 All right, now I have a list of size 2. 2734 02:06:54,480 --> 02:06:56,680 How do I sort this? 2735 02:06:56,680 --> 02:06:58,600 Well, sort the left half. 2736 02:06:58,600 --> 02:06:59,520 So here we go. 2737 02:06:59,520 --> 02:07:01,020 Here's a list of size 1. 2738 02:07:01,020 --> 02:07:03,996 How do I sort this? 2739 02:07:03,996 --> 02:07:05,640 I think it's done, right? 2740 02:07:05,640 --> 02:07:06,480 That's quit, right? 2741 02:07:06,480 --> 02:07:08,010 If only one number, I'm done. 2742 02:07:08,010 --> 02:07:09,233 The 5 is sorted. 2743 02:07:09,233 --> 02:07:10,650 All right, what was the next step? 2744 02:07:10,650 --> 02:07:12,180 You have to now rewind in time. 2745 02:07:12,180 --> 02:07:16,510 I just sorted the left half of the left half of the left half. 2746 02:07:16,510 --> 02:07:17,970 What do I now sort? 2747 02:07:17,970 --> 02:07:20,850 The right half, which is 2. 2748 02:07:20,850 --> 02:07:22,030 This is one element. 2749 02:07:22,030 --> 02:07:22,740 So I'm done. 2750 02:07:22,740 --> 02:07:27,060 So now at this point in the story, I have sorted, sort of idiotically-- 2751 02:07:27,060 --> 02:07:29,850 the 5 assorted, and the 2 is sorted. 2752 02:07:29,850 --> 02:07:34,410 But what's the third and final step of this phase of the algorithm? 2753 02:07:34,410 --> 02:07:35,650 Merge the two together. 2754 02:07:35,650 --> 02:07:37,600 So here's the left, here's the right list. 2755 02:07:37,600 --> 02:07:38,850 How do I merge these together? 2756 02:07:38,850 --> 02:07:41,530 I compare the lists, and I put the two there. 2757 02:07:41,530 --> 02:07:43,590 I only have the [? 5 ?] left, and I do that. 2758 02:07:43,590 --> 02:07:45,990 So now we see some visible progress. 2759 02:07:45,990 --> 02:07:47,190 But again, let's rewind. 2760 02:07:47,190 --> 02:07:48,220 How did we get here? 2761 02:07:48,220 --> 02:07:52,860 We started to sort the left half of the left half of the left half, then 2762 02:07:52,860 --> 02:07:53,700 the right half. 2763 02:07:53,700 --> 02:07:54,900 And now where are we? 2764 02:07:54,900 --> 02:07:58,090 We've just sorted the left half of the left half. 2765 02:07:58,090 --> 02:08:01,140 So what comes after sorting the left half of anything? 2766 02:08:01,140 --> 02:08:01,692 Right half. 2767 02:08:01,692 --> 02:08:03,900 All right, here's the sort of same nonsensical thing. 2768 02:08:03,900 --> 02:08:05,460 Here's a list of size 2. 2769 02:08:05,460 --> 02:08:06,690 Let's sort the left half. 2770 02:08:06,690 --> 02:08:07,347 Done. 2771 02:08:07,347 --> 02:08:08,430 Let's sort the right half. 2772 02:08:08,430 --> 02:08:09,120 Done. 2773 02:08:09,120 --> 02:08:10,170 What's the third step? 2774 02:08:10,170 --> 02:08:11,590 Merge them together. 2775 02:08:11,590 --> 02:08:14,880 So that's the 4, and that's the 7. 2776 02:08:14,880 --> 02:08:16,110 What have I now done? 2777 02:08:16,110 --> 02:08:21,070 In total, I've now sorted the left half of the original thing. 2778 02:08:21,070 --> 02:08:23,092 So what happens next? 2779 02:08:23,092 --> 02:08:24,300 Wait a minute, wait a minute. 2780 02:08:24,300 --> 02:08:25,560 I have not done that. 2781 02:08:25,560 --> 02:08:26,550 What have I done? 2782 02:08:26,550 --> 02:08:29,220 I have sorted the left half of the left half, 2783 02:08:29,220 --> 02:08:32,350 and I've sorted the right half of the left half. 2784 02:08:32,350 --> 02:08:34,710 What do I now need to do lastly? 2785 02:08:34,710 --> 02:08:36,640 Merge those two lists together. 2786 02:08:36,640 --> 02:08:38,562 So again, I put my finger on the beginning 2787 02:08:38,562 --> 02:08:40,270 of this list, the beginning of this list. 2788 02:08:40,270 --> 02:08:42,360 And if you want, I'll do the same thing when I merged last time 2789 02:08:42,360 --> 02:08:43,830 to be clear what I'm comparing. 2790 02:08:43,830 --> 02:08:46,440 2 and 4-- the 2 obviously comes first. 2791 02:08:46,440 --> 02:08:48,220 What comes next? 2792 02:08:48,220 --> 02:08:50,410 Well, the 4 comes next. 2793 02:08:50,410 --> 02:08:51,670 What comes next? 2794 02:08:51,670 --> 02:08:55,920 The 5 comes next and then lastly, of course, the 7. 2795 02:08:55,920 --> 02:09:00,010 Notice that the 2457 are now sorted. 2796 02:09:00,010 --> 02:09:02,802 So the original left half is sorted. 2797 02:09:02,802 --> 02:09:05,010 And I'll do the rest a little faster because, my God, 2798 02:09:05,010 --> 02:09:06,385 this feels like it takes forever. 2799 02:09:06,385 --> 02:09:08,430 But I bet we're on to something here. 2800 02:09:08,430 --> 02:09:10,320 What step remains next? 2801 02:09:10,320 --> 02:09:12,570 I've just sorted the left half of the original. 2802 02:09:12,570 --> 02:09:14,280 Sort the right half of the original. 2803 02:09:14,280 --> 02:09:15,300 How do I sort this? 2804 02:09:15,300 --> 02:09:17,880 I sort the left half of the right half. 2805 02:09:17,880 --> 02:09:19,050 How do I sort this? 2806 02:09:19,050 --> 02:09:21,360 I sort the left half of the left half. 2807 02:09:21,360 --> 02:09:22,320 Done. 2808 02:09:22,320 --> 02:09:24,360 I sort the right half of the left half. 2809 02:09:24,360 --> 02:09:25,260 Done. 2810 02:09:25,260 --> 02:09:27,390 Now I merge the two together. 2811 02:09:27,390 --> 02:09:30,190 The 1 comes first, the 6 comes next. 2812 02:09:30,190 --> 02:09:34,380 Now I sort the right half of the right half. 2813 02:09:34,380 --> 02:09:35,190 What do I do? 2814 02:09:35,190 --> 02:09:36,600 Sort the left half. 2815 02:09:36,600 --> 02:09:37,200 Done. 2816 02:09:37,200 --> 02:09:38,200 Sort the right half. 2817 02:09:38,200 --> 02:09:38,700 Done. 2818 02:09:38,700 --> 02:09:39,420 What do I do? 2819 02:09:39,420 --> 02:09:41,230 Merge them together. 2820 02:09:41,230 --> 02:09:43,080 So that's the third step of that phase. 2821 02:09:43,080 --> 02:09:47,760 Now where are we in the stor-- oh my God, where are we in the story? 2822 02:09:47,760 --> 02:09:52,230 We have sorted the left half of the right half 2823 02:09:52,230 --> 02:09:54,180 and the right half of the right half. 2824 02:09:54,180 --> 02:09:55,760 What comes next? 2825 02:09:55,760 --> 02:09:56,263 Merge. 2826 02:09:56,263 --> 02:09:58,430 So I'm going to compare, and I'm going to move those 2827 02:09:58,430 --> 02:10:00,590 down just to make clear what I'm comparing, 2828 02:10:00,590 --> 02:10:02,300 the beginning of both sublists. 2829 02:10:02,300 --> 02:10:03,110 What comes first? 2830 02:10:03,110 --> 02:10:04,610 Of course, the 0. 2831 02:10:04,610 --> 02:10:07,110 What comes next? 2832 02:10:07,110 --> 02:10:08,550 What comes next? 2833 02:10:08,550 --> 02:10:10,340 The 1. 2834 02:10:10,340 --> 02:10:11,730 What comes next? 2835 02:10:11,730 --> 02:10:13,010 The 3. 2836 02:10:13,010 --> 02:10:14,930 And then lastly comes the 6. 2837 02:10:14,930 --> 02:10:16,880 All right, where are we in the story? 2838 02:10:16,880 --> 02:10:19,220 We've now sorted the left half of the original 2839 02:10:19,220 --> 02:10:20,780 and the right half of the original. 2840 02:10:20,780 --> 02:10:22,430 What step remains? 2841 02:10:22,430 --> 02:10:23,030 Merge. 2842 02:10:23,030 --> 02:10:25,070 All right, so I'm going to make the same point. 2843 02:10:25,070 --> 02:10:27,690 And this is actually literally what we did earlier 2844 02:10:27,690 --> 02:10:31,760 because I deliberately demoed those original numbers in this order, 2 2845 02:10:31,760 --> 02:10:32,630 and a 0. 2846 02:10:32,630 --> 02:10:34,010 This comes out first. 2847 02:10:34,010 --> 02:10:35,300 What comes next? 2848 02:10:35,300 --> 02:10:36,230 2 and 1. 2849 02:10:36,230 --> 02:10:37,890 The 1 comes out next. 2850 02:10:37,890 --> 02:10:39,450 What comes next? 2851 02:10:39,450 --> 02:10:40,920 The 2 comes next. 2852 02:10:40,920 --> 02:10:42,030 What comes next? 2853 02:10:42,030 --> 02:10:43,410 The 3 comes next. 2854 02:10:43,410 --> 02:10:44,460 What comes next? 2855 02:10:44,460 --> 02:10:47,150 The 4. 2856 02:10:47,150 --> 02:10:48,590 What comes after that? 2857 02:10:48,590 --> 02:10:50,720 The 5. 2858 02:10:50,720 --> 02:10:51,920 What comes after that? 2859 02:10:51,920 --> 02:10:52,700 The 6. 2860 02:10:52,700 --> 02:10:56,760 And lastly-- this is when we run out of memory-- 2861 02:10:56,760 --> 02:11:01,010 the 7 over there is actually in place. 2862 02:11:01,010 --> 02:11:03,390 OK. 2863 02:11:03,390 --> 02:11:05,867 OK, so admittedly, a little harder to explain, 2864 02:11:05,867 --> 02:11:07,950 and honestly, it gets a little trippy because it's 2865 02:11:07,950 --> 02:11:10,980 so easy to forget about where you are in the story 2866 02:11:10,980 --> 02:11:13,680 because we're constantly diving into the algorithm 2867 02:11:13,680 --> 02:11:15,250 and then backing back out of it. 2868 02:11:15,250 --> 02:11:17,970 But in code, we could express this pretty correctly 2869 02:11:17,970 --> 02:11:21,000 and, it turns out, pretty efficiently because what 2870 02:11:21,000 --> 02:11:24,660 I was doing, even though it's longer when I do it verbally, 2871 02:11:24,660 --> 02:11:28,260 I was touching these elements a minimal amount of times, right? 2872 02:11:28,260 --> 02:11:31,530 I wasn't going back and forth, back and forth in front of the shelf 2873 02:11:31,530 --> 02:11:32,490 again and again. 2874 02:11:32,490 --> 02:11:37,240 I was deliberately only ever merging the smallest elements in each list. 2875 02:11:37,240 --> 02:11:40,200 So every time we merge, even though I was doing it quickly, 2876 02:11:40,200 --> 02:11:44,370 my fingers were only touching each of the elements once. 2877 02:11:44,370 --> 02:11:50,152 And how many times did we divide, divide, divide in half the list? 2878 02:11:50,152 --> 02:11:52,110 Well, we started with all of the elements here, 2879 02:11:52,110 --> 02:11:53,318 and there were eight of them. 2880 02:11:53,318 --> 02:11:56,860 And then we moved them 1, 2, 3 positions. 2881 02:11:56,860 --> 02:12:03,420 So the height of this visualization, if you will, is actually log n, right? 2882 02:12:03,420 --> 02:12:06,280 If I started with 8, turns out if you do the arithmetic, 2883 02:12:06,280 --> 02:12:10,380 this is log n height because 2 to the 3 is 8. 2884 02:12:10,380 --> 02:12:12,750 But for now, just trust that this is a log n height. 2885 02:12:12,750 --> 02:12:14,280 And how wide is the shelf? 2886 02:12:14,280 --> 02:12:18,270 Well, it's of width n because there's n elements any time 2887 02:12:18,270 --> 02:12:19,470 they were on the shelf. 2888 02:12:19,470 --> 02:12:23,737 So technically, I was kind of cheating this algorithm because this 2889 02:12:23,737 --> 02:12:25,320 is the first time I've needed shelves. 2890 02:12:25,320 --> 02:12:29,010 With the human examples, we just had the humans, and that's it, and only eight 2891 02:12:29,010 --> 02:12:29,520 of them. 2892 02:12:29,520 --> 02:12:32,200 Here, I was sort of using more and more memory. 2893 02:12:32,200 --> 02:12:34,770 In fact, I was using like four times as much memory 2894 02:12:34,770 --> 02:12:36,930 even though that was just for visualization's sake. 2895 02:12:36,930 --> 02:12:41,040 Merge sort actually requires that you have some spare space, an empty array 2896 02:12:41,040 --> 02:12:44,045 to move the elements into when you're merging them together. 2897 02:12:44,045 --> 02:12:46,920 But if I really wanted and if I didn't have this shelf or this shelf, 2898 02:12:46,920 --> 02:12:49,590 honestly, I could have just gone back and forth between the two shelves. 2899 02:12:49,590 --> 02:12:51,250 That would have been sufficient. 2900 02:12:51,250 --> 02:12:56,190 So merge sort uses more memory for this merging process, 2901 02:12:56,190 --> 02:12:59,100 but the advantage of using more memory is 2902 02:12:59,100 --> 02:13:04,680 that the total running time, if you can perhaps infer from that math, is what? 2903 02:13:04,680 --> 02:13:07,560 The big O notation for merge sort, it turns out, 2904 02:13:07,560 --> 02:13:10,530 is actually going to be n times log n. 2905 02:13:10,530 --> 02:13:13,050 And even if you're a little rusty still on your logarithms, 2906 02:13:13,050 --> 02:13:18,390 we saw in week zero and again today that log n is smaller than n. 2907 02:13:18,390 --> 02:13:19,650 That's a good thing. 2908 02:13:19,650 --> 02:13:20,880 Binary search was log n. 2909 02:13:20,880 --> 02:13:23,250 That's faster than linear search, which was n. 2910 02:13:23,250 --> 02:13:28,830 So n times log n is, of course, smaller than n times n or n squared. 2911 02:13:28,830 --> 02:13:31,830 So it's sort of lower on this little cheat sheet that I've been drawing, 2912 02:13:31,830 --> 02:13:35,350 which is to suggest that it's running time is indeed better or faster. 2913 02:13:35,350 --> 02:13:38,340 And in fact, if we consider the best case running time, 2914 02:13:38,340 --> 02:13:42,645 turns out it's not quite as good as bubble sort with omega of n, 2915 02:13:42,645 --> 02:13:45,270 where you can just sort of abort if you realize, wait a minute, 2916 02:13:45,270 --> 02:13:46,320 I've done no work. 2917 02:13:46,320 --> 02:13:50,850 Merge sort, you actually have to do that work to get to the finish line anyway. 2918 02:13:50,850 --> 02:13:56,170 So it's actually in omega and ultimately theta of n log n as well. 2919 02:13:56,170 --> 02:13:58,230 So again, a trade off there because if you 2920 02:13:58,230 --> 02:14:00,540 happen to have a data set that is very often sorted, 2921 02:14:00,540 --> 02:14:02,665 honestly, you might want to stick with bubble sort. 2922 02:14:02,665 --> 02:14:05,490 But in the general case, where the data is unsorted, 2923 02:14:05,490 --> 02:14:08,820 n log n as sounding better than n squared. 2924 02:14:08,820 --> 02:14:10,830 Well, what does it actually look or feel like? 2925 02:14:10,830 --> 02:14:14,230 Give me a moment to just change over to our visualization here. 2926 02:14:14,230 --> 02:14:18,210 And we'll see with this example what merge sort looks like 2927 02:14:18,210 --> 02:14:20,170 depicted with now these vertical bars. 2928 02:14:20,170 --> 02:14:22,860 So same algorithm, but instead of my numbers on shelves, 2929 02:14:22,860 --> 02:14:27,970 here is a random array of numbers being sorted. 2930 02:14:27,970 --> 02:14:30,120 And you can see it being done half at a time. 2931 02:14:30,120 --> 02:14:33,720 And you see sort of remnants of the previous bars. 2932 02:14:33,720 --> 02:14:35,520 Actually, that was unfair. 2933 02:14:35,520 --> 02:14:37,500 Let me zoom out here. 2934 02:14:37,500 --> 02:14:42,030 Let me zoom out so you can actually see the height here. 2935 02:14:42,030 --> 02:14:44,730 Let me go ahead and randomize this again and run merge sort. 2936 02:14:44,730 --> 02:14:45,310 There we go. 2937 02:14:45,310 --> 02:14:50,100 Now you can see the second array and where the values are going temporarily. 2938 02:14:50,100 --> 02:14:54,130 And even though this one looks way more cryptic visualization-wise, 2939 02:14:54,130 --> 02:14:56,170 it does seem to be moving faster. 2940 02:14:56,170 --> 02:14:59,670 And it seems to be merging halves together, and boom, it's done. 2941 02:14:59,670 --> 02:15:03,990 So let's actually see, in conclusion, what these algorithms compare to 2942 02:15:03,990 --> 02:15:07,080 and consider that moving forward as we write more and more code, 2943 02:15:07,080 --> 02:15:10,350 the goal is, again, not just to be correct but to be well-designed. 2944 02:15:10,350 --> 02:15:13,820 And one measure of design is going to indeed be efficiency. 2945 02:15:13,820 --> 02:15:18,120 So here we have, in final, a visualization of three algorithms-- 2946 02:15:18,120 --> 02:15:21,000 selection sort, bubble sort, and merge sort-- 2947 02:15:21,000 --> 02:15:22,450 from top to bottom. 2948 02:15:22,450 --> 02:15:25,620 And let's see what these algorithms might look or sound like here. 2949 02:15:25,620 --> 02:15:27,855 Oh, if we can dim the lights for dramatic effect-- 2950 02:15:27,855 --> 02:15:32,160 2951 02:15:32,160 --> 02:15:35,970 selection's on top, bubble on bottom, merge in the middle. 2952 02:15:35,970 --> 02:16:33,320 [MUSIC PLAYING] 2953 02:16:33,320 --> 02:17:11,000 [MUSIC PLAYING]