WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:01:17.345 [MUSIC PLAYING] 00:01:17.345 --> 00:01:22.170 DAVID J. MALAN: This is CS50, and this is already week three. 00:01:22.170 --> 00:01:25.695 And even as we've gotten much more into the minutia of programming 00:01:25.695 --> 00:01:27.570 and some of the C stuff that we've been doing 00:01:27.570 --> 00:01:30.840 is all the more cryptic looking, recall that at the end of the day, 00:01:30.840 --> 00:01:34.647 like, everything we've been doing ultimately fits into to this model. 00:01:34.647 --> 00:01:36.480 So keep that in mind, particularly as things 00:01:36.480 --> 00:01:39.230 seem like they're getting more complicated and more sophisticated. 00:01:39.230 --> 00:01:41.850 It's just a process of learning a new language that ultimately 00:01:41.850 --> 00:01:44.430 lets us express this process. 00:01:44.430 --> 00:01:47.490 And of course, last week we really went into the weeds of like how 00:01:47.490 --> 00:01:49.240 inputs and outputs are represented. 00:01:49.240 --> 00:01:53.580 And this thing here, a photograph thereof, is called what? 00:01:53.580 --> 00:01:54.742 This is what? 00:01:54.742 --> 00:01:55.628 AUDIENCE: RAM. 00:01:55.628 --> 00:01:56.670 DAVID J. MALAN: RAM, I heard-- 00:01:56.670 --> 00:01:59.670 Random Access Memory or just generally known as memory. 00:01:59.670 --> 00:02:02.280 And recall that we looked at one of these little black chips 00:02:02.280 --> 00:02:04.500 that contains all of the bytes-- 00:02:04.500 --> 00:02:05.790 all of the bits, ultimately. 00:02:05.790 --> 00:02:08.400 It's just kind of a grid, sort of an artist grid, that 00:02:08.400 --> 00:02:11.520 allows us to think about every one of these memory locations 00:02:11.520 --> 00:02:14.040 as just having a number or an address, so to speak. 00:02:14.040 --> 00:02:16.770 Like, this might be byte number 0 and then 1 and then 2 00:02:16.770 --> 00:02:19.680 and then, maybe way down here again, something like 2 billion 00:02:19.680 --> 00:02:22.080 if you have 2 gigabytes of memory. 00:02:22.080 --> 00:02:26.070 And so as we did that, we started to explore how we could use this canvas 00:02:26.070 --> 00:02:29.760 to create kind of our own information, our own inputs and outputs, not 00:02:29.760 --> 00:02:32.380 just the basics like ints and floats and so forth. 00:02:32.380 --> 00:02:34.470 But we also talked about strings. 00:02:34.470 --> 00:02:37.080 And what is a string as you now know it? 00:02:37.080 --> 00:02:40.200 How would you describe in layperson's terms a string? 00:02:40.200 --> 00:02:41.010 Yeah, over there. 00:02:41.010 --> 00:02:41.990 AUDIENCE: I was gonna say-- 00:02:41.990 --> 00:02:42.480 [AUDIO OUT] 00:02:42.480 --> 00:02:43.897 DAVID J. MALAN: An array of characters. 00:02:43.897 --> 00:02:45.690 And an array, meanwhile-- let's go there. 00:02:45.690 --> 00:02:50.580 How might someone else define an array in more familiar now terms? 00:02:50.580 --> 00:02:53.160 What would be an array? 00:02:53.160 --> 00:02:53.670 Yeah. 00:02:53.670 --> 00:02:57.070 AUDIENCE: Kind of like an indexed set of things. 00:02:57.070 --> 00:02:59.060 DAVID J. MALAN: An indexed set of things-- not bad. 00:02:59.060 --> 00:03:02.210 And I think a key characteristic to keep in mind with an array is that it 00:03:02.210 --> 00:03:03.890 does actually pertain to memory. 00:03:03.890 --> 00:03:05.810 And it's contiguous memory. 00:03:05.810 --> 00:03:09.200 Byte after byte after byte is what constitutes an array. 00:03:09.200 --> 00:03:11.810 And we'll see in a couple of weeks time that there's actually 00:03:11.810 --> 00:03:15.770 more interesting ways to use this same primitive Canvas to stitch together 00:03:15.770 --> 00:03:19.240 things that are sort of two directional even that have some kind of shape 00:03:19.240 --> 00:03:19.740 to them. 00:03:19.740 --> 00:03:22.970 But for now, all we've talked about is arrays and just using these things 00:03:22.970 --> 00:03:27.240 from left to right, top to bottom, contiguously to represent information. 00:03:27.240 --> 00:03:29.900 So today, we'll consider still an array. 00:03:29.900 --> 00:03:33.020 But we won't focus so much on representation 00:03:33.020 --> 00:03:34.460 of strings or other data types. 00:03:34.460 --> 00:03:36.918 We'll actually now focus on the other part of that process, 00:03:36.918 --> 00:03:39.950 of inputs becoming outputs, namely the thing in the middle-- 00:03:39.950 --> 00:03:40.940 algorithms. 00:03:40.940 --> 00:03:45.200 But we have to keep in mind, even though every time we've looked at an array 00:03:45.200 --> 00:03:48.610 thus far, certainly on the board like this, you as a human certainly 00:03:48.610 --> 00:03:50.360 have the luxury of just kind of eyeballing 00:03:50.360 --> 00:03:53.820 the whole thing with a bird's eye view and seeing where all of those numbers 00:03:53.820 --> 00:03:54.320 are. 00:03:54.320 --> 00:03:56.153 If I asked you where a particular number is, 00:03:56.153 --> 00:03:59.150 like zero, odds are your eyes would go right to where it is, 00:03:59.150 --> 00:04:02.150 and boom, problem solved in sort of one step. 00:04:02.150 --> 00:04:07.260 But the catch is, with a computer that has this memory, even though you, 00:04:07.260 --> 00:04:10.730 the human, can [INAUDIBLE] see everything at once, a computer cannot. 00:04:10.730 --> 00:04:14.090 It's better to think of your computer's memory, your phone's memory, 00:04:14.090 --> 00:04:17.000 or more specifically an array of memory like this 00:04:17.000 --> 00:04:21.500 as really being a set of closed doors, not unlike lockers in a school. 00:04:21.500 --> 00:04:24.260 And only by opening each of those doors can 00:04:24.260 --> 00:04:26.090 the computer actually see what's in there, 00:04:26.090 --> 00:04:28.610 which is to say that the computer, unlike you, doesn't 00:04:28.610 --> 00:04:32.480 have this bird's eye view of all of the data in all these locations. 00:04:32.480 --> 00:04:35.400 It has to much more methodically look here, 00:04:35.400 --> 00:04:39.590 maybe look here, maybe look here, and so forth in order to find something. 00:04:39.590 --> 00:04:43.670 Now fortunately, we already have some building blocks-- loops, conditions, 00:04:43.670 --> 00:04:45.210 Boolean expressions, and the like-- 00:04:45.210 --> 00:04:47.120 where you could imagine writing some code 00:04:47.120 --> 00:04:51.680 that very methodically goes from left to right or right to left or something 00:04:51.680 --> 00:04:55.490 more sophisticated that actually finds something you're looking for. 00:04:55.490 --> 00:04:58.850 And just remember that the conventions we've had since last week 00:04:58.850 --> 00:05:03.320 now is that these arrays are zero indexed, so to speak. 00:05:03.320 --> 00:05:08.150 To be zero indexed just means that the data type starts counting from zero. 00:05:08.150 --> 00:05:11.930 So this is location 0, 1, 2, 3, 4, 5, 6. 00:05:11.930 --> 00:05:15.320 And notice even though there are seven total doors here, 00:05:15.320 --> 00:05:17.930 the right-most one, of course, is called 6 00:05:17.930 --> 00:05:19.940 just because we've started counting at 0. 00:05:19.940 --> 00:05:24.680 So in the general case, if you had n doors or n bytes of memory, 00:05:24.680 --> 00:05:29.480 0 would always be at the left, and n minus 1 would always be at the right. 00:05:29.480 --> 00:05:33.860 That's sort of a generalization of just thinking about this kind of convention. 00:05:33.860 --> 00:05:37.790 All right, so let's revisit the problem that we started the whole term off 00:05:37.790 --> 00:05:40.523 with in week zero, which was this notion of searching. 00:05:40.523 --> 00:05:42.440 And what does it mean to search for something? 00:05:42.440 --> 00:05:45.218 Well, to find information-- and this, of course, is omnipresent. 00:05:45.218 --> 00:05:48.260 Anytime you take out your phone, you're searching for a friend's contact. 00:05:48.260 --> 00:05:51.140 Any time you pull up a browser, you're googling for this or that. 00:05:51.140 --> 00:05:55.310 So search is kind of one of the most omnipresent topics and features 00:05:55.310 --> 00:05:56.877 of any device these days. 00:05:56.877 --> 00:05:59.960 So let's consider how the Googles, the Apples, the Microsofts of the world 00:05:59.960 --> 00:06:03.660 are implementing something as seemingly familiar as this. 00:06:03.660 --> 00:06:06.020 So here might be the problem statement. 00:06:06.020 --> 00:06:08.450 We want some input to become some output. 00:06:08.450 --> 00:06:09.800 What's that input going to be? 00:06:09.800 --> 00:06:13.520 Maybe it's a bunch of closed doors like this out of which we want 00:06:13.520 --> 00:06:16.250 to get back an answer, true or false. 00:06:16.250 --> 00:06:18.785 Is something we're looking for there or not? 00:06:18.785 --> 00:06:21.410 You can imagine taking this one step further and trying to find 00:06:21.410 --> 00:06:23.520 where is the thing you're looking for. 00:06:23.520 --> 00:06:25.940 But for now, let's just take one bite out of the problem. 00:06:25.940 --> 00:06:30.560 Can we tell ourselves, true or false, is some number behind one 00:06:30.560 --> 00:06:33.390 of these doors or lockers in memory? 00:06:33.390 --> 00:06:37.760 But before we go there and start talking about ways to do that-- that is, 00:06:37.760 --> 00:06:38.580 algorithms. 00:06:38.580 --> 00:06:42.680 Let's consider how we might lay the foundation of, like, 00:06:42.680 --> 00:06:46.108 comparing whether one algorithm is better than another. 00:06:46.108 --> 00:06:47.900 We talked about correctness, and it sort of 00:06:47.900 --> 00:06:51.440 goes without saying that any code you write, any algorithm you implement, 00:06:51.440 --> 00:06:52.460 had better be correct. 00:06:52.460 --> 00:06:55.340 Otherwise, what's the point if it doesn't give you the right answers? 00:06:55.340 --> 00:06:57.260 But we also talked about design. 00:06:57.260 --> 00:07:01.280 And in your own words, what do we mean when we say a program is better 00:07:01.280 --> 00:07:04.470 designed at this stage than another? 00:07:04.470 --> 00:07:07.880 How do you think about this notion of design now? 00:07:07.880 --> 00:07:09.050 Yeah, in the middle? 00:07:09.050 --> 00:07:11.300 AUDIENCE: Easier to understand or easier to institute. 00:07:11.300 --> 00:07:12.990 DAVID J. MALAN: OK, so easier to understand. 00:07:12.990 --> 00:07:13.825 I like that. 00:07:13.825 --> 00:07:14.450 Other thoughts? 00:07:14.450 --> 00:07:14.950 Yeah. 00:07:14.950 --> 00:07:15.980 AUDIENCE: Efficiency. 00:07:15.980 --> 00:07:18.813 DAVID J. MALAN: Efficiency, and what do you mean by efficiency precisely? 00:07:18.813 --> 00:07:22.163 AUDIENCE: [INAUDIBLE] 00:07:22.163 --> 00:07:22.830 DAVID J. MALAN: Nice. 00:07:22.830 --> 00:07:25.390 It doesn't use up too much memory, and it isn't redundant. 00:07:25.390 --> 00:07:27.120 So you can think about design along a few 00:07:27.120 --> 00:07:29.078 of these axes-- sort of the quality of the code 00:07:29.078 --> 00:07:31.270 but also the quality of the performance. 00:07:31.270 --> 00:07:35.818 And as our programs get bigger and more sophisticated and just longer, 00:07:35.818 --> 00:07:37.860 those kinds of things are really going to matter. 00:07:37.860 --> 00:07:39.860 And in the real world, if you start writing code 00:07:39.860 --> 00:07:42.630 not just by yourself but with someone else, getting the design 00:07:42.630 --> 00:07:46.140 right is just going to make it easier to collaborate and ultimately produce, 00:07:46.140 --> 00:07:48.670 write code, with just higher probability. 00:07:48.670 --> 00:07:52.290 So let's consider how we might focus on exactly the second characteristic, 00:07:52.290 --> 00:07:54.390 the efficiency, of an algorithm. 00:07:54.390 --> 00:07:58.230 And the way we might talk about the efficiency of algorithms, just how fast 00:07:58.230 --> 00:08:01.350 or how slow they are, is in terms of their running time. 00:08:01.350 --> 00:08:04.900 That is to say, when they're running, how much time do they take? 00:08:04.900 --> 00:08:08.010 And we might measure this in seconds or milliseconds or minutes 00:08:08.010 --> 00:08:10.350 or just some number of steps in the general case 00:08:10.350 --> 00:08:14.560 because presumably fewer steps, to your point, is better than more steps. 00:08:14.560 --> 00:08:16.410 So how might we think about running times? 00:08:16.410 --> 00:08:19.470 Well, there's one general notation we should define today. 00:08:19.470 --> 00:08:23.610 So computer scientists tend to describe the running time of an algorithm 00:08:23.610 --> 00:08:27.840 or a piece of code, for that matter, in terms of what's called big O notation. 00:08:27.840 --> 00:08:30.630 This is literally a capitalized O, a big O. 00:08:30.630 --> 00:08:34.110 And this generally means that the running time of some algorithm 00:08:34.110 --> 00:08:37.559 is on the order of such and such, where such and such, we'll see, 00:08:37.559 --> 00:08:40.169 is just going to be a very simple mathematical formula. 00:08:40.169 --> 00:08:42.450 It's kind of a way of waving your hands mathematically 00:08:42.450 --> 00:08:46.770 to convey the idea of just how fast or how slow some algorithm or code is 00:08:46.770 --> 00:08:48.600 without getting into the weeds of like, it 00:08:48.600 --> 00:08:52.330 took this many milliseconds or this many specific number of steps. 00:08:52.330 --> 00:08:56.280 So you might recall then from week zero, I even introduced this picture 00:08:56.280 --> 00:08:57.360 but without much context. 00:08:57.360 --> 00:09:00.660 At the time, we just use this to compare those phone book algorithms. 00:09:00.660 --> 00:09:03.600 Recall that this red straight line was the first algorithm, 00:09:03.600 --> 00:09:04.920 one page at a time. 00:09:04.920 --> 00:09:09.870 The yellow line that's still straight differed how if you recall? 00:09:09.870 --> 00:09:14.370 That line represented what alternative algorithm? 00:09:14.370 --> 00:09:15.370 Looking out and back. 00:09:15.370 --> 00:09:16.620 What is that second algorithm? 00:09:16.620 --> 00:09:17.430 Yeah, over there. 00:09:17.430 --> 00:09:18.930 AUDIENCE: Like, two pages at a time. 00:09:18.930 --> 00:09:22.080 DAVID J. MALAN: Two pages at a time, which was almost correct so long as we 00:09:22.080 --> 00:09:25.340 potentially double back a page if maybe we go a little too far in the phone 00:09:25.340 --> 00:09:25.840 book. 00:09:25.840 --> 00:09:28.372 So it had a potential bug but arguably solvable. 00:09:28.372 --> 00:09:31.080 This last algorithm, though, was the so-called divide and conquer 00:09:31.080 --> 00:09:34.590 strategy where I sort of unnecessarily tore the phone book in half 00:09:34.590 --> 00:09:36.600 and then in half and then in half, which, 00:09:36.600 --> 00:09:40.890 as dramatic as that was unnecessarily, it actually took significantly bigger 00:09:40.890 --> 00:09:43.420 bites out of the problem-- like 500 pages 00:09:43.420 --> 00:09:49.360 the first time, another 250, another 125 versus just 1 or 2 bytes at a time. 00:09:49.360 --> 00:09:52.583 And so we described its running time as this picture 00:09:52.583 --> 00:09:55.500 there, though I didn't use that expression at the time, running times. 00:09:55.500 --> 00:09:58.230 But indeed, time to solve might be measured just 00:09:58.230 --> 00:10:00.150 abstractly in some unit of measure-- 00:10:00.150 --> 00:10:03.420 seconds, milliseconds, minutes, pages-- 00:10:03.420 --> 00:10:05.110 via this y-axis here. 00:10:05.110 --> 00:10:07.830 So let's now slap some numbers on this. 00:10:07.830 --> 00:10:11.250 If we had n pages in that phone book, n just representing 00:10:11.250 --> 00:10:13.620 a generic number, the first algorithm here 00:10:13.620 --> 00:10:15.450 we might describe as taking n steps. 00:10:15.450 --> 00:10:18.870 Second algorithm we might describe as taking n divided by 2 steps, 00:10:18.870 --> 00:10:21.510 maybe give or take one if we have to double back but generally 00:10:21.510 --> 00:10:22.482 n divided by 2. 00:10:22.482 --> 00:10:24.690 And then this thing, if you remember your logarithms, 00:10:24.690 --> 00:10:26.648 was sort of a fundamentally different formula-- 00:10:26.648 --> 00:10:30.100 log base 2 of n or just log of n for short. 00:10:30.100 --> 00:10:32.790 So this is of a fundamentally different formula. 00:10:32.790 --> 00:10:36.430 But what's noteworthy is that these first two algorithms, 00:10:36.430 --> 00:10:40.050 even though, yes, the second algorithm was hands down faster-- 00:10:40.050 --> 00:10:41.910 I mean, literally twice as fast-- 00:10:41.910 --> 00:10:46.560 when you start to zoom out and if I increase my y-axis and x-axis, 00:10:46.560 --> 00:10:52.560 these first two start to look awfully similar to one another. 00:10:52.560 --> 00:10:54.810 And if we keep zooming out and zooming out and zooming 00:10:54.810 --> 00:10:57.090 out as n gets really large-- 00:10:57.090 --> 00:10:59.220 that is, the x-axis gets really long-- 00:10:59.220 --> 00:11:03.250 these first two algorithms start to become essentially the same. 00:11:03.250 --> 00:11:06.390 And so this is where computer scientists use big O notation. 00:11:06.390 --> 00:11:10.420 Instead of saying specifically, this algorithm takes any steps. 00:11:10.420 --> 00:11:13.620 And this one n divided by 2, a computer scientist would say, 00:11:13.620 --> 00:11:17.010 eh, each of those algorithms takes on the order of n steps 00:11:17.010 --> 00:11:19.020 or on the order of n over 2. 00:11:19.020 --> 00:11:19.770 But you know what? 00:11:19.770 --> 00:11:23.490 On the order of n over 2 is pretty much the same 00:11:23.490 --> 00:11:29.620 when n gets really large as being equivalent to big O of n itself. 00:11:29.620 --> 00:11:34.480 So yes, in practice, it's obviously fewer steps to move twice as fast. 00:11:34.480 --> 00:11:37.950 But in the big picture, when n becomes a million, a billion, 00:11:37.950 --> 00:11:40.620 the numbers are already so darn big at that point 00:11:40.620 --> 00:11:43.680 that these are as, the shapes of these curves imply, 00:11:43.680 --> 00:11:45.930 pretty much functionally equivalent. 00:11:45.930 --> 00:11:49.080 But this one still looks better and better 00:11:49.080 --> 00:11:52.470 as n gets large because it's rising so much less quickly. 00:11:52.470 --> 00:11:54.660 And so here, a computer scientist would say 00:11:54.660 --> 00:11:59.318 that that third algorithm was on the order of-- that is, big O of-- log n. 00:11:59.318 --> 00:12:01.110 And they don't have to bother with the base 00:12:01.110 --> 00:12:04.980 because it's a smaller mathematical detail that is also just in some sense 00:12:04.980 --> 00:12:07.890 a constant, multiplicative factor. 00:12:07.890 --> 00:12:09.840 So in short, what are the takeaways here? 00:12:09.840 --> 00:12:12.270 This is just a new vocabulary that we'll start 00:12:12.270 --> 00:12:16.020 to use when we just want to describe the running time of an algorithm. 00:12:16.020 --> 00:12:18.600 To make this more real, if any of you have implemented 00:12:18.600 --> 00:12:24.290 a for loop at this point in any of your code and that for loop iterated n times 00:12:24.290 --> 00:12:27.590 where maybe in was the height of your pyramid or maybe n 00:12:27.590 --> 00:12:31.550 was something else that you wanted to do n times, you wrote code 00:12:31.550 --> 00:12:36.440 or you implemented an algorithm that operated in big O of n time, 00:12:36.440 --> 00:12:37.200 if you will. 00:12:37.200 --> 00:12:39.350 So this is just a way now to retroactively start 00:12:39.350 --> 00:12:43.490 describing with somewhat mathematical notation what we've 00:12:43.490 --> 00:12:45.780 been doing in practice for a while now. 00:12:45.780 --> 00:12:51.500 So here's a list of commonly seen running times in the real world. 00:12:51.500 --> 00:12:55.010 This is not a thorough list because you could come up 00:12:55.010 --> 00:12:57.710 with an infinite number of mathematical formulas, certainly. 00:12:57.710 --> 00:13:01.400 But the common ones we'll discuss and you will see in your own code 00:13:01.400 --> 00:13:04.070 probably reduce to this list here. 00:13:04.070 --> 00:13:06.320 And if you were to study more computer science theory, 00:13:06.320 --> 00:13:07.903 this list would get longer and longer. 00:13:07.903 --> 00:13:11.930 But for now, these are sort of the most familiar ones that we'll soon see. 00:13:11.930 --> 00:13:14.700 All right, two other pieces of vocabulary, if you will, 00:13:14.700 --> 00:13:16.160 before we start to use this stuff-- 00:13:16.160 --> 00:13:19.430 so this, a big omega, capital omega symbol, 00:13:19.430 --> 00:13:25.170 is used now to describe a lower bound on the running time of an algorithm. 00:13:25.170 --> 00:13:28.610 So to be clear, big O is on the order of-- that 00:13:28.610 --> 00:13:31.910 is, an upper bound-- on how many steps an algorithm might 00:13:31.910 --> 00:13:34.700 take, on the order of so many steps. 00:13:34.700 --> 00:13:37.580 If you want to talk, though, from the other perspective, well, 00:13:37.580 --> 00:13:39.710 how few steps my algorithm take? 00:13:39.710 --> 00:13:42.320 Maybe in the so-called best case, it'd be nice 00:13:42.320 --> 00:13:45.110 if we had a notation to just describe what a lower 00:13:45.110 --> 00:13:47.840 bound is because some algorithms might be super fast 00:13:47.840 --> 00:13:49.890 in these so-called best cases. 00:13:49.890 --> 00:13:53.660 So the symbology is almost the same, but we replace the big O 00:13:53.660 --> 00:13:54.780 with the big omega. 00:13:54.780 --> 00:13:58.580 So to be clear, big O describes an upper bound and omega 00:13:58.580 --> 00:14:00.110 describes a lower bound. 00:14:00.110 --> 00:14:02.540 And we'll see examples of this before long. 00:14:02.540 --> 00:14:08.420 And then lastly, last one here, big theta, is used by a computer scientist 00:14:08.420 --> 00:14:12.860 when you have a case where both the upper bound on an algorithm's 00:14:12.860 --> 00:14:15.890 running time is the same as the lower bound. 00:14:15.890 --> 00:14:19.580 You can then describe it in one breath as being in theta of such and such 00:14:19.580 --> 00:14:23.660 instead of saying it's in big O and in omega of something else. 00:14:23.660 --> 00:14:27.920 All right, so out of context, sort of just seemingly cryptic symbols, 00:14:27.920 --> 00:14:31.463 but all they refer to is upper bounds, lower bounds, or when 00:14:31.463 --> 00:14:32.880 they happen to be one in the same. 00:14:32.880 --> 00:14:36.440 And we'll now introduce over time examples of how we might actually 00:14:36.440 --> 00:14:38.990 apply these to concrete problems. 00:14:38.990 --> 00:14:43.220 But first, let me pause to see if there's any questions. 00:14:43.220 --> 00:14:46.120 Any questions here? 00:14:46.120 --> 00:14:48.430 Any questions? 00:14:48.430 --> 00:14:50.880 I see pointing somewhere. 00:14:50.880 --> 00:14:53.100 Where are you pointing to? 00:14:53.100 --> 00:14:54.640 Over here-- there we go. 00:14:54.640 --> 00:14:55.980 OK, sorry-- very bright. 00:14:55.980 --> 00:14:59.560 AUDIENCE: So, um, smaller-- 00:14:59.560 --> 00:15:01.520 DAVID J. MALAN: Smaller n functions move faster. 00:15:01.520 --> 00:15:06.260 So yes, if you have something like n, that takes only steps. 00:15:06.260 --> 00:15:09.265 If you have a formula like n squared, just by nature of the math, 00:15:09.265 --> 00:15:11.900 that take more steps and therefore be slower. 00:15:11.900 --> 00:15:13.810 So the larger the mathematical expression, 00:15:13.810 --> 00:15:18.280 the slower your algorithm is because the more time or more steps that it takes. 00:15:20.800 --> 00:15:23.170 AUDIENCE: So you want your n function to be small? 00:15:23.170 --> 00:15:26.410 DAVID J. MALAN: You want your n function, so to speak, to be small, yes. 00:15:26.410 --> 00:15:28.220 And in fact, the Holy Grail, so to speak, 00:15:28.220 --> 00:15:31.690 would be this last one here either in big O notation or even theta, 00:15:31.690 --> 00:15:34.900 when an algorithm is on the order of a single step. 00:15:34.900 --> 00:15:39.370 That means it literally takes constant time, one step, or maybe 10 steps, 00:15:39.370 --> 00:15:42.490 100 steps, but a fixed, constant number of steps. 00:15:42.490 --> 00:15:46.120 That's the best because even as the phone book gets bigger, 00:15:46.120 --> 00:15:50.350 even as the data set you're searching gets larger and larger, 00:15:50.350 --> 00:15:53.500 if something only takes a finite number of steps constantly, 00:15:53.500 --> 00:15:57.640 then it doesn't matter how big the data set actually gets. 00:15:57.640 --> 00:16:01.840 Questions as well on these notations-- yep, thank you for the pointing. 00:16:01.840 --> 00:16:03.310 This is actually very helpful. 00:16:03.310 --> 00:16:05.170 I'm seeing pointing this way? 00:16:05.170 --> 00:16:08.227 AUDIENCE: [INAUDIBLE] 00:16:08.227 --> 00:16:10.560 DAVID J. MALAN: What is the input to each of these functions? 00:16:10.560 --> 00:16:13.710 It is an expression of how many steps an algorithm takes. 00:16:13.710 --> 00:16:16.080 So in fact, let me go ahead and make this 00:16:16.080 --> 00:16:19.220 more concrete with an actual example here if we could. 00:16:19.220 --> 00:16:22.350 So on stage here, we have seven lockers which represent, 00:16:22.350 --> 00:16:24.300 if you will, an array of memory. 00:16:24.300 --> 00:16:26.400 And this array of memory is maybe storing 00:16:26.400 --> 00:16:30.150 seven integers, seven integers that we might actually want to search for. 00:16:30.150 --> 00:16:33.090 And if we want to search for these values, how might 00:16:33.090 --> 00:16:34.048 we go about doing this? 00:16:34.048 --> 00:16:36.257 Well, for this, why don't we make things interesting? 00:16:36.257 --> 00:16:37.810 Would a volunteer like to come on up? 00:16:37.810 --> 00:16:40.920 Have to be masked and on the internet if you are comfortable. 00:16:40.920 --> 00:16:44.190 Both of-- oh, there's someone putting their friend's hand up and back? 00:16:44.190 --> 00:16:44.940 Yes, OK. 00:16:44.940 --> 00:16:45.660 Come on down. 00:16:50.380 --> 00:16:53.080 And in just a moment, our brave volunteer 00:16:53.080 --> 00:16:57.070 is going to help me find a specific number in the data set 00:16:57.070 --> 00:16:58.520 that we have here on the screen. 00:16:58.520 --> 00:17:02.230 So come on down, and I'll get things ready for you in advance here. 00:17:02.230 --> 00:17:08.723 Come on down nice to meet. 00:17:08.723 --> 00:17:09.640 And what is your name? 00:17:09.640 --> 00:17:10.390 AUDIENCE: [? Nomira. ?] 00:17:10.390 --> 00:17:10.645 DAVID J. MALAN: Minera? 00:17:10.645 --> 00:17:11.530 AUDIENCE: [? Nomira. ?] 00:17:11.530 --> 00:17:13.113 DAVID J. MALAN: [? Nomira. ?] Nice to meet. 00:17:13.113 --> 00:17:13.700 Come on over. 00:17:13.700 --> 00:17:17.650 So here we have for Nomira seven lockers or an array of memory. 00:17:17.650 --> 00:17:19.540 And behind each of these doors is a number. 00:17:19.540 --> 00:17:22.569 And the goal, quite simply, is, given this array of memory 00:17:22.569 --> 00:17:27.680 as input, to return, true or false, is the number I care about actually there? 00:17:27.680 --> 00:17:30.250 So suppose I care about the number 0. 00:17:30.250 --> 00:17:33.730 What would be the simplest, most correct algorithm you could 00:17:33.730 --> 00:17:38.200 apply in order to find us the number 0? 00:17:38.200 --> 00:17:41.608 OK, try opening the first one. 00:17:41.608 --> 00:17:44.150 All right, and maybe just step aside so the audience can see. 00:17:44.150 --> 00:17:46.160 I think you have not found 0 yet. 00:17:46.160 --> 00:17:47.480 OK, so keep the door open. 00:17:47.480 --> 00:17:50.120 Let's move on to your next choice. 00:17:50.120 --> 00:17:51.080 Second door, sure. 00:17:51.080 --> 00:17:51.998 AUDIENCE: [INAUDIBLE] 00:17:51.998 --> 00:17:53.540 DAVID J. MALAN: Oh, go ahead, second door. 00:17:53.540 --> 00:17:54.415 Let's keep it simple. 00:17:54.415 --> 00:17:57.557 Let's just move from left to right, sort of searching our way. 00:17:57.557 --> 00:17:58.640 And what do you see there? 00:17:58.640 --> 00:17:59.770 Oh, 6, not 0. 00:17:59.770 --> 00:18:00.770 How about the next door? 00:18:04.200 --> 00:18:07.230 All right, also not working out so well yet, but that's OK. 00:18:07.230 --> 00:18:11.610 If you want to go on to the next, we're still looking for 0. 00:18:11.610 --> 00:18:12.930 All right, I see a 2. 00:18:12.930 --> 00:18:14.502 All right, it's not so good yet. 00:18:14.502 --> 00:18:15.210 Let's keep going. 00:18:15.210 --> 00:18:16.860 Next door. 00:18:16.860 --> 00:18:18.540 2, 7-- no. 00:18:18.540 --> 00:18:20.100 OK, next door. 00:18:20.100 --> 00:18:25.870 No, that's a-- all right, very well done. 00:18:25.870 --> 00:18:26.370 Oh. 00:18:29.030 --> 00:18:32.330 All right, so I kind of set you up for a fairly slow algorithm, 00:18:32.330 --> 00:18:34.550 but let me just ask you to describe what is it 00:18:34.550 --> 00:18:37.534 you did by following the steps I gave you. 00:18:37.534 --> 00:18:39.660 AUDIENCE: I just went one by one to each character. 00:18:39.660 --> 00:18:41.660 DAVID J. MALAN: You went one by one to each character 00:18:41.660 --> 00:18:43.380 if you want to talk into here. 00:18:43.380 --> 00:18:45.140 So you went one by one by each character. 00:18:45.140 --> 00:18:48.620 And would you say that algorithm left or right is correct? 00:18:48.620 --> 00:18:49.790 AUDIENCE: No. 00:18:49.790 --> 00:18:50.960 DAVID J. MALAN: No? 00:18:50.960 --> 00:18:53.000 AUDIENCE: Or, yes, in the scenario. 00:18:53.000 --> 00:18:54.500 DAVID J. MALAN: OK, yes in this scenario. 00:18:54.500 --> 00:18:55.520 Why are you hesitating? 00:18:55.520 --> 00:18:56.060 What's going through your mind? 00:18:56.060 --> 00:18:58.100 AUDIENCE: Because it's not the most efficient way to do it. 00:18:58.100 --> 00:18:58.980 DAVID J. MALAN: OK, good. 00:18:58.980 --> 00:19:01.610 So we see a contrast here between correctness and design. 00:19:01.610 --> 00:19:04.360 I mean, I do think it was correct because even though it was slow, 00:19:04.360 --> 00:19:05.960 you eventually found zero. 00:19:05.960 --> 00:19:08.490 But it took some number of steps. 00:19:08.490 --> 00:19:10.280 So in fact, this would be an algorithm. 00:19:10.280 --> 00:19:12.350 It has a name, called linear search. 00:19:12.350 --> 00:19:14.840 And, [? Nomira, ?] as you did, you kind of walked along 00:19:14.840 --> 00:19:16.735 a line going from left to right. 00:19:16.735 --> 00:19:17.360 Now let me ask. 00:19:17.360 --> 00:19:19.880 If you had gone from right to left, would the algorithm 00:19:19.880 --> 00:19:23.030 have been fundamentally better? 00:19:23.030 --> 00:19:23.780 AUDIENCE: Yes. 00:19:23.780 --> 00:19:24.885 DAVID J. MALAN: OK, and why? 00:19:24.885 --> 00:19:27.260 AUDIENCE: Because the zero is here in the first scenario. 00:19:27.260 --> 00:19:30.860 But if it was like, the zero is in the middle, it wouldn't have been. 00:19:30.860 --> 00:19:35.020 DAVID J. MALAN: Yeah, and so here is where the right way to do things 00:19:35.020 --> 00:19:36.270 becomes a little less obvious. 00:19:36.270 --> 00:19:38.707 You would absolutely have given yourself a better result 00:19:38.707 --> 00:19:40.790 if you would just happened to start from the right 00:19:40.790 --> 00:19:42.780 or if I had pointed you to start over there. 00:19:42.780 --> 00:19:45.907 But the catch is if I asked her to find another number, like the number 8, 00:19:45.907 --> 00:19:47.240 well, that would have backfired. 00:19:47.240 --> 00:19:48.948 And this time, it would have taken longer 00:19:48.948 --> 00:19:51.450 to find that number because it's way over here instead. 00:19:51.450 --> 00:19:55.970 And so in the general case, going left to right or, heck, right to left 00:19:55.970 --> 00:19:59.540 is probably as correct as you can get because if you know nothing 00:19:59.540 --> 00:20:03.517 about the order of these numbers-- and indeed, they seem to be fairly random. 00:20:03.517 --> 00:20:05.600 Some of them are smaller, some of them are bigger. 00:20:05.600 --> 00:20:07.308 There doesn't seem to be rhyme or reason. 00:20:07.308 --> 00:20:11.330 Linear search is about as good as you can do when you don't know anything 00:20:11.330 --> 00:20:13.053 a priori about the numbers. 00:20:13.053 --> 00:20:15.720 So I have a little thank you gift here, a little CS stress ball. 00:20:15.720 --> 00:20:18.680 Round of applause for our first volunteer. 00:20:18.680 --> 00:20:19.550 Thank you so much. 00:20:23.670 --> 00:20:27.690 Let's try to formalize what I just described as linear search 00:20:27.690 --> 00:20:30.938 because indeed, no matter which end [? Nomira ?] had started on, 00:20:30.938 --> 00:20:32.730 I could have kind of changed up the problem 00:20:32.730 --> 00:20:35.070 to make sure that it appears to be running slow. 00:20:35.070 --> 00:20:36.090 But it is correct. 00:20:36.090 --> 00:20:39.270 If zero were among those doors, she absolutely would have found it 00:20:39.270 --> 00:20:40.260 and indeed did. 00:20:40.260 --> 00:20:45.870 So let's now try to translate what we did into what we might call again 00:20:45.870 --> 00:20:48.210 pseudo code as from week zero. 00:20:48.210 --> 00:20:50.730 So with pseudo code, we just need a terse English 00:20:50.730 --> 00:20:53.770 like, or any language, syntax to describe what we did. 00:20:53.770 --> 00:20:56.280 So here might be one formulation of what [? Nomira ?] did. 00:20:56.280 --> 00:21:00.690 For each door, from left to right, if the number is behind the door, 00:21:00.690 --> 00:21:02.280 return true. 00:21:02.280 --> 00:21:07.630 Else, at the very end of the program, you would return false by default. 00:21:07.630 --> 00:21:08.800 And now you got lucky. 00:21:08.800 --> 00:21:10.800 And by the seventh door, [? Nomira ?] had indeed 00:21:10.800 --> 00:21:14.430 returned true by saying, well, there is the zero. 00:21:14.430 --> 00:21:17.070 But let's consider if this pseudo code is now correct, 00:21:17.070 --> 00:21:18.150 an accurate translation. 00:21:18.150 --> 00:21:22.360 First of all, normally, when we've seen ifs, we might see an if else. 00:21:22.360 --> 00:21:26.100 And yet down here, return false is aligned with the for. 00:21:26.100 --> 00:21:29.700 Why did I not indent the return false, or put another way, 00:21:29.700 --> 00:21:36.780 why did I not do if number is behind door, return true, else return false? 00:21:36.780 --> 00:21:39.855 Why would that version of this code have been problematic? 00:21:39.855 --> 00:21:41.318 Way in back. 00:21:41.318 --> 00:21:50.320 AUDIENCE: [INAUDIBLE] 00:21:50.320 --> 00:21:52.690 DAVID J. MALAN: OK, I'm not sure it's because of redundancy. 00:21:52.690 --> 00:21:55.000 Let me go ahead and just make this explicit. 00:21:55.000 --> 00:21:58.510 If I had instead done else return false, I 00:21:58.510 --> 00:22:02.942 don't think it's so much redundancy that I'd be worried about. 00:22:02.942 --> 00:22:04.150 Let me bounce somewhere else. 00:22:04.150 --> 00:22:04.810 Yeah, in front? 00:22:04.810 --> 00:22:08.350 AUDIENCE: Um, maybe [INAUDIBLE] for the entire list 00:22:08.350 --> 00:22:09.730 after just checking one number. 00:22:09.730 --> 00:22:11.920 DAVID J. MALAN: Yeah, it would be returning falls for-- 00:22:11.920 --> 00:22:13.795 even though I'd only looked at-- [? Nomira ?] 00:22:13.795 --> 00:22:15.100 had only looked at one element. 00:22:15.100 --> 00:22:18.142 And it would have been as though if all of these doors were still closed, 00:22:18.142 --> 00:22:21.610 she opens this up and says, nope, this is not zero, return false. 00:22:21.610 --> 00:22:24.708 That would give me an incorrect result because obviously, 00:22:24.708 --> 00:22:27.250 at that stage in the algorithm, she wouldn't have even looked 00:22:27.250 --> 00:22:28.720 through any of the other doors. 00:22:28.720 --> 00:22:32.510 So just the original indentation of this, if you will, 00:22:32.510 --> 00:22:35.200 without the [? else, ?] is correct because only 00:22:35.200 --> 00:22:38.950 if I get to the bottom of this algorithm or the pseudo code does 00:22:38.950 --> 00:22:41.740 it make sense to conclude at that point, once she's 00:22:41.740 --> 00:22:45.220 gone through all of the doors, that nope, there's in fact-- 00:22:45.220 --> 00:22:48.550 the number I'm looking for is, in fact, not actually there. 00:22:48.550 --> 00:22:52.790 So how might we consider now the running time of this algorithm? 00:22:52.790 --> 00:22:55.930 We have a few different types of vocabulary now. 00:22:55.930 --> 00:22:59.080 And if we consider now how we might think about this, 00:22:59.080 --> 00:23:02.620 let's start to translate it from sort of higher level pseudo code 00:23:02.620 --> 00:23:04.420 to something a little lower level. 00:23:04.420 --> 00:23:07.820 We've been writing code using n and loops and the like. 00:23:07.820 --> 00:23:12.340 So let's take this higher level pseudo code and now just kind of 00:23:12.340 --> 00:23:14.890 get a middle ground between English and C. 00:23:14.890 --> 00:23:18.910 Let me propose that we think about this version of the same algorithm 00:23:18.910 --> 00:23:20.680 as being a little more pedantic. 00:23:20.680 --> 00:23:28.820 For i from 0 to n minus 1, if number behind doors bracket i return true. 00:23:28.820 --> 00:23:31.520 Otherwise, at the end of the program, return false. 00:23:31.520 --> 00:23:33.520 Now I'm kind of mixing English and C here, 00:23:33.520 --> 00:23:35.950 but that's reasonable if the reader is familiar with C 00:23:35.950 --> 00:23:37.540 or some similar language. 00:23:37.540 --> 00:23:39.530 And notice this pattern here. 00:23:39.530 --> 00:23:44.650 This is a way of just saying in pseudo code, give myself a variable called i. 00:23:44.650 --> 00:23:49.360 Start at 0 and then just count up to n minus 1. 00:23:49.360 --> 00:23:53.410 And recall n minus 1 is not one shy of the end of the array. 00:23:53.410 --> 00:23:56.170 N minus 1 is the end of the array because again, we 00:23:56.170 --> 00:23:57.580 started counting at 0. 00:23:57.580 --> 00:24:00.880 So this is a very common way of expressing this kind of loop 00:24:00.880 --> 00:24:03.910 from the left all the way to the right of an array. 00:24:03.910 --> 00:24:07.210 Doors I'm kind of implicitly treating as the name of this array, 00:24:07.210 --> 00:24:10.000 like it's a variable from last week that I defined as being 00:24:10.000 --> 00:24:11.780 an array of integers in this case. 00:24:11.780 --> 00:24:16.690 So doors bracket i means that when i is 0, it's this location. 00:24:16.690 --> 00:24:18.220 When i is 1, it's this. 00:24:18.220 --> 00:24:21.790 When i is 7 or, more generally n minus-- 00:24:21.790 --> 00:24:26.270 sorry, 6 or, more generally, n minus 1, that's this location here. 00:24:26.270 --> 00:24:28.700 So same idea but a translation of it. 00:24:28.700 --> 00:24:32.920 So now let's consider what the running time of this algorithm is. 00:24:32.920 --> 00:24:36.010 If we have this menu of possible answers to this question, 00:24:36.010 --> 00:24:38.930 how efficient or inefficient is this algorithm, 00:24:38.930 --> 00:24:41.650 let's take a look in the context of this pseudo code. 00:24:41.650 --> 00:24:44.500 We don't even have to bother going all the way to C. 00:24:44.500 --> 00:24:47.720 How do we go about analyzing each of these steps? 00:24:47.720 --> 00:24:49.600 Well, let's consider this. 00:24:49.600 --> 00:24:55.540 This outermost loop here for i from 0 to n minus 1, that line of code 00:24:55.540 --> 00:24:57.790 is going to execute how many times? 00:24:57.790 --> 00:25:01.420 How many times will that loop execute? 00:25:01.420 --> 00:25:04.540 Let me give folks this moment to think on it. 00:25:04.540 --> 00:25:07.330 How many times is that going to loop here? 00:25:07.330 --> 00:25:08.584 Yeah, over there. 00:25:08.584 --> 00:25:10.360 AUDIENCE: [INAUDIBLE] 00:25:10.360 --> 00:25:11.530 DAVID J. MALAN: n times, right? 00:25:11.530 --> 00:25:13.720 Because it's from 0 to n minus 1. 00:25:13.720 --> 00:25:16.330 And if it's a little weird to think in from 0 to n minus 1, 00:25:16.330 --> 00:25:19.832 this is essentially the same mathematically as from 1 to n. 00:25:19.832 --> 00:25:21.790 And that's perhaps a little more obviously more 00:25:21.790 --> 00:25:23.690 intuitively n total steps. 00:25:23.690 --> 00:25:28.180 So I might just make a note to myself this loop is going to operate n times. 00:25:28.180 --> 00:25:29.770 What about these inner steps? 00:25:29.770 --> 00:25:33.160 Well, how many steps or seconds does it take to ask a question? 00:25:33.160 --> 00:25:35.290 If the number behind-- 00:25:35.290 --> 00:25:38.740 if the number you're looking for is behind doors bracket i, 00:25:38.740 --> 00:25:41.570 well, as [? Nomira ?] did, that's kind of like one step. 00:25:41.570 --> 00:25:42.820 So you open the door and boom. 00:25:42.820 --> 00:25:46.100 All right, maybe it's two steps, but it's a constant number of steps. 00:25:46.100 --> 00:25:48.320 So this is some constant number of steps. 00:25:48.320 --> 00:25:50.080 Let's just call it one for simplicity. 00:25:50.080 --> 00:25:53.500 How many steps or seconds does it take to return true? 00:25:53.500 --> 00:25:55.863 I don't know exactly in the computer's memory 00:25:55.863 --> 00:25:57.280 but that feels like a single step. 00:25:57.280 --> 00:25:58.940 Just return true. 00:25:58.940 --> 00:26:01.960 So if this takes one step, this takes one step 00:26:01.960 --> 00:26:04.960 but only if the condition is true, it looks 00:26:04.960 --> 00:26:09.370 like you're doing a constant number of things n times. 00:26:09.370 --> 00:26:12.530 Or maybe you're doing one additional step. 00:26:12.530 --> 00:26:15.010 So in short, the only thing that really matters here 00:26:15.010 --> 00:26:18.220 in terms of the efficiency or inefficiency of the algorithm 00:26:18.220 --> 00:26:21.495 is what are you doing again and again and again because that's obviously 00:26:21.495 --> 00:26:22.870 the thing that's going to add up. 00:26:22.870 --> 00:26:26.080 Doing one thing or two things a constant number of times? 00:26:26.080 --> 00:26:27.010 Not a big deal. 00:26:27.010 --> 00:26:32.050 But looping, that's going to add up over time because the more doors there are, 00:26:32.050 --> 00:26:35.420 the bigger n is going to be and the more steps that's going to take, 00:26:35.420 --> 00:26:38.320 which is all to say if you were to describe roughly 00:26:38.320 --> 00:26:43.120 how many steps does this algorithm take in big O notation, 00:26:43.120 --> 00:26:46.060 what might your instincts say? 00:26:46.060 --> 00:26:51.350 How many steps is this algorithm on the order of given n doors or n integers? 00:26:51.350 --> 00:26:51.850 Yeah? 00:26:51.850 --> 00:26:52.805 AUDIENCE: [INAUDIBLE] 00:26:52.805 --> 00:26:53.680 DAVID J. MALAN: Say again? 00:26:53.680 --> 00:26:54.670 AUDIENCE: O n. 00:26:54.670 --> 00:26:56.110 DAVID J. MALAN: Big O of n. 00:26:56.110 --> 00:26:58.220 And indeed, that's going to be the case here. 00:26:58.220 --> 00:26:58.450 Why? 00:26:58.450 --> 00:27:00.533 Because you're essentially, at the end of the day, 00:27:00.533 --> 00:27:03.992 doing n things as an upper bound on running time. 00:27:03.992 --> 00:27:05.950 And that's, in fact, what exactly what happened 00:27:05.950 --> 00:27:08.500 with [? Nomira. ?] She had to look at all n lockers 00:27:08.500 --> 00:27:11.450 before finally getting to the right answer. 00:27:11.450 --> 00:27:14.470 But what if she got lucky and the number we 00:27:14.470 --> 00:27:17.380 were looking for was not at the end of the array 00:27:17.380 --> 00:27:20.080 but was at the beginning of the array? 00:27:20.080 --> 00:27:21.650 How might we think about that? 00:27:21.650 --> 00:27:25.120 Well, have a nomenclature for this too, of course-- omega notation. 00:27:25.120 --> 00:27:27.730 Remember, omega notation is a lower bound. 00:27:27.730 --> 00:27:34.240 So given this menu of possible running times for lower bounds on an algorithm, 00:27:34.240 --> 00:27:38.897 what might the omega notation be for [? Nomira's ?] linear search? 00:27:38.897 --> 00:27:40.270 AUDIENCE: Omega 1. 00:27:40.270 --> 00:27:42.250 DAVID J. MALAN: Omega of 1, and why that? 00:27:42.250 --> 00:27:43.973 AUDIENCE: [INAUDIBLE] 00:27:43.973 --> 00:27:46.390 DAVID J. MALAN: Right, because if just by chance she gets lucky 00:27:46.390 --> 00:27:49.300 and the number she's looking for is right there where 00:27:49.300 --> 00:27:51.490 she begins the algorithm, that's it. 00:27:51.490 --> 00:27:52.540 It's one step. 00:27:52.540 --> 00:27:55.210 Maybe it's two steps if you have to unlock the door and open it, 00:27:55.210 --> 00:27:56.740 but it's a constant number of steps. 00:27:56.740 --> 00:27:58.810 And the way we describe constant number of steps 00:27:58.810 --> 00:28:01.030 is just with a single number like 1. 00:28:01.030 --> 00:28:04.990 So the omega notation for linear search might be omega of 1 00:28:04.990 --> 00:28:08.650 because in the best case, she might just get the number right from the get go. 00:28:08.650 --> 00:28:11.860 But in the worst case, we need to talk about the upper bound, which 00:28:11.860 --> 00:28:13.990 might indeed be big O of n. 00:28:13.990 --> 00:28:16.690 So again there's this way now of talking symbolically 00:28:16.690 --> 00:28:22.150 about best cases and worst cases or lower bounds and upper bounds. 00:28:22.150 --> 00:28:24.880 Theta notation, just as a little trivia now, 00:28:24.880 --> 00:28:27.965 is it applicable based on the definition I gave earlier? 00:28:27.965 --> 00:28:28.840 AUDIENCE: [INAUDIBLE] 00:28:28.840 --> 00:28:31.630 DAVID J. MALAN: OK, no, because you only take out the theta notation 00:28:31.630 --> 00:28:34.120 when those two bounds, upper and lower, happen 00:28:34.120 --> 00:28:36.740 to be the same for shorthand notation, if you will. 00:28:36.740 --> 00:28:41.530 So it suffices here to talk about just big O and omega notation. 00:28:41.530 --> 00:28:43.880 Well, what if we are a little smarter about this? 00:28:43.880 --> 00:28:47.350 Let me go ahead and sort of semi-secretly here 00:28:47.350 --> 00:28:48.380 rearrange these numbers. 00:28:48.380 --> 00:28:50.605 But first, how about one other volunteer? 00:28:50.605 --> 00:28:53.230 One other volunteer-- you have to be comfortable with your mask 00:28:53.230 --> 00:28:55.510 and your being on the internet. 00:28:55.510 --> 00:28:58.180 How about over here? 00:28:58.180 --> 00:28:59.880 Yes, you want to come on down? 00:28:59.880 --> 00:29:00.880 All right, come on down. 00:29:00.880 --> 00:29:03.640 And don't look at what I'm doing because I'm going to-- 00:29:08.340 --> 00:29:10.830 take your time and don't look up this way 00:29:10.830 --> 00:29:14.550 because I need a moment to rearrange all of the numbers. 00:29:14.550 --> 00:29:17.430 And actually, if you could stay right there before coming up, 00:29:17.430 --> 00:29:20.910 just an awkward few seconds while I finish hiding the numbers 00:29:20.910 --> 00:29:22.590 behind these doors for you. 00:29:22.590 --> 00:29:23.890 AUDIENCE: [INAUDIBLE] 00:29:23.890 --> 00:29:26.490 DAVID J. MALAN: I will be right with you. 00:29:26.490 --> 00:29:31.110 Actually, if-- do you want to warm up the crowd for a moment 00:29:31.110 --> 00:29:32.283 and I'll be right back? 00:29:32.283 --> 00:29:33.700 So you want to introduce yourself? 00:29:33.700 --> 00:29:34.770 AUDIENCE: Yeah, hi, guys. 00:29:34.770 --> 00:29:35.340 I'm Rave. 00:29:37.900 --> 00:29:38.400 Yeah! 00:29:43.500 --> 00:29:46.200 DAVID J. MALAN: All right, I think I am ready. 00:29:46.200 --> 00:29:47.700 Thank you for stalling there. 00:29:47.700 --> 00:29:48.970 AUDIENCE: Of course. 00:29:48.970 --> 00:29:49.710 DAVID J. MALAN: And I didn't catch your name. 00:29:49.710 --> 00:29:50.250 What was your name? 00:29:50.250 --> 00:29:51.080 AUDIENCE: I'm Rave. 00:29:51.080 --> 00:29:51.450 DAVID J. MALAN: I'm sorry? 00:29:51.450 --> 00:29:52.440 AUDIENCE: Rave, like a party. 00:29:52.440 --> 00:29:53.130 DAVID J. MALAN: Rave, OK. 00:29:53.130 --> 00:29:53.672 Nice to meet. 00:29:53.672 --> 00:29:54.930 Come on over. 00:29:54.930 --> 00:29:56.760 So Rave has kindly volunteered now. 00:29:56.760 --> 00:29:58.725 And I'm going to give you an additional advantage this time. 00:29:58.725 --> 00:29:59.400 AUDIENCE: OK. 00:29:59.400 --> 00:30:03.180 DAVID J. MALAN: Unbeknownst to you, I now took numbers behind the doors, 00:30:03.180 --> 00:30:04.530 but I sorted them for you. 00:30:04.530 --> 00:30:06.120 So they're not in the same random order like they 00:30:06.120 --> 00:30:08.162 were for [? Nomira. ?] You now have the advantage 00:30:08.162 --> 00:30:10.678 to know that the numbers are sorted from small to big. 00:30:10.678 --> 00:30:11.220 AUDIENCE: OK. 00:30:11.220 --> 00:30:15.180 DAVID J. MALAN: Given that, and given perhaps what we talked about in week zero 00:30:15.180 --> 00:30:19.140 with the phone book, where might you propose we begin the story this time? 00:30:19.140 --> 00:30:20.850 With which locker? 00:30:20.850 --> 00:30:21.808 AUDIENCE: To find zero? 00:30:21.808 --> 00:30:23.600 DAVID J. MALAN: Let's find number six this time. 00:30:23.600 --> 00:30:24.940 Let's make things interesting. 00:30:24.940 --> 00:30:26.200 AUDIENCE: OK. 00:30:26.200 --> 00:30:27.300 I'll start in the middle. 00:30:27.300 --> 00:30:28.050 DAVID J. MALAN: OK, so the middle. 00:30:28.050 --> 00:30:28.990 There's seven total. 00:30:28.990 --> 00:30:29.490 So-- 00:30:29.490 --> 00:30:29.670 AUDIENCE: OK. 00:30:29.670 --> 00:30:30.420 DAVID J. MALAN: --that would be right here. 00:30:30.420 --> 00:30:30.920 Go ahead. 00:30:30.920 --> 00:30:32.460 Open that up. 00:30:32.460 --> 00:30:34.320 And you find, sadly, the number five. 00:30:34.320 --> 00:30:36.150 So what do you know now? 00:30:36.150 --> 00:30:37.257 AUDIENCE: I know to go up. 00:30:37.257 --> 00:30:37.840 DAVID J. MALAN: OK. 00:30:37.840 --> 00:30:38.382 AUDIENCE: OK. 00:30:38.382 --> 00:30:40.465 DAVID J. MALAN: All right, and just to keep it uniform, 00:30:40.465 --> 00:30:43.080 just like I did, I opened to the right half of the phone book. 00:30:43.080 --> 00:30:43.350 AUDIENCE: Yes. 00:30:43.350 --> 00:30:44.880 DAVID J. MALAN: Let's keep it similar. 00:30:44.880 --> 00:30:45.390 Yeah. 00:30:45.390 --> 00:30:46.223 AUDIENCE: All right. 00:30:46.223 --> 00:30:48.182 DAVID J. MALAN: All right, and, uh, a little too far 00:30:48.182 --> 00:30:50.070 even though I know you wanted to go one over. 00:30:50.070 --> 00:30:51.080 AUDIENCE: All good, all good. 00:30:51.080 --> 00:30:52.800 DAVID J. MALAN: And now we're going to go which direction? 00:30:52.800 --> 00:30:54.217 AUDIENCE: Over here in the middle. 00:30:54.217 --> 00:30:56.320 DAVID J. MALAN: Right, and voila, the number six. 00:30:56.320 --> 00:30:57.840 All right, so very nicely done. 00:31:00.620 --> 00:31:02.320 A little stressful for you as well. 00:31:02.320 --> 00:31:03.210 Thank you again. 00:31:03.210 --> 00:31:05.790 So here we see by nature of the locker door 00:31:05.790 --> 00:31:10.350 still being open sort of an artifact of the greater efficiency, 00:31:10.350 --> 00:31:13.560 it would seem, of this algorithm because now that Rave 00:31:13.560 --> 00:31:16.470 was given the assumption that these numbers are sorted from small 00:31:16.470 --> 00:31:20.310 on the left to large on the right, she was able to apply that same divide 00:31:20.310 --> 00:31:23.610 and conquer algorithm from week zero which we're now going to give a name-- 00:31:23.610 --> 00:31:25.200 binary search. 00:31:25.200 --> 00:31:28.650 And simply by starting in the middle and realizing, 00:31:28.650 --> 00:31:32.670 OK, too small, then by going to the right half and realizing, oh, 00:31:32.670 --> 00:31:35.820 went a little too far, then by going to the left half, which, 00:31:35.820 --> 00:31:39.660 Rave able to find in just three steps instead of seven 00:31:39.660 --> 00:31:43.720 the number six in this case that we were actually searching for. 00:31:43.720 --> 00:31:47.890 So you can see that this would seem to be more efficient. 00:31:47.890 --> 00:31:50.700 Let's consider for just a moment is it correct. 00:31:50.700 --> 00:31:56.250 If I had used different numbers but still sorted them from left to right, 00:31:56.250 --> 00:31:59.378 would it still have worked this algorithm? 00:31:59.378 --> 00:32:00.420 You're nodding your head. 00:32:00.420 --> 00:32:01.170 Can I call on you? 00:32:01.170 --> 00:32:03.920 Like, why would it still have worked, do you think? 00:32:03.920 --> 00:32:06.700 AUDIENCE: [INAUDIBLE] 00:32:06.700 --> 00:32:08.450 DAVID J. MALAN: Yeah, so so long as the numbers 00:32:08.450 --> 00:32:10.760 are always in the same order from left to right 00:32:10.760 --> 00:32:13.970 or, heck, they could even be in reverse order, so long as it's consistent, 00:32:13.970 --> 00:32:18.410 the decisions that Rave was making-- if greater than, else, if less than-- 00:32:18.410 --> 00:32:20.820 would guide us to the solution no matter what. 00:32:20.820 --> 00:32:23.460 And it would seem to take fewer steps. 00:32:23.460 --> 00:32:26.220 So if we consider now the pseudo code for this algorithm, 00:32:26.220 --> 00:32:28.530 let's take a look how we might describe binary search. 00:32:28.530 --> 00:32:31.400 So binary search we might describe with something like this. 00:32:31.400 --> 00:32:34.640 If the number is behind the middle door, which is where Rave began, 00:32:34.640 --> 00:32:36.710 then we can just return true. 00:32:36.710 --> 00:32:40.290 Else if the number is less than the middle door, 00:32:40.290 --> 00:32:42.800 so if six is less than whatever is behind the middle door, 00:32:42.800 --> 00:32:45.140 then Rave would have searched the left half. 00:32:45.140 --> 00:32:47.690 Else if the number is greater than the middle door, 00:32:47.690 --> 00:32:49.700 Rave would have searched the right half. 00:32:49.700 --> 00:32:53.840 Else, if there are no doors-- and we'll see in a moment why I put 00:32:53.840 --> 00:32:55.710 this up top just to keep things clean. 00:32:55.710 --> 00:32:59.390 If there's no doors, what should Rave have presumably returned immediately 00:32:59.390 --> 00:33:02.870 if I gave her no lockers to work with? 00:33:02.870 --> 00:33:03.920 Just returned false. 00:33:03.920 --> 00:33:06.020 But this is an important case to consider 00:33:06.020 --> 00:33:09.980 because if in the process of searching by locker by locker, 00:33:09.980 --> 00:33:14.630 we might have whittled down the problem from seven doors to three doors 00:33:14.630 --> 00:33:17.370 to one door to zero doors-- and at that point, 00:33:17.370 --> 00:33:19.230 we might have had no doors left to search. 00:33:19.230 --> 00:33:22.040 So we have to naturally have a scenario for just considering 00:33:22.040 --> 00:33:23.120 if there were no doors. 00:33:23.120 --> 00:33:26.910 So it's not to say that maybe I don't give Rave any doors to begin with. 00:33:26.910 --> 00:33:28.910 But as she divides and divides and divides, 00:33:28.910 --> 00:33:32.720 if she runs out of lockers to ask those questions of-- or a few weeks ago, 00:33:32.720 --> 00:33:35.660 if I ran out of phone book pages to tear in half, 00:33:35.660 --> 00:33:39.210 I too might have had to return false as in this case. 00:33:39.210 --> 00:33:42.500 So how can we now describe this a little more like C 00:33:42.500 --> 00:33:45.710 just to give ourselves a variable to start thinking and talking about? 00:33:45.710 --> 00:33:48.930 Well, I might talk about doors as being an array. 00:33:48.930 --> 00:33:52.490 And so if I want to express the middle door, I could just, in pseudo code, 00:33:52.490 --> 00:33:54.470 say doors bracket middle. 00:33:54.470 --> 00:33:56.270 I'm assuming that someone has done the math 00:33:56.270 --> 00:33:59.450 to figure out what the middle door is, but that's easy enough to do. 00:33:59.450 --> 00:34:01.940 And then doors, if the number we're looking for 00:34:01.940 --> 00:34:05.470 is less than doors bracket middle, then search door 00:34:05.470 --> 00:34:09.230 zero through doors middle minus 1. 00:34:09.230 --> 00:34:13.610 So again, this is a more pedantic way of taking what's a pretty intuitive idea-- 00:34:13.610 --> 00:34:16.159 search the left half, search the right half-- 00:34:16.159 --> 00:34:22.790 but start to now describe it in terms of actual indices or indexes 00:34:22.790 --> 00:34:24.593 like we did with our array notation. 00:34:24.593 --> 00:34:26.510 The last scenario, of course, is if the number 00:34:26.510 --> 00:34:28.760 is greater than the door's bracket middle, 00:34:28.760 --> 00:34:32.060 then Rave would have wanted to search the middle door plus 1-- 00:34:32.060 --> 00:34:37.250 so 1 over-- through doors n minus 1-- 00:34:37.250 --> 00:34:38.330 through n minus 1. 00:34:38.330 --> 00:34:41.389 So again, just a way of sort of describing a little more syntactically 00:34:41.389 --> 00:34:42.989 what it is that's going on. 00:34:42.989 --> 00:34:46.820 So how might we translate this now into big O notation? 00:34:46.820 --> 00:34:53.870 Well, in the worst case, how many steps total might Rave's binary search 00:34:53.870 --> 00:34:55.070 algorithm have taken? 00:34:55.070 --> 00:34:59.030 Given seven doors or given more generically n doors, 00:34:59.030 --> 00:35:03.620 how many times could she go left or go right before finding herself with one 00:35:03.620 --> 00:35:06.110 or no doors left? 00:35:06.110 --> 00:35:08.787 What's the way to think about that? 00:35:08.787 --> 00:35:09.620 Yeah, in the middle? 00:35:09.620 --> 00:35:11.150 AUDIENCE: Log n. 00:35:11.150 --> 00:35:11.990 DAVID J. MALAN: Log n. 00:35:11.990 --> 00:35:13.280 So there's log n again. 00:35:13.280 --> 00:35:16.155 And even if you're not feeling wholly comfortable with your logarithm 00:35:16.155 --> 00:35:19.250 still, pretty much in programming and in computer science more generally, 00:35:19.250 --> 00:35:22.430 any time we talk about some algorithm that's dividing and conquering 00:35:22.430 --> 00:35:26.180 in half, in half, in half, or any other multiple, 00:35:26.180 --> 00:35:28.580 it's probably involving logarithms in some sense. 00:35:28.580 --> 00:35:31.400 And log base n essentially refers to the number 00:35:31.400 --> 00:35:37.160 of times you can divide n by 2 until you bottom out at just a single door 00:35:37.160 --> 00:35:39.410 or equivalently zero doors left. 00:35:39.410 --> 00:35:40.370 So log n. 00:35:40.370 --> 00:35:44.270 So we might say that indeed, binary search is in big O of log n 00:35:44.270 --> 00:35:48.440 because the door that Rave opened last, this one, 00:35:48.440 --> 00:35:50.360 happened to be three doors away. 00:35:50.360 --> 00:35:52.790 And actually, if you do the math here, that roughly 00:35:52.790 --> 00:35:54.600 works out to be exactly that case. 00:35:54.600 --> 00:35:58.640 If we add one, that's sort of out of seven doors or roughly eight, 00:35:58.640 --> 00:36:01.940 we were able to search it in just three total steps. 00:36:01.940 --> 00:36:03.980 What about omega notation, though? 00:36:03.980 --> 00:36:07.220 Like, in the best case, Rave might have gotten lucky. 00:36:07.220 --> 00:36:09.170 She opened the door, and there it is. 00:36:09.170 --> 00:36:14.970 So how might we describe a lower bound on the running time of binary search. 00:36:14.970 --> 00:36:15.470 Yeah. 00:36:15.470 --> 00:36:16.125 AUDIENCE: 1. 00:36:16.125 --> 00:36:17.000 DAVID J. MALAN: Say again? 00:36:17.000 --> 00:36:17.840 AUDIENCE: 1. 00:36:17.840 --> 00:36:19.220 DAVID J. MALAN: Omega of 1. 00:36:19.220 --> 00:36:23.810 So here too, we see that in some cases binary search and linear search, eh, 00:36:23.810 --> 00:36:25.700 like, they're pretty equivalent. 00:36:25.700 --> 00:36:30.830 And so this is why sometimes compelling to consider both the best 00:36:30.830 --> 00:36:33.290 case in the worst case because honestly, in general, 00:36:33.290 --> 00:36:35.600 who really cares if you just get lucky once in a while 00:36:35.600 --> 00:36:37.280 and your algorithm is super fast? 00:36:37.280 --> 00:36:40.250 What you probably care about is what's the worst case. 00:36:40.250 --> 00:36:41.390 How long are my users-- 00:36:41.390 --> 00:36:45.170 how long am I going to be sitting there watching some spinning hourglass 00:36:45.170 --> 00:36:50.807 or beach ball trying to give myself an answer to a pretty big problem? 00:36:50.807 --> 00:36:53.640 Well, odds are, you're going to generally care about big O notation. 00:36:53.640 --> 00:36:55.430 So indeed, moving forward, will generally 00:36:55.430 --> 00:36:58.730 talk about the running time of algorithms often in terms of big O, 00:36:58.730 --> 00:37:00.780 a little less so in terms of omega. 00:37:00.780 --> 00:37:03.140 But understanding the range can be important 00:37:03.140 --> 00:37:08.700 depending on the nature of the data that you're going to actually be given here. 00:37:08.700 --> 00:37:11.510 All right let me pause and see if there is any questions. 00:37:14.200 --> 00:37:16.420 Any questions here? 00:37:16.420 --> 00:37:18.850 Yes, thank you. 00:37:18.850 --> 00:37:21.430 AUDIENCE: So this method is clearly more efficient, 00:37:21.430 --> 00:37:26.440 but it requires that the information is all compiled in a certain order. 00:37:26.440 --> 00:37:29.770 How do you ensure that you can compile information 00:37:29.770 --> 00:37:31.265 in a particular order at scale? 00:37:31.265 --> 00:37:33.140 DAVID J. MALAN: Yeah, it's a really good question. 00:37:33.140 --> 00:37:36.015 And if I can generalize it, how do you guarantee that you can do this 00:37:36.015 --> 00:37:38.560 at scale, which algorithm is better? 00:37:38.560 --> 00:37:41.440 I've sort of led us down this road of implying 00:37:41.440 --> 00:37:43.540 that Rave's second algorithm, binary search, 00:37:43.540 --> 00:37:45.580 is better because it's so much faster. 00:37:45.580 --> 00:37:49.600 It's log of n in the worst case instead of big O of n. 00:37:49.600 --> 00:37:53.230 But Rave was given an advantage when she came up here in that the doors were 00:37:53.230 --> 00:37:54.222 already sorted. 00:37:54.222 --> 00:37:55.930 And so that sort of invites the question, 00:37:55.930 --> 00:37:57.670 well, given a whole bunch of random data, 00:37:57.670 --> 00:38:00.710 either a small data set or, heck, something Google sized with millions, 00:38:00.710 --> 00:38:04.540 billions of pieces of data, should you sort it first 00:38:04.540 --> 00:38:07.150 from smallest to largest and then search? 00:38:07.150 --> 00:38:11.920 Or should you just dive right in and search it linearly? 00:38:11.920 --> 00:38:13.638 Like, how might you think about that? 00:38:13.638 --> 00:38:15.430 If you are Google, for instance, and you've 00:38:15.430 --> 00:38:19.090 got millions, billions of web pages, should they just go with linear search 00:38:19.090 --> 00:38:21.850 because it's always going to work even though it might be slow? 00:38:21.850 --> 00:38:24.820 Or should they invest the time in sorting all of that data-- 00:38:24.820 --> 00:38:26.630 we'll see how in a bit-- 00:38:26.630 --> 00:38:28.900 and then search it more efficiently? 00:38:28.900 --> 00:38:31.438 Like, how do you decide between those options? 00:38:31.438 --> 00:38:33.730 AUDIENCE: If you're sorting the data, then wouldn't you 00:38:33.730 --> 00:38:36.573 have to go through all of the data? 00:38:36.573 --> 00:38:38.740 DAVID J. MALAN: Yeah, if you had to sort the data first-- 00:38:38.740 --> 00:38:40.700 and we don't yet formally know how to do this. 00:38:40.700 --> 00:38:43.117 But obviously, as humans, we could probably figure it out. 00:38:43.117 --> 00:38:45.280 You do have to look at all of the data anyway. 00:38:45.280 --> 00:38:48.760 And so you're sort of wasting your time if you're sorting it only 00:38:48.760 --> 00:38:50.680 then to go in search it. 00:38:50.680 --> 00:38:52.898 But maybe it depends a bit more. 00:38:52.898 --> 00:38:54.940 Like, that's absolutely right, and if you're just 00:38:54.940 --> 00:38:58.060 searching for one thing in life, then that's probably a waste of time 00:38:58.060 --> 00:39:01.720 to sort it and then search it because you're just adding to the process. 00:39:01.720 --> 00:39:03.880 But what's another scenario in which you might not 00:39:03.880 --> 00:39:09.410 worry about that whereby it might make sense to sort it and then search? 00:39:09.410 --> 00:39:09.910 Yeah. 00:39:09.910 --> 00:39:16.580 AUDIENCE: [INAUDIBLE] you can go and use the other values as a way 00:39:16.580 --> 00:39:17.810 to find out what's happening. 00:39:17.810 --> 00:39:18.852 DAVID J. MALAN: Yeah, exactly. 00:39:18.852 --> 00:39:21.392 So if your problem is a Google-like problem where 00:39:21.392 --> 00:39:24.350 you have more than just one user who's searching for more than just one 00:39:24.350 --> 00:39:26.780 website page, probably you should incur the cost up front 00:39:26.780 --> 00:39:30.620 and sort the whole thing because every subsequent request thereafter 00:39:30.620 --> 00:39:32.810 is going to be faster, faster, faster because it's 00:39:32.810 --> 00:39:36.440 going to [INAUDIBLE] algorithm of binary search, binary search, binary search 00:39:36.440 --> 00:39:39.320 that's going to add up to be way fewer steps 00:39:39.320 --> 00:39:41.610 than doing linear search multiple times. 00:39:41.610 --> 00:39:43.490 So again, kind of depends on the use case 00:39:43.490 --> 00:39:45.350 and kind of depends on how important it is. 00:39:45.350 --> 00:39:48.050 And this happens even in real world contexts. 00:39:48.050 --> 00:39:50.900 I think back always to graduate school, when I was writing some code 00:39:50.900 --> 00:39:52.610 to analyze some large data set. 00:39:52.610 --> 00:39:55.400 And honestly, it was actually easier at the time for me 00:39:55.400 --> 00:39:58.310 to write pretty inefficient but hopefully correct 00:39:58.310 --> 00:39:59.537 code because you know what? 00:39:59.537 --> 00:40:02.870 I could just go to sleep for eight hours and let it analyze this really big data 00:40:02.870 --> 00:40:03.370 set. 00:40:03.370 --> 00:40:06.335 I didn't have to bother writing more complex code to sort it just 00:40:06.335 --> 00:40:07.460 to run it more efficiently. 00:40:07.460 --> 00:40:07.960 Why? 00:40:07.960 --> 00:40:11.520 Because I was the only user, and I only needed to run these queries once. 00:40:11.520 --> 00:40:13.700 And so this was kind of a reasonable approach, 00:40:13.700 --> 00:40:17.550 reasonable until I woke up eight hours later and my code was incorrect. 00:40:17.550 --> 00:40:20.960 And now I had to spend another eight hours rerunning it after fixing it. 00:40:20.960 --> 00:40:22.910 But even there, you see an example where, 00:40:22.910 --> 00:40:24.890 what is your most precious resource? 00:40:24.890 --> 00:40:26.720 Is it time to run the code? 00:40:26.720 --> 00:40:28.970 Is it time to write the code? 00:40:28.970 --> 00:40:31.098 Is it the amount of memory the computer is using? 00:40:31.098 --> 00:40:33.890 These are all resources we'll start to talk about because it really 00:40:33.890 --> 00:40:36.080 depends on what your goals are. 00:40:36.080 --> 00:40:39.050 Any questions, then, on upper bounds, lower bounds, 00:40:39.050 --> 00:40:42.260 or each of these two searches, linear or binary? 00:40:42.260 --> 00:40:42.991 Yeah. 00:40:42.991 --> 00:40:45.580 AUDIENCE: So just, when you're calculating running time, 00:40:45.580 --> 00:40:50.317 does the sorting step count for that time? 00:40:50.317 --> 00:40:53.150 DAVID J. MALAN: When analyzing running time, does the sorting step count? 00:40:53.150 --> 00:40:55.310 If you want it to if you actually do it. 00:40:55.310 --> 00:40:56.900 At the moment, it did not apply. 00:40:56.900 --> 00:41:01.100 I just gave Rave the luxury of knowing that the data was sorted. 00:41:01.100 --> 00:41:04.520 But if I really wanted to charge her for the amount of time 00:41:04.520 --> 00:41:07.730 it took to find that number six, I should have added the time 00:41:07.730 --> 00:41:09.712 to sort plus the time to search. 00:41:09.712 --> 00:41:11.420 And in fact, that's a road we'll go down. 00:41:11.420 --> 00:41:13.170 Why don't we go ahead and pace ourselves as before? 00:41:13.170 --> 00:41:14.587 Let's take a 10 minute break here. 00:41:14.587 --> 00:41:17.130 And when we come back, we'll write some actual code. 00:41:17.130 --> 00:41:21.038 So we've seen a couple of searches-- linear search and binary search, which, 00:41:21.038 --> 00:41:22.580 to be fair, we saw back in week zero. 00:41:22.580 --> 00:41:25.790 But let's actually translate at least one of those now to some code 00:41:25.790 --> 00:41:28.970 using this building block from last week where we can actually 00:41:28.970 --> 00:41:32.820 define an array if we want, like an array of integers called numbers. 00:41:32.820 --> 00:41:34.550 So let me switch over to BS Code here. 00:41:34.550 --> 00:41:37.910 Let me go ahead and start a program called numbers.c. 00:41:37.910 --> 00:41:40.940 And in numbers.c, let me go ahead here. 00:41:40.940 --> 00:41:44.840 And how about let's include our familiar header files? 00:41:44.840 --> 00:41:46.670 So css50.h. 00:41:46.670 --> 00:41:51.330 I'll include standardio.h that we can get input and print input if we want. 00:41:51.330 --> 00:41:54.410 And now I'm going to go ahead and give myself int main void. 00:41:54.410 --> 00:41:56.100 No command line arguments today. 00:41:56.100 --> 00:41:57.232 So I'll leave that as void. 00:41:57.232 --> 00:41:58.940 And I'm going to go ahead and give myself 00:41:58.940 --> 00:42:01.410 an array of how about seven numbers? 00:42:01.410 --> 00:42:04.220 So I'll call it int number 7. 00:42:04.220 --> 00:42:06.260 And then I can fill this array with numbers. 00:42:06.260 --> 00:42:10.100 Like, numbers brackets 0 can be the number 4, and numbers bracket 1 00:42:10.100 --> 00:42:14.308 could be the number 6, and numbers bracket 2 can be the number 8. 00:42:14.308 --> 00:42:16.850 And this is the same list that we saw with [? Nomira ?] a bit 00:42:16.850 --> 00:42:19.170 ago where it was 4, then 6, then 8. 00:42:19.170 --> 00:42:19.920 But you know what? 00:42:19.920 --> 00:42:22.340 There's actually another syntax I can show you here. 00:42:22.340 --> 00:42:25.220 If you know in advance in a C program that you 00:42:25.220 --> 00:42:30.390 want an array of certain values and you know therefore how many of those values 00:42:30.390 --> 00:42:33.410 you want, you can actually do this little trick using curly braces. 00:42:33.410 --> 00:42:36.560 You can say, don't worry about how big this is. 00:42:36.560 --> 00:42:39.620 It's going to be implicit by way of these curly braces. 00:42:39.620 --> 00:42:44.610 Here, I can do 4, 6, 8, 2, 7, 5, 0, close curly brace. 00:42:44.610 --> 00:42:46.730 So it's a somewhat new use of curly braces. 00:42:46.730 --> 00:42:50.780 But this has the effect of giving me an array called numbers inside 00:42:50.780 --> 00:42:52.440 of which are a whole bunch of integers. 00:42:52.440 --> 00:42:53.030 How many? 00:42:53.030 --> 00:42:56.960 The compiler can infer it from what's ever inside these curly braces. 00:42:56.960 --> 00:43:00.140 And it seems to be of size 1, 2, 3, 4, 5, 6, 7. 00:43:00.140 --> 00:43:05.510 And all seven elements will be initialized with 4, 6, 8, 2, 7, 5, 0 00:43:05.510 --> 00:43:06.330 respectively. 00:43:06.330 --> 00:43:08.780 So just a minor optimization code wise to tighten up 00:43:08.780 --> 00:43:12.090 what would have otherwise been like eight separate lines of code. 00:43:12.090 --> 00:43:15.080 Now let's go ahead and implement linear search, as we called it. 00:43:15.080 --> 00:43:18.122 And you can do this in a bunch of ways, but I'm going to do it like this. 00:43:18.122 --> 00:43:24.830 For int i get 0, i is less than 7 i plus plus. 00:43:24.830 --> 00:43:27.800 Then inside of my loop, I'm going to ask the question, well, 00:43:27.800 --> 00:43:33.020 if the numbers at location i equals equals, as we asked of 00:43:33.020 --> 00:43:36.680 [? Nomira, ?] the number 0, then I'm going to go ahead and do something 00:43:36.680 --> 00:43:41.450 like printf found backslash n. 00:43:41.450 --> 00:43:43.280 And then I'm going to return 0. 00:43:43.280 --> 00:43:46.040 Just because of last week's discussion of returning 00:43:46.040 --> 00:43:50.150 a value for main when all is well, I'm going to return 0 by convention 00:43:50.150 --> 00:43:53.060 just to signal that indeed, I found what I'm looking for. 00:43:53.060 --> 00:44:00.560 Otherwise, on what line do I want to go and add a printf, like, not found 00:44:00.560 --> 00:44:02.600 and return something other than 0? 00:44:02.600 --> 00:44:06.860 Right, I don't think I want an else here per our pseudo code earlier. 00:44:06.860 --> 00:44:11.030 So on what line would you prefer I sort of insert a default scenario 00:44:11.030 --> 00:44:14.490 of not found and I'll return an error? 00:44:14.490 --> 00:44:15.795 Yeah, over here? 00:44:15.795 --> 00:44:19.343 [INTERPOSING VOICES] 00:44:19.343 --> 00:44:20.010 DAVID J. MALAN: Nice. 00:44:20.010 --> 00:44:21.718 So at the end of the for loop because you 00:44:21.718 --> 00:44:23.550 want to give the program or our volunteer 00:44:23.550 --> 00:44:26.980 earlier a chance to go through all of the doors, all of the numbers. 00:44:26.980 --> 00:44:29.700 But if you go through the whole thing, through the whole loop, 00:44:29.700 --> 00:44:33.630 at the very end, you probably just want to conclude not found backslash n 00:44:33.630 --> 00:44:36.060 and then return something like positive 1 00:44:36.060 --> 00:44:38.040 just to signify that an error happened. 00:44:38.040 --> 00:44:40.170 And again, this was a minor detail last week. 00:44:40.170 --> 00:44:44.370 Any time main is successful, the programming convention is to return 0. 00:44:44.370 --> 00:44:45.850 That means all as well. 00:44:45.850 --> 00:44:49.020 And if something goes wrong, like you didn't find what you're looking for, 00:44:49.020 --> 00:44:52.950 you might return something other than 0, like positive 1, maybe positive 2, 00:44:52.950 --> 00:44:55.202 or even negative numbers if you want. 00:44:55.202 --> 00:44:57.160 All right, well, let me go ahead and save this. 00:44:57.160 --> 00:44:58.800 Let me do make numbers. 00:44:58.800 --> 00:45:01.230 Hopefully no syntax errors. 00:45:01.230 --> 00:45:02.520 All good so far. 00:45:02.520 --> 00:45:05.040 dot slash numbers, enter. 00:45:05.040 --> 00:45:07.500 All right, and it's found, as I would hope it would be. 00:45:07.500 --> 00:45:10.710 And just as a little check, let's search for something that's definitely 00:45:10.710 --> 00:45:14.790 not there, like the number negative 1. 00:45:14.790 --> 00:45:17.640 Let me go ahead and recompile the code with make numbers. 00:45:17.640 --> 00:45:19.890 Let me rerun the code with dot slash numbers 00:45:19.890 --> 00:45:21.900 and hopefully-- whew, OK, not found. 00:45:21.900 --> 00:45:24.473 So proof by example seems to be working correctly. 00:45:24.473 --> 00:45:26.640 But let's make things a little more interesting now. 00:45:26.640 --> 00:45:29.700 Right now, I'm using just an array of integers. 00:45:29.700 --> 00:45:34.090 Let me go ahead and introduce maybe an array of strings instead. 00:45:34.090 --> 00:45:37.360 And maybe this time, I'll store a bunch of names and not just integers 00:45:37.360 --> 00:45:39.100 but actual strings of names. 00:45:39.100 --> 00:45:40.330 So how might I do this? 00:45:40.330 --> 00:45:42.130 Well, let me go back to my code here. 00:45:42.130 --> 00:45:46.470 I'm going to switch us over to maybe a file called names.c. 00:45:46.470 --> 00:45:50.130 And in here, I'll go ahead and include cs50.h. 00:45:50.130 --> 00:45:53.550 I'll include standardio.h. 00:45:53.550 --> 00:45:57.030 And I'm going to go ahead and for now include a new friend 00:45:57.030 --> 00:46:00.478 from last week, string.h, which gives me some string-related functionality. 00:46:00.478 --> 00:46:03.270 Int main void because I'm not going to bother with any command line 00:46:03.270 --> 00:46:04.480 arguments for now. 00:46:04.480 --> 00:46:09.330 And now if I want an array of strings, I could do something like this-- 00:46:09.330 --> 00:46:12.120 string names bracket 7. 00:46:12.120 --> 00:46:14.100 And then I could start doing like before. 00:46:14.100 --> 00:46:17.580 Names bracket 0 could be someone like Bill, and names bracket 1 00:46:17.580 --> 00:46:20.740 could be someone like Charlie and so forth. 00:46:20.740 --> 00:46:24.352 But there's this new improvement I can make. 00:46:24.352 --> 00:46:27.060 Let me just let the compiler figure out how many names there are. 00:46:27.060 --> 00:46:32.550 And using curly braces, I'll do Bill and then Charlie and then Fred and then 00:46:32.550 --> 00:46:39.690 George and then Ginny and then Percy and then Ron if there's the pattern there. 00:46:39.690 --> 00:46:42.930 All right, so now I have these seven names as strings. 00:46:42.930 --> 00:46:44.230 Let's do something similar. 00:46:44.230 --> 00:46:47.730 So for int, i get 0. 00:46:47.730 --> 00:46:51.330 i is less than 7 as before, i plus plus as before. 00:46:51.330 --> 00:46:54.900 And inside of the, loop lets this time check for the string in question, 00:46:54.900 --> 00:46:57.630 and suppose we're searching for Ron arbitrarily. 00:46:57.630 --> 00:47:00.090 He is there, so we should eventually find him. 00:47:00.090 --> 00:47:07.530 Let me go ahead and say if names bracket i equals quote unquote Ron, then inside 00:47:07.530 --> 00:47:11.340 of my if condition, I'm going to say printf found just like before. 00:47:11.340 --> 00:47:13.590 And I'm going to return 0 just because all is well. 00:47:13.590 --> 00:47:16.390 And I'm going to take your advice from the get go this time 00:47:16.390 --> 00:47:20.560 and, at the end of the loop, print out not found because if I get this far, 00:47:20.560 --> 00:47:23.670 I have not printed found, and I have not returned already. 00:47:23.670 --> 00:47:27.840 So I'm just going to go ahead and return 1 after printing not found. 00:47:27.840 --> 00:47:30.420 All right, let me go ahead and cross my fingers as always. 00:47:30.420 --> 00:47:33.310 Make names this time. 00:47:33.310 --> 00:47:36.300 And it doesn't seem to like my code here. 00:47:36.300 --> 00:47:38.370 This is perhaps a new error that you might not 00:47:38.370 --> 00:47:41.080 have seen yet in names.c line 11. 00:47:41.080 --> 00:47:43.920 So that's this line here, my if condition. 00:47:43.920 --> 00:47:47.970 Result of comparison against a string literal is unspecified. 00:47:47.970 --> 00:47:50.392 Use an explicit string comparison function instead. 00:47:50.392 --> 00:47:53.100 I mean, that's kind of a mouthful, and the first time you see it, 00:47:53.100 --> 00:47:55.600 you're probably not going to know how to make sense of that. 00:47:55.600 --> 00:47:59.130 But it does kind of draw our attention to something being awry 00:47:59.130 --> 00:48:03.840 with the equality checking here, with equal equals and Ron. 00:48:03.840 --> 00:48:06.240 And here's where again we've been telling 00:48:06.240 --> 00:48:08.700 sort of a white lie for the past couple of weeks. 00:48:08.700 --> 00:48:12.900 Strings are a thing in C. Strings are a thing in programming. 00:48:12.900 --> 00:48:14.670 But recall from last week, I did disclaim 00:48:14.670 --> 00:48:16.650 there's no such thing as a string data type 00:48:16.650 --> 00:48:20.670 technically because it's not a primitive in the way an int 00:48:20.670 --> 00:48:24.210 and a float and a bool are that are sort of built into the language. 00:48:24.210 --> 00:48:28.170 You can't just use equation equals to compare two strings. 00:48:28.170 --> 00:48:31.110 You actually have to use a special function that's 00:48:31.110 --> 00:48:34.140 in this header file we talked briefly about last week. 00:48:34.140 --> 00:48:36.780 In that header file was string length or strlen. 00:48:36.780 --> 00:48:39.490 But there's other functions instead as well. 00:48:39.490 --> 00:48:43.080 Let me, in fact, go ahead and open up the manual pages. 00:48:43.080 --> 00:48:46.200 And if we go to string.h-- 00:48:46.200 --> 00:48:47.760 let me scroll down a bit. 00:48:47.760 --> 00:48:52.800 In string.h you can perhaps infer what function will probably take 00:48:52.800 --> 00:48:56.612 the place of equals equals for today. 00:48:56.612 --> 00:48:57.570 What do we want to use? 00:48:57.570 --> 00:48:57.930 Yeah. 00:48:57.930 --> 00:48:58.680 AUDIENCE: Strcmp? 00:48:58.680 --> 00:49:03.305 DAVID J. MALAN: So strcmp, S-T-R-C-M-P, which apparently compares two strings. 00:49:03.305 --> 00:49:05.430 And if I click on that, we'll see more information. 00:49:05.430 --> 00:49:09.510 And indeed, if I click on strcmp, we'll see under the synopsis 00:49:09.510 --> 00:49:14.490 that, OK, I need to use the CS50 header file and string.h, as I already have. 00:49:14.490 --> 00:49:17.850 Here is its prototype, which is telling me 00:49:17.850 --> 00:49:21.360 that strcmp takes two strings, S1 and S2, that 00:49:21.360 --> 00:49:22.890 are presumably going to be compared. 00:49:22.890 --> 00:49:25.230 And it returns an integer, which is interesting. 00:49:25.230 --> 00:49:26.430 So let's read on. 00:49:26.430 --> 00:49:29.730 The description of this function is that it compares two strings case 00:49:29.730 --> 00:49:30.870 sensitively. 00:49:30.870 --> 00:49:34.110 So uppercase or lowercase matters, just FYI. 00:49:34.110 --> 00:49:36.570 And then let's look it the return value here. 00:49:36.570 --> 00:49:40.860 The return value of this function returns an int less than 0 00:49:40.860 --> 00:49:48.480 if S1 comes before S2, 0 if S1 is the same as S2, or an int greater than 0 00:49:48.480 --> 00:49:50.940 if S1 comes after S2. 00:49:50.940 --> 00:49:54.780 So the reason that this function returns an integer and not just a 00:49:54.780 --> 00:49:57.390 bool, true or false, is that it actually will 00:49:57.390 --> 00:50:00.750 allow us to sort these things eventually because if you can tell me 00:50:00.750 --> 00:50:04.980 if two strings come in this order or in this order or they're the same, 00:50:04.980 --> 00:50:07.080 you need three possible return values. 00:50:07.080 --> 00:50:08.910 And a bool, of course, only gives you two, 00:50:08.910 --> 00:50:12.480 but an int gives you like 4 billion even though we just need the 3. 00:50:12.480 --> 00:50:17.520 So 0 or a positive number or a negative number is what this function returns. 00:50:17.520 --> 00:50:21.960 And the documentation goes on to explain what we mean by ASCIIbetical order. 00:50:21.960 --> 00:50:25.260 Recall that capital A is 65, capital B is 66, 00:50:25.260 --> 00:50:27.660 and it's those underlying ASCII or Unicode 00:50:27.660 --> 00:50:31.050 numbers that a computer uses to figure out whether something comes before it 00:50:31.050 --> 00:50:33.180 or after it like in the dictionary. 00:50:33.180 --> 00:50:35.950 But for our purposes now, we only care about equality. 00:50:35.950 --> 00:50:37.680 So I'm going to go ahead and do this. 00:50:37.680 --> 00:50:41.820 If I want to compare names bracket i against Ron, 00:50:41.820 --> 00:50:49.320 I use stir compare or strcmp, names bracket i comma, quote unquote, Ron. 00:50:49.320 --> 00:50:51.510 So it's a little more involved than actually 00:50:51.510 --> 00:50:55.830 using equals equals, which does work for integers, longs, 00:50:55.830 --> 00:50:57.000 and certain other values. 00:50:57.000 --> 00:51:00.850 But for strings, it turns out we need to use a more powerful function. 00:51:00.850 --> 00:51:01.350 Why? 00:51:01.350 --> 00:51:03.630 Well, last week, recall what a string really is. 00:51:03.630 --> 00:51:06.220 It's an array of characters. 00:51:06.220 --> 00:51:09.690 And so whereas you can use equals equals for single characters, 00:51:09.690 --> 00:51:12.600 strcmp, as we'll eventually see, is going 00:51:12.600 --> 00:51:14.438 to compare multiple characters for us. 00:51:14.438 --> 00:51:15.480 There's more logic there. 00:51:15.480 --> 00:51:19.570 There's a loop needed, and that's why it comes with the string library. 00:51:19.570 --> 00:51:22.290 But it doesn't just work out of the box with equals equals alone. 00:51:22.290 --> 00:51:26.267 That would literally be comparing two things, not two arrays of things. 00:51:26.267 --> 00:51:28.350 And we'll come back to this next week as to what's 00:51:28.350 --> 00:51:29.920 really going on under the hood. 00:51:29.920 --> 00:51:34.140 So let me go ahead and fix one bug that I just realized I made. 00:51:34.140 --> 00:51:39.450 I want to check if the return value of str compare is equal to 0 00:51:39.450 --> 00:51:42.660 because per the documentation, that meant they're the same. 00:51:42.660 --> 00:51:45.640 All right, let me go ahead and make names this time. 00:51:45.640 --> 00:51:46.710 Now it compiles. 00:51:46.710 --> 00:51:49.770 Dot slash names, Enter, found. 00:51:49.770 --> 00:51:55.290 And just as a sanity check, let's check someone outside the family. 00:51:55.290 --> 00:51:59.010 Searching now for Hermione after recompiling the code, 00:51:59.010 --> 00:52:00.570 after rerunning the code. 00:52:00.570 --> 00:52:02.280 And she's not, in fact, found. 00:52:02.280 --> 00:52:05.220 So here's just a similar implementation of linear search 00:52:05.220 --> 00:52:09.570 not for integers this time but instead for strings, 00:52:09.570 --> 00:52:13.140 the subtlety really being we need a helper function, str compare, 00:52:13.140 --> 00:52:17.640 to actually do the legwork for us of comparing two arrays of characters. 00:52:17.640 --> 00:52:21.208 All right, questions on either of these implementations-- yeah, in the middle? 00:52:21.208 --> 00:52:22.892 AUDIENCE: So, if I do [INAUDIBLE] 00:52:22.892 --> 00:52:24.100 DAVID J. MALAN: Ah, good question. 00:52:24.100 --> 00:52:28.260 If I had not fixed what I claimed was a mistake earlier and I did this-- 00:52:28.260 --> 00:52:30.840 and we saw an example of this last week, actually. 00:52:30.840 --> 00:52:37.470 If a function returns an integer, be it negative or positive or 0, 00:52:37.470 --> 00:52:40.740 when you get back 0, the expression, the Boolean expression, 00:52:40.740 --> 00:52:42.040 will be considered false. 00:52:42.040 --> 00:52:44.940 So 0 equals false always. 00:52:44.940 --> 00:52:49.470 If a function returns any positive number, or any negative number, 00:52:49.470 --> 00:52:52.770 that's going to be interpreted as true even 00:52:52.770 --> 00:52:57.510 if it's positive or negative, whether it's 1, negative 1, 2, negative 2. 00:52:57.510 --> 00:53:01.570 And so if I did this, this would be saying the opposite. 00:53:01.570 --> 00:53:06.930 So if I were to say this, if str compare of names bracket i and Hermione, that's 00:53:06.930 --> 00:53:13.393 implicitly like saying this does not equal 0, or it means sort of is true, 00:53:13.393 --> 00:53:15.810 but you don't want to check for true because, again, we're 00:53:15.810 --> 00:53:17.350 comparing integers here. 00:53:17.350 --> 00:53:21.000 So the reason I did 0 here in this case is 00:53:21.000 --> 00:53:24.610 that it explicitly checks for the return value that means they're the same. 00:53:24.610 --> 00:53:25.110 And yeah. 00:53:25.110 --> 00:53:26.055 Follow up? 00:53:26.055 --> 00:53:30.948 AUDIENCE: [INAUDIBLE] 00:53:30.948 --> 00:53:32.990 DAVID J. MALAN: Yes, you might not have seen this yet, 00:53:32.990 --> 00:53:36.580 but you can express the equivalent because if you 00:53:36.580 --> 00:53:40.300 want to check if this is false, you can actually 00:53:40.300 --> 00:53:43.300 use an exclamation point, known as a bang in programming, 00:53:43.300 --> 00:53:44.960 that inverts the meaning. 00:53:44.960 --> 00:53:48.163 So false becomes true, true becomes false. 00:53:48.163 --> 00:53:50.080 So this would be another way of expressing it. 00:53:50.080 --> 00:53:54.940 This is arguably a worse design, though, because the documentation explicitly 00:53:54.940 --> 00:53:58.660 says you should be checking for 0 or a positive value 00:53:58.660 --> 00:54:01.750 or a negative value, and this little trick, while correct, 00:54:01.750 --> 00:54:05.500 and I think you can make a reasonable case for it, sort of hides that detail. 00:54:05.500 --> 00:54:07.420 And I would argue instead for the first way, 00:54:07.420 --> 00:54:09.607 checking for equals equals 0 instead. 00:54:09.607 --> 00:54:11.440 And if that's a little subtle, not to worry. 00:54:11.440 --> 00:54:16.240 We'll come back to little syntactic tricks like that before long. 00:54:16.240 --> 00:54:20.770 Other questions on linear search in these two forms. 00:54:20.770 --> 00:54:22.450 Is there another hand or hands? 00:54:22.450 --> 00:54:24.070 Two hands? 00:54:24.070 --> 00:54:24.610 No? 00:54:24.610 --> 00:54:25.900 OK, just holler if I missed. 00:54:25.900 --> 00:54:28.012 So let's now actually take this one step further. 00:54:28.012 --> 00:54:30.970 Suppose that we want to write a program that maybe implements something 00:54:30.970 --> 00:54:35.110 a little more like a phone book that has both names and numbers and not 00:54:35.110 --> 00:54:37.030 just integers but actual phone numbers. 00:54:37.030 --> 00:54:39.340 Well, we could escalate things like this. 00:54:39.340 --> 00:54:42.850 We could now have two arrays-- one called names, one called numbers. 00:54:42.850 --> 00:54:45.370 And I'm going to use strings for the numbers now, 00:54:45.370 --> 00:54:48.010 the phone numbers, because in most communities, 00:54:48.010 --> 00:54:51.730 phone numbers might have dashes, pluses, parentheses, so something 00:54:51.730 --> 00:54:55.030 that really looks more like a string even though we call it a phone number. 00:54:55.030 --> 00:54:58.580 Probably don't want to use an int lest we throw away those kinds of details. 00:54:58.580 --> 00:55:03.160 So let me switch back to BS Code here, and let's do one more program, this one 00:55:03.160 --> 00:55:05.033 in a file called phonebook.c. 00:55:05.033 --> 00:55:06.700 And now let me go ahead and do the same. 00:55:06.700 --> 00:55:08.380 Let me include cs50.h. 00:55:08.380 --> 00:55:14.320 Let me include standardio.h, and let me include string.h. 00:55:14.320 --> 00:55:17.380 I'm going to again do int main void. 00:55:17.380 --> 00:55:20.710 And then inside of my program, I'm going to give myself two arrays-- 00:55:20.710 --> 00:55:22.510 the efficient way this time. 00:55:22.510 --> 00:55:25.000 String names will be just two of us this time. 00:55:25.000 --> 00:55:28.390 How about Carter and me? 00:55:28.390 --> 00:55:30.880 And then I'll give myself-- oops, typo already. 00:55:30.880 --> 00:55:33.790 If I want this to be an array, I don't have to specify the number. 00:55:33.790 --> 00:55:35.380 The compiler can count for me. 00:55:35.380 --> 00:55:37.300 But I do need the square brackets. 00:55:37.300 --> 00:55:43.750 Then for numbers, I'm again going to use a string array specifying with 00:55:43.750 --> 00:55:49.510 the curly braces that how about Carter can be at 1-617-495-1000. 00:55:49.510 --> 00:55:51.280 And how about my own number here-- 00:55:51.280 --> 00:55:55.000 1-949-468-- oh pattern appearing-- 00:55:55.000 --> 00:55:57.760 2750 will be mine. 00:55:57.760 --> 00:55:58.600 Why mine? 00:55:58.600 --> 00:56:00.530 Well, I'm just kind of lined things up. 00:56:00.530 --> 00:56:03.460 So Carter's number is apparently first in this array, 00:56:03.460 --> 00:56:06.800 and I'm claiming that he'll be first in this array, respectively. 00:56:06.800 --> 00:56:09.610 I, David, will be the first-- the second in the names array 00:56:09.610 --> 00:56:12.267 and second in the numbers array. 00:56:12.267 --> 00:56:15.100 If you want to have a little fun with programming, feel free to text 00:56:15.100 --> 00:56:17.270 or call me some time at that number. 00:56:17.270 --> 00:56:20.950 So now let's actually use this data in some way. 00:56:20.950 --> 00:56:24.368 Let's go ahead and actually search for my own name and number here. 00:56:24.368 --> 00:56:24.910 So let me do. 00:56:24.910 --> 00:56:27.490 For int i, get 0. 00:56:27.490 --> 00:56:32.090 There's two of us this time-- so i less than 2 and then i plus plus as before. 00:56:32.090 --> 00:56:34.480 And now I'm going to practice what I preached earlier, 00:56:34.480 --> 00:56:38.440 and I'm going to use str compare to find my name in this case. 00:56:38.440 --> 00:56:45.100 And I'm going to say if strcmp of names bracket i equals quote unquote David 00:56:45.100 --> 00:56:48.740 and that equals 0, meaning they're the same, 00:56:48.740 --> 00:56:51.610 then just as before, I'm going to go ahead and print something out. 00:56:51.610 --> 00:56:53.320 But this time, I'm going to make the program more useful 00:56:53.320 --> 00:56:55.100 and not just say found or not found. 00:56:55.100 --> 00:56:59.050 Now I'm implementing a phone book, like the contacts app on iOS or Android. 00:56:59.050 --> 00:57:02.380 So I'm going to say something like, quote unquote, found percent 00:57:02.380 --> 00:57:08.830 s backslash n and then actually plug in numbers bracket i 00:57:08.830 --> 00:57:12.370 to correspond to the current name bracket i. 00:57:12.370 --> 00:57:14.225 And then I'll return 0 as before. 00:57:14.225 --> 00:57:16.600 And then down here if we get all the way through the loop 00:57:16.600 --> 00:57:20.120 and David's not there for some reason, I'm going to print as before not found 00:57:20.120 --> 00:57:21.580 and then return 1. 00:57:21.580 --> 00:57:26.590 So let me go ahead and compile this with make phone dot slash phonebook, 00:57:26.590 --> 00:57:29.240 and it seems to have found the number. 00:57:29.240 --> 00:57:33.130 So this code I'm going to claim is correct. 00:57:33.130 --> 00:57:36.190 It's kind of stupid because I've just made a phone book or a contacts 00:57:36.190 --> 00:57:37.690 app that only supports two people. 00:57:37.690 --> 00:57:39.503 They're only going to be me and Carter. 00:57:39.503 --> 00:57:41.920 This would be like downloading the contacts app on a phone 00:57:41.920 --> 00:57:43.837 and you can only call two people in the world. 00:57:43.837 --> 00:57:45.820 There's no ability to add names or edit things. 00:57:45.820 --> 00:57:48.860 That, of course, could come later using get string or something else. 00:57:48.860 --> 00:57:50.693 But for now for the sake of discussion, I've 00:57:50.693 --> 00:57:53.450 just hardcoded two names and two numbers. 00:57:53.450 --> 00:57:56.290 But for what it does, I claim this is correct. 00:57:56.290 --> 00:57:59.590 It's going to find me and print out my number. 00:57:59.590 --> 00:58:01.480 But is it well-designed? 00:58:01.480 --> 00:58:05.560 Let's start to now consider if we're not just using arrays, 00:58:05.560 --> 00:58:07.852 but are we using them, well? 00:58:07.852 --> 00:58:10.810 We started to use them last week, but are we using them well this week? 00:58:10.810 --> 00:58:14.680 And what might I even mean by using an array well or designing 00:58:14.680 --> 00:58:16.640 this program well? 00:58:16.640 --> 00:58:21.940 Any critiques or concerns with why this might not 00:58:21.940 --> 00:58:24.100 be the best road for us to be going down when 00:58:24.100 --> 00:58:28.540 I want to implement something like a phone book with pieces of information? 00:58:28.540 --> 00:58:31.540 It seems all too vulnerable to just mistakes. 00:58:31.540 --> 00:58:35.620 For instance, if I screw up the actual number of names in the names array 00:58:35.620 --> 00:58:40.190 such that it's now more or less than is in the numbers array or vise versa, 00:58:40.190 --> 00:58:43.420 it feels like there's not a tight relationship between those pieces 00:58:43.420 --> 00:58:47.140 of data, and it's just sort of is trusting on the honor system 00:58:47.140 --> 00:58:53.440 that any time I use names bracket i that it lines up with numbers bracket i. 00:58:53.440 --> 00:58:54.160 And that's fine. 00:58:54.160 --> 00:58:56.080 If you're the one writing the code, you're probably 00:58:56.080 --> 00:58:57.640 not going to really screw this up. 00:58:57.640 --> 00:59:00.265 But if you start collaborating with someone else or the program 00:59:00.265 --> 00:59:03.700 is getting much, much longer, the odds that you or your colleagues 00:59:03.700 --> 00:59:08.110 remember that you're sort of just trusting that names and numbers line up 00:59:08.110 --> 00:59:10.420 like this is going to fail eventually. 00:59:10.420 --> 00:59:13.540 Someone's not going to realize that, and just, the code is going to break. 00:59:13.540 --> 00:59:16.540 And you're going to start out putting the wrong numbers for names, which 00:59:16.540 --> 00:59:20.710 is to say it'd be much nicer if we could somehow couple these two 00:59:20.710 --> 00:59:24.850 pieces of data, names and numbers, a little more tightly together so 00:59:24.850 --> 00:59:28.900 that you're not just trusting that these two independent variables, names 00:59:28.900 --> 00:59:32.630 and numbers, have this kind of relationship with themselves. 00:59:32.630 --> 00:59:35.200 So let's consider how we might solve this. 00:59:35.200 --> 00:59:39.400 A new feature today that we'll introduce is generally known as a data structure. 00:59:39.400 --> 00:59:43.280 In C, we have the ability to invent our own data types, 00:59:43.280 --> 00:59:46.555 if you will-- data types that the authors of C decades 00:59:46.555 --> 00:59:48.430 ago just didn't envision or just didn't think 00:59:48.430 --> 00:59:51.880 were necessary because we can implement them ourselves-- similar to Scratch 00:59:51.880 --> 00:59:54.280 just as you could create custom puzzle pieces, 00:59:54.280 --> 00:59:56.360 or in C, you can create custom functions. 00:59:56.360 --> 01:00:00.760 So in C, can you create your own types of data 01:00:00.760 --> 01:00:04.900 that go beyond the built in ints and floats and even strings? 01:00:04.900 --> 01:00:10.540 You can make, for instance, a person data type or a candidate data 01:00:10.540 --> 01:00:13.090 type in the context of elections or a person data type 01:00:13.090 --> 01:00:15.950 more generically that might have a name and a number. 01:00:15.950 --> 01:00:17.720 So how might we do this? 01:00:17.720 --> 01:00:23.470 Well, let me go here and propose that if we want to define a person, 01:00:23.470 --> 01:00:26.920 wouldn't it be nice if we could have a person data type, 01:00:26.920 --> 01:00:29.470 and then we could have an array called people? 01:00:29.470 --> 01:00:32.770 And maybe that array is our only array with two things 01:00:32.770 --> 01:00:35.200 in it, two persons in it. 01:00:35.200 --> 01:00:38.140 But somehow, those data types, these persons, 01:00:38.140 --> 01:00:41.158 would have both a name and a number associated with them. 01:00:41.158 --> 01:00:42.700 So we don't need two separate arrays. 01:00:42.700 --> 01:00:47.240 We need one array of persons, a brand new data type. 01:00:47.240 --> 01:00:48.890 So how might we do this? 01:00:48.890 --> 01:00:51.070 Well, if we want every person in the world 01:00:51.070 --> 01:00:53.320 or in this program to have a name and a number, 01:00:53.320 --> 01:00:56.380 we literally right out first those two data types. 01:00:56.380 --> 01:00:57.860 Give me a string called name. 01:00:57.860 --> 01:01:00.850 Give me a string called number semicolon, after each. 01:01:00.850 --> 01:01:04.030 And then we wrap that, those two lines of code, 01:01:04.030 --> 01:01:06.730 with this syntax, which at first glance is a little cryptic. 01:01:06.730 --> 01:01:08.410 It's a lot of words all of a sudden. 01:01:08.410 --> 01:01:12.730 But typedef is a new keyword today that defines a new data type. 01:01:12.730 --> 01:01:16.510 This is the C key word that lets you create your own data 01:01:16.510 --> 01:01:18.100 type for the very first time. 01:01:18.100 --> 01:01:22.840 Struct is another related key word that tells the compiler that this isn't just 01:01:22.840 --> 01:01:27.310 a simple data type, like an int or a float renamed or something like that. 01:01:27.310 --> 01:01:28.940 It actually is a structure. 01:01:28.940 --> 01:01:32.710 It's got some dimensions to it, like two things in it or three things in it 01:01:32.710 --> 01:01:35.260 or even 50 things inside of it. 01:01:35.260 --> 01:01:39.310 The last word down here is the name that you want to give your data type, 01:01:39.310 --> 01:01:41.980 and it weirdly goes after the curly braces. 01:01:41.980 --> 01:01:45.760 But this is how you invent a data type called person. 01:01:45.760 --> 01:01:48.670 And what this code is implying is that henceforth, 01:01:48.670 --> 01:01:53.980 the compiler clang will know that a person is composed of a name that's 01:01:53.980 --> 01:01:56.770 a string and a number that's a string. 01:01:56.770 --> 01:01:59.770 And you don't have to worry about having multiple arrays now. 01:01:59.770 --> 01:02:03.950 You can just have an array of people moving forward. 01:02:03.950 --> 01:02:05.918 So how can we go about using this? 01:02:05.918 --> 01:02:07.960 Well, let me go back to my code from before where 01:02:07.960 --> 01:02:09.340 I was implementing a phone book. 01:02:09.340 --> 01:02:11.715 And why don't we enhance the phone book code a little bit 01:02:11.715 --> 01:02:14.230 by borrowing some of that new syntax? 01:02:14.230 --> 01:02:16.720 Let me go to the top of my program above main 01:02:16.720 --> 01:02:19.660 and define a type that's a structure or a data 01:02:19.660 --> 01:02:24.500 structure that has a name inside of it and that has a number inside of it. 01:02:24.500 --> 01:02:28.150 And the name of this new structure again is going to be called person. 01:02:28.150 --> 01:02:33.550 Inside of my code now, let me go ahead and delete this old stuff temporarily. 01:02:33.550 --> 01:02:37.510 Let me give myself an array called people of size 2. 01:02:37.510 --> 01:02:40.997 And I'm going to use the non-terse way to do this. 01:02:40.997 --> 01:02:42.580 I'm not going to use the curly braces. 01:02:42.580 --> 01:02:47.140 I'm going to more pedantic spell out what I want in this array of size 2 01:02:47.140 --> 01:02:50.860 at location 0, which is the first person in an array 01:02:50.860 --> 01:02:52.690 because you always start counting at 0. 01:02:52.690 --> 01:02:56.500 I'm going to give that person a name of quote unquote Carter. 01:02:56.500 --> 01:02:59.980 And the dot is admittedly one new piece of syntax today too. 01:02:59.980 --> 01:03:02.410 The dot means go inside of that structure 01:03:02.410 --> 01:03:06.190 and access the variable called name and give it this value Carter. 01:03:06.190 --> 01:03:08.350 Similarly, if I'm going to give Carter a number, 01:03:08.350 --> 01:03:13.030 I can go into people bracket 0 dot number and give that the same thing 01:03:13.030 --> 01:03:17.950 as before plus 1-617-495-1000. 01:03:17.950 --> 01:03:20.230 And then I can do the same for myself here-- 01:03:20.230 --> 01:03:24.145 people bracket-- where should I go? 01:03:24.145 --> 01:03:26.107 OK, one because again, two elements. 01:03:26.107 --> 01:03:27.440 But we started counting at zero. 01:03:27.440 --> 01:03:29.420 Bracket name equals quote unquote David. 01:03:29.420 --> 01:03:34.340 And then lastly, people bracket 1 dot number equals quote unquote plus 01:03:34.340 --> 01:03:40.370 1-949-468-2750. 01:03:40.370 --> 01:03:43.250 So now if I scroll down here to my logic, 01:03:43.250 --> 01:03:46.130 I don't think this part needs to change too much. 01:03:46.130 --> 01:03:50.680 I'm still, for the sake of discussion, going to iterate 2 times from i 01:03:50.680 --> 01:03:53.630 is 0 on up to but not through 2. 01:03:53.630 --> 01:03:56.670 But I think this line of code needs to change. 01:03:56.670 --> 01:04:04.910 How should I now refer to the i-th person's name as I iterate? 01:04:04.910 --> 01:04:08.250 What should I compare quote unquote David to this time? 01:04:08.250 --> 01:04:08.750 Let me see. 01:04:08.750 --> 01:04:10.050 On the end here? 01:04:10.050 --> 01:04:13.078 AUDIENCE: People bracket i dot name. 01:04:13.078 --> 01:04:14.870 DAVID J. MALAN: Yeah, people bracket i dot name. 01:04:14.870 --> 01:04:15.170 Why? 01:04:15.170 --> 01:04:16.837 Because people is the name of the array. 01:04:16.837 --> 01:04:20.480 Bracket i is the i-th person that we're iterating over in the current loop-- 01:04:20.480 --> 01:04:23.240 first zero, then one, maybe higher if it had more people. 01:04:23.240 --> 01:04:26.570 Then dot is our new syntax for going inside of a data structure 01:04:26.570 --> 01:04:29.870 and accessing a variable therein which in this case is name. 01:04:29.870 --> 01:04:32.280 And so I can compare David just as before. 01:04:32.280 --> 01:04:36.890 So it's a little more verbose, but now arguably this is a better program 01:04:36.890 --> 01:04:42.470 because now these people are full fledged data types unto themselves. 01:04:42.470 --> 01:04:44.810 There's no more honor system inside of my loop 01:04:44.810 --> 01:04:47.102 that this is going to line up because in just a moment, 01:04:47.102 --> 01:04:49.867 I'm going to fix this one last remnant of the previous version. 01:04:49.867 --> 01:04:51.950 And if I can call back on you again, what should I 01:04:51.950 --> 01:04:54.830 change numbers bracket i to this time? 01:04:54.830 --> 01:05:00.760 AUDIENCE: [INAUDIBLE] dot number. 01:05:00.760 --> 01:05:02.390 DAVID J. MALAN: Dot number, exactly. 01:05:02.390 --> 01:05:04.940 So gone is the honor system that just assumes 01:05:04.940 --> 01:05:08.060 that bracket i in this array lines up with bracket i in this other array. 01:05:08.060 --> 01:05:08.810 Now why? 01:05:08.810 --> 01:05:10.430 There's only one array. 01:05:10.430 --> 01:05:11.990 It's an array called people. 01:05:11.990 --> 01:05:14.060 The things it stores are persons. 01:05:14.060 --> 01:05:15.800 A person has a name and a number. 01:05:15.800 --> 01:05:18.217 And so even though it's kind of marginal admittedly given 01:05:18.217 --> 01:05:21.050 that this is a short program and given that this kind of made things 01:05:21.050 --> 01:05:23.300 look more complicated at first glance, we're 01:05:23.300 --> 01:05:26.540 now laying the foundation for just a better design because you really 01:05:26.540 --> 01:05:29.180 can't screw up now the association of names 01:05:29.180 --> 01:05:32.780 with numbers because every person's name and number is, so to speak, 01:05:32.780 --> 01:05:36.710 encapsulated inside of the same data type. 01:05:36.710 --> 01:05:38.270 And that's a term of art in CS. 01:05:38.270 --> 01:05:41.720 Encapsulation means to encapsulate-- that is, contain-- 01:05:41.720 --> 01:05:43.830 related pieces of information. 01:05:43.830 --> 01:05:49.670 And thus, we have a person that encapsulates two other data types, name 01:05:49.670 --> 01:05:50.450 and number. 01:05:50.450 --> 01:05:52.310 And this just sets the foundation for all 01:05:52.310 --> 01:05:55.190 of the cool stuff we've talked about and you use every day. 01:05:55.190 --> 01:05:55.993 What is an image? 01:05:55.993 --> 01:05:58.910 Well, recall that an image is a bunch of pixels or dots on the screen. 01:05:58.910 --> 01:06:02.570 Every one of those dots has RGB values associated 01:06:02.570 --> 01:06:04.430 with it-- red, green, and blue. 01:06:04.430 --> 01:06:07.760 You could imagine now creating a structure in C probably where 01:06:07.760 --> 01:06:11.540 maybe you have three values, three variables-- one called red, 01:06:11.540 --> 01:06:13.400 one called green, one called blue. 01:06:13.400 --> 01:06:15.980 And then you could name the thing not person but pixel. 01:06:15.980 --> 01:06:19.910 And now you could store in C three different colors-- some amount of red, 01:06:19.910 --> 01:06:23.978 some green, some blue-- and collectively treat it as the color of a pixel. 01:06:23.978 --> 01:06:27.020 And you could imagine doing something similar perhaps for video or music. 01:06:27.020 --> 01:06:30.440 Music, you might have three variables-- one for the musical note, 01:06:30.440 --> 01:06:32.660 the duration, the loudness of it. 01:06:32.660 --> 01:06:36.120 And you can imagine coming up with your own data type for music as well. 01:06:36.120 --> 01:06:37.370 So this is a little low level. 01:06:37.370 --> 01:06:39.920 We're just using like a familiar contacts application. 01:06:39.920 --> 01:06:44.270 But we now have the way in code to express most any type of data 01:06:44.270 --> 01:06:48.260 that we might want to implement or discuss ultimately. 01:06:48.260 --> 01:06:53.510 So any questions now on struct or defining our own types, 01:06:53.510 --> 01:06:58.110 the purposes for which are to use arrays but use them more responsibly 01:06:58.110 --> 01:07:01.280 now in a better design but also to lay the foundation 01:07:01.280 --> 01:07:05.880 for implementing cooler and cooler stuff per our week zero discussion? 01:07:05.880 --> 01:07:06.380 Yeah. 01:07:06.380 --> 01:07:07.713 AUDIENCE: What's the [INAUDIBLE] 01:07:07.713 --> 01:07:10.713 DAVID J. MALAN: What's the difference between this and an object in an object 01:07:10.713 --> 01:07:11.550 oriented language? 01:07:11.550 --> 01:07:14.390 So slight side note, C is not object-oriented. 01:07:14.390 --> 01:07:17.990 Languages like Java and C++ and others which you might have heard 01:07:17.990 --> 01:07:21.200 of, programmed yourself, had friends program in, are object oriented 01:07:21.200 --> 01:07:25.430 languages in those languages they have things called classes or objects which 01:07:25.430 --> 01:07:26.450 are interrelated. 01:07:26.450 --> 01:07:30.020 And objects can store not just data, like variables. 01:07:30.020 --> 01:07:34.490 Objects can also store functions, and you can kind of sort of do this in C. 01:07:34.490 --> 01:07:36.380 But it's not sort of conventional. 01:07:36.380 --> 01:07:39.770 In C, you have data structures that store data. 01:07:39.770 --> 01:07:44.780 In languages like Java and C+, you have objects that store data and functions 01:07:44.780 --> 01:07:45.410 together. 01:07:45.410 --> 01:07:47.790 Python is an object-oriented language as well. 01:07:47.790 --> 01:07:51.270 So we'll see this issue in a few weeks, but let me wave my hands at it for now. 01:07:51.270 --> 01:07:51.770 Yeah. 01:07:51.770 --> 01:07:53.755 AUDIENCE: Could you use this [INAUDIBLE]?? 01:07:53.755 --> 01:07:54.380 DAVID J. MALAN: Yes. 01:07:54.380 --> 01:07:57.020 Could you use this struct to redefine how an int is defined? 01:07:57.020 --> 01:07:58.250 Short answer, yes. 01:07:58.250 --> 01:08:01.700 We talked a couple of times now about integer overflow. 01:08:01.700 --> 01:08:05.900 And most recently, you might have seen me mention the bug in iOS and Mac OS 01:08:05.900 --> 01:08:08.480 that was literally related to an int overflow. 01:08:08.480 --> 01:08:12.890 That's the result of ints only storing 4 bytes or 32 bits 01:08:12.890 --> 01:08:15.800 or even as long as 64 bits or 8 bytes. 01:08:15.800 --> 01:08:16.880 But it's finite. 01:08:16.880 --> 01:08:19.520 But if you want to implement some financial software 01:08:19.520 --> 01:08:22.100 or some scientific or mathematical software that 01:08:22.100 --> 01:08:25.970 allows you to count way bigger than a typical int or a long, 01:08:25.970 --> 01:08:29.135 you could imagine John coming up with your own structure. 01:08:29.135 --> 01:08:31.260 And in fact, in some languages there is a structure 01:08:31.260 --> 01:08:35.370 called big int, which allows you to express even bigger numbers. 01:08:35.370 --> 01:08:35.970 How? 01:08:35.970 --> 01:08:40.410 Well, maybe you store inside of a big ant an array of values. 01:08:40.410 --> 01:08:43.492 And you somehow allow yourself to store more and more bits 01:08:43.492 --> 01:08:45.450 based on how high you want to be able to count. 01:08:45.450 --> 01:08:46.470 So in short, yes. 01:08:46.470 --> 01:08:49.840 We now have the ability now to do most anything we want in the language 01:08:49.840 --> 01:08:52.200 even if it's not built in for us. 01:08:52.200 --> 01:08:53.100 Other questions. 01:08:53.100 --> 01:08:58.762 AUDIENCE: [INAUDIBLE] 01:08:58.762 --> 01:09:01.470 DAVID J. MALAN: Could you define a name and a number in the same line? 01:09:01.470 --> 01:09:02.069 Sort of. 01:09:02.069 --> 01:09:03.986 It starts to get syntactically a little messy, 01:09:03.986 --> 01:09:07.229 so I did it a little more pedantic line by line. 01:09:07.229 --> 01:09:07.870 Good question. 01:09:07.870 --> 01:09:08.430 Over here. 01:09:08.430 --> 01:09:12.910 AUDIENCE: [INAUDIBLE] function you use for the function 01:09:12.910 --> 01:09:15.340 at the bottom of the [INAUDIBLE]. 01:09:15.340 --> 01:09:19.029 Could you do something like that [INAUDIBLE]?? 01:09:19.029 --> 01:09:21.399 DAVID J. MALAN: Prototypes-- you have to do A and C. You 01:09:21.399 --> 01:09:25.029 have to define anything you're going to use or declare anything you're going 01:09:25.029 --> 01:09:26.779 to use before you actually use it. 01:09:26.779 --> 01:09:30.830 So it is deliberate that I put it at the top of my code in this file. 01:09:30.830 --> 01:09:34.870 Otherwise, the compiler would not know what I mean by person when I first 01:09:34.870 --> 01:09:37.750 use it here on what's line 14. 01:09:37.750 --> 01:09:41.479 So it has to come first, or it has to be put into something like a header file 01:09:41.479 --> 01:09:44.979 so that you include it at the very top of your code. 01:09:44.979 --> 01:09:46.779 Other questions over here. 01:09:46.779 --> 01:09:47.662 Yeah. 01:09:47.662 --> 01:09:53.282 AUDIENCE: [INAUDIBLE] 01:09:53.282 --> 01:09:54.990 DAVID J. MALAN: Yeah, good question, and we'll 01:09:54.990 --> 01:09:58.500 come back to this later in the term when we talk about SQL, a database language, 01:09:58.500 --> 01:10:00.630 and storing things in actual databases. 01:10:00.630 --> 01:10:04.320 Generally speaking, even though we humans call things phone numbers, 01:10:04.320 --> 01:10:07.800 or in the US, we have social security numbers, those types of numbers 01:10:07.800 --> 01:10:12.060 often have other punctuation in it, like dashes, parentheses, pluses, 01:10:12.060 --> 01:10:13.380 and so forth. 01:10:13.380 --> 01:10:17.340 You could not store any of that syntax or that punctuation inside of an int. 01:10:17.340 --> 01:10:18.880 You could only store numbers. 01:10:18.880 --> 01:10:20.940 So one motivation for using a string is just 01:10:20.940 --> 01:10:24.570 I can store whatever the human wanted me to store, including parentheses 01:10:24.570 --> 01:10:25.690 and so forth. 01:10:25.690 --> 01:10:29.428 Another reason for storing things as strings, 01:10:29.428 --> 01:10:31.470 even if they look like numbers, is in the context 01:10:31.470 --> 01:10:33.088 of zip codes in the United States. 01:10:33.088 --> 01:10:34.380 Again, we'll come back to this. 01:10:34.380 --> 01:10:36.690 But long story short-- years ago, actually-- 01:10:36.690 --> 01:10:39.732 I was using Microsoft Outlook for my email client. 01:10:39.732 --> 01:10:41.190 And eventually I switched to Gmail. 01:10:41.190 --> 01:10:42.900 And this is like 10 plus years ago now. 01:10:42.900 --> 01:10:47.430 And Outlook at the time lets you export all of your contacts as a CSV file-- 01:10:47.430 --> 01:10:48.780 Comma Separated Values. 01:10:48.780 --> 01:10:50.610 More on that in the weeks to come too. 01:10:50.610 --> 01:10:52.402 And that just means I could download a text 01:10:52.402 --> 01:10:55.710 file with all of my friends and family and their numbers inside of it. 01:10:55.710 --> 01:10:59.812 Unfortunately, I open that same CSV file with Excel, I think, at the time 01:10:59.812 --> 01:11:01.770 just to kind of spot check it and see if what's 01:11:01.770 --> 01:11:03.400 in there was what it was expected. 01:11:03.400 --> 01:11:06.870 And I must have instinctively hit, like, Command or Control-S to save it. 01:11:06.870 --> 01:11:09.900 And Excel at least has this habit of sort of reformatting your data. 01:11:09.900 --> 01:11:12.630 If things look like numbers, it treats them as numbers. 01:11:12.630 --> 01:11:14.040 And Apple Numbers does this too. 01:11:14.040 --> 01:11:16.080 Google Spreadsheets does this to nowadays. 01:11:16.080 --> 01:11:23.040 But long story short, I then imported my mildly saved CSV file into Gmail. 01:11:23.040 --> 01:11:26.760 And now 10 plus years later, I'm still occasionally finding friends and family 01:11:26.760 --> 01:11:32.640 members whose zip codes are in Cambridge, Massachusetts 2138, 01:11:32.640 --> 01:11:36.300 which is missing the 0 because we here in Cambridge are 02138. 01:11:36.300 --> 01:11:39.420 And that's because I treated or I let Excel 01:11:39.420 --> 01:11:42.630 treat what looks like a number as an actual number or int, 01:11:42.630 --> 01:11:45.540 and now leading zeros become a problem because mathematically, they 01:11:45.540 --> 01:11:48.900 mean nothing, but in the mail system, they do-- 01:11:48.900 --> 01:11:50.070 sending envelopes and such. 01:11:50.070 --> 01:11:51.653 All right, other final questions here. 01:11:51.653 --> 01:11:54.660 AUDIENCE: [INAUDIBLE] 01:11:54.660 --> 01:11:58.500 DAVID J. MALAN: Yeah, so could I have used a 2D or two dimensional 01:11:58.500 --> 01:12:02.730 array to solve the problem earlier of having just one array? 01:12:02.730 --> 01:12:06.750 Yes, but one, I would argue it's less readable, especially 01:12:06.750 --> 01:12:08.640 as I get lots of names and numbers. 01:12:08.640 --> 01:12:11.730 And two, that too is also kind of relying on the honor system. 01:12:11.730 --> 01:12:14.940 It would be all too easy to omit some of the square brackets in the two 01:12:14.940 --> 01:12:15.940 dimensional array. 01:12:15.940 --> 01:12:20.010 So I would argue it too is not as good as introducing a struct. 01:12:20.010 --> 01:12:21.180 More on that down the road. 01:12:21.180 --> 01:12:26.070 Two dimensional arrays just means arrays of arrays, as you might infer. 01:12:26.070 --> 01:12:28.080 All right, so now that we have this ability 01:12:28.080 --> 01:12:32.210 to store different types of data like contacts in a phone book, 01:12:32.210 --> 01:12:33.960 having names and addresses, let's actually 01:12:33.960 --> 01:12:36.780 take a step back and consider how we might now 01:12:36.780 --> 01:12:41.850 solve one of the original problems by actually sorting the information we're 01:12:41.850 --> 01:12:45.930 given in advance and considering, per our discussion earlier, just how 01:12:45.930 --> 01:12:48.900 costly, how time consuming is that because that might tip 01:12:48.900 --> 01:12:53.010 the scales in favor of sorting, then searching, or maybe just 01:12:53.010 --> 01:12:54.960 not sorting and only searching. 01:12:54.960 --> 01:12:58.470 It'll give us a sense of just how expensive, so to speak, 01:12:58.470 --> 01:13:00.345 sorting something actually is. 01:13:00.345 --> 01:13:02.220 Well, what's the formulation of this problem? 01:13:02.220 --> 01:13:03.780 It's the same thing as week zero. 01:13:03.780 --> 01:13:04.950 We've got input to sort. 01:13:04.950 --> 01:13:07.120 We want it to be output as sorted. 01:13:07.120 --> 01:13:10.560 So for instance, if we're taking unsorted input as input, 01:13:10.560 --> 01:13:13.895 we want the sorted output as the result. More concretely, 01:13:13.895 --> 01:13:15.270 if we've got numbers like these-- 01:13:15.270 --> 01:13:20.100 63852741, which are just randomly arranged numbers-- 01:13:20.100 --> 01:13:24.510 we want to get back out 12345678. 01:13:24.510 --> 01:13:26.440 So we just want those things to be sorted. 01:13:26.440 --> 01:13:28.470 So again, inside of the black box here is 01:13:28.470 --> 01:13:33.460 going to be one or more algorithms that actually gets this job done. 01:13:33.460 --> 01:13:35.680 So how might we go about doing this? 01:13:35.680 --> 01:13:39.180 Well, just to vary things a bit more, I think we have a chance here 01:13:39.180 --> 01:13:41.580 for a bit more audience participation. 01:13:41.580 --> 01:13:43.790 But this time, we need eight people if we may. 01:13:43.790 --> 01:13:46.290 All of you have to be comfortable appearing on the internet. 01:13:46.290 --> 01:13:49.165 OK, so this is actually quite convenient that you're all quite close. 01:13:49.165 --> 01:13:52.752 How about 1, 2, 3, 4, 5, 6, 7-- 01:13:52.752 --> 01:13:57.060 oh, OK, and someone volunteering their friend-- number eight. 01:13:57.060 --> 01:13:58.020 Come on down. 01:13:58.020 --> 01:13:58.875 Come on down. 01:13:58.875 --> 01:14:00.750 And if you could, I'm going to set things up. 01:14:00.750 --> 01:14:03.630 If you all could join Valerie, my colleague over there, 01:14:03.630 --> 01:14:08.910 to give you a prop to use here, we'll go ahead in just a moment 01:14:08.910 --> 01:14:11.835 and try to find some numbers at hand. 01:14:15.180 --> 01:14:20.820 In just a moment, each of our volunteers is going to be representing an integer. 01:14:20.820 --> 01:14:25.380 And that integer is initially going to be in unsorted order. 01:14:25.380 --> 01:14:29.100 And I claim that using an algorithm, step by step instructions, 01:14:29.100 --> 01:14:33.820 we can probably sort these folks in at least a couple of different ways. 01:14:33.820 --> 01:14:38.430 So they're in wardrobe right now just getting their very own Harvard T-shirts 01:14:38.430 --> 01:14:42.945 with a Jersey number on it, which will then represent an element of our array. 01:14:46.650 --> 01:14:51.270 Give us just a moment to finish getting the attire ready. 01:14:51.270 --> 01:14:56.040 They're being handed a shirt and a number. 01:14:56.040 --> 01:14:58.620 And let me ask the audience for just a moment. 01:14:58.620 --> 01:15:02.760 As we have these numbers up here on the screen, these numbers too are unsorted. 01:15:02.760 --> 01:15:04.353 They're just in random order. 01:15:04.353 --> 01:15:05.520 And let me ask the audience. 01:15:05.520 --> 01:15:10.980 How would you go about sorting these eight numbers on the screen? 01:15:10.980 --> 01:15:12.930 How would you go about sorting these? 01:15:12.930 --> 01:15:14.138 Yeah, what are your thoughts? 01:15:14.138 --> 01:15:20.327 AUDIENCE: [INAUDIBLE] the number at the end, the following number. 01:15:20.327 --> 01:15:20.910 DAVID J. MALAN: OK. 01:15:20.910 --> 01:15:24.547 AUDIENCE: The following number is bigger, then I keep it as it is. 01:15:24.547 --> 01:15:25.130 DAVID J. MALAN: OK. 01:15:25.130 --> 01:15:26.800 AUDIENCE: If not, then [INAUDIBLE]. 01:15:26.800 --> 01:15:29.352 DAVID J. MALAN: OK, so just to recap, you would start 01:15:29.352 --> 01:15:30.810 with one of the numbers on the end. 01:15:30.810 --> 01:15:33.210 You would look to the number to the right or to the left of it, 01:15:33.210 --> 01:15:34.530 depending on which end you start at. 01:15:34.530 --> 01:15:37.113 And if it's out of order, you would just start to swap things. 01:15:37.113 --> 01:15:38.350 And that seems reasonable. 01:15:38.350 --> 01:15:40.435 There's a whole bunch of mistakes to fix here 01:15:40.435 --> 01:15:42.060 because things are pretty out of order. 01:15:42.060 --> 01:15:45.060 But probably, if you start to solve small problems at a time, 01:15:45.060 --> 01:15:47.910 you can achieve the end result of getting the whole thing sorted. 01:15:47.910 --> 01:15:50.820 Other instincts, if you were just handed these numbers, how 01:15:50.820 --> 01:15:54.077 you might go about sorting them? 01:15:54.077 --> 01:15:54.660 How might you? 01:15:54.660 --> 01:15:55.856 Yeah, in the back. 01:15:55.856 --> 01:16:00.140 AUDIENCE: [INAUDIBLE] 01:16:00.140 --> 01:16:01.920 DAVID J. MALAN: OK, I like that. 01:16:01.920 --> 01:16:05.840 So to recap there, find the smallest one first and put it at the beginning, 01:16:05.840 --> 01:16:07.070 if I heard you correctly. 01:16:07.070 --> 01:16:10.342 And then presumably, you could do that again and again and again. 01:16:10.342 --> 01:16:13.050 And that would seem to give you a couple of different algorithms. 01:16:13.050 --> 01:16:15.710 And if you all are attired here-- 01:16:15.710 --> 01:16:18.800 do you want to come on up if you're ready? 01:16:18.800 --> 01:16:20.660 We had some [? felt ?] volunteers too. 01:16:20.660 --> 01:16:23.030 Come on over. 01:16:23.030 --> 01:16:25.520 So if you all would like to line yourselves up 01:16:25.520 --> 01:16:27.770 facing the audience in exactly this order-- so 01:16:27.770 --> 01:16:30.170 whoever is number zero should be way over here, 01:16:30.170 --> 01:16:33.710 and whoever is number five should be way over there. 01:16:33.710 --> 01:16:36.920 Feel free to distance as much as you'd like and scooch a little with this way 01:16:36.920 --> 01:16:38.480 if you could. 01:16:38.480 --> 01:16:39.653 OK, all right. 01:16:39.653 --> 01:16:40.820 And make a little more room. 01:16:40.820 --> 01:16:41.870 So seven-- let's see. 01:16:41.870 --> 01:16:44.170 5, 2, 7, 4-- 01:16:44.170 --> 01:16:45.090 AUDIENCE: [INAUDIBLE] 01:16:45.090 --> 01:16:46.733 DAVID J. MALAN: 4, hopefully 1. 01:16:46.733 --> 01:16:47.900 Yeah, keep them to the side. 01:16:47.900 --> 01:16:51.210 OK, 1, 6, and there we go-- 01:16:51.210 --> 01:16:51.710 3. 01:16:51.710 --> 01:16:52.543 Come on over, three. 01:16:52.543 --> 01:16:53.690 I was looking for you. 01:16:53.690 --> 01:16:57.075 All right, so here, we have an array of eight numbers-- 01:16:57.075 --> 01:16:58.200 eight integers if you will. 01:16:58.200 --> 01:17:00.770 And do you want to each say a quick hello to the group? 01:17:00.770 --> 01:17:02.750 AUDIENCE: Hello, I'm Quinn. 01:17:02.750 --> 01:17:04.892 Go [INAUDIBLE]. 01:17:04.892 --> 01:17:05.850 AUDIENCE: Hi, everyone. 01:17:05.850 --> 01:17:08.060 I'm [INAUDIBLE]. 01:17:08.060 --> 01:17:09.460 AUDIENCE: Hey, I'm Mitchell. 01:17:09.460 --> 01:17:10.460 AUDIENCE: Hi, I'm Brett. 01:17:10.460 --> 01:17:12.675 And also, go [INAUDIBLE]. 01:17:12.675 --> 01:17:13.550 AUDIENCE: I'm Hannah. 01:17:13.550 --> 01:17:15.137 Go [INAUDIBLE]. 01:17:15.137 --> 01:17:16.220 AUDIENCE: Hi, I'm Matthew. 01:17:16.220 --> 01:17:18.058 Go [INAUDIBLE] 01:17:18.058 --> 01:17:19.100 AUDIENCE: Hi, I'm Miriam. 01:17:19.100 --> 01:17:20.720 Go Winthrop. 01:17:20.720 --> 01:17:22.905 AUDIENCE: Hi, I'm Celeste, and go Strauss. 01:17:22.905 --> 01:17:23.780 DAVID J. MALAN: Wonderful. 01:17:23.780 --> 01:17:26.930 Well, welcome all to the stage, and let's just visualize, 01:17:26.930 --> 01:17:29.430 perhaps organically, how you eight would solve this problem. 01:17:29.430 --> 01:17:32.540 So we currently have the numbers 0 through 7 quite out of order. 01:17:32.540 --> 01:17:36.508 Could you go ahead and just yourselves from 0 through 7? 01:17:36.508 --> 01:17:37.341 AUDIENCE: Thank you. 01:17:41.400 --> 01:17:44.200 DAVID J. MALAN: OK, so what did they just do? 01:17:44.200 --> 01:17:44.700 OK, yes. 01:17:44.700 --> 01:17:46.260 First of all, yes, very well done. 01:17:50.030 --> 01:17:53.037 How would you describe what they just did? 01:17:53.037 --> 01:17:53.870 Well, let's do this. 01:17:53.870 --> 01:17:56.060 Could you go back into that order on the screen-- 01:17:56.060 --> 01:18:00.320 52741630? 01:18:00.320 --> 01:18:03.440 And could you do exactly what you just did again? 01:18:03.440 --> 01:18:04.750 Sort yourselves. 01:18:08.040 --> 01:18:09.030 All right, what did-- 01:18:09.030 --> 01:18:09.630 OK, yes. 01:18:09.630 --> 01:18:10.470 Well done again. 01:18:14.190 --> 01:18:17.480 All right, so admittedly, there's kind of a lot going on because each of you, 01:18:17.480 --> 01:18:21.150 except number four, are doing something in parallel all at the same time. 01:18:21.150 --> 01:18:23.390 And that's not really how a computer typically works. 01:18:23.390 --> 01:18:26.900 Just like a computer can only look at one memory location, at one locker, 01:18:26.900 --> 01:18:31.490 at a time, so can a computer only move one number at a time-- sort of opening 01:18:31.490 --> 01:18:33.717 a locker, checking what's there, moving it as needed. 01:18:33.717 --> 01:18:36.800 So let's try this more methodically based on the two audience suggestions. 01:18:36.800 --> 01:18:42.350 If you all could randomize yourself again to 52741630, 01:18:42.350 --> 01:18:44.762 let's take the second of those approaches first. 01:18:44.762 --> 01:18:46.220 I'm going to look at these numbers. 01:18:46.220 --> 01:18:48.870 And even though I as the human can obviously see all the numbers 01:18:48.870 --> 01:18:50.930 and I just kind of have the intuition for how to fix this, 01:18:50.930 --> 01:18:53.180 we got to be more methodical because eventually, we've 01:18:53.180 --> 01:18:55.500 got to translate this to pseudo code and then code. 01:18:55.500 --> 01:18:56.390 So let me see. 01:18:56.390 --> 01:18:59.200 I'm going to search for, as you proposed, the smallest number. 01:18:59.200 --> 01:19:00.950 And I'm going to start from left to right. 01:19:00.950 --> 01:19:04.100 I could do it right to left, but left to right just tends to be convention. 01:19:04.100 --> 01:19:07.080 All right, 5 at this moment is the smallest number I've seen. 01:19:07.080 --> 01:19:09.818 So I'm going to remember that in a variable, if you will. 01:19:09.818 --> 01:19:11.360 Now I'm going to take one more step-- 01:19:11.360 --> 01:19:11.930 2. 01:19:11.930 --> 01:19:15.260 OK, 2 I'm going to compare to the variable in mind, obviously smaller. 01:19:15.260 --> 01:19:19.160 I'm going to forget about 5 and only now remember 2 as the now smallest 01:19:19.160 --> 01:19:19.820 elements. 01:19:19.820 --> 01:19:23.090 7, nope-- I'm going to ignore that because it's not smaller than the 2 01:19:23.090 --> 01:19:23.810 I have in mind. 01:19:23.810 --> 01:19:27.170 4, 1-- OK, I'm going to update the variable in mind 01:19:27.170 --> 01:19:28.430 because that's indeed smaller. 01:19:28.430 --> 01:19:31.130 Now obviously, we the humans know that's getting pretty small. 01:19:31.130 --> 01:19:32.180 Maybe it's the end. 01:19:32.180 --> 01:19:35.630 I have to check all values to see if there's something even smaller 01:19:35.630 --> 01:19:38.435 because 6 is not, 3 is not, but 0 is. 01:19:38.435 --> 01:19:39.560 And what's your name again? 01:19:39.560 --> 01:19:40.430 AUDIENCE: Celeste. 01:19:40.430 --> 01:19:41.270 DAVID J. MALAN: Celeste. 01:19:41.270 --> 01:19:47.960 Where should Celeste or number 0 go according to this proposed algorithm? 01:19:47.960 --> 01:19:49.590 All right, I'm seeing a lot of this. 01:19:49.590 --> 01:19:52.820 So at the beginning of the array, so before doing this for real, 01:19:52.820 --> 01:19:54.560 let's have you pop out in front. 01:19:54.560 --> 01:19:58.100 And could you all shift and make room for Celeste? 01:19:58.100 --> 01:20:02.030 Is this a good idea to have all of them move or equivalently 01:20:02.030 --> 01:20:04.550 move everything in the array to make room for Celeste 01:20:04.550 --> 01:20:06.920 and number 0 over there? 01:20:06.920 --> 01:20:07.670 No, probably not. 01:20:07.670 --> 01:20:08.878 That felt like a lot of work. 01:20:08.878 --> 01:20:12.200 And even though it happened pretty quickly, that's like seven steps 01:20:12.200 --> 01:20:14.040 to happen just to move her in place. 01:20:14.040 --> 01:20:16.580 So what would be marginally smarter perhaps-- 01:20:16.580 --> 01:20:18.960 a little more efficient, perhaps? 01:20:18.960 --> 01:20:19.460 What's that? 01:20:19.460 --> 01:20:19.910 AUDIENCE: Swapping. 01:20:19.910 --> 01:20:20.490 DAVID J. MALAN: Swapping. 01:20:20.490 --> 01:20:21.532 What do you mean by swap? 01:20:21.532 --> 01:20:23.053 AUDIENCE: Replacing swaps. 01:20:23.053 --> 01:20:24.470 DAVID J. MALAN: OK, replace two values. 01:20:24.470 --> 01:20:28.090 So if you want to go back to where you were, one step Over, number 5, 01:20:28.090 --> 01:20:29.560 he's not in the right place. 01:20:29.560 --> 01:20:30.790 He's got to move eventually. 01:20:30.790 --> 01:20:31.498 So you know what? 01:20:31.498 --> 01:20:34.300 If that's where Celeste belongs, why don't we just swap 5 and 0? 01:20:34.300 --> 01:20:36.925 So if you want to go ahead and exchange places with each other. 01:20:36.925 --> 01:20:38.230 Notice what's just happened. 01:20:38.230 --> 01:20:41.420 The problem I'm trying to solve has gotten smaller. 01:20:41.420 --> 01:20:44.020 Instead of being size 8, now it's size 7. 01:20:44.020 --> 01:20:47.050 Now granted, I moved 5 to another wrong location. 01:20:47.050 --> 01:20:48.940 But if these numbers started off randomly, 01:20:48.940 --> 01:20:52.790 it doesn't really matter where 5 goes until we get him into the right place. 01:20:52.790 --> 01:20:54.040 So I think we've improved. 01:20:54.040 --> 01:20:57.430 And now if I go back, my loop is sort of coming back around. 01:20:57.430 --> 01:21:01.690 I can ignore Celeste and make this a seven step problem and not eight 01:21:01.690 --> 01:21:03.430 because I know she's in the right place. 01:21:03.430 --> 01:21:04.840 2 seems to be the smallest. 01:21:04.840 --> 01:21:05.650 I'll remember that. 01:21:05.650 --> 01:21:07.300 Not 7, not 4-- 01:21:07.300 --> 01:21:09.070 1 seems to be the smallest. 01:21:09.070 --> 01:21:13.600 Now I know as a human this should be my next smallest. 01:21:13.600 --> 01:21:18.100 But why, intuitively, should I keep going, do you think? 01:21:18.100 --> 01:21:20.950 I can't sort of optimize as a human and just say, number 1, 01:21:20.950 --> 01:21:22.760 let's get you into the right place. 01:21:22.760 --> 01:21:24.610 I still want to check the whole array. 01:21:24.610 --> 01:21:25.330 Why? 01:21:25.330 --> 01:21:25.870 Yeah. 01:21:25.870 --> 01:21:28.477 AUDIENCE: Perhaps there's another 1. 01:21:28.477 --> 01:21:30.310 DAVID J. MALAN: Maybe there's another 1, and that 01:21:30.310 --> 01:21:32.030 could be another problem altogether. 01:21:32.030 --> 01:21:32.770 Other thoughts? 01:21:32.770 --> 01:21:33.270 Yeah. 01:21:33.270 --> 01:21:34.458 AUDIENCE: Could be another 0 01:21:34.458 --> 01:21:36.250 DAVID J. MALAN: There could be another 0 indeed, 01:21:36.250 --> 01:21:38.560 but I did go through the list once, right? 01:21:38.560 --> 01:21:39.970 And I kind of know there isn't. 01:21:39.970 --> 01:21:40.570 Your thoughts? 01:21:40.570 --> 01:21:43.300 AUDIENCE: You don't know that every value is represented. 01:21:43.300 --> 01:21:47.387 So maybe there's a [INAUDIBLE] You just don't know what kind of data 01:21:47.387 --> 01:21:48.220 you're working with. 01:21:48.220 --> 01:21:50.600 DAVID J. MALAN: Yeah, I don't necessarily know what is there. 01:21:50.600 --> 01:21:54.940 And honestly, I only stipulated earlier that I'm using one variable in my mind. 01:21:54.940 --> 01:21:58.180 I could use two and remember the two smallest elements I've seen. 01:21:58.180 --> 01:22:00.130 I could use three variables, four. 01:22:00.130 --> 01:22:03.440 But then I'm going to start to use a lot of space in addition to time. 01:22:03.440 --> 01:22:06.880 So if I've stipulated that I only have one variable to solve this problem, 01:22:06.880 --> 01:22:09.190 I don't know anything more about these elements 01:22:09.190 --> 01:22:11.398 because the only thing I'm remembering at this moment 01:22:11.398 --> 01:22:13.490 is number 1 is the smallest element I've seen. 01:22:13.490 --> 01:22:14.290 So I'm going to keep going. 01:22:14.290 --> 01:22:14.710 6? 01:22:14.710 --> 01:22:15.070 Nope. 01:22:15.070 --> 01:22:15.490 3? 01:22:15.490 --> 01:22:15.790 Nope. 01:22:15.790 --> 01:22:16.210 5? 01:22:16.210 --> 01:22:16.710 Nope. 01:22:16.710 --> 01:22:18.670 OK, I know that number 1, and your name was-- 01:22:18.670 --> 01:22:19.378 AUDIENCE: Hannah. 01:22:19.378 --> 01:22:21.850 DAVID J. MALAN: --Hannah is the next smallest element. 01:22:21.850 --> 01:22:24.340 I could have everyone move over to make room, but nope. 01:22:24.340 --> 01:22:25.060 2? 01:22:25.060 --> 01:22:26.980 You know, even though you're so close to where I want you, 01:22:26.980 --> 01:22:29.170 I'm just going to keep it simple and swap you two. 01:22:29.170 --> 01:22:31.570 So granted, I've made the problem a little worse. 01:22:31.570 --> 01:22:35.200 But on average, I could get lucky too and just pop number 2 01:22:35.200 --> 01:22:36.280 into the right place. 01:22:36.280 --> 01:22:38.210 Now let me just accelerate this. 01:22:38.210 --> 01:22:42.550 I can now ignore Hannah and Celeste, making the problem size 6 instead of 8. 01:22:42.550 --> 01:22:43.870 So it's getting smaller. 01:22:43.870 --> 01:22:45.190 7 is the smallest. 01:22:45.190 --> 01:22:46.450 Nope, now 4 is-- 01:22:46.450 --> 01:22:47.920 2 is the smallest. 01:22:47.920 --> 01:22:50.470 Still 2, still 2, still 2. 01:22:50.470 --> 01:22:53.560 So let's go ahead and swap 2 and 7. 01:22:53.560 --> 01:22:56.230 And now I'll just kind of orchestrate it verbally. 01:22:56.230 --> 01:22:57.950 4, you're about to have to do something. 01:22:57.950 --> 01:23:01.750 So we now have 4, 7, 6 3, 5. 01:23:01.750 --> 01:23:04.480 OK, 3-- could you swap with 4? 01:23:04.480 --> 01:23:07.675 All right, now we have 7, 6, 4, 5. 01:23:07.675 --> 01:23:10.420 OK, 4, could you swap with 7? 01:23:10.420 --> 01:23:13.180 Now we have 6, 7, 5. 01:23:13.180 --> 01:23:15.430 5, could you swap with 6? 01:23:15.430 --> 01:23:16.835 And now we have 7, 6. 01:23:16.835 --> 01:23:18.160 6, would you swap at 7? 01:23:18.160 --> 01:23:19.690 And now perhaps round of applause. 01:23:19.690 --> 01:23:20.980 They've sorted themselves. 01:23:20.980 --> 01:23:25.250 OK, hang on there one minute. 01:23:25.250 --> 01:23:27.130 So we'll do this one other approach. 01:23:27.130 --> 01:23:30.220 And my God, that felt so much slower than the first approach, 01:23:30.220 --> 01:23:33.040 but that's, one, because I was kind of providing a long voiceover. 01:23:33.040 --> 01:23:37.690 But two, we were doing one thing at a time whereas the first time, you guys 01:23:37.690 --> 01:23:41.110 had the luxury of moving like eight different CPUs-- 01:23:41.110 --> 01:23:43.690 brains, if you will-- were all operating at the same time. 01:23:43.690 --> 01:23:45.140 And computers like that exist. 01:23:45.140 --> 01:23:47.893 If you have a computer with multiple cores, so to speak, 01:23:47.893 --> 01:23:49.810 that's like having a computer that technically 01:23:49.810 --> 01:23:51.490 can do multiple things at once. 01:23:51.490 --> 01:23:54.470 But software typically, at least as we've written it thus far, 01:23:54.470 --> 01:23:56.195 can only do one thing at a time. 01:23:56.195 --> 01:23:58.070 So in a bit, we'll add up all of these steps. 01:23:58.070 --> 01:23:59.862 But for now, let's take one other approach. 01:23:59.862 --> 01:24:02.140 If you all could reorder yourselves like that-- 01:24:02.140 --> 01:24:06.940 52741630-- let's take the other approach that 01:24:06.940 --> 01:24:10.610 was recommended by just fixing small problems and see where this gets us. 01:24:10.610 --> 01:24:12.730 So we're back in the original order. 01:24:12.730 --> 01:24:14.812 5 and 2 are clearly out of order. 01:24:14.812 --> 01:24:15.520 So you know what? 01:24:15.520 --> 01:24:17.350 Let's just bite this problem off now. 01:24:17.350 --> 01:24:19.090 5 and 2, could you swap? 01:24:19.090 --> 01:24:20.530 Now let me take a next step. 01:24:20.530 --> 01:24:22.330 5 and 7, I think you're OK. 01:24:22.330 --> 01:24:25.000 There's a gap, yes, but that might not be a big deal. 01:24:25.000 --> 01:24:26.360 7 and 4-- problem. 01:24:26.360 --> 01:24:28.420 Let's have you swap. 01:24:28.420 --> 01:24:31.150 OK, 7 and 1, let's have you swap. 01:24:31.150 --> 01:24:34.120 7 and 6, let's have you swap. 01:24:34.120 --> 01:24:35.830 7 and 3, you swap. 01:24:35.830 --> 01:24:37.690 7 and 0, you swap. 01:24:37.690 --> 01:24:39.490 Now let me pause for just a moment. 01:24:39.490 --> 01:24:40.790 Still not sorted. 01:24:40.790 --> 01:24:42.470 So I'm clearly not done. 01:24:42.470 --> 01:24:45.340 But have I improved the problem? 01:24:45.340 --> 01:24:48.100 Right, I can't see-- like before, I can't optimize like 01:24:48.100 --> 01:24:50.210 before because 0 is obviously not here. 01:24:50.210 --> 01:24:54.160 So unless they're still way back there, so it's not like I've gone from 8 steps 01:24:54.160 --> 01:24:56.300 to 7 to 6 just yet. 01:24:56.300 --> 01:24:58.047 But have I made any improvements? 01:24:58.047 --> 01:24:58.630 AUDIENCE: Yes. 01:24:58.630 --> 01:24:59.255 DAVID J. MALAN: Yes. 01:24:59.255 --> 01:25:01.480 In what sense is this improved? 01:25:01.480 --> 01:25:06.050 What's a concrete thing you could point to is better? 01:25:06.050 --> 01:25:06.550 Yeah. 01:25:06.550 --> 01:25:08.110 AUDIENCE: Sorted the highest number. 01:25:08.110 --> 01:25:10.690 DAVID J. MALAN: I've sorted the highest number, which is indeed 7. 01:25:10.690 --> 01:25:15.400 And conversely, if you prefer, Celeste is one step closer to the beginning. 01:25:15.400 --> 01:25:19.970 Now worst case, Celeste is going to have to move one step on each iteration. 01:25:19.970 --> 01:25:22.540 So I might need to do this thing like n total times 01:25:22.540 --> 01:25:24.215 to move her all the way over. 01:25:24.215 --> 01:25:25.340 But that might work out OK. 01:25:25.340 --> 01:25:26.510 Let me see. 01:25:26.510 --> 01:25:27.980 2 and 5, you're good. 01:25:27.980 --> 01:25:29.480 5 and 4, swap you. 01:25:29.480 --> 01:25:31.370 5 and 1, let's swap you. 01:25:31.370 --> 01:25:32.570 5 and 6, you're good. 01:25:32.570 --> 01:25:34.550 6 and 3, let's swap you. 01:25:34.550 --> 01:25:36.650 6 and 0, let's swap you. 01:25:36.650 --> 01:25:38.240 6 and 7, you're good. 01:25:38.240 --> 01:25:39.380 And I think now-- 01:25:39.380 --> 01:25:41.510 notice that the high values, as you noted, 01:25:41.510 --> 01:25:44.150 are sort of bubbling up, if you will, to the end of the list. 01:25:44.150 --> 01:25:45.530 2 and 4, you're good. 01:25:45.530 --> 01:25:46.820 4 and 1, let's swap. 01:25:46.820 --> 01:25:48.080 4 and 5, good. 01:25:48.080 --> 01:25:49.430 5 and 3, swap. 01:25:49.430 --> 01:25:51.780 5 and 0, swap. 01:25:51.780 --> 01:25:53.370 5, 6, 7, of course, are good. 01:25:53.370 --> 01:25:56.175 So now you can sort of see the problem resolving itself. 01:25:56.175 --> 01:25:57.800 And let's just do this part now faster. 01:25:57.800 --> 01:26:00.140 2 and 1, 2 and 4. 01:26:00.140 --> 01:26:04.160 OK, 4 and 3, 4 and 0. 01:26:04.160 --> 01:26:08.877 All right, now 1 and 2, 2, and 3, and 0, and good. 01:26:08.877 --> 01:26:10.460 So we do have some optimization there. 01:26:10.460 --> 01:26:12.860 We don't need to keep going because those all are sorted. 01:26:12.860 --> 01:26:13.910 1 and 2, you're good. 01:26:13.910 --> 01:26:16.730 2 and 0, all right, done. 01:26:16.730 --> 01:26:20.330 1 and 0-- and big round of applause in closing. 01:26:20.330 --> 01:26:24.890 OK, so thank you all. 01:26:24.890 --> 01:26:27.140 We need the puppets back, but you can keep the shirts. 01:26:27.140 --> 01:26:28.840 Thank you for volunteering here. 01:26:28.840 --> 01:26:31.810 Feel free to make your way exits left or right. 01:26:31.810 --> 01:26:33.760 And let's see if, thanks to our volunteers 01:26:33.760 --> 01:26:40.000 here, we can't now formalize a little bit what we did on both passes here. 01:26:40.000 --> 01:26:44.170 I claim that the first algorithm our volunteers kindly acted out 01:26:44.170 --> 01:26:45.730 is what's called selection sort. 01:26:45.730 --> 01:26:50.980 And as the name implied, we selected the smallest elements again and again 01:26:50.980 --> 01:26:53.530 and again, working our way from left to right, 01:26:53.530 --> 01:26:58.100 putting Celeste into the right place, and then continuing with everyone else. 01:26:58.100 --> 01:27:01.060 So selection sort, as it's formally called, 01:27:01.060 --> 01:27:04.000 can be described, for instance, with this pseudo code here-- 01:27:04.000 --> 01:27:06.910 4i from 0 to n minus 1. 01:27:06.910 --> 01:27:08.350 And again, why this? 01:27:08.350 --> 01:27:10.420 This is just how talk about arrays. 01:27:10.420 --> 01:27:14.920 The left end is 0, the right end is n minus 1 where in this case, 01:27:14.920 --> 01:27:16.720 n happened to be eight people. 01:27:16.720 --> 01:27:18.700 So that's 0 through 7. 01:27:18.700 --> 01:27:22.240 So for i from 0 to n minus 1, what did I do? 01:27:22.240 --> 01:27:27.400 I found the smallest number between numbers bracket i and numbers bracket 01:27:27.400 --> 01:27:28.810 n minus 1. 01:27:28.810 --> 01:27:31.030 It's a little cryptic at first glance, but this 01:27:31.030 --> 01:27:34.510 is just a very pseudo code-like way of saying 01:27:34.510 --> 01:27:37.960 find the smallest element among all eight volunteers 01:27:37.960 --> 01:27:43.120 because if i starts at 0 and n minus 1 never changes because there's always 01:27:43.120 --> 01:27:47.230 8, 8 people, so 8 minus 1 is 7, this first 01:27:47.230 --> 01:27:50.320 says find the smallest number between numbers bracket 0 01:27:50.320 --> 01:27:53.350 and numbers bracket 7, if you will. 01:27:53.350 --> 01:27:54.550 Then what do I do? 01:27:54.550 --> 01:27:57.730 Swap the smallest number with numbers bracket i. 01:27:57.730 --> 01:28:01.210 So that's how we got Celeste from over here all the way over there. 01:28:01.210 --> 01:28:03.340 We just swapped those two values. 01:28:03.340 --> 01:28:05.840 What then happens next in this pseudo code? 01:28:05.840 --> 01:28:07.900 i, of course, goes from 0 to 1. 01:28:07.900 --> 01:28:10.060 And that's the technical way of saying now 01:28:10.060 --> 01:28:13.810 find the smallest element among the 7 remaining volunteers, 01:28:13.810 --> 01:28:17.570 ignoring Celeste this time because she was already in the correct location. 01:28:17.570 --> 01:28:19.960 So the problem went from size 8 to size 7. 01:28:19.960 --> 01:28:23.500 And if we repeat, size 6, 5, 4, 3, 2, 1, until boom, 01:28:23.500 --> 01:28:25.820 it's all done at the very end. 01:28:25.820 --> 01:28:29.200 So this is just one way of expressing in pseudo code what 01:28:29.200 --> 01:28:33.040 we did a little more organically and a formalization of what someone 01:28:33.040 --> 01:28:35.420 volunteered out in the audience. 01:28:35.420 --> 01:28:40.300 So if we consider, then, the efficiency of this algorithm, 01:28:40.300 --> 01:28:42.730 maybe abstracting it away now as a bunch of doors 01:28:42.730 --> 01:28:46.960 where the left most again is always 0, the right most is always n minus 1, 01:28:46.960 --> 01:28:50.350 or equivalently, the second to last is n minus 2, the third to last 01:28:50.350 --> 01:28:54.550 is n minus 3 where n might be 8 or anything else, 01:28:54.550 --> 01:28:59.620 how do we think about or quantify the running time of selection sort? 01:28:59.620 --> 01:29:02.230 Big O of what? 01:29:02.230 --> 01:29:05.230 I mean, that was a lot of steps to be adding up. 01:29:05.230 --> 01:29:09.130 It's probably more than n, right, because I went through the list 01:29:09.130 --> 01:29:10.030 again and again. 01:29:10.030 --> 01:29:14.740 It was like n plus n minus 1 plus n minus 2. 01:29:14.740 --> 01:29:17.080 Any instincts here? 01:29:17.080 --> 01:29:20.620 We got like the whole team in the orchestra now. 01:29:20.620 --> 01:29:25.180 Let me propose we think about it this way with just a bit of formula, say. 01:29:25.180 --> 01:29:29.350 So the first time, I had to look at n different volunteers. 01:29:29.350 --> 01:29:33.130 n was 8 in this case, but generically, I looked at all eight numbers 01:29:33.130 --> 01:29:35.230 in order to decide who was the smallest. 01:29:35.230 --> 01:29:37.270 And sure enough, Celeste was at the very end. 01:29:37.270 --> 01:29:39.103 She happened to be all the way to the right. 01:29:39.103 --> 01:29:43.510 But I only knew that once I looked at all 8 or all n volunteers. 01:29:43.510 --> 01:29:45.880 So that took me n steps first. 01:29:45.880 --> 01:29:49.270 But once the list was swapped into the right place, then 01:29:49.270 --> 01:29:53.290 my problem with size n minus 1, and I had n minus 1 other people 01:29:53.290 --> 01:29:54.320 to look through. 01:29:54.320 --> 01:29:55.870 So that's n minus 1 steps. 01:29:55.870 --> 01:29:59.825 Then after that, it's n minus 2 plus n minus 3 plus n minus 4 plus dot dot 01:29:59.825 --> 01:30:01.270 dot until I had one final step. 01:30:01.270 --> 01:30:04.460 And it's obvious that I only have one human left to consider. 01:30:04.460 --> 01:30:07.180 So we might wave our hands at this with a little ellipsis 01:30:07.180 --> 01:30:10.400 and just say dot dot dot plus 1 for the final step. 01:30:10.400 --> 01:30:11.890 Now what does this actually equal? 01:30:11.890 --> 01:30:13.480 Well, this is where you might think back on, like, 01:30:13.480 --> 01:30:15.400 your high school math or physics textbook that 01:30:15.400 --> 01:30:18.640 has a little cheat sheet at the end that shows these kinds of recurrences. 01:30:18.640 --> 01:30:21.490 That happens to work out mathematically to be 01:30:21.490 --> 01:30:25.120 n times n plus 1 all divided by 2. 01:30:25.120 --> 01:30:28.690 That's just what that recurrence, that series, actually adds up to. 01:30:28.690 --> 01:30:31.300 So if you take on faith that that math is correct, let's 01:30:31.300 --> 01:30:35.530 just now multiply this out mathematically. 01:30:35.530 --> 01:30:41.920 That's n squared plus n divided by 2 or n squared divided by 2 plus n over 2. 01:30:41.920 --> 01:30:44.890 And here's where we're starting to get annoyingly into the weeds. 01:30:44.890 --> 01:30:50.080 Like, honestly, as n gets really large, like a million doors or integers 01:30:50.080 --> 01:30:54.940 or a billion web pages in Google search engine, honestly, which of these terms 01:30:54.940 --> 01:30:57.400 is going to matter the most mathematically 01:30:57.400 --> 01:30:59.200 if n is a really big number? 01:30:59.200 --> 01:31:01.840 Is n squared divided by 2 the dominant factor, 01:31:01.840 --> 01:31:04.260 or is n divided by 2 the dominant factor? 01:31:04.260 --> 01:31:05.995 AUDIENCE: n squared. 01:31:05.995 --> 01:31:07.120 DAVID J. MALAN: Yeah, n squared. 01:31:07.120 --> 01:31:09.290 I mean, no matter what n is-- and the bigger it is, 01:31:09.290 --> 01:31:12.580 the bigger raising it to the power 2 is going to be. 01:31:12.580 --> 01:31:13.330 So you know what? 01:31:13.330 --> 01:31:16.270 Let's just wave our hands at this because at the end of the day, 01:31:16.270 --> 01:31:19.780 as n gets really large, the dominant factor is indeed that first one. 01:31:19.780 --> 01:31:20.530 And you know what? 01:31:20.530 --> 01:31:24.290 Even the divided 2, as I claimed earlier with our two phone book examples, where 01:31:24.290 --> 01:31:26.960 the two straight lines if you keep zooming out essentially 01:31:26.960 --> 01:31:31.760 looked the same when n is large enough, let's just call this on the order of n 01:31:31.760 --> 01:31:32.490 squared. 01:31:32.490 --> 01:31:37.130 So that is to say a computer scientist would describe bubble sort as taking 01:31:37.130 --> 01:31:39.860 on the order of n squared steps. 01:31:39.860 --> 01:31:41.510 That's an oversimplification. 01:31:41.510 --> 01:31:44.600 If we really added it up, it's actually this many steps-- n 01:31:44.600 --> 01:31:46.610 squared divided by 2 plus n over 2. 01:31:46.610 --> 01:31:50.420 But again, if we want to just be able to generally compare two algorithms' 01:31:50.420 --> 01:31:53.810 performance, I think it's going to suffice if we look at that highest 01:31:53.810 --> 01:31:59.720 order term to get a sense of what the algorithm feels like, if you will, 01:31:59.720 --> 01:32:02.540 or what it even looks like graphically. 01:32:02.540 --> 01:32:06.170 All right, so with that said, we might describe bubble sort 01:32:06.170 --> 01:32:07.790 as being in big O-- 01:32:07.790 --> 01:32:11.700 sorry, selection sort as being in big O of n squared. 01:32:11.700 --> 01:32:17.030 But what if we consider now the best case scenario-- an opportunity 01:32:17.030 --> 01:32:19.070 to talk about a lower bound? 01:32:19.070 --> 01:32:23.305 In the best case, how many steps does selection sort take? 01:32:23.305 --> 01:32:24.680 Well, here, we need some context. 01:32:24.680 --> 01:32:27.320 Like, what does it mean to be the best case or the worst case 01:32:27.320 --> 01:32:29.090 when it comes to sorting? 01:32:29.090 --> 01:32:32.600 Like, what could you imagine meaning the best possible scenario when you're 01:32:32.600 --> 01:32:35.651 trying to sort a bunch of numbers? 01:32:35.651 --> 01:32:37.100 I got the whole crew here again. 01:32:37.100 --> 01:32:37.400 Yeah. 01:32:37.400 --> 01:32:39.050 AUDIENCE: They would already be sorted. 01:32:39.050 --> 01:32:40.370 DAVID J. MALAN: All right, they're already sorted, right? 01:32:40.370 --> 01:32:44.030 I can't really imagine a better scenario than I have to sort some numbers, 01:32:44.030 --> 01:32:46.040 but they're already sorted for me. 01:32:46.040 --> 01:32:51.170 But does this algorithm leverage that fact in practice? 01:32:51.170 --> 01:32:54.110 Even if all of our humans had lined up from 0 to 7, 01:32:54.110 --> 01:32:56.930 I'm pretty sure I would have pretty naively started here. 01:32:56.930 --> 01:32:58.670 And yes, Celeste happens to be here. 01:32:58.670 --> 01:33:03.375 But I only know she needs to be here once I've looked at all eight people. 01:33:03.375 --> 01:33:06.000 And then I would have realized, well, that was a waste of time. 01:33:06.000 --> 01:33:07.280 I can leave Celeste be. 01:33:07.280 --> 01:33:09.500 But then what would I have done? 01:33:09.500 --> 01:33:12.860 I would have ignored her position because we've solved one problem. 01:33:12.860 --> 01:33:16.260 I would have done the same thing now for seven people, then six people. 01:33:16.260 --> 01:33:19.320 So every time I walk through, I'm not doing much useful work. 01:33:19.320 --> 01:33:21.860 But I am doing those comparisons because I 01:33:21.860 --> 01:33:25.500 don't know until I do the work that the people were in the right order. 01:33:25.500 --> 01:33:30.440 So this would seem to imply that the omega notation, the best case 01:33:30.440 --> 01:33:33.740 scenario, even, a lower bound on the running time would be what, then? 01:33:33.740 --> 01:33:35.840 AUDIENCE: [INAUDIBLE] 01:33:35.840 --> 01:33:37.470 DAVID J. MALAN: A little louder? 01:33:37.470 --> 01:33:38.410 AUDIENCE: N squared. 01:33:38.410 --> 01:33:40.250 DAVID J. MALAN: It's still going to be n squared, 01:33:40.250 --> 01:33:45.820 in fact, because the code I'm giving myself doesn't leverage or benefit 01:33:45.820 --> 01:33:50.260 from any of that scenario because it just mindlessly continues 01:33:50.260 --> 01:33:51.770 to do this again and again. 01:33:51.770 --> 01:33:57.010 So in this case, yes, I would claim that the omega notation for selection sort 01:33:57.010 --> 01:33:58.830 is also big O of n squared. 01:33:58.830 --> 01:34:00.580 So those are the kinds of numbers to beat. 01:34:00.580 --> 01:34:03.760 It seems like the upper bound and lower bound of selection 01:34:03.760 --> 01:34:06.130 sort are indeed n squared. 01:34:06.130 --> 01:34:08.380 And so we can also describe selection sort, therefore, 01:34:08.380 --> 01:34:09.717 as being in theta of n squared. 01:34:09.717 --> 01:34:12.550 That's the first algorithm we've had the chance to describe that in, 01:34:12.550 --> 01:34:14.650 which is to say that it's kind of slow. 01:34:14.650 --> 01:34:16.595 I mean, maybe other algorithms are slower, 01:34:16.595 --> 01:34:18.220 but this isn't the best starting point. 01:34:18.220 --> 01:34:19.370 Can we do better? 01:34:19.370 --> 01:34:22.930 Well, there's a reason that I guided us to doing the second algorithm second. 01:34:22.930 --> 01:34:25.430 Even though you verbally proposed them in a different order, 01:34:25.430 --> 01:34:28.600 this second algorithm we did is generally known as bubble sort. 01:34:28.600 --> 01:34:31.210 And I deliberately used that word a bit ago, 01:34:31.210 --> 01:34:34.930 saying the big values are bubbling their way up to the right 01:34:34.930 --> 01:34:37.990 to kind of capture the fact that, indeed, this algorithm works 01:34:37.990 --> 01:34:38.620 differently. 01:34:38.620 --> 01:34:40.870 But let's consider if it's better or worse. 01:34:40.870 --> 01:34:43.960 So here, we have pseudo code for bubble sort. 01:34:43.960 --> 01:34:45.930 You could write this too in different ways. 01:34:45.930 --> 01:34:48.550 But let's consider what we did on the stage. 01:34:48.550 --> 01:34:51.850 We repeated the following n minus 1 times. 01:34:51.850 --> 01:34:55.390 We initialized at least, even though I didn't verbalize it this way, 01:34:55.390 --> 01:35:00.430 a variable like i from 0 to n minus 2, n minus 2. 01:35:00.430 --> 01:35:01.810 And then I asked this question. 01:35:01.810 --> 01:35:08.470 If numbers bracket i and numbers bracket i plus 1 are out of order, 01:35:08.470 --> 01:35:10.250 then swap them. 01:35:10.250 --> 01:35:12.580 So again, I just did it more intuitively by pointing, 01:35:12.580 --> 01:35:14.950 but this would be a way, with a bit of pseudo code, 01:35:14.950 --> 01:35:16.147 to describe what's going on. 01:35:16.147 --> 01:35:18.730 But notice that I'm doing something a little differently here. 01:35:18.730 --> 01:35:22.730 I'm iterating from if equals 0 to n minus 2. 01:35:22.730 --> 01:35:23.230 Why? 01:35:23.230 --> 01:35:26.740 Well, if I'm comparing two things, left hand and right hand, 01:35:26.740 --> 01:35:28.690 I'd still want to start at 0. 01:35:28.690 --> 01:35:31.450 But I don't want to go all the way to n minus 1 01:35:31.450 --> 01:35:35.048 because then, I'd be going past the boundary of my array, which 01:35:35.048 --> 01:35:35.590 would be bad. 01:35:35.590 --> 01:35:38.080 I want to make sure that my left hand-- i, if you will-- 01:35:38.080 --> 01:35:42.820 stops at n minus 2 so that when I plus 1 in my pseudo code, 01:35:42.820 --> 01:35:45.610 I'm looking at the last two elements, not the last element 01:35:45.610 --> 01:35:46.900 and then pass the boundary. 01:35:46.900 --> 01:35:48.733 That's actually a common programming mistake 01:35:48.733 --> 01:35:50.620 that we'll undoubtedly soon make by going 01:35:50.620 --> 01:35:52.880 beyond the boundaries of your array. 01:35:52.880 --> 01:35:59.530 So this pseudo code, then, allows me to say compare every one again and again 01:35:59.530 --> 01:36:01.730 and swap them if they're out of order. 01:36:01.730 --> 01:36:06.190 Why do I repeat the whole thing n minus 1 times? 01:36:06.190 --> 01:36:11.500 Like, why does it not suffice just to do this loop here? 01:36:11.500 --> 01:36:14.980 Think what happened with Celeste. 01:36:14.980 --> 01:36:19.480 Why do I repeat this whole thing n minus 1 times? 01:36:19.480 --> 01:36:20.230 Yeah, in the back? 01:36:20.230 --> 01:36:27.855 AUDIENCE: [INAUDIBLE] 01:36:27.855 --> 01:36:30.230 DAVID J. MALAN: Indeed, and I think if I can recap accurately, 01:36:30.230 --> 01:36:31.640 think back to Celeste again. 01:36:31.640 --> 01:36:34.130 And I'm sorry to keep calling on you as our number 0. 01:36:34.130 --> 01:36:37.670 Each time through bubble sort, she only moved one step. 01:36:37.670 --> 01:36:41.280 And so in total, if there's n locations, at the end of the day, 01:36:41.280 --> 01:36:45.930 she needs to move n minus 1 steps to get 0 all the way to where it needs to be. 01:36:45.930 --> 01:36:50.330 And so this inner loop, if you will, where we're iterating using i, 01:36:50.330 --> 01:36:52.700 that just fixes some of the problems. 01:36:52.700 --> 01:36:56.150 But it doesn't fix all of the problems until we do that same logic again 01:36:56.150 --> 01:36:57.660 and again and again. 01:36:57.660 --> 01:37:01.410 And so how might we quantify the running time of this algorithm? 01:37:01.410 --> 01:37:04.670 Well, one way to see it is to just literally look at the pseudo code. 01:37:04.670 --> 01:37:09.170 The outer loop repeats n minus 1 times by definition. 01:37:09.170 --> 01:37:10.490 It literally says that. 01:37:10.490 --> 01:37:15.960 The inner loop, the for loop, also iterates n minus 1 times. 01:37:15.960 --> 01:37:16.460 Why? 01:37:16.460 --> 01:37:18.540 Because it's going from 0 to n minus 2. 01:37:18.540 --> 01:37:22.580 And if that's hard to think about, that's the same thing is 1 to n minus 1 01:37:22.580 --> 01:37:25.320 if you just add 1 to both ends of the formula. 01:37:25.320 --> 01:37:29.910 So that means you're doing n minus 1 things n minus 1 times. 01:37:29.910 --> 01:37:32.540 So I literally multiply how many times the outer loop 01:37:32.540 --> 01:37:35.960 is running by how many times the inner loop is running, which gives me 01:37:35.960 --> 01:37:39.320 sort of FOIL method n minus 1 squared. 01:37:39.320 --> 01:37:41.300 And I could multiply that whole thing out. 01:37:41.300 --> 01:37:44.330 Well, let's consider this just a little more methodically here. 01:37:44.330 --> 01:37:48.382 If I have n minus 1 on the outer, n minus 1 on the inner-- 01:37:48.382 --> 01:37:49.590 let's go ahead and FOIL this. 01:37:49.590 --> 01:37:53.180 So n squared minus n minus n plus 1, combine 01:37:53.180 --> 01:37:56.570 like terms-- n squared minus 2n plus 1. 01:37:56.570 --> 01:38:01.470 And now which of these terms is clearly going to be dominant, so to speak? 01:38:01.470 --> 01:38:01.970 The-- 01:38:01.970 --> 01:38:02.450 AUDIENCE: N squared. 01:38:02.450 --> 01:38:03.720 DAVID J. MALAN: --the n squared. 01:38:03.720 --> 01:38:06.260 So yes, even though minus 2n is a good thing 01:38:06.260 --> 01:38:08.660 because it's subtracting off some of the time required, 01:38:08.660 --> 01:38:11.690 plus 1 is not that big a thing, there's such drops in the bucket when 01:38:11.690 --> 01:38:14.120 n gets really large, like in the millions or billions, 01:38:14.120 --> 01:38:18.510 certainly, that bubble sort 2 is on the order of n squared. 01:38:18.510 --> 01:38:21.110 It's not the same exactly as selection sort. 01:38:21.110 --> 01:38:23.030 But as n gets big, honestly, we're barely 01:38:23.030 --> 01:38:25.530 going to be able to notice the difference most likely. 01:38:25.530 --> 01:38:29.190 And so it too might be said to be on the order of n squared. 01:38:29.190 --> 01:38:34.530 And if we consider now the lower bound on bubble sort's running time, 01:38:34.530 --> 01:38:38.720 here's where things get potentially interesting. 01:38:38.720 --> 01:38:44.450 What might you claim is the running time of bubble sort in the best case? 01:38:44.450 --> 01:38:48.230 And the best case, I claim, is when the numbers are already sorted. 01:38:48.230 --> 01:38:50.720 Is our pseudo code going to take that into account? 01:38:50.720 --> 01:38:52.442 AUDIENCE: N 01:38:52.442 --> 01:38:53.150 DAVID J. MALAN: OK, n. 01:38:53.150 --> 01:38:54.350 Why do you propose n? 01:38:54.350 --> 01:38:59.043 AUDIENCE: [INAUDIBLE] 01:38:59.043 --> 01:39:00.710 DAVID J. MALAN: Yes, and that's the key word. 01:39:00.710 --> 01:39:05.150 To summarize, in bubble sort, I do have to minimally make one pass because if I 01:39:05.150 --> 01:39:07.760 don't look at all n elements, that I'm theoretically 01:39:07.760 --> 01:39:09.260 just guessing if it's sorted or not. 01:39:09.260 --> 01:39:11.480 Like, I obviously intuitively have to look 01:39:11.480 --> 01:39:14.390 at every element to decide yay or nay, it's in the right order. 01:39:14.390 --> 01:39:17.180 And my original pseudo code, though, is pretty naive. 01:39:17.180 --> 01:39:22.640 It's just going to blindly go back and forth n minus 1 times again and again, 01:39:22.640 --> 01:39:23.810 and that's going to add up. 01:39:23.810 --> 01:39:25.760 But what if I add a bit of an optimization 01:39:25.760 --> 01:39:28.010 that you might have glimpsed on the slide a moment ago 01:39:28.010 --> 01:39:31.610 where if I compare two people and I don't swap them, compare two people, 01:39:31.610 --> 01:39:34.550 don't swap them, and I go all the way through the list comparing 01:39:34.550 --> 01:39:38.120 every pair of adjacent people, and I make no swaps, 01:39:38.120 --> 01:39:40.820 it would be kind of not just naive but stupid 01:39:40.820 --> 01:39:44.360 to do that same process again because if the humans have not moved, 01:39:44.360 --> 01:39:47.150 I'm not going to make any different decisions. 01:39:47.150 --> 01:39:49.740 I'm going to do nothing again, nothing again. 01:39:49.740 --> 01:39:53.150 So at that point, it would be stupid, very inefficient, to go back and forth 01:39:53.150 --> 01:39:54.210 and back and forth. 01:39:54.210 --> 01:39:58.070 So if I modify our pseudo code with just an additional if condition, 01:39:58.070 --> 01:40:00.050 I bet we can speed this up. 01:40:00.050 --> 01:40:06.200 Inside of that same pseudo code, what if I say, hey, if no swaps, quit? 01:40:06.200 --> 01:40:10.010 Like quit, prematurely before the loops are finished running. 01:40:10.010 --> 01:40:13.380 One of the loops has gone through per the indentation here. 01:40:13.380 --> 01:40:16.190 But if I do a loop from left to right and I 01:40:16.190 --> 01:40:18.680 have made no swaps, which you can think of as just being 01:40:18.680 --> 01:40:22.250 one other variable that's plus plusing as I go keeping, track of how many 01:40:22.250 --> 01:40:22.820 swaps-- 01:40:22.820 --> 01:40:24.740 if I've made no swaps from left to right, 01:40:24.740 --> 01:40:27.240 I'm not going to make any swaps the next time around either. 01:40:27.240 --> 01:40:30.030 So let's just quit at that point. 01:40:30.030 --> 01:40:32.630 And that is to say in the best case, if you will, 01:40:32.630 --> 01:40:36.410 when the list is already sorted, the omega notation for bubble sort 01:40:36.410 --> 01:40:41.460 might indeed be omega of n if you add that optimization 01:40:41.460 --> 01:40:44.060 so as to short circuit all of that inefficient 01:40:44.060 --> 01:40:49.830 looping to do it only as many times as is necessary. 01:40:49.830 --> 01:40:52.280 Let me pause to see if there's any questions here. 01:40:52.280 --> 01:40:53.192 Yeah. 01:40:53.192 --> 01:41:02.038 AUDIENCE: [INAUDIBLE] to optimize the running time for all cases possible? 01:41:02.038 --> 01:41:03.080 DAVID J. MALAN: Good question. 01:41:03.080 --> 01:41:08.690 If the running time of selection sort and bubble sort are both in big O 01:41:08.690 --> 01:41:14.480 of n squared but selection sort is in omega of n squared while bubble sort is 01:41:14.480 --> 01:41:17.300 in omega of n, which sounds better-- 01:41:17.300 --> 01:41:20.510 I think if I may, should we just always use bubble sort? 01:41:20.510 --> 01:41:24.680 Yes if we think that we might benefit over time 01:41:24.680 --> 01:41:29.250 from a lot of good case scenarios or best case scenarios. 01:41:29.250 --> 01:41:31.340 However, the goal at hand in just a bit is 01:41:31.340 --> 01:41:33.480 going to be to do even better than both of these. 01:41:33.480 --> 01:41:35.330 So hold that question further for a moment. 01:41:35.330 --> 01:41:35.830 Yeah. 01:41:35.830 --> 01:41:41.357 AUDIENCE: [INAUDIBLE] n minus 1? 01:41:41.357 --> 01:41:41.940 DAVID J. MALAN: No. 01:41:41.940 --> 01:41:43.080 So yes, good question. 01:41:43.080 --> 01:41:46.890 So I say omega of n, but is it technically omega of n minus 1? 01:41:46.890 --> 01:41:50.110 Maybe, but again, we're throwing away lower order terms. 01:41:50.110 --> 01:41:53.670 And that's an advantage because we're not comparing things ever so precisely. 01:41:53.670 --> 01:41:57.180 Just like I plotted with the green and yellow and red chart, 01:41:57.180 --> 01:41:59.980 I just want to get a sense of the shape of these algorithms 01:41:59.980 --> 01:42:03.330 so that when n gets really large, which of these choices 01:42:03.330 --> 01:42:05.185 is going to matter the most? 01:42:05.185 --> 01:42:07.560 At the end of the day, it's actually perfectly reasonable 01:42:07.560 --> 01:42:09.460 to use selection sort or bubble sort if you 01:42:09.460 --> 01:42:12.210 don't have that much data because they're going to be pretty fast. 01:42:12.210 --> 01:42:14.520 My God, our computers nowadays are 1 gigahertz, 01:42:14.520 --> 01:42:18.440 2 gigahertz, 1 billion things per second, 2 billion things per second. 01:42:18.440 --> 01:42:20.940 But if we have large data sets, as we will later in the term 01:42:20.940 --> 01:42:23.470 and as you might in the real world, that the Googles of the world, 01:42:23.470 --> 01:42:25.530 then you're going to want to be more thoughtful. 01:42:25.530 --> 01:42:27.240 And that's where we're going today. 01:42:27.240 --> 01:42:30.090 All right, so let's actually see this visualized a little bit. 01:42:30.090 --> 01:42:32.010 In a moment, I'm going to change screens here 01:42:32.010 --> 01:42:37.350 to open up what is a little visualization tool that will give us 01:42:37.350 --> 01:42:40.920 a sense of how these things actually work and look at a faster rate 01:42:40.920 --> 01:42:42.820 than our humans are able to do here on stage. 01:42:42.820 --> 01:42:47.580 So here is another visualization of a bunch of numbers, an array of numbers. 01:42:47.580 --> 01:42:50.670 Short bars mean small numbers, tall bars mean big numbers. 01:42:50.670 --> 01:42:52.920 So instead of having the numbers on their torsos here, 01:42:52.920 --> 01:42:57.690 we just have bars that are small or tall based on the magnitude of the number. 01:42:57.690 --> 01:43:00.630 Let me go ahead, and I preconfigured this in advance 01:43:00.630 --> 01:43:02.050 to operate somewhat quickly. 01:43:02.050 --> 01:43:05.550 Let's go ahead and do selections sort by clicking this button. 01:43:05.550 --> 01:43:08.220 And you'll see some pink bars flying by. 01:43:08.220 --> 01:43:11.640 And that's like me walking left and right, left and right, 01:43:11.640 --> 01:43:14.140 to select the next smallest number. 01:43:14.140 --> 01:43:17.580 And so what you'll see happening on the left of this array of numbers 01:43:17.580 --> 01:43:20.520 is Celeste, if you will, and all of the other smaller numbers 01:43:20.520 --> 01:43:24.030 are appearing on the left while we continue to solve the remaining 01:43:24.030 --> 01:43:25.810 problems to the right. 01:43:25.810 --> 01:43:29.177 So again, we no longer have to touch the smaller numbers here. 01:43:29.177 --> 01:43:32.010 So that's why the problem is getting smaller and smaller and smaller 01:43:32.010 --> 01:43:32.790 over time. 01:43:32.790 --> 01:43:36.420 But you can notice now visually, look at how many times 01:43:36.420 --> 01:43:37.920 we're retracing our steps. 01:43:37.920 --> 01:43:40.710 This is why things that are n squared tend 01:43:40.710 --> 01:43:44.830 to be frowned upon if avoidable because I'm touching the same elements again 01:43:44.830 --> 01:43:45.330 and again. 01:43:45.330 --> 01:43:47.610 When I was walking through, I kept pointing at the same humans 01:43:47.610 --> 01:43:48.360 again and again. 01:43:48.360 --> 01:43:49.660 And that adds up. 01:43:49.660 --> 01:43:52.750 So let's see if bubble sort looks or feels a little different. 01:43:52.750 --> 01:43:56.188 Let me re-randomize the thing, and let me now click Bubble Sort at the top. 01:43:56.188 --> 01:43:58.980 And as you might infer, there's other sorting algorithms out there, 01:43:58.980 --> 01:44:00.272 not all of which we'll look at. 01:44:00.272 --> 01:44:01.740 But here's bubble sort. 01:44:01.740 --> 01:44:04.570 Same pink coloration, but it's doing something different. 01:44:04.570 --> 01:44:07.860 It's two pink bars going through again and again comparing 01:44:07.860 --> 01:44:09.550 the adjacent numbers. 01:44:09.550 --> 01:44:13.050 And you'll see that the largest numbers are indeed bubbling the way up 01:44:13.050 --> 01:44:18.120 to the right, but the smaller numbers, like our number 0 was, 01:44:18.120 --> 01:44:20.287 is only slowly making its way over. 01:44:20.287 --> 01:44:21.120 Here's a comparable. 01:44:21.120 --> 01:44:22.200 Here's the number one. 01:44:22.200 --> 01:44:24.900 And it's going to take a while to get all the way to the left. 01:44:24.900 --> 01:44:28.560 And here too, notice how many times the same bars 01:44:28.560 --> 01:44:32.590 are becoming pink, how many times the algorithm is retracing and retracing 01:44:32.590 --> 01:44:33.090 its steps. 01:44:33.090 --> 01:44:33.600 Why? 01:44:33.600 --> 01:44:37.140 Because it's only solving one problem at a time on each pass. 01:44:37.140 --> 01:44:40.717 And each time we do that, we're stepping through practically the whole array. 01:44:40.717 --> 01:44:43.800 And now granted, I could speed this up even further if I really wanted to, 01:44:43.800 --> 01:44:48.120 but my God, this is only, what, like 50 or 60 elements, something like that? 01:44:48.120 --> 01:44:49.140 This is slow. 01:44:49.140 --> 01:44:52.320 Like, this is what n squared looks like and feels like. 01:44:52.320 --> 01:44:54.570 And now I'm just trying to come up with words to say 01:44:54.570 --> 01:44:56.280 until we get to the finish line here. 01:44:56.280 --> 01:44:59.250 Like, this would be annoying if this is the speed of sorting, 01:44:59.250 --> 01:45:03.133 and this is why I sort of secretly sorted the numbers for Rave in advance 01:45:03.133 --> 01:45:05.550 because it would have taken us an annoying number of steps 01:45:05.550 --> 01:45:07.210 to get that in place for her. 01:45:07.210 --> 01:45:10.170 So those two algorithms are n squared. 01:45:10.170 --> 01:45:12.265 Can we do, in fact, better? 01:45:12.265 --> 01:45:15.390 Well, to save the best algorithm for last, let's take a shorter five minute 01:45:15.390 --> 01:45:15.890 break here. 01:45:15.890 --> 01:45:20.410 And when we come back, we'll do even better than n squared. 01:45:20.410 --> 01:45:22.480 All right. 01:45:22.480 --> 01:45:27.180 So the challenge at hand is to do better than selection sort 01:45:27.180 --> 01:45:30.270 and better than bubble sort and ideally not just marginally 01:45:30.270 --> 01:45:32.460 better but fundamentally better. 01:45:32.460 --> 01:45:35.970 Just like in week zero, that third and final divide and conquer algorithm 01:45:35.970 --> 01:45:38.860 was sort of fundamentally faster than the other two. 01:45:38.860 --> 01:45:41.950 So can we do better than something on the order of n squared? 01:45:41.950 --> 01:45:44.310 Well, I bet we can if we start to approach 01:45:44.310 --> 01:45:46.080 the problem a little differently. 01:45:46.080 --> 01:45:48.202 The sorts we've done thus far, generally known 01:45:48.202 --> 01:45:50.160 as comparison sorts-- and that kind of captures 01:45:50.160 --> 01:45:54.000 the reality that we were doing a huge number of comparisons again and again. 01:45:54.000 --> 01:45:57.360 And you kind of saw that in the vertical bars that were going pink as everything 01:45:57.360 --> 01:45:58.855 was being compared again and again. 01:45:58.855 --> 01:46:01.230 But there's this programming technique, and it's actually 01:46:01.230 --> 01:46:04.140 a mathematical technique known as recursion 01:46:04.140 --> 01:46:05.950 that we've actually seen before. 01:46:05.950 --> 01:46:09.000 And this is a building block or a mental model 01:46:09.000 --> 01:46:12.300 we can bring to bear on the problem to solve the sorting problem 01:46:12.300 --> 01:46:13.740 sort of fundamentally differently. 01:46:13.740 --> 01:46:16.620 But first, let's look at it in a more familiar context. 01:46:16.620 --> 01:46:23.190 A little bit ago, I proposed this pseudo code for the binary search algorithm. 01:46:23.190 --> 01:46:26.130 And notice that what was interesting about this code, 01:46:26.130 --> 01:46:29.970 even though I didn't call it out at the time, it's kind of cyclically defined. 01:46:29.970 --> 01:46:32.730 Like, I claim this is an algorithm for search, 01:46:32.730 --> 01:46:36.750 and yet it seems a little unfair that I'm using the verb search 01:46:36.750 --> 01:46:38.960 inside of the algorithm for search. 01:46:38.960 --> 01:46:41.720 It's like an English sort of defining a word by using the word. 01:46:41.720 --> 01:46:43.850 Normally, you shouldn't really get away with that. 01:46:43.850 --> 01:46:46.430 But there's something interesting about this technique 01:46:46.430 --> 01:46:51.110 here because even though this whole thing is a search algorithm 01:46:51.110 --> 01:46:56.510 and I'm using my own algorithm to search the left half or the right half, 01:46:56.510 --> 01:46:58.520 the key feature here that doesn't normally 01:46:58.520 --> 01:47:01.670 happen in English when you define a word in terms of a word 01:47:01.670 --> 01:47:05.300 is that when I search the left half or search the right half, yes, 01:47:05.300 --> 01:47:06.810 I'm doing the same thing. 01:47:06.810 --> 01:47:08.090 I'm using the same algorithm. 01:47:08.090 --> 01:47:11.060 But the problem is, by definition, half as large. 01:47:11.060 --> 01:47:14.180 So this isn't going to be a cyclical argument in the same way. 01:47:14.180 --> 01:47:17.750 This approach, by using search within search 01:47:17.750 --> 01:47:21.290 is going to whittle the problem down and down and down until hopefully, 01:47:21.290 --> 01:47:24.180 one door or no doors remains. 01:47:24.180 --> 01:47:26.720 And so recursion is a programming technique 01:47:26.720 --> 01:47:29.930 whereby a function calls itself. 01:47:29.930 --> 01:47:34.220 And we haven't seen this yet in C, and we haven't seen this really in Scratch. 01:47:34.220 --> 01:47:38.122 But in C, you can have a function call itself. 01:47:38.122 --> 01:47:39.830 And the form that takes is like literally 01:47:39.830 --> 01:47:44.180 using the function's name inside of the function's implementation itself. 01:47:44.180 --> 01:47:48.310 We've actually seen an opportunity for this once before too. 01:47:48.310 --> 01:47:49.310 Think back to week zero. 01:47:49.310 --> 01:47:51.560 Here's that same pseudo code for searching for someone 01:47:51.560 --> 01:47:53.180 in an actual, physical phone book. 01:47:53.180 --> 01:47:55.700 And notice these yellow lines here. 01:47:55.700 --> 01:47:59.810 We described those in week zero as inducing a loop, a cycle. 01:47:59.810 --> 01:48:04.310 And this is a very procedural approach, if you will, because lines 8 and 11 01:48:04.310 --> 01:48:06.470 are very mechanically, if you will, telling 01:48:06.470 --> 01:48:10.370 me to go back to line three to do this kind of looping thing. 01:48:10.370 --> 01:48:14.750 But really, what that's doing in the binary search algorithm for the phone 01:48:14.750 --> 01:48:20.120 book is it's just telling me to search the left half or search the right half. 01:48:20.120 --> 01:48:23.900 I'm doing it more mechanically again by sort of telling myself 01:48:23.900 --> 01:48:25.490 what line number to go back to. 01:48:25.490 --> 01:48:28.400 But that's equivalent to just telling myself go search the left half, 01:48:28.400 --> 01:48:30.950 search the right half, the key thing being the left 01:48:30.950 --> 01:48:33.720 have and the right half are smaller than the original problem. 01:48:33.720 --> 01:48:37.630 It would be a bug if I just said search the phone book, search the phone book, 01:48:37.630 --> 01:48:39.380 because obviously, you never get anywhere. 01:48:39.380 --> 01:48:41.540 But if you search the half, the half, the half, 01:48:41.540 --> 01:48:43.470 problem gets smaller and smaller. 01:48:43.470 --> 01:48:50.180 So let's reformulate week zero's phone book code to be not procedural as here 01:48:50.180 --> 01:48:55.140 but recursive whereby in this search algorithm, 01:48:55.140 --> 01:48:58.580 AKA binary search, formerly called divide and conquer, I'm 01:48:58.580 --> 01:49:01.940 going to literally use also the keyword search here. 01:49:01.940 --> 01:49:03.950 Notice among the benefits of doing this is it 01:49:03.950 --> 01:49:06.798 kind of tightens the code up, makes it a little more succinct, 01:49:06.798 --> 01:49:08.840 even though that's kind of a fringe benefit here. 01:49:08.840 --> 01:49:12.350 But it's an elegant way too of describing 01:49:12.350 --> 01:49:17.210 a problem by just having a function use itself 01:49:17.210 --> 01:49:21.330 to solve a smaller puzzle at hand. 01:49:21.330 --> 01:49:24.380 So let's now consider a familiar problem, a smaller 01:49:24.380 --> 01:49:27.132 version than the one you've dabbled with-- this sort of pyramid, 01:49:27.132 --> 01:49:28.340 this half pyramid from Mario. 01:49:28.340 --> 01:49:31.070 And let's throw away the parts that aren't that interesting 01:49:31.070 --> 01:49:35.420 and just consider how we might, up until now, implement this in C code, 01:49:35.420 --> 01:49:37.520 this left aligned pyramid, if you will. 01:49:37.520 --> 01:49:44.630 Let me go over here, and let me create a file called-- how about iteration.c? 01:49:44.630 --> 01:49:48.080 And in this file, I'm going to go ahead and include cs50.h. 01:49:48.080 --> 01:49:51.890 And I'm going to include stdio.h. 01:49:51.890 --> 01:49:57.440 And the goal at hand is to implement in C a little program that just prints out 01:49:57.440 --> 01:49:59.280 this and exactly this pyramid. 01:49:59.280 --> 01:50:02.113 So no get string or any of that-- we're just going to keep it simple 01:50:02.113 --> 01:50:05.400 and print exactly this pyramid of height 4 here. 01:50:05.400 --> 01:50:06.570 So how might I do this? 01:50:06.570 --> 01:50:10.945 Well, let me go ahead, and in main, let me first ask the user for-- 01:50:10.945 --> 01:50:12.570 well, we'll go ahead and generalize it. 01:50:12.570 --> 01:50:14.403 Let's go ahead and ask the user for heights. 01:50:14.403 --> 01:50:16.250 We're using getint as before. 01:50:16.250 --> 01:50:18.500 And I'll store that in a variable called height. 01:50:18.500 --> 01:50:20.750 And then let me go ahead and simply call the function 01:50:20.750 --> 01:50:22.410 draw passing in that height. 01:50:22.410 --> 01:50:25.040 So for the moment, let me assume that someone somewhere 01:50:25.040 --> 01:50:26.630 has implemented a draw function. 01:50:26.630 --> 01:50:29.810 And this, then, is the entirety of my program. 01:50:29.810 --> 01:50:32.970 All right, unfortunately, C does not come with a draw function. 01:50:32.970 --> 01:50:34.820 So let me go ahead and invent one. 01:50:34.820 --> 01:50:36.300 It doesn't need to return a value. 01:50:36.300 --> 01:50:38.850 It just needs to print something-- so-called side effect. 01:50:38.850 --> 01:50:43.460 So I'm going to define a function called draw that takes as input an int. 01:50:43.460 --> 01:50:46.280 I'll call it n for number, but I could call it anything I want. 01:50:46.280 --> 01:50:47.850 And inside of this. 01:50:47.850 --> 01:50:53.240 I'm going to go ahead and print out a left aligned pyramid like this from top 01:50:53.240 --> 01:50:53.990 to bottom. 01:50:53.990 --> 01:50:57.710 The salient features here are that this is a pyramid, at least in this example, 01:50:57.710 --> 01:50:58.820 of height four. 01:50:58.820 --> 01:51:02.240 And now in height four, the first row has one brick. 01:51:02.240 --> 01:51:03.470 The second row has two. 01:51:03.470 --> 01:51:04.400 The third has three. 01:51:04.400 --> 01:51:05.270 The fourth has four. 01:51:05.270 --> 01:51:08.100 That's a nice pattern that I can probably represent in code. 01:51:08.100 --> 01:51:09.210 So how might I do this? 01:51:09.210 --> 01:51:11.390 Well, how about 4 int i gets-- 01:51:11.390 --> 01:51:12.950 let me do it the old school way-- 01:51:12.950 --> 01:51:13.730 1. 01:51:13.730 --> 01:51:18.530 And then i is less than or equal to n. 01:51:18.530 --> 01:51:20.180 And then i plus plus-- 01:51:20.180 --> 01:51:23.810 so I'm going from 1 to 4 just to keep myself sane here. 01:51:23.810 --> 01:51:26.690 And then inside of this loop, what do I want to do? 01:51:26.690 --> 01:51:28.560 Well, let me keep it conventional, in fact. 01:51:28.560 --> 01:51:31.970 Let me just change this to be the more conventional 0 01:51:31.970 --> 01:51:36.170 to n even though it might not be as intuitive because now on row 0, 01:51:36.170 --> 01:51:37.550 I want one brick. 01:51:37.550 --> 01:51:39.965 On row 1, I want two bricks, dot dot dot. 01:51:39.965 --> 01:51:41.300 On row 3, I want four. 01:51:41.300 --> 01:51:42.530 So it's kind of offset now. 01:51:42.530 --> 01:51:44.240 But I'm being more conventional. 01:51:44.240 --> 01:51:48.440 So on each row, how many bricks do I want to print? 01:51:48.440 --> 01:51:49.880 Well, I think I want to do this. 01:51:49.880 --> 01:51:55.700 For int j, for instance, common to use j after if you have a nested loop, 01:51:55.700 --> 01:52:02.930 let's start j at 0 and do this so long as is less than i plus 1 01:52:02.930 --> 01:52:04.460 and then do j plus plus. 01:52:04.460 --> 01:52:06.500 So why i plus 1? 01:52:06.500 --> 01:52:10.670 Well, again, when I equals 0, that's the first row, and I want one brick. 01:52:10.670 --> 01:52:12.650 When i equals 1, that's the second row. 01:52:12.650 --> 01:52:13.430 I want two bricks. 01:52:13.430 --> 01:52:16.250 And dot dot dot, when i is 3, I want four bricks. 01:52:16.250 --> 01:52:19.460 So again, I have to add 1 to i to get the total number of bricks 01:52:19.460 --> 01:52:21.120 that I want to print to the screen. 01:52:21.120 --> 01:52:25.820 So inside of this nested for loop, I'm going to do printf of a hash 01:52:25.820 --> 01:52:28.970 with no new line. 01:52:28.970 --> 01:52:33.112 I'm going to save the new line for about here instead. 01:52:33.112 --> 01:52:34.820 All right, the last thing I'm going to do 01:52:34.820 --> 01:52:38.550 is copy and paste the prototype at the top of the file. 01:52:38.550 --> 01:52:39.620 So that I can call this. 01:52:39.620 --> 01:52:42.728 And again, this is of now week one, week two. 01:52:42.728 --> 01:52:45.020 Wouldn't necessarily come to your mind as quickly as it 01:52:45.020 --> 01:52:48.380 might to mine after all this practice, but this is something reminiscent 01:52:48.380 --> 01:52:51.230 of what you yourself did already for Mario-- printing out 01:52:51.230 --> 01:52:54.330 a pyramid that hopefully in a moment is going to look like this. 01:52:54.330 --> 01:52:56.370 So let me go back to my code. 01:52:56.370 --> 01:53:00.560 Let me run make iteration, and let me do dot slash iteration. 01:53:00.560 --> 01:53:02.540 I'll type in 4, and voila. 01:53:02.540 --> 01:53:05.840 Seems to be correct, and let's assume it's going to work for other inputs 01:53:05.840 --> 01:53:06.500 as well. 01:53:06.500 --> 01:53:07.310 Oh, thank you. 01:53:11.770 --> 01:53:15.460 So this is indeed an example of iteration-- doing something 01:53:15.460 --> 01:53:16.780 again and again. 01:53:16.780 --> 01:53:18.160 And it's very procedural. 01:53:18.160 --> 01:53:21.160 Like, I literally have a function called draw that does this thing. 01:53:21.160 --> 01:53:25.570 But I can think about implementing draw in a somewhat different way that's 01:53:25.570 --> 01:53:26.710 kind of clever. 01:53:26.710 --> 01:53:28.720 And it's not strictly necessary for this problem 01:53:28.720 --> 01:53:31.360 because this problem honestly is not that complicated 01:53:31.360 --> 01:53:33.387 to solve once you have practice under your belt. 01:53:33.387 --> 01:53:36.220 Certainly the first time around, probably significantly challenging. 01:53:36.220 --> 01:53:39.250 But now that you kind of associate, OK, row one 01:53:39.250 --> 01:53:42.010 with one brick, row two with two bricks, it kind of comes together 01:53:42.010 --> 01:53:43.340 with these for loops. 01:53:43.340 --> 01:53:45.770 But how else could we think about this problem? 01:53:45.770 --> 01:53:48.940 Well, this physical structure, these bricks, in some sense 01:53:48.940 --> 01:53:54.980 is a recursive structure, a structure that's defined in terms of itself. 01:53:54.980 --> 01:53:56.360 Now what do I mean by that? 01:53:56.360 --> 01:54:01.600 Well, if I were to ask you the question, what does a pyramid of height 4 01:54:01.600 --> 01:54:04.720 look like, you would point, of course, to this picture. 01:54:04.720 --> 01:54:11.170 But you could also kind of cleverly say to me, well, 01:54:11.170 --> 01:54:16.240 it's actually a pyramid of height 3 plus 1 additional row. 01:54:16.240 --> 01:54:18.237 And here's that cyclical argument, right? 01:54:18.237 --> 01:54:21.070 Kind of obnoxious to do typically in English or in a spoken language 01:54:21.070 --> 01:54:23.200 because you're defining one thing in terms of itself. 01:54:23.200 --> 01:54:24.408 What's a pyramid of height 4? 01:54:24.408 --> 01:54:28.180 Well, it's a pyramid of height 3 plus 1 more row. 01:54:28.180 --> 01:54:30.940 But we can kind of leverage this logic in code. 01:54:30.940 --> 01:54:32.590 Well, what's a pyramid of height 3? 01:54:32.590 --> 01:54:34.870 Well, it's a pyramid of height 2 plus 1 more row. 01:54:34.870 --> 01:54:36.940 Fine, what's a pyramid of height 2? 01:54:36.940 --> 01:54:39.370 Well, it's a pyramid of height 1 plus 1 more row. 01:54:39.370 --> 01:54:42.370 And then hopefully, this process ends, and it does because notice, 01:54:42.370 --> 01:54:44.990 the pyramid is getting smaller and smaller. 01:54:44.990 --> 01:54:48.610 So you're not going to have this sort of silly back and forth with me 01:54:48.610 --> 01:54:52.550 infinitely many times because when we finally get to the base case, 01:54:52.550 --> 01:54:54.130 the end of the pyramid, fine. 01:54:54.130 --> 01:54:55.600 What is a pyramid of height 1? 01:54:55.600 --> 01:54:58.630 Well, it's a pyramid of no height plus one more row. 01:54:58.630 --> 01:55:01.270 And at that point, things just get negative-- 01:55:01.270 --> 01:55:02.050 no pun intended. 01:55:02.050 --> 01:55:04.035 Things just would otherwise go negative. 01:55:04.035 --> 01:55:05.410 And so you can just kind of stop. 01:55:05.410 --> 01:55:07.930 The base case is when there is no more pyramid. 01:55:07.930 --> 01:55:12.080 So there's a way to draw a line in the sand and say, stop, no more arguments. 01:55:12.080 --> 01:55:16.180 But this idea of defining a physical structure in terms of itself 01:55:16.180 --> 01:55:22.180 or code in terms of itself actually lets us do some interesting new algorithms. 01:55:22.180 --> 01:55:24.160 Let me go back to my code here. 01:55:24.160 --> 01:55:29.800 Let me go ahead and create one final file here called recursion.c 01:55:29.800 --> 01:55:36.100 that leverages this idea of this built-in self-referential nature. 01:55:36.100 --> 01:55:38.080 Let me include cs50.h. 01:55:38.080 --> 01:55:41.740 Let me go ahead and include standardio.h, int main void. 01:55:41.740 --> 01:55:46.000 And then inside of main, I'm going to do the exact same thing-- int 01:55:46.000 --> 01:55:50.470 height equals get int, asking the user for height. 01:55:50.470 --> 01:55:53.870 And then I'm going to go ahead and call draw passing in height. 01:55:53.870 --> 01:55:55.520 So that's going to stay the same. 01:55:55.520 --> 01:56:01.580 I even am going to make my prototype the same-- void draw int n semicolon. 01:56:01.580 --> 01:56:03.820 And now I'm going to implement void down here 01:56:03.820 --> 01:56:05.590 with that same prototype, of course. 01:56:05.590 --> 01:56:08.470 But the code now is going to be a little different. 01:56:08.470 --> 01:56:10.070 What am I going to do here? 01:56:10.070 --> 01:56:15.610 Well, first of all, if you ask me to draw a pyramid of height n, 01:56:15.610 --> 01:56:18.580 I'm going to be kind of a wise ass here and say, well, just 01:56:18.580 --> 01:56:21.010 draw a pyramid of n minus 1-- 01:56:21.010 --> 01:56:21.762 done. 01:56:21.762 --> 01:56:24.220 All right, but there's still a little more work to be done. 01:56:24.220 --> 01:56:28.660 What happens after I print or draw a pyramid of height n minus 1 01:56:28.660 --> 01:56:33.340 according to our structural definition a moment ago? 01:56:33.340 --> 01:56:38.470 What remains after drawing a pyramid of height n minus 1 or 3, specifically? 01:56:38.470 --> 01:56:40.840 AUDIENCE: [INAUDIBLE] 01:56:40.840 --> 01:56:42.400 We need one more row of hashes. 01:56:42.400 --> 01:56:43.750 OK, so I can do that, right? 01:56:43.750 --> 01:56:45.213 I'm OK with the single loops. 01:56:45.213 --> 01:56:46.630 There's no nesting necessary here. 01:56:46.630 --> 01:56:51.010 I'm just going to do this-- for int i get 0, i is less than n, 01:56:51.010 --> 01:56:53.185 which is the height that's passed in, i plus plus. 01:56:53.185 --> 01:56:55.060 And then inside of this loop, I'm very simply 01:56:55.060 --> 01:56:56.740 going to print out a single hash. 01:56:56.740 --> 01:57:00.650 And then down here, I'm going to print out a new line at the very end. 01:57:00.650 --> 01:57:01.600 So that's good, right? 01:57:01.600 --> 01:57:03.720 I might not be as comfortable with nested loops. 01:57:03.720 --> 01:57:04.720 This is nice and simple. 01:57:04.720 --> 01:57:07.660 What does this loop do here on line 17 through 20? 01:57:07.660 --> 01:57:13.540 It literally prints n hashes by counting from i equals 0 on up to 01:57:13.540 --> 01:57:15.130 but not through n. 01:57:15.130 --> 01:57:17.930 So that's sort of week one style syntax. 01:57:17.930 --> 01:57:20.740 But this is kind of trippy now because I've somehow 01:57:20.740 --> 01:57:25.480 boiled down the implementation of draw into printing a row after just 01:57:25.480 --> 01:57:27.250 drawing the thing above it. 01:57:27.250 --> 01:57:31.250 But this is problematic as is because in this case, 01:57:31.250 --> 01:57:37.900 my drawer function, notice, is always going to call the draw function forever 01:57:37.900 --> 01:57:38.890 in some sense. 01:57:38.890 --> 01:57:44.200 But ideally, when do I want this cyclical process to stop? 01:57:44.200 --> 01:57:48.010 When do I want to not call draw anymore? 01:57:48.010 --> 01:57:51.297 Yeah, when n is 1, right? 01:57:51.297 --> 01:57:53.380 When I get to the top of the pyramid, when n is 1, 01:57:53.380 --> 01:57:55.720 or heck, when the pyramids all gone and n equals 0. 01:57:55.720 --> 01:57:58.060 I can pick any line in the sand, so long as it's 01:57:58.060 --> 01:57:59.690 sort of at the end of the process. 01:57:59.690 --> 01:58:01.480 Then I don't want to call draw anymore. 01:58:01.480 --> 01:58:03.850 So maybe what I should do is this. 01:58:03.850 --> 01:58:10.450 If n equals equals 0, there's really nothing to draw. 01:58:10.450 --> 01:58:14.340 So I'm just going to go ahead and return like this. 01:58:14.340 --> 01:58:16.920 Otherwise, I'm going to go ahead and draw 01:58:16.920 --> 01:58:20.238 n minus 1 rows and then one more row. 01:58:20.238 --> 01:58:21.780 And I could express this differently. 01:58:21.780 --> 01:58:24.540 I could do something like this, which would be equivalent. 01:58:24.540 --> 01:58:29.490 I could say something like if n is greater than or equal to 0, 01:58:29.490 --> 01:58:31.462 then go ahead and draw the row. 01:58:31.462 --> 01:58:32.670 But I like it this way first. 01:58:32.670 --> 01:58:34.587 For now, I'm going to go with the original way 01:58:34.587 --> 01:58:37.740 just to ask a simple question and then just bail out of the function 01:58:37.740 --> 01:58:38.970 if n equals 0. 01:58:38.970 --> 01:58:41.740 And heck, just to be super safe, just in case 01:58:41.740 --> 01:58:44.040 the user types in a negative number, let me also 01:58:44.040 --> 01:58:46.980 just check if n is a negative number, also, just return immediately. 01:58:46.980 --> 01:58:48.210 Don't do anything. 01:58:48.210 --> 01:58:51.490 I'm not returning a value because again, the function is void. 01:58:51.490 --> 01:58:53.680 It doesn't need or have a return value. 01:58:53.680 --> 01:58:55.860 So just saying return suffices. 01:58:55.860 --> 01:59:00.660 But if n equals 1 or 2 or 3 or anything higher, 01:59:00.660 --> 01:59:06.120 it is reasonable to draw a pyramid of slightly shorter height like, instead 01:59:06.120 --> 01:59:10.680 of 4, 3, and then go ahead and print one more row. 01:59:10.680 --> 01:59:16.450 So this is an example now of code that calls itself within itself. 01:59:16.450 --> 01:59:18.150 Draw is calling draw. 01:59:18.150 --> 01:59:22.920 But this so-called base case ensures, this conditional ensures, 01:59:22.920 --> 01:59:24.690 that we're not going to do this forever. 01:59:24.690 --> 01:59:27.300 Otherwise, we literally would do this infinitely many times, 01:59:27.300 --> 01:59:30.330 and something bad is probably going to happen. 01:59:30.330 --> 01:59:34.230 All right, let me go ahead and compile this code-- make recursion. 01:59:34.230 --> 01:59:38.250 OK, no syntax errors-- dot slash recursion, Enter, height of 4, 01:59:38.250 --> 01:59:40.590 and voila. 01:59:40.590 --> 01:59:44.400 If only because some of you have run into this issue accidentally already, 01:59:44.400 --> 01:59:48.420 let me get rid of the base case here, and let me recompile the code. 01:59:48.420 --> 01:59:49.980 Make recursion. 01:59:49.980 --> 01:59:52.150 Oh, and actually, now it's actually catching it. 01:59:52.150 --> 01:59:55.050 So the compiler is smart enough here to realize 01:59:55.050 --> 01:59:58.470 that all paths through this function will call itself. 01:59:58.470 --> 02:00:01.300 AKA, It's going to loop forever. 02:00:01.300 --> 02:00:03.090 So let me do the first thing. 02:00:03.090 --> 02:00:05.550 Suppose I only check for n equaling 0. 02:00:05.550 --> 02:00:09.390 Let me go ahead and recompile this code with make recursion. 02:00:09.390 --> 02:00:12.030 And now let me just be kind of uncooperative. 02:00:12.030 --> 02:00:16.200 When I run this program, still works for 4, still works for 0. 02:00:16.200 --> 02:00:19.110 What if I do like negative 100? 02:00:19.110 --> 02:00:22.920 Have any of you experienced a segmentation fault or core dump? 02:00:22.920 --> 02:00:24.240 OK, so no shame in this. 02:00:24.240 --> 02:00:28.980 Like, this means I have somehow touched memory that I shouldn't have. 02:00:28.980 --> 02:00:32.910 And in short, I actually called this function thousands of times 02:00:32.910 --> 02:00:35.550 accidentally, it would seem now, until the program just 02:00:35.550 --> 02:00:38.413 bailed on me because I eventually touched memory in the computer 02:00:38.413 --> 02:00:39.330 that I shouldn't have. 02:00:39.330 --> 02:00:41.140 That'll make even more sense next week. 02:00:41.140 --> 02:00:42.640 But for now, it's simply a bug. 02:00:42.640 --> 02:00:44.430 And I can avoid that bug in this context, 02:00:44.430 --> 02:00:48.900 probably not your own pset context, by just making sure we don't even 02:00:48.900 --> 02:00:51.340 allow for negative numbers at all. 02:00:51.340 --> 02:00:53.700 So with this building block in place, what 02:00:53.700 --> 02:00:57.420 can we now do in terms of those same numbers to sort? 02:00:57.420 --> 02:01:00.390 Well, it turns out there's a sorting algorithm called merge sort. 02:01:00.390 --> 02:01:01.980 And there's bunches of others too. 02:01:01.980 --> 02:01:06.923 But merge sort is a nice one to discuss because it fundamentally, we hope, 02:01:06.923 --> 02:01:09.090 is going to do better than selection sort and bubble 02:01:09.090 --> 02:01:11.490 sort that is better than n squared. 02:01:11.490 --> 02:01:14.238 But the catch is it's a little harder to think about. 02:01:14.238 --> 02:01:17.280 In fact, I'll act it out myself with just these numbers on the shelf here 02:01:17.280 --> 02:01:21.360 rather than humans because recursion in general takes a little bit of effort 02:01:21.360 --> 02:01:23.700 to wrap your mind around, typically a bit of practice. 02:01:23.700 --> 02:01:25.908 But I'll see if we can't walk through it methodically 02:01:25.908 --> 02:01:28.260 enough such that this comes to light. 02:01:28.260 --> 02:01:32.460 So here's the pseudo code I propose for this algorithm called merge sort. 02:01:32.460 --> 02:01:35.640 In the spirit of recursion, this sorting algorithm 02:01:35.640 --> 02:01:41.020 literally calls itself by using the verb sort in its pseudo code. 02:01:41.020 --> 02:01:42.600 So how does merge sort work? 02:01:42.600 --> 02:01:46.020 It sort of obnoxiously says, well, if you want to sort all of these things, 02:01:46.020 --> 02:01:48.720 go sort the left half, then go sort the right half, 02:01:48.720 --> 02:01:50.430 and then merge the two together. 02:01:50.430 --> 02:01:51.682 Now obnoxious in what sense? 02:01:51.682 --> 02:01:54.390 Well, if I just asked you to sort something and you just tell me, 02:01:54.390 --> 02:01:56.070 well, go sort that thing and then go sort 02:01:56.070 --> 02:01:58.737 that thing, what was the point of asking you in the first place? 02:01:58.737 --> 02:02:01.080 But the key is that each of these lines is 02:02:01.080 --> 02:02:03.880 sorting a smaller piece of the problem. 02:02:03.880 --> 02:02:06.450 So eventually, we'll be able to pare this down 02:02:06.450 --> 02:02:10.410 into something that doesn't go on forever because in fact, in merge sort, 02:02:10.410 --> 02:02:12.120 there's a base case too. 02:02:12.120 --> 02:02:14.940 There's a scenario where we just check, wait a minute, 02:02:14.940 --> 02:02:17.520 if there's only one number to sort, that's it. 02:02:17.520 --> 02:02:19.470 Quit then because you're all done. 02:02:19.470 --> 02:02:22.590 So there has to be this base case in any use of recursion 02:02:22.590 --> 02:02:27.090 to make sure that you don't mindlessly call yourself forever. 02:02:27.090 --> 02:02:29.380 You've got to stop at some point. 02:02:29.380 --> 02:02:32.560 So let's focus on the third of these steps. 02:02:32.560 --> 02:02:37.328 What does it mean to merge two lists, two halves of a list, 02:02:37.328 --> 02:02:39.120 just because this is apparently going to be 02:02:39.120 --> 02:02:41.290 a key ingredient-- so here, for instance, 02:02:41.290 --> 02:02:43.890 are two halves of a list of size 8. 02:02:43.890 --> 02:02:47.010 We have the numbers 2-- and I'll call it out if you're at a bad angle-- 02:02:47.010 --> 02:02:52.200 2457 and 0136. 02:02:52.200 --> 02:02:56.710 Notice that the left half at the moment, 2457, is already sorted, 02:02:56.710 --> 02:03:01.000 and the right half, 0136, is also sorted as well. 02:03:01.000 --> 02:03:03.870 So that's a good thing because it means that theoretically, I've 02:03:03.870 --> 02:03:05.280 sorted the left half already. 02:03:05.280 --> 02:03:07.620 I've sorted the right half already before we began. 02:03:07.620 --> 02:03:09.690 I just need to merge these two halves. 02:03:09.690 --> 02:03:11.800 What does it mean to sort two halves? 02:03:11.800 --> 02:03:13.550 Well, for the sake of discussion, I'm just 02:03:13.550 --> 02:03:19.340 going to turn over most of the numbers except for the first numbers in each 02:03:19.340 --> 02:03:20.360 of these halves. 02:03:20.360 --> 02:03:23.390 There's two halves here, left and right. 02:03:23.390 --> 02:03:25.460 At the moment, I'm only going to consider 02:03:25.460 --> 02:03:29.532 the leftmost element of each half-- that is, the one on the left here 02:03:29.532 --> 02:03:30.740 and the one on the left here. 02:03:30.740 --> 02:03:33.800 How do I merge these two lists together? 02:03:33.800 --> 02:03:38.540 Well, if I look at 2 and I look at 0, which one should presumably come first? 02:03:38.540 --> 02:03:39.502 The smaller one. 02:03:39.502 --> 02:03:41.210 So I'm going to grab the 0, and I'm going 02:03:41.210 --> 02:03:44.150 to put it into its own place on this new shelf here. 02:03:44.150 --> 02:03:50.300 And now I'm going to consider, as part of my iteration, 02:03:50.300 --> 02:03:53.270 the beginning of this list and the new beginning of this list. 02:03:53.270 --> 02:03:55.130 So I'm now comparing 2 and 1. 02:03:55.130 --> 02:03:56.210 Which one's smaller? 02:03:56.210 --> 02:03:58.400 I'm going to go ahead and grab the 1. 02:03:58.400 --> 02:04:01.130 Now I'm going to compare the beginning of the left list 02:04:01.130 --> 02:04:03.440 and the new beginning of the right list, 2 and 3. 02:04:03.440 --> 02:04:05.018 Of course, it's 2. 02:04:05.018 --> 02:04:07.310 Now I'm going to compare the beginning of the left list 02:04:07.310 --> 02:04:09.290 and the beginning of the right list, 4 and 3. 02:04:09.290 --> 02:04:11.510 It's of course 3. 02:04:11.510 --> 02:04:14.250 Now I'm going to compare the 4 against the beginning and end, 02:04:14.250 --> 02:04:15.830 it turns out, of the second list-- 02:04:15.830 --> 02:04:16.988 4, of course. 02:04:16.988 --> 02:04:19.280 Now I'm going to compare the beginning of the left list 02:04:19.280 --> 02:04:20.822 and the beginning of the right list-- 02:04:20.822 --> 02:04:22.430 5, of course. 02:04:22.430 --> 02:04:25.010 I'm realizing this is not going to end well because I left 02:04:25.010 --> 02:04:26.420 too much distance between the numbers. 02:04:26.420 --> 02:04:28.337 But that has nothing to do with the algorithm. 02:04:28.337 --> 02:04:29.840 7 is the beginning of the left list. 02:04:29.840 --> 02:04:31.460 6 is the beginning of the right list. 02:04:31.460 --> 02:04:32.960 It's, of course, 6. 02:04:32.960 --> 02:04:36.410 And at the risk of knocking all of these over, 02:04:36.410 --> 02:04:43.310 if I now make room for this element, we have hopefully 02:04:43.310 --> 02:04:50.400 sorted the whole thing by having merged together the two halves of the list. 02:04:50.400 --> 02:04:52.040 So in short-- thank you. 02:04:56.448 --> 02:04:58.740 I'm a little worried that's just getting sarcastic now, 02:04:58.740 --> 02:05:03.870 but we now have merged two half lists. 02:05:03.870 --> 02:05:07.430 We haven't done the guts of the algorithm yet-- sort the left half 02:05:07.430 --> 02:05:08.430 and sort the right half. 02:05:08.430 --> 02:05:11.160 But I claim that that is how mechanically you 02:05:11.160 --> 02:05:12.960 merge two sorted halves. 02:05:12.960 --> 02:05:15.060 You keep looking at the beginning of each list, 02:05:15.060 --> 02:05:17.640 and you just kind of weave them together based 02:05:17.640 --> 02:05:21.070 on which one belongs first based on its size. 02:05:21.070 --> 02:05:23.460 So if you agree that that was a reasonable way 02:05:23.460 --> 02:05:28.020 to merge two lists together, let's go ahead and focus lastly 02:05:28.020 --> 02:05:31.260 on what it means to actually sort the left half 02:05:31.260 --> 02:05:33.437 and sort the right half of a whole bunch of numbers. 02:05:33.437 --> 02:05:35.520 And for this, I'm going to go ahead and order them 02:05:35.520 --> 02:05:37.320 in this seemingly random order. 02:05:37.320 --> 02:05:40.200 And I just have a little cheat sheet above so that I don't mess up. 02:05:40.200 --> 02:05:42.330 And I'm going to start at the very top this time. 02:05:42.330 --> 02:05:45.390 And hopefully, these will not fall down at any point. 02:05:45.390 --> 02:05:51.300 But I'm just deliberately putting them in this random order, 5274. 02:05:51.300 --> 02:05:53.703 And then we have 1630-- 02:05:53.703 --> 02:05:57.000 1630. 02:05:57.000 --> 02:05:59.340 Hopefully this won't fall over. 02:05:59.340 --> 02:06:03.990 Here is now an array of size 8 with eight integers. 02:06:03.990 --> 02:06:05.290 And I want to sort this. 02:06:05.290 --> 02:06:08.332 I could use selection sort and just go back and forth and back and forth. 02:06:08.332 --> 02:06:11.100 I could use bubble sort and just compare pairs, pairs, pairs. 02:06:11.100 --> 02:06:13.830 But those are going to be on the order of big O of n squared. 02:06:13.830 --> 02:06:16.000 My hope is to do fundamentally better here. 02:06:16.000 --> 02:06:17.760 So let's see if we can do better. 02:06:17.760 --> 02:06:19.800 All right, so let me look now at my code. 02:06:19.800 --> 02:06:21.490 I'll keep it on the screen. 02:06:21.490 --> 02:06:23.310 How do I implement merge sort? 02:06:23.310 --> 02:06:25.303 Well, if there's only one number, I quit. 02:06:25.303 --> 02:06:26.220 There's obviously not. 02:06:26.220 --> 02:06:28.270 There's eight numbers, so that's not applicable. 02:06:28.270 --> 02:06:30.603 I'm going to go ahead and sort the left half of numbers. 02:06:30.603 --> 02:06:32.460 All right, here's the left half-- 02:06:32.460 --> 02:06:34.470 5274. 02:06:34.470 --> 02:06:37.230 Do I sort an array of size 4? 02:06:37.230 --> 02:06:40.170 Well, here's where the recursion kicks in. 02:06:40.170 --> 02:06:42.150 How do you sort a list of size 4? 02:06:42.150 --> 02:06:44.310 Well, there's the pseudo code on the board. 02:06:44.310 --> 02:06:47.490 I sort the left half of the list of size 4. 02:06:47.490 --> 02:06:49.200 So here we go. 02:06:49.200 --> 02:06:50.550 I have a list of size 4. 02:06:50.550 --> 02:06:51.390 How do I sort it? 02:06:51.390 --> 02:06:52.680 I sort the left half. 02:06:52.680 --> 02:06:54.480 All right, now I have a list of size 2. 02:06:54.480 --> 02:06:56.680 How do I sort this? 02:06:56.680 --> 02:06:58.600 Well, sort the left half. 02:06:58.600 --> 02:06:59.520 So here we go. 02:06:59.520 --> 02:07:01.020 Here's a list of size 1. 02:07:01.020 --> 02:07:03.996 How do I sort this? 02:07:03.996 --> 02:07:05.640 I think it's done, right? 02:07:05.640 --> 02:07:06.480 That's quit, right? 02:07:06.480 --> 02:07:08.010 If only one number, I'm done. 02:07:08.010 --> 02:07:09.233 The 5 is sorted. 02:07:09.233 --> 02:07:10.650 All right, what was the next step? 02:07:10.650 --> 02:07:12.180 You have to now rewind in time. 02:07:12.180 --> 02:07:16.510 I just sorted the left half of the left half of the left half. 02:07:16.510 --> 02:07:17.970 What do I now sort? 02:07:17.970 --> 02:07:20.850 The right half, which is 2. 02:07:20.850 --> 02:07:22.030 This is one element. 02:07:22.030 --> 02:07:22.740 So I'm done. 02:07:22.740 --> 02:07:27.060 So now at this point in the story, I have sorted, sort of idiotically-- 02:07:27.060 --> 02:07:29.850 the 5 assorted, and the 2 is sorted. 02:07:29.850 --> 02:07:34.410 But what's the third and final step of this phase of the algorithm? 02:07:34.410 --> 02:07:35.650 Merge the two together. 02:07:35.650 --> 02:07:37.600 So here's the left, here's the right list. 02:07:37.600 --> 02:07:38.850 How do I merge these together? 02:07:38.850 --> 02:07:41.530 I compare the lists, and I put the two there. 02:07:41.530 --> 02:07:43.590 I only have the [? 5 ?] left, and I do that. 02:07:43.590 --> 02:07:45.990 So now we see some visible progress. 02:07:45.990 --> 02:07:47.190 But again, let's rewind. 02:07:47.190 --> 02:07:48.220 How did we get here? 02:07:48.220 --> 02:07:52.860 We started to sort the left half of the left half of the left half, then 02:07:52.860 --> 02:07:53.700 the right half. 02:07:53.700 --> 02:07:54.900 And now where are we? 02:07:54.900 --> 02:07:58.090 We've just sorted the left half of the left half. 02:07:58.090 --> 02:08:01.140 So what comes after sorting the left half of anything? 02:08:01.140 --> 02:08:01.692 Right half. 02:08:01.692 --> 02:08:03.900 All right, here's the sort of same nonsensical thing. 02:08:03.900 --> 02:08:05.460 Here's a list of size 2. 02:08:05.460 --> 02:08:06.690 Let's sort the left half. 02:08:06.690 --> 02:08:07.347 Done. 02:08:07.347 --> 02:08:08.430 Let's sort the right half. 02:08:08.430 --> 02:08:09.120 Done. 02:08:09.120 --> 02:08:10.170 What's the third step? 02:08:10.170 --> 02:08:11.590 Merge them together. 02:08:11.590 --> 02:08:14.880 So that's the 4, and that's the 7. 02:08:14.880 --> 02:08:16.110 What have I now done? 02:08:16.110 --> 02:08:21.070 In total, I've now sorted the left half of the original thing. 02:08:21.070 --> 02:08:23.092 So what happens next? 02:08:23.092 --> 02:08:24.300 Wait a minute, wait a minute. 02:08:24.300 --> 02:08:25.560 I have not done that. 02:08:25.560 --> 02:08:26.550 What have I done? 02:08:26.550 --> 02:08:29.220 I have sorted the left half of the left half, 02:08:29.220 --> 02:08:32.350 and I've sorted the right half of the left half. 02:08:32.350 --> 02:08:34.710 What do I now need to do lastly? 02:08:34.710 --> 02:08:36.640 Merge those two lists together. 02:08:36.640 --> 02:08:38.562 So again, I put my finger on the beginning 02:08:38.562 --> 02:08:40.270 of this list, the beginning of this list. 02:08:40.270 --> 02:08:42.360 And if you want, I'll do the same thing when I merged last time 02:08:42.360 --> 02:08:43.830 to be clear what I'm comparing. 02:08:43.830 --> 02:08:46.440 2 and 4-- the 2 obviously comes first. 02:08:46.440 --> 02:08:48.220 What comes next? 02:08:48.220 --> 02:08:50.410 Well, the 4 comes next. 02:08:50.410 --> 02:08:51.670 What comes next? 02:08:51.670 --> 02:08:55.920 The 5 comes next and then lastly, of course, the 7. 02:08:55.920 --> 02:09:00.010 Notice that the 2457 are now sorted. 02:09:00.010 --> 02:09:02.802 So the original left half is sorted. 02:09:02.802 --> 02:09:05.010 And I'll do the rest a little faster because, my God, 02:09:05.010 --> 02:09:06.385 this feels like it takes forever. 02:09:06.385 --> 02:09:08.430 But I bet we're on to something here. 02:09:08.430 --> 02:09:10.320 What step remains next? 02:09:10.320 --> 02:09:12.570 I've just sorted the left half of the original. 02:09:12.570 --> 02:09:14.280 Sort the right half of the original. 02:09:14.280 --> 02:09:15.300 How do I sort this? 02:09:15.300 --> 02:09:17.880 I sort the left half of the right half. 02:09:17.880 --> 02:09:19.050 How do I sort this? 02:09:19.050 --> 02:09:21.360 I sort the left half of the left half. 02:09:21.360 --> 02:09:22.320 Done. 02:09:22.320 --> 02:09:24.360 I sort the right half of the left half. 02:09:24.360 --> 02:09:25.260 Done. 02:09:25.260 --> 02:09:27.390 Now I merge the two together. 02:09:27.390 --> 02:09:30.190 The 1 comes first, the 6 comes next. 02:09:30.190 --> 02:09:34.380 Now I sort the right half of the right half. 02:09:34.380 --> 02:09:35.190 What do I do? 02:09:35.190 --> 02:09:36.600 Sort the left half. 02:09:36.600 --> 02:09:37.200 Done. 02:09:37.200 --> 02:09:38.200 Sort the right half. 02:09:38.200 --> 02:09:38.700 Done. 02:09:38.700 --> 02:09:39.420 What do I do? 02:09:39.420 --> 02:09:41.230 Merge them together. 02:09:41.230 --> 02:09:43.080 So that's the third step of that phase. 02:09:43.080 --> 02:09:47.760 Now where are we in the stor-- oh my God, where are we in the story? 02:09:47.760 --> 02:09:52.230 We have sorted the left half of the right half 02:09:52.230 --> 02:09:54.180 and the right half of the right half. 02:09:54.180 --> 02:09:55.760 What comes next? 02:09:55.760 --> 02:09:56.263 Merge. 02:09:56.263 --> 02:09:58.430 So I'm going to compare, and I'm going to move those 02:09:58.430 --> 02:10:00.590 down just to make clear what I'm comparing, 02:10:00.590 --> 02:10:02.300 the beginning of both sublists. 02:10:02.300 --> 02:10:03.110 What comes first? 02:10:03.110 --> 02:10:04.610 Of course, the 0. 02:10:04.610 --> 02:10:07.110 What comes next? 02:10:07.110 --> 02:10:08.550 What comes next? 02:10:08.550 --> 02:10:10.340 The 1. 02:10:10.340 --> 02:10:11.730 What comes next? 02:10:11.730 --> 02:10:13.010 The 3. 02:10:13.010 --> 02:10:14.930 And then lastly comes the 6. 02:10:14.930 --> 02:10:16.880 All right, where are we in the story? 02:10:16.880 --> 02:10:19.220 We've now sorted the left half of the original 02:10:19.220 --> 02:10:20.780 and the right half of the original. 02:10:20.780 --> 02:10:22.430 What step remains? 02:10:22.430 --> 02:10:23.030 Merge. 02:10:23.030 --> 02:10:25.070 All right, so I'm going to make the same point. 02:10:25.070 --> 02:10:27.690 And this is actually literally what we did earlier 02:10:27.690 --> 02:10:31.760 because I deliberately demoed those original numbers in this order, 2 02:10:31.760 --> 02:10:32.630 and a 0. 02:10:32.630 --> 02:10:34.010 This comes out first. 02:10:34.010 --> 02:10:35.300 What comes next? 02:10:35.300 --> 02:10:36.230 2 and 1. 02:10:36.230 --> 02:10:37.890 The 1 comes out next. 02:10:37.890 --> 02:10:39.450 What comes next? 02:10:39.450 --> 02:10:40.920 The 2 comes next. 02:10:40.920 --> 02:10:42.030 What comes next? 02:10:42.030 --> 02:10:43.410 The 3 comes next. 02:10:43.410 --> 02:10:44.460 What comes next? 02:10:44.460 --> 02:10:47.150 The 4. 02:10:47.150 --> 02:10:48.590 What comes after that? 02:10:48.590 --> 02:10:50.720 The 5. 02:10:50.720 --> 02:10:51.920 What comes after that? 02:10:51.920 --> 02:10:52.700 The 6. 02:10:52.700 --> 02:10:56.760 And lastly-- this is when we run out of memory-- 02:10:56.760 --> 02:11:01.010 the 7 over there is actually in place. 02:11:01.010 --> 02:11:03.390 OK. 02:11:03.390 --> 02:11:05.867 OK, so admittedly, a little harder to explain, 02:11:05.867 --> 02:11:07.950 and honestly, it gets a little trippy because it's 02:11:07.950 --> 02:11:10.980 so easy to forget about where you are in the story 02:11:10.980 --> 02:11:13.680 because we're constantly diving into the algorithm 02:11:13.680 --> 02:11:15.250 and then backing back out of it. 02:11:15.250 --> 02:11:17.970 But in code, we could express this pretty correctly 02:11:17.970 --> 02:11:21.000 and, it turns out, pretty efficiently because what 02:11:21.000 --> 02:11:24.660 I was doing, even though it's longer when I do it verbally, 02:11:24.660 --> 02:11:28.260 I was touching these elements a minimal amount of times, right? 02:11:28.260 --> 02:11:31.530 I wasn't going back and forth, back and forth in front of the shelf 02:11:31.530 --> 02:11:32.490 again and again. 02:11:32.490 --> 02:11:37.240 I was deliberately only ever merging the smallest elements in each list. 02:11:37.240 --> 02:11:40.200 So every time we merge, even though I was doing it quickly, 02:11:40.200 --> 02:11:44.370 my fingers were only touching each of the elements once. 02:11:44.370 --> 02:11:50.152 And how many times did we divide, divide, divide in half the list? 02:11:50.152 --> 02:11:52.110 Well, we started with all of the elements here, 02:11:52.110 --> 02:11:53.318 and there were eight of them. 02:11:53.318 --> 02:11:56.860 And then we moved them 1, 2, 3 positions. 02:11:56.860 --> 02:12:03.420 So the height of this visualization, if you will, is actually log n, right? 02:12:03.420 --> 02:12:06.280 If I started with 8, turns out if you do the arithmetic, 02:12:06.280 --> 02:12:10.380 this is log n height because 2 to the 3 is 8. 02:12:10.380 --> 02:12:12.750 But for now, just trust that this is a log n height. 02:12:12.750 --> 02:12:14.280 And how wide is the shelf? 02:12:14.280 --> 02:12:18.270 Well, it's of width n because there's n elements any time 02:12:18.270 --> 02:12:19.470 they were on the shelf. 02:12:19.470 --> 02:12:23.737 So technically, I was kind of cheating this algorithm because this 02:12:23.737 --> 02:12:25.320 is the first time I've needed shelves. 02:12:25.320 --> 02:12:29.010 With the human examples, we just had the humans, and that's it, and only eight 02:12:29.010 --> 02:12:29.520 of them. 02:12:29.520 --> 02:12:32.200 Here, I was sort of using more and more memory. 02:12:32.200 --> 02:12:34.770 In fact, I was using like four times as much memory 02:12:34.770 --> 02:12:36.930 even though that was just for visualization's sake. 02:12:36.930 --> 02:12:41.040 Merge sort actually requires that you have some spare space, an empty array 02:12:41.040 --> 02:12:44.045 to move the elements into when you're merging them together. 02:12:44.045 --> 02:12:46.920 But if I really wanted and if I didn't have this shelf or this shelf, 02:12:46.920 --> 02:12:49.590 honestly, I could have just gone back and forth between the two shelves. 02:12:49.590 --> 02:12:51.250 That would have been sufficient. 02:12:51.250 --> 02:12:56.190 So merge sort uses more memory for this merging process, 02:12:56.190 --> 02:12:59.100 but the advantage of using more memory is 02:12:59.100 --> 02:13:04.680 that the total running time, if you can perhaps infer from that math, is what? 02:13:04.680 --> 02:13:07.560 The big O notation for merge sort, it turns out, 02:13:07.560 --> 02:13:10.530 is actually going to be n times log n. 02:13:10.530 --> 02:13:13.050 And even if you're a little rusty still on your logarithms, 02:13:13.050 --> 02:13:18.390 we saw in week zero and again today that log n is smaller than n. 02:13:18.390 --> 02:13:19.650 That's a good thing. 02:13:19.650 --> 02:13:20.880 Binary search was log n. 02:13:20.880 --> 02:13:23.250 That's faster than linear search, which was n. 02:13:23.250 --> 02:13:28.830 So n times log n is, of course, smaller than n times n or n squared. 02:13:28.830 --> 02:13:31.830 So it's sort of lower on this little cheat sheet that I've been drawing, 02:13:31.830 --> 02:13:35.350 which is to suggest that it's running time is indeed better or faster. 02:13:35.350 --> 02:13:38.340 And in fact, if we consider the best case running time, 02:13:38.340 --> 02:13:42.645 turns out it's not quite as good as bubble sort with omega of n, 02:13:42.645 --> 02:13:45.270 where you can just sort of abort if you realize, wait a minute, 02:13:45.270 --> 02:13:46.320 I've done no work. 02:13:46.320 --> 02:13:50.850 Merge sort, you actually have to do that work to get to the finish line anyway. 02:13:50.850 --> 02:13:56.170 So it's actually in omega and ultimately theta of n log n as well. 02:13:56.170 --> 02:13:58.230 So again, a trade off there because if you 02:13:58.230 --> 02:14:00.540 happen to have a data set that is very often sorted, 02:14:00.540 --> 02:14:02.665 honestly, you might want to stick with bubble sort. 02:14:02.665 --> 02:14:05.490 But in the general case, where the data is unsorted, 02:14:05.490 --> 02:14:08.820 n log n as sounding better than n squared. 02:14:08.820 --> 02:14:10.830 Well, what does it actually look or feel like? 02:14:10.830 --> 02:14:14.230 Give me a moment to just change over to our visualization here. 02:14:14.230 --> 02:14:18.210 And we'll see with this example what merge sort looks like 02:14:18.210 --> 02:14:20.170 depicted with now these vertical bars. 02:14:20.170 --> 02:14:22.860 So same algorithm, but instead of my numbers on shelves, 02:14:22.860 --> 02:14:27.970 here is a random array of numbers being sorted. 02:14:27.970 --> 02:14:30.120 And you can see it being done half at a time. 02:14:30.120 --> 02:14:33.720 And you see sort of remnants of the previous bars. 02:14:33.720 --> 02:14:35.520 Actually, that was unfair. 02:14:35.520 --> 02:14:37.500 Let me zoom out here. 02:14:37.500 --> 02:14:42.030 Let me zoom out so you can actually see the height here. 02:14:42.030 --> 02:14:44.730 Let me go ahead and randomize this again and run merge sort. 02:14:44.730 --> 02:14:45.310 There we go. 02:14:45.310 --> 02:14:50.100 Now you can see the second array and where the values are going temporarily. 02:14:50.100 --> 02:14:54.130 And even though this one looks way more cryptic visualization-wise, 02:14:54.130 --> 02:14:56.170 it does seem to be moving faster. 02:14:56.170 --> 02:14:59.670 And it seems to be merging halves together, and boom, it's done. 02:14:59.670 --> 02:15:03.990 So let's actually see, in conclusion, what these algorithms compare to 02:15:03.990 --> 02:15:07.080 and consider that moving forward as we write more and more code, 02:15:07.080 --> 02:15:10.350 the goal is, again, not just to be correct but to be well-designed. 02:15:10.350 --> 02:15:13.820 And one measure of design is going to indeed be efficiency. 02:15:13.820 --> 02:15:18.120 So here we have, in final, a visualization of three algorithms-- 02:15:18.120 --> 02:15:21.000 selection sort, bubble sort, and merge sort-- 02:15:21.000 --> 02:15:22.450 from top to bottom. 02:15:22.450 --> 02:15:25.620 And let's see what these algorithms might look or sound like here. 02:15:25.620 --> 02:15:27.855 Oh, if we can dim the lights for dramatic effect-- 02:15:32.160 --> 02:15:35.970 selection's on top, bubble on bottom, merge in the middle. 02:15:35.970 --> 02:16:33.320 [MUSIC PLAYING] 02:16:33.320 --> 02:17:11.000 [MUSIC PLAYING]