WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:04.960 [MUSIC PLAYING] 00:00:49.182 --> 00:00:50.140 DAVID MALAN: All right. 00:00:50.140 --> 00:00:53.260 This is CS50, and this is week 3. 00:00:53.260 --> 00:00:56.560 And you'll recall that last week, we equipped you with a lot more tools 00:00:56.560 --> 00:00:59.830 by which to solve problems-- not only problems that we had proposed, 00:00:59.830 --> 00:01:02.110 but problems in your own code, that is to say bugs. 00:01:02.110 --> 00:01:05.980 And recall that those tools involve command line tools like help50 00:01:05.980 --> 00:01:09.580 for help with cryptic error messages that the compiler might spit out; 00:01:09.580 --> 00:01:11.500 style50 which, gives you a bit of feedback 00:01:11.500 --> 00:01:14.230 on the stylization of your code, the aesthetics thereof; 00:01:14.230 --> 00:01:18.550 check50, which checks the correctness of your code against the specifications 00:01:18.550 --> 00:01:21.400 in a given problem set or lab; printf, which 00:01:21.400 --> 00:01:24.640 is a function that it exists in some form in almost any programming 00:01:24.640 --> 00:01:27.185 language that you might ultimately learn, and this 00:01:27.185 --> 00:01:30.310 is simply a way of printing out anything you might want from the computer's 00:01:30.310 --> 00:01:31.345 memory onto the screen. 00:01:31.345 --> 00:01:33.220 Then perhaps the most powerful of these tools 00:01:33.220 --> 00:01:36.190 was debug50, which was this interactive debugger. 00:01:36.190 --> 00:01:40.060 And even though this command debug50 is a little specific to CS50, 00:01:40.060 --> 00:01:42.970 what it triggers to happen, that little side window 00:01:42.970 --> 00:01:46.150 where you can see the stack of functions that you 00:01:46.150 --> 00:01:48.550 might have called during some break point, 00:01:48.550 --> 00:01:51.942 and you can see the local variables that you might have defined at some point 00:01:51.942 --> 00:01:53.650 during the execution of your code, that's 00:01:53.650 --> 00:01:58.900 a very common conventional feature of any debugger with most any language. 00:01:58.900 --> 00:02:02.500 And then lastly, recall there was this ddb, duck debugger, which 00:02:02.500 --> 00:02:06.490 of course, takes this physical form, if you happen to have a rubber duck lying 00:02:06.490 --> 00:02:07.870 around with whom you can talk. 00:02:07.870 --> 00:02:10.270 But I'm so pleased to say that if you lack 00:02:10.270 --> 00:02:16.420 that currently while at home, CS50's own Kareem and Brenda and Sophie 00:02:16.420 --> 00:02:19.150 have wonderfully added, if you haven't noticed already, 00:02:19.150 --> 00:02:22.090 that same virtual duck to CS50 IDE. 00:02:22.090 --> 00:02:24.400 So if you click in the top corner, you can actually 00:02:24.400 --> 00:02:27.328 begin to have a chat of sorts with the rubber duck. 00:02:27.328 --> 00:02:30.370 And while this is a certainly more playful incarnation of that same idea, 00:02:30.370 --> 00:02:32.560 we really can't emphasize enough the value 00:02:32.560 --> 00:02:36.700 of talking through problems when you're experiencing them in code with someone 00:02:36.700 --> 00:02:38.530 else or with something else. 00:02:38.530 --> 00:02:42.250 This particular duck, not all that large of a vocabulary, 00:02:42.250 --> 00:02:45.100 but it's not so much what the other person says but what you say 00:02:45.100 --> 00:02:47.952 and what you hear yourself saying that is undoubtedly 00:02:47.952 --> 00:02:49.535 the most valuable part of the process. 00:02:49.535 --> 00:02:53.560 So our thanks to Kareem and Brenda and Sophie on that. 00:02:53.560 --> 00:02:58.000 Recall last week, 2, that we took a look underneath the hood, 00:02:58.000 --> 00:03:02.020 literally in some sense, at the computer's memory in your laptop 00:03:02.020 --> 00:03:03.040 or desktop or phone. 00:03:03.040 --> 00:03:06.160 And then we decided to think about this more 00:03:06.160 --> 00:03:09.100 artistically as just a grid of bytes. 00:03:09.100 --> 00:03:11.470 So within that chip, there's a whole bunch of bits. 00:03:11.470 --> 00:03:14.620 And if you look at eight of them at a time, there's a whole bunch of bytes. 00:03:14.620 --> 00:03:16.150 And it stands to reason that we could think 00:03:16.150 --> 00:03:18.790 of this as the first byte, the second byte, the third byte, 00:03:18.790 --> 00:03:21.360 and so forth, and sort of chop this up pictorially 00:03:21.360 --> 00:03:25.790 into just a whole sequence of bytes in the computer's memory. 00:03:25.790 --> 00:03:30.160 And recall that if we zoom in on that and focus on just one continuous block 00:03:30.160 --> 00:03:33.500 of memory, otherwise known as an "array," 00:03:33.500 --> 00:03:37.640 we can do things within this array like storing a bunch of different values. 00:03:37.640 --> 00:03:41.110 So recall last week, we started by defining a little-- 00:03:41.110 --> 00:03:44.710 goofily, multiple variables that were almost identically names, 00:03:44.710 --> 00:03:47.470 like scores1, scores2, scores3. 00:03:47.470 --> 00:03:51.190 And then we began to clean up the design of our code by introducing an array, 00:03:51.190 --> 00:03:55.540 so we can have just one variable called scores, that is of size 3 00:03:55.540 --> 00:03:57.860 and has room for multiple values. 00:03:57.860 --> 00:04:02.320 So today, we'll continue to leverage this feature of many programming 00:04:02.320 --> 00:04:04.570 languages-- being able to store things continuously, 00:04:04.570 --> 00:04:07.000 back to back to back to back, in a computer's memory, 00:04:07.000 --> 00:04:11.050 because this very simple layout, this very simple feature of the language, 00:04:11.050 --> 00:04:13.270 is going to open up all sorts of powerful features. 00:04:13.270 --> 00:04:17.410 And in fact, we can even revisit some of the problems 00:04:17.410 --> 00:04:19.990 we tried to solve way back in week 0. 00:04:19.990 --> 00:04:21.700 But there is a catch with arrays. 00:04:21.700 --> 00:04:25.405 And we didn't really emphasize this much last week. 00:04:25.405 --> 00:04:27.280 And that's because, even though you and I can 00:04:27.280 --> 00:04:30.363 glance at this picture on the screen and see immediately that, oh, there's 00:04:30.363 --> 00:04:33.640 seven boxes on the screen, there are seven locations in which you 00:04:33.640 --> 00:04:37.870 can store values, you and I can sort of have this bird's eye view of everything 00:04:37.870 --> 00:04:41.380 and just see what's inside that entire array all at once. 00:04:41.380 --> 00:04:45.370 But computers, recall, are much more methodical, more algorithmic, 00:04:45.370 --> 00:04:46.250 if you will. 00:04:46.250 --> 00:04:48.790 And so a computer, as powerful as they are, 00:04:48.790 --> 00:04:51.945 can technically only look at one location in an array at a time. 00:04:51.945 --> 00:04:54.070 So whereas you and I can glance at this and sort of 00:04:54.070 --> 00:04:57.370 take it all in at once a computer just can't glance at its memory 00:04:57.370 --> 00:05:00.040 and take in all at once all of the values therein, 00:05:00.040 --> 00:05:04.270 it has to do so more methodically, for instance, from left to right, maybe 00:05:04.270 --> 00:05:06.950 right to left, maybe middle onward. 00:05:06.950 --> 00:05:08.990 But it has to be an algorithm. 00:05:08.990 --> 00:05:12.850 And so today we'll formalize that notion and really kind of hide the fact that 00:05:12.850 --> 00:05:16.840 this array cannot be seen all at once, you can only look at one location 00:05:16.840 --> 00:05:19.090 in an array at a given time. 00:05:19.090 --> 00:05:21.160 And this is going to have very real implications. 00:05:21.160 --> 00:05:23.820 For instance, if we consider that very first problem 00:05:23.820 --> 00:05:27.430 in the very first week where we tried to find my phone number in a phone book, 00:05:27.430 --> 00:05:30.640 the very naive approach was to start at the beginning and search from left 00:05:30.640 --> 00:05:31.180 to right. 00:05:31.180 --> 00:05:33.055 And we tried a couple of variants thereafter. 00:05:33.055 --> 00:05:35.590 But the problem, quite simply, is that of searching. 00:05:35.590 --> 00:05:37.690 And this is a term of art in computer science, 00:05:37.690 --> 00:05:40.540 super common, certainly for you and I as users on Google 00:05:40.540 --> 00:05:42.830 and the like to search for things all day long. 00:05:42.830 --> 00:05:46.690 And so certainly searching well, designing a search algorithm well, 00:05:46.690 --> 00:05:49.710 is certainly a compelling feature of so many of today's tools 00:05:49.710 --> 00:05:50.723 that you and I use. 00:05:50.723 --> 00:05:52.890 So if we think of this really as a problem to solve, 00:05:52.890 --> 00:05:56.430 we've got some input, which, for instance, might be an array of numbers, 00:05:56.430 --> 00:05:59.640 or maybe an array of web pages in the case of Google. 00:05:59.640 --> 00:06:01.270 And the goal is to get some output. 00:06:01.270 --> 00:06:05.340 So if the input to the problem is an array of values, the output, 00:06:05.340 --> 00:06:09.630 hopefully, is going to be something as simple, really, as a bool-- 00:06:09.630 --> 00:06:10.650 yes or no. 00:06:10.650 --> 00:06:15.180 Is the value you're looking for discoverable? 00:06:15.180 --> 00:06:20.130 Can you search for and find that value, yes or no, true or false? 00:06:20.130 --> 00:06:23.190 Now, within this black box, recall, is going to be some algorithm. 00:06:23.190 --> 00:06:25.530 And that's where today we'll spend most of our time. 00:06:25.530 --> 00:06:28.680 Indeed, we won't really introduce that many more features of C. 00:06:28.680 --> 00:06:30.480 We won't introduce that much more code. 00:06:30.480 --> 00:06:33.930 We'll focus again on ideas, just taking for granted now that you 00:06:33.930 --> 00:06:35.670 have some more tools in your toolkit. 00:06:35.670 --> 00:06:38.100 Beyond loops and conditions and Boolean expressions, 00:06:38.100 --> 00:06:40.830 we now have this other tool known as arrays. 00:06:40.830 --> 00:06:45.060 But let's first introduce some other terms of art, some jargon if you will, 00:06:45.060 --> 00:06:46.990 related to what we'll call running time. 00:06:46.990 --> 00:06:48.780 So we've alluded to this a few times. 00:06:48.780 --> 00:06:51.510 When we're thinking about just how good or bad an algorithm is, 00:06:51.510 --> 00:06:53.490 we describe how long it takes to run. 00:06:53.490 --> 00:06:54.540 That is its running time. 00:06:54.540 --> 00:06:57.330 The running time of an algorithm is how long it takes-- 00:06:57.330 --> 00:07:00.820 how many steps it takes, how many seconds it takes, how many iterations 00:07:00.820 --> 00:07:01.320 it takes. 00:07:01.320 --> 00:07:04.140 It doesn't really matter what your unit of measure is. 00:07:04.140 --> 00:07:07.350 Maybe it's time, maybe it's iterations or something else. 00:07:07.350 --> 00:07:10.422 But running time just refers to how long does an algorithm take. 00:07:10.422 --> 00:07:13.380 And there are ways that we can think about this a little more formally. 00:07:13.380 --> 00:07:16.020 And we kind of did this already in the first week, 00:07:16.020 --> 00:07:19.320 but we didn't give it this name, this italicized O, 00:07:19.320 --> 00:07:23.220 this capital O on the screen, is otherwise known as Big O notation. 00:07:23.220 --> 00:07:25.470 And computer scientists and some mathematicians 00:07:25.470 --> 00:07:28.410 will very frequently use, literally, this symbol 00:07:28.410 --> 00:07:31.410 to describe the running times of algorithms, 00:07:31.410 --> 00:07:33.310 or mathematically like a function. 00:07:33.310 --> 00:07:34.720 So recall this picture, in fact. 00:07:34.720 --> 00:07:38.190 When we were searching that phone book, we did it sort of good, better, best. 00:07:38.190 --> 00:07:41.800 We did it linearly-- that is, searching one page at a time, 00:07:41.800 --> 00:07:44.940 we did it twice as fast by doing two pages at a time-- 00:07:44.940 --> 00:07:48.690 and then we did it logarithmically by dividing and conquering, 00:07:48.690 --> 00:07:50.060 in half and half and half. 00:07:50.060 --> 00:07:52.560 And at the time, I proposed that if we think of a phone book 00:07:52.560 --> 00:07:57.630 as having n pages, where n is just a number in computer science vernacular, 00:07:57.630 --> 00:08:00.300 we might describe the running time, or the number of steps 00:08:00.300 --> 00:08:04.350 involved for that first algorithm, as being maybe in the worst case n steps. 00:08:04.350 --> 00:08:06.780 If the person you're looking for in a phone book maybe 00:08:06.780 --> 00:08:10.297 alphabetically has the last name starting with Z in English, 00:08:10.297 --> 00:08:12.880 well, the Z might be all the way at the end of the phone book. 00:08:12.880 --> 00:08:15.450 So at the worst case, you might be taking n steps 00:08:15.450 --> 00:08:18.090 to find someone like myself in that phone book. 00:08:18.090 --> 00:08:20.400 The second algorithm, though, was twice as fast, 00:08:20.400 --> 00:08:22.510 because we went two pages at a time. 00:08:22.510 --> 00:08:24.985 So we might describe its running time as n divided by 2. 00:08:24.985 --> 00:08:28.110 And then the third algorithm, where we divided the problem in half and half 00:08:28.110 --> 00:08:31.680 and half, literally throwing half of the problem away again and again, 00:08:31.680 --> 00:08:35.760 was logarithmic-- technically log base 2 of n, which, again, is just 00:08:35.760 --> 00:08:39.390 a mathematical formula that refers to halving something again 00:08:39.390 --> 00:08:40.289 and again and again. 00:08:40.289 --> 00:08:43.690 And you start with, of course, n pages in that scenario. 00:08:43.690 --> 00:08:46.560 Well, it turns out that a computer scientist would actually 00:08:46.560 --> 00:08:49.027 wave their hands at some of these mathematical details. 00:08:49.027 --> 00:08:51.360 Indeed, we're not going to get into the habit of writing 00:08:51.360 --> 00:08:53.517 very precise mathematical formulas. 00:08:53.517 --> 00:08:55.350 What we're instead going to do is try to get 00:08:55.350 --> 00:08:59.850 a sense of the order on which the running time of an algorithm 00:08:59.850 --> 00:09:03.870 is, just roughly how fast or how slow it is, but still 00:09:03.870 --> 00:09:06.210 using some symbology like n as a placeholder. 00:09:06.210 --> 00:09:09.480 And so a computer scientist would describe the running time of all three 00:09:09.480 --> 00:09:14.820 of those algorithms from week 0 as being big O of n, or big O of n/2, 00:09:14.820 --> 00:09:17.040 or big O of log base 2 of n. 00:09:17.040 --> 00:09:18.900 So "big O" just means "on the order of." 00:09:18.900 --> 00:09:20.520 It's sort of a wave of the hand. 00:09:20.520 --> 00:09:24.900 Maybe it's n minus 1, maybe it's n plus 1, maybe it's even 2n. 00:09:24.900 --> 00:09:28.240 But it's on the order of n or these other values. 00:09:28.240 --> 00:09:30.468 But in fact, too, notice that this chart, 00:09:30.468 --> 00:09:33.510 there's something kind of curious. , Like these first two algorithms from 00:09:33.510 --> 00:09:36.870 week 0 kind of pictorially look pretty much the same. 00:09:36.870 --> 00:09:39.137 Like undoubtedly, the yellow line is a little lower 00:09:39.137 --> 00:09:41.970 and therefore a little better and a little faster than the red line. 00:09:41.970 --> 00:09:43.200 But they have the same shape. 00:09:43.200 --> 00:09:47.370 And in fact, I bet if we zoomed way out, these two straight lines 00:09:47.370 --> 00:09:49.380 would pretty much look identical. 00:09:49.380 --> 00:09:53.412 If you change your axis to be big enough and tall enough, 00:09:53.412 --> 00:09:54.870 these would start to blur together. 00:09:54.870 --> 00:09:57.880 But clearly, the green line is fundamentally different. 00:09:57.880 --> 00:10:01.020 And so this speaks to a computer scientist's tendency 00:10:01.020 --> 00:10:03.870 to not really quibble over these details. 00:10:03.870 --> 00:10:06.977 Like, yes, the second algorithm in week 0 was better. 00:10:06.977 --> 00:10:08.310 Yes, this yellow line is better. 00:10:08.310 --> 00:10:12.540 But, eh, let's just call both of those algorithms running times 00:10:12.540 --> 00:10:13.800 on the order of n. 00:10:13.800 --> 00:10:18.390 That is to say, a computer scientist tends to throw away constant factors, 00:10:18.390 --> 00:10:20.840 like the 1/2 or the divided by 2. 00:10:20.840 --> 00:10:23.250 And they tend to focus only on the dominant factor, 00:10:23.250 --> 00:10:27.810 like which value in that mathematical expression is going to grow the most, 00:10:27.810 --> 00:10:28.890 grow the fastest. 00:10:28.890 --> 00:10:32.280 And n divided by 2n it's going to dominate over time. 00:10:32.280 --> 00:10:34.830 The bigger the phone book gets, the more pages you have. 00:10:34.830 --> 00:10:38.238 It's really n that's going to matter less so than that divided by 2. 00:10:38.238 --> 00:10:39.280 And same thing over here. 00:10:39.280 --> 00:10:41.700 If you're familiar with and remember your logarithms, 00:10:41.700 --> 00:10:45.120 we don't really have to even care about the base of that logarithm. 00:10:45.120 --> 00:10:48.940 Yes, it's base 2, but eh, we can just multiply that logarithm 00:10:48.940 --> 00:10:51.490 by some other number to convert it to any base we want-- 00:10:51.490 --> 00:10:54.040 base 10, base 3, base 7, anything. 00:10:54.040 --> 00:10:55.960 So let's just say it's on the order of log n. 00:10:55.960 --> 00:10:57.760 So this is good, because it means we're not really 00:10:57.760 --> 00:10:59.815 going to waste time getting really into the weeds 00:10:59.815 --> 00:11:02.440 mathematically when we talk about the efficiency of algorithms. 00:11:02.440 --> 00:11:07.580 It suffices to describe things really in terms of the variable, n in this case, 00:11:07.580 --> 00:11:09.370 if you will, that dominates over time. 00:11:09.370 --> 00:11:10.510 And indeed, let's zoom out. 00:11:10.510 --> 00:11:12.910 If I zoom out on this picture, boom, you begin 00:11:12.910 --> 00:11:16.405 to see that, yeah, these are really starting to look almost identical. 00:11:16.405 --> 00:11:20.140 And if we kept zooming out, you would see that they're essentially 00:11:20.140 --> 00:11:21.040 one in the same. 00:11:21.040 --> 00:11:25.420 But the green one stands out, so that's indeed on the order of log of n 00:11:25.420 --> 00:11:26.768 as opposed to n itself. 00:11:26.768 --> 00:11:28.060 So here's a little cheat sheet. 00:11:28.060 --> 00:11:30.280 It turns out that within computer science, 00:11:30.280 --> 00:11:33.250 and within the analysis of algorithms, we're 00:11:33.250 --> 00:11:36.410 going to tend to see some common formulas like this. 00:11:36.410 --> 00:11:38.290 So we've just seen on the order of n. 00:11:38.290 --> 00:11:39.880 We've seen on the order of log n. 00:11:39.880 --> 00:11:42.850 It turns out that the very common two is going to be n times 00:11:42.850 --> 00:11:46.677 log n, maybe even n squared, and then even big O of 1. 00:11:46.677 --> 00:11:48.760 And the last of those just means that an algorithm 00:11:48.760 --> 00:11:50.800 takes, wonderfully, one step-- 00:11:50.800 --> 00:11:55.420 or maybe two steps, maybe even 10 steps, but a constant number of steps. 00:11:55.420 --> 00:11:58.780 So that's sort of the best case scenario, at least among these options. 00:11:58.780 --> 00:12:01.820 Whereas, n squared is going to start to take a long time. 00:12:01.820 --> 00:12:03.820 It's going to start to feel slow, because if you 00:12:03.820 --> 00:12:06.070 take any value of n and square it, that's 00:12:06.070 --> 00:12:08.510 going to imply more and more steps. 00:12:08.510 --> 00:12:10.998 So just a bit of jargon, then, to start off today, 00:12:10.998 --> 00:12:12.790 whereby we now have this sort of vocabulary 00:12:12.790 --> 00:12:16.750 with which to describe the running times of an algorithm in terms of this Big O 00:12:16.750 --> 00:12:17.560 notation. 00:12:17.560 --> 00:12:19.240 But there's one other notation. 00:12:19.240 --> 00:12:22.960 And just as big O refers to an upper bound 00:12:22.960 --> 00:12:25.930 on running times, like how many steps maximally, 00:12:25.930 --> 00:12:27.970 how much time maximally might an algorithm take, 00:12:27.970 --> 00:12:31.730 this omega notation refers to the opposite. 00:12:31.730 --> 00:12:34.293 What's a lower bound on the running time of an algorithm? 00:12:34.293 --> 00:12:36.460 And we don't need another picture or other formulas. 00:12:36.460 --> 00:12:37.740 We can reuse the same one. 00:12:37.740 --> 00:12:40.150 So this cheat sheet here just proposes that, 00:12:40.150 --> 00:12:43.850 when describing the efficiency or inefficiency of an algorithm 00:12:43.850 --> 00:12:45.850 and you want to come up with a lower bound-- 00:12:45.850 --> 00:12:48.630 like minimally, how many steps does my algorithm take-- 00:12:48.630 --> 00:12:50.380 we can use the same mathematical formulas, 00:12:50.380 --> 00:12:54.340 but we can note that with omega instead of big O. 00:12:54.340 --> 00:12:56.620 So again, looks fancy, but it really just refers 00:12:56.620 --> 00:13:01.090 to a wave of the hand trying to sort of ballpark exactly what the running time 00:13:01.090 --> 00:13:02.600 is of an algorithm. 00:13:02.600 --> 00:13:05.080 And thankfully, we've seen a few algorithms already, 00:13:05.080 --> 00:13:08.650 including in that week 0, and now we're going to give it a more formal name. 00:13:08.650 --> 00:13:11.750 Linear search is what we did with that phone book 00:13:11.750 --> 00:13:15.010 first off by searching it page by page by page, 00:13:15.010 --> 00:13:18.710 looking for my phone number in that particular example. 00:13:18.710 --> 00:13:23.412 And so the difference today is that, unlike us humans, who 00:13:23.412 --> 00:13:26.620 can look down at a phone book page and see a whole bunch of names and numbers 00:13:26.620 --> 00:13:30.490 at once, unlike a human who can look at an array on the board a moment ago 00:13:30.490 --> 00:13:32.530 and sort of see everything at once, we need 00:13:32.530 --> 00:13:34.630 to be more methodical, more deliberate today so 00:13:34.630 --> 00:13:39.160 that we can translate week 0's ideas now, not into even pseudocode, 00:13:39.160 --> 00:13:40.510 but actual C code. 00:13:40.510 --> 00:13:43.810 And so wonderfully, here at the American Repertory Theater 00:13:43.810 --> 00:13:46.280 as we are on Harvard's campus this semester, 00:13:46.280 --> 00:13:48.610 we've been collaborating with the whole team 00:13:48.610 --> 00:13:51.670 here who are much more artistically inclined than certainly I 00:13:51.670 --> 00:13:52.900 could be on my own here. 00:13:52.900 --> 00:13:55.150 And we have these seven wonderful doors that 00:13:55.150 --> 00:13:58.690 were previously used in various theatrical shows that took place here 00:13:58.690 --> 00:13:59.680 in this theater. 00:13:59.680 --> 00:14:03.130 And we've even collaborated with the theater's prop shop, 00:14:03.130 --> 00:14:07.030 who in back have wonderfully manufactured some delightful numbers 00:14:07.030 --> 00:14:08.300 and brought them to life. 00:14:08.300 --> 00:14:12.437 Which is to say that, behind each of these seven doors is a number. 00:14:12.437 --> 00:14:15.520 And this is going to be an opportunity now to really hammer home the point 00:14:15.520 --> 00:14:18.430 that when we want to search for some number in an array, 00:14:18.430 --> 00:14:22.900 it's pretty equivalent to having to search for a number, in this case, 00:14:22.900 --> 00:14:25.210 behind an otherwise closed door. 00:14:25.210 --> 00:14:27.190 You and I can't just look at all of these doors 00:14:27.190 --> 00:14:29.020 now and figure out where a number is. 00:14:29.020 --> 00:14:30.430 We have to be more methodical. 00:14:30.430 --> 00:14:32.440 We have to start searching these doors, maybe 00:14:32.440 --> 00:14:35.688 from left to right, maybe from right to left, maybe from the middle on out. 00:14:35.688 --> 00:14:38.980 But we need to come up with an algorithm and ultimately translate that to code. 00:14:38.980 --> 00:14:43.630 So for instance, suppose I were to search for the number 0. 00:14:43.630 --> 00:14:48.700 How could we go about searching, methodically, these seven wooden doors 00:14:48.700 --> 00:14:50.200 for the number 0? 00:14:50.200 --> 00:14:54.310 Let me take a suggestion from the audience. 00:14:54.310 --> 00:14:56.620 What approach might you take? 00:14:56.620 --> 00:15:02.540 What first step would you propose I take here on my own with these doors? 00:15:02.540 --> 00:15:04.540 Any recommendations? 00:15:04.540 --> 00:15:07.030 How do I begin to find myself the number 0? 00:15:07.030 --> 00:15:09.280 Florence, what do you propose? 00:15:09.280 --> 00:15:13.523 AUDIENCE: I would propose starting form the left, since 0 is a smaller number. 00:15:13.523 --> 00:15:14.440 DAVID MALAN: OK, good. 00:15:14.440 --> 00:15:15.880 And hang in there for with me for just a moment. 00:15:15.880 --> 00:15:18.630 Let me go ahead and started on the left edge as Florence proposes. 00:15:18.630 --> 00:15:21.040 Go ahead and open the door, and hopefully, voila-- 00:15:21.040 --> 00:15:21.850 no. 00:15:21.850 --> 00:15:23.020 It's a number 4. 00:15:23.020 --> 00:15:23.920 So it's not a 0. 00:15:23.920 --> 00:15:27.150 So Florence, what would you propose I do next? 00:15:27.150 --> 00:15:31.790 AUDIENCE: I'd probably start in the middle somewhere, 00:15:31.790 --> 00:15:35.533 if, like, in case, I don't know, it's going down by 1. 00:15:35.533 --> 00:15:36.200 DAVID MALAN: OK. 00:15:36.200 --> 00:15:37.367 So maybe it's going down. 00:15:37.367 --> 00:15:38.700 So let me go ahead and try that. 00:15:38.700 --> 00:15:41.690 So you propose middle, I could go over here, and voila-- 00:15:41.690 --> 00:15:42.620 nope. 00:15:42.620 --> 00:15:44.360 That's the number 2. 00:15:44.360 --> 00:15:47.820 And I wonder, where else should I look. 00:15:47.820 --> 00:15:49.070 Let me-- I'm a little curious. 00:15:49.070 --> 00:15:50.570 I'm a little nervous that I ignored these doors. 00:15:50.570 --> 00:15:53.320 So Florence, if you don't mind, let's go ahead and look here and-- 00:15:53.320 --> 00:15:56.240 no, that's the number 6, it seems. 00:15:56.240 --> 00:15:59.690 Let's go ahead and check in here, the number 8. 00:15:59.690 --> 00:16:02.670 So they're kind of going up and down. 00:16:02.670 --> 00:16:05.150 So Florence, how might I finish searching for this number? 00:16:05.150 --> 00:16:08.420 What remains to be done, would you say? 00:16:08.420 --> 00:16:10.490 AUDIENCE: Probably start from the right now. 00:16:10.490 --> 00:16:11.157 DAVID MALAN: OK. 00:16:11.157 --> 00:16:14.420 So I could start from the right now, and maybe just go over here. 00:16:14.420 --> 00:16:16.550 And voila-- and there it is. 00:16:16.550 --> 00:16:17.690 So we found the number 0. 00:16:17.690 --> 00:16:19.820 So let me ask Florence, what was your algorithm? 00:16:19.820 --> 00:16:23.060 How did you go about so successfully finding the number 0 for us? 00:16:26.130 --> 00:16:30.990 AUDIENCE: I guess I initially tried starting, like, by going down by 1. 00:16:30.990 --> 00:16:36.840 So like, if the number was not at the left, 00:16:36.840 --> 00:16:40.490 then going to the center, which is, like, halfway in between 00:16:40.490 --> 00:16:41.780 and then going to [INAUDIBLE]. 00:16:41.780 --> 00:16:42.410 I don't know. 00:16:42.410 --> 00:16:45.780 DAVID MALAN: And playfully, how did that work out for you, going to the middle? 00:16:45.780 --> 00:16:48.420 Better or worse, no different? 00:16:48.420 --> 00:16:52.290 AUDIENCE: I mean, I guess maybe it helped a little bit 00:16:52.290 --> 00:16:54.083 to then go all the way to the right. 00:16:54.083 --> 00:16:54.750 DAVID MALAN: OK. 00:16:54.750 --> 00:16:56.370 Yeah, we might have gleaned some information. 00:16:56.370 --> 00:16:59.203 But let's go ahead and take a look at all of the doors for a moment. 00:16:59.203 --> 00:17:00.720 There's that 4 and the 6 again. 00:17:00.720 --> 00:17:02.400 Here is that 8 again. 00:17:02.400 --> 00:17:04.920 Over in the middle we have the 2 again. 00:17:04.920 --> 00:17:08.490 Over here we have a 7 for the first time. 00:17:08.490 --> 00:17:10.109 Over here we have a 5. 00:17:10.109 --> 00:17:11.996 And then of course, we have a 0. 00:17:11.996 --> 00:17:14.579 And if you took all of that in, honestly, Florence, you and I, 00:17:14.579 --> 00:17:16.079 we couldn't really have done any better. 00:17:16.079 --> 00:17:18.130 Because these door-- these numbers, it turns out, 00:17:18.130 --> 00:17:20.310 are just randomly arranged behind these doors. 00:17:20.310 --> 00:17:23.460 So it wasn't bad at all that you kind of hopped around. 00:17:23.460 --> 00:17:26.060 Although, the downside is if you hop around, 00:17:26.060 --> 00:17:29.642 you and I as humans can pretty easily remember where we've been before. 00:17:29.642 --> 00:17:32.100 But if you think about how we would translate that to code, 00:17:32.100 --> 00:17:34.500 I feel like we're starting to accumulate a bunch of variables 00:17:34.500 --> 00:17:36.417 maybe, because you have to keep track of that. 00:17:36.417 --> 00:17:40.140 So frankly, maybe the simplest solution-- whoops-- 00:17:40.140 --> 00:17:44.460 maybe the simplest solution would have been where we started in week 0, 00:17:44.460 --> 00:17:48.540 where we just take a very simple if naive approach of starting 00:17:48.540 --> 00:17:52.920 with our array, this time of size 7, behind which are some numbers. 00:17:52.920 --> 00:17:55.530 And if you don't know anything about those numbers, 00:17:55.530 --> 00:18:01.260 honestly the best you can do is just that same linear search from week 0, 00:18:01.260 --> 00:18:05.190 and just check, one at a time, the values behind each of these doors 00:18:05.190 --> 00:18:08.940 and just hope that eventually you will find it. 00:18:08.940 --> 00:18:12.360 So this is already sort of taking a lot of time, right? 00:18:12.360 --> 00:18:16.090 If I do this linear search approach like I did in week 0, 00:18:16.090 --> 00:18:18.840 I'm potentially going to have to search behind all of those doors. 00:18:18.840 --> 00:18:21.090 I'm going to have to search behind all of those doors. 00:18:21.090 --> 00:18:24.485 So let's consider a little more formally exactly how I could at least implement 00:18:24.485 --> 00:18:25.110 that algorithm. 00:18:25.110 --> 00:18:26.730 Because I could take the approach that Florence 00:18:26.730 --> 00:18:29.105 proposed, and just kind of jumping around and maybe using 00:18:29.105 --> 00:18:30.030 a bit of intuition. 00:18:30.030 --> 00:18:31.530 But again, that's not really an algorithm. 00:18:31.530 --> 00:18:33.572 We really need to do something more step by step. 00:18:33.572 --> 00:18:36.447 And in the meantime, let's go ahead, Joe, and let's close the curtain 00:18:36.447 --> 00:18:39.300 and see if we can't clean those up with another problem in a moment, 00:18:39.300 --> 00:18:43.180 while we consider now linear search and the analysis thereof. 00:18:43.180 --> 00:18:46.650 So with linear search, I would propose that we 00:18:46.650 --> 00:18:50.160 could implement it in pseudocode first, if you will, like this. 00:18:50.160 --> 00:18:53.280 For i from 0 to n minus 1-- 00:18:53.280 --> 00:18:57.090 we'll see where we're going with this-- if the number is behind the i-th door, 00:18:57.090 --> 00:19:02.110 return true, otherwise at the very end return false. 00:19:02.110 --> 00:19:05.280 So it's a relatively simple translation into pseudocode, 00:19:05.280 --> 00:19:08.130 much like we did with the phone book some time ago. 00:19:08.130 --> 00:19:10.260 And why, though, these values? 00:19:10.260 --> 00:19:13.200 Because I'm now starting to express myself a little more like C, 00:19:13.200 --> 00:19:14.770 even though it's still pseudocode. 00:19:14.770 --> 00:19:16.800 So for i from 0 to n minus 1. 00:19:16.800 --> 00:19:19.320 So computer scientists tend to start counting from 0. 00:19:19.320 --> 00:19:25.230 If there's n doors, or 7 doors in this case, you want to go from 0 on up to 6, 00:19:25.230 --> 00:19:27.930 or from 0 on up to n minus 1. 00:19:27.930 --> 00:19:31.980 So this is just a very common way of setting yourself up with a for loop, 00:19:31.980 --> 00:19:34.320 maybe in C, maybe in pseudocode in this case, that 00:19:34.320 --> 00:19:38.140 just gets you from left to right, algorithmically step by step. 00:19:38.140 --> 00:19:41.530 If a condition, number is behind the i-th door-- 00:19:41.530 --> 00:19:43.710 i-th just being a colloquial way of saying, 00:19:43.710 --> 00:19:45.960 what is behind the door at location i-- 00:19:45.960 --> 00:19:47.160 go ahead and return true. 00:19:47.160 --> 00:19:50.640 I have found myself the number I want, for instance, the number 0. 00:19:50.640 --> 00:19:54.640 And then notice that this return false is not part of an else, 00:19:54.640 --> 00:19:58.710 because I don't want to abort this algorithm prematurely and abort simply 00:19:58.710 --> 00:20:01.950 because a number is not behind the current door. 00:20:01.950 --> 00:20:05.820 I essentially want to wait all the way to the end of the algorithm, 00:20:05.820 --> 00:20:10.080 after I've checked all n doors, and if I have still not found 00:20:10.080 --> 00:20:14.150 the number I care about, then and only then am I going to return false. 00:20:14.150 --> 00:20:15.900 So a very common programming mistake might 00:20:15.900 --> 00:20:19.920 be to nest this internally and think about things in terms of ifs and elses. 00:20:19.920 --> 00:20:21.600 But you don't need to have an else. 00:20:21.600 --> 00:20:24.900 This is kind of a catchall here at the very end. 00:20:24.900 --> 00:20:29.460 But now let's consider, if this is the pseudocode for linear search, just what 00:20:29.460 --> 00:20:32.730 is the efficiency of linear search? 00:20:32.730 --> 00:20:35.650 What is the efficiency of linear search, which is to say, 00:20:35.650 --> 00:20:37.680 how well-designed is this algorithm? 00:20:37.680 --> 00:20:40.260 We put or gave ourselves a framework a moment ago, 00:20:40.260 --> 00:20:43.500 Big O notation, which is an upper bound, which we can think of for now 00:20:43.500 --> 00:20:45.060 as meaning like a worst case. 00:20:45.060 --> 00:20:49.380 In the worst case, how many steps might it take me to find the number 0-- 00:20:49.380 --> 00:20:51.030 or any number for that matter-- 00:20:51.030 --> 00:20:52.800 among n doors? 00:20:52.800 --> 00:20:57.060 Is it big O of n squared, big O of n times log n, big O of n, 00:20:57.060 --> 00:21:00.360 big O of log n, or big O of one, which, again, just means 00:21:00.360 --> 00:21:03.290 a constant fixed number of steps? 00:21:03.290 --> 00:21:06.670 Brian, could we go ahead and pull up this question? 00:21:06.670 --> 00:21:09.180 Let me go ahead and pull it up on my screen as well. 00:21:09.180 --> 00:21:14.250 If you go to our usual URL to propose what you think an upper bound 00:21:14.250 --> 00:21:18.210 is on the running time of linear search. 00:21:18.210 --> 00:21:18.710 OK. 00:21:18.710 --> 00:21:20.940 Indeed, if we consider now the running time of linear search, 00:21:20.940 --> 00:21:22.107 it's going to be big O of n. 00:21:22.107 --> 00:21:22.678 Why is that? 00:21:22.678 --> 00:21:24.720 So in the worst case, the number I'm looking for, 00:21:24.720 --> 00:21:27.480 0, might very well be at the end of that list, which 00:21:27.480 --> 00:21:31.087 is going to be on the order of n steps, or in this case precisely n steps. 00:21:31.087 --> 00:21:32.670 So that's one way to think about this. 00:21:32.670 --> 00:21:34.590 Well, now let me ask a follow-up question. 00:21:34.590 --> 00:21:39.090 Proposing instead that we consider omega notation, which is a lower bound 00:21:39.090 --> 00:21:40.800 on the running time of an algorithm-- 00:21:40.800 --> 00:21:43.320 Brian, could we go ahead and ask this question next? 00:21:43.320 --> 00:21:47.470 At that same URL, we'll see a question asking now 00:21:47.470 --> 00:21:55.210 for the possible answers for the running time-- 00:21:55.210 --> 00:21:58.280 for a lower bound on the running time of linear search. 00:21:58.280 --> 00:22:00.800 So let's go ahead and take a look at this one here. 00:22:00.800 --> 00:22:03.220 And in just a moment, we'll see as the responses come in, 00:22:03.220 --> 00:22:05.830 about 75-plus percent of you are proposing 00:22:05.830 --> 00:22:07.630 that it's actually omega of 1. 00:22:07.630 --> 00:22:09.220 So omega is a lower bound. 00:22:09.220 --> 00:22:10.923 1 refers to constant time. 00:22:10.923 --> 00:22:11.590 And why is that? 00:22:11.590 --> 00:22:13.507 Let me just take a quick answer on this point. 00:22:13.507 --> 00:22:17.890 Among the 75% of you who said one step, or a constant number of steps, 00:22:17.890 --> 00:22:19.060 why is that? 00:22:19.060 --> 00:22:23.260 How do you think about this lower bound on running time? 00:22:23.260 --> 00:22:25.210 How about from Keith? 00:22:25.210 --> 00:22:27.600 Why omega of 1? 00:22:27.600 --> 00:22:31.120 AUDIENCE: Yeah, you can just open it and be lucky and find it in the first door. 00:22:31.120 --> 00:22:31.410 DAVID MALAN: Yeah. 00:22:31.410 --> 00:22:32.830 So it really speaks to just that. 00:22:32.830 --> 00:22:34.900 You might just get lucky, and the number you're looking for 00:22:34.900 --> 00:22:36.290 might be at the very first door. 00:22:36.290 --> 00:22:39.760 So the lower bound, in the best case, if you will, of this algorithm, 00:22:39.760 --> 00:22:44.530 linear search might very well be omega of 1 for exactly that reason-- 00:22:44.530 --> 00:22:46.540 that you have-- get lucky and the element 00:22:46.540 --> 00:22:48.020 might be there at the beginning. 00:22:48.020 --> 00:22:48.670 So that's pretty good. 00:22:48.670 --> 00:22:50.378 You really can't do any better than that. 00:22:50.378 --> 00:22:54.880 So we this range now of a lower bound from omega of 1 on up to big O of n 00:22:54.880 --> 00:22:57.940 being an upper bound on the running time of linear search. 00:22:57.940 --> 00:23:00.760 But of course, we have this other algorithm in our toolkit. 00:23:00.760 --> 00:23:04.240 And recall from week 0 that we looked at binary search-- although, 00:23:04.240 --> 00:23:05.530 not necessarily by name. 00:23:05.530 --> 00:23:09.250 It was that divide-and-conquer third algorithm, where we took the phone book 00:23:09.250 --> 00:23:11.950 and split it in half and half and half again. 00:23:11.950 --> 00:23:17.140 Now, while I fumbled there, Joe kindly has given us a new set of doors. 00:23:17.140 --> 00:23:19.990 If Joe, you could go ahead and reveal our seven doors again, 00:23:19.990 --> 00:23:22.510 behind which we still have some numbers. 00:23:22.510 --> 00:23:26.440 But I think this time, I'm going to be a little better off. 00:23:26.440 --> 00:23:30.010 Cue Joe and the doors behind. 00:23:30.010 --> 00:23:30.620 There we go. 00:23:30.620 --> 00:23:32.140 So we have our same seven doors. 00:23:32.140 --> 00:23:34.930 But behind those doors now is a different arrangement of numbers. 00:23:34.930 --> 00:23:39.532 And suppose this time, I want to find myself the number 6. 00:23:39.532 --> 00:23:41.740 So the number 6-- we'll change the problem slightly-- 00:23:41.740 --> 00:23:44.630 but I'm going to give you one other ingredient this time, 00:23:44.630 --> 00:23:46.990 which is going to be key to this working. 00:23:46.990 --> 00:23:51.580 Why were Florence and I able to do no better than linear search before? 00:23:51.580 --> 00:23:53.530 Why were Florence and I able to do no better 00:23:53.530 --> 00:23:57.610 than randomly searching even last time? 00:23:57.610 --> 00:24:00.760 What was it about the array of numbers, or the array 00:24:00.760 --> 00:24:06.390 of doors, that did not allow me previously to use binary search? 00:24:06.390 --> 00:24:08.755 Iris, what do you think? 00:24:08.755 --> 00:24:11.380 AUDIENCE: Because we didn't know the numbers are sorted or not. 00:24:11.380 --> 00:24:11.860 DAVID MALAN: Yeah. 00:24:11.860 --> 00:24:14.060 We didn't know if the numbers were sorted or not. 00:24:14.060 --> 00:24:16.510 And indeed, barring that detail, Florence and I 00:24:16.510 --> 00:24:19.550 really couldn't have done any better than, say, linear search. 00:24:19.550 --> 00:24:22.900 So this time, though, Joe has kindly sorted some numbers 00:24:22.900 --> 00:24:24.440 behind these doors for us. 00:24:24.440 --> 00:24:26.590 And so if I want to search for the number 6, 00:24:26.590 --> 00:24:29.290 now I can begin to use a bit of that information. 00:24:29.290 --> 00:24:31.480 So you know what, I'm going to start just like we did with the phone book 00:24:31.480 --> 00:24:32.813 and start roughly in the middle. 00:24:32.813 --> 00:24:34.310 And voila, number 5. 00:24:34.310 --> 00:24:34.810 All right. 00:24:34.810 --> 00:24:36.850 So we're pretty close, we're pretty close. 00:24:36.850 --> 00:24:38.740 But the thing about binary search, recall, 00:24:38.740 --> 00:24:41.080 is that this is now useful information. 00:24:41.080 --> 00:24:44.920 If the numbers are sorted behind these doors all, of the doors to the left 00:24:44.920 --> 00:24:48.790 should presumably be lower than 5, and all of the doors to the right 00:24:48.790 --> 00:24:51.340 should presumably be larger than 5. 00:24:51.340 --> 00:24:54.430 Now, I might kind of cut a corner here and be like, well, if this is 5, 00:24:54.430 --> 00:24:57.220 6 is probably right next door, literally. 00:24:57.220 --> 00:24:59.530 But again, algorithmically, how might we do this? 00:24:59.530 --> 00:25:02.420 We don't want to necessarily consider these special cases. 00:25:02.420 --> 00:25:06.580 So more generally, it looks like I now have an array of size 3. 00:25:06.580 --> 00:25:10.400 So let me go ahead and apply that same algorithm, voila, to the middle. 00:25:10.400 --> 00:25:11.560 Now I have the number 7. 00:25:11.560 --> 00:25:14.888 And now it's becoming pretty clear that if the number 6 is present, 00:25:14.888 --> 00:25:16.180 it's probably behind this door. 00:25:16.180 --> 00:25:21.400 And indeed, if I now look at my remaining array of size 1, and voila, 00:25:21.400 --> 00:25:23.470 in the middle there is that number 6. 00:25:23.470 --> 00:25:27.700 So this time, I only had to open up three doors instead of all seven, 00:25:27.700 --> 00:25:31.390 potentially, or maybe all six doors to find my way to that number, 00:25:31.390 --> 00:25:34.060 because I was given this additional ingredient of all 00:25:34.060 --> 00:25:35.860 of those numbers being sorted. 00:25:35.860 --> 00:25:38.230 So it would seem, then, that you can apply 00:25:38.230 --> 00:25:40.960 the better, more efficient, better designed algorithm, now known 00:25:40.960 --> 00:25:46.060 as binary search, if only someone like Joe would sort the numbers for you 00:25:46.060 --> 00:25:46.820 in advance. 00:25:46.820 --> 00:25:50.110 So let's consider now a little more algorithmically 00:25:50.110 --> 00:25:51.440 how we might implement this. 00:25:51.440 --> 00:25:54.130 So with binary search, let me propose this pseudocode. 00:25:54.130 --> 00:25:57.980 If the number is behind the middle door, return true-- we found it. 00:25:57.980 --> 00:26:00.562 So if we got lucky, then we might very well 00:26:00.562 --> 00:26:02.520 have found the number 6 behind the middle door, 00:26:02.520 --> 00:26:03.687 and we would have been done. 00:26:03.687 --> 00:26:04.660 But that didn't happen. 00:26:04.660 --> 00:26:06.785 And in the general case that probably won't happen. 00:26:06.785 --> 00:26:10.452 So if the number is less than that behind the middle door, then 00:26:10.452 --> 00:26:12.910 just like with the phone book, I'm going to go to the left, 00:26:12.910 --> 00:26:17.020 and I'm going to search the left half of the remaining doors in the array. 00:26:17.020 --> 00:26:20.045 Else if the number is greater than that behind the middle door, 00:26:20.045 --> 00:26:22.420 then like the phone book I'm going to go ahead and search 00:26:22.420 --> 00:26:24.370 the right half of the phone book. 00:26:24.370 --> 00:26:28.000 But there might still be one final case potentially, 00:26:28.000 --> 00:26:32.170 whereby if there's no doors left at all, or no doors in the first place, 00:26:32.170 --> 00:26:36.430 I should at least have this one special case where I do say return false. 00:26:36.430 --> 00:26:40.272 For instance, if 6, for whatever reason, weren't be among those doors 00:26:40.272 --> 00:26:41.980 and I were searching for it, I still need 00:26:41.980 --> 00:26:44.860 to be able to handle that situation where I can say definitively 00:26:44.860 --> 00:26:48.280 return false if I'm left with no further doors to search. 00:26:48.280 --> 00:26:51.750 So here, then, might be the pseudocode for this algorithm a bit more formally. 00:26:51.750 --> 00:26:53.820 Now let's consider the analysis thereof. 00:26:53.820 --> 00:26:58.620 Before, where we left off, linear search was big O of n. 00:26:58.620 --> 00:27:00.630 Linear search was big O of n. 00:27:00.630 --> 00:27:05.310 This time let's consider where binary search actually falls into place 00:27:05.310 --> 00:27:06.908 by asking a different question. 00:27:06.908 --> 00:27:09.450 I'm going to go ahead and go back and ask this question now-- 00:27:09.450 --> 00:27:13.920 what's an upper bound on the running time of binary search? 00:27:13.920 --> 00:27:16.920 An upper bound on the running time of binary search-- 00:27:16.920 --> 00:27:21.210 and go ahead and buzz in, if you'd like, similarly to before. 00:27:21.210 --> 00:27:25.510 What's an upper bound on the running time of binary search? 00:27:25.510 --> 00:27:30.960 And you can see here answers are getting pretty dominant around log n. 00:27:30.960 --> 00:27:33.780 And indeed, that jives with exactly what we did in week 0. 00:27:33.780 --> 00:27:36.240 The correct answer is indeed log of n, because that's 00:27:36.240 --> 00:27:38.240 going to be the maximum number of times that you 00:27:38.240 --> 00:27:40.438 can take a list or an array of a given size 00:27:40.438 --> 00:27:42.480 and split it in half and half and half, until you 00:27:42.480 --> 00:27:45.360 find the number you're looking for, or ultimately you 00:27:45.360 --> 00:27:47.320 don't find that number at all. 00:27:47.320 --> 00:27:54.460 Meanwhile, if we consider now not just the upper bound on this algorithm-- 00:27:54.460 --> 00:27:57.930 so in the worst case, binary search takes big O of log n-- 00:27:57.930 --> 00:28:00.210 now let's consider a related question which 00:28:00.210 --> 00:28:03.420 is, what's a lower bound on the running time of this same algorithm? 00:28:03.420 --> 00:28:05.530 What's a lower bound on the running time? 00:28:05.530 --> 00:28:07.470 I'll go ahead and pluck this one off myself 00:28:07.470 --> 00:28:10.680 and go back to some of the suggestions thus far. 00:28:10.680 --> 00:28:13.573 In the best case, maybe, too, you do get lucky, 00:28:13.573 --> 00:28:15.990 and the number you're looking for, 6 or some other number, 00:28:15.990 --> 00:28:18.250 is smack dab in the middle of the array. 00:28:18.250 --> 00:28:20.940 And so maybe indeed you can get away with just one step. 00:28:20.940 --> 00:28:25.110 And indeed, a lower bound on binary search now might very well just 00:28:25.110 --> 00:28:29.250 be an omega of 1, because in that best case you just get lucky, 00:28:29.250 --> 00:28:33.250 and it's right where you happen to start, in this case in the middle. 00:28:33.250 --> 00:28:34.690 So we seem to have a range there. 00:28:34.690 --> 00:28:37.950 But strictly speaking, it would seem that binary search is better. 00:28:37.950 --> 00:28:40.710 Binary search is better than linear search, 00:28:40.710 --> 00:28:44.880 because as n gets big, big, big, you can really feel that difference. 00:28:44.880 --> 00:28:47.970 In fact, recall from week 0 we played a little bit with these light bulbs. 00:28:47.970 --> 00:28:50.698 And right now, all 64 of these light bulbs are on. 00:28:50.698 --> 00:28:53.490 And let's consider for a moment, just to put this into perspective, 00:28:53.490 --> 00:28:56.400 how long it would take to use linear search to find 00:28:56.400 --> 00:28:58.410 one light bulb among these 64. 00:28:58.410 --> 00:29:01.920 And recall that in the worst case, maybe the light bulb, or the number 00:29:01.920 --> 00:29:04.480 that we're looking for, is way down there at the end, 00:29:04.480 --> 00:29:06.100 but we don't know in advance. 00:29:06.100 --> 00:29:09.900 And so Sumner, if you wouldn't mind executing linear search 00:29:09.900 --> 00:29:13.410 on these light bulbs, let's just get a feel for the efficiency 00:29:13.410 --> 00:29:15.870 or inefficiency of this algorithm. 00:29:15.870 --> 00:29:18.360 Linear search in light bulb form. 00:29:18.360 --> 00:29:21.060 So you'll notice that one light bulb at a time 00:29:21.060 --> 00:29:24.570 is going out, implying that I've searched that door, searched that door, 00:29:24.570 --> 00:29:25.890 searched that door. 00:29:25.890 --> 00:29:27.990 But we've only gone through 10 or so bulbs, 00:29:27.990 --> 00:29:30.400 and we've got another 50-plus to go. 00:29:30.400 --> 00:29:34.920 And you can see that if we look inside of these doors one per second, 00:29:34.920 --> 00:29:39.010 or turn off these light bulbs one per second, it's going to take a long time. 00:29:39.010 --> 00:29:42.310 In fact, it doesn't seem worthwhile to even wait until the very end. 00:29:42.310 --> 00:29:45.143 So Sumner, if you wouldn't mind, let's bring all the lights back up, 00:29:45.143 --> 00:29:48.300 and let's try once more another algorithm, this one binary search, just 00:29:48.300 --> 00:29:52.500 to get, again, a feel of what the running time is of an algorithm, 00:29:52.500 --> 00:29:54.750 like binary search that runs in logarithmic time. 00:29:54.750 --> 00:29:58.260 So in just a moment, we'll go ahead and execute binary search 00:29:58.260 --> 00:30:01.510 on these light bulbs, the idea being that there's one bulb we care about. 00:30:01.510 --> 00:30:04.620 Let's see how fast we can get down to just one bulb out of 64. 00:30:04.620 --> 00:30:07.440 So Sumner, on your marks, get set, go. 00:30:10.850 --> 00:30:14.102 And we're done just a few steps later. 00:30:14.102 --> 00:30:15.560 And then have this sole light bulb. 00:30:15.560 --> 00:30:16.880 That was so much faster. 00:30:16.880 --> 00:30:20.330 And in fact, we did this deliberately one iteration at a time. 00:30:20.330 --> 00:30:24.920 The algorithm that we just executed with Sumner's and Matt's help, 00:30:24.920 --> 00:30:28.400 algorithmically was operating at what's called 1 hertz, 1 hertz. 00:30:28.400 --> 00:30:31.400 And if you're unfamiliar with hertz, it's just one something per second. 00:30:31.400 --> 00:30:34.855 It's very often used in physics or just in discussions of electricity 00:30:34.855 --> 00:30:35.480 more generally. 00:30:35.480 --> 00:30:38.090 And indeed, in this case if you're doing one thing per second, 00:30:38.090 --> 00:30:41.930 that first algorithm, linear search, might have taken us like 64 seconds 00:30:41.930 --> 00:30:44.480 to get all the way to that final light bulb. 00:30:44.480 --> 00:30:47.390 But that second algorithm was logarithmic. 00:30:47.390 --> 00:30:54.350 And so by going from 64 to 32 to 16 to 8 to 4 to 2 to 1, 00:30:54.350 --> 00:30:58.850 we get to the final result much faster, even going at the same pace. 00:30:58.850 --> 00:31:01.500 So in fact, if you think of your computer's CPU, 00:31:01.500 --> 00:31:03.560 CPUs are also measured in hertz-- 00:31:03.560 --> 00:31:06.750 H-E-R-T-Z. Probably measured in gigahertz, 00:31:06.750 --> 00:31:08.960 which is billions of hertz per second. 00:31:08.960 --> 00:31:11.960 So your CPU, the brain of your computer, If it's 1 gigahertz, 00:31:11.960 --> 00:31:15.840 that means it can literally do 1 billion things at a time. 00:31:15.840 --> 00:31:18.770 And here we have this sort of simpler setup of just light bulbs doing 00:31:18.770 --> 00:31:20.510 one thing per second. 00:31:20.510 --> 00:31:24.290 Your computer can do 1 billion of these kinds of operations at once. 00:31:24.290 --> 00:31:27.260 So just imagine, therefore, how much these savings tend 00:31:27.260 --> 00:31:30.743 to add up over time if you can take big bites out of these problems at once, 00:31:30.743 --> 00:31:32.660 as opposed to doing things like we did in week 00:31:32.660 --> 00:31:36.620 0, just one single step at a time. 00:31:36.620 --> 00:31:37.380 All right. 00:31:37.380 --> 00:31:39.922 Well, let's now go ahead and start to translate this to code. 00:31:39.922 --> 00:31:42.677 We have enough tools in our toolkit in C that I think, 00:31:42.677 --> 00:31:44.510 based on our discussion of arrays last week, 00:31:44.510 --> 00:31:48.330 we can now actually start to build something in code on our own. 00:31:48.330 --> 00:31:51.600 So I'm going to go ahead and create a file here in just a moment, 00:31:51.600 --> 00:31:55.430 in CS50 IDE, called, for instance, numbers.c. 00:31:55.430 --> 00:32:00.320 Let me go ahead and translate this to a file in C code called numbers.c. 00:32:00.320 --> 00:32:03.650 The goal at hand is just to implement linear search in code, 00:32:03.650 --> 00:32:06.380 just so that we're no longer waving our hands at the pseudocode 00:32:06.380 --> 00:32:08.520 but doing things a little more concretely. 00:32:08.520 --> 00:32:10.880 So I'm going to go ahead and include cs50.h. 00:32:10.880 --> 00:32:13.160 I'm going to go ahead and include stdio.h. 00:32:13.160 --> 00:32:15.530 And I'm going to start with no command line arguments, 00:32:15.530 --> 00:32:18.647 like we left off last week, but just with main void. 00:32:18.647 --> 00:32:21.230 And I'm going to go ahead and give myself an array of numbers, 00:32:21.230 --> 00:32:22.820 seven numbers, just like the doors. 00:32:22.820 --> 00:32:25.670 And I'm going to go ahead and say int numbers. 00:32:25.670 --> 00:32:28.250 And then this is a little trick that we didn't see last week, 00:32:28.250 --> 00:32:30.740 but it's handy for creating an array when 00:32:30.740 --> 00:32:33.110 you know in advance what numbers you want, which I do, 00:32:33.110 --> 00:32:36.440 because I'm going to mimic the doors that Joe kindly set up for us here, I'm 00:32:36.440 --> 00:32:44.120 going to go ahead and say give me an array that is equal to 4, 6, 8, 2, 7, 00:32:44.120 --> 00:32:45.480 5, 0. 00:32:45.480 --> 00:32:47.480 And this is the feature we didn't see last week. 00:32:47.480 --> 00:32:50.690 If you know in advance the numbers that you want to assign to an array, 00:32:50.690 --> 00:32:54.560 you actually don't have to bother specifying the size of the array 00:32:54.560 --> 00:32:55.790 explicitly. 00:32:55.790 --> 00:32:58.640 The compiler can figure that out intelligently for you. 00:32:58.640 --> 00:33:01.960 But you can use these curly braces with commas 00:33:01.960 --> 00:33:05.420 inside to enumerate from left to right the values 00:33:05.420 --> 00:33:07.400 that you want to put into that array. 00:33:07.400 --> 00:33:09.727 So after this line 6 has executed in my computer, 00:33:09.727 --> 00:33:11.810 I'm going to be left with an array called numbers, 00:33:11.810 --> 00:33:14.480 inside of which are seven integers listed 00:33:14.480 --> 00:33:17.895 from left to right in the computer's memory, so to speak, in this way. 00:33:17.895 --> 00:33:19.770 Now, what do I want to do with these numbers? 00:33:19.770 --> 00:33:21.560 Well, let's implement linear search. 00:33:21.560 --> 00:33:24.170 Linear search, as we latched on to earlier, 00:33:24.170 --> 00:33:27.110 is a searching from left to right or equivalently right to left-- 00:33:27.110 --> 00:33:29.460 but convention tends to go left to right. 00:33:29.460 --> 00:33:31.700 So I'm going to do a standard for loop. 00:33:31.700 --> 00:33:34.730 For int i gets 0, i is less than-- 00:33:34.730 --> 00:33:36.980 I'm going to keep it simple for now and hardcode this, 00:33:36.980 --> 00:33:40.400 but we could clean this up if we want, and I'm going to do i++ on each 00:33:40.400 --> 00:33:41.130 iteration. 00:33:41.130 --> 00:33:45.330 So I'm pretty sure that my line 8 will induce a for loop that 00:33:45.330 --> 00:33:47.090 iterates eight total times. 00:33:47.090 --> 00:33:49.460 And what question do I want to ask on each iteration? 00:33:49.460 --> 00:33:56.270 Well, if the numbers array at location i equals equals-- 00:33:56.270 --> 00:34:00.170 for instance, the number I was searching for initially, 00:34:00.170 --> 00:34:02.060 let's go ahead and search for 0-- 00:34:02.060 --> 00:34:03.620 then what do I want to do? 00:34:03.620 --> 00:34:06.860 Let me go ahead and print out something arbitrary but useful, 00:34:06.860 --> 00:34:09.787 like "Found," quote, unquote, so the human knows. 00:34:09.787 --> 00:34:12.620 And then let me go ahead, and just for good measure, let me go ahead 00:34:12.620 --> 00:34:14.000 and return 0. 00:34:14.000 --> 00:34:16.110 And we'll come back to that in just a moment. 00:34:16.110 --> 00:34:20.389 But at the end of this program, I'm also going to do this-- printf "Not found" 00:34:20.389 --> 00:34:21.610 with a backslash n. 00:34:21.610 --> 00:34:23.719 And then I'm going to go ahead and return 1. 00:34:23.719 --> 00:34:26.000 But before we tease apart those returns, just 00:34:26.000 --> 00:34:27.949 consider the code in the aggregate. 00:34:27.949 --> 00:34:30.020 Here's my entire main function. 00:34:30.020 --> 00:34:32.750 And on line 6, to recap, I initialized the array, 00:34:32.750 --> 00:34:37.909 just as we did at the very beginning, with a seemingly random list of numbers 00:34:37.909 --> 00:34:39.020 behind the doors. 00:34:39.020 --> 00:34:41.570 Then on line 8, I'm going to iterate with this for loop 00:34:41.570 --> 00:34:44.600 seven total times, incrementing i in each turn. 00:34:44.600 --> 00:34:48.469 And then line 10, just like I was opening the doors one at a time, 00:34:48.469 --> 00:34:51.469 I'm going to check if the i-th number in this array 00:34:51.469 --> 00:34:54.949 equals equals the number I care about, 0, with that first demo. 00:34:54.949 --> 00:34:56.570 I'm going to print "Found." 00:34:56.570 --> 00:35:00.110 Otherwise-- not else, per se-- but otherwise, 00:35:00.110 --> 00:35:04.550 if I go through this entire loop, checking if, if, if, if, if, and I 00:35:04.550 --> 00:35:08.870 never actually find 0, I'm going to have this catchall at the end that just 00:35:08.870 --> 00:35:13.790 says no matter what, if you reach line 16, print "Not found," 00:35:13.790 --> 00:35:15.505 and then return 1. 00:35:15.505 --> 00:35:16.880 Now, this is a bit of a subtlety. 00:35:16.880 --> 00:35:20.800 But could someone remind us what's going on with the return 0 00:35:20.800 --> 00:35:26.566 on line 13 and the return 1 on line 17? 00:35:26.566 --> 00:35:31.930 Why 0 in 1, and why am I returning at all? 00:35:31.930 --> 00:35:35.630 What problem is this solving for me? 00:35:35.630 --> 00:35:37.380 Even though most of our programs thus far, 00:35:37.380 --> 00:35:39.150 we haven't bothered too much with this. 00:35:39.150 --> 00:35:40.710 Demi, is it? 00:35:40.710 --> 00:35:42.340 What do you think? 00:35:42.340 --> 00:35:45.600 AUDIENCE: It's Demi, but basically, return 0 00:35:45.600 --> 00:35:49.490 is like it was executed correctly, or it found it, 00:35:49.490 --> 00:35:52.170 and it kind of exits that loop saying that it was found. 00:35:52.170 --> 00:35:58.375 And then return 1 is like the return false, and it exits as well. 00:35:58.375 --> 00:35:59.250 DAVID MALAN: Exactly. 00:35:59.250 --> 00:36:00.917 And "exit" really is the operative word. 00:36:00.917 --> 00:36:02.580 In main, when you are done-- 00:36:02.580 --> 00:36:05.340 ready to quit the program, as we've done with the word 00:36:05.340 --> 00:36:09.000 "quit" in some of our pseudocode in the past, you can literally return a value. 00:36:09.000 --> 00:36:11.970 And recall at the end of last week, we introduced the fact 00:36:11.970 --> 00:36:14.250 that main always returns an int. 00:36:14.250 --> 00:36:17.130 You and I have ignored that for at least a week or two, 00:36:17.130 --> 00:36:20.130 but sometimes it's useful to return an explicit value, 00:36:20.130 --> 00:36:22.410 whether it's for autograding purposes, whether it's 00:36:22.410 --> 00:36:24.735 for automated testing of your code in the real world, 00:36:24.735 --> 00:36:27.360 or just so it's a signal to the user that something indeed went 00:36:27.360 --> 00:36:28.030 Wrong. 00:36:28.030 --> 00:36:30.180 So you can return a value from main. 00:36:30.180 --> 00:36:33.507 And as Demi proposed, 0 means "all is well." 00:36:33.507 --> 00:36:35.340 And it's a little counter-intuitive, because 00:36:35.340 --> 00:36:37.330 thus far true tends to be a good thing. 00:36:37.330 --> 00:36:39.420 But in this case, 0 is a good thing. 00:36:39.420 --> 00:36:40.140 All is well. 00:36:40.140 --> 00:36:41.220 It's success. 00:36:41.220 --> 00:36:44.700 And if you return any other value, for instance 1, 00:36:44.700 --> 00:36:46.660 that indicates that something went wrong. 00:36:46.660 --> 00:36:51.390 So the reason I'm printing out, after the word "Found" I'm returning 0, 00:36:51.390 --> 00:36:54.498 is so that effectively the program exits at that point. 00:36:54.498 --> 00:36:56.790 I don't want to keep going again and again if I already 00:36:56.790 --> 00:36:58.320 found the number I care about. 00:36:58.320 --> 00:37:02.160 And down here, this one admittedly isn't strictly necessary, 00:37:02.160 --> 00:37:06.120 because if I hit line 16 and maybe deleted line 17, 00:37:06.120 --> 00:37:07.740 the program's going to end anyway. 00:37:07.740 --> 00:37:10.800 But there wouldn't be that so-called exit status 00:37:10.800 --> 00:37:12.960 that we discussed last week briefly, whereby 00:37:12.960 --> 00:37:15.293 you can kind of signal to the computer whether something 00:37:15.293 --> 00:37:17.190 was successful or unsuccessful. 00:37:17.190 --> 00:37:21.690 And the reason that 0 is a good thing and 1 or any other number is not, 00:37:21.690 --> 00:37:24.780 consider how many things can go wrong in programs that you write 00:37:24.780 --> 00:37:26.607 or that companies in the real world write 00:37:26.607 --> 00:37:28.440 when you get those error messages, sometimes 00:37:28.440 --> 00:37:30.060 with those cryptic error codes. 00:37:30.060 --> 00:37:32.430 There are hundreds, thousands of problems 00:37:32.430 --> 00:37:35.790 that might happen in a computer program that could be that many error 00:37:35.790 --> 00:37:37.680 codes that you see on the screen, reasons 00:37:37.680 --> 00:37:40.630 explaining why the program crashed or froze or the like. 00:37:40.630 --> 00:37:46.020 But 0 is sort of special in that it's just one value that the world has 00:37:46.020 --> 00:37:47.760 decided means "success." 00:37:47.760 --> 00:37:50.460 So there's only one way to get your program right, in a sense, 00:37:50.460 --> 00:37:54.000 but there's so many millions of ways in which things can go wrong. 00:37:54.000 --> 00:37:57.810 And that's why humans have adopted that particular convention. 00:37:57.810 --> 00:37:58.310 All right. 00:37:58.310 --> 00:38:02.378 But let's consider now not just numbers, but let's make things more interesting. 00:38:02.378 --> 00:38:04.170 Besides the doors, suppose that we actually 00:38:04.170 --> 00:38:06.300 had people's names behind them. 00:38:06.300 --> 00:38:08.790 Well, let's go ahead and write a program this time 00:38:08.790 --> 00:38:13.130 that not only searches for numbers, but instead searches for names. 00:38:13.130 --> 00:38:16.850 So I'm going to go ahead and create a different file here called names.c. 00:38:16.850 --> 00:38:18.600 And I'm going to start a little similarly. 00:38:18.600 --> 00:38:23.460 I'm going to include cs50.h at the top, I'm going to include stdio at the top. 00:38:23.460 --> 00:38:26.332 But I'm also this time going to include string.h, 00:38:26.332 --> 00:38:28.040 which we introduced briefly last week, so 00:38:28.040 --> 00:38:31.110 that we have access to strlen for getting the length of a string, 00:38:31.110 --> 00:38:33.100 and, it turns out, some other functions. 00:38:33.100 --> 00:38:35.790 Let me go ahead and declare int main void as usual. 00:38:35.790 --> 00:38:38.190 And then inside here, I need some arbitrary names. 00:38:38.190 --> 00:38:40.260 So let's come up with seven names here. 00:38:40.260 --> 00:38:43.825 And here, too, I can declare an array just as I did before. 00:38:43.825 --> 00:38:45.450 But it doesn't have to store only ints. 00:38:45.450 --> 00:38:47.400 It can store strings instead. 00:38:47.400 --> 00:38:49.560 So I've changed the data type from int to string, 00:38:49.560 --> 00:38:52.480 and I've changed the variable name from numbers to names. 00:38:52.480 --> 00:38:55.350 And I can still use this new curly brace notation, 00:38:55.350 --> 00:39:01.470 and I can give myself a name like Bill, and maybe Charlie, and maybe Fred, 00:39:01.470 --> 00:39:07.480 and maybe George, and maybe Ginny, and maybe Percy, and lastly, 00:39:07.480 --> 00:39:09.510 maybe a name like Ron. 00:39:09.510 --> 00:39:12.700 And it just barely fits on my screen. 00:39:12.700 --> 00:39:16.110 So with that said, I now have this array of names. 00:39:16.110 --> 00:39:19.620 And beyond there being a perhaps obvious pattern to them, 00:39:19.620 --> 00:39:23.310 there's a second less obvious, or maybe obvious, pattern to them. 00:39:23.310 --> 00:39:28.590 How would you describe the list of names I arbitrarily just came up with? 00:39:28.590 --> 00:39:31.350 What's a useful characteristic of them? 00:39:31.350 --> 00:39:34.410 What do you notice about these names? 00:39:34.410 --> 00:39:37.530 And there's at least two right answers to this question, I think. 00:39:37.530 --> 00:39:40.370 What do you notice about these names? 00:39:40.370 --> 00:39:41.533 Jack? 00:39:41.533 --> 00:39:43.200 AUDIENCE: They're in alphabetical order. 00:39:43.200 --> 00:39:43.500 DAVID MALAN: Yes. 00:39:43.500 --> 00:39:46.332 So beyond being the names of the Weasley children in Harry Potter, 00:39:46.332 --> 00:39:47.790 they're also in alphabetical order. 00:39:47.790 --> 00:39:49.080 And that's the more salient detail. 00:39:49.080 --> 00:39:50.970 For our purposes, I've had the forethought 00:39:50.970 --> 00:39:53.340 this time to sort these names in advance. 00:39:53.340 --> 00:39:56.580 And if I've sorted these names, that means implicitly 00:39:56.580 --> 00:39:59.670 I can use a better algorithm than linear search. 00:39:59.670 --> 00:40:02.160 I can use, for instance, our old binary search. 00:40:02.160 --> 00:40:05.615 But let's go ahead first and just search them naively for now. 00:40:05.615 --> 00:40:07.740 Let's still apply linear search, because, you know, 00:40:07.740 --> 00:40:12.925 what we haven't yet done is necessarily compare strings against one another. 00:40:12.925 --> 00:40:15.300 We've done a lot of comparisons of numbers like integers. 00:40:15.300 --> 00:40:16.178 But what about names? 00:40:16.178 --> 00:40:17.470 So let me go ahead and do this. 00:40:17.470 --> 00:40:22.200 So for int i gets 0, just like before, i less than 7, i++-- 00:40:22.200 --> 00:40:25.200 and I'm doing this only because I know in advance there are seven names. 00:40:25.200 --> 00:40:27.630 I think we could probably improve the design of this code, 00:40:27.630 --> 00:40:31.200 too, by having a variable or a constant storing that value. 00:40:31.200 --> 00:40:34.470 But I'm going to keep it simple and focus only on the new details for now. 00:40:34.470 --> 00:40:38.280 And it turns out, for reasons we'll explore in more detail next week, 00:40:38.280 --> 00:40:42.560 it is not sufficient to do what we did before and do something like this 00:40:42.560 --> 00:40:44.330 if I'm searching for "Ron." 00:40:44.330 --> 00:40:49.340 It turns out that in C, you can't use equals equals to compare two strings. 00:40:49.340 --> 00:40:52.080 You can for an int, you can for a char. 00:40:52.080 --> 00:40:53.870 And we've done both of those in the past. 00:40:53.870 --> 00:40:57.920 But there's a subtlety that we'll dive into in more detail next week that 00:40:57.920 --> 00:41:00.003 means you can't actually do this. 00:41:00.003 --> 00:41:02.420 And this is curious, because if you have prior programming 00:41:02.420 --> 00:41:06.270 experience in languages like Python or the like, you can do this. 00:41:06.270 --> 00:41:09.770 So in C you can't, but we'll see next time why. 00:41:09.770 --> 00:41:13.040 But for now, it turns out that C can solve this problem, 00:41:13.040 --> 00:41:15.600 and historically the way you do this is with a function. 00:41:15.600 --> 00:41:18.560 So inside of the string.h header file, there 00:41:18.560 --> 00:41:22.790 is not only a declaration for strlen, the length of a string like last week. 00:41:22.790 --> 00:41:25.400 There's another function called strcmp. 00:41:25.400 --> 00:41:29.000 And "stir compare," for short, S-T-R-C-M-P, 00:41:29.000 --> 00:41:33.590 allows me to pass in two strings, one string that I want to compare against 00:41:33.590 --> 00:41:34.880 another string. 00:41:34.880 --> 00:41:37.250 So it's not quite the same syntax. 00:41:37.250 --> 00:41:38.830 Indeed, it's a little harder to read. 00:41:38.830 --> 00:41:40.580 It's not quite as simple as equals equals. 00:41:40.580 --> 00:41:43.760 But strcmp, if we read the documentation for it, 00:41:43.760 --> 00:41:46.430 will tell us that this compares two strings. 00:41:46.430 --> 00:41:49.160 And it returns one of three possible values. 00:41:49.160 --> 00:41:54.410 If those two strings are equal, that is, identically the same letter for letter, 00:41:54.410 --> 00:41:58.550 then this function is going to return 0, it turns out. 00:41:58.550 --> 00:42:03.470 If the first string is supposed to come before the second string 00:42:03.470 --> 00:42:06.760 alphabetically, in some sense, then this function 00:42:06.760 --> 00:42:09.140 is going to return a negative value. 00:42:09.140 --> 00:42:13.550 If the first string is supposed to come after the second string alphabetically, 00:42:13.550 --> 00:42:16.320 if you will, then it's going to return a positive value. 00:42:16.320 --> 00:42:20.690 So there's three possible outcomes-- either equal to 0, or less than 0, 00:42:20.690 --> 00:42:22.460 or greater than 0. 00:42:22.460 --> 00:42:26.090 But you'll notice, and in fact, if you look at the documentation some time, 00:42:26.090 --> 00:42:30.890 it doesn't specify what value less than 0 or what value greater than 0. 00:42:30.890 --> 00:42:34.790 You have to just check for any negative value or any positive value. 00:42:34.790 --> 00:42:37.070 And I also told a bit of a white lie a moment ago. 00:42:37.070 --> 00:42:39.620 This does not check things alphabetically, 00:42:39.620 --> 00:42:42.530 even though it coincidentally does sometimes. 00:42:42.530 --> 00:42:45.590 Actually compares strings in what's called ASCII order, 00:42:45.590 --> 00:42:49.670 or ASCIIbetically which is kind of a goofy way of describing 00:42:49.670 --> 00:42:54.830 this function looks at every character in the two strings, from left to right, 00:42:54.830 --> 00:42:59.570 it checks the ASCII values of them, and then it compares those ASCII values 00:42:59.570 --> 00:43:01.440 character by character. 00:43:01.440 --> 00:43:03.680 And if the ASCII value is less than the other, 00:43:03.680 --> 00:43:06.780 then it returns a negative value or vice versa. 00:43:06.780 --> 00:43:10.790 So if you have, for instance, the letter A, capital A in the string, 00:43:10.790 --> 00:43:12.963 that gets converted first to 65. 00:43:12.963 --> 00:43:15.380 And then if you have an A in the other string capitalized, 00:43:15.380 --> 00:43:18.455 it, too, gets compared to 65, and those would be equal. 00:43:18.455 --> 00:43:21.080 But of course, all of these names have more than one character, 00:43:21.080 --> 00:43:25.620 so this ASCII order, or ASCIIbetical, precedes left to right 00:43:25.620 --> 00:43:29.750 so that strcmp checks every character in the names for you. 00:43:29.750 --> 00:43:33.290 And it stops when it hits that terminating null character. 00:43:33.290 --> 00:43:35.480 Recall that strings, underneath the hood, 00:43:35.480 --> 00:43:39.780 always end in C with this backslash 0, or eight 0 bits. 00:43:39.780 --> 00:43:43.470 So that's how strcmp knows when to stop comparing values. 00:43:43.470 --> 00:43:46.257 But if I go ahead and find someone like Ron, let me go ahead 00:43:46.257 --> 00:43:47.840 and print out quote, unquote, "Found." 00:43:47.840 --> 00:43:50.840 And like before, I'll go ahead and return, like Demi proposed, 00:43:50.840 --> 00:43:54.710 0, just to imply that all is successful. 00:43:54.710 --> 00:43:57.360 Otherwise, if we get all the way to the bottom of my code, 00:43:57.360 --> 00:44:00.620 I'm going to print out "Not found" to tell the story that we did not 00:44:00.620 --> 00:44:04.310 find Ron in this array, even though he does happen to be there, 00:44:04.310 --> 00:44:06.570 and I'm going to go ahead and return 1. 00:44:06.570 --> 00:44:08.480 So even though I've hardcoded everything-- 00:44:08.480 --> 00:44:11.960 to hardcode something in a program means to type it out explicitly-- 00:44:11.960 --> 00:44:15.200 you could imagine using a command line argument like last week 00:44:15.200 --> 00:44:16.310 to get user's input. 00:44:16.310 --> 00:44:17.720 Who would you like to search for? 00:44:17.720 --> 00:44:20.900 You could imagine using get_string to get user's input and ask them, 00:44:20.900 --> 00:44:22.460 who would you like to search for? 00:44:22.460 --> 00:44:26.150 But for now, just for demonstration sake, I've used only Ron's name. 00:44:26.150 --> 00:44:28.310 And if I haven't made any typos-- 00:44:28.310 --> 00:44:35.030 let me go ahead and type in make names, Enter, so far so good, ./names. 00:44:35.030 --> 00:44:37.400 And hopefully, we'll see, indeed, "Found," 00:44:37.400 --> 00:44:42.140 because "Ron" is very much in this array of seven siblings. 00:44:42.140 --> 00:44:44.120 But the building blocks that are new here 00:44:44.120 --> 00:44:47.540 are, again, the fact that when we declare an array of some fixed size 00:44:47.540 --> 00:44:49.580 we don't strictly need to put a number here, 00:44:49.580 --> 00:44:51.890 and we have this curly brace notation when we 00:44:51.890 --> 00:44:54.110 know the array's contents in advance. 00:44:54.110 --> 00:44:56.150 But perhaps lastly and most powerfully, we 00:44:56.150 --> 00:45:01.730 do have this function in C called strcmp that will allow us to actually 00:45:01.730 --> 00:45:05.370 store and compare strings in this way. 00:45:05.370 --> 00:45:07.550 So let me pause here and just ask if there's 00:45:07.550 --> 00:45:12.320 any questions about how we translated these ideas to code for numbers, 00:45:12.320 --> 00:45:15.650 and how we translated these ideas to code for now 00:45:15.650 --> 00:45:19.280 names, each time using linear search, not, binary. 00:45:19.280 --> 00:45:21.670 Caleb, question? 00:45:21.670 --> 00:45:22.400 AUDIENCE: Yeah. 00:45:22.400 --> 00:45:29.585 So would that program still work if "Ron," for example, was like all caps, 00:45:29.585 --> 00:45:30.960 like if you're trying to search-- 00:45:30.960 --> 00:45:35.643 like, if the cases are different in terms of uppercase and lowercase? 00:45:35.643 --> 00:45:37.060 DAVID MALAN: Really good question. 00:45:37.060 --> 00:45:40.510 And let me propose an instinct that's useful to acquire in general-- 00:45:40.510 --> 00:45:41.457 when in doubt, try it. 00:45:41.457 --> 00:45:42.790 So I'm going to do exactly that. 00:45:42.790 --> 00:45:44.998 I do happen to know the answer, but suppose I didn't. 00:45:44.998 --> 00:45:47.560 Let me go ahead and change "Ron" to all caps, just because. 00:45:47.560 --> 00:45:49.420 Maybe the human, the Caps Lock key was on, 00:45:49.420 --> 00:45:51.220 and they typed it in a little sloppily. 00:45:51.220 --> 00:45:53.300 Let me go ahead and make no other changes. 00:45:53.300 --> 00:45:57.820 Notice that I'm leaving the original array alone with only a capital R. 00:45:57.820 --> 00:46:01.900 Let me remake this program, make name, ./names. 00:46:01.900 --> 00:46:06.950 And voila, he's still, in fact, found. 00:46:10.860 --> 00:46:11.570 Stand by. 00:46:14.690 --> 00:46:15.960 Oh, OK. 00:46:19.910 --> 00:46:22.280 Caleb, you have just helped me unearth, a bug that 00:46:22.280 --> 00:46:23.810 was latent in the previous example. 00:46:23.810 --> 00:46:26.600 None of you should have accepted the fact 00:46:26.600 --> 00:46:29.317 that the previous program worked with "RON," because I didn't 00:46:29.317 --> 00:46:30.900 practice literally what I'm preaching. 00:46:30.900 --> 00:46:34.070 So Caleb, hold that thought for just a moment so I can rewind a little bit 00:46:34.070 --> 00:46:35.540 and fix my apparent bug. 00:46:35.540 --> 00:46:37.340 So "RON" was indeed found. 00:46:37.340 --> 00:46:39.650 But he wasn't found because "RON" was found. 00:46:39.650 --> 00:46:41.240 I did something stupid here. 00:46:41.240 --> 00:46:45.440 And it's perhaps all the more pedagogically appropriate 00:46:45.440 --> 00:46:47.420 now to highlight that. 00:46:47.420 --> 00:46:53.870 So how did this program say "Ron" was found, even though this time it also 00:46:53.870 --> 00:46:57.050 says "RON" was found in all caps? 00:46:57.050 --> 00:47:00.480 And you know what, let me get a little curious here. 00:47:00.480 --> 00:47:03.920 Let me go ahead and search for, not even "Ron." 00:47:03.920 --> 00:47:07.310 How about we search for Ron's mom, "Molly"? 00:47:07.310 --> 00:47:08.600 Make names. 00:47:08.600 --> 00:47:09.680 All right. 00:47:09.680 --> 00:47:14.330 And now, just to reveal that I really did do something stupid, ./names. 00:47:14.330 --> 00:47:16.910 OK, now something's clearly wrong, right? 00:47:16.910 --> 00:47:22.473 I can even search for the father "Arthur", make names, ./name. 00:47:22.473 --> 00:47:25.640 It seems that I wrote you a program that just literally always says "Found." 00:47:25.640 --> 00:47:27.830 So we shouldn't have accepted this as correct. 00:47:27.830 --> 00:47:32.870 Can anyone spot the bug based on my definition thus far? 00:47:32.870 --> 00:47:34.880 Can anyone spot the bug? 00:47:34.880 --> 00:47:38.900 In the meantime, this isn't really a bad time to open up the duck 00:47:38.900 --> 00:47:41.130 and say, "Hello, duck. 00:47:41.130 --> 00:47:50.240 I am having a problem whereby my program is always printing Found 00:47:50.240 --> 00:47:54.320 even when someone is not in the array. 00:47:54.320 --> 00:47:56.630 And I could proceed to explain my logic to the duck, 00:47:56.630 --> 00:48:00.640 but hopefully Sophia can point me at the solution even faster than the duck. 00:48:00.640 --> 00:48:03.350 AUDIENCE: You need to compare the value that we 00:48:03.350 --> 00:48:05.910 received from strcmp with something. 00:48:05.910 --> 00:48:08.030 So we need to compare it with like 0 and make sure 00:48:08.030 --> 00:48:11.265 that we receive the value that they're equal. 00:48:11.265 --> 00:48:12.140 DAVID MALAN: Perfect. 00:48:12.140 --> 00:48:16.400 So I said the right thing, but I literally did not do the right thing. 00:48:16.400 --> 00:48:19.070 If I want to check for equality, I literally 00:48:19.070 --> 00:48:22.580 need to check the return value when comparing names bracket i 00:48:22.580 --> 00:48:24.380 against "Ron" to equal 0. 00:48:24.380 --> 00:48:28.130 Because only in the case when the return value of strcmp is 0 00:48:28.130 --> 00:48:31.280 do I actually have a match. 00:48:31.280 --> 00:48:36.590 By contrast, if the function returns a negative value or the function 00:48:36.590 --> 00:48:39.470 returns a positive value, that means it's not a match. 00:48:39.470 --> 00:48:42.230 That means that one name is supposed to come before the other 00:48:42.230 --> 00:48:44.240 or after the other. 00:48:44.240 --> 00:48:47.870 But the catch with my shorthand syntax here, which is not always 00:48:47.870 --> 00:48:52.880 an incorrect syntax to use, whenever you have a Boolean expression inside 00:48:52.880 --> 00:48:56.340 of which is a function call like this-- 00:48:56.340 --> 00:48:58.550 notice that the entirety of my Boolean expression 00:48:58.550 --> 00:49:01.830 is just a call, so to speak, to strcmp. 00:49:01.830 --> 00:49:06.350 I'm passing in two inputs, names bracket i and quote, unquote "Ron." 00:49:06.350 --> 00:49:10.070 And therefore, I'm expecting strcmp to return output, a so-called return 00:49:10.070 --> 00:49:10.730 value. 00:49:10.730 --> 00:49:15.200 That return value is going to be negative or positive or 0. 00:49:15.200 --> 00:49:18.530 And in fact, to be clear, if the first name being searched 00:49:18.530 --> 00:49:22.760 for is "Bill" and names bracket i or names bracket 0 00:49:22.760 --> 00:49:26.540 is "Bill," "Bill" comma "Ron" is effectively what 00:49:26.540 --> 00:49:28.730 my input is on the first iteration. 00:49:28.730 --> 00:49:31.820 "Bill," alphabetically and ASCIIbetically, 00:49:31.820 --> 00:49:34.520 comes before "Ron," which means it should 00:49:34.520 --> 00:49:36.980 be returning a negative value to me. 00:49:36.980 --> 00:49:41.450 And the problem with Boolean expressions is, as implemented in this context, 00:49:41.450 --> 00:49:44.870 is that only 0 is false. 00:49:44.870 --> 00:49:50.240 Any other return value is by definition true or a Yes answer, 00:49:50.240 --> 00:49:53.810 whether it's negative 1 or positive 1, negative 1 million 00:49:53.810 --> 00:49:55.340 or positive 1 million-- 00:49:55.340 --> 00:49:58.850 any non-zero value in a computer language like C 00:49:58.850 --> 00:50:02.660 is considered true, also known as truthy. 00:50:02.660 --> 00:50:06.260 Any value that is 0 is considered false, but only 00:50:06.260 --> 00:50:07.880 that value is considered false. 00:50:07.880 --> 00:50:12.560 So really, I was getting lucky at first, because my program was finding "Bill," 00:50:12.560 --> 00:50:14.450 but I was confusing "Bill" for "Ron." 00:50:14.450 --> 00:50:17.960 Then when I did it again for Caleb and I capitalized "Ron," 00:50:17.960 --> 00:50:20.690 I was getting unlucky, because suddenly I 00:50:20.690 --> 00:50:23.030 knew "RON" capitalized wasn't in the array, 00:50:23.030 --> 00:50:24.590 and yet I'm still saying he's found. 00:50:24.590 --> 00:50:28.580 But that's because I didn't practice what I preach per Sophia's find. 00:50:28.580 --> 00:50:31.490 And so if I actually compare this against 0-- and now, Caleb, 00:50:31.490 --> 00:50:33.260 we come full circle to your question-- 00:50:33.260 --> 00:50:37.970 I rebuild this program with make names, I now do ./names and search for all 00:50:37.970 --> 00:50:43.280 caps "RON," I should now see, thankfully, "Not found." 00:50:43.280 --> 00:50:46.550 So I wish I could say that was deliberate, but thus 00:50:46.550 --> 00:50:48.410 is the common case of bugs. 00:50:48.410 --> 00:50:50.820 So here I am 20 years later making bugs in my code. 00:50:50.820 --> 00:50:53.360 So if you run up to a similar problem this week, 00:50:53.360 --> 00:50:58.910 rest assured that it never ends. 00:50:58.910 --> 00:51:01.430 But hopefully you won't have several people watching you 00:51:01.430 --> 00:51:03.120 while you do your problem set this week. 00:51:03.120 --> 00:51:03.620 All right. 00:51:03.620 --> 00:51:05.660 Any questions, then, beyond Caleb's? 00:51:05.660 --> 00:51:07.640 So great question, Caleb, and the answer is no. 00:51:07.640 --> 00:51:09.540 It is case sensitive. 00:51:09.540 --> 00:51:11.780 So it does not find "Rob"-- 00:51:11.780 --> 00:51:13.270 "RON." 00:51:13.270 --> 00:51:16.530 Any questions here? 00:51:16.530 --> 00:51:20.640 Any questions on linear search using strings? 00:51:20.640 --> 00:51:21.140 No? 00:51:21.140 --> 00:51:24.230 All right, well, let's go ahead and do one final example, I think, 00:51:24.230 --> 00:51:25.280 with searching. 00:51:25.280 --> 00:51:27.260 But let's introduce just one other feature. 00:51:27.260 --> 00:51:29.540 And this one's actually pretty cool and powerful. 00:51:29.540 --> 00:51:33.350 Up until now, we've been using data types that just come with C 00:51:33.350 --> 00:51:37.770 or come from CS50, like int, and char, and float, and the like. 00:51:37.770 --> 00:51:41.330 And you'll see now that there's actually sometimes 00:51:41.330 --> 00:51:44.570 reasons where you or I might want to create our own custom data types, 00:51:44.570 --> 00:51:48.500 our own types that didn't exist when C itself was invented. 00:51:48.500 --> 00:51:51.890 So for instance, suppose that I wanted to represent not just 00:51:51.890 --> 00:51:54.440 a whole bunch of numbers and not just a whole bunch of names, 00:51:54.440 --> 00:51:57.065 but suppose I want to implement like a full-fledged phone book. 00:51:57.065 --> 00:52:00.110 A phone book, of course, contains both names and numbers. 00:52:00.110 --> 00:52:02.870 And suppose I want to combine these two ideas together. 00:52:02.870 --> 00:52:06.950 Wouldn't it be nice if I could have a data structure that 00:52:06.950 --> 00:52:10.550 is a data type that has some structure to it that can actually 00:52:10.550 --> 00:52:11.850 store both at once? 00:52:11.850 --> 00:52:16.670 And in fact, wouldn't it be nice if C had a data type called person, 00:52:16.670 --> 00:52:19.550 so that if I want to represent a person, like in a phone book, who 00:52:19.550 --> 00:52:22.490 had both a name and a number, I can actually 00:52:22.490 --> 00:52:27.320 implement that and code by calling that variable of type person? 00:52:27.320 --> 00:52:30.170 Now, of course, the designers of C did not have the foresight 00:52:30.170 --> 00:52:32.308 to create a data type called person. 00:52:32.308 --> 00:52:34.100 And, indeed, that would be a slippery slope 00:52:34.100 --> 00:52:37.820 if they had a data type for every real-world entity you can think of. 00:52:37.820 --> 00:52:40.830 But they did give us the capabilities to do this. 00:52:40.830 --> 00:52:44.810 So if a person, in our limited world here of phone books, 00:52:44.810 --> 00:52:49.610 has both a name and a number, we might think of it as follows-- 00:52:49.610 --> 00:52:51.770 a name and a number, both of type string. 00:52:51.770 --> 00:52:53.060 But a quick check here. 00:52:53.060 --> 00:52:56.330 Why have I now decided, somewhat presumptuously, 00:52:56.330 --> 00:53:00.110 to call phone numbers strings as well? 00:53:00.110 --> 00:53:02.360 We've been talking about ints behind these doors. 00:53:02.360 --> 00:53:04.580 We've been searching for ints in code. 00:53:04.580 --> 00:53:08.150 But why did I just presume to propose that we instead 00:53:08.150 --> 00:53:12.560 implement a phone book using strings for names and numbers? 00:53:12.560 --> 00:53:14.650 Any thoughts here, Kurt? 00:53:14.650 --> 00:53:16.070 AUDIENCE: Yeah. 00:53:16.070 --> 00:53:17.570 Because we're not doing math on it. 00:53:17.570 --> 00:53:21.065 It's like-- a phone number could be, like, letters for all we care. 00:53:21.065 --> 00:53:22.940 And in fact, I mean, sometimes you see, like, 00:53:22.940 --> 00:53:26.160 1-800 Contacts or something like that, and maybe we want to allow that. 00:53:26.160 --> 00:53:27.410 DAVID MALAN: Yeah, absolutely. 00:53:27.410 --> 00:53:30.780 A phone number, despite its name, isn't necessarily just a number. 00:53:30.780 --> 00:53:33.710 It might be 1-800 Contacts, which is an English word. 00:53:33.710 --> 00:53:36.020 It might have hyphens in it or dashes. 00:53:36.020 --> 00:53:37.640 It might have parentheses in it. 00:53:37.640 --> 00:53:39.800 It might have a plus sign for country codes. 00:53:39.800 --> 00:53:41.720 There's a lot of characters that we absolutely 00:53:41.720 --> 00:53:46.257 can represent in C using strings that we couldn't represent in C using int. 00:53:46.257 --> 00:53:48.090 And so indeed, even though in the real world 00:53:48.090 --> 00:53:51.710 there are these "numbers" that you and I talk about once in a while like phone 00:53:51.710 --> 00:53:55.580 numbers, maybe in the US Social Security numbers, credit card numbers, 00:53:55.580 --> 00:54:00.170 those aren't necessarily values that you want to treat as actual integers. 00:54:00.170 --> 00:54:02.570 And in fact, those of you who did the credit problem 00:54:02.570 --> 00:54:04.760 and tried to validate credit card numbers 00:54:04.760 --> 00:54:08.720 may very well have run into challenges by using a long to represent a credit 00:54:08.720 --> 00:54:09.650 card number. 00:54:09.650 --> 00:54:11.900 It probably in retrospect might very well 00:54:11.900 --> 00:54:15.560 have been easier for you to treat credit card numbers as strings. 00:54:15.560 --> 00:54:17.810 The catch, of course, by design is that you didn't yet 00:54:17.810 --> 00:54:21.410 have strings in your vocabulary, at least in C yet. 00:54:21.410 --> 00:54:23.390 So suppose I want to create my own custom data 00:54:23.390 --> 00:54:27.800 type that encapsulates, if you will, two different types of values. 00:54:27.800 --> 00:54:31.430 A person shall be henceforth a name and a number. 00:54:31.430 --> 00:54:34.850 It turns out that C gives us this syntax here. 00:54:34.850 --> 00:54:39.770 This is the only juicy piece of new syntax besides those curly braces 00:54:39.770 --> 00:54:42.880 a moment ago that we'll see today in C, typedef. 00:54:42.880 --> 00:54:44.810 And as the name rather succinctly suggests, 00:54:44.810 --> 00:54:47.240 this allows you to define a type. 00:54:47.240 --> 00:54:50.120 And the type will be a structure of some sort. 00:54:50.120 --> 00:54:52.610 So a data structure in a programming language 00:54:52.610 --> 00:54:56.600 is typically a data type that has some structure to it. 00:54:56.600 --> 00:54:57.980 What do we mean by "structure"? 00:54:57.980 --> 00:55:01.610 It typically has one or more values inside of it. 00:55:01.610 --> 00:55:04.940 So using typedef, and in turn using the struct keyword, 00:55:04.940 --> 00:55:06.770 we can create our own custom types that's 00:55:06.770 --> 00:55:10.320 a structure, a composition of multiple other data types. 00:55:10.320 --> 00:55:14.360 So if we want to keep persons together as their own custom data type, 00:55:14.360 --> 00:55:15.890 the syntax is a little cryptic here. 00:55:15.890 --> 00:55:19.310 You literally do typedef struct open curly brace, 00:55:19.310 --> 00:55:22.137 then one per line you specify the data types that you want 00:55:22.137 --> 00:55:24.470 and the names that you want to give to those data types, 00:55:24.470 --> 00:55:26.270 for instance name and number. 00:55:26.270 --> 00:55:28.850 And then outside of the closing curly brace, 00:55:28.850 --> 00:55:33.290 you literally put the word "person," if that's indeed the data type 00:55:33.290 --> 00:55:35.280 that you want to invent. 00:55:35.280 --> 00:55:37.790 So how can we use this more powerfully? 00:55:37.790 --> 00:55:42.020 Well, let's go ahead and do things the wrong way without this feature first, 00:55:42.020 --> 00:55:44.060 so as to motivate its existence. 00:55:44.060 --> 00:55:47.016 Let me go ahead and save this file as phonebook.c. 00:55:47.016 --> 00:55:50.420 And let me start, as always, with includes cs50.h. 00:55:50.420 --> 00:55:53.030 And then let me go ahead and include stdio.h. 00:55:53.030 --> 00:55:56.180 And then lastly, let me also include string.h, 00:55:56.180 --> 00:55:59.100 because I know I'm going to be manipulating some strings in a moment. 00:55:59.100 --> 00:56:03.200 Let me go ahead now, and within my main function, let me go ahead 00:56:03.200 --> 00:56:06.050 and give myself initially, for the first version of this program, 00:56:06.050 --> 00:56:07.370 a whole bunch of names. 00:56:07.370 --> 00:56:10.940 Specifically, how about "Brian" comma "David"? 00:56:10.940 --> 00:56:15.200 We'll keep it short, just so as to focus on the ideas and not the actual data 00:56:15.200 --> 00:56:15.862 they're in. 00:56:15.862 --> 00:56:17.570 Then Brian and I each have phone numbers. 00:56:17.570 --> 00:56:19.890 So let's go ahead and store them in an array-- 00:56:19.890 --> 00:56:28.890 numbers equals, again the curly braces as before, and +1-617-495-1000-- 00:56:28.890 --> 00:56:32.670 and indeed, there's already motivation, per Kurt's comment, to use strings, 00:56:32.670 --> 00:56:35.250 because we've got a plus and a couple of dashes in there-- 00:56:35.250 --> 00:56:36.460 and then my number here. 00:56:36.460 --> 00:56:43.020 So we'll do +1-949-468-2750 close curly brace, semicolon. 00:56:43.020 --> 00:56:45.900 So I've gone ahead and declared two arrays, one called names, 00:56:45.900 --> 00:56:47.040 one called numbers. 00:56:47.040 --> 00:56:52.410 And I'm just going to have sort of a handshake agreement 00:56:52.410 --> 00:56:56.380 that the first name in names corresponds to the first number in numbers, 00:56:56.380 --> 00:56:59.702 the second name in names corresponds to the second number in numbers. 00:56:59.702 --> 00:57:02.910 And you can imagine that working well so long as you don't make any mistakes, 00:57:02.910 --> 00:57:04.950 and you have just the right number in each. 00:57:04.950 --> 00:57:08.520 Now let me go ahead and do int i equals 0, i less than 2-- 00:57:08.520 --> 00:57:11.580 I'm going to keep that hardcoded for now just to do the demonstration. 00:57:11.580 --> 00:57:14.910 And then inside of this loop, let me go ahead and search for my phone number, 00:57:14.910 --> 00:57:17.350 for instance, even though I happen to be at the end. 00:57:17.350 --> 00:57:22.590 So if strcmp of names bracket i equals-- 00:57:22.590 --> 00:57:27.360 rather, comma "David" equals equals 0-- 00:57:27.360 --> 00:57:29.200 so I'm not going to make that mistake again. 00:57:29.200 --> 00:57:37.290 Let me go ahead inside of this loop, inside of this condition here. 00:57:37.290 --> 00:57:41.650 And I'm going to go ahead and do the following-- print out that I found, 00:57:41.650 --> 00:57:43.910 for instance my number. 00:57:43.910 --> 00:57:45.160 And I'm going to plug that in. 00:57:45.160 --> 00:57:47.730 So numbers bracket i. 00:57:47.730 --> 00:57:50.430 And then as before, I'm going to go ahead and return 0. 00:57:50.430 --> 00:57:54.090 And if none of this works out, and I happen not to be in this array, 00:57:54.090 --> 00:57:59.200 I'll go ahead and print out as before "Not found" with a semicolon. 00:57:59.200 --> 00:58:00.870 And then I'll return 1 arbitrarily. 00:58:00.870 --> 00:58:03.750 I can return negative 1, I could return a million, negative million. 00:58:03.750 --> 00:58:07.530 But human convention would typically have you go from 0 to 1 to 2 to 3 00:58:07.530 --> 00:58:11.050 on up, if you have that many possible error conditions. 00:58:11.050 --> 00:58:11.760 All right. 00:58:11.760 --> 00:58:14.730 So I essentially have implemented in C a phone book of sorts. 00:58:14.730 --> 00:58:16.710 We did this verbally in week 0. 00:58:16.710 --> 00:58:17.987 Now I'm doing it in code. 00:58:17.987 --> 00:58:19.070 It's a limited phone book. 00:58:19.070 --> 00:58:21.030 It's only got two names and two numbers. 00:58:21.030 --> 00:58:22.988 But I could certainly implement this phone book 00:58:22.988 --> 00:58:26.230 by just using two arrays, two parallel arrays, if you will, 00:58:26.230 --> 00:58:29.490 by just using the honor system that the first element and names lines 00:58:29.490 --> 00:58:32.072 up with the first elements and numbers and so forth. 00:58:32.072 --> 00:58:33.780 Now hopefully, if I don't make any typos, 00:58:33.780 --> 00:58:36.030 let me go ahead and make phonebook. 00:58:36.030 --> 00:58:36.530 All right. 00:58:36.530 --> 00:58:41.440 It compiled OK. ./phonebook, and it found what seems to be my number there. 00:58:41.440 --> 00:58:45.040 So it seems to work correctly, though I've tried to pull that one over you 00:58:45.040 --> 00:58:45.540 before. 00:58:45.540 --> 00:58:48.100 But I'm pretty sure this one actually works correctly. 00:58:48.100 --> 00:58:51.990 And so we found my name and in turn number. 00:58:51.990 --> 00:58:57.030 But why is the design of this code not necessarily the best? 00:58:57.030 --> 00:59:00.090 This is starting to get more subtle, admittedly. 00:59:00.090 --> 00:59:02.980 And we've seen that we can do this differently. 00:59:02.980 --> 00:59:05.372 But what rubs you the wrong way about here? 00:59:05.372 --> 00:59:07.830 This is another example of what we might call "code smell." 00:59:07.830 --> 00:59:10.710 Like, something's a little funky here, like, ah, this 00:59:10.710 --> 00:59:13.500 might not be the best solution long term Nick, what do you think? 00:59:13.500 --> 00:59:14.125 AUDIENCE: Yeah. 00:59:14.125 --> 00:59:15.378 So what I'm guessing is that-- 00:59:15.378 --> 00:59:18.420 like, you know how you made the data frame before the new data structure, 00:59:18.420 --> 00:59:20.920 where the two things were linked together? 00:59:20.920 --> 00:59:23.010 In this case, we're just banking on the fact 00:59:23.010 --> 00:59:27.270 that we don't screw something up and unintentionally unlink them 00:59:27.270 --> 00:59:29.390 from the same index. 00:59:29.390 --> 00:59:32.812 So they're not intrinsically linked, which might not be-- 00:59:32.812 --> 00:59:34.770 DAVID MALAN: That's exactly the right instinct. 00:59:34.770 --> 00:59:38.790 In general, as great as a programmer as you're maybe aspiring to be, 00:59:38.790 --> 00:59:39.690 you're not all that. 00:59:39.690 --> 00:59:41.357 And like, you're going to make mistakes. 00:59:41.357 --> 00:59:44.700 And the more you can write code that's self-defensive, 00:59:44.700 --> 00:59:47.880 that protects you from yourself, the better off you're going to be, 00:59:47.880 --> 00:59:49.890 the more correct your code is going to be, 00:59:49.890 --> 00:59:53.460 and the more easily you're going to be able to collaborate successfully, 00:59:53.460 --> 00:59:56.710 if you so choose in the real world, on real-world programming projects, 00:59:56.710 --> 00:59:59.520 whether for a research project, a full-time job, a personal project 00:59:59.520 --> 01:00:00.210 or the like. 01:00:00.210 --> 01:00:04.290 Generally speaking, you should not trust yourself or other people 01:00:04.290 --> 01:00:06.060 that-- with whom you're writing code, you 01:00:06.060 --> 01:00:11.320 should have as many defense mechanisms in place exactly along these lines. 01:00:11.320 --> 01:00:15.030 So yes, there's nothing wrong with what I have done in the sense 01:00:15.030 --> 01:00:16.620 that this is correct. 01:00:16.620 --> 01:00:21.480 But as noted, if you screw up, and maybe you get an off by one error-- 01:00:21.480 --> 01:00:24.210 maybe you transpose two names or two numbers. 01:00:24.210 --> 01:00:26.910 I mean, imagine if you've got dozens of names and numbers, 01:00:26.910 --> 01:00:29.220 hundreds of names and numbers, thousands of them 01:00:29.220 --> 01:00:32.820 the odds that you or someone messes the order up at some point 01:00:32.820 --> 01:00:35.370 is just probably going to be too, too high. 01:00:35.370 --> 01:00:39.250 So it would be nice, then, if we could sort of keep related data together. 01:00:39.250 --> 01:00:42.805 This is kind of a hack, to just on the honor system say, my arrays line up, 01:00:42.805 --> 01:00:45.180 I'm just going to make sure to keep them the same length. 01:00:45.180 --> 01:00:46.110 We can do better. 01:00:46.110 --> 01:00:51.030 Let's keep related data together and design this a little more cleanly. 01:00:51.030 --> 01:00:53.570 And I can do this by defining my own type 01:00:53.570 --> 01:00:55.650 that I'll call for instance, a person. 01:00:55.650 --> 01:00:58.790 So at the top of this file, before main, I'm 01:00:58.790 --> 01:01:01.830 going to go ahead and typedef a structure, inside 01:01:01.830 --> 01:01:04.380 of which are the two types of data that I care about, 01:01:04.380 --> 01:01:08.520 string name and string number, just as before. 01:01:08.520 --> 01:01:15.030 Notice, though, here that what I have done here is not give myself an array. 01:01:15.030 --> 01:01:17.738 I've given myself one name and one number. 01:01:17.738 --> 01:01:20.280 Outside of this curly brace, I'm going to give this data type 01:01:20.280 --> 01:01:21.690 a name, which I could call "person." 01:01:21.690 --> 01:01:23.482 I could call it anything I want, but person 01:01:23.482 --> 01:01:25.410 seems pretty reasonable in this case. 01:01:25.410 --> 01:01:29.890 And now down here, I'm going to go ahead and change this code a little bit. 01:01:29.890 --> 01:01:32.370 I'm going to go ahead and give myself an array still, 01:01:32.370 --> 01:01:36.240 but this time I'm going to give myself an array of persons. 01:01:36.240 --> 01:01:39.360 And I'm going to call that array, somewhat playfully, "people," 01:01:39.360 --> 01:01:44.940 because I want to have two persons, two people, in this program, me and Brian. 01:01:44.940 --> 01:01:47.215 Now I want to go ahead and populate this array. 01:01:47.215 --> 01:01:48.840 That is, I want to fill it with values. 01:01:48.840 --> 01:01:50.940 And this syntax is a little new, but it's just 01:01:50.940 --> 01:01:55.230 to enable us to actually store values inside of a structure. 01:01:55.230 --> 01:01:58.050 If I want to index into this array, there's 01:01:58.050 --> 01:01:59.460 nothing different from last week. 01:01:59.460 --> 01:02:01.380 I do people bracket 0. 01:02:01.380 --> 01:02:06.270 That's going to give me the first person variable inside, so probably where 01:02:06.270 --> 01:02:07.890 "Brian" is supposed to go. 01:02:07.890 --> 01:02:10.350 The one last piece of syntax I need is how do I 01:02:10.350 --> 01:02:14.730 go inside of that structure, that person data structure, 01:02:14.730 --> 01:02:16.450 and access the person's name? 01:02:16.450 --> 01:02:17.710 I literally just do a dot. 01:02:17.710 --> 01:02:22.620 So people bracket 0 gives me the first person in the people array. 01:02:22.620 --> 01:02:26.580 And then the dot means, go inside of it and grab the person variable. 01:02:26.580 --> 01:02:29.880 I'm going to go ahead and set that name equal to quote, unquote "Brian." 01:02:29.880 --> 01:02:31.990 The syntax now for his name is almost identical-- 01:02:31.990 --> 01:02:39.630 people bracket 0 dot number equals quote, unquote "+1-617-495-1000" 01:02:39.630 --> 01:02:40.740 semicolon. 01:02:40.740 --> 01:02:43.092 Meanwhile, if I want to access a location for myself, 01:02:43.092 --> 01:02:45.300 I'm going to go ahead and put it at location 1, which 01:02:45.300 --> 01:02:46.620 is the second location. 01:02:46.620 --> 01:02:48.630 Name will be, quote, unquote "David." 01:02:48.630 --> 01:02:52.020 And then over here, I'm going to do people bracket 1 dot number equals 01:02:52.020 --> 01:02:59.250 quote, unquote "+1-949-468-2750" close quote, semicolon. 01:02:59.250 --> 01:03:01.320 So it's a bit verbose, admittedly. 01:03:01.320 --> 01:03:05.430 But you could imagine, if we just let our thoughts run ahead of ourselves 01:03:05.430 --> 01:03:08.370 here, if you used get_string, could sort of automatically do this. 01:03:08.370 --> 01:03:11.453 If you used command line arguments, maybe you could populate some of this. 01:03:11.453 --> 01:03:14.190 We don't just have to hardcode, that is, write my name and number 01:03:14.190 --> 01:03:15.668 and Brian's into this program. 01:03:15.668 --> 01:03:17.460 You can imagine doing this more dynamically 01:03:17.460 --> 01:03:20.700 using some of our techniques, using get_string and so forth, from week 1. 01:03:20.700 --> 01:03:22.920 But for now, it's just for demonstration's sake. 01:03:22.920 --> 01:03:29.040 So now if I want to search this new array, this new single array of people, 01:03:29.040 --> 01:03:31.410 I think my for loop can stay the same. 01:03:31.410 --> 01:03:33.990 And I think I can still use strcmp. 01:03:33.990 --> 01:03:38.970 But now I need to go inside of not names but people, 01:03:38.970 --> 01:03:41.740 and look for the dot name field. 01:03:41.740 --> 01:03:45.008 So data structures have fields or variables inside of them. 01:03:45.008 --> 01:03:46.800 So I'm going to use the dot notation there, 01:03:46.800 --> 01:03:50.670 too, go into the i-th person in the people array, 01:03:50.670 --> 01:03:53.610 and compare that name against, for instance, quote, unquote "David." 01:03:53.610 --> 01:03:57.300 And then if I have found "David," in this case myself, 01:03:57.300 --> 01:04:01.920 go ahead and access the people array again, but print out using printf 01:04:01.920 --> 01:04:02.830 the number. 01:04:02.830 --> 01:04:05.820 So again, the dot operator is the only new piece of syntax 01:04:05.820 --> 01:04:10.120 that's letting us go inside of this new feature known as a data structure. 01:04:10.120 --> 01:04:12.930 If I go ahead and make phonebook again after making those changes, 01:04:12.930 --> 01:04:13.610 all is well. 01:04:13.610 --> 01:04:14.850 It compiled OK. 01:04:14.850 --> 01:04:20.410 And if I run ./phonebook, I now have hopefully found my number again. 01:04:20.410 --> 01:04:24.420 So here is a seemingly useless exercise, in that all 01:04:24.420 --> 01:04:28.050 I really did was re-implement the same program using more lines of code 01:04:28.050 --> 01:04:29.630 and making it more complicated. 01:04:29.630 --> 01:04:31.460 But it's now better designed. 01:04:31.460 --> 01:04:34.130 Or it's a step toward being better designed, because now I've 01:04:34.130 --> 01:04:38.330 encapsulated all inside of one variable, for instance, people bracket 0, 01:04:38.330 --> 01:04:42.590 people bracket 1, all of the information we care about with respect to Brian, 01:04:42.590 --> 01:04:46.800 or me, or anyone else we might put into this program. 01:04:46.800 --> 01:04:49.880 And indeed, this is how programs, this is how Googles, of the world, 01:04:49.880 --> 01:04:52.520 Facebooks of the world store lots of information together. 01:04:52.520 --> 01:04:55.220 Consider any of your social media accounts like Instagram, 01:04:55.220 --> 01:04:57.230 or Facebook, or Snapchat and the like. 01:04:57.230 --> 01:04:59.810 You have multiple pieces of data associated 01:04:59.810 --> 01:05:02.810 with you on all of those platforms-- not just your username 01:05:02.810 --> 01:05:06.050 but also your password, also your history of posts, 01:05:06.050 --> 01:05:08.265 also your friends and followers and the like. 01:05:08.265 --> 01:05:11.390 So there's a lot of information that these companies, for better for worse, 01:05:11.390 --> 01:05:12.950 are collecting on all of us. 01:05:12.950 --> 01:05:17.660 And can you imagine if they just had one big array with all of our usernames, 01:05:17.660 --> 01:05:21.892 one big array with all of our passwords, one big array with all of our friends? 01:05:21.892 --> 01:05:23.600 Like, you can imagine certainly at scale, 01:05:23.600 --> 01:05:26.418 that's got to be a bad design, to just trust 01:05:26.418 --> 01:05:29.210 that you're going to get the ordering of all of these things right. 01:05:29.210 --> 01:05:30.110 They don't do that. 01:05:30.110 --> 01:05:34.040 They instead write code in some language that somehow encapsulates 01:05:34.040 --> 01:05:36.590 all the information related to me and Brian 01:05:36.590 --> 01:05:39.620 and you inside of some kind of data structure. 01:05:39.620 --> 01:05:43.370 And that's what they put in their database or some other server 01:05:43.370 --> 01:05:44.360 on their back end. 01:05:44.360 --> 01:05:47.720 So this encapsulation is a feature we now have in terms of C. 01:05:47.720 --> 01:05:51.740 And it allows us to create our own data structures that we can then use 01:05:51.740 --> 01:05:56.090 in order to keep related data together. 01:05:56.090 --> 01:05:59.990 All right, any questions, then, on data structures, or more 01:05:59.990 --> 01:06:04.130 specifically typedef and struct, the C keywords 01:06:04.130 --> 01:06:08.360 with which you can create your own custom types 01:06:08.360 --> 01:06:10.760 that themselves are data structures? 01:06:10.760 --> 01:06:11.720 Besley? 01:06:11.720 --> 01:06:12.590 AUDIENCE: Hi. 01:06:12.590 --> 01:06:17.450 So is it typical to define the new data structure outside of main, like 01:06:17.450 --> 01:06:17.950 in a header? 01:06:17.950 --> 01:06:19.367 DAVID MALAN: Really good question. 01:06:19.367 --> 01:06:22.070 Is it typical to define a new data structure outside of main? 01:06:22.070 --> 01:06:23.270 Quite often yes. 01:06:23.270 --> 01:06:26.270 In this case, it's immaterial, because I only have 01:06:26.270 --> 01:06:28.190 one function in this program, main. 01:06:28.190 --> 01:06:30.530 But as we'll see this week and next week and onward, 01:06:30.530 --> 01:06:34.370 our programs are going to start to get a little more complicated by nature 01:06:34.370 --> 01:06:35.930 of just having more features. 01:06:35.930 --> 01:06:38.987 And once you have more features, you probably have more functions. 01:06:38.987 --> 01:06:41.570 And when you have more functions, you want your data structure 01:06:41.570 --> 01:06:44.930 to be available to all of those functions. 01:06:44.930 --> 01:06:49.100 And so we'll begin to see definition of some of these structures being, 01:06:49.100 --> 01:06:51.620 indeed, outside of our own functions. 01:06:51.620 --> 01:06:53.000 Peter, over to you. 01:06:53.000 --> 01:06:54.020 AUDIENCE: Oh, yeah. 01:06:54.020 --> 01:06:58.160 Will we define new classes in header files later, 01:06:58.160 --> 01:07:00.350 or will we keep defining them outside of main? 01:07:00.350 --> 01:07:01.767 DAVID MALAN: Really good question. 01:07:01.767 --> 01:07:05.430 Might we define our own types and our own data structures in header files? 01:07:05.430 --> 01:07:05.930 Yes. 01:07:05.930 --> 01:07:07.220 Eventually we'll do that, too. 01:07:07.220 --> 01:07:11.210 Thus far, you and I have only been using header files that other people wrote. 01:07:11.210 --> 01:07:15.770 We've been using stdio.h, string.h, that the authors of C created. 01:07:15.770 --> 01:07:19.070 You've been using cs50.h which we the staff wrote. 01:07:19.070 --> 01:07:23.210 It turns out, you can also create your own header files, your own .h files, 01:07:23.210 --> 01:07:27.870 inside of which are pieces of code that you want to share across multiple files 01:07:27.870 --> 01:07:28.370 of your own. 01:07:28.370 --> 01:07:29.453 We're not quite there yet. 01:07:29.453 --> 01:07:32.990 But yes, Peter, that would be a solution to this problem 01:07:32.990 --> 01:07:36.130 by putting it in one central place. 01:07:36.130 --> 01:07:37.580 Thiago, over to you. 01:07:41.120 --> 01:07:43.370 AUDIENCE: I was-- 01:07:43.370 --> 01:07:54.265 I was thinking, this course really takes enough information to solve the sets, 01:07:54.265 --> 01:08:00.590 because I feel there is missing information. 01:08:00.590 --> 01:08:07.490 I am a freshman, and I was taking-- 01:08:07.490 --> 01:08:14.780 I was so concentrating, and I can't go on, go ahead on the sets. 01:08:14.780 --> 01:08:17.302 Is there anything that I'm missing? 01:08:17.302 --> 01:08:19.010 DAVID MALAN: It's a really good question. 01:08:19.010 --> 01:08:19.700 And quite fair. 01:08:19.700 --> 01:08:21.920 We do move quite quickly, admittedly. 01:08:21.920 --> 01:08:26.720 So indeed, recall from week 0 the fire hose metaphor 01:08:26.720 --> 01:08:29.270 that I borrowed from MIT's water fountain. 01:08:29.270 --> 01:08:31.069 Indeed, that's very much the case. 01:08:31.069 --> 01:08:33.569 There's a lot of new syntax, a lot of new ideas all at once. 01:08:33.569 --> 01:08:37.310 But when it comes to individual problems in the problem sets, 01:08:37.310 --> 01:08:39.920 do realize that you should take those step by step. 01:08:39.920 --> 01:08:45.460 And invariably, they tend to work from less complicated to more complicated. 01:08:45.460 --> 01:08:47.960 And throughout each of the lectures and each of the examples 01:08:47.960 --> 01:08:50.390 that we do, either live or via the examples that 01:08:50.390 --> 01:08:53.060 are premade on the course's website for your review, 01:08:53.060 --> 01:08:57.529 there's always little clues or hints or examples that you can then do. 01:08:57.529 --> 01:09:01.910 And certainly, by way of other resources like labs and the like, 01:09:01.910 --> 01:09:04.200 will you see additional building blocks as well. 01:09:04.200 --> 01:09:06.080 So feel free to reach out more individually afterword. 01:09:06.080 --> 01:09:07.997 Happy to point you at some of those resources. 01:09:07.997 --> 01:09:10.910 In fact, most recently, too, will you notice on the course's website 01:09:10.910 --> 01:09:13.399 what we call "shorts," which are shorter videos made 01:09:13.399 --> 01:09:17.120 by another colleague of mine, CS50's own Doug Lloyd, which are literally 01:09:17.120 --> 01:09:19.680 short videos on very specific topics. 01:09:19.680 --> 01:09:22.490 So after today, you'll see short videos by Doug 01:09:22.490 --> 01:09:25.580 with a different perspective on linear search, on binary search, 01:09:25.580 --> 01:09:28.260 and on a number of other algorithms as well. 01:09:28.260 --> 01:09:28.859 Good question. 01:09:28.859 --> 01:09:30.567 Sophia, back to you. 01:09:30.567 --> 01:09:32.609 AUDIENCE: I was wondering, with the return values 01:09:32.609 --> 01:09:37.380 that we have for different error cases, would that be-- 01:09:37.380 --> 01:09:39.630 like, what's an example of what we would use that for? 01:09:39.630 --> 01:09:42.540 Is that for later if there are like several different cases 01:09:42.540 --> 01:09:44.356 and we want to somehow keep track of them? 01:09:44.356 --> 01:09:45.689 DAVID MALAN: Exactly the latter. 01:09:45.689 --> 01:09:47.790 So right now, honestly, it's kind of stupid 01:09:47.790 --> 01:09:52.027 that we're even bothering to spend time returning 0 or returning 1. 01:09:52.027 --> 01:09:54.360 Like, we don't really need to do that, because we're not 01:09:54.360 --> 01:09:55.410 using the information. 01:09:55.410 --> 01:09:58.200 But what we're trying to do is lay the foundation 01:09:58.200 --> 01:09:59.782 for more complicated programs. 01:09:59.782 --> 01:10:01.740 And indeed, this week and next week and beyond, 01:10:01.740 --> 01:10:05.520 as your own programs get a little longer, and as we, the course, 01:10:05.520 --> 01:10:08.520 start providing you with starter code or distribution 01:10:08.520 --> 01:10:12.210 code, that is, lines of code that the staff and I write that you then 01:10:12.210 --> 01:10:15.630 have to build upon, it's going to be a very useful mechanism 01:10:15.630 --> 01:10:19.360 to be able to signal that this went wrong or this other thing went wrong. 01:10:19.360 --> 01:10:22.110 So all we're doing is preparing for that inevitability, 01:10:22.110 --> 01:10:25.870 even if right now it doesn't really seem to be scratching an itch. 01:10:25.870 --> 01:10:26.630 Anthony? 01:10:26.630 --> 01:10:28.672 AUDIENCE: I was just going to ask really quickly, 01:10:28.672 --> 01:10:32.200 obviously in this code we have "Brian" and your name, "David." 01:10:32.200 --> 01:10:33.310 And that's two people. 01:10:33.310 --> 01:10:35.500 So let's say we had 10 or 20 or even 30 people. 01:10:35.500 --> 01:10:37.500 I know it was a question in the chat, but I just 01:10:37.500 --> 01:10:39.460 wanted to clarify for myself, too. 01:10:39.460 --> 01:10:42.000 DAVID MALAN: And the "what if" being what would change? 01:10:42.000 --> 01:10:44.210 Or, what was the end of that question? 01:10:44.210 --> 01:10:44.850 AUDIENCE: Yeah. 01:10:44.850 --> 01:10:46.100 What would change in the code? 01:10:46.100 --> 01:10:48.017 Or what we do exactly to address that problem? 01:10:48.017 --> 01:10:48.850 DAVID MALAN: Ah, OK. 01:10:48.850 --> 01:10:49.450 Good question. 01:10:49.450 --> 01:10:53.305 So if we were to have more names, like a third name or a tenth name or the like, 01:10:53.305 --> 01:10:56.430 the only things that we would have to change in this version of the program 01:10:56.430 --> 01:10:58.947 is first, on line 14, the size of the array. 01:10:58.947 --> 01:11:00.780 So if we're going to have 10 people, we need 01:11:00.780 --> 01:11:03.180 to decide in advance that we're going to have 10 people. 01:11:03.180 --> 01:11:08.048 Better still, I could, for instance, allocate myself a constant up here. 01:11:08.048 --> 01:11:09.840 So let me actually go up here, just like we 01:11:09.840 --> 01:11:17.670 did in a previous class, where we did something like this-- const int NUMBER. 01:11:17.670 --> 01:11:19.380 And I'll just initialize this to 10. 01:11:19.380 --> 01:11:21.450 And recall that const means constant. 01:11:21.450 --> 01:11:23.140 That means this variable can't change. 01:11:23.140 --> 01:11:24.930 Int, of course means it's an integer. 01:11:24.930 --> 01:11:27.450 The fact that I've capitalized it is just a human convention 01:11:27.450 --> 01:11:29.640 to make a little visually clear that this is 01:11:29.640 --> 01:11:32.200 a constant, just so you don't forget. 01:11:32.200 --> 01:11:33.660 But it has no functional role. 01:11:33.660 --> 01:11:36.360 And then this, of course, is just the value to assign to NUMBER. 01:11:36.360 --> 01:11:40.187 Then I could go down here on line 16 and plug in that variable 01:11:40.187 --> 01:11:42.270 so that I don't have to hardcode what people would 01:11:42.270 --> 01:11:44.730 call a "magic number," which is just a number that 01:11:44.730 --> 01:11:46.410 appears seemingly out of nowhere. 01:11:46.410 --> 01:11:50.520 Now I've put all of my special numbers at the top of my file, 01:11:50.520 --> 01:11:54.190 or toward the top of my file, and now I'm using this variable here. 01:11:54.190 --> 01:11:57.390 And then what I could do-- and I alluded to this only verbally before-- 01:11:57.390 --> 01:12:00.120 I could absolutely start hardcoding in, for instance, 01:12:00.120 --> 01:12:03.420 Montague's name and number, and Rithvik's and Benedict's, and Cody's 01:12:03.420 --> 01:12:04.110 and others. 01:12:04.110 --> 01:12:07.048 But honestly, this seems kind of stupid if you're just hardcoding 01:12:07.048 --> 01:12:08.340 all of these names and numbers. 01:12:08.340 --> 01:12:10.423 And in a few weeks, we'll see how you can actually 01:12:10.423 --> 01:12:13.200 store all of the same information in like a spreadsheet, 01:12:13.200 --> 01:12:14.910 or what's called a CSV file-- 01:12:14.910 --> 01:12:19.560 Comma Separated Values-- or even in a proper database, which the Facebooks 01:12:19.560 --> 01:12:21.065 and Googles of the world would use. 01:12:21.065 --> 01:12:23.190 But what I could do for now is something like this. 01:12:23.190 --> 01:12:28.940 For int i gets 0, i less than the number of people, i++. 01:12:28.940 --> 01:12:30.720 And maybe I could do something like this-- 01:12:30.720 --> 01:12:38.250 people bracket i dot name equals get_string, "What's the name" 01:12:38.250 --> 01:12:39.150 question mark. 01:12:39.150 --> 01:12:43.080 And then here I could do people bracket i dot number equals 01:12:43.080 --> 01:12:46.380 get_string, "What's their number?" 01:12:46.380 --> 01:12:48.220 And I could ask that question, too. 01:12:48.220 --> 01:12:51.320 So now the program's getting to be a little better designed. 01:12:51.320 --> 01:12:53.700 I'm not arbitrarily hardcoding just me and Brian. 01:12:53.700 --> 01:12:54.780 Now it's dynamic. 01:12:54.780 --> 01:12:58.330 And technically, the phone book only supports 10 people at the moment, 01:12:58.330 --> 01:12:59.850 but I could make that dynamic, too. 01:12:59.850 --> 01:13:01.630 I could also call get_int. 01:13:01.630 --> 01:13:04.860 Or, like you did this past week, use a command line argument 01:13:04.860 --> 01:13:07.770 and parameterize the code so that it can actually 01:13:07.770 --> 01:13:09.990 be for 2 people, 10 people-- whatever you want, 01:13:09.990 --> 01:13:15.270 the program can dynamically adapt to it for you. 01:13:15.270 --> 01:13:16.890 Other questions? 01:13:16.890 --> 01:13:22.420 On structs, on types, or the like? 01:13:22.420 --> 01:13:23.170 No? 01:13:23.170 --> 01:13:23.670 All right. 01:13:23.670 --> 01:13:25.110 So how did we get here? 01:13:25.110 --> 01:13:27.810 Recall that we started with this problem of searching, 01:13:27.810 --> 01:13:30.420 whereby we just want to find someone in the doors. 01:13:30.420 --> 01:13:32.625 We just want to find someone in the array. 01:13:32.625 --> 01:13:34.500 We've sort of escalated things pretty quickly 01:13:34.500 --> 01:13:36.720 to finding not just numbers or names but now 01:13:36.720 --> 01:13:40.380 names with numbers in the form of these data structures. 01:13:40.380 --> 01:13:43.770 But to do this efficiently really requires a smarter algorithm 01:13:43.770 --> 01:13:44.850 like binary search. 01:13:44.850 --> 01:13:50.100 Up until now, we've only used in C code linear search, even though, recall, 01:13:50.100 --> 01:13:54.240 that we did have at our disposal the pseudocode for binary search. 01:13:54.240 --> 01:13:58.020 But with binary search, we're going to need the data to be sorted. 01:13:58.020 --> 01:14:01.080 And so if you want to get the speed benefits of searching 01:14:01.080 --> 01:14:04.290 more quickly by having sorted numbers, somehow someone 01:14:04.290 --> 01:14:06.120 is going to have to do that for us. 01:14:06.120 --> 01:14:09.748 Joe, for instance, sorted behind the curtain all of these numbers for us. 01:14:09.748 --> 01:14:11.790 But what algorithm did he use is going to open up 01:14:11.790 --> 01:14:15.005 a whole can of worms as to how we can sort numbers efficiently. 01:14:15.005 --> 01:14:17.130 And indeed, if you're the Googles and the Facebooks 01:14:17.130 --> 01:14:19.240 and the Instagrams of the world, with millions, 01:14:19.240 --> 01:14:22.740 billions of pieces of data and users, you surely 01:14:22.740 --> 01:14:24.850 want to keep that data sorted, presumably, 01:14:24.850 --> 01:14:26.940 so that you can use algorithms like binary search 01:14:26.940 --> 01:14:30.520 to find information quickly when you're searching for friends or for content. 01:14:30.520 --> 01:14:32.730 But let's go ahead and here take a five-minute break. 01:14:32.730 --> 01:14:35.160 And when we come back, we'll consider a few algorithms 01:14:35.160 --> 01:14:39.600 for sorting that's going to enable us to do everything we've just now discussed. 01:14:39.600 --> 01:14:40.910 See you in five. 01:14:40.910 --> 01:14:41.790 All right. 01:14:41.790 --> 01:14:43.050 We are back. 01:14:43.050 --> 01:14:47.760 So to recap, we have a couple different algorithms for searching, linear search 01:14:47.760 --> 01:14:48.870 and binary search. 01:14:48.870 --> 01:14:52.560 Binary search is clearly the winner from all measures we've seen thus far. 01:14:52.560 --> 01:14:56.910 The catch is that the data needs to be sorted in advanced in order 01:14:56.910 --> 01:14:58.107 to apply that algorithm. 01:14:58.107 --> 01:14:59.940 So let's just give ourselves a working model 01:14:59.940 --> 01:15:01.350 for what it means to sort something. 01:15:01.350 --> 01:15:04.560 Well, as always, if you think of this as just another problem to be solved, 01:15:04.560 --> 01:15:07.970 it's got input and output, and the goal is to take that input 01:15:07.970 --> 01:15:08.970 and produce that output. 01:15:08.970 --> 01:15:09.990 Well, what's the input? 01:15:09.990 --> 01:15:12.360 It's going to be a whole bunch of unsorted values. 01:15:12.360 --> 01:15:14.462 And the goal, of course, is to get sorted values. 01:15:14.462 --> 01:15:16.420 So the interesting part of the process is going 01:15:16.420 --> 01:15:18.270 to be whatever there is in the middle. 01:15:18.270 --> 01:15:22.710 But just to be even more concrete, if we think now in terms of this unsorted 01:15:22.710 --> 01:15:25.650 input as being an array of input-- because after all, 01:15:25.650 --> 01:15:28.470 that's perhaps the most useful mechanism we've seen thus far, 01:15:28.470 --> 01:15:32.880 to pass around a bunch of values at once using just one variable name-- 01:15:32.880 --> 01:15:37.710 we might have an array like this, 6 3 8 5 2 7 4 1, which seems to be, indeed, 01:15:37.710 --> 01:15:40.050 randomly ordered, that is unsorted. 01:15:40.050 --> 01:15:43.594 And we want to turn that into an equivalent array that's just 1 2 01:15:43.594 --> 01:15:45.120 3 4 5 6 7 8. 01:15:45.120 --> 01:15:47.490 So eight numbers this time instead of seven. 01:15:47.490 --> 01:15:51.190 But the goal this time is not to search them, per se, but to sort them. 01:15:51.190 --> 01:15:53.730 But before I get ahead of myself, could someone 01:15:53.730 --> 01:15:56.640 push back on this whole intellectual exercise 01:15:56.640 --> 01:15:59.400 we're about to do with sorting in the first place? 01:15:59.400 --> 01:16:02.280 Could someone make an argument as to why we might not 01:16:02.280 --> 01:16:06.990 want to bother using a sorted array, why we might not 01:16:06.990 --> 01:16:12.000 want to bother sorting the elements, and heck, let's just use linear search 01:16:12.000 --> 01:16:13.570 to find some element-- 01:16:13.570 --> 01:16:17.460 whether it's a number behind a door, a name in an array. 01:16:17.460 --> 01:16:23.850 Like, when might we want to just use linear search and not bother sorting? 01:16:23.850 --> 01:16:25.940 Sophia, what do you think? 01:16:25.940 --> 01:16:28.370 AUDIENCE: We could encounter errors in sorting, 01:16:28.370 --> 01:16:33.010 and that might cause errors, like, unpredictability in terms of, 01:16:33.010 --> 01:16:34.620 like, if we can find something. 01:16:34.620 --> 01:16:37.143 Versus linear search, we know we can find it. 01:16:37.143 --> 01:16:38.310 DAVID MALAN: OK, quite fair. 01:16:38.310 --> 01:16:41.940 I will concede that implementing binary search, not in pseudocode, which we've 01:16:41.940 --> 01:16:44.850 already done, but in code is actually more difficult, 01:16:44.850 --> 01:16:47.040 because you have to deal with rounding, especially 01:16:47.040 --> 01:16:49.790 if you've got a weird number of doors, like an odd number of doors 01:16:49.790 --> 01:16:52.410 versus an even number of doors or an array of those lengths. 01:16:52.410 --> 01:16:54.618 Honestly, you've got to deal with these corner cases, 01:16:54.618 --> 01:16:57.180 like rounding down or rounding up, because anything time 01:16:57.180 --> 01:16:59.857 you divide something by 2, you might get a fractional value 01:16:59.857 --> 01:17:01.190 or you might get a whole number. 01:17:01.190 --> 01:17:02.350 So we've got to make some decisions. 01:17:02.350 --> 01:17:03.630 So it's totally solvable. 01:17:03.630 --> 01:17:06.870 And humans for decades have been writing code that implements binary search. 01:17:06.870 --> 01:17:08.160 It's totally possible. 01:17:08.160 --> 01:17:09.450 There's libraries you can use. 01:17:09.450 --> 01:17:12.492 But it's definitely more challenging, and you open yourselves up to risk. 01:17:12.492 --> 01:17:14.385 But let me stipulate that that's OK. 01:17:14.385 --> 01:17:17.763 I am good enough at this point in my progression where I'm pretty 01:17:17.763 --> 01:17:19.180 sure I can implement it correctly. 01:17:19.180 --> 01:17:21.480 So correctness is not my concern. 01:17:21.480 --> 01:17:26.880 What else might demotivate me from sorting an array of elements? 01:17:26.880 --> 01:17:30.420 And what might motivate me to, ah, just use linear search. 01:17:30.420 --> 01:17:33.150 It's so simple. 01:17:33.150 --> 01:17:34.930 Can anyone propose why? 01:17:34.930 --> 01:17:36.480 Olivia, what do you think? 01:17:36.480 --> 01:17:39.420 AUDIENCE: If the name of the game is efficiency, 01:17:39.420 --> 01:17:42.180 and you have a small enough data set, then 01:17:42.180 --> 01:17:46.040 you might as well just search it versus sort 01:17:46.040 --> 01:17:47.920 it, which would be an extra expense. 01:17:47.920 --> 01:17:49.420 DAVID MALAN: Yeah, really well said. 01:17:49.420 --> 01:17:51.450 If you've got a relatively small data set, 01:17:51.450 --> 01:17:55.050 and your computer operates at a billion operations per second, 01:17:55.050 --> 01:17:59.130 for instance, my God, who cares if your code sucks and it's a little bit slow? 01:17:59.130 --> 01:18:00.580 Just do it the inefficient way. 01:18:00.580 --> 01:18:01.080 Why? 01:18:01.080 --> 01:18:04.205 Because it's going to take you maybe a few minutes to implement the simpler 01:18:04.205 --> 01:18:07.200 algorithm like linear search, even though it's going to take longer 01:18:07.200 --> 01:18:10.350 to run, whereas it might take you tens of minutes, maybe an hour 01:18:10.350 --> 01:18:12.810 or so, to not only write but debug something 01:18:12.810 --> 01:18:16.200 like a fancier algorithm, like binary search, at which point 01:18:16.200 --> 01:18:19.710 you might have spent more time writing the code, the faster code, than you 01:18:19.710 --> 01:18:22.342 would have just running the slower code. 01:18:22.342 --> 01:18:23.800 And I can speak to this personally. 01:18:23.800 --> 01:18:26.008 Back in grad school, some of the research I was doing 01:18:26.008 --> 01:18:28.170 involved analysis of very large data sets. 01:18:28.170 --> 01:18:30.960 And I had to write code in order to analyze this data. 01:18:30.960 --> 01:18:33.330 And I could have spent hours, days, even, 01:18:33.330 --> 01:18:37.320 writing the best designed algorithm I could to analyze 01:18:37.320 --> 01:18:39.480 the data as efficiently as possible. 01:18:39.480 --> 01:18:42.240 Or, frankly, I could write the crappy version of the code, 01:18:42.240 --> 01:18:44.550 go to sleep for eight hours, and my code will just 01:18:44.550 --> 01:18:46.770 produce the output I want by morning. 01:18:46.770 --> 01:18:49.513 And that is a very real-world, reasonable trade-off to make. 01:18:49.513 --> 01:18:51.930 And indeed, this is going to be thematic in the weeks that 01:18:51.930 --> 01:18:54.615 proceed in the course, where there's going to be this trade-off. 01:18:54.615 --> 01:18:56.490 And quite often, the trade-off is going to be 01:18:56.490 --> 01:19:01.110 time, or complexity, or the amount of space or memory that you're using. 01:19:01.110 --> 01:19:04.740 And part of the art of being a good computer scientist, 01:19:04.740 --> 01:19:08.140 and in turn programmer, is trying to decide where the line is. 01:19:08.140 --> 01:19:11.430 Do you exert more effort upfront to make a better, faster, more efficient 01:19:11.430 --> 01:19:13.800 algorithm, or do you maybe cut some corners 01:19:13.800 --> 01:19:17.580 there so that you can focus your most precious resource, human time, 01:19:17.580 --> 01:19:20.430 on other, more fundamentally challenging problems? 01:19:20.430 --> 01:19:22.620 So we for the course's problem sets and labs 01:19:22.620 --> 01:19:24.880 will always prescribe what's most important. 01:19:24.880 --> 01:19:27.113 But in a few weeks' time, with one of our problem 01:19:27.113 --> 01:19:29.280 sets will you implement your very own spell checker. 01:19:29.280 --> 01:19:30.988 And among the goals of that spell checker 01:19:30.988 --> 01:19:33.690 are going to be to minimize the amount of time 01:19:33.690 --> 01:19:37.980 your code is taking to run, and also to minimize the amount of space or memory 01:19:37.980 --> 01:19:41.070 that your program is taking while running. 01:19:41.070 --> 01:19:44.010 And so we'll begin to appreciate those trade-offs ever more. 01:19:44.010 --> 01:19:47.370 But indeed, it's the case-- and I really like Olivia's formulation of it-- 01:19:47.370 --> 01:19:50.340 if your data set is pretty small, it's probably not worth 01:19:50.340 --> 01:19:54.330 writing the fastest, best designed algorithm as possible. 01:19:54.330 --> 01:19:56.400 Just write it the simple way, the correct way, 01:19:56.400 --> 01:19:58.440 and get the answer quickly, and move on. 01:19:58.440 --> 01:20:01.530 But that's not going to be the case for a lot of problems, dare I say, 01:20:01.530 --> 01:20:03.420 most problems in life. 01:20:03.420 --> 01:20:06.150 If you're building Facebook or Instagram or Whatsapp, 01:20:06.150 --> 01:20:10.980 or any of today's most popular services that are getting thousands, millions 01:20:10.980 --> 01:20:13.500 of new pieces of data at a time, you can't just 01:20:13.500 --> 01:20:17.310 linearly search all of your friends or connections on LinkedIn efficiently. 01:20:17.310 --> 01:20:20.430 You can't just linearly search the billions of web pages 01:20:20.430 --> 01:20:23.820 that Google and Microsoft index in their search engines. 01:20:23.820 --> 01:20:25.450 You've got to be smarter about it. 01:20:25.450 --> 01:20:28.200 And undoubtedly, the more successful your programs are 01:20:28.200 --> 01:20:31.350 and your code are, or websites, your apps, whatever the case may be, 01:20:31.350 --> 01:20:33.810 the more important design does come into play. 01:20:33.810 --> 01:20:38.850 So indeed, let's stipulate now that the goal is not to search these doors once; 01:20:38.850 --> 01:20:41.040 the goal is not to search these light bulbs once; 01:20:41.040 --> 01:20:44.430 the goal is not to search the phone book once, but rather again 01:20:44.430 --> 01:20:45.970 and again and again. 01:20:45.970 --> 01:20:48.600 And if that's going to be the case, then we probably 01:20:48.600 --> 01:20:52.890 should spend a little more time and a little more complexity upfront 01:20:52.890 --> 01:20:56.250 getting our code, not only right but also efficient, 01:20:56.250 --> 01:20:59.670 so that we can benefit from that efficiency again and again 01:20:59.670 --> 01:21:01.480 and again, over time. 01:21:01.480 --> 01:21:04.050 So how might we go about sorting some numbers. 01:21:04.050 --> 01:21:06.390 So in fact, let me see, to do this, if we can maybe 01:21:06.390 --> 01:21:10.470 get a hand from Brian in back. 01:21:10.470 --> 01:21:12.180 Brian, do you mind helping with sorting? 01:21:12.180 --> 01:21:13.210 BRIAN: Yeah, absolutely. 01:21:13.210 --> 01:21:16.950 So I've got eight numbers here right now that all seem to be in unsorted order. 01:21:16.950 --> 01:21:17.700 DAVID MALAN: Yeah. 01:21:17.700 --> 01:21:21.400 And Brian, could you go ahead, and could you sort these eight numbers for us? 01:21:21.400 --> 01:21:22.900 BRIAN: Yeah, I'll put them in order. 01:21:22.900 --> 01:21:27.640 So we'll take these and-- 01:21:27.640 --> 01:21:35.450 um-- and all right. 01:21:35.450 --> 01:21:37.033 I think these are now in sorted order. 01:21:37.033 --> 01:21:38.117 DAVID MALAN: Yeah, indeed. 01:21:38.117 --> 01:21:38.630 I agree. 01:21:38.630 --> 01:21:42.420 And now let's take some critique from the audience, some observations. 01:21:42.420 --> 01:21:49.140 Would someone mind explaining how Brian just sorted those eight numbers? 01:21:49.140 --> 01:21:54.650 What did Brian just do, step by step, in order to get to that end result? 01:21:54.650 --> 01:21:57.780 The input was unsorted, the output now is sorted. 01:21:57.780 --> 01:21:58.550 So what did he do? 01:21:58.550 --> 01:22:01.190 Peter, what did you see happen? 01:22:01.190 --> 01:22:03.200 AUDIENCE: He went through them step by step. 01:22:03.200 --> 01:22:07.200 And if they were in increasing order, he flipped them, 01:22:07.200 --> 01:22:10.730 and kept doing it until they were all in the correct [INAUDIBLE].. 01:22:10.730 --> 01:22:11.480 DAVID MALAN: Yeah. 01:22:11.480 --> 01:22:13.820 He kept step by step kind of looking for small values 01:22:13.820 --> 01:22:16.130 and moving them to the left, and looking for big values 01:22:16.130 --> 01:22:18.380 and moving them to the right, so effectively selecting 01:22:18.380 --> 01:22:21.330 numbers one at a time and putting it into its right place. 01:22:21.330 --> 01:22:24.163 So let's see this is, maybe in more slow motion, if you will, Brian. 01:22:24.163 --> 01:22:25.913 And if you could be a little more pedantic 01:22:25.913 --> 01:22:27.900 and explain exactly what you're doing. 01:22:27.900 --> 01:22:32.390 I see you've already reset the numbers to their original, unsorted order. 01:22:32.390 --> 01:22:35.240 Why don't we go ahead and start a little more methodically? 01:22:35.240 --> 01:22:37.910 And could you go ahead and for us, more slowly this time, 01:22:37.910 --> 01:22:40.100 select the smallest value. 01:22:40.100 --> 01:22:41.860 Because I do think, per Peter, it's going 01:22:41.860 --> 01:22:44.100 to need to end up at the far left. 01:22:44.100 --> 01:22:44.850 BRIAN: Yeah, sure. 01:22:44.850 --> 01:22:47.370 So I'm looking at the numbers, and the 1 is the smallest. 01:22:47.370 --> 01:22:49.340 So I now have the smallest value. 01:22:49.340 --> 01:22:49.580 DAVID MALAN: All right. 01:22:49.580 --> 01:22:50.872 So you did that really quickly. 01:22:50.872 --> 01:22:52.880 But I feel like you took the liberty of being 01:22:52.880 --> 01:22:56.120 a human who can kind of have this bird's eye view of everything all at once. 01:22:56.120 --> 01:22:58.162 But be a little more computer-like, if you could. 01:22:58.162 --> 01:23:00.650 And if these eight numbers are technically in an array, 01:23:00.650 --> 01:23:02.690 kind of like my seven doors out here, such 01:23:02.690 --> 01:23:06.710 that you can only look at one number at a time, can you be even more methodical 01:23:06.710 --> 01:23:10.010 and deliberate this time in telling us how you found the smallest 01:23:10.010 --> 01:23:11.760 number to put into place? 01:23:11.760 --> 01:23:12.260 BRIAN: Sure. 01:23:12.260 --> 01:23:15.110 I guess, since the computer can only look at one number at a time, 01:23:15.110 --> 01:23:17.220 I would start at the left side of this array 01:23:17.220 --> 01:23:20.490 and work my way through the right, looking at each number one at a time. 01:23:20.490 --> 01:23:22.910 So I might start with the 6 and say, OK, this right now 01:23:22.910 --> 01:23:25.180 is the smallest number I've looked at so far. 01:23:25.180 --> 01:23:28.430 But then I look at the next number, and it's a 3, and that's smaller than a 6. 01:23:28.430 --> 01:23:31.700 So now the 3, that's the smallest number I found so far. 01:23:31.700 --> 01:23:33.440 So I'll remember that and keep looking. 01:23:33.440 --> 01:23:36.107 The 8 is bigger than the 3, so I don't need to worry about that. 01:23:36.107 --> 01:23:37.520 The 5 is bigger than the 3. 01:23:37.520 --> 01:23:39.680 The 2 is smaller than the 3, so that now is 01:23:39.680 --> 01:23:41.905 the smallest number I've found so far. 01:23:41.905 --> 01:23:42.780 But I'm not done yet. 01:23:42.780 --> 01:23:43.655 So I'll keep looking. 01:23:43.655 --> 01:23:46.490 The 7 is bigger than the 2, the 4 is bigger than the 2. 01:23:46.490 --> 01:23:47.990 But the 1 is smaller than the 2. 01:23:47.990 --> 01:23:50.750 So now I've made my way all the way to the end of the array. 01:23:50.750 --> 01:23:53.023 And 1, I can say, is the smallest number that I found. 01:23:53.023 --> 01:23:53.690 DAVID MALAN: OK. 01:23:53.690 --> 01:23:56.232 So what I'm hearing is you're doing all of these comparisons, 01:23:56.232 --> 01:23:58.867 also similar to what Peter implied, and you keep checking, 01:23:58.867 --> 01:24:00.950 is this smaller, is this smaller, is this smaller, 01:24:00.950 --> 01:24:04.100 and you're keeping track of the currently smallest number you've seen? 01:24:04.100 --> 01:24:05.180 BRIAN: Yeah, that sounds about right. 01:24:05.180 --> 01:24:05.420 DAVID MALAN: All right. 01:24:05.420 --> 01:24:06.260 So you found it. 01:24:06.260 --> 01:24:07.930 And I think it belongs at the beginning. 01:24:07.930 --> 01:24:09.760 So how do we put this into place now? 01:24:09.760 --> 01:24:11.250 BRIAN: Yeah, so I want to put it at the beginning. 01:24:11.250 --> 01:24:12.583 There's not really space for it. 01:24:12.583 --> 01:24:15.383 So I could make space for it, just by shifting these numbers over. 01:24:15.383 --> 01:24:16.050 DAVID MALAN: OK. 01:24:16.050 --> 01:24:16.500 Wait, wait. 01:24:16.500 --> 01:24:19.375 But I feel like you're just-- now you're doubling the amount of work. 01:24:19.375 --> 01:24:20.708 I feel like-- don't do all that. 01:24:20.708 --> 01:24:23.167 That feels like you're going to do more steps than we need. 01:24:23.167 --> 01:24:24.530 What else could we do here? 01:24:24.530 --> 01:24:25.030 BRIAN: OK. 01:24:25.030 --> 01:24:27.343 So the other option is, it needs to go in this spot, 01:24:27.343 --> 01:24:28.760 like this first spot in the array. 01:24:28.760 --> 01:24:30.267 So I could just put it there. 01:24:30.267 --> 01:24:33.350 But if I do that, I'm going to have to take the 6 which is there right now 01:24:33.350 --> 01:24:34.790 and pull the 6 out. 01:24:34.790 --> 01:24:35.060 DAVID MALAN: All right, but I think that's-- 01:24:35.060 --> 01:24:37.170 BRIAN: So the 1 is in the right place, but the 6 isn't. 01:24:37.170 --> 01:24:37.670 DAVID MALAN: Yeah, I agree. 01:24:37.670 --> 01:24:38.878 But I think that's OK, right? 01:24:38.878 --> 01:24:42.660 Because these numbers started randomly, and so the 6 is in the wrong place 01:24:42.660 --> 01:24:43.160 anyway. 01:24:43.160 --> 01:24:46.310 I don't think we're making the problem any worse by just moving it elsewhere. 01:24:46.310 --> 01:24:49.280 And indeed, it's a lot faster, I would think, to just swap two numbers, 01:24:49.280 --> 01:24:51.530 move one to the other and vice versa, then 01:24:51.530 --> 01:24:54.020 shift all of those numbers in between. 01:24:54.020 --> 01:24:54.520 BRIAN: Yeah. 01:24:54.520 --> 01:24:56.728 So I took the 1 out of the position at the very end 01:24:56.728 --> 01:24:58.770 of the array, all the way on the right-hand side. 01:24:58.770 --> 01:25:01.320 So I guess I could take the 6 and just put it there, 01:25:01.320 --> 01:25:04.100 because that's where there's an open space to put the number. 01:25:04.100 --> 01:25:04.370 DAVID MALAN: Yeah. 01:25:04.370 --> 01:25:07.110 And it's not exactly in the right space, but again, it's no worse off. 01:25:07.110 --> 01:25:07.610 So I like that. 01:25:07.610 --> 01:25:07.820 All right. 01:25:07.820 --> 01:25:10.487 But now, the fact that the 1 is in the right place-- and indeed, 01:25:10.487 --> 01:25:12.380 you've illuminated it to indicate as much-- 01:25:12.380 --> 01:25:14.630 I feel like we can pretty much ignore the 1 henceforth 01:25:14.630 --> 01:25:16.680 and now just select the next smallest element. 01:25:16.680 --> 01:25:18.210 So can you walk us through that? 01:25:18.210 --> 01:25:19.790 BRIAN: Yeah, so I guess I'd repeat the same process. 01:25:19.790 --> 01:25:20.870 I'd start with the 3. 01:25:20.870 --> 01:25:22.928 That's the smallest number I've found so far. 01:25:22.928 --> 01:25:23.720 And I keep looking. 01:25:23.720 --> 01:25:26.750 The 8 is bigger than the 3, the 5 is bigger than the 3. 01:25:26.750 --> 01:25:27.990 The 2 is smaller than the 3. 01:25:27.990 --> 01:25:29.060 So I'll remember that 2. 01:25:29.060 --> 01:25:31.023 That's the smallest thing I've seen so far. 01:25:31.023 --> 01:25:34.190 And then I just need to check to see if there's anything smaller than the 2. 01:25:34.190 --> 01:25:36.558 And I look at the 7, the 4, and the 6. 01:25:36.558 --> 01:25:38.100 None of those are smaller than the 2. 01:25:38.100 --> 01:25:41.623 So the 2, I can say is the next smallest number for the array. 01:25:41.623 --> 01:25:42.290 DAVID MALAN: OK. 01:25:42.290 --> 01:25:43.808 And where would you put that then? 01:25:43.808 --> 01:25:45.600 BRIAN: That needs to go in the second spot. 01:25:45.600 --> 01:25:47.060 So I need to pull the 3 out. 01:25:47.060 --> 01:25:50.450 And I guess I can take the 3 and just put it into this open spot, where 01:25:50.450 --> 01:25:51.770 there's available space. 01:25:51.770 --> 01:25:52.520 DAVID MALAN: Yeah. 01:25:52.520 --> 01:25:54.650 And I feel like it's starting to become clear 01:25:54.650 --> 01:25:56.870 that we're inside some kind of loop, because you pretty much told 01:25:56.870 --> 01:25:58.670 the same story again but with a different number. 01:25:58.670 --> 01:26:00.710 Do you mind just continuing the algorithm to the end 01:26:00.710 --> 01:26:03.168 and select the next smallest, next smallest, next smallest, 01:26:03.168 --> 01:26:04.001 and get that sorted? 01:26:04.001 --> 01:26:04.501 BRIAN: Sure. 01:26:04.501 --> 01:26:05.225 So we got the 8. 01:26:05.225 --> 01:26:07.768 5 is smaller than that, 3 is smaller than that. 01:26:07.768 --> 01:26:09.560 And then the rest of the number is 7, 4, 6. 01:26:09.560 --> 01:26:10.760 Those are all bigger. 01:26:10.760 --> 01:26:13.850 So the 3, that's going to go into sorted position here. 01:26:13.850 --> 01:26:15.900 And I'll take the 8 and swap it. 01:26:15.900 --> 01:26:17.380 Now I'm going to look at the 5. 01:26:17.380 --> 01:26:18.830 8 and 7 are both bigger. 01:26:18.830 --> 01:26:21.680 The 4 is smaller than the 5, but the 6 is bigger. 01:26:21.680 --> 01:26:24.540 So the 4, that's the smallest number I've seen so far. 01:26:24.540 --> 01:26:28.160 So the 4, that's going to go into place, and I'll swap it with the 5. 01:26:28.160 --> 01:26:29.485 And now I've got the 8. 01:26:29.485 --> 01:26:31.610 The 7 is smaller than the 8, so I'll remember that. 01:26:31.610 --> 01:26:33.040 5 is smaller than that. 01:26:33.040 --> 01:26:34.550 The 6 is bigger. 01:26:34.550 --> 01:26:37.190 So the 5, that's going to be the next number. 01:26:37.190 --> 01:26:39.877 And now I'm left with 7. 01:26:39.877 --> 01:26:41.960 8 is bigger, so 7 is still the smallest I've seen. 01:26:41.960 --> 01:26:45.350 But 6 is smaller, so 6 goes next. 01:26:45.350 --> 01:26:47.570 And now I'm down to the last two. 01:26:47.570 --> 01:26:50.900 And between the last two, the 8 and the 7, the 7 is smaller. 01:26:50.900 --> 01:26:53.450 So the 7 is going to go in this spot. 01:26:53.450 --> 01:26:55.730 And at this point, I've only got one number left. 01:26:55.730 --> 01:26:58.460 So that number must be in sorted position. 01:26:58.460 --> 01:27:01.310 And now I would say that this is a sorted array of numbers. 01:27:01.310 --> 01:27:02.060 DAVID MALAN: Nice. 01:27:02.060 --> 01:27:04.250 So it definitely seems to be correct. 01:27:04.250 --> 01:27:05.600 It felt a little slow. 01:27:05.600 --> 01:27:07.808 But of course, the computer could do this much faster 01:27:07.808 --> 01:27:09.132 than we, using an actual array. 01:27:09.132 --> 01:27:11.090 And if you don't mind my making an observation, 01:27:11.090 --> 01:27:15.870 it looks like if we have eight numbers to begin with, or n more generally, 01:27:15.870 --> 01:27:20.120 it looks like you essentially did n minus 1 comparisons, 01:27:20.120 --> 01:27:23.897 because you kept comparing numbers again-- actually, did n comparisons. 01:27:23.897 --> 01:27:25.730 You looked at the first number, and then you 01:27:25.730 --> 01:27:29.000 compared it again and again and again at all of the other possible values 01:27:29.000 --> 01:27:31.440 in order to find the smallest element. 01:27:31.440 --> 01:27:31.940 BRIAN: Yeah. 01:27:31.940 --> 01:27:35.190 Because for each of the numbers in the array, I had to do a comparison to see, 01:27:35.190 --> 01:27:38.250 is it smaller than the smallest thing that I've seen so far? 01:27:38.250 --> 01:27:40.540 And if it is smaller, than I needed to remember that. 01:27:40.540 --> 01:27:41.290 DAVID MALAN: Yeah. 01:27:41.290 --> 01:27:44.510 So in each pass you considered every number, so a total of n numbers first. 01:27:44.510 --> 01:27:46.760 And so you found the number 1 you put it in its place, 01:27:46.760 --> 01:27:50.300 and that left you to be clear with n minus 1 numbers thereafter. 01:27:50.300 --> 01:27:53.630 And then after that, n minus 2 numbers, n minus 3 numbers, dot, 01:27:53.630 --> 01:27:56.130 dot, dot, all the way down to one final number. 01:27:56.130 --> 01:27:57.440 So I think this is correct. 01:27:57.440 --> 01:28:00.050 And I think that's a pretty deliberate way 01:28:00.050 --> 01:28:03.428 of sorting these elements, a little more deliberately than your first approach, 01:28:03.428 --> 01:28:05.720 Brian, which I might describe as a little more organic. 01:28:05.720 --> 01:28:06.560 You kind of did it like-- 01:28:06.560 --> 01:28:09.352 more like a human, just kind of eyeballing things and moving things 01:28:09.352 --> 01:28:09.930 around. 01:28:09.930 --> 01:28:11.722 But if we were to translate this into code, 01:28:11.722 --> 01:28:13.790 recall that we have to be ever so precise. 01:28:13.790 --> 01:28:18.470 And so let me consider altogether how exactly we might translate what Brian 01:28:18.470 --> 01:28:20.820 did ultimately to, again, pseudocode. 01:28:20.820 --> 01:28:23.280 So what he did is actually an algorithm that has a name. 01:28:23.280 --> 01:28:24.990 It's called selection sort. 01:28:24.990 --> 01:28:25.530 Why? 01:28:25.530 --> 01:28:27.380 Well, it's sorting the elements ultimately. 01:28:27.380 --> 01:28:30.530 And it's doing so by having Brian, or really the computer, 01:28:30.530 --> 01:28:33.830 select the smallest elements again and again and again. 01:28:33.830 --> 01:28:35.660 And once you found each such small element, 01:28:35.660 --> 01:28:37.820 you get the added benefit of just ignoring it. 01:28:37.820 --> 01:28:39.680 Indeed, every time Brian lit up a number, 01:28:39.680 --> 01:28:43.610 he didn't need to keep comparing it, so the amount of work he was doing 01:28:43.610 --> 01:28:47.300 was decreasing each iteration-- n numbers, then n minus 1, 01:28:47.300 --> 01:28:49.950 then n minus 2, n minus 3, and so forth. 01:28:49.950 --> 01:28:53.450 And so we can think about the running time of this algorithm 01:28:53.450 --> 01:28:56.820 as being manifest in its actual pseudocode. 01:28:56.820 --> 01:28:58.483 So how might we define the pseudocode? 01:28:58.483 --> 01:29:00.650 Well, let me propose that we think of it like this-- 01:29:00.650 --> 01:29:03.300 for i from 0 to n minus 1. 01:29:03.300 --> 01:29:05.390 Now, undoubtedly this is probably the most cryptic 01:29:05.390 --> 01:29:07.972 looking line of the three lines of pseudocode on the screen. 01:29:07.972 --> 01:29:09.680 But again, this is the kind of thing that 01:29:09.680 --> 01:29:13.470 should become rote memory over time, or just instincts with code. 01:29:13.470 --> 01:29:15.470 We've seen in C how you can write a for loop. 01:29:15.470 --> 01:29:18.680 For loops typically, by convention, start counting at 0. 01:29:18.680 --> 01:29:22.400 But if you have n elements, you don't want to count up through n. 01:29:22.400 --> 01:29:27.560 You want to count up to n or equivalently up through n minus 1, 01:29:27.560 --> 01:29:29.430 so from 0 to n minus 1. 01:29:29.430 --> 01:29:29.930 All right. 01:29:29.930 --> 01:29:31.580 Now what do I want to do on the next-- 01:29:31.580 --> 01:29:32.900 on the first iteration? 01:29:32.900 --> 01:29:37.580 Find the smallest item between the i-th item and the last item. 01:29:37.580 --> 01:29:40.580 So this is not quite obvious, I think, at first glance. 01:29:40.580 --> 01:29:43.700 But I do think it's a fair characterization of what Brian did. 01:29:43.700 --> 01:29:47.540 Because if i is initialized to 0, that was like Brian pointing 01:29:47.540 --> 01:29:51.830 his left hand at the first number on the very left of the shelf. 01:29:51.830 --> 01:29:56.250 And what he then did was he found the smallest element between the i-th item, 01:29:56.250 --> 01:29:58.655 the first item 0, and the last item. 01:29:58.655 --> 01:30:00.530 So that's kind of a very fancy way of saying, 01:30:00.530 --> 01:30:04.790 Brian, find the smallest elements among all n elements. 01:30:04.790 --> 01:30:09.680 Then what he did was swapped the smallest item with the i-th item. 01:30:09.680 --> 01:30:12.140 So he just did that switcheroo, so as to not have 01:30:12.140 --> 01:30:15.080 to waste time shifting everything over. 01:30:15.080 --> 01:30:17.330 He instead just made room for it by swapping it 01:30:17.330 --> 01:30:20.217 with the value that was in its wrong place. 01:30:20.217 --> 01:30:23.300 But now in the next iteration of this loop, consider how a for loop works. 01:30:23.300 --> 01:30:25.582 You do an i++ implicitly in pseudocode. 01:30:25.582 --> 01:30:26.790 That's what's happening here. 01:30:26.790 --> 01:30:28.370 So now i equals 1. 01:30:28.370 --> 01:30:33.800 Find the smallest item between the i-th item, item 1 0 indexed, 01:30:33.800 --> 01:30:34.880 and the last item. 01:30:34.880 --> 01:30:39.140 So this is a fancy way of saying, Brian, check all of the n elements 01:30:39.140 --> 01:30:42.200 again, except for the first, because now you're 01:30:42.200 --> 01:30:45.680 starting at location 1 instead of location 0. 01:30:45.680 --> 01:30:47.300 And now the algorithm proceeds. 01:30:47.300 --> 01:30:49.490 So you could write this code in different ways 01:30:49.490 --> 01:30:51.410 in English like pseudocode, but this seems 01:30:51.410 --> 01:30:54.650 to be a reasonable formulation of exactly that algorithm. 01:30:54.650 --> 01:30:57.470 But let's see it a little more visually now, 01:30:57.470 --> 01:31:00.808 without all of the switching around of the humans moving around the numbers. 01:31:00.808 --> 01:31:02.600 Let me go ahead and use this visualization. 01:31:02.600 --> 01:31:04.040 And we'll put a link on the course's website 01:31:04.040 --> 01:31:05.707 if you'd like to play with this as well. 01:31:05.707 --> 01:31:10.520 This is just someone's visualization of an array of numbers. 01:31:10.520 --> 01:31:14.540 But this time, rather than represent the numbers as symbols, decimal digits, 01:31:14.540 --> 01:31:17.510 now this person is using vertical bars, like a bar chart. 01:31:17.510 --> 01:31:20.720 And what this means is that a small bar is like a small number, 01:31:20.720 --> 01:31:22.430 and a big bar is a big number. 01:31:22.430 --> 01:31:26.810 So the goal here is to these bars, which equivalently might as well be numbers, 01:31:26.810 --> 01:31:29.777 from short bars over to tall bars, left to right. 01:31:29.777 --> 01:31:30.860 And I'm going to go ahead. 01:31:30.860 --> 01:31:34.100 And along the top of the here, I can choose my sorting algorithm. 01:31:34.100 --> 01:31:36.830 And the one we just described, recall, was selection sort. 01:31:36.830 --> 01:31:39.170 So let me go ahead and do this. 01:31:39.170 --> 01:31:41.900 And notice, it takes a moment, I think, to wrap your mind 01:31:41.900 --> 01:31:43.740 around what's happening here. 01:31:43.740 --> 01:31:48.200 But notice that this pink line is going from left to right, because that's 01:31:48.200 --> 01:31:49.730 essentially what Brian was doing. 01:31:49.730 --> 01:31:52.730 He was walking back and forth, back and forth, back and forth 01:31:52.730 --> 01:31:56.180 through that shelf of numbers, looking for the next smallest number, 01:31:56.180 --> 01:32:00.260 and he kept putting the smallest number over on the left where it belongs. 01:32:00.260 --> 01:32:02.660 And indeed, that's why in this visualization 01:32:02.660 --> 01:32:07.730 you see the small numbers beginning to be put into place on the left 01:32:07.730 --> 01:32:09.500 as we keep swooping through. 01:32:09.500 --> 01:32:13.760 But notice, the colored bar keeps starting later and later, 01:32:13.760 --> 01:32:18.222 more rightward and more rightward, just like Brian was not retracing his steps. 01:32:18.222 --> 01:32:20.430 As soon as he lit up the numbers, he left them alone. 01:32:20.430 --> 01:32:23.257 And voila, all of these numbers are now sorted. 01:32:23.257 --> 01:32:26.090 So that's just a graphical way of thinking about the same algorithm. 01:32:26.090 --> 01:32:28.970 But how efficient or inefficient was that? 01:32:28.970 --> 01:32:31.100 Well, let's see if we can apply some numbers here. 01:32:31.100 --> 01:32:33.517 But there's also ways to do this a little more intuitively 01:32:33.517 --> 01:32:35.160 over time, which we'll do, too. 01:32:35.160 --> 01:32:38.450 So if the first time through the shelf of numbers, he 01:32:38.450 --> 01:32:41.420 had eight numbers at his disposal-- he had to look at all eight numbers 01:32:41.420 --> 01:32:43.560 in order to decide which of these is the smallest. 01:32:43.560 --> 01:32:45.380 So that's n steps initially. 01:32:45.380 --> 01:32:48.090 The next time he did a pass through the shelf, 01:32:48.090 --> 01:32:51.140 he ignored the brightly lit number 1, because it was already 01:32:51.140 --> 01:32:53.660 in place by definition of what he had already done. 01:32:53.660 --> 01:32:56.480 So now he had n minus 1 steps to go. 01:32:56.480 --> 01:33:01.860 Then he did another n minus 2 steps, then n minus 3, n minus 4, n minus 5, 01:33:01.860 --> 01:33:05.270 dot, dot, dot, all the way down to the final step, where he just 01:33:05.270 --> 01:33:08.210 had to find and leave alone the number 8, 01:33:08.210 --> 01:33:11.280 because that was the biggest number, so one single step. 01:33:11.280 --> 01:33:13.988 So this is some kind of series here, mathematically. 01:33:13.988 --> 01:33:17.030 You might recall something like this in, like, the back of your math book 01:33:17.030 --> 01:33:19.697 or in high school, or back of your physics textbook or the like. 01:33:19.697 --> 01:33:23.330 It turns out that this actually sums up to this formula here-- 01:33:23.330 --> 01:33:25.692 n times n plus 1 divided by 2. 01:33:25.692 --> 01:33:28.400 And if that's not familiar, you don't remember that, no big deal. 01:33:28.400 --> 01:33:31.280 Just let me stipulate that the mathematical formula with which we 01:33:31.280 --> 01:33:35.000 began, where we had the series of n, plus n minus 1, plus n minus 2, 01:33:35.000 --> 01:33:38.150 plus n minus 3, dot, dot, dot, simply sums up ultimately 01:33:38.150 --> 01:33:42.650 to the more succinct n times n plus 1 divided by 2. 01:33:42.650 --> 01:33:47.000 This, of course, if we multiply it out, gives us n squared plus n divided by 2. 01:33:47.000 --> 01:33:50.770 And this now, I will propose, gives us just this-- 01:33:50.770 --> 01:33:54.440 n squared divided by 2 plus n/2. 01:33:54.440 --> 01:33:56.900 So if we really wanted to be nit-picky, this 01:33:56.900 --> 01:34:01.400 is the total number of steps, or operations, or seconds, 01:34:01.400 --> 01:34:04.040 however we want to measure Brian's running time. 01:34:04.040 --> 01:34:07.970 This seems to be the precise mathematical formula therefore. 01:34:07.970 --> 01:34:11.700 But at the beginning of this week, we considered again, 01:34:11.700 --> 01:34:13.160 the sort of Big O notation. 01:34:13.160 --> 01:34:16.760 With a wave of the hand, we care more about the order of magnitude 01:34:16.760 --> 01:34:18.080 on which an algorithm operates. 01:34:18.080 --> 01:34:22.850 I really don't care about these divided by 2 and n/2. 01:34:22.850 --> 01:34:26.120 Because which of these factors is going to matter as n gets big? 01:34:26.120 --> 01:34:29.210 The bigger the phone book gets, the more doors we have, 01:34:29.210 --> 01:34:33.370 the more light bulbs we have, the more numbers we have on the shelf, 01:34:33.370 --> 01:34:35.810 n is going to keep getting bigger and bigger and bigger. 01:34:35.810 --> 01:34:38.430 And given that, which is the dominant factor? 01:34:38.430 --> 01:34:42.080 Rongxin, if we could call on someone here, which of these factors, 01:34:42.080 --> 01:34:48.320 n squared divided by 2, or n divided by 2, really matters in the long run 01:34:48.320 --> 01:34:54.090 as our problems get bigger and bigger, as n gets bigger and bigger? 01:34:54.090 --> 01:34:57.630 Which of those factors mathematically dominates? 01:34:57.630 --> 01:34:58.130 Anika? 01:34:58.130 --> 01:35:00.218 AUDIENCE: Oh, it's Anika, but-- 01:35:00.218 --> 01:35:01.010 DAVID MALAN: Anika. 01:35:01.010 --> 01:35:02.635 AUDIENCE: It would be the-- no problem. 01:35:02.635 --> 01:35:04.272 It would be the n squared. 01:35:04.272 --> 01:35:05.480 DAVID MALAN: Yeah, n squared. 01:35:05.480 --> 01:35:05.980 Right. 01:35:05.980 --> 01:35:08.390 If you take any number for n and you square it, 01:35:08.390 --> 01:35:11.150 that's going to be bigger, certainly in the long run, 01:35:11.150 --> 01:35:12.720 than just doing n divided by 2. 01:35:12.720 --> 01:35:15.470 And so with our Big O notation, we could describe the running time 01:35:15.470 --> 01:35:20.930 of Brian's selection sort implementation as, ah, it's on the order of n squared. 01:35:20.930 --> 01:35:23.150 Yes, I'm ignoring some numbers, and yes, if we really 01:35:23.150 --> 01:35:25.760 wanted to be nit-picky and count up every single step 01:35:25.760 --> 01:35:29.450 that Brian took, yes, it's n squared divided by 2 plus n/2. 01:35:29.450 --> 01:35:33.080 But again, if you think about the problem over time and n 01:35:33.080 --> 01:35:36.170 getting really large, sort of Facebook-sized, Twitter-sized, 01:35:36.170 --> 01:35:39.890 Google-sized, what's really going to dominate mathematically 01:35:39.890 --> 01:35:41.690 is this bigger factor here. 01:35:41.690 --> 01:35:44.420 That's what's going to make the total number of steps way 01:35:44.420 --> 01:35:47.370 bigger than just those smaller order terms. 01:35:47.370 --> 01:35:49.250 So in Big O notation, selection sort would 01:35:49.250 --> 01:35:51.540 seem to be on the order of n squared. 01:35:51.540 --> 01:35:54.140 So if we consider our chart from before where 01:35:54.140 --> 01:35:57.920 we had the upper bounds on our searching algorithms, 01:35:57.920 --> 01:36:01.410 both linear and binary, this one, unfortunately, 01:36:01.410 --> 01:36:05.765 is at really the tip top of this particular list of running times. 01:36:05.765 --> 01:36:07.140 And there's infinitely many more. 01:36:07.140 --> 01:36:09.470 These are just a subset of the more common formulas 01:36:09.470 --> 01:36:11.807 that a computer scientist might use and think about. 01:36:11.807 --> 01:36:13.640 Selection sort is kind of a top of the list. 01:36:13.640 --> 01:36:15.800 And being number one on this list is bad. 01:36:15.800 --> 01:36:18.980 n squared is certainly much slower than, say, 01:36:18.980 --> 01:36:22.610 big O of 1, which, of course, was constant time or one step. 01:36:22.610 --> 01:36:24.580 So I wonder if we could be-- 01:36:24.580 --> 01:36:25.880 if we could do a little better. 01:36:25.880 --> 01:36:27.770 I wonder if we could do a little better. 01:36:27.770 --> 01:36:29.970 And Peter actually did say something else earlier, 01:36:29.970 --> 01:36:33.890 which was about like sharing two numbers and fixing problems. 01:36:33.890 --> 01:36:35.720 And if I can kind of run with that, let me 01:36:35.720 --> 01:36:39.620 propose that we, Brian, return to you for a look at an algorithm that 01:36:39.620 --> 01:36:43.390 might be called instead bubble sort, bubble sort 01:36:43.390 --> 01:36:45.740 being a different algorithm, this one that 01:36:45.740 --> 01:36:47.287 tries to fix problems more locally. 01:36:47.287 --> 01:36:49.370 So in fact, Brian, if you look at the numbers that 01:36:49.370 --> 01:36:51.440 are in front of you, which you've kindly reset 01:36:51.440 --> 01:36:55.100 to their original, unsorted location, I feel like this really, 01:36:55.100 --> 01:36:58.578 if we focus on just pairs of numbers, it's just a lot of small numbers. 01:36:58.578 --> 01:37:00.620 Like last time, we tried to solve the big problem 01:37:00.620 --> 01:37:02.030 and sorting the whole thing. 01:37:02.030 --> 01:37:06.050 What if we just look at pairs of numbers that are adjacent to one another? 01:37:06.050 --> 01:37:09.860 Can we maybe make some little tweaks and change our algorithm fundamentally? 01:37:09.860 --> 01:37:14.450 So for instance, Brian, 6 and 3, what observation can you make there for us? 01:37:14.450 --> 01:37:15.200 BRIAN: Yeah, sure. 01:37:15.200 --> 01:37:18.233 So 6 and 3 that's, the first pair of numbers in the array. 01:37:18.233 --> 01:37:20.900 And if I want the array to be sorted, I want the smaller numbers 01:37:20.900 --> 01:37:23.670 to be on the left and the bigger numbers to be on the right. 01:37:23.670 --> 01:37:27.227 So just looking at this pair, I can tell you that the 6 and 3 or out of order. 01:37:27.227 --> 01:37:29.810 The 3 should be on the left, and the 6 should be on the right. 01:37:29.810 --> 01:37:30.140 DAVID MALAN: All right. 01:37:30.140 --> 01:37:31.932 So let's go ahead and do that, and go ahead 01:37:31.932 --> 01:37:34.070 and fix that by swapping those two. 01:37:34.070 --> 01:37:35.632 And just fix a small little problem. 01:37:35.632 --> 01:37:37.340 And now let's repeat this process, right? 01:37:37.340 --> 01:37:39.690 Loops seem to be omnipresent in a lot of our algorithms. 01:37:39.690 --> 01:37:41.213 So 6 and 8 is the next such pair. 01:37:41.213 --> 01:37:43.130 What you want-- what do you think about those? 01:37:43.130 --> 01:37:46.362 BRIAN: That particular pair seems OK, because the 6 is smaller and already 01:37:46.362 --> 01:37:47.070 on the left side. 01:37:47.070 --> 01:37:48.560 So I think I can leave this pair alone. 01:37:48.560 --> 01:37:49.518 DAVID MALAN: All right. 01:37:49.518 --> 01:37:50.380 How about 8 and 5? 01:37:50.380 --> 01:37:51.920 BRIAN: The 8 is bigger than the 5. 01:37:51.920 --> 01:37:53.300 So I'm going to swap these two. 01:37:53.300 --> 01:37:55.190 The 5 should be on the left of the 8. 01:37:55.190 --> 01:37:56.148 DAVID MALAN: All right. 01:37:56.148 --> 01:37:56.870 And 8 and 2? 01:37:56.870 --> 01:37:58.537 BRIAN: Same thing here, the 8 is bigger. 01:37:58.537 --> 01:38:00.337 So the 8 is going to be swapped with the 2. 01:38:00.337 --> 01:38:01.670 DAVID MALAN: All right, 8 and 7. 01:38:01.670 --> 01:38:06.133 BRIAN: The 8 is bigger than the 7, so the 8 I should switch with the 7. 01:38:06.133 --> 01:38:07.425 DAVID MALAN: All right 8 and 4? 01:38:07.425 --> 01:38:09.968 BRIAN: 8 and 4, same thing, it's bigger than the 4. 01:38:09.968 --> 01:38:11.010 DAVID MALAN: And 8 and 1. 01:38:11.010 --> 01:38:12.385 BRIAN: I can do it one last time. 01:38:12.385 --> 01:38:14.912 The 8 is bigger than the 1, and I think that's all. 01:38:14.912 --> 01:38:16.870 DAVID MALAN: And with a nice dramatic flourish, 01:38:16.870 --> 01:38:18.900 if you step off to the side, voila-- 01:38:18.900 --> 01:38:20.340 not sorted. 01:38:20.340 --> 01:38:23.070 In fact, it doesn't really look all that much better. 01:38:23.070 --> 01:38:26.070 But I do think Brian's done something smart here. 01:38:26.070 --> 01:38:29.550 Brian, can you speak to at least some of the marginal improvements 01:38:29.550 --> 01:38:30.300 that you've made? 01:38:30.300 --> 01:38:30.590 BRIAN: Yeah. 01:38:30.590 --> 01:38:32.298 So there are some improvements, at least. 01:38:32.298 --> 01:38:36.890 The 1 originally was all the way at the very end, and it moved back one spot. 01:38:36.890 --> 01:38:39.390 And the other improvement, I think, is that the 8 originally 01:38:39.390 --> 01:38:42.150 was way over here on the left side of the array somewhere. 01:38:42.150 --> 01:38:44.130 But because the 8 is the biggest number, I 01:38:44.130 --> 01:38:46.380 kept switching it over and over again until it made it 01:38:46.380 --> 01:38:47.640 all the way to the end. 01:38:47.640 --> 01:38:51.270 And so now actually, I think this 8 is in the correct place. 01:38:51.270 --> 01:38:54.660 It's the biggest number, and it ended up moving its way all the way 01:38:54.660 --> 01:38:56.190 to the right side of the array. 01:38:56.190 --> 01:38:56.550 DAVID MALAN: Yeah. 01:38:56.550 --> 01:38:59.130 And this is where this algorithm that we'll see the rest of in just a moment 01:38:59.130 --> 01:39:00.720 gets its name, bubble sort-- 01:39:00.720 --> 01:39:04.470 alludes to the fact that the biggest numbers start bubbling their way up 01:39:04.470 --> 01:39:08.220 to the top of, or the end of, the list, at the right-hand side 01:39:08.220 --> 01:39:09.570 of the shelf as Brian notes. 01:39:09.570 --> 01:39:14.350 But notice, as Brian does, too, the number 1 only moved over one position. 01:39:14.350 --> 01:39:16.080 So there's clearly more work to be done. 01:39:16.080 --> 01:39:18.690 And that's obvious from the other numbers being misordered as well. 01:39:18.690 --> 01:39:19.950 But we have improved things. 01:39:19.950 --> 01:39:23.760 The 8 is in place, and the 1 is closer to being in place. 01:39:23.760 --> 01:39:25.377 So how might we proceed next? 01:39:25.377 --> 01:39:28.210 Well, Brian, let's continue to solve some small bite-sized problems. 01:39:28.210 --> 01:39:29.668 Let's start at the beginning again. 01:39:29.668 --> 01:39:30.400 3 and 6? 01:39:30.400 --> 01:39:30.900 BRIAN: Sure. 01:39:30.900 --> 01:39:33.540 The 3 and the 6, those seem to be in order, so I'll leave those alone. 01:39:33.540 --> 01:39:34.590 DAVID MALAN: 6 and 5. 01:39:34.590 --> 01:39:37.420 BRIAN: 6 and 5 or out of the order, so I'll go ahead and take the 6 01:39:37.420 --> 01:39:38.420 and put it to the right. 01:39:38.420 --> 01:39:39.300 DAVID MALAN: 6 and 2. 01:39:39.300 --> 01:39:42.045 BRIAN: Those are out of order as well, so I'll swap the 2 and the 6. 01:39:42.045 --> 01:39:42.930 DAVID MALAN: 6 and 7. 01:39:42.930 --> 01:39:44.032 BRIAN: 6 and 7 are OK. 01:39:44.032 --> 01:39:44.740 They're in order. 01:39:44.740 --> 01:39:45.690 DAVID MALAN: 7 and 4. 01:39:45.690 --> 01:39:48.535 BRIAN: Those are out of order, so I'll switch the 4 and the 7. 01:39:48.535 --> 01:39:49.410 DAVID MALAN: 7 and 1. 01:39:49.410 --> 01:39:52.300 BRIAN: And those two are out of order as well, so I'll swap those. 01:39:52.300 --> 01:39:55.800 And now I think the 7 has made its way to the sorted position as well. 01:39:55.800 --> 01:39:56.783 DAVID MALAN: Indeed. 01:39:56.783 --> 01:39:58.200 So now we're making some progress. 01:39:58.200 --> 01:40:02.100 7 has bubbled its way up to the top of the list, stopping just before the 8, 01:40:02.100 --> 01:40:05.550 whereas the 1 has continued its advance to its correct location. 01:40:05.550 --> 01:40:08.490 So I bet, Brian, if we keep doing this again and again 01:40:08.490 --> 01:40:11.610 and again, so long as the list remains in part unsorted, 01:40:11.610 --> 01:40:13.530 I think we'll probably get to the finish line. 01:40:13.530 --> 01:40:15.690 Do you want to take it from here and sort the rest? 01:40:15.690 --> 01:40:16.270 BRIAN: Yeah, sure. 01:40:16.270 --> 01:40:17.830 So I just repeat the process again. 01:40:17.830 --> 01:40:19.410 The 3 and the 5 are OK. 01:40:19.410 --> 01:40:22.070 The 2 and the 5 are out of order, so I'll swap them. 01:40:22.070 --> 01:40:24.330 The 5 and the 6, those are fine as a pair. 01:40:24.330 --> 01:40:28.200 The 6 and the 4, out of order relative to each other, so I'll switch those. 01:40:28.200 --> 01:40:31.510 And the 6 and the 1, those are out of order as well, so I'll swap those. 01:40:31.510 --> 01:40:34.680 And now the 6, that I can say is in its correct position. 01:40:34.680 --> 01:40:36.010 And I'll repeat it again. 01:40:36.010 --> 01:40:38.820 The 3 and the 2 are out of order, so those get switched. 01:40:38.820 --> 01:40:40.380 The 3 and the 5 are OK. 01:40:40.380 --> 01:40:43.080 The 5 and the 4 are out of order, so those get switched. 01:40:43.080 --> 01:40:47.520 And then the 5 and the 1 need to be switched as well. 01:40:47.520 --> 01:40:49.325 So there's the 5 in sorted position. 01:40:49.325 --> 01:40:50.700 And now I'm left with these four. 01:40:50.700 --> 01:40:53.070 The 2 and the 3 are OK, the 3 and the 4 OK. 01:40:53.070 --> 01:40:55.110 But the 4 and the 1 are out of order. 01:40:55.110 --> 01:40:58.790 So those get switched, and now the four, that's in its place. 01:40:58.790 --> 01:41:02.720 The 2 and the 3 are OK, but the 3 and the 1 are not, so I'll swap those. 01:41:02.720 --> 01:41:05.700 And now the 3 goes into its sorted place. 01:41:05.700 --> 01:41:08.900 And then finally, the last pair to consider is just the 2 and the 1. 01:41:08.900 --> 01:41:12.690 Those are out of order, so I'll swap those, and now the 2 is in place. 01:41:12.690 --> 01:41:16.280 And 1 is the only remaining number, so I can say that that one's in place, too. 01:41:16.280 --> 01:41:18.440 And now I think we have a sorted array. 01:41:18.440 --> 01:41:19.190 DAVID MALAN: Nice. 01:41:19.190 --> 01:41:21.860 So it felt like this was a fundamentally different approach, 01:41:21.860 --> 01:41:23.762 but we still got to the same end point. 01:41:23.762 --> 01:41:26.720 So that really now invites the question as to whether bubbles or it was 01:41:26.720 --> 01:41:29.370 better or worse or maybe no different. 01:41:29.370 --> 01:41:33.270 But notice, too, that we've solved the same problem fundamentally differently. 01:41:33.270 --> 01:41:36.427 The first time, we took the more human natural intuition of just, 01:41:36.427 --> 01:41:37.510 find the smallest element. 01:41:37.510 --> 01:41:39.480 All right, do it again, do it again, do it again. 01:41:39.480 --> 01:41:42.230 This time, we sort of viewed the problem through a different lens. 01:41:42.230 --> 01:41:44.360 And we thought about, it would seem, what does it 01:41:44.360 --> 01:41:46.130 mean for the list to be unsorted? 01:41:46.130 --> 01:41:48.260 As Peter noted, it's when things are out of order. 01:41:48.260 --> 01:41:51.380 Like that very basic primitive where something is out of order 01:41:51.380 --> 01:41:54.380 suggests an opportunity to solve the problem that way. 01:41:54.380 --> 01:41:57.170 Just fix all of the tiny bite-sized problems. 01:41:57.170 --> 01:42:00.470 And it would seem that using a loop, if we repeat that intuition, 01:42:00.470 --> 01:42:03.506 is going to pay off eventually by fixing, fixing, fixing, 01:42:03.506 --> 01:42:06.620 fixing all of the little problems until the big one itself 01:42:06.620 --> 01:42:08.000 would seem to go away. 01:42:08.000 --> 01:42:12.150 Well, let me return to the visualization from before, re-randomize the bars-- 01:42:12.150 --> 01:42:15.480 short bar is small number, big bar is big number. 01:42:15.480 --> 01:42:17.900 And let me go ahead and run the bubble sort algorithm, 01:42:17.900 --> 01:42:20.390 this time with this visualization. 01:42:20.390 --> 01:42:24.680 And you'll notice now sweeping from left to right are two colored bars that 01:42:24.680 --> 01:42:30.320 represent the comparison of two adjacent numbers again and again and again. 01:42:30.320 --> 01:42:33.770 And you'll see this time that the bars are being a little smart, 01:42:33.770 --> 01:42:36.470 and they're not going all the way to the end every time, 01:42:36.470 --> 01:42:39.440 just like Brian illuminated the numbers and stopped 01:42:39.440 --> 01:42:43.080 looking at the 8 and the 7 and the 6 once they were in place. 01:42:43.080 --> 01:42:46.910 But he and this visualization do indeed keep returning to the beginning, 01:42:46.910 --> 01:42:50.330 doing another pass, another pass, and another pass. 01:42:50.330 --> 01:42:53.270 So if we think ahead to the analysis of this algorithm, 01:42:53.270 --> 01:42:57.560 it sort of invites us to consider, well, how many total comparisons are there 01:42:57.560 --> 01:42:58.610 this time? 01:42:58.610 --> 01:43:01.280 It would seem that the very first time through the bars, 01:43:01.280 --> 01:43:04.250 or equivalently the very first time through the shelf, Brian 01:43:04.250 --> 01:43:07.830 and this visualization did like n minus 1 comparisons. 01:43:07.830 --> 01:43:10.090 So n minus 1 comparisons from left to right, out 01:43:10.090 --> 01:43:13.880 of n elements you can compare n minus 1 adjacencies. 01:43:13.880 --> 01:43:17.780 After that it was n minus 2, n minus 3, n minus 4, 01:43:17.780 --> 01:43:22.050 n minus 5, until just two or one remain, and at that point you're done. 01:43:22.050 --> 01:43:25.070 So even though this algorithm fundamentally took a different approach 01:43:25.070 --> 01:43:29.160 and achieved the same goal, it sorted the elements successfully. 01:43:29.160 --> 01:43:31.400 Let's consider how it was implemented in code 01:43:31.400 --> 01:43:35.270 and whether it's actually a little faster or a little slower. 01:43:35.270 --> 01:43:37.910 And let's set one final bar, in fact, too. 01:43:37.910 --> 01:43:42.218 Earlier, we considered only the upper bound on selection sort, 01:43:42.218 --> 01:43:44.510 just so that we have something to compare this against. 01:43:44.510 --> 01:43:48.590 Let's also consider for a moment what the running time is 01:43:48.590 --> 01:43:52.940 of selection sort in terms of a lower bound-- best case scenario. 01:43:52.940 --> 01:43:56.690 With selection sort, if you have n elements, 01:43:56.690 --> 01:43:59.810 and you keep looking for the next smallest element, again and again 01:43:59.810 --> 01:44:04.390 and again, it turns out that selection sort is not really our friend. 01:44:04.390 --> 01:44:07.790 Here's, for instance, the chart of where we left off in terms of omega notation 01:44:07.790 --> 01:44:08.450 before. 01:44:08.450 --> 01:44:10.850 Linear search and binary search could very well 01:44:10.850 --> 01:44:13.910 get lucky and take just one step if you happen to open a door 01:44:13.910 --> 01:44:17.340 and, voila, the number you're looking for is already there. 01:44:17.340 --> 01:44:20.360 But with selection sort, as we've implemented it, 01:44:20.360 --> 01:44:23.150 both with Brian and with the visualization, 01:44:23.150 --> 01:44:26.990 unfortunately it's none so good with the lower bound. 01:44:26.990 --> 01:44:27.920 Why? 01:44:27.920 --> 01:44:32.720 Well, Brian pretty naively, every time he searched for a number, 01:44:32.720 --> 01:44:37.230 started at the left and went all the way to the right, started at the left, 01:44:37.230 --> 01:44:38.480 went all the way to the right. 01:44:38.480 --> 01:44:41.808 To be fair, he did ignore the numbers that were already in place. 01:44:41.808 --> 01:44:44.600 So he didn't keep looking at the 1, he didn't keep looking at the 2 01:44:44.600 --> 01:44:46.260 once they were in place. 01:44:46.260 --> 01:44:50.390 But he did keep repeating himself again and again, touching 01:44:50.390 --> 01:44:52.400 those numbers multiple times each. 01:44:52.400 --> 01:44:55.400 So again, even though you and I, the humans, could look at those numbers 01:44:55.400 --> 01:44:57.692 and be like, obviously there's the 1, obviously there's 01:44:57.692 --> 01:44:59.540 the 2, the obviously there's the 3, Brian 01:44:59.540 --> 01:45:01.430 had to do it much more methodically. 01:45:01.430 --> 01:45:07.220 And in fact, even if that list of numbers were perfectly sorted, 01:45:07.220 --> 01:45:09.480 he would have wasted just as much time. 01:45:09.480 --> 01:45:11.840 In fact, Brian, if you don't mind, could you quickly 01:45:11.840 --> 01:45:14.510 sort all eight numbers again? 01:45:14.510 --> 01:45:17.210 And Brian, if we start with a sorted list, 01:45:17.210 --> 01:45:21.190 this is kind of a nice perversion to consider, if you will, algorithmically. 01:45:21.190 --> 01:45:22.940 When analyzing an algorithm, sometimes you 01:45:22.940 --> 01:45:25.358 want to consider best cases and worst cases. 01:45:25.358 --> 01:45:28.400 And there would seem to be nothing better than, heck, the list is already 01:45:28.400 --> 01:45:31.530 sorted, you got lucky, there's really no work to be done. 01:45:31.530 --> 01:45:34.040 The worst case is the list is maybe completely backwards, 01:45:34.040 --> 01:45:36.020 and that's a huge amount of work to be done. 01:45:36.020 --> 01:45:40.262 Unfortunately, selection sort doesn't really optimize for that lucky case 01:45:40.262 --> 01:45:41.470 where they're already sorted. 01:45:41.470 --> 01:45:44.780 So Brian, I see you've resorted the numbers for us from left to right. 01:45:44.780 --> 01:45:48.380 If we were to re-execute selection sort as before, 01:45:48.380 --> 01:45:51.005 how would you go about finding the smallest number? 01:45:51.005 --> 01:45:53.630 BRIAN: So we decided earlier that, to find the smallest number, 01:45:53.630 --> 01:45:55.422 I need to look at all the numbers from left 01:45:55.422 --> 01:45:58.580 to right in the array and each time check to see if I found something. 01:45:58.580 --> 01:45:59.390 smaller. 01:45:59.390 --> 01:46:00.590 So I would start with the 1. 01:46:00.590 --> 01:46:02.530 That's the smallest thing I've seen so far. 01:46:02.530 --> 01:46:04.070 But I would have to keep looking, because maybe there's 01:46:04.070 --> 01:46:05.780 a 0 or a negative number later on. 01:46:05.780 --> 01:46:08.020 I need to check to see if there's anything smaller. 01:46:08.020 --> 01:46:11.240 So I would check, the 2 is bigger, the 3, 4, 5, 6, 7, 8. 01:46:11.240 --> 01:46:12.117 They're all bigger. 01:46:12.117 --> 01:46:13.700 So it turns out I was right all along. 01:46:13.700 --> 01:46:16.730 The 1 was the smallest number, and it's already in place. 01:46:16.730 --> 01:46:18.590 So now that number is in place. 01:46:18.590 --> 01:46:20.090 DAVID MALAN: And then to find the next smallest number, 01:46:20.090 --> 01:46:21.260 what would you have done? 01:46:21.260 --> 01:46:22.635 BRIAN: I would do the same thing. 01:46:22.635 --> 01:46:24.320 2 is the smallest number I found so far. 01:46:24.320 --> 01:46:25.580 And then I would look through all the rest 01:46:25.580 --> 01:46:27.497 to see if there's anything smaller than the 2. 01:46:27.497 --> 01:46:30.170 And I would look at 3, 4, 5, 6, 7, 8. 01:46:30.170 --> 01:46:31.880 Nothing's smaller than the 2. 01:46:31.880 --> 01:46:34.910 So I go back to the two and say, OK, that number must now 01:46:34.910 --> 01:46:36.327 be in its sorted position. 01:46:36.327 --> 01:46:37.160 DAVID MALAN: Indeed. 01:46:37.160 --> 01:46:40.035 And that story would be the same for the 3, for the 4, and for the 5. 01:46:40.035 --> 01:46:43.610 Like, nowhere in selection sort pseudocode or actual code 01:46:43.610 --> 01:46:46.850 is there any sort of intelligence of, eh, if the numbers are already sorted, 01:46:46.850 --> 01:46:47.630 quit. 01:46:47.630 --> 01:46:50.540 Like, there was no opportunity to short circuit and abort 01:46:50.540 --> 01:46:51.645 that algorithm earlier. 01:46:51.645 --> 01:46:53.520 Brian would literally be doing the same work, 01:46:53.520 --> 01:46:55.340 whether they're all sorted from the get-go 01:46:55.340 --> 01:46:57.620 or completely unsorted, and even backwards. 01:46:57.620 --> 01:47:00.660 And so selection sort doesn't really perform very highly. 01:47:00.660 --> 01:47:03.048 So now we're hoping bubble sort, indeed, does. 01:47:03.048 --> 01:47:05.590 So toward that end, let's take a look at some proposed pseudo 01:47:05.590 --> 01:47:09.040 code for bubble sort, assuming that the input is anything. 01:47:09.040 --> 01:47:10.790 Whether sorted or unsorted, the pseudocode 01:47:10.790 --> 01:47:13.070 is always going to look like this. 01:47:13.070 --> 01:47:14.510 Repeat until sorted. 01:47:14.510 --> 01:47:17.100 For i from 0 to n minus 2-- 01:47:17.100 --> 01:47:18.190 now, what does this mean? 01:47:18.190 --> 01:47:22.010 0 to n minus 1 goes from the first element to the last. 01:47:22.010 --> 01:47:27.050 So 0 to n minus 2 goes from the first element to the second to last. 01:47:27.050 --> 01:47:27.950 Why am I doing that? 01:47:27.950 --> 01:47:29.210 We'll see in just a moment. 01:47:29.210 --> 01:47:34.010 The condition inside of this loop is, if the i-th and the i plus 1th elements 01:47:34.010 --> 01:47:36.720 are out of order, swap them. 01:47:36.720 --> 01:47:38.590 So this is me being a little clever. 01:47:38.590 --> 01:47:41.090 If you think about all of these numbers as being in an array 01:47:41.090 --> 01:47:46.100 or behind doors, if you iterate from 0 to n minus 2, 01:47:46.100 --> 01:47:49.010 that's like going from the first door to the second to last door. 01:47:49.010 --> 01:47:52.830 But that's good, because my condition is checking door i and i plus 1. 01:47:52.830 --> 01:47:57.530 So if I start at the beginning here, and I only iterate up to this door, 01:47:57.530 --> 01:47:58.520 that's a good thing. 01:47:58.520 --> 01:48:02.070 Because when I compared door i and i plus 1, at the very end 01:48:02.070 --> 01:48:05.100 I'm going to compare door i and i plus 1. 01:48:05.100 --> 01:48:09.020 What I don't want to do is compare this door i against door i 01:48:09.020 --> 01:48:10.763 plus 1, which doesn't even exist. 01:48:10.763 --> 01:48:13.430 And indeed, that's going to be an error that probably all of you 01:48:13.430 --> 01:48:14.510 make at some point-- 01:48:14.510 --> 01:48:19.040 going beyond the boundary of an array, touching memory 01:48:19.040 --> 01:48:22.233 that is going one or more spaces too far in the array, 01:48:22.233 --> 01:48:24.150 even though you didn't allocate memory for it. 01:48:24.150 --> 01:48:26.940 So this hedges against that possibility. 01:48:26.940 --> 01:48:29.630 So this would seem to be a pretty smart algorithm. 01:48:29.630 --> 01:48:35.030 But as written, it's not actually as performant as might be ideal. 01:48:35.030 --> 01:48:39.830 With bubble sort, suppose the list were entirely sorted. 01:48:39.830 --> 01:48:43.910 Brian, not to make you sort and resort numbers too many times. 01:48:43.910 --> 01:48:47.300 Do you mind giving us a sorted list one more time real quick? 01:48:47.300 --> 01:48:50.660 In a moment, I want to see, if we consider that same sorted list as 01:48:50.660 --> 01:48:54.920 before, this time with bubble sort, can we do fundamentally better? 01:48:54.920 --> 01:48:58.160 I have this code saying, repeat until sorted. 01:48:58.160 --> 01:48:59.330 So how might this change? 01:48:59.330 --> 01:49:01.310 So Brian, you've got the sorted numbers again. 01:49:01.310 --> 01:49:02.660 This should be a good case. 01:49:02.660 --> 01:49:05.990 But selection sort did not benefit from this input, 01:49:05.990 --> 01:49:07.700 even though we could have gotten lucky. 01:49:07.700 --> 01:49:10.085 Bubble sort, what would your thought process be here? 01:49:10.085 --> 01:49:11.960 BRIAN: So the thought process for bubble sort 01:49:11.960 --> 01:49:14.030 was to go through each of the pairs one at a time 01:49:14.030 --> 01:49:17.000 and see if I need to make a swap for that particular pair. 01:49:17.000 --> 01:49:18.620 So I'd look at the 1 and the 2. 01:49:18.620 --> 01:49:20.600 Those two are OK, I don't need to swap them. 01:49:20.600 --> 01:49:21.763 The 2 and the 3 are OK. 01:49:21.763 --> 01:49:23.180 I don't need to make a swap there. 01:49:23.180 --> 01:49:24.270 The 3 and the 4 are OK. 01:49:24.270 --> 01:49:25.820 The 4 and the 5 are OK. 01:49:25.820 --> 01:49:29.750 Same with the 5 and the 6, and the 6 and the 7, and the 7 and the 8. 01:49:29.750 --> 01:49:32.480 So I made my way through all the entire array, 01:49:32.480 --> 01:49:36.020 and I never needed to make any swap, because every pair that I looked at, 01:49:36.020 --> 01:49:38.840 they were already in the correct order relative to each other. 01:49:38.840 --> 01:49:39.673 DAVID MALAN: Indeed. 01:49:39.673 --> 01:49:42.440 And so it would be foolish and so obvious this time 01:49:42.440 --> 01:49:45.140 if Brian literally retraced those steps and did it 01:49:45.140 --> 01:49:49.130 again with n minus 1 elements, and then did it again with n minus 2 elements. 01:49:49.130 --> 01:49:52.490 I mean, if he didn't do any work, any swaps the first pass, 01:49:52.490 --> 01:49:54.830 he's literally wasting his own time by even 01:49:54.830 --> 01:49:57.050 doing another pass or another pass. 01:49:57.050 --> 01:50:00.650 And so that's kind of implicit in the pseudocode, this repeat until sorted. 01:50:00.650 --> 01:50:03.110 Even though it doesn't translate perfectly into a for loop 01:50:03.110 --> 01:50:07.250 or a while loop in C, it kind of says intuitively what he should do-- 01:50:07.250 --> 01:50:08.390 repeat until sorted. 01:50:08.390 --> 01:50:11.150 Brian has already identified the fact, by nature of him 01:50:11.150 --> 01:50:13.670 not having made any swaps, that this list is sorted. 01:50:13.670 --> 01:50:16.400 Therefore, he can just stop, and this loop does not 01:50:16.400 --> 01:50:18.020 have to continue again and again. 01:50:18.020 --> 01:50:21.620 We can map this to C-like code a little more explicitly. 01:50:21.620 --> 01:50:24.770 We can by default say, do the following n minus 1 times. 01:50:24.770 --> 01:50:29.570 Because among n elements, you can look at n minus 1 total pairs from left 01:50:29.570 --> 01:50:31.620 to right without going too far. 01:50:31.620 --> 01:50:35.100 But notice, I can add an additional line of code here 01:50:35.100 --> 01:50:40.070 which might say, if no swaps, quit from the algorithm altogether. 01:50:40.070 --> 01:50:42.890 So, so long as Brian is keeping track of how many swaps 01:50:42.890 --> 01:50:47.160 he made or didn't make through one pass, as with a variable called counter 01:50:47.160 --> 01:50:50.940 or whatever, he can simply abort this algorithm early and certainly 01:50:50.940 --> 01:50:52.687 then save us some time. 01:50:52.687 --> 01:50:55.020 So with that said, let's consider for just a moment what 01:50:55.020 --> 01:50:59.280 the running time of bubble sort might be in terms of an upper bound, 01:50:59.280 --> 01:51:01.120 in the worst case, if you will. 01:51:01.120 --> 01:51:05.070 Well, in the case of bubble sort, notice with the pseudocode 01:51:05.070 --> 01:51:08.160 where we're doing something n minus 1 times, 01:51:08.160 --> 01:51:11.820 and inside of that we're doing something n minus 1 times. 01:51:11.820 --> 01:51:14.130 So again, repeat n minus 1 times literally 01:51:14.130 --> 01:51:17.040 says, do the following n minus 1 times. 01:51:17.040 --> 01:51:19.830 The for loop here, which is just a different way in pseudocode 01:51:19.830 --> 01:51:24.540 of expressing a similar idea but giving us a variable this time, for i from 0 01:51:24.540 --> 01:51:25.950 to n minus 1-- 01:51:25.950 --> 01:51:32.890 n minus 2, is a total number of n minus 1 comparisons. 01:51:32.890 --> 01:51:37.490 So this is an n minus 1 thing inside the repeat, 01:51:37.490 --> 01:51:39.680 and an n minus 1 outside the repeat. 01:51:39.680 --> 01:51:44.720 So I think what that gives me is n minus 1 things times n minus 1 times. 01:51:44.720 --> 01:51:47.180 So now if I just kind of FOIL this, sort of in high school 01:51:47.180 --> 01:51:51.080 or middle school math, n squared minus 1n minus 1n plus 1. 01:51:51.080 --> 01:51:54.115 We can combine like terms, n squared minus 2n plus 1. 01:51:54.115 --> 01:51:57.240 But per our discussion earlier, ugh, this is really getting into the weeds. 01:51:57.240 --> 01:51:59.870 Who cares about the 2n or the 1? 01:51:59.870 --> 01:52:04.320 The dominant factor as n gets large is definitely going to be the n squared. 01:52:04.320 --> 01:52:06.440 So it would seem that bubble sort, if you actually 01:52:06.440 --> 01:52:10.130 do out the math and the formulas, is going to have an upper bound of n 01:52:10.130 --> 01:52:13.200 squared, or rather, on the order of n squared steps. 01:52:13.200 --> 01:52:17.420 So in that sense, it is equivalent to selection sort. 01:52:17.420 --> 01:52:19.490 It is no better fundamentally. 01:52:19.490 --> 01:52:22.430 It's what we would say ask asymptotically equivalent. 01:52:22.430 --> 01:52:25.220 That is, as n gets really large, this formula 01:52:25.220 --> 01:52:28.670 is, for all intents and purposes, equivalent to the selection sort 01:52:28.670 --> 01:52:30.750 formula, even though they differed slightly 01:52:30.750 --> 01:52:32.930 in terms of their lower order terms. 01:52:32.930 --> 01:52:37.080 For all intents and purposes, ah, they're on the order of n squared both. 01:52:37.080 --> 01:52:40.460 But if we consider a lower bound, perhaps, 01:52:40.460 --> 01:52:43.520 even though bubble sort has the same upper bound running time, 01:52:43.520 --> 01:52:47.420 if we consider a lower bound, as with this smarter code, where Brian might 01:52:47.420 --> 01:52:50.840 actually have the wherewithal to notice, wait a minute, I didn't do any swaps, 01:52:50.840 --> 01:52:54.630 I'm just going to exit out of this looping pretty much early-- 01:52:54.630 --> 01:52:56.540 not even prematurely but early, because it 01:52:56.540 --> 01:52:59.120 would be fruitless to keep doing more and more work-- 01:52:59.120 --> 01:53:01.760 we can then whittle down this running time. 01:53:01.760 --> 01:53:07.610 I think-- not quite as good as omega of 1, which was constant time-- 01:53:07.610 --> 01:53:12.200 like, you cannot conclude definitively that an array is sorted unless you 01:53:12.200 --> 01:53:14.360 minimally look at all of the elements once. 01:53:14.360 --> 01:53:17.540 So constant time is completely naive and unrealistic. 01:53:17.540 --> 01:53:21.230 You can't look at one element, or two or three, and say, yes, this is sorted. 01:53:21.230 --> 01:53:24.650 You've got to obviously look at all of the elements at least once. 01:53:24.650 --> 01:53:28.400 So this would seem to suggest that the omega notation for it, that 01:53:28.400 --> 01:53:31.520 is, the lower bound on bubble sort's running time, 01:53:31.520 --> 01:53:37.850 if we're clever and don't retrace our steps unnecessarily, is in omega of n. 01:53:37.850 --> 01:53:39.950 Or technically, it's n minus 1 steps, right? 01:53:39.950 --> 01:53:41.950 Because if you've got n elements and you compare 01:53:41.950 --> 01:53:44.030 these two, these two, these two, these two, 01:53:44.030 --> 01:53:45.650 that's n minus 1 total comparisons. 01:53:45.650 --> 01:53:47.450 But who cares about the minus 1? 01:53:47.450 --> 01:53:53.380 It's on the order of n, or omega of n notation here. 01:53:53.380 --> 01:53:57.240 So to recap, selection sort selects the next smallest element again and again 01:53:57.240 --> 01:53:57.870 and again. 01:53:57.870 --> 01:54:01.020 Unfortunately, based on how it's implemented in pseudocode and actual 01:54:01.020 --> 01:54:03.330 code, it's in Big O of n squared. 01:54:03.330 --> 01:54:05.370 But it's also an omega of n squared, which 01:54:05.370 --> 01:54:10.170 means it's always going to take the same amount of time asymptotically, that is, 01:54:10.170 --> 01:54:11.490 as n gets large. 01:54:11.490 --> 01:54:16.083 Unfortunately, too, bubble sort is no better, it would seem, 01:54:16.083 --> 01:54:17.250 in terms of the upper bound. 01:54:17.250 --> 01:54:19.500 It's going to take as many as n squared steps, too. 01:54:19.500 --> 01:54:23.610 But it's at least marginally better when it comes to using something 01:54:23.610 --> 01:54:26.280 like an input that's already sorted. 01:54:26.280 --> 01:54:30.880 It can short circuit and not waste time. 01:54:30.880 --> 01:54:32.850 But honestly, n squared is bad. 01:54:32.850 --> 01:54:34.950 Like, n squared is really going to add up quickly. 01:54:34.950 --> 01:54:39.180 If you've got n squared and n is a million or n is a billion, I mean, 01:54:39.180 --> 01:54:40.860 my God, that's a lot of 0's. 01:54:40.860 --> 01:54:44.820 That's a lot of steps in the total running time of your algorithm. 01:54:44.820 --> 01:54:46.180 Can we do better? 01:54:46.180 --> 01:54:47.500 Can we do better? 01:54:47.500 --> 01:54:48.720 And it turns out we can. 01:54:48.720 --> 01:54:52.560 And we'll consider one final algorithm today that does fundamentally better. 01:54:52.560 --> 01:54:57.000 Just like in week 0, we sort of latched onto binary search and again today-- 01:54:57.000 --> 01:55:01.390 it's just fundamentally better than linear search by an order of magnitude, 01:55:01.390 --> 01:55:01.890 so to speak. 01:55:01.890 --> 01:55:05.250 Its picture representation was fundamentally different. 01:55:05.250 --> 01:55:09.048 I think we can do fundamentally better than bubble sort and selection sort. 01:55:09.048 --> 01:55:10.840 And so while both bubble sort and selection 01:55:10.840 --> 01:55:13.465 sort might be the sort of thing that I was using in grad school 01:55:13.465 --> 01:55:15.840 just to rip up the code quickly and then go to sleep, 01:55:15.840 --> 01:55:18.272 it's not going to work well for very large data sets. 01:55:18.272 --> 01:55:19.980 And frankly, it wouldn't have worked well 01:55:19.980 --> 01:55:22.140 if I didn't want to just sleep through the problem. 01:55:22.140 --> 01:55:26.190 Rather, we want to do things as efficiently as we can from the get go. 01:55:26.190 --> 01:55:29.560 And let me propose that we leverage a technique-- 01:55:29.560 --> 01:55:32.310 and this is a technique that you can use in almost any programming 01:55:32.310 --> 01:55:33.960 language, C among them-- 01:55:33.960 --> 01:55:35.430 known as recursion. 01:55:35.430 --> 01:55:41.940 And recursion, quite simply, is the ability for a function to call itself. 01:55:41.940 --> 01:55:44.760 Up until now, we have not seen any examples of this. 01:55:44.760 --> 01:55:47.070 We've seen functions calling other functions. 01:55:47.070 --> 01:55:49.170 Main keeps calling printf. 01:55:49.170 --> 01:55:51.000 Main has started to call strlen. 01:55:51.000 --> 01:55:54.930 Main called strcmp, compare, earlier today. 01:55:54.930 --> 01:55:56.700 But we've never seen main call main. 01:55:56.700 --> 01:55:59.770 And people don't do that, so that's not going to solve the problem. 01:55:59.770 --> 01:56:02.640 But we can implement our own functions and have 01:56:02.640 --> 01:56:05.100 our own functions call themselves. 01:56:05.100 --> 01:56:07.290 Now, this would seem to be a bad idea in principle. 01:56:07.290 --> 01:56:10.020 If a function calls itself, my God, where does it end? 01:56:10.020 --> 01:56:11.820 It would seem to just do something forever, 01:56:11.820 --> 01:56:13.487 and then something bad probably happens. 01:56:13.487 --> 01:56:14.132 And it could. 01:56:14.132 --> 01:56:15.840 And that's the danger of using recursion. 01:56:15.840 --> 01:56:17.495 You can screw it up easily. 01:56:17.495 --> 01:56:19.620 But it's also a very powerful technique, because it 01:56:19.620 --> 01:56:21.870 allows us to think about potential solutions 01:56:21.870 --> 01:56:25.770 to problems in a very interesting, and daresay elegant, way. 01:56:25.770 --> 01:56:29.160 So we're not only going to be able to achieve correctness but also better 01:56:29.160 --> 01:56:32.610 design, because of better efficiency, it would seem, here. 01:56:32.610 --> 01:56:33.960 So let me propose this. 01:56:33.960 --> 01:56:37.830 Recall this code from week 0, which was the pseudocode for finding someone 01:56:37.830 --> 01:56:38.790 in a phone book. 01:56:38.790 --> 01:56:42.180 And recall that, among the features of this pseudocode, 01:56:42.180 --> 01:56:44.520 were these lines here, "Go back to line 3." 01:56:44.520 --> 01:56:48.950 And we describe those in week 0 as being representative of loops, 01:56:48.950 --> 01:56:52.950 a programming construct that has something happen again and again. 01:56:52.950 --> 01:56:56.640 But you know what, there's a missed opportunity here in this pseudocode 01:56:56.640 --> 01:56:59.590 to use a technique known as recursion. 01:56:59.590 --> 01:57:02.280 This implementation is what we would call iterative. 01:57:02.280 --> 01:57:04.290 It is purely loop based. 01:57:04.290 --> 01:57:07.230 It tells me literally, go back to this line, go back to this line, 01:57:07.230 --> 01:57:08.400 go back to this line. 01:57:08.400 --> 01:57:10.230 There's no calling yourself. 01:57:10.230 --> 01:57:13.620 But what if I changed week 0's pseudocode to be a little more 01:57:13.620 --> 01:57:14.550 like this? 01:57:14.550 --> 01:57:19.110 Let me go ahead and get rid of, not just that one line but two lines 01:57:19.110 --> 01:57:21.000 in both of those conditions. 01:57:21.000 --> 01:57:23.310 And let me quite simply say, instead of open 01:57:23.310 --> 01:57:26.280 to the middle of the left half of the book and then go back to line 3, 01:57:26.280 --> 01:57:29.910 or open to the middle of the right half of the book and then go back to line 3, 01:57:29.910 --> 01:57:34.950 why don't I just more elegantly say, search left half of book, 01:57:34.950 --> 01:57:36.660 search right half of book? 01:57:36.660 --> 01:57:39.780 Now, immediately I can shorten the code a little bit. 01:57:39.780 --> 01:57:44.880 But I claim that by just saying search left half of book and search right 01:57:44.880 --> 01:57:49.140 half of book, I claim that this is enough information 01:57:49.140 --> 01:57:50.760 to implement the very same algorithm. 01:57:50.760 --> 01:57:53.100 But it's not using a loop per se. 01:57:53.100 --> 01:57:56.010 It's going to induce me the human or me the computer 01:57:56.010 --> 01:57:57.840 to do something again and again. 01:57:57.840 --> 01:58:00.270 But there's other ways to do things again and again-- 01:58:00.270 --> 01:58:03.690 not by way of a for loop, or a while loop, or a do while loop, 01:58:03.690 --> 01:58:06.360 or a repeat block, or a forever block-- 01:58:06.360 --> 01:58:08.610 you can actually use recursion. 01:58:08.610 --> 01:58:12.690 And recursion, again, is this technique where a function can call itself. 01:58:12.690 --> 01:58:15.660 And if we consider, after all, the pseudocode we are looking at 01:58:15.660 --> 01:58:18.210 is the pseudocode for searching. 01:58:18.210 --> 01:58:24.210 And on line 7 and 9 now, I am literally saying, "Search left half of book," 01:58:24.210 --> 01:58:28.800 and "Search right half of book," this is already, even in pseudocode form, 01:58:28.800 --> 01:58:30.270 an example of recursion. 01:58:30.270 --> 01:58:34.530 Here I have in 11 lines of code an algorithm or a function 01:58:34.530 --> 01:58:36.440 that searches a phone book. 01:58:36.440 --> 01:58:40.830 In lines 7 and 9, I have lines of code that literally say, search 01:58:40.830 --> 01:58:44.460 a phone book, but more specifically, search half of the phone book. 01:58:44.460 --> 01:58:47.520 And that's where recursion really works its magic. 01:58:47.520 --> 01:58:50.760 It would be foolish and incorrect and completely counterproductive 01:58:50.760 --> 01:58:53.148 to just have a function call itself with the same input, 01:58:53.148 --> 01:58:55.440 with the same input, with the same input, because you'd 01:58:55.440 --> 01:58:57.960 have to be kind of crazy to expect different output 01:58:57.960 --> 01:59:00.120 if the input is constantly the same. 01:59:00.120 --> 01:59:03.540 But that's not what we did in week 0, and that's not what we're doing now. 01:59:03.540 --> 01:59:07.110 If you use the same function, or equivalently algorithm, 01:59:07.110 --> 01:59:11.580 but change the input to be smaller and smaller and smaller, 01:59:11.580 --> 01:59:14.770 it's probably OK that a function is calling itself, 01:59:14.770 --> 01:59:18.120 so long as you have at least one line of code in there 01:59:18.120 --> 01:59:20.820 that very intelligently says, if you're out of doors, 01:59:20.820 --> 01:59:23.310 if you're out of phone book pages, quit. 01:59:23.310 --> 01:59:25.500 You need to have a so-called base case. 01:59:25.500 --> 01:59:28.650 You need some line of code that's going to notice, wait a minute, there's 01:59:28.650 --> 01:59:31.950 no more problem to be solved, quit now. 01:59:31.950 --> 01:59:35.190 And so how can we map this to actual code? 01:59:35.190 --> 01:59:38.130 Well, let's consider something very familiar from week 1. 01:59:38.130 --> 01:59:40.440 Recall when you reconstructed one of Mario's pyramids. 01:59:40.440 --> 01:59:43.330 It looked a little something like this. 01:59:43.330 --> 01:59:46.050 And let's consider that this is a pyramid of blocks, 01:59:46.050 --> 01:59:47.880 of bricks, that's of height 4. 01:59:47.880 --> 01:59:48.420 Why 4? 01:59:48.420 --> 01:59:52.450 Well, there's 1, then 2, then 3, then 4 bricks from top to bottom. 01:59:52.450 --> 01:59:54.210 So the total height here is 4. 01:59:54.210 --> 01:59:58.740 But let me ask the question, a little naively, how do you go about creating, 01:59:58.740 --> 02:00:02.730 or how do you go about printing a pyramid of height 4? 02:00:02.730 --> 02:00:06.000 Well, it turns out that this simple Mario pyramid, that's 02:00:06.000 --> 02:00:09.210 ever more clear if we get rid of the unnecessary background, 02:00:09.210 --> 02:00:12.700 is a recursive structure of some sort. 02:00:12.700 --> 02:00:14.250 It's a recursive physical structure. 02:00:14.250 --> 02:00:15.020 Why? 02:00:15.020 --> 02:00:18.780 Well, notice that this structure, this brick, this pyramid, 02:00:18.780 --> 02:00:20.970 is kind of defined in terms of itself. 02:00:20.970 --> 02:00:21.690 Why? 02:00:21.690 --> 02:00:24.660 Well, how do you make a pyramid of height 4? 02:00:24.660 --> 02:00:27.870 I would argue, a little obnoxiously, a little circularly, well, 02:00:27.870 --> 02:00:30.240 you create a pyramid of height 3, and then 02:00:30.240 --> 02:00:32.670 you add an additional row of bricks. 02:00:32.670 --> 02:00:33.400 All right. 02:00:33.400 --> 02:00:34.440 Well, let's continue that logic. 02:00:34.440 --> 02:00:35.107 All right, fine. 02:00:35.107 --> 02:00:38.160 How do you build a pyramid of height 3? 02:00:38.160 --> 02:00:41.670 Well, you sort of smile and say, well, you build a pyramid of height 2, 02:00:41.670 --> 02:00:43.110 and then you add one more layer. 02:00:43.110 --> 02:00:43.590 All right, fine. 02:00:43.590 --> 02:00:45.215 How do you build a pyramid of height 2? 02:00:45.215 --> 02:00:49.165 Well, you build a pyramid of height 1, and then you add one more layer. 02:00:49.165 --> 02:00:51.040 Well, how do you build a pyramid of height 1? 02:00:51.040 --> 02:00:53.610 Well, you just put the stupid brick down. 02:00:53.610 --> 02:00:56.280 You have a base case, where you sort of state the obvious 02:00:56.280 --> 02:00:57.990 and just do something once. 02:00:57.990 --> 02:00:59.550 You hardcode the logic. 02:00:59.550 --> 02:01:02.610 But notice what's kind of mind bending, or kind 02:01:02.610 --> 02:01:05.550 of obnoxious in a human interaction, like, 02:01:05.550 --> 02:01:08.640 you're just defining the answer in terms of itself. 02:01:08.640 --> 02:01:10.470 I keep saying the same thing. 02:01:10.470 --> 02:01:14.790 But that's OK, because the pyramid keeps getting smaller and smaller and smaller 02:01:14.790 --> 02:01:16.895 until I can handle that one special case. 02:01:16.895 --> 02:01:19.770 And so we can do this just for fun with these little cardboard bricks 02:01:19.770 --> 02:01:20.920 here, for instance. 02:01:20.920 --> 02:01:24.010 If I want to build a pyramid of height 4, how do I do it? 02:01:24.010 --> 02:01:26.220 Well, I can build a pyramid of height 3. 02:01:26.220 --> 02:01:29.280 All right, let me go ahead and build a pyramid of height 3. 02:01:29.280 --> 02:01:31.095 How do I build a pyramid of height 3? 02:01:31.095 --> 02:01:33.970 All right, well, I build a pyramid of height 2, and then I add to it. 02:01:33.970 --> 02:01:37.090 OK, how do I build a pyramid of height 2? 02:01:37.090 --> 02:01:38.790 Well, you build a pyramid of height 1. 02:01:38.790 --> 02:01:39.773 How do I do that? 02:01:39.773 --> 02:01:41.190 Well, you just put the brick down. 02:01:41.190 --> 02:01:43.140 And so here's where things kind of bottom out, 02:01:43.140 --> 02:01:45.390 and it's no longer a cyclical argument. 02:01:45.390 --> 02:01:47.640 You eventually just do some actual work. 02:01:47.640 --> 02:01:51.780 But in my mind, I have to remember all of the instructions you just gave me, 02:01:51.780 --> 02:01:53.040 or I gave myself. 02:01:53.040 --> 02:01:56.850 I had to build a pyramid of height 4; nope, 3; nope, 2; nope, 1. 02:01:56.850 --> 02:01:58.270 Now I'm actually doing that. 02:01:58.270 --> 02:02:00.090 So here's a pyramid of height 1. 02:02:00.090 --> 02:02:02.610 How do I now build a pyramid of height 2? 02:02:02.610 --> 02:02:04.350 Well, rewind in the story. 02:02:04.350 --> 02:02:07.950 To build a pyramid of height 2, you build a pyramid of height 1, 02:02:07.950 --> 02:02:09.940 and then you add one more layer. 02:02:09.940 --> 02:02:14.440 So I think to add one more layer, I essentially need to do this. 02:02:14.440 --> 02:02:14.940 All right. 02:02:14.940 --> 02:02:16.740 Now I have a pyramid of height 2. 02:02:16.740 --> 02:02:17.490 But wait a minute. 02:02:17.490 --> 02:02:20.010 The story began with, how do I build a pyramid of height 3? 02:02:20.010 --> 02:02:22.740 Well, you take a pyramid of height 2, which I have here, 02:02:22.740 --> 02:02:24.330 and you add an additional layer. 02:02:24.330 --> 02:02:26.370 So I've got to build this additional layer. 02:02:26.370 --> 02:02:30.900 I'm going to go ahead and give myself the layer, the layer, the layer. 02:02:30.900 --> 02:02:34.950 And then I'm going to put the original pyramid of height to on top of it. 02:02:34.950 --> 02:02:37.598 And voila, it's a pyramid of height 3 now. 02:02:37.598 --> 02:02:38.640 Well, how did I get here? 02:02:38.640 --> 02:02:40.230 Well, let me keep rewinding in the story. 02:02:40.230 --> 02:02:42.120 The very first question I asked myself was, 02:02:42.120 --> 02:02:43.828 how do you build a pyramid of height 4? 02:02:43.828 --> 02:02:45.870 Well, the answer was build a pyramid of height 3. 02:02:45.870 --> 02:02:47.110 Great, that's done. 02:02:47.110 --> 02:02:49.222 Then add one additional layer. 02:02:49.222 --> 02:02:51.930 And if I had more hands, I could do this a little more elegantly, 02:02:51.930 --> 02:02:54.190 but let me go ahead and just lay this out. 02:02:54.190 --> 02:02:57.280 Here's the new level of height 3. 02:02:57.280 --> 02:02:58.920 And now I'm going to go-- 02:02:58.920 --> 02:03:00.840 of width 4. 02:03:00.840 --> 02:03:05.940 Now I'm going to go and put the pyramid of height 3 on top of it, until voila, 02:03:05.940 --> 02:03:09.840 I have this form here of Mario's pyramid. 02:03:09.840 --> 02:03:13.020 So it's a bit cyclical in that, every time I 02:03:13.020 --> 02:03:15.627 asked myself to build a pyramid of a certain height, 02:03:15.627 --> 02:03:18.210 I kind of punted and said, no, build a pyramid of this height. 02:03:18.210 --> 02:03:19.290 No, build a pyramid of this height. 02:03:19.290 --> 02:03:20.940 No, build a pyramid of this height. 02:03:20.940 --> 02:03:25.890 But the magic of that algorithm was that there was constantly 02:03:25.890 --> 02:03:29.520 this, do a little more work, build a layer, do a little more work, 02:03:29.520 --> 02:03:30.690 build a layer. 02:03:30.690 --> 02:03:35.160 And it's in that implicit building of layer after layer after layer 02:03:35.160 --> 02:03:38.483 that the pyramid itself, the end goal, actually emerges. 02:03:38.483 --> 02:03:41.400 So you could implement the same thing with a for loop or a while loop. 02:03:41.400 --> 02:03:42.330 And frankly, you did. 02:03:42.330 --> 02:03:45.210 It was a slightly different shape for problem set 1, 02:03:45.210 --> 02:03:47.340 but you did the same thing using a loop. 02:03:47.340 --> 02:03:51.060 And you kind of had to do it that way, at least as we prescribed it. 02:03:51.060 --> 02:03:54.327 Because with printf, you have to print from the top of the screen 02:03:54.327 --> 02:03:54.910 to the bottom. 02:03:54.910 --> 02:03:57.910 Like, we haven't shown you a technique yet to print a layer 02:03:57.910 --> 02:03:59.253 and then go back on top. 02:03:59.253 --> 02:04:01.420 So I'm kind of taking some real-world liberties here 02:04:01.420 --> 02:04:03.503 by lifting these things up and moving them around. 02:04:03.503 --> 02:04:06.010 You'd have to be a little more clever in code. 02:04:06.010 --> 02:04:07.300 But the idea is the same. 02:04:07.300 --> 02:04:09.280 And so even physical objects like this can 02:04:09.280 --> 02:04:12.770 have some recursive definition to them. 02:04:12.770 --> 02:04:14.800 And so we present this sort of goofy example, 02:04:14.800 --> 02:04:18.708 because this notion of recursion is a fundamental programming technique 02:04:18.708 --> 02:04:20.500 that you can leverage now to solve problems 02:04:20.500 --> 02:04:22.480 in a fundamentally different way. 02:04:22.480 --> 02:04:26.500 And I think for this, we need one final visualization of merge sort, 02:04:26.500 --> 02:04:28.450 with both Brian's help and the computer's. 02:04:28.450 --> 02:04:32.170 And merge sort is going to be an algorithm whose pseudocode is, daresay, 02:04:32.170 --> 02:04:35.440 the simplest we've seen thus far, but deceptively simple. 02:04:35.440 --> 02:04:38.740 The pseudocode for merge sort, quite simply, is this-- 02:04:38.740 --> 02:04:42.690 sort the left half of numbers, sort the right half of numbers, 02:04:42.690 --> 02:04:44.950 merge the sorted halves. 02:04:44.950 --> 02:04:48.640 And notice, even at first glance this feels kind of unfair. 02:04:48.640 --> 02:04:51.310 Like, here's an algorithm for sorting, and yet I'm 02:04:51.310 --> 02:04:54.820 literally using the word "sort" in my algorithm for sorting. 02:04:54.820 --> 02:04:57.070 It's like in English if you're asked to define a word, 02:04:57.070 --> 02:04:59.290 and you literally use the word in the definition. 02:04:59.290 --> 02:05:03.950 Like, that rarely flies, because you're just making a circular argument. 02:05:03.950 --> 02:05:08.740 But in code, it's OK, so long as there's one special step that's doing something 02:05:08.740 --> 02:05:10.990 a little differently, and so long as the problem keeps 02:05:10.990 --> 02:05:12.157 getting smaller and smaller. 02:05:12.157 --> 02:05:13.180 And indeed it is. 02:05:13.180 --> 02:05:16.593 This pseudocode is not saying, sort the numbers, sort the numbers, 02:05:16.593 --> 02:05:17.260 sort of numbers. 02:05:17.260 --> 02:05:22.060 No, it's dividing the problem in half and then solving the other half 02:05:22.060 --> 02:05:22.700 as well. 02:05:22.700 --> 02:05:25.090 So it's shrinking the problem on each iteration. 02:05:25.090 --> 02:05:28.667 Now, I will disclaim we're going to need that so-called base case again. 02:05:28.667 --> 02:05:31.000 I'm going to have to do something stupid, but necessary, 02:05:31.000 --> 02:05:33.670 and say, if there's only one number, quit. 02:05:33.670 --> 02:05:34.540 It's sorted. 02:05:34.540 --> 02:05:36.370 That's the so-called base case. 02:05:36.370 --> 02:05:39.970 The recursive case is where the function calls itself. 02:05:39.970 --> 02:05:44.920 But this is, indeed, our third and final sorting algorithm called merge sort. 02:05:44.920 --> 02:05:48.290 And we'll focus here really on the juiciest pieces, 02:05:48.290 --> 02:05:49.870 one, this notion of merging. 02:05:49.870 --> 02:05:52.270 So in fact, Brian, can we come over to you 02:05:52.270 --> 02:05:56.300 just so we can define, before we look at the merge sort algorithm itself, 02:05:56.300 --> 02:06:00.010 what do we even mean when we say merge sorted halves? 02:06:00.010 --> 02:06:04.510 So for instance, Brian has on his shelf here two arrays of size 4. 02:06:04.510 --> 02:06:08.680 In the first array on the left are four integers, 3, 5, 6, 8. 02:06:08.680 --> 02:06:12.460 And in the right side, in another array of size 4, 02:06:12.460 --> 02:06:15.160 are four numbers, too, 1, 2, 4, 7. 02:06:15.160 --> 02:06:18.370 Both the left is sorted and the right is sorted. 02:06:18.370 --> 02:06:21.880 But now, Brian, I would like you to merge these sorted halves. 02:06:21.880 --> 02:06:23.480 Tell us what that means. 02:06:23.480 --> 02:06:23.980 BRIAN: Sure. 02:06:23.980 --> 02:06:26.680 So if I have a left half that sorted from smallest 02:06:26.680 --> 02:06:30.490 to largest and a right half that's also sorted from smallest to largest, 02:06:30.490 --> 02:06:34.120 I want to merge them into a new list that has all of the same numbers 02:06:34.120 --> 02:06:36.090 also from smallest to largest. 02:06:36.090 --> 02:06:38.890 And I guess where I could start here is that the smallest 02:06:38.890 --> 02:06:43.570 number of the combined array needs to begin with either the smallest 02:06:43.570 --> 02:06:46.540 number of the left half or the smallest number of the right half. 02:06:46.540 --> 02:06:49.690 So on the left the smallest number is the 3, and on the right 02:06:49.690 --> 02:06:51.580 the smallest number is the 1. 02:06:51.580 --> 02:06:55.030 Of those two has got to be the smallest number for the entire array. 02:06:55.030 --> 02:06:58.030 And between the 3 and the 1, the 1 is smaller. 02:06:58.030 --> 02:07:03.100 So I would take that 1, and that's going to be the first number, the smallest 02:07:03.100 --> 02:07:06.297 number, of the merged two halves. 02:07:06.297 --> 02:07:08.380 And then I guess I would repeat the process again. 02:07:08.380 --> 02:07:10.960 On the left side the smallest number is the 3. 02:07:10.960 --> 02:07:13.060 On the right side the smallest number is the 2. 02:07:13.060 --> 02:07:16.120 And between the 3 and the 2, 2 is smaller. 02:07:16.120 --> 02:07:19.270 So I would take the 2 [INAUDIBLE] and that's going to be the next number. 02:07:19.270 --> 02:07:22.090 So I'm slowly building up this sorted array that 02:07:22.090 --> 02:07:23.770 is the result of combining these two. 02:07:23.770 --> 02:07:27.040 Now I'm comparing the 3 on the left to the 4 on the right. 02:07:27.040 --> 02:07:29.170 Between the 3 and the 4, the 3 is smaller. 02:07:29.170 --> 02:07:33.020 So I'll take the 3, and we'll put that one into position. 02:07:33.020 --> 02:07:36.010 Now I'm comparing the 5 on the left with the 4 on the right. 02:07:36.010 --> 02:07:38.380 Between the 5 and the 4, the 4 is smaller. 02:07:38.380 --> 02:07:41.130 So that one goes into position. 02:07:41.130 --> 02:07:44.970 And then now I'm comparing the 5 on the left with the 7 on the right. 02:07:44.970 --> 02:07:47.811 5 is smaller, so the 5 goes next. 02:07:47.811 --> 02:07:50.880 Next I'm comparing the 6 on the left with the 7 on the right. 02:07:50.880 --> 02:07:55.200 The 6 is still smaller, so that one is going to go next. 02:07:55.200 --> 02:07:58.290 Now I'm comparing the 8 and the 7, the only two numbers left. 02:07:58.290 --> 02:08:00.650 The 7 is the smaller between the two. 02:08:00.650 --> 02:08:03.530 So I'll take the 7 and put that into place. 02:08:03.530 --> 02:08:05.280 And now I'm only left with one number that 02:08:05.280 --> 02:08:09.900 hasn't been put into the merging of the two halves, and that's the number 8. 02:08:09.900 --> 02:08:12.510 So that number is going to take up the final position. 02:08:12.510 --> 02:08:16.440 And now I've taken these to halves, each of which was originally sorted, 02:08:16.440 --> 02:08:19.887 and made one complete array that has all of those numbers in sorted order. 02:08:19.887 --> 02:08:20.720 DAVID MALAN: Indeed. 02:08:20.720 --> 02:08:21.930 And consider what we've done. 02:08:21.930 --> 02:08:24.120 We've essentially verbally and physically kind of 02:08:24.120 --> 02:08:27.360 defined a helper function, our own custom function if you will, 02:08:27.360 --> 02:08:32.430 whereby Brian has defined what does it mean to merge two arrays-- 02:08:32.430 --> 02:08:35.080 specifically merge two sorted arrays. 02:08:35.080 --> 02:08:35.640 Because why? 02:08:35.640 --> 02:08:37.050 Well, that's a building block that I think 02:08:37.050 --> 02:08:38.920 we're going to want in this merge sort algorithm. 02:08:38.920 --> 02:08:40.795 So just like in actual C code, you might have 02:08:40.795 --> 02:08:43.230 defined a function that does some small task, 02:08:43.230 --> 02:08:46.440 so have we now verbally and physically defined the notion of merging. 02:08:46.440 --> 02:08:49.620 The mind bending part here is that "Sort left 02:08:49.620 --> 02:08:52.770 half of numbers" and "Sort right half of numbers" 02:08:52.770 --> 02:08:54.690 is kind of already implemented. 02:08:54.690 --> 02:08:58.170 There's nothing more for Brian or me to define. 02:08:58.170 --> 02:09:01.890 All that remains is for us to execute this algorithm, focusing especially 02:09:01.890 --> 02:09:04.300 on these three highlighted lines of code. 02:09:04.300 --> 02:09:08.220 And let me disclaim that of the algorithms we've looked at thus far, 02:09:08.220 --> 02:09:10.290 odds are this will be the one that doesn't really 02:09:10.290 --> 02:09:11.887 sink in as quickly as the others. 02:09:11.887 --> 02:09:14.220 Even if the others might have taken you a moment, a day, 02:09:14.220 --> 02:09:16.887 a week to settle in-- or maybe you're still not quite there yet, 02:09:16.887 --> 02:09:17.520 that's fine-- 02:09:17.520 --> 02:09:20.610 merge sort is a bit of a mind bending one, 02:09:20.610 --> 02:09:23.130 because it seems to work magically. 02:09:23.130 --> 02:09:25.387 But it really just works more intelligently. 02:09:25.387 --> 02:09:27.720 And you'll begin to get more comfortable with harnessing 02:09:27.720 --> 02:09:31.190 these kinds of primitives so that we can ultimately, indeed, solve problems 02:09:31.190 --> 02:09:31.960 more efficiently. 02:09:31.960 --> 02:09:35.130 So Brian has kindly put the numbers again on the top shelf. 02:09:35.130 --> 02:09:37.740 And he has put them into their original, unsorted order, 02:09:37.740 --> 02:09:40.080 just like for selection sort and bubble sort. 02:09:40.080 --> 02:09:44.050 And Brian, I'd like to propose now that we execute this merge sort algorithm. 02:09:44.050 --> 02:09:47.500 And if you don't mind, I'll recite aloud first the few steps. 02:09:47.500 --> 02:09:51.690 So here is one array of size 8 with unsorted numbers. 02:09:51.690 --> 02:09:54.150 The goal is to these numbers using merge sort. 02:09:54.150 --> 02:09:57.520 And recall that merge sort essentially is just three steps-- 02:09:57.520 --> 02:10:00.330 sort left half, sort right half, merge sorted halves. 02:10:00.330 --> 02:10:02.280 So Brian, looking at those numbers there, 02:10:02.280 --> 02:10:04.860 could you go ahead and sort the left half of numbers? 02:10:04.860 --> 02:10:05.100 BRIAN: All right. 02:10:05.100 --> 02:10:06.225 So there are eight numbers. 02:10:06.225 --> 02:10:09.810 The left half would be these four numbers, so I will sort those. 02:10:09.810 --> 02:10:13.020 Except I'm not really sure how do I now sort these four numbers. 02:10:13.020 --> 02:10:13.770 DAVID MALAN: Yeah. 02:10:13.770 --> 02:10:16.353 So granted, we've seen selection sort, we've seen bubble sort. 02:10:16.353 --> 02:10:19.530 But we don't want to regress to those older, slower algorithms. 02:10:19.530 --> 02:10:22.110 Brian, I can kind of be a little clever here. 02:10:22.110 --> 02:10:24.220 Well, I'm giving you a sorting algorithm. 02:10:24.220 --> 02:10:28.140 So now you effectively have a smaller problem, an array of size 4, 02:10:28.140 --> 02:10:30.870 and I'm pretty sure we can use the same algorithm, merge sort, 02:10:30.870 --> 02:10:32.670 by sorting left half, sorting right half, 02:10:32.670 --> 02:10:34.520 and then merging the sorted halves. 02:10:34.520 --> 02:10:38.232 So could you go ahead and sort the left half of these four numbers? 02:10:38.232 --> 02:10:38.940 BRIAN: All right. 02:10:38.940 --> 02:10:40.020 So I have these four numbers. 02:10:40.020 --> 02:10:41.228 I want to sort the left half. 02:10:41.228 --> 02:10:42.893 That's these two numbers. 02:10:42.893 --> 02:10:45.060 So now I need to figure out how to sort two numbers. 02:10:45.060 --> 02:10:45.390 DAVID MALAN: All right. 02:10:45.390 --> 02:10:48.490 Now, us with human intuition might obviously know what we have to do here. 02:10:48.490 --> 02:10:51.360 But again, let's apply the algorithm-- sort left half, sort right half, 02:10:51.360 --> 02:10:52.170 merge sorted half. 02:10:52.170 --> 02:10:55.257 Brian, could you sort the right half of this array of size 2? 02:10:55.257 --> 02:10:57.340 BRIAN: So I've got the array of two, so I'll first 02:10:57.340 --> 02:11:00.150 sort the left half of the array of two, which is the 6. 02:11:00.150 --> 02:11:03.120 DAVID MALAN: And this is where the base case in white on the slide 02:11:03.120 --> 02:11:04.260 comes into play-- 02:11:04.260 --> 02:11:06.180 if only one number, quit. 02:11:06.180 --> 02:11:07.830 So Brian, I can let you off the hook. 02:11:07.830 --> 02:11:12.220 That list of size one with the number 6 is sorted. 02:11:12.220 --> 02:11:13.650 So that's step one of three done. 02:11:13.650 --> 02:11:16.950 Brian, could you sort the right half of that array of size two? 02:11:16.950 --> 02:11:18.620 BRIAN: The right half is the number 3. 02:11:18.620 --> 02:11:20.770 It's also just one number, so that one is done. 02:11:20.770 --> 02:11:21.520 DAVID MALAN: Good. 02:11:21.520 --> 02:11:22.890 So think about where we are on the story. 02:11:22.890 --> 02:11:25.710 We've sorted the left half, and we've started the right half, 02:11:25.710 --> 02:11:29.030 even though it looks like neither Brian nor I have done any useful work yet. 02:11:29.030 --> 02:11:30.270 But now the magic happens. 02:11:30.270 --> 02:11:34.290 Brian, you now have two arrays of size 1. 02:11:34.290 --> 02:11:36.222 Could you merge them together? 02:11:36.222 --> 02:11:36.930 BRIAN: All right. 02:11:36.930 --> 02:11:38.670 So I'm going to merge these two together. 02:11:38.670 --> 02:11:40.530 Between the 6 and the 3, the 3 is smaller. 02:11:40.530 --> 02:11:42.300 So that one I'll put there first. 02:11:42.300 --> 02:11:44.700 And then I'll take the 6, and that one goes next. 02:11:44.700 --> 02:11:47.892 And now I have a sorted array of size 2 that is now done. 02:11:47.892 --> 02:11:48.850 DAVID MALAN: All right. 02:11:48.850 --> 02:11:51.725 And this is where you now need to start remembering step by step sort 02:11:51.725 --> 02:11:53.850 of in your brain as the things pile up. 02:11:53.850 --> 02:11:55.180 How did we get to this point? 02:11:55.180 --> 02:11:57.120 We started with a list of size 8. 02:11:57.120 --> 02:12:00.360 We then looked at the left half, which was an array of size 4. 02:12:00.360 --> 02:12:03.630 We then looked at the left half of that, which was an array of size 2, 02:12:03.630 --> 02:12:06.990 then two arrays of size 1, then we merged those two sorted halves. 02:12:06.990 --> 02:12:09.600 So I think now if I rewind in that story, Brian, 02:12:09.600 --> 02:12:14.262 you need to sort the right half of the left half of the original numbers. 02:12:14.262 --> 02:12:14.970 BRIAN: All right. 02:12:14.970 --> 02:12:16.770 So the left half is these four. 02:12:16.770 --> 02:12:20.430 The right half of the left half is going to be these two numbers. 02:12:20.430 --> 02:12:23.790 And so now to those two, I guess I would repeat the process again-- 02:12:23.790 --> 02:12:25.240 look at the numbers individually. 02:12:25.240 --> 02:12:27.825 I would look at the left half of these two, which is the 8. 02:12:27.825 --> 02:12:29.250 That one is done. 02:12:29.250 --> 02:12:31.180 And the 5, that one is done as well. 02:12:31.180 --> 02:12:32.138 DAVID MALAN: All right. 02:12:32.138 --> 02:12:35.262 So step three of three, then, is merge those two sorted halves. 02:12:35.262 --> 02:12:35.970 BRIAN: All right. 02:12:35.970 --> 02:12:39.830 So between the 8 and the 5, the 5 is smaller, so that one will go in first. 02:12:39.830 --> 02:12:41.460 And the 8 will go after that. 02:12:41.460 --> 02:12:45.157 And now I have a second array of size 2 that is also now sorted. 02:12:45.157 --> 02:12:45.990 DAVID MALAN: Indeed. 02:12:45.990 --> 02:12:49.750 So here's where, again, you have to rewind in your mind's eye. 02:12:49.750 --> 02:12:53.940 We've just now sorted the left half, and we've 02:12:53.940 --> 02:12:58.240 sorted the left half and the right half of the left half. 02:12:58.240 --> 02:13:02.700 So I think the third and final step at this part of the story is, Brian, 02:13:02.700 --> 02:13:07.692 to merge those sorted halves, each of which now is of size 2. 02:13:07.692 --> 02:13:08.400 BRIAN: All right. 02:13:08.400 --> 02:13:12.008 I have two arrays of size 2, each of which is sorted, that I need to merge. 02:13:12.008 --> 02:13:14.300 So I'm going to compare the smallest numbers from each. 02:13:14.300 --> 02:13:16.110 I'm going to compare the 3 and the 5. 02:13:16.110 --> 02:13:18.570 The 3 is smaller, so that one will go in first. 02:13:18.570 --> 02:13:21.960 Now between these two arrays, I have a 6 and a 5 to compare. 02:13:21.960 --> 02:13:24.210 The 5 is smaller, so that one will go next. 02:13:24.210 --> 02:13:26.830 Between the 6 and the 8, the 6 is smaller. 02:13:26.830 --> 02:13:28.530 And I'm left with just the 8. 02:13:28.530 --> 02:13:32.250 So if we go back to the original story of eight numbers that I was sorting, 02:13:32.250 --> 02:13:35.970 I think I have now sorted the left half of the left four numbers 02:13:35.970 --> 02:13:37.230 from that original array. 02:13:37.230 --> 02:13:37.560 DAVID MALAN: Indeed. 02:13:37.560 --> 02:13:39.210 So if you're playing along at home, think 02:13:39.210 --> 02:13:41.335 about-- you've got all these thoughts probably kind 02:13:41.335 --> 02:13:42.430 of piling up in your mind. 02:13:42.430 --> 02:13:43.860 That's indeed supposed to be the case. 02:13:43.860 --> 02:13:46.152 And admittedly, it's hard to keep track of all of that. 02:13:46.152 --> 02:13:48.520 So we'll let Brian now execute this altogether 02:13:48.520 --> 02:13:52.213 together doing the same thing now, by sorting the right half all 02:13:52.213 --> 02:13:53.130 the way to completion. 02:13:53.130 --> 02:13:53.963 Brian, if you could. 02:13:53.963 --> 02:13:54.672 BRIAN: All right. 02:13:54.672 --> 02:13:56.340 So the right half, you got four numbers. 02:13:56.340 --> 02:13:59.670 I'm going to start by sorting the left half of the right half, which 02:13:59.670 --> 02:14:01.180 is these two numbers here. 02:14:01.180 --> 02:14:02.930 To do that, I'll repeat the same process-- 02:14:02.930 --> 02:14:06.240 sort the left half of these two numbers, which is just the 2. 02:14:06.240 --> 02:14:08.080 That one's done, it's only one number. 02:14:08.080 --> 02:14:09.372 Same thing with the right half. 02:14:09.372 --> 02:14:11.100 The 7 is only one number, so it's done. 02:14:11.100 --> 02:14:13.410 And now I'll merge the sorted halves together. 02:14:13.410 --> 02:14:17.160 Between the 2 and the 7, the 2 is smaller and then the 7. 02:14:17.160 --> 02:14:21.540 So here now is the left half of the right half, an array of size 2, 02:14:21.540 --> 02:14:22.500 that is sorted. 02:14:22.500 --> 02:14:25.260 And I'll do the same thing with the right half of the right half, 02:14:25.260 --> 02:14:27.120 starting with the left half, which is 4. 02:14:27.120 --> 02:14:28.090 That's done. 02:14:28.090 --> 02:14:29.200 The 1 is done. 02:14:29.200 --> 02:14:30.950 And now to merge these two together, I'll 02:14:30.950 --> 02:14:33.190 compare them and say the 1 is smaller. 02:14:33.190 --> 02:14:36.330 So I'll put the 1 down and then the 4. 02:14:36.330 --> 02:14:39.870 So now I have two sorted arrays, each of size 2, 02:14:39.870 --> 02:14:42.360 that I now need to backtrack and now merge together 02:14:42.360 --> 02:14:44.460 to form an array of size 4. 02:14:44.460 --> 02:14:45.930 So I'll compare the 2 and the 1. 02:14:45.930 --> 02:14:48.000 Between those two, the 1 is smaller. 02:14:48.000 --> 02:14:49.980 Then I'll compare the 2 with the 4. 02:14:49.980 --> 02:14:51.570 The 2 is smaller. 02:14:51.570 --> 02:14:53.190 Then I'll compare the 7 with the 4. 02:14:53.190 --> 02:14:54.480 The 4 is smaller. 02:14:54.480 --> 02:14:57.210 And then finally, I'll just take the 7, the last number, 02:14:57.210 --> 02:14:59.010 and put that in the final spot. 02:14:59.010 --> 02:15:01.740 And so now from the original array of eight numbers, 02:15:01.740 --> 02:15:05.520 I've now sorted the left half, and I've sorted the right half. 02:15:05.520 --> 02:15:09.030 DAVID MALAN: And now that brings us to our third and very final step. 02:15:09.030 --> 02:15:11.950 Could you, Brian, merge the sorted halves? 02:15:11.950 --> 02:15:12.450 BRIAN: Yeah. 02:15:12.450 --> 02:15:14.520 And I think this is actually an example we've seen already. 02:15:14.520 --> 02:15:16.728 And what I'm going to do in order to these two halves 02:15:16.728 --> 02:15:19.080 is just take the smaller number from each half 02:15:19.080 --> 02:15:20.640 and compare them again and again. 02:15:20.640 --> 02:15:24.240 So between the 3 and the 1, the 1, that's the smallest number. 02:15:24.240 --> 02:15:25.800 So that goes into place. 02:15:25.800 --> 02:15:29.070 Then between the 3 and the 2, the 2 is smaller, 02:15:29.070 --> 02:15:31.350 so we'll take that and put that into place. 02:15:31.350 --> 02:15:33.930 Now I'm comparing the 3 with the 4. 02:15:33.930 --> 02:15:36.420 The 3, that goes next. 02:15:36.420 --> 02:15:38.640 Next I'm comparing the 5 with the 4. 02:15:38.640 --> 02:15:42.450 4 is smaller, so the 4 goes into place next. 02:15:42.450 --> 02:15:44.670 Now I'm comparing the 5 with the 7. 02:15:44.670 --> 02:15:47.930 5 is smaller, so that one goes into place. 02:15:47.930 --> 02:15:51.630 And next, comparing the 6 with the 7, so the 6 is smaller. 02:15:51.630 --> 02:15:52.870 That goes next. 02:15:52.870 --> 02:15:55.320 And now I'm left with two numbers, the 8 and the 7. 02:15:55.320 --> 02:15:59.100 The 7 is the smaller of the 2, so that one goes next. 02:15:59.100 --> 02:16:02.200 And at this point, I only have one number left, which is the 8. 02:16:02.200 --> 02:16:05.100 And so that one's going to go into its sorted position 02:16:05.100 --> 02:16:06.607 at the end of the array. 02:16:06.607 --> 02:16:07.440 DAVID MALAN: Indeed. 02:16:07.440 --> 02:16:10.110 So even though it felt like we weren't really doing anything 02:16:10.110 --> 02:16:12.300 at several points in that story, it all sort of 02:16:12.300 --> 02:16:16.110 came together when we started merging and merging and merging these lists. 02:16:16.110 --> 02:16:19.650 And it's not an accident that Brian was using multiple shelves, 02:16:19.650 --> 02:16:22.320 moving the numbers from top to bottom, to make clear 02:16:22.320 --> 02:16:26.285 just how many times he was effectively dividing that list up. 02:16:26.285 --> 02:16:28.410 We started with a list of eight, and we essentially 02:16:28.410 --> 02:16:33.510 took it to two lists of size 4, four lists of size 2, eight lists of size 1. 02:16:33.510 --> 02:16:35.740 And while it wasn't exactly in that order, 02:16:35.740 --> 02:16:38.850 if you rewind and analyze all of the steps, that's indeed what he did. 02:16:38.850 --> 02:16:42.809 He went from 8 to two 4's to four 2's to eight 1's. 02:16:42.809 --> 02:16:47.040 And that's why he moved those numbers from the top shelf down three times-- 02:16:47.040 --> 02:16:50.920 from 8's, to 4's, to 2's, to 1'r. 02:16:50.920 --> 02:16:53.100 So how many times did he move the numbers? 02:16:53.100 --> 02:16:55.590 He moved them three times total. 02:16:55.590 --> 02:17:00.150 And on each of those shelves, how many numbers did he have to merge together? 02:17:00.150 --> 02:17:04.020 On each of those shelves, he ultimately touched all eight numbers. 02:17:04.020 --> 02:17:08.087 He first inserted the smallest number, then the second smallest, then 02:17:08.087 --> 02:17:08.879 the third smallest. 02:17:08.879 --> 02:17:13.709 But unlike selection sort, he had smartly already sorted those halves, 02:17:13.709 --> 02:17:16.072 so he was just plucking them off one at a time. 02:17:16.072 --> 02:17:18.030 He wasn't going back and forth, back and forth. 02:17:18.030 --> 02:17:22.469 He was constantly taking from the beginning of each of those half lists. 02:17:22.469 --> 02:17:26.309 So on every shelf, he was doing, let's say, n steps, 02:17:26.309 --> 02:17:29.790 because he was merging in all n elements of that shelf. 02:17:29.790 --> 02:17:34.379 But how many times did he merge n elements together? 02:17:34.379 --> 02:17:36.633 Well, he did that three total times. 02:17:36.633 --> 02:17:39.550 But if you think about binary search, and really the process of divide 02:17:39.550 --> 02:17:43.070 and conquer more generally, anytime you divide something in half and half 02:17:43.070 --> 02:17:47.150 and half, as he was doing from 8's to 4's to 2's to 1's. 02:17:47.150 --> 02:17:48.170 That's a logarithm. 02:17:48.170 --> 02:17:49.639 That's log base 2. 02:17:49.639 --> 02:17:52.730 And indeed, that is wonderfully the height of this shelf. 02:17:52.730 --> 02:17:56.480 If you have eight elements on the shelf, the number of additional shelves 02:17:56.480 --> 02:18:03.180 Brian used, 3, is exactly what you get by doing the math log base 2 of 8. 02:18:03.180 --> 02:18:08.059 Which is to say, Brian did n things log n times. 02:18:08.059 --> 02:18:10.520 And again with a wave of the hand, computer scientists 02:18:10.520 --> 02:18:13.129 don't bother mentioning the base with Big O notation. 02:18:13.129 --> 02:18:14.840 It suffices just to say log n-- 02:18:14.840 --> 02:18:18.410 Brian did n things log n times. 02:18:18.410 --> 02:18:22.490 And so if we consider, then, the asymptotic complexity 02:18:22.490 --> 02:18:25.430 of this algorithm, that is to say the running time of this algorithm, 02:18:25.430 --> 02:18:30.230 in terms of big O notation, notice that it performs strictly better then 02:18:30.230 --> 02:18:32.660 selection sort and bubble sort-- 02:18:32.660 --> 02:18:34.780 n times log n. 02:18:34.780 --> 02:18:37.760 And even, again, if you're a little rusty on logarithms, log n, 02:18:37.760 --> 02:18:40.040 we have seen as of week 0 in binary search, 02:18:40.040 --> 02:18:43.040 is definitely faster than n steps. 02:18:43.040 --> 02:18:45.530 So n squared is n times n. 02:18:45.530 --> 02:18:49.459 n log n is n times log n, which is indeed mathematically 02:18:49.459 --> 02:18:51.980 better then n squared. 02:18:51.980 --> 02:18:55.400 As with merge sort, though, if we consider the lower bound, 02:18:55.400 --> 02:19:00.049 notice that bubble sort, yes, got us as low as omega of n. 02:19:00.049 --> 02:19:03.750 Turns out merge sort is a little bit like selection sort 02:19:03.750 --> 02:19:07.940 in that it doesn't optimize itself and get you out of the algorithm early. 02:19:07.940 --> 02:19:13.375 It's always n log n, so it's lower bound omega of n log n. 02:19:13.375 --> 02:19:14.750 And that might not be acceptable. 02:19:14.750 --> 02:19:16.730 Sometimes you might have certain data inputs 02:19:16.730 --> 02:19:19.700 where maybe it tends to be sorted and you don't want to waste time. 02:19:19.700 --> 02:19:21.799 So maybe you'd be OK with bubble sort. 02:19:21.799 --> 02:19:24.920 But honestly, as n gets large, the probability 02:19:24.920 --> 02:19:29.540 that the input to your sorting algorithm is just by chance going to be sorted 02:19:29.540 --> 02:19:33.200 is probably so, so low that you're just better 02:19:33.200 --> 02:19:36.940 off in the general case using an algorithm like merge sort that's 02:19:36.940 --> 02:19:38.660 n log n always. 02:19:38.660 --> 02:19:41.299 We can see this visually using our bars, too. 02:19:41.299 --> 02:19:43.850 And notice, just as Brian was dividing and conquering 02:19:43.850 --> 02:19:47.150 the problem in half and half and half, and then reconstituting 02:19:47.150 --> 02:19:51.620 the array by merging those halves, you can kind of see that visually here. 02:19:51.620 --> 02:19:53.618 There's a lot more going on. 02:19:53.618 --> 02:19:56.660 And it's going to seem in a moment that everything just kind of magically 02:19:56.660 --> 02:19:57.160 worked. 02:19:57.160 --> 02:20:00.680 But you can see in the faded purple bars that, indeed, this 02:20:00.680 --> 02:20:04.400 is sorting things in halves and then merging those halves together. 02:20:04.400 --> 02:20:06.590 And this visualization was a little different. 02:20:06.590 --> 02:20:08.510 It did not have the luxury of three shelves. 02:20:08.510 --> 02:20:10.760 It just moved top to bottom, top to bottom. 02:20:10.760 --> 02:20:13.460 And honestly, Brian could have been a little more optimal there. 02:20:13.460 --> 02:20:16.250 We wanted to make clear how many total shelves there were. 02:20:16.250 --> 02:20:18.470 But honestly, there's no reason he couldn't have just 02:20:18.470 --> 02:20:21.950 moved the numbers down then back up, then back down then back up. 02:20:21.950 --> 02:20:24.440 And, indeed that's the price you pay with merge sort. 02:20:24.440 --> 02:20:27.920 Even though n log n is better than n squared, and ergo 02:20:27.920 --> 02:20:31.790 merge sort is arguably better than selection sort and bubble sort, 02:20:31.790 --> 02:20:32.960 you pay a price. 02:20:32.960 --> 02:20:35.750 And this speaks to the trade-off I mentioned earlier. 02:20:35.750 --> 02:20:39.560 Almost always, when you do something better in code 02:20:39.560 --> 02:20:42.897 or solve a problem more intelligently, you have paid a price. 02:20:42.897 --> 02:20:45.230 Maybe you spent more time as the human writing the code, 02:20:45.230 --> 02:20:47.420 because it was harder and took more sophistication. 02:20:47.420 --> 02:20:48.620 That is a cost. 02:20:48.620 --> 02:20:51.680 Maybe you had to use actually more space. 02:20:51.680 --> 02:20:56.950 Brian had to have at least one extra shelf in order to implement merge sort. 02:20:56.950 --> 02:20:59.360 If implementing merge sort in code and C, 02:20:59.360 --> 02:21:05.180 you will need at least a second array to temporarily put the numbers into as you 02:21:05.180 --> 02:21:06.800 merge things back and forth. 02:21:06.800 --> 02:21:09.740 If you want to be extravagant, you can have three separate arrays 02:21:09.740 --> 02:21:11.090 or four separate arrays. 02:21:11.090 --> 02:21:14.240 But it's suffices, per the graphical representation of merge sort, 02:21:14.240 --> 02:21:16.103 to just use a second array. 02:21:16.103 --> 02:21:18.020 Now, that might not seem like such a big deal. 02:21:18.020 --> 02:21:21.710 But implicitly, you need twice as much space. 02:21:21.710 --> 02:21:23.270 And that might be a big deal. 02:21:23.270 --> 02:21:27.590 If you've got a million things to sort, and you now need two arrays, 02:21:27.590 --> 02:21:30.530 that's 2 million chunks of memory that you need. 02:21:30.530 --> 02:21:32.052 And maybe that's not tenable. 02:21:32.052 --> 02:21:34.010 So there, too, there's going to be a trade-off. 02:21:34.010 --> 02:21:36.320 And maybe while slower, selection sort of bubble sort, maybe 02:21:36.320 --> 02:21:38.820 it's better because it's a little more efficient with space. 02:21:38.820 --> 02:21:41.120 It's going to depend on what you care about 02:21:41.120 --> 02:21:42.870 and what you want to optimize for. 02:21:42.870 --> 02:21:44.750 And honestly, money is sometimes a factor. 02:21:44.750 --> 02:21:48.500 In the real world, maybe it's better to write slightly slower code 02:21:48.500 --> 02:21:52.130 so that you don't have to buy twice as many servers or twice as much memory 02:21:52.130 --> 02:21:53.210 for your computer. 02:21:53.210 --> 02:21:55.730 It depends there on what resource is more important-- 02:21:55.730 --> 02:22:00.770 your time, the computer's time, your wallet, or some other resource 02:22:00.770 --> 02:22:01.340 altogether. 02:22:01.340 --> 02:22:03.590 So we'll continue to see these kinds of trade-offs. 02:22:03.590 --> 02:22:07.500 But perhaps the most mind blowing thing we can do as we wrap up here 02:22:07.500 --> 02:22:12.920 is share a few visualizations of how these algorithms actually compare. 02:22:12.920 --> 02:22:17.900 And one last piece of jargon is this one final Greek symbol, theta. 02:22:17.900 --> 02:22:21.920 It turns out that, thanks to selection sort and merge sort, 02:22:21.920 --> 02:22:26.570 we can actually apply one more term of art here, this theta notation. 02:22:26.570 --> 02:22:30.140 Anytime an algorithm has both the same upper bound 02:22:30.140 --> 02:22:33.050 as its lower bound running time, you can actually 02:22:33.050 --> 02:22:37.980 describe it in just one sentence instead of two in terms of theta notation. 02:22:37.980 --> 02:22:41.375 So because selection sort was in both big O of n squared and omega 02:22:41.375 --> 02:22:44.870 of n squared, you can actually just say, ah, it's in theta of n squared. 02:22:44.870 --> 02:22:49.010 It's always n squared either in the upper bound or in the lower bound. 02:22:49.010 --> 02:22:50.210 Same thing for merge sort. 02:22:50.210 --> 02:22:52.310 It's in theta of n log n. 02:22:52.310 --> 02:22:57.860 We cannot use theta for bubble sort or for binary search or for linear search, 02:22:57.860 --> 02:23:01.150 because they had different upper and lower bounds. 02:23:01.150 --> 02:23:05.600 Well, let me go ahead now and prepare a final demonstration, 02:23:05.600 --> 02:23:08.090 this time using some random inputs. 02:23:08.090 --> 02:23:11.750 So you'll see here a video comparing selection sort, bubble sort, 02:23:11.750 --> 02:23:13.760 and merge sort all together. 02:23:13.760 --> 02:23:16.560 All three of them start with random data. 02:23:16.560 --> 02:23:18.830 But let's just see what it means for an algorithm 02:23:18.830 --> 02:23:26.295 to be an n squared in the worst case or in n log n in this case instead. 02:23:26.295 --> 02:23:26.920 [MUSIC PLAYING] 02:23:26.920 --> 02:23:29.295 Selection sort's on the top, bubble sort's on the bottom, 02:23:29.295 --> 02:23:30.490 merge sort's in the middle. 02:23:30.490 --> 02:23:35.352 And would you believe it, merge sort is already done. 02:23:35.352 --> 02:23:37.280 [MUSIC INTENSIFIES] 02:23:37.280 --> 02:23:40.902 And meanwhile, we have some very trendy music we can listen to, 02:23:40.902 --> 02:23:42.610 which is really just there to distract us 02:23:42.610 --> 02:23:47.060 from the fact at how slow n squared actually is in practice. 02:23:47.060 --> 02:23:49.420 And notice, there's not that many bars here. 02:23:49.420 --> 02:23:51.310 There's maybe like a hundred or so bars. 02:23:51.310 --> 02:23:52.690 Like, n is 100. 02:23:52.690 --> 02:23:53.882 That's not even a big value. 02:23:53.882 --> 02:23:56.590 When we're talking about the Twitters, the Facebooks, the Googles 02:23:56.590 --> 02:23:59.080 of the world, these are trivial sizes. 02:23:59.080 --> 02:24:03.220 And yet, my God, we're still waiting for selection sort and bubble sort 02:24:03.220 --> 02:24:04.300 to finish. 02:24:04.300 --> 02:24:08.110 And so you can see here that it really matters when you exercise a little bit 02:24:08.110 --> 02:24:10.980 more cleverness, and you leverage a more efficient algorithm-- 02:24:10.980 --> 02:24:14.495 and finally, selection sort is done, bubble sort still taking 02:24:14.495 --> 02:24:15.370 a little longer here. 02:24:15.370 --> 02:24:16.810 And this is going to depend on the input. 02:24:16.810 --> 02:24:18.610 Sometimes you can get lucky or unlucky. 02:24:18.610 --> 02:24:22.390 But I think it's convincing that merge sort has won in this case. 02:24:22.390 --> 02:24:26.940 [MUSIC PLAYING]