WEBVTT X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:900000 00:00:00.000 --> 00:00:02.994 [MUSIC PLAYING] 00:01:01.377 --> 00:01:04.590 DAVID MALAN: All right, this is CS50. 00:01:04.590 --> 00:01:05.700 And this is week 3. 00:01:05.700 --> 00:01:09.420 And as promised, we thought we'd take a bit of a break on new syntax this week 00:01:09.420 --> 00:01:12.300 and focus a lot more on algorithms and implementation 00:01:12.300 --> 00:01:15.840 thereof because over the past few weeks, besides Scratch, we've now had C. 00:01:15.840 --> 00:01:17.850 And you have a lot of vocabulary now. 00:01:17.850 --> 00:01:20.460 Even if it might not seem yet that you fully 00:01:20.460 --> 00:01:23.790 grasp all of the functionality of this particular language, with practice, 00:01:23.790 --> 00:01:25.030 it'll get easier and easier. 00:01:25.030 --> 00:01:28.410 But today, we'd focus instead on ideas and ultimately on this, 00:01:28.410 --> 00:01:32.700 how to think algorithmically, and so to take problems in the real world 00:01:32.700 --> 00:01:35.550 and try to quantize them in a way that you can map those puzzle 00:01:35.550 --> 00:01:39.210 pieces from week 0 or all of this new syntax from weeks 1 and 2 00:01:39.210 --> 00:01:41.792 on to actually writing code to solve those problems. 00:01:41.792 --> 00:01:43.500 To contextualize this, though, we thought 00:01:43.500 --> 00:01:46.230 we'd remind you of this here chart from week zero. 00:01:46.230 --> 00:01:48.900 And recall that in week zero, we painted this picture where 00:01:48.900 --> 00:01:52.320 on the x-axis on the horizontal was the size of the problem. 00:01:52.320 --> 00:01:54.808 And the number of phone book pages increased 00:01:54.808 --> 00:01:56.100 as you went from left to right. 00:01:56.100 --> 00:01:59.200 And then on the vertical axis or y-axis, we had time to solve. 00:01:59.200 --> 00:02:01.900 So this was like how many seconds, how many page turns, 00:02:01.900 --> 00:02:05.140 how many units of measure-- whatever you're using. 00:02:05.140 --> 00:02:07.690 We might actually describe the solution there too. 00:02:07.690 --> 00:02:10.660 And the first algorithm we had in week 0 for the phone book 00:02:10.660 --> 00:02:12.350 was, like, one page at a time. 00:02:12.350 --> 00:02:14.410 So we plotted it with this 1 to 1 slope. 00:02:14.410 --> 00:02:16.990 The second algorithm, we started doing two pages 00:02:16.990 --> 00:02:19.000 at a time, which did risk a bug. 00:02:19.000 --> 00:02:20.680 I might have to double back one page. 00:02:20.680 --> 00:02:23.120 But it was still going for the most part twice as fast. 00:02:23.120 --> 00:02:27.100 So it was still a slow but sort of a 2 to 1 slope instead of 1 to 1. 00:02:27.100 --> 00:02:29.290 But the third and final algorithm, recall, 00:02:29.290 --> 00:02:31.210 was sort of fundamentally different. 00:02:31.210 --> 00:02:34.250 And it was this logarithmic curve, so to speak, 00:02:34.250 --> 00:02:38.435 whereby it kept increasing, increasing, increasing but very, very slowly. 00:02:38.435 --> 00:02:40.810 So even if you, like, doubled the size of the phone book, 00:02:40.810 --> 00:02:44.810 as by having Cambridge and Allston here in Massachusetts merge, no big deal. 00:02:44.810 --> 00:02:49.190 It was just one more page turn, not another 500 or another 1,000. 00:02:49.190 --> 00:02:54.730 So think back today on that particular idea of how we began to divide 00:02:54.730 --> 00:02:55.900 and conquer the problem. 00:02:55.900 --> 00:02:59.330 And that sort of gave us a fundamentally better advantage. 00:02:59.330 --> 00:03:02.840 So we thought we'd see if we can apply this lesson learned as follows. 00:03:02.840 --> 00:03:06.340 If I were to take, like, attendance today sort of here on stage, 00:03:06.340 --> 00:03:10.870 I could do it old school, like, 1, 2, 3, 4, 5, 6, 7, 8, and so 00:03:10.870 --> 00:03:13.058 forth, so one step at a time. 00:03:13.058 --> 00:03:14.350 I could also double the speed-- 00:03:14.350 --> 00:03:18.200 2, 4, 6, 8, 10, 12, 14, 16, and so forth. 00:03:18.200 --> 00:03:20.890 But I dare say we can learn a bit from week zero. 00:03:20.890 --> 00:03:24.040 And if you'll indulge me right in place, could everyone 00:03:24.040 --> 00:03:26.865 stand up and think of the number one. 00:03:26.865 --> 00:03:28.990 So right where you are, just stand up if you could. 00:03:32.650 --> 00:03:33.250 Stand up. 00:03:33.250 --> 00:03:34.872 And think of the number one. 00:03:34.872 --> 00:03:38.080 So at this point, hopefully, everyone's literally thinking of the number one. 00:03:38.080 --> 00:03:42.010 And the second step of this algorithm, which I claim ultimately theoretically 00:03:42.010 --> 00:03:45.490 should be much faster than either my one person at a time 00:03:45.490 --> 00:03:47.740 or two people at a time, step two is this. 00:03:47.740 --> 00:03:49.240 Pair off with someone standing. 00:03:49.240 --> 00:03:51.340 And add their number to yours. 00:03:51.340 --> 00:03:52.345 And remember the sum. 00:03:59.620 --> 00:04:03.400 Person can be in front of, behind, left, or right of you. 00:04:06.510 --> 00:04:10.950 All right most likely most everyone in the room assuming you 00:04:10.950 --> 00:04:13.620 found someone is thinking of the number two 00:04:13.620 --> 00:04:16.073 now unless you're sort of an odd person out in the row. 00:04:16.073 --> 00:04:16.740 And that's fine. 00:04:16.740 --> 00:04:18.115 If you're still one, that's fine. 00:04:18.115 --> 00:04:19.800 But most of you are probably two. 00:04:19.800 --> 00:04:23.925 Next step is that one of you in those pairs should sit down. 00:04:29.290 --> 00:04:34.210 OK, so many of you tried to sit down as quickly as possible we noticed. 00:04:34.210 --> 00:04:37.898 But so next step now, at this point, rather, most of you 00:04:37.898 --> 00:04:39.190 are thinking of the number two. 00:04:39.190 --> 00:04:41.023 A few of you are thinking of the number one. 00:04:41.023 --> 00:04:41.710 And that's OK. 00:04:41.710 --> 00:04:45.130 The next step, and notice we're about to induce a loop, so the rest of this 00:04:45.130 --> 00:04:49.195 is on you, if still standing, go back to step two. 00:04:53.584 --> 00:04:56.572 [INTERPOSING VOICES] 00:05:18.030 --> 00:05:20.490 DAVID MALAN: If still standing, notice that this is a loop. 00:05:20.490 --> 00:05:21.270 So keep going. 00:05:21.270 --> 00:05:24.914 Keep going if still standing. 00:05:24.914 --> 00:05:27.680 [INTERPOSING VOICES] 00:05:29.100 --> 00:05:30.990 DAVID MALAN: There? 00:05:30.990 --> 00:05:33.276 How about there? 00:05:33.276 --> 00:05:35.702 [INTERPOSING VOICES] 00:05:35.702 --> 00:05:36.660 DAVID MALAN: That's OK. 00:05:36.660 --> 00:05:37.570 But now keep going. 00:05:37.570 --> 00:05:38.070 Keep going. 00:05:38.070 --> 00:05:39.900 Keep pairing off, so maybe you two. 00:05:42.610 --> 00:05:45.823 All right, a few more seconds. 00:05:45.823 --> 00:05:48.709 [INTERPOSING VOICES] 00:05:49.922 --> 00:05:51.255 DAVID MALAN: So step two, still. 00:05:55.810 --> 00:05:57.670 All right, keep pairing if you're standing. 00:06:00.595 --> 00:06:03.529 [INTERPOSING VOICES] 00:06:08.717 --> 00:06:09.675 DAVID MALAN: All right. 00:06:12.420 --> 00:06:16.325 All right, so theoretically, there's only one person standing left. 00:06:16.325 --> 00:06:17.700 But clearly, that's not the case. 00:06:17.700 --> 00:06:18.450 That's fine. 00:06:18.450 --> 00:06:22.198 I will help with the pairing because some of you are far away. 00:06:22.198 --> 00:06:22.740 So let's see. 00:06:22.740 --> 00:06:24.450 What's your number here? 00:06:24.450 --> 00:06:25.680 Sorry? 00:06:25.680 --> 00:06:27.620 What's your number? 00:06:27.620 --> 00:06:28.120 Eight. 00:06:28.120 --> 00:06:29.350 OK, go ahead and sit down. 00:06:29.350 --> 00:06:30.160 How about in back? 00:06:30.160 --> 00:06:31.480 What's your number? 00:06:31.480 --> 00:06:32.270 46. 00:06:32.270 --> 00:06:32.770 Nice. 00:06:32.770 --> 00:06:34.000 OK, go ahead and sit down. 00:06:34.000 --> 00:06:35.860 Who else is standing? 00:06:35.860 --> 00:06:38.840 Over here, what's your number? 00:06:38.840 --> 00:06:40.490 You're 16? 00:06:40.490 --> 00:06:42.350 OK, so go ahead and sit down. 00:06:42.350 --> 00:06:43.820 And behind you? 00:06:43.820 --> 00:06:44.480 48. 00:06:44.480 --> 00:06:46.400 OK, go ahead and sit down. 00:06:46.400 --> 00:06:48.290 Is anyone still standing? 00:06:48.290 --> 00:06:49.400 Yeah? 00:06:49.400 --> 00:06:50.270 32. 00:06:50.270 --> 00:06:50.960 Nice. 00:06:50.960 --> 00:06:53.180 Still standing over here. 00:06:53.180 --> 00:06:54.530 43. 00:06:54.530 --> 00:06:55.140 OK, nice. 00:06:55.140 --> 00:06:55.640 Sit down. 00:06:55.640 --> 00:06:56.140 Sit down. 00:06:56.140 --> 00:06:58.220 And anyone else still standing here? 00:06:58.220 --> 00:06:59.480 22. 00:06:59.480 --> 00:07:01.160 Go ahead and sit down. 00:07:01.160 --> 00:07:03.860 Is anyone still standing and participating? 00:07:03.860 --> 00:07:07.400 Yeah, where-- oh, yeah. 00:07:07.400 --> 00:07:08.357 16. 00:07:08.357 --> 00:07:09.440 OK, go ahead and sit down. 00:07:09.440 --> 00:07:12.360 Anyone else still standing? 00:07:12.360 --> 00:07:14.250 OK, so theoretically, everyone's paired off. 00:07:14.250 --> 00:07:16.330 You were the last person standing. 00:07:16.330 --> 00:07:19.257 So when I hit Enter here, having just greased the wheels 00:07:19.257 --> 00:07:21.090 to do all of the remaining additions myself, 00:07:21.090 --> 00:07:24.810 we should have the total count of people in the room. 00:07:24.810 --> 00:07:27.330 And recognize that unlike my algorithm, which 00:07:27.330 --> 00:07:29.910 would have required pointing at each and every person 00:07:29.910 --> 00:07:33.930 or my second algorithm, which would mean pointing at every two people 00:07:33.930 --> 00:07:38.010 twice as fast, theoretically, the algorithm you all just executed, 00:07:38.010 --> 00:07:40.480 I daresay, should've been fundamentally faster. 00:07:40.480 --> 00:07:40.980 Why? 00:07:40.980 --> 00:07:43.680 Because no matter how many people in the room-- maybe, like, 00:07:43.680 --> 00:07:45.840 if there were 1,000 people in the room, there 00:07:45.840 --> 00:07:50.280 would then have been 500, just as there would be 500 pages in week zero. 00:07:50.280 --> 00:07:52.320 Then from 500, there'd be 250-- 00:07:52.320 --> 00:07:53.200 125. 00:07:53.200 --> 00:07:53.700 Why? 00:07:53.700 --> 00:07:55.783 Because on each step of the algorithm, half of you 00:07:55.783 --> 00:07:59.310 theoretically were sitting down, sitting down, sitting down, dividing 00:07:59.310 --> 00:08:00.880 and conquering that problem. 00:08:00.880 --> 00:08:05.370 So the total number of people in the room as of now according to your count 00:08:05.370 --> 00:08:08.460 is 231. 00:08:08.460 --> 00:08:11.650 As a backup, though, Carter kindly did it the old-fashioned way, 00:08:11.650 --> 00:08:12.760 one person at a time. 00:08:12.760 --> 00:08:16.310 And Carter, the actual number of people in the room is? 00:08:16.310 --> 00:08:18.070 [LAUGHS] 00:08:18.070 --> 00:08:21.220 OK, so our first real world bug to be fair. 00:08:21.220 --> 00:08:23.510 So theoretically, that should have worked. 00:08:23.510 --> 00:08:26.230 But clearly, we lost some numbers along the way, so a bug 00:08:26.230 --> 00:08:27.620 that we can fix today. 00:08:27.620 --> 00:08:31.810 But remember that really, this is just similar in spirit 00:08:31.810 --> 00:08:34.750 to that algorithm we indeed did in week zero. 00:08:34.750 --> 00:08:36.532 It's the same as the phone book example. 00:08:36.532 --> 00:08:37.990 It went off the rails in this case. 00:08:37.990 --> 00:08:41.320 But it's the same idea, ultimately dividing and conquering. 00:08:41.320 --> 00:08:44.402 And any time you have this halving, halving, halving, 00:08:44.402 --> 00:08:46.360 there's going to be a logarithm involved there, 00:08:46.360 --> 00:08:48.152 even if you're a little rusty on your math. 00:08:48.152 --> 00:08:51.220 And that's fundamentally faster than just doing something n times 00:08:51.220 --> 00:08:55.090 or even n divided by 2 times where n is the number of people in the room 00:08:55.090 --> 00:08:57.980 or in week zero, the number of pages in the phone book. 00:08:57.980 --> 00:09:01.450 So even when you're using your iPhone or Android device later today, like, 00:09:01.450 --> 00:09:04.750 if you search for a contact using autocomplete, 00:09:04.750 --> 00:09:09.070 it is that so-called divide and conquer algorithm that's 00:09:09.070 --> 00:09:10.930 finding people in your address book. 00:09:10.930 --> 00:09:13.570 It's not starting top to bottom or bottom up. 00:09:13.570 --> 00:09:15.520 It's probably going roughly to the middle 00:09:15.520 --> 00:09:19.300 and then doing the top half or the bottom half, repeating again and again. 00:09:19.300 --> 00:09:21.040 So these ideas are everywhere. 00:09:21.040 --> 00:09:23.340 And hopefully, you end up finding just the one 00:09:23.340 --> 00:09:25.090 person you're looking for or in your case, 00:09:25.090 --> 00:09:28.030 the one person last standing who should have theoretically 00:09:28.030 --> 00:09:32.350 had the count of everyone because if each of you started by representing 1, 00:09:32.350 --> 00:09:35.440 effectively handed off your number, handed off your number, 00:09:35.440 --> 00:09:38.260 handed off your number, it theoretically should have coalesced 00:09:38.260 --> 00:09:40.090 in that final person standing. 00:09:40.090 --> 00:09:43.730 So let's consider the connection now between this idea 00:09:43.730 --> 00:09:47.680 and what we introduced last week, which was this idea of very simple data 00:09:47.680 --> 00:09:52.240 structures in your computer's memory-- like, actually using this memory as 00:09:52.240 --> 00:09:54.700 though it's kind of a grid of bytes. 00:09:54.700 --> 00:09:58.190 Each one of these squares, recall, represents 1 byte or 8 bits. 00:09:58.190 --> 00:10:00.250 And we can get rid of the hardware and sort 00:10:00.250 --> 00:10:03.850 of abstract it away as just this grid of memory or this canvas. 00:10:03.850 --> 00:10:05.680 And then we introduced arrays last week. 00:10:05.680 --> 00:10:08.680 And what was the key definition of an array? 00:10:08.680 --> 00:10:11.420 How would you describe an array? 00:10:11.420 --> 00:10:14.400 What is it, anyone? 00:10:14.400 --> 00:10:15.540 What's an array? 00:10:15.540 --> 00:10:18.180 Yeah, in the middle. 00:10:18.180 --> 00:10:20.720 A collection. 00:10:20.720 --> 00:10:21.330 A collection. 00:10:21.330 --> 00:10:24.790 And I don't love data types only because in C, it tends to be the same type. 00:10:24.790 --> 00:10:25.790 So a collection of data. 00:10:25.790 --> 00:10:26.450 I do like that. 00:10:26.450 --> 00:10:28.370 But there's one other key characteristic. 00:10:28.370 --> 00:10:29.670 Do you want to be more precise? 00:10:29.670 --> 00:10:32.850 It's not just a collection-- 00:10:32.850 --> 00:10:35.680 something about where it is. 00:10:35.680 --> 00:10:36.880 Potentially strings. 00:10:36.880 --> 00:10:39.850 But strings are just an example of putting char, char, char, char. 00:10:39.850 --> 00:10:42.500 It could certainly be integers or floating point values. 00:10:42.500 --> 00:10:44.410 Another characteristic? 00:10:44.410 --> 00:10:45.220 Sorry? 00:10:45.220 --> 00:10:46.433 It's not necessarily ordered. 00:10:46.433 --> 00:10:48.100 Actually, we'll come back to that today. 00:10:48.100 --> 00:10:49.340 It could be in any order. 00:10:49.340 --> 00:10:52.090 And certainly a string isn't necessarily in sorted order. 00:10:52.090 --> 00:10:54.890 It's in whatever the word is. 00:10:54.890 --> 00:10:56.450 It's a list in concept. 00:10:56.450 --> 00:10:59.750 But there was something key about where we put things in memory. 00:10:59.750 --> 00:11:01.580 Yeah? 00:11:01.580 --> 00:11:05.420 Consecutive-- the memory is consecutive, a.k.a., contiguous. 00:11:05.420 --> 00:11:08.660 An array is important in C because, yes, it's a list of values. 00:11:08.660 --> 00:11:10.130 Yes, it's a collection of values. 00:11:10.130 --> 00:11:15.080 But the real key distinction in an array in C is that it's contiguous. 00:11:15.080 --> 00:11:19.740 The bytes are back to back to back somewhere in the computer's memory, 00:11:19.740 --> 00:11:23.930 at least for any given data type, be it an int, a float, a bigger string. 00:11:23.930 --> 00:11:27.140 All of the characters are back to back to back in the computer's memory, 00:11:27.140 --> 00:11:30.150 not spread all out, even if you have space elsewhere. 00:11:30.150 --> 00:11:35.190 So with that said, we can actually start to solve problems with that mindset. 00:11:35.190 --> 00:11:39.020 And for instance, if I kind of pare this down to just an abstract array of size 00:11:39.020 --> 00:11:42.860 1, 2, 3, 4, 5, 6, 7, for instance, suppose 00:11:42.860 --> 00:11:47.220 that there are these numbers inside of this array of memory. 00:11:47.220 --> 00:11:51.080 So here are seven integers or ints in C. I have in this case 00:11:51.080 --> 00:11:55.100 sorted them just to make the numbers pop out as obviously smallest to largest. 00:11:55.100 --> 00:11:58.070 But the catch with C is that if I were to ask you 00:11:58.070 --> 00:12:01.650 or if you were to ask the computer through code to find you the number 50, 00:12:01.650 --> 00:12:03.390 well, obviously, every human in this room 00:12:03.390 --> 00:12:07.800 just found it obviously right there because we kind of have this bird's eye 00:12:07.800 --> 00:12:10.500 view of the whole memory at once. 00:12:10.500 --> 00:12:12.990 But the computer ironically does not have 00:12:12.990 --> 00:12:15.240 that bird's eye view of its own memory. 00:12:15.240 --> 00:12:18.820 It can only look at each location one at a time. 00:12:18.820 --> 00:12:22.080 So if you really were to do this like a computer, you would kind of have to, 00:12:22.080 --> 00:12:26.040 like, shield your eye and only look at one number at a time from left 00:12:26.040 --> 00:12:28.030 to right, from right to left, or in any order 00:12:28.030 --> 00:12:31.487 in order to find is the 50 actually there. 00:12:31.487 --> 00:12:32.820 You can't just take a step back. 00:12:32.820 --> 00:12:34.540 And boom, it pops out at you. 00:12:34.540 --> 00:12:37.230 So this is kind of analogous, this array, 00:12:37.230 --> 00:12:39.690 to being like a set of gym lockers or school 00:12:39.690 --> 00:12:42.828 lockers like this where the doors are actually closed by default. 00:12:42.828 --> 00:12:43.870 The numbers are in there. 00:12:43.870 --> 00:12:46.120 But the doors are closed, which is to say the computer 00:12:46.120 --> 00:12:47.562 and we can't actually look. 00:12:47.562 --> 00:12:49.020 So we couldn't find yellow lockers. 00:12:49.020 --> 00:12:51.010 But we did find red lockers here. 00:12:51.010 --> 00:12:53.850 And so I propose that you think of these lockers on the stage 00:12:53.850 --> 00:12:57.270 here of which there are seven as well as representing an array. 00:12:57.270 --> 00:13:01.740 And just so we have some terminology, notice that I've labeled these bracket 00:13:01.740 --> 00:13:05.220 0, bracket 1, bracket 2, 3, 4, 5 and 6. 00:13:05.220 --> 00:13:07.410 And the bracket notation recalls the new syntax 00:13:07.410 --> 00:13:10.230 from last week that lets you index into-- 00:13:10.230 --> 00:13:13.450 go to a specific spot inside of an array. 00:13:13.450 --> 00:13:15.720 And notice that even though there are seven lockers, 00:13:15.720 --> 00:13:17.520 I only counted as high as six. 00:13:17.520 --> 00:13:20.850 But again, that's just a side effect of our generally counting from 0. 00:13:20.850 --> 00:13:27.330 So 0 through 6 or 0 through n minus 1 because if there are n equal 7 lockers, 00:13:27.330 --> 00:13:28.590 n minus 1 is 6. 00:13:28.590 --> 00:13:31.900 So that's the left bound and the right bound respectively. 00:13:31.900 --> 00:13:37.560 So suppose that we were to use these lockers as representing a problem, 00:13:37.560 --> 00:13:40.320 like we want to find an actual number behind these doors. 00:13:40.320 --> 00:13:42.820 So this is actually a very common problem in the real world. 00:13:42.820 --> 00:13:46.650 And you and I take for granted every day that big companies like Google 00:13:46.650 --> 00:13:49.110 and Microsoft and others, like, do this for us 00:13:49.110 --> 00:13:52.920 constantly, not to mention AI doing something similar nowadays, 00:13:52.920 --> 00:13:54.360 searching for information. 00:13:54.360 --> 00:13:56.940 And we'll focus on some basics first that 00:13:56.940 --> 00:13:59.542 will lead us to more sophisticated algorithms ultimately. 00:13:59.542 --> 00:14:01.500 But all we're going to talk about fundamentally 00:14:01.500 --> 00:14:05.160 is this same picture from week zero and from week one and from week two 00:14:05.160 --> 00:14:07.120 where here is a problem to be solved. 00:14:07.120 --> 00:14:11.295 So if for instance, the input to this problem is an array of numbers-- 00:14:11.295 --> 00:14:12.420 an array of seven numbers-- 00:14:12.420 --> 00:14:15.870 I can't see them all at once-- but I'm looking for something like the number 00:14:15.870 --> 00:14:16.560 50-- 00:14:16.560 --> 00:14:18.540 ultimately, I want to get back out. 00:14:18.540 --> 00:14:20.280 I claim true or false. 00:14:20.280 --> 00:14:21.900 Like, the number 50 is there. 00:14:21.900 --> 00:14:22.740 Or it is not. 00:14:22.740 --> 00:14:25.050 That's one way of thinking about the search problem. 00:14:25.050 --> 00:14:27.390 Find me some piece of data if it's there. 00:14:27.390 --> 00:14:31.060 Otherwise, tell me that it's not-- true or false, respectively. 00:14:31.060 --> 00:14:33.772 So the algorithm inside of this black box, 00:14:33.772 --> 00:14:36.480 though, is where we're going to actually have to do some thinking 00:14:36.480 --> 00:14:41.097 and actually figure out how best to find the number or the data we care about. 00:14:41.097 --> 00:14:43.680 And even though we'll use numbers to keep things simple today, 00:14:43.680 --> 00:14:47.190 you could certainly generalize this to web pages or contacts 00:14:47.190 --> 00:14:51.450 or any other type of information that are in some computer or database more 00:14:51.450 --> 00:14:52.240 generally. 00:14:52.240 --> 00:14:56.970 So maybe to keep things interesting, could we 00:14:56.970 --> 00:14:59.028 get-- how about two volunteers? 00:14:59.028 --> 00:14:59.820 Wow, that was fast. 00:14:59.820 --> 00:15:02.080 OK, come on down. 00:15:02.080 --> 00:15:03.630 And how about one other volunteer? 00:15:03.630 --> 00:15:04.050 I'll go over here. 00:15:04.050 --> 00:15:04.800 OK, how about you? 00:15:04.800 --> 00:15:06.300 Come on down. 00:15:06.300 --> 00:15:08.670 Sure, round of applause for our volunteers. 00:15:08.670 --> 00:15:10.380 [APPLAUSE] 00:15:10.380 --> 00:15:13.670 OK, welcome. 00:15:13.670 --> 00:15:14.250 Come on over. 00:15:14.250 --> 00:15:15.875 Do you want to introduce yourself to the group? 00:15:15.875 --> 00:15:16.970 AUDIENCE: [INAUDIBLE]. 00:15:16.970 --> 00:15:18.760 DAVID MALAN: Like a few seconds is fine. 00:15:18.760 --> 00:15:20.155 AUDIENCE: Hi, I'm Sam. 00:15:20.155 --> 00:15:21.895 I am not a CS concentration. 00:15:21.895 --> 00:15:24.580 DAVID MALAN: [LAUGHS] So what are you? 00:15:24.580 --> 00:15:25.705 Do you know yet? 00:15:25.705 --> 00:15:26.830 AUDIENCE: Applied math. 00:15:26.830 --> 00:15:27.497 DAVID MALAN: OK. 00:15:27.497 --> 00:15:27.997 Nice. 00:15:27.997 --> 00:15:28.720 Nice to meet you. 00:15:28.720 --> 00:15:29.140 Welcome. 00:15:29.140 --> 00:15:29.640 And? 00:15:29.640 --> 00:15:30.910 AUDIENCE: Hi, I'm Louis. 00:15:30.910 --> 00:15:32.740 I'm from first year, Matthews. 00:15:32.740 --> 00:15:34.705 And I'm going to do Econ with Stats. 00:15:34.705 --> 00:15:36.690 DAVID MALAN: Oh, you're in Matthews too? 00:15:36.690 --> 00:15:37.190 OK. 00:15:37.190 --> 00:15:40.720 [CHUCKLES] I was in Matthews too, so Matthews South? 00:15:40.720 --> 00:15:41.350 Oh, wow. 00:15:41.350 --> 00:15:42.220 Oh, my god. 00:15:42.220 --> 00:15:43.840 I was room 201. 00:15:43.840 --> 00:15:44.986 AUDIENCE: 103. 00:15:44.986 --> 00:15:46.630 DAVID MALAN: Fifth floor-- 00:15:46.630 --> 00:15:48.280 all right, so anyhow. 00:15:48.280 --> 00:15:50.337 And so we have Louis and your name? 00:15:50.337 --> 00:15:50.920 AUDIENCE: Sam. 00:15:50.920 --> 00:15:51.460 DAVID MALAN: Sam. 00:15:51.460 --> 00:15:51.940 Louis and Sam. 00:15:51.940 --> 00:15:54.550 So Louis and I are going to step off to the side for just a moment because Sam, 00:15:54.550 --> 00:15:56.077 we have a problem for you to solve. 00:15:56.077 --> 00:15:57.910 This feels a little bit like Price Is Right. 00:15:57.910 --> 00:16:00.010 But behind you are these seven lockers. 00:16:00.010 --> 00:16:03.460 And we'd like you to just find us the number 50. 00:16:03.460 --> 00:16:05.050 That's all the information you get. 00:16:05.050 --> 00:16:08.205 But we'd like you to then explain how you go about finding it. 00:16:08.205 --> 00:16:10.122 AUDIENCE: Wait, are they in order or anything? 00:16:10.122 --> 00:16:12.486 Or [INAUDIBLE] it just kind of [INAUDIBLE]?? 00:16:12.486 --> 00:16:15.540 DAVID MALAN: Find us the number 50. 00:16:15.540 --> 00:16:17.370 And then tell us how you found it. 00:16:20.662 --> 00:16:22.745 OK, what was in there, just so the audience knows. 00:16:22.745 --> 00:16:23.620 AUDIENCE: [INAUDIBLE] 00:16:23.620 --> 00:16:24.850 DAVID MALAN: Yes. 00:16:24.850 --> 00:16:25.550 [LAUGHTER] 00:16:25.550 --> 00:16:28.818 AUDIENCE: It was $10, a very, very big $10 bill. 00:16:28.818 --> 00:16:29.485 DAVID MALAN: OK. 00:16:29.485 --> 00:16:30.423 AUDIENCE: Fake money. 00:16:30.423 --> 00:16:31.090 DAVID MALAN: OK. 00:16:33.925 --> 00:16:39.310 AUDIENCE: That was $100. 00:16:39.310 --> 00:16:42.100 $1. 00:16:42.100 --> 00:16:43.570 $5. 00:16:43.570 --> 00:16:45.400 OK, I'm not lucky. 00:16:45.400 --> 00:16:46.810 Oh, I found it. 00:16:46.810 --> 00:16:47.560 DAVID MALAN: Nice. 00:16:47.560 --> 00:16:49.180 Take it out and so people can believe. 00:16:49.180 --> 00:16:50.530 All right, wonderful. 00:16:50.530 --> 00:16:52.390 So you found the 50. 00:16:52.390 --> 00:16:53.260 [APPLAUSE] 00:16:53.260 --> 00:16:55.305 And now if you could explain. 00:16:55.305 --> 00:16:55.930 I'll take that. 00:16:55.930 --> 00:16:58.060 If you could explain, what was your algorithm? 00:16:58.060 --> 00:17:00.810 What was the step-by-step approach you took? 00:17:00.810 --> 00:17:03.490 AUDIENCE: I didn't really have one. 00:17:03.490 --> 00:17:04.720 I just started on one end. 00:17:04.720 --> 00:17:07.450 And I went down because it wasn't in order or anything. 00:17:07.450 --> 00:17:09.400 So I just kept going until I found it. 00:17:09.400 --> 00:17:10.720 DAVID MALAN: OK, so that's actually pretty fair. 00:17:10.720 --> 00:17:12.262 And in fact, let's step forward here. 00:17:12.262 --> 00:17:14.829 Carter's going to very secretly kind of shuffle the numbers 00:17:14.829 --> 00:17:16.880 in a particular new arrangement here. 00:17:16.880 --> 00:17:19.390 And so you really went from right to left. 00:17:19.390 --> 00:17:22.960 And I dare say maybe going from left to right might be equivalent. 00:17:22.960 --> 00:17:25.750 But could she have done better? 00:17:25.750 --> 00:17:29.590 Could she have done better, because that took 1, 2, 3, 4, 5 steps? 00:17:29.590 --> 00:17:31.970 Could Sam have done better? 00:17:31.970 --> 00:17:32.470 Sure. 00:17:32.470 --> 00:17:35.270 How? 00:17:35.270 --> 00:17:36.527 OK, so you got lucky. 00:17:36.527 --> 00:17:38.360 You could have found the number in one step. 00:17:38.360 --> 00:17:40.400 Although, luck isn't really an algorithm. 00:17:40.400 --> 00:17:42.780 It really isn't a step-by-step approach. 00:17:42.780 --> 00:17:45.053 So another thought? 00:17:45.053 --> 00:17:46.890 AUDIENCE: [INAUDIBLE] dollar bills-- 00:17:46.890 --> 00:17:49.656 sort them in order? 00:17:49.656 --> 00:17:51.270 DAVID MALAN: Oh, interesting. 00:17:51.270 --> 00:17:54.492 So you could have taken out all of the dollar bills, sorted them, 00:17:54.492 --> 00:17:57.200 put them back in, and then you probably could have done something 00:17:57.200 --> 00:17:59.315 like the divide and conquer approach. 00:17:59.315 --> 00:18:00.520 AUDIENCE: I didn't know I was allowed to do that. 00:18:00.520 --> 00:18:01.400 DAVID MALAN: No, you weren't allowed to. 00:18:01.400 --> 00:18:02.360 So that's fine. 00:18:02.360 --> 00:18:02.810 AUDIENCE: [INAUDIBLE] 00:18:02.810 --> 00:18:04.852 DAVID MALAN: But that would be a valid algorithm. 00:18:04.852 --> 00:18:07.970 Although, it sounds very inefficient to do all this work just 00:18:07.970 --> 00:18:10.128 to find the number 50 because in doing that work, 00:18:10.128 --> 00:18:11.420 she would've found the 50 once. 00:18:11.420 --> 00:18:13.420 But that might actually be a reasonable solution 00:18:13.420 --> 00:18:17.330 if she plans to search again and again and again, not just for seven numbers 00:18:17.330 --> 00:18:18.560 but maybe a lot of numbers. 00:18:18.560 --> 00:18:21.690 Maybe you do want to incur some cost up front and all the data 00:18:21.690 --> 00:18:22.910 so as to find it faster. 00:18:22.910 --> 00:18:24.320 But let's do this a little more methodically. 00:18:24.320 --> 00:18:25.862 We'll step off to the side once more. 00:18:25.862 --> 00:18:29.120 And just by convention, if you want to go ahead and-- do we need this? 00:18:29.120 --> 00:18:29.660 OK. 00:18:29.660 --> 00:18:30.368 AUDIENCE: Got it. 00:18:30.368 --> 00:18:32.390 DAVID MALAN: OK, let's go from left to right. 00:18:32.390 --> 00:18:32.900 Yep. 00:18:32.900 --> 00:18:34.858 And go ahead and show everyone the numbers just 00:18:34.858 --> 00:18:37.430 to prove that there's no funny business here. 00:18:37.430 --> 00:18:38.848 AUDIENCE: 20. 00:18:38.848 --> 00:18:41.596 DAVID MALAN: [LAUGHS] 00:18:41.596 --> 00:18:42.960 AUDIENCE: Oh, OK. 00:18:42.960 --> 00:18:44.010 500. 00:18:44.010 --> 00:18:44.760 DAVID MALAN: Nice. 00:18:48.191 --> 00:18:49.340 AUDIENCE: 10. 00:18:49.340 --> 00:18:50.290 DAVID MALAN: Mm-hm. 00:18:50.290 --> 00:18:54.212 So it doesn't sound like it's sorted, either, this time. 00:18:54.212 --> 00:18:54.945 AUDIENCE: Five. 00:18:54.945 --> 00:18:58.040 DAVID MALAN: Nice. 00:18:58.040 --> 00:18:59.826 And? 00:18:59.826 --> 00:19:00.940 AUDIENCE: 100. 00:19:00.940 --> 00:19:02.280 DAVID MALAN: Oh, so close. 00:19:05.090 --> 00:19:06.057 AUDIENCE: [INAUDIBLE] 00:19:06.057 --> 00:19:06.890 DAVID MALAN: I know. 00:19:06.890 --> 00:19:07.730 Now we're just messing with you. 00:19:07.730 --> 00:19:08.230 AUDIENCE: 1. 00:19:08.230 --> 00:19:10.904 DAVID MALAN: OK, and lastly? 00:19:10.904 --> 00:19:12.380 AUDIENCE: [GASPS] 50. 00:19:12.380 --> 00:19:13.670 DAVID MALAN: Very well done. 00:19:13.670 --> 00:19:15.360 OK, so nicely done. 00:19:15.360 --> 00:19:15.860 [APPLAUSE] 00:19:15.860 --> 00:19:17.870 And so thank you. 00:19:17.870 --> 00:19:20.990 So this is a say whether she goes from left to right or right to left, 00:19:20.990 --> 00:19:24.080 like, the performance of that algorithm just going linearly 00:19:24.080 --> 00:19:27.170 from side to side really depends on where the number ends up being. 00:19:27.170 --> 00:19:28.760 And it kind of does boil down to luck. 00:19:28.760 --> 00:19:31.550 And so that was kind of the best you could do because I dare say, 00:19:31.550 --> 00:19:34.772 had you gone linearly from right to left-- and go ahead to reset-- 00:19:34.772 --> 00:19:37.730 had you gone from right to left, that time you would have gotten lucky. 00:19:37.730 --> 00:19:40.080 So on average, it might just kind of work out. 00:19:40.080 --> 00:19:43.070 And so half the time, it takes you maybe half as many steps-- 00:19:43.070 --> 00:19:45.087 half the number of lockers to find it on average 00:19:45.087 --> 00:19:46.670 because sometimes it's in the one end. 00:19:46.670 --> 00:19:47.840 Sometimes it's on the other end. 00:19:47.840 --> 00:19:49.620 Sometimes it's smack dab in the middle. 00:19:49.620 --> 00:19:52.490 But we're going to give you the option now, 00:19:52.490 --> 00:19:54.523 Louis, of knowing that it's sorted. 00:19:54.523 --> 00:19:56.690 So we're going to take away the microphone from you. 00:19:56.690 --> 00:19:58.400 But stay on up here with us. 00:19:58.400 --> 00:20:00.710 And you are now given the assumption that the numbers 00:20:00.710 --> 00:20:03.860 are this time sorted from smallest to largest, left to right. 00:20:03.860 --> 00:20:08.570 And might you want to take more of a divide and conquer approach here? 00:20:08.570 --> 00:20:09.656 Wait a minute. 00:20:09.656 --> 00:20:10.934 AUDIENCE: [LAUGHS] 00:20:10.934 --> 00:20:14.450 DAVID MALAN: What might you do as your algorithm, Louis? 00:20:14.450 --> 00:20:16.700 AUDIENCE: Well, I think I know all the numbers, right? 00:20:16.700 --> 00:20:17.440 It's 1, 5-- 00:20:17.440 --> 00:20:18.482 DAVID MALAN: Oh, damn it. 00:20:18.482 --> 00:20:19.610 AUDIENCE: 10, 20, 50. 00:20:19.610 --> 00:20:21.630 DAVID MALAN: OK, so Louis has some memory, as we'll say. 00:20:21.630 --> 00:20:22.130 OK. 00:20:22.130 --> 00:20:24.320 AUDIENCE: But assuming if I didn't have memory. 00:20:24.320 --> 00:20:25.310 DAVID MALAN: OK, assuming if you didn't have 00:20:25.310 --> 00:20:26.480 memory, where would you start first? 00:20:26.480 --> 00:20:27.920 AUDIENCE: I would probably start in the middle. 00:20:27.920 --> 00:20:28.490 DAVID MALAN: OK, go ahead. 00:20:28.490 --> 00:20:29.220 Go in the middle. 00:20:29.220 --> 00:20:31.505 And what do you see? 00:20:31.505 --> 00:20:32.283 AUDIENCE: 20. 00:20:32.283 --> 00:20:32.950 DAVID MALAN: 20. 00:20:32.950 --> 00:20:34.505 OK, what does that mean now for you? 00:20:34.505 --> 00:20:37.630 AUDIENCE: So now that means that since4 I know there's some numbers bigger, 00:20:37.630 --> 00:20:38.390 I'll go to the right. 00:20:38.390 --> 00:20:38.500 DAVID MALAN: Good. 00:20:38.500 --> 00:20:40.720 So you can tear the problem in half, so to speak. 00:20:40.720 --> 00:20:43.690 As per week zero, skip all of the lockers on the left. 00:20:43.690 --> 00:20:45.460 And now go where relative to these three? 00:20:45.460 --> 00:20:46.540 AUDIENCE: Relative to the middle. 00:20:46.540 --> 00:20:46.720 DAVID MALAN: OK. 00:20:46.720 --> 00:20:48.800 AUDIENCE: So maybe [? you ?] don't know what's on the right-hand side. 00:20:48.800 --> 00:20:49.765 And so now it's 100. 00:20:49.765 --> 00:20:51.070 DAVID MALAN: OK, nice. 00:20:51.070 --> 00:20:54.250 AUDIENCE: So the 50 must be between 20 and 100. 00:20:54.250 --> 00:20:56.012 So it must be this one. 00:20:56.012 --> 00:20:56.845 DAVID MALAN: Indeed. 00:20:56.845 --> 00:21:00.342 So a round of applause for Louis for getting this right this time. 00:21:00.342 --> 00:21:01.790 [APPLAUSE] 00:21:01.790 --> 00:21:05.008 We have some lovely parting gifts, since we're using monopoly money. 00:21:05.008 --> 00:21:06.050 So it's not actual money. 00:21:06.050 --> 00:21:10.610 But this is the Cambridge edition Harvard Square and all for both of you. 00:21:10.610 --> 00:21:11.390 So welcome. 00:21:11.390 --> 00:21:12.986 Thank you so much. 00:21:12.986 --> 00:21:17.140 [APPLAUSE] 00:21:17.140 --> 00:21:19.630 So Louis was actually getting particularly clever there 00:21:19.630 --> 00:21:21.610 when his first instinct was to just remember 00:21:21.610 --> 00:21:23.800 what were all the numbers and then sort of deduce 00:21:23.800 --> 00:21:25.600 where the 50 must obviously be. 00:21:25.600 --> 00:21:26.440 So that's on us. 00:21:26.440 --> 00:21:29.050 Like, ideally, we would have completely changed the dollar amount so 00:21:29.050 --> 00:21:30.633 that he couldn't use that information. 00:21:30.633 --> 00:21:34.540 But it turns out Louie's instinct there to figure out where the number should 00:21:34.540 --> 00:21:35.980 be and jump right there-- 00:21:35.980 --> 00:21:39.160 index into that exact location-- is actually a technique. 00:21:39.160 --> 00:21:42.760 And it's a technique we'll talk about in future classes 00:21:42.760 --> 00:21:45.280 where you actually do take into account the information 00:21:45.280 --> 00:21:46.940 and go right where you want. 00:21:46.940 --> 00:21:50.230 It's an example of what we'll eventually call hashing, so to speak, 00:21:50.230 --> 00:21:51.940 in a concept called hash tables. 00:21:51.940 --> 00:21:54.520 But for now, let's try to formalize the algorithms 00:21:54.520 --> 00:21:58.240 that both of these volunteers kind of intuitively 00:21:58.240 --> 00:22:00.070 came up with, first and second. 00:22:00.070 --> 00:22:01.580 And we'll slap some names on them. 00:22:01.580 --> 00:22:05.140 So the first, and I've hinted at this, is what we would call linear search. 00:22:05.140 --> 00:22:08.590 Anytime you search from left to right or from right to left, 00:22:08.590 --> 00:22:10.190 it's generally called linear search. 00:22:10.190 --> 00:22:10.690 Why? 00:22:10.690 --> 00:22:13.482 Because you're kind of walking in a line, no matter which direction 00:22:13.482 --> 00:22:14.080 you're going. 00:22:14.080 --> 00:22:17.390 But now for today's purposes, let's see if we can't truly 00:22:17.390 --> 00:22:21.440 formalize what our volunteers' algorithms were by translating them, 00:22:21.440 --> 00:22:24.320 not necessarily to code yet, but pseudocode. 00:22:24.320 --> 00:22:28.190 See if we can't map it to English-like syntax that gets the ideas across. 00:22:28.190 --> 00:22:32.220 So I dare say, the first algorithm, even though she went from right to left, 00:22:32.220 --> 00:22:34.460 then from left to right, might look like this. 00:22:34.460 --> 00:22:39.800 For each door from left to right, she checked if 50 is behind the door 00:22:39.800 --> 00:22:40.850 as by looking at it. 00:22:40.850 --> 00:22:44.417 If it was behind the door, then she returned true. 00:22:44.417 --> 00:22:45.500 Like, yes, this is the 50. 00:22:45.500 --> 00:22:47.583 That didn't happen on the first iteration, though. 00:22:47.583 --> 00:22:51.750 So she moved on again and again. 00:22:51.750 --> 00:22:53.660 And now notice the indentation here is just 00:22:53.660 --> 00:22:55.670 as important as it was in week zero. 00:22:55.670 --> 00:22:59.180 Notice that only at the very bottom of this algorithm 00:22:59.180 --> 00:23:00.620 do I propose returning false. 00:23:00.620 --> 00:23:03.650 But it's not indented inside of this pseudocode. 00:23:03.650 --> 00:23:04.160 Why? 00:23:04.160 --> 00:23:06.530 Well, because if I had changed it to be this, 00:23:06.530 --> 00:23:10.740 what would be the logical bug in this version of that algorithm? 00:23:10.740 --> 00:23:11.240 Yeah? 00:23:17.040 --> 00:23:17.730 Exactly. 00:23:17.730 --> 00:23:22.020 If she had opened the first door, found it to be the wrong number if it says 00:23:22.020 --> 00:23:24.330 else-- if it's not behind the door, then return false-- 00:23:24.330 --> 00:23:26.760 that would erroneously conclude the 50's not there, 00:23:26.760 --> 00:23:29.350 even though it could certainly be among those other doors. 00:23:29.350 --> 00:23:32.370 So this first version of the code where the return false is 00:23:32.370 --> 00:23:35.670 sort of left indented, so to speak, and the very last thing 00:23:35.670 --> 00:23:39.690 you do if you don't previously return true, that just 00:23:39.690 --> 00:23:42.360 makes sure that we're handling all possible cases. 00:23:42.360 --> 00:23:45.060 But let's make this maybe a little more technical. 00:23:45.060 --> 00:23:47.580 This is how a computer scientist or a programmer 00:23:47.580 --> 00:23:49.970 would likely express this instead. 00:23:49.970 --> 00:23:52.260 Instead of just drawing it in broad strokes, 00:23:52.260 --> 00:23:56.100 it's actually fine to kind of steal some of the syntax from languages like C 00:23:56.100 --> 00:24:01.530 and actually use some of the indices or indexes like 0, 1, 2, 3, 4, 5, 6, 00:24:01.530 --> 00:24:04.360 to represent the pieces of data we care about. 00:24:04.360 --> 00:24:06.150 So this is a little more precise. 00:24:06.150 --> 00:24:12.030 For i-- like a variable i-- from the value 0 to n minus 1, 00:24:12.030 --> 00:24:16.390 so in the case of seven doors, this is like saying, for i starting at 0, 00:24:16.390 --> 00:24:19.280 going up to 6, do the following. 00:24:19.280 --> 00:24:22.730 If the number 50 is behind the doors array-- 00:24:22.730 --> 00:24:27.250 so I'm using array syntax, even though this is technically still pseudocode-- 00:24:27.250 --> 00:24:32.720 if the ith location of my doors array has the number 50, return true. 00:24:32.720 --> 00:24:35.680 Otherwise, if you do that again and again and again-- 00:24:35.680 --> 00:24:39.863 n total times-- and you still don't find it, you want to return false. 00:24:39.863 --> 00:24:40.780 So we introduced this. 00:24:40.780 --> 00:24:44.590 This is just an example of how you can start to borrow ideas from actual code 00:24:44.590 --> 00:24:47.890 to paint the picture even more precisely of what 00:24:47.890 --> 00:24:50.200 it is you want a colleague to do, what it 00:24:50.200 --> 00:24:53.860 is you want your code to do, ultimately, by sort 00:24:53.860 --> 00:24:58.720 of borrowing these ideas from code and incorporating it into our pseudocode. 00:24:58.720 --> 00:25:03.430 But what about the second algorithm here, 00:25:03.430 --> 00:25:07.930 the second algorithm, whereby he took a divide and conquer approach, 00:25:07.930 --> 00:25:10.660 starting in the middle and then going right and then going left. 00:25:10.660 --> 00:25:13.077 Well, it turns out this is generally called binary search. 00:25:13.077 --> 00:25:15.040 Bi implying two, because you're either going 00:25:15.040 --> 00:25:17.395 with the left half or the right half again and again. 00:25:17.395 --> 00:25:20.020 This is literally what we've been talking about since week zero 00:25:20.020 --> 00:25:21.312 when I searched the phone book. 00:25:21.312 --> 00:25:25.450 That too was binary search, dividing and dividing and dividing in half and half. 00:25:25.450 --> 00:25:27.850 So if we were to draw some pseudocode for this, 00:25:27.850 --> 00:25:30.250 I would propose that we could do something like this. 00:25:30.250 --> 00:25:34.900 If 50 is behind the middle door, then we got lucky. 00:25:34.900 --> 00:25:36.490 Just return true. 00:25:36.490 --> 00:25:40.700 Else if the 50 is less than the value at the middle door-- 00:25:40.700 --> 00:25:42.430 so if it's smaller than the middle door-- 00:25:42.430 --> 00:25:43.810 I want to search to the left. 00:25:43.810 --> 00:25:45.580 So I can say search left half. 00:25:45.580 --> 00:25:50.330 Else if 50 is greater than the middle door, I want to search to the right. 00:25:50.330 --> 00:25:54.550 And I think that's almost everything, right? 00:25:54.550 --> 00:25:57.680 Is there a fourth possible case? 00:25:57.680 --> 00:25:58.715 What else could happen? 00:26:03.130 --> 00:26:03.670 Good. 00:26:03.670 --> 00:26:07.720 So taking into account that if there are no doors left or no doors 00:26:07.720 --> 00:26:09.830 to begin with, we better handle that case 00:26:09.830 --> 00:26:11.650 so that we don't induce one of those spinning beach balls, 00:26:11.650 --> 00:26:13.540 so that the computer doesn't freeze or crash. 00:26:13.540 --> 00:26:17.020 There's really four possible scenarios in searching for information. 00:26:17.020 --> 00:26:19.990 It's either in the middle or to the left or to the right. 00:26:19.990 --> 00:26:21.438 Or it's just not there at all. 00:26:21.438 --> 00:26:24.730 And so I sort of slip that in at the end because technically, that's a question 00:26:24.730 --> 00:26:28.390 you should ask first because if there's no doors, there's no work to be done. 00:26:28.390 --> 00:26:31.780 But logically, this is the juicy part-- the other three questions 00:26:31.780 --> 00:26:33.320 that you might ask yourself. 00:26:33.320 --> 00:26:35.860 So this then might be the pseudocode for binary search. 00:26:35.860 --> 00:26:37.398 And we could make it more technical. 00:26:37.398 --> 00:26:39.940 And this is where it kind of escalates quickly syntactically. 00:26:39.940 --> 00:26:41.770 But I'm just using the same kind of syntax. 00:26:41.770 --> 00:26:45.190 If doors in my pseudocode represents an array of doors, 00:26:45.190 --> 00:26:48.550 well, then doors bracket middle is just a pseudocode-like way 00:26:48.550 --> 00:26:51.110 of saying go to the middle door in that array. 00:26:51.110 --> 00:26:55.090 And then notice, else if 50 is less than-- that middle value-- 00:26:55.090 --> 00:26:59.620 then search doors bracket 0, so the leftmost one-- 00:26:59.620 --> 00:27:02.380 through doors middle minus 1. 00:27:02.380 --> 00:27:05.780 So you don't need to waste time researching the middle door. 00:27:05.780 --> 00:27:10.250 So I say middle minus 1 so that I scooch over slightly to the left, so to speak. 00:27:10.250 --> 00:27:13.550 Else if 50 is greater than the value at the middle door, 00:27:13.550 --> 00:27:16.700 then you want to search 1 over to the right, 00:27:16.700 --> 00:27:21.770 so middle plus 1 among those doors, through the last door, which is not n, 00:27:21.770 --> 00:27:24.710 because we start counting at 0, but n minus 1. 00:27:24.710 --> 00:27:26.460 And the rest of the algorithm is the same. 00:27:26.460 --> 00:27:27.918 This is just a little more precise. 00:27:27.918 --> 00:27:32.450 And I dare say, when writing a program in C or any language, 00:27:32.450 --> 00:27:36.200 like, honestly, starting in pseudocode like this will generally make it 00:27:36.200 --> 00:27:38.370 much easier to write the actual code. 00:27:38.370 --> 00:27:41.360 So in fact, in this and future problem sets, do get into the habit, 00:27:41.360 --> 00:27:43.610 especially if you're struggling getting started-- just 00:27:43.610 --> 00:27:45.680 write things out in English and maybe high level 00:27:45.680 --> 00:27:47.120 English a little more like this. 00:27:47.120 --> 00:27:51.710 Then as a version two, go in with your keyboard or paper, pencil. 00:27:51.710 --> 00:27:56.240 And make it a little more precise using some code-like syntax. 00:27:56.240 --> 00:27:58.430 And then I dare say in version 3, you can now 00:27:58.430 --> 00:28:01.460 translate this pretty much verbatim to C code. 00:28:01.460 --> 00:28:04.040 The only headache is going to be rounding issues 00:28:04.040 --> 00:28:06.890 with integers because if you divide an integer and you a fraction, 00:28:06.890 --> 00:28:08.810 it's going to truncate, so all that kind of headache. 00:28:08.810 --> 00:28:10.670 But you can work through that by just thinking 00:28:10.670 --> 00:28:12.440 through what's going to get truncated when 00:28:12.440 --> 00:28:15.630 you round down or up as a solution. 00:28:15.630 --> 00:28:17.870 Any questions, though, on this pseudocode 00:28:17.870 --> 00:28:23.420 for either linear or binary search as we've defined them-- 00:28:23.420 --> 00:28:26.340 linear or binary search-- 00:28:26.340 --> 00:28:26.840 no? 00:28:26.840 --> 00:28:28.798 All right, well, let's consider then a bit more 00:28:28.798 --> 00:28:33.000 formally a question that we'll come back to in the future in future classes 00:28:33.000 --> 00:28:33.500 as well. 00:28:33.500 --> 00:28:35.885 What is the running time of these algorithms? 00:28:35.885 --> 00:28:38.927 What is, that is to say, the efficiency of these algorithms? 00:28:38.927 --> 00:28:41.510 And how do we actually measure the efficiency of an algorithm? 00:28:41.510 --> 00:28:44.017 Is it with a stopwatch or with some other mechanism? 00:28:44.017 --> 00:28:46.100 Well, I propose that we think back to this picture 00:28:46.100 --> 00:28:50.660 here, whereby this, again, was representative of both the phone book 00:28:50.660 --> 00:28:54.140 example from week zero and theoretically, bug aside, 00:28:54.140 --> 00:28:57.920 the attendance counting algorithm from earlier today, whereby 00:28:57.920 --> 00:29:02.150 this same green line theoretically represents how much time it should have 00:29:02.150 --> 00:29:04.380 taken us as a group to count ourselves. 00:29:04.380 --> 00:29:04.880 Why? 00:29:04.880 --> 00:29:07.520 Because if maybe another class comes in and doubles 00:29:07.520 --> 00:29:11.280 the size of the number of humans in this room, no big deal. 00:29:11.280 --> 00:29:14.960 That's just one more step or one more iteration of the loop 00:29:14.960 --> 00:29:17.850 because half of the people would anyway sit down. 00:29:17.850 --> 00:29:22.410 So this green algorithm still represents the faster theoretical algorithm today. 00:29:22.410 --> 00:29:26.130 And so recall that we described these things more mathematically as n. 00:29:26.130 --> 00:29:29.970 So this was one page at a time or one person at a time. 00:29:29.970 --> 00:29:33.130 This was two people or two pages at a time. 00:29:33.130 --> 00:29:36.480 So it's twice as fast, so if n is the number of people or pages 00:29:36.480 --> 00:29:38.850 and divided by 2 is the total number of steps. 00:29:38.850 --> 00:29:41.820 And then this one got a little mathy, but log base 2 of n. 00:29:41.820 --> 00:29:47.850 And log base 2 of n just means what is the value when you take n and divide it 00:29:47.850 --> 00:29:51.360 in two by 2 again and again and again and again 00:29:51.360 --> 00:29:54.960 until you're left with just one page or one person standing. 00:29:54.960 --> 00:29:58.500 But in the world of running times, it turns out that being this precise 00:29:58.500 --> 00:30:00.390 is not that intellectually interesting. 00:30:00.390 --> 00:30:02.370 And it sort of devolves into lower level math. 00:30:02.370 --> 00:30:04.560 It's just not necessary when having discussions 00:30:04.560 --> 00:30:07.500 about the efficiency of an algorithm or even code that you've written. 00:30:07.500 --> 00:30:10.020 So generally, a computer scientist, when asked 00:30:10.020 --> 00:30:11.730 what's the running time of your algorithm 00:30:11.730 --> 00:30:13.813 or what's the efficiency of your algorithm or more 00:30:13.813 --> 00:30:16.560 generally how good or bad is your algorithm, 00:30:16.560 --> 00:30:20.588 they'll talk about it being on the order of some number of steps. 00:30:20.588 --> 00:30:23.130 This is a phrase you'll hear increasingly in computer science 00:30:23.130 --> 00:30:25.020 where you can kind of wave your hand at it. 00:30:25.020 --> 00:30:27.600 Like, oh, the lower level details don't matter that much. 00:30:27.600 --> 00:30:31.410 All you care about in broad strokes are certain numbers 00:30:31.410 --> 00:30:32.940 that will add up the most. 00:30:32.940 --> 00:30:35.250 And in fact, when computer scientists talk 00:30:35.250 --> 00:30:37.860 about the efficiency of algorithms, they tend 00:30:37.860 --> 00:30:43.020 to throw away constant factors, so literally, numbers like 2 00:30:43.020 --> 00:30:46.540 that might be dividing here or a base here. 00:30:46.540 --> 00:30:49.710 So for instance, these two algorithms to a computer scientist 00:30:49.710 --> 00:30:50.918 would sort be the same. 00:30:50.918 --> 00:30:52.710 Like, yeah, it's technically twice as fast. 00:30:52.710 --> 00:30:53.940 But look at the lines. 00:30:53.940 --> 00:30:57.210 I mean, they're practically the same and this one here too, log base 2-- sure. 00:30:57.210 --> 00:31:01.050 But if you remember from math class, you can change the base of any logarithm 00:31:01.050 --> 00:31:03.040 from one number to another pretty easily. 00:31:03.040 --> 00:31:06.120 So ah, let's just generalize it as log of n. 00:31:06.120 --> 00:31:09.390 It doesn't really matter fundamentally what the numbers actually are. 00:31:09.390 --> 00:31:14.400 And honestly, if we zoom out slightly so that the y-axis and the x-axis 00:31:14.400 --> 00:31:18.240 get even bigger, honestly, these first two algorithms 00:31:18.240 --> 00:31:21.750 really do start to resemble each other closer and closer. 00:31:21.750 --> 00:31:25.050 And I daresay in your mind's eye, imagine zooming further and further 00:31:25.050 --> 00:31:26.070 and further out. 00:31:26.070 --> 00:31:29.850 Like, that red and yellow line are pretty much-- once n is large enough-- 00:31:29.850 --> 00:31:31.305 going to be functionally the same. 00:31:31.305 --> 00:31:33.180 Like, they're practically the same algorithm. 00:31:33.180 --> 00:31:35.700 But this one is still doing amazingly because it's 00:31:35.700 --> 00:31:37.360 a fundamentally different shape. 00:31:37.360 --> 00:31:40.320 So this is to say when a computer scientist talks about, thinks 00:31:40.320 --> 00:31:43.890 about the efficiency of algorithms, we just throw away the constant terms 00:31:43.890 --> 00:31:47.450 that when n gets really large just don't seem to matter as much. 00:31:47.450 --> 00:31:49.200 They don't add up as much or fundamentally 00:31:49.200 --> 00:31:51.760 change the picture in a case like this. 00:31:51.760 --> 00:31:55.320 So what I'm describing here with this capital letter O has a technical term. 00:31:55.320 --> 00:31:57.810 This is called big O notation. 00:31:57.810 --> 00:32:00.180 And this is omnipresent in computer science 00:32:00.180 --> 00:32:02.250 and often rears its head even in programming, 00:32:02.250 --> 00:32:05.617 specifically when talking about the design of some algorithm. 00:32:05.617 --> 00:32:07.950 And this is a little cheat sheet here on the screen now. 00:32:07.950 --> 00:32:10.890 Very often, algorithms that you write or you 00:32:10.890 --> 00:32:17.400 use will be describable as being on the order of one of these running times. 00:32:17.400 --> 00:32:21.180 So n is just representative of the number of things-- number of people, 00:32:21.180 --> 00:32:22.210 number of pages-- 00:32:22.210 --> 00:32:25.830 whatever it is you're actually doing in code. 00:32:25.830 --> 00:32:30.150 So the mathematical formulas inside of the parentheses 00:32:30.150 --> 00:32:33.870 describe as a function of the size of that input 00:32:33.870 --> 00:32:36.360 how fast or slow the algorithm's going to be. 00:32:36.360 --> 00:32:40.710 So this algorithm in the middle here, big O of n, so to speak, 00:32:40.710 --> 00:32:43.600 means that it takes linear time, in other words. 00:32:43.600 --> 00:32:44.760 So my first algorithm-- 00:32:44.760 --> 00:32:47.040 1, 2, 3, 4-- 00:32:47.040 --> 00:32:48.787 or my first algorithm in week zero-- 00:32:48.787 --> 00:32:51.630 1, 2, 3, 4. 00:32:51.630 --> 00:32:53.250 That was a linear search. 00:32:53.250 --> 00:32:56.940 The number of steps it takes is on the order of n because if there's n pages, 00:32:56.940 --> 00:33:00.065 in the worst case, like, John Harvard's all the way at the end of the phone 00:33:00.065 --> 00:33:01.318 book, so it takes me n steps. 00:33:01.318 --> 00:33:04.110 In this case, it's always going to take me n steps to count you all 00:33:04.110 --> 00:33:06.443 because if I want to point at each and every one of you, 00:33:06.443 --> 00:33:08.550 that is always going to take me n steps. 00:33:08.550 --> 00:33:13.605 So big O represents an upper bound on the number of steps 00:33:13.605 --> 00:33:14.730 that you might be counting. 00:33:14.730 --> 00:33:17.760 And so we often use it to consider the worst case and the worst 00:33:17.760 --> 00:33:20.190 case, John Harvard, or whoever might be all the way at the end of the phone 00:33:20.190 --> 00:33:20.690 book. 00:33:20.690 --> 00:33:23.240 So that linear search is on the order of n. 00:33:23.240 --> 00:33:25.230 But what about n squared? 00:33:25.230 --> 00:33:27.870 This means n people doing n things. 00:33:27.870 --> 00:33:30.590 So for instance, and we won't do this today, 00:33:30.590 --> 00:33:33.440 but if we were to ask you again to stand up and shake 00:33:33.440 --> 00:33:36.140 everyone's hand in the room-- 00:33:36.140 --> 00:33:40.340 not good for health nowadays-- but shake everyone's hand in the room, 00:33:40.340 --> 00:33:42.420 how many handshakes would there be? 00:33:42.420 --> 00:33:46.580 Well, if there's n of you and you've got to shake everyone else's hand, 00:33:46.580 --> 00:33:48.650 that's technically n times n minus 1. 00:33:48.650 --> 00:33:50.030 Let's throw away the minus 1. 00:33:50.030 --> 00:33:53.600 That's n times n or n squared handshakes. 00:33:53.600 --> 00:33:54.800 That's a lot of handshakes. 00:33:54.800 --> 00:33:57.140 And so the running time of shaking everyone's hand 00:33:57.140 --> 00:34:00.860 in the room to introduce yourself would be on the order of n squared. 00:34:00.860 --> 00:34:04.370 At the other end of the spectrum, the faster end, big O of 1 00:34:04.370 --> 00:34:07.310 doesn't mean that the algorithm takes literally one step. 00:34:07.310 --> 00:34:11.040 It could take two steps or three or even 1,000 steps. 00:34:11.040 --> 00:34:14.159 But what it means is it's a constant number of steps. 00:34:14.159 --> 00:34:16.520 So it doesn't matter how many people are in the room, 00:34:16.520 --> 00:34:21.320 this describes something taking just one step total 00:34:21.320 --> 00:34:23.520 or a constant number of steps total. 00:34:23.520 --> 00:34:26.610 So for instance, earlier when everyone stood up at the same time, 00:34:26.610 --> 00:34:29.412 that was constant time because if we had twice as many people 00:34:29.412 --> 00:34:32.370 come into the room, it's not going to take us twice as long to stand up 00:34:32.370 --> 00:34:34.739 if everyone stands up at the same time. 00:34:34.739 --> 00:34:37.030 That would be a constant time algorithm. 00:34:37.030 --> 00:34:38.290 So this is linear. 00:34:38.290 --> 00:34:39.330 This is constant. 00:34:39.330 --> 00:34:41.460 If you want to get fancy, this is quadratic. 00:34:41.460 --> 00:34:42.960 This is logarithmic. 00:34:42.960 --> 00:34:44.760 And this is n log n. 00:34:44.760 --> 00:34:48.340 Or there's other fancier terms we can give it as well, but for now, 00:34:48.340 --> 00:34:53.050 just a common list of running times that we might apply to certain algorithms. 00:34:53.050 --> 00:34:56.400 So linear search, I claim, is in big O of n 00:34:56.400 --> 00:34:58.900 because it's going to take in the worst case n steps. 00:34:58.900 --> 00:35:01.490 What about binary search? 00:35:01.490 --> 00:35:05.870 How many steps does binary search take on the order of 00:35:05.870 --> 00:35:07.490 according to this chart? 00:35:07.490 --> 00:35:07.990 Yeah? 00:35:07.990 --> 00:35:08.700 AUDIENCE: Log n. 00:35:08.700 --> 00:35:09.630 DAVID MALAN: Log n. 00:35:09.630 --> 00:35:14.250 Yeah, because no matter what with binary search, you're dividing in half, 00:35:14.250 --> 00:35:15.510 half, half. 00:35:15.510 --> 00:35:19.380 But in the worst case, it might be in the last door you check. 00:35:19.380 --> 00:35:21.630 But you only took log n steps to get there. 00:35:21.630 --> 00:35:23.550 But it still might be the last one you check. 00:35:23.550 --> 00:35:26.897 So binary search indeed would be on the order of log n. 00:35:26.897 --> 00:35:28.230 But sometimes, you do get lucky. 00:35:28.230 --> 00:35:30.990 And we saw with our volunteers that sometimes you can get lucky 00:35:30.990 --> 00:35:32.430 and just find things quicker. 00:35:32.430 --> 00:35:35.793 So we don't always want to talk about things in terms of an upper bound, 00:35:35.793 --> 00:35:38.460 like how many steps in the worst case might this algorithm take. 00:35:38.460 --> 00:35:40.860 Sometimes it's useful to know in the best case 00:35:40.860 --> 00:35:43.350 how few steps might an algorithm take. 00:35:43.350 --> 00:35:46.170 So for that, we have this capital Greek omega, which 00:35:46.170 --> 00:35:48.540 is another symbol in computer science. 00:35:48.540 --> 00:35:53.872 And whereas big O represents upper bound, omega represents lower bound. 00:35:53.872 --> 00:35:55.080 And it's the exact same idea. 00:35:55.080 --> 00:35:56.820 It's just a different symbol to represent 00:35:56.820 --> 00:35:58.930 a different idea, the opposite, in this case. 00:35:58.930 --> 00:36:00.690 So here's a similar cheat sheet here. 00:36:00.690 --> 00:36:02.670 But when you use the omega symbol, that just 00:36:02.670 --> 00:36:07.240 means that this algorithm might take as few as this many steps, 00:36:07.240 --> 00:36:09.610 for instance, in the very best case. 00:36:09.610 --> 00:36:12.640 So by that logic, if I ask about linear search, 00:36:12.640 --> 00:36:17.090 our first demonstration with the lockers, let's consider the best case. 00:36:17.090 --> 00:36:19.720 How many steps might it take to search n lockers 00:36:19.720 --> 00:36:21.610 using linear search in the best case? 00:36:21.610 --> 00:36:22.960 You get lucky. 00:36:22.960 --> 00:36:23.680 So I heard it. 00:36:23.680 --> 00:36:24.950 Yeah, just one step. 00:36:24.950 --> 00:36:29.210 So we could say that linear search is an omega of 1, so to speak. 00:36:29.210 --> 00:36:30.880 What about binary search? 00:36:30.880 --> 00:36:34.150 If you've got n lockers, in the best case, though, 00:36:34.150 --> 00:36:35.947 how few steps might it take us? 00:36:35.947 --> 00:36:37.780 Again, one, because we might just get lucky. 00:36:37.780 --> 00:36:40.460 And boom, it happens to be right there in the middle. 00:36:40.460 --> 00:36:46.000 So you could say that both linear search and binary search are an omega of 1. 00:36:46.000 --> 00:36:50.620 Now, by contrast, my attendance algorithm, the first one I proposed-- 00:36:50.620 --> 00:36:55.822 1, 2, 3, 4, 5, 6, 7, I claimed a moment ago that that's in big O of n 00:36:55.822 --> 00:36:58.780 because if there's n people in the room, I've got to point at everyone. 00:36:58.780 --> 00:37:00.580 But equivalently, if there's n people in the room 00:37:00.580 --> 00:37:03.400 and I have to point at everyone, what's the fewest number of steps 00:37:03.400 --> 00:37:08.770 it could take me to take attendance using this linear approach? 00:37:08.770 --> 00:37:11.730 So still n, right, unless I guess, which is not an algorithm. 00:37:11.730 --> 00:37:14.230 Like, unless I guess, I'm not going to get the right answer, 00:37:14.230 --> 00:37:16.120 so I kind of have to point at everyone. 00:37:16.120 --> 00:37:18.210 So in both the best case and the worst case, 00:37:18.210 --> 00:37:20.640 some algorithms still take n steps. 00:37:20.640 --> 00:37:24.300 And for this, we have what we'll call theta notation, whereby 00:37:24.300 --> 00:37:28.110 if big O and omega happen to be the same, which is not always the case 00:37:28.110 --> 00:37:33.190 but can be, then you can say that algorithm is in theta of such and such. 00:37:33.190 --> 00:37:37.530 So my attendance-taking algorithm, the very first, 1, 2, 3, 4, 00:37:37.530 --> 00:37:40.020 all the way on up to n would be in theta of n 00:37:40.020 --> 00:37:42.250 because in both the best case and the worst case, 00:37:42.250 --> 00:37:48.143 it takes the same number of steps as per my big O and omega analysis. 00:37:48.143 --> 00:37:51.060 Now, there is a more formal mathematical definition for both of these. 00:37:51.060 --> 00:37:52.620 And if you take higher level computer science, 00:37:52.620 --> 00:37:54.495 you'll go more into the weeds of these ideas. 00:37:54.495 --> 00:37:56.820 But for now, big O, upper bound. 00:37:56.820 --> 00:37:59.100 Omega, lower bound. 00:37:59.100 --> 00:38:03.430 Questions on this symbology? 00:38:03.430 --> 00:38:06.920 It'll be a tool in our tool kit. 00:38:06.920 --> 00:38:07.420 No? 00:38:07.420 --> 00:38:14.260 OK, so with that said, let's see how we might translate this to actual code 00:38:14.260 --> 00:38:17.050 in something that makes sense now using C and not so much 00:38:17.050 --> 00:38:20.570 new syntax but applications of similar ideas from last time. 00:38:20.570 --> 00:38:25.237 So for instance, let me actually go over to search dot C. And in search dot C, 00:38:25.237 --> 00:38:27.820 I'm going to go ahead and implement the idea of linear search, 00:38:27.820 --> 00:38:29.980 very simply, using integers initially. 00:38:29.980 --> 00:38:34.150 So to do this, let me go ahead and give myself the familiar CS50 dot h 00:38:34.150 --> 00:38:36.460 so that I can ask the human what number to search for. 00:38:36.460 --> 00:38:40.000 Then let me go ahead and include standard io.h 00:38:40.000 --> 00:38:42.430 so that I can use printf and the like. 00:38:42.430 --> 00:38:45.880 Then I'm going to go ahead and do int main void without any command line 00:38:45.880 --> 00:38:49.130 arguments because I'm not going to need any for this particular demonstration. 00:38:49.130 --> 00:38:50.920 And someone asked about this a while back. 00:38:50.920 --> 00:38:53.560 If I want to declare an array of values but I 00:38:53.560 --> 00:38:57.650 know in advance what the values are, there is a special syntax I can use, 00:38:57.650 --> 00:38:58.330 which is this. 00:38:58.330 --> 00:39:01.480 If I want a whole bunch of numbers but I want those numbers to be stored 00:39:01.480 --> 00:39:04.780 in an array, I can store them in-- 00:39:04.780 --> 00:39:06.380 using these curly braces. 00:39:06.380 --> 00:39:09.582 And I'm going to use the same numbers as the monopoly denominations 00:39:09.582 --> 00:39:10.790 that we've been playing with. 00:39:10.790 --> 00:39:12.690 And I'm just going to put them in sort of random order. 00:39:12.690 --> 00:39:15.315 But I'm going to deliberately put the 50 all the way at the end 00:39:15.315 --> 00:39:19.430 just so that I know that it's going to try all possible steps, so big O of n, 00:39:19.430 --> 00:39:20.210 ultimately. 00:39:20.210 --> 00:39:23.180 Now let's go ahead and ask the user for a value n. 00:39:23.180 --> 00:39:24.530 And we'll use get int. 00:39:24.530 --> 00:39:28.520 And just ask the user for a number to search for, be it 50 or something else. 00:39:28.520 --> 00:39:32.000 And then here's how I might implement in code now linear search, 00:39:32.000 --> 00:39:34.850 translating effectively the pseudocode from last time. 00:39:34.850 --> 00:39:38.090 For int i equals 0. 00:39:38.090 --> 00:39:40.940 i is less than 7. i plus plus. 00:39:40.940 --> 00:39:43.670 And then inside of this loop, I can ask a question. 00:39:43.670 --> 00:39:47.610 If the ith number in the numbers array-- 00:39:47.610 --> 00:39:51.170 so numbers bracket i-- equals, equals the number n 00:39:51.170 --> 00:39:55.130 that I care about, well, this is where I could just declare return true 00:39:55.130 --> 00:39:55.930 or return false. 00:39:55.930 --> 00:39:58.430 Here, I'm going to go ahead and just use a printf statement. 00:39:58.430 --> 00:40:01.670 And I'm going to say found, backslash n, just to know visually 00:40:01.670 --> 00:40:03.110 that I found the number. 00:40:03.110 --> 00:40:06.920 And else I might do something like this. 00:40:06.920 --> 00:40:09.080 Else if that's not the case, I'll go ahead 00:40:09.080 --> 00:40:13.910 and print out not found backslash n. 00:40:13.910 --> 00:40:16.410 All right, so let me zoom out for just a moment. 00:40:16.410 --> 00:40:17.690 Here's all of my code. 00:40:17.690 --> 00:40:25.130 Any concerns with this implementation of what I claim is now linear search? 00:40:25.130 --> 00:40:28.970 Any concerns with what I claim is linear search? 00:40:28.970 --> 00:40:29.885 Yeah? 00:40:29.885 --> 00:40:32.855 AUDIENCE: [INAUDIBLE] 00:40:32.855 --> 00:40:33.950 DAVID MALAN: Exactly. 00:40:33.950 --> 00:40:37.917 If I search the first number and it's not found, it's going to say not found. 00:40:37.917 --> 00:40:40.500 But it's going to keep saying not found, not found, not found, 00:40:40.500 --> 00:40:41.360 which might be fine. 00:40:41.360 --> 00:40:42.402 But it's a little stupid. 00:40:42.402 --> 00:40:44.325 I probably want to know if it's found or not. 00:40:44.325 --> 00:40:46.700 So I've made that same mistake that I called out earlier. 00:40:46.700 --> 00:40:49.700 Like, the else is not the alternative to not finding 00:40:49.700 --> 00:40:51.570 the number in that first location. 00:40:51.570 --> 00:40:55.890 It's the final decision to make when I haven't actually found the value. 00:40:55.890 --> 00:40:58.580 So I think what I want to do is get rid of this else clause. 00:40:58.580 --> 00:41:01.940 And then at the outside of this loop, I think 00:41:01.940 --> 00:41:03.770 I want to conclude printf not found. 00:41:03.770 --> 00:41:08.420 But here too there's a new bug that's arisen. 00:41:08.420 --> 00:41:13.310 There's a new bug, even though I fixed the logical error you just described. 00:41:13.310 --> 00:41:15.290 What symptom are we still going to see? 00:41:18.290 --> 00:41:20.960 Yeah, now it's going to always print not found, 00:41:20.960 --> 00:41:23.997 even when I have found it because even once I've found it 00:41:23.997 --> 00:41:26.330 and I finished going through the array, it's still going 00:41:26.330 --> 00:41:27.740 to assume that I got to the bottom. 00:41:27.740 --> 00:41:28.990 And therefore, it's not found. 00:41:28.990 --> 00:41:32.660 So I need to somehow exit out of main prematurely, if you will. 00:41:32.660 --> 00:41:35.510 And recall that last week, we also introduced the idea 00:41:35.510 --> 00:41:39.680 that main all this time does actually return a value, an integer. 00:41:39.680 --> 00:41:42.110 By default, it's secretly been zero. 00:41:42.110 --> 00:41:46.130 Anytime a program exits, it just returns zero, an exit status of zero. 00:41:46.130 --> 00:41:47.970 But we do now have control over that. 00:41:47.970 --> 00:41:51.080 And so a convention in C would be that when 00:41:51.080 --> 00:41:54.500 you want your main function to end prematurely if so, 00:41:54.500 --> 00:41:56.587 you can literally just return a value. 00:41:56.587 --> 00:41:59.670 And even though this feels a little backwards, this is just the way it is. 00:41:59.670 --> 00:42:02.450 You return zero to indicate success. 00:42:02.450 --> 00:42:06.120 And you return any other integer to indicate failure. 00:42:06.120 --> 00:42:09.110 So by convention, people go to 1 and then 2 and then 3. 00:42:09.110 --> 00:42:11.310 They don't think too hard about what numbers to use. 00:42:11.310 --> 00:42:14.120 But in this case, I'm going to go ahead and return 1 00:42:14.120 --> 00:42:16.440 if I do get to the bottom of this file. 00:42:16.440 --> 00:42:21.090 So now if I open back up my terminal window, I run make search. 00:42:21.090 --> 00:42:23.267 No syntax errors-- dot slash search. 00:42:23.267 --> 00:42:24.850 I'm going to be prompted for a number. 00:42:24.850 --> 00:42:26.683 Let's go ahead and search for the number 50. 00:42:26.683 --> 00:42:28.650 And I should see found. 00:42:28.650 --> 00:42:31.290 Meanwhile, if I run it again-- dot slash search-- 00:42:31.290 --> 00:42:34.590 I search for the number 13, that should be not found. 00:42:34.590 --> 00:42:38.850 So I claim that this is now a correct implementation of linear search 00:42:38.850 --> 00:42:43.620 that's gone from left to right, looking for a number that may or may not 00:42:43.620 --> 00:42:44.310 be there. 00:42:44.310 --> 00:42:48.420 Any questions on this version of the code here? 00:42:48.420 --> 00:42:49.578 Yeah? 00:42:49.578 --> 00:42:53.064 AUDIENCE: [INAUDIBLE] 00:42:53.064 --> 00:42:55.760 DAVID MALAN: Return zero indicates success 00:42:55.760 --> 00:42:58.250 when doing it from main in particular. 00:42:58.250 --> 00:43:02.480 And that's backwards only in the sense that generally 0 is false. 00:43:02.480 --> 00:43:04.640 And 1 is true. 00:43:04.640 --> 00:43:09.272 But the logic here is that if the program works correctly, that's zero. 00:43:09.272 --> 00:43:11.730 But there's an infinite number of things that can go wrong. 00:43:11.730 --> 00:43:15.290 And that's why we need 1 and 2 and 3 and 4 and all the way up. 00:43:15.290 --> 00:43:17.362 Other questions? 00:43:17.362 --> 00:43:19.842 AUDIENCE: [INAUDIBLE] 00:43:19.842 --> 00:43:21.637 DAVID MALAN: Yes. 00:43:21.637 --> 00:43:25.533 AUDIENCE: [INAUDIBLE] 00:43:25.533 --> 00:43:26.590 DAVID MALAN: Correct. 00:43:26.590 --> 00:43:31.270 When you return zero or return any value from main wherever it is in your code, 00:43:31.270 --> 00:43:34.240 the program will effectively terminate right then and there. 00:43:34.240 --> 00:43:37.600 No additional code will get executed at the bottom of the function. 00:43:37.600 --> 00:43:39.310 You'll effectively exit out. 00:43:39.310 --> 00:43:43.240 Just like in a normal function that isn't main, when you return a value, 00:43:43.240 --> 00:43:46.930 it immediately exits that function and hands back the value. 00:43:46.930 --> 00:43:47.440 Yeah? 00:43:47.440 --> 00:43:49.352 AUDIENCE: [INAUDIBLE] 00:43:49.352 --> 00:43:52.020 DAVID MALAN: So return 1 is just me being 00:43:52.020 --> 00:43:56.250 pedantic at this point because I'm frankly not going to really care what 00:43:56.250 --> 00:43:58.110 the exit status is of this program. 00:43:58.110 --> 00:44:03.810 But once I've introduced the idea of manually returning 0 on this line 14 00:44:03.810 --> 00:44:07.470 to indicate success, it stands to reason that I should also 00:44:07.470 --> 00:44:10.510 return a different value when I want to indicate failure. 00:44:10.510 --> 00:44:13.038 And so even though this does not functionally 00:44:13.038 --> 00:44:15.330 change the program-- it will still work-- it will still 00:44:15.330 --> 00:44:19.050 print the same things correctly-- it's a lower level detail 00:44:19.050 --> 00:44:22.260 that programmers, teaching assistants, testing software, 00:44:22.260 --> 00:44:24.930 might appreciate, knowing what actually happened 00:44:24.930 --> 00:44:28.160 in the program underneath the hood. 00:44:28.160 --> 00:44:30.660 All right, so what about strings? 00:44:30.660 --> 00:44:32.660 So it turns out with strings, we're going 00:44:32.660 --> 00:44:35.790 to have to think a little harder about how best to do this. 00:44:35.790 --> 00:44:37.530 So let me actually go ahead and do this. 00:44:37.530 --> 00:44:41.240 Let me go ahead and get rid of much of this code but transition 00:44:41.240 --> 00:44:44.420 to a different type of array, this time an array of strings. 00:44:44.420 --> 00:44:47.570 And I'm going to call the array strings itself plural, just to make clear 00:44:47.570 --> 00:44:48.985 what's in it instead of numbers. 00:44:48.985 --> 00:44:51.110 I'm going to use the square bracket notation, which 00:44:51.110 --> 00:44:54.235 just means I don't know at the moment how many elements it's going to have. 00:44:54.235 --> 00:44:56.030 But the compiler can figure it out for me. 00:44:56.030 --> 00:44:58.910 And in the spirit of Monopoly, let's go ahead in our curly braces, 00:44:58.910 --> 00:45:00.440 do something like this. 00:45:00.440 --> 00:45:03.200 Battleship is going to be one of the strings. 00:45:03.200 --> 00:45:05.390 Boot is going to be another. 00:45:05.390 --> 00:45:07.160 Cannon is going to be a third. 00:45:07.160 --> 00:45:09.140 Iron is going to be the fourth. 00:45:09.140 --> 00:45:11.510 Thimble-- and if you've ever played Monopoly, 00:45:11.510 --> 00:45:14.610 you know where these are coming from-- and top hat, for instance. 00:45:14.610 --> 00:45:17.510 So this gives me 1, 2, 3, 4, 5, 6. 00:45:17.510 --> 00:45:19.148 I could write the number 6 here. 00:45:19.148 --> 00:45:20.690 And the compiler would appreciate it. 00:45:20.690 --> 00:45:21.607 But it doesn't matter. 00:45:21.607 --> 00:45:25.280 The compiler can figure it out on its own just based on the number of commas. 00:45:25.280 --> 00:45:28.440 And this also ensures that I don't write one number to the left 00:45:28.440 --> 00:45:30.130 and then miscount on the right. 00:45:30.130 --> 00:45:32.890 So omitting it is probably in everyone's benefit. 00:45:32.890 --> 00:45:33.840 Now let's do this. 00:45:33.840 --> 00:45:38.340 Let's ask the user for a string using get string instead of get int 00:45:38.340 --> 00:45:40.650 for some string to search for. 00:45:40.650 --> 00:45:42.660 Then let's go ahead and do the exact same thing. 00:45:42.660 --> 00:45:46.620 For i int i equals 0. 00:45:46.620 --> 00:45:50.670 i is less than 6, which is technically a magic number. 00:45:50.670 --> 00:45:53.070 But let's focus on the searching algorithm for now. 00:45:53.070 --> 00:45:54.330 i plus plus. 00:45:54.330 --> 00:45:56.850 And then inside of this loop, let's do this. 00:45:56.850 --> 00:46:04.980 If strings bracket i equals equals, s, then let's go ahead 00:46:04.980 --> 00:46:08.100 and print out just as before, quote, unquote, found-- backslash n-- 00:46:08.100 --> 00:46:10.125 and proactively return zero this time. 00:46:10.125 --> 00:46:12.000 And if we don't find it anywhere in the loop, 00:46:12.000 --> 00:46:14.280 let's go ahead and return 1 at the very bottom, 00:46:14.280 --> 00:46:17.820 before which we will print not found backslash n. 00:46:17.820 --> 00:46:20.670 So it's the exact same logic at the moment, 00:46:20.670 --> 00:46:23.530 even though I've changed my ints to strings. 00:46:23.530 --> 00:46:25.950 Let me go ahead and open up my terminal window now. 00:46:25.950 --> 00:46:30.810 Do make search again and see, OK, so far, so good. 00:46:30.810 --> 00:46:33.780 Let me now go ahead and do dot slash search. 00:46:33.780 --> 00:46:37.470 And let's go ahead and search for-- how about top hat. 00:46:37.470 --> 00:46:39.653 So we should see found. 00:46:39.653 --> 00:46:40.415 Huh. 00:46:40.415 --> 00:46:41.290 All right, not found. 00:46:41.290 --> 00:46:43.332 All right, well, let's do dot slash search again. 00:46:43.332 --> 00:46:44.320 How about thimble? 00:46:44.320 --> 00:46:47.380 Maybe it's because it's just two words. 00:46:47.380 --> 00:46:50.110 No, thimble's not found, either. 00:46:50.110 --> 00:46:53.230 Dot slash search-- let's search for the first one, battleship. 00:46:53.230 --> 00:46:55.240 Enter-- still not found. 00:46:55.240 --> 00:46:57.280 Let's search for something else like cat. 00:46:57.280 --> 00:46:58.600 Not found. 00:46:58.600 --> 00:47:03.550 What is going on because I'm pretty sure the logic is exactly the same? 00:47:03.550 --> 00:47:09.010 Well, it turns out in C, this line here, currently line 11, 00:47:09.010 --> 00:47:11.350 is not how you compare strings. 00:47:11.350 --> 00:47:15.670 If you want to compare strings in C, you don't do it like you did integers. 00:47:15.670 --> 00:47:17.920 You actually need another technique altogether. 00:47:17.920 --> 00:47:21.070 And for that, we're going to need to revisit one of our friends, which 00:47:21.070 --> 00:47:24.610 is string.h, which is one of the header files for the string library 00:47:24.610 --> 00:47:27.610 that we introduced last week that has in addition to functions 00:47:27.610 --> 00:47:31.390 like strlen, which gives you the length of a string, it also gives us, 00:47:31.390 --> 00:47:34.330 as per the documentation here, another function that we'll 00:47:34.330 --> 00:47:38.830 start to find useful here, succinctly named strcmp, for string compare. 00:47:38.830 --> 00:47:44.750 And string compare will actually tell us if two strings are the same or not. 00:47:44.750 --> 00:47:46.200 It will indeed compare them. 00:47:46.200 --> 00:47:49.970 And if I use this now, let me go back to my code 00:47:49.970 --> 00:47:53.990 here and see what I might do differently if I go back into my code here 00:47:53.990 --> 00:47:55.160 and change this value. 00:47:55.160 --> 00:47:59.570 Instead of using strings bracket i equals, equals s, let's do str compare. 00:47:59.570 --> 00:48:01.190 And I read the documentation earlier. 00:48:01.190 --> 00:48:04.640 So I know that it takes two arguments, the first and the second string 00:48:04.640 --> 00:48:09.500 that you want to compare, so strings bracket i and then s, 00:48:09.500 --> 00:48:12.290 which is the string that the human typed in. 00:48:12.290 --> 00:48:15.680 But somewhat weirdly, what I want to check for 00:48:15.680 --> 00:48:19.220 is that str compare returns zero. 00:48:19.220 --> 00:48:24.950 So if str compare when given two strings is input, strings bracket i and s, 00:48:24.950 --> 00:48:30.750 returns an integer 0, that actually means the strings are the same. 00:48:30.750 --> 00:48:31.500 So let's try this. 00:48:31.500 --> 00:48:33.020 Let me do make search again. 00:48:33.020 --> 00:48:33.890 Huh. 00:48:33.890 --> 00:48:36.410 What did I do wrong here? 00:48:36.410 --> 00:48:39.200 A whole bunch of errors popped out. 00:48:39.200 --> 00:48:40.100 What did I do wrong? 00:48:40.100 --> 00:48:41.530 Yeah? 00:48:41.530 --> 00:48:44.510 Yeah, so I didn't include the very header file we're talking about. 00:48:44.510 --> 00:48:46.270 So again, it doesn't necessarily mean a logical error. 00:48:46.270 --> 00:48:47.920 It just means a stupid error on my part. 00:48:47.920 --> 00:48:49.670 I didn't actually include the header file. 00:48:49.670 --> 00:48:51.340 So let me go back and actually do that. 00:48:51.340 --> 00:48:54.280 Up at the top in addition to cs50.h and standard io, 00:48:54.280 --> 00:48:56.890 let's also include string.h. 00:48:56.890 --> 00:48:59.290 Let me clear my terminal and do make search again. 00:48:59.290 --> 00:49:00.340 Crossing my fingers. 00:49:00.340 --> 00:49:01.390 That time it worked. 00:49:01.390 --> 00:49:06.040 And now if I do dot slash search and search for top hat like before, 00:49:06.040 --> 00:49:08.500 now, thankfully, it is, in fact, found. 00:49:08.500 --> 00:49:12.340 If I do it once more and search for battleship, now it's, in fact, found. 00:49:12.340 --> 00:49:15.520 If I do it once more and search for cat, which should not be in there, 00:49:15.520 --> 00:49:17.870 that is not, in fact, found. 00:49:17.870 --> 00:49:21.050 So now just intuitively, even if you've never done this before, 00:49:21.050 --> 00:49:25.450 why might it be valuable for this function called strcmp 00:49:25.450 --> 00:49:30.370 to return zero if the strings are equal as opposed 00:49:30.370 --> 00:49:34.060 to a simple Boolean like true or false, which might have been your intuition? 00:49:34.060 --> 00:49:40.540 When you compare two strings, what are the possible takeaways 00:49:40.540 --> 00:49:43.270 you might have from comparing two strings? 00:49:43.270 --> 00:49:45.790 It's not just that they're equal or not. 00:49:45.790 --> 00:49:47.970 AUDIENCE: [INAUDIBLE] 00:49:47.970 --> 00:49:51.170 DAVID MALAN: OK, so maybe if the ASCII values are the same, that 00:49:51.170 --> 00:49:53.150 might imply, indeed, literal equality. 00:49:53.150 --> 00:49:54.720 But something else. 00:49:54.720 --> 00:49:58.044 AUDIENCE: [INAUDIBLE] about how similar these things are [INAUDIBLE].. 00:49:58.044 --> 00:49:59.020 DAVID MALAN: Ah, nice. 00:49:59.020 --> 00:50:00.940 Like, you and I in English, certainly, are very much 00:50:00.940 --> 00:50:02.982 in the habit of sorting information, whether it's 00:50:02.982 --> 00:50:06.490 in a dictionary, in our contacts, in a phone book, in any such technology. 00:50:06.490 --> 00:50:11.140 And so it's often useful to be able to know, does this string equal another? 00:50:11.140 --> 00:50:11.740 Sure. 00:50:11.740 --> 00:50:15.760 But does this string come before another alphabetically or maybe 00:50:15.760 --> 00:50:17.620 after another alphabetically? 00:50:17.620 --> 00:50:20.680 So sometimes, you want functions to give you back three answers. 00:50:20.680 --> 00:50:24.550 But equals, equals alone can only give you true or false, yes or no. 00:50:24.550 --> 00:50:27.040 And that might not be useful enough when you're 00:50:27.040 --> 00:50:29.200 trying to solve some problem involving strings. 00:50:29.200 --> 00:50:34.280 So it turns out str compare actually compares the two strings for equality 00:50:34.280 --> 00:50:38.470 but also for what's called, playfully, ASCII-betical order. 00:50:38.470 --> 00:50:41.590 So not alphabetical order, per se, but ASCII-betical order 00:50:41.590 --> 00:50:44.680 where it actually compares the integer values of the letters. 00:50:44.680 --> 00:50:46.450 So if you're comparing the letter A, It's 00:50:46.450 --> 00:50:51.250 going to compare 65 against some other letter's integer value-- hence 00:50:51.250 --> 00:50:52.880 ASCII-betical value. 00:50:52.880 --> 00:50:54.790 So we're not doing any form of sorting here. 00:50:54.790 --> 00:50:56.110 So it's sort of immaterial. 00:50:56.110 --> 00:50:59.000 But as per the documentation, I do know that str compare 00:50:59.000 --> 00:51:01.880 returns zero if two strings are equal. 00:51:01.880 --> 00:51:06.230 And a little teaser for next week, it turns out when I was only using equals, 00:51:06.230 --> 00:51:09.400 equals to compare strings bracket i and s, 00:51:09.400 --> 00:51:13.350 I was not comparing the strings in the way that you might have thought. 00:51:13.350 --> 00:51:18.087 And if you have programmed in Java or Python before, equals, equals 00:51:18.087 --> 00:51:20.420 is actually doing something different in those languages 00:51:20.420 --> 00:51:23.130 than it is actually doing in C. But more on that next week. 00:51:23.130 --> 00:51:26.600 For now, just take on faith that str compare is indeed 00:51:26.600 --> 00:51:29.010 how you compare two strings. 00:51:29.010 --> 00:51:33.080 So let's actually put this into play with some actual additional code. 00:51:33.080 --> 00:51:37.940 Let me propose that we implement a very simplistic phone book, for instance. 00:51:37.940 --> 00:51:40.640 Let me go ahead and implement here-- 00:51:40.640 --> 00:51:42.110 how about in a new file. 00:51:42.110 --> 00:51:46.550 Instead of search dot C, let's actually do phone book dot c. 00:51:46.550 --> 00:51:50.510 And in this phone book, I'm going to go ahead and include the same header file, 00:51:50.510 --> 00:51:52.730 so cs50.h, so I can get input-- 00:51:52.730 --> 00:51:55.310 standard io.h, so I can use printf-- 00:51:55.310 --> 00:51:58.070 string.h so that I can use str compare. 00:51:58.070 --> 00:52:02.250 Let me give myself a main function again without command line arguments for now. 00:52:02.250 --> 00:52:05.150 And let me go ahead now and store a proper phone book, which 00:52:05.150 --> 00:52:07.800 has some names and some actual numbers. 00:52:07.800 --> 00:52:09.320 So let's store the names first. 00:52:09.320 --> 00:52:12.560 So string-- names is going to be the name of my array. 00:52:12.560 --> 00:52:14.580 And let's go ahead and store Carter's name, 00:52:14.580 --> 00:52:18.260 how about my name, and maybe John Harvard for phone book throwback. 00:52:18.260 --> 00:52:22.640 Then let's go ahead and give me another array called numbers wherein I'll put 00:52:22.640 --> 00:52:27.560 our phone number, so 617-495-1000 for Carter-- 00:52:27.560 --> 00:52:32.060 617-495-1000 for me-- technically, directory assistance here. 00:52:32.060 --> 00:52:34.160 And then for John, we'll give him an actual one. 00:52:34.160 --> 00:52:40.100 So it's actually going to be 949-468-2750. 00:52:40.100 --> 00:52:42.380 You're welcome to text or call John when you want. 00:52:42.380 --> 00:52:43.070 Whoops. 00:52:43.070 --> 00:52:46.130 And just for good measure, let's go ahead and put our country codes 00:52:46.130 --> 00:52:50.990 in here plus 1, even though at the end of the day, these are strings. 00:52:50.990 --> 00:52:54.320 So indeed notice I'm not using an integer for these values. 00:52:54.320 --> 00:52:56.510 I kind of sort of should. 00:52:56.510 --> 00:52:59.270 But here's where data types in C and programming more 00:52:59.270 --> 00:53:01.280 generally might sometimes mislead you. 00:53:01.280 --> 00:53:04.100 Even though we call it, obviously, a phone number. 00:53:04.100 --> 00:53:08.210 It's probably best to represent it generally as strings, in fact, 00:53:08.210 --> 00:53:12.110 so that you can have the pluses-- you can have the dashes so that it doesn't 00:53:12.110 --> 00:53:14.030 get too big and overflow an integer. 00:53:14.030 --> 00:53:16.340 Maybe it's an international number for which there's even more digits. 00:53:16.340 --> 00:53:18.120 You don't want to risk overflowing a value. 00:53:18.120 --> 00:53:20.120 And in general, the rule of thumb in programming 00:53:20.120 --> 00:53:22.940 is even if in English we call something a number, 00:53:22.940 --> 00:53:26.600 if you wouldn't do math on it ever, you should probably be storing it 00:53:26.600 --> 00:53:29.060 as a string, not as an integer. 00:53:29.060 --> 00:53:33.060 And it makes no logical sense to do math on phone numbers, per se. 00:53:33.060 --> 00:53:35.570 So those are best instinctively left as strings. 00:53:35.570 --> 00:53:37.850 And in this case, even more simply, this ensures 00:53:37.850 --> 00:53:41.630 that we have pluses and dashes stored inside of the string. 00:53:41.630 --> 00:53:46.490 All right, so now that I have these two arrays in parallel, if you will. 00:53:46.490 --> 00:53:49.040 Like, I'm assuming that Carter's name is first. 00:53:49.040 --> 00:53:50.180 So his number is first. 00:53:50.180 --> 00:53:51.230 David's name is second. 00:53:51.230 --> 00:53:52.850 So his number is second. 00:53:52.850 --> 00:53:54.810 John's is third and thus third. 00:53:54.810 --> 00:53:56.060 So let's actually search this. 00:53:56.060 --> 00:53:58.010 Let's ask the user for-- 00:53:58.010 --> 00:54:01.340 how about a name using get string. 00:54:01.340 --> 00:54:03.950 And this will be a name to search for in the phone book 00:54:03.950 --> 00:54:05.420 just like we did in week zero. 00:54:05.420 --> 00:54:09.290 Let's do for int i equals 0, i less than 3. 00:54:09.290 --> 00:54:10.738 Again, the 3 is bad practice. 00:54:10.738 --> 00:54:12.530 I should probably store that in a constant. 00:54:12.530 --> 00:54:15.950 But let's keep the focus for today on the algorithm alone-- 00:54:15.950 --> 00:54:17.330 i plus, plus. 00:54:17.330 --> 00:54:20.240 Then in here, let's do if-- 00:54:20.240 --> 00:54:27.800 how about names bracket i equals, equals name typed in. 00:54:27.800 --> 00:54:28.550 But wait a minute. 00:54:28.550 --> 00:54:30.320 I'm screwing this up again. 00:54:30.320 --> 00:54:32.270 What should I be using here? 00:54:32.270 --> 00:54:37.340 str compare again-- so let's do str compare, names bracket i comma, 00:54:37.340 --> 00:54:38.930 name, which came from the user. 00:54:38.930 --> 00:54:42.770 And if that return value is zero, then let's go ahead 00:54:42.770 --> 00:54:45.830 and print out, just like before, found backslash n. 00:54:45.830 --> 00:54:46.580 But you know what? 00:54:46.580 --> 00:54:48.170 We can do something more interesting. 00:54:48.170 --> 00:54:49.670 Let's actually print out the number. 00:54:49.670 --> 00:54:51.260 So I didn't just find something. 00:54:51.260 --> 00:54:52.550 I found the name. 00:54:52.550 --> 00:54:55.710 So let's actually plug in that person's corresponding number. 00:54:55.710 --> 00:54:57.900 So now it's a more useful phonebook or contacts app 00:54:57.900 --> 00:55:00.060 where I'm going to show the human not just found 00:55:00.060 --> 00:55:02.430 but found this specific number. 00:55:02.430 --> 00:55:06.010 Then I'm going to go ahead as before and return 0 to indicate success. 00:55:06.010 --> 00:55:09.538 And if we get all the way down here, I'm going to go ahead and say not found 00:55:09.538 --> 00:55:12.330 and not print out any number because I obviously haven't found one, 00:55:12.330 --> 00:55:15.880 and return one by convention to indicate failure. 00:55:15.880 --> 00:55:17.890 So let me open my terminal window. 00:55:17.890 --> 00:55:20.340 Let me do make phone book-- enter. 00:55:20.340 --> 00:55:23.010 So far, so good-- dot slash phone book. 00:55:23.010 --> 00:55:25.080 Let's search for Carter-- 00:55:25.080 --> 00:55:27.060 enter-- found his number. 00:55:27.060 --> 00:55:27.840 Let's do it again. 00:55:27.840 --> 00:55:28.757 Dot slash phone book-- 00:55:28.757 --> 00:55:30.030 David-- found it. 00:55:30.030 --> 00:55:32.490 Let's do it one more time for John-- 00:55:32.490 --> 00:55:33.240 found it. 00:55:33.240 --> 00:55:38.160 And just for good measure, let's do one other here like Eli-- 00:55:38.160 --> 00:55:40.920 enter-- not found in this case. 00:55:40.920 --> 00:55:43.710 All right, so it seems to be working based on this example here. 00:55:43.710 --> 00:55:48.630 But now we've actually implemented the idea of, like, a proper phone book. 00:55:48.630 --> 00:55:51.510 But does any aspect of the design of this code, 00:55:51.510 --> 00:55:54.060 even if you've never programmed before CS50, 00:55:54.060 --> 00:55:58.120 does anything rub you wrong about how we're storing our data-- 00:55:58.120 --> 00:56:02.690 the phone book itself-- these names and numbers? 00:56:02.690 --> 00:56:04.220 Does anything rub you the wrong way? 00:56:04.220 --> 00:56:06.008 Yeah, in back. 00:56:06.008 --> 00:56:09.326 AUDIENCE: [INAUDIBLE] 00:56:11.222 --> 00:56:13.320 DAVID MALAN: Yeah, really good observation. 00:56:13.320 --> 00:56:15.510 I'm separating the names and the numbers, 00:56:15.510 --> 00:56:17.775 which indeed, it looks a little bit weird. 00:56:17.775 --> 00:56:20.650 And there's actually this technical term in the world of programming, 00:56:20.650 --> 00:56:22.088 which is code smell, where, like-- 00:56:22.088 --> 00:56:25.380 [SNIFFING]---- something smells a little off about this code in the sense that, 00:56:25.380 --> 00:56:27.270 like, this probably doesn't end well, right? 00:56:27.270 --> 00:56:31.438 If I add a fourth name, a fourth number, a fifth name, a fiftieth name 00:56:31.438 --> 00:56:33.480 and number, like, at some point, they're probably 00:56:33.480 --> 00:56:34.813 going to get out of sync, right? 00:56:34.813 --> 00:56:37.110 So, like, there's something awry about this design 00:56:37.110 --> 00:56:39.870 that I shouldn't decouple the names from the numbers. 00:56:39.870 --> 00:56:42.300 So something kind of smells about this code, so to speak. 00:56:42.300 --> 00:56:44.760 And any time you perceive that in your code, 00:56:44.760 --> 00:56:48.150 it's probably an opportunity to go about improving it somehow. 00:56:48.150 --> 00:56:50.893 But to do that, we actually need another tool in the toolkit. 00:56:50.893 --> 00:56:53.560 And that is, again, this term that I've used a couple of times-- 00:56:53.560 --> 00:56:54.367 data structures. 00:56:54.367 --> 00:56:56.700 Like, arrays have been the first of our data structures. 00:56:56.700 --> 00:56:57.783 And they're so simplistic. 00:56:57.783 --> 00:57:01.110 It just means storing things back to back to back contiguously in memory. 00:57:01.110 --> 00:57:03.300 But it turns out C-- 00:57:03.300 --> 00:57:06.660 and a little bit of new syntax-- but it's not a lot of new syntax today-- 00:57:06.660 --> 00:57:09.060 a little bit of new syntax today will allow 00:57:09.060 --> 00:57:13.710 us to create our own data structures, our own types of variables, 00:57:13.710 --> 00:57:16.330 largely using syntax we've seen thus far. 00:57:16.330 --> 00:57:19.230 So to do this, let me propose that in order 00:57:19.230 --> 00:57:22.980 to represent a person in a phone book-- well, let's 00:57:22.980 --> 00:57:26.250 not just implement them as a list of names and a list of numbers. 00:57:26.250 --> 00:57:30.660 Wouldn't it be nice if C had a data type actually called person? 00:57:30.660 --> 00:57:34.560 Because if it did, then I could go about creating an array called-- 00:57:34.560 --> 00:57:36.690 using the pluralized form-- people-- 00:57:36.690 --> 00:57:40.470 containing my people in my phone book. 00:57:40.470 --> 00:57:44.070 And maybe a person has both a name and a number. 00:57:44.070 --> 00:57:46.690 And therefore, we can kind of keep everything together. 00:57:46.690 --> 00:57:47.740 So how can I do this? 00:57:47.740 --> 00:57:48.960 Well, what is a person? 00:57:48.960 --> 00:57:52.350 Well, a person, really, in this story is a person has a name. 00:57:52.350 --> 00:57:53.730 And a person has a number. 00:57:53.730 --> 00:57:58.500 So can we create a new data type that maybe has both of these together? 00:57:58.500 --> 00:58:02.280 Well, we actually can by using one piece of new syntax 00:58:02.280 --> 00:58:05.100 today, which is just this here. 00:58:05.100 --> 00:58:09.395 Using what's called a struct, we can create our own data structure 00:58:09.395 --> 00:58:11.020 that actually has some structure in it. 00:58:11.020 --> 00:58:13.150 It's not just one thing like a string or an int. 00:58:13.150 --> 00:58:14.320 Maybe it's two strings. 00:58:14.320 --> 00:58:15.190 Maybe it's two ints. 00:58:15.190 --> 00:58:16.210 Maybe it's one of each. 00:58:16.210 --> 00:58:21.380 So a structure can be a variable that contains any number of other variables, 00:58:21.380 --> 00:58:22.060 so to speak. 00:58:22.060 --> 00:58:24.310 And typedef is a cryptic keyword that just 00:58:24.310 --> 00:58:28.630 means define the following type-- invent the following data type for me. 00:58:28.630 --> 00:58:30.320 And the syntax is a little weird. 00:58:30.320 --> 00:58:32.620 But you say typedef struct curly brace. 00:58:32.620 --> 00:58:35.410 Inside of the curly braces, you put all of the types 00:58:35.410 --> 00:58:38.090 of variables you want to associate with this new data type. 00:58:38.090 --> 00:58:40.660 And then after the curly brace, you invent the name 00:58:40.660 --> 00:58:41.830 that you want to give it. 00:58:41.830 --> 00:58:45.550 And this will create a new data type in C called person, 00:58:45.550 --> 00:58:47.830 even though no one thought of this decades 00:58:47.830 --> 00:58:52.670 ago when C was created alongside of the ints and the floats and so forth. 00:58:52.670 --> 00:58:55.220 So how can I actually use this? 00:58:55.220 --> 00:58:58.330 Well, it turns out that once you have this building 00:58:58.330 --> 00:59:03.190 block of creating your very own data structures, I can go back into my code 00:59:03.190 --> 00:59:06.340 and improve it as follows in direct response to your concern 00:59:06.340 --> 00:59:10.752 about it seeming not ideal that we're decoupling the names and the numbers. 00:59:10.752 --> 00:59:13.210 Now, it's going to look a little more complicated at first. 00:59:13.210 --> 00:59:15.320 But it will scale better over time. 00:59:15.320 --> 00:59:17.380 So within my code here, I'm going to go ahead 00:59:17.380 --> 00:59:21.020 and essentially type out exactly that same data type. 00:59:21.020 --> 00:59:27.460 So define a structure that has inside of it a string called name and a string 00:59:27.460 --> 00:59:28.300 called number. 00:59:28.300 --> 00:59:30.700 And let's call this thing a person. 00:59:30.700 --> 00:59:34.270 So these new lines copied and pasted from the slide a moment ago just 00:59:34.270 --> 00:59:36.370 invent the data type called person. 00:59:36.370 --> 00:59:39.010 So the only thing we need today is the syntax via which 00:59:39.010 --> 00:59:43.610 we can set a person's name and number-- like, how can we access those values. 00:59:43.610 --> 00:59:47.920 So to do this, I'm going to go ahead and erase the hard-coded arrays 00:59:47.920 --> 00:59:49.250 that I had before. 00:59:49.250 --> 00:59:53.740 And I'm going to give myself one array of type person called people 00:59:53.740 --> 00:59:55.610 with room for three people. 00:59:55.610 --> 00:59:57.490 So this seems like a bit of a mouthful. 00:59:57.490 --> 00:59:59.560 But the new data type is called person. 00:59:59.560 --> 01:00:02.048 The array name is called people, which in English is weird. 01:00:02.048 --> 01:00:04.840 I mean, it could call it persons to make it a little more parallel. 01:00:04.840 --> 01:00:06.520 But we call them people, generally. 01:00:06.520 --> 01:00:09.880 And I want three such people in my phone book. 01:00:09.880 --> 01:00:12.670 Now, how do I actually initialize those people? 01:00:12.670 --> 01:00:16.030 Well, previously, I did something like this-- names, bracket, 0. 01:00:16.030 --> 01:00:18.520 And then I did numbers, bracket, 0. 01:00:18.520 --> 01:00:20.080 Well, it's a similar idea. 01:00:20.080 --> 01:00:22.630 I do people, bracket, 0. 01:00:22.630 --> 01:00:27.490 But if I want to set this person's name, the one new piece of syntax today 01:00:27.490 --> 01:00:32.290 is a period, a literal dot operator, that says go inside of this person 01:00:32.290 --> 01:00:36.400 and access their name field or their name attribute. 01:00:36.400 --> 01:00:39.190 And set it equal to, quote, unquote, "Carter." 01:00:39.190 --> 01:00:41.800 Then go into that same person. 01:00:41.800 --> 01:00:43.660 But go into their number field. 01:00:43.660 --> 01:00:48.368 And set that equal to plus 1, 617-495-1000. 01:00:48.368 --> 01:00:50.410 Then-- and I'll just separate it with white space 01:00:50.410 --> 01:00:53.500 for readability-- go into people bracket 1. 01:00:53.500 --> 01:00:56.170 Set their name into mine, David. 01:00:56.170 --> 01:00:58.390 Let's go into people bracket 1 number. 01:00:58.390 --> 01:01:02.000 Set that equal to the same, since we're both available through the directory, 01:01:02.000 --> 01:01:04.720 so 617-495-1000. 01:01:04.720 --> 01:01:10.390 And then lastly, let's go ahead and do people bracket 2 dot name equals, 01:01:10.390 --> 01:01:15.210 quote, unquote, "John," and then people bracket 2 dot number equals, quote, 01:01:15.210 --> 01:01:21.646 unquote, "plus 1, 949-468-2750-- 01:01:21.646 --> 01:01:23.548 let me fix my dash-- 01:01:23.548 --> 01:01:24.850 semicolon. 01:01:24.850 --> 01:01:29.010 And now I think I can mostly keep the rest of the code the same 01:01:29.010 --> 01:01:31.500 because if I'm searching for this person's name, 01:01:31.500 --> 01:01:35.160 I think the only thing I want to change is this because I don't have a names 01:01:35.160 --> 01:01:36.300 array anymore. 01:01:36.300 --> 01:01:40.440 So what should I change this highlighted phrase to in order 01:01:40.440 --> 01:01:44.930 to search the ith person's name? 01:01:44.930 --> 01:01:45.800 What should this be? 01:01:45.800 --> 01:01:47.870 Yeah? 01:01:47.870 --> 01:01:51.470 People bracket i dash name because the whole point of this loop 01:01:51.470 --> 01:01:54.330 is to iterate over each of these people one at a time. 01:01:54.330 --> 01:01:57.950 So people bracket i gives me the ith person-- first [INAUDIBLE] people 0, 01:01:57.950 --> 01:01:59.060 people 1, people 2. 01:01:59.060 --> 01:02:01.880 But if on each iteration, I want to check that person's name 01:02:01.880 --> 01:02:04.010 and compare it against whatever the human typed in, 01:02:04.010 --> 01:02:05.700 now I can do exactly that. 01:02:05.700 --> 01:02:12.350 But I have to change the output to be people bracket i dot number, 01:02:12.350 --> 01:02:13.560 in this case. 01:02:13.560 --> 01:02:15.838 So I've added a little bit of complexity. 01:02:15.838 --> 01:02:19.130 And granted, this is not going to be the way long term you create a phone book. 01:02:19.130 --> 01:02:21.660 Odds are, we're going to get the phone book with a loop of some sort. 01:02:21.660 --> 01:02:24.702 We're going to use a constant, so I know how many people I have room for. 01:02:24.702 --> 01:02:27.440 For demonstration sake, I'm just typing everything out manually. 01:02:27.440 --> 01:02:30.660 But I think logically now, we've achieved the same thing. 01:02:30.660 --> 01:02:33.050 So let me do make phone book for this new version-- 01:02:33.050 --> 01:02:35.180 no syntax errors-- dot slash phone book. 01:02:35.180 --> 01:02:36.770 Let's search for Carter-- enter. 01:02:36.770 --> 01:02:38.210 And there indeed is his number. 01:02:38.210 --> 01:02:39.710 Let's go ahead and search for David. 01:02:39.710 --> 01:02:40.860 There's my number. 01:02:40.860 --> 01:02:42.470 And lastly, let's search for John-- 01:02:42.470 --> 01:02:42.970 his number. 01:02:42.970 --> 01:02:45.387 And again, we'll search for someone we know is not there-- 01:02:45.387 --> 01:02:46.130 Eli. 01:02:46.130 --> 01:02:47.840 And Eli is not found. 01:02:47.840 --> 01:02:52.160 So what we've done is try to solve this problem of introducing a brand new data 01:02:52.160 --> 01:02:57.560 type that allows us to represent this custom data that you and I just 01:02:57.560 --> 01:02:58.220 created. 01:02:58.220 --> 01:03:02.100 And that is by using the struct keyword to cluster these things together 01:03:02.100 --> 01:03:05.960 and the typedef keyword to give it a brand new name that we might like. 01:03:05.960 --> 01:03:08.330 Now, as an aside, when it comes to styling your code, 01:03:08.330 --> 01:03:11.430 you'll actually see that style50 and similar tools will actually 01:03:11.430 --> 01:03:14.610 put the name of the data type on the same line as the closing curly brace, 01:03:14.610 --> 01:03:15.900 just sort of a curiosity. 01:03:15.900 --> 01:03:16.523 That's fine. 01:03:16.523 --> 01:03:18.690 Even though I wrote it the first way for consistency 01:03:18.690 --> 01:03:23.530 with Scratch and our prior examples, this is the right way in styling it. 01:03:23.530 --> 01:03:26.470 Any questions now on this data type? 01:03:26.470 --> 01:03:26.970 Yeah? 01:03:29.675 --> 01:03:33.042 AUDIENCE: [INAUDIBLE] 01:03:34.966 --> 01:03:36.980 DAVID MALAN: The question is, do you have 01:03:36.980 --> 01:03:39.590 to assign both the name and the number when creating a person? 01:03:39.590 --> 01:03:42.410 Or can you only get away with assigning one of them? 01:03:42.410 --> 01:03:43.200 You can. 01:03:43.200 --> 01:03:46.200 But that's going to be one of those so-called garbage values, typically. 01:03:46.200 --> 01:03:48.800 And so there's just going to be some bogus data there. 01:03:48.800 --> 01:03:53.270 And unless you are so careful as to never touch that field thereafter, 01:03:53.270 --> 01:03:56.630 you probably run the risk of some kind of bug, even a crash in your code, 01:03:56.630 --> 01:03:59.330 if you try to access that value later, even though you've never 01:03:59.330 --> 01:03:59.960 initialized it. 01:03:59.960 --> 01:04:01.100 So yes, it's possible. 01:04:01.100 --> 01:04:02.520 But no, do not do that. 01:04:02.520 --> 01:04:05.020 Other questions? 01:04:05.020 --> 01:04:05.890 No? 01:04:05.890 --> 01:04:08.470 All right, well, now that we have the ability 01:04:08.470 --> 01:04:11.570 to sort of represent more interesting structures, up until now, 01:04:11.570 --> 01:04:14.350 we've sort of assumed that in order to get to binary search, 01:04:14.350 --> 01:04:16.990 we have a phone book that someone already sorted for us. 01:04:16.990 --> 01:04:19.540 For our second demonstration with our volunteers, 01:04:19.540 --> 01:04:21.820 we assumed that someone, Carter in that case, 01:04:21.820 --> 01:04:24.010 had already sorted the information for us. 01:04:24.010 --> 01:04:26.050 It was proposed by the audience that, well, 01:04:26.050 --> 01:04:29.770 what if we sort the information first and then go find the number 50? 01:04:29.770 --> 01:04:33.580 That invited the question even early on-- well, how expensive is it to sort? 01:04:33.580 --> 01:04:38.170 How much time and money and inefficiency do Google and Microsoft and others 01:04:38.170 --> 01:04:41.590 spend to keep their data and our data sorted? 01:04:41.590 --> 01:04:43.840 Well, let's consider what the problem really is. 01:04:43.840 --> 01:04:47.380 If this is how we represent any problem, the unsorted data 01:04:47.380 --> 01:04:50.770 that we might consume by typing things in randomly to a search engine 01:04:50.770 --> 01:04:54.430 or crawling the internet or just adding context in any old order to our phone 01:04:54.430 --> 01:04:55.840 is arguably unsorted. 01:04:55.840 --> 01:04:58.060 It's certainly not alphabetically sorted by default. 01:04:58.060 --> 01:04:59.650 But we want to get it sorted. 01:04:59.650 --> 01:05:02.860 And so somewhere in here in this black box, we need a set of algorithms 01:05:02.860 --> 01:05:05.130 for actually sorting information as well. 01:05:05.130 --> 01:05:09.080 For instance, if we have these integers here unsorted-- 01:05:09.080 --> 01:05:14.130 72541603-- effectively random, the problem of sorting, of course, 01:05:14.130 --> 01:05:18.050 is to turn it into 01234567 instead. 01:05:18.050 --> 01:05:20.100 And there's a bunch of different ways to do this. 01:05:20.100 --> 01:05:22.730 But before we do that, I think it's probably time for some brownies. 01:05:22.730 --> 01:05:24.605 So let's go ahead and take a 10-minute break. 01:05:24.605 --> 01:05:26.570 And we'll see you in 10. 01:05:26.570 --> 01:05:28.040 All right, we are back. 01:05:28.040 --> 01:05:30.470 And where we left off was this cliffhanger. 01:05:30.470 --> 01:05:32.060 We've got some unsorted numbers. 01:05:32.060 --> 01:05:33.380 We want to make them sorted. 01:05:33.380 --> 01:05:34.430 How do we do this? 01:05:34.430 --> 01:05:37.250 And at the risk of one too many volunteers, 01:05:37.250 --> 01:05:39.680 could we get eight more volunteers? 01:05:39.680 --> 01:05:41.840 Wow, OK, overwhelming. 01:05:41.840 --> 01:05:45.110 OK, how about 1, 2, 3, 4. 01:05:45.110 --> 01:05:46.250 How about all three of you? 01:05:46.250 --> 01:05:51.290 OK, 5, 6, 7-- 01:05:51.290 --> 01:05:51.950 OK, 8. 01:05:51.950 --> 01:05:52.460 Come on. 01:05:52.460 --> 01:05:54.120 All right, one from each section-- 01:05:54.120 --> 01:05:56.244 all right, come on down. 01:05:56.244 --> 01:06:00.073 [INTERPOSING VOICES] 01:06:00.073 --> 01:06:01.615 DAVID MALAN: All right, come on down. 01:06:05.410 --> 01:06:06.100 Thank you. 01:06:06.100 --> 01:06:10.695 And if you all could grab a number here. 01:06:10.695 --> 01:06:11.320 So you'll be 7. 01:06:11.320 --> 01:06:13.220 Stand on the left there if you could. 01:06:13.220 --> 01:06:14.470 All right, you'll be number 2. 01:06:14.470 --> 01:06:17.590 Stand to his left. 01:06:17.590 --> 01:06:20.020 OK, keep coming. 01:06:20.020 --> 01:06:21.910 OK, yeah. 01:06:21.910 --> 01:06:24.380 OK, here we go. 01:06:24.380 --> 01:06:26.530 There you go. 01:06:26.530 --> 01:06:28.210 OK, [INAUDIBLE] 0 and 3. 01:06:28.210 --> 01:06:31.810 OK, so let's just make sure they match what we've got there. 01:06:31.810 --> 01:06:32.810 Good so far. 01:06:32.810 --> 01:06:36.313 OK, so we have eight volunteers here, an array of volunteers, if you would. 01:06:36.313 --> 01:06:38.230 This time, we've used eight rather than seven. 01:06:38.230 --> 01:06:41.840 We deliberately had seven lockers just so that the divide by 2 math 01:06:41.840 --> 01:06:42.340 worked out. 01:06:42.340 --> 01:06:44.890 That was very deliberate that there was always a [INAUDIBLE] a middle, 01:06:44.890 --> 01:06:45.682 a middle, a middle. 01:06:45.682 --> 01:06:48.682 In this case, it doesn't matter because we're going to focus on sorting. 01:06:48.682 --> 01:06:51.130 But first, how about some introductions from each group? 01:06:51.130 --> 01:06:52.520 AUDIENCE: I'm Rebecca. 01:06:52.520 --> 01:06:53.962 I'm a first year in Pennypacker. 01:06:53.962 --> 01:06:56.170 And I'm thinking about studying environmental science 01:06:56.170 --> 01:06:57.410 and public policy. 01:06:57.410 --> 01:06:57.910 [CHEERING] 01:06:57.910 --> 01:06:59.240 DAVID MALAN: Nice. 01:06:59.240 --> 01:07:00.100 [APPLAUSE] 01:07:00.100 --> 01:07:02.420 AUDIENCE: I'm [? Mariella. ?] I'm also in Pennypacker. 01:07:02.420 --> 01:07:03.170 I'm a first year. 01:07:03.170 --> 01:07:05.020 And I'm thinking of studying bioengineering. 01:07:05.020 --> 01:07:06.565 DAVID MALAN: Wonderful. 01:07:06.565 --> 01:07:09.440 AUDIENCE: My name's [? Haron ?] [? Li. ?] I'm a freshman in Matthews. 01:07:09.440 --> 01:07:11.015 I'm planning on studying applied mathematics. 01:07:11.015 --> 01:07:12.290 DAVID MALAN: Nice-- Matthews. 01:07:12.290 --> 01:07:13.910 AUDIENCE: My name is Emily. 01:07:13.910 --> 01:07:15.140 I'm a first year in Canada. 01:07:15.140 --> 01:07:17.470 And I'm still deciding what to study. 01:07:17.470 --> 01:07:18.220 DAVID MALAN: Fair. 01:07:18.220 --> 01:07:21.020 AUDIENCE: My name's [? Tanai. ?] I'm a first year from Toronto. 01:07:21.020 --> 01:07:23.315 And I'm planning on studying ECON and CS. 01:07:23.315 --> 01:07:24.110 DAVID MALAN: Nice. 01:07:24.110 --> 01:07:26.030 AUDIENCE: My name is Teddy. 01:07:26.030 --> 01:07:27.320 I'm a first year in Hurlbut. 01:07:27.320 --> 01:07:30.320 And I'm planning on concentrating in computer science with linguistics. 01:07:30.320 --> 01:07:31.070 DAVID MALAN: Nice. 01:07:31.070 --> 01:07:32.000 AUDIENCE: Yeah! 01:07:32.000 --> 01:07:32.750 DAVID MALAN: Nice. 01:07:32.750 --> 01:07:33.690 [APPLAUSE] 01:07:33.690 --> 01:07:36.470 AUDIENCE: My name's [? Lenny. ?] I'm a first year in Matthews. 01:07:36.470 --> 01:07:39.300 And I'm planning on concentrating in gov and CS. 01:07:39.300 --> 01:07:40.220 DAVID MALAN: Ah, nice. 01:07:40.220 --> 01:07:40.565 [CHEERING] 01:07:40.565 --> 01:07:41.037 [APPLAUSE] 01:07:41.037 --> 01:07:41.537 And? 01:07:41.537 --> 01:07:42.740 AUDIENCE: My name is Eli. 01:07:42.740 --> 01:07:44.023 I'm a first year in Hollis. 01:07:44.023 --> 01:07:45.440 And I plan on concentrating in CS. 01:07:45.440 --> 01:07:47.540 DAVID MALAN: Eli, we keep looking for you today. 01:07:47.540 --> 01:07:49.962 OK, so if you guys could scooch a little bit this way just 01:07:49.962 --> 01:07:51.170 to be a little more centered. 01:07:51.170 --> 01:07:54.750 Notice that this array of volunteers is entirely unsorted. 01:07:54.750 --> 01:07:56.960 So we thought we'd do three passes at this. 01:07:56.960 --> 01:08:00.135 Could you all sort yourselves from smallest to largest? 01:08:00.135 --> 01:08:00.635 Go. 01:08:06.420 --> 01:08:08.190 All right, that was very good. 01:08:08.190 --> 01:08:09.420 So yes, so, sure. 01:08:09.420 --> 01:08:10.003 OK. 01:08:10.003 --> 01:08:10.503 [CHEERING] 01:08:10.503 --> 01:08:12.920 [APPLAUSE] 01:08:12.920 --> 01:08:14.420 And let's see. 01:08:14.420 --> 01:08:17.603 What was your algorithm? 01:08:17.603 --> 01:08:19.520 AUDIENCE: I just kind of found the person that 01:08:19.520 --> 01:08:21.260 was, like, one lower or one higher. 01:08:21.260 --> 01:08:22.560 So I looked at him. 01:08:22.560 --> 01:08:23.060 He had 2. 01:08:23.060 --> 01:08:24.500 So I knew I had to be on his left. 01:08:24.500 --> 01:08:25.729 And then she had 3. 01:08:25.729 --> 01:08:27.410 So I told her to come to my right. 01:08:27.410 --> 01:08:28.467 And then I knew he had 5. 01:08:28.467 --> 01:08:30.050 DAVID MALAN: OK, nice-- pretty clever. 01:08:30.050 --> 01:08:31.640 And how about your algorithm? 01:08:31.640 --> 01:08:34.935 AUDIENCE: Same thing-- looked for the number that was lower and higher 01:08:34.935 --> 01:08:35.810 and found the middle. 01:08:35.810 --> 01:08:38.029 DAVID MALAN: OK, interesting, because I think from the outside view, 01:08:38.029 --> 01:08:39.638 it all seemed very organic. 01:08:39.638 --> 01:08:42.180 And things just kind of worked themselves out, which is fine. 01:08:42.180 --> 01:08:44.930 But I dare say what you guys did was probably a little hard for me 01:08:44.930 --> 01:08:46.410 to translate into code. 01:08:46.410 --> 01:08:48.560 So let me propose that we take two passes at this. 01:08:48.560 --> 01:08:51.380 If you could reset yourselves to be in that order from left 01:08:51.380 --> 01:08:54.920 to right, which is just the exact same sequence, just so that we're 01:08:54.920 --> 01:08:58.310 starting from the same point each time. 01:08:58.310 --> 01:08:59.630 [INAUDIBLE] 01:08:59.630 --> 01:09:00.319 Very good. 01:09:00.319 --> 01:09:02.120 All right, so let me propose this. 01:09:02.120 --> 01:09:04.470 We can approach sorting in a couple of different ways. 01:09:04.470 --> 01:09:05.720 But it needs to be methodical. 01:09:05.720 --> 01:09:09.600 Like, it needs to translate ideally to pseudocode and eventually code. 01:09:09.600 --> 01:09:12.810 So as much as we can quantize things to be step by step, 01:09:12.810 --> 01:09:15.100 I think the better this will scale overall, 01:09:15.100 --> 01:09:19.020 especially when there's not eight people but maybe there's 80 people or 800. 01:09:19.020 --> 01:09:22.020 Because I dare say that if we had everyone in the room sort themselves-- 01:09:22.020 --> 01:09:23.580 if they were all handling a number-- 01:09:23.580 --> 01:09:26.062 like, that probably wouldn't have worked out very well. 01:09:26.062 --> 01:09:29.229 It probably would have taken forever because there's just so much more data. 01:09:29.229 --> 01:09:30.700 So let's be a little more methodical. 01:09:30.700 --> 01:09:33.283 So for instance, I don't have a bird's eye view of the numbers 01:09:33.283 --> 01:09:35.130 just as before because even though we don't 01:09:35.130 --> 01:09:38.460 have doors in front of our volunteers, they're effectively lockers too. 01:09:38.460 --> 01:09:42.040 And the computer or the human in my case can only look at one door at a time. 01:09:42.040 --> 01:09:44.880 So if I want to find the smallest number and put it 01:09:44.880 --> 01:09:46.830 where it should go on the left, I can't just 01:09:46.830 --> 01:09:49.500 take a step back and be like, OK, obviously, there's the one. 01:09:49.500 --> 01:09:51.010 I have to do it more methodically. 01:09:51.010 --> 01:09:52.350 So I'm going to check here-- 01:09:52.350 --> 01:09:52.890 7. 01:09:52.890 --> 01:09:55.320 At the moment, this is actually the smallest number I've seen. 01:09:55.320 --> 01:09:56.945 So I'm going to make mental note of it. 01:09:56.945 --> 01:09:58.230 OK, 2 I see now. 01:09:58.230 --> 01:10:01.140 I can forget about the 7 because 2 is clearly smaller. 01:10:01.140 --> 01:10:02.850 5-- I don't need to remember it. 01:10:02.850 --> 01:10:04.200 4-- I don't need to remember it. 01:10:04.200 --> 01:10:06.330 1-- OK, that's clearly smaller. 01:10:06.330 --> 01:10:08.560 Have I found my smallest number? 01:10:08.560 --> 01:10:10.337 Not even because there actually is a 0. 01:10:10.337 --> 01:10:11.920 So I should go through the whole list. 01:10:11.920 --> 01:10:14.905 But I will remember that my smallest element is now 1. 01:10:14.905 --> 01:10:18.190 6 is not smaller-- oh, 0 is smaller, so I'll remember this. 01:10:18.190 --> 01:10:19.400 3 is not smaller. 01:10:19.400 --> 01:10:21.790 And so now, our volunteer for 0-- what was your name? 01:10:21.790 --> 01:10:22.707 AUDIENCE: [INAUDIBLE]. 01:10:22.707 --> 01:10:26.120 DAVID MALAN: Mariana, and where should we put you clearly? 01:10:26.120 --> 01:10:26.665 So there. 01:10:26.665 --> 01:10:27.790 But what's your name again? 01:10:27.790 --> 01:10:28.665 AUDIENCE: [INAUDIBLE] 01:10:28.665 --> 01:10:30.010 DAVID MALAN: Eli's in the way. 01:10:30.010 --> 01:10:31.745 So we could have Mary Ellen? 01:10:31.745 --> 01:10:32.620 AUDIENCE: [INAUDIBLE] 01:10:32.620 --> 01:10:35.860 DAVID MALAN: [? Mariella ?] just go to the right of Eli. 01:10:35.860 --> 01:10:37.390 But that's kind of cheating, right? 01:10:37.390 --> 01:10:40.570 Because if this is an array, recall that they are contiguous. 01:10:40.570 --> 01:10:43.578 But there could be other stuff in memory to the left and to the right. 01:10:43.578 --> 01:10:44.620 So that's not quite fair. 01:10:44.620 --> 01:10:47.110 We can't just have the luxury of putting things wherever we want. 01:10:47.110 --> 01:10:49.120 But Eli, you're not even in the right order, anyway. 01:10:49.120 --> 01:10:50.620 So why don't we just swap you two. 01:10:50.620 --> 01:10:52.720 So [? Mariella ?] and Eli swap. 01:10:52.720 --> 01:10:55.780 But now I've taken one bite out of this problem. 01:10:55.780 --> 01:10:57.910 I've coincidentally made it a little better 01:10:57.910 --> 01:11:00.760 by moving Eli closer to where he is. 01:11:00.760 --> 01:11:01.510 But you know what? 01:11:01.510 --> 01:11:02.830 He was in a random location, anyway. 01:11:02.830 --> 01:11:05.372 So I don't think I made the problem any worse, fundamentally. 01:11:05.372 --> 01:11:06.820 Now I can do this again. 01:11:06.820 --> 01:11:08.870 And I can shave a little bit of time off of it 01:11:08.870 --> 01:11:11.600 because I don't need to revisit [? Mariella ?] because if she 01:11:11.600 --> 01:11:14.300 was the smaller and I've already selected her from the array, 01:11:14.300 --> 01:11:17.600 I can just move on and take one fewer bites this time around. 01:11:17.600 --> 01:11:20.930 So 2 is the smallest number, not 5, not 4. 01:11:20.930 --> 01:11:22.620 OK, 1 is the smallest number. 01:11:22.620 --> 01:11:27.320 So I'm going to remember that as with a mental variable- 6, 7, 3. 01:11:27.320 --> 01:11:28.430 OK, 1 is the smallest. 01:11:28.430 --> 01:11:29.930 So let me select number 1. 01:11:29.930 --> 01:11:32.270 And we're going to have to evict you, which 01:11:32.270 --> 01:11:34.080 is making the problem slightly worse. 01:11:34.080 --> 01:11:35.780 But I think it'll average out. 01:11:35.780 --> 01:11:38.490 And now 0 and 1 are in the right place. 01:11:38.490 --> 01:11:40.160 So now let me do this again but faster. 01:11:40.160 --> 01:11:42.230 So 5-- OK, 4-- 01:11:42.230 --> 01:11:43.340 2 is the smallest-- 01:11:43.340 --> 01:11:44.630 6, 7, 3. 01:11:44.630 --> 01:11:48.110 OK, 2, let's put you where you belong, evicting 5. 01:11:48.110 --> 01:11:50.900 Now I can skip all three of these volunteers. 01:11:50.900 --> 01:11:52.080 OK, 4, is the smallest. 01:11:52.080 --> 01:11:52.580 Nope. 01:11:52.580 --> 01:11:52.910 Nope. 01:11:52.910 --> 01:11:53.120 Nope. 01:11:53.120 --> 01:11:54.200 3, you're the smallest. 01:11:54.200 --> 01:11:55.880 Let me evict 4. 01:11:55.880 --> 01:11:59.330 All right, and now let me move in front of these volunteers. 01:11:59.330 --> 01:12:02.250 5, 6, 7, 4, come on back. 01:12:02.250 --> 01:12:04.910 All right, and now let's select 6, 7, 5. 01:12:04.910 --> 01:12:07.140 OK, come on back. 01:12:07.140 --> 01:12:08.340 Oh, no, no cheating, Eli. 01:12:08.340 --> 01:12:09.630 OK, and then let's see. 01:12:09.630 --> 01:12:13.020 5, 7, 6-- OK, 6, come on back. 01:12:13.020 --> 01:12:14.250 OK, now you have to move. 01:12:14.250 --> 01:12:18.847 OK, and now we've selected in turn all of the numbers from left to right. 01:12:18.847 --> 01:12:20.430 And even though that did feel slower-- 01:12:20.430 --> 01:12:22.890 I was doing it a little more verbally as I stepped through it-- 01:12:22.890 --> 01:12:25.510 but I dare say we could translate that probably more to code. 01:12:25.510 --> 01:12:26.010 Why? 01:12:26.010 --> 01:12:27.690 Because there's a lot of, like, looping through it 01:12:27.690 --> 01:12:30.270 and probably a lot of conditionals just asking the question, 01:12:30.270 --> 01:12:34.800 is this number smaller than this one or conversely greater than this other one. 01:12:34.800 --> 01:12:36.520 All right, let's do this one more time. 01:12:36.520 --> 01:12:41.010 If you guys could reset yourselves to this ordering. 01:12:41.010 --> 01:12:46.720 All right, so we again have 72541603. 01:12:46.720 --> 01:12:47.220 Good. 01:12:47.220 --> 01:12:48.480 So how about this? 01:12:48.480 --> 01:12:50.550 I liked the intuition that they originally 01:12:50.550 --> 01:12:53.520 had, funny enough, whereby they just kind of organically 01:12:53.520 --> 01:12:55.570 looked to the person to the left and to the right 01:12:55.570 --> 01:12:56.820 and kind of fixed the problem. 01:12:56.820 --> 01:12:59.970 So if they were out of order, they just kind of swapped locally adjacent 01:12:59.970 --> 01:13:00.610 to each other. 01:13:00.610 --> 01:13:01.500 So let's try this. 01:13:01.500 --> 01:13:03.670 So 7 and 2, you're clearly out of order. 01:13:03.670 --> 01:13:05.340 So let's swap just you two. 01:13:05.340 --> 01:13:06.870 7 and 5, you're out of order. 01:13:06.870 --> 01:13:08.190 Let's swap you two. 01:13:08.190 --> 01:13:10.290 7 and 4, let's swap you two. 01:13:10.290 --> 01:13:12.390 7 and 1, let's swap you two. 01:13:12.390 --> 01:13:15.360 7 and 6, let's swap you two. 01:13:15.360 --> 01:13:17.220 7 and 0, swap you two. 01:13:17.220 --> 01:13:18.900 7 and 3, swap you two. 01:13:18.900 --> 01:13:20.620 OK, that was a lot of swapping. 01:13:20.620 --> 01:13:22.560 But notice what happened. 01:13:22.560 --> 01:13:23.790 I'm not done, clearly. 01:13:23.790 --> 01:13:24.780 It's not sorted yet. 01:13:24.780 --> 01:13:26.230 But I have improved the situation. 01:13:26.230 --> 01:13:26.730 How? 01:13:26.730 --> 01:13:29.910 Who is now definitely in the right place? 01:13:29.910 --> 01:13:34.510 So, 7-- or Eli has wonderfully bubbled all the way up to the top of the list, 01:13:34.510 --> 01:13:35.170 so to speak. 01:13:35.170 --> 01:13:37.030 Now I can actually skip him moving forward. 01:13:37.030 --> 01:13:38.820 So that takes one bite out of the problem. 01:13:38.820 --> 01:13:39.653 Let's do this again. 01:13:39.653 --> 01:13:40.440 2 and 5 are OK. 01:13:40.440 --> 01:13:42.285 5 and 4, let's swap. 01:13:42.285 --> 01:13:43.020 No, over there. 01:13:43.020 --> 01:13:45.150 [LAUGHS] Thank you. 01:13:45.150 --> 01:13:46.950 5 and 1, let's swap. 01:13:46.950 --> 01:13:48.420 5 and 6 are good. 01:13:48.420 --> 01:13:50.220 6 and 0, let's swap. 01:13:50.220 --> 01:13:52.158 6 and 3, let's swap. 01:13:52.158 --> 01:13:53.700 And we don't have to worry about Eli. 01:13:53.700 --> 01:13:54.720 And now, what's your name again? 01:13:54.720 --> 01:13:54.960 AUDIENCE: [? Haron. ?] 01:13:54.960 --> 01:13:57.700 DAVID MALAN: [? Haron-- ?] we don't have to worry about him either as well. 01:13:57.700 --> 01:13:59.283 So now I can go back to the beginning. 01:13:59.283 --> 01:14:02.460 And even though it feels like I'm going back and forth a lot, 01:14:02.460 --> 01:14:05.257 that's OK because the problem's still getting better and better. 01:14:05.257 --> 01:14:06.840 I'm taking a bite out of it each time. 01:14:06.840 --> 01:14:07.770 2 and 4 are good. 01:14:07.770 --> 01:14:10.140 4 and 1, let's swap. 01:14:10.140 --> 01:14:11.130 4 and 5 are good. 01:14:11.130 --> 01:14:12.420 5 and 0, swap. 01:14:12.420 --> 01:14:13.950 5 and 3, swap. 01:14:13.950 --> 01:14:15.750 And now these three are in the right place. 01:14:15.750 --> 01:14:16.500 Let's do it again. 01:14:16.500 --> 01:14:17.670 2 and 1-- ah, here we go. 01:14:17.670 --> 01:14:18.330 Let's swap. 01:14:18.330 --> 01:14:19.200 2 and 4 are OK. 01:14:19.200 --> 01:14:20.670 4 and 0, swap. 01:14:20.670 --> 01:14:22.110 4 and 3, swap. 01:14:22.110 --> 01:14:23.820 Now, these four are OK. 01:14:23.820 --> 01:14:24.990 1 and 2, you're good. 01:14:24.990 --> 01:14:26.280 2 and 0, swap. 01:14:26.280 --> 01:14:27.840 2 and 3, you're good. 01:14:27.840 --> 01:14:28.530 You're good. 01:14:28.530 --> 01:14:30.790 OK, now 1 and 0, swap-- 01:14:30.790 --> 01:14:33.190 1 and 2 and now 0 and 1. 01:14:33.190 --> 01:14:36.027 A round of applause if we could for our volunteers. 01:14:36.027 --> 01:14:37.338 [APPLAUSE] 01:14:37.338 --> 01:14:38.622 [CHEERING] 01:14:38.622 --> 01:14:40.830 So if you want to put your numbers on the tray there, 01:14:40.830 --> 01:14:44.820 we have some lovely Super Mario Oreos today, which 01:14:44.820 --> 01:14:46.650 maybe drives a lot of the volunteerism. 01:14:46.650 --> 01:14:48.930 But here we go. 01:14:48.930 --> 01:14:51.180 Thank you all so much. 01:14:51.180 --> 01:14:54.060 And maybe, Carter, if you can help with the reset? 01:14:54.060 --> 01:14:56.020 All right, here we go. 01:14:56.020 --> 01:14:57.840 [INAUDIBLE] yes, all set. 01:14:57.840 --> 01:14:58.740 Thank you very much. 01:14:58.740 --> 01:14:59.430 Thank you. 01:14:59.430 --> 01:15:02.340 Thank you, guys. 01:15:02.340 --> 01:15:04.290 Thank you, yes. 01:15:04.290 --> 01:15:08.880 To recap, let's actually formalize a little more algorithmically 01:15:08.880 --> 01:15:09.573 what we did. 01:15:09.573 --> 01:15:11.490 And I deliberately kind of orchestrated things 01:15:11.490 --> 01:15:14.280 there to show two fundamentally different approaches, one 01:15:14.280 --> 01:15:18.210 where I kind of selected the element I wanted again and again 01:15:18.210 --> 01:15:19.770 on the basis of how small it was. 01:15:19.770 --> 01:15:22.210 I looked for the smallest, then the next smallest, and so forth. 01:15:22.210 --> 01:15:24.335 The second time around, I took a different approach 01:15:24.335 --> 01:15:26.010 by just fixing local problems. 01:15:26.010 --> 01:15:28.710 But I did it again and again and again until I 01:15:28.710 --> 01:15:30.408 fixed all of the minor problems. 01:15:30.408 --> 01:15:32.700 And frankly, what they did organically at the beginning 01:15:32.700 --> 01:15:36.167 was probably closer to the second algorithm than the first, 01:15:36.167 --> 01:15:39.250 even though I'm not sure they would write down the same pseudocode for it. 01:15:39.250 --> 01:15:43.510 The first algorithm we executed is actually called selection sort. 01:15:43.510 --> 01:15:46.060 And I deliberately used that vernacular of selecting 01:15:46.060 --> 01:15:50.590 the smallest element again and again to evoke this name of the algorithm. 01:15:50.590 --> 01:15:54.460 So for instance, when we had these numbers here initially unsorted, 01:15:54.460 --> 01:15:58.900 I kept looking again and again and again for the smallest element. 01:15:58.900 --> 01:16:02.380 And I don't know a priori what the smallest number is until I 01:16:02.380 --> 01:16:04.510 go through the list at least once. 01:16:04.510 --> 01:16:06.190 Then I can pluck out the 0. 01:16:06.190 --> 01:16:10.360 I then go through the list a second time to pluck out the 1, a third time 01:16:10.360 --> 01:16:11.560 to pluck out the 2. 01:16:11.560 --> 01:16:14.380 Now, this assumes that I don't have an infinite amount of memory 01:16:14.380 --> 01:16:17.920 because even though I kept repeating myself looking for the next smallest 01:16:17.920 --> 01:16:20.800 element, next smallest element, I propose that I only 01:16:20.800 --> 01:16:25.060 keep track of one number at a time, one variable in my mind, which 01:16:25.060 --> 01:16:28.330 was the smallest element I have seen thus far. 01:16:28.330 --> 01:16:32.080 If I used more memory, I could probably remember from the first pass 01:16:32.080 --> 01:16:33.580 where the 2 is, where the 3 is. 01:16:33.580 --> 01:16:34.955 But that's a different algorithm. 01:16:34.955 --> 01:16:37.180 And it would take more space, more memory. 01:16:37.180 --> 01:16:41.390 I was confining myself to just one variable in my approach. 01:16:41.390 --> 01:16:44.800 So here might be the pseudocode for what we'd call selection 01:16:44.800 --> 01:16:48.490 sort, the very first algorithm that we all did together, not organically, 01:16:48.490 --> 01:16:50.350 but more methodically as a group. 01:16:50.350 --> 01:16:54.220 I would propose that we use some syntax from C when we talk about the loop 01:16:54.220 --> 01:16:57.790 and say for i from 0 to n minus 1. 01:16:57.790 --> 01:16:58.600 Now, why is that? 01:16:58.600 --> 01:17:02.980 Well, if there were n people or 8, 0 through 7 01:17:02.980 --> 01:17:07.040 are the indexes or indices of those humans on stage from left to right. 01:17:07.040 --> 01:17:08.620 So what did I have myself do? 01:17:08.620 --> 01:17:15.190 Find the smallest number between numbers bracket i and numbers bracket 01:17:15.190 --> 01:17:16.390 n minus 1. 01:17:16.390 --> 01:17:20.710 So when the loop starts at 0, this is literally 01:17:20.710 --> 01:17:25.750 saying between numbers bracket 0 and numbers bracket n minus 1. 01:17:25.750 --> 01:17:30.490 So from the far left to the far right, find me the smallest number. 01:17:30.490 --> 01:17:34.300 Then swap that number with numbers bracket i. 01:17:34.300 --> 01:17:36.880 So that's why we evicted the person all the way 01:17:36.880 --> 01:17:39.970 over to the right or your left the very first time, 01:17:39.970 --> 01:17:42.650 and then the next person, and the next person, and so forth. 01:17:42.650 --> 01:17:46.990 So this is an algorithm that has us go back and forth, back and forth, 01:17:46.990 --> 01:17:49.250 iteratively selecting the smallest element. 01:17:49.250 --> 01:17:53.980 So if we generalize this now, not away from eight people to, like, n people, 01:17:53.980 --> 01:17:56.800 you can think of them as representing an array, a.k.a. 01:17:56.800 --> 01:18:01.180 doors like this where the leftmost one is 0, the rightmost one is n minus 1, 01:18:01.180 --> 01:18:04.540 second to last is n minus 2, and so forth if we don't know or care 01:18:04.540 --> 01:18:06.250 what n specifically is. 01:18:06.250 --> 01:18:11.980 So how many total steps does selection sort perhaps take? 01:18:11.980 --> 01:18:13.930 And let's make this a little more real here. 01:18:13.930 --> 01:18:17.990 Let me actually open up, for instance, a quick visualization here. 01:18:17.990 --> 01:18:19.930 And on the screen here, you'll just see now 01:18:19.930 --> 01:18:24.070 an artist's rendition of an array of values whereby tall purple bars 01:18:24.070 --> 01:18:25.390 represent big integers. 01:18:25.390 --> 01:18:28.310 And short purple bars represent small integers. 01:18:28.310 --> 01:18:31.690 So the idea of any sorting algorithm here as visualized 01:18:31.690 --> 01:18:34.600 is to get the small bars on the left and the big bars on the right. 01:18:34.600 --> 01:18:36.550 And this is a nice handy tool that lets you play around 01:18:36.550 --> 01:18:37.700 with different algorithms. 01:18:37.700 --> 01:18:39.970 So here is selection sort that I've just clicked. 01:18:39.970 --> 01:18:42.760 In pink again and again is the equivalent 01:18:42.760 --> 01:18:46.210 of me walking through the volunteers looking for the next smallest element. 01:18:46.210 --> 01:18:50.260 And as soon as I found them, I swapped them into their leftmost location, 01:18:50.260 --> 01:18:54.250 evicting whoever's there in order to gradually sort 01:18:54.250 --> 01:18:56.480 this list from left to right. 01:18:56.480 --> 01:19:00.400 And so as you can see here, it holds very briefly in pink 01:19:00.400 --> 01:19:03.250 whatever the currently smallest element it has found is. 01:19:03.250 --> 01:19:05.830 That's the analog of, like, me pointing to my head 01:19:05.830 --> 01:19:09.940 whereby it's constantly comparing, comparing, comparing, 01:19:09.940 --> 01:19:12.423 looking for the next smallest element. 01:19:12.423 --> 01:19:15.340 Now, I'm kind of stalling because I'm running out of intelligent words 01:19:15.340 --> 01:19:15.840 to say. 01:19:15.840 --> 01:19:18.190 But that is to say, this algorithm feels kind of slow. 01:19:18.190 --> 01:19:20.590 Like, it seems to be doing a lot of work. 01:19:20.590 --> 01:19:23.170 Where is the work coming in? 01:19:23.170 --> 01:19:26.060 Like, what is it doing a lot of specifically? 01:19:26.060 --> 01:19:27.118 Yeah? 01:19:27.118 --> 01:19:29.410 Yeah, it keeps going back and forth and back and forth. 01:19:29.410 --> 01:19:31.090 And even though it's shaving a little bit of time, 01:19:31.090 --> 01:19:33.670 right, because it doesn't have to stupidly go all the way back 01:19:33.670 --> 01:19:35.920 to the beginning, and I was saving myself a few steps, 01:19:35.920 --> 01:19:38.680 like, it's a lot of cyclicity again and again and again. 01:19:38.680 --> 01:19:41.500 And put another way, it's a lot of comparisons again and again. 01:19:41.500 --> 01:19:44.560 And some of those comparisons you're doing multiple times 01:19:44.560 --> 01:19:47.890 because I only remembered one element at a time in my head. 01:19:47.890 --> 01:19:50.950 So I have to kind of remind myself on every pass which 01:19:50.950 --> 01:19:52.640 is smallest, which is smallest. 01:19:52.640 --> 01:19:57.760 So this invites the question how fast or how slow or equivalently 01:19:57.760 --> 01:20:02.350 how efficient or inefficient is something like bubble-- 01:20:02.350 --> 01:20:03.400 selection sort. 01:20:03.400 --> 01:20:07.790 Well, let's actually consider how we could analyze this as follows. 01:20:07.790 --> 01:20:12.790 So if we have n numbers that we want to sort, 01:20:12.790 --> 01:20:15.730 how many comparisons do we do the first time? 01:20:15.730 --> 01:20:19.280 Well, if there's n numbers, you can only make n minus 1 comparisons. 01:20:19.280 --> 01:20:19.780 Why? 01:20:19.780 --> 01:20:21.550 Because if we have eight people here and we 01:20:21.550 --> 01:20:23.440 started with whoever's all the way over here, 01:20:23.440 --> 01:20:27.890 we compare this person against seven others-- n minus 1 if n is 8. 01:20:27.890 --> 01:20:32.000 So the first pass through the list, I made n minus 1 comparisons. 01:20:32.000 --> 01:20:34.950 But that put the smallest number 0 in place. 01:20:34.950 --> 01:20:37.547 The second time I walked across our eight volunteers, 01:20:37.547 --> 01:20:39.380 I didn't need to walk in front of all eight. 01:20:39.380 --> 01:20:41.630 I could shave off a little bit and do seven of them, 01:20:41.630 --> 01:20:43.460 then six, then five, then four. 01:20:43.460 --> 01:20:45.710 So if I were to write this out roughly mathematically, 01:20:45.710 --> 01:20:50.630 I could do this-- n minus 1 plus n minus 2 plus n minus 3 plus 01:20:50.630 --> 01:20:54.020 dot, dot, dot, all the way down to my very last comparison 01:20:54.020 --> 01:20:55.260 at the end of the list. 01:20:55.260 --> 01:20:57.320 Now, this is the kind of thing that typically in high school 01:20:57.320 --> 01:20:59.612 you'd look at the back of the math book or physics book 01:20:59.612 --> 01:21:02.960 that's got a little cheat sheet for all of the formulas that add up. 01:21:02.960 --> 01:21:07.160 This series here, let me just stipulate, adds up to this-- n times 01:21:07.160 --> 01:21:09.260 n minus 1 divided by 2. 01:21:09.260 --> 01:21:12.770 So no matter what n is, this formula captures that series, 01:21:12.770 --> 01:21:14.760 that summation of all of those values. 01:21:14.760 --> 01:21:18.830 So that is how many steps I took again and again 01:21:18.830 --> 01:21:22.070 and again when implementing selection sort for eight people. 01:21:22.070 --> 01:21:24.000 So of course, let's multiply this out. 01:21:24.000 --> 01:21:27.013 So this is like n squared minus n all divided by 2. 01:21:27.013 --> 01:21:28.430 Let's do it out a little bit more. 01:21:28.430 --> 01:21:31.610 That's n squared divided by 2 minus n over 2. 01:21:31.610 --> 01:21:34.913 And now we're back into the territory of running times. 01:21:34.913 --> 01:21:36.830 Like, how many steps does this algorithm take? 01:21:36.830 --> 01:21:39.080 How many comparisons are we making? 01:21:39.080 --> 01:21:42.950 Now, n squared seems like the biggest term. 01:21:42.950 --> 01:21:44.250 It's the dominant term. 01:21:44.250 --> 01:21:48.110 In other words, as n gets large-- not eight but 80 or 800 01:21:48.110 --> 01:21:53.000 or 8,000 or 8 million-- squaring that is going to make a way bigger difference 01:21:53.000 --> 01:21:55.880 than just doing, like, n divided by 2 and subtracting that off. 01:21:55.880 --> 01:21:59.300 Similarly, just dividing even this quadratic formula by 2, 01:21:59.300 --> 01:22:01.460 like, yes, it's going to halve it literally. 01:22:01.460 --> 01:22:03.410 But that's kind of a drop in the bucket. 01:22:03.410 --> 01:22:05.930 As n gets larger and larger and larger, it's 01:22:05.930 --> 01:22:07.650 still going to be a crazy big number. 01:22:07.650 --> 01:22:11.000 So computer scientists would typically wrap this in some big O notation 01:22:11.000 --> 01:22:16.070 and say, OK, OK, selection sort is on the order of n squared. 01:22:16.070 --> 01:22:17.502 That's not a precise measurement. 01:22:17.502 --> 01:22:19.460 But it's on the order of n squared because it's 01:22:19.460 --> 01:22:23.480 making so many darn comparisons, not unlike everyone shaking everyone else's 01:22:23.480 --> 01:22:25.470 hand like I proposed verbally earlier. 01:22:25.470 --> 01:22:27.300 It's a lot of work to get that done. 01:22:27.300 --> 01:22:30.170 So selection sort is on the order of n squared steps. 01:22:30.170 --> 01:22:31.967 And that's kind of a slow one. 01:22:31.967 --> 01:22:33.800 That's what was at the top of my cheat sheet 01:22:33.800 --> 01:22:37.232 earlier when I proposed a ranking of some common running times. 01:22:37.232 --> 01:22:38.690 There's an infinite number of them. 01:22:38.690 --> 01:22:40.380 But those are some of the common ones. 01:22:40.380 --> 01:22:42.920 So can we do actually better? 01:22:42.920 --> 01:22:47.330 Well, if bubble sort then is in big O-- 01:22:47.330 --> 01:22:47.840 sorry. 01:22:47.840 --> 01:22:49.010 Sorry-- spoiler. 01:22:49.010 --> 01:22:52.610 If selection sort is in the order of n squared, 01:22:52.610 --> 01:22:55.420 could we maybe get lucky sometimes? 01:22:55.420 --> 01:22:58.285 Like, with selection sort, what would the best case scenario be? 01:22:58.285 --> 01:23:00.910 Well, the best case would be, like, everyone is already sorted, 01:23:00.910 --> 01:23:02.020 0 through 7. 01:23:02.020 --> 01:23:03.250 And we just get lucky. 01:23:03.250 --> 01:23:05.820 But does bubble sort-- 01:23:05.820 --> 01:23:10.580 [SIGHS]---- does selection sort appreciate that? 01:23:10.580 --> 01:23:14.590 Does selection sort take into account whether the list is already sorted? 01:23:14.590 --> 01:23:18.280 Not necessarily, because if you look at the pseudocode even, 01:23:18.280 --> 01:23:22.480 there's no special conditional in here that says, if it's already sorted, 01:23:22.480 --> 01:23:23.620 exit early. 01:23:23.620 --> 01:23:27.010 It's just going to blindly do this this many times. 01:23:27.010 --> 01:23:30.380 And you can actually see pseudocode-wise the n squared. 01:23:30.380 --> 01:23:33.850 Notice that this line of pseudocode here is essentially telling me to do what? 01:23:33.850 --> 01:23:37.450 Do something n times from 0 to n minus 1, 01:23:37.450 --> 01:23:40.720 or equivalently, if you prefer the real world, from 1 to n. 01:23:40.720 --> 01:23:42.710 That's n times total. 01:23:42.710 --> 01:23:44.797 But what are you doing inside of this loop? 01:23:44.797 --> 01:23:46.630 Well, every time you're inside of this loop, 01:23:46.630 --> 01:23:49.720 you're looking for the smallest element, looking for the smallest element. 01:23:49.720 --> 01:23:51.790 And that might take you as many as n steps. 01:23:51.790 --> 01:23:55.000 So another way to think about the running time of this algorithm 01:23:55.000 --> 01:23:58.540 selection sort is this loop is telling you to do something n times. 01:23:58.540 --> 01:24:02.230 This line is telling you to do something n times as well. 01:24:02.230 --> 01:24:04.630 And that's roughly n times n or n squared. 01:24:04.630 --> 01:24:05.630 It's not precise. 01:24:05.630 --> 01:24:07.390 But it's on the order of n squared. 01:24:07.390 --> 01:24:11.090 Unfortunately, if you're just blindly doing that much work always, 01:24:11.090 --> 01:24:14.240 even if you have any number of doors, this 01:24:14.240 --> 01:24:19.800 is going to end up being in omega of n squared as well, 01:24:19.800 --> 01:24:23.130 because even in the best case where all of the numbers are already sorted, 01:24:23.130 --> 01:24:26.580 there is nothing about the algorithm called selection sort or even 01:24:26.580 --> 01:24:29.850 my implementation thereof that would have said, wait a minute-- like, I'm 01:24:29.850 --> 01:24:31.980 done, and exit prematurely. 01:24:31.980 --> 01:24:35.760 So selection sort is in big O of and omega 01:24:35.760 --> 01:24:39.420 of n squared as we've designed it. 01:24:39.420 --> 01:24:42.988 And by coincidence, because those boundaries are the same, 01:24:42.988 --> 01:24:45.030 you can also say that it's in theta of n squared. 01:24:45.030 --> 01:24:49.170 No matter what, you're going to spend n squared steps or n squared comparisons. 01:24:49.170 --> 01:24:53.100 But that second algorithm that I keep teasing called bubble sort, 01:24:53.100 --> 01:24:55.808 and I deliberately used the word bubble in my description of what 01:24:55.808 --> 01:24:58.433 was happening because Eli, I think was our first volunteer, who 01:24:58.433 --> 01:25:01.140 kind of bubbled his way up all the way to the end of the list. 01:25:01.140 --> 01:25:05.302 Then number 6 did, then 5, then 4, then 3, then 2, then 1. 01:25:05.302 --> 01:25:07.260 All of the numbers kind of bubbled their way up 01:25:07.260 --> 01:25:09.000 being the bigger values to the right. 01:25:09.000 --> 01:25:11.790 So bubble sort just does something again and again 01:25:11.790 --> 01:25:15.060 by comparing adjacencies, comparing, comparing, comparing. 01:25:15.060 --> 01:25:17.810 And then it does it again and again and again. 01:25:17.810 --> 01:25:19.180 So let's analyze bubble sort. 01:25:19.180 --> 01:25:22.750 If, for instance, this was the original array that we tried sorting before-- 01:25:22.750 --> 01:25:24.100 same exact numbers-- 01:25:24.100 --> 01:25:28.330 I was doing things like flipping the 7 and the 2, and then the 7 and the 5, 01:25:28.330 --> 01:25:29.770 and then the 7 and the 4. 01:25:29.770 --> 01:25:34.360 But that would only fix minimally one number. 01:25:34.360 --> 01:25:37.120 I then had to repeat again and again and again. 01:25:37.120 --> 01:25:39.290 So that one too felt like it was taking some time. 01:25:39.290 --> 01:25:41.560 So here's one way of thinking about bubble sort. 01:25:41.560 --> 01:25:43.660 Bubble sort pseudocode could say this. 01:25:43.660 --> 01:25:45.790 Repeat the following n times. 01:25:45.790 --> 01:25:48.700 And that's why I kept going through the list again and again-- 01:25:48.700 --> 01:25:53.473 for i from 0 to n minus 2. 01:25:53.473 --> 01:25:56.140 Now, this is a little weird, but we'll get back to this-- from i 01:25:56.140 --> 01:25:57.820 from 0 to n minus 2. 01:25:57.820 --> 01:26:02.380 If the number at location i and the number at location i plus 1 01:26:02.380 --> 01:26:05.140 are out of order, swap them. 01:26:05.140 --> 01:26:08.140 And then just repeat, repeat, repeat. 01:26:08.140 --> 01:26:11.470 But we've never seen n minus 2 in pseudocode before. 01:26:11.470 --> 01:26:14.830 Why is it n minus 2 and not the usual n minus 1? 01:26:14.830 --> 01:26:16.086 Yeah? 01:26:16.086 --> 01:26:20.370 AUDIENCE: [INAUDIBLE] 01:26:20.370 --> 01:26:22.770 DAVID MALAN: Exactly, because we're taking a number 01:26:22.770 --> 01:26:24.250 and comparing it to the next one. 01:26:24.250 --> 01:26:26.370 So you better not go all the way to the end 01:26:26.370 --> 01:26:29.245 and expect there to be a next one at the end of the list. 01:26:29.245 --> 01:26:31.620 So if we use these same numbers here, even though they're 01:26:31.620 --> 01:26:34.350 in a different order, if this is location 0, 01:26:34.350 --> 01:26:38.280 and this is location n minus 1 always, if you think 01:26:38.280 --> 01:26:42.480 of i as being my left hand, it's pointing from 0 to 1 to 2 to 3 01:26:42.480 --> 01:26:44.910 to 4 to 5 to 6 to 7. 01:26:44.910 --> 01:26:47.430 If you get to the end, you don't want to look at i plus 1 01:26:47.430 --> 01:26:49.555 because that's pointing to no man's land over here. 01:26:49.555 --> 01:26:51.960 And you can't go beyond the boundaries of your array. 01:26:51.960 --> 01:26:55.012 But n minus 2 would be the second to last element, as you know. 01:26:55.012 --> 01:26:56.970 And that makes sure that my left hand and right 01:26:56.970 --> 01:26:59.070 hand stay within the boundaries of the array 01:26:59.070 --> 01:27:02.777 if left hand represents i and right hand represents i plus 1. 01:27:02.777 --> 01:27:04.860 So it's just to make sure we don't screw up and go 01:27:04.860 --> 01:27:06.930 too far past the boundary of the array. 01:27:06.930 --> 01:27:10.920 But as implemented here, this too does not take into account whether or not 01:27:10.920 --> 01:27:13.620 the array is already sorted. 01:27:13.620 --> 01:27:16.087 So technically, we can actually do this n minus 1 times 01:27:16.087 --> 01:27:17.670 because the last one you get for free. 01:27:17.670 --> 01:27:19.020 It ends up being in place. 01:27:19.020 --> 01:27:22.060 But even still, let's do a quick analysis here. 01:27:22.060 --> 01:27:27.100 If we've got n doors in total from 0 on up to n minus 1, 01:27:27.100 --> 01:27:30.520 bubble sort's analysis looks a little bit like this. 01:27:30.520 --> 01:27:35.228 Do the following-- n minus 1 times n minus 1 times. 01:27:35.228 --> 01:27:36.520 Well, how did we get from that? 01:27:36.520 --> 01:27:38.020 Let me back up to the pseudocode. 01:27:38.020 --> 01:27:40.750 If this is the pseudocode as I originally put it, 01:27:40.750 --> 01:27:42.580 I then propose verbally a refinement. 01:27:42.580 --> 01:27:45.580 You don't need to repeat yourself n times again and again because again, 01:27:45.580 --> 01:27:46.915 the last one bubbles up. 01:27:46.915 --> 01:27:49.660 But by process of elimination, it's in the right place. 01:27:49.660 --> 01:27:52.390 So you can think of it as just n minus 1 times. 01:27:52.390 --> 01:27:57.910 How many times can you compare i and i plus 1? 01:27:57.910 --> 01:28:01.270 Well, you're iterating from 0 to n minus 2. 01:28:01.270 --> 01:28:03.370 And this is where the math is going to get weird. 01:28:03.370 --> 01:28:06.500 But that is n minus 1 steps also. 01:28:06.500 --> 01:28:07.000 Why? 01:28:07.000 --> 01:28:09.940 Because if you do it in the real world starting at one, 01:28:09.940 --> 01:28:13.300 this is saying from i from 1 to n minus 1. 01:28:13.300 --> 01:28:15.460 And then maybe it pops a little more obviously. 01:28:15.460 --> 01:28:18.400 This inner loop is repeating n minus 1 times. 01:28:18.400 --> 01:28:20.830 So outer loop says do the following n minus 1 times. 01:28:20.830 --> 01:28:23.620 Inner loop says do this now n minus 1 times. 01:28:23.620 --> 01:28:27.010 Mathematically, then, that's n minus 1 times n minus 1. 01:28:27.010 --> 01:28:28.550 Now we've got our old FOIL method. 01:28:28.550 --> 01:28:31.780 So n squared minus n minus n plus 1. 01:28:31.780 --> 01:28:35.440 Combine like terms gives us n squared minus 2n plus 1. 01:28:35.440 --> 01:28:37.600 But now let's think like a computer scientist. 01:28:37.600 --> 01:28:41.710 When n gets really large, we definitely don't care about the plus 1. 01:28:41.710 --> 01:28:45.640 When n gets really large, we probably don't even care about the minus 2n. 01:28:45.640 --> 01:28:48.580 It will make a mathematical difference but a drop in the bucket 01:28:48.580 --> 01:28:50.860 once n gets to be in the millions or billions. 01:28:50.860 --> 01:28:54.770 So this would be on the order of what, similarly? 01:28:54.770 --> 01:28:56.730 On the order of n squared as well. 01:28:56.730 --> 01:28:59.840 So algorithmically, and actually, we'll use a different term 01:28:59.840 --> 01:29:03.770 now-- asymptotically-- asymptotic notation is the fancy term 01:29:03.770 --> 01:29:08.960 to describe big O and omega and theta notation-- asymptotically bubble 01:29:08.960 --> 01:29:12.530 sort is, quote, unquote, "the same" as selection sort 01:29:12.530 --> 01:29:13.760 in terms of its upper bound. 01:29:13.760 --> 01:29:15.812 It's not 100% exactly the same. 01:29:15.812 --> 01:29:19.020 If we get into the weeds of the math, there are obviously different formulas. 01:29:19.020 --> 01:29:22.462 But when n gets really large and you plot the pictures on a chart, 01:29:22.462 --> 01:29:24.920 they're going to look almost the same because they're going 01:29:24.920 --> 01:29:27.060 to be fundamentally the same shape. 01:29:27.060 --> 01:29:33.380 So in our cheat sheet here, bubble sort now is in big O of n squared. 01:29:33.380 --> 01:29:36.600 But let me propose that we make an improvement. 01:29:36.600 --> 01:29:38.720 Here's our pseudocode earlier. 01:29:38.720 --> 01:29:43.430 When might it make sense to abort bubble sort early? 01:29:43.430 --> 01:29:47.780 Like, when could you logically conclude that I do not need 01:29:47.780 --> 01:29:51.975 to make another pass again and again? 01:29:51.975 --> 01:29:54.600 And remember what I did from left to right over our volunteers. 01:29:54.600 --> 01:29:57.510 I was comparing, comparing, comparing, comparing and maybe swapping 01:29:57.510 --> 01:29:59.200 people who were out of order. 01:29:59.200 --> 01:30:03.980 So what could I do to short circuit this? 01:30:03.980 --> 01:30:04.855 AUDIENCE: [INAUDIBLE] 01:30:04.855 --> 01:30:05.950 DAVID MALAN: Perfect. 01:30:05.950 --> 01:30:10.060 If I compare through the whole list left to right and I make no swaps, 01:30:10.060 --> 01:30:12.808 it stands to reason that they're already ordered. 01:30:12.808 --> 01:30:14.350 Otherwise, I would have swapped them. 01:30:14.350 --> 01:30:17.923 And therefore, I would be crazy to do that again and again and again. 01:30:17.923 --> 01:30:20.090 Because if I didn't swap them the first time around, 01:30:20.090 --> 01:30:22.757 why would I swap them the second time around if no one's moving. 01:30:22.757 --> 01:30:25.280 So I can terminate the algorithm early. 01:30:25.280 --> 01:30:27.550 So in pseudocode, I could say something like this. 01:30:27.550 --> 01:30:29.920 If no swaps, quit. 01:30:29.920 --> 01:30:31.940 And I can do that inside of the inner loop. 01:30:31.940 --> 01:30:35.560 So once I make a pass through the people and I say, hm, did I make any swaps? 01:30:35.560 --> 01:30:38.320 If no, just quit because they're not going 01:30:38.320 --> 01:30:41.260 to move any further if they didn't just move already. 01:30:41.260 --> 01:30:45.040 So bubble sort then might arguably be an omega of what? 01:30:45.040 --> 01:30:51.130 In the best case, how few steps could we get away with with bubble sort? 01:30:51.130 --> 01:30:52.330 OK, it's not one. 01:30:52.330 --> 01:30:53.490 But it is n. 01:30:53.490 --> 01:30:54.160 Why? 01:30:54.160 --> 01:30:56.980 Because I minimally need to go through the whole list to decide, 01:30:56.980 --> 01:30:58.240 did I make any swaps. 01:30:58.240 --> 01:31:03.010 And logically-- and this will come up a lot in the analysis of algorithms-- 01:31:03.010 --> 01:31:07.780 any question like, is this list sorted, cannot possibly take only one step 01:31:07.780 --> 01:31:09.040 if you've got n elements. 01:31:09.040 --> 01:31:09.310 Why? 01:31:09.310 --> 01:31:10.727 Because then you're just guessing. 01:31:10.727 --> 01:31:14.560 Like, if you don't at least take the time to look at every element, 01:31:14.560 --> 01:31:17.710 how in the world can you conclude logically that they're sorted or not? 01:31:17.710 --> 01:31:21.290 Like, you minimally have to meet us halfway and check every element. 01:31:21.290 --> 01:31:24.338 So it's minimally in omega of n. 01:31:24.338 --> 01:31:27.130 Constant time wouldn't even give you time to look at the whole list 01:31:27.130 --> 01:31:28.190 if there's n elements. 01:31:28.190 --> 01:31:29.500 So that's the intuition there. 01:31:29.500 --> 01:31:32.042 So we can't say anything about theta notation for bubble sort 01:31:32.042 --> 01:31:34.000 because it's different, upper and lower bounds. 01:31:34.000 --> 01:31:39.850 But we seem to have done better asymptotically than selection sort. 01:31:39.850 --> 01:31:41.530 In the worst case, they're just as bad. 01:31:41.530 --> 01:31:42.530 They're super slow. 01:31:42.530 --> 01:31:45.310 But in the best case, bubble sort with that tweak-- 01:31:45.310 --> 01:31:48.550 that conditional about swapping-- might actually be faster for us. 01:31:48.550 --> 01:31:52.940 And you'd want Google, Microsoft, your phone, to use that kind of algorithm. 01:31:52.940 --> 01:31:55.280 Let me go back to the sorting demonstration earlier. 01:31:55.280 --> 01:31:57.380 Let me re-randomize the array. 01:31:57.380 --> 01:31:58.820 Small bars is small number. 01:31:58.820 --> 01:31:59.990 Big bars is big number. 01:31:59.990 --> 01:32:02.030 And let me click bubble sort this time. 01:32:02.030 --> 01:32:04.970 And you'll see again, the pink represents comparisons. 01:32:04.970 --> 01:32:08.400 But now the comparisons are always adjacent from left to right. 01:32:08.400 --> 01:32:12.360 So this is pretty much doing with more bars what I did with eight people here. 01:32:12.360 --> 01:32:15.950 And you'll see that the biggest elements, Eli 01:32:15.950 --> 01:32:18.260 being the first one of the humans, bubbled up 01:32:18.260 --> 01:32:21.470 all the way to the right, followed by the next largest, next largest, 01:32:21.470 --> 01:32:22.490 next largest. 01:32:22.490 --> 01:32:25.700 But here again you can visually see. 01:32:25.700 --> 01:32:28.490 And with the number of words I'm using to kind of stall here, 01:32:28.490 --> 01:32:31.100 you can [? hear ?] that this is kind of slow 01:32:31.100 --> 01:32:34.618 because you're doing a lot of comparisons again and again and again. 01:32:34.618 --> 01:32:35.910 And you're making improvements. 01:32:35.910 --> 01:32:37.618 So you're taking bites out of the problem 01:32:37.618 --> 01:32:40.897 but really only one bite at a time. 01:32:40.897 --> 01:32:43.480 So I kind of feel this is going to be like holding in a sneeze 01:32:43.480 --> 01:32:44.560 if we don't finish this. 01:32:44.560 --> 01:32:47.410 So I feel like we should give you emotional closure 01:32:47.410 --> 01:32:49.930 with getting this to finish. 01:32:49.930 --> 01:32:53.020 Any questions in the meantime because there's going to be no surprise. 01:32:53.020 --> 01:32:54.830 It's going to be sorted. 01:32:54.830 --> 01:32:57.580 This is bubble sort. 01:32:57.580 --> 01:32:58.570 No? 01:32:58.570 --> 01:33:00.040 All right, we're almost there. 01:33:00.040 --> 01:33:02.800 It's going faster because the list is effectively getting shorter. 01:33:02.800 --> 01:33:08.350 And-- oh, OK, so that then was bubble sort. 01:33:08.350 --> 01:33:12.920 But both of these, I claim now, are actually pretty inefficient. 01:33:12.920 --> 01:33:13.780 They're pretty slow. 01:33:13.780 --> 01:33:16.330 And my god, that was only eight humans on the stage. 01:33:16.330 --> 01:33:19.540 That was only, like, what, 40 or 50 bars on the screen. 01:33:19.540 --> 01:33:23.320 What if we start talking about hundreds or thousands or millions of inputs? 01:33:23.320 --> 01:33:25.690 Like, those algorithms are probably going to take crazy 01:33:25.690 --> 01:33:28.220 long because n squared gets really big. 01:33:28.220 --> 01:33:30.440 So can we do better than n squared? 01:33:30.440 --> 01:33:31.330 Well, we can. 01:33:31.330 --> 01:33:35.170 But let me propose that we introduce a new problem-solving technique 01:33:35.170 --> 01:33:38.320 that we've actually been using already, just not by name. 01:33:38.320 --> 01:33:42.700 In programming and in mathematics, there's this idea of recursion. 01:33:42.700 --> 01:33:48.280 And recursion is a description for a function that calls itself. 01:33:48.280 --> 01:33:51.460 A function that calls itself is recursive. 01:33:51.460 --> 01:33:52.895 So what do we mean by this? 01:33:52.895 --> 01:33:54.520 Well, we've actually seen this already. 01:33:54.520 --> 01:33:58.810 Here's the same pseudocode for binary search earlier. 01:33:58.810 --> 01:34:01.180 And binary search-- bi implying two-- went 01:34:01.180 --> 01:34:05.070 either left or right or left or right, halving each time. 01:34:05.070 --> 01:34:06.950 So that was the division and conquering. 01:34:06.950 --> 01:34:08.570 So this is the same pseudocode. 01:34:08.570 --> 01:34:12.080 But notice that in my pseudocode earlier, 01:34:12.080 --> 01:34:16.710 I literally used the keyword search inside of my definition of search. 01:34:16.710 --> 01:34:19.550 So this is sort of one of those, like, circular definitions. 01:34:19.550 --> 01:34:21.590 Like, if this is a search algorithm, how am I 01:34:21.590 --> 01:34:24.860 getting away with using the word search in the definition of search? 01:34:24.860 --> 01:34:27.110 It's kind of when you get reprimanded for using a word 01:34:27.110 --> 01:34:28.850 to define a vocabulary word. 01:34:28.850 --> 01:34:32.960 This is kind of like that because this algorithm is calling itself. 01:34:32.960 --> 01:34:35.700 This search function is calling itself. 01:34:35.700 --> 01:34:38.300 Now, normally, if you were to have a function call itself, 01:34:38.300 --> 01:34:41.750 call itself, call itself, call itself, it would just do that infinitely. 01:34:41.750 --> 01:34:43.550 And presumably, it would go on forever. 01:34:43.550 --> 01:34:45.350 Or the program or computer would probably 01:34:45.350 --> 01:34:48.517 crash because out of memory or something like that-- more on that next week. 01:34:48.517 --> 01:34:50.780 But there's a key detail, a key characteristic 01:34:50.780 --> 01:34:54.770 about this algorithm that makes sure that doing the same thing again 01:34:54.770 --> 01:34:56.690 and again is not crazy. 01:34:56.690 --> 01:34:59.630 It's actually going to lead us to a solution. 01:34:59.630 --> 01:35:02.720 Even though I'm using search here or search here, 01:35:02.720 --> 01:35:08.742 what's happening to the size of the problem for each of these lines? 01:35:08.742 --> 01:35:09.450 What's happening? 01:35:09.450 --> 01:35:10.290 Yeah? 01:35:10.290 --> 01:35:11.610 It's being cut in half. 01:35:11.610 --> 01:35:15.420 So even though I'm doing the exact same thing algorithmically again and again, 01:35:15.420 --> 01:35:18.930 I'm doing it on a smaller, smaller, smaller input-- 01:35:18.930 --> 01:35:21.820 fewer, fewer, fewer doors-- fewer, fewer, fewer people. 01:35:21.820 --> 01:35:23.820 And so even though I'm doing it again and again, 01:35:23.820 --> 01:35:26.760 so long as I have this so-called base case, as we'll call it, 01:35:26.760 --> 01:35:29.700 that just makes sure if there's no doors-- you're done-- 01:35:29.700 --> 01:35:34.710 I can break out of what would otherwise be an infinite loop of sorts. 01:35:34.710 --> 01:35:37.525 All right, so those two lines we've already kind of seen. 01:35:37.525 --> 01:35:39.150 And actually, we saw this in week zero. 01:35:39.150 --> 01:35:42.510 So this is the pseudocode for the phone book example from week zero. 01:35:42.510 --> 01:35:45.300 And you might recall that we had these lines of code here. 01:35:45.300 --> 01:35:48.420 And this is very procedural or iterative where I literally 01:35:48.420 --> 01:35:49.872 told you go back to line 3. 01:35:49.872 --> 01:35:52.330 And we did that today when we took attendance, so to speak. 01:35:52.330 --> 01:35:55.860 With that third algorithm, you went back to step 2 again and again. 01:35:55.860 --> 01:35:58.000 That's a very iterative loop-based approach. 01:35:58.000 --> 01:35:58.750 But you know what? 01:35:58.750 --> 01:36:00.840 I could be a little more clever here too in week zero. 01:36:00.840 --> 01:36:03.210 We just didn't want to get too far ahead of ourselves. 01:36:03.210 --> 01:36:07.410 Let me actually change these highlighted lines to be more, succinctly, 01:36:07.410 --> 01:36:10.150 search left half book, search right half of book. 01:36:10.150 --> 01:36:11.620 I can now tighten the code up. 01:36:11.620 --> 01:36:13.270 So it's kind of a shorter algorithm. 01:36:13.270 --> 01:36:14.860 But it's the exact same idea. 01:36:14.860 --> 01:36:20.570 But on week zero, I very methodically told you what I want you to do next. 01:36:20.570 --> 01:36:22.900 But here, I'm sort of recognizing that, wait a minute, 01:36:22.900 --> 01:36:25.660 I just spent the past six lines telling you how to search. 01:36:25.660 --> 01:36:27.220 So go do more of that. 01:36:27.220 --> 01:36:29.180 You have all of the logic you need. 01:36:29.180 --> 01:36:33.280 So this too is recursive in the sense that if this is a search algorithm, 01:36:33.280 --> 01:36:36.610 the search algorithm is using a search algorithm inside of it. 01:36:36.610 --> 01:36:39.430 But here too, the phone book was getting divided and divided 01:36:39.430 --> 01:36:42.160 and divided in half again and again. 01:36:42.160 --> 01:36:45.370 So recursion doesn't just describe mathematical formulas. 01:36:45.370 --> 01:36:49.077 It doesn't just describe pseudocode or code-- even physical structures. 01:36:49.077 --> 01:36:51.410 So we deliberately brought back these bricks here today, 01:36:51.410 --> 01:36:54.040 which are reminiscent, of course, from Mario and Super Mario Brothers. 01:36:54.040 --> 01:36:56.695 That pyramid, especially if we get rid of the distractions 01:36:56.695 --> 01:36:58.570 of the ground and the mountains and so forth, 01:36:58.570 --> 01:37:01.810 is itself recursive as a structure. 01:37:01.810 --> 01:37:02.920 Now, why is that? 01:37:02.920 --> 01:37:05.440 Well, here you can kind of play verbal games. 01:37:05.440 --> 01:37:07.930 If this is a pyramid of height 4-- 01:37:07.930 --> 01:37:11.092 and ask the question, well, what is that-- what is a pyramid of height 4-- 01:37:11.092 --> 01:37:13.300 I could kind of be difficult and say, well, it's just 01:37:13.300 --> 01:37:16.270 a pyramid of height 3 plus 1 other row. 01:37:16.270 --> 01:37:18.310 All right, well, what's a pyramid of height 3? 01:37:18.310 --> 01:37:21.185 You could be difficult and say, well, it's just a pyramid of height 2 01:37:21.185 --> 01:37:22.060 plus 1 more row. 01:37:22.060 --> 01:37:23.290 What's a pyramid of height 2? 01:37:23.290 --> 01:37:27.363 Well, it's this plus 1 more row. 01:37:27.363 --> 01:37:28.780 I might have skipped a step there. 01:37:28.780 --> 01:37:30.010 What's a pyramid of height 2? 01:37:30.010 --> 01:37:33.130 Well, it's a pyramid of height 1 plus 1 more row. 01:37:33.130 --> 01:37:34.900 What's a pyramid of height 1? 01:37:34.900 --> 01:37:37.240 Well, it's nothing plus 1 more row. 01:37:37.240 --> 01:37:39.638 Or at this point, you can just say it's a single brick. 01:37:39.638 --> 01:37:42.430 You can have a base case that just says, like, the buck stops here. 01:37:42.430 --> 01:37:43.722 Stop asking me these questions. 01:37:43.722 --> 01:37:48.140 I can special case or hard code my answer to the very tiniest of problems. 01:37:48.140 --> 01:37:50.740 So even those bricks then are sort of-- 01:37:50.740 --> 01:37:52.622 the pyramid is recursively defined. 01:37:52.622 --> 01:37:54.580 And we can actually implement this even in code 01:37:54.580 --> 01:37:57.080 if we want in a couple of different ways to make this clear. 01:37:57.080 --> 01:37:59.690 So let me go back over to VS Code here. 01:37:59.690 --> 01:38:03.760 And let me propose that we do one old-school iterative example with 01:38:03.760 --> 01:38:04.510 this-- 01:38:04.510 --> 01:38:06.190 code iteration dot C. 01:38:06.190 --> 01:38:09.040 And iteration just means to loop again and again. 01:38:09.040 --> 01:38:11.000 An iteration is a pass through the code. 01:38:11.000 --> 01:38:13.430 And let me go ahead and whip this up as follows. 01:38:13.430 --> 01:38:15.730 Let's include the CS50 library. 01:38:15.730 --> 01:38:18.850 So we have our get int function. 01:38:18.850 --> 01:38:21.790 Let's include standard io so that we have printf. 01:38:21.790 --> 01:38:27.650 Let's go ahead and create main with no command line arguments today. 01:38:27.650 --> 01:38:30.430 Let's go ahead and create a variable of type int set 01:38:30.430 --> 01:38:33.070 equal to the return value of get int asking 01:38:33.070 --> 01:38:35.230 the user for the height of the pyramid. 01:38:35.230 --> 01:38:37.480 And then let me assume for the sake of discussion 01:38:37.480 --> 01:38:41.290 that there is a function now that will draw a pyramid of that height. 01:38:41.290 --> 01:38:44.810 All right, now let's actually implement that function as a helper function, 01:38:44.810 --> 01:38:45.500 so to speak. 01:38:45.500 --> 01:38:49.270 So if I have a function called draw, it's going to clearly take an integer. 01:38:49.270 --> 01:38:51.670 And I'll call it, maybe, n as input. 01:38:51.670 --> 01:38:53.600 And it doesn't need to return anything. 01:38:53.600 --> 01:38:55.120 So I'm going to say that this is a void function. 01:38:55.120 --> 01:38:56.203 It doesn't return a value. 01:38:56.203 --> 01:38:59.030 It just has a side effect of printing bricks on the screen. 01:38:59.030 --> 01:39:02.050 How can I go about doing a pyramid of height n? 01:39:02.050 --> 01:39:07.120 Well, I'm going to do this one, which looks like this-- one at a time, 01:39:07.120 --> 01:39:09.430 then two, then three, then four. 01:39:09.430 --> 01:39:11.210 And this is a little easier, for instance, 01:39:11.210 --> 01:39:14.380 than problem set one, which had the pyramid flipped around the other way. 01:39:14.380 --> 01:39:16.480 So this one's a little easier to do deliberately. 01:39:16.480 --> 01:39:21.250 So if I go back to my code, I could say something like this-- for int i gets 0. 01:39:21.250 --> 01:39:24.610 i is less than, let's say, n, which is the height-- 01:39:24.610 --> 01:39:25.810 i plus, plus. 01:39:25.810 --> 01:39:30.280 That's going to essentially iterate over every row of the pyramid top to bottom. 01:39:30.280 --> 01:39:31.480 This might feel similar-- 01:39:31.480 --> 01:39:33.700 reminiscent-- to problem set one. 01:39:33.700 --> 01:39:36.280 Then in my inner loop, I'm going to do four. 01:39:36.280 --> 01:39:38.200 int j gets 0. 01:39:38.200 --> 01:39:42.682 j is less than i plus 1 j plus, plus. 01:39:42.682 --> 01:39:44.390 And we'll see why this works in a moment. 01:39:44.390 --> 01:39:47.800 And then let's go ahead and print out a single hash, no new line. 01:39:47.800 --> 01:39:52.660 But at the very bottom of this row, let's print out just a new line 01:39:52.660 --> 01:39:54.850 to move the cursor down a line. 01:39:54.850 --> 01:39:56.050 Now, I'm not done yet. 01:39:56.050 --> 01:39:57.760 Let me hide my terminal for a second. 01:39:57.760 --> 01:39:58.940 I need the prototype. 01:39:58.940 --> 01:40:02.720 So I'm going to copy this, paste it up here with a semicolon. 01:40:02.720 --> 01:40:05.300 And I think now the code is correct. 01:40:05.300 --> 01:40:06.430 Now, why is this correct? 01:40:06.430 --> 01:40:11.160 Because on my outer loop, I'm iterating n times starting at 0. 01:40:11.160 --> 01:40:15.900 My inner loop-- realize I want to have at least one hash, 01:40:15.900 --> 01:40:18.480 then two, then three, then four. 01:40:18.480 --> 01:40:20.970 So I can't start my number of hashes at 0. 01:40:20.970 --> 01:40:26.910 And that's why I'm doing j all the way up to i plus 1 so that when i is 0, 01:40:26.910 --> 01:40:30.390 I'm actually still printing one hash, and then two hashes, and then three. 01:40:30.390 --> 01:40:32.670 Otherwise, I would start with 0, which is not my goal. 01:40:32.670 --> 01:40:35.700 Let me go in and do make iteration to compile this-- 01:40:35.700 --> 01:40:37.350 dot slash iteration. 01:40:37.350 --> 01:40:39.600 And let me go ahead and make my terminal bigger-- 01:40:39.600 --> 01:40:40.650 enter. 01:40:40.650 --> 01:40:42.660 The height will be, for instance, 4. 01:40:42.660 --> 01:40:45.660 And indeed, it's not quite to scale because these are a little more 01:40:45.660 --> 01:40:47.550 vertical than they are horizontal. 01:40:47.550 --> 01:40:49.440 But it essentially looks like that pyramid. 01:40:49.440 --> 01:40:50.910 And I can do this even larger. 01:40:50.910 --> 01:40:52.930 Let's do this again after clearing my screen. 01:40:52.930 --> 01:40:54.210 Let's do like 10 of these. 01:40:54.210 --> 01:40:55.300 And it gets bigger. 01:40:55.300 --> 01:40:56.750 Let's do, like, 50 of these. 01:40:56.750 --> 01:40:57.750 And it gets even bigger. 01:40:57.750 --> 01:40:59.167 It doesn't even fit on the screen. 01:40:59.167 --> 01:40:59.730 But it works. 01:40:59.730 --> 01:41:02.648 And that's an iterative approach, 100% correct, 01:41:02.648 --> 01:41:05.190 and similar to what you might have done already for something 01:41:05.190 --> 01:41:08.130 like week zero or week one. 01:41:08.130 --> 01:41:10.390 But it turns out if we leverage recursion, 01:41:10.390 --> 01:41:12.650 we can be a little more clever, in fact. 01:41:12.650 --> 01:41:13.880 Let me go ahead and do this. 01:41:13.880 --> 01:41:15.600 Let me minimize my terminal window. 01:41:15.600 --> 01:41:17.990 And let me go back into my code here. 01:41:17.990 --> 01:41:23.800 And let me propose to implement the exact same program recursively instead. 01:41:23.800 --> 01:41:27.710 And let me go ahead and do this. 01:41:27.710 --> 01:41:31.030 I'm going to leave main-- 01:41:31.030 --> 01:41:34.060 let's see-- exactly as is. 01:41:34.060 --> 01:41:39.410 And I'm going to go ahead and change my draw function to work as follows. 01:41:39.410 --> 01:41:41.860 Let me delete all of these loops because loops are sort 01:41:41.860 --> 01:41:43.750 of an indication of using iteration. 01:41:43.750 --> 01:41:45.550 And I'm instead going to do this. 01:41:45.550 --> 01:41:49.030 What is a pyramid of height 4? 01:41:49.030 --> 01:41:52.210 I said it's a pyramid of height 3 plus 1 more row. 01:41:52.210 --> 01:41:53.740 So let's take that literally. 01:41:53.740 --> 01:41:56.950 If a pyramid of height n needs to be drawn, 01:41:56.950 --> 01:42:01.270 let's first draw a pyramid of n minus 1. 01:42:01.270 --> 01:42:04.150 And then how do I go about drawing one more row? 01:42:04.150 --> 01:42:05.870 Well, this, I can use a bit of iteration. 01:42:05.870 --> 01:42:08.710 But I don't need a doubly-nested loop anymore. 01:42:08.710 --> 01:42:13.750 I can set i equal to 0, i less than n, i plus, plus. 01:42:13.750 --> 01:42:16.300 And this block of code is simply going to print 01:42:16.300 --> 01:42:23.590 one simple row of hashes again and again followed by a new line just 01:42:23.590 --> 01:42:25.070 to move the cursor down. 01:42:25.070 --> 01:42:27.040 So notice this line here. 01:42:27.040 --> 01:42:33.880 And I'll add some comments-- print pyramid of height n minus 1, 01:42:33.880 --> 01:42:35.990 print one more row. 01:42:35.990 --> 01:42:39.680 So I'm kind of taking a bite out of the problem by printing one row. 01:42:39.680 --> 01:42:44.230 But I'm deferring to, weirdly, myself, to print the rest of the pyramid. 01:42:44.230 --> 01:42:47.800 Now I'm going to go ahead and try compiling this-- make-- 01:42:47.800 --> 01:42:48.340 oh. 01:42:48.340 --> 01:42:49.940 And this is no longer iteration. 01:42:49.940 --> 01:42:53.920 So I'm actually going to do this-- code recursion dot c. 01:42:53.920 --> 01:42:56.437 I'm going to paste that same code into this new version, 01:42:56.437 --> 01:42:59.020 just so we have a different file without breaking the old one. 01:42:59.020 --> 01:43:01.960 And now I'm going to do make recursion. 01:43:01.960 --> 01:43:03.250 All right, interesting. 01:43:03.250 --> 01:43:05.710 So Clang is yelling at me with this error. 01:43:05.710 --> 01:43:08.810 All paths through this function will call itself. 01:43:08.810 --> 01:43:11.590 So Clang is actually smart enough to notice in my code 01:43:11.590 --> 01:43:14.920 that no matter what, the draw function is going to call the draw function. 01:43:14.920 --> 01:43:16.840 And the draw function is going to call the draw function. 01:43:16.840 --> 01:43:19.173 And the draw function's going to call the draw function. 01:43:19.173 --> 01:43:23.230 Now, to be fair, n, the input to draw, is getting smaller and smaller. 01:43:23.230 --> 01:43:25.990 But what's going to happen eventually to the input of draw 01:43:25.990 --> 01:43:28.870 as I've written this code right now? 01:43:28.870 --> 01:43:30.040 Yeah, in the back? 01:43:30.040 --> 01:43:32.722 AUDIENCE: [INAUDIBLE] 01:43:32.722 --> 01:43:33.800 DAVID MALAN: Exactly. 01:43:33.800 --> 01:43:35.900 Remember that integers are signed by default. 01:43:35.900 --> 01:43:38.550 They can be both positive or 0 or negative. 01:43:38.550 --> 01:43:41.330 And so here, if I just keep blindly subtracting 1, 01:43:41.330 --> 01:43:44.930 it's going to do that seemingly forever until I technically underflow. 01:43:44.930 --> 01:43:47.970 But that's going to be like 2 billion rows of pyramids later. 01:43:47.970 --> 01:43:51.660 So I think what I actually need to do is something like this. 01:43:51.660 --> 01:43:56.900 I need to ask the question, if n equals 0-- 01:43:56.900 --> 01:44:00.710 or heck, just to be super safe, let's say if n is less than or equal to 0, 01:44:00.710 --> 01:44:04.010 just to make sure it never goes negative, let's just go ahead 01:44:04.010 --> 01:44:06.540 and return without doing anything. 01:44:06.540 --> 01:44:10.130 So I'm going to comment this as, like, if nothing to draw, well, then 01:44:10.130 --> 01:44:11.780 don't blindly call draw again. 01:44:11.780 --> 01:44:13.380 So this is that so-called base case. 01:44:13.380 --> 01:44:16.490 This is analogous to saying, like, if John Harvard not in phone book 01:44:16.490 --> 01:44:22.410 or if no lockers or doors left, then just exit or return in this case. 01:44:22.410 --> 01:44:27.050 So now I'll never call draw a negative number of times 01:44:27.050 --> 01:44:28.370 or zero number of times. 01:44:28.370 --> 01:44:30.990 I will only do it so long as n is positive. 01:44:30.990 --> 01:44:36.420 So now if I make my terminal bigger, make recursion again does compile fine. 01:44:36.420 --> 01:44:39.630 Dot slash recursion-- let's do the same input for. 01:44:39.630 --> 01:44:42.840 And I get the exact same number of bricks. 01:44:42.840 --> 01:44:45.540 Let's go ahead and do it again with maybe 10. 01:44:45.540 --> 01:44:46.890 That seems to work too. 01:44:46.890 --> 01:44:48.510 Let's do it again with 50. 01:44:48.510 --> 01:44:49.503 That seems to work too. 01:44:49.503 --> 01:44:50.670 And I can go a little crazy. 01:44:50.670 --> 01:44:52.320 I can do, like, 5,000. 01:44:52.320 --> 01:44:53.500 This is still going to work. 01:44:53.500 --> 01:44:55.210 It's just not going to fit on my screen. 01:44:55.210 --> 01:44:59.073 But it is doing a valiant attempt to print all of those out. 01:44:59.073 --> 01:45:00.990 Now, it turns out that could get us in trouble 01:45:00.990 --> 01:45:05.790 if we start poking around in the-- now, maybe not 5,000 but 500,000 or 5 01:45:05.790 --> 01:45:07.020 million or beyond. 01:45:07.020 --> 01:45:11.050 But for now, we'll just assume that this is just a slow process at that. 01:45:11.050 --> 01:45:14.340 But if we go back to the goal, which was to print this thing here, 01:45:14.340 --> 01:45:17.760 this too is a recursive structure that we're just now translating 01:45:17.760 --> 01:45:19.623 the ideas of recursive code to. 01:45:19.623 --> 01:45:21.540 And actually, if you've never discovered this, 01:45:21.540 --> 01:45:23.790 we would be remiss in not doing this for you. 01:45:23.790 --> 01:45:27.840 If I go into a browser here. 01:45:27.840 --> 01:45:30.540 And suppose that I'm not using any AI fancy technology. 01:45:30.540 --> 01:45:32.070 I'm just going to Google.com. 01:45:32.070 --> 01:45:35.730 And I search for recursion because I'm curious to see what it means. 01:45:35.730 --> 01:45:39.650 Google for years has had this little Easter egg for computer scientists. 01:45:43.940 --> 01:45:45.680 It's not a typo. 01:45:45.680 --> 01:45:46.180 Get it? 01:45:46.180 --> 01:45:47.690 Ha, ha, kind of funny. 01:45:47.690 --> 01:45:48.190 Yeah? 01:45:48.190 --> 01:45:49.540 OK, like, see? 01:45:49.540 --> 01:45:52.070 Recursion-- so if you click recursion, OK, now we're 01:45:52.070 --> 01:45:53.320 in night mode for some reason. 01:45:53.320 --> 01:45:57.620 But OK, like, so it just does this endlessly. 01:45:57.620 --> 01:46:00.410 OK, so programmers with too much free time or too much 01:46:00.410 --> 01:46:02.560 control over Google's own servers to do that. 01:46:02.560 --> 01:46:05.060 So it turns out, yes, if you search for recursion on Google, 01:46:05.060 --> 01:46:07.560 this is a long-standing Easter egg in this case. 01:46:07.560 --> 01:46:09.410 But the point of introducing recursion is 01:46:09.410 --> 01:46:12.830 that, one, it's actually going to be a very powerful problem-solving technique 01:46:12.830 --> 01:46:15.770 because honestly, it, one, we've seen in pseudocode already, 01:46:15.770 --> 01:46:18.920 it kind of tightens up the amount of code or the amount of lines 01:46:18.920 --> 01:46:20.900 that you need to write to convey an algorithm. 01:46:20.900 --> 01:46:22.912 And two, it will actually allow us to solve 01:46:22.912 --> 01:46:24.620 problems in a fundamentally different way 01:46:24.620 --> 01:46:27.540 by using computer's memory in an interesting way. 01:46:27.540 --> 01:46:31.220 And so toward that end, we wanted to introduce one final sort today, which 01:46:31.220 --> 01:46:34.520 we won't use humans to demonstrate but just numbers instead, namely 01:46:34.520 --> 01:46:35.990 something called merge sort. 01:46:35.990 --> 01:46:40.790 And merge sort is an algorithm for sorting n numbers that I claim 01:46:40.790 --> 01:46:43.430 is going to be better than selection sort and bubble sort. 01:46:43.430 --> 01:46:46.010 It's got to be better because that n squared was killing us. 01:46:46.010 --> 01:46:48.540 We want to get something lower than n squared. 01:46:48.540 --> 01:46:51.960 So merge sort's pseudocode essentially looks like this. 01:46:51.960 --> 01:46:53.540 And this is it. 01:46:53.540 --> 01:46:58.460 Merge sort says, if you've got n numbers or an array of numbers, 01:46:58.460 --> 01:47:01.430 sort the left half of them, sort the right half of them. 01:47:01.430 --> 01:47:03.412 And then merge the sorted halves. 01:47:03.412 --> 01:47:05.370 And we'll see what this means in just a moment. 01:47:05.370 --> 01:47:07.280 But if there's only one number, just quit. 01:47:07.280 --> 01:47:09.920 So this is my base case because this is recursive. 01:47:09.920 --> 01:47:12.440 Because if this is a sorting algorithm called merge sort, 01:47:12.440 --> 01:47:15.685 I'm kind of cheating by using the verb sort in the sorting algorithm. 01:47:15.685 --> 01:47:17.060 But that just makes it recursive. 01:47:17.060 --> 01:47:18.330 It doesn't make it wrong. 01:47:18.330 --> 01:47:19.730 So what do I mean by merging? 01:47:19.730 --> 01:47:25.522 Just to make this clear, here among these numbers are two lists of size 4. 01:47:25.522 --> 01:47:27.980 And I'm going to scooch them over to the left and the right 01:47:27.980 --> 01:47:32.540 just to make clear that on the left is one half that is sorted-- 01:47:32.540 --> 01:47:35.270 1346 from smallest to largest. 01:47:35.270 --> 01:47:40.460 On the right is a second half that's also sorted 0257. 01:47:40.460 --> 01:47:42.980 So for the sake of discussion, suppose that I'm 01:47:42.980 --> 01:47:46.070 partway through this algorithm called merge sort. 01:47:46.070 --> 01:47:48.380 And I've sorted the left half already, clearly. 01:47:48.380 --> 01:47:50.570 I've sorted the right half already, clearly. 01:47:50.570 --> 01:47:52.790 Now I need to merge the sorted halves. 01:47:52.790 --> 01:47:54.270 What do we mean by that? 01:47:54.270 --> 01:47:57.710 Well, that means to essentially, conceptually, 01:47:57.710 --> 01:48:00.350 point your left hand at the first at the left half. 01:48:00.350 --> 01:48:02.540 Point your right hand at the right half. 01:48:02.540 --> 01:48:05.450 And then decide which of these numbers should come first 01:48:05.450 --> 01:48:08.990 in order to merge or kind of stitch these two lists together 01:48:08.990 --> 01:48:09.980 in sorted order. 01:48:09.980 --> 01:48:11.180 Well, which is smaller? 01:48:11.180 --> 01:48:12.350 Obviously, 0. 01:48:12.350 --> 01:48:16.460 So that last step, merge sorted halves, would have me take this number 01:48:16.460 --> 01:48:19.490 and put it in an extra empty array up here. 01:48:19.490 --> 01:48:22.250 Now I move my right hand to the next number in the right half. 01:48:22.250 --> 01:48:23.870 And I compare the 1 and the 2. 01:48:23.870 --> 01:48:25.440 Obviously, 1 comes next. 01:48:25.440 --> 01:48:27.140 So I put this now up here. 01:48:27.140 --> 01:48:28.880 Now I compare the 3 and the 2. 01:48:28.880 --> 01:48:30.450 Obviously, the 2 comes next. 01:48:30.450 --> 01:48:32.180 And so I put this up here-- 01:48:32.180 --> 01:48:34.670 3 and 5-- obviously, 3-- 01:48:34.670 --> 01:48:40.350 4 and 5-- obviously, 4-- 01:48:40.350 --> 01:48:43.620 6 and 5-- obviously, 5-- 01:48:43.620 --> 01:48:46.800 and 6 and 7-- obviously, 6. 01:48:46.800 --> 01:48:50.550 And I didn't leave quite enough room for 7, perhaps, and lastly, 7. 01:48:50.550 --> 01:48:53.838 So that's all we mean by merging two lists together. 01:48:53.838 --> 01:48:56.130 If they're already sorted, you just kind of stitch them 01:48:56.130 --> 01:48:59.280 together by plucking from one or the other the next number 01:48:59.280 --> 01:49:00.280 that you actually want. 01:49:00.280 --> 01:49:00.872 And that's it. 01:49:00.872 --> 01:49:03.330 And even though I picked up partway through this algorithm, 01:49:03.330 --> 01:49:05.370 those three steps alone would seem to work. 01:49:05.370 --> 01:49:08.310 So long as you can sort the left half and sort the right half, 01:49:08.310 --> 01:49:10.543 you can surely then merge the sorted halves. 01:49:10.543 --> 01:49:13.710 Now I'll go ahead and do this digitally rather than use the physical numbers 01:49:13.710 --> 01:49:16.377 because clearly, it's a little involved moving them up and down. 01:49:16.377 --> 01:49:19.650 But this rack here, this shelving, essentially 01:49:19.650 --> 01:49:22.753 represents one array with maybe a second array here. 01:49:22.753 --> 01:49:25.170 And heck, if I really want it, a third and a fourth array. 01:49:25.170 --> 01:49:27.660 It turns out that with selection sort and bubble sort, 01:49:27.660 --> 01:49:30.120 we've really been tying our hands because I only 01:49:30.120 --> 01:49:33.450 allowed myself a constant amount of memory, just one variable in my head, 01:49:33.450 --> 01:49:35.640 for instance, with selection sort that let me keep 01:49:35.640 --> 01:49:37.662 track of who was the smallest element. 01:49:37.662 --> 01:49:40.120 And when we did bubble sort, the only number I kept in mind 01:49:40.120 --> 01:49:42.460 was i, like i and i plus 1. 01:49:42.460 --> 01:49:44.650 I didn't allow myself any additional memory. 01:49:44.650 --> 01:49:47.740 But it turns out in programming and in CS, you 01:49:47.740 --> 01:49:50.480 can trade off one resource for another. 01:49:50.480 --> 01:49:53.410 So if you want to spend less time solving a problem, 01:49:53.410 --> 01:49:55.150 you've got to throw space at it. 01:49:55.150 --> 01:49:57.460 You've got to spend more space, spend more money, 01:49:57.460 --> 01:50:00.650 in order to give yourself more space to reduce your time. 01:50:00.650 --> 01:50:04.120 Conversely, if you're fine with things being slow in terms of time, 01:50:04.120 --> 01:50:06.400 then you can get away with very little space. 01:50:06.400 --> 01:50:08.950 So it's kind of like this balance whereby 01:50:08.950 --> 01:50:11.710 you have to decide which is more important to you, which 01:50:11.710 --> 01:50:15.350 is more expensive for you, or the like. 01:50:15.350 --> 01:50:20.780 So let's go ahead then and consider exactly this algorithm as follows. 01:50:20.780 --> 01:50:25.580 So suppose that these are the numbers in question. 01:50:25.580 --> 01:50:26.770 So here is our array. 01:50:26.770 --> 01:50:27.820 It's clearly unsorted. 01:50:27.820 --> 01:50:29.140 I'd like to sort this. 01:50:29.140 --> 01:50:30.340 I could use selection sort. 01:50:30.340 --> 01:50:33.130 But selection sort was not great because it's big O of n squared. 01:50:33.130 --> 01:50:36.227 And it's omega of n squared, so sort of damned if you do. 01:50:36.227 --> 01:50:37.060 Damned if you don't. 01:50:37.060 --> 01:50:38.630 Bubble sort was a little better. 01:50:38.630 --> 01:50:39.963 It was still big O of n squared. 01:50:39.963 --> 01:50:42.380 But sometimes, we could get lucky and shave some time off. 01:50:42.380 --> 01:50:44.680 So it was omega of only n. 01:50:44.680 --> 01:50:47.170 Let's see if merge sort is fundamentally better. 01:50:47.170 --> 01:50:51.385 And let's do so by trying to reduce the number of comparisons-- no more looping 01:50:51.385 --> 01:50:54.010 back and forth and back and forth and back and forth endlessly. 01:50:54.010 --> 01:50:58.210 So here's how we can draw inspiration from the idea of recursion and divide 01:50:58.210 --> 01:51:00.160 and conquer as per week zero. 01:51:00.160 --> 01:51:02.980 Let's first think of these numbers as indeed in an array 01:51:02.980 --> 01:51:04.810 contiguously from left to right. 01:51:04.810 --> 01:51:08.590 Let's then go ahead and sort the left half. 01:51:08.590 --> 01:51:10.750 Because again, the pseudocode that you have 01:51:10.750 --> 01:51:14.470 to remember throughout this whole algorithm has just three real steps. 01:51:14.470 --> 01:51:15.670 Sort the left half. 01:51:15.670 --> 01:51:16.930 Sort the right half. 01:51:16.930 --> 01:51:18.160 Merge the sorted halves. 01:51:18.160 --> 01:51:19.540 And then this is sort of a one-off thing. 01:51:19.540 --> 01:51:20.350 But it's important. 01:51:20.350 --> 01:51:21.555 But this is the juicy part. 01:51:21.555 --> 01:51:22.180 Sort left half. 01:51:22.180 --> 01:51:22.847 Sort right half. 01:51:22.847 --> 01:51:24.070 Merge the sorted halves. 01:51:24.070 --> 01:51:27.310 So if we go back to this array on the top shelf, here's the original array. 01:51:27.310 --> 01:51:29.320 Let's go ahead and sort the left half. 01:51:29.320 --> 01:51:32.600 And I'm going to do so by sort of stealing some more memory, 01:51:32.600 --> 01:51:36.160 so I can sort of work on a second shelf with just these numbers. 01:51:36.160 --> 01:51:39.730 So 6341 is now an array of size 4, essentially. 01:51:39.730 --> 01:51:42.070 How do I sort an array of size 4? 01:51:42.070 --> 01:51:44.395 Well, I've got an algorithm for that-- selection sort. 01:51:44.395 --> 01:51:45.520 But we decided that's slow. 01:51:45.520 --> 01:51:46.780 Bubble sort-- that's slow. 01:51:46.780 --> 01:51:47.500 Wait a minute. 01:51:47.500 --> 01:51:49.300 I'm in the middle of defining merge sort. 01:51:49.300 --> 01:51:52.570 Let's recursively use the same algorithm by just 01:51:52.570 --> 01:51:55.130 sorting the left half of the left half. 01:51:55.130 --> 01:51:58.335 So if you've got an array of size 4, it's only three steps to solve it. 01:51:58.335 --> 01:51:58.960 Sort left half. 01:51:58.960 --> 01:51:59.710 Sort right half. 01:51:59.710 --> 01:52:00.400 Merge. 01:52:00.400 --> 01:52:01.540 So let's do that. 01:52:01.540 --> 01:52:03.490 Let's sort the left half. 01:52:03.490 --> 01:52:04.300 OK, wait a minute. 01:52:04.300 --> 01:52:06.728 How do you sort an array of size 2? 01:52:06.728 --> 01:52:08.020 I've got an algorithm for that. 01:52:08.020 --> 01:52:08.710 Sort the left half. 01:52:08.710 --> 01:52:09.543 Sort the right half. 01:52:09.543 --> 01:52:10.870 Merge the two halves. 01:52:10.870 --> 01:52:14.110 All right, let's sort the left half. 01:52:14.110 --> 01:52:15.070 What now? 01:52:15.070 --> 01:52:17.020 So 6 is a list of size 1. 01:52:17.020 --> 01:52:19.580 What was that special base case then? 01:52:19.580 --> 01:52:21.500 So it was quit or just return. 01:52:21.500 --> 01:52:22.530 Like, I'm already done. 01:52:22.530 --> 01:52:26.510 So this list of size 1 is sort of weirdly already sorted. 01:52:26.510 --> 01:52:27.920 So I'm making progress. 01:52:27.920 --> 01:52:31.070 Meanwhile, what's the next step after sorting this left half? 01:52:31.070 --> 01:52:32.720 Sort the right half. 01:52:32.720 --> 01:52:33.710 Done. 01:52:33.710 --> 01:52:35.060 This is already sorted. 01:52:35.060 --> 01:52:36.170 But here's the magic. 01:52:36.170 --> 01:52:38.810 What's the third step for these numbers? 01:52:38.810 --> 01:52:39.310 Merge them. 01:52:39.310 --> 01:52:41.310 So this is like the left hand, right hand thing. 01:52:41.310 --> 01:52:42.260 Which one comes first? 01:52:42.260 --> 01:52:45.380 Obviously, 3 and then 6. 01:52:45.380 --> 01:52:50.638 And now we are making progress because now the list of size 2 is sorted. 01:52:50.638 --> 01:52:51.680 So what have I just done? 01:52:51.680 --> 01:52:56.400 I've just finished sorting the left half of the left half. 01:52:56.400 --> 01:53:01.890 So after I've sorted the left half, what comes next if you rewind in time? 01:53:01.890 --> 01:53:04.870 Sort the right half of that left half. 01:53:04.870 --> 01:53:06.900 So now I take the right half. 01:53:06.900 --> 01:53:09.330 And how do I sort this right half of size 2? 01:53:09.330 --> 01:53:10.920 Well, sort its left half. 01:53:10.920 --> 01:53:11.610 Done. 01:53:11.610 --> 01:53:12.870 Sort its right half. 01:53:12.870 --> 01:53:13.560 Done. 01:53:13.560 --> 01:53:15.180 Now merge them together. 01:53:15.180 --> 01:53:17.880 And obviously, the 1 comes first, then the 4. 01:53:17.880 --> 01:53:19.110 Now where are we? 01:53:19.110 --> 01:53:22.080 We're at the point in the story where we have a left left half 01:53:22.080 --> 01:53:26.040 sorted and a right left half sorted. 01:53:26.040 --> 01:53:28.890 So what comes next now narratively? 01:53:28.890 --> 01:53:35.333 Merge the two together, so left hand, right hand, so 1346. 01:53:35.333 --> 01:53:36.750 And where are we now in the story? 01:53:36.750 --> 01:53:41.220 We're sort of at the beginning because I've now sorted the left half. 01:53:41.220 --> 01:53:43.320 And you recall the demo I did physically, this 01:53:43.320 --> 01:53:46.830 is how I had the left half was already sorted on the second shelf. 01:53:46.830 --> 01:53:50.550 But now that I've sorted the left half of the original array, what's 01:53:50.550 --> 01:53:53.580 the original second step? 01:53:53.580 --> 01:53:54.820 Sort the right half. 01:53:54.820 --> 01:53:55.900 So it's the same idea. 01:53:55.900 --> 01:53:57.780 So I'm borrowing some extra memory here. 01:53:57.780 --> 01:53:59.370 And now I've got an array of size 4. 01:53:59.370 --> 01:54:00.870 How do I sort an array of size 4? 01:54:00.870 --> 01:54:02.010 Sort the left half. 01:54:02.010 --> 01:54:03.900 All right, how do I sort an array of size 2? 01:54:03.900 --> 01:54:05.190 Sort the left half. 01:54:05.190 --> 01:54:06.000 Done. 01:54:06.000 --> 01:54:07.230 Sort the right half. 01:54:07.230 --> 01:54:08.040 Done. 01:54:08.040 --> 01:54:08.670 Now what? 01:54:08.670 --> 01:54:09.990 Merge the two together. 01:54:09.990 --> 01:54:12.090 And of course, it's 2 and 5. 01:54:12.090 --> 01:54:13.080 Now what do I do? 01:54:13.080 --> 01:54:16.300 I've sorted the left half of the right half. 01:54:16.300 --> 01:54:19.830 So now I sort the right half of the right half. 01:54:19.830 --> 01:54:21.600 How do I sort a list of size 2? 01:54:21.600 --> 01:54:22.680 Well, sort the left half. 01:54:22.680 --> 01:54:23.100 Done. 01:54:23.100 --> 01:54:23.940 Sort the right half. 01:54:23.940 --> 01:54:24.600 Done. 01:54:24.600 --> 01:54:27.540 Merge them together-- 0 and 7. 01:54:27.540 --> 01:54:29.100 Where am I in the story? 01:54:29.100 --> 01:54:33.610 I've sorted the left half and the right half of the original right half. 01:54:33.610 --> 01:54:34.980 So I merge these two together-- 01:54:34.980 --> 01:54:38.730 0 and then 2 and then 5 and then 7. 01:54:38.730 --> 01:54:39.360 Whew. 01:54:39.360 --> 01:54:40.230 Where are we? 01:54:40.230 --> 01:54:43.080 We're at exactly the point in the story where we had the numbers 01:54:43.080 --> 01:54:44.670 originally on the second shelf. 01:54:44.670 --> 01:54:48.790 We had a list that's sorted on the left, a list that's sorted on the right. 01:54:48.790 --> 01:54:52.840 And what I demoed physically was merging those two together. 01:54:52.840 --> 01:54:54.100 So what happens now? 01:54:54.100 --> 01:55:00.130 01234567. 01:55:00.130 --> 01:55:04.030 And it seems kind of magical and kind of weird in that I kind of cheated. 01:55:04.030 --> 01:55:05.427 And when I had these leaves-- 01:55:05.427 --> 01:55:07.510 these leaf nodes, so to speak-- these singletons-- 01:55:07.510 --> 01:55:08.648 I was, like, sorted. 01:55:08.648 --> 01:55:09.940 I wasn't really doing anything. 01:55:09.940 --> 01:55:12.700 But it's that merging that seems to really be doing 01:55:12.700 --> 01:55:15.350 the magic of sorting things for us. 01:55:15.350 --> 01:55:17.380 So that felt like a mouthful. 01:55:17.380 --> 01:55:19.300 And recursion in general is the kind of thing 01:55:19.300 --> 01:55:21.440 that will, like, bend your brain a little bit. 01:55:21.440 --> 01:55:23.565 And if that went over your head, like, that's fine. 01:55:23.565 --> 01:55:25.600 It takes time and time and time and practice. 01:55:25.600 --> 01:55:30.550 But what there was not a lot of with this algorithm was again and again 01:55:30.550 --> 01:55:31.960 and again and again. 01:55:31.960 --> 01:55:35.080 I was kind of only doing things once. 01:55:35.080 --> 01:55:38.150 And then once I fixed a number, I moved on to the next. 01:55:38.150 --> 01:55:40.630 And that's sort of the essence of this algorithm here. 01:55:40.630 --> 01:55:43.480 If it helps you to see it in another visual way, 01:55:43.480 --> 01:55:46.300 let me go back to our previous visualization here. 01:55:46.300 --> 01:55:48.790 Let me re-randomize the array and click merge sort. 01:55:48.790 --> 01:55:51.670 And this time, notice that merge sort is using more space. 01:55:51.670 --> 01:55:54.970 Technically, I was using 1, 2, 3 shelves. 01:55:54.970 --> 01:55:57.400 But you can actually be slightly more intelligent about it 01:55:57.400 --> 01:56:00.220 and actually just go back and forth and back and forth between two shelves 01:56:00.220 --> 01:56:01.390 just to save a little space. 01:56:01.390 --> 01:56:03.320 So you're still using twice as much space. 01:56:03.320 --> 01:56:06.190 But you don't need four times as much space as the diagrams 01:56:06.190 --> 01:56:07.670 or as the shelves might imply. 01:56:07.670 --> 01:56:09.040 So here is merge sort. 01:56:09.040 --> 01:56:13.000 And you'll notice that we're sort of working in halves-- sometimes 01:56:13.000 --> 01:56:15.070 big halves, sometimes smaller halves. 01:56:15.070 --> 01:56:18.370 But you can see as the two halves are merged, 01:56:18.370 --> 01:56:20.660 things seem to happen very quickly. 01:56:20.660 --> 01:56:23.860 And so notice that this is the same number of bars as before. 01:56:23.860 --> 01:56:25.630 But that was way faster, right? 01:56:25.630 --> 01:56:29.050 I don't need to stall nearly as much as I have in the past. 01:56:29.050 --> 01:56:30.260 So why is that? 01:56:30.260 --> 01:56:33.340 Well, if we go back to the diagram in question, 01:56:33.340 --> 01:56:35.210 here's the array that's already sorted. 01:56:35.210 --> 01:56:38.440 And if we consider now exactly how much work is done, 01:56:38.440 --> 01:56:41.980 I'll stipulate already it's not n squared. n squared was slow. 01:56:41.980 --> 01:56:44.480 And merge sort is actually much better than that. 01:56:44.480 --> 01:56:46.340 Let's see how much better it is. 01:56:46.340 --> 01:56:48.070 So here is the original list-- 01:56:48.070 --> 01:56:50.740 63415270. 01:56:50.740 --> 01:56:54.160 And here are all of the remnants of that algorithm, all of the states 01:56:54.160 --> 01:56:57.520 that we were in at some point-- sort of leaving a whole bunch of breadcrumbs. 01:56:57.520 --> 01:57:01.300 How many pieces of work did I do? 01:57:01.300 --> 01:57:08.810 Well, like, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 01:57:08.810 --> 01:57:11.470 20, 21, 22, 23, 24. 01:57:11.470 --> 01:57:15.370 So I moved things around, like, 24 times, it seems. 01:57:15.370 --> 01:57:17.830 And how do I actually reason about that? 01:57:17.830 --> 01:57:19.040 Well, these are the numbers. 01:57:19.040 --> 01:57:21.070 This is like my temporary workspace here. 01:57:21.070 --> 01:57:22.447 And 24 is the literal number. 01:57:22.447 --> 01:57:25.030 But let's see if we can't do things a little more generically. 01:57:25.030 --> 01:57:27.790 Log base 2 of n, recall, is what refers to anything 01:57:27.790 --> 01:57:30.492 that we're doing in half-- dividing, dividing, dividing. 01:57:30.492 --> 01:57:32.200 And that's kind of what I was doing here. 01:57:32.200 --> 01:57:36.850 I took a list of size 8-- divide it into two of size 4, then four of size 2, 01:57:36.850 --> 01:57:38.320 then eight of size 1. 01:57:38.320 --> 01:57:41.950 Well, how many times can you do this if you start with eight numbers? 01:57:41.950 --> 01:57:43.600 Well, that's log base 2 of 8. 01:57:43.600 --> 01:57:46.990 And if we do some fancy math, that's just log base 2 of 2 to the third. 01:57:46.990 --> 01:57:49.520 And remember, the base and the number here can cancel out. 01:57:49.520 --> 01:57:50.710 So that's actually 3. 01:57:50.710 --> 01:57:55.450 So log base 2 of 8 is 3, which means that's how many times you 01:57:55.450 --> 01:57:59.800 can divide a problem of size 8 in half, in half, in half. 01:57:59.800 --> 01:58:03.760 But every time we did that per this chart, I had to take the numbers 01:58:03.760 --> 01:58:06.620 and merge them together, merge them together, merge them together. 01:58:06.620 --> 01:58:12.010 And so on every row of this postmortem of the algorithm, 01:58:12.010 --> 01:58:16.270 there are n steps, n steps, n steps. 01:58:16.270 --> 01:58:19.660 So laterally, there's n steps because I had to merge all of those things 01:58:19.660 --> 01:58:20.570 back together. 01:58:20.570 --> 01:58:23.140 But what is the height of these yellow remnants? 01:58:23.140 --> 01:58:26.650 Well, it's 3, which is log base 2 of 8, which is 3. 01:58:26.650 --> 01:58:32.110 So this is technically three times 8, ergo 24 steps. 01:58:32.110 --> 01:58:39.350 But more generally, this is log n height and n width, so to speak. 01:58:39.350 --> 01:58:41.770 So the total running time I claim is actually 01:58:41.770 --> 01:58:46.332 n log n, which it's OK if that doesn't quite gel immediately in your mind, 01:58:46.332 --> 01:58:48.040 especially if you're rusty on algorithms. 01:58:48.040 --> 01:58:51.230 But and we can throw away the base because that's just a constant factor 01:58:51.230 --> 01:58:54.770 and with a wave of our hand when we talk about big O notation, merge sort, 01:58:54.770 --> 01:59:01.070 I claim is in big O of n log n-- that is, n times log n. 01:59:01.070 --> 01:59:05.360 Unfortunately, it is also in omega of n log 01:59:05.360 --> 01:59:09.680 n, which means that, frankly, bubble sort might sometimes outperform it, 01:59:09.680 --> 01:59:14.600 at least when the inputs are already sorted or certainly relatively small. 01:59:14.600 --> 01:59:16.790 But that's probably OK because in general, 01:59:16.790 --> 01:59:18.920 data that we're sorting probably isn't very sorted. 01:59:18.920 --> 01:59:20.810 And honestly, we could even half merge sort 01:59:20.810 --> 01:59:24.500 to just do one pass to check initially, is the whole thing sorted, 01:59:24.500 --> 01:59:26.212 and then maybe terminate early. 01:59:26.212 --> 01:59:29.420 So we can maybe massage the algorithm a little better to be a little smarter. 01:59:29.420 --> 01:59:34.240 But fundamentally, merge sort is in theta of n log n, 01:59:34.240 --> 01:59:35.240 is how you would say it. 01:59:35.240 --> 01:59:38.210 It's on the order of n times log n steps. 01:59:38.210 --> 01:59:42.950 Now, in terms of that chart, it's strictly higher than linear. 01:59:42.950 --> 01:59:47.550 But it's strictly lower than quadratic-- n and n squared, respectively. 01:59:47.550 --> 01:59:49.530 So it clearly seems to be faster. 01:59:49.530 --> 01:59:51.170 So it's not as good as linear search. 01:59:51.170 --> 01:59:53.212 And it's definitely not as good as binary search. 01:59:53.212 --> 01:59:58.140 But it's way better than selection sort or bubble sort actually were. 01:59:58.140 --> 02:00:00.680 So with that said, we thought we'd conclude 02:00:00.680 --> 02:00:03.140 by showing you a final film that's just about 02:00:03.140 --> 02:00:08.000 a minute long that compares these actual algorithms and shows them as follows. 02:00:08.000 --> 02:00:12.230 You'll see on the top, the middle, and the bottom, three different algorithms. 02:00:12.230 --> 02:00:14.570 On the top, you will see selection sort. 02:00:14.570 --> 02:00:16.850 On the bottom, you will see bubble sort. 02:00:16.850 --> 02:00:23.450 And in the middle, you will see and hear an appreciation of what n log n is-- 02:00:23.450 --> 02:00:25.680 a.k.a., merge sort today. 02:00:25.680 --> 02:00:28.200 So if we could dim the lights dramatically, 02:00:28.200 --> 02:00:31.471 this is n log n vis a vis n squared. 02:00:31.471 --> 02:00:32.138 [VIDEO PLAYBACK] 02:00:32.138 --> 02:00:35.126 [MUSIC PLAYING] 02:01:37.837 --> 02:01:38.420 [END PLAYBACK] 02:01:38.420 --> 02:01:40.090 All right, that's it for CS50. 02:01:40.090 --> 02:01:41.620 We'll see you next time. 02:01:41.620 --> 02:01:42.520 [APPLAUSE] 02:01:42.520 --> 02:01:45.570 [MUSIC PLAYING]