1 00:00:00,000 --> 00:00:02,994 [MUSIC PLAYING] 2 00:00:02,994 --> 00:01:01,377 3 00:01:01,377 --> 00:01:04,590 DAVID MALAN: All right, this is CS50. 4 00:01:04,590 --> 00:01:05,700 And this is week 3. 5 00:01:05,700 --> 00:01:09,420 And as promised, we thought we'd take a bit of a break on new syntax this week 6 00:01:09,420 --> 00:01:12,300 and focus a lot more on algorithms and implementation 7 00:01:12,300 --> 00:01:15,840 thereof because over the past few weeks, besides Scratch, we've now had C. 8 00:01:15,840 --> 00:01:17,850 And you have a lot of vocabulary now. 9 00:01:17,850 --> 00:01:20,460 Even if it might not seem yet that you fully 10 00:01:20,460 --> 00:01:23,790 grasp all of the functionality of this particular language, with practice, 11 00:01:23,790 --> 00:01:25,030 it'll get easier and easier. 12 00:01:25,030 --> 00:01:28,410 But today, we'd focus instead on ideas and ultimately on this, 13 00:01:28,410 --> 00:01:32,700 how to think algorithmically, and so to take problems in the real world 14 00:01:32,700 --> 00:01:35,550 and try to quantize them in a way that you can map those puzzle 15 00:01:35,550 --> 00:01:39,210 pieces from week 0 or all of this new syntax from weeks 1 and 2 16 00:01:39,210 --> 00:01:41,792 on to actually writing code to solve those problems. 17 00:01:41,792 --> 00:01:43,500 To contextualize this, though, we thought 18 00:01:43,500 --> 00:01:46,230 we'd remind you of this here chart from week zero. 19 00:01:46,230 --> 00:01:48,900 And recall that in week zero, we painted this picture where 20 00:01:48,900 --> 00:01:52,320 on the x-axis on the horizontal was the size of the problem. 21 00:01:52,320 --> 00:01:54,808 And the number of phone book pages increased 22 00:01:54,808 --> 00:01:56,100 as you went from left to right. 23 00:01:56,100 --> 00:01:59,200 And then on the vertical axis or y-axis, we had time to solve. 24 00:01:59,200 --> 00:02:01,900 So this was like how many seconds, how many page turns, 25 00:02:01,900 --> 00:02:05,140 how many units of measure-- whatever you're using. 26 00:02:05,140 --> 00:02:07,690 We might actually describe the solution there too. 27 00:02:07,690 --> 00:02:10,660 And the first algorithm we had in week 0 for the phone book 28 00:02:10,660 --> 00:02:12,350 was, like, one page at a time. 29 00:02:12,350 --> 00:02:14,410 So we plotted it with this 1 to 1 slope. 30 00:02:14,410 --> 00:02:16,990 The second algorithm, we started doing two pages 31 00:02:16,990 --> 00:02:19,000 at a time, which did risk a bug. 32 00:02:19,000 --> 00:02:20,680 I might have to double back one page. 33 00:02:20,680 --> 00:02:23,120 But it was still going for the most part twice as fast. 34 00:02:23,120 --> 00:02:27,100 So it was still a slow but sort of a 2 to 1 slope instead of 1 to 1. 35 00:02:27,100 --> 00:02:29,290 But the third and final algorithm, recall, 36 00:02:29,290 --> 00:02:31,210 was sort of fundamentally different. 37 00:02:31,210 --> 00:02:34,250 And it was this logarithmic curve, so to speak, 38 00:02:34,250 --> 00:02:38,435 whereby it kept increasing, increasing, increasing but very, very slowly. 39 00:02:38,435 --> 00:02:40,810 So even if you, like, doubled the size of the phone book, 40 00:02:40,810 --> 00:02:44,810 as by having Cambridge and Allston here in Massachusetts merge, no big deal. 41 00:02:44,810 --> 00:02:49,190 It was just one more page turn, not another 500 or another 1,000. 42 00:02:49,190 --> 00:02:54,730 So think back today on that particular idea of how we began to divide 43 00:02:54,730 --> 00:02:55,900 and conquer the problem. 44 00:02:55,900 --> 00:02:59,330 And that sort of gave us a fundamentally better advantage. 45 00:02:59,330 --> 00:03:02,840 So we thought we'd see if we can apply this lesson learned as follows. 46 00:03:02,840 --> 00:03:06,340 If I were to take, like, attendance today sort of here on stage, 47 00:03:06,340 --> 00:03:10,870 I could do it old school, like, 1, 2, 3, 4, 5, 6, 7, 8, and so 48 00:03:10,870 --> 00:03:13,058 forth, so one step at a time. 49 00:03:13,058 --> 00:03:14,350 I could also double the speed-- 50 00:03:14,350 --> 00:03:18,200 2, 4, 6, 8, 10, 12, 14, 16, and so forth. 51 00:03:18,200 --> 00:03:20,890 But I dare say we can learn a bit from week zero. 52 00:03:20,890 --> 00:03:24,040 And if you'll indulge me right in place, could everyone 53 00:03:24,040 --> 00:03:26,865 stand up and think of the number one. 54 00:03:26,865 --> 00:03:28,990 So right where you are, just stand up if you could. 55 00:03:28,990 --> 00:03:32,650 56 00:03:32,650 --> 00:03:33,250 Stand up. 57 00:03:33,250 --> 00:03:34,872 And think of the number one. 58 00:03:34,872 --> 00:03:38,080 So at this point, hopefully, everyone's literally thinking of the number one. 59 00:03:38,080 --> 00:03:42,010 And the second step of this algorithm, which I claim ultimately theoretically 60 00:03:42,010 --> 00:03:45,490 should be much faster than either my one person at a time 61 00:03:45,490 --> 00:03:47,740 or two people at a time, step two is this. 62 00:03:47,740 --> 00:03:49,240 Pair off with someone standing. 63 00:03:49,240 --> 00:03:51,340 And add their number to yours. 64 00:03:51,340 --> 00:03:52,345 And remember the sum. 65 00:03:52,345 --> 00:03:59,620 66 00:03:59,620 --> 00:04:03,400 Person can be in front of, behind, left, or right of you. 67 00:04:03,400 --> 00:04:06,510 68 00:04:06,510 --> 00:04:10,950 All right most likely most everyone in the room assuming you 69 00:04:10,950 --> 00:04:13,620 found someone is thinking of the number two 70 00:04:13,620 --> 00:04:16,073 now unless you're sort of an odd person out in the row. 71 00:04:16,073 --> 00:04:16,740 And that's fine. 72 00:04:16,740 --> 00:04:18,115 If you're still one, that's fine. 73 00:04:18,115 --> 00:04:19,800 But most of you are probably two. 74 00:04:19,800 --> 00:04:23,925 Next step is that one of you in those pairs should sit down. 75 00:04:23,925 --> 00:04:29,290 76 00:04:29,290 --> 00:04:34,210 OK, so many of you tried to sit down as quickly as possible we noticed. 77 00:04:34,210 --> 00:04:37,898 But so next step now, at this point, rather, most of you 78 00:04:37,898 --> 00:04:39,190 are thinking of the number two. 79 00:04:39,190 --> 00:04:41,023 A few of you are thinking of the number one. 80 00:04:41,023 --> 00:04:41,710 And that's OK. 81 00:04:41,710 --> 00:04:45,130 The next step, and notice we're about to induce a loop, so the rest of this 82 00:04:45,130 --> 00:04:49,195 is on you, if still standing, go back to step two. 83 00:04:49,195 --> 00:04:53,584 84 00:04:53,584 --> 00:04:56,572 [INTERPOSING VOICES] 85 00:04:56,572 --> 00:05:18,030 86 00:05:18,030 --> 00:05:20,490 DAVID MALAN: If still standing, notice that this is a loop. 87 00:05:20,490 --> 00:05:21,270 So keep going. 88 00:05:21,270 --> 00:05:24,914 Keep going if still standing. 89 00:05:24,914 --> 00:05:27,680 [INTERPOSING VOICES] 90 00:05:27,680 --> 00:05:29,100 91 00:05:29,100 --> 00:05:30,990 DAVID MALAN: There? 92 00:05:30,990 --> 00:05:33,276 How about there? 93 00:05:33,276 --> 00:05:35,702 [INTERPOSING VOICES] 94 00:05:35,702 --> 00:05:36,660 DAVID MALAN: That's OK. 95 00:05:36,660 --> 00:05:37,570 But now keep going. 96 00:05:37,570 --> 00:05:38,070 Keep going. 97 00:05:38,070 --> 00:05:39,900 Keep pairing off, so maybe you two. 98 00:05:39,900 --> 00:05:42,610 99 00:05:42,610 --> 00:05:45,823 All right, a few more seconds. 100 00:05:45,823 --> 00:05:48,709 [INTERPOSING VOICES] 101 00:05:48,709 --> 00:05:49,922 102 00:05:49,922 --> 00:05:51,255 DAVID MALAN: So step two, still. 103 00:05:51,255 --> 00:05:55,810 104 00:05:55,810 --> 00:05:57,670 All right, keep pairing if you're standing. 105 00:05:57,670 --> 00:06:00,595 106 00:06:00,595 --> 00:06:03,529 [INTERPOSING VOICES] 107 00:06:03,529 --> 00:06:08,717 108 00:06:08,717 --> 00:06:09,675 DAVID MALAN: All right. 109 00:06:09,675 --> 00:06:12,420 110 00:06:12,420 --> 00:06:16,325 All right, so theoretically, there's only one person standing left. 111 00:06:16,325 --> 00:06:17,700 But clearly, that's not the case. 112 00:06:17,700 --> 00:06:18,450 That's fine. 113 00:06:18,450 --> 00:06:22,198 I will help with the pairing because some of you are far away. 114 00:06:22,198 --> 00:06:22,740 So let's see. 115 00:06:22,740 --> 00:06:24,450 What's your number here? 116 00:06:24,450 --> 00:06:25,680 Sorry? 117 00:06:25,680 --> 00:06:27,620 What's your number? 118 00:06:27,620 --> 00:06:28,120 Eight. 119 00:06:28,120 --> 00:06:29,350 OK, go ahead and sit down. 120 00:06:29,350 --> 00:06:30,160 How about in back? 121 00:06:30,160 --> 00:06:31,480 What's your number? 122 00:06:31,480 --> 00:06:32,270 46. 123 00:06:32,270 --> 00:06:32,770 Nice. 124 00:06:32,770 --> 00:06:34,000 OK, go ahead and sit down. 125 00:06:34,000 --> 00:06:35,860 Who else is standing? 126 00:06:35,860 --> 00:06:38,840 Over here, what's your number? 127 00:06:38,840 --> 00:06:40,490 You're 16? 128 00:06:40,490 --> 00:06:42,350 OK, so go ahead and sit down. 129 00:06:42,350 --> 00:06:43,820 And behind you? 130 00:06:43,820 --> 00:06:44,480 48. 131 00:06:44,480 --> 00:06:46,400 OK, go ahead and sit down. 132 00:06:46,400 --> 00:06:48,290 Is anyone still standing? 133 00:06:48,290 --> 00:06:49,400 Yeah? 134 00:06:49,400 --> 00:06:50,270 32. 135 00:06:50,270 --> 00:06:50,960 Nice. 136 00:06:50,960 --> 00:06:53,180 Still standing over here. 137 00:06:53,180 --> 00:06:54,530 43. 138 00:06:54,530 --> 00:06:55,140 OK, nice. 139 00:06:55,140 --> 00:06:55,640 Sit down. 140 00:06:55,640 --> 00:06:56,140 Sit down. 141 00:06:56,140 --> 00:06:58,220 And anyone else still standing here? 142 00:06:58,220 --> 00:06:59,480 22. 143 00:06:59,480 --> 00:07:01,160 Go ahead and sit down. 144 00:07:01,160 --> 00:07:03,860 Is anyone still standing and participating? 145 00:07:03,860 --> 00:07:07,400 Yeah, where-- oh, yeah. 146 00:07:07,400 --> 00:07:08,357 16. 147 00:07:08,357 --> 00:07:09,440 OK, go ahead and sit down. 148 00:07:09,440 --> 00:07:12,360 Anyone else still standing? 149 00:07:12,360 --> 00:07:14,250 OK, so theoretically, everyone's paired off. 150 00:07:14,250 --> 00:07:16,330 You were the last person standing. 151 00:07:16,330 --> 00:07:19,257 So when I hit Enter here, having just greased the wheels 152 00:07:19,257 --> 00:07:21,090 to do all of the remaining additions myself, 153 00:07:21,090 --> 00:07:24,810 we should have the total count of people in the room. 154 00:07:24,810 --> 00:07:27,330 And recognize that unlike my algorithm, which 155 00:07:27,330 --> 00:07:29,910 would have required pointing at each and every person 156 00:07:29,910 --> 00:07:33,930 or my second algorithm, which would mean pointing at every two people 157 00:07:33,930 --> 00:07:38,010 twice as fast, theoretically, the algorithm you all just executed, 158 00:07:38,010 --> 00:07:40,480 I daresay, should've been fundamentally faster. 159 00:07:40,480 --> 00:07:40,980 Why? 160 00:07:40,980 --> 00:07:43,680 Because no matter how many people in the room-- maybe, like, 161 00:07:43,680 --> 00:07:45,840 if there were 1,000 people in the room, there 162 00:07:45,840 --> 00:07:50,280 would then have been 500, just as there would be 500 pages in week zero. 163 00:07:50,280 --> 00:07:52,320 Then from 500, there'd be 250-- 164 00:07:52,320 --> 00:07:53,200 125. 165 00:07:53,200 --> 00:07:53,700 Why? 166 00:07:53,700 --> 00:07:55,783 Because on each step of the algorithm, half of you 167 00:07:55,783 --> 00:07:59,310 theoretically were sitting down, sitting down, sitting down, dividing 168 00:07:59,310 --> 00:08:00,880 and conquering that problem. 169 00:08:00,880 --> 00:08:05,370 So the total number of people in the room as of now according to your count 170 00:08:05,370 --> 00:08:08,460 is 231. 171 00:08:08,460 --> 00:08:11,650 As a backup, though, Carter kindly did it the old-fashioned way, 172 00:08:11,650 --> 00:08:12,760 one person at a time. 173 00:08:12,760 --> 00:08:16,310 And Carter, the actual number of people in the room is? 174 00:08:16,310 --> 00:08:18,070 [LAUGHS] 175 00:08:18,070 --> 00:08:21,220 OK, so our first real world bug to be fair. 176 00:08:21,220 --> 00:08:23,510 So theoretically, that should have worked. 177 00:08:23,510 --> 00:08:26,230 But clearly, we lost some numbers along the way, so a bug 178 00:08:26,230 --> 00:08:27,620 that we can fix today. 179 00:08:27,620 --> 00:08:31,810 But remember that really, this is just similar in spirit 180 00:08:31,810 --> 00:08:34,750 to that algorithm we indeed did in week zero. 181 00:08:34,750 --> 00:08:36,532 It's the same as the phone book example. 182 00:08:36,532 --> 00:08:37,990 It went off the rails in this case. 183 00:08:37,990 --> 00:08:41,320 But it's the same idea, ultimately dividing and conquering. 184 00:08:41,320 --> 00:08:44,402 And any time you have this halving, halving, halving, 185 00:08:44,402 --> 00:08:46,360 there's going to be a logarithm involved there, 186 00:08:46,360 --> 00:08:48,152 even if you're a little rusty on your math. 187 00:08:48,152 --> 00:08:51,220 And that's fundamentally faster than just doing something n times 188 00:08:51,220 --> 00:08:55,090 or even n divided by 2 times where n is the number of people in the room 189 00:08:55,090 --> 00:08:57,980 or in week zero, the number of pages in the phone book. 190 00:08:57,980 --> 00:09:01,450 So even when you're using your iPhone or Android device later today, like, 191 00:09:01,450 --> 00:09:04,750 if you search for a contact using autocomplete, 192 00:09:04,750 --> 00:09:09,070 it is that so-called divide and conquer algorithm that's 193 00:09:09,070 --> 00:09:10,930 finding people in your address book. 194 00:09:10,930 --> 00:09:13,570 It's not starting top to bottom or bottom up. 195 00:09:13,570 --> 00:09:15,520 It's probably going roughly to the middle 196 00:09:15,520 --> 00:09:19,300 and then doing the top half or the bottom half, repeating again and again. 197 00:09:19,300 --> 00:09:21,040 So these ideas are everywhere. 198 00:09:21,040 --> 00:09:23,340 And hopefully, you end up finding just the one 199 00:09:23,340 --> 00:09:25,090 person you're looking for or in your case, 200 00:09:25,090 --> 00:09:28,030 the one person last standing who should have theoretically 201 00:09:28,030 --> 00:09:32,350 had the count of everyone because if each of you started by representing 1, 202 00:09:32,350 --> 00:09:35,440 effectively handed off your number, handed off your number, 203 00:09:35,440 --> 00:09:38,260 handed off your number, it theoretically should have coalesced 204 00:09:38,260 --> 00:09:40,090 in that final person standing. 205 00:09:40,090 --> 00:09:43,730 So let's consider the connection now between this idea 206 00:09:43,730 --> 00:09:47,680 and what we introduced last week, which was this idea of very simple data 207 00:09:47,680 --> 00:09:52,240 structures in your computer's memory-- like, actually using this memory as 208 00:09:52,240 --> 00:09:54,700 though it's kind of a grid of bytes. 209 00:09:54,700 --> 00:09:58,190 Each one of these squares, recall, represents 1 byte or 8 bits. 210 00:09:58,190 --> 00:10:00,250 And we can get rid of the hardware and sort 211 00:10:00,250 --> 00:10:03,850 of abstract it away as just this grid of memory or this canvas. 212 00:10:03,850 --> 00:10:05,680 And then we introduced arrays last week. 213 00:10:05,680 --> 00:10:08,680 And what was the key definition of an array? 214 00:10:08,680 --> 00:10:11,420 How would you describe an array? 215 00:10:11,420 --> 00:10:14,400 What is it, anyone? 216 00:10:14,400 --> 00:10:15,540 What's an array? 217 00:10:15,540 --> 00:10:18,180 Yeah, in the middle. 218 00:10:18,180 --> 00:10:20,720 A collection. 219 00:10:20,720 --> 00:10:21,330 A collection. 220 00:10:21,330 --> 00:10:24,790 And I don't love data types only because in C, it tends to be the same type. 221 00:10:24,790 --> 00:10:25,790 So a collection of data. 222 00:10:25,790 --> 00:10:26,450 I do like that. 223 00:10:26,450 --> 00:10:28,370 But there's one other key characteristic. 224 00:10:28,370 --> 00:10:29,670 Do you want to be more precise? 225 00:10:29,670 --> 00:10:32,850 It's not just a collection-- 226 00:10:32,850 --> 00:10:35,680 something about where it is. 227 00:10:35,680 --> 00:10:36,880 Potentially strings. 228 00:10:36,880 --> 00:10:39,850 But strings are just an example of putting char, char, char, char. 229 00:10:39,850 --> 00:10:42,500 It could certainly be integers or floating point values. 230 00:10:42,500 --> 00:10:44,410 Another characteristic? 231 00:10:44,410 --> 00:10:45,220 Sorry? 232 00:10:45,220 --> 00:10:46,433 It's not necessarily ordered. 233 00:10:46,433 --> 00:10:48,100 Actually, we'll come back to that today. 234 00:10:48,100 --> 00:10:49,340 It could be in any order. 235 00:10:49,340 --> 00:10:52,090 And certainly a string isn't necessarily in sorted order. 236 00:10:52,090 --> 00:10:54,890 It's in whatever the word is. 237 00:10:54,890 --> 00:10:56,450 It's a list in concept. 238 00:10:56,450 --> 00:10:59,750 But there was something key about where we put things in memory. 239 00:10:59,750 --> 00:11:01,580 Yeah? 240 00:11:01,580 --> 00:11:05,420 Consecutive-- the memory is consecutive, a.k.a., contiguous. 241 00:11:05,420 --> 00:11:08,660 An array is important in C because, yes, it's a list of values. 242 00:11:08,660 --> 00:11:10,130 Yes, it's a collection of values. 243 00:11:10,130 --> 00:11:15,080 But the real key distinction in an array in C is that it's contiguous. 244 00:11:15,080 --> 00:11:19,740 The bytes are back to back to back somewhere in the computer's memory, 245 00:11:19,740 --> 00:11:23,930 at least for any given data type, be it an int, a float, a bigger string. 246 00:11:23,930 --> 00:11:27,140 All of the characters are back to back to back in the computer's memory, 247 00:11:27,140 --> 00:11:30,150 not spread all out, even if you have space elsewhere. 248 00:11:30,150 --> 00:11:35,190 So with that said, we can actually start to solve problems with that mindset. 249 00:11:35,190 --> 00:11:39,020 And for instance, if I kind of pare this down to just an abstract array of size 250 00:11:39,020 --> 00:11:42,860 1, 2, 3, 4, 5, 6, 7, for instance, suppose 251 00:11:42,860 --> 00:11:47,220 that there are these numbers inside of this array of memory. 252 00:11:47,220 --> 00:11:51,080 So here are seven integers or ints in C. I have in this case 253 00:11:51,080 --> 00:11:55,100 sorted them just to make the numbers pop out as obviously smallest to largest. 254 00:11:55,100 --> 00:11:58,070 But the catch with C is that if I were to ask you 255 00:11:58,070 --> 00:12:01,650 or if you were to ask the computer through code to find you the number 50, 256 00:12:01,650 --> 00:12:03,390 well, obviously, every human in this room 257 00:12:03,390 --> 00:12:07,800 just found it obviously right there because we kind of have this bird's eye 258 00:12:07,800 --> 00:12:10,500 view of the whole memory at once. 259 00:12:10,500 --> 00:12:12,990 But the computer ironically does not have 260 00:12:12,990 --> 00:12:15,240 that bird's eye view of its own memory. 261 00:12:15,240 --> 00:12:18,820 It can only look at each location one at a time. 262 00:12:18,820 --> 00:12:22,080 So if you really were to do this like a computer, you would kind of have to, 263 00:12:22,080 --> 00:12:26,040 like, shield your eye and only look at one number at a time from left 264 00:12:26,040 --> 00:12:28,030 to right, from right to left, or in any order 265 00:12:28,030 --> 00:12:31,487 in order to find is the 50 actually there. 266 00:12:31,487 --> 00:12:32,820 You can't just take a step back. 267 00:12:32,820 --> 00:12:34,540 And boom, it pops out at you. 268 00:12:34,540 --> 00:12:37,230 So this is kind of analogous, this array, 269 00:12:37,230 --> 00:12:39,690 to being like a set of gym lockers or school 270 00:12:39,690 --> 00:12:42,828 lockers like this where the doors are actually closed by default. 271 00:12:42,828 --> 00:12:43,870 The numbers are in there. 272 00:12:43,870 --> 00:12:46,120 But the doors are closed, which is to say the computer 273 00:12:46,120 --> 00:12:47,562 and we can't actually look. 274 00:12:47,562 --> 00:12:49,020 So we couldn't find yellow lockers. 275 00:12:49,020 --> 00:12:51,010 But we did find red lockers here. 276 00:12:51,010 --> 00:12:53,850 And so I propose that you think of these lockers on the stage 277 00:12:53,850 --> 00:12:57,270 here of which there are seven as well as representing an array. 278 00:12:57,270 --> 00:13:01,740 And just so we have some terminology, notice that I've labeled these bracket 279 00:13:01,740 --> 00:13:05,220 0, bracket 1, bracket 2, 3, 4, 5 and 6. 280 00:13:05,220 --> 00:13:07,410 And the bracket notation recalls the new syntax 281 00:13:07,410 --> 00:13:10,230 from last week that lets you index into-- 282 00:13:10,230 --> 00:13:13,450 go to a specific spot inside of an array. 283 00:13:13,450 --> 00:13:15,720 And notice that even though there are seven lockers, 284 00:13:15,720 --> 00:13:17,520 I only counted as high as six. 285 00:13:17,520 --> 00:13:20,850 But again, that's just a side effect of our generally counting from 0. 286 00:13:20,850 --> 00:13:27,330 So 0 through 6 or 0 through n minus 1 because if there are n equal 7 lockers, 287 00:13:27,330 --> 00:13:28,590 n minus 1 is 6. 288 00:13:28,590 --> 00:13:31,900 So that's the left bound and the right bound respectively. 289 00:13:31,900 --> 00:13:37,560 So suppose that we were to use these lockers as representing a problem, 290 00:13:37,560 --> 00:13:40,320 like we want to find an actual number behind these doors. 291 00:13:40,320 --> 00:13:42,820 So this is actually a very common problem in the real world. 292 00:13:42,820 --> 00:13:46,650 And you and I take for granted every day that big companies like Google 293 00:13:46,650 --> 00:13:49,110 and Microsoft and others, like, do this for us 294 00:13:49,110 --> 00:13:52,920 constantly, not to mention AI doing something similar nowadays, 295 00:13:52,920 --> 00:13:54,360 searching for information. 296 00:13:54,360 --> 00:13:56,940 And we'll focus on some basics first that 297 00:13:56,940 --> 00:13:59,542 will lead us to more sophisticated algorithms ultimately. 298 00:13:59,542 --> 00:14:01,500 But all we're going to talk about fundamentally 299 00:14:01,500 --> 00:14:05,160 is this same picture from week zero and from week one and from week two 300 00:14:05,160 --> 00:14:07,120 where here is a problem to be solved. 301 00:14:07,120 --> 00:14:11,295 So if for instance, the input to this problem is an array of numbers-- 302 00:14:11,295 --> 00:14:12,420 an array of seven numbers-- 303 00:14:12,420 --> 00:14:15,870 I can't see them all at once-- but I'm looking for something like the number 304 00:14:15,870 --> 00:14:16,560 50-- 305 00:14:16,560 --> 00:14:18,540 ultimately, I want to get back out. 306 00:14:18,540 --> 00:14:20,280 I claim true or false. 307 00:14:20,280 --> 00:14:21,900 Like, the number 50 is there. 308 00:14:21,900 --> 00:14:22,740 Or it is not. 309 00:14:22,740 --> 00:14:25,050 That's one way of thinking about the search problem. 310 00:14:25,050 --> 00:14:27,390 Find me some piece of data if it's there. 311 00:14:27,390 --> 00:14:31,060 Otherwise, tell me that it's not-- true or false, respectively. 312 00:14:31,060 --> 00:14:33,772 So the algorithm inside of this black box, 313 00:14:33,772 --> 00:14:36,480 though, is where we're going to actually have to do some thinking 314 00:14:36,480 --> 00:14:41,097 and actually figure out how best to find the number or the data we care about. 315 00:14:41,097 --> 00:14:43,680 And even though we'll use numbers to keep things simple today, 316 00:14:43,680 --> 00:14:47,190 you could certainly generalize this to web pages or contacts 317 00:14:47,190 --> 00:14:51,450 or any other type of information that are in some computer or database more 318 00:14:51,450 --> 00:14:52,240 generally. 319 00:14:52,240 --> 00:14:56,970 So maybe to keep things interesting, could we 320 00:14:56,970 --> 00:14:59,028 get-- how about two volunteers? 321 00:14:59,028 --> 00:14:59,820 Wow, that was fast. 322 00:14:59,820 --> 00:15:02,080 OK, come on down. 323 00:15:02,080 --> 00:15:03,630 And how about one other volunteer? 324 00:15:03,630 --> 00:15:04,050 I'll go over here. 325 00:15:04,050 --> 00:15:04,800 OK, how about you? 326 00:15:04,800 --> 00:15:06,300 Come on down. 327 00:15:06,300 --> 00:15:08,670 Sure, round of applause for our volunteers. 328 00:15:08,670 --> 00:15:10,380 [APPLAUSE] 329 00:15:10,380 --> 00:15:13,670 OK, welcome. 330 00:15:13,670 --> 00:15:14,250 Come on over. 331 00:15:14,250 --> 00:15:15,875 Do you want to introduce yourself to the group? 332 00:15:15,875 --> 00:15:16,970 AUDIENCE: [INAUDIBLE]. 333 00:15:16,970 --> 00:15:18,760 DAVID MALAN: Like a few seconds is fine. 334 00:15:18,760 --> 00:15:20,155 AUDIENCE: Hi, I'm Sam. 335 00:15:20,155 --> 00:15:21,895 I am not a CS concentration. 336 00:15:21,895 --> 00:15:24,580 DAVID MALAN: [LAUGHS] So what are you? 337 00:15:24,580 --> 00:15:25,705 Do you know yet? 338 00:15:25,705 --> 00:15:26,830 AUDIENCE: Applied math. 339 00:15:26,830 --> 00:15:27,497 DAVID MALAN: OK. 340 00:15:27,497 --> 00:15:27,997 Nice. 341 00:15:27,997 --> 00:15:28,720 Nice to meet you. 342 00:15:28,720 --> 00:15:29,140 Welcome. 343 00:15:29,140 --> 00:15:29,640 And? 344 00:15:29,640 --> 00:15:30,910 AUDIENCE: Hi, I'm Louis. 345 00:15:30,910 --> 00:15:32,740 I'm from first year, Matthews. 346 00:15:32,740 --> 00:15:34,705 And I'm going to do Econ with Stats. 347 00:15:34,705 --> 00:15:36,690 DAVID MALAN: Oh, you're in Matthews too? 348 00:15:36,690 --> 00:15:37,190 OK. 349 00:15:37,190 --> 00:15:40,720 [CHUCKLES] I was in Matthews too, so Matthews South? 350 00:15:40,720 --> 00:15:41,350 Oh, wow. 351 00:15:41,350 --> 00:15:42,220 Oh, my god. 352 00:15:42,220 --> 00:15:43,840 I was room 201. 353 00:15:43,840 --> 00:15:44,986 AUDIENCE: 103. 354 00:15:44,986 --> 00:15:46,630 DAVID MALAN: Fifth floor-- 355 00:15:46,630 --> 00:15:48,280 all right, so anyhow. 356 00:15:48,280 --> 00:15:50,337 And so we have Louis and your name? 357 00:15:50,337 --> 00:15:50,920 AUDIENCE: Sam. 358 00:15:50,920 --> 00:15:51,460 DAVID MALAN: Sam. 359 00:15:51,460 --> 00:15:51,940 Louis and Sam. 360 00:15:51,940 --> 00:15:54,550 So Louis and I are going to step off to the side for just a moment because Sam, 361 00:15:54,550 --> 00:15:56,077 we have a problem for you to solve. 362 00:15:56,077 --> 00:15:57,910 This feels a little bit like Price Is Right. 363 00:15:57,910 --> 00:16:00,010 But behind you are these seven lockers. 364 00:16:00,010 --> 00:16:03,460 And we'd like you to just find us the number 50. 365 00:16:03,460 --> 00:16:05,050 That's all the information you get. 366 00:16:05,050 --> 00:16:08,205 But we'd like you to then explain how you go about finding it. 367 00:16:08,205 --> 00:16:10,122 AUDIENCE: Wait, are they in order or anything? 368 00:16:10,122 --> 00:16:12,486 Or [INAUDIBLE] it just kind of [INAUDIBLE]?? 369 00:16:12,486 --> 00:16:15,540 DAVID MALAN: Find us the number 50. 370 00:16:15,540 --> 00:16:17,370 And then tell us how you found it. 371 00:16:17,370 --> 00:16:20,662 372 00:16:20,662 --> 00:16:22,745 OK, what was in there, just so the audience knows. 373 00:16:22,745 --> 00:16:23,620 AUDIENCE: [INAUDIBLE] 374 00:16:23,620 --> 00:16:24,850 DAVID MALAN: Yes. 375 00:16:24,850 --> 00:16:25,550 [LAUGHTER] 376 00:16:25,550 --> 00:16:28,818 AUDIENCE: It was $10, a very, very big $10 bill. 377 00:16:28,818 --> 00:16:29,485 DAVID MALAN: OK. 378 00:16:29,485 --> 00:16:30,423 AUDIENCE: Fake money. 379 00:16:30,423 --> 00:16:31,090 DAVID MALAN: OK. 380 00:16:31,090 --> 00:16:33,925 381 00:16:33,925 --> 00:16:39,310 AUDIENCE: That was $100. 382 00:16:39,310 --> 00:16:42,100 $1. 383 00:16:42,100 --> 00:16:43,570 $5. 384 00:16:43,570 --> 00:16:45,400 OK, I'm not lucky. 385 00:16:45,400 --> 00:16:46,810 Oh, I found it. 386 00:16:46,810 --> 00:16:47,560 DAVID MALAN: Nice. 387 00:16:47,560 --> 00:16:49,180 Take it out and so people can believe. 388 00:16:49,180 --> 00:16:50,530 All right, wonderful. 389 00:16:50,530 --> 00:16:52,390 So you found the 50. 390 00:16:52,390 --> 00:16:53,260 [APPLAUSE] 391 00:16:53,260 --> 00:16:55,305 And now if you could explain. 392 00:16:55,305 --> 00:16:55,930 I'll take that. 393 00:16:55,930 --> 00:16:58,060 If you could explain, what was your algorithm? 394 00:16:58,060 --> 00:17:00,810 What was the step-by-step approach you took? 395 00:17:00,810 --> 00:17:03,490 AUDIENCE: I didn't really have one. 396 00:17:03,490 --> 00:17:04,720 I just started on one end. 397 00:17:04,720 --> 00:17:07,450 And I went down because it wasn't in order or anything. 398 00:17:07,450 --> 00:17:09,400 So I just kept going until I found it. 399 00:17:09,400 --> 00:17:10,720 DAVID MALAN: OK, so that's actually pretty fair. 400 00:17:10,720 --> 00:17:12,262 And in fact, let's step forward here. 401 00:17:12,262 --> 00:17:14,829 Carter's going to very secretly kind of shuffle the numbers 402 00:17:14,829 --> 00:17:16,880 in a particular new arrangement here. 403 00:17:16,880 --> 00:17:19,390 And so you really went from right to left. 404 00:17:19,390 --> 00:17:22,960 And I dare say maybe going from left to right might be equivalent. 405 00:17:22,960 --> 00:17:25,750 But could she have done better? 406 00:17:25,750 --> 00:17:29,590 Could she have done better, because that took 1, 2, 3, 4, 5 steps? 407 00:17:29,590 --> 00:17:31,970 Could Sam have done better? 408 00:17:31,970 --> 00:17:32,470 Sure. 409 00:17:32,470 --> 00:17:35,270 How? 410 00:17:35,270 --> 00:17:36,527 OK, so you got lucky. 411 00:17:36,527 --> 00:17:38,360 You could have found the number in one step. 412 00:17:38,360 --> 00:17:40,400 Although, luck isn't really an algorithm. 413 00:17:40,400 --> 00:17:42,780 It really isn't a step-by-step approach. 414 00:17:42,780 --> 00:17:45,053 So another thought? 415 00:17:45,053 --> 00:17:46,890 AUDIENCE: [INAUDIBLE] dollar bills-- 416 00:17:46,890 --> 00:17:49,656 sort them in order? 417 00:17:49,656 --> 00:17:51,270 DAVID MALAN: Oh, interesting. 418 00:17:51,270 --> 00:17:54,492 So you could have taken out all of the dollar bills, sorted them, 419 00:17:54,492 --> 00:17:57,200 put them back in, and then you probably could have done something 420 00:17:57,200 --> 00:17:59,315 like the divide and conquer approach. 421 00:17:59,315 --> 00:18:00,520 AUDIENCE: I didn't know I was allowed to do that. 422 00:18:00,520 --> 00:18:01,400 DAVID MALAN: No, you weren't allowed to. 423 00:18:01,400 --> 00:18:02,360 So that's fine. 424 00:18:02,360 --> 00:18:02,810 AUDIENCE: [INAUDIBLE] 425 00:18:02,810 --> 00:18:04,852 DAVID MALAN: But that would be a valid algorithm. 426 00:18:04,852 --> 00:18:07,970 Although, it sounds very inefficient to do all this work just 427 00:18:07,970 --> 00:18:10,128 to find the number 50 because in doing that work, 428 00:18:10,128 --> 00:18:11,420 she would've found the 50 once. 429 00:18:11,420 --> 00:18:13,420 But that might actually be a reasonable solution 430 00:18:13,420 --> 00:18:17,330 if she plans to search again and again and again, not just for seven numbers 431 00:18:17,330 --> 00:18:18,560 but maybe a lot of numbers. 432 00:18:18,560 --> 00:18:21,690 Maybe you do want to incur some cost up front and all the data 433 00:18:21,690 --> 00:18:22,910 so as to find it faster. 434 00:18:22,910 --> 00:18:24,320 But let's do this a little more methodically. 435 00:18:24,320 --> 00:18:25,862 We'll step off to the side once more. 436 00:18:25,862 --> 00:18:29,120 And just by convention, if you want to go ahead and-- do we need this? 437 00:18:29,120 --> 00:18:29,660 OK. 438 00:18:29,660 --> 00:18:30,368 AUDIENCE: Got it. 439 00:18:30,368 --> 00:18:32,390 DAVID MALAN: OK, let's go from left to right. 440 00:18:32,390 --> 00:18:32,900 Yep. 441 00:18:32,900 --> 00:18:34,858 And go ahead and show everyone the numbers just 442 00:18:34,858 --> 00:18:37,430 to prove that there's no funny business here. 443 00:18:37,430 --> 00:18:38,848 AUDIENCE: 20. 444 00:18:38,848 --> 00:18:41,596 DAVID MALAN: [LAUGHS] 445 00:18:41,596 --> 00:18:42,960 AUDIENCE: Oh, OK. 446 00:18:42,960 --> 00:18:44,010 500. 447 00:18:44,010 --> 00:18:44,760 DAVID MALAN: Nice. 448 00:18:44,760 --> 00:18:48,191 449 00:18:48,191 --> 00:18:49,340 AUDIENCE: 10. 450 00:18:49,340 --> 00:18:50,290 DAVID MALAN: Mm-hm. 451 00:18:50,290 --> 00:18:54,212 So it doesn't sound like it's sorted, either, this time. 452 00:18:54,212 --> 00:18:54,945 AUDIENCE: Five. 453 00:18:54,945 --> 00:18:58,040 DAVID MALAN: Nice. 454 00:18:58,040 --> 00:18:59,826 And? 455 00:18:59,826 --> 00:19:00,940 AUDIENCE: 100. 456 00:19:00,940 --> 00:19:02,280 DAVID MALAN: Oh, so close. 457 00:19:02,280 --> 00:19:05,090 458 00:19:05,090 --> 00:19:06,057 AUDIENCE: [INAUDIBLE] 459 00:19:06,057 --> 00:19:06,890 DAVID MALAN: I know. 460 00:19:06,890 --> 00:19:07,730 Now we're just messing with you. 461 00:19:07,730 --> 00:19:08,230 AUDIENCE: 1. 462 00:19:08,230 --> 00:19:10,904 DAVID MALAN: OK, and lastly? 463 00:19:10,904 --> 00:19:12,380 AUDIENCE: [GASPS] 50. 464 00:19:12,380 --> 00:19:13,670 DAVID MALAN: Very well done. 465 00:19:13,670 --> 00:19:15,360 OK, so nicely done. 466 00:19:15,360 --> 00:19:15,860 [APPLAUSE] 467 00:19:15,860 --> 00:19:17,870 And so thank you. 468 00:19:17,870 --> 00:19:20,990 So this is a say whether she goes from left to right or right to left, 469 00:19:20,990 --> 00:19:24,080 like, the performance of that algorithm just going linearly 470 00:19:24,080 --> 00:19:27,170 from side to side really depends on where the number ends up being. 471 00:19:27,170 --> 00:19:28,760 And it kind of does boil down to luck. 472 00:19:28,760 --> 00:19:31,550 And so that was kind of the best you could do because I dare say, 473 00:19:31,550 --> 00:19:34,772 had you gone linearly from right to left-- and go ahead to reset-- 474 00:19:34,772 --> 00:19:37,730 had you gone from right to left, that time you would have gotten lucky. 475 00:19:37,730 --> 00:19:40,080 So on average, it might just kind of work out. 476 00:19:40,080 --> 00:19:43,070 And so half the time, it takes you maybe half as many steps-- 477 00:19:43,070 --> 00:19:45,087 half the number of lockers to find it on average 478 00:19:45,087 --> 00:19:46,670 because sometimes it's in the one end. 479 00:19:46,670 --> 00:19:47,840 Sometimes it's on the other end. 480 00:19:47,840 --> 00:19:49,620 Sometimes it's smack dab in the middle. 481 00:19:49,620 --> 00:19:52,490 But we're going to give you the option now, 482 00:19:52,490 --> 00:19:54,523 Louis, of knowing that it's sorted. 483 00:19:54,523 --> 00:19:56,690 So we're going to take away the microphone from you. 484 00:19:56,690 --> 00:19:58,400 But stay on up here with us. 485 00:19:58,400 --> 00:20:00,710 And you are now given the assumption that the numbers 486 00:20:00,710 --> 00:20:03,860 are this time sorted from smallest to largest, left to right. 487 00:20:03,860 --> 00:20:08,570 And might you want to take more of a divide and conquer approach here? 488 00:20:08,570 --> 00:20:09,656 Wait a minute. 489 00:20:09,656 --> 00:20:10,934 AUDIENCE: [LAUGHS] 490 00:20:10,934 --> 00:20:14,450 DAVID MALAN: What might you do as your algorithm, Louis? 491 00:20:14,450 --> 00:20:16,700 AUDIENCE: Well, I think I know all the numbers, right? 492 00:20:16,700 --> 00:20:17,440 It's 1, 5-- 493 00:20:17,440 --> 00:20:18,482 DAVID MALAN: Oh, damn it. 494 00:20:18,482 --> 00:20:19,610 AUDIENCE: 10, 20, 50. 495 00:20:19,610 --> 00:20:21,630 DAVID MALAN: OK, so Louis has some memory, as we'll say. 496 00:20:21,630 --> 00:20:22,130 OK. 497 00:20:22,130 --> 00:20:24,320 AUDIENCE: But assuming if I didn't have memory. 498 00:20:24,320 --> 00:20:25,310 DAVID MALAN: OK, assuming if you didn't have 499 00:20:25,310 --> 00:20:26,480 memory, where would you start first? 500 00:20:26,480 --> 00:20:27,920 AUDIENCE: I would probably start in the middle. 501 00:20:27,920 --> 00:20:28,490 DAVID MALAN: OK, go ahead. 502 00:20:28,490 --> 00:20:29,220 Go in the middle. 503 00:20:29,220 --> 00:20:31,505 And what do you see? 504 00:20:31,505 --> 00:20:32,283 AUDIENCE: 20. 505 00:20:32,283 --> 00:20:32,950 DAVID MALAN: 20. 506 00:20:32,950 --> 00:20:34,505 OK, what does that mean now for you? 507 00:20:34,505 --> 00:20:37,630 AUDIENCE: So now that means that since4 I know there's some numbers bigger, 508 00:20:37,630 --> 00:20:38,390 I'll go to the right. 509 00:20:38,390 --> 00:20:38,500 DAVID MALAN: Good. 510 00:20:38,500 --> 00:20:40,720 So you can tear the problem in half, so to speak. 511 00:20:40,720 --> 00:20:43,690 As per week zero, skip all of the lockers on the left. 512 00:20:43,690 --> 00:20:45,460 And now go where relative to these three? 513 00:20:45,460 --> 00:20:46,540 AUDIENCE: Relative to the middle. 514 00:20:46,540 --> 00:20:46,720 DAVID MALAN: OK. 515 00:20:46,720 --> 00:20:48,800 AUDIENCE: So maybe [? you ?] don't know what's on the right-hand side. 516 00:20:48,800 --> 00:20:49,765 And so now it's 100. 517 00:20:49,765 --> 00:20:51,070 DAVID MALAN: OK, nice. 518 00:20:51,070 --> 00:20:54,250 AUDIENCE: So the 50 must be between 20 and 100. 519 00:20:54,250 --> 00:20:56,012 So it must be this one. 520 00:20:56,012 --> 00:20:56,845 DAVID MALAN: Indeed. 521 00:20:56,845 --> 00:21:00,342 So a round of applause for Louis for getting this right this time. 522 00:21:00,342 --> 00:21:01,790 [APPLAUSE] 523 00:21:01,790 --> 00:21:05,008 We have some lovely parting gifts, since we're using monopoly money. 524 00:21:05,008 --> 00:21:06,050 So it's not actual money. 525 00:21:06,050 --> 00:21:10,610 But this is the Cambridge edition Harvard Square and all for both of you. 526 00:21:10,610 --> 00:21:11,390 So welcome. 527 00:21:11,390 --> 00:21:12,986 Thank you so much. 528 00:21:12,986 --> 00:21:17,140 [APPLAUSE] 529 00:21:17,140 --> 00:21:19,630 So Louis was actually getting particularly clever there 530 00:21:19,630 --> 00:21:21,610 when his first instinct was to just remember 531 00:21:21,610 --> 00:21:23,800 what were all the numbers and then sort of deduce 532 00:21:23,800 --> 00:21:25,600 where the 50 must obviously be. 533 00:21:25,600 --> 00:21:26,440 So that's on us. 534 00:21:26,440 --> 00:21:29,050 Like, ideally, we would have completely changed the dollar amount so 535 00:21:29,050 --> 00:21:30,633 that he couldn't use that information. 536 00:21:30,633 --> 00:21:34,540 But it turns out Louie's instinct there to figure out where the number should 537 00:21:34,540 --> 00:21:35,980 be and jump right there-- 538 00:21:35,980 --> 00:21:39,160 index into that exact location-- is actually a technique. 539 00:21:39,160 --> 00:21:42,760 And it's a technique we'll talk about in future classes 540 00:21:42,760 --> 00:21:45,280 where you actually do take into account the information 541 00:21:45,280 --> 00:21:46,940 and go right where you want. 542 00:21:46,940 --> 00:21:50,230 It's an example of what we'll eventually call hashing, so to speak, 543 00:21:50,230 --> 00:21:51,940 in a concept called hash tables. 544 00:21:51,940 --> 00:21:54,520 But for now, let's try to formalize the algorithms 545 00:21:54,520 --> 00:21:58,240 that both of these volunteers kind of intuitively 546 00:21:58,240 --> 00:22:00,070 came up with, first and second. 547 00:22:00,070 --> 00:22:01,580 And we'll slap some names on them. 548 00:22:01,580 --> 00:22:05,140 So the first, and I've hinted at this, is what we would call linear search. 549 00:22:05,140 --> 00:22:08,590 Anytime you search from left to right or from right to left, 550 00:22:08,590 --> 00:22:10,190 it's generally called linear search. 551 00:22:10,190 --> 00:22:10,690 Why? 552 00:22:10,690 --> 00:22:13,482 Because you're kind of walking in a line, no matter which direction 553 00:22:13,482 --> 00:22:14,080 you're going. 554 00:22:14,080 --> 00:22:17,390 But now for today's purposes, let's see if we can't truly 555 00:22:17,390 --> 00:22:21,440 formalize what our volunteers' algorithms were by translating them, 556 00:22:21,440 --> 00:22:24,320 not necessarily to code yet, but pseudocode. 557 00:22:24,320 --> 00:22:28,190 See if we can't map it to English-like syntax that gets the ideas across. 558 00:22:28,190 --> 00:22:32,220 So I dare say, the first algorithm, even though she went from right to left, 559 00:22:32,220 --> 00:22:34,460 then from left to right, might look like this. 560 00:22:34,460 --> 00:22:39,800 For each door from left to right, she checked if 50 is behind the door 561 00:22:39,800 --> 00:22:40,850 as by looking at it. 562 00:22:40,850 --> 00:22:44,417 If it was behind the door, then she returned true. 563 00:22:44,417 --> 00:22:45,500 Like, yes, this is the 50. 564 00:22:45,500 --> 00:22:47,583 That didn't happen on the first iteration, though. 565 00:22:47,583 --> 00:22:51,750 So she moved on again and again. 566 00:22:51,750 --> 00:22:53,660 And now notice the indentation here is just 567 00:22:53,660 --> 00:22:55,670 as important as it was in week zero. 568 00:22:55,670 --> 00:22:59,180 Notice that only at the very bottom of this algorithm 569 00:22:59,180 --> 00:23:00,620 do I propose returning false. 570 00:23:00,620 --> 00:23:03,650 But it's not indented inside of this pseudocode. 571 00:23:03,650 --> 00:23:04,160 Why? 572 00:23:04,160 --> 00:23:06,530 Well, because if I had changed it to be this, 573 00:23:06,530 --> 00:23:10,740 what would be the logical bug in this version of that algorithm? 574 00:23:10,740 --> 00:23:11,240 Yeah? 575 00:23:11,240 --> 00:23:17,040 576 00:23:17,040 --> 00:23:17,730 Exactly. 577 00:23:17,730 --> 00:23:22,020 If she had opened the first door, found it to be the wrong number if it says 578 00:23:22,020 --> 00:23:24,330 else-- if it's not behind the door, then return false-- 579 00:23:24,330 --> 00:23:26,760 that would erroneously conclude the 50's not there, 580 00:23:26,760 --> 00:23:29,350 even though it could certainly be among those other doors. 581 00:23:29,350 --> 00:23:32,370 So this first version of the code where the return false is 582 00:23:32,370 --> 00:23:35,670 sort of left indented, so to speak, and the very last thing 583 00:23:35,670 --> 00:23:39,690 you do if you don't previously return true, that just 584 00:23:39,690 --> 00:23:42,360 makes sure that we're handling all possible cases. 585 00:23:42,360 --> 00:23:45,060 But let's make this maybe a little more technical. 586 00:23:45,060 --> 00:23:47,580 This is how a computer scientist or a programmer 587 00:23:47,580 --> 00:23:49,970 would likely express this instead. 588 00:23:49,970 --> 00:23:52,260 Instead of just drawing it in broad strokes, 589 00:23:52,260 --> 00:23:56,100 it's actually fine to kind of steal some of the syntax from languages like C 590 00:23:56,100 --> 00:24:01,530 and actually use some of the indices or indexes like 0, 1, 2, 3, 4, 5, 6, 591 00:24:01,530 --> 00:24:04,360 to represent the pieces of data we care about. 592 00:24:04,360 --> 00:24:06,150 So this is a little more precise. 593 00:24:06,150 --> 00:24:12,030 For i-- like a variable i-- from the value 0 to n minus 1, 594 00:24:12,030 --> 00:24:16,390 so in the case of seven doors, this is like saying, for i starting at 0, 595 00:24:16,390 --> 00:24:19,280 going up to 6, do the following. 596 00:24:19,280 --> 00:24:22,730 If the number 50 is behind the doors array-- 597 00:24:22,730 --> 00:24:27,250 so I'm using array syntax, even though this is technically still pseudocode-- 598 00:24:27,250 --> 00:24:32,720 if the ith location of my doors array has the number 50, return true. 599 00:24:32,720 --> 00:24:35,680 Otherwise, if you do that again and again and again-- 600 00:24:35,680 --> 00:24:39,863 n total times-- and you still don't find it, you want to return false. 601 00:24:39,863 --> 00:24:40,780 So we introduced this. 602 00:24:40,780 --> 00:24:44,590 This is just an example of how you can start to borrow ideas from actual code 603 00:24:44,590 --> 00:24:47,890 to paint the picture even more precisely of what 604 00:24:47,890 --> 00:24:50,200 it is you want a colleague to do, what it 605 00:24:50,200 --> 00:24:53,860 is you want your code to do, ultimately, by sort 606 00:24:53,860 --> 00:24:58,720 of borrowing these ideas from code and incorporating it into our pseudocode. 607 00:24:58,720 --> 00:25:03,430 But what about the second algorithm here, 608 00:25:03,430 --> 00:25:07,930 the second algorithm, whereby he took a divide and conquer approach, 609 00:25:07,930 --> 00:25:10,660 starting in the middle and then going right and then going left. 610 00:25:10,660 --> 00:25:13,077 Well, it turns out this is generally called binary search. 611 00:25:13,077 --> 00:25:15,040 Bi implying two, because you're either going 612 00:25:15,040 --> 00:25:17,395 with the left half or the right half again and again. 613 00:25:17,395 --> 00:25:20,020 This is literally what we've been talking about since week zero 614 00:25:20,020 --> 00:25:21,312 when I searched the phone book. 615 00:25:21,312 --> 00:25:25,450 That too was binary search, dividing and dividing and dividing in half and half. 616 00:25:25,450 --> 00:25:27,850 So if we were to draw some pseudocode for this, 617 00:25:27,850 --> 00:25:30,250 I would propose that we could do something like this. 618 00:25:30,250 --> 00:25:34,900 If 50 is behind the middle door, then we got lucky. 619 00:25:34,900 --> 00:25:36,490 Just return true. 620 00:25:36,490 --> 00:25:40,700 Else if the 50 is less than the value at the middle door-- 621 00:25:40,700 --> 00:25:42,430 so if it's smaller than the middle door-- 622 00:25:42,430 --> 00:25:43,810 I want to search to the left. 623 00:25:43,810 --> 00:25:45,580 So I can say search left half. 624 00:25:45,580 --> 00:25:50,330 Else if 50 is greater than the middle door, I want to search to the right. 625 00:25:50,330 --> 00:25:54,550 And I think that's almost everything, right? 626 00:25:54,550 --> 00:25:57,680 Is there a fourth possible case? 627 00:25:57,680 --> 00:25:58,715 What else could happen? 628 00:25:58,715 --> 00:26:03,130 629 00:26:03,130 --> 00:26:03,670 Good. 630 00:26:03,670 --> 00:26:07,720 So taking into account that if there are no doors left or no doors 631 00:26:07,720 --> 00:26:09,830 to begin with, we better handle that case 632 00:26:09,830 --> 00:26:11,650 so that we don't induce one of those spinning beach balls, 633 00:26:11,650 --> 00:26:13,540 so that the computer doesn't freeze or crash. 634 00:26:13,540 --> 00:26:17,020 There's really four possible scenarios in searching for information. 635 00:26:17,020 --> 00:26:19,990 It's either in the middle or to the left or to the right. 636 00:26:19,990 --> 00:26:21,438 Or it's just not there at all. 637 00:26:21,438 --> 00:26:24,730 And so I sort of slip that in at the end because technically, that's a question 638 00:26:24,730 --> 00:26:28,390 you should ask first because if there's no doors, there's no work to be done. 639 00:26:28,390 --> 00:26:31,780 But logically, this is the juicy part-- the other three questions 640 00:26:31,780 --> 00:26:33,320 that you might ask yourself. 641 00:26:33,320 --> 00:26:35,860 So this then might be the pseudocode for binary search. 642 00:26:35,860 --> 00:26:37,398 And we could make it more technical. 643 00:26:37,398 --> 00:26:39,940 And this is where it kind of escalates quickly syntactically. 644 00:26:39,940 --> 00:26:41,770 But I'm just using the same kind of syntax. 645 00:26:41,770 --> 00:26:45,190 If doors in my pseudocode represents an array of doors, 646 00:26:45,190 --> 00:26:48,550 well, then doors bracket middle is just a pseudocode-like way 647 00:26:48,550 --> 00:26:51,110 of saying go to the middle door in that array. 648 00:26:51,110 --> 00:26:55,090 And then notice, else if 50 is less than-- that middle value-- 649 00:26:55,090 --> 00:26:59,620 then search doors bracket 0, so the leftmost one-- 650 00:26:59,620 --> 00:27:02,380 through doors middle minus 1. 651 00:27:02,380 --> 00:27:05,780 So you don't need to waste time researching the middle door. 652 00:27:05,780 --> 00:27:10,250 So I say middle minus 1 so that I scooch over slightly to the left, so to speak. 653 00:27:10,250 --> 00:27:13,550 Else if 50 is greater than the value at the middle door, 654 00:27:13,550 --> 00:27:16,700 then you want to search 1 over to the right, 655 00:27:16,700 --> 00:27:21,770 so middle plus 1 among those doors, through the last door, which is not n, 656 00:27:21,770 --> 00:27:24,710 because we start counting at 0, but n minus 1. 657 00:27:24,710 --> 00:27:26,460 And the rest of the algorithm is the same. 658 00:27:26,460 --> 00:27:27,918 This is just a little more precise. 659 00:27:27,918 --> 00:27:32,450 And I dare say, when writing a program in C or any language, 660 00:27:32,450 --> 00:27:36,200 like, honestly, starting in pseudocode like this will generally make it 661 00:27:36,200 --> 00:27:38,370 much easier to write the actual code. 662 00:27:38,370 --> 00:27:41,360 So in fact, in this and future problem sets, do get into the habit, 663 00:27:41,360 --> 00:27:43,610 especially if you're struggling getting started-- just 664 00:27:43,610 --> 00:27:45,680 write things out in English and maybe high level 665 00:27:45,680 --> 00:27:47,120 English a little more like this. 666 00:27:47,120 --> 00:27:51,710 Then as a version two, go in with your keyboard or paper, pencil. 667 00:27:51,710 --> 00:27:56,240 And make it a little more precise using some code-like syntax. 668 00:27:56,240 --> 00:27:58,430 And then I dare say in version 3, you can now 669 00:27:58,430 --> 00:28:01,460 translate this pretty much verbatim to C code. 670 00:28:01,460 --> 00:28:04,040 The only headache is going to be rounding issues 671 00:28:04,040 --> 00:28:06,890 with integers because if you divide an integer and you a fraction, 672 00:28:06,890 --> 00:28:08,810 it's going to truncate, so all that kind of headache. 673 00:28:08,810 --> 00:28:10,670 But you can work through that by just thinking 674 00:28:10,670 --> 00:28:12,440 through what's going to get truncated when 675 00:28:12,440 --> 00:28:15,630 you round down or up as a solution. 676 00:28:15,630 --> 00:28:17,870 Any questions, though, on this pseudocode 677 00:28:17,870 --> 00:28:23,420 for either linear or binary search as we've defined them-- 678 00:28:23,420 --> 00:28:26,340 linear or binary search-- 679 00:28:26,340 --> 00:28:26,840 no? 680 00:28:26,840 --> 00:28:28,798 All right, well, let's consider then a bit more 681 00:28:28,798 --> 00:28:33,000 formally a question that we'll come back to in the future in future classes 682 00:28:33,000 --> 00:28:33,500 as well. 683 00:28:33,500 --> 00:28:35,885 What is the running time of these algorithms? 684 00:28:35,885 --> 00:28:38,927 What is, that is to say, the efficiency of these algorithms? 685 00:28:38,927 --> 00:28:41,510 And how do we actually measure the efficiency of an algorithm? 686 00:28:41,510 --> 00:28:44,017 Is it with a stopwatch or with some other mechanism? 687 00:28:44,017 --> 00:28:46,100 Well, I propose that we think back to this picture 688 00:28:46,100 --> 00:28:50,660 here, whereby this, again, was representative of both the phone book 689 00:28:50,660 --> 00:28:54,140 example from week zero and theoretically, bug aside, 690 00:28:54,140 --> 00:28:57,920 the attendance counting algorithm from earlier today, whereby 691 00:28:57,920 --> 00:29:02,150 this same green line theoretically represents how much time it should have 692 00:29:02,150 --> 00:29:04,380 taken us as a group to count ourselves. 693 00:29:04,380 --> 00:29:04,880 Why? 694 00:29:04,880 --> 00:29:07,520 Because if maybe another class comes in and doubles 695 00:29:07,520 --> 00:29:11,280 the size of the number of humans in this room, no big deal. 696 00:29:11,280 --> 00:29:14,960 That's just one more step or one more iteration of the loop 697 00:29:14,960 --> 00:29:17,850 because half of the people would anyway sit down. 698 00:29:17,850 --> 00:29:22,410 So this green algorithm still represents the faster theoretical algorithm today. 699 00:29:22,410 --> 00:29:26,130 And so recall that we described these things more mathematically as n. 700 00:29:26,130 --> 00:29:29,970 So this was one page at a time or one person at a time. 701 00:29:29,970 --> 00:29:33,130 This was two people or two pages at a time. 702 00:29:33,130 --> 00:29:36,480 So it's twice as fast, so if n is the number of people or pages 703 00:29:36,480 --> 00:29:38,850 and divided by 2 is the total number of steps. 704 00:29:38,850 --> 00:29:41,820 And then this one got a little mathy, but log base 2 of n. 705 00:29:41,820 --> 00:29:47,850 And log base 2 of n just means what is the value when you take n and divide it 706 00:29:47,850 --> 00:29:51,360 in two by 2 again and again and again and again 707 00:29:51,360 --> 00:29:54,960 until you're left with just one page or one person standing. 708 00:29:54,960 --> 00:29:58,500 But in the world of running times, it turns out that being this precise 709 00:29:58,500 --> 00:30:00,390 is not that intellectually interesting. 710 00:30:00,390 --> 00:30:02,370 And it sort of devolves into lower level math. 711 00:30:02,370 --> 00:30:04,560 It's just not necessary when having discussions 712 00:30:04,560 --> 00:30:07,500 about the efficiency of an algorithm or even code that you've written. 713 00:30:07,500 --> 00:30:10,020 So generally, a computer scientist, when asked 714 00:30:10,020 --> 00:30:11,730 what's the running time of your algorithm 715 00:30:11,730 --> 00:30:13,813 or what's the efficiency of your algorithm or more 716 00:30:13,813 --> 00:30:16,560 generally how good or bad is your algorithm, 717 00:30:16,560 --> 00:30:20,588 they'll talk about it being on the order of some number of steps. 718 00:30:20,588 --> 00:30:23,130 This is a phrase you'll hear increasingly in computer science 719 00:30:23,130 --> 00:30:25,020 where you can kind of wave your hand at it. 720 00:30:25,020 --> 00:30:27,600 Like, oh, the lower level details don't matter that much. 721 00:30:27,600 --> 00:30:31,410 All you care about in broad strokes are certain numbers 722 00:30:31,410 --> 00:30:32,940 that will add up the most. 723 00:30:32,940 --> 00:30:35,250 And in fact, when computer scientists talk 724 00:30:35,250 --> 00:30:37,860 about the efficiency of algorithms, they tend 725 00:30:37,860 --> 00:30:43,020 to throw away constant factors, so literally, numbers like 2 726 00:30:43,020 --> 00:30:46,540 that might be dividing here or a base here. 727 00:30:46,540 --> 00:30:49,710 So for instance, these two algorithms to a computer scientist 728 00:30:49,710 --> 00:30:50,918 would sort be the same. 729 00:30:50,918 --> 00:30:52,710 Like, yeah, it's technically twice as fast. 730 00:30:52,710 --> 00:30:53,940 But look at the lines. 731 00:30:53,940 --> 00:30:57,210 I mean, they're practically the same and this one here too, log base 2-- sure. 732 00:30:57,210 --> 00:31:01,050 But if you remember from math class, you can change the base of any logarithm 733 00:31:01,050 --> 00:31:03,040 from one number to another pretty easily. 734 00:31:03,040 --> 00:31:06,120 So ah, let's just generalize it as log of n. 735 00:31:06,120 --> 00:31:09,390 It doesn't really matter fundamentally what the numbers actually are. 736 00:31:09,390 --> 00:31:14,400 And honestly, if we zoom out slightly so that the y-axis and the x-axis 737 00:31:14,400 --> 00:31:18,240 get even bigger, honestly, these first two algorithms 738 00:31:18,240 --> 00:31:21,750 really do start to resemble each other closer and closer. 739 00:31:21,750 --> 00:31:25,050 And I daresay in your mind's eye, imagine zooming further and further 740 00:31:25,050 --> 00:31:26,070 and further out. 741 00:31:26,070 --> 00:31:29,850 Like, that red and yellow line are pretty much-- once n is large enough-- 742 00:31:29,850 --> 00:31:31,305 going to be functionally the same. 743 00:31:31,305 --> 00:31:33,180 Like, they're practically the same algorithm. 744 00:31:33,180 --> 00:31:35,700 But this one is still doing amazingly because it's 745 00:31:35,700 --> 00:31:37,360 a fundamentally different shape. 746 00:31:37,360 --> 00:31:40,320 So this is to say when a computer scientist talks about, thinks 747 00:31:40,320 --> 00:31:43,890 about the efficiency of algorithms, we just throw away the constant terms 748 00:31:43,890 --> 00:31:47,450 that when n gets really large just don't seem to matter as much. 749 00:31:47,450 --> 00:31:49,200 They don't add up as much or fundamentally 750 00:31:49,200 --> 00:31:51,760 change the picture in a case like this. 751 00:31:51,760 --> 00:31:55,320 So what I'm describing here with this capital letter O has a technical term. 752 00:31:55,320 --> 00:31:57,810 This is called big O notation. 753 00:31:57,810 --> 00:32:00,180 And this is omnipresent in computer science 754 00:32:00,180 --> 00:32:02,250 and often rears its head even in programming, 755 00:32:02,250 --> 00:32:05,617 specifically when talking about the design of some algorithm. 756 00:32:05,617 --> 00:32:07,950 And this is a little cheat sheet here on the screen now. 757 00:32:07,950 --> 00:32:10,890 Very often, algorithms that you write or you 758 00:32:10,890 --> 00:32:17,400 use will be describable as being on the order of one of these running times. 759 00:32:17,400 --> 00:32:21,180 So n is just representative of the number of things-- number of people, 760 00:32:21,180 --> 00:32:22,210 number of pages-- 761 00:32:22,210 --> 00:32:25,830 whatever it is you're actually doing in code. 762 00:32:25,830 --> 00:32:30,150 So the mathematical formulas inside of the parentheses 763 00:32:30,150 --> 00:32:33,870 describe as a function of the size of that input 764 00:32:33,870 --> 00:32:36,360 how fast or slow the algorithm's going to be. 765 00:32:36,360 --> 00:32:40,710 So this algorithm in the middle here, big O of n, so to speak, 766 00:32:40,710 --> 00:32:43,600 means that it takes linear time, in other words. 767 00:32:43,600 --> 00:32:44,760 So my first algorithm-- 768 00:32:44,760 --> 00:32:47,040 1, 2, 3, 4-- 769 00:32:47,040 --> 00:32:48,787 or my first algorithm in week zero-- 770 00:32:48,787 --> 00:32:51,630 1, 2, 3, 4. 771 00:32:51,630 --> 00:32:53,250 That was a linear search. 772 00:32:53,250 --> 00:32:56,940 The number of steps it takes is on the order of n because if there's n pages, 773 00:32:56,940 --> 00:33:00,065 in the worst case, like, John Harvard's all the way at the end of the phone 774 00:33:00,065 --> 00:33:01,318 book, so it takes me n steps. 775 00:33:01,318 --> 00:33:04,110 In this case, it's always going to take me n steps to count you all 776 00:33:04,110 --> 00:33:06,443 because if I want to point at each and every one of you, 777 00:33:06,443 --> 00:33:08,550 that is always going to take me n steps. 778 00:33:08,550 --> 00:33:13,605 So big O represents an upper bound on the number of steps 779 00:33:13,605 --> 00:33:14,730 that you might be counting. 780 00:33:14,730 --> 00:33:17,760 And so we often use it to consider the worst case and the worst 781 00:33:17,760 --> 00:33:20,190 case, John Harvard, or whoever might be all the way at the end of the phone 782 00:33:20,190 --> 00:33:20,690 book. 783 00:33:20,690 --> 00:33:23,240 So that linear search is on the order of n. 784 00:33:23,240 --> 00:33:25,230 But what about n squared? 785 00:33:25,230 --> 00:33:27,870 This means n people doing n things. 786 00:33:27,870 --> 00:33:30,590 So for instance, and we won't do this today, 787 00:33:30,590 --> 00:33:33,440 but if we were to ask you again to stand up and shake 788 00:33:33,440 --> 00:33:36,140 everyone's hand in the room-- 789 00:33:36,140 --> 00:33:40,340 not good for health nowadays-- but shake everyone's hand in the room, 790 00:33:40,340 --> 00:33:42,420 how many handshakes would there be? 791 00:33:42,420 --> 00:33:46,580 Well, if there's n of you and you've got to shake everyone else's hand, 792 00:33:46,580 --> 00:33:48,650 that's technically n times n minus 1. 793 00:33:48,650 --> 00:33:50,030 Let's throw away the minus 1. 794 00:33:50,030 --> 00:33:53,600 That's n times n or n squared handshakes. 795 00:33:53,600 --> 00:33:54,800 That's a lot of handshakes. 796 00:33:54,800 --> 00:33:57,140 And so the running time of shaking everyone's hand 797 00:33:57,140 --> 00:34:00,860 in the room to introduce yourself would be on the order of n squared. 798 00:34:00,860 --> 00:34:04,370 At the other end of the spectrum, the faster end, big O of 1 799 00:34:04,370 --> 00:34:07,310 doesn't mean that the algorithm takes literally one step. 800 00:34:07,310 --> 00:34:11,040 It could take two steps or three or even 1,000 steps. 801 00:34:11,040 --> 00:34:14,159 But what it means is it's a constant number of steps. 802 00:34:14,159 --> 00:34:16,520 So it doesn't matter how many people are in the room, 803 00:34:16,520 --> 00:34:21,320 this describes something taking just one step total 804 00:34:21,320 --> 00:34:23,520 or a constant number of steps total. 805 00:34:23,520 --> 00:34:26,610 So for instance, earlier when everyone stood up at the same time, 806 00:34:26,610 --> 00:34:29,412 that was constant time because if we had twice as many people 807 00:34:29,412 --> 00:34:32,370 come into the room, it's not going to take us twice as long to stand up 808 00:34:32,370 --> 00:34:34,739 if everyone stands up at the same time. 809 00:34:34,739 --> 00:34:37,030 That would be a constant time algorithm. 810 00:34:37,030 --> 00:34:38,290 So this is linear. 811 00:34:38,290 --> 00:34:39,330 This is constant. 812 00:34:39,330 --> 00:34:41,460 If you want to get fancy, this is quadratic. 813 00:34:41,460 --> 00:34:42,960 This is logarithmic. 814 00:34:42,960 --> 00:34:44,760 And this is n log n. 815 00:34:44,760 --> 00:34:48,340 Or there's other fancier terms we can give it as well, but for now, 816 00:34:48,340 --> 00:34:53,050 just a common list of running times that we might apply to certain algorithms. 817 00:34:53,050 --> 00:34:56,400 So linear search, I claim, is in big O of n 818 00:34:56,400 --> 00:34:58,900 because it's going to take in the worst case n steps. 819 00:34:58,900 --> 00:35:01,490 What about binary search? 820 00:35:01,490 --> 00:35:05,870 How many steps does binary search take on the order of 821 00:35:05,870 --> 00:35:07,490 according to this chart? 822 00:35:07,490 --> 00:35:07,990 Yeah? 823 00:35:07,990 --> 00:35:08,700 AUDIENCE: Log n. 824 00:35:08,700 --> 00:35:09,630 DAVID MALAN: Log n. 825 00:35:09,630 --> 00:35:14,250 Yeah, because no matter what with binary search, you're dividing in half, 826 00:35:14,250 --> 00:35:15,510 half, half. 827 00:35:15,510 --> 00:35:19,380 But in the worst case, it might be in the last door you check. 828 00:35:19,380 --> 00:35:21,630 But you only took log n steps to get there. 829 00:35:21,630 --> 00:35:23,550 But it still might be the last one you check. 830 00:35:23,550 --> 00:35:26,897 So binary search indeed would be on the order of log n. 831 00:35:26,897 --> 00:35:28,230 But sometimes, you do get lucky. 832 00:35:28,230 --> 00:35:30,990 And we saw with our volunteers that sometimes you can get lucky 833 00:35:30,990 --> 00:35:32,430 and just find things quicker. 834 00:35:32,430 --> 00:35:35,793 So we don't always want to talk about things in terms of an upper bound, 835 00:35:35,793 --> 00:35:38,460 like how many steps in the worst case might this algorithm take. 836 00:35:38,460 --> 00:35:40,860 Sometimes it's useful to know in the best case 837 00:35:40,860 --> 00:35:43,350 how few steps might an algorithm take. 838 00:35:43,350 --> 00:35:46,170 So for that, we have this capital Greek omega, which 839 00:35:46,170 --> 00:35:48,540 is another symbol in computer science. 840 00:35:48,540 --> 00:35:53,872 And whereas big O represents upper bound, omega represents lower bound. 841 00:35:53,872 --> 00:35:55,080 And it's the exact same idea. 842 00:35:55,080 --> 00:35:56,820 It's just a different symbol to represent 843 00:35:56,820 --> 00:35:58,930 a different idea, the opposite, in this case. 844 00:35:58,930 --> 00:36:00,690 So here's a similar cheat sheet here. 845 00:36:00,690 --> 00:36:02,670 But when you use the omega symbol, that just 846 00:36:02,670 --> 00:36:07,240 means that this algorithm might take as few as this many steps, 847 00:36:07,240 --> 00:36:09,610 for instance, in the very best case. 848 00:36:09,610 --> 00:36:12,640 So by that logic, if I ask about linear search, 849 00:36:12,640 --> 00:36:17,090 our first demonstration with the lockers, let's consider the best case. 850 00:36:17,090 --> 00:36:19,720 How many steps might it take to search n lockers 851 00:36:19,720 --> 00:36:21,610 using linear search in the best case? 852 00:36:21,610 --> 00:36:22,960 You get lucky. 853 00:36:22,960 --> 00:36:23,680 So I heard it. 854 00:36:23,680 --> 00:36:24,950 Yeah, just one step. 855 00:36:24,950 --> 00:36:29,210 So we could say that linear search is an omega of 1, so to speak. 856 00:36:29,210 --> 00:36:30,880 What about binary search? 857 00:36:30,880 --> 00:36:34,150 If you've got n lockers, in the best case, though, 858 00:36:34,150 --> 00:36:35,947 how few steps might it take us? 859 00:36:35,947 --> 00:36:37,780 Again, one, because we might just get lucky. 860 00:36:37,780 --> 00:36:40,460 And boom, it happens to be right there in the middle. 861 00:36:40,460 --> 00:36:46,000 So you could say that both linear search and binary search are an omega of 1. 862 00:36:46,000 --> 00:36:50,620 Now, by contrast, my attendance algorithm, the first one I proposed-- 863 00:36:50,620 --> 00:36:55,822 1, 2, 3, 4, 5, 6, 7, I claimed a moment ago that that's in big O of n 864 00:36:55,822 --> 00:36:58,780 because if there's n people in the room, I've got to point at everyone. 865 00:36:58,780 --> 00:37:00,580 But equivalently, if there's n people in the room 866 00:37:00,580 --> 00:37:03,400 and I have to point at everyone, what's the fewest number of steps 867 00:37:03,400 --> 00:37:08,770 it could take me to take attendance using this linear approach? 868 00:37:08,770 --> 00:37:11,730 So still n, right, unless I guess, which is not an algorithm. 869 00:37:11,730 --> 00:37:14,230 Like, unless I guess, I'm not going to get the right answer, 870 00:37:14,230 --> 00:37:16,120 so I kind of have to point at everyone. 871 00:37:16,120 --> 00:37:18,210 So in both the best case and the worst case, 872 00:37:18,210 --> 00:37:20,640 some algorithms still take n steps. 873 00:37:20,640 --> 00:37:24,300 And for this, we have what we'll call theta notation, whereby 874 00:37:24,300 --> 00:37:28,110 if big O and omega happen to be the same, which is not always the case 875 00:37:28,110 --> 00:37:33,190 but can be, then you can say that algorithm is in theta of such and such. 876 00:37:33,190 --> 00:37:37,530 So my attendance-taking algorithm, the very first, 1, 2, 3, 4, 877 00:37:37,530 --> 00:37:40,020 all the way on up to n would be in theta of n 878 00:37:40,020 --> 00:37:42,250 because in both the best case and the worst case, 879 00:37:42,250 --> 00:37:48,143 it takes the same number of steps as per my big O and omega analysis. 880 00:37:48,143 --> 00:37:51,060 Now, there is a more formal mathematical definition for both of these. 881 00:37:51,060 --> 00:37:52,620 And if you take higher level computer science, 882 00:37:52,620 --> 00:37:54,495 you'll go more into the weeds of these ideas. 883 00:37:54,495 --> 00:37:56,820 But for now, big O, upper bound. 884 00:37:56,820 --> 00:37:59,100 Omega, lower bound. 885 00:37:59,100 --> 00:38:03,430 Questions on this symbology? 886 00:38:03,430 --> 00:38:06,920 It'll be a tool in our tool kit. 887 00:38:06,920 --> 00:38:07,420 No? 888 00:38:07,420 --> 00:38:14,260 OK, so with that said, let's see how we might translate this to actual code 889 00:38:14,260 --> 00:38:17,050 in something that makes sense now using C and not so much 890 00:38:17,050 --> 00:38:20,570 new syntax but applications of similar ideas from last time. 891 00:38:20,570 --> 00:38:25,237 So for instance, let me actually go over to search dot C. And in search dot C, 892 00:38:25,237 --> 00:38:27,820 I'm going to go ahead and implement the idea of linear search, 893 00:38:27,820 --> 00:38:29,980 very simply, using integers initially. 894 00:38:29,980 --> 00:38:34,150 So to do this, let me go ahead and give myself the familiar CS50 dot h 895 00:38:34,150 --> 00:38:36,460 so that I can ask the human what number to search for. 896 00:38:36,460 --> 00:38:40,000 Then let me go ahead and include standard io.h 897 00:38:40,000 --> 00:38:42,430 so that I can use printf and the like. 898 00:38:42,430 --> 00:38:45,880 Then I'm going to go ahead and do int main void without any command line 899 00:38:45,880 --> 00:38:49,130 arguments because I'm not going to need any for this particular demonstration. 900 00:38:49,130 --> 00:38:50,920 And someone asked about this a while back. 901 00:38:50,920 --> 00:38:53,560 If I want to declare an array of values but I 902 00:38:53,560 --> 00:38:57,650 know in advance what the values are, there is a special syntax I can use, 903 00:38:57,650 --> 00:38:58,330 which is this. 904 00:38:58,330 --> 00:39:01,480 If I want a whole bunch of numbers but I want those numbers to be stored 905 00:39:01,480 --> 00:39:04,780 in an array, I can store them in-- 906 00:39:04,780 --> 00:39:06,380 using these curly braces. 907 00:39:06,380 --> 00:39:09,582 And I'm going to use the same numbers as the monopoly denominations 908 00:39:09,582 --> 00:39:10,790 that we've been playing with. 909 00:39:10,790 --> 00:39:12,690 And I'm just going to put them in sort of random order. 910 00:39:12,690 --> 00:39:15,315 But I'm going to deliberately put the 50 all the way at the end 911 00:39:15,315 --> 00:39:19,430 just so that I know that it's going to try all possible steps, so big O of n, 912 00:39:19,430 --> 00:39:20,210 ultimately. 913 00:39:20,210 --> 00:39:23,180 Now let's go ahead and ask the user for a value n. 914 00:39:23,180 --> 00:39:24,530 And we'll use get int. 915 00:39:24,530 --> 00:39:28,520 And just ask the user for a number to search for, be it 50 or something else. 916 00:39:28,520 --> 00:39:32,000 And then here's how I might implement in code now linear search, 917 00:39:32,000 --> 00:39:34,850 translating effectively the pseudocode from last time. 918 00:39:34,850 --> 00:39:38,090 For int i equals 0. 919 00:39:38,090 --> 00:39:40,940 i is less than 7. i plus plus. 920 00:39:40,940 --> 00:39:43,670 And then inside of this loop, I can ask a question. 921 00:39:43,670 --> 00:39:47,610 If the ith number in the numbers array-- 922 00:39:47,610 --> 00:39:51,170 so numbers bracket i-- equals, equals the number n 923 00:39:51,170 --> 00:39:55,130 that I care about, well, this is where I could just declare return true 924 00:39:55,130 --> 00:39:55,930 or return false. 925 00:39:55,930 --> 00:39:58,430 Here, I'm going to go ahead and just use a printf statement. 926 00:39:58,430 --> 00:40:01,670 And I'm going to say found, backslash n, just to know visually 927 00:40:01,670 --> 00:40:03,110 that I found the number. 928 00:40:03,110 --> 00:40:06,920 And else I might do something like this. 929 00:40:06,920 --> 00:40:09,080 Else if that's not the case, I'll go ahead 930 00:40:09,080 --> 00:40:13,910 and print out not found backslash n. 931 00:40:13,910 --> 00:40:16,410 All right, so let me zoom out for just a moment. 932 00:40:16,410 --> 00:40:17,690 Here's all of my code. 933 00:40:17,690 --> 00:40:25,130 Any concerns with this implementation of what I claim is now linear search? 934 00:40:25,130 --> 00:40:28,970 Any concerns with what I claim is linear search? 935 00:40:28,970 --> 00:40:29,885 Yeah? 936 00:40:29,885 --> 00:40:32,855 AUDIENCE: [INAUDIBLE] 937 00:40:32,855 --> 00:40:33,950 DAVID MALAN: Exactly. 938 00:40:33,950 --> 00:40:37,917 If I search the first number and it's not found, it's going to say not found. 939 00:40:37,917 --> 00:40:40,500 But it's going to keep saying not found, not found, not found, 940 00:40:40,500 --> 00:40:41,360 which might be fine. 941 00:40:41,360 --> 00:40:42,402 But it's a little stupid. 942 00:40:42,402 --> 00:40:44,325 I probably want to know if it's found or not. 943 00:40:44,325 --> 00:40:46,700 So I've made that same mistake that I called out earlier. 944 00:40:46,700 --> 00:40:49,700 Like, the else is not the alternative to not finding 945 00:40:49,700 --> 00:40:51,570 the number in that first location. 946 00:40:51,570 --> 00:40:55,890 It's the final decision to make when I haven't actually found the value. 947 00:40:55,890 --> 00:40:58,580 So I think what I want to do is get rid of this else clause. 948 00:40:58,580 --> 00:41:01,940 And then at the outside of this loop, I think 949 00:41:01,940 --> 00:41:03,770 I want to conclude printf not found. 950 00:41:03,770 --> 00:41:08,420 But here too there's a new bug that's arisen. 951 00:41:08,420 --> 00:41:13,310 There's a new bug, even though I fixed the logical error you just described. 952 00:41:13,310 --> 00:41:15,290 What symptom are we still going to see? 953 00:41:15,290 --> 00:41:18,290 954 00:41:18,290 --> 00:41:20,960 Yeah, now it's going to always print not found, 955 00:41:20,960 --> 00:41:23,997 even when I have found it because even once I've found it 956 00:41:23,997 --> 00:41:26,330 and I finished going through the array, it's still going 957 00:41:26,330 --> 00:41:27,740 to assume that I got to the bottom. 958 00:41:27,740 --> 00:41:28,990 And therefore, it's not found. 959 00:41:28,990 --> 00:41:32,660 So I need to somehow exit out of main prematurely, if you will. 960 00:41:32,660 --> 00:41:35,510 And recall that last week, we also introduced the idea 961 00:41:35,510 --> 00:41:39,680 that main all this time does actually return a value, an integer. 962 00:41:39,680 --> 00:41:42,110 By default, it's secretly been zero. 963 00:41:42,110 --> 00:41:46,130 Anytime a program exits, it just returns zero, an exit status of zero. 964 00:41:46,130 --> 00:41:47,970 But we do now have control over that. 965 00:41:47,970 --> 00:41:51,080 And so a convention in C would be that when 966 00:41:51,080 --> 00:41:54,500 you want your main function to end prematurely if so, 967 00:41:54,500 --> 00:41:56,587 you can literally just return a value. 968 00:41:56,587 --> 00:41:59,670 And even though this feels a little backwards, this is just the way it is. 969 00:41:59,670 --> 00:42:02,450 You return zero to indicate success. 970 00:42:02,450 --> 00:42:06,120 And you return any other integer to indicate failure. 971 00:42:06,120 --> 00:42:09,110 So by convention, people go to 1 and then 2 and then 3. 972 00:42:09,110 --> 00:42:11,310 They don't think too hard about what numbers to use. 973 00:42:11,310 --> 00:42:14,120 But in this case, I'm going to go ahead and return 1 974 00:42:14,120 --> 00:42:16,440 if I do get to the bottom of this file. 975 00:42:16,440 --> 00:42:21,090 So now if I open back up my terminal window, I run make search. 976 00:42:21,090 --> 00:42:23,267 No syntax errors-- dot slash search. 977 00:42:23,267 --> 00:42:24,850 I'm going to be prompted for a number. 978 00:42:24,850 --> 00:42:26,683 Let's go ahead and search for the number 50. 979 00:42:26,683 --> 00:42:28,650 And I should see found. 980 00:42:28,650 --> 00:42:31,290 Meanwhile, if I run it again-- dot slash search-- 981 00:42:31,290 --> 00:42:34,590 I search for the number 13, that should be not found. 982 00:42:34,590 --> 00:42:38,850 So I claim that this is now a correct implementation of linear search 983 00:42:38,850 --> 00:42:43,620 that's gone from left to right, looking for a number that may or may not 984 00:42:43,620 --> 00:42:44,310 be there. 985 00:42:44,310 --> 00:42:48,420 Any questions on this version of the code here? 986 00:42:48,420 --> 00:42:49,578 Yeah? 987 00:42:49,578 --> 00:42:53,064 AUDIENCE: [INAUDIBLE] 988 00:42:53,064 --> 00:42:55,760 DAVID MALAN: Return zero indicates success 989 00:42:55,760 --> 00:42:58,250 when doing it from main in particular. 990 00:42:58,250 --> 00:43:02,480 And that's backwards only in the sense that generally 0 is false. 991 00:43:02,480 --> 00:43:04,640 And 1 is true. 992 00:43:04,640 --> 00:43:09,272 But the logic here is that if the program works correctly, that's zero. 993 00:43:09,272 --> 00:43:11,730 But there's an infinite number of things that can go wrong. 994 00:43:11,730 --> 00:43:15,290 And that's why we need 1 and 2 and 3 and 4 and all the way up. 995 00:43:15,290 --> 00:43:17,362 Other questions? 996 00:43:17,362 --> 00:43:19,842 AUDIENCE: [INAUDIBLE] 997 00:43:19,842 --> 00:43:21,637 DAVID MALAN: Yes. 998 00:43:21,637 --> 00:43:25,533 AUDIENCE: [INAUDIBLE] 999 00:43:25,533 --> 00:43:26,590 DAVID MALAN: Correct. 1000 00:43:26,590 --> 00:43:31,270 When you return zero or return any value from main wherever it is in your code, 1001 00:43:31,270 --> 00:43:34,240 the program will effectively terminate right then and there. 1002 00:43:34,240 --> 00:43:37,600 No additional code will get executed at the bottom of the function. 1003 00:43:37,600 --> 00:43:39,310 You'll effectively exit out. 1004 00:43:39,310 --> 00:43:43,240 Just like in a normal function that isn't main, when you return a value, 1005 00:43:43,240 --> 00:43:46,930 it immediately exits that function and hands back the value. 1006 00:43:46,930 --> 00:43:47,440 Yeah? 1007 00:43:47,440 --> 00:43:49,352 AUDIENCE: [INAUDIBLE] 1008 00:43:49,352 --> 00:43:52,020 DAVID MALAN: So return 1 is just me being 1009 00:43:52,020 --> 00:43:56,250 pedantic at this point because I'm frankly not going to really care what 1010 00:43:56,250 --> 00:43:58,110 the exit status is of this program. 1011 00:43:58,110 --> 00:44:03,810 But once I've introduced the idea of manually returning 0 on this line 14 1012 00:44:03,810 --> 00:44:07,470 to indicate success, it stands to reason that I should also 1013 00:44:07,470 --> 00:44:10,510 return a different value when I want to indicate failure. 1014 00:44:10,510 --> 00:44:13,038 And so even though this does not functionally 1015 00:44:13,038 --> 00:44:15,330 change the program-- it will still work-- it will still 1016 00:44:15,330 --> 00:44:19,050 print the same things correctly-- it's a lower level detail 1017 00:44:19,050 --> 00:44:22,260 that programmers, teaching assistants, testing software, 1018 00:44:22,260 --> 00:44:24,930 might appreciate, knowing what actually happened 1019 00:44:24,930 --> 00:44:28,160 in the program underneath the hood. 1020 00:44:28,160 --> 00:44:30,660 All right, so what about strings? 1021 00:44:30,660 --> 00:44:32,660 So it turns out with strings, we're going 1022 00:44:32,660 --> 00:44:35,790 to have to think a little harder about how best to do this. 1023 00:44:35,790 --> 00:44:37,530 So let me actually go ahead and do this. 1024 00:44:37,530 --> 00:44:41,240 Let me go ahead and get rid of much of this code but transition 1025 00:44:41,240 --> 00:44:44,420 to a different type of array, this time an array of strings. 1026 00:44:44,420 --> 00:44:47,570 And I'm going to call the array strings itself plural, just to make clear 1027 00:44:47,570 --> 00:44:48,985 what's in it instead of numbers. 1028 00:44:48,985 --> 00:44:51,110 I'm going to use the square bracket notation, which 1029 00:44:51,110 --> 00:44:54,235 just means I don't know at the moment how many elements it's going to have. 1030 00:44:54,235 --> 00:44:56,030 But the compiler can figure it out for me. 1031 00:44:56,030 --> 00:44:58,910 And in the spirit of Monopoly, let's go ahead in our curly braces, 1032 00:44:58,910 --> 00:45:00,440 do something like this. 1033 00:45:00,440 --> 00:45:03,200 Battleship is going to be one of the strings. 1034 00:45:03,200 --> 00:45:05,390 Boot is going to be another. 1035 00:45:05,390 --> 00:45:07,160 Cannon is going to be a third. 1036 00:45:07,160 --> 00:45:09,140 Iron is going to be the fourth. 1037 00:45:09,140 --> 00:45:11,510 Thimble-- and if you've ever played Monopoly, 1038 00:45:11,510 --> 00:45:14,610 you know where these are coming from-- and top hat, for instance. 1039 00:45:14,610 --> 00:45:17,510 So this gives me 1, 2, 3, 4, 5, 6. 1040 00:45:17,510 --> 00:45:19,148 I could write the number 6 here. 1041 00:45:19,148 --> 00:45:20,690 And the compiler would appreciate it. 1042 00:45:20,690 --> 00:45:21,607 But it doesn't matter. 1043 00:45:21,607 --> 00:45:25,280 The compiler can figure it out on its own just based on the number of commas. 1044 00:45:25,280 --> 00:45:28,440 And this also ensures that I don't write one number to the left 1045 00:45:28,440 --> 00:45:30,130 and then miscount on the right. 1046 00:45:30,130 --> 00:45:32,890 So omitting it is probably in everyone's benefit. 1047 00:45:32,890 --> 00:45:33,840 Now let's do this. 1048 00:45:33,840 --> 00:45:38,340 Let's ask the user for a string using get string instead of get int 1049 00:45:38,340 --> 00:45:40,650 for some string to search for. 1050 00:45:40,650 --> 00:45:42,660 Then let's go ahead and do the exact same thing. 1051 00:45:42,660 --> 00:45:46,620 For i int i equals 0. 1052 00:45:46,620 --> 00:45:50,670 i is less than 6, which is technically a magic number. 1053 00:45:50,670 --> 00:45:53,070 But let's focus on the searching algorithm for now. 1054 00:45:53,070 --> 00:45:54,330 i plus plus. 1055 00:45:54,330 --> 00:45:56,850 And then inside of this loop, let's do this. 1056 00:45:56,850 --> 00:46:04,980 If strings bracket i equals equals, s, then let's go ahead 1057 00:46:04,980 --> 00:46:08,100 and print out just as before, quote, unquote, found-- backslash n-- 1058 00:46:08,100 --> 00:46:10,125 and proactively return zero this time. 1059 00:46:10,125 --> 00:46:12,000 And if we don't find it anywhere in the loop, 1060 00:46:12,000 --> 00:46:14,280 let's go ahead and return 1 at the very bottom, 1061 00:46:14,280 --> 00:46:17,820 before which we will print not found backslash n. 1062 00:46:17,820 --> 00:46:20,670 So it's the exact same logic at the moment, 1063 00:46:20,670 --> 00:46:23,530 even though I've changed my ints to strings. 1064 00:46:23,530 --> 00:46:25,950 Let me go ahead and open up my terminal window now. 1065 00:46:25,950 --> 00:46:30,810 Do make search again and see, OK, so far, so good. 1066 00:46:30,810 --> 00:46:33,780 Let me now go ahead and do dot slash search. 1067 00:46:33,780 --> 00:46:37,470 And let's go ahead and search for-- how about top hat. 1068 00:46:37,470 --> 00:46:39,653 So we should see found. 1069 00:46:39,653 --> 00:46:40,415 Huh. 1070 00:46:40,415 --> 00:46:41,290 All right, not found. 1071 00:46:41,290 --> 00:46:43,332 All right, well, let's do dot slash search again. 1072 00:46:43,332 --> 00:46:44,320 How about thimble? 1073 00:46:44,320 --> 00:46:47,380 Maybe it's because it's just two words. 1074 00:46:47,380 --> 00:46:50,110 No, thimble's not found, either. 1075 00:46:50,110 --> 00:46:53,230 Dot slash search-- let's search for the first one, battleship. 1076 00:46:53,230 --> 00:46:55,240 Enter-- still not found. 1077 00:46:55,240 --> 00:46:57,280 Let's search for something else like cat. 1078 00:46:57,280 --> 00:46:58,600 Not found. 1079 00:46:58,600 --> 00:47:03,550 What is going on because I'm pretty sure the logic is exactly the same? 1080 00:47:03,550 --> 00:47:09,010 Well, it turns out in C, this line here, currently line 11, 1081 00:47:09,010 --> 00:47:11,350 is not how you compare strings. 1082 00:47:11,350 --> 00:47:15,670 If you want to compare strings in C, you don't do it like you did integers. 1083 00:47:15,670 --> 00:47:17,920 You actually need another technique altogether. 1084 00:47:17,920 --> 00:47:21,070 And for that, we're going to need to revisit one of our friends, which 1085 00:47:21,070 --> 00:47:24,610 is string.h, which is one of the header files for the string library 1086 00:47:24,610 --> 00:47:27,610 that we introduced last week that has in addition to functions 1087 00:47:27,610 --> 00:47:31,390 like strlen, which gives you the length of a string, it also gives us, 1088 00:47:31,390 --> 00:47:34,330 as per the documentation here, another function that we'll 1089 00:47:34,330 --> 00:47:38,830 start to find useful here, succinctly named strcmp, for string compare. 1090 00:47:38,830 --> 00:47:44,750 And string compare will actually tell us if two strings are the same or not. 1091 00:47:44,750 --> 00:47:46,200 It will indeed compare them. 1092 00:47:46,200 --> 00:47:49,970 And if I use this now, let me go back to my code 1093 00:47:49,970 --> 00:47:53,990 here and see what I might do differently if I go back into my code here 1094 00:47:53,990 --> 00:47:55,160 and change this value. 1095 00:47:55,160 --> 00:47:59,570 Instead of using strings bracket i equals, equals s, let's do str compare. 1096 00:47:59,570 --> 00:48:01,190 And I read the documentation earlier. 1097 00:48:01,190 --> 00:48:04,640 So I know that it takes two arguments, the first and the second string 1098 00:48:04,640 --> 00:48:09,500 that you want to compare, so strings bracket i and then s, 1099 00:48:09,500 --> 00:48:12,290 which is the string that the human typed in. 1100 00:48:12,290 --> 00:48:15,680 But somewhat weirdly, what I want to check for 1101 00:48:15,680 --> 00:48:19,220 is that str compare returns zero. 1102 00:48:19,220 --> 00:48:24,950 So if str compare when given two strings is input, strings bracket i and s, 1103 00:48:24,950 --> 00:48:30,750 returns an integer 0, that actually means the strings are the same. 1104 00:48:30,750 --> 00:48:31,500 So let's try this. 1105 00:48:31,500 --> 00:48:33,020 Let me do make search again. 1106 00:48:33,020 --> 00:48:33,890 Huh. 1107 00:48:33,890 --> 00:48:36,410 What did I do wrong here? 1108 00:48:36,410 --> 00:48:39,200 A whole bunch of errors popped out. 1109 00:48:39,200 --> 00:48:40,100 What did I do wrong? 1110 00:48:40,100 --> 00:48:41,530 Yeah? 1111 00:48:41,530 --> 00:48:44,510 Yeah, so I didn't include the very header file we're talking about. 1112 00:48:44,510 --> 00:48:46,270 So again, it doesn't necessarily mean a logical error. 1113 00:48:46,270 --> 00:48:47,920 It just means a stupid error on my part. 1114 00:48:47,920 --> 00:48:49,670 I didn't actually include the header file. 1115 00:48:49,670 --> 00:48:51,340 So let me go back and actually do that. 1116 00:48:51,340 --> 00:48:54,280 Up at the top in addition to cs50.h and standard io, 1117 00:48:54,280 --> 00:48:56,890 let's also include string.h. 1118 00:48:56,890 --> 00:48:59,290 Let me clear my terminal and do make search again. 1119 00:48:59,290 --> 00:49:00,340 Crossing my fingers. 1120 00:49:00,340 --> 00:49:01,390 That time it worked. 1121 00:49:01,390 --> 00:49:06,040 And now if I do dot slash search and search for top hat like before, 1122 00:49:06,040 --> 00:49:08,500 now, thankfully, it is, in fact, found. 1123 00:49:08,500 --> 00:49:12,340 If I do it once more and search for battleship, now it's, in fact, found. 1124 00:49:12,340 --> 00:49:15,520 If I do it once more and search for cat, which should not be in there, 1125 00:49:15,520 --> 00:49:17,870 that is not, in fact, found. 1126 00:49:17,870 --> 00:49:21,050 So now just intuitively, even if you've never done this before, 1127 00:49:21,050 --> 00:49:25,450 why might it be valuable for this function called strcmp 1128 00:49:25,450 --> 00:49:30,370 to return zero if the strings are equal as opposed 1129 00:49:30,370 --> 00:49:34,060 to a simple Boolean like true or false, which might have been your intuition? 1130 00:49:34,060 --> 00:49:40,540 When you compare two strings, what are the possible takeaways 1131 00:49:40,540 --> 00:49:43,270 you might have from comparing two strings? 1132 00:49:43,270 --> 00:49:45,790 It's not just that they're equal or not. 1133 00:49:45,790 --> 00:49:47,970 AUDIENCE: [INAUDIBLE] 1134 00:49:47,970 --> 00:49:51,170 DAVID MALAN: OK, so maybe if the ASCII values are the same, that 1135 00:49:51,170 --> 00:49:53,150 might imply, indeed, literal equality. 1136 00:49:53,150 --> 00:49:54,720 But something else. 1137 00:49:54,720 --> 00:49:58,044 AUDIENCE: [INAUDIBLE] about how similar these things are [INAUDIBLE].. 1138 00:49:58,044 --> 00:49:59,020 DAVID MALAN: Ah, nice. 1139 00:49:59,020 --> 00:50:00,940 Like, you and I in English, certainly, are very much 1140 00:50:00,940 --> 00:50:02,982 in the habit of sorting information, whether it's 1141 00:50:02,982 --> 00:50:06,490 in a dictionary, in our contacts, in a phone book, in any such technology. 1142 00:50:06,490 --> 00:50:11,140 And so it's often useful to be able to know, does this string equal another? 1143 00:50:11,140 --> 00:50:11,740 Sure. 1144 00:50:11,740 --> 00:50:15,760 But does this string come before another alphabetically or maybe 1145 00:50:15,760 --> 00:50:17,620 after another alphabetically? 1146 00:50:17,620 --> 00:50:20,680 So sometimes, you want functions to give you back three answers. 1147 00:50:20,680 --> 00:50:24,550 But equals, equals alone can only give you true or false, yes or no. 1148 00:50:24,550 --> 00:50:27,040 And that might not be useful enough when you're 1149 00:50:27,040 --> 00:50:29,200 trying to solve some problem involving strings. 1150 00:50:29,200 --> 00:50:34,280 So it turns out str compare actually compares the two strings for equality 1151 00:50:34,280 --> 00:50:38,470 but also for what's called, playfully, ASCII-betical order. 1152 00:50:38,470 --> 00:50:41,590 So not alphabetical order, per se, but ASCII-betical order 1153 00:50:41,590 --> 00:50:44,680 where it actually compares the integer values of the letters. 1154 00:50:44,680 --> 00:50:46,450 So if you're comparing the letter A, It's 1155 00:50:46,450 --> 00:50:51,250 going to compare 65 against some other letter's integer value-- hence 1156 00:50:51,250 --> 00:50:52,880 ASCII-betical value. 1157 00:50:52,880 --> 00:50:54,790 So we're not doing any form of sorting here. 1158 00:50:54,790 --> 00:50:56,110 So it's sort of immaterial. 1159 00:50:56,110 --> 00:50:59,000 But as per the documentation, I do know that str compare 1160 00:50:59,000 --> 00:51:01,880 returns zero if two strings are equal. 1161 00:51:01,880 --> 00:51:06,230 And a little teaser for next week, it turns out when I was only using equals, 1162 00:51:06,230 --> 00:51:09,400 equals to compare strings bracket i and s, 1163 00:51:09,400 --> 00:51:13,350 I was not comparing the strings in the way that you might have thought. 1164 00:51:13,350 --> 00:51:18,087 And if you have programmed in Java or Python before, equals, equals 1165 00:51:18,087 --> 00:51:20,420 is actually doing something different in those languages 1166 00:51:20,420 --> 00:51:23,130 than it is actually doing in C. But more on that next week. 1167 00:51:23,130 --> 00:51:26,600 For now, just take on faith that str compare is indeed 1168 00:51:26,600 --> 00:51:29,010 how you compare two strings. 1169 00:51:29,010 --> 00:51:33,080 So let's actually put this into play with some actual additional code. 1170 00:51:33,080 --> 00:51:37,940 Let me propose that we implement a very simplistic phone book, for instance. 1171 00:51:37,940 --> 00:51:40,640 Let me go ahead and implement here-- 1172 00:51:40,640 --> 00:51:42,110 how about in a new file. 1173 00:51:42,110 --> 00:51:46,550 Instead of search dot C, let's actually do phone book dot c. 1174 00:51:46,550 --> 00:51:50,510 And in this phone book, I'm going to go ahead and include the same header file, 1175 00:51:50,510 --> 00:51:52,730 so cs50.h, so I can get input-- 1176 00:51:52,730 --> 00:51:55,310 standard io.h, so I can use printf-- 1177 00:51:55,310 --> 00:51:58,070 string.h so that I can use str compare. 1178 00:51:58,070 --> 00:52:02,250 Let me give myself a main function again without command line arguments for now. 1179 00:52:02,250 --> 00:52:05,150 And let me go ahead now and store a proper phone book, which 1180 00:52:05,150 --> 00:52:07,800 has some names and some actual numbers. 1181 00:52:07,800 --> 00:52:09,320 So let's store the names first. 1182 00:52:09,320 --> 00:52:12,560 So string-- names is going to be the name of my array. 1183 00:52:12,560 --> 00:52:14,580 And let's go ahead and store Carter's name, 1184 00:52:14,580 --> 00:52:18,260 how about my name, and maybe John Harvard for phone book throwback. 1185 00:52:18,260 --> 00:52:22,640 Then let's go ahead and give me another array called numbers wherein I'll put 1186 00:52:22,640 --> 00:52:27,560 our phone number, so 617-495-1000 for Carter-- 1187 00:52:27,560 --> 00:52:32,060 617-495-1000 for me-- technically, directory assistance here. 1188 00:52:32,060 --> 00:52:34,160 And then for John, we'll give him an actual one. 1189 00:52:34,160 --> 00:52:40,100 So it's actually going to be 949-468-2750. 1190 00:52:40,100 --> 00:52:42,380 You're welcome to text or call John when you want. 1191 00:52:42,380 --> 00:52:43,070 Whoops. 1192 00:52:43,070 --> 00:52:46,130 And just for good measure, let's go ahead and put our country codes 1193 00:52:46,130 --> 00:52:50,990 in here plus 1, even though at the end of the day, these are strings. 1194 00:52:50,990 --> 00:52:54,320 So indeed notice I'm not using an integer for these values. 1195 00:52:54,320 --> 00:52:56,510 I kind of sort of should. 1196 00:52:56,510 --> 00:52:59,270 But here's where data types in C and programming more 1197 00:52:59,270 --> 00:53:01,280 generally might sometimes mislead you. 1198 00:53:01,280 --> 00:53:04,100 Even though we call it, obviously, a phone number. 1199 00:53:04,100 --> 00:53:08,210 It's probably best to represent it generally as strings, in fact, 1200 00:53:08,210 --> 00:53:12,110 so that you can have the pluses-- you can have the dashes so that it doesn't 1201 00:53:12,110 --> 00:53:14,030 get too big and overflow an integer. 1202 00:53:14,030 --> 00:53:16,340 Maybe it's an international number for which there's even more digits. 1203 00:53:16,340 --> 00:53:18,120 You don't want to risk overflowing a value. 1204 00:53:18,120 --> 00:53:20,120 And in general, the rule of thumb in programming 1205 00:53:20,120 --> 00:53:22,940 is even if in English we call something a number, 1206 00:53:22,940 --> 00:53:26,600 if you wouldn't do math on it ever, you should probably be storing it 1207 00:53:26,600 --> 00:53:29,060 as a string, not as an integer. 1208 00:53:29,060 --> 00:53:33,060 And it makes no logical sense to do math on phone numbers, per se. 1209 00:53:33,060 --> 00:53:35,570 So those are best instinctively left as strings. 1210 00:53:35,570 --> 00:53:37,850 And in this case, even more simply, this ensures 1211 00:53:37,850 --> 00:53:41,630 that we have pluses and dashes stored inside of the string. 1212 00:53:41,630 --> 00:53:46,490 All right, so now that I have these two arrays in parallel, if you will. 1213 00:53:46,490 --> 00:53:49,040 Like, I'm assuming that Carter's name is first. 1214 00:53:49,040 --> 00:53:50,180 So his number is first. 1215 00:53:50,180 --> 00:53:51,230 David's name is second. 1216 00:53:51,230 --> 00:53:52,850 So his number is second. 1217 00:53:52,850 --> 00:53:54,810 John's is third and thus third. 1218 00:53:54,810 --> 00:53:56,060 So let's actually search this. 1219 00:53:56,060 --> 00:53:58,010 Let's ask the user for-- 1220 00:53:58,010 --> 00:54:01,340 how about a name using get string. 1221 00:54:01,340 --> 00:54:03,950 And this will be a name to search for in the phone book 1222 00:54:03,950 --> 00:54:05,420 just like we did in week zero. 1223 00:54:05,420 --> 00:54:09,290 Let's do for int i equals 0, i less than 3. 1224 00:54:09,290 --> 00:54:10,738 Again, the 3 is bad practice. 1225 00:54:10,738 --> 00:54:12,530 I should probably store that in a constant. 1226 00:54:12,530 --> 00:54:15,950 But let's keep the focus for today on the algorithm alone-- 1227 00:54:15,950 --> 00:54:17,330 i plus, plus. 1228 00:54:17,330 --> 00:54:20,240 Then in here, let's do if-- 1229 00:54:20,240 --> 00:54:27,800 how about names bracket i equals, equals name typed in. 1230 00:54:27,800 --> 00:54:28,550 But wait a minute. 1231 00:54:28,550 --> 00:54:30,320 I'm screwing this up again. 1232 00:54:30,320 --> 00:54:32,270 What should I be using here? 1233 00:54:32,270 --> 00:54:37,340 str compare again-- so let's do str compare, names bracket i comma, 1234 00:54:37,340 --> 00:54:38,930 name, which came from the user. 1235 00:54:38,930 --> 00:54:42,770 And if that return value is zero, then let's go ahead 1236 00:54:42,770 --> 00:54:45,830 and print out, just like before, found backslash n. 1237 00:54:45,830 --> 00:54:46,580 But you know what? 1238 00:54:46,580 --> 00:54:48,170 We can do something more interesting. 1239 00:54:48,170 --> 00:54:49,670 Let's actually print out the number. 1240 00:54:49,670 --> 00:54:51,260 So I didn't just find something. 1241 00:54:51,260 --> 00:54:52,550 I found the name. 1242 00:54:52,550 --> 00:54:55,710 So let's actually plug in that person's corresponding number. 1243 00:54:55,710 --> 00:54:57,900 So now it's a more useful phonebook or contacts app 1244 00:54:57,900 --> 00:55:00,060 where I'm going to show the human not just found 1245 00:55:00,060 --> 00:55:02,430 but found this specific number. 1246 00:55:02,430 --> 00:55:06,010 Then I'm going to go ahead as before and return 0 to indicate success. 1247 00:55:06,010 --> 00:55:09,538 And if we get all the way down here, I'm going to go ahead and say not found 1248 00:55:09,538 --> 00:55:12,330 and not print out any number because I obviously haven't found one, 1249 00:55:12,330 --> 00:55:15,880 and return one by convention to indicate failure. 1250 00:55:15,880 --> 00:55:17,890 So let me open my terminal window. 1251 00:55:17,890 --> 00:55:20,340 Let me do make phone book-- enter. 1252 00:55:20,340 --> 00:55:23,010 So far, so good-- dot slash phone book. 1253 00:55:23,010 --> 00:55:25,080 Let's search for Carter-- 1254 00:55:25,080 --> 00:55:27,060 enter-- found his number. 1255 00:55:27,060 --> 00:55:27,840 Let's do it again. 1256 00:55:27,840 --> 00:55:28,757 Dot slash phone book-- 1257 00:55:28,757 --> 00:55:30,030 David-- found it. 1258 00:55:30,030 --> 00:55:32,490 Let's do it one more time for John-- 1259 00:55:32,490 --> 00:55:33,240 found it. 1260 00:55:33,240 --> 00:55:38,160 And just for good measure, let's do one other here like Eli-- 1261 00:55:38,160 --> 00:55:40,920 enter-- not found in this case. 1262 00:55:40,920 --> 00:55:43,710 All right, so it seems to be working based on this example here. 1263 00:55:43,710 --> 00:55:48,630 But now we've actually implemented the idea of, like, a proper phone book. 1264 00:55:48,630 --> 00:55:51,510 But does any aspect of the design of this code, 1265 00:55:51,510 --> 00:55:54,060 even if you've never programmed before CS50, 1266 00:55:54,060 --> 00:55:58,120 does anything rub you wrong about how we're storing our data-- 1267 00:55:58,120 --> 00:56:02,690 the phone book itself-- these names and numbers? 1268 00:56:02,690 --> 00:56:04,220 Does anything rub you the wrong way? 1269 00:56:04,220 --> 00:56:06,008 Yeah, in back. 1270 00:56:06,008 --> 00:56:09,326 AUDIENCE: [INAUDIBLE] 1271 00:56:09,326 --> 00:56:11,222 1272 00:56:11,222 --> 00:56:13,320 DAVID MALAN: Yeah, really good observation. 1273 00:56:13,320 --> 00:56:15,510 I'm separating the names and the numbers, 1274 00:56:15,510 --> 00:56:17,775 which indeed, it looks a little bit weird. 1275 00:56:17,775 --> 00:56:20,650 And there's actually this technical term in the world of programming, 1276 00:56:20,650 --> 00:56:22,088 which is code smell, where, like-- 1277 00:56:22,088 --> 00:56:25,380 [SNIFFING]---- something smells a little off about this code in the sense that, 1278 00:56:25,380 --> 00:56:27,270 like, this probably doesn't end well, right? 1279 00:56:27,270 --> 00:56:31,438 If I add a fourth name, a fourth number, a fifth name, a fiftieth name 1280 00:56:31,438 --> 00:56:33,480 and number, like, at some point, they're probably 1281 00:56:33,480 --> 00:56:34,813 going to get out of sync, right? 1282 00:56:34,813 --> 00:56:37,110 So, like, there's something awry about this design 1283 00:56:37,110 --> 00:56:39,870 that I shouldn't decouple the names from the numbers. 1284 00:56:39,870 --> 00:56:42,300 So something kind of smells about this code, so to speak. 1285 00:56:42,300 --> 00:56:44,760 And any time you perceive that in your code, 1286 00:56:44,760 --> 00:56:48,150 it's probably an opportunity to go about improving it somehow. 1287 00:56:48,150 --> 00:56:50,893 But to do that, we actually need another tool in the toolkit. 1288 00:56:50,893 --> 00:56:53,560 And that is, again, this term that I've used a couple of times-- 1289 00:56:53,560 --> 00:56:54,367 data structures. 1290 00:56:54,367 --> 00:56:56,700 Like, arrays have been the first of our data structures. 1291 00:56:56,700 --> 00:56:57,783 And they're so simplistic. 1292 00:56:57,783 --> 00:57:01,110 It just means storing things back to back to back contiguously in memory. 1293 00:57:01,110 --> 00:57:03,300 But it turns out C-- 1294 00:57:03,300 --> 00:57:06,660 and a little bit of new syntax-- but it's not a lot of new syntax today-- 1295 00:57:06,660 --> 00:57:09,060 a little bit of new syntax today will allow 1296 00:57:09,060 --> 00:57:13,710 us to create our own data structures, our own types of variables, 1297 00:57:13,710 --> 00:57:16,330 largely using syntax we've seen thus far. 1298 00:57:16,330 --> 00:57:19,230 So to do this, let me propose that in order 1299 00:57:19,230 --> 00:57:22,980 to represent a person in a phone book-- well, let's 1300 00:57:22,980 --> 00:57:26,250 not just implement them as a list of names and a list of numbers. 1301 00:57:26,250 --> 00:57:30,660 Wouldn't it be nice if C had a data type actually called person? 1302 00:57:30,660 --> 00:57:34,560 Because if it did, then I could go about creating an array called-- 1303 00:57:34,560 --> 00:57:36,690 using the pluralized form-- people-- 1304 00:57:36,690 --> 00:57:40,470 containing my people in my phone book. 1305 00:57:40,470 --> 00:57:44,070 And maybe a person has both a name and a number. 1306 00:57:44,070 --> 00:57:46,690 And therefore, we can kind of keep everything together. 1307 00:57:46,690 --> 00:57:47,740 So how can I do this? 1308 00:57:47,740 --> 00:57:48,960 Well, what is a person? 1309 00:57:48,960 --> 00:57:52,350 Well, a person, really, in this story is a person has a name. 1310 00:57:52,350 --> 00:57:53,730 And a person has a number. 1311 00:57:53,730 --> 00:57:58,500 So can we create a new data type that maybe has both of these together? 1312 00:57:58,500 --> 00:58:02,280 Well, we actually can by using one piece of new syntax 1313 00:58:02,280 --> 00:58:05,100 today, which is just this here. 1314 00:58:05,100 --> 00:58:09,395 Using what's called a struct, we can create our own data structure 1315 00:58:09,395 --> 00:58:11,020 that actually has some structure in it. 1316 00:58:11,020 --> 00:58:13,150 It's not just one thing like a string or an int. 1317 00:58:13,150 --> 00:58:14,320 Maybe it's two strings. 1318 00:58:14,320 --> 00:58:15,190 Maybe it's two ints. 1319 00:58:15,190 --> 00:58:16,210 Maybe it's one of each. 1320 00:58:16,210 --> 00:58:21,380 So a structure can be a variable that contains any number of other variables, 1321 00:58:21,380 --> 00:58:22,060 so to speak. 1322 00:58:22,060 --> 00:58:24,310 And typedef is a cryptic keyword that just 1323 00:58:24,310 --> 00:58:28,630 means define the following type-- invent the following data type for me. 1324 00:58:28,630 --> 00:58:30,320 And the syntax is a little weird. 1325 00:58:30,320 --> 00:58:32,620 But you say typedef struct curly brace. 1326 00:58:32,620 --> 00:58:35,410 Inside of the curly braces, you put all of the types 1327 00:58:35,410 --> 00:58:38,090 of variables you want to associate with this new data type. 1328 00:58:38,090 --> 00:58:40,660 And then after the curly brace, you invent the name 1329 00:58:40,660 --> 00:58:41,830 that you want to give it. 1330 00:58:41,830 --> 00:58:45,550 And this will create a new data type in C called person, 1331 00:58:45,550 --> 00:58:47,830 even though no one thought of this decades 1332 00:58:47,830 --> 00:58:52,670 ago when C was created alongside of the ints and the floats and so forth. 1333 00:58:52,670 --> 00:58:55,220 So how can I actually use this? 1334 00:58:55,220 --> 00:58:58,330 Well, it turns out that once you have this building 1335 00:58:58,330 --> 00:59:03,190 block of creating your very own data structures, I can go back into my code 1336 00:59:03,190 --> 00:59:06,340 and improve it as follows in direct response to your concern 1337 00:59:06,340 --> 00:59:10,752 about it seeming not ideal that we're decoupling the names and the numbers. 1338 00:59:10,752 --> 00:59:13,210 Now, it's going to look a little more complicated at first. 1339 00:59:13,210 --> 00:59:15,320 But it will scale better over time. 1340 00:59:15,320 --> 00:59:17,380 So within my code here, I'm going to go ahead 1341 00:59:17,380 --> 00:59:21,020 and essentially type out exactly that same data type. 1342 00:59:21,020 --> 00:59:27,460 So define a structure that has inside of it a string called name and a string 1343 00:59:27,460 --> 00:59:28,300 called number. 1344 00:59:28,300 --> 00:59:30,700 And let's call this thing a person. 1345 00:59:30,700 --> 00:59:34,270 So these new lines copied and pasted from the slide a moment ago just 1346 00:59:34,270 --> 00:59:36,370 invent the data type called person. 1347 00:59:36,370 --> 00:59:39,010 So the only thing we need today is the syntax via which 1348 00:59:39,010 --> 00:59:43,610 we can set a person's name and number-- like, how can we access those values. 1349 00:59:43,610 --> 00:59:47,920 So to do this, I'm going to go ahead and erase the hard-coded arrays 1350 00:59:47,920 --> 00:59:49,250 that I had before. 1351 00:59:49,250 --> 00:59:53,740 And I'm going to give myself one array of type person called people 1352 00:59:53,740 --> 00:59:55,610 with room for three people. 1353 00:59:55,610 --> 00:59:57,490 So this seems like a bit of a mouthful. 1354 00:59:57,490 --> 00:59:59,560 But the new data type is called person. 1355 00:59:59,560 --> 01:00:02,048 The array name is called people, which in English is weird. 1356 01:00:02,048 --> 01:00:04,840 I mean, it could call it persons to make it a little more parallel. 1357 01:00:04,840 --> 01:00:06,520 But we call them people, generally. 1358 01:00:06,520 --> 01:00:09,880 And I want three such people in my phone book. 1359 01:00:09,880 --> 01:00:12,670 Now, how do I actually initialize those people? 1360 01:00:12,670 --> 01:00:16,030 Well, previously, I did something like this-- names, bracket, 0. 1361 01:00:16,030 --> 01:00:18,520 And then I did numbers, bracket, 0. 1362 01:00:18,520 --> 01:00:20,080 Well, it's a similar idea. 1363 01:00:20,080 --> 01:00:22,630 I do people, bracket, 0. 1364 01:00:22,630 --> 01:00:27,490 But if I want to set this person's name, the one new piece of syntax today 1365 01:00:27,490 --> 01:00:32,290 is a period, a literal dot operator, that says go inside of this person 1366 01:00:32,290 --> 01:00:36,400 and access their name field or their name attribute. 1367 01:00:36,400 --> 01:00:39,190 And set it equal to, quote, unquote, "Carter." 1368 01:00:39,190 --> 01:00:41,800 Then go into that same person. 1369 01:00:41,800 --> 01:00:43,660 But go into their number field. 1370 01:00:43,660 --> 01:00:48,368 And set that equal to plus 1, 617-495-1000. 1371 01:00:48,368 --> 01:00:50,410 Then-- and I'll just separate it with white space 1372 01:00:50,410 --> 01:00:53,500 for readability-- go into people bracket 1. 1373 01:00:53,500 --> 01:00:56,170 Set their name into mine, David. 1374 01:00:56,170 --> 01:00:58,390 Let's go into people bracket 1 number. 1375 01:00:58,390 --> 01:01:02,000 Set that equal to the same, since we're both available through the directory, 1376 01:01:02,000 --> 01:01:04,720 so 617-495-1000. 1377 01:01:04,720 --> 01:01:10,390 And then lastly, let's go ahead and do people bracket 2 dot name equals, 1378 01:01:10,390 --> 01:01:15,210 quote, unquote, "John," and then people bracket 2 dot number equals, quote, 1379 01:01:15,210 --> 01:01:21,646 unquote, "plus 1, 949-468-2750-- 1380 01:01:21,646 --> 01:01:23,548 let me fix my dash-- 1381 01:01:23,548 --> 01:01:24,850 semicolon. 1382 01:01:24,850 --> 01:01:29,010 And now I think I can mostly keep the rest of the code the same 1383 01:01:29,010 --> 01:01:31,500 because if I'm searching for this person's name, 1384 01:01:31,500 --> 01:01:35,160 I think the only thing I want to change is this because I don't have a names 1385 01:01:35,160 --> 01:01:36,300 array anymore. 1386 01:01:36,300 --> 01:01:40,440 So what should I change this highlighted phrase to in order 1387 01:01:40,440 --> 01:01:44,930 to search the ith person's name? 1388 01:01:44,930 --> 01:01:45,800 What should this be? 1389 01:01:45,800 --> 01:01:47,870 Yeah? 1390 01:01:47,870 --> 01:01:51,470 People bracket i dash name because the whole point of this loop 1391 01:01:51,470 --> 01:01:54,330 is to iterate over each of these people one at a time. 1392 01:01:54,330 --> 01:01:57,950 So people bracket i gives me the ith person-- first [INAUDIBLE] people 0, 1393 01:01:57,950 --> 01:01:59,060 people 1, people 2. 1394 01:01:59,060 --> 01:02:01,880 But if on each iteration, I want to check that person's name 1395 01:02:01,880 --> 01:02:04,010 and compare it against whatever the human typed in, 1396 01:02:04,010 --> 01:02:05,700 now I can do exactly that. 1397 01:02:05,700 --> 01:02:12,350 But I have to change the output to be people bracket i dot number, 1398 01:02:12,350 --> 01:02:13,560 in this case. 1399 01:02:13,560 --> 01:02:15,838 So I've added a little bit of complexity. 1400 01:02:15,838 --> 01:02:19,130 And granted, this is not going to be the way long term you create a phone book. 1401 01:02:19,130 --> 01:02:21,660 Odds are, we're going to get the phone book with a loop of some sort. 1402 01:02:21,660 --> 01:02:24,702 We're going to use a constant, so I know how many people I have room for. 1403 01:02:24,702 --> 01:02:27,440 For demonstration sake, I'm just typing everything out manually. 1404 01:02:27,440 --> 01:02:30,660 But I think logically now, we've achieved the same thing. 1405 01:02:30,660 --> 01:02:33,050 So let me do make phone book for this new version-- 1406 01:02:33,050 --> 01:02:35,180 no syntax errors-- dot slash phone book. 1407 01:02:35,180 --> 01:02:36,770 Let's search for Carter-- enter. 1408 01:02:36,770 --> 01:02:38,210 And there indeed is his number. 1409 01:02:38,210 --> 01:02:39,710 Let's go ahead and search for David. 1410 01:02:39,710 --> 01:02:40,860 There's my number. 1411 01:02:40,860 --> 01:02:42,470 And lastly, let's search for John-- 1412 01:02:42,470 --> 01:02:42,970 his number. 1413 01:02:42,970 --> 01:02:45,387 And again, we'll search for someone we know is not there-- 1414 01:02:45,387 --> 01:02:46,130 Eli. 1415 01:02:46,130 --> 01:02:47,840 And Eli is not found. 1416 01:02:47,840 --> 01:02:52,160 So what we've done is try to solve this problem of introducing a brand new data 1417 01:02:52,160 --> 01:02:57,560 type that allows us to represent this custom data that you and I just 1418 01:02:57,560 --> 01:02:58,220 created. 1419 01:02:58,220 --> 01:03:02,100 And that is by using the struct keyword to cluster these things together 1420 01:03:02,100 --> 01:03:05,960 and the typedef keyword to give it a brand new name that we might like. 1421 01:03:05,960 --> 01:03:08,330 Now, as an aside, when it comes to styling your code, 1422 01:03:08,330 --> 01:03:11,430 you'll actually see that style50 and similar tools will actually 1423 01:03:11,430 --> 01:03:14,610 put the name of the data type on the same line as the closing curly brace, 1424 01:03:14,610 --> 01:03:15,900 just sort of a curiosity. 1425 01:03:15,900 --> 01:03:16,523 That's fine. 1426 01:03:16,523 --> 01:03:18,690 Even though I wrote it the first way for consistency 1427 01:03:18,690 --> 01:03:23,530 with Scratch and our prior examples, this is the right way in styling it. 1428 01:03:23,530 --> 01:03:26,470 Any questions now on this data type? 1429 01:03:26,470 --> 01:03:26,970 Yeah? 1430 01:03:26,970 --> 01:03:29,675 1431 01:03:29,675 --> 01:03:33,042 AUDIENCE: [INAUDIBLE] 1432 01:03:33,042 --> 01:03:34,966 1433 01:03:34,966 --> 01:03:36,980 DAVID MALAN: The question is, do you have 1434 01:03:36,980 --> 01:03:39,590 to assign both the name and the number when creating a person? 1435 01:03:39,590 --> 01:03:42,410 Or can you only get away with assigning one of them? 1436 01:03:42,410 --> 01:03:43,200 You can. 1437 01:03:43,200 --> 01:03:46,200 But that's going to be one of those so-called garbage values, typically. 1438 01:03:46,200 --> 01:03:48,800 And so there's just going to be some bogus data there. 1439 01:03:48,800 --> 01:03:53,270 And unless you are so careful as to never touch that field thereafter, 1440 01:03:53,270 --> 01:03:56,630 you probably run the risk of some kind of bug, even a crash in your code, 1441 01:03:56,630 --> 01:03:59,330 if you try to access that value later, even though you've never 1442 01:03:59,330 --> 01:03:59,960 initialized it. 1443 01:03:59,960 --> 01:04:01,100 So yes, it's possible. 1444 01:04:01,100 --> 01:04:02,520 But no, do not do that. 1445 01:04:02,520 --> 01:04:05,020 Other questions? 1446 01:04:05,020 --> 01:04:05,890 No? 1447 01:04:05,890 --> 01:04:08,470 All right, well, now that we have the ability 1448 01:04:08,470 --> 01:04:11,570 to sort of represent more interesting structures, up until now, 1449 01:04:11,570 --> 01:04:14,350 we've sort of assumed that in order to get to binary search, 1450 01:04:14,350 --> 01:04:16,990 we have a phone book that someone already sorted for us. 1451 01:04:16,990 --> 01:04:19,540 For our second demonstration with our volunteers, 1452 01:04:19,540 --> 01:04:21,820 we assumed that someone, Carter in that case, 1453 01:04:21,820 --> 01:04:24,010 had already sorted the information for us. 1454 01:04:24,010 --> 01:04:26,050 It was proposed by the audience that, well, 1455 01:04:26,050 --> 01:04:29,770 what if we sort the information first and then go find the number 50? 1456 01:04:29,770 --> 01:04:33,580 That invited the question even early on-- well, how expensive is it to sort? 1457 01:04:33,580 --> 01:04:38,170 How much time and money and inefficiency do Google and Microsoft and others 1458 01:04:38,170 --> 01:04:41,590 spend to keep their data and our data sorted? 1459 01:04:41,590 --> 01:04:43,840 Well, let's consider what the problem really is. 1460 01:04:43,840 --> 01:04:47,380 If this is how we represent any problem, the unsorted data 1461 01:04:47,380 --> 01:04:50,770 that we might consume by typing things in randomly to a search engine 1462 01:04:50,770 --> 01:04:54,430 or crawling the internet or just adding context in any old order to our phone 1463 01:04:54,430 --> 01:04:55,840 is arguably unsorted. 1464 01:04:55,840 --> 01:04:58,060 It's certainly not alphabetically sorted by default. 1465 01:04:58,060 --> 01:04:59,650 But we want to get it sorted. 1466 01:04:59,650 --> 01:05:02,860 And so somewhere in here in this black box, we need a set of algorithms 1467 01:05:02,860 --> 01:05:05,130 for actually sorting information as well. 1468 01:05:05,130 --> 01:05:09,080 For instance, if we have these integers here unsorted-- 1469 01:05:09,080 --> 01:05:14,130 72541603-- effectively random, the problem of sorting, of course, 1470 01:05:14,130 --> 01:05:18,050 is to turn it into 01234567 instead. 1471 01:05:18,050 --> 01:05:20,100 And there's a bunch of different ways to do this. 1472 01:05:20,100 --> 01:05:22,730 But before we do that, I think it's probably time for some brownies. 1473 01:05:22,730 --> 01:05:24,605 So let's go ahead and take a 10-minute break. 1474 01:05:24,605 --> 01:05:26,570 And we'll see you in 10. 1475 01:05:26,570 --> 01:05:28,040 All right, we are back. 1476 01:05:28,040 --> 01:05:30,470 And where we left off was this cliffhanger. 1477 01:05:30,470 --> 01:05:32,060 We've got some unsorted numbers. 1478 01:05:32,060 --> 01:05:33,380 We want to make them sorted. 1479 01:05:33,380 --> 01:05:34,430 How do we do this? 1480 01:05:34,430 --> 01:05:37,250 And at the risk of one too many volunteers, 1481 01:05:37,250 --> 01:05:39,680 could we get eight more volunteers? 1482 01:05:39,680 --> 01:05:41,840 Wow, OK, overwhelming. 1483 01:05:41,840 --> 01:05:45,110 OK, how about 1, 2, 3, 4. 1484 01:05:45,110 --> 01:05:46,250 How about all three of you? 1485 01:05:46,250 --> 01:05:51,290 OK, 5, 6, 7-- 1486 01:05:51,290 --> 01:05:51,950 OK, 8. 1487 01:05:51,950 --> 01:05:52,460 Come on. 1488 01:05:52,460 --> 01:05:54,120 All right, one from each section-- 1489 01:05:54,120 --> 01:05:56,244 all right, come on down. 1490 01:05:56,244 --> 01:06:00,073 [INTERPOSING VOICES] 1491 01:06:00,073 --> 01:06:01,615 DAVID MALAN: All right, come on down. 1492 01:06:01,615 --> 01:06:05,410 1493 01:06:05,410 --> 01:06:06,100 Thank you. 1494 01:06:06,100 --> 01:06:10,695 And if you all could grab a number here. 1495 01:06:10,695 --> 01:06:11,320 So you'll be 7. 1496 01:06:11,320 --> 01:06:13,220 Stand on the left there if you could. 1497 01:06:13,220 --> 01:06:14,470 All right, you'll be number 2. 1498 01:06:14,470 --> 01:06:17,590 Stand to his left. 1499 01:06:17,590 --> 01:06:20,020 OK, keep coming. 1500 01:06:20,020 --> 01:06:21,910 OK, yeah. 1501 01:06:21,910 --> 01:06:24,380 OK, here we go. 1502 01:06:24,380 --> 01:06:26,530 There you go. 1503 01:06:26,530 --> 01:06:28,210 OK, [INAUDIBLE] 0 and 3. 1504 01:06:28,210 --> 01:06:31,810 OK, so let's just make sure they match what we've got there. 1505 01:06:31,810 --> 01:06:32,810 Good so far. 1506 01:06:32,810 --> 01:06:36,313 OK, so we have eight volunteers here, an array of volunteers, if you would. 1507 01:06:36,313 --> 01:06:38,230 This time, we've used eight rather than seven. 1508 01:06:38,230 --> 01:06:41,840 We deliberately had seven lockers just so that the divide by 2 math 1509 01:06:41,840 --> 01:06:42,340 worked out. 1510 01:06:42,340 --> 01:06:44,890 That was very deliberate that there was always a [INAUDIBLE] a middle, 1511 01:06:44,890 --> 01:06:45,682 a middle, a middle. 1512 01:06:45,682 --> 01:06:48,682 In this case, it doesn't matter because we're going to focus on sorting. 1513 01:06:48,682 --> 01:06:51,130 But first, how about some introductions from each group? 1514 01:06:51,130 --> 01:06:52,520 AUDIENCE: I'm Rebecca. 1515 01:06:52,520 --> 01:06:53,962 I'm a first year in Pennypacker. 1516 01:06:53,962 --> 01:06:56,170 And I'm thinking about studying environmental science 1517 01:06:56,170 --> 01:06:57,410 and public policy. 1518 01:06:57,410 --> 01:06:57,910 [CHEERING] 1519 01:06:57,910 --> 01:06:59,240 DAVID MALAN: Nice. 1520 01:06:59,240 --> 01:07:00,100 [APPLAUSE] 1521 01:07:00,100 --> 01:07:02,420 AUDIENCE: I'm [? Mariella. ?] I'm also in Pennypacker. 1522 01:07:02,420 --> 01:07:03,170 I'm a first year. 1523 01:07:03,170 --> 01:07:05,020 And I'm thinking of studying bioengineering. 1524 01:07:05,020 --> 01:07:06,565 DAVID MALAN: Wonderful. 1525 01:07:06,565 --> 01:07:09,440 AUDIENCE: My name's [? Haron ?] [? Li. ?] I'm a freshman in Matthews. 1526 01:07:09,440 --> 01:07:11,015 I'm planning on studying applied mathematics. 1527 01:07:11,015 --> 01:07:12,290 DAVID MALAN: Nice-- Matthews. 1528 01:07:12,290 --> 01:07:13,910 AUDIENCE: My name is Emily. 1529 01:07:13,910 --> 01:07:15,140 I'm a first year in Canada. 1530 01:07:15,140 --> 01:07:17,470 And I'm still deciding what to study. 1531 01:07:17,470 --> 01:07:18,220 DAVID MALAN: Fair. 1532 01:07:18,220 --> 01:07:21,020 AUDIENCE: My name's [? Tanai. ?] I'm a first year from Toronto. 1533 01:07:21,020 --> 01:07:23,315 And I'm planning on studying ECON and CS. 1534 01:07:23,315 --> 01:07:24,110 DAVID MALAN: Nice. 1535 01:07:24,110 --> 01:07:26,030 AUDIENCE: My name is Teddy. 1536 01:07:26,030 --> 01:07:27,320 I'm a first year in Hurlbut. 1537 01:07:27,320 --> 01:07:30,320 And I'm planning on concentrating in computer science with linguistics. 1538 01:07:30,320 --> 01:07:31,070 DAVID MALAN: Nice. 1539 01:07:31,070 --> 01:07:32,000 AUDIENCE: Yeah! 1540 01:07:32,000 --> 01:07:32,750 DAVID MALAN: Nice. 1541 01:07:32,750 --> 01:07:33,690 [APPLAUSE] 1542 01:07:33,690 --> 01:07:36,470 AUDIENCE: My name's [? Lenny. ?] I'm a first year in Matthews. 1543 01:07:36,470 --> 01:07:39,300 And I'm planning on concentrating in gov and CS. 1544 01:07:39,300 --> 01:07:40,220 DAVID MALAN: Ah, nice. 1545 01:07:40,220 --> 01:07:40,565 [CHEERING] 1546 01:07:40,565 --> 01:07:41,037 [APPLAUSE] 1547 01:07:41,037 --> 01:07:41,537 And? 1548 01:07:41,537 --> 01:07:42,740 AUDIENCE: My name is Eli. 1549 01:07:42,740 --> 01:07:44,023 I'm a first year in Hollis. 1550 01:07:44,023 --> 01:07:45,440 And I plan on concentrating in CS. 1551 01:07:45,440 --> 01:07:47,540 DAVID MALAN: Eli, we keep looking for you today. 1552 01:07:47,540 --> 01:07:49,962 OK, so if you guys could scooch a little bit this way just 1553 01:07:49,962 --> 01:07:51,170 to be a little more centered. 1554 01:07:51,170 --> 01:07:54,750 Notice that this array of volunteers is entirely unsorted. 1555 01:07:54,750 --> 01:07:56,960 So we thought we'd do three passes at this. 1556 01:07:56,960 --> 01:08:00,135 Could you all sort yourselves from smallest to largest? 1557 01:08:00,135 --> 01:08:00,635 Go. 1558 01:08:00,635 --> 01:08:06,420 1559 01:08:06,420 --> 01:08:08,190 All right, that was very good. 1560 01:08:08,190 --> 01:08:09,420 So yes, so, sure. 1561 01:08:09,420 --> 01:08:10,003 OK. 1562 01:08:10,003 --> 01:08:10,503 [CHEERING] 1563 01:08:10,503 --> 01:08:12,920 [APPLAUSE] 1564 01:08:12,920 --> 01:08:14,420 And let's see. 1565 01:08:14,420 --> 01:08:17,603 What was your algorithm? 1566 01:08:17,603 --> 01:08:19,520 AUDIENCE: I just kind of found the person that 1567 01:08:19,520 --> 01:08:21,260 was, like, one lower or one higher. 1568 01:08:21,260 --> 01:08:22,560 So I looked at him. 1569 01:08:22,560 --> 01:08:23,060 He had 2. 1570 01:08:23,060 --> 01:08:24,500 So I knew I had to be on his left. 1571 01:08:24,500 --> 01:08:25,729 And then she had 3. 1572 01:08:25,729 --> 01:08:27,410 So I told her to come to my right. 1573 01:08:27,410 --> 01:08:28,467 And then I knew he had 5. 1574 01:08:28,467 --> 01:08:30,050 DAVID MALAN: OK, nice-- pretty clever. 1575 01:08:30,050 --> 01:08:31,640 And how about your algorithm? 1576 01:08:31,640 --> 01:08:34,935 AUDIENCE: Same thing-- looked for the number that was lower and higher 1577 01:08:34,935 --> 01:08:35,810 and found the middle. 1578 01:08:35,810 --> 01:08:38,029 DAVID MALAN: OK, interesting, because I think from the outside view, 1579 01:08:38,029 --> 01:08:39,638 it all seemed very organic. 1580 01:08:39,638 --> 01:08:42,180 And things just kind of worked themselves out, which is fine. 1581 01:08:42,180 --> 01:08:44,930 But I dare say what you guys did was probably a little hard for me 1582 01:08:44,930 --> 01:08:46,410 to translate into code. 1583 01:08:46,410 --> 01:08:48,560 So let me propose that we take two passes at this. 1584 01:08:48,560 --> 01:08:51,380 If you could reset yourselves to be in that order from left 1585 01:08:51,380 --> 01:08:54,920 to right, which is just the exact same sequence, just so that we're 1586 01:08:54,920 --> 01:08:58,310 starting from the same point each time. 1587 01:08:58,310 --> 01:08:59,630 [INAUDIBLE] 1588 01:08:59,630 --> 01:09:00,319 Very good. 1589 01:09:00,319 --> 01:09:02,120 All right, so let me propose this. 1590 01:09:02,120 --> 01:09:04,470 We can approach sorting in a couple of different ways. 1591 01:09:04,470 --> 01:09:05,720 But it needs to be methodical. 1592 01:09:05,720 --> 01:09:09,600 Like, it needs to translate ideally to pseudocode and eventually code. 1593 01:09:09,600 --> 01:09:12,810 So as much as we can quantize things to be step by step, 1594 01:09:12,810 --> 01:09:15,100 I think the better this will scale overall, 1595 01:09:15,100 --> 01:09:19,020 especially when there's not eight people but maybe there's 80 people or 800. 1596 01:09:19,020 --> 01:09:22,020 Because I dare say that if we had everyone in the room sort themselves-- 1597 01:09:22,020 --> 01:09:23,580 if they were all handling a number-- 1598 01:09:23,580 --> 01:09:26,062 like, that probably wouldn't have worked out very well. 1599 01:09:26,062 --> 01:09:29,229 It probably would have taken forever because there's just so much more data. 1600 01:09:29,229 --> 01:09:30,700 So let's be a little more methodical. 1601 01:09:30,700 --> 01:09:33,283 So for instance, I don't have a bird's eye view of the numbers 1602 01:09:33,283 --> 01:09:35,130 just as before because even though we don't 1603 01:09:35,130 --> 01:09:38,460 have doors in front of our volunteers, they're effectively lockers too. 1604 01:09:38,460 --> 01:09:42,040 And the computer or the human in my case can only look at one door at a time. 1605 01:09:42,040 --> 01:09:44,880 So if I want to find the smallest number and put it 1606 01:09:44,880 --> 01:09:46,830 where it should go on the left, I can't just 1607 01:09:46,830 --> 01:09:49,500 take a step back and be like, OK, obviously, there's the one. 1608 01:09:49,500 --> 01:09:51,010 I have to do it more methodically. 1609 01:09:51,010 --> 01:09:52,350 So I'm going to check here-- 1610 01:09:52,350 --> 01:09:52,890 7. 1611 01:09:52,890 --> 01:09:55,320 At the moment, this is actually the smallest number I've seen. 1612 01:09:55,320 --> 01:09:56,945 So I'm going to make mental note of it. 1613 01:09:56,945 --> 01:09:58,230 OK, 2 I see now. 1614 01:09:58,230 --> 01:10:01,140 I can forget about the 7 because 2 is clearly smaller. 1615 01:10:01,140 --> 01:10:02,850 5-- I don't need to remember it. 1616 01:10:02,850 --> 01:10:04,200 4-- I don't need to remember it. 1617 01:10:04,200 --> 01:10:06,330 1-- OK, that's clearly smaller. 1618 01:10:06,330 --> 01:10:08,560 Have I found my smallest number? 1619 01:10:08,560 --> 01:10:10,337 Not even because there actually is a 0. 1620 01:10:10,337 --> 01:10:11,920 So I should go through the whole list. 1621 01:10:11,920 --> 01:10:14,905 But I will remember that my smallest element is now 1. 1622 01:10:14,905 --> 01:10:18,190 6 is not smaller-- oh, 0 is smaller, so I'll remember this. 1623 01:10:18,190 --> 01:10:19,400 3 is not smaller. 1624 01:10:19,400 --> 01:10:21,790 And so now, our volunteer for 0-- what was your name? 1625 01:10:21,790 --> 01:10:22,707 AUDIENCE: [INAUDIBLE]. 1626 01:10:22,707 --> 01:10:26,120 DAVID MALAN: Mariana, and where should we put you clearly? 1627 01:10:26,120 --> 01:10:26,665 So there. 1628 01:10:26,665 --> 01:10:27,790 But what's your name again? 1629 01:10:27,790 --> 01:10:28,665 AUDIENCE: [INAUDIBLE] 1630 01:10:28,665 --> 01:10:30,010 DAVID MALAN: Eli's in the way. 1631 01:10:30,010 --> 01:10:31,745 So we could have Mary Ellen? 1632 01:10:31,745 --> 01:10:32,620 AUDIENCE: [INAUDIBLE] 1633 01:10:32,620 --> 01:10:35,860 DAVID MALAN: [? Mariella ?] just go to the right of Eli. 1634 01:10:35,860 --> 01:10:37,390 But that's kind of cheating, right? 1635 01:10:37,390 --> 01:10:40,570 Because if this is an array, recall that they are contiguous. 1636 01:10:40,570 --> 01:10:43,578 But there could be other stuff in memory to the left and to the right. 1637 01:10:43,578 --> 01:10:44,620 So that's not quite fair. 1638 01:10:44,620 --> 01:10:47,110 We can't just have the luxury of putting things wherever we want. 1639 01:10:47,110 --> 01:10:49,120 But Eli, you're not even in the right order, anyway. 1640 01:10:49,120 --> 01:10:50,620 So why don't we just swap you two. 1641 01:10:50,620 --> 01:10:52,720 So [? Mariella ?] and Eli swap. 1642 01:10:52,720 --> 01:10:55,780 But now I've taken one bite out of this problem. 1643 01:10:55,780 --> 01:10:57,910 I've coincidentally made it a little better 1644 01:10:57,910 --> 01:11:00,760 by moving Eli closer to where he is. 1645 01:11:00,760 --> 01:11:01,510 But you know what? 1646 01:11:01,510 --> 01:11:02,830 He was in a random location, anyway. 1647 01:11:02,830 --> 01:11:05,372 So I don't think I made the problem any worse, fundamentally. 1648 01:11:05,372 --> 01:11:06,820 Now I can do this again. 1649 01:11:06,820 --> 01:11:08,870 And I can shave a little bit of time off of it 1650 01:11:08,870 --> 01:11:11,600 because I don't need to revisit [? Mariella ?] because if she 1651 01:11:11,600 --> 01:11:14,300 was the smaller and I've already selected her from the array, 1652 01:11:14,300 --> 01:11:17,600 I can just move on and take one fewer bites this time around. 1653 01:11:17,600 --> 01:11:20,930 So 2 is the smallest number, not 5, not 4. 1654 01:11:20,930 --> 01:11:22,620 OK, 1 is the smallest number. 1655 01:11:22,620 --> 01:11:27,320 So I'm going to remember that as with a mental variable- 6, 7, 3. 1656 01:11:27,320 --> 01:11:28,430 OK, 1 is the smallest. 1657 01:11:28,430 --> 01:11:29,930 So let me select number 1. 1658 01:11:29,930 --> 01:11:32,270 And we're going to have to evict you, which 1659 01:11:32,270 --> 01:11:34,080 is making the problem slightly worse. 1660 01:11:34,080 --> 01:11:35,780 But I think it'll average out. 1661 01:11:35,780 --> 01:11:38,490 And now 0 and 1 are in the right place. 1662 01:11:38,490 --> 01:11:40,160 So now let me do this again but faster. 1663 01:11:40,160 --> 01:11:42,230 So 5-- OK, 4-- 1664 01:11:42,230 --> 01:11:43,340 2 is the smallest-- 1665 01:11:43,340 --> 01:11:44,630 6, 7, 3. 1666 01:11:44,630 --> 01:11:48,110 OK, 2, let's put you where you belong, evicting 5. 1667 01:11:48,110 --> 01:11:50,900 Now I can skip all three of these volunteers. 1668 01:11:50,900 --> 01:11:52,080 OK, 4, is the smallest. 1669 01:11:52,080 --> 01:11:52,580 Nope. 1670 01:11:52,580 --> 01:11:52,910 Nope. 1671 01:11:52,910 --> 01:11:53,120 Nope. 1672 01:11:53,120 --> 01:11:54,200 3, you're the smallest. 1673 01:11:54,200 --> 01:11:55,880 Let me evict 4. 1674 01:11:55,880 --> 01:11:59,330 All right, and now let me move in front of these volunteers. 1675 01:11:59,330 --> 01:12:02,250 5, 6, 7, 4, come on back. 1676 01:12:02,250 --> 01:12:04,910 All right, and now let's select 6, 7, 5. 1677 01:12:04,910 --> 01:12:07,140 OK, come on back. 1678 01:12:07,140 --> 01:12:08,340 Oh, no, no cheating, Eli. 1679 01:12:08,340 --> 01:12:09,630 OK, and then let's see. 1680 01:12:09,630 --> 01:12:13,020 5, 7, 6-- OK, 6, come on back. 1681 01:12:13,020 --> 01:12:14,250 OK, now you have to move. 1682 01:12:14,250 --> 01:12:18,847 OK, and now we've selected in turn all of the numbers from left to right. 1683 01:12:18,847 --> 01:12:20,430 And even though that did feel slower-- 1684 01:12:20,430 --> 01:12:22,890 I was doing it a little more verbally as I stepped through it-- 1685 01:12:22,890 --> 01:12:25,510 but I dare say we could translate that probably more to code. 1686 01:12:25,510 --> 01:12:26,010 Why? 1687 01:12:26,010 --> 01:12:27,690 Because there's a lot of, like, looping through it 1688 01:12:27,690 --> 01:12:30,270 and probably a lot of conditionals just asking the question, 1689 01:12:30,270 --> 01:12:34,800 is this number smaller than this one or conversely greater than this other one. 1690 01:12:34,800 --> 01:12:36,520 All right, let's do this one more time. 1691 01:12:36,520 --> 01:12:41,010 If you guys could reset yourselves to this ordering. 1692 01:12:41,010 --> 01:12:46,720 All right, so we again have 72541603. 1693 01:12:46,720 --> 01:12:47,220 Good. 1694 01:12:47,220 --> 01:12:48,480 So how about this? 1695 01:12:48,480 --> 01:12:50,550 I liked the intuition that they originally 1696 01:12:50,550 --> 01:12:53,520 had, funny enough, whereby they just kind of organically 1697 01:12:53,520 --> 01:12:55,570 looked to the person to the left and to the right 1698 01:12:55,570 --> 01:12:56,820 and kind of fixed the problem. 1699 01:12:56,820 --> 01:12:59,970 So if they were out of order, they just kind of swapped locally adjacent 1700 01:12:59,970 --> 01:13:00,610 to each other. 1701 01:13:00,610 --> 01:13:01,500 So let's try this. 1702 01:13:01,500 --> 01:13:03,670 So 7 and 2, you're clearly out of order. 1703 01:13:03,670 --> 01:13:05,340 So let's swap just you two. 1704 01:13:05,340 --> 01:13:06,870 7 and 5, you're out of order. 1705 01:13:06,870 --> 01:13:08,190 Let's swap you two. 1706 01:13:08,190 --> 01:13:10,290 7 and 4, let's swap you two. 1707 01:13:10,290 --> 01:13:12,390 7 and 1, let's swap you two. 1708 01:13:12,390 --> 01:13:15,360 7 and 6, let's swap you two. 1709 01:13:15,360 --> 01:13:17,220 7 and 0, swap you two. 1710 01:13:17,220 --> 01:13:18,900 7 and 3, swap you two. 1711 01:13:18,900 --> 01:13:20,620 OK, that was a lot of swapping. 1712 01:13:20,620 --> 01:13:22,560 But notice what happened. 1713 01:13:22,560 --> 01:13:23,790 I'm not done, clearly. 1714 01:13:23,790 --> 01:13:24,780 It's not sorted yet. 1715 01:13:24,780 --> 01:13:26,230 But I have improved the situation. 1716 01:13:26,230 --> 01:13:26,730 How? 1717 01:13:26,730 --> 01:13:29,910 Who is now definitely in the right place? 1718 01:13:29,910 --> 01:13:34,510 So, 7-- or Eli has wonderfully bubbled all the way up to the top of the list, 1719 01:13:34,510 --> 01:13:35,170 so to speak. 1720 01:13:35,170 --> 01:13:37,030 Now I can actually skip him moving forward. 1721 01:13:37,030 --> 01:13:38,820 So that takes one bite out of the problem. 1722 01:13:38,820 --> 01:13:39,653 Let's do this again. 1723 01:13:39,653 --> 01:13:40,440 2 and 5 are OK. 1724 01:13:40,440 --> 01:13:42,285 5 and 4, let's swap. 1725 01:13:42,285 --> 01:13:43,020 No, over there. 1726 01:13:43,020 --> 01:13:45,150 [LAUGHS] Thank you. 1727 01:13:45,150 --> 01:13:46,950 5 and 1, let's swap. 1728 01:13:46,950 --> 01:13:48,420 5 and 6 are good. 1729 01:13:48,420 --> 01:13:50,220 6 and 0, let's swap. 1730 01:13:50,220 --> 01:13:52,158 6 and 3, let's swap. 1731 01:13:52,158 --> 01:13:53,700 And we don't have to worry about Eli. 1732 01:13:53,700 --> 01:13:54,720 And now, what's your name again? 1733 01:13:54,720 --> 01:13:54,960 AUDIENCE: [? Haron. ?] 1734 01:13:54,960 --> 01:13:57,700 DAVID MALAN: [? Haron-- ?] we don't have to worry about him either as well. 1735 01:13:57,700 --> 01:13:59,283 So now I can go back to the beginning. 1736 01:13:59,283 --> 01:14:02,460 And even though it feels like I'm going back and forth a lot, 1737 01:14:02,460 --> 01:14:05,257 that's OK because the problem's still getting better and better. 1738 01:14:05,257 --> 01:14:06,840 I'm taking a bite out of it each time. 1739 01:14:06,840 --> 01:14:07,770 2 and 4 are good. 1740 01:14:07,770 --> 01:14:10,140 4 and 1, let's swap. 1741 01:14:10,140 --> 01:14:11,130 4 and 5 are good. 1742 01:14:11,130 --> 01:14:12,420 5 and 0, swap. 1743 01:14:12,420 --> 01:14:13,950 5 and 3, swap. 1744 01:14:13,950 --> 01:14:15,750 And now these three are in the right place. 1745 01:14:15,750 --> 01:14:16,500 Let's do it again. 1746 01:14:16,500 --> 01:14:17,670 2 and 1-- ah, here we go. 1747 01:14:17,670 --> 01:14:18,330 Let's swap. 1748 01:14:18,330 --> 01:14:19,200 2 and 4 are OK. 1749 01:14:19,200 --> 01:14:20,670 4 and 0, swap. 1750 01:14:20,670 --> 01:14:22,110 4 and 3, swap. 1751 01:14:22,110 --> 01:14:23,820 Now, these four are OK. 1752 01:14:23,820 --> 01:14:24,990 1 and 2, you're good. 1753 01:14:24,990 --> 01:14:26,280 2 and 0, swap. 1754 01:14:26,280 --> 01:14:27,840 2 and 3, you're good. 1755 01:14:27,840 --> 01:14:28,530 You're good. 1756 01:14:28,530 --> 01:14:30,790 OK, now 1 and 0, swap-- 1757 01:14:30,790 --> 01:14:33,190 1 and 2 and now 0 and 1. 1758 01:14:33,190 --> 01:14:36,027 A round of applause if we could for our volunteers. 1759 01:14:36,027 --> 01:14:37,338 [APPLAUSE] 1760 01:14:37,338 --> 01:14:38,622 [CHEERING] 1761 01:14:38,622 --> 01:14:40,830 So if you want to put your numbers on the tray there, 1762 01:14:40,830 --> 01:14:44,820 we have some lovely Super Mario Oreos today, which 1763 01:14:44,820 --> 01:14:46,650 maybe drives a lot of the volunteerism. 1764 01:14:46,650 --> 01:14:48,930 But here we go. 1765 01:14:48,930 --> 01:14:51,180 Thank you all so much. 1766 01:14:51,180 --> 01:14:54,060 And maybe, Carter, if you can help with the reset? 1767 01:14:54,060 --> 01:14:56,020 All right, here we go. 1768 01:14:56,020 --> 01:14:57,840 [INAUDIBLE] yes, all set. 1769 01:14:57,840 --> 01:14:58,740 Thank you very much. 1770 01:14:58,740 --> 01:14:59,430 Thank you. 1771 01:14:59,430 --> 01:15:02,340 Thank you, guys. 1772 01:15:02,340 --> 01:15:04,290 Thank you, yes. 1773 01:15:04,290 --> 01:15:08,880 To recap, let's actually formalize a little more algorithmically 1774 01:15:08,880 --> 01:15:09,573 what we did. 1775 01:15:09,573 --> 01:15:11,490 And I deliberately kind of orchestrated things 1776 01:15:11,490 --> 01:15:14,280 there to show two fundamentally different approaches, one 1777 01:15:14,280 --> 01:15:18,210 where I kind of selected the element I wanted again and again 1778 01:15:18,210 --> 01:15:19,770 on the basis of how small it was. 1779 01:15:19,770 --> 01:15:22,210 I looked for the smallest, then the next smallest, and so forth. 1780 01:15:22,210 --> 01:15:24,335 The second time around, I took a different approach 1781 01:15:24,335 --> 01:15:26,010 by just fixing local problems. 1782 01:15:26,010 --> 01:15:28,710 But I did it again and again and again until I 1783 01:15:28,710 --> 01:15:30,408 fixed all of the minor problems. 1784 01:15:30,408 --> 01:15:32,700 And frankly, what they did organically at the beginning 1785 01:15:32,700 --> 01:15:36,167 was probably closer to the second algorithm than the first, 1786 01:15:36,167 --> 01:15:39,250 even though I'm not sure they would write down the same pseudocode for it. 1787 01:15:39,250 --> 01:15:43,510 The first algorithm we executed is actually called selection sort. 1788 01:15:43,510 --> 01:15:46,060 And I deliberately used that vernacular of selecting 1789 01:15:46,060 --> 01:15:50,590 the smallest element again and again to evoke this name of the algorithm. 1790 01:15:50,590 --> 01:15:54,460 So for instance, when we had these numbers here initially unsorted, 1791 01:15:54,460 --> 01:15:58,900 I kept looking again and again and again for the smallest element. 1792 01:15:58,900 --> 01:16:02,380 And I don't know a priori what the smallest number is until I 1793 01:16:02,380 --> 01:16:04,510 go through the list at least once. 1794 01:16:04,510 --> 01:16:06,190 Then I can pluck out the 0. 1795 01:16:06,190 --> 01:16:10,360 I then go through the list a second time to pluck out the 1, a third time 1796 01:16:10,360 --> 01:16:11,560 to pluck out the 2. 1797 01:16:11,560 --> 01:16:14,380 Now, this assumes that I don't have an infinite amount of memory 1798 01:16:14,380 --> 01:16:17,920 because even though I kept repeating myself looking for the next smallest 1799 01:16:17,920 --> 01:16:20,800 element, next smallest element, I propose that I only 1800 01:16:20,800 --> 01:16:25,060 keep track of one number at a time, one variable in my mind, which 1801 01:16:25,060 --> 01:16:28,330 was the smallest element I have seen thus far. 1802 01:16:28,330 --> 01:16:32,080 If I used more memory, I could probably remember from the first pass 1803 01:16:32,080 --> 01:16:33,580 where the 2 is, where the 3 is. 1804 01:16:33,580 --> 01:16:34,955 But that's a different algorithm. 1805 01:16:34,955 --> 01:16:37,180 And it would take more space, more memory. 1806 01:16:37,180 --> 01:16:41,390 I was confining myself to just one variable in my approach. 1807 01:16:41,390 --> 01:16:44,800 So here might be the pseudocode for what we'd call selection 1808 01:16:44,800 --> 01:16:48,490 sort, the very first algorithm that we all did together, not organically, 1809 01:16:48,490 --> 01:16:50,350 but more methodically as a group. 1810 01:16:50,350 --> 01:16:54,220 I would propose that we use some syntax from C when we talk about the loop 1811 01:16:54,220 --> 01:16:57,790 and say for i from 0 to n minus 1. 1812 01:16:57,790 --> 01:16:58,600 Now, why is that? 1813 01:16:58,600 --> 01:17:02,980 Well, if there were n people or 8, 0 through 7 1814 01:17:02,980 --> 01:17:07,040 are the indexes or indices of those humans on stage from left to right. 1815 01:17:07,040 --> 01:17:08,620 So what did I have myself do? 1816 01:17:08,620 --> 01:17:15,190 Find the smallest number between numbers bracket i and numbers bracket 1817 01:17:15,190 --> 01:17:16,390 n minus 1. 1818 01:17:16,390 --> 01:17:20,710 So when the loop starts at 0, this is literally 1819 01:17:20,710 --> 01:17:25,750 saying between numbers bracket 0 and numbers bracket n minus 1. 1820 01:17:25,750 --> 01:17:30,490 So from the far left to the far right, find me the smallest number. 1821 01:17:30,490 --> 01:17:34,300 Then swap that number with numbers bracket i. 1822 01:17:34,300 --> 01:17:36,880 So that's why we evicted the person all the way 1823 01:17:36,880 --> 01:17:39,970 over to the right or your left the very first time, 1824 01:17:39,970 --> 01:17:42,650 and then the next person, and the next person, and so forth. 1825 01:17:42,650 --> 01:17:46,990 So this is an algorithm that has us go back and forth, back and forth, 1826 01:17:46,990 --> 01:17:49,250 iteratively selecting the smallest element. 1827 01:17:49,250 --> 01:17:53,980 So if we generalize this now, not away from eight people to, like, n people, 1828 01:17:53,980 --> 01:17:56,800 you can think of them as representing an array, a.k.a. 1829 01:17:56,800 --> 01:18:01,180 doors like this where the leftmost one is 0, the rightmost one is n minus 1, 1830 01:18:01,180 --> 01:18:04,540 second to last is n minus 2, and so forth if we don't know or care 1831 01:18:04,540 --> 01:18:06,250 what n specifically is. 1832 01:18:06,250 --> 01:18:11,980 So how many total steps does selection sort perhaps take? 1833 01:18:11,980 --> 01:18:13,930 And let's make this a little more real here. 1834 01:18:13,930 --> 01:18:17,990 Let me actually open up, for instance, a quick visualization here. 1835 01:18:17,990 --> 01:18:19,930 And on the screen here, you'll just see now 1836 01:18:19,930 --> 01:18:24,070 an artist's rendition of an array of values whereby tall purple bars 1837 01:18:24,070 --> 01:18:25,390 represent big integers. 1838 01:18:25,390 --> 01:18:28,310 And short purple bars represent small integers. 1839 01:18:28,310 --> 01:18:31,690 So the idea of any sorting algorithm here as visualized 1840 01:18:31,690 --> 01:18:34,600 is to get the small bars on the left and the big bars on the right. 1841 01:18:34,600 --> 01:18:36,550 And this is a nice handy tool that lets you play around 1842 01:18:36,550 --> 01:18:37,700 with different algorithms. 1843 01:18:37,700 --> 01:18:39,970 So here is selection sort that I've just clicked. 1844 01:18:39,970 --> 01:18:42,760 In pink again and again is the equivalent 1845 01:18:42,760 --> 01:18:46,210 of me walking through the volunteers looking for the next smallest element. 1846 01:18:46,210 --> 01:18:50,260 And as soon as I found them, I swapped them into their leftmost location, 1847 01:18:50,260 --> 01:18:54,250 evicting whoever's there in order to gradually sort 1848 01:18:54,250 --> 01:18:56,480 this list from left to right. 1849 01:18:56,480 --> 01:19:00,400 And so as you can see here, it holds very briefly in pink 1850 01:19:00,400 --> 01:19:03,250 whatever the currently smallest element it has found is. 1851 01:19:03,250 --> 01:19:05,830 That's the analog of, like, me pointing to my head 1852 01:19:05,830 --> 01:19:09,940 whereby it's constantly comparing, comparing, comparing, 1853 01:19:09,940 --> 01:19:12,423 looking for the next smallest element. 1854 01:19:12,423 --> 01:19:15,340 Now, I'm kind of stalling because I'm running out of intelligent words 1855 01:19:15,340 --> 01:19:15,840 to say. 1856 01:19:15,840 --> 01:19:18,190 But that is to say, this algorithm feels kind of slow. 1857 01:19:18,190 --> 01:19:20,590 Like, it seems to be doing a lot of work. 1858 01:19:20,590 --> 01:19:23,170 Where is the work coming in? 1859 01:19:23,170 --> 01:19:26,060 Like, what is it doing a lot of specifically? 1860 01:19:26,060 --> 01:19:27,118 Yeah? 1861 01:19:27,118 --> 01:19:29,410 Yeah, it keeps going back and forth and back and forth. 1862 01:19:29,410 --> 01:19:31,090 And even though it's shaving a little bit of time, 1863 01:19:31,090 --> 01:19:33,670 right, because it doesn't have to stupidly go all the way back 1864 01:19:33,670 --> 01:19:35,920 to the beginning, and I was saving myself a few steps, 1865 01:19:35,920 --> 01:19:38,680 like, it's a lot of cyclicity again and again and again. 1866 01:19:38,680 --> 01:19:41,500 And put another way, it's a lot of comparisons again and again. 1867 01:19:41,500 --> 01:19:44,560 And some of those comparisons you're doing multiple times 1868 01:19:44,560 --> 01:19:47,890 because I only remembered one element at a time in my head. 1869 01:19:47,890 --> 01:19:50,950 So I have to kind of remind myself on every pass which 1870 01:19:50,950 --> 01:19:52,640 is smallest, which is smallest. 1871 01:19:52,640 --> 01:19:57,760 So this invites the question how fast or how slow or equivalently 1872 01:19:57,760 --> 01:20:02,350 how efficient or inefficient is something like bubble-- 1873 01:20:02,350 --> 01:20:03,400 selection sort. 1874 01:20:03,400 --> 01:20:07,790 Well, let's actually consider how we could analyze this as follows. 1875 01:20:07,790 --> 01:20:12,790 So if we have n numbers that we want to sort, 1876 01:20:12,790 --> 01:20:15,730 how many comparisons do we do the first time? 1877 01:20:15,730 --> 01:20:19,280 Well, if there's n numbers, you can only make n minus 1 comparisons. 1878 01:20:19,280 --> 01:20:19,780 Why? 1879 01:20:19,780 --> 01:20:21,550 Because if we have eight people here and we 1880 01:20:21,550 --> 01:20:23,440 started with whoever's all the way over here, 1881 01:20:23,440 --> 01:20:27,890 we compare this person against seven others-- n minus 1 if n is 8. 1882 01:20:27,890 --> 01:20:32,000 So the first pass through the list, I made n minus 1 comparisons. 1883 01:20:32,000 --> 01:20:34,950 But that put the smallest number 0 in place. 1884 01:20:34,950 --> 01:20:37,547 The second time I walked across our eight volunteers, 1885 01:20:37,547 --> 01:20:39,380 I didn't need to walk in front of all eight. 1886 01:20:39,380 --> 01:20:41,630 I could shave off a little bit and do seven of them, 1887 01:20:41,630 --> 01:20:43,460 then six, then five, then four. 1888 01:20:43,460 --> 01:20:45,710 So if I were to write this out roughly mathematically, 1889 01:20:45,710 --> 01:20:50,630 I could do this-- n minus 1 plus n minus 2 plus n minus 3 plus 1890 01:20:50,630 --> 01:20:54,020 dot, dot, dot, all the way down to my very last comparison 1891 01:20:54,020 --> 01:20:55,260 at the end of the list. 1892 01:20:55,260 --> 01:20:57,320 Now, this is the kind of thing that typically in high school 1893 01:20:57,320 --> 01:20:59,612 you'd look at the back of the math book or physics book 1894 01:20:59,612 --> 01:21:02,960 that's got a little cheat sheet for all of the formulas that add up. 1895 01:21:02,960 --> 01:21:07,160 This series here, let me just stipulate, adds up to this-- n times 1896 01:21:07,160 --> 01:21:09,260 n minus 1 divided by 2. 1897 01:21:09,260 --> 01:21:12,770 So no matter what n is, this formula captures that series, 1898 01:21:12,770 --> 01:21:14,760 that summation of all of those values. 1899 01:21:14,760 --> 01:21:18,830 So that is how many steps I took again and again 1900 01:21:18,830 --> 01:21:22,070 and again when implementing selection sort for eight people. 1901 01:21:22,070 --> 01:21:24,000 So of course, let's multiply this out. 1902 01:21:24,000 --> 01:21:27,013 So this is like n squared minus n all divided by 2. 1903 01:21:27,013 --> 01:21:28,430 Let's do it out a little bit more. 1904 01:21:28,430 --> 01:21:31,610 That's n squared divided by 2 minus n over 2. 1905 01:21:31,610 --> 01:21:34,913 And now we're back into the territory of running times. 1906 01:21:34,913 --> 01:21:36,830 Like, how many steps does this algorithm take? 1907 01:21:36,830 --> 01:21:39,080 How many comparisons are we making? 1908 01:21:39,080 --> 01:21:42,950 Now, n squared seems like the biggest term. 1909 01:21:42,950 --> 01:21:44,250 It's the dominant term. 1910 01:21:44,250 --> 01:21:48,110 In other words, as n gets large-- not eight but 80 or 800 1911 01:21:48,110 --> 01:21:53,000 or 8,000 or 8 million-- squaring that is going to make a way bigger difference 1912 01:21:53,000 --> 01:21:55,880 than just doing, like, n divided by 2 and subtracting that off. 1913 01:21:55,880 --> 01:21:59,300 Similarly, just dividing even this quadratic formula by 2, 1914 01:21:59,300 --> 01:22:01,460 like, yes, it's going to halve it literally. 1915 01:22:01,460 --> 01:22:03,410 But that's kind of a drop in the bucket. 1916 01:22:03,410 --> 01:22:05,930 As n gets larger and larger and larger, it's 1917 01:22:05,930 --> 01:22:07,650 still going to be a crazy big number. 1918 01:22:07,650 --> 01:22:11,000 So computer scientists would typically wrap this in some big O notation 1919 01:22:11,000 --> 01:22:16,070 and say, OK, OK, selection sort is on the order of n squared. 1920 01:22:16,070 --> 01:22:17,502 That's not a precise measurement. 1921 01:22:17,502 --> 01:22:19,460 But it's on the order of n squared because it's 1922 01:22:19,460 --> 01:22:23,480 making so many darn comparisons, not unlike everyone shaking everyone else's 1923 01:22:23,480 --> 01:22:25,470 hand like I proposed verbally earlier. 1924 01:22:25,470 --> 01:22:27,300 It's a lot of work to get that done. 1925 01:22:27,300 --> 01:22:30,170 So selection sort is on the order of n squared steps. 1926 01:22:30,170 --> 01:22:31,967 And that's kind of a slow one. 1927 01:22:31,967 --> 01:22:33,800 That's what was at the top of my cheat sheet 1928 01:22:33,800 --> 01:22:37,232 earlier when I proposed a ranking of some common running times. 1929 01:22:37,232 --> 01:22:38,690 There's an infinite number of them. 1930 01:22:38,690 --> 01:22:40,380 But those are some of the common ones. 1931 01:22:40,380 --> 01:22:42,920 So can we do actually better? 1932 01:22:42,920 --> 01:22:47,330 Well, if bubble sort then is in big O-- 1933 01:22:47,330 --> 01:22:47,840 sorry. 1934 01:22:47,840 --> 01:22:49,010 Sorry-- spoiler. 1935 01:22:49,010 --> 01:22:52,610 If selection sort is in the order of n squared, 1936 01:22:52,610 --> 01:22:55,420 could we maybe get lucky sometimes? 1937 01:22:55,420 --> 01:22:58,285 Like, with selection sort, what would the best case scenario be? 1938 01:22:58,285 --> 01:23:00,910 Well, the best case would be, like, everyone is already sorted, 1939 01:23:00,910 --> 01:23:02,020 0 through 7. 1940 01:23:02,020 --> 01:23:03,250 And we just get lucky. 1941 01:23:03,250 --> 01:23:05,820 But does bubble sort-- 1942 01:23:05,820 --> 01:23:10,580 [SIGHS]---- does selection sort appreciate that? 1943 01:23:10,580 --> 01:23:14,590 Does selection sort take into account whether the list is already sorted? 1944 01:23:14,590 --> 01:23:18,280 Not necessarily, because if you look at the pseudocode even, 1945 01:23:18,280 --> 01:23:22,480 there's no special conditional in here that says, if it's already sorted, 1946 01:23:22,480 --> 01:23:23,620 exit early. 1947 01:23:23,620 --> 01:23:27,010 It's just going to blindly do this this many times. 1948 01:23:27,010 --> 01:23:30,380 And you can actually see pseudocode-wise the n squared. 1949 01:23:30,380 --> 01:23:33,850 Notice that this line of pseudocode here is essentially telling me to do what? 1950 01:23:33,850 --> 01:23:37,450 Do something n times from 0 to n minus 1, 1951 01:23:37,450 --> 01:23:40,720 or equivalently, if you prefer the real world, from 1 to n. 1952 01:23:40,720 --> 01:23:42,710 That's n times total. 1953 01:23:42,710 --> 01:23:44,797 But what are you doing inside of this loop? 1954 01:23:44,797 --> 01:23:46,630 Well, every time you're inside of this loop, 1955 01:23:46,630 --> 01:23:49,720 you're looking for the smallest element, looking for the smallest element. 1956 01:23:49,720 --> 01:23:51,790 And that might take you as many as n steps. 1957 01:23:51,790 --> 01:23:55,000 So another way to think about the running time of this algorithm 1958 01:23:55,000 --> 01:23:58,540 selection sort is this loop is telling you to do something n times. 1959 01:23:58,540 --> 01:24:02,230 This line is telling you to do something n times as well. 1960 01:24:02,230 --> 01:24:04,630 And that's roughly n times n or n squared. 1961 01:24:04,630 --> 01:24:05,630 It's not precise. 1962 01:24:05,630 --> 01:24:07,390 But it's on the order of n squared. 1963 01:24:07,390 --> 01:24:11,090 Unfortunately, if you're just blindly doing that much work always, 1964 01:24:11,090 --> 01:24:14,240 even if you have any number of doors, this 1965 01:24:14,240 --> 01:24:19,800 is going to end up being in omega of n squared as well, 1966 01:24:19,800 --> 01:24:23,130 because even in the best case where all of the numbers are already sorted, 1967 01:24:23,130 --> 01:24:26,580 there is nothing about the algorithm called selection sort or even 1968 01:24:26,580 --> 01:24:29,850 my implementation thereof that would have said, wait a minute-- like, I'm 1969 01:24:29,850 --> 01:24:31,980 done, and exit prematurely. 1970 01:24:31,980 --> 01:24:35,760 So selection sort is in big O of and omega 1971 01:24:35,760 --> 01:24:39,420 of n squared as we've designed it. 1972 01:24:39,420 --> 01:24:42,988 And by coincidence, because those boundaries are the same, 1973 01:24:42,988 --> 01:24:45,030 you can also say that it's in theta of n squared. 1974 01:24:45,030 --> 01:24:49,170 No matter what, you're going to spend n squared steps or n squared comparisons. 1975 01:24:49,170 --> 01:24:53,100 But that second algorithm that I keep teasing called bubble sort, 1976 01:24:53,100 --> 01:24:55,808 and I deliberately used the word bubble in my description of what 1977 01:24:55,808 --> 01:24:58,433 was happening because Eli, I think was our first volunteer, who 1978 01:24:58,433 --> 01:25:01,140 kind of bubbled his way up all the way to the end of the list. 1979 01:25:01,140 --> 01:25:05,302 Then number 6 did, then 5, then 4, then 3, then 2, then 1. 1980 01:25:05,302 --> 01:25:07,260 All of the numbers kind of bubbled their way up 1981 01:25:07,260 --> 01:25:09,000 being the bigger values to the right. 1982 01:25:09,000 --> 01:25:11,790 So bubble sort just does something again and again 1983 01:25:11,790 --> 01:25:15,060 by comparing adjacencies, comparing, comparing, comparing. 1984 01:25:15,060 --> 01:25:17,810 And then it does it again and again and again. 1985 01:25:17,810 --> 01:25:19,180 So let's analyze bubble sort. 1986 01:25:19,180 --> 01:25:22,750 If, for instance, this was the original array that we tried sorting before-- 1987 01:25:22,750 --> 01:25:24,100 same exact numbers-- 1988 01:25:24,100 --> 01:25:28,330 I was doing things like flipping the 7 and the 2, and then the 7 and the 5, 1989 01:25:28,330 --> 01:25:29,770 and then the 7 and the 4. 1990 01:25:29,770 --> 01:25:34,360 But that would only fix minimally one number. 1991 01:25:34,360 --> 01:25:37,120 I then had to repeat again and again and again. 1992 01:25:37,120 --> 01:25:39,290 So that one too felt like it was taking some time. 1993 01:25:39,290 --> 01:25:41,560 So here's one way of thinking about bubble sort. 1994 01:25:41,560 --> 01:25:43,660 Bubble sort pseudocode could say this. 1995 01:25:43,660 --> 01:25:45,790 Repeat the following n times. 1996 01:25:45,790 --> 01:25:48,700 And that's why I kept going through the list again and again-- 1997 01:25:48,700 --> 01:25:53,473 for i from 0 to n minus 2. 1998 01:25:53,473 --> 01:25:56,140 Now, this is a little weird, but we'll get back to this-- from i 1999 01:25:56,140 --> 01:25:57,820 from 0 to n minus 2. 2000 01:25:57,820 --> 01:26:02,380 If the number at location i and the number at location i plus 1 2001 01:26:02,380 --> 01:26:05,140 are out of order, swap them. 2002 01:26:05,140 --> 01:26:08,140 And then just repeat, repeat, repeat. 2003 01:26:08,140 --> 01:26:11,470 But we've never seen n minus 2 in pseudocode before. 2004 01:26:11,470 --> 01:26:14,830 Why is it n minus 2 and not the usual n minus 1? 2005 01:26:14,830 --> 01:26:16,086 Yeah? 2006 01:26:16,086 --> 01:26:20,370 AUDIENCE: [INAUDIBLE] 2007 01:26:20,370 --> 01:26:22,770 DAVID MALAN: Exactly, because we're taking a number 2008 01:26:22,770 --> 01:26:24,250 and comparing it to the next one. 2009 01:26:24,250 --> 01:26:26,370 So you better not go all the way to the end 2010 01:26:26,370 --> 01:26:29,245 and expect there to be a next one at the end of the list. 2011 01:26:29,245 --> 01:26:31,620 So if we use these same numbers here, even though they're 2012 01:26:31,620 --> 01:26:34,350 in a different order, if this is location 0, 2013 01:26:34,350 --> 01:26:38,280 and this is location n minus 1 always, if you think 2014 01:26:38,280 --> 01:26:42,480 of i as being my left hand, it's pointing from 0 to 1 to 2 to 3 2015 01:26:42,480 --> 01:26:44,910 to 4 to 5 to 6 to 7. 2016 01:26:44,910 --> 01:26:47,430 If you get to the end, you don't want to look at i plus 1 2017 01:26:47,430 --> 01:26:49,555 because that's pointing to no man's land over here. 2018 01:26:49,555 --> 01:26:51,960 And you can't go beyond the boundaries of your array. 2019 01:26:51,960 --> 01:26:55,012 But n minus 2 would be the second to last element, as you know. 2020 01:26:55,012 --> 01:26:56,970 And that makes sure that my left hand and right 2021 01:26:56,970 --> 01:26:59,070 hand stay within the boundaries of the array 2022 01:26:59,070 --> 01:27:02,777 if left hand represents i and right hand represents i plus 1. 2023 01:27:02,777 --> 01:27:04,860 So it's just to make sure we don't screw up and go 2024 01:27:04,860 --> 01:27:06,930 too far past the boundary of the array. 2025 01:27:06,930 --> 01:27:10,920 But as implemented here, this too does not take into account whether or not 2026 01:27:10,920 --> 01:27:13,620 the array is already sorted. 2027 01:27:13,620 --> 01:27:16,087 So technically, we can actually do this n minus 1 times 2028 01:27:16,087 --> 01:27:17,670 because the last one you get for free. 2029 01:27:17,670 --> 01:27:19,020 It ends up being in place. 2030 01:27:19,020 --> 01:27:22,060 But even still, let's do a quick analysis here. 2031 01:27:22,060 --> 01:27:27,100 If we've got n doors in total from 0 on up to n minus 1, 2032 01:27:27,100 --> 01:27:30,520 bubble sort's analysis looks a little bit like this. 2033 01:27:30,520 --> 01:27:35,228 Do the following-- n minus 1 times n minus 1 times. 2034 01:27:35,228 --> 01:27:36,520 Well, how did we get from that? 2035 01:27:36,520 --> 01:27:38,020 Let me back up to the pseudocode. 2036 01:27:38,020 --> 01:27:40,750 If this is the pseudocode as I originally put it, 2037 01:27:40,750 --> 01:27:42,580 I then propose verbally a refinement. 2038 01:27:42,580 --> 01:27:45,580 You don't need to repeat yourself n times again and again because again, 2039 01:27:45,580 --> 01:27:46,915 the last one bubbles up. 2040 01:27:46,915 --> 01:27:49,660 But by process of elimination, it's in the right place. 2041 01:27:49,660 --> 01:27:52,390 So you can think of it as just n minus 1 times. 2042 01:27:52,390 --> 01:27:57,910 How many times can you compare i and i plus 1? 2043 01:27:57,910 --> 01:28:01,270 Well, you're iterating from 0 to n minus 2. 2044 01:28:01,270 --> 01:28:03,370 And this is where the math is going to get weird. 2045 01:28:03,370 --> 01:28:06,500 But that is n minus 1 steps also. 2046 01:28:06,500 --> 01:28:07,000 Why? 2047 01:28:07,000 --> 01:28:09,940 Because if you do it in the real world starting at one, 2048 01:28:09,940 --> 01:28:13,300 this is saying from i from 1 to n minus 1. 2049 01:28:13,300 --> 01:28:15,460 And then maybe it pops a little more obviously. 2050 01:28:15,460 --> 01:28:18,400 This inner loop is repeating n minus 1 times. 2051 01:28:18,400 --> 01:28:20,830 So outer loop says do the following n minus 1 times. 2052 01:28:20,830 --> 01:28:23,620 Inner loop says do this now n minus 1 times. 2053 01:28:23,620 --> 01:28:27,010 Mathematically, then, that's n minus 1 times n minus 1. 2054 01:28:27,010 --> 01:28:28,550 Now we've got our old FOIL method. 2055 01:28:28,550 --> 01:28:31,780 So n squared minus n minus n plus 1. 2056 01:28:31,780 --> 01:28:35,440 Combine like terms gives us n squared minus 2n plus 1. 2057 01:28:35,440 --> 01:28:37,600 But now let's think like a computer scientist. 2058 01:28:37,600 --> 01:28:41,710 When n gets really large, we definitely don't care about the plus 1. 2059 01:28:41,710 --> 01:28:45,640 When n gets really large, we probably don't even care about the minus 2n. 2060 01:28:45,640 --> 01:28:48,580 It will make a mathematical difference but a drop in the bucket 2061 01:28:48,580 --> 01:28:50,860 once n gets to be in the millions or billions. 2062 01:28:50,860 --> 01:28:54,770 So this would be on the order of what, similarly? 2063 01:28:54,770 --> 01:28:56,730 On the order of n squared as well. 2064 01:28:56,730 --> 01:28:59,840 So algorithmically, and actually, we'll use a different term 2065 01:28:59,840 --> 01:29:03,770 now-- asymptotically-- asymptotic notation is the fancy term 2066 01:29:03,770 --> 01:29:08,960 to describe big O and omega and theta notation-- asymptotically bubble 2067 01:29:08,960 --> 01:29:12,530 sort is, quote, unquote, "the same" as selection sort 2068 01:29:12,530 --> 01:29:13,760 in terms of its upper bound. 2069 01:29:13,760 --> 01:29:15,812 It's not 100% exactly the same. 2070 01:29:15,812 --> 01:29:19,020 If we get into the weeds of the math, there are obviously different formulas. 2071 01:29:19,020 --> 01:29:22,462 But when n gets really large and you plot the pictures on a chart, 2072 01:29:22,462 --> 01:29:24,920 they're going to look almost the same because they're going 2073 01:29:24,920 --> 01:29:27,060 to be fundamentally the same shape. 2074 01:29:27,060 --> 01:29:33,380 So in our cheat sheet here, bubble sort now is in big O of n squared. 2075 01:29:33,380 --> 01:29:36,600 But let me propose that we make an improvement. 2076 01:29:36,600 --> 01:29:38,720 Here's our pseudocode earlier. 2077 01:29:38,720 --> 01:29:43,430 When might it make sense to abort bubble sort early? 2078 01:29:43,430 --> 01:29:47,780 Like, when could you logically conclude that I do not need 2079 01:29:47,780 --> 01:29:51,975 to make another pass again and again? 2080 01:29:51,975 --> 01:29:54,600 And remember what I did from left to right over our volunteers. 2081 01:29:54,600 --> 01:29:57,510 I was comparing, comparing, comparing, comparing and maybe swapping 2082 01:29:57,510 --> 01:29:59,200 people who were out of order. 2083 01:29:59,200 --> 01:30:03,980 So what could I do to short circuit this? 2084 01:30:03,980 --> 01:30:04,855 AUDIENCE: [INAUDIBLE] 2085 01:30:04,855 --> 01:30:05,950 DAVID MALAN: Perfect. 2086 01:30:05,950 --> 01:30:10,060 If I compare through the whole list left to right and I make no swaps, 2087 01:30:10,060 --> 01:30:12,808 it stands to reason that they're already ordered. 2088 01:30:12,808 --> 01:30:14,350 Otherwise, I would have swapped them. 2089 01:30:14,350 --> 01:30:17,923 And therefore, I would be crazy to do that again and again and again. 2090 01:30:17,923 --> 01:30:20,090 Because if I didn't swap them the first time around, 2091 01:30:20,090 --> 01:30:22,757 why would I swap them the second time around if no one's moving. 2092 01:30:22,757 --> 01:30:25,280 So I can terminate the algorithm early. 2093 01:30:25,280 --> 01:30:27,550 So in pseudocode, I could say something like this. 2094 01:30:27,550 --> 01:30:29,920 If no swaps, quit. 2095 01:30:29,920 --> 01:30:31,940 And I can do that inside of the inner loop. 2096 01:30:31,940 --> 01:30:35,560 So once I make a pass through the people and I say, hm, did I make any swaps? 2097 01:30:35,560 --> 01:30:38,320 If no, just quit because they're not going 2098 01:30:38,320 --> 01:30:41,260 to move any further if they didn't just move already. 2099 01:30:41,260 --> 01:30:45,040 So bubble sort then might arguably be an omega of what? 2100 01:30:45,040 --> 01:30:51,130 In the best case, how few steps could we get away with with bubble sort? 2101 01:30:51,130 --> 01:30:52,330 OK, it's not one. 2102 01:30:52,330 --> 01:30:53,490 But it is n. 2103 01:30:53,490 --> 01:30:54,160 Why? 2104 01:30:54,160 --> 01:30:56,980 Because I minimally need to go through the whole list to decide, 2105 01:30:56,980 --> 01:30:58,240 did I make any swaps. 2106 01:30:58,240 --> 01:31:03,010 And logically-- and this will come up a lot in the analysis of algorithms-- 2107 01:31:03,010 --> 01:31:07,780 any question like, is this list sorted, cannot possibly take only one step 2108 01:31:07,780 --> 01:31:09,040 if you've got n elements. 2109 01:31:09,040 --> 01:31:09,310 Why? 2110 01:31:09,310 --> 01:31:10,727 Because then you're just guessing. 2111 01:31:10,727 --> 01:31:14,560 Like, if you don't at least take the time to look at every element, 2112 01:31:14,560 --> 01:31:17,710 how in the world can you conclude logically that they're sorted or not? 2113 01:31:17,710 --> 01:31:21,290 Like, you minimally have to meet us halfway and check every element. 2114 01:31:21,290 --> 01:31:24,338 So it's minimally in omega of n. 2115 01:31:24,338 --> 01:31:27,130 Constant time wouldn't even give you time to look at the whole list 2116 01:31:27,130 --> 01:31:28,190 if there's n elements. 2117 01:31:28,190 --> 01:31:29,500 So that's the intuition there. 2118 01:31:29,500 --> 01:31:32,042 So we can't say anything about theta notation for bubble sort 2119 01:31:32,042 --> 01:31:34,000 because it's different, upper and lower bounds. 2120 01:31:34,000 --> 01:31:39,850 But we seem to have done better asymptotically than selection sort. 2121 01:31:39,850 --> 01:31:41,530 In the worst case, they're just as bad. 2122 01:31:41,530 --> 01:31:42,530 They're super slow. 2123 01:31:42,530 --> 01:31:45,310 But in the best case, bubble sort with that tweak-- 2124 01:31:45,310 --> 01:31:48,550 that conditional about swapping-- might actually be faster for us. 2125 01:31:48,550 --> 01:31:52,940 And you'd want Google, Microsoft, your phone, to use that kind of algorithm. 2126 01:31:52,940 --> 01:31:55,280 Let me go back to the sorting demonstration earlier. 2127 01:31:55,280 --> 01:31:57,380 Let me re-randomize the array. 2128 01:31:57,380 --> 01:31:58,820 Small bars is small number. 2129 01:31:58,820 --> 01:31:59,990 Big bars is big number. 2130 01:31:59,990 --> 01:32:02,030 And let me click bubble sort this time. 2131 01:32:02,030 --> 01:32:04,970 And you'll see again, the pink represents comparisons. 2132 01:32:04,970 --> 01:32:08,400 But now the comparisons are always adjacent from left to right. 2133 01:32:08,400 --> 01:32:12,360 So this is pretty much doing with more bars what I did with eight people here. 2134 01:32:12,360 --> 01:32:15,950 And you'll see that the biggest elements, Eli 2135 01:32:15,950 --> 01:32:18,260 being the first one of the humans, bubbled up 2136 01:32:18,260 --> 01:32:21,470 all the way to the right, followed by the next largest, next largest, 2137 01:32:21,470 --> 01:32:22,490 next largest. 2138 01:32:22,490 --> 01:32:25,700 But here again you can visually see. 2139 01:32:25,700 --> 01:32:28,490 And with the number of words I'm using to kind of stall here, 2140 01:32:28,490 --> 01:32:31,100 you can [? hear ?] that this is kind of slow 2141 01:32:31,100 --> 01:32:34,618 because you're doing a lot of comparisons again and again and again. 2142 01:32:34,618 --> 01:32:35,910 And you're making improvements. 2143 01:32:35,910 --> 01:32:37,618 So you're taking bites out of the problem 2144 01:32:37,618 --> 01:32:40,897 but really only one bite at a time. 2145 01:32:40,897 --> 01:32:43,480 So I kind of feel this is going to be like holding in a sneeze 2146 01:32:43,480 --> 01:32:44,560 if we don't finish this. 2147 01:32:44,560 --> 01:32:47,410 So I feel like we should give you emotional closure 2148 01:32:47,410 --> 01:32:49,930 with getting this to finish. 2149 01:32:49,930 --> 01:32:53,020 Any questions in the meantime because there's going to be no surprise. 2150 01:32:53,020 --> 01:32:54,830 It's going to be sorted. 2151 01:32:54,830 --> 01:32:57,580 This is bubble sort. 2152 01:32:57,580 --> 01:32:58,570 No? 2153 01:32:58,570 --> 01:33:00,040 All right, we're almost there. 2154 01:33:00,040 --> 01:33:02,800 It's going faster because the list is effectively getting shorter. 2155 01:33:02,800 --> 01:33:08,350 And-- oh, OK, so that then was bubble sort. 2156 01:33:08,350 --> 01:33:12,920 But both of these, I claim now, are actually pretty inefficient. 2157 01:33:12,920 --> 01:33:13,780 They're pretty slow. 2158 01:33:13,780 --> 01:33:16,330 And my god, that was only eight humans on the stage. 2159 01:33:16,330 --> 01:33:19,540 That was only, like, what, 40 or 50 bars on the screen. 2160 01:33:19,540 --> 01:33:23,320 What if we start talking about hundreds or thousands or millions of inputs? 2161 01:33:23,320 --> 01:33:25,690 Like, those algorithms are probably going to take crazy 2162 01:33:25,690 --> 01:33:28,220 long because n squared gets really big. 2163 01:33:28,220 --> 01:33:30,440 So can we do better than n squared? 2164 01:33:30,440 --> 01:33:31,330 Well, we can. 2165 01:33:31,330 --> 01:33:35,170 But let me propose that we introduce a new problem-solving technique 2166 01:33:35,170 --> 01:33:38,320 that we've actually been using already, just not by name. 2167 01:33:38,320 --> 01:33:42,700 In programming and in mathematics, there's this idea of recursion. 2168 01:33:42,700 --> 01:33:48,280 And recursion is a description for a function that calls itself. 2169 01:33:48,280 --> 01:33:51,460 A function that calls itself is recursive. 2170 01:33:51,460 --> 01:33:52,895 So what do we mean by this? 2171 01:33:52,895 --> 01:33:54,520 Well, we've actually seen this already. 2172 01:33:54,520 --> 01:33:58,810 Here's the same pseudocode for binary search earlier. 2173 01:33:58,810 --> 01:34:01,180 And binary search-- bi implying two-- went 2174 01:34:01,180 --> 01:34:05,070 either left or right or left or right, halving each time. 2175 01:34:05,070 --> 01:34:06,950 So that was the division and conquering. 2176 01:34:06,950 --> 01:34:08,570 So this is the same pseudocode. 2177 01:34:08,570 --> 01:34:12,080 But notice that in my pseudocode earlier, 2178 01:34:12,080 --> 01:34:16,710 I literally used the keyword search inside of my definition of search. 2179 01:34:16,710 --> 01:34:19,550 So this is sort of one of those, like, circular definitions. 2180 01:34:19,550 --> 01:34:21,590 Like, if this is a search algorithm, how am I 2181 01:34:21,590 --> 01:34:24,860 getting away with using the word search in the definition of search? 2182 01:34:24,860 --> 01:34:27,110 It's kind of when you get reprimanded for using a word 2183 01:34:27,110 --> 01:34:28,850 to define a vocabulary word. 2184 01:34:28,850 --> 01:34:32,960 This is kind of like that because this algorithm is calling itself. 2185 01:34:32,960 --> 01:34:35,700 This search function is calling itself. 2186 01:34:35,700 --> 01:34:38,300 Now, normally, if you were to have a function call itself, 2187 01:34:38,300 --> 01:34:41,750 call itself, call itself, call itself, it would just do that infinitely. 2188 01:34:41,750 --> 01:34:43,550 And presumably, it would go on forever. 2189 01:34:43,550 --> 01:34:45,350 Or the program or computer would probably 2190 01:34:45,350 --> 01:34:48,517 crash because out of memory or something like that-- more on that next week. 2191 01:34:48,517 --> 01:34:50,780 But there's a key detail, a key characteristic 2192 01:34:50,780 --> 01:34:54,770 about this algorithm that makes sure that doing the same thing again 2193 01:34:54,770 --> 01:34:56,690 and again is not crazy. 2194 01:34:56,690 --> 01:34:59,630 It's actually going to lead us to a solution. 2195 01:34:59,630 --> 01:35:02,720 Even though I'm using search here or search here, 2196 01:35:02,720 --> 01:35:08,742 what's happening to the size of the problem for each of these lines? 2197 01:35:08,742 --> 01:35:09,450 What's happening? 2198 01:35:09,450 --> 01:35:10,290 Yeah? 2199 01:35:10,290 --> 01:35:11,610 It's being cut in half. 2200 01:35:11,610 --> 01:35:15,420 So even though I'm doing the exact same thing algorithmically again and again, 2201 01:35:15,420 --> 01:35:18,930 I'm doing it on a smaller, smaller, smaller input-- 2202 01:35:18,930 --> 01:35:21,820 fewer, fewer, fewer doors-- fewer, fewer, fewer people. 2203 01:35:21,820 --> 01:35:23,820 And so even though I'm doing it again and again, 2204 01:35:23,820 --> 01:35:26,760 so long as I have this so-called base case, as we'll call it, 2205 01:35:26,760 --> 01:35:29,700 that just makes sure if there's no doors-- you're done-- 2206 01:35:29,700 --> 01:35:34,710 I can break out of what would otherwise be an infinite loop of sorts. 2207 01:35:34,710 --> 01:35:37,525 All right, so those two lines we've already kind of seen. 2208 01:35:37,525 --> 01:35:39,150 And actually, we saw this in week zero. 2209 01:35:39,150 --> 01:35:42,510 So this is the pseudocode for the phone book example from week zero. 2210 01:35:42,510 --> 01:35:45,300 And you might recall that we had these lines of code here. 2211 01:35:45,300 --> 01:35:48,420 And this is very procedural or iterative where I literally 2212 01:35:48,420 --> 01:35:49,872 told you go back to line 3. 2213 01:35:49,872 --> 01:35:52,330 And we did that today when we took attendance, so to speak. 2214 01:35:52,330 --> 01:35:55,860 With that third algorithm, you went back to step 2 again and again. 2215 01:35:55,860 --> 01:35:58,000 That's a very iterative loop-based approach. 2216 01:35:58,000 --> 01:35:58,750 But you know what? 2217 01:35:58,750 --> 01:36:00,840 I could be a little more clever here too in week zero. 2218 01:36:00,840 --> 01:36:03,210 We just didn't want to get too far ahead of ourselves. 2219 01:36:03,210 --> 01:36:07,410 Let me actually change these highlighted lines to be more, succinctly, 2220 01:36:07,410 --> 01:36:10,150 search left half book, search right half of book. 2221 01:36:10,150 --> 01:36:11,620 I can now tighten the code up. 2222 01:36:11,620 --> 01:36:13,270 So it's kind of a shorter algorithm. 2223 01:36:13,270 --> 01:36:14,860 But it's the exact same idea. 2224 01:36:14,860 --> 01:36:20,570 But on week zero, I very methodically told you what I want you to do next. 2225 01:36:20,570 --> 01:36:22,900 But here, I'm sort of recognizing that, wait a minute, 2226 01:36:22,900 --> 01:36:25,660 I just spent the past six lines telling you how to search. 2227 01:36:25,660 --> 01:36:27,220 So go do more of that. 2228 01:36:27,220 --> 01:36:29,180 You have all of the logic you need. 2229 01:36:29,180 --> 01:36:33,280 So this too is recursive in the sense that if this is a search algorithm, 2230 01:36:33,280 --> 01:36:36,610 the search algorithm is using a search algorithm inside of it. 2231 01:36:36,610 --> 01:36:39,430 But here too, the phone book was getting divided and divided 2232 01:36:39,430 --> 01:36:42,160 and divided in half again and again. 2233 01:36:42,160 --> 01:36:45,370 So recursion doesn't just describe mathematical formulas. 2234 01:36:45,370 --> 01:36:49,077 It doesn't just describe pseudocode or code-- even physical structures. 2235 01:36:49,077 --> 01:36:51,410 So we deliberately brought back these bricks here today, 2236 01:36:51,410 --> 01:36:54,040 which are reminiscent, of course, from Mario and Super Mario Brothers. 2237 01:36:54,040 --> 01:36:56,695 That pyramid, especially if we get rid of the distractions 2238 01:36:56,695 --> 01:36:58,570 of the ground and the mountains and so forth, 2239 01:36:58,570 --> 01:37:01,810 is itself recursive as a structure. 2240 01:37:01,810 --> 01:37:02,920 Now, why is that? 2241 01:37:02,920 --> 01:37:05,440 Well, here you can kind of play verbal games. 2242 01:37:05,440 --> 01:37:07,930 If this is a pyramid of height 4-- 2243 01:37:07,930 --> 01:37:11,092 and ask the question, well, what is that-- what is a pyramid of height 4-- 2244 01:37:11,092 --> 01:37:13,300 I could kind of be difficult and say, well, it's just 2245 01:37:13,300 --> 01:37:16,270 a pyramid of height 3 plus 1 other row. 2246 01:37:16,270 --> 01:37:18,310 All right, well, what's a pyramid of height 3? 2247 01:37:18,310 --> 01:37:21,185 You could be difficult and say, well, it's just a pyramid of height 2 2248 01:37:21,185 --> 01:37:22,060 plus 1 more row. 2249 01:37:22,060 --> 01:37:23,290 What's a pyramid of height 2? 2250 01:37:23,290 --> 01:37:27,363 Well, it's this plus 1 more row. 2251 01:37:27,363 --> 01:37:28,780 I might have skipped a step there. 2252 01:37:28,780 --> 01:37:30,010 What's a pyramid of height 2? 2253 01:37:30,010 --> 01:37:33,130 Well, it's a pyramid of height 1 plus 1 more row. 2254 01:37:33,130 --> 01:37:34,900 What's a pyramid of height 1? 2255 01:37:34,900 --> 01:37:37,240 Well, it's nothing plus 1 more row. 2256 01:37:37,240 --> 01:37:39,638 Or at this point, you can just say it's a single brick. 2257 01:37:39,638 --> 01:37:42,430 You can have a base case that just says, like, the buck stops here. 2258 01:37:42,430 --> 01:37:43,722 Stop asking me these questions. 2259 01:37:43,722 --> 01:37:48,140 I can special case or hard code my answer to the very tiniest of problems. 2260 01:37:48,140 --> 01:37:50,740 So even those bricks then are sort of-- 2261 01:37:50,740 --> 01:37:52,622 the pyramid is recursively defined. 2262 01:37:52,622 --> 01:37:54,580 And we can actually implement this even in code 2263 01:37:54,580 --> 01:37:57,080 if we want in a couple of different ways to make this clear. 2264 01:37:57,080 --> 01:37:59,690 So let me go back over to VS Code here. 2265 01:37:59,690 --> 01:38:03,760 And let me propose that we do one old-school iterative example with 2266 01:38:03,760 --> 01:38:04,510 this-- 2267 01:38:04,510 --> 01:38:06,190 code iteration dot C. 2268 01:38:06,190 --> 01:38:09,040 And iteration just means to loop again and again. 2269 01:38:09,040 --> 01:38:11,000 An iteration is a pass through the code. 2270 01:38:11,000 --> 01:38:13,430 And let me go ahead and whip this up as follows. 2271 01:38:13,430 --> 01:38:15,730 Let's include the CS50 library. 2272 01:38:15,730 --> 01:38:18,850 So we have our get int function. 2273 01:38:18,850 --> 01:38:21,790 Let's include standard io so that we have printf. 2274 01:38:21,790 --> 01:38:27,650 Let's go ahead and create main with no command line arguments today. 2275 01:38:27,650 --> 01:38:30,430 Let's go ahead and create a variable of type int set 2276 01:38:30,430 --> 01:38:33,070 equal to the return value of get int asking 2277 01:38:33,070 --> 01:38:35,230 the user for the height of the pyramid. 2278 01:38:35,230 --> 01:38:37,480 And then let me assume for the sake of discussion 2279 01:38:37,480 --> 01:38:41,290 that there is a function now that will draw a pyramid of that height. 2280 01:38:41,290 --> 01:38:44,810 All right, now let's actually implement that function as a helper function, 2281 01:38:44,810 --> 01:38:45,500 so to speak. 2282 01:38:45,500 --> 01:38:49,270 So if I have a function called draw, it's going to clearly take an integer. 2283 01:38:49,270 --> 01:38:51,670 And I'll call it, maybe, n as input. 2284 01:38:51,670 --> 01:38:53,600 And it doesn't need to return anything. 2285 01:38:53,600 --> 01:38:55,120 So I'm going to say that this is a void function. 2286 01:38:55,120 --> 01:38:56,203 It doesn't return a value. 2287 01:38:56,203 --> 01:38:59,030 It just has a side effect of printing bricks on the screen. 2288 01:38:59,030 --> 01:39:02,050 How can I go about doing a pyramid of height n? 2289 01:39:02,050 --> 01:39:07,120 Well, I'm going to do this one, which looks like this-- one at a time, 2290 01:39:07,120 --> 01:39:09,430 then two, then three, then four. 2291 01:39:09,430 --> 01:39:11,210 And this is a little easier, for instance, 2292 01:39:11,210 --> 01:39:14,380 than problem set one, which had the pyramid flipped around the other way. 2293 01:39:14,380 --> 01:39:16,480 So this one's a little easier to do deliberately. 2294 01:39:16,480 --> 01:39:21,250 So if I go back to my code, I could say something like this-- for int i gets 0. 2295 01:39:21,250 --> 01:39:24,610 i is less than, let's say, n, which is the height-- 2296 01:39:24,610 --> 01:39:25,810 i plus, plus. 2297 01:39:25,810 --> 01:39:30,280 That's going to essentially iterate over every row of the pyramid top to bottom. 2298 01:39:30,280 --> 01:39:31,480 This might feel similar-- 2299 01:39:31,480 --> 01:39:33,700 reminiscent-- to problem set one. 2300 01:39:33,700 --> 01:39:36,280 Then in my inner loop, I'm going to do four. 2301 01:39:36,280 --> 01:39:38,200 int j gets 0. 2302 01:39:38,200 --> 01:39:42,682 j is less than i plus 1 j plus, plus. 2303 01:39:42,682 --> 01:39:44,390 And we'll see why this works in a moment. 2304 01:39:44,390 --> 01:39:47,800 And then let's go ahead and print out a single hash, no new line. 2305 01:39:47,800 --> 01:39:52,660 But at the very bottom of this row, let's print out just a new line 2306 01:39:52,660 --> 01:39:54,850 to move the cursor down a line. 2307 01:39:54,850 --> 01:39:56,050 Now, I'm not done yet. 2308 01:39:56,050 --> 01:39:57,760 Let me hide my terminal for a second. 2309 01:39:57,760 --> 01:39:58,940 I need the prototype. 2310 01:39:58,940 --> 01:40:02,720 So I'm going to copy this, paste it up here with a semicolon. 2311 01:40:02,720 --> 01:40:05,300 And I think now the code is correct. 2312 01:40:05,300 --> 01:40:06,430 Now, why is this correct? 2313 01:40:06,430 --> 01:40:11,160 Because on my outer loop, I'm iterating n times starting at 0. 2314 01:40:11,160 --> 01:40:15,900 My inner loop-- realize I want to have at least one hash, 2315 01:40:15,900 --> 01:40:18,480 then two, then three, then four. 2316 01:40:18,480 --> 01:40:20,970 So I can't start my number of hashes at 0. 2317 01:40:20,970 --> 01:40:26,910 And that's why I'm doing j all the way up to i plus 1 so that when i is 0, 2318 01:40:26,910 --> 01:40:30,390 I'm actually still printing one hash, and then two hashes, and then three. 2319 01:40:30,390 --> 01:40:32,670 Otherwise, I would start with 0, which is not my goal. 2320 01:40:32,670 --> 01:40:35,700 Let me go in and do make iteration to compile this-- 2321 01:40:35,700 --> 01:40:37,350 dot slash iteration. 2322 01:40:37,350 --> 01:40:39,600 And let me go ahead and make my terminal bigger-- 2323 01:40:39,600 --> 01:40:40,650 enter. 2324 01:40:40,650 --> 01:40:42,660 The height will be, for instance, 4. 2325 01:40:42,660 --> 01:40:45,660 And indeed, it's not quite to scale because these are a little more 2326 01:40:45,660 --> 01:40:47,550 vertical than they are horizontal. 2327 01:40:47,550 --> 01:40:49,440 But it essentially looks like that pyramid. 2328 01:40:49,440 --> 01:40:50,910 And I can do this even larger. 2329 01:40:50,910 --> 01:40:52,930 Let's do this again after clearing my screen. 2330 01:40:52,930 --> 01:40:54,210 Let's do like 10 of these. 2331 01:40:54,210 --> 01:40:55,300 And it gets bigger. 2332 01:40:55,300 --> 01:40:56,750 Let's do, like, 50 of these. 2333 01:40:56,750 --> 01:40:57,750 And it gets even bigger. 2334 01:40:57,750 --> 01:40:59,167 It doesn't even fit on the screen. 2335 01:40:59,167 --> 01:40:59,730 But it works. 2336 01:40:59,730 --> 01:41:02,648 And that's an iterative approach, 100% correct, 2337 01:41:02,648 --> 01:41:05,190 and similar to what you might have done already for something 2338 01:41:05,190 --> 01:41:08,130 like week zero or week one. 2339 01:41:08,130 --> 01:41:10,390 But it turns out if we leverage recursion, 2340 01:41:10,390 --> 01:41:12,650 we can be a little more clever, in fact. 2341 01:41:12,650 --> 01:41:13,880 Let me go ahead and do this. 2342 01:41:13,880 --> 01:41:15,600 Let me minimize my terminal window. 2343 01:41:15,600 --> 01:41:17,990 And let me go back into my code here. 2344 01:41:17,990 --> 01:41:23,800 And let me propose to implement the exact same program recursively instead. 2345 01:41:23,800 --> 01:41:27,710 And let me go ahead and do this. 2346 01:41:27,710 --> 01:41:31,030 I'm going to leave main-- 2347 01:41:31,030 --> 01:41:34,060 let's see-- exactly as is. 2348 01:41:34,060 --> 01:41:39,410 And I'm going to go ahead and change my draw function to work as follows. 2349 01:41:39,410 --> 01:41:41,860 Let me delete all of these loops because loops are sort 2350 01:41:41,860 --> 01:41:43,750 of an indication of using iteration. 2351 01:41:43,750 --> 01:41:45,550 And I'm instead going to do this. 2352 01:41:45,550 --> 01:41:49,030 What is a pyramid of height 4? 2353 01:41:49,030 --> 01:41:52,210 I said it's a pyramid of height 3 plus 1 more row. 2354 01:41:52,210 --> 01:41:53,740 So let's take that literally. 2355 01:41:53,740 --> 01:41:56,950 If a pyramid of height n needs to be drawn, 2356 01:41:56,950 --> 01:42:01,270 let's first draw a pyramid of n minus 1. 2357 01:42:01,270 --> 01:42:04,150 And then how do I go about drawing one more row? 2358 01:42:04,150 --> 01:42:05,870 Well, this, I can use a bit of iteration. 2359 01:42:05,870 --> 01:42:08,710 But I don't need a doubly-nested loop anymore. 2360 01:42:08,710 --> 01:42:13,750 I can set i equal to 0, i less than n, i plus, plus. 2361 01:42:13,750 --> 01:42:16,300 And this block of code is simply going to print 2362 01:42:16,300 --> 01:42:23,590 one simple row of hashes again and again followed by a new line just 2363 01:42:23,590 --> 01:42:25,070 to move the cursor down. 2364 01:42:25,070 --> 01:42:27,040 So notice this line here. 2365 01:42:27,040 --> 01:42:33,880 And I'll add some comments-- print pyramid of height n minus 1, 2366 01:42:33,880 --> 01:42:35,990 print one more row. 2367 01:42:35,990 --> 01:42:39,680 So I'm kind of taking a bite out of the problem by printing one row. 2368 01:42:39,680 --> 01:42:44,230 But I'm deferring to, weirdly, myself, to print the rest of the pyramid. 2369 01:42:44,230 --> 01:42:47,800 Now I'm going to go ahead and try compiling this-- make-- 2370 01:42:47,800 --> 01:42:48,340 oh. 2371 01:42:48,340 --> 01:42:49,940 And this is no longer iteration. 2372 01:42:49,940 --> 01:42:53,920 So I'm actually going to do this-- code recursion dot c. 2373 01:42:53,920 --> 01:42:56,437 I'm going to paste that same code into this new version, 2374 01:42:56,437 --> 01:42:59,020 just so we have a different file without breaking the old one. 2375 01:42:59,020 --> 01:43:01,960 And now I'm going to do make recursion. 2376 01:43:01,960 --> 01:43:03,250 All right, interesting. 2377 01:43:03,250 --> 01:43:05,710 So Clang is yelling at me with this error. 2378 01:43:05,710 --> 01:43:08,810 All paths through this function will call itself. 2379 01:43:08,810 --> 01:43:11,590 So Clang is actually smart enough to notice in my code 2380 01:43:11,590 --> 01:43:14,920 that no matter what, the draw function is going to call the draw function. 2381 01:43:14,920 --> 01:43:16,840 And the draw function is going to call the draw function. 2382 01:43:16,840 --> 01:43:19,173 And the draw function's going to call the draw function. 2383 01:43:19,173 --> 01:43:23,230 Now, to be fair, n, the input to draw, is getting smaller and smaller. 2384 01:43:23,230 --> 01:43:25,990 But what's going to happen eventually to the input of draw 2385 01:43:25,990 --> 01:43:28,870 as I've written this code right now? 2386 01:43:28,870 --> 01:43:30,040 Yeah, in the back? 2387 01:43:30,040 --> 01:43:32,722 AUDIENCE: [INAUDIBLE] 2388 01:43:32,722 --> 01:43:33,800 DAVID MALAN: Exactly. 2389 01:43:33,800 --> 01:43:35,900 Remember that integers are signed by default. 2390 01:43:35,900 --> 01:43:38,550 They can be both positive or 0 or negative. 2391 01:43:38,550 --> 01:43:41,330 And so here, if I just keep blindly subtracting 1, 2392 01:43:41,330 --> 01:43:44,930 it's going to do that seemingly forever until I technically underflow. 2393 01:43:44,930 --> 01:43:47,970 But that's going to be like 2 billion rows of pyramids later. 2394 01:43:47,970 --> 01:43:51,660 So I think what I actually need to do is something like this. 2395 01:43:51,660 --> 01:43:56,900 I need to ask the question, if n equals 0-- 2396 01:43:56,900 --> 01:44:00,710 or heck, just to be super safe, let's say if n is less than or equal to 0, 2397 01:44:00,710 --> 01:44:04,010 just to make sure it never goes negative, let's just go ahead 2398 01:44:04,010 --> 01:44:06,540 and return without doing anything. 2399 01:44:06,540 --> 01:44:10,130 So I'm going to comment this as, like, if nothing to draw, well, then 2400 01:44:10,130 --> 01:44:11,780 don't blindly call draw again. 2401 01:44:11,780 --> 01:44:13,380 So this is that so-called base case. 2402 01:44:13,380 --> 01:44:16,490 This is analogous to saying, like, if John Harvard not in phone book 2403 01:44:16,490 --> 01:44:22,410 or if no lockers or doors left, then just exit or return in this case. 2404 01:44:22,410 --> 01:44:27,050 So now I'll never call draw a negative number of times 2405 01:44:27,050 --> 01:44:28,370 or zero number of times. 2406 01:44:28,370 --> 01:44:30,990 I will only do it so long as n is positive. 2407 01:44:30,990 --> 01:44:36,420 So now if I make my terminal bigger, make recursion again does compile fine. 2408 01:44:36,420 --> 01:44:39,630 Dot slash recursion-- let's do the same input for. 2409 01:44:39,630 --> 01:44:42,840 And I get the exact same number of bricks. 2410 01:44:42,840 --> 01:44:45,540 Let's go ahead and do it again with maybe 10. 2411 01:44:45,540 --> 01:44:46,890 That seems to work too. 2412 01:44:46,890 --> 01:44:48,510 Let's do it again with 50. 2413 01:44:48,510 --> 01:44:49,503 That seems to work too. 2414 01:44:49,503 --> 01:44:50,670 And I can go a little crazy. 2415 01:44:50,670 --> 01:44:52,320 I can do, like, 5,000. 2416 01:44:52,320 --> 01:44:53,500 This is still going to work. 2417 01:44:53,500 --> 01:44:55,210 It's just not going to fit on my screen. 2418 01:44:55,210 --> 01:44:59,073 But it is doing a valiant attempt to print all of those out. 2419 01:44:59,073 --> 01:45:00,990 Now, it turns out that could get us in trouble 2420 01:45:00,990 --> 01:45:05,790 if we start poking around in the-- now, maybe not 5,000 but 500,000 or 5 2421 01:45:05,790 --> 01:45:07,020 million or beyond. 2422 01:45:07,020 --> 01:45:11,050 But for now, we'll just assume that this is just a slow process at that. 2423 01:45:11,050 --> 01:45:14,340 But if we go back to the goal, which was to print this thing here, 2424 01:45:14,340 --> 01:45:17,760 this too is a recursive structure that we're just now translating 2425 01:45:17,760 --> 01:45:19,623 the ideas of recursive code to. 2426 01:45:19,623 --> 01:45:21,540 And actually, if you've never discovered this, 2427 01:45:21,540 --> 01:45:23,790 we would be remiss in not doing this for you. 2428 01:45:23,790 --> 01:45:27,840 If I go into a browser here. 2429 01:45:27,840 --> 01:45:30,540 And suppose that I'm not using any AI fancy technology. 2430 01:45:30,540 --> 01:45:32,070 I'm just going to Google.com. 2431 01:45:32,070 --> 01:45:35,730 And I search for recursion because I'm curious to see what it means. 2432 01:45:35,730 --> 01:45:39,650 Google for years has had this little Easter egg for computer scientists. 2433 01:45:39,650 --> 01:45:43,940 2434 01:45:43,940 --> 01:45:45,680 It's not a typo. 2435 01:45:45,680 --> 01:45:46,180 Get it? 2436 01:45:46,180 --> 01:45:47,690 Ha, ha, kind of funny. 2437 01:45:47,690 --> 01:45:48,190 Yeah? 2438 01:45:48,190 --> 01:45:49,540 OK, like, see? 2439 01:45:49,540 --> 01:45:52,070 Recursion-- so if you click recursion, OK, now we're 2440 01:45:52,070 --> 01:45:53,320 in night mode for some reason. 2441 01:45:53,320 --> 01:45:57,620 But OK, like, so it just does this endlessly. 2442 01:45:57,620 --> 01:46:00,410 OK, so programmers with too much free time or too much 2443 01:46:00,410 --> 01:46:02,560 control over Google's own servers to do that. 2444 01:46:02,560 --> 01:46:05,060 So it turns out, yes, if you search for recursion on Google, 2445 01:46:05,060 --> 01:46:07,560 this is a long-standing Easter egg in this case. 2446 01:46:07,560 --> 01:46:09,410 But the point of introducing recursion is 2447 01:46:09,410 --> 01:46:12,830 that, one, it's actually going to be a very powerful problem-solving technique 2448 01:46:12,830 --> 01:46:15,770 because honestly, it, one, we've seen in pseudocode already, 2449 01:46:15,770 --> 01:46:18,920 it kind of tightens up the amount of code or the amount of lines 2450 01:46:18,920 --> 01:46:20,900 that you need to write to convey an algorithm. 2451 01:46:20,900 --> 01:46:22,912 And two, it will actually allow us to solve 2452 01:46:22,912 --> 01:46:24,620 problems in a fundamentally different way 2453 01:46:24,620 --> 01:46:27,540 by using computer's memory in an interesting way. 2454 01:46:27,540 --> 01:46:31,220 And so toward that end, we wanted to introduce one final sort today, which 2455 01:46:31,220 --> 01:46:34,520 we won't use humans to demonstrate but just numbers instead, namely 2456 01:46:34,520 --> 01:46:35,990 something called merge sort. 2457 01:46:35,990 --> 01:46:40,790 And merge sort is an algorithm for sorting n numbers that I claim 2458 01:46:40,790 --> 01:46:43,430 is going to be better than selection sort and bubble sort. 2459 01:46:43,430 --> 01:46:46,010 It's got to be better because that n squared was killing us. 2460 01:46:46,010 --> 01:46:48,540 We want to get something lower than n squared. 2461 01:46:48,540 --> 01:46:51,960 So merge sort's pseudocode essentially looks like this. 2462 01:46:51,960 --> 01:46:53,540 And this is it. 2463 01:46:53,540 --> 01:46:58,460 Merge sort says, if you've got n numbers or an array of numbers, 2464 01:46:58,460 --> 01:47:01,430 sort the left half of them, sort the right half of them. 2465 01:47:01,430 --> 01:47:03,412 And then merge the sorted halves. 2466 01:47:03,412 --> 01:47:05,370 And we'll see what this means in just a moment. 2467 01:47:05,370 --> 01:47:07,280 But if there's only one number, just quit. 2468 01:47:07,280 --> 01:47:09,920 So this is my base case because this is recursive. 2469 01:47:09,920 --> 01:47:12,440 Because if this is a sorting algorithm called merge sort, 2470 01:47:12,440 --> 01:47:15,685 I'm kind of cheating by using the verb sort in the sorting algorithm. 2471 01:47:15,685 --> 01:47:17,060 But that just makes it recursive. 2472 01:47:17,060 --> 01:47:18,330 It doesn't make it wrong. 2473 01:47:18,330 --> 01:47:19,730 So what do I mean by merging? 2474 01:47:19,730 --> 01:47:25,522 Just to make this clear, here among these numbers are two lists of size 4. 2475 01:47:25,522 --> 01:47:27,980 And I'm going to scooch them over to the left and the right 2476 01:47:27,980 --> 01:47:32,540 just to make clear that on the left is one half that is sorted-- 2477 01:47:32,540 --> 01:47:35,270 1346 from smallest to largest. 2478 01:47:35,270 --> 01:47:40,460 On the right is a second half that's also sorted 0257. 2479 01:47:40,460 --> 01:47:42,980 So for the sake of discussion, suppose that I'm 2480 01:47:42,980 --> 01:47:46,070 partway through this algorithm called merge sort. 2481 01:47:46,070 --> 01:47:48,380 And I've sorted the left half already, clearly. 2482 01:47:48,380 --> 01:47:50,570 I've sorted the right half already, clearly. 2483 01:47:50,570 --> 01:47:52,790 Now I need to merge the sorted halves. 2484 01:47:52,790 --> 01:47:54,270 What do we mean by that? 2485 01:47:54,270 --> 01:47:57,710 Well, that means to essentially, conceptually, 2486 01:47:57,710 --> 01:48:00,350 point your left hand at the first at the left half. 2487 01:48:00,350 --> 01:48:02,540 Point your right hand at the right half. 2488 01:48:02,540 --> 01:48:05,450 And then decide which of these numbers should come first 2489 01:48:05,450 --> 01:48:08,990 in order to merge or kind of stitch these two lists together 2490 01:48:08,990 --> 01:48:09,980 in sorted order. 2491 01:48:09,980 --> 01:48:11,180 Well, which is smaller? 2492 01:48:11,180 --> 01:48:12,350 Obviously, 0. 2493 01:48:12,350 --> 01:48:16,460 So that last step, merge sorted halves, would have me take this number 2494 01:48:16,460 --> 01:48:19,490 and put it in an extra empty array up here. 2495 01:48:19,490 --> 01:48:22,250 Now I move my right hand to the next number in the right half. 2496 01:48:22,250 --> 01:48:23,870 And I compare the 1 and the 2. 2497 01:48:23,870 --> 01:48:25,440 Obviously, 1 comes next. 2498 01:48:25,440 --> 01:48:27,140 So I put this now up here. 2499 01:48:27,140 --> 01:48:28,880 Now I compare the 3 and the 2. 2500 01:48:28,880 --> 01:48:30,450 Obviously, the 2 comes next. 2501 01:48:30,450 --> 01:48:32,180 And so I put this up here-- 2502 01:48:32,180 --> 01:48:34,670 3 and 5-- obviously, 3-- 2503 01:48:34,670 --> 01:48:40,350 4 and 5-- obviously, 4-- 2504 01:48:40,350 --> 01:48:43,620 6 and 5-- obviously, 5-- 2505 01:48:43,620 --> 01:48:46,800 and 6 and 7-- obviously, 6. 2506 01:48:46,800 --> 01:48:50,550 And I didn't leave quite enough room for 7, perhaps, and lastly, 7. 2507 01:48:50,550 --> 01:48:53,838 So that's all we mean by merging two lists together. 2508 01:48:53,838 --> 01:48:56,130 If they're already sorted, you just kind of stitch them 2509 01:48:56,130 --> 01:48:59,280 together by plucking from one or the other the next number 2510 01:48:59,280 --> 01:49:00,280 that you actually want. 2511 01:49:00,280 --> 01:49:00,872 And that's it. 2512 01:49:00,872 --> 01:49:03,330 And even though I picked up partway through this algorithm, 2513 01:49:03,330 --> 01:49:05,370 those three steps alone would seem to work. 2514 01:49:05,370 --> 01:49:08,310 So long as you can sort the left half and sort the right half, 2515 01:49:08,310 --> 01:49:10,543 you can surely then merge the sorted halves. 2516 01:49:10,543 --> 01:49:13,710 Now I'll go ahead and do this digitally rather than use the physical numbers 2517 01:49:13,710 --> 01:49:16,377 because clearly, it's a little involved moving them up and down. 2518 01:49:16,377 --> 01:49:19,650 But this rack here, this shelving, essentially 2519 01:49:19,650 --> 01:49:22,753 represents one array with maybe a second array here. 2520 01:49:22,753 --> 01:49:25,170 And heck, if I really want it, a third and a fourth array. 2521 01:49:25,170 --> 01:49:27,660 It turns out that with selection sort and bubble sort, 2522 01:49:27,660 --> 01:49:30,120 we've really been tying our hands because I only 2523 01:49:30,120 --> 01:49:33,450 allowed myself a constant amount of memory, just one variable in my head, 2524 01:49:33,450 --> 01:49:35,640 for instance, with selection sort that let me keep 2525 01:49:35,640 --> 01:49:37,662 track of who was the smallest element. 2526 01:49:37,662 --> 01:49:40,120 And when we did bubble sort, the only number I kept in mind 2527 01:49:40,120 --> 01:49:42,460 was i, like i and i plus 1. 2528 01:49:42,460 --> 01:49:44,650 I didn't allow myself any additional memory. 2529 01:49:44,650 --> 01:49:47,740 But it turns out in programming and in CS, you 2530 01:49:47,740 --> 01:49:50,480 can trade off one resource for another. 2531 01:49:50,480 --> 01:49:53,410 So if you want to spend less time solving a problem, 2532 01:49:53,410 --> 01:49:55,150 you've got to throw space at it. 2533 01:49:55,150 --> 01:49:57,460 You've got to spend more space, spend more money, 2534 01:49:57,460 --> 01:50:00,650 in order to give yourself more space to reduce your time. 2535 01:50:00,650 --> 01:50:04,120 Conversely, if you're fine with things being slow in terms of time, 2536 01:50:04,120 --> 01:50:06,400 then you can get away with very little space. 2537 01:50:06,400 --> 01:50:08,950 So it's kind of like this balance whereby 2538 01:50:08,950 --> 01:50:11,710 you have to decide which is more important to you, which 2539 01:50:11,710 --> 01:50:15,350 is more expensive for you, or the like. 2540 01:50:15,350 --> 01:50:20,780 So let's go ahead then and consider exactly this algorithm as follows. 2541 01:50:20,780 --> 01:50:25,580 So suppose that these are the numbers in question. 2542 01:50:25,580 --> 01:50:26,770 So here is our array. 2543 01:50:26,770 --> 01:50:27,820 It's clearly unsorted. 2544 01:50:27,820 --> 01:50:29,140 I'd like to sort this. 2545 01:50:29,140 --> 01:50:30,340 I could use selection sort. 2546 01:50:30,340 --> 01:50:33,130 But selection sort was not great because it's big O of n squared. 2547 01:50:33,130 --> 01:50:36,227 And it's omega of n squared, so sort of damned if you do. 2548 01:50:36,227 --> 01:50:37,060 Damned if you don't. 2549 01:50:37,060 --> 01:50:38,630 Bubble sort was a little better. 2550 01:50:38,630 --> 01:50:39,963 It was still big O of n squared. 2551 01:50:39,963 --> 01:50:42,380 But sometimes, we could get lucky and shave some time off. 2552 01:50:42,380 --> 01:50:44,680 So it was omega of only n. 2553 01:50:44,680 --> 01:50:47,170 Let's see if merge sort is fundamentally better. 2554 01:50:47,170 --> 01:50:51,385 And let's do so by trying to reduce the number of comparisons-- no more looping 2555 01:50:51,385 --> 01:50:54,010 back and forth and back and forth and back and forth endlessly. 2556 01:50:54,010 --> 01:50:58,210 So here's how we can draw inspiration from the idea of recursion and divide 2557 01:50:58,210 --> 01:51:00,160 and conquer as per week zero. 2558 01:51:00,160 --> 01:51:02,980 Let's first think of these numbers as indeed in an array 2559 01:51:02,980 --> 01:51:04,810 contiguously from left to right. 2560 01:51:04,810 --> 01:51:08,590 Let's then go ahead and sort the left half. 2561 01:51:08,590 --> 01:51:10,750 Because again, the pseudocode that you have 2562 01:51:10,750 --> 01:51:14,470 to remember throughout this whole algorithm has just three real steps. 2563 01:51:14,470 --> 01:51:15,670 Sort the left half. 2564 01:51:15,670 --> 01:51:16,930 Sort the right half. 2565 01:51:16,930 --> 01:51:18,160 Merge the sorted halves. 2566 01:51:18,160 --> 01:51:19,540 And then this is sort of a one-off thing. 2567 01:51:19,540 --> 01:51:20,350 But it's important. 2568 01:51:20,350 --> 01:51:21,555 But this is the juicy part. 2569 01:51:21,555 --> 01:51:22,180 Sort left half. 2570 01:51:22,180 --> 01:51:22,847 Sort right half. 2571 01:51:22,847 --> 01:51:24,070 Merge the sorted halves. 2572 01:51:24,070 --> 01:51:27,310 So if we go back to this array on the top shelf, here's the original array. 2573 01:51:27,310 --> 01:51:29,320 Let's go ahead and sort the left half. 2574 01:51:29,320 --> 01:51:32,600 And I'm going to do so by sort of stealing some more memory, 2575 01:51:32,600 --> 01:51:36,160 so I can sort of work on a second shelf with just these numbers. 2576 01:51:36,160 --> 01:51:39,730 So 6341 is now an array of size 4, essentially. 2577 01:51:39,730 --> 01:51:42,070 How do I sort an array of size 4? 2578 01:51:42,070 --> 01:51:44,395 Well, I've got an algorithm for that-- selection sort. 2579 01:51:44,395 --> 01:51:45,520 But we decided that's slow. 2580 01:51:45,520 --> 01:51:46,780 Bubble sort-- that's slow. 2581 01:51:46,780 --> 01:51:47,500 Wait a minute. 2582 01:51:47,500 --> 01:51:49,300 I'm in the middle of defining merge sort. 2583 01:51:49,300 --> 01:51:52,570 Let's recursively use the same algorithm by just 2584 01:51:52,570 --> 01:51:55,130 sorting the left half of the left half. 2585 01:51:55,130 --> 01:51:58,335 So if you've got an array of size 4, it's only three steps to solve it. 2586 01:51:58,335 --> 01:51:58,960 Sort left half. 2587 01:51:58,960 --> 01:51:59,710 Sort right half. 2588 01:51:59,710 --> 01:52:00,400 Merge. 2589 01:52:00,400 --> 01:52:01,540 So let's do that. 2590 01:52:01,540 --> 01:52:03,490 Let's sort the left half. 2591 01:52:03,490 --> 01:52:04,300 OK, wait a minute. 2592 01:52:04,300 --> 01:52:06,728 How do you sort an array of size 2? 2593 01:52:06,728 --> 01:52:08,020 I've got an algorithm for that. 2594 01:52:08,020 --> 01:52:08,710 Sort the left half. 2595 01:52:08,710 --> 01:52:09,543 Sort the right half. 2596 01:52:09,543 --> 01:52:10,870 Merge the two halves. 2597 01:52:10,870 --> 01:52:14,110 All right, let's sort the left half. 2598 01:52:14,110 --> 01:52:15,070 What now? 2599 01:52:15,070 --> 01:52:17,020 So 6 is a list of size 1. 2600 01:52:17,020 --> 01:52:19,580 What was that special base case then? 2601 01:52:19,580 --> 01:52:21,500 So it was quit or just return. 2602 01:52:21,500 --> 01:52:22,530 Like, I'm already done. 2603 01:52:22,530 --> 01:52:26,510 So this list of size 1 is sort of weirdly already sorted. 2604 01:52:26,510 --> 01:52:27,920 So I'm making progress. 2605 01:52:27,920 --> 01:52:31,070 Meanwhile, what's the next step after sorting this left half? 2606 01:52:31,070 --> 01:52:32,720 Sort the right half. 2607 01:52:32,720 --> 01:52:33,710 Done. 2608 01:52:33,710 --> 01:52:35,060 This is already sorted. 2609 01:52:35,060 --> 01:52:36,170 But here's the magic. 2610 01:52:36,170 --> 01:52:38,810 What's the third step for these numbers? 2611 01:52:38,810 --> 01:52:39,310 Merge them. 2612 01:52:39,310 --> 01:52:41,310 So this is like the left hand, right hand thing. 2613 01:52:41,310 --> 01:52:42,260 Which one comes first? 2614 01:52:42,260 --> 01:52:45,380 Obviously, 3 and then 6. 2615 01:52:45,380 --> 01:52:50,638 And now we are making progress because now the list of size 2 is sorted. 2616 01:52:50,638 --> 01:52:51,680 So what have I just done? 2617 01:52:51,680 --> 01:52:56,400 I've just finished sorting the left half of the left half. 2618 01:52:56,400 --> 01:53:01,890 So after I've sorted the left half, what comes next if you rewind in time? 2619 01:53:01,890 --> 01:53:04,870 Sort the right half of that left half. 2620 01:53:04,870 --> 01:53:06,900 So now I take the right half. 2621 01:53:06,900 --> 01:53:09,330 And how do I sort this right half of size 2? 2622 01:53:09,330 --> 01:53:10,920 Well, sort its left half. 2623 01:53:10,920 --> 01:53:11,610 Done. 2624 01:53:11,610 --> 01:53:12,870 Sort its right half. 2625 01:53:12,870 --> 01:53:13,560 Done. 2626 01:53:13,560 --> 01:53:15,180 Now merge them together. 2627 01:53:15,180 --> 01:53:17,880 And obviously, the 1 comes first, then the 4. 2628 01:53:17,880 --> 01:53:19,110 Now where are we? 2629 01:53:19,110 --> 01:53:22,080 We're at the point in the story where we have a left left half 2630 01:53:22,080 --> 01:53:26,040 sorted and a right left half sorted. 2631 01:53:26,040 --> 01:53:28,890 So what comes next now narratively? 2632 01:53:28,890 --> 01:53:35,333 Merge the two together, so left hand, right hand, so 1346. 2633 01:53:35,333 --> 01:53:36,750 And where are we now in the story? 2634 01:53:36,750 --> 01:53:41,220 We're sort of at the beginning because I've now sorted the left half. 2635 01:53:41,220 --> 01:53:43,320 And you recall the demo I did physically, this 2636 01:53:43,320 --> 01:53:46,830 is how I had the left half was already sorted on the second shelf. 2637 01:53:46,830 --> 01:53:50,550 But now that I've sorted the left half of the original array, what's 2638 01:53:50,550 --> 01:53:53,580 the original second step? 2639 01:53:53,580 --> 01:53:54,820 Sort the right half. 2640 01:53:54,820 --> 01:53:55,900 So it's the same idea. 2641 01:53:55,900 --> 01:53:57,780 So I'm borrowing some extra memory here. 2642 01:53:57,780 --> 01:53:59,370 And now I've got an array of size 4. 2643 01:53:59,370 --> 01:54:00,870 How do I sort an array of size 4? 2644 01:54:00,870 --> 01:54:02,010 Sort the left half. 2645 01:54:02,010 --> 01:54:03,900 All right, how do I sort an array of size 2? 2646 01:54:03,900 --> 01:54:05,190 Sort the left half. 2647 01:54:05,190 --> 01:54:06,000 Done. 2648 01:54:06,000 --> 01:54:07,230 Sort the right half. 2649 01:54:07,230 --> 01:54:08,040 Done. 2650 01:54:08,040 --> 01:54:08,670 Now what? 2651 01:54:08,670 --> 01:54:09,990 Merge the two together. 2652 01:54:09,990 --> 01:54:12,090 And of course, it's 2 and 5. 2653 01:54:12,090 --> 01:54:13,080 Now what do I do? 2654 01:54:13,080 --> 01:54:16,300 I've sorted the left half of the right half. 2655 01:54:16,300 --> 01:54:19,830 So now I sort the right half of the right half. 2656 01:54:19,830 --> 01:54:21,600 How do I sort a list of size 2? 2657 01:54:21,600 --> 01:54:22,680 Well, sort the left half. 2658 01:54:22,680 --> 01:54:23,100 Done. 2659 01:54:23,100 --> 01:54:23,940 Sort the right half. 2660 01:54:23,940 --> 01:54:24,600 Done. 2661 01:54:24,600 --> 01:54:27,540 Merge them together-- 0 and 7. 2662 01:54:27,540 --> 01:54:29,100 Where am I in the story? 2663 01:54:29,100 --> 01:54:33,610 I've sorted the left half and the right half of the original right half. 2664 01:54:33,610 --> 01:54:34,980 So I merge these two together-- 2665 01:54:34,980 --> 01:54:38,730 0 and then 2 and then 5 and then 7. 2666 01:54:38,730 --> 01:54:39,360 Whew. 2667 01:54:39,360 --> 01:54:40,230 Where are we? 2668 01:54:40,230 --> 01:54:43,080 We're at exactly the point in the story where we had the numbers 2669 01:54:43,080 --> 01:54:44,670 originally on the second shelf. 2670 01:54:44,670 --> 01:54:48,790 We had a list that's sorted on the left, a list that's sorted on the right. 2671 01:54:48,790 --> 01:54:52,840 And what I demoed physically was merging those two together. 2672 01:54:52,840 --> 01:54:54,100 So what happens now? 2673 01:54:54,100 --> 01:55:00,130 01234567. 2674 01:55:00,130 --> 01:55:04,030 And it seems kind of magical and kind of weird in that I kind of cheated. 2675 01:55:04,030 --> 01:55:05,427 And when I had these leaves-- 2676 01:55:05,427 --> 01:55:07,510 these leaf nodes, so to speak-- these singletons-- 2677 01:55:07,510 --> 01:55:08,648 I was, like, sorted. 2678 01:55:08,648 --> 01:55:09,940 I wasn't really doing anything. 2679 01:55:09,940 --> 01:55:12,700 But it's that merging that seems to really be doing 2680 01:55:12,700 --> 01:55:15,350 the magic of sorting things for us. 2681 01:55:15,350 --> 01:55:17,380 So that felt like a mouthful. 2682 01:55:17,380 --> 01:55:19,300 And recursion in general is the kind of thing 2683 01:55:19,300 --> 01:55:21,440 that will, like, bend your brain a little bit. 2684 01:55:21,440 --> 01:55:23,565 And if that went over your head, like, that's fine. 2685 01:55:23,565 --> 01:55:25,600 It takes time and time and time and practice. 2686 01:55:25,600 --> 01:55:30,550 But what there was not a lot of with this algorithm was again and again 2687 01:55:30,550 --> 01:55:31,960 and again and again. 2688 01:55:31,960 --> 01:55:35,080 I was kind of only doing things once. 2689 01:55:35,080 --> 01:55:38,150 And then once I fixed a number, I moved on to the next. 2690 01:55:38,150 --> 01:55:40,630 And that's sort of the essence of this algorithm here. 2691 01:55:40,630 --> 01:55:43,480 If it helps you to see it in another visual way, 2692 01:55:43,480 --> 01:55:46,300 let me go back to our previous visualization here. 2693 01:55:46,300 --> 01:55:48,790 Let me re-randomize the array and click merge sort. 2694 01:55:48,790 --> 01:55:51,670 And this time, notice that merge sort is using more space. 2695 01:55:51,670 --> 01:55:54,970 Technically, I was using 1, 2, 3 shelves. 2696 01:55:54,970 --> 01:55:57,400 But you can actually be slightly more intelligent about it 2697 01:55:57,400 --> 01:56:00,220 and actually just go back and forth and back and forth between two shelves 2698 01:56:00,220 --> 01:56:01,390 just to save a little space. 2699 01:56:01,390 --> 01:56:03,320 So you're still using twice as much space. 2700 01:56:03,320 --> 01:56:06,190 But you don't need four times as much space as the diagrams 2701 01:56:06,190 --> 01:56:07,670 or as the shelves might imply. 2702 01:56:07,670 --> 01:56:09,040 So here is merge sort. 2703 01:56:09,040 --> 01:56:13,000 And you'll notice that we're sort of working in halves-- sometimes 2704 01:56:13,000 --> 01:56:15,070 big halves, sometimes smaller halves. 2705 01:56:15,070 --> 01:56:18,370 But you can see as the two halves are merged, 2706 01:56:18,370 --> 01:56:20,660 things seem to happen very quickly. 2707 01:56:20,660 --> 01:56:23,860 And so notice that this is the same number of bars as before. 2708 01:56:23,860 --> 01:56:25,630 But that was way faster, right? 2709 01:56:25,630 --> 01:56:29,050 I don't need to stall nearly as much as I have in the past. 2710 01:56:29,050 --> 01:56:30,260 So why is that? 2711 01:56:30,260 --> 01:56:33,340 Well, if we go back to the diagram in question, 2712 01:56:33,340 --> 01:56:35,210 here's the array that's already sorted. 2713 01:56:35,210 --> 01:56:38,440 And if we consider now exactly how much work is done, 2714 01:56:38,440 --> 01:56:41,980 I'll stipulate already it's not n squared. n squared was slow. 2715 01:56:41,980 --> 01:56:44,480 And merge sort is actually much better than that. 2716 01:56:44,480 --> 01:56:46,340 Let's see how much better it is. 2717 01:56:46,340 --> 01:56:48,070 So here is the original list-- 2718 01:56:48,070 --> 01:56:50,740 63415270. 2719 01:56:50,740 --> 01:56:54,160 And here are all of the remnants of that algorithm, all of the states 2720 01:56:54,160 --> 01:56:57,520 that we were in at some point-- sort of leaving a whole bunch of breadcrumbs. 2721 01:56:57,520 --> 01:57:01,300 How many pieces of work did I do? 2722 01:57:01,300 --> 01:57:08,810 Well, like, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2723 01:57:08,810 --> 01:57:11,470 20, 21, 22, 23, 24. 2724 01:57:11,470 --> 01:57:15,370 So I moved things around, like, 24 times, it seems. 2725 01:57:15,370 --> 01:57:17,830 And how do I actually reason about that? 2726 01:57:17,830 --> 01:57:19,040 Well, these are the numbers. 2727 01:57:19,040 --> 01:57:21,070 This is like my temporary workspace here. 2728 01:57:21,070 --> 01:57:22,447 And 24 is the literal number. 2729 01:57:22,447 --> 01:57:25,030 But let's see if we can't do things a little more generically. 2730 01:57:25,030 --> 01:57:27,790 Log base 2 of n, recall, is what refers to anything 2731 01:57:27,790 --> 01:57:30,492 that we're doing in half-- dividing, dividing, dividing. 2732 01:57:30,492 --> 01:57:32,200 And that's kind of what I was doing here. 2733 01:57:32,200 --> 01:57:36,850 I took a list of size 8-- divide it into two of size 4, then four of size 2, 2734 01:57:36,850 --> 01:57:38,320 then eight of size 1. 2735 01:57:38,320 --> 01:57:41,950 Well, how many times can you do this if you start with eight numbers? 2736 01:57:41,950 --> 01:57:43,600 Well, that's log base 2 of 8. 2737 01:57:43,600 --> 01:57:46,990 And if we do some fancy math, that's just log base 2 of 2 to the third. 2738 01:57:46,990 --> 01:57:49,520 And remember, the base and the number here can cancel out. 2739 01:57:49,520 --> 01:57:50,710 So that's actually 3. 2740 01:57:50,710 --> 01:57:55,450 So log base 2 of 8 is 3, which means that's how many times you 2741 01:57:55,450 --> 01:57:59,800 can divide a problem of size 8 in half, in half, in half. 2742 01:57:59,800 --> 01:58:03,760 But every time we did that per this chart, I had to take the numbers 2743 01:58:03,760 --> 01:58:06,620 and merge them together, merge them together, merge them together. 2744 01:58:06,620 --> 01:58:12,010 And so on every row of this postmortem of the algorithm, 2745 01:58:12,010 --> 01:58:16,270 there are n steps, n steps, n steps. 2746 01:58:16,270 --> 01:58:19,660 So laterally, there's n steps because I had to merge all of those things 2747 01:58:19,660 --> 01:58:20,570 back together. 2748 01:58:20,570 --> 01:58:23,140 But what is the height of these yellow remnants? 2749 01:58:23,140 --> 01:58:26,650 Well, it's 3, which is log base 2 of 8, which is 3. 2750 01:58:26,650 --> 01:58:32,110 So this is technically three times 8, ergo 24 steps. 2751 01:58:32,110 --> 01:58:39,350 But more generally, this is log n height and n width, so to speak. 2752 01:58:39,350 --> 01:58:41,770 So the total running time I claim is actually 2753 01:58:41,770 --> 01:58:46,332 n log n, which it's OK if that doesn't quite gel immediately in your mind, 2754 01:58:46,332 --> 01:58:48,040 especially if you're rusty on algorithms. 2755 01:58:48,040 --> 01:58:51,230 But and we can throw away the base because that's just a constant factor 2756 01:58:51,230 --> 01:58:54,770 and with a wave of our hand when we talk about big O notation, merge sort, 2757 01:58:54,770 --> 01:59:01,070 I claim is in big O of n log n-- that is, n times log n. 2758 01:59:01,070 --> 01:59:05,360 Unfortunately, it is also in omega of n log 2759 01:59:05,360 --> 01:59:09,680 n, which means that, frankly, bubble sort might sometimes outperform it, 2760 01:59:09,680 --> 01:59:14,600 at least when the inputs are already sorted or certainly relatively small. 2761 01:59:14,600 --> 01:59:16,790 But that's probably OK because in general, 2762 01:59:16,790 --> 01:59:18,920 data that we're sorting probably isn't very sorted. 2763 01:59:18,920 --> 01:59:20,810 And honestly, we could even half merge sort 2764 01:59:20,810 --> 01:59:24,500 to just do one pass to check initially, is the whole thing sorted, 2765 01:59:24,500 --> 01:59:26,212 and then maybe terminate early. 2766 01:59:26,212 --> 01:59:29,420 So we can maybe massage the algorithm a little better to be a little smarter. 2767 01:59:29,420 --> 01:59:34,240 But fundamentally, merge sort is in theta of n log n, 2768 01:59:34,240 --> 01:59:35,240 is how you would say it. 2769 01:59:35,240 --> 01:59:38,210 It's on the order of n times log n steps. 2770 01:59:38,210 --> 01:59:42,950 Now, in terms of that chart, it's strictly higher than linear. 2771 01:59:42,950 --> 01:59:47,550 But it's strictly lower than quadratic-- n and n squared, respectively. 2772 01:59:47,550 --> 01:59:49,530 So it clearly seems to be faster. 2773 01:59:49,530 --> 01:59:51,170 So it's not as good as linear search. 2774 01:59:51,170 --> 01:59:53,212 And it's definitely not as good as binary search. 2775 01:59:53,212 --> 01:59:58,140 But it's way better than selection sort or bubble sort actually were. 2776 01:59:58,140 --> 02:00:00,680 So with that said, we thought we'd conclude 2777 02:00:00,680 --> 02:00:03,140 by showing you a final film that's just about 2778 02:00:03,140 --> 02:00:08,000 a minute long that compares these actual algorithms and shows them as follows. 2779 02:00:08,000 --> 02:00:12,230 You'll see on the top, the middle, and the bottom, three different algorithms. 2780 02:00:12,230 --> 02:00:14,570 On the top, you will see selection sort. 2781 02:00:14,570 --> 02:00:16,850 On the bottom, you will see bubble sort. 2782 02:00:16,850 --> 02:00:23,450 And in the middle, you will see and hear an appreciation of what n log n is-- 2783 02:00:23,450 --> 02:00:25,680 a.k.a., merge sort today. 2784 02:00:25,680 --> 02:00:28,200 So if we could dim the lights dramatically, 2785 02:00:28,200 --> 02:00:31,471 this is n log n vis a vis n squared. 2786 02:00:31,471 --> 02:00:32,138 [VIDEO PLAYBACK] 2787 02:00:32,138 --> 02:00:35,126 [MUSIC PLAYING] 2788 02:00:35,126 --> 02:01:37,837 2789 02:01:37,837 --> 02:01:38,420 [END PLAYBACK] 2790 02:01:38,420 --> 02:01:40,090 All right, that's it for CS50. 2791 02:01:40,090 --> 02:01:41,620 We'll see you next time. 2792 02:01:41,620 --> 02:01:42,520 [APPLAUSE] 2793 02:01:42,520 --> 02:01:45,570 [MUSIC PLAYING] 2794 02:01:45,570 --> 02:02:11,000